Hypotheses

v0.4 pre-registers four primary contrasts (H1–H4), a fixed-effects pool (H5), three mechanism decompositions (H8a/b/c), and one judge agreement check (H9). Two fairness controls (H6 with a 2× scorer budget on the bare arm, H7 with a generic two-pass revision baseline) are reported in the paper appendix. Click any card for the underlying definition; the results page renders these against the live stats.

Hypothesis dependency tree: H1 / H2 / H3 / H4 → H5 (fixed-effects pool); H8a (within-arm paired revision-vs-draft); H8b (gate calibration F1); H8c (commit-policy leaderboard); H9 (judge sensitivity). — F5 — Hypothesis dependency tree. H1–H4 feed the H5 fixed-effects pool; H8a / H8b / H8c are the mechanism panel; H9 is the judge-vs-proxy sensitivity. H8a is the load-bearing positive finding. TikZ source: paper/sections/07_methods.tex.

Primary cascade-vs-bare (H1–H4)

H1.v4 · aut

Cascade > bare on this domain

not supported

Hedges' g

-0.324

BCa 95% CI on paired mean Δ

[-0.059, 0.011]

p · n

0.813 · n=5

H2.v4 · poetry_interp

Cascade > bare on this domain

not supported

Hedges' g

0.315

BCa 95% CI on paired mean Δ

[-0.014, 0.146]

p · n

0.157 · n=10

H3.v4 · poetry_gen

Cascade > bare on this domain

not supported

Hedges' g

0.144

BCa 95% CI on paired mean Δ

[-0.050, 0.063]

p · n

0.375 · n=6

H4.v4 · sci_creativity

Cascade > bare on this domain

not supported

Hedges' g

0.303

BCa 95% CI on paired mean Δ

[-0.013, 0.061]

p · n

0.313 · n=4

Pooled effect (H5)

H5.v4

Fixed-effects pool of cascade-vs-bare

not supported

Hedges' g

0.145

Fixed-effects 95% Wald CI on pooled g

[-0.255, 0.544]

p · n

— · n=25

ADR-005 inverse-variance fixed-effects meta-pool (fixed_effects_inverse_variance). The interval is a Wald CI on the pooled g, not a BCa bootstrap.

Mechanism decomposition (H8a/b/c)

H8a.v4

Shadow revision > draft (paired)

supported

Hedges' g

0.649

Δ = 0.058

BCa 95% CI on paired mean Δ

[0.031, 0.095]

p · n

1.2e-4 · n=27

Direct test of vimarśa: does the recursive revision pass produce a measurable surface lift? CI is on the paired mean Δ, not on the Hedges' g.

H8b.v4

Learned gate F1 > event-gated F1

supported

0.647

precision · recall

1.00 · 0.48

support

n=27

Binary classifier metric; no bootstrap CI is reported. The H8b card intentionally omits CI / p-value because F1 contrasts are not Hedges' g and the v0.4 pilot did not bootstrap them.

learned_gate F1 = 0.647 vs event_gated F1 = 0.516; binary-classifier metric, no bootstrap CI.

H8c.v4

Commit-policy leaderboard — winner: always_revise

inconclusive

Hedges' g

0.533

BCa 95% CI on paired mean Δ

[0.022, 0.096]

p · n

— · n=0

Pairwise paired permutations across always_draft / always_revise / event_gated / learned_gate; Holm-corrected.

Methodological flag (H9)

H9 reports Spearman ρ and sign-agreement between the proxy composite scorer and a calibrated Sonnet-4.5 LLM-judge over a held-out subset. Disagreement is treated as a flag against the proxy scorer's construct validity rather than as a refutation of the cascade. The discussion page §10.4 unpacks this.