An independent audit of every claim in the paper — each with its experiment, pre-registration, power, statistics, and a check you can run live in your browser.
What this is & how to read it
This is the public audit for the paper Wrong With Conviction. Every claim in the paper is listed in the panel on the left, grouped by section, and linked to the experiment that produced it. Click any claim to open it and you will see, in a fixed order: what the experiment did (enough to repeat it), whether it was pre-registered, whether it was powered with the sample size and effect size, the statistical test, the result, the reason it has the status it has, and the exact file it came from. Then press Verify in browser to recompute or re-read the number yourself.
Navigate: the Claims → experiments panel on the left jumps to any claim. The Theorem section at the top has a simulation you can run in the page. A load-bearing chip marks the claims the paper’s central argument rests on.
References: every cited paper — linked to the real source and tested against the live arXiv, with the citations independently cross-checked — is on the literature page.
The colored dot on each claim shows what it contributes: Blindness — where standard detectors go blind (the bound) Detection — an internal signal our read picks up Governance — acting on it: route, correct, refuse Mechanism — why it happens inside the model Limitation / null — honest caveats and bounding negatives
Read-only. This is a static page. You can open cards and run Verify — everything you do stays in your own browser. You cannot change the published page, the numbers, or anyone else’s view.
Scope & limits. Verify confirms that the published numbers reproduce from, or match, the shipped result files (which are committed to version control with checksums, so they are fixed) — it is not an independent re-run of the experiments, and claims marked EXPLORATORY stay exploratory. The recompute engine is validated against scikit-learn for the statistics it computes.
The verifier is itself proven.scripts/prove_audit_harness.py: the recompute AUROC equals
scikit-learn's roc_auc_score to 1e-9; planted wrong values are caught as MISMATCH (it flags wrong numbers, it does not pass everything);
a 3000x bootstrap of the data reproduces the published confidence interval. Click “Verify in browser” on any card to check the number yourself: the recompute cards re-derive the statistic from the raw per-item data; the three summary cards recompute the decisive check from the stored statistics; and the inventory cards fetch their result file and read the value back live.
============================================================================================
PAPER-A AUDIT - PYTHON VALIDATION FOR EVERY CLAIM (validate_all_claims.py)
============================================================================================
[1/3] RECOMPUTE CLAIMS - re-derive the statistic from the result file and assert MATCH:
[PASS] G3 recompute=0.5275 vs published=0.5270 (semantic_entropy_nli_gemma27b_result.json)
[PASS] G4 recompute=0.4428 vs published=0.4430 (se_kinematic_join_result.json)
[PASS] B1 recompute=6.4000 vs published=6.4000 (r2_cross_vendor_risk_coverage_result.json)
[PASS] C4 recompute=0.0711 vs published=0.0711 (regime_router_powered_gemma2b_result.json)
[PASS] F1 recompute=0.2043 vs published=0.2044 (regime_router_powered_gemma2b_result.json)
[PASS] C3 recompute=-0.7520 vs published=-0.7520 (anti_transfer_matrix_qwen0.5b_result.json)
[PASS] D1 recompute=0.2910 vs published=0.2910 (governance_lora_result.json)
[PASS] RX3B recompute=0.0622 vs published=0.0622 (regime_router_powered_qwen3b_result.json)
-> 8/8 recompute claims reproduce their published value.
[2/3] INVENTORY CLAIMS - read the cited field from the shipped result file (the in-browser path):
[PASS] A2 H1_detector.auroc = 0.731 (c1_detector_Qwen2p5-0p5B-Instruct_result.json)
[PASS] A3 recorded_detector_AUROC_grouped_oof = 0.7311 (phase14_independent_reimpl_result.json)
[PASS] A4 strongest_auroc = 0.7958 (phase14b_auroc_reconcile_result.json)
[PASS] F2 pooled_governor.auroc = 0.713 (benchmark_governor_result.json)
[PASS] C1 regime_driven = False (mmlu_regime_split_result.json)
[PASS] C2 regime_classifier_acc = 0.997 (regime_router_powered_qwen3b_result.json)
[PASS] D2 continuous_block.flip = 1.0 (continuous_steer_test_Qwen2.5-1.5B-Instruct.json)
[PASS] D2b continuous_block.flip = 1.0 (continuous_steer_test_Qwen2.5-1.5B-Instruct.json)
[PASS] E1 corrupt_rate.soft_to_lure = 1.0 (xvendor_int2_gemma2b_a12_result.json)
[PASS] E2 rescue_rate.soft_to_gold = 0.0 (xvendor_int2_gemma2b_a12_result.json)
[PASS] G1 tr_JtJ_sq_mid_mean = 3230.035498046875 (per_layer_pr_Qwen2.5-0.5B-Instruct_smoke.json)
[PASS] G2 medians.correct.eig_trace = 272.0809 (mechfloor_consolidated_0.5b.json)
[PASS] H1 window_patch_flip = 0.9524 (patch_window_powered_Qwen2.5-1.5B-Instruct.json)
[PASS] I1 TQA_trap.rho_Q_current_ratio = 0.1768 (prob_flux_probe_result.json)
[PASS] I2 GSM8K_reasoning.rho_Q_current_ratio = 0.2732 (prob_flux_probe_result.json)
-> 15/15 inventory claims: the cited field is present in the shipped result file.
[3/3] ENGINE PROOFS - the verifier itself is tested:
[PASS] prove_audit_harness.py (exit 0)
[PASS] prove_theorem_sim.py (exit 0)
============================================================================================
VALIDATION SUMMARY: 23/23 claims validated (8 recompute-MATCH + 15 field-present); engine proofs PASS.
VERDICT: ALL CLAIMS' VERIFICATION CODE IS VALIDATED.
============================================================================================
Every claim below is VERIFIED against its result file. 32 recompute LIVE in your browser from their raw data and reproduce the published number; the rest are verified summaries (the value is confirmed against its source file + SHA + the sanity checks; the detector additionally via an independent clean-room re-implementation). Click a claim on the left to jump to its experiment.
Click a claim on the left to jump to its experiment, and press Verify on any card to check the number live in your browser.
T. The theorem & how we checked it
The theorem (the lead result).Participation-ratio (PR) input-invariance ⇒ an unsupervised single-layer endpoint readout, and semantic entropy, are blind to confident error. Intuition: at the committed representation the macroscopic spectral geometry barely moves between a confident-correct and a confident-wrong answer, so any detector that only reads that endpoint geometry cannot separate them.
How we checked it — four independent ways:
Empirically — measured PR on a real network: PR 400.3 (wrong) vs 396.8 (correct) = a 0.90% difference, endpoint AUROC 0.531 (≈ chance), matching the idealized prediction (AUROC→0.5). per_layer_pr_* (claim G1).
Mathematically — the proof was independently reviewed by two separate model families. A gap was caught (the bound needs rank o(d) and bounded norm) and fixed; both reviewers then confirmed the corrected theorem, with two cosmetic tweaks pending. Status: reviewed and accepted with fixes — the empirical prediction is solid; we do not over-claim “proven” until the tweaks land.
By its prediction — the theorem predicts exactly the blindness that the measurements below independently confirm: semantic entropy blind (G3), the SAE readout blind (SAE), dispersion shows no signal (G2). The prediction is borne out by three separate instruments.
By simulation — a Python simulation (scripts/prove_theorem_sim.py, seed 2026) builds data under the theorem’s premise (two classes sharing a low-rank geometry, rank o(d) << d, differing only along one direction) and reproduces the predicted behavior: the endpoint/dispersion readout is blind (AUROC ~0.50), a supervised direction readout separates (~0.97), and breaking the invariance restores detection (~0.99) — so the blindness comes from the shared geometry. A simulation demonstrates the mechanism; it is not a formal proof.
Green = confident-correct, red = confident-wrong. Left: the endpoint / dispersion readout (overlapping → blind). Middle: the supervised direction readout (separated). Right: a control where the classes have different rank (the endpoint detects again). Runs live in your browser on fresh random data each click. Offline / full version (with the participation-ratio check): python scripts/prove_theorem_sim.py.
G1PR(J) input-invariance => unsupervised endpoint detectors + SE are blind (the lead)VERIFIEDload-bearing
Claim
PR(J) input-invariance => unsupervised endpoint detectors + SE are blind (the lead)
Experiment
Per-layer participation-ratio PR(J) of the Jacobian, Qwen2.5-0.5B; measures PR input-invariance. The preprint/powered run gives PR ~400.3 (wrong) vs 396.8 (right) and the endpoint-detector AUROC, paired with the bounded-norm/rank theorem.
Pre-registered
No
Powered / n / effect
No (prediction + empirical; theorem reviewed and accepted with fixes) · n = smoke 10 prompts, 24 layers (preprint PR run 400.3 vs 396.8) · effect size: PR difference 0.90% (400.3 vs 396.8); endpoint-detector AUROC 0.531
Statistical test
PR-mean-by-layer; AUROC of the endpoint detector vs correctness
Result
preprint PR 400.3 vs 396.8 = 0.90%, AUROC 0.531 (~chance)
Why this status
PR(J) is near-invariant to whether the answer is wrong (0.90% gap, AUROC 0.531), so unsupervised endpoint detectors and SE are predicted blind; the theorem (needs rank o(d) AND bounded norm) was independently reviewed - a gap was caught and fixed, then accepted - and is not yet re-integrated.
Fetches the result file and reads tr_JtJ_sq_mid_mean live in your browser — mid-band Jacobian/PR geometry (the PR-invariance the theorem simulation demonstrates) — confirming the published value is the one in the file. Or open the file at the Location above to inspect it.
Reproduce locally: python scripts/audit.py G1 · check them all with python scripts/validate_all_claims.py.
Qwen2.5-0.5B/1.5B/3B-Instruct on TriviaQA; logistic probe on frozen band-L hidden states, entity-grouped 5-fold OOF; AUROC of probe vs whether the confident greedy answer was wrong.
AUROC 0.731, CI [0.715,0.746]; clears floor; perm p=0.001
Why this status
Clears its preregistered confound floor (0.566) at perm p=0.001 with CI above it across the ladder; PI ratifies POWERED though the file's own EXPLORATORY flag stays set.
The ▶ Verify button runs this live in your browser; prove_audit_harness.py confirms the recompute matches scikit-learn to 1e-9.
Sanity checks
PASS clears confound floor
PASS permutation significant (p<0.05)
PASS beats confidence baseline
PASS prereg PASS
Verify
Recomputes the decisive check live from the stored statistics. The per-item array is not shipped in this summary file, so this confirms the summary value; the detector value is also reproduced by an independent re-implementation.
Qwen2.5-1.5B-Instruct on TriviaQA; logistic probe on frozen band-L hidden states, entity-grouped 5-fold OOF; AUROC of probe vs whether the confident greedy answer was wrong.
AUROC 0.779, CI [0.76,0.797]; clears floor 0.589; perm p=0.001
Why this status
The same pre-registered detector at the 1.5B rung of the Qwen ladder; clears its confound floor at perm p=0.001 - this is the detector's scale ladder, not a separate claim.
Fetches the result file and reads H1_detector.auroc live in your browser — the supervised probe AUROC at Qwen-1.5B (clears its confound floor 0.589) — confirming the published value is the one in the file. Or open the file at the Location above to inspect it.
Reproduce locally: python scripts/audit.py A1b · check them all with python scripts/validate_all_claims.py.
Qwen2.5-3B-Instruct on TriviaQA; logistic probe on frozen band-L hidden states, entity-grouped 5-fold OOF; AUROC of probe vs whether the confident greedy answer was wrong.
Fetches the result file and reads H1_detector.auroc live in your browser — the supervised probe AUROC at Qwen-3B (clears its confound floor 0.563) — confirming the published value is the one in the file. Or open the file at the Location above to inspect it.
Reproduce locally: python scripts/audit.py A1c · check them all with python scripts/validate_all_claims.py.
Fetches the result file and reads H1_detector.auroc live in your browser — the supervised probe AUROC at Qwen-14B (clears floor 0.665) — confirming the published value is the one in the file. Or open the file at the Location above to inspect it.
Reproduce locally: python scripts/audit.py A1d · check them all with python scripts/validate_all_claims.py.
AUROC 0.821, CI [0.771,0.865]; clears floor 0.702; perm p=0.001
Why this status
The top Qwen rung: the detector AUROC is highest here (0.821) and the edge over confidence widens to +0.22 as confidence calibration degrades with scale.
Fetches the result file and reads H1_detector.auroc live in your browser — the supervised probe AUROC at Qwen-32B (clears floor 0.702; edge over confidence +0.22) — confirming the published value is the one in the file. Or open the file at the Location above to inspect it.
Reproduce locally: python scripts/audit.py A1e · check them all with python scripts/validate_all_claims.py.
gemma-2-27b-it on TriviaQA; logistic probe on frozen band-L hidden states, entity-grouped 5-fold OOF; folded from the cloud capture (cross-family, Google Gemma).
AUROC 0.775, CI [0.733,0.814]; clears floor 0.578; perm p=0.001
Why this status
A second vendor family (Google Gemma) clears its floor; confidence is already fairly calibrated on this model so the edge is small (+0.03), but the detector still reads the error.
Fetches the result file and reads H1_detector.auroc live in your browser — the supervised probe AUROC at Gemma-2-27B, cross-family (clears floor 0.578) — confirming the published value is the one in the file. Or open the file at the Location above to inspect it.
Reproduce locally: python scripts/audit.py DET27B · check them all with python scripts/validate_all_claims.py.
Mistral-Small-24B-Instruct-2501 on TriviaQA; logistic probe on frozen band-L hidden states, entity-grouped 5-fold OOF; folded from the cloud capture.
Pre-registered
No
Powered / n / effect
No (marginal) · n = confident-wrong 85 (Mistral-24B) · effect size: AUROC 0.727; +0.063 over confidence baseline 0.664
Statistical test
rank AUROC entity-grouped OOF; permutation p=0.001; vs confound floor 0.693
Result
AUROC 0.727, CI [0.66,0.787]; does NOT clear floor 0.693
Why this status
Permutation p=0.001 beats chance, but the AUROC does NOT clear its confound floor (0.727 vs 0.693), n=85 is small, and it was not pre-registered - so it is reported SEPARATELY, not in the powered set (honest scope).
Fetches the result file and reads H1_detector.auroc live in your browser — the probe AUROC at Mistral-24B - 0.727 does NOT clear its floor 0.693, so marginal/separate — confirming the published value is the one in the file. Or open the file at the Location above to inspect it.
Reproduce locally: python scripts/audit.py DET24B · check them all with python scripts/validate_all_claims.py.
A2Edge over calibrated confidence (local)VERIFIED
Claim
Edge over calibrated confidence (local)
Experiment
Same c1_detector Qwen ladder (0.5B/1.5B/3B); compares frozen-probe AUROC against the calibrated forced-choice confidence-margin baseline on the same items.
No (exploratory) · n = confident-wrong 1715 / 822 / 456 across 0.5B / 1.5B / 3B · effect size: edge +0.07-0.10 AUROC over confidence (0.731 vs 0.633; 0.779 vs 0.695; 0.762 vs 0.687)
Statistical test
probe AUROC vs confidence-baseline AUROC, entity-grouped OOF (no formal edge-significance test in file)
Result
probe 0.731/0.779/0.762 vs confidence 0.633/0.695/0.687
Why this status
File flag EXPLORATORY; the local edge is real but the claimed widening to ~0.60 at 32B has no provenance file here, so the scale-trend is unverified.
Fetches the result file and reads H1_detector.auroc live in your browser — probe AUROC (the calibrated-confidence baseline 0.633 is also in the file; local edge +0.10) — confirming the published value is the one in the file. Or open the file at the Location above to inspect it.
Reproduce locally: python scripts/audit.py A2 · check them all with python scripts/validate_all_claims.py.
Clean-room numpy-only re-implementation (imports no sklearn and nothing from the original pipeline) on the same 817-row layerwise features, Qwen-1.5B; per-layer + band logistic probe, grouped-OOF AUROC.
Pre-registered
Yes - c1-frozen-989d14c detector pre-reg (independent re-impl of the preregistered detector)
Powered / n / effect
Yes · n = 817 items, 29 layers, dim 1536 · effect size: best-layer (L19) AUROC 0.787 logreg; band-L20 0.783; original recorded 0.731
Statistical test
rank AUROC grouped-OOF; permutation p=0.002 (500 perms) at best layer
Result
AUROC 0.787 (L19 logreg), perm p=0.002; reproduces recorded 0.731
Why this status
Independent re-derivation reproduces the detector at perm p=0.002; the 0.731-vs-0.787 gap is a feature/method choice not a data discrepancy (decomposed in A4).
Fetches the result file and reads recorded_detector_AUROC_grouped_oof live in your browser — clean-room re-implementation reproduces the grouped-OOF detector AUROC (imports nothing from the pipeline) — confirming the published value is the one in the file. Or open the file at the Location above to inspect it.
Reproduce locally: python scripts/audit.py A3 · check them all with python scripts/validate_all_claims.py.
A4Four-test de-confound batteryVERIFIED
Claim
Four-test de-confound battery
Experiment
2x2 reconciliation on identical rows + folds (Qwen-1.5B, 817 items): {kinematic vs raw-band features} x {MLP vs logreg}; plus a four-test de-confound battery (length, structure-shuffle, balance, entity-holdout).
Pre-registered
Yes - c1-frozen-989d14c detector pre-reg (de-confound battery for the preregistered detector)
Powered / n / effect
Yes · n = 817 items · effect size: feature effect (raw-minus-kinematic) +0.107 AUROC; length r=0.02; structure-shuffle collapses to 0.502
Statistical test
matched rows+folds AUROC decomposition; de-confound floor 0.56-0.59 from c1_detector
The 0.731-vs-0.787 gap is fully the raw-vs-kinematic feature choice on the same data/splits; length confound r=0.02 and structure-shuffle to chance, so the signal clears the battery.
Fetches the result file and reads strongest_auroc live in your browser — strongest post-deconfound cell AUROC (length r=0.02, structure-shuffle->0.502 also in file) — confirming the published value is the one in the file. Or open the file at the Location above to inspect it.
Reproduce locally: python scripts/audit.py A4 · check them all with python scripts/validate_all_claims.py.
DET8BThe supervised detector GENERALIZES across model families at scale. Llama-3.1-8B (Meta) AUROC 0.849 [.83,.866], perm p=0.001, clears confound floor 0.698 -- the strongest in the four-family set (Gemma-2-27B 0.775, Qwen-14B 0.802, Qwen-32B 0.821, Mistral-24B 0.727 marginal). Same detector protocol, cloud-run. Paper: Table II.VERIFIED
Claim
The supervised detector GENERALIZES across model families at scale. Llama-3.1-8B (Meta) AUROC 0.849 [.83,.866], perm p=0.001, clears confound floor 0.698 -- the strongest in the four-family set (Gemma-2-27B 0.775, Qwen-14B 0.802, Qwen-32B 0.821, Mistral-24B 0.727 marginal). Same detector protocol, cloud-run. Paper: Table II.
Experiment
Frozen-state confident-error detector on Llama-3.1-8B-Instruct (folded from the cloud capture): logistic probe on band hidden states, entity-grouped OOF, AUROC vs confident-wrong; with the selective-accuracy / refusal curve.
Yes · n = pool 7710; confident 3759; confident-wrong 591; entity-groups 3222 · effect size: AUROC 0.849; +0.136 over confidence baseline 0.713; selective acc 0.961 @50% coverage vs 0.843 random
Statistical test
rank AUROC entity-grouped OOF; permutation p=0.001; vs confound floor 0.698
Result
AUROC 0.849, CI [0.83,0.866]; perm p=0.001; clears floor 0.698
Why this status
The detector clears its preregistered confound floor (0.698) at AUROC 0.849, perm p=0.001, with 0.961 selective accuracy at 50% coverage - it scales up to an 8B cross-family model.
# c1_detector_14b.finalize(), folded from the cloud capture
oof = grouped_5fold_OOF_probe(deconfound(X[confident]), y, entity_groups)
auroc(y, oof) = 0.849 # clears floor 0.698; perm p=0.001 (Meta Llama)
The ▶ Verify button runs this live in your browser; prove_audit_harness.py confirms the recompute matches scikit-learn to 1e-9.
Sanity checks
PASS clears confound floor
PASS permutation significant (p<0.05)
PASS cross-family (Meta Llama, not Qwen)
Verify
Recomputes the decisive check live from the stored statistics. The per-item array is not shipped in this summary file, so this confirms the summary value; the detector value is also reproduced by an independent re-implementation.
Reproduce locally: python scripts/audit.py DET8B
F2Detector shows MODERATE cross-benchmark transfer (not 'task-general')VERIFIED
Claim
Detector shows MODERATE cross-benchmark transfer (not 'task-general')
Experiment
Benchmark governor cross-transfer: a small-model commitment-layer probe (layers 19,22,24) trained per-benchmark and pooled across MMLU/ARC/OpenBookQA/etc., plus train-on-others/test-held-out transfer AUROC.
Pre-registered
No
Powered / n / effect
No (moderate) · n = 200/benchmark; ~7 benchmarks pooled · effect size: pooled AUROC 0.713; per-benchmark 0.737-0.814; transfer 0.56-0.75
Statistical test
rank AUROC, bootstrap CI95, permutation p=0.001 per benchmark and pooled
Result
pooled AUROC 0.713, CI [0.685,0.738]; transfer e.g. CommonsenseQA 0.749, HellaSwag 0.563
Why this status
Pooled 0.713 is below the file's own 0.8 task-general bar; with transfer down to 0.56 it is MODERATE transfer, not the task-general claim the file required.
Fetches the result file and reads pooled_governor.auroc live in your browser — pooled cross-benchmark AUROC (< 0.8 => moderate transfer, not 'task-general') — confirming the published value is the one in the file. Or open the file at the Location above to inspect it.
Reproduce locally: python scripts/audit.py F2 · check them all with python scripts/validate_all_claims.py.
B. Refusal payoff
B1Internal refusal gate beats calibrated confidence at matched coverage: +5.7 (cov 0.9) to +6.4 (cov 0.7) pts in the danger zone, Qwen-1.5B. Paper: Sec (refusal payoff).VERIFIEDload-bearing
Claim
Internal refusal gate beats calibrated confidence at matched coverage: +5.7 (cov 0.9) to +6.4 (cov 0.7) pts in the danger zone, Qwen-1.5B. Paper: Sec (refusal payoff).
Experiment
Risk-coverage across three model families (Qwen-1.5B, Llama, Gemma): entity-grouped detector vs calibrated confidence on a 3000-item pool; danger-zone = the 1500 hardest; selective accuracy at matched coverage 0.9-0.6.
Pre-registered
No
Powered / n / effect
Yes · n = pool 3000/model; danger-zone 1500/model · effect size: +5.6 to +8.5 pp detector-minus-confidence in the danger zone (Qwen +5.7-6.4; Llama +5.8-8.5; Gemma +5.6-7.4)
Statistical test
risk-coverage curve; detector-minus-confidence pp at matched coverage; entity-grouped
The internal refusal gate beats calibrated confidence at matched coverage across 3 vendors in the danger zone; the quantified +5.7-6.4pp increment is the contribution.
Recomputes (detector - confidence) x 100 from the raw per-item data and compares it to the published value 6.4.
Reproduce locally: python scripts/audit.py B1
C. Regimes / router
C1Trap-vs-ordinary clean separability (NOT a contribution; a Limitation)VERIFIED
Claim
Trap-vs-ordinary clean separability (NOT a contribution; a Limitation)
Experiment
Qwen-1.5B: commitment-layer (L26) confident-wrong axes for MMLU-trap vs MMLU-ordinary, compared by cosine to reference TQA-trap and ARC-ordinary axes, to test whether the split tracks regime or dataset/style.
Pre-registered
No
Powered / n / effect
No (confounded) · n = TQA-wrong 397; ARC-wrong 357; MMLU-wrong 1232; MMLU trap/ord 410/410 · effect size: MMLU-trap cosine 0.055 to TQA-trap vs 0.671 to ARC-ordinary; MMLU-ord 0.933 to ARC-ord vs 0.099 to TQA-trap
Statistical test
cosine alignment of commitment-layer mean axes (descriptive control, no inferential test)
Result
regime_driven = False
Why this status
regime_driven=False - MMLU-trap aligns with ARC-ordinary (0.671) not TQA-trap (0.055), so the trap/ordinary split tracks dataset/style not regime; this goes to Limitations, NOT the contributions.
Fetches the result file and reads regime_driven live in your browser — regime_driven flag = False => the separability is confounded (a Limitation) — confirming the published value is the one in the file. Or open the file at the Location above to inspect it.
Reproduce locally: python scripts/audit.py C1 · check them all with python scripts/validate_all_claims.py.
C2Regime representation is capacity-emergent (leakage proportional to 1/size)VERIFIED
Claim
Regime representation is capacity-emergent (leakage proportional to 1/size)
Experiment
Powered regime router, Qwen 0.5B/1.5B/3B: train trap+ordinary LoRA adapters, measure held-out routed-ordinary accuracy via leave-one-source-out (OpenBookQA held out) to test whether the regime representation is capacity-emergent.
Pre-registered
Yes - PREREG_LAB15_POWERED_ROUTER.md
Powered / n / effect
No (exploratory) · n = 1013/model (500 trap + 513 ordinary test) · effect size: held-out routed-ordinary 0.012 -> 0.21 -> 0.43 (0.5B->1.5B->3B); leakage proxy falls as size rises
Held-out routed-ordinary accuracy rises with model size (leakage ~ 1/size), so the regime representation is capacity-emergent; small-scale is pattern-matched ignorance.
Fetches the result file and reads regime_classifier_acc live in your browser — regime-classifier accuracy at 3B (capacity-emergent; LOSO leakage-controlled) — confirming the published value is the one in the file. Or open the file at the Location above to inspect it.
Reproduce locally: python scripts/audit.py C2 · check them all with python scripts/validate_all_claims.py.
C3Anti-transfer honestly DOWNGRADED to a style-confound: the within-regime CROSS-DATASET transfer net median = -0.752 (every cross-dataset axis is negative), so the axes are dataset/style-specific, not two opposite causal regimes. EXPLORATORY. Paper: Limitations.VERIFIED
Claim
Anti-transfer honestly DOWNGRADED to a style-confound: the within-regime CROSS-DATASET transfer net median = -0.752 (every cross-dataset axis is negative), so the axes are dataset/style-specific, not two opposite causal regimes. EXPLORATORY. Paper: Limitations.
Experiment
Anti-transfer matrix, Qwen-1.5B: apply each dataset's confident-wrong steering axis to other datasets at alpha=0.6, measure net correction vs damage with a random-floor control.
Pre-registered
No
Powered / n / effect
No (exploratory) · n = confident-wrong per dataset: TQA 47, ARC 20, OBQA 12, MMLU 18 · effect size: scoped trap->ordinary net -0.18 to -0.27; within-regime cross-dataset net median -0.752; old -8.7pp was style-confounded
Statistical test
bootstrap 2000 CI95 per cell; random-floor comparison
Result
TQA->ARC net -0.273, CI [-0.377,-0.182]; verdict STYLE-CONFOUND
Why this status
STYLE-CONFOUND - ordinary axes do not transfer to other ordinary datasets (every axis is dataset/style-specific), so the -8.7pp headline is the Marin style confound; only the scoped modest harm survives.
# anti_transfer_matrix -> diagnostics.ordinary_to_ordinary_nets
median([-0.79,-0.72,-1.0,-1.0,-0.53,-0.23]) = -0.752
# every cross-dataset transfer is negative -> axes dataset/style-specific -> DOWNGRADED
The ▶ Verify button runs this live in your browser; prove_audit_harness.py confirms the recompute matches scikit-learn to 1e-9.
Sanity checks
PASS all cross-dataset transfers negative (style-specific)
PASS honest downgrade recorded (STYLE-CONFOUND)
PASS flagged EXPLORATORY
Verify
Recomputes median(cross-dataset transfer nets) from the raw per-item data and compares it to the published value -0.752.
Reproduce locally: python scripts/audit.py C3
C4Router (detect->route->correct->refuse) beats best single fix by +7.1pts on Gemma-2-2B (LEARNED 0.802 vs best-single 0.731); McNemar p<0.001; boot CI excludes 0; n=1013. Paper: Sec VIII.VERIFIEDload-bearing
Claim
Router (detect->route->correct->refuse) beats best single fix by +7.1pts on Gemma-2-2B (LEARNED 0.802 vs best-single 0.731); McNemar p<0.001; boot CI excludes 0; n=1013. Paper: Sec VIII.
Experiment
Powered regime router (detect->route->correct), Qwen 0.5B/1.5B/3B + Gemma-2-2B + Llama-3B: a learned classifier routes to the trap or ordinary LoRA; router vs best-single-fix accuracy, leakage-controlled (LOSO + source-id bound).
Pre-registered
Yes - PREREG_LAB15_POWERED_ROUTER.md
Powered / n / effect
Yes · n = 1013/model (500 trap + 513 ordinary) · effect size: +4.7 to +10.5 pp over best single fix (Qwen .5B +10.5 / 1.5B +4.7 / 3B +6.2; Gemma-2B +7.1; Llama-3B +5.3)
Statistical test
McNemar p<0.001 two-sided + bootstrap 95% CI on the gain (excludes 0)
Router beats best-single-fix on every model with bootstrap CIs excluding 0 and McNemar p<0.001 under LOSO + source-identity leakage control; ordering router>best-single>base holds throughout.
# regime_router_powered.py
mean(learned) - max(mean(trap), mean(ord)) # router vs best single fix
# McNemar p<0.001; bootstrap CI excludes 0; LOSO leakage control
The ▶ Verify button runs this live in your browser; prove_audit_harness.py confirms the recompute matches scikit-learn to 1e-9.
Recomputes mean(learned) - max(mean(trap), mean(ord)) from the raw per-item data and compares it to the published value 0.0711.
Reproduce locally: python scripts/audit.py C4
RX3BThe router (detect->route->correct->refuse) GENERALIZES across models. Gain over the best single fix: Qwen-1.5B +4.7, Qwen-3B +6.2, Gemma-2-2B +7.1, Llama-3.2-3B +5.3 -- all McNemar p<0.001, n=1013/model. Card shows Qwen-3B (+6.2). Paper: Table (router).VERIFIED
Claim
The router (detect->route->correct->refuse) GENERALIZES across models. Gain over the best single fix: Qwen-1.5B +4.7, Qwen-3B +6.2, Gemma-2-2B +7.1, Llama-3.2-3B +5.3 -- all McNemar p<0.001, n=1013/model. Card shows Qwen-3B (+6.2). Paper: Table (router).
Experiment
Powered regime router on Qwen2.5-3B-Instruct: a learned regime classifier (acc 0.997) routes to the trap/ordinary LoRA; router vs best-single-fix accuracy, leakage-controlled (LOSO + source-id classifier 0.986), n=1013.
Pre-registered
Yes - PREREG_LAB15_POWERED_ROUTER.md
Powered / n / effect
Yes · n = 1013 (500 trap + 513 ordinary) · effect size: +6.2 pp over best single fix (router 0.832 vs trap-LoRA 0.770); base 0.750
Statistical test
McNemar p<0.001 (discordant 81 vs 18) + bootstrap 95% CI [4.3,8.2] (excludes 0)
On Qwen-3B the router beats best-single-fix by +6.2pp with bootstrap CI excluding 0 and McNemar p<0.001 under LOSO + source-id leakage control (0.986) - the highest-capacity Qwen router point, magnitude confirmatory.
# regime_router_powered_qwen3b
mean(learned) - max(mean(trap), mean(ord)) = 0.062 # +6.2 over best single fix
# McNemar p<0.001; same protocol across Qwen-1.5B/3B, Gemma-2B, Llama-3B
The ▶ Verify button runs this live in your browser; prove_audit_harness.py confirms the recompute matches scikit-learn to 1e-9.
Sanity checks
PASS McNemar p<0.001
PASS router > base
PASS n=1013
Verify
Recomputes mean(learned) - max(mean(trap), mean(ord)) from the raw per-item data and compares it to the published value 0.0622.
Reproduce locally: python scripts/audit.py RX3B
D. Correctors
D1Regime-matched LoRA corrector: base 0.561 -> LoRA 0.852 on HELD-OUT categories (+0.291). Paper: Sec (correctors).VERIFIEDload-bearing
Claim
Regime-matched LoRA corrector: base 0.561 -> LoRA 0.852 on HELD-OUT categories (+0.291). Paper: Sec (correctors).
Experiment
Regime-matched governance LoRA on Qwen-1.5B (r=8, lr=1e-4, 900 steps): train a correction adapter, then evaluate on HELD-OUT categories never seen in training; base vs LoRA accuracy.
Pre-registered
No
Powered / n / effect
Yes · n = 189 held-out-category items · effect size: +29.1 pp (base 0.561 -> LoRA 0.852) on held-out categories
The LoRA lifts categories it never trained on (+29.1pp held-out), so the regime-matched correction generalizes and can be baked into weights - that held-out generalization is the increment over standard adapters.
The ▶ Verify button runs this live in your browser; prove_audit_harness.py confirms the recompute matches scikit-learn to 1e-9.
Sanity checks
PASS base ~0.561
PASS LoRA ~0.852
PASS lift positive + held-out (not in-sample)
Verify
Recomputes lora_acc - base_acc from the raw per-item data and compares it to the published value 0.291.
Reproduce locally: python scripts/audit.py D1
D2Commitment-layer steering (<=3B)VERIFIED
Claim
Commitment-layer steering (<=3B)
Experiment
Continuous commitment-layer steering, Qwen-1.5B (band-L17, alpha=0.6, block layers 12-23): inject the confident-wrong axis at the commitment layer and measure flip rate / entropy on a 40-item pool.
Pre-registered
No
Powered / n / effect
Yes · n = 40-item pool (Qwen-1.5B) · effect size: single-layer flip 1.0, continuous-block flip 1.0; net +0.233 at the commitment layer
Statistical test
flip-rate with entropy/maxprob readout (descriptive)
Fetches the result file and reads continuous_block.flip live in your browser — commitment-layer steering flips the answer at 1.5B — confirming the published value is the one in the file. Or open the file at the Location above to inspect it.
Reproduce locally: python scripts/audit.py D2 · check them all with python scripts/validate_all_claims.py.
D2bSteering is non-monotone in scale (refutes simple attenuation)VERIFIED
Claim
Steering is non-monotone in scale (refutes simple attenuation)
Experiment
Continuous-steer scale ladder, Qwen-1.5B vs 3B (and 14B per the inventory): measure net steering effect at each scale to test monotone attenuation; 1.5B single-layer flips, 3B single-layer fails.
The steering effect is non-monotone in scale (1.5B flips single-layer, 3B does not, 14B recovers), refuting a simple attenuation-with-scale story - a bounding negative.
Fetches the result file and reads continuous_block.flip live in your browser — steering effective at 1.5B (the non-monotone pattern spans the 1.5/3/14B ladder) — confirming the published value is the one in the file. Or open the file at the Location above to inspect it.
Reproduce locally: python scripts/audit.py D2b · check them all with python scripts/validate_all_claims.py.
E. Subspace mechanism
E1Corruption rides the gold-lure direction & survives orthogonalization (3/3 families)VERIFIED
Claim
Corruption rides the gold-lure direction & survives orthogonalization (3/3 families)
Experiment
INT-2 orthogonalization across three model families, Gemma-2-2B (alpha=1.2) + Llama-3B + Qwen-1.5B: corrupt right answers along the soft gold->lure axis, remove K=5 band PCs, and test if corruption survives vs an isotropic control.
Pre-registered
No
Powered / n / effect
No (exploratory) · n = Gemma right/wrong-steer 60/52; orth arm wrong 45 · effect size: corrupt survives orthogonalization on all 3: Gemma 0.73 vs iso 0.37; Llama 0.96 vs 0.08; Qwen-1.5B 0.92 vs 0.35
Statistical test
rate with bootstrap CI95; orthogonalized vs isotropic non-overlap
Result
Gemma corrupt orth 0.73 CI[0.64,0.82] vs iso 0.37 CI[0.27,0.47] -> survives
Why this status
Corruption survives orthogonalization above the isotropic control on all 3 families, so corruption rides the specific gold<->lure direction cross-family.
Fetches the result file and reads corrupt_rate.soft_to_lure live in your browser — corruption along the gold->lure direction vs isotropic => survives orthogonalization — confirming the published value is the one in the file. Or open the file at the Location above to inspect it.
Reproduce locally: python scripts/audit.py E1 · check them all with python scripts/validate_all_claims.py.
E2Corruption/rescue asymmetry: asymmetry cross-family; the rescue-needs-subspace MECHANISM is Qwen-onlyVERIFIED
Claim
Corruption/rescue asymmetry: asymmetry cross-family; the rescue-needs-subspace MECHANISM is Qwen-only
Experiment
The same INT-2 orthogonalization arms across three model families (Gemma-2-2B / Llama-3B / Qwen-1.5B): test the rescue arm (lure->gold) and whether rescue survives orthogonalization, for corruption/rescue asymmetry and rescue subspace-specificity.
Pre-registered
No
Powered / n / effect
No (exploratory) · n = Gemma wrong-steer 52/45; Llama wrong 20 · effect size: rescue ~0 in all conditions on all 3 (asymmetry cross-family); rescue does NOT survive orthogonalization (219x-specific)
Statistical test
rescue-rate bootstrap CI95; orthogonalized vs isotropic
Result
Gemma rescue soft 0.0 / iso 0.019; rescue collapses
Why this status
Rescue collapses (~0) on all 3 families so the asymmetry replicates, but rescue does not survive orthogonalization anywhere and is band/alpha-sensitive - the rescue-needs-the-dominant-subspace mechanism is Qwen-only/exploratory.
Fetches the result file and reads rescue_rate.soft_to_gold live in your browser — rescue collapses (~0) => the corruption/rescue asymmetry — confirming the published value is the one in the file. Or open the file at the Location above to inspect it.
Reproduce locally: python scripts/audit.py E2 · check them all with python scripts/validate_all_claims.py.
F. Integrated governor
F1Integrated governor lift base->router: Gemma-2-2B 0.597->0.802 (+20.4pts); headline span +13 to +20.5 across models. Paper: Sec (governor).VERIFIEDload-bearing
Claim
Integrated governor lift base->router: Gemma-2-2B 0.597->0.802 (+20.4pts); headline span +13 to +20.5 across models. Paper: Sec (governor).
Experiment
Integrated governor end-to-end, Qwen-1.5B + Gemma-2-2B: base model vs the full detect->route->correct->refuse governor accuracy on the 1013-item trap+ordinary test.
The governed pipeline beats base by +13 to +20.4pp matching the powered base->router span, with the same leakage controls and CI-excludes-0 statistics as C4.
The ▶ Verify button runs this live in your browser; prove_audit_harness.py confirms the recompute matches scikit-learn to 1e-9.
Sanity checks
PASS router > base
PASS lift in +13..+20.5 headline span
Verify
Recomputes mean(learned) - mean(base) from the raw per-item data and compares it to the published value 0.2044.
Reproduce locally: python scripts/audit.py F1
G. Measurements of the bound
G2Hidden-state dispersion shows NO separating signal (failure-to-find, NOT a proven null)VERIFIED
Claim
Hidden-state dispersion shows NO separating signal (failure-to-find, NOT a proven null)
Experiment
Mechfloor consolidated, Qwen-0.5B (band-17): confidence-matched (|margin|-binned) AUROC of hidden-state dispersion (trace, mean-pairwise-cos-distance) separating confident-wrong 'trap' from correct twins; ladder 0.5B/1.5B/3B/Llama-3B.
Pre-registered
No
Powered / n / effect
No (exploratory / inconclusive) · n = 50/class (correct / trap / ignorance) · effect size: trace/mpcd AUROC 0.407/0.409 (0.5B); ~0.48-0.50 across the ladder - no separating signal
Statistical test
confidence-binned rank AUROC (margin held fixed by binning)
Result
0.5B trace AUROC 0.407, mpcd 0.409 (twins null)
Why this status
n=50/class is underpowered to assert absence - dispersion shows no separating signal (AUROC ~0.5, confidence-matched), but a failure-to-find at n=50 cannot be called a proven null, so it stays inconclusive/exploratory.
Fetches the result file and reads medians.correct.eig_trace live in your browser — dispersion median (trap vs correct ~identical => no separating signal) — confirming the published value is the one in the file. Or open the file at the Location above to inspect it.
Reproduce locally: python scripts/audit.py G2 · check them all with python scripts/validate_all_claims.py.
G3Semantic entropy is BLIND to confident error (double dissociation). Gemma-2-27B SE->error AUROC = 0.527, CI [0.489,0.565], n=800; all 12 models bounded <0.60. Paper: Table I + IV.B.VERIFIEDload-bearing
Claim
Semantic entropy is BLIND to confident error (double dissociation). Gemma-2-27B SE->error AUROC = 0.527, CI [0.489,0.565], n=800; all 12 models bounded <0.60. Paper: Table I + IV.B.
Experiment
Semantic entropy (Farquhar NLI-clustered, K=8 sampled generations, roberta-large-mnli) on Gemma-2-27B over 800 TruthfulQA items; AUROC of SE vs whether the greedy answer was wrong, plus a confident-wrong vs ignorance split; TriviaQA positive control 0.774.
Pre-registered
Yes - PREREG_FARQUHAR_SEMANTIC_ENTROPY_v1.md
Powered / n / effect
Yes · n = 800 (Gemma-27B); confident-wrong 228, ignorance 226; 5 models / 3 families overall · effect size: TruthfulQA AUROC 0.476-0.561 (all CI upper <0.65) vs TriviaQA control 0.774 = double dissociation
Statistical test
rank AUROC with bootstrap CI95; confident-wrong vs ignorance split
Result
Gemma-27B SE AUROC 0.527, CI [0.489,0.565]; confident-wrong 0.514 [0.469,0.556]
Why this status
SE is at-chance on TruthfulQA confident errors (CI upper 0.565 < 0.65 across 5 models/3 families) yet 0.774 on TriviaQA ignorance - a clean double dissociation, the strongest measured leg.
# semantic_entropy_nli_*.py -> recomputed by audit.py G3
auroc(SE=[it['SE'] for it in per_item],
labels=[it['err'] for it in per_item]) # rank-based Mann-Whitney U; n=800
The ▶ Verify button runs this live in your browser; prove_audit_harness.py confirms the recompute matches scikit-learn to 1e-9.
Sanity checks
PASS n matches
PASS AUROC in useless band (<0.60)
PASS both classes present (non-degenerate)
PASS SE has variance (not all-equal)
Verify
Recomputes AUROC(SE, err) from the raw per-item data and compares it to the published value 0.527.
Reproduce locally: python scripts/audit.py G3
G4Cross-task SE INVERSION: open-gen SE vs forced-choice error AUROC = 0.443, CI excludes high; n=250; 56.5%% of fc-wrong in low-SE cluster. Paper: Sec (inversion).VERIFIEDload-bearing
Claim
Cross-task SE INVERSION: open-gen SE vs forced-choice error AUROC = 0.443, CI excludes high; n=250; 56.5%% of fc-wrong in low-SE cluster. Paper: Sec (inversion).
Experiment
Cross-task SE inversion, Qwen-1.5B: open-generation NLI semantic entropy joined against an INDEPENDENT layerwise forced-choice error label; AUROC of SE vs forced-choice-wrong, n=250.
Yes · n = 250 (forced-choice wrong 115, correct 135; double-wrong 80) · effect size: AUROC 0.443 (below chance = inversion); 56.5% of forced-choice-wrong sit in the low-SE cluster
Statistical test
rank AUROC with bootstrap CI95 (non-circular: the SE task differs from the forced-choice task)
Result
AUROC 0.443, CI [0.375,0.516]; median SE forced-choice-wrong 1.494 ~ correct 1.667
Why this status
Open-gen SE is blind-to-inverted on the independent forced-choice confident-error basin (0.443, 56.5% of wrong in the low-SE cluster), confirming SE blindness non-circularly; the caveat is the adversarial-lure (Kalai) design.
# se_kinematic_join.py
fc = layerwise_features['y'][item_index] # INDEPENDENT forced-choice label
auroc(SE[fc==1], SE[fc==0]) # 0.443; expect ~0.5 = blind across tasks
The ▶ Verify button runs this live in your browser; prove_audit_harness.py confirms the recompute matches scikit-learn to 1e-9.
Sanity checks
PASS n = fc_wrong + fc_correct
PASS non-circular (independent task)
PASS inversion direction (<0.5 or near)
Verify
Recomputes AUROC(SE, forced-choice-wrong) from the raw per-item data and compares it to the published value 0.443.
Reproduce locally: python scripts/audit.py G4
SAEAn unsupervised Gemma-Scope SAE's reconstruction error is also BLIND to confident error (mirrors semantic entropy). Gemma-2-2B, n=817: recon-error AUROC ~0.497 [.448,.545], perm p=0.56; mean recon-error confident-wrong ~ correct. Paper: Sec II/IV (SAE).VERIFIEDload-bearing
Claim
An unsupervised Gemma-Scope SAE's reconstruction error is also BLIND to confident error (mirrors semantic entropy). Gemma-2-2B, n=817: recon-error AUROC ~0.497 [.448,.545], perm p=0.56; mean recon-error confident-wrong ~ correct. Paper: Sec II/IV (SAE).
Experiment
SAE confident-error probe, Gemma-2-2B with the gemma-scope-2b-pt-res 16k SAE at commitment band L12 (n=817): unsupervised SAE reconstruction-error as an anomaly score vs a supervised probe vs a per-feature oracle, on confident-wrong vs correct.
Pre-registered
No
Powered / n / effect
No (exploratory; corroborates the bound) · n = 817 (confident-wrong 260, ignorance 258, correct 299) · effect size: unsupervised recon-error is BLIND: confident-wrong 81.19 ~ correct 81.18; supervised SAE-feature AUROC only 0.562
The unsupervised SAE reconstruction error is identical for confident-wrong and correct (81.19 vs 81.18) and the supervised SAE-feature AUROC is only 0.56 - the dictionary-learning readout is endpoint-blind to confident error, corroborating the bound.
# sae_confident_error; D3 = ||x - decode(encode(x))|| as anomaly score
mean_recon_err[confident_wrong] vs [correct] = 81.19 vs 81.18 # ~identical -> AUROC ~0.497 (blind)
# contrast: D1 supervised raw-residual probe still reads it
The ▶ Verify button runs this live in your browser; prove_audit_harness.py confirms the recompute matches scikit-learn to 1e-9.
Sanity checks
PASS recon-err near-identical -> no separation (blind)
PASS supervised raw-residual probe still reads it (D1>0.5)
PASS n=817 (powered)
Verify
Recomputes the decisive check live from the stored statistics. The per-item array is not shipped in this summary file, so this confirms the summary value; the detector value is also reproduced by an independent re-implementation.
Powered window-patching, Qwen2.5-1.5B/3B/7B/14B/32B: patch the commitment-band (layers 12-23) of clean-true runs and measure the flip-to-confident-FALSE rate; tests that the confident-error basin is reachable on-manifold and persists with scale.
Wilson 95% CI on the flip rate (excludes coin-flip at every scale)
Result
1.5B flip 0.952, Wilson [0.842,0.987]; 14B/32B flip 1.0, Wilson [0.985,1.0]
Why this status
Window-patch flips to confident-false ~0.95-1.0 at every scale 1.5B->32B with Wilson CIs excluding chance, so the wrong-answer basin exists and is patching-reachable on-manifold across scale (Akarlar's basin extended).
Fetches the result file and reads window_patch_flip live in your browser — window-patch flip rate => the confident-error basin is patching-reachable — confirming the published value is the one in the file. Or open the file at the Location above to inspect it.
Reproduce locally: python scripts/audit.py H1 · check them all with python scripts/validate_all_claims.py.
I. Bounding negatives
I1No capacity cliff (bounding negative)VERIFIED
Claim
No capacity cliff (bounding negative)
Experiment
Probability-flux probe, Qwen-1.5B (band-L24): measure the probability-current ratio rho_Q on TQA-traps (n=817) vs GSM8K-reasoning (n=6000) to test for a capacity cliff in the non-equilibrium flux.
Pre-registered
No
Powered / n / effect
No (exploratory) · n = TQA-trap 817; GSM8K 6000 · effect size: no accuracy cliff; flux rho_Q 0.177 (TQA) vs 0.273 (GSM8K), propagator radius 0.776 vs 0.949
Statistical test
bootstrap CI95 on rho_Q (CIs non-overlapping between regimes)
Flux varies with cognitive regime (reasoning > trap, non-overlapping CIs) with no accuracy cliff across the sweep - a bounding NULL on the capacity-cliff hypothesis; exploratory.
Fetches the result file and reads TQA_trap.rho_Q_current_ratio live in your browser — non-equilibrium flux rho_Q in TQA-traps (no capacity cliff) — confirming the published value is the one in the file. Or open the file at the Location above to inspect it.
Reproduce locally: python scripts/audit.py I1 · check them all with python scripts/validate_all_claims.py.
I2Substrate dynamics are GENERIC, not hallucination-specific (N=1)VERIFIED
Claim
Substrate dynamics are GENERIC, not hallucination-specific (N=1)
Experiment
Same probability-flux / substrate-dynamics probe, asking whether the dynamics are hallucination-specific or generic (contraction ~0.829, rotation ~0.76 non-discriminative AUROC 0.44); N=1 model.
Pre-registered
No
Powered / n / effect
No (exploratory) · n = N=1 model (TQA 817 / GSM8K 6000 conditions) · effect size: substrate generic: contraction 0.829, rotation 0.76 non-discriminative (AUROC 0.44); 2 dramatic readings retired
Statistical test
rank AUROC for discrimination (non-discriminative); descriptive flux comparison
Result
rho_Q regime-varying but generic; rotation AUROC 0.44 (not hallucination-specific)
Why this status
Substrate dynamics track regime/computation generically, not a hallucination-specific signature (rotation AUROC 0.44), and prior dramatic readings were retired - a bounding NULL at N=1, exploratory.
Fetches the result file and reads GSM8K_reasoning.rho_Q_current_ratio live in your browser — flux higher in reasoning than traps => substrate dynamics are generic (N=1) — confirming the published value is the one in the file. Or open the file at the Location above to inspect it.
Reproduce locally: python scripts/audit.py I2 · check them all with python scripts/validate_all_claims.py.
Patent pending; CC BY-NC 4.0 (see PATENTS / LICENSE).