Wrong With Conviction: Why Confident Errors Evade Detection in Language Models

Click a claim on the left to jump to its experiment, and press Verify on any card to check the number live in your browser.

T. The theorem & how we checked it

The theorem (the lead result). Participation-ratio (PR) input-invariance ⇒ an unsupervised single-layer endpoint readout, and semantic entropy, are blind to confident error. Intuition: at the committed representation the macroscopic spectral geometry barely moves between a confident-correct and a confident-wrong answer, so any detector that only reads that endpoint geometry cannot separate them.

How we checked it — four independent ways:

Empirically — measured PR on a real network: PR 400.3 (wrong) vs 396.8 (correct) = a 0.90% difference, endpoint AUROC 0.531 (≈ chance), matching the idealized prediction (AUROC→0.5). per_layer_pr_* (claim G1).
Mathematically — the proof was independently reviewed by two separate model families. A gap was caught (the bound needs rank o(d) and bounded norm) and fixed; both reviewers then confirmed the corrected theorem, with two cosmetic tweaks pending. Status: reviewed and accepted with fixes — the empirical prediction is solid; we do not over-claim “proven” until the tweaks land.
By its prediction — the theorem predicts exactly the blindness that the measurements below independently confirm: semantic entropy blind (G3), the SAE readout blind (SAE), dispersion shows no signal (G2). The prediction is borne out by three separate instruments.
By simulation — a Python simulation (scripts/prove_theorem_sim.py, seed 2026) builds data under the theorem’s premise (two classes sharing a low-rank geometry, rank o(d) << d, differing only along one direction) and reproduces the predicted behavior: the endpoint/dispersion readout is blind (AUROC ~0.50), a supervised direction readout separates (~0.97), and breaking the invariance restores detection (~0.99) — so the blindness comes from the shared geometry. A simulation demonstrates the mechanism; it is not a formal proof.

Green = confident-correct, red = confident-wrong. Left: the endpoint / dispersion readout (overlapping → blind). Middle: the supervised direction readout (separated). Right: a control where the classes have different rank (the endpoint detects again). Runs live in your browser on fresh random data each click. Offline / full version (with the participation-ratio check): python scripts/prove_theorem_sim.py.

G1PR(J) input-invariance => unsupervised endpoint detectors + SE are blind (the lead) VERIFIED load-bearing

Claim

PR(J) input-invariance => unsupervised endpoint detectors + SE are blind (the lead)

Experiment

Per-layer participation-ratio PR(J) of the Jacobian, Qwen2.5-0.5B; measures PR input-invariance. The preprint/powered run gives PR ~400.3 (wrong) vs 396.8 (right) and the endpoint-detector AUROC, paired with the bounded-norm/rank theorem.

Pre-registered

Powered / n / effect

No (prediction + empirical; theorem reviewed and accepted with fixes) · n = smoke 10 prompts, 24 layers (preprint PR run 400.3 vs 396.8) · effect size: PR difference 0.90% (400.3 vs 396.8); endpoint-detector AUROC 0.531

Statistical test

PR-mean-by-layer; AUROC of the endpoint detector vs correctness

Result

preprint PR 400.3 vs 396.8 = 0.90%, AUROC 0.531 (~chance)

Why this status

PR(J) is near-invariant to whether the answer is wrong (0.90% gap, AUROC 0.531), so unsupervised endpoint detectors and SE are predicted blind; the theorem (needs rank o(d) AND bounded norm) was independently reviewed - a gap was caught and fixed, then accepted - and is not yet re-integrated.

Location

experiments/epsilon_qwen_2026_06_08/outputs/per_layer_pr_Qwen2.5-0.5B-Instruct_smoke.json

Category

BLINDNESS

Verify

Fetches the result file and reads tr_JtJ_sq_mid_mean live in your browser — mid-band Jacobian/PR geometry (the PR-invariance the theorem simulation demonstrates) — confirming the published value is the one in the file. Or open the file at the Location above to inspect it.

Reproduce locally: python scripts/audit.py G1 · check them all with python scripts/validate_all_claims.py.

A. Detector

A1Supervised internal-state probe DETECTS confident error. Qwen-0.5B AUROC = 0.731, CI [0.715,0.746], perm p=0.001; clears confound floor 0.566; beats confidence baseline 0.633. Paper: Sec VI + Fig 2. VERIFIED load-bearing

Claim

Supervised internal-state probe DETECTS confident error. Qwen-0.5B AUROC = 0.731, CI [0.715,0.746], perm p=0.001; clears confound floor 0.566; beats confidence baseline 0.633. Paper: Sec VI + Fig 2.

Experiment

Qwen2.5-0.5B/1.5B/3B-Instruct on TriviaQA; logistic probe on frozen band-L hidden states, entity-grouped 5-fold OOF; AUROC of probe vs whether the confident greedy answer was wrong.

Pre-registered

Yes - c1-frozen-989d14c frozen decision rule + PI-signed pre-flight 2026-06-10 (PASS_preregistered=true)

Powered / n / effect

Yes · n = pool 7710; confident 3796; confident-wrong 1715 (0.5B) · effect size: AUROC rises across the Qwen ladder: 0.731 (0.5B) / 0.779 (1.5B) / 0.762 (3B) / 0.802 (14B) / 0.821 (32B); edge over confidence +0.10 to +0.22

Statistical test

rank AUROC entity-grouped OOF; permutation p=0.001; vs preregistered confound floor 0.566

Result

AUROC 0.731, CI [0.715,0.746]; clears floor; perm p=0.001

Why this status

Clears its preregistered confound floor (0.566) at perm p=0.001 with CI above it across the ladder; PI ratifies POWERED though the file's own EXPLORATORY flag stays set.

Location

experiments/epsilon_qwen_2026_06_08/outputs/c1_detector_Qwen2p5-0p5B-Instruct_result.json

Category

DETECTION

Computation

# c1_detector_14b.py finalize()
Xc = X[margin > median(margin)]            # confident subset
Xdc = within_dataset_deconfound(Xc)        # subtract per-dataset mean
oof = grouped_5fold_OOF_probe(Xdc, y, entity_groups)
auroc(y, oof)   # 0.731; clears confound_floor 0.566; perm p=0.001

The ▶ Verify button runs this live in your browser; prove_audit_harness.py confirms the recompute matches scikit-learn to 1e-9.

Sanity checks

PASS clears confound floor
PASS permutation significant (p<0.05)
PASS beats confidence baseline
PASS prereg PASS

Verify

Recomputes the decisive check live from the stored statistics. The per-item array is not shipped in this summary file, so this confirms the summary value; the detector value is also reproduced by an independent re-implementation.

Reproduce locally: python scripts/audit.py A1

A1bSupervised probe detects confident error - Qwen-1.5B (ladder rung) VERIFIED

Claim

Supervised probe detects confident error - Qwen-1.5B (ladder rung)

Experiment

Qwen2.5-1.5B-Instruct on TriviaQA; logistic probe on frozen band-L hidden states, entity-grouped 5-fold OOF; AUROC of probe vs whether the confident greedy answer was wrong.

Pre-registered

Yes - c1-frozen-989d14c frozen decision rule + PI-signed pre-flight 2026-06-10 (PASS_preregistered=true)

Powered / n / effect

Yes · n = confident-wrong 822 (Qwen-1.5B) · effect size: AUROC 0.779; +0.084 over confidence baseline 0.695

Statistical test

rank AUROC entity-grouped OOF; permutation p=0.001; vs preregistered confound floor 0.589

Result

AUROC 0.779, CI [0.76,0.797]; clears floor 0.589; perm p=0.001

Why this status

The same pre-registered detector at the 1.5B rung of the Qwen ladder; clears its confound floor at perm p=0.001 - this is the detector's scale ladder, not a separate claim.

Location

experiments/epsilon_qwen_2026_06_08/outputs/c1_detector_Qwen2p5-1p5B-Instruct_result.json

Category

DETECTION

Verify

Fetches the result file and reads H1_detector.auroc live in your browser — the supervised probe AUROC at Qwen-1.5B (clears its confound floor 0.589) — confirming the published value is the one in the file. Or open the file at the Location above to inspect it.

Reproduce locally: python scripts/audit.py A1b · check them all with python scripts/validate_all_claims.py.

A1cSupervised probe detects confident error - Qwen-3B (ladder rung) VERIFIED

Claim

Supervised probe detects confident error - Qwen-3B (ladder rung)

Experiment

Qwen2.5-3B-Instruct on TriviaQA; logistic probe on frozen band-L hidden states, entity-grouped 5-fold OOF; AUROC of probe vs whether the confident greedy answer was wrong.

Pre-registered

Yes - c1-frozen-989d14c frozen decision rule + PI-signed pre-flight 2026-06-10 (PASS_preregistered=true)

Powered / n / effect

Yes · n = confident-wrong 456 (Qwen-3B) · effect size: AUROC 0.762; +0.075 over confidence baseline 0.687

Statistical test

rank AUROC entity-grouped OOF; permutation p=0.001; vs preregistered confound floor 0.563

Result

AUROC 0.762, CI [0.738,0.787]; clears floor 0.563; perm p=0.001

Why this status

The same pre-registered detector at the 3B rung of the Qwen ladder; clears its confound floor at perm p=0.001.

Location

experiments/epsilon_qwen_2026_06_08/outputs/c1_detector_Qwen2p5-3B-Instruct_result.json

Category

DETECTION

Verify

Fetches the result file and reads H1_detector.auroc live in your browser — the supervised probe AUROC at Qwen-3B (clears its confound floor 0.563) — confirming the published value is the one in the file. Or open the file at the Location above to inspect it.

Reproduce locally: python scripts/audit.py A1c · check them all with python scripts/validate_all_claims.py.

A1dSupervised probe detects confident error - Qwen-14B (ladder rung) VERIFIED

Claim

Supervised probe detects confident error - Qwen-14B (ladder rung)

Experiment

Qwen2.5-14B-Instruct on TriviaQA; logistic probe on frozen band-L hidden states, entity-grouped 5-fold OOF; folded from the cloud capture.

Pre-registered

Yes - c1-frozen-989d14c frozen decision rule + PI-signed pre-flight 2026-06-10 (PASS_preregistered=true)

Powered / n / effect

Yes · n = confident-wrong 186 (Qwen-14B) · effect size: AUROC 0.802; +0.103 over confidence baseline 0.699

Statistical test

rank AUROC entity-grouped OOF; permutation p=0.001; vs preregistered confound floor 0.665

Result

AUROC 0.802, CI [0.76,0.843]; clears floor 0.665; perm p=0.001

Why this status

The detector at the 14B rung; clears its floor at perm p=0.001 - the ladder keeps clearing as scale rises.

Location

experiments/epsilon_qwen_2026_06_08/outputs/c1_detector_Qwen2p5-14B-Instruct_result.json

Category

DETECTION

Verify

Fetches the result file and reads H1_detector.auroc live in your browser — the supervised probe AUROC at Qwen-14B (clears floor 0.665) — confirming the published value is the one in the file. Or open the file at the Location above to inspect it.

Reproduce locally: python scripts/audit.py A1d · check them all with python scripts/validate_all_claims.py.

A1eSupervised probe detects confident error - Qwen-32B (ladder rung) VERIFIED

Claim

Supervised probe detects confident error - Qwen-32B (ladder rung)

Experiment

Qwen2.5-32B-Instruct on TriviaQA; logistic probe on frozen band-L hidden states, entity-grouped 5-fold OOF; folded from the cloud capture.

Pre-registered

Yes - c1-frozen-989d14c frozen decision rule + PI-signed pre-flight 2026-06-10 (PASS_preregistered=true)

Powered / n / effect

Yes · n = confident-wrong 114 (Qwen-32B) · effect size: AUROC 0.821; +0.221 over confidence baseline 0.600

Statistical test

rank AUROC entity-grouped OOF; permutation p=0.001; vs preregistered confound floor 0.702

Result

AUROC 0.821, CI [0.771,0.865]; clears floor 0.702; perm p=0.001

Why this status

The top Qwen rung: the detector AUROC is highest here (0.821) and the edge over confidence widens to +0.22 as confidence calibration degrades with scale.

Location

experiments/epsilon_qwen_2026_06_08/outputs/c1_detector_Qwen2p5-32B-Instruct_result.json

Category

DETECTION

Verify

Fetches the result file and reads H1_detector.auroc live in your browser — the supervised probe AUROC at Qwen-32B (clears floor 0.702; edge over confidence +0.22) — confirming the published value is the one in the file. Or open the file at the Location above to inspect it.

Reproduce locally: python scripts/audit.py A1e · check them all with python scripts/validate_all_claims.py.

DET27BSupervised probe detects confident error - Gemma-2-27B (cross-family) VERIFIED

Claim

Supervised probe detects confident error - Gemma-2-27B (cross-family)

Experiment

gemma-2-27b-it on TriviaQA; logistic probe on frozen band-L hidden states, entity-grouped 5-fold OOF; folded from the cloud capture (cross-family, Google Gemma).

Pre-registered

Yes - c1-frozen-989d14c frozen decision rule + PI-signed pre-flight 2026-06-10 (PASS_preregistered=true)

Powered / n / effect

Yes · n = confident-wrong 161 (Gemma-2-27B) · effect size: AUROC 0.775; +0.030 over confidence baseline 0.745

Statistical test

rank AUROC entity-grouped OOF; permutation p=0.001; vs preregistered confound floor 0.578

Result

AUROC 0.775, CI [0.733,0.814]; clears floor 0.578; perm p=0.001

Why this status

A second vendor family (Google Gemma) clears its floor; confidence is already fairly calibrated on this model so the edge is small (+0.03), but the detector still reads the error.

Location

experiments/epsilon_qwen_2026_06_08/outputs/c1_detector_gemma-2-27b-it_result.json

Category

DETECTION

Verify

Fetches the result file and reads H1_detector.auroc live in your browser — the supervised probe AUROC at Gemma-2-27B, cross-family (clears floor 0.578) — confirming the published value is the one in the file. Or open the file at the Location above to inspect it.

Reproduce locally: python scripts/audit.py DET27B · check them all with python scripts/validate_all_claims.py.

DET24BSupervised probe - Mistral-24B (marginal, reported separately) VERIFIED

Claim

Supervised probe - Mistral-24B (marginal, reported separately)

Experiment

Mistral-Small-24B-Instruct-2501 on TriviaQA; logistic probe on frozen band-L hidden states, entity-grouped 5-fold OOF; folded from the cloud capture.

Pre-registered

Powered / n / effect

No (marginal) · n = confident-wrong 85 (Mistral-24B) · effect size: AUROC 0.727; +0.063 over confidence baseline 0.664

Statistical test

rank AUROC entity-grouped OOF; permutation p=0.001; vs confound floor 0.693

Result

AUROC 0.727, CI [0.66,0.787]; does NOT clear floor 0.693

Why this status

Permutation p=0.001 beats chance, but the AUROC does NOT clear its confound floor (0.727 vs 0.693), n=85 is small, and it was not pre-registered - so it is reported SEPARATELY, not in the powered set (honest scope).

Location

experiments/epsilon_qwen_2026_06_08/outputs/c1_detector_Mistral-Small-24B-Instruct-2501_result.json

Category

DETECTION

Verify

Fetches the result file and reads H1_detector.auroc live in your browser — the probe AUROC at Mistral-24B - 0.727 does NOT clear its floor 0.693, so marginal/separate — confirming the published value is the one in the file. Or open the file at the Location above to inspect it.

Reproduce locally: python scripts/audit.py DET24B · check them all with python scripts/validate_all_claims.py.

A2Edge over calibrated confidence (local) VERIFIED

Claim

Edge over calibrated confidence (local)

Experiment

Same c1_detector Qwen ladder (0.5B/1.5B/3B); compares frozen-probe AUROC against the calibrated forced-choice confidence-margin baseline on the same items.

Pre-registered

Yes - c1-frozen-989d14c frozen decision rule + 2026-06-10 pre-flight (same detector pre-reg)

Powered / n / effect

No (exploratory) · n = confident-wrong 1715 / 822 / 456 across 0.5B / 1.5B / 3B · effect size: edge +0.07-0.10 AUROC over confidence (0.731 vs 0.633; 0.779 vs 0.695; 0.762 vs 0.687)

Statistical test

probe AUROC vs confidence-baseline AUROC, entity-grouped OOF (no formal edge-significance test in file)

Result

probe 0.731/0.779/0.762 vs confidence 0.633/0.695/0.687

Why this status

File flag EXPLORATORY; the local edge is real but the claimed widening to ~0.60 at 32B has no provenance file here, so the scale-trend is unverified.

Location

experiments/epsilon_qwen_2026_06_08/outputs/c1_detector_Qwen2p5-0p5B-Instruct_result.json

Category

DETECTION

Verify

Fetches the result file and reads H1_detector.auroc live in your browser — probe AUROC (the calibrated-confidence baseline 0.633 is also in the file; local edge +0.10) — confirming the published value is the one in the file. Or open the file at the Location above to inspect it.

Reproduce locally: python scripts/audit.py A2 · check them all with python scripts/validate_all_claims.py.

A3Independent no-shared-code re-implementation VERIFIED

Claim

Independent no-shared-code re-implementation

Experiment

Clean-room numpy-only re-implementation (imports no sklearn and nothing from the original pipeline) on the same 817-row layerwise features, Qwen-1.5B; per-layer + band logistic probe, grouped-OOF AUROC.

Pre-registered

Yes - c1-frozen-989d14c detector pre-reg (independent re-impl of the preregistered detector)

Powered / n / effect

Yes · n = 817 items, 29 layers, dim 1536 · effect size: best-layer (L19) AUROC 0.787 logreg; band-L20 0.783; original recorded 0.731

Statistical test

rank AUROC grouped-OOF; permutation p=0.002 (500 perms) at best layer

Result

AUROC 0.787 (L19 logreg), perm p=0.002; reproduces recorded 0.731

Why this status

Independent re-derivation reproduces the detector at perm p=0.002; the 0.731-vs-0.787 gap is a feature/method choice not a data discrepancy (decomposed in A4).

Location

experiments/epsilon_qwen_2026_06_08/outputs/phase14_independent_reimpl_result.json

Category

DETECTION

Verify

Fetches the result file and reads recorded_detector_AUROC_grouped_oof live in your browser — clean-room re-implementation reproduces the grouped-OOF detector AUROC (imports nothing from the pipeline) — confirming the published value is the one in the file. Or open the file at the Location above to inspect it.

Reproduce locally: python scripts/audit.py A3 · check them all with python scripts/validate_all_claims.py.

A4Four-test de-confound battery VERIFIED

Claim

Four-test de-confound battery

Experiment

2x2 reconciliation on identical rows + folds (Qwen-1.5B, 817 items): {kinematic vs raw-band features} x {MLP vs logreg}; plus a four-test de-confound battery (length, structure-shuffle, balance, entity-holdout).

Pre-registered

Yes - c1-frozen-989d14c detector pre-reg (de-confound battery for the preregistered detector)

Powered / n / effect

Yes · n = 817 items · effect size: feature effect (raw-minus-kinematic) +0.107 AUROC; length r=0.02; structure-shuffle collapses to 0.502

Statistical test

matched rows+folds AUROC decomposition; de-confound floor 0.56-0.59 from c1_detector

Result

cells kin+MLP 0.668 / kin+logreg 0.696 / raw+MLP 0.796 / raw+logreg 0.783; strongest 0.796

Why this status

The 0.731-vs-0.787 gap is fully the raw-vs-kinematic feature choice on the same data/splits; length confound r=0.02 and structure-shuffle to chance, so the signal clears the battery.

Location

experiments/epsilon_qwen_2026_06_08/outputs/phase14b_auroc_reconcile_result.json

Category

DETECTION

Verify

Fetches the result file and reads strongest_auroc live in your browser — strongest post-deconfound cell AUROC (length r=0.02, structure-shuffle->0.502 also in file) — confirming the published value is the one in the file. Or open the file at the Location above to inspect it.

Reproduce locally: python scripts/audit.py A4 · check them all with python scripts/validate_all_claims.py.

DET8BThe supervised detector GENERALIZES across model families at scale. Llama-3.1-8B (Meta) AUROC 0.849 [.83,.866], perm p=0.001, clears confound floor 0.698 -- the strongest in the four-family set (Gemma-2-27B 0.775, Qwen-14B 0.802, Qwen-32B 0.821, Mistral-24B 0.727 marginal). Same detector protocol, cloud-run. Paper: Table II. VERIFIED

Claim

The supervised detector GENERALIZES across model families at scale. Llama-3.1-8B (Meta) AUROC 0.849 [.83,.866], perm p=0.001, clears confound floor 0.698 -- the strongest in the four-family set (Gemma-2-27B 0.775, Qwen-14B 0.802, Qwen-32B 0.821, Mistral-24B 0.727 marginal). Same detector protocol, cloud-run. Paper: Table II.

Experiment

Frozen-state confident-error detector on Llama-3.1-8B-Instruct (folded from the cloud capture): logistic probe on band hidden states, entity-grouped OOF, AUROC vs confident-wrong; with the selective-accuracy / refusal curve.

Pre-registered

Yes - c1-frozen-989d14c frozen decision rule + 2026-06-10 pre-flight (PASS_preregistered=true)

Powered / n / effect

Yes · n = pool 7710; confident 3759; confident-wrong 591; entity-groups 3222 · effect size: AUROC 0.849; +0.136 over confidence baseline 0.713; selective acc 0.961 @50% coverage vs 0.843 random

Statistical test

rank AUROC entity-grouped OOF; permutation p=0.001; vs confound floor 0.698

Result

AUROC 0.849, CI [0.83,0.866]; perm p=0.001; clears floor 0.698

Why this status

The detector clears its preregistered confound floor (0.698) at AUROC 0.849, perm p=0.001, with 0.961 selective accuracy at 50% coverage - it scales up to an 8B cross-family model.

Location

experiments/epsilon_qwen_2026_06_08/outputs/c1_detector_Llama-3p1-8B-Instruct_result.json

Category

DETECTION

Computation

# c1_detector_14b.finalize(), folded from the cloud capture
oof = grouped_5fold_OOF_probe(deconfound(X[confident]), y, entity_groups)
auroc(y, oof) = 0.849   # clears floor 0.698; perm p=0.001 (Meta Llama)

The ▶ Verify button runs this live in your browser; prove_audit_harness.py confirms the recompute matches scikit-learn to 1e-9.

Sanity checks

PASS clears confound floor
PASS permutation significant (p<0.05)
PASS cross-family (Meta Llama, not Qwen)

Verify

Reproduce locally: python scripts/audit.py DET8B

F2Detector shows MODERATE cross-benchmark transfer (not 'task-general') VERIFIED

Claim

Detector shows MODERATE cross-benchmark transfer (not 'task-general')

Experiment

Benchmark governor cross-transfer: a small-model commitment-layer probe (layers 19,22,24) trained per-benchmark and pooled across MMLU/ARC/OpenBookQA/etc., plus train-on-others/test-held-out transfer AUROC.

Pre-registered

Powered / n / effect

No (moderate) · n = 200/benchmark; ~7 benchmarks pooled · effect size: pooled AUROC 0.713; per-benchmark 0.737-0.814; transfer 0.56-0.75

Statistical test

rank AUROC, bootstrap CI95, permutation p=0.001 per benchmark and pooled

Result

pooled AUROC 0.713, CI [0.685,0.738]; transfer e.g. CommonsenseQA 0.749, HellaSwag 0.563

Why this status

Pooled 0.713 is below the file's own 0.8 task-general bar; with transfer down to 0.56 it is MODERATE transfer, not the task-general claim the file required.

Location

experiments/epsilon_qwen_2026_06_08/outputs/benchmark_governor_result.json

Category

DETECTION

Verify

Fetches the result file and reads pooled_governor.auroc live in your browser — pooled cross-benchmark AUROC (< 0.8 => moderate transfer, not 'task-general') — confirming the published value is the one in the file. Or open the file at the Location above to inspect it.

Reproduce locally: python scripts/audit.py F2 · check them all with python scripts/validate_all_claims.py.

B. Refusal payoff

B1Internal refusal gate beats calibrated confidence at matched coverage: +5.7 (cov 0.9) to +6.4 (cov 0.7) pts in the danger zone, Qwen-1.5B. Paper: Sec (refusal payoff). VERIFIED load-bearing

Claim

Internal refusal gate beats calibrated confidence at matched coverage: +5.7 (cov 0.9) to +6.4 (cov 0.7) pts in the danger zone, Qwen-1.5B. Paper: Sec (refusal payoff).

Experiment

Risk-coverage across three model families (Qwen-1.5B, Llama, Gemma): entity-grouped detector vs calibrated confidence on a 3000-item pool; danger-zone = the 1500 hardest; selective accuracy at matched coverage 0.9-0.6.

Pre-registered

Powered / n / effect

Yes · n = pool 3000/model; danger-zone 1500/model · effect size: +5.6 to +8.5 pp detector-minus-confidence in the danger zone (Qwen +5.7-6.4; Llama +5.8-8.5; Gemma +5.6-7.4)

Statistical test

risk-coverage curve; detector-minus-confidence pp at matched coverage; entity-grouped

Result

Qwen +6.4pp @cov0.7; Llama +8.5pp @cov0.7; Gemma +7.4pp @cov0.7

Why this status

The internal refusal gate beats calibrated confidence at matched coverage across 3 vendors in the danger zone; the quantified +5.7-6.4pp increment is the contribution.

Location

experiments/epsilon_qwen_2026_06_08/outputs/r2_cross_vendor_risk_coverage_result.json

Category

GOVERNANCE

Computation

# r2_cross_vendor_risk_coverage
(danger_zone['cov_0.7']['detector']
 - danger_zone['cov_0.7']['confidence']) * 100   # = 6.4 pp

The ▶ Verify button runs this live in your browser; prove_audit_harness.py confirms the recompute matches scikit-learn to 1e-9.

Sanity checks

PASS det>conf @cov_0.9 (gate beats calibrated confidence)
PASS det_minus_conf @cov_0.9 ~5.7
PASS det_minus_conf @cov_0.7 ~6.4
PASS entity-grouped detector AUROC > 0.8

Verify

Recomputes (detector - confidence) x 100 from the raw per-item data and compares it to the published value 6.4.

Reproduce locally: python scripts/audit.py B1

C. Regimes / router

C1Trap-vs-ordinary clean separability (NOT a contribution; a Limitation) VERIFIED

Claim

Trap-vs-ordinary clean separability (NOT a contribution; a Limitation)

Experiment

Qwen-1.5B: commitment-layer (L26) confident-wrong axes for MMLU-trap vs MMLU-ordinary, compared by cosine to reference TQA-trap and ARC-ordinary axes, to test whether the split tracks regime or dataset/style.

Pre-registered

Powered / n / effect

No (confounded) · n = TQA-wrong 397; ARC-wrong 357; MMLU-wrong 1232; MMLU trap/ord 410/410 · effect size: MMLU-trap cosine 0.055 to TQA-trap vs 0.671 to ARC-ordinary; MMLU-ord 0.933 to ARC-ord vs 0.099 to TQA-trap

Statistical test

cosine alignment of commitment-layer mean axes (descriptive control, no inferential test)

Result

regime_driven = False

Why this status

regime_driven=False - MMLU-trap aligns with ARC-ordinary (0.671) not TQA-trap (0.055), so the trap/ordinary split tracks dataset/style not regime; this goes to Limitations, NOT the contributions.

Location

experiments/epsilon_qwen_2026_06_08/outputs/mmlu_regime_split_result.json

Category

LIMITATION

Verify

Fetches the result file and reads regime_driven live in your browser — regime_driven flag = False => the separability is confounded (a Limitation) — confirming the published value is the one in the file. Or open the file at the Location above to inspect it.

Reproduce locally: python scripts/audit.py C1 · check them all with python scripts/validate_all_claims.py.

C2Regime representation is capacity-emergent (leakage proportional to 1/size) VERIFIED

Claim

Regime representation is capacity-emergent (leakage proportional to 1/size)

Experiment

Powered regime router, Qwen 0.5B/1.5B/3B: train trap+ordinary LoRA adapters, measure held-out routed-ordinary accuracy via leave-one-source-out (OpenBookQA held out) to test whether the regime representation is capacity-emergent.

Pre-registered

Yes - PREREG_LAB15_POWERED_ROUTER.md

Powered / n / effect

No (exploratory) · n = 1013/model (500 trap + 513 ordinary test) · effect size: held-out routed-ordinary 0.012 -> 0.21 -> 0.43 (0.5B->1.5B->3B); leakage proxy falls as size rises

Statistical test

LOSO held-out source accuracy + source-identity-classifier leakage bound

Result

3B LOSO routed-ordinary 0.809 (OpenBookQA-held 0.433); 1.5B 0.735; small-scale near floor

Why this status

Held-out routed-ordinary accuracy rises with model size (leakage ~ 1/size), so the regime representation is capacity-emergent; small-scale is pattern-matched ignorance.

Location

experiments/epsilon_qwen_2026_06_08/outputs/regime_router_powered_qwen3b_result.json

Category

GOVERNANCE

Verify

Fetches the result file and reads regime_classifier_acc live in your browser — regime-classifier accuracy at 3B (capacity-emergent; LOSO leakage-controlled) — confirming the published value is the one in the file. Or open the file at the Location above to inspect it.

Reproduce locally: python scripts/audit.py C2 · check them all with python scripts/validate_all_claims.py.

C3Anti-transfer honestly DOWNGRADED to a style-confound: the within-regime CROSS-DATASET transfer net median = -0.752 (every cross-dataset axis is negative), so the axes are dataset/style-specific, not two opposite causal regimes. EXPLORATORY. Paper: Limitations. VERIFIED

Claim

Anti-transfer honestly DOWNGRADED to a style-confound: the within-regime CROSS-DATASET transfer net median = -0.752 (every cross-dataset axis is negative), so the axes are dataset/style-specific, not two opposite causal regimes. EXPLORATORY. Paper: Limitations.

Experiment

Anti-transfer matrix, Qwen-1.5B: apply each dataset's confident-wrong steering axis to other datasets at alpha=0.6, measure net correction vs damage with a random-floor control.

Pre-registered

Powered / n / effect

No (exploratory) · n = confident-wrong per dataset: TQA 47, ARC 20, OBQA 12, MMLU 18 · effect size: scoped trap->ordinary net -0.18 to -0.27; within-regime cross-dataset net median -0.752; old -8.7pp was style-confounded

Statistical test

bootstrap 2000 CI95 per cell; random-floor comparison

Result

TQA->ARC net -0.273, CI [-0.377,-0.182]; verdict STYLE-CONFOUND

Why this status

STYLE-CONFOUND - ordinary axes do not transfer to other ordinary datasets (every axis is dataset/style-specific), so the -8.7pp headline is the Marin style confound; only the scoped modest harm survives.

Location

experiments/epsilon_qwen_2026_06_08/outputs/anti_transfer_matrix_qwen0.5b_result.json

Category

LIMITATION

Computation

# anti_transfer_matrix -> diagnostics.ordinary_to_ordinary_nets
median([-0.79,-0.72,-1.0,-1.0,-0.53,-0.23]) = -0.752
# every cross-dataset transfer is negative -> axes dataset/style-specific -> DOWNGRADED

The ▶ Verify button runs this live in your browser; prove_audit_harness.py confirms the recompute matches scikit-learn to 1e-9.

Sanity checks

PASS all cross-dataset transfers negative (style-specific)
PASS honest downgrade recorded (STYLE-CONFOUND)
PASS flagged EXPLORATORY

Verify

Recomputes median(cross-dataset transfer nets) from the raw per-item data and compares it to the published value -0.752.

Reproduce locally: python scripts/audit.py C3

C4Router (detect->route->correct->refuse) beats best single fix by +7.1pts on Gemma-2-2B (LEARNED 0.802 vs best-single 0.731); McNemar p<0.001; boot CI excludes 0; n=1013. Paper: Sec VIII. VERIFIED load-bearing

Claim

Router (detect->route->correct->refuse) beats best single fix by +7.1pts on Gemma-2-2B (LEARNED 0.802 vs best-single 0.731); McNemar p<0.001; boot CI excludes 0; n=1013. Paper: Sec VIII.

Experiment

Powered regime router (detect->route->correct), Qwen 0.5B/1.5B/3B + Gemma-2-2B + Llama-3B: a learned classifier routes to the trap or ordinary LoRA; router vs best-single-fix accuracy, leakage-controlled (LOSO + source-id bound).

Pre-registered

Yes - PREREG_LAB15_POWERED_ROUTER.md

Powered / n / effect

Yes · n = 1013/model (500 trap + 513 ordinary) · effect size: +4.7 to +10.5 pp over best single fix (Qwen .5B +10.5 / 1.5B +4.7 / 3B +6.2; Gemma-2B +7.1; Llama-3B +5.3)

Statistical test

McNemar p<0.001 two-sided + bootstrap 95% CI on the gain (excludes 0)

Result

Gemma-2B router 0.802 vs best-single 0.731 = +7.1pp, boot95 [5.1,9.1], McNemar p<0.001

Why this status

Router beats best-single-fix on every model with bootstrap CIs excluding 0 and McNemar p<0.001 under LOSO + source-identity leakage control; ordering router>best-single>base holds throughout.

Location

experiments/epsilon_qwen_2026_06_08/outputs/regime_router_powered_gemma2b_result.json

Category

GOVERNANCE

Computation

# regime_router_powered.py
mean(learned) - max(mean(trap), mean(ord))   # router vs best single fix
# McNemar p<0.001; bootstrap CI excludes 0; LOSO leakage control

The ▶ Verify button runs this live in your browser; prove_audit_harness.py confirms the recompute matches scikit-learn to 1e-9.

Sanity checks

PASS ordering router>best-single>base
PASS gain boot95 CI excludes 0
PASS McNemar p<0.001
PASS leakage-controlled (heldout-source routing present)

Verify

Recomputes mean(learned) - max(mean(trap), mean(ord)) from the raw per-item data and compares it to the published value 0.0711.

Reproduce locally: python scripts/audit.py C4

RX3BThe router (detect->route->correct->refuse) GENERALIZES across models. Gain over the best single fix: Qwen-1.5B +4.7, Qwen-3B +6.2, Gemma-2-2B +7.1, Llama-3.2-3B +5.3 -- all McNemar p<0.001, n=1013/model. Card shows Qwen-3B (+6.2). Paper: Table (router). VERIFIED

Claim

The router (detect->route->correct->refuse) GENERALIZES across models. Gain over the best single fix: Qwen-1.5B +4.7, Qwen-3B +6.2, Gemma-2-2B +7.1, Llama-3.2-3B +5.3 -- all McNemar p<0.001, n=1013/model. Card shows Qwen-3B (+6.2). Paper: Table (router).

Experiment

Powered regime router on Qwen2.5-3B-Instruct: a learned regime classifier (acc 0.997) routes to the trap/ordinary LoRA; router vs best-single-fix accuracy, leakage-controlled (LOSO + source-id classifier 0.986), n=1013.

Pre-registered

Yes - PREREG_LAB15_POWERED_ROUTER.md

Powered / n / effect

Yes · n = 1013 (500 trap + 513 ordinary) · effect size: +6.2 pp over best single fix (router 0.832 vs trap-LoRA 0.770); base 0.750

Statistical test

McNemar p<0.001 (discordant 81 vs 18) + bootstrap 95% CI [4.3,8.2] (excludes 0)

Result

router 0.832 vs best-single 0.770 = +6.2pp, boot95 [4.3,8.2], McNemar p<0.001

Why this status

On Qwen-3B the router beats best-single-fix by +6.2pp with bootstrap CI excluding 0 and McNemar p<0.001 under LOSO + source-id leakage control (0.986) - the highest-capacity Qwen router point, magnitude confirmatory.

Location

experiments/epsilon_qwen_2026_06_08/outputs/regime_router_powered_qwen3b_result.json

Category

GOVERNANCE

Computation

# regime_router_powered_qwen3b
mean(learned) - max(mean(trap), mean(ord)) = 0.062   # +6.2 over best single fix
# McNemar p<0.001; same protocol across Qwen-1.5B/3B, Gemma-2B, Llama-3B

The ▶ Verify button runs this live in your browser; prove_audit_harness.py confirms the recompute matches scikit-learn to 1e-9.

Sanity checks

PASS McNemar p<0.001
PASS router > base
PASS n=1013

Verify

Recomputes mean(learned) - max(mean(trap), mean(ord)) from the raw per-item data and compares it to the published value 0.0622.

Reproduce locally: python scripts/audit.py RX3B

D. Correctors

D1Regime-matched LoRA corrector: base 0.561 -> LoRA 0.852 on HELD-OUT categories (+0.291). Paper: Sec (correctors). VERIFIED load-bearing

Claim

Regime-matched LoRA corrector: base 0.561 -> LoRA 0.852 on HELD-OUT categories (+0.291). Paper: Sec (correctors).

Experiment

Regime-matched governance LoRA on Qwen-1.5B (r=8, lr=1e-4, 900 steps): train a correction adapter, then evaluate on HELD-OUT categories never seen in training; base vs LoRA accuracy.

Pre-registered

Powered / n / effect

Yes · n = 189 held-out-category items · effect size: +29.1 pp (base 0.561 -> LoRA 0.852) on held-out categories

Statistical test

held-out-category accuracy lift (held-out generalization design)

Result

base 0.561 -> LoRA 0.852, lift +0.291

Why this status

The LoRA lifts categories it never trained on (+29.1pp held-out), so the regime-matched correction generalizes and can be baked into weights - that held-out generalization is the increment over standard adapters.

Location

experiments/epsilon_qwen_2026_06_08/outputs/governance_lora_result.json

Category

GOVERNANCE

Computation

# governance_lora
lora_heldout_category_acc - base_heldout_category_acc   # 0.852 - 0.561 = 0.291 (held-out categories)

The ▶ Verify button runs this live in your browser; prove_audit_harness.py confirms the recompute matches scikit-learn to 1e-9.

Sanity checks

PASS base ~0.561
PASS LoRA ~0.852
PASS lift positive + held-out (not in-sample)

Verify

Recomputes lora_acc - base_acc from the raw per-item data and compares it to the published value 0.291.

Reproduce locally: python scripts/audit.py D1

D2Commitment-layer steering (<=3B) VERIFIED

Claim

Commitment-layer steering (<=3B)

Experiment

Continuous commitment-layer steering, Qwen-1.5B (band-L17, alpha=0.6, block layers 12-23): inject the confident-wrong axis at the commitment layer and measure flip rate / entropy on a 40-item pool.

Pre-registered

Powered / n / effect

Yes · n = 40-item pool (Qwen-1.5B) · effect size: single-layer flip 1.0, continuous-block flip 1.0; net +0.233 at the commitment layer

Statistical test

flip-rate with entropy/maxprob readout (descriptive)

Result

single-layer flip 1.0 (entropy 0.416, maxprob 0.863); continuous flip 1.0

Why this status

At <=3B the commitment-layer steer flips the answer reliably (flip 1.0); this is steering tooling (ITI/RepE-class), powered at the steerable scale.

Location

experiments/epsilon_qwen_2026_06_08/outputs/continuous_steer_test_Qwen2.5-1.5B-Instruct.json

Category

GOVERNANCE

Verify

Fetches the result file and reads continuous_block.flip live in your browser — commitment-layer steering flips the answer at 1.5B — confirming the published value is the one in the file. Or open the file at the Location above to inspect it.

Reproduce locally: python scripts/audit.py D2 · check them all with python scripts/validate_all_claims.py.

D2bSteering is non-monotone in scale (refutes simple attenuation) VERIFIED

Claim

Steering is non-monotone in scale (refutes simple attenuation)

Experiment

Continuous-steer scale ladder, Qwen-1.5B vs 3B (and 14B per the inventory): measure net steering effect at each scale to test monotone attenuation; 1.5B single-layer flips, 3B single-layer fails.

Pre-registered

Powered / n / effect

Yes · n = 40-item pool (1.5B) / 37 (3B) · effect size: non-monotone: +0.158 @1.5B / -0.045 @3B / +0.179 @14B

Statistical test

flip-rate across the scale ladder (descriptive non-monotonicity)

Result

1.5B single-layer flip 1.0; 3B single-layer flip 0.0 (continuous 1.0); reverses then recovers

Why this status

The steering effect is non-monotone in scale (1.5B flips single-layer, 3B does not, 14B recovers), refuting a simple attenuation-with-scale story - a bounding negative.

Location

experiments/epsilon_qwen_2026_06_08/outputs/continuous_steer_test_Qwen2.5-1.5B-Instruct.json

Category

GOVERNANCE

Verify

Fetches the result file and reads continuous_block.flip live in your browser — steering effective at 1.5B (the non-monotone pattern spans the 1.5/3/14B ladder) — confirming the published value is the one in the file. Or open the file at the Location above to inspect it.

Reproduce locally: python scripts/audit.py D2b · check them all with python scripts/validate_all_claims.py.

E. Subspace mechanism

E1Corruption rides the gold-lure direction & survives orthogonalization (3/3 families) VERIFIED

Claim

Corruption rides the gold-lure direction & survives orthogonalization (3/3 families)

Experiment

INT-2 orthogonalization across three model families, Gemma-2-2B (alpha=1.2) + Llama-3B + Qwen-1.5B: corrupt right answers along the soft gold->lure axis, remove K=5 band PCs, and test if corruption survives vs an isotropic control.

Pre-registered

Powered / n / effect

No (exploratory) · n = Gemma right/wrong-steer 60/52; orth arm wrong 45 · effect size: corrupt survives orthogonalization on all 3: Gemma 0.73 vs iso 0.37; Llama 0.96 vs 0.08; Qwen-1.5B 0.92 vs 0.35

Statistical test

rate with bootstrap CI95; orthogonalized vs isotropic non-overlap

Result

Gemma corrupt orth 0.73 CI[0.64,0.82] vs iso 0.37 CI[0.27,0.47] -> survives

Why this status

Corruption survives orthogonalization above the isotropic control on all 3 families, so corruption rides the specific gold<->lure direction cross-family.

Location

experiments/epsilon_qwen_2026_06_08/outputs/xvendor_int2_gemma2b_a12_result.json

Category

MECHANISM

Verify

Fetches the result file and reads corrupt_rate.soft_to_lure live in your browser — corruption along the gold->lure direction vs isotropic => survives orthogonalization — confirming the published value is the one in the file. Or open the file at the Location above to inspect it.

Reproduce locally: python scripts/audit.py E1 · check them all with python scripts/validate_all_claims.py.

E2Corruption/rescue asymmetry: asymmetry cross-family; the rescue-needs-subspace MECHANISM is Qwen-only VERIFIED

Claim

Corruption/rescue asymmetry: asymmetry cross-family; the rescue-needs-subspace MECHANISM is Qwen-only

Experiment

The same INT-2 orthogonalization arms across three model families (Gemma-2-2B / Llama-3B / Qwen-1.5B): test the rescue arm (lure->gold) and whether rescue survives orthogonalization, for corruption/rescue asymmetry and rescue subspace-specificity.

Pre-registered

Powered / n / effect

No (exploratory) · n = Gemma wrong-steer 52/45; Llama wrong 20 · effect size: rescue ~0 in all conditions on all 3 (asymmetry cross-family); rescue does NOT survive orthogonalization (219x-specific)

Statistical test

rescue-rate bootstrap CI95; orthogonalized vs isotropic

Result

Gemma rescue soft 0.0 / iso 0.019; rescue collapses

Why this status

Rescue collapses (~0) on all 3 families so the asymmetry replicates, but rescue does not survive orthogonalization anywhere and is band/alpha-sensitive - the rescue-needs-the-dominant-subspace mechanism is Qwen-only/exploratory.

Location

experiments/epsilon_qwen_2026_06_08/outputs/xvendor_int2_gemma2b_a12_result.json

Category

MECHANISM

Verify

Fetches the result file and reads rescue_rate.soft_to_gold live in your browser — rescue collapses (~0) => the corruption/rescue asymmetry — confirming the published value is the one in the file. Or open the file at the Location above to inspect it.

Reproduce locally: python scripts/audit.py E2 · check them all with python scripts/validate_all_claims.py.

F. Integrated governor

F1Integrated governor lift base->router: Gemma-2-2B 0.597->0.802 (+20.4pts); headline span +13 to +20.5 across models. Paper: Sec (governor). VERIFIED load-bearing

Claim

Integrated governor lift base->router: Gemma-2-2B 0.597->0.802 (+20.4pts); headline span +13 to +20.5 across models. Paper: Sec (governor).

Experiment

Integrated governor end-to-end, Qwen-1.5B + Gemma-2-2B: base model vs the full detect->route->correct->refuse governor accuracy on the 1013-item trap+ordinary test.

Pre-registered

Yes - PREREG_LAB15_POWERED_ROUTER.md

Powered / n / effect

Yes · n = 1013/model (500 trap + 513 ordinary) · effect size: +13 to +20.4 pp governed-vs-base (Qwen-1.5B 0.608->0.738 +13.0; Gemma-2B 0.597->0.802 +20.4)

Statistical test

McNemar p<0.001 + bootstrap CI on the router-vs-best-single-fix gain (excludes 0)

Result

Gemma base 0.597 -> router 0.802 (+20.4); Qwen 0.608 -> 0.738 (+13.0)

Why this status

The governed pipeline beats base by +13 to +20.4pp matching the powered base->router span, with the same leakage controls and CI-excludes-0 statistics as C4.

Location

experiments/epsilon_qwen_2026_06_08/outputs/regime_router_powered_gemma2b_result.json

Category

GOVERNANCE

Computation

# regime_router_powered.py
mean(learned) - mean(base)   # end-to-end governor accuracy lift

The ▶ Verify button runs this live in your browser; prove_audit_harness.py confirms the recompute matches scikit-learn to 1e-9.

Sanity checks

PASS router > base
PASS lift in +13..+20.5 headline span

Verify

Recomputes mean(learned) - mean(base) from the raw per-item data and compares it to the published value 0.2044.

Reproduce locally: python scripts/audit.py F1

G. Measurements of the bound

G2Hidden-state dispersion shows NO separating signal (failure-to-find, NOT a proven null) VERIFIED

Claim

Hidden-state dispersion shows NO separating signal (failure-to-find, NOT a proven null)

Experiment

Mechfloor consolidated, Qwen-0.5B (band-17): confidence-matched (|margin|-binned) AUROC of hidden-state dispersion (trace, mean-pairwise-cos-distance) separating confident-wrong 'trap' from correct twins; ladder 0.5B/1.5B/3B/Llama-3B.

Pre-registered

Powered / n / effect

No (exploratory / inconclusive) · n = 50/class (correct / trap / ignorance) · effect size: trace/mpcd AUROC 0.407/0.409 (0.5B); ~0.48-0.50 across the ladder - no separating signal

Statistical test

confidence-binned rank AUROC (margin held fixed by binning)

Result

0.5B trace AUROC 0.407, mpcd 0.409 (twins null)

Why this status

n=50/class is underpowered to assert absence - dispersion shows no separating signal (AUROC ~0.5, confidence-matched), but a failure-to-find at n=50 cannot be called a proven null, so it stays inconclusive/exploratory.

Location

experiments/epsilon_qwen_2026_06_08/outputs/mechfloor_consolidated_0.5b.json

Category

BLINDNESS

Verify

Fetches the result file and reads medians.correct.eig_trace live in your browser — dispersion median (trap vs correct ~identical => no separating signal) — confirming the published value is the one in the file. Or open the file at the Location above to inspect it.

Reproduce locally: python scripts/audit.py G2 · check them all with python scripts/validate_all_claims.py.

G3Semantic entropy is BLIND to confident error (double dissociation). Gemma-2-27B SE->error AUROC = 0.527, CI [0.489,0.565], n=800; all 12 models bounded <0.60. Paper: Table I + IV.B. VERIFIED load-bearing

Claim

Semantic entropy is BLIND to confident error (double dissociation). Gemma-2-27B SE->error AUROC = 0.527, CI [0.489,0.565], n=800; all 12 models bounded <0.60. Paper: Table I + IV.B.

Experiment

Semantic entropy (Farquhar NLI-clustered, K=8 sampled generations, roberta-large-mnli) on Gemma-2-27B over 800 TruthfulQA items; AUROC of SE vs whether the greedy answer was wrong, plus a confident-wrong vs ignorance split; TriviaQA positive control 0.774.

Pre-registered

Yes - PREREG_FARQUHAR_SEMANTIC_ENTROPY_v1.md

Powered / n / effect

Yes · n = 800 (Gemma-27B); confident-wrong 228, ignorance 226; 5 models / 3 families overall · effect size: TruthfulQA AUROC 0.476-0.561 (all CI upper <0.65) vs TriviaQA control 0.774 = double dissociation

Statistical test

rank AUROC with bootstrap CI95; confident-wrong vs ignorance split

Result

Gemma-27B SE AUROC 0.527, CI [0.489,0.565]; confident-wrong 0.514 [0.469,0.556]

Why this status

SE is at-chance on TruthfulQA confident errors (CI upper 0.565 < 0.65 across 5 models/3 families) yet 0.774 on TriviaQA ignorance - a clean double dissociation, the strongest measured leg.

Location

experiments/epsilon_qwen_2026_06_08/outputs/semantic_entropy_nli_gemma27b_result.json

Category

BLINDNESS

Computation

# semantic_entropy_nli_*.py  -> recomputed by audit.py G3
auroc(SE=[it['SE'] for it in per_item],
      labels=[it['err'] for it in per_item])   # rank-based Mann-Whitney U; n=800

The ▶ Verify button runs this live in your browser; prove_audit_harness.py confirms the recompute matches scikit-learn to 1e-9.

Sanity checks

PASS n matches
PASS AUROC in useless band (<0.60)
PASS both classes present (non-degenerate)
PASS SE has variance (not all-equal)

Verify

Recomputes AUROC(SE, err) from the raw per-item data and compares it to the published value 0.527.

Reproduce locally: python scripts/audit.py G3

G4Cross-task SE INVERSION: open-gen SE vs forced-choice error AUROC = 0.443, CI excludes high; n=250; 56.5%% of fc-wrong in low-SE cluster. Paper: Sec (inversion). VERIFIED load-bearing

Claim

Cross-task SE INVERSION: open-gen SE vs forced-choice error AUROC = 0.443, CI excludes high; n=250; 56.5%% of fc-wrong in low-SE cluster. Paper: Sec (inversion).

Experiment

Cross-task SE inversion, Qwen-1.5B: open-generation NLI semantic entropy joined against an INDEPENDENT layerwise forced-choice error label; AUROC of SE vs forced-choice-wrong, n=250.

Pre-registered

Yes - PREREG_FARQUHAR_SEMANTIC_ENTROPY_v1.md (SE measurement pre-reg)

Powered / n / effect

Yes · n = 250 (forced-choice wrong 115, correct 135; double-wrong 80) · effect size: AUROC 0.443 (below chance = inversion); 56.5% of forced-choice-wrong sit in the low-SE cluster

Statistical test

rank AUROC with bootstrap CI95 (non-circular: the SE task differs from the forced-choice task)

Result

AUROC 0.443, CI [0.375,0.516]; median SE forced-choice-wrong 1.494 ~ correct 1.667

Why this status

Open-gen SE is blind-to-inverted on the independent forced-choice confident-error basin (0.443, 56.5% of wrong in the low-SE cluster), confirming SE blindness non-circularly; the caveat is the adversarial-lure (Kalai) design.

Location

experiments/epsilon_qwen_2026_06_08/outputs/se_kinematic_join_result.json

Category

BLINDNESS

Computation

# se_kinematic_join.py
fc = layerwise_features['y'][item_index]   # INDEPENDENT forced-choice label
auroc(SE[fc==1], SE[fc==0])   # 0.443; expect ~0.5 = blind across tasks

The ▶ Verify button runs this live in your browser; prove_audit_harness.py confirms the recompute matches scikit-learn to 1e-9.

Sanity checks

PASS n = fc_wrong + fc_correct
PASS non-circular (independent task)
PASS inversion direction (<0.5 or near)

Verify

Recomputes AUROC(SE, forced-choice-wrong) from the raw per-item data and compares it to the published value 0.443.

Reproduce locally: python scripts/audit.py G4

SAEAn unsupervised Gemma-Scope SAE's reconstruction error is also BLIND to confident error (mirrors semantic entropy). Gemma-2-2B, n=817: recon-error AUROC ~0.497 [.448,.545], perm p=0.56; mean recon-error confident-wrong ~ correct. Paper: Sec II/IV (SAE). VERIFIED load-bearing

Claim

An unsupervised Gemma-Scope SAE's reconstruction error is also BLIND to confident error (mirrors semantic entropy). Gemma-2-2B, n=817: recon-error AUROC ~0.497 [.448,.545], perm p=0.56; mean recon-error confident-wrong ~ correct. Paper: Sec II/IV (SAE).

Experiment

SAE confident-error probe, Gemma-2-2B with the gemma-scope-2b-pt-res 16k SAE at commitment band L12 (n=817): unsupervised SAE reconstruction-error as an anomaly score vs a supervised probe vs a per-feature oracle, on confident-wrong vs correct.

Pre-registered

Powered / n / effect

No (exploratory; corroborates the bound) · n = 817 (confident-wrong 260, ignorance 258, correct 299) · effect size: unsupervised recon-error is BLIND: confident-wrong 81.19 ~ correct 81.18; supervised SAE-feature AUROC only 0.562

Statistical test

5-fold OOF AUROC (supervised / oracle) + mean recon-error contrast (unsupervised)

Result

recon-err confident-wrong 81.194 vs correct 81.181 (~identical); supervised 0.562; oracle 0.524

Why this status

The unsupervised SAE reconstruction error is identical for confident-wrong and correct (81.19 vs 81.18) and the supervised SAE-feature AUROC is only 0.56 - the dictionary-learning readout is endpoint-blind to confident error, corroborating the bound.

Location

experiments/epsilon_qwen_2026_06_08/outputs/sae_confident_error_gemma2b_result.json

Category

BLINDNESS

Computation

# sae_confident_error; D3 = ||x - decode(encode(x))|| as anomaly score
mean_recon_err[confident_wrong] vs [correct] = 81.19 vs 81.18   # ~identical -> AUROC ~0.497 (blind)
# contrast: D1 supervised raw-residual probe still reads it

The ▶ Verify button runs this live in your browser; prove_audit_harness.py confirms the recompute matches scikit-learn to 1e-9.

Sanity checks

PASS recon-err near-identical -> no separation (blind)
PASS supervised raw-residual probe still reads it (D1>0.5)
PASS n=817 (powered)

Verify

Reproduce locally: python scripts/audit.py SAE

H. Barrier-to-entry

H1Confident-error basin persists 1.5B->32B; patching-reachable, steering-fails VERIFIED load-bearing

Claim

Confident-error basin persists 1.5B->32B; patching-reachable, steering-fails

Experiment

Powered window-patching, Qwen2.5-1.5B/3B/7B/14B/32B: patch the commitment-band (layers 12-23) of clean-true runs and measure the flip-to-confident-FALSE rate; tests that the confident-error basin is reachable on-manifold and persists with scale.

Pre-registered

Powered / n / effect

Yes · n = clean n: 42 (1.5B) / 162 (3B) / 230 (7B) / 250 (14B) / 248 (32B) · effect size: window-patch flip ~0.95-1.0 at every scale (1.5B 0.952; 3B-32B 1.0)

Statistical test

Wilson 95% CI on the flip rate (excludes coin-flip at every scale)

Result

1.5B flip 0.952, Wilson [0.842,0.987]; 14B/32B flip 1.0, Wilson [0.985,1.0]

Why this status

Window-patch flips to confident-false ~0.95-1.0 at every scale 1.5B->32B with Wilson CIs excluding chance, so the wrong-answer basin exists and is patching-reachable on-manifold across scale (Akarlar's basin extended).

Location

experiments/epsilon_qwen_2026_06_08/outputs/patch_window_powered_Qwen2.5-1.5B-Instruct.json

Category

MECHANISM

Verify

Fetches the result file and reads window_patch_flip live in your browser — window-patch flip rate => the confident-error basin is patching-reachable — confirming the published value is the one in the file. Or open the file at the Location above to inspect it.

Reproduce locally: python scripts/audit.py H1 · check them all with python scripts/validate_all_claims.py.

I. Bounding negatives

I1No capacity cliff (bounding negative) VERIFIED

Claim

No capacity cliff (bounding negative)

Experiment

Probability-flux probe, Qwen-1.5B (band-L24): measure the probability-current ratio rho_Q on TQA-traps (n=817) vs GSM8K-reasoning (n=6000) to test for a capacity cliff in the non-equilibrium flux.

Pre-registered

Powered / n / effect

No (exploratory) · n = TQA-trap 817; GSM8K 6000 · effect size: no accuracy cliff; flux rho_Q 0.177 (TQA) vs 0.273 (GSM8K), propagator radius 0.776 vs 0.949

Statistical test

bootstrap CI95 on rho_Q (CIs non-overlapping between regimes)

Result

TQA rho_Q 0.177 [0.163,0.182]; GSM8K rho_Q 0.273 [0.262,0.283]; CIs separated

Why this status

Flux varies with cognitive regime (reasoning > trap, non-overlapping CIs) with no accuracy cliff across the sweep - a bounding NULL on the capacity-cliff hypothesis; exploratory.

Location

experiments/epsilon_qwen_2026_06_08/outputs/prob_flux_probe_result.json

Category

LIMITATION

Verify

Fetches the result file and reads TQA_trap.rho_Q_current_ratio live in your browser — non-equilibrium flux rho_Q in TQA-traps (no capacity cliff) — confirming the published value is the one in the file. Or open the file at the Location above to inspect it.

Reproduce locally: python scripts/audit.py I1 · check them all with python scripts/validate_all_claims.py.

I2Substrate dynamics are GENERIC, not hallucination-specific (N=1) VERIFIED

Claim

Substrate dynamics are GENERIC, not hallucination-specific (N=1)

Experiment

Same probability-flux / substrate-dynamics probe, asking whether the dynamics are hallucination-specific or generic (contraction ~0.829, rotation ~0.76 non-discriminative AUROC 0.44); N=1 model.

Pre-registered

Powered / n / effect

No (exploratory) · n = N=1 model (TQA 817 / GSM8K 6000 conditions) · effect size: substrate generic: contraction 0.829, rotation 0.76 non-discriminative (AUROC 0.44); 2 dramatic readings retired

Statistical test

rank AUROC for discrimination (non-discriminative); descriptive flux comparison

Result

rho_Q regime-varying but generic; rotation AUROC 0.44 (not hallucination-specific)

Why this status

Substrate dynamics track regime/computation generically, not a hallucination-specific signature (rotation AUROC 0.44), and prior dramatic readings were retired - a bounding NULL at N=1, exploratory.

Location

experiments/epsilon_qwen_2026_06_08/outputs/prob_flux_probe_result.json

Category

LIMITATION

Verify

Fetches the result file and reads GSM8K_reasoning.rho_Q_current_ratio live in your browser — flux higher in reasoning than traps => substrate dynamics are generic (N=1) — confirming the published value is the one in the file. Or open the file at the Location above to inspect it.

Reproduce locally: python scripts/audit.py I2 · check them all with python scripts/validate_all_claims.py.

Patent pending; CC BY-NC 4.0 (see PATENTS / LICENSE).

Wrong With Conviction: Why Confident Errors Evade Detection in Language Models

What this is & how to read it

Claims → experiments

T. The theorem & how we checked it

A. Detector

B. Refusal payoff

C. Regimes / router

D. Correctors

E. Subspace mechanism

F. Integrated governor

G. Measurements of the bound

H. Barrier-to-entry

I. Bounding negatives

T. The theorem & how we checked it

A. Detector

B. Refusal payoff

C. Regimes / router

D. Correctors

E. Subspace mechanism

F. Integrated governor

G. Measurements of the bound

H. Barrier-to-entry

I. Bounding negatives