Wrong With Conviction: Why Confident Errors Evade Detection in Language Models

How these were checked. All 38 references are listed below, each linked to the source. The 30 arXiv entries were tested against the live arXiv (30 return the paper with a matching title); the 8 venue entries (Nature, ICML, NeurIPS, Springer, and interpretability technical reports) link to a title search. The citations were also independently cross-checked, which caught and corrected three author-attribution errors before this list was published. Click any link to open the paper.

1	Azaria, A. & Mitchell, T. The Internal State of an LLM Knows When It's Lying. Findings of EMNLP, 2023. arXiv:2304.13734. arXiv title: The Internal State of an LLM Knows When It's Lying	arXiv 2304.13734 ↗ link verified
2	Orgad, H. et al. LLMs Know More Than They Show. ICLR, 2025. arXiv:2410.02707. arXiv title: LLMs Know More Than They Show: On the Intrinsic Representation of LLM Hallucinations	arXiv 2410.02707 ↗ link verified
3	Chen, C. et al. INSIDE: LLMs' Internal States Retain the Power of Hallucination Detection. ICLR, 2024. arXiv:2402.03744. arXiv title: INSIDE: LLMs' Internal States Retain the Power of Hallucination Detection	arXiv 2402.03744 ↗ link verified
4	Sriramanan, G. et al. LLM-Check. NeurIPS, 2024.	Google Scholar ↗ Google Scholar search
5	Farquhar, S. et al. Detecting hallucinations in large language models using semantic entropy. Nature 630, 2024.	Google Scholar ↗ Google Scholar search
6	Kuhn, L. et al. Semantic Uncertainty. ICLR, 2023. arXiv:2302.09664. arXiv title: Semantic Uncertainty: Linguistic Invariances for Uncertainty Estimation in Natural Language Generation	arXiv 2302.09664 ↗ link verified
7	Kossen, J. et al. Semantic Entropy Probes. arXiv:2406.15927, 2024. arXiv title: Semantic Entropy Probes: Robust and Cheap Hallucination Detection in LLMs	arXiv 2406.15927 ↗ link verified
8	Ma, H. et al. Semantic Energy: Detecting LLM Hallucination Beyond Entropy. arXiv:2508.14496, 2025. arXiv title: Semantic Energy: Detecting LLM Hallucination Beyond Entropy	arXiv 2508.14496 ↗ link verified
9	Karpowicz, M.P. On the Fundamental Impossibility of Hallucination Control in Large Language Models. arXiv:2506.06382, 2025. arXiv title: On the Fundamental Impossibility of Hallucination Control in Large Language Models	arXiv 2506.06382 ↗ link verified
10	Simhi, A. et al. Distinguishing Ignorance from Error in LLM Hallucinations. arXiv:2410.22071, 2024. arXiv title: Distinguishing Ignorance from Error in LLM Hallucinations	arXiv 2410.22071 ↗ link verified
11	Simhi, A. et al. HACK: Hallucinations Along Certainty and Knowledge Axes. arXiv:2510.24222, 2025. arXiv title: HACK: Hallucinations Along Certainty and Knowledge Axes	arXiv 2510.24222 ↗ link verified
12	Marin, J. A Geometric Taxonomy of Hallucinations in LLMs. arXiv:2602.13224, 2026. arXiv title: A Geometric Taxonomy of Hallucinations in LLMs	arXiv 2602.13224 ↗ link verified
13	Cherukuri, K. & Varshney, L.R. Hallucination Basins: A Dynamic Framework for Understanding and Controlling LLM Hallucinations. arXiv:2604.04743, 2026. arXiv title: Hallucination Basins: A Dynamic Framework for Understanding and Controlling LLM Hallucinations	arXiv 2604.04743 ↗ link verified
14	Akarlar, G.\,A. Hallucination as Trajectory Commitment: Causal Evidence for Asymmetric Attractor Dynamics in Transformer Generation. arXiv:2604.15400, 2026. arXiv title: Hallucination as Trajectory Commitment: Causal Evidence for Asymmetric Attractor Dynamics in Transformer Generation	arXiv 2604.15400 ↗ link verified
15	Kalai, A.T. et al. Why Language Models Hallucinate. arXiv:2509.04664, 2025. arXiv title: Why Language Models Hallucinate	arXiv 2509.04664 ↗ link verified
16	Fernando, T. & Guitchounts, G. Dynamics of the Transformer Residual Stream. arXiv:2605.14258, 2026. arXiv title: Dynamics of the Transformer Residual Stream: Coupling Spectral Geometry to Network Topology	arXiv 2605.14258 ↗ link verified
17	Lieberum, T. et al. Gemma Scope: Open Sparse Autoencoders Everywhere All At Once on Gemma 2. arXiv:2408.05147, 2024. arXiv title: Gemma Scope: Open Sparse Autoencoders Everywhere All At Once on Gemma 2	arXiv 2408.05147 ↗ link verified
18	Lu, Y. et al. Beyond Finite Layer Neural Networks. ICML, 2018.	Google Scholar ↗ Google Scholar search
19	Li, K. et al. Inference-Time Intervention. NeurIPS, 2023. arXiv:2306.03341. arXiv title: Inference-Time Intervention: Eliciting Truthful Answers from a Language Model	arXiv 2306.03341 ↗ link verified
20	Zou, A. et al. Representation Engineering. arXiv:2310.01405, 2023. arXiv title: Representation Engineering: A Top-Down Approach to AI Transparency	arXiv 2310.01405 ↗ link verified
21	Rimsky, N. et al. Steering Llama 2 via Contrastive Activation Addition. ACL, 2024. arXiv:2312.06681. arXiv title: Steering Llama 2 via Contrastive Activation Addition	arXiv 2312.06681 ↗ link verified
22	Obeso, O., Arditi, A., Ferrando, J., Freeman, J., Holmes, C. & Nanda, N. Real-Time Detection of Hallucinated Entities in Long-Form Generation. arXiv:2509.03531, 2025. arXiv title: Real-Time Detection of Hallucinated Entities in Long-Form Generation	arXiv 2509.03531 ↗ link verified
23	Yeom, J., Sok, J., Kim, H., Park, S., Park, J. & Kim, T. Hallucination as Commitment Failure: Larger LLMs Misfire Despite Knowing the Answer. arXiv:2605.22007, 2026. arXiv title: Hallucination as Commitment Failure: Larger LLMs Misfire Despite Knowing the Answer	arXiv 2605.22007 ↗ link verified
24	Hendrycks, D. et al. Measuring Massive Multitask Language Understanding (MMLU). ICLR, 2021. arXiv:2009.03300. arXiv title: Measuring Massive Multitask Language Understanding	arXiv 2009.03300 ↗ link verified
25	Clark, P. et al. Think you have Solved Question Answering? Try ARC. arXiv:1803.05457, 2018. arXiv title: Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge	arXiv 1803.05457 ↗ link verified
26	Zellers, R. et al. HellaSwag: Can a Machine Really Finish Your Sentence? ACL, 2019. arXiv:1905.07830. arXiv title: HellaSwag: Can a Machine Really Finish Your Sentence?	arXiv 1905.07830 ↗ link verified
27	Lin, S. et al. TruthfulQA: Measuring How Models Mimic Human Falsehoods. ACL, 2022. arXiv:2109.07958. arXiv title: TruthfulQA: Measuring How Models Mimic Human Falsehoods	arXiv 2109.07958 ↗ link verified
28	Mihaylov, T. et al. Can a Suit of Armor Conduct Electricity? (OpenBookQA). EMNLP, 2018. arXiv:1809.02789. arXiv title: Can a Suit of Armor Conduct Electricity? A New Dataset for Open Book Question Answering	arXiv 1809.02789 ↗ link verified
29	Talmor, A. et al. CommonsenseQA: A Question Answering Challenge Targeting Commonsense Knowledge. NAACL, 2019. arXiv:1811.00937. arXiv title: CommonsenseQA: A Question Answering Challenge Targeting Commonsense Knowledge	arXiv 1811.00937 ↗ link verified
30	Joshi, M. et al. TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension. ACL, 2017. arXiv:1705.03551. arXiv title: TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension	arXiv 1705.03551 ↗ link verified
31	Marcenko, V.A. & Pastur, L.A. Distribution of eigenvalues for some sets of random matrices. Math. USSR-Sbornik, 1967.	Google Scholar ↗ Google Scholar search
32	Bai, Z. & Silverstein, J.W. Spectral Analysis of Large Dimensional Random Matrices. Springer, 2010.	Google Scholar ↗ Google Scholar search
33	Guo, C. et al. On Calibration of Modern Neural Networks. ICML, 2017. arXiv:1706.04599. arXiv title: On Calibration of Modern Neural Networks	arXiv 1706.04599 ↗ link verified
34	Geifman, Y. & El-Yaniv, R. Selective Classification for Deep Neural Networks. NeurIPS, 2017. arXiv:1705.08500. arXiv title: Selective Classification for Deep Neural Networks	arXiv 1705.08500 ↗ link verified
35	Hu, E.J. et al. LoRA: Low-Rank Adaptation of Large Language Models. ICLR, 2022. arXiv:2106.09685. arXiv title: LoRA: Low-Rank Adaptation of Large Language Models	arXiv 2106.09685 ↗ link verified
36	Bricken, T. et al. Towards Monosemanticity: Decomposing Language Models With Dictionary Learning. Transformer Circuits Thread, 2023.	Google Scholar ↗ Google Scholar search
37	Templeton, A. et al. Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet. Transformer Circuits Thread, 2024.	Google Scholar ↗ Google Scholar search
38	Lindsey, J. et al. On the Biology of a Large Language Model. Transformer Circuits Thread, 2025.	Google Scholar ↗ Google Scholar search

← Back to the result audit