← HomeAuditLiteraturePaperRepo

Wrong With Conviction: Why Confident Errors Evade Detection in Language Models

References — every cited work, linked to the real paper. ← back to the result audit

How these were checked. All 38 references are listed below, each linked to the source. The 30 arXiv entries were tested against the live arXiv (30 return the paper with a matching title); the 8 venue entries (Nature, ICML, NeurIPS, Springer, and interpretability technical reports) link to a title search. The citations were also independently cross-checked, which caught and corrected three author-attribution errors before this list was published. Click any link to open the paper.
1
Azaria, A. & Mitchell, T. The Internal State of an LLM Knows When It's Lying. Findings of EMNLP, 2023. arXiv:2304.13734.
arXiv title: The Internal State of an LLM Knows When It's Lying
arXiv 2304.13734 ↗
link verified
2
Orgad, H. et al. LLMs Know More Than They Show. ICLR, 2025. arXiv:2410.02707.
arXiv title: LLMs Know More Than They Show: On the Intrinsic Representation of LLM Hallucinations
arXiv 2410.02707 ↗
link verified
3
Chen, C. et al. INSIDE: LLMs' Internal States Retain the Power of Hallucination Detection. ICLR, 2024. arXiv:2402.03744.
arXiv title: INSIDE: LLMs' Internal States Retain the Power of Hallucination Detection
arXiv 2402.03744 ↗
link verified
4
Sriramanan, G. et al. LLM-Check. NeurIPS, 2024.
Google Scholar ↗
Google Scholar search
5
Farquhar, S. et al. Detecting hallucinations in large language models using semantic entropy. Nature 630, 2024.
Google Scholar ↗
Google Scholar search
6
Kuhn, L. et al. Semantic Uncertainty. ICLR, 2023. arXiv:2302.09664.
arXiv title: Semantic Uncertainty: Linguistic Invariances for Uncertainty Estimation in Natural Language Generation
arXiv 2302.09664 ↗
link verified
7
Kossen, J. et al. Semantic Entropy Probes. arXiv:2406.15927, 2024.
arXiv title: Semantic Entropy Probes: Robust and Cheap Hallucination Detection in LLMs
arXiv 2406.15927 ↗
link verified
8
Ma, H. et al. Semantic Energy: Detecting LLM Hallucination Beyond Entropy. arXiv:2508.14496, 2025.
arXiv title: Semantic Energy: Detecting LLM Hallucination Beyond Entropy
arXiv 2508.14496 ↗
link verified
9
Karpowicz, M.P. On the Fundamental Impossibility of Hallucination Control in Large Language Models. arXiv:2506.06382, 2025.
arXiv title: On the Fundamental Impossibility of Hallucination Control in Large Language Models
arXiv 2506.06382 ↗
link verified
10
Simhi, A. et al. Distinguishing Ignorance from Error in LLM Hallucinations. arXiv:2410.22071, 2024.
arXiv title: Distinguishing Ignorance from Error in LLM Hallucinations
arXiv 2410.22071 ↗
link verified
11
Simhi, A. et al. HACK: Hallucinations Along Certainty and Knowledge Axes. arXiv:2510.24222, 2025.
arXiv title: HACK: Hallucinations Along Certainty and Knowledge Axes
arXiv 2510.24222 ↗
link verified
12
Marin, J. A Geometric Taxonomy of Hallucinations in LLMs. arXiv:2602.13224, 2026.
arXiv title: A Geometric Taxonomy of Hallucinations in LLMs
arXiv 2602.13224 ↗
link verified
13
Cherukuri, K. & Varshney, L.R. Hallucination Basins: A Dynamic Framework for Understanding and Controlling LLM Hallucinations. arXiv:2604.04743, 2026.
arXiv title: Hallucination Basins: A Dynamic Framework for Understanding and Controlling LLM Hallucinations
arXiv 2604.04743 ↗
link verified
14
Akarlar, G.\,A. Hallucination as Trajectory Commitment: Causal Evidence for Asymmetric Attractor Dynamics in Transformer Generation. arXiv:2604.15400, 2026.
arXiv title: Hallucination as Trajectory Commitment: Causal Evidence for Asymmetric Attractor Dynamics in Transformer Generation
arXiv 2604.15400 ↗
link verified
15
Kalai, A.T. et al. Why Language Models Hallucinate. arXiv:2509.04664, 2025.
arXiv title: Why Language Models Hallucinate
arXiv 2509.04664 ↗
link verified
16
Fernando, T. & Guitchounts, G. Dynamics of the Transformer Residual Stream. arXiv:2605.14258, 2026.
arXiv title: Dynamics of the Transformer Residual Stream: Coupling Spectral Geometry to Network Topology
arXiv 2605.14258 ↗
link verified
17
Lieberum, T. et al. Gemma Scope: Open Sparse Autoencoders Everywhere All At Once on Gemma 2. arXiv:2408.05147, 2024.
arXiv title: Gemma Scope: Open Sparse Autoencoders Everywhere All At Once on Gemma 2
arXiv 2408.05147 ↗
link verified
18
Lu, Y. et al. Beyond Finite Layer Neural Networks. ICML, 2018.
Google Scholar ↗
Google Scholar search
19
Li, K. et al. Inference-Time Intervention. NeurIPS, 2023. arXiv:2306.03341.
arXiv title: Inference-Time Intervention: Eliciting Truthful Answers from a Language Model
arXiv 2306.03341 ↗
link verified
20
Zou, A. et al. Representation Engineering. arXiv:2310.01405, 2023.
arXiv title: Representation Engineering: A Top-Down Approach to AI Transparency
arXiv 2310.01405 ↗
link verified
21
Rimsky, N. et al. Steering Llama 2 via Contrastive Activation Addition. ACL, 2024. arXiv:2312.06681.
arXiv title: Steering Llama 2 via Contrastive Activation Addition
arXiv 2312.06681 ↗
link verified
22
Obeso, O., Arditi, A., Ferrando, J., Freeman, J., Holmes, C. & Nanda, N. Real-Time Detection of Hallucinated Entities in Long-Form Generation. arXiv:2509.03531, 2025.
arXiv title: Real-Time Detection of Hallucinated Entities in Long-Form Generation
arXiv 2509.03531 ↗
link verified
23
Yeom, J., Sok, J., Kim, H., Park, S., Park, J. & Kim, T. Hallucination as Commitment Failure: Larger LLMs Misfire Despite Knowing the Answer. arXiv:2605.22007, 2026.
arXiv title: Hallucination as Commitment Failure: Larger LLMs Misfire Despite Knowing the Answer
arXiv 2605.22007 ↗
link verified
24
Hendrycks, D. et al. Measuring Massive Multitask Language Understanding (MMLU). ICLR, 2021. arXiv:2009.03300.
arXiv title: Measuring Massive Multitask Language Understanding
arXiv 2009.03300 ↗
link verified
25
Clark, P. et al. Think you have Solved Question Answering? Try ARC. arXiv:1803.05457, 2018.
arXiv title: Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge
arXiv 1803.05457 ↗
link verified
26
Zellers, R. et al. HellaSwag: Can a Machine Really Finish Your Sentence? ACL, 2019. arXiv:1905.07830.
arXiv title: HellaSwag: Can a Machine Really Finish Your Sentence?
arXiv 1905.07830 ↗
link verified
27
Lin, S. et al. TruthfulQA: Measuring How Models Mimic Human Falsehoods. ACL, 2022. arXiv:2109.07958.
arXiv title: TruthfulQA: Measuring How Models Mimic Human Falsehoods
arXiv 2109.07958 ↗
link verified
28
Mihaylov, T. et al. Can a Suit of Armor Conduct Electricity? (OpenBookQA). EMNLP, 2018. arXiv:1809.02789.
arXiv title: Can a Suit of Armor Conduct Electricity? A New Dataset for Open Book Question Answering
arXiv 1809.02789 ↗
link verified
29
Talmor, A. et al. CommonsenseQA: A Question Answering Challenge Targeting Commonsense Knowledge. NAACL, 2019. arXiv:1811.00937.
arXiv title: CommonsenseQA: A Question Answering Challenge Targeting Commonsense Knowledge
arXiv 1811.00937 ↗
link verified
30
Joshi, M. et al. TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension. ACL, 2017. arXiv:1705.03551.
arXiv title: TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension
arXiv 1705.03551 ↗
link verified
31
Marcenko, V.A. & Pastur, L.A. Distribution of eigenvalues for some sets of random matrices. Math. USSR-Sbornik, 1967.
Google Scholar ↗
Google Scholar search
32
Bai, Z. & Silverstein, J.W. Spectral Analysis of Large Dimensional Random Matrices. Springer, 2010.
Google Scholar ↗
Google Scholar search
33
Guo, C. et al. On Calibration of Modern Neural Networks. ICML, 2017. arXiv:1706.04599.
arXiv title: On Calibration of Modern Neural Networks
arXiv 1706.04599 ↗
link verified
34
Geifman, Y. & El-Yaniv, R. Selective Classification for Deep Neural Networks. NeurIPS, 2017. arXiv:1705.08500.
arXiv title: Selective Classification for Deep Neural Networks
arXiv 1705.08500 ↗
link verified
35
Hu, E.J. et al. LoRA: Low-Rank Adaptation of Large Language Models. ICLR, 2022. arXiv:2106.09685.
arXiv title: LoRA: Low-Rank Adaptation of Large Language Models
arXiv 2106.09685 ↗
link verified
36
Bricken, T. et al. Towards Monosemanticity: Decomposing Language Models With Dictionary Learning. Transformer Circuits Thread, 2023.
Google Scholar ↗
Google Scholar search
37
Templeton, A. et al. Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet. Transformer Circuits Thread, 2024.
Google Scholar ↗
Google Scholar search
38
Lindsey, J. et al. On the Biology of a Large Language Model. Transformer Circuits Thread, 2025.
Google Scholar ↗
Google Scholar search

← Back to the result audit