Chain-of-Thought Is Not Explainability

Fazl Barez, Tung-Yu Wu, Iván Arcuschin, Michael Lan, Vincent Wang, Noah Siegel, Nicolas Collignon, Clement Neo, Isabelle Lee, Alasdair Paren, Adel Bibi, Robert Trager, Damiano Fornasiere, John Yan, Yanai Elazar, Yoshua Bengio

July 2025

Preprint

Abstract

Chains-of-thought (CoT) allow language models to verbalise multi-step rationales before producing their final answer. While this technique often boosts task performance and offers an impression of transparency into the model’s reasoning, we argue that rationales generated by current CoT techniques can be misleading and are neither necessary nor sufficient for trustworthy interpretability. By analysing faithfulness in terms of whether CoTs are not only human-interpretable, but also reflect underlying model reasoning in a way that supports responsible use, we synthesise evidence from previous studies. We show that verbalised chains are frequently unfaithful, diverging from the true hidden computations that drive a model’s predictions, and giving an incorrect picture of how models arrive at conclusions.

Type

Preprint

Publication

Oxford AI Governance Initiative (AIGI)