Chain-of-Thought Is Not Explainability

Abstract

Chains-of-thought (CoT) allow language models to verbalise multi-step rationales before producing their final answer. While this technique often boosts task performance and offers an impression of transparency into the model’s reasoning, we argue that rationales generated by current CoT techniques can be misleading and are neither necessary nor sufficient for trustworthy interpretability. By analysing faithfulness in terms of whether CoTs are not only human-interpretable, but also reflect underlying model reasoning in a way that supports responsible use, we synthesise evidence from previous studies. We show that verbalised chains are frequently unfaithful, diverging from the true hidden computations that drive a model’s predictions, and giving an incorrect picture of how models arrive at conclusions.

Publication
Oxford AI Governance Initiative (AIGI)