Why Don’t LLM Agents Act as They Explain? The Faithfulness Gap in 3 Recent Papers
Why Don’t LLM Agents Act as They Explain? The Faithfulness Gap in 3 Recent Papers
Three recent arXiv papers look at a related reliability problem from different angles: whether LLM agents act on the reasoning they state, how safety signals can be tracked across long action trajectories, and whether hidden reasoning traces can still be exposed. “Doing What They Say, Not What They Reason” introduces the faithfulness gap in a controlled Texas Poker simulator with a verifiable reference action for every decision. “TRACE” reframes long-horizon agent safety as trajectory-level evidence compression. “Hidden Thoughts Are Not Secret” examines reasoning trace exposure in systems that show users only summaries and final answers. Taken together, these papers suggest that agent reliability is not just about output quality, but also about the relationship between internal reasoning, visible explanations, and actual behavior over time. [S5][S8][S9] [S5] [S8] [S9]
What these papers are about
All three papers focus on trust in LLM-based systems, but they approach it at different layers. “Doing What They Say, Not What They Reason: Locating the Faithfulness Gap in LLM Agents,” posted on arXiv in June 2026, asks a direct question: do LLM agents act on the reasoning they themselves state? The paper frames this as a process-fidelity problem and studies it in a controlled Texas Poker simulator where each decision has a verifiable reference action. “TRACE: Trajectory Risk-Aware Compression for Long-Horizon Agent Safety,” also posted on arXiv in June 2026, addresses a different but related issue: long-horizon agents may produce safety-relevant evidence across many steps, and that evidence can be missed by detectors that only inspect short contexts or single turns. “Hidden Thoughts Are Not Secret: Reasoning Trace Exposure in LLMs,” likewise posted on arXiv in June 2026, looks at the exposure risk around reasoning traces, especially in systems that hide raw traces and reveal only summaries and answers. The shared theme is reliability: what an agent says, what it does, what a monitor can detect, and what internal traces may still leak are all connected questions. [S5][S8][S9]
Sources: [S5], [S8], [S9]
Core idea: how to think about the gap between reasoning and action
The central idea in S5 is that the mismatch between an agent’s explanation and its behavior should not be treated as one single problem. Instead, the paper decomposes the faithfulness gap into two steps: reasoning-conclusion and conclusion-action. In simple terms, the first step asks whether the stated reasoning actually supports the conclusion the model reaches. The second asks whether that conclusion is then reflected in the action the agent takes. This matters because an agent can sound coherent in its explanation yet still fail to act accordingly, or it can produce reasoning that appears weak while still landing on an action that matches the reference. S5 studies this in a controlled Texas Poker simulator specifically because the environment provides a verifiable reference action for every decision, which makes it possible to inspect these links more carefully than in open-ended settings where there may be no clear ground truth for behavior. The paper’s abstract states that the two steps behave oppositely, which is an important clue that “faithfulness” may hide more than one failure mode. That is the paper’s stated finding; my interpretation is that evaluating only the final action or only the visible reasoning can miss where the actual mismatch occurs. [S5]
Sources: [S5]
What is different from earlier ways of looking at the problem
S5 differs from a more coarse-grained approach that asks only whether an agent’s final behavior looks right. Its contribution, as described in the abstract, is to locate the faithfulness gap by splitting it into reasoning-conclusion and conclusion-action rather than treating behavior alignment as a single end-to-end measure. S8 makes a different shift. Instead of relying on turn-level or short-context safety detectors, it argues that long-horizon agents generate sparse, delayed, and compositional risk signals that can escape local moderation. The paper therefore reframes long-horizon safety detection as trajectory-level evidence compression and proposes TRACE, short for Trajectory Risk-Aware Compression for Long-Horizon Agent Safety. S9 focuses on another layer entirely: the exposure of reasoning traces. Its abstract notes that detailed traces are useful learning signals for capability transfer, which is why some deployed reasoning systems hide raw traces and expose only summaries and answers. The paper then studies the problem that hidden thoughts may still not be fully secret. Put together, the three papers differ in scope but complement one another: S5 studies whether stated reasoning matches action, S8 studies how to retain safety evidence across long trajectories, and S9 studies whether internal reasoning can remain hidden when only compressed outputs are shown. [S5][S8][S9]
Sources: [S5], [S8], [S9]
Possible applications in practice
Within the scope described by the abstracts, S5 is especially relevant to settings such as social simulation, which the paper explicitly mentions. If researchers want to use LLM agents in simulations of strategic or social behavior, they need some way to judge whether the agent’s stated reasoning is actually connected to what it does. The Texas Poker simulator in S5 is a controlled example of that broader concern. S8 is relevant to long-running agent systems, where safety-relevant evidence may emerge gradually across many steps rather than in a single response. In such systems, a trajectory-level view could help safety review focus on accumulated evidence instead of isolated turns. S9 is relevant wherever reasoning models are deployed with hidden internal traces and public-facing summaries or answers. In those cases, the paper’s framing suggests that trace exposure is not only a model-training issue but also a deployment and interface issue. My interpretation is that these papers are useful less as finished recipes and more as ways to structure evaluation: one for behavior faithfulness, one for long-horizon safety monitoring, and one for reasoning-trace handling. [S5][S8][S9]
Sources: [S5], [S8], [S9]
Limitations and open questions
These papers also make clear that the problem is not solved. S5 studies faithfulness in a controlled environment, which is useful because it provides a verifiable reference action for every decision, but that same control also means the setup is narrower than many real-world agent tasks. The abstract tells us the decomposition is measurable there; it does not imply that the same measurement will be straightforward in open-ended domains without clear reference actions. S8 starts from the observation that long-horizon risk signals can be sparse, delayed, and compositional, which itself highlights the difficulty of the problem: evidence may be distributed across a long trajectory and may not be easy to preserve or aggregate. S9 points to another unresolved tension. Reasoning traces are valuable for learning and transfer, yet systems may try to hide them and expose only summaries and answers. The paper’s framing suggests that hiding raw traces does not automatically eliminate exposure concerns. Across all three, the open question is how to build systems whose explanations, actions, safety monitoring, and trace-handling policies remain consistent under realistic use, not just under idealized evaluation conditions. [S5][S8][S9]
Sources: [S5], [S8], [S9]
One-paragraph takeaway
These three papers point to a simple but important lesson: with LLM agents, reliability is not captured by the final answer alone. S5 shows that the gap between what an agent says and what it does can be decomposed into reasoning-conclusion and conclusion-action, and studies that gap in a Texas Poker simulator with verifiable reference actions. S8 argues that long-horizon safety should be treated as a trajectory-level evidence compression problem rather than a turn-by-turn one. S9 reminds us that even when systems expose only summaries and answers, reasoning trace exposure remains a live concern. Together, they do not offer a complete solution, but they do provide a clearer map of where trust can break down. [S5][S8][S9]
Sources: [S5], [S8], [S9]
One-line takeaway: Recent papers suggest that LLM agent reliability depends on more than output quality: it also hinges on whether stated reasoning matches action, whether long-run risk evidence is preserved, and whether hidden traces can still be exposed. [S5][S8][S9] [S5] [S8] [S9]
Short summary: Three recent papers examine why LLM agents may not behave in line with their own explanations. They cover the faithfulness gap, long-horizon safety monitoring through trajectory-level compression, and the exposure risks of hidden reasoning traces.
Sources and references: - [S5] cs.AI updates on arXiv.org - Doing What They Say, Not What They Reason: Locating the Faithfulness Gap in LLM Agents - URL: https://arxiv.org/abs/2606.00476 - [S8] cs.AI updates on arXiv.org - TRACE: Trajectory Risk-Aware Compression for Long-Horizon Agent Safety - URL: https://arxiv.org/abs/2606.00611 - [S9] cs.AI updates on arXiv.org - Hidden Thoughts Are Not Secret: Reasoning Trace Exposure in LLMs - URL: https://arxiv.org/abs/2606.00642
Internal link ideas: - A beginner’s guide to LLM agents and why process fidelity matters - What chain-of-thought, reasoning traces, and answer summaries each reveal - How to evaluate long-horizon AI agent safety beyond single-turn moderation
LLM agents #faithfulness gap #reasoning trace #AI safety #paper brief
Note AI-assisted content
This post was drafted with AI (gpt-5.4) using source-grounded inputs.
Please review the citations and original links below.
Comments
Post a Comment