Tool Choice and Interpretability in LLM Agents: Key Ideas from Three Recent Papers

Tool Choice and Interpretability in LLM Agents: Key Ideas from Three Recent Papers

Recent work on LLM agents is converging on a practical question: when an agent uses tools or makes multi-step decisions, how can we reduce selection mistakes and make its reasoning easier to inspect? The papers discussed here approach that question from different angles. "Explainable AML Triage with LLMs: Evidence Retrieval and Counterfactual Checks" focuses on evidence-grounded explanations in a regulated workflow, "From Actions to Understanding: Conformal Interpretability of Temporal Concepts in LLM Agents" studies how concepts evolve over time inside agent behavior, and "SkillGraph: Graph Foundation Priors for LLM Agent Tool Sequence Recommendation" and "JTPRO: A Joint Tool-Prompt Reflective Optimization Framework for Language Agents" target tool choice and execution errors when tool libraries become large or ambiguous. Together, they show a common shift away from unconstrained generation and toward more structured support for both action selection and explanation. [S1][S3][S7][S10] [S1] [S3] [S7] [S10]

Paper overview: what these studies are trying to solve

"Explainable AML Triage with LLMs: Evidence Retrieval and Counterfactual Checks" addresses anti-money laundering alert triage, where investigators work under audit and governance constraints. Its starting point is that LLMs can summarize evidence and draft rationales, but free-form generation is risky because of hallucinations, weak provenance, and explanations that may not faithfully reflect the decision. [S1]

"From Actions to Understanding: Conformal Interpretability of Temporal Concepts in LLM Agents" looks at a broader agent setting. It starts from the observation that LLM agents can reason, plan, and act over multiple steps, but the internal mechanisms behind those sequential behaviors remain opaque. The paper proposes a framework for interpreting the temporal evolution of concepts during agent behavior. [S3]

"SkillGraph: Graph Foundation Priors for LLM Agent Tool Sequence Recommendation" focuses on a narrower but common operational problem: agents must choose tools from large API libraries and order them correctly. The paper argues that existing methods often rely on semantic similarity for both retrieval and ordering, even though correct ordering depends on inter-tool data dependencies not visible in tool descriptions. [S7]

"JTPRO: A Joint Tool-Prompt Reflective Optimization Framework for Language Agents" also studies tool-augmented agents, especially when the number of tools grows and tools become domain-specific. It attributes frequent tool mis-selection and incorrect slot or value instantiation to generic prompts and underspecified tool schemas. [S10]

Sources: [S1], [S3], [S7], [S10]

Core idea: how they handle agent choice and explanation

The AML paper proposes a more controlled way to use LLMs in a high-stakes workflow. According to the abstract, its emphasis is on evidence retrieval and counterfactual checks. The source explicitly frames these as responses to hallucinations, weak provenance, and unfaithful explanations. In plain terms, the paper is trying to make the model's written rationale stay closer to retrieved evidence and to test whether the decision still holds when relevant factors are varied. That is a different goal from simply producing a fluent explanation: it aims to make the explanation auditable. [S1]

The conformal interpretability paper takes a different route. Rather than improving one specific decision directly, it proposes a framework for understanding how concepts change over time as an agent acts step by step. The source describes this as a step-wise conformal lens for interpreting temporal concepts in LLM agents. My interpretation is that this work is less about choosing a better tool in the moment and more about giving researchers a structured way to inspect what the agent appears to be tracking across a sequence of actions. [S3]

SkillGraph targets the selection problem more directly. The paper introduces a directed, weighted execution-transition graph mined from successful trajectories. The source states that this graph is meant to capture execution transitions, which matters because tool ordering depends on data dependencies between tools, not just on how similar their descriptions sound. In simple terms, the proposal adds workflow structure to tool recommendation. [S7]

JTPRO focuses on the interaction between tool definitions and prompts. The source says the framework jointly optimizes tools and prompts through reflective optimization, motivated by the claim that generic prompts ignore tool-specific nuances and that tool schemas are often underspecified. The core idea is therefore not only to pick a tool better, but to make the instructions and tool representations themselves better aligned with the agent's task. [S10]

Sources: [S1], [S3], [S7], [S10]

How they differ from earlier approaches

A shared criticism across these papers is that simple, unconstrained methods break down in realistic settings. In the AML paper, the problem is unconstrained generation: the source explicitly warns that free-form LLM output is risky in regulated workflows because of hallucinations, weak provenance, and explanations that may not be faithful to the actual decision basis. Its answer is to anchor outputs in retrieved evidence and add counterfactual checks. [S1]

SkillGraph makes a similar argument in the tool-selection setting. The source says existing methods use semantic similarity for both retrieval and ordering, but ordering actually depends on inter-tool data dependencies that are absent from tool descriptions. The paper's main difference is to introduce a graph prior mined from successful trajectories, so recommendation is informed by observed execution structure rather than description similarity alone. [S7]

JTPRO differs from one-size-fits-all prompting. The source attributes tool errors partly to generic prompts and underspecified schemas, then proposes joint tool-prompt reflective optimization. Compared with a baseline where prompts stay fixed and tools are taken as given, this suggests a more adaptive setup in which the agent's instructions and tool definitions are refined together. [S10]

Taken together, these papers point to a broader pattern. The source-backed claim is that semantic matching or free-form generation alone is often insufficient. My interpretation is that recent work is moving toward explicit support structures: retrieved evidence, counterfactual checks, execution graphs, and reflective prompt-tool refinement. [S1][S7][S10]

Sources: [S1], [S7], [S10]

Possible application areas within the papers' scope

The clearest application described in the sources is AML transaction monitoring. The AML paper is specifically framed around triaging large volumes of alerts for investigators who operate under strict audit and governance requirements. In that setting, evidence-backed rationales and checks on explanation faithfulness are directly relevant. [S1]

The conformal interpretability paper applies to interactive environments where LLM agents perform multi-step reasoning, planning, and acting. The source does not limit this to one industry domain; instead, it positions the framework as a way to interpret sequential agent behavior over time. That makes it relevant wherever understanding the evolution of agent concepts matters. [S3]

SkillGraph is aimed at structured workflow domains in which an agent must select from large API libraries and order tools correctly. The source explicitly highlights the importance of inter-tool data dependencies, so the most natural applications are environments where tool outputs feed into later tool inputs in a nontrivial sequence. [S7]

JTPRO is relevant to tool-augmented language agents operating with many tools, especially domain-specific ones. Based on the source, its likely use cases are settings where ambiguous tool descriptions and incomplete schemas lead to wrong tool calls or incorrect slot and value filling. [S10]

Sources: [S1], [S3], [S7], [S10]

Limitations and open questions

These papers address concrete failure modes, but they do not imply that LLM agents have become fully transparent or fully reliable. The AML paper itself begins from the fact that hallucinations, weak provenance, and unfaithful explanations are serious risks in regulated workflows. Its proposal is a mitigation strategy, not a claim that explanation problems are solved in general. [S1]

The conformal interpretability paper is motivated by opacity in sequential agent behavior. That framing itself signals an open problem: even with a framework for interpreting temporal concepts, the internal mechanisms of LLM agents remain difficult to observe directly. The source presents a lens for interpretation, not a complete account of internal causality. [S3]

SkillGraph improves on semantic-only ordering by introducing execution-transition structure, but its own setup depends on successful trajectories mined into a graph. From the source alone, we can say the method is built around observed execution transitions; a reasonable caution is that its usefulness may depend on how well those transitions represent the workflows an agent will face. [S7]

JTPRO identifies generic prompts and underspecified schemas as root causes of tool errors, then proposes joint optimization. But the source also makes clear why the problem is hard: tool libraries can be large, domain-specific, and ambiguous. That means prompt and schema refinement may reduce some classes of error without removing the underlying complexity of tool-rich environments. [S10]

Sources: [S1], [S3], [S7], [S10]

One-line takeaway

Across these papers, the common direction is clear: making LLM agents more dependable requires work on both sides of the problem—improving what the agent selects or does, and improving how that choice is grounded, checked, or interpreted afterward. Evidence retrieval and counterfactual checks address explanation faithfulness, temporal interpretability frameworks address opaque sequential behavior, and graph or reflective optimization methods address tool-choice errors that simple semantic matching often misses. [S1][S3][S7][S10]

Sources: [S1], [S3], [S7], [S10]


One-line takeaway: These papers suggest that reducing LLM agent errors requires more than better generation: it also needs structured support for tool choice, evidence grounding, and interpretable decision traces. [S1][S3][S7][S10] [S1] [S3] [S7] [S10]

Short summary: Recent papers on LLM agents are tackling two linked problems: wrong tool choices and unclear explanations. They do so with different mechanisms, including evidence retrieval, counterfactual checks, temporal interpretability, graph-based sequencing, and joint tool-prompt optimization.

Sources and references: - [S1] cs.AI updates on arXiv.org - Explainable AML Triage with LLMs: Evidence Retrieval and Counterfactual Checks - URL: https://arxiv.org/abs/2604.19755 - [S3] cs.AI updates on arXiv.org - From Actions to Understanding: Conformal Interpretability of Temporal Concepts in LLM Agents - URL: https://arxiv.org/abs/2604.19775 - [S7] cs.AI updates on arXiv.org - SkillGraph: Graph Foundation Priors for LLM Agent Tool Sequence Recommendation - URL: https://arxiv.org/abs/2604.19793 - [S10] cs.AI updates on arXiv.org - JTPRO: A Joint Tool-Prompt Reflective Optimization Framework for Language Agents - URL: https://arxiv.org/abs/2604.19821

Internal link ideas: - How tool-augmented LLM agents fail in large API environments - Why evidence grounding matters for explainable AI in regulated workflows - A primer on interpretability methods for multi-step LLM agents

LLM agents #tool selection #interpretability #evidence retrieval #counterfactual checks #agent workflows


Note AI-assisted content
This post was drafted with AI (gpt-5.4) using source-grounded inputs.
Please review the citations and original links below.

Comments