How Can We Make LLM Agents More Reliable in Memory and Tool Use?

How Can We Make LLM Agents More Reliable in Memory and Tool Use?

Three recent papers look at a shared problem in tool-using agents: an LLM may know how to call a tool, but still struggle with choosing the right tool at the right time, reusing past experience, or adapting a learned skill when the environment changes. "Contract2Tool: Learning Preconditions and Effects for Reliable Tool-Augmented LLM Agents" focuses on tool appropriateness through lightweight contracts about preconditions, effects, risk, and cost. "MemToolAgent" examines how long-term memory, retrieval of similar past cases, and reflection can improve tool-using behavior. "Efficient Skill Grounding via Code Refactoring with Small Language Models" addresses a related reliability issue in embodied agents, where a reusable skill can fail when embodiment or environment details differ. All three were announced on arXiv in June 2026 and, taken together, point to three practical axes for more stable agents: tool-use appropriateness, long-term memory, and adaptation to environmental differences. [S7][S8][S12] [S7] [S8] [S12]

Introduction: Which papers are these, and what do they cover?

"Contract2Tool: Learning Preconditions and Effects for Reliable Tool-Augmented LLM Agents" starts from a simple observation: standard tool schemas mainly explain how to call a tool, not when it is causally appropriate or what state change it should produce. The paper frames this gap as a reliability problem for tool-augmented agents. "MemToolAgent" looks at agents that need to learn from long-term historical events or prior interactions with the environment, arguing that memory is not just for dialogue continuity but also for better tool use. "Efficient Skill Grounding via Code Refactoring with Small Language Models" shifts the setting toward embodied agents and reusable skills, emphasizing that even small differences in embodiment or environment can make an existing skill incompatible. Although the settings differ, the common theme is clear: reliable agent behavior depends not only on generation quality, but also on better structure around action selection, experience reuse, and execution under changing conditions. [S7][S8][S12]

Sources: [S7], [S8], [S12]

Core idea: How do these papers strengthen tool choice and memory?

The core idea of Contract2Tool is to represent tools with lightweight contracts that describe preconditions, effects, risk level, and cost. In plain terms, this means the agent is not only told how to invoke an API, but also what must already be true before using it and what the tool is expected to change afterward. The source states that causal tool filtering uses these contracts to decide whether a tool is appropriate, and the paper focuses on learning such contracts because writing and maintaining them manually does not scale in large or changing tool ecosystems. [S7]

MemToolAgent addresses a different failure mode: agents often repeat mistakes or fail to benefit from prior interactions. The source describes a setup where the agent retrieves similar memories, receives feedback on an invalid action format, and generates a reflection that updates memory. The central idea is that memory should not be a passive log. Instead, past episodes can be retrieved when a similar task appears, and feedback can be turned into a compact reflection that improves future tool use. My interpretation is that this makes memory operational rather than archival. [S8]

The skill-grounding paper focuses on embodied settings where a previously written skill may break because the environment or embodiment is slightly different. Its proposed direction is code refactoring with small language models, rather than relying on large models at deployment time. The source frames this as a way to make reusable skills compatible again under practical constraints, especially in dynamic and partially observable environments where access to large models is impractical. Put simply, instead of asking a large model to reason from scratch each time, the system tries to rewrite or adapt the skill code so it fits the new setting. [S12]

Sources: [S7], [S8], [S12]

What is different from existing approaches?

Compared with standard tool use, Contract2Tool adds a layer of causal structure. The source explicitly contrasts standard tool schemas, which describe how to call a tool, with contracts that describe when the tool is appropriate and what task state it produces. That is a meaningful shift: the problem is no longer only invocation syntax, but action validity in context. [S7]

Compared with simple conversation history or generic memory buffers, MemToolAgent emphasizes retrieval of similar experiences and reflection based on feedback. The source notes that sophisticated memory systems exist for dialogue agents, but fewer studies have empirically examined how to improve tool-using capabilities through memory. The difference, then, is not merely storing more past text; it is using prior episodes as actionable guidance for future tool decisions and corrections. [S8]

Compared with approaches that assume a skill transfers directly across settings, the skill-grounding paper treats environmental mismatch as a first-class problem. The source argues that even minor embodiment or environmental differences can make an entire skill incompatible, especially in embodied settings. Its use of code refactoring with small language models also differs from methods that depend on large models being available during deployment. [S12]

Across the three papers, the shared change is a move away from treating the agent as a single-step text generator. Instead, they add explicit structure around three questions: when a tool should be used, how past experience should be reused, and how a learned procedure should be adapted when the world is not exactly the same as before. This synthesis is my interpretation based on the three abstracts. [S7][S8][S12]

Sources: [S7], [S8], [S12]

Applications: Where could these ideas matter in practice?

Contract-based tool filtering could matter in any setting where an agent has access to many APIs or tools and where misuse has meaningful cost or risk. Because the source highlights preconditions, effects, risk level, and cost, the most natural applications are tool ecosystems where choosing an inappropriate action can waste resources or move the task into the wrong state. [S7]

MemToolAgent is relevant for agents that repeatedly face similar tasks over time and can benefit from prior successes and failures. The source specifically ties memory to long-term historical events and previous agent-environment interactions, which suggests value in recurring workflows where the same kinds of tool decisions come up again and again. The restaurant-booking example in the source summary illustrates this in a simple form: the agent can retrieve a similar case, learn from feedback about an invalid format, and update memory for later use. [S8]

The skill-grounding paper is especially relevant for embodied or deployed systems operating in dynamic, partially observable environments. The source emphasizes reusable skills, environmental variation, and the practical constraint that large language models may not be available at deployment time. That makes the approach potentially useful wherever a skill must be adapted locally and efficiently rather than regenerated from scratch. [S12]

Taken together, these papers suggest a broader application pattern: personal automation and enterprise agents may need better tool appropriateness checks; long-running assistants may need memory that supports correction and reuse; and embodied or edge-deployed systems may need lightweight ways to adapt skills to local conditions. The first two points are directly grounded in the sources, while the combined framing is an interpretation across them. [S7][S8][S12]

Sources: [S7], [S8], [S12]

Limitations and open problems

Contract2Tool is motivated by a clear limitation: manually writing and maintaining tool contracts does not scale to large or changing tool ecosystems. That means the paper addresses a real bottleneck, but it also implies an ongoing challenge around keeping learned contracts accurate as tools evolve. The source summary does not provide enough detail to claim how fully this is solved. [S7]

MemToolAgent highlights another unresolved issue: memory systems may exist, but their direct contribution to better tool use still needs careful empirical examination. The source itself notes that few studies have examined this question. That suggests a practical limitation: adding memory is not automatically the same as improving action quality, and the usefulness of retrieval and reflection likely depends on how memories are selected, updated, and applied. This caution is partly stated in the source and partly my interpretation. [S8]

The skill-grounding paper makes clear that embodied settings are difficult because small differences can invalidate an entire skill. It proposes code refactoring with small language models, but the source summary also makes the challenge visible: dynamic, partially observable environments and limited model capacity are hard constraints. In other words, adapting a skill with a smaller model may be necessary, but reliability under varied real-world conditions remains an open problem from the information provided. [S12]

A common limitation across all three papers is that they tackle different parts of the reliability stack rather than solving the whole problem at once. Better contracts do not automatically create useful memory; better memory does not automatically resolve environmental mismatch; and better skill adaptation does not by itself decide whether a tool should be used. That cross-paper conclusion is an interpretation, but it is consistent with the distinct scopes described in the sources. [S7][S8][S12]

Sources: [S7], [S8], [S12]

Summary

These three papers point to a practical view of agent reliability. Contract2Tool argues that tool use needs explicit knowledge about preconditions and effects, not just API signatures. MemToolAgent argues that agents should retrieve similar past experiences and turn feedback into reflections that improve future tool use. The skill-grounding paper argues that reusable skills must be adapted to embodiment and environment differences, and that small language models may need to do this through code refactoring when large models are unavailable. Together, they suggest that more stable LLM agents will likely require structured tool semantics, operational long-term memory, and mechanisms for adapting execution to changing environments. The first three claims come from the sources; the final synthesis is my interpretation across them. [S7][S8][S12]

Sources: [S7], [S8], [S12]


One-line takeaway: Recent papers suggest that making LLM agents more reliable requires three complementary pieces: clearer tool-use conditions, memory that supports reflection and reuse, and adaptation mechanisms for changing environments. [S7][S8][S12] [S7] [S8] [S12]

Short summary: Three recent papers examine how tool-using agents can become more reliable through better tool selection, long-term memory, and adaptation to environmental differences. Together, they show that stable agent behavior depends on more than fluent generation alone. [S7][S8][S12]

Sources and references: - [S7] cs.AI updates on arXiv.org - Contract2Tool: Learning Preconditions and Effects for Reliable Tool-Augmented LLM Agents - URL: https://arxiv.org/abs/2606.07904 - [S8] cs.AI updates on arXiv.org - MemToolAgent overview with a simple restaurant booking scenario where the agent retrieves similar memories, receives feedback on an invalid time format, and generates a reflection to update its memory - URL: https://arxiv.org/abs/2606.07909 - [S12] cs.AI updates on arXiv.org - Efficient Skill Grounding via Code Refactoring with Small Language Models - URL: https://arxiv.org/abs/2606.07999

Internal link ideas: - A primer on tool-augmented LLM agents and common failure modes - How memory systems are being designed for autonomous agents - Why embodied agents need skill grounding and environment adaptation

LLM agents #tool use #agent memory #skill grounding #arXiv papers


Note AI-assisted content
This post was drafted with AI (gpt-5.4) using source-grounded inputs.
Please review the citations and original links below.

Comments