Why LLM Agents Stay Unstable: Three Recent arXiv Papers on Reliability, Web Skill Learning, and Reasoning Limits

Why LLM Agents Stay Unstable: Three Recent arXiv Papers on Reliability, Web Skill Learning, and Reasoning Limits

Three recent arXiv papers look at a similar problem from different angles: LLM agents can appear capable, yet still behave unpredictably, struggle on long web workflows, or degrade during multi-step reasoning. “Numerical Instability and Chaos: Quantifying the Unpredictability of Large Language Models” examines instability at the numerical level, “WebXSkill: Skill Learning for Autonomous Web Agents” focuses on how web agents learn reusable skills for long-horizon tasks, and “The cognitive companion: a lightweight parallel monitoring architecture for detecting and recovering from reasoning degradation in LLM agents” studies how to detect and recover from agent failures while they are running. Taken together, these papers shift attention from raw capability to reliability in actual agent use. [S1][S2][S7] [S1] [S2] [S7]

Introduction: the papers and their arXiv context

All three works were recently posted on arXiv in April 2026 and address practical weaknesses in LLM-based agents rather than presenting agents as already solved systems. S1 frames unpredictability as a reliability issue tied to numerical instability in large language models. S2 studies autonomous web agents and argues that long-horizon browser tasks remain difficult because current skill representations do not connect understanding and execution well enough. S7 focuses on deployed multi-step agents, where reasoning can degrade into looping, drift, or stuck states, and proposes a monitoring architecture to catch those failures during operation. This is a useful grouping for beginners because the papers cover three layers of the same stack: model behavior, task skill learning, and runtime supervision. The first part is stated in the papers; the “three layers” framing is an interpretation that helps connect them. [S1][S2][S7]

Sources: [S1], [S2], [S7]

Core idea: what problem each paper identifies and what it proposes

S1 argues that LLM unpredictability is not only a vague behavioral issue but is rooted in finite numerical computation. The paper presents a rigorous analysis linking unpredictability to numerical instability and chaos-like behavior in model computation. In plain terms, its claim is that small numerical differences can matter enough to produce meaningfully different outputs, which becomes a serious concern when LLMs are used inside agent workflows. [S1]

S2 starts from a different bottleneck: web agents often fail on long tasks because their learned skills are either too abstract or too opaque. According to the paper, textual workflow skills are easy to read but cannot be directly executed, while code-based skills can be executed but do not give the agent step-level understanding for recovery when something goes wrong. WebXSkill is proposed as a way to learn skills for autonomous web agents that better bridge this grounding gap. [S2]

S7 addresses failure during execution. Its starting point is that LLM agents on multi-step tasks can suffer reasoning degradation, including looping, drift, and stuck states. The paper proposes a “Cognitive Companion,” a parallel monitoring architecture with two versions: one LLM-based and one probe-based. The basic idea is to monitor the main agent from the side, detect signs of degraded reasoning, and support recovery without relying only on blunt limits or expensive per-step judging. [S7]

Sources: [S1], [S2], [S7]

How these papers differ from existing approaches

S1 differs from work that mainly measures downstream inconsistency without explaining where it comes from. Based on its abstract, the paper tries to move from observing unstable outputs to analyzing the underlying mechanism in finite numerical computation. That does not mean it fully closes the question, but it clearly shifts the discussion from symptoms to causes. [S1]

S2 differs from prior web-agent skill formulations by rejecting the trade-off between text-only and code-only skills. The paper explicitly describes the gap: textual skills provide natural-language guidance but are not directly executable, while code skills are executable but not transparent enough for step-level reasoning and error recovery. WebXSkill is presented as an attempt to combine the strengths of both rather than choosing one format. [S2]

S7 contrasts its approach with two existing operational strategies named in the abstract: hard step limits and LLM-as-judge monitoring. Hard limits are described as abrupt, while LLM-as-judge monitoring is said to add per-step overhead. The proposed companion architecture differs by running in parallel and by offering both an LLM-based monitor and a probe-based monitor, with the latter described as zero-overhead in the abstract. [S7]

Sources: [S1], [S2], [S7]

Possible applications: where this could help in practice

The most direct application in these sources is autonomous web interaction. S2 is explicitly about browser-based agents, so its skill-learning approach could matter in settings where an agent must complete long workflows across multiple pages and recover from intermediate mistakes. The paper’s framing suggests relevance for web automation tasks that require both execution and interpretable step structure. [S2]

S7 points to another practical layer: operating multi-step agents more safely. A parallel monitor that can detect looping, drift, or stuck states could be useful wherever agents are expected to run for several steps without constant human checking. The paper is careful to frame this as detection and recovery from reasoning degradation, not as a guarantee of correctness. That distinction matters in real deployments. [S7]

S1 is less application-specific in the abstract, but its relevance is broad: if numerical instability contributes to unpredictable outputs, then any agentic workflow built on top of LLMs may inherit that risk. This is an interpretation of the paper’s reliability framing rather than a separate application claim made in the abstract. [S1]

Sources: [S2], [S7], [S1]

Limitations and open questions

None of these papers should be read as saying the agent reliability problem is solved. S1 is important because it tries to explain root causes of unpredictability, but the abstract itself says the mechanisms have remained poorly understood, which implies the field is still in an early stage of explanation and control. Even if instability can be quantified more rigorously, that does not automatically mean it can be eliminated in practical systems. [S1]

S2 addresses a real weakness in long-horizon web tasks, but the abstract also makes clear that such tasks remain difficult. Bridging the gap between readable workflow knowledge and executable skills is a meaningful step, yet long web workflows involve many sources of failure beyond skill representation alone, including recovery, grounding, and task continuity. The last sentence is an interpretation based on the problem framing in the abstract, not a direct claim of the paper. [S2]

S7 is explicit that reasoning degradation is common enough to justify runtime monitoring, but monitoring is still a response layer rather than a full cure. Detecting loops, drift, or stuck states can help, yet it does not remove the underlying causes of those failures. Also, the paper compares itself to existing monitoring approaches, which suggests that trade-offs around cost, robustness, and intervention strategy remain active design questions. [S7]

Sources: [S1], [S2], [S7]

One-paragraph takeaway

These three papers make a similar point from different directions: better LLM agents require more than stronger base models or better benchmark scores. S1 says unpredictability may be rooted in numerical instability, S2 says long web tasks need skill representations that connect understanding and execution, and S7 says agents in operation need lightweight ways to detect and recover from reasoning degradation. My interpretation is that the common message is not that agents are failing in one single way, but that reliability problems appear at multiple levels and need different kinds of fixes. [S1][S2][S7]

Sources: [S1], [S2], [S7]


One-line takeaway: Recent arXiv papers suggest that LLM agent limits are not just about capability: instability, weak skill grounding, and runtime reasoning degradation remain central reliability challenges. [S1][S2][S7] [S1] [S2] [S7]

Short summary: Three recent arXiv papers examine why LLM agents remain unreliable in practice. They focus on numerical instability, long-horizon web skill learning, and runtime monitoring for reasoning degradation.

Sources and references: - [S1] cs.AI updates on arXiv.org - Numerical Instability and Chaos: Quantifying the Unpredictability of Large Language Models - URL: https://arxiv.org/abs/2604.13206 - [S2] cs.AI updates on arXiv.org - WebXSkill: Skill Learning for Autonomous Web Agents - URL: https://arxiv.org/abs/2604.13318 - [S7] cs.AI updates on arXiv.org - The cognitive companion: a lightweight parallel monitoring architecture for detecting and recovering from reasoning degradation in LLM agents - URL: https://arxiv.org/abs/2604.13759

Internal link ideas: - What makes LLM agents fail on multi-step tasks? - A beginner’s guide to web agents and browser automation - How to think about reliability in LLM-based systems

LLM agents #arXiv papers #AI reliability #web agents #reasoning degradation


Note AI-assisted content
This post was drafted with AI (gpt-5.4) using source-grounded inputs.
Please review the citations and original links below.

Comments