Skip to main content

Posts

Featured

What Determines the Performance of LLM Agent Workflows? Balancing Latency, Reliability, and Cost

What Determines the Performance of LLM Agent Workflows? Balancing Latency, Reliability, and Cost Three recent arXiv papers point to the same design problem from different angles: once AI systems are built as workflows of multiple interacting agents, performance is no longer explained by model quality alone. "Toward Reliable Design of LLM-Enabled Agentic Workflows" frames the central tradeoff as latency, reliability, and cost across mixed LLM and non-LLM agents. "Stop Comparing LLM Agents Without Disclosing the Harness" argues that the execution harness around the model can be a stronger determinant of results than the model itself in long-horizon settings. "QUIVER" adds a formal view of how perturbations propagate and how execution paths can bifurcate in compound AI systems. Together, these papers make the topic important not just for benchmarking, but for how agent systems should be designed and evaluated in practice. [S4][S9][S12] [S4] [S9] [S12] intro...

Latest Posts

Why LLM Agent Evaluation Is Hard: Recent Papers on the Gap Between Benchmarks and Real Deployment

Three Recent AI Agent News Items: OpenAI, AWS, and Virgin Atlantic

Rethinking LLM Agent Evaluation: The New Criteria Proposed by AgentAtlas

What Data Shapes LLM Performance? Why This Paper Proposes Data Probes

Three Recent AI Papers on Agents, Documents, and Data: What Has Changed for Real-World LLM Systems?

Recent Papers on LLM Agents: Memory, Negotiation, and Structural Failure

Three Recent Papers on Making LLM Agent Execution More Reliable: SDOF, SkillSmith, and STAR

Two Axes for Reading LLM Agent Design: What the Agent Does and How It Runs

Designing Safer LLM Agents: Key Issues from Recent Papers

Why LLMs Lose Context in Multi-Turn Interaction: What Three New Papers Suggest About Causes and Responses