What Determines the Performance of LLM Agent Workflows? Balancing Latency, Reliability, and Cost
What Determines the Performance of LLM Agent Workflows? Balancing Latency, Reliability, and Cost Three recent arXiv papers point to the same design problem from different angles: once AI systems are built as workflows of multiple interacting agents, performance is no longer explained by model quality alone. "Toward Reliable Design of LLM-Enabled Agentic Workflows" frames the central tradeoff as latency, reliability, and cost across mixed LLM and non-LLM agents. "Stop Comparing LLM Agents Without Disclosing the Harness" argues that the execution harness around the model can be a stronger determinant of results than the model itself in long-horizon settings. "QUIVER" adds a formal view of how perturbations propagate and how execution paths can bifurcate in compound AI systems. Together, these papers make the topic important not just for benchmarking, but for how agent systems should be designed and evaluated in practice. [S4][S9][S12] [S4] [S9] [S12] intro...