Why LLM Agent Evaluation Is Hard: Recent Papers on the Gap Between Benchmarks and Real Deployment
Why LLM Agent Evaluation Is Hard: Recent Papers on the Gap Between Benchmarks and Real Deployment Four recent arXiv papers from May 2026 approach the same problem from different angles: it is difficult to judge LLM agents by benchmark scores alone. "Design and Report Benchmarks for Knowledge Work" argues that evaluation for coding, research, healthcare, and other knowledge-work settings still too often follows the logic of traditional NLP tasks, even when real deployment demands something broader. "GENSTRAT" focuses on strategic reasoning in marketplaces and bidding settings, "When Planning Fails Despite Correct Execution" examines failures caused by misjudging what agents know during planning, and "PrefBench" studies negotiation under hidden user preferences. Taken together, these papers frame a common gap between clean benchmark settings and messy real-world use. [S4][S3][S7][S12] [S4] [S3] [S7] [S12] intro: Which papers are these, and why ar...