Posts

Showing posts with the label strategic reasoning

Why LLM Agent Evaluation Is Hard: Recent Papers on the Gap Between Benchmarks and Real Deployment