Posts

Showing posts with the label benchmark

What Changed in Physics-Aware Diagram Generation and Physical Reasoning Benchmarks?

Can LLMs Reuse Tools Creatively? What CreativityBench Tries to Measure

Why Do Long-Horizon Agents Break? HORIZON and the Case for Diagnostic Evaluation