Posts

Showing posts with the label benchmark

Can LLMs Reuse Tools Creatively? What CreativityBench Tries to Measure

Why Do Long-Horizon Agents Break? HORIZON and the Case for Diagnostic Evaluation