Posts

Showing posts with the label tool use

Can LLMs Reuse Tools Creatively? What CreativityBench Tries to Measure

Why Do Long-Horizon Agents Break? Diagnosing Failure with HORIZON and Related Papers

Why Do Long-Horizon Agents Break? HORIZON and the Case for Diagnostic Evaluation