Posts

Showing posts with the label agent evaluation

Why Do Long-Horizon Agents Break? Diagnosing Failure with HORIZON and Related Papers

Why Do Long-Horizon Agents Break? HORIZON and the Case for Diagnostic Evaluation