Why LLM Agents Still Struggle With Scientific Reasoning: Limits and Responses From Recent Papers
Why LLM Agents Still Struggle With Scientific Reasoning: Limits and Responses From Recent Papers Recent papers are converging on a similar concern: LLM-based agents can appear capable in end-to-end tasks, yet still fail in the parts that matter most for trust. “AI scientists produce results without reasoning scientifically” examines whether autonomous research agents follow the epistemic norms of scientific inquiry, rather than only producing outputs. “ARES: Adaptive Red-Teaming and End-to-End Repair of Policy-Reward System” argues that alignment pipelines built on RLHF can fail not only at the policy level but at the combined policy-reward-system level. “Personalized Benchmarking: Evaluating LLMs by Individual Preferences” questions whether aggregate evaluation hides meaningful differences in what users actually want, and “Towards Scalable Lifelong Knowledge Editing with Selective Knowledge Suppression” focuses on how to update model knowledge over time without destabilizing prior ed...