Can LLMs Reuse Tools Creatively? What CreativityBench Tries to Measure
Can LLMs Reuse Tools Creatively? What CreativityBench Tries to Measure
CreativityBench is an arXiv paper that introduces a benchmark for evaluating creative reasoning in large language model agents. The paper frames creative problem-solving in a specific way: not as open-ended originality in general, but as the ability to repurpose available tools or objects by reasoning about their affordances and attributes rather than their usual, canonical use. [S1] [S1]
intro: What is CreativityBench?
The paper is titled "CreativityBench: Evaluating Agent Creative Reasoning via Affordance-Based Tool Repurposing" and was released on arXiv. In the authors' framing, the benchmark is a first step toward evaluating whether an LLM-based agent can solve problems creatively by using tools in non-standard ways. Rather than asking only whether a model reaches the right answer, the benchmark is designed to examine a narrower question: can the model look at an available object, infer what properties it has, and use it for a purpose beyond its default role? [S1]
Sources: [S1]
core_idea: Reusing a tool based on its properties, not its usual purpose
The core idea is approachable even for non-specialists. People sometimes solve problems by looking at what an object can do, not just what it was made for. The paper studies this same pattern in LLM agents. Its focus is "creative tool use," where a model repurposes available objects by reasoning about their affordances and attributes instead of relying on canonical usage. In plain terms, the benchmark is interested in whether a model can think, "This object is normally used for X, but because it has shape, weight, rigidity, or some other useful property, it might also help with Y." That is the paper's operational view of creative problem-solving. [S1]
Sources: [S1]
diff_from_existing: How is this different from existing evaluations?
According to the paper abstract, recent LLM progress has produced strong results on reasoning and environment-interaction tasks, but creative problem-solving remains less explored. CreativityBench differs by targeting a more specific capability: affordance-based tool repurposing. This matters because many benchmarks emphasize final correctness or broad task completion, while giving less visibility into whether a model can discover unconventional but plausible uses for available objects. A related point appears in work on travel planning, which argues that end-to-end evaluation often lacks interpretability and makes it hard to identify why models fail. By comparison, CreativityBench narrows the lens to one interpretable sub-capability of agency: creative use of tools under constraints. That does not make it a complete measure of intelligence, but it does make the target ability more explicit. [S1][S7]
Sources: [S1], [S7]
applications: Where could this kind of benchmark matter?
This benchmark is most relevant in settings where LLMs act as agents and need to work with available tools, objects, or interfaces. Source material on agent systems notes that tool-using LLM agents are increasingly deployed across web, app, operating-system, and transactional environments. In those broader agent settings, a model may face situations where the obvious tool is unavailable, incomplete, or ambiguous, and success depends on recognizing alternative uses or indirect paths. The travel-planning paper also highlights tool use as one of the atomic sub-capabilities needed for complex tasks, suggesting that tool-related reasoning is already a practical concern in agent evaluation. My interpretation is that CreativityBench could be useful as a diagnostic benchmark for constrained environments, embodied tasks, or agent workflows where flexible use of available resources matters. Still, that interpretation should be kept narrower than claiming broad real-world readiness. [S6][S7][S1]
Sources: [S1], [S6], [S7]
limitations: What this benchmark does not settle yet
The paper presents CreativityBench as a first step, which already signals an important limitation: it studies one slice of creativity rather than the full concept. Measuring creative problem-solving through tool repurposing is useful, but it does not automatically capture every form of creativity an agent might need in open-ended real environments. More broadly, other recent benchmark work warns that evaluations can overstate model ability when scenarios are too explicit or insufficiently ambiguous; for example, safety research notes that existing benchmarks may exaggerate judgment ability in deceptive or out-of-distribution settings. By extension, results on a focused benchmark like CreativityBench should be read as evidence about a defined capability, not as proof of general creative competence. [S1][S6]
Sources: [S1], [S6]
One-line takeaway: CreativityBench is an arXiv benchmark that evaluates LLM creative reasoning through a specific lens: whether an agent can repurpose tools based on their affordances rather than their standard use. [S1] [S1]
Short summary: CreativityBench is an arXiv benchmark for testing whether LLM agents can solve problems by reusing tools in non-standard ways. It focuses on affordance-based tool repurposing, offering a more specific view of creative reasoning than many general evaluations. [S1]
Sources and references: - [S1] cs.AI updates on arXiv.org - CreativityBench: Evaluating Agent Creative Reasoning via Affordance-Based Tool Repurposing - URL: https://arxiv.org/abs/2605.02910 - [S6] cs.AI updates on arXiv.org - Enhancing Agent Safety Judgment: Controlled Benchmark Rewriting and Analogical Reasoning for Deceptive Out-of-Distribution Scenarios - URL: https://arxiv.org/abs/2605.03242 - [S7] cs.AI updates on arXiv.org - Revisiting the Travel Planning Capabilities of Large Language Models - URL: https://arxiv.org/abs/2605.03308
Internal link ideas: - How LLM agent benchmarks measure tool use and planning - What 'affordance' means in AI and robotics - Why end-to-end benchmark scores can hide failure modes
CreativityBench #LLM #benchmark #agent #creative reasoning #tool use
Note AI-assisted content
This post was drafted with AI (gpt-5.4) using source-grounded inputs.
Please review the citations and original links below.
Comments
Post a Comment