Abstract
OpenAI’s o3-preview reasoning model exceeded human accuracy on the ARC-AGI benchmark, but does that mean state-of-the-art models recognize and reason with the abstractions that the task creators intended? In this work, we investigate the abstraction abilities of AI models using the ConceptARC benchmark. We evaluate models under settings that vary the input modality (textual vs. visual), whether the model is permitted to use external Python tools, and, for reasoning models, the amount of reasoning effort. In addition to measuring output accuracy, we perform fine-grained evaluation of the natural-language rules that models generate to explain their solutions. This dual evaluation allows us to assess whether models solve tasks using the abstractions that ConceptARC was designed to elicit, rather than relying on surface-level patterns. Our results show that, while some models using text-based task representations match human output accuracy, the best models’ rules are frequently based on surface-level “shortcuts”, and capture intended abstractions substantially less often than do humans. Thus their capabilities for general abstract reasoning may be overestimated by evaluations based on accuracy alone. In the visual modality, AI models’ output accuracy drops sharply, yet our rule-level analysis reveals that models might be underestimated, as they still exhibit a substantial share of rules that capture intended abstractions, but are often unable to correctly apply these rules. In short, our results show that models still lag humans in abstract reasoning, and that using accuracy alone to evaluate abstract reasoning on ARC-like tasks may overestimate abstract-reasoning capabilities in textual modalities and underestimate it in visual modalities. We believe that our evaluation framework offers a more faithful picture of multimodal models’ abstract reasoning abilities and a more principled way to track progress toward human-like, abstraction-centered intelligence.
1 Introduction
The ability to quickly form abstractions and reason with them via analogy is central to humans’ remarkable capacity to generalize knowledge to novel situations (Carey, 2011, Hofstadter, 2001, Lake et al., 2017). Many benchmarks have been designed to evaluate abstract reasoning abilities in machines (Foundalis, 2025, Hofstadter, 1995, Zhang et al., 2019). Among the most prominent such benchmarks is the Abstraction and Reasoning Corpus (ARC) (Chollet, 2019). ARC consists of a set of idealized problems that require few-shot rule-induction and analogical reasoning. As Figure 1 shows, each puzzle (“task”) consists of a small set of demonstrations—initial and transformed grids—and a test grid, each ranging in size from 1×11\times 1 to 30×3030\times 30, with each cell having one of 10 possible colors. To solve a task, an agent should infer a rule governing the demonstrations and apply that rule to the test input to produce a correct output grid.
Chollet 2025 devised 1,000 such tasks, releasing 400 easier puzzles as a “training set,” 400 harder puzzles as an “evaluation set,” and keeping the remaining harder puzzles to form private test sets. Participants in the 2024 ARC-AGI Prize competition entered programs to vie for monetary prizes, including a $600,000 grand prize for a program that exceeds 85% accuracy—that is, percentage of correct output grids—on a private test set of 100 tasks. The top scoring program, which employed a fine-tuned LLM and extensive data augmentation, reached about 54% accuracy (Chollet et al., 2024).
After the competition, Chollet and colleagues, with collaboration from OpenAI, tested a pre-release version of OpenAI’s o3 model on a different “semi-private” test set of 100 tasks. This model achieved 76% accuracy on its low-effort setting and 88% accuracy on its high-effort setting, with computing cost per task estimated at $200 and $20,000 respectively (Chollet et al., 2025). While o3-preview was not qualified to participate in the official competition, its superior performance was described as “a genuine breakthrough, marking a qualitative shift in AI capabilities compared to the prior limitations of LLMs”(Chollet, 2024).
Despite the high accuracy of o3 on ARC tasks, it is not clear to what extent AI systems have achieved human-like abstract reasoning abilities. Consider the task illustrated in the top row of Figure 1. A human solving this task is likely to be able to generalize across different instantiations of the underlying abstract concepts—identifying and removing the top and bottom objects—no matter the size, shape, color, position, or number of objects. To our knowledge, no prior studies have assessed whether AI systems such as o3 are solving these tasks by using the intended, generalizable abstractions, or if they are inferring less generalizable rules (“shortcuts”) based on unintended correlations in task demonstrations.
Here we assess the abstractions used by several commercial and open-weight models in solving tasks from ConceptARC (Moskvichev et al., 2023), a benchmark in the ARC domain containing tasks organized around basic spatial and semantic concepts, such as “inside vs. outside,” “above vs. below,” “extend to boundary,” and “same vs. different.” For example, the tasks shown in Figure 1 are from ConceptARC’s “top vs. bottom” and “extract object” concept groups, respectively. As described in Moskvichev et al. (2023), ConceptARC was designed to test robust understanding of these concepts by providing tasks—designed to be simple for humans—that deploy each concept in varying contexts and require varying degrees of generalization. Because it isolates simple abstract concepts, we believe this benchmark to be better suited than the original ARC dataset for investigating the concepts used by humans or machines in solving tasks.

Figure 1: Each row shows a task from the ConceptARC benchmark. Each task shown consists of three demonstrations of a transformation and one test grid. In this study, the solver is tasked with generating a rule that describes the transformations and applying that rule to the test grid.
Previous evaluations using the o3 model (as well as all entries in the 2024 ARC-AGI Prize competition) relied on text-based representations of the demonstration and test grids to solve each ARC task. Each grid is represented as an integer matrix, with entries encoding colors indexed from 0 to 9. However, o3 and related models are reported to possess sophisticated reasoning abilities in both textual and visual modalities (OpenAI, 2025). In our experiments, we investigate the models’ abstract reasoning abilities in both modalities. We also examine how reasoning effort (the token budget allocated for the reasoning stage) and access to external “tools” (here, the ability to generate and execute Python code) affect a model’s ability to discover abstract rules and solve tasks.
In the following sections, we describe our experimental setup and results, and discuss how our findings relate to three central questions: (1) How does the accuracy achieved by AI models on ConceptARC tasks compare to that of humans? (2) To what extent do the rules generated by AI models and by humans capture the abstractions intended by the test designers, and to what extent do they rely on unintended, superficial patterns? (3) How do modality (textual vs. visual), reasoning effort (token budget), and Python tool access affect how well models can solve these tasks via the intended abstractions?
4 Discussion
Given the results described above, we can now provide preliminary answers to the questions we listed at the beginning of this paper. (1) How does the accuracy obtained by AI models compare with that of humans? Table 1 shows that for textual inputs, o3, with medium reasoning effort, matches or surpasses human accuracy on ConceptARC tasks, with Claude and Gemini obtaining lower accuracy, and o4-mini surpassing humans only when Python tools are enabled. This aligns with results reported in (Chollet et al., 2025, ARC-Prize, 2025). However, using the visual modality, the models’ performance still lags significantly behind human accuracy, even when models are given access to Python tools.
(2) To what extent do the rules generated by AI models capture the abstractions that were intended by ConceptARC’s creators, versus more superficial shortcuts? Figure 2 shows that for textual inputs and medium reasoning effort with Python tools, about 57% of o3’s generated rules (regardless of output accuracy) were correct and intended; that is, they captured the intended abstractions of the tasks. However, about 28% of o3’s generated rules were correct but unintended, meaning they were correct with respect to the given demonstrations, and frequently generated correct output grids, but did not capture the intended abstractions. ConceptARC, like ARC, is built on “core knowledge” priors, including “objectness” Chollet (2019), but we found that, for example, o3’s rules often focused on colors and individual pixels rather than objects. Moreover, using integers to encode colors enabled unintended shortcuts such as relying on numerical values (e.g., the value for green, 3, is greater than the value for red, 2) which were not available in visual modalities. Both Claude and Gemini’s shares of correct-unintended rules (14% and 17% respectively) were lower than o3’s, but more than twice the percentage of correct-unintended rules produced by humans (3%). Thus AI models seem more likely to miss intended abstractions and to solve tasks using more superficial features than humans.
(3) Regarding the effects of textual vs. visual modalities, Table 1 and Figure 2 show that both output-grid and rule correctness drop dramatically in the visual mode. In addition, we observe that in this mode all three models are considerably better at forming correct-intended rules than generating correct output grids. As for the effects of reasoning effort and Python tools, Table 1 and Figure 3 show that the former is more helpful for textual inputs and the latter is more helpful for visual inputs, especially at higher reasoning effort. These results point to possible directions for strengthening visual reasoning models, especially in more abstract domains.
In short, our results show that models still lag humans in abstract reasoning, and that using accuracy alone to evaluate abstract reasoning on ARC-like tasks may overestimate abstract-reasoning capabilities in textual modalities and underestimate it in visual modalities. In evaluating capabilities such as abstract reasoning in AI systems, our results highlight the importance of going beyond simple accuracy, namely assessing both robustness and the extent to which a system uses generalizable mechanisms rather than more superficial shortcuts (Frank, 2023, Ivanova, 2025, Rane et al., 2025). More generally, developing AI models that are better at grasping the abstractions understood by humans will be essential for these systems to generalize in human-like ways, and for these systems to be able to explain their reasoning in ways understandable to humans—both key abilities for successful human-AI interaction.
5 Conclusions
The contributions of this work are threefold. (1) We demonstrated the effects of task representation (textual or visual), reasoning effort, and Python tool use on the ConceptARC benchmark for abstract reasoning, finding that in textual modalities with medium reasoning effort, the best AI models match or surpass humans in output accuracy. (2) We evaluated not only accuracy, but also the rules that AI models generated to describe their solutions, and found that while they were able to capture intended abstractions in about half the cases in textual settings, in other cases their rules relied on more superficial features or patterns that are less generalizable. These results suggest that relying on accuracy alone to evaluate abstract reasoning capabilities, as was done in the ARC-Prize challenge, may overestimate the generality of these capabilities. (3) We showed that state-of-the-art multimodal reasoning models still lack human-like visual reasoning abilities, performing dramatically worse in the visual than in the textual modality. However, these models were substantially better at generating correct rules than they were at applying them, which points to directions for improving visual reasoning in such systems.
Improving the abstraction capabilities of AI models is an essential direction for future research. Recognizing and using human-like abstract concepts is a crucial step for AI systems to become more generalizable and trustworthy in their reasoning, and also to successfully communicate with humans about their reasoning processes.
Appendix I Output Grid Accuracies Reassessed For Incorrect Grid Formats
To compute the accuracies reported in Table 4 and Table 1, we followed the ARC-Prize evaluation method ARC-Prize (2024): we counted an output grid as correct only if it perfectly matched the ground-truth output grid and was in the format requested in the prompt (see Appendix Aand Appendix B). However, upon exhaustive examination of the output grids generated by different models, we found that, in some cases, models generated these answer grids in different formats than that requested in the prompt; these answers were assessed as incorrect. The incorrect output grid formats included surrounding grid rows with brackets, using commas or slashes as row separators, and several other variations.
We re-assessed each case of such formatting to see if the intended grid was actually correct. Table 8 gives, for each model and experimental setting, the original output-grid accuracy from Table 1 or Table 4 and the revised output-grid accuracy when incorrect formats are allowed. Table 8 shows that accepting alternate grid formats leads to minor increases in accuracy in most cases, with a few exceptions in which the accuracy rose by more than 5%: o4-mini low-effort, o4-mini low-effort + tools, and Claude Sonnet 4 medium-effort, which had the largest increase: 60.2% to 72.5%.
Figure 7 gives a plot corresponding to Figure 2 but with the revised accuracies. Comparing this to Figure 2, we do not see any substantial changes in the fractions of correct-intended, correct-unintended, and incorrect rules associated with each bar.
In summary, while models sometimes generate their answer grid in a different format than what we requested, whether we accept these formats as valid answers and assess their correctness does not have a large effect on our overall results.
In a smaller number of cases, all in the visual setting, models would generate a natural-language description of the output grid rather than the grid itself. We did not consider these to be in a valid answer format and counted such outputs as incorrect.
Table 8: Output grid accuracies with alternative grid formats included. For each model and setting, we give original accuracy / re-assessed accuracy. Original accuracies are from Table 4 and Table 1.

Figure 7: Re-assessed rule evaluations. Results of rule evaluations, similar to that shown in Figure 2, but here with re-assessed accuracies.