A new artificial intelligence (AI) model has reached a noteworthy milestone by achieving human-level results on a test aimed at evaluating “general intelligence.” On December 20, OpenAI’s o3 system scored an impressive 85% on the ARC-AGI benchmark, outstripping the previous AI best score of 55% and aligning closely with the average human score. It also exhibited strong performance on a notably challenging mathematics test.
Creating artificial general intelligence (AGI) has long been the primary ambition of leading AI research labs. At first glance, OpenAI’s advancement signifies a meaningful stride toward this objective. However, skepticism persists, as many in the AI community debate whether we are indeed nearing the reality of AGI or if this is a false dawn.
To grasp the significance of the o3’s results, one must comprehend the purpose of the ARC-AGI test. This benchmark assesses the “sample efficiency” of an AI system—specifically, how many instances of a new scenario it requires to understand and respond appropriately. In contrast, ChatGPT (GPT-4) demonstrates lower sample efficiency, having been trained on vast amounts of text and forming probabilistic rules about language, which leaves it less adaptable to uncommon tasks.
For AI systems to expand beyond basic repetitive jobs, they must learn from minimal examples and adapt efficiently. The ability to tackle previously unknown problems from limited data is recognized as a fundamental aspect of true intelligence.
The ARC-AGI tests sample efficient adaptation through grid-based problems that require the AI to decipher a pattern by deriving rules from limited examples. This is reminiscent of the IQ tests that many may recall from their school days.
OpenAI’s o3 model appears to showcase exceptional adaptability, as it can derive broad rules from just a few instances. While the exact methods employed by OpenAI are not fully disclosed, the results indicate that o3 successfully identifies weak rules that enhance its ability to generalize across new scenarios.
It’s uncertain if OpenAI intentionally optimized the o3 system for identifying weak rules, yet it seems to succeed at ARC-AGI tasks by employing a process similar to how Google’s AlphaGo analyzed potential sequences to outsmart a world champion in Go. Francois Chollet, a French AI researcher responsible for the benchmark, posits that o3 explores various “chains of thought” to solve problems, selecting the optimal approach based on a heuristic.
This method gives rise to numerous plausible solutions, among which a heuristic determines the most suitable option. However, should o3’s operational principles resemble those of AlphaGo, then it likely employs an AI-generated heuristic similar to the process used during AlphaGo’s training.
The crucial question that arises is whether o3 truly signifies a step closer to AGI. If this model operates effectively on a more generalized level, then it may not represent a fundamental advancement over existing models. It’s possible that the changes observed merely reflect improved generalizability in an already established framework.
Currently, much about o3 remains undisclosed, and OpenAI’s limited communication means extensive evaluations are needed to grasp its capabilities. Understanding its adaptability compared to humans could herald transformative economic impacts and the onset of self-improving intelligence, necessitating new standards for AGI and consideration of governance structures.
Whether or not o3’s results can be replicated, the achievement is remarkable. Nonetheless, the day-to-day landscape of AI might remain largely unchanged for now.