OpenAI o1-preview benchmark performance

by | Jan 1, 2025

OpenAI's o1 model is a reasoning AI designed to handle complex, multi-step tasks with advanced accuracy. It excels in areas such as coding, mathematics, and scientific reasoning, making it suitable for applications that require deep contextual understanding and agentic workflows.

The o1 model series includes:

  • o1-preview: An early version of the model, offering advanced reasoning capabilities.
  • o1-mini: A smaller, faster, and more cost-effective variant, optimized for coding and STEM-related tasks. It is 80% cheaper than o1-preview, making it a powerful, cost-effective model for applications that require reasoning but not broad world knowledge.

Developers have utilized o1 to build applications that streamline customer support, optimize supply chain decisions, and forecast complex financial trends. Key features of o1 include function calling, which allows seamless connection to external data and APIs, and structured outputs that generate responses adhering to custom JSON schemas.

Access to o1 is available through OpenAI’s API for users on paid usage tiers. In ChatGPT, Plus and Team users can select o1 or o1-mini in the model selector. Usage limits apply, with o1-mini offering 50 messages per day and o1-preview providing 50 messages per week.

Current
Commercial License
Instruction-tuned

Comparison 

Sourced on: January 1, 2025

OpenAI’s o1 model, introduced in September 2024, is designed to enhance reasoning capabilities, particularly in complex tasks such as mathematics, coding, and scientific problem-solving. It employs reinforcement learning techniques to generate internal chains of thought before responding, enabling it to handle intricate multi-step tasks with improved accuracy.

  • AIME (Math Competition): o1 demonstrated top-tier performance, achieving over 83% consensus and surpassing GPT-4o significantly.
  • Codeforces (Competitive Programming): o1 achieved an Elo rating of 1673, placing in the 89th percentile compared to GPT-4o’s 808 Elo and 11th percentile.
  • GPQA Diamond: o1 surpassed human PhD-level accuracy benchmarks in biology, chemistry, and physics.
BenchmarkOpenAI o1-previewOpenAI o1-miniGPT-4o
Competition Math (AIME 2024) - Consensus@6483.356.713.4
Competition Math (AIME 2024) - Pass@174.444.69.3
Competition Code (Codeforces) - Elo Rating16731258808
Competition Code (Codeforces) - Percentile89.062.011.0
GPQA Diamond - Consensus@6478.078.356.1
GPQA Diamond - Pass@177.373.350.6
Physics - Consensus@6494.289.568.6
Physics - Pass@192.889.459.5
MATH Benchmark - Pass@194.885.560.3
MMLU - Pass@192.390.888.0
MMMU (val) - Pass@178.2N/A69.1
MathVista (testmini) - Pass@173.9N/A63.8
Chemistry - Consensus@6465.660.243.0
Chemistry - Pass@164.759.940.2

Team 

The team that developed the o1 model at OpenAI consisted of a multidisciplinary group of researchers, engineers, and product specialists. The core team included AI researchers specializing in natural language processing (NLP) and reasoning systems, as well as software engineers who optimized model architecture for performance and efficiency. A dedicated group of data scientists curated high-quality datasets, ensuring the model’s capability in reasoning, STEM, and coding domains. Additionally, ethical AI specialists and compliance experts contributed to the alignment of the model with OpenAI’s values for responsible AI use.

The collaboration involved extensive testing and iteration to enhance accuracy, reliability, and structured output capabilities. The team utilized feedback from diverse user groups, including developers and domain experts, to fine-tune the model’s performance. This collective effort highlights OpenAI’s commitment to advancing AI while addressing real-world challenges with precision and ethical considerations.