Claude 3.5 Sonnet AI Model: A New Benchmark in Intelligence

by | Jun 25, 2024

Claude 3.5 Sonnet excels in intelligence and performance, mastering graduate-level reasoning, undergraduate knowledge, and coding tasks. Operating at double the speed of its predecessor, it's ideal for dynamic environments, introducing "Artifacts" for enhanced real-time interaction. Rigorously tested for safety, it offers a cost-effective AI solution for business integration.

Introducing Claude 3.5 Sonnet: The latest model in the Claude 3.5 series, Claude 3.5 Sonnet, is now available. It’s a top-performing AI model that excels in intelligence, speed, and cost-effectiveness.
Accessibility and Pricing: You can access Claude 3.5 Sonnet for free on Claude.ai and its iOS app. Subscribers to Claude Pro and Team plans get higher rate limits. The model costs $3 per million input tokens and $15 per million output tokens.
Performance Benchmarks: Claude 3.5 Sonnet sets new standards in graduate-level reasoning, undergraduate knowledge, and coding proficiency. It’s twice as fast as its predecessor and offers a 200K token context window.
Coding and Vision Capabilities: The model showcases advanced coding abilities, solving 64% of agentic coding problems. It also surpasses previous models in vision benchmarks, particularly in tasks requiring visual reasoning.

Current
Commercial License
Pretrained, Instruction-tuned

Comparison 

Sourced on: June 25, 2024

Claude 3.5 Sonnet shows competitive performance across a variety of benchmarks designed to test different aspects of reasoning, knowledge, and problem-solving abilities. In “Graduate level reasoning” on the GPQA, Diamond benchmark with 0-shot chain-of-thought (COT), it achieves a score of 59.4%, surpassing Claude 3 and closely following GPT-4o. In “Undergraduate level knowledge” tested by MMLU, Claude 3.5 Sonnet consistently shows strong performance, scoring 88.7% in 5-shot and 88.3% in 0-shot COT, indicating robustness in both multi-shot and zero-shot settings.

When it comes to programming problems as assessed by “Code HumanEval”, Claude 3.5 reaches a high score of 92.0% in 0-shot settings, clearly leading other models like Claude 3 and GPT-4o. In “Multilingual math” via the MGSM benchmark, it scores 91.6% in a 0-shot COT context, maintaining high competence in multilingual mathematical problem solving. For the “Reasoning over text” DROP benchmark, Claude 3.5 achieves an F1 score of 87.1 in 3-shot settings, showing solid reasoning abilities over text.

The model also excels in “Mixed evaluations” on the BIG-Bench-Hard with 93.1% in 3-shot COT, outperforming other versions like Gemini 1.5 pro and GPT-4o. In mathematical problem solving (MATH benchmark), it scores 71.1% in 0-shot COT, showing reasonable mathematical reasoning skills. Lastly, in “Grade school math” (GSM8K), it achieves an impressive 96.4% in 0-shot COT, highlighting its strong foundational math skills. Overall, Claude 3.5 Sonnet demonstrates a well-rounded and robust performance across diverse cognitive domains.

BenchmarkClaude 3.5 SonnetClaude 3 OpusGPT 4oGemini 1.5 proLlama-400b (early snapshot)
Graduate level reasoning GPQA, Diamond 59.4%* 0-shot COT50.4% 0-shot COT53.6% 0-shot COT --
Undergraduate level knowledge MMLU 88.7% 5-shot86.8% 5-shot_85.9% 5-shot 86.1% 5-shot
Code HumanEval 92.0% 0-shot 84.9% 0-shot90.2% 0-shot 84.1% 0-shot 84.1% 0-shot
Multilingual math MGSM 91.6% 0-shot COT 90.7% 0-shot COT 90.5% 0-shot COT87.5% 8-shot -
Reasoning over text DROP, Fl score 87.1 3-shot 83.1 3-shot 83.4 3-shot 74.9 Variable shots 83.5 3-shot Pre-trained model
Mixed evaluations BIG-Bench-Hard 93.1% 3-shot COT 86.8% 3-shot COT -89.2% 3-shot COT 85.3% 3-shot COT Pre-trained model
Math problem-solving MATH 71.1% 0-shot COT 60.1% 0-shot COT 76.6% 0-shot COT 67.7% 4-shot 57.8% 4-shot COT
Grade school math GSM8K 96.4% 0-shot COT 95.0% 0-shot COT -90.8% Il-shot 94.1% 8-shot COT

Team 

Team Anthropic is a collective of skilled individuals dedicated to advancing AI technology through their product, Claude. They focus on creating an AI assistant that enhances team productivity by tapping into shared expertise. Claude is designed to facilitate easy collaboration, serving as a virtual teammate that not only accelerates routine tasks like email and document writing but also aids in generating ideas, pulling insights from data, and producing high-quality work with less effort. By integrating Projects, Claude can access specific knowledge, enabling each team member to contribute expert-level results, thereby making work more productive across various domains such as engineering, support, marketing, and sales1.

https://www.anthropic.com/team