Introducing Claude 3.5 Sonnet: The latest model in the Claude 3.5 series, Claude 3.5 Sonnet, is now available. It’s a top-performing AI model that excels in intelligence, speed, and cost-effectiveness.
Accessibility and Pricing: You can access Claude 3.5 Sonnet for free on Claude.ai and its iOS app. Subscribers to Claude Pro and Team plans get higher rate limits. The model costs $3 per million input tokens and $15 per million output tokens.
Performance Benchmarks: Claude 3.5 Sonnet sets new standards in graduate-level reasoning, undergraduate knowledge, and coding proficiency. It’s twice as fast as its predecessor and offers a 200K token context window.
Coding and Vision Capabilities: The model showcases advanced coding abilities, solving 64% of agentic coding problems. It also surpasses previous models in vision benchmarks, particularly in tasks requiring visual reasoning.
Claude 3.5 Sonnet shows competitive performance across a variety of benchmarks designed to test different aspects of reasoning, knowledge, and problem-solving abilities. In “Graduate level reasoning” on the GPQA, Diamond benchmark with 0-shot chain-of-thought (COT), it achieves a score of 59.4%, surpassing Claude 3 and closely following GPT-4o. In “Undergraduate level knowledge” tested by MMLU, Claude 3.5 Sonnet consistently shows strong performance, scoring 88.7% in 5-shot and 88.3% in 0-shot COT, indicating robustness in both multi-shot and zero-shot settings.
When it comes to programming problems as assessed by “Code HumanEval”, Claude 3.5 reaches a high score of 92.0% in 0-shot settings, clearly leading other models like Claude 3 and GPT-4o. In “Multilingual math” via the MGSM benchmark, it scores 91.6% in a 0-shot COT context, maintaining high competence in multilingual mathematical problem solving. For the “Reasoning over text” DROP benchmark, Claude 3.5 achieves an F1 score of 87.1 in 3-shot settings, showing solid reasoning abilities over text.
The model also excels in “Mixed evaluations” on the BIG-Bench-Hard with 93.1% in 3-shot COT, outperforming other versions like Gemini 1.5 pro and GPT-4o. In mathematical problem solving (MATH benchmark), it scores 71.1% in 0-shot COT, showing reasonable mathematical reasoning skills. Lastly, in “Grade school math” (GSM8K), it achieves an impressive 96.4% in 0-shot COT, highlighting its strong foundational math skills. Overall, Claude 3.5 Sonnet demonstrates a well-rounded and robust performance across diverse cognitive domains.
Benchmark | Claude 3.5 Sonnet | Claude 3 Opus | GPT 4o | Gemini 1.5 pro | Llama-400b (early snapshot) |
---|---|---|---|---|---|
Graduate level reasoning GPQA, Diamond | 59.4%* 0-shot COT | 50.4% 0-shot COT | 53.6% 0-shot COT | - | - |
Undergraduate level knowledge MMLU | 88.7% 5-shot | 86.8% 5-shot | _ | 85.9% 5-shot | 86.1% 5-shot |
Code HumanEval | 92.0% 0-shot | 84.9% 0-shot | 90.2% 0-shot | 84.1% 0-shot | 84.1% 0-shot |
Multilingual math MGSM | 91.6% 0-shot COT | 90.7% 0-shot COT | 90.5% 0-shot COT | 87.5% 8-shot | - |
Reasoning over text DROP, Fl score | 87.1 3-shot | 83.1 3-shot | 83.4 3-shot | 74.9 Variable shots | 83.5 3-shot Pre-trained model |
Mixed evaluations BIG-Bench-Hard | 93.1% 3-shot COT | 86.8% 3-shot COT | - | 89.2% 3-shot COT | 85.3% 3-shot COT Pre-trained model |
Math problem-solving MATH | 71.1% 0-shot COT | 60.1% 0-shot COT | 76.6% 0-shot COT | 67.7% 4-shot | 57.8% 4-shot COT |
Grade school math GSM8K | 96.4% 0-shot COT | 95.0% 0-shot COT | - | 90.8% Il-shot | 94.1% 8-shot COT |
Team Anthropic is a collective of skilled individuals dedicated to advancing AI technology through their product, Claude. They focus on creating an AI assistant that enhances team productivity by tapping into shared expertise. Claude is designed to facilitate easy collaboration, serving as a virtual teammate that not only accelerates routine tasks like email and document writing but also aids in generating ideas, pulling insights from data, and producing high-quality work with less effort. By integrating Projects, Claude can access specific knowledge, enabling each team member to contribute expert-level results, thereby making work more productive across various domains such as engineering, support, marketing, and sales1.
https://www.anthropic.com/team