The SWE-bench technical report presents a comparative analysis of various language models’ performance in resolving real-world GitHub issues. The outcomes highlight that Devin, an AI agent for software development developed by Cognition, achieved a 13.86% success rate, which is a notable improvement over other models. For instance, Claude 2 and GPT-4 resolved 4.8% and 1.7% of instances, respectively, even with the aid of an oracle retriever. These results underscore Devin’s advanced capabilities in executing multi-step plans and iterating based on feedback, essential traits for practical and intelligent software development.

Devin’s performance on SWE-bench is impressive, the benchmark’s design and the nature of the tasks might not provide a level playing field for all AI models being evaluated.

Cognition Labs, Scott Wu
Not Applicable
March 25, 2024
Cognition Labs Home Page