Best of 2025: AI Coding: New Research Shows Even The Best Models Struggle With Real-World Software Engineering
devops.com, Monday, December 29th, 2025
As AI increasingly permeates the software development landscape, new research from OpenAI offers sobering insights into the current limitations of even the most advanced AI coding assistants.
The benchmark study, 'SWE-Lancer: Can Frontier LLMs Earn $1 Million from Real-World Freelance Software Engineering?' presents evidence that despite rapid advances, today's frontier AI models still fall short when tackling realistic software engineering challenges.
The SWE-Lancer Benchmark: A New Standard for Evaluating AI Coding
Unlike previous coding benchmarks that primarily test isolated programming tasks, SWE-Lancer evaluates AI models on 1,488 authentic software engineering tasks sourced from Upwork, collectively worth $1 million in real-world pay. This approach significantly raises the bar, requiring models to demonstrate capabilities that match those of professional freelance engineers.