The Rise of Coding Agents

AI coding agents have evolved rapidly over the past year. Large Language Models from providers like OpenAI, Anthropic, Google, DeepSeek etc. have shown remarkable progress in performance despite their young age.  Along with LLMs, coding agents have consistently rocked popular coding benchmarks like SWE Bench - which has seen a variety of SOTA models in this time, with the top scores steadily reaching new heights. In fact, the highest scoring model on SWE-bench verified changed while this blog was being written! 

March 2024, Cognition Labs blog   
Feb 2025, Anthropic blog

In the span of 1 year, we have seen agents go a long way with performance jumping from 14% to 70.3%. Throughout this journey, we have seen several instances of evidence to the claim that the largest performance gains in coding agents are driven by stronger frontier models.

“We found that scores on SWE-bench verified are largely driven by the quality of the foundation model.” Source: Augment Code Blog

At CurieTech AI, we are building agents for integrations and have consistently seen off-the shelf frontier models struggle on tasks which are trivial for integration developers. We believe that the lack of domain knowledge and specialization is the barrier between a great general-purpose agent and a great vertical agent. Next, we present evidence for this as seen in our evaluations for integration development tasks in MuleSoft.

Benchmarking

To create a holistic evaluation for our agents, we created a benchmark of various tasks a MuleSoft developer is expected to do including integration flows, writing DataWeave transformations, adding connectors, and refactoring. Our evaluations are intended to mimic the complexities faced by MuleSoft developers in the real world. We strongly believe that there is a huge jump in complexity from tasks requiring standalone code generation as compared to tasks requiring work on existing repositories. Working with repositories requires a deep understanding of the domain and various concepts, relationships among components in existing code, and coding standards.

While the gains due to specialization are clear from the numbers above, we get a stronger signal from an ablation study on OpenHands agent using progressively ‘stronger’ base models. Unlike popular coding/reasoning benchmarks, we do not find the familiar trend of stronger reasoning models giving better performance. This points to the fact that frontier models have shown consistent gains in performance on the distribution of tasks they are trained on - general purpose coding on popular programming languages – however they cannot perform well in distinctive tasks without the infusion of domain knowledge.

In our evaluations, we find that general purpose models only perform reasonably on our benchmark tasks like refactoring which requires minimal domain knowledge. They struggle with tasks requiring understanding of existing integrations and usage of MuleSoft components.

For instance, on a task as simple as “send the error log as message using Twilio” we see general purpose agents struggle to find the right operations

Augment Code Agent
CurieTech AI Agent

CurieTech AI Agents combine the superior reasoning of frontier models with deep domain knowledge using proprietary knowledge bases and tools, harnessing the power of Retrieval Augmented Generation and rich feedback signals for iterative refinement to solve this last mile problem. We strongly believe that carefully crafted systems of domain expertise are essential to convert great general-purpose models to great vertical products.

Like this article?

Subscribe to our LinkedIn Newsletter to receive more educational content.

Subscribe now