Blogs

Latest blog posts

GeneralMay 26, 2026
MindForge: Building Reproducible SWE-bench Environments at Scale
If you've ever tried to evaluate a coding agent against a SWE-bench-style dataset, you may have encountered, at some point, the following problem: the task is easy to describe ("apply this patch, run these tests, check the exit code"), but the environment is a mess. Every repo wants a different version of the same language, a different package manager, a different set of system libraries. Even within the same repository, requirements may change depending on when a commit was created. Get one detail wrong and the test suite fails for reasons that have nothing to do with the patch under evaluation.
AI/MLMay 4, 2026
Before You Score the Model, Score the Benchmark: A Skeptical View Into Current Agentic Software Engineering Benchmarks
The software engineering (SE) community has been captivated by the rise of agentic SWEs. As these AI agents evolve, we are naturally shifting our focus from simple bug fixes to the significantly more complex task of full-blown feature implementation. However, evaluating these agents introduces a profound challenge that did not exist in earlier AI domains, such as image classification. While in those fields, the "gold" ground-truth patch would guarantee a 100% resolution rate, achieving this ceiling in software engineering benchmarks is neither easy nor straightforward, even when using an Oracle (the gold patch itself). The sheer friction of environment building, combined with the instability of test execution, creates a significant infrastructural noise problem. If we can't reliably spin up environments and run tests—if the perfect solution can't consistently achieve 100% success—we can't trust the benchmarks or meaningfully evaluate the agents. As part of an internal project diving into SE agent benchmarks, we decided to lift the hood on current standard datasets to investigate this discrepancy. What we found was a concerning amount of infrastructure noise and fundamental flaws.
AI/MLJul 18, 2025
SWE-Effi: Re-Evaluating SWE Agent Solutions for their Efficiency
Existing AI for software engineering leaderboards (e.g., SWE-bench ) focus solely on "resolve rate", ignoring the crucial factor of effectiveness in a resource-constrained world. This is a universal problem that also exists beyond software engineering: any AI system should be more than correct—it must also be cost-effective. We introduce SWE-Effi, a set of new metrics to re-evaluate AI systems in terms of holistic effectiveness scores. We define effectiveness as the balance between the resolve rate (the outcome) and the resources consumed (e.g., token and time). In this blog, we specifically focus on the software engineering scenario by re-ranking popular AI systems for issue resolution on a subset of the SWE-bench benchmark using these new, multi-dimensional metrics. Note that by “AI system”, we refer to a single software system that includes an AI model (LLM) with a software scaffold (e.g., agent) working together to solve a given task.
AI/MLJul 17, 2025
RepoForge: Training a SOTA Fast-thinking SWE Agent with an End-to-End Data Curation Pipeline Synergizing SFT and RL at Scale
The Problem: Training software engineering (SWE) LLMs is bottlenecked by expensive infrastructure, inefficient evaluation pipelines, scarce training data, and costly quality control. Our Solution: RepoForge: an autonomous, end-to-end pipeline that generates, evaluates, and trains SWE agents at scale. We present:...