The era of the AI “Copilot”, a helpful assistant that writes a few lines of code while you watch, might already be ending. In a new experiment published this week, the team behind the Cursor code editor revealed they have successfully scaled autonomous agents to handle massive, long-running projects, effectively creating an “AI software factory.”
The headline result? A swarm of AI agents wrote over 1 million lines of code to build a functional web browser from scratch in roughly a week. They also migrated the entire Cursor codebase from Solid.js to React and built a Windows 7 emulator. Here is the breakdown of how they moved from simple autocomplete to industrial-scale autonomous coding.
Also read: Can AI make a web browser from scratch? This experiment says maybe
Until now, AI agents have been “sprinters.” They are excellent at fixing a bug or writing a script (tasks that take minutes), but they fail at “marathons” (tasks that take weeks).
Cursor’s experiment proved that you can’t just ask one super-smart model to “build a browser.” You need a corporate structure.
To solve the scaling problem, Cursor didn’t use a better model; they used a better org chart. They moved away from a flat structure (where every agent tries to do everything) to a strict hierarchy.
Also read: Elon Musk and OpenAI: How a partnership became a $134 billion legal war
The Planners (The Managers) are agents, powered by high-reasoning models like GPT-5.2, act as project leads. They read the codebase, understand the high-level architecture, and break the project down into thousands of tiny, actionable tasks. They do not write code. This keeps their context window clean and focused purely on logic and strategy.
The Workers (The Engineers) as agents are “stateless.” They pick up a single task from the Planner’s queue (e.g., “Implement the tab closing logic in tabs.ts”), execute it, run the tests, and submit the work. Because they only care about one file at a time, they don’t get confused by the complexity of the wider project.
The Judges (QA) are a final layer of agents that reviews the output. If the tests pass and the code matches the spec, it’s merged. If not, it’s sent back to the loop.
The experiment highlighted a crucial hardware/software divide. While earlier models like Claude 3 Opus (Opus 4.5) were capable coders, they often struggled with the long-term planning required to manage a weeks-long project.
Cursor found that GPT-5.2 was essential for the “Planner” role. It demonstrated the ability to maintain a consistent “train of thought” over days, creating tasks that actually made sense for the workers without hallucinating non-existent files or circular dependencies.
The most surprising finding from the experiment was that quantity has a quality all its own. By throwing hundreds of parallel agents at the problem, they generated code at a speed human teams couldn’t match. The 1 million lines weren’t perfect initially, but the sheer volume of iteration (write, test, fix, repeat) refined it rapidly.
Migrating a massive codebase is usually a nightmare for developers. The agents treated it as a rote task, methodically translating file after file without getting bored or fatigued.
This experiment suggests a future where software engineering looks more like management. Instead of typing syntax, senior engineers might soon spend their days reviewing the “Architecture Plans” generated by a Planner agent, approving the strategy, and then letting a swarm of Worker agents implement the 5,000 necessary changes overnight. The code factory is here and it runs on GPT-5.2.
Also read: Stanford to MIT: Claude AI is helping biological researchers unlock new science