‘Spawn Me a PR’ Isn't Ready for Large Codebases: The Context Barrier to AI Code Generation

Jan 23, 2025

Practical, production-useful code generation is a famously thorny challenge - yet if you browse X or LinkedIn, you might get the impression that we’re mere weeks away from end-to-end AI solutions taking the place of entire dev teams. After all, LLMs can already spit out functional, clean-looking code samples. That part is undeniably cool.

And yet… we can’t simply ask an LLM, “Create a PR with a new login endpoint for my enterprise codebase,” and watch it handle everything automatically - at least not in any general, reliable way. Why? Because for serious, real-world development, context is everything. Production environments are full of intricate domain logic, legacy architecture quirks, and business-specific constraints that a model just can’t absorb with a single prompt.

A few AI-driven products have made headlines by claiming major revenue or funding: Cursor hitting impressive ARR, Bolt.new raising nine-figure rounds, or Vercel V0 launching with big hype. These tools do create meaningful value, but they’re still fundamentally constrained by how they handle context. To me, that’s the real story: an ongoing search for how best to feed large, complex codebases into AI while preserving nuance. Until then, these are better fitting for spinning up weekend side projects than for building features at 100+ engineer companies.

^ Primagen showcasing the woeful current practical state of Devin. The Devin-dream is interesting, but the current real-world value appears weak.

Three Patterns of AI Code Generation

I’ve been observing three distinct approaches in the market. Each wrestles with the context problem in its own way.

1. “New Project” Generators (Bolt.new, Lovable, V0, etc.)

One category is focused on spinning up new codebases from scratch. You provide a prompt like, “Build me a minimal Rails app with user registration,” and - boom - a service stitches together a few templated modules, dropping in some boilerplate code to get you started.

Strengths:

By only dealing with greenfield scenarios, these tools dodge the complexities of preexisting architecture. No legacy schemas or cryptic internal APIs.
It’s like the AI version of create-react-app: quick, practical, and surprisingly helpful if your aim is a demo or weekend project.

Limitations:

For any medium-to-large organization, the real challenge isn’t spinning up something brand new; it’s integrating new features into an existing labyrinth of code.
Even in modest codebases, they can’t parse years of domain-specific logic or data schemas. That remains a manual task - one that grows exponentially with project size.

2. IDE-Based Assistants (Cursor, Copilot, Windsurf, etc.)

The second category fits neatly into a developer’s daily workflow. You’ve likely seen tools like GitHub Copilot: they integrate directly into your IDE and generate code suggestions as you type.

Strengths:

They work well in complex codebases because the human developer guides the AI by opening relevant files, highlighting specific contexts, and verifying suggestions.
This has proven to be the most successful short-term strategy for enterprise use. Engineers already live in their IDEs, so it’s an easy add-on with tangible benefits.

Limitations:

Although these tools are genuinely helpful, they still rely on the human to supply context incrementally. It’s not a “fire-and-forget” solution.
You’re not asking the AI to read the entire codebase at once. You open a handful of files, the AI sees them, and that’s it. For big, cross-cutting features, you end up doing a lot of manual bridging.

3. “Headless Agent” Solutions (Devin, Factory, etc.)

The third approach is more ambitious: these tools aim to operate in the background, ingesting your entire repo and waiting for tasks. For instance, you create a GitHub issue or a Linear ticket, the AI sees it, tries to figure out the relevant code, and then automatically opens a PR with a fix or new feature.

Strengths:

This is the ultimate vision of code generation: ask for a high-level outcome (“add a new authentication method”), and the system scours the repo, hunts down relevant sections, and implements changes.
Could save developers from drudgery and context-switching if it worked perfectly.

Limitations:

The context gap is still huge. For large codebases, it’s nearly impossible to incorporate all relevant details without constant back-and-forth guidance.
Right now, these tools reliably handle only small bug fixes or housekeeping tasks. For anything bigger, they tend to produce broken or incomplete solutions.

Where Do We Go From Here?

A lot of people are betting that models will get orders of magnitude smarter, cheaper, and more context-aware over the next 12 to 24 months. I’m generally in that camp - I do think we’ll see exponential improvements in areas like context window size and inference speed.

In that world, you might imagine entire codebases being shoved into a single context window, letting the AI reason about thousands of files at once. That brute-force approach could eventually erode the need for an IDE-based pattern, because the AI would “just know” everything about your project after one big prompt. Sam Altman has openly hinted that the complicated retrieval-augmented generation (RAG) we use today might end up a footnote, superseded by bigger, cheaper models that can ingest everything at once.

But I’m cautious for three reasons:

Learning On the Fly: Human brains train and infer simultaneously. Today’s AI models are not good at actively re-training when they encounter new domain concepts. They infer from a static knowledge state. Maybe Google’s new Titan transformers will help?
Context Windows Are Finite: Even a “two-million-token” model is dwarfed by the kind of data a developer might hold in their head or the version control history that spans years and millions of lines.
Signal vs. Noise: Dumping an entire codebase into a model’s prompt can drown the signal in a sea of irrelevant details. Performance often degrades when the input is cluttered.

A Practical Approach for Today

Given these constraints, if I were to build an AI code-generation product right now (targeted at large, real-world codebases), I’d lean on a few design choices:

Background Analysis

Set up a process that regularly scans the codebase, file by file, building a robust index of references and dependencies. It can do this asynchronously, so users aren’t waiting around.
Over time, accumulate a detailed “map” of the code’s structure, pulling in only the most relevant context for each potential fix.

Small, High-Confidence Fixes

Focus on specific, repeated patterns - like deprecation warnings, unsafe patterns, or functions that are obviously inefficient.
When the system is sure it has a valid fix or optimization, it opens a small PR. If it’s not sure, it does nothing. This avoids spamming developers with half-baked ideas.

Limit the Scope

Don’t attempt net-new features or major refactors unless the confidence is rock-solid. For those tasks, an IDE-based approach or a human-led design cycle is still safer.
Accept that some big changes are simply out of scope, at least until the model (and context windows) get significantly more advanced.

Learn from PR Interactions

Treat every acceptance or rejection of an AI-generated PR as a data point. Slowly, the system can refine its sense of what “good” or “acceptable” changes look like in your specific codebase.

Incremental Upgrade Path

When bigger, better models become available, swap them in under the hood. If eventually the model can handle broader changes, that’s great - but you don’t have to overhaul the user experience.
The key is to keep building user trust along the way. A developer who sees five high-quality fixes is more likely to trust the sixth, even if it’s bigger.

Right now, the sweet spot for code generation in large, messy codebases is the IDE-based pattern. It’s the easiest for real teams to adopt, and it has a track record of success because humans remain in the loop. But it’s also likely to be a transitional stage. If context windows and model intelligence progress as fast as people expect, the notion of pointing and steering an AI line by line may fade. We’ll just trust the AI as if it were a well-informed teammate.

Yet I’m not convinced we’ll see “ask for a new feature in Slack, get a perfect PR” across a multi-million-line codebase next week or even next quarter. There’s a gap between today’s demos and a truly robust solution for large-scale production systems. That’s why I believe a more methodical, iterative approach - focusing on small, high-value fixes first - might actually yield the most tangible short-term impact.

Kyle Gao

Feb 15

My struggle with the IDE-based assistants is that when it makes large-scale code changes it's really hard to review. Vibe coders may say never review just accept all, but I find that unrealistic for real world production critical code bases.

I think the future will be a combination of the headless agents plus an AI assisted PR review tool, with the implementation step automated away by code generation. Stacked PR, AI reviews, etc. from Graphite are super well positioned for that later half.

Expand full comment

Small Diffs - by Greg Foster

Discussion about this post