Context-first approach to code AI

Posted on X/Twitter on 2024-03-16, after David Sacks mentioned Sourcegraph’s context-first approach to code AI in E170 of the All-In Podcast. This is a sketch of our plan to build highly capable and autonomous AI devs.

AI devs need an objective function ƒ(code) (“is this the right code to do xyz?”) to iterate autonomously toward the right code.

The best AI dev is the one with the best ƒ(code), assuming equal access to equally capable, cheap, and fast LLMs. Everything else about an AI dev is undifferentiated.

How might you create an ƒ(code) to programmatically evaluate the code produced by an AI dev? You need tons of context:

  • code
  • code symbols/defs/refs/types/call-graphs/etc.
  • docs
  • tests
  • logs
  • execution
  • tickets
  • UI screenshots and live access
  • observability data
  • usage data and analytics
  • live DB data
  • team chat logs
  • simulated/shadow traffic
  • etc.

One way to think about this is that your ƒ(code) is at least everything that a human might need to check to see if the code works as intended.

A very simple example of ƒ(code) is “does the AI-generated code typecheck?” Obviously no code AI should ever show you a suggestion that fails typechecking, but all do (even Cody), which shows how early everything is here. Other fairly obvious ƒ(code)s are “do the tests pass?” and “does an LLM browsing agent consider the app to be broken?”

You can imagine arbitrarily more complex ƒ(code)s, such as deploying the AI-generated diff to 1% of traffic on prod (on some retail website, for example) and seeing if that code change increases profit vs. the status quo.

Human code review is a very slow and flawed ƒ(code). Your CI is a very slow ƒ(code); obviously for an AI dev to iterate 107 times, it can’t run your slow CI pipeline 107 times. Human code review and slow CI will start becoming obsolete/irrelevant in a world with AI devs with better-than-~80%ile-human ƒ(code)s.

In theory, with a perfect instant ƒ(code), you don’t even need an LLM at all, just lots of monkeys typing on keyboards. LLMs do meaningfully narrow the search space, though. I believe today’s SOTA LLMs, used by an AI dev with an ƒ(code) that is feasible to build today (but well beyond what anyone has actually built yet), are sufficiently capable to automate coding way beyond what most people think possible. Can only prove that claim by building it, so we are.

And since the best AI dev is the one with the best ƒ(code), and the best ƒ(code) comes from having the most comprehensive context and using it smartly, that means that context is the most important part of an AI dev. Also, better context makes all the other ways you (human devs) are already using code AI today (autocomplete, chat, test generation, other commands/macros) much better.

Note 1: Transformer model context windows are strictly less powerful than iterative inference with an objective function incorporating external context. But most AI applications today just use n=1 inference output and a single iteration and show the raw output to the user. Soon we will look back on that as stupidly primitive, especially for fields like code AI where LLM outputs can be programmatically evaluated and aren’t just for human consumption.

Note 2: Everyone thinks of “context” as a RAG thing meaning “LLM input tokens”. Need to also think about tapping your context corpus in the objective function to evaluate outputs interactively and iterate.

Note 3: Good context requires tapping information that lives in tons of different tools from different vendors. Any AI built by a vendor that only integrates with that vendor’s own tools will be very limited compared to one that can slurp up info from all the tools you use.

Note 4: See also levels of code AI.