← Back to notes

    LLMs, Diffs and Moats

    2026-01-20

    tldr
    LLMs are dramatically better at targeted diffs than full rewrites — the best tools figured this out independently. Every accepted or rejected diff is training data. Capture it and you have a moat. Works for any text, not just code.

    It's January 2026. Claude Code has mass adoption, Anthropic is loss-leading their API, and every dev tool is scrambling to add "AI-powered" to their landing page.

    Here's a number that should change how you think about LLM generations: Anthropic reports that Claude Sonnet 4.5 went from a 9% error rate to 0% on their internal code editing benchmark. The secret? "A simple scaffold with two tools — bash and file editing via string replacements."

    Not full file rewrites. Targeted diffs.

    Aider's research found the same thing: switching to diff formats reduced "lazy coding" (those infuriating "...rest of code here..." placeholder comments) by 3X.

    Models are getting dramatically better at targeted edits, fast. Not just for code — for any text.

    The evidence

    I can't show you Anthropic's or OpenAI's training data. Nobody can. But we can observe behavior — these models behave as if they've seen millions of Git commits, bug-fix patches, and code review diffs.

    Ask for a clean, targeted diff and they don't just comply — they excel. Faster, fewer tokens, more accurate, fewer hallucinations.

    Look at how they're evaluated. SWE-Bench tests real GitHub issue resolution — Claude Sonnet 4.5 hit 77.2%, Opus 4.5 pushed it to 80.9%. These benchmarks reward minimal, surgical changes. The models that win touch only what's necessary.

    That's exactly how good pull requests work. And with 84% of developers now using AI tools, it makes sense — if you're training foundation models, you'd optimize for code editing. That's where the demand is.

    Convergent evolution

    The best AI coding agents landed on the same pattern independently.

    Claude Code's text editor tool:

    {
      "command": "str_replace",
      "path": "primes.py",
      "old_str": "for num in range(2, limit + 1)",
      "new_str": "for num in range(2, limit + 1):"
    }

    OpenCode's edit tool (80k+ stars, used by 650k+ developers):

    "This tool performs precise edits to files by replacing exact text matches. It's the primary way the LLM modifies code."

    Same pattern. No line numbers, no complex diff syntax. Just: find this exact string, replace it with this.

    Not a coincidence. Both teams landed here independently: exact string replacement is what works.

    Cursor went the other direction in 2024 — full file rewrites with a custom "Apply" model. Not sure what they do now.

    But the difference matters for what comes next. If we're building toward event-sourced AI systems where every change is stored:

    • String replacements give you clean, atomic, intent-preserving patches
    • Full file rewrites bury the edit — you'd have to diff them yourself to extract it

    Event sourcing fits naturally

    Targeted diffs are already atomic, intent-preserving patches. They're practically events by default — store them with a timestamp and user intent, and you've got event sourcing without trying.

    This isn't a coincidence. Git works the same way: you don't store snapshots, you store diffs. LLMs are naturally producing the same structure. The architecture that makes sense for version control also makes sense for AI-assisted editing.

    The moat

    I'm sure I'm just jogging your memory so far. Code is where the evidence is clearest, but the insight is broader: any LLM edit on any product is a diff. Legal drafting, contract redlining, copywriting, config management — every change an LLM makes can be captured as a targeted edit.

    Which means we're not just building better tools. We're building training data.

    Every accepted diff is a positive signal. Every rejection is a correction. Multiply across millions of users and you have exactly what fine-tuning needs — structured examples of what humans actually want.

    The compounding effect: better diffs → more acceptances → more training data → better models. The products capturing this data now have a moat that's impossible to replicate. You can't retroactively reconstruct the decision history.

    The shift

    Targeted diffs are more accurate, cheaper to run, and naturally create the training data that makes them better. The tools are converging on this. The models already excel at it. And it works for any text, not just code.

    If you're building LLM creation or editing into your product, do it with diffs.

    — Simon


    References