← Back to notes

    Casey on AWS DynamoDB Outage: When RCAs Explain Nothing

    2026-01-16

    Casey uses the October 2025 DynamoDB US-East-1 outage as an example of companies publishing shallow "root cause analyses" that explain how symptoms appeared but never show the actual programming bug.

    tldr

    • AWS blamed a "race condition" - but that's just the trigger, not the bug
    • the real bug: enactor code crashes hard when it can't read an old/deleted plan for the rollback record
    • that crash should never happen - graceful handling of missing data is basic engineering
    • AWS never showed the actual code (exception? null deref? assertion failure?) - teaches nothing
    • good RCAs (CrowdStrike, Google, Cloudflare) show the actual buggy line - AWS gave none of that depth

    what aws said happened

    • DynamoDB endpoints resolved via Route 53 DNS using a constantly-updated "load-balancing tree"
    • single planner generates new tree "plans" (hashed names), three enactors write them to Route 53
    • enactors use distributed lock (via Route 53 atomic ops) so only one updates at a time
    • unlucky enactor gets stuck in long backoff retries, falls way behind (e.g. plan #110 while others on #145+)
    • finally gets lock, writes endpoint to point to ancient plan
    • another enactor sees #110 is stale, garbage-collects it
    • endpoint now points to non-existent DNS name - "no records found"
    • when enactors try to fix it, they also update a "rollback" record pointing to the previous plan
    • previous plan no longer exists - enactor crashes permanently trying to update rollback record
    • all three enactors die the same way - no automatic updates - outage until humans intervene

    casey's core criticism

    • everyone calls it a "race condition" and stops thinking
    • race condition is just the trigger that made an old plan get written then deleted
    • real bug: whatever code in the enactor crashes hard when it can't fetch the old plan for the rollback record
    • that crash should never happen:

    - rollback field should allow empty/null - stale/missing plan should be gracefully handled (system starts with no prior plan anyway) - operator typo or manual deletion would trigger the exact same total failure - no race needed

    • AWS never explains what the crashing code looked like
    • therefore the RCA teaches nothing useful about avoidable bad programming practices

    comparison to better rcas

    • CrowdStrike: buffer overflow from too many rules - clear "don't do that"
    • Google: JSON null-pointer deref - clear lesson
    • Cloudflare: famous one-line bug - they literally showed the line
    • AWS gave none of that depth - leaves engineers suspicious whether they really understood their own bug

    bottom line

    • pretending you understand something (or accepting "it was a race condition") is common early in careers
    • real growth comes from refusing to stop until you actually understand the root programming mistake
    • AWS's vague RCA is a textbook example of stopping at the surface instead of showing the real bug