Casey on AWS DynamoDB Outage: When RCAs Explain Nothing

Casey uses the October 2025 DynamoDB US-East-1 outage as an example of companies publishing shallow "root cause analyses" that explain how symptoms appeared but never show the actual programming bug.

tldr

•AWS blamed a "race condition" - but that's just the trigger, not the bug
•the real bug: enactor code crashes hard when it can't read an old/deleted plan for the rollback record
•that crash should never happen - graceful handling of missing data is basic engineering
•AWS never showed the actual code (exception? null deref? assertion failure?) - teaches nothing
•good RCAs (CrowdStrike, Google, Cloudflare) show the actual buggy line - AWS gave none of that depth

what aws said happened

•DynamoDB endpoints resolved via Route 53 DNS using a constantly-updated "load-balancing tree"
•single planner generates new tree "plans" (hashed names), three enactors write them to Route 53
•enactors use distributed lock (via Route 53 atomic ops) so only one updates at a time
•unlucky enactor gets stuck in long backoff retries, falls way behind (e.g. plan #110 while others on #145+)
•finally gets lock, writes endpoint to point to ancient plan
•another enactor sees #110 is stale, garbage-collects it
•endpoint now points to non-existent DNS name - "no records found"
•when enactors try to fix it, they also update a "rollback" record pointing to the previous plan
•previous plan no longer exists - enactor crashes permanently trying to update rollback record
•all three enactors die the same way - no automatic updates - outage until humans intervene

casey's core criticism

•everyone calls it a "race condition" and stops thinking
•race condition is just the trigger that made an old plan get written then deleted
•real bug: whatever code in the enactor crashes hard when it can't fetch the old plan for the rollback record
•that crash should never happen:

- rollback field should allow empty/null - stale/missing plan should be gracefully handled (system starts with no prior plan anyway) - operator typo or manual deletion would trigger the exact same total failure - no race needed

•AWS never explains what the crashing code looked like
•therefore the RCA teaches nothing useful about avoidable bad programming practices

comparison to better rcas

•CrowdStrike: buffer overflow from too many rules - clear "don't do that"
•Google: JSON null-pointer deref - clear lesson
•Cloudflare: famous one-line bug - they literally showed the line
•AWS gave none of that depth - leaves engineers suspicious whether they really understood their own bug

bottom line

•pretending you understand something (or accepting "it was a race condition") is common early in careers
•real growth comes from refusing to stop until you actually understand the root programming mistake
•AWS's vague RCA is a textbook example of stopping at the surface instead of showing the real bug