> "We had the same outage three times in a year — different teams, same root cause." ^pain A collective synthesis pain. Post-incident reviews are scoped too narrowly, lack rationale, are never revisited. Groundhog-day outages where the same class of problem recurs for months or years. The post-mortem document exists; the *learning* doesn't propagate. ## Discovery questions - "Walk me through your last three incidents - is there a thread you only see in retrospect?" - "When was the last time a post-mortem changed how your team actually works, six months later?" ^discovery-questions ## Examples - CFPB on Wells Fargo: improper sales-practice issues and fake-accounts patterns persisted for years across business units because root causes were not addressed.[^1] - Google SRECon talks like "The Halting Problem of Incident Response" document orgs seeing recurring outages because postmortems focused on proximate causes rather than systemic fixes.[^2] - Cloudflare incident reports include cases of repeated BGP or configuration-related outages, prompting stronger change management after Groundhog-day patterns.[^3] - UK FSA/FCA documented repeat control failures at UBS that enabled multiple rogue-trading incidents over years, despite earlier warnings.[^4] - NTSB repeatedly cites airlines and rail operators for the same kinds of human-factors and signalling errors across accidents, highlighting shallow learning.[^5] [^1]: https://www.consumerfinance.gov/enforcement/actions/wells-fargo-bank-n-a-2016-ct-order/ [^2]: https://www.usenix.org/conference/srecon [^3]: https://blog.cloudflare.com [^4]: https://www.fca.org.uk [^5]: https://www.ntsb.gov