danluu/post-mortems: A collection of postmortems

Config Errors

Allegro. E-commerce site went down after a sudden traffic spike caused by a marketing campaign. The outage was caused by a configuration error in cluster resource management which prevented more service instances from starting even though hardware resources were available.

Cloudflare. A bad config (router rule) caused all of their edge routers to crash, taking down all of Cloudflare.

Etsy. Sending multicast traffic without properly configuring switches caused an Etsy global outage.

Facebook. A bad config took down both Facebook and Instagram.

GoCardless. A bad config combined with an uncommon set of failures led to an outage of a database cluster, taking the API and Dashboard offline.

Google. A bad config (autogenerated) removed all Google Compute Engine IP blocks from BGP announcements.

Google. A bad config (autogenerated) took down most Google services.

Google. A bad config caused a quota service to fail, which caused multiple services to fail (including gmail).

Google. / was checked into the URL blacklist, causing every URL to show a warning.

Google. A bug in configuration roll-out to a load balancer lead to increased error rates for 22 minutes.

Google. A configuration change intended to address an uptick in demand for metadata storage, which overloaded part of the blob lookup system, which caused a cascading failure with user-visible service impact to Gmail, Google Photos, Google Drive, and other GCP services dependent on blob storage.

Google. Two misconfigurations, plus a software bug, caused a massive Google Cloud Network failure on the US East Coast.

Heroku. An automated remote configuration change did not propagate fully. Web dynos could not be started.

Microsoft. A bad config took down Azure storage.

OWASA. The wrong push of a button lead to a water treatment plant shutting down due to too high levels of fluoride.

Stack Overflow. A bad firewall config blocked stackexchange/stackoverflow.

Sentry. Wrong Amazon S3 settings on backups lead to data leak.

TravisCI. A configuration issue (incomplete password rotation) led to "leaking" VMs, leading to elevated build queue times.

TravisCI. A configuration issue (automated age-based Google Compute Engine VM image cleanup job) caused stable base VM images to be deleted.