Note: Our post-mortems are conducted in the Five Whys format, which is useful for exploring beyond surface-level issues and identifying the deeper root cause.
Time (PT) | Event |
---|---|
12:08 pm | Support team first notices that prod is unresponsive. |
12:10 pm | Incident response protocol is triggered. Engineers attempt to connect to the database but are unable to. |
12:16 pm | Database is restarted, but this does not restore responsiveness. |
12:18 pm | Investigation into health metrics shows that the DB has run out of disk space. |
12:32 pm | Manual resize attempted, but is blocked by optimization task. |
12:36 pm | Engineering connects with AWS support, confirms optimization cannot be canceled, and is only about 70% complete. |
12:40 pm | Read replica backup and restore is started. |
1:10 pm | Promotion of replica identified as possible solution, blocked by in-progress backup. |
1:20 pm | AWS support contacted again, confirms backup cannot be canceled, and is only about 40% complete. |
2:46 pm | Replica backup completes. |
2:46 pm | Replica promotion initiated. |
2:47 pm | Engineering confirms production database is responsive and can execute queries. |
2:48 pm | Engineering updates helm charts to use new database. |
2:48 pm | app.hex.tech is available to end users again. |
Production was unavailable to end users for 2 hours and 40 minutes, interrupting critical customer workflows. During the initial mitigation and the subsequent fix, Fivetran sync was disabled, causing internal data to be stale for about a week. Between initial incident response and subsequent follow ups, engineering and support teams lost around 40 hours of productive time.
Why did production go down?
The database filled up and became unresponsive.
Why did the database fill up?
The write-ahead log was growing unboundedly and filled up the disk before anyone on the team noticed the issue.
Why was the WAL enabled?
At the time we configured Fivetran, it was the only option to support deletes, which we needed for our internal analytics. When connecting Fivetran for the first time, we considered a number of ways it could impact production, but due to the available disk space in the database, we did not expect the size of the WAL to be a problem.
Why was the WAL growing?
We updated our database tunnel instance but not credentials and this change wasn’t propagated to Fivetran. With a new host fingerprint, the SSH connection began failing silently. With Fivetran no longer consuming the WAL, it began growing quickly.
Why did the change not propagate to Fivetran?
Because the failure was silent, we weren’t alerted, and we didn’t have a process in place to fix all downstream dependencies when the tunnel updates.
Why didn’t we have a process to update downstream dependencies?
The tunnel doesn’t restart that frequently, and this was the first time this error had been encountered. Additionally, as a fast moving team, tracking all architecture dependencies (including cross-team) would have been a significant process burden.
Why did we not notice that the DB was running out of space?
We had alerting for every database health metric except disk.
Why did we not have alerting for disk?
The vast majority of the time, disk was not the source of our database issues, so we relied on auto-scaling to fix issues with low disk space, which has worked successfully in the past. Additionally, when we did have alerting on disk usage, it was very noisy and successfully corrected by autoscaling the vast majority of the time.s
Why did we not notice that the Fivetran sync was failing?
Because the connector wasn’t truly broken (it was failing at the SSH layer), it didn’t trigger any alerting. We did eventually notice that our data was stale using our data freshness tests (only a few minutes before the incident began).
Why did it take so long for our freshness tests to fail?
They were configured to fail after 4 days of stale data. This could certainly be lower for many tables.
Why did the database filling up cause production to become unresponsive?
Because Postgres must write to the WAL before making any changes, no changes can be made unless there is free disk space, so adding more space is necessary to run any query.
Why couldn’t we add more space?
The database was in the middle of an optimization job after it had previously automatically scaled earlier in the day, and you cannot change the size of the database during optimization (or cancel the job).
Why was the outage so long?
We had taken actions in AWS that were slow and not cancelable, before discovering the fast path of promoting the read replica. We also expected the backup to be faster than it was – had we known it would be long we would have explored other options.
Why did we not discover the fast path first?
We had a number of runbooks and incident response plans that we had practiced, including restore from backup, but we had not built one around this scenario. Deep into the incident at this point, we jumped to the solutions that we knew about without considering the potential failure modes or exploring alternatives.
Why was the backup slower than expected?
We checked timing of the nightly backups, but failed to account for the fact that those are incremental and we would be backing up the replica for the first time.