<aside> âšī¸ Executive summary A PostgreSQL connection spike (00:27 UTC) caused by Django moved a node to an unresponsive state (00:55 UTC), upon performing a recycle of the affected node volumes were placed into a state where they could not be mounted.
</aside>
List the sequence of events that led to the incident
Python Discord's Status page Powered by Freshstatus | Live status
lke13311-16405-5fafd1b46dcf
begins it's restartDescribe how internal and external users were impacted during the incident
Initially, this manifested as a standard node outage where services on that node experienced some downtime as the node was restored.
Post-restore, all stateful services (e.g. PostgreSQL, Redis, PrestaShop) were unexecutable due to the volume issues, and so any dependent services (e.g. Site, Bot, Hastebin) also had trouble starting.
PostgreSQL was restored early on so for the most part Moderation could continue.
Report when the team detected the incident, and how we could improve detection time
DevOps were initially alerted at 00:32 UTC due to the PostgreSQL connection surge, and acknowledged at the same time.