<aside> ℹī¸ Executive summary A PostgreSQL connection spike (00:27 UTC) caused by Django moved a node to an unresponsive state (00:55 UTC), upon performing a recycle of the affected node volumes were placed into a state where they could not be mounted.

</aside>

⚠ī¸ Leadup

List the sequence of events that led to the incident

Python Discord's Status page Powered by Freshstatus | Live status

đŸĨ Impact

Describe how internal and external users were impacted during the incident

Initially, this manifested as a standard node outage where services on that node experienced some downtime as the node was restored.

Post-restore, all stateful services (e.g. PostgreSQL, Redis, PrestaShop) were unexecutable due to the volume issues, and so any dependent services (e.g. Site, Bot, Hastebin) also had trouble starting.

PostgreSQL was restored early on so for the most part Moderation could continue.

👁ī¸ Detection

Report when the team detected the incident, and how we could improve detection time

DevOps were initially alerted at 00:32 UTC due to the PostgreSQL connection surge, and acknowledged at the same time.