<aside> ℹ️ Executive summary A PostgreSQL connection spike (00:27 UTC) caused by Django moved a node to an unresponsive state (00:55 UTC), upon performing a recycle of the affected node volumes were placed into a state where they could not be mounted.

</aside>

⚠️ Leadup

List the sequence of events that led to the incident

00:27 UTC: Django starts rapidly using connections to our PostgreSQL database
00:32 UTC: DevOps team is alerted that PostgreSQL has saturated it's 115 max connections limit. Joe is paged.
00:33 UTC: DevOps team is alerted that a service has claimed 34 dangerous table locks (it peaked at 61).
00:42 UTC: Status incident created and backdated to 00:25 UTC.

Python Discord's Status page Powered by Freshstatus | Live status

00:55 UTC: It's clear that the node which PostgreSQL was on is no longer healthy after the Django connection surge, so it's recycled and a new one is to be added to the pool.
01:01 UTC: Node lke13311-16405-5fafd1b46dcf begins it's restart
01:13 UTC: Node has restored and regained healthy status, but volumes will not mount to the node. Support ticket opened at Linode for assistance.
06:36 UTC: DevOps team alerted that Python is offline. This is due to Redis being a dependency of the bot, which as a stateful service was not healthy.

🥏 Impact

Describe how internal and external users were impacted during the incident

Initially, this manifested as a standard node outage where services on that node experienced some downtime as the node was restored.

Post-restore, all stateful services (e.g. PostgreSQL, Redis, PrestaShop) were unexecutable due to the volume issues, and so any dependent services (e.g. Site, Bot, Hastebin) also had trouble starting.

PostgreSQL was restored early on so for the most part Moderation could continue.

👁️ Detection

Report when the team detected the incident, and how we could improve detection time

DevOps were initially alerted at 00:32 UTC due to the PostgreSQL connection surge, and acknowledged at the same time.