Increased Error Rates & Latencies

Write-up

The primary datastore of the Intercom Ruby on Rails application is MySQL. In the USA hosting region, where the vast majority of Intercom workspaces are hosted, we run 13 distinct MySQL clusters in order to scale with our customers' use of Intercom. We are currently in the process of a migration from AWS RDS Aurora MySQL to Vitess based PlanetScale. One of the first databases we moved over to PlanetScale powers conversations in Intercom.

At 09:26 UTC we started seeing query latencies start to slowly increase for this database, and 4 minutes later query latencies rapidly increased, and a large-scale outage began. Core functionality such as replying to conversations in the Inbox and creating new conversations in the Intercom messenger were severely degraded. Our full incident management process was quickly engaged. There was a brief recovery at 09:39 UTC, but this lasted for only 4 minutes. The problems persisted until 10:25 UTC when there was complete recovery.

The queries that were slow and failing were primarily updates and inserts into the database. The errors being surfaced indicated that the "transaction pool" was exhausted. We engaged PlanetScale at 09:39 UTC and they assisted us with troubleshooting. At 10:09 UTC an attempt was made to increase the transaction pool by 25%, this was unsuccessful, and a forced failover of the database occurred at 10:25 UTC, which lines up with full recovery. Before the outage the transaction pool was running at 8% capacity, so we believe it was the failover rather than the increase in the transaction pool that recovered our application.

While we have not established the trigger for the outage, we have already taken a number of steps to avoid recurrence and ensure faster recovery if it does recur. We have dampened the scaling of some of our asynchronous workers which scaled up this morning after the database outage started. We have identified a source of potential deadlock in the system, through use of a lock on a particular high volume table when being written to. We have prepared a mechanism to immediately disable the lock, and will continue investigating how we can remove or improve the locking mechanism used here. We will continue to work with PlanetScale to understand how the database got into this state, how to avoid it in the future and any other learnings. Due to the nature of the recovery today, we will do an immediate failover of the database if the problem recurs. In the medium term, we are close to fully sharding the database, which will allow us to take full advantage of the Vitess platform, unlocking a large amount of scalability of our database.

We will keep this status page updated with additional information as we continue our investigation. As always we apologize for the outage and the disruption to your business and our top priority is to permanently stabalizing our database platform.