Root Cause Analysis: MySQL query routing failure took Intercom down
On Monday 23rd June 2025, Intercom went down in the USA region at 08:02 UTC, web serving capacity recovered at 08:04 UTC, and was fully recovered by 08:09 UTC. Intercom recently migrated its use of MySQL to PlanetScale's hosted Vitess. The query routing layer of Vitess experienced a brief failure, which caused all connections to MySQL to hang and the queries ultimately fail. These hangs caused all capacity on our web and worker fleets to be exhausted. After 60 seconds, all web fleet processes were automatically restarted, and replacement processes started coming online. There were no issues reconnecting to MySQL and by 08:04 UTC all web serving capacity was restored.
Latency based and queue backlog autoscaling policies also started bringing in additional capacity. At 08:06 UTC our paging alarms fired and our incident management process kicked off. At 08:07 UTC worker processes timed out their MySQL queries and were able to reconnect to MySQL. These timeouts also surfaced the first errors implicating the database as the source of the problem. By 08:09 UTC all asynchronous job backlogs were processed and Intercom was back to normal. At 08:11 UTC we posted to our status page.
We are continuing to work with PlanetScale on understanding the failure of the query routing layer, and will make any changes to our application deemed necessary. PlanetScale are continuing to work on improving the query routing system to avoid this failure mode.