Intercom

Write-up
Increased Errors and Latencies with Intercom

Intercom's primary data store is MySQL, and we recently largely completed a full migration from AWS RDS Aurora MySQL to Vitess, hosted by PlanetScale. Due to this migration a backlog of schema changes was built up. As part of implementing these schema changes, a change that was understood to be low-risk to a large table using MySQL’s “instant” algorithm unexpectedly hit an issue securing a metadata lock, which caused the database to stop accepting new queries. This was an unexpected failure mode and caused Intercom to immediately go down. In addition to our standard incident response, we immediately had our database vendor working with us on restoring service. The Intercom team in conjunction with our database vendor were in the process of taking emergency action to mitigate the issue when the operation succeeded on its own, unblocking database queries. The downtime lasted almost exactly 10 minutes, from 14:18:10 - 14:28:10 UTC on 29th May 2025.

We are revising operational procedures around running similar schema changes, as well as changing the schema migration process to avail of a more aggressive cutover mechanism which will prevent similar issues occurring in the future. This is a new feature that we will test, and have high confidence that this will remove any possibility of recurrence.

As always we apologise for the disruption, and please do reach out to our team if you have any concerns.