Increased Error Rates & Latencies for Intercom users

Write-up

Between 14:28 - 15:12 UTC on 25th February 2025, Intercom's USA region experienced 7 outages, lasting between 1-5 minutes each. The following graph shows the impact to conversations being replied to in the Intercom inbox, however the outage was universal across all Intercom features in the USA.

Intercom has a multi-tenant architecture, runs on Ruby on Rails and uses MySQL as a primary datastore. was primarily driven by a failure of a caching layer used to protect the MySQL databases from load.

Intercom utilizes a library called IdentityCache to cache objects that are accessed from the MySQL databases. 12 memcached servers used by IdentityCache collectively serve over 7 million requests per second at peak, out of a cache size of 5 billion records. The MySQL clusters that authoritatively store the data cached in memcached serve orders of magnitude fewer requests, less than 200 thousand per second.

At 11:41, as part of preparation for an upgrade from Ruby on Rails 7.1 to 7.2, a change was made to the Active Support cache_format_version from 6.1 to 7.0. This was a necessary change due to the deprecation of the 6.1 cache_format_version. This cache format change caused previously cached records to be "missed", and as a result a slowly increasing amount of query load started going direct to the primary node of one of our database clusters. This load was not massively excessive, approximately 25% extra load, which is well within the headroom provisioned for this database cluster to allow for future growth and traffic spikes.

However starting at 14:28, the database cluster experienced short periods of extremely slow query performance. As a result, the database connection proxying layer marked the database as unhealthy, bringing the Intercom application down. The database recovered quickly, but this pattern recurred 6 further times. We believe it was the expiry of very hot keys in the caching layer that caused the database to receive a short but extremely high query load - this coupled with the additional load caused query times to massively increase, connections to get maxed out on the primary host and the subsequent failed health checks to occur.

Incident management of this outage was prompt and effective, and we had the right folks quickly involved. However troubleshooting this problem was difficult, as the behaviour change in the cache had started hours previously, and the general load on the databases was within operational norms. Once we identified the cache as the problem, we quickly identified the format change and rolled back, which immediately stabilized things.

We are continuing to investigate the exact nature of the cache behaviour change - we had specifically tested forward and backward compatibility in advance of this change and it had not surfaced any issues. We are also continuing to remove AWS RDS Aurora MySQL and ProxySQL from our environment - while not directly implicated in the incident, we believe that Vitess/PlanetScale would have degraded more gracefully and pointed us quicker at the cause of the problem. The database in question will be moved to Vitess/PlanetScale in Spring 2025.

As always, we apologize for the impact of these outages for our customers. Stabalizing our database platform is our number one priority. Please do reach out to us directly at team@intercom.com or to myself personally at brian.scanlan@intercom.io if there's anything we can do.