Root Cause Analysis: Intercom down for 25% of workspaces
Brian Scanlan, Senior Principal Engineer, 1st March 2024
The primary datastore of the Intercom Ruby on Rails application is MySQL. In the USA hosting region, where the vast majority of Intercom workspaces are hosted, we run 13 distinct AWS RDS Aurora MySQL clusters, and the databases hosted on these clusters are sharded using a mix of functional and on a per-customer workspace basis. Five of these clusters contain databases sharded on a per-workspace basis, and contain many of our largest tables. Each of these clusters have many thousands of individual databases as a result.
On February 28th 2024 at 21:40 UTC, one of the MySQL clusters used for the sharded databases ("Shard C") experienced a significant deterioration in the throughput and performance of writes being made to the database. Around 25% of Intercom workspaces are located on this cluster, and affected customers experienced high latencies and errors when using Intercom. At 21:49 UTC the first paging alarm fired, and our full incident process was initiated. The errors persisted until 22:12:42 UTC when we initiated a failover of the primary node of the cluster. By 22:19 UTC the system was performing normally and Intercom was working normally.
The cause of the degradation is related to the incident that affected Intercom on February 22nd 2024. As a result of this event, we are running a large number of column data type change migrations on our databases. These are generally precautionary and Intercom's systems are not at immediate risk of further downtime, however they are important to complete in order to avoid any chance of recurrence of that incident. Running migrations on our MySQL databases is a near-constant state for Intercom, and we over time have built many layers to protect our production traffic while these expensive operations are occurring on our databases. We believe that three factors contributed to the database performance degradation that occurred this time:
We believe the large volume of “alter” table changes caused the database degradation - the precise mechanism here is still under investigation, but it is strongly correlated with the outage. Other sharded clusters were able to deal with these changes with no major problems, and there may be a workload-related factor on Shard C.
As a result we have changed our migration strategy to always use a new copy of tables when changing data types, regardless of the size of the table. This will slow down the overall process, and is a more expensive operation, but will avoid the chronic problems observed in this outage.