Degraded Performance on Intercom Web App & Messenger
Incident Report for Intercom
Postmortem

Root Cause Analysis: Intercom down for 25% of workspaces

Brian Scanlan, Senior Principal Engineer, 1st March 2024

Issue Summary

The primary datastore of the Intercom Ruby on Rails application is MySQL. In the USA hosting region, where the vast majority of Intercom workspaces are hosted, we run 13 distinct AWS RDS Aurora MySQL clusters, and the databases hosted on these clusters are sharded using a mix of functional and on a per-customer workspace basis. Five of these clusters contain databases sharded on a per-workspace basis, and contain many of our largest tables. Each of these clusters have many thousands of individual databases as a result.

On February 28th 2024 at 21:40 UTC, one of the MySQL clusters used for the sharded databases ("Shard C") experienced a significant deterioration in the throughput and performance of writes being made to the database. Around 25% of Intercom workspaces are located on this cluster, and affected customers experienced high latencies and errors when using Intercom.  At 21:49 UTC the first paging alarm fired, and our full incident process was initiated. The errors persisted until 22:12:42 UTC when we initiated a failover of the primary node of the cluster. By 22:19 UTC the system was performing normally and Intercom was working normally.

The cause of the degradation is related to the incident that affected Intercom on February 22nd 2024. As a result of this event, we are running a large number of column data type change migrations on our databases. These are generally precautionary and Intercom's systems are not at immediate risk of further downtime, however they are important to complete in order to avoid any chance of recurrence of that incident. Running migrations on our MySQL databases is a near-constant state for Intercom, and we over time have built many layers to protect our production traffic while these expensive operations are occurring on our databases. We believe that three factors contributed to the database performance degradation that occurred this time:

  1. Column data type changes are uncommon in our environment - we have way more experience with changes such as adding or deleting columns.
  2. Our migration strategy is bimodal depending on the size of the database being migrated. In order to be able to complete migrations in a timely manner, we alter tables directly when the tables are very small, and make new copies of tables in all other cases.
  3. The large volume of databases on the sharded clusters, and volume of direct “alter” changes being made to tables.

We believe the large volume of “alter” table changes caused the database degradation - the precise mechanism here is still under investigation, but it is strongly correlated with the outage. Other sharded clusters were able to deal with these changes with no major problems, and there may be a workload-related factor on Shard C.

As a result we have changed our migration strategy to always use a new copy of tables when changing data types, regardless of the size of the table. This will slow down the overall process, and is a more expensive operation, but will avoid the chronic problems observed in this outage.

Actions

  1. Stop directly altering column types on tables for all migrations regardless of table size (done)
  2. Review alarm thresholds to ensure faster detection of this type of partial failure. 
  3. Continue work to replace AWS RDS Aurora MySQL based architecture with a Vitess based system - fundamentally addressing the limitations of our current MySQL sharding setup.
Posted Mar 01, 2024 - 17:46 UTC

Resolved
Customers hosted in our US data center saw severely degraded performance across the Intercom Web App and Messenger between 21:41 until 22:18 UTC, this issue has now been resolved.
Posted Feb 28, 2024 - 22:26 UTC
Monitoring
We're seeing recovery for the impacted customers and will continue to monitor.
Posted Feb 28, 2024 - 22:20 UTC
Update
We're continuing to work on a fix for this issue. Some customers will continue to see major impact to Intercom functionality.
Posted Feb 28, 2024 - 22:15 UTC
Update
We are continuing to investigate this issue.
Posted Feb 28, 2024 - 22:10 UTC
Investigating
We are looking into higher than usual latencies and errors across the Intercom Web App and Messenger impacting customers hosted in our US data center since 21:41 UTC.
Posted Feb 28, 2024 - 22:03 UTC
This incident affected: Intercom Messenger (Web Messenger, Mobile Messenger) and Intercom Web Application.