Elevated Error Rate and Latencies on Intercom Web App
Incident Report for Intercom
Postmortem

Root Cause Analysis: Intercom down for 25% of workspaces

Brian Scanlan, Senior Principal Engineer, 13th March 2024

Issue Summary

The primary datastore of the Intercom Ruby on Rails application is MySQL. In the USA hosting region, where the vast majority of Intercom workspaces are hosted, we run 13 distinct AWS RDS Aurora MySQL clusters, and the databases hosted on these clusters are sharded using a mix of functional and on a per-customer workspace basis. Five of these clusters contain databases sharded on a per-workspace basis, and contain many of our largest tables. Each of these clusters have many thousands of individual databases as a result.

On March 13th 28th 2024 at 17:01 UTC, one of the MySQL clusters used for the sharded databases ("Shard D") experienced a significant deterioration in the throughput and performance of writes being made to the database. Around 25% of Intercom workspaces are located on this cluster, and affected customers experienced high latencies and errors when using Intercom.  Our full incident process was initiated at 17:06 UTC. The errors persisted until 17:13:45 UTC when we initiated a failover of the primary node of the cluster. By 17:22 UTC the system was performing normally and Intercom was working normally.

The cause of the degradation is related to the incident that affected Intercom on February 22nd 2024. As a result of this event, we are running a large number of column data type change migrations on our databases. These are generally precautionary and Intercom's systems are not at immediate risk of further downtime, however they are important to complete in order to avoid any chance of recurrence of that incident. Running migrations on our MySQL databases is a near-constant state for Intercom, and we over time have built many layers to protect our production traffic while these expensive operations are occurring on our databases. However we have now had two similar incidents on different sharded clusters during these migrations (see https://www.intercomstatus.com/incidents/x3sq0ggjqzkv ) and will continue investigating the lockup we’ve now seen twice. We will also look at detecting the lockup and failing the cluster over automatically, as in both cases we have needed a person to initiate a failover to get recovery of the system.

Actions

  1. Failover AWS RDS Aurora MySQL cluster primary node to be the largest instance type available (done)
  2. Open a support case with AWS.
  3. Investigate automating the detection of this specific lockup and failing over the MySQL cluster.
  4. Continue work to replace AWS RDS Aurora MySQL based architecture with a Vitess based system - fundamentally addressing the limitations of our current MySQL sharding setup.
Posted Mar 13, 2024 - 18:18 UTC

Resolved
We saw an elevated rate of errors and latency across the Intercom web app between 17:00UTC to 17:22UTC. This issue has now been resolved.
Posted Mar 13, 2024 - 17:23 UTC
Investigating
We are looking into reports of increased error rates and latency on our web app since 17:00UTC.
Posted Mar 13, 2024 - 17:16 UTC
This incident affected: Intercom Web Application.