Performance degradation to a subset of customers

Write-up

The primary datastore of the Intercom Ruby on Rails application is MySQL. In the USA hosting region, where the vast majority of Intercom workspaces are hosted, we run 13 distinct MySQL clusters in order to scale with our customers' use of Intercom. We are currently in the process of a migration from AWS RDS Aurora MySQL to Vitess based PlanetScale. This process is ongoing, with four core datastores already successfully migrated and others still operating in AWS RDS. The issues today impacted one of the clusters still in AWS RDS.

At 09:42 UTC the latency for queries on one of the clusters still operating in AWS RDS rapidly increased and throughput fell off precipitously. The cluster in question is Intercom’s “shard-e” database, part of a system of horizontally scaled clusters which Intercom built to handle “user scale” data. Shard E is currently serving newly created customers. While impact was contained to this single cluster, the “brown out” nature of the failure with latencies vastly increasing caused some knock-on impact to customers hosted on other “shards” in the system, since web serving fleet capacity was consumed waiting on the unhealthy database responding to queries.

By 09:46 UTC a full incident response team has assembled including engineers and a dedicated incident commander to handle mitigation. At this point shard E continued to do seemingly no work, and there were no leading indicators of a gradual decline in performance in the minutes prior to the outage. This is unusual and combined with the fact the database was still not responding to queries the incident response team made the decision to manually trigger a failover of the database cluster. This operation is typically handled automatically when an instance becomes unhealthy and is handled by AWS, but the team felt that a manual failover was the most appropriate action in the circumstances.

At 09:50 UTC an operator triggers the failover and the database goes offline. Read replicas in the cluster also begin to be restarted by AWS automation. Generally speaking Aurora failover operations are fast (on the order of 1-3 minutes) but this can vary depending on the load on the cluster. In this instance the failover took 7 minutes, with AWS reporting the operation as complete at 09:57 UTC. Full recovery shortly followed once incident responders redeployed the Intercom application, with normal metrics showing by 09:59 UTC.

As part of understanding this incident we’ve opened a high severity support case with Amazon RDS. Intercom speculates an underlying issue with the cluster either in the version of Aurora or the hardware the primary was running on. We are also pushing to understand why the failover took 7 minutes to resolve, which is an outlier in our experience.

Even before this incident occurred the sharded clusters were the top priority for moving to Planetscale next, with significant testing having occurred over the past multiple quarters, even before the process was discussed on our blog. We expect the first workspaces to be transferred to Planetscale as soon as this week, with customer workspaces following in the coming weeks. This process is expected to be zero downtime and non-impacting to the operation of customers using Intercom and will be overseen by multiple engineers on both the Intercom and Planetscale teams. We believe that completing this migration is the most significant step we can take to reduce the risk of outages caused by database infrastructure issues which we have seen increase in the past 12-18 months.