Increased Error Rates and elevated latencies using Intercom
Incident Report for Intercom
Postmortem

Issue Summary

The primary datastore of the Intercom Ruby on Rails application is MySQL. In the USA hosting region, where the vast majority of Intercom workspaces are hosted, we run 13 distinct AWS RDS Aurora MySQL clusters, and the databases hosted on these clusters are sharded using a mix of functional and on a per-customer workspace basis. Five of these clusters contain databases sharded on a per-workspace basis, and contain many of our largest tables. Each of these clusters have many thousands of individual databases as a result. 

On May 7th, 2024 at 14:42 UTC "shard-e" failed over. "shard-e" is the smallest sharded cluster, used for recently created Intercom workspaces, takes a small fraction of the load compared to the other four sharded clusters and runs on smaller instances (db.r6i.8xlarge). Despite the lower load, the primary database in the cluster ran out of freeable memory, and the cluster failed over. This failover was complete by 14:48 UTC and the cluster started taking requests normally at this time.

However, at the time of the failover, a small fraction of queries to an unrelated cluster "shard-b" also started to fail, and our application was reporting login failures to "shard-b", which is used by ~25% of Intercom workspaces hosted in the USA region. There was no failover of the cluster, or any obvious problems with the cluster metrics, which were nominal. After verifying the state of the cluster we started to investigate the health of the ProxySQL service, which is a shared service used by the Intercom Ruby on Rails application to connect to all databases. We added additional capacity to ProxySQL at 15:20 UTC, which took a few minutes to get into service, but did not immediately improve the situation. We initiated the removal of the original ProxySQL hosts from the EC2 autoscaling group. The number of failing queries to "shard-b" was slowly increasing during this time. While we waited for the ProxySQL hosts to be fully removed from service, we prepared to fail over "shard-b" to try to recover the cluster, which would cause a near complete outage for ~25% of Intercom customers. 

By 15:40 all of the original ProxySQL hosts were removed from service, and all queries to the cluster recovered without initiating a failover. At the peak of the outage, just over 50% of requests made by ~25% of Intercom customers were failing, causing problems across the Intercom application, not limited to the messenger and inbox.

This outage had similarities to two outages experienced by Intercom in September and October 2023, detailed here: https://www.intercomstatus.com/incidents/6f01d7h9zs1f

In this case while the specifics were different (impact to a single cluster, the ProxySQL version and an unusual trigger), our use of ProxySQL in front of our RDS Aurora clusters got into a similar metastable state, and the hosts needed to be replaced to recover connectivity to a single cluster.

Actions

Update runbooks to replace ProxySQL hosts on any sign of login issues (done)

Investigate the trigger for the memory exhaustion on shard-e, upgrading the cluster size if necessary (in progress)

Continue to investigate possible triggers on cluster-b and why an unrelated failover caused connectivity to get in to a bad state (in progress)

Continue to remove all use of ProxySQL and RDS Aurora in the Intercom Ruby on Rails application (in progress)

Posted May 07, 2024 - 22:34 UTC

Resolved
Between approx 2024-05-07 1545 UTC and 2024-05-07 1640 UTC, customers would have experienced elevated error rates and high latencies while using the Intercom application due to a database failover and recovery. This incident is now resolved, and all services are working as expected.
Posted May 07, 2024 - 16:04 UTC
Monitoring
A fix has been implemented and we are monitoring the results.
Posted May 07, 2024 - 15:49 UTC
Identified
We have identified a cause of the elevated error rates and high latencies across the app, and are working towards resolution. Customers will continue to experience errors at this time.
Posted May 07, 2024 - 15:24 UTC
Investigating
We are currently investigating increased error rates and high latencies while using Intercom.
Posted May 07, 2024 - 15:11 UTC
This incident affected: Intercom message delivery (Email, Chats and Posts, Mobile Push, Admin notifications), Intercom Messenger (Web Messenger, Mobile Messenger), and Intercom Web Application, Intercom APIs, Intercom Mobile APIs, Intercom Webhooks.