Slowness and errors in Intercom
Incident Report for Intercom
Postmortem

Database infrastructure failure in USA region

Brian Scanlan May 29, 2024

Background

The primary datastore of the Intercom Ruby on Rails application is MySQL. In the USA hosting region, where the vast majority of Intercom workspaces are hosted, we run 13 distinct AWS RDS Aurora MySQL clusters. One of the problems of this architecture is connection management to MySQL databases. There are limits on the maximum number of connections that can be opened to any individual MySQL host, and on AWS RDS Aurora the limit is 16,000 connections. Intercom runs a monolithic Ruby on Rails application, with hundreds of distinct workloads running in the same application, connecting to the same databases. As each running Ruby on Rails process generally needs to connect to each database cluster, the connection limit is something we had to engineer a solution for. On most of the MySQL clusters, the read traffic is sent by the Ruby on Rails application to read-replicas, which spreads the connections out over a number of hosts, in addition to horizontally scaling the query load balancing across the read-replicas. However for write requests we need to use a different approach and in 2017 we rolled out ProxySQL to put in front of the primary writer nodes in each MySQL cluster. ProxySQL maintains a connection pool to each writer in the MySQL clusters and efficiently re-uses connections to serve write requests made by our Ruby on Rails application. The use of ProxySQL allows us to scale our application without running into connection limits of MySQL databases.

In the last 9 months we have experienced a number of outages related to our use of ProxySQL. In August and October 2023 we had two outages related to an upgrade to ProxySQL 2. We reverted this upgrade and reverted to running the previous version of ProxySQL, however we experienced an outage in early May 2024 where the use of ProxySQL prevented us from recovering quickly from a database failover.

Incident

On 29th May 2024 at 14:54 UTC errors and latencies across the Intercom application suddenly spiked. An incident was opened at 14:55 UTC, and multiple engineers and an incident commander were troubleshooting the problem by 14:57 UTC. We confirmed the impact, ruled out obvious causes such as database failovers and application deployments, and posted to our public Status Page at 15:04 UTC. The problem appeared to be located around "MainDB" which is the oldest and default database at Intercom, and contains a large number of core application tables. The RDS Aurora MySQL metrics were reporting login failures, and at the same time throughput on the database was plummeting, while errors were rising across all parts of the Intercom application. We tried some safe actions such as rolling back recent deployments to rule out a code change as the cause, dug into infrastructure logs and changes to see if anything had been changed and looked for any sign of resource intensive database queries. At 15:12 UTC we failed over the MainDB database to try to recover things, but none of this got us anywhere. CPU on the ProxySQL fleet was elevated, so we added more capacity, but the new hosts came up with the same CPU utilization as the existing hosts. This seemed to briefly improve things significantly, but the connection errors to MainDB never fully went away and then quickly returned. We decided to recycle the entire ProxySQL fleet, which again briefly recovered things, but the high error rate returned quickly. We decided to do two things at once - scale down to zero the largest parts of our Ruby on Rails application, and at the same time initiate a failover of the MainDB database primary node to the largest instance type possible in AWS. The scale-down and failover worked, and at 16:13 UTC no more database errors were being seen. We rapidly scaled up our largest fleets, recovering the Intercom Inbox by 16:17 UTC, with the Intercom messenger, mobile SDK and REST API fleets following shortly afterwards.

Full recovery of the Intercom messenger and mobile SDK was delayed due to DynamoDB autoscaling removing ~80% of our write capacity into the tables used for end-user and company data - this took until 16:50 UTC to be fully functional. We closed off the Status Page by 17:18 UTC once we had confirmed that there were no lingering problems.

At this time we don't know the initial trigger - however we now know that our ProxySQL architecture can get into a "metastable" state that is difficult to recover from without taking drastic action such as multiple database failovers and scaling down our fleets. Our fleet scaling policies likely contributed here - we scale our fleets up aggressively when high latencies are seen, which can help recover problems in many situations but in this case quickly added a huge amount of additional connections to the systems. Also while we thought we were rolling back and cleaning out configuration and state on our fleets, it turned out that on the host level, not being able to connect to MainDB was blocking deployments, meaning that connection states were not being reset when we tried rollbacks and redeployments. These are contributory factors in not being able to recover the system back to a good state. We will continue to investigate the configuration and architecture of our use of ProxySQL, along with our scaling and deployment policies, to understand how they interacted and prevented recovery, as well as to identify any potential triggers. We are also working on removing ProxySQL and AWS RDS Aurora entirely from our infrastructure by moving to PlanetScale/Vitess.‌

Action items

  • Investigate an initial trigger for the initial degradation of MainDB.
  • Reduce the velocity of aggressive scale-up policies.
  • Investigate making deployments more reliable during database outages.
  • In the medium term, we will entirely remove ProxySQL and RDS Aurora from our architecture by migrating to PlanetScale/Vitess. This project is underway, and we expect to start migrating production data to PlanetScale in June 2024.
Posted May 29, 2024 - 21:47 UTC

Resolved
This incident has been resolved. Apologies for the disruption. We'll follow-up with a Root Cause Analysis shortly.
Posted May 29, 2024 - 17:18 UTC
Update
We are seeing recovery across the Intercom platform.
Posted May 29, 2024 - 16:54 UTC
Monitoring
Intercom's services are recovering at this time. We are continuing to work on the problem.
Posted May 29, 2024 - 16:26 UTC
Update
We are continuing to work on resolving the problems using Intercom.
Posted May 29, 2024 - 16:08 UTC
Update
We are still working on fixing Intercom.
Posted May 29, 2024 - 15:29 UTC
Identified
We're continuing to troubleshoot problems using Intercom.
Posted May 29, 2024 - 15:16 UTC
Investigating
We investigating slowness and errors across the Intercom application.
Posted May 29, 2024 - 15:04 UTC
This incident affected: Intercom message delivery (Email, Chats and Posts, Mobile Push, Admin notifications), Intercom Messenger (Web Messenger, Mobile Messenger), and Intercom Web Application, Intercom APIs, Intercom Mobile APIs, Intercom Webhooks.