Root Cause Analysis: Database proxy service took Intercom down
Brian Scanlan, Senior Principal Systems Engineer
Oct 13, 2023
Over the last few months, Intercom has been rolling out an updated proxy service for MySQL. There have been two large outages as a result - we are discontinuing the project and have rolled back to the setup in place before August 2023. The two outages have both been relatively complex "metastable" events involving interactions between thousands of clients, multiple layers of proxies and the behavior of the RDS Aurora MySQL service, and the proxy service getting into an unrecoverable state. Both outages caused an unacceptably high impact to our customers as well as to our own use of Intercom. We are still in the process of understanding the nature of exactly how ProxySQL 2 gets into an unrecoverable state, but we want to quickly share the background and details of what happened, and we may update this document over time.
Intercom's is primarily served by a large Ruby on Rails monolith application. We regularly have 50,000 Rails processes running on over 1000 EC2 hosts each day serving customer requests and processing background work. The primary datastore of the Ruby on Rails application is MySQL. The application connects to 13 AWS RDS Aurora MySQL clusters, which are sharded using a mix of functional and per-customer workspace basis. One of the problems of this architecture is connection management on MySQL hosts. There are limits on the maximum number of connections that can be opened to any individual MySQL host, and on AWS RDS Aurora the limit is 16,000 connections. As each running Ruby on Rails process generally needs to connect to each database cluster, this limit is something we had to engineer a solution for. On most of the MySQL clusters, the read traffic is sent by the Ruby on Rails application to read-replicas, which spreads the connections out over a number of hosts, in addition to horizontally scaling the query load balancing across the read-replicas. However for write requests to work well, we need to use a different approach and in 2017 we rolled out ProxySQL to put in front of the primary writer nodes in each MySQL cluster. ProxySQL maintains a connection pool to each writer in the MySQL clusters and efficiently re-uses connections to serve write requests made by our Ruby on Rails application. Our operational experience with ProxySQL since has generally been good, and we have built a good working relationship with the ProxySQL developers.
ProxySQL 2 was released a few years ago, and one of the new features is a far better integration with AWS RDS Aurora MySQL clusters. These new features can improve reliability through faster failover and recovery times, as well as now being suitable for putting in place of the reader nodes in each cluster, which would allow for a scaledown of the number of readers running in each cluster. We attempted a limited rollout of ProxySQL 2 in mid-2022, however we ran into some edge case bugs, and we worked with the ProxySQL developers on getting them fixed. Because of how critical ProxySQL is on the serving path for the Intercom application, we were cautious about the rollout and were happy to wait for bugs to be fixed.
We started using ProxySQL 2 in production on August 8th 2023. We did a controlled, gradual rollout where we moved traffic incrementally over on a per-MySQL cluster basis over a number of weeks. On the morning of August 23rd we moved another large cluster over to ProxySQL 2 without issue. However, around the time of our daily peak traffic on that day at 15:01 UTC, we started to get alerts for high latencies on the Intercom application. Within a minute we had identified that it was related to the serving of MySQL requests and began working on the problem, following our standard incident management process. At 15:09 the situation degraded further, with errors and high latencies widespread, making Intercom unusable. At this point we had identified DNS resolution errors occurring on the ProxySQL 2 fleet as a likely cause. An attempt was made at 15:13 to remove ProxySQL 2 from service by using a feature flag to revert to the prior configuration, however this then caused a large number of additional connections to be opened to the MySQL clusters and connection counts reached their maximum on the largest database cluster, and so we reverted back to using ProxySQL 2 at 15:18. Latencies and errors recovered at this time, but we were without a safe rollback plan and with ProxySQL 2 service back serving requests. By 15:20, an autoscaling rule in place on the largest database cluster had added two additional read replicas into the cluster as a result of the high connection count on the read nodes of the cluster.
A few minutes later at 15:34, MySQL query latency and error rates started degrading again due to DNS resolution failures on the ProxySQL 2 fleet. We took two different actions at this point - scaling up the ProxySQL 2 fleet, and scaling down the fleet serving the Intercom web application. The Intercom web application had aggressively scaled up in response to the elevated latencies, a tactic that has worked well in our environment in the past, however we had recently changed this scale-up to be very aggressive for the web serving fleet. By 15:45, additional ProxySQL 2 hosts were in service and latencies started to improve, however we were still uncomfortable with leaving ProxySQL 2 in service. We made an assessment of the capacity of the largest database cluster since it had scaled up, and again switched off use of ProxySQL 2 at 15:52, this time without any problems due to the additional read replicas being now in place.
After the previous outage, we followed our incident review process and completed the actions we had identified in this process, including rolling out robust DNS caching on each host, running a larger ProxySQL 2 fleet on smaller hosts and testing various traffic loads on the fleet. We again started using ProxySQL 2 in production on September 4th 2023 with no issues with all MySQL clusters using it.
On October 12th 2023 at 13:56:42 UTC, a deployment went out to the Ruby on Rails application containing an expensive MySQL query on the MainDB MySQL cluster. The overall load increase was small - around 6% or so, along with a similar increase in network throughput, as the query was sending a lot of data back to the Ruby on Rails application. We consider this to be a trigger for the outage due to the timing of this bad query rolling out. At 14:00 UTC latency started rising on the Ruby on Rails application. A number of throughput metrics on the MainDB cluster started slowly dropping including select rate and network bandwidth utilized, and the number of connections opened to each read-replica started to slowly increase, however overall database load and CPU metrics were stable. At 14:05 the first paging alarm fired for high frontend latency. The oncall engineer acknowledged the alarm within a minute, and by 14:07 UTC had established that the problem was related to queries to the reader nodes of the MainDB MySQL cluster, and initiated our incident command process. The first statuspage update was published at 14:10 UTC. At this time the Intercom application was mostly slow, with a small number of errors being reported.
This all changed at 14:11 UTC, when ProxySQL health checks for the read-replicas started failing and there were a large number of login failures. Each ProxySQL host monitors the health of each database it uses in order to maximize health and availability of the service. As a result of the health check failures, all but one of the read-replicas were now serving very little traffic, and due to high latencies and error rates in queries, the Intercom application was now fully down. We brought in subject matter experts in ProxySQL and MySQL and started to troubleshoot the problem.
We tried a number of things based on our understanding of the previous outage - we rolled back deployments, scaled up the ProxySQL fleet, shedded load by scaling down parts of the Intercom application, and scaled up the read-replicas. By 14:48 we had partial recovery with near-full throughput levels of throughput going through the read-replicas, but there was still a high latencies and errors on the application, along with ongoing login failures to the read-replicas and we weren't confident in the stability of the service. Connectivity to the cluster overall was inconsistent across different ProxySQL 2 nodes and we observed error messages indicating loss of Aurora cluster topology state. At 15:04, throughput through the read-replicas collapsed entirely after we applied a configuration intended to hardcode the cluster topology into the ProxySQL 2 configuration, and we had difficulty connecting directly to the databases. While we had stabilized the health-check related instability, the underlying issue around ProxySQL 2 not being able to maintain a healthy set of connections to the read-replicas was still present. We decided to go back to the architecture that was in place in early August, removing ProxySQL 2 entirely from our environment. The decision to roll back the architecture change was made at 15:10 UTC, but as the feature flag change had to be done directly on the database due to the outage, this was completed at 15:20 UTC, and immediate recovery followed.