Root Cause Analysis: Europe down due to being unable to scale
Brian Scanlan, Senior Principal Systems Engineer
Intercom's web application, REST APIs and messenger hosted in the European region were down between 07:30 - 09:15 UTC on 22nd November 2023. Users experienced high latencies and error rates when using Intercom. Customers hosted in the USA and Australia regions were not impacted.
On the afternoon of 21st November, we shipped a web server configuration change affecting how our web server (nginx) boots the Intercom Ruby on Rails application. The purpose of this change was to remove explicit configuration that sets region specific settings for the application, taking advantage of automatic configuration, further simplifying and standardizing our environment. This change had already been successfully rolled out for asynchronous workers.
We now know that the change did not work as expected in the Europe and Australia regions, and nginx was booting the Intercom Ruby on Rails application with the configuration for the USA region. Due to how we manage configuration changes made to nginx, this configuration change only affected new hosts being brought into service and running hosts were not affected. As a result automated tests on new builds of the Intercom application in the Europe and Australia regions did not exercise this configuration change, and the build went into production.
Intercom relies on AWS Autoscaling to bring capacity in and out of service in response to customer demand. The Europe region web fleet scaled down as normal as customer traffic dropped off after business hours. When customer traffic started rising at the start of business hours this morning, new hosts failed to be brought into service.
As traffic levels grew, the in-service web servers became overwhelmed. Latencies started to increase on the web fleet around 07:00. By 07:30 we were out of web capacity and some requests started to be queued. By 07:45 the web application was practically unusable due to slow responses from the web servers.Shortly after 08:00 the queues were full and errors started to be returned.
The first paging alarms fired at 08:02, and our oncall engineer triaged the problem and followed our standard incident management process, engaging an incident commander at 08:15. Our Status Page was updated at 08:24. Standard mitigations such as scaling up, rolling back and redeploying the application were tried with no success, and more engineers were brought in to investigate the problem. At 08:26 we identified that most web servers were not passing health checks, and by 08:36 had identified that the Ruby on Rails application was not successfully booting. At 08:52 the configuration change was rolled back. The deployment started going out to the web fleet at 09:08, after this time new hosts brought into service would be able to boot. At 09:13 we took action to quickly remove failing servers and they were replaced with fresh hosts. Intercom fully recovered at 09:15.