Increased Error Rates and Latencies in Intercom Europe
Incident Report for Intercom
Postmortem

Root Cause Analysis: Europe down due to being unable to scale

Brian Scanlan, Senior Principal Systems Engineer

Summary

Intercom's web application, REST APIs and messenger hosted in the European region were down between 07:30 - 09:15 UTC on 22nd November 2023. Users experienced high latencies and error rates when using Intercom. Customers hosted in the USA and Australia regions were not impacted.

What happened?

On the afternoon of 21st November, we shipped a web server configuration change affecting how our web server (nginx) boots the Intercom Ruby on Rails application. The purpose of this change was to remove explicit configuration that sets region specific settings for the application, taking advantage of automatic configuration, further simplifying and standardizing our environment. This change had already been successfully rolled out for asynchronous workers.

We now know that the change did not work as expected in the Europe and Australia regions, and nginx was booting the Intercom Ruby on Rails application with the configuration for the USA region. Due to how we manage configuration changes made to nginx, this configuration change only affected new hosts being brought into service and running hosts were not affected. As a result automated tests on new builds of the Intercom application in the Europe and Australia regions did not exercise this configuration change, and the build went into production.

Intercom relies on AWS Autoscaling to bring capacity in and out of service in response to customer demand. The Europe region web fleet scaled down as normal as customer traffic dropped off after business hours. When customer traffic started rising at the start of business hours this morning, new hosts failed to be brought into service.

As traffic levels grew, the in-service web servers became overwhelmed. Latencies started to increase on the web fleet around 07:00. By 07:30 we were out of web capacity and some requests started to be queued. By 07:45 the web application was practically unusable due to slow responses from the web servers.Shortly after 08:00 the queues were full and errors started to be returned.

The first paging alarms fired at 08:02, and our oncall engineer triaged the problem and followed our standard incident management process, engaging an incident commander at 08:15. Our Status Page was updated at 08:24. Standard mitigations such as scaling up, rolling back and redeploying the application were tried with no success, and more engineers were brought in to investigate the problem. At 08:26 we identified that most web servers were not passing health checks, and by 08:36 had identified that the Ruby on Rails application was not successfully booting. At 08:52 the configuration change was rolled back. The deployment started going out to the web fleet at 09:08, after this time new hosts brought into service would be able to boot. At 09:13 we took action to quickly remove failing servers and they were replaced with fresh hosts. Intercom fully recovered at 09:15.

What went right

  • Despite being out of business hours, we had many engineers on-hand to help troubleshoot the problem.
  • Once the need for a rollback was identified, the rollback was quick.

What could be better

  • The system was degrading for an hour as traffic ramped up, but we were not alerted until errors were being returned by the application.
  • Our deployment systems are typically robust, and can catch problems with many types of infrastructure and configuration changes, but failed to catch the bad configuration change here.
  • The application boot failures were silent. No exceptions were thrown and the problem was not visible in our observability tooling. This slowed down our response and speed of recovery.

Actions

  1. Tune timeouts and thresholds for synthetic monitoring to fire sooner when latency degrades. This will alert us sooner to capacity or performance problems.
  2. Separate the web application from high volume messenger and REST APIs in Europe and Australia. We should not need to rely on autoscaling for the web application to remain online. This will improve our availability by reducing the potential blast radius of the web serving fleet.
  3. Alarm on hosts not successfully being brought into service at the infrastructure level.
  4. Ensure that nginx configuration changes are exercised fully in the test pipeline, and review any other similar long-running processes that are configured during instance boot.
  5. Ensure that application boot problems are very visible in our observability tooling.
Posted Nov 22, 2023 - 20:31 UTC

Resolved
Between 2023-11-22 0733 UTC and 2023-11-22 0915 UTC, customers hosted in our EU Data Center (https://app.eu.intercom.com) would have experienced high error rates and latencies across all Intercom services due to a configuration error. This issue has been fixed, and all services are working as expected.

Customers hosted in our US (app.intercom.com) and Australia (app.au.intercom.com) regions were unaffected during this time.
Posted Nov 22, 2023 - 09:25 UTC
Monitoring
A fix has been implemented and we are monitoring the results.
Posted Nov 22, 2023 - 09:17 UTC
Identified
The issue has been identified and a fix is being implemented.
Posted Nov 22, 2023 - 09:11 UTC
Update
We are continuing to investigate this issue.
Posted Nov 22, 2023 - 08:39 UTC
Investigating
We are currently investigating an increase in error rates and latencies for customers hosted in our Intercom Europe Data Center - hosted on https://app.eu.intercom.com.

Customers hosted in our US (app.intercom.com) and Australia (app.au.intercom.com) regions are unaffected at this time.
Posted Nov 22, 2023 - 08:24 UTC
This incident affected: Intercom Europe (Intercom Web Application, Intercom Web Messenger, Intercom Mobile Messenger, Intercom APIs, Intercom Webhooks).