Problems with our APIs in all regions.

Write-up

Root Cause Analysis: API fleets broken by configuration change 31/03/2025

Issue Summary

On 31st March 2025, an engineer was adding a new shared cache between Intercom's Ruby on Rails application and a separate Python application that serves AI related functionality, such as calls to third-party LLM APIs. The Ruby on Rails application serves Intercom customers directly, and the Python application is only accessible internally. The purpose of this shared cache is to enable enhanced telemetry of the customer experience of Intercom's AI features. The cache had been rolled out in the Python application previously, and now was being added to the Ruby on Rails application.

Intercom uses EC2 Security Groups to segment access to internal services and infrastructure inside the production AWS accounts. The rollout of the new cache involved permitting access to the cache by updating the relevant security groups. This change was made using our Infrastructure As Code system by an experienced infrastructure engineer, reviewed by another experienced infrastructure engineer, passed tests and was automatically deployed to production without issue. Connectivity between the Intercom Ruby on Rails application and the new cache was manually verified in production.

A code change was made to start using the new cache in the telemetry API endpoint in the Ruby on Rails application, and was merged into the production code-base at 10:29 UTC. After passing tests the change started being deployed to production at 10:43 UTC. At 10:46 UTC latencies started to increase on the USA REST API fleet, and the Europe web fleet. In the USA region, Intercom has separate fleets for the REST API, mobile messenger, web messenger and web application. The latency increase was caused by the telemetry API going from approx 60ms per request to over 30 seconds per request. The telemetry APIs are relatively high volume, and serving these requests quickly took up all serving capacity on these fleets, causing a full outage of the REST API in the USA and the REST API, mobile messenger, web messenger and web application in Europe. The Australia region was also affected, but did not go fully down.

At 10:52 UTC alarms fired for disruption in the Europe region, and the deployment was automatically rolled back, and the Europe region endpoints were back at 10:57 UTC. The USA REST API took until 11:05 UTC to fully recover due to retries by clients.

The cause of the elevated latencies was that the web fleets had no access to the new cache. As a result the calls to the web fleet took 2 seconds each to timeout, and there were numerous calls per typical API request. Intercom uses a separate set of EC2 Security Groups for web facing fleets, and the engineers were unaware of this at the time. The tests carried out had only tested connectivity from non-web facing fleets.

A second outage occurred in the Europe region an hour later, when the same breaking code was inadvertently deployed. An unrelated disruption to Intercom's deployment process had caused a backlog of changes that were ready to be deployed, and the automatic rollback had also globally locked all deployments. After the problem with the deployment tool was fixed, the revert of the code change utilizing the new cache was merged, manually verified to be ready to be deployed in the USA region, and deployments were unlocked. However the revert was not ready to deploy in the Europe region due to a delay in automated tests running in the region - they took 2 minutes longer to run. This deployment then caused the code containing the calls to the new cache to go out again to Europe. Once again alarms fired and the deployment was automatically rolled back. This caused downtime on the same endpoints in the Europe hosting region between 12:30 - 12:40 UTC.

Evaluation

What went well:

The automated rollback mechanism reacted quickly and did the right thing in both outages.
The cause(s) of the outages were quickly identified.

What could be better:

Complicated EC2 Security Groups makes it hard for the engineers to reason about infrastructure changes and make simple changes safely. No tests existed to catch mis-matched security groups used in the same application.
Unlocking the deployment pipelines across the regions is a global action, but each pipeline works independently, making it possible to inadvertently deploy undesirable versions of code.
Adding a new dependency globally was unnecessary. A staggered roll-out per-region and/or using a feature flag to opt-in to the new functionality would have reduced the possibility of multiple outages caused by the same cause.
The impact was smaller in the USA region due to the use of multiple web serving fleets. These were put in place due to the increased scale in the USA, however also reduce the blast radius of outages caused by problems with individual APIs.
Alarms took 4 minutes to fire, and rollbacks took 5 minutes.

Immediate Actions

Create dedicated fleets in the Europe and Australia regions (already planned, but will be done this week).
Investigate consolidating all production applications' EC2 Security Groups to a single Security Group per production environment (no web/worker split) and/or prevent drift between the Security Groups.
Tune inbox activity alarms to fire faster.
Speed up rollbacks - prioritising web fleets and speeding up on-host deployments.
Investigate safer use of global deployment unlocks - requiring explicit override when the release candidate differs between regions.

As always, we apologize for the impact of these outages for our customers. Improving the availability of our service is our number one engineering priority. Please get in contact directly with me if there's anything I can help with: brian.scanlan@intercom.io.