Intercom in USA region down/slow

Write-up

Inbox outage in USA region - October 17th 2025

Brian Scanlan Oct 17, 2025

Between 11:46 - 11:57 UTC the Inbox was down in the USA region. The cause of the outage was a change made to how unsafe URLs were being processed in Intercom's backend. All URLs in conversations are examined for their reputation, and unsafe URLs are clearly identified in the Intercom UI. The tracking of unsafe URLs was moved to being processed at the conversation level for performance reasons. However, the change added a database query to very busy paths, and the query did not use an index, as the relevant tables in the query were in different keyspaces in our sharded MySQL database. This forced full table scans, causing slow queries in hot code-paths, which resulted in the Inbox being unusable.

Our heartbeat monitoring system detected a drop in the rate of Inbox activity, alarms triggered and our deployment tool automatically rolled back the change. While the automatic rollback did eventually restore Inbox functionality, impact was partially extended by our deployment mechanism, which replaces running processes one at a time. This is effective at replacing processes under normal circumstances or when those processes are experiencing errors, but allows each process a “grace period” to finish the work it is doing before proceeding. When an issue results in processes being blocked, for example in this instance on very slow database queries, this grace period waits the maximum amount of time between each replacement and significantly slows recovery.

Next steps:

[Done] Co-locate the tables in the relevant query to ensure consistent performance and prevent recurrence.
[Done] Speed-up detection of impact by sharpening alarm thresholds.
[In-Progress] Block high-volume joins on tables spread across keyspaces.

[Pending] Investigate speeding up the deployment of new code into production to reduce impact windows.

Timeline (UTC):

11:41:53 - Deployment starts

11:45:15 - Deployment completes

11:46:00 - Application latency starts rising

11:48:59 - Inbox activity alarm fires

11:50:01 - Automatic rollback starts

11:52:02 - Status Page updated

11:52:17 - Automatic rollback completes

11:54:00 - Inbox activity starts recovering

11:57:00 - Inbox activity at normal levels

11:59:59 - Inbox activity alarm goes ok

12:01:30 - All metrics nominal