On March 16, 2026, between 15:44 and 16:08 UTC, EU customers experienced degraded availability of Intercom's Inbox product. Customers using the Mobile Messenger experienced errors and latency between 16:06 and 16:12 UTC. Fin was not impacted.
The incident was triggered by a cascading failure in our internal telemetry and tracing pipeline, which inadvertently consumed the shared compute capacity of our primary web servers. This prevented normal traffic from being served.
We understand that Intercom is core to how you run your business, and that any downtime directly impacts your ability to serve your own customers. We sincerely apologize for the disruption this caused.
Our application sends all tracing telemetry data to a sampling proxy service (Refinery), which then forwards the sampled data to our tracing provider. This service became unavailable after an out-of-memory (OOM) event terminated running processes. The remaining processes could not absorb the additional load and also crashed. Critically, an undetected configuration error deployed four days earlier prevented any existing or new processes from restarting.
Our web application fleet proxies browser telemetry data, such as page loading times, to the Refinery on behalf of frontend clients. This forwarding happens as part of the same pool of web server processes that handle all other Inbox API requests, for example, loading conversations, sending replies, and updating assignments.
With Refinery unavailable, each telemetry forwarding request occupied a request serving process for up to 40 seconds due to our timeout settings and exponential backoff on retries (median was ~15s). The incoming rate of requests for frontend telemetry, combined with the slowness in responding, was enough to overwhelm the available server processes and exhaust our capacity. As a result, there were no server processes available to handle normal Inbox traffic. Consequently, EU Inbox requests timed out, returned errors, or had high degrees of latency.
We continue to investigate the root cause of the increased error rate and latency of our Mobile Messenger fleet.
March 12, 16:16: Telemetry change containing a syntax error is merged.
March 16, 15:40:36: Refinery node stops reporting incoming spans; initial failure begins.
15:42:29: Auto-scaling launches a new Refinery node, which immediately fails to start due to the dormant syntax error.
15:43:30: Remaining Refinery nodes take on 200% of normal span volume.
15:43:45: Inbox degradation begins in the EU region.
15:44:34: Emergency auto-scaling for the web fleet begins adding instances.
15:45:00: Refinery load balancer reports 0 healthy targets; complete downstream tracing outage.
15:46:00: Incident automatically declared; engineering and an incident commander are paged and begin investigating.
16:06:00: Latency and errors begin in the Mobile Messenger.
16:06:04: Engineers initiate a manual rollback of the Refinery configuration pipeline. Recovery begins.
16:08:00: Inbox fully recovered.
16:12:00: Mobile Messenger fully recovered.
We have fixed the invalid configuration for the tracing proxy service.
We have added more processes to the tracing proxy service. A loss of a small set of processes should not lead to cascading effects on other processes.
We have moved the browser telemetry trace export off the critical path. Slowdown or errors in the pipeline will no longer consume resources for production traffic.
We revised client settings for the frontend trace export, including reducing the export timeout from 10 seconds to 2 seconds, to limit the impact of telemetry failures on production resources.
We have hardened the validation of configuration changes of the proxy service so that invalid configurations cannot pass our pre-production stages.
We have added additional monitoring and alarming for the detecting the overall health of the proxy service.
We take the reliability of Intercom seriously. This incident exposed gaps in how we deploy and monitor our tracing infrastructure, and in how our telemetry pipeline interacts with our core application. The improvements above address both the immediate trigger and the underlying architectural issues that allowed a telemetry failure to affect customer-facing availability.