Intercom Inbox and Messenger Impacted

Write-up

October 20, 2025 – AWS us-east-1 Region Outage

Between 06:48 UTC and 21:14 UTC on October 20 2025, Intercom experienced a major outage in our US hosting region due to widespread availability issues with AWS (Amazon Web Services) in the us-east-1 region.

As a result, customers in the US data hosting region were unable to access the Intercom app from 06:48 UTC to 09:24 UTC. As AWS systems began to recover, Intercom became available again in a degraded state with customers experiencing increased errors and latency. To aid recovery, we blocked new deployments to our app to reduce changes happening in production. During this phase of the incident we were unable to acquire new capacity due to throttling from AWS. Our engineering team implemented load shedding on available hosts to prioritize and maintain core functionality across Inbox, Messenger, and Fin. Our engineering team also carried out optimisation activities to ensure we were using our existing capacity effectively.

Upon seeing that we were once again able to get new capacity from AWS, our team started the process of undoing our emergency changes; scaling back up all work that had been scaled down and ensuring that we had eliminated residual latency, and restored complete reliability across all services.

In the EU and AU regions, Intercom Phone and Fin Voice were affected due to Twilio’s services being impacted by the same AWS event. Our Slack integration also experienced intermittent connectivity issues because Slack was affected. These dependencies recovered as AWS services stabilized.

Root Cause

The outage was triggered by a race condition in AWS’s internal DNS management system for DynamoDB in the us-east-1 region, which resulted in deletion of a critical DNS record and disrupted communication with DynamoDB. This failure cascaded across AWS control plane services, including those responsible for EC2 instance management and network load balancer health monitoring, preventing new instances from being launched and causing intermittent connectivity between existing hosts and dependent services such as SQS, Lambda, and DynamoDB.

As a result, Intercom’s infrastructure was unable to provision new EC2 capacity or replace unhealthy hosts during the incident, and when global traffic increased, autoscaling attempts were throttled by AWS, preventing recovery and amplifying the customer impact.

AWS have published a deep dive on the root cause of the incident which is available here: https://aws.amazon.com/message/101925/

Incident Timeline - UTC:

06:48 - Heartbeat metrics start to degrade, indicating the Intercom app is not in a healthy state

06:49 - Out-of-hours on-call paged

06:53 - Issue confirmed via Heartbeat Metrics

06:54 - Incident Commander + Comms Lead engaged

07:00 - Initial Status Page posted

07:02 - Subject matter expert engineers engaged

07:06 - Confirmed impact across complete region beyond Intercom’s infrastructure

07:11 - AWS provides their first update to the service health dashboard

08:52 - Intercom internally blocks shipping to limit the number of changes actively taking place in production

09:24 - Inbox access is largely restored for US data region customers as AWS shows early signs of recovery

09:30 - Fin starts processing messages in a degraded state

09:40 - Engineering scales down non-critical services to free capacity

11:15 - Fin starts serving requests as expected as the backlog has cleared

14:15 - All services start to degrade again as new instances are still unavailable from AWS and we reach peak traffic

16:17 - Engineering team moves some of our available hosts currently serving messenger requests to instead serve inbox in an effort to optimise our available capacity

20:37 - Full recovery begins as AWS address the root cause which allows the acquisition of new EC2 instances

20:52 - Inbox goes down briefly as we scale up and encounter rate limiting

21:14 - We achieve confidence in declaring full recovery and enter a monitoring state

21:53 - Intercom closes off the status page posting

22:06 - Intercom engineering team stands down

Write-up

Intercom Inbox and Messenger Impacted

Full outage

View the incident

October 20, 2025 – AWS us-east-1 Region Outage

Root Cause

AWS have published a deep dive on the root cause of the incident which is available here: https://aws.amazon.com/message/101925/

Incident Timeline - UTC:

06:48 - Heartbeat metrics start to degrade, indicating the Intercom app is not in a healthy state

06:49 - Out-of-hours on-call paged

06:53 - Issue confirmed via Heartbeat Metrics

06:54 - Incident Commander + Comms Lead engaged

07:00 - Initial Status Page posted

07:02 - Subject matter expert engineers engaged

07:06 - Confirmed impact across complete region beyond Intercom’s infrastructure

07:11 - AWS provides their first update to the service health dashboard

08:52 - Intercom internally blocks shipping to limit the number of changes actively taking place in production

09:24 - Inbox access is largely restored for US data region customers as AWS shows early signs of recovery

09:30 - Fin starts processing messages in a degraded state

09:40 - Engineering scales down non-critical services to free capacity

11:15 - Fin starts serving requests as expected as the backlog has cleared

14:15 - All services start to degrade again as new instances are still unavailable from AWS and we reach peak traffic

16:17 - Engineering team moves some of our available hosts currently serving messenger requests to instead serve inbox in an effort to optimise our available capacity

20:37 - Full recovery begins as AWS address the root cause which allows the acquisition of new EC2 instances

20:52 - Inbox goes down briefly as we scale up and encounter rate limiting

21:14 - We achieve confidence in declaring full recovery and enter a monitoring state

21:53 - Intercom closes off the status page posting

22:06 - Intercom engineering team stands down