Latency in the inbox for a subset of customers

Write-up

On January 27, 2026, between 17:03 and 22:09 UTC, a small subset of Intercom customers (< 1%) experienced degraded service including inconsistent latency and stale data when reading from the platform. The incident was caused by database replicas becoming overloaded, which prevented them from keeping up with changes to the primary database.

We have no evidence of customer data loss or corruption. Write operations continued to function normally throughout the incident; the impact was limited to read operations returning stale data.

We understand that Intercom is core to how you run your business, and that any downtime directly impacts your ability to serve your own customers. We sincerely apologize for the disruption this caused.

High-Level Summary

Intercom utilizes PlanetScale (built on Vitess) as a high-scale core database layer. The majority of our high-scale data is sharded across many database clusters to distribute load and ensure performance.

The incident was triggered when a specific database shard experienced excessive load from two concurrent workloads: an internal app deletion process and our ETL data pipeline. This combined workload caused the database replicas for that shard to reach 100% CPU utilization.

When the database replica reaches 100% CPU, it cannot process incoming changes from the primary database quickly enough, creating replication lag. During this incident, replication lag reached up to 2.2 hours, meaning customers on this shard were seeing data that was over two hours old.

When replica CPUs were saturated, the replica cannot process the stream of changes from the primary fast enough, where a backlog built up. How this manifested for customers included:

Conversations not appearing in their inbox immediately after creation
Messages seemingly not sending (though they were written successfully
Updates to settings or data not appearing to take effect

Technical deep dive

The trigger

Two resource-intensive workloads converged on the same database shard

Internal app deletion workers: Our internal process for removing customer workspaces was running a higher than usual concurrent deletion steps per app with no rate limiting. This process performs many database operations to cleanly remove all associated data.
ETL ingestion: Our data pipeline was simultaneously reading from the same shard to sync data for analytics and reporting purposes.

The overload

The combined load from these two processes exceeded the capacity of the replica databases. Both replicas serving the affected shard reached 100% CPU utilization. Unable to keep up with the incoming change stream from the primary, replication lag grew steadily - eventually reaching 2.2 hours.

Timeline in UTC

17:03 - Replica CPU on affected shard begins climbing; replication lag starts accumulating
17:25 - Alarm raised from the worker showing an excessive message backlog, automatically declaring an incident
18:25 - ETL processing disabled to reduce load on the affected shard
21:33 - App deletion workers scaled to zero; ETL cautiously re-enabled after monitoring showed stable metrics
22:09 - All replicas return to healthy CPU levels; replication lag cleared; full recovery confirmed

Improvements completed

Shard scaling

We have proactively increased capacity on shards to provide additional headroom against unexpected load spikes.

App deletion worker protection

We have implemented rate limiting and resource controls on the app deletion process to prevent it from overwhelming database infrastructure. The worker now respects capacity limits and backs off when database load is elevated.

Ongoing improvements

Workload isolation

We are reviewing how background processes like ETL and maintenance tasks interact with production database capacity. The goal is to ensure these workloads cannot compete with customer-facing operations for critical resources.

Improved monitoring

We are enhancing our alerting to detect replication lag earlier and correlate it with specific workloads, enabling faster identification and mitigation of similar issues.

Customer communication

We are improving our processes for proactively communicating with affected customers when incidents have a limited blast radius, ensuring impacted customers receive timely updates even when a broader status page update is not warranted.

Contextualizing this failure

This incident, while disruptive for affected customers, demonstrates the value of our sharded database architecture. By distributing data across many independent shards, we contained what could have been a platform-wide degradation to less than 1% of customers.

We remain committed to the high standard of reliability you expect from Intercom and are implementing the improvements outlined above to prevent similar incidents in the future.

Write-up

Latency in the inbox for a subset of customers

View the incident

We have no evidence of customer data loss or corruption. Write operations continued to function normally throughout the incident; the impact was limited to read operations returning stale data.

High-Level Summary

When replica CPUs were saturated, the replica cannot process the stream of changes from the primary fast enough, where a backlog built up. How this manifested for customers included:

Conversations not appearing in their inbox immediately after creation
Messages seemingly not sending (though they were written successfully
Updates to settings or data not appearing to take effect

Technical deep dive

The trigger

Two resource-intensive workloads converged on the same database shard

Internal app deletion workers: Our internal process for removing customer workspaces was running a higher than usual concurrent deletion steps per app with no rate limiting. This process performs many database operations to cleanly remove all associated data.
ETL ingestion: Our data pipeline was simultaneously reading from the same shard to sync data for analytics and reporting purposes.

The overload

Timeline in UTC

17:03 - Replica CPU on affected shard begins climbing; replication lag starts accumulating
17:25 - Alarm raised from the worker showing an excessive message backlog, automatically declaring an incident
18:25 - ETL processing disabled to reduce load on the affected shard
21:33 - App deletion workers scaled to zero; ETL cautiously re-enabled after monitoring showed stable metrics
22:09 - All replicas return to healthy CPU levels; replication lag cleared; full recovery confirmed

Improvements completed

Shard scaling

We have proactively increased capacity on shards to provide additional headroom against unexpected load spikes.

App deletion worker protection

Ongoing improvements

Workload isolation

Improved monitoring

We are enhancing our alerting to detect replication lag earlier and correlate it with specific workloads, enabling faster identification and mitigation of similar issues.

Customer communication

Contextualizing this failure

We remain committed to the high standard of reliability you expect from Intercom and are implementing the improvements outlined above to prevent similar incidents in the future.