Elevated level of 5xx errors for Help Center

Write-up

Summary

On August 14, 2025 between approximately 16:55 and 17:20 UTC, some Intercom customers experienced issues with their Help Centers where pages would fail to load. The root cause was a sudden and abnormal surge in traffic that overwhelmed backend systems. Immediate mitigation steps were taken, and longer-term improvements have been implemented to prevent recurrence.

Customer Impact
The incident began at approximately 16:55 UTC and full recovery was achieved around 17:20 UTC. The impact was limited to customers hosted in the US region. Affected users experienced errors and were unable to access Help Center content and in some cases, content loaded slowly.

Root Cause & Resolution
The incident was caused by an unexpected surge in traffic, primarily from a single customer, that overwhelmed the load balancer and downstream Node services powering the Help Center. The spike exceeded automated scaling and throttling limits. Most requests were blocked or dropped in the Node layer before reaching the Ruby application layer, meaning our application-level request throttling logic did not engage. This created a bottleneck where queued requests timed out, leading to widespread HTTP 5xx errors.

Engineering mitigated the issue by manually scaling backend infrastructure, rerouting the problematic traffic to an overflow fleet, updating load balancer rules, and adjusting rate limiting settings at the edge. These actions successfully restored normal service.

Next Steps & Prevention

We have updated rate limiting and DDoS protection rules to better detect and block abnormal traffic patterns. Work is underway to improve autoscaling logic, enabling faster infrastructure responses to sudden traffic spikes. Additionally, monitoring and alerting have been strengthened to ensure earlier detection of similar issues in the future.