Intercom

Write-up
Intercom teammate web app is down

On 24th April between 20:09 UTC and 20:17 UTC, the Intercom teammate web application, including the Inbox, was down in the USA region. A repeat of the same problem happened on 25th April 2025 between 11:57 UTC and 12:07 UTC. During this time customers would have experienced general slowness and seen error pages.

Starting 22nd April, we were rolling out a new mechanism to update counters in the Intercom Inbox. Counters are used to see in real-time how many conversations are in a specific custom view or folder in the Inbox. Synchronising these counts in real-time across all clients can be a complex task, and in order to make them more reliable, accurate and efficient, we recently started rewriting parts of this system. This week the rewrite went fully live. The rollout was incremental and actively monitored, and on Thursday was enabled for all customers. On the evening of the 24th, memory usage of the Rails processes on the web serving fleet started to increase due to a thread leak bug, ultimately exhausting all memory on the fleet. Earlier in the day, regular deployments meant that the thread-leaks and memory increases were reset, and it was now over 2 hours since the last deployment. Most web serving hosts ran out of memory in a very short period of time. A latency based scale-up of the web fleet and host replacements due to health check failures automatically recovered the situation. The engineering team put in place 30 minute automatic deployments, as well as fixing a suspected cause of the increased memory usage.

On 25th April, the outage was being investigated by multiple engineers. The 30 minute automatic deployments were disabled, as we had a probable cause and the redeployments can slow down regular deployments.  An unrelated issue with the deployment system prevented deployments going out for over an hour. In that time the memory usage of the web fleet grew to 100% utilisation and caused a repeat of the same outage. We doubled the available memory on the web serving fleet and re-enabled 30 minute automatic deployments. We expanded the scope of our investigations and identified the new counters implementation as a possible cause. The new counters implementation was unintentionally creating a new set of threads on each request without properly cleaning them up. Over time, this caused the memory usage of our web fleet to increase steadily, eventually exhausting available memory and preventing the servers from handling traffic. We have since shipped a fix for the thread-leak and have verified that the problem no longer exists. We will be following-up with additional alarming/scaling on memory utilization.