Inbox experiencing degraded performance

Write-up

Between 03:30 and 06:00 UTC on 5th November 2024, many features of Intercom in the USA hosting region were slow or not working. The causes of the problem was the failure of the background job processing system due to a DNS related configuration change, the failure to detect the failure before it was rolled out to production, delays in diagnosing the problem and our inability to quickly roll the problem back. While no data was lost, there were significant delays in large parts of Intercom including conversations being updated, Fin AI Agent answering questions, webhooks being delivered, and freshness of data in the Intercom platform.

The cause of the problem was a change to DNS resolution on the Intercom server fleet. A caching resolver was added on the Linux server image, which was a follow-up from a previous outage, and a fix for occasional delays with DNS resolution on the server fleet. This Linux server image had been successfully rolled out to numerous applications without issue prior to this outage.

The Intercom application in the USA uses a private DNS zone to lookup which ProxySQL cluster to use for the MySQL databases it uses. No other production service in Intercom uses a private DNS zone. Unknown to us before the incident, the caching DNS resolver we used, unbound, behaves differently in our environment depending on whether the hosts had outbound access to make DNS requests to the Internet, and when the DNS resolver could make queries to the Internet, the private DNS records were not resolvable. This was not surfaced in testing or through the rollouts to other applications.

Linux server images are automatically built and we replace every application server at Intercom with the new image on a weekly basis to ensure the security and consistency of our platform. At 03:00 UTC a deployment with the server image with the DNS caching resolver began. It successfully passed through our staging and pre-prod testing environments without issue. These environments could not make DNS queries to the Internet, and so the failure to resolve the internal records was not present in those environments. Once all of the approval steps were completed, the server image started rolling out to our production environment at 03:27 UTC. By 03:44 UTC numerous alerts started firing as background worker servers started being replaced by servers that could not boot their Ruby on Rails processes due to not being able to resolve the ProxySQL endpoints, and backlogs of work to process started building. Web servers were not impacted by the rollout as the replacement hosts were healthchecked before they replaced the existing servers.

Our standard out-of-hours oncall responded to the alarms, and soon after our incident management process was followed, and further engineering assistance was engaged. By 04:48 UTC the Linux server image rollout had been identified as the cause and a rollback was initiated. This rollback needed to pass through all of the pre-production safety checks, and then failed upon reaching the production environment, due to deployments being locked because of a speculative rollback of the Intercom application due to the outage. The server image rollout was retried, and at 05:46 started rolling out to the production environment. Prior to this rollout we had manually repaired a small number of critical workers that allowed functionality such as conversation updates and webhook delivery to recover sooner, and had prepared a script to rapidly roll out the new server image. By 06:00 UTC the vast majority of the background work had been processed and all features were fully functional.

We will follow-up to this outage by making server image rollouts more robust and further simplifying the environment to remove the use of private DNS records. As always we apologize for the outage and any disruption to your business.