We want to provide some detail about the problem we experienced yesterday. First we'd like to apologize to all our customers for the disruption to our services. We make significant efforts to keep Intercom working well at all times, and yesterday’s events were far from the level of service that we are committed to providing.
Intercom was mostly down or impaired for over 2 hours. Starting just after 14:10 UTC, our MySQL (RDS Aurora) read-replica hosts for our main application database went to near 100% CPU utilisation. We tried a number of things to get to a stable state, including restarting the RDS Aurora cluster and rolling many back code deploys and feature flag changes. Ultimately we got control of the situation by identifying the problem queries, killing them automatically and disabling the parts of the Intercom application that was creating them. By 17:00 UTC, the error rates across all of Intercom’s services were back to normal. Around that time we identified an unnecessary association in our ORM that was causing the queries in question and removed it.
Subsequent investigations have identified the root cause as being a change in the execution plan being used by MySQL for the query. The query itself was quite frequent, but the execution plan change meant that it was no longer using an index, instead scanning every row in a large table. This caused a massive increase in IO and CPU utilization on the database hosts, increasing latencies for almost all our services, causing them to be effectively down. The trigger for the query execution plan change is still being established, however it was not caused by any direct change to Intercom such as a deployment or configuration change, and this contributed to the time to resolve this issue.
We're continuing to dig into yesterday's outage, so we can learn how we to improve our operational response and avoid this type of problem entirely in future. Once again we'd like to apologize to our customers for the outage and any disruption it caused, and please do get in touch with us if you have any concerns or questions about the outage.
Brian Scanlan, Engineering Manager