Intercom uses MySQL as its default datastore, running on AWS’s RDS Aurora service. We are running a mix of multi-tenant database clusters, as well as sharded database clusters for high scale. The sharded database clusters contain a subset of our customers data in dedicated databases. For high availability, all the database clusters are deployed in multiple availability zones in AWS.
At approximately 11:49 UTC on the 7th of November, we initiated a manual query against one of the sharded databases. Our aim was to determine the size of all tables within that database as part of a capacity planning exercise. We needed to establish the amount of data stored in the cluster from temporary tables. These temporary tables were created by our database schema migration process. This query was run not directly on the database itself, but by connecting to the database from our production console.
This query caused the free memory of the cluster to drop significantly. At 11:50 UTC, AWS, our cloud provider, initiated an automated failover to an instance of the cluster in a different availability zone. The failover terminated the query. At the same time, latency started rising across our Ruby on Rails application.
At 11:54 UTC the first paging alarm fired, denoting a degradation in the availability. Our on-call engineers acknowledged the alarm, and by 11:57 UTC had established that the problem was related to the failover of the particular cluster. We updated the status page at 11:59 UTC.
The fail-over itself was completed at 11:54 UTC, and all database instances restarted at 11:55 UTC. By that time, the Intercom app was working normally again for the majority of Intercom customers.
However, the Intercom app was still serving errors to customers with data on the cluster that had just rebooted. A deployment of our Rails application completed at 11:56 UTC, which re-established the connections evenly to the instances in the affected cluster. All customer impact was resolved.