Web App and API Degraded Performance
Incident Report for Intercom
Postmortem

Serena Fritsch, Senior Product Engineer

Issue Summary

Intercom uses MySQL as its default datastore, running on AWS’s RDS Aurora service. We are running a mix of multi-tenant database clusters, as well as sharded database clusters for high scale. The sharded database clusters contain a subset of our customers data in dedicated databases. For high availability, all the database clusters are deployed in multiple availability zones in AWS.

At approximately 11:49 UTC on the 7th of November, we initiated a manual query against one of the sharded databases. Our aim was to determine the size of all tables within that database as part of a capacity planning exercise. We needed to establish the amount of data stored in the cluster from temporary tables. These temporary tables were created by our database schema migration process. This query was run not directly on the database itself, but by connecting to the database from our production console.

This query caused the free memory of the cluster to drop significantly.  At 11:50 UTC, AWS, our cloud provider, initiated  an automated failover to an instance of the cluster in a different availability zone. The failover terminated the query. At the same time, latency started rising across our Ruby on Rails application.

At 11:54 UTC the first paging alarm fired, denoting a degradation in the availability. Our on-call engineers acknowledged the alarm, and by 11:57 UTC had established that the problem was related to the failover of the particular cluster. We updated the status page at 11:59 UTC. 

The fail-over itself was completed at 11:54 UTC, and all database instances restarted at 11:55 UTC. By that time, the Intercom app was working normally again for the majority of Intercom customers. 

However, the Intercom app was still serving errors to customers with data on the cluster that had just rebooted. A deployment of our Rails application completed at 11:56 UTC, which re-established the connections evenly to the instances in the affected cluster. All customer impact was resolved.

Evaluation

What went right

  • We updated the status page quickly.
  • Our detection and general operational response was fast. We had the right people involved within minutes, and narrowed down the problem quickly.
  • Our systems recovered from the database failover without manual intervention.

What could be better

  • We did not realize that the query was dangerous and could result in query performance degradation and ultimately a failover.
  • Our long tail recovery process is dependent on redeploying the Ruby on Rails application to reestablish connections.
  • Our schema migration process is complex and resource heavy. The reason we performed the query in question was to determine how much additional disk space was used by unused temporary migration tables.

Actions

  1. We are no longer running queries like that and using other methods for the capacity planning exercise.
  2. We are looking into mechanisms to remove the need for a manual redeployment of the application after this type of event.
Posted Nov 08, 2023 - 17:05 UTC

Resolved
Between 11:50 UTC and 11:56 UTC the Web App, Messenger, and API saw Degraded Performance. The incident has been fully resolved.
Posted Nov 07, 2023 - 12:13 UTC
Investigating
We are investigating reports of degraded performance and high latency across our web app and APIs.
Posted Nov 07, 2023 - 11:59 UTC
This incident affected: Intercom message delivery (Email, Chats and Posts, Mobile Push, Admin notifications), Intercom Messenger (Web Messenger, Mobile Messenger), Intercom Europe (Intercom Web Application, Intercom Web Messenger, Intercom Mobile Messenger, Intercom APIs, Intercom Webhooks), Intercom Australia (Intercom Web Application, Intercom Web Messenger, Intercom Mobile Messenger, Intercom APIs, Intercom Webhooks), and Intercom Web Application, Intercom APIs, Intercom Mobile APIs, Intercom Webhooks.