Root Cause Analysis: Problems with conversations
Brian Scanlan, Senior Principal Engineer - February 22nd 2024
Earlier today, Intercom suffered a partial outage that took a number of hours to fully recover from. We're very sorry for the problem and the impact on our customers. We take these outages very seriously, and will be doing a lot more work to understand what happened here, and how to ensure similar outages can not happen again. We are publishing this Root Cause Analysis on the day of the outage to help our customers understand what happened and how we responded.
Intercom's primary datastore for conversation data is MySQL. One of the larger tables used to store conversation metadata is the "message_threads" table. This table is used to quickly lookup messages in different states, such as priority and snooze status, and is one of the fundamental building blocks of the Intercom Inbox. This is a single table and is not sharded, meaning that all of our customers' message metadata is stored in a single table.
The primary key ID of this table was originally an signed integer data type ("int"). We have alarming in place that alerted us to upcoming key space exhaustion in the USA region (where the vast majority of Intercom workspaces are hosted), and we started migrating the database table to use a "bigint" ID in July 2023. The migration took a few months to complete due to the size of the table and its indexes, and was completed in September 2023.
On 22nd February 2024 at 08:13:57 UTC the ID of the message threads table grew larger than the maximum size of the "int" data type. Writes to this table continued to work fine however there were four tables that had columns with references to the message_thread ID that were still using the int data type. Writes to these tables started failing immediately as the message_thread ID referenced was too large for that data type.
The failed writes started causing widespread errors across Intercom. We were immediately alerted and started investigating. We prepared database migrations to update the column data types on the affected tables, however we knew that the size of the affected tables were large and would take a significant amount of time to complete, likely longer than a day. We assembled a large team of engineers to investigate other changes to work around the problem. The impact on Intercom functionality was widespread and varied depending on the features in use. Most of the time, new conversations could be started by users. However in the Inbox, functionality like assignment to teammates, closing conversations, workflows and SLAs were not working.
At 10:48 UTC we rolled out a change that bypassed writing a message_thread ID into the "conversation_parts" table, after verifying that it was a superfluous write and that it had no side effects. This substantially restored access back to Inbox features, though there were delays in recovery as a large backlog of conversations were processed.
By 11:21 UTC the Inbox mentions feature was fixed. The last remaining set of features with problems were Conversation SLAs and related functionality. By 12:40 UTC we disabled SLA functionality in the application to allow adjacent features such as Inbox Rules to work normally. By 13:49 UTC a database migration relating to SLAs completed and we re-enabled SLA features. There were residual issues with reporting data on SLAs and conversational insights, and we will continue to investigate these problems.
We have not completed a full review of this incident yet and this list of actions is incomplete: