Problems with conversations
Incident Report for Intercom
Postmortem

Root Cause Analysis: Problems with conversations

Brian Scanlan, Senior Principal Engineer - February 22nd 2024

Earlier today, Intercom suffered a partial outage that took a number of hours to fully recover from. We're very sorry for the problem and the impact on our customers. We take these outages very seriously, and will be doing a lot more work to understand what happened here, and how to ensure similar outages can not happen again. We are publishing this Root Cause Analysis on the day of the outage to help our customers understand what happened and how we responded.

Summary

Intercom's primary datastore for conversation data is MySQL. One of the larger tables used to store conversation metadata is the "message_threads" table. This table is used to quickly lookup messages in different states, such as priority and snooze status, and is one of the fundamental building blocks of the Intercom Inbox. This is a single table and is not sharded, meaning that all of our customers' message metadata is stored in a single table.

The primary key ID of this table was originally an signed integer data type ("int"). We have alarming in place that alerted us to upcoming key space exhaustion in the USA region (where the vast majority of Intercom workspaces are hosted), and we started migrating the database table to use a "bigint" ID in July 2023. The migration took a few months to complete due to the size of the table and its indexes, and was completed in September 2023.

On 22nd February 2024 at 08:13:57 UTC the ID of the message threads table grew larger than the maximum size of the "int" data type. Writes to this table continued to work fine however there were four tables that had columns with references to the message_thread ID that were still using the int data type. Writes to these tables started failing immediately as the message_thread ID referenced was too large for that data type.

The failed writes started causing widespread errors across Intercom. We were immediately alerted and started investigating. We prepared database migrations to update the column data types on the affected tables, however we knew that the size of the affected tables were large and would take a significant amount of time to complete, likely longer than a day. We assembled a large team of engineers to investigate other changes to work around the problem. The impact on Intercom functionality was widespread and varied depending on the features in use. Most of the time, new conversations could be started by users. However in the Inbox, functionality like assignment to teammates, closing conversations, workflows and SLAs were not working.

At 10:48 UTC we rolled out a change that bypassed writing a message_thread ID into the "conversation_parts" table, after verifying that it was a superfluous write and that it had no side effects. This substantially restored access back to Inbox features, though there were delays in recovery as a large backlog of conversations were processed.

By 11:21 UTC the Inbox mentions feature was fixed. The last remaining set of features with problems were Conversation SLAs and related functionality. By 12:40 UTC we disabled SLA functionality in the application to allow adjacent features such as Inbox Rules to work normally. By 13:49 UTC a database migration relating to SLAs completed and we re-enabled SLA features. There were residual issues with reporting data on SLAs and conversational insights, and we will continue to investigate these problems.

Evaluation

What went right

  • We were fortunate with the timing of this event. The ID could have rolled over the integer limit at any time. We were able to get a full team of engineers working on the issue very fast as it happened at the start of a working day.
  • Not writing the large ID to the affected tables worked generally well as a mitigation approach, without many side effects.

What could be better

  • We missed ensuring that all references to the message_threads ID got moved to bigint.
  • For some tables that use bigints, we ensure that all references to the ID are compatible with bigints by using very large numbers for IDs in our test suite - however we did not extend this to the message_threads table.

Actions

We have not completed a full review of this incident yet and this list of actions is incomplete:

  • Migrate all affected tables (in progress)
  • Identify and fix any other references to bigints (done)
  • Ensure test suite coverage for all bigint tables (in progress)
  • Undo temporary workarounds
Posted Feb 22, 2024 - 18:36 UTC

Resolved
We have mitigated all substantial impact here. There is a small residual impact to SLA reporting which we will track separately so we can share a post mortem report with our customers explaining the more significant impact we experienced this morning.
Posted Feb 22, 2024 - 18:31 UTC
Identified
We are still seeing issues with a small percentage of SLA events not being visualised in SLA reports. We are investigating the issue and will hopefully have a fix available shortly.
Posted Feb 22, 2024 - 16:53 UTC
Update
We have shipped a fix to SLA reporting and are actively monitoring reports before we fully resolve this incident.
Posted Feb 22, 2024 - 15:44 UTC
Update
SLA functionality is restored. We are still seeing impact to SLA reporting which we are working on resolving as soon as possible.
Posted Feb 22, 2024 - 14:28 UTC
Monitoring
A fix is rolled out for SLAs and we are monitoring the results.
Posted Feb 22, 2024 - 13:55 UTC
Update
We are currently rolling out a fix for the SLA functionality.
Posted Feb 22, 2024 - 13:28 UTC
Update
The majority of functionality has been restored to normal.

SLAs and any downstream workflows based on SLAs continue to be impacted and we are working on fixing that.
Posted Feb 22, 2024 - 12:03 UTC
Update
Most functionality has been restored to normal. Assignments and SLAs are not working normally and we are working on fixing these. Mentions functionality should now be restored.
Posted Feb 22, 2024 - 11:26 UTC
Identified
Most functionality has been restored to normal. Mentions and SLAs are not working and we are working on fixing these.
Posted Feb 22, 2024 - 10:59 UTC
Update
We are continuing to work this problem, and have a potential mitigation to restore most functionality being tested.

Conversations created since 08:14 cannot be assigned, closed or in some cases replied to after the conversation is created. Related functionality around conversations such as SLAs is also broken. Conversations created before 08:14 are unaffected.

Customers hosted in the Europe and Australia regions are unaffected.
Posted Feb 22, 2024 - 10:50 UTC
Update
We are continuing to work this problem, and do not currently have an ETA for restoration of all functionality.

Conversations created since 08:14 cannot be assigned, closed or in some cases replied to after the conversation is created. Related functionality around conversations such as SLAs is also broken. Conversations created before 08:14 are unaffected.

Customers hosted in the Europe and Australia regions are unaffected.
Posted Feb 22, 2024 - 10:30 UTC
Update
We are continuing to work this problem, and do not currently have an ETA for restoration of all functionality.

Conversations created since 08:14 cannot be assigned, closed or in some cases replied to after the conversation is created. Related functionality around conversations such as SLAs is also broken. Conversations created before 08:14 are unaffected.

Customers hosted in the Europe and Australia regions are unaffected.
Posted Feb 22, 2024 - 10:08 UTC
Update
We are continuing to work this problem, and do not currently have an ETA for restoration of all functionality.

Conversations created since 08:14 cannot be assigned, closed or in some cases replied to. Related functionality around conversations such as SLAs is also broken. Conversations created before 08:14 are unaffected.
Posted Feb 22, 2024 - 09:58 UTC
Update
Large amounts of functionality around conversations in Intercom are broken due to a reference to a primary key breaking. We are working on restoring functionality but have no ETA currently. Workspaces hosted in the Europe and Australia region are not impacted.
Posted Feb 22, 2024 - 09:12 UTC
Investigating
We are investigating reports that conversations can't be assigned, closed in the USA region.
Posted Feb 22, 2024 - 08:45 UTC
This incident affected: Intercom Messenger (Web Messenger, Mobile Messenger) and Intercom Web Application, Intercom APIs.