Fin is experiencing issues - US Hosting Status

Write-up

Fin is experiencing issues

Date: Apr 14, 2026

Summary

On April 14, 2026, between 11:18 UTC and approximately 12:20 UTC, Fin was unable to respond to customer conversations for the US region. The incident was caused by a database column reaching its maximum integer value, preventing new records from being written.

Conversations that Fin could not respond to during the outage window were not lost. They remained in your default Inbox for your team to handle manually.

During initial recovery, newly created conversations started receiving replies from Fin. Conversations with Fin where a customer sent another message would also receive a response from Fin. Any conversations with no new replies, or those where teammate action had been taken (a reply, inbox assignment etc) would not receive a reply from Fin. And where possible, Fin re-engaged with conversations without a human interaction.

Between 12:20 UTC and 13:00 UTC, Fin would fail to escalate to teammates when requested.

By 13:00 UTC, full functionality of Fin was restored.

EU and AU regions were not impacted by this incident.

We understand that Fin is core to how many of you support your customers, and that any downtime directly impacts your ability to deliver fast, reliable support. We sincerely apologize for the disruption this caused.

Root cause analysis

High-level summary

Intercom uses PlanetScale (built on Vitess) as our high-scale database layer. Fin's conversation pipeline writes event records to one of the tables each time it processes a customer interaction.

The primary key column on this table was defined as a 32-bit signed integer, which has a maximum value of approximately 2.1 billion. On April 14, the auto-incrementing sequence for this column exceeded that maximum. Every subsequent attempt to insert a new record failed immediately with a range error.

Because this sequence is global across all database shards in Vitess, the failure was instantaneous with every shard rejected inserts simultaneously. Fin could not create the event records required to complete its conversation pipeline, causing it to stop responding entirely for US-hosted customers.

Why did this affect customers?

When a customer sends a message and Fin processes it, the pipeline writes an event record to track the interaction before generating and delivering a response. This write is a required step.

Once the integer limit was reached, this write failed on every attempt. The result was that Fin appeared completely unresponsive. Customers sending messages through Messenger, Inbox, or workflows received no response from Fin.

EU and AU customers were not directly impacted by this issue. The integer sequence had only reached its limit in the US region.

Technical deep dive

This incident was an integer overflow on a primary key where a table reached the maximum limit of a standard 32-bit integer, preventing new records from being created. While Intercom maintains a robust monitoring system that has successfully processed over 300 migrations for similar issues in the past two years, a specific gap in metric coverage for sharded environments, compounded by the unique characteristics of the affected table, resulted in a failure to trigger proactive alarms.

Intercom has long-standing safeguards against integer growth, including a dedicated worker monitoring production for integer growth to trigger preventative alarms. Upon migrating to PlanetScale in 2025, we collaborated to implement a specific integer growth metric. This system is active and effective. In fact, it successfully triggered a migration as recently as last week.

Despite these layers of defense, this specific table remained an INT rather than a BIGINT due to two primary factors:

During the investigation, we identified that our monitoring specifically covers autoincrementing ID columns in MySQL and had not been extended to Vitess sequences: the provided integer growth metric does not trigger within sharded environments. This resulted in a localized telemetry gap that bypassed our standard monitoring layers.
The affected table was created in 2018 but remained low-volume for years. For context, we updated to Rails 5.1 in May 2019, and all migrations generated from that point onwards used BIGINT, making standard 32-bit INT tables a legacy edge case. It was moved into a sharded keyspace last year. Because its usage only scaled recently, it had not reached previous alarming thresholds before moving into the sharded environment where monitoring was silent.

Once the error surfaced the Fin heartbeat metric, which is emitted only after a successful database write, dropped causing the Fin Heartbeat Anomaly Monitor alarm to fire. Once the root cause was detected, a killswitch was introduced to stop WRITES to the offending table preserving Fin's ability to send messages and escalations while the migration was being deployed changing the key to a BIGINT. The killswitch also prevented Fin from creating new records which depend on the offending table. The result was Fin was able to speak to end users, but wasn’t creating reporting data during this time.

We take full responsibility for this outage. While we have successfully prevented hundreds of similar issues, our monitoring failed to account for the specific architectural nuances of sharded databases.

Timeline (UTC)

11:20 - Fin AI conversation part creation rate drops by ~70% in US region
11:21 - Fin Heartbeat Anomaly Monitor fires and automated rollback of Intercom monolith triggered in all regions. Incident declared internally paging engineering, an incident commander, and initiating triage
11:28 - Root cause identified to be the database column reaching its maximum integer value
11:32 - Status page posted for the US region
11:50 - Migration PR opened to convert the offending id from INT to BIGINT
12:04 - Enabled emergency feature flags to stop further writes to the affected tables, allowing Fin to answer user queries again
11:49 - Migration executing across database shards
13:00 - Migration PR deployed. Fin AI Agent and Fin Voice were functioning as normal for new conversations
13:05 - We retried any pending Fin conversations that had not yet been actioned by a teammate on workspaces
13:30 - Full recovery confirmed. Fin responding normally for all US customers

Next steps

Completed remediation

The offending column and its foreign key reference have been migrated from INT to BIGINT, eliminating the integer overflow constraint. The new column type supports values up to 9.2 quintillion, which is effectively unlimited for this use case.
We have addressed the observability gaps implementing further internal monitoring verifying integer growth across all database types (sharded and unsharded), removing reliance on third-party metrics for this specific critical failure mode.

Ongoing improvements

We are conducting an audit of all high-write database tables to identify any other columns defined as INT that could approach their limits. Tables with high write volumes will be proactively migrated to BIGINT before they reach capacity.
We're adding linting so that people can't divert from the defaults to ensure BIGINT is the default for any table expected to scale, regardless of initial volume.
We are reviewing how the Fin conversation pipeline can be made more resilient to transient failures, so that a single table's constraint does not cause a complete Fin outage across all interfaces. This also includes improving capabilities internally to be able to easily retrigger Fin for all states.
We are working closely with PlanetScale to rectify the metric reporting gaps for sharded environments to ensure their internal alarms provide the redundancy we expect.

We take full responsibility for this outage and remain committed to the reliability and availability you expect from Intercom. The root cause of this incident is well-understood, fully resolved, and we are taking steps to ensure similar issues are detected and addressed proactively across our infrastructure.