On January 9, 2026, between 19:39 and 20:50 UTC, Intercom services in the US region were fully unavailable for customers hosted there. The outage was caused by a failure in our database routing layer that prevented queries from being routed to the correct shards.
We have no evidence of customer data loss or corruption. The failure mode prevented queries from being routed and executed; it did not modify underlying data.
We understand that Intercom is core to how you run your business, and that any downtime directly impacts your ability to serve your own customers. We sincerely apologize for the disruption and frustration this caused.
Root cause analysis
High-level summary
Intercom utilizes PlanetScale (built on Vitess) as a high-scale core database layer. The outage was triggered by a software bug within a stable version of Vitess (v22) that was being rolled out to our infrastructure.
During a standard administrative workflow to move tables for testing, a bug in the rollback logic resulted in the application of an empty "VSchema" to our primary production database. The VSchema acts as the routing map for our sharded data; when this map was overwritten with an empty configuration, the database proxy layer (VTGates) could no longer route queries to the correct data shards. This resulted in a total loss of connectivity between the Intercom application and our data layer.
Service was restored when PlanetScale engineers manually reapplied the correct VSchema from a VSchema snapshot, restoring the routing topology.
Technical deep dive
This reflects our current understanding, based on our internal incident review and PlanetScale’s preliminary RCA. We are publishing it within 24 hours and will update it as the investigation concludes.
The specific failure mechanism was a logic regression introduced in Vitess Pull Request #17401 regarding how MoveTables failures are handled in sharded keyspaces.
The trigger
A standard MoveTables command was issued. Due to transient networking issues occurring in the provider's Kubernetes cluster at that time, this command failed (a standard operational state that the system is designed to handle safely).
The logic error
When a MoveTables command fails, Vitess attempts to revert the state.
For unsharded keyspaces (unsharded database), Vitess caches the previous VSchema in memory to facilitate a rollback, as the operation might modify the schema.
For sharded keyspaces, a VSchema change is generally not expected during this specific operation, so the previous state is not persisted.
However, the regression in v22 caused the condition "Should we reapply the old VSchema?" to erroneously evaluate to TRUE for our sharded keyspace.
Invalid configuration
Because the system decided to reapply a "previous" VSchema that it had not actually saved (expecting it to be unnecessary for sharded keyspaces), it effectively applied an empty value to the production keyspace topology.
Propagation
The PlanetScale control plane detected this new (empty) VSchema state and propagated it to the VTGates. The VTGates, which rely on the VSchema to route queries to the correct shards, immediately dropped all routes, causing the application-wide outage.
Resolution
PlanetScale engineers reapplied a snapshotted VSchema from one hour prior to the incident and removed artifacts left behind by the failed workflow. This action pushed the correct configuration back to the PlanetScale control plane, which subsequently propagated the valid routing rules to the VTGate fleet. As the VTGates consumed the restored VSchema, they rebuilt their internal routing tables, immediately re-establishing the path between the Intercom application and the backend databases.
Timeline in UTC
18:56 – A specific configuration in the database management pipeline interpreted a 'branch unfreeze' action as a signal to proceed with a pending version upgrade to v22 scheduled the following week. This behavior was unintended and is being reconfigured.
19:00 – The Vitess upgrade began rolling out which contained the v22 bug.
19:30 – A MoveTables workflow was created. When it failed (which should be safe), due to the v22 bug, it caused an empty VSchema to be applied to the topology.
19:39 – The PlanetScale Control plane read the empty VSchema and caused VTGates to consume the new empty VSchema, triggering the production impact.
19:40 – Inbox Heartbeat Anomaly alarms trigger in the US region declaring an incident.
19:41 – Engineers and an incident commander acknowledge the incident and begin investigation.
19:48 – Incident team engage with PlanetScale to continue the investigation.
20:50 – PlanetScale restored the correct VSchema; service recovered.
21:00+ – Initiated rollback of Vitess to v21 completing in the following hours.
Next steps
Ongoing investigation
We are publishing this write-up based on our current understanding of the incident, including PlanetScale’s preliminary RCA and our own analysis of Intercom’s service behavior during the outage. We do this to provide immediate transparency into the root cause. We have high confidence in the direct failure mode but our work here is not finished. We remain actively engaged with PlanetScale to fully understand and remediate the various issues that led to this incident. As we deepen our understanding, we will continue to refine our technical safeguards and remediation plans to ensure your service remains resilient.
When this investigation concludes, we will update this write-up with any material new findings and refine our corrective actions accordingly. Our goal is not only to explain what happened, but to ensure the specific failure mode is prevented and that recovery is faster and more deterministic if a similar class of issue ever occurs again.
Understanding the networking issue
PlanetScale is currently isolating the source of the Kubernetes networking packet drops that occurred at the start of the incident. While the Vitess software bug was the primary cause of the outage, these networking errors may have contributed to the chain of events by causing the initial workflow failure that triggered the bug.
Fixing the Vitess regression
PlanetScale has identified the specific code path causing the regression and is deploying a fix. We have paused additional database upgrades for Intercom infrastructure until this patch is suitably verified. Consistent with our policy to trail the leading edge, this version (v22) had been in general availability for over nine months prior to our adoption. To ensure reliability, our upgrade cadence is designed to trail the leading edge, allowing us to verify that new versions have been successfully vetted by the broader engineering community prior to adoption.
Hardening topology resilience
We are exploring with PlanetScale further safeguards to protect our application from invalid topology updates.
Preventing unintentional upgrades
We are conducting a review with PlanetScale regarding the pipeline configuration that allowed a database "branch unfreeze" to automatically trigger a pending version upgrade. While this version was scheduled for a controlled rollout the following week, this specific pipeline trigger caused it to deploy prematurely. We are working to enforce stricter separation between maintenance tasks and version upgrades to ensure all deployments are explicitly intended.
Contextualizing this failure
We know the incidents on January 5th and 9th stand in stark contrast to the stability you have experienced recently. Over the last six months, our Core Platform availability had averaged 99.98% when excluding the widespread AWS us-east-1 incident, with several months at 100% uptime following our migration to PlanetScale.
We share this data not to minimize the impact of this week’s outage, but to reinforce that the level of service you received is an anomaly that we do not accept. We are committed to returning to the high standard of reliability we maintained throughout the second half of 2025 and are redoubling our efforts to further ensure it.