On January 10th between 08:23 and 08:36 UTC on January 10th, Intercom services in the EU region were fully unavailable for customers hosted there. The outage was caused by a failure in our database routing layer that prevented queries from being routed to the correct shards.
We have no evidence of customer data loss or corruption. The failure mode prevented queries from being routed and executed; it did not modify underlying data.
We understand that Intercom is core to how you run your business, and that any downtime directly impacts your ability to serve your own customers. We sincerely apologize for the disruption and frustration this caused.
Root cause analysis
High-level summary
Intercom utilizes PlanetScale (built on Vitess) as a high-scale core database layer. The outage was triggered by a software bug within a stable version of Vitess (v22) that was being rolled out to our infrastructure.
During a standard administrative workflow to move tables for testing, a bug in the rollback logic resulted in the application of an empty "VSchema" to our primary production database. The VSchema acts as the routing map for our sharded data; when this map was overwritten with an empty configuration, the database proxy layer (VTGates) could no longer route queries to the correct data shards. This resulted in a total loss of connectivity between the Intercom application and our data layer.
Service was restored when PlanetScale engineers manually reapplied the correct VSchema from a VSchema snapshot, restoring the routing topology.
Approximately 12 hours prior, the US region faced a similar incident. We now know the EU configuration was corrupted simultaneously (Jan 9th) but remained dormant. The region operated normally until a morning backup process propagated the latent configuration, overwriting the live topology.
Technical deep dive
This reflects our current understanding, based on our internal incident review and PlanetScale’s preliminary RCA.
The specific failure mechanism was a logic regression introduced in Vitess Pull Request #17401 regarding how MoveTables failures are handled in sharded keyspaces.
The trigger
A standard import workflow was created on Jan 9th, prior to the incident in the US. As part of this workflow, a standard MoveTables command was issued (on v22). This command failed which is a standard operational state that the system is designed to handle safely.
This triggered the Vitess MoveTables bug, causing an empty VSchema to be stored but not immediately propagated. The active failure condition was triggered later when a production backup completed; this event initiated a standard reconciliation process that applied the empty VSchema to the topology.
The logic error
When a MoveTables command fails, Vitess attempts to revert the state.
For unsharded keyspaces (unsharded database), Vitess caches the previous VSchema in memory to facilitate a rollback, as the operation might modify the schema.
For sharded keyspaces, a VSchema change is generally not expected during this specific operation, so the previous state is not persisted.
However, the regression in v22 caused the condition "Should we reapply the old VSchema?" to erroneously evaluate to TRUE for our sharded keyspace.
Invalid configuration
Because the system decided to reapply a "previous" VSchema that it had not actually saved (expecting it to be unnecessary for sharded keyspaces), it effectively applied an empty value to the production keyspace topology.
Propagation
The PlanetScale control plane detected this new (empty) VSchema state and propagated it to the VTGates. The VTGates, which rely on the VSchema to route queries to the correct shards, immediately dropped all routes, causing the application-wide outage.
Resolution
PlanetScale engineers reapplied a snapshotted VSchema from one hour prior to the incident and removed artifacts left behind by the failed workflow. This action pushed the correct configuration back to the PlanetScale control plane, which subsequently propagated the valid routing rules to the VTGate fleet. As the VTGates consumed the restored VSchema, they rebuilt their internal routing tables, immediately re-establishing the path between the Intercom application and the backend databases.
Timeline in UTC
(9th Jan) 19:29 – Standard import workflow created on Planetscale (on v22) which failed and stored an empty VSchema.
08:23 – A daily production snapshot completed. The completion of this backup initiated a state reconciliation process which picked up the dormant empty VSchema and propagated it to the VTGates, triggering the production impact.
08:25 – Inbox Heartbeat Anomaly alarms trigger in the EU region.
08:25 – Incident is declared and acknowledged with engineering and an incident commander paged in.
08:27 – Incident team engaged with Planetscale to continue the investigation spotting similar patterns to the previous incident in the US region from Jan 9th.
08:36 – PlanetScale restored the correct VSchema; service recovered.
08:52 – Rollbacks to Vittess v21 begin across affected databases in EU and AU regions.
10:13 – Rollbacks completed for all regions.
Next steps
Ongoing investigation
We are publishing this write-up based on our current understanding of the incident, including PlanetScale’s preliminary RCA and our own analysis of Intercom’s service behavior during the outage. We do this to provide immediate transparency into the root cause. We have high confidence in the direct failure mode but our work here is not finished. We remain actively engaged with PlanetScale to fully understand and remediate the various issues that led to this incident. As we deepen our understanding, we will continue to refine our technical safeguards and remediation plans to ensure your service remains resilient.
When this investigation concludes, we will update this write-up with any material new findings and refine our corrective actions accordingly. Our goal is not only to explain what happened, but to ensure the specific failure mode is prevented and that recovery is faster and more deterministic if a similar class of issue ever occurs again.
Fixing the Vitess regression
PlanetScale has identified the specific code path causing the regression and is deploying a fix. We have paused additional database upgrades for Intercom infrastructure until this patch is suitably verified. Consistent with our policy to trail the leading edge, this version (v22) had been in general availability for over nine months prior to our adoption. To ensure reliability, our upgrade cadence is designed to trail the leading edge, allowing us to verify that new versions have been successfully vetted by the broader engineering community prior to adoption.
Hardening topology resilience
We are exploring with PlanetScale further safeguards to protect our application from invalid topology updates.
Preventing unintentional upgrades
We are conducting a review with PlanetScale regarding the pipeline configuration that allowed a database "branch unfreeze" to automatically trigger a pending version upgrade. While this version was scheduled for a controlled rollout the following week, this specific pipeline trigger caused it to deploy prematurely. We are working to enforce stricter separation between maintenance tasks and version upgrades to ensure all deployments are explicitly intended.