Between approximately 14:07 and 14:23 UTC, and again between approximately 14:37 and 14:47 UTC, teammates in our AU, EU, and US regions were logged out of the Intercom app and were unable to log in or maintain active sessions.
Other products such as Fin, the Messenger, API, and inbound/outbound email were not affected. No customer data was lost or corrupted. We treat this with the same severity as a total outage.
We apologize for the disruption to your support operations and the frustration caused by this issue.
Root Cause
Intercom’s backend is built on Ruby on Rails. During all production rollouts we deploy gradually so new application code and behaviour are expected to be compatible with existing running code. As part of a framework configuration update, we deployed a change to the algorithm used for signing session cookies. This change was not backwards compatible with old application versions running in production.
Requests can be routed to different application servers during a rollout and because of this many teammates alternated between the two application versions. Each version rejected the other’s session cookies as invalid, which progressively logged teammates out and prevented stable login sessions until the fleet converged on a single version.
This same incompatibility also meant that rolling back the change reintroduced another mixed-version window during recovery, which contributed to the duration and the recurrence.
Because this issue only occurs while two incompatible versions are serving traffic during a rolling deployment, the symptoms would stop once the fleet converged on a single version. During an active incident we cannot rely on that timing or outcome, so our automated safeguards are designed to roll back to a known-good version rather than wait.
Why it reached production
We have identified a gap in our pre-production environment that prevented this specific failure mode from being detected. Before promoting a release to production, in addition to standard automation tests, we also deploy it to an isolated environment that mirrors production and runs synthetic browser tests. We deploy the new version to 100% of the test environment before running our automated test suite. Because the test suite ran against a homogeneous environment (all running the new code), the mixed version incompatibility that caused the incident did not exist in that environment. Consequently, the tests passed, and the release was automatically promoted to production, where the rolling deployment strategy exposed the incompatibility.
Why recovery was delayed
Our monitoring systems detected the anomaly immediately after deployment, and our automated rollback system attempted to revert the change. A logic flaw in our deployment pipeline caused a queued promotion from pre-production to take precedence over the rollback, which prevented the rollback from starting cleanly. Engineers intervened and completed a manual rollback, and because this issue persists while mixed versions are serving traffic, that delay extended the time to recovery.
Why the issue recurred
After the initial recovery, engineers moved to restore the platform’s safety systems by re-enabling the automated rollback system. The action of re-enabling this system also automatically unlocked the deployment pipeline. The pipeline, interpreting the previously failed (but newer) code as the next valid state, immediately began redeploying the problematic configuration.
When engineers attempted to manually roll back this mistaken deployment, the automated rollback system, now active, was re-triggered by the recurrence of login issues and a different logic flaw led to it cancelling the manual rollback and not initiating a replacement as the two rollbacks, automated and manual, were for the same code version. As a consequence, we continued to run the new application version until a redeployment of old code was manually triggered two minutes later.
Timeline (UTC) - US region (AU/EU region timelines are similar)
Incident 1
14:07 - Code deployment begins. First failed requests were detected at 14:07:45. Mixed versions are now serving traffic.
14:08 - Inbox Heartbeat Anomaly alarms trigger in the EU and US regions.
14:09 - Incident is declared and acknowledged with engineering and an incident commander paged in.
14:11 - Automated rollback is triggered, but a deployment pipeline logic flaw causes a queued new application version to take precedence, delaying the rollback until engineers intervene.
14:15 - Engineers execute a manual rollback.
14:19 - Recovery observed as error rates drop.
14:23 - First recovery completed.
Incident 2
14:34 - Engineers re-enable automated rollbacks. Due to the tooling design, this action automatically unlocks the pipeline and redeploys the new application configuration before a revert is ready
14:37 - Error rates spike again (the second incident begins).
14:39 - Engineers attempt a manual rollback.
14:39 - Automated rollback system triggers, terminating the manual rollback attempt to start its own process. It fails to perform a rollback deployment.
14:41 - Engineers manually trigger a redeployment of the old code
14:45 - Recovery observed as error rates drop.
14:47 - Full recovery confirmed.
Next steps
Ongoing investigation
We have identified the direct cause of the incident and are continuing a deeper review of the safeguards that failed around deployment and rollback. This level of instability is not acceptable, and we treat the deployment pipeline and rollback automation as safety-critical systems.
Prevent bad version redeployments
We are adding the ability to mark a deployment as unsafe during an incident. Unsafe versions will be blocked from redeployment, even if the pipeline is unlocked, to prevent accidental reintroduction of a known bad change.
Hardening automated rollbacks
We are patching the deployment pipeline logic to ensure that a rollback or cancellation command always takes precedence over the promotion of new builds. This fixes the specific bug that delayed our initial recovery.
Pre-production environment updates
We are updating our pre-production environment to better mirror the mixed version reality of a production rolling deployment. Future tests will run against a cluster that contains both the old and new versions of the code simultaneously, ensuring backward compatibility issues are caught before promotion.
Tooling UX improvements
We are redesigning the automated rollback re-enablement workflow to decouple it from unlocking the pipeline. This will prevent operators from accidentally redeploying code when restoring safety systems. We have updated our immediate incident guidance to prevent this recurrence until the tooling fix is shipped.
Automatic rollback logic update
We are updating the rollback logic to ensure it never terminates an operator-initiated manual rollback. Human intervention during an incident should always take precedence over automation.
Atomic deployment capability
We follow a strategy of safe, incremental rollouts shipping the read path first to ensure backward compatibility during deployments (we talk about our process here). For certain framework-level configuration changes that are effectively binary, we are exploring an atomic deployment approach so traffic can switch from old to new without a mixed-version window.