Intercom teammate web app is down

Write-up

On January 5th, between approximately 14:07 and 14:23 UTC, and again between approximately 14:37 and 14:47 UTC, teammates in our AU, EU, and US regions were logged out of the Intercom app and were unable to log in or maintain active sessions.

Other products such as Fin, the Messenger, API, and inbound/outbound email were not affected. No customer data was lost or corrupted. We treat this with the same severity as a total outage.

We apologize for the disruption to your support operations and the frustration caused by this issue.

Root Cause

Intercom’s backend is built on Ruby on Rails. During all production rollouts we deploy gradually so new application code and behaviour are expected to be compatible with existing running code. As part of a framework configuration update, we deployed a change to the algorithm used for signing session cookies. This change was not backwards compatible with old application versions running in production.

Requests can be routed to different application servers during a rollout and because of this many teammates alternated between the two application versions. Each version rejected the other’s session cookies as invalid, which progressively logged teammates out and prevented stable login sessions until the fleet converged on a single version.

This same incompatibility also meant that rolling back the change reintroduced another mixed-version window during recovery, which contributed to the duration and the recurrence.

Because this issue only occurs while two incompatible versions are serving traffic during a rolling deployment, the symptoms would stop once the fleet converged on a single version. During an active incident we cannot rely on that timing or outcome, so our automated safeguards are designed to roll back to a known-good version rather than wait.

Why it reached production

We have identified a gap in our pre-production environment that prevented this specific failure mode from being detected. Before promoting a release to production, in addition to standard automation tests, we also deploy it to an isolated environment that mirrors production and runs synthetic browser tests. We deploy the new version to 100% of the test environment before running our automated test suite. Because the test suite ran against a homogeneous environment (all running the new code), the mixed version incompatibility that caused the incident did not exist in that environment. Consequently, the tests passed, and the release was automatically promoted to production, where the rolling deployment strategy exposed the incompatibility.

Why recovery was delayed

Our monitoring systems detected the anomaly immediately after deployment, and our automated rollback system attempted to revert the change. A logic flaw in our deployment pipeline caused a queued promotion from pre-production to take precedence over the rollback, which prevented the rollback from starting cleanly. Engineers intervened and completed a manual rollback, and because this issue persists while mixed versions are serving traffic, that delay extended the time to recovery.

Why the issue recurred

After the initial recovery, engineers moved to restore the platform’s safety systems by re-enabling the automated rollback system. The action of re-enabling this system also automatically unlocked the deployment pipeline. The pipeline, interpreting the previously failed (but newer) code as the next valid state, immediately began redeploying the problematic configuration.

When engineers attempted to manually roll back this mistaken deployment, the automated rollback system, now active, was re-triggered by the recurrence of login issues and a different logic flaw led to it cancelling the manual rollback and not initiating a replacement as the two rollbacks, automated and manual, were for the same code version. As a consequence, we continued to run the new application version until a redeployment of old code was manually triggered two minutes later.

Timeline (UTC) - US region (AU/EU region timelines are similar)

Incident 1

14:07 - Code deployment begins. First failed requests were detected at 14:07:45. Mixed versions are now serving traffic.

14:08 - Inbox Heartbeat Anomaly alarms trigger in the EU and US regions.

14:09 - Incident is declared and acknowledged with engineering and an incident commander paged in.

14:11 - Automated rollback is triggered, but a deployment pipeline logic flaw causes a queued new application version to take precedence, delaying the rollback until engineers intervene.

14:15 - Engineers execute a manual rollback.

14:19 - Recovery observed as error rates drop.

14:23 - First recovery completed.

Incident 2

14:34 - Engineers re-enable automated rollbacks. Due to the tooling design, this action automatically unlocks the pipeline and redeploys the new application configuration before a revert is ready

14:37 - Error rates spike again (the second incident begins).

14:39 - Engineers attempt a manual rollback.

14:39 - Automated rollback system triggers, terminating the manual rollback attempt to start its own process. It fails to perform a rollback deployment.
14:41 - Engineers manually trigger a redeployment of the old code

14:45 - Recovery observed as error rates drop.

14:47 - Full recovery confirmed.

Next steps

Ongoing investigation

We have identified the direct cause of the incident and are continuing a deeper review of the safeguards that failed around deployment and rollback. This level of instability is not acceptable, and we treat the deployment pipeline and rollback automation as safety-critical systems.

Prevent bad version redeployments

We are adding the ability to mark a deployment as unsafe during an incident. Unsafe versions will be blocked from redeployment, even if the pipeline is unlocked, to prevent accidental reintroduction of a known bad change.

Hardening automated rollbacks

We have updated our automated rollback system to ensure that in the event of an in-progress deployment needing to be cancelled to facilitate a rollback the rollback will take precedence over any new code that was waiting to be deployed. This fixes the specific bug that delayed our initial recovery.

Pre-production environment updates

We are updating our pre-production environment to better mirror the mixed version reality of a production rolling deployment. Future tests will run against a cluster that contains both the old and new versions of the code simultaneously, ensuring backward compatibility issues are caught before promotion.

Tooling UX improvements

We have redesigned the automated rollback re-enablement workflow to decouple it from unlocking the pipeline. This will prevent operators from accidentally re-deploying code when restoring safety systems.

Automatic rollback logic update

We have updated our automated rollback system to not perform any action when the pipeline is locked; locking always occurs when a manual rollback is triggered. This will prevent the automated system from getting involved when engineers are already performing actions. This fixes the specific bug that delayed our second recovery.

Accelerated deployment capability

We follow a strategy of safe, incremental rollouts shipping the read path first to ensure backward compatibility during deployments (we talk about our process here). However, this incident demonstrated that prolonged periods of heterogeneous code increase the risk of incompatibility failures. We are accelerating work to increase deployment parallelism, effectively shrinking the "mixed-version" window. This reduces the surface area for backward compatibility bugs and allows for faster remediation during incidents.

Write-up

Intercom teammate web app is down

Full outage

View the incident

Other products such as Fin, the Messenger, API, and inbound/outbound email were not affected. No customer data was lost or corrupted. We treat this with the same severity as a total outage.

We apologize for the disruption to your support operations and the frustration caused by this issue.

Root Cause

This same incompatibility also meant that rolling back the change reintroduced another mixed-version window during recovery, which contributed to the duration and the recurrence.

Why it reached production

Why recovery was delayed

Why the issue recurred

Timeline (UTC) - US region (AU/EU region timelines are similar)

Incident 1

14:07 - Code deployment begins. First failed requests were detected at 14:07:45. Mixed versions are now serving traffic.

14:08 - Inbox Heartbeat Anomaly alarms trigger in the EU and US regions.

14:09 - Incident is declared and acknowledged with engineering and an incident commander paged in.

14:11 - Automated rollback is triggered, but a deployment pipeline logic flaw causes a queued new application version to take precedence, delaying the rollback until engineers intervene.

14:15 - Engineers execute a manual rollback.

14:19 - Recovery observed as error rates drop.

14:23 - First recovery completed.

Incident 2

14:34 - Engineers re-enable automated rollbacks. Due to the tooling design, this action automatically unlocks the pipeline and redeploys the new application configuration before a revert is ready

14:37 - Error rates spike again (the second incident begins).

14:39 - Engineers attempt a manual rollback.

14:45 - Recovery observed as error rates drop.

14:47 - Full recovery confirmed.

Next steps

Ongoing investigation

Prevent bad version redeployments

Hardening automated rollbacks

Pre-production environment updates

Tooling UX improvements

Automatic rollback logic update

Accelerated deployment capability