Fin is experiencing issues - AU Hosting Status

Write-up

Fin is experiencing issues

Summary

On April 21, 2026, from 18:11-18:23 UTC and 22:12-01:04 UTC, the Fin AI Agent was affected for approximately 8% of our customers. Within those impacted workspaces, Fin failed to respond to a subset of conversations. The issue affected conversations using the “Let Fin answer” workflow step. Simple Setup and Fin Voice were not affected.

The incident unfolded in two phases. During the first phase, starting at 18:11 UTC, Fin was unable to respond to a subset of conversations and those failed jobs began accumulating in a retry queue. Our automated monitoring detected an anomaly in Fin's health metrics and triggered an automated rollback of all recent deployments. This rollback completed at 18:23 UTC, allowing all new conversations to be processed normally. Full processing returned to normal by 19:07 UTC. However, the triggering change appeared unrelated to the affected Fin AI code path, so it was not initially identified as the likely trigger during triage. When the same change was re-introduced as part of a later deployment bundle at 22:12 UTC, the issue recurred.

During the second phase, the incident was escalated to Major severity, the status page was updated, and a second incident call was convened. The team identified the failing code path and deployed a fix. Fin began responding to new conversations at approximately 00:47 UTC on April 22. By 01:04 UTC, impacted customers confirmed Fin was operating normally. All retry queues were cleared across all regions by 01:17 UTC.

Once recovered, Fin responded immediately to new conversations. Previously stalled conversations did not automatically resume if a teammate had not already taken over. Those conversations required re-engagement, either from a new customer message or from an admin-triggered workflow that prompted Fin to resume. No conversation data was lost.

We understand that Fin is core to how many of you support your customers, and that any period where Fin is unable to respond directly impacts your ability to deliver fast, reliable support. We sincerely apologize for the disruption this caused.

Root cause

The root cause was a latent defect in our sharded database connection routing that had been dormant in our codebase for several years.

Intercom's database is sharded with customer data distributed across many independent database partitions. When Fin processes a conversation, it connects to the specific shard that holds that customer's data. A low-level data refresh operation in Fin's session handling bypassed the normal connection routing layer and attempted to use a default database connection that had never been registered. Under normal conditions this path was never reached, and the defect remained dormant.

A deployment on April 21 triggered this latent defect at scale. The correlation was unambiguous: errors spiked to over 11,000 per hour within 12 minutes of the deployment, stopped immediately when the deployment was rolled back, and resumed when the same change was re-introduced in a later deployment bundle.

We are continuing to investigate exactly how this code change, which did not directly modify any database code, made the latent defect reachable at this scale. Our current understanding is that an earlier change introduced a database access pattern into Fin's conversation pipeline that interacted with the connection routing state in a way that exposed the unregistered default connection. The April 21 change then triggered that access pattern. This investigation is ongoing. We have prioritized sharing this update within 24 hours of incident resolution.

What is clear is that the latent defect was the root cause, not the deployment itself. The defect has been fixed: we have registered the missing default connection in our sharded database configuration, ensuring that the refresh operation can always find a valid connection regardless of how it is reached.

Timeline (UTC)

18:11, Apr 21 — A deployment containing the triggering code change completes. Errors spike immediately
18:23 — Automatically triggered rollback completes. Errors stop
19:07 — Incident mitigated. Retry queue cleared
22:12 — Same code change re-introduced in a deployment bundle. Errors resume across all regions
23:00 — Customer reports arrive. Incident resumed
23:20 — Escalated to Major severity. Status pages updated for US, EU, and AU
23:40 — Error onset correlated to the 22:12 deployment
23:54 — Root cause identified: session state refresh loses shard connection
00:02, Apr 22 — First fix deployed. Partially effective
00:38 — Comprehensive fix deployed.
00:47 — Errors stop. Fin begins responding to all new conversations
01:04 — Resolution confirmed by customers
01:17 — All retry queues cleared across all regions
01:19 — Incident resolved. Status pages updated

Next steps

Completed

The latent defect has been fixed. The missing default connection has been registered in our sharded database configuration.
All retry queues across all three regions have been fully cleared. All conversation data was securely retained within our database, though we recognize the significant friction caused by the need to manually resume these stalled sessions.

Ongoing

We are investigating why our Fin health monitoring detected the first occurrence but not the second, and are working to close this gap so that failures like this consistently trigger automated rollbacks.
We are reviewing how database connections are maintained during Fin's conversation processing to ensure that connection issues are handled gracefully rather than causing complete job failure.
We are continuing to investigate the exact mechanism by which the April 21 deployment triggered the latent defect, to ensure that similar interactions cannot occur in the future.
We are prioritizing updates to Fin’s architecture that will allow us to safely auto-resume Fin on impacted conversations at scale during an interruption. This change focuses on protecting the end-user experience and entirely removing the manual recovery burden from our customers' support teams.

We take full responsibility for this outage. The latent defect that caused it is well-understood and has been fixed. We are committed to the reliability you expect from Intercom, and are taking concrete steps to ensure that our connection handling, monitoring, and deployment processes prevent this class of failure from recurring.