Intermittent issues with Intercom

Write-up

Root Cause Analysis: Main Database Instability Leading to App Outages

Between March 30th and today, April 11th, our Main database experienced intermittent instability that resulted in short performance blips, with the majority lasting between 10 and 90 seconds. While most of these occurred outside peak traffic hours (09:00–20:00 UTC), several were sustained and required manual intervention and impacted availability for customers in our US hosted region. Shorter blips are largely mitigated by our application retry logic, while longer ones may have resulted in error pages.

What happened

The issue began on Sunday, March 30th, with a failover mitigating the initial outage. In the following days, the database continued to experience brief periods of degraded performance.

In response, we:

Shifted 50% of the write load to PlanetScale (as part of the planned migration).
Reduced application transaction volume on the database.
Performed an emergency upgrade to the latest Aurora patch release.
Opened multiple business-critical cases with AWS Aurora support.

Unfortunately, these mitigations had no positive impact with the performance blips becoming more frequent, especially after the Aurora patch early on April 4th.

Late Saturday, April 5th, we changed a database configuration (binlog retention from 30 days to 7 days). This should have had no impact on performance and the previous setting was aligned with best practices for our migration off Aurora. However, following the change, the constant performance blips ceased. At the time, the correlation was suggestive but not conclusive.

On April 7th, we attempted to revert the retention setting back to 30 days. This triggered a restart of the database writer and resulted in a 90-second stall. This was unexpected behavior for a routine config change. This strongly suggested an underlying issue with Aurora itself.

On April 8th, and April 11th, we again had issues that resulted in errors in the application.

Current status

We’re working closely with AWS and the Aurora team, who are conducting their root cause analysis. While we have confirmation of the root cause from AWS, an internal memory corruption issue, we’re waiting for AWS to complete their RCA and mitigate the underlying issue. As part of this most recent incident, we have also changed the writer instance to different hardware to potentially mitigate memory corruption issues.

In parallel, we’re continuing to migrate workloads to PlanetScale. This includes:

Migrating more tables from our Main database to PlanetScale in the coming days.
Migrating off our Aurora sharded infrastructure, which has been a key driver of DB-related incidents over the past year. 50%+ of our customers’ workspaces have already been migrated with zero downtime.

Going forward

We’ll publish a complete public RCA once AWS finalizes their internal investigation and mitigation. In the meantime, we continue to prioritize platform availability above all else.

We appreciate your patience and understanding as we work through this.