Degraded experience in the Inbox

Write-up

Root Cause Analysis: Conversations or Tickets in the Inbox failing to load

On July 30, 2025, customers in the EU region that had conversations created during the brief dual write window would have experienced issues loading specific conversations in the app between approximately 9:20 and 11:30 UTC. A secondary issue impacting assigned ticket functionality was later discovered. All services were fully restored by 15:00 UTC.

The root cause of the incident was data drift originating from a dual-write window of our migration to Planetscale. For a brief period, writes were sent to both our legacy Aurora database and the new PlanetScale database, this meant that different conversations could receive the same ID on different databases. PlanetScale was the authoritative database so writes to Aurora were lost, but because the application saw them as being successful, there was the risk of incorrect data being written back to Planetscale. Our security guardrails raised errors when we detected this to prevent the wrong conversation ID from being loaded, which caused the initial loading issue that customers observed.

Subsequently we identified an issue with the tool that powers our inbox, views and search functions, Elasticsearch, which caused issues with conversation reassignments, and some inconsistencies in the number of displayed conversations or tickets in the inbox. Another symptom of the issue was the mismatch in the volume of conversation or tickets between what is actually assigned to a teammate or inbox, and the amount of visible conversations or tickets in a given view. This was also a result of elasticsearch using the wrong conversation IDs. This was resolved by manually reindexing data into Elasticsearch from PlanetScale.

To prevent this from happening again, we have taken immediate action to improve our database migration process. This includes updating our runbooks to require strict data consistency checks before switching writes, and verifying that all connections have been drained from the legacy database. The successful data recovery methods used during this new failure mode has now become our standard procedure, ensuring a faster response and limiting the impact of any future data synchronization issues.