Intercom

Write-up
Increased Latency on the Intercom Web App

On 2024-07-25, a database configuration change was deployed to one of Intercom’s core database services. This change increased the size of the transaction pool, allowing more concurrent transactions on the database, in response to normal operations sometimes coming close to this limit and risking requests failing.

On 2024-07-26 a further database configuration change was deployed to the same database, increasing the maximum length of a transaction from 20 to 180 seconds.

Both of these changes happened in isolation, and no immediate negative effects were observed. However, later in the day, a significant number of scheduled conversations unsnoozes happen on the hour. These unsnoozes all require transactions on the database where the aforementioned configuration changes where deployed. The combination of increasing the timeout (allowing transactions to wait for longer) and increasing the pool size (allowing more transactions to wait concurrently) turned out to interact poorly under these conditions, flooding the transaction pool with long running transactions. This manifested as latency and errors loading conversations in the inbox for a few minutes at the start of each hour.

Recovery was hampered by the first problems coinciding with a logical change in our application to use new database functionality to speed up certain slow queries. This change was quickly reverted when the first signs of trouble were spotted by our database team, and again we were unfortunate to have the system recover on its own very closely in line with when that change was reverted. We continued to investigate but it was only after the next spike in latency at 15:00 UTC that the concrete connection was made between the database configuration changes and scheduled unsnoozing of message threads. All the configuration changes were reverted at 15:27 UTC and no further impact was observed at 16:00 UTC.

Since the incident, a new set of configuration changes that achieve the desired result have been shipped without further issue.