Intercom

Intermittent issues with Intercom
Affected components
Updates

Write-up published

Read it here

Resolved

Our systems have been stable for over 24 hours and we have decided to close off this status page and will continue to monitor system health. Full root cause analysis is being investigated and will be written up and shared with our customers. We’ve provided a summary below of the incident. 

Problem: Since the 30th March, there have been intermittent errors and performance degradations across the system due to mainDB experiencing spikey outages.

Impact: Whilst many of the blips were not detectable, some customers may have experienced brief intermittent errors.

Causes: The problem has been linked to workload-related issues on our databases and underlying hardware/storage layer. Further investigation is on-going.

Steps to resolve: Measures have been implemented to mitigate the issue, including:

  • Reducing binlog retention window on key databases

  • Enabling changes to remove problematic transactions 

  • Reduced deletions on heavy load database tables 

  • Coordinated with our database partner, AWS,  on potential backend changes to be implemented 


Further steps to reduce database load are planned this week.

Mon, Apr 7, 2025, 08:37 AM

Monitoring

We have seen no recent database stability issues. We continue to monitor the situation closely.

Sun, Apr 6, 2025, 11:46 AM(20 hours earlier)

Monitoring

Our engineers continue to investigate, make improvements to and monitor our database environment over the weekend. We'll update here again with any change in status.

Sat, Apr 5, 2025, 02:29 PM(21 hours earlier)

Monitoring

All systems continue to be operational and we are continuing to monitor in case of further issues. 

We will keep this status page open with updates on our progress until we have confidence in a full resolution.

Fri, Apr 4, 2025, 03:16 PM(23 hours earlier)

Monitoring

  • The engineering team is continuing to investigate and mitigate the current issues being experienced. 

  • We have spent the past few hours identifying bad patterns calling the database, while moving general load from our critical databases. Customer impact has been minimized over the past 2 hours as we continue to build stronger confidence that the changes introduced have had the necessary effect.  

  • We’ve yet to identify a confirmed root cause. We will continue to keep this status page open with updates on our progress until full resolution.

Fri, Apr 4, 2025, 11:19 AM(3 hours earlier)

Monitoring

  • The engineering team is continuing to investigate and mitigate the current issues being experienced. 

  • We have spent the past few hours identifying bad patterns calling the database, while moving general load from our critical databases. Customer impact has been minimized over the past 2 hours as we continue to build stronger confidence that the changes introduced have had the necessary effect.  

  • We’ve yet to identify a confirmed root cause. We will continue to keep this status page open with updates on our progress until full resolution.

Fri, Apr 4, 2025, 10:49 AM(29 minutes earlier)

Investigating

  • The engineering team is continuing to investigate and mitigate the current issues being experienced. 

  • We have spent the past few hours identifying bad patterns calling the database, while moving general load from our critical databases. Customer impact has been minimized over the past 2 hours as we continue to build stronger confidence that the changes introduced have had the necessary effect.  

  • We’ve yet to identify a confirmed root cause. We will continue to keep this status page open with updates on our progress until full resolution.

Fri, Apr 4, 2025, 10:43 AM

Investigating

  • We are currently experiencing some ongoing intermittent issues which could cause brief errors or degradation for customers.

  • We have taken steps to mitigate but continue to investigate and monitor, we will keep this status page open with updates on our progress until full resolution.

  • Please be assured that this is the top priority for engineering to have all systems fully recovered and stable as quickly as possible.

Fri, Apr 4, 2025, 07:57 AM(2 hours earlier)

Investigating

We're still to see brief (seconds) spikes in errors once or twice an hour. We are continuing to investigate.

Fri, Apr 4, 2025, 05:58 AM(1 hour earlier)

Investigating

We're seeing issues with Intercom where the product had increased errors between 04:03 and 04:06 UTC. Three minutes outside of our maintenance window. Our team is aware and investigating the issue with AWS. We’ll update you here as soon as we have more information.

Fri, Apr 4, 2025, 04:15 AM(1 hour earlier)