Write-up published
Resolved
Our systems have been stable for over 24 hours and we have decided to close off this status page and will continue to monitor system health. Full root cause analysis is being investigated and will be written up and shared with our customers. We’ve provided a summary below of the incident.
Problem: Since the 30th March, there have been intermittent errors and performance degradations across the system due to mainDB experiencing spikey outages.
Impact: Whilst many of the blips were not detectable, some customers may have experienced brief intermittent errors.
Causes: The problem has been linked to workload-related issues on our databases and underlying hardware/storage layer. Further investigation is on-going.
Steps to resolve: Measures have been implemented to mitigate the issue, including:
Reducing binlog retention window on key databases
Enabling changes to remove problematic transactions
Reduced deletions on heavy load database tables
Coordinated with our database partner, AWS, on potential backend changes to be implemented
Further steps to reduce database load are planned this week.
Monitoring
We have seen no recent database stability issues. We continue to monitor the situation closely.
Monitoring
Our engineers continue to investigate, make improvements to and monitor our database environment over the weekend. We'll update here again with any change in status.
Monitoring
All systems continue to be operational and we are continuing to monitor in case of further issues.
We will keep this status page open with updates on our progress until we have confidence in a full resolution.
Monitoring
The engineering team is continuing to investigate and mitigate the current issues being experienced.
We have spent the past few hours identifying bad patterns calling the database, while moving general load from our critical databases. Customer impact has been minimized over the past 2 hours as we continue to build stronger confidence that the changes introduced have had the necessary effect.
We’ve yet to identify a confirmed root cause. We will continue to keep this status page open with updates on our progress until full resolution.
Monitoring
The engineering team is continuing to investigate and mitigate the current issues being experienced.
We have spent the past few hours identifying bad patterns calling the database, while moving general load from our critical databases. Customer impact has been minimized over the past 2 hours as we continue to build stronger confidence that the changes introduced have had the necessary effect.
We’ve yet to identify a confirmed root cause. We will continue to keep this status page open with updates on our progress until full resolution.
Investigating
The engineering team is continuing to investigate and mitigate the current issues being experienced.
We have spent the past few hours identifying bad patterns calling the database, while moving general load from our critical databases. Customer impact has been minimized over the past 2 hours as we continue to build stronger confidence that the changes introduced have had the necessary effect.
We’ve yet to identify a confirmed root cause. We will continue to keep this status page open with updates on our progress until full resolution.
Investigating
We are currently experiencing some ongoing intermittent issues which could cause brief errors or degradation for customers.
We have taken steps to mitigate but continue to investigate and monitor, we will keep this status page open with updates on our progress until full resolution.
Please be assured that this is the top priority for engineering to have all systems fully recovered and stable as quickly as possible.
Investigating
We're still to see brief (seconds) spikes in errors once or twice an hour. We are continuing to investigate.
Investigating
We're seeing issues with Intercom where the product had increased errors between 04:03 and 04:06 UTC. Three minutes outside of our maintenance window. Our team is aware and investigating the issue with AWS. We’ll update you here as soon as we have more information.