A change was shipped on Wednesday, March 5th, 2025 at 14:34 UTC to enable multiple Identity Verification tokens for Messenger security. This caused login problems for customers in our Europe and Australian hosted regions that used Identity Verification.
The issue was reported at 14:54 UTC, with engineering incident response activated at 15:34 UTC. We reverted the change at 15:41 UTC which resolved the issue for all affected customers.
Upon investigation, we identified that the problem occurred because the necessary data backfill was not applied across all regions, leading to the observed login failures.
The change in question was feature-flagged, allowing a quick disable of the functionality and allowing for a fast recovery.
Once engaged, our engineering incident response resolved the issue in seven minutes.
Automated alarms did not detect the login failures before a customer report, highlighting an opportunity to improve our monitoring for this feature.
The interval between the customer report and the initiation of our incident response was longer than desired. We are reviewing our escalation procedures to ensure a more immediate response in the future.
The data backfill process did not apply to every region where Identity Verification was active. We will strengthen our deployment checks to ensure comprehensive multi-region coverage.
☑️ Review and update alarm coverage for messenger login failures.
We discovered that the Messenger install used for active monitoring was not configured to use Identity Verification. This gap in alarm coverage has now been resolved to ensure that login failures are promptly detected across all regions.
☑️ Complete the rollout of the feature that caused the incident.
[In progress] Implement warnings and checks in our backfill framework so that the process ensures backfills are defaulted to multi-region.
We will add steps to our deployment process or superrake tasks to confirm each relevant region is updated successfully before finalizing.
This may include a validation step that explicitly flags any region left out of the backfill.
[In progress] Review and improve our escalation process from initial customer contact to incident response initiated.
[In progress] Complete our internal incident review process.