Asia Pacific - API(Asia Pacific - R3), Asia Pacific - Web Application(Asia Pacific - R3), Asia Pacific - Open and Link Tracking(Asia Pacific - R3), Asia Pacific - Surveys and Forms(Asia Pacific - R3), Asia Pacific - Mail Sending(Asia Pacific - R3), Asia Pacific - Reporting(Asia Pacific - R3), Asia Pacific - Pages and Forms(Asia Pacific - R3), Asia Pacific - Contact Imports(Asia Pacific - R3), Asia Pacific - SMS(Asia Pacific - R3), Asia Pacific - Transactional Email(Asia Pacific - R3), Asia Pacific - Integration Hub(Asia Pacific - R3), Asia Pacific - Email to SMS(Asia Pacific - R3), Asia Pacific - Dynamics Connector(Asia Pacific - R3), Asia Pacific - Salesforce Connector(Asia Pacific - R3)
At approximately 10:51 UTC on Wednesday 30th August 2023, services in our Asia Pacific region (R3) began to fail as a result of an Azure datacenter outage. We restored services at 16:30 UTC the same day.
During this time, customers in R3 will have been unable to use all aspects of our platform, including, but not limited to:
The Dotdigital platform is hosted by Microsoft’s Azure cloud and uses the Australia East and Australia Southeast regions. The Dotdigital platform began to fail after a power surge damaged cooling units in the Azure datacenter. This had the knock-on impact of Azure’s compute and storage systems being taken offline causing widespread disruption for their customers. You can read more on Azure’s status page.
During this incident, Dotdigital engineers attempted to restore service to our alternate location in Azure’s Australia Southeast region. Together Australia East and Southeast form a cross-region pair which is the recommended approach from Azure to achieve high availability. Our engineers discovered deployments in the Southeast region failed due to the Azure outage impacting the second region. Although this wasn’t confirmed on Azure’s public status page, it was reported on their customer-facing system health report.
The timeline (in UTC) for resolving this issue was:
10:51 - Our monitoring detected the first signs of failing services.
11:11 - We raised an incident and began investigations.
11:24 - We created a status Page highlighting the issue with Azure.
11:31 - We disabled the send pipeline to protect the integrity of sends.
11:53 - We attempted to scale the healthy pool which failed as the Azure management node was unresponsive.
12:23 - We opened a high-priority ticket with the Azure support team.
12:31 - Azure confirmed the issue was related to their cooling issue.
14:20 - Our attempts to restore services via the Southeast region failed due to a secondary Azure issue.
14:45 - We opened a second high-priority ticket with the Azure support team regarding the Southeast region also being unresponsive.
15:09 - We received verbal confirmation from the Azure support team Australia Southeast was also degraded.
16:31 - The Azure issue was resolved and we started all Dotdigital services. Any scheduled sends submitted prior to the incident were immediately dispatched.
We’ll be working with Azure to understand the events that took place on their side and in particular how two regions suffered in tandem. In addition, we’ll investigate the adoption of Azure’s Availability Zones and reduce the need for infrastructure deployments in our secondary regions.
We're really sorry for the trouble the outage may have caused you. We know service reliability is super important. We'll be working hand in hand with Azure to learn from this incident and take action to prevent it from happening again. Thanks for your patience while we resolved the issue, and we promise to keep working hard to provide you with great service.
Problems with Azure's datacentre have been resolved and all Dotdigital services are now back online. Queued sends are being dispatched immediately.
Our service is still being impacted by problems in Azure's Australia East region. In addition attempts to move systems to our alternate site in Australia Southeast have also failed due to Azure outages in this region. You can track the Azure issue via the following link https://azure.status.microsoft/en-us/status
We continue to work with our cloud provider Microsoft Azure to mitigate the issue. Azure have reported some success in their mitigation and is looking to restore services over the next few hours. You can track progress via the following link https://azure.status.microsoft/en-us/status
Issue identified with cloud provider. Users will continue to experience disruption to all dotdigital services including Sends, API, Programs and Web Behaviour Tracking.
Issue identified with cloud provider. Users will continue to experience intermittent disruption to all dotdigital services including Sends, API, Programs and Web Behaviour Tracking.