Google first reported an “Issue” on Jun 2, 2019 at 12:25 PDT. As is now common in any type of disaster, reports of this outage first appeared on social media. It seems now the most reliable place to get any type of information early in a disaster is social media.
Many services that rely on Google Compute Engine were impacted. With three teenage kids at home, I first knew something was up when all three kids emerged from their caves, aka, bedrooms, at the same time with a worried look on their faces. Snapchat, Youtube and Discord were all offline!
They must have thought that surely this was the first sign of the apocalypse. I reassured them that this was not the beginning of the new dark ages and that maybe they should go outside and do some yard work instead. That scared them back to reality and they quickly scurried away to find something else to occupy their time.
All kidding aside, there were many services being reported as down, or only available in certain areas. The dust is still settling on the cause, breadth and scope of the outage, but it certainly seems that the outage was pretty significant in size and scope, impacting many customers and services including Gmail and other G-Suite services, Vimeo and more.
While we wait for the official root cause analysis on this latest Google Compute Engine outage, Google reported “high levels of network congestion in the eastern USA” caused the downtime. We will have to wait to see what they determine caused the network issues. Was it human error, cyber-attack, hardware failure, or something else?
Were you prepared?
As I wrote during the last major cloud outage, if you are running business critical workloads in the cloud, regardless of the cloud service provider, it is incumbent upon you to plan for the inevitable outage. The multi-day Azure outage of Sept 4th, 2018 was related to a failure of the secondary HVAC system to kick in during a power surge related to an electrical storm. While the failure was just within a single datacenter, the outage exposed multiple services that had dependencies on this single datacenter, making that datacenter itself a single point of failure.
Leveraging the cloud’s infrastructure, you can minimize your risks by continuously replicating critical data between Availability Zones, Regions or even cloud service providers. In addition to data protection, having a procedure in place to rapidly recover business critical applications is an essential part of any disaster recovery plan. There are various replication and recovery options available, from services provided by the cloud vendor themselves like Azure Site Recovery, to application specific solutions like SQL Server Always On Availability Groups, to third party solutions like SIOS DataKeeper that protect a wide range of applications running on both Windows and Linux.
Having a disaster recovery strategy that is wholly dependent on a single cloud provider leaves you susceptible to a scenario that might impact multiple regions within a single cloud. Multi-datacenter or multi-region disasters are not likely. However, as we saw with this recent outage and the Azure outage last fall, even if a failure is local to a single datacenter, the impact can be wide reaching across multiple datacenters or even regions within a cloud. To minimize your risks, may want to consider a multi-cloud or hybrid cloud scenario where your disaster recovery site resides outside of your primary cloud platform.
The cloud is just as susceptible to outages as your own datacenter. You must take steps to prepare for disasters. I suggest you start by looking at your most business critical apps first. What would you do if they were offline and the cloud portal to manage them was not even available? Could you recover? Would you meet your RTO and RPO objectives? If not, maybe it is time to re-evaluate your DR strategy.
“By failing to prepare, you are preparing to fail.”― Benjamin Franklin
2 thoughts on “Major Cloud Outage impacts Google Compute Engine – Were you prepared?”
Would also be interesting to see how storage replica is a part of this. This can be configured between Datacenters and or in Azure between regions, this in combination with stretched clusters. and or multi cloud solutions. Can datakeeper provide some role in a multi could DR solution ? think this would be an interesting topic.
Storage Replica (SR) can be used for replication only to a single target, which is very limiting in comparison to DataKeeper which can support multiple targets and stretch clusters, allowing you to build scenarios such as the following:
2-node cluster between Availability Zones in AWS with a 3rd node in Azure.
2-node cluster between Availability Zones in AWS with a 3rd node in GCP.
2-node cluster between Availability Zones in Azure with a 3rd node in AWS.
2-node cluster between Availability Zones in Azure with a 3rd node in GCP.
2-node cluster between Availability Zones in GCP with a 3rd node in Azure.
2-node cluster between Availability Zones in GCP with a 3rd node in AWS.
The 3rd node can be part of the cluster, or it can simply be a replication only target. This allows you to build things like a SQL Server Standard Edition 2-node cluster, but still have a 3rd copy of your data for DR.
You can add 2 nodes in your DR site, but in my experience people don’t want to pay for redundancy in their DR site unless they have to activate it and plan to be there for a while. In those cases they can simply add the 4th node as the need arises.
The biggest limitation of SR in a Cluster-to-Cluster configuration or Stretch Cluster Configuration is that the SR endpoints need to reside on cluster aware storage, i.e., SAN, iSCSI or S2D. Since SAN and iSCSI are not typically available in the cloud, and S2D does not support configurations that span Availability Zones, so you will not be able to build any configurations that combine HA and DR. You will only be able to do simple replication from one server in CloudProviderA to one server in CloudProviderB. And of course this assumes you are using Windows Server 2016 or later, whereas DataKeeper supports Windows Server 2008 R2 and later.