On April 25th 2023 Google Cloud has an incident involving a datacenter. The incident impacted multiple zones.


If this sounds strange, it is because not all cloud zones are created equal.

Not all clouds are equal

Let’s start from the beginning. I’m not an expert in cloud but I did my studies (AWS Advanced Network Specialty and Azure Network Engineer Associate ).

Not being familiar with Google Cloud, the incident report did not seem very clear to me. How was it possible for a physical problem (a water leak caused the batteries to catch fire) affecting one DC to impact multiple zones?

The answer is simple: not all zones are created equal.

There are many clouds

For clarity, I searched the documentation of the major cloud providers for what a zone actually was. Here are my findings.

AWS availability zones

What is an availability zone for AWS?

An Availability Zone (AZ) is one or more discrete data centers with redundant power, networking, and connectivity in an AWS Region. Each Region has multiple, isolated locations known as Availability Zones.

Source

The point is: each AZ is in a different DataCenter, some zones may spread across multiple datacenters.

AZs are physically separated by a meaningful distance, many kilometers, from any other AZ, although all are within 100 km (60 miles) of each other.

Source

And each AZ is within 100km from the others. This is to reduce latency in case workloads are spread across multiple AZs. Fair enough.

That means thas a water leak in one DC wouldn’t affect the other zones.

Azure availability zones

Like AWS, for Azure each AZ is also in one or more DCs:

Each zone is composed of one or more datacenters equipped with independent power, cooling, and networking infrastructure.

Source

Water leak in one DC shouldn’t affect more than one zone.

Oracle Cloud Infrastructure availability domains

The AZs in OCI are called availability domains (AD).

An availability domain consists of a set of data centers within an Oracle Cloud Infrastructure region.

Source

In detail:

Availability domains are isolated from each other, fault tolerant, and very unlikely to fail simultaneously. Because availability domains do not share infrastructure such as power or cooling, or the internal availability domain network, a failure at one availability domain within a region is unlikely to impact the availability of the others within the same region.

Source

The definition of what an AD is for OCI is very close to the AZs of Azure and AWS.

Note that not all the regions of OCI have multiple ADs, meanign the may be in a single DC (list ).

OCI is more conservative in its definizion of ADs. While it’s true that an incident in one AD should not impact any other AD, this applies only to phisical incidents like water leak, fire, power failure.

In case of a logical incidents, like a routing issue or software bug, it may impact multiple zones. This is the same for all the CSPs.

Alibaba Cloud

Another example of zones defined as isolated locations comes from Alibaba Cloud . The documentation is not very clear how far the locations are form each other:

Each region has multiple isolated locations known as zones, which have their own independent power supply and network. Zones within the same region are interconnected over an internal network. Each zone is independent of each other, providing effective fault isolation.

Source

Assuming that isolated locations means a separate DC, once again one zone equals one DC.

GCP availability zones

Google Cloud definition of what an AZ is differs from the other Cloud Service Providers:

Regions are independent geographic areas that consist of zones. Zones and regions are logical abstractions of underlying physical resources provided in one or more physical data centers Source

Multiple zones may reside in the same data center, because they are just abstractions, I guess to separate the control plane and/or management. We can assume that the zones run on different racks/rooms, but they could still be in the same DC.

Based on this assumption, the incident report makes more sense. A water intrusion in a DC could affect multiple areas.

It all started with zone europe-west9-a and then all the other zones of the region europe-west9 (a.k.a. Paris) were shut:

Description: Water intrusion in europe-west9-a has caused a multi-cluster failure and has led to an emergency shutdown of multiple zones. We expect general unavailability of the europe-west9 region.

The workaround was to migrate to other regions:

Customers are advised to failover to other regions if they are impacted.

Here’s the part I found unclear:

A zone is a deployment area for Google Cloud resources within a region. Zones should be considered a single failure domain within a region. To deploy fault-tolerant applications with high availability and help protect against unexpected failures, deploy your applications across multiple zones in a region. Source

The advice to deploy in multiple zone and multiple regions is valid for all the major CSPs, but it is particularly critical for GCP beacuse it’s the only CSP where multiple AZs may reside inside the same DC.

In the specific case the region europe-west9 is all in a single location/DC, HERE . The DC is managed by Global Switch , details about the facility available LINK .

If customers deployed their applications/services only in that single region, distributing the workload across all AZs did not help to avoid being affected by the water leak incident.

Lesson learned

What can be learned from this brief and concise analysis?

Certainly that you should always read the service provider’s documentation, and ask for details before making the design.

The assumption that the definition of Zone is consistent among CSPs can lead to some design errors.

The usual best practices apply also in the cloud, spreading the workload across regions/zones/cloud providers is usually a good advice, YMMV.

The tweet

Like many discussions, it all started with a tweet: