Azure NLB probe down behavior
I was recently engaged in a troubleshooting session pertaining to Azure Network Load Balancers (NLBs).
The assistance was sought by cloud engineers seeking deeper insights into the operational characteristics of NLBs. The findings are outlined below.
The cloud
Within the realm of cloud computing, load balancers occupy a domain that lacks clear demarcation. At times, their management falls within the purview of Network teams, while in other instances, it is entrusted to System Engineers. Additionally, some IT departments categorize load balancers as security appliances, warranting oversight from the Security Operations Center (SOC) in tandem with firewall management.
In cloud environments, the initial orchestration of networking, security, and computational components typically falls under the domain of system engineers. However, as complexity escalates and intricacies accumulate, a determination is eventually reached to partition these responsibilities to different teams.
The case
Specifically in the present scenario, the Systems Engineers encountered an unexpected phenomenon of session drops during the reboot of a Network Virtual Appliance (NVA) positioned behind an Azure NLB.
It only took me a few minutes to put my Cloud Network Engineer hat back on and point them in the right direction:
Single instance probes down:
- New TCP connections succeed to remaining healthy backend endpoint. Established TCP connections to this backend endpoint continue
- Existing UDP flows move to another healthy instance in the backend pool
Source: Azure Network Load Balancer documentation
It is important to note that, unlike certain other load balancers, Azure NLBs don't transmit TCP resets to signal session termination to clients. Instead, reliance is placed upon the client's TCP timeout mechanism to detect the session's termination. When the client initiates a new session, it is assigned to a healthy backend instance.
For instance, F5 load balancers permit the configuration of an action to transmit a Reset (RST) to the client, facilitating immediate initiation of new sessions as opposed to awaiting the timeout period.
Source: F5 K15095: Overview of the Action On Service Down feature
The lesson learned
Never let System Engineers do a Network Engineer's job.
Being surprised is usually a consequence of not having studied a technology thoroughly or not understanding the consequences of certain actions.
Unknowns/unknowns are always around the corner, software bugs and complex systems that react nondeterministically can happen.
Fortunately, this was not the case; it was enough to read and understand the documentation to determine the causes of the NLB's behavior and confirm that the behavior was as expected.
The cloud often does not offer the same features as appliances, which is why the constructs provided by CSPs are not always the best options. Each design must weigh pros and cons and figure out where it makes sense to use NVAs because they have more features.