This post is about the lack of best practices, workarounds, L2 extensions, tunnels and loops.

Fasten your seatbelt and prepare for a ride down the rabbit hole.

The scenario I describe is not recommended. Stay away from L2 extension as much as you can.

Business requirements

Every project should start with a business requirements. In this case, the request was to extend a L2 domain across two datacenters to facilitate the relocation of the workload.

I warned the customer it was not a good idea, shared plenty of blogs about the topic, and then I did my job.

The good part was that the old site would be decommissioned soon, so the L2 extension would be temporary for real.

The tunnel

After discussing some options and considering the delay of shipment of new hardware, it was decided to extend the L2 domain using VXLAN tunnels created by some virtual firewalls.

The team already had experience with the firewall vendor and VXLAN tunnels, but it was only done with hardware appliances before.

The initial configuration included creating the virtual machines, configuring L3 reachability on the WAN interfaces terminating tunnels, and creating the firewall LAN interface with the VLANs to be extended over the L3 link.


With the VXLAN up, the first test started with a no shut of the LAN interfaces of the firewall VMs, one per site.

The loop

Right after the LAN interfaces were unshut the worst case scenario occured: a bridging loop.

How to notice a loop:

  • high number of pps on the tunnel
  • high CPU usage on the firewall VMs
  • Nexus switches complaining about MAC flapping

Troubleshooting

The troubleshooting process started with the switch logs:

%L2FM-2-L2FM_MAC_FLAP_DISABLE_LEARN: Disabling learning in vlan 8 for 120s due to too many mac moves
%L2FM-2-L2FM_MAC_FLAP_RE_ENABLE_LEARN: Re-enabling learning in vlan 8 

The message was not very useful, so we made the logging more verbose:

DC1-NX-1(config)# logging level l2fm 5

We had then a message that shows the offending MAC address and the flapping ports:

2022 Mar %L2FM-4-L2FM_MAC_MOVE2: Mac 0050.5684.aaaa in vlan 8 has moved from Po100 to Eth1/10

After some investigation the result was very clear: the MAC was flapping beween the internal DC network Po100 and the VM of the firewall with the VXLAN tunnels, on port E1/10. The L2 loop was real.

Virtual networking

The next phase of the troubleshootin process focused on the virtual networking.

The system engineer setup the VM in a port-group connected to a shared vSwitch. The vSwitch had 4 adapters (uplinks).

As requested the vSwitch had promiscuous mode enabled and Forged Trasmits set to Accept. This was a requirement because the firewall VM must be able to send packets with the source MAC of the original client on the other side of the VXLAN tunnel.

Of the four adapters connected to the vSwich, one was set to Active, the other three were set as Stanby. This is a common scenario I see in many datacenters, I guess it’s how it is taught in some manuals.

flowchart LR; A[FW-VM] ==>|vNIC| B[vSwitch] B ==> |nic0 - Active| C[Nexus switch] B ==> |nic1 - Standby| C[Nexus switch] B ==> |nic2 - Standby| C[Nexus switch] B ==> |nic3 - Standby| C[Nexus switch]

The packet walk

Every time I learn a new networking technology I focus on the basics. The packet walk is usually the part I prefer to validate my understanding and discuss the routing/forwarding decision of every hop.

flowchart LR; A[DC1
Virtual Machine] --> B[DC1
Nexus switch] B ==>|switchport| C[ESX
vSwitch] C --> D(DC1
FW VM) D <-.-> |VXLAN tunnel
over L3|E(DC2
FW VM) E --> F[ESX
vSwitch] F ==> |switchport| G[DC2
Nexus Switch] G --> H[DC2
Virtual Machine]

Starting form the left:

  1. the VM in DC1 sends an ARP request: SMAC = DC1-VM MAC, DMAC = L2 broadcast FF:FF:FF:FF:FF:FF
  2. the switch receives the broadcast and forwards the packet to all the ports in the same VLAN, excluding the receiving port
  3. the ESXi host recevies the packet and forwards it to the firewall virtual machine DC1 FW-VM
  4. the firewall VM encapsulates the packet and sends it across the VXLAN tunnel
  5. the firewall on the other side receives the packed, decapsulates the VXLAN header and sends the broadcast to its LAN interface
  6. the vSwitch forwards the packet and sends transmits it via the Active adapter
  7. the physical switch recevies the broadcast and sends it to all the ports in the same VLAN, except the receiving port

This is a key point: the ESX host is connected to the switch with trunk ports but no LAG, these are individual ports.

The switch, as part of its normal mode of operation, sends the broadcast packet to the three Standby ports of the ESX host.

Here comes the question: will the ESX host forwad the packet internally to the vSwitch, even if it is received form Standby ports?

My assumption is: yes it does. I see no other way the firewall can receive the same packet and send it via the VXLAN tunnel back to the source, creating the loop.

The smoking gun

Since I have no access to the customer VMware setup, I tried to create the minimum scenario that allows me to experience the loop and do some packet captures.

The lab is simple: one Linux VM and a vSwitch with upilnks NICs in active-stanby configuration:

flowchart LR; A[Linux VM] ==>|vNIC| B[vSwitch] B ==> |nic0 - Active| C[Switch] B ==> |nic1 - Standby| C[Switch]

Now how can we create a packet with a source MAC different than the one of the VM?

With Python and scapy of course!

The script creates a packet with a forged SMAC ad a brodcast DMAC:

from scapy.all import *
dest_mac = "FF:FF:FF:FF:FF:FF"
src_mac = "00:0c:29:99:99:99"
sendp(Ether(src=src_mac, dst=dest_mac)/ARP(op=2, psrc="0.0.0.0", hwsrc=src_mac, hwdst=dest_mac)/Padding(load="X"*18), verbose=0)

We can use tcpdump to verify if the VM receives the packet is just sent. The wire doesn’t lie.

sudo tcpdump -e -vv -i ens160 ether src 00:0c:29:99:99:99 --direction=in

And the resuls was very clear: the VM received back it’s own packet.

That’s a loop!

The explanation

After some additional investigation, it was clear that when an uplink of the vSwitch is set to Standby or Unused, it meast that NIC will not forward traffic from the VMs. But on the other hand, incoming traffic is received and, when promiscuous mode is enabled, the brodcast traffic is forwarded back to the VMs. When one of the VMs is bridging L2 over VXLAN we have the perfect storm and the bridging loop happens.

The remediation

How to fix the loop?

  1. use a LAG on the switch side - the switch will not forward bcast frames back to the same LAG interface
  2. use a dedicated vSwitch for the firewall, with only one uplink

The first suggested remediation was rejected. The we always did it this way policy won and the HiPPO used its autority to prevent any LAG being configured on the switches.

The second remediation was applied the and… done! The loop was fixed!

Lesson learned

I remember many troubleshooting sessions I did during my career. Sunny days are easily forgotten, hard times leave scars and helps us to improve.

In this particular case, the troubleshooting process required a sum of concepts and experience that was particularly satisfying, not telling how good it felt to prove it was the network!

Under pressure, you don’t rise to the occasion, you sink to the level of your training. Archilochus

The lessons learned:

  • Always study the basics
  • Packet walks are great
  • The more tools in the toolbelt the better
  • Trust and verify (the documentation)

Hope you enjoyed this post!