This story starts with a phone call at night. If you worked in IT long enough you know what it means.

Customer’s HQ network is down and since the day before I’ve replaced a pair of data center switches in a remote site I’m somehow involved based on the well-known principle last one who made changes is responsible.

I state that all the facts took place with my telephone support, without any remote access to the machines.

Phase 1 - collect facts

Current status: services are unreachable, the network is slow, ping timeout, panic, fear, uncertainty.

The only known fact is a Cisco 4500 is showing high CPU usage. In my mind this translates almost immediately in a possible bridge loop. Some heavy traffic is hitting the control plane of the switch.

The network topology is a classic three tiers design with access, distribution and core switches. The C4500 sits on the distribution layer connecting the access layer to the core.

Phase 2 - isolate

The first step for troubleshooting it cases like this is to shutdown all the links to the access layer. A very slow response of the switch console may slow down the operation but it usually works.

Once all the links to the access switches are shut the network performs as expected, the CPU of the C4500 goes back to normal levels.

Phase 3 - find the smoking gun

To find the cause of the problem we need to enable again the ports to the access switches one by one. Once the CPU of the C4500 switch goes up again we know which access switch or stack is causing the problem.

What now? Remember we are working through a remote access, I’m on the phone with the customer and the network is slow again, the access switch is not responding to management via ssh. Usually from this point the investigation proceeds on-site but we can go a step further with our remote analysis.

The links from distribution to access switches are trunks. Enabling only the management VLAN on the trunk allows us to reach the access switch. This works unless the problem is in the management VLAN itself. This is a good reason to keep the management VLAN separated from user VLANs.

Best practice: do not assign access ports to the management vlan on access switches

A quick check of the port counters on the access switch shows one port with high level of broadcast. When the incriminated port is shut we enable again all the VLANs on the distribution port. The network now works fine, the only impact was is access port down.

Well done!

Phase 4 - PEBKAC

The day after the investigation proceeds. After finding where the cable drop ends the mistake is very clear.

Sitting on a desk we find a cheap SoHo switch connected to a docking station for laptops. One user, in the attempt to fix some issue with his his docking station, disconnected its Ethernet cable and connected it back to the SoHo switch, creating a bridge loop.

This is the most common event that happens in a network but in 2018 I’m still getting calls to fix this.

Lesson learned

This phone call ended well with a problem solved and a consultancy project to secure and fix the access switches configuration.

What most customer still under estimate is the importance of the access layer of the network both in terms of availability and security.

Spanning tree, DHCP snooping, BPDU guard, 802.1x, port-security are some examples of features supported by most switches today but still, in many cases, the access switch configuration includes just a few VLANs and sometimes no STP because a mis-configuration created some issues in the past.

Most IT managers still think the bad guys are outside the network

and ignore the fact that the access layer should be considered as bad as the Internet

Happily ever after

After this interruption I resumed reading about VXLAN, the integration of K8s with ACI and SD-Access.

As William Gibson wrote some time ago

The future is already here — it’s just not very evenly distributed.

Network engineers and IT professionals in general like to be informed about what’s happening in the innovators side of the curve but the reality often sits between the late majority and the laggards, with some step towards the early majority in the best cases.

Massimo described this scenario very well in his post Cloud and the Three IT Geographies

How many home labs today use technologies most advanced than the one we find in the company we work with? ;-)

I hope you liked this war story, if the feedback is positive I’ll share some more. Feel free to link your story if you shared something similar online.