Receive Side Scaling

In recent years, I have been working more and more in datacenter networking. It has happened to me on several occasions that during testing, the customer would contest that the network performance of the servers was lower than expected.

So I put on my systems engineer hat and dust off my VMware training.

Non blocking network

Datacenter network are usually designed to be non-blocking or very close to that.

In my case the DC has 4 spines and 12 leaf nodes. Each leaf has 48 ports 1/10/25G, and a 100G link to each spine, meaning an oversubscription ratio of 3:1.

Since the tests were performed before moving production traffic, with just a few servers connected to the new fabric, we can assume it was non-blocking at the moment.

A pair of servers were connected to 100G ports and bandwidth tests were executed. The results were far below the expectations. Was it a network issue?

I checked the usual counters on the ports, buffer utilization and found no issue.

What about the servers?

After the usual blame the network phase, and explaining again that an optical adapter always sends traffic at full speed, I've managed to get the System Engineers onboard to discuss the issue.

It was not a surprise that no effort was made on the server side to RTFM fine tune and optimize the performance.

I took one step further, read the documentation, in this case the Performance Best Practices for VMware vSphere 8.0.

In the ESXi Networking Considerations section I found something interesting:

Receive Side Scaling (RSS) is a feature that allows network packets from a single NIC to be scheduled in parallel on multiple CPUs by creating multiple hardware queues. While this might increase network throughput for a NIC that receives packets at a high rate, it can also increase CPU overhead.

Non-RSS Receive Processing, all the traffic is managed by a single CPU:

With RSS the traffic is managed by multiple CPUs:

And then I found the perfect post:

Symptoms: Users might not see a high enough throughput for a VNIC to fill the physical NIC bandwidth.

After applying the suggested changes, the network performance of the server significanlty improved, leading to an happy customer.

Warning

Be sure to read all the VMware and server documentation and carefully consider the possible impacts before applying that configuration change. Bugs happen, don't test in prod.