Cisco WLC roaming troubleshooting scripts
As it often happens, everything begins with a call from a customer with a problem.
The problem is related to WiFi roaming in a warehouse with clients disconnecting from RDP sessions. The clients are industrial PCs installed on forklifts that move quite fast (and dangerously).
Second rule of troubleshooting: measure
As the first rule is clearly identify the problem, I skip to the second one.
How can we measure roaming problems?
On Cisco WLC we can debug clients (up to 10 a the same time) and look for "Association" messages.
When a client roams correctly between APs we should see only REassociation messages. If the client for any reason drops the connection while moving it will start a new connections sending an Association request to the AP.
But how many Association messages are too many? How can we check if some clients are associating/reassociating more than others?
Association messages are part of the behavior of a normal WiFi network so it critical to distinguish what's normal from what's a problem.
Get the logs
We start getting the logs. Connect to WLC, debug up to 10 clients identified by MAC address, logs to a file. Easy huh?
debug client 7c:5c:f8:4a:79:82 7c:5c:f8:4a:7e:82 7c:5c:f8:4a:73:8d 7c:5c:f8:4a:7e:7d 7c:5c:f8:4a:7e:aa 7c:5c:f8:4b:9e:43 7c:5c:f8:4a:7e:41 7c:5c:f8:4a:7c:f7 7c:5c:f8:4a:7c:f7 7c:5c:f8:4a:7d:e2
Count the messages
NOTE: I work on a Windows machine with some GNU tools installed so you can easily port these scripts on Linux/OSX. Feel free to send me a MacBookAir if you need my support on OSX ;-)
Association messages format:
*apfMsConnTask_1: May 13 13:30:26.883: 7c:5c:f8:4a:7c:f7 Association received from mobile on BSSID 00:08:2f:fa:dc:b0
Reassociation message format:
*apfMsConnTask_7: May 13 13:25:48.487: 7c:5c:f8:4a:7e:41 Reassociation received from mobile on BSSID e0:89:9d:d2:31:e0
Now lets white a script that counts Association messages per AP:
@echo off
:loop
cls
echo ASSOCIATION EVENTS PER AP
time /T
grep " Association received" wlc_243_20160513.log | gawk "{print $12}" | sort | uniq -c | sort /R
sleep 5
goto loop
In the output we see a list of APs (BSSIDs) ordered by the number of Association messages they received from one of our 10 debugged clients.
Since I named all my APs I prefer to replace the BSSIDs with AP names.
Create a sed script with a good name - I chose ap2name.sed
s/00:08:2f:4b:yy:xx/APB058/g
s/1c:de:a7:1a:yy:xx/APD057/g
s/1c:de:a7:1a:yy:xx/APD069/g
s/1c:de:a7:1a:yy:xx/APA062/g
s/00:08:2f:fa:yy:xx/APE068/g
s/1c:de:a7:1a:yy:xx/APC053/g
s/1c:de:a7:1a:yy:xx/APA064/g
s/1c:de:a7:1a:yy:xx/APD055/g
s/00:08:2f:fa:yy:xx/APA063/g
s/dc:eb:94:11:yy:xx/APC071/g
s/00:08:2f:fa:yy:xx/APC051/g
s/00:08:2f:fa:yy:xx/APE067/g
s/1c:de:a7:1a:yy:xx/APC054/g
s/1c:de:a7:1a:yy:xx/APC052/g
s/00:08:2f:fa:yy:xx/APB061/g
s/00:08:2f:fa:yy:xx/APD070/g
s/00:08:2f:fa:yy:xx/APD059/g
s/00:08:2f:fa:yy:xx/APA065/g
s/0c:27:24:0f:yy:xx/AP1410/g
s/00:08:2f:fa:yy:xx/APE066/g
s/1c:de:a7:1a:yy:xx/APD056/g
s/0c:68:03:5f:yy:xx/AP1810/g
Add it at the end of the grep command to see AP names instead of the mac address:
@echo off
:loop
cls
echo ASSOCIATION EVENTS PER AP
time /T
grep " Association received" wlc_243_20160513.log | gawk "{print $12}" | sort | uniq -c | sort /R | sed -f ap2name.sed
sleep 5
goto loop
The output looks better now:
ASSOCIATION PER AP
07:55
14 APA062
14 APA063
14 APD059
9 APB060
9 APC051
9 APB058
8 APC053
7 APB061
5 APC071
4 APD057
4 APA064
4 APC052
4 APE068
3 APD069
3 APD055
3 APD056
3 APC054
3 APD070
1 APA065
1 APE066
1 APE067
With the same method we can count Reassociation messages per AP:
@echo off
:loop
cls
ECHO REASSOCIATION EVENTS PER AP
time /T
grep " Reassociation" wlc_243_20160513.log | gawk "{print $12}" | sort | uniq -c | sort /R | sed -f ap2name.sed
sleep 5
goto loop
Now focus on the clients, let's count Association messages per client:
@echo off
:loop
cls
echo ASSOCIATION EVENTS PER CLIENT
time /T
grep " Association received" wlc_243_20160513.log | gawk "{print $5}" | sort | uniq -c | sort /R
sleep 5
goto loop
And finally Reassociation messages per client:
@echo off
:loop
cls
echo REASSOCIATION EVENTS PER CLIENT
time /T
grep " Reassociation" wlc_243_20160513.log | gawk "{print $5}" | sort | uniq -c | sort /R | sed -f ap2name.sed
sleep 5
goto loop
This is the final result:
Understand the output
What we see now is a good view of which clients and APs have more Association events, that may be an indicator that they are not roaming correctly.
In the Association events per AP output we can identify which APs are getting more association messages, possibly meaning clients moving towards that AP experience a coverage hole and a disconnection.
Fix the problem
I can't really talk about how to fix the problem because it depends on your WiFi network design, AP placement, antennas in use, noise floor, interferes... I just want to show you the results.
Before:
After:
You see almost no Association messages while clients move around, it is a good sign roaming is working fine.
Wrap up
Use the counters cum granus salis, their meaning really depends on your infrastructure, the position of the APs and how the clients move.
Sometimes you get many Associations on an AP that is the first one the clients connect to in the morning and that's not a problem.
It is possible that an AP that receives many Associations request is working perfectly fine but the previous one is the row is not correctly positioned or has the wrong settings so that's the one causing roaming problems.
I strongly advise anybody who installs/manages WiFi network to study CWNA for a better understanding.
Enjoy!