It's Not Me, It's You
Last week I ran into an age-old problem: my WiFi network being blamed for the poor performance of a neighbor's network. I was pretty sure that the problem wasn't my fault, but how would I prove that? Here are some ways that you can use a WiFi sniffer to diagnose whether your new WLAN is disrupting what's already in place.
Let's start with a little background on my situation. I was teaching a class* at a training location that provides Guest WiFi (unencrypted; no web-based authentication) and optional Dedicated Classroom WiFi (TKIP-encrypted; WPA-PSK authentication). The Guest WiFi was being used by students in my class, but the Dedicated Classroom WiFi was not. Both of those networks were working poorly from the time we arrived. The Guest WiFi being down was just annoying, but the Dedicated Classroom WiFi was a red-level problem. A problem significant enough that we might have to move the students (and several crates worth of lab equipment) to a different building if things didn't improve.
As often happens with problems that have to resolved quickly, the training location's IT Guy had already identified my class as the problem via deductive troubleshooting . No problem had been reported before I arrived, so the presence of my class must be the culprit. I could have tried to give the IT Guy my treatise on the flaws in deductive troubleshooting, but that would've probably made me look like a jerk. Instead, I figured that I would fire up the ol' WiFi sniffer with the IT Guy by my side and see if my classroom WLAN really was a problem.
Though the exercise proved somewhat calisthenic (they made us move the class, anyway), it did end up giving me a chance to share my methodology in identifying whether neighboring WiFi networks are causing problems with one another.
Step 1: Use the client utility to look for SSIDs and channels
Step one may seem odd coming from a guy who loves WiFi sniffers as much as I do, but it is the proper initial step. Before opening a WiFi sniffer it is best to see what the troubled station adapter sees.
In this case, both the IT Guy and I were using the Broadcom Client Utility (in my case rebranded as the Dell Client Utility). The Broadcom client is superb (perhaps a topic for a future post...). It's reliable, the roaming is consistent, it gives comprehensive status information and it's free as long as you have a WiFi adapter with a Broadcom chipset. Unfortunately, it does have a problem that exacerbated my troubles in this case: it does not scan well when lots of SSIDs are in the air.
When I teach wireless classes, I like to allow students to act as administrator, end user and support professional. As part of the administrator role, students get to create their own configurations for a RADIUS server, AP/controller and client utility. That means one SSID per student (or pair of students, depending on the size of the class). In this case, that meant 12 SSIDs. Twelve classroom SSIDs plus Guest WiFi plus Dedicated Classroom WiFi equals a frustrating experience with the Broadcom client. It's not that the Broadcom client won't work. It's just that the Broadcom client only scans the air for a limited amount of time, thus often missing some of the SSIDs that are broadcasting.
In my case, the Dedicated Classroom WiFi that we were trying to troubleshoot was not showing up in the list of available networks in the Broadcom client. Since I was aware of the Broadcom client's limitation I was able to chalk that up as a non-issue, but to the IT Guy this was evidence that we were killing his Dedicated Classroom WiFi.
There was another important piece of information that the Broadcom client revealed. The Site Monitor screen showed that my AP (and the twelve SSIDs it was broadcasting) was on the same channel as the IT Guy's Dedicated Classroom AP. Now, you're not going to believe this, but both of our APs were set to automatic channel selection!! I know, I know. It's a shocking development that automatic channel selection (or whatever name your WiFi vendor gives it) usually sometimes doesn't work. But it didn't. Luckily I'm a seasoned WiFi veteran so I was aware that I could turn off automatic channel selection and actually set the channel myself. (Alright, turning off the sarcasm faucet now.)
I set my twelve network AP to a different channel (channel 1, with the Dedicated Classroom AP on channel 11). At this point I'd exhausted my options with the client utility and I was ready for the WiFi sniffer.
Step 2: View the APs and stations in your WiFi sniffer
I kept the Broadcom client utility running in the background and I started up WildPackets OmniPeek and began capturing on channel 11 (channel of the Dedicated Classroom AP). Then I went to the screen I always start on: the WLAN screen. The WLAN screen in OmniPeek shows a list of SSIDs, APs and stations. The equivalent screen in AirMagnet WiFi Analyzer would be the Infrastructure screen. (There is no equivalent screen in Wireshark, though if you're using Linux or an AirPcap adapter in Windows, you can get similar information by running Kismet).
In the WLAN screen I immediately saw two pieces of information that could be reasons for the poor performance of the Dedicated Monitoring AP: the IT Guy was using a single Linksys AP and they had over 40 stations associated to it. Now, for those that know me outside of this blog, identifying this as a potential problem may sound hypocritical. Why? Because I tend to be someone who endorses the use of consumer-grade equipment for some business WiFi networks and because I tend to endorse high station-to-AP ratios. In my defense, I just have to point out that I don't endorse these two things together.
If you want to save money when installing WiFi at a small office, by all means go with an Apple AP. They're relatively cheap and they're reliable. And if you are designing an enterprise WiFi deployment, by all means work from a station-to-AP ratio of 40 or even 50. I've seen enterprise-class APs and controllers handle those numbers. But please, don't combine the two. Things that make enterprise-class APs enterprise class are PoE, controller-based management, IPSec tunneling, IDS sensor mode and reliability. In other words, you don't have to power cycle them once a month or limit the number of simultaneous users.
Even though I would recommend avoiding a station-to-AP ratio of 40 when using Linksys APs, I still wanted to see if something else might be causing the problem.
Step 3: View channel statistics in your WiFi sniffer
Look, there could be 45 stations on your rickety old Linksys wireless router, but if 40 of those stations are quiet, the AP will still function. You need to see how active the channel is to properly diagnose whether channel overload is your problem.
In OmniPeek, channel statistics are found in the Summary screen (AirMagnet uses the Channel screen while Wireshark has the Summary window, which is accessed from the Statistics menu). When in the Summary screen I was able to view the average utilization and the Retry percentage, which are two of the most important statistics to analyze (data rate percentage being the other). The average utilization was around 3 Mbps, which is high for a WiFi network, but far less than the maximum for a channel. The Retry percentage was hovering around 35%, which is a downright alarming number.
When you see a Retry percentage that high, you have a problem. Unfortunately, it can be hard to identify exactly what is causing the problem. It could be:
- Bad APs and/or stations. Sometimes you just get crap equipment. A firmware upgrade may provide a fix, but sometimes you just have to ride it out until you can afford an upgrade.
- Interference. If you have a high CRC Error percentage along with your high Retry percentage, it may be time to break out the spectrum analyzer.
- Hidden nodes. If two or more APs/stations are on the same channel but out of data range from one another, the APs/stations may not know to stay quiet while the other is transmitting.
- Black Magic. Sometimes I can't pinpoint the problem, so I blame black magic.
In my case it probably wasn't interference. My CRC errors were high when I initially started my capture, but as I moved closer to the AP the percentage went down. If there were an interference problem the CRC error percentage would have most likely remained high.
I'd like to think the problem wasn't the hidden node. The channel had only one AP on it and all stations were in the same room. It was a very large room, but usually you need walls or some other obstruction to create a hidden node problem.
Still, I wanted to be sure, so I had the IT Guy enable RTS/CTS on the AP. He set the RTS Threshold to 1 (the lowest possible setting on this Linksys AP model), which meant that any unicast frames transmitted by the AP that are 1 byte or larger would use RTS/CTS.
We waited for the AP to reboot and proceeded to the final step.
Step 4: View frames to look for unusual patterns
When using a WiFi sniffer it is best to avoid looking at captured frames in most cases. Of course captured frames give very detailed information, but sifting through them can be time-consuming. This should really only be used if you've already looked at all of the statistical information in the sniffer and you still can't identify the problem.
In OmniPeek, captured frames can be viewed in the Packets screen (AirMagnet shows frames in the Decodes screen and Wireshark shows frames in the main window). When I went to the Packets screen I noticed an immediate problem: there were no RTS and CTS frames.
This narrowed down my problem with the quickness. The IT Guy was using a bad AP for the Dedicated Classroom WiFi. We had enabled RTS/CTS but something in the AP's software was not allowing it to use RTS/CTS.
Now, the trouble with this is that the inability to use RTS/CTS did not address the initial problem. Even if RTS/CTS is unavailable, stations should still be able to maintain an association and communicate with the AP. So to delve a little bit deeper I sought out a Beacon frame from the offending AP.
The Beacon frame is a frame that is sent out at regular intervals by APs. It gives stations basic information (channel, SSID, security, etc.) about the WiFi network.
In this case, I was expecting to see channel 11 in the DS Parameter Set, the SSID being used for the Dedicated Classroom WiFi in the SSID information element and either the WPA or RSN information element indicating the use of PSK authentication with the TKIP cipher suite. Unfortunately, that is not what I saw. Instead I saw the correct DS Parameter Set and SSID information elements, but no WPA or RSN information elements. In other words, I saw that this AP was transmitting malformed Beacon frames.
I told the IT Guy that the AP was not beaconing correctly and we rebooted the AP. When the AP came back up the problem was briefly solved (stations began to see the AP in their client utilities), but after a while the same problems started up again.
It was clear to me that they just had an AP in need of a firmware upgrade (or maybe even a replacement). That type of thing happens. The tough part is that it sometimes happens after months (or even years) of reliable use. You may have an AP that functions well, but then once it hits a certain number of associations it starts to send malformed Beacon frames. You may have an AP that covers your entire home, but then once your iPad sleeps for too long, the buffer AP's fills up and it freezes until you power cycle it (ahem). The important thing to remember no matter what you encounter is how to use a WiFi sniffer to tell whether the problem is you or the network that's nearby.
*Shameless self-promotion: I'm available for teaching WLAN classes from time to time. Use my email address to contact me.
*Shameless self-promotion: I'm available for teaching WLAN classes from time to time. Use my email address to contact me.
I think the most relevant statement in the article is: "Both of those networks were working poorly from the time we arrived."
ReplyDeleteI'm all for breaking out the wlan science on the knee-jerk technologist as the next dork but the IT guy's behavior just seems odd!
I have to assume he's somewhat in charge of maintaining those networks since he's diagnosing them; has this sort of error never been seen before? You said a power cycle gave temporary remedy... did he not try this during diagnosis? Living on the ragged edge with "re-tasked" wrt54g's and 2400 baud Motorola surfboard taught me the beauty of just pulling the plug if my settings and firmware were copasetic.
And blatant weirdness: Telling the _wifi guy_ who is _teaching_ at your "training facility" that his class FUBAR'ed _both_ networks.
I do commend you on handling it most gentlemanly though... are you going to bill them for wireless analysis? :)
Haha, no billing from me.
ReplyDeleteWe did tell the IT Guy that we were having problem from the beginning, but the problem was that we didn't report them to him first thing Monday morning.
The other problem is that the Dedicated Classroom AP is an extra charge from the training location, so it's not up all the time. He setup the Dedicated Classroom AP for another class on Wednesday before class began. The instructor for the other class called IT Guy and told him that students couldn't connect. IT Guy then game by my class an hour or two into class Wednesday morning to blame me.
I'm really on the fence about IT guy. I can just imagine myself the technologist, young, angsty and neurotic, dragging myself across the quad to help a faculty member with another keyboard->chair data loss only to end up copping a 'tude with a visiting ieee bigwig or the new CS provost.
ReplyDelete