Sometimes, Two Plus Two Ain't Four
My love for WildPackets OmniPeek may be one of the few things in technology that exceeds my love for the iPhone...
Now that I've run off 20% of my audience, let's talk about how the former can be used to figure out if the latter is causing a problem.
I have a lot of enemies in life, and I'm proud of that. In my opinion, part of being an adult is recognizing who your enemies are. UCLA football players are my enemy when they play college football. Drivers who text while stopped at green lights are my enemy when I am running late. (No comments from the peanut gallery on that one, GT Hill.) And deductive reasoning is often my enemy when troubleshooting.
Deductive reasoning is oh so tantalizing. It's simple math; A + B = C. The WLAN works (C) when VoFi handsets (B) connect to my APs (A). If I switch out the VoFi handsets for SIP-based iPhones (thus changing the value of B) and the WLAN stops working, then the iPhones must be at fault. Right? Wrong.
Deductive reasoning only works if you eliminate all variables but one. A switch from a VoFi handset to an iPhone is not the changing of one variable. It is the changing of numerous variables. The chipset changes. The client utility changes. The apps that run in the background change. The SIP app changes.
Another problem I have with deductive reasoning is that sometimes a device that we think is working really isn't working. That's what was happening recently when I was doing some work on site at a health care provider.
The Problem
The health care provider had been using dedicated VoFi handsets for years. Like many organizations, they decided to move away from dedicated voice devices and use a smartphone app instead. In this case, it was an Avaya smartphone app running primarily on iPhone 5s.
Once the iPhones started getting used, the WiFi folks at the health care provider immediately starting noticing problems. Users were saying that the phones kept beeping. Avaya said that the app will cause the phones to beep if the app gets disconnected. It is a way to protect phone from being stolen or taken to locations where they're not supposed to go.
The problem for the WiFi admins is that the phones were still connected when the beeping was happening. Something was causing the iPhone to think it was disconnected when it really wasn't.
Faulty Deductive Reasoning
The natural reaction of the WiFi admins was to blame the iPhones and/or Avaya. The health care provider had a working VoFi solution. They changed to iPhones and Avaya. The solution stopped working. The rest of the WiFi continued to work well, so the problem was thought to be the phones.
I was skeptical that the iPhones and the Avaya app were the problem right from the beginning. iPhones have been blamed innumerable times over the years for problems that were actually the infrastructure's fault. What's more, I tend to almost always blame the network. My attitude is that networking (both wired and WiFi) is the business of support. We don't make money for the company. We create infrastructure that allows saleswomen and craftsmen (see what I did there, ladies?) and executives to make money for the company. If a user wants to use a certain device, we should try to make our infrastructure support that device. (That is, assuming we want the company we work for to grow and be successful.)
OmniPeek to the Rescue
Armed with my natural skepticism of iPhone blame syndrome, I had the lead WiFi guy do an WildPackets OmniPeek capture with me. (OmniPeek is the WiFi sniffer that he was already using.) We set the capture to the channel of the nearest AP. We associated the iPhone. He made an received a call.
We then looked at the OmniPeek WiFi statistics (Summary -> 802.11 Analysis, for those of you OmniPeek users). I noticed that the Retry percentages were astronomical. Getting as high as 60% at their worst. That told me that something was wrong. During normal operation the WiFi channel was seeing Retry percentages well below 10%.
From the Summary screen of OmniPeek, we switched to the Packets screen to get more detail. I turned on Auto Scroll (a little icon above the packets list that looks like a little padlock) and watched the packets come in while we repeated our iPhone test. I kept my eye on the Flags column in my Packets window because I knew that Retry frames always caused a red colored "+" to show up in that column.
Sure enough, the next time the iPhone connected a whole slew of frames with a red "+" showed up in the Flags column. I saw more than 5 consecutive HTTPS frames in a row all with the Retry flag and none of them were followed by acknowledgment (802.11 Ack) frames.
We Found a Problem, but Whose Fault is It?
Once we saw a flood of unacknowledged frames, we had to determine what the cause might be. Was it the iPhone misbehaving? A bad Avaya app? Or the AP being unresponsive.
All of the Retry frames had the iPhone 5's MAC address as the second address in the 802.11 header. The second address is always the transmitter's MAC address. That meant that the iPhone was sending all of the frames that were failing.
After seeing that the iPhone was the device that was sending the failed data, I immediately began to suspect that the APs were the problem. It is quite rare that a device will disconnect itself while sending a bunch of data at the same. Remember, our initial problem was that we were getting a beeping sound that indicated a network disconnection.
Let's Get Specific
Blaming the WiFi infrastructure for an iPhone problem is always satisfying (at least, if you're an Apple loyalist like yours truly), but in order to get people to believe you it's best to pile on the evidence. My usual method is to try to find something logical. Why would an AP just stop acknowledging iPhone data? Why might I be using the same WiFi APs and controllers that have always worked fine with VoFi handsets and never see a problem that is the fault of the infrastructure?
I used OmniPeek to look inside the frame decodes for any unique properties of the unacknowledged frames. I noticed that some of the problem data had the same WMM/QoS category in the QoS Control field: Background. I immediately added the Decode column to the OmniPeek Packets screen, then selected the Traffic Identifier (TID) subfield within the QoS Control field of the 802.11 header. In the Decode column of the OmniPeek Packets screen, whichever field or subfield selected in the frame decode below will be displayed. So in my case, that meant that the Decode column displayed the TID for every frame.
Once I was able to easily see the TID for each captured data frame, I started quickly scrolling up and down through the capture. Sure enough, I saw a consistent pattern. Every time the iPhone sent a frame tagged with the Background TID to the access point, the access point would fail to send an acknowledgment in return. We didn't have time to identify which specific app within the phone was sending data tagged as "Background", but at that point it was peripheral to the issue. If an AP can't handle Background traffic, then that AP is a potential problem that could spread beyond just iPhones and Avaya apps.
The Conclusion
The lead WiFi administrator was in the room with me looking at OmniPeek as we were investigating this iPhone disconnection problem (and, unfortunately, this is why I am unable to show screenshots of our OmniPeek captures). He came into the day looking for an iPhone problem. After an hour or so of using OmniPeek, he realized that the problem was his APs. The fact that he was using 802.11a/b/g APs made things all the logical. The AP vendor had mentioned to the lead WiFi admin that those old APs were no longer receiving firmware updates and that it could lead to problems. Our little OmniPeek exercise convinced him that it was either time to upgrade or to look for a new vendor (and his current AP vendor shall go nameless in this blog post, because I like to avoid trashing vendors for hardware or software that is no longer kept up to date).
I am aware that I sometimes come across as too sniffer-centric. I realize that some people find management software easier to access or spectrum analyzers easier to understand or discovery software easier to afford. But there is a reason that I am such a zealot for WiFi sniffers. They help me identify causes of the really difficult problems that plague so many high level wireless networks.
We Found a Problem, but Whose Fault is It?
Once we saw a flood of unacknowledged frames, we had to determine what the cause might be. Was it the iPhone misbehaving? A bad Avaya app? Or the AP being unresponsive.
All of the Retry frames had the iPhone 5's MAC address as the second address in the 802.11 header. The second address is always the transmitter's MAC address. That meant that the iPhone was sending all of the frames that were failing.
After seeing that the iPhone was the device that was sending the failed data, I immediately began to suspect that the APs were the problem. It is quite rare that a device will disconnect itself while sending a bunch of data at the same. Remember, our initial problem was that we were getting a beeping sound that indicated a network disconnection.
Let's Get Specific
Blaming the WiFi infrastructure for an iPhone problem is always satisfying (at least, if you're an Apple loyalist like yours truly), but in order to get people to believe you it's best to pile on the evidence. My usual method is to try to find something logical. Why would an AP just stop acknowledging iPhone data? Why might I be using the same WiFi APs and controllers that have always worked fine with VoFi handsets and never see a problem that is the fault of the infrastructure?
I used OmniPeek to look inside the frame decodes for any unique properties of the unacknowledged frames. I noticed that some of the problem data had the same WMM/QoS category in the QoS Control field: Background. I immediately added the Decode column to the OmniPeek Packets screen, then selected the Traffic Identifier (TID) subfield within the QoS Control field of the 802.11 header. In the Decode column of the OmniPeek Packets screen, whichever field or subfield selected in the frame decode below will be displayed. So in my case, that meant that the Decode column displayed the TID for every frame.
Once I was able to easily see the TID for each captured data frame, I started quickly scrolling up and down through the capture. Sure enough, I saw a consistent pattern. Every time the iPhone sent a frame tagged with the Background TID to the access point, the access point would fail to send an acknowledgment in return. We didn't have time to identify which specific app within the phone was sending data tagged as "Background", but at that point it was peripheral to the issue. If an AP can't handle Background traffic, then that AP is a potential problem that could spread beyond just iPhones and Avaya apps.
The Conclusion
The lead WiFi administrator was in the room with me looking at OmniPeek as we were investigating this iPhone disconnection problem (and, unfortunately, this is why I am unable to show screenshots of our OmniPeek captures). He came into the day looking for an iPhone problem. After an hour or so of using OmniPeek, he realized that the problem was his APs. The fact that he was using 802.11a/b/g APs made things all the logical. The AP vendor had mentioned to the lead WiFi admin that those old APs were no longer receiving firmware updates and that it could lead to problems. Our little OmniPeek exercise convinced him that it was either time to upgrade or to look for a new vendor (and his current AP vendor shall go nameless in this blog post, because I like to avoid trashing vendors for hardware or software that is no longer kept up to date).
I am aware that I sometimes come across as too sniffer-centric. I realize that some people find management software easier to access or spectrum analyzers easier to understand or discovery software easier to afford. But there is a reason that I am such a zealot for WiFi sniffers. They help me identify causes of the really difficult problems that plague so many high level wireless networks.
Comments
Post a Comment