Blog Closed

This blog has moved to Github. This page will not be updated and is not open for comments. Please go to the new site for updated content.

Thursday, July 24, 2008

Asynchronous Race Conditions

We've been having a serious problem with the reliability of our satellite communications for the last week. Our target is a 95% success rate (preferrably much higher), but for the past week we've been hovering between 80% and 90%. A few abysmal tests have even wondered down into the high 70s. These results come from us running stock hardware and stock software directly from the satellite transmitter manufacturer. Why we were seeing such poor rates was a huge mystery. When you're dealing with wireless communications in general, and satellite communications more specifically, there are a number of problems that you have to contend with, any one of which can ruin your results:
  1. Line of sight. If there is a lot of garbage between you and the satellite (buildings, trees, powerlines, mountains, etc) your signal quality is going to degrade. Less signal = less reliability.
  2. Noise and interference. Ever hear of multipathing? Multipathing is when data from the satellite, which is being broadcast over a wide area, the signal can bounce off various objects. Multiple versions of the signal, some slightly distorted and delayed, can add together and interfere with each other at the receiver. On top of this effect, you have to deal with all sorts of other interference from objects which are supposed to emit signals (think cell phones and radio stations) to those that really aren't supposed to (think HVAC units)
  3. Transmit power. An electromagnetic wave traveling through free space decreases in signal power proportional to the cube root of the distance travelled. Satellites are hella far away. If your little doodad can't muster the mustard, your signal is going to be lost in space.
  4. Protocol issues. What's your transmit frequency? bit rate? bandwidth? What's your transmission protocol? Data collision policy? Coding scheme? Interleaving? Do you have encryption? What form of modulation are you using? Error correction? Get one thing wrong, and it's bye-bye message.
So we have all these possible problems, plus the potential that any number of things could be wrong with our hardware, the batteries, etc. In short, it's debugging hell. So we start tracing through the terminal logs, to see when messages are supposed to be transmitted. From there, we can pull the satellite's receive logs and compare notes. However, we find a trend today that we haven't seen in the past week of debugging: It appears that some messages aren't being transmitted in the first place! We've got huge data logs, and I write up a few quick Perl scripts to chew threw them (more on that tomorrow, probably). Sure enough, Perl tells me what we had all started to suspect: Some messages were simply not being transmitted. Of those that were being transmitted, the satellite was successfully picking up over 99%. This is good, because it rules out the entire "network cloud" and our hardware: The problem was a software problem.

Here's a sample of the type of program that we were running. It is a short test script with very little real-world application, and it was given to us directly from the manufacturer for testing purposes. The software platform is highly asynchronous, and performs a lot of tasks automatically for us.
  1. Enqueue a simple "hello world" message
  2. Enqueue a message with the GPS-derived location of the terminal
  3. When all messages in the queue are sent and the queue becomes empty, shut down.
This script looks simple, innocuous. It should send two messages and then shut down the device. What we were seeing, however, was that sometimes the second message wasn't being sent. A look through the logs showed us that the message wasn't being queued before the device went to sleep. why? Like I said before, the underlying platform is highly asynchronous. With that in mind, here's what's happening:
  1. We create and enqueue the first "hello world" message
  2. The transmitter takes the message out of the queue and sends it before the GPS message can be added to the queue
  3. Seeing that the queue has become empty (as per our instructions) the device shuts down without sending the second message. Wash, rinse, repeat.
Asynchronous systems aren't particularly popular among programmers precisely because they are so tricky, hard to trace, hard to design, and hard to understand. It's a topic that I tried to stress with my students (back when I still had students) and has been an area where I've purposefully tried to concentrate my studies. If I had been more familiar with the platform, I would have known about these types of issues and may have been able to spot this particular problem earlier. I'm still learning, however, and won't be caught off guard next time. It's good to have consultants come in to help us with these things, because I can get answers to valuable questions like this.

In short: If you're an engineer or a programmer and aren't familar with asynchronous systems, preemptive multithreading systems, or race conditions, do yourself a favor and spend some time to read up on it.

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.