Message priority is handled in a simple way. The post office simply maintains a set of post office boxes for each level of priority. When the channel becomes available, the next message will be from the highest set of boxes that has a message posted. Thus, if there are no messages in the level 3 boxes, the post office will check the level 2 boxes, and so forth.
Many of the field problems in robotics are communications related, so it is important to be able to quickly determine the nature of the problem. The post office diagnostic display in Figure 6.9 shows all of the elements just discussed for a simple single-level post office. The bar at the bottom shows the state of all 32 of the boxes reserved for this robot’s channel. The letter “d” indicates a message is done, while the letter “u”
indicates it is unsent, etc. Clicking the “RxMsgs” button will show the most recent messages received that were rejected. The “Details” button gives statistics about which messages failed.
Although the concept of a post office may seem overly complex at first glance, it is in fact a very simple piece of code to write. The benefits far outweigh the costs.
Multipath interference occurs when the robot is at a point where radio waves are not being predominantly received directly from the transmitter, but rather as the result of reflections along several paths. If one or more paths are longer than the others by one- half wavelength, then the signals can cancel at the antenna even though the radio should be within easy range. When the patrol car approached, it added a new reflective path and communications was restored.8
Data integrity
The most obvious problem in communications is data integrity. A system can be made tolerant of communications outages, but bad messages that get through are much more likely to cause serious problems. There are many forms of error checking and error correction, ranging from simple checksums to complex self-correcting proto- cols.
It is useful to remember that if an error has a one in a million chance of going un- detected, then a simple low-speed serial link could easily produce more than one undetected error per day! Luckily, our application protocol is likely to be carried by protocols that have excellent error detection. Even so, it is useful to discuss error checking briefly.
The two most popular error checking techniques are the checksum and the CRC (cyclical redundancy check). Checksums are calculated by merely adding or (more commonly) subtracting the data from an accumulator, and then sending the low byte or word of the result at the end of the data stream. The problem occurs when two errors in a message cancel each other in the sum, leaving both errors undetected.
A CRC check is calculated by presetting a bit pattern into a shift register. An exclusive-or is then performed between incoming data and data from logic con- nected to taps of the shift register. The result of this exclusive-or is then fed into the input of the shift register. The term cyclical comes from this feedback of data around the shift register. The function may be calculated by hardware or software.
When all the data has been fed into the CRC loop, the value of the shift register is sent as the check. Various standards use different register lengths, presets, and taps.
8 The multipath problem disappeared in data communications with the advent of spread-spectrum radios, because these systems operate over many wavelengths. The problem is still experienced with analog video transmission systems, causing video dropout and flashing as the robot moves.
All of these problems go away when communications is combined with video on 802.11 spread-
It is mathematically demonstrable9 that this form of checking is much less likely to be subject to error canceling.
The ratio of non-data to actual data in a message protocol is called the overhead. Ge- nerally, longer checks are more reliable, but tend to increase the overhead. Error correcting codes take many more bytes than a simple 16-bit CRC, which in turn is better than a checksum. The type and extent of error checking should match the nature of the medium. There is no sense in increasing message lengths 10% to correct errors that happen only 1% of the time. In such cases it is much easier to retransmit the message.
Temporal integrity
A more common and less anticipated problem is that of temporal integrity. This problem can take on many forms, but the most common is when we need data that represents a snapshot in time. To understand how important this can become, consider that we request the position of the robot.
First, consider the case where the X and Y position are constantly being updated by dead reckoning as the result of interrupts from an encoder on the drive system. Ser- vicing the interrupts from this encoder cannot be delayed without the danger of missing interrupts, so it has been given the highest interrupt priority. This is in fact usually the case.
Now assume the following scenario. The hexadecimal value of the X position is re- presented by the 32-bit value consisting of two words, 0000h and FFFFh, at the moment our position request is received. The communications program begins sending the requested value by sending the high byte of 0000h first. At that moment an encoder interrupt breaks in and increments the X position by one, to 0001h and 0000h. After this interruption, our communications program resumes and sends the low word as 0000h. We receive a valid message indicating that the X position is 0000h, 0000h, an error of 32,768!
We cannot escape this problem by simply making the communications interrupt have a higher priority than the encoder interrupt. If we do this, we may interrupt the dead reckoning calculation while it is in a roll over or roll under condition, resulting in the same type of error as just discussed.
9 The author makes no claim to be able to demonstrate this fact, but believes those who claim they can!
For this and other reasons, it is imperative that the communications task copy all of the requested data into a buffer before beginning transmission. This transfer can take place with interrupts disabled, or through the use of a data transfer instruction that cannot be interrupted.
There are even more subtle forms of errors.
Flashback…
I am reminded of one of the most elusive bugs I ever experienced. The problem showed up at very rare times, only in a few installations, always along the same paths, and in areas with marginal radio communication. The robot would receive a new program, begin executing it, and suddenly halt and announce an “Event of Path” error. This error meant that the robot had been instructed to perform an action as it drove over a point on a path, but that it did not believe the point was on the path! More rarely, the robot would suddenly double back to the previous node and then turn back around and con- tinue on. It was very strange behavior for a major appliance!
This bug happened so rarely that it was at first dismissed as an observer-related problem, then as a path-programming problem. Finally, a place was found where the problem would occur fairly regularly (one time in 100) and we continued to exercise the robot until we were able to halt the robot at the problem and determine the cause. This is what was happening:
When the robot finished a job, it would halt. At that point, it would be sent a new program and then it would be told to begin execution of the program at step one. The program was loaded in blocks, but no block would be sent until the block before it had been successfully transmitted. Once the whole program was successfully loaded, the in- struction pointer would be set to the first instruction and the mode would be set to automatic. As a final check, the program itself contained a 16-bit checksum to assure that it had been put into memory without modification.
The problem, it turned out, was caused when the robot was sent the message setting its program pointer and mode. If this message was not confirmed within a certain time, it was retransmitted. It had been assumed that a message timeout would be the result of this message not being received by the robot. In that case, the code worked fine. The real problem came when the robot did receive the message, but its reply saying that it had received the message did not get back to the host. The host would then wait a few sec- onds and retransmit the reply, causing the robot to jump back to the beginning of its
program like a record jumping a track10. Thus, it would begin the program over even though it might now be 50 feet from where that program was supposed to start. If it had crossed through a node before the second message arrived, the robot would throw an error. If it had not reached a node, then the second transmission would not cause a problem because the program pointer was still at the first step anyway.
The result was that the problem only occurred in areas of poor communications where the distance from the first node to the second node was relatively short. The reason it was so hard to detect was because it occurred so rarely and because of the assumption that the message did not go through if a reply was not received. This was a classic assumption bug.11