Designing Safe Software Systems Part 2

November 8, 2009

In the last blog, I discussed one of the basic principles of writing safe systems is having a backup that is decoupled from the primary. For example, hospitals normally have a backup power generator in case the grid power fails. These generators are normally located in the basement. In New Orleans, the same event (hurricane Katrina) that took out the grid power also flooded the basements and took out the backup power. The two systems were coupled. One building did have its backup generator on an upper floor, and they maintained power for days.

On “60 Minutes” airing 11/08/2009, they showed a power generator literally self-destructing when it was infected with malicious code delivered over the internet. The techniques presented here can prevent such attacks from being successful.

Dual Path

Dual path is the gold standard. For embedded systems, a dual path solution would be to have two separate systems computing a result. A hardware comparator checks to see that both produce the same result, if not, the system is shut down. To avoid coupling between the two paths,

Separate teams develop each path. They are not allowed to talk to each other.
Each path is implemented using different software algorithms.
As well, different languages are used, different CPUs, and different hardware from different vendors.
A third group monitors both path designs to ensure they have no inadvertent similarities.

It’s a costly technique, but has proven very effective. It’s commonly used in aviation. A simple version of this exists in modern cars in the form of the dual brake system, with a hydraulic pressure difference sensor that turns on the [brakes] dash light if there’s a difference.

Monitors

A monitor is a watered-down dual path system. It is a separate hardware system that monitors the output of the main system. If the output is outside some preset bounds, the system is shut down. A monitor might be installed on an X-ray machine to check the output level.

Deadman

A deadman is a hardware timer switch added to a computer system that shuts it down if it isn’t regularly reset. The main loop in the software resets the timer. The idea is that if the software crashes or hangs, it won’t reset the timer. Another deadman timer switch can forcibly reset the system at regular intervals, regardless.

Deadmans are practical solutions for embedded systems that don’t need to be terribly reliable, but must work for long periods without human intervention. An example would be an elevator controller, or a lawn sprinkler controller.

Of course, most of us working programmers do not have the option of changing the hardware, we must work with what it is. What can we do within those constraints?

Processes

Processes have the nice feature of having the hardware separate one from the other, so one crashing process can’t take down the next. We can use this to emulate when custom hardware isn’t possible.

For example, a dual path setup can be emulated with two processes, each developed independently with different algorithms, etc. A third process watches the output of the two processes, and if they diverge, shuts the system down.

Similarly, a monitor can be a separate process that watches the output of the primary process to see if it remains within preset bounds. A deadman can be a separate process that resets the primary process if it fails to tickle the deadman at regular intervals.

Contract Programming

Back when I was doing physics problems in college, once I had derived the solution I’d check them by feeding it back the boundary conditions to see if they were satisfied. If they were, then there was high confidence that I hadn’t made an algebraic or arithmetic error in the derivation.

Doing the same thing in programming is called Contract Programming. For example, the output of a square root function can be squared to see if the original argument can be reproduced. The odds of both the square root algorithm and the multiply being wrong in the same way is astronomically low. This check on the result is called a contract. Similarly, the output of a sort function could be checked to see if it is actually in sorted order. By using essentially orthogonal algorithms in the function and its contract, a level of decoupling is achieved.

While contract programming is a sound software practice, it is not a substitute for independent backups when safety is critical.

Conclusion

To make a proper safe software system, custom hardware is necessary in order to implement decoupled dual path, monitored, or deadman systems. Absent the possibility of custom hardware, this can be approximated using hardware process isolation. Within a program’s process, contract programming is effective at detecting faults.

Articles