Fail Safe

a much abused term

One of the most misunderstood engineering terms is 'fail safe'. Most people from a non-engineering background (including many software developers) believe it means something won't fail. Last week even the Economist used it incorrectly.

A 'fail safe' device/system is expected to eventually fail but when it does it will be in a safe way. Classic examples include the brakes on trains that engage when they fail and ratchet mechanisms in lifts/elevators so they can't drop if the cable breaks. Well engineered physical devices will state their Mean Time Between Failure (MTBF) and define how they can fail and what happens when they do. A well maintained physical device may never fail over its lifetime but you know what will happen if it does.

A fail safe physical device may also define what occurs when a user error causes it to behave in an undesired manner. For example the “dead man handles” in lawn-movers or electric drills. I own an angle-grinder and in order to turn it on I have to flick a switch and then pull a trigger. Importantly, if I let the trigger go the cutting blade is stopped. This means that if I drop it I'm much less likely to lose a foot. When the trigger is released the switch is also reset, making it impossible for the trigger to be pressed by bouncing off an object.

As there is no physical wear-and-tear on a software system the concept of MTBF is arguably not applicable. However software systems can and do fail all the time, so perhaps it's surprising that many software systems I've experienced don't cope with failure very well or have defined actions when they fail. For example the following may happen:

  • Underlying hardware failure. Networks and external disks are the ones I encounter most.
  • External system failure. Obviously your system is perfect but external systems you rely on start to feed you garbage.
  • User error. If you create an idiot proof system then I guarantee they will employ a better idiot.

It's tempting to try to correct a failure situation and keep on running but this can lead to a system getting into an unknown state and creating more issues. For example:

  • The network is not responding but you keep on processing inputs and queuing outputs hoping it comes back. Your caches and disks fill up affecting other systems. Eventually it does come back on line and your system stops responding as it processes hours worth of stale data.
  • An external data provider starts sending blanks in a numeric field. A developer had previously decided to 'interpret' empty as a zero (whereas it was missing data) and this fed through a banks pricing systems, was forwarded onto other system which then tried to execute buys (these as they were obviously a bargain at zero!)
  • In finance we worry about 'fat fingers' where a trader hits the wrong keys and buys a 12 million rather than 1 million...

All of the above are real examples I have come across. How would I have changed the failure handling? I prefer to put the system into a known, safe state if possible.

  • Put limits on anything you do for recovery situations e.g. retry only three times, put a time limit on caches etc. Don't continually do something that isn't working.
  • Don't make generic assumptions about correcting data across a system. If it's not a good input then fail that input as you have no idea what it really means and you are hiding the error. Note that I'm not suggesting the entire system should be suspended but the transactions that are in error should be suspended and reported upon.
  • User inputs are often sanity checked but “are you sure” dialogs are automatically clicked (without reading them) or the “never show this again” checkbox is selected. Ultimately, there is only so much you can do to save the user from themselves but you might want to save an audit of the user's decisions...

It's important to not just put the system (or transaction) into a safe state but to also inform those that can resolve the situation. As developers we often write

LOG.warn(“Transaction X has failed”)

and think nothing more about it. It's amazing to use a reporting tool like Splunk on a mature system and extract all the worrying messages. Would it be more appropriate to send an email, pager message, text message or change a dashboard status etc?

We need to design the error reporting and monitoring services up front and define how the operators should be kept informed. We also need to allow the operators to resolve issues speedily and safely.

To conclude:

  • How can a system fail?
  • What safe state can be entered?
  • How can the failure be reported?
  • How can the issue be resolved?

About the author

Robert Annett Robert works in financial services and has spent many years creating and maintaining trading systems. He knows far more about low latency data systems and garbage collection than is good for anyone. He likes to think of himself as a pragmatist who loves technology but uses what's appropriate rather than what's cool.

When not pouring over data connections or tormenting interviewees with circular reference questions, Robert can be found locked in his shed with an impressive collection of woodworking tools.

E-mail : robert.annett at codingthearchitecture.com


Re: Fail Safe

Great post!

In many cases the failure modes, means of detection or appropriate response aren't understood in advance. These scenarios don't test well in isolation, and seem to prefer emerging in production environments over time. Experience and basic probability tell us we should expect this to be the case, yet the means by which we can monitor, diagnose and correct aberrant behaviour often aren't particularly good.

The dead man's handle is neither the start nor the end of fail-safe mechanisms for public transport. Similarly, our initial attempts at coping with failure are unlikely to be ideal. Automated responses to failure are great, but manual intervention may be the only way for your application to survive long enough to implement them.

Re: Fail Safe

Funny you should mention Splunk. I work at Splunk and we had a power outage this morning. One of the first things I checked was that all of my systems properly executed their fail-safe routines (mostly making sure Apache didn't start without MySQL running, thus providing lots of invalid data to test infrastructure). And yes, looking for 'ERROR' and 'WARNING' on legacy systems is frightening.

Add a comment Send a TrackBack