One of the most misunderstood engineering terms is 'fail safe'. Most people from a non-engineering background (including many software developers) believe it means something won't fail. Last week even the Economist used it incorrectly.
A 'fail safe' device/system is expected to eventually fail but when it does it will be in a safe way. Classic examples include the brakes on trains that engage when they fail and ratchet mechanisms in lifts/elevators so they can't drop if the cable breaks. Well engineered physical devices will state their Mean Time Between Failure (MTBF) and define how they can fail and what happens when they do. A well maintained physical device may never fail over its lifetime but you know what will happen if it does.
A fail safe physical device may also define what occurs when a user error causes it to behave in an undesired manner. For example the “dead man handles” in lawn-movers or electric drills. I own an angle-grinder and in order to turn it on I have to flick a switch and then pull a trigger. Importantly, if I let the trigger go the cutting blade is stopped. This means that if I drop it I'm much less likely to lose a foot. When the trigger is released the switch is also reset, making it impossible for the trigger to be pressed by bouncing off an object.
As there is no physical wear-and-tear on a software system the concept of MTBF is arguably not applicable. However software systems can and do fail all the time, so perhaps it's surprising that many software systems I've experienced don't cope with failure very well or have defined actions when they fail. For example the following may happen:
It's tempting to try to correct a failure situation and keep on running but this can lead to a system getting into an unknown state and creating more issues. For example:
All of the above are real examples I have come across. How would I have changed the failure handling? I prefer to put the system into a known, safe state if possible.
It's important to not just put the system (or transaction) into a safe state but to also inform those that can resolve the situation. As developers we often write
LOG.warn(“Transaction X has failed”)
and think nothing more about it. It's amazing to use a reporting tool like Splunk on a mature system and extract all the worrying messages. Would it be more appropriate to send an email, pager message, text message or change a dashboard status etc?
We need to design the error reporting and monitoring services up front and define how the operators should be kept informed. We also need to allow the operators to resolve issues speedily and safely.