Maybe software developers are naturally optimistic but in my experience they rarely consider system failure or disaster scenarios when designing software. Failures are varied and range from the likely (local disk failure) to the rare (tsunami) and from low impact to fatal (where fatal may be the death of people or bankruptcy of a business).
Failure planning broadly fits into the following areas:
Avoiding failure is what a software architect is most likely to think about at design time. This may involve a number of High Availability (HA) techniques and tools including; redundant servers, distributed databases or real time replication of data and state. This usually involves removing any single point of failure but you should be careful to not just consider the software and hardware that it immediately runs on - you should also remove any single dependency on infrastructure such as power (battery backup, generators or multiple power supplies) or telecoms (multiple wired connections, satellite or radio backups etc).
Failing safely is a complex topic that I touched on recently and may not apply to your problem domain (although you should always consider if it does).
Failure recovery usually goes hand-in-hand with High Availability and ensures that when single components are lost they can be re-created/started to join the system. There is no point in having redundancy if components cannot be recovered as you will eventually lose enough components for the system to fail!
However, the main topic I want to discuss here is disaster recovery. This is the process that a system and its operators have to execute in order to recreate a fully operational system after a disaster scenario. This differs from a failure in that the entire system (potentially all the components but at least enough to render it inoperable) stops working. As I stated earlier, many software architects don't consider these scenarios but they can include:
These are usually classified into either natural or man-made disasters. Importantly these are likely to cause outright system failure and require some manual intervention - the system will not automatically recover. Therefore an organisation should have a Disaster Recovery (DR) Plan for the operational staff to follow when this occurs.
A disaster recovery plan should consider a range of scenarios and give very clear and precise instructions on what to do for each of them. In the event of a disaster scenario the staff members are likely to be stressed and not thinking as clearly as they would otherwise. Keep any steps required simple and don't worry about stating the obvious or being patronising - remember that the staff executing the plan may not be the usual maintainers of the system.
Please remember that 'cloud hosted' systems still require disaster recovery plans! Your provider could have issues and you are still affected by scenarios that involve corrupt data and disgruntled staff. Can you roll-back your data store to a known point in the past before corruption?
The aims and actions of any recovery will depend on the scenario that occurs. Therefore the scenarios listed should each refer to a strategy which contains some actions.
Before any strategy is executed you need to be able to detect the event has occurred. This may sound obvious but a common mistake is to have insufficient monitoring in place to actually detect it. Once detected there needs to be comprehensive notification in place so that all systems and people are aware that actions are now required.
For each strategy there has to be an aim for the actions. For example, do you wish to try to bring up a complete system with all data (no data loss) or do you just need something up and running? Perhaps missing data can be imported at a later time or maybe some permanent data-loss is tolerated? Does the recovered system have to provide full functionality or is an emergency subset sufficient?
This is hugely dependent on the problem domain and scenario but the key metrics are recovery point objectives (RPO) and recovery time objectives (RTO) along with level of service. Your RPO and RTO are key non-functional (quality) requirements and should be listed in your software architecture document. These metrics should influence your replication, backup strategies and necessary actions.
The disaster recovery plans for the IT systems are actually a subset of the boarder 'business continuity' plans (BCP) that an organisation should have. This covers all the aspects of keeping an organisation running in the event of a disaster. BCP plans also includes manual processes, staff coverage, building access etc. You need to make sure that the IT disaster recovery plan fits into the business continuity plan and you state the dependencies between them.
There are a range of official standards covering Business Continuity Planning such as ISO22301, ISO22313 and ISO27031. Depending on your business and location you might have a legal obligation to comply with these or other local standards. I would strongly recommend that you investigate whether your organisation needs to be compliant - if you fail to do so then there could be legal consequences.
This is a complex topic which I have only touched upon - if it raises concerns then you may have a lot of work to do! If you don't know where to start then I'd suggest getting your team together and running a risk storming workshop.