Most government IT agencies have a disaster recovery plan on paper, but how confident are they that it will work and enable them to ride out a catastrophe with minimal loss of data or equipment? There’s only one way to find out: Conduct a live test.
Last year, the Virginia Information Technologies Agency (VITA) completed a full disaster recovery drill at a backup data center. The drill involved 22 state executive branches, 211 servers and 170 terabytes of data. The agency successfully recovered critical systems to a backup data center in southwest Virginia within 24 hours, and less critical systems within 48 hours.
Just a decade ago, the commonwealth of Virginia’s one-size-fits-all disaster recovery approach involved transporting backup tapes to an external site where space wasn’t guaranteed. Testing was extremely limited, and there was little confidence in the ability to recover. VITA solved the problem with an evolutionary project that took several years to complete and required significant research. Implemented and managed for 89 executive branch agencies, Virginia’s dynamic, action-based DR plan undergoes continuous improvement. What follows are some of our best practices for formulating a dynamic DR plan.
Key to Virginia’s success is how the organization approaches continuity of operations planning. Focus on changing the thought process from planning and preparing for an annual test to thinking about DR as a continuous activity.
Because VITA leaders are responsible for declaring a disaster, they support the program from the top down. Involve individual agencies in DR planning and testing, and boost awareness through online and print materials such as posters and planning kits. Our staff members think about DR as a part of their everyday job, not just when an exercise is approaching.
During initial live testing, VITA found that changes to hardware and software weren’t being replicated to the recovery site. Now, a quarterly report compares hardware, operating system and patching production assets with the same assets in the secondary data center, and a configuration management database flags discrepancies. Over time, the number of issues has steadily decreased, showing that the process is maturing.
Examine all production changes to determine the impact, and have change owners speak up when a change in production has a DR component. Closely integrating change control and DR reduces risk.
Customize what you have or use out-of-the-box tools to improve DR. There are a wide variety of tools available; evaluate which tools best support your applications and budget. For example, VITA implemented the ability to boot from a storage area network, disk backups and a wiki for easily accessible documentation.
Stop using physical tapes for backup to eliminate costs associated with transporting them and the failures associated with the fragility of tape. Deploying near real-time replication enabled us to reduce recovery times by more than 75 percent and better manage shift work to keep staff rested.
Standardize the restore process independent of hardware and automate the collection of systems for the most accurate recoveries possible.
Comparing inventory and auditing the main and backup data centers enabled us to reduce recovery failures due to hardware mismatches and improved input for storage capacity planning. Add configuration items to the configuration management database and conduct the inventory comparison and audit on a regular basis.
Ensure that IT staff who propose and oversee tech initiatives consider DR requirements when designing solutions. This reduces the number of missed requirements and the lag time between rollout and DR implementation.
We designed a workflow to notify all departments of DR requirements, improve documentation and ensure efficient use of DR resources. Not only does this augment the validation process, but it resolves implementation errors before customer testing. VITA plans to continue to improve DR for the commonwealth. For instance, the agency is considering deploying a mobile command center or more geographically diverse locations. A virtual private network enables systems access from anywhere there’s broadband.
Another challenge in an actual disaster is that staff may not be available for a variety of reasons. Being able to properly test your second and third resources in any area of expertise is essential. Including these people in the drill provides cross-training and aids in evaluating the quality of recovery documentation.