Continuous Preparation
As a small agency, the South Tahoe Public Utility District can't afford
to build and operate a secondary data center with backup servers at the ready.
But its IT department has discovered a less costly alternative.
The district, which provides water and sewage service to 17,000 customers
in South Lake Tahoe, Calif., last year purchased a Novell PlateSpin Forge
appliance that uses VMware virtualization software to copy server images.
Each image includes the application, the operating system it runs on, network
settings and even drivers and permissions. If a server crashes, the IT staff
can call up the image on the appliance and have the application running again
in about 30 minutes, with no additional configuration needed.
"It works quite well. It may not be as fast or as powerful as the original
server, but it runs the application," says Bill Frye, the district's
network and telecom systems administrator.
For Frye, the recovery appliance is one of many initiatives to improve continuity
of operations. For example, he rolled out a storage area network with a RAID
5 configuration to boost uptime and reliability. And before that project,
he deployed disk-to-disk backup to complement his tape backup system.
Frye isn't alone in his efforts. For local and state government IT
leaders, disaster recovery readiness and continuity of operations planning
is a continuous work in progress that is improved over time and as budgets
allow for new technology, experts say.
"Disaster recovery is an evolutionary process," says James Quinn,
an analyst with Info-Tech Research Group. "As long as you are backing
up data in some way, you have disaster preparedness. It's minimal, but
then you start looking at how you restore systems and start defining your
IT personnel's role and responsibilities."
Cost-Effective COOP
The South Tahoe Public Utility District spent about $33,000 on the PlateSpin
Forge appliance to guard against server outages, and the purchase paid off
almost immediately.
IT copied to the appliance images of its 10 most critical applications, including
financial, billing, maintenance and plant monitoring software, as well as
Microsoft Exchange, a SQL Server database and a domain controller.
Several weeks after the deployment, Frye and his staff were in the process
of upgrading to a new version of the district's financial application.
They were testing it on new servers when the server running the existing version
crashed. When they tried to revive the old server, it wouldn't respond,
so they started up the image of the existing financial application on the
Forge. It took the appliance about 30 minutes to boot the virtual machine,
and when it was up and running, users accessed the financial data as if nothing
had happened.
In total, downtime was about two hours. Without the appliance, IT would have
needed two to three days to bring up the application, which would have disrupted
accounts payable and payroll, says Chris Skelly, South Tahoe information technology
systems specialist. "We would have had to order new hardware and have
it delivered overnight. And then it would take 20 to 30 hours to reinstall
the software," he says.
The IT department continues to refine the utility district's disaster
recovery plan. Today, the district's main building includes a data center
that houses 25 servers and a 10 terabyte EMC CLARiiON AX4-5i SAN that includes
dual controllers for redundancy. A second building houses the Forge appliance
and the data backup system.
District IS manager Carol Swain is negotiating with a nearby city to lease
space at its data center to store data backups and possibly house another
Forge appliance. While the appliance doesn't failover immediately, being
able to start up services again within the hour is good enough, Frye says.
"Having a hot site with duplicate servers and storage is too expensive,"
he says. "The Forge appliance is in a more reasonable price range. It's
already proven useful and allows us to have a much quicker turnaround to restore
a service. We can accept the limitations because we can afford the price."
Going Virtual
Trae Umstead has focused on continuity of operations since he became IT director
of Minnehaha County, S.D., three years ago. He upgraded the county's
network to improve reliability and then turned to server virtualization and
networked storage to improve uptime.
Photo credit: Meggan Haller/Keyhole Photo
When he first arrived, each of the county's 12 buildings had just one
network connection to the data center. "If we had any accident at the
data center -- a fire or a tornado -- the rest of the county would
have no connection," he says.
Using a grant from the Department of Homeland Security, he built a ring topography
and laid additional fiber between the county buildings, so each building would
have two network connections. "If one point in the ring failed, we could
connect through the other side," he says.
Though mainframe applications had backup servers at an offsite data center,
the county lacked those for Microsoft Exchange, SQL Server and other Windows
applications. Umstead purchased blade servers, VMware and NetApp network-attached
storage devices to create redundancy.
Today, three blade servers run about 30 virtual machines in the main data
center. If one blade fails, the other two blades automatically take over and
keep the applications running. For backup, Umstead moved a few older rackmount
servers to a secondary data center at the county's Emergency Operations
Center six miles away.
61%
Respondents from North American enterprises who rank upgrading disaster
recovery capabilities as either a critical or high priority, according to
Forrester Research
Umstead installed a 50TB NetApp FAS3140 storage system for the main data
center and a 45TB NetApp FAS2020 storage system at the offsite location. He
uses NetApp SnapMirror software to replicate data from one site to the other.
Data for the most critical applications, such as Microsoft Exchange, are replicated
every two hours. Data for less critical applications are replicated at least
once a day, he says.
The virtual machines are also stored as files on the offsite NetApp device.
So if the main data center ever goes offline, Umstead can fire up the old
rackmount servers and launch the virtual machines in the secondary data center
within three hours.
Next year, Umstead plans to replace the county's aging phone systems
with Voice over IP. VoIP will aid disaster recovery because if a county building
becomes uninhabitable, IT can easily run new lines or forward extensions to
other county buildings.
Disaster Readiness
The city of Gulf Shores, Ala., recently invested in much-needed disaster
recovery technology, and for the first time ever, IT is feeling prepared for
hurricane season.
The beach resort community faces the threat of hurricanes every summer. In
fact, Hurricane Ivan in 2004 caused widespread damage throughout the city.
City hall, which housed the city's lone server at the time, survived
unscathed, however.
When Network Administrator Lee Hartley joined the city's IT staff in
2007, Gulf Shores still didn't have much of a disaster recovery plan.
At the time, the data center had grown to about six servers, each backed up
with tape drives. If a hurricane threatened the city, IT's plan was
to pack the servers in a truck and drive to safety.
"It was a ‘turn the servers off and take it with us' kind
of thing," Hartley recalls.
Since then, IT has planned for all types of disasters, including fires. This
past winter, Gulf Shores invested about $175,000 in IT hardware to improve
continuity of operations.
IT purchased VMware virtualization software and a 6TB HP SAN, allowing it
to replicate its servers and storage to a secondary data center at the city's
public works maintenance facility five miles north.
Four physical servers operate 19 virtual machines, with two servers housed
in each data center. If the main data center goes down, the virtual machines
can operate out of the secondary data center, Hartley says.
To conserve server resources, he says, only the most critical applications
will failover automatically, such as public safety applications, Microsoft
Exchange and domain controllers. The failover process takes just a few minutes,
he adds.
At Your Service: Go to www.statetechmag.com/recovery310 to learn how Alaska and Rhode Island protect state resources with hot-site recovery services.
The city's new HP LeftHand (now StorageWorks P4000) SAN replicates
data in real time between the main data center and the backup data center.
Gulf Shores also improved its backup processes by deploying EMC Data Domain
DD610 deduplication appliances for nightly disk backup and secondary storage.
The city keeps data backups for 45 days and archives older data to tape, Hartley
says.
In the future, the IT department plans to add redundant fiber links between
the two data centers and double the storage capacity on the SAN. But overall,
the city's IT employees have confidence in its disaster recovery plan.
"For the most part, if a hurricane is on its way, I don't think
we have anything to worry about," Hartley says. "If something
happens to city hall, our data is protected. We don't lose anything.
Instead of weeks of downtime, we're just talking minutes."
Fighting Floods
When relentless rain caused devastating floods in Rhode Island this spring,
the state's IT infrastructure went relatively unscathed, except for
two facilities.
One facility housing the state's Medicaid application was damaged,
knocking out the state's Medicaid operation. HP Enterprise Services,
which manages the Medicaid application, brought the application back up
in about three days through one of its backup data centers, says Rhode Island
CIO John (Jack) Landers.
Elsewhere, fast work by Rhode Island IT staff saved the state's Department
of Motor Vehicle's Operator Control Division's technology from
damage. As flood waters began to rise at the DMV, IT administrators disconnected
the hardware -- six servers, PCs and printers -- and moved to
a different floor, Landers says.
"From a state standpoint, we escaped the flood very well,"
he says.
Best Practices for COOP
1. Hire a good leader. Success in continuity of operations
planning revolves around the personality and skills of whomever is in charge.
"That person has to round up the troops and get them going,"
says Richard Jones, vice president and service director for Burton Group
Data Center Strategies.
2. Put proper procedures in place. "The key to continuity
of operations is people and how they react to a disaster. It's human
coordination that makes or breaks the response," says Anand Dubey,
Alaska's director of enterprise technology services. In the past,
different teams in Alaska had their own separate disaster recovery plans.
Dubey has worked to create an overarching, holistic plan.
3. Update the disaster recovery plan regularly. The work
is never finished because state agencies are always adding new critical
applications or replacing old mainframe applications with newer offerings,
says Rhode Island CIO John (Jack) Landers. Every application can't
be deemed critical, so it's important for states to regularly identify
and prioritize the applications that must be brought back up first, he says.
4. Test backup systems. Ensuring that the IT staff can
bring applications back online at a secondary data center is crucial for
disaster recovery, says Rhode Island's Deputy Information Processing
Officer Mike Lombardi, who manages the state's disaster recovery planning.
Lombardi and a group of IT workers from each state agency travel to the
state's recovery facility twice a year to run tests.
5. Build an incident response trailer. Rhode Island's
IT department has assembled a trailer stocked with servers, notebook PCs,
wiring and generators. If a server crashes, IT can drive to the scene and
restore the application with a spare server, Lombardi says.
VOLCANIC INTERRUPTION
Planning and testing helps governments prepare for disasters. But it takes
a real emergency to determine whether government agencies are actually ready.
For Anand Dubey and his IT team in Alaska, that moment came in spring 2009,
when Mount Redoubt, a volcano 100 miles southwest of Anchorage, started to
rumble and geologists warned of an imminent eruption.
At that time, Dubey, Alaska's director of enterprise technology services,
had begun developing plans for different disaster scenarios, including earthquakes,
pandemics and volcanic eruptions. The plans were still in draft form, but
with the volcano threatening to blow, he put his volcano plan into action
and deployed his seven-member incident response team.
No IT facilities were close enough to be destroyed by lava flow. But an ash
cloud could destroy technology at the Anchorage data center and the state
WAN built with microwave dishes on mountaintops. Volcanic ash is abrasive,
and Dubey's fear was that the ash would not only take down the communications
network but also seep into the data center and wreck equipment.
"We had to mount a 24-hour watch for two weeks. Our team monitored
conditions and were ready to shut things down and bring them back up once
the ash cloud was gone," Dubey says.
After the first week, his team was exhausted. Dubey realized he needed more
staff, so he quickly recruited a second incident response team and had the
two teams work in shifts. The volcano did have minor eruptions, but not enough
to send ash toward Anchorage. After two weeks, the volcanic activity died
down.
The experience was a good exercise and provided Dubey's team with some
ways to improve their procedures.
"My ‘A' team was fatigued, and it taught us that we needed
to formulate a ‘Team B,'" he says.