Data de-duplication shrinks backups by removing redundant bits of data.
In this tough economy, government must ensure storage costs don’t balloon in tandem with the amount of data that must be stored. As a result, budget-minded IT organizations have flipped the switch on data de-duplication, a technology that continuously trims data retention costs.
“The nature of backups is that when you do a series of incremental and full backups, inevitably you capture the same data over and over again,” says Noemi Greyzdorf, research manager at IDC. “With a high cost per gigabyte to manage and store data, it becomes attractive to eliminate that redundancy.”
Data de-duplication searches for duplicate segments of data and replaces them with a pointer to a single instance of that information. For example, if there are 50 copies of an e-mail attachment, the storage system will retain only one. This dramatic reduction in stored data in turn drives big savings: Less data requires fewer servers to process the data and fewer tape and disk drives to store it. Agencies also save on related expenses such as heating, cooling, space, personnel and bandwidth.
Parsram Rajaram, IT manager for the city of Winter Park, Fla., credits data de-duplication for enabling him to create a cost-effective, high-availability storage network to comply with public-records laws. “We used to back up each of our 55 servers individually,” he says. “Backups would run into each other, or we wouldn’t be able to get a full one because users had files open. Now there’s more pressure on us to have data available. We can’t tell someone we can’t access the backup and that data is lost.”
With manufacturers such as Cybernetics, Data Domain, FalconStor, HP, Overland Storage, Quantum, Spectra Logic and Symantec to choose from, Rajaram settled on a Data Domain 530 backup and recovery appliance because of its inline approach. The backup server channels data from 15 servers through the appliance to tape for offsite storage.
Inline data de-duplication technology examines data as it enters the box, before it’s stored to disk, whereas the post-processing technique de-duplicates data after it’s stored. The $300,000 Data Domain investment helps Winter Park save overall on disk capacity. “Because everything is done in the CPU and memory and not on disk, you only need a fraction of the disk space,” Rajaram says. For example, he’s been able to reduce some backups from 4 terabytes down to 138 gigabytes solely from data de-duplication.
That smaller data pool also speeds the time it takes to write data to tape, which makes disaster recovery more efficient. Eventually, Rajaram hopes to add a second Data Domain appliance to do replication between the appliances as a real-time fail-safe mechanism.
For the past few years, Butler County (Pa.) IT Director Bob Moyer has felt trapped. While county laws require that he keep seven years of data, the manual labor involved with his direct-attached tape backup system was exhausting.
“We’ve got more than three dozen servers, each with its own tape drive. That’s a heck of a lot of tapes to back up each night, once a week, once a month and annually,” he says.
In addition, the backups hampered user productivity. “There are about four hours of downtime associated with each backup. Also, if the tape gets full and the server hangs, then users have to wait for someone to come in the morning to fix it. The simple truth: I’d love to get rid of tapes altogether,” Moyer says.
He’s taken a giant first step in that direction by creating a redundant storage network featuring two Data Domain 530 appliances connected to two media servers equipped with Symantec’s Backup Exec System Recovery software. One media server/appliance pairing is onsite and the other offsite at the 911 center a mile and a half away.
Moyer estimates he has as much as 10TB of incremental and full backup data flowing each month through the appliances, which together with the backup software, storage and maintenance contract cost $185,000.
He relies on Data Domain’s inline de-duplication and replication to ensure that as much of that data as possible is readily accessible. So far, he’s been able to reduce what is typically a 2.4TB data load that includes the county’s financial system, tax system and databases by more than 50 percent, thus significantly increasing his pool of available storage.
“I get to replace most of our tape backups, free up my team’s time and gain a solid disaster recovery plan — you can’t really put a price tag on that,” he says.
The Virginia Department of Motor Vehicles applied data de-duplication to ward off a different problem: infrastructure creep. “The biggest challenge we faced was stopping the expansion of the footprint we started to see for our 6 terabytes of storage,” says Todd Gallagher, storage administrator at DMV. “We were definitely getting cramped.”
To tackle the problem, the team initially decided to shorten the DMV’s standard 45-day retention cycle for backups. “We thought we’d be happy with a full backup every other week, but it only got worse. Shortening our retention cycle compromised our ability to go back to a certain point in time,” he says.
Rather than continue down that dangerous road, Gallagher and his team turned to EMC’s Avamar disk-based storage networking solution with inline processing de-duplication. Each of the DMV’s three servers running the Avamar software can back up as often as necessary.
By applying EMC’s data de-duplication, the team has been able to reduce the size of e-mail, database and file system backups from 72TB to 5.9TB. This enabled the department to return to its 45-day retention policy.
That drastic reduction has also increased data center space and helped the DMV stave off the need for hardware purchases. “Now we have the capacity to do full backups every night and know we have full and reliable coverage,” Gallagher says.
The Indiana Office of Technology wants not only to halt expansion, but also to consolidate. Jim Rose, manager of systems administration in Indianapolis, has been busy centralizing 275 remote storage locations to eliminate the need for onsite staff to handle tape backups.
Although each of those locations had tape backup, there were no IT staff onsite, so managers had to remember to swap tapes. “It was a nightmare,” Rose says. “People lost tapes or forgot to take them offsite.”
The clincher: When an entire office burned to the ground, there was no way to recover the data.
His team faced one big obstacle: ferrying 45TB to 50TB of data each night across T-1 lines back to a data center. “We had done a small amount of testing to backup remote sites, and some had files that were hundreds of megabytes each, such as CAD/CAM drawings, that just weren’t feasible to continuously bring across the WAN. It was very formidable,” he says.
Rose could not afford to add pipes. The only solution was to use the data de-duplication features within Symantec’s storage networking tools to dramatically reduce the amount of data that had to traverse the network.
The team rolled out Symantec’s Veritas NetBackup with Shared Storage and Vault software, as well as Veritas NetBackup PureDisk Remote Office Edition to remote sites. The inline data de-duplication transfers single-instance storage back to the data center. For example, each field office stores its own copy of state forms and documents, but only one copy is backed up.
Rose has reaped a 98 percent reduction in storage needs for 300 remote servers, from 42TB down to 1TB. “More important, we don’t have to rely on offsite, non-IT staff for backups. Previously, our recovery rate had been very low; now we’re at 95 percent,” he says.
If you apply data de-duplication to a 500GB backup and only 1 percent has changed since the previous backup, then you have to move only 5GB of data to keep the two systems synchronized, according to Data De-Duplication for Dummies.
Is De-Duplication Right for You?
Enterprise Management Associates Senior Analyst Michael Karp answers these questions about choosing de-duplication:
1. What is your data mix?
If most of your data consists of images or other types of information that are hard to label as redundant at the bit level, then de-duplication is not a good fit. Applying de-duplication to e-mail servers, databases and file systems will give you the best return on your investment.
2. How long is your retention rate?
The longer you keep your data, the better de-duplication will work for you.
3. De-dupe now or later?
De-duplication can be done in two ways: as it’s being sent into the software or appliance (inline) or once it is already stored (post-processing). Each method has its benefits. Inline processing requires less storage because data is de-duplicated before it arrives on disk. Post-processing requires less overhead when sending data over the WAN.