Data Center

4 Ways to Optimize Data Deduplication

Tweak storage requirements and bandwidth to speed protection across multiple sites through global deduplication.

Rick Cook learned programming on a computer with magnetic drum memory. Since then he's written thousands of articles on all aspects of computers and high technology — as well as several fantasy novels full of bad computer jokes.

Some data deduplication users can reduce storage demands by up to 90 percent by eliminating redundant data in storage.

Redundancy is a fact of life with most data. Large parts of the data consist of duplicates of information already in storage. Data deduplication works by breaking the data flow into blocks and replacing each block with a much shorter signature generated using a hash algorithm. That signature is checked against a database of existing signatures on the destination server (a DD database) and, if it matches, it is discarded.

If it doesn’t match, an existing block is added to the DD database. Since only nonduplicate blocks are stored, the amount of storage can be reduced considerably. Blocks usually are compressed and can be encrypted. In most applications, the savings in storage more than compensates for the additional computing needed by data deduplication. It is becoming an increasingly popular option for storage.

So, what can IT leaders do to get more out of their storage?

1. Strategize Deduplication to Fit Your Schedule

Your backup strategy matters as well. In general, more frequent full backups will result in more compression. To ease the load on system resources, schedule data deduplication for off-peak hours. On source-based DD, the size of the caches on the source servers has a profound effect on performance. Usually the system will arrive with the cache sizes set to optimum or near-optimum for an average load. That can be reconfigured in the event the factory settings aren’t appropriate for your application.

Once your parameters are properly set, your DD system will need very little attention, and it should produce a considerable savings in storage space.

2. Be Sure to Select the Right System for You

In choosing a data deduplication system, one important consideration is where the data will actually be deduplicated. The original systems conducted the dedupe on the server after the data was sent. (And that remains a popular option.) The other system performs the dedupe on the client, and only sends the DD data to the server. Both systems have advantages.

3. Don’t Overload a Network

Client-side DD cuts the load on the network by reducing the amount of data sent to the server; however, it puts the computing load involved in DD on the client’s networks, which may overload them. It’s important to consider that because DD is a compute-intensive process, and the server-side DD puts an additional load on the network.

Which system is best for your enterprise depends in large part on your system configuration? Are you better off increasing the load on your network or your client’s?

4. Match Your System to Your Data Needs

Another consideration is whether all the data required can be deduplicated. Some systems will only handle DD blocks of a certain size. Others will not deduplicate data that is not in a storage pool. When choosing your system, it is important that it match your needs.

123dartist/Thinkstock;koya79/Thinkstock