Deduplication removes redundant data from storage — originally, duplicate files, but more recently, duplicate blocks or even sub-blocks, called "chunks." If a new file is stored and parts of it are identical to a file already on the storage system, only the different parts are stored.
Hardware that deduplicates near-line storage intended for backups has been available for several years. This works well because data varies little from one backup to the next. However, deduplication of online or primary storage is relatively new, and not all types of data will benefit from the process. Here are some pointers to optimize deduplicated storage.
The two types of deduplication present some trade-offs. In-line processing intercepts all data as it is written to storage, removes any bits that have already been stored and writes the rest. This requires heavy-duty processing power and can introduce latency because the data must be compared before it is written. It also generally costs more than a post-processing system. In contrast, post-processing deduplication requires a "landing zone" to store data until it can be processed, which means that as much as 50 percent of the available raw capacity is consumed. Data is stored in the landing zone until it can be processed, then moved to the other half of the array.
Deduplication works most effectively when there are multiple copies of the same sets of files. For example, the .vmdk or .vhd virtual hard-disk files used by virtualization platforms such as VMware or Hyper-V often contain the same basic sets of operating system files and should be ripe for deduplication. In contrast, video files or databases tend to have large sets of unique data and will not see much reduction from deduplication. However, if video data were being processed with multiple working files of the same data, then deduplication would be appropriate.
Post-processing deduplication examines data saved to the storage system to find duplicate blocks or chunks of data. Typically, this scheduled process runs overnight; hence the need for a landing zone to store data until it can be processed. If the system is being used 24x7, you may need to move to an in-line deduplication system, which generally requires a greater upfront investment.
While in-line deduplication can be subject to latency, this should be a concern only for data that requires high-performance storage.
If the deduplication engine were to fail, the stored data would still remain. However, the system might no longer know how to find the original primary copy of the data. Therefore, IT should either mirror the data to a second system or perform snapshot backups that capture the full set of data before deduplication.
Predicting compression ratios can be difficult. Deduplication doesn't actually compress data, it just removes duplicates. Effective compression ratios of 100:1 or more are possible with some data. But user directories or other types of data may contain relatively little duplication, complicating the task of estimating how much compression you'll achieve. Run the deduplication system for a while to see how well it works in your environment.
Make sure to test a system thoroughly before you purchase it. Some storage systems offer deduplication as a standard feature. If you believe your environment can benefit from deduplication, you can deploy one of these devices, and if it doesn't work out, you'll have a functional SAN appliance that can be used like any other.