![EMC logo]() |
How should I back up data that doesn’t deduplicate? It’s one of the questions I’m asked often – by both our engineers and our customers. In fact, a TBW reader raised the issue in response to my recent post. Therefore, I’d like to explain how we approach such fundamental challenges and then share the approaches that I recommend to our customers.
The Fundamental Challenge
Difficult challenges require a system-level solution approach because the problems are too complex to be solved by one component. It is this systems view that drives my push to transition from tape to disk.
Over the past twenty years, tape-centric backup systems have evolved about as far as they can. Meanwhile, disk-centric backup continues to evolve rapidly because disk storage systems alter the constraints in the system. Therefore, “backup to disk” isn’t code for “write a tar image to a Data Domain VTL” (especially since VTL still implies a tape-centric backup approach).
Usually, one of the disk backup approaches can meet our customers’ RPO/RTO and reliability needs at the right cost… or come closer to the mark than anything else available. More importantly, with both the freedom and investment to innovate, disk-centric backup architecture will more effectively address IT challenges today and in the future.
The Approach: Four Use Cases
There are four “non-dedupe” backup use cases I hear about:
- Low-retention, non-repeating data (e.g., database logs): Customers usually choose between two options: Option 1: Store the logs on the backup appliance, getting only local compression, but with consolidated protection storage management. Option 2: Store the logs on non-deduplicating disk systems and coordinate the storage management (e.g., replication). Regardless, disk is usually the best option to handle the performance requirements for high value data with such an aggressive half-life.
- High churn environments (e.g., test data): These data sets experience 30%+ daily change. Most customers opt for short-term retention because the data is so short-lived. In that case, I recommend snapshots/clones and/or replication. While the snapshots consume a significant amount of space, they save a tremendous amount of IOPs. Too often, organizations ignore the heavy I/O load caused by backups. Not only are most of the backup reads not served from cache, but they often pollute the cache. In high-churn environments, IOPs are even more precious, since the storage system’s disks are so heavily loaded with the application load (and the churn makes flash a non-ideal fit). Therefore, at a system level, it is often less expensive to consume extra space for snapshots than to consume the IOPs for traditional backups.
As an additional benefit, the snapshots enable faster recovery from current versions of data. The choice to replicate becomes a cost/benefit analysis around the availability of data vs. the cost of a second storage array and network bandwidth. Tape-centric approaches compromise application performance (or require overbuying the primary storage performance), recover stale copies of the data, and recover the data so slowly that customers prefer to regenerate the data (e.g,. application binaries, satellite images, oil and gas analytics, or rendered movie scenes).
- Environments in which you don’t run multiple full backups and have little cross-backup dedupe (e.g., images, web objects, training videos): If data is never modified and rarely deleted, customers don’t run full backups. Since a backup appliance derives much of its space savings from deduplicating redundant full backups, dedupe rates fall in the absence of multiple fulls. The best approach for protecting these data sets is replication, especially if the replicated copy can service customer accesses.
Since the data is not modified, there is little value from retaining multiple point-in-time copies. Therefore, the most critical recovery path is that of a full recovery; nothing is faster than connecting to a live replica, nothing is scarier than depending on multiple incremental tape restores. Furthermore, these types of datasets tend to have distributed access patterns, so technologies like EMC’s VPLEX can improve both protection and performance with the same copy (another way of deduplicating copies).
- Environments in which the application behavior compromises dedupe (e.g., compressing data that you modify): Think of an application that either modifies compressed files in place (e.g., open file, decompress file, modify file, recompress file) or creates multiple compressed copies of data (e.g., compressed or encrypted local database dumps). This workflow tends to create 10x more data modification than the actual new data.
In these cases, you have two options: Option 1: Decompress the data for the backup and/or write the database dumps directly to the dedupe storage, so you can get the optimal deduplication. Option 2: Treat the data as Type 1 or Type 2 discussed above.
However, if the customer is unwilling to decompress the data and wants long-term retention, this is the most plausible instance in which to leverage tape. I’m just not sure it’s widespread enough to justify deploying a tape environment; I would fully explore cloud options first.
When I advocate for disk, I’m asking the industry to both consider at the entire portfolio of disk solutions and the possibilities that can be developed. As we’ve been discussing on LinkedIn, as soon as you make disk your design center, it opens a whole new set of architectural approaches. And that’s the transition that is so exciting – moving from putting disk inside a tape-centric architecture to really designing around disk.
As you can see from the examples above, the most challenging environments for data protection require a system-level approach. In fact, some of them demand approaches that look beyond just the protection infrastructure. As we’ve talked about in the past, backup teams need to connect with application, virtualization, and storage owners to provide the services that their users need. With those connections, they can deliver better integrated, more innovative solutions to their customers.
|