There is a saying we use in IT “not all data is equal” that we use to in relation to datasets for both primary storage location and backup policies. This can be reused when we are talking about deduplication – “not all deduplication is equal”
The goal of this blog is to provide an intro to Data Deduplication and then some of the things you should look out for when deciding on a technology fit. Note while modern flash based primary storage systems offer dedupe built in, this blog is focused on Data Protection software and appliances.
Data Protection is expensive!
Data protection is expensive. In essence it can be thought of as insurance, but it is more expensive than insurance. There is no way of getting away from it, when you consider the number of copies of data in play when it comes to backups then it is obvious.
Let’s use an example with nice round numbers:
- To protect 10Tb of Front-End data (FETB) using a standard 4 weekly/12 monthly/7 yearly protection scheme the Back-End (BETB) total is:
10 FETB = (3*10) + (11*10) + (7*10) = 210 BETB!
(Note in the above calculations the 4th week is a monthly and the 12th month is a yearly)
- Let’s say we have managed to reduce our long-term retention to 12 months
10 FETB = (3*10) + (12*10) = 150 BETB!#
– That’s still huge.
- Say you manage to convince the business that you are going to operate with only the last 4 weeks retention
10 FETB = (4*10) = 40 BETB!
– You require 4 times the amount of backend storage which is still a lot of capacity!
Advancements in Data Protection
I.T. needed a way to reduce the storage cost. Everyone is familiar with the Iron Mountain vans driving around the country every Monday collecting tapes for offsite storage. This was because tape is cheap and previously was the only affordable way of managing these retention levels. But in this day and age you don’t want to be relying on tape media for data recovery.
Introducing Data Deduplication
Deduplication: The elimination of duplicate or redundant information, especially in computer data – deduplication removes the repetitive information before storing it” – Lexico Dictionary, 2019.
Data Deduplication has been around for decades (file, single instance), but it was really only in the 2000’s that vendors had the technology to bring out storage systems with inbuilt block deduplication, allowing companies reduce their storage footprint.
The key to these systems were that they were not just focused at a file level but rather the block level. Unique chunks are stored on the system and other chunks that match the same characteristics are identified these are simply stored as a pointer back to the original unique chunk. When over 90% of your data is not changing, this can have a significant reduction in your FETB and BETB requirements.
What to look out for when selecting a Solution with dedupe capabilities
Make sure that your system uses inline deduplication
Inline deduplication – data is checked against data that already exists in storage and doesn’t store any of the redundant data it discovers. This can be target or source side:
- Target dedupe – data is sent whole over the network to the deduplication appliance where it is deduped and stored
- Source side dedupe – data is analysed at source and only unique new chunks are sent over the network to the deduplication appliance
Some systems only do Target based dedupe – While target only dedupe sometimes is possible in DCs, the ability to do source side dedupe can reduce not only your network traffic but also your ever-shrinking backup window.
Make sure that your system is using variable length dedupe not fixed length
- Fixed length dedupe – the deduplication appliance analyses the chunks in fixed lengths
- Variable length dedupe – the deduplication appliance analyses the data and decides where best to break out into chunks.
The variable length option allows for greater accuracy in identifying redundant data and results in a greater dedupe ratio and therefore requires a smaller back end storage footprint
Make sure that the variable length dedupe is dynamic variable
This may seem obvious but we have experienced vendors producing fixed length dedupe systems but allowing you the admin to set what that fixed length is thereby classifying it as “Variable length”. This approach is dependent totally on the backup admin determining the correct length for all data sets. I don’t know about you but id rather technology identify what is best as opposed to this method.
Make sure your system has the performance to meet your requirements
You typically have strict bandwidth limits and backup windows. Processing deduplication is a resource intensive so you need to sure that your system can meet these requirements. When sizing a solution make it a point that your solution provider include the full sizing methodology and calculations in their presentation.
Ask for real world numbers
This is a follow-on to the performance note above. Every vendor should have published numbers around capacity and performance. Do not go by the advertised hero numbers but rather request that they provide you real world examples (they can do this).
Also make sure that they are providing the correct dedupe ratios per data type, file dedupe ratio is different to DB dedupe ratio so ensure they are correctly sizing based on your data sets.
Don’t opt for a post process deduplication appliance.
- Post process is where the data is sent whole over the network to the deduplication appliance (as in target dedupe) where it is stored whole. The appliance then runs an algorithm afterwards on this and copies unique chunks to another partition.
This may be a cheaper alternative but its not as effective in reducing either backup window or network traffic and in fact results in longer backup times to completion.