Data Deduplication
What is Data Deduplication?
Need for Data Deduplication?
a In general, data deduplication improves data protection, increases the speed of service, and reduces costs.
a Lower storage space requirements will save money on disk expenditures.
a The more efficient use of disk space also allows for longer disk retention periods, which provides better recovery time objectives (RTO) for a longer time and reduces the need for tape backups.
a Data deduplication also reduces the data that must be sent across a WAN for remote backups, replication, and disaster recovery.
How Data Deduplication works?
a Data deduplication can generally operate at the file, block, and even the bit level.
a File deduplication eliminates duplicate files (as in the example above), but this is not a very efficient means of deduplication.
a Block and bit deduplication looks within a file and saves unique iterations of each block or bit.
Deduplication in NetApp Environment
This picture gives a high level overview of data before and after deduplication process. Here all similar colours of boxes denotes block with similar data and before deduplication process all the duplicate blocks were written to the different areas of hard disk so once the deduplication process runs it will identifies all duplicate blocks and removes them so only unique blocks of data is on volume.
As stated before deduplication process runs on storage level hence no configuration is required on application side and they keep accessing the data as before. While system creates the fingerprint in the process of writing of new data there's a negligible performance impact on your system, however if your filer is heavily utilized and it's constantly above 50% utilization then a performance impact will be an average of 15%.
Under the hood
Whenever new data is written on a flexvol which is having asis on (NetApp term of deduplication) OnTap creates a fingerprint for every block of data it writes for comparison. At this moment system writes all the data as any other system except recording some extra information for your data i.e. fingerprint for every block. Now either you have to start the deduplication process manually or schedule it to run on a specific time. Once the deduplication process is started then fingerprints are checked for duplicates and, when found, first a byte-by-byte comparison of the blocks is done to make sure that the blocks are indeed identical, and if they are found to be identical, the block's pointer is updated to the already existing data block and the new (duplicate) data block is released.
This is just a high level overview of deduplication in EMC as they have added dedupe function only in January 2009 to their Celerra range of products.
EMC has deployed deduplication on their newer Celerra models on file level in conjunction with compression technology. As it's on file level hence the duplication is very fast however it gives very small savings because two or more files should be identical to be a dedupe candidate. Compression technology used by EMC gives additional level of space saving as it uses spare CPU cycles from the system hence you don't have to invest money on expensive specialized compression products. However even after having deduplication working with compression it gives less storage savings compare to NetApp Fixed block level deduplication technology. How? Here's the detail
a As it's on file level and files needs to be an exact match for deduplication hence vmdk files and any luns made on the storage are not de-duplicated.
a It targets only infrequently accessed files as compressing active files is not a good idea
a By default any files more than 200mb size is left untouched
a Compression works only on file size more than 24K
a It disables MPFS
a Has a performance impact in reading deduped and compressed files
Functionality | Celerra Data Deduplication | NetApp Data Deduplication |
User interface | GUI-Simple graphical user interface, a one-click operation to enable | CLI only; cryptic commands with limited flexibility |
Compression | Additional space savings on duplicate PLUS unique data | NetApp does not offer compression; this makes EMC more efficient in saving space |
Unlimited file system size | EMC supports a 16 TB file system size across the entire Celerra unified storage series (NX4 through the NS-960) | NetApp limits the file system size based on the Filer model; the FAS2020 supports only 1 TB to a maximum of 16 TB for the FAS6080 |
Integrated with snaps | Celerra snaps do not negatively affect deduplication space savings in production file systems. Space savings can be realized immediately | NetApp will not achieve space savings from deduplication on any data that is currently part of a snapshot |