Tuesday, May 26, 2009

What is Deduplication

Data Deduplication

What is Data Deduplication?


Data deduplication essentially refers to the elimination of redundant data. In the deduplication process, duplicate data is deleted, leaving only one copy of the data to be stored. However, indexing of all data is still retained should that data ever be required. Deduplication is able to reduce the required storage capacity since only the unique data is stored.

For example, a typical email system might contain 100 instances of the same one megabyte (MB) file attachment. If the email platform is backed up or archived, all 100 instances are saved, requiring 100 MB storage space. With data deduplication, only one instance of the attachment is actually stored; each subsequent instance is just referenced back to the one saved copy. In this example, a 100 MB storage demand could be reduced to only 1 MB.

Need for Data Deduplication?

a In general, data deduplication improves data protection, increases the speed of service, and reduces costs.

a Lower storage space requirements will save money on disk expenditures.

a The more efficient use of disk space also allows for longer disk retention periods, which provides better recovery time objectives (RTO) for a longer time and reduces the need for tape backups.

a Data deduplication also reduces the data that must be sent across a WAN for remote backups, replication, and disaster recovery.


How Data Deduplication works?

a Data deduplication can generally operate at the file, block, and even the bit level.

a File deduplication eliminates duplicate files (as in the example above), but this is not a very efficient means of deduplication.

a Block and bit deduplication looks within a file and saves unique iterations of each block or bit.

Each chunk of data is processed using a hash algorithm such as MD5 or SHA-1. This process generates a unique number for each piece which is then stored in an index. If a file is updated, only the changed data is saved. That is, if only a few bytes of a document or presentation are changed, only the changed blocks or bytes are saved, the changes don't constitute an entirely new file. This behavior makes block and bit deduplication far more efficient. However, block and bit deduplication take more processing power and uses a much larger index to track the individual pieces.

Deduplication in NetApp Environment

Netapp has implemented their deduplication function at fixed block (4kb) level which gives more space saving and works very efficiently. As it works on block level irrespective of file type or data format hence you can dedupe any type of file on either cifs or nfs even you can dedupe a lun of any size no matter where they are written in a volume.

This picture gives a high level overview of data before and after deduplication process. Here all similar colours of boxes denotes block with similar data and before deduplication process all the duplicate blocks were written to the different areas of hard disk so once the deduplication process runs it will identifies all duplicate blocks and removes them so only unique blocks of data is on volume.

As stated before deduplication process runs on storage level hence no configuration is required on application side and they keep accessing the data as before. While system creates the fingerprint in the process of writing of new data there's a negligible performance impact on your system, however if your filer is heavily utilized and it's constantly above 50% utilization then a performance impact will be an average of 15%.

Under the hood

Whenever new data is written on a flexvol which is having asis on (NetApp term of deduplication) OnTap creates a fingerprint for every block of data it writes for comparison. At this moment system writes all the data as any other system except recording some extra information for your data i.e. fingerprint for every block. Now either you have to start the deduplication process manually or schedule it to run on a specific time. Once the deduplication process is started then fingerprints are checked for duplicates and, when found, first a byte-by-byte comparison of the blocks is done to make sure that the blocks are indeed identical, and if they are found to be identical, the block's pointer is updated to the already existing data block and the new (duplicate) data block is released.

The maximum sharing for a block is 255. This means, for example, that if there are 500 duplicate blocks, deduplication would reduce that to only 2 blocks. Also note that this ability to share blocks is different from the ability to keep 255 Snapshot copies for a volume.

Deduplication in EMC Environment

This is just a high level overview of deduplication in EMC as they have added dedupe function only in January 2009 to their Celerra range of products.

EMC has deployed deduplication on their newer Celerra models on file level in conjunction with compression technology. As it's on file level hence the duplication is very fast however it gives very small savings because two or more files should be identical to be a dedupe candidate. Compression technology used by EMC gives additional level of space saving as it uses spare CPU cycles from the system hence you don't have to invest money on expensive specialized compression products. However even after having deduplication working with compression it gives less storage savings compare to NetApp Fixed block level deduplication technology. How? Here's the detail

a As it's on file level and files needs to be an exact match for deduplication hence vmdk files and any luns made on the storage are not de-duplicated.

a It targets only infrequently accessed files as compressing active files is not a good idea

a By default any files more than 200mb size is left untouched

a Compression works only on file size more than 24K

a It disables MPFS

a Has a performance impact in reading deduped and compressed files

Functionality

Celerra Data Deduplication

NetApp Data Deduplication

User interface

GUI-Simple graphical user interface, a one-click operation to enable

CLI only; cryptic commands with limited flexibility

Compression

Additional space savings on duplicate PLUS unique data

NetApp does not offer compression; this makes EMC more efficient in saving space

Unlimited file system size

EMC supports a 16 TB file system size across the entire Celerra unified storage series (NX4 through the NS-960)

NetApp limits the file system size based on the Filer model; the FAS2020 supports only 1 TB to a maximum of 16 TB for the FAS6080

Integrated with snaps

Celerra snaps do not negatively affect deduplication space savings in production file systems. Space savings can be realized immediately

NetApp will not achieve space savings from deduplication on any data that is currently part of a snapshot