utorok 6. júla 2010

Kompresia dat, deduplikacia a Single Instance Storage

Single Instance Storage

'Single Instance Storage' is sometimes referred to by some as 'File Level Deduplicaiton'. 'Single Instance Storage' refers to the ability of a file system (or data storage container) to identify two or more identical files and to retain the multiple external references of the file while storing a single copy on disk.

Many of us have used or accessed technologies which leverage 'Single Instance Storage' as it has been the primary storage savings technology with Microsoft Exchange Server 5.5, 2000, 2003, & 2007. If you are familiar with Exchange Server you probably recall that the ability of 'Single Instance Storage' to reduce file redundancy is limited to the content within an Exchange database (or mailstore). In other words, multiple copies of a file may exist, but each individual database will only maintain a single copy.

Are you aware that Exchange Server 2010 has discontinued support for 'Single Instance Storage'? Seems Microsoft has left it to the storage vendors to provide capacity savings.

Data Deduplication

'Data Deduplication' is best described as block, or sub-file level deduplication, which is the ability to reduce the redundancy in two or more files which are not identical. Historically the storage and backup industries have used the term 'Data Deduplicaiton' specifically to mean the reduction of data at the sub-file level. I'm sure many of you use technologies which include 'Data Deduplication' such as systems from NetApp, Data Domain, or Sun Microsystems.

With 'Data Deduplication' data is stored in the same format as if it was not deduplicated with the exception that multiple files share storage blocks between them. This design allows the storage system to serve data without any additional processing prior to transferring the data to the requesting host.

In summary, 'Data Deduplicaiton' is an advanced form of 'Single Instance Storage'. It exceeds the storage savings provided by 'Single Instance Storage' by deduplicating both identical and dissimilar data sets.

Data Compression

Probably the most mature technology of the bunch is 'Data Compression'. I'm sure we are all familiar with this technology as we use it every day in transferring files (ala WinZip) or maybe even dabbled with NTFS compression with some of your Windows systems.

In the example below we have two virtual machines, each running the same Guest Operating System, yet unique objects in their security realm, and storing dissimilar data sets. This example represents common deployments of VMware, KVM, Hyper-V, etc... With 'Data Compression' the data comprising the VMs is rewritten into a dense format on the array. There is no requirement for the data to be common between any objects.

As data which has been compressed is not stored in a format which can be directly accessible by the requesting host, it falls onto the storage controller to decompress the data prior to serving it to a host. This process will add latency to the storage operations.

Many of you may be surprised to know that NetApp arrays provide both 'Data Deduplicaiton' and 'Data Compression'. I'll share more on the later in my next post; however, relative to this discussion I can share with you that while we see performance increases with 'Data Deduplication', 'Data Compression' does add an additional performance tax to the storage system.

Note, these technologies are mutually inclusive, so compressed data sets gain the advantage of TSCS to help offset the performance tax.

In summary, 'Data Compression' is a stalwart of storage savings technologies which can provide savings unavailable with 'Single Instance Storage' or 'Data Deduplication'. Because of the performance tax of 'Data Compression' one should restrict it's usage to data archives and NAS file services.

Wrapping Up This Post

Storage savings technologies are all the rage of the storage and backup industries. While every vendor has their own set of capabilities, it is in the best interest for any architect, administrator, or manager of data center operations to have a clear understanding of which technology will provide benefits to which data sets before enabling these technologies. Saving storage while impeding the performance of a production environment is a sure-fire means to updating one's resume.

Suffice to say these technologies are here, and they are reshaping our data centers. I hope this post will help you to better understand what your storage vendor means when he or she states that they offer 'deduplicaiton'.


Zdroj: http://blogs.netapp.com/virtualstorageguy/2010/06/data-compression-deduplication-single-instance-storage.html

Žiadne komentáre: