With the advancement of cloud computing technologies, both personal and business users tend to store more and more data on consolidated data centers which can be accessed from anywhere using computers or smart devices. Multiple users may upload identical or similar contents which results in a large amount of duplicated data in the data center. Besides cloud services, the emerging virtualization technologies allow running hundreds of virtual machines on one physical machine which needs to store many copies of similar operating systems and applications. Traditional data storage systems are not able to fully exploit such data redundancy. This dissertation presents a new approach to identify and store similar data blocks in compact formats to improve the performance of the storage system.
A histogram-based signature is proposed to capture the similarities between data blocks if their contents are similar or shifted. Similar data blocks are clustered into the same group based on their signatures. Furthermore, a heatmap algorithm is designed to find the most popular block among similar blocks considering both temporal and content localities of data blocks. Finally, a high-speed delta coding algorithm is developed to compress similar blocks into small deltas.
The proposed approach leverages flash memory based Solid-State Disk (SSD) to store a single copy, the reference, for many redundant data blocks. Other similar blocks are stored as small deltas referring to the reference block in SSD. Compared to conventional magnetic hard disks, the flash based SSD is orders of magnitude faster in terms of latency. Thus the reference block stored on SSD can be retrieved quickly and I/O requests to other similar blocks can be served by combining the corresponding deltas with the reference block to avoid slow hard disk accesses.
Two prototypes of the proposed data storage system have been implemented, one as part of the Linux kernel virtual machine monitor and the other as a Linux device driver. Numerical results on standard benchmarks show an order of magnitude improvement of the new storage system compared to existing disk I/O architectures such as RAID and SSD/HDD storage hierarchy.
The last part of this dissertation presents a block level versioning system that is able to recover to any point in time to the past. The versioning system is independent of operating systems by using network storage protocol. The version creation, log maintenance and version recovery are done at storage target to offload the versioning overhead from application servers. Experiments on Linux, Windows, and Solaris have demonstrated that the new versioning system allows user to recover selected files with much smaller metadata cost compared to existing file system versioning systems.
|Commitee:||Fay-Wolfe, Vic, Sendag, Resit|
|School:||University of Rhode Island|
|School Location:||United States -- Rhode Island|
|Source:||DAI-B 72/06, Dissertation Abstracts International|
|Subjects:||Computer Engineering, Computer science, Condensed matter physics|
|Keywords:||Cache, Content locality, Data storage, Recovery, Solid state disks|
Copyright in each Dissertation and Thesis is retained by the author. All Rights Reserved
The supplemental file or files you are about to download were provided to ProQuest by the author as part of a
dissertation or thesis. The supplemental files are provided "AS IS" without warranty. ProQuest is not responsible for the
content, format or impact on the supplemental file(s) on our system. in some cases, the file type may be unknown or
may be a .exe file. We recommend caution as you open such files.
Copyright of the original materials contained in the supplemental file is retained by the author and your access to the
supplemental files is subject to the ProQuest Terms and Conditions of use.
Depending on the size of the file(s) you are downloading, the system may take some time to download them. Please be