View of SECURE DATA DEDUPLICATION TO IMPROVE THE PERFORMANCE OF PRIVACY : REVIEW

(1)

SECURE DATA DEDUPLICATION TO IMPROVE THE PERFORMANCE OF PRIVACY : REVIEW

SHILPI RAI

Research Scholar IMEC Sagar, MP, India PROF. MUKESH ASATI

Asstt. Prof. , IMEC Sagar. MP, India

Abstract: Recent studies have shown that moderate to high data redundancy clearly exists in primary storage systems in the Cloud. Our experimental studies reveal that data redundancy exhibits a much higher level of intensity on the I/O path than that on disks due to relatively high temporal access locality associated with small I/O requests to redundant data. Moreover, directly applying data deduplication to primary storage systems in the Cloud will likely cause space contention in memory and data fragmentation on disks.

This paper is proposed from the different research literatures.

1. INTRODUCTION

DATA effective technique in Cloud backup and archiving deduplication has been demonstrated to be an applications to reduce the backup window, improve the storage-space efficiency and network bandwidth utilization. Recent studies reveal that moderate to high data redundancy clearly exists in virtual machine (VM), enterprise and high- performance computing (HPC) storage systems. These studies have shown that by applying the data deduplication technology to large-scale data sets, an average space saving of 30 percent, with up to 90 percent in VM and 70 percent in HPC storage systems can be achieved. For example, the time for the live VM migration in the Cloud can be significantly reduced by adopting the data deduplication technology. The existing data deduplication schemes for primary storage, such as iDedup and Offline- Dedupe, are capacity oriented in that they focus on storage capacity savings and only select the large requests to deduplicate and bypass all the small requests (e.g., 4 KB, 8 KB or less). The rationale is that the small I/O requests only account for a tiny fraction of the storage capacity requirement, making deduplication on them unprofitable and potentially counterproductive considering the substantial deduplication overhead involved. However, previous workload studies have revealed that small files dominate in primary storage systems (more than 50 percent) and are at the root of the system performance bottleneck.

Furthermore, due to the buffer effect, primary storage workloads exhibit obvious I/O burstiness. From a performance perspective, the existing data

deduplication schemes fail to consider these workload characteristics in primary storage systems, missing the opportunity to address one of the most important issues in primary storage, that of performance. With the explosive growth in data volume, the I/O bottleneck has become an increasingly daunting challenge for big data analytics in terms of both performance and capacity.

2. LITERATURE SURVEY

Decentralized De-duplication in SAN Cluster File Systems

File systems hosting virtual machines typically contain many duplicated blocks of data resulting in wasted storage space and increased storage array cache footprint. De-duplication addresses these problems by storing a single instance of each unique data block and sharing it between all original sources of that data.

While de-duplication is well understood for file systems with a centralized component, we investigate it in a decentralized cluster file system, specifically in the context of VM storage.

We propose DEDE, a block-level deduplication system for live cluster file systems that does not require any central coordination, tolerates host failures, and takes advantage of the block layout policies of an existing cluster file system.

In DEDE, hosts keep summaries of their own writes to the cluster file system in shared on-disk logs. Each host periodically and independently processes the summaries of its locked files, merges them with a shared index of blocks, and reclaims any duplicate blocks. DEDE manipulates metadata using general file system interfaces without knowledge of

(2)

the file system implementation. We present the design, implementation, and evaluation of our techniques in the context of VMware ESX Server. Our results show an 80% reduction in space with minor performance overhead for realistic workloads.

In this paper, we studied deduplication in the context of decentralized cluster file systems. We have described a novel software system, DEDE, which provides block level de-duplication of a live, shared file system without any central coordination. Furthermore, DEDE builds atop an existing file system without violating the file system’s abstractions, allowing it to take advantage of regular file system block layout policies and in-place updates to unique data. Using our prototype implementation, we demonstrated that this approach can achieve up to 80% space reduction with minor performance overhead on realistic workloads.

Singleton: System-wide Page De- duplication in Virtual Environments We consider the problem of providing memory-management in hypervisors and propose Singleton, a KVM-based system wide page de-duplication solution to increase memory usage efficiency.

Specifically, we address the problem of double caching that occurs in KVM—the same disk blocks are cached at both the host (hypervisor) and the guest (VM) page- caches. Singleton’s main components are identical-page sharing across guest virtual machines and an implementation of an exclusive cache for the host and guest page-cache hierarchy. We use and improve KSM–Kernel Same Page Merging to identify and share pages across guest virtual machines. We utilize guest memory-snapshots to scrub the host page-cache and maintain a single copy of a page across the host and the guests.

Singleton operates on a completely black- box assumption—we do not modify the guest or assume anything about its behaviour. We show that conventional operating system cache management techniques are sub-optimal for virtual environments, and how Singleton supplements and improves the existing Linux kernel memory management mechanisms. Singleton is able to improve the utilization of the host cache by reducing its size (by up to an order of magnitude), and increasing the cache-hit

ratio (by factor of 2x). This translates into better VM performance (40% faster I/O).

Singleton’s unified page de-duplication and host cache scrubbing is able to reclaim large amounts of memory and facilitates higher levels of memory over commitment. The optimizations to page de-duplication we have implemented keep the overhead down to less than 20% CPU utilization.

By combining inter-VM page deduplication and host cache scrubbing, Singleton achieves unified redundancy elimination in KVM, and can reclaim massive amounts of memory. Through a series of workloads under varying degrees of memory-pressure, we have shown that host-cache scrubbing is a low-overhead way of implementing a host/guest exclusive-cache in KVM. Our exclusive cache implementation results in tiny host page-caches (of the order of a few megabytes, as compared to several gigabytes without the scrubbing), along with improved guest performance because of better cache utilization.

I/O De-duplication: Utilizing Content Similarity to Improve I/O Performance Duplication of data in storage systems is becoming increasingly common. We introduce I/O De-duplication, a storage optimization that utilizes content similarity for improving I/O performance by eliminating I/O operations and reducing the mechanical delays during I/O operations. I/O De-duplication consists of three main techniques:

content-based caching, dynamic replica retrieval, and selective duplication. Each of these techniques is motivated by our observations with I/O workload traces obtained from actively-used production storage systems, all of which revealed surprisingly high levels of content similarity for both stored and accessed data. Evaluation of a prototype implementation using these workloads revealed an overall improvement in disk I/O performance of 28-47% across these workloads. Further breakdown also showed that each of the three techniques contributed significantly to the overall performance improvement.

System and storage consolidation trends are driving increased duplication of data within storage systems. Past efforts have been primarily directed towards the elimination of such duplication for improving storage capacity utilization.

(3)

With I/O De-duplication, we take a contrary view that intrinsic duplication in a class of systems which are not capacity- bound can be effectively utilized to improve I/O performance – the traditional Achilles’ heel for storage systems. Three techniques contained within I/O De- duplication work together to either optimize I/O operations or eliminate them altogether. An in-depth evaluation of these mechanisms revealed that together they reduced average disk I/O times by 28-47%, a large improvement all of which can directly impact the overall application-level performance of disk I/O bound systems.

I De-Dup: Latency-aware, inline data de-duplication for primary storage De-duplication technologies are increasingly being deployed to reduce cost and increase space-efficiency in corporate data centers. However, prior research has not applied de-duplication techniques inline to the request path for latency sensitive, primary workloads. This is primarily due to the extra latency these techniques introduce. Inherently, de- duplicating data on disk causes fragmentation that increases seeks for subsequent sequential reads of the same data, thus, increasing latency. In addition, de-duplicating data requires extra disk IOs to access on-disk deduplication metadata. In this paper, we propose an inline de-duplication solution, I De-dup, for primary workloads, while minimizing extra IOs and seeks. Our algorithm is based on two key insights from real world workloads: i) spatial locality exists in duplicated primary data;

and ii) temporal locality exists in the access patterns of duplicated data. Using the first insight, we selectively deduplicate only sequences of disk blocks.

This reduces fragmentation and amortizes the seeks caused by de-duplication. The second insight allows us to replace the expensive, on-disk, de-duplication metadata with a smaller, in-memory cache. These techniques enable us to tradeoff capacity savings for performance, as demonstrated in our evaluation with real-world workloads. Our evaluation shows that I De-dup achieve 60-70% of the maximum de-duplication with less than a 5% CPU overhead and a 2-4%

latency impact.

In this paper, we describe I De- dup, an inline de-duplication system specifically targeting latency-sensitive, primary storage workloads. With latency sensitive workloads, inline de-duplication has many challenges: fragmentation leading to extra disk seeks for reads, deduplication processing overheads in the critical path, and extra latency caused by IOs for de-dup-metadata management. To counter these challenges, we derived two insights by observing real-world, primary workloads: i) there is significant spatial locality on disk for duplicated data, and ii) temporal locality exists in the accesses of duplicated blocks.

Distributed Exact De-duplication for Primary Storage Infrastructures

De-duplication of primary storage volumes in a cloud computing environment is increasingly desirable, as the resulting space savings contribute to the cost effectiveness of a large scale multi-tenant infrastructure. However, traditional archival and backup deduplication systems impose prohibitive overhead for latency-sensitive applications deployed at these infrastructures while, current primary de-duplication systems rely on special cluster file systems, centralized components, or restrictive workload assumptions. We present DEDIS, a fully-distributed and dependable system that performs exact and cluster-wide background deduplication of primary storage. DEDIS does not depend on data locality and works on top of any unsophisticated storage backend, centralized or distributed, that exports a basic shared block device interface. The evaluation of an open-source prototype shows that DEDIS scales out and adds negligible overhead even when de-duplication and intensive storage I/O run simultaneously.

We presented DEDIS, a dependable and distributed system that performs cluster wide off-line deduplication across primary storage volumes. The design is fully decentralized avoiding any single point of failure or contention thus, safely scaling- 14 out.

Also, it is compatible with any storage backend, distributed or centralized, that exports a shared block device interface.

The evaluation of a Xen-based prototype in up to 20 nodes shows that by relying on an optimistic de-duplication algorithm

(4)

and on several optimizations, deduplication and primary I/O workloads can run simultaneously in a scalable system. In fact, DEDIS introduces less than 10% of latency overhead while maintaining a baseline single-server deduplication throughput of 4.78 MB/s with low-end hardware. This is key for performing efficient de-duplication and reducing the storage backlog of duplicates in infrastructures with scarce off-peak periods.

Exploiting Neighborhood Similarity for Virtual Machine Migration over Wide- Area Network

Conventional virtual machine (VM) migration focuses on transferring a VM’s memory and CPU states across host machines. The VM’s disk image has to remain accessible to both the source and destination host machines through shared storage during the migration. As a result, conventional virtual machine migration is limited to host machines on the same local area network (LAN) since sharing storage across wide-area network (WAN) is inefficient. As datacenters are being constructed around the globe, we envision the need for VM migration across datacenter boundaries. We thus propose a system aiming to achieve efficient VM migration over wide area network. The system exploits similarity in the storage data of neighboring VMs by first indexing the VM storage images and then using the index to locate storage data blocks from neighboring VMs, as opposed to pulling all data from the remote source VM across WAN. The experiment result shows that the system can achieve an average 66%

reduction in the amount of data transmission and an average 59%

reduction in the total migration time.

Conventional virtual machine migration is limited to LAN environment, because both the sharing and the migration of VM storage across wide-area network (WAN) are expensive due to the amount of data in the VM storage and the limited bandwidth of WAN. On the other hand, the adoption of cloud computing has caused active construction of datacenters around the globe. Being able to carry out VM migration across datacenter boundaries and across WAN environment would open up new possibilities for more powerful resource

utilization and fault tolerance in cloud computing.

Tango: Distributed Data Structures over a Shared Log

Distributed systems are easier to build than ever with the emergence of new, data-centric abstractions for storing and computing over massive datasets.

However, similar abstractions do not exist for storing and accessing metadata. To fill this gap, Tango provides developers with the abstraction of a replicated, in-memory data structure (such as a map or a tree) backed by a shared log. Tango objects are easy to build and use, replicating state via simple append and read operations on the shared log instead of complex distributed protocols; in the process, they obtain properties such as linearizability, persistence and high availability from the shared log. Tango also leverages the shared log to enable fast transactions across different objects, allowing applications to partition state across machines and scale to the limits of the underlying log without sacrificing consistency.

In the rush to produce better tools for distributed programming, metadata services have been left behind; it is arguably as hard to build a highly available, persistent and strongly consistent metadata service today as it was a decade earlier. Tango fills this gap with the abstraction of a data structure backed by a shared log. Tango objects are simple to build and use, relying on simple append and read operations on the shared log rather than complex messaging protocols. By leveraging the shared log to provide key properties – such as consistency, persistence, elasticity, atomicity and isolation – Tango makes metadata services as easy to write as a Map Reduce job or a photo-sharing website.

IDO: Intelligent Data Outsourcing with Improved RAID Reconstruction Performance in Large-Scale Data Centers

Dealing with disk failures has become an increasingly common task for system administrators in the face of high disk failure rates in large-scale data centers consisting of hundreds of thousands of disks. Thus, achieving fast recovery from disk failures in general and high online

(5)

RAID-reconstruction performance in particular has become crucial. To address the problem, this paper proposes IDO (Intelligent Data Outsourcing), a proactive and zone-based optimization, to significantly improve on-line RAID- reconstruction performance. IDO moves popular data zones that are proactively identified in the normal state to a surrogate set at the onset of reconstruction. Thus, IDO enables most, if not all, user I/O requests to be serviced by the surrogate set instead of the degraded set during reconstruction.

Extensive trace-driven experiments on our lightweight prototype implementation of IDO demonstrate that, compared with the existing state-of-the-art reconstruction approaches Work Out and VDF, IDO simultaneously speeds up the reconstruction time and the average user response time. Moreover, IDO can be extended to improving the performance of other background RAID support tasks, such as re-synchronization, RAID reshape and disk scrubbing.

In many data-intensive computing environments, especially data centers, large numbers of disks are organized into various RAID architectures. Because of the increased error rates for individual disk drives, the dramatically increasing size of drives, and the slow growth in transfer rates, the performance of RAID during its reconstruction phase (after a disk failure) has become increasingly important for system availability. We have shown that IDO can substantially improve this performance at low cost by using the free space available in these environments. IDO proactively exploits both the temporal locality and spatial locality of user I/O requests to identify the hot data zones in the normal operational state. When a disk fails, IDO first reconstructs the lost data blocks on the failed disk belonging to the hot data zones and concurrently migrates them to a surrogate RAID set.

3. OBJECTIVE OF THE PROJECT

With the explosive growth in data volume, the I/O bottleneck has become an increasingly daunting challenge for big

data analytics in the Cloud. Recent studies have shown that moderate to high data redundancy clearly exists in primary storage systems in the Cloud. Our experimental studies reveal that data redundancy exhibits a much higher level of intensity on the I/O path than that on disks due to relatively high temporal access locality associated with small I/O requests to redundant data. Moreover, directly applying data deduplication to primary storage systems in the Cloud will likely cause space contention in memory nd data fragmentation on disks. Based on these observations, we propose a performance-oriented I/O deduplication, called POD, rather than a capacity- oriented I/O deduplication, exemplified by iDedup, to improve the I/O performance of primary storage systems in the Cloud without sacrificing capacity savings of the latter.

Stages in SDLC:



Requirement Gathering



Analysis



Designing



Coding



Testing



Maintenance

Requirements Gatheringstage:

The requirements gathering process takes as its input the goals identified in the high-level requirements section of the project plan. Each goal will be refined into a set of one or more requirements. These requirements define the major functions of the intended application, define operational data areas and reference data areas, and define the initial data entities.

Major functions include critical processes to be managed, as well as mission critical inputs, outputs and reports. A user class hierarchy is developed and associated with these major functions, data areas, and data entities. Each of these definitions is termed a Requirement.

Requirements are identified by unique requirement identifiers and, at minimum, contain a requirement title and textual description.

(6)

In the requirements stage, the RTM consists of a list of high-level requirements, or goals, by title, with a listing of associated requirements for each goal, listed by requirement title. In this hierarchical listing, the RTM shows that each requirement developed during this stage is formally linked to a specific product goal. In this format, each requirement can be traced to a specific product goal, hence the term requirements traceability.

The outputs of the requirements definition stage include the requirements document, the RTM, and an updated project plan.



Feasibility study is all about identification of problems in a project.



No. of staff required to handle a project is represented as Team Formation, in this case only modules are individual tasks will be assigned to employees who are working for that project.



Project Specifications are all about representing of various possible inputs submitting to the server and corresponding outputs along with reports maintained by administrator.

BIBLIOGRAPHY

1. A. T. Clements, I. Ahmad, M. Vilayannur, and J.

Li, ―Decentralized deduplication in SAN cluster file systems,‖ in Proc. Conf. USENIX Annu. Tech.

Conf., Jun. 2009, pp. 101–114.

2. K. Jinand and E. L. Miller, ―The effectiveness of deduplication on virtual machine disk images,‖

in Proc. The Israeli Exp. Syst. Conf., May 2009, pp. 1–12.

3. R. Koller and R. Rangaswami, ―I/O Deduplication: Utilizing content similarity to improve I/O performance,‖ in Proc. USENIX File Storage Technol., Feb. 2010, pp. 1–14.

4. D. T. Meyer and W. J. Bolosky, ―A study of practical deduplication,‖ in Proc. 9th USENIX Conf. File Stroage Technol., Feb. 2011, pp. 1–14.

5. K. Srinivasan, T. Bisson, G. Goodson, and K.

Voruganti, ―iDedup: Latency-aware, inline data deduplication for primary storage,‖ in Proc. 10th USENIX Conf. File Storage Technol., Feb. 2012, pp. 299–312.

6. A. El-Shimi, R. Kalach, A. Kumar, A. Oltean, J.

Li, and S. Sengupta, ―Primary data deduplication-large scale study and system design,‖ in Proc. USENIX Conf. Annu. Tech.

Conf., Jun. 2012, pp. 285–296.

7. S. Kiswany, M. Ripeanu, S. S. Vazhkudai, and A.

Gharaibeh, ―STDCHK: A checkpoint storage system for desktop grid computing,‖ in Proc.

28th Int. Conf. Distrib. Comput. Syst., Jun.

2008, pp. 613–624.

8. D. Meister, J. Kaiser, A. Brinkmann, T. Cortes, M. Kuhn, and J. Kunkel, ―A study on data deduplication in HPC storage systems,‖ in Proc.

Int. Conf. High Perform. Comput., Netw., Storage Anal., Nov. 2012, pp. 1–11.

9. X. Zhang, Z. Huo, J. Ma, and D. Meng,

―Exploiting data deduplication to accelerate live virtual machine migration,‖ in Proc. IEEE Int.

Conf. Cluster Comput., Sep. 2010, pp. 88–96.

10. J. Lofstead, M. Polte, G. Gibson, S. Klasky, K.

Schwan, R. Oldfield, M. Wolf, and Q. Liu, ―Six degrees of scientific data: Reading patterns for extreme scale science IO,‖ in Proc. 20th Int.

Symp. High Perform. Distrib. Comput., Jun.

2011, pp. 49–60.