Improving Consistency in Distributed Systems Using Controlled Data Replication

(1)

ISSN (PRINT) : 2320 – 8945, Volume -1, Issue -1, 2013

86

Improving Consistency in Distributed Systems Using Controlled Data Replication

A. Jain & M. Patidar

Sanghvi Institute of Management & Science, Indore (M.P.) India E-mail : [email protected], [email protected]

Abstract – In centralized systems; All data are stored at a central location i.e. at server. In case of heavy data access traffic at server the availability of data at the client side is reduced whereas in distributed systems the problem is less severe. Data availability has been improved through replication of data at multiple servers. The existence of multiple copies of same data alleviates the traffic problem of centralized systems, but this data availability has been achieved at the cost of inconsistency. The need for multi copy update for maintaining consistency in the systems is also a problem found in multiprocessors along with distributed systems. In multiprocessors the data in main memory is inconsistent with the data in cache; Also the data in all the caches is inconsistent among them. Many approaches are used to maintain consistency in multiprocessors involving the different cache write policies, snoopy, MESI, directory protocols etc. The problem here is confined to a single system consisting of a single file system; The machines are homogeneous. They may or may not use similar operating system. Maintaining consistency in a heterogeneous multicomputer system is harder. The nodes of a distributed system may have local memories or may be diskless workstations. The data in the local memories of nodes is inconsistent with the data at the server, the data at multiple servers is also inconsistent among themselves, and also the data in different computers local memories is inconsistent. An attempt to maintain inconsistency results in delayed write operations and reduction in the number of replicas results in increase in traffic at the server. An optimal trade off between the number of replicas and inconsistency is made. This paper aims at developing an effective algorithm for file replication that reduces the number of replicas and also covers respective algorithm for consistency. Defining the manner of replication avoided the need of object locating mechanism leading to faster access. The location of original files and their owner description is kept on a repository server for easier access to files.

Keywords – Consistency, Replication, SHA-1, CHORD techniques.

I. INTRODUCTION

Nowadays, the need of heavy data download and data sharing has produced a significant stimulus to distributed file sharing system. The review of the features of recent distributed system applications yields a long list comprising of redundant storage, durability, selection of nearby nodes, search and hierarchical naming. The work concludes the demand of fast data access, low network failure, less congestion, cost efficiency and file availability. Replication is used to improve the performance of a distributed system by achieving all the above objectives. The replication should also proactively reduce unnecessary replicas leading to reduced updates hence to minimize the overhead of consistency maintenance.

The paper proposed a Hybrid consistency mechanism using replication technique along with compression and distribution mechanism with logging to overcome data loss, hot spot creation activity and network traffic. The mechanism may also help to create backup in case of nature hazard or node failure. The proposed consistency mechanism uses SHA-1 algorithm for compression and distribution of file chunk [14].

Subsequently, CHORD Techniques has been used for keeping the track, of chunks distribution and storage [15]. This same technique also used for determining and locating required file chunks. Further, Work proposed a dedicated replication algorithm to handle traffic load and avoid system crash problem respectively. At last the, work determines a consistency policy for effective replication process.

II. RELATED WORK

The enrichment of bulky data storage and file transfer explores the demand of efficient file replication

(2)

87 technique for distributed systems. The distributed file system typically has two reasonably distinct components: the file service and the directory service.

The former is concerned with the operation on individual files such as reading ,writing , appending and updating whereas the latter one is concerned with creating and managing directories , dividing file in pieces and distributing them keeping log of each piece of a particular file .

To better understand and justify the need of file replication and consistency maintenance, work examines compression and distribution of file chunks. It also explores keeping track of file chunks and efficient searching techniques. The work determines that SHA-1 is the most efficient and low cost technique for compression and distribution of file chunk [14].

Subsequently, work determines that CHORD techniques are efficient for logging and storage of file chunks [15].

The proposed consistency technique starts replication when it finds a hot spot at node level. To effective replication it uses neighbor connected network with a hot spot node and replicate the group of frequently used chunks to nodes of another network. It will also transfer upcoming requests to replicated node of neighbor network. In order to obtain effective consistency maintenance, work determines the need to delete replica of rarest used or unused files chunks, with original copy of unused chunks. The study of work done so far addressed that, all previous works distributed the file chunks without compression. Replication of these chunks may create bottleneck condition or conjunction in network traffic and reason for node failure. The failure may be responsible for heavy data loss or file corruption introducing unnecessary delay in file access.

Furthermore, integrated file replication (IRM) techniques have been used for consistency maintenance in peer to peer systems. This technique replicates the complete file in a distributed system without fragmentation and compression. Thus, requested and replicated nodes are both totally dependent on server machine. In case of server failure or bottleneck condition, access of original file or replica becomes too much complex [14].

To achieve benefits of compression for file distribution work determines a distributed hash function for compression of chunks before signing.

Classically, work concludes that, there is a need to develop a hybrid consistency mechanism based on replication policy that involves compression, distribution, logging and effective searching of replica into all possible connected networks. Proposed solution should be very simple and effective for load forwarding,

replication management and chunk storage in situation of node failure.

III. SYSTEM MODEL

Fig. 3.1 Hybrid Consistency Model

Analysis of the previous work states that IRM technique rectifies the problem of replication. It should not only perform fragmentation but the size of fragment should be as low as possible. To achieve replication, work proposed a hybrid consistency maintenance model by integrating SHA-1 and CHORD techniques with proposed chunk efficient consistency management. At the initial stage, model works only on .txt input file. The description of the above model has been explained below:

Stage 1: File Fragmentation

In the first stage of this mechanism, the file is divided in chunks of the size of 256 bits [|x| < 256 bits]

each and assigning a public key along with private signature respectively. Due to the varying size of file chunks work encounters the problem of unequal size pieces into file fragmentation. This approach is quite insecure and very slow solution for long messages. To overcome this, the message is compressed prior to signing for a better solution.

Stage 2: SHA-1 Compression

A Distributed Hash Tree Function is a distributed mechanism in distributed systems that manage the distribution of files among the changing sets of nodes by mapping them with keys. The biggest benefit of this algorithm is that we can locate stored resources without

(3)

88 using centralized servers. For better understanding of replication process based on compression and distribution policy, work demands the exploration of internal working with practical implementation of hash function. To address the same, work selects SHA-1 Algorithm, providing a 160bit output for arbitrary size input. SHA-1 is similar to the MD4 and MD5 hash functions that generated 512bit output for 264 -1 digit input. The major advantage of SHA-1 algorithm is that it works on arbitrary length input using padding technique to provide fixed length compressed output.

Thus it automatically solves the previous problem of unequal sized fragments. SHA-1 pads the message with a single one followed by zeroes until the final block has 448 bits. And then it appends the size of the original message as an unsigned 64 bit integer. Now, it initializes the 5 hash blocks (h0, h1, h2, h3, h4) to the specific constants defined in the SHA1 standard. Next, it performs hashing for each 512 bit Block to allocate an 80 word array for the message schedule. The first 16 words are to be the 512 bit block split into 16 words and the rest of the words are generated with the help of XOR algorithm.

Stage 3: SHA-1 Hash Distribution

In this stage, work accepts a fixed length 160 bit size chunk as input and distributes this file chunk among nodes according to chunk hash table. The work will use all eight (8) respective techniques of hashing to reduce collision rate.

Stage 4: Chunk Storage & Logging using CHORD Techniques

The next level of proposed solution requires a logging mechanism for storage and efficient searching of file chunks. Work determines CHORD mechanism to keep track of file chunk distribution and storage. It also help in file chunk look up in distributed systems. This mechanism uses key hash function for node mapping during distribution. This node is responsible to carry the value associated with the key. To do that CHORD uses a variant of consistent hashing which tends to balance load i.e each node will receive approximately the same number of keys and little movement of keys when nodes join and leave the system. CHORD routing information regarding nodes leaving and joining the system results in a probability of not more than O (log₂N). This is far better that Napster, gnutella, fast track replication and full replication. Figure 3.2 describes the CHORD structure of 2⁴=16 nodes distributed network consisting of activated and deactivated nodes. Black nodes (1, 5, 8, 11, 15) are the activated nodes whereas remaining are deactivated. CHORD Structure state that if the assigned node is not activated then the chunk will point to its

successor and chunk will be stored on activated successor nodes.

Fig. 3.2 CHORD Structure

In another case, if the node left the network then node has to transfer all stored chunk to its successor before leaving the network. Subsequently, Searching is also possible through CHORD structure due to its logging techniques. It searches the stored chunk with 2^m node efficiency. Each node stores IP address of its next activated nodes in the form of 2^m node shown in figure 3.3. The main advantage of CHORD searching mechanism is does not require visiting all possible nodes. Thus it effectively locates the desire chunk with fast search rate.

Fig. 3.3 CHORD IP address Storage

Stage 5 & 6: Replication management and Consistency Maintenance:

Various replication algorithms are proposed that replicate files based on file ownership or the location of file requester (either on the requesting node or along the path) [3]. Some algorithms use client-access history for file replication. These algorithms replicate those files at the client that have most look up requests in the client- access history. Some other algorithms use parity based consistency maintenance of files. They replicate the parity of data blocks upon update instead of the data

(4)

89 block itself. This algorithm also replicates the data file but instead of original version, it replicates the popular chunks stored into storage media. Replication processes start when a normal node becomes a hot spot node and remain inefficient for working. To avoid access problem during node failure or file corruption, it is better to have a backup copy of stored data files. Because it is impossible to have multiple copies of each and every file due to its limited capability and time of hot spotted node. On the other side, some of the files may be unused

& are suppose to be deleted in near future. This algorithm only replicates the replicas of frequently used file chunk to all possible activated nodes of connected network. Figure 3.4 has explored the complete process of chunk replication during hot spot. The next step is to maintain consistency. Since the replication process already avoided creating multiple copies of unused chunks, so does this algorithm requires minimal efforts on the implementation of consistency maintenance. As the node that is replicated maintains the data regarding the new location of the file, which is a replica, all references to file are directed to all the replicas simultaneously. Storing all the references at one location makes it easy to forward the changes to all these replicas. Hence consistency is attained implicitly while creating the replica for the file chunk.

Fig. 3.4 Replication Structure

IV. EXPERIMENTAL RESULTS

To evaluate the proposed model we designed and implemented a simulator in java. The simulator is a prototype of distributed system & it works on .txt file format. System takes arbitrary size (.txt) file as input, opens into a readable mode and converts each and every data into n length binary output using predefined binary function. Simulator also implements SHA-1 that takes arbitrary size input from binary file updated into

previous stage. Next, the simulator creates file chunk according to binary input length and attaches a public key with private signature respectively. Key management has been done through KeyPairGenerator class. To implement CHORD strategy we use .txt file format to maintain log of all possible activated nodes into our network. The simulator was designed for 2²(4 computer) network. Thus it maintains a history file at every activated node to maintain chunk details of stored network. To simulate the connected network of hot spot node we have used another node of our intranet to transfer the replica files and request respectively. This simulator only implements the proposed model but do not evaluate the time taken during the replication process.

V. CONCLUSION

Despite continues efforts to develop file replication and file consistency maintenance method in distributed system, there has been very little research devoted to tackling both challenges simultaneously. File replication needs consistency maintenance to keep a balance between the file and its replica. Connecting the two features enhance the system performance. Unlike other approaches the proposed structure need not store data at a server. SHA-1, CHORD and hashing technique made it possible to distribute chunks of file to numerous nodes there by reducing the chances of traffic, congestion and network failure.

Attractive features of CHORD and SHA-1 include their simplicity, provable correctness even in the face of concurrent node arrival and departure. Our theoretical analysis, simulation and experimental results confirm that the proposed structure works well with the large networks as well.

VI. REFERENCES

[1] P. Triantafillou and C. Neilson ―Achieving Strong Consistency in a Distributed File― IEEE Transactions on Software Engineering vol 23.No1 January 1997.

[2] M. Wiesmann, F. Pedonet, A. Schiper, B.

Kemmet, G. Alonso ―Database Replication Techniques: a Three Parameter Classification‖

published in Reliable Distributed Systems, 2000.

SRDS-2000. Proceedings The 19th IEEE Symposium on at Lausanne PP. 206-215

[3] H. Shen,; ―Integrated File Replication and Consistency Maintenance in P2P Systems‖ , IEEE Transactions on Parallel Systems,Vol.

21,no 1, January 2010.

(5)

90 [4] A. Ahmed', A. Abdullah2, and P.D.D.Dominic3;

―A multi-Agent Based Replication Strategy for Improving Availability and Maintaining Consistency of Data in Large Scale Mobile Traffic Control Environments‖ published in International Symposium on High Capacity Optical Networks and Enabling Tech. HONET, 2008.

[5] X. Sun, J. Zheng, Q. Liu, Y. Liu ―Dynamic Data Replication Based on Access Cost in Distributed Systems‖ Fourth International Conference on Computer Sciences and Convergence Information Technology, 2009.

[6] Y. Huang, J. Cao, B. Jin, X. Tao, J. Lu and Y.

Feng ―Flexible Cache Consistency Maintenance over Wireless Ad Hoc Networks‖ IEEE Transactions on Parallel Systems,Vol. 21, no 8, August 2010.

[7] T. Repantis, A. Iyengar, V. Kalogeraki and I.

Rouvellou ―Consistent Replication in Distributed Multi-Tier Architectures‖ published in International Conference on Collaborative Computing: Networking, Applications and Worksharing (CollaborateCom), Orlando, Florida, USA, October 15-18, 201

[8] A. Diana, P. Fatos Xhafa, F. Pop and V. Cristea

―Evaluation of Optimistic Replication Techniques for Dynamic Files in P2P Systems‖

published in International Conference on P2P, Parallel, Grid, Cloud and Internet Computing, 2011.

[9] A. Sulistio, C. Shin Yeo, and R. Buyya

―Simulation of Parallel and Distributed Systems:

A Taxonomy and Survey of Tools‖.

[10] F. Huber, S. Molterer, A. Rausch, B. Sch¨atz, M.

Sihling, O. Slotosch ―Tool supported Specification and Simulation of Distributed Systems‖.

[11] G. Yadgar ―Multilevel Cache Management Based on Application Hints‖ PhD work Sumbitted to the Senate of Technion - Israel Institute of Technology, Iyar 5772,Haifa, in 2012.

[12] I. Ben-Zvi, ―Causality, Knowledge and Coordination in Distributed Systems‖, PhD thesis Submitted to the Senate of the Technion — Israel Institute of Technology Adar Bet 5771 Haifa March 2011.

[13] R. Al Ekram and R. Holt, ―Multi-Consistency Data Replication‖ published in 16^th International Conference on parallel and distributed systesm in 2010.

[14] C. De Canniere and C. Rechberger Institute for Applied Information Processing and Communications (IAIK) Graz University of Technology, Inffeldgasse 16a A–8010 Graz, Austria ―Finding SHA-1 Characteristics: General Results and Applications‖ .

[15] Stica:R. Morris , d. Karger , M. Frans Kaashoek , H. BalaKrishnan MIT laboratory for computer science ―Chord : Ascalable P2P look up service for internet application".



.