Distribution Models - Big Data Storage Concepts

Big Data Storage Concepts

2.2 Distribution Models

1 GB

Shard 1 256 MB

256 MB

256 MB (a)

Shard 2

Shard 3

Shard 4

887

Employee_Id Name

Employee_Id 887

900 John

Strphen Name Shard A (b)

Employee_Id 901

903 George

Doe Name Shard B

Employee_Id 908

911 Pietro

Mathew Name Shard C

Employee_Id 917

920 Antonio

Matrco Name Shard D 900

901 903 908 911 917 920

Stephen John Doea George Mathew Pietro Marco Antonio

Figure 2.6 (a) Sharding. (b) Sharding example.

massive data growth. Sharding reduces the number of transaction each node han- dles and increases throughput. It reduces the data each node needs to store.

Figure 2.6b shows an example as how a data block is split up into shards across multiple nodes. A data set with employee details is split up into four small blocks:

shard A, shard B, shard C, shard D and stored across four different nodes: node A, node B, node C, and node D. Sharding improves the fault tolerance of the system as the failure of a node affects only the block of the data stored in that particu- lar node.

2.2.2 Data Replication

Replication is the process of creating copies of the same set of data across multiple servers. When a node crashes, the data stored in that node will be lost. Also, when a node is down for maintenance, the node will not be available until the maintenance process is over. To overcome these issues, the data block is copied across multiple nodes. This process is called data replication, and the copy of a block is called replica. Figure 2.7 shows data replication.

Replication makes the system fault tolerant since the data is not lost when an individual node fails as the data is redundant across the nodes. Replication increases the data availability as the same copy of data is available across multiple nodes. Figure 2.8 illustrates that the same data is replicated across node A, node B, and node C. Data replication is achieved through the master-slave and peer-peer models.

Data

Replica 1

Replica 2

Replica 3

Replica 4 Figure 2.7 Replication.

2.2.2.1 Master-Slave Model

Master-slave configuration is a model where one centralized device known as the master controls one or more devices known as slaves. In a master-slave configuration a replica set constitutes a master node and several slave nodes. Once the relationship between master and slave is established, the flow of control is only from master to the slaves. In master-slave replication, all the incoming data are written on the master node, and the same data is replicated over several slave nodes. All the write requests are handled by the master node, and the data update, insert, or delete occurs in the master node, while the read requests are handled by slave nodes. This architecture supports intensive read requests as the increasing demands can be handled by appending additional slave nodes. If a master node fails, write requests cannot be fulfilled until the master node is resumed or a new master node is created from one of the slave nodes. Figure 2.9 shows data replication in a master-slave configuration.

2.2.2.2 Peer-to-Peer Model

In the master-slave model only the slaves are guaranteed against single point of failure. The cluster still suffers from single point of failure, if the master fails.

Also, the writes are limited to the maximum capacity that a master can handle;

EmpId Name 887 John

888 George

900 Joseph

901 Stephen

EmpId Name

887 John

888 George

900 Joseph

901 Stephen

Replica A Node A

EmpId Name

887 John

888 George

900 Joseph

901 Stephen

Replica B

EmpId Name

887 John

888 George

900 Joseph

901 Stephen

Replica C Node B

Node C Figure 2.8 Data replication.

hence, it provides only read scalability. These drawbacks in the master-slave model are overcome in the peer-to-peer model. In a peer-to-peer configuration there is no master-slave concept, all the nodes have the same responsibility and are at the same level. The nodes in a peer-to-peer configuration act both as client and the server. In the master-slave model, communication is always initiated by the master, whereas in a peer-to-peer configuration, either of the devices involved in the process can initiate communication. Figure 2.10 shows replication in the peer-to-peer model.

In the peer-to-peer model the workload or the task is partitioned among the nodes. The nodes consume as well as donate the resources. Resources such as disk storage space, memory, bandwidth, processing power, and so forth, are shared among the nodes.

Reliability of this type of configuration is improved through replication.

Replication is the process of sharing the same data across multiple nodes to avoid single point of failure. Also, the nodes connected in a peer-to-peer configuration are geographically distributed across the globe.

2.2.3 Sharding and Replication

In sharding when a node goes down, the data stored in the node will be lost. So it provides only a limited fault tolerance to the system. Sharding and replication can be combined to make the system fault tolerant and highly available. Figure 2.11 illustrates the combination of sharding and replication where the data set is split up into shard A and shard B. Shard A is replicated across node A and node B;

similarly shard B is replicated across node C and node D.

Data Replication

Master

Writes

Reads Reads Reads Reads

Client 1 Client 2 Client 3 Client 4

Slave 1 Slave 2 Slave 3 Slave 4

Figure 2.9 Master-Slave model.

Node 1

Replication Replication

Replication

Replication Replication

Node 4 Node 6

Node 5

Node 2

Node 3

Figure 2.10 Peer-to-peer model.

EmpID Name 887 John

888 George

900 Joseph

901 Stephen

Node A

Node B

Node C

Node D EmpID Name

887 John

888 George

Shard A, Replica A

EmpID Name

887 John

888 George

Shard A, Replica B

EmpID Name

900 Joseph

901 Stephen

Shard B, Replica A

EmpID Name

887 Joseph

888 Stephen

Shard B, Replica B

SHARD A

SHARD B

Figure 2.11 Combination of sharding and replication.

Dalam dokumen Big Data Concepts Technology and Architecture-Wiley (2021) (Halaman 51-57)