• Tidak ada hasil yang ditemukan

Driving Big Data with Hadoop Tools and Technologies

5.2 Hadoop Storage

5.2.1 HDFS (Hadoop Distributed File System)

The Hadoop distributed file system is designed to store large data sets with stream- ing access pattern running on low-cost commodity hardware. It does not require highly reliable expensive hardware. The data set generated from multiple sources is stored in a HDFS in a write once, read many times pattern and analysis is performed on the data set to extract knowledge from it. HDFS is not suitable for applications that require low latency access to the data. HBase is a suitable alter- native for such applications requiring low latency.

An HDFS stores the data by partitioning the data into small chunks. Blocks of a single file are replicated to provide fault tolerance and availability. If the blocks are corrupt or if the disk or machine fails, the blocks can be retrieved by replicat- ing the blocks across physically separate machines.

5.2.2 Why HDFS?

Figure  5.3 shows a DFS vs. a single machine. With a single machine, to read 500 GB of data it takes approximately 22.5 minutes when the machine has four I/O channels and each channel is capable of processing the task at a speed of 100 MB/s. On top of it, data analysis has to be performed, which will still increase the overall time consumed. If the same data is distributed over 100 machines with the same number of I/O channels in each machine, then the time taken would be 13.5 seconds approximately. This is essentially what Hadoop does, instead of stor- ing the data at a single location, Hadoop stores in a distributed fashion in DFS, where the data is stored in hundreds of data nodes, and the data retrieval occurs in parallel. This approach eliminates the bottleneck and improves performance.

Single Machine Distributed File System

1 Machine 100 Machines

Figure 5.3 Distributed file system vs. single machine.

5.2.3 HDFS Architecture

HDFS is highly fault-tolerant designed to be deployed on commodity hardware.

The applications that run on HDFS typically range from terabytes to petabytes as it is designed to support such large files. It is also designed in a way that it is easy to port HDFS from one platform to another. It basically adopts master/slave architecture wherein one machine in the cluster acts as a master and all other machines serve as the slaves. Figure 5.4 shows the HDFS architecture. The master node has the NameNode and the associated daemon called JobTracker. NameNode manages the namespace of the entire file system, supervises the health of the DataNode through the Heartbeat signal, and controls the access to the files by the end user. The NameNode does not hold the actual data, it is the directory for DataNode holding the information of which blocks together constitute the file and location of those blocks. NameNode is the single point of failure in the entire system, and if it fails, it needs manual intervention. Also, HDFS is not suitable for storing a large number of small files. This is because the file system metadata is stored in NameNode; the total number of files that can be stored in HDFS is governed by the memory capacity of the NameNode. If a large number of small files has to be stored, more metadata will have to be stored, which occupies more memory space.

The set of all slave nodes with the associated daemon, which is called TaskTracker, comprises the DataNode. DataNode is the location where the actual data reside, distributed across the cluster. The distribution occurs by splitting up the file that has the user data into blocks of size 64 Mb by default, and these blocks are then stored in the DataNodes. The mapping of the block to the DataNode is performed by the NameNode, that is, the NameNode decides which block of the file has to be placed in a specific DataNode. Several blocks of the same file are

NameNode (Metadata)

Rack 1 Rack 2

DataNode DataNode DataNode DataNode DataNode

Figure 5.4 HDFS architecture.

stored in different DataNodes. Each block is mapped to three DataNodes by default to provide reliability and fault tolerance through data replication. The number of replicas that a file should have in HDFS can also be specified by the application. NameNode has the location of each block in DataNode. It also does several other operations such as opening or closing files and renaming files and directories. NameNode also decides which block of the file has to be written to which DataNode within a specific rack. The rack is a storage area where multiple DataNodes are put together. The three replicas of the block are written in such a way that the first block is written on a separate rack and blocks 2 and 3 are always written on the same rack on two different DataNodes, but blocks 2 and 3 cannot be written on the same rack where the block 1 is written. This approach is to over- come rack failure. The placement of these blocks decided by the NameNode is based on proximity between the nodes. The closer the proximity, the faster is the communication between the DataNodes.

HDFS has a secondary NameNode, which periodically backs up all the data that resides in the RAM of the NameNode. The secondary NameNode does not act as the NameNode if it fails; rather, it acts as a recovery mechanism in case of its fail- ure. The secondary NameNode runs on a separate machine because it requires memory space equivalent to NameNode to back up the data residing in the NameNode. Despite the presence of the secondary NameNode, the system does not guarantee high availability: NameNode still remains a single point of failure.

Failure of NameNode makes the filesystem unavailable to read or write until a new NameNode is brought into action.

HDFS federation is introduced since the limitation on the memory size of the NameNode, which holds the metadata and the reference to each block in the file system, limits cluster scaling. Under HDFS federation, additional NameNodes are added and each individual NameNode manages Namespace independent of the other NameNode. Hence NameNodes do not communicate with each other, and failure of one NameNode does not affect the Namespace of another NameNode.

5.2.4 HDFS Read/Write Operation

The HDFS client initiates a read request to the Distributed File System, and DFS, in turn, connects with NameNode. NameNode creates a new record for storing the metadata about the new block, and a new file creation operation is initiated after placing a check for file duplication. The DataNodes are identified based on the number of replicas, which is by default three. The input file is split up into blocks of default size 64 MB, and then the blocks are sent to DataNodes in packets. The writing is done in a pipelined fashion. The client sends the packet to a DataNode that is of close proximity among the three DataNodes identified by the NameNode, and that DataNode will send the packet received to the second DataNode; the

second DataNode, in turn, sends the packet received to a third one. Upon receiv- ing a complete data block, the acknowledgment is sent from the receiver DataNode to the sender DataNode and finally to the client. If the data are successfully writ- ten on all identified DataNodes, the connection established between the client and the DataNodes is closed. Figure 5.5 illustrates the file write in HDFS.

The client initiates the read request to DFS, and the DFS, in turn, interacts with NameNode to receive the metadata, that is, the block location of the data file to be read. NameNode returns the location of all the DataNode holding the copy of the block in a sorted order by placing the nearest DataNode first. This metadata is then passed on from DFS to the client; the client then picks the DataNode with close proximity first and connects to it. The read operation is performed, and the NameNode is again called to get the block location for the next batch of files to be read. This process is repeated until all the necessary data are read, and a close operation is performed to close the connection established between client and DataNode. Meanwhile, if any of the DataNodes fails, data is read from the block where the same data is replicated. Figure 5.6 illustrates the file read in HDFS.

NameNode

A DataNode

Pipelined Write Pipelined Write ACK

Receives Metadata Request to Add Block

Rack 2 CLIENT

Rack 1

Data Replication Write

ACK ACK

A DataNode A DataNode

Block Received Acknowledgement

Figure 5.5 File write.

5.2.5 Rack Awareness

HDFS has its DataNodes spanned across different racks, and the racks are identified by the rack IDs, the details of which are stored in NameNode. The three replicas of a block are placed such that the first block is written on a separate rack and blocks 2 and 3 are always written on the same rack on two different DataNodes, but blocks 2 and 3 cannot be placed on the same rack where the block 1 is placed to make the DFS highly available and fault tolerant. Thus, when the rack where block 1 is placed goes down, the data can still be fetched from the rack where blocks 2 and 3 are placed.

The logic here is not to place more than two blocks on the DataNodes of the same rack, and each block is placed on different DataNodes. The number of racks involved in replication should be less than the total number of replicas of the block as the rack failure is less common than DataNode failure. The second and third blocks are placed in the different DataNodes of the same rack as the availability and fault tolerance issues are already handled by placing blocks on two unique racks. The placement of blocks 2 and 3 on the same rack is due to the fact that writing the rep- licas on the DataNode of the same rack is remarkably faster than writing on DataNodes of different racks. The overall concept is placing the blocks into two sepa- rate racks and three different nodes to address both rack failure and node failure.

5.2.6 Features of HDFS 5.2.6.1 Cost-Effective

HDFS is an open-source storage platform; hence, it is available free of cost to the organizations that choose to adopt it as it storage tool. HDFS does not require high-end hardware for storage. It uses commodity hardware for storage, which

NameNode

A A A

File Read Request

Parallel Read Metadata (Block

Location)

DataNode DataNode

DataNode Figure 5.6 File read.

has made it cost effective. If HDFS used a specialized, high-end version of hard- ware, handling and storing big data would be expensive.

5.2.6.2 Distributed Storage

HDFS splits the input files into blocks, each of size 64 MB by default, and then stores in HDFS. A file of size 200 MB will be split into three 64 MB blocks and one 8 MB block. Three 64 MB files occupy three blocks completely, and the 8 MB file does not occupy a full block. This block can be shared to store other files to make the 64 MB utilized fully.

5.2.6.3 Data Replication

HDFS by default makes three copies of all the data blocks and stores them in dif- ferent nodes in the cluster. If any node crashes, the node carrying the copy of the data that is lost is identified and the data is retrieved.