Driving Big Data with Hadoop Tools and Technologies
5.3 Hadoop Computation
5.3.1 MapReduce
has made it cost effective. If HDFS used a specialized, high-end version of hard- ware, handling and storing big data would be expensive.
5.2.6.2 Distributed Storage
HDFS splits the input files into blocks, each of size 64 MB by default, and then stores in HDFS. A file of size 200 MB will be split into three 64 MB blocks and one 8 MB block. Three 64 MB files occupy three blocks completely, and the 8 MB file does not occupy a full block. This block can be shared to store other files to make the 64 MB utilized fully.
5.2.6.3 Data Replication
HDFS by default makes three copies of all the data blocks and stores them in dif- ferent nodes in the cluster. If any node crashes, the node carrying the copy of the data that is lost is identified and the data is retrieved.
defined by the user in the MapReduce program and produces another intermedi- ate key and value pair as the output. The processing of all the data blocks is done in parallel and the same key can have multiple values. The output of the mapper is represented as list (K2, V2).
5.3.1.2 Combiner
The output of the mapper is optimized before moving the data to the reducer.
This is to reduce the overhead time taken to move larger data sets between the mapper and the reducer. The combiner is essentially the reducer of the map job and logically groups the output of the mapper function, which are multiple
Input Split 1 Input Split 2 Input Split 3 Input Split 4
Map Map Map Map
Reduce Reduce
Output INPUT
Partition Partition Partition Partition
Combine Combine
Figure 5.7 MapReduce model.
key-value pairs. In combiner the keys that are repeated are combined, and the values corresponding to the key are listed. Figure 5.8 illustrates how processing is done in combiner.
5.3.1.3 Reducer
Reducer performs the logical function specified by the user in the MapReduce program. Each reducer runs in isolation from other reducers, and they do not communicate with each other. The input to the reducer is sorted based on the key.
Reducer processes the value of each key, value-pairs it, and receives and produces another key-value pair as the output. The output key-value pair may be either the same as the input key-value pair or modified based on the user-defined function.
The output of the reducer is written back to the DFS.
Input Split 1 Input Split 2 Input Split 3 Input Split 4
Map Map Map Map
(K1,V) (K1,V) (K2,V) (K1,V) (K2,V) (K3,V) (K1,V) (K3,V) (K4,V) (K2,V) (K4,V) (K5,V)
Combiner
K1,V,V,V,V K2,V,V,V K3,V,V,V K4,V,V K5,V
INPUT
Reduce Reduce Reduce Reduce Reduce
Output Figure 5.8 Combiner illustration.
5.3.1.4 JobTracker and TaskTracker
Hadoop MapReduce has one JobTracker and several TaskTrackers in a master/
slave architecture. Job tracker runs on the master node, and TaskTracker runs on the slave node. There is always only one TaskTracker per slave node. TaskTracker and NameNode run in one machine while JobTracker and DataNode run in another machine, making each node perform both computing and storage tasks.
TaskTracker is responsible for workflow management and resource management.
Parallel processing of data using MapReduce is handled by JobTracker. Figure 5.9 illustrates a JobTracker as the master and TaskTracker as the slaves executing the tasks assigned by the JobTracker. The two-way arrow indicates that communica- tion flows in both directions. JobTracker communicates with TaskTracker to assign tasks, and TaskTracker periodically updates the progress of the tasks.
JobTracker accepts requests from client for job submissions, schedules tasks that are to be run by the slave nodes, administers the health of the slave nodes, and monitors the progress of tasks that are assigned to TaskTracker. JobTracker is a single point of failure, and if it fails, all the tasks running on the cluster will eventually fail; hence, the machine holding the JobTracker should be highly reli- able. The communication between TaskTracker and the client as well as between TaskTracker and JobTracker is established through remote procedure calls (RPC).
TaskTracker sends a Heartbeat signal to JobTracker to indicate that the node is alive. Additionally it sends the information about the task that it is handling if it is processing a task or its availability to process a task otherwise. After a specific time interval if the Heartbeat signal is not received from TaskTracker, it is assumed to be dead.
Upon submission of a job, the details about the individual tasks that are in pro- gress are stored in memory. The progress of the task is updated with each heart- beat signal received from the JobTracker giving the end user a real-time view of the task in progress. On an active MapReduce cluster where multiple jobs are
JobTracker
TaskTracker TaskTracker TaskTracker TaskTracker
M R M R M R M R
Figure 5.9 JobTracker and TaskTracker.
running, it is hard to estimate the RAM memory space it would consume, so it is highly critical to monitor the memory utilization by the JobTracker.
TaskTracker accepts the tasks from the JobTracker, executes the user code, and sends periodical updates back to the JobTracker. When processing of a task fails, it is detected by the TaskTracker and reported to the JobTracker. The JobTracker reschedules the task to run again either on the same node or on another node of the same cluster. If multiple tasks of the same job on a single TaskTracker fail, then the TaskTracker is refrained from executing other tasks corresponding to a specific job. On the other hand, if tasks from different jobs on the same TaskTracker fail, then the TaskTracker is refrained from executing any task for the next 24 hours.
5.3.2 MapReduce Input Formats