• Tidak ada hasil yang ditemukan

Driving Big Data with Hadoop Tools and Technologies

Chapter 5 Chapter 5 Refresher

1 What is the default block size of HDFS?

A 32 MB B 64 MB C 128 MB D 16 MB Answer: b

Explanation: The input file is split up into blocks of size 64 Mb by default, and these blocks are then stored in the DataNodes.

2 What is the default replication factor of HDFS?

A 4 B 1 C 3 D 2 Answer: c

Explanation: The input file is split up into blocks, and each block is mapped to three DataNodes by default to provide reliability and fault tolerance through data replication.

3 Can HDFS data blocks be read in parallel?

A Yes B No Answer: a

Explanation: HDFS read operations are done in parallel, and write operations are done in pipelined fashion.

4 In Hadoop there exists _______.

A one JobTracker per Hadoop job B one JobTracker per Mapper C one JobTracker per node D one JobTracker per cluster Answer: d

Explanation: Hadoop executes a master/slave architecture where there is one master node and several slave nodes. JobTracker resides in the master node, and TaskTrackers reside in slave nodes one per node.

5 Task assignedJobTracker is executed by the ________, which acts as the slave.

A MapReduce B Mapper C TaskTracker D JobTracker Answer: c

Explanation: JobTracker sends the necessary information for executing a task to theTaskTracker, which executes the task and sends back the results to JobTracker.

6 What is the default number of times a Hadoop task can fail before the job is killed?

A 3 B 4 C 5 D 6 Answer: b

Explanation: If a task running on TaskTracker fails, it will be restarted on some other TaskTracker. If the task fails for more than four times, the job will be killed.

Four is the default number of times a task can fail, and it can be modified.

7 Input key-value pairs are mapped by the__________ into a set of intermediate key-value pairs.

A Mapper B Reducer

C both Mapper and Reducer D none of the above Answer: a

Explanation: Maps are the individual tasks that transform the input records into a set of intermediate records.

8 The __________ is a framework-specific entity that negotiates resources from the ResourceManager

A NodeManager B ResourceManager C ApplicationMaster D all of the above Answer: c

Explanation: ApplicationMaster has the responsibility of negotiating the resource containers from the ResourceManager.

9 Hadoop YARN stands for __________.

A Yet Another Resource Network B Yet Another Reserve Negotiator C Yet Another Resource Negotiator D all of the mentioned

Answer: c

10 ________ is used when the NameNode goes down in Hadoop 1.0.

A Rack B DataNode

C Secondary NameNode D None of the above Answer: c

Explanation: NameNode is the single point of failure in Hadoop 1.0, and when NameNode goes down, the entire system crashes until a new NameNode is brought into action again.

11 ________ is used when the active NameNode goes down in Hadoop 2.0.

A Standby NameNode B DataNode

C Secondary NameNode D None of the above Answer: a

Explanation: When active NameNode goes down in the Hadoop YARN architec- ture, the standby NameNode comes into action and takes up the tasks of active NameNode.

Conceptual Short Questions with Answers

1 What is a Hadoop framework?

Apache Hadoop, written in the Java language, is an open-source framework that supports processing of large data sets in streaming access pattern across clusters in a distributed computing environment. It can store a large volume of structured, semi-structured, and unstructured data in a DFS and process them in parallel. It is a highly scalable and cost-effective storage platform.

2 What is fault tolerance?

Fault tolerance is the ability of the system to work without interruption in case of system hardware or software failure. In Hadoop, fault tolerance is the ability of the system to recover the data even if the node where the data is stored fails. This is achieved by data replication where the same data gets replicated across multiple nodes; by default it is three nodes in HDFS.

3 Name the four components that make up the Hadoop framework.

Hadoop Common: Hadoop common is a collection of common utilities that support other Hadoop modules.

Hadoop Distributed File System (HDFS): HDFS is a DFS to store large data sets in a distributed cluster and provides high-throughput access to the data across the cluster.

Hadoop YARN: YARN is the acronym for Yet Another Resource Negotiator and does the job-scheduling and resource-management tasks in the Hadoop cluster.

Hadoop MapReduce: MapReduce is a framework that performs parallel pro- cessing of large unstructured data sets across the clusters.

4 If replication across nodes in HDFS causes data redundancy occupying more memory, then why is it implemented?

HDFS is designed to work on commodity hardware to make it cost effective. The commodity hardware are low-performance machines, which can increase the possibility crashing; thus, to make the system fault tolerant, the data are replicated across three nodes. Hence, if the first node crashes and second node is not avail- able for any reason, the data can be retrieved from the third node, making the system highly fault tolerant.

5 What is a master node and slave node in Hadoop?

Slaves are the Hadoop cluster daemons that are responsible for storing the actual data and the replicated data and processing of the MapReduce jobs. A slave node in Hadoop has the DataNode and TaskTracker. Masters are responsible for moni- toring the storage of data across the slaves and the status of the task assigned to slaves. A master node has a NameNode and the JobTracker.

6 What is a NameNode?

NameNode manages the namespace of the entire file system, supervises the health of the DataNode through the Heartbeat signal, and controls the access to the files by the end user. The NameNode does not hold the actual data; it is the directory for DataNode holding the information of which blocks together consti- tute the file and the location of those blocks. This information is called metadata, which is data about data.

7 Is the NameNode also commodity hardware?

No, the NameNode is the single point of failure, and it cannot be commodity hard- ware as the entire file system relies on it. NameNode has to be a highly availa- ble system.

8 What is MapReduce?

MapReduce is the batch-processing programming model for the Hadoop frame- work, which adopts a divide-and-conquer principle. It is highly scalable, reliable, and fault tolerant, capable of processing input data with any format in parallel, supporting only batch workloads.

9 What is a DataNode?

A slave node has a DataNode and an associated daemon the TaskTracker.

DataNodes are deployed on each slave machine, which provide the actual storage and are responsible for serving read/write requests from clients.

10 What is a JobTracker?

JobTracker is a daemon running on the master that tracks the MapReduce jobs. It assigns the tasks to the different task trackers. A Hadoop cluster has only one JobTracker, and it is the single point of failure. If it goes down, all the running jobs are halted. It receives a Heartbeat from the task tracker based on which JobTracker receives the Heartbeat signal from the TaskTracker, which in turn indicates the health of the JobTracker and the status of the MapReduce jobs.

11 What is a TaskTracker?

TaskTracker is a daemon running on the slave that manages the execution of tasks on slave node. When a job is submitted by a client, the JobTracker will divide and assign the tasks to different TaskTrackers to perform MapReduce tasks. The task tracker will simultaneously communicate with the JobTracker by sending the Heartbeat signal to update the status of the job and to indicate the TaskTracker is alive. If the Heartbeat is not received by the JobTracker for a specified period of time, then the JobTracker assumes that the TaskTracker has crashed.

12 Why is HDFS used for applications with large data sets and not for the appli- cations having large number of small files?

HDFS is suitable for large data sets typically of size 64 MB when compared to a file with large number of small files because NameNode is an expensive, high-perfor- mance system; hence, the space cannot be filled with a large volume of metadata generated from large number of small files. So when the file size is large, the metadata will be occupying less space in the NameNode for a single file. Thus, for optimized performance, large data sets are supported by HDFS instead of large number of small files.

13 What is a Heartbeat signal in HDFS?

TaskTracker sends a Heartbeat signal to JobTracker to indicate that the node is alive and additionally the information about the task that it is handling if it is processing a task or its availability to process a task. After a specific time interval, if the Heartbeat signal is not received from TaskTracker, it is assumed dead.

14 What is a secondary NameNode? Is the secondary NameNode a substitute for NameNode?

The secondary NameNode periodically backs up all the data that reside in the RAM of the NameNode. The secondary NameNode does not act as the NameNode if it fails; rather, it acts as a recovery mechanism in case of its failure. The second- ary NameNode runs on a separate machine because it requires memory space equivalent to NameNode to back up the data residing in the NameNode.

15 What is a rack?

The rack is a storage area where multiple DataNodes are put together. These DataNodes can be located at different places. Rack is a collection of DataNodes that are stored at a single location.

16 What is a combiner?

The combiner is essentially the reducer of the map job and logically groups the output of the mapper function, which is multiple key-value pairs. In combiner the keys that are repeated are combined, and the values corresponding to the key are listed. Instead of passing the output of the mapper directly to the reducer, it is first sent to the combiner and then to the reducer to optimize the MapReduce job.

17 If a file size is 500 MB, block size is 64 MB, and the replication factor is 1, what is the total number of blocks it occupies?

No of blocks 500/64 *

. 1

7 8125

So, the number of blocks it occupies is 8

18 If a file size is 800 MB, block size is 128 MB, and the replication factor is 3, what is the total number of blocks it occupies? What is the size of each block?

Total number of blocks So, the total number of

/ 800 128 6 25. blocks Size of 6 blocks

Size of 7th block 800

1287

128 6 ( * )) 32

Frequently Asked Interview Questions

1 In Hadoop why reading process is performed in parallel and writing is not performed in parallel?

In Hadoop MapReduce, a file is read in parallel for faster data access. But writing operation is not performed in parallel since it will result in data inconsistency. For example, when two nodes are writing data into a file in parallel, then neither of the nodes may be aware of what the other node has written into the file, which results in data inconsistency.

2 What is replication factor?

Replication factor is the number of times a data block is stored in the Hadoop cluster. The default replication factor is 3. This means that three times the storage needed to store the actual data is required.

3 Since the data is replicated on three nodes, will the calculations be performed on all the three nodes?

On execution of MapReduce programs, calculations will be performed only on the original data. If the node on which the calculations are performed fails, then the required calculations will be performed on the second replica.

4 How can a running job be stopped in Hadoop?

The jobid will be killed to stop a running Hadoop job.

5 What if all the DataNode of all the three replications fail?

If DataNodes of all the replications fail, then the data cannot be recovered. If the job is of high priority, then the data can be replicated more than three times by changing the replication factor value, which is 3 by default.

6 What is the difference between input split and HDFS block?

Input split is the logical division of data, and HDFS block is the physical division of data.

7 Is Hadoop suitable for handling streaming data?

Yes, Hadoop handles streaming data with technologies such as Apache flume and Apache Spark.

8 Why are the data replications performed in different racks?

The first replication of a data is placed in a rack, and replications 2 and 3 are placed in the same rack other than the rack where the first replication is placed.

This is to overcome rack failure.

9 What are the write types in HDFS? And what is the difference between them?

There are two types of writes in HDFS, namely, posted and non-posted. A posted write does not require acknowledgement, whereas in case of a non-posted write, acknowledgement is required.

10 What happens when a JobTracker goes down in Hadoop 1.0?

When a JobTracker fails, all the jobs in the JobTracker will be restarted, interrupting the overall execution.

11 What is a storage node and compute node?

The storage node is the computer or the machine where the actual data resides, and the compute node is the machine where the business logic is executed.

12 What happens when 100 tasks are spawned for a job and one task fails?

If a task running on TaskTracker fails, it will be restarted on some other TaskTracker. If the task fails for more than four times, the job will be killed. Four is the default number of times a task can fail, but it can be modified.

Big Data: Concepts, Technology, and Architecture, First Edition. Balamurugan Balusamy, Nandhini Abirami. R, Seifedine Kadry, and Amir H. Gandomi.

© 2021 John Wiley & Sons, Inc. Published 2021 by John Wiley & Sons, Inc.

CHAPTER OBJECTIVE

This chapter begins to reap the benefits of the big data era. Anticipating the best time of price fall to make purchases or going in line with current trends by catching up with social media is all possible with big data analysis. A deep insight is given on the various methods with which this massive flood of data can be analyzed, the entire life cycle of big data analysis, and various practical applications of capturing, processing, and analyzing this huge data.

Analyzing the data is always beneficial and the greatest challenge for the organizations. This chapter examines the existing approaches to analyze the stored data to assist organizations in making big business decisions to improve business performance and efficiency, to compete with their business rivals and find new approaches to grow their business. It delivers insight to the different types of data analysis techniques (descriptive analysis, diagnostic analysis, predictive analysis, prescriptive analysis) used to analyze big data. The data analytics life cycle starting from data identification to utilization of data analysis results are explained. It unfolds the techniques used in big data analysis, that is, quantitative analysis, qualitative analysis, and various types of statistical analysis such as A/B testing, correlation, and regression. Earlier the analysis on big data was made by querying this huge data set, and analysis were done in batch mode. Today’s trend has made big data analysis possible in real time, and all the tools and technologies that made this possible are all well explained in this chapter.

6.1   Terminology of Big Data Analytics

6.1.1  Data Warehouse

Data warehouse, also termed as Enterprise Data Warehouse (EDW), is a reposi- tory for the data that various organizations and business enterprises collect. It gathers the data from diverse sources to make the data available for unified access and analysis by the data analysts.

6