• Tidak ada hasil yang ditemukan

Driving Big Data with Hadoop Tools and Technologies

5.5 HBASE

HBase is a column-oriented NoSQL database that is a horizontally scalable open-source distributed database built on top of the HDFS. Since it is a NoSQL database, it does not require any predefined schema. HBase supports both structured and unstructured data. It provides real-time access to the data in HDFS. HBase provides random access to massive amounts of structured data sets. Hadoop can access data sets only in sequential fashion. A huge data set when accessed in sequential manner for a simple job may take a long time to give the desired output, which results in high latency. Hence, HBase came into picture to access the data randomly. Hadoop stores data in flat files, while HBase stores data in key-value pairs in a column-oriented fashion. Also Hadoop sup- ports write once and read many times while HBase supports read and write many times. HBase was designed to support the storage of structured data based on Google’s Bigtable.

Figure  5.17 shows HBase master-slave architecture with HMaster, Region Server, HFile, MemStore, Write-ahead log (WAL) and Zookeeper. The HBase Master is called HMaster and coordinates the client application with the Region Server. HBase slave is the HRegionServer, and there may be multiple HRegions in an HRegionServer. Each region is used as database and contains the distribution of tables. Each HRegion has one WAL, multiple HFiles, and its associated MemStore. WAL is the technique used in storing logs. HMaster and HRegionServer work in coordination to serve the cluster.

HBase has no additional features to replicate data, which has to be provided by the underlying file system. HDFS is the most commonly used file system because of its fault tolerance, built-in replication, and scalability. HBase finds its applica- tion in medical, sports, web, e-commerce, and so forth.

HMaster – HMaster is the master node in the HBase architecture similar to NameNode in Hadoop. It is the master for all the RegionServers running on sev- eral machines, and it holds the metadata. Also, it is responsible for RegionServer failover and auto sharding of regions. To provide high availability, an HBase cluster can have more than one HMaster, but only one HMaster will be active at a time. Other than the active HMaster, all other HMasters are passive until the active HMaster goes down. If the Master goes down, the cluster may continue to work as the clients will communicate directly to the RegionServers. However, since region splits and RegionServer failover are performed by HMaster, it has to be started as soon as possible. In HBase HBase:meta is the catalog table where the list of all the regions is stored.

Zookeeper – Zookeeper provides a centralized service and manages the coordi- nation between the components of a distributed system. It facilitates better reach- ability to the system components.

RegionServer – RegionServer has a set of regions. RegionServers hold the actual data, similar to a Hadoop cluster where the NameNode holds the metadata and the DataNode holds the actual data. RegionServers serves the regions assigned to it, han- dles the read/write requests, and maintains Hlogs. Figure 5.18 shows a RegionServer.

Zookeeper HBASE

HMASTER

Region Server

Write Ahead Log(WAL) HRegion

HFILE MemStore

HDFS MapReduce

Hadoop HBASE API

Figure 5.17 HBase architecture.

Region – The tables in HBase are split into smaller chunks, which are called regions, and these regions are distributed across multiple RegionServers. The dis- tribution of regions across the RegionServers is handled by the Master. There are two types of files available for data storage in the region, namely, HLog, the WAL, and the Hfile, which is the actual data storage file.

WAL – Data write is not performed directly on the disk; rather, it is placed in the MemStore before it is written on to the disk. Before the MemStore is being flushed if the RegionServer fails the data may be lost as the MemStore is volatile. So, to avoid the data loss it is written into the log first and then written into the MemStore.

So if the RegionServer goes down data can be effectively recovered from the log.

HFile – HFiles are the files where the actual data are stored on the disk. The file contains several data blocks, and the default size of each data block is 64 KB. For example, a 100 MB file can be split up into multiple 64 KB blocks and stored in HFile.

MemStore – Data that has to be written to the disk are first written to the MemStore and WAL. When the MemStore is full, a new HFile is created on HDFS, and the data from the MemStore are flushed in to the disk.

5.5.1 Features of HBase

Automatic Failover  –  HBase failover is supported through HRegionServer replication.

Auto sharding – HBase Regions has contiguous rows that are split by the system into smaller regions when a threshold size is reached. Initially a table has only one region when data are added, and if the configured maximum size is Region Server Write Ahead Log (WAL)

Region

HFile

MemStore

Region

HFile

MemStore

HDFS DataNode

Region

HFile

MemStore

Figure 5.18 RegionServer architecture.

exceeded, the region is split up, each region is served by an HRegionServer, and each HRegionServer can serve more than one region at a time.

Horizontal scalability – HBase is horizontally scalable, which enables the sys- tem to scale wider to meet the increasing demand where the server need not be upgraded as in the case of vertical scalability. More nodes can be added to the cluster on the fly. Since scaling out storage uses low-cost commodity hardware and storage components, HBase is cost effective.

Column oriented – In contrast with a relational database, which is row- oriented, HBase is column-oriented. The working method of a column-store database is that it saves data into sections of columns rather than sections of rows.

HDFS is the most common file system used by HBase. Since HBase has a pluggable file system architecture, it can run on any other supported file system as well. Also, HBase provides massive parallel processing through the MapReduce framework.