• Tidak ada hasil yang ditemukan

Big Data Storage Concepts

2.1 Cluster Computing

Cluster computing is a distributed or parallel computing system comprising multiple stand-alone PCs connected together working as a single, integrated, highly available resource. Multiple computing resources are connected together in a cluster to con- stitute a single larger and more powerful virtual computer with each computing resource running an instance of the OS. The cluster components are connected together through local area networks (LANs). Cluster computing technology is used for high availability as well as load balancing with better system performance and reliability. The benefits of massively parallel processors and cluster computers are high availability, scalable performance, fault tolerance, and the use of cost-effective commodity hardware. Scalability is achieved by removing nodes or adding addi- tional nodes as per the demand without hindering the system operation. A cluster of systems connects together a group of systems to share critical computational tasks.

The servers in a cluster are called nodes. Cluster computing can be client-server architecture or a peer-peer model. It provides high-speed computational power for processing data-intensive applications related to big data technologies. Cluster com- puting with distributed computation infrastructure provides fast and reliable data processing power to gigantic-sized big data solutions with integrated and geographi- cally separated autonomous resources. They make a cost-effective solution to big data as they do allow multiple applications to share the computing resources. They are flexible to add more computing resources as required by the big data technology.

The clusters are capable of changing the size dynamically, they shrink when any server shuts down or grow in size when additional servers are added to handle more load. They survive the failures with no or minimal impact. Clusters adopt a failover mechanism to eliminate the service interruptions. Failover is the process of switch- ing to a redundant node upon the abnormal termination or failure of a previously

Machine data

Web data

Hadoop Cluster

Data

Warehouse Adhoc Queries Users Adhoc Quer

ies

Adhoc Queries Audio/Video

data

External data

Figure 2.1 Big data storage architecture.

active node. Failover is an automatic mechanism that does not require any human intervention, which differentiates it from the switch-over operation.

Figure 2.2 shows the overview of cluster computing. Multiple stand-alone PCs connected together through a dedicated switch. The login node acts as the gateway into the cluster. When the cluster has to be accessed by the users from a public network, the user has to login to the login node. This is to prevent unauthorized access by the users. Cluster computing has a master-slave model and a peer-to- peer model. There are two major types of clusters, namely, high-availability cluster and load-balancing cluster. Cluster types are briefed in the following section.

2.1.1  Types of Cluster

Clusters may be configured for various purposes such as web-based services or computational-intensive workloads. Based on their purpose, the clusters may be classified into two major types:

High availability

Load balancing

Cluster Compute Nodes

Switch

Login Node

Users Submitting Jobs

Figure 2.2 Cluster computing.

When the availability of the system is of high importance in case of failure of the nodes, high-availability clusters are used. When the computational workload has to be shared among the cluster nodes, load-balancing clusters are used to improvise the overall performance. Thus, computer clusters are configured based on the business purpose needs.

2.1.1.1 High Availability Cluster

High availability clusters are designed to minimize downtime and provide unin- terrupted service when nodes fail. Nodes in a highly available cluster must have access to a shared storage. Such systems are often used for failover and backup purposes. Without clustering the nodes if the server running an application goes down, the application will not be available until the server is up again. In a highly available cluster, if a node becomes inoperative, continuous service is provided by failing over service from the inoperative cluster node to another, without admin- istrative intervention. Such clusters must maintain data integrity while failing over the service from one cluster node to another. High availability systems con- sist of several nodes that communicate with each other and share information.

High availability makes the system highly fault tolerant with many redundant nodes, which sustain faults and failures. Such systems also ensure high reliability and scalability. The higher the redundancy, the higher the availability. A highly available system eliminates single point of failures.

Highly available systems are essential for an organization that has to protect its business against loss of transactional data or incomplete data and overcome the risk of system outage. These risks, under certain circumstances, are bound to cause millions of dollars of losses to the business. Certain applications such as online platforms may face sudden increase in traffic. To manage these traffic spikes a robust solution such as cluster computing is required. Billing, banking, and e-commerce demand a system that is highly available with zero loss of trans- actional data.

2.1.1.2 Load Balancing Cluster

Load-balancing clusters are designed to distribute workloads across different cluster nodes to share the service load among the nodes. If a node in a load-bal- ancing cluster goes down, the load from that node is switched over to another node. This is achieved by having identical copies of data across all the nodes, so the remaining nodes can share the increase in load. The main objective of load balancing is to optimize the use of resources, minimize response time, maximize throughput, and avoid overload on any one of the resources. The resources are used efficiently in this kind of cluster algorithm as there is a good amount of control over the way in which the requests are routed. This kind of routing is

essential when the cluster is composed of machines that are not equally efficient;

in that case, low-performance machines are assigned a lesser share of work.

Instead of having a single, very expensive and very powerful server, load balanc- ing can be used to share the load across several inexpensive, low performing systems for better scalability.

Round robin load balancing, weight-based load balancing, random load bal- ancing, and server affinity load balancing are examples of load balancing.

Round robin load balancing chooses server from the top server in the list in sequential order until the last server in the list is chosen. Once the last server is chosen it resets back to the top. The weight-based load balancing algorithm takes into account the previously assigned weight for each server. The weight field will be assigned a numerical value between 1 and 100, which determines the proportion of the load the server can bear with respect to other servers. If the servers bear equal weight, an equal proportion of the load is distributed among the servers. Random load balancing routes requests to servers at ran- dom. Random load balancing is suitable only for homogenous clusters, where the machines are similarly configured. A random routing of requests does not allow for differences among the machines in their processing power. Server affinity load balancing is the ability of the load balancer to remember the server where the client initiated the request and to route the subsequent requests to the same server.

2.1.2 Cluster Structure

In a basic cluster structure, a group of computers are linked and work together as a single computer. Clusters are deployed to improve performance and availability.

Based on how these computers are linked together, cluster structure is classified into two types:

Symmetric clusters

Asymmetric clusters

Symmetric cluster is a type of cluster structure in which each node functions as an individual computer capable of running applications. The symmetric cluster setup is simple and straightforward. A sub-network is created with individual machines or machines can be added to an existing network and cluster-specific software can be installed to it. Additional machines can be added as needed.

Figure 2.3 shows a symmetric cluster.

Asymmetric clusters are a type of cluster structure in which one machine acts as the head node, and it serves as the gateway between the user and the remaining nodes. Figure 2.4 shows an asymmetric cluster.

Node

Node

Node

Node Figure 2.3 Symmetric clusters.

USER Head Node

Node

Node

Node

Node Figure 2.4 Asymmetric cluster.