• Tidak ada hasil yang ditemukan

Driving Big Data with Hadoop Tools and Technologies

5.12 Apache Oozie

Tasks in the Hadoop environment in some cases may require multiple jobs to be sequenced to complete its goal, which requires the component Oozie in the Hadoop ecosystem. Oozie allows multiple Map/Reduce jobs to combine into a logical unit of work to accomplish the larger task.

Apache Oozie is a tool that manages the workflow of the programs at a desired order in the Hadoop environment. Oozie is capable of configuring jobs to run on demand or periodically. Thus, it provides greater control over jobs allowing them

Pig Latin Scripts

Parser

Optimizer

Compiler

Execution Engine

MapReduce Job Figure 5.22 Pig – internal process.

to be repeated at predetermined intervals. By definition, Apache Oozie is an open-source workflow management engine and scheduler system to run and manage jobs in the Hadoop distributed environment. It acts as a job coordinator to complete multiple jobs. Multiple jobs are run in sequential order to complete a task as a whole. Jobs under a single task can also be scheduled to run in parallel.

Oozie supports any type of Hadoop jobs, which includes MapReduce, Hive, Pig, SQOOP, and others.

There are three types of Oozie jobs:

Workflow jobs—These jobs are represented as directed acyclic graphs (DAGs) and run on demand.

Coordinator Jobs—These jobs are scheduled to execute periodically based on frequency or availability of input data.

Bundle Jobs—These are a collection of coordinator jobs run and managed as a single job.

Oozie job definitions for workflow jobs, coordinator jobs, and bundle jobs are written in XML. The Oozie workflow is created when the workflow definition is placed in a file named workflow.xml.

5.12.1 Oozie Workflow

An Oozie workflow has multiple stages. A workflow is a collection of actions that are Hadoop Map/Reduce jobs, Pig, Hive, or Sqoop jobs in a control dependency DAGs. Action can also be non-Hadoop jobs such as an email notification or a java application. Control dependency between actions is that the second action cannot start until the first action has been completed. Oozie workflow has control nodes and action nodes. Action nodes specify the actions. Actions are the jobs, namely, a MapReduce job, a Hive job, a Pig job, and so forth. Control nodes determine the order of execution of the actions. The actions in a workflow are dependent on each other and an action will not start until its preceding action in the workflow has been completed. Oozie workflows are initiated on demand, but the majority of times they are run at regular time intervals or based on data availability or external events. Workflow execution schedules are defined based on these param- eters. The various control nodes in a workflow are:

Start and end control nodes;

Fork and join control nodes; and

Decision control nodes.

The start and end of the workflow are defined by the start and end control nodes. Parallel executions of the actions are performed by the fork and join

control nodes. The decision control node is used to select an execution path within the workflow with the information provided in the job. Figure  5.23 shows an Oozie workflow.

5.12.2 Oozie Coordinators

The Oozie workflow schedules the jobs in a specified sequence. The workflows that have been previously created and stored need to be scheduled, which is done through Oozie coordinators. Oozie coordinators schedule a workflow based on a frequency parameter, that is, jobs are executed at a specific time interval or based on the availability of all the necessary input data. In case of unavailability of input data, the workflow is delayed until all the necessary input data becomes available.

Unlike workflow, a coordinator does not have any execution logic, it simply starts and runs a workflow based on the time specified or upon the availability of the input data. An Oozie coordinator is defined with the entities, namely:

the start and end time;

Frequency of execution;

Input data; and

workflow.

Oozie coordinators are created based on time when jobs have to run daily or weekly to accomplish certain tasks such as generating reports for the organization

Start Map

Reduce Job

Map Reduce

Job

Hive Job

Pig Job Fork Join

Decision

No Shell

Job

End

Java Job Yes

File System

Job

Figure 5.23 Oozie workflow.

periodically. Oozie coordinators created based on time needs three important parameters, namely, the start time, end time, and frequency of execution. Start time specifies the execution of the workflow for the first time, end time specifies the execution of the workflow for the last time, and frequency specifies how often the workflow needs to be executed. When a coordinator is created based on time, it starts and runs automatically until the defined end time is reached; for example, an Oozie coordinator can be created to run a workflow at 8 p.m. every day for seven days starting from November 4, 2016, to November 10, 2016.

An Oozie coordinator created based on the availability of data usually checks the availability of input data for triggering a workflow. The input data may be the output of another workflow or may be passed from an external source. When the input data is available, the workflow is started to process the data to produce the corresponding output data on completion. A data-based coordinator can also be created to run based on the frequency parameter. For example a coordina- tor set to run at 8 a.m. will trigger the workflow if the data are available at that time. If the data are not available at 8 a.m., the coordinator waits until the data are available, and then it triggers the workflow.

5.12.3 Oozie Bundles

Oozie bundles are a collection of coordinators that specifies the run time of each coordinator. Thus a bundle has one or more coordinators, and a coordinator in turn has one or more workflows. Bundles are specifically useful to group two or more related coordinators where the output of one coordinator becomes the input of another and also useful in an environment where there are hundreds or thou- sands of workflows scheduled to run on a daily basis.