Hadoop Beginner's Guide Learn how to crunch big data to extract meaning from the data avalanche pdf pdf

(1)

(2)

Hadoop Beginner's Guide

Learn how to crunch big data to extract meaning from the

data avalanche

Garry Turkington

(3)

Hadoop Beginner's Guide

All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews. Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author, nor Packt Publishing, and its dealers and distributors will be held liable for any damages caused or alleged to be caused directly or indirectly by this book.

Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.

First published: February 2013

Production Reference: 1150213

Published by Packt Publishing Ltd. Livery Place

35 Livery Street

Birmingham B3 2PB, UK.

ISBN 978-1-84951-7-300

www.packtpub.com

(4)

Credits

Author

Garry Turkington

Reviewers David Gruzman

Muthusamy Manigandan Vidyasagar N V

Acquisition Editor

Robin de Jongh

Lead Technical Editor

Azharuddin Sheikh

Technical Editors

Ankita Meshram Varun Pius Rodrigues

Copy Editors

Brandt D'Mello Aditya Nair Laxmi Subramanian

Ruta Waghmare

Project Coordinator

Leena Purkait

Proofreader

Maria Gould

Indexer

Hemangini Bari

Production Coordinator

Nitesh Thakur

(5)

About the Author

Garry Turkington

has 14 years of industry experience, most of which has been focused on the design and implementation of large-scale distributed systems. In his current roles as VP Data Engineering and Lead Architect at Improve Digital, he is primarily responsible for the realization of systems that store, process, and extract value from the company's large data volumes. Before joining Improve Digital, he spent time at Amazon.co.uk, where he led several software development teams building systems that process Amazon catalog data for every item worldwide. Prior to this, he spent a decade in various government positions in both the UK and USA.

He has BSc and PhD degrees in Computer Science from the Queens University of Belfast in Northern Ireland and an MEng in Systems Engineering from Stevens Institute of Technology in the USA.

(6)

About the Reviewers

David Gruzman

is a Hadoop and big data architect with more than 18 years of hands-on experience, specializing in the design and implementation of scalable high-performance distributed systems. He has extensive expertise of OOA/OOD and (R)DBMS technology. He is an Agile methodology adept and strongly believes that a daily coding routine makes good software architects. He is interested in solving challenging problems related to real-time analytics and the application of machine learning algorithms to the big data sets.

He founded—and is working with—BigDataCraft.com, a boutique consulting firm in the area of big data. Visit their site at www.bigdatacraft.com. David can be contacted at david@ bigdatacraft.com. More detailed information about his skills and experience can be found at http://www.linkedin.com/in/davidgruzman.

(7)

Vidyasagar N V

has been interested in computer science since an early age. Some of his serious work in computers and computer networks began during his high school days. Later, he went to the prestigious Institute Of Technology, Banaras Hindu University, for his B.Tech. He has been working as a software developer and data expert, developing and building scalable systems. He has worked with a variety of second, third, and fourth generation languages. He has worked with flat files, indexed files, hierarchical databases, network databases, relational databases, NoSQL databases, Hadoop, and related technologies. Currently, he is working as Senior Developer at Collective Inc., developing big data-based structured data extraction techniques from the Web and local information. He enjoys producing high-quality software and web-based solutions and designing secure and scalable data systems. He can be contacted at vidyasagar1729@gmail.com.

(8)

www.PacktPub.com

Support files, eBooks, discount offers and more

You might want to visit www.PacktPub.com for support files and downloads related to your book.

Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at www.PacktPub.com and as a print book customer, you are entitled to a discount on the eBook copy. Get in touch with us at service@packtpub.com for more details.

At www.PacktPub.com, you can also read a collection of free technical articles, sign up for a range of free newsletters and receive exclusive discounts and offers on Packt books and eBooks.

http://PacktLib.PacktPub.com

Do you need instant solutions to your IT questions? PacktLib is Packt's online digital book library. Here, you can access, read and search across Packt's entire library of books.

Why Subscribe?

Fully searchable across every book published by Packt

Copy and paste, print and bookmark content

On demand and accessible via web browser

Free Access for Packt account holders

(9)

(10)

Preface 1

Chapter 1: What It's All About

7

Big data processing 8

The value of data 8

Historically for the few and not the many 9

Classic data processing systems 9

Limiting factors 10

A different approach 11

All roads lead to scale-out 11

Share nothing 11

Expect failure 12

Smart software, dumb hardware 13

Move processing, not data 13

Build applications, not infrastructure 14

Hadoop 15

Thanks, Google 15

Thanks, Doug 15

Thanks, Yahoo 15

Parts of Hadoop 15

Common building blocks 16

HDFS 16

MapReduce 17

Better together 18

Common architecture 19

What it is and isn't good for 19

Cloud computing with Amazon Web Services 20

Too many clouds 20

A third way 20

Different types of costs 21

AWS – infrastructure on demand from Amazon 22

Elastic Compute Cloud (EC2) 22

(11)

Elastic MapReduce (EMR) 22

What this book covers 23

A dual approach 23

Summary 24

Chapter 2: Getting Hadoop Up and Running

25

Hadoop on a local Ubuntu host 25

Other operating systems 26

Time for action – checking the prerequisites 26

Setting up Hadoop 27

A note on versions 27

Time for action – downloading Hadoop 28

Time for action – setting up SSH 29

Configuring and running Hadoop 30

Time for action – using Hadoop to calculate Pi 30

Three modes 32

Time for action – configuring the pseudo-distributed mode 32

Configuring the base directory and formatting the filesystem 34

Time for action – changing the base HDFS directory 34 Time for action – formatting the NameNode 35

Starting and using Hadoop 36

Time for action – starting Hadoop 36

Time for action – using HDFS 38

Time for action – WordCount, the Hello World of MapReduce 39

Monitoring Hadoop from the browser 42

The HDFS web UI 42

Using Elastic MapReduce 45

Setting up an account on Amazon Web Services 45

Creating an AWS account 45

Signing up for the necessary services 45

Time for action – WordCount in EMR using the management console 46

Other ways of using EMR 54

AWS credentials 54

The EMR command-line tools 54

The AWS ecosystem 55

Comparison of local versus EMR Hadoop 55

Summary 56

Chapter 3: Understanding MapReduce

57

Key/value pairs 57

What it mean 57

Why key/value data? 58

Some real-world examples 59

(12)

The Hadoop Java API for MapReduce 60

The 0.20 MapReduce Java API 61

The Mapper class 61

The Reducer class 62

The Driver class 63

Writing MapReduce programs 64

Time for action – setting up the classpath 65 Time for action – implementing WordCount 65 Time for action – building a JAR file 68 Time for action – running WordCount on a local Hadoop cluster 68 Time for action – running WordCount on EMR 69

The pre-0.20 Java MapReduce API 72

Hadoop-provided mapper and reducer implementations 73

Time for action – WordCount the easy way 73

Walking through a run of WordCount 75

Startup 75

Mapper output and reduce input 77

Partitioning 77

The optional partition function 78

Reducer input 78

Reducer execution 79

Reducer output 79

Shutdown 79

That's all there is to it! 80

Apart from the combiner…maybe 80

Why have a combiner? 80

Time for action – WordCount with a combiner 80

When you can use the reducer as the combiner 81

Time for action – fixing WordCount to work with a combiner 81

Reuse is your friend 82

Hadoop-specific data types 83

The Writable and WritableComparable interfaces 83

Introducing the wrapper classes 84

Primitive wrapper classes 85

Array wrapper classes 85

(13)

Time for action – using the Writable wrapper classes 86

Other wrapper classes 88

Making your own 88

Input/output 88

Files, splits, and records 89

InputFormat and RecordReader 89

Hadoop-provided InputFormat 90

Hadoop-provided RecordReader 90

Output formats and RecordWriter 91

Hadoop-provided OutputFormat 91

Don't forget Sequence files 91

Summary 92

Chapter 4: Developing MapReduce Programs

93

Using languages other than Java with Hadoop 94

How Hadoop Streaming works 94

Why to use Hadoop Streaming 94

Time for action – WordCount using Streaming 95

Differences in jobs when using Streaming 97

Analyzing a large dataset 98

Getting the UFO sighting dataset 98

Getting a feel for the dataset 99

Time for action – summarizing the UFO data 99

Examining UFO shapes 101

Time for action – summarizing the shape data 102 Time for action – correlating sighting duration to UFO shape 103

Using Streaming scripts outside Hadoop 106

Time for action – performing the shape/time analysis from the command line 107

Java shape and location analysis 107

Time for action – using ChainMapper for field validation/analysis 108

Too many abbreviations 112

Using the Distributed Cache 113

Time for action – using the Distributed Cache to improve location output 114 Counters, status, and other output 117 Time for action – creating counters, task states, and writing log output 118

Too much information! 125

Summary 126

Chapter 5: Advanced MapReduce Techniques

127

Simple, advanced, and in-between 127

(14)

When this is a bad idea 128

Map-side versus reduce-side joins 128

Matching account and sales information 129

Time for action – reduce-side joins using MultipleInputs 129

DataJoinMapper and TaggedMapperOutput 134

Implementing map-side joins 135

Using the Distributed Cache 135

Pruning data to fit in the cache 135

Using a data representation instead of raw data 136

Using multiple mappers 136

To join or not to join... 137

Graph algorithms 137

Graph 101 138

Graphs and MapReduce – a match made somewhere 138

Representing a graph 139

Time for action – representing the graph 140

Overview of the algorithm 140

The mapper 141

The reducer 141

Iterative application 141

Time for action – creating the source code 142

Time for action – the first run 146

Time for action – the second run 147

Time for action – the third run 148

Time for action – the fourth and last run 149

Running multiple jobs 151

Final thoughts on graphs 151

Using language-independent data structures 151

Candidate technologies 152

Introducing Avro 152

Time for action – getting and installing Avro 152

Avro and schemas 154

Time for action – defining the schema 154 Time for action – creating the source Avro data with Ruby 155 Time for action – consuming the Avro data with Java 156

Using Avro within MapReduce 158

Time for action – generating shape summaries in MapReduce 158 Time for action – examining the output data with Ruby 163 Time for action – examining the output data with Java 163

Going further with Avro 165

(15)

Chapter 6: When Things Break

167

Cluster setup, test files, and block sizes 169

Fault tolerance and Elastic MapReduce 170

Time for action – killing a DataNode process 170

NameNode and DataNode communication 173

Time for action – the replication factor in action 174 Time for action – intentionally causing missing blocks 176

When data may be lost 178

Block corruption 179

Time for action – killing a TaskTracker process 180

Comparing the DataNode and TaskTracker failures 183

Permanent failure 184

Killing the cluster masters 184

Time for action – killing the JobTracker 184

Starting a replacement JobTracker 185

Time for action – killing the NameNode process 186

Starting a replacement NameNode 188

The role of the NameNode in more detail 188

File systems, files, blocks, and nodes 188

The single most important piece of data in the cluster – fsimage 189

DataNode startup 189

Safe mode 190

SecondaryNameNode 190

So what to do when the NameNode process has a critical failure? 190

BackupNode/CheckpointNode and NameNode HA 191

Hardware failure 191

Host failure 191

Host corruption 192

The risk of correlated failures 192

Task failure due to software 192

Failure of slow running tasks 192

Time for action – causing task failure 193

Hadoop's handling of slow-running tasks 195

Speculative execution 195

Hadoop's handling of failing tasks 195

Task failure due to data 196

Handling dirty data through code 196

(16)

Time for action – handling dirty data by using skip mode 197

To skip or not to skip... 202

Summary 202

Chapter 7: Keeping Things Running

205

A note on EMR 206

Hadoop configuration properties 206

Default values 206

Time for action – browsing default properties 206

Additional property elements 208

Default storage location 208

Where to set properties 209

Setting up a cluster 209

How many hosts? 210

Calculating usable space on a node 210

Location of the master nodes 211

Sizing hardware 211

Processor / memory / storage ratio 211

EMR as a prototyping platform 212

Special node requirements 213

Storage types 213

Commodity versus enterprise class storage 214

Single disk versus RAID 214

Finding the balance 214

Network storage 214

Hadoop networking configuration 215

How blocks are placed 215

Rack awareness 216

Time for action – examining the default rack configuration 216 Time for action – adding a rack awareness script 217

What is commodity hardware anyway? 219

Cluster access control 220

The Hadoop security model 220

Time for action – demonstrating the default security 220

User identity 223

More granular access control 224

Working around the security model via physical access control 224

Managing the NameNode 224

Configuring multiple locations for the fsimage class 225

Time for action – adding an additional fsimage location 225

Where to write the fsimage copies 226

Swapping to another NameNode host 227

(17)

Time for action – swapping to a new NameNode host 227

Don't celebrate quite yet! 229

What about MapReduce? 229

Managing HDFS 230

Where to write data 230

Using balancer 230

When to rebalance 230

MapReduce management 231

Command line job management 231

Job priorities and scheduling 231

Time for action – changing job priorities and killing a job 232

Alternative schedulers 233

Capacity Scheduler 233

Fair Scheduler 234

Enabling alternative schedulers 234

When to use alternative schedulers 234

Scaling 235

Adding capacity to a local Hadoop cluster 235

Adding capacity to an EMR job flow 235

Expanding a running job flow 235

Summary 236

Chapter 8: A Relational View on Data with Hive

237

Overview of Hive 237

Time for action – installing Hive 239

Using Hive 241

Time for action – creating a table for the UFO data 241 Time for action – inserting the UFO data 244

Validating the data 246

Time for action – validating the table 246 Time for action – redefining the table with the correct column separator 248

Hive tables – real or not? 250

Time for action – creating a table from an existing file 250 Time for action – performing a join 252

Hive and SQL views 254

Time for action – using views 254

(18)

Time for action – exporting query output 258

Partitioning the table 260

Time for action – making a partitioned UFO sighting table 260

Bucketing, clustering, and sorting... oh my! 264

User Defined Function 264

Time for action – adding a new User Defined Function (UDF) 265

To preprocess or not to preprocess... 268

Hive versus Pig 269

What we didn't cover 269

Hive on Amazon Web Services 270

Time for action – running UFO analysis on EMR 270

Using interactive job flows for development 277

Integration with other AWS products 278

Summary 278

Chapter 9: Working with Relational Databases

279

Common data paths 279

Hadoop as an archive store 280

Hadoop as a preprocessing step 280

Hadoop as a data input tool 281

The serpent eats its own tail 281

Setting up MySQL 281

Time for action – installing and setting up MySQL 281

Did it have to be so hard? 284

Time for action – configuring MySQL to allow remote connections 285

Don't do this in production! 286

Time for action – setting up the employee database 286

Be careful with data file access rights 287

Getting data into Hadoop 287

Using MySQL tools and manual import 288

Accessing the database from the mapper 288

A better way – introducing Sqoop 289

Time for action – downloading and configuring Sqoop 289

Sqoop and Hadoop versions 290

Sqoop and HDFS 291

Time for action – exporting data from MySQL to HDFS 291

Sqoop's architecture 294

Importing data into Hive using Sqoop 294

Time for action – exporting data from MySQL into Hive 295 Time for action – a more selective import 297

(19)

Time for action – using a type mapping 299 Time for action – importing data from a raw query 300

Sqoop and Hive partitions 302

Field and line terminators 302

Getting data out of Hadoop 303

Writing data from within the reducer 303

Writing SQL import files from the reducer 304

A better way – Sqoop again 304

Time for action – importing data from Hadoop into MySQL 304

Differences between Sqoop imports and exports 306

Inserts versus updates 307

Sqoop and Hive exports 307

Time for action – importing Hive data into MySQL 308 Time for action – fixing the mapping and re-running the export 310

Other Sqoop features 312

AWS considerations 313

Considering RDS 313

Summary 314

Chapter 10: Data Collection with Flume

315

A note about AWS 315

Data data everywhere 316

Types of data 316

Getting network traffic into Hadoop 316

Time for action – getting web server data into Hadoop 316

Getting files into Hadoop 318

Hidden issues 318

Keeping network data on the network 318

Hadoop dependencies 318

Reliability 318

Re-creating the wheel 318

A common framework approach 319

Introducing Apache Flume 319

A note on versioning 319

Time for action – installing and configuring Flume 320

Using Flume to capture network data 321

Time for action – capturing network traffic to a log file 321 Time for action – logging to the console 324

Writing network data to log files 326

Time for action – capturing the output of a command in a flat file 326

Logs versus files 327

Time for action – capturing a remote file in a local flat file 328

(20)

[ xi ]

Sources 330

Sinks 330

Channels 330

Or roll your own 331

Understanding the Flume configuration files 331

It's all about events 332

Time for action – writing network traffic onto HDFS 333 Time for action – adding timestamps 335

To Sqoop or to Flume... 337

Time for action – multi level Flume networks 338 Time for action – writing to multiple sinks 340

Selectors replicating and multiplexing 342

Handling sink failure 342

Chapter 11: Where to Go Next

347

What we did and didn't cover in this book 347

Upcoming Hadoop changes 348

Alternative distributions 349

Why alternative distributions? 349

Bundling 349

Free and commercial extensions 349

(21)

Sources of information 356

Source code 356

Mailing lists and forums 356

LinkedIn groups 356

HUGs 356

Conferences 357

Summary 357

Appendix: Pop Quiz Answers

359

Chapter 3, Understanding MapReduce 359

Chapter 7, Keeping Things Running 360

(22)

Preface

This book is here to help you make sense of Hadoop and use it to solve your big data problems. It's a really exciting time to work with data processing technologies such as Hadoop. The ability to apply complex analytics to large data sets—once the monopoly of large corporations and government agencies—is now possible through free open source

software (OSS).

But because of the seeming complexity and pace of change in this area, getting a grip on the basics can be somewhat intimidating. That's where this book comes in, giving you an understanding of just what Hadoop is, how it works, and how you can use it to extract value from your data now.

In addition to an explanation of core Hadoop, we also spend several chapters exploring other technologies that either use Hadoop or integrate with it. Our goal is to give you an understanding not just of what Hadoop is but also how to use it as a part of your broader technical infrastructure.

A complementary technology is the use of cloud computing, and in particular, the offerings from Amazon Web Services. Throughout the book, we will show you how to use these services to host your Hadoop workloads, demonstrating that not only can you process large data volumes, but also you don't actually need to buy any physical hardware to do so.

What this book covers

(23)

Chapter 1, What It's All About, gives an overview of the trends that have made Hadoop and cloud computing such important technologies today.

Chapter 2, Getting Hadoop Up and Running, walks you through the initial setup of a local Hadoop cluster and the running of some demo jobs. For comparison, the same work is also executed on the hosted Hadoop Amazon service.

Chapter 3, Understanding MapReduce, goes inside the workings of Hadoop to show how MapReduce jobs are executed and shows how to write applications using the Java API.

Chapter 4, Developing MapReduce Programs, takes a case study of a moderately sized data set to demonstrate techniques to help when deciding how to approach the processing and analysis of a new data source.

Chapter 5, Advanced MapReduce Techniques, looks at a few more sophisticated ways of applying MapReduce to problems that don't necessarily seem immediately applicable to the Hadoop processing model.

Chapter 6, When Things Break, examines Hadoop's much-vaunted high availability and fault tolerance in some detail and sees just how good it is by intentionally causing havoc through killing processes and intentionally using corrupt data.

Chapter 7, Keeping Things Running, takes a more operational view of Hadoop and will be of most use for those who need to administer a Hadoop cluster. Along with demonstrating some best practice, it describes how to prepare for the worst operational disasters so you can sleep at night.

Chapter 8, A Relational View On Data With Hive, introduces Apache Hive, which allows Hadoop data to be queried with a SQL-like syntax.

Chapter 9, Working With Relational Databases, explores how Hadoop can be integrated with existing databases, and in particular, how to move data from one to the other.

Chapter 10, Data Collection with Flume, shows how Apache Flume can be used to gather data from multiple sources and deliver it to destinations such as Hadoop.

Chapter 11, Where To Go Next, wraps up the book with an overview of the broader Hadoop ecosystem, highlighting other products and technologies of potential interest. In addition, it gives some ideas on how to get involved with the Hadoop community and to get help.

What you need for this book

(24)

In the simplest case, a single Linux-based machine will give you a platform to explore almost all the exercises in this book. We assume you have a recent distribution of Ubuntu, but as long as you have command-line Linux familiarity any modern distribution will suffice.

Some of the examples in later chapters really need multiple machines to see things working, so you will require access to at least four such hosts. Virtual machines are completely acceptable; they're not ideal for production but are fine for learning and exploration. Since we also explore Amazon Web Services in this book, you can run all the examples on EC2 instances, and we will look at some other more Hadoop-specific uses of AWS throughout the book. AWS services are usable by anyone, but you will need a credit card to sign up!

Who this book is for

We assume you are reading this book because you want to know more about Hadoop at a hands-on level; the key audience is those with software development experience but no prior exposure to Hadoop or similar big data technologies.

For developers who want to know how to write MapReduce applications, we assume you are comfortable writing Java programs and are familiar with the Unix command-line interface. We will also show you a few programs in Ruby, but these are usually only to demonstrate language independence, and you don't need to be a Ruby expert.

For architects and system administrators, the book also provides significant value in explaining how Hadoop works, its place in the broader architecture, and how it can be managed operationally. Some of the more involved techniques in Chapter 4, Developing

MapReduce Programs, and Chapter 5, Advanced MapReduce Techniques, are probably of less direct interest to this audience.

Conventions

In this book, you will find several headings appearing frequently.

To give clear instructions of how to complete a procedure or task, we use:

Time for action – heading

1.

_{Action 1}

2.

_{Action 2}

3.

_{Action 3}

(25)

What just happened?

This heading explains the working of tasks or instructions that you have just completed. You will also find some other learning aids in the book, including:

Pop quiz – heading

These are short multiple-choice questions intended to help you test your own understanding.

Have a go hero – heading

These set practical challenges and give you ideas for experimenting with what you have learned.

You will also find a number of styles of text that distinguish between different kinds of information. Here are some examples of these styles, and an explanation of their meaning. Code words in text are shown as follows: "You may notice that we used the Unix command rm to remove the Drush directory rather than the DOS del command."

A block of code is set as follows:

# * Fine Tuning #

key_buffer = 16M key_buffer_size = 32M max_allowed_packet = 16M thread_stack = 512K thread_cache_size = 8 max_connections = 300

When we wish to draw your attention to a particular part of a code block, the relevant lines or items are set in bold:

# * Fine Tuning #

key_buffer = 16M key_buffer_size = 32M

max_allowed_packet = 16M

(26)

Any command-line input or output is written as follows:

cd /ProgramData/Propeople rm -r Drush

git clone --branch master http://git.drupal.org/project/drush.git

Newterms and important words are shown in bold. Words that you see on the screen, in menus or dialog boxes for example, appear in the text like this: "On the Select Destination Location screen, click on Next to accept the default destination."

Warnings or important notes appear in a box like this.

Tips and tricks appear like this.

Reader feedback

Feedback from our readers is always welcome. Let us know what you think about this book—what you liked or may have disliked. Reader feedback is important for us to develop titles that you really get the most out of.

To send us general feedback, simply send an e-mail to feedback@packtpub.com, and mention the book title through the subject of your message.

If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, see our author guide on www.packtpub.com/authors.

Customer support

Now that you are the proud owner of a Packt book, we have a number of things to help you to get the most from your purchase.

Downloading the example code

(27)

Errata

Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you find a mistake in one of our books—maybe a mistake in the text or the code—we would be grateful if you would report this to us. By doing so, you can save other readers from frustration and help us improve subsequent versions of this book. If you find any errata, please report them by visiting http://www.packtpub.com/submit-errata, selecting your book, clicking on the errata submission form link, and entering the details of your errata. Once your errata are verified, your submission will be accepted and the errata will be uploaded to our website, or added to any list of existing errata, under the Errata section of that title.

Piracy

Piracy of copyright material on the Internet is an ongoing problem across all media. At Packt, we take the protection of our copyright and licenses very seriously. If you come across any illegal copies of our works, in any form, on the Internet, please provide us with the location address or website name immediately so that we can pursue a remedy.

Please contact us at copyright@packtpub.com with a link to the suspected pirated material.

We appreciate your help in protecting our authors, and our ability to bring you valuable content.

Questions

(28)

1

What It's All About

This book is about Hadoop, an open source framework for large-scale data processing. Before we get into the details of the technology and its use in later chapters, it is important to spend a little time exploring the trends that led to Hadoop's creation and its enormous success.

Hadoop was not created in a vacuum; instead, it exists due to the explosion in the amount of data being created and consumed and a shift that sees this data deluge arrive at small startups and not just huge multinationals. At the same time, other trends have changed how software and systems are deployed, using cloud resources alongside or even in preference to more traditional infrastructures.

This chapter will explore some of these trends and explain in detail the specific problems Hadoop seeks to solve and the drivers that shaped its design.

In the rest of this chapter we shall:

Learn about the big data revolution

Understand what Hadoop is and how it can extract value from data

Look into cloud computing and understand what Amazon Web Services provides See how powerful the combination of big data processing and cloud computing

can be

(29)

Big data processing

Look around at the technology we have today, and it's easy to come to the conclusion that it's all about data. As consumers, we have an increasing appetite for rich media, both in terms of the movies we watch and the pictures and videos we create and upload. We also, often without thinking, leave a trail of data across the Web as we perform the actions of our daily lives.

Not only is the amount of data being generated increasing, but the rate of increase is also accelerating. From emails to Facebook posts, from purchase histories to web links, there are large data sets growing everywhere. The challenge is in extracting from this data the most valuable aspects; sometimes this means particular data elements, and at other times, the focus is instead on identifying trends and relationships between pieces of data.

There's a subtle change occurring behind the scenes that is all about using data in more and more meaningful ways. Large companies have realized the value in data for some time and have been using it to improve the services they provide to their customers, that is, us. Consider how Google displays advertisements relevant to our web surfing, or how Amazon or Netflix recommend new products or titles that often match well to our tastes and interests.

The value of data

These corporations wouldn't invest in large-scale data processing if it didn't provide a meaningful return on the investment or a competitive advantage. There are several main aspects to big data that should be appreciated:

Some questions only give value when asked of sufficiently large data sets.

Recommending a movie based on the preferences of another person is, in the absence of other factors, unlikely to be very accurate. Increase the number of people to a hundred and the chances increase slightly. Use the viewing history of ten million other people and the chances of detecting patterns that can be used to give relevant recommendations improve dramatically.

Big data tools often enable the processing of data on a larger scale and at a lower

cost than previous solutions. As a consequence, it is often possible to perform data processing tasks that were previously prohibitively expensive.

The cost of large-scale data processing isn't just about financial expense; latency is

(30)

[ 9 ]

Previous assumptions of what a database should look like or how its data should be

structured may need to be revisited to meet the needs of the biggest data problems.

In combination with the preceding points, sufficiently large data sets and flexible

tools allow previously unimagined questions to be answered.

Historically for the few and not the many

The examples discussed in the previous section have generally been seen in the form of innovations of large search engines and online companies. This is a continuation of a much older trend wherein processing large data sets was an expensive and complex undertaking, out of the reach of small- or medium-sized organizations.

Similarly, the broader approach of data mining has been around for a very long time but has never really been a practical tool outside the largest corporations and government agencies. This situation may have been regrettable but most smaller organizations were not at a disadvantage as they rarely had access to the volume of data requiring such an investment.

The increase in data is not limited to the big players anymore, however; many small and medium companies—not to mention some individuals—find themselves gathering larger and larger amounts of data that they suspect may have some value they want to unlock.

Before understanding how this can be achieved, it is important to appreciate some of these broader historical trends that have laid the foundations for systems such as Hadoop today.

Classic data processing systems

The fundamental reason that big data mining systems were rare and expensive is that scaling a system to process large data sets is very difficult; as we will see, it has traditionally been limited to the processing power that can be built into a single computer.

There are however two broad approaches to scaling a system as the size of the data increases, generally referred to as scale-up and scale-out.

Scale-up

(31)

The advantage of simple scale-up is that the architecture does not significantly change through the growth. Though larger components are used, the basic relationship (for example, database server and storage array) stays the same. For applications such as commercial database engines, the software handles the complexities of utilizing the available hardware, but in theory, increased scale is achieved by migrating the same software onto larger and larger servers. Note though that the difficulty of moving software onto more and more processors is never trivial; in addition, there are practical limits on just how big a single host can be, so at some point, scale-up cannot be extended any further.

The promise of a single architecture at any scale is also unrealistic. Designing a scale-up system to handle data sets of sizes such as 1 terabyte, 100 terabyte, and 1 petabyte may conceptually apply larger versions of the same components, but the complexity of their connectivity may vary from cheap commodity through custom hardware as the scale increases.

Early approaches to scale-out

Instead of growing a system onto larger and larger hardware, the scale-out approach spreads the processing onto more and more machines. If the data set doubles, simply use two servers instead of a single double-sized one. If it doubles again, move to four hosts.

The obvious benefit of this approach is that purchase costs remain much lower than for scale-up. Server hardware costs tend to increase sharply when one seeks to purchase larger machines, and though a single host may cost $5,000, one with ten times the processing power may cost a hundred times as much. The downside is that we need to develop strategies for splitting our data processing across a fleet of servers and the tools historically used for this purpose have proven to be complex.

As a consequence, deploying a scale-out solution has required significant engineering effort; the system developer often needs to handcraft the mechanisms for data partitioning and reassembly, not to mention the logic to schedule the work across the cluster and handle individual machine failures.

Limiting factors

These traditional approaches to scale-up and scale-out have not been widely adopted outside large enterprises, government, and academia. The purchase costs are often high, as is the effort to develop and manage the systems. These factors alone put them out of the reach of many smaller businesses. In addition, the approaches themselves have had several weaknesses that have become apparent over time:

As scale-out systems get large, or as scale-up systems deal with multiple CPUs, the

(32)

Hardware advances—often couched in terms of Moore's law—have begun to

highlight discrepancies in system capability. CPU power has grown much faster than network or disk speeds have; once CPU cycles were the most valuable resource in the system, but today, that no longer holds. Whereas a modern CPU may be able to execute millions of times as many operations as a CPU 20 years ago would, memory and hard disk speeds have only increased by factors of thousands or even hundreds. It is quite easy to build a modern system with so much CPU power that the storage system simply cannot feed it data fast enough to keep the CPUs busy.

A different approach

From the preceding scenarios there are a number of techniques that have been used successfully to ease the pain in scaling data processing systems to the large scales required by big data.

All roads lead to scale-out

As just hinted, taking a scale-up approach to scaling is not an open-ended tactic. There is a limit to the size of individual servers that can be purchased from mainstream hardware suppliers, and even more niche players can't offer an arbitrarily large server. At some point, the workload will increase beyond the capacity of the single, monolithic scale-up server, so then what? The unfortunate answer is that the best approach is to have two large servers instead of one. Then, later, three, four, and so on. Or, in other words, the natural tendency of scale-up architecture is—in extreme cases—to add a scale-out strategy to the mix. Though this gives some of the benefits of both approaches, it also compounds the costs and weaknesses; instead of very expensive hardware or the need to manually develop the cross-cluster logic, this hybrid architecture requires both.

As a consequence of this end-game tendency and the general cost profile of scale-up architectures, they are rarely used in the big data processing field and scale-out architectures are the de facto standard.

If your problem space involves data workloads with strong internal cross-references and a need for transactional integrity, big iron scale-up relational databases are still likely to be a great option.

Share nothing

(33)

The conceptual view of a scale-out architecture in particular shows individual hosts, each processing a subset of the overall data set to produce its portion of the final result. Reality is rarely so straightforward. Instead, hosts may need to communicate between each other, or some pieces of data may be required by multiple hosts. These additional dependencies create opportunities for the system to be negatively affected in two ways: bottlenecks and increased risk of failure.

If a piece of data or individual server is required by every calculation in the system, there is a likelihood of contention and delays as the competing clients access the common data or host. If, for example, in a system with 25 hosts there is a single host that must be accessed by all the rest, the overall system performance will be bounded by the capabilities of this key host.

Worse still, if this "hot" server or storage system holding the key data fails, the entire workload will collapse in a heap. Earlier cluster solutions often demonstrated this risk; even though the workload was processed across a farm of servers, they often used a shared storage system to hold all the data.

Instead of sharing resources, the individual components of a system should be as independent as possible, allowing each to proceed regardless of whether others are tied up in complex work or are experiencing failures.

Expect failure

Implicit in the preceding tenets is that more hardware will be thrown at the problem with as much independence as possible. This is only achievable if the system is built with an expectation that individual components will fail, often regularly and with inconvenient timing.

You'll often hear terms such as "five nines" (referring to 99.999 percent uptime or availability). Though this is absolute best-in-class availability, it is important to realize that the overall reliability of a system comprised of many such devices can vary greatly depending on whether the system can tolerate individual component failures.

Assume a server with 99 percent reliability and a system that requires five such hosts to function. The system availability is 0.99*0.99*0.99*0.99*0.99 which equates to 95 percent availability. But if the individual servers are only rated at 95 percent, the system reliability drops to a mere 76 percent.

(34)

If figures such as 99 percent availability seem a little abstract to you, consider it in terms of how much downtime that would mean in a given time period. For example, 99 percent availability equates to a downtime of just over 3.5 days a year or 7 hours a month. Still sound as good as 99 percent?

This approach of embracing failure is often one of the most difficult aspects of big data systems for newcomers to fully appreciate. This is also where the approach diverges most strongly from scale-up architectures. One of the main reasons for the high cost of large scale-up servers is the amount of effort that goes into mitigating the impact of component failures. Even low-end servers may have redundant power supplies, but in a big iron box, you will see CPUs mounted on cards that connect across multiple backplanes to banks of memory and storage systems. Big iron vendors have often gone to extremes to show how resilient their systems are by doing everything from pulling out parts of the server while it's running to actually shooting a gun at it. But if the system is built in such a way that instead of treating every failure as a crisis to be mitigated it is reduced to irrelevance, a very different architecture emerges.

Smart software, dumb hardware

If we wish to see a cluster of hardware used in as flexible a way as possible, providing hosting to multiple parallel workflows, the answer is to push the smarts into the software and away from the hardware.

In this model, the hardware is treated as a set of resources, and the responsibility for allocating hardware to a particular workload is given to the software layer. This allows hardware to be generic and hence both easier and less expensive to acquire, and the functionality to efficiently use the hardware moves to the software, where the knowledge about effectively performing this task resides.

Move processing, not data

Imagine you have a very large data set, say, 1000 terabytes (that is, 1 petabyte), and you need to perform a set of four operations on every piece of data in the data set. Let's look at different ways of implementing a system to solve this problem.

(35)

Alternatively, the processing approach of previous cluster technologies would perhaps see a cluster of 1,000 machines, each with 1 terabyte of data divided into four quadrants, with each responsible for performing one of the operations. The cluster management software would then coordinate the movement of the data around the cluster to ensure each piece receives all four processing steps. As each piece of data can have one step performed on the host on which it resides, it will need to stream the data to the other three quadrants, so we are in effect consuming 3 petabytes of network bandwidth to perform the processing.

Remembering that processing power has increased faster than networking or disk

technologies, so are these really the best ways to address the problem? Recent experience suggests the answer is no and that an alternative approach is to avoid moving the data and instead move the processing. Use a cluster as just mentioned, but don't segment it into quadrants; instead, have each of the thousand nodes perform all four processing stages on the locally held data. If you're lucky, you'll only have to stream the data from the disk once and the only things travelling across the network will be program binaries and status reports, both of which are dwarfed by the actual data set in question.

If a 1,000-node cluster sounds ridiculously large, think of some modern server form factors being utilized for big data solutions. These see single hosts with as many as twelve 1- or 2-terabyte disks in each. Because modern processors have multiple cores it is possible to build a 50-node cluster with a petabyte of storage and still have a CPU core dedicated to process the data stream coming off each individual disk.

Build applications, not infrastructure

When thinking of the scenario in the previous section, many people will focus on the questions of data movement and processing. But, anyone who has ever built such a system will know that less obvious elements such as job scheduling, error handling, and coordination are where much of the magic truly lies.

If we had to implement the mechanisms for determining where to execute processing, performing the processing, and combining all the subresults into the overall result, we wouldn't have gained much from the older model. There, we needed to explicitly manage data partitioning; we'd just be exchanging one difficult problem with another.

(36)

Hadoop

The thoughtful (or perhaps suspicious) reader will not be surprised to learn that the preceding approaches are all key aspects of Hadoop. But we still haven't actually answered the question about exactly what Hadoop is.

Thanks, Google

It all started with Google, which in 2003 and 2004 released two academic papers describing Google technology: the Google File System (GFS) (http://research.google.com/ archive/gfs.html) and MapReduce (http://research.google.com/archive/ mapreduce.html). The two together provided a platform for processing data on a very large scale in a highly efficient manner.

Thanks, Doug

At the same time, Doug Cutting was working on the Nutch open source web search engine. He had been working on elements within the system that resonated strongly once the Google GFS and MapReduce papers were published. Doug started work on the implementations of these Google systems, and Hadoop was soon born, firstly as a subproject of Lucene and soon was its own top-level project within the Apache open source foundation. At its core, therefore, Hadoop is an open source platform that provides implementations of both the MapReduce and GFS technologies and allows the processing of very large data sets across clusters of low-cost commodity hardware.

Thanks, Yahoo

Yahoo hired Doug Cutting in 2006 and quickly became one of the most prominent supporters of the Hadoop project. In addition to often publicizing some of the largest Hadoop

deployments in the world, Yahoo has allowed Doug and other engineers to contribute to Hadoop while still under its employ; it has contributed some of its own internally developed Hadoop improvements and extensions. Though Doug has now moved on to Cloudera (another prominent startup supporting the Hadoop community) and much of the Yahoo's Hadoop team has been spun off into a startup called Hortonworks, Yahoo remains a major Hadoop contributor.

Parts of Hadoop

(37)

HDFS is a filesystem that can store very large data sets by scaling out across a cluster of hosts. It has specific design and performance characteristics; in particular, it is optimized for throughput instead of latency, and it achieves high availability through replication instead of redundancy.

MapReduce is a data processing paradigm that takes a specification of how the data will be input and output from its two stages (called map and reduce) and then applies this across arbitrarily large data sets. MapReduce integrates tightly with HDFS, ensuring that wherever possible, MapReduce tasks run directly on the HDFS nodes that hold the required data.

Common building blocks

Both HDFS and MapReduce exhibit several of the architectural principles described in the previous section. In particular:

Both are designed to run on clusters of commodity (that is, low-to-medium specification) servers

Both scale their capacity by adding more servers (scale-out)

Both have mechanisms for identifying and working around failures

Both provide many of their services transparently, allowing the user to concentrate on the problem at hand

Both have an architecture where a software cluster sits on the physical servers and

controls all aspects of system execution

HDFS

HDFS is a filesystem unlike most you may have encountered before. It is not a POSIX-compliant filesystem, which basically means it does not provide the same guarantees as a regular filesystem. It is also a distributed filesystem, meaning that it spreads storage across multiple nodes; lack of such an efficient distributed filesystem was a limiting factor in some historical technologies. The key features are:

HDFS stores files in blocks typically at least 64 MB in size, much larger than the 4-32

KB seen in most filesystems.

HDFS is optimized for throughput over latency; it is very efficient at streaming

read requests for large files but poor at seek requests for many small ones.

HDFS is optimized for workloads that are generally of the write-once and

read-many type.

(38)

Instead of handling disk failures by having physical redundancies in disk arrays or similar strategies, HDFS uses replication. Each of the blocks comprising a file is stored on multiple nodes within the cluster, and the HDFS NameNode constantly monitors reports sent by each DataNode to ensure that failures have not dropped any block below the desired replication factor. If this does happen, it schedules the addition of another copy within the cluster.

MapReduce

Though MapReduce as a technology is relatively new, it builds upon much of the

fundamental work from both mathematics and computer science, particularly approaches that look to express operations that would then be applied to each element in a set of data. Indeed the individual concepts of functions called map and reduce come straight from functional programming languages where they were applied to lists of input data.

Another key underlying concept is that of "divide and conquer", where a single problem is broken into multiple individual subtasks. This approach becomes even more powerful when the subtasks are executed in parallel; in a perfect case, a task that takes 1000 minutes could be processed in 1 minute by 1,000 parallel subtasks.

MapReduce is a processing paradigm that builds upon these principles; it provides a series of transformations from a source to a result data set. In the simplest case, the input data is fed to the map function and the resultant temporary data to a reduce function. The developer only defines the data transformations; Hadoop's MapReduce job manages the process of how to apply these transformations to the data across the cluster in parallel. Though the underlying ideas may not be novel, a major strength of Hadoop is in how it has brought these principles together into an accessible and well-engineered platform.

Unlike traditional relational databases that require structured data with well-defined schemas, MapReduce and Hadoop work best on semi-structured or unstructured data. Instead of data conforming to rigid schemas, the requirement is instead that the data be provided to the map function as a series of key value pairs. The output of the map function is a set of other key value pairs, and the reduce function performs aggregation to collect the final set of results.

(39)

This last point is possibly the most important aspect of Hadoop. The platform takes responsibility for every aspect of executing the processing across the data. After the user defines the key criteria for the job, everything else becomes the responsibility of the system. Critically, from the perspective of the size of data, the same MapReduce job can be applied to data sets of any size hosted on clusters of any size. If the data is 1 gigabyte in size and on a single host, Hadoop will schedule the processing accordingly. Even if the data is 1 petabyte in size and hosted across one thousand machines, it still does likewise, determining how best to utilize all the hosts to perform the work most efficiently. From the user's perspective, the actual size of the data and cluster are transparent, and apart from affecting the time taken to process the job, they do not change how the user interacts with Hadoop.

Better together

It is possible to appreciate the individual merits of HDFS and MapReduce, but they are even more powerful when combined. HDFS can be used without MapReduce, as it is intrinsically a large-scale data storage platform. Though MapReduce can read data from non-HDFS sources, the nature of its processing aligns so well with HDFS that using the two together is by far the most common use case.

When a MapReduce job is executed, Hadoop needs to decide where to execute the code most efficiently to process the data set. If the MapReduce-cluster hosts all pull their data from a single storage host or an array, it largely doesn't matter as the storage system is a shared resource that will cause contention. But if the storage system is HDFS, it allows MapReduce to execute data processing on the node holding the data of interest, building on the principle of it being less expensive to move data processing than the data itself.

The most common deployment model for Hadoop sees the HDFS and MapReduce clusters deployed on the same set of servers. Each host that contains data and the HDFS component to manage it also hosts a MapReduce component that can schedule and execute data processing. When a job is submitted to Hadoop, it can use an optimization process as much as possible to schedule data on the hosts where the data resides, minimizing network traffic and maximizing performance.

Think back to our earlier example of how to process a four-step task on 1 petabyte of data spread across one thousand servers. The MapReduce model would (in a somewhat simplified and idealized way) perform the processing in a map function on each piece of data on a host where the data resides in HDFS and then reuse the cluster in the reduce function to collect the individual results into the final result set.

(40)

[ 19 ]

Common architecture

Both HDFS and MapReduce are, as mentioned, software clusters that display common characteristics:

Each follows an architecture where a cluster of worker nodes is managed by a special master/coordinator node

The master in each case (NameNode for HDFS and JobTracker for MapReduce)

monitors the health of the cluster and handle failures, either by moving data blocks around or by rescheduling failed work

Processes on each server (DataNode for HDFS and TaskTracker for MapReduce) are

responsible for performing work on the physical host, receiving instructions from the NameNode or JobTracker, and reporting health/progress status back to it

As a minor terminology point, we will generally use the terms host or server to refer to the physical hardware hosting Hadoop's various components. The term node will refer to the software component comprising a part of the cluster.

What it is and isn't good for

As with any tool, it's important to understand when Hadoop is a good fit for the problem in question. Much of this book will highlight its strengths, based on the previous broad overview on processing large data volumes, but it's important to also start appreciating at an early stage where it isn't the best choice.

The architecture choices made within Hadoop enable it to be the flexible and scalable data processing platform it is today. But, as with most architecture or design choices, there are consequences that must be understood. Primary amongst these is the fact that Hadoop is a batch processing system. When you execute a job across a large data set, the framework will churn away until the final results are ready. With a large cluster, answers across even huge data sets can be generated relatively quickly, but the fact remains that the answers are not generated fast enough to service impatient users. Consequently, Hadoop alone is not well suited to low-latency queries such as those received on a website, a real-time system, or a similar problem domain.

(41)

Another member of the broader Hadoop family is HBase, an open-source implementation of another Google technology. This provides a (non-relational) database atop Hadoop that uses various means to allow it to serve low-latency queries.

But haven't Google and Yahoo both been among the strongest proponents of this method of computation, and aren't they all about such websites where response time is critical? The answer is yes, and it highlights an important aspect of how to incorporate Hadoop into any organization or activity or use it in conjunction with other technologies in a way that exploits the strengths of each. In a paper (http://research.google.com/archive/ googlecluster.html), Google sketches how they utilized MapReduce at the time; after a web crawler retrieved updated webpage data, MapReduce processed the huge data set, and from this, produced the web index that a fleet of MySQL servers used to service end-user search requests.

Cloud computing with Amazon Web Services

The other technology area we'll explore in this book is cloud computing, in the form of several offerings from Amazon Web Services. But first, we need to cut through some hype and buzzwords that surround this thing called cloud computing.

Too many clouds

Cloud computing has become an overused term, arguably to the point that its overuse risks it being rendered meaningless. In this book, therefore, let's be clear what we mean—and care about—when using the term. There are two main aspects to this: a new architecture option and a different approach to cost.

A third way

(42)

It's not always that simple, of course. But for many cloud services, the model truly is this revolutionary. You develop the software according to some published guidelines or interface and then deploy it onto the cloud platform and allow it to scale the service based on the demand, for a cost of course. But given the costs usually involved in making scaling systems, this is often a compelling proposition.

Different types of costs

This approach to cloud computing also changes how system hardware is paid for. By offloading infrastructure costs, all users benefit from the economies of scale achieved by the cloud provider by building their platforms up to a size capable of hosting thousands or millions of clients. As a user, not only do you get someone else to worry about difficult engineering problems, such as scaling, but you pay for capacity as it's needed and you don't have to size the system based on the largest possible workloads. Instead, you gain the benefit of elasticity and use more or fewer resources as your workload demands.

An example helps illustrate this. Many companies' financial groups run end-of-month workloads to generate tax and payroll data, and often, much larger data crunching occurs at year end. If you were tasked with designing such a system, how much hardware would you buy? If you only buy enough to handle the day-to-day workload, the system may struggle at month end and may likely be in real trouble when the end-of-year processing rolls around. If you scale for the end-of-month workloads, the system will have idle capacity for most of the year and possibly still be in trouble performing the end-of-year processing. If you size for the end-of-year workload, the system will have significant capacity sitting idle for the rest of the year. And considering the purchase cost of hardware in addition to the hosting and running costs—a server's electricity usage may account for a large majority of its lifetime costs—you are basically wasting huge amounts of money.

The service-on-demand aspects of cloud computing allow you to start your application on a small hardware footprint and then scale it up and down as the year progresses. With a pay-for-use model, your costs follow your utilization and you have the capacity to process your workloads without having to buy enough hardware to handle the peaks.

(43)

Today, with cloud computing, the initial infrastructure cost could literally be as low as a few tens or hundreds of dollars a month and that would only increase when—and if—the traffic demanded.

AWS – infrastructure on demand from Amazon

Amazon Web Services (AWS) is a set of such cloud computing services offered by Amazon. We will be using several of these services in this book.

Elastic Compute Cloud (EC2)

Amazon's Elastic Compute Cloud (EC2), found at http://aws.amazon.com/ec2/, is basically a server on demand. After registering with AWS and EC2, credit card details are all that's required to gain access to a dedicated virtual machine, it's easy to run a variety of operating systems including Windows and many variants of Linux on our server.

Need more servers? Start more. Need more powerful servers? Change to one of the higher specification (and cost) types offered. Along with this, EC2 offers a suite of complimentary services, including load balancers, static IP addresses, high-performance additional virtual disk drives, and many more.

Simple Storage Service (S3)

Amazon's Simple Storage Service (S3), found at http://aws.amazon.com/s3/, is a storage service that provides a simple key/value storage model. Using web, command-line, or programmatic interfaces to create objects, which can be everything from text files to images to MP3s, you can store and retrieve your data based on a hierarchical model. You create buckets in this model that contain objects. Each bucket has a unique identifier, and within each bucket, every object is uniquely named. This simple strategy enables an extremely powerful service for which Amazon takes complete responsibility (for service scaling, in addition to reliability and availability of data).

Elastic MapReduce (EMR)

Amazon's Elastic MapReduce (EMR), found at http://aws.amazon.com/

(44)

In its most impressive mode, EMR can pull source data from S3, process it on a Hadoop cluster it creates on EC2, push the results back into S3, and terminate the Hadoop cluster and the EC2 virtual machines hosting it. Naturally, each of these services has a cost (usually on per GB stored and server time usage basis), but the ability to access such powerful data processing capabilities with no need for dedicated hardware is a powerful one.

What this book covers

In this book we will be learning how to write MapReduce programs to do some serious data crunching and how to run them on both locally managed and AWS-hosted Hadoop clusters.

Not only will we be looking at Hadoop as an engine for performing MapReduce processing, but we'll also explore how a Hadoop capability can fit into the rest of an organization's infrastructure and systems. We'll look at some of the common points of integration, such as getting data between Hadoop and a relational database and also how to make Hadoop look more like such a relational database.

A dual approach

In this book we will not be limiting our discussion to EMR or Hadoop hosted on Amazon EC2; we will be discussing both the building and the management of local Hadoop clusters (on Ubuntu Linux) in addition to showing how to push the processing into the cloud via EMR. The reason for this is twofold: firstly, though EMR makes Hadoop much more accessible, there are aspects of the technology that only become apparent when manually

administering the cluster. Though it is also possible to use EMR in a more manual mode, we'll generally use a local cluster for such explorations. Secondly, though it isn't necessarily an either/or decision, many organizations use a mixture of in-house and cloud-hosted capacities, sometimes due to a concern of over reliance on a single external provider, but practically speaking, it's often convenient to do development and small-scale tests on local capacity then deploy at production scale into the cloud.

Hadoop Beginner's Guide Learn how to crunch big data to extract meaning from the data avalanche pdf pdf

Hadoop Beginner's Guide

Learn how to crunch big data to extract meaning from the

data avalanche

Garry Turkington

Hadoop Beginner's Guide

Credits

About the Author

Garry Turkington

About the Reviewers

David Gruzman

Vidyasagar N V

www.PacktPub.com

Support files, eBooks, discount offers and more

Why Subscribe?

Free Access for Packt account holders

Table of Contents

Preface 1

Chapter 1: What It's All About

7

Chapter 2: Getting Hadoop Up and Running

25

Chapter 3: Understanding MapReduce

57

Chapter 4: Developing MapReduce Programs

93

Chapter 5: Advanced MapReduce Techniques

127

Chapter 6: When Things Break

167

Chapter 7: Keeping Things Running

205

Chapter 8: A Relational View on Data with Hive

237

Chapter 9: Working with Relational Databases

279

Chapter 10: Data Collection with Flume

315

Chapter 11: Where to Go Next

347

Appendix: Pop Quiz Answers

359

Preface

What this book covers

What you need for this book

Who this book is for

Conventions

Time for action – heading

1.

2.

3.

What just happened?

Pop quiz – heading

Have a go hero – heading

Reader feedback

Customer support

Downloading the example code

Errata

Piracy

Questions

1

What It's All About

Big data processing

The value of data

Historically for the few and not the many

Classic data processing systems

Limiting factors

A different approach

All roads lead to scale-out

Share nothing

Expect failure

Smart software, dumb hardware

Move processing, not data

Build applications, not infrastructure

Hadoop

Thanks, Google

Thanks, Doug

Thanks, Yahoo

Parts of Hadoop

Common building blocks