Managing the Data Lake pdf pdf

(1)

(2)

(3)

Managing the Data Lake

Moving to Big Data Analysis

(4)

Managing the Data Lake by Andy Oram

Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA 95472.

O’Reilly books may be purchased for educational, business, or sales promotional use. Online editions are also available for most titles (http://safaribooksonline.com). For more information, contact our corporate/institutional sales department: 800-998-9938 or

corporate@oreilly.com. Editor: Shannon Cutt

Interior Designer: David Futato

Cover Designer: Karen Montgomery

(5)

Revision History for the First Edition 2015-09-02: First Release

2015-10-20: Second Release

The O’Reilly logo is a registered trademark of O’Reilly Media, Inc.

Managing the Data Lake and related trade dress are trademarks of O’Reilly Media, Inc.

While the publisher and the author have used good faith efforts to ensure that the information and instructions contained in this work are accurate, the publisher and the author disclaim all responsibility for errors or omissions, including without limitation responsibility for damages resulting from the use of or reliance on this work. Use of the information and instructions contained in this work is at your own risk. If any code samples or other technology this work contains or describes is subject to open source licenses or the

intellectual property rights of others, it is your responsibility to ensure that your use thereof complies with such licenses and/or rights.

Cover photo credit: “55 Flying Fish” by Michal (flikr). 978-1-491-94168-3

(6)

Chapter 1. Moving to Big Data

Analysis

Can you tell by sailing the surface of a lake whether it has been well maintained? Can local fish and plants survive? Dare you swim? And how about the data maintained in your organization’s data lake? Can you tell whether it’s healthy enough to support your business needs?

An increasing number of organizations maintain fast-growing repositories of data, usually from multiple sources and formatted in multiple ways, that are commonly called “data lakes.” They use a variety of storage and processing tools — especially in the Hadoop family — to extract value quickly and inform key organizational decisions.

This report looks at the common needs that modern organizations have for data management and governance. The MapReduce model — introduced in 2004 in a paper1_{by Jeffrey Dean and Sanjay Ghemawat — completely}

overturned the way the computing community approached big data analysis. Many other models, such as Spark, have come since then, creating

excitement and seeing eager adoption by organizations of all sizes to solve the problems that relational databases were not suited for. But these

technologies bring with them new demands for organizing data and keeping track of what you’ve got.

I take it for granted that you understand the value of undertaking a big data initiative, as well as the value of a framework such as Hadoop, and are in the process of transforming the way you manage your organization’s data. I have interviewed a number of experts in data management to find out the common challenges you are about to face, so you can anticipate them and put solutions in place before you find yourself overwhelmed.

(7)

value in those constraints, and most of us will be entrusting data to relational systems for the foreseeable future. But some data tasks just don’t fit. And once you escape the familiarity and safety of the relational model, you need other tools to manage the inconsistencies, unpredictability, and breakneck pace of the data you’re handling.

The risk of the new tools is having many disparate sources of data — and perhaps multiple instances of Hadoop or other systems offering analytics operating inefficiently — which in turn causes you to lose track of basic information you need to know about your data. This makes it hard to set up new jobs that could provide input to the questions you urgently need to answer.

The fix is to restore some of the controls you had over old data sources

through careful planning and coding, while still being flexible and responsive to fast-moving corporate data needs.

The main topics covered in this report are: Acquisition and ingestion

Data comes nowadays from many different sources: internal business systems, product data from customers, external data providers, public data sets, and more. You can’t force everyone to provide the data in a format that’s convenient for you. Nor can you take the time (as in the old days) to define strict schemas and enter all data into schemas. The problems of data acquisition and ingestion have to be solved with a degree of automation.

Metadata (cataloguing)

Questions such as who provided the data, when it came in, and how it was formatted — a slew of concerns known as lineage or provenance — are critical to managing your data well. A catalog can keep this metadata and make it available to later stages of processing.

Data preparation and cleaning

(8)

addition, data might be lost or corrupted because sensors run out of battery power, networks fail, software along the way harbored a bug, or the incoming data had an unrecognized format. Some data users estimate that detecting these anomalies and cleaning takes up 90% of their time. Managing workflows

The actual jobs you run on data need to be linked with the three other stages just described. Users should be able to submit jobs of their own, based on the work done by experts before them, to handle ingestion, cataloguing, and cleaning. You want staff to quickly get a new

visualization or report without waiting weeks for a programmer to code it up.

Access control

(9)

Why Companies Move to Hadoop

To set the stage for exploration of data management, it is helpful to remind ourselves of why organizations are moving in the direction of big data tools. Size

“Volume” is one of the main aspects of big data. Relational databases cannot scale beyond a certain volume due to architecture restrictions. Organizations find that data processing in relational databases takes too long, and as they do more and more analytics, such data processing using conventional ETL tools becomes such a big time sink that they hold users back from making full use of the data.

Variety

Typical sources include flat files, RDBMSes, logs from web servers, devices and sensors, and even legacy mainframe data. Sometimes you want also to export data from Hadoop to an RDBMS or other repository. Free-form data

Some data may be almost completely unstructured, as in the case of product reviews and social media postings. Other data will come to you inconsistently structured. For instance, different data providers may provide the same information in very different formats.

Streaming data

If you don’t keep up with changes in the world around you, it will pass you by — and probably reward a competitor who does adapt to it.

Streaming has evolved from a few rare cases, such as stock markets and sensor data, to everyday data such as product usage data and social media.

Fitting the task to the tool

(10)

preparing data can take more time than doing the analytics. Data in normalized relational format resides in many different tables and must be combined to make the format that the analytics engine can efficiently process.

Frequent failures

Modern processing systems such as Hadoop contain redundancy and automatic restart to handle hardware failures or software glitches. Even so, you can expect jobs to be aborted regularly by bad data. You’ll want to get notifications when a job finishes successfully or unsuccessfully. Log files should show you what goes wrong, and you should be able to see how many corrupted rows were discarded and what other errors occurred.

Unless you take management into consideration in advance, you end up unable to make good use of this data. One example comes from a telecom company whose network generated records about the details of phone calls for monthly billing purposes. Their ETL system didn’t ingest data from calls that were dropped or never connected, because no billing was involved. So years later, when they realized they should be looking at which cell towers had low quality, they had no data with which to do so.

A failure to collect or store data may be an extreme example of management problems, but other hindrances — such as storing it in a format that is hard to read, or failing to remember when it arrived — will also slow down

processing to the point where you give up opportunities for learning insights from your data.

When the telecom company just mentioned realized that they could use information on dropped and incomplete calls, their ETL system required a huge new programming effort and did not have the capacity to store or process the additional data. Modern organizations may frequently get new sources of data from brokers or publicly available repositories, and can’t afford to spend time and resources doing such coding in order to integrate them.

(11)

program to try to fix corrupted data, and when do you abort the whole job? A minor error such as a missing ZIP code probably shouldn’t stop a job, or even prevent that record from being processed. A missing customer ID, though, might prevent you from doing anything useful with the data. (There may be ways to recover from these errors too, as we’ll see.)

Your choice depends of course on your goal. If you’re counting sales of a particular item, you don’t need the customer ID. If you want to update customer records, you probably do.

A more global problem with data ingestion comes when someone changes the order of fields in all the records of an incoming data set. Your program might be able to detect what happened and adjust, or might have to abort.

(12)

Acquisition and Ingestion

At this point we turn to the steps in data processing. Acquisition comes first. Nowadays it involves much more than moving data from an external source to your own repository. In fact, you may not be storing every source you get data from at all: you might accept streams of fast-changing data from sensors or social media, process them right away, and save only the results.

On the other hand, if you want to keep the incoming data, you may need to convert it to a format understood by Hadoop or other processing tools, such as Avro or Parquet.

The health care field provides a particularly complex data collection case. You may be collecting:

Electronic health records from hospitals using different formats

Claims data from health care providers or payers

Profiles from health plans

Data from individuals’ fitness devices

Electronic health records illustrate the variety and inconsistency of all these data types. Although there are standards developed by the HL7 standards group, they are implemented differently by each EHR vendor. Furthermore, HL7 exchanges data through several messaging systems that differ from any other kind of data exchange used in the computer field.

In a situation like this, you will probably design several general methods of ingesting data: one to handle the HL7 messages from EHRs, another to handle claims data, and so on. You’ll want to make it easy for a user to choose one of these methods and adjust parameters such as source, destination file, and frequency in order to handle a new data feed.

(13)

contain what you wanted at all, or needs complex processing to extract just what you need. And the documentation may not be trustworthy, so you have to test your ingestion process on actual input.

As mentioned earlier, you may be able to anticipate how incoming data changes — such as reordered fields — and adapt to it. However, there are risks to doing this. First, your tools become more complicated and harder to maintain. Second, they may make the wrong choice because they think they understand the change and get it wrong.

Another common ingestion task is to create a consolidated record from multiple files of related information that are used frequently together — for example, an Order Header and Details merged into one file. Hadoop has a particular constraint on incoming data: it was not designed for small files. Input may consist of many small files, but submitting them individually will force a wasteful input process onto Hadoop and perhaps even cause a failure. For this reason, it is recommended that, prior to processing these small files, they be combined into a single large file to leverage the Hadoop cluster more efficiently.

This example highlights an important principle governing all the processing discussed in this report: use open formats if possible, and leverage everything the open source and free software communities have made available. This will give you more options, because you won’t be locked into one vendor. Open source also makes it easier to hire staff and get them productive quickly.

However, current open source tools don’t do everything you need. You’ll have to fill in the gaps with commercial solutions or hand-crafted scripts. For instance, Sqoop is an excellent tool for importing data from a relational database to Hadoop and supports incremental loads. However, building a complete insert-update-delete solution to keep the Hive table in sync with the RDBMS table would be a pretty complex task. Here you might benefit from

(14)

Metadata (Cataloguing)

Why do you need to preserve metadata about your data? Reasons for doing so abound:

For your analytics, you will want to choose data from the right place and time. For instance, you may want to go back to old data from all your stores in a particular region.

Data preparation and cleaning require a firm knowledge of which data set you’re working on. Different sets require different types of preparation, based on what you have learned about them historically.

Analytical methods are often experimental and have some degree of error. To determine whether you can trust results, you may want to check the data that was used to achieve the results, and review how it was processed.

When something goes wrong in any stage from ingestion through to the processing, you need to quickly pinpoint the data causing the problem. You also must identify the source so you can contact them and make sure the problem doesn’t reoccur in future data sets.

In addition to cleaning data and preventing errors, you may have other reasons related to quality control to preserve the lineage or provenance of data.

Access has to be restricted to sensitive data. If users deliberately or inadvertently try to start a job on data they’re not supposed to see, your system should reject the job.

Regulatory requirements may require the access restrictions mentioned in the previous bullet, as well as imposing other requirements that depend on the data source.

(15)

Ben Sharma, CEO and co-founder of Zaloni, talks about creating “a single source of truth” from the diverse data sets you take in. By creating a data catalog, you can store this metadata for use by downstream programs. Zaloni divides metadata roughly into three types:

Business metadata

This can include the business names and descriptions that you assign to data fields to make them easier to find and understand. For instance, the technical staff may have a good reason to assign the name loc_outlet to a field that represents a retail store, but you will want users to be able to find it through common English words. This kind of metadata also covers business rules, such as putting an upper limit (perhaps even a lower limit) on salaries, or determining which data must be removed from some jobs for security and privacy.

Operational metadata

This is generated automatically by the processes described in this report, and include such things as the source and target locations of data, file size, number of records, how many records were rejected during data preparation or a job run, and the success or failure of that run itself. Technical metadata

This includes the data’s type and format (text, images, JSON, Avro, etc.) and the structure or schema. This structure includes the names of fields, their data types, their lengths, whether they can be empty, and so on. Structure is commonly provided by a relational database or the headings in a spreadsheet, but may also be added during ingestion and data

preparation. Zaloni’s Bedrock integrates with Apache Hcatalog for technical metadata so that other tools in the Hadoop ecosystem can take advantage of the structure definition.

As suggested in the previous list, one can also categorize metadata by the way it is gathered:

(16)

Some metadata pertains to the data acquisition process: the source of the data, filename, time of creation, time of acquisition, file size, redundancy checks generated to make sure the transmission was not corrupted, and MD5 hashes generated to uniquely identify a file.

Some metadata is created during ingestion. For instance, a watermark can be added to a file or to a column within the file. If you take JSON or other relatively unstructured data and create a schema around it, that schema becomes part of the metadata.

Some metadata is created during a job run, such as the number of records successfully processed, the number of bad fields or bad records, and how long a job took.

The next question is how to create metadata. Many tools can extract the easy stuff, such as file sizes and timestamps, as the stages of processing proceed. Other metadata requires custom-written programs that do such things as tag particular data fields you’ll want to extract later.

At any stage of processing, you may choose to update the metadata. Each stage can also consult the metadata when applying rules for user access, cleaning, and submitting data to jobs. We’ll see later how, at least in theory, storing feedback in metadata can create an environment of continuous quality improvement.

Currently, one of the huge challenges in data management is communicating metadata to downstream parts of a workflow. A good deal of Zaloni

(17)

Data Preparation and Cleaning

Assume that your data will come with a certain amount of errors, corrupted formats, and duplicates. I’m not using “assume” in a hypothetical sense here — you had better assume the presence of errors or you will be blindsided when they happen.

What will be the impacts of such errors? Suppose data transfers don’t complete, for instance? Your workflows should be able to handle the most common problems, and you’ll need to research your data feeds to discover those problems.

A sense of what you can run into comes, like several other examples in this report, from health care. The US government’s Center for Medicare & Medicaid Services (CMS), which covers a large percentage of health care payments in the country, requires participating health care providers to submit quality data in a format called the Healthcare Effectiveness Data and Information Set (HEDIS). This format is strict, demanding, and absolutely gigantic. Fields that get mixed up or have incorrect coding cost huge amounts of money as providers rush to fix them.

Why is HEDIS hard to fill out? Because the data is drawn from reports that undergo many processing steps, in paper or electronic forms. You would not want your organs during a surgery to pass through as many hands as HEDIS data does. The doctor’s original note is processed by a business office within the provider, after which it is sent to an outside billing service because payer requirements are so strict and complicated. The forms then go to the insurer, who may question the claim and send it back through the route on which it came.

The trek may undergo several iterations, taking months. As the health care provider strives to get payment, lost data and errors in coding are likely to enter the data.

(18)

organization: fixes that can be done on a single piece of data and fixes that require analytics to be run on large data sets.

Note that even a fix on a single piece of data may be developed by analytics carried out within your organization, or a vendor. For instance, research can show that the state of California is commonly represented as Ca, CA, Cal, or Cali in data sets. A simple programming check, using fixed strings or regular expressions, can identify the various possible values and harmonize them on a single standard, such as CA.

Similar research can help with the HL7 example I cited earlier, where different vendors implement a standard differently and put data in different places. Once you identify how a particular vendor codes an address, you can write a program to read it into the format of your choice. This program must be updated, of course, if the vendor changes their coding, which probably will happen without notice. Good reason for running more analytics.

A missing customer ID probably can’t be fixed by examining a single record, although it is possible you’ll discover the ID entered into a different field of the record. More likely, you’ll run a job to match customers by name, gender, address, and other characteristics. You can probably find a record in a

different data set and be able to trust, with a good deal of confidence, that it’s the customer with the missing ID.

A job can identify two records that refer to the same customer. This mistake often happens when combining data sets from different sources. It could also happen out in the real world for many reasons: the customer changed his name, moved to a new address, decided to use a different email address, got a misspelled name because someone entered it into the system sloppily, etc. Another example where a job can help enforce quality is checking city names against ZIP codes in US addresses. If you find two cities with the same ZIP code in your data, at least one is incorrect. Every ZIP code in the U.S. is assigned to only one city (although a city can have many ZIP codes).

(19)

Data older than a certain age should be discarded, or marked as less trustworthy because it might have changed.

Certain fields must not be empty. An empty field may be hard to identify because some people enter meaningless strings such as X or 9999 when they don’t know something. Sometimes you can find the data elsewhere and fill it in, but sometimes you’ll choose to reject the whole record.

Dates and times must be correct, and must be in a standard format.

Many commercial tools provide built-in functions to do common checks and even make fixes, but many sites write filters of their own at least part of the time.

In addition to checking each field, you usually need some higher-level checks that involve files and metadata. For instance, did incoming data conform to the schema you expected? Are you getting two identical files? Comparing the MD5 hashes generated on the files is a simple way to determine the answer to the previous question.

The data preparation stage is often where sensitive data, such as financial and health information, is protected. Although terms for this differ, most systems distinguish two types of protection: removing a field completely (often called masking) and changing the field to something innocuous (often called

tokenization). As an example of tokenization, test data sets substitute realistic but fake names for real names so that developers can test their code against these sets.

Another kind of tokenization is to run the value from a field through a one-way hash (such as MD5), which ensures that the same value is alone-ways

represented by the same hash, but prevents anyone from deriving the original value. This is a type of pseudonymity.

(20)

sets.

(21)

Managing Workflows

You have designed your filters and jobs for ingestion, cataloguing metadata, data preparation, and Hadoop itself. Can you make regular, productive use of all these things? That depends on how easily you can combine the tasks in end-to-end workflows.

First, you should make workflows for each task. How is data from a

particular source ingested? Do you have a general workflow to which you can just assign parameters such as the source and type of data?

And how is the workflow triggered? Forcing someone to launch the job manually is a waste of staff time, and prone to errors.

You could do something as simple as schedule a job at regular intervals. (Unix and Linux provide cron for that purpose.) YARN is an open source tool that helps with resource allocation and scheduling. Resource allocation gets particularly complex in the cloud. You want to ensure you can get the number and capacity of systems you need for the turn-around time you need, while avoiding the risk of jobs growing to an enormous, costly scale.

Your workflow processor should also be able to handle triggers, so that when something important happens like the arrival of new data, the job launches on its own. For instance, AWS Data Pipeline lets you specify that a job starts whenever a particular file is uploaded to S3 storage. The open source Oozie

project can also start a job based on the availability of data.

Scheduling should also be flexible. One site I talked to sometimes delays a workflow for a few hours when the servers are at capacity.

Having small workflows in place, you should be able to compose larger workflows from sub-workflows. In that way you can robustly construct a single workflow covering data acquisition, ingestion (putting it in the right repository), cleansing, format conversion, enrichment, and provisioning of the results.

(22)

environments, with different parameters appropriate for each environment. With such a system in place, you can have strong confidence that the

programs your developers and testers work on will hold up in production. Currently, most sites create workflows through a programming language. Some developers use Java because that’s the basic way of creating jobs for Hadoop and related tools. Most use popular scripting languages such as Python or simply the Unix shell. However, not all formats handled by

Hadoop are supported by all languages. Libraries are continually being added to fill the gap, but you are likely to find a need to incorporate a Java program to format data into your workflow. One advantage of using a programming or scripting language is that you can use source control and testing as you would on any program.

Ideally, users without a technical background could construct and launch their own workflows. To enable this, Zaloni provides a graphical user

interface where users can drag and drop predefined workflows, connect them by dragging arrows between them, and then schedule the job.

Job failures, as mentioned before, may sometimes be handled by rerunning the job at various levels of your system, but you’ll have to plan what to do if the job can’t recover from an error. Thus, workflows should send

notifications on important events, particularly success or failure. They should also embody rules to decide when to skip a record, or when to stop entirely. For instance, suppose you have two rules during data preparation, one

making sure that the input is a number and the other making sure it’s within an allowed range. If the input isn’t a number, it would be meaningless to check it against a range, and there is no point to running the second rule. After a run, reports can include lots of useful statistics in addition to success or failure. How many records were dropped because they were corrupt? Were input files missing? What were the percentages of such errors, in relation to the whole job?

(23)

way you can tell whether your input data is slipping in quality, and whether your tools are doing as good a job as they did on the data where you first ran them.

Your metadata catalog can come in valuable at the error stage. The operations team should be able to see from a log or other report where the problem

occurred (which file, which record) and go back to the original data to diagnose the cause.

(24)

Access Control

We have seen that access control is crucial for organizational safety, privacy, and regulatory compliance. Large organizations achieve security by dividing users into groups — research teams, operations teams, etc. — and grouping data into resources with access rights.

Then you can grant users or groups access to particular data resources. For instance, one research team may be researching the effectiveness of a website, so you can grant it access to all logs and data about the website without being able to see other things such as sales data.

One site I talked to isolated personally identifiable information (PII) through a hybrid solution. It’s often easy to tell by the column name whether data is personally identifiable, and route such columns to a different repository with different access rights. Sometimes a processor needs to tag data with special identifiers so that it is routed later to the secure repository. Each stage,

including the analytics, can be restricted to the repositories that don’t contain PII.

(25)

Conclusion

A recent report2_{found that governments and other organizations are opening}

up large quantities of data, but many of the companies who could benefit from it don’t know it exists. The same problem can happen within your own organization.

Hadoop, at its core, is a file system and a set of libraries to process large quantities of data. Management of that data — ingestion, data preparation, job scheduling, and access rights — must be addressed by other tools. Tools such as Sqoop and YARN are emerging in the open source community to pick off various pieces of the data management problem. You should use robust open source tools where they are available and keep data in

transparent formats so that it can be submitted to these tools, while taking advantage of commercial products aimed at the data lake.

You’re spending a lot of money to accumulate and store data. Therefore, the people who need the data must be able to find it and combine it quickly into analytic jobs that produce useful insights

Recognizing the specific tasks you need for acquisition and ingestion, cataloguing, data cleaning, and analytical jobs can help you prepare for the problems you’ll encounter in these phases and have production-ready solutions at hand. Workflows and access control contribute important

management solutions across the entire system. All that shiny data is there for your users to enjoy — make it a pleasure for them.

(26)

About the Author

(27)

1. Moving to Big Data Analysis