The Hadoop Performance Myth pdf pdf

(1)

(2)

(3)

(4)

The Hadoop Performance Myth

Why Best Practices Lead to Underutilized Clusters, and Which New Tools Can Help

(5)

The Hadoop Performance Myth

by Courtney Webster

Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA 95472.

O’Reilly books may be purchased for educational, business, or sales promotional use. Online editions are also available for most titles (http://safaribooksonline.com). For more information, contact our corporate/institutional sales department: 800-998-9938 or

[email protected]. Editor: Nicole Tache

Production Editor: Kristen Brown

Copyeditor: Amanda Kersey

Interior Designer: David Futato

Cover Designer: Randy Comer

Illustrator: Rebecca Demarest

(6)

Revision History for the First Edition

2016-03-15: First Release

The O’Reilly logo is a registered trademark of O’Reilly Media, Inc. The Hadoop Performance Myth, the cover image, and related trade dress are trademarks of O’Reilly Media, Inc.

While the publisher and the author have used good faith efforts to ensure that the information and instructions contained in this work are accurate, the publisher and the author disclaim all responsibility for errors or omissions, including without limitation responsibility for damages resulting from the use of or reliance on this work. Use of the information and instructions contained in this work is at your own risk. If any code samples or other technology this work contains or describes is subject to open source licenses or the

intellectual property rights of others, it is your responsibility to ensure that your use thereof complies with such licenses and/or rights.

(7)

The Hadoop Performance Myth

Hadoop is a popular (if not de facto) framework for processing large data sets through distributed computing. YARN allowed Hadoop to evolve from a MapReduce engine to a big data ecosystem that can run heterogeneous

(8)

The Challenge of Predictable Performance

Hadoop breaks down a large computational problem into tiny, modular pieces across a cluster of commodity hardware. Each computational piece could be run almost anywhere within the cluster and, as a result, could be a little faster or slower based on that machine’s specifications. Hadoop was designed to include redundancy to keep this variability from impacting performance. If a particular task is running slower than expected, Hadoop may launch the same computation on another copy of the target data. Whichever task completes first wins.

Optimizing performance on Hadoop 1.0 wasn’t necessarily easy but had fewer variables to contend with than later versions. Version 1.0 only ran a specific workload (MapReduce) with known resource needs (CPU and I/O). And it was designed to address what it believed would be the primary

performance challenge: machine variance.

Hadoop 2.0 was restructured to overcome key limitations in scalability,

availability, and utilization in the first release. Version 2.0 expands Hadoop’s capabilities beyond just MapReduce so that a cluster can run different types of workloads at the same time. A resource scheduler (YARN) ensures that jobs can start only if the system has the resources the job speculates it needs.

RESOURCE AND PERFORMANCE NEEDS FROM MIXED WORKLOADS

In addition to Hadoop 2.0 (YARN) supporting more users and organizations in a single cluster, mixed workloads demand different resources and performance goals than MapReduce-only jobs:1

Workload Resource

constraints Performance goals

Long-running services CPU time Instantaneous availability DAG-of-task (like MapReduce) CPU- or I/O-bound Low scheduler latency High performance/throughput

(9)

Despite these enhancements, maintaining predictable cluster performance is a monumental challenge. The key contributors to the issue include:

Hadoop’s design, such as including redundancy to prevent single-machine failure

Running mixed workloads with different requirements, which introduce tremendous complexity

Resource schedulers, like YARN, prioritize fault tolerance over efficiency

YARN can ensure that resources are available immediately before a job begins, but it cannot adjust resources while a job is running (and

therefore cannot prevent resource contention)

(10)

Optimizing Cluster Performance

Today’s clusters are managing more users, organizations, and resource requests than before. This increased demand makes performance critical while, at the same time, the added complexity makes it more difficult to achieve. A marketplace of tools and tricks have been developed to optimize performance as clusters grow in size and complexity. We’ll begin by

(11)

Provisioning

When performance issues occur, classic logic tells us to scale the cluster. If you double your nodes, job time should be cut in half. This can be

disheartening after you’ve painstakingly planned the cluster for projected data, resource, and performance needs. Adding additional nodes increases cost and complexity, leading not only to added capital expenses but

additional expert staffing needs.

(12)

Application Isolation

Another technique to improve performance is to isolate a workload within its own Hadoop cluster, especially if it is a job under a critical completion

deadline.

There are a few cases when workload isolation may be sufficient. If you are a massive company with few monolithic applications and the resources for duplicate hardware, this may be a viable strategy. Cluster isolation could also control incompatible workloads (like HBase causing conflicts with

MapReduce). Or, perhaps certain data requires isolation for security reasons. Outside of these scenarios, this technique takes us further down a road of expensive overprovisioning, not to mention a regression from traditional goals of a data-centric organization. Data siloed inside its own cluster

becomes hard to access or plug into other workflows (requiring the company to utilize snapshots or old data).

DRAWBACKS OF PROVISIONING AND ISOLATION

Overprovisioning adds complexity and cost.

Application isolation is expensive and results in siloed data.

(13)

Tuning

Tuning is an essential part of maintaining a Hadoop cluster. Performance benchmarking and adjustment are bona fide methods to identify

bad/inefficient code or poorly configured parameters. Cluster administrators must interpret system metrics and optimize for specific workloads (e.g., high CPU utilization versus high I/O).

To know what to tune, Hadoop operators often rely on monitoring software for insight into cluster activity. Tools like Ganglia, Cloudera Manager, or Apache Ambari will give you near real-time statistics at the node level, and many provide after-the-fact reports for particular jobs. The more visibility you have on all cluster resources (for example, including a network

monitoring tool like PathView), the better.

Good monitoring alerts you to errors that require immediate attention and helps identify problem areas where tuning can improve performance. Though not the focus of this report, there are myriad management and

troubleshooting tools one can use to pinpoint troublesome jobs and inefficient code (White Elephant, Splunk, Starfish, Nagios, Cloudera Navigator

Optimizer, Driven, and Unravel, to name only a few).

With reporting tools in place, tuning can start right after cluster configuration. Classic benchmarking workloads (TestDFSIO, TeraSort, PiTest, STREAM,

NNBench and MRBench) can be used to check configuration and baseline performance. Best-practices tuning guides recommend the following adjustments to optimize performance.

Tuning checklist

Number of mappers

If you find that mappers are only running for a few seconds, try to use fewer mappers that can run longer (a minute or so).2

(14)

Mapper output

Mappers should output as little data as possible.4

Try filtering out records on the mapper side.3

Use minimal data to form the map output key and map output value.3

Number of reducers

Reduce tasks should run for five minutes or so and produce at least a block’s worth of data.2

Combiners

Can you specify a combiner to cut the amount of data shuffled between the mappers and the reducers?2

Compression

Can you enable map output compression to improve job execution time?

2,5

Custom serialization

Can you implement a RawComparator?2

Disks per node

Adjust the number of disks per node (mapred.local.dir,

dfs.name.dir, dfs.data.dir) and test how scaling affects execution time5_.

JVM reuse

Consider enabling JVM reuse

(mapred.job.reuse.jvm.num.tasks) for workloads with lots

of short-running tasks.5

Memory management

(15)

Make the mapred.child.java.opts property as large as possible for the amount of memory on the task nodes.

Minimize disk spilling.2

One spill to disk is optimal. The MapReduce counter

spilled_records is a useful metric, as it counts the total number of records that were spilled to disk during a job.

Adjust memory allocation using the reference: Total Memory = Map Slots + Reduce Slots + TT + DN + Other Services + OS4

Data locality detection

After going through the classic tuning recommendations, you may want to check if data locality issues are impacting your cluster performance. Hadoop is designed to prioritize data locality (to process computational functions in the same node where the data is stored). In practice, this may not be as optimized as one would think, especially for large data clusters.6_Identifying

these issues is not a task for the faint-hearted, as you’d likely have to comb through various logs to determine which tasks access which data nodes.

YARN-specific tuning

For Hadoop 2.0 implementations, you should also tune some of YARN’s parameters. Start by determining the resources you can assign to YARN by subtracting hardware requirements from the total CPU cores and memory.7

Don’t forget to allocate resources for services that don’t subscribe to YARN (like Impala, HBase RegionServer, and Solr) and task buffers (like HDFS Sort I/O buffer).7 _{It’s recommended to right-size YARN’s NodeManager and}

ResourceManager resources as well.7

The Limitations of Tuning

All tuning is retrospective....Knowing where you’ve been doesn’t help you figure out where you’re going.

(16)

tuning may be all that’s required to wring additional performance and improve cluster utilization. But subscribed “best practices” built from

benchmarking workloads like TeraSort may or may not map perfectly to live applications or dynamic clusters.

More often than not, the complicated and retrospective nature of tuning does not solve performance problems:

Tuning delivers diminishing returns

Monitoring programs like Ganglia can provide subminute updates on individual node and cluster-wide performance (load chart, CPU utilization, and memory usage). This provides a “what,” but does not help diagnose a “why.” Even with these tools, parsing job history files, tuning parameters, and measuring the impact of minor changes is time-consuming and requires expert-level staff. This can result in a rabbit hole that eventually delivers diminishing returns.

You can’t tune what you can’t measure

Most Hadoop monitoring tools report on some resources (like CPU and memory), but not all cluster resources (like network and disk I/O). You may need a complex combination of tools to report all the data required to identify root issues.

Retrospection can’t guarantee better prediction

The final nail in the coffin is that all tuning is retrospective. You can’t expect trail maps of known territory to help you navigate an uncharted route. Knowing where you’ve been doesn’t help you figure out where you’re going. For example, what happens for a new job with a new usage profile? What if an old job starts using more or less resources (CPU/memory/disk/network)? What if the demands of other nonjob services change?

In clusters with dynamic activity (mixed workloads, multitenant applications, and variable ad hoc use), optimizing past issues simply does not future-proof cluster performance.

(17)

Monitoring tools provide a limited “what” (coarse data that cannot provide visibility down to the process level) and not a “why.”

Parsing log files, modifying variables and testing impact is time-consuming and requires expertise.

There are too many options to be adjusted on a cluster, and what works for one job may not work for the next.

You need a combination of tools to report on all cluster resources (CPU, memory, network, and disk I/O).

(18)

How Resource Managers Affect Performance and

Utilization

At this point, you may be asking yourself, “Once I’ve tuned, isn’t my resource manager supposed to help coordinate requests and avoid resource contention? Surely the resource manager could provide consistent

performance if it could perfectly allocate jobs.” The answer to this question may surprise you.

YARN’s resource manager performs a vital function to ensure that jobs complete in a fault-tolerant (not necessarily high-performance) way. Before we can answer whether YARN can alleviate resource contention,

understanding how the resource manager executes tasks and which resources it controls (and which it doesn’t) is necessary to understand its impact on cluster performance.

How YARN works

A YARN cluster has a single resource manager that monitors the entire cluster’s resources. Applications submit requests, which include resources needed to complete the task (like memory and CPU) and other constraints (like preferred data locality), to the resource manager.

The resource manager queries the system and finds a node manager with resources that meet or exceed the application’s request. Node managers (which run on all nodes in the cluster) monitor and launch the containers that will execute the job. If the node manager seems to have enough resources available, it is directed to launch a new container by the resource manager. The container then executes the process using its resources.2

YARN prioritizes fault tolerance over performance

(19)

container on a replica node, then a container within the same rack, and finally a container not within the same rack. Or, it can halt the request entirely.2

Another example is the practice of speculative execution. Once an application is running, the application manager can identify tasks that are slower than expected. It can’t affect the pace of the slow task and doesn’t diagnose the issue (e.g., hardware degradation). Instead, it requests that YARN launch another instance of the task on a copy of the data as a backup. Once one task completes, redundant tasks are killed. This speculative execution increases fault tolerance at the cost of efficiency, especially for busy clusters. Knowing when to use speculative execution (and when not to use it, like for reduce tasks) has an impact on overall cluster performance.

Applications overestimate needed resources

In an ideal world, the application would request only the resources it needed. Theoretically, you could predict resources for particular workloads by testing each cluster variable independently. For example, you could test how

performance changes based on: Number of cores

Number of nodes

Load changes (size and intensity)

Interference (for shared clusters)8

Gathering this data scientifically should allow you to perfectly estimate the resources you need. But you’d also have to repeat these experiments any time something in the cluster changed (versions, platforms, hardware, etc). As a result, resource requests are typically tuned for worst-case scenarios (upper-limit thresholds).

How YARN affects cluster utilization

(20)

requested resources regardless of whether or not they are needed. For

example, if a job requests memory resources with an upper limit of 3 GB, the cluster allocates the entire 3 GB, even if the running job is only using 1.5 GB. This results in your monitoring tools reporting submaximal utilization, and yet you are unable to start more jobs.9

YARN can’t prevent resource contention

Finally, though YARN controls starting and stopping processes across its cluster, in-progress jobs are allowed uncontrolled competition for available resources. Additionally, YARN manages CPU and memory allocations, but these are not the only resources that can have contention. Managing other elements, like disk and network I/O, is planned but not currently supported. This uncontrolled contention puts overall performance in a fragile state where one unexpected snag could jeopardize the entire cluster. For dynamic, mixed-workload, and/or multitenant clusters, resource allocation alone cannot

guarantee consistent performance with high utilization.

DRAWBACKS OF RESOURCE SCHEDULERS (LIKE YARN)

Speculative execution improves fault tolerance, but can have a negative impact on cluster efficiency (especially for busy clusters).

Applications poorly estimate the resources they need, typically requesting maximum (upper-limit) thresholds.

YARN locks up maximum requested resources for particular jobs, regardless of whether or not the resources are needed as the job is running in real-time.

YARN manages CPU and memory allocations, but does not currently support all resources (like disk and network I/O).

While YARN controls when jobs start and stop based on available resources, it cannot manage resource contention once jobs are active.

Improving the performance of your cluster

(21)

that the requestor has an opportunity to reject the resource scheduler’s “offer” for job placement). Two-level schedulers are more scalable than monolithic schedulers (like YARN) and allow the framework to decide whether an offer is a good fit for the job. Mesos can schedule resources that YARN doesn’t (like network and disk I/O).

Mesos and YARN can be misconstrued as an either/or choice, but Apache Myriad allows you to run YARN as part of the Mesos framework. This allows Mesos to elastically allocate resources to YARN (making it more dynamic), which should improve utilization of the entire data center. But unless Mesos revokes allocated resources, YARN still locks up maximum resource thresholds once a job is running, leaving resource contention and low utilization as persistent issues.

Virtualized Hadoop (either private or public cloud) enables elastic scaling, which means dynamically adding (and then removing) nodes as needed. This could assist during peak load or times of resource contention. Just like YARN and Mesos, though, the hypervisor will assign new nodes to meet maximum resource requests. More nodes does not lead to more utilized nodes.

Clusters are drastically underutilized

We know that the tactics of existing resource managers leads to

overprovisioning in their efforts to improve performance, but we haven’t yet specified the severity of the problem. Best-in-class solutions are simply not providing perfect performance with full utilization. Industry-wide, cluster utilization averages only 6–12%.10_{The most efficient clusters (like Google}

and Twitter, which co-locate workloads) still only report up to 50% utilization.

Trace analysis of Google’s cluster shows it allocating ~100% of its CPU and ~80% of its memory, but usage of these resources is much lower. Over the 29-day trace period, actual memory did not exceed 50% capacity. CPU usage peaked at 60% but remained below 50% most of the time.1_{Similar results}

(22)

The Need for Improved Resource Prediction and

Real-Time Allocation Tools

Traditional optimization techniques may temporarily improve performance but are expensive, time-consuming, and cannot manage the volatility of modern clusters. These performance improvements come with a cost of low utilization.

Reiss and Tumanov, et al., studied Google’s cluster as an example of a large cluster with heterogeneous workloads. They provide a nice summary of recommendations based on their observations in their 2012 SOCC report.1

We’d like to discuss two of those recommendations and which emerging products can help your cluster meet the need for efficient, consistent performance:

Recommendation 1: better resource prediction

First, better prediction of resource needs would eliminate overallocation of resources. As Reiss states, “Resource schedulers need more

sophisticated time-based models of resource usage.”1

Recommendation 2: real-time resource allocation

Secondly, managers must be able to dynamically adjust resources based on real-time usage (not allocation). “To achieve high utilization in the face of usage spikes, schedulers should not set aside resources but have resources that can be made available by stopping or mitigating fine-grained or low-priority workloads.”1

In order to accomplish the latter recommendation, the manager must allow the user to set priorities in order for real-time allocation to function

effectively.11_{The manager must be able to dynamically decide where}

resources can be siphoned from and where they should be fed to. This can’t rely on human intervention — a problem that requires lots of tiny decisions per second requires a programmatic solution.

(23)

Quasar: A performance constrained, workload profiling cluster manager

In the spring of 2014, Christina Delimitrou and Christos Kozyrakis from Stanford University published a report on their new cluster manager (named Quasar) that provides programmatic resource prediction.8,10_{Quasar allows}

users to specify performance constraints, then it profiles incoming workloads to classify their resource needs. The resource classification is provided to a “greedy” scheduler that looks to allocate the least resources (minimum

threshold instead of maximum threshold) to the job to satisfy the performance target.10

Their prediction/classification technique uses two models in tandem: wavelet transform prediction and workload pattern classification.8_{The wavelet}

method decompresses a resource trace into individual wavelets and then predicts future needs based on each wavelet’s pattern. The workload

classification method breaks a resource trace down into smaller models (e.g., spike, plateau, and sinusoidal). When a workload’s behavior changes, it is matched to a new model and resources are reallocated accordingly (albeit with a slow sampling rate of 10 minutes).10_{On a 200-node cluster, Quasar}

achieved a 62% overall utilization and achieved its performance constraints for batch and latency-critical workloads.10

At this time, Quasar is not open source or commercialized for immediate adoption. Christos Kozyrakis joined Mesosphere in the fall of 2014, and some Quasar code was released in July 2015 as part of Mesos 0.23.0. Termed

“oversubscription”, this version provides “experimental support” for

launching tasks with resources that can be revoked at any time. It’s reported that other features may be reserved for Mesosphere’s DCOS (Data Center Operating System) Enterprise Edition.12

Pepperdata: A real-time resource allocation performance optimizer

Pepperdata, founded in 2012, offers real-time, active performance

(24)

distributions (Cloudera, Hortonworks, MapR, IBM BigInsights, Pivotal PHD, and Apache).13

Pepperdata software monitors hardware (CPU, memory, disk I/O, network) in real time by process, job, and user; is aware of actual hardware usage across the entire cluster, second by second; and reshapes hardware usage,

dynamically and in real time, to adapt to the ever-changing conditions of cluster chaos.

Pepperdata installs a node agent on every node in the cluster (with low 0.1% CPU overhead) that collects over 200 metrics in 3–5 second intervals.13_It

monitors hardware (CPU, memory, disk I/O, and network) by process, job, and user; is aware of actual hardware usage across the entire cluster; and dynamically reshapes hardware usage in real time to adapt to the ever-changing conditions of a chaotic cluster.

Unlike most monitoring-only tools (which just report node metrics or provide tuning recommendations based on past behavior), Pepperdata dynamically allocates resources according to real-time usage. If YARN allocates 4 GB of memory to a particular job but only 2 GB are being used, Pepperdata re-informs YARN of the usage number and allows YARN to assign the remaining 2 GB elsewhere.9

Users specify simple priorities (for example, production gets 50% of the cluster, data science gets 25%, and so forth) without static partitioning. In times of nonpeak use, data science jobs can exceed those priority thresholds as needed. In times of resource contention, Pepperdata reallocates resources from lower-priority jobs to help high-priority jobs complete on time. It is reported that an average user can expect a 30–50% increase in throughput by adding Pepperdata to their cluster.11_{Chartboost, a mobile-games platform,}

(25)

Conclusion

Traditional best practices can improve performance and may be enough for clusters with single workloads and consistent needs. More often, these

methods are nonscalable stop-gaps that can’t lead to QoS for dynamic, mixed workload, or multitenant clusters. The conservative actions of resource

managers and the practice of overprovisioning may help with peak resource contention, but leads to drastic underutilization. In most clusters, 88–94% of the resources are left as dormant assets.

Trace analysis of a large, heterogeneous cluster pointed to a need for better resource prediction and real-time resource allocation to improve performance and increase utilization. Pepperdata allows a Hadoop cluster to allocate

resources in real time by re-informing YARN of actual (not theoretical)

(26)

References

1. Reiss, C., Tumanov, A., Ganger, G.R., Katz, R.H., Kozuch, M.A. “Heterogeneity and Dynamicity of Clouds at Scale: Google Trace Analysis.” SOCC, 2012.

2. White, Tom. Hadoop: The Definitive Guide, 4th Edition. Sebastopol, CA: O’Reilly Media, 2015.

3. Ren-Chyan Chern, F. (2014, March 5). Hadoop Performance Tuning Best Practices [Weblog post]. Retrieved from

http://www.idryman.org/blog/2014/03/05/hadoop-performance-tuning-best-practices/.

4. “Pepperdata on a Highly Tuned Hadoop Cluster.” Pepperdata, June 2015.

5. “Hadoop Performance Tuning - A Pragmatic & Iterative Approach.”

DHTechnologies, 2013.

6. Kopp, M. (2013, July 17). Top Performance Problems discussed at the Hadoop and Cassandra Summits [Weblog post]. Retrieved from

http://apmblog.dynatrace.com/2013/07/17/top-performance-problems-discussed-at-the-hadoop-and-cassandra-summits/.

7. “Tuning the Cluster for MapReduce v2 (YARN)”. Cloudera, 13 January 2016.

8. [Twitter University]. (2014, April 8). Improving Resource Efficiency with Apache Mesos. [Video File]. Retrieved from

https://www.youtube.com/watch?v=YpmElyi94AA.

9. “4 Warning Signs That Your Hadoop Cluster Isn’t Optimized... And how Pepperdata Can Help.” Pepperdata.

(27)

11. “Now Big Data Works for Every Enterprise: Pepperdata Adds Missing Performance QoS to Hadoop.” Taneja Group, 2015.

12. Morgan, T.P. (2015, June 9). Mesos Brings The Google Way To The Global 2000 [Weblog post]. Retrieved from

http://www.nextplatform.com/2015/06/09/mesos-brings-the-google-way-to-the-global-2000/.

13. “Pepperdata Overview and Differentiators.” Pepperdata, 2014.

14. “Chartboost sees significant AWS savings with Pepperdata.”

(28)

About the Author

(29)

The Hadoop Performance Myth

The Challenge of Predictable Performance

Optimizing Cluster Performance Provisioning

Application Isolation

Tuning

How Resource Managers Affect Performance and Utilization

The Need for Improved Resource Prediction and Real-Time Allocation Tools

Conclusion