Mastering Apache Spark 2.x, Second Edition

Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. He is a member of the IBM Technical Expert Council and the IBM Academy of Technology, IBM's leading brains trust.

Apache Spark with Jupyter Notebooks on IBM DataScience

Apache Spark is an in-memory, cluster-based parallel processing system that provides a wide range of functions such as graph processing, machine learning, stream processing, and SQL. The book goes on to show how to incorporate H20 and Deeplearning4j for machine learning and Juypter Notebooks, Zeppelin, Docker and Kubernetes for cloud-based Spark.

What this book covers

IBQUFS, Classic MLlib, introduces you to MLlib, the de facto standard for machine learning when using Apache Spark. Let's understand what Deep Learning is and how to use it on top of Apache Spark using these libraries.

What you need for this book

Who this book is for

Conventions

The words you see on the screen, for example, in menus or dialog boxes, appear in text like this: "To download new modules, we'll go to Files | Settings | Project Name | Project Interpreter.".

Reader feedback

Customer support

Downloading the example code

The code bundle for the book is also hosted on GitHub at IUUQTHJUIVCDPN1BDLU1VCM JTIJOH.BTUFSJOH"QBDIF4QBSLY.

Downloading the color images of this book

Errata

To see the previously submitted errors, go to IUUQTTXQBDLUQVCDPNCPPLTDPOUFO UTVQQPSU and type the name of the book in the search field.

Piracy

Questions

Fortunately, Hadoop stack vendors like IBM and Hortonworks offer you an open data platform, a Hadoop stack, and a cluster manager that integrates with Apache Spark, Hadoop, and most of the current stable of tools, all based on open source. In the following sections, we will discuss each of the aforementioned components in more detail before diving into the material that begins in the next chapter:.

Spark machine learning

Apache SystemML introduces the first library that runs on top of Apache Spark that does not ship with the Apache Spark distribution. Deep learning on Apache Spark uses H20, Deeplearning4j, and Apache SystemML, which are other examples of very interesting third-party machine learning libraries that do not ship with the Apache Spark distribution.

Spark Streaming

It is nice that there is so much choice and you are not locked into a single library when you use Apache Spark. While Apache Spark has recently integrated GraphX, another Apache Spark library, into its distribution, we don't expect it to happen too soon.

Spark SQL

In addition, there are a plethora of additional connectors to access NoSQL databases outside of the Hadoop ecosystem directly from Apache Spark. In $IBQUFS, Apache Spark SQL, we'll see how the Cloudant connector can be used to access a remote.

Spark graph processing

Extended ecosystem

We have used the H2O machine learning library, SystemML and Deeplearning4j, to extend Apache Spark's MLlib machine learning module. We have shown that Deeplearning and high-performance cost-based optimized machine learning can be introduced to Apache Spark.

What's new in Apache Spark V2?

Cluster design

However, how can the Apache Spark cluster itself, within the big data cluster, be configured. The next section will explore this and describe each type of Apache Spark cluster manager.

Cluster management

You should monitor edge nodes for resource usage and adjust in terms of resources and/or application location as needed. This section briefly set the scene for big data clustering in terms of Apache Spark, Hadoop and other tools.

Local

In general, the edge nodes acting as cluster namenode servers or Spark master servers will require greater resources than the cluster processing nodes within the firewall. When many Hadoop ecosystem components are deployed on the cluster, all of them will require extra memory on the master server.

Standalone

Here, IPTUOBNF is the name of the host that the Spark master is running on. You can consider allowing concurrent application scheduling by setting the resource configuration options for each application; for example, using TQBSLDPSFNBY to share cores between applications.

Apache YARN

We've specified as the port, which is the default, but this is configurable.

Apache Mesos

Cloud-based deployments

Performance

The cluster structure

Hadoop Distributed File System

Finally, Cassandra is not a file system but a NoSQL key-value store and is tightly integrated with Apache Spark, and is therefore marketed as a valid and powerful alternative to HDFS--or indeed any other distributed file system--especially since it supports the push-down predicate using the "QBDIF4QBSL42- and the Catalyst optimizer, which we will cover in the following chapters.

Data locality

Memory

Can you change the storage fraction and memory used by the JVM for storage and caching of RDDs? Use the Storage page in the Apache Spark UI to see if this fraction is set to an optimal value.

Coding

Finally, I can recommend the following resource for Apache Spark performance related issues: IUUQTTQBSLBQBDIFPSHEPDT MBUFTUUVOJOHIUNM.

Cloud

Summary

As structured data processing steadily becomes the standard way of analyzing data in Apache Spark, optimizers like Tungsten and Catalyst play an important role; so important that we have devoted two whole chapters to the subject. Data import and storage Processing of text files Processing of JSON files Processing of Parquet DataSource API files.

The SparkSession--your gateway to structured data processing

Before we move on to SQL, DataFrames and Datasets, we will cover an overview of the SparkSession. Next, we'll examine some of the supported file formats available for importing and saving data.

Importing and saving data

Processing the text files

Processing JSON files

First, we still get the file wrapped in a subfolder, but that's not a problem since HDFS treats folders equal to files, and as long as the files they contain stay in the same format, there's no problem. This is a framework-independent way of telling users of that file that the job of writing this file (or folder respectively) has completed successfully.

Processing the Parquet files

For more information about the possible 4QBSL$POUFYU and 4QBSL4FTTJPO methods, check the API documentation of the called classes. They were introduced in Spark 1.3 and became one of the first-class citizens in Apache Spark 1.5 and 1.6.

Understanding the DataSource API

Implicit schema discovery

If the +40/ file gets too large and you want to use the lazy loading nature that every Apache Spark data object typically supports, you can specify a portion of the data to be sampled in order to extract the names and types of columns from a +40/ file. Another example is the Java database connection (JDBC) data source, where the schema does not even need to be inferred, but is read directly from the source database.

Predicate push-down on smart data sources

You don't necessarily understand the entire code snippet, but you can see that two parameters are passed to the CVJME4DBO method: SFRVJSFE$PMVNOT and GJMUFST.

DataFrames

In this case, we passed as an optional parameter that limits the output to the first three rows. Note that the result of a method on a DataFrame again returns a DataFrame, similar to the concept of 3%% methods returning RDDs.

Using SQL

Thus, SQL-like actions can be performed against DataFrames, including TFMFDU, GJMUFS, TPSU, HSPVQCZ, and QSJOU. The next section shows how to create tables from DataFrames and perform SQL-based actions against these frames.

Defining schemas manually

So we're fine; we lost the schedule information, but the rest of the information remains. 4USVDU'JFME@JE4USJOH5ZQFUSVF 4USVDU'JFME@SFW4USJOH5ZQFUSVF 4USVDU'JFMEDPVOU-POH5ZQFUSVF 4USVDU'JFMEGMPXSBUF-POH5ZQFUSVF 4USVDU'JFMEGMVJEMFWFM4USJOH5ZQFUSVF 4USVDU' JFMEGSFRVFODZ-POH5ZQFUSVF 4USVDU'JFMEIBSEOFTT-POH5ZQFUSVF 4USVDU'JFMETQFFE-POH5ZQFUSVF 4USVDU'JFMEUFNQFSBUVSF-POH5ZQFUSVF 4USVDU'JFMEUT-POH5ZQFUSVF.

Using SQL subqueries

Applying SQL table joins

Using Datasets

The Dataset API in action

User-defined functions

The newly registered function called UP"HF3BOHF can now be used in the TFMFDU statement.

RDDs versus DataFrames versus Datasets

This knowledge is important for writing efficient SQL queries or data transformations on top of a DataFrame or Dataset relational API. That's why it's extremely important to look at the Apache Spark optimizer called Catalyst, which actually takes your high-level program and turns it into efficient calls on top of the RDD API and in later chapters, Tungsten, which is an integral part of the Apache Spark study.

Understanding the workings of the Catalyst Optimizer

Managing temporary views with the catalog API

The SQL abstract syntax tree

So, as you can see, at the bottom of the tree (those nodes are also called leaves since they have no further nodes associated with them) we have two integer literals: one and two. So in other words, this AST does nothing but add the number three to each of the values in the x column.

How to go from Unresolved Logical Execution Plan to Resolved Logical

The Add operation actually turns into another literal (three in this case), which is then used by another Add operation as one of its inputs. The other input labeled as Attribute(x) now takes the value of x from that particular relation and exposes it as materialized value to the Add operator.

Internal class and object representations of LEPs

As can be clearly seen in the previous figure, -FBG&YQSFTTJPO is the parent of most functional operators. Once the type is opened, the function Open Type Hierarchy (Navigate | Open Type Hierarchy..) can be used to generate those diagrams, as can be seen in the previous figures.

How to optimize the Resolved Logical Execution Plan

Physical Execution Plan generation and selection

Code generation

Practical examples

Then we register all of them as temporary tables so we can write regular SQL statements against them.

Using the explain method to obtain the PEP

Therefore, these fields are included in the set of column names to retrieve as well. This can be seen in sections 1VTIFE'JMUFST and 3FBE4DIFNB in the explained plan where the.

How smart data sources work internally

This means that if in a PEP data is to be filtered, then this filter will be executed in the SQL statement on the RDBMS when the underlying data is read. First, a list of column names to include in the result, and second, an array of filter objects that basically express the query predicate, pushed down to the underlying smart data source.

Memory management beyond the Java Virtual Machine Garbage Collector

It continuously monitors how many active references to a particular object are on the heap, and when no more references exist, those objects are destroyed and the allocated memory is released. References to an object can either exist in objects on the heap itself or on the stack.

Understanding the UnsafeRow object

So when you write something to an object's memory segment on the heap (for example, by updating a class property value of type *OUFHFS, you change 32 bits on the heap) you don't use the actual heap memory address to do so, but you use the reference to the object and either access the object's property or use a set method. So we learned about OFX allocated memory on the heap, but how is it ever freed.

The null bit set region

Please note that although the technology is called insecure, it is completely safe for you as a user, as Tungsten takes care of the memory management. In the following sections, we will explain the purpose of each memory area of the 6OTBGF3PX object.

The fixed length values region

Note that all regions, as well as the contained fields within the regions, are 8-byte aligned. It's just a technology that allows people who are not JVM creators to do memory management.

The variable length values region

Understanding the BytesToBytesMap

A practical example on memory usage and performance

Note that while this particular run on an RDD is faster, Apache Spark jobs very rarely consist of just a single operation on an RDD. In theory, you could write highly efficient Apache Spark jobs on RDDs alone, but actually rewriting your Apache Spark application for performance tuning will take a lot of time.

Cache-friendly layout of data in memory

Cache eviction strategies and pre-fetching

So what Tungsten actually does is transform this expression into bytecode and send it to the executor thread. Nowadays, no one thinks of creating a bytecode on the fly, but that's what's happening in code generation.

Understanding columnar storage

Understanding whole stage code generation

A practical example on whole stage code generation performance

As we can see, without the entire staging code generation, the operations (composed of task 0 and task 1) take almost 32 seconds, while with the entire staging code generation enabled it takes just slightly more than half a second (task ID 2 and 3) ). As we can see in Apache Spark V2.0, with the entire stage code generation enabled, multiple operators are merged.

Operator fusing versus the volcano iterator model

Being based on Spark, it is also highly fault tolerant, having the ability to rerun failed tasks by controlling the data stream being processed. For each topic, we will provide a worked example in Scala and show how the stream-based architecture can be set up and tested.

Overview

The size of each element in the stream is then based on a batch time, which can be two seconds. It is also possible to create a window, expressed as the previous red box, over the DStream.

Errors and recovery

Checkpointing

It will also check the stream to the HDFS based .. directory using the streaming context checkpoint method, which takes a directory path as a parameter. The Spark Streaming context TTD is retrieved or created using the HDFS checkpoint directory via the 4USFBNJOH$POUFYU--DIFDLQPJOU method.

Streaming sources

TCP stream

Then a (key,value) pair is created as (word,1) which is reduced by the key value, this is the word. This is interesting if you want to stream data using Apache Spark Streaming based on TCP/IP from a host and port.

File streams

What if you want to use some of the big data tools available today such as Flume and Kafka. What if we want to use some of the built-in streaming functionality in Spark Streaming.

Flume

Then the Flume based data stream is created using the stream context host and port number. So, to demonstrate a potentially real-world approach to data processing, we'll modify the source data of the Flume configuration file so that it uses a Perl script, which can be run like this: . agent1.sources.source1.type=exec .. agent1.sources.source.command=./rss.perl.

Kafka

In the example below, the kafka-topics script is called with the create option. Finally, it has been shown using Scala code based on Apache Spark in the TUSFBN application, that Kafka data is consumed and stored in HDFS.

The concept of continuous applications

True unification - same code, same engine

Windowing

How streaming engines use windowing

It is important to note that the number of tuples per window can vary, as it just depends on how many messages arrived in a certain time frame; only those will be included in the respective window. Since the number of concurrent users varies at each time during the day, the size of the minute windows will also vary accordingly.

How Apache Spark improves windowing

The watermark is basically the threshold used to define how old a late arriving data point is allowed to be to still be included in the respective window. A four hour old data point simply wouldn't make sense to process, even if it could change the decision, since the decision has already been made.

Increased performance with good old friends

If, for whatever reason, a data tuple arrives that is more than 4 hours old, it may not make sense to include it in the windows if, for example, this application is used to run a time series forecasting model on a provision or withdraw additional HTTP servers to a group on an hourly basis.

How transparent fault tolerance and exactly- once delivery guarantee is achieved

Replayable sources can replay streams from a given offset

Idempotent sinks prevent data duplication

State versioning guarantees consistent results after reruns

Since cloud services tend to change constantly and the cloud in general is introduced later in this book, the following tutorial explains how to configure the test data generator in the cloud and connect to the external MQTT message broker. The EFQFOEFODZSFTPMWF command in the folder in the QPNYNM file pulls the required dependencies into our local maven repository where they can be accessed by the Apache Spark driver and pushed to the Apache Spark workers automatically.

Controlling continuous applications

More on stream life cycle management

In this chapter, we examine the functionality of the MLlib library in areas such as regression, classification, and neural network. We'll examine the theory behind each algorithm before providing working examples that address real-world problems.

Architecture

MLlib is the original machine learning library provided with Apache Spark, the in-memory cluster-based open source data processing system. In a later chapter, we will also learn how machine learning works on the newer DataFrame and Dataset API.

The development environment

Note that if you were developing in Java, you would use a path of the form OCBZFTTSDNBJOKBWB. The OBNF, WFSTJPO, and TDBMB7FSTJPO options set the details of the project and the version of Scala to be used.

Classification with Naive Bayes

As we can see in the example in the sbt structure diagram, a JAR file named OBJWFCBZFT@KBS was created after successfully compiling and packaging. This will be described later when we explore the functionality in the Apache Spark MLlib module.

Theory on Classification

P(Buy|Not Spam) = probability that a non-spam email has the word buy = 0.1. This means that an email containing the word buy is 92 percent more likely to be spam.

Naive Bayes in practice

This structure only accepts double values, which means that the text values in the preceding data must be sorted numerically. In the next example, we will use K-Means to try to determine what clusters exist within the data.

Clustering with K-Means

Theory on Clustering

Means in practice

We calculated the K-Means cost and number of input data rows and output the results via QSJOUMO statements. Finally, we use the prediction function of the K-Means model to create a list of cluster membership predictions.

Artificial neural networks

In the previous equation, the diagram and key show that the input values from a model P are passed to the neurons in the input layer of a network. This function has two constants: A and B; B affects the shape of the activation curve as shown in the previous graph.

ANN in practice

The first trains using the input data and then tests the ANN model with the same input data. Finally, by selecting one of the executors, it is possible to view its history and configuration, as well as links to the log file and error information.

What does the new API look like?

The concept of pipelines

Transformers

String indexer

OneHotEncoder

Intuitively, we would expect to get three extra columns in the encoded DataFrame, for example DPMPS*OEFYFE3FE, DPMPS*OEFYFE(SFFO, and ). In contrast, we only get one extra column in the DataFrame and the type is PSHBQBDIFTQBSLNMMJOBMH7FDUPS.

VectorAssembler

Pipelines

Estimators

RandomForestClassifier

As you can see, 3BOEPN'PSFTU$MBTTJGJFS has two parameters: the column name of the actual labels (remember we are in a supervised learning environment) and the functions we created earlier. Then we call fit and transform - always passing our DataFrame as a parameter - and we obtain a final DataFrame result, which actually contains .

Model evaluation

CrossValidation and hyperparameter tuning

CrossValidation

Hyperparameter tuning

Since $SPTT7BMJEBUPS &TUJNBUPS returns a model of type $SPTT7BMJEBUPS.PEFM, we can use it as a regular Apache SparkML model by simply calling a transformation on it to get the predictions. $SPTT7BMJEBUPS.PEFM automatically selects the learned hyperparameters of the underlying model (in this case 3BOEPN'PSFTU$MBTTJGJFS) to perform the prediction.

Winning a Kaggle competition with Apache SparkML

Data preparation

Feature engineering

Finally, let's examine some contents of the newly created column and compare it to the source column. Sparse vector representations: 0OF)PU&ODPEFS, like many other algorithms, returns a sparse vector of.

Testing the feature engineering pipeline

Training the machine learning model

PSHBQBDIFTQBSLNMWFWBMVBUJPO#JOBSZ$MBTTJGJDBUJPO&WBMVBUPS and there are some other classes for other forecasting use cases like 3FHSFTTJPO&WBMVBUPS or. In the previous example, we use the BSFB6OEFS30$ metric used for evaluating binary classifiers.

Using the evaluator to assess the quality of the cross-validated and tuned model

1JQFMJOF actually takes machine learning in Apache Spark to the next level by dramatically improving data scientist productivity. So far, we've only covered the components that came with the standard Apache Spark distribution (excluding HDFS, Kafka, and Flume, of course).

Why do we need just another library?

However, Apache Spark can also serve as a runtime environment for third-party components, making it a kind of operating system for big data applications. In other words, Apache SystemML is to machine learning what Catalyst and Tungsten are to DataFrames.

Why on Apache Spark?

The history of Apache SystemML

Thus, the data analysis pipeline had to be introduced after prototyping a small-scale version of it. The scaled down version of the prototype will give results on the entire data set and the data scientist is again in charge of modifying the prototype and the whole cycle starts again.

A cost-based optimizer for machine learning algorithms

This means that the data scientist will prototype their application in a programming language of their choice, most likely Matlab, R or python, and finally the systems programmer will choose and re-implement this in a JVM language such as Java or Scala, which usually turns out to be , to provide better performance and also scale linearly in a data-parallel framework like Apache Spark. So let's make the system programmer redundant (or at least have it only care about our Apache Spark jobs) by using Apache SystemML.

An example - alternating least squares

Again, don't worry if you don't understand every line, but the purpose of this figure is to show you that this algorithm in R only takes 27 lines of code to be expressed. Apache SystemML's DSL (Domain Specific Language) is a subset of the R syntax, so you can just use the previous one.

ApacheSystemML architecture

Now if we look at the ALS implementation in MLlib, we will see that it has more than 800 lines. Get rid of all performance optimizations in our algorithm implementation Port our R code 1:1 to some parallel framework.

Language parsing

High-level operators are generated