ANN in practice - Mastering Apache Spark 2.x, Second Edition

The noise, created by adding extra zero (0) characters within the image, has been highlighted:

As before, the ANN code is developed using the Linux Hadoop account in a subdirectory called TQBSLBOO. The BOOTCU file exists in the BOO directory:

[hadoop@hc2nn ann]$ pwd /home/hadoop/spark/ann [hadoop@hc2nn ann]$ ls ann.sbt project src target

The contents of the BOOTCU file have been changed to use full paths of JAR library files for the Spark dependencies:

name := "A N N"

libraryDependencies += "org.apache.spark" % "spark-core" % "2.6.0"

libraryDependencies += "org.apache.spark" % "spark-mllib" % "2.1.1"

libraryDependencies += "org.apache.spark" % "akka" % "2.5.3"

As in the previous examples, the actual Scala code to be compiled exists in a subdirectory named TSDNBJOTDBMB. We have created two Scala programs. The first trains using the input data and then tests the ANN model with the same input data. The second tests the trained model with noisy data to test the distorted data classification:

[hadoop@hc2nn scala]$ pwd

/home/hadoop/spark/ann/src/main/scala [hadoop@hc2nn scala]$ ls

test_ann1.scala test_ann2.scala

We will examine the first Scala file and then we will just show the extra features of the second file, as the two examples are very similar up to the point of training the ANN. The code examples shown here can be found in the software package provided with this book under the path, DIBQUFS="//. So, to examine the first Scala example, the import

statements are similar to the previous examples. The Spark context, configuration, vectors, and -BCFMFE1PJOU are being imported. The 3%% class for RDD processing is being

imported this time, along with the new ANN class, "//$MBTTJGJFS. Note that the

MLlib/classification routines widely use the -BCFMFE1PJOU structure for input data, which will contain the features and labels that are supposed to be trained against:

JNQPSUPSHBQBDIFTQBSL4QBSL$POUFYU JNQPSUPSHBQBDIFTQBSL4QBSL$POUFYU@

JNQPSUPSHBQBDIFTQBSL4QBSL$POG

JNQPSUPSHBQBDIFTQBSLNMMJCDMBTTJGJDBUJPO"//$MBTTJGJFS JNQPSUPSHBQBDIFTQBSLNMMJCSFHSFTTJPO-BCFMFE1PJOU JNQPSUPSHBQBDIFTQBSLNMMJCMJOBMH7FDUPST

JNQPSUPSHBQBDIFTQBSLNMMJCMJOBMH@

JNQPSUPSHBQBDIFTQBSLSEE3%%

PCKFDUUFTUBOOFYUFOET"QQ\

The application class in this example has been called UFTUBOO. The HDFS files to be processed have been defined in terms of the HDFS TFSWFS, QBUI, and file name:

WBMTFSWFSIEGTMPDBMIPTU WBMQBUIEBUBTQBSLBOO

WBMEBUBTFSWFSQBUIDMPTF@TRVBSFJNH WBMEBUBTFSWFSQBUIDMPTF@USJBOHMFJNH WBMEBUBTFSWFSQBUIMJOFTJNH

WBMEBUBTFSWFSQBUIPQFO@TRVBSFJNH WBMEBUBTFSWFSQBUIPQFO@USJBOHMFJNH

WBMEBUBTFSWFSQBUIQMVTJNH

The Spark context has been created with the URL for the Spark instance, which now has a different port number--. The application name is "//. This will appear on the Spark web UI when the application is run:

val sparkMaster = "spark://localhost:8077"

val appName = "ANN 1"

val conf = new SparkConf() conf.setMaster(sparkMaster) conf.setAppName(appName)

val sparkCxt = new SparkContext(conf)

The HDFS-based input training and test data files are loaded. The values on each line are split by space characters, and the numeric values have been converted into doubles. The variables that contain this data are then stored in an array called inputs. At the same time, an array called outputs is created, containing the labels from to . These values will be used to classify the input patterns:

val rData1 = sparkCxt.textFile(data1).map(_.split("

").map(_.toDouble)).collect

val rData2 = sparkCxt.textFile(data2).map(_.split("

").map(_.toDouble)).collect

val rData3 = sparkCxt.textFile(data3).map(_.split("

").map(_.toDouble)).collect

val rData4 = sparkCxt.textFile(data4).map(_.split("

").map(_.toDouble)).collect

val rData5 = sparkCxt.textFile(data5).map(_.split("

").map(_.toDouble)).collect

val rData6 = sparkCxt.textFile(data6).map(_.split("

").map(_.toDouble)).collect

val inputs = Array[Array[Double]] (

rData1(0), rData2(0), rData3(0), rData4(0), rData5(0), rData6(0) ) val outputs = Array[Double]( 0.1, 0.2, 0.3, 0.4, 0.5, 0.6 )

The input and output data, representing the input data features and labels, are then

combined and converted into a -BCFMFE1PJOU structure. Finally, the data is parallelised in order to partition it for optimal parallel processing:

val ioData = inputs.zip( outputs )

val lpData = ioData.map{ case(features,label) =>

Variables are created to define the hidden layer topology of the ANN. In this case, we have chosen to have two hidden layers, each with 100 neurons. The maximum number of iterations is defined as well as a batch size (six patterns) and convergence tolerance. The tolerance refers to how big the training error can get before we can consider training to have worked. Then, an ANN model is created using these configuration parameters and the input data:

val hiddenTopology : Array[Int] = Array( 100, 100 ) val maxNumIterations = 1000

val convTolerance = 1e-4 val batchSize = 6

val annModel = ANNClassifier.train(rddData, batchSize, hiddenTopology, maxNumIterations, convTolerance)

In order to test the trained ANN model, the same input training data is used as testing data to obtain prediction labels. First, an input data variable is created called S1SFEJDU%BUB.

Then, the data is partitioned and, finally, the predictions are obtained using the trained ANN model. For this model to work, it must output the labels, to :

val rPredictData = inputs.map{ case(features) =>

( Vectors.dense(features) ) }

val rddPredictData = sparkCxt.parallelize( rPredictData ) val predictions = annModel.predict( rddPredictData )

The label predictions are printed and the script closes with a closing bracket:

predictions.toArray().foreach( value => println( "prediction > " + value ) )

} // end ann1

So, in order to run this code sample, it must first be compiled and packaged. By now, you must be familiar with the TCU command, executed from the BOO subdirectory:

[hadoop@hc2nn ann]$ pwd /home/hadoop/spark/ann

[hadoop@hc2nn ann]$ sbt package

The TQBSLTVCNJU command is then used from within the new TQBSLTQBSL path using the new Spark-based URL at port to run the application, UFTUBOO:

/home/hadoop/spark/spark/bin/spark-submit \ --class testann1 \

--master spark://localhost:8077 \

--executor-memory 700M \ --total-executor-cores 100 \

/home/hadoop/spark/ann/target/scala-2.10/a-n-n_2.10-1.0.jar

By checking the Apache Spark web URL at IUUQMPDBMIPTU, it is now possible to see the application running. The following figure shows the "// application running as well as the previously completed executions:

By selecting one of the cluster host worker instances, it is possible to see a list of executors that actually carry out cluster processing for that worker:

Finally, by selecting one of the executors, it is possible to see its history and configuration as well as links to the log file and error information. At this level, with the log information provided, debugging is possible. These log files can be checked to process error messages:

The "// application provides the following output to show that it has reclassified the same input data correctly. The reclassification has been successful as each of the input patterns has been given the same label that it was trained with:

prediction > 0.1 prediction > 0.2

So, this shows that ANN training and test prediction will work with the same data. Now, we will train with the same data, but test with distorted or noisy data, an example of which we already demonstrated. This example can be found in the file called UFTU@BOOTDBMB in your software package. It is very similar to the first example, so we will just demonstrate the changed code. The application is now called UFTUBOO:

object testann2 extends App

An extra set of testing data is created, after the ANN model has been created using the training data. This testing data contains noise:

val tData1 = server + path + "close_square_test.img"

val tData2 = server + path + "close_triangle_test.img"

val tData3 = server + path + "lines_test.img"

val tData4 = server + path + "open_square_test.img"

val tData5 = server + path + "open_triangle_test.img"

val tData6 = server + path + "plus_test.img"

This data is processed into input arrays and partitioned for cluster processing:

val rtData1 = sparkCxt.textFile(tData1).map(_.split("

").map(_.toDouble)).collect

val rtData2 = sparkCxt.textFile(tData2).map(_.split("

").map(_.toDouble)).collect

val rtData3 = sparkCxt.textFile(tData3).map(_.split("

").map(_.toDouble)).collect

val rtData4 = sparkCxt.textFile(tData4).map(_.split("

").map(_.toDouble)).collect

val rtData5 = sparkCxt.textFile(tData5).map(_.split("

").map(_.toDouble)).collect

val rtData6 = sparkCxt.textFile(tData6).map(_.split("

").map(_.toDouble)).collect

val tInputs = Array[Array[Double]] (

rtData1(0), rtData2(0), rtData3(0), rtData4(0), rtData5(0), rtData6(0) )

val rTestPredictData = tInputs.map{ case(features) => ( Vectors.dense(features) ) }

val rddTestPredictData = sparkCxt.parallelize( rTestPredictData ) It is then used to generate label predictions in the same way as the first example. If the model classifies the data correctly, then the same label values should be printed from to :

val testPredictions = annModel.predict( rddTestPredictData )

testPredictions.toArray().foreach( value => println( "test prediction > "

+ value ) )

The code has already been compiled, so it can be run using the TQBSLTVCNJU command:

/home/hadoop/spark/spark/bin/spark-submit \ --class testann2 \

--master spark://localhost:8077 \ --executor-memory 700M \

--total-executor-cores 100 \

/home/hadoop/spark/ann/target/scala-2.10/a-n-n_2.10-1.0.jar

Here is the cluster output from this script, which shows a successful classification using a trained ANN model and some noisy test data. The noisy data has been classified correctly.

For instance, if the trained model had become confused, it might have given a value of 0.15 for the noisy DMPTF@TRVBSF@UFTUJNH test image in position one, instead of returning as it did:

test prediction > 0.1 test prediction > 0.2 test prediction > 0.3 test prediction > 0.4 test prediction > 0.5 test prediction > 0.6

Summary

This chapter has attempted to provide you with an overview of some of the functionality available within the Apache Spark MLlib module. It has also shown the functionality that will soon be available in terms of ANNs or artificial neural networks. You might have been impressed how well ANNs work, so there is a lot more on ANNs in a later Chapter

covering DeepLearning. It is not possible to cover all the areas of MLlib due to the time and space allowed for this chapter. In addition, we now want to concentrate more on the SparkML library in the next chapter, which speeds up machine learning by supporting DataFrames and the underlying Catalyst and Tungsten optimizations.

We saw how to develop Scala-based examples for Naive Bayes classification, K-Means clustering, and ANNs. You learned how to prepare test data for these Spark MLlib routines.

You also saw that they all accept the -BCFMFE1PJOU structure, which contains features and labels.

Additionally, each approach takes a training and prediction step to training and testing a model using different datasets. Using the approach shown in this chapter, you can now investigate the remaining functionality in the MLlib library. You can refer to IUUQTQBSL BQBDIFPSH and ensure that you refer to the correct version when checking

documentation.

Having examined the Apache Spark MLlib machine learning library in this chapter, it is now time to consider Apache Spark's SparkML. The next chapter will examine machine learning on top of DataFrames.

8

Apache SparkML 8

So now that you've learned a lot about MLlib, why another ML API? First of all, it is a common task in data science to work with multiple frameworks and ML libraries as there are always advantages and disadvantages; mostly, it is a trade-off between performance and functionality. R, for instance, is the king when it comes to functionality--there exist more than 6000 R add-on packages. However, R is also one of the slowest execution

environments for data science. SparkML, on the other hand, currently has relatively limited functionality but is one of the fastest libraries. Why is this so? This brings us to the second reason why SparkML exists.

The duality between RDD on the one hand and DataFrames and Datasets on the other is like a red thread in this book and doesn't stop influencing the machine learning chapters.

As MLlib is designed to work on top of RDDs, SparkML works on top of DataFrames and Datasets, therefore making use of all the new performance benefits that Catalyst and Tungsten bring.

We will cover the following topics in this chapter:

Introduction to the SparkML API The concept of pipelines

Transformers and estimators A working example

Dalam dokumen Mastering Apache Spark 2.x, Second Edition (Halaman 157-169)