How smart data sources work internally - Mastering Apache Spark 2.x, Second Edition

JDBC stands for Java Database Connectivity. When talking about Apache Spark and JDBC there is sometimes a bit of confusion because JDBC can be used in the context of a data source as well as referred to Apache Spark's capability to serve as JDBC-compliant data source to other systems. The latter is not further covered in this book, whereas the former is only used as one particular example where the data source (in this case a relational

database) can be transparently used for data pre-processing without the user of Apache SparkSQL further noticing it.

If you want to use Apache SparkSQL as a data source for other

JAVA/JVM-based applications you have to start the JDBC Thrift server, as explained here: IUUQTEFWFMPQFSJCNDPNIBEPPQIPXU PSVORVFSJFTPOTQBSLTRMVTJOHKECDWJBUISJGUTFSWFS. The following website explains how to connect to a MySQL database from Apache SparkSQL using JDBC: IUUQTEPDTEBUBCSJDLTDPNTQBSLMB UFTUEBUBTPVSDFTTRMEBUBCBTFTIUNM. And more on JDBC in general can be found here: ÎÛÛ^Q^X^X^X^P^S^B^D^M^F^D^P^NÛ^F^DÎÔ^FÛ^X^P^S^L^K^B^W^B^P^W^F^S^W^J^F XIUNM.

Now we want to show how Apache Spark internally implements smart data sources

exemplified on a filter pushed down to a JDBC data source. This means that if in a PEP data has to be filtered, then this filter will be executed in the SQL statement on the RDBMS when the underlying data is read. This way, reading unnecessary data is avoided. Depending on the seductiveness of the filter predicate, the performance gain can be multiple orders of magnitude.

-FU TIBWFBMPPLBUUIFGPMMPXJOHUSBJUIFSF TUIFMJOLUPUIFTPVSDFDPEFIU UQTHJUIVCDPNBQBDIFTQBSLCMPCNBTUFSTRMDPSFTSDNBJOTDBMBPSHBQBDIF TQBSLTRMTPVSDFTJOUFSGBDFTTDBMB):

USBJU1SVOFE'JMUFSFE4DBO\

EFGCVJME4DBOSFRVJSFE$PMVNOT"SSBZ<4USJOH>GJMUFST"SSBZ<'JMUFS>

3%%<3PX>

This trait defines a method called CVJME4DBO, which takes two parameters. First, a list of column names which have to be included in the result, and second, an array of filter objects basically expressing the query predicate pushed down to the underlying smart data source.

The return type is of 3%%<3PX>. One example is implemented in the +%#$3FMBUJPO class (again, here is the link to the source code ÎÛÛ^Q^T^H^JÛÎ^V^C^D^P^N^B^Q^B^DÎ^F^T^Q^B^S^L^C^M^P^C^N^B^TÛ^F STRMDPSFTSDNBJOTDBMBPSHBQBDIFTQBSLTRMFYFDVUJPOEBUBTPVSDFTKECD+%#

$3FMBUJPOTDBMB):

QSJWBUF<TRM>DBTFDMBTT+%#$3FMBUJPO

QBSUT"SSBZ<1BSUJUJPO>KECD0QUJPOT+%#$0QUJPOT!USBOTJFOUWBM TQBSL4FTTJPO4QBSL4FTTJPO

FYUFOET#BTF3FMBUJPO XJUI1SVOFE'JMUFSFE4DBO XJUI*OTFSUBCMF3FMBUJPO\

+%#$3FMBUJPO implemented the CVJME4DBO method in the following way:

PWFSSJEFEFGCVJME4DBOSFRVJSFE$PMVNOT"SSBZ<4USJOH>GJMUFST

"SSBZ<'JMUFS>3%%<3PX>\

3FMZPOBUZQFFSBTVSFIBDLUPQBTT3%%<*OUFSOBM3PX>CBDLBT 3%%<3PX>

+%#$3%%TDBO5BCMF

TQBSL4FTTJPOTQBSL$POUFYU TDIFNB

SFRVJSFE$PMVNOT GJMUFST

QBSUT

KECD0QUJPOTBT*OTUBODF0G<3%%<3PX>>

As we can see, this method is just a EFMFHBUF. Therefore, we have to have a look at the TDBO5BCMF method of the +%#$3%% class. It is interesting to note that, in case of +%#$3%%, support for predicate push-down is implemented directly in the 3%% class. But we'll skip the TDBO5BCMF method for now, since it just parameterizes and creates a new +%#$3%% object.

So the most interesting method in +%#$3%% is DPNQVUF, which it inherits from the abstract 3%% class. Through the DPNQVUF method, Apache Spark tells this RDD to get out of lazy mode and materialize itself whenever it is appropriate during computation of a data processing job. We'll show you two important fractions of this method after we have had a look at the method signature:

PWFSSJEFEFGDPNQVUFUIF1BSU1BSUJUJPODPOUFYU5BTL$POUFYU

*UFSBUPS<*OUFSOBM3PX>\

Here you can see that the return type is of *UFSBUPS, which allows a lazy underlying data source to be read lazily as well. As we can see, this is the case for this particular

implementation as well:

WBMTRM5FYUT4&-&$5DPMVNO-JTU'30.\PQUJPOTUBCMF^

NZ8IFSF$MBVTF

TUNUDPOOQSFQBSF4UBUFNFOUTRM5FYU

3FTVMU4FU5:1&@'038"3%@0/-:3FTVMU4FU$0/$63@3&"%@0/-:

TUNUTFU'FUDI4J[FPQUJPOTGFUDI4J[F STTUNUFYFDVUF2VFSZ

Note that the SQL statement created and stored in the TRM5FYU constant is referencing two interesting variables: DPMVNO-JTU and NZ8IFSF$MBVTF. Both are derived from the SFRVJSFE$PMVNOT and GJMUFS arguments passed to the +%#$3FMBUJPO class.

Therefore, this data source can be called a smart source, because the underlying storage technology (an SQL database in this case) can be told to only return columns and rows which are actually requested. And as already mentioned, the data source supports passing lazy data access patterns to be pushed to the underlying database as well. Here you can see that the JDBC result set is wrapped into a typed *OUFSOBM3PX iterator,

*UFSBUPS<*OUFSOBM3PX>>. Since this matches the return type of the DPNQVUF method, we are done upon execution of the following code:

WBMSPXT*UFSBUPS+ECD6UJMTSFTVMU4FU5P4QBSL*OUFSOBM3PXTSTTDIFNB JOQVU.FUSJDT$PNQMFUJPO*UFSBUPS<*OUFSOBM3PX

*UFSBUPS<*OUFSOBM3PX>>SPXT*UFSBUPSDMPTF

Note that 3FTVMU4FUST obtained from the JDBC database query is passed to (wrapped into) a EFMFHBUF object. Therefore, the JDBC connection stays open, and the RDBMS courser doesn't get destroyed and can be used to return subsequent data once requested from Apache Spark.

Summary

This chapter basically was a sales pitch for using DataFrames, Datasets, and SQL over RDDs, because in the majority of cases, the optimizer does such a great job that the

performance is nearly equal to hand-optimized code on RDDs. You now know the internals of the optimizer, which will give you a solid foundation in using all components that are based on it and that are introduced in the subsequent chapters.

The next chapter is dedicated to Tungsten Phase I and Phase II, which is a mind-blowing technology that accelerates computations by a factor of 100, so stay tuned!

4

Project Tungsten 4

This chapter introduces Project Tungsten. CPU and main memory performance became the new bottlenecks in big data processing after the massive increase in I/O performance due to the usage of solid state disks (SSDs) and 10 Gbps Ethernet. Therefore, Project Tungsten, the core of the Apache Spark execution engine, aims at improving performance at the CPU and main memory level. This chapter will cover the following topics:

Memory management beyond the Java Virtual Machine (JVM) Garbage Collector (GC)

Cache-friendly layout of data in memory Code generation

We will have an in-depth look at all these three topics now.

Memory management beyond the Java

Dalam dokumen Mastering Apache Spark 2.x, Second Edition (Halaman 72-76)