What about state? Imagine that a machine learning algorithm maintains a count variable on all the workers. If you replay the exact same data twice, you will end up counting the data multiple times. Therefore, the query planner also maintains a versioned key-value map within the workers, which are persisting their state in turn to HDFS--which is by design fault tolerant.
So, in case of a failure, if data has to be replayed, the planner makes sure that the correct version of the key-value map is used by the workers.
Example - connection to a MQTT message broker
So, let's start with a sample use case. Let's connect to an Internet of Things (IoT) sensor data stream. As we haven't covered machine learning so far, we don't analyze the data, we just showcase the concept.
We are using the IBM Watson IoT platform as a streaming source. At its core, the Watson IoT platform is backed by an MQTT (Message Queue Telemetry Transport) message broker. MQTT is a lightweight telemetry protocol invented by IBM in 1999 and became-- an OASIS (Organization for the Advancement of Structured Information Standards, a global nonprofit consortium that works on the development, convergence, and adoption of
standards for security, Internet of Things, energy, content technologies, emergency management, and other areas) standard in 2013--the de facto standard for IoT data integration.
Messaging between applications can be backed by a message queue which is a middleware system supporting asynchronous point to point channels in various delivery modes like first-in first-out (FIFO), last-in first-out (LIFO) or Priority Queue (where each message can be re-ordered by certain criteria).
This is already a very nice feature but still couples applications in a certain way because, once a message is read, it is made unavailable to others.
This way N to N communication is hard (but not impossible) to achieve. In a publish/subscribe model applications are completely de-coupled. There doesn't exists any queues anymore but the notion of topics is introduced.
Data providers publish messages on specific topics and data consumers subscribe to those topics. This way N to N communication is very
straightforward to achieve since it is reflected by the underlying message delivery model. Such a middleware is called a Message Broker in contrast to a Message Queue.
As cloud services tend to change constantly, and cloud in general is introduced later in this book, the following tutorial explains how to set up the test data generator in the cloud and connect to the remote MQTT message broker. In this example, we will use the IBM Watson IoT Platform, which is an MQTT message broker available in the cloud. Alternatively one can install an open source message broker like MOSQUITTO which also provides a publicly available test installation on the following URL: IUUQUFTUNPTRVJUUPPSH.
In order to replicate the example, the following steps (1) and (2) are necessary as described in the following tutorial: IUUQTXXXJCNDPNEFWFMPQFSXPSLTMJCSBSZJPUDPHOJUJWF JPUBQQNBDIJOFMFBSOJOHJOEFYIUNM. Please make sure to note down IUUQ@IPTU, PSH , BQJ,FZ, and BQJ5PLFO during execution of the tutorial. Those are needed later in order to subscribe to data using Apache Spark Structured Streaming.
As the IBM Watson IoT platform uses the open MQTT standard, no special IBM component is necessary to connect to the platform. Instead, we are using MQTT and Apache Bahir as a connector between MQTT and Apache Spark structured streaming.
The goal of the Apache Bahir project is to provide a set of source and sink connectors for various data processing engines including Apache Spark and Apache Flink since they are lacking those connectors. In this case we will use the Apache Bahir MQTT data source for MQTT.
In order to use Apache Bahir, we need to add two dependencies to our local maven repository. A complete QPNYNM file is provided in the download section of this chapter.
Let's have a look at the dependency section of QPNYNM:
We are basically getting the MQTT Apache structured streaming adapter of Apache Bahir and a dependent package for low-level MQTT processing. A simple NWO
EFQFOEFODZSFTPMWF command in the directory of the QPNYNM file pulls the required dependencies into our local maven repository, where they can be accessed by the Apache Spark driver and transferred to the Apache Spark workers automatically.
Another way of resolving the dependencies is when using the following command in order to start a spark-shell (spark-submit works the same way); the necessary dependencies are automatically distributed to the workers:
Now we need the MQTT credentials that we've obtained earlier. Let's set the values here:
WBMNRUU@IPTUQDPZIBNFTTBHJOHJOUFSOFUPGUIJOHTJCNDMPVEDPN WBMPSHQDPZIB
WBMBQJ,FZBQDPZIBPBJHDLVC WBMBQJ5PLFOXVZQ79Z/H7-"D-S
WBSSBOEPN4FTTJPO*ETDBMBVUJM3BOEPNOFYU*OU
Now we can start creating a stream connecting to an MQTT message broker. We are telling Apache Spark to use the Apache Bahir MQTT streaming source:
WBMEG
TQBSLSFBE4USFBNGPSNBUPSHBQBDIFCBIJSTRMTUSFBNJOHNRUU.2554USFBN4PVS DF1SPWJEFS
We need to specify credentials such as aVTFSOBNFb, aQBTTXPSEb, and aDMJFOU*Eb in order to pull data from the MQTT message broker; the link to the tutorial mentioned earlier explains how to obtain these:
PQUJPOVTFSOBNFBQJ,FZ
As we are using a publish/subscribe messaging model, we have to provide the topic that we are subscribing to--this topic is used by the test data generator that you've deployed to the cloud before:
PQUJPOUPQJD
JPUUZQF8BTIJOH.BDIJOFJE8BTIFSFWUWPMUBHFGNUKTPO
Once everything is set on the configuration side, we have to provide the endpoint host and port in order to create the stream:
MPBEUDQNRUU@IPTU
Interestingly, as can be seen in the following screenshot, this leads to the creation of a DataFrame:
Note that the schema is fixed to <4USJOH5JNFTUBNQ> and cannot be changed during stream creation--this is a limitation of the Apache Bahir library. However, using the rich DataFrame API, you can parse the value, a JSON string for example, and create new columns.
As discussed before, this is one of the powerful features of Apache Spark structured streaming, as the very same DataFrame (and Dataset) API now can be used to process historic and real-time data. So let's take a look at the contents of this stream by writing it to the console:
WBMRVFSZEGXSJUF4USFBN PVUQVU.PEFBQQFOE GPSNBUDPOTPMF TUBSU
As output mode, we choose BQQFOE to enforce incremental display and avoid having the complete contents of the historic stream being written to the console again and again. As GPSNBU, we specify DPOTPMF as we just want to debug what's happening on the stream:
Finally, the TUBSU method initiates query processing, as can be seen here: