Buku Java for Data Science

The final chapter combines topics discussed in the book to create a comprehensive data science application. The field of natural language processing has made significant progress and is widely used in data science applications.

What you need for this book

In this chapter, we explain how they work and demonstrate the use of several different types of neural networks. Java provides significant support in this area, and we will illustrate the use of this approach.

Who this book is for

In this book, you'll find a variety of text styles that distinguish between different types of information. SpeechResult instance that holds the result of processing." Database table names, folder names, file names, file extensions, path names, dummy URLs, user input, and Twitter handles are shown as below: "The KevinVoice directory contains two voices: kevin and kevin16. ".

Customer support

Reader feedback is important to us as it helps us develop titles that you'll really get the most out of. If there's a topic you have expertise in and are interested in writing or contributing to a book, check out our author guide at www.packtpub.com/authors.

Downloading the example code

A more in-depth discussion of the history of data science can be found at http://www.forbes.com/sites/gilpress a-very-sh ort-history-of-data-science/#3d9ea08369fd. However, the purpose of this book is simply to introduce the reader to important data science topics and illustrate how they can be addressed using Java.

Problems solved using data science

The term data science has been used since 1974 and has evolved over time to include statistical analysis of data. This book aims to take a broad look at data science using Java and will briefly touch on many topics.

Understanding the data science problem – solving approach

By displaying the data in a visual form, a difficult to understand set of numbers can be more easily understood. Video, image, and audio processing and analysis: This is a more specialized form of analysis that is becoming more common as better analysis techniques are discovered and faster processors become available.

Using Java to support data science

Text analysis: This is a common form of analysis that works with natural languages to identify features such as the names of people and places, the relationship between parts of text, and the implied meaning of text. Each section can only provide a brief introduction to the topics and support available.

Acquiring data for an application

The remainder of this chapter will provide an overview of the data science tasks and Java support demonstrated in the book.

The importance and process of cleaning data

Stop words are words like, and, or, but which do not always contribute to the analysis of text. A few years ago - never mind how long exactly - with little or no money in my wallet and nothing much to interest me on land, I thought I would sail around a bit and see the watery part of the world.

Visualizing data to enhance understanding

Visualization allows users to examine large data sets in ways that provide insights not present in the bulk of the data. Visualization tools help us identify potential problems or unexpected data results and develop meaningful interpretations of the data.

The use of statistical methods in data science

An equation representing the line is calculated and can be used to predict future behavior. Simple statistical techniques, such as mean and standard deviation, can be calculated using basic Java.

Machine learning applied to data science

In Chapter 5, Statistical Data Analysis Techniques, we will demonstrate how to perform regression analysis using the Apache Commons SimpleRegression class. In Chapter 6, Machine Learning, we will demonstrate how to use the approach using a dataset about individuals' propensity to camp.

Using neural networks in data science

Once the model is trained, the possible hyperplanes are considered and predictions can then be made using similar data. After the data is read, the MLP instance is created and initialized using the method to configure the model's characteristics, including how fast the model should learn and the amount of time spent training the model.

Deep learning approaches

Once the model is trained, it can be used for prediction and retrieval. By searching, you can use the compressed middle layer to match other compressed images to be sorted.

Performing text analysis

Parts of Speech (POS): This identifies the grammatical parts of a sentence, such as noun, verb, adjective and so on. Relationships: Here we are concerned with identifying how parts of text are related to each other, such as the subject and object of a sentence.

Visual and audio analysis

In the following example, explained in more detail in Chapter 11, mathematical and parallel techniques for data analysis achieve OCR using Tess4j (http://tess4j.sourceforge. net/), a Java JNA wrapper for the Tesseract OCR API. We build on this example in Chapter 11, Mathematical and Parallel Techniques for Data Analysis, with a discussion of improvements and considerations to ensure that the OCR process is as efficient as possible.

Improving application performance using parallel techniques

Threads are used by APIs such as Aparapi when multiple CPUs or GPUs are not available. The Aparapi API provides high-level support for using multiple CPUs or GPUs, and we include a demo of Aparapi supporting matrix multiplication.

Assembling the pieces

Summary

In this chapter, we will demonstrate how to obtain data from several of these, including various Internet sites and several social media. In this chapter we will focus on how to download and extract these types of text as plain text for possible processing.

Understanding the data formats used in data science applications

We will briefly examine different data formats, followed by an examination of possible data sources. We need this knowledge to demonstrate how to obtain data using various data collection techniques.

Overview of CSV data

Overview of spreadsheets

For some of our examples, we'll use a simple Excel spreadsheet consisting of a series of rows with an ID, along with minimum, maximum, and average values.

Overview of databases

Next, we will show how to add information to the database and how to read it. Next, we add the URLs to the table using the setString method and the execute method.

Overview of PDF files

There are many resources available that will provide more in-depth coverage of this topic.

Overview of JSON

Overview of XML

The Java API for XML Processing (JAXP) consists of three interfaces for parsing XML data. The Document Object Model (DOM) interface parses an XML document and returns a tree structure that delineates the structure of the document.

Overview of streaming data

Overview of audio/video/images in Java

Data acquisition techniques

Using the HttpUrlConnection class

Assuming we are getting an HTML formatted page, the following sequence illustrates how to get this content. While this is feasible, there are easier methods of getting the content of a web page.

Web crawlers in Java

We can create our own web crawler or use an existing crawler, and we'll examine both approaches in this chapter. We will show how to create a simple web crawler in Java to provide a better insight into how web crawlers work.

Creating your own web crawler

In the SimpleWebCrawler constructor, we initialize instance variables to start a Wikipedia search for Bishop Rock, an island off the coast of Italy. The urlLimiter variable is set to Bishop_Rock, which will limit embedded links to only those containing this string.

Using the crawler4j web crawler

The first is the shouldVisit method that determines whether a page will be visited and the visit method that actually handles the page. If any of the images match, the method returns false and the page is ignored.

Web scraping in Java

Link: https://en.wikipedia.org/wiki/Data_science Text: Data Science Link: https://en.wikipedia.org/wiki/Jsoup Text: Jsoup. A comparison of Java HTML parsers, among others, can be found at https://en.wikipedia.org/wiki/Comparis on_of_HTML_parsers.

Using API calls to access common social media sites

Using OAuth to authenticate users

Handing Twitter

There are other options for retrieving tweets from a specific user and for posting data to a specific account, but we won't go into those in this chapter. For the purposes of this example, we will simply read the messages we received from the stream and print them to the screen.

Handling Wikipedia

The first two arguments of the three-argument constructor are the user ID and password, respectively. It is also possible to obtain a list of references in the article and a list of titles.

Handling Flickr

The Flickr instance is then created, with apikey and secret supplied as the first two parameters. We get the first photo instance from the previous list and then its getUrl to get the image URL.

Handling YouTube

In this section, we will demonstrate how to search for videos with keywords and retrieve information of interest. To use the YouTube API, you will need a Google Account, which can be obtained at.

Searching by keyword

We will show several approaches to give the reader insight into how this can be done. Many of these formats were introduced in Chapter 2, Data Collection, but we will go into more detail in the following sections.

Handling data formats

For example, there are a number of GUI tools that support data cleaning, such as OpenRefine (http://openrefine. org/). Since we introduced how this is done in the Web Scraping in Java section of Chapter 2, Data Acquisition, we won't duplicate the effort here.

Handling CSV data

We will temporarily store the data in an ArrayList, because we don't always know how many rows our data consists of. We can then process the data as we did above, by splitting the data into a two-dimensional array using String class methods.

Handling spreadsheets

If we want to read the entire file at once, we use the readAll method. The following example prints each token individually, but the tokens can also be stored in a two-dimensional array or other data structure if necessary.

Handling Excel spreadsheets

We use the getCellType method to determine its type and then use the appropriate method to extract the data in the cell.

Handling PDF files

It is a very powerful tool when we need to extract and otherwise manipulate PDF documents.

Handling JSON

Using JSON streaming API

It consists of four fields, with the last field containing a range of location data. In the while loop, the nextToken method returns and the parser moves on to the next token.

Using the JSON tree API

The nitty gritty of cleaning text

For example, the string Hello, my name is Sally can be used as a regular expression to find the exact words in a given text. We will use several examples of more complex regular expressions, and some of the more useful syntax options are summarized in the following table.

Using Java tokenizers to extract words

Hello, my name is \\w will match any text that starts with Hello, my name is and ends with a word character. You may need to read data from a file, but for the sake of simplicity, we'll use a string containing the opening sentences of Herman Melville's Moby Dick for some examples within this chapter.

Java core tokenizers

Third-party tokenizers and libraries

A few years ago - no matter how long exactly - having little or no money in my wallet. A few years ago - no matter how long exactly - having little or no money in my wallet.

Transforming data into a usable form

Simple text cleaning

In the following example, we use the join method to combine our words once we've cleaned them up. We use the same chaining methods as shown previously to clean up and split our text.

Removing stop words

We set our factory to use default English stop words and then draw the text. Second, the LingPipe list of stop words is different from the file we used in the previous example.

Finding words in text

We specify the file to search for and use a try-catch block to catch any IO exceptions.

Finding and replacing text

A few years ago - don't worry about how long exactly - Call me Ishmael A few years ago don't worry about how long exactly. In this example, we'll use the replace method to simply replace all occurrences of the word with a single space.

Data imputation

To prevent skewing and misalignment of data, many statisticians advocate replacing missing data with values that are representative of the mean or expected value for that data set. One alternative approach might be to calculate the average of the values in the array.

Subsetting data

Data will sometimes start with blank lines or header lines that we want to remove from our dataset to be analyzed. In our last example, we will read data from a file and remove all empty lines.

Sorting text

First, we'll print the original version of each list, followed by a call to the sort method. Next, we'll replace the sort method with the reverse method of the Collections class in our integer data example.

Data validation

Validating data types

The IntegerValidator class also provides methods to determine whether an integer is greater than or less than a specific value, compare the number to a range of numbers, and convert Number objects to Integer objects.

Validating dates

We create an instance of the SimpleDateFormat class using the format specified in the parameter. Our second method call did contain a valid date, but it was not in the format we specified.

Validating e-mail addresses

When tested against the same data as in the previous example, our output is identical. This class's isValid method examines an email address and determines whether it is valid or not.

Validating ZIP codes

Validating names

Cleaning images

In the following examples we will use Wikipedia images as they are freely downloadable.

Changing the contrast of an image

Smoothing an image

This is the process of multiplying each image element by the weighted values of its neighbors.

Brightening an image

Resizing an image

Converting images to different formats

In this chapter, we will illustrate how to create different types of graphs, plots, and charts. Visualization is an important step in data analysis because it allows us to imagine large data sets in a practical and meaningful way.

Understanding plots and graphs

This type of graph can be particularly useful for identifying trends over time or suggesting correlations between data values and specific events over a period of time. In the following scatter chart, as developed in Creating Scatter Charts, population along the X-axis is plotted against decade along the Y-axis:.

Visual analysis goals

While classification sorts data into similar types of data sets, clustering deals with the data within the set. Sometimes our data analysis requires us to extract specific types of information from our data set.

Creating index charts

In JavaFX, the main method usually starts the application using the base class launch method. JavaFX uses the concept of a stage and scene to handle the internals of the application's GUI.

Creating bar charts

In the following series of examples, we will use a set of European country populations for three decades, as indicated in the following table. The first will show the countries as categories where the annual changes occur within the category on the X-axis and the population along the Y-axis.

Using country as the category

The three series are added to the scene and the scene is attached to the scene using the setScene method. The last of the two methods is the start method, which is called automatically when the window is displayed.

Using decade as the category

The data series are created in the same way as before, except that the country name is used for the series name and the year is used for the category.

Creating stacked graphs

The ranges are initialized with the year used for the range name and the country, and their population is added using the helper method addDataItem.

Creating pie charts

We can create and initialize the pie chart in the constructor using an observable List of pie chart data. The next sequence creates an ObservableList instance based on the European population data presented earlier.

Creating scatter charts

Creating histograms

We then create a Number array to specify the ages for which we expect to have data. We use the setInsets method to specify how much space to place around each side of the graph within the window.

Creating donut charts

We create a DataTable and use the add method to add our values from the array. Since a donut plot is essentially a pie chart, we will display a donut plot by calling the setInnerRadius method.

Creating bubble charts

In the next chapter we will introduce basic statistical processes, including linear regression, and use the techniques introduced in this chapter. Although there are a wide variety of data analysis techniques, in this chapter we will focus on the most common tasks.

Working with mean, mode, and median

Calculating the mean

Using simple Java techniques to find mean

Using Java 8 techniques to find mean

Using Google Guava to find mean

Using Apache Commons to find mean

For our next two average examples, we'll use third-party libraries and continue to use the array of doubles, testData. We'll use this later to demonstrate median and standard deviation, but first we'll start by calculating the mean.

Calculating the median

Using simple Java techniques to find median

Our list has an even number of values Our list has an odd number of values. If our list has an even number of values, we divide the length of the list by 2.

Using Apache Commons to find the median

Calculating the mode

The problem is that our testData array now contains two values that appear twice, 12.5 and 11.2. We can handle this with native Java code and with third-party libraries, as we'll show in a moment.

Using ArrayLists to find multiple modes

Using a HashMap to find multiple modes

Using a Apache Commons to find multiple modes

Standard deviation

Sometimes it is better to calculate the standard deviation of a sample subset of the population to correct for potential bias. However, we can continue to use Apache Commons to calculate the standard deviation in any format.

Sample size determination

On the first line, we've created a new StandardDeviation object and set our constructor's parameter to false, which will produce the standard deviation of a population. This means that we first treated it as if it were a subset of a population of data.

Hypothesis testing

Specific methods for performing hypothesis testing are beyond the scope of this book and require a solid background in statistical processes and best practices. Apache Commons provides the org.apache.commons.math3.stat.inference package with tools for performing hypothesis testing.

Regression analysis

RSS is also known as the Sum of Squared Residuals (SSR) or the Sum of Squared Error (SSE) of the forecast. The Mean Square Error (MSE) is the sum of the squared residuals divided by the degrees of freedom.

Using simple linear regression

The SimpleRegression class has a predict method that is passed a value, in this case year, and returns the estimated population. The line fits the actual population values fairly well and shows the predicted population in the future.

Using multiple regression

We will use the Apache Commons OLSMultipleLinearRegression class to perform multiple regression using the cigarette data. In this chapter, we will be able to touch on just a few of these techniques.

Supervised learning techniques

For unsupervised learning, we show how association rule learning can be used to find relationships between elements of a dataset. We will review the elements of reinforcement learning and discuss some specific variations of this technique.

Decision trees

Decision tree types

Decision tree libraries

We will use the Waikato Environment for Knowledge Analysis (Weka) to demonstrate how to create a decision tree in Java. The second section is the data section; it consists of a comma-delimited array of data.

Using a decision tree with a book dataset

It can also be called from the command line or through a Java API, which we will use. We'll want to test the model, so we'll create an object that implements the Instance interface for each test case.

Testing the book decision tree

Based on the books file, we can calculate the entropy for the bond as shown in the table below. We can calculate entropy for the use of color and genre in a similar way.

Support vector machines

So it makes more sense to use the binding type for the first level of the tree. Part of the SVM process is finding the best hyperplane for the problem dataset.

Using an SVM for camping data

The Instances class holds a series of data instances, where each instance is an age, income, and camping value. The first 14 cases are used to train the model and the last 5 cases are used to test the model.

Testing individual instances

This is repeated 10 times using a different partition of the data set each time, and the average of the results is used.

Bayesian networks