Use Simple Java Techniques to Find Common Use Java 8 Techniques to Find Common Use Google Guava to Find Common. Use Apache Commons to find mean. Calculate the median. Use simple Java techniques to find median Use Apache Commons to find the median Calculate the mode.
Java for Data Science
Credits
About the Authors
About the Reviewers
Why subscribe?
Customer Feedback
Preface
What this book covers
This chapter begins with data acquisition and includes many of the techniques used in subsequent chapters to build a complete application.
What you need for this book
Who this book is for
Conventions
Note
Reader feedback
Customer support
Downloading the example code
Errata
Piracy
Questions
Getting Started with Data Science
A more in-depth discussion of the history of data science can be found at http://www.forbes.com/sites/gilpress a-very-short-history-of-data-. However, the purpose of this book is simply to introduce the reader to the important data science topics and illustrate how they can be addressed using Java.
Problems solved using data science
Understanding the data science problem - solving approach
Using Java to support data science
Acquiring data for an application
The importance and process of cleaning data
Stop words are words like, and, or, but which do not always contribute to the analysis of text. A few years ago - never mind how long exactly - with little or no money in my wallet and nothing much to interest me on land, I thought I would sail around a bit and see the watery part of the world.
Visualizing data to enhance understanding
The use of statistical methods in data science
Simple statistical techniques, such as mean and standard deviation, can be calculated using standard Java. In Chapter 5, Statistical Data Analysis Techniques, we will demonstrate how to perform regression analysis using the Apache Commons SimpleRegression class.
Machine learning applied to data science
Using neural networks in data science
Deep learning approaches
Once the model is trained, it can be used for predictive and search tasks. With a search, the compressed middle layer can be used to match other compressed images.
Performing text analysis
Visual and audio analysis
As you can see, there are numerous errors that need to be fixed in this example. We conclude the chapter with a discussion of NeurophStudio, a Java-based neural network editor to classify images and perform image recognition.
Improving application performance using parallel techniques
Threads are used by APIs such as Aparapi when multiple CPUs or GPUs are unavailable. The Aparapi API provides high-level support for using multiple CPUs or GPUs, and we include a demonstration of Aparapi supporting matrix multiplication.
Assembling the pieces
Summary
Data Acquisition
In this chapter we will focus on how to download and extract these types of text as plain text for possible processing. We need this knowledge to demonstrate how to obtain data using various data collection techniques.
Understanding the data formats used in data science applications
Overview of CSV data
Overview of spreadsheets
Overview of databases
There are many resources available that will provide more in-depth coverage of this topic. For example, Oracle provides a more in-depth introduction to this topic at https://docs.oracle.com/javase/tutorial/jdbc/.
Overview of PDF files
Overview of JSON
Overview of XML
For example, we can validate a specific XML document by using either a DTD document or XML schema written specifically for that XML document.
Overview of streaming data
Overview of audio/video/images in Java
Data acquisition techniques
Using the HttpUrlConnection class
Web crawlers in Java
Creating your own web crawler
In this simple web crawler, we will keep track of the pages visited using ArrayList instances. In this example, we did not store the results of the review in an external source.
Using the crawler4j web crawler
If you place the for loop outside the if statement, the links for all pages are tested. If any of the images match, the method returns false and the page is ignored.
Web scraping in Java
Text: Data science From Wikipedia, the free encyclopedia Jump to: navigation, search Not to be confused with information science. Link: https://en.wikipedia.org/wiki/Data_science Text: Data Science Link: https://en.wikipedia.org/wiki/Jsoup Text: Jsoup.
Using API calls to access common social media sites
Using OAuth to authenticate users
Handing Twitter
The parameters to our method are the authentication information that Twitter gave us when we created our app. For the purposes of this example, we will simply read the messages received from the stream and print them to the screen.
Handling Wikipedia
The first two arguments of the three-argument constructor are the user ID and password, respectively. It is also possible to get a list of references in the article and a list of the headings.
Handling Flickr
One of the first steps to using the Flickr API is to request an API key. The list of images returned by this example will vary, as we used a fairly wide search range and images are added all the time.
Handling YouTube
Data Cleaning
Validity: Ensures that the data has the correct form or structure. Accuracy: The values within the data are truly representative of the data set Completeness: There are no missing elements. We will show several approaches to give the reader insights into how this can be done.
Handling data formats
Handling CSV data
We can have any number of elements on each row without worrying about exception handling. We can then process the data as we did above by splitting the data into a two-dimensional array using String class methods.
Handling spreadsheets
Handling Excel spreadsheets
Since a workbook can consist of multiple spreadsheets, we select the first one using the getSheetAt method. We use the getCellType method to determine the type and then use the appropriate method to extract the data in the cell.
Handling PDF files
Handling JSON
Using JSON streaming API
In the while loop, the nextToken method returns and moves the parser to the next token. However, we need to find the persons field and then extract each of its elements.
Using the JSON tree API
The nitty gritty of cleaning text
For example, you can use the string Hello, my name is Sally as a regular expression to find exact words in a given text. Hello, my name is \\w will match any text that starts with Hello, my name is and ends with the word.
Using Java tokenizers to extract words
Java core tokenizers
Third-party tokenizers and libraries
A few years ago – however long exactly – I had little or no money in my wallet. A few years ago – however long exactly – I had little or no money in my wallet.
Transforming data into a usable form
Simple text cleaning
Removing stop words
We set our factory to use the default English stop words and then tokenize the text. Second, the list of LingPipe stopwords is different from the file we used in the previous example.
Finding words in text
Finding and replacing text
A few years ago - no matter how long exactly - Call me Ishmael A few years ago no matter how long exactly. In this example, we'll use the replace method to simply replace all occurrences of the word me with a single space.
Data imputation
When making these decisions, it is important to consider the type of data and the purpose of the data analysis. We will use this to test whether the value in the array is null or not.
Subsetting data
Sometimes, the data will start with blank rows or header rows that we want to remove from our dataset for analysis. In our final example, we will read data from a file and remove all blank lines.
Sorting text
Next, we will replace the sort method with the reverse method of the Collections class in our integer data example. We will continue to use our numsList and assume that no sorting has occurred yet.
Data validation
Validating data types
Validating dates
Our second method call did contain a valid date, but it was not in the format we specified. But we also run the risk of leaving out useful data if we are too restrictive in our validation.
Validating e-mail addresses
This example highlights why it is important to consider the limitations you place on data. When tested against the same data as in the previous example, our output is identical.
Validating ZIP codes
Validating names
Cleaning images
Changing the contrast of an image
Smoothing an image
Brightening an image
Resizing an image
Converting images to different formats
Data Visualization
Visualization is an important step in data analysis because it allows us to imagine large data sets in a practical and meaningful way. This process allows us to quickly spot errors and fix them before they become a problem later.
Understanding plots and graphs
The following is a simple example of an index chart showing the distribution of the population for. A scatterplot is simply collections of points, and analysis techniques such as correlation or regression can be used to identify trends within these types of graphs.
Visual analysis goals
Creating index charts
The last part of the simpleLineChart method creates an instance of the Scene class that represents the contents of the work area. JavaFX uses the concept of a stage and scene to handle the internal workings of the application's GUI.
Creating bar charts
The first shows the countries as categories where the annual changes occur within the category on the X-axis and the population along the Y-axis.
Using country as the category
The last of the two methods is the launch method, which is automatically called when the window is displayed.
Using decade as the category
Creating stacked graphs
Creating pie charts
The next sequence creates an ObservableList instance based on the European population data presented earlier.
Creating scatter charts
Creating histograms
We've chosen to set our range for the X-axis to match our expected age range, but to leave some room on the side of the graph. Because we know the amount of sample data, we set our Y axis to go from 0 to 10.
Creating donut charts
Creating bubble charts
Statistical Data Analysis Techniques
Although there are a number of different data analysis techniques, this chapter will focus on the more common tasks. We will begin with an introduction to the basic techniques of statistical analysis, including calculating the mean, median, mode, and standard deviation of a data set.
Working with mean, mode, and median
Calculating the mean
Using simple Java techniques to find mean
Using Java 8 techniques to find mean
Another, more concise, technique using the OptionalDouble class involves lambda expressions and the ifPresent method. For our next two average examples, we'll use third-party libraries and continue to use the array of doubles, testData.
Using Google Guava to find mean
Using Apache Commons to find mean
We'll use this later to show the median and standard deviation, but first we'll start by calculating the mean.
Calculating the median
Using simple Java techniques to find median
To test our code for an odd number of elements, we'll add the double 12.5 to the end of the array.
Using Apache Commons to find the median
Calculating the mode
We can address this through basic Java code and through third-party libraries, as we'll show in a moment.
Using ArrayLists to find multiple modes
Using a HashMap to find multiple modes
Using a Apache Commons to find multiple modes
Standard deviation
Sometimes it is better to calculate the standard deviation of a sample subset of the population to correct for potential bias. However, we can continue to use Apache Commons to calculate the standard deviation in any format.
Sample size determination
Hypothesis testing
Regression analysis
Using simple linear regression
The SimpleRegression class has a predict method that passes a value, a year in this case, and returns the estimated population. The line matches the current population values fairly well and shows the projected future population.
Using multiple regression
We will use a JavaFX program to create the scatterplot and perform the analysis. In this example, we focus on the multiple regression code and do not include the JavaFX code used to create the scatter plot.