• Tidak ada hasil yang ditemukan

Java for Data Science

N/A
N/A
Игорь Ежов

Academic year: 2023

Membagikan " Java for Data Science"

Copied!
412
0
0

Teks penuh

Use Simple Java Techniques to Find Common Use Java 8 Techniques to Find Common Use Google Guava to Find Common. Use Apache Commons to find mean. Calculate the median. Use simple Java techniques to find median Use Apache Commons to find the median Calculate the mode.

Java for Data Science

Credits

About the Authors

About the Reviewers

Why subscribe?

Customer Feedback

Preface

What this book covers

This chapter begins with data acquisition and includes many of the techniques used in subsequent chapters to build a complete application.

What you need for this book

Who this book is for

Conventions

Note

Reader feedback

Customer support

Downloading the example code

Errata

Piracy

Questions

Getting Started with Data Science

A more in-depth discussion of the history of data science can be found at http://www.forbes.com/sites/gilpress a-very-short-history-of-data-. However, the purpose of this book is simply to introduce the reader to the important data science topics and illustrate how they can be addressed using Java.

Problems solved using data science

Understanding the data science problem - solving approach

Using Java to support data science

Acquiring data for an application

The importance and process of cleaning data

Stop words are words like, and, or, but which do not always contribute to the analysis of text. A few years ago - never mind how long exactly - with little or no money in my wallet and nothing much to interest me on land, I thought I would sail around a bit and see the watery part of the world.

Visualizing data to enhance understanding

The use of statistical methods in data science

Simple statistical techniques, such as mean and standard deviation, can be calculated using standard Java. In Chapter 5, Statistical Data Analysis Techniques, we will demonstrate how to perform regression analysis using the Apache Commons SimpleRegression class.

Machine learning applied to data science

Using neural networks in data science

Deep learning approaches

Once the model is trained, it can be used for predictive and search tasks. With a search, the compressed middle layer can be used to match other compressed images.

Performing text analysis

Visual and audio analysis

As you can see, there are numerous errors that need to be fixed in this example. We conclude the chapter with a discussion of NeurophStudio, a Java-based neural network editor to classify images and perform image recognition.

Improving application performance using parallel techniques

Threads are used by APIs such as Aparapi when multiple CPUs or GPUs are unavailable. The Aparapi API provides high-level support for using multiple CPUs or GPUs, and we include a demonstration of Aparapi supporting matrix multiplication.

Assembling the pieces

Summary

Data Acquisition

In this chapter we will focus on how to download and extract these types of text as plain text for possible processing. We need this knowledge to demonstrate how to obtain data using various data collection techniques.

Understanding the data formats used in data science applications

Overview of CSV data

Overview of spreadsheets

Overview of databases

There are many resources available that will provide more in-depth coverage of this topic. For example, Oracle provides a more in-depth introduction to this topic at https://docs.oracle.com/javase/tutorial/jdbc/.

Overview of PDF files

Overview of JSON

Overview of XML

For example, we can validate a specific XML document by using either a DTD document or XML schema written specifically for that XML document.

Overview of streaming data

Overview of audio/video/images in Java

Data acquisition techniques

Using the HttpUrlConnection class

Web crawlers in Java

Creating your own web crawler

In this simple web crawler, we will keep track of the pages visited using ArrayList instances. In this example, we did not store the results of the review in an external source.

Using the crawler4j web crawler

If you place the for loop outside the if statement, the links for all pages are tested. If any of the images match, the method returns false and the page is ignored.

Web scraping in Java

Text: Data science From Wikipedia, the free encyclopedia Jump to: navigation, search Not to be confused with information science. Link: https://en.wikipedia.org/wiki/Data_science Text: Data Science Link: https://en.wikipedia.org/wiki/Jsoup Text: Jsoup.

Using API calls to access common social media sites

Using OAuth to authenticate users

Handing Twitter

The parameters to our method are the authentication information that Twitter gave us when we created our app. For the purposes of this example, we will simply read the messages received from the stream and print them to the screen.

Handling Wikipedia

The first two arguments of the three-argument constructor are the user ID and password, respectively. It is also possible to get a list of references in the article and a list of the headings.

Handling Flickr

One of the first steps to using the Flickr API is to request an API key. The list of images returned by this example will vary, as we used a fairly wide search range and images are added all the time.

Handling YouTube

Data Cleaning

Validity: Ensures that the data has the correct form or structure. Accuracy: The values ​​within the data are truly representative of the data set Completeness: There are no missing elements. We will show several approaches to give the reader insights into how this can be done.

Handling data formats

Handling CSV data

We can have any number of elements on each row without worrying about exception handling. We can then process the data as we did above by splitting the data into a two-dimensional array using String class methods.

Handling spreadsheets

Handling Excel spreadsheets

Since a workbook can consist of multiple spreadsheets, we select the first one using the getSheetAt method. We use the getCellType method to determine the type and then use the appropriate method to extract the data in the cell.

Handling PDF files

Handling JSON

Using JSON streaming API

In the while loop, the nextToken method returns and moves the parser to the next token. However, we need to find the persons field and then extract each of its elements.

Using the JSON tree API

The nitty gritty of cleaning text

For example, you can use the string Hello, my name is Sally as a regular expression to find exact words in a given text. Hello, my name is \\w will match any text that starts with Hello, my name is and ends with the word.

Using Java tokenizers to extract words

Java core tokenizers

Third-party tokenizers and libraries

A few years ago – however long exactly – I had little or no money in my wallet. A few years ago – however long exactly – I had little or no money in my wallet.

Transforming data into a usable form

Simple text cleaning

Removing stop words

We set our factory to use the default English stop words and then tokenize the text. Second, the list of LingPipe stopwords is different from the file we used in the previous example.

Finding words in text

Finding and replacing text

A few years ago - no matter how long exactly - Call me Ishmael A few years ago no matter how long exactly. In this example, we'll use the replace method to simply replace all occurrences of the word me with a single space.

Data imputation

When making these decisions, it is important to consider the type of data and the purpose of the data analysis. We will use this to test whether the value in the array is null or not.

Subsetting data

Sometimes, the data will start with blank rows or header rows that we want to remove from our dataset for analysis. In our final example, we will read data from a file and remove all blank lines.

Sorting text

Next, we will replace the sort method with the reverse method of the Collections class in our integer data example. We will continue to use our numsList and assume that no sorting has occurred yet.

Data validation

Validating data types

Validating dates

Our second method call did contain a valid date, but it was not in the format we specified. But we also run the risk of leaving out useful data if we are too restrictive in our validation.

Validating e-mail addresses

This example highlights why it is important to consider the limitations you place on data. When tested against the same data as in the previous example, our output is identical.

Validating ZIP codes

Validating names

Cleaning images

Changing the contrast of an image

Smoothing an image

Brightening an image

Resizing an image

Converting images to different formats

Data Visualization

Visualization is an important step in data analysis because it allows us to imagine large data sets in a practical and meaningful way. This process allows us to quickly spot errors and fix them before they become a problem later.

Understanding plots and graphs

The following is a simple example of an index chart showing the distribution of the population for. A scatterplot is simply collections of points, and analysis techniques such as correlation or regression can be used to identify trends within these types of graphs.

Visual analysis goals

Creating index charts

The last part of the simpleLineChart method creates an instance of the Scene class that represents the contents of the work area. JavaFX uses the concept of a stage and scene to handle the internal workings of the application's GUI.

Creating bar charts

The first shows the countries as categories where the annual changes occur within the category on the X-axis and the population along the Y-axis.

Using country as the category

The last of the two methods is the launch method, which is automatically called when the window is displayed.

Using decade as the category

Creating stacked graphs

Creating pie charts

The next sequence creates an ObservableList instance based on the European population data presented earlier.

Creating scatter charts

Creating histograms

We've chosen to set our range for the X-axis to match our expected age range, but to leave some room on the side of the graph. Because we know the amount of sample data, we set our Y axis to go from 0 to 10.

Creating donut charts

Creating bubble charts

Statistical Data Analysis Techniques

Although there are a number of different data analysis techniques, this chapter will focus on the more common tasks. We will begin with an introduction to the basic techniques of statistical analysis, including calculating the mean, median, mode, and standard deviation of a data set.

Working with mean, mode, and median

Calculating the mean

Using simple Java techniques to find mean

Using Java 8 techniques to find mean

Another, more concise, technique using the OptionalDouble class involves lambda expressions and the ifPresent method. For our next two average examples, we'll use third-party libraries and continue to use the array of doubles, testData.

Using Google Guava to find mean

Using Apache Commons to find mean

We'll use this later to show the median and standard deviation, but first we'll start by calculating the mean.

Calculating the median

Using simple Java techniques to find median

To test our code for an odd number of elements, we'll add the double 12.5 to the end of the array.

Using Apache Commons to find the median

Calculating the mode

We can address this through basic Java code and through third-party libraries, as we'll show in a moment.

Using ArrayLists to find multiple modes

Using a HashMap to find multiple modes

Using a Apache Commons to find multiple modes

Standard deviation

Sometimes it is better to calculate the standard deviation of a sample subset of the population to correct for potential bias. However, we can continue to use Apache Commons to calculate the standard deviation in any format.

Sample size determination

Hypothesis testing

Regression analysis

Using simple linear regression

The SimpleRegression class has a predict method that passes a value, a year in this case, and returns the estimated population. The line matches the current population values ​​fairly well and shows the projected future population.

Using multiple regression

We will use a JavaFX program to create the scatterplot and perform the analysis. In this example, we focus on the multiple regression code and do not include the JavaFX code used to create the scatter plot.

Referensi

Dokumen terkait

The purpose of this study is to analyze the effectiveness of biopesticides from cinnamon bark extract (Cinnamomum aromati- cum), nutmeg seeds (Myristica fragrans),

Data ModellingWe will split our data into training and testing parts using the sklearn library, and we will predict using the XGBoost algorithm and find out the Accuracy and Root Mean