Handling YouTube - Java for Data Science

In the following sequence, we obtain an instance of PhotosInterface using its constructor to illustrate an alternate approach:

PhotosInterface pi = new PhotosInterface(apikey, secret, new REST());

We get the first ^Photo instance from the previous list and then its ^getUrl to get the image's URL.

The PhotosInterface class's ^getImage method returns a BufferedImage object representing the image as shown here:

Photo currentPhoto = list.get(0);

BufferedImage bufferedImage =

pi.getImage(currentPhoto.getUrl());

The image is then saved to a file using the ^ImageIO class:

File outputfile = new File("image.jpg");

ImageIO.write(bufferedImage, "jpg", outputfile);

The getImage method is overloaded. Here, the Photo instance and the size of the image desired are used as arguments to get the BufferedImage instance:

bufferedImage = pi.getImage(currentPhoto, Size.SMALL);

The image can be saved to a file using the previous technique.

The Flickr4Java API supports a number of other techniques for working with Flickr images.

The process of searching for videos by keyword is adapted from

https://developers.google.com/youtube/v3/code_samples/java#search_by_keyword. Other potentially useful code examples can be found at

https://developers.google.com/youtube/v3/code_samples/java. The process has been simplified so that we can focus on the search process. We start with a try block and the creation of a

YouTube instance. This class provides the basic access to the API. Javadocs for this API is found at https://developers.google.com/resources/api-libraries/documentation/youtube/v3/java/latest/.

The YouTube.Builder class is used to construct a ^YouTube instance. Its constructor takes three arguments:

Transport: Object used for HTTP

JSONFactory: Used to process JSON objects

HttpRequestInitializer: None is needed for this example

Many of the APIs responses will be in the form of JSON objects. The YouTube class'

setApplicationName method gives it a name and the build method creates a new YouTube instance:

try {

YouTube youtube = new YouTube.Builder(

Auth.HTTP_TRANSPORT, Auth.JSON_FACTORY,

new HttpRequestInitializer() {

public void initialize(HttpRequest request) throws IOException {

} })

.setApplicationName("application_name") ...

} catch (GoogleJSONResponseException ex) { // Handle exceptions

} catch (IOException ex) { // Handle exceptions }

Next, we initialize a string to hold the search term of interest. In this case, we will look for videos containing the word ^cats:

String queryTerm = "cats";

The class, YouTube.Search.List, maintains a collection of search results. The ^YouTube class's

search method specifies the type of resource to be returned. In this case, the string specifies the

id and ^snippet portions of the search result to be returned:

YouTube.Search.List search = youtube .search()

.list("id,snippet");

The search result is a JSON object that has the following structure. It is described in more detail at https://developers.google.com/youtube/v3/docs/playlistItems#methods. In the previous

sequence, only the ^id and ^snippet parts of a search will be returned, resulting in a more efficient operation:

{

"kind": "youtube#searchResult", "etag": etag,

"id": {

"kind": string, "videoId": string, "channelId": string, "playlistId": string },

"snippet": {

"publishedAt": datetime, "channelId": string, "title": string,

"description": string, "thumbnails": {

(key): {

"url": string,

"width": unsigned integer, "height": unsigned integer }

"channelTitle": string,

"liveBroadcastContent": string }

}

Next, we need to specify the API key and various search parameters. The query term is specified, as well as the type of media to be returned. In this case, only videos will be returned. The other two options include ^channel and ^playlist:

String apiKey = "Your API key";

search.setKey(apiKey);

search.setQ(queryTerm);

search.setType("video");

In addition, we further specify the fields to be returned as shown here. These correspond to fields of the JSON object:

search.setFields("items(id/kind,id/videoId,snippet/title," + "snippet/description,snippet/thumbnails/default/url)");

We also specify the maximum number of results to retrieve using the setMaxResults method:

search.setMaxResults(10L);

The ^execute method will perform the actual query, returning a SearchListResponse object. Its

getItems method returns a list of SearchResult objects, one for each video retrieved:

SearchListResponse searchResponse = search.execute();

List<SearchResult> searchResultList = searchResponse.getItems();

In this example, we do not iterate through each video returned. Instead, we retrieve the first video and display information about the video. The SearchResult video variable allows us to access different parts of the JSON object, as shown here:

SearchResult video = searchResultList.iterator().next();

Thumbnail thumbnail = video

.getSnippet().getThumbnails().getDefault();

out.println("Kind: " + video.getKind());

out.println("Video Id: " + video.getId().getVideoId());

out.println("Title: " + video.getSnippet().getTitle());

out.println("Description: " +

video.getSnippet().getDescription());

out.println("Thumbnail: " + thumbnail.getUrl());

One possible output follows where parts of the output have been modified:

Kind: null

Video Id: tntO...

Title: Funny Cats ...

Description: Check out the ...

Thumbnail: https://i.ytimg.com/vi/tntO.../default.jpg

We have skipped many error checking steps to simplify the example, but these should be considered when implementing this in a business application.

If we need to download the video, one of the simplest ways is to use axet/wget found at

https://github.com/axet/wget. It provides an easy-to-use technique to download the video using its video ID.

In the following example, a URL is created using the video ID. You will need to provide a video ID for this to work properly. The file is saved to the current directory with the video's title as the filename:

String url = "http://www.youtube.com/watch?v=videoID";

String path = ".";

VGet vget = new VGet(new URL(url), new File(path));

vget.download();

There are other more sophisticated download techniques found at the GitHub site.

Summary

In this chapter, we discussed types of data that are useful for data science and readily accessible on the Internet. This discussion included details about file specifications and formats for the most common types of data sources.

We also examined Java APIs and other techniques for retrieving data, and illustrated this process with multiple sources. In particular, we focused on types of text-based document formats and multimedia files. We used web crawlers to access websites and then performed web scraping to retrieve data from the sites we encountered.

Finally, we extracted data from social media sites and examined the available Java support. We retrieved data from Twitter, Wikipedia, Flickr, and YouTube and examined the available API support.

Chapter 3. Data Cleaning

Real-world data is frequently dirty and unstructured, and must be reworked before it is usable.

Data may contain errors, have duplicate entries, exist in the wrong format, or be inconsistent. The process of addressing these types of issues is called data cleaning. Data cleaning is also referred to as data wrangling, massaging, reshaping , or munging. Data merging, where data from

multiple sources is combined, is often considered to be a data cleaning activity.

We need to clean data because any analysis based on inaccurate data can produce misleading results. We want to ensure that the data we work with is quality data. Data quality involves:

Validity: Ensuring that the data possesses the correct form or structure Accuracy: The values within the data are truly representative of the dataset Completeness: There are no missing elements

Consistency: Changes to data are in sync

Uniformity: The same units of measurement are used

There are several techniques and tools used to clean data. We will examine the following approaches:

Handling different types of data Cleaning and manipulating text data Filling in missing data

Validating data

In addition, we will briefly examine several image enhancement techniques.

There are often many ways to accomplish the same cleaning task. For example, there are a number of GUI tools that support data cleaning, such as OpenRefine (http://openrefine.org/). This tool allows a user to read in a dataset and clean it using a variety of techniques. However, it requires a user to interact with the application for each dataset that needs to be cleaned. It is not conducive to automation.

We will focus on how to clean data using Java code. Even then, there may be different techniques to clean the data. We will show multiple approaches to provide the reader with insights on how it can be done. Sometimes, this will use core Java string classes, and at other time, it may use

specialized libraries.

These libraries often are more expressive and efficient. However, there are times when using a simple string function is more than adequate to address the problem. Showing complimentary techniques will improve the reader's skill set.

The basic text based tasks include:

Data transformation

Data imputation (handling missing data)

Subsetting data Sorting data Validating data

In this chapter, we are interested in cleaning data. However, part of this process is extracting information from various data sources. The data may be stored in plaintext or in binary form. We need to understand the various formats used to store data before we can begin the cleaning

process. Many of these formats were introduced in Chapter 2, Data Acquisition, but we will go into greater detail in the following sections.

Dalam dokumen Java for Data Science (Halaman 91-98)