Performing schema transformations - Java Deep Learning Cookbook

For the JacksonLineRecordReader example, you need to provide the directory location of irisdata.txt, which is located in this chapter's GitHub repository. In the irisdata.txt file, each line represents a JSON object.

There's more...

JacksonRecordReader is a record reader that uses the Jackson API. Just

like JacksonLineRecordReader, it also supports JSON, XML, and YAML formats. For JacksonRecordReader, the user needs to provide a list of fields to read from the

JSON/XML/YAML file. This may look complicated, but it allows us to parse the files under the following conditions:

There is no consistent schema for the JSON/XML/YAML data. The order of output fields can be provided using the FieldSelection object.

There are fields that are missing in some files but that can be provided using the FieldSelection object.

JacksonRecordReader can also be used with PathLabelGenerator to append the label based on the file path.

s, projectedDimension, normalize);

INDArray reduced = inputFeatures.mmul(factor);

Use a schema to define the structure of the data: The following is an example of 2. a basic schema for a customer churn dataset. You can download the dataset

from https://www.kaggle.com/barelydedicated/bank-customer-churn- modeling/downloads/bank-customer-churn-modeling.zip/1:

Schema schema = new Schema.Builder() .addColumnString("RowNumber")

.addColumnInteger("CustomerId") .addColumnString("Surname") .addColumnInteger("CreditScore") .addColumnCategorical("Geography",

Arrays.asList("France","Germany","Spain"))

.addColumnCategorical("Gender", Arrays.asList("Male","Female")) .addColumnsInteger("Age", "Tenure")

.addColumnDouble("Balance")

.addColumnsInteger("NumOfProducts","HasCrCard","IsActiveMember") .addColumnDouble("EstimatedSalary")

.build();

How it works...

Before we start schema creation, we need to examine all the features in our dataset. Then, we need to clear all the noisy features, such as name, where it is fair to assume that they have no effect on the produced outcome. If some features are unclear to you, just keep them as such and include them in the schema. If you remove a feature that happens to be a signal unknowingly, then you'll degrade the efficiency of the neural network. This process of removing outliers and keeping signals (valid features) is referred to in step 1. Principal Component Analysis (PCA) would be an ideal choice, and the same has been implemented in ND4J. The PCA class can perform dimensionality reduction in the case of a dataset with a large number of features where you want to reduce the number of features to reduce the complexity. Reducing the features just means removing irrelevant features (outliers/noise).

In step 1, we generated a PCA factor matrix by calling pca_factor() with the following arguments:

inputFeatures: Input features as a matrix

projectedDimension: The number of features to project from the actual set of features (for example, 100 important features out of 1,000)

normalize: A Boolean variable (true/false) indicating whether the features are to be normalized (zero mean)

Matrix multiplication is performed by calling the mmul() method and the end result. reduced is the feature matrix that we use after performing the

dimensionality reduction based on the PCA factor. Note that you may need to perform multiple training sessions using input features (which are generated using the PCA factor) to understand signals.

In step 2, we used the customer churn dataset (the simple dataset that we used in the next chapter) to demonstrate the Schema creation process. The data types that are mentioned in the schema are for the respective features or labels. For example, if you want to add a schema definition for an integer feature, then it would be addColumnInteger(). Similarly, there are other Schema methods available that we can use to manage other data types.

Categorical variables can be added using addColumnCategorical(), as we mentioned in step 2. Here, we marked the categorical variables and the possible values were supplied.

Even if we get a masked set of features, we can still construct their schema if the features are arranged in numbered format (for example, column1, column2, and similar).

There's more...

In a nutshell, here is what you need to do to build the schema for your datasets:

Understand your data well. Identify the noise and signals.

Capture features and labels. Identify categorical variables.

Identify categorical features that one-hot encoding can be applied to.

Pay attention to missing or bad data.

Add features using type-specific methods such

as addColumnInteger() and addColumnsInteger(), where the feature type is an integer. Apply the respective Builder method to other data types.

Add categorical variables using addColumnCategorical(). Call the build() method to build the schema.

Note that you cannot skip/ignore any features from the dataset without specifying them in the schema. You need to remove the outlying features from the dataset, create a schema from the remaining features, and then move on to the transformation process for further processing. Alternatively, you can keep all the features aside, keep all the features in the schema, and then define the outliers during the transformation process.

When it comes to feature engineering/data analysis, DataVec comes up with its own analytic engine to perform data analysis on feature/target variables. For local executions, we can make use of AnalyzeLocal to return a data analysis object that holds information about each column in the dataset. Here is how you can create a data analysis object from a record reader object:

DataAnalysis analysis = AnalyzeLocal.analyze(mySchema, csvRecordReader);

System.out.println(analysis);

You can also analyze your dataset for missing values and check whether it is schema- compliant by calling analyzeQuality():

DataQualityAnalysis quality = AnalyzeLocal.analyzeQuality(mySchema, csvRecordReader);

System.out.println(quality);

For sequence data, you need to use analyzeQualitySequence() instead of analyzeQuality(). For data analysis on Spark, you can make use of the AnalyzeSpark utility class in place of AnalyzeLocal.

Dalam dokumen Java Deep Learning Cookbook (Halaman 61-64)