When it comes to feature engineering/data analysis, DataVec comes up with its own analytic engine to perform data analysis on feature/target variables. For local executions, we can make use of AnalyzeLocal to return a data analysis object that holds information about each column in the dataset. Here is how you can create a data analysis object from a record reader object:
DataAnalysis analysis = AnalyzeLocal.analyze(mySchema, csvRecordReader);
System.out.println(analysis);
You can also analyze your dataset for missing values and check whether it is schema- compliant by calling analyzeQuality():
DataQualityAnalysis quality = AnalyzeLocal.analyzeQuality(mySchema, csvRecordReader);
System.out.println(quality);
For sequence data, you need to use analyzeQualitySequence() instead of analyzeQuality(). For data analysis on Spark, you can make use of the AnalyzeSpark utility class in place of AnalyzeLocal.
.removeColumns("Geography[France]") .build();
Create a record reader using TransformProcessRecordReader to extract and 2. transform the data:
TransformProcessRecordReader transformProcessRecordReader = new TransformProcessRecordReader(recordReader,transformProcess);
How it works...
In step 1, we added all the transformations that are needed for the dataset.
TransformProcess defines an unordered list of all the transformations that we want to apply to the dataset. We removed any unnecessary features by calling removeColumns(). During schema creation, we marked the categorical features in the Schema. Now, we can actually decide on what kind of transformation is required for a particular categorical variable. Categorical variables can be converted into integers by
calling categoricalToInteger(). Categorical variables can undergo one-hot encoding if we call categoricalToOneHot(). Note that the schema needs to be created prior to the transformation process. We need the schema to create a TransformProcess.
In step 2, we apply the transformations that were added before with the help
of TransformProcessRecordReader. All we need to do is create the basic record reader object with the raw data and pass it to TransformProcessRecordReader, along with the defined transformation process.
There's more...
DataVec allows us to do much more within the transformation stage. Here are some of the other important transformation features that are available within TransformProcess:
addConstantColumn(): Adds a new column in a dataset, where all the values in the column are identical and are as they were specified by the value. This method accepts three attributes: the new column name, the new column type, and the value.
appendStringColumnTransform(): Appends a string to the specified column.
This method accepts two attributes: the column to append to and the string value to append.
conditionalCopyValueTransform(): Replaces the value in a column with the value specified in another column if a condition is satisfied. This method accepts three attributes: the column to replace the values, the column to refer to the values, and the condition to be used.
conditionalReplaceValueTransform(): Replaces the value in a column with the specified value if a condition is satisfied. This method accepts three attributes:
the column to replace the values, the value to be used as a replacement, and the condition to be used.
conditionalReplaceValueTransformWithDefault(): Replaces the value in a column with the specified value if a condition is satisfied. Otherwise, it fills the column with another value. This method accepts four attributes: the column to replace the values, the value to be used if the condition is satisfied, the value to be used if the condition is not satisfied, and the condition to be used.
We can use built-in conditions that have been written in DataVec with the transformation process or data cleaning process. We can
use NaNColumnCondition to replace NaN values
and NullWritableColumnCondition to replace null values, respectively.
stringToTimeTransform(): Converts a string column into a time column. This targets date columns that are saved as a string/object in the dataset. This method accepts three attributes: the name of the column to be used, the time format to be followed, and the time zone.
reorderColumns(): Reorders the columns using the newly defined order. We can provide the column names in the specified order as attributes to this method.
filter (): Defines a filter process based on the specified condition. If the condition is satisfied, remove the example or sequence; otherwise, keep the examples or sequence. This method accepts only a single attribute, which is the condition/filter to be applied. The filter() method is very useful for the data cleaning process. If we want to remove NaN values from a specified column, we can create a filter, as follows:
Filter filter = new ConditionFilter(new NaNColumnCondition("columnName"));
If we want to remove null values from a specified column, we can create a filter, as follows:
Filter filter = new ConditionFilter(new NullWritableColumnCondition("columnName"));
stringRemoveWhitespaceTransform(): This method removes whitespace characters from the value of a column. This method accepts only a single attribute, which is the column from which whitespace is to be trimmed.
integerMathOp(): This method is used to perform a mathematical operation on an integer column with a scalar value. Similar methods are available for types such as double and long. This method accepts three attributes: the integer column to apply the mathematical operation on, the mathematical operation itself, and the scalar value to be used for the mathematical operation.
TransformProcess is not just meant for data handling – it can also be used to overcome memory bottlenecks by a margin.
Refer to the DL4J API documentation to find more powerful DataVec features for your data analysis tasks. There are other interesting operations supported in TransformPorocess, such as reduce() and convertToString(). If you're a data analyst, then you should know that many of the data normalization strategies can be applied during this stage. You can refer to the DL4J API documentation for more information on the normalization strategies that are available on https://deeplearning4j.org/docs/latest/datavec- normalization.