Using Doc2Vec for document classification

.addSourceFolder(unClassifiedResource.getFile()) .build();

Store the weight lookup table:

InMemoryLookupTable<VocabWord> lookupTable =

(InMemoryLookupTable<VocabWord>)paragraphVectors.getLookupTable();

Predict labels for every unclassified document, as shown in the following 7. pseudocode:

while (unClassifiedIterator.hasNextDocument()) { //Calculate the domain vector of each document.

//Calculate the cosine similarity of the domain vector with all //the given labels

//Display the results }

Create the tokens from the document and use the iterator to retrieve the 8. document instance:

LabelledDocument labelledDocument = unClassifiedIterator.nextDocument();

List<String> documentAsTokens =

tokenizerFactory.create(labelledDocument.getContent()).getTokens();

Use the lookup table to get the vocabulary information (VocabCache):

VocabCache vocabCache = lookupTable.getVocab();

Count all the instances where the words are matched in VocabCache: 10.

AtomicInteger cnt = new AtomicInteger(0);

for (String word: documentAsTokens) { if (vocabCache.containsWord(word)){

cnt.incrementAndGet();

} }

INDArray allWords = Nd4j.create(cnt.get(), lookupTable.layerSize());

Store word vectors of the matching words in the vocab:

11.

cnt.set(0);

for (String word: documentAsTokens) { if (vocabCache.containsWord(word))

allWords.putRow(cnt.getAndIncrement(), lookupTable.vector(word));

}

Calculate the domain vector by calculating the mean of the word embeddings:

12.

INDArray documentVector = allWords.mean(0);

Check the cosine similarity of the document vector with labeled word vectors:

13.

List<String> labels =

labelAwareIterator.getLabelsSource().getLabels();

List<Pair<String, Double>> result = new ArrayList<>();

for (String label: labels) {

INDArray vecLabel = lookupTable.vector(label);

if (vecLabel == null){

throw new IllegalStateException("Label '"+ label+"' has no known vector!");

}

double sim = Transforms.cosineSim(documentVector, vecLabel);

result.add(new Pair<String, Double>(label, sim));

}

Display the results:

14.

for (Pair<String, Double> score: result) {

log.info(" " + score.getFirst() + ": " + score.getSecond());

}

How it works...

In step 1, we created a dataset iterator using FileLabelAwareIterator.

The FileLabelAwareIterator is a simple filesystem-based LabelAwareIterator interface. It assumes that you have one or more folders organized in the following way:

First-level subfolder: Label name

Second-level subfolder: The documents for that label Look at the following screenshot for an example of this data structure:

In step 3, we created ParagraphVector by adding all required hyperparameters. The purpose of paragraph vectors is to associate arbitrary documents with labels. Paragraph vectors are an extension to Word2Vec that learn to correlate labels and words,

while Word2Vec correlates words with other words. We need to define labels for the paragraph vectors to work.

For more information on what we did in step 5, refer to the following directory structure (under the unlabeled directory in the project):

The directory names can be random and no specific labels are required. Our task is to find the proper labels (document classifications) for these documents. Word embeddings are stored in the lookup table. For any given word, a word vector of numbers will be returned.

Word embeddings are stored in the lookup table. For any given word, a word vector will be returned from the lookup table.

In step 6, we created InMemoryLookupTable from paragraph vectors.

InMemoryLookupTable is the default word lookup table in DL4J. Basically, the lookup table operates as the hidden layer and the word/document vectors refer to the output.

Step 8 to step 12 are solely used for the calculation of the domain vector of each document.

In step 8, we created tokens for the document using the tokenizer that was created in step 2.

In step 9, we used the lookup table that was created in step 6 to obtain VocabCache. VocabCache stores the information needed to operate the lookup table. We can look up words in the lookup table using VocabCache.

In step 11, we store the word vectors along with the occurrence of a particular word in an INDArray.

In step 12, we calculated the mean of this INDArray to get the document vector.

The mean across the zero dimension means that it is calculated across all dimensions.

In step 13, the cosine similarity is calculated by calling the cosineSim() method provided by ND4J. We use cosine similarity to calculate the similarity of document vectors. ND4J provides a functional interface to calculate the cosine similarity of two domain

vectors. vecLabel represents the document vector for the labels from classified documents.

Then, we compared vecLabel with our unlabeled document vector, documentVector.

After step 14, you should see an output similar to the following:

We can choose the label that has the higher cosine similarity value. From the preceding screenshots, we can infer that the first document is more likely finance-related content with a 69.7% probability. The second document is more likely health-related content with a 53.2% probability.

Constructing an LSTM Network 6

for Time Series

In this chapter, we will discuss how to construct a long short-term memory (LSTM) neural network to solve a medical time series problem. We will be using data from 4,000 intensive care unit (ICU) patients. Our goal is to predict the mortality of patients using a given set of generic and sequential features. We have six generic features, such as age, gender, and weight. Also, we have 37 sequential features, such as cholesterol level, temperature, pH, and glucose level. Each patient has multiple measurements recorded against these sequential features. The number of measurements taken from each patient differs.

Furthermore, the time between measurements also differs among patients.

LSTM is well-suited to this type of problem due to the sequential nature of the data. We could also solve it using a regular recurrent neural network (RNN), but the purpose of LSTM is to avoid vanishing and exploding gradients. LSTM is capable of capturing long- term dependencies because of its cell state.

In this chapter, we will cover the following recipes:

Extracting and reading clinical data Loading and transforming data

Constructing input layers for a network Constructing output layers for a network Training time series data

Evaluating the LSTM network's efficiency

Technical requirements

A concrete implementation of the use case discussed in this chapter can be found here: https://github.com/PacktPublishing/Java-Deep-Learning-Cookbook/blob/

master/06_Constructing_LSTM_Network_for_time_series/sourceCode/cookbookapp- lstm-time-series/src/main/java/LstmTimeSeriesExample.java.

After cloning the GitHub repository, navigate to the Java-Deep-Learning- Cookbook/06_Constructing_LSTM_Network_for_time_series/sourceCode directory. Then, import the cookbookapp-lstm-time-series project as a Maven project by importing pom.xml.

Download the clinical time series data from here: https://skymindacademy.blob.core.

windows.net/physionet2012/physionet2012.tar.gz. The dataset is from the PhysioNet Cardiology Challenge 2012.

Unzip the package after the download. You should see the following directory structure:

The features are contained in a directory called sequence and the labels are contained in a directory called mortality. Ignore the other directories for now. You need to update file paths to features/labels in the source code to run the example.

Dalam dokumen Java Deep Learning Cookbook (Halaman 153-160)