Dataset Construction - Processing Pipeline

4.2 Processing Pipeline

4.2.1 Dataset Construction

The dataset construction phase of the processing pipeline (discussed in section 4.3.2 and illustrated in figure 4.2) is executed only once per clinical dataset. It has therefore been executed exactly twice during this study – once for dataset 1 and once for dataset 2.

5http://www.python.org/

6http://www.gnu.org/software/bash/

4.2 Processing Pipeline 56

Figure 4.2. Dataset construction phase of the processing pipeline.

The RegaDB application [11] (including source code) has been downloaded from the RegaDB website and the application installed into the test environment. The Africa Centre database was then restored in order to establish an exact clone of the production system.

RegaDB Setup

RegaDB is a Java web application that runs inside the Tomcat Servlet container using the Postgres database management system for storage. The Spring MVC web framework⁷ is used to serve content and JWT⁸ is the user interface library. Hibernate⁹ is used as the persistence framework. The RegaDB application is represented by the cylinder labelledClinical Datain figure 4.2.

Data Extraction

The initial step of the dataset construction phase of the processing pipeline is to invoke the custom developed tool called CuRE (shown as the square box in figure 4.2). The tool was developed in order to extract data

7http://spring.io/guides/gs/serving-web-content/,MVC-Model/View/Controller

8http://www.webtoolkit.eu/jwt

9http://www.hibernate.org/

4.2 Processing Pipeline 57

from the RegaDB database in the appropriate format and in a configurable and repeatable way. The design and implementation of the tool is discussed in detail in section 4.3.2. The tool is invoked via the web interface provided and results in an ARFF file or spreadsheet containing the data in tabular format being downloaded onto the filesystem of the user.

Dataset Preparation

As illustrated in figure 4.2, there are three final steps that follow data extraction by the CuRE application.

These three processes comprise the dataset preparation phase and have been shown to be an important part of the classification procedure [86]. Each individual process is discussed below.

Normalisation Before the multi-label classification algorithms are used to train the models, the dataset is normalised. This involves converting categorical features to numeric values. This is necessary because some classification methods (such as support vector machines) only support feature vectors that consist only of numeric values.

As an example, if there is a categorical feature that could have any one of the values{red,green,blue}, this would be mapped to three numeric features. The value of the three numeric features would be{1,0,0}

if the original feature value wasred,{0,1,0}forgreenand{0,0,1}forblue.

The WekaExplorergraphical tool was used to generate the normalised dataset. TheExplorertool is bundled with the Weka application and allows the user to import an ARFF file and applyfilters. Filters are used to apply various modifications to dataset files, and in this case, theNominalToBinaryfilter was used to transform categorical features into multiple features with values in{0,1}(i.e. binary features) as described in the above example.

The clinical data used in this study only contained numeric and categorical values, but if other data types, such as image, video or audio data existed, then the normalisation step would also include mapping these data to numeric values. The conversion of the categorical features resulted in the feature count increasing from 62 to 125.

4.2 Processing Pipeline 58

Scaling The second data preprocessing step involved in constructing the training dataset is scaling. Scal- ing entails mapping all numeric values to the same range (in this case [−1; 1]). There are two primary reasons for doing this. First, having all features in the same range eliminates the possibility of features with greater numeric values dominating in the training process and thus having a larger influence during the classification step. Secondly, using a range of[−1; 1]helps prevent possible calculation errors associated with computation, such as integer overflow in the case of calculating the product of two very large numbers.

Like normalisation, scaling was done using the WekaExplorergraphical tool. The dataset filter that was applied in this case was theNormalisationfilter. TheNormalisationfilter takes two parameters, the scaling factorand thetranslation value, both of which are numeric. The scaling factor specifies the numeric range to which the features should be mapped and the translation value specifies the starting point of the numeric range as an offset from zero. For example, if a scaling factor of 1.0 and a translation value of−1.0 were specified, then all features would be mapped to the range[−1; 0]by the WekaNormalisationfilter.

In this study a scaling factor of 2.0 and translation value of−1.0 was used, resulting in all numeric features being mapped to the range[−1; 1].

Stratification The last data preparation step that must be performed before the dataset can be used for the evaluation of the multi-label classification techniques is stratification. Stratification in this context is the process of generating the dataset folds used for cross validation while maintaining the label distributions of the complete dataset. This is done for the purposes of mitigating the problem of sampling bias and reducing the variance of the evaluation results obtained.

Since the dataset used in this study is small (by machine learning standards), the Mulan implementation of the iterative stratification algorithm [79] was used to generate the 10 folds to be used for cross validation.

A small Java utility, discussed in section 4.3.3 was developed to execute this task.

4.2 Processing Pipeline 59

Dalam dokumen An investigation of multi-label classification techniques for predicting HIV drug resistance in resource-limited settings. (Halaman 68-72)