LEARNING FROM DATA
4.10 IMBALANCED DATA CLASSIFICATION
McNemar’s test rejects the hypothesis that the two algorithms have the same error at the significance levelα, if previous value is greater thanχ2α,1. For example, for α= 0.05,χ20.05,1= 3.84.
The other test is applied if we compare two classification models that are tested with k-fold cross-validation process. The test starts with the results ofk-fold cross- validation obtained fromktraining/validation set pairs. We compare the error percen- tages in two classification algorithms based on errors inkvalidation sets, which are recorded for two models as:p1i andp2i,i= 1,…,K.
The difference in error rates on foldiisPi=p1i−p2i. Then, we can compute
m=
k i= 1Pi
K and S2=
k
i= 1 Pi−m 2 K−1
We have a statistic that is t distributed with k-1 degrees of freedom and the fol- lowing test:
K×m S tK−1
Thus, thek-fold cross-validation pairedt-test rejects the hypothesis that two algo- rithms have the same error rate at significance levelα, if previous value is outside interval (−tα/2,K−1, tα/2,K−1). For example, the threshold values could be for α= 0.05 andK= 10 or 30:t0.025,9= 2.26, andt0.025,29= 2.05.
Over time, all systems evolve. Thus, from time to time, the model will have to be retested, retrained and possibly completely rebuilt. Charts of the residual differences between forecasted and observed values are an excellent way to monitor model results.
classification problems with imbalanced data, and by convention, we call the classes having more samples themajority classeswhile the ones having fewer samples the minority classes.
Imbalanced data sets exist in many application domains, such as recognition in advance of unreliable telecommunication customers, detection of oil spills using sat- ellite radar images, detection of fraudulent telephone calls, filtering spam email, identifying rare diseases in medical diagnostics, and predicting natural disaster like earthquakes. For example, if the system is classifying credit card transactions, most transactions are legitimate, and only a few of them are fraudulent. What could be an available balance between majority and minority class depends on an application. In practical applications, the ratio of the small to the large classes can be drastic such as 1–100, while some applications for fraud detection reported imbalance of 1–100,000. For example, in a case of utilities fraud detection, usually for every 1000 observations, it may be recognized about 20 fraudulent observations; we may assume that the minority event rate is about 2%. Similar case is in large medical record databases used to build classification model for some rare diseases. Minority samples rate may be below 1% because of large number of patients who do not have that rare disease. In such cases with imbalanced classes, standard classifiers tend to be overwhelmed by the large majority classes and ignore the small ones. Further- more, in medical diagnostics of a certain rare disease such as cancer, cancer is regarded as the positive, minority class, and non-cancer as negative, majority class.
In this kind of applications, missing a cancer using classification model is much more serious error than the false positive error predicting cancer for healthy person.
Therefore, data-mining professionals have addressed two additional issues of a class imbalance:
(a) Assign distinct costs to majority and minority training samples, and more often (b) Rebalance the original data set, either by oversampling the minority class or
undersampling the majority class.
With undersampling we randomly select a subset of samples from the majority class to match the number of samples coming from each class. The main disadvantage of undersampling is that we lose potentially relevant information given in the left-out samples. With oversampling, we randomly duplicate or combine samples from minor- ity class to match the amount of samples in each class. While we are avoiding losing information with this approach, we also run the risk of overfitting our model by repeat- ing some samples.
One of the most popular oversampling techniques is Synthetic Minority Over- sampling Technique (SMOTE). Its main idea is to form new minority class samples by interpolating between several minority class samples that lie close together. New synthetic samples are generated in the following way:
• Randomly select a sample from a minority class, and find its nearest neighbor belonging also to the minority class.
IMBALANCED DATA CLASSIFICATION 151
• Take the difference betweenn-dimensional feature vector representing sample under consideration and a vector for its nearest neighbor.
• Multiply this difference by a random number between 0 and 1, and add it to the feature vector of a minority sample under consideration. This causes the selec- tion of a random point along the line segment between specific samples.
• Repeat the previous steps until a balance between minority and majority classes is obtained.
This approach effectively forces the decision region of the minority class to become more general. Illustrative example of a SMOTE approach is given in Figure 4.38. Minority class samples are presented with small circles, while majority class samples in 2D space are given withXsymbol. Minority class consists of only six 2D samples, and it is oversampled with additional five new synthetic samples on lines between initial members of a minority class and their nearest neighbors.
The explanation why synthetic minority oversampling improves performance while minority oversampling with replacement does not is obvious from Figure 4.38. Consider the effect on the decision regions in n-dimensional feature space when minority oversampling is done by replication (sampling with replacement), instead of introduction of synthetic samples. With replication, the decision region, which results in a classification model for the minority class, is becoming smaller and more specific as the minority samples in the region are replicated. This is the opposite of the desired effect of a model generality. On the other hand, SMOTE method of synthetic oversampling enables the classifier to build larger decision regions that contain not only original but also nearby minority class points.
Usually SMOTE approach assumes a combination of SMOTE oversampling algorithm for minority class, combined with undersampling of a majority class by ran- domly removing some of these samples, and making a classification problem more balanced. Of course, it is not required that the balance expect exactly 50–50%
Synthetic instances
Figure 4.38. Generation of synthetic sampling SMOTE approach.
152 LEARNING FROM DATA
representation of classes. Final effect on the SMOTE-based classification model is combination of positive aspects of both undersampling and oversampling processes:
better generalization of a minority class and no loss of useful information-related majority class. The main problem with SMOTE is weak effectiveness for high- dimensional data, but some improved version of the algorithm solved satisfactory these problems.
While the initial SMOTE approach does not handle data sets with nominal features, new version was developed to handle mixed data sets of continuous and nominal features. This approach is called Synthetic Minority Over-sampling Technique-Nominal Continuous [SMOTE-NC]. The main generalization is based on median computation. Compute in the first step the median of standard deviations of all continuous features for the minority class. If the nominal features differ between a sample and its potential nearest neighbors, then this median is included in the Euclidean distance computation. A median is used to penalize the difference of each nominal feature. For example, it is necessary to determine the distance between two six-dimensional samples,F1andF2, where the first three dimensions are numerical while last three are nominal:
F1= 1, 2, 3,A,B,C F2= 4, 6, 5,A,D,E
Initially, value Med is determined representing the median of the standard devia- tions of all continuous features of the minority class (first three features). Then, the Euclidean distance between the samplesF1andF2is determined as
Distance F1,F2 = SQRT 1−4 2+ 2−62+ 3−52+ 0 + Med2+ Med2 The median term Med is included in the distance twice for two nominal features with different values in samplesF1andF2. For the first nominal feature, because the values are the same (both samples have valueA), the distance is equal to 0.
Traditionally, accuracy is the most commonly used measure for the quality of the classification model. However, for classification with the class imbalance problem, accuracy is no longer a proper measure since the rare class has very little impact on accuracy as compared with the majority class. For example, in a problem where a rare class is represented by only 1% of the training data, a simple classification strat- egy can be to predict the majority class label for every sample. This simplified approach will achieve a high accuracy of 99%. However, this measurement is mean- ingless to some applications where the learning is concentrated on the identification of the rare minority cases. The nature of some applications requires a fairly high rate of correct detection in the minority class allowing as a balance a small error rate in the majority class.
Because the accuracy is not meaningful for imbalanced data classification pro- blems, two additional measures are commonly used as a replacement:precision(also
IMBALANCED DATA CLASSIFICATION 153
called positive predictive value) andrecall(also called sensitivity). These two mea- sures may be replaced with the single oneF-measure (as it is given in Table 4.4):
F-measure =2 × precision × recall precision + recall
F-measure represents the harmonic mean of precision and recall, and it tends to be closer to the smaller of the two values. As a consequence, high F-measure value ensures that both recall and precision are reasonably high.
When the performance of both classes is concerned in imbalanced applications, both true positive rate (TPrate) and true negative rate (TNrate) are expected to be high simultaneously. Additional measure is suggested for these cases. It is thegeometric meanorG-meanmeasure defined as
G-mean = TPrate × TNrate 1 2= sensitivity × specificity1 2
G-mean measures the balanced performance of a learning algorithm between majority and minority classes. The final decision in a model selection should consider a combination of different measures instead of relying on a single measure. To min- imize imbalance-biased estimates of performance, it is recommended to report both the obtained metric values for selected measures and the degree of imbalance in the data.
4.11 90% ACCURACY…NOW WHAT?
Often forgotten in texts on data mining is a discussion of the deployment process. Any data-mining student may produce a model with relatively high accuracy over some small data set using the tools available. However, an experienced data miner sees beyond the creation of a model during the planning stages. There needs to be a plan created to evaluate how useful a data-mining model is to a business and how the model will be rolled out. In a business setting the value of a data-mining model is not simply the accuracy, but how that model can impact the bottom line of a company. For exam- ple, in fraud detection, algorithm A may achieve an accuracy of 90%, while algorithm Bachieves 85% on training data. However, an evaluation of the business impact of each may reveal that algorithm A would likely underperform algorithm Bbecause of larger number of very expensive false negative cases. Additional financial evalu- ation may recommend algorithmBfor the final deployment because with this solu- tions company saves more money. A careful analysis of the business impacts of data-mining decisions gives much greater insight of a data-mining model.
In this section two case studies are summarized. The first case study details the deployment of a data-mining model that improved the efficiency of employees in find- ing fraudulent claims at an insurance company in Chile. The second case study
154 LEARNING FROM DATA
involves a system deployed in hospitals to aid in counting compliance with industry standards in caring for individuals with cardiovascular disease (CVD).
4.11.1 Insurance Fraud Detection
In 2005, insurance company Banmedica S.A. of Chile received 800 digital medical claims per day. The process of identifying fraud was entirely manual. Those respon- sible for identifying fraud had to look one by one at medical claims to find fraudulent cases. Instead it was hoped that data-mining techniques would aid in a more efficient discovery of fraudulent claims.
The first step in the data-mining process required that the data-mining experts gain a better understanding of the processing of medical claims. After several meet- ings with medical experts, the data-mining experts were able to better understand the business process as it related to fraud detection. They were able to determine the cur- rent criteria used in manually discriminating between claims that were approved, rejected, and modified. A number of known fraud cases were discussed and the behav- ioral patterns that revealed these documented fraud cases.
Next, two data sets were supplied. The first data set contained 169 documented cases of fraud. Each fraudulent case took place over an extended period of time show- ing that time was an important factor in these decisions as cases developed. The sec- ond data set contained 500,000 medical claims with labels supplied by the business of
“approved,” “rejected,”or“reduced.”
Both data sets were analyzed in detail. The smaller data set of known fraud cases revealed that these fraudulent cases all involved on a small number of medical profes- sionals, affiliates, and employers. From the original paper,“19 employers and 6 doc- tors were implicated with 152 medical claims.”The labels of the larger data set were revealed to be not sufficiently accurate for data mining. Contradictory data points were found. A lack of standards in recording these medical claims with a large number of missing values contributed to the poorly labeled data set. Instead of the larger 500,000 point data set, the authors were“forced”to rebuild a subset of this data. This required manual labeling of the subset.
The manual labeling would require a much smaller set of data points to be used from the original 500,000. To cope with a smaller set of data points, the problem was split into four smaller problems, namely, identifying fraudulent medical claims, affili- ates, medical professionals, and employers. Individual data sets were constructed for each of these four subtasks ranging in size from 2838 samples in the medical claims task to 394 samples in the employer subtask. For each subtask a manual selection of features was performed. This involved selecting only one feature from highly corre- lated features, replacing categorical features with numerical features, and designing new features that“summarize temporal behavior over an extended time span.”The original 125 features were paired down to between 12 and 25 features depending on the subtask. Additionally, the output of all other subtasks became inputs to each subtask, thus providing feedback to each subtask. Lastly, 2% of outliers were removed and features were normalized.
90% ACCURACY…NOW WHAT? 155
When modeling the data, it was found that initially the accuracy of a single neural network on these data sets could vary by as much as 8.4%. Instead of a single neural network for a particular data set, a committee of neural networks was used. Each data set was also divided into a training set, a validation set, and a testing set to avoid over- fitting the data. At this point it was also decided that each of the four models would be retrained monthly to keep up with the ever-evolving process of fraud.
Neural networks and committees of neural networks output scores rather than an absolute fraud classification. It was necessary that the output be thresholded. The threshold was decided after accounting for personnel costs, false alarm costs, and the cost of not detecting a particular instance of fraud. All of these factors figured into an ROC curve to decide upon acceptable false and true positive rates. When the med- ical claims model using the input of the other three subtasks scored a medical claim above the chosen threshold, then a classification of fraud is given to that claim. The system was tested on a historical data set of 8819 employers that contains 418 instances of fraud. After this historical data set was split into training, validation, and test set, the results showed the system identified 73.4% of the true fraudsters and had a false positive rate of 6.9%.
The completed system was then run each night, giving each new medical claim a fraud probability. The claims are then reviewed being sorted by the given probabil- ities. There were previously very few documented cases of fraud. After implementa- tion there were approximately 75 rejected claims per month. These newly found cases of fraud accounted for nearly 10% of the raw overall costs to the company! Addition- ally, the culture of fraud detection changed. A taxonomy of the types of fraud was created, and further improvements were made on the manual revision process. The savings covered the operational costs and increased the quality of health coverage.
Overall this project was a big success. The authors spent a lot of time first under- standing the problem and second analyzing the data in detail before the data was mod- eled. The final models produced were analyzed in terms of real business costs. In the end the results showed that the costs of the project were justified and Banmedica S.A.
greatly benefited from the final system.
4.11.2 Improving Cardiac Care
CVD leads to nearly one million deaths per year in the United States or 38% of all deaths in the United States. Additionally, in 2005 the estimated cost of CVD was
$394 billion compared with an estimated $190 billion on all cancers combined.
CVD is a real problem that appears to be growing in the number of lives claimed and the percent of the population that will be directly affected by this disease. Certainly we can gain a better understanding of this disease. There already exist guidelines for the care of patients with CVD that were created by panels of experts. With the current load on the medical system, doctors are able to only spend a short amount of time with each patient. With the large number of guidelines that exists, it is not reasonable to expect that doctors will follow every guideline on every patient. Ideally a system would aid a doctor in following the given guidelines without adding additional overheads.
156 LEARNING FROM DATA
This case study outlines the use and deployment of a system called REMIND meant to both find patients at need within the system and enable a better tracking of when patients are being cared for according to guidelines. Currently two main types of records are kept for each patient, financial and clinical. The financial records are used for billing. These records use standardized codes (ICD-9, for example) for doctor assessments and drugs prescribed. This standardization makes it straightforward for computer systems to extract information from these records and to be used by data- mining processes. However, it has been found that these codes are accurate only 60–80% of the time for various reasons. One reason is that when these codes are used for billing, though two conditions are nearly identical in symptoms and prescriptions, the amount of money that will be paid out by an insurer may be very different. The other form of records kept is clinical records. Clinical records are made up of unstruc- tured text and allow for the transfer of knowledge about a patient’s condition and treat- ments from one doctor to another. These records are much more accurate, but are not in a form that is easily used by automated computer systems.
It is not possible that with great demands on the time of doctors and nurses, addi- tional data may be recorded specifically for this system. Instead, the REMIND system combines data from all available systems. This includes extracting knowledge from the unstructured clinical records. The REMIND system combines all available sources of data available and then using redundancy in data the most likely state of the patient is found. For example, to determine a patient is diabetic one, may use any of the fol- lowing pieces of data: billing code 250.xx for diabetes, a free text dictation identifying a diabetic diagnosis, a blood sugar value >300, a treatment of insulin or oral antidia- betic, or a common diabetic complication. The likelihood of the patient having dia- betes increases as more relevant information is found. The REMIND system uses extracted information from all possible data sources, combined in a Bayesian network.
The various outputs of the network are used along with temporal information to find the most probable sequence of states with a predefined disease progression modeled as a Markov model. The probabilities and structure of the Bayesian network are provided as domain knowledge provided beforehand by experts and tunable per deployment.
The domain knowledge in the REMIND system is fairly simple as stated by the author of the system. Additionally, by using a large amount of redundancy, the system per- forms well for a variety of probability settings and temporal settings for the disease progression. However, before a wide distribution of the REMIND system, a careful tuning of all the parameters must take place.
One example deployment was to the South Carolina Heart Center where the goal was to identify among 61,027 patients those at risk of sudden cardiac death (SCD).
Patients who have previously suffered a myocardial infarction (heart attack, abbrevi- ated as MI) are at the highest risk of SCD. A study was performed on the efficacy of implantable cardioverter defibrillators (ICDs). It was found that patients, with a prior MI and low ventricular function, had their 20-month mortality rate drop from 19.8 to 14.2%. This implantation is now a standard recommendation. Previous to the REMIND system, one had two options to find who would require the implantation of an ICD. The first option is to manually review the records of all patients to identify
90% ACCURACY…NOW WHAT? 157