Data pruning

I would also like to thank my colleagues in the Vision Group and Learning Systems Group at Caltech. It is not assumed that the data or noise can be modeled or that additional training examples are available.

Learning and Generalization

Diﬃcult examples and outliers

In the presence of difficult examples, the learning machine has to face a very difficult task: it has to learn an unknown function using a ﬁnite number of examples, taking into account that some of the examples may be misleading or difficult in other ways.

Data pruning overview

What is data pruning?
Overview of data pruning method
What is a diﬃcult example?
How to deal with outliers?
Why deﬁning outliers is diﬃcult?
Motivation: Why study outliers?

The reliability of an observation is dependent on the other observations, i.e. the definition of difficult example must be done in the context of the remaining data. A sample that is too influential in the model can change the estimate of the model.

Robust statistics

Accommodating outliers

Eliminating outliers

However, caution is needed because this may not always be the case: the value of hii should be considered in the context of another hii in the 'hat' matrix [43]. These diagnostics could help confirm the adequacy of the model, guide further data analysis, consider cases that may need to be further investigated, rejected or adapted to the model [5].

RANSAC

Many estimators are created specifically to improve the robustness of the mean or other data statistics, but we will not study them here. Among many attempts to fit the model, the one that best fits the overall data is selected.

Regularization methods

Regularization with penalties

The regularizing function penalizes large sums of weights in the network, which can lead to overly complex discrimination function: Ω = 12iw2i. He showed that the generalization ability of the neural network depends on the sum of the network's weights.

Introducing slack variables

The most popular form of regulation is weight decay for neural networks, where the penalty is the sum of the squares of all parameters in the network, namely its weights [6]. A theoretical rationale for preferentially smaller sum of weights with weight decay or early stopping in neural networks was given by Bartlett [3].

Case study: Regularizing AdaBoost

Applying penalties
Introducing slack variables
Modifying the cost function
Removing examples

The standard way of applying penalty to the cost function to be minimized can be simply applied to AdaBoost. Manipulating the cost function so that it penalizes more difficult examples is again a kind of regularization for a class of promotion techniques.

Learning with queries

Learning in the presence of classiﬁcation noise

A validation set [30] is used to determine the most difficult cases to remove.

Learning in the presence of malicious noise

The authors provide hardness results for learning arbitrary concepts in this model as well as constructive learning algorithms for learning certain concept classes.

Active learning

Outliers in a probabilistic setting

Data valuation

Exhaustive learning

Since π is unknown, it is approximated by the leave-one-out error, the error of the data set that excludes the sample in question. It gives information on whether the sample on average contradicts the rest of the samples with respect to all possible hypotheses. In this setting, we want to identify examples that are difficult for the learning process for this particular training set and this particular model and that could possibly cause the model to overfit and thus degrade the generalization performance.

In the next chapter we apply our method to recognize object categories in which the training data is contaminated, that is, the data may have a large amount of examples or mislabeled examples, which are otherwise difficult to model. in words.

Problem formulation

The main conclusion is that using unlabeled data in addition to labeled is quite useful and useful when labeling is expensive. They depend on the assumption that they have some reliably labeled data to start learning and evaluating from [8]. These methods rely on the strong assumption that the data come from a particular distribution which can be modeled.

Nevertheless, we believe that in general we cannot assume that we know and can model the sources that generated the data, and that we want to use the available training data in less demanding discriminative models.

The essence of data pruning problem

When is data pruning necessary?
Beneﬁts of data pruning
Diﬃcult examples increase model complexity
Regularization can beneﬁt from data pruning
Challenges for data pruning

Note that there are datasets where none of the examples are useless or harmful, as shown in Figure 3.1. In this section, we have seen that it is possible to improve the generalization error if trained on a (suitably chosen) subset of the training data, simply for the reason that there are difficult examples. We measure the observed complexity of the model after training on nested subsets of the training data by adding more and more difficult examples.

We can observe that there are subsets of the data that could improve the generalization error.

Figure 3.1: Average test errors when training on subsets of the data with increasing diﬃculty of examples and random subsets of the same size

Data pruning

Overview of the approach
Multiple models and data subsets are needed
Collecting multiple learners’ opinions
Pruning the data

Combining classiﬁers’ votes
Naive Bayes classiﬁcation

Why pruning the data?
Experiments

Other ways to create classifier diversity are by selecting different subsets of input features or selecting slightly overlapping subsets of the data or even injecting noise into the training data [10]. We are interested in finding the probability p(y|X) of the label of an example y, given the data modeled and estimated based on the data or in our case the projections of the multiple students.

Data trimming is applied to the original dataset as well as to data with artificially introduced label noise.

Discussion

The character indicates the average test errors over 100 runs on the original data and the lines point to the test errors on the same data with 10% label noise. In data pruning, we addressed a more challenging problem, that is, determining which examples would be difficult to learn without using any kind of validation set, additional data, or prior information. The final learning is done on the pruned set using a single learner from the learning model, not a combination of learners.

It is no wonder that various methods that try to cope with noise use multiple classifiers for that purpose, because if no other information is known about the data and there are noisy samples, the only way to find out which examples are noisy or to somehow ignore that is to use multiple projections of the data, hoping that noisy observations will affect the classiﬁers in a contradictory way and the good examples will consistently promote classiﬁcation with each other.

Figure 3.8: Data pruning results for UCI datasets. SVM model. The pruning is done using 100 SVMs over bootstrapped data

Conclusion

So, among the returned images, there would be many that contain the target category, but also many images that are not relevant to the target. In our case we have an instance set that contains instances of the target, but may have many mislabeled instances. We can query for more data of the same type, but we cannot access the actual instance tag.

In this section, we apply the data pruning method to clean contaminated face data, and modify it to take advantage of the fact that we are learning visual categories.

Experimental setup

Training data

This is useful, especially for target images, because there may be examples with poor lighting or extreme pose variations that the learning machine cannot accommodate and will generalize better if trained without them. In active learning, we can have many unlabeled examples and assume that we can query the object label at any time, but this query is more expensive than retrieving an unlabeled example [13].

Feature projections

From each filter channel, sums of pixels in sub-regions of the image, represented as rectangular activity masks, are obtained, Figure 4.3. Thus, to select stable features, pixels in larger regions must be used, in order to accommodate for variability in terms of local translations, rotations, etc. The easiest way to ensure variability is to sum the pixels in the region, as proposed by Viola and Jones [42].

Therefore, we used the sum of pixels in different filter channels in the original image instead.

Learning algorithms

The face dataset, as seen in figure 4.1, contains quite challenging images with great variation in terms of lighting, pose, image quality, etc. Regardless of whether the faces are aligned, pixel-wise correspondence or even correspondence within small neighborhoods is highly unlikely to yield consistent response among the face samples. We will note that we used only the first moment statistics (sum of pixels) in the rectangular regions, while Murphy et al.

31], who created functions to discriminate between many objects, used second-moment and fourth-moment statistics.

Data pruning for visual data

Generating semi-independent learners

The visual object category allows more freedom in training semi-independent learners because it offers many potential feature projections. Note that this is more difficult to do with a dataset with a small number of input dimensions. To produce semi-independent learners, we can split the image into slightly overlapping areas.

We use regions defined similarly to those in Figure 4.3, but drop the rectangles that take too large areas, such as the first four, resulting in a total of 23 learners, so that we do not receive fully dependent classiﬁers, see Figure 4.4 .

Data pruning results

As we can see, the data is useful, especially when a lot of noise is present. In the second comparison, both SVM and AdaBoost use the same pruning mechanism and we can only observe the different responses of the algorithms to different noise levels. The number in the first column of Tables 4.2 and 4.3 is the proportion of detected mislabeled examples among those with artificially flipped labels, while the number of false alarms is measured using the full data.

In fact, some face examples marked for cropping (shown in red boxes in Figures 4.9 and 4.8) are quite difficult and should not be present in the training data in the first place.

Figure 4.5: Test errors of learning on full and pruned datasets for diﬀerent levels of noise

Pruning very noisy data

Conclusion

With this work, we have shown that in the learning of the examples, the quality of the examples is also important for better generalization. We demonstrate the applicability of the approach to very challenging face datasets with large noise levels. 23] Kearns, M., Li, M., 'Learning in the presence of malicious errors', In Proceedings of the Twentieth Annual ACM Symposium on Theory of Computing, pages 267-280, Chicago, Illinois, 2-4 May 1988.

27] Loy, G., Bartlett, P., 'Generalization and magnitude of weights: an experimental study', Proceedings of the Eighth Australian Conference on Neural Networks.

Training on subsets of the data may provide for decreasing the general-

Inherent model complexity is increased by adding more diﬃcult examples 26

Data pruning results

Data pruning results for UCI datasets

Training data for face category

Dictionary of ﬁlters

Masks selecting sub-regions of the object

Sub-regions for training multiple independent learners for face category 44

Data pruning results for contaminated face data. Scatter plots

Comparison between diﬀerent pruning mechanisms and diﬀerent basic

Examples selected for elimination by bootstrapped learners

Examples selected for elimination by region learners