I would also like to thank my colleagues in the Vision Group and Learning Systems Group at Caltech. It is not assumed that the data or noise can be modeled or that additional training examples are available.
Learning and Generalization
Difficult examples and outliers
In the presence of difficult examples, the learning machine has to face a very difficult task: it has to learn an unknown function using a finite number of examples, taking into account that some of the examples may be misleading or difficult in other ways.
Data pruning overview
- What is data pruning?
- Overview of data pruning method
- What is a difficult example?
- How to deal with outliers?
- Why defining outliers is difficult?
- Motivation: Why study outliers?
The reliability of an observation is dependent on the other observations, i.e. the definition of difficult example must be done in the context of the remaining data. A sample that is too influential in the model can change the estimate of the model.
Robust statistics
Accommodating outliers
Eliminating outliers
However, caution is needed because this may not always be the case: the value of hii should be considered in the context of another hii in the 'hat' matrix [43]. These diagnostics could help confirm the adequacy of the model, guide further data analysis, consider cases that may need to be further investigated, rejected or adapted to the model [5].
RANSAC
Many estimators are created specifically to improve the robustness of the mean or other data statistics, but we will not study them here. Among many attempts to fit the model, the one that best fits the overall data is selected.
Regularization methods
Regularization with penalties
The regularizing function penalizes large sums of weights in the network, which can lead to overly complex discrimination function: Ω = 12iw2i. He showed that the generalization ability of the neural network depends on the sum of the network's weights.
Introducing slack variables
The most popular form of regulation is weight decay for neural networks, where the penalty is the sum of the squares of all parameters in the network, namely its weights [6]. A theoretical rationale for preferentially smaller sum of weights with weight decay or early stopping in neural networks was given by Bartlett [3].
Case study: Regularizing AdaBoost
- Applying penalties
- Introducing slack variables
- Modifying the cost function
- Removing examples
The standard way of applying penalty to the cost function to be minimized can be simply applied to AdaBoost. Manipulating the cost function so that it penalizes more difficult examples is again a kind of regularization for a class of promotion techniques.
Learning with queries
Learning in the presence of classification noise
A validation set [30] is used to determine the most difficult cases to remove.
Learning in the presence of malicious noise
The authors provide hardness results for learning arbitrary concepts in this model as well as constructive learning algorithms for learning certain concept classes.
Active learning
Outliers in a probabilistic setting
Data valuation
Exhaustive learning
Since π is unknown, it is approximated by the leave-one-out error, the error of the data set that excludes the sample in question. It gives information on whether the sample on average contradicts the rest of the samples with respect to all possible hypotheses. In this setting, we want to identify examples that are difficult for the learning process for this particular training set and this particular model and that could possibly cause the model to overfit and thus degrade the generalization performance.
In the next chapter we apply our method to recognize object categories in which the training data is contaminated, that is, the data may have a large amount of examples or mislabeled examples, which are otherwise difficult to model. in words.
Problem formulation
The main conclusion is that using unlabeled data in addition to labeled is quite useful and useful when labeling is expensive. They depend on the assumption that they have some reliably labeled data to start learning and evaluating from [8]. These methods rely on the strong assumption that the data come from a particular distribution which can be modeled.
Nevertheless, we believe that in general we cannot assume that we know and can model the sources that generated the data, and that we want to use the available training data in less demanding discriminative models.
The essence of data pruning problem
- When is data pruning necessary?
- Benefits of data pruning
- Difficult examples increase model complexity
- Regularization can benefit from data pruning
- Challenges for data pruning
Note that there are datasets where none of the examples are useless or harmful, as shown in Figure 3.1. In this section, we have seen that it is possible to improve the generalization error if trained on a (suitably chosen) subset of the training data, simply for the reason that there are difficult examples. We measure the observed complexity of the model after training on nested subsets of the training data by adding more and more difficult examples.
We can observe that there are subsets of the data that could improve the generalization error.
Data pruning
- Overview of the approach
- Multiple models and data subsets are needed
- Collecting multiple learners’ opinions
- Pruning the data
- Combining classifiers’ votes
- Naive Bayes classification
- Why pruning the data?
- Experiments
Other ways to create classifier diversity are by selecting different subsets of input features or selecting slightly overlapping subsets of the data or even injecting noise into the training data [10]. We are interested in finding the probability p(y|X) of the label of an example y, given the data modeled and estimated based on the data or in our case the projections of the multiple students.
Data trimming is applied to the original dataset as well as to data with artificially introduced label noise.
Discussion
The character indicates the average test errors over 100 runs on the original data and the lines point to the test errors on the same data with 10% label noise. In data pruning, we addressed a more challenging problem, that is, determining which examples would be difficult to learn without using any kind of validation set, additional data, or prior information. The final learning is done on the pruned set using a single learner from the learning model, not a combination of learners.
It is no wonder that various methods that try to cope with noise use multiple classifiers for that purpose, because if no other information is known about the data and there are noisy samples, the only way to find out which examples are noisy or to somehow ignore that is to use multiple projections of the data, hoping that noisy observations will affect the classifiers in a contradictory way and the good examples will consistently promote classification with each other.
Conclusion
So, among the returned images, there would be many that contain the target category, but also many images that are not relevant to the target. In our case we have an instance set that contains instances of the target, but may have many mislabeled instances. We can query for more data of the same type, but we cannot access the actual instance tag.
In this section, we apply the data pruning method to clean contaminated face data, and modify it to take advantage of the fact that we are learning visual categories.
Experimental setup
Training data
This is useful, especially for target images, because there may be examples with poor lighting or extreme pose variations that the learning machine cannot accommodate and will generalize better if trained without them. In active learning, we can have many unlabeled examples and assume that we can query the object label at any time, but this query is more expensive than retrieving an unlabeled example [13].
Feature projections
From each filter channel, sums of pixels in sub-regions of the image, represented as rectangular activity masks, are obtained, Figure 4.3. Thus, to select stable features, pixels in larger regions must be used, in order to accommodate for variability in terms of local translations, rotations, etc. The easiest way to ensure variability is to sum the pixels in the region, as proposed by Viola and Jones [42].
Therefore, we used the sum of pixels in different filter channels in the original image instead.
Learning algorithms
The face dataset, as seen in figure 4.1, contains quite challenging images with great variation in terms of lighting, pose, image quality, etc. Regardless of whether the faces are aligned, pixel-wise correspondence or even correspondence within small neighborhoods is highly unlikely to yield consistent response among the face samples. We will note that we used only the first moment statistics (sum of pixels) in the rectangular regions, while Murphy et al.
31], who created functions to discriminate between many objects, used second-moment and fourth-moment statistics.
Data pruning for visual data
Generating semi-independent learners
The visual object category allows more freedom in training semi-independent learners because it offers many potential feature projections. Note that this is more difficult to do with a dataset with a small number of input dimensions. To produce semi-independent learners, we can split the image into slightly overlapping areas.
We use regions defined similarly to those in Figure 4.3, but drop the rectangles that take too large areas, such as the first four, resulting in a total of 23 learners, so that we do not receive fully dependent classifiers, see Figure 4.4 .
Data pruning results
As we can see, the data is useful, especially when a lot of noise is present. In the second comparison, both SVM and AdaBoost use the same pruning mechanism and we can only observe the different responses of the algorithms to different noise levels. The number in the first column of Tables 4.2 and 4.3 is the proportion of detected mislabeled examples among those with artificially flipped labels, while the number of false alarms is measured using the full data.
In fact, some face examples marked for cropping (shown in red boxes in Figures 4.9 and 4.8) are quite difficult and should not be present in the training data in the first place.
Pruning very noisy data
Conclusion
With this work, we have shown that in the learning of the examples, the quality of the examples is also important for better generalization. We demonstrate the applicability of the approach to very challenging face datasets with large noise levels. 23] Kearns, M., Li, M., 'Learning in the presence of malicious errors', In Proceedings of the Twentieth Annual ACM Symposium on Theory of Computing, pages 267-280, Chicago, Illinois, 2-4 May 1988.
27] Loy, G., Bartlett, P., 'Generalization and magnitude of weights: an experimental study', Proceedings of the Eighth Australian Conference on Neural Networks.
Training on subsets of the data may provide for decreasing the general-
Training on subsets of the data may provide for decreasing the general-
Inherent model complexity is increased by adding more difficult examples 26
Data pruning results
Data pruning results for UCI datasets
Training data for face category
Dictionary of filters
Masks selecting sub-regions of the object
Sub-regions for training multiple independent learners for face category 44
Data pruning results for contaminated face data. Scatter plots
Comparison between different pruning mechanisms and different basic
Examples selected for elimination by bootstrapped learners
Examples selected for elimination by region learners
Data pruning results for very noisy data