Combined Group and Exclusive Sparsity for Deep Neural Networks

Introduction

Related Work

Sparsity for Deep Neural Networks

The simplest yet popular method for removing weights is removing weak weights using simple thresholding (Reed, 1993; Han et al., 2015). Collins & Kohli (2014) applied the ℓ1 regularization to convolutional neural networks, showing that it can obtain a compact, memory-efficient network at the cost of a small reduction in prediction accuracy. Few recent works applied group sparsity regularization (Yuan & Lin, 2006) to deep networks, as it has some nice properties.

By removing an entire feature group, group similarity can automatically determine the number of neurons (Alvarez & Salzmann, 2016). Furthermore, if applied between the weights at different layers, it can also be used to determine the optimal number of layers to use for the given network (Wen et al., 2016). In terms of efficiency, structured sparseness using (2,1)-norm exhibits better data locality than the usual sparseness, and leads to larger speedups (Wen et al., 2016; Alvarez & Salzmann, 2016).

We also use the group sparsely in our combined regulator, but we mainly group the features across multiple filters, to promote feature sharing between the filters. Although all the previously introduced models do help to reduce the number of parameters and result in some amount of speedups, such memory and time efficiency is mostly gained at the cost of reduced accuracy.

Exclusive Feature Learning

𝐾} is its class label which is one of the K classes, {𝑾(l)} is the set of weights over all layers, ℒ(𝑾) is the loss parameterized by 𝑾, 𝑳 is the total number of layers, 𝑾( l) is the weight matrix (or tensor) for layer 𝑙, 𝛺(𝑾(𝑙)) is some regularization term on the network weights at layer l, and 𝜆 is the regularization parameter that balances the loss with the regularization. Yet this group sparsity does not maximally utilize the capacity of the network, as there still can. Exclusive sparseness can be simply applied to fully connected layers of a deep network, by grouping network weights of the same neuron at each layer into one group and applying (1,2)-norm to these groups (See Figure 1 (b) ).

For convolutional layers of a convolutional neural network, exclusive sparseness can be applied in the same way as in fully connected layers, where we Eq. We observe improvements on all models, demonstrating the effectiveness of the exclusive sparse regulator for all types of network weights. Visualization of the 𝟏𝒔𝒕 convolution layer ﬁlters of the network trained on CIFAR-10 dataset. a) ℓ2 arrangement leads to smooth non-sparse ﬁlters.

The exclusive rarity adjuster, when used alone, results in a separate feature selection for each class. To show that such an effect is not limited to the fully connected layer, in Fig. 6 we also visualize the convolutional filters in the first convolutional layer of the network trained on the CIFAR-10 dataset.

Fig 1. Illustration of the regularizers: (a) When grouping weights from the same input neuron into each group, the group sparsity has an effect of completely removing some neurons that are not shared across different weights (highlighted in red)

Approach

Exclusive Sparsity Regularization for Deep Neural Networks

Exclusive sparsity on convolutional filters

Exclusive sparsity or exclusive lasso was first introduced in Zhou et al., (2010) in multi-task learning context. The main idea of the work is to enforce the model parameters for different tasks to compete for features, instead of sharing features as suggested by previous work on multi-task learning exploiting group lasso. When the task is a classification task, this makes sense since the goal is to distinguish between classes, which can be achieved by identifying discriminative properties for each class.

This norm is often called the (1,2)-norm, and is basically the 2-norm over 1-norm groups. Sparsity is now enforced within each group, unlike the group sparsity regularizer, which promotes sparsity between groups. This will force each output neuron to compete for input neurons, resulting in learning largely dissimilar network weights at each layer.

4 on the convolutional filters, while each group is defined 𝑔 as the same function across multiple convolutional filters. This will force the convolution filters to be as different from each other as possible, removing any redundancies between them.

Combined group & Exclusive Sparsity Regularization

While this model can be applied to the final softmax layer, to the softmax weights for each class, the taxonomy does not exist for the intermediate layer network weights, nor can they be efficiently obtained by clustering or other means. Therefore, we propose to simply combine group sparsity and exclusive sparsity together, which will result in a similar effect where the network weights show some degree of sharing when they are correlated, but learn to be different on other non-shared parts . One simple solution is to set all 𝜇𝑙 to be a single constant, but a better way is to set them differently at each layer, depending on the level of sharing and competition required at each layer.

At lower levels, the features will be quite general and may need to be shared by all high-level neurons to accurately express the input data, while at the top layer of fuzzy maximum weights, it would be better to have feature selection weights as separate for the best possible discrimination.

Numerical Optimization

When Ω(𝑾𝑡+1) is a group sparsity regularizer or an exclusive sparsity regularizer, the above problem has a closed-form solution. The exclusive sparsity improves performance over the base ℓ2 regularization model in the low-sparity region, which is particularly well shown in CIFAR-10 results, but degenerates performance as sparsity increases. Finally, our combined group and exclusive sparsity, which allows for some degree of sharing between the weights/features while enforcing exclusivity, achieves the best accuracy/parameter trade-off, achieving a similar or better performance gain over the exclusive sparsity, while it also greatly reduces the number of parameters.

The network regularized with exclusive sparsity at all layers performed better at higher sparsity, compared to models that used ES only at convolutional or fully connected layers. Interestingly, we empirically found that our exclusive sparse regularizer also helps network to achieve the same error using much fewer iterations (Figure 4 (a)), compared to base ℓ2 regularization. The first model, ES-Increasing, is the actual combination we used in our method that increases the effect of exclusive sparsity with increasing 𝑙.

We observe that the combined array and exclusive sparsity regularization results in filters that are much sharper than those obtained by the ℓ1 or array sparsity regularization, with some spatial features completely removed from competition with the other filters. Specifically, we propose to use an exclusive sparsity adjustment based on the (1,2)-norm on the network weights, together with a group sparsity adjustment using the (2,1)-norm such that the exclusive sparsity forces network weights to use the input neurons. that are as different as possible from the other weights, while the sparsity of the set allows some degree of separation between them, since it is impossible for the network weights to fit the set of completely disjoint features.

Experiment

Baselines and Our Models

The network regularized with ℓ2,1 norm on the weights, which groups each convolutional filter as a group at convolutional layers. The network using the same ℓ2,1 regularization as in III., but with each group defined as the same feature at different filters.

Datasets and Base Networks

This dataset consists of 60,000 images of size 28×28, from ten animal and vehicle classes (airplane, car, bird, cat, deer, dog, frog, horse, ship and truck). For the basic network, we use LeNet (Lecun et al., 1998), which has two convolutional layers followed by three fully connected layers. For the basic network we use a variant of Wide Residual Network (Zagoruyko & Komodakis., 2016), which has 16 layers with the broadening factor of k = 10.

This is the dataset for the 2012 ImageNet Large Scale Visual Recognition Challenge (Deng et al., 2009), consisting of 1,281,167 images from 1,000 generic object categories. For evaluation, we used the validation set consisting of 50,000 images, following the standard procedure. For MNIST and CIFAR-10 experiment, we train all networks from scratch; for CIFAR-100 and ImageNet-1K experiments, where we use larger networks (WRN and AlexNet), we fine-tune the network from the ℓ2-regularized networks, since training them from scratch takes prohibitively long.

We report the accuracy over the number of parameters and the accuracy over FLOPs to see how each reduction-inducing adjustment in the different sparsity ranges affects the model's accuracy. The results reported are the average accuracy of the model over three runs (with random weight initialization), and the error bars indicate the standard errors for the 95% confidence interval.

Figure 3. Accuracy-efﬁciency trade-off. We report the accuracy over number of parameters, and accuracy over FLOP to see how each sparsity-inducing regularization at various sparsity range affect the model accuracy

Quantitative Analysis

Iterative pruning
Convergence speed
Convolutional vs. fully connected layers
Sharing vs. competing for features

On the CIFAR-10 dataset, CGES improves classification accuracy over the ℓ2 baselines by 2.17%, using 13.72% fewer parameters and 35.67% less FLOP. We report the accuracy of this combined model on the MNIST and CIFAR-10 dataset, when using 10% of the parameters of the full network (Table 2). The results show that CGES combined with iterative pruning produces similar or even better results than the original model with only a fraction of the parameters, and significantly outperforms the basic pruning model, which suffers significant accuracy loss.

To see what effect our combined regularizer has on different types of layers, we experiment with using the model only for the fully connected layer or the convolutional layers, while using the regular ℓ2-regularizer for the other layers. Furthermore, ES results in greater improvements on convolutional layers, which makes sense since lower layer features are more important as they are more general across classes than features obtained on fully connected layers. However, conv layers achieved the best accuracy in the low-sparse region, as strict exclusivity enforcement hurts the representational power of features, while FC layers achieved improvements even in the high-sparse region; this may be because the loss of expressiveness could be compensated by better resolution of high-level features.

We further investigate how varying degrees of sharing and competition affect the accuracy and efficiency of the model, by experimenting with different configurations of 𝜇𝑙 in Eq. Group sparsity regularizer results in total elimination of certain features that are not shared across multiple classes. However, when combined with the group lasso, it allows some degree of feature reuse while still achieving parameters that are largely different across classes.

In this work, we proposed a novel controller for generic deep neural networks that efficiently utilizes network capacity by exploiting the sharing and competitive relationships between different network weights.

Figure 4. Further Analysis of the exclusive sparsity on CIFAR-10 dataset. (a) Convergence speed:

Qualitive Analysis

Conclusion