An Expectation Maximization Method to Learn the Group Structure of Deep Neural Network

One of the key challenges is to automatically learn latent features from dynamically changing multivariate inputs. For high-dimensional multivariate time series, designing an appropriate CNN model structure is challenging because the kernels may need to be expanded over the entire dimension of the input volume. The EM method assumes the Gaussian Mixture Model (GMM) and the parameters of the GMM are updated together with the parameters of the deep neural network by end-to-end backpropagation learning.

Conv 1, Conv 2 and Conv 3 of the model include additional subsampling layer after the convolution layer. Moreover, new and complicated fields, such as bioinformatics, which require more accurate analytical methods emerge with the development of the technology, while the existing applications also become more complex and demanding. Convolutional neural network (CNN) can also be used as feature extractor by computing 1D convolutions over fractions of multiple time series.

The parameters for the GMM are updated together with the neural network parameters using a backpropagation algorithm. By adding the probability function of the EM algorithm to the cost function, the model can be optimized to maximize the GMM probability function while minimizing the cost function of the network. This EM method learns the group structure of the connections between the network layers using an end-to-end learning algorithm and decouples the weights that do not come from the group that has the strongest connection to each output node. In Section II, we will briefly explain some of the background knowledge for our proposed method, followed by the related work in Section III.

By organizing the order for the purpose of the visualization, the resulting matrix can be described as a block matrix.

Figure 1: Visualization of how the proposed EM method works. It learns the group structure of a network using GMM to replace a weight connection with a simpler block matrix

Convolutional Neural Network (CNN)

Second, the nonlinear subsampling layer samples small regions of the input layer using nonlinear functions that reduce the size of the layer. The most common choice for this local downsampling is maximum clustering, but averaging or other methods can also be used. The idea of local subsampling is that once a feature is detected, its absolute location is not as important as its relative location.

By reducing the dimensionality, it makes the network less sensitive to locality and reduces computational complexity [10, 11]. After the convolutional and subsampling layers produce the feature maps, the fully connected layers perform reasoning using the extracted features to produce the actual result. Typically, a CNN consists of alternating layers of convolutional layers and a subsampling layer at the bottom, and several fully connected layers following them, as in the example in Figure 3.

The model receives input data consisting of three channels, such as generic image files, and extracts local features through two sets of convolution and subsampling layers. First, a cost function J(θ|x,t) is defined to be minimized, where θ is a set of parameters and is the target result. The parameters at the last layer are updated using a gradient descent method that minimizes the cost function, and then the error is propagated back to update the parameters of the previous layer until it reaches the first layer and updates all the parameters in the network.

The local computations of CNNs not only reduce the memory load but also improve the classification performance. Also, a recurrent variant of CNN, called recurrent convolutional neural network (RCNN) [5], showed state-of-the-art performance on object recognition problems [12].

Mixture Models

This is the weighted sum of the base distribution pk's, and the variablesπk's are the mixture weights or mixture coefficients for this mixture model. They are based on the information about the base distributions, which can never be explicitly observed. First, the distribution k was chosen with probabilities given by the mixing coefficients, and then an observation was generated according to the distribution.

Another way to look at mixed models is to treat the distribution of each component as a single cluster and separate the data points by cluster. To achieve clustering, the posterior distribution p(zk= 1|xi) is used, which is the probability that xi belongs to the kth cluster given the data point.

Gaussian Mixture Model

EM Algorithm

EM for GMMs

Expected Log Likelihood Function

E step

M step

Similarly, if we take derivatives with respect to Σkan and πk respectively and follow the same reasoning, we get the result. After computing the new estimates, the parameters are updated for k = 1, ., K and the updates are repeated until the log-likelihood estimate converges as in Algorithm 1. RNN and LSTM are the popular choices when ANNs are used to solve temporal problems However, recently there have been several works that used CNN and its variant models, such as RCNN, to deal with time series data.

They showed that deep convolutional neural networks can be used as an automatic feature extractor for multichannel time series. Ordóñez and Roggen [16] presented a deep neural network (DNN) called DeepConvLSTM that combined CNN and LSTM for wearable activity recognition. Deep-ConvLSTM used LSTM blocks instead of fully connected layers, so that the convolutional layers act as feature extractors and the subsequent LSTM layers model the temporal dynamics of the extracted features.

24] proposed multi-channel deep convolutional neural networks (MC-DCNN) for multivariate time series classification. They separate multivariate time series into multiple univariate ones and stack individual convolutional and subsampling layers for each time series. The final activations of individual submodels were connected and processed with fully connected layers.

Their model is similar to our model if we set the number of clusters to be the same as the number of time series. However, MCDCNN does not allow information to be separated at the convolutional layer level, while our method allows information from variables in the same group to be mixed. Also, the array can change at each layer and the array structure can be trained by end-to-end learning.

Inspired by the least absolute shrinkage and selection operator (LASSO) algorithm [20], the Lasso regularization performs both the regularization and variable selection by bounding the l1norm of the weights. Yuan and Lin [22] considered the case that the variables are structured in K groups and proposed group lasso method. When the variables can be divided into K different groups and the group memberships are given, only a part of the groups is selected and contributes to the output.

EM Method

Optimize parameters

In the derivation, F is replaced by a term within the exponential function of the normal distribution.

Applying to CNNs

In experiments, we tested our method on the MNIST handwritten digit dataset to compare the model performance between a CNN that contains one or more group convolutional layer(s) and a CNN that does not.

Experimental Setup

Results

The gradient from N LLQ is blocked after 150 training epochs for CNN-2, 80 epochs for CNN-3, and 80 epochs for CNN-4. To prevent this, gradients from N LLQterms are blocked after certain training periods. Group convolution CNNs learn faster than model CNNs because their train errors decrease faster and validation errors start to increase earlier than model CNNs.

CNN and CNN-2 models were retrained to 500 epochs to see the change in parameters. The numbers in the first row are from the model whose loss plots are in Figure 8 and the numbers in the second row are from the model in which the gradient of the N LLQ term is stopped after 80 epochs of training. Whether the gradient is stopped or not, the number of parameters decreases to the same level and the network learns the group structure as shown in Figure 12.

Since we see that the number of kernels decreases later when the gradient is stopped, we can guess that the cropping happened too early when the gradient is not blocked. Machine learning methods are used to handle the problems and CNNs can be used to extract features from the multivariate sequential data. An EM method can learn the group structure of the folding layers by end-to-end backpropagation learning.

Group structure can be exploited by reducing redundant cores and placing more emphasis on strong ties. In experiments with the MNIST dataset, CNNs with a batch convolution layer were trained faster than the baseline CNN model by building more connections in the early learning phase and trimming unnecessary connections based on the structure of the trained set. The experiment showed that deep neural networks can be expressed using fewer connections if we use the cluster structure.

EM method was only applied to the CNN for the time series dataset, but for the future work, it can be applied to other types of ANNs, such as multilayer perceptrons (MLPs). The group convolution calculation should also be optimized as it takes too much time than normal convolution due to the probability distribution calculation of GMM. Convolutional networks for images, speech and time series. The Handbook of Brain Theory and Neural Networks.

I would also like to thank all the members of the Statistical Artificial Intelligence Lab who spent a lot of time together in the same place. I would like to thank my friends who were there for me when I needed it and supported me to complete my research.