In particular, artificial intelligence technologies such as deep learning show the latest performance in the analysis of the production of big data due to the sophisticated structure and computing power. In this thesis, we propose deep learning techniques that can support and improve current monitoring systems found in most manufacturing lines. However, deep learning models show state-of-the-art performance by automatically selecting important features.
We propose a deep learning model to diagnose the quality of an assembled product based on measured audio signal or time series data.
Motivation
Research Objectives
Outline of the Thesis
Fault detection using signal
The advantage of deep learning in error detection is that the user does not have to search for features. However, error detection must provide a basis for judgment, and the deep learning model cannot be justified because it is difficult to interpret.
Motion Recognition
Cho and Chen [23] proposed a system to recognize actions and visualization using multi-layer perceptron. 25] proposed an LSTM RNN for robust action recognition and obtained good results for the KTH dataset. 27] proposed a pose-based CNN for action recognition and investigated the CNN features using automatically estimated and manually annotated human poses.
Studies using deep learning have tried to make the models more complex, rather than trying to find good features in the input data. However, this was done using a large data set with almost all the information the model needed to make a decision.
Artificial Neural Network
The most important characteristic of the sigmoid function is that it can easily express the values of the derivatives. However, if the layer becomes deeper, the weights may not update or the learning speed may slow down as the value of the derivative decreases [33].
Convolutional Neural Network
Connectivity represents the complexity of the model, meaning that the model is less complex if the connectivity is sparse. However, in a CNN, each filter weight value is used at each input position. By sharing this weight, convolution is more efficient than dense matrix multiplication in terms of memory requirements and statistical efficiency.
If the position of the input image is shifted, the same position change is re-expressed in the feature map as the output image. Pooling also reduces the dimensionality of the model; as a result, we can avoid overfitting [36]. When we visually identify images, we are not looking at the entire image; instead, we intuitively focus on the most important parts of the image.
As illustrated in Figure 7 , the feature maps of the last convolutional layer can be interpreted as a collection of visual spatial locations that the model focused on. Mc directly indicates the importance of the feature map at a spatial grid ( , )x y of the class c. In the case of the CNN, the size of the feature map is reduced by the pooling layer.
Recurrent Neural Network
This loss of information leads to a vanishing gradient problem that prevents the gradient from being transmitted correctly during the learning process. In addition, previously hidden state information is overwritten in the activation function during the RNN learning sequence. It uses the memory cell to store information that helps find and exploit long-range contexts.
These gateways can resolve issues regarding information not properly transmitted in the model [39]. For a motion classification problem, the model retrieves the motion sequence information in window size chunks at a time. Bidirectional RNN uses both past and future context in both directions with forward and backward hidden layers [40].
The bidirectional RNN computes the forward hidden vector sequence, h, and the posterior sequence, h, and the output sequence.
Problem Statement
Data Set
Among various deep learning models, CNN has shown excellent performance in feature detection and classification of characteristic parameters in images. We can see that the difference is obvious when the data is represented in the STFT spectrogram compared to the raw signal. CNN looks for patterns within an image that is similar to the visual system of an animal, and this model is expected to more easily classify the results for the STFT spectrogram.
The A-weighted filter reflects the characteristics of human hearing and is a commonly used filter for noise measurement in industry [43]. The weights are applied to signals that reflect the frequency range of 2-5 KHz, which is recognized by the human ear [44]. Weights are applied to each frequency band inversely proportional to the sensitivity of the ears in order to increase the weight of information in the human-detectable frequency range.
The weighting function, RA( )f , is applied to the amplitude spectrum of the unweighted sound level.
Proposed Model
We set the learning rate to 0.0001 and continuously decay the learning rate every 100 iterations to stably converge the cost.
Experiment
We used the STFT image as input for all models and the size of the input image was By repeating the NG data we ensure that the NG data contributes multiple times to the loss function. The left table in Table 2 shows the confusion matrix of the SVM where the input is the signal type.
This is the biggest problem with unbalanced data; the model cannot reflect small parts of the data. In the case of STFT images, the accuracy of SVM is 93.35%, and the confusion matrix for each action is shown on the right side of Table 2. The overall accuracy of SVM is as high as 93.35, but the confusion matrix shows that the accuracy of NG is low, while the accuracy is OK high.
The left side of Table 3 shows the result of NN, where input is the signal type. Although the accuracy of the signal is better, the confusion matrix shows that both cases still show imbalance problems. The class activation map (CAM) is the additional layer used for the interpretation of the classification results.
Conclusion
Problem Statement
Data Set
The 3D position of the joint obtained from the depth image shows the variation of the value according to the physical characteristics of the person and the viewing angle despite the same movement. This variation significantly affects the classification performance when classifying another person [18][49]. Therefore, with the help of a spherical coordinate system, we converted from a common vector to a joint angle, so that they are invariant with respect to position.
Before conversion, we transformed the central hip joint to the center point to reduce the viewpoint variability and defined the center of the spherical coordinates as the central hip joint. In order to reduce the difference in performance in learning data based on human height, the height of the human body can be made equal to the height as shown in Figure 20. However, in the image classification problem, the classification performance is improved through data augmentation techniques that artificially generates a data set using the flip, cut, translate and rotate method and avoids overfitting problems [50][51].
In this study, by using data augmentation techniques to generate data with different body lengths, the model can be made robust to differences in height. Data augmentation techniques use a method that changes the length of the motion, as shown in Figure 20.
Proposed model
We set the learning rate to 0.001 and constantly decreased the learning rate every 500 iterations to stably converge the cost.
Experiment
The accuracy of SVM was 77.32%, and the confusion matrix for each action is shown in Figure 23. For the results of the SVM case, it can be noted that the performance gap is significantly large depending on the type of movement. Unlike NN, RNN uses the output of the previous sequence as input to the current sequence.
Similar to the bidirectional RNN, the length of the sequence is 250 images, and each sequence consists of 100 hidden nodes. The number of hidden nodes in the LSTM cell was 100, and the length of the sequence was also 250 frames. In the case of the LSTM RNN, the results were analyzed by dividing the cases into whether data augmentation was performed or not.
It can be seen that the performance is higher when the data augmentation is not performed. Despite the standard RNN being a deep learning model, it exhibited the lowest performance due to structural issues, whereas the LSTM RNN and the bidirectional RNN showed remarkable performances. Therefore, we compared the pre- and post-augmentation results of NN, RNN, LSTM RNN, and bidirectional RNN, respectively.
Conclusion
The RNN results performed poorly whether data augmentation was used or not due to the vanishing gradient problem. Data expansion was used to solve such problems and the results showed that the performance fluctuation was reduced. In conclusion, bidirectional RNN showed better performance compared to other models, and the performance fluctuation could be reduced by increasing the data.
In this study, we used a rich representation of the frequency and time of an audio signal obtained by applying STFT. The results of the STFT applied audio signals were used as input to the CNN model. We can use the results of the CAM to determine which components are important features, and we can determine where these features can be used as prior information when creating a new classification model for similar systems.
In addition, it is necessary to create a model that can find the cause of the problem in the part where the model is concentrated to prevent recurrence. The results showed that bidirectional RNN has better performance compared to other models. This model is necessary to improve production efficiency and protect the physical health of workers.