System Methodology
3.1 Working Principle of Catch using Time-Distance Approximation
Figure 3.1.1 Framework of speed trap system using time-distance approximation Figure 3.1.1 above shows an overview of how Catch, a less-evasible speed trap system works.
Upon entering the speed enforcement section, vehicles are captured at both check-in (entry) and check-out (exit) point, each equipped with ALPR cameras to register the license plates.
These data are then sent to a matching server to sort, match and compare the timestamps for each data set. The server calculates the average speed from the predefined distance and the registered timestamps. If the calculated speed exceeds the legal speed limit, the license plate images are then marked and kept as evidence to issue speeding violations.
For example, provided that the posted speed limit is 60 km/h while the distance between the entry and exit point is 60 km. Theoretically and logically, it takes at least 1 hour to complete the distance if obeying the allowable limit. However, if a driver enters and exits the enforcement section at low speed, but the whole distance is completed in less than 1 hour, it is obviously known that the driver has exceeded 60 km/h halfway. Thus, the driver is identified to be speeding.
Bachelor of Computer Science (Honours)
Faculty of Information and Communication Technology (Kampar Campus), UTAR
18 3.2 The Architecture of Convolutional Recurrent Neural Network (CRNN) with Connectionist Temporal Classification (CTC)
Figure 3.2.1 below showed the network architecture of CRNN. It consisted of three components, from bottom to up: convolutional layers, recurrent layers, and a transcription layer. First, convolutional layers will automatically extract the feature sequence from each input car plate image. The outputted feature sequence then be inputted into recurrent layers.
Second, recurrent layers will be making prediction for each frame of the feature sequence, and outputting a character-score for each timestep represented in matrix. Lastly, transcription layer calculated the loss function to train the Neural Network, as well as decoded the output matrix score of CRNN into a label sequence, which is the extracted license plate number.
Figure 3.2.1 The network architecture of license plate recognition
Bachelor of Computer Science (Honours)
Faculty of Information and Communication Technology (Kampar Campus), UTAR
19 3.2.1 Convolutional Layer β Feature Sequence Extraction
In CRNN model, the fully connected layers at the end of standard CNN were removed. The feature maps in convolutional layers were used to extract a sequence of feature vector from an input car plate image. These feature vectors will then be fed into RNN for sequence labelling.
Specifically, each feature vector in a feature sequence was generated from left to right on the feature maps by column. To be more precise, the ππ-th feature vector referred to the ππ-th column of the feature map.
Since the convolution, max-pooling and activation function operated on the local regions, they were so called translation invariant. As a result, each column of the feature maps corresponded to each receptive field on the original input image. As illustrated in Figure 3.2.1.1 below, each vector in the feature sequence was associated with receptive field, and thus be considered as the image descriptor for that particular region.
Figure 3.2.1.1 Each vector in the feature sequence corresponds to a receptive field on the input image
Bachelor of Computer Science (Honours)
Faculty of Information and Communication Technology (Kampar Campus), UTAR
20 3.2.2 Recurrent Layer β Sequence Labelling
In the CRNN model, the recurrent layers were built with bidirectional RNN. The recurrent layers predicted a label distribution π΄π΄ = π΄π΄1, β¦, π΄π΄π―π― for each frame π³π³ππ in the feature sequence π³π³ = π³π³1, β¦, π³π³π―π―.
Figure 3.2.2.1 (a) The structure of a basic LSTM unit.
(b) The structure of a bidirectional LSTM.
A traditional RNN has a self-connected hidden layer between its input and output layers. Every time it receives a frame π³π³ππ in the feature sequence, it updates its internal state βππ with a non- linear function that takes both current input π³π³ππ and past state βππ-1 as its inputs: βππ = ππ(π³π³ππ, βππ-1).
Then, the prediction π΄π΄ππ is made based on βππ. In this way, the past context {π³π³ππβ}ππβ < ππ are captured and utilized for prediction. As illustrated in Figure 3.2.2.1 (a), LSTM consists of a memory cell and three multiplicative gates, namely the input, output and forget gates. Conceptually, the memory cell stores the past contexts, and the input and output gates allow the cell to store contexts for a long period of time. Meanwhile, the memory in the cell can be cleared by the forget gate. This special design of LSTM allows it to capture long-range dependencies, which often occur in image-based sequences.
LSTM is directional, it only uses past contexts. Nevertheless, in image-based sequences, past and future contexts from both directions are useful and complementary to each other.
Therefore, we combined two LSTMs, one forward (left to right) and one backward (right to left), into a bidirectional LSTM. Furthermore, stacking multiple bidirectional LSTMs can result in a deep bidirectional LSTM as illustrated in Figure 3.2.2.1 (b). The deep structure allowed
Bachelor of Computer Science (Honours)
Faculty of Information and Communication Technology (Kampar Campus), UTAR
21 higher level of abstractions than a shallow one, thus can achieve significant performance improvements in the task of text recognition.
There were three advantages of feeding the output from convolutional layers into RNN, instead of the fully-connected layers. First, RNN had a strong capability of capturing contextual information within a sequence. Using contextual cues for image-based sequence recognition was more robust and helpful than treating each character independently. Taking Figure 3.2.1.1 as an example, wide character βBβ may require several successive receptive fields to fully describe, while character β1β only occupied one receptive field. Besides, some ambiguous characters were easier to distinguish when observing their context, e.g. it was easier to identify
βOβ and β0β by comparing the character positions in the car plate number, rather than recognizing each of them separately.
Figure 3.2.2.2 Generally, first few characters of the car plate are letter βOβ, and the last few characters are digit β0β
Second, RNN can backpropagate error differentials to its input, which is the convolutional layers. This allowed us to jointly train the recurrent layers and the convolutional layers in a unified network.
Third, RNN was able to operate on the sequences of arbitrary lengths, traversing from beginnings to ends. This was very helpful in the recognition of Malaysia car plates with variable length.
Figure 3.2.2.3 Examples of Malaysia car plates with variable length
Bachelor of Computer Science (Honours)
Faculty of Information and Communication Technology (Kampar Campus), UTAR
22 3.2.3 Transcription Layer β License Plate Label Sequence
At the end of CRNN model, transcription layer was added to convert the per-frame predictions output by RNN into a label sequence. Mathematically, transcription was to find the label sequence with the highest probability conditioned on the per-frame predictions.
Figure 3.2.3.1 Conversion of per-frame predictions into label sequence
We adopted a special loss function known as Connectionist Temporal Classification (CTC) loss. Under this loss function, the output of CRNN was treated as conditional probability distribution over all the possible label sequence.
The formulation of the conditional probability is briefly described as follow:
The input is a sequence π΄π΄ = π΄π΄1, β¦, π΄π΄π―π― where π―π― is the sequence length. Here, each π΄π΄ππββ|ββ|
is a probability distribution over the set ββ = β βͺ β°, where β contains all labels in the classification task (0-9 and A-Z), as well as a βblankβ label denoted by β°. A sequence-to- sequence mapping function β¬ is defined on sequence Οβ ββπ―π― , where π―π― is the sequence length.
β¬ maps Ο onto ππ by initially removing the repeating labels, then removing the βblankβs. For example, β¬ maps ββBKFβ16631β (βββ represents βblankβ) onto βBKF16631β. Then, the conditional probability is defined as the sum of the probabilities of all Ο that are mapped by β¬ onto ππ:
ππ(ππ|π΄π΄) = οΏ½ ππ(Ο|π΄π΄)
Ο: β¬(Ο)= ππ
where the probability of Ο is defined as ππ(Ο|π΄π΄) = βπ―π―π‘π‘=1π΄π΄Οπ‘π‘t , where π΄π΄ Οπ‘π‘t is the probability of having label Οππ at timestamp ππ.
In short, the probability was defined for label sequence ππ conditioned on the per-frame predictions π΄π΄ = π΄π΄1, β¦, π΄π΄π―π―, and it ignored the position where each label in ππ was located.
Consequently, when we used the negative log-likelihood of this probability as the objective to train the network, we only required images and their corresponding label sequences, avoiding the intensive labor of labeling positions of individual characters.
Bachelor of Computer Science (Honours)
Faculty of Information and Communication Technology (Kampar Campus), UTAR
23