Working Principle of Catch using Time-Distance Approximation

System Methodology

3.1 Working Principle of Catch using Time-Distance Approximation

Figure 3.1.1 Framework of speed trap system using time-distance approximation Figure 3.1.1 above shows an overview of how Catch, a less-evasible speed trap system works.

Upon entering the speed enforcement section, vehicles are captured at both check-in (entry) and check-out (exit) point, each equipped with ALPR cameras to register the license plates.

These data are then sent to a matching server to sort, match and compare the timestamps for each data set. The server calculates the average speed from the predefined distance and the registered timestamps. If the calculated speed exceeds the legal speed limit, the license plate images are then marked and kept as evidence to issue speeding violations.

For example, provided that the posted speed limit is 60 km/h while the distance between the entry and exit point is 60 km. Theoretically and logically, it takes at least 1 hour to complete the distance if obeying the allowable limit. However, if a driver enters and exits the enforcement section at low speed, but the whole distance is completed in less than 1 hour, it is obviously known that the driver has exceeded 60 km/h halfway. Thus, the driver is identified to be speeding.

Bachelor of Computer Science (Honours)

Faculty of Information and Communication Technology (Kampar Campus), UTAR

18 3.2 The Architecture of Convolutional Recurrent Neural Network (CRNN) with Connectionist Temporal Classification (CTC)

Figure 3.2.1 below showed the network architecture of CRNN. It consisted of three components, from bottom to up: convolutional layers, recurrent layers, and a transcription layer. First, convolutional layers will automatically extract the feature sequence from each input car plate image. The outputted feature sequence then be inputted into recurrent layers.

Second, recurrent layers will be making prediction for each frame of the feature sequence, and outputting a character-score for each timestep represented in matrix. Lastly, transcription layer calculated the loss function to train the Neural Network, as well as decoded the output matrix score of CRNN into a label sequence, which is the extracted license plate number.

Figure 3.2.1 The network architecture of license plate recognition

Bachelor of Computer Science (Honours)

Faculty of Information and Communication Technology (Kampar Campus), UTAR

19 3.2.1 Convolutional Layer – Feature Sequence Extraction

In CRNN model, the fully connected layers at the end of standard CNN were removed. The feature maps in convolutional layers were used to extract a sequence of feature vector from an input car plate image. These feature vectors will then be fed into RNN for sequence labelling.

Specifically, each feature vector in a feature sequence was generated from left to right on the feature maps by column. To be more precise, the 𝑖𝑖-th feature vector referred to the 𝑖𝑖-th column of the feature map.

Since the convolution, max-pooling and activation function operated on the local regions, they were so called translation invariant. As a result, each column of the feature maps corresponded to each receptive field on the original input image. As illustrated in Figure 3.2.1.1 below, each vector in the feature sequence was associated with receptive field, and thus be considered as the image descriptor for that particular region.

Figure 3.2.1.1 Each vector in the feature sequence corresponds to a receptive field on the input image

Bachelor of Computer Science (Honours)

Faculty of Information and Communication Technology (Kampar Campus), UTAR

20 3.2.2 Recurrent Layer – Sequence Labelling

In the CRNN model, the recurrent layers were built with bidirectional RNN. The recurrent layers predicted a label distribution 𝒴𝒴 = 𝒴𝒴1, …, 𝒴𝒴𝒯𝒯 for each frame 𝒳𝒳𝑇𝑇 in the feature sequence 𝒳𝒳 = 𝒳𝒳¹, …, 𝒳𝒳^𝒯𝒯.

Figure 3.2.2.1 (a) The structure of a basic LSTM unit.

(b) The structure of a bidirectional LSTM.

A traditional RNN has a self-connected hidden layer between its input and output layers. Every time it receives a frame 𝒳𝒳𝑇𝑇 in the feature sequence, it updates its internal state ℎ𝑇𝑇 with a non- linear function that takes both current input 𝒳𝒳^𝑇𝑇 and past state ℎ^𝑇𝑇-1 as its inputs: ℎ^𝑇𝑇 = 𝑔𝑔(𝒳𝒳^𝑇𝑇, ℎ^𝑇𝑇-1).

Then, the prediction 𝒴𝒴^𝑇𝑇 is made based on ℎ^𝑇𝑇. In this way, the past context {𝒳𝒳^𝑇𝑇’}𝑇𝑇’ < 𝑇𝑇 are captured and utilized for prediction. As illustrated in Figure 3.2.2.1 (a), LSTM consists of a memory cell and three multiplicative gates, namely the input, output and forget gates. Conceptually, the memory cell stores the past contexts, and the input and output gates allow the cell to store contexts for a long period of time. Meanwhile, the memory in the cell can be cleared by the forget gate. This special design of LSTM allows it to capture long-range dependencies, which often occur in image-based sequences.

LSTM is directional, it only uses past contexts. Nevertheless, in image-based sequences, past and future contexts from both directions are useful and complementary to each other.

Therefore, we combined two LSTMs, one forward (left to right) and one backward (right to left), into a bidirectional LSTM. Furthermore, stacking multiple bidirectional LSTMs can result in a deep bidirectional LSTM as illustrated in Figure 3.2.2.1 (b). The deep structure allowed

Bachelor of Computer Science (Honours)

Faculty of Information and Communication Technology (Kampar Campus), UTAR

21 higher level of abstractions than a shallow one, thus can achieve significant performance improvements in the task of text recognition.

There were three advantages of feeding the output from convolutional layers into RNN, instead of the fully-connected layers. First, RNN had a strong capability of capturing contextual information within a sequence. Using contextual cues for image-based sequence recognition was more robust and helpful than treating each character independently. Taking Figure 3.2.1.1 as an example, wide character “B” may require several successive receptive fields to fully describe, while character “1” only occupied one receptive field. Besides, some ambiguous characters were easier to distinguish when observing their context, e.g. it was easier to identify

“O” and “0” by comparing the character positions in the car plate number, rather than recognizing each of them separately.

Figure 3.2.2.2 Generally, first few characters of the car plate are letter “O”, and the last few characters are digit “0”

Second, RNN can backpropagate error differentials to its input, which is the convolutional layers. This allowed us to jointly train the recurrent layers and the convolutional layers in a unified network.

Third, RNN was able to operate on the sequences of arbitrary lengths, traversing from beginnings to ends. This was very helpful in the recognition of Malaysia car plates with variable length.

Figure 3.2.2.3 Examples of Malaysia car plates with variable length

Bachelor of Computer Science (Honours)

Faculty of Information and Communication Technology (Kampar Campus), UTAR

22 3.2.3 Transcription Layer – License Plate Label Sequence

At the end of CRNN model, transcription layer was added to convert the per-frame predictions output by RNN into a label sequence. Mathematically, transcription was to find the label sequence with the highest probability conditioned on the per-frame predictions.

Figure 3.2.3.1 Conversion of per-frame predictions into label sequence

We adopted a special loss function known as Connectionist Temporal Classification (CTC) loss. Under this loss function, the output of CRNN was treated as conditional probability distribution over all the possible label sequence.

The formulation of the conditional probability is briefly described as follow:

The input is a sequence 𝒴𝒴 = 𝒴𝒴¹, …, 𝒴𝒴^𝒯𝒯 where 𝒯𝒯 is the sequence length. Here, each 𝒴𝒴^𝑇𝑇∈ℜ^|ℒ’|

is a probability distribution over the set ℒ’ = ℒ ∪ ℰ, where ℒ contains all labels in the classification task (0-9 and A-Z), as well as a ‘blank’ label denoted by ℰ. A sequence-to- sequence mapping function ℬ is defined on sequence π∈ ℒ’^𝒯𝒯 , where 𝒯𝒯 is the sequence length.

ℬ maps π onto 𝑇𝑇 by initially removing the repeating labels, then removing the ‘blank’s. For example, ℬ maps “–BKF–16631” (‘–’ represents ‘blank’) onto “BKF16631”. Then, the conditional probability is defined as the sum of the probabilities of all π that are mapped by ℬ onto 𝑇𝑇:

𝜌𝜌(𝑇𝑇|𝒴𝒴) = � 𝜌𝜌(π|𝒴𝒴)

π: ℬ(π)= 𝑙𝑙

where the probability of π is defined as 𝜌𝜌(π|𝒴𝒴) = ∏^𝒯𝒯_𝑡𝑡=1𝒴𝒴_π^𝑡𝑡_t , where 𝒴𝒴 _π^𝑡𝑡_tis the probability of having label π𝑇𝑇 at timestamp 𝑇𝑇.

In short, the probability was defined for label sequence 𝑇𝑇 conditioned on the per-frame predictions 𝒴𝒴 = 𝒴𝒴¹, …, 𝒴𝒴^𝒯𝒯, and it ignored the position where each label in 𝑇𝑇 was located.

Consequently, when we used the negative log-likelihood of this probability as the objective to train the network, we only required images and their corresponding label sequences, avoiding the intensive labor of labeling positions of individual characters.

Bachelor of Computer Science (Honours)

Faculty of Information and Communication Technology (Kampar Campus), UTAR

CHAPTER 4

Dalam dokumen DECLARATION OF ORIGINALITY (Halaman 32-38)