Local binary pattern‑based on‑road vehicle detection in urban traffic scene

(1)

https://doi.org/10.1007/s10044-020-00874-9 THEORETICAL ADVANCES

Local binary pattern‑based on‑road vehicle detection in urban traffic scene

M. Hassaballah¹ · Mourad A. Kenk² · Ibrahim M. El‑Henawy³

Received: 14 July 2018 / Accepted: 4 February 2020

Abstract

For intelligent traffic monitoring systems and related applications, detecting vehicles on roads is a vital step. However, robust and efficient vehicles detection is still a challenging problem due to variations in the appearance of the vehicles and complicated background of the roads. In this paper, we propose a simple and effective vehicle detection method based on local vehicle’s texture and appearance histograms feed into clustering forests. The interdependency of vehicle’s parts locations is incorporating within a clustering forests framework. Local binary pattern-like descriptors are utilized for texture feature extraction. Through utilizing the LBP descriptors, the local structures of vehicles, such as edge, contour and flat region can be effectively depicted. The align set of histograms generated concurrence with LBPs spatial for random sampled local regions are used to measure the dissimilarity between regions of all training images. Evaluating the fit between histograms is built in clustering forests. That is, clustering discriminative codebooks of latent features are used to search between different LBP features of the random regions utilizing the Chi-square dissimilarity measure. Besides, saliency maps built by the learnt latent features are adopted to determine the vehicles locations in test image. Effectiveness of the proposed method is evaluated on different car datasets stressing various imaging conditions and the obtained results show that the method achieves significant improvements compared to published methods.

Keywords Intelligent traffic monitoring · Vehicle detection · Local binary pattern · Saliency map · Clustering forests

1 Introduction

In computer vision, the ability to detect road vehicles in digital images is an important key to a large number of applications [1] ranging from real traffic surveillance, transportation, traffic management, and traffic flow control systems at an intersection to advanced driver assistance systems as reported in [2–4]. For instance, video surveillance analysis systems for urban traffic roads provide rapid effective

information producing in increased safety and traffic flow such as stranded vehicles, lane crossing, vehicles parked cross the roads, traffic congestion, count passing vehicles [5, 6] as well as determining the plate number, type, speed of vehicles and their direction or the lane [7, 8]. Particu- larly, vehicle detection can be considered as one of the core functions of advanced driver assistance systems. In order to provide enough time for a distracted driver to take an appropriate action to avoid driving conflicts, and thereby to decrease the possibility of rear-end crashes, the early vehicle detection becomes a must. Besides, the current systems of monitoring traffic that count the vehicles using magnetic loops on the road as demonstrated in [9–11] are expensive.

All these issues have made it is necessary to investigate new and favor approaches for the vehicle detection task.

Commonly, the main purpose of vehicle detection is to identify the possible vehicle locations in a given image and mark them as regions of interest (ROI) for further process- ing tasks. Conversely, automatic detection of the vehicles is a challenging and inherently difficult task [12]. The difficulty comes from the considerable diversities in the appearance of

* M. Hassaballah

[email protected]

* Mourad A. Kenk

[email protected]

1 Computer Science Department, Faculty of Computers and Information, South Valley University, Qena 83523, Egypt

2 Mathematics Department, Faculty of Science, South Valley University, Qena 83523, Egypt

3 Computer Science Department, Faculty of Computers and Information, Zagazig University, Zagazig, Egypt

(2)

the vehicles (e.g., size, shape, orientation) and the decrease in visibility due to camera noise or unfavorable climatic conditions (e.g., fog, rain). Besides, varying illumination, occlusion and a clutter background increase the difficulty compared to the general object detection. In particular, the satellite images may contain complicated background of urban or rural scene;

thus, the vehicles appear very different, which results in more difficulty in finding informative features in the scene [13].

In spite of these difficulties, both the scientific community and the transportation industry have contributed to the devel- opment of different types of vehicles detection and tracking methods to improve efficiency of the monitoring traffic systems and the vehicle-related applications mentioned earlier.

Therefore, several interesting vehicle detection approaches have been proposed in the literature, which are based mainly on handicraft-designed features or learning features [14, 15].

The handicraft-designed features-based approaches include the method of [16] that combines HOG-like features, color probability maps and a partial least square model. Su et al.

method [17] that utilizes rotation-invariant HOG-like features also belongs to these approaches. The methods of [18, 19] use Haar-like features to detect the vehicles in aerial images. Besides, several well-known handicraft features [20] such as Gabor filters, SIFT, and orient gradient are used in [21–23]. Other handicraft-designed features- based approaches are proposed in [24, 25]. Though these approaches are robust to intra-class variation in appearance, they are still sensitive to inter-class variation [26].

On the other hand, because learning features-based approaches have shown remarkable achievements in object detection problems due to their superb learning capacity (i.e., the ability to learn compact and discriminative features), they are investigated by some researchers for vehicle detection tasks. Most of the learning features-based approaches actually transform the vehicle detection to a classification problem via generating vehicle proposals and then classifying each proposal independently by some image classification techniques such as convolutional neural networks (CNN) [27] or support vector machine (SVM) [28]. In these approaches, features are learned by approximating the maximum likelihood of the data distribution. Also, for increasing feature distinctive or discriminative power, sparsity can be used as a regularization criterion. For detecting vehicles in a tunnel surveillance application, Rios-Cabrera et al. [29] used the Haar features and AdaBoost cascade classifier, while Fritz et al. [30] integrated a generative model (i.e., appearance codebook) with the SVM classifier and local kernels.

In [31], located hidden random field as an extension to the conditional random field is utilized for learning discriminative parts of vehicles those are rich in information content.

A multi-view vehicle detection method using a cluster/cascade boosted tree classifier is presented in [32, 33]. Also, boosted random ferns on the HOG-like feature space is used

for object detection in [34], which achieves vehicle detection of 98.2% equal error rate on the UIUC car dataset.

For detecting vehicle in high-resolution aerial images, Lin et al. [35] followed similar direction by using a combination of SURF and edge features and classification steps are performed via probabilistic modeling (Gaussian-weighted voting). The edge information is employed to highlight the sides of the vehicles. In [36], edge and corner features are used with dynamic Bayesian network classifier. Cao et al.

[37] combine deep convolutional neural network (DCNN)- based feature learning to learn discriminative features with exemplar SVMs classifier to improve the robustness of the classification step. They reported that “the leverage of DCNN has achieved significant performance boost by comparing to a serial of handcraft designed features.” Other methods used roadside LiDAR sensors with DCNN [38, 39]. In [40], a vehicle detection method in high-resolution aerial images is proposed using sparse representation and superpixel segmentation, where meaningful patches for training and detection are obtained based on the superpixel segmentation. For constructing a compact training subset and to select the most representative samples from the whole training dataset, a training sample selection method is proposed based on the sparse representation. The HOG-like features are used during the training and detection steps. This method achieves prom- ising results on Toronto and OIRDS datasets. Unfortunately, the sparse representation has high computational complexity compared to its counterparts classifiers such as SVM.

Random forests (RF) [41, 42] (a decision tree-structured classifier that makes a prediction for routing a feature sample through the whole tree to its leaves in an unsupervised way) and local binary pattern (LBP) [43, 44] (a texture feature descriptor that characterizes the spatial structure of local image texture) have received significant attention due to their success in several computer vision applications [45–47]. Inspired by the success of RF and LBP, we propose a simple, yet effective method for detecting road vehicles in urban traffic scene where clustering forests based on the Chi-square are employed to build visual codebooks for appearance features extracted using the LBP. In particular, three different LBP variants are utilized to extract the vehicle features. Also, a combination of extracted local vehicle’s features using the three LBP variants is proposed to build a powerful classifier of clustering forests. Experimental results proved that the proposed method overcomes the limitations of several previous methods.

The remainder of this paper is organized as follows. Sec- tion 2 describes feature extraction using LBP descriptor and its variants. The concept of the clustering forests is discussed in Sect. 3. Details of the proposed vehicle detection method are explained in Sect. 4. Experimental evaluations and dis- cussions are presented in Sect. 5. Conclusions are finally given in Sect. 6.

(3)

2 Feature extraction and fusion

2.1 Local binary patterns (LBP)

The basic LBP [48] characterizes the spatial structure of the image texture based on the assumption that this texture has a pattern and its strength (amplitude)—two locally comple- mentary aspects. It is proven to be highly discriminative, as well as its invariance to monotonic grayscale changes, and the computational efficiency make the LBP suitable for several image-related applications [49, 50]. With the invariance prop- erty, there is no need to apply grayscale normalization prior to compute the LBP feature. Briefly, LBP labels image pixels by thresholding the 3×3 neighborhood of each pixel with the center value, and the result is a 8-bit binary number. In other words, it creates an order-based feature for all image pixels via comparing each pixel’s intensity value with that of its neighboring pixels where the neighbors whose feature responses exceed the central’s one are labeled as “1” while the others are labeled as “0.” Subsequently, the co-occurrence of this comparison results is recorded as a string of binary bits. After that, weights coming from a geometric sequence that has a common ratio of 2 are assigned to these bits according to their indices in the strings. The binary string with its weighted bits is then transformed into a decimal valued index (LBP code) which is the LBP feature response as shown in Fig. 1.

Mathematically, a pixel c with intensity g(c) is labeled as

The pixels p belong to the 3×3 neighborhood with gray levels g_p , p=0, 1,…, 7 . The final LBP code of the pixel c is computed via summing the corresponding thresholded values S(g_p−g_c) weighted by a binomial factor of 2^k as

(1) S(g_p−g_c) =

{1, if g_p≥g_c 0, if g_p<g_c.

In this context, the binary numbers generated by the LBP response encodes the mutual relationship between the central pixel and its neighboring pixels. Unfortunately, the calculated features in this local 3×3 neighborhood may not capture large-scale structures of image features.

Therefore, the basic LBP descriptor is extended to use neigborhoods of different sizes and hence allowing sub- pixel alterations. Using circular neighborhoods rather than the fixed 3×3 allows any radius and number of pixels in the neighborhood. In this way, the circular LBP (CLBP) puts no limitations to the neighborhood size of the pixel or to the number of sampling points as described in [51]. For neighborhood (P, R) with P sampling points on a circle of radius R, the CLBP code can be computed using

where g_c and g_i are the gray value of the central pixel and its neighbors, respectively. In the case that the sampling point is not in the center of a pixel, the pixel values are bilinearly interpolated. The coordinates of sampling point (x_i, y_i) in an evenly spaced circular neighborhood of N sampling points and radius R around point (x,y) are given by

Another extension to the basic LBP called LBP variance (LBPV) [52] is introduced to avoid drawback of los- ing global spatial information existed in the basic LBP.

The principal orientations of the texture image are esti- mated based on the LBP distribution to align LBP histograms. Then, the aligned histograms are used to measure the dissimilarity between images. The LBPV characterizes the local contrast information into one-dimensional LBP histogram. The local variance LV_(P,R) can be considered as invariant against shifts in the gray level and is given by

The variance LV_(P,R) is utilized as an adaptive weight to adjust the contribution of the LBP code in histogram cal- culation. The joint distribution LBP_P,R∕LV_P,R produces the final LBP variance (LBPV) response that can describe the local texture better than the LBP. The three LBP variants discussed above are shown in Fig. 2. An illustration example for computing the LBP features from an input vehicle image using LBP, CLBP, and LBPV is shown in Fig. 3.

(2) LBP=

∑7 k=0

S(g_p−g_c) 2^k.

(3) CLBP_P,R(x, y) =

P−1∑

i=0

S(g_i−g_c) 2ⁱ

(4) x_i= (x+Rcos(2𝜋i∕N)), y_i= (y−R sin(2𝜋i∕N)).

(5) LV_(P,R)= 1

P

P−1∑

l=0

(g_l−V)², where V= 1 P

P−1∑

l=0

g_l.

Fig. 1 Computing LBP code for a pixel in a 3×3 neighborhood

(4)

After computing the LBP patterns for each pixel c(i, j), the input image of size N×M is represented by creating histograms of these LBP patterns as shown in Fig. 4 using the following formula

where

For the LBPV patterns, the histogram is created using

where

The histograms are computed independently for each image patch (region) and K is the maximum value of the LBP pattern. Hence, the created histograms contain information about the pattern description on the pixel-level and on the strength level.

2.2 Feature fusion and dimensionality reduction The main aim in developing texture descriptors is to find a stand-alone approach that can handle blur, illumination variations, rotation and different scales in images. Another major concern is to discover features descriptors that are characterized with a small dimensionality size [53].

Unfortunately, it is not easy to find a single descriptor that always works well wherever results also vary according to used classifiers. Thus, it is very difficult to find a stand-alone method that works better than all others.

The most feasible way to improve results is to combine various descriptors based on their different characteristics pointed to the features extracted by descriptors from the (6) H₁(k) =

N−1∑

i=0 M−1∑

j=0

F(LBP_P,R(i, j), k)_k∈[0,K]

(7) F(a, b) =

{1, a=b 0, otherwise

(8) H₂(k) =

N−1∑

i=0 M−1∑

j=0

F(LBP̃ _P,R(i, j), k)_k∈[0,K]

(9) F̃(LBP_P,R(i, j), k) =

{LV_(P,R)(i, j), LBP_(P,R)(i, j) =k

0, otherwise

co-occurrence matrix that are combined by sum rules as in the image classification problem [54].

In this context, the LBP labels for the histogram contain information about the patterns on a pixel level, and the labels are summed over a small region to produce information on a regional level. As a result, the vehicle features are represented by the LBP where shape of the vehicle is divided into small regions and recovered by different local histograms in which enhancing the features histogram. The drawback of some LBP variants reported in the literature is that they are tested on few applications problems, while the LBP features suffer another drawback of dimensionality size. An illustration is shown in Fig. 5, where the OLBP histograms can be described as a multimodal or bimodal distribution characterized by wide covering range of feature dimension that can be formed with feature vector as shown in Fig. 5a. The CLBP histograms can be described as a positively skewed distribution (right skewed with some of outlier) which is distinctive by short covering rang of feature dimension that is reduced than the OLBP feature vector. On the other hand, the LBPV

Fig. 2 The LBP variants: a the basic LBP 3×3 , b the circular CLBP (16,2) neighbors [51], c The LBPV (16,2) neighbors [52]

Fig. 3 An illustration example for computing LBP-like features

Fig. 4 Extracting LBP feature histograms from an input vehicle image

(5)

histograms can be described as a negatively skewed distribution (Left skewed with some little of outlier) which is also distinctive by short covering range of feature dimension.

The histogram of CLBP (16,2) feature distribution is shown in Fig. 5b where its median = 50 and the histogram LBPV (16,2) feature distribution is shown in Fig. 5c with median of 200. The CLBP (16,2) and LBPV (16,2) are considered as short descriptors containing less features dimensionality. Fur- thermore, the LBPV (16,2) histogram values cover a range of dimension values that are not being covered by CLBP (16,2).

Even though the CLBP and LBPV contain less dimensions for features, they can be used as unique feature descriptors.

Consequently, this reduction in dimensionality may affect significantly the speed of feature similarity measurement, which is needed for high speed and real-time applications.

Several possible dissimilarity measures have been proposed for histograms such as log-likelihood statistic and Chi-square statistic (X²) . The log-likelihood measure is suitable in many situations, but may be sensitive with small regions because the histograms may contain many zeros.

The Chi-square distance is usually stable with small regions as reported in [43] and suitable for the mentioned problem.

The Chi-square as a dissimilarity measure between two histograms S and M can be written as

where _(S ¹

i+M_i) is the inversely proportional to the sum of two histograms with very small value; hence, clustering more infrequent patterns into weighted clusters may have part of influence.

The procedure consists of using the texture descriptor to build several local descriptions of vehicle’s features and combining them into a global description instead of striv- ing for a holistic descriptions. In spatial histograms, the vehicle’s features are effectively described of three different (10) X²(S,M) =

∑n i=1

(S_i−M_i)² (S_i+M_i)

levels of representation: histograms contain information about the patterns on a pixel-level, histograms contain different distance over neighbor producing information on a regional-level, and histograms contain the effect of contrast for pattern description on a strength-level, which increase the space features of the vehicle and give a chance to investigate deep searching for latent features. In this way, each region in vehicle image gets different LBP codes. The most accurate information would be obtained by using the joint distribution of these LBP codes.

3 Clustering forests

A clustering forests (CF)-based framework is proposed for building visual codebooks. The clustering forest [47, 55] consists of randomized decision trees that are built recursively top down as given in Algorithm 1. At each node i corresponding to set of regions A= {R^train_i }, two children nodes L, R are created by choosing a binary test defined as T(𝛤)→{0, 1} that divides A_i into two disjoint set of regions with

Such that A_i=A^L_i ∪A^R_i with A^L_i ∩A^R_i =𝜙 , and 𝛥=𝛥1 or 𝛥2 , where 𝛥1 measures the impurity of displacement vector d_i for regions and d_m is the mean displacement vector over all vehicles class regions in the set.

The impurity of class label l_i uncertainty between vehicle regions and the background is measured using 𝛥2 as

(11) 𝛺_i=arg max𝛥(A_i,A^R_i,A^L_i,T_i)

𝛥1(A_i) = ∑ (12)

i∶l_i=1

∥ (d_i−d_m) ∥²

(13) 𝛥2(A_i) =|A_i|⋅𝛷(l_i|A_i)

Fig. 5 An illustration for differences between the LBP derivative histograms for same region used by local similarity pattern

(6)

where

The Aczél and Daróczy entropy 𝛷(l_i,|A_i) is used for maxi- mizing the joint probability of the class label for classification information gain. In this case, a probability density function (PDF) can be defined as the saliency map, where the high probability vote of regions represents high saliency.

During the training phase, splitting recursion continues until further subdivision is impossible, clustering max level reach or either all surviving training regions examples belong to one similarity measurement threshold, where all regions have identical values for all histograms. The Chi-square X² statistic dissimilarity measurement defined in Eq. 10 can be considered as the crux of the binary test with

The S and M are two histograms indexed in the same LBP descriptor. The tests are selected randomly such that the histogram index I(H_i) = {H1,H2,H3} for appearance of regions extracted by LBP 𝛤 = {𝛾¹,𝛾²,𝛾³} is chosen randomly. Thus, we generate a random elementary threshold 𝜔 at the beginning of training stage for tests as dissimilarity measurement; thereafter, the threshold value 𝜔 for each test is chosen uniformly from the range of differences between histograms observed on data randomly.

4 The proposed method

4.1 Training phase

During the training phase, each tree is constructed in a recursion proceeds starting from the root using top-down manner. Each splitting node j receives a set of histograms H_j of the training regions R^train_j of fixed size ( 36×36 pixels) is defined by extracted LBP feature 𝛤_j . The Chi-square similarity measurement is assigned for each splitting node in binary test function that is used for splitting the regions to deeper splitting level; this splitting process depends on the dissimilarity value that could be found among the arrived histograms by random choice of measuring mode 𝛥1 or 𝛥2 . If the number of the regions is small ( N_min =20 ) or the clustering depth level is the last ( d_max=15 ), the constructed splitting node is declared as a cluster and the clusters probability vote information is stored in a saliency map and a list of the displacement vectors d_i of the vehicles regions to the vehicle centroid. Figure 6 shows the basic processes in the training phase of the proposed method.

(14) 𝛷(l_i�A_i) = −

∑

hP(l_i�A_i)logP(l_i�A_i)

∑

hP(l_i�A_i)

(15) T(S,M,𝜔) =

{1, X²(S,M)< 𝜔 0, otherwise

Fig. 6 Schematic representation for the training process of the proposed method

(7)

4.2 Selective regions models (dense and sparse) The first design choice is about the feature sampling mecha- nism. This can be either dense or sparse sampling. Dense sampling is used very commonly, mainly because it produces no information loss. The sparse sampling is selected according to some convenient criteria. This has some poten- tial advantages in that it reduces complexity and computational time, but may induce significant loss of discrimina- tion information. In case an image is divided into regions, it can be expected that some of the regions contain more useful information than others in terms of vehicle features.

For example, headlight, logo, side mirror, wheels, license plate number seem to be an important cue in vehicle tracking for traffic management. To take advantage of this, the uniform distribution is utilized to divide image into dense regions R_j with sparsely covering range for whole vehicle features. Let A is a set of regions A= {R^train_j (𝛤_j,H_j, l_j, d_j)} , where each region is represented by three spaces chan- nels 𝛤 = {𝛾¹,𝛾²,𝛾³} extracted by different derivative of LBP operators. Simultaneously, H is a set of histograms H= {H1,H2,H3} generated concurrence with LBP spatial.

Finally, displacement vector d_j for the regions is associated and the class label l_j= {1, 0} for vehicles and background.

4.3 Detection phase

During the detection phase, each region histogram from the test image is matched to the clusters of histograms, similar to the case for appearance-based object detection [56], when an entry in the clustered forest is activated; it casts votes for all possible vehicles centroids based on the associated vectors. The candidate vehicle centroids are detected as high saliency in the saliency map using mean-shift mode estimation. For each member histogram {H_j}_i in cluster i, we store their linked region position relative to the vehicle center. Thus, a cluster entry records the similarity measure threshold value, LBP derivative descriptor type, number of similarities histograms regions, and relative regions to vehicle center (y_m, m=1,…, M_i).

The saliency map can be used for vehicle detection via measuring the similarity of each region histogram in the test image located at x to the clustered histograms in the forest using Chi-square distance. The cluster entry is declared to be similar if X²(⋅)< 𝜔 . For each matched cluster entry J , we cast votes for possible locations of the vehicle centers (y_m, m=1,…, Mi) . The described saliency map that represents a PDF can be used as prior knowledge to detect vehicles. Then, vehicle detection is accomplished by searching the saliency map using mean-shift mode estimation. For- mally, let P(x) is the probability that vehicle appears at candidate position x_n in the test image, since the voting is casted

from each individual member in the activated clusters entries J , we cast votes for all possible locations of vehicle centers y_m with bounding boxes of size W×H . Thus, the P(x_n|J) can be marginalized as:

Without prior knowledge of y_m , we treat them equally and assume that P(y_m|J) is a uniform distribution (i.e., P(y_m|J) =1∕M_i ). The vote is only at the location of possible vehicle centers, thus P(x_n|y_m) =𝛿(x_n−y_m) , where 𝛿(z) is the Dirac delta function. The vehicle detection hypotheses E(V) coming from all trees in the forest {F_t}^T_t=1 is given by:

To handle scale variations, the hypothesized E(V) bounding box in the original image is centered at the point ^x_sⁿ and re-scaled to ^W_s ×^H_s , where s is a scale value of the detection.

The detecting process is illustrated in Fig. 7.

5 Experiments and results

5.1 Evaluation metrics and datasets

There are many contributions in literature with respect to object detection evaluation presented a multiplicity of metrics to measure different aspects of evaluation protocols [57].

Kasturi et al. [58] presented metrics and tools for evaluating the performance of object detection algorithms based P(x_n|J) =∑ (16)

k

P(x_n|y_m)P(y_m|J)

E(V) =∑ (17)

x_n

P(x_n|J;{F_t}^T_t=1).

Fig. 7 Schematic representation for the detecting process of the proposed method

(8)

on the detection accuracy (DA), multiple object detection accuracy (MODA), and multiple object detection precision (MODP). The DA computes the spatial overlap between the ground-truth (GT) and detected objects as a ratio of the spatial intersection between the two objects and the spatial union of them. The sum of all overlaps is normalized over the average number of the ground-truth and detected objects for a single image z, where there are predicted bounding box D^(z)_i and ground-truth bounding box G^(z)_i and different overlap ratio with nonbinary thresholding between them such that

where G^(z)_i is the ith ground-truth vehicle, D^(z)_i is the ith detected vehicle, N^(z)

G and N^(z)

D denote the number of ground- truth vehicles and the number of detected vehicles in image z, respectively, while N_mapped^(z) is the number of mapped ground-truth and detected vehicle pairs in image z.

This results in a matrix of N×M with N are ground-truth objects, M are detected objects of metric scores with an entity x_ij denoting the metric score for the detected object, D^(z)_j , when mapped with the ground-truth reference object, G^(z)_i . Figure 8 shows the scoring matrix and computation of the nonbinary thresholding for the DA [58]. In order to evaluate the accuracy aspect of the proposed method, multiple object detection accuracy (MODA) is used to measure the missed detection and false-positive counts. Assuming that the number of misses is indicated by ( m_z ) and the number of DA(z) = O_Ratio (18)

[N_G^(z)+N^(z)_D 2

], O_Ratio=

N_mapped^(z)

∑

i=1

|G^(z)_i ∩D^(z)_i |

|G^(z)_i ∪D^(z)_i |

false positives is indicated by ( fp_z ) for each image z, the (MODA) can be computed using

While, the normalized multiple object detection precision (NMODP) that gives the detection precision for the entire sequence of testing datasets is computed as

On the other hand, the proposed method is tested exten- sively on several challenge cars datasets including: UIUC dataset [28], Stanford dataset [59], Pascal VOC 2012 [60]

car class, TU Graz-02 [61] car class, and surveillance nature data in CompCars dataset [62]. Images of these datasets cover extensive variations of on-road vehicles (e.g., multi views, multi scales, multi models) in urban scene (involves a naturally high variability of cluttered background, different roads) taken from different points of view. Table 1 outlines the properties for these datasets. The learned detectors of clustering forests is evaluated on the UIUC dataset [28] consisting of two test sets: single-scale and multi-scale.

The single-scale part has 170 images containing 200 car instances with fixed resolution 100×40 , while the multi- scale part consists of 108 images from 139 cars with resolution ranging from 89 × 36 to 212 × 85. The best learned detectors of clustering forests are evaluated on three different datasets with various imaging conditions, namely Stanford cars dataset [59] that has 8041 testing images for 196 classes of cars with variety of coarse categories, Pascal VOC 2012 [60] car class has 1161 testing images in realistic scenes, and TU Graz-02 [61] contains 420 images car class with extreme complexity and high intra-class variability on highly cluttered backgrounds in urban scene. Then, the proposed method is compared with the other methods on two different datasets: the UIUC car dataset [28] for urban traffic scene and surveillance nature data in [62] for on-road vehicles.

(19) MODA(z) =1−fp_z+m_z

N_G^(z)

(20) NMODP(z) =

∑^Ntotal

z=1 MODP_(z)

N_total , MODP_(z)= O_Ratio N^z_mapped

Fig. 8 An illustration example for detection accuracy computation. a The nonbinary thresholding, b scoring matrix

Table 1 Details of real challenge cars datasets used for evaluating of the proposed method

Dataset No. images No. objects Image size Res.

UIUC dataset [28] 278 339 188 × 112 Low

Stanford cars [59] 8041 8041 640 × 426 Multi Pascal VOC 2012 [60] 1161 2017 500 × 375 Multi TU Graz-02 cars [61] 420 770 640 × 480 Multi Surveillance nature

set [62] 10,000 10,000 778 × 744 High

(9)

5.2 Parameters settings

The PC configuration is an Intel Core i5 with 2.47 GHz and RAM of 4 GB. The proposed method is implemented under Windows 7 platform using C++ and OpenCV image pro- cessing library. Efficiency of the proposed method is evaluated using same setting of clustering forests training for the stopping criteria of 15 levels with maximum clustering and 20 region of the same features and three different sizes of forests trees {1,3,5}. At the detection phase, two and three scales are used to handle the variety of vehicles sizes in the test datasets. Curves of the results are generated by chang- ing the acceptance threshold on the hypotheses vote strength P(x_n|J) . Each single experiment is carried out ten times, then the standard deviation of detection rate values are considered for comparisons.

For setting the training sets, at first training set-1, we col- lect positive samples with 587 images in the training set-1 of the CompCars dataset (web-nature data) [62] which are re-scaled so that the size of the largest bounding box dimension 200×100 and the size of the smallest one 140×100 , where the average size for mostly 181×100 . Also, we col- lect fussily each vehicle image appearance contains at least vehicle features (wheel and headlight) which are rotated from 𝛼^◦ where 4 features appear (2 wheels + 2 head light) to 𝛽^◦ where 3 features appear (2 wheels + 1 head light).

Negative training set are randomly cropped from background of 5,329 images. Each clustering forest is trained on about 29,350 positive regions and 266,450 negative regions samples to build a training model working robustly on hard conditions. Second training set-2 (Vehicle Image Database-GTi (VID-GTi) [63]) is chosen to investigate the CF training ability in low resolution. Images of this dataset is of size of 64 × 64 captured for on-road vehicle and extracted from sequences of 360×256 pixels recorded in highways and splits into four different parts according to the pose of vehicle from camera: middle-close range in {front, left, right, and far range}. Additionally, the images are not cropped to fit the contour of vehicles to investigate the classifier more robust to hypothesis vehicle partially occluded.

Table 2 overviews details of the two different training sets that has been used.

5.3 Clustering splitting mode and LBP feature extractors

5.3.1 Classification versus regression model

The purpose of this first experiment is to investigate and analyze the following aspects of the clustering forest learning algorithm:

1. Clustering splitting mode (classification and regression).

2. Feature extraction using the LBP variations.

Therefore, the experiment is designed with the following setting:

• Training clustering forest using only classification mode by 𝛥2(A_i) =|A_i|.𝛷(l_i,|A_i).

• Training another clustering forest using regression mode by 𝛥1(A_i) =∑

j∶l_i=1 ∥ (d_j−d_m) ∥².

• Training clustering forest using random splitting mode 𝛥=𝛥2 or 𝛥1.

Each clustering forest is then trained individually using each channel of features extractor LBP operators which are the original LBP 3×3 (OLBP), the circular LBP (16,2) (CLBP), and the LBPV for dissimilarity measurement between regions with 𝛤 = {𝛾1,𝛾2,𝛾3} . For the multi-LBP training, the clustering forests is constructed using a combination of all the LBP operators with a random choice between them for feature extraction at each splitting node.

Figure 9 shows a visualized comparison between different clustering splitting model and different features extraction based on the LBP variations. The horizontal axis represents to splitting node number and the vertical axis refers to the threshold value 𝜔 for splitting regions. The vertical axis value ranges between positive and negative threshold values, where the positive values for the splitting right nodes and the negative values for the splitting left nodes. At the start of clustering levels, the splitting proceeds permanently in approximately the first quarter of clustering levels which can be seen as a compression; this massive splitting occurs by the large numbers of the dissimilarity and the variations between vast regions of different vehicles in the training datasets. On the other hand, the spaces appear as rarefac- tions between the splitting node in the horizontal axis, where the threshold value 𝜔=0 corresponds to clusters entries J which have an acceptable similarity between positive training regions.

Analyzing the clustering splitting mode (i.e., classification or regression), one can note that the classification mode 𝛥2(A_i) in column (a) has less clustering nodes than the regression mode 𝛥1(A_i) in column (b) shown in Fig. 9.

Under other conditions, we analyze the features extraction

Table 2 Details of the training datasets used in the training phase

Dataset Pos. Neg. Avg. dim. Res. Scale

Web-nature set [62] 587 5329 181 × 100 High Multi

VID-GTi [63] 3425 3900 64 × 64 Low One

(10)

based on LBP variations, we find a fluctuate in the behavior of clustering nodes starting from the first raw by OLBP which has huge numbers of clustering nodes crossing to CLBP in the second raw which has fewer clustering nodes and at third raw LBPV that has a little squeezed clustering nodes. The random choice mode 𝛥(A_i) has a moder- ate clustering nodes when using OLBP, CLBP and LBPV even though 𝛥(A_i) has a less clustering nodes number when merging LBP variations in Multi-LBP. Table 3 demonstrates training time for each training model, the actual values for the clustering nodes, and clusters entries. It is clear that the OLBP features type has higher dimensionality histogram than the CLBP and LBPV features types. Considering the training time, the OLBP features type has more time than others the CLBP and LBPV features, while the CLBP and LBPV features approximately consume steady training time.

Also, the average and maximum number of regions in each cluster entry are reported in Table 4, where in all clustering models, there is a big gab between the average number and the anomaly values in maximum number of regions in each cluster entry caused in a few number of clusters entries but the average values ensure stable of the entire clusters entries. The behavior of features extraction is investigated based on the LBP variations using a statistical methodology (descriptive statistics) to analyze and measure the variations

of regions similarity that are produced in the clusters for each LBP operator. The results for mean, standard deviation and coefficient of variations are reported in Table 5 demon- strating significant variations between the clusters entries and histograms of regions for the LBP operators. Also, the ratio of the PDF of clusters entries with probability >0.5 is computed and the results for 500 clusters entries are drawn in Fig. 10

The learned clustering forests with different LBP derivatives and splitting measurement model are evaluated for vehicles detection on the multi-scale part of the UIUC car dataset. The detection task is usually evaluated based on the Recall/Precision curve, where Precision = True Posi- tive/True Positive + False Positive, and Recall = True Posi- tive/True Positive + False Negative. Thus, the LBP features extraction and different learning model are investigated based on true positive rate (TPR), true negative rate (TNR), and equal error rate (EERs) point, where TPR describes the proportion of positive samples that are classified correctly.

A higher TPR value indicates a better classifier for positive samples, while the TNR represents proportion of negative samples that are classified correctly. A higher TNR value represent a lower false-positive rate (FPR). Only when the intersection between the detection response and ground-truth box is larger than 60% of their union, it can be viewed as

Fig. 9 Splitting threshold of LBP derivatives among clustering levels for classification and regression impurity measurements

(11)

a successful detection. Table 6 lists the quantitative results of applying this learned clustering forests for vehicles detection, where the classification impurity model 𝛥2 for multi-LBP achieves an impressive EER of 69.21% while the regression impurity model 𝛥1 for LBPV achieves an impressive EER of 46.16%. On the other hand, the random impurity model 𝛥 for multi-LBP achieves the highest EER of 72.63%

outperforming the others LBP features extractors.

5.3.2 Impact of clustering forests parameters

Second experiment is carried out to investigate the impact of clustering parameters including clustering forests size, depth of clustering levels, and regions size to optimize the clustering forests for best vehicles detection. To this end, with the clustering forest size F, two learned clustering forests of different sizes F= {3, 5} are implemented using random splitting impurity 𝛥 with the LBP derivatives OLBP, CLBP, LBPV feature extraction. The conjunction between LBP derivatives in multi-LBP is considered randomly with sub-supervision, where the supporting for the OLBP and LBPV is more than CLBP on account of the previous detection results. Figure 11a shows the obtained results for this experiment, and Table 7 gives detailed information about

clustering process, where the OLBP channel scores the highest clusters number in all forests sizes. The CLBP and LBPV score the lowest clusters number in different forests sizes.

All the learned detectors are evaluated on the single-scale part of the UIUC car dataset. Three different overlap ratios {40%, 50%, 60%} are considered to evaluate the detection accuracy of vehicles. The Multi-LBP feature extractor out- performs any single LBP feature a few example as reported in Table 8.

The third experiment is assigned for examining the depth of clustering level. We implement four different learned clustering forests with only the multi-LBP feature extractor at F=1 and varying the clustering depth levels by {5, 8, 10, 12}. In this case, all learned detectors are evaluated on the single-scale part of the UIUC car dataset with 60%

overlap ratio. Figure 11b shows this comparison between different clustering level. In this regards, the clustering forests are sensitive to the clustering level depth and the clustering forest with 15 clustering level depth achieves the best score among the others. Effect of region size on the overall detection rates is shown in Fig. 11c. It is clear that the performance is also sensitive to the regions size.

5.4 Evaluating under different imagining conditions

Based on the previous analysis and findings, we use clustering forests of five trees with two stop conditions, 15 maximum level of clustering, regions size of 16×16 , and the Multi-LBP as features extractor. After that, the discriminative saliency map is learned. The proposed method is evaluated on three different cars datasets: Stanford cars, Pascal VOC 2012, and TU Graz-02 cars datasets considering the (Recall/Precision) curves. Only when the intersection between a detection response and a ground-truth box is

Table 3 Number of clustering nodes and clusters entries for measurements impurity 𝛥1 , 𝛥2 and 𝛥

Split measure Regression impurity ( 𝛥1) Classification impurity ( 𝛥2) Random impurity ( 𝛥) Time (m) Nodes Cluster Time (m) Nodes Cluster Time (m) Nodes Cluster

OLBP 115.17 32,764 4058 73.55 32,544 3536 119.45 32,773 4348

CLBP 79.47 32,115 4663 64.26 29,491 1185 114.62 32,000 2914

LBPV 77.86 32,755 2757 66.96 32,287 2171 101.20 29,7070 2885

Multi-LBP 109.53 32,208 3338 99.86 32,757 2521 120.66 31,839 3209

Table 4 Average and maximum regions in each cluster entry cross different splitting impurity measurements

Split measure Classification 𝛥2 Regression 𝛥1 Random 𝛥

Avg. Max. Avg. Max. Avg. Max

OLBP 8 1571 7 961 6 777

CLBP 24 3,387 7 1088 10 934

LBPV 13 625 10 1995 10 1120

Multi-LBP 11 1206 8 2609 9 845

Table 5 Mean, standard deviation and coefficient of variation for regions similarity in different LBP variants and ratio of clusters entries with probability >0.5

LBP feature Mean SD (%) Coeff. var. (%) Entries

>0.5 (%)

OLBP 0.64 33.5 51.9 61.4

CLBP 0.43 26.6 60.6 38.2

LBPV 0.65 33.1 50.4 70.8

Multi-LBP 0.75 23.3 30.9 70.1

(12)

Fig. 10 Comparison of the LBP variants PDF with coefficient of variation Table 6 Evaluation of measure

impurity and different LBP derivatives on the multi-scale UIUC car dataset

Split measure Classification 𝛥2 Regression 𝛥1 Random 𝛥

TPR (%) TNR TPR (%) TNR TPR (%) TNR

Overlap ratio threshold 60%

OLBP 44.52 0.68 23.70 0.37 60.18 0.85

CLBP 46.15 0.83 26.92 0.85 38.46 0.87

LBPV 27.07 0.79 46.16 0.46 49.55 0.61

Multi-LBP 69.21 0.83 38.46 0.57 72.63 0.76

Fig. 11 Impact of clustering parameters. a Sharing ratio of LBP derivatives in the Multi-LBP training features, b performance versus clustering level depth, c performance versus region size

(13)

larger than 50% of their union, it can be considered as a successful detection. The Recall/Precision curves are shown in Fig. 12, which reports the efficiency of the proposed method in cluttered background and wide variation of cars models with different points of view in these cars datasets. Specifi- cally, the Multi-LBP achieves a precision of 79.2% in the

Stanford dataset and examples of successful detection are shown in Fig. 13. It achieves a precision of 83.9% in the Pascal VOC 2012 dataset car class and examples of successful detection are shown in Fig. 14. Additionally, it scores a precision 88.5% in the TU Graz-02 datasets car class with some examples of this detection are shown in Fig. 15.

5.5 Comparison with other methods

Importantly, the proposed method is compared with other methods on both single-scale and multi-scale part of the UIUC car test set. The comparison with other methods are given in Table 9. It can be seen that the proposed method competes well with the state-of-the-art methods. From the experiments, we can report that the proposed method can achieve high detection rates on single-scale part and multi- scale part of 99.6% and 98.9% on the UIUC car dataset, respectively. For the single-scale part, the method of Mutch and Lowe [21] based on SVM and Gabor features achieves 99.9% outperforming our method which achieves 99.6%, while for the multi-scale part with difficult imaging conditions, the proposed method achieves 98.9% against 90.6%

for Mutch and Lowe’s method [21]. Examples of detection results on the UIUC dataset using the proposed method are shown in Fig. 16. Another experiment is conducted to

Table 7 Number of splitting nodes and clusters for three forest size F=1, 3, 5

CF size F=1 F=3 F=5

LBP feature Nodes Cluster Nodes Cluster Nodes Cluster

Splitting nodes number and clusters number

OLBP 32,773 4348 32,776 4601 32,637 4714

CLBP 32,000 2914 31,511 2700 31,913 2354

LBPV 29,707 2885 32,451 2760 30,379 3026

Multi-LBP 31,839 3209 31,496 2858 32,330 2192

Table 8 Evaluation of the LBP and its variants on the UIUC car dataset

Overlap ratio threshold 40% 50% 60%

LBP feature CF size Detection accuracy

OLBP F=1 72.5% 69% 60.18%

F=3 75.9% 74% 62.7%

F=5 77.2% 76% 65%

CLBP F=1 43.8% 41% 38.46%

F=3 45.8% 43.7% 42%

F=5 50% 48.2% 45.3%

LBPV F=1 61% 52.3% 49.55%

F=3 68.6% 64% 57%

F=5 75.6% 72.7% 62%

Multi-LBP F=1 86% 83% 72.63%

F=3 93.1% 91% 88.4%

F=5 99.9% 99.6% 95.7%

0 0.2 0.4 0.6 0.8 1

Recall

Precision

OLBPCLBP LBPVMulti-LBP Precision vs Recall (50%)

(a)

0 0.2 0.4 0.6 0.8 1

Recall

Precision

(b)

0 0.2 0.4 0.6 0.8 1

Recall

Precision

(c)

Fig. 12 Performance of the proposed under different imagining conditions. a Stanford cars dataset, b Pascal VOC 2012 car class, c TU Graz-02 dataset (car class)

(14)

compare the proposed method with deep learning-based approach. For this comparison, the “You Only Look Once (YOLO)” method [64] is re-implemented using an open

Fig. 13 Example of detection results on the Stanford cars dataset

Fig. 14 Example of detection results on the Pascal VOC 2012 car class

Fig. 15 Example of detection results on the TU Graz-02 dataset (car class)

Fig. 16 Example of detection results on the UIUC dataset using the proposed method

0 0.2 0.4 0.6 0.8 1

Recall

Precision

yolo-voc yolo-coco Multi-LBP Precision vs Recall (50%)

Fig. 17 Comparison of the proposed method with two pre-trained Yolo deep learning techniques

(15)

source Darknet framework and tiny network is used with 1024 MB memory of GPU for computations. Two different pre-trained car classifier are employed. The first one (yolo-voc) is trained based Pascal VOC dataset [60] and the second (yolo-coco) is trained based coco dataset [65]. The proposed method is trained on training set-2 [63] with the same CF configuration and 9×9 regions size. For the testing stage, a random set of surveillance nature data is used and the results are given in Table 1. The recall and precision curve for this comparison is shown in Fig. 17. The proposed method-based CPU computation achieves high precision of 87.2% compared to the detector-based GPU computation, where yolo-voc scores precision 74.8% and yolo-coco scores precision 79.4%. Examples of detection results on surveillance nature dataset using the proposed method are shown in Fig. 18.

6 Conclusions

In this paper, a method is proposed to detect road vehicle using the saliency map based on the conduction of local binary pattern histograms and clustering forests. This is carried out within the framework of random forests, where a combination of local binary pattern variants for appearance- based detection methods and random impurity measurement (classification and regression)-based clustering forests are introduced as the optimal choice for a low cost and reliable training and detection. Three different LBP variants utilized as feature extractors are analyzed for road vehicle detection, and the best performance is achieved by a combination of these LBP variants for medium regions sizes(e.g., 12×12 to 16×16 ). On the other side, clustering forests that is used as a training technique with several setting parameters for the clustering forest size, the clustering levels depth, and splitting training set for different measuring mode are investigated for the detection task under hard car datasets of uncon- trolled background conditions, namely UIUC car dataset, stanford cars dataset, Pascal VOC 2012 dataset cal class, and TU Graz-02 cars datasets and surveillance nature data in CompCars dataset. The best performance is accomplished when the clustering forest size is more than three trees with clustering level depth between 12 and 15 level and the random measurement impurity mode. The proposed method achieves currently high performance compared to many state-of-the-art methods on the UIUC car dataset which is one of the most difficult real-world datasets. Also, it is compared with deep learning under supervision techniques and achieves a competitive results without the need of GPU computations. Moreover, the proposed method is easy to implement and interpret. For future work, the clustering process provides useful information that can be used to classify different types or models of vehicles and the clusters need some kind of pruning. Thus, further identification process

Fig. 18 Example of detection results on the surveillance nature set using the proposed method

Table 9 Comparison of the proposed method the with state of the arts on the UIUC dataset

Training details Detection details

Classifier Features Dataset Pos. Neg. Single (%) Multi (%)

Mutch and Lowe [21] Linear SVM 2D Gabor UIUC 500 500 99.9 90.6

Agarwal et al. [28] SNoW BFV UIUC 500 500 76.5 39.6

Fritz et al. [30] ISM + SVM Local kernel UIUC 170 – 88.6 87.8

Kapoor and Winn [31] LHRF SIFT TUD car 35 – 94.0 –

Wu and Nevatia [32] CBT Edgelets MIT 4000 7000 97.5 93.5

Kuo and Nevatia [33] G. AdaBoost HOG MIT + Pascal 2462 8427 98.5 95.0

Villamizar et al. [34] BRF HOG-LBF UIUC 200 – 98.5 98.2

Gall and Lempitsky [46] Hough Forest intensity UIUC 550 450 98.5 98.6

Fergus et al. [66] EM PCA UIUC + rear 800 – 88.5 –

Leibe et al. [67] K-means & ALC Hessian-Laplace Own set 100 – 97.5 95.0

Our proposed method CF LBP variants CompCars 587 5329 99.6 98.9