https://doi.org/10.1007/s10044-020-00874-9 THEORETICAL ADVANCES
Local binary pattern‑based on‑road vehicle detection in urban traffic scene
M. Hassaballah1 · Mourad A. Kenk2 · Ibrahim M. El‑Henawy3
Received: 14 July 2018 / Accepted: 4 February 2020
© Springer-Verlag London Ltd., part of Springer Nature 2020
Abstract
For intelligent traffic monitoring systems and related applications, detecting vehicles on roads is a vital step. However, robust and efficient vehicles detection is still a challenging problem due to variations in the appearance of the vehicles and com- plicated background of the roads. In this paper, we propose a simple and effective vehicle detection method based on local vehicle’s texture and appearance histograms feed into clustering forests. The interdependency of vehicle’s parts locations is incorporating within a clustering forests framework. Local binary pattern-like descriptors are utilized for texture feature extraction. Through utilizing the LBP descriptors, the local structures of vehicles, such as edge, contour and flat region can be effectively depicted. The align set of histograms generated concurrence with LBPs spatial for random sampled local regions are used to measure the dissimilarity between regions of all training images. Evaluating the fit between histograms is built in clustering forests. That is, clustering discriminative codebooks of latent features are used to search between dif- ferent LBP features of the random regions utilizing the Chi-square dissimilarity measure. Besides, saliency maps built by the learnt latent features are adopted to determine the vehicles locations in test image. Effectiveness of the proposed method is evaluated on different car datasets stressing various imaging conditions and the obtained results show that the method achieves significant improvements compared to published methods.
Keywords Intelligent traffic monitoring · Vehicle detection · Local binary pattern · Saliency map · Clustering forests
1 Introduction
In computer vision, the ability to detect road vehicles in digital images is an important key to a large number of appli- cations [1] ranging from real traffic surveillance, transpor- tation, traffic management, and traffic flow control systems at an intersection to advanced driver assistance systems as reported in [2–4]. For instance, video surveillance analy- sis systems for urban traffic roads provide rapid effective
information producing in increased safety and traffic flow such as stranded vehicles, lane crossing, vehicles parked cross the roads, traffic congestion, count passing vehicles [5, 6] as well as determining the plate number, type, speed of vehicles and their direction or the lane [7, 8]. Particu- larly, vehicle detection can be considered as one of the core functions of advanced driver assistance systems. In order to provide enough time for a distracted driver to take an appropriate action to avoid driving conflicts, and thereby to decrease the possibility of rear-end crashes, the early vehicle detection becomes a must. Besides, the current systems of monitoring traffic that count the vehicles using magnetic loops on the road as demonstrated in [9–11] are expensive.
All these issues have made it is necessary to investigate new and favor approaches for the vehicle detection task.
Commonly, the main purpose of vehicle detection is to identify the possible vehicle locations in a given image and mark them as regions of interest (ROI) for further process- ing tasks. Conversely, automatic detection of the vehicles is a challenging and inherently difficult task [12]. The difficulty comes from the considerable diversities in the appearance of
* M. Hassaballah
* Mourad A. Kenk
1 Computer Science Department, Faculty of Computers and Information, South Valley University, Qena 83523, Egypt
2 Mathematics Department, Faculty of Science, South Valley University, Qena 83523, Egypt
3 Computer Science Department, Faculty of Computers and Information, Zagazig University, Zagazig, Egypt
the vehicles (e.g., size, shape, orientation) and the decrease in visibility due to camera noise or unfavorable climatic condi- tions (e.g., fog, rain). Besides, varying illumination, occlusion and a clutter background increase the difficulty compared to the general object detection. In particular, the satellite images may contain complicated background of urban or rural scene;
thus, the vehicles appear very different, which results in more difficulty in finding informative features in the scene [13].
In spite of these difficulties, both the scientific community and the transportation industry have contributed to the devel- opment of different types of vehicles detection and tracking methods to improve efficiency of the monitoring traffic sys- tems and the vehicle-related applications mentioned earlier.
Therefore, several interesting vehicle detection approaches have been proposed in the literature, which are based mainly on handicraft-designed features or learning features [14, 15].
The handicraft-designed features-based approaches include the method of [16] that combines HOG-like features, color probability maps and a partial least square model. Su et al.
method [17] that utilizes rotation-invariant HOG-like fea- tures also belongs to these approaches. The methods of [18, 19] use Haar-like features to detect the vehicles in aerial images. Besides, several well-known handicraft fea- tures [20] such as Gabor filters, SIFT, and orient gradient are used in [21–23]. Other handicraft-designed features- based approaches are proposed in [24, 25]. Though these approaches are robust to intra-class variation in appearance, they are still sensitive to inter-class variation [26].
On the other hand, because learning features-based approaches have shown remarkable achievements in object detection problems due to their superb learning capacity (i.e., the ability to learn compact and discriminative fea- tures), they are investigated by some researchers for vehi- cle detection tasks. Most of the learning features-based approaches actually transform the vehicle detection to a clas- sification problem via generating vehicle proposals and then classifying each proposal independently by some image clas- sification techniques such as convolutional neural networks (CNN) [27] or support vector machine (SVM) [28]. In these approaches, features are learned by approximating the maxi- mum likelihood of the data distribution. Also, for increasing feature distinctive or discriminative power, sparsity can be used as a regularization criterion. For detecting vehicles in a tunnel surveillance application, Rios-Cabrera et al. [29] used the Haar features and AdaBoost cascade classifier, while Fritz et al. [30] integrated a generative model (i.e., appear- ance codebook) with the SVM classifier and local kernels.
In [31], located hidden random field as an extension to the conditional random field is utilized for learning discrimina- tive parts of vehicles those are rich in information content.
A multi-view vehicle detection method using a cluster/cas- cade boosted tree classifier is presented in [32, 33]. Also, boosted random ferns on the HOG-like feature space is used
for object detection in [34], which achieves vehicle detection of 98.2% equal error rate on the UIUC car dataset.
For detecting vehicle in high-resolution aerial images, Lin et al. [35] followed similar direction by using a combina- tion of SURF and edge features and classification steps are performed via probabilistic modeling (Gaussian-weighted voting). The edge information is employed to highlight the sides of the vehicles. In [36], edge and corner features are used with dynamic Bayesian network classifier. Cao et al.
[37] combine deep convolutional neural network (DCNN)- based feature learning to learn discriminative features with exemplar SVMs classifier to improve the robustness of the classification step. They reported that “the leverage of DCNN has achieved significant performance boost by comparing to a serial of handcraft designed features.” Other methods used roadside LiDAR sensors with DCNN [38, 39]. In [40], a vehicle detection method in high-resolution aerial images is proposed using sparse representation and superpixel seg- mentation, where meaningful patches for training and detec- tion are obtained based on the superpixel segmentation. For constructing a compact training subset and to select the most representative samples from the whole training dataset, a training sample selection method is proposed based on the sparse representation. The HOG-like features are used during the training and detection steps. This method achieves prom- ising results on Toronto and OIRDS datasets. Unfortunately, the sparse representation has high computational complexity compared to its counterparts classifiers such as SVM.
Random forests (RF) [41, 42] (a decision tree-structured classifier that makes a prediction for routing a feature sam- ple through the whole tree to its leaves in an unsupervised way) and local binary pattern (LBP) [43, 44] (a texture feature descriptor that characterizes the spatial structure of local image texture) have received significant attention due to their success in several computer vision applications [45–47]. Inspired by the success of RF and LBP, we pro- pose a simple, yet effective method for detecting road vehi- cles in urban traffic scene where clustering forests based on the Chi-square are employed to build visual codebooks for appearance features extracted using the LBP. In particular, three different LBP variants are utilized to extract the vehicle features. Also, a combination of extracted local vehicle’s features using the three LBP variants is proposed to build a powerful classifier of clustering forests. Experimental results proved that the proposed method overcomes the limitations of several previous methods.
The remainder of this paper is organized as follows. Sec- tion 2 describes feature extraction using LBP descriptor and its variants. The concept of the clustering forests is discussed in Sect. 3. Details of the proposed vehicle detection method are explained in Sect. 4. Experimental evaluations and dis- cussions are presented in Sect. 5. Conclusions are finally given in Sect. 6.
2 Feature extraction and fusion
2.1 Local binary patterns (LBP)The basic LBP [48] characterizes the spatial structure of the image texture based on the assumption that this texture has a pattern and its strength (amplitude)—two locally comple- mentary aspects. It is proven to be highly discriminative, as well as its invariance to monotonic grayscale changes, and the computational efficiency make the LBP suitable for several image-related applications [49, 50]. With the invariance prop- erty, there is no need to apply grayscale normalization prior to compute the LBP feature. Briefly, LBP labels image pixels by thresholding the 3×3 neighborhood of each pixel with the center value, and the result is a 8-bit binary number. In other words, it creates an order-based feature for all image pixels via comparing each pixel’s intensity value with that of its neigh- boring pixels where the neighbors whose feature responses exceed the central’s one are labeled as “1” while the others are labeled as “0.” Subsequently, the co-occurrence of this com- parison results is recorded as a string of binary bits. After that, weights coming from a geometric sequence that has a common ratio of 2 are assigned to these bits according to their indices in the strings. The binary string with its weighted bits is then transformed into a decimal valued index (LBP code) which is the LBP feature response as shown in Fig. 1.
Mathematically, a pixel c with intensity g(c) is labeled as
The pixels p belong to the 3×3 neighborhood with gray levels gp , p=0, 1,…, 7 . The final LBP code of the pixel c is computed via summing the corresponding thresholded values S(gp−gc) weighted by a binomial factor of 2k as
(1) S(gp−gc) =
{1, if gp≥gc 0, if gp<gc.
In this context, the binary numbers generated by the LBP response encodes the mutual relationship between the central pixel and its neighboring pixels. Unfortunately, the calculated features in this local 3×3 neighborhood may not capture large-scale structures of image features.
Therefore, the basic LBP descriptor is extended to use neigborhoods of different sizes and hence allowing sub- pixel alterations. Using circular neighborhoods rather than the fixed 3×3 allows any radius and number of pixels in the neighborhood. In this way, the circular LBP (CLBP) puts no limitations to the neighborhood size of the pixel or to the number of sampling points as described in [51]. For neighborhood (P, R) with P sampling points on a circle of radius R, the CLBP code can be computed using
where gc and gi are the gray value of the central pixel and its neighbors, respectively. In the case that the sampling point is not in the center of a pixel, the pixel values are bilinearly interpolated. The coordinates of sampling point (xi, yi) in an evenly spaced circular neighborhood of N sampling points and radius R around point (x,y) are given by
Another extension to the basic LBP called LBP vari- ance (LBPV) [52] is introduced to avoid drawback of los- ing global spatial information existed in the basic LBP.
The principal orientations of the texture image are esti- mated based on the LBP distribution to align LBP histo- grams. Then, the aligned histograms are used to measure the dissimilarity between images. The LBPV characterizes the local contrast information into one-dimensional LBP histogram. The local variance LV(P,R) can be considered as invariant against shifts in the gray level and is given by
The variance LV(P,R) is utilized as an adaptive weight to adjust the contribution of the LBP code in histogram cal- culation. The joint distribution LBPP,R∕LVP,R produces the final LBP variance (LBPV) response that can describe the local texture better than the LBP. The three LBP variants discussed above are shown in Fig. 2. An illustration exam- ple for computing the LBP features from an input vehicle image using LBP, CLBP, and LBPV is shown in Fig. 3.
(2) LBP=
∑7 k=0
S(gp−gc) 2k.
(3) CLBPP,R(x, y) =
P−1∑
i=0
S(gi−gc) 2i
(4) xi= (x+Rcos(2𝜋i∕N)), yi= (y−R sin(2𝜋i∕N)).
(5) LV(P,R)= 1
P
P−1∑
l=0
(gl−V)2, where V= 1 P
P−1∑
l=0
gl.
Fig. 1 Computing LBP code for a pixel in a 3×3 neighborhood
After computing the LBP patterns for each pixel c(i, j), the input image of size N×M is represented by creating histograms of these LBP patterns as shown in Fig. 4 using the following formula
where
For the LBPV patterns, the histogram is created using
where
The histograms are computed independently for each image patch (region) and K is the maximum value of the LBP pat- tern. Hence, the created histograms contain information about the pattern description on the pixel-level and on the strength level.
2.2 Feature fusion and dimensionality reduction The main aim in developing texture descriptors is to find a stand-alone approach that can handle blur, illumination variations, rotation and different scales in images. Another major concern is to discover features descriptors that are characterized with a small dimensionality size [53].
Unfortunately, it is not easy to find a single descriptor that always works well wherever results also vary accord- ing to used classifiers. Thus, it is very difficult to find a stand-alone method that works better than all others.
The most feasible way to improve results is to combine various descriptors based on their different characteristics pointed to the features extracted by descriptors from the (6) H1(k) =
N−1∑
i=0 M−1∑
j=0
F(LBPP,R(i, j), k)k∈[0,K]
(7) F(a, b) =
{1, a=b 0, otherwise
(8) H2(k) =
N−1∑
i=0 M−1∑
j=0
F(LBP̃ P,R(i, j), k)k∈[0,K]
(9) F̃(LBPP,R(i, j), k) =
{LV(P,R)(i, j), LBP(P,R)(i, j) =k
0, otherwise
co-occurrence matrix that are combined by sum rules as in the image classification problem [54].
In this context, the LBP labels for the histogram contain information about the patterns on a pixel level, and the labels are summed over a small region to produce information on a regional level. As a result, the vehicle features are repre- sented by the LBP where shape of the vehicle is divided into small regions and recovered by different local histograms in which enhancing the features histogram. The drawback of some LBP variants reported in the literature is that they are tested on few applications problems, while the LBP features suffer another drawback of dimensionality size. An illustra- tion is shown in Fig. 5, where the OLBP histograms can be described as a multimodal or bimodal distribution character- ized by wide covering range of feature dimension that can be formed with feature vector as shown in Fig. 5a. The CLBP histograms can be described as a positively skewed distribu- tion (right skewed with some of outlier) which is distinctive by short covering rang of feature dimension that is reduced than the OLBP feature vector. On the other hand, the LBPV
Fig. 2 The LBP variants: a the basic LBP 3×3 , b the circular CLBP (16,2) neighbors [51], c The LBPV (16,2) neighbors [52]
Fig. 3 An illustration example for computing LBP-like features
Fig. 4 Extracting LBP feature histograms from an input vehicle image
histograms can be described as a negatively skewed distribu- tion (Left skewed with some little of outlier) which is also distinctive by short covering range of feature dimension.
The histogram of CLBP (16,2) feature distribution is shown in Fig. 5b where its median = 50 and the histogram LBPV (16,2) feature distribution is shown in Fig. 5c with median of 200. The CLBP (16,2) and LBPV (16,2) are considered as short descriptors containing less features dimensionality. Fur- thermore, the LBPV (16,2) histogram values cover a range of dimension values that are not being covered by CLBP (16,2).
Even though the CLBP and LBPV contain less dimensions for features, they can be used as unique feature descriptors.
Consequently, this reduction in dimensionality may affect significantly the speed of feature similarity measurement, which is needed for high speed and real-time applications.
Several possible dissimilarity measures have been pro- posed for histograms such as log-likelihood statistic and Chi-square statistic (X2) . The log-likelihood measure is suitable in many situations, but may be sensitive with small regions because the histograms may contain many zeros.
The Chi-square distance is usually stable with small regions as reported in [43] and suitable for the mentioned problem.
The Chi-square as a dissimilarity measure between two his- tograms S and M can be written as
where (S 1
i+Mi) is the inversely proportional to the sum of two histograms with very small value; hence, clustering more infrequent patterns into weighted clusters may have part of influence.
The procedure consists of using the texture descriptor to build several local descriptions of vehicle’s features and combining them into a global description instead of striv- ing for a holistic descriptions. In spatial histograms, the vehicle’s features are effectively described of three different (10) X2(S,M) =
∑n i=1
(Si−Mi)2 (Si+Mi)
levels of representation: histograms contain information about the patterns on a pixel-level, histograms contain dif- ferent distance over neighbor producing information on a regional-level, and histograms contain the effect of contrast for pattern description on a strength-level, which increase the space features of the vehicle and give a chance to inves- tigate deep searching for latent features. In this way, each region in vehicle image gets different LBP codes. The most accurate information would be obtained by using the joint distribution of these LBP codes.
3 Clustering forests
A clustering forests (CF)-based framework is proposed for building visual codebooks. The clustering forest [47, 55] consists of randomized decision trees that are built recursively top down as given in Algorithm 1. At each node i corresponding to set of regions A= {Rtraini }, two children nodes L, R are created by choosing a binary test defined as T(𝛤)→{0, 1} that divides Ai into two disjoint set of regions with
Such that Ai=ALi ∪ARi with ALi ∩ARi =𝜙 , and 𝛥=𝛥1 or 𝛥2 , where 𝛥1 measures the impurity of displacement vector di for regions and dm is the mean displacement vector over all vehicles class regions in the set.
The impurity of class label li uncertainty between vehicle regions and the background is measured using 𝛥2 as
(11) 𝛺i=arg max𝛥(Ai,ARi,ALi,Ti)
𝛥1(Ai) = ∑ (12)
i∶li=1
∥ (di−dm) ∥2
(13) 𝛥2(Ai) =|Ai|⋅𝛷(li|Ai)
Fig. 5 An illustration for differences between the LBP derivative histograms for same region used by local similarity pattern
where
The Aczél and Daróczy entropy 𝛷(li,|Ai) is used for maxi- mizing the joint probability of the class label for classifi- cation information gain. In this case, a probability density function (PDF) can be defined as the saliency map, where the high probability vote of regions represents high saliency.
During the training phase, splitting recursion continues until further subdivision is impossible, clustering max level reach or either all surviving training regions examples belong to one similarity measurement threshold, where all regions have identical values for all histograms. The Chi-square X2 statistic dissimilarity measurement defined in Eq. 10 can be considered as the crux of the binary test with
The S and M are two histograms indexed in the same LBP descriptor. The tests are selected randomly such that the histogram index I(Hi) = {H1,H2,H3} for appearance of regions extracted by LBP 𝛤 = {𝛾1,𝛾2,𝛾3} is chosen ran- domly. Thus, we generate a random elementary threshold 𝜔 at the beginning of training stage for tests as dissimilarity measurement; thereafter, the threshold value 𝜔 for each test is chosen uniformly from the range of differences between histograms observed on data randomly.
4 The proposed method
4.1 Training phaseDuring the training phase, each tree is constructed in a recursion proceeds starting from the root using top-down manner. Each splitting node j receives a set of histograms Hj of the training regions Rtrainj of fixed size ( 36×36 pix- els) is defined by extracted LBP feature 𝛤j . The Chi-square similarity measurement is assigned for each splitting node in binary test function that is used for splitting the regions to deeper splitting level; this splitting process depends on the dissimilarity value that could be found among the arrived histograms by random choice of measuring mode 𝛥1 or 𝛥2 . If the number of the regions is small ( Nmin =20 ) or the clustering depth level is the last ( dmax=15 ), the constructed splitting node is declared as a cluster and the clusters probability vote information is stored in a saliency map and a list of the displacement vectors di of the vehicles regions to the vehicle centroid. Figure 6 shows the basic processes in the training phase of the proposed method.
(14) 𝛷(li�Ai) = −
∑
hP(li�Ai)logP(li�Ai)
∑
hP(li�Ai)
(15) T(S,M,𝜔) =
{1, X2(S,M)< 𝜔 0, otherwise
Fig. 6 Schematic representation for the training process of the pro- posed method
4.2 Selective regions models (dense and sparse) The first design choice is about the feature sampling mecha- nism. This can be either dense or sparse sampling. Dense sampling is used very commonly, mainly because it pro- duces no information loss. The sparse sampling is selected according to some convenient criteria. This has some poten- tial advantages in that it reduces complexity and computa- tional time, but may induce significant loss of discrimina- tion information. In case an image is divided into regions, it can be expected that some of the regions contain more useful information than others in terms of vehicle features.
For example, headlight, logo, side mirror, wheels, license plate number seem to be an important cue in vehicle track- ing for traffic management. To take advantage of this, the uniform distribution is utilized to divide image into dense regions Rj with sparsely covering range for whole vehicle features. Let A is a set of regions A= {Rtrainj (𝛤j,Hj, lj, dj)} , where each region is represented by three spaces chan- nels 𝛤 = {𝛾1,𝛾2,𝛾3} extracted by different derivative of LBP operators. Simultaneously, H is a set of histograms H= {H1,H2,H3} generated concurrence with LBP spatial.
Finally, displacement vector dj for the regions is associated and the class label lj= {1, 0} for vehicles and background.
4.3 Detection phase
During the detection phase, each region histogram from the test image is matched to the clusters of histograms, simi- lar to the case for appearance-based object detection [56], when an entry in the clustered forest is activated; it casts votes for all possible vehicles centroids based on the associ- ated vectors. The candidate vehicle centroids are detected as high saliency in the saliency map using mean-shift mode estimation. For each member histogram {Hj}i in cluster i, we store their linked region position relative to the vehicle center. Thus, a cluster entry records the similarity measure threshold value, LBP derivative descriptor type, number of similarities histograms regions, and relative regions to vehi- cle center (ym, m=1,…, Mi).
The saliency map can be used for vehicle detection via measuring the similarity of each region histogram in the test image located at x to the clustered histograms in the forest using Chi-square distance. The cluster entry is declared to be similar if X2(⋅)< 𝜔 . For each matched cluster entry J , we cast votes for possible locations of the vehicle centers (ym, m=1,…, Mi) . The described saliency map that repre- sents a PDF can be used as prior knowledge to detect vehi- cles. Then, vehicle detection is accomplished by searching the saliency map using mean-shift mode estimation. For- mally, let P(x) is the probability that vehicle appears at can- didate position xn in the test image, since the voting is casted
from each individual member in the activated clusters entries J , we cast votes for all possible locations of vehicle centers ym with bounding boxes of size W×H . Thus, the P(xn|J) can be marginalized as:
Without prior knowledge of ym , we treat them equally and assume that P(ym|J) is a uniform distribution (i.e., P(ym|J) =1∕Mi ). The vote is only at the location of pos- sible vehicle centers, thus P(xn|ym) =𝛿(xn−ym) , where 𝛿(z) is the Dirac delta function. The vehicle detection hypotheses E(V) coming from all trees in the forest {Ft}Tt=1 is given by:
To handle scale variations, the hypothesized E(V) bound- ing box in the original image is centered at the point xsn and re-scaled to Ws ×Hs , where s is a scale value of the detection.
The detecting process is illustrated in Fig. 7.
5 Experiments and results
5.1 Evaluation metrics and datasetsThere are many contributions in literature with respect to object detection evaluation presented a multiplicity of met- rics to measure different aspects of evaluation protocols [57].
Kasturi et al. [58] presented metrics and tools for evaluat- ing the performance of object detection algorithms based P(xn|J) =∑ (16)
k
P(xn|ym)P(ym|J)
E(V) =∑ (17)
xn
P(xn|J;{Ft}Tt=1).
Fig. 7 Schematic representation for the detecting process of the pro- posed method
on the detection accuracy (DA), multiple object detection accuracy (MODA), and multiple object detection precision (MODP). The DA computes the spatial overlap between the ground-truth (GT) and detected objects as a ratio of the spatial intersection between the two objects and the spatial union of them. The sum of all overlaps is normalized over the average number of the ground-truth and detected objects for a single image z, where there are predicted bounding box D(z)i and ground-truth bounding box G(z)i and different overlap ratio with nonbinary thresholding between them such that
where G(z)i is the ith ground-truth vehicle, D(z)i is the ith detected vehicle, N(z)
G and N(z)
D denote the number of ground- truth vehicles and the number of detected vehicles in image z, respectively, while Nmapped(z) is the number of mapped ground-truth and detected vehicle pairs in image z.
This results in a matrix of N×M with N are ground-truth objects, M are detected objects of metric scores with an entity xij denoting the metric score for the detected object, D(z)j , when mapped with the ground-truth reference object, G(z)i . Figure 8 shows the scoring matrix and computation of the nonbinary thresholding for the DA [58]. In order to eval- uate the accuracy aspect of the proposed method, multiple object detection accuracy (MODA) is used to measure the missed detection and false-positive counts. Assuming that the number of misses is indicated by ( mz ) and the number of DA(z) = ORatio (18)
[NG(z)+N(z)D 2
], ORatio=
Nmapped(z)
∑
i=1
|G(z)i ∩D(z)i |
|G(z)i ∪D(z)i |
false positives is indicated by ( fpz ) for each image z, the (MODA) can be computed using
While, the normalized multiple object detection precision (NMODP) that gives the detection precision for the entire sequence of testing datasets is computed as
On the other hand, the proposed method is tested exten- sively on several challenge cars datasets including: UIUC dataset [28], Stanford dataset [59], Pascal VOC 2012 [60]
car class, TU Graz-02 [61] car class, and surveillance nature data in CompCars dataset [62]. Images of these datasets cover extensive variations of on-road vehicles (e.g., multi views, multi scales, multi models) in urban scene (involves a naturally high variability of cluttered background, dif- ferent roads) taken from different points of view. Table 1 outlines the properties for these datasets. The learned detec- tors of clustering forests is evaluated on the UIUC dataset [28] consisting of two test sets: single-scale and multi-scale.
The single-scale part has 170 images containing 200 car instances with fixed resolution 100×40 , while the multi- scale part consists of 108 images from 139 cars with reso- lution ranging from 89 × 36 to 212 × 85. The best learned detectors of clustering forests are evaluated on three different datasets with various imaging conditions, namely Stanford cars dataset [59] that has 8041 testing images for 196 classes of cars with variety of coarse categories, Pascal VOC 2012 [60] car class has 1161 testing images in realistic scenes, and TU Graz-02 [61] contains 420 images car class with extreme complexity and high intra-class variability on highly cluttered backgrounds in urban scene. Then, the proposed method is compared with the other methods on two different datasets: the UIUC car dataset [28] for urban traffic scene and surveillance nature data in [62] for on-road vehicles.
(19) MODA(z) =1−fpz+mz
NG(z)
(20) NMODP(z) =
∑Ntotal
z=1 MODP(z)
Ntotal , MODP(z)= ORatio Nzmapped
Fig. 8 An illustration example for detection accuracy computation. a The nonbinary thresholding, b scoring matrix
Table 1 Details of real challenge cars datasets used for evaluating of the proposed method
Dataset No. images No. objects Image size Res.
UIUC dataset [28] 278 339 188 × 112 Low
Stanford cars [59] 8041 8041 640 × 426 Multi Pascal VOC 2012 [60] 1161 2017 500 × 375 Multi TU Graz-02 cars [61] 420 770 640 × 480 Multi Surveillance nature
set [62] 10,000 10,000 778 × 744 High
5.2 Parameters settings
The PC configuration is an Intel Core i5 with 2.47 GHz and RAM of 4 GB. The proposed method is implemented under Windows 7 platform using C++ and OpenCV image pro- cessing library. Efficiency of the proposed method is evalu- ated using same setting of clustering forests training for the stopping criteria of 15 levels with maximum clustering and 20 region of the same features and three different sizes of forests trees {1,3,5}. At the detection phase, two and three scales are used to handle the variety of vehicles sizes in the test datasets. Curves of the results are generated by chang- ing the acceptance threshold on the hypotheses vote strength P(xn|J) . Each single experiment is carried out ten times, then the standard deviation of detection rate values are con- sidered for comparisons.
For setting the training sets, at first training set-1, we col- lect positive samples with 587 images in the training set-1 of the CompCars dataset (web-nature data) [62] which are re-scaled so that the size of the largest bounding box dimen- sion 200×100 and the size of the smallest one 140×100 , where the average size for mostly 181×100 . Also, we col- lect fussily each vehicle image appearance contains at least vehicle features (wheel and headlight) which are rotated from 𝛼◦ where 4 features appear (2 wheels + 2 head light) to 𝛽◦ where 3 features appear (2 wheels + 1 head light).
Negative training set are randomly cropped from back- ground of 5,329 images. Each clustering forest is trained on about 29,350 positive regions and 266,450 negative regions samples to build a training model working robustly on hard conditions. Second training set-2 (Vehicle Image Database-GTi (VID-GTi) [63]) is chosen to investigate the CF training ability in low resolution. Images of this data- set is of size of 64 × 64 captured for on-road vehicle and extracted from sequences of 360×256 pixels recorded in highways and splits into four different parts according to the pose of vehicle from camera: middle-close range in {front, left, right, and far range}. Additionally, the images are not cropped to fit the contour of vehicles to investigate the clas- sifier more robust to hypothesis vehicle partially occluded.
Table 2 overviews details of the two different training sets that has been used.
5.3 Clustering splitting mode and LBP feature extractors
5.3.1 Classification versus regression model
The purpose of this first experiment is to investigate and analyze the following aspects of the clustering forest learn- ing algorithm:
1. Clustering splitting mode (classification and regression).
2. Feature extraction using the LBP variations.
Therefore, the experiment is designed with the following setting:
• Training clustering forest using only classification mode by 𝛥2(Ai) =|Ai|.𝛷(li,|Ai).
• Training another clustering forest using regression mode by 𝛥1(Ai) =∑
j∶li=1 ∥ (dj−dm) ∥2.
• Training clustering forest using random splitting mode 𝛥=𝛥2 or 𝛥1.
Each clustering forest is then trained individually using each channel of features extractor LBP operators which are the original LBP 3×3 (OLBP), the circular LBP (16,2) (CLBP), and the LBPV for dissimilarity measurement between regions with 𝛤 = {𝛾1,𝛾2,𝛾3} . For the multi-LBP training, the clustering forests is constructed using a com- bination of all the LBP operators with a random choice between them for feature extraction at each splitting node.
Figure 9 shows a visualized comparison between different clustering splitting model and different features extraction based on the LBP variations. The horizontal axis represents to splitting node number and the vertical axis refers to the threshold value 𝜔 for splitting regions. The vertical axis value ranges between positive and negative threshold val- ues, where the positive values for the splitting right nodes and the negative values for the splitting left nodes. At the start of clustering levels, the splitting proceeds permanently in approximately the first quarter of clustering levels which can be seen as a compression; this massive splitting occurs by the large numbers of the dissimilarity and the variations between vast regions of different vehicles in the training datasets. On the other hand, the spaces appear as rarefac- tions between the splitting node in the horizontal axis, where the threshold value 𝜔=0 corresponds to clusters entries J which have an acceptable similarity between positive train- ing regions.
Analyzing the clustering splitting mode (i.e., classifi- cation or regression), one can note that the classification mode 𝛥2(Ai) in column (a) has less clustering nodes than the regression mode 𝛥1(Ai) in column (b) shown in Fig. 9.
Under other conditions, we analyze the features extraction
Table 2 Details of the training datasets used in the training phase
Dataset Pos. Neg. Avg. dim. Res. Scale
Web-nature set [62] 587 5329 181 × 100 High Multi
VID-GTi [63] 3425 3900 64 × 64 Low One
based on LBP variations, we find a fluctuate in the behav- ior of clustering nodes starting from the first raw by OLBP which has huge numbers of clustering nodes crossing to CLBP in the second raw which has fewer clustering nodes and at third raw LBPV that has a little squeezed cluster- ing nodes. The random choice mode 𝛥(Ai) has a moder- ate clustering nodes when using OLBP, CLBP and LBPV even though 𝛥(Ai) has a less clustering nodes number when merging LBP variations in Multi-LBP. Table 3 demonstrates training time for each training model, the actual values for the clustering nodes, and clusters entries. It is clear that the OLBP features type has higher dimensionality histogram than the CLBP and LBPV features types. Considering the training time, the OLBP features type has more time than others the CLBP and LBPV features, while the CLBP and LBPV features approximately consume steady training time.
Also, the average and maximum number of regions in each cluster entry are reported in Table 4, where in all cluster- ing models, there is a big gab between the average number and the anomaly values in maximum number of regions in each cluster entry caused in a few number of clusters entries but the average values ensure stable of the entire clusters entries. The behavior of features extraction is investigated based on the LBP variations using a statistical methodology (descriptive statistics) to analyze and measure the variations
of regions similarity that are produced in the clusters for each LBP operator. The results for mean, standard deviation and coefficient of variations are reported in Table 5 demon- strating significant variations between the clusters entries and histograms of regions for the LBP operators. Also, the ratio of the PDF of clusters entries with probability >0.5 is computed and the results for 500 clusters entries are drawn in Fig. 10
The learned clustering forests with different LBP deriva- tives and splitting measurement model are evaluated for vehicles detection on the multi-scale part of the UIUC car dataset. The detection task is usually evaluated based on the Recall/Precision curve, where Precision = True Posi- tive/True Positive + False Positive, and Recall = True Posi- tive/True Positive + False Negative. Thus, the LBP features extraction and different learning model are investigated based on true positive rate (TPR), true negative rate (TNR), and equal error rate (EERs) point, where TPR describes the proportion of positive samples that are classified correctly.
A higher TPR value indicates a better classifier for positive samples, while the TNR represents proportion of negative samples that are classified correctly. A higher TNR value represent a lower false-positive rate (FPR). Only when the intersection between the detection response and ground-truth box is larger than 60% of their union, it can be viewed as
Fig. 9 Splitting threshold of LBP derivatives among clustering levels for classification and regression impurity measurements
a successful detection. Table 6 lists the quantitative results of applying this learned clustering forests for vehicles detection, where the classification impurity model 𝛥2 for multi-LBP achieves an impressive EER of 69.21% while the regression impurity model 𝛥1 for LBPV achieves an impres- sive EER of 46.16%. On the other hand, the random impurity model 𝛥 for multi-LBP achieves the highest EER of 72.63%
outperforming the others LBP features extractors.
5.3.2 Impact of clustering forests parameters
Second experiment is carried out to investigate the impact of clustering parameters including clustering forests size, depth of clustering levels, and regions size to optimize the clustering forests for best vehicles detection. To this end, with the clustering forest size F, two learned clustering forests of different sizes F= {3, 5} are implemented using random splitting impurity 𝛥 with the LBP derivatives OLBP, CLBP, LBPV feature extraction. The conjunction between LBP derivatives in multi-LBP is considered randomly with sub-supervision, where the supporting for the OLBP and LBPV is more than CLBP on account of the previous detec- tion results. Figure 11a shows the obtained results for this experiment, and Table 7 gives detailed information about
clustering process, where the OLBP channel scores the high- est clusters number in all forests sizes. The CLBP and LBPV score the lowest clusters number in different forests sizes.
All the learned detectors are evaluated on the single-scale part of the UIUC car dataset. Three different overlap ratios {40%, 50%, 60%} are considered to evaluate the detection accuracy of vehicles. The Multi-LBP feature extractor out- performs any single LBP feature a few example as reported in Table 8.
The third experiment is assigned for examining the depth of clustering level. We implement four different learned clustering forests with only the multi-LBP feature extrac- tor at F=1 and varying the clustering depth levels by {5, 8, 10, 12}. In this case, all learned detectors are evaluated on the single-scale part of the UIUC car dataset with 60%
overlap ratio. Figure 11b shows this comparison between dif- ferent clustering level. In this regards, the clustering forests are sensitive to the clustering level depth and the cluster- ing forest with 15 clustering level depth achieves the best score among the others. Effect of region size on the overall detection rates is shown in Fig. 11c. It is clear that the per- formance is also sensitive to the regions size.
5.4 Evaluating under different imagining conditions
Based on the previous analysis and findings, we use cluster- ing forests of five trees with two stop conditions, 15 maxi- mum level of clustering, regions size of 16×16 , and the Multi-LBP as features extractor. After that, the discrimi- native saliency map is learned. The proposed method is evaluated on three different cars datasets: Stanford cars, Pascal VOC 2012, and TU Graz-02 cars datasets consider- ing the (Recall/Precision) curves. Only when the intersec- tion between a detection response and a ground-truth box is
Table 3 Number of clustering nodes and clusters entries for measurements impurity 𝛥1 , 𝛥2 and 𝛥
Split measure Regression impurity ( 𝛥1) Classification impurity ( 𝛥2) Random impurity ( 𝛥) Time (m) Nodes Cluster Time (m) Nodes Cluster Time (m) Nodes Cluster
OLBP 115.17 32,764 4058 73.55 32,544 3536 119.45 32,773 4348
CLBP 79.47 32,115 4663 64.26 29,491 1185 114.62 32,000 2914
LBPV 77.86 32,755 2757 66.96 32,287 2171 101.20 29,7070 2885
Multi-LBP 109.53 32,208 3338 99.86 32,757 2521 120.66 31,839 3209
Table 4 Average and maximum regions in each cluster entry cross different splitting impurity measurements
Split measure Classification 𝛥2 Regression 𝛥1 Random 𝛥
Avg. Max. Avg. Max. Avg. Max
OLBP 8 1571 7 961 6 777
CLBP 24 3,387 7 1088 10 934
LBPV 13 625 10 1995 10 1120
Multi-LBP 11 1206 8 2609 9 845
Table 5 Mean, standard deviation and coefficient of variation for regions similarity in different LBP variants and ratio of clusters entries with probability >0.5
LBP feature Mean SD (%) Coeff. var. (%) Entries
>0.5 (%)
OLBP 0.64 33.5 51.9 61.4
CLBP 0.43 26.6 60.6 38.2
LBPV 0.65 33.1 50.4 70.8
Multi-LBP 0.75 23.3 30.9 70.1
Fig. 10 Comparison of the LBP variants PDF with coefficient of variation Table 6 Evaluation of measure
impurity and different LBP derivatives on the multi-scale UIUC car dataset
Split measure Classification 𝛥2 Regression 𝛥1 Random 𝛥
TPR (%) TNR TPR (%) TNR TPR (%) TNR
Overlap ratio threshold 60%
OLBP 44.52 0.68 23.70 0.37 60.18 0.85
CLBP 46.15 0.83 26.92 0.85 38.46 0.87
LBPV 27.07 0.79 46.16 0.46 49.55 0.61
Multi-LBP 69.21 0.83 38.46 0.57 72.63 0.76
Fig. 11 Impact of clustering parameters. a Sharing ratio of LBP derivatives in the Multi-LBP training features, b performance versus clustering level depth, c performance versus region size
larger than 50% of their union, it can be considered as a suc- cessful detection. The Recall/Precision curves are shown in Fig. 12, which reports the efficiency of the proposed method in cluttered background and wide variation of cars models with different points of view in these cars datasets. Specifi- cally, the Multi-LBP achieves a precision of 79.2% in the
Stanford dataset and examples of successful detection are shown in Fig. 13. It achieves a precision of 83.9% in the Pascal VOC 2012 dataset car class and examples of success- ful detection are shown in Fig. 14. Additionally, it scores a precision 88.5% in the TU Graz-02 datasets car class with some examples of this detection are shown in Fig. 15.
5.5 Comparison with other methods
Importantly, the proposed method is compared with other methods on both single-scale and multi-scale part of the UIUC car test set. The comparison with other methods are given in Table 9. It can be seen that the proposed method competes well with the state-of-the-art methods. From the experiments, we can report that the proposed method can achieve high detection rates on single-scale part and multi- scale part of 99.6% and 98.9% on the UIUC car dataset, respectively. For the single-scale part, the method of Mutch and Lowe [21] based on SVM and Gabor features achieves 99.9% outperforming our method which achieves 99.6%, while for the multi-scale part with difficult imaging condi- tions, the proposed method achieves 98.9% against 90.6%
for Mutch and Lowe’s method [21]. Examples of detection results on the UIUC dataset using the proposed method are shown in Fig. 16. Another experiment is conducted to
Table 7 Number of splitting nodes and clusters for three forest size F=1, 3, 5
CF size F=1 F=3 F=5
LBP feature Nodes Cluster Nodes Cluster Nodes Cluster
Splitting nodes number and clusters number
OLBP 32,773 4348 32,776 4601 32,637 4714
CLBP 32,000 2914 31,511 2700 31,913 2354
LBPV 29,707 2885 32,451 2760 30,379 3026
Multi-LBP 31,839 3209 31,496 2858 32,330 2192
Table 8 Evaluation of the LBP and its variants on the UIUC car data- set
Overlap ratio threshold 40% 50% 60%
LBP feature CF size Detection accuracy
OLBP F=1 72.5% 69% 60.18%
F=3 75.9% 74% 62.7%
F=5 77.2% 76% 65%
CLBP F=1 43.8% 41% 38.46%
F=3 45.8% 43.7% 42%
F=5 50% 48.2% 45.3%
LBPV F=1 61% 52.3% 49.55%
F=3 68.6% 64% 57%
F=5 75.6% 72.7% 62%
Multi-LBP F=1 86% 83% 72.63%
F=3 93.1% 91% 88.4%
F=5 99.9% 99.6% 95.7%
0 0.2 0.4 0.6 0.8 1
0 0.2 0.4 0.6 0.8 1
Recall
Precision
OLBPCLBP LBPVMulti-LBP Precision vs Recall (50%)
(a)
0 0.2 0.4 0.6 0.8 1
0 0.2 0.4 0.6 0.8 1
Recall
Precision
OLBPCLBP LBPVMulti-LBP Precision vs Recall (50%)
(b)
0 0.2 0.4 0.6 0.8 1
0 0.2 0.4 0.6 0.8 1
Recall
Precision
OLBPCLBP LBPVMulti-LBP Precision vs Recall (50%)
(c)
Fig. 12 Performance of the proposed under different imagining conditions. a Stanford cars dataset, b Pascal VOC 2012 car class, c TU Graz-02 dataset (car class)
compare the proposed method with deep learning-based approach. For this comparison, the “You Only Look Once (YOLO)” method [64] is re-implemented using an open
Fig. 13 Example of detection results on the Stanford cars dataset
Fig. 14 Example of detection results on the Pascal VOC 2012 car class
Fig. 15 Example of detection results on the TU Graz-02 dataset (car class)
Fig. 16 Example of detection results on the UIUC dataset using the proposed method
0 0.2 0.4 0.6 0.8 1
0 0.2 0.4 0.6 0.8 1
Recall
Precision
yolo-voc yolo-coco Multi-LBP Precision vs Recall (50%)
Fig. 17 Comparison of the proposed method with two pre-trained Yolo deep learning techniques
source Darknet framework and tiny network is used with 1024 MB memory of GPU for computations. Two differ- ent pre-trained car classifier are employed. The first one (yolo-voc) is trained based Pascal VOC dataset [60] and the second (yolo-coco) is trained based coco dataset [65]. The proposed method is trained on training set-2 [63] with the same CF configuration and 9×9 regions size. For the test- ing stage, a random set of surveillance nature data is used and the results are given in Table 1. The recall and precision curve for this comparison is shown in Fig. 17. The proposed method-based CPU computation achieves high precision of 87.2% compared to the detector-based GPU computation, where yolo-voc scores precision 74.8% and yolo-coco scores precision 79.4%. Examples of detection results on surveil- lance nature dataset using the proposed method are shown in Fig. 18.
6 Conclusions
In this paper, a method is proposed to detect road vehicle using the saliency map based on the conduction of local binary pattern histograms and clustering forests. This is car- ried out within the framework of random forests, where a combination of local binary pattern variants for appearance- based detection methods and random impurity measurement (classification and regression)-based clustering forests are introduced as the optimal choice for a low cost and reliable training and detection. Three different LBP variants utilized as feature extractors are analyzed for road vehicle detection, and the best performance is achieved by a combination of these LBP variants for medium regions sizes(e.g., 12×12 to 16×16 ). On the other side, clustering forests that is used as a training technique with several setting parameters for the clustering forest size, the clustering levels depth, and split- ting training set for different measuring mode are investi- gated for the detection task under hard car datasets of uncon- trolled background conditions, namely UIUC car dataset, stanford cars dataset, Pascal VOC 2012 dataset cal class, and TU Graz-02 cars datasets and surveillance nature data in CompCars dataset. The best performance is accomplished when the clustering forest size is more than three trees with clustering level depth between 12 and 15 level and the ran- dom measurement impurity mode. The proposed method achieves currently high performance compared to many state-of-the-art methods on the UIUC car dataset which is one of the most difficult real-world datasets. Also, it is compared with deep learning under supervision techniques and achieves a competitive results without the need of GPU computations. Moreover, the proposed method is easy to implement and interpret. For future work, the clustering pro- cess provides useful information that can be used to classify different types or models of vehicles and the clusters need some kind of pruning. Thus, further identification process
Fig. 18 Example of detection results on the surveillance nature set using the proposed method
Table 9 Comparison of the proposed method the with state of the arts on the UIUC dataset
Training details Detection details
Classifier Features Dataset Pos. Neg. Single (%) Multi (%)
Mutch and Lowe [21] Linear SVM 2D Gabor UIUC 500 500 99.9 90.6
Agarwal et al. [28] SNoW BFV UIUC 500 500 76.5 39.6
Fritz et al. [30] ISM + SVM Local kernel UIUC 170 – 88.6 87.8
Kapoor and Winn [31] LHRF SIFT TUD car 35 – 94.0 –
Wu and Nevatia [32] CBT Edgelets MIT 4000 7000 97.5 93.5
Kuo and Nevatia [33] G. AdaBoost HOG MIT + Pascal 2462 8427 98.5 95.0
Villamizar et al. [34] BRF HOG-LBF UIUC 200 – 98.5 98.2
Gall and Lempitsky [46] Hough Forest intensity UIUC 550 450 98.5 98.6
Fergus et al. [66] EM PCA UIUC + rear 800 – 88.5 –
Leibe et al. [67] K-means & ALC Hessian-Laplace Own set 100 – 97.5 95.0
Our proposed method CF LBP variants CompCars 587 5329 99.6 98.9