Unsupervised Image Dataset Annotation Framework for Snow Covered Road Networks

(1)

Unsupervised Image Dataset Annotation Framework for Snow Covered Road Networks

Item Type Conference Paper

Authors Karaa, Mohamed;Ghazzai, Hakim;Sboui, Lokman;Besbes, Hichem;Massoud, Yehia Mahmoud

Citation Karaa, M., Ghazzai, H., Sboui, L., Besbes, H., & Massoud, Y. (2022).

Unsupervised Image Dataset Annotation Framework for Snow Covered Road Networks. 2022 IEEE Asia Pacific Conference on Circuits and Systems (APCCAS). https://doi.org/10.1109/

apccas55924.2022.10090274 Eprint version Post-print

DOI 10.1109/apccas55924.2022.10090274

Publisher IEEE

Rights This is an accepted manuscript version of a paper before final publisher editing and formatting. Archived with thanks to IEEE.

Download date 2023-12-02 19:47:33

Link to Item http://hdl.handle.net/10754/691077

(2)

Unsupervised Image Dataset Annotation Framework for Snow Covered Road Networks

Mohamed Karaa^1,3, Hakim Ghazzai¹, Lokman Sboui², Hichem Besbes³ and Yehia Massoud¹

1Innovative Technologies Laboratories, King Abdullah University of Science and Technology (KAUST), Thuwal, Saudi Arabia

2Systems Engineering Department, Ecole de Technologie Sup´erieure ( ´ETS), Montreal, Canada

3Higher School of Communication of Tunis, University of Carthage, Tunis, Tunisia

Abstract—Road surface condition estimation plays a crucial role in road safety and maintenance, especially in adverse weather conditions like snowfall. In this paper, we introduce a framework for unsupervised annotation of a dataset describing road snow cover level. This framework relies on feature learning using autoencoders and graph clustering using the Louvain community detection algorithm. We also incorporate time and weather data to facilitate the annotation process. We evaluate our method by assessing its different steps and comparing it to another density- based clustering method. We also present a large image dataset describing four road cover states in urban scenes, including different weather and visual conditions. The dataset comprises 41346 images collected from road monitoring cameras installed in Montreal, Canada, during the 2022 winter season. This dataset intends to help integrate computer vision techniques in planning snow removal operations.

Index Terms—Intelligent Transportation, Road Surface Con- dition, Unsupervised Annotation, Autoencoders, Community De- tection.

I. INTRODUCTION

Every year, many major cities experience heavy snowfalls during the winter. These occurrences can affect roads’ surface conditions, causing them to be blocked or slippery, severely impacting road users’ mobility and safety. In order to prevent possible incidents, governments spend millions of dollars to plan and perform efficient snow removal operations. For example, the city of Montreal, QC, Canada, dedicates an annual budget of over 170 billion dollars to keep its streets snow-free. These numbers indicate the necessity of systems recognizing the roads’ state to perform adequate snow cleaning or alert road users.

Some existing systems are based on artificial intelligence (AI) and need large appropriate datasets to train on the road surface condition estimation task. In [1], the authors collected images from weather station cameras installed only on main roads across Poland. The dataset, composed of 2100 images, described three road conditions (dry, wet, and snowy) and was annotated using sensor measurements from the weather stations. In [2], Pan et al. trained their models using only highway road images and distinguished three classes of road cover depending on the visibility of the pavement. Their datasets included almost 33000 annotated images. Similarly, highway images are collected and manually annotated using defined criteria in [3].

(a)Clear surface (b) Light-covered surface

(c) Medium-to-heavy-covered surface

(d) Plowed surface

Fig. 1: Sample images of the different snow cover classes under different illumination and weather conditions.

The mentioned datasets were labeled manually or using external data, which requires considerable effort and time. AI can replace these traditional annotation methods with other unsupervised annotation techniques. Such approaches rely on unsupervised learning to discover groups of similar images within unlabeled datasets. Mathilde et al. [4] introduced a clustering method that jointly learns the parameters of a neural network and the cluster assignment of the resulting features, meaning that it iteratively uses clusters as pseudo-labels to train the network. Similarly, another clustering framework is introduced in [5], which joins feature learning and clustering outputs. In [6], researchers presented a deep autoencoder-based clustering method that aims to learn better representations from the autoencoder input data and then use k-means algorithm to extract clusters. In [7], Min et al. enumerated other methods where multiple deep learning architecture were used for clustering.

This paper introduces an unsupervised data annotation tech- nique for snow cover estimation in an urban scene. Unlike previously mentioned datasets, we annotate our dataset using a novel unsupervised annotation framework by transforming the images and learning representations using a convolutional autoencoder (CAE) architecture. We approach image cluster-

(3)

clear light medium/

heavy plowed

(f): Cluster aggregation

(c): Image transformation (b): Dataset splitting

Collected images

decoder

latent space

encoder

(d): Feature representation

(d): Graph generation (e): Community detection

Fig. 2: Overview of the different components of the annotation framework, listed as in Section III-B. After collecting the images, we divide them by time (day/night) and weather (clear/snowy). Then, we binarize the images to represent the snow cover. Later, we extract features using an autoencoder, build a graph representing the distances between the features and apply Louvain community detection algorithm for clustering. Finally, we aggregate the clusters to obtain the annotated dataset.

ing as a network clustering task using the Louvain community detection algorithm to discover similar communities representing the different target classes. Additional metadata assists our method by facilitating the separation of very similar cluster members.

II. PROBLEMSTATEMENT& METHODOLOGY

In this work, we aim to build an annotated dataset describing the snow level covering the road in an urban scene.

The annotation framework takes images captured by traffic monitoring cameras as input and generates clusters of similar images reflecting the street cover state for each image in an unsupervised way. Hence, our proposed framework takes images as input and assigns them labels corresponding to road snow cover level. The output classes are designated to fit different snow levels specified by the city of Montreal. We define the following classes: ”clear surface”, ”light-covered surface”, ”medium-to-heavy-covered surface”, and ”plowed surface”. Fig. 1 illustrates different samples of these classes.

Existing datasets for road surface conditions in the context of snowy weather suffer from different limitations. Some datasets have small sizes, while others lack the diversity of scenes and visual conditions. In some cases, defined labels are not practical for real-world use-cases as they only describe high-level categories. Not to mention that most datasets are not publicly available.

To collect the dataset, we scrape images from traffic cameras installed and made accessible by the city of Montreal on ten different days of the 2022 winter season. Due to a large number of scraped images, manual labeling becomes a tedious and expensive process. Therefore, we propose an automatic labeling strategy based on feature learning and clustering to tackle this challenge. Fig. 2 presents an overview of our

proposed framework, where each step is referred to as in Section III and multiple arrows imply that the process is done across all camera image sets. Initially, we leverage the associated metadata to split the images into different categories to simplify the annotation process: We divide images into day and night time and then into clear and snowy surfaces.

The next step is transforming images into a binary format that better describes the road surface cover. We use a CAE architecture to extract representations from binary images.

The CAE produces a latent feature space where the input images’ dimensionality is reduced, and the features are suited for clustering, i.e., similar images are closer to each other and well separated from different images. We use these features to build a network on which we perform the Louvain community detection algorithm to extract clusters. We aggregate the detected communities for each camera and across all cameras to obtain the defined labels.

III. PROPOSEDARCHITECTURE/ALGORITHM

This section details the different steps of the framework, from dataset collection to unsupervised annotation.

A. Data Collection

This work uses images collected from over 500 traffic cameras installed across the city of Montreal and publicly accessed online¹. These pan-tilt-zoom cameras provide multiple views of various urban scenes (highways, main roads, neighborhoods). We perform image scraping between 30 Jan- uary and 25 February of the 2022 winter season. This period coincides with the snowfall season in Montreal, which receives a yearly average of 209 cm of snow. Collected images exhibit

1Montreal traffic cameras website, available on:

https://ville.montreal.qc.ca/circulation/

(4)

TABLE I:Classes describing snow cover level

Class Characteristic

Clear surface The road is totally clear

Light-covered surface Thin layer of snow or only a little surface is covered

Medium-to-heavy-covered surface

More than 2.5 cm of snow

Plowed surface The road surface has been cleared, traces of snow are still visible

different settings, including time, weather, lighting, and traffic, providing rich and diverse data. We store images from different cameras separately and associate them with metadata files recording timestamps and weather conditions for each image.

B. Dataset Labeling

a) Label Definition: We define four target classes that describe the road’s surface in accordance with snow levels defined by the city of Montreal to carry out snow removal operations. Table I mentions the labels and their characteristics.

b) Dataset Splitting: The collected images are split into day and night images using timestamps, as the scenes’

appearance differs during these two periods. Night images have similar illumination settings using artificial light and present low traffic. On the other hand, day scenes experience medium to high traffic, and the illumination depends on the weather. In addition, we leverage weather information (state description and precipitations) to divide the images into clear roads associated with clear weather and absence of snowfall for long periods, and snowy roads corresponding to snowy weather. This step reduces the target of the clustering to three classes. We select image sets with a fixed view to focus on the varying element across the images, i.e., snow levels, instead of generating clusters based on the camera view change.

c) Image Processing: We transform the images into a more representative form to ensure the good quality of features passed to the clustering algorithm. First, we convert color images into grayscale space. Next, we use the Contrast Limited Adaptive Histogram Equalization (CLAHE) algorithm to enhance the contrast and accentuate more details, such as tire traces, road border level, and snow plowing traces. Finally, we binarize images to simplify the scene and better represent the snow cover, as shown in step (c) in Fig 2. For instance, clear roads would be all black, covered roads would be all white, and light-covered or plowed roads would be represented by both colors but different textures.

d) Dimensionality Reduction and Feature Representa- tion: The features should exist in a suitable space for clustering, meaning that this space is composed of dense regions of similar points, and different points are well-separated. Images in their original format do not satisfy this requirement, hence we try transforming them into a more convenient space. Using autoencoders (AEs) to learn cluster representations can be

Fig. 3: The proposed convolutional autoencoder architecture used to extract feature representations from camera images.

The input layer receives binary images and learns to re- construct them. The bottleneck layer is flattened to extract representative features from the latent space.

more efficient than other feature extraction and dimensionality reduction techniques. In general, an autoencoder can learn latent encoding for its inputs by trying to regenerate them from a bottleneck space. An efficient AE would transform its input into a less dimensional space where different points are well separated.

Our objective is to learn a representation for each camera image set to be able to separate different classes from each other. Fig. 3 shows the proposed AE architecture. The input layer shape corresponds to the binarized image size of 480×720 pixels. The autoencoder model is trained by minimizing the binary cross-entropy loss function to recon- struct the binarized images, as it is learning to estimate a binary value for each pixel. The AE bottleneck layer captures the most important features while reducing the dimensions to 120×180×8, equivalent to half of the input size. After training, this layer is extracted and flattened to generate the lower dimension representation.

e) Clustering: Treating each camera image set separately does not guarantee that a given set includes all the target classes, which can be explained by cameras going down or not clearing the snow on some roads. This implies that the number of clusters is unknown prior to the clustering and suggests the need for an algorithm that handles this requirement.

For this matter, we resort to a community detection method introduced by Blondel et al. in [8] named the Louvain method.

In [9], the authors defined community detection as the task of identifying groups (communities) of similar nodes within a network such that the nodes of a community are densely connected whereas nodes in different communities are sparsely linked. Louvain algorithm is based on modularity optimization, which measures the internal density of communities. Higher modularity values imply better partitioning of the data.

First, we build a network based on the euclidean distances between the extracted latent representations generated using the CAE. Each network node represents a code, and the network edges link each node to its k-nearest neighbors, simplifying the network’s complexity instead of building a

(5)

Fig. 4:Variation of thekparameter of the k-nearest neighbors algorithm used to build networks representing extracted codes.

complete network. To select the best network configuration, we experiment with different values of the k parameter of k- nearest neighbors. We evaluate the modularity of the graph for different resolutions as shown in Fig. 4. We opt for thekvalue that maximizes the modularity, i.e., gives better partitions while keeping the lowest number of discovered communities.

Although the value k= 2 scores higher modularity values, it generates more communities than k = 3 due to its sparsity.

For this reason, we setk to three.

We select the resolution parameter from a set of values to maximize modularity and assure the quality of discovered communities, as shown in the example depicted in Fig. 2.

f) Cluster Aggregation: The previous step generates communities representing granular clusters of the different target classes, meaning multiple clusters could represent one state of a road. These granular clusters are aggregated to get groups corresponding to the desired labels. We assign a virtual center for each community which is the mean of its features.

We calculate the cosine similarity for every pair of centers and then group each cluster with its most similar one. In this phase, the light-covered class is usually confused with the plowed or medium-to-heavy-covered classes because of their similar features. In order to solve this problem, we leverage the image timestamps, given the start time of snowstorms, to extract clusters corresponding to light-cover before the aggregation, if they exist. Hence, we aggregate the granular clusters into three classes as per our objective.

IV. RESULTS& DISCUSSIONS

In this section, we evaluate the performance of the different parts of the framework step-by-step. We also compare our graph clustering method to a density-based clustering method.

A. Framework Evaluation

a) Autoencoder Performance: To select the adequate autoencoder architecture for the dimensionality reduction and feature representation, we compare three different architec- tures in terms of the number of parameters and depth. We define ”d”, the model’s depth, as the number of pooling operations in the encoder and ”p” as the number of the

TABLE II: Autoencoder performance for different architec- tures across image sets

Metric

Model M1 (d=2,p=51k)

M2 (d=2,p=13k)

M3 (d=3,p=13k)

Day Average accuracy 0.97 0.96 0.90

Average loss 0.05 0.07 0.20

Night Average accuracy 0.97 0.96 0.91

Average loss 0.06 0.07 0.18

TABLE III: Evaluation of Louvain community detection partitioning across image sets

Average modularity

Average coverage

Average performance

Average Dunn’s index

Day 0.79 0.92 0.92 0.51

Night 0.76 0.92 0.91 0.42

model parameters. All the models are trained with the Adam optimizer, a fixed learning rate of 0.001, and a batch size of 8 for 300 epochs. Table II displays the average performance of the models M1, M2 and M3. The results show that model M1, with less depth and more parameters, gives the highest accuracy score and lowest error. This model will be used for the remaining steps of the proposed data annotation process.

b) Clustering Performance: To evaluate the clustering method, we study some metrics related to graph clustering, such as modularity, coverage, and performance. Modularity is a measure of the division of a network into groups. A network with higher modularity has dense connectivity within clusters and sparse connectivity between clusters. In [10], coverage and performance are defined as metrics of graph clustering quality. Coverage is the ratio of intra-cluster edges over the total network edges. The performance of a clustering reflects the number of correctly interpreted pairs of nodes.

• Community detection performance:We evaluate the output of the Louvain algorithm by measuring the modularity, coverage, performance, and the Dunn index for all the image sets as shown in Table III. Modularity values range between -½ and 1, where values closer to 1 mean better partitioning of the nodes. Coverage and performance range between 0 and 1, where higher values indicate better graph clustering.

Higher values of Dunn’s index imply that a cluster is internally compact and well separated from other clusters.

• Comparison with density-based clustering method:We compare the results of our method with a density-based clustering method. This method relies on the HDBSCAN algorithm [11], [12] to extract dense point distributions in the feature space. We choose HDBSCAN as it fits for the problem of not knowing the number of clusters before clustering. HDBSCAN was also proven to outperform other density-based clustering methods such as DBSCAN and OPTICS.

(6)

TABLE IV:Benchmarking of our proposed method (community detection) and other density-based clustering method.

Metric

Method Community detection based method

Density-based clustering method

Day Night Day Night

Modularity 0.49 0.42 0.40 0.41

Coverage 0.97 0.99 0.84 0.85

Performance 0.55 0.45 0.60 0.59

Dunn’s index 0.55 0.60 0.47 0.44 HDBSCAN algorithm detects outlier points that are not identified to belong to any cluster. In our case, images representing the light-covered surface class are often detected as outliers as it is the underrepresented class. We perform a second clustering on the outliers to preserve these points while reducing the minimum cluster size parameter.

The final step is to filter points according to their cluster membership strength probability to have homogeneous clusters. The aggregation step is the same as in our developed method. To compare the performance of both clustering outputs, we calculate the previously defined metrics on the final output, i.e., on the three snowy classes, using the same network built for community detection. Table IV showcases the benchmarking of the two methods.

B. Dataset Annotation Results

The resulting dataset consists of 41346 images describing four road surface covers. Fig. 5 shows the distribution of these classes compared to the dataset produced by the density-based clustering method. The dataset comprises:

• Clear surface: 17422 images.

• Light-covered surface: 3726 images.

• Medium-to-heavy-covered surface: 14725 images.

• Plowed surface: 3512 images.

The clear surface class has the same number of instances as it is extracted using metadata. We notice that both light-covered

Fig. 5: Class distribution in the resulting dataset compared to the dataset from the density-based clustering method.

surface and plowed surface classes are less represented in the dataset. This is due to the short period of the presence of light snow cover as the roads get rapidly covered in snow. For the plowed class, it is mainly present in daytime images and rare during nighttime. Still, our method has more ability to de- tect underrepresented clusters (light-covered surface clusters) compared to the density-based method.

V. CONCLUSION

In this paper, we have introduced an annotated dataset for road surface cover classification in the context of snowfall in an urban scene. We have developed a framework allowing the automatic annotation of the collected data. We have relied on unsupervised feature representation and community detection techniques to achieve the annotation task. The resulting dataset is composed of 41346 images describing four classes and presents a variety of scenes, weather and illumination conditions. We have compared our method to a density-based clustering method relying on the HDBSCAN algorithm, and we have outperformed the latter method’s results according to defined metrics. Our proposed dataset will enable the integra- tion of computer vision techniques in snow removal operations planning by training deep learning models to predict snow levels and appropriate operations to perform.

REFERENCES

[1] D. Grabowski and A. Czy˙zewski, “System for monitoring road slippery based on CCTV cameras and convolutional neural networks,” Journal of Intelligent Information Systems, vol. 55, no. 3, pp. 521-534, 2020.

[2] G. Pan, L. Fu, R. Yu, and M. Muresan, “Evaluation of Alternative Pre- trained Convolutional Neural Networks for Winter Road Surface Condition Monitoring,” in Proc. of the 2019 5th International Conference on Trans- portation Information and Safety (ICTIS), 2019, pp. 614–620.

[3] M. Khan and M. Ahmed, “Weather and surface condition detection based on road-side webcams utilizing AlexNet, GoogleLeNet, and ResNet:

Application of pre-trained Convolutional Neural Network,” International Journal of Transportation Science and Technology, 2021.

[4] M. Caron, P. Bojanowski, A. Joulin, and M. Douze, “Deep Clustering for Unsupervised Learning of Visual Features,” in Computer Vision – ECCV 2018, vol. 11218, 2018, pp. 139–156.

[5] J. Yang, D. Parikh, and D. Batra, “Joint Unsupervised Learning of Deep Representations and Image Clusters,” in Proc. of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 5147–5156.

[6] S. Lu and R. Li, “DAC–Deep Autoencoder-Based Clustering: A General Deep Learning Framework of Representation Learning,” in Intelligent Systems and Applications, vol. 294, 2022, pp. 205–216.

[7] E. Min, X. Guo, Q. Liu, G. Zhang, J. Cui and J. Long, “A Survey of Clustering With Deep Learning: From the Perspective of Network Architecture,” in IEEE Access, vol. 6, pp. 39501-39514, 2018.

[8] V. D. Blondel, J.-L. Guillaume, R. Lambiotte, and E. Lefebvre, “Fast unfolding of communities in large networks,” J. Stat. Mech., vol. 2008, no. 10, p. P10008, 2008.

[9] D. Jin et al., “A Survey of Community Detection Approaches: From Statistical Modeling to Deep Learning,” IEEE Trans. Knowl. Data Eng., pp. 1–1, 2021.

[10] U. Brandes, M. Gaertler, and D. Wagner, “Experiments on Graph Clustering Algorithms,” in Algorithms - ESA 2003, vol. 2832. 2003, pp.

568–579.

[11] R. J. G. B. Campello, D. Moulavi, and J. Sander, “Density-Based Clustering Based on Hierarchical Density Estimates,” in Advances in Knowledge Discovery and Data Mining, vol. 7819, 2013, pp. 160–172.

[12] L. McInnes and J. Healy, “Accelerated Hierarchical Density Clustering,”

in Proc. of the 2017 IEEE International Conference on Data Mining Workshops (ICDMW), 2017, pp. 33–42.