ArticlePDF Available

Vehicle Make Detection Using the Transfer Learning Approach

Authors:

Abstract and Figures

Vehicle detection and classification is an important part of an intelligent transportation surveillance system. Although car detection is a trivial task for deep learning models, studies have shown that when vehicles are visible from different angles, more research is relevant for brand classification. Furthermore, each year, more than 30 new car models are released to the United States market alone, implying that the model needs to be updated with new classes, and the task becomes more complex over time. As a result, a transfer learning approach has been investigated that allows the retraining of a model with a small amount of data. This study proposes an efficient solution to develop an updatable local vehicle brand monitoring system. The proposed framework includes the dataset preparation, object detection, and a view-independent make classification model that has been tested using two efficient deep learning architectures, EfficientNetV2 and MobileNetV2. The model was trained on the dominant car brands in Lithuania and achieved 81.39 % accuracy in classifying 19 classes, using 400 to 500 images per class.
Content may be subject to copyright.
ELEKTRONIKA IR ELEKTROTECHNIKA, ISSN 1392-1215, VOL. 28, NO. 4, 2022
1AbstractVehicle detection and classification is an
important part of an intelligent transportation surveillance
system. Although car detection is a trivial task for deep
learning models, studies have shown that when vehicles are
visible from different angles, more research is relevant for
brand classification. Furthermore, each year, more than 30
new car models are released to the United States market alone,
implying that the model needs to be updated with new classes,
and the task becomes more complex over time. As a result, a
transfer learning approach has been investigated that allows
the retraining of a model with a small amount of data. This
study proposes an efficient solution to develop an updatable
local vehicle brand monitoring system. The proposed
framework includes the dataset preparation, object detection,
and a view-independent make classification model that has
been tested using two efficient deep learning architectures,
EfficientNetV2 and MobileNetV2. The model was trained on
the dominant car brands in Lithuania and achieved 81.39 %
accuracy in classifying 19 classes, using 400 to 500 images per
class.
Index TermsImage classification; Machine learning;
Vehicle detection.
I. INTRODUCTION
The ability to recognize a vehicle from a video can be
very helpful, from traffic monitoring to following the
fleeing driver from a crime scene. So far, Automatic License
Plate Recognition systems have been mostly utilized to
identify the vehicle under investigation. However, if the
criminal is driving fast and the recorded license plates are of
poor quality, the main tool for locating the suspect is
monitoring a car with certain attributes via city cameras and
manual searches by officers. Nonetheless, suspects can swap
license plates, or witnesses can write down only part of the
license plate or recall only a few facts about the car, such as
color, make, or model. As a result, an automatic vehicle
characteristics recognition system becomes critical in
assisting officers and improving the intelligent
transportation system [1].
The potential of classifying key characteristics of a
Manuscript received 25 January, 2022; accepted 31 March, 2022.
vehicle is constantly being examined due to the increasing
use and research of convolutional neural networks (CNNs)
and the development of the graphics processing unit. CNN
has already been used to detect and track cars or to count the
number of cars passing by on the road [2], [3]. However,
due to the most severe CNN constraint of requiring a large
dataset, the transfer learning approach [4] was designed to
tackle limited data resources. Deep learning technology is
already helping individuals to reduce eye tracking on
multiple screens, so it can also be used to determine the
attributes of a car, thus helping automate toll collection,
intelligent parking systems [5] or law enforcement [6].
Filtering video segments by specific car attributes can be
simplified using a vehicle classification model.
The task of identifying vehicle characteristics has unique
challenges and problems. The first is the impact of weather,
time of day, and varied lighting exposures, all of which can
make the model difficult to perform, as the classes are very
similar [1]. The lack of open-source datasets that include a
range of commonly used vehicles, viewing angles, diverse
image quality, and data with more than several images per
class is another barrier [7][9]. And the third challenge is
that newer vehicle models are released on a regular basis,
requiring the deep learning model to be retrained with even
small amounts of data. According to statistics, 42 new
models were created in the United States alone in 2020, 34
in 2021, and 62 models are expected in 2025 [10],
demonstrating an upward trend in car manufacturing. To
address the last two challenges, a country-specific dataset is
collected and vehicle brand classification is explored using a
transfer learning technique that does not require a large
dataset for training, making it appropriate for constantly
upgrading systems.
The aim of this research is to create a retrainable vehicle
detection and classification framework that can accurately
classify vehicle images from various viewpoints to simulate
the effect of city cameras. To present the most popular cars
on the roads, the dataset of vehicle brands is taken from the
Lithuanian car marketplace website. This research uses
architectures designed for lower computing power,
Vehicle Make Detection Using the Transfer
Learning Approach
Dovile Komolovaite1, 3, *, Andrius Krisciunas1, Ingrida Lagzdinyte-Budnike1,
Aurelijus Budnikas2, Dominykas Rentelis3
1Department of Applied Informatics, Kaunas University of Technology,
Studentu g. 50407, LT-51368 Kaunas, Lithuania
2Department of Software Engineering, Kaunas University of Technology,
Studentu g. 50-406, LT-51368 Kaunas, Lithuania
3MB Kodo technologijos,
Silagirio St. 73-1, Vijukai, LT-54306 Kaunas dist., Lithuania
dovile.komolovaite@ktu.edu
http://dx.doi.org/10.5755/j02.eie.31046
55
ELEKTRONIKA IR ELEKTROTECHNIKA, ISSN 1392-1215, VOL. 28, NO. 4, 2022
MobileNetV2 and EfficientNetV2, to ensure that the
solution can be implemented in real time and retrained with
new vehicle models. The pre-trained feature vectors of
defined architectures are fine-tuned, and the study examines
multiple combinations of dense classification layers to
suggest a novel architectural design. Finally, the
generalization abilities of the model are evaluated by
comparing the classification performance of various
distribution datasets, and examples of the final proposed
system are presented. At the end of the study, the findings
are discussed, and conclusions and recommendations for
further research are presented.
II. COMPUTER VISION LITERATURE REVIEW
Convolutional Neural Networks are widely used in
computer vision tasks to classify objects such as cars,
pedestrians, and more [11]. To learn how to replicate human
intellect, this supervised machine learning technique
requires labeled data [12]. The basic CNN consists of
convolutional, polling, and fully connected layers that
classify the final classes, where the convolutional layer
extracts low- and high-level visual features using a weight-
sharing mechanism [11].
A state-of-the-art CNN model named You Only Look
Once (YOLO) demonstrated exceptional performance in
real-time image processing due to its high performance and
accuracy [13]. The third version of the model is the latest
and includes 53 convolutional layers and 23 residual layers.
The YOLOv3 model has already been pre-trained on the
COCO dataset, which covers 80 object categories, including
the means of transportation such as bicycles, cars, buses,
trains, trucks, and so on. Furthermore, it has been used
successfully in vehicle detection applications and has
improved recognition capabilities for tiny objects [13].
A. Transfer Learning
CNNs, on the other hand, have the disadvantage of being
strongly reliant on large amounts of data to avoid overfitting
[14] and hence requiring massive computer resources [15].
As a result, a machine learning technique called transfer
learning [4] was established, in which a previously trained
model with stored weights is used to train a new model with
a smaller similar dataset. Because pre-trained models have
already learned to recognize edges, colors, and patterns, the
weights of the feature extraction network can be frozen to
convey this information [16]. Finally, the last layers can be
used to obtain additional detailed information about the
classified classes [15]. Another transfer learning method is
fine-tuning, which involves retraining the layers with
initialized pre-trained model weights rather than freezing
them [17]. In terms of time and computational resources,
freezing layers is the most effective strategy, whereas
training from scratch is the most expensive [18].
B. Classification of Vehicle Attributes
Vehicle classification is one of the most interesting tasks
in an intelligent transportation system. It has been studied
using a logo-based technique, in which the vehicle logo is
first localized and then classification is performed [12].
Classification of car brands and models based on front
images reaches accuracies of 98.22 % (49 classes) [5],
98.7 % (107 classes) [19], 97.89 % (35 classes) [1], and
96.33 % (766 classes) [20]. However, the region of interest
is usually strongly cropped to remove the background and
requires a certain angle image; this strategy, for example, is
suited for parking lot cameras. There have been successful
experiments to determine the type of vehicle, such as a car,
van, truck, or another, with an accuracy of up to 99.68 %
using a modified pre-trained ResNet-152 model [3] or
76.28 % using ResNet34 [4]. Another example shows that
the AlexNet architecture can be used to produce a perfect
classifier to assess the position of a vehicle [16]. This type
of study can assist in the development of multiple models
for different viewpoints to improve the overall performance
of model recognition. The classification of vehicle brands
using the logo has been of interest for many years, but the
challenge of detecting cars that do not rely on this
information only has remained largely unsolved [21].
Another type of vehicle classification is the appearance-
based classification, which uses elements such as lights,
windows, and the vehicle body to classify the vehicle [12].
To date, the most used car data from different viewpoints
are the Stanford car dataset, which consists of 196 model
classes with 24 to 68 images in the training dataset and the
same amount of test data as it has been split 50/50 ratio [6],
[15], [16], [18]. The publications that use these data do not
take into consideration the impact of unbalanced data,
although the sample size of the test data is almost three
times smaller for certain classes. However, using this
dataset, the fully trained GoogLeNet architecture [9] and the
MobileNet transfer learning model [22] had the highest
accuracies of the classification of the make and model, at
80 % and 78.27 %, respectively. In 2020, a unique study
was carried out to actualize the classification of vehicle
models using real surveillance cameras, and the model
achieved an accuracy of 62.09 % in real environmental
settings [23]. This, together with the increasing number of
vehicles and city cameras [7], shows that vehicle
classification is relevant and more research is needed to
make it adaptive to real-world scenarios.
C. EfficientNet and MobileNet Architectures
This study selected two of the most efficient classification
architectures. The first architecture is the lightweight
MobileNetV2 [24], which, according to the reviewed
articles, achieves one of the highest accuracies. It employs a
slow downsampling strategy and more layers have large
feature maps, resulting in detail preservation [25]. Another
selected architecture is EfficientNetV2, a newer architecture
network that uses a compound scaling approach to
uniformly scale and balance the depth, width, and resolution
of the network to improve accuracy and efficiency [26]. The
MobileNetV2 architecture is pre-trained using the ImageNet
dataset, which contains 1000 classes covering high-level
categories such as animals and vehicles [27]. Meanwhile,
EfficientNetV2 is pre-trained on ImageNet-21k, which has
21,841 classes, and is simply a larger version of ImageNet
[27]. The pre-training on ImageNet-21k is claimed to
provide better performance than the use of ImageNet1k
[28].
56
ELEKTRONIKA IR ELEKTROTECHNIKA, ISSN 1392-1215, VOL. 28, NO. 4, 2022
III. PROPOSED VEHICLE MAKE DETECTION SYSTEM
The suggested system architecture is illustrated in Fig. 1.
Here, the preparation phase involves the preparation of data
and the training of the classifier, while the usage phase
involves the adaptability of the model to the use of external
applications containing traffic camera data. The final
approach can locate the car in the frame, classify the make
for each bounding box, and be leveraged for rendering or
other purposes.
Fig. 1. The architecture of the proposed vehicle detection system.
Each of the three phases is described in further detail:
Data Preparation. The source material contains images
of vehicles for sale posted on the Internet. These images
contain a lot of noise, such as the gearbox, steering
wheel, engine, car interior, etc. The pre-trained YOLOv3
state-of-the-art detection model was utilized as an auto
labeler to filter out unnecessary data and crop the region
of interest. To ensure that no noisy data was left out, the
photographs were reviewed using the designed interface
to exclude unrepresentable images. There is also a
problem with data-class imbalance; thus, only classes
with at least 400 images are taken, and those with more
than 500 images are truncated. Image pre-processing
includes resizing, converting to grayscale, and
normalizing such that the image input matches the pre-
trained data. Subsequently, data augmentation was used
to create a more diversified representation of the vehicles
sides and dimensions. The last step includes data
separation training and testing.
Classification Model. The classification task is the
central objective of the study. The most efficient
architectures of the pre-trained models, MobileNetV2 and
EfficientNetV2, were investigated by fine-tuning their
feature vectors and using different dense classifiers. With
a limited dataset, these models could easily be retrained.
See the subsection Image Classification Using Various
Classifiers for further information.
Car Detection and Make Classification. The YOLOv3
model, which has previously been used for vehicle image
filtering, is combined with the best classifier. This can be
used for vehicle monitoring via video cameras.
Image Classification Using Various Classifiers. The
suggested classification architecture has two primary
components: reusing a previously trained CNN model and
adding a new classifier. For initial feature extraction,
selected model architectures with pre-trained weights are
used. The MobileNetV2 architecture was trained using a
considerably larger and more general ImageNet dataset to
obtain the vector of image features. With a total of 2257984
parameters, the vector output size is 1280. Meanwhile, the
feature vector of the EfficientNetV2 neural network is
trained on the ImageNet-21k dataset, resulting in a vector
output size of 1536 and nearly 6 times more parameters:
12930622. Although the learning time increases, the feature
vector is fine-tuned with new classifier layers to obtain
greater accuracy. The classifier receives the resulting dense
1-D tensor feature vector output and learns to categorize the
input image according to the brands of the vehicle. Figure 2
illustrates the classification architecture by reusing pre-
trained feature vectors and adding new classifier layers.
Fig. 2. Image classification using fine-tuning.
When the image feature vector is extracted using a grid
with pre-trained weights rather than starting from a random
weight initialization, then the generated classifier is based
on a fully connected grid. In this way, the knowledge
already acquired can be re-used instead of training from
scratch. The classifier can be as simple as a dense layer with
a Softmax activation function to obtain the probability of
each class. This classification is called the baseline or
version 1. Different classifier versions are tested for
greater accuracy. The 2nd version adds weight adjustment L1
57
ELEKTRONIKA IR ELEKTROTECHNIKA, ISSN 1392-1215, VOL. 28, NO. 4, 2022
and L2 to reduce the number of features. A batch
normalization layer is added in the 3rd version to allow the
network to train faster. Dropout at a rate of 0.5 is utilized in
the 4th version to avoid overfitting and learn more robust
features. And different combinations of dense layers are
then introduced with the rectified linear block activation
function. All classifier architectures are presented in Fig. 3.
Fig. 3. Classifier versions 1-8. Here, newly added layers are colored in
gray, and n denotes the number of classes.
IV. IMPLEMENTATION RESULTS
This section describes the steps in building a vehicle
monitoring system. The best classification architecture,
which outperforms prior studies, is proposed. A detailed
assessment of the classifier is provided, examining each
class performance and providing some visual explanations
using gradient-weighted class activation mapping. Finally,
the generalizability of the model is investigated.
A. Data Description
The raw data for initial model training collected from a
car sales website. The data also include pictures that are not
required for the model, such as the interior of the vehicle,
the wheels, and the engine. Additionally, each car has a
.json file with information about the ID, make/model, body
type, color, date of manufacture, and more. The images,
which are all in .jpg format, also vary in size and quality.
Examples of original photos of one car are shown in Fig. 4.
Fig. 4. Raw images.
B. Auto Labeler
A pre-trained YOLOv3 model on the COCO dataset with
80 classes and an input image size of 416 was used to detect
image files that do not show an exterior of the car. This
model was designed to search for classes such as car, bus, or
truck with a preset intersection over a union threshold of 0.5
and a score threshold of 0.89. If the area of the bounding
box was greater than 33 % of the image, it was considered a
vehicle and cropped. A sample result is shown in Fig. 5.
Fig. 5. Cropped images around the region of interest.
C. Validation Interface
Vehicle search and image cropping operations using the
YOLOv3 model did not yield perfect results. The data
contained images of the car with the doors open or very
close-up shots that had to be manually removed. To speed
up manual work, a graphical user interface has been
developed using the Tkinter Python library. Using the
interface, it is possible to view images and delete poorly
selected photos at the touch of a button. Examples of
acceptable and unacceptable images are shown in Fig. 6.
58
ELEKTRONIKA IR ELEKTROTECHNIKA, ISSN 1392-1215, VOL. 28, NO. 4, 2022
Fig. 6. Simplified validation interface. Here, the red Delete button
deletes the image, the green Next button proceeds to the next image, and
the white Undo button returns the user back to the previous screen.
D. Image Pre-Processing
The images have been resized to 224×224 dimensions to
represent the lower quality of the real-life image and to
speed up model training. Each image is converted from
RGB to grayscale to remove color information that is not a
decisive factor in classifying a car brand. However, the pre-
trained model uses 3 color channels as input, so the gray
channel is repeated 3 times. Finally, the data are normalized
from 0 to 1. The final dense 4-D-shaped tensor is of the
following shape [batch size, height, width, color channels].
Examples of pre-processed images are shown in Fig. 7.
Fig. 7. Pre-processed images.
According to the study in [29], 200 or more images per
class are preferred to achieve the best accuracy using the
transfer learning strategy. Therefore, only vehicle brands
with more than 400 images have been included. And excess
data with more than 500 photos per class were not included
to avoid difficulties with imbalanced classes. The class
labels were one hot encoded. Finally, a total of 9451 images
were divided by 80/20 ratio for training and testing,
respectively. There were a total of 7560 train and 1891 test
photos collected and 19 classes remained: Audi, BMW,
Citroen, Ford, Honda, Hyundai, Kia, Lexus, Mazda,
Mercedes-Benz, Mitsubishi, Nissan, Opel, Peugeot, Renault,
Skoda, Subaru, Volkswagen, Volvo.
E. Data Augmentation
The data augmentation technique artificially creates
modified images and is used to reduce overfitting [29]. The
more data available, the better models can be developed
[14]. Two different augmentation techniques were used to
depict reality. The first flips the horizontal axis to make both
sides of the car visible. The second method uses a random
30 % scaling to make the model more robust to small
changes in object size. Figure 8 illustrates both
augmentations.
Fig. 8. Zooming and horizontal flipping augmentation.
Data augmentation is used for the training dataset only, so
the size of the test data remains the same (1891), but the
training data increases from 7560 to 11560, about 53 % of
the original data were augmented. As a result, 65.4 % of the
training data are real, while 34.6 % are artificially
augmented. Figure 9 shows the number of images by car
brand.
(a)
(b)
Fig. 9. Number of images in train and test datasets: (a) Number of cars in
training dataset; (b) Number of cars in testing dataset.
F. Training Environment
The model is built using cloud-based Google Colab
environment. The code is written in the Python
programming language using the TensorFlow GPU library.
The models are trained on Ubuntu 18.04.5 LTS using
NVIDIA-SMI 495.44 with CUDA version 11.2. More
information about the experimental setup can be found in
Table I.
TABLE I. EXPERIMENTAL SETTINGS.
Attribute
Configuration Information
Operating system (OS)
Linux 5.4.144 with Ubuntu
18.04.5 LTS
CPU
Intel(R) Xeon(R) CPU @
2.30 GHz
GPU
NVIDIA-SMI 495.44
Driver version
460.32.03
CuDNN/CUDA
8.0.5/11.2
Python version
3.7.12
Framework
Tensorflow 2.7.0
Furthermore, the initially established hyperparameters
used to regulate learning processes are presented in Table II.
59
ELEKTRONIKA IR ELEKTROTECHNIKA, ISSN 1392-1215, VOL. 28, NO. 4, 2022
TABLE II. EXPERIMENTAL SETTINGS.
Hyperparameter
Value
Number of epochs
200
Early stopping epochs
15
Batch size
16
Optimizer
Adam
Learning rate
0.0001
The maximum number of training iterations, called
epochs is 200. Since too many or too few can lead to
overfitting or underfitting, a monitoring system was
introduced with an early stop feature such that if the
validation loss did not improve in the last 15 epochs, the
training would be terminated. The categorical cross-entropy
loss function was used for model training as an objective
function to be minimized. The test loss measures the
distance between the predicted probability distribution of
belonging to each class and the target values. To avoid
memory overflow, the batch size is set to 16 to feed the
model with only a fraction of the training data at a time.
Adam, an optimizer, with a learning rate of 0.0001 was
selected.
G. Testing Classifiers Versions Using Pre-Trained
Models
Performing classification using the pre-trained models
MobileNetV2 and EfficientNet, eight classifiers were tested.
All the classifier structures are defined in Section III. Model
architectures trained with random weights were also
evaluated. However, due to small training data, the model
with random weights did not learn the influential features
and achieved the highest accuracy of 8.83 % (see Fig. 10).
This demonstrates the benefits of employing pre-trained
weights, which have a 12.5 times greater accuracy on
average than randomly generated weights. The accuracy of
the test dataset, which estimates the number of times the
highest probability prediction matches the ground truth
labels, is shown in Fig. 10 and the model training duration is
shown in Fig. 11.
Fig. 10. Comparison of the accuracy of different classifier models and versions.
Fig. 11. Run-time comparison of different classifier models and versions.
The EfficientNet-B3 feature vector and the 7th classifier
had the best accuracy results of 81.39 %, although the 3rd
classifier was only 0.11 % less accurate. The 7th version
took 50 minutes to train, while the 3rd took 44 minutes.
Although both versions are usable, the 7th has been selected.
The summary of the model with the 7th classifier is shown in
Fig. 12.
The layer names, type, output shape, and number of
weight parameters are all listed in the model summary (see
Fig. 12). At the bottom of the table is the total number of
trainable and non-trainable model parameters. The “None”
output value refers to the batch size. The Keras layer in this
case is an invoked object for loading a stored model
EfficientNet-B3 with a trainable parameter equal to “True”.
60
ELEKTRONIKA IR ELEKTROTECHNIKA, ISSN 1392-1215, VOL. 28, NO. 4, 2022
Fig. 12. Summary of the best sequential classification model (version no.
7).
H. Training Performance of the Best Classifier
The training performance of the EfficientNet-B3 using
the 7th classifier version is shown in Fig. 13.
Here, the blue line indicates the training, and the orange
indicates the validation curve. The accuracy plot shows that
the model gains high accuracy in the early epochs because
the pre-trained weights are reused. To avoid overfitting and
generalization errors, the model was trained for less than 40
epochs until the loss of validation stopped improving.
However, because there is a gap between the training and
validation curves, it indicates a slight overfitting, so
increasing the dataset may improve the results. Additionally,
the shape of the training loss curve suggests that a good
learning pace was chosen.
Fig. 13. Performance of the classification model training. Model performance history.
At predicted class probability thresholds ranging from 0
to 1, the receiver operating characteristic curve (ROC)
displays the true positive rate on the y-axis versus the false
positive rate on the x-axis. The true positive rate indicates
the probability that a defined car label will be predicted
correctly, and the false positive rate shows how often the
prediction of a defined car is incorrect. It is usually plotted
for binary classifier; therefore, the one-vs-all approach is
employed for multi-class analysis. The ROC curves and the
area under the curve are given in Fig. 14. Looking at the
ROC curves, the model is close to a perfect classifier, as the
curve is above the diagonal line.
Fig. 14. Receiver operating characteristic curves.
A contrast between real and predicted values is shown in
the confusion matrix (see Fig. 15).
Fig. 15. Normalized confusion matrix.
Using the confusion matrix, the most confused classes
could be identified and their interrelationships explored.
61
ELEKTRONIKA IR ELEKTROTECHNIKA, ISSN 1392-1215, VOL. 28, NO. 4, 2022
However, the model predictions are scattered, and the
diagonal reveals the strongest link between the predictable
and true classes.
I. Classifier Visual Explanations Using Grad-CAM
Deep learning models are known for their excellent
accuracy, yet their complexity makes them hard to interpret.
Gradient-weighted class activation mapping (Grad-CAM) is
an approach to better understanding how the CNN model
works. Grad-CAM descends to the final convolutional layer
and uses gradient information to create a rough localization
map that identifies key discriminating areas in the image for
class prediction. No fully connected layers are used. The
class activation mapping is then upscaled to the size of the
input image so that a heat map could be presented [30].
The most important properties that determine the
classifier decision are marked in red on the heat map (see
Fig. 16). In the images, where the car is visible from the
front, the model focuses on the brand symbol, the front
grille, and a little bit on the headlights. Meanwhile, looking
at the cars from the side, the model distinguishes the most
discriminative regions in the middle and upper parts of the
car (body, hood, and windows). To classify the car from the
rear view, the model searches for the brand symbol around
the license plate number. The angle of view is very
important for good performance of the model. If the car is
seen from above, the focus is on the body. And when the car
is at a 45-degree angle, the discriminating region includes a
side lamp or other car shapes.
(a)
(b)
(c)
(d)
Fig. 16. Grad-CAM from different car views: (a) the front, (b) the side, (c)
the back, and (d) at a 45-degree angle.
J. Generalization Testing
It is critical to check how the model works with real-
world data [31]. Even if the accuracy of a test dataset is
high, this does not guarantee that the model will work well
in every situation. There may be a generalization gap if the
model receives previously unseen data, such as different
backgrounds or different weather conditions. Three different
scenarios are used to verify the reliability of the model.
To start, 168 photos from the test dataset are randomly
selected so that the vehicles are only seen from the side
when the car brand symbol is not visible. This results in 17
classes with a 76.19 % accuracy (see Table III). As a
reminder, the accuracy of the test on all data is 81.39 %,
which means that the side of the car is about 5 % less
predictable.
The second scenario combines both training and testing
sets from the Stanford cars dataset. Because the data
contained brands that our model cannot recognize, it was
filtered out. That left 10 classes, the majority of which were
formed of Audi and BMW vehicles. A total of 4119 pictures
were acquired and an accuracy of 66.86 % was obtained.
In the third scenario, 12 photos are taken from a video
recorded in the evening with artificial lighting. Six vehicle
brand classes with two images each were identified with an
accuracy of 75.0 %. However, due to the tiny dataset, it is
advised to use more annotated video frames for validation.
TABLE III. COMPARISON OF CLASSIFICATION PERFORMANCES.
Side Car
Images
Stanford Cars
Dataset
Video
Frames
Test Accuracy
76.19 %
66.86 %
75.0 %
Test Loss
1.472
2.347
0.928
Total Images
168
4119
12
Number of Classes
17
10
6
Generalization testing reveals that additional datasets
have a mean accuracy of 72.7 %, resulting in 8.7 % poorer
estimation than the initial test dataset (81.39 %). In general,
the model can be generalized, but the accuracy will not be
as good as for the same training distribution. As a result,
integrating diverse data distributions in the training phase
can help the model adapt to a broader range of scenarios.
V. DISCUSSION
In this paper, the collection of country-based datasets and
a strategy for vehicle localization and brand classification
are proposed. The data represent the most frequent cars on
the Lithuanian market and were balanced to avoid biases.
The data preparation stage was automated using the
YOLOv3 car detection model; however, the manual image
validation stage could be eliminated by optimizing the
model parameters and making it more conservative.
Analyzing the classifiers, it was discovered that
EfficientNetV2 is 7.14 % more accurate than MobileNetV2.
The 7th version of the classifier, consisting of three dense
layers, batch normalization, dropout, and L1 and L2
regularization, improved the most in accuracy. Compared to
the baseline classifier, the EfficientNetV2 7th classifier
improved its accuracy by 9 % (from 72.4 % to 81.4 %). This
was accomplished by classifying 19 automobile
manufacturer classes using a batch size of 16 and 38 epochs.
Furthermore, the architecture created outperforms the
maximum accuracy of the fully trained GoogLeNet
architecture of 80 % [10], although the performance cannot
be directly compared due to different datasets and specific
configurations.
After generalization tests using the EfficientNetV2 7th
classifier, the average accuracy of three different datasets
was 72.7 %. Compared to the accuracy of the original
dataset testing sample, the accuracy dropped by 8.7 %. The
reduction in accuracy is not significant, and some
62
ELEKTRONIKA IR ELEKTROTECHNIKA, ISSN 1392-1215, VOL. 28, NO. 4, 2022
fluctuation is expected. As observed from the results of the
Grad-CAM method, the model is trained to recognize the
brand of the car by the location of the logo, and if it is not
found, the car body, hood, or windows become an essential
predictor. On the other hand, mixing various data into a
training dataset can help strengthen model generalization
abilities.
The following are suggestions for future improvements:
More innovative augmentation techniques might be
proposed to increase generalization performance in
diverse weather and lighting settings. For example,
artificially manipulating photos to make them appear as
though they were taken at night, with a darker
background and light reflections on the car. In addition, it
would be useful to add more data from different sources,
representing different nature of images, diverse quality,
viewpoints, and so on. As background clutter reduces
model performance [12], removing it improves vehicle
classification results [1]. By changing the eliminated
background with various representations of road pictures,
this technique might be utilized in augmentation.
As one study was able to classify the position of a car
with 100 % accuracy [16], it could be re-used effectively
to classify cars from different angles. The suggestion
would be to first classify the visible position of the car
and then, depending on the results, classify the features of
the car using several trained models. As a result, the
learnt parameters of the model would be unique for each
given viewpoint, although it requires labeled data on the
viewpoint of the vehicle.
The task could also be expanded to include other
vehicle attributes, such as vehicle type, make, model, or
color, so that the vehicle can be precisely identified. It
would also be beneficial to test the model using video
data.
VI. CONCLUSIONS
In this paper, a vehicle detection and classification system
was suggested and implemented, which was found to be
able to classify common local vehicle brands regardless of
the viewable angle. This is especially relevant as smart city
systems and the use of security and traffic monitoring
cameras become more widespread. To represent the
countrys most popular car manufacturers, static images
from the Lithuanian car sales website were used. However,
this step of data collection has also become a barrier in the
system, as it requires manual validation, which should be
replaced by an automated and more efficient data labeling
solution. Effective deep learning architectures have been
investigated to make the model easier to use in real time and
to incorporate new car classes as production grows. The
proposed EfficientNetV2 architecture adjustment improves
the performance of the original classifier by 9 % and
achieves an acceptable classification score of 81.4 %.
However, to adjust the model to real-world conditions, data
from the environment in which it will be used should be
incorporated into the training. The findings imply that the
trained model can be used for urban vehicle monitoring, and
various improvements are proposed for future research.
CONFLICTS OF INTEREST
The authors declare that they have no conflicts of interest.
REFERENCES
[1] M. A. Manzoor, Y. Morgan, and A. Bais, “Real-time vehicle make
and model recognition system”, Machine Learning and Knowledge
Extraction, vol. 1, no. 2, pp. 611629, 2019. DOI:
10.3390/make1020036.
[2] S. Benslimane, S. Tamayo, and A. de La Fortelle, “Classifying
logistic vehicles in cities using Deep learning”, in Proc. of 15th World
Conference on Transport Research, 2019, vol. 33. arXiv: 1906.11895.
[3] M. A. Butt et al., “Convolutional neural network based vehicle
classification in adverse illuminous conditions for intelligent
transportation systems”, Complexity, vol. 2021, art. ID 6644861,
2021. DOI: 10.1155/2021/6644861.
[4] N. Yaras, “Vehicle type classification with deep learning”, M.S.
thesis, İzmir Institute of Technology, İzmir, 2020.
[5] S. Naseer, S. M. A. Shah, S. Aziz, M. U. Khan, and K. Iqtidar,
“Vehicle make and model recognition using deep transfer learning
and support vector machines”, in Proc. of 2020 IEEE 23rd
International Multitopic Conference (INMIC), 2020, pp. 16. DOI:
10.1109/INMIC50486.2020.9318063.
[6] A. B. Khalifa and H. Frigui, “A dataset for vehicle make and model
recognition”, in Proc. of 3rd Workshop Fine-Grained Vis.
Categorization (FGVC), 2015, pp. 12.
[7] S. Baghdadi and N. Aboutabit, “View-independent vehicle category
classification system, International Journal of Advanced Computer
Science and Applications, vol. 12, no. 7, pp. 756771, 2021. DOI:
10.14569/IJACSA.2021.0120786.
[8] B. Satar and A. E. Dirik, “Deep learning based vehicle make-model
classification”, in Artificial Neural Networks and Machine Learning
ICANN 2018. ICANN 2018. Lecture Notes in Computer Science, vol.
11141. Springer, Cham, 2018. DOI: 10.1007/978-3-030-01424-7_53.
[9] D. Liu and Y. Wang, “Monza: Image classification of vehicle make
and model using convolutional neural networks and transfer
learning”, 2016. [Online]. Available:
http://cs231n.stanford.edu/reports/2015/pdfs/lediurfinal.pdf
[10] “U.S. new car model market launches 2021 | Statista”. [Online].
Available: https://www.statista.com/statistics/200092/total-number-
of-car-models-on-the-us-market-since-1990/
[11] M. M. Hasan, Z. Wang, M. A. I. Hussain, and K. Fatima,
Bangladeshi native vehicle classification based on transfer learning
with deep convolutional neural network, Sensors, vol. 21, no. 22, p.
7545, 2021. DOI: 10.3390/s21227545.
[12] S. Hashemi, H. Emami, and A. B. Sangar, “A new comparison
framework to survey neural networks-based vehicle detection and
classification approaches”, International Journal of Communication
Systems, vol. 34, no. 14, p. e4928, Sep. 2021. DOI: 10.1002/dac.4928.
[13] M. Guerrieri and G. Parla, Deep learning and YOLOv3 systems for
automatic traffic data measurement by moving car observer
technique, Infrastructures, vol. 6, no. 9, p. 134, 2021. DOI:
10.3390/infrastructures6090134.
[14] C. Shorten and T. M. Khoshgoftaar, “A survey on image data
augmentation for deep learning”, Journal of Big Data, vol. 6, no. 1,
Dec. 2019. DOI: 10.1186/s40537-019-0197-0.
[15] C. Desai, “Image classification using transfer learning and deep
learning”, International Journal of Engineering and Computer
Science, vol. 10, no. 9, pp. 2539425398, Sep. 2021. DOI:
10.18535/ijecs/v10i9.4622.
[16] S. Baghdadi and N. Aboutabit, “Transfer learning for classifying front
and rear views of vehicles”, Journal of Physics: Conference Series,
vol. 1743, no. 1, p. 012007, 2021. DOI: 10.1088/1742-
6596/1743/1/012007.
[17] A. Gupta and M. Gupta, “Transfer learning for small and different
datasets: Fine-tuning a pre-trained model affects performance”,
Journal of Emerging Investigators, vol. 3, 2020.
[18] V. Taormina, D. Cascio, L. Abbene, and G. Raso, “Performance of
fine-tuning convolutional neural networks for HEp-2 image
classification”, Applied Sciences, vol. 10, no. 19, p. 6940, 2020. DOI:
10.3390/app10196940.
[19] Y. Gao and H. J. Lee, Local tiled deep networks for recognition of
vehicle make and model, Sensors, vol. 16, no. 2, p. 226, 2016. DOI:
10.3390/s16020226.
[20] H. J. Lee, I. Ullah, W. Wan, Y. Gao, and Z. Fang, Real-time vehicle
make and model recognition with the residual SqueezeNet
63
ELEKTRONIKA IR ELEKTROTECHNIKA, ISSN 1392-1215, VOL. 28, NO. 4, 2022
architecture, Sensors, vol. 19, no. 5, p. 982, 2019. DOI:
10.3390/s19050982.
[21] F. Tafazzoli, “Vehicle make and model recognition for intelligent
transportation monitoring and surveillance”, 2017. DOI:
10.18297/etd/2630.
[22] F. Xu, K. Han, Y. Ruan, and Y. Mao, “Car classification using neural
networks”, pp. 16, 2000.
[23] H. Wang, Q. Xue, T. Cui, Y. Li, and H. Zeng, “Cold start problem of
vehicle model recognition under cross-scenario based on transfer
learning”, Computers, Materials and Continua, vol. 63, no. 1, pp.
337351, 2020. DOI: 10.32604/cmc.2020.07290.
[24] F. Pereira dos Santos and M. Antonelli Ponti, “Features transfer
learning for image and video recognition tasks”, in Proc. of Workshop
de Teses e Dissertações - Conference on Graphics, Patterns and
Images (SIBGRAPI), 2020, pp. 2935. DOI:
10.5753/sibgrapi.est.2020.12980.
[25] Z. Qin, Z. Zhang, X. Chen, C. Wang, and Y. Peng, Fd-Mobilenet:
Improved Mobilenet with a fast downsampling strategy”, in Proc. of
2018 25th IEEE International Conference on Image Processing,
2018, pp. 13631367. DOI: 10.1109/ICIP.2018.8451355.
[26] M. Tan and K. V. Le, EfficientNet: Rethinking model scaling for
convolutional neural networks”, in Proc. of International Conference
on Machine Learning, 2019. DOI: 10.48550/arXiv.1905.11946.
[27] M. Tan and Q. V. Le, “EfficientNetV2: Smaller Models and faster
training”, in Proc. of International Conference on Machine Learning,
2021. DOI: 10.48550/arXiv.2104.00298.
[28] T. Ridnik, E. Ben-Baruch, A. Noy, and L. Zelnik-Manor, “ImageNet-
21K pretraining for the masses”, 2021. DOI:
10.48550/arXiv.2104.10972.
[29] Y. Kang, N. Cho, J. Yoon, S. Park, and J. Kim, “Transfer learning of a
deep Learning model for exploring tourists’ urban image using
geotagged photos, ISPRS International Journal of Geo-Information,
vol. 10, no. 3, p. 137, 2021. DOI: 10.3390/ijgi10030137.
[30] R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, and
D. Batra, “Grad-CAM: Visual explanations from deep networks via
gradient-based localization, International Journal of Computer
Vision, vol. 128, pp. 336359, 2020. DOI: 10.1007/s11263-019-
01228-7.
[31] A. Harras, A. Tsuji, S. Karungaru, and K. Terada, Enhanced vehicle
classification using transfer learning and a novel duplication-based
data augmentation technique, International Journal of Innovative
Computing, Information and Control, vol. 17, no. 6, pp. 22012216,
2021. DOI: 10.24507/ijicic.17.06.2201.
This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution 4.0
(CC BY 4.0) license (http://creativecommons.org/licenses/by/4.0/).
64
... As defined by multiple authors [4], [5], [6], the vehicle reidentification task is to identify the same vehicle among different nonoverlapping cameras. The same vehicle can be defined by extracting specific vehicle attributes such as shape, color, and size, utilizing MMR (make, model recognition) [7] or ALPR (automatic license plate recognition) systems. An innovative solution for vehicle re-identification is the utilization of magnetic sensors. ...
Article
Full-text available
In Intelligent Transportation Systems the identification and tracking of vehicles play an important role in enhancing traffic management, security, and overall road safety. Traditional means for vehicle re-identification rely solely on video-based systems which are not resilient to harsh environment conditions, suffer from visual obstructions, and are facing other challenges. To address these shortcomings and provide a more robust solution, alternative methods can be employed. This study addresses the gap in vehicle re-identification accuracy under harsh environmental conditions and visual obstructions faced by traditional video-based systems by integrating magnetic sensors into the road surface. The essence of this study revolves around a comprehensive comparison of various algorithms employed for feature extraction from registered magnetic field distortions. These distortions are treated as transient time series and various distance metrics are applied to calculate their similarity. Useful features are extracted and their classification performance is compared using a single neighbor classifier also taking into account calculation time. The validation experiments demonstrate the efficacy of presented approach in extracting critical features that hold the potential for successfully re-identifying same vehicles. For tested subset up to 90 % re-identification accuracy can be reached. The main contribution of this work involves determining which magnetic sensor axis to use—whether single or in combination—and identifying the most effective methods for feature extraction from the registered magnetic field distortions.
... Usage of magnetic sensing in intelligent transportation systems (ITS) is rapidly gaining popularity [1][2][3] compared to more traditional video processing [4,5]. Various systems for road traffic surveillance largely utilize MEMS (micro-electromechanical systems) magnetometers for the purpose of vehicle detection and classification [6,7]. ...
Article
Full-text available
Intelligent transportation systems represent innovative solutions for traffic congestion minimization, mobility improvements and safety enhancement. These systems require various inputs about vehicles and traffic state. Vehicle re-identification systems based on video cameras are most popular; however, more strict privacy policy necessitates depersonalized vehicle re-identification systems. Promising research for depersonalized vehicle re-identification systems involves leveraging the captured unique distortions induced in the Earth’s magnetic field by passing vehicles. Employing anisotropic magneto-resistive sensors embedded in the road surface system captures vehicle magnetic signatures for similarity evaluation. A novel vehicle re-identification algorithm utilizing Euclidean distances and Pearson correlation coefficients is analyzed, and performance is evaluated. Initial processing is applied on registered magnetic signatures, useful features for decision making are extracted, different classification algorithms are applied and prediction accuracy is checked. The results demonstrate the effectiveness of our approach, achieving 97% accuracy in vehicle re-identification for a subset of 300 different vehicles passing the sensor a few times.
... Komolovaite et al. [33] then attempted to address the challenges in vehicle similarity appearance using the transfer-learning framework. They transferred the learned knowledge from pre-trained AlexNet to learn and classify the vehicles' front and rear views. ...
Article
Full-text available
The increase in security threats and a huge demand for smart transportation applications for vehicle identification and tracking with multiple non-overlapping cameras have gained a lot of attention. Moreover, extracting meaningful and semantic vehicle information has become an adventurous task, with frameworks deployed on different domains to scan features independently. Furthermore, approach identification and tracking processes have largely relied on one or two vehicle characteristics. They have managed to achieve a high detection quality rate and accuracy using Inception ResNet and pre-trained models but have had limitations on handling moving vehicle classes and were not suitable for real-time tracking. Additionally, the complexity and diverse characteristics of vehicles made the algorithms impossible to efficiently distinguish and match vehicle tracklets across non-overlapping cameras. Therefore, to disambiguate these features, we propose to implement a Ternion stream deep convolutional neural network (TSDCNN) over non-overlapping cameras and combine all key vehicle features such as shape, license plate number, and optical character recognition (OCR). Then jointly investigate the strategic analysis of visual vehicle information to find and identify vehicles in multiple non-overlapping views of algorithms. As a result, the proposed algorithm improved the recognition quality rate and recorded a remarkable overall performance, outperforming the current online state-of-the-art paradigm by 0.28% and 1.70%, respectively, on vehicle rear view (VRV) and Veri776 datasets.
Article
This research focuses on car classification and the use of the ResNet-50 neural network architecture to improve the accuracy and reliability of car detection systems. Indonesia, as one of the countries with high daily mobility, has a majority of the population using cars as the main mode of transportation. Along with the increasing use of cars in Indonesia, many automotive industries have built factories in this country, so the cars used are either local or imported. The importance of car classification in traffic management is a major concern, and vehicle make and model recognition plays an important role in traffic monitoring. This study uses the Vehicle images dataset which contains high-resolution images of cars taken from the highway with varying viewing angles and frame rates. This data is used to analyze the best- selling car brands and build car classifications based on output or categories that consumers are interested in. Digital image processing methods, machine learning, and artificial neural networks are used in the development of automatic and real-time car detection systems.The ResNet-50 architecture was chosen because of its ability to overcome performance degradation problems and study complex and abstract features from car images. Residual blocks in the ResNet architecture allow a direct flow of information from the input layer to the output layer, overcoming the performance degradation problem common in neural networks. In this paper, we explain the basic concepts of ResNet-50 in car detection and popular techniques such as optimization, augmentation, and learning rate to improve performance and accuracy. in this study, it is proved that ResNet has a fairly high accuracy of 95%, 92% precision, 93% recall, and 92% F1-Score.
Article
Full-text available
Vehicle make and model recognition (VMMR) is a crucial task for developing automatic vehicle recognition (AVR) systems, and has gained significant attention in the fields of computer vision and artificial intelligence in recent years. The ability to automatically identify a vehicle's make and model has numerous practical applications, such as traffic monitoring, vehicle re-identification, etc. This survey paper provides a comprehensive overview of the state-of-the-art techniques developed for VMMR problem. The survey begins with an introduction to the problem of AVR, followed by a discussion of the various factors that affect the accuracy of recognition, including lighting conditions, viewpoint variations, and occlusions. We then discuss a solution to this problem and provide an overview of the different approaches for VMMR, such as machine learning approaches and deep learning approaches. This survey also provides a comprehensive review of publicly available datasets that have been used for evaluating VMMR methods. Finally, the paper concludes with a discussion of some of the remaining challenges in VMMR, such as the need for large-scale datasets with more diverse vehicle models, the need for more robust methods that can handle variations in lighting and viewpoint, and the need for real-time methods that can operate in a variety of settings. This survey aims to serve as a valuable resource for researchers working in the field of computer vision that includes AVR.
Article
Full-text available
Recently, as computer vision and image processing technologies have rapidly advanced in the artificial intelligence (AI) field, deep learning technologies have been applied in the field of urban and regional study through transfer learning. In the tourism field, studies are emerging to analyze the tourists’ urban image by identifying the visual content of photos. However, previous studies have limitations in properly reflecting unique landscape, cultural characteristics, and traditional elements of the region that are prominent in tourism. With the purpose of going beyond these limitations of previous studies, we crawled 168,216 Flickr photos, created 75 scenes and 13 categories as a tourist’ photo classification by analyzing the characteristics of photos posted by tourists and developed a deep learning model by continuously re-training the Inception-v3 model. The final model shows high accuracy of 85.77% for the Top 1 and 95.69% for the Top 5. The final model was applied to the entire dataset to analyze the regions of attraction and the tourists’ urban image in Seoul. We found that tourists feel attracted to Seoul where the modern features such as skyscrapers and uniquely designed architectures and traditional features such as palaces and cultural elements are mixed together in the city. This work demonstrates a tourist photo classification suitable for local characteristics and the process of re-training a deep learning model to effectively classify a large volume of tourists’ photos.
Article
Full-text available
Vehicle type classification plays an essential role in developing an intelligent transportation system (ITS). Based on the modern accomplishments of deep learning (DL) on image classification, we proposed a model based on transfer learning, incorporating data augmentation, for the recognition and classification of Bangladeshi native vehicle types. An extensive dataset of Bangladeshi native vehicles, encompassing 10,440 images, was developed. Here, the images are categorized into 13 common vehicle classes in Bangladesh. The method utilized was a residual network (ResNet-50)-based model, with extra classification blocks added to improve performance. Here, vehicle type features were automatically extracted and categorized. While conducting the analysis, a variety of metrics was used for the evaluation, including accuracy, precision, recall, and F1 − Score. In spite of the changing physical properties of the vehicles, the proposed model achieved progressive accuracy. Our proposed method surpasses the existing baseline method as well as two pre-trained DL approaches, AlexNet and VGG-16. Based on result comparisons, we have seen that, in the classification of Bangladeshi native vehicle types, our suggested ResNet-50 pre-trained model achieves an accuracy of 98.00%.
Article
Full-text available
Deep learning models have demonstrated improved efficacy in image classification since the ImageNet Large Scale Visual Recognition Challenge started since 2010. Classification of images has further augmented in the field of computer vision with the dawn of transfer learning. To train a model on huge dataset demands huge computational resources and add a lot of cost to learning. Transfer learning allows to reduce on cost of learning and also help avoid reinventing the wheel. There are several pretrained models like VGG16, VGG19, ResNet50, Inceptionv3, EfficientNet etc which are widely used. This paper demonstrates image classification using pretrained deep neural network model VGG16 which is trained on images from ImageNet dataset. After obtaining the convolutional base model, a new deep neural network model is built on top of it for image classification based on fully connected network. This classifier will use features extracted from the convolutional base model.
Article
Full-text available
Macroscopic traffic flow variables estimation is of fundamental interest in the planning, designing and controlling of highway facilities. This article presents a novel automatic traffic data acquirement method, called MOM-DL, based on the moving observer method (MOM), deep learning and YOLOv3 algorithm. The proposed method is able to automatically detect vehicles in a traffic stream and estimate the traffic variables flow q, space mean speed vs. and vehicle density k for highways in stationary and homogeneous traffic conditions. The first application of the MOM-DL technique concerns a segment of an Italian highway. In the experiments, a survey vehicle equipped with a camera has been used. Using deep learning and YOLOv3 the vehicles detection and the counting processes have been carried out for the analyzed highway segment. The traffic flow variables have been calculated by the Wardrop relationships. The first results demonstrate that the MOM and MOM-DL methods are in good agreement with each other despite some errors arising with MOM-DL during the vehicle detection step due to a variety of reasons. However, the values of macroscopic traffic variables estimated by means of the Drakes’ traffic flow model together with the proposed method (MOM-DL) are very close to those obtained by the traditional one (MOM), being the maximum percentage variation less than 3%.
Article
Full-text available
Vehicle category classification is important, but it is a challenging task, especially, when the vehicles are captured from a surveillance camera with different view angles. This paper aims to develop a view-independent vehicle category classification system. It proposes a two-phase system: one phase recognizes the view angles helping the second phase to recognize the vehicle category including bus, car, motorcycle, and truck. In each phase, several descriptors and Machine Learning techniques including traditional algorithms and Deep neural networks are employed. In particular, we used three descriptors: HOG (Histogram of Oriented Gradient), LBP (Local Binary Patterns) and Gabor filter with two classifiers SVM (Support Vector Machine) and k-NN (k-Nearest Neighbor). And also, we used the Convolutional Neural Network (CNN, or ConvNet). Three experiments have been elaborated based on many datasets. The first experiment is dedicated to choosing the best approach for the recognition of views: rear or front. The second experiment aims to classify the vehicle categories based on each view. In the third experiment, we developed the overall system, the categories were classified independently of the view. Experimental results reveal that CNN gives the highest recognition accuracy of 94.29% in the first experiment, and HOG with SVM or k-NN gives the best results (99.58%, 99.17%) in the second experiment. The system can robustly recognize vehicle categories with an accuracy of 95.77%.
Article
Full-text available
The vehicle detection and classification (VDC) problem has received much attention recently due to the increased security threats and the need to develop intelligent transportation systems. A large number of approaches have been proposed for the VDC problem using neural networks. To determine how neural networks-based approaches have developed for the VDC in recent years, this paper surveys the VDC approaches through a literature review with the range Jan. 2012 through Apr. 2021. To do this, we introduce a new comparison framework to classify and compare the VDC approaches. Our proposed framework is composed of nine comparison dimensions: input data type, vehicle type, scale, scope, dynamicity, vehicle detection method, vehicle classification method, application, and evaluation method. Next, using the proposed framework, we discuss the evolution of the VDC approaches and identify several open issues that have emerged in the field. This paper provides a guide for researchers to use or design robust VDC systems with proper characteristics based on their needs.
Preprint
Full-text available
ImageNet-1K serves as the primary dataset for pretraining deep learning models for computer vision tasks. ImageNet-21K dataset, which contains more pictures and classes, is used less frequently for pretraining, mainly due to its complexity, and underestimation of its added value compared to standard ImageNet-1K pretraining. This paper aims to close this gap, and make high-quality efficient pretraining on ImageNet-21K available for everyone. % Via a dedicated preprocessing stage, utilizing WordNet hierarchies, and a novel training scheme called semantic softmax, we show that various models, including small mobile-oriented models, significantly benefit from ImageNet-21K pretraining on numerous datasets and tasks. We also show that we outperform previous ImageNet-21K pretraining schemes for prominent new models like ViT. % Our proposed pretraining pipeline is efficient, accessible, and leads to SoTA reproducible results, from a publicly available dataset. The training code and pretrained models are available at: https://github.com/Alibaba-MIIL/ImageNet21K
Article
Full-text available
In step with rapid advancements in computer vision, vehicle classification demonstrates a considerable potential to reshape intelligent transportation systems. In the last couple of decades, image processing and pattern recognition-based vehicle classification systems have been used to improve the effectiveness of automated highway toll collection and traffic monitoring systems. However, these methods are trained on limited handcrafted features extracted from small datasets, which do not cater the real-time road traffic conditions. Deep learning-based classification systems have been proposed to incorporate the above-mentioned issues in traditional methods. However, convolutional neural networks require piles of data including noise, weather, and illumination factors to ensure robustness in real-time applications. Moreover, there is no generalized dataset available to validate the efficacy of vehicle classification systems. To overcome these issues, we propose a convolutional neural network-based vehicle classification system to improve robustness of vehicle classification in real-time applications. We present a vehicle dataset comprising of 10,000 images categorized into six-common vehicle classes considering adverse illuminous conditions to achieve robustness in real-time vehicle classification systems. Initially, pretrained AlexNet, GoogleNet, Inception-v3, VGG, and ResNet are fine-tuned on self-constructed vehicle dataset to evaluate their performance in terms of accuracy and convergence. Based on better performance, ResNet architecture is further improved by adding a new classification block in the network. To ensure generalization, we fine-tuned the network on the public VeRi dataset containing 50,000 images, which have been categorized into six vehicle classes. Finally, a comparison study has been carried out between the proposed and existing vehicle classification methods to evaluate the effectiveness of the proposed vehicle classification system. Consequently, our proposed system achieved 99.68%, 99.65%, and 99.56% accuracy, precision, and F1-score on our self-constructed dataset. 1. Introduction With an exponential production of vehicles around the world, vehicle classification systems can play a significant role in the development of intelligent transportation systems, i.e., an automated highway toll collection, perception in self-driving vehicles, and traffic flow control systems. In the earlier times, laser and loop induction sensors-based methods have been proposed for the vehicle type classification [1–4]. These sensors have been installed under the pavement of the roads to collect and analyse the data to extract the relevant information regarding vehicles. However, the precision and stability of these methodologies are significantly influenced due to undesired weather conditions and impairment in the road pavement [5]. In step with the advancement in computer vision, image processing and pattern recognition-based vehicle classification systems have been proposed [6, 7]. Basically, computer vision-based classification system is a two-step procedure; in the first step, handcrafted extraction methods are utilized to obtain visual features from input visual frame. In the second step, machine learning classifiers are trained on the extracted features to perform classification on group-based data. Hand-crafted features are categorized into (i) global and (ii) local features to describe and represent the image data simultaneously [8]. These features are combined in the training of traditional machine learning classifiers to perform object recognition. Though these systems perform well in the specific controlled environment and are more convenient in terms of installation and maintenance than the existing laser and inductive-based schemes, these methods are trained on the limited handcrafted features extracted from the small datasets, whereas extensive prior knowledge is required to maintain accuracy time environment [9]. Recently, deep learning-based feature extraction and classification methods have been introduced, which demonstrated better applicability and adaptability than the traditional classification systems. Convolutional neural network (CNN) based classification systems have achieved significant accuracy on the large-scale image datasets due to their sophisticated architecture [10–12]. Though, the development of the graphical processing unit (GPU) has significantly increased the image processing capabilities of the computing machines. But the matter of fact is that CNN based classification system requires piles of data to sustain accuracy and ensure generalization. Until recently, to the best of our knowledge, no generalized benchmark dataset is available for the development and evaluation of vehicle classification systems. Consequently, available vehicle classification datasets are relatively small, based on limited classes of the specific regions, i.e., CompCars [13] and Stanford cars dataset [14]. Intelligent transportation systems of these regions can achieve significant results with these datasets; however, their performance is prejudiced in the occurrence of nonregional classes. To address the above-mentioned limitations in vehicle-classification systems, we have made the following contributions.(i)Convolutional Neural Network (CNN) based generalized vehicle classification architecture is presented to improve robustness of vehicle classification systems for Intelligent Transportation Systems (ITS) in adverse illuminous conditions.(ii)A local dataset comprising of 10,000 images based on six classes (i.e., Car, Van, Truck, Motorbike, rickshaw, and Mini-Van) has been collected from traffic surveillance and driving videos. It is important to mention that these classes are unique in design and shape, which are not covered in the existing vehicle datasets.(iii)Modified CNN has been employed and trained on the VeRi dataset, containing 50,000 images over six vehicle classes, to ensure generalization of the network.(iv)Finally, an extensive comparison study has been carried out between the proposed and existing vehicle classification methods to demonstrate the effectiveness of the proposed classification network. The rest of paper is organized as follows. In Section 2, the existing handcrafted and deep learning feature-extraction and vehicle-classification methods are discussed briefly. In Section 3, network architecture along with the preprocessing and dataset collection has been elaborated. The results and the comparison study are carried out in Section 4. Finally, the article is concluded in Section 5. 2. Related Work In step with the rapid advancement in artificial intelligence, vision-based vehicle classification is considered as an important element in perception module of self-driving vehicles. In the existing research work [5], vision-based vehicle classification is categorized into two major categories: (i) handcrafted features-based and (ii) deep features-based methodologies. In the early era of computer vision, handcrafted features-based vehicle classification methods have been proposed for intelligent transportation systems. In this regard, Ng et al. [15] have proposed HOG-SVM based handcrafted features method to train SVM classifier using HOG features with Gaussian kernel function. The proposed classifier has been evaluated on 2800-image dataset of surveillance videos, which classified the motorcycle, car, and lorries with 92.3% accuracy. In another research work, Chen et al. [16] have presented a classification method that extracts the texture and HOG features and classifies the vehicles using a fuzzy inspired SVM classifier. The presented classifier has been evaluated on dataset, comprising of 2000 images in which the proposed systems classified the cars, vans, and buses with 92.6% accuracy. Matos et al. [17] have proposed two-neural networks based combined method embedding the features, i.e., height, width, and bounding borders of the vehicles. Resultantly, the proposed classifier achieved 69% on the dataset of 100 images. Furthermore, Cui et al. [18] have proposed Scale Invariant Feature Transform (SIFT) descriptors and Bad of Words (BoW) based combined model for the extraction of the features and utilized SVM to classify the dataset consisting 340 images of cars, minibuses, and trucks. In the results, it is shown that the proposed classifier achieved 90.2% accuracy on the provided dataset. Wen et al. [19] have proposed an AdaBoost based fast learning vehicle classifier to distinguish the data into vehicle and nonvehicle classes. Moreover, the authors have proposed an algorithm to extract Haar-like features for the rapid learning of classifiers. The presented classifier has been evaluated on the public Caltech dataset, in which the system achieved 92.89% accuracy. To overcome the issues of the handcrafted features-based classifiers, deep features-based systems have been proposed. Dong et al. [20] have presented CNN based semisupervised classification method for real-time vehicle classification. A sparse-Laplacian filter-based method has been devised to extract relative vehicle information, and the softmax layer has been trained to calculate the class probability of the belonging vehicle. The presented method has been evaluated on the Bit-Vehicle dataset and achieved 96.1% and 89.6% accuracy in day and night images, respectively. In another research work, Wang et al. [21] have presented a Fast R–CNN based vehicle classification method for traffic surveillance in a real-time environment. A crossroad dataset consisting of 60,000 images has been collected and divided into training and tested data, on which the proposed method attained 80.051% accuracy. Cao et al. [22] have proposed CNN and an end-to-end combined architecture for the vehicle classification in the incontinent road environment. The proposed framework has been evaluated on the CompCars view-aware dataset, in which the proposed classifier achieved a 0.953 accuracy rate. Chauhan et al. [23] have proposed CNN based vehicle classification framework for vehicle classification and counting on highway roads. Authors have claimed that the proposed framework achieved 75% MAP on the collected dataset of 5562 CCTV camera videos of highway traffic. Jo et al. [24] have proposed a transfer learning-based GoogLeNet framework for vehicle classification of road traffic. The authors have shown that the presented classifier has achieved a 0.983 accuracy rate while experimenting on the ILSVRC-2012 dataset. Kim et al. [25] have proposed the PCANeT-HOG-HU based combined feature extraction method, which is provided to SVM as input data to train the classification model. Moreover, the authors have collected the dataset consisting of 13700 images of vehicles considering six-categories of vehicles (i.e., motorcycle, van, car, truck, mini-bus, and large-bus), extracted from the surveillance videos for the training and testing of the proposed classification network. Results demonstrated that the proposed light-weight classifier achieved 98.34% average accuracy on the provided dataset. Though the deep feature-based approaches can enhance the accuracy of vehicle classification effectively, these methodologies need a huge amount of data to achieve significant accuracy in real-time ITS applications [26–29]. In the recent era, extensive research has been carried out in this field; however, the available public datasets for self-driving vehicles/intelligent transportation systems comprise modern vehicle types, which are common in well developed countries. Consequently, these classification systems are not feasible for the intelligent transportation systems in Asian countries, i.e., Pakistan, India, Bangladesh, and China. The above-mentioned issues are indication towards the need of a novel vehicle classification system along with the dataset that covers the common vehicles, i.e., traditional trucks, buses, cars, rickshaws, and motorbikes of Asian countries. 3. The Proposed Method To address the above-mentioned issues, we present a new vehicle dataset comprising of 10,000 images having six classes based on the common road traffic vehicles, as elucidated in Figure 1. To enhance the performance of the proposed classification in real-time ITS applications, initially, the existing pretrained AlexNet [30], VGG [31], GoogleNet [32], Inception-v3 [33], and ResNet [34] are fine-tuned on self-constructed dataset to obtain the final network. Based on the performance of these models, the best performing model is selected for the fine-tuning to increase the classification accuracy of the network. To ensure generalization, the proposed network is further fine-tuned on public VeRi dataset for robust performance in the intelligent transportation system of different regions. The whole process is briefly discussed below in Figure 1.
Article
Machine learning and deep learning algorithms are rapidly becoming integrated into everyday life. Whether it is in your face-ID to unlock your phone or the detection of deadly diseases like melanoma, neural networks have been traditionally designed to work in isolation to achieve amazing tasks once thought impossible by computers. However, these algorithms are trained to be able to solve extremely specific tasks. Models have to be rebuilt from scratch once the source and target domains change and the required task changes. Transfer learning is defined as a field that leverages learnings and weights from one task for related tasks. This process is quite smooth if one has enough data and the task is similar to the previous, already learnt task. However, research on when these two conditions are not met is scarce. The purpose of this research is to investigate how fine-tuning a pre-trained image classification model will affect accuracy for a binary image classification task. Image classification is widely used, and when only a small dataset is available, transfer learning becomes an important asset. Convolutional neural networks and the VGG-16 model trained on Imagenet will be used. Through this study, I am investigating whether there are specific trends in how fine-tuning affects accuracy when used for a small dataset which is dissimilar from Imagenet. This will allow for the beginning of investigating quantifiable methods to train a model when using Transfer Learning techniques.
Conference Paper
Feature transfer learning aims to reuse knowledge previously acquired in some source dataset to apply it in another target data and/or task. A requirement for the transfer of knowledge is the quality of feature spaces obtained, in which deep learning methods are widely applied since those provide discriminative and general descriptors. In this context, the main questions include: what to transfer; how to transfer; and when to transfer. Hence, we address these questions through distinct learning paradigms, transfer learning techniques, and several datasets and tasks. Therefore, our contributions are: an analysis of multiple descriptors contained in supervised deep networks; a new generalization metric that can be applied to any model and evaluation system; and a new architecture with a loss function for semi-supervised deep networks, in which all available data provide the learning.