Access to this full-text is provided by Springer Nature.
Content available from Scientific Reports
This content is subject to copyright. Terms and conditions apply.
Vehicle re-identication with
multiple discriminative features
based on non-local-attention block
Lu Bai1 & Leilei Rong2
Vehicle re-identication (re-id) technology refers to a vehicle matching under a non-overlapping
domain, that is, to conrm whether the vehicle target taken by cameras in dierent positions at
dierent times is the same vehicle. Dierent identities of the same type of vehicles are one of the most
challenging factors in the eld of vehicle re-identication. The key to solve this diculty is to make
full use of the multiple discriminative features of vehicles. Therefore, this paper proposes a multiple
discriminative features extraction network (MDFE-Net) that can enhance the distance dependence
on the vehicle’s multiple discriminative features by non-local attention, which in turn enhances the
discriminative power of the network. Meanwhile, to more directly represent the retrieval capability
of the model and enhance the rigor of model evaluation, we introduce a novel vehicle re-id model
evaluation metric called mean positive sample occupancy (mPSO). Comprehensive experiments
implemented on challenging vehicle evaluation datasets (including VeRi-776, VRIC, and VehicleID)
show that our model robustly achieves state-of-the-art performances. Moreover, our novel metric
mPSO further proves the powerful retrieval capability of the MDFE-Net.
Keywords Vehicle re-identication, Multiple discriminative features, Non-local attention, mPSO
As a signicant mean of transportation in people’s daily life, vehicles play an extremely important role in modern
transportation systems. e task of vehicle re-identication is to identify and retrieve the same vehicle under
dierent cameras, so it is also called cross-camera-tracking.
In the research eld of re-identication in non-overlapping domain, the main research objects are
pedestrians1,2 and vehicles3,4. Vehicle re-id is more challenging than pedestrian re-id for the following reasons:
(1) vehicles show extreme perspective changes from a 360-degree shooting Angle; (2) due to limited vehicle
types and colors, there are few ne-grained features of the body. In recent years, research on improving model
performance in the eld of vehicle re-identication can be divided into two paths: rst, extracting more
discriminative vehicle appearance features by designing new network models; and second, creating more
eective model loss functions. Specically, Wei et al. designs a hierarchical attention model based on recurrent
neural network for vehicle re-id5. e hierarchical model is rst used to establish the dependency relationship
for vehicle features, and then the attention model is used to extract more subtle features of vehicles. Guo et al.
proposes a two-level network composed of rigid block attention module and so pixel level attention module,
which can be adapted to extract highly dierentiated vehicle features6. Lou et al. uses generative adversarial
networks to generate dicult samples for vehicle re-id model training7. Yan et al. proposes a multi-task deep
learning framework, which uses multi-dimensional information to complete vehicle classication and similarity
ranking, so as to achieve the purpose of identifying the same vehicle8. Liu et al. improves the traditional triplet
loss and proposes a pair cluster loss function to make the distance between the same vehicles become closer9.
Zhang et al. proposes a triplet loss based on classication invariance, and designs a triplet sampling method
based on paired images to better train the proposed triplet loss and strengthen the constraint on the same class
of vehicle images10.
According to research ndings, dierent identities of the same type of vehicles (intra-class similarity) are
one of the most challenging factors in the eld of vehicle re-identication. For example, as shown in Fig.1, we
enumerate four pairs of cas es with the same vehicle type (SUV, cab, bus, and truck) but dierent vehicle identities.
As can be seen from the Fig.1, the personalized characteristics of the vehicles (such as annual inspection marks,
ornaments.) are the key to distinguish the dierent vehicles. To deal with the problem of intra-class similarity,
we propose a multiple discriminative features extraction network, which combines multiple personalized details
of vehicles by non-local-attention mechanism11.
1Shandong Maritime Vocation College, Weifang 261108, China. 2Weifang Education Investment Group Co., Ltd.,
Weifang 261108, China. email: rll5721@126.com
OPEN
Scientic Reports | (2024) 14:31386 1
| https://doi.org/10.1038/s41598-024-82755-3
www.nature.com/scientificreports
Content courtesy of Springer Nature, terms of use apply. Rights reserved
Our main contributions of this paper are summarized as follows:
• A new deep neural network MDFE-Net is proposed to discover and capture more discriminative vehicle
appearance features by non-local-attention blocks. We choose ResNet50 as the backbone network, and then
embed three non-local-attention (NLA) blocks in ResNet50. NLA can enhance the distance dependence on
multiple discriminative aspects of vehicles, adaptively select non-local correlated regions within the image,
mitigate these interference factors, and ultimately improve the quality of image reconstruction.
• A new re-identication model evaluation metric, mPSO, has been introduced. Unlike mAP and Rank-1, this
metric assesses the model’s capability to identify all positive samples during the re-identication process. A
higher mPSO value indicates that the model spends less time identifying all positive samples, while a lower
value indicates more time. is directly reects the ‘retrieval performance’ of the re-identication technology.
Most importantly, this metric is the rst application in the eld of vehicle re-id.
• In our approach, besides utilizing center loss12 and Somax cross-entropy loss13, we have also incorporated
the Weighted Regularized Triplet (WRT) loss. By learning similarity metrics within a high-dimensional em-
bedding space, representations of similar objects tend to cluster together, whereas those of dissimilar objects
remain distant from each other. is eectively optimizes the distance between positive and negative samples.
Additionally, WRT prevents model overtting and bolsters its generalization capabilities through the intro-
duction of regularization terms.
• Extensive experiments on three benchmark datasets (VeRi-77614, VRIC15 and VehicleID16) demonstrate the
superiority of our proposed approach.
Our study work is organized as fol lows: Section“Methods” shows the framework of our proposed MDFE-Net with
realization details. Section“Experiments results and discussion” conducts various experiments to eectiveness
of our model MDFE-Net. Finally, this paper is concluded and future work is proposed in Section“Conclusion”.
Methods
Baseline and multiple discriminative features extraction network
From Fig.2a, we can see that our baseline consists of ResNet5017 backbone, Global Average Pooling (GAP), a batch
normalization layer, a fully connected (FC) layer and Somax cross-entropy loss. e multiple discriminative
features extraction network is shown in Fig.2b. Based on the backbone network, the three non-local-attention
(NLA) blocks are inserted aer conv3_4, conv4_5, and conv4_6 respectively. e NLA can achieve information
interaction between any two locations, not limited to adjacent points, and thus can maintain more information.
In the training phase, we rst calculate the center loss and WRT loss for the output features aer GAP. Finally, the
total loss of the model is given by the weighted sum of the center loss, WRT loss, and Somax cross-entropy loss.
Non-local-attention block
In recent years, attention mechanism has been widely used in the research of person re-identication, but not
enough attention has been paid in the eld of vehicle re-identication. And all state-of-the-art models to achieve
the best performance on each person dataset adopt attention mechanism. Hence, the attention mechanism is
an essential component of the discriminative re-identication model. e attention mechanism aims to capture
the relationships between dierent convolution channels, multiple feature maps, dierent attributes/areas of
the vehicle body, and even multiple images. In a word, the attention mechanism is to give higher weight to the
Fig. 1. Vehicle intra-class similarity problem. Four pairs of vehicles share the same exterior appearance but
dier in their identities.
Scientic Reports | (2024) 14:31386 2
| https://doi.org/10.1038/s41598-024-82755-3
www.nature.com/scientificreports/
Content courtesy of Springer Nature, terms of use apply. Rights reserved
discriminative /personalized vehicle features, and these features are incorporated to enhance the feature learning
ability of the network. Drawing upon spatial and channel attention, several researchers have developed various
styles of attention mechanisms to enhance the performance of vehicle re-identication within neural networks.
Zhu et al.18 developed a dual self-attention module, comprising static self-attention and cross-region attention, to
eectively capture diverse regional dependencies and address the challenges posed by high inter-class similarity
and signicant intra-class variation among vehicles. Lee et al.19 designed a Multi-Attention So Partition
(MUSP) network, which employs multiple so attention mechanisms in both spatial and channel directions.
is network is capable of learning distinct features from various discriminative regions and viewpoints, without
the need for any articial attention branches that are specic to local regions or dependent on specic views.
Pang et al.20 proposed a global relationship attention mechanism that integrates global dependencies to enhance
the network’s ability to discriminate personalized vehicle features and reduce computational complexity. Yu
et al.21 constructed a Multi-Attention Guided Feature Enhancement Network (MAFEN) to learn the spatial
structure information and channel dependence of multi-receptive eld features, and embedded them to enhance
feature extraction performance.
In order to associate dierent vehicle attributes with personalized features, we adopt the non-local-attention
blocks to attain a weighted sum of all discriminative/personalized features of the vehicle appearance, represented
by
zi=Wz×ϕ(xi)+xi
(1)
Fig. 2. Illustration of the baseline and multiple discriminative features extraction network.
Scientic Reports | (2024) 14:31386 3
| https://doi.org/10.1038/s41598-024-82755-3
www.nature.com/scientificreports/
Content courtesy of Springer Nature, terms of use apply. Rights reserved
Here
i
is the index of the output location whose response is to be calculated,
Wz
is a weight matrix to be learned,
ϕ(•)
represents a non-local operation, and “
+xi
” denotes a residual learning strategy13,17. Details can be found
in Fig.3.
Loss functions design
From Fig.2, we can see that our network applies three loss functions (Somax cross-entropy loss, weighted
regularization triplet (WRT) loss, and center loss) to optimize our model. e Somax cross-entropy loss and
center loss12 are formulated as follows:
L
Sof tmax =−
Ni
i=1
log(exp(xy)
Nid
j=1
exp(x
j
)
)
(2)
L
center =
1
2
m
∑
i=1
||fi−cyi||2
2 (3)
Here
Ni
and
Nid
respectively represent the number of vehicle images in the mini-batch and vehicle identities
in the whole training dataset.
y
is the ground truth identity of input vehicle image and
xj
denotes the output
of fully-connected layer for
j
th identity.
cyi
represents the feature center of the
yi
th category,
fi
represents the
feature before the fully connected layer, and
m
represents the size of the mini-batch.
Weighted regularization triplet (WRT) loss13 retains the advantage of optimizing the relative distance
between the positive and negative pairs of triple loss, while avoiding the introduction of any additional margin
parameters. e WRT loss function is formulated as follows:
L
wrt(i)=log
{
1+exp
(∑j
wp
ij dp
ij
−∑k
wn
ikdn
ik
)}
(4)
w
p
ij =
exp(
d
p
ij
)
∑
dp
ij ∈
Piexp(dp
ij ),w
n
ik =
exp(
−d
n
ik
)
∑
dn
ik∈
Niexp(
−
dn
ik)
,
(5)
Fig. 3. Illustration of non-local-attention block. (Best view in color). First, perform three 1 × 1 convolut ion
operation on the input feature map matrix (H × W × C, blue block) at the same time to obtain the
dimensionality-reduced feature map matrix (H × W × C/r, grey block), and then multiply the feature map
matrix twice to obtain the weighted feature map matrix (H × W × C/r, light orange block). en, the dimension
of the feature map is increased through 1 × 1 convolution to obtain a weighted feature map (H × W × C, purple
block) of the same dimension as the input feature map, and nally the input feature matrix and the weighted
feature matrix are added to obtain the output feature map matrix (H × W × C, blue-purple block). Here
C = 2048 and r = 8 represent channel number and channel scaling factor, respectively.
Scientic Reports | (2024) 14:31386 4
| https://doi.org/10.1038/s41598-024-82755-3
www.nature.com/scientificreports/
Content courtesy of Springer Nature, terms of use apply. Rights reserved
Here (
i
;
j
;
k
) represents a hard triplet within each training batch. For anchor
i
,
Pi
and
Ni
are the corresponding
positive set and the negative set respectively.
dp
ij
/
dn
ik
denotes the pairwise distance of a positive/negative sample
pair. Details can be shown in Fig.4.
e total loss is used to train our proposed method in an end-to-end manner, which combines Somax
cross-entropy loss, weighted regularization triplet (WRT) loss, and center loss.
Ltotal =αLsoftmax +βLcenter +γLwrt
(6)
Here parameter
α=1
,
β=0.0005
, and
γ=1
balance the contribution of three kinds of loss functions.
Experiments results and discussion
Datasets and evaluation metrics
To validate the superiority and eectiveness of the proposed multiple discriminative features extraction network,
we conduct extensive experiments on three public benchmarks for vehicle re-id, namely, VeRi-776, VRIC and
VehicleID. e vehicle distribution of the three datasets is shown in Table 1.
Vehicle re-identication is a sub-task of image retrieval, the higher the position ranking of correct vehicles
in the retrieval results, the better the model retrieval eect. Prior to this, Rank1 accuracy and mAP (mean
Average Precision) were the most popular model performance evaluation metrics in the eld of re-identication.
Rank 1 accuracy solely evaluates the correctness of the top-ranked prediction, without taking into account the
predictive accuracy of the remaining digits. In imbalanced data, when the proportion of a certain class of samples
is very low, using Rank1 accuracy may result in the model performing well, but in reality, the model’s prediction
performance for minority classes is poor. In practical applications, due to the fact that re identication systems
usually return a list of query results for manual ltering, a good re-identication system should try to rank
all correct matching results at the front of the list as much as possible. e mAP standard did not emphasize
this point during evaluation, resulting in higher mAP scores even if some correctly matched results in the
matching list are ranked lower, which is not in line with practical application requirements. e example in
Fig.5 proves that the model retrieval result is not necessarily optimal when the AP (Average Precision) result is
high. To address the limitations of the aforementioned evaluation metrics, we introduce a novel re-identication
evaluation metric, mPSO (mean positive sample occupancy), designed to directly align with the requirements of
re-identication technology in practical applications.
As shown in Fig.5, assuming that there are only four positive samples, in order to nd all of them, model
1 needs to search ten times, while model 2 only needs six times. erefore, model 2 has better retrieval ability
than model 1. If only AP is used to evaluate the model, the performance of model 1 is better than that of model
2, which is contrary to the real situation. erefore, PSO can better reect the performance of the re-id model
than AP.
PSO
i=
G
i
Xi
(i=1,2,...,Q
)
(7)
Dataset Images/ID Train/ID Query/ID Gallery/ID
VeRi-776 51,035/776 37,778/576 1678/200 11,579/200
VRIC 60,430/5622 54,808/2811 2811/2811 2811/2811
VehicleID (Test800)
221,763/26,267 110,178/13,134
6532/800 800/800
VehicleID (Test1600) 11,395/1600 1600/1600
VehicleID (Test2400) 17,638/2400 2400/2400
Tab le 1. Vehicle distribution in three datasets.
Fig. 4. Weighted regularization triplet (WRT) loss.
Scientic Reports | (2024) 14:31386 5
| https://doi.org/10.1038/s41598-024-82755-3
www.nature.com/scientificreports/
Content courtesy of Springer Nature, terms of use apply. Rights reserved
mP SO
=1
Q
Q
∑
1
Gi
Xi (8)
In vehicle re-id, G and X are the number of target vehicles in vehicle retrieval results and the number of times
to retrieve the last target vehicle, respectively; mPSO represents the mean occupancy rate of Q target vehicles in
the search results.
Implementation details
In this paper, all vehicle images are resized to 256
×
256. We adopt the Pytorch framework to train MDEF-Net
on a sever with Deepin V20 Linux and 1 NVIDIA GeForce RTX 3080 GPU. e batch size for training and testing
are to 32 and 128, respectively. During the training stage, we choose Adam as the model optimizer and adopt the
linear warm-up strategy to gradually change the value of learning rate. e initial learning rate is 0.00035 and
decreases by 0.1 in the
10th
and
50th
epoch, respectively. e whole training stage lasts for 90 epochs.
Ablation study
We use Xception, Iceptionv4, Densenet169, ResNet50, Shuenetv2, Squeezenet1-1, and Mobilenetv2, pretrained
on ImageNet, as the backbone network. Without adding other network structures, training and testing are
carried out on VeRi-776 dataset. e experimental results are shown in Table 2. By comparing mAP, mPSO and
Rank1, we can see that ResNet50 has the best performance, so ResNet50 is selected as our backbone network.
To validate the performance of our MDEF-Net, we conduct a series of ablation experiments on VeRi-776
dataset. First, we choose ResNet50 backbone, GAP, BN layer, FC layer and Somax cross-entropy loss as the
baseline network, as shown in Fig.2a. en, we add triplet loss, WRT, center loss and NLA to the baseline
network one by one and nally get our MDFE-Net. e results are shown in Table 3.
e impact of triplet loss and WRT
From Table 3 and Fig. 5, we can see that without the WRT loss, triplet loss can still improve the model
performance on mAP and mPSO. However, Rank1 accuracy does not open a large margin. Aer applying WRT
Methods mAP mPSO Rank1
Baseline 76.65 36.41 94.59
Baseline + triplet loss 78.36 38.67 94.70
Baseline + WRT 78.84 39.25 95.55
Baseline + WRT + center loss 78.98 41.89 96.30
Baseline + WRT + center loss + NLA (ours) 80.33 43.47 97.01
Tab le 3. Ablation study on VeRi-776 (multiple discriminative features extraction network).
Backbone mAP mPSO Rank1
Squeezenet1-122 38.87 9.31 76.76
Mobilenetv223 44.86 9.65 80.69
Shuenetv224 58.98 18.69 88.38
Xception25 60.40 17.82 90.23
Iceptionv426 60.70 19.58 88.80
Densenet16927 63.21 23.49 90.05
ResNet50 66.10 24.13 90.87
Tab le 2. Comparison experiment of backbone network on VeRi-776.
Fig. 5. Illustration of retrieval results for model 1 and model 2. e green and gray boxes represent positive
and negative samples, respectively.
Scientic Reports | (2024) 14:31386 6
| https://doi.org/10.1038/s41598-024-82755-3
www.nature.com/scientificreports/
Content courtesy of Springer Nature, terms of use apply. Rights reserved
loss, “baseline + WRT” outperforms “baseline” by a large margin (2.19% mAP, 2.84% mPSO, and 0.96% Rank1).
e results show that compared with triplet loss, WRT loss can eectively optimize the distance between positive
and negative sample pairs by weighting, that is, the distance between positive sample pairs is closer and the
margin between negative sample pairs is larger.
e impact of center loss
Compared with “baseline”, “baseline + WRT + center loss” improve the 2.33% mAP, 5.48% mPSO, and 1.71%
Rank1 on VeRi-776 dataset, as illustrated in Table 3. Besides, in Fig.6, the “baseline + WRT + center loss” has a
much lower loss value than “baseline”. is result validates the eectiveness of the center loss to force our method
to reach a stationary state in the shortest time while minimizing losses.
e impact of NLA
To solve the problem of intra-class similarity in vehicle re-id, we use non-local-attention mechanism to obtain
the dependency between multiple features. Among the four processing stages of ResNet50, the feature maps in
Stage2 and Stage3 exhibit a richer representation of vehicle information compared to those in Stage1 (Conv2_X)
and Stage4 (Conv5_X). Consequently, we opted to incorporate the non-local attention mechanism into Stage2
(Conv3_X) and Stage3 (Conv4_X). is integration enhances the learning capacity of the characteristics within
the non-local-attention mechanism. We performed ablation experiments on the location, number, and channel
scaling factor of non-local-attention (NLA) modules in ResNet50 on VeRi-776 to verify the eectiveness of the
NLA module. As shown in Tables 4 and 5, the results show that:
(1) Compared with other cases, this design mode aer the three NLA blocks are inserted into conv3_4, conv4_5
and conv4_6 makes the overall performance of the model play the best, and the accuracy of the three met-
rics mAP, mPSO and Rank1 reach 80.33%, 43.47% and 97.01% respectively.
(2) When the channel scaling factor is set to 8, the vehicle feature information is least lost during the scaling
process of the feature map, so the performance of the re-identication model is the best.
To further elucidate the role of non-local-attention mechanism in the re-id process, we replaced the NLA module
in MDFE-Net with spatial attention module (SAM), channel attention module (CAM), and convolutional block
Non_layers Numbers of NLA mAP mPSO Rank1
[0,4,5,0] 1 79.59 42.58 96.39
[0,3,4,0] 3 80.33 43.47 97.01
[0,2,3,0] 5 79.90 42.83 96.69
Tab le 4. Ablation study on VeRi-776 (the location and numbers of NLA in ResNet50).
Fig. 6. Loss of dierent methods.
Scientic Reports | (2024) 14:31386 7
| https://doi.org/10.1038/s41598-024-82755-3
www.nature.com/scientificreports/
Content courtesy of Springer Nature, terms of use apply. Rights reserved
attention module (CBAM) respectively, and conducted performance evaluations on the VeRi-776 dataset. e
experimental results are presented in Table 6.
e experimental results presented in Table 6 illustrate that SAM and CAM exhibit suboptimal performance
in re-identication when compared to CBAM and NLA. is discrepancy arises because the learning focus of
these two kinds of attention is pr imarily on spatial dimension information of feature maps and channel dimension
information, respectively, leading to an inadequate representation of information. In contrast, both CBAM and
NLA are designed to integrate spatial and channel information for enhanced learning. e architecture of CBAM
consists of two independent sub-modules utilized sequentially, which signicantly increases the number of
network parameters as well as computational time, thereby heightening the risk of model overtting. Conversely,
by assessing correlations among dierent positions within a sequence, NLA bolsters the model’s capacity to
capture global information while reducing the number of parameters to a certain extent, thus enhancing its
feature representation capabilities. erefore, NLA performs the best among the four types of attention.
e impact of hyperparameter setup
From Tables 7 and 8, we can see that the initial learning rate is 0.00035 and decreases by a factor of 0.1 aer
the
10th
and
50th
epoch, till the end
90th
epoch, this hyperparameter setup can make our model play the best
performance.
Cross-validation
To further demonstrate the powerful robustness and generalization of our method MDFE-Net, we match the
vehicle images in Gallery with the vehicle images in Query. is method is called cross-validation, as shown in
Fig.7. e results of two validation methods are shown in Table 9.
Max-epoch Step-size mAP mPSO Rank1
70 [10,50] 78.31 42.85 95.83
70 [10,30] 79.86 42.80 96.90
90 [10,50] 80.33 43.47 97.01
90 [20,50] 79.19 43.95 96.31
120 [20,50] 80.20 43.26 96.42
Tab le 8. Performance comparison experiment under dierent training epochs and step-size on VeRi-776.
lr mAP mPSO Rank1
0.0001 78.44 41.98 96.07
0.00025 79.17 42.78 96.66
0.0003 79.94 42.54 96.60
0.00035 80.33 43.47 97.01
0.0005 77.50 38.01 95.89
Tab le 7. Performance comparison experiment under dierent learning rates (lr) on VeRi-776.
Attention type mAP mPSO Rank1
SAM 76.90 36.70 95.72
CAM 77.00 37.69 95.88
CBAM 79.67 43.20 96.70
NLA 80.33 43.47 97.01
Tab le 6. Ablation study on VeRi-776 (four kinds of Attention modules).
rmAP mPSO Rank1
1 79.28 41.53 96.62
2 78.49 40.77 96.31
4 78.29 40.35 96.20
8 80.33 43.47 97.01
16 78.42 40.83 96.40
Tab le 5. Ablation study on VeRi-776 (channel scaling factor of NLA).
Scientic Reports | (2024) 14:31386 8
| https://doi.org/10.1038/s41598-024-82755-3
www.nature.com/scientificreports/
Content courtesy of Springer Nature, terms of use apply. Rights reserved
We can see that aer applying cross-validation, the margins between the two methods of verication are
small. is result validates that MDFE-Net has the well robustness and generalization to the vehicles.
Comparisons with state-of-the-art methods
Table 10 shows that compared with the other methods, our method MDFE-Net achieves the best performance
on the three datasets. Specically, MDFE-Net gets 80.33% mAP, 97.01% Rank1 on VeRi-776, 86.58% mAP,
80.75% Rank1 on VRIC, 89.24% mAP, 83.66% Rank1 on VehicleID (Test800), 86.56% mAP, 80.78% Rank1 on
VehicleID (Test1600) and 83.70% mAP, 77.88% Rank1 on VehicleID (Test2400). In order to ensure the rigor of
the comparative experiment, we do not use the re-ranking method for all our experimental results. e methods
MAFEN21, URRNet39, DSN18, MBN43, and GRMS20 all introduce attention mechanism based on ResNet50
to extract vehicle appearance features. Compared with these methods, the MDFE-Net proposed in this paper
employs three methods, NLA, center loss, and WRT loss, to perform comprehensive structural optimization
on ResNet50, achieving the best re-identication accuracy. Notably, when compared to the top three previous
methods with the highest accuracy (TL + CL + SL42, MBN43, and GRMS20), MDFE-Net achieved improvements
in mAP and Rank1 on the VeRi-776 dataset by 0.03% and 0.71%, respectively. On the VRIC dataset, mAP and
Rank1 increased by 2.01% and 0.78%, while for VehicleID (Test800), they rose by 1.54% and 0.26%. Furthermore,
mAP and Rank1 on VehicleID (Test1600) improved by 2.30% and 1.88%, respectively; similarly, for VehicleID
(Test2400), there were increases of 2.01% in mAP and 1.61% in Rank1. ese results prove the powerful feature
learning and representation capability of the MDFE-Net.
Visualization of retrieval result and computation time
Figure 8 shows the retrieval results of DSN18, MBN43, GRMS20, and MDFE-Net on VeRi-776, VRIC, and
VehicleID (Test2400), respectively. We can see that facing the same vehicle situation, MDFE-Net can achieve
correct vehicle re-identication with fewer search rounds. On VRIC datasets with poor vehicle data quality,
the performance advantage of MDFE-Net is even more pronounced. Table 11 shows the training time and
reasoning time of the four methods DSN18, MBN43, GRMS20, and MDFE-Net on VeRi-776, VRIC and VehicleID
(Test2400). It can be seen from the comparison results that the learning and reasoning eciency of method
MDFE-Net is the highest on VeRi-776 and VRIC, and the eciency of method MBN43 is similar on VehicleID
(Test2400). It also reects the contribution of NLA, center loss and WRT loss to model eciency.
Conclusion
In this paper, we propose a multiple discriminative features extraction network to discover multiple personalized
features of the vehicle. To locate multiple discriminative features, we introduce the non-local attention that
can realize information interaction between long-distance features by calculating the relationship between two
features. In addition, we introduce a novel evaluation metric called mean positive sample occupancy (mPSO)
to comprehensively evaluate the re-id model. mPSO can reect the retrieval performance of the model more
intuitively. Our ablation study and comparative experiments show that our proposed method MDFE-Net
outperforms a variety of state-of-the-art vehicle re-identication methods on VeRi-776, VRIC and VehicleID
datasets.
Method
VeRi-776 VRIC
VehicleID
Test800 Test1600 Test2400
mAP mPSO Rank1 mAP mPSO Rank1 mAP mPSO Rank1 mAP mPSO Rank1 mAP mPSO Rank1
MDFE-
Net 80.33 43.47 97.01 86.58 56.07 80.75 89.24 69.00 83.66 86.56 63.41 80.78 83.70 58.19 77.88
MDFE-
Net* 81.55 58.12 95.96 84.00 53.98 78.55 87.82 73.80 95.25 83.65 66.31 93.75 82.32 63.95 93.08
Tab le 9. Performance comparison with two validation methods. *denotes cross-validation.
Fig. 7. Illustration of two validation methods.
Scientic Reports | (2024) 14:31386 9
| https://doi.org/10.1038/s41598-024-82755-3
www.nature.com/scientificreports/
Content courtesy of Springer Nature, terms of use apply. Rights reserved
e number of currently available vehicle datasets is limited, and there are various challenges, such as the
fact that vehicle feature extraction is susceptible to weather conditions (such as rain, snow, haze, and other
harsh environments that increase the diculty of re-identication), and there are few vehicle types and a lack
of emerging vehicle groups such as electric vehicles, which leads to a signicant gap between the actual scenario
and the dataset. erefore, constructing the vehicle dataset that closely resembles the real environment is crucial
for vehicle re-identication research. Additionally, most current vehicle re-identication tasks are based on
supervised learning, which requires high labeling accuracy for the dataset, high labor costs, and unsatisfactory
model domain adaptability. erefore, designing better unsupervised algorithms to solve the problems of label
creation and inter-domain dierences will be one of the important research directions in the eld of vehicle re-
identication in the future.
Method Backbone
VeRi-776 VRIC
VehicleID
Test800 Test1600 Test2400
mAP Rank1 mAP Rank1 mAP Rank1 mAP Rank1 mAP Rank1
Siamese-
Visual28 ResNet50 29.48 41.12 – 30.55 – – – – – –
OIFE29 GoogLeNet 48.00 65.9 – 24.62 – – – – – 67.00
MSVR15 MobileNets 49.30 88.56 47.50 46.61 – – – – – 63.02
SCAN30 VGG16 49.87 82.24 – – – – – – – 65.44
FDA-Net31 Self-design 55.49 84.27 – – – – 65.33 59.84 61.84 55.53
VAMI32 Self-design 61.32 85.92 43.80 30.50 – 63.12 – 52.87 – 47.34
RAM33 VGG_
CNN_M 61.50 88.60 – – – 75.20 – 72.30 – 67.70
MSA34 ResNet50 62.89 92.07 – – 80.31 77.55 77.11 74.41 75.55 72.91
AAVER35 ResNet101 66.35 90.17 – – – 74.69 – 68.62 – 63.54
BS36 Self-design 67.55 90.23 78.55 69.09 86.19 78.80 81.69 73.41 78.16 69.33
CCA37 ResNet50 68.05 91.71 – – 78.89 75.51 76.53 73.60 73.11 70.08
TCL + SL38 Self-design 68.97 93.92 71.66 63.68 80.13 74.97 77.26 72.84 75.25 71.20
MAFEN21 ResNet50 71.00 95.53 – – – 77.18 – 76.07 – 72.94
URRNet39 ResNet50 72.20 93.10 – – – 76.50 – 73.70 – 68.20
MVAN40 VGG16 72.53 92.59 – – – – – – 76.78 72.58
MsDeep41 ResNet50 74.50 95.10 – – 84.30 81.20 81.00 78.00 78.60 75.60
DSN18 ResNet50 76.30 94.80 – – – 80.60 – 78.20 – 75.00
TL + CL + SL42 Self-design 76.95 93.62 84.57 78.37 86.84 81.36 83.71 77.94 81.69 76.27
MBN43 ResNet50 77.12 96.30 82.75 79.97 87.70 81.96 84.26 77.85 80.87 74.07
GRMS20 ResNet50 80.30 95.80 – – 83.40 – 78.90 – 75.60
MDFE-Net ResNet50 80.33 97.01 86.58 80.75 89.24 83.66 86.56 80.78 83.70 77.88
Table 10. Performance comparison with state-of-the-art methods. '–' indicates a suboptimal result.
Scientic Reports | (2024) 14:31386 10
| https://doi.org/10.1038/s41598-024-82755-3
www.nature.com/scientificreports/
Content courtesy of Springer Nature, terms of use apply. Rights reserved
Fig. 8. e top-10 rank comparisons of visualized results of the state-of-the-art methods on VeRi-776, VRIC
and VehicleID (Test2400) dataset. e green and red boxes represent correct matching vehicles and wrong
matching vehicles, respectively.
Scientic Reports | (2024) 14:31386 11
| https://doi.org/10.1038/s41598-024-82755-3
www.nature.com/scientificreports/
Content courtesy of Springer Nature, terms of use apply. Rights reserved
Data availability
e datasets analysed during the current study are not publicly available due this research will be submitted
for university scientic research achievements in the future but are available from the corresponding author on
reasonable request.
Received: 3 August 2024; Accepted: 9 December 2024
References
1. Zheng, Z., Zheng, L. & Yang, Y. Pedestrian alignment network for large-scale person re-identication. IEEE T. Circ. Syst. Vid. 29,
3037–3045 (2018).
2. Ning, E., Wang, C., Zhang, H., Ning, X. & Tiwari, P. Occluded person re-identication with deep learning: a survey and perspectives.
Expert Syst. Appl. 239, 122419 (2023).
3. Guo, X. et al. A novel dual-pooling attention module for UAV vehicle re-identication. Sci. Rep. 14, 2027 (2024).
4. Wang, Q. et al. Dual similarity pre-training and domain dierence encouragement learning for vehicle re-identication in the wild.
Pattern Recognit. 139, 109513 (2023).
5. Wei, X. S., Zhang, C. L., Liu, L., Shen, C. & Wu, J. Coarse-to-ne: A RNN-based hierarchical attention model for vehicle re-
identication. In Proceedings of 14th Asian Conference on Computer Vision (ACCV) 575–591 (2018).
6. Guo, H., Zhu, K., Tang, M. & Wang, J. Two-level attention network with multi-grain ranking loss for vehicle re-identication. IEEE
T. Image Process. 28, 4328–4338 (2019).
7. Lou, Y., Bai, Y., Liu, J., Wang, S. & Duan, L. Y. Embedding adversarial learning for vehicle re-identication. IEEE T. Image Process.
28, 3794–3807 (2019).
8. Yan, K., Tian, Y., Wang, Y., Zeng, W. & Huang, T. Exploiting multi-grain ranking constraints for precisely searching visually-similar
vehicles. In Proceedings of IEEE International Conference on Computer Vision (ICCV) 562–570 (2017).
9. Liu, H., Tian, Y., Yang, Y., Pang, L. & Huang, T. Deep relative distance learning: tell the dierence between similar vehicles. In
Proceedings of IEEE conference on Computer Vision and Pattern Recognition (CVPR) 2167–2175 (2016).
10. Zhang, Y., Liu, D. & Zha, Z. J. Improving triplet-wise training of convolutional neural network for vehicle re-identication. In
Proceedings of IEEE International Conference on Multimedia and Expo (ICME) 1386–1391 (2017).
11. Wang, X., Girshick, R., Gupta, A. & He, K. Non-local neural networks. In Proceedings of Computer Vision and Pattern Recognition
(ICCV) 7794–7803 (2018).
12. Lu, H., Zou, X. & Zhang, P. Learning progressive modality-shared transformers for eective visible-infrared person re-
identication. In Proceedings of the AAAI Conference on Articial Intelligence (AAAI) 1835–1843 (2023).
13. Ye, M. et al. Deep learning for person re-identication: A survey and outlook. IEEE T. Pattern Anal. 44(6), 2872–2893 (2021).
14. Liu, X., Liu, W., Mei, T. & Ma, H. A deep learning-based approach to progressive vehicle re-identication for urban surveillance.
In Proceedings of European Conference on Computer Vision (ECCV) 869–884 (2016).
15. Kanacı, A., Zhu, X. & Gong, S. Vehicle re-identication in context. In Proceedings of German Conference on Pattern Recognition
(GCPR) 377–390 (2018).
16. Khan, S. D. & Ullah, H. A survey of advances in vision-based vehicle re-identication. Comput. Vision Image Underst. 182, 50–63
(2019).
17. He, K., Zhang, X., Ren, S. & Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer
Vision and Pattern Recognition (CVPR) 770–778 (2016).
18. Zhu, W. et al. A dual self-attention mechanism for vehicle re-identication. Pattern Recognit. 137, 109258 (2023).
19. Lee, S., Woo, T. & Lee, S. H. Multi-attention-based so partition network for vehicle re-identication. J. Comput. Des. Eng. 10(2),
488–502 (2023).
20. Pang, X., Yin, Y. & Tian, X. Global relational attention with a maximum suppression constraint for vehicle re-identication. Int. J.
Mach. Learn. Cybern. 15(5), 1729–1742 (2024).
21. Yu, Y. et al. Multi-attention guided and feature enhancement network for vehicle re-identication. J. Intell. Fuzzy Syst. 44(1),
673–690 (2023).
22. Iandola, F. N. et al. SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and< 0.5 MB model size. arXiv preprint.
arXiv:1602.07360 (2016).
23. Sandler, M., Howard, A., Zhu, M., Zhmoginov, A. & Chen, L. C. Mobilenetv2: Inverted residuals and linear bottlenecks. In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 4510–4520 (2018).
24. Ma, N., Zhang, X., Zheng, H. T. & Sun, J. Shuenet v2: Practical guidelines for ecient cnn architecture design. In Proceedings of
the European Conference on Computer Vision (ECCV) 116–131 (2018).
25. Chollet F. Xception: Deep learning with depthwise separable convolutions. In Proceedings of the IEEE Conference on Computer
Vision and Pattern Recognition (CVPR) 1251–1258 (2017).
26. Szegedy, C., Ioe, S., Vanhoucke, V. & Alemi, A. Inception-v4, inception-resnet and the impact of residual connections on learning.
In Proceedings of the AAAI Conference on Articial Intelligence (AAAI) 1–12 (2017).
27. Huang, G., Liu, Z., Van Der Maaten, L. & Weinberger, K. Q. Densely connected convolutional networks. In Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition (CVPR) 4700–4708 (2017).
28. Shen, Y., Xiao, T., Li, H., Yi, S. & Wang, X. Learning deep neural networks for vehicle re-id with visual-spatio-temporal path
proposals. In Proceedings of IEEE International Conference on Computer Vision (ICCV) 1900–1909 (2017).
29. Wang, Z. et al. Orientation invariant feature embedding and spatial temporal regularization for vehicle re-identication. In
Proceedings of IEEE International Conference on Computer Vision (ICCV) 379–387 (2017).
Method
VeRi-776 VRIC VehicleID (Test2400)
Training time (h) Inference time (s) Training time (h) Inference time (s) Training time (h) Inference time (s)
DSN18 7.1 0.6576 9.4 0.3219 12.2 0.9905
MBN43 6.3 0.4349 8.73 0.2240 10.19 0.8318
GRMS20 5.7 0.4012 7.9 0.1112 11.3 0.9001
MDFE-Net 5.6 0.2765 7.2 0.0989 10.1 0.8979
Table 11. Comparison of computation time of the state-of-the-arts methods. Inference
time = TestingSize(img)
÷
BatchSize(img)
×
BatchTime (s).
Scientic Reports | (2024) 14:31386 12
| https://doi.org/10.1038/s41598-024-82755-3
www.nature.com/scientificreports/
Content courtesy of Springer Nature, terms of use apply. Rights reserved
30. Teng, S., Liu, X., Zhang, S. & Huang, Q. Scan: Spatial and channel attention network for vehicle re-identication. In Proceedings of
Pacic Rim Conference on Multimedia 350–361 (2018).
31. Lou, Y., Bai, Y., Liu, J., Wang, S. & Duan, L. Veri-wild: A large dataset and a new method for vehicle re-identication in the wild. In
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 3235–3243 (2019).
32. Zhou, Y. & Shao, L. Aware attentive multi-view inference for vehicle re-ident ication. In Proceedings of IEEE conference on computer
vision and pattern recognition (CVPR) 6489–6498 (2018).
33. Liu, X., Zhang, S., Huang, Q. & Gao, W. Ram: A region-aware deep model for vehicle re-identication. In Proceedings of IEEE
International Conference on Multimedia and Expo (ICME) 1–6 (2018).
34. Zheng, A. et al. Multi-scale attention vehicle re-identication. Neural Comput. Appl. 32, 17489–17503 (2020).
35. Khorramshahi P. et al. A dual-path model with adaptive attention for vehicle re-identication. In Proceedings of IEEE/CVF
International Conference on Computer Vision (ICCV) 6132–6141 (2019).
36. Kumar, R., Weill, E., Aghdasi, F. & Sriram, P. A strong and ecient baseline for vehicle re-identication using deep triplet
embedding. J. Artif. Intell. So. 10, 27–45 (2020).
37. Peng, J. et al. Eliminating cross-camera bias for vehicle re-identication. Multimed. Tools Appl. 1–17 (2022).
38. He, X., Zhou, Y., Zhou, Z., Bai, S., & Bai, X. Triplet-center loss for multi-view 3d object retrieval. In Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition (CVPR) 1945–1954 (2018).
39. Qian, J., Pan, M., Tong, W., Law, R. & Wu, E. Q. URRNet: A unied relational reasoning network for vehicle re-identication. IEEE
Trans. Veh. Technol. 72(9), 11156–11168 (2023).
40. Teng, S., Zhang, S., Huang, Q. & Sebe, N. Multi-view spatial attention embedding for vehicle re-identication. IEEE T. Circ. Syst.
Vid. 31, 816–827 (2020).
41. Cheng, Y. et al. Multi-scale deep feature fusion for vehicle re-identication. In Proceedings of IEEE International Conference on
Acoustics, Speech and Signal Processing (ICASSP) 1928–1932 (2020).
42. Wen, Y., Zhang, K., Li, Z., & Qiao, Y. A discriminative feature learning approach for deep face recognition. In Proceedings of
European Conference on Computer Vision (ECCV) 499–515 (2016).
43. Rong, L. et al. A vehicle re-identication framework based on the improved multi-branch feature fusion network. Sci. Rep. 11, 1–12
(2021).
Author contributions
Conceptualization, L.B.; Methodology, L.R.; Validation, L.R.; Formal analysis, L.R.; Writing—original dra
preparation, L.R.; Writing—review and editing, L.B.
Competing interests
e authors declare no competing interests.
Additional information
Correspondence and requests for materials should be addressed to L.R.
Reprints and permissions information is available at www.nature.com/reprints.
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and
institutional aliations.
Open Access is article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives
4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in
any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide
a link to the Creative Commons licence, and indicate if you modied the licensed material. You do not have
permission under this licence to share adapted material derived from this article or parts of it. e images or
other third party material in this article are included in the article’s Creative Commons licence, unless indicated
otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence
and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to
obtain permission directly from the copyright holder. To view a copy of this licence, visit h t t p : / / c r e a t i v e c o m m o
n s . o r g / l i c e n s e s / b y - n c - n d / 4 . 0 / .
© e Author(s) 2024
Scientic Reports | (2024) 14:31386 13
| https://doi.org/10.1038/s41598-024-82755-3
www.nature.com/scientificreports/
Content courtesy of Springer Nature, terms of use apply. Rights reserved
1.
2.
3.
4.
5.
6.
Terms and Conditions
Springer Nature journal content, brought to you courtesy of Springer Nature Customer Service Center GmbH (“Springer Nature”).
Springer Nature supports a reasonable amount of sharing of research papers by authors, subscribers and authorised users (“Users”), for small-
scale personal, non-commercial use provided that all copyright, trade and service marks and other proprietary notices are maintained. By
accessing, sharing, receiving or otherwise using the Springer Nature journal content you agree to these terms of use (“Terms”). For these
purposes, Springer Nature considers academic use (by researchers and students) to be non-commercial.
These Terms are supplementary and will apply in addition to any applicable website terms and conditions, a relevant site licence or a personal
subscription. These Terms will prevail over any conflict or ambiguity with regards to the relevant terms, a site licence or a personal subscription
(to the extent of the conflict or ambiguity only). For Creative Commons-licensed articles, the terms of the Creative Commons license used will
apply.
We collect and use personal data to provide access to the Springer Nature journal content. We may also use these personal data internally within
ResearchGate and Springer Nature and as agreed share it, in an anonymised way, for purposes of tracking, analysis and reporting. We will not
otherwise disclose your personal data outside the ResearchGate or the Springer Nature group of companies unless we have your permission as
detailed in the Privacy Policy.
While Users may use the Springer Nature journal content for small scale, personal non-commercial use, it is important to note that Users may
not:
use such content for the purpose of providing other users with access on a regular or large scale basis or as a means to circumvent access
control;
use such content where to do so would be considered a criminal or statutory offence in any jurisdiction, or gives rise to civil liability, or is
otherwise unlawful;
falsely or misleadingly imply or suggest endorsement, approval , sponsorship, or association unless explicitly agreed to by Springer Nature in
writing;
use bots or other automated methods to access the content or redirect messages
override any security feature or exclusionary protocol; or
share the content in order to create substitute for Springer Nature products or services or a systematic database of Springer Nature journal
content.
In line with the restriction against commercial use, Springer Nature does not permit the creation of a product or service that creates revenue,
royalties, rent or income from our content or its inclusion as part of a paid for service or for other commercial gain. Springer Nature journal
content cannot be used for inter-library loans and librarians may not upload Springer Nature journal content on a large scale into their, or any
other, institutional repository.
These terms of use are reviewed regularly and may be amended at any time. Springer Nature is not obligated to publish any information or
content on this website and may remove it or features or functionality at our sole discretion, at any time with or without notice. Springer Nature
may revoke this licence to you at any time and remove access to any copies of the Springer Nature journal content which have been saved.
To the fullest extent permitted by law, Springer Nature makes no warranties, representations or guarantees to Users, either express or implied
with respect to the Springer nature journal content and all parties disclaim and waive any implied warranties or warranties imposed by law,
including merchantability or fitness for any particular purpose.
Please note that these rights do not automatically extend to content, data or other material published by Springer Nature that may be licensed
from third parties.
If you would like to use or distribute our Springer Nature journal content to a wider audience or on a regular basis or in any other manner not
expressly permitted by these Terms, please contact Springer Nature at
onlineservice@springernature.com
Available via license: CC BY-NC-ND 4.0
Content may be subject to copyright.