ArticlePDF Available

Vehicle re-identification with multiple discriminative features based on non-local-attention block

Springer Nature
Scientific Reports
Authors:

Abstract and Figures

Vehicle re-identification (re-id) technology refers to a vehicle matching under a non-overlapping domain, that is, to confirm whether the vehicle target taken by cameras in different positions at different times is the same vehicle. Different identities of the same type of vehicles are one of the most challenging factors in the field of vehicle re-identification. The key to solve this difficulty is to make full use of the multiple discriminative features of vehicles. Therefore, this paper proposes a multiple discriminative features extraction network (MDFE-Net) that can enhance the distance dependence on the vehicle’s multiple discriminative features by non-local attention, which in turn enhances the discriminative power of the network. Meanwhile, to more directly represent the retrieval capability of the model and enhance the rigor of model evaluation, we introduce a novel vehicle re-id model evaluation metric called mean positive sample occupancy (mPSO). Comprehensive experiments implemented on challenging vehicle evaluation datasets (including VeRi-776, VRIC, and VehicleID) show that our model robustly achieves state-of-the-art performances. Moreover, our novel metric mPSO further proves the powerful retrieval capability of the MDFE-Net.
This content is subject to copyright. Terms and conditions apply.
Vehicle re-identication with
multiple discriminative features
based on non-local-attention block
Lu Bai1 & Leilei Rong2
Vehicle re-identication (re-id) technology refers to a vehicle matching under a non-overlapping
domain, that is, to conrm whether the vehicle target taken by cameras in dierent positions at
dierent times is the same vehicle. Dierent identities of the same type of vehicles are one of the most
challenging factors in the eld of vehicle re-identication. The key to solve this diculty is to make
full use of the multiple discriminative features of vehicles. Therefore, this paper proposes a multiple
discriminative features extraction network (MDFE-Net) that can enhance the distance dependence
on the vehicle’s multiple discriminative features by non-local attention, which in turn enhances the
discriminative power of the network. Meanwhile, to more directly represent the retrieval capability
of the model and enhance the rigor of model evaluation, we introduce a novel vehicle re-id model
evaluation metric called mean positive sample occupancy (mPSO). Comprehensive experiments
implemented on challenging vehicle evaluation datasets (including VeRi-776, VRIC, and VehicleID)
show that our model robustly achieves state-of-the-art performances. Moreover, our novel metric
mPSO further proves the powerful retrieval capability of the MDFE-Net.
Keywords Vehicle re-identication, Multiple discriminative features, Non-local attention, mPSO
As a signicant mean of transportation in people’s daily life, vehicles play an extremely important role in modern
transportation systems. e task of vehicle re-identication is to identify and retrieve the same vehicle under
dierent cameras, so it is also called cross-camera-tracking.
In the research eld of re-identication in non-overlapping domain, the main research objects are
pedestrians1,2 and vehicles3,4. Vehicle re-id is more challenging than pedestrian re-id for the following reasons:
(1) vehicles show extreme perspective changes from a 360-degree shooting Angle; (2) due to limited vehicle
types and colors, there are few ne-grained features of the body. In recent years, research on improving model
performance in the eld of vehicle re-identication can be divided into two paths: rst, extracting more
discriminative vehicle appearance features by designing new network models; and second, creating more
eective model loss functions. Specically, Wei et al. designs a hierarchical attention model based on recurrent
neural network for vehicle re-id5. e hierarchical model is rst used to establish the dependency relationship
for vehicle features, and then the attention model is used to extract more subtle features of vehicles. Guo et al.
proposes a two-level network composed of rigid block attention module and so pixel level attention module,
which can be adapted to extract highly dierentiated vehicle features6. Lou et al. uses generative adversarial
networks to generate dicult samples for vehicle re-id model training7. Yan et al. proposes a multi-task deep
learning framework, which uses multi-dimensional information to complete vehicle classication and similarity
ranking, so as to achieve the purpose of identifying the same vehicle8. Liu et al. improves the traditional triplet
loss and proposes a pair cluster loss function to make the distance between the same vehicles become closer9.
Zhang et al. proposes a triplet loss based on classication invariance, and designs a triplet sampling method
based on paired images to better train the proposed triplet loss and strengthen the constraint on the same class
of vehicle images10.
According to research ndings, dierent identities of the same type of vehicles (intra-class similarity) are
one of the most challenging factors in the eld of vehicle re-identication. For example, as shown in Fig.1, we
enumerate four pairs of cas es with the same vehicle type (SUV, cab, bus, and truck) but dierent vehicle identities.
As can be seen from the Fig.1, the personalized characteristics of the vehicles (such as annual inspection marks,
ornaments.) are the key to distinguish the dierent vehicles. To deal with the problem of intra-class similarity,
we propose a multiple discriminative features extraction network, which combines multiple personalized details
of vehicles by non-local-attention mechanism11.
1Shandong Maritime Vocation College, Weifang 261108, China. 2Weifang Education Investment Group Co., Ltd.,
Weifang 261108, China. email: rll5721@126.com
OPEN
Scientic Reports | (2024) 14:31386 1
| https://doi.org/10.1038/s41598-024-82755-3
www.nature.com/scientificreports
Content courtesy of Springer Nature, terms of use apply. Rights reserved
Our main contributions of this paper are summarized as follows:
A new deep neural network MDFE-Net is proposed to discover and capture more discriminative vehicle
appearance features by non-local-attention blocks. We choose ResNet50 as the backbone network, and then
embed three non-local-attention (NLA) blocks in ResNet50. NLA can enhance the distance dependence on
multiple discriminative aspects of vehicles, adaptively select non-local correlated regions within the image,
mitigate these interference factors, and ultimately improve the quality of image reconstruction.
A new re-identication model evaluation metric, mPSO, has been introduced. Unlike mAP and Rank-1, this
metric assesses the model’s capability to identify all positive samples during the re-identication process. A
higher mPSO value indicates that the model spends less time identifying all positive samples, while a lower
value indicates more time. is directly reects the ‘retrieval performance’ of the re-identication technology.
Most importantly, this metric is the rst application in the eld of vehicle re-id.
In our approach, besides utilizing center loss12 and Somax cross-entropy loss13, we have also incorporated
the Weighted Regularized Triplet (WRT) loss. By learning similarity metrics within a high-dimensional em-
bedding space, representations of similar objects tend to cluster together, whereas those of dissimilar objects
remain distant from each other. is eectively optimizes the distance between positive and negative samples.
Additionally, WRT prevents model overtting and bolsters its generalization capabilities through the intro-
duction of regularization terms.
Extensive experiments on three benchmark datasets (VeRi-77614, VRIC15 and VehicleID16) demonstrate the
superiority of our proposed approach.
Our study work is organized as fol lows: SectionMethods” shows the framework of our proposed MDFE-Net with
realization details. Section“Experiments results and discussion” conducts various experiments to eectiveness
of our model MDFE-Net. Finally, this paper is concluded and future work is proposed in Section“Conclusion”.
Methods
Baseline and multiple discriminative features extraction network
From Fig.2a, we can see that our baseline consists of ResNet5017 backbone, Global Average Pooling (GAP), a batch
normalization layer, a fully connected (FC) layer and Somax cross-entropy loss. e multiple discriminative
features extraction network is shown in Fig.2b. Based on the backbone network, the three non-local-attention
(NLA) blocks are inserted aer conv3_4, conv4_5, and conv4_6 respectively. e NLA can achieve information
interaction between any two locations, not limited to adjacent points, and thus can maintain more information.
In the training phase, we rst calculate the center loss and WRT loss for the output features aer GAP. Finally, the
total loss of the model is given by the weighted sum of the center loss, WRT loss, and Somax cross-entropy loss.
Non-local-attention block
In recent years, attention mechanism has been widely used in the research of person re-identication, but not
enough attention has been paid in the eld of vehicle re-identication. And all state-of-the-art models to achieve
the best performance on each person dataset adopt attention mechanism. Hence, the attention mechanism is
an essential component of the discriminative re-identication model. e attention mechanism aims to capture
the relationships between dierent convolution channels, multiple feature maps, dierent attributes/areas of
the vehicle body, and even multiple images. In a word, the attention mechanism is to give higher weight to the
Fig. 1. Vehicle intra-class similarity problem. Four pairs of vehicles share the same exterior appearance but
dier in their identities.
Scientic Reports | (2024) 14:31386 2
| https://doi.org/10.1038/s41598-024-82755-3
www.nature.com/scientificreports/
Content courtesy of Springer Nature, terms of use apply. Rights reserved
discriminative /personalized vehicle features, and these features are incorporated to enhance the feature learning
ability of the network. Drawing upon spatial and channel attention, several researchers have developed various
styles of attention mechanisms to enhance the performance of vehicle re-identication within neural networks.
Zhu et al.18 developed a dual self-attention module, comprising static self-attention and cross-region attention, to
eectively capture diverse regional dependencies and address the challenges posed by high inter-class similarity
and signicant intra-class variation among vehicles. Lee et al.19 designed a Multi-Attention So Partition
(MUSP) network, which employs multiple so attention mechanisms in both spatial and channel directions.
is network is capable of learning distinct features from various discriminative regions and viewpoints, without
the need for any articial attention branches that are specic to local regions or dependent on specic views.
Pang et al.20 proposed a global relationship attention mechanism that integrates global dependencies to enhance
the network’s ability to discriminate personalized vehicle features and reduce computational complexity. Yu
et al.21 constructed a Multi-Attention Guided Feature Enhancement Network (MAFEN) to learn the spatial
structure information and channel dependence of multi-receptive eld features, and embedded them to enhance
feature extraction performance.
In order to associate dierent vehicle attributes with personalized features, we adopt the non-local-attention
blocks to attain a weighted sum of all discriminative/personalized features of the vehicle appearance, represented
by
zi=Wz×ϕ(xi)+xi
(1)
Fig. 2. Illustration of the baseline and multiple discriminative features extraction network.
Scientic Reports | (2024) 14:31386 3
| https://doi.org/10.1038/s41598-024-82755-3
www.nature.com/scientificreports/
Content courtesy of Springer Nature, terms of use apply. Rights reserved
Here
i
is the index of the output location whose response is to be calculated,
Wz
is a weight matrix to be learned,
ϕ()
represents a non-local operation, and “
+xi
” denotes a residual learning strategy13,17. Details can be found
in Fig.3.
Loss functions design
From Fig.2, we can see that our network applies three loss functions (Somax cross-entropy loss, weighted
regularization triplet (WRT) loss, and center loss) to optimize our model. e Somax cross-entropy loss and
center loss12 are formulated as follows:
L
Sof tmax =
Ni
i=1
log(exp(xy)
Nid
j=1
exp(x
j
)
)
(2)
L
center =
1
2
m
i=1
||ficyi||2
2 (3)
Here
and
Nid
respectively represent the number of vehicle images in the mini-batch and vehicle identities
in the whole training dataset.
y
is the ground truth identity of input vehicle image and
xj
denotes the output
of fully-connected layer for
j
th identity.
cyi
represents the feature center of the
yi
th category,
fi
represents the
feature before the fully connected layer, and
m
represents the size of the mini-batch.
Weighted regularization triplet (WRT) loss13 retains the advantage of optimizing the relative distance
between the positive and negative pairs of triple loss, while avoiding the introduction of any additional margin
parameters. e WRT loss function is formulated as follows:
L
wrt(i)=log
{
1+exp
(j
wp
ij dp
ij
k
wn
ikdn
ik
)}
(4)
w
p
ij =
exp(
d
p
ij
)
dp
ij
Piexp(dp
ij ),w
n
ik =
exp(
d
n
ik
)
dn
ik
Niexp(
dn
ik)
,
(5)
Fig. 3. Illustration of non-local-attention block. (Best view in color). First, perform three 1 × 1 convolut ion
operation on the input feature map matrix (H × W × C, blue block) at the same time to obtain the
dimensionality-reduced feature map matrix (H × W × C/r, grey block), and then multiply the feature map
matrix twice to obtain the weighted feature map matrix (H × W × C/r, light orange block). en, the dimension
of the feature map is increased through 1 × 1 convolution to obtain a weighted feature map (H × W × C, purple
block) of the same dimension as the input feature map, and nally the input feature matrix and the weighted
feature matrix are added to obtain the output feature map matrix (H × W × C, blue-purple block). Here
C = 2048 and r = 8 represent channel number and channel scaling factor, respectively.
Scientic Reports | (2024) 14:31386 4
| https://doi.org/10.1038/s41598-024-82755-3
www.nature.com/scientificreports/
Content courtesy of Springer Nature, terms of use apply. Rights reserved
Here (
i
;
j
;
k
) represents a hard triplet within each training batch. For anchor
i
,
Pi
and
are the corresponding
positive set and the negative set respectively.
dp
ij
/
dn
ik
denotes the pairwise distance of a positive/negative sample
pair. Details can be shown in Fig.4.
e total loss is used to train our proposed method in an end-to-end manner, which combines Somax
cross-entropy loss, weighted regularization triplet (WRT) loss, and center loss.
Ltotal =αLsoftmax +βLcenter +γLwrt
(6)
Here parameter
α=1
,
β=0.0005
, and
γ=1
balance the contribution of three kinds of loss functions.
Experiments results and discussion
Datasets and evaluation metrics
To validate the superiority and eectiveness of the proposed multiple discriminative features extraction network,
we conduct extensive experiments on three public benchmarks for vehicle re-id, namely, VeRi-776, VRIC and
VehicleID. e vehicle distribution of the three datasets is shown in Table 1.
Vehicle re-identication is a sub-task of image retrieval, the higher the position ranking of correct vehicles
in the retrieval results, the better the model retrieval eect. Prior to this, Rank1 accuracy and mAP (mean
Average Precision) were the most popular model performance evaluation metrics in the eld of re-identication.
Rank 1 accuracy solely evaluates the correctness of the top-ranked prediction, without taking into account the
predictive accuracy of the remaining digits. In imbalanced data, when the proportion of a certain class of samples
is very low, using Rank1 accuracy may result in the model performing well, but in reality, the model’s prediction
performance for minority classes is poor. In practical applications, due to the fact that re identication systems
usually return a list of query results for manual ltering, a good re-identication system should try to rank
all correct matching results at the front of the list as much as possible. e mAP standard did not emphasize
this point during evaluation, resulting in higher mAP scores even if some correctly matched results in the
matching list are ranked lower, which is not in line with practical application requirements. e example in
Fig.5 proves that the model retrieval result is not necessarily optimal when the AP (Average Precision) result is
high. To address the limitations of the aforementioned evaluation metrics, we introduce a novel re-identication
evaluation metric, mPSO (mean positive sample occupancy), designed to directly align with the requirements of
re-identication technology in practical applications.
As shown in Fig.5, assuming that there are only four positive samples, in order to nd all of them, model
1 needs to search ten times, while model 2 only needs six times. erefore, model 2 has better retrieval ability
than model 1. If only AP is used to evaluate the model, the performance of model 1 is better than that of model
2, which is contrary to the real situation. erefore, PSO can better reect the performance of the re-id model
than AP.
PSO
i=
G
i
Xi
(i=1,2,...,Q
)
(7)
Dataset Images/ID Train/ID Query/ID Gallery/ID
VeRi-776 51,035/776 37,778/576 1678/200 11,579/200
VRIC 60,430/5622 54,808/2811 2811/2811 2811/2811
VehicleID (Test800)
221,763/26,267 110,178/13,134
6532/800 800/800
VehicleID (Test1600) 11,395/1600 1600/1600
VehicleID (Test2400) 17,638/2400 2400/2400
Tab le 1. Vehicle distribution in three datasets.
Fig. 4. Weighted regularization triplet (WRT) loss.
Scientic Reports | (2024) 14:31386 5
| https://doi.org/10.1038/s41598-024-82755-3
www.nature.com/scientificreports/
Content courtesy of Springer Nature, terms of use apply. Rights reserved
mP SO
=1
Q
Q
1
Gi
Xi (8)
In vehicle re-id, G and X are the number of target vehicles in vehicle retrieval results and the number of times
to retrieve the last target vehicle, respectively; mPSO represents the mean occupancy rate of Q target vehicles in
the search results.
Implementation details
In this paper, all vehicle images are resized to 256
×
256. We adopt the Pytorch framework to train MDEF-Net
on a sever with Deepin V20 Linux and 1 NVIDIA GeForce RTX 3080 GPU. e batch size for training and testing
are to 32 and 128, respectively. During the training stage, we choose Adam as the model optimizer and adopt the
linear warm-up strategy to gradually change the value of learning rate. e initial learning rate is 0.00035 and
decreases by 0.1 in the
10th
and
50th
epoch, respectively. e whole training stage lasts for 90 epochs.
Ablation study
We use Xception, Iceptionv4, Densenet169, ResNet50, Shuenetv2, Squeezenet1-1, and Mobilenetv2, pretrained
on ImageNet, as the backbone network. Without adding other network structures, training and testing are
carried out on VeRi-776 dataset. e experimental results are shown in Table 2. By comparing mAP, mPSO and
Rank1, we can see that ResNet50 has the best performance, so ResNet50 is selected as our backbone network.
To validate the performance of our MDEF-Net, we conduct a series of ablation experiments on VeRi-776
dataset. First, we choose ResNet50 backbone, GAP, BN layer, FC layer and Somax cross-entropy loss as the
baseline network, as shown in Fig.2a. en, we add triplet loss, WRT, center loss and NLA to the baseline
network one by one and nally get our MDFE-Net. e results are shown in Table 3.
e impact of triplet loss and WRT
From Table 3 and Fig. 5, we can see that without the WRT loss, triplet loss can still improve the model
performance on mAP and mPSO. However, Rank1 accuracy does not open a large margin. Aer applying WRT
Methods mAP mPSO Rank1
Baseline 76.65 36.41 94.59
Baseline + triplet loss 78.36 38.67 94.70
Baseline + WRT 78.84 39.25 95.55
Baseline + WRT + center loss 78.98 41.89 96.30
Baseline + WRT + center loss + NLA (ours) 80.33 43.47 97.01
Tab le 3. Ablation study on VeRi-776 (multiple discriminative features extraction network).
Backbone mAP mPSO Rank1
Squeezenet1-122 38.87 9.31 76.76
Mobilenetv223 44.86 9.65 80.69
Shuenetv224 58.98 18.69 88.38
Xception25 60.40 17.82 90.23
Iceptionv426 60.70 19.58 88.80
Densenet16927 63.21 23.49 90.05
ResNet50 66.10 24.13 90.87
Tab le 2. Comparison experiment of backbone network on VeRi-776.
Fig. 5. Illustration of retrieval results for model 1 and model 2. e green and gray boxes represent positive
and negative samples, respectively.
Scientic Reports | (2024) 14:31386 6
| https://doi.org/10.1038/s41598-024-82755-3
www.nature.com/scientificreports/
Content courtesy of Springer Nature, terms of use apply. Rights reserved
loss, “baseline + WRT” outperforms “baseline” by a large margin (2.19% mAP, 2.84% mPSO, and 0.96% Rank1).
e results show that compared with triplet loss, WRT loss can eectively optimize the distance between positive
and negative sample pairs by weighting, that is, the distance between positive sample pairs is closer and the
margin between negative sample pairs is larger.
e impact of center loss
Compared with “baseline, “baseline + WRT + center loss” improve the 2.33% mAP, 5.48% mPSO, and 1.71%
Rank1 on VeRi-776 dataset, as illustrated in Table 3. Besides, in Fig.6, the “baseline + WRT + center loss” has a
much lower loss value than “baseline. is result validates the eectiveness of the center loss to force our method
to reach a stationary state in the shortest time while minimizing losses.
e impact of NLA
To solve the problem of intra-class similarity in vehicle re-id, we use non-local-attention mechanism to obtain
the dependency between multiple features. Among the four processing stages of ResNet50, the feature maps in
Stage2 and Stage3 exhibit a richer representation of vehicle information compared to those in Stage1 (Conv2_X)
and Stage4 (Conv5_X). Consequently, we opted to incorporate the non-local attention mechanism into Stage2
(Conv3_X) and Stage3 (Conv4_X). is integration enhances the learning capacity of the characteristics within
the non-local-attention mechanism. We performed ablation experiments on the location, number, and channel
scaling factor of non-local-attention (NLA) modules in ResNet50 on VeRi-776 to verify the eectiveness of the
NLA module. As shown in Tables 4 and 5, the results show that:
(1) Compared with other cases, this design mode aer the three NLA blocks are inserted into conv3_4, conv4_5
and conv4_6 makes the overall performance of the model play the best, and the accuracy of the three met-
rics mAP, mPSO and Rank1 reach 80.33%, 43.47% and 97.01% respectively.
(2) When the channel scaling factor is set to 8, the vehicle feature information is least lost during the scaling
process of the feature map, so the performance of the re-identication model is the best.
To further elucidate the role of non-local-attention mechanism in the re-id process, we replaced the NLA module
in MDFE-Net with spatial attention module (SAM), channel attention module (CAM), and convolutional block
Non_layers Numbers of NLA mAP mPSO Rank1
[0,4,5,0] 1 79.59 42.58 96.39
[0,3,4,0] 3 80.33 43.47 97.01
[0,2,3,0] 5 79.90 42.83 96.69
Tab le 4. Ablation study on VeRi-776 (the location and numbers of NLA in ResNet50).
Fig. 6. Loss of dierent methods.
Scientic Reports | (2024) 14:31386 7
| https://doi.org/10.1038/s41598-024-82755-3
www.nature.com/scientificreports/
Content courtesy of Springer Nature, terms of use apply. Rights reserved
attention module (CBAM) respectively, and conducted performance evaluations on the VeRi-776 dataset. e
experimental results are presented in Table 6.
e experimental results presented in Table 6 illustrate that SAM and CAM exhibit suboptimal performance
in re-identication when compared to CBAM and NLA. is discrepancy arises because the learning focus of
these two kinds of attention is pr imarily on spatial dimension information of feature maps and channel dimension
information, respectively, leading to an inadequate representation of information. In contrast, both CBAM and
NLA are designed to integrate spatial and channel information for enhanced learning. e architecture of CBAM
consists of two independent sub-modules utilized sequentially, which signicantly increases the number of
network parameters as well as computational time, thereby heightening the risk of model overtting. Conversely,
by assessing correlations among dierent positions within a sequence, NLA bolsters the model’s capacity to
capture global information while reducing the number of parameters to a certain extent, thus enhancing its
feature representation capabilities. erefore, NLA performs the best among the four types of attention.
e impact of hyperparameter setup
From Tables 7 and 8, we can see that the initial learning rate is 0.00035 and decreases by a factor of 0.1 aer
the
10th
and
50th
epoch, till the end
90th
epoch, this hyperparameter setup can make our model play the best
performance.
Cross-validation
To further demonstrate the powerful robustness and generalization of our method MDFE-Net, we match the
vehicle images in Gallery with the vehicle images in Query. is method is called cross-validation, as shown in
Fig.7. e results of two validation methods are shown in Table 9.
Max-epoch Step-size mAP mPSO Rank1
70 [10,50] 78.31 42.85 95.83
70 [10,30] 79.86 42.80 96.90
90 [10,50] 80.33 43.47 97.01
90 [20,50] 79.19 43.95 96.31
120 [20,50] 80.20 43.26 96.42
Tab le 8. Performance comparison experiment under dierent training epochs and step-size on VeRi-776.
lr mAP mPSO Rank1
0.0001 78.44 41.98 96.07
0.00025 79.17 42.78 96.66
0.0003 79.94 42.54 96.60
0.00035 80.33 43.47 97.01
0.0005 77.50 38.01 95.89
Tab le 7. Performance comparison experiment under dierent learning rates (lr) on VeRi-776.
Attention type mAP mPSO Rank1
SAM 76.90 36.70 95.72
CAM 77.00 37.69 95.88
CBAM 79.67 43.20 96.70
NLA 80.33 43.47 97.01
Tab le 6. Ablation study on VeRi-776 (four kinds of Attention modules).
rmAP mPSO Rank1
1 79.28 41.53 96.62
2 78.49 40.77 96.31
4 78.29 40.35 96.20
8 80.33 43.47 97.01
16 78.42 40.83 96.40
Tab le 5. Ablation study on VeRi-776 (channel scaling factor of NLA).
Scientic Reports | (2024) 14:31386 8
| https://doi.org/10.1038/s41598-024-82755-3
www.nature.com/scientificreports/
Content courtesy of Springer Nature, terms of use apply. Rights reserved
We can see that aer applying cross-validation, the margins between the two methods of verication are
small. is result validates that MDFE-Net has the well robustness and generalization to the vehicles.
Comparisons with state-of-the-art methods
Table 10 shows that compared with the other methods, our method MDFE-Net achieves the best performance
on the three datasets. Specically, MDFE-Net gets 80.33% mAP, 97.01% Rank1 on VeRi-776, 86.58% mAP,
80.75% Rank1 on VRIC, 89.24% mAP, 83.66% Rank1 on VehicleID (Test800), 86.56% mAP, 80.78% Rank1 on
VehicleID (Test1600) and 83.70% mAP, 77.88% Rank1 on VehicleID (Test2400). In order to ensure the rigor of
the comparative experiment, we do not use the re-ranking method for all our experimental results. e methods
MAFEN21, URRNet39, DSN18, MBN43, and GRMS20 all introduce attention mechanism based on ResNet50
to extract vehicle appearance features. Compared with these methods, the MDFE-Net proposed in this paper
employs three methods, NLA, center loss, and WRT loss, to perform comprehensive structural optimization
on ResNet50, achieving the best re-identication accuracy. Notably, when compared to the top three previous
methods with the highest accuracy (TL + CL + SL42, MBN43, and GRMS20), MDFE-Net achieved improvements
in mAP and Rank1 on the VeRi-776 dataset by 0.03% and 0.71%, respectively. On the VRIC dataset, mAP and
Rank1 increased by 2.01% and 0.78%, while for VehicleID (Test800), they rose by 1.54% and 0.26%. Furthermore,
mAP and Rank1 on VehicleID (Test1600) improved by 2.30% and 1.88%, respectively; similarly, for VehicleID
(Test2400), there were increases of 2.01% in mAP and 1.61% in Rank1. ese results prove the powerful feature
learning and representation capability of the MDFE-Net.
Visualization of retrieval result and computation time
Figure 8 shows the retrieval results of DSN18, MBN43, GRMS20, and MDFE-Net on VeRi-776, VRIC, and
VehicleID (Test2400), respectively. We can see that facing the same vehicle situation, MDFE-Net can achieve
correct vehicle re-identication with fewer search rounds. On VRIC datasets with poor vehicle data quality,
the performance advantage of MDFE-Net is even more pronounced. Table 11 shows the training time and
reasoning time of the four methods DSN18, MBN43, GRMS20, and MDFE-Net on VeRi-776, VRIC and VehicleID
(Test2400). It can be seen from the comparison results that the learning and reasoning eciency of method
MDFE-Net is the highest on VeRi-776 and VRIC, and the eciency of method MBN43 is similar on VehicleID
(Test2400). It also reects the contribution of NLA, center loss and WRT loss to model eciency.
Conclusion
In this paper, we propose a multiple discriminative features extraction network to discover multiple personalized
features of the vehicle. To locate multiple discriminative features, we introduce the non-local attention that
can realize information interaction between long-distance features by calculating the relationship between two
features. In addition, we introduce a novel evaluation metric called mean positive sample occupancy (mPSO)
to comprehensively evaluate the re-id model. mPSO can reect the retrieval performance of the model more
intuitively. Our ablation study and comparative experiments show that our proposed method MDFE-Net
outperforms a variety of state-of-the-art vehicle re-identication methods on VeRi-776, VRIC and VehicleID
datasets.
Method
VeRi-776 VRIC
VehicleID
Test800 Test1600 Test2400
mAP mPSO Rank1 mAP mPSO Rank1 mAP mPSO Rank1 mAP mPSO Rank1 mAP mPSO Rank1
MDFE-
Net 80.33 43.47 97.01 86.58 56.07 80.75 89.24 69.00 83.66 86.56 63.41 80.78 83.70 58.19 77.88
MDFE-
Net* 81.55 58.12 95.96 84.00 53.98 78.55 87.82 73.80 95.25 83.65 66.31 93.75 82.32 63.95 93.08
Tab le 9. Performance comparison with two validation methods. *denotes cross-validation.
Fig. 7. Illustration of two validation methods.
Scientic Reports | (2024) 14:31386 9
| https://doi.org/10.1038/s41598-024-82755-3
www.nature.com/scientificreports/
Content courtesy of Springer Nature, terms of use apply. Rights reserved
e number of currently available vehicle datasets is limited, and there are various challenges, such as the
fact that vehicle feature extraction is susceptible to weather conditions (such as rain, snow, haze, and other
harsh environments that increase the diculty of re-identication), and there are few vehicle types and a lack
of emerging vehicle groups such as electric vehicles, which leads to a signicant gap between the actual scenario
and the dataset. erefore, constructing the vehicle dataset that closely resembles the real environment is crucial
for vehicle re-identication research. Additionally, most current vehicle re-identication tasks are based on
supervised learning, which requires high labeling accuracy for the dataset, high labor costs, and unsatisfactory
model domain adaptability. erefore, designing better unsupervised algorithms to solve the problems of label
creation and inter-domain dierences will be one of the important research directions in the eld of vehicle re-
identication in the future.
Method Backbone
VeRi-776 VRIC
VehicleID
Test800 Test1600 Test2400
mAP Rank1 mAP Rank1 mAP Rank1 mAP Rank1 mAP Rank1
Siamese-
Visual28 ResNet50 29.48 41.12 30.55
OIFE29 GoogLeNet 48.00 65.9 24.62 67.00
MSVR15 MobileNets 49.30 88.56 47.50 46.61 63.02
SCAN30 VGG16 49.87 82.24 65.44
FDA-Net31 Self-design 55.49 84.27 65.33 59.84 61.84 55.53
VAMI32 Self-design 61.32 85.92 43.80 30.50 63.12 52.87 47.34
RAM33 VGG_
CNN_M 61.50 88.60 75.20 72.30 67.70
MSA34 ResNet50 62.89 92.07 80.31 77.55 77.11 74.41 75.55 72.91
AAVER35 ResNet101 66.35 90.17 74.69 68.62 63.54
BS36 Self-design 67.55 90.23 78.55 69.09 86.19 78.80 81.69 73.41 78.16 69.33
CCA37 ResNet50 68.05 91.71 78.89 75.51 76.53 73.60 73.11 70.08
TCL + SL38 Self-design 68.97 93.92 71.66 63.68 80.13 74.97 77.26 72.84 75.25 71.20
MAFEN21 ResNet50 71.00 95.53 77.18 76.07 72.94
URRNet39 ResNet50 72.20 93.10 76.50 73.70 68.20
MVAN40 VGG16 72.53 92.59 76.78 72.58
MsDeep41 ResNet50 74.50 95.10 84.30 81.20 81.00 78.00 78.60 75.60
DSN18 ResNet50 76.30 94.80 80.60 78.20 75.00
TL + CL + SL42 Self-design 76.95 93.62 84.57 78.37 86.84 81.36 83.71 77.94 81.69 76.27
MBN43 ResNet50 77.12 96.30 82.75 79.97 87.70 81.96 84.26 77.85 80.87 74.07
GRMS20 ResNet50 80.30 95.80 83.40 78.90 75.60
MDFE-Net ResNet50 80.33 97.01 86.58 80.75 89.24 83.66 86.56 80.78 83.70 77.88
Table 10. Performance comparison with state-of-the-art methods. '–' indicates a suboptimal result.
Scientic Reports | (2024) 14:31386 10
| https://doi.org/10.1038/s41598-024-82755-3
www.nature.com/scientificreports/
Content courtesy of Springer Nature, terms of use apply. Rights reserved
Fig. 8. e top-10 rank comparisons of visualized results of the state-of-the-art methods on VeRi-776, VRIC
and VehicleID (Test2400) dataset. e green and red boxes represent correct matching vehicles and wrong
matching vehicles, respectively.
Scientic Reports | (2024) 14:31386 11
| https://doi.org/10.1038/s41598-024-82755-3
www.nature.com/scientificreports/
Content courtesy of Springer Nature, terms of use apply. Rights reserved
Data availability
e datasets analysed during the current study are not publicly available due this research will be submitted
for university scientic research achievements in the future but are available from the corresponding author on
reasonable request.
Received: 3 August 2024; Accepted: 9 December 2024
References
1. Zheng, Z., Zheng, L. & Yang, Y. Pedestrian alignment network for large-scale person re-identication. IEEE T. Circ. Syst. Vid. 29,
3037–3045 (2018).
2. Ning, E., Wang, C., Zhang, H., Ning, X. & Tiwari, P. Occluded person re-identication with deep learning: a survey and perspectives.
Expert Syst. Appl. 239, 122419 (2023).
3. Guo, X. et al. A novel dual-pooling attention module for UAV vehicle re-identication. Sci. Rep. 14, 2027 (2024).
4. Wang, Q. et al. Dual similarity pre-training and domain dierence encouragement learning for vehicle re-identication in the wild.
Pattern Recognit. 139, 109513 (2023).
5. Wei, X. S., Zhang, C. L., Liu, L., Shen, C. & Wu, J. Coarse-to-ne: A RNN-based hierarchical attention model for vehicle re-
identication. In Proceedings of 14th Asian Conference on Computer Vision (ACCV) 575–591 (2018).
6. Guo, H., Zhu, K., Tang, M. & Wang, J. Two-level attention network with multi-grain ranking loss for vehicle re-identication. IEEE
T. Image Process. 28, 4328–4338 (2019).
7. Lou, Y., Bai, Y., Liu, J., Wang, S. & Duan, L. Y. Embedding adversarial learning for vehicle re-identication. IEEE T. Image Process.
28, 3794–3807 (2019).
8. Yan, K., Tian, Y., Wang, Y., Zeng, W. & Huang, T. Exploiting multi-grain ranking constraints for precisely searching visually-similar
vehicles. In Proceedings of IEEE International Conference on Computer Vision (ICCV) 562–570 (2017).
9. Liu, H., Tian, Y., Yang, Y., Pang, L. & Huang, T. Deep relative distance learning: tell the dierence between similar vehicles. In
Proceedings of IEEE conference on Computer Vision and Pattern Recognition (CVPR) 2167–2175 (2016).
10. Zhang, Y., Liu, D. & Zha, Z. J. Improving triplet-wise training of convolutional neural network for vehicle re-identication. In
Proceedings of IEEE International Conference on Multimedia and Expo (ICME) 1386–1391 (2017).
11. Wang, X., Girshick, R., Gupta, A. & He, K. Non-local neural networks. In Proceedings of Computer Vision and Pattern Recognition
(ICCV) 7794–7803 (2018).
12. Lu, H., Zou, X. & Zhang, P. Learning progressive modality-shared transformers for eective visible-infrared person re-
identication. In Proceedings of the AAAI Conference on Articial Intelligence (AAAI) 1835–1843 (2023).
13. Ye, M. et al. Deep learning for person re-identication: A survey and outlook. IEEE T. Pattern Anal. 44(6), 2872–2893 (2021).
14. Liu, X., Liu, W., Mei, T. & Ma, H. A deep learning-based approach to progressive vehicle re-identication for urban surveillance.
In Proceedings of European Conference on Computer Vision (ECCV) 869–884 (2016).
15. Kanacı, A., Zhu, X. & Gong, S. Vehicle re-identication in context. In Proceedings of German Conference on Pattern Recognition
(GCPR) 377–390 (2018).
16. Khan, S. D. & Ullah, H. A survey of advances in vision-based vehicle re-identication. Comput. Vision Image Underst. 182, 50–63
(2019).
17. He, K., Zhang, X., Ren, S. & Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer
Vision and Pattern Recognition (CVPR) 770–778 (2016).
18. Zhu, W. et al. A dual self-attention mechanism for vehicle re-identication. Pattern Recognit. 137, 109258 (2023).
19. Lee, S., Woo, T. & Lee, S. H. Multi-attention-based so partition network for vehicle re-identication. J. Comput. Des. Eng. 10(2),
488–502 (2023).
20. Pang, X., Yin, Y. & Tian, X. Global relational attention with a maximum suppression constraint for vehicle re-identication. Int. J.
Mach. Learn. Cybern. 15(5), 1729–1742 (2024).
21. Yu, Y. et al. Multi-attention guided and feature enhancement network for vehicle re-identication. J. Intell. Fuzzy Syst. 44(1),
673–690 (2023).
22. Iandola, F. N. et al. SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and< 0.5 MB model size. arXiv preprint.
arXiv:1602.07360 (2016).
23. Sandler, M., Howard, A., Zhu, M., Zhmoginov, A. & Chen, L. C. Mobilenetv2: Inverted residuals and linear bottlenecks. In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 4510–4520 (2018).
24. Ma, N., Zhang, X., Zheng, H. T. & Sun, J. Shuenet v2: Practical guidelines for ecient cnn architecture design. In Proceedings of
the European Conference on Computer Vision (ECCV) 116–131 (2018).
25. Chollet F. Xception: Deep learning with depthwise separable convolutions. In Proceedings of the IEEE Conference on Computer
Vision and Pattern Recognition (CVPR) 1251–1258 (2017).
26. Szegedy, C., Ioe, S., Vanhoucke, V. & Alemi, A. Inception-v4, inception-resnet and the impact of residual connections on learning.
In Proceedings of the AAAI Conference on Articial Intelligence (AAAI) 1–12 (2017).
27. Huang, G., Liu, Z., Van Der Maaten, L. & Weinberger, K. Q. Densely connected convolutional networks. In Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition (CVPR) 4700–4708 (2017).
28. Shen, Y., Xiao, T., Li, H., Yi, S. & Wang, X. Learning deep neural networks for vehicle re-id with visual-spatio-temporal path
proposals. In Proceedings of IEEE International Conference on Computer Vision (ICCV) 1900–1909 (2017).
29. Wang, Z. et al. Orientation invariant feature embedding and spatial temporal regularization for vehicle re-identication. In
Proceedings of IEEE International Conference on Computer Vision (ICCV) 379–387 (2017).
Method
VeRi-776 VRIC VehicleID (Test2400)
Training time (h) Inference time (s) Training time (h) Inference time (s) Training time (h) Inference time (s)
DSN18 7.1 0.6576 9.4 0.3219 12.2 0.9905
MBN43 6.3 0.4349 8.73 0.2240 10.19 0.8318
GRMS20 5.7 0.4012 7.9 0.1112 11.3 0.9001
MDFE-Net 5.6 0.2765 7.2 0.0989 10.1 0.8979
Table 11. Comparison of computation time of the state-of-the-arts methods. Inference
time = TestingSize(img)
÷
BatchSize(img)
×
BatchTime (s).
Scientic Reports | (2024) 14:31386 12
| https://doi.org/10.1038/s41598-024-82755-3
www.nature.com/scientificreports/
Content courtesy of Springer Nature, terms of use apply. Rights reserved
30. Teng, S., Liu, X., Zhang, S. & Huang, Q. Scan: Spatial and channel attention network for vehicle re-identication. In Proceedings of
Pacic Rim Conference on Multimedia 350–361 (2018).
31. Lou, Y., Bai, Y., Liu, J., Wang, S. & Duan, L. Veri-wild: A large dataset and a new method for vehicle re-identication in the wild. In
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 3235–3243 (2019).
32. Zhou, Y. & Shao, L. Aware attentive multi-view inference for vehicle re-ident ication. In Proceedings of IEEE conference on computer
vision and pattern recognition (CVPR) 6489–6498 (2018).
33. Liu, X., Zhang, S., Huang, Q. & Gao, W. Ram: A region-aware deep model for vehicle re-identication. In Proceedings of IEEE
International Conference on Multimedia and Expo (ICME) 1–6 (2018).
34. Zheng, A. et al. Multi-scale attention vehicle re-identication. Neural Comput. Appl. 32, 17489–17503 (2020).
35. Khorramshahi P. et al. A dual-path model with adaptive attention for vehicle re-identication. In Proceedings of IEEE/CVF
International Conference on Computer Vision (ICCV) 6132–6141 (2019).
36. Kumar, R., Weill, E., Aghdasi, F. & Sriram, P. A strong and ecient baseline for vehicle re-identication using deep triplet
embedding. J. Artif. Intell. So. 10, 27–45 (2020).
37. Peng, J. et al. Eliminating cross-camera bias for vehicle re-identication. Multimed. Tools Appl. 1–17 (2022).
38. He, X., Zhou, Y., Zhou, Z., Bai, S., & Bai, X. Triplet-center loss for multi-view 3d object retrieval. In Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition (CVPR) 1945–1954 (2018).
39. Qian, J., Pan, M., Tong, W., Law, R. & Wu, E. Q. URRNet: A unied relational reasoning network for vehicle re-identication. IEEE
Trans. Veh. Technol. 72(9), 11156–11168 (2023).
40. Teng, S., Zhang, S., Huang, Q. & Sebe, N. Multi-view spatial attention embedding for vehicle re-identication. IEEE T. Circ. Syst.
Vid. 31, 816–827 (2020).
41. Cheng, Y. et al. Multi-scale deep feature fusion for vehicle re-identication. In Proceedings of IEEE International Conference on
Acoustics, Speech and Signal Processing (ICASSP) 1928–1932 (2020).
42. Wen, Y., Zhang, K., Li, Z., & Qiao, Y. A discriminative feature learning approach for deep face recognition. In Proceedings of
European Conference on Computer Vision (ECCV) 499–515 (2016).
43. Rong, L. et al. A vehicle re-identication framework based on the improved multi-branch feature fusion network. Sci. Rep. 11, 1–12
(2021).
Author contributions
Conceptualization, L.B.; Methodology, L.R.; Validation, L.R.; Formal analysis, L.R.; Writing—original dra
preparation, L.R.; Writing—review and editing, L.B.
Competing interests
e authors declare no competing interests.
Additional information
Correspondence and requests for materials should be addressed to L.R.
Reprints and permissions information is available at www.nature.com/reprints.
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and
institutional aliations.
Open Access is article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives
4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in
any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide
a link to the Creative Commons licence, and indicate if you modied the licensed material. You do not have
permission under this licence to share adapted material derived from this article or parts of it. e images or
other third party material in this article are included in the article’s Creative Commons licence, unless indicated
otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence
and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to
obtain permission directly from the copyright holder. To view a copy of this licence, visit h t t p : / / c r e a t i v e c o m m o
n s . o r g / l i c e n s e s / b y - n c - n d / 4 . 0 / .
© e Author(s) 2024
Scientic Reports | (2024) 14:31386 13
| https://doi.org/10.1038/s41598-024-82755-3
www.nature.com/scientificreports/
Content courtesy of Springer Nature, terms of use apply. Rights reserved
1.
2.
3.
4.
5.
6.
Terms and Conditions
Springer Nature journal content, brought to you courtesy of Springer Nature Customer Service Center GmbH (“Springer Nature”).
Springer Nature supports a reasonable amount of sharing of research papers by authors, subscribers and authorised users (“Users”), for small-
scale personal, non-commercial use provided that all copyright, trade and service marks and other proprietary notices are maintained. By
accessing, sharing, receiving or otherwise using the Springer Nature journal content you agree to these terms of use (“Terms”). For these
purposes, Springer Nature considers academic use (by researchers and students) to be non-commercial.
These Terms are supplementary and will apply in addition to any applicable website terms and conditions, a relevant site licence or a personal
subscription. These Terms will prevail over any conflict or ambiguity with regards to the relevant terms, a site licence or a personal subscription
(to the extent of the conflict or ambiguity only). For Creative Commons-licensed articles, the terms of the Creative Commons license used will
apply.
We collect and use personal data to provide access to the Springer Nature journal content. We may also use these personal data internally within
ResearchGate and Springer Nature and as agreed share it, in an anonymised way, for purposes of tracking, analysis and reporting. We will not
otherwise disclose your personal data outside the ResearchGate or the Springer Nature group of companies unless we have your permission as
detailed in the Privacy Policy.
While Users may use the Springer Nature journal content for small scale, personal non-commercial use, it is important to note that Users may
not:
use such content for the purpose of providing other users with access on a regular or large scale basis or as a means to circumvent access
control;
use such content where to do so would be considered a criminal or statutory offence in any jurisdiction, or gives rise to civil liability, or is
otherwise unlawful;
falsely or misleadingly imply or suggest endorsement, approval , sponsorship, or association unless explicitly agreed to by Springer Nature in
writing;
use bots or other automated methods to access the content or redirect messages
override any security feature or exclusionary protocol; or
share the content in order to create substitute for Springer Nature products or services or a systematic database of Springer Nature journal
content.
In line with the restriction against commercial use, Springer Nature does not permit the creation of a product or service that creates revenue,
royalties, rent or income from our content or its inclusion as part of a paid for service or for other commercial gain. Springer Nature journal
content cannot be used for inter-library loans and librarians may not upload Springer Nature journal content on a large scale into their, or any
other, institutional repository.
These terms of use are reviewed regularly and may be amended at any time. Springer Nature is not obligated to publish any information or
content on this website and may remove it or features or functionality at our sole discretion, at any time with or without notice. Springer Nature
may revoke this licence to you at any time and remove access to any copies of the Springer Nature journal content which have been saved.
To the fullest extent permitted by law, Springer Nature makes no warranties, representations or guarantees to Users, either express or implied
with respect to the Springer nature journal content and all parties disclaim and waive any implied warranties or warranties imposed by law,
including merchantability or fitness for any particular purpose.
Please note that these rights do not automatically extend to content, data or other material published by Springer Nature that may be licensed
from third parties.
If you would like to use or distribute our Springer Nature journal content to a wider audience or on a regular basis or in any other manner not
expressly permitted by these Terms, please contact Springer Nature at
onlineservice@springernature.com
ResearchGate has not been able to resolve any citations for this publication.
Article
Full-text available
Vehicle re-identification (Re-ID) involves identifying the same vehicle captured by other cameras, given a vehicle image. It plays a crucial role in the development of safe cities and smart cities. With the rapid growth and implementation of unmanned aerial vehicles (UAVs) technology, vehicle Re-ID in UAV aerial photography scenes has garnered significant attention from researchers. However, due to the high altitude of UAVs, the shooting angle of vehicle images sometimes approximates vertical, resulting in fewer local features for Re-ID. Therefore, this paper proposes a novel dual-pooling attention (DpA) module, which achieves the extraction and enhancement of locally important information about vehicles from both channel and spatial dimensions by constructing two branches of channel-pooling attention (CpA) and spatial-pooling attention (SpA), and employing multiple pooling operations to enhance the attention to fine-grained information of vehicles. Specifically, the CpA module operates between the channels of the feature map and splices features by combining four pooling operations so that vehicle regions containing discriminative information are given greater attention. The SpA module uses the same pooling operations strategy to identify discriminative representations and merge vehicle features in image regions in a weighted manner. The feature information of both dimensions is finally fused and trained jointly using label smoothing cross-entropy loss and hard mining triplet loss, thus solving the problem of missing detail information due to the high height of UAV shots. The proposed method’s effectiveness is demonstrated through extensive experiments on the UAV-based vehicle datasets VeRi-UAV and VRU.
Article
Full-text available
The goal of vehicle re-identification is to identify the same vehicle from multiple cameras, which is a challenging task. There are many solutions to this problem, among which the self-attention mechanism is very popular. It can capture the long-range dependence in an image, thereby suppressing the irrelevant features. Most of the existing designs are based on isolated pairwise query-key interactions to refine a node. They implicitly mine attention patterns without explicitly modeling node weights. In this paper, we propose a global relational attention mechanism, which makes full use of the global dependence of a node to learn and infer its weight value. Global dependence can measure the importance of nodes more robustly and efficiently. To capture more discriminative features, we propose a maximum suppression constraint to adaptively adjust weight values to expand the range of attention. In addition, we design a pair of effective attention modules based on the proposed attention mechanism, that focus on mining the discriminative features related to vehicle identities from the spatial and channel dimensions. We conduct a large number of experiments on the VeRi-776 and VehicleID datasets, and the experimental results demonstrate the effectiveness of our method.
Article
Full-text available
Visible-Infrared Person Re-Identification (VI-ReID) is a challenging retrieval task under complex modality changes. Existing methods usually focus on extracting discriminative visual features while ignoring the reliability and commonality of visual features between different modalities. In this paper, we propose a novel deep learning framework named Progressive Modality-shared Transformer (PMT) for effective VI-ReID. To reduce the negative effect of modality gaps, we first take the gray-scale images as an auxiliary modality and propose a progressive learning strategy. Then, we propose a Modality-Shared Enhancement Loss (MSEL) to guide the model to explore more reliable identity information from modality-shared features. Finally, to cope with the problem of large intra-class differences and small inter-class differences, we propose a Discriminative Center Loss (DCL) combined with the MSEL to further improve the discrimination of reliable features. Extensive experiments on SYSU-MM01 and RegDB datasets show that our proposed framework performs better than most state-of-the-art methods. For model reproduction, we release the source code at https://github.com/hulu88/PMT.
Article
Full-text available
Vehicle re-identification helps in distinguishing between images of the same and other vehicles. It is a challenging process because of significant intra-instance differences between identical vehicles from different views and subtle inter-instance differences between similar vehicles. To solve this issue, researchers have extracted view-aware or part-specific features via spatial attention mechanisms, which usually result in noisy attention maps or otherwise require expensive additional annotation for metadata, such as key points, to improve the quality. Meanwhile, based on the researchers’ insights, various handcrafted multi-attention architectures for specific viewpoints or vehicle parts have been proposed. However, this approach does not guarantee that the number and nature of attention branches will be optimal for real-world re-identification tasks. To address these problems, we proposed a new vehicle re-identification network based on a multiple soft attention mechanism for capturing various discriminative regions from different viewpoints more efficiently. Furthermore, this model can significantly reduce the noise in spatial attention maps by devising a new method for creating an attention map for insignificant regions and then excluding it from generating the final result. We also combined a channel-wise attention mechanism with a spatial attention mechanism for the efficient selection of important semantic attributes for vehicle re-identification. Our experiments showed that our proposed model achieved a state-of-the-art (SOTA) performance among the attention-based methods without metadata and was comparable to the approaches using metadata for the VehicleID and VERI-Wild datasets.
Article
Full-text available
Vehicle re‑identification (re‑id) aims to solve the problems of matching and identifying the same vehicle under the scenes across multiple surveillance cameras. For public security and intelligent transportation system (ITS), it is extremely important to locate the target vehicle quickly and accurately in the massive vehicle database. However, re‑id of the target vehicle is very challenging due to many factors, such as the orientation variations, illumination changes, occlusion, low resolution, rapid vehicle movement, and amounts of similar vehicle models. In order to resolve the difficulties and enhance the accuracy for vehicle re‑id, in this work, we propose an improved multi‑branch network in which global–local feature fusion, channel attention mechanism and weighted local feature are comprehensively combined. Firstly, the fusion of global and local features is adopted to obtain more information of the vehicle and enhance the learning ability of the model; Secondly, the channel attention module in the feature extraction branch is embedded to extract the personalized features of the targeting vehicle; Finally, the background and noise information on feature extraction is controlled by weighted local feature. The results of comprehensive experiments on the mainstream evaluation datasets including VeRi‑776, VRIC, and VehicleID indicate that our method can effectively improve the accuracy of vehicle re‑identification and is superior to the state‑of‑the‑art methods.
Article
With the continuous improvement and optimization of security monitoring networks, vehicle Re-Identification (Re-ID) becomes an emerging key technology in the development of intelligent visual surveillance systems. Due to the influence of viewpoint variation and fine-grained differences, vehicle Re-ID is still a research topic worth investigating. To alleviate above problems, a novel end-to-end framework named Unified Relational Reasoning Network (URRNet) is proposed in this paper, which integrates global features with local features to obtain better recognition accuracy. For the proposed framework, to understand the overall semantics of the image, an algorithm based on the global feature graph-structure learning is designed. The pixel-level feature maps are transformed to the node features of graph in the interactive space by projection, then graph reasoning is performed by using the graph convolutional network to improve the representation of global features. Moreover, an algorithm based on multi-scale local feature relational reasoning is designed. Using keypoint and viewpoint to obtain the multi-scale partial characteristics of the vehicle, and the vehicle multi-view features are learned from the single-view vehicle images through relational reasoning and attention mechanism. The two algorithms are combined to obtain the overall model, which not only preserves the details of the vehicle, but also effectively solves the problem of viewpoint variation. Comprehensive experimental results on two public datasets (VeRi-776 and VehicleID) indicate that the proposed URRNet can practically improve the model's representation ability and generalization ability, which is comparable to the state-of-the-art vehicle Re-ID methods.
Article
Vehicle Re-Identification (Re-ID) aims to discover and match target vehicles in different cameras of road surveillance. The high similarity between vehicle appearances and the dramatic variations in viewpoints and illumination cause great challenges for vehicle Re-ID. Meanwhile, in safety supervision and intelligent traffic systems, one needs a quick efficient method of identifying target vehicles. In this paper, we propose a Multi-Attention Guided Feature Enhancement Network (MAFEN) to extract robust vehicle appearance features. Specifically, the Fusing Spatial-Channel information multi-receptive fields Feature Enhancement module (FSCFE) is first proposed to aggregate richer and more representative multi-receptive fields features at different receptive fields sizes. It also learned the spatial structure information and channel dependencies of the multi-receptive fields features and embedded them to enhance the feature. Then, we construct the Spatial Attention-Guided Adaptive Feature Erasure (SAAFE) module, which uses spatial attention to erase the most distinguishing features. The network’s attention is shifted to potentially salient features to strengthen the ability of the network to extract salient features. In addition, a multi-loss knowledge distillation (MLKD) method using MAFEN as a teacher network is designed to improve computational efficiency. It uses multiple loss functions to jointly supervise the student network. Experimental results on three public datasets demonstrate the merits of the proposed method over the state-of-the-art methods.