PreprintPDF Available

Staged Encoder Training for Cross-Camera Person Re-Identification

Authors:

Abstract and Figures

As a cross-camera retrieval problem, person Re-identification (ReID) suffers from image style variations casued by camera parameters, lighting and other reasons, whic-h will seriously affect the model recognition accuracy. To address this problem, this paper proposes a two-stage contrastive learning method to gradually reduce the impact of camera variations. In the first stage, we train an encoder for each camera using only images from the respective camera. This ensures that each encoder has better recognition performance on images from its respective camera while being unaffected by camera variations. In the second stage, we encode the same image using all trained encoders to generate a new combination code that is robust against camera variations. We also use Cross-Camera Encouragement distance that complements the advantages of combined encoding to further mitigate the impact of camera variations. Our method achieves high accuracy on several commonly used person ReID datasets, e.g., achieces 90.8% rank-1 accuracy and 85.2% mAP on the Market1501, outperforming the recent unsupervised works by 12+%. Code is available at https://github.com/yjwyuanwu/SET.
Content may be subject to copyright.
Staged Encoder Training for Cross-Camera Person
Re-Identication
zhi Xu ( xuzhi@guet.edu.cn )
School of Computer Information and Security, Guilin University of Electronic Technology
Jiawei Yang
School of Computer Information and Security, Guilin University of Electronic Technology
Yuxuan Liu
School of Mechanical and Electrical Engineering, Guilin University of Electronic Technology
Longyang Zhao
School of Computer Information and Security, Guilin University of Electronic Technology
Jiajia Liu
School of Institude of Electronic and Electrical Engineering, Civil Aviation Flight University of China
Research Article
Keywords: Camera variation, Contrastive learning, Unsupervised, Person Re-identication
Posted Date: November 2nd, 2023
DOI: https://doi.org/10.21203/rs.3.rs-3511084/v1
License: This work is licensed under a Creative Commons Attribution 4.0 International License. 
Read Full License
Additional Declarations: No competing interests reported.
Staged Encoder Training for Cross-Camera Person Re-Identification
Zhi Xu1·Jiawei Yang1·Yuxuan Liu2·Longyang Zhao1·Jiajia Liu3
Received: date / Accepted: date
Abstract As a cross-camera retrieval problem, person Re-
identification (ReID) suffers from image style variations ca-
sued by camera parameters, lighting and other reasons, whic-
h will seriously affect the model recognition accuracy. To
address this problem, this paper proposes a two-stage con-
trastive learning method to gradually reduce the impact of
camera variations. In the first stage, we train an encoder for
each camera using only images from the respective camera.
This ensures that each encoder has better recognition per-
formance on images from its respective camera while be-
ing unaffected by camera variations. In the second stage, we
encode the same image using all trained encoders to gen-
erate a new combination code that is robust against camera
variations. We also use Cross-Camera Encouragement [12]
distance that complements the advantages of combined en-
coding to further mitigate the impact of camera variations.
Our method achieves high accuracy on several commonly
used person ReID datasets, e.g., achieces 90.8% rank-1 ac-
curacy and 85.2% mAP on the Market1501, outperforming
the recent unsupervised works by 12+%. Code is available
at https://github.com/yjwyuanwu/SET.
Keywords Camera variation ·Contrastive learning ·
Unsupervised ·Person Re-identification
1 Introduction
Given a query image, person Re-identification(ReID) aims
to match the person across multiple non-overlapping cam-
corresponding author: Zhi Xu e-mail: xuzhi@guet.edu.cn
1School of Computer Information and Security, Guilin University of
Electronic Technology, China.
2School of Mechanical and Electrical Engineering, Guilin University
of Electronic Technology, China.
3School of Institude of Electronic and Electrical Engineering, Civil
Aviation Flight University of China.
Encoder-1
·
·
Encoder-n
Encoder-i
·
·
Backbone
Combination Coding
·· ··
Fig. 1 An illustration: the generation of combination coding in the
inter-contrast learning stage using encoders trained in the intra-contrast
learning stage
eras [12,20]. In ReID scenarios, each identity may be record-
ed by multiple cameras with different parameters and envi-
ronments, these factors change the appearance of the image,
making it challenging to recognize the same identities.
In previous studies, researchers have addressed the above
challenges through supervised methods, mainly focusing on
finding appropriate mapping functions based on the data dis-
tribution of images captured by different cameras [14,9].
However, such approaches require annotated training sam-
ples to learn the camera transfer model and are only appli-
cable to small datasets. In recent years, researchers have fo-
cused on studying unsupervised domain adaptation (UDA)
methods [3,18, 28, 10, 5, 23] and purely unsupervised meth-
ods [11,17, 19, 27, 1] to address this issue. UDA is complex
to train and requires that the difference between the source
and target domains is not significant. In this paper, we focus
on the fully unsupervised approach, which uses only unla-
beled data in the target domain and is trained using the gen-
erated pseudo-labels.
In research on fully unsupervised methods, it is common
to use data augmentation to make the model robust to cam-
era variations [27]. Alternatively, in the training step, sam-
2 Zhi Xu1et al.
ples are clustered and pseudo-labelled, and then a model is
designed to extract features that are robust to camera varia-
tions [11,17, 19, 1]. Unlike previous methods, this paper fo-
cuses on the pseudo-label prediction step in the fully un-
supervised setting. Most pseudo-label prediction algorithms
follow a similar process, which includes feature extraction,
similarity computation, and assigning the same label to sim-
ilar samples for training. The feature similarity calculation
is a crucial step in this process. However, camera variations
lead to an increase in the inter-class distance for the same
identity, which significantly affects the reliability of the sim-
ilarity results.
In this paper, we address the above issues by investigat-
ing a more reasonable distance computation for generating
pseudo-labels. Since it is easier to identify pedestrians with
the same identity in the same camera than in different cam-
eras, as shown in Fig. 2, we decompose the distance cal-
culation between sample encodings into two stages, grad-
ually searching for reliable pseudo-labels. These stages are
trained alternately to jointly optimize the backbone network.
In the first stage, i.e., intra-contrast learning stage, multiple
branches are trained together, with branch kusing samples
from camera kfor training. Since the samples of each branch
come from a single camera and are not affected by camera
variations, the similarity computation in this stage is per-
formed directly using the encodings obtained from the back-
bone network and the encoder. The contrast learning method
used for training is discussed in detail in Sec. 3.2.
In the second stage, i.e., inter-contrast learning stage, we
use all samples in the training set to jointly train an addi-
tional encoder. Since the samples in the training set come
from different cameras, we must take camera variations into
account during this stage. Inspired by studies such as [19,
4], which show that the classification probability is more ro-
bust to the domain gap than raw features, we consider the
feature obtained from the backbone as ”raw feature”. As
shown in Fig. 1, the encoders trained in the first stage for
each camera are used to obtain the combined encoding of the
samples as ”classification”. Furthermore, to avoid misiden-
tifying samples from different identities as the same identity
when their combined encodings are close, we further explic-
itly reduced the sample distance between different cameras
using the Cross-Camera Encouragement [12]. The distance
between sample encodings in the second stage is composed
of the original encoding distance (d1
d1
d1), the combined encod-
ing distance (d2
d2
d2), and the Cross-Camera Encouragement dis-
tance (d3
d3
d3). We also employed contrastive learning for train-
ing in this stage. d2and d3will be introduced in Sec. 3.5.
The proposed method decomposes the distance calcula-
tion between sample encodings into two stages, and grad-
ually finds reliable pseudo-labels. This method is more re-
liable than directly predicting pseudo-labels across cameras
in that, and effectively alleviates the impact of camera varia-
tions. Code is available at https://github.com/yjwyuanwu/SET
Our contributions can be summarized as follows:
We propose a two-stage comparative learning framework
to optimise the image coding extraction process, where
the two stages mutually promote each other’s perfor-
mance.
The proposed method for similarity computation effec-
tively alleviates the challenge of camera variations, in
which d2and d3have complementary advantages.
At the stage of pseudo-labelling, we present a method
for reprocessing pseudo-labels to address the issue of
over-labeling.
Our method achieves high accuracy on three commonly
used person re-identification datasets. It provides insights
into improved similarity calculation for fully unsuper-
vised person ReID.
2 Related work
The proposed method is inspired by domain adaptation meth-
ods and effectively mitigates the impact of camera variations
in a fully unsupervised setting. The work on these two topics
will be introduced in the following two subsections.
2.1 Domain adaptation
Domain adaptation can be summarized into three categories:
GAN-based style transfer, finding features that are robust
to camera varitions, and mutual training. Zhong et al.[26]
proposed a triplet training sample construction method us-
ing style transfer and non-overlapping person ReID datasets.
Wei et al. [18] introduced a GAN-based approach that trans-
fers task images to match the style of the target domain
dataset while preserving the label information from the sour-
ce domain. For research on finding robust features, Zheng et
al. [25] proposed a method to separate features into appear-
ance and structural features, and Zou et al. [28] explored do-
main adaptation using appearance features as domain-invari-
ant features. There are also studies [19,4] showing that the
classification probability is more robust to the domain gap
than raw features, and our work was inspired by this re-
search result. Other methods, such as MMT [5] and NRMT
[23], focus on reducing the impact of low-quality pseudo-
labels through mutual training [22] to improve the model’s
recognition accuracy.
2.2 Fully unsupervised person ReID
Fully unsupervised methods related to mitigating camera
variations mainly focus on three aspects: data augmenta-
tion, extracting features that are robust to camera variations,
Staged Encoder Training for Cross-Camera Person Re-Identification 3
Training Images
Training Samples
Clustering
Averaging
Batching
Intra / Inter Contrast Learning Loss
Real-ranking
Updating
Memory Dictionary
Peseudo Labels
Peseudo Labels
Assigning Pesudo Labels
BackboneBackbone
Cam-1
Cam-n
·
·
·
·
Encoder-i Codings
Cam-i
Encoder
Label=15Label=15
Label=15
Label=193
Label=193
Label=367
Label=367
Label=15
Label=15
Label=193
Label=193
Label=367
Label=367
Label=1Label=1
Label=1
Label=2
Label=2
Label=3
Label=3
Label=1
Label=1
Label=2
Label=2
Label=3
Label=3
Label=1
Label=2
Label=3
Label=1
Label=2
Label=3
Distance Calculation
Codings
Intra
Inter
Fig. 2 overall flowchart .The whole training process is divided into two parts, intra contrast learning and inter contrast learning, which share the
backbone structure, and are represented by the upper and lower parts in the box, respectively. Both parts undergo two stages of training sequentially:
(1)Initialization Stage (indicated by the red line): The clustering results of image encodings are used for dictionary feature initialization and pseudo-
label initialization of the samples. (2)Training Stage (indicated by the thin arrow line): The thin solid arrow line updates features in the dictionary,
while the thin dashed arrow line calculates the loss for the current stage and updates the backbone and encoder. During testing, we encode the
images using the backbone and encoder from the inter-contrast learning stage. We compute the Euclidean distance between the encodings to obtain
the final query results
and generating reliable pseudo-labels. Zhong et al. [27] pro-
posed a method to improve model accuracy through data en-
hancement and using label smoothing regularization (LSR)
loss. Chen et al. [1] extracted features from the statistical
information of different camera images and performed fea-
ture fusion to generate cross-camera invariant features. For
research on generating reliable pseudo-labels, Lin et al. [11]
considered each image as an individual sample and gradu-
ally grouped them based on sample similarity. Wang et al.
[17] formulated ReID as a multi-classification problem and
employed optimized similarity computation to enhance the
accuracy of pseudo-label prediction. The work most simi-
lar to our study is [19], which produces feature vectors that
withstand differences in cameras by utilizing classification
outcomes from various camera classifiers to mitigate the cam-
era disparity issue. In contrast, we use image encoding to
produce camera-robust composite encodings directly. Addi-
tionally, we use d3(the Cross-Camera Encouragement dis-
tance) to compensate for the shortcomings of d2(the com-
bined encoding distance) and improve the model’s optimiza-
tion by using a memory dictionary rather than a classifier,
resulting in a better reduction of intra-class distance of the
samples whilst expanding the inter-class distance. our metho-
d is proven to be more effective on multiple datasets.
3 Proposed Method
3.1 Formulation
Given an unlabelled dataset χ, we can consider it to consist
of multiple subdatasets, denoted as χ={χc},c=1 : C,
where the superscript cindicates that all of the images in
this subdataset are from camera cand Crepresents the to-
tal number of cameras. Our task is to train a model on χ,
such that for each query image q, this ReID model gener-
ates a feature encoding to retrieve the pedestrian images in
gallery set Gthat contain the same identity. in other words,
the feature encoding of qshould have a smaller distance to
the encoding of a gallery image gwith the same identity as
qcompared to the distances to other images in G. The task
can be defined as follows:
g=argmin
gGdist(rg,rq)(1)
4 Zhi Xu1et al.
where rrepresents the image encoding extracted by the ReID
model, and dist (·)is the distance metric.
This paper is to generate more accurate pseudo-labels by
reducing the effect of camera variations on the sample dis-
tance calculation during training, so as to guide the model
training and enable the model to extract encodings that sat-
isfy Eq. 1. The model comprises of two stages. As shown
in Fig. 2, the first stage, intra-contrast learning stage, uses
multiple branches for joint training, each branch uses only
a sub-dataset for training, and the loss of branch ccan be
expressed as the sum of the contrast loss of all samples from
camera c:
Lc
intra =
Iχc,IHc
m
Lc
contrast (f,m)(2)
where frepresents the feature of image Iafter extraction by
the backbone, mis the corresponding pseudo-label, and H
represents the set of pseudo-labels generated by clustering.
In the second stage, inter-contrast learning, we share the
parameters of the first stage backbone and train an additional
encoder. This stage uses the whole training set for training,
including images from different cameras. To minimize the
impact of camera variations, we propose a combination cod-
ing that is more robust to the camera variations. As shown in
Fig. 1, we use all encoders trained in the first stage to encode
the images separately, and then use these encodings to gen-
erate the combination coding R. The combination coding Ri
for image xican be denoted as:
Ri=hr1
i,·· · ,rk
i,·· · ,rC
ii(3)
where rk
iis the coding of the image xiobtained by the en-
coder corresponding to the k-th camera. We use d2(the com-
bination coding distance) and d3(the Cross-Camera Encour-
agement distance [12]) to reduce the impact of camera vari-
ations. The distance between any two images xiand xjin the
inter-contrast learning stage is represented as follows:
D(xi,xj) = d1(xi,xj) + µd2(xi,xj) + d3(xi,xj)(4)
where d1(·)represents the Euclidean distance of the image
coding. We use the clustering result Hto calculate the loss
in the inter-contrast learning phase to optimise the extraction
of the coding r, i.e.,
Linter =
IHm
Lcontrast (r,m)(5)
In summary, these two stages share the backbone network
while having their own encoders with the same structure,
and the two stages are trained alternately. The d2(·)and d3(·)
mentioned above are explained in detail in Sec. 3.5.
3.2 Contrast learning
Both stages of the model are trained using contrast learning,
including the initialization and training stages. In the ini-
tialization stage, samples are passed through the backbone
network and the encoder to obtain sample encodings. Then,
similarity is computed to perform clustering by assigning
the same pseudo-label to samples belonging to the same
cluster. After real-ranking which will be described in Sec.
3.3, the average encoding of samples with the same pseudo-
label is used to initialize the memory dictionary, each of the
cluster centroids stored in the memory dictionary can be rep-
resented as:
φk=1
|Hk|
φiHk
φi(6)
where Hkrepresents the set of sample encodings for the k-
th cluster. During the training process, the cluster centroids
encoding φkin the memory dictionary is updated with the
sample encoding r using Eq. 7:
φkλφk+(1λ)r(7)
where λ[0,1)represents the momentum update factor. λ
controls the consistency between the sample coding rand
the corresponding clustering mean. When λapproaches 0,
the clustering mean φkis closest to the coding rof the latest
training sample sample. The loss of contrast for a sample
coded as rand with a pseudo-label of mcan be expressed
as:
Lcontrast (r,m) = log exp(r·φmτ)
K
k=0exp(r·φkτ)(8)
where τis a temperature hyperparameter, {φ1,φ2,..., φK}
represents the cluster centroids stored in the memory dictio-
nary, and Krepresents the number of clusters. The contrast
loss can reduce the intra-class distance while increasing the
inter-class distance, which can improve the discriminative
ability of the model. The loss of all obtained samples is used
to update the backbone and encoder.
3.3 Real-ranking
We use top-down hierarchical clustering method for clus-
tering, requiring the number of clusters Mto be specified
at the outset. When the number of samples is small, the re-
sulting number of clusters Kmay be fewer than the spec-
ified quantity. Nevertheless, the allocation of cluster labels
mis randomly assigned by the clustering algorithm within
a number less than M, i.e., it may produce the problem of
over-labelling:
m=random(M)>K(9)
Staged Encoder Training for Cross-Camera Person Re-Identification 5
Since the cluster means in the memory dictionary are
stored in order of label, when an over-label problem occurs,
the cluster centroid corresponding to label mthat exceeds
the actual number of clusters cannot be found in the memory
dictionary, which leads to the inability to compute the loss
by using Eq. 8. To solve this problem, as shown in Fig. 2,
we propose a method called Real-ranking to redistribute the
pseudo-labels by ranking them after clustering. The ranking
position of the given sample’s pseudo-label is then used as
its final pseudo-label, guaranteeing that no pseudo-label ex-
ceeds the actual number of classifications.
3.4 Intra-contrast learning
As illustrated in Fig. 2, we employ multiple branches for
joint training in the intra-contrast learning stage. According
to Eq. 8, we can derive the contrastive loss for the branch c
mentioned in Sec. 3.1 as follows:
Lc
contrast (f,m) = Lc
contrast (E(θc,f),m)
=log exp(E(θc,f)·φmτ)
K
k=0exp(E(θc,f)·φkτ)
(10)
where E(θc,·)represents the encoder with parameter θc.
The loss of the intra-contrast learning stage is equal to the
sum of the losses of all branches in this stage and can be
formulated as:
Lintra =
C
c=1
Lc
intra (11)
Eq. 11 effectively improves the discriminative ability of the
encodings extracted by each camera encoder. In addition,
the optimization of multiple branches also improves the dis-
criminative ability of the model for images from different
cameras.
3.5 Inter-contrast learning
In the inter-contrast learning stage, the encoding distance
between samples is determined using Eq. 4. Due to camera
variations, the encoding distance between different samples
of the same identity tends to increase. Therefore, we sub-
tract d2from the encoding distance of samples from distinct
cameras during the encoding distance calculation. d2can be
calculated as follows:
d2(xi,xj) = 0,ci=cj
J(Ri,Rj),ci=cj
(12)
where J(·)represents the Jaccard distance, the Jaccard dis-
tance between two samples is smaller when their combina-
tion coding is more similar. The corresponding Jaccard dis-
tance of the combination coding is calculated as:
J(Ri,Rj) = 1RiRj
RiRj
(13)
where indicates that the combination coding Rtakes a
smaller value at the corresponding location, and indicates
that it takes a larger value. In order to prevent samples with
different identities from having similar combination codings
leading to them being mistakenly recognised as the same
identity, we use d3to further reduce the effect of camera
variations, and the d3distance can be denoted as:
d3(xi,xj) = λc,ci=cj
0,ci=cj
(14)
4 Experiment
4.1 dataset and Evaluation Protocols
We evaluated our method on three widely-used person ReID
datasets, including Market-1501 [24], PersonX [16] and Du-
keMTMC-ReID [15]. The details of these three datasets are
summarized in Table 1. During training, we only utilized
the images and camera information from the training sets of
each dataset, without using any other annotation informa-
tion. Note that the camera ID is automatically obtained at
the moment of capturing and is no need for human label-
ing. Performance is evaluated by the Cumulative Matching
Characteristic (CMC) and meanAverage Precision (mAP).
4.2 Implementation details
To ensure a fair comparison with other methods, we used a
pre-trained ResNet50 [7] on ImageNet [2] as the backbone
network for feature extraction. After layer 5, we removed all
submodule layers and added a batch normalisation layer [8],
which will produce 2048 dimensional coding using the com-
bination of these two layers as the encoder. During testing
and clustering, we calculated the similarity between sam-
ples using the encodings obtained after passing through the
backbone and the encoder.
During training, the input images are resized to 256 ×
128. In each round, we perform intra-contrast learning and
inter-contrast learning in sequence. The training consists of
50 rounds. We use the Adam optimizer to train both stages
of the re-ID model with weight decay of 0.0005. The ini-
tial learning rate lr =0.00035 and then decays to 1/10 of
the previous every 20 rounds. The momentum update fac-
tor λ=0.99. Every mini-batch integrates 256 images of 16
fake person identities (16 images per identity).
6 Zhi Xu1et al.
Dataset # train IDs # train images # test IDs # query images # total images # cameras
Market-1501 751 12,936 750 3,368 32,668 6
PersonX 410 9,840 856 5,136 45,792 6
DukeMTMC-ReID 702 16,522 702 2,228 36,441 8
Table 1 Statistics of datasets used in the experimental section
Methods Market1501 DukeMTMC-ReID
source mAP Rank-1 Rank-5 Rank-10 source mAP Rank-1 Rank-5 Rank-10
SPGAN[3] Duke 26.9 58.1 76.0 82.7 Market 26.4 46.6 62.6 68.5
HHL[26] Duke 31.4 62.2 78.8 84.0 Market 27.2 46.9 61.0 66.7
DGNet++[28] Duke 61.7 82.1 90.2 92.7 Market 61.8 78.9 87.8 90.4
PDA-Net[10] Duke 47.6 75.2 86.3 90.2 Market 45.1 63.2 77.0 82.5
NRMT[23] Duke 71.7 87.8 94.6 96.5 Market 62.2 77.8 86.9 89.5
MMT[5] Duke 71.2 87.7 94.9 96.9 Market 63.1 76.8 88.0 92.2
BUC[11] None 38.3 66.2 79.6 84.5 None 22.1 40.4 52.5 58.2
HCT[21] None 56.4 80.0 91.6 95.2 None 50.7 69.6 83.4 87.4
MMCL[17] None 45.5 80.3 89.4 92.3 None 40.2 65.2 75.9 80.0
IICS[19] None 72.9 89.5 95.2 97.0 None 64.4 80.0 89.0 91.6
Ours None 85.2 90.8 94.4 95.8 None 71.1 80.2 85.9 88.7
Table 2 Experiments on Market-1501 and DukeMTMC-ReID datasets. The comparison with recent person ReID methods, including domain
adaptation methods and fully unsupervised methods, where ”None” represents the fully unsupervised method and other values represent the source
domain datasets in domain adaptive methods. The black bold font represents the optimal value of each metric
Methods PersonX
source mAP Rank-1 Rank-5 Rank-10
MMT[5] Market 78.9 90.6 96.8 98.2
SPCL[6] None 72.3 88.1 96.6 98.3
Ours None 91.8 94.5 97.6 98.6
Table 3 Experiments on PersonX datasets. Where ”None” represents
the fully unsupervised method and other values represent the source
domain datasets in domain adaptive methods. The black bold font rep-
resents the optimal value of each metric
For every round of training, we train the model for two
epochs at both stages. We use the standard hierarchical clus-
tering method [13], as done in [19], we set the number of
clusters for each camera to be 600 in the intra-contrast learn-
ing stage and 800 in the inter-contrast learning stage.
4.3 Comparison with State-of-the-arts
We compare recent fully unsupervised methods and domain
adaptation methods on Market-1501 [24] , PersonX [16],
and DukeMTMC-ReID [15]. The results of the comparison
are summarised in Table 2 and Table 3. First we compare
domain adaptive methods, including methods that perform
style transfer via GAN (SPGAN [3],et al.), methods that
reduce the effect of domain gap by disentangling features
(DGNet++ [28],et al.) and methods that reduce the effect
of low-quality pseudo-labelling by mutual training (NRMT
[23],et al.).
These domain adaptation techniques depend on manu-
ally annotated labels from the source domain, whereas our
methodology achieves better results even without such re-
liance. We also compared our method with some fully un-
supervised methods (BUC [11],et al), and it is clear that our
approach outperformed most of these methods based on var-
ious metrics relying on the more reliable calculation of the
sample encoding distances used in the clustering process.
4.4 Ablation Studies
The impact of individual components. In this section we
evaluate the effectiveness of the two stages of intra-contrast
learning and inter-contrast learning in our method. The ex-
perimental results are summarised in Table 5. As shown in
the table, relying solely on inter-contrast learning for train-
ing leads to poor performance, indicating that the distance
calculations between samples from different cameras are un-
reliable. On the other hand, when only intra-contrast learn-
ing is used, the rank-1 accuracy on the Market-1501 and Per-
sonX datasets can reach 86.9% and 93.0% respectively. This
shows that the distance calculation of the sample coding is
more accurate when it is not influenced by camera varia-
tions. However, without considering the distribution gap be-
tween the cameras, the addition of the inter-contrast learn-
ing stage results in a decrease in performance on PersonX.
This shows that although the sample coding produced by
the model improves after the intra-contrast learning stage,
the calculation of distances between samples from different
Staged Encoder Training for Cross-Camera Person Re-Identification 7
Dateset Market-1501 PersonX
Settings mAP Rank-1 mAP Rank-1
d183.1 87.4 88.0 92.5
d1+µd284.6 90.1 91.0 93.6
d1+d384.1 89.7 90.0 92.9
d1+µd2+d385.2 90.8 91.8 94.5
Table 4 Investigate the effect on the results of using different parts of
Eq. 4 in stage 2
Fig. 3 Parameter analysis on Market-1501
cameras remains unreliable. When we use Eq. 4 to calcu-
late the sample coding distance in the inter-contrast learn-
ing stage, there is a significant improvement in accuracy,
demonstrating that our proposed distance calculation method
successfully mitigates the effects of camera variances on
sample distance calculations.
The impact of different partial distances. In this section,
we investigate the effectiveness of the d2and d3distances
in Eq. 4. The experimental results are summarised in Ta-
ble 4. Taking the experimental results on the Market1501
dataset as an example, when we use d1directly to calcu-
late the sample coding distance, the rank-1 accuracy is only
87.4%. However, when we use d2or d3for distance calcula-
tion while using d for sample encoding distance calculation,
the rank-1 accuracy improves to 90.1% and 89.7%, respec-
tively, indicating that both can reduce the effect of camera
variations on the distance calculation. Furthermore, when
we calculate the sample encoding distance using d1,d2, and
d3simultaneously, the rank-1 accuracy further improves to
90.8%. This suggests that d2and d3can improve accuracy
individually, and their advantages complement each other.
d2compensates for d3’s shortcoming of treating all inter-
camera variations as equal, while d3can explicitly reduce
inter-sample coding distances, compensating for d2’s short-
coming of discriminating pedestrians with different identi-
ties whose combination codings are close to each other as
the same identity.
Influence of hyper-parameters. In this section, we investi-
gate the effect of two important hyperparameters µand λc,
as shown in Fig. 3. The parameter µis used to regulate the
importance of d2. By increasing µfrom 0 to 0.02, we ob-
serve an increase in both mAP and Rank1. However, further
raising µleads to a decline in mAP and rank-1 to varying
extents. Therefore, we select µas 0.02.
Dateset Market-1501 PersonX
Settings mAP Rank-1 mAP Rank-1
Stage1 83.0 86.9 88.3 93.0
Stage281.8 85.7 76.9 88.1
Stage1 + Stage283.1 87.4 88.0 92.5
Stage1 + Stage2+ Eq. 4 85.2 90.8 91.8 94.5
Table 5 Ablation study on individual components. Stage 1 denotes
intra-contrast learning stage. Stage 2 denotes inter-contrast learning
stage.denotes only d1in Eq. 4 is used in stage 2
For the parameter λc, it is used to explicitly decrease the
encoding distance between samples from different cameras.
It can be observed that when λcincreased to 0.04, both mAP
and Rank1 reached their optimal values, and further increas-
ing λcproduces a negative effect.
5 Conclusion
This paper introduces two-stage contrastive learning approa-
ch for unsupervised person ReID, which aims to mitigate the
impact of camera variations by improving the encoding dis-
tance calculation across cameras. First, In the intra-contrast
learning stage, multi-branching is utilized to train individual
encoders for each camera separately. Subsequently, in the
inter-contrast learning stage, the encoding results of all en-
coders are combined to generate a more robust combination
coding that is more robust to camera variations. The sample
encoding distance is calculated by considering both d1(the
original distance) and d2(the complementary combination
coding distance) and d3(the Cross-Camera Encouragement
distance). Extensive experiments have demonstrated the ef-
fectiveness of our proposed method in unsupervised person
ReID tasks.
Declarations
Ethical approval Not applicable.
Funding This work was supported by Guangxi Natural Sci-
ence Foundation (No. 2020GXNSFAA297186), Jiangsu Pro-
vince Agricultural Science and Technology Innovation and
Promotion Special Project (No. NJ2021-21), Guilin Key Re-
search and Development Program (No. 20210206-1), Guan-
gxi Key Laboratory of Precision Navigation Technology and
Application (No. DH202227), Guangxi Key Laboratory of
Image and Graphic Intelligent Processing (No. GIIP2301).
There are no financial conflicts of interest to disclose.
Availability of data and materials The datasets are avail-
able at https://virutalbuy-public.oss-cn-hangzhou.aliyuncs.-
com/share/data.zip. Code is available at https://github.com/-
yjwyuanwu/SET.
8 Zhi Xu1et al.
References
1. Chen, Y.C., Zhu, X., Zheng, W.S., Lai, J.H.: Person re-
identification by camera correlation aware feature augmentation.
IEEE transactions on pattern analysis and machine intelligence
40(2), 392–408 (2017)
2. Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: Ima-
genet: A large-scale hierarchical image database. In: 2009 IEEE
conference on computer vision and pattern recognition, pp. 248–
255. Ieee (2009)
3. Deng, W., Zheng, L., Ye, Q., Kang, G., Yang, Y., Jiao, J.:
Image-image domain adaptation with preserved self-similarity
and domain-dissimilarity for person re-identification. In: Proceed-
ings of the IEEE conference on computer vision and pattern recog-
nition, pp. 994–1003 (2018)
4. Dou, Q., Coelho de Castro, D., Kamnitsas, K., Glocker, B.: Do-
main generalization via model-agnostic learning of semantic fea-
tures. Advances in neural information processing systems 32
(2019)
5. Ge, Y., Chen, D., Li, H.: Mutual mean-teaching: Pseudo la-
bel refinery for unsupervised domain adaptation on person re-
identification. arXiv preprint arXiv:2001.01526 (2020)
6. Ge, Y., Zhu, F., Chen, D., Zhao, R., et al.: Self-paced contrastive
learning with hybrid memory for domain adaptive object re-id.
Advances in neural information processing systems 33, 11309–
11321 (2020)
7. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for
image recognition. In: Proceedings of the IEEE conference on
computer vision and pattern recognition, pp. 770–778 (2016)
8. Ioffe, S., Szegedy, C.: Batch normalization: Accelerating deep net-
work training by reducing internal covariate shift. In: International
conference on machine learning, pp. 448–456. pmlr (2015)
9. Javed, O., Shafique, K., Rasheed, Z., Shah, M.: Modeling inter-
camera space–time and appearance relationships for tracking
across non-overlapping views. Computer Vision and Image Un-
derstanding 109(2), 146–162 (2008)
10. Li, Y.J., Lin, C.S., Lin, Y.B., Wang, Y.C.F.: Cross-dataset person
re-identification via unsupervised pose disentanglement and adap-
tation. In: Proceedings of the IEEE/CVF international conference
on computer vision, pp. 7919–7929 (2019)
11. Lin, Y., Dong, X., Zheng, L., Yan, Y., Yang, Y.: A bottom-up clus-
tering approach to unsupervised person re-identification. In: Pro-
ceedings of the AAAI conference on artificial intelligence, vol. 33,
pp. 8738–8745 (2019)
12. Lin, Y., Xie, L., Wu, Y., Yan, C., Tian, Q.: Unsupervised person re-
identification via softened similarity learning. In: Proceedings of
the IEEE/CVF conference on computer vision and pattern recog-
nition, pp. 3390–3399 (2020)
13. Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion,
B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg,
V., et al.: Scikit-learn: Machine learning in python. the Journal of
machine Learning research 12, 2825–2830 (2011)
14. Porikli, F.: Inter-camera color calibration by correlation model
function. In: Proceedings 2003 international conference on im-
age processing (cat. No. 03CH37429), vol. 2, pp. II–133. IEEE
(2003)
15. Ristani, E., Solera, F., Zou, R., Cucchiara, R., Tomasi, C.: Per-
formance measures and a data set for multi-target, multi-camera
tracking. In: European conference on computer vision, pp. 17–35.
Springer (2016)
16. Sun, X., Zheng, L.: Dissecting person re-identification from the
viewpoint of viewpoint. In: Proceedings of the IEEE/CVF con-
ference on computer vision and pattern recognition, pp. 608–617
(2019)
17. Wang, D., Zhang, S.: Unsupervised person re-identification via
multi-label classification. In: Proceedings of the IEEE/CVF con-
ference on computer vision and pattern recognition, pp. 10981–
10990 (2020)
18. Wei, L., Zhang, S., Gao, W., Tian, Q.: Person transfer gan to bridge
domain gap for person re-identification. In: Proceedings of the
IEEE conference on computer vision and pattern recognition, pp.
79–88 (2018)
19. Xuan, S., Zhang, S.: Intra-inter camera similarity for unsupervised
person re-identification. In: Proceedings of the IEEE/CVF con-
ference on computer vision and pattern recognition, pp. 11926–
11935 (2021)
20. Ye, M., Shen, J., Lin, G., Xiang, T., Shao, L., Hoi, S.C.: Deep
learning for person re-identification: A survey and outlook. IEEE
transactions on pattern analysis and machine intelligence 44(6),
2872–2893 (2021)
21. Zeng, K., Ning, M., Wang, Y., Guo, Y.: Hierarchical clustering
with hard-batch triplet loss for person re-identification. In: Pro-
ceedings of the IEEE/CVF conference on computer vision and
pattern recognition, pp. 13657–13665 (2020)
22. Zhang, Y., Xiang, T., Hospedales, T.M., Lu, H.: Deep mutual
learning. In: Proceedings of the IEEE conference on computer
vision and pattern recognition, pp. 4320–4328 (2018)
23. Zhao, F., Liao, S., Xie, G.S., Zhao, J., Zhang, K., Shao, L.: Unsu-
pervised domain adaptation with noise resistible mutual-training
for person re-identification. In: Computer Vision–ECCV 2020:
16th European Conference, Glasgow, UK, August 23–28, 2020,
Proceedings, Part XI 16, pp. 526–544. Springer (2020)
24. Zheng, L., Shen, L., Tian, L., Wang, S., Wang, J., Tian, Q.: Scal-
able person re-identification: A benchmark. In: Proceedings of the
IEEE international conference on computer vision, pp. 1116–1124
(2015)
25. Zheng, Z., Yang, X., Yu, Z., Zheng, L., Yang, Y., Kautz, J.: Joint
discriminative and generative learning for person re-identification.
In: proceedings of the IEEE/CVF conference on computer vision
and pattern recognition, pp. 2138–2147 (2019)
26. Zhong, Z., Zheng, L., Li, S., Yang, Y.: Generalizing a person re-
trieval model hetero-and homogeneously. In: Proceedings of the
European conference on computer vision (ECCV), pp. 172–188
(2018)
27. Zhong, Z., Zheng, L., Zheng, Z., Li, S., Yang, Y.: Camera style
adaptation for person re-identification. In: Proceedings of the
IEEE conference on computer vision and pattern recognition, pp.
5157–5166 (2018)
28. Zou, Y., Yang, X., Yu, Z., Kumar, B.V., Kautz, J.: Joint disentan-
gling and adaptation for cross-domain person re-identification. In:
Computer Vision–ECCV 2020: 16th European Conference, Glas-
gow, UK, August 23–28, 2020, Proceedings, Part II 16, pp. 87–
104. Springer (2020)
Article
Full-text available
Object re-identification (reID) plays a pivotal role in traffic surveillance systems for matching objects like people, cars, and motorcycles across multiple cameras. This is an active area of research in both industry and academia due to the ever-growing population and need for smart surveillance, public safety, and traffic management. Most current reID methods use deep convolutional neural networks as the backbone that are manually designed, which does not have the optimum settings as the network complexity increases. This paper introduces MNASreID, an automated approach for designing deep convolutional neural networks designed specifically for motorcycle reID. Key contributions include proposing a NAS based optimization framework and designing a comprehensive search space covering backbone architectures and hyperparameters. Grasshopper optimization algorithm used as NAS search strategy to find the optimal DNN model. Experimental results on two motorcycle datasets, MoRe and BPReID, demonstrate MNASreID’s ability to automatically identify efficient DNN models for reID tasks. Comparative evaluation against existing algorithms reveals significant performance enhancements. Specifically, MNASreID achieves a notable improvement of +1.14% and +1.24% in r1 and mAP metrics, respectively, on the MoRe dataset. On the BPReID dataset, it outperforms existing approaches by +26.82% and +29.56% in r1 and mAP metrics, respectively.
Article
Full-text available
Person re-identification (Re-ID) aims at retrieving a person of interest across multiple non-overlapping cameras. With the advancement of deep neural networks and increasing demand of intelligent video surveillance, it has gained significantly increased interest in the computer vision community. By dissecting the involved components in developing a person Re-ID system, we categorize it into the closed-world and open-world settings. We first conduct a comprehensive overview with in-depth analysis for closed-world person Re-ID from three different perspectives, including deep feature representation learning, deep metric learning and ranking optimization. With the performance saturation under closed-world setting, the research focus for person Re-ID has recently shifted to the open-world setting, facing more challenging issues. This setting is closer to practical applications under specific scenarios. We summarize the open-world Re-ID in terms of five different aspects. By analyzing the advantages of existing methods, we design a powerful AGW baseline, achieving state-of-the-art or at least comparable performance on twelve datasets for four different Re-ID tasks. Meanwhile, we introduce a new evaluation metric (mINP) for person Re-ID, indicating the cost for finding all the correct matches, which provides an additional criterion to evaluate the Re-ID system. Finally, some important yet under-investigated open issues are discussed.
Conference Paper
Full-text available
Unsupervised domain adaptation (UDA) in the task of person re-identification (re-ID) is highly challenging due to large domain divergence and no class overlap between domains. Pseudo-label based self-training is one of the representative techniques to address UDA. However, label noise caused by unsupervised clustering is always a trouble to self-training methods. To depress noises in pseudo-labels, this paper proposes a Noise Resistible Mutual-Training (NRMT) method, which maintains two networks during training to perform collaborative clustering and mutual instance selection. On one hand, collaborative clustering eases the fitting to noisy instances by allowing the two networks to use pseudo-labels provided by each other as an additional supervision. On the other hand, mutual instance selection further selects reliable and informative instances for training according to the peer-confidence and relationship disagreement of the networks. Extensive experiments demonstrate that the proposed method outperforms the state-of-the-art UDA methods for person re-ID.
Conference Paper
Full-text available
Person re-identification (re-id) remains challenging due to significant intra-class variations across different cam- eras. Recently, there has been a growing interest in using generative models to augment training data and enhance the invariance to input changes. The generative pipelines in existing methods, however, stay relatively separate from the discriminative re-id learning stages. Accordingly, re-id models are often trained in a straightforward manner on the generated data. In this paper, we seek to improve learned re-id embeddings by better leveraging the generated data. To this end, we propose a joint learning framework that couples re-id learning and data generation end-to-end. Our model involves a generative module that separately encodes each person into an appearance code and a structure code, and a discriminative module that shares the appearance en- coder with the generative module. By switching the appear- ance or structure codes, the generative module is able to generate high-quality cross-id composed images, which are online fed back to the appearance encoder and used to im- prove the discriminative module. The proposed joint learn- ing framework renders significant improvement over the baseline without using generated data, leading to the state- of-the-art performance on several benchmark datasets.
Conference Paper
Generalization capability to unseen domains is crucial for machine learning models when deploying to real-world conditions. We investigate the challenging problem of domain generalization, i.e., training a model on multi-domain source data such that it can directly generalize to target domains with unknown statistics. We adopt a model-agnostic learning paradigm with gradient-based meta-train and meta-test procedures to expose the optimization to domain shift. Further, we introduce two complementary losses which explicitly regularize the semantic structure of the feature space. Globally, we align a derived soft confusion matrix to preserve general knowledge of inter-class relationships. Locally, we promote domain-independent class-specific cohesion and separation of sample features with a metric-learning component. The effectiveness of our method is demonstrated with new state-of-the-art results on two common object recognition benchmarks. Our method also shows consistent improvement on a medical image segmentation task.