Figure - available from: International Journal of Computer Vision
This content is subject to copyright. Terms and conditions apply.
Overall architecture for one-step methods based on Faster R-CNN Ren et al. (2017). Black arrows denote the forward pass and colorful ones denote different supervision signals. Region Proposal Net is omitted for simplicity.

Overall architecture for one-step methods based on Faster R-CNN Ren et al. (2017). Black arrows denote the forward pass and colorful ones denote different supervision signals. Region Proposal Net is omitted for simplicity.

Source publication
Article
Full-text available
Person detection and Re-identification are two well-defined support tasks for practically relevant tasks such as Person Search and Multiple Person Tracking. Person Search aims to find and locate all instances with the same identity as the query person in a set of panoramic gallery images. Similarly, Multiple Person Tracking, especially when using t...

Similar publications

Article
Full-text available
Abstract Multiple object tracking (MOT) framework based on bifurcate strategy was usually challenged by data association of different model path, which work for object localisation and appearance embedding independently. By incorporating the re‐identification (re‐ID) as appearance embedding model, more recent studies on task combination of a single...

Citations

... In this section, the proposed model is compared with SOTA methods such as BUFF [45], TCTS [46], BINet [47], NAE þ [48], TransReID [49] and DPM [50] in terms of network layers, model parameters, identification accuracy and test time dimensions. Table 2 shows that our method is not the best in recognition accuracy, but with the similar accuracy, the number of proposed model parameters is greatly reduced, and the time cost of recognition is the least. ...
Article
Full-text available
Person re‐identification is aimed at searching for specific target pedestrians from non‐intersecting cameras. However, in real complex scenes, pedestrians are easily obscured, which makes the target pedestrian search task time‐consuming and challenging. To address the problem of pedestrians' susceptibility to occlusion, a person re‐identification via deep compound eye network (CEN) and pose repair module is proposed, which includes (1) A deep CEN based on multi‐camera logical topology is proposed, which adopts graph convolution and a Gated Recurrent Unit to capture the temporal and spatial information of pedestrian walking and finally carries out pedestrian global matching through the Siamese network; (2) An integrated spatial‐temporal information aggregation network is designed to facilitate pose repair. The target pedestrian features under the multi‐level logic topology camera are utilised as auxiliary information to repair the occluded target pedestrian image, so as to reduce the impact of pedestrian mismatch due to pose changes; (3) A joint optimisation mechanism of CEN and pose repair network is introduced, where multi‐camera logical topology inference provides auxiliary information and retrieval order for the pose repair network. The authors conducted experiments on multiple datasets, including Occluded‐DukeMTMC, CUHK‐SYSU, PRW, SLP, and UJS‐reID. The results indicate that the authors’ method achieved significant performance across these datasets. Specifically, on the CUHK‐SYSU dataset, the authors’ model achieved a top‐1 accuracy of 89.1% and a mean Average Precision accuracy of 83.1% in the recognition of occluded individuals.
... However, these methods encounter a challenging issue of detection and ReID conflicts during the training phase. To alleviate the discrepancies in the feature space, Chen et al. [59] introduce norm-aware embeddings (NAE) that simultaneously tackle detection and ReID. As depicted in Fig. 7b, NAE exhibits unique properties of decomposition, ingeniously dividing the feature representation into norm and angle components for detection and ReID, respectively. ...
... We conduct a comparative evaluation of the 34 recent state-of-the-art (SOTA) pipelines in the MOT community, including TransCenter [147], OC-SORT [14], HMM [183], CSTrack [24], PatchTrack [149], IQHAT [112], Sp_Con [143], MOTR [148], ByteTrack [18], NAE [59], TransMOT [25], TBC [111], ApLift [184], GMPHD-ReID [117], DAN [23], CRF-RNN [107], FairMOT [53], TransTrack [145], CenterTrack [137], JDE [57], CTracker [65], TubeTK [66], MPNTrack [97], FFT [185], MLT [186], Tracktor + + [68], FAMNet [130], CNNMTT [116], DCCRF [62], RAN [114], MOTDT [187], DeepSORT [13], Quad-CNN [188], POI [113], SORT [12]. Among these trackers, MOTDT, RAN, GMPHD-ReID, DAN, CNNMTT, DCCRF, POI, SORT, ByteTrack, OC-SORT, IQHAT, and DeepSORT are all based on the SDE paradigm, while the latter four trackers are the variants of SORT. ...
Article
Full-text available
Multiple object tracking (MOT), as a typical application scenario of computer vision, has attracted significant attention from both academic and industrial communities. With its rapid development, MOT has becomes an hot topic. However, maintaining robust MOT in complex scenarios still faces significant challenges, such as irregular motion patterns, similar appearances, and frequent occlusions. Based on an extensive investigation into the state-of-the-art MOT, this survey has made the following efforts: 1) listing down preceding MOT approaches and current classifications; 2) surveying the MOT metrics and benchmark databases; 3) evaluating the MOT approaches frequently employed; 4) discussing the main challenges for MOT; and 5) putting forward potential directions for the development of future MOT approaches. By doing so, it strives to provide a systematic and comprehensive overview of existing MOT methods from SDE to TBA perspectives, thereby promoting further research into this emerging and important field.
... Comparison of Backbone Networks at the Efficiency Level. The proposed model is compared with methods such as BUFF [38], TCTS [39], BINet [40], and NAE+ [41] in terms of network layers, model parameters, identification accuracy, and test time dimensions. Table 2 shows that while our method may not achieve the highest recognition accuracy, it delivers comparable accuracy with significantly fewer model parameters. ...
Article
Full-text available
Pedestrian re-identification aims to identify the same target pedestrian among multiple non-overlapping cameras. However, in real scenarios, pedestrians often change their clothing features due to external factors such as weather and seasons, rendering traditional methods reliant on consistent clothing features ineffective. In this paper, we propose a Knowledge-Driven Cross-Period Network for Clothing Change Person Re-Identification, comprising three key components: (1) Knowledge-Driven Topology Inference Network: Leveraging knowledge graphs and graph convolution networks, this network captures spatio-temporal information between camera nodes. Knowledge embedding is introduced into the graph convolution network for effective topology inference. (2) Cross-Period Clothing Change Network: This network aggregates spatio-temporal information for clothing generation. By utilizing overall pedestrian clothing characteristics whthin logical topology cameras, it mitigates matching errors caused by external factors. (3) Joint Optimization Mechanism: A collaborative approach involving both the topology inference network and cross-period clothing change network. Multi-camera logical topology offers auxiliary information and retrieval order for the clothing change network, while pedestrian re-identification results provide feedback to adjust the logical topology. Experimental analysis on datasets Celeb-ReID, PRCC, UJS-ReID, SLP, and DukeMTMC-ReID, demonstrates the effectiveness and robustness of our proposed model in addressing the challenges of pedestrian re-identification in scenarios involving changing clothing features.
... L reid = L r2 + L nae (7) where L r2 represents classification loss, which is same as L r1 . Follow [42], L nae loss contains classification loss and identity classification loss [18]. L c1 aims to ensure the features extracted by the re-identification network have both high intra-class consistency and interclass separability. ...
... We conducted a comparative study between MGCN and various state-of-the-art methods designed to solve the person search problem on the CUHK-SYSU and PRW datasets, which includes DPM [16], MGTS [17], CLSA [49], RDLR [32], IGPN [50], TCTS [33], OIM [18], IAN [51], NPSM [52], RCAA [35], CTXG [53], QEEPS [19], HOIM [54], APNet [55], BINet [56], NAE [42], NAE+ [42] DMRNet [57], PGS [58], AlignPS [36], AlignPS+ [36], and SeqNet [37] methods. ...
... We conducted a comparative study between MGCN and various state-of-the-art methods designed to solve the person search problem on the CUHK-SYSU and PRW datasets, which includes DPM [16], MGTS [17], CLSA [49], RDLR [32], IGPN [50], TCTS [33], OIM [18], IAN [51], NPSM [52], RCAA [35], CTXG [53], QEEPS [19], HOIM [54], APNet [55], BINet [56], NAE [42], NAE+ [42] DMRNet [57], PGS [58], AlignPS [36], AlignPS+ [36], and SeqNet [37] methods. ...
Article
Full-text available
The key procedure is to accurately identify pedestrians in complex scenes and effectively embed features from multiple vision cues. However, it is still a limitation to coordinate two tasks in the unified framework, thus leading to high computational overhead and unsatisfactory search performance. Furthermore, most methods do not take significant clues and key features of pedestrians into consideration. To remedy these issues, we introduce a novel method named Multi-Attention-Guided Cascading Network (MGCN) in this paper. Specifically, we obtain the trusted bounding box through the detection header as the label information for post-process. Based on the end-to-end network, we demonstrate the advantages of jointly learning to construct the bounding box and attention module by maximizing the complementary information from different attention modules, which can achieve optimized person search performance. Meanwhile, by imposing an aligning module on re-id feature extracted network to locate visual clues with semantic information, which can restrain redundant background information. Extensive experimental results for the two benchmark person search datasets are provided to demonstrate that the proposed MGCN markedly outperforms the state-of-the-art baselines.
... Although two-step models can obtain satisfactory results, the disentangled treatment of the two tasks is time-and resource-consuming. In contrast, the second category (Xiao et al., 2017;Liu et al., 2017;Chang et al., 2018;Munjal et al., 2019;Chen et al., 2021) provides a one-step solution that unifies detection and re-id in an end-to-end manner. As shown in Fig. 1b, one-step models first apply an ROI-Align layer to aggregate features in the detected bounding boxes. ...
... Specifically, a joint framework enabling end-to-end training of detection and re-id was proposed by stacking a reid embedding layer after the detection features and proposing the Online Instance Matching (OIM) loss. So far, a number of improvements (Liu et al., 2017;Xiao et al., 2019;Chang et al., 2018;Yan et al., 2019;Munjal et al., 2019;Dong et al., 2020a;Chen et al., 2021) have been made based on this framework. In general, two-step models may achieve better performance, while one-step models have the advantages of simplicity and efficiency. ...
... We compare our framework with state-of-the-art one-step models (Xiao et al., 2017(Xiao et al., , 2019Liu et al., 2017;Chang et al., 2018;Chen et al., 2020a;Munjal et al., 2019;Dong et al., 2020a;Chen et al., 2021;Kim et al., 2021;Zhang et al., 2021b;Li and Miao, 2021) and two-step models (Chen et al., 2020b;Lan et al., 2018;Han et al., 2019a;Dong et al., 2020b;Wang et al., 2020a). ...
Article
Full-text available
Person search aims to simultaneously localize and identify a query person from uncropped images. To achieve this goal, state-of-the-art models typically add a re-id branch upon two-stage detectors like Faster R-CNN. Owing to the ROI-Align operation, this pipeline yields promising accuracy as re-id features are explicitly aligned with the corresponding object regions, but in the meantime, it introduces high computational overhead due to dense object anchors. In this work, we present an anchor-free approach to efficiently tackling this challenging task, by introducing the following dedicated designs. First, we select an anchor-free detector (i.e., FCOS) as the prototype of our framework. Due to the lack of dense object anchors, it exhibits significantly higher efficiency compared with existing person search models. Second, when directly accommodating this anchor-free detector for person search, there exist several misalignment issues in different levels (i.e., scale, region, and task). To address these issues, we propose an aligned feature aggregation module to generate more discriminative and robust feature embeddings. Accordingly, we name our framework as Feature-Aligned Person Search Network (AlignPS). Third, by investigating the advantages of both anchor-based and anchor-free models, we further augment AlignPS with an ROI-Align head, which significantly improves the robustness of re-id features while still keeping our model highly efficient. Our framework not only achieves state-of-the-art or competitive performance on two challenging person search benchmarks, but can be also extended to other challenging searching tasks such as animal and object search. All the source codes, data, and trained models are available at: https://github.com/daodaofr/alignps.
Article
Large-scale pre-training has proven to be an effective method for improving performance across different tasks. Current person search methods use ImageNet pre-trained models for feature extraction, yet it is not an optimal solution due to the gap between the pre-training task and person search task (as a downstream task). Therefore, in this paper, we focus on pre-training for person search, which involves detecting and re-identifying individuals simultaneously. Although labeled data for person search is scarce, datasets for two sub-tasks person detection and re-identification are relatively abundant. To this end, we propose a hybrid pre-training framework specifically designed for person search using sub-task data only. It consists of a hybrid learning paradigm that handles data with different kinds of supervisions, and an intra-task alignment module that alleviates domain discrepancy under limited resources. To the best of our knowledge, this is the first work that investigates how to support full-task pre-training using sub-task data. Extensive experiments demonstrate that our pre-trained model can achieve significant improvements across diverse protocols, such as person search method, fine-tuning data, pre-training data and model backbone. For example, our model improves ResNet50 based NAE by 10.3% relative improvement w.r.t. mAP. Our code and pre-trained models are released for plug-and-play usage to the person search community (https://github.com/personsearch/PretrainPS).
Article
Text-based person search is a challenging crossmodal retrieval task. Existing works reduce the inter-modality and intra-class gaps by aligning local features extracted from image and text modalities, which easily lead to mismatching problems due to the lack of annotation information. Besides, it is sub-optimal to reduce two gaps simultaneously in the same feature space. This work proposes a novel joint token and feature alignment framework to reduce the inter-modality and intraclass gaps progressively. Specifically, we first build a dual-path feature learning network to extract features and conduct feature alignment to reduce the inter-modality gap. Second, we design a text generation module to generate token sequences using visual features, and then token alignment is performed to reduce the intra-class gap. Last, a fusion interaction module is introduced to further eliminate the modality heterogeneity using the strategy of multi-stage feature fusion. Extensive experiments on the CUHKPEDES dataset demonstrate the effectiveness of our model, which significantly outperforms previous state-of-the-art methods.
Article
Person search is a time-consuming computer vision task that entails locating and recognizing query people in scenic pictures. Body components are commonly mismatched during matching due to position variation, occlusions, and partially absent body parts, resulting in unsatisfactory person search results. Existing approaches for extracting local characteristics of the human body using keypoint information are unable to handle the search job when distinct body parts are misaligned, ignoring to exploit multiple granularities, which is crucial in the person search process. Moreover, the alignment learning methods learn body part features with fixed and equal weights, ignoring the beneficial contextual information, e.g., the umbrella carried by pedestrian, which supplements compelling clues for identifying the person. In this paper, we propose a Coarse-to-Fine Adaptive Alignment Representation (CFA ² R) network for learning multiple granular features in misaligned person search in the coarse-to-fine perspective. To exploit more beneficial body parts and related context of the cropped pedestrians, we design a Part-Attentional Progressive Module (PAPM) to guide the network to focus on informative body parts and positive accessorial regions. Besides, we propose a Re-weighting Alignment Module (RAM) shedding light on more contributive parts instead of treating them equally. Specifically, adaptive re-weighted but not fixed part features are reconstructed by Re-weighting Reconstruction module, considering that different parts serve unequally during image matching. Extensive experiments conducted on CUHK-SYSU and PRW datasets demonstrate competitive performance of our proposed method.