Conference Paper

PoseTrack: A Benchmark for Human Pose Estimation and Tracking

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... We carried out thorough evaluations for video pose propagation and video pose estimation tasks on three popular benchmarks: PoseTrack2017 (Iqbal, Milan, and Gall 2017), Pose-Track2018 (Andriluka et al. 2018), and PoseTrack21 (Doering et al. 2022). The videos in these datasets feature diverse challenges, such as crowded scenes and rapid movements. ...
... It greatly surpasses other methods on wrists and ankles, showcasing its effectiveness in tackling challenging scenarios where these joints are often blurred or occluded due to pose occlusion and rapid movement. Table 2: Comparisons with the state-of-the-art methods for video pose estimation on the validation sets of the Pose-Track2017 (Iqbal, Milan, and Gall 2017), PoseTrack2018 (Andriluka et al. 2018), and PoseTrack2021 (Doering et al. 2022) datasets. Note that during training, we aggregate temporal information from neighboring frames (i.e., one frame to the left and one to the right), and during inference, the pose labels of neighboring frames are not provided. ...
... Each of these sequences is meticulously annotated with 15 key points, augmented by a visibility flag indicating the state of each joint. Expanding on its predecessor, PoseTrack2018 (Andriluka et al. 2018) introduces 1,138 video sequences with a notable rise to 153,615 annotations, divided into 593 for training, 170 for validation, and 375 for testing. Each individual is meticulously annotated with 15 joints and an added visibility flag. ...
Preprint
Human pose estimation in videos remains a challenge, largely due to the reliance on extensive manual annotation of large datasets, which is expensive and labor-intensive. Furthermore, existing approaches often struggle to capture long-range temporal dependencies and overlook the complementary relationship between temporal pose heatmaps and visual features. To address these limitations, we introduce STDPose, a novel framework that enhances human pose estimation by learning spatiotemporal dynamics in sparsely-labeled videos. STDPose incorporates two key innovations: 1) A novel Dynamic-Aware Mask to capture long-range motion context, allowing for a nuanced understanding of pose changes. 2) A system for encoding and aggregating spatiotemporal representations and motion dynamics to effectively model spatiotemporal relationships, improving the accuracy and robustness of pose estimation. STDPose establishes a new performance benchmark for both video pose propagation (i.e., propagating pose annotations from labeled frames to unlabeled frames) and pose estimation tasks, across three large-scale evaluation datasets. Additionally, utilizing pseudo-labels generated by pose propagation, STDPose achieves competitive performance with only 26.7% labeled data.
... • Overcoming state-of-the-art performance in challenging real-world data sets such as PoseTrack21 [20] and Pose-Track18 [21], underscoring the effectiveness of our approach in multiframe pose estimation. ...
... In our experiments, we evaluated the proposed Poseidon model using the large-scale PoseTrack21 [20], Pose-Track18 [21] and Sub-JHMDB [26] for multi-frame human pose estimation. ...
... Dataset. The PoseTrack18 dataset [21] is a large-scale benchmark for video-based human pose estimation and articulated multi-person tracking. It contains 550 video sequences with 66,374 frames. ...
Preprint
Full-text available
Human pose estimation, a vital task in computer vision, involves detecting and localising human joints in images and videos. While single-frame pose estimation has seen significant progress, it often fails to capture the temporal dynamics for understanding complex, continuous movements. We propose Poseidon, a novel multi-frame pose estimation architecture that extends the ViTPose model by integrating temporal information for enhanced accuracy and robustness to address these limitations. Poseidon introduces key innovations: (1) an Adaptive Frame Weighting (AFW) mechanism that dynamically prioritises frames based on their relevance, ensuring that the model focuses on the most informative data; (2) a Multi-Scale Feature Fusion (MSFF) module that aggregates features from different backbone layers to capture both fine-grained details and high-level semantics; and (3) a Cross-Attention module for effective information exchange between central and contextual frames, enhancing the model's temporal coherence. The proposed architecture improves performance in complex video scenarios and offers scalability and computational efficiency suitable for real-world applications. Our approach achieves state-of-the-art performance on the PoseTrack21 and PoseTrack18 datasets, achieving mAP scores of 88.3 and 87.8, respectively, outperforming existing methods.
... The mAP metric is increased from 60.5 (COCO 2016 Challenge winner [9,5]) to 72.1(COCO 2017 Challenge winner [6,9]) in one year. With the quick maturity of pose estimation, a more challenging task of "simultaneous pose detection and tracking in the wild" has been introduced recently [2]. ...
... Comparison between such works is mostly on system level and less informative. About pose tracking, although there has not been much work [2], the system complexity can be expected to further increase due to the increased problem dimension and solution space. ...
... Our pose tracking follows a similar pipeline of the winner [11] of ICCV'17 PoseTrack Challenge [2]. The single person pose estimation uses our own method as above. ...
Preprint
There has been significant progress on pose estimation and increasing interests on pose tracking in recent years. At the same time, the overall algorithm and system complexity increases as well, making the algorithm analysis and comparison more difficult. This work provides simple and effective baseline methods. They are helpful for inspiring and evaluating new ideas for the field. State-of-the-art results are achieved on challenging benchmarks. The code will be available at https://github.com/leoxiaobin/pose.pytorch.
... We perform extensive experiments to evaluate the efficacy of our method across three large-scale datasets: PoseTrack2017 [26], PoseTrack2018 [27], and PoseTrack2021 [28]. The input image size is fixed at 256×192 and the patch size is set to 16. ...
... We evaluate the performance of the proposed FTP-Pose against the latest state-of-the-art (SOTA) methods, including M-HANet [34], TDMI-ST [33], DSTA [21], and others, on three larger-scale datasets: PoseTrack2017 [26], Pose-Track2018 [27], and PoseTrack2021 [28]. All experimental results on the three validation datasets are presented in Table I. ...
Preprint
Full-text available
Human pose estimation, with its broad applications in action recognition and motion capture, has experienced significant advancements. However, current Transformer-based methods for video pose estimation often face challenges in managing redundant temporal information and achieving fine-grained perception because they only focus on processing low-resolution features. To address these challenges, we propose a novel multi-scale resolution framework that encodes spatio-temporal representations at varying granularities and executes fine-grained perception compensation. Furthermore, we employ a density peaks clustering method to dynamically identify and prioritize tokens that offer important semantic information. This strategy effectively prunes redundant feature tokens, especially those arising from multi-frame features, thereby optimizing computational efficiency without sacrificing semantic richness. Empirically, it sets new benchmarks for both performance and efficiency on three large-scale datasets. Our method achieves a 93.8% improvement in inference speed compared to the baseline, while also enhancing pose estimation accuracy, reaching 87.4 mAP on the PoseTrack2017 dataset.
... Another approach is PoseTrack, proposed by Mykhaylo Andriluka et al. [26], which not only performs individual pose estimation for each person in an image but also uses temporal information from video sequences for cross-frame person tracking. PoseTrack leverages temporal models and association graphs to maintain consistent pose estimates across frames and effectively track individuals in dynamic multiperson scenes. ...
... We selected three widely adopted datasets for model training and evaluation: PoseTrack [26], Human3.6M [35], and Sports-1M [36]. ...
Article
Full-text available
Accurate and efficient human pose estimation is crucial for precise motion tracking and performance feedback in real-time sports analysis. This paper presents an innovative real-time pose estimation framework that integrates EfficientPose and T-GCN (Temporal Graph Convolutional Networks) to address the challenges of dynamic and complex sports scenarios. EfficientPose utilizes a highly efficient network architecture to achieve accurate 3D pose estimation from single-frame images, providing a robust foundation for subsequent temporal modeling. T-GCN further refines the motion trajectories by modeling temporal dependencies and spatial relationships across frames, ensuring temporal continuity and spatial consistency. Experimental results demonstrate the superior performance of the proposed framework, achieving the lowest Mean Absolute Error (MAE: 30.5 mm), the highest Multiple Object Tracking Accuracy (MOTA: 80.2%), and maintaining a real-time frame rate (45 FPS) across multiple benchmarks. Compared to traditional methods, the proposed approach exhibits significant advantages in handling high-speed motions, occlusions, and complex multi-agent interactions, enabling high-precision and temporally stable pose estimation. This framework provides an efficient and robust solution for real-time sports performance analysis, offering valuable scientific support for performance feedback and tactical decision-making.
... These datasets generally provide high-quality 3D annotations but are less flexible due to physical constraints, especially those built with optical motion capture systems. 2) pseudo-annotated datasets [6], [10], [17], [19], [20], [21], [24], [25], [29], [32], [33], [37], [39], [40], [41], [43], [61] that re-annotate existing image datasets with parametric human annotations. These datasets take advantage of the diversity of 2D datasets, and the pseudo-3D annotations, albeit typically not as high-quality, have been proven effective [58]. ...
... PoseTrack [32] (Fig. 17c) is a large-scale benchmark for multi-person pose estimation and tracking in videos. It contains 514 videos and includes 66,374 frames. ...
Preprint
Full-text available
Expressive human pose and shape estimation (EHPS) unifies body, hands, and face motion capture with numerous applications. Despite encouraging progress, current state-of-the-art methods focus on training innovative architectural designs on confined datasets. In this work, we investigate the impact of scaling up EHPS towards a family of generalist foundation models. 1) For data scaling, we perform a systematic investigation on 40 EHPS datasets, encompassing a wide range of scenarios that a model trained on any single dataset cannot handle. More importantly, capitalizing on insights obtained from the extensive benchmarking process, we optimize our training scheme and select datasets that lead to a significant leap in EHPS capabilities. Ultimately, we achieve diminishing returns at 10M training instances from diverse data sources. 2) For model scaling, we take advantage of vision transformers (up to ViT-Huge as the backbone) to study the scaling law of model sizes in EHPS. To exclude the influence of algorithmic design, we base our experiments on two minimalist architectures: SMPLer-X, which consists of an intermediate step for hand and face localization, and SMPLest-X, an even simpler version that reduces the network to its bare essentials and highlights significant advances in the capture of articulated hands. With big data and the large model, the foundation models exhibit strong performance across diverse test benchmarks and excellent transferability to even unseen environments. Moreover, our finetuning strategy turns the generalist into specialist models, allowing them to achieve further performance boosts. Notably, our foundation models consistently deliver state-of-the-art results on seven benchmarks such as AGORA, UBody, EgoBody, and our proposed SynHand dataset for comprehensive hand evaluation. (Code is available at: https://github.com/wqyin/SMPLest-X).
... The large-scale dataset is provided by paper [32], which contains 593 training videos, 74 validation videos and 375 testing videos. It annotated skeleton points and unique track IDs for each person appearing in the video. ...
... The metrics of PoseTrack challenge contain per-joint average precision (AP) [35]and multi-object tracking Accuracy (MOTA) scores [32,36]. MOTA score considers false negatives (FN), false positives(FP), and ID switches (IDSW). ...
Article
Full-text available
The skeleton data of limbs are unreliable due to occlusion and camera viewpoints on intercity railway platforms. It hinders the acquisition of skeleton sequences and disturbs the skeleton-based abnormal action recognition. To overcome these issues, this work proposes a framework consisting of a pose tracking module and an abnormal action recognition module. The proposed pose tracking module maintains the identities of multiple human poses across frames and provides skeleton sequences as input for recognition. Instead of utilizing the whole skeleton, the pose tracking method tracks the trunk for more stable results of identity association as the estimations of the limbs are unreliable. In addition, a position embedding graph convolutional network (PEGCN) is proposed to recognize abnormal actions. PEGCN utilizes a simple cosine encoding as position embeddings for enhancing the differentiation of skeleton vertices and an SElayer for extracting temporal dynamics. The pose tracking method achieves 66.42% tracking accuracy scores and higher frame rates than previous methods on the PoseTrack dataset. Additionally, PEGCN achieves competitive results on the Intercity Railway Action Dataset (IRAD) and the public NTU-RGB+D dataset.
... Advances in deep neural network (DNN) models and GPU hardware accelerators have significantly advanced video analytics in edge intelligence applications, including object detection (Zou et al. 2023;Zhao et al. 2019), action recognition (Ghodrati, Bejnordi, and Habibian 2021;Jhuang et al. 2013), and pose estimation (Andriluka et al. 2014;Toshev and Szegedy 2014;Andriluka et al. 2018), etc. To protect data privacy and ensure low-latency quality of service (QoS), many of these applications are deployed on edge devices close to the data sources (Liang et al. 2023). ...
Preprint
Deep neural network (DNN) models are increasingly popular in edge video analytic applications. However, the compute-intensive nature of DNN models pose challenges for energy-efficient inference on resource-constrained edge devices. Most existing solutions focus on optimizing DNN inference latency and accuracy, often overlooking energy efficiency. They also fail to account for the varying complexity of video frames, leading to sub-optimal performance in edge video analytics. In this paper, we propose an Energy-Efficient Early-Exit (E4) framework that enhances DNN inference efficiency for edge video analytics by integrating a novel early-exit mechanism with dynamic voltage and frequency scaling (DVFS) governors. It employs an attention-based cascade module to analyze video frame diversity and automatically determine optimal DNN exit points. Additionally, E4 features a just-in-time (JIT) profiler that uses coordinate descent search to co-optimize CPU and GPU clock frequencies for each layer before the DNN exit points. Extensive evaluations demonstrate that E4 outperforms current state-of-the-art methods, achieving up to 2.8x speedup and 26% average energy saving while maintaining high accuracy.
... There are many datasets and benchmarks for the human hand [28], whole body [29], and animal [30] pose estimation. However, the public datasets for articulated surgical tool pose estimation are limited. ...
Preprint
Full-text available
Accurate and efficient surgical robotic tool pose estimation is of fundamental significance to downstream applications such as augmented reality (AR) in surgical training and learning-based autonomous manipulation. While significant advancements have been made in pose estimation for humans and animals, it is still a challenge in surgical robotics due to the scarcity of published data. The relatively large absolute error of the da Vinci end effector kinematics and arduous calibration procedure make calibrated kinematics data collection expensive. Driven by this limitation, we collected a dataset, dubbed SurgPose, providing instance-aware semantic keypoints and skeletons for visual surgical tool pose estimation and tracking. By marking keypoints using ultraviolet (UV) reactive paint, which is invisible under white light and fluorescent under UV light, we execute the same trajectory under different lighting conditions to collect raw videos and keypoint annotations, respectively. The SurgPose dataset consists of approximately 120k surgical instrument instances (80k for training and 40k for validation) of 6 categories. Each instrument instance is labeled with 7 semantic keypoints. Since the videos are collected in stereo pairs, the 2D pose can be lifted to 3D based on stereo-matching depth. In addition to releasing the dataset, we test a few baseline approaches to surgical instrument tracking to demonstrate the utility of SurgPose. More details can be found at surgpose.github.io.
... The PoseTrack2018 [40] dataset is used for human pose estimation in videos, providing a challenging benchmark for multi-person pose estimation in real-world scenarios. The dataset consists of a total of 1,138 videos with 153,615 pose annotations. ...
Article
Full-text available
Due to the huge requirements of performing human pose estimation tasks on edge devices with limited computational resources, more and more researchers have turned to work on the design of lightweight human pose estimation networks. As a typical lightweight network, Lite-HRNet achieves high performance in human pose estimation with a relatively low model complexity, but its inference speed in practical applications is not ideal. In order to address this issue, we propose a lightweight and efficient network (LENet), which is not only capable of learning comprehensive information under a lightweight architecture for real-time human pose estimation, but also has a higher inference speed than other common lightweight networks. We design two blocks, Recursive Fusion Block (RFB) and Deep Shuffle Block (DSB), to construct the model architecture. The RFB implements multi-scale feature fusion in a more lightweight way, that is, only carries out scale transformations between adjacent branches. The DSB fully utilizes computational resources to extract expressive information during the whole processing. Experimental results demonstrate that with the same complexity as Lite-HRNet, our LENet yields inference speeds of 40.8 FPS and 150.4 FPS on CPU and GPU respectively, obtaining a great improvement of 83% over that of Lite-HRNet. Furthermore, LENet is more competitive than Lite-HRNet in terms of achieving a superior balance among model performance, complexity, and inference speed.
... The encoder is trained to reconstruct 3D poses from corrupted 2D poses using various 2D and 3D human pose datasets, including Human3.6M [10], AMASS [50], PoseTrack [51], and InstaVariety [52]. Subsequently, it is fine-tuned with additional layers for downstream tasks. ...
Preprint
Recent advancements in deep learning methods have significantly improved the performance of 3D Human Pose Estimation (HPE). However, performance degradation caused by domain gaps between source and target domains remains a major challenge to generalization, necessitating extensive data augmentation and/or fine-tuning for each specific target domain. To address this issue more efficiently, we propose a novel canonical domain approach that maps both the source and target domains into a unified canonical domain, alleviating the need for additional fine-tuning in the target domain. To construct the canonical domain, we introduce a canonicalization process to generate a novel canonical 2D-3D pose mapping that ensures 2D-3D pose consistency and simplifies 2D-3D pose patterns, enabling more efficient training of lifting networks. The canonicalization of both domains is achieved through the following steps: (1) in the source domain, the lifting network is trained within the canonical domain; (2) in the target domain, input 2D poses are canonicalized prior to inference by leveraging the properties of perspective projection and known camera intrinsics. Consequently, the trained network can be directly applied to the target domain without requiring additional fine-tuning. Experiments conducted with various lifting networks and publicly available datasets (e.g., Human3.6M, Fit3D, MPI-INF-3DHP) demonstrate that the proposed method substantially improves generalization capability across datasets while using the same data volume.
... Dataset. We evaluate the proposed CM-Pose for videobased human pose estimation in three widely used datasets: PoseTrack2017 (Iqbal, Milan, and Gall 2017), Pose-Track2018 (Andriluka et al. 2018), and PoseTrack2021 (Doering et al. 2022. PosTrack2017 includes 80,144 pose annotations and has two subsets, i.e., training (train) and validation (val) with 250 videos and 50 videos (split according to the official protocol), respectively. ...
Preprint
Video-based human pose estimation has long been a fundamental yet challenging problem in computer vision. Previous studies focus on spatio-temporal modeling through the enhancement of architecture design and optimization strategies. However, they overlook the causal relationships in the joints, leading to models that may be overly tailored and thus estimate poorly to challenging scenes. Therefore, adequate causal reasoning capability, coupled with good interpretability of model, are both indispensable and prerequisite for achieving reliable results. In this paper, we pioneer a causal perspective on pose estimation and introduce a causal-inspired multitask learning framework, consisting of two stages. \textit{In the first stage}, we try to endow the model with causal spatio-temporal modeling ability by introducing two self-supervision auxiliary tasks. Specifically, these auxiliary tasks enable the network to infer challenging keypoints based on observed keypoint information, thereby imbuing causal reasoning capabilities into the model and making it robust to challenging scenes. \textit{In the second stage}, we argue that not all feature tokens contribute equally to pose estimation. Prioritizing causal (keypoint-relevant) tokens is crucial to achieve reliable results, which could improve the interpretability of the model. To this end, we propose a Token Causal Importance Selection module to identify the causal tokens and non-causal tokens (\textit{e.g.}, background and objects). Additionally, non-causal tokens could provide potentially beneficial cues but may be redundant. We further introduce a non-causal tokens clustering module to merge the similar non-causal tokens. Extensive experiments show that our method outperforms state-of-the-art methods on three large-scale benchmark datasets.
... LSP [62] 2010 2000 images Not specified COCO [92] 2014 328000 images Not specified Expressive hands and faces dataset [129] [63] 2015 297000 frames Not specified video PennAction [192] 2013 2326 videos Not specified PoseTrack [7] 2018 1337 videos Not specified poral streams. Lastly, achieving real-time recognition with low latency while maintaining high accuracy is a persistent challenge, especially for early-stage gesture detection. ...
Preprint
Full-text available
Hand gesture recognition has become an important research area, driven by the growing demand for human-computer interaction in fields such as sign language recognition, virtual and augmented reality, and robotics. Despite the rapid growth of the field, there are few surveys that comprehensively cover recent research developments, available solutions, and benchmark datasets. This survey addresses this gap by examining the latest advancements in hand gesture and 3D hand pose recognition from various types of camera input data including RGB images, depth images, and videos from monocular or multiview cameras, examining the differing methodological requirements of each approach. Furthermore, an overview of widely used datasets is provided, detailing their main characteristics and application domains. Finally, open challenges such as achieving robust recognition in real-world environments, handling occlusions, ensuring generalization across diverse users, and addressing computational efficiency for real-time applications are highlighted to guide future research directions. By synthesizing the objectives, methodologies, and applications of recent studies, this survey offers valuable insights into current trends, challenges, and opportunities for future research in human hand gesture recognition.
... The dataset generation procedure is typically timeconsuming and requires significant labor efforts. In the field of object detection and pose estimation, CORe50 [33], Dex-YCB [34], and PoseTrack [35] are fully manually labeled datasets that necessitate a substantial amount of labor to complete. Semi-automated dataset generation methods are employed to reduce these efforts. ...
Preprint
Full-text available
EasyVis2 is a system designed for hands-free, real-time 3D visualization during laparoscopic surgery. It incorporates a surgical trocar equipped with a set of micro-cameras, which are inserted into the body cavity to provide an expanded field of view and a 3D perspective of the surgical procedure. A sophisticated deep neural network algorithm, YOLOv8-Pose, is tailored to estimate the position and orientation of surgical instruments in each individual camera view. Subsequently, 3D surgical tool pose estimation is performed using associated 2D key points across multiple views. This enables the rendering of a 3D surface model of the surgical tools overlaid on the observed background scene for real-time visualization. In this study, we explain the process of developing a training dataset for new surgical tools to customize YoLOv8-Pose while minimizing labeling efforts. Extensive experiments were conducted to compare EasyVis2 with the original EasyVis, revealing that, with the same number of cameras, the new system improves 3D reconstruction accuracy and reduces computation time. Additionally, experiments with 3D rendering on real animal tissue visually demonstrated the distance between surgical tools and tissues by displaying virtual side views, indicating potential applications in real surgeries in the future.
... Monocular 3D human shape and pose estimation is a critical task in computer vision, aimed at reconstructing the 3D pose and detailed body shape of a person from a single image [1][2][3][4][5]. This task has numerous real-world applications, including visual tracking [6][7][8][9][10], enhancing user experiences in virtual and augmented reality [11][12][13][14][15][16][17][18][19][20][21][22][23][24][25], generating realistic human motion for animation [19][20][21][22][23][24][25][26][27][28][29][30][31][32], improving image editing, and advancing neural radiance field rendering [33][34][35][36]. These fields benefit greatly from the ability to capture human form and motion with precision. ...
Article
Full-text available
Existing Transformers for 3D human pose and shape estimation models often struggle with computational complexity, particularly when handling high-resolution feature maps. These challenges limit their ability to efficiently utilize fine-grained features, leading to suboptimal performance in accurate body reconstruction. In this work, we propose TransSMPL, a novel Transformer framework built upon the SMPL model, specifically designed to address the challenges of computational complexity and inefficient utilization of high-resolution feature maps in 3D human pose and shape estimation. By replacing HRNet with MobileNetV3 for lightweight feature extraction, applying pruning and quantization techniques, and incorporating an early exit mechanism, TransSMPL significantly reduces both computational cost and memory usage. TransSMPL introduces two key innovations: (1) a multi-scale attention mechanism, reduced from four scales to two, allowing for more efficient global and local feature integration, and (2) a confidence-based early exit strategy, which enables the model to halt further computations when high-confidence predictions are achieved, further enhancing efficiency. Extensive pruning and dynamic quantization are also applied to reduce the model size while maintaining competitive performance. Quantitative and qualitative experiments on the Human3.6M dataset demonstrate the efficacy of TransSMPL. Our model achieves an MPJPE (Mean Per Joint Position Error) of 48.5 mm, reducing the model size by over 16% compared to existing methods while maintaining a similar level of accuracy.
... With the rapid development of image processing technology, three-dimensional pose tracking of human [1] has become one of the core research topics in the field of computer vision, and has been widely used in many fields, such as automatic driving, human-computer interaction, video surveillance, intelligent security systems, medical care and so on. These application scenarios require highly accurate and robust human pose detection technology. ...
Article
Full-text available
At present, the research of three-dimensional human pose tracking mainly focuses on the multi-camera system, but lacks a tracking algorithm for monocular cameras. Therefore, a 3D human pose tracking algorithm, FlexTrack3D, is proposed, which can track human 3D pose on monocular video. FlexTrack3D innovatively combines 2D human pose detection and monocular depth estimation technology. By integrating pixel coordinates of human key points provided by FlexPoseNet and depth data generated by ZoeDepth algorithm, FlexTrack3D can accurately model key points in three dimensions and accurately track their dynamic trajectories. FlexTrack3D shows the pose and trajectory of human motion data collected in indoor and outdoor environment through 3D point cloud model, which proves its accuracy and robustness. For tracking key points of human, a lightweight human pose detection model, FlexPoseNet, and a parameter sharing detector with multi-scale feature information are proposed, which strengthens the learning of key features on the premise of reducing the parameters of baseline model YOLO-Pose. Multi-Head Self-Attention module and Deformable Convolution Networks module are introduced into the backbone of FlexPoseNet model, which enhances the modeling ability of the network to the target’s receptive field and improves the modeling flexibility of the network to the target structure. SlimNeck structure is introduced into the neck of the model, which effectively reduces the calculation and parameters of the model while maintaining the detection accuracy. In the experiment of COCO-Pose data set, the processing speed of FlexPoseNet is increased by 42.9% per second compared with the baseline model YOLO-Pose, and the detection accuracy of mAP50 and mAP50-95 is increased by 12.46% and 10.21%, which further improves the accuracy of the model on the premise of maintaining performance.
... 16 Media Pipe Pose, a popular framework developed by Google, uses deep learning models to estimate key body landmarks and infer skeletal poses from video input, making it suitable for analyzing posture stability in diverse settings. 17 This study, 18 focuses on the action problem of human activity and posture recognition, which is important for various health and robotics applications. Using the Kinect Activity Recognition Dataset (KARD) and the MSR Pairs dataset, the paper proposes a new way of detecting motion with the help of CNNs. ...
Article
Full-text available
The analysis of the Hip and Knee (HK) joint angles during single-leg stance (SLS) activity contributes to a great understanding of the bio-mechanical mechanisms and balance maintenance across different age groups. Comprehending how these joints operate in the dynamic state is critical for identifying age-related changes in joint control and stability which can contribute to reducing the risk of falls and improving mobility among people. The possibility of revealing the health of HK joints without resorting to an MRI scan is the significance of this work. By obtaining the data on joint angles during the SLS test, we succeeded in identifying those with a higher risk of HK issues and thus the possibility of early intervention and treatment. Our proposed work is going to use a pose estimation technique that will track the trajectories of HK angles followed by association with instabilities or compromised balance in this way, we can detect the affected joint and evaluate the severity of the issue. The study found stable mean and standard deviation values for HK joints in a young participant, both hips (107.14 ± 5, 96.42 ± 7) and both knees (36.76 ± 7, 44.30 ± 4), these values align with the expected norm (110–120° for the hip and 45–65° for the knee) which indicates stable results. while elderly participants showed high variability and low mean values for both hips (65.42 ± 77, 85 ± 76.67) and both knees (4.15 ± 10.8, 7 ± 18) indicating concerns about joint health and stability. The evaluation of HK joint angles through SLS activity offers new insights.
... This benchmark sets a standard for evaluating models across a wide variety of real-world activities. The PoseTrack benchmark [174] focuses on video-based multi-person pose estimation and tracking. It introduces tasks for single-frame and video-based pose estimation, as well as articulated tracking, offering a large dataset with labeled person tracks. ...
Preprint
Full-text available
With the increasing application of large language models (LLMs) in the medical domain, evaluating these models' performance using benchmark datasets has become crucial. This paper presents a comprehensive survey of various benchmark datasets employed in medical LLM tasks. These datasets span multiple modalities including text, image, and multimodal benchmarks, focusing on different aspects of medical knowledge such as electronic health records (EHRs), doctor-patient dialogues, medical question-answering, and medical image captioning. The survey categorizes the datasets by modality, discussing their significance, data structure, and impact on the development of LLMs for clinical tasks such as diagnosis, report generation, and predictive decision support. Key benchmarks include MIMIC-III, MIMIC-IV, BioASQ, PubMedQA, and CheXpert, which have facilitated advancements in tasks like medical report generation, clinical summarization, and synthetic data generation. The paper summarizes the challenges and opportunities in leveraging these benchmarks for advancing multimodal medical intelligence, emphasizing the need for datasets with a greater degree of language diversity, structured omics data, and innovative approaches to synthesis. This work also provides a foundation for future research in the application of LLMs in medicine, contributing to the evolving field of medical artificial intelligence.
... Methods include cost volumes [7], point cloud representa- Table 1: Comparison of paradigms, mechanisms of SOTA tracking methods. Indication Types defines the representation to indicate targets with their corresponding datasets: TAP-Vid [18], PoseTrack [19,20], MOT [21,22,23], VOS [24], VIS [25], MOTS [26], KITTI [27], LaSOT [28], GroOT [29]. Methods in color gradient support both types of singleand multi-target benchmarks. ...
Preprint
Full-text available
Object tracking is a fundamental task in computer vision, requiring the localization of objects of interest across video frames. Diffusion models have shown remarkable capabilities in visual generation, making them well-suited for addressing several requirements of the tracking problem. This work proposes a novel diffusion-based methodology to formulate the tracking task. Firstly, their conditional process allows for injecting indications of the target object into the generation process. Secondly, diffusion mechanics can be developed to inherently model temporal correspondences, enabling the reconstruction of actual frames in video. However, existing diffusion models rely on extensive and unnecessary mapping to a Gaussian noise domain, which can be replaced by a more efficient and stable interpolation process. Our proposed interpolation mechanism draws inspiration from classic image-processing techniques, offering a more interpretable, stable, and faster approach tailored specifically for the object tracking task. By leveraging the strengths of diffusion models while circumventing their limitations, our Diffusion-based INterpolation TrackeR (DINTR) presents a promising new paradigm and achieves a superior multiplicity on seven benchmarks across five indicator representations.
... Human-centric perception remains a core focus within the computer vision and machine learning communities, encompassing a wide array of research tasks and applications such as person ReID [1][2][3][4][5][6][7], human parsing [8][9][10][11][12][13][14], human pose estimation [15][16][17][18][19][20][21][22], and pedestrian detection [23][24][25][26][27][28][29]. Despite significant advancements, these algorithms are inherently data-intensive and frequently encounter overfitting issues [30][31][32][33][34][35][36], where models excel with training data but falter on unseen test data. ...
Article
Full-text available
This survey presents a comprehensive analysis of data augmentation techniques in human-centric vision tasks, a first of its kind in the field. It delves into a wide range of research areas including person ReID, human parsing, human pose estimation, and pedestrian detection, addressing the significant challenges posed by overfitting and limited training data in these domains. Our work categorizes data augmentation methods into two main types: data generation and data perturbation. Data generation covers techniques like graphic engine-based generation, generative model-based generation, and data recombination, while data perturbation is divided into image-level and human-level perturbations. Each method is tailored to the unique requirements of human-centric tasks, with some applicable across multiple areas. Our contributions include an extensive literature review, providing deep insights into the influence of these augmentation techniques in human-centric vision and highlighting the nuances of each method. We also discuss open issues and future directions, such as the integration of advanced generative models like Latent Diffusion Models, for creating more realistic and diverse training data. This survey not only encapsulates the current state of data augmentation in human-centric vision but also charts a course for future research, aiming to develop more robust, accurate, and efficient human-centric vision systems.
... In the fine-tuning stage, we fine-tune the model with reinforcement learning on uneven terrains. and human videos(47,48). We extract human poses from videos using computer vision techniques(49), which can be seen as noisy MoCap. ...
Preprint
Full-text available
Humanoid robots can, in principle, use their legs to go almost anywhere. Developing controllers capable of traversing diverse terrains, however, remains a considerable challenge. Classical controllers are hard to generalize broadly while the learning-based methods have primarily focused on gentle terrains. Here, we present a learning-based approach for blind humanoid locomotion capable of traversing challenging natural and man-made terrain. Our method uses a transformer model to predict the next action based on the history of proprioceptive observations and actions. The model is first pre-trained on a dataset of flat-ground trajectories with sequence modeling, and then fine-tuned on uneven terrain using reinforcement learning. We evaluate our model on a real humanoid robot across a variety of terrains, including rough, deformable, and sloped surfaces. The model demonstrates robust performance, in-context adaptation, and emergent terrain representations. In real-world case studies, our humanoid robot successfully traversed over 4 miles of hiking trails in Berkeley and climbed some of the steepest streets in San Francisco.
... PoseTrack: PoseTrack is a large-scale benchmark for video-based human pose estimation and articulated tracking. It is a diverse dataset that contains more than 153k poses in single frames and videos [37]. ...
... In the PoseTracking17 dataset for human pose estimation, each teacher instance is annotated with 17 key points representing distinct body parts or feature points, accompanied by corresponding category labels [16] , as depicted in Fig. 2(a). The DCPose algorithm predicts each teacher s human body key points and stores them in a JSON file, including label information, image pixel coordinates, and confidence levels of each key point. ...
Article
Full-text available
Computer vision, a scientific discipline enables machines to perceive visual information, aims to supplant human eyes in tasks encompassing object recognition, localization, and tracking. In traditional educational settings, instructors or evaluators evaluate teaching performance based on subjective judgment. However, with the continuous advancements in computer vision technology, it becomes increasingly crucial for computers to take on the role of judges in obtaining vital information and making unbiased evaluations. Against this backdrop, this paper proposes a deep learning-based approach for evaluating lecture posture. First, feature information is extracted from various dimensions, including head position, hand gestures, and body posture, using a human pose estimation algorithm. Second, a machine learning-based regression model is employed to predict machine scores by comparing the extracted features with expert-assigned human scores. The correlation between machine scores and human scores is investigated through experiment and analysis, revealing a robust overall correlation (0.642 0) between predicted machine scores and human scores. Under ideal scoring conditions (100 points), approximately 51.72% of predicted machine scores exhibited deviations within a range of 10 points, while around 81.87% displayed deviations within a range of 20 points; only a minimal percentage of 0.12% demonstrated deviations exceeding the threshold of 50 points. Finally, to further optimize performance, additional features related to bodily movements are extracted by introducing facial expression recognition and gesture recognition algorithms. The fusion of multiple models resulted in an overall average correlation improvement of 0.022 6.
... PoseT rack Dataset PoseTrack [43] is In terms of tracking, MOTA (Multiple Object Tracking Accuracy) is a metric used to assess the performance of multiple object tracking and is commonly employed in multi-person pose tracking tasks. MOTA comprehensively considers several factors, including Missed Detection, False Positive, and Identity Switch, providing a comprehensive performance evaluation. ...
Article
Full-text available
Multi-person pose estimation and tracking are crucial research directions in the field of artificial intelligence, with widespread applications in virtual reality, action recognition, and human-computer interaction. While existing pose tracking algorithms predominantly follow the top-down paradigm, they face challenges, such as pose occlusion and motion blur in complex scenes, leading to tracking inaccuracies. To address these challenges, we leverage enhanced keypoint information and pose-weighted re-identification (re-ID) features to improve the performance of multi-person pose estimation and tracking. Specifically, our proposed Decouple Heatmap Network decouples heatmaps into keypoint confidence and position. The refined keypoint information are utilized to reconstruct occluded poses. For the pose tracking task, we introduce a more efficient pipeline founded on pose-weighted re-ID features. This pipeline integrates a Pose Embedding Network to allocate weights to re-ID features and achieves the final pose tracking through a novel tracking matching algorithm. Extensive experiments indicate that our approach performs well in both multi-person pose estimation and tracking and achieves state-of-the-art results on the PoseTrack 2017 and 2018 datasets. Our source code is available at: https://github.com/TaoTaoPei/posetracking.
Preprint
Full-text available
Temporal modeling and spatio-temporal collaboration are pivotal techniques for video-based human pose estimation. Most state-of-the-art methods adopt optical flow or temporal difference, learning local visual content correspondence across frames at the pixel level, to capture motion dynamics. However, such a paradigm essentially relies on localized pixel-to-pixel similarity, which neglects the semantical correlations among frames and is vulnerable to image quality degradations (e.g. occlusions or blur). Moreover, existing approaches often combine motion and spatial (appearance) features via simple concatenation or summation, leading to practical challenges in fully leveraging these distinct modalities. In this paper, we present a novel framework that learns multi-level semantical dynamics and dense spatio-temporal collaboration for multi-frame human pose estimation. Specifically, we first design a Multi-Level Semantic Motion Encoder using a multi-masked context and pose reconstruction strategy. This strategy stimulates the model to explore multi-granularity spatiotemporal semantic relationships among frames by progressively masking the features of (patch) cubes and frames. We further introduce a Spatial-Motion Mutual Learning module which densely propagates and consolidates context information from spatial and motion features to enhance the capability of the model. Extensive experiments demonstrate that our approach sets new state-of-the-art results on three benchmark datasets, PoseTrack2017, PoseTrack2018, and PoseTrack21.
Preprint
Human pose estimation has given rise to a broad spectrum of novel and compelling applications, including action recognition, sports analysis, as well as surveillance. However, accurate video pose estimation remains an open challenge. One aspect that has been overlooked so far is that existing methods learn motion clues from all pixels rather than focusing on the target human body, making them easily misled and disrupted by unimportant information such as background changes or movements of other people. Additionally, while the current Transformer-based pose estimation methods has demonstrated impressive performance with global modeling, they struggle with local context perception and precise positional identification. In this paper, we try to tackle these challenges from three aspects: (1) We propose a bilayer Human-Keypoint Mask module that performs coarse-to-fine visual token refinement, which gradually zooms in on the target human body and keypoints while masking out unimportant figure regions. (2) We further introduce a novel deformable cross attention mechanism and a bidirectional separation strategy to adaptively aggregate spatial and temporal motion clues from constrained surrounding contexts. (3) We mathematically formulate the deformable cross attention, constraining that the model focuses solely on the regions centered at the target person body. Empirically, our method achieves state-of-the-art performance on three large-scale benchmark datasets. A remarkable highlight is that our method achieves an 84.8 mean Average Precision (mAP) on the challenging wrist joint, which significantly outperforms the 81.5 mAP achieved by the current state-of-the-art method on the PoseTrack2017 dataset.
Article
Visual Object Tracking (VOT) aims to estimate the positions of target objects in a video sequence, which is an important vision task with various real-world applications. Depending on whether the initial states of target objects are specified by provided annotations in the first frame or the categories, VOT could be classified as instance tracking (e.g., SOT and VOS) and category tracking (e.g., MOT, MOTS, and VIS) tasks. Different definitions have led to divergent solutions for these two types of tasks, resulting in redundant training expenses and parameter overhead. In this paper, combing the advantages of the best practices developed in both communities, we propose a novel tracking-with-detection paradigm, where tracking supplements appearance priors for detection and detection provides tracking with candidate bounding boxes for the association. Equipped with such a design, a unified tracking model, OmniTracker, is further presented to resolve all the tracking tasks with a fully shared network architecture, model weights, and inference pipeline, eliminating the need for task-specific architectures and reducing redundancy in model parameters. We conduct extensive experimentation on seven prominent tracking datasets of different tracking tasks, including LaSOT, TrackingNet, DAVIS16-17, MOT17, MOTS20, and YTVIS19, and demonstrate that OmniTracker achieves on-par or even better results than both task-specific and unified tracking models.
Article
Full-text available
In recent years, human pose estimation has been widely studied as a branch task of computer vision. Human pose estimation plays an important role in the development of medicine, fitness, virtual reality, and other fields. Early human pose estimation technology used traditional manual modeling methods. Recently, human pose estimation technology has developed rapidly using deep learning. This study not only reviews the basic research of human pose estimation but also summarizes the latest cutting-edge technologies. In addition to systematically summarizing the human pose estimation technology, this article also extends to the upstream and downstream tasks of human pose estimation, which shows the positioning of human pose estimation technology more intuitively. In particular, considering the issues regarding computer resources and challenges concerning model performance faced by human pose estimation, the lightweight human pose estimation models and the transformer-based human pose estimation models are summarized in this paper. In general, this article classifies human pose estimation technology around types of methods, 2D or 3D representation of outputs, the number of people, views, and temporal information. Meanwhile, classic datasets and targeted datasets are mentioned in this paper, as well as metrics applied to these datasets. Finally, we generalize the current challenges and possible development of human pose estimation technology in the future.
Article
Full-text available
Human activity recognition is a critical task for various applications across healthcare, sports, security, gaming, and other fields. This paper presents BodyFlow, a comprehensive library that seamlessly integrates human pose estimation and multiple-person estimation and tracking, along with activity recognition modules. BodyFlow enables users to effortlessly identify common activities and 2D/3D body joints from input sources such as videos, image sets, or webcams. Additionally, the library can simultaneously process inertial sensor data, offering users the flexibility to choose their preferred input, thus facilitating multimodal human activity recognition. BodyFlow incorporates state-of-the-art algorithms for 2D and 3D pose estimation and three distinct models for human activity recognition.
Article
Full-text available
Single object tracking is a vital task of many applications in critical fields. However, it is still considered one of the most challenging vision tasks. In recent years, computer vision, especially object tracking, witnessed the introduction or adoption of many novel techniques, setting new fronts for performance. In this survey, we visit some of the cutting-edge techniques in vision, such as Sequence Models, Generative Models, Self-supervised Learning, Unsupervised Learning, Reinforcement Learning, Meta-Learning, Continual Learning, and Domain Adaptation, focusing on their application in single object tracking. We propose a novel categorization of single object tracking methods based on novel techniques and trends. Also, we conduct a comparative analysis of the performance reported by the methods presented on popular tracking benchmarks. Moreover, we analyze the pros and cons of the presented approaches and present a guide for non-traditional techniques in single object tracking. Finally, we suggest potential avenues for future research in single-object tracking.
Conference Paper
Full-text available
We propose a two component fully-convolutional network for heatmap regression to perform multi-person pose estimation from images. The first component of the network predicts all body joints of all persons visible on an image , while the second component groups these body joints based on the position of the head of the person of interest. By applying the second component for all detected heads, the poses of all persons visible on an image are estimated. A subsequent geometric frame-by-frame tracker using distances of body joints tracks the poses of all detected persons throughout video sequences. Results on the PoseTrack challenge test set show good performance of our proposed method with a mean average precision (mAP) of 50.4 and a multiple object tracking accuracy (MOTA) of 29.9.
Article
Full-text available
In this paper, we study the trade-off between accuracy and speed when building an object detection system based on convolutional neural networks. We consider three main families of detectors --- Faster R-CNN, R-FCN and SSD --- which we view as "meta-architectures". Each of these can be combined with different kinds of feature extractors, such as VGG, Inception or ResNet. In addition, we can vary other parameters, such as the image resolution, and the number of box proposals. We develop a unified framework (in Tensorflow) that enables us to perform a fair comparison between all of these variants. We analyze the performance of many different previously published model combinations, as well as some novel ones, and thus identify a set of models which achieve different points on the speed-accuracy tradeoff curve, ranging from fast models, suitable for use on a mobile phone, to a much slower model that achieves a new state of the art on the COCO detection challenge.
Conference Paper
Full-text available
In this work, we introduce the challenging problem of joint multi-person pose estimation and tracking of an unknown number of persons in unconstrained videos. Existing methods for multi-person pose estimation in images cannot be applied directly to this problem, since it also requires to solve the problem of person association over time in addition to the pose estimation for each person. We therefore propose a novel method that jointly models multi-person pose estimation and tracking in a single formulation. To this end, we represent body joint detections in a video by a spatio-temporal graph and solve an integer linear program to partition the graph into sub-graphs that correspond to plausible body pose trajectories for each person. The proposed approach implicitly handles occlusions and truncations of persons. Since the problem has not been addressed quantitatively in the literature, we introduce a challenging "Multi-Person Pose-Track" dataset, and also propose a completely unconstrained evaluation protocol that does not make any assumptions on the scale, size, location or the number of persons. Finally, we evaluate the proposed approach and several baseline methods on our new dataset.
Conference Paper
Full-text available
This paper is on human pose estimation using Convolutional Neural Networks. Our main contribution is a CNN cascaded architecture specifically designed for learning part relationships and spatial context, and robustly inferring pose even for the case of severe part occlusions. To this end, we propose a detection-followed-by-regression CNN cascade. The first part of our cascade outputs part detection heatmaps and the second part performs regression on these heatmaps. The benefits of the proposed architecture are multi-fold: It guides the network where to focus in the image and effectively encodes part constraints and context. More importantly, it can effectively cope with occlusions because part detection heatmaps for occluded parts provide low confidence scores which subsequently guide the regression part of our network to rely on contextual information in order to predict the location of these parts. Additionally, we show that the proposed cascade is flexible enough to readily allow the integration of various CNN architectures for both detection and regression, including recent ones based on residual learning. Finally, we illustrate that our cascade achieves top performance on the MPII and LSP data sets. Code can be downloaded from http://www.cs.nott.ac.uk/~psxab5/
Conference Paper
Full-text available
Despite of the recent success of neural networks for human pose estimation, current approaches are limited to pose estimation of a single person and cannot handle humans in groups or crowds. In this work, we propose a method that estimates the poses of multiple persons in an image in which a person can be occluded by another person or might be truncated. To this end, we consider multi-person pose estimation as a joint-to-person association problem. We construct a fully connected graph from a set of detected joint candidates in an image and resolve the joint-to-person association and outlier detection using integer linear programming. Since solving joint-to-person association jointly for all persons in an image is an NP-hard problem and even approximations are expensive, we solve the problem locally for each person. On the challenging MPII Human Pose Dataset for multiple persons, our approach achieves the accuracy of a state-of-the-art method, but it is 6,000 to 19,000 times faster.
Article
Full-text available
Pose Machines provide a sequential prediction framework for learning rich implicit spatial models. In this work we show a systematic design for how convolutional networks can be incorporated into the pose machine framework for learning image features and image-dependent spatial models for the task of pose estimation. The contribution of this paper is to implicitly model long-range dependencies between variables in structured prediction tasks such as articulated pose estimation. We achieve this by designing a sequential architecture composed of convolutional networks that directly operate on belief maps from previous stages, producing increasingly refined estimates for part locations, without the need for explicit graphical model-style inference. Our approach addresses the characteristic difficulty of vanishing gradients during training by providing a natural learning objective function that enforces intermediate supervision, thereby replenishing back-propagated gradients and conditioning the learning procedure. We demonstrate state-of-the-art performance and outperform competing methods on standard benchmarks including the MPII, LSP, and FLIC datasets.
Article
Full-text available
Standardized benchmarks are crucial for the majority of computer vision applications. Although leaderboards and ranking tables should not be over-claimed, benchmarks often provide the most objective measure of performance and are therefore important guides for reseach. Recently, a new benchmark for Multiple Object Tracking, MOTChallenge, was launched with the goal of collecting existing and new data and creating a framework for the standardized evaluation of multiple object tracking methods. The first release of the benchmark focuses on multiple people tracking, since pedestrians are by far the most studied object in the tracking community. This paper accompanies a new release of the MOTChallenge benchmark. Unlike the initial release, all videos of MOT16 have been carefully annotated following a consistent protocol. Moreover, it not only offers a significant increase in the number of labeled boxes, but also provides multiple object classes beside pedestrians and the level of visibility for every single object of interest.
Conference Paper
Full-text available
In this work we propose to utilize information about human actions to improve pose estimation in monocular videos. To this end, we present a pictorial structure model that exploits high-level information about activities to incorporate higher-order part dependencies by modeling action specific appearance models and pose priors. However, instead of using an additional expensive action recognition framework, the action priors are efficiently estimated by our pose estimation framework. This is achieved by starting with a uniform action prior and updating the action prior during pose estimation. We also show that learning the right amount of appearance sharing among action classes improves the pose estimation. Our proposed model achieves state-of-the-art performance on two challenging datasets for pose estimation and action recognition with over 80,000 test images.
Article
Full-text available
We present a method for detecting objects in images using a single deep neural network. Our approach, named SSD, discretizes the output space of bounding boxes into a set of bounding box priors over different aspect ratios and scales per feature map location. At prediction time, the network generates confidences that each prior corresponds to objects of interest and produces adjustments to the prior to better match the object shape. Additionally, the network combines predictions from multiple feature maps with different resolutions to naturally handle objects of various sizes. Our SSD model is simple relative to methods that requires object proposals, such as R-CNN and MultiBox, because it completely discards the proposal generation step and encapsulates all the computation in a single network. This makes SSD easy to train and straightforward to integrate into systems that require a detection component. Experimental results on ILSVRC DET and PASCAL VOC dataset confirm that SSD has comparable performance with methods that utilize an additional object proposal step and yet is 100-1000x faster. Compared to other single stage methods, SSD has similar or better performance, while providing a unified framework for both training and inference.
Article
Full-text available
This paper considers the task of articulated human pose estimation of multiple people in real-world images. We propose an approach that jointly solves the tasks of detection and pose estimation: it infers the number of persons in a scene, identifies occluded body parts, and disambiguates body parts between people in close proximity of each other. This joint formulation is in contrast to previous strategies, that address the problem by first detecting people and subsequently estimating their body pose. We propose a partitioning and labeling formulation of a set of body-part hypotheses generated with CNN-based part detectors. Our formulation, an instance of an integer linear program, implicitly performs non-maximum suppression on the set of part candidates and groups them to form configurations of body parts respecting geometric and appearance constraints. Experiments on four different datasets demonstrate state-of-the-art results for both single person and multi person pose estimation.
Article
Full-text available
Hierarchical feature extractors such as Convolutional Networks (ConvNets) have achieved strong performance on a variety of classification tasks using purely feedforward processing. Feedforward architectures can learn rich representations of the input space but do not explicitly model dependencies in the output spaces, that are quite structured for tasks such as articulated human pose estimation or object segmentation. Here we propose a framework that expands the expressive power of hierarchical feature extractors to encompass both input and output spaces, by introducing top-down feedback. Instead of directly predicting the target outputs in one go, we use a self-correcting model that progressively changes an initial solution by feeding back error predictions, in a process we call Iterative Error Feedback (IEF). We show that IEF improves over the state-of-the-art on the task of articulated human pose estimation on the challenging MPII dataset.
Article
Full-text available
The objective of this work is human pose estimation in videos, where multiple frames are available. We investigate a ConvNet architecture that is able to benefit from temporal context by combining information across the multiple frames using optical flow. To this end we propose a new network architecture that: (i) regresses a confidence heatmap of joint position predictions; (ii) incorporates optical flow at a mid-layer to align heatmap predictions from neighbouring frames; and (iii) includes a final parametric pooling layer which learns to combine the aligned heatmaps into a pooled confidence map. We show that this architecture outperforms a number of others, including one that uses optical flow solely at the input layers, and one that regresses joint coordinates directly. The new architecture outperforms the state of the art by a large margin on three video pose estimation datasets, including the very challenging Poses in the Wild dataset.
Article
Full-text available
In this paper, we focus on the two key aspects of multiple target tracking problem: 1) designing an accurate affinity measure to associate detections and 2) implementing an efficient and accurate (near) online multiple target tracking algorithm. As the first contribution, we introduce a novel Aggregated Local Flow Descriptor (ALFD) that encodes the relative motion pattern between a pair of temporally distant detections using long term interest point trajectories (IPTs). Leveraging on the IPTs, the ALFD provides a robust affinity measure for estimating the likelihood of matching detections regardless of the application scenarios. As another contribution, we present a Near-Online Multi-target Tracking (NOMT) algorithm. The tracking problem is formulated as a data-association between targets and detections in a temporal window, that is performed repeatedly at every frame. While being efficient, NOMT achieves robustness via integrating multiple cues including ALFD metric, target dynamics, appearance similarity, and long term trajectory regularization into the model. Our ablative analysis verifies the superiority of the ALFD metric over the other conventional affinity metrics. We run a comprehensive experimental evaluation on two challenging tracking datasets, KITTI and MOT datasets. The NOMT method combined with ALFD metric achieves the best accuracy in both datasets with significant margins (about 10% higher MOTA) over the state-of-the-arts.
Conference Paper
Full-text available
Human pose estimation has made significant progress during the last years. However current datasets are limited in their coverage of the overall pose estimation challenges. Still these serve as the common sources to evaluate, train and compare different models on. In this paper we intro-duce a novel benchmark "MPII Human Pose" 1 that makes a significant advance in terms of diversity and difficulty, a contribution that we feel is required for future develop-ments in human body models. This comprehensive dataset was collected using an established taxonomy of over 800 human activities [1]. The collected images cover a wider variety of human activities than previous datasets including various recreational, occupational and householding activ-ities, and capture people from a wider range of viewpoints. We provide a rich set of labels including positions of body joints, full 3D torso and head orientation, occlusion labels for joints and body parts, and activity labels. For each im-age we provide adjacent video frames to facilitate the use of motion information. Given these rich annotations we per-form a detailed analysis of leading human pose estimation approaches and gaining insights for the success and fail-ures of these methods.
Article
Full-text available
Recent state-of-the-art performance on human-body pose estimation has been achieved with Deep Convolutional Networks (ConvNets). Traditional ConvNet architectures include pooling layers which reduce computational requirements, introduce invariance and prevent over-training. These benefits of pooling come at the cost of reduced localization accuracy. We introduce a novel architecture which includes an efficient 'position refinement' model that is trained to estimate the joint offset location within a small region of the image. This refinement model is jointly trained in cascade with a state-of-the-art ConvNet model to achieve improved accuracy in human joint location estimation. We show that the variance of our detector approaches the variance of human annotations on the FLIC dataset and outperforms all existing approaches on the MPII-human-pose dataset.
Article
Full-text available
This paper proposes a new hybrid architecture that consists of a deep Convolutional Network and a Markov Random Field. We show how this architecture is successfully applied to the challenging problem of articulated human pose estimation in monocular images. The architecture can exploit structural domain constraints such as geometric relationships between body joint locations. We show that joint training of these two model paradigms improves performance and allows us to significantly outperform existing state-of-the-art techniques.
Conference Paper
Full-text available
Although action recognition in videos is widely studied, current methods often fail on real-world datasets. Many recent approaches improve accuracy and robustness to cope with challenging video sequences, but it is often unclear what affects the results most. This paper attempts to provide insights based on a systematic performance evaluation using thoroughly-annotated data of human actions. We annotate human Joints for the HMDB dataset (J-HMDB). This annotation can be used to derive ground truth optical flow and segmentation. We evaluate current methods using this dataset and systematically replace the output of various algorithms with ground truth. This enables us to discover what is important - for example, should we work on improving flow algorithms, estimating human bounding boxes, or enabling pose estimation? In summary, we find that high-level pose features greatly outperform low/mid level features, in particular, pose over time is critical, but current pose estimation algorithms are not yet reliable enough to provide this information. We also find that the accuracy of a top-performing action recognition framework can be greatly increased by refining the underlying low/mid level features, this suggests it is important to improve optical flow and human detection algorithms. Our analysis and J-HMDB dataset should facilitate a deeper understanding of action recognition algorithms.
Technical Report
In this work we investigate the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting. Our main contribution is a thorough evaluation of networks of increasing depth, which shows that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 weight layers. These findings were the basis of our ImageNet Challenge 2014 submission, where our team secured the first and the second places in the localisation and classification tracks respectively.
Conference Paper
This paper proposes a new hybrid architecture that consists of a deep Convolutional Network and a Markov Random Field. We show how this architecture is successfully applied to the challenging problem of articulated human pose estimation in monocular images. The architecture can exploit structural domain constraints such as geometric relationships between body joint locations. We show that joint training of these two model paradigms improves performance and allows us to significantly outperform existing state-of-the-art techniques.
Conference Paper
Pose Machines provide a sequential prediction framework for learning rich implicit spatial models. In this work we show a systematic design for how convolutional networks can be incorporated into the pose machine framework for learning image features and image-dependent spatial models for the task of pose estimation. The contribution of this paper is to implicitly model long-range dependencies between variables in structured prediction tasks such as articulated pose estimation. We achieve this by designing a sequential architecture composed of convolutional networks that directly operate on belief maps from previous stages, producing increasingly refined estimates for part locations, without the need for explicit graphical model-style inference. Our approach addresses the characteristic difficulty of vanishing gradients during training by providing a natural learning objective function that enforces intermediate supervision, thereby replenishing back-propagated gradients and conditioning the learning procedure. We demonstrate state-of-the-art performance and outperform competing methods on standard benchmarks including the MPII, LSP, and FLIC datasets.
Conference Paper
State-of-the-art object detection networks depend on region proposal algorithms to hypothesize object locations. Advances like SPPnet [7] and Fast R-CNN [5] have reduced the running time of these detection networks, exposing region pro-posal computation as a bottleneck. In this work, we introduce a Region Proposal Network (RPN) that shares full-image convolutional features with the detection network, thus enabling nearly cost-free region proposals. An RPN is a fully-convolutional network that simultaneously predicts object bounds and objectness scores at each position. RPNs are trained end-to-end to generate high-quality region proposals, which are used by Fast R-CNN for detection. With a simple alternating optimization, RPN and Fast R-CNN can be trained to share convolu-tional features. For the very deep VGG-16 model [18], our detection system has a frame rate of 5fps (including all steps) on a GPU, while achieving state-of-the-art object detection accuracy on PASCAL VOC 2007 (73.2% mAP) and 2012 (70.4% mAP) using 300 proposals per image. The code will be released.
Article
The paucity of videos in current action classification datasets (UCF-101 and HMDB-51) has made it difficult to identify good video architectures, as most methods obtain similar performance on existing small-scale benchmarks. This paper re-evaluates state-of-the-art architectures in light of the new Kinetics Human Action Video dataset. Kinetics has two orders of magnitude more data, with 400 human action classes and over 400 clips per class, and is collected from realistic, challenging YouTube videos. We provide an analysis on how current architectures fare on the task of action classification on this dataset and how much performance improves on the smaller benchmark datasets after pre-training on Kinetics. We also introduce a new Two-Stream Inflated 3D ConvNet (I3D) that is based on 2D ConvNet inflation: filters and pooling kernels of very deep image classification ConvNets are expanded into 3D, making it possible to learn seamless spatio-temporal feature extractors from video while leveraging successful ImageNet architecture designs and even their parameters. We show that, after pre-training on Kinetics, I3D models considerably improve upon the state-of-the-art in action classification, reaching 80.7% on HMDB-51 and 98.0% on UCF-101.
Article
We propose a method for multi-person detection and 2-D keypoint localization (human pose estimation) that achieves state-of-the-art results on the challenging COCO keypoints task. It is a simple, yet powerful, top-down approach consisting of two stages. In the first stage, we predict the location and scale of boxes which are likely to contain people; for this we use the Faster RCNN detector with an Inception-ResNet architecture. In the second stage, we estimate the keypoints of the person potentially contained in each proposed bounding box. For each keypoint type we predict dense heatmaps and offsets using a fully convolutional ResNet. To combine these outputs we introduce a novel aggregation procedure to obtain highly localized keypoint predictions. We also use a novel form of keypoint-based Non-Maximum-Suppression (NMS), instead of the cruder box-level NMS, and a novel form of keypoint-based confidence score estimation, instead of box-level scoring. Our final system achieves average precision of 0.636 on the COCO test-dev set and the 0.628 test-standard sets, outperforming the CMU-Pose winner of the 2016 COCO keypoints challenge. Further, by using additional labeled data we obtain an even higher average precision of 0.668 on the test-dev set and 0.658 on the test-standard set, thus achieving a roughly 10% improvement over the previous best performing method on the same challenge.
Conference Paper
In this work, we present an adaptation of the sequence-to-sequence model for structured vision tasks. In this model, the output variables for a given input are predicted sequentially using neural networks. The prediction for each output variable depends not only on the input but also on the previously predicted output variables. The model is applied to spatial localization tasks and uses convolutional neural networks (CNNs) for processing input images and a multi-scale deconvolutional architecture for making spatial predictions at each step. We explore the impact of weight sharing with a recurrent connection matrix between consecutive predictions, and compare it to a formulation where these weights are not tied. Untied weights are particularly suited for problems with a fixed sized structure, where different classes of output are predicted at different steps. We show that chain models achieve top performing results on human pose estimation from images and videos.
Conference Paper
The goal of this paper is to advance the state-of-the-art of articulated pose estimation in scenes with multiple people. To that end we contribute on three fronts. We propose (1) improved body part detectors that generate effective bottom-up proposals for body parts; (2) novel image-conditioned pairwise terms that allow to assemble the proposals into a variable number of consistent body part configurations; and (3) an incremental optimization strategy that explores the search space more efficiently thus leading both to better performance and significant speed-up factors. Evaluation is done on two single-person and two multi-person pose estimation benchmarks. The proposed approach significantly outperforms best known multi-person pose estimation results while demonstrating competitive performance on the task of single person pose estimation (Models and code available at http:// pose. mpi-inf. mpg. de).
Conference Paper
This work introduces a novel convolutional network architecture for the task of human pose estimation. Features are processed across all scales and consolidated to best capture the various spatial relationships associated with the body. We show how repeated bottom-up, top-down processing used in conjunction with intermediate supervision is critical to improving the performance of the network. We refer to the architecture as a “stacked hourglass” network based on the successive steps of pooling and upsampling that are done to produce a final set of predictions. State-of-the-art results are achieved on the FLIC and MPII benchmarks outcompeting all recent methods.
Conference Paper
We propose a personalized ConvNet pose estimator that automatically adapts itself to the uniqueness of a person’s appearance to improve pose estimation in long videos. We make the following contributions: (i) we show that given a few high-precision pose annotations, e.g. from a generic ConvNet pose estimator, additional annotations can be generated throughout the video using a combination of image-based matching for temporally distant frames, and dense optical flow for temporally local frames; (ii) we develop an occlusion aware self-evaluation model that is able to automatically select the high-quality and reject the erroneous additional annotations; and (iii) we demonstrate that these high-quality annotations can be used to fine-tune a ConvNet pose estimator and thereby personalize it to lock on to key discriminative features of the person’s appearance. The outcome is a substantial improvement in the pose estimates for the target video using the personalized ConvNet compared to the original generic ConvNet. Our method outperforms the state of the art (including top ConvNet methods) by a large margin on three standard benchmarks, as well as on a new challenging YouTube video dataset. Furthermore, we show that training from the automatically generated annotations can be used to improve the performance of a generic ConvNet on other benchmarks.
Article
In this work we address the task of semantic image segmentation with Deep Learning and make three main contributions that are experimentally shown to have substantial practical merit. First, we highlight convolution with upsampled filters, or 'atrous convolution', as a powerful tool in dense prediction tasks. Atrous convolution allows us to explicitly control the resolution at which feature responses are computed within Deep Convolutional Neural Networks. It also allows us to effectively enlarge the field of view of filters to incorporate larger context without increasing the number of parameters or the amount of computation. Second, we propose atrous spatial pyramid pooling (ASPP) to robustly segment objects at multiple scales. ASPP probes an incoming convolutional feature layer with filters at multiple sampling rates and effective fields-of-views, thus capturing objects as well as image context at multiple scales. Third, we improve the localization of object boundaries by combining methods from DCNNs and probabilistic graphical models. The commonly deployed combination of max-pooling and downsampling in DCNNs achieves invariance but has a toll on localization accuracy. We overcome this by combining the responses at the final DCNN layer with a fully connected Conditional Random Field (CRF), which is shown both qualitatively and quantitatively to improve localization performance. Our proposed "DeepLab" system sets the new state-of-art at the PASCAL VOC-2012 semantic image segmentation task, reaching 79.7% mIOU in the test set, and advances the results on three other datasets: PASCAL-Context, PASCAL-Person-Part, and Cityscapes. All of our code is made publicly available online.
Article
The goal of this paper is to advance the state-of-the-art of articulated pose estimation in scenes with multiple people. To that end we contribute on three fronts. We propose (1) improved body part detectors that generate effective bottom-up proposals for body parts; (2) novel image-conditioned pairwise terms that allow to assemble the proposals into a variable number of consistent body part configurations; and (3) an incremental optimization strategy that explores the search space more efficiently thus leading both to better performance and significant speed-up factors. We evaluate our approach on two single-person and two multi-person pose estimation benchmarks. The proposed approach significantly outperforms best known multi-person pose estimation results while demonstrating competitive performance on the task of single person pose estimation. Models and code available at http://pose.mpi-inf.mpg.de
Article
In this paper, we present an adaptation of the sequence-to-sequence model for structured output prediction in vision tasks. In this model the output variables for a given input are predicted sequentially using neural networks. The prediction for each output variable depends not only on the input but also on the previously predicted output variables. The model is applied to spatial localization tasks and uses convolutional neural networks (CNNs) for processing input images and a multi-scale deconvolutional architecture for making spatial predictions at each time step. We explore the impact of weight sharing with a recurrent connection matrix between consecutive predictions, and compare it to a formulation where these weights are not tied. Untied weights are particularly suited for problems with a fixed sized structure, where different classes of output are predicted in different steps. We show that chained predictions achieve top performing results on human pose estimation from single images and videos.
Article
Very deep convolutional networks have been central to the largest advances in image recognition performance in recent years. One example is the Inception architecture that has been shown to achieve very good performance at relatively low computational cost. Recently, the introduction of residual connections in conjunction with a more traditional architecture has yielded state-of-the-art performance in the 2015 ILSVRC challenge; its performance was similar to the latest generation Inception-v3 network. This raises the question of whether there are any benefit in combining the Inception architecture with residual connections. Here we give clear empirical evidence that training with residual connections accelerates the training of Inception networks significantly. There is also some evidence of residual Inception networks outperforming similarly expensive Inception networks without residual connections by a thin margin. We also present several new streamlined architectures for both residual and non-residual Inception networks. These variations improve the single-frame recognition performance on the ILSVRC 2012 classification task significantly. We further demonstrate how proper activation scaling stabilizes the training of very wide residual Inception networks. With an ensemble of three residual and one Inception-v4, we achieve 3.08 percent top-5 error on the test set of the ImageNet classification (CLS) challenge
Article
Deeper neural networks are more difficult to train. We present a residual learning framework to ease the training of networks that are substantially deeper than those used previously. We explicitly reformulate the layers as learning residual functions with reference to the layer inputs, instead of learning unreferenced functions. We provide comprehensive empirical evidence showing that these residual networks are easier to optimize, and can gain accuracy from considerably increased depth. On the ImageNet dataset we evaluate residual nets with a depth of up to 152 layers---8x deeper than VGG nets but still having lower complexity. An ensemble of these residual nets achieves 3.57% error on the ImageNet test set. This result won the 1st place on the ILSVRC 2015 classification task. We also present analysis on CIFAR-10 with 100 and 1000 layers. The depth of representations is of central importance for many visual recognition tasks. Solely due to our extremely deep representations, we obtain a 28% relative improvement on the COCO object detection dataset. Deep residual nets are foundations of our submissions to ILSVRC & COCO 2015 competitions, where we also won the 1st places on the tasks of ImageNet detection, ImageNet localization, COCO detection, and COCO segmentation.
Article
State-of-the-art object detection networks depend on region proposal algorithms to hypothesize object locations. Advances like SPPnet and Fast R-CNN have reduced the running time of these detection networks, exposing region proposal computation as a bottleneck. In this work, we introduce a Region Proposal Network (RPN) that shares full-image convolutional features with the detection network, thus enabling nearly cost-free region proposals. An RPN is a fully-convolutional network that simultaneously predicts object bounds and objectness scores at each position. RPNs are trained end-to-end to generate high-quality region proposals, which are used by Fast R-CNN for detection. With a simple alternating optimization, RPN and Fast R-CNN can be trained to share convolutional features. For the very deep VGG-16 model, our detection system has a frame rate of 5fps (including all steps) on a GPU, while achieving state-of-the-art object detection accuracy on PASCAL VOC 2007 (73.2% mAP) and 2012 (70.4% mAP) using 300 proposals per image. The code will be released.
Article
In this work we investigate the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting. Our main contribution is a thorough evaluation of networks of increasing depth, which shows that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 weight layers. These findings were the basis of our ImageNet Challenge 2014 submission, where our team secured the first and the second places in the localisation and classification tracks respectively.
Conference Paper
We present a new dataset with the goal of advancing the state-of-the-art in object recognition by placing the question of object recognition in the context of the broader question of scene understanding. This is achieved by gathering images of complex everyday scenes containing common objects in their natural context. Objects are labeled using per-instance segmentations to aid in understanding an object's precise 2D location. Our dataset contains photos of 91 objects types that would be easily recognizable by a 4 year old along with per-instance segmentation masks. With a total of 2.5 million labeled instances in 328k images, the creation of our dataset drew upon extensive crowd worker involvement via novel user interfaces for category detection, instance spotting and instance segmentation. We present a detailed statistical analysis of the dataset in comparison to PASCAL, ImageNet, and SUN. Finally, we provide baseline performance analysis for bounding box and segmentation detection results using a Deformable Parts Model.