Conference Paper

ArtTrack: Articulated Multi-Person Tracking in the Wild

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... Most multi-person pose estimation and tracking methods can be categorized into two pipelines: bottom-up [4,5] and top-down [6][7][8][9]. Due to the absence of a global view of Most multi-person pose estimation and tracking methods can be categorized into two pipelines: bottom-up [4,5] and top-down [6][7][8][9]. ...
... Most multi-person pose estimation and tracking methods can be categorized into two pipelines: bottom-up [4,5] and top-down [6][7][8][9]. Due to the absence of a global view of Most multi-person pose estimation and tracking methods can be categorized into two pipelines: bottom-up [4,5] and top-down [6][7][8][9]. Due to the absence of a global view of individual instances in the bottom-up pipeline, state-of-the-art top-down approaches excel in accuracy on large-scale benchmarks. ...
... However, bottom-up approaches may struggle with body part association in occluded scenes. In [4,5], the multi-person pose tracking challenge is introduced, and a spatial graph is extended to a spatiotemporal graph based on bottom-up methods [29]. While [4] achieves plausible results in complex videos by solving a minimum-cost multicut problem, the handcrafted features in probabilistic graphical models are not necessarily optimal for long video clips. ...
Article
Full-text available
Tracking the articulated poses of multiple individuals in complex videos is a highly challenging task due to a variety of factors that compromise the accuracy of estimation and tracking. Existing frameworks often rely on intricate propagation strategies and extensive exchange of flow data between video frames. In this context, we propose a spatiotemporal sampling framework that addresses the degradation of frames at the feature level, offering a simple yet effective network block. Our spatiotemporal sampling mechanism empowers the framework to extract meaningful features from neighboring video frames, thereby optimizing the accuracy of pose detection in the current frame. This approach results in significant improvements in running latency. When evaluated on the COCO dataset and the mixed dataset, our approach outperforms other methods in terms of average precision (AP), recall rate (AR), and acceleration ratio. Specifically, we achieve a 3.7% increase in AP, a 1.77% increase in AR, and a speedup of 1.51 times compared to mainstream state-of-the-art (SOTA) methods. Furthermore, when evaluated on the PoseTrack2018 dataset, our approach demonstrates superior accuracy in multi-object tracking, as measured by the multi-object tracking accuracy (MOTA) metric. Our method achieves an impressive 11.7% increase in MOTA compared to the prevailing SOTA methods.
... In this section, the performance of the proposed ConsistencyTrack is evaluated on two popular datasets: MOT17 and DanceTrack [33,34,35,36]. Firstly, the noise robustness and various characteristics of ConsistencyTrack is tested through experiments. ...
... DanceTrack Dataset. The DanceTrack dataset [35,36] is designed to evaluate tracking algorithms in dynamic and complex scenarios, specifically focusing on dance performances. It includes sequences with fast, non-linear movements and frequent occlusions. ...
Preprint
Full-text available
Multi-object tracking (MOT) is a critical technology in computer vision, designed to detect multiple targets in video sequences and assign each target a unique ID per frame. Existed MOT methods excel at accurately tracking multiple objects in real-time across various scenarios. However, these methods still face challenges such as poor noise resistance and frequent ID switches. In this research, we propose a novel ConsistencyTrack, joint detection and tracking(JDT) framework that formulates detection and association as a denoising diffusion process on perturbed bounding boxes. This progressive denoising strategy significantly improves the model's noise resistance. During the training phase, paired object boxes within two adjacent frames are diffused from ground-truth boxes to a random distribution, and then the model learns to detect and track by reversing this process. In inference, the model refines randomly generated boxes into detection and tracking results through minimal denoising steps. ConsistencyTrack also introduces an innovative target association strategy to address target occlusion. Experiments on the MOT17 and DanceTrack datasets demonstrate that ConsistencyTrack outperforms other compared methods, especially better than DiffusionTrack in inference speed and other performance metrics. Our code is available at https://github.com/Tankowa/ConsistencyTrack.
... The top-down methods [14][15][16][17][18] detects all people in an image and then performs human pose estimation for all detected humans. In contrast, the bottom-up methods [19][20][21][22][23][24][25] only performs the joint detection network to obtain all the joint candidates and then robustly matches them by designing the specific joint matching cues. The bottom-up method uses a single neural network forwards process in the detection phase and additionally requires only the postprocessing stage with less computation. ...
... In contrast to the top-down approach, bottom-up methods [20][21][22][23][50][51][52][53][54][55] apply a global joint detector in the whole image to detect all the joint candidates, cluster them according to the defined joint matching cues, and obtain the pose Figure 2, the bottom-up method still implements the pose estimation task in a two-stage way, completes joint detection and then performs joint grouping. ...
Article
Full-text available
To effectively solve the problems of human occlusion and motion blur in pose estimation algorithms, this paper proposes a bottom-up method for multi-person pose estimation based on human anchor joints and perception-enhancement networks. First, in terms of the detection task label, we divide human joint points into two groups—upper body and lower body—and then select two geometric anchor joint points from the two groups as joint point matching clues. Then, the other joint points of each group are represented by offset embedding of the joint point matching clues. Furthermore, two directional anchor joints that are rich in human orientation information are added to constitute a set of human anchor joints and form a new network detection target. Second, we design a perception-enhancement network based on the attention mechanism and feature fusion strategy, which can help the network effectively learn the unique features of each half-body and the inherent consistent features of the whole body. The proposed network has a stronger detection task modelling ability. In the test phase, based on the greedy strategy, the postprocessing algorithm is carried out to obtain the pose estimation results of multiple people by the final joint extraction and matching. The experimental results on the MPII dataset and CrowdPose dataset demonstrate the effectiveness of the proposed method. The code is open source and available online (https://github.com/Ozone-oo/perception_enhancement_network.git).
... Lifting is an important issue related to injuries of MMH workers; many researchers have used deep learning models applied to lifting videos or images to investigate lifting posture assessment or estimate lifting load on the lower back [10][11][12][13]. Several different algorithms for pose estimation have been published over the past decade, such as OpenPose [14,15], DeepLabCut [16], DeepPose [17], DeeperCut [18], Alpha Pose [19], and ArtTrack [20]. OpenPose is a well-known open-source library and has been adopted by many researchers and various applications in recent years [21][22][23][24][25]. OpenPose utilizes a unique architecture that combines convolutional neural networks with a part affinity field to accurately identify and track body parts across multiple individuals. ...
Article
Full-text available
Background: Occupational low back pain (LBP) is a pervasive health issue that significantly impacts productivity and contributes to work-related musculoskeletal disorders (WMSDs). Inadequate lifting postures are a primary, modifiable risk factor associated with LBP, making early detection of unsafe practices crucial to mitigating occupational injuries. Our study aims to address these limitations by developing a markerless, smartphone-based camera system integrated with a deep learning model capable of accurately classifying lifting postures. Material and Method: We recruited 50 healthy adults who participated in lifting tasks using correct and incorrect postures to build a robust dataset. Participants lifted boxes of varying sizes and weights while their movements were recorded from multiple angles and heights to ensure comprehensive data capture. We used the OpenPose algorithm to detect and extract key body points to calculate relevant biomechanical features. These extracted features served as inputs to a bidirectional long short-term memory (LSTM) model, which classified lifting postures into correct and incorrect categories. Results: Our model demonstrated high classification accuracy across all datasets, with accuracy rates of 96.9% for Tr, 95.6% for the testing set, and 94.4% for training. We observed that environmental factors, such as camera angle and height, slightly influenced the model’s accuracy, particularly in scenarios where the subject’s posture partially occluded key body points. Nonetheless, these variations were minor, confirming the robustness of our system across different conditions. Conclusions: This study demonstrates the feasibility and effectiveness of a smartphone camera and AI-based system for lifting posture classification. The system’s high accuracy, low setup cost, and ease of deployment make it a promising tool for enhancing workplace ergonomics. This approach highlights the potential of artificial intelligence to improve occupational safety and underscores the relevance of affordable, scalable solutions in the pursuit of healthier workplaces.
... MAN is composed of a ResNet-50 [21] and eight deconvolutional layers. When ResNet was first proposed in 2015, it achieved first place in the ImageNet image classification task, and subsequently its excellent performance led to its widespread application in various object recognition tasks based on convolutional neural networks, such as ArtTrack [22], DeepCut [23], and DeepCut2 [24]. Therefore, in this study, we introduced ResNet-50 as a keypoint detector into MAN. ...
Article
Full-text available
Pain assessment in trigeminal neuralgia (TN) mouse models is essential for exploring its pathophysiology and developing effective analgesics. However, pain assessment methods for TN mouse models have not been widely studied, resulting in a critical gap in our understanding of TN. With the rapid advancement of deep learning, numerous pain assessment methods based on deep learning have emerged. Nonetheless, these methods have some limitations: (1) insufficiently objective supervision signals for training, (2) failure to account for the dynamic behavioral characteristics of mouse models in the constructed models and (3) inadequate generalization ability of the models. In this study, we initially constructed an objective pain grading dataset as the ground truth for model training, which remedy the limitations of prior studies that relied on subjective evaluation as supervisory signals. Then we proposed a novel deep neural network, named trigeminal neuralgia pain assessment network (TNPAN), which fuses the static texture characteristics and dynamic behavioral characteristics of mouse facial expressions. The promising experimental results demonstrate that TNPAN exhibits exceptional accuracy and generalization capability in pain assessment.
... Multi-object tracking via human pose estimation [35][36][37][38] offers advantages over traditional bounding box-based tracking methods in scenarios with occlusion and similar appearances. Bounding box-based tracking methods often struggle to maintain tracking continuity when the target is partially occluded. ...
Preprint
Full-text available
Multi-object tracking (MOT) is crucial for various multi-agent analyses such as evaluating team sports tactics and player movements and performance. While pedestrian tracking has advanced with Tracking-by-Detection MOT, team sports like basketball pose unique challenges. These challenges include players' unpredictable movements, frequent close interactions, and visual similarities that complicate pose labeling and lead to significant occlusions, frequent ID switches, and high manual annotation costs. To address these challenges, we propose a novel pose-based virtual marker (VM) MOT method for team sports, named Sports-vmTracking. This method builds on the vmTracking approach developed for multi-animal tracking with active learning. First, we constructed a 3x3 basketball pose dataset for VMs and applied active learning to enhance model performance in generating VMs. Then, we overlaid the VMs on video to identify players, extract their poses with unique IDs, and convert these into bounding boxes for comparison with automated MOT methods. Using our 3x3 basketball dataset, we demonstrated that our VM configuration has been highly effective, and reduced the need for manual corrections and labeling during pose model training while maintaining high accuracy. Our approach achieved an average HOTA score of 72.3%, over 10 points higher than other state-of-the-art methods without VM, and resulted in 0 ID switches. Beyond improving performance in handling occlusions and minimizing ID switches, our framework could substantially increase the time and cost efficiency compared to traditional manual annotation.
... Top-down methods [5,10,26,30,41,43,44] first detect individual instances using an object detector [13,17,29,35,36] and then estimate the pose within each detected bounding box. Bottom-up methods [4,6,11,16,21,33] first detect all body parts in the image and then group them into individual instances. Recently, transformer-based methods [26,30,43,44] have shown promising results in pose estimation tasks. ...
Preprint
Pose estimation is a crucial task in computer vision, with wide applications in autonomous driving, human motion capture, and virtual reality. However, existing methods still face challenges in achieving high accuracy, particularly in complex scenes. This paper proposes a novel pose estimation method, GatedUniPose, which combines UniRepLKNet and Gated Convolution and introduces the GLACE module for embedding. Additionally, we enhance the feature map concatenation method in the head layer by using DySample upsampling. Compared to existing methods, GatedUniPose excels in handling complex scenes and occlusion challenges. Experimental results on the COCO, MPII, and CrowdPose datasets demonstrate that GatedUniPose achieves significant performance improvements with a relatively small number of parameters, yielding better or comparable results to models with similar or larger parameter sizes.
... In bottom-up methods, the temporal pose association can be solved as a linear programming problem utilizing spatiotemporal graph [89,92], extending image-based multi-person HPE solutions [2,45]. However, these methods are slow, hindering real-time applications. ...
Preprint
Full-text available
Human modelling and pose estimation stands at the crossroads of Computer Vision, Computer Graphics, and Machine Learning. This paper presents a thorough investigation of this interdisciplinary field, examining various algorithms, methodologies, and practical applications. It explores the diverse range of sensor technologies relevant to this domain and delves into a wide array of application areas. Additionally, we discuss the challenges and advancements in 2D and 3D human modelling methodologies, along with popular datasets, metrics, and future research directions. The main contribution of this paper lies in its up-to-date comparison of state-of-the-art (SOTA) human pose estimation algorithms in both 2D and 3D domains. By providing this comprehensive overview, the paper aims to enhance understanding of 3D human modelling and pose estimation, offering insights into current SOTA achievements, challenges, and future prospects within the field.
... Human pose estimation is a popular subject in computer vision studies, with the goal of detecting and marking the positions of human key points (e.g., head and wrists) in an image. It has numerous applications in diverse domains, such as video surveillance, autonomous driving, and motion analysis (Insafutdinov et al., 2017;Li et al., 2018;Zheng et al., 2019;Fang ZJ and López, 2020). Human pose estimation has been developed rapidly with the establishment of large datasets (Sapp and Taskar, 2013;Andriluka et al., 2014;Lin et al., 2014) and deep learning (Wang M et al., 2012;Chu et al., 2017;Martinez et al., 2017;Yang X et al., 2017Yang X et al., , 2018Liu et al., 2019). ...
Article
Due to factors such as motion blur, video out-of-focus, and occlusion, multi-frame human pose estimation is a challenging task. Exploiting temporal consistency between consecutive frames is an efficient approach for addressing this issue. Currently, most methods explore temporal consistency through refinements of the final heatmaps. The heatmaps contain the semantics information of key points, and can improve the detection quality to a certain extent. However, they are generated by features, and feature-level refinements are rarely considered. In this paper, we propose a human pose estimation framework with refinements at the feature and semantics levels. We align auxiliary features with the features of the current frame to reduce the loss caused by different feature distributions. An attention mechanism is then used to fuse auxiliary features with current features. In terms of semantics, we use the difference information between adjacent heatmaps as auxiliary features to refine the current heatmaps. The method is validated on the large-scale benchmark datasets PoseTrack2017 and PoseTrack2018, and the results demonstrate the effectiveness of our method.
... In most large-scale applications, manually cropping all of the human photos is unfeasible. Person detection [60,61] or tracking techniques [62,63] are commonly used to obtain bounding boxes. ...
Article
Re-identification (Re-ID) is a process that seeks to identify concern individuals from successive non-overlapping photographs. The area of computer vision has recently seen an uptick in the amount of attention focused on deep neural networks, especially given the popularity of smart monitoring systems and the development of sophisticated learning algorithms. We classified existing Re-ID technologies into closed-world and open-world contexts based on the used components. The closed-world scenario has been commonly used under a variety of data analysis hypotheses, and it brought precise results when applied to a variety of datasets utilizing deep learning techniques. We began with a comprehensive overview of closed-world person Re-ID considering deep metric learning, an extensive representation of features learning, ranking optimization, and in-depth analysis. Due to the accomplishment of performance in the packed scenario, the Re-ID focuses research has lately turned to a bare environment setting, which brings new issues. This setting is more akin to what we'd find in real-world circumstances. We summarized the unsupervised Re-ID literature as well as current research trends and proposed future studies.
... Recently, the research community has witnessed the ubiquitous intelligence of machine learning, and this has amounted to advancing its application to features extraction and representation. Some examples to support this are human identifiers such as MPII [8], COCO for human skeleton [9], DeepPose for human body parts detection using images [10], Stackedhourglass network [11], ArtTrack [12], OpenPose [13], [14], Deepcut [15], and Human pose detection [16]. However, machine learning cannot detect each target object as accurately as human beings. ...
Research
Full-text available
The applications of deep learning to livestock farming have in recent years gained wide acceptance from the computer vision community due to the continuous achievement of its applications to agricultural tasks. Moreover, the essentiality of deep learning is its practicality in detecting, segmenting, and classifying video and image objects without which precision livestock farming would have been impossible. However, the applications of most of the state-of-the-art models of deep learning to multiple cow objects image segmentation are not accurate and cannot generate colorimetric information due to poor pre-processing mechanism inherent in the associated methods and unequal training of their backbone layers. To overcome the above-mentioned limitations, an enhanced deep learning framework of Mask Region-based Convolutional Neural Network (Mask R-CNN) based on Generalized Color Fourier Descriptors (GCFD) is proposed. The enhanced model produced 0.93 mean Average Precision (mAP). The result shows the performance capability of the proposed framework over the state-of-the-art models for cow image segmentation.
... The top-down methods (Fang et al., 2017;Chen et al., 2018;Su et al., 2019;Papandreou et al., 2017;He et al., 2017;Xiao et al., 2018;Sun et al., 2019;Li et al., 2019b;Moon et al., 2019;Wang et al., 2020;Cai et al., 2020;Huang et al., 2020;Zhang et al., 2020a) first employ a human detector to outline every person in an image with a bounding box, then perform single person pose estimation (SPPE) within that bounding box. On the other hand, the bottom-up way of pose estimation (Iqbal & Gall, 2016;Pishchulin et al., 2016;Insafutdinov et al., 2016Insafutdinov et al., , 2017Cao et al., 2021;Newell et al., 2017;Kreiss et al., 2019;Nie et al., 2019;Jin et al., 2020;Cheng et al., 2020a;Papandreou et al., 2018;Kocabas et al., 2018) first estimates every joint in an image, and then groups those joints into persons via grouping algorithms. Both top-down and bottom-up methods usually employ a strong CNN-based network (Newell et al., 2016;Wei et al., 2016;He et al., 2016) supervised by heatmap-based joints ground truths. ...
Article
Full-text available
Pose estimation in crowded scenes is key to understanding human behavior in real-life applications. Most existing CNN-based pose estimation methods often depend on the appearance of visible parts as cues to localize human joints. However, occlusion is typical in crowded scenes, and invisible body parts have no valid features for joint localization. Introducing prior information about the human pose structure to infer the locations of occluded parts is a natural solution to this problem. In this paper, we argue that learning structural information based on human joints alone is not enough to address human body variations and could be prone to overfitting. From a perspective on the human pose as a dual representation of joints and limbs, we propose a pose refinement network, coined as dual graph network (DGN), to jointly learn its structural information of body joints and limbs by incorporating the cooperative constraints between two branches. Specifically, our DGN has two coupled graph convolutional network (GCN) branches to model the structure information of joints and limbs. Each stage in the branch is composed of a feature aggregator and a GCN module for inter-branch information fusion and intra-branch context extraction, respectively. In addition, to enhance the modeling capacity of GCN, we design an adaptive GCN layer (AGL) embedded in the GCN module to handle each pose instance based on its graph structure. We also propose a heatmap-guided sampling to leverage the features of the body parts to provide rich visual features for the inference of occluded parts. We perform extensive experiments on five challenging datasets to demonstrate the effectiveness of our DGN on pose estimation. Our DGN obtains significant performance improvement from 67.9 to 72.4 mAP in the CrowdPose dataset with the same CNN-based pose estimator and training strategy as the OPEC-Net. It shows that, compared to the OPEC-Net only considering joints, our DGN has a clear advantage due to the joint consideration of both joints and limbs. Meanwhile, our DGN is also helpful for pose estimation in general datasets (i.e., COCO and Pose track) with less occlusion and mutual interference, demonstrating the generalization power of DGN on refining human poses.
... Challenges in acquiring valid and repeatable data in human subjects can arise due to the relative motion and location of the skin, where markers are placed, with respect to the actual skeletal movement and location, as highlighted in previous research studies [15]. Recent advancements in computer vision and marker-less techniques have promoted the development of posture-estimation algorithms that can track human motion with high accuracy and minimal technical requirements [16][17][18][19][20][21]. As such, these algorithms can potentially revolutionize joint-angle assessments in clinical settings by overcoming the limitations of existing methods. ...
Article
Full-text available
Substantial advancements in markerless motion capture accuracy exist, but discrepancies persist when measuring joint angles compared to those taken with a goniometer. This study integrates machine learning techniques with markerless motion capture, with an aim to enhance this accuracy. Two artificial intelligence-based libraries-MediaPipe and LightGBM-were employed in executing markerless motion capture and shoulder abduction angle estimation. The motion of ten healthy volunteers was captured using smartphone cameras with right shoulder abduction angles ranging from 10° to 160°. The cameras were set diagonally at 45°, 30°, 15°, 0°, -15°, or -30° relative to the participant situated at a distance of 3 m. To estimate the abduction angle, machine learning models were developed considering the angle data from the goniometer as the ground truth. The model performance was evaluated using the coefficient of determination R2 and mean absolute percentage error, which were 0.988 and 1.539%, respectively, for the trained model. This approach could estimate the shoulder abduction angle, even if the camera was positioned diagonally with respect to the object. Thus, the proposed models can be utilized for the real-time estimation of shoulder motion during rehabilitation or sports motion.
... The proposed formulation allows us to recover 3D motion even if the number of detected landmarks is smaller than the number of DoFs. We assume that the 2D landmarks can be detected with a state-of-the-art 2D detection of human pose [7,24]. Contributions and specificities. ...
Conference Paper
Full-text available
This work introduces a method to robustly reconstruct 3D human motion from the motion of 2D skeletal landmarks. We propose to use a lasso (least absolute shrinkage and selection operator) optimization framework where the 1-norm is computed over the vector of differential angular kinematics and the 2-norm is computed over the differential 2D reprojection error. The 1-norm term allows us to model sparse kinematic angular motion. The minimization of the reprojection error allows us to assume a bounded noise in both the kinematic model and the 2D landmark detection. This bound is controlled by a scale factor associated to the 2-norm data term. A posteriori verification condition is provided to check whether or not the lasso formulation has allowed us to recover the ground-truth 3D human motion. Results on publicly available data demonstrates the effectiveness of the proposed approach on state-of-the-art methods. It shows that both sparsity and bounded noise assumptions encoded in lasso formulation are robust priors to safely recover 3D human motion.
... For HPE in videos, it is essential to exploit their temporal information [10], [11]. Several earlier methods [10], [11], [19] approached the video pose estimation task as a two-stage problem, first detecting the body joints in individual frames and then applying temporal filtering techniques. Later, recurrent networks, especially LSTM [20] and GRU [21], were proposed for pose estimation [22], [23]. ...
Article
We propose a framework for the integration of heterogeneous networks in human pose estimation (HPE) with the aim of balancing accuracy and computational complexity. Although many existing methods can improve the accuracy of HPE using multiple frames in videos, they also increase the computational complexity. The key difference here is that the proposed heterogeneous framework has various networks for different types of frames, while existing methods use the same networks for all frames. In particular, we propose to divide the video frames into two types, including key frames and non-key frames, and adopt three networks including slow networks, fast networks, and transfer networks in our heterogeneous framework. For key frames, a slow network is used that has high accuracy but high computational complexity. For non-key frames that follow a key frame, we propose to warp the heatmap of a slow network from a key frame via a transfer network and fuse it with a fast network that has low accuracy but low computational complexity. Furthermore, when extending to the usage of long-term frames where a large number of non-key frames follow a key frame, the temporal correlation decreases. Therefore, when necessary, we use an additional transfer network that warps the heatmap from a neighboring non-key frame. The experimental results on PoseTrack 2017 and PoseTrack 2018 datasets demonstrate that the proposed FSPose achieves a better balance between accuracy and computational complexity than the competitor method. Our source code is available at https://github.com/Fenax79/fspose.
... Human body estimation, the task of creating a precise 3D model of the human body [1,3], has been an active area of exploration for decades. The capability to estimate the human body in 3D is important for a wide range of operations similar as virtual reality, human-computer interaction, motion analysis, biomechanical study, and motion prisoner [2,3]. still, creating a precise 3D model of the human body is a grueling task due to the complexity of the human body and the variability of human motion [4]. ...
Article
Full-text available
This paper presents a novel method for estimating the human body in 3D using depth sensor data. The proposed method utilizes a deep neural network to predict the body joints and a kinematic model to estimate the body shape. The approach utilizes a combination of convolutional and recurrent neural networks, trained on a dataset of human subjects, to accurately predict the positions of key body joints. These joints are then used as input to a kinematic model, which estimates the body shape and pose. The method was estimated on a dataset of human subjects, and the results show that it achieves high delicacy in body joint and shape estimation. The proposed approach outperforms being methods in terms of both delicacy and robustness, and it's suitable to handle a wide range of body acts and movements. also, the system is computationally effective and can run in real-time, making it suitable for a variety of operations similar as virtual reality, mortal-computer commerce, and stir analysis. The capability to directly estimate the human body in 3D is pivotal for a wide range of operations, and this work makes significant benefactions towards this thing. The proposed system is the first to demonstrate that it's possible to directly estimate the body shape and disguise using only depth sensor data, and it opens up new possibilities for a wide range of exploration and operations. In summary, this paper presents a real-time, robust and accurate system for 3D mortal body estimation using depth detector data, which is grounded on a deep neural network armature and kinematic model. The proposed system was estimated on a dataset of human subjects and achieved high delicacy in body joint and shape estimation, outperforming being methods in terms of both delicacy and robustness. And the approach can be used for a variety of operations similar as virtual reality, human-computer interaction, and motion analysis.d
Preprint
This paper investigates the task of 2D whole-body human pose estimation, which aims to localize dense landmarks on the entire human body including body, feet, face, and hands. We propose a single-network approach, termed ZoomNet, to take into account the hierarchical structure of the full human body and solve the scale variation of different body parts. We further propose a neural architecture search framework, termed ZoomNAS, to promote both the accuracy and efficiency of whole-body pose estimation. ZoomNAS jointly searches the model architecture and the connections between different sub-modules, and automatically allocates computational complexity for searched sub-modules. To train and evaluate ZoomNAS, we introduce the first large-scale 2D human whole-body dataset, namely COCO-WholeBody V1.0, which annotates 133 keypoints for in-the-wild images. Extensive experiments demonstrate the effectiveness of ZoomNAS and the significance of COCO-WholeBody V1.0.
Article
Human pose estimation and detection are critical for understanding human activities in videos and images. This paper presents a novel approach to meet the advanced demands of human–computer interactions and assisted living systems through enhanced human pose estimation and activity recognition. We introduce IMPos-DNet, an innovative technique that integrates multi-person pose estimation and activity recognition using a 3D Dual Convolution Neural Network (CNN) applied to multiview video datasets. Our approach combines top-down and bottom-up models to improve performance. The top-down network focuses on evaluating human joints for each individual, enhancing robustness against inaccurate bounding boxes, while the bottom-up network employs normalized heatmaps based on human detection, improving resilience to scale variation. By synergizing the 3D poses estimated by both networks, IMPos-DNet produces precise final 3D poses. Our research objectives include advancing the accuracy and efficiency of pose estimation and activity recognition, as well as addressing the scarcity of 3D ground-truth data. To this end, we employ a semi-supervised method, broadening the model’s applicability. Comprehensive experiments on three publicly available datasets—Human3.6 M, MuPoTs-3D, and MPI-INF-3DHP—demonstrate the model’s superior accuracy and efficiency. Evaluation results confirm the effectiveness of IMPos-DNet’s individual components, highlighting its potential for reliable human pose estimation and activity recognition.
Preprint
Full-text available
We propose an end-to-end architecture for joint 2D and 3D human pose estimation in natural images. Key to our approach is the generation and scoring of a number of pose proposals per image, which allows us to predict 2D and 3D poses of multiple people simultaneously. Hence, our approach does not require an approximate localization of the humans for initialization. Our Localization-Classification-Regression architecture, named LCR-Net, contains 3 main components: 1) the pose proposal generator that suggests candidate poses at different locations in the image; 2) a classifier that scores the different pose proposals; and 3) a regressor that refines pose proposals both in 2D and 3D. All three stages share the convolutional feature layers and are trained jointly. The final pose estimation is obtained by integrating over neighboring pose hypotheses, which is shown to improve over a standard non maximum suppression algorithm. Our method recovers full-body 2D and 3D poses, hallucinating plausible body parts when the persons are partially occluded or truncated by the image boundary. Our approach significantly outperforms the state of the art in 3D pose estimation on Human3.6M, a controlled environment. Moreover, it shows promising results on real images for both single and multi-person subsets of the MPII 2D pose benchmark and demonstrates satisfying 3D pose results even for multi-person images.
Article
Full-text available
Recently, to address the multiple object tracking (MOT) problem, we harnessed the power of deep learning-based methods. The tracking-by-detection approach to multiple object tracking (MOT) involves two primary steps: object detection and data association. In the first step, objects of interest are detected in each frame of a video. The second step establishes the correspondence between these detected objects across different frames to track their trajectories. This paper proposes an efficient and unified data association method that utilizes a deep feature association network (deepFAN) to learn the associations. Additionally, the Structural Similarity Index Metric (SSIM) is employed to address uncertainties in the data association, complementing the deep feature association network. These combined association computations effectively link the current detections with the previous tracks, enhancing the overall tracking performance. To evaluate the efficiency of the proposed MOT framework, we conducted a comprehensive analysis of the popular MOT datasets, such as the MOT challenge and UA-DETRAC. The results showed that our technique performed substantially better than the current state-of-the-art methods in terms of standard MOT metrics.
Article
Visual perception of ships has been attracting increasing attention in the fields of computer vision and ocean engineering. Despite the extensive work related to landmark detection of common objects, the role of landmarks in ship perception has been overlooked. In this paper, we aim to fill this gap by focusing on ship landmarks. Specifically, we give a comprehensive analysis of both the physical structure and deep features of ships, which finds that highlighted areas in feature maps correspond with structurally significant parts of ships. By summarizing the locations of such areas in ships, we define 20 ship landmarks and build the Ship Landmark Dataset (SLAD), the first ship dataset with landmark annotations. We also provide a benchmark for ship landmark detection by evaluating state-of-the-art landmark detection methods on the newly built SLAD. Moreover, we showcased several applications of ship landmarks, including ship recognition, ship image generation, key area detection for ships, and ship detection. Project web page: https://vsislab.github.io/Ships_VSIS/.
Conference Paper
Full-text available
Este artigo descreve o desenvolvimento de um jogo multijogadores do tipo exergame que utiliza ferramenta de aprendizado de máquina para o rastreamento de poses de pessoas em tempo real. O sistema é composto por equipamentos convencionais: computador, webcam e projetor de vídeo. Após uma busca exploratória de modelos de código aberto, foi selecionada a ferramenta Movenet que utiliza rede neural convolucional como base do algoritmo e que pode detectar até 6 pessoas em velocidade de resposta superior a 30 quadros por segundo, possibilitando a execução do jogo sem latência perceptível ao usuário.
Article
Full-text available
This paper delves into the crucial realm of posture assessment in contemporary work environments. The introduction underscores the increasing need for posture correction tools, particularly for individuals engrossed in sedentary jobs, who often overlook their anthropological constraints during work. Sedentary work, exceeding eight hours daily, is not only detrimental to health but can result in Work-Related Musculoskeletal Disorders (WMSDs) like Carpal Tunnel Syndrome, lower back pain, and cervical spondylitis. Ergonomics plays a pivotal role in assessing the well-being of working individuals. Methods such as REBA and RULA are commonly used to evaluate posture, although these methods demand manual readings conducted at various intervals. Different work environments require distinct assessment scores, making them intricate and time-consuming. For industries with dynamic working conditions such as mining, plumbing, construction, logistics, and maintenance, 3D posture recognition proves effective. This ergonomic approach, combined with manual assessment, enhances worker safety and productivity. Conversely, an increasing number of individuals lead sedentary lifestyles due to their profession, typified by prolonged computer use, desk work, and limited physical activity. These behaviours lead to a host of health issues, both physical and psychological. Moreover, the omnipresence of smartphones and attention-diverting content have extended the average time people spend sitting or lying down, causing long-term health problems, including cardiovascular diseases, vitamin deficiencies, and migraines, compounded by the persistence of WMSDs. In the subsequent sections, this paper delves into the advancements in posture assessment techniques. With camera various pose estimation techniques with AI models table, categorizing them by year invented, model name, strengths, weaknesses, applicability to single or multiple persons, and whether they operate in 2D or 3D space. These advancements represent innovative and efficient methods for assessing posture, making it a valuable resource for researchers, ergonomists, and individuals aiming to enhance their work environments and overall health.
Article
Understanding human posture is a challenging topic, which encompasses several tasks, e.g., pose estimation, body mesh recovery and pose tracking. In this paper, we propose a novel Distribution-Aware Single-stage (DAS) model for the pose-related tasks. The proposed DAS model estimates human position and localizes joints simultaneously, which requires only a single pass. Meanwhile, we utilize normalizing flow to enable DAS to learn the true distribution of joint locations, rather than making simple Gaussian or Laplacian assumptions. This provides a pivotal prior and greatly boosts the accuracy of regression-based methods, thus making DAS achieve comparable performance to the volumetric-based methods. We also introduce a recursively update strategy to progressively approach the regression target, reducing the difficulty of regression and improving the regression performance. We further adapt DAS to multi-person mesh recovery and pose tracking tasks and achieve considerable performance on both tasks. Comprehensive experiments on CMU Panoptic and MuPoTS-3D demonstrate the superior efficiency of DAS, specifically 1.5 times speedup over previous best method, and its state-of-the-art accuracy for multi-person pose estimation. Extensive experiments on 3DPW and PoseTrack2018 indicate the effectiveness and efficiency of DAS for human body mesh recovery and pose tracking, respectively, which prove the generality of our proposed DAS model.
Article
Full-text available
Human Pose Estimation (HPE) is the task that aims to predict the location of human joints from images and videos. This task is used in many applications, such as sports analysis and surveillance systems. Recently, several studies have embraced deep learning to enhance the performance of HPE tasks. However, building an efficient HPE model is difficult; many challenges, like crowded scenes and occlusion, must be handled. This paper followed a systematic procedure to review different HPE models comprehensively. About 100 articles published since 2014 on HPE using deep learning were selected using several selection criteria. Both image and video data types of methods were investigated. Furthermore, both single and multiple HPE methods were reviewed. In addition, the available datasets, different loss functions used in HPE, and pretrained feature extraction models were all covered. Our analysis revealed that Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs) are the most used in HPE. Moreover, occlusion and crowd scenes remain the main problems affecting models’ performance. Therefore, the paper presented various solutions to address these issues. Finally, this paper highlighted the potential opportunities for future work in this task.
Article
Non-contact heartbeat detection and heart rate estimation have been hot topics of research over the last few years. By employing wireless sensors, such as Doppler sensors, the goal of these tasks is to detect subtle movements associated with the heart movements. The detection is either used to reconstruct the R-peaks, thus the heartbeat detection, or simply identify the heart rate (HR). However, due to the nature of the faint movements of the chest caused by the heart beats, unless the person is still, such movements are overwhelmingly obscured by the motion of the person. That being the case, we are interested in this work in identifying such instances where the person is not in motion, in which case, we measure their heartbeats and heart rate. To do so, we use a combination of a 3D Light Detection and Ranging (LiDAR) and a Multiple-Input Multiple-Output (MIMO) Doppler Radar. The former’s objective is to recognize when the person is still, in which case their position and distance from the sensors are reported. The latter’s objective is to create a beam directed towards the person’s detected chest and measure the reflected signal, for heartbeat detection, HR or respiration rate (RR) estimation. Our experiments demonstrate that determining the subject’s location and distance and identifying their chest position using the LiDAR, then collecting the data accordingly using the Doppler radar leads to better HR detection and RR intervals (RRIs) estimation. For 5 different scenarios, the RRI estimation error reaches values between 112 ms and 231 ms.
Article
Human pose estimation and tracking are fundamental tasks for understanding human behaviors in videos. Existing top-down framework-based methods usually perform three-stage tasks: human detection, pose estimation and tracking. Although promising results have been achieved, these methods rely heavily on high-performance detectors and may fail to track persons who are occluded or miss-detected. To overcome these problems, in this paper, we develop a novel keypoint confidence network and a tracking pipeline to improve human detection and pose estimation in top-down approaches. Specifically, the keypoint confidence network is designed to determine whether each keypoint is occluded, and it is incorporated into the pose estimation module. In the tracking pipeline, we propose the Bboxrevision module to reduce missing detection and the ID-retrieve module to correct lost trajectories, improving the performance of the detection stage. Experimental results show that our approach is universal in human detection and pose estimation, achieving state-of-the-art performance on both PoseTrack 2017 and 2018 datasets.
Article
Recent multimedia and computer vision research has focused on analyzing human behavior and activity using images. Skeleton estimation, known as pose estimation, has received a significant attention. For human pose estimation, deep learning approaches primarily emphasize on the keypoint features. Conversely, in the case of occluded or incomplete poses, the keypoint feature is insufficiently substantial, especially when there are multiple humans in a single frame. Other features, such as the body border and visibility conditions, can contribute to pose estimation in addition to the keypoint feature. Our model framework integrates multiple features, namely the human body mask features, which can serve as a constraint to keypoint location estimation, the body keypoint features, and the keypoint visibility via mask region-based convolutional neural network (Mask-RCNN). A sequential multi-feature learning setup is formed to share multi-features across the structure, whereas, in the Mask-RCNN, the only feature that could be shared through the system is the region of interest feature. By two-way up-scaling with the shared weight process to produce the mask, we have addressed the problems of improper segmentation, small intrusion, and object loss when Mask-RCNN is used, for instance, segmentation. Accuracy is indicated by the percentage of correct keypoint, and our model can identify 86.1% of the correct keypoints.
Article
Full-text available
Human pose estimation (HPE) is a critical problem in computer vision, serving as a foundation for many downstream tasks. However, existing image-based methods tend to perform poorly when applied to video sequences, especially in complex scenes with motion blur and serious occlusion. Therefore, it is essential to develop a specialized pose estimation network for video. In this paper, we propose a human pose estimation network called the Dual Association Network (DANet), designed explicitly for video sequences. It can make full use of the temporal information between video frames and the correlation between joints. The overall framework consists of three modules. The Dual Fusion Network (DFN) utilizes temporal information from adjacent frames to compute position offsets and infer the positions of blurred joints in the current frame. The Joint Association Network (JAN) models the correlation between joints and infers invisible joints based on visible joints. The SpatioTemporal Fusion (STF) module applies deformable convolutions to fuse the outputs from DFN and JAN and refine the final prediction. The application of the three modules resulted in a 1.4 AP improvement in ankle joint detection, particularly in cases where the joint is occluded or blurred due to motion. Our method demonstrated competitive results on two large benchmark datasets, PoseTrack2017 and PoseTrack2018.
Article
Currently, most action recognition networks have deep overall structures, large model parameters, and high requirements for computer hardware equipment. As a result, it is easy to overfit in the recognition process for too deep network layers. Furthermore, it is also difficult to extract features because of the video's interference information, such as illumination and occlusion. To solve the above problems, we propose a multiperson action recognition and tracking algorithm based on skeletal keypoint detection. First, the n network combining the improved dense convolutional network and part affinity field is used to extract the skeletal information points of the human body. Then, we present an improved DeepSort network for multiperson target tracking, which contains a Hungarian matching algorithm based on the generalized intersection over union and a pedestrian reidentification network combining GhostNet and feature pyramid network. Finally, we construct a deep neural network model to classify the extracted human skeletal information and realize action recognition. Experimental results show that the multiperson action recognition and tracking algorithm achieves an action recognition accuracy of 98%. In addition, the multitarget tracking accuracy of the proposed algorithm is improved by 4.2% on the MOT16 dataset. Compared with other common algorithms, the proposed algorithm can achieve high accuracy in detecting keypoints of the human body and improve the accuracy of multiperson action recognition with fewer parameters and complexity of operations.
Chapter
In a photograph or video, a human pose estimate attempts to predict the pose of several human body components. Since certain human motions frequently produce posing movements, understanding such poses is crucial for action recognition. This chapter focuses on recent developments in action recognition and their application to human pose estimation. With the use of pose estimating techniques and human pose depth estimates, the authors attempt to provide a detailed overview of the top-down and bottom-up models currently employed for action detection. In the framework of real-time posture estimation, this chapter explores traditional computer vision applications. Due to severe injuries, many fitness enthusiasts are unable to achieve their goals. Poor training posture and poor mental health are the causes. These people can modify their posture in real time and perform better in a computer vision-based environment. This chapter sheds light on sports, the necessity to correct posture, and the application of technology in this area.
Article
Objective: Video-based pose estimation is an emerging technology that shows significant promise for improving clinical gait analysis by enabling quantitative movement analysis with little costs of money, time, or effort. The objective of this study is to determine the accuracy of pose estimation-based gait analysis when video recordings are constrained to 3 common clinical or in-home settings (ie, frontal and sagittal views of overground walking, sagittal views of treadmill walking). Methods: Simultaneous video and motion capture recordings were collected from 30 persons after stroke during overground and treadmill walking. Spatiotemporal and kinematic gait parameters were calculated from videos using an open-source human pose estimation algorithm and from motion capture data using traditional gait analysis. Repeated-measures analyses of variance were then used to assess the accuracy of the pose estimation-based gait analysis across the different settings, and the authors examined Pearson and intraclass correlations with ground-truth motion capture data. Results: Sagittal videos of overground and treadmill walking led to more accurate measurements of spatiotemporal gait parameters versus frontal videos of overground walking. Sagittal videos of overground walking resulted in the strongest correlations between video-based and motion capture measurements of lower extremity joint kinematics. Video-based measurements of hip and knee kinematics showed stronger correlations with motion capture versus ankle kinematics for both overground and treadmill walking. Conclusions: Video-based gait analysis using pose estimation provides accurate measurements of step length, step time, and hip and knee kinematics during overground and treadmill walking in persons after stroke. Generally, sagittal videos of overground gait provide the most accurate results. Impact: Many clinicians lack access to expensive gait analysis tools that can help identify patient-specific gait deviations and guide therapy decisions. These findings show that video-based methods that require only common household devices provide accurate measurements of a variety of gait parameters in persons after stroke and could make quantitative gait analysis significantly more accessible.
Article
Due to the loss of underwater light resources, monitoring fish situations in underwater surveillance has led to fish distortion. Monitoring and observing fish situations in culture ponds around the world is becoming more and more important to prevent them from suffering uncertain damage. In this study, we propose a stereo underwater surveillance system for Oplegnathus punctatus by developing an underwater depth prediction model and an underwater fish skeleton model based on deep learning. The underwater depth prediction model is a convolutional neural network-based method for extracting underwater depth spatial features. The fish skeleton prediction model is for extracting 9 keypoints on the fish body. Additionally, since there is no established underwater Oplegnathus punctatus dataset for fish body analysis in culture ponds, we have collected and proposed a depth and skeleton Oplegnathus punctatus dataset, which contains underwater information on Oplegnathus punctatus bodies. The experimental results on our self-collected dataset show that the fish body information measurement achieves 94% accuracy in weight. We also compared our proposed method with Mask-RCNN and stereo-matching methods, and our method proved to be the most effective.
Article
Full-text available
The MNIST dataset is a popular benchmark dataset in the field of machine learning and computer vision. The dataset has a training set of 60,000 examples, and a test set of 10,000 examples where the digits have been centered inside 28x28 pixel images. The dataset is commonly used for image classification tasks, where the goal is to train a model to correctly identify the digit represented in each image. The MNIST dataset has been widely used in academic research, with many researchers using it to develop and test new machine learning algorithms. It has also been used in industry, with many companies using it to train and evaluate image recognition systems. The MNIST dataset is an important resource for the machine learning and computer vision communities and has played a significant role in the development of these fields. The MNIST dataset has the advantage of striking a good balance in terms of the scope of the issue. The photos are only available in 10 different classifications and are only 28x28 pixels in size. However, just because the images are small does not necessarily suggest that the data set's numbers do not have a significant amount of variation. It should come as no surprise that some of the digits are challenging for a human to accurately classify. The composite average of the class the classifier is most likely to choose is displayed after a selection of photos that are likely to be extremely challenging to identify with a classifier. These photographs are challenging because they remarkably resemble the typical (or another prevalent) image of another.
Preprint
Full-text available
Person identification is a problem that has received substantial attention, particularly in security domains. Gait recognition is one of the most convenient approaches enabling person identification at a distance without the need of high-quality images. There are several review studies addressing person identification such as the utilization of facial images, silhouette images, and wearable sensor. Despite skeleton-based person identification gaining popularity while overcoming the challenges of traditional approaches, existing survey studies lack the comprehensive review of skeleton-based approaches to gait identification. We present a detailed review of the human pose estimation and gait analysis that make the skeleton-based approaches possible. The study covers various types of related datasets, tools, methodologies, and evaluation metrics with associated challenges, limitations, and application domains. Detailed comparisons are presented for each of these aspects with recommendations for potential research and alternatives. A common trend throughout this paper is the positive impact that deep learning techniques are beginning to have on topics such as human pose estimation and gait identification. The survey outcomes might be useful for the related research community and other stakeholders in terms of performance analysis of existing methodologies, potential research gaps, application domains, and possible contributions in the future.
Article
With the continuous development of satellite constellations worldwide, dynamic estimation of on-orbit spacecraft plays a more and more important role in space situation awareness applications. Based on high-resolution inverse synthetic aperture radar (ISAR) imaging from the ground, some exploratory methods have been proposed to estimate target dynamic parameters, such as its attitude pointing and spin period. Due to the limited observation view from the ground, most of them rely on the long-term measurement, and the robust association of image features needs to be ensured. In this article, a novel approach based on spaceborne ISAR images is proposed to achieve the dynamic estimation of spin satellites. In order to build a synchronous similar-resolution ISAR imaging system, two adjacent satellites are picked up from low earth orbit constellations according to the radar tracking parameters. With obtained a pair of ISAR images, the explicit expression between target dynamic parameters and projection feature is deduced. Inspired by the existing works on shape extraction of the human body, a feature extraction network is built on the framework of ResNet for achieving the automatic processing in spaceborne equipments. In this way, target dynamic parameters can be solved in instantaneous attitude optimization and spin motion optimization in steps. Simulation experiments of a typical spin spacecraft, Tiangong-I, illustrate the feasibility of the proposed method. Besides, a comparison experiment with an existing ground-based work is also made to analyze its advantage in practical applications.
Conference Paper
Full-text available
In this work, we introduce the challenging problem of joint multi-person pose estimation and tracking of an unknown number of persons in unconstrained videos. Existing methods for multi-person pose estimation in images cannot be applied directly to this problem, since it also requires to solve the problem of person association over time in addition to the pose estimation for each person. We therefore propose a novel method that jointly models multi-person pose estimation and tracking in a single formulation. To this end, we represent body joint detections in a video by a spatio-temporal graph and solve an integer linear program to partition the graph into sub-graphs that correspond to plausible body pose trajectories for each person. The proposed approach implicitly handles occlusions and truncations of persons. Since the problem has not been addressed quantitatively in the literature, we introduce a challenging "Multi-Person Pose-Track" dataset, and also propose a completely unconstrained evaluation protocol that does not make any assumptions on the scale, size, location or the number of persons. Finally, we evaluate the proposed approach and several baseline methods on our new dataset.
Conference Paper
Full-text available
This paper is on human pose estimation using Convolutional Neural Networks. Our main contribution is a CNN cascaded architecture specifically designed for learning part relationships and spatial context, and robustly inferring pose even for the case of severe part occlusions. To this end, we propose a detection-followed-by-regression CNN cascade. The first part of our cascade outputs part detection heatmaps and the second part performs regression on these heatmaps. The benefits of the proposed architecture are multi-fold: It guides the network where to focus in the image and effectively encodes part constraints and context. More importantly, it can effectively cope with occlusions because part detection heatmaps for occluded parts provide low confidence scores which subsequently guide the regression part of our network to rely on contextual information in order to predict the location of these parts. Additionally, we show that the proposed cascade is flexible enough to readily allow the integration of various CNN architectures for both detection and regression, including recent ones based on residual learning. Finally, we illustrate that our cascade achieves top performance on the MPII and LSP data sets. Code can be downloaded from http://www.cs.nott.ac.uk/~psxab5/
Conference Paper
Full-text available
Despite of the recent success of neural networks for human pose estimation, current approaches are limited to pose estimation of a single person and cannot handle humans in groups or crowds. In this work, we propose a method that estimates the poses of multiple persons in an image in which a person can be occluded by another person or might be truncated. To this end, we consider multi-person pose estimation as a joint-to-person association problem. We construct a fully connected graph from a set of detected joint candidates in an image and resolve the joint-to-person association and outlier detection using integer linear programming. Since solving joint-to-person association jointly for all persons in an image is an NP-hard problem and even approximations are expensive, we solve the problem locally for each person. On the challenging MPII Human Pose Dataset for multiple persons, our approach achieves the accuracy of a state-of-the-art method, but it is 6,000 to 19,000 times faster.
Patent
Full-text available
Methods and apparatus are described for monocular 3D human pose estimation and tracking, which are able to recover poses of people in realistic street conditions captured using a monocular, potentially moving camera. Embodiments of the present invention provide a three-stage process involving estimating (10, 60, 110) a 3D pose of each of the multiple objects using an output of 2D tracking-by detection (50) and 2D viewpoint estimation (46). The present invention provides a sound Bayesian formulation to address the above problems. The present invention can provide articulated 3D tracking in realistic street conditions. The present invention provides methods and apparatus for people detection and 2D pose estimation combined with a dynamic motion prior. The present invention provides not only 2D pose estimation for people in side views, it goes beyond this by estimating poses in 3D from multiple viewpoints. The estimation of poses is done in monocular images, and does not require stereo images. Also the present invention does not require detection of characteristic poses of people.
Article
Full-text available
Pose Machines provide a sequential prediction framework for learning rich implicit spatial models. In this work we show a systematic design for how convolutional networks can be incorporated into the pose machine framework for learning image features and image-dependent spatial models for the task of pose estimation. The contribution of this paper is to implicitly model long-range dependencies between variables in structured prediction tasks such as articulated pose estimation. We achieve this by designing a sequential architecture composed of convolutional networks that directly operate on belief maps from previous stages, producing increasingly refined estimates for part locations, without the need for explicit graphical model-style inference. Our approach addresses the characteristic difficulty of vanishing gradients during training by providing a natural learning objective function that enforces intermediate supervision, thereby replenishing back-propagated gradients and conditioning the learning procedure. We demonstrate state-of-the-art performance and outperform competing methods on standard benchmarks including the MPII, LSP, and FLIC datasets.
Article
Full-text available
The objective of this work is human pose estimation in videos, where multiple frames are available. We investigate a ConvNet architecture that is able to benefit from temporal context by combining information across the multiple frames using optical flow. To this end we propose a new network architecture that: (i) regresses a confidence heatmap of joint position predictions; (ii) incorporates optical flow at a mid-layer to align heatmap predictions from neighbouring frames; and (iii) includes a final parametric pooling layer which learns to combine the aligned heatmaps into a pooled confidence map. We show that this architecture outperforms a number of others, including one that uses optical flow solely at the input layers, and one that regresses joint coordinates directly. The new architecture outperforms the state of the art by a large margin on three video pose estimation datasets, including the very challenging Poses in the Wild dataset.
Article
Full-text available
Formulations of the image decomposition problem as a multicut problem (MP) w.r.t. a superpixel graph have received considerable attention. In contrast, instances of the MP w.r.t. a pixel grid graph have received little attention, firstly, because the MP is NP-hard and instances w.r.t. a pixel grid graph are hard to solve in practice, and, secondly, due to the lack of long-range terms in the objective function of the MP. We propose a generalization of the MP with long-range terms (LMP). We design and implement two efficient algorithms (primal feasible heuristics) for the MP and LMP which allow us to study instances of both problems w.r.t. the pixel grid graphs of the images in the BSDS-500 benchmark. The decompositions we obtain do not differ significantly from the state of the art at the time of writing, suggesting that the LMP is a competitive formulation of the image decomposition problem. To demonstrate the generality of the LMP formulation, we apply it also to the mesh decomposition problem posed by the Princeton benchmark, obtaining state-of-the-art decompositions.
Conference Paper
Full-text available
Human pose estimation has made significant progress during the last years. However current datasets are limited in their coverage of the overall pose estimation challenges. Still these serve as the common sources to evaluate, train and compare different models on. In this paper we intro-duce a novel benchmark "MPII Human Pose" 1 that makes a significant advance in terms of diversity and difficulty, a contribution that we feel is required for future develop-ments in human body models. This comprehensive dataset was collected using an established taxonomy of over 800 human activities [1]. The collected images cover a wider variety of human activities than previous datasets including various recreational, occupational and householding activ-ities, and capture people from a wider range of viewpoints. We provide a rich set of labels including positions of body joints, full 3D torso and head orientation, occlusion labels for joints and body parts, and activity labels. For each im-age we provide adjacent video frames to facilitate the use of motion information. Given these rich annotations we per-form a detailed analysis of leading human pose estimation approaches and gaining insights for the success and fail-ures of these methods.
Article
Full-text available
This paper proposes a new hybrid architecture that consists of a deep Convolutional Network and a Markov Random Field. We show how this architecture is successfully applied to the challenging problem of articulated human pose estimation in monocular images. The architecture can exploit structural domain constraints such as geometric relationships between body joint locations. We show that joint training of these two model paradigms improves performance and allows us to significantly outperform existing state-of-the-art techniques.
Conference Paper
Full-text available
In this paper, we present a method for estimating articulated human poses in videos. We cast this as an optimization problem defined on body parts with spatio-temporal links between them. The resulting formulation is unfortunately intractable and previous approaches only provide approximate solutions. Although such methods perform well on certain body parts, e.g., head, their performance on lower arms, i.e., elbows and wrists, remains poor. We present a new approximate scheme with two steps dedicated to pose estimation. First, our approach takes into account temporal links with subsequent frames for the less-certain parts, namely elbows and wrists. Second, our method decomposes poses into limbs, generates limb sequences across time, and recomposes poses by mixing these body part sequences. We introduce a new dataset "Poses in the Wild", which is more challenging than the existing ones, with sequences containing background clutter, occlusions, and severe camera motion. We experimentally compare our method with recent approaches on this new dataset as well as on two other benchmark datasets, and show significant improvement.
Conference Paper
Full-text available
Both detection and tracking people are challenging problems, especially in complex real world scenes that com- monly involve multiple people, complicated occlusions, and cluttered or even moving backgrounds. People detectors have been shown to be able to locate pedestrians even in complex street scenes, but false positives have remained frequent. The identification of particular individuals has remained challenging as well. On the other hand, tracking methods are able to find a particular individual in image se- quences, but are severely challenged by real-world scenar- ios such as crowded street scenes. In this paper, we combine the advantages of both detection and tracking in a single framework. The approximate articulation of each person is detected in every frame based on local features that model the appearance of individual body parts. Prior knowledge on possible articulations and temporal coherency within a walking cycle are modeled using a hierarchical Gaussian process latent variable model (hGPLVM). We show how the combination of these results improves hypotheses for posi- tion and articulation of each person in several subsequent frames. We present experimental results that demonstrate how this allows to detect and track multiple people in clut- tered scenes with reoccurring occlusions.
Article
Full-text available
Simultaneous tracking of multiple persons in real-world environments is an active research field and several approaches have been proposed, based on a variety of features and algorithms. Recently, there has been a growing interest in organizing systematic evaluations to compare the various techniques. Unfortunately, the lack of common metrics for measuring the performance of multiple object trackers still makes it hard to compare their results. In this work, we introduce two intuitive and general metrics to allow for objective comparison of tracker characteristics, focusing on their precision in estimating object locations, their accuracy in recognizing object configurations and their ability to consistently label objects over time. These metrics have been extensively used in two large-scale international evaluations, the 2006 and 2007 CLEAR evaluations, to measure and compare the performance of multiple object trackers for a wide variety of tracking tasks. Selected performance results are presented and the advantages and drawbacks of the presented metrics are discussed based on the experience gained during the evaluations.
Conference Paper
This paper proposes a new hybrid architecture that consists of a deep Convolutional Network and a Markov Random Field. We show how this architecture is successfully applied to the challenging problem of articulated human pose estimation in monocular images. The architecture can exploit structural domain constraints such as geometric relationships between body joint locations. We show that joint training of these two model paradigms improves performance and allows us to significantly outperform existing state-of-the-art techniques.
Article
We propose a personalized ConvNet pose estimator that automatically adapts itself to the uniqueness of a person's appearance to improve pose estimation in long videos. We make the following contributions: (i) we show that given a few high-precision pose annotations, e.g. from a generic ConvNet pose estimator, additional annotations can be generated throughout the video using a combination of image-based matching for temporally distant frames, and dense optical flow for temporally local frames; (ii) we develop an occlusion aware self-evaluation model that is able to automatically select the high-quality and reject the erroneous additional annotations; and (iii) we demonstrate that these high-quality annotations can be used to fine-tune a ConvNet pose estimator and thereby personalize it to lock on to key discriminative features of the person's appearance. The outcome is a substantial improvement in the pose estimates for the target video using the personalized ConvNet compared to the original generic ConvNet. Our method outperforms the state of the art (including top ConvNet methods) by a large margin on two standard benchmarks, as well as on a new challenging YouTube video dataset. Furthermore, we show that training from the automatically generated annotations can be used to improve the performance of a generic ConvNet on other benchmarks.
Conference Paper
In Tang et al. (2015), we proposed a graph-based formulation that links and clusters person hypotheses over time by solving a minimum cost subgraph multicut problem. In this paper, we modify and extend Tang et al. (2015) in three ways: (1) We introduce a novel local pairwise feature based on local appearance matching that is robust to partial occlusion and camera motion. (2) We perform extensive experiments to compare different pairwise potentials and to analyze the robustness of the tracking formulation. (3) We consider a plain multicut problem and remove outlying clusters from its solution. This allows us to employ an efficient primal feasible optimization algorithm that is not applicable to the subgraph multicut problem of Tang et al. (2015). Unlike the branch-and-cut algorithm used there, this efficient algorithm used here is applicable to long videos and many detections. Together with the novel pairwise feature, it eliminates the need for the intermediate tracklet representation of Tang et al. (2015). We demonstrate the effectiveness of our overall approach on the MOT16 benchmark (Milan et al. 2016), achieving state-of-art performance.
Conference Paper
The goal of this paper is to advance the state-of-the-art of articulated pose estimation in scenes with multiple people. To that end we contribute on three fronts. We propose (1) improved body part detectors that generate effective bottom-up proposals for body parts; (2) novel image-conditioned pairwise terms that allow to assemble the proposals into a variable number of consistent body part configurations; and (3) an incremental optimization strategy that explores the search space more efficiently thus leading both to better performance and significant speed-up factors. Evaluation is done on two single-person and two multi-person pose estimation benchmarks. The proposed approach significantly outperforms best known multi-person pose estimation results while demonstrating competitive performance on the task of single person pose estimation (Models and code available at http:// pose. mpi-inf. mpg. de).
Conference Paper
This work introduces a novel convolutional network architecture for the task of human pose estimation. Features are processed across all scales and consolidated to best capture the various spatial relationships associated with the body. We show how repeated bottom-up, top-down processing used in conjunction with intermediate supervision is critical to improving the performance of the network. We refer to the architecture as a “stacked hourglass” network based on the successive steps of pooling and upsampling that are done to produce a final set of predictions. State-of-the-art results are achieved on the FLIC and MPII benchmarks outcompeting all recent methods.
Conference Paper
We propose a personalized ConvNet pose estimator that automatically adapts itself to the uniqueness of a person’s appearance to improve pose estimation in long videos. We make the following contributions: (i) we show that given a few high-precision pose annotations, e.g. from a generic ConvNet pose estimator, additional annotations can be generated throughout the video using a combination of image-based matching for temporally distant frames, and dense optical flow for temporally local frames; (ii) we develop an occlusion aware self-evaluation model that is able to automatically select the high-quality and reject the erroneous additional annotations; and (iii) we demonstrate that these high-quality annotations can be used to fine-tune a ConvNet pose estimator and thereby personalize it to lock on to key discriminative features of the person’s appearance. The outcome is a substantial improvement in the pose estimates for the target video using the personalized ConvNet compared to the original generic ConvNet. Our method outperforms the state of the art (including top ConvNet methods) by a large margin on three standard benchmarks, as well as on a new challenging YouTube video dataset. Furthermore, we show that training from the automatically generated annotations can be used to improve the performance of a generic ConvNet on other benchmarks.
Article
The goal of this paper is to advance the state-of-the-art of articulated pose estimation in scenes with multiple people. To that end we contribute on three fronts. We propose (1) improved body part detectors that generate effective bottom-up proposals for body parts; (2) novel image-conditioned pairwise terms that allow to assemble the proposals into a variable number of consistent body part configurations; and (3) an incremental optimization strategy that explores the search space more efficiently thus leading both to better performance and significant speed-up factors. We evaluate our approach on two single-person and two multi-person pose estimation benchmarks. The proposed approach significantly outperforms best known multi-person pose estimation results while demonstrating competitive performance on the task of single person pose estimation. Models and code available at http://pose.mpi-inf.mpg.de
Article
Marker-less motion capture has seen great progress, but most state-of-the-art approaches fail to reliably track articulated human body motion with a very low number of cameras, let alone when applied in outdoor scenes with general background. In this paper, we propose a method for accurate marker-less capture of articulated skeleton motion of several subjects in general scenes, indoors and outdoors, even from input filmed with as few as two cameras. The new algorithm combines the strengths of a discriminative image-based joint detection method with a model-based generative motion tracking algorithm through an unified pose optimization energy. The discriminative part-based pose detection method is implemented using Convolutional Networks (ConvNet) and estimates unary potentials for each joint of a kinematic skeleton model. These unary potentials serve as the basis of a probabilistic extraction of pose constraints for tracking by using weighted sampling from a pose posterior that is guided by the model. In the final energy, we combine these constraints with an appearance-based model-to-image similarity term. Poses can be computed very efficiently using iterative local optimization, since joint detection with a trained ConvNet is fast, and since our formulation yields a combined pose estimation energy with analytic derivatives. In combination, this enables to track full articulated joint angles at state-of-the-art accuracy and temporal stability with a very low number of cameras. Our method is efficient and lends itself to implementation on parallel computing hardware, such as GPUs. We test our method extensively and show its advantages over related work on many indoor and outdoor data sets captured by ourselves, as well as data sets made available to the community by other research labs. The availability of good evaluation data sets is paramount for scientific progress, and many existing test data sets focus on controlled indoor settings, do not feature much variety in the scenes, and often lack a large corpus of data with ground truth annotation. We therefore further contribute with a new extensive test data set called MPI-MARCOnI for indoor and outdoor marker-less motion capture that features 12 scenes of varying complexity and varying camera count, and that features ground truth reference data from different modalities, ranging from manual joint annotations to marker-based motion capture results. Our new method is tested on these data, and the data set will be made available to the community.
Article
Deeper neural networks are more difficult to train. We present a residual learning framework to ease the training of networks that are substantially deeper than those used previously. We explicitly reformulate the layers as learning residual functions with reference to the layer inputs, instead of learning unreferenced functions. We provide comprehensive empirical evidence showing that these residual networks are easier to optimize, and can gain accuracy from considerably increased depth. On the ImageNet dataset we evaluate residual nets with a depth of up to 152 layers---8x deeper than VGG nets but still having lower complexity. An ensemble of these residual nets achieves 3.57% error on the ImageNet test set. This result won the 1st place on the ILSVRC 2015 classification task. We also present analysis on CIFAR-10 with 100 and 1000 layers. The depth of representations is of central importance for many visual recognition tasks. Solely due to our extremely deep representations, we obtain a 28% relative improvement on the COCO object detection dataset. Deep residual nets are foundations of our submissions to ILSVRC & COCO 2015 competitions, where we also won the 1st places on the tasks of ImageNet detection, ImageNet localization, COCO detection, and COCO segmentation.
Article
Discriminative methods for learning structured models have enabled wide-spread use of very rich feature representations. However, the computational cost of feature extraction is prohibitive for large-scale or time-sensitive applications, often dominating the cost of inference in the models. Significant efforts have been devoted to sparsity-based model selection to decrease this cost. Such feature selection methods control computation statically and miss the opportunity to finetune feature extraction to each input at run-time. We address the key challenge of learning to control fine-grained feature extraction adaptively, exploiting nonhomogeneity of the data. We propose an architecture that uses a rich feedback loop between extraction and prediction. The run-time control policy is learned using efficient value-function approximation, which adaptively determines the value of information of features at the level of individual variables for each input. We demonstrate significant speedups over state-of-the-art methods on two challenging datasets. For articulated pose estimation in video, we achieve a more accurate state-of-the-art model that is also faster, with similar results on an OCR task.
Conference Paper
We present an approach to multi-target tracking that has expressive potential beyond the capabilities of chain-shaped hidden Markov models, yet has significantly reduced complexity. Our framework, which we call tracking-by-selection}, is similar to tracking-by-detection in that it separates the tasks of detection and tracking, but it shifts temporal reasoning from the tracking stage to the detection stage. The core feature of tracking-by-selection is that it reasons about path hypotheses that traverse the entire video instead of a chain of single-frame object hypotheses. A traditional chain-shaped tracking-by-detection model is only able to promote consistency between one frame and the next. In tracking-by-selection, path hypotheses exist across time, and encouraging long-term temporal consistency is as simple as rewarding path hypotheses with consistent image features. One additional advantage of tracking-by-selection is that it results in a dramatically simplified model that can be solved exactly. We adapt an existing tracking-by-detection model to the tracking-by-selection framework, and show improved performance on a challenging dataset.
Conference Paper
Optical flow computation is a key component in many computer vision systems designed for tasks such as action detection or activity recognition. However, despite several major advances over the last decade, handling large displacement in optical flow remains an open problem. Inspired by the large displacement optical flow of Brox and Malik, our approach, termed Deep Flow, blends a matching algorithm with a variational approach for optical flow. We propose a descriptor matching algorithm, tailored to the optical flow problem, that allows to boost performance on fast motions. The matching algorithm builds upon a multi-stage architecture with 6 layers, interleaving convolutions and max-pooling, a construction akin to deep convolutional nets. Using dense sampling, it allows to efficiently retrieve quasi-dense correspondences, and enjoys a built-in smoothing effect on descriptors matches, a valuable asset for integration into an energy minimization framework for optical flow estimation. Deep Flow efficiently handles large displacements occurring in realistic videos, and shows competitive performance on optical flow benchmarks. Furthermore, it sets a new state-of-the-art on the MPI-Sintel dataset.
Conference Paper
Automatic recovery of 3D human pose from monocular image sequences is a challenging and important research topic with numerous applications. Although current methods are able to recover 3D pose for a single person in controlled environments, they are severely challenged by real-world scenarios, such as crowded street scenes. To address this problem, we propose a three-stage process building on a number of recent advances. The first stage obtains an initial estimate of the 2D articulation and viewpoint of the person from single frames. The second stage allows early data association across frames based on tracking-by-detection. These two stages successfully accumulate the available 2D image evidence into robust estimates of 2D limb positions over short image sequences (= tracklets). The third and final stage uses those tracklet-based estimates as robust image observations to reliably recover 3D pose. We demonstrate state-of-the-art performance on the HumanEva II benchmark, and also show the applicability of our approach to articulated 3D tracking in realistic street conditions.
Conference Paper
We address the problem of articulated human pose estimation in videos using an ensemble of tractable models with rich appearance, shape, contour and motion cues. In previous articulated pose estimation work on unconstrained videos, using temporal coupling of limb positions has made little to no difference in performance over parsing frames individually [8, 28]. One crucial reason for this is that joint parsing of multiple articulated parts over time involves intractable inference and learning problems, and previous work has resorted to approximate inference and simplified models. We overcome these computational and modeling limitations using an ensemble of tractable submodels which couple locations of body joints within and across frames using expressive cues. Each submodel is responsible for tracking a single joint through time (e.g., left elbow) and also models the spatial arrangement of all joints in a single frame. Because of the tree structure of each submodel, we can perform efficient exact inference and use rich temporal features that depend on image appearance, e.g., color tracking and optical flow contours. We propose and experimentally investigate a hierarchy of submodel combination methods, and we find that a highly efficient max-marginal combination method outperforms much slower (by orders of magnitude) approximate inference using dual decomposition. We apply our pose model on a new video dataset of highly varied and articulated poses from TV shows. We show significant quantitative and qualitative improvements over state-of-the-art single-frame pose estimation approaches.
Conference Paper
We present a novel multi-person pose estimation framework, which extends pictorial structures (PS) to explicitly model interactions between people and to estimate their poses jointly. Interactions are modeled as occlusions between people. First, we propose an occlusion probability predictor, based on the location of persons automatically detected in the image, and incorporate the predictions as occlusion priors into our multi-person PS model. Moreover, our model includes an inter-people exclusion penalty, preventing body parts from different people from occupying the same image region. Thanks to these elements, our model has a global view of the scene, resulting in better pose estimates in group photos, where several persons stand nearby and occlude each other. In a comprehensive evaluation on a new, challenging group photo datasets we demonstrate the benefits of our multi-person model over a state-of-the-art single-person pose estimator which treats each person independently.