Conference PaperPDF Available

Improving the Accuracy of Intelligent Pose Estimation Systems Through Low Level Image Processing Operations


Abstract and Figures

The development of powerful and popular machine learning driven pose estimation systems have been on the rise during the past years. In this research we have investigated how the accuracy level can be increased by applying low level image processing techniques unto the footage before they are submitted to the pose estimation system. The techniques used were high and low contrast, histogram equalization, sharpness and canny edge detection. By applying them on datasets, containing different environments and lighting conditions the system's accuracy was increased, ranging from 0.29% increase to 38.37% increase dependent on the context. These increases have potential to upgrade the pose estimation system to be less lighting sensitive.
Content may be subject to copyright.
DISP '19, Oxford, United Kingdom
ISBN: 978-1-912532-09-4
The development of powerful and popular machine learning driven
pose estimation systems have been on the rise during the past years.
In this research we have investigated how the accuracy level can be
increased by applying low level image processing techniques unto
the footage before they are submitted to the pose estimation system.
The techniques used were high and low contrast, histogram
equalization, sharpness and canny edge detection. By applying
them on datasets, containing different environments and lighting
conditions the system’s accuracy was increased, ranging from
0.29% increase to 38.37% increase dependent on the context. These
increases have potential to upgrade the pose estimation system to
be less lighting sensitive.
OpenPose, Image processing, limb estimation, histogram
equalization, low-level operations, image contrast, sharpness,
canny edge detection.
Human pose and limb estimation has a long tradition. Ranging from
the 1980’s usage of image processing operations, such as edge
detection and template matching [1], [2] to current machine
learning enhanced computer vision systems [3][9]. The newer
systems are especially interesting as their core is neural networks
of different kinds. Convolutional neural networks (CNN) have
especially been prominent for computer vision, as the networks can
be trained by datasets composed of pictures and frames to recognize
different objects. Thus with a high accuracy enable the machine to
find high-level elements such as human body limbs. A system
created for such task is OpenPose [4]. OpenPose’s architecture uses
CNNs to classify where in the frame different body parts are with
different certainty, their direction and afterwards through bipartite
graphs recognizes which body parts belong together in order to
create simple human skeletons. A particular strength of OpenPose,
is that it runs in real time and can recognize and distinguish multiple
persons at once without confusing the limbs of one person with
another [4]. To prove the usefulness of OpenPose it was submitted
to two benchmark tests by Z. Cao et, al, by using the datasets from
MPII [10] and COCO [11]. The performance on the datasets were
around 90 percent and 62.7 percent respectively. The lower
accuracy found when tested on the COCO dataset according to the
Z. Cao et, al. was due to background confusion and imprecise limb
localization [4].
To understand the challenges OpenPose is facing within the images
it is important to learn what the distractors are and how to make the
humans in the frames standout. Challenging scenarios such as
complex backgrounds, when background color values are close to
the people’s in the frame, and when objects have human limb like
features have distracted OpenPose. As OpenPose has its difficulties
other systems had as well, previous studies have improved the
recognition rate of other advanced algorithms through different
means. Y. Wang et al. [12] have improved the accuracy of multiple
pedestrian detection algorithm by applying a non-linear motion
guided filter on the footage before it is submitted to the algorithms.
K.G Lore et al. [13] have contrast enhanced and de-noised low light
noisy images through training a deep neural net to recognize the
issues in the images compared to their normal lit low noise
counterparts. Milan et al [14] have further developed a multi-label
conditional random field (CRF) algorithm to detect occluded
humans in a multi human tracking scenario, compared to similar
Since [12], [13] both help bring forth or enhance the objects in the
scenes they are inspecting, their problem area lies close to that of
OpenPoses. Inspired by them a number of low level light weight
image processing operations will be explored to make it easier for
OpenPose and its CNNs to extract the right human features from
the frames. The techniques used are operators such as Contrast
adjustment [15], both higher and lower contrast, Histogram
Equalization [16], Sharpness [17] and Canny Edge Detection [18].
DISP '19, Oxford, United Kingdom
ISBN: 978-1-912532-09-4
2.1 Hypothesis
By applying the different image processing operations we expect
Contrast adjustment and Histogram Equalization will make the
differences in complex frames and frames with low contrast more
obvious, thus heightening OpenPose’s accuracy. However, noise
and other unwanted items will be enhanced as well, so there is a
chance that it will worsen the recognition results. Regarding
Sharpness and Canny Edge Detection we expect that the low level
edge features in complex scenes and frames with different entities
will be heightened, thus improving the accuracy of OpenPose.
2.2 Video Analysis Procedure
In order to enable OpenPose to work with the low level
techniques, and thus analyzing the frames, the following software
procedure is applied.
1. The video is loaded into the system
2. The image processing operation is applied
3. OpenPose with Tensorflow architecture [19], [20]
analyses the frame
4. The amount of limbs is recorded and stored.
5. If there are more available frames the next is chosen and
step 2-4 is repeated.
6. If there are no more frames the system quits.
7. A graph showcasing the amount of limbs found on a
frame by frame basis is saved.
8. Each frame superimposed with the OpenPose limb
estimation is saved.
After the system have analyzed the videos, the frames will be
inspected in order to find false positives. False positives being
non-human objects identified as limbs. These will be subtracted
from the total number of found limbs. Thereby, see which image
processing operation optimizes the videos the most. For the sake
of reliability OpenPose is exposed to each of the per-processed
videos 20 times.
The computer used to run the procedure had the following
specifications GTX950M GPU, I5 (6.gen) CPU, 8 GB Ram, 256
GB PCle SSD and Windows 10 Home.
2.2 Datasets
To test how well the different pre-processing operations help
OpenPose increase its limb recognition accuracy they need to be
exposed to different scenarios. 300 frames movie clips with
different types of background complexity, lighting condition and
amount of people (one or two persons) were chosen (see figure 1
for the different datasets showcased with one person). When
OpenPose recognizes a full human, it classifies 18 different body
parts. In the video clips the persons are fully present, thus the total
of recognizable limbs are 18 times 300 equals 5400 limbs for
datasets with one person and 10800 limbs for datasets with two
persons. These numbers will serve as ground truth when testing
the systems.
The datasets were recorded using a Canon 60D (shutter speed: 30)
+ 16-55mm lens + SD-card 32 GB SanDisk + Tripod set to the
resolution of 640X480.
2.3 Pre-Processing Configurations
In this pilot research we chose to focus on a small subset of light
weight operations to see the feasibility of increasing OpenPose’s
accuracy by enhancing the videos. Contrast Adjustment,
Histogram Equalization, Sharpness and Canny Edge Detection
was chosen, as they have promising capabilities of altering frames
in order to bring forth the human in the frame. Contrast and
Sharpness can be configured to different extend. Contrast will be
used twice. In a low contrast and high contrast setting. For
Sharpness, its algorithm which is based on an un-sharp masking
procedure will run with a 5, 4 configuration. This means that the
original image has a higher weight than the added blurred image.
These values were chosen as they produces sharper edges, but
without introducing too much noise to the images.
The following two figures on the next page (figure 2 and figure 3)
show the results from the test showcasing how OpenPose and the
different pre-processing operations behaved. Immediately it is
seen that OpenPose by itself performs rather well for dataset b,
and c, while for dataset a and d the pre-processing operations
improved the accuracy significantly (the frames from the datasets
can be found in figure 1). The pre-processing operations reported
to have increased the accuracy compared to OpenPose alone are
Histogram Equalization, Sharpness and High Contrast. While
Low Contrast and Canny Edge Detection contributed to a
decrease in accuracy. This tells us that lowering the contrast and
highlighting the edges of the images only worsen the chance of
limb recognition in the images. Histogram Equalization and High
Contrast on the other hand both made OpenPose perform better
most of the time. Especially in regards to dataset a and d which
both have a narrow range of histogram values, thus telling us that
OpenPose performs best when there are a clear difference
between the background and the persons. This observation is
Figure 1 shows the different kind of datasets OpenPose is being exposed to a) A dark indoor scenario b) a
static simple background scenario c) A dynamic background scenario d) A complex static background with
different than white tinted light scenario
DISP '19, Oxford, United Kingdom
ISBN: 978-1-912532-09-4
supported by OpenPose’s high accuracy in dataset b, which is the
perfect condition with clear difference between background and
the person in the frame.
In this paper we have seen how OpenPose’s limb estimation
accuracy can be enhanced by applying different kinds of pre-
processing techniques on the datasets. For all datasets the pre-
processing operations helped OpenPose achieve a higher accuracy
ranging from 0.29% increase to 38.37% increase. Interestingly,
the datasets OpenPose had most difficulties with are the ones
deviating mostly away from normal lighting situations such as the
dark environment where the human is barely visible and the
environment with complex background and different colored light
than white. In the dark scenario, the high contrast pre-processing
served to create the best scenario for OpenPose to analyze. Very
logically since the low contrast dark frame is being manipulated to
have greater differences between the values in the frame.
Histogram equalization does a similar operation by stretching out
the intensity range of the frame. Thereby making a frame like d
from Figure 1 with the unnatural light, lighter and with a higher
contrast at the same time. This could tell us that the dataset
OpenPose is trained to find limbs in is usually in normal white
light of different brightness, which does not hinder the visibility
of the human in the frame. Furthermore, seeing that histogram
equalization increased the accuracy for most of the datasets, could
witness that for OpenPose to perform the best as possible the
contrast between background and human needs to be adequate.
Furthermore ensuring that all the limbs are visible enough to
minimize the doubt to whether or not there are limbs or only a
subset of them. Additionally, the non-linear filter used in [12]
could prove itself useful in this case as well, as the humans can be
in motion when captured.
Seen in figure 2 and 3 different scenarios requires adding of
different pre-processing techniques to enhance the accuracy of
OpenPose. Seeing that the dataset used in this paper is made for
this very occasion the logical next step will be to test the different
filters on the COCO [11] dataset, as the results from that dataset
was the ones inspiring this pilot research.
Additionally, these pilot results suggest an approach to improve
the intelligent machine running OpenPose. Such approach could
first analyze the histogram of the frames and values of the pixels,
and then the machine (similar to [13]) could be trained to make an
informed choice to which image processing operation that would
OpenPose Low Contrast High Contrast Canny Edge Hist. Equalization Sharpness
Figure 2 shows the accuracies for when OpenPose was exposed to datasets with one person, which is either raw or
manipulated through the various pre-processing operations. The different datasets a b c and d refers to the same
type of scenarios found in Figure 1. The accuracies in bold are the highest achieved accuracies when exposed to the
different datasets.
OpenPose Low Contrast High Contrast Canny Edge Hist. Equalization Sharpness
DISP '19, Oxford, United Kingdom
ISBN: 978-1-912532-09-4
create the highest limb estimation accuracy. If such addition to the
architecture is made the raw footage the user wants to analyze
whether it is for security, health or other social relevant situation
does not need to be retouched before submitting it to OpenPose,
as the system could be enhanced to improve the image for highest
and most reliable limb estimation possible.
In this research low-level image processing operation were
applied to datasets before submitting them to an intelligent pose
estimation algorithm to see if it allowed the algorithm to achieve
higher recognition scores. For the purpose of this study the
popular engine OpenPose was chosen as the pose estimation
system and the image processing operations were: Low contrast,
high contrast, canny edge detection, histogram equalization and
sharpness. Applying them on different datasets containing of 300
images each proven to increase OpenPose’s accuracy with 0.29%
to 38.37% percent dependent on the given dataset. Most
noticeably was the increase when applying high contrast
operations to a dataset containing a dark environment in which a
person is present. OpenPose’s accuracy increased from 47.19% to
85.56%, and when applying histogram equalization on frames
containing low-lit non-white lighting, the accuracy rose from
76.19% to 89.53%. Less noticeably was the accuracy increase
when tested on datasets containing bright outdoor light and well
lit indoor scenarios. These accuracies only rose by 0.29% - 1.29%
utilizing high contrast, histogram equalization or sharpness as the
pre-processing operation. These results witnesses that OpenPose’s
accuracy can be enhanced by applying low level image processing
on the datasets before submitting them to the algorithm. However,
the effect is very context dependent and an intelligent integration
of the pre-processing operations into OpenPose could be an
interesting next step in order to optimize the system to be less
sensitive to different lighting conditions.
Our thanks to Prof. George Palamas for being available for
consultation doing the development of this research and thanks to
the OpenPose team for having their framework and code openly
available for others.
[1] J. K. Aggarwal and Q. Cai, Human Motion Analysis: A
Review, vol. 73, no. 3, pp. 428440, 1999.
[2] T. B. Moeslund and E. Granum, A Survey of Computer
Vision-Based Human Motion Capture, vol. 268, pp. 231
268, 2001.
[3] A. Toshev and C. Szegedy, DeepPose: Human Pose
Estimation via Deep Neural Networks, Proc. IEEE Conf.
Comput. Vis. pattern Recognit., pp. 16531660, 2014.
[4] Z. Cao, T. Simon, S.-E. Wei, and Y. Sheikh, Realtime Multi-
Person 2D Pose Estimation using Part Affinity Fields, Acta
Phys. Pol. A, vol. 106, no. 5, pp. 709713, Nov. 2016.
[5] A. Newell, K. Yang, and J. Deng, Stacked Hourglass
Networks for Human Pose Estimation, in European
Conference on Computer Vision, Springer, 2016, pp. 483499.
[6] M. Fieraru, A. Khoreva, L. Pishchulin, and B. Schiele,
Learning to Refine Human Pose Estimation, Proc. IEEE
Conf. Comput. Vis. Pattern Recognit. Work., pp. 318327,
[7] M. Omran, C. Lassner, G. Pons-Moll, P. V. Gehler, and B.
Schiele, Neural Body Fitting: Unifying Deep Learning and
Model-Based Human Pose and Shape Estimation, in 2018
International Conference on 3D Vision (3DV), IEEE, 2018,
pp. 484494.
[8] S. Li and A. B. Chan, 3D Human Pose Estimation from
Monocular Images with Deep Convolutional Neural
Network, in 12th Asian Conference on Computer Vision,
2014, pp. 332347.
[9] Y. Du, W. Wang, and L. Wang, Hierarchical Recurrent
Neural Network for Skeleton Based Action Recognition,
Proc. IEEE Conf. Comput. Vis. pattern Recognit., pp. 1110
1118, 2015.
[10] M. Andriluka, L. Pishchulin, P. Gehler, and B. Schiele, 2D
Human Pose Estimation: New Benchmark and State of the Art
Analysis, in 2014 IEEE Conference on Computer Vision and
Pattern Recognition, 2014, pp. 36863693.
[11] T.-Y. Lin et al., Microsoft COCO: Common Objects in
Context, in 2014 IEEE Conference on Computer Vision and
Pattern Recognition, IEEE, 2014, pp. 740755.
[12] Y. Wang, S. Piérard, S. Z. Su, and P. M. Jodoin, Improving
pedestrian detection using motion-guided filtering, Pattern
Recognit. Lett., vol. 96, pp. 106112, 2017.
[13] K. G. Lore, A. Akintayo, and S. Sarkar, LLNet: A deep
autoencoder approach to natural low-light image
enhancement, Pattern Recognit., vol. 61, no. 2, pp. 650662,
Jan. 2017.
[14] A. Milan, L. Leal-Taixe, K. Schindler, and I. Reid, Joint
tracking and segmentation of multiple targets, in 2015 IEEE
Conference on Computer Vision and Pattern Recognition
(CVPR), 2015, pp. 53975406.
[15] G. Konecny, Methods and possibilities for digital differential
rectification, Photogramm. Eng. Remote Sens., vol. 45, no. 6,
pp. 734, 727, 1979.
[16] D. J. Ketcham, Real-Time Image Enhancement Techniques,
1976, no. July 1976, pp. 120125.
[17] G. Deng, A generalized unsharp masking algorithm, IEEE
Trans. Image Process., vol. 20, no. 5, pp. 12491261, 2011.
[18] J.F.Canny, A Computational Approach To Edge Detection,
IEEE Trans. Pattern Anal. Mach. Intell., vol. 8, no. 6, pp. 679
714, 1986.
[19] S.-E. Wei, V. Ramakrishna, T. Kanade, and Y. Sheikh,
Convolutional Pose Machines, Proc. IEEE Conf. Comput.
Vis. Pattern Recognit., pp. 47244732, Jan. 2016.
[20] I. Kim, Deep Pose Estimation implemented using
Tensorflow with Custom Architectures for fast inference.,
2018. [Online]. Available:
Conference Paper
Full-text available
Tracking-by-detection has proven to be the most successful strategy to address the task of tracking multiple targets in unconstrained scenarios [e.g. 40, 53, 55]. Traditionally, a set of sparse detections, generated in a preprocessing step, serves as input to a high-level tracker whose goal is to correctly associate these " dots " over time. An obvious shortcoming of this approach is that most information available in image sequences is simply ignored by thresholding weak detection responses and applying non-maximum suppression. We propose a multi-target tracker that exploits low level image information and associates every (super)-pixel to a specific target or classifies it as background. As a result , we obtain a video segmentation in addition to the classical bounding-box representation in unconstrained, real-world videos. Our method shows encouraging results on many standard benchmark sequences and significantly out-performs state-of-the-art tracking-by-detection approaches in crowded scenes with long-term partial occlusions.
Multi-person pose estimation in images and videos is an important yet challenging task with many applications. Despite the large improvements in human pose estimation enabled by the development of convolutional neural networks, there still exist a lot of difficult cases where even the state-of-the-art models fail to correctly localize all body joints. This motivates the need for an additional refinement step that addresses these challenging cases and can be easily applied on top of any existing method. In this work, we introduce a pose refinement network (PoseRefiner) which takes as input both the image and a given pose estimate and learns to directly predict a refined pose by jointly reasoning about the input-output space. In order for the network to learn to refine incorrect body joint predictions, we employ a novel data augmentation scheme for training, where we model "hard" human pose cases. We evaluate our approach on four popular large-scale pose estimation benchmarks such as MPII Single- and Multi-Person Pose Estimation, PoseTrack Pose Estimation, and PoseTrack Pose Tracking, and report systematic improvement over the state of the art.
Conference Paper
Pose Machines provide a sequential prediction framework for learning rich implicit spatial models. In this work we show a systematic design for how convolutional networks can be incorporated into the pose machine framework for learning image features and image-dependent spatial models for the task of pose estimation. The contribution of this paper is to implicitly model long-range dependencies between variables in structured prediction tasks such as articulated pose estimation. We achieve this by designing a sequential architecture composed of convolutional networks that directly operate on belief maps from previous stages, producing increasingly refined estimates for part locations, without the need for explicit graphical model-style inference. Our approach addresses the characteristic difficulty of vanishing gradients during training by providing a natural learning objective function that enforces intermediate supervision, thereby replenishing back-propagated gradients and conditioning the learning procedure. We demonstrate state-of-the-art performance and outperform competing methods on standard benchmarks including the MPII, LSP, and FLIC datasets.
Conference Paper
In this paper, we propose a deep convolutional neural network for 3D human pose estimation from monocular images. We train the network using two strategies: (1) a multi-task framework that jointly trains pose regression and body part detectors; (2) a pre-training strategy where the pose regressor is initialized using a network trained for body part detection. We compare our network on a large data set and achieve significant improvement over baseline methods. Human pose estimation is a structured prediction problem, i.e., the locations of each body part are highly correlated. Although we do not add constraints about the correlations between body parts to the network, we empirically show that the network has disentangled the dependencies among different body parts, and learned their correlations.
We present a realtime approach for multi-person 2D pose estimation that predicts vector fields, which we refer to as Part Affinity Fields (PAFs), that directly expose the association between anatomical parts in an image. The architecture is designed to jointly learn part locations and their association, via two branches of the same sequential prediction process. The sequential prediction enables the part confidence maps and the association fields to encode global context, while allowing an efficient bottom-up parsing step that maintains tractable runtime complexity. Our method has set the state-of-the-art performance on the inaugural MSCOCO 2016 keypoints challenge, and significantly exceeds the previous state-of-the-art result on the MPII Multi-Person benchmark, both in performance and efficiency.
Conference Paper
This work introduces a novel convolutional network architecture for the task of human pose estimation. Features are processed across all scales and consolidated to best capture the various spatial relationships associated with the body. We show how repeated bottom-up, top-down processing used in conjunction with intermediate supervision is critical to improving the performance of the network. We refer to the architecture as a “stacked hourglass” network based on the successive steps of pooling and upsampling that are done to produce a final set of predictions. State-of-the-art results are achieved on the FLIC and MPII benchmarks outcompeting all recent methods.
In surveillance, monitoring and tactical reconnaissance, gathering the right visual information from a dynamic environment and accurately processing such data are essential ingredients to making informed decisions which determines the success of an operation. Camera sensors are often cost-limited in ability to clearly capture objects without defects from images or videos taken in a poorly-lit environment. The goal in many applications is to enhance the brightness, contrast and reduce noise content of such images in an on-board real-time manner. We propose a deep autoencoder-based approach to identify signal features from low-light images handcrafting and adaptively brighten images without over-amplifying the lighter parts in images (i.e., without saturation of image pixels) in high dynamic range. We show that a variant of the recently proposed stacked-sparse denoising autoencoder can learn to adaptively enhance and denoise from synthetically darkened and noisy training examples. The network can then be successfully applied to naturally low-light environment and/or hardware degraded images. Results show significant credibility of deep learning based approaches both visually and by quantitative comparison with various popular enhancing, state-of-the-art denoising and hybrid enhancing-denoising techniques.