DISP '19, Oxford, United Kingdom
The development of powerful and popular machine learning driven
pose estimation systems have been on the rise during the past years.
In this research we have investigated how the accuracy level can be
increased by applying low level image processing techniques unto
the footage before they are submitted to the pose estimation system.
The techniques used were high and low contrast, histogram
equalization, sharpness and canny edge detection. By applying
them on datasets, containing different environments and lighting
conditions the system’s accuracy was increased, ranging from
0.29% increase to 38.37% increase dependent on the context. These
increases have potential to upgrade the pose estimation system to
be less lighting sensitive.
OpenPose, Image processing, limb estimation, histogram
equalization, low-level operations, image contrast, sharpness,
canny edge detection.
Human pose and limb estimation has a long tradition. Ranging from
the 1980’s usage of image processing operations, such as edge
detection and template matching ,  to current machine
learning enhanced computer vision systems –. The newer
systems are especially interesting as their core is neural networks
of different kinds. Convolutional neural networks (CNN) have
especially been prominent for computer vision, as the networks can
be trained by datasets composed of pictures and frames to recognize
different objects. Thus with a high accuracy enable the machine to
find high-level elements such as human body limbs. A system
created for such task is OpenPose . OpenPose’s architecture uses
CNNs to classify where in the frame different body parts are with
different certainty, their direction and afterwards through bipartite
graphs recognizes which body parts belong together in order to
create simple human skeletons. A particular strength of OpenPose,
is that it runs in real time and can recognize and distinguish multiple
persons at once without confusing the limbs of one person with
another . To prove the usefulness of OpenPose it was submitted
to two benchmark tests by Z. Cao et, al, by using the datasets from
MPII  and COCO . The performance on the datasets were
around 90 percent and 62.7 percent respectively. The lower
accuracy found when tested on the COCO dataset according to the
Z. Cao et, al. was due to background confusion and imprecise limb
To understand the challenges OpenPose is facing within the images
it is important to learn what the distractors are and how to make the
humans in the frames standout. Challenging scenarios such as
complex backgrounds, when background color values are close to
the people’s in the frame, and when objects have human limb like
features have distracted OpenPose. As OpenPose has its difficulties
other systems had as well, previous studies have improved the
recognition rate of other advanced algorithms through different
means. Y. Wang et al.  have improved the accuracy of multiple
pedestrian detection algorithm by applying a non-linear motion
guided filter on the footage before it is submitted to the algorithms.
K.G Lore et al.  have contrast enhanced and de-noised low light
noisy images through training a deep neural net to recognize the
issues in the images compared to their normal lit low noise
counterparts. Milan et al  have further developed a multi-label
conditional random field (CRF) algorithm to detect occluded
humans in a multi human tracking scenario, compared to similar
Since ,  both help bring forth or enhance the objects in the
scenes they are inspecting, their problem area lies close to that of
OpenPose’s. Inspired by them a number of low level light weight
image processing operations will be explored to make it easier for
OpenPose and its CNNs to extract the right human features from
the frames. The techniques used are operators such as Contrast
adjustment , both higher and lower contrast, Histogram
Equalization , Sharpness  and Canny Edge Detection .
DISP '19, Oxford, United Kingdom
By applying the different image processing operations we expect
Contrast adjustment and Histogram Equalization will make the
differences in complex frames and frames with low contrast more
obvious, thus heightening OpenPose’s accuracy. However, noise
and other unwanted items will be enhanced as well, so there is a
chance that it will worsen the recognition results. Regarding
Sharpness and Canny Edge Detection we expect that the low level
edge features in complex scenes and frames with different entities
will be heightened, thus improving the accuracy of OpenPose.
2.2 Video Analysis Procedure
In order to enable OpenPose to work with the low level
techniques, and thus analyzing the frames, the following software
procedure is applied.
1. The video is loaded into the system
2. The image processing operation is applied
3. OpenPose with Tensorflow architecture , 
analyses the frame
4. The amount of limbs is recorded and stored.
5. If there are more available frames the next is chosen and
step 2-4 is repeated.
6. If there are no more frames the system quits.
7. A graph showcasing the amount of limbs found on a
frame by frame basis is saved.
8. Each frame superimposed with the OpenPose limb
estimation is saved.
After the system have analyzed the videos, the frames will be
inspected in order to find false positives. False positives being
non-human objects identified as limbs. These will be subtracted
from the total number of found limbs. Thereby, see which image
processing operation optimizes the videos the most. For the sake
of reliability OpenPose is exposed to each of the per-processed
videos 20 times.
The computer used to run the procedure had the following
specifications GTX950M GPU, I5 (6.gen) CPU, 8 GB Ram, 256
GB PCle SSD and Windows 10 Home.
To test how well the different pre-processing operations help
OpenPose increase its limb recognition accuracy they need to be
exposed to different scenarios. 300 frames movie clips with
different types of background complexity, lighting condition and
amount of people (one or two persons) were chosen (see figure 1
for the different datasets showcased with one person). When
OpenPose recognizes a full human, it classifies 18 different body
parts. In the video clips the persons are fully present, thus the total
of recognizable limbs are 18 times 300 equals 5400 limbs for
datasets with one person and 10800 limbs for datasets with two
persons. These numbers will serve as ground truth when testing
The datasets were recorded using a Canon 60D (shutter speed: 30)
+ 16-55mm lens + SD-card 32 GB SanDisk + Tripod set to the
resolution of 640X480.
2.3 Pre-Processing Configurations
In this pilot research we chose to focus on a small subset of light
weight operations to see the feasibility of increasing OpenPose’s
accuracy by enhancing the videos. Contrast Adjustment,
Histogram Equalization, Sharpness and Canny Edge Detection
was chosen, as they have promising capabilities of altering frames
in order to bring forth the human in the frame. Contrast and
Sharpness can be configured to different extend. Contrast will be
used twice. In a low contrast and high contrast setting. For
Sharpness, its algorithm which is based on an un-sharp masking
procedure will run with a 5, 4 configuration. This means that the
original image has a higher weight than the added blurred image.
These values were chosen as they produces sharper edges, but
without introducing too much noise to the images.
The following two figures on the next page (figure 2 and figure 3)
show the results from the test showcasing how OpenPose and the
different pre-processing operations behaved. Immediately it is
seen that OpenPose by itself performs rather well for dataset b,
and c, while for dataset a and d the pre-processing operations
improved the accuracy significantly (the frames from the datasets
can be found in figure 1). The pre-processing operations reported
to have increased the accuracy compared to OpenPose alone are
Histogram Equalization, Sharpness and High Contrast. While
Low Contrast and Canny Edge Detection contributed to a
decrease in accuracy. This tells us that lowering the contrast and
highlighting the edges of the images only worsen the chance of
limb recognition in the images. Histogram Equalization and High
Contrast on the other hand both made OpenPose perform better
most of the time. Especially in regards to dataset a and d which
both have a narrow range of histogram values, thus telling us that
OpenPose performs best when there are a clear difference
between the background and the persons. This observation is
Figure 1 shows the different kind of datasets OpenPose is being exposed to a) A dark indoor scenario b) a
static simple background scenario c) A dynamic background scenario d) A complex static background with
different than white tinted light scenario
DISP '19, Oxford, United Kingdom
supported by OpenPose’s high accuracy in dataset b, which is the
perfect condition with clear difference between background and
the person in the frame.
In this paper we have seen how OpenPose’s limb estimation
accuracy can be enhanced by applying different kinds of pre-
processing techniques on the datasets. For all datasets the pre-
processing operations helped OpenPose achieve a higher accuracy
ranging from 0.29% increase to 38.37% increase. Interestingly,
the datasets OpenPose had most difficulties with are the ones
deviating mostly away from normal lighting situations such as the
dark environment where the human is barely visible and the
environment with complex background and different colored light
than white. In the dark scenario, the high contrast pre-processing
served to create the best scenario for OpenPose to analyze. Very
logically since the low contrast dark frame is being manipulated to
have greater differences between the values in the frame.
Histogram equalization does a similar operation by stretching out
the intensity range of the frame. Thereby making a frame like d
from Figure 1 with the unnatural light, lighter and with a higher
contrast at the same time. This could tell us that the dataset
OpenPose is trained to find limbs in is usually in normal white
light of different brightness, which does not hinder the visibility
of the human in the frame. Furthermore, seeing that histogram
equalization increased the accuracy for most of the datasets, could
witness that for OpenPose to perform the best as possible the
contrast between background and human needs to be adequate.
Furthermore ensuring that all the limbs are visible enough to
minimize the doubt to whether or not there are limbs or only a
subset of them. Additionally, the non-linear filter used in 
could prove itself useful in this case as well, as the humans can be
in motion when captured.
Seen in figure 2 and 3 different scenarios requires adding of
different pre-processing techniques to enhance the accuracy of
OpenPose. Seeing that the dataset used in this paper is made for
this very occasion the logical next step will be to test the different
filters on the COCO  dataset, as the results from that dataset
was the ones inspiring this pilot research.
Additionally, these pilot results suggest an approach to improve
the intelligent machine running OpenPose. Such approach could
first analyze the histogram of the frames and values of the pixels,
and then the machine (similar to ) could be trained to make an
informed choice to which image processing operation that would
PER FORMANCE OF OPENPOSE ON DATASETS WITH O NE PERSON
OpenPose Low Contrast High Contrast Canny Edge Hist. Equalization Sharpness
Figure 2 shows the accuracies for when OpenPose was exposed to datasets with one person, which is either raw or
manipulated through the various pre-processing operations. The different datasets a b c and d refers to the same
type of scenarios found in Figure 1. The accuracies in bold are the highest achieved accuracies when exposed to the
PER FORMANCE OF OPENPOSE ON DATASETS WITH T WO PE RSONS
OpenPose Low Contrast High Contrast Canny Edge Hist. Equalization Sharpness
Figure 3 shows the accuracies for when OpenPose was exposed to datasets with two persons, which is either raw or
manipulated through the various pre-processing operations. The different datasets a b c and d refers to the same type
of scenarios found in Figure 1. The accuracies in bold are the highest achieved accuracies when exposed to the
DISP '19, Oxford, United Kingdom
create the highest limb estimation accuracy. If such addition to the
architecture is made the raw footage the user wants to analyze
whether it is for security, health or other social relevant situation
does not need to be retouched before submitting it to OpenPose,
as the system could be enhanced to improve the image for highest
and most reliable limb estimation possible.
In this research low-level image processing operation were
applied to datasets before submitting them to an intelligent pose
estimation algorithm to see if it allowed the algorithm to achieve
higher recognition scores. For the purpose of this study the
popular engine OpenPose was chosen as the pose estimation
system and the image processing operations were: Low contrast,
high contrast, canny edge detection, histogram equalization and
sharpness. Applying them on different datasets containing of 300
images each proven to increase OpenPose’s accuracy with 0.29%
to 38.37% percent dependent on the given dataset. Most
noticeably was the increase when applying high contrast
operations to a dataset containing a dark environment in which a
person is present. OpenPose’s accuracy increased from 47.19% to
85.56%, and when applying histogram equalization on frames
containing low-lit non-white lighting, the accuracy rose from
76.19% to 89.53%. Less noticeably was the accuracy increase
when tested on datasets containing bright outdoor light and well
lit indoor scenarios. These accuracies only rose by 0.29% - 1.29%
utilizing high contrast, histogram equalization or sharpness as the
pre-processing operation. These results witnesses that OpenPose’s
accuracy can be enhanced by applying low level image processing
on the datasets before submitting them to the algorithm. However,
the effect is very context dependent and an intelligent integration
of the pre-processing operations into OpenPose could be an
interesting next step in order to optimize the system to be less
sensitive to different lighting conditions.
Our thanks to Prof. George Palamas for being available for
consultation doing the development of this research and thanks to
the OpenPose team for having their framework and code openly
available for others.
 J. K. Aggarwal and Q. Cai, “Human Motion Analysis : A
Review,” vol. 73, no. 3, pp. 428–440, 1999.
 T. B. Moeslund and E. Granum, “A Survey of Computer
Vision-Based Human Motion Capture,” vol. 268, pp. 231–
 A. Toshev and C. Szegedy, “DeepPose : Human Pose
Estimation via Deep Neural Networks,” Proc. IEEE Conf.
Comput. Vis. pattern Recognit., pp. 1653–1660, 2014.
 Z. Cao, T. Simon, S.-E. Wei, and Y. Sheikh, “Realtime Multi-
Person 2D Pose Estimation using Part Affinity Fields,” Acta
Phys. Pol. A, vol. 106, no. 5, pp. 709–713, Nov. 2016.
 A. Newell, K. Yang, and J. Deng, “Stacked Hourglass
Networks for Human Pose Estimation,” in European
Conference on Computer Vision, Springer, 2016, pp. 483–499.
 M. Fieraru, A. Khoreva, L. Pishchulin, and B. Schiele,
“Learning to Refine Human Pose Estimation,” Proc. IEEE
Conf. Comput. Vis. Pattern Recognit. Work., pp. 318–327,
 M. Omran, C. Lassner, G. Pons-Moll, P. V. Gehler, and B.
Schiele, “Neural Body Fitting: Unifying Deep Learning and
Model-Based Human Pose and Shape Estimation,” in 2018
International Conference on 3D Vision (3DV), IEEE, 2018,
 S. Li and A. B. Chan, “3D Human Pose Estimation from
Monocular Images with Deep Convolutional Neural
Network,” in 12th Asian Conference on Computer Vision,
2014, pp. 332–347.
 Y. Du, W. Wang, and L. Wang, “Hierarchical Recurrent
Neural Network for Skeleton Based Action Recognition,”
Proc. IEEE Conf. Comput. Vis. pattern Recognit., pp. 1110–
 M. Andriluka, L. Pishchulin, P. Gehler, and B. Schiele, “2D
Human Pose Estimation: New Benchmark and State of the Art
Analysis,” in 2014 IEEE Conference on Computer Vision and
Pattern Recognition, 2014, pp. 3686–3693.
 T.-Y. Lin et al., “Microsoft COCO: Common Objects in
Context,” in 2014 IEEE Conference on Computer Vision and
Pattern Recognition, IEEE, 2014, pp. 740–755.
 Y. Wang, S. Piérard, S. Z. Su, and P. M. Jodoin, “Improving
pedestrian detection using motion-guided filtering,” Pattern
Recognit. Lett., vol. 96, pp. 106–112, 2017.
 K. G. Lore, A. Akintayo, and S. Sarkar, “LLNet: A deep
autoencoder approach to natural low-light image
enhancement,” Pattern Recognit., vol. 61, no. 2, pp. 650–662,
 A. Milan, L. Leal-Taixe, K. Schindler, and I. Reid, “Joint
tracking and segmentation of multiple targets,” in 2015 IEEE
Conference on Computer Vision and Pattern Recognition
(CVPR), 2015, pp. 5397–5406.
 G. Konecny, “Methods and possibilities for digital differential
rectification,” Photogramm. Eng. Remote Sens., vol. 45, no. 6,
pp. 734, 727, 1979.
 D. J. Ketcham, “Real-Time Image Enhancement Techniques,”
1976, no. July 1976, pp. 120–125.
 G. Deng, “A generalized unsharp masking algorithm,” IEEE
Trans. Image Process., vol. 20, no. 5, pp. 1249–1261, 2011.
 J.F.Canny, “A Computational Approach To Edge Detection,”
IEEE Trans. Pattern Anal. Mach. Intell., vol. 8, no. 6, pp. 679–
 S.-E. Wei, V. Ramakrishna, T. Kanade, and Y. Sheikh,
“Convolutional Pose Machines,” Proc. IEEE Conf. Comput.
Vis. Pattern Recognit., pp. 4724–4732, Jan. 2016.
 I. Kim, “Deep Pose Estimation implemented using
Tensorflow with Custom Architectures for fast inference.,”
2018. [Online]. Available: https://github.com/ildoonet/tf-