Content uploaded by David Fuentes-Jimenez
Author content
All content in this area was uploaded by David Fuentes-Jimenez on Mar 08, 2018
Content may be subject to copyright.
Robust People Detection Using Depth Information from an Overhead
Time-of-Flight Camera
Carlos A. Lunaa, Cristina Losada-Gutierreza, David Fuentes-Jimeneza, Alvaro Fernandez-Rincona,
Manuel Mazoa, Javier Macias-Guarasaa,b
aDepartment of Electronics, University of Alcal´a, Ctra. Madrid-Barcelona, km. 33,600, 28805-Alcal ´a de Henares, SPAIN
Emails: caluna@depeca.uah.es (C.A. Luna), losada@depeca.uah.es (C. Losada-Gutierrez),
david.fuentes@depeca.uah.es (D. Fuentes-Jimenez) alvaro.fernandez@depeca.uah.es (A. Fernandez-Rincon),
mazo@depeca.uah.es (M. Mazo), macias@depeca.uah.es (J. Macias-Guarasa)
bCorresponding author. Tel: +34 918856918, fax: +34 918856591.
Abstract
In this paper we describe a system for the automatic detection of multiple people in a scene, by only using
depth information provided by a Time of Flight (ToF) camera placed in overhead position. The main con-
tribution of this work lies in the proposal of a methodology for determining the Regions of Interest (ROIs)
and feature extraction, which result in a robust discrimination between people with or without accessories
and objects (either static or dynamic), even when people and objects are close together. Since only depth
information is used, the developed system guarantees users’ privacy. The designed algorithm includes two
stages: an online stage, and an offline one. In the offline stage, a new depth image dataset has been recorded
and labeled, and the labeled images have been used to train a classifier. The online stage is based on ro-
bustly detecting local maximums in the depth image (which are candidates to correspond to the head of the
people present in the scene), from which a carefully selected Region of Interest (ROI) is defined around
each of them. For each ROI, a feature vector is extracted, providing information on the top view of people
and objects, including information related to the expected overhead morphology of the head and shoulders.
The online stage also includes a pre-filtering process, in order to reduce noise in the depth images. Finally,
there is a classification process based on Principal Components Analysis (PCA). The online stage works in
real time at an average of 150 fps. In order to evaluate the proposal, a wide experimental validation has
been carried out, including different number of people simultaneously present in the scene, as well as people
with different heights, complexions, and accessories. The obtained results are very satisfactory, with a 3.1%
average error rate.
Keywords: People detection, Depth camera information, Interest regions estimation, Overhead camera,
Feature extraction
Preprint submitted to Expert Systems with Applications October 13, 2016
1. Introduction
Automatic people detection and counting is a highly interesting topic because of its multiple applications
in areas such as video-surveillance, access control, people flow analysis, behaviour analysis or event capacity
management. Given the needs for preventing and detecting potentially dangerous situations, there is no doubt
these applications are becoming increasingly important in recent years.5
There are several works in the literature aimed at achieving a robust and reliable detection and counting
of people in a non-invasive way (without adding turnstiles or other contact systems for access control). The
first works in this line were based on the use of RGB cameras. In (Ramanan et al.,2006) the authors propose
a system based on learning person models. Lefloch et al. (2008) describe a proposal for real-time people
counting based on robust background subtraction, and subsequent people segmentation. Other approaches10
are based on face detection (Chen et al.,2010) or interest point classification (Jeong et al.,2013). These
proposals have good results in controlled conditions, but they have problems in scenarios with occlusions.
In order to reduce these occlusions, there are other works that propose placing the camera in an overhead
position. This is the case of Antic et al. (2009) and Cai et al. (2014), whereas Dan et al. (2012) use the fusion
of RGB and depth information in order to improve the detection. Color and depth information are acquired15
using a Kinect R
v1 sensor (Smisek et al.,2011), which simultaneously includes an RGB camera and a depth
sensor that constructs a depth map by analyzing a speckle pattern of infrared laser light.
However, using an RGB camera implies that there is information that could allow knowing the identity
of the people in the scene. This possibility can be a relevant issue in applications where there are privacy
preservation requirements, due, among others, to legal considerations. This is the reason why, in the last few20
years, researchers have looked for alternatives in order to preserve users’ privacy. This is the case of Chan
et al. (2008), who proposed the use of a low resolution camera, located far away from the users. This allows
monitoring people without being able to identify them, and, consequently, without invading their privacy.
Unfortunately, this solution can only be used in environments which allow placing the cameras in distant
positions, and the proposal exhibits some problems if there are occlusions, or people walk very close to each25
other.
On the other hand, also in the last years, there are several works that propose the use of depth sensors or
2.5D cameras (Lange & Seitz,2001;Sell & O’Connor,2014), based on Time of Flight (ToF) (Bevilacqua
et al.,2006;Stahlschmidt et al.,2013,2014;Jia & Radke,2014) or structural light (Zhang et al.,2012;Galˇ
c´
ık
& Gargal´
ık,2013;Rauter,2013;Zhu & Wong,2013;Del Pizzo et al.,2016;Vera et al.,2016) in order to30
detect and count people. All these works are based on overhead cameras, with the objective of reducing
the occlusion effects. The use of depth sensors implies privacy preservation, but also provides a source of
2
information that allows achieving object segmentation in a more straightforward way than the traditional
optical cameras (especially in the context of people counting from top-view cameras).
In (Bevilacqua et al.,2006) the use of a ToF depth sensor (Canesta EP205 with a resolution of 64 ×6435
pixels) instead of a stereo pair of cameras is proposed, to allow people tracking even in low illumination
conditions. In that work, they use both depth and infrared intensity (grayscale) images provided by the
sensor, which implies that the users’ privacy is not fully preserved. Moreover, it only works properly if
people enter the scene separated, to avoid occlusion effects. So, their proposal does not work if the number
of people is high, or if they move close to each other, thus not being useful in realistic environments.40
In (Zhang et al.,2012), the authors use a vertical KinectR
v1 sensor to obtain a depth image. Since they
assume that the head is always closer to the camera than other body parts, the proposal for people detection
is based on obtaining the local minimums of the depth image. This proposal is scale-invariant but it cannot
handle the situation where some moving object is closer to the sensor than the head, such as when raising
hands over it.45
Another interesting proposal is described in (Stahlschmidt et al.,2013,2014), where the authors present
a system that also uses an overhead ToF camera. The proposal includes several stages: first, they detect
the maximums in the height image, and define a Region of Interest (ROI) around each maximum. Then, a
preprocessing stage is carried out, in which they remove the points that belong to the floor, and normalize
the measurements. Finally, they use a filter based on the normalized Mexican Hat Wavelet that allows50
segmenting the different objects in the scene. This proposal improves the results of Bevilacqua et al. (2006)
when the number of people in the scene is high, but it still generates errors if people are very close to each
other.
An additional relevant drawback of the proposals by Zhang et al. (2012) and Stahlschmidt et al. (2013,
2014) is that they do not include a classifier, but only a detector (Zhang et al.,2012) or a tracker (Stahlschmidt55
et al.,2013,2014), so that they cannot discriminate between people and other (mobile or static) objects in the
scene, thus leading to the possible appearance of an important number of false positives in realistic scenarios.
Jia & Radke (2014) describe an alternative for people detection and tracking, as well as for posture
estimation (standing or sitting) in an intelligent space. The authors detect people in the scene and their pose
as a function of the heights of a set of segmented points. Just as in the previously cited works, a preprocessing60
stage is necessary in order to reduce noise and depth measurements errors. After removing the background
(pixels belonging to the floor or furniture), any group of connected depth measurements (blob) with a height
over 90 cm, is considered to be a person.
The proposal by Del Pizzo et al. (2016) allows real-time people counting and includes two stages: first
3
there is a background subtraction step, which includes a dynamic background update. This stage allows65
identifying the pixels in motion within the scene. Then, a second step interprets the results of the foreground
detection in order to count people.
Both Jia & Radke (2014) and Del Pizzo et al. (2016) allow people detection preserving the users’ privacy,
but again, since no classification stage is included, these proposals are not able to discriminate people from
other objects in the scene. In order to solve this issue, other works add a classification stage that allows70
reducing the number of false positives (Galˇ
c´
ık & Gargal´
ık,2013;Rauter,2013;Zhu & Wong,2013;Vera
et al.,2016).
The proposals in (Galˇ
c´
ık & Gargal´
ık,2013), (Vera et al.,2016), and (Zhu & Wong,2013) are based on
the human head and shoulders structure in order to obtain a descriptor that will be used to detect people in
depth images. In particular, Galˇ
c´
ık & Gargal´
ık (2013) propose to first detect the areas that can be a head, and75
then generate a descriptor with three components related to the head structure: the head area, its roundness,
and a box test component. Finally, the correctness of the descriptor is computed. An additional tracking
stage is included in order to reduce the number of false positives. The proposal by Vera et al. (2016) is
also based on using the head roundness, its area, and a tracker, but the authors also include an stage where
the tracklets obtained for several depth cameras are combined. On the other hand, Zhu & Wong (2013)80
propose to use the Head and Shoulder Profile (HASP) as the input feature vector, and an Adaboost based
algorithm for classification. All these proposals are able to efficiently discriminate between people and other
objects or animals in the scene, but the detection rates significantly decrease if people are close to each other.
Additionally, since the proposals are based on the head and shoulder structure for humans, they should not
work properly if people wear complements such as hats, caps, etc.85
In summary, the works by Stahlschmidt et al. (2013), Stahlschmidt et al. (2014), Jia & Radke (2014);
Zhang et al. (2012), and Del Pizzo et al. (2016), allow detecting people preserving their privacy, but since
these proposals do not include a classification process, they are not able to discriminate between people and
other objects, thus leading to a possibly high number of false positives in realistic scenarios. On the other
hand, the proposals by Galˇ
c´
ık & Gargal´
ık (2013), Zhu & Wong (2013), and Vera et al. (2016) include a90
classification stage (based on the head or head and shoulder structure for humans), but they exhibit a lack of
robustness if people are close to each other, and they should not work properly if people wear complements
that change their appearance such as hats, caps, backpacks, etc.
In this paper, we propose a system for robust and reliable detection of multiple people, by only using
depth data acquired using a ToF camera. The proposed solution works properly even if the number of people95
is high and/or they are close to each other. This work improves the robustness of previous proposals in the
4
literature as it is able to discriminate people from other objects in the environment. Moreover, our proposal
is able to detect people even if they are wearing complements that change their appearance, such as hats,
caps, backpacks, etc.
The structure of the paper is described as follows: section 1provides a general introduction and a crit-100
ical review of the literature, section 2describes the notation used, section 3describes the person detection
algorithm, section 4includes the experimental setup, results and discussion, and section 5contains the main
conclusions and some ideas for future work.
2. Notation
The real scalar values are represented by lowercase letters (e.g. δor n). Vectors are organized by rows,105
and they are represented by using bold lowercase letters (e.g. x). Capital letters are reserved for defining the
size of vectors and sets. (e.g. vector x=[x1,· · · ,xN]>is of size N). Matrices are represented by bold capital
letters (e.g. Z). Calligraphic types are reserved for representing ranges or sets, such as Rfor real numbers,
or Hfor generic sets, being Hthe complementary set of H. The cardinality of a set Ais defined as the
number of elements included in the set, and it will be referred to as C(A).110
Along this paper, we will use the concept of neighborhood area for a given region in the image plane
P. This will be referred to as VR
v, where Ris the region to which the neighborhood refers to (that could be
composed by just a single pixel), and vis the neighborhood distance used.
If P=npi,j,with 0<i≤M and 0<j≤No, where pi,jrefers to the pixel in position i,j, formally:
VR
v=pk,l∈ P /pk,l<Rand ∀pm,n∈ R,|k−m| ≤ vor |l−n| ≤ v.(1)
Figure 1shows an example of two different regions R1and R2, and two neighborhood areas VR1
2associ-
ated to R1with neighborhood distance 2, and VR2
1associated to R2with neighborhood distance 1.115
In the same way, we specify the concept of neighborhood in a given radial direction, referred to as VR
v,δi
(with 1 ≤i≤8), as shown in Figure 2, where neighborhoods are represented in radial directions (radial
subzones) for the examples in Figure 1. Arrows labeled δ1to δ4follow the direction of the compass points
(north, east, south and west, respectively), whereas δ5to δ8correspond to the four diagonals (northeast,
southeast, southwest and northwest, respectively).120
We also define the concept of local neighborhood area of a region Rfor a given distance in a given direc-
tion δi, as the one which excludes the neighborhood area of the immediately lower neighborhood distance:
LR
v,δi=VR
v,δi∩ VR
v−1,δi,for v≥2 (2)
5
Figure 1: Examples of neighborhood areas.
Figure 2: Examples of neighborhood areas in radial directions.
6
Figure 3: Examples of local neighborhood areas in radial directions.
Figure 4: Proposed system architecture for people detection using ToF cameras.
Figure 11 graphically shows some examples of local neighborhood areas in radial directions.
3. Person Detection Algorithm
As it was previously commented in the introduction, the solution we propose for people detection, only125
uses depth information provided by a ToF camera (in order to guarantee the privacy of the users in the
environment) in an overhead position. A general block diagram of the proposal is shown in Figure 4.
There are two processes, an offline process and an online one. For the offline process, we have recorded
and labeled a database of depth sequences1. This dataset has been used to define two different classes: one
class for people without any accessories, and another one for people including accessories such as hats and130
1The GOTPD1 database, that is available to the academic community for research purposes (full information can be found
in (Macias-Guarasa et al.,2016)).
7
caps.
The online process includes five stages: One to obtain the height matrix for each depth image acquired
by the ToF camera; another to carry out a filtering process to reduce the measurement noise; a third stage,
for the detection of local maximums in the height matrix that could correspond to a person head, and the
determination of the Regions of Interest (ROI) in its environment (corresponding to the person head, neck135
and shoulders); a fourth module to extract a feature vector for each ROI, and, finally, the last stage performs
the discrimination between people and other objects.
The algorithm proposed by the authors is based on the detection of isolated maximums in the height
matrix that can belong to the heads of people in the scene (or to any other object), and then, the extraction of
features in each ROI around each maximum. In order to discriminate between people and other objects we140
use a classification stage based on PCA (Principal Components Analysis). This stage allows determining if
a feature vector extracted from a ROI around a maximum belongs to any of the trained classes representing
people. The main difficulty of a realistic scenario is the high variability in the people head and shoulders
shaping (long/short hair, hats, caps, backpacks, etc.) which the system has to face.
Next, each stage of the proposed solution shown in Figure 4is described in detail.145
3.1. Height Acquisition
The ToF camera is located in an overhead position, and its optical axis is perpendicular to the floor plane,
as shown in Figure 5. In this figure, the camera coordinate system is defined as Xc,Yc,Zc, and its origin (Oc)
is the optical center of the camera lens. p3D=hxp3D,yp3D,zp3Di>corresponds to a 3D point in the scene
whose coordinates xp3D,yp3D,zp3Dare related to the camera coordinate system. dp3D=qx2
p3D+y2
p3D+z2
p3D
150
is the distance between p3Dand OC, and hp3Dis the height of p3Dwith respect to the floor plane. Supposing
that the floor plane and the Xc,Ycplane are parallel, and that the distance between them is hcamera , then
hp3D=hcamera −zp3D∀p3D.
A ToF camera provides, for each pixel qi,jin the depth image (being i,jthe pixel coordinates in the
image plane), the 3D coordinates of the point pqi,j=hxpqi,j,ypqi,j,zpqi,ji>in the 3D scene associated to the qi,j
155
pixel, as well as the distance dpqi,j, all of them related to the camera coordinate system. So, for each acquired
image, it is possible to obtain a height measurement matrix Hmea, whose dimensions are the same that the
camera resolution (one measurement for each image pixel). Assuming a camera with a spatial resolution
M×N, then:
8
Figure 5: Definition of coordinates and measurements.
Hmea =
hmea
1,1. . . hmea
1,N
.
.
.....
.
.
hmea
M,1. . . hmea
M,N
∈RM×N,(3)
where hmea
i,j=hpqi,j=hcamera −zmea
pqi,jrepresents the obtained height for pixel qi,jin the ToF camera with160
respect to the floor. These heights have been obtained from the height of the camera with respect to the
floor (hcamera), and the value of the zcoordinate of the 3D point related to the camera coordinate system
(zmea
pqi,j=zpqi,j).
3.2. Noise Reduction
One of the fundamental problems in ToF cameras is the high noise level that is present in the measured165
matrix Hmea. This noise is especially significant if there are moving objects in the scene, leading to a great
number of invalid measurements along the objects’ edges (Jimenez et al.,2014) (motion artifacts). Another
noise source is the multipath interference (Jim´
enez et al.,2014).
ToF camera manufacturers detect if a measurement zmea
pqi,jis not valid, and they indicate this circumstance
by assigning a predetermined value to these invalid measurements (e.g. in the PMD S3 camera the invalid170
measurements have a value zmea
pqi,j=2mm whereas in Microsoft Kinect R
v2 the assigned value is zmea
pqi,j=0mm).
In order to reduce the noise as well as the number of invalid measurements, we have implemented a noise
reduction algorithm that includes two stages. In the first one, the invalid measurements are corrected using
the information of the nearest neighbors pixels. Then, a mean filter is used to smooth the detected surfaces.
In this work, we consider as invalid values those hmea
i,jwhich are provided by the camera as invalid175
9
measurements (associated to invalid coordinate measurements zmea
pqi,j) as well as those with a height greater
than the maximum height for a person. We define the set of pixels with an invalid measurement given by the
camera as Inull-camera =nqm,n/zmea
pqm,nis invalid o, and, Inull-hpmax =nqm,n/hmea
m,n>hpmax oas the set of pixels
with a height greater than the maximum allowed value for a person (hpmax =220cm in this work). From
them, the full set of invalid measurements, referred to as Inull, is Inull =Inull-camera ∪ Inull-hpmax .180
Next, for each qi,j∈ Inull a new height value is estimated, referred to as b
hmea
i,j, following the procedure
described next:
1. First, there is a search for valid height values in a neighborhood area around every pixel qi,j∈ Inull .
This search is carried out in the 8 nearest neighbors, in the neighborhood area Vqi,j
1. If there are no valid
heights in this neighborhood area, the search continues in the neighborhood area of level two Vqi,j
2. For185
the first neighbor level v∗in which a valid pixel is found (Vqi,j
v∗, with v∗=argmin1≤l≤2l/Vqi,j
l,Ø),
the estimated valueb
hmea
i,jis obtained as the average of the valid values hmea
m,nin that neighborhood level.
If hmea,v∗
m,nare the valid values in Vqi,j
v∗, then the value ofb
hmea
i,jis given by:
b
hmea
i,j=average
hmea,v∗
m,n∈Vqi,j
v∗hmea,v∗
m,n(4)
The height measurement matrix obtained after removing the invalid measurement will be referred to
as Hval ∈RM×N.190
If ∀v/v∈{1,2}no valid height measurement is obtained, we consider that the number of measurement
errors in the image is excessive, so that the image is rejected, and a new one is processed. This
condition may seem very stringent, but in the experimental conditions of this work, we have confirmed
that the probability of having a 5x5 pixels area without any valid measurement is certainly low. In fact,
the created dataset does not include any image with a 5x5 pixels area only containing invalid pixels.195
Nevertheless, in case an image is rejected, the people’s positions could be recovered by including a
tracking stage in the proposal.
In the search for valid pixels, we use a maximum neighborhood level of two because we consider that
up to that level it can be guaranteed that there is a high correlation between the pixels information,
given the camera position and the image resolution used.200
2. Mean filter: once the matrix Hval has been obtained, a nine element mean filter is used to estimate a
new height value for each pixel, calledb
hval
i,jand calculated as:
b
hval
i,j=average
0≤∆q≤1,0≤∆r≤1ˆ
hmea
i±∆q,j±∆r(5)
10
After the filtering process, the obtained height measurement matrix is finally referred to as H:
H=hb
hval
i,ji=hhi,ji,(6)
where the notation hi,jis used to refer to the heights of each pixel in the image plane after the correction
of the invalid values and the application of the mean filter. This notation is used in order to simplify the205
expressions below.
3.3. Regions of Interest (ROI’s) Estimation
In this work, the regions of interest (ROI’s) are defined as the pixels around each detected isolated
maximum that belong to the same object (which can be a person or not). Since the body parts of interest
for this solution are the head, neck and shoulders, the ROI’s should include all of them. Additionally, their210
estimation should be precise enough so that they do not include measurements from any other nearby people
or objects in the scene, even if the distance between them is small.
In order to determine the ROI associated to each isolated maximum, the initial criterion is that the height
difference between the highest point on the head and the shoulders should not be greater that an interest height
hinterest . In this work, we have selected hinterest =40cm, based on anthropometric considerations (Bushby215
et al.,1992;Matzner et al.,2015). Because of that (refer to Figure 6), once that a maximum with a height
hmax has been detected, the ROI around that maximum initially includes all the pixels whose heights hi,j
fulfill:
hmax −hi,j≤hintere st =40cm (7)
The initial ROI should be adjusted taking into account additional considerations in order to achieve a
correct segmentation for the extraction of reliable features that will be used in the classification process. It220
should work even if people are close to each other or if there are partial occlusions. With this objective, we
have designed a robust algorithm for ROI’s estimation that is composed of two stages: the robust detection of
maximums in the height matrix, that will provide a list of possible candidates to be identified as people; and
a second stage of robust ROI’s adjustment, for facing situations of highly populated environments (where
people are very close to each other) and occlusions.225
3.3.1. Detection of Local Maxima
In order to select which regions in the height matrix Hcan actually correspond to people or other objects,
we have developed a robust local maxima detection algorithm which is detailed below.
11
Figure 6: 3D projections and representations of a person in the hinter est height range.
1. Division of the image plane in sub-regions: Assuming that the ToF camera has a spacial resolution
M×N, in this stage the image is divided in square sub-regions (SR0s), with dimensions D×Dpixels.230
Thus, the number of sub-regions will be Nr×Nc(refer to Figure 7), with:
Nr=M
D;Nc=N
D(8)
The value of Dis set as a function of the camera intrinsic parameters, the camera height related to
the floor plane (hcamera), the minimum height of people to be detected (hpmin ), and the minimum area
(l×l), in metric coordinates, that the top part of a person head with a height hpmin can occupy. In this
work, the value Dis given by:235
D=f
a
l
hcamera −hpmin(9)
where fis the camera focal length, and ais the pixel dimension (assuming square pixels). In our
experimental tests with KinectR
v2 cameras, the used values are f
a=365.77, hcamera =340cm and
hpmin =100cm. Based on anthropometric characteristics of the human body (Bushby et al.,1992;
Matzner et al.,2015), we have set l=13cm in this work. This value guarantees that the overhead view
of a person whose height is hpmin covers several SR’s. From all these considerations, the calculated
12
Figure 7: Division of the original pixel space (qi,jpixels) in SRr,csubregions of dimension D×D(3 ×3 in this example).
value for Dis 20 pixels. For l=10cm,Dwould become 15 pixels, and the obtained results would
be similar. Therefore, each SRr,c(note that the coordinates r,cof the subregion in the new plane
partition have been already included in the notation) with 1 ≤r≤Nrand 1 ≤c≤Nc, includes the
corresponding pixels from the image plane:
SRr,c=nqi,j/(r−1)D+1≤i≤(r−1)D+Dand (c−1)D+1≤j≤(c−1)D+Do,(10)
and will have assigned the corresponding heights from matrix H, as shown in equation 11:
HSRr,c=
h(r−1)D+1,(c−1)D+1. . . h(r−1)D+1,(c−1)D+D
.
.
.....
.
.
h(r−1)D+D,(c−1)D+1. . . h(r−1)D+D,(c−1)D+D
(11)
where HSRr,c∈RD×D.
2. Calculation of Maximums: identifying as hmaxSR
r,cthe maximum height values associated to each SRr,c
(hmaxSR
r,c=max∀qi,j∈SRr,chi,j), a matrix HmaxSR ∈RNr×Ncis constructed as follows:
HmaxSR =
hmaxSR
1,1. . . hmaxSR
1,Nc
.
.
.....
.
.
hmaxSR
Nr,1. . . hmaxSR
Nr,Nc
(12)
Each value hmaxSR
r,cin HmaxSR is considered as a candidate to correspond to a person if it initially fulfills
the next condition (see Figure 8):
hpmin ≤hmaxSR
r,c≥hmaxSR
r∗,c∗,∀hmaxSR
r∗,c∗∈ VSRr,c
1(13)
13
Figure 8: Scheme to obtain mk
r,c, in the HSR matrix, where the thick border square shows the level 1 neighboring region (VSRr,c
1).
In order to simplify the notation, all the values of hmaxSR
r,cthat fulfill the previous condition will be240
referred to as mk
r,cwith 1 ≤k≤NP, where kindexes each maximum height in a list of them, and NPis
the number of detected maximums.
Given that it is very probable that nearby SR’s will belong to the same person, when nearby mk∗
r∗,c∗are
found, they will be substituted by a single one, that with the highest value, that will represent all the
others. In our case, and taking into account the dimensions of the SR’s, the camera placement and245
characteristics, and the people height range, we will consider close SR’s to a given SRr,c, those that
are within its neighbor area of order 2, that is, those included in VSRr,c
2.
As a consequence, from all mk∗
r∗,c∗∈ VSRr,c
2, the one with the highest value is chosen, assigning to it the
coordinates of the SR that is the nearest to the centroid of the SR’s that have an associated mk
r,c. That
is, identifying as nSRr1,c1,SRr2,c:2,...,SRrNSR p ,cNS Rp o(being NS R p the number of nearby subregions250
with associated mk
r,c), the coordinates to be assigned to the associated maximum will be:
r=round
PNS Rp
l=1rl
NS Rp
;c=round
PNS Rp
l=1cl
NS Rp
(14)
Where round {·}is a function that rounds to the nearest integer. That maximum will take the value
mk
r,c=maxmk∗
r∗,c∗∈VSRr,c
2mk∗
r∗,c∗, The left side of Figure 9shows an example of a person with a depth
map with two nearby maximums, and the right side of the same figure shows the final result, with a
single maximum set to the highest value of the nearby ones.255
After this process, the result is a set of NPmaximums that are candidates to correspond to NPpeople.
14
Figure 9: Scheme to obtain mk
r,c, when nearby maximums are found. The person depth map shows two maximums (left graphic,
mk
r,cand mk∗
r∗,c∗) which will be merged (right graphic).
3.3.2. ROI’s Adjustment
Since the body parts of interest for the feature extraction and classification are the head, neck and shoul-
ders, the ROI’s should include this three body parts. Thus, each ROI, associated to each mk
r,cwill always
comprise several SR’s. The set of all the SR’s associated with a mk
r,cwill be referred to as ROIk
r,c. Figure 10260
shows a diagram that includes, as an example, the SR’s corresponding to the ROI’s associated with two
people.
In order to robustly determine the SR’s that belong to the ROIk
r,cassociated to each mk
r,c, we have
designed an algorithm that accurately searches for the boundaries between each candidate to be a person
and its environment. The proposal exhaustively and sequentially analyzes a set of neighborhood areas, in265
different radial directions starting from the SR under study. The analyzed SR’s are added to the ROIk
r,cif
they fulfill several conditions.
In what follows, we will use Mto refer to the set of all the SRk
r,cthat have an associated mk
r,c. As seen in
section 3.3.1,C(M)=NPis satisfied in that set.
In order to select the SR’s (SRk∗
r∗,c∗) that belong to each ROIk
r,c,NVlocal neighborhood area levels270
and 8 radial directions δi(1 ≤i≤8) will be established around each SRk
r,c∈ M, as shown in Figure 11.
Algorithm 1describes the procedure for ROI’s adjustment, for which we provide next some considerations
and comments (refer to the left side of Algorithm 1for the locations of the comments discussed below):
•C1: The sub-region SRk
r,cwhere the mk
r,chas been detected always belong to the ROIk
r,c.
•C2: The SR’s in the level 1 neighborhood area of SRk
r,c, always belong to the ROIk
r,c, provided they275
are within the interest heights limits.
15
Figure 10: Example of the SR’s subregion set forming the ROIs associated to two people.
Figure 11: Directions and neighboring levels used to decide which SR’s belong to each ROIk
r,c.
16
foreach SRk
r,c∈ M do
ROIk
r,c←− ∅
for SRk∗
r∗,c∗∈ VSRk
r,c
1/hmaxSR
r∗,c∗≥hmaxSR
r,c−hinterest do
[C1,2] ROIk
r,c←− ROIk
r,c∪ SRk∗
r∗,c∗
[C3] for i=1..8do
[C3] for v∗=2..NVdo
[C3] foreach SRk∗
r∗,c∗∈ LSRk
r,c
v∗−1,δi∗do
[C4] if hmaxSR
r∗,c∗≥hmaxSR
r,c−hinterest then
[C5] if 1≤i∗≤4and CROIk
r,c∩ LSRk
r,c
v∗−1,δi∗≥v∗−1then
if checkDecreasingHeightsAndAdd() =FALSE then
break to for i
[C6] if 5≤i∗≤8and CROIk
r,c∩ LSRk
r,c
v∗−1,δi∗=1then
if checkDecreasingHeightsAndAdd() =FALSE then
break to for i
Function checkDecreasingHeightsAndAdd()
r0=r∗−∆r;c0=c∗−∆c(see Table 1)
r00 =r∗+ ∆r;c00 =c∗+ ∆c(see Table 1)
[C7] if hmaxSR
r0,c0≥hmaxSR
r,c−hinterest then
[C8] if hmaxSR
r0,c0≥hmaxSR
r∗,c∗≥hmaxSR
r00,c00 then
ROIk
r,c←− ROIk
r,c∪ SRk∗
r∗,c∗
return TRUE;
else
[C9] ROIk
r,c←− ROIk
r,c∪1
2SRk∗
r∗,c∗
return FALSE;
else
[C10] return FALSE
Algorithm 1: ROIs Adjustment algorithm.
17
Table 1: Values taken by the ∆r,∆cvariables for each δidirection in Algorithm 1.
δ1δ2δ3δ4δ5δ6δ7δ8
∆r-1 0 1 0 -1 1 1 -1
∆c0 1 0 -1 1 1 -1 -1
The decision of including the level 1 neighborhood areas in the ROIk
r,cis based on the sub-regions
size, the dimension of people in the scene, and the camera characteristic. Thus, the sub-regions that
are adjacent to the considered one, always belong to the ROI (if their height is adequate).
•C3: For all the directions δi∗(1 ≤i∗≤8) all the SR’s belonging to the local neighborhood areas of280
SRk
r,cwith a neighborhood distance 2 ≤v≤NVare analyzed. These SR’s are SRk∗
r∗,c∗∈ LSRk
r,c
v∗−1,δi∗,
with 2 ≤v∗≤NV. Then each SRk∗
r∗,c∗will be added to the ROIk
r,cif it fulfills the conditions below
(C4 through C10).
•C4: The maximum height of the SRk∗
r∗,c∗considered for its inclusion in a given ROIk
r,c, should be
within the interest heights limits.285
The number of non-shared sub-regions belonging to the ROI in the local neighborhood area with
neighborhood distance immediately below to the considered one is:
–C5: for the horizontal and vertical directions δ1,δ2,δ3and δ4, equal at least to the value of the
considered neighborhood distance v∗minus 1.
–C6: for the diagonal directions δ5,δ6,δ7and δ8, equal to 1.290
The objective of these two previous conditions is to guarantee a continuity in the ROI building
up, so that most of the neighborhood area with distance immediately below the considered one,
already belongs to that ROI. The condition does not force all that area to belong to the ROI in
order to increase the robustness of the ROI estimation process.
–C8: since the interest is focused on separating people that are close to each other (especially if295
they are very close), and for each considered direction, we will impose that the maximum height
in the considered SRk∗
r∗,c∗has to be lower than the maximum height of the SRk0
r0,c0immediately
adjacent to it (in the local neighborhood area immediately below), and it has to be greater that
the maximum height of the SRk00
r00,c00 immediately adjacent to it (in the local neighborhood area
immediately above). All of this applies provided that the immediately adjacent region in the300
lower neighborhood level has its maximum height within the heights of interest (C7).
18
–C9: if condition [C8] is not fulfilled, then there is a minimum between two adjacent ROI’s, thus
the considered SRk∗
r∗,c∗is shared by the two adjacent ROI’s. In this situation, half of the points
belonging to that SRk∗
r∗,c∗will be assigned to each of the adjacent ROI’s, being the dividing line
the one perpendicular to the considered direction δi∗.305
–C10: if the conditions [C7] or [C8] are not fulfilled, then the search of adjacent sub-regions of
interest will be stopped (from the considered SR and for greater neighborhood distances in the
corresponding direction).
Figures 12.aand 12.d(as well as in their zoomed in versions in Figures 12.band 12.e) show an example
of the selection of the SR’s that belong to each ROI in a situation that includes two people very close to310
each other, including their shared SR’s. Figure 12.bshows a shared SR in the neighborhood distance with
level v∗=2, and another one in the neighborhood distance with level v∗=1, both of them in the direction δ1,
and also a shared SR in the neighborhood distance with level v∗=1 in the direction δ8. Figure 12.eshows
shared SR’s in the neighborhood distances v∗=2 and v∗=3 in the directions δ6and δ2, respectively.
Finally, Figures 12.cand 12.fshow two different real examples of the results of the ROIs adjustment315
algorithm for a scene that includes two people. One of then wears a cap (in Figure 12.c) and another one
wears a hat (in Figure 12.f), and both of them are marked with a blue square.
3.4. Feature Extraction
Given that our objective is determining the presence of people considering the morphology of the head,
neck and shoulders areas, we have designed a feature set able to model such morphology, using the height320
measurements from the ToF camera as the only input, and taking into account anthropometric considerations
(related to the visible person profile from an overhead view, the head geometry, etc.). So, the feature vector
components will be related to the pixel density associated to the person surface in different height levels
within the corresponding ROI. In this work, the feature vector is composed of six components that will
be extracted for each ROIk
r,c, as they correspond to possible candidates to be people. Five of these features325
will be related to the visible people or objects surfaces at different heights, and the sixth component will
correspond to the relationship between the lower and higher diameters of the top surface, providing an idea
on the eccentricity of the person head. The calculation of the feature vector components is done following
the process described in the next sections and shown in Figure 13.
19
(a) SR’s of ROIk=1
r,c. (b) Zoom of shared SR’s of ROIk=1
r,c. (c) ROI estimation for a scene with two
people, one of them wearing a cap.
(d) SR’s of ROIk=2
r,c. (e) Zoom of shared SR’s of ROIk=2
r,c(f) ROI estimation for a scene with 2
people, one of them wearing a hat.
Figure 12: Example of the selection of SR’s shared by two ROI’s.
3.4.1. Pixel Assignment to Height Slices and Counting330
In this first stage, features related to the pixels associated to the head, neck and shoulders areas are
calculated. As described in Section 3.3, we assume them to be included in a height segment hinterest below
the person height. So, starting from that person height at each ROIk
r,cequally spaced slices are taken, with a
slice height ∆h(in our work, ∆h=2cm). So, NF=hinterest
∆hslices (referred to as FROIk
r,c
s) will be considered
for analysis, with 1 ≤s≤NF(NF=20 in our case). The pixels included in the subregions belonging to the
given ROIk
r,c, will be assigned to the corresponding slice, that is:
FROIk
r,c
s=nqi,j/qi,j∈ ROIk
r,cand hmaxSR
r,c−(s−1) ·∆h≥hi,j>hmaxSR
r,c−s·∆ho(15)
For each of these FROIk
r,c
sslices, the number of height measurements obtained by the ToF camera are
counted (the number of pixels), being ϕROIk
r,c
s=CFROIk
r,c
s, providing information on the pixel density for
20
Figure 13: Feature extraction process: from ϕROIk
r,cto bϕROIk
r,c.
the identification of the head and shoulders areas of a person. With the values of these densities, a vector
ϕROIk
r,cof NFcomponents is generated, where the value of each component coincides with the number of
height measurements in each slice ϕROIk
r,c=ϕROIk
r,c
1. . . ϕROIk
r,c
NF>
=CFROIk
r,c
1. . . CFROIk
r,c
NF>
. In the335
left part of Figure 14 an example of the values for the 20 components of a ϕROIk
r,cvector is shown, for a
person with a silhouette similar to that shown in the central part of the figure. The values in the upper section
of the figure correspond to the number of pixels in the head area, those in the intermediate section correspond
to the neck area, and those in the lower section correspond to the shoulders area.
3.4.2. Count Accumulation340
The components of the ϕROIk
r,cvector are very sensitive to the appearance changes of a person (hair style,
hair length, neck height, etc.), the person height, and, additionally, their estimation will be affected by the
person position within the scene (taking into account that the camera is in overhead position, there will be
areas not seen by the camera). Moreover, the effects of distance on the measurement noise, and the multipath
propagation of ToF cameras must also be taken into account.345
To minimize the noise measurement errors and the appearance changes of people, an accumulation of
the distance measurements in several slices is carried out. Also eccentricity information of the top sec-
tion of the head will be added (as described in section 3.4.4), building a new feature vector bϕROIk
r,c=
b
ϕROIk
r,c
1. . . b
ϕROIk
r,c
6>
, with 6 components in our case.
The first three components of bϕROIk
r,cinclude information related to the head, each of them integrating the350
number of pixels for three consecutive slices (individual components spans 6 cm each). To avoid problems of
a wrong estimation of the person height due to measurement noise, we assume that the first three components
of bϕROIk
r,ccomprise most of the pixels in the top section of the head. Then, depending on the slice with the
highest pixel density (µH=argmax1≤s≤3ϕROIk
r,c
s), the first three components of bϕROIk
r,cwill be generated
21
Figure 14: Example of slice segmentation for a person. The number of measured points in each slice is shown in the left (ϕROIk
r,c).
The values for the first 5 components of the feature vector are shown in the right (bϕROIk
r,c).
22
as follows:355
if µH=1 OR µH=2⇒
b
ϕROIk
r,c
1=P3
s=1ϕROIk
r,c
s
b
ϕROIk
r,c
2=P6
s=4ϕROIk
r,c
s
b
ϕROIk
r,c
3=P9
s=7ϕROIk
r,c
s
(16)
if µH=3⇒
b
ϕROIk
r,c
1=P4
s=2ϕROIk
r,c
s
b
ϕROIk
r,c
2=P7
s=5ϕROIk
r,c
s
b
ϕROIk
r,c
3=P10
s=8ϕROIk
r,c
s
(17)
The fourth and fifth components of bϕROIk
r,care related to the shoulders zone, integrating each of them
the number of pixels found in three consecutive slices (individual components spans 6 cm each). Again, to
increase the estimation robustness, the shoulders zone will be considered to start in any of the FROIk
r,c
sslices
in a given range of values for s:
Srange =
10 ≤s≤16 if µH=1 OR µH=2
11 ≤s≤16 if µH=3
(18)
so that the slice with the highest pixel density will be selected (µS=argmaxSrange ϕROIk
r,c
s), and from it, the360
values of the fourth and fifth components of bϕROIk
r,cwill be generated, as follows:
b
ϕROIk
r,c
4=
µS+1
X
s=µS−1
ϕROIk
r,c
s
b
ϕROIk
r,c
5=
µS+4
X
s=µS+2
ϕROIk
r,c
s
(19)
Figure 14 includes an example in which the different height slices are shown, determining the values of
the feature vector components ϕROIk
r,cand bϕROIk
r,c.
3.4.3. Normalization
As can be seen from the calculation scheme shown in Figure 14, the number of pixels associated to365
height measurements in the different ∆hslices, depends on the person height. So, the components b
ϕROIk
r,c
1to
b
ϕROIk
r,c
5associated to people will be also dependent on height, thus being necessary to normalize them.
To carry out the normalization, the relationship between the maximum height (b
hmaxSR
r,c) and the number
of detected pixels by the camera in the top section of the head (b
ϕROIk
r,c
1) will be calculated. As an initial
23
Figure 15: ρROIk
r,c
1curve.
approximation, a quadratic relationship has been assumed, so that:
ϕROIk
r,c
1≈ρROIk
r,c
1=a0b
hmaxSR
r,c2+a1b
hmaxSR
r,c+a2(20)
where a0,a1and a2are the coefficients to estimate.
The Levenberg-Marquardt algorithm was used for the determination of those coefficients, as those that
best fit the input data set b
hmaxSR
r,c,b
ϕROIk
r,c
1, selected from the training database, and according to the non370
linear function ρROIk
r,c
1in equation (20). For a sample set of people with heights between 140 cm and 213 cm,
and for each height, the average number of pixels of ϕROIk
r,c
1was calculated, and from it, the normalization
curve was obtained, along with the coefficient values a0,a1and a2. Figure 15 includes a graphic in which
the training data and the fitted curve is shown (resulting a0=0.138, a1=−36.94 and a2=2997), obtaining
a mean square error of 45 pixels approximately.375
Finally, the normalized vector will be obtained dividing the feature vector components of bϕROIk
r,c, by the
estimated ρROIk
r,c
1(that will generate a new normalized vector bϕ
ROIk
r,c
norm ):
b
ϕROIk
r,c
i,norm =b
ϕROIk
r,c
i
ρROIk
r,c
1
,with 1 ≤i≤5 (21)
3.4.4. Eccentricity Calculation
The five normalized components of the feature vector bϕ
ROIk
r,c
norm already described, provide information
on the top view of people and objects with different heights, but initial experiments on people detection380
showed the need to also include more information related to the expected overhead geometry of the head.
So, to incorporate information on the way that pixels are distributed in the top section of the head, a sixth
24
(a) 3D point cloud measures (b) 2D depth map (c) Sample feature vector values
Figure 16: Example of a frame with 8 people. In Subfigure (c), top graphic corresponds to a person 165cm tall and long hair (that
in the north-west position in the group of Subfigure (b), and the bottom graphic to a person 202cm tall and short hair (that in the
center position in the group of Subfigure (b)).
(a) 3D point cloud measures (b) 2D depth map (c) Sample feature vector values
Figure 17: Example of a frame with two people, one of them wearing a hat.
component has been added to the feature vector, at heights between hmaxSR
r,cand hmaxSR
r,c−6cm (as was
discussed above). If the function that calculates the relationship between the major and minor axes of the
region located 6cm (3∆h) below the maximum height is referred to as rba (operating on a set of pixels), the385
sixth component will be:
b
ϕROIk
r,c
6,norm =rba nqi,j/qi,j∈ ROIk
r,cand hmaxSR
r,c≥hi,j>hmaxSR
r,c−3∆ho(22)
To provide real examples on the used dataset, in Figures 16,17 and 18 we show examples of the 3D
representation of scenes with 8, 2 and 1 people, respectively, their corresponding 2D depth maps, and feature
vectors for selected elements. Figure 19 shows also an example of a person pushing a chair.
3.5. People Classification390
Given that the objective of this work is detecting people in a scene in which there may be other static or
moving objects (chairs, for example), it’s necessary to implement a classifier able to differentiate the feature
25
(a) 3D point cloud measures (b) 2D depth map (c) Sample feature vector values
Figure 18: Example of a frame with one person moving his fists up and down.
(a) 3D point cloud measures (b) 2D depth map (c) Sample feature vector values
Figure 19: Example of a frame with one person pushing a chair.
26
vectors obtained for the different ROIk
r,c, as corresponding or not to people.
From now on, and with the objective of simplifying the notation, the feature vector bϕ
ROIk
r,c
norm ∈R6obtained
for each ROIk
r,c, will be referred to as Ψ(Ψ∈R6).395
In our task, the values of the Ψfeature vector components significantly change when people carry acces-
sories occluding the head and shoulders (wearing hats, caps, etc.). This is why two classes have been defined
(α∈1,2), for people without and with accessories respectively.
To carry out such classification, a classifier based on Principal Component Analysis (PCA) was se-
lected (Shlens,2014;Jim´
enez et al.,2005), thus requiring an offline estimation of the models for each class,400
prior to the online classification process.
3.5.1. Model Estimation (Offline Process)
In the offline process, the two transformation matrices Uαare calculated. To do so, Nαtraining vectors
were used, associated to different people representative of each of the two classes Ψαi(α=1,2, i=1· · · Nα).
From those Nαvectors, their average value and scatter matrices are calculated, for each class:405
Ψα=1
Nα
Nα
X
i=1
Ψαi
STα=
Nα
X
i=1
(Ψαi−Ψα)(Ψαi−Ψα)T
(23)
Matrices Uαfor each class αare formed by the eigenvectors associated to the highest eigenvalues of the
corresponding scatter matrices STα(Shlens,2014;Jim´
enez et al.,2005). In our case, three eigenvectors have
been chosen, following the criteria that the average normalized residual quadratic error (RMSE) is higher
than 90%, that is:
RMS E =P6
j=m+1λαj
P6
j=1λαj
>0.9 (24)
resulting m=3 and, consequently, Uα∈R6x3.410
3.5.2. Person detection (Online Process)
In the classification process (online process), the Ψfeature vector of each ROIk
r,cis calculated, and
for each class α, the difference between this vector and the average vector class Ψαis projected in the
transformed space (Φα=Ψ−Ψα).
The projected vector will then be Ωα=UT
αΦα. Next, the projected vector is recovered in the original415
space b
Φα=UαΩα. The Euclidean distance between Φαand b
Φαis called the reconstruction error α. This
27
process is applied for each of the two classes.
Finally, a feature vector is classified as corresponding to a person if its reconstruction error is lower than
a given threshold for any of both transformations. That is:
Ψis person if 1=||Φ−b
Φ1|| ≤ T h1OR 2=||Φ−b
Φ2|| ≤ T h2(25)
where T hαis the threshold for the αclass, that in our case was determined experimentally for each class,420
using the following equation:
T hα=α+3σα(26)
where αis the average value of the reconstruction error and σαis its standard deviation, for Nαpeople with
different characteristics, and in different scene positions.
In case that the condition in equation 25 does not hold, the feature vector is considered not to correspond
to a person.425
4. Experimental Work
4.1. Experimental Setup
In order to provide data for training and evaluating the proposal, we used part of the GOTPD1 database
(available at (Macias-Guarasa et al.,2016)), that was recorded with a KinectR
v2 device located at a height
of 3.4m. The recordings tried to cover a broad variety of conditions, with scenarios comprising:430
•Single and multiple people
•Single and multiple non-people (such as chairs)
•People with and without accessories (hats, caps)
•People with different complexity, height, hair color, and hair configuration
•People actively moving and performing additional actions (such as using their mobile phones, moving435
their fists up and down, moving their arms, etc.).
The data used was split in two subsets, one for training and the other for testing. The subsets are fully
independent, so that no person present in the training database was present in the testing subset.
Table 2and Table 3show the details of the training and testing subsets, respectively. #Samples refers
to the number of all the heads over all the frames in the recorded sequences (in our recordings we used 39440
28
Table 2: Training subset details.
Sequence ID #S amples Description Class
seq-P01-M02-A0001-G00-C00-S0041 141
Single person
Class 1:
Person without
accessories
seq-P01-M02-A0001-G00-C00-S0042 299
seq-P01-M02-A0001-G00-C00-S0043 566
seq-P01-M02-A0001-G00-C00-S0044 149
seq-P01-M04-A0001-G00-C00-S0045 226
seq-P01-M04-A0001-G00-C00-S0046 301
seq-P01-M02-A0001-G00-C02-S0047 221 Multiple people with
accessories (hats, caps)
Class 2: Person with
accessories (hats, caps)seq-P01-M02-A0001-G00-C02-S0048 152
Table 3: Testing subset details, specifying the total number of samples, and the number of samples for classes 1 and 2.
Sequence type #S amples #Class1 #Class2
Sequences with a single person 5317 5317 0
Sequences with two people 933 833 100
Sequences with more than two people 8577 6929 1648
Sequences with chairs and
people balancing fists facing up
830 830 0
Totals 15657 13909 1748
different people). The database contains sequences in which the users were instructed on how to move under
the camera (to allow for proper coverage of the recording area), and sequences where people moved freely
(to allow for a more natural behavior)2.
4.2. Comparison with other methods
4.2.1. Performance comparison445
In order to evaluate the improvements of the proposed method compared with others in the literature,
we first chose the recent work described in (Stahlschmidt et al.,2014), given the similarity of the task. For
the comparison, we run preliminary experiments on a subset of the testing database. In the comparison we
did not use a tracking module for any of the methods (to provide a fair comparison on their discrimination
2This is fully detailed in the documentation distributed with the database at (Macias-Guarasa et al.,2016)
29
Table 4: Comparison results with the strategy described in (Stahlschmidt et al.,2014).
Sequence type #S ample s
T PR F PR
ST2014 Proposal ST2014 Proposal
Single person 5757 100.00%±0.00% 98.06%±0.36% 0.21%±0.12% 0.23%±0.12%
Two people 973 60.43%±3.07% 97.01%±1.08% 0.10%±0.20% 0.32%±0.35%
More than two people 2383 84.01%±1.47% 95.90%±0.80% 0.42%±0.26% 0.13%±0.14%
Sequences with chairs and
people balancing fists facing up
1042 98.37%±0.77% 97.57%±0.93% 21.02%±2.47% 0.12±0.21%%
Total counts and
average TPR/FPR rates
10155 92.29%±0.52% 97.40%±0.31% 2.38%±0.30% 0.20%±0.09%
capabilities without other improvements), and we evaluated the results in correct and false detections (true450
and false positive rates (T PR and F PR, respectively)), with the results shown in Table 4, where #S amples is
the number of people heads labeled in the ground truth, ST2014 refers to the results in (Stahlschmidt et al.,
2014), and Proposal refers to our results. Table 4also includes confidence interval values for a confidence
level of 95%.
As it can be clearly seen, the proposed method is better than the other one, except for very slight degra-455
dations in the case of the T PR in the single person sequences, and in sequences with chairs and people
balancing fists facing up (not statistically significant). There is also a very minor (again not statistically
significant) degradation in the FPR for sequences with one or two people, but this just reduces to 2+1 cases
out of 5757 +973, respectively. These results are coherent with our expectations, as the Mexican hat strategy
used in (Stahlschmidt et al.,2014) will generate a detection for almost everything that can be assimilated460
to this shape. This behaviour leads to very good performance for single person sequences, but this is ac-
complished at the expense of generating a much higher number of false positive detections when there are
additional objects in the scene (as is the case for sequences with chairs and people balancing their fists facing
up).
It can also be observed that the improvements in the T PR are very significant for sequences of multi-465
ple people, in which the Mexican hat strategy will not be able to accurately separate nearby people. The
effect is particularly noticeable in the sequences of two people (with 60.5% relative improvement), because
the recordings were done with both people being very close to each other. The observed improvement in
sequences with more than two people is lower than with just two people, because in the latter recordings
people were not requested to remain close to each other, and they mainly moved freely in the scene, so that470
the Mexican hat can do a better job in discriminating among people in the scene.
30
Table 5: Comparison results with the strategy described in (Zhang et al.,2012).
Strategy Recall Accuracy F-score
(Zhang et al.,2012) 99.47% ±0.42% 99.57% ±0.38% 99.52% ±0.40%
Proposal399.57% ±0.38% 99.57% ±0.38% 99.57% ±0.38%
3Values of all metrics are equal in our results as the number of false positives and false negatives are equal.
The last row in Table 4shows the average results, where we can see that our proposal clearly outper-
forms the strategy described in (Stahlschmidt et al.,2014), and that the observed differences are statistically
significant in the evaluated metrics.
We also compared the performance of our proposal on a different dataset, the one used in (Zhang et al.,475
2012), kindly provided to us by the authors. The application of our algorithm was challenging, due to three
different facts. First of all, the provided data was generated with a Kinect R
v1 sensor, which has a lower
resolution than the Kinect R
v2 sensor in which we based our development (320x240 vs 512x424 pixels), and
also uses a technology that generates noisier data3. Second, the provided data had been already processed
by applying their background substraction strategy, so that we could’t apply our noise reduction algorithm480
as we didn’t have access to the raw depth stream required to do so. Finally, we also had to adapt the PCA
models used in the classification stage to the new data, allowing the consideration of a new class to model
the high presence of people with backpacks and wearing coats with large hoods (either put on their heads, or
resting on their shoulder and backs). Given that our system needs to be trained, and to avoid training biases,
we generated the models using sequences from the dataset1 described in the paper, and tested our proposal485
on the dataset2 sequences (which were recorded in a fully different scene than dataset1).
Table 5shows the comparison results using the same metrics than the ones used in (Zhang et al.,2012)
(to ease the comparison), where we have also added the confidence interval values for a confidence level
of 95%: Recall rate is the fraction of people that are detected; Accuracy is the fraction of detected result
that are people4, and F-score is the tradeoffbetween recall rate and accuracy. It can be clearly seen that the490
performance of both algorithms is very similar, with a very slight (non statistically significant) advantage
of our proposal. It is important to note here that the strategy in (Zhang et al.,2012) makes the assumption
that only people are present in the sequences, so that all the detected regions will be considered to be people
3The KinectR
v1 sensor uses structured light, as opposed to time of flight in the KinectR
v2.
4In the literature, this is usually referred to as “Precision”, but we kept the naming used in (Zhang et al.,2012) for easier
comparison.
31
Table 6: Timing details of the proposed algorithm (showing average processing times per frame in ms.)
Sequence type Filter Max ROI’s FE Classify Total time FPS
Sequence with 8 users 3.756 0.138 1.132 3.692 0.087 8.805 114
Sequence with 4 users 3.752 0.095 1.016 3.536 0.710 9.110 110
Single user height=178cm 3.772 0.007 0.356 1.331 0.241 5.707 175
Single user height=150cm 3.738 0.012 0.361 1.307 0.254 5.672 176
(the only validation check the authors use to discriminate between people or other elements in the scene
is the area of the detected region bounding box, which is compared to a predefined threshold value). This495
is actually acknowledged by the authors that state that “the water filling cannot handle the situation where
some moving object is closer to the sensor than head, such as raising hands over head.” The use of the
classification stage in our proposal actually avoid such false detections.
4.2.2. Computational demands comparison
Regarding the computational demands5, we also first compared our proposal with that of Stahlschmidt500
et al. (2014). Table 6shows the average processing time per frame of our proposal for several sequence types
(given that the execution times per frame depend on the number of maximums detected and the conditions
found in each frame) and in each of the most relevant stages: Noise reduction (column “Filter”), detection
of local maxima (column “Max”), ROI’s estimation (column “ROI’s”), feature extraction (column “FE”)
and classification (column “Classify”). Column “Total Time” shows the average total time per frame, and505
column “FPS” shows the maximum number of frames per second that the algorithm can cope with in real
time. It is clear that the most demanding stages are the noise reduction and the feature extraction, which
account for 80% of the total execution time. We have also evaluated the best and worst timing cases in the
sequences analyzed in Table 6and found that the number of frames that the algorithm can process in real
time vary between 43 and 309, thus proving its adequacy for real time processing.510
Table 7shows the average processing time per frame using the strategy described in (Stahlschmidt et al.,
2014) in each of its most relevant stages: Preprocessing (column “Prep.”), application of the normalized
Mexican Hat Wavelet (column “Wavelet”), and peak detection (column “Peak”). Columns “Total Time”
5All the experiments reported in this section were run in an Asus X553MA laptop, with an Intel BayTrail M Dual Core Celeron
N2840 2.58 Ghz, and 4Gb RAM.
32
Table 7: Timing details of the strategy described in (Stahlschmidt et al.,2014) (showing average processing times per frame in ms).
Sequence type Prep. Wavelet Peak Total time FPS
Sequence with 8 users 11.133 87.439 10.745 109.317 9
Sequence with 4 users 12.837 87.822 10.730 111.389 9
Single user height=178cm 12.959 86.776 10.537 110.272 9
Single user height=150cm 11.616 88.540 10.968 111.124 9
and “FPS” have the same meaning than in Table 6. In this case, the timing is very similar across sequence
types, as the wavelet filter must be applied across all the frame content, being this stage also the most time515
consuming. This is the reason why the strategy in (Stahlschmidt et al.,2014) is much slower than our
proposal.
Regarding the computational complexity of the proposal described in (Zhang et al.,2012), the authors
state in their paper that the speed of their water filling algorithm is about 30 frames per second (although
they don’t provide information on the hardware used), which is also lower than in our proposal.520
4.3. Results and discussion
To validate the people detection algorithm performance, the used sequences are those in which both the
people and the accessories used (hats and caps) are different than those used in the offline training process.
Experimentation was carried out with sequences with a different number of people involved in the record-
ings. In all cases, the selected people had different characteristics in what respect to hair style, complexity,525
height and with/without accessories, etc. Additional sequences also include people moving their fists up and
down, and moving three different types of chairs around the scene.
Table 8shows the classification results, where #S am ples is the number of people occurrences, FP and
FN are the number of false positives and false negatives respectively, and ERR is the system error rate
(ERR =100 ·(F P +FN)/#S am ples). The table also includes confidence interval values calculated on the530
ERR metric, for a confidence level of 95%. The last column (#Other) shows the number of other detected
maxima in the scene (people hands, chairs and other objects) that have not been labeled as corresponding
to people by the classification stage. As it can be clearly seen, this number is high for sequences with a
lot of non-people objects, proving the capacity of the proposal to avoid generating false detections, without
significantly impacting its performance.535
From Table 8, we can conclude that the performance is very high, with an average error rate of 3.1%.
In sequences with one or two people we get the best results, with an average error rate of 2.14%. The error
33
Table 8: Experimental results.
Sequence #S amples FN FP ERR % #Other
Single person 5317 96 10 1.99% ±0.38% 76
Two people 933 28 0 3.00% ±1.09% 1
More than two people 8577 327 4 3.86% ±0.41% 26
Sequences with chairs and
people balancing fists facing up
830 20 0 2.41% ±1.04% 956
Totals counts and average ERR 15657 471 14 3.10% ±0.27% 1059
increases to 3.86% for sequences with more than two people, and it keeps at 2.41% in situations where there
are increased difficulties due to the presence of chairs and users behaving to confuse the system (when they
move their fists up and down). Given these results, we can also state that the system is also able to efficiently540
cope with the approximately 10% of samples of people with complements (refer to Table 3for details on the
testing subset data).
When individually analyzing the frames in which the system made an error, those errors were mainly
found near the scene borders, in which the objects capture is less “zenithal” and the noise level of the
measurements is higher (as the illumination intensity arriving the infrared sensor is lower). This noise is545
higher for people with lower height (further from the sensor), and with black straight hair (with a lower light
reflection factor). These issues have a relevant impact in the generation of the feature vector values, so that,
in some cases, the classification stage is not able to correctly identify them as corresponding or not to people.
Most of the times when these issues affects the classification, they lead to the generation of a number of false
negatives, which is by far the most common error in our system, as can be clearly seen in Table 8. The false550
positives shown in Table 8are due to the very few cases in which spurious peaks (due to raised arms or
hands) are incorrectly classified as corresponding to a person.
5. Conclusions
In this work we have presented a system for the real time robust detection of people, by only using the
depth information provided by a ToF camera placed in an overhead position. The proposal comprises several555
stages that allow the detection of multiple people in the scene in an efficient and robust way, even when the
number of people is high and they are very close to each other. It also allows discriminating people from
other objects (moving or fixed) present in the scene. Additionally, and provided that only depth information
34
is used, people privacy is guaranteed. This implies an additional advantage over solutions making use of
standard RGB or gray-level information.560
Due to the lack of publicly available datasets fulfilling the requirements of the target application (high
quality labeled depth sequences acquired with an overhead ToF camera), we have recorded and labeled a
specific database, that is available to the scientific community. This dataset has been used to train and
evaluate the PCA models for the two classes defined: people with and without accessories (caps, hats).
For the people detection task on the depth images, we have proposed an algorithm to detect the isolated565
maximums in the scene (which will be candidates to correspond to people), and to precisely define a Region
of Interest (ROI) around each of them. From the pixels included in the ROI, a feature vector is extracted,
whose component values are related to pixel heights and pixel densities in given areas of the ROI, so that it
allows to properly characterize the people upper body geometry. The classification of such feature vectors
using PCA techniques allows determining whether they belong or not to the two people classes that have been570
defined. The system evaluation has followed a rigorous experimental procedure, and has been carried out on
previously labeled depth sequences, including a wide range of conditions with respect to number of people
present in the scene (considering cases in which they were very close to each other), people complexity,
people height, positions of arms and hands (arms up, fists moving up and down, people using their mobile
phones, etc.), accessories on the head (caps, hats), and the presence of additional objects such as chairs. The575
obtained results (3.1% average error rate) are very satisfactory, given the complexity of the task due to the
high variability of the evaluated sequences.
We have also compared our proposal with other alternatives in the literature (both in terms of performance
and computational demands), and evaluated our system performance on an alternative challengind data set,
proving its ability to efficiently cope with different experimental scenarios.580
As a general conclusion, the proposal described in this work allows the robust detection of people present
in a given scene, with high performance. It only uses the depth information obtained by a ToF camera, thus
being appropriate in those cases where privacy concerns are relevant, also making the system to properly
work independently of the lighting conditions in the scene. Additionally, the people detection process can
run in real time at an average of 150 fps.585
6. Acknowledgments
This work has been supported by the Spanish Ministry of Economy and Competitiveness under project
SPACES-UAH (TIN2013-47630-C2-1-R), and by the University of Alcal´
a under projects DETECTOR
(CCG2015/EXP-019) and ARMIS (CCG2015/EXP-054). We also thank Xucong Zhang for providing us
35
with their evaluation data, and special thanks are given to the anonymous reviewers for their careful reading590
of our manuscript and their many insightful comments and suggestions.
References
Antic, B., Letic, D., Culibrk, D., & Crnojevic, V. (2009). K-means based segmentation for real-time zenithal
people counting. In Proceedings of the 16th IEEE International Conference on Image Processing ICIP’09
(pp. 2537–2540). Piscataway, NJ, USA: IEEE Press. doi:10.1109/ICIP.2009.5414001.595
Bevilacqua, A., Di Stefano, L., & Azzari, P. (2006). People tracking using a time-of-flight depth sensor.
In Video and Signal Based Surveillance, 2006. AVSS ’06. IEEE International Conference on (pp. 89–89).
doi:10.1109/AVSS.2006.92.
Bushby, K. M., Cole, T., Matthews, J. N., & Goodship, J. A. (1992). Centiles for adult head circumference.
Archives of Disease in Childhood,67, 1286. doi:10.1136/adc.67.10.1286.600
Cai, Z., Yu, Z. L., Liu, H., & Zhang, K. (2014). Counting people in crowded scenes by video analyzing.
In Industrial Electronics and Applications (ICIEA), 2014 IEEE 9th Conference on (pp. 1841–1845). doi:
10.1109/ICIEA.2014.6931467.
Chan, A., Liang, Z.-S., & Vasconcelos, N. (2008). Privacy preserving crowd monitoring: Counting people
without people models or tracking. In Computer Vision and Pattern Recognition, 2008. CVPR 2008. IEEE605
Conference on (pp. 1–7). doi:10.1109/CVPR.2008.4587569.
Chen, T.-Y., Chen, C.-H., Wang, D.-J., & Kuo, Y.-L. (2010). A people counting system based on face-
detection. In Genetic and Evolutionary Computing (ICGEC), 2010 Fourth International Conference on
(pp. 699–702). doi:10.1109/ICGEC.2010.178.
Dan, B.-K., Kim, Y.-S., Suryanto, Jung, J.-Y., & Ko, S.-J. (2012). Robust people counting system based610
on sensor fusion. Consumer Electronics, IEEE Transactions on,58, 1013–1021. doi:10.1109/TCE.2012.
6311350.
Del Pizzo, L., Foggia, P., Greco, A., Percannella, G., & Vento, M. (2016). Counting people by rgb or depth
overhead cameras. Pattern Recognition Letters, .
Galˇ
c´
ık, F., & Gargal´
ık, R. (2013). Real-time depth map based people counting. In International Conference615
on Advanced Concepts for Intelligent Vision Systems (pp. 330–341). Springer.
36
Jeong, C. Y., Choi, S., & Han, S. W. (2013). A method for counting moving and stationary people by
interest point classification. In Image Processing (ICIP), 2013 20th IEEE International Conference on
(pp. 4545–4548). doi:10.1109/ICIP.2013.6738936.
Jia, L., & Radke, R. (2014). Using time-of-flight measurements for privacy-preserving tracking in a smart620
room. Industrial Informatics, IEEE Transactions on,10, 689–696. doi:10.1109/TII.2013.2251892.
Jimenez, D., Pizarro, D., & Mazo, M. (2014). Single frame correction of motion artifacts in PMD-based
time of flight cameras. Image and Vision Computing,32, 1127 – 1143. doi:10.1016/j.imavis.2014.08.014.
Jim´
enez, D., Pizarro, D., Mazo, M., & Palazuelos, S. (2014). Modeling and correction of multipath interfer-
ence in time of flight cameras. Image and Vision Computing,32, 1 – 13. doi:10.1016/j.imavis.2013.10.008.625
Jim´
enez, J. A., Mazo, M., Ure˜
na, J., Hern´
andez, A., Alvarez, F., Garc´
ıa, J. J., & Santiso, E. (2005). Using
PCA in time-of-flight vectors for reflector recognition and 3-D localization. Robotics, IEEE Transactions
on,21, 909–924. doi:10.1109/TRO.2005.851375.
Lange, R., & Seitz, P. (2001). Solid-state time-of-flight range camera. Quantum Electronics, IEEE Journal
of ,37, 390–397. doi:10.1109/3.910448.630
Lefloch, D., Cheikh, F. A., Hardeberg, J. Y., Gouton, P., & Picot-Clemente, R. (2008). Real-time people
counting system using a single video camera. (pp. 681109–681109–12). volume 6811. doi:10.1117/12.
766499.
Macias-Guarasa, J., Losada-Gutierrez, C., Fuentes-Jimenez, D., Fernandez, R., Luna, C. A., Fernandez-
Rincon, A., & Mazo, M. (2016). The GEINTRA Overhead ToF People Detection (GOTPD1) database.635
Available online. URL: http://www.geintra-uah.org/datasets/gotpd1 (accessed June 2016).
Matzner, S., Heredia-Langner, A., Amidan, B., Boettcher, E., Lochtefeld, D., & Webb, T. (2015). Stand-
offhuman identification using body shape. In Technologies for Homeland Security (HST), 2015 IEEE
International Symposium on (pp. 1–6). doi:10.1109/THS.2015.7225300.
Ramanan, D., Forsyth, D. A., & Zisserman, A. (2006). Tracking People by Learning Their Appearance.640
Pattern Analysis and Machine Intelligence, IEEE Transactions on,29, 65–81. doi:10.1109/tpami.2007.
250600.
Rauter, M. (2013). Reliable human detection and tracking in top-view depth images. In Proceedings of the
IEEE Conference on Computer Vision and Pattern Recognition Workshops (pp. 529–534).
37
Sell, J., & O’Connor, P. (2014). The Xbox one system on a chip and Kinect sensor. Micro, IEEE,34, 44–53.645
doi:10.1109/MM.2014.9.
Shlens, J. (2014). A tutorial on principal component analysis. arXiv preprint arXiv:1404.1100. URL:
https://arxiv.org/pdf/1404.1100.pdf (accessed June 2016).
Smisek, J., Jancosek, M., & Pajdla, T. (2011). 3D with Kinect. In Computer Vision Workshops (ICCV Work-
shops), 2011 IEEE International Conference on (pp. 1154–1160). doi:10.1109/ICCVW.2011.6130380.650
Stahlschmidt, C., Gavriilidis, A., Velten, J., & Kummert, A. (2013). People detection and tracking from
a top-view position using a time-of-flight camera. In A. Dziech, & A. Czyazwski (Eds.), Multimedia
Communications, Services and Security (pp. 213–223). Springer Berlin Heidelberg volume 368 of Com-
munications in Computer and Information Science.doi:10.1007/978-3-642-38559-9 19.
Stahlschmidt, C., Gavriilidis, A., Velten, J., & Kummert, A. (2014). Applications for a people detection655
and tracking algorithm using a time-of-flight camera. Multimedia Tools and Applications, (pp. 1–18).
doi:10.1007/s11042-014-2260-3.
Vera, P., Monjaraz, S., & Salas, J. (2016). Counting pedestrians with a zenithal arrangement of depth
cameras. Machine Vision and Applications,27, 303–315.
Zhang, X., Yan, J., Feng, S., Lei, Z., Yi, D., & Li, S. Z. (2012). Water filling: Unsupervised people counting660
via vertical kinect sensor. In Advanced Video and Signal-Based Surveillance (AVSS), 2012 IEEE Ninth
International Conference on (pp. 215–220). IEEE.
Zhu, L., & Wong, K.-H. (2013). Human tracking and counting using the kinect range sensor based on
adaboost and kalman filter. In International Symposium on Visual Computing (pp. 582–591). Springer.
38