Content uploaded by Billy Okal
Author content
All content in this area was uploaded by Billy Okal on Jun 30, 2016
Content may be subject to copyright.
Real-time Social Activity Detection for Mobile Service Robots
(Extended Abstract)
Billy Okal1Rudolph Triebel2Kai O. Arras1,3
I. INTRODUCTION
Interactive mobile robots performing tasks such as infor-
mation provision, tour guidance or domestic care, typically
operate in environments populated by humans engaged in
various social activities (hereinafter referred to simply as
activities). The interplay between such activities of the
different people in the environment gives rise to different
social contexts. To operate effectively, such robots need to
reason about these various social contexts induced by the
activities. Such reasoning allows the robots to approximately
understand the social behavior, embedded in the interactions.
Robots with such kind of understanding can, for example,
plan and respond with user-aware behavior that can be both
task-efficient and socially normative at the same time. This
is because the robots could be equipped with context specific
socially normative behavior, which can then be triggered
by detectors developed in this work via high level planning
executives. In this paper, we use high level perceptual cues
(specifically, people detector outputs) to model and detect
such social activities on mobile platforms.
Previous work in human activity recognition has typically
been carried out by the computer vision community moti-
vated by applications such as monitoring and surveillance.
However, the employed sensors in such cases are typically
static and generally overlook the entire scene, such as in [1],
[2], [3], [4] which makes it difficult to deploy the techniques
in mobile robots operating in the wild. Such conditions are
generally not met by mobile robots perceiving the surround-
ing people from a first-person perspective which are also typ-
ically subject to nosier perception due to occlusions. In this
work, we focus on methods that uses perceptual information
available from a robot-centric viewpoint. As for the choice of
activities, we focus on certain social activity classes that we
consider relevant in various scenarios of robots in human
environments encounter. In our case these are; waiting-in-
a-queue,walking-in-a-flow,walking-against-a-flow,moving-
in-a-group,standing and moving-individuallyas illustrated
in Fig. 1. The choice of these activities is motivated by
application of a service robots in airport-like environments,
where such robots are required to navigate in a socially
normative manner. In particular, these set of activities was
arrived at by looking at video data of people moving in an
1Billy and Kai are with the Department of Computer Science at the
University of Freiburg {okal, arras}@cs.uni-freiburg.de.
2Rudolph is with the Computer Vision Group of the Technical University
of Munich, rudolph.triebel@in.tum.de
3Kai is also associated with Bosch Corporate Research, Renningen.
This work has been partly supported by the European Commission under
contract number FP7-ICT-600877 (SPENCER).
Fig. 1. Different simulated social activities relevant for mobile robots
as shown in our simulator with the activity labels as text. They include:
waiting-in-a-queue,walking-in-a-flow,walking-against-a-flow,moving-in-a-
group,standing and moving-individually. Numbers show confidence.
airport over long periods, during which these activities were
most commonly observed.
II. AP PROACH
In this work, we develop a histogram-based feature de-
scriptor using combinations of information about relative
positions and orientations of people around the target person
whose activity we want to detect. We call our descriptor the
Activity context (AC) descriptor as it is largely inspired by
the shape context descriptor [5], used for object recognition
in the computer vision community. This is also motivated
by the fact that the shape context descriptor is known
to be robust against outliers, deformation, translations and
noise [5]. In effect, our activity context descriptor models
“shapes” of social activities based on relative positions and
orientations of people tracks as a histogram ft. Additionally,
we use different information for binning such histograms,
such as speed and direction in addition to densities or
counts. In effect, we compare seven feature combinations
by concatenating the basic feature histograms as; density,
direction,speed,density–direction,density–speed,speed–
direction and density–direction–speed.
We develop a two-stage classification system for detecting
the social activities; (i) frame level (instantaneous) classifica-
tion and (ii) spatial smoothing step. We use Gradient boosted
trees (GBT) [6] to perform instantaneous classification as
they provide good trade-off of accuracy and fast performance
in practice. The output of the GBT are probabilities of each
activity Pr(ak
t|ft,θ)given the features and model parameters
θdetermined by grid search.
Additionally, since we are interested in group-level activi-
ties which are commonly believed to be correlated spatially,
it is imperative to employ a method that is able to reason
about such spatial correlations. We employ a Conditional
random fields (CRF) model [7] to perform spatial smoothing
Fig. 2. Classification accuracy over the five activity classes for every
feature combination against the three learning modes (GB C, Offline-CRF
and Online-CRF) averaged over a test trajectory.
of the frame activity predictions. Our CRF model uses online
belief updates that are localized within its network i. e. belief
updates messages are not sent over the whole network but
only to the most influenced nodes. This enables efficient and
fast addition and updating of nodes to the network resulting
real-time performance. The result is an online Conditional
random fields (O-CRF). The node potentials used in the O-
CR F are simply the activity probabilities from the GBT,
while edge potentials are,
ψ(φti,φtj,ˆai
t,ˆaj
t) = (0if ˆai
t6= ˆaj
t
exp −d(φti,φtj)
σ2
delse
(1)
using Euclidean distance d(·,·)function, and σdacting as
a range parameter. We compute the accuracy by counting
correctly classified instances.
III. EXP ER IM EN TS
We conduct a series of experiments in order to analyze the
performance of the different features (density, relative direc-
tion, speed), as well as the combinations of these. We also
compare three classification modes: (a) without smoothing
i. e. using only the Gradient boosted trees classifier (GBC),
(b) a GBC with O-CRF and (c) a GB C and CRF without
incremental belief updates i. e. full inference is done at every
step. Because of limited availability of real-world datasets of
social activities from a robot-centric viewpoint, we make use
of an open source pedestrian simulator described in [8].
From the experiments, we observe as shown in Fig. 2
that adding more information into the feature histogram
i. e. speed and direction helps to improve the performance
when compared to using only density information. This is
consistent with the motivation of using coherent motion
indicators [9], and reaffirms our confidence in using such
information to discern different social activities. We also
observe that using either of the CR F models (online and
offline) improves the accuracy rates, confirming our intuition
about the benefit of spatial smoothing. This is in congruence
Fig. 3. Inference times for the CRF in online vs offline cases showing
that the online case is generally faster than the offline case. The times are
computed for CRF network with average of 50 nodes at every time step.
with the fact that most social activities involve multiple
people arranged in certain patterns. Furthermore, we ob-
served that the online version of the CRF with incremental
belief updates is considerably faster while also achieving
almost similar performance to the offline case. Altogether,
we achieve accuracy of up to 76% with no smoothing, which
then goes up to above 85% with the best feature combination
of density-direction when smoothing is enabled.
IV. CONCLUSIONS
We have addressed the problem of detecting social ac-
tivities from high level perceptual cues obtained from a
first-person perspective and compared several features and
classification models. The best feature combination was
found to encode the combination of density, and direction.
We also tested different classification models and found out
that using our online CRF to smooth the classifications gives
the best compromise for speed, accuracy and scalability
needed in robotic setups due to its efficient online belief
updates. In the future, we aim to conduct experiments on
real world datasets and deploy on the module on a service
robot.
REFERENCES
[1] T. Lan, Y. Wang, W. Yang, and G. Mori, “Beyond actions: Discrimi-
native models for contextual group activities.” in Advances in Neural
Information Processing Systems (NIPS), 2010.
[2] R. Li, P. Porfilio, and T. Zickler, “Finding group interactions in social
clutter,” in Proc. of the IEEE Conf. on Computer Vision and Pattern
Recognition (CVPR), 2013.
[3] B. Ni, S. Yan, and A. Kassim, “Recognizing human group activities
with localized causalities,” in Proc. of the IEEE Conf. on Computer
Vision and Pattern Recognition (CVPR), 2009.
[4] W. Choi, K. Shahid, and S. Savarese, “Learning context for collective
activity recognition,” in Proc. of the IEEE Conf. on Computer Vision
and Pattern Recognition (CVPR), 2011.
[5] S. Belongie, J. Malik, and J. Puzicha, “Shape context: A new descriptor
for shape matching and object recognition,” in Advances in Neural
Information Processing Systems (NIPS), 2000.
[6] J. H. Friedman, “Greedy function approximation: a gradient boosting
machine,” Annals of statistics, pp. 1189–1232, 2001.
[7] J. Lafferty, A. McCallum, and F. Pereira, “Conditional random fields:
Probabilistic models for segmenting and labeling sequence data,” in
Int. Conf. on Machine Learning (ICML), 2001.
[8] B. Okal and K. O. Arras, “Towards group-level social activity recogni-
tion for mobile robots,” in In IROS Workshop on Assistance and Service
Robotics in a Human Environments, 2014.
[9] Z. Ycel, F. Zanlungo, T. Ikeda, T. Miyashita, and N. Hagita, “Modeling
indicators of coherent motion,” in Int. Conf. on Intelligent Robots and
Systems (IROS), 2012.