Conference PaperPDF Available

Real-time Social Activity Detection for Mobile Service Robots

Real-time Social Activity Detection for Mobile Service Robots
(Extended Abstract)
Billy Okal1Rudolph Triebel2Kai O. Arras1,3
Interactive mobile robots performing tasks such as infor-
mation provision, tour guidance or domestic care, typically
operate in environments populated by humans engaged in
various social activities (hereinafter referred to simply as
activities). The interplay between such activities of the
different people in the environment gives rise to different
social contexts. To operate effectively, such robots need to
reason about these various social contexts induced by the
activities. Such reasoning allows the robots to approximately
understand the social behavior, embedded in the interactions.
Robots with such kind of understanding can, for example,
plan and respond with user-aware behavior that can be both
task-efficient and socially normative at the same time. This
is because the robots could be equipped with context specific
socially normative behavior, which can then be triggered
by detectors developed in this work via high level planning
executives. In this paper, we use high level perceptual cues
(specifically, people detector outputs) to model and detect
such social activities on mobile platforms.
Previous work in human activity recognition has typically
been carried out by the computer vision community moti-
vated by applications such as monitoring and surveillance.
However, the employed sensors in such cases are typically
static and generally overlook the entire scene, such as in [1],
[2], [3], [4] which makes it difficult to deploy the techniques
in mobile robots operating in the wild. Such conditions are
generally not met by mobile robots perceiving the surround-
ing people from a first-person perspective which are also typ-
ically subject to nosier perception due to occlusions. In this
work, we focus on methods that uses perceptual information
available from a robot-centric viewpoint. As for the choice of
activities, we focus on certain social activity classes that we
consider relevant in various scenarios of robots in human
environments encounter. In our case these are; waiting-in-
in-a-group,standing and moving-individuallyas illustrated
in Fig. 1. The choice of these activities is motivated by
application of a service robots in airport-like environments,
where such robots are required to navigate in a socially
normative manner. In particular, these set of activities was
arrived at by looking at video data of people moving in an
1Billy and Kai are with the Department of Computer Science at the
University of Freiburg {okal, arras}
2Rudolph is with the Computer Vision Group of the Technical University
of Munich,
3Kai is also associated with Bosch Corporate Research, Renningen.
This work has been partly supported by the European Commission under
contract number FP7-ICT-600877 (SPENCER).
Fig. 1. Different simulated social activities relevant for mobile robots
as shown in our simulator with the activity labels as text. They include:
group,standing and moving-individually. Numbers show confidence.
airport over long periods, during which these activities were
most commonly observed.
In this work, we develop a histogram-based feature de-
scriptor using combinations of information about relative
positions and orientations of people around the target person
whose activity we want to detect. We call our descriptor the
Activity context (AC) descriptor as it is largely inspired by
the shape context descriptor [5], used for object recognition
in the computer vision community. This is also motivated
by the fact that the shape context descriptor is known
to be robust against outliers, deformation, translations and
noise [5]. In effect, our activity context descriptor models
“shapes” of social activities based on relative positions and
orientations of people tracks as a histogram ft. Additionally,
we use different information for binning such histograms,
such as speed and direction in addition to densities or
counts. In effect, we compare seven feature combinations
by concatenating the basic feature histograms as; density,
direction and density–direction–speed.
We develop a two-stage classification system for detecting
the social activities; (i) frame level (instantaneous) classifica-
tion and (ii) spatial smoothing step. We use Gradient boosted
trees (GBT) [6] to perform instantaneous classification as
they provide good trade-off of accuracy and fast performance
in practice. The output of the GBT are probabilities of each
activity Pr(ak
t|ft,θ)given the features and model parameters
θdetermined by grid search.
Additionally, since we are interested in group-level activi-
ties which are commonly believed to be correlated spatially,
it is imperative to employ a method that is able to reason
about such spatial correlations. We employ a Conditional
random fields (CRF) model [7] to perform spatial smoothing
Fig. 2. Classification accuracy over the five activity classes for every
feature combination against the three learning modes (GB C, Offline-CRF
and Online-CRF) averaged over a test trajectory.
of the frame activity predictions. Our CRF model uses online
belief updates that are localized within its network i. e. belief
updates messages are not sent over the whole network but
only to the most influenced nodes. This enables efficient and
fast addition and updating of nodes to the network resulting
real-time performance. The result is an online Conditional
random fields (O-CRF). The node potentials used in the O-
CR F are simply the activity probabilities from the GBT,
while edge potentials are,
t) = (0if ˆai
t6= ˆaj
exp d(φti,φtj)
using Euclidean distance d(·,·)function, and σdacting as
a range parameter. We compute the accuracy by counting
correctly classified instances.
We conduct a series of experiments in order to analyze the
performance of the different features (density, relative direc-
tion, speed), as well as the combinations of these. We also
compare three classification modes: (a) without smoothing
i. e. using only the Gradient boosted trees classifier (GBC),
(b) a GBC with O-CRF and (c) a GB C and CRF without
incremental belief updates i. e. full inference is done at every
step. Because of limited availability of real-world datasets of
social activities from a robot-centric viewpoint, we make use
of an open source pedestrian simulator described in [8].
From the experiments, we observe as shown in Fig. 2
that adding more information into the feature histogram
i. e. speed and direction helps to improve the performance
when compared to using only density information. This is
consistent with the motivation of using coherent motion
indicators [9], and reaffirms our confidence in using such
information to discern different social activities. We also
observe that using either of the CR F models (online and
offline) improves the accuracy rates, confirming our intuition
about the benefit of spatial smoothing. This is in congruence
Fig. 3. Inference times for the CRF in online vs offline cases showing
that the online case is generally faster than the offline case. The times are
computed for CRF network with average of 50 nodes at every time step.
with the fact that most social activities involve multiple
people arranged in certain patterns. Furthermore, we ob-
served that the online version of the CRF with incremental
belief updates is considerably faster while also achieving
almost similar performance to the offline case. Altogether,
we achieve accuracy of up to 76% with no smoothing, which
then goes up to above 85% with the best feature combination
of density-direction when smoothing is enabled.
We have addressed the problem of detecting social ac-
tivities from high level perceptual cues obtained from a
first-person perspective and compared several features and
classification models. The best feature combination was
found to encode the combination of density, and direction.
We also tested different classification models and found out
that using our online CRF to smooth the classifications gives
the best compromise for speed, accuracy and scalability
needed in robotic setups due to its efficient online belief
updates. In the future, we aim to conduct experiments on
real world datasets and deploy on the module on a service
[1] T. Lan, Y. Wang, W. Yang, and G. Mori, “Beyond actions: Discrimi-
native models for contextual group activities. in Advances in Neural
Information Processing Systems (NIPS), 2010.
[2] R. Li, P. Porfilio, and T. Zickler, “Finding group interactions in social
clutter,” in Proc. of the IEEE Conf. on Computer Vision and Pattern
Recognition (CVPR), 2013.
[3] B. Ni, S. Yan, and A. Kassim, “Recognizing human group activities
with localized causalities,” in Proc. of the IEEE Conf. on Computer
Vision and Pattern Recognition (CVPR), 2009.
[4] W. Choi, K. Shahid, and S. Savarese, “Learning context for collective
activity recognition, in Proc. of the IEEE Conf. on Computer Vision
and Pattern Recognition (CVPR), 2011.
[5] S. Belongie, J. Malik, and J. Puzicha, “Shape context: A new descriptor
for shape matching and object recognition,” in Advances in Neural
Information Processing Systems (NIPS), 2000.
[6] J. H. Friedman, “Greedy function approximation: a gradient boosting
machine,” Annals of statistics, pp. 1189–1232, 2001.
[7] J. Lafferty, A. McCallum, and F. Pereira, “Conditional random fields:
Probabilistic models for segmenting and labeling sequence data,” in
Int. Conf. on Machine Learning (ICML), 2001.
[8] B. Okal and K. O. Arras, “Towards group-level social activity recogni-
tion for mobile robots,” in In IROS Workshop on Assistance and Service
Robotics in a Human Environments, 2014.
[9] Z. Ycel, F. Zanlungo, T. Ikeda, T. Miyashita, and N. Hagita, “Modeling
indicators of coherent motion,” in Int. Conf. on Intelligent Robots and
Systems (IROS), 2012.
The ability to perceive humans in their surroundings is a key ingredient for robots that operate in environments shared with humans, for example in consumer, industrial and automotive applications – such as a service robot for person guidance, an autonomous forklift in a warehouse, or a self-driving vehicle. This thesis deals with the problem of robustly detecting and tracking humans and recognizing their attributes in challenging environments in real-time, from the egocentric perspective of a computationally constrained mobile robot equipped with multiple sensing modalities. To address this problem, we examine both classical, model-based approaches and deep learning-based methods, and evaluate them on novel datasets as well as during real-world deployments on different mobile robot platforms in populated indoor scenarios. We start this thesis with the question if complex data association methods are suitable for tracking groups of people in general, and in crowded environments in particular. To this end, we address the problem of joint individual-group tracking using learned pairwise social relations in RGB-D by extending an existing multi-model multi-hypothesis tracking method with a mechanism to maintain consistent group identities. In qualitative experiments on a novel dataset from a pedestrian zone, we achieve good real-time tracking performance for varying group sizes with few identifier switches. We apply the method to socially-aware navigation use-cases and present further experiments on simulated data in a more crowded environment, where we examine limitations of the hypothesis-oriented MHT approach under real-time constraints. We then take a step back from group tracking and investigate the problem of tracking individual humans in crowded scenes using a mobile platform with a multi-modal sensor setup. Here, we first introduce a computationally very efficient tracking baseline: Using a relatively cheap set of extensions from the target tracking community to systematically tackle shortcomings of current systems, we attempt to improve robustness without resorting to more complex data association methods. After automated hyperparameter optimization, we compare our method systematically under different detector combinations to a hypothesis-oriented MHT, a track-oriented MDL tracker, and different NN variants on two novel datasets. We find that our efficient baseline method outperforms all other evaluated methods on the MOTA metric across all settings. Our key finding is that detector performance is the single, most influential factor affecting tracking performance which goes far beyond the impact of the chosen tracking algorithm. Therefore, we focus our subsequent research on the detection task. One insight we gain from initial experiments is that recent CNN-based detectors perform well on 2D image-based detection, but this does not easily translate into robust localization in 3D world space. To deal with this, we develop a fast CNN-based one-stage detector that benefits from complementary RGB and depth image data and regresses 3D human centroids in an end-to-end fashion. We show that we can efficiently learn their 3D localization from a highly randomized RGB-D dataset that has been synthetically generated using a modern game engine, while exploiting existing real-world 2D object detection datasets to pretrain the detection task. The resulting method outperforms several state-of-the-art baselines, including a 3D articulated human pose estimation approach. For 2D laser-based leg detection, we examine several classical model-based detection approaches as well as a CNN-based method that can be improved by observing human leg movement over a sequence of frames, while conducting experiments on a large-scale dataset from an elderly care facility. We then consider also methods for human detection in 3D lidar and RGB-D, and quantitatively compare detection performance across all three sensor modalities on two novel sequences in a challenging intralogistics scenario. This provides us with interesting insights on their strengths, weaknesses and generalization capabilities: In particular, we learn that the 3D lidar methods, which have been trained on available autonomous driving datasets, do not seem to transfer well to our application domain, where large-scale training datasets are not available; we observe problems especially in narrow and cluttered spaces. This indicates the need for more large-scale, domain-specific datasets and benchmarks in robotics, as well as methods that can generalize better with limited amounts of training data. We finally take a closer look at humans in order to recognize their individual attributes. To this end, we extend an efficient tessellation-boosting method to recognize human attributes from RGB-D point clouds. The method achieves over 300 Hz without GPU, and can compete with computationally more complex deep learning-based methods on our novel attributes dataset. Throughout this thesis, we acquired, annotated and analyzed several novel datasets in challenging environments, like a pedestrian zone, a crowded airport terminal, and intralogistics warehouses. The presented methods have been extensively validated "in the wild" to show their general applicability. To combine the methods, we propose a unified, multi-modal, ROS-based human detection and tracking framework that facilitates their deployment and evaluation. Due to its modular design with reusable interfaces and software components, we were able to deploy it on close to a dozen different robot platforms. In particular, we gathered experiences with a socially-aware mobile service robot for person guidance that we deployed inside a crowded airport terminal. Here, system contributions have been made that go beyond human detection, tracking and analysis and touch the topics of sensor calibration, human-robot interaction, distributed software architecture and practical safety considerations. We share previously unpublished lessons learned during this ambitious project, which we hope will benefit future research in this area.
Conference Paper
Full-text available
Robots in human populated environments need toperceive and understand the social context they are in for avariety of tasks. One key element to this understanding aregroup-level activities of people in the vicinity of the robot.In this paper, we employ supervised learning to recognizesuch activities from a robot-centric first-person perspectivefor the task of navigation in human crowds. We develop andcompare several feature descriptors that encode spatiotemporalmotion information of surrounding people using histograms anduse Random forests for classification. Extensive comparativeexperiments in simulation reveal that adding additional infor-mation such as velocity and speed to the histograms gives bestperformance given that some activities are indistinguishableby mere density counts. We also observe that directionalinformation in velocity dominates speed. We obtain a 77% classification accuracy for five activity classes
Conference Paper
Full-text available
In this paper we present a framework for the recognition of collective human activities. A collective activity is defined or reinforced by the existence of coherent behavior of individuals in time and space. We call such coherent behavior `Crowd Context'. Examples of collective activities are “queuing in a line” or “talking”. Following, we propose to recognize collective activities using the crowd context and introduce a new scheme for learning it automatically. Our scheme is constructed upon a Random Forest structure which randomly samples variable volume spatio-temporal regions to pick the most discriminating attributes for classification. Unlike previous approaches, our algorithm automatically finds the optimal configuration of spatio-temporal bins, over which to sample the evidence, by randomization. This enables a methodology for modeling crowd context. We employ a 3D Markov Random Field to regularize the classification and localize collective activities in the scene. We demonstrate the flexibility and scalability of the proposed framework in a number of experiments and show that our method outperforms state-of-the art action classification techniques.
Conference Paper
Full-text available
We propose a discriminative model for recognizing group activities. Our model jointly captures the group activity, the individual person actions, and the interactions among them. Two new types of contextual information, group-person interaction and person-person interaction, are explored in a latent variable framework. Different from most of the previous latent structured models which assume a predefined structure for the hidden layer, e.g. a tree structure, we treat the structure of the hidden layer as a latent variable and implicitly infer it during learning and inference. Our experimental results demonstrate that by inferring this contextual information together with adaptive structures, the proposed model can significantly improve activity recognition performance. 1
Conference Paper
Full-text available
The aim of this paper is to address the problem of recognizing human group activities in surveillance videos. This task has great potentials in practice, however was rarely studied due to the lack of benchmark database and the difficulties caused by large intra-class variations. Our contributions are two-fold. Firstly, we propose to encode the group-activities with three types of localized causalities, namely self-causality, pair-causality, and group-causality, which characterize the local interaction/reasoning relations within, between, and among motion trajectories of different humans respectively. Each type of causality is expressed as a specific digital filter, whose frequency responses then constitute the feature representation space. Finally, each video clip of certain group activity is encoded as a bag of localized causalities/filters. We also collect a human group-activity video database, which involves six popular group activity categories with about 80 video clips for each in average, captured in five different sessions with varying numbers of participants. Extensive experiments on this database based on our proposed features and different classifiers show the promising results on this challenging task.
Full-text available
We introduce a new shape descriptor, the shape context, for correspondence recovery and shape-based object recognition. The shape context at a point captures the distribution over relative positions of other shape points and thus summarizes global shape in a rich, local descriptor. Shape contexts greatly simplify recovery of correspondences between points of two given shapes. Moreover, the shape context leads to a robust score for measuring shape similarity, once shapes are aligned. The shape context descriptor is tolerant to all common shape deformations. As a key advantage no special landmarks or key-points are necessary. It is thus a generic method with applications in object recognition, image registration and point set matching. Using examples involving both handwritten digits and 3D objects, we illustrate its power for object recognition.
Function estimation/approximation is viewed from the perspective of numerical optimization iti function space, rather than parameter space. A connection is made between stagewise additive expansions and steepest-descent minimization. A general gradient descent "boosting" paradigm is developed for additive expansions based on any fitting criterion. Specific algorithms are presented for least-squares, least absolute deviation, and Huber-M loss functions for regression, and multiclass logistic likelihood for classification. Special enhancements are derived for the particular case where the individual additive components are regression trees, and tools for interpreting such "TreeBoost" models are presented. Gradient boosting of regression trees produces competitives highly robust, interpretable procedures for both regression and classification, especially appropriate for mining less than clean data. Connections between this approach and the boosting methods of Freund and Shapire and Friedman, Hastie and Tibshirani are discussed.
Conference Paper
This study focuses on joint motion patterns of humans that move together with other humans or objects. Since this scope embraces 'group motion', which relates only humans, and expands its extent of interactions accounting for various auxiliary instruments such as walking aids or pushcarts, we term this collective motion pattern as 'coherent' motion. Coherence is proposed to be characterized by the distance between the moving parties, the scalar product of their velocities and the scalar product of the velocity vector and the displacement vector. The contribution of this study lies in the formulation of coherence in terms of the listed features through explicit mathematical models. The models are developed in accordance with a large database recorded in an uncontrolled environment involving a total of more than 500 mobile entities. The performance of the proposed models is evaluated qualitatively by comparing them to the empirical data and quantitatively by employing log-likelihoods. Comparison to an earlier work indicates that the proposed models improve the identification of coherence quality significantly well.
We consider the problem of finding distinctive social interactions involving groups of agents embedded in larger social gatherings. Given a pre-defined gallery of short exemplar interaction videos, and a long input video of a large gathering (with approximately-tracked agents), we identify within the gathering small sub-groups of agents exhibiting social interactions that resemble those in the exemplars. The participants of each detected group interaction are localized in space, the extent of their interaction is localized in time, and when the gallery of exemplars is annotated with group-interaction categories, each detected interaction is classified into one of the pre-defined categories. Our approach represents group behaviors by dichotomous collections of descriptors for (a) individual actions, and (b) pair-wise interactions, and it includes efficient algorithms for optimally distinguishing participants from by-standers in every temporal unit and for temporally localizing the extent of the group interaction. Most importantly, the method is generic and can be applied whenever numerous interacting agents can be approximately tracked over time. We evaluate the approach using three different video collections, two that involve humans and one that involves mice.
Conference Paper
We present conditional random fields, a framework for building probabilistic models to segment and label sequence data. Conditional random fields offer several advantages over hidden Markov models and stochastic grammars for such tasks, including the ability to relax strong independence assumptions made in those models. Conditional random fields also avoid a fundamental limitation of maximum entropy Markov models (MEMMs) and other discriminative Markov models based on directed graphical models, which can be biased towards states with few successor states. We present iterative parameter estimation algorithms for conditional random fields and compare the performance of the resulting models to HMMs and MEMMs on synthetic and natural-language data.
We present conditional random fields, a framework for building probabilistic models to segment and label sequence data. Conditional random fields offer several advantages over hidden Markov models and stochastic grammars for such tasks, including the ability to relax strong independence assumptions made in those models. Conditional random fields also avoid a fundamental limitation of maximum entropy Markov models (MEMMs) and other discriminative Markov models based on directed graphical models, which can be biased towards states with few successor states. We present iterative parameter estimation algorithms for conditional random fields and compare the performance of the resulting models to HMMs and MEMMs on synthetic and natural-language data.