ArticlePDF Available

Informative Scene Decomposition for Crowd Analysis, Comparison and Simulation Guidance

Authors:

Abstract and Figures

Fig. 1. Overview of our framework. Crowd simulation is a central topic in several fields including graphics. To achieve high-fidelity simulations, data has been increasingly relied upon for analysis and simulation guidance. However, the information in real-world data is often noisy, mixed and unstructured, making it difficult for effective analysis, therefore has not been fully utilized. With the fast-growing volume of crowd data, such a bottleneck needs to be addressed. In this paper, we propose a new framework which comprehensively tackles this problem. It centers at an unsupervised method for analysis. The method takes as input raw and noisy data with highly mixed multi-dimensional (space, time and dynamics) information, and automatically structure it by learning the correlations among these dimensions. The dimensions together with their correlations fully describe the scene semantics which consists of recurring activity patterns in a scene, manifested as space flows with temporal and dynamics profiles. The effectiveness and robustness of the analysis have been * Corresponding author † Corresponding author tested on datasets with great variations in volume, duration, environment and crowd dynamics. Based on the analysis, new methods for data visual-ization, simulation evaluation and simulation guidance are also proposed. Together, our framework establishes a highly automated pipeline from raw data to crowd analysis, comparison and simulation guidance. Extensive experiments and evaluations have been conducted to show the flexibility, versatility and intuitiveness of our framework.
Content may be subject to copyright.
Informative Scene Decomposition for Crowd Analysis, Comparison and
Simulation Guidance
FEIXIANG HE, University of Leeds, United Kingdom
YUANHANG XIANG, Xi’an Jiaotong University, China
XI ZHAO,Xi’an Jiaotong University, China
HE WANG,University of Leeds, United Kingdom
Fig. 1. Overview of our framework.
Crowd simulation is a central topic in several elds including graphics. To
achieve high-delity simulations, data has been increasingly relied upon for
analysis and simulation guidance. However, the information in real-world
data is often noisy, mixed and unstructured, making it dicult for eective
analysis, therefore has not been fully utilized. With the fast-growing volume
of crowd data, such a bottleneck needs to be addressed. In this paper, we
propose a new framework which comprehensively tackles this problem.
It centers at an unsupervised method for analysis. The method takes as
input raw and noisy data with highly mixed multi-dimensional (space, time
and dynamics) information, and automatically structure it by learning the
correlations among these dimensions. The dimensions together with their
correlations fully describe the scene semantics which consists of recurring
activity patterns in a scene, manifested as space ows with temporal and dy-
namics proles. The eectiveness and robustness of the analysis have been
Corresponding author
Corresponding author
Authors’ addresses: Feixiang He, University of Leeds, School of Computing, United
Kingdom, fxhe1992@gmail.com; Yuanhang Xiang, Xi’an Jiaotong University, School of
Computer Science and Technology, China, xiangyuanhang@icloud.com; Xi Zhao, Xi’an
Jiaotong University, School of Computer Science and Technology, China, zhaoxi.jade@
gmail.com; He Wang, University of Leeds, School of Computing, United Kingdom,
realcrane@gmail.com.
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
for prot or commercial advantage and that copies bear this notice and the full citation
on the rst page. Copyrights for components of this work owned by others than ACM
must be honored. Abstracting with credit is permitted. To copy otherwise, or republish,
to post on servers or to redistribute to lists, requires prior specic permission and/or a
fee. Request permissions from permissions@acm.org.
©2020 Association for Computing Machinery.
0730-0301/2020/7-ART1 $15.00
https://doi.org/10.1145/3386569.3392407
tested on datasets with great variations in volume, duration, environment
and crowd dynamics. Based on the analysis, new methods for data visual-
ization, simulation evaluation and simulation guidance are also proposed.
Together, our framework establishes a highly automated pipeline from raw
data to crowd analysis, comparison and simulation guidance. Extensive
experiments and evaluations have been conducted to show the exibility,
versatility and intuitiveness of our framework.
CCS Concepts:
Computing methodologies Animation
;
Topic mod-
eling
;
Learning in probabilistic graphical models
;
Scene understand-
ing
;
Activity recognition and understanding
;
Multi-agent planning
;
Mathematics of computing Probabilistic inference problems
;
Nonparametric statistics.
Additional Key Words and Phrases: Crowd Simulation, Simulation Evalua-
tion, Bayesian Inference
ACM Reference Format:
Feixiang He, Yuanhang Xiang, Xi Zhao, and He Wang. 2020. Informative
Scene Decomposition for Crowd Analysis, Comparison and Simulation
Guidance. ACM Trans. Graph. 39, 4, Article 1 (July 2020), 15 pages. https:
//doi.org/10.1145/3386569.3392407
1 INTRODUCTION
Crowd simulation has been intensively used in computer anima-
tion, as well as other elds such as architectural design and crowd
management. The delity or realism of simulation has been a long-
standing problem. The main complexity arises from its multifaceted
nature. It could mean high-level global behaviors [Narain et al
.
2009],
mid-level ow information [Wang et al
.
2016] or low-level individ-
ual motions [Guy et al
.
2012]. It could also mean perceived realism
[Ennis et al
.
2011] or numerical accuracy [Wang et al
.
2017]. In
ACM Trans. Graph., Vol. 39, No. 4, Article 1. Publication date: July 2020.
1:2 Feixiang He, Yuanhang Xiang, Xi Zhao, and He Wang
any case, analyzing real-world data is inevitable for evaluating and
guiding simulations.
The main challenges in utilizing real-world data are data com-
plexity, intrinsic motion randomness and the shear volume. The
data complexity makes structured analysis dicult. As the most
prevalent form of crowd data, trajectories extracted from sensors
contain rich but mixed and unstructured information of space, time
and dynamics. Although high-level statistics such as density can
be used for analysis, they are not well dened and cannot give
structural insights [Wang et al
.
2017]. Second, trajectories show
intrinsic randomness of individual motions [Guy et al
.
2012]. The
randomness shows heterogeneity between dierent individuals and
groups, and is inuenced by internal factors such as state of mind
and external factors such as collision avoidance. Hence a single
representation is not likely to be able to capture all randomness
for all people in a scene. This makes it dicult to guide simulation
without systematically considering the randomness. Lastly, with
more recording devices being installed and data being shared, the
shear volume of data in both space and time, with excessive noise,
requires ecient and robust analysis.
Existing methods that use real-world data for purposes such as
qualitative and quantitative comparisons [Wang et al
.
2016], sim-
ulation guidance [Ren et al
.
2018] or steering [López et al
.
2019],
mainly focus on one aspect of data, e.g. space, time or dynamics,
and tend to ignore the structural correlations between them. Also
during simulation and analysis, motion randomness is often ignored
or uniformly modelled for all trajectories [Guy et al
.
2012; Helbing
et al
.
1995]. Ignoring the randomness (e.g. only assuming the least-
eort principle) makes simulated agents to walk in straight lines
whenever possible, which is rarely observed in real-world data; uni-
formly modelling the randomness fails to capture the heterogeneity
of the data. Besides, most existing methods are not designed to deal
with massive data with excessive noise. Many of them require the
full trajectories to be available [Wolinski et al
.
2014] which cannot
be guaranteed in real world, and do not handle data at the scale of
tens of thousands of people and several days long.
In this paper, we propose a new framework that addresses the
three aforementioned challenges. This framework is centered at an
analysis method which automatically decomposes a crowd scene
of a large number of trajectories into a series of modes. Each mode
comprehensively captures a unique pattern of spatial, temporal and
dynamics information. Spatially, a mode represents a pedestrian ow
which connects subspaces with specic functionalities, e.g. entrance,
exit, information desk, etc.; temporally it captures when this ow
appears, crescendos, wanes and disappears; dynamically it reveals
the speed preferences on this ow. With space, time and dynamics
information, each mode represents a unique recurring activity and
all modes together describe the scene semantics. These modes serve
as a highly exible visualization tool for general and task-specic
analysis. Next, they form a natural basis where explicable evaluation
metrics can be derived for quantitatively comparing simulated and
real crowds, both holistically and dimension-specic (space, time
and dynamics). Lastly, they can easily automate simulation guidance,
especially in capturing the heterogeneous motion randomness in
the data.
The analysis is done by a new unsupervised clustering method
based on non-parametric Bayesian models, because manual labelling
would be extremely laborious. Specically, Hierarchical Dirichlet
Processes (HDP) are used to disentangle the spatial, temporal and
dynamics information. Our model consists of three intertwined
HDPs and is thus named Triplet HDPs (THDP). The outcome is a
(potentially innite) number of modes with weights. Spatially, each
mode is a crowd ow represented by trajectories sharing spatial
similarities. Temporally, it is a distribution of when the ow appears,
crescendos, peaks, wanes and disappears. Dynamically, it shows the
speed distribution of the ow. The whole data is then represented by
a weighted combination of all modes. Besides, the power of THDP
comes with an increased model complexity, which brings challenges
on inference. We therefore propose a new method based on Markov
Chain Monte Carlo (MCMC). The method is a major generalization
of the Chinese Restaurant Franchise (CRF) method, which was orig-
inally developed for HDP. We refer to the new inference method as
Chinese Restaurant Franchise League (CRFL). THDP and CRFL are
general and eective on datasets with great spatial, temporal and
dynamics variations. They provide a versatile base for new methods
for visualization, simulation evaluation and simulation guidance.
Formally, we propose the rst, to our best knowledge, multi-
purpose framework for crowd analysis, visualization, simulation
evaluation and simulation guidance, which includes:
(1) a new activity analysis method by unsupervised clustering.
(2) a new visualization tool for highly complex crowd data.
(3)
a set of new metrics for comparing simulated and real crowds.
(4) a new approach for automated simulation guidance.
To this end, we have technical contributions which include:
(1)
the rst, to our best knowledge, non-parametric method that
holistically considers space, time and dynamics for crowd
analysis, simulation evaluation and simulation guidance.
(2)
a new Markov Chain Monte Carlo method which achieves
eective inference on intertwined HDPs.
2 RELATED WORK
2.1 Crowd Simulation
Empirical modelling and data-driven methods have been the two
mainstreams in simulation. Empirical modelling dominates early
research, where observations of crowd motions are abstracted into
mathematical equations and deterministic systems. Crowds can be
modelled as elds or ows [Narain et al
.
2009], or as particle systems
[Helbing et al
.
1995], or by velocity and geometric optimization
[van den Berg et al
.
2008]. Social behaviors including queuing and
grouping [Lemercier et al
.
2012; Ren et al
.
2016] have also been
pursued. On the other hand, data-driven simulation has also been
explored, in using e.g. rst-person vision to guide steering behaviors
[López et al
.
2019] or trajectories to extract features to describe
motions [Karamouzas et al
.
2018; Lee et al
.
2007]. Our research is
highly complementary to simulation research in providing analysis,
guidance and evaluation metrics. It aims to work with existing
steering and global planning methods.
ACM Trans. Graph., Vol. 39, No. 4, Article 1. Publication date: July 2020.
Informative Scene Decomposition for Crowd Analysis, Comparison and Simulation Guidance 1:3
2.2 Crowd Analysis
Crowd analysis has been a trendy topic in computer vision [Wang
and O’Sullivan 2016; Wang et al
.
2008]. They aim to learn structured
latent patterns in data, similar to our analysis method. However, they
only consider limited information (e.g. space only or space/time)
compared to our method because our method explicitly models
space, time, dynamics and their correlations. In contrast, another
way of scene analysis is to focus on the anomalies [Charalambous
et al
.
2014]. Their perspective is dierent from ours and therefore
complementary to our approach. Trajectory analysis also plays an
important role in modern sports analysis [Sha et al
.
2018, 2017], but
they do not deal with a large number of trajectories as our method
does. Recently, deep learning has been used for crowd analysis in
trajectory prediction [Xu et al. 2018], people counting [Wang et al.
2019], scene understanding [Lu et al. 2019] and anomaly detection
[Sabokrou et al
.
2017]. However, they either do not model low-level
behaviors or can only do short-horizon prediction (seconds). Our
research is orthogonal to theirs by focusing on the analysis and its
applications in simulations.
Besides computer vision, crowd analysis has also been investi-
gated in physics. In [Ali and Shah 2007], Lagrangian Particle Dy-
namics is exploited for the segmentation of high-density crowd
ows and detection of ow instabilities, where the target was simi-
lar to our analysis. But they only consider space when separating
ows, while our research explicitly models more comprehensive
information, including space, time and dynamic. Physics-inspired
approaches have also been applied in abnormal trajectory detec-
tion for surveillance [Chaker et al
.
2017; Mehran et al
.
2009]. An
approach based on social force model [Mehran et al
.
2009] is intro-
duced to describe individual movement in microscopic by placing a
grid particle over the image. A local and global social network are
built by constructing a set of spatio-temporal cuboids in [Chaker
et al
.
2017] to detect anomalies. Compared with these methods, our
anomaly detection is more informative and versatile in providing
what attributes contribute to the abnormality.
2.3 Simulation Evaluation
How to evaluate simulations is a long-standing problem. One major
approach is to compare simulated and real crowds. There are quali-
tative and quantitative methods. Qualitative methods include visual
comparison [Lemercier et al
.
2012] and perceptual experiments [En-
nis et al
.
2011]. Quantitative methods fall into model-based methods
[Golas et al
.
2013] and data-driven methods [Guy et al
.
2012; Lerner
et al
.
2009; Wang et al
.
2016, 2017]. Individual behaviors can be
directly compared between simulation and reference data [Lerner
et al
.
2009]. However, it requires full trajectories to be available
which is dicult in practice. Our comparison is based on the latent
behavioral patterns instead of individual behaviors and does not
require full trajectories. The methods in [Wang et al
.
2016, 2017]
are similar to ours where only space is considered. In contrast, our
approach is more comprehensive by considering space, time and
dynamics. Dierent combinations of these factors result in dierent
metrics focusing on comparing dierent aspects of the data. The
comparisons can be spatially focused or temporally focused. They
can also be comparing general situations or specic modes. Overall,
our method provides greater exibility and more intuitive results.
2.4 Simulation Guidance
Quantitative simulation guidance has been investigated before,
through user control or real-world data. In the former, trajectory-
based user control signals can be converted into guiding trajectories
for simulation [Shen et al
.
2018]. Predened crowd motion ‘patches’
can be used to compose heterogeneous crowd motions [Jordao et al
.
2014]. The purpose of this kind of guidance is to give the user the
full control to ‘sculpture’ crowd motions. The latter is to guide sim-
ulations using real-world data to mimic real crowd motions. Given
data and a parameterized simulation model, optimizations are used
to t the model on the data [Wolinski et al
.
2014]. Alternatively,
features can be extracted and compared for dierent simulations,
so that predictions can be made about dierent steering methods
on a simulation task [Karamouzas et al
.
2018]. Our approach also
heavily relies on data and is thus similar to the latter. But instead
of anchoring on the modelling of individual motions, it focuses on
the analysis of scene semantics/activities. It also considers intrinsic
motion randomness in a structured and principled way.
3 METHODOLOGY OVERVIEW
The overview of our framework is in Fig. 1. Without loss of gener-
ality, we assume that the input is raw trajectories/tracklets which
can be extracted from videos by existing trackers, where we can
estimate the temporal and velocity information. Naively modelling
the trajectories/tracklets, e.g. by simple descriptive statistics such
as average speed, will average out useful information and cannot
capture the data heterogeneity. To capture the heterogeneity in the
presence of noise and randomness, we seek an underlying invariant
as the scene descriptor. Based on empirical observations, steady
space ows, characterized by groups of geometrically similar tra-
jectories, can be observed in many crowd scenes. Each ow is a
recurring activity connecting subspaces with designated function-
alities, e.g. a ow from the front entrance to the ticket oce then
to a platform in a train station. Further, this ow reveals certain
semantic information, i.e. people buying tickets before going to the
platforms. Overall, all ows in a scene form a good basis to describe
the crowd activities and the basis is an underlying invariant. How
to compute this basis is therefore vital in analysis.
However, computing such a basis is challenging. Naive statistics of
trajectories are not descriptive enough because the basis consists of
many ows, and is therefore highly heterogeneous and multi-modal.
Further the number of ows is not known a priori. Since the ows
are formed by groups of geometrically similar trajectories/tracklets,
a natural solution is to cluster them [Bian et al
.
2018]. In this spe-
cic research context, unsupervised clustering is needed due to that
the shear data volume prohibits human labelling. In unsupervised
clustering, popular methods such as K-means and Gaussian Mixture
Models [Bishop 2007] require a pre-dened cluster number which
is hard to know in advance. Hierarchical Agglomerative Clustering
[Kauman and Rousseeuw 2005] does not require a predened clus-
ter number, but the user must decide when to stop merging, which
is similarly problematic. Spectral-based clustering methods [Shi and
ACM Trans. Graph., Vol. 39, No. 4, Article 1. Publication date: July 2020.
1:4 Feixiang He, Yuanhang Xiang, Xi Zhao, and He Wang
Malik 2000] solve this problem, but require the computation of a
similarity matrix whose space complexity is
O(n2)
on the number of
trajectories. Too much memory is needed for large datasets and per-
formance degrades quickly with increasing matrix size. Due to the
afore-mentioned limitations, non-parametric Bayesian approaches
were proposed [Wang et al
.
2016, 2017]. However, a new approach
is still needed because the previous approaches only consider space,
and therefore cannot be reused or adapted for our purposes.
We propose a new non-parametric Bayesian method to cluster
the trajectories with the time and velocity information in an unsu-
pervised fashion, which requires neither manual labelling nor the
prior knowledge of cluster number. The outcome of clustering is a
series of modes, each being a unique distribution over space, time
and speed. Then we propose new methods for data visualization,
simulation evaluation and automated simulation guidance.
We rst introduce the background of one family of non-parametric
Bayesian models, Dirichlet Processes (DPs), and Hierarchical Dirich-
let Processes (HDP) (Sec. 4.1). We then introduce our new model
Triplet HDPs (Sec. 4.2) and new inference method Chinese Restau-
rant Franchise League (Sec. 5). Finally new methods are proposed
for visualization (Sec. 6.1), comparison (Sec. 6.2) and simulation
guidance (Sec. 6.3).
4 OUR METHOD
4.1 Background
Dirichlet Process
. To understand DP, imagine there is a multi-
modal 1D dataset with ve high-density areas (modes). Then a
classic ve-component Gaussian Mixture Model (GMM) can t the
data via Expectation-Minimization [Bishop 2007]. Now further gen-
eralize the problem by assuming that there are an unknown number
of high-density areas. In this case, an ideal solution would be to
impose a prior distribution which can represent an innite number
of Gaussians, so that the number of Gaussians needed, their means
and covariances can be automatically learnt. DP is such a prior.
A DP(
γ
, H) is a probabilistic measure on measures [Ferguson
1973], with a scaling parameter
γ
> 0 and a base probability measure
H
. A draw from DP,
G
~
DP (γ,H)
is:
G=
k=1βkδϕk
, where
βkβ
is random and dependent on
γ
.
ϕkϕ
is a variable distributed
according to
H
,
ϕkH
.
δϕk
is called an atom at
ϕk
. Specically for
the example problem above, we can dene
H
to be a Normal-Inverse-
Gamma (NIG) so that any draw,
ϕk
, from
H
is a Gaussian, then
G
becomes an Innite Gaussian Mixture Model (IGMM) [Rasmussen
1999]. In practice, kis nite and computed during inference.
Hierarchical DPs
. Now imagine that the multi-modal dataset in
the example problem is observed in separate data groups. Although
all the modes can be observed from the whole dataset, only a subset
of the modes can be observed in any particular data group. To model
this phenomenon, a parent DP is used to capture all the modes with
a child DP modelling the modes in each group:
GjDP (αj,G)or Gj=
i=1
βji δψj i where G=
k=1
βkδϕk(1)
where
Gj
is the modes in the
j
th data group.
αj
is the scaling factor
and
G
is its based distribution.
βji
is the weight and
δψji
is the atom.
Now we have the Hierarchical DPs, or HDP [Teh et al
.
2006] (Fig. 2
Fig. 2. Le: HDP. Right: Triplet HDP.
Left). At the top level, the modes are captured by GDP(γ,H). In
each data group
j
, the modes are captured by
Gj
which is dependent
on
αj
and
G
. This way, the modes,
Gj
, in every data group come
from the common set of modes
G
, i.e.
ψji ∈ {ϕ1,ϕ2, . . ., ϕk}
. In Fig. 2
Left, there is also a variable
θji
called factor which indicates with
which mode (
ψji
or equally
ϕk
) the data sample
xji
is associated.
Finally, if
H
is again a NIG prior, then the HDP becomes Hierarchical
Innite Gaussian Mixture Model (HIGMM).
4.2 Triplet-HDPs (THDP)
We now introduce THDP (Fig. 2 Right). There are three HDPs in
THDP, to model space, time and speed. We name them Time-HDP
(Green), Space-HDP (Yellow) and Speed-HDP (Blue). Space-HDP is
to compute space modes. Time-HDP and Speed-HDP are to compute
the time and speed modes associated with each space mode, which
requires the three HDPs to be linked. The modeling choice of the
links will be explained later. The only observed variable in THDP
is
w
, an observation of a person in a frame. It includes a location-
orientation (
xji
), timestamp (
ykd
) and speed (
zkc
).
θs
ji
,
θt
kd
and
θe
kc
are their factor variables. Given a single observation denoted as
w
,
we denote one trajectory as
¯
w
, a group of trajectories as
ˇ
w
and the
whole data set as
w
. Our nal goal is to compute the space, time
and speed modes, given w:
Gs=
k=1
βkδϕs
kGt=
l=1
ζlδϕt
lGe=
q=1
ρqδϕe
q(2)
In THDP, a space mode is dened to be a group of geometrically
similar trajectories
ˇ
w
. Since these trajectories form a ow, we also
refer to it as a space ow. A space ow’s timestamps (
ykd
s) and
speed (
zkc
s) are both 1D data and can be modelled in similar ways.
We rst introduce the Time-HDP. One space ow
ˇ
w
might appear,
crescendo, peak, wane and disappear several times. If a Gaussian
distribution is used to represent one time peak on the timeline,
multiple Gaussians are needed. Naturally IGMM is used to model
the
ykd ˇ
w
. A possible alternative is to use Poisson Processes to
model the entry time. But IGMM is chosen due to its ability to t
complex multi-modal distributions. It can also model a ow for the
entire duration. Next, since there are many space ows and the
ykd
s of each space ow form a timestamp data group, we therefore
assume that there is a common set of time peaks shared by all space
ACM Trans. Graph., Vol. 39, No. 4, Article 1. Publication date: July 2020.
Informative Scene Decomposition for Crowd Analysis, Comparison and Simulation Guidance 1:5
Fig. 3. From le to right: 1. A space flow. 2. Discretization and flow cell
occupancy, darker means more occupants. 3. Codebook with normalized
occupancy as probabilities indicated by color intensities. 4. Five colored
orientation subdomains (Pink indicates static).
ows and each space ow shares only a subset. This way, we use a
DP to represent all the time peaks and a child DP below the rst DP
to represent the peaks in each space ow. This is a HIGMM (for the
Time-HDP) where the
Ht
is a NIG. Similarly for the speed,
zkc ˇ
w
can also have multiple peaks on the speed axis, so we use IGMM
for this. Further, there are many space ows. We again assume that
there is a common set of speed peaks and each space ow only has
a subset of these peaks and use another HIGMM for the Speed-TDP.
After Time-HDP and Speed-HDP, we introduce the Space-HDP.
The Space-HDP is dierent because, unlike time and speed, space
data (
xji
s) is 4D (2D location + 2D orientation), which means its
modes are also multi-dimensional. In contrast to time and speed, a
4D Gaussian cannot represent a group of similar trajectories well. So
we need to use a dierent distribution. Similar to [Wang et al
.
2017],
we discretize the image domain (Fig. 3: 1) into a m
×
n grid (Fig. 3:
2). The discretization serves three purposes: 1. the cell occupancy
serves as a good feature for a ow, since a space ow occupies a xed
group of cells. 2. it removes noises caused by frequent turns and
tracking errors. 3. it eliminates the dependence on full trajectories.
As long as instantaneous positions and velocities can be estimated,
THDP can cluster observations. This is crucial in dealing with real-
world data where full trajectories cannot be guaranteed. Next, since
there is no orientation information so that the representation cannot
distinguish between ows from A-to-B and ows from B-to-A, we
discretize the instantaneous orientation into 5 cardinal subdomains
(Fig. 3: 4). This makes the grid m
×
n
×
5 (Fig. 3: 3), which now
becomes a codebook and every 4D
xji
can be converted into a cell
occupancy. Note although the grid resolution is problem-specic, it
does not aect the validity of our method.
Next, since the cell occupancy on the grid (after normalization)
can be seen as a Multinomial distribution, we use Multinomials to
represent space ows. This way, a space ow has high probabilities
in some cells and low probabilities in others (Fig. 3:3). Further, we
assume the data is observed in groups and any group could contain
multiple ows. We use a DP to model all the space ows of the
whole dataset with child DPs representing the ows in individual
data groups, e.g. video clips. This is a HDP (Space-HDP) with
Hs
being a Dirichlet distribution.
After the three HDPs introduced separately, we need to link them,
which is the key of THDP. For a space ow
ˇ
w1
, all
xji ˇ
w1
are
associated with the same space mode, denoted by
ϕs
1
, and all
ykd
ˇ
w1
are associated with the time modes {
ϕt
1
} which forms a temporal
prole of
ϕs
1
. This indicates that
ykd
’s time mode association is
dependent on
xji
’s space mode association. In other words, if
x1
ji
ˇ
w1
(
ϕs
1
) and
x2
ji ˇ
w2
(
ϕs
2
), where
x1
ji =x2
ji
but
ˇ
w1,ˇ
w2
(two
ows can partially overlap), then their corresponding
y1
kd ˇ
w1
and
y2
kd ˇ
w2
should be associated with {
ϕt
1
} and {
ϕt
2
} where {
ϕt
1
}
,
{
ϕt
2
} when
ˇ
w1
and
ˇ
w2
have dierent temporal proles. We therefore
condition
θt
kd
on
θs
ji
(The left red arrow in Fig. 2 Right) so that
ykd
’s
time mode association is dependent on
xji
’s space mode association.
Similarly, a conditioning is also added to
θe
kc
on
θs
ji
. This way,
w
’s
associations to space, time and speed modes are linked. This is the
biggest feature that distinguishes THDP from just a simple collection
of HDPs, which would otherwise require doing analysis on space,
time and dynamics separately, instead of holistically.
5 INFERENCE
Given data
w
, the goal is to compute the posterior distribution
p
(
β
,
ϕs
,
ζ
,
ϕt
,
ρ
,
ϕe
|
w
). Existing inference methods for DPs include
MCMC [Teh et al
.
2006], variational inference [Homan et al
.
2013]
and geometric optimization [Yurochkin and Nguyen 2016]. However,
they are designed for simpler models (e.g. a single HDP). Further,
both variational inference and geometric optimization suer from
local minimum. We therefore propose a new MCMC method for
THDP. The method is a major generalization of Chinese Restaurant
Franchise (CRF). Next, we rst give the background of CRF, then
introduce our method.
5.1 Chinese Restaurant Franchise (CRF)
A single DP has a Chinese Restaurant Process (CRP) representation.
CRF is its extension onto HDPs. We refer the readers to [Teh et al
.
2006] for details on CRP. Here we directly follow the CRF metaphor
on HDP (Eq. 1, Fig. 2 Left) to compute the posterior distribution
p
(
β
,
ϕ
|
x
). In CRF, each observation
xji
is called a customer. Each data
group is called a restaurant. Finally, since a customer is associated
with a mode (indicated by
θji
), the mode is called a dish and is to
be learned, as if the customer ordered this dish. CRF dictates that,
in every restaurant, there is a potentially innite number of tables,
each with only one dish and many customers sharing that dish.
There can be multiple tables serving the same dish. All dishes are on
a global menu shared by all restaurants. The global menu can also
contain an innite number of dishes. In summary, we have multiple
restaurants with many tables where customers order dishes from a
common menu.
CRF is a Gibbs sampling approach. The sampling process is con-
ducted at both customer and table level alternatively. At the cus-
tomer level, each customer is treated, in turn, as a new customer,
given all the other customers sitting at their tables. Then she needs
to choose a table in her restaurant. There are two criteria inuenc-
ing her decision: 1. how many customers are already at the table
(table popularity) and 2. how much she likes the dish on that table
(dish preference). If she decides to not sit at any existing table, she
can create a new table then order a dish. This dish can be from the
menu or she can create a new dish and add it to the menu. Next, at
the table-level, for each table, all the customers sitting at that table
are treated as a new group of customers, and are asked to choose a
dish together. Their collective dish preference and how frequently
the dish is ordered in all restaurants (dish popularity) will inuence
their choice. They can choose a dish from the menu or create a new
ACM Trans. Graph., Vol. 39, No. 4, Article 1. Publication date: July 2020.
1:6 Feixiang He, Yuanhang Xiang, Xi Zhao, and He Wang
ALGORITHM 1: Chinese Restaurant Franchise
Result: β,ϕ(Eq. 1)
1Input: x;
2while Not converged do
3for every restaurant j do
4for every customer xji do
5Sample a table tji (Eq. 11, Appx. A);
6if a new table is chosen then
7
Sample a dish or create a new dish (Eq. 12, Appx. A)
8end
9end
10 for every table and its customers xjt do
11 Sample a new dish (Eq. 13, Appx. A)
12 end
13 end
14 Sample hyper-parameters [Teh et al. 2006]
15 end
one and add it to the menu. We give the algorithm in Algorithm 1
and refer the readers to Appx. A for more details.
5.2 Chinese Restaurant Franchise League (CRFL)
We generalize CRF by proposing a new method called Chinese
Restaurant Franchise League. We rst change the naming conven-
tion by adding prexes space-, time- and speed- to customers, restau-
rant and dishes to distinguish between corresponding variables in
the three HDPs. For instance, an observation
w
now contains a
space-customer
xji
, a time-customer
ykd
and a speed-customer
zkc
.
CRFL is a Gibbs sampling scheme, shown in Algorithm 2. The dif-
ferences between CRF and CRFL are on two levels. At the top level,
CRFL generalizes CRF by running CRF alternatively on three HDPs.
This makes use of the conditional independence between the Time-
HDP and the Speed-HDP given the Space-HDP xed. At the bottom
level, there are
three
major dierences in the sampling, between
Eq. 11 and Eq. 3, Eq. 12 and Eq. 4, Eq. 13 and Eq. 5.
ALGORITHM 2: Chinese Restaurant Franchise League
Result: β,ϕs,ζ,ϕt,ρ,ϕe(Eq. 2)
1Input: w;
2while Not converged do
3Fix all variables in Space-HDP;
4Do one CRF iteration (line 3-13, Algorithm 1) on Time-HDP;
5Do one CRF iteration (line 3-13, Algorithm 1) on Speed-HDP;
6for every space-restaurant j in Space-HDP do
7for every space-customer xji do
8Sample a table tji (Eq. 3);
9if a new table is chosen then
10 Sample a dish or create a new dish (Eq. 4);
11 end
12 end
13 for every table and its space-customers xjt do
14 Sample a new space-dish (Eq. 5);
15 end
16 end
17 Sample hyper-parameters (Appx. B.3);
18 end
The rst dierence is when we do customer-level sampling (line
8 in Algorithm 2), the left side of Eq. 11 in CRF becomes:
p(tji =t,xj i ,ykd ,zk c |xji
,tji
,k,ykd
,okd
,l,zkc
,pkc
,q)(3)
where
tji
is the new table for space-customer
xji
.
ykd
and
zkc
are
the time and speed customer.
xji
and
tji
are the other customers
(excluding
xji
) in the
j
th space-restaurant and their choices of tables.
k
is the space dishes. Correspondingly,
ykd
and
okd
are the other
time-customers (excluding
ykd
) in the
k
th time-restaurant and their
choices of tables.
l
is the time dishes. Similarly,
zkc
and
pkc
are the
other speed-customers (excluding
zkc
) in the
k
th speed-restaurant
and their choices of tables.
q
is the speed-dishes. The intuitive in-
terpretation of the dierences between Eq. 3 and Eq. 11 is: when a
space-customer
xji
chooses a table, the popularity and preference
are not the only criteria anymore. She has to also consider the prefer-
ences of her associated time-customer
ykd
and speed-customer
zkc
.
This is because when
xji
orders a dierent space-dish,
ykd
and
zkc
will be placed into a dierent time-restaurant and speed-restaurant,
due to that the organizations of time- and speed-restaurants are
dependent on the space-dishes (the dependence of
θt
kd
and
θe
kc
on
θs
ji
). Each space-dish corresponds to a time-restaurant and a
speed-restaurant (see Sec. 4.2). Since a space-customer’s choice of
space-dish can change during CRFL, the organization of time- and
speed-restaurants becomes dynamic! This is why CRF cannot be
directly applied to THDP.
The second dierence is when we need to sample a dish (line 10
in Algorithm 2), the left side of Eq. 12 in CRF becomes:
p(kjt new =k,xj i ,ykd ,zk c |kjtnew
,ykd
,okd
,
l,zkc
,pkc
,q) ∝
m·kp(xji | · · · )p(yk d | · · · )p(zk c | · · · )
γp(xji | · · · )p(yk d | · · · )p(zk c | · · · ) (4)
where
kjt ne w
is the new dish for customer
xji
.
· · ·
represents all
the conditional variables for simplicity.
p(ykd | · · · )
and
p(zkc | ··· )
are the major dierences. We refer the readers to Appx. B regarding
the computation of Eq. 3 and Eq. 4.
The last dierence is when we do the table-level sampling (line
14 in Algorithm 2), the left side of Eq. 13 in CRF changes to:
p(kjt =k,xjt,ykdjt ,zkcjt |kjt
,ykdjt
,okdjt
,
lko
,zkcjt
,pkcjt
,qkp) ∝
mjt
·kp(xjt | · · · )p(ykdjt | · · · )p(zkcjt | · · · )
γp(xjt | · · · )p(ykdjt | · · · )p(zkcjt | · · · ) (5)
where
xjt
is the space-customers at the
t
th table,
ykdjt
and
zkcjt
are
the associated time- and speed-customers.
kjt
,
ykdjt
,
okdjt
,
lko
,
zkcjt
,
pkcjt
,
qkp
are the rest and their table and dish choices in
three HDPs.
· · ·
represents all the conditional variables for sim-
plicity.
p(xjt | · · · )
is the Multinomial
f
as in Eq. 13. Unlike Eq. 4,
p(ykdjt | · · · )
and
p(zkcjt | · · · )
cannot be easily computed and needs
special treatment. We refer the readers to Appx. B for details.
ACM Trans. Graph., Vol. 39, No. 4, Article 1. Publication date: July 2020.
Informative Scene Decomposition for Crowd Analysis, Comparison and Simulation Guidance 1:7
Now we have fully derived CRFL. Given a data set
w
, we can com-
pute the posterior distribution
p
(
β
,
ϕs
,
ζ
,
ϕt
,
ρ
,
ϕe
|
w
) where
β
,
ζ
and
ρ
are the weights of the space, time and speed dishes,
ϕs
,
ϕt
and
ϕe
respectively.
ϕs
are Multinomials.
ϕt
and
ϕe
are Gaussians. As
mentioned in Sec. 5.1, the number of
ϕs
, is automatically learnt, so
we do not need to know the space dish number in advance. Neither
do we need it for
ϕt
and
ϕe
. This makes THDP non-parametric.
Further, since one
ϕs
could be associated with potentially an in-
nite number of
ϕt
s and
ϕe
s and vice versa, the many-to-many
associations are also automatically learnt.
5.3 Time Complexity of CRFL
For each sampling iteration in Algorithm 2, the time complexities of
sampling on time-HDP, speed-HDP and space-HDP are
O[W(N+
L)+K N L]
,
O[W(A+Q)+KAQ]
and
O[W(M+K)+
2
W(K+
1
)η+JMK]
respectively, where
η=N+L+A+Q
.
W
is the total observation
number.
K
,
L
and
Q
are the dish numbers of space, time and speed.
J
is the number of space-restaurants.
M
,
N
and
A
are the average table
numbers in space-, time- and speed-restaurants respectively. Note
that
K
appears in all three time complexities because the number of
space-dishes is also the number of time- and space-restaurants.
The time complexity of CRFL is
O[W(N+L)+K N L]+O[W(A+
Q)+KAQ]+O[W(M+K)+
2
W(K+
1
)η+JMK]
. This time complexity
is not high in practice.
W
can be large, depending on the dataset,
over which a sampling could be used to reduce the observation
number. In addition,
K
is normally smaller than 50 even for highly
complex datasets.
L
and
Q
are even smaller.
J
is decided by the user
and in the range of 10-30.
M
,
N
and
A
are not large either due to the
high aggregation property of DPs, i.e. each table tends to be chosen
by many customers, so the table number is low.
6 VISUALIZATION, METRICS AND SIMULATION
GUIDANCE BASED ON THDP
THDP provides a powerful and versatile base for new tools. In
this section, we present three tools for structured visualization,
quantitative comparison and simulation guidance.
6.1 Flexible and Structured Crowd Data Visualization
After inference, the highly rich but originally mixed and unstruc-
tured data is now structured. This is vital for visualization. It is
immediately easy to visualize the time and speed modes as they are
mixtures of univariate Gaussians. The space modes require further
treatments because they are m
×
n
×
5 Multinomials and hard to vi-
sualize. We therefore propose to use them as classiers to classify
trajectories. After classication, we select representative trajectories
for a clear and intuitive visualization of ows. Given a trajectory
¯
w
,
we compute a softmax function:
pk(¯
w)=epk(¯
w)
K
k=1epk(¯
w)k[1, K] (6)
where
pk(¯
w)
=
p(¯
w|βk,ϕs
k,ζk,ϕt
,ρk,ϕe)
.
ϕs
k
and
βk
are the
k
th
space mode and its weight. The others are the associated time and
speed modes. The time and speed modes (
ϕt
and
ϕe
) are associated
with space ow
ϕs
k
, with weights,
ζk
and
ρk
.
K
is the total number
of space ows. This way, we classify every trajectory into a space
ow. Then we can visualize representative trajectories with high
probabilities, or show anomaly trajectories with low probabilities.
In addition, since THDP captures all space, time and dynamics,
there is a variety of visualization. A period of time can be represented
by a weighted combination of time modes {
ϕt
}. Assuming that the
user wants to see what space ows are prominent during this period,
we can visualize trajectories based on
ρ,ϕep(β,ϕs|{ϕt})
, which
gives the space ows with weights. This is very useful if for instance
{
ϕt
} is rush hours,
ρ,ϕep(β,ϕs|{ϕt})
shows us what ows are
prominent and their relative importance during the rush hours.
Similarly, if we visualize data based on
ζ,ϕtp(ρ,ϕe|ϕs)
, it will tell
us if people walk fast/slowly on the space ow
ϕs
. A more complex
visualization is
p(ζ,ϕt
,ρ,ϕe|ϕs)
where the time-speed distribution
is given for a space ow
ϕs
. This gives the speed change against
time of this space ow, which could reveal congestion at times.
Through marginalizing and conditioning on dierent variables
(as above), there are many possible ways of visualizing crowd data
and each of them reveals a certain aspect of the data. We do not
enumerate all the possibilities for simplicity but it is very obvious
that THDP can provide highly exible and insightful visualizations.
6.2 New antitative Evaluation Metrics
Being able to quantitatively compare simulated and real crowds is
vital in evaluating the quality of crowd simulation. Trajectory-based
[Guy et al
.
2012] and ow-based [Wang et al
.
2016] methods have
been proposed. The rst ow-based metrics are proposed in [Wang
et al
.
2016] which is similar to our approach. In their work, the two
metrics proposed were: average likelihood (AL) and distribution-
pair distance (DPD) based on Kullback-Leibler (KL) divergence. The
underlying idea is that a good simulation does not have to strictly
reproduce the data but should have statistical similarities with the
data. However, they only considered space. We show that THDP
is a major generalization of their work and provides much more
exibility with a set of new AL and DPD metrics.
6.2.1 AL Metrics. Given a simulation data set,
ˆ
w=(ˆ
xji ,ˆ
yk d ,ˆ
zk c )
and
p
(
β
,
ϕs
,
ζ
,
ϕt
,
ρ
,
ϕe
|
w
) inferred from real-world data
w
, we can
compute the AL metric based on space only, essentially computing
the average space likelihood while marginalizing time and speed:
1
|ˆ
w|
j,i
K
k=1
βkzy
p(ˆ
xji |ϕs
k,ˆ
ykd ,ˆ
zkc )p(ˆ
ykd )p(ˆ
zkc )dydz (7)
where
|ˆ
w|
is the number of observations in
ˆ
w
. The dependence on
β
,
ϕs
,
ζ
,
ϕt
,
ρ
,
ϕe
are omitted for simplicity. If we completely discard
time and speed, Eq. 7 changes to the AL metric in [Wang et al
.
2017],
1
|ˆ
w|j,ikβkp(ˆ
xji |ϕs
k)
. However, the metric is just a special case
of THDP. We give a list of AL metrics in Table 1, which all have
similar forms as Eq. 7.
6.2.2 DPD Metrics. AL metrics are based on average likelihoods,
summarizing the dierences between two data sets into one number.
To give more exibility, we also propose distribution-pair metrics.
We rst learn two posterior distributions
p
(
ˆ
β
,
ˆ
ϕs
,
ˆ
ζ
,
ˆ
ϕt
,
ˆ
ρ
,
ˆ
ϕe
|
ˆ
w
)
and
p
(
β
,
ϕs
,
ζ
,
ϕt
,
ρ
,
ϕe
|
w
). Then we can compare individual
ACM Trans. Graph., Vol. 39, No. 4, Article 1. Publication date: July 2020.
1:8 Feixiang He, Yuanhang Xiang, Xi Zhao, and He Wang
Metric To compare
1. 1
|ˆ
w|p(ˆ
xji ,ˆ
ykd ,ˆ
zkc |•) overall similarity
2. 1
|ˆ
w|p(ˆ
xji ,ˆ
ykd |•) space&time ignoring speed
3. 1
|ˆ
w|p(ˆ
xji ,ˆ
zkc |•) space&speed ignoring time
4. 1
|ˆ
w|p(ˆ
ykd ,ˆ
zkc |•) time&speed ignoring space
5. 1
|ˆ
w|p(ˆ
xi j |•) space ignoring time & speed
6. 1
|ˆ
w|p(ˆ
ykd |•) time ignoring space & speed
7. 1
|ˆ
w|p(ˆ
zkc |•) speed ignoring space & time
Table 1. AL Metrics, represents {β,ϕs
,ζ,ϕt
,ρ,ϕe}.
pairs of
ϕs
and
ˆ
ϕs
,
ϕt
and
ˆ
ϕt
,
ϕe
and
ˆ
ϕe
. Since all space, time and
speed modes are probability distributions, we propose to use Jensen-
Shannon divergence, as oppose to KL divergence [Wang et al
.
2017]
due to KL’s asymmetry:
JSD (P||Q)=1
2D(P||M)+1
2D(Q||M)(8)
where
D
is KL divergence and
M=1
2(P+Q)
.
P
and
Q
are probabil-
ity distributions. Again, in the DPD comparison, THDP provides
many options, similar to the AL metrics in Table 1. We only give
several examples here. Given two space ows,
ϕs
and
ˆ
ϕs
, JSD(
ϕs
||
ˆ
ϕs
) directly compares two space ows. Further,
P
and
Q
can be
conditional distributions. If we compute JSD(
p(ϕt
|
ϕs
) || p(
ˆ
ϕt
|
ˆ
ϕs
))
where
ϕt
and
ˆ
ϕt
are the associated time modes of
ϕs
and
ˆ
ϕs
respec-
tively. This is to compare the two temporal proles. This is very
useful when
ϕs
and
ˆ
ϕs
are two spatially similar ows but we want
to compare the temporal similarity. Similarly, we can also compare
their speed proles JSD(
p(ϕe
|
ϕs
) || p(
ˆ
ϕe
|
ˆ
ϕs
)) or their time-speed
proles JSD(
p(ϕt
,
ϕe
|
ϕs
) || p(
ˆ
ϕt
,
ˆ
ϕe
|
ˆ
ϕs
)). In summary, similar
to AL metrics, dierent conditioning and marginalization choices
result in dierent DPD metrics.
6.3 Simulation Guidance
We propose a new method to automate simulation guidance with
real-world data, which works with existing simulators including
steering and global planning methods. Assuming that we want to
simulate crowds in a given environment based on data, there are
still several key parameters which need to be estimated including,
starting/destination positions, the entry timing and the desired
speed. After inferring, we use GMM to model both starting and
destination regions for every space ow. This way, we completely
eliminate the need for manual labelling, which is dicult in spaces
with no designated entrances/exits (e.g. a square). Also, we removed
the one-to-one mapping requirement of the agents in simulation
and data. We can sample any number of agents based on space ow
weights (
β
) and still keep similar agent proportions on dierent
ows to the data. In addition, since each ow comes with a temporal
and speed prole, we sample the entry timing and desired speed
for each agent, to mimic the randomness in these parameters. It is
dicult to manually set the timing when the duration is long and
sampling the speed is necessary to capture the speed variety within
a ow caused by latent factors such as dierent physical conditions.
Next, even with the right setting of all the afore-mentioned param-
eters, existing simulators tend to simulate straight lines whenever
possible while the real data shows otherwise. This is due to that no
intrinsic motion randomness is introduced. Intrinsic motion ran-
domness can be observed in that people rarely walk in straight lines
and they generate slightly dierent trajectories even when asked
to walk several times between the same starting position and desti-
nation [Wang et al
.
2017]. This is related to the state of the person
as well as external factors such as collision avoidance. Individual
motion randomness can be modelled by assuming the randomness
is Gaussian-distributed [Guy et al
.
2012]. Here, we do not assume
that all people have the same distribution. Instead, we propose to do
a structured modelling. We observe that people on dierent space
ows show dierent dynamics but share similar dynamics within
the same ow. This is because people on the same ow share the
same starting/destination regions and walk through the same part
of the environment. In other words, they started in similar positions,
had similar goals and made similar navigation decisions. Although
individual motion randomness still exists, their randomness is likely
to be similarly distributed. However, this is not necessarily true
across dierent ows. We therefore assume that each space ow can
be seen as generated by a unique dynamic system which captures
the within-group motion randomness which implicitly considers
factors such as collision avoidance. Given a trajectory,
¯
w
, from a
ow ˇ
w, we assume that there is an underlying dynamic system:
x¯
w
t=Ast+ωtωN(0,)
st=Bst1+λtλN(0,Λ)(9)
where
x¯
w
t
is the observed location of a person at time
t
on trajectory
¯
w
.
st
is the latent state of the dynamic system at time
t
.
ωt
and
λt
are the observational and dynamics randomness. Both are white
Gaussian noises.
A
and
B
are transition matrices. We assume that
is a known diagonal covariance matrix because it is intrinsic to the
device (e.g. a camera) and can be trivially estimated. We also assume
that
A
is an identity matrix so that there is no systematic bias and the
observation is only subject to the state
st
and noise
ωt
. The dynamic
system then becomes:
x¯
w
tN(Ist,)
and
stN(Bst1,Λ)
, where
we need to estimate
st
,
B
and
Λ
. Given the
U
trajectories in
ˇ
w
, the
total likelihood is:
p(ˇ
w)=ΠU
i=1p(¯
wi)where
p(¯
wi)=ΠTi1
t=2p(xi
t|st)P(st|st1)s1=xi
1,sT=xi
Ti(10)
where
Ti
is the length of trajectory
¯
wi
. We maximize
loдP(ˇ
w)
via
Expectation-Maximization [Bishop 2007]. Details can be found in
the Appx. C. After learning the dynamic system for a space ow and
given a starting and destination location,
s1
and
sT
, we can sample
diversied trajectories while obeying the ow dynamics. During
simulation guidance, one target trajectory is sampled for each agent
and this trajectory reects the motion randomness.
7 EXPERIMENTS
In this section, we rst introduce the datasets, then show our highly
informative and exible visualization tool. Next, we give quantitative
comparison results between simulated and real crowds by the newly
proposed metrics. Finally, we show that our automated simulation
ACM Trans. Graph., Vol. 39, No. 4, Article 1. Publication date: July 2020.
Informative Scene Decomposition for Crowd Analysis, Comparison and Simulation Guidance 1:9
Fig. 4. Forum (top), CarPark (Middle) and TrainStation (Boom) dataset. In each dataset, Top le: original data; P1-P9: the top 9 space modes; Top right: the
time modes of P1-P9; Boom right: the speed modes of P1-P9. Both time and speed profiles are scaled by their respective space model weights, with the y axis
indicating the likelihood.
guidance with high semantic delity. We only show representative
results in the paper and refer the readers to the supplementary video
and materials for details.
7.1 Datasets
We choose three publicly available datasets:
Forum
[Majecka 2009],
CarPark
[Wang et al
.
2008] and
TrainStation
[Yi et al
.
2015], to
cover dierent data volumes, durations, environments and crowd
dynamics. Forum is an indoor environment in a school building,
recorded by a top-down camera, containing 664 trajectories and
lasting for 4.68 hours. Only people are tracked and they are mostly
slow and casual. CarPark consists of videos of an outdoor car park
with mixed pedestrians and cars, by a far-distance camera and con-
tains totally 40,453 trajectories over ve days. TrainStation is a big
indoor environment with pedestrians and designated sub-spaces.
It is from New York Central Terminal and contains totally 120,000
frames with 12,684 pedestrians within approximately 45 minutes.
The speed varies among pedestrians.
7.2 Visualization Results
We rst show a general, full-mode visualization in Fig. 4. Due to
the space limit, we only show the top 9 space modes and their cor-
responding time and speed proles. Overall, THDP is eective in
decomposing highly mixed and unstructured data into structured
results across dierent data sets. The top 9 space modes (with time
and speed) are the main activities. With the environment informa-
tion (e.g. where the doors/lifts/rooms are), the semantic meanings
of the activities can be inferred. In addition, the time and dynamics
are captured well. One peak of a space ow (indicated by color) in
the time proles indicates that this ow is likely to appear around
that time. Correspondingly, one peak of a space ow in the speed
prole indicates a major speed preference of the people on that ow.
Multiple space ows can peak near one point in both the time and
speed proles. The speed proles of Forum and TrainStation are
slightly dierent, with most of the former distributed in a smaller
region. This is understandable because people in TrainStation in
general walk faster. The speed prole of CarPark is quite dierent
in that it ranges more widely, up to 10m/s. This is because both
pedestrians and vehicles were recorded.
Besides, we show conditioned visualization. Suppose that the
user is interested in a period (e.g. rush hours) or speed range (e.g.
ACM Trans. Graph., Vol. 39, No. 4, Article 1. Publication date: July 2020.
1:10 Feixiang He, Yuanhang Xiang, Xi Zhao, and He Wang
Fig. 5. Le: TrainStation, Right: CarPark. The space flow prominence (indi-
cated by bar heights) of P1-P9 in Fig. 4 respectively given a time period (blue
bars) or speed range (orange bars). The higher the bar, the more prominent
the space flow is.
Fig. 6. Space flows from Forum, CarPark and TrainStation and their time-
speed distributions. The y (up) axis is likelihood. The x and z axes are time
and speed. The redder, the higher the likelihood is.
to see where people generally walk fast/slowly), the associated ow
weights can be visualized (Fig. 5). This allows users to see which
space ows are prominent in the chosen period or speed range.
Conversely, given a space ow in interest, we can visualize the time-
speed distribution (Fig. 6), showing how the speed changes along
time, which could help identify congestion on that ow at times.
Last but not least, we can identify anomaly trajectories and show
unusual activities. The anomalies here refer to statistical anomalies.
Although they are not necessarily suspicious behaviors or events,
they can help the user to quickly reduce the number of cases needed
to be investigated. Note that the anomaly is not only the spacial
anomaly. It is possible that a spatially normal trajectory that is ab-
normal in time and/or speed. To distinguish between them, we rst
compute the probabilities of all trajectories and select anomalies.
Then for each anomaly trajectory, we compute its relative probabili-
ties (its probability divided by the maximal trajectory probability) in
space, time and speed, resulting in three probabilities in [0, 1]. Then
we use them (after normalization) as the bary-centric coordinates of
a point inside of a colored triangle. This way, we can visualize what
contributes to their abnormality (Fig. 7). Take T1 for example. It has
a normal spacial pattern, and therefore is close to the ‘space’ vertex.
It is far away from both ‘time’ and ‘speed’ vertex, indicating T1’s
time and speed patterns are very dierent from the others’. THDP
can be used as a versatile and discriminative anomaly detector.
Non-parametric Bayesian approaches have been used for crowd
analysis [Wang et al
.
2016, 2017]. However, existing methods can
be seen as variants of the Space-HDP and cannot decompose infor-
mation in time and dynamics. Consequently, they cannot show any
results related to time & speed, as opposed to Fig. 4-7. A naive alter-
native would be to use the methods in [Wang et al
.
2016, 2017] to
Fig. 7. Representative anomaly trajectories. Every trajectory has a cor-
responding location in the triangle on the right, indicating what factors
contribute more in its abnormality. For instance, T1 is close to the space
vertex, it means its spatial probability is relatively high and the main abnor-
mality contribution comes from its time and speed. For T2, the contribution
mainly comes from its speed.
rst cluster data regardless time and dynamics, then do per-cluster
time and dynamics analysis, equivalent to using the Space-HDP
rst, then the time-HDP & Speed-HDP subsequently. However, this
kind of sequential analysis has failed due to one limitation: the
spatial-only HDP misclassies observations in the overlapped ar-
eas of ows [Wang and O’Sullivan 2016]. The following time and
dynamics analysis would be based on wrong clustering. The simul-
taneity of considering all three types of information, accomplished
by the links (red arrows in Fig. 2 Right) among three HDPs in THDP,
is therefore essential.
7.3 Compare Real and Simulated Crowds
To compare simulated and real crowds, we ask participants (Master
and PhD students whose expertise is in crowd analysis and simula-
tion) to simulate crowds in Forum and TrainStation. We left CarPark
out because its excessively long duration makes it extremely dif-
cult for participants to observe. We built a simple UI for setting
up simulation parameters including starting/destination locations,
the entry timing and the desired speed for every agent. For simula-
tor, our approach is agnostic about simulation methods. We chose
ORCA in Menge [Curtis et al
.
2016] for our experiments but other
simulation methods would work equally well. Initially, we provide
the participants with only videos and ask them to do their best to
replicate the crowd motions. They found it dicult because they
had to watch the videos and tried to remember a lot of information,
which is also a real-world problem of simulation engineers. This
suggests that dierent levels of detail of the information are needed
to set up simulations. The information includes variables such as
entry timings and start/end positions, which are readily available,
or descriptive statistics such as average speed, which can be rela-
tively easily computed. We systematically investigate their roles in
producing scene semantics. After several trials, we identied a set
of key parameters including starting/ending positions, entry timing
and desired speed. Dierent simulation methods require dierent
parameters, but these are the key parameters shared by all. We also
identied four typical settings where we gradually provide more
and more information about these parameters. This design helps us
to identify the qualitative and quantitative importance of the key
parameters for the purpose of reproducing the scene semantics.
The rst setting, denoted as Random, is where only the start-
ing/destination regions are given. The participants have to estimate
ACM Trans. Graph., Vol. 39, No. 4, Article 1. Publication date: July 2020.
Informative Scene Decomposition for Crowd Analysis, Comparison and Simulation Guidance 1:11
Information / Setting Random SDR SDRT SDRTS
Starting/Dest. Areas ✓ ✓
Exact Starting/Dest. Positions ×✓ ✓
Trajectory Entry Timing × × ✓ ✓
Trajectory Average Speed × × ×
Table 2. Dierent simulation seings and the information provided.
Metric/Simulations Random SDR SDRT SDRTS Ours
Overall (×108) 7.11 20.67 37.08 40.55 57.9
Space-Only (×103) 2.7 5.3 5.3 5.5 5.1
Space-Time (×107) 1.23 2.96 5.56 5.77 6.02
Space-Speed (×103) 1.5 3.6 3.5 4.0 4.9
Overall (×107) 6.7 11.97 13.96 19.39 19.89
Space-Only (×103) 3.5 6.8 6.7 6.6 6.9
Space-Time (×107) 8.02 15.87 19.00 18.84 20.44
Space-Speed (×103) 2.9 5.0 4.9 6.9 6.7
Table 3. Comparison on Forum (Top) and TrainStation (Boom) based on
AL metrics.
Higher
is beer. Numbers should only compared within the
same row.)
the rest. Based on Random, we further give the exact starting/ending
positions, denoted by SDR. Next, we also give the entry timing for
each agent based on SDR, denoted by SDRT. Finally, we give the
average speed of each agent based on SDRT, denoted by SDRTS.
Random is the least-informed scenario where the users have to esti-
mate many parameters, while SDRTS is the most-informed situation.
A comparison between the four settings is shown in Table 2.
We use four AL metrics to compare simulations with data, as they
provide detailed and insightful comparisons: Overall (Table 1: 1),
Space-Only (Table 1: 5), Space-Time (Table 1: 2) and Space-Speed
(Table 1: 3) and show the comparisons in Table 3. In Random, the
users had to guess the exact entrance/exit locations, entry timing
and speed. It is very dicult to do by just watching videos and thus
has the lowest score across the board. When provided with exact
entrance/exit locations (SDR), the score is boosted in Overall and
Space-Only. But the scores in Space-Time and Space-Speed remain
relatively low. As more information is provided (SDRT & SDRTS),
the scores generally increase. This shows that our metrics are sensi-
tive to space, time and dynamics information during comparisons.
Further, each type of information is isolated out in the comparison.
The Space-Only scores are roughly the same between SDR, SDRT
and SDRTS. The Space-Time scores do not change much between
SDRT and SDRTS. The isolation in comparisons makes our AL met-
rics ideal for evaluating simulations in dierent aspects, providing
great exibility which is necessary in practice.
Next, we show that it is possible to do more detailed comparisons
using DPD metrics. Due to the space limit, we show one space ow
from all simulation settings (Fig. 8), and compare them in space
only (DPD-Space), time only (DPD-Time) and time-speed (DPD-TS)
in Table 4. In DPD-Space, all settings perform similarly because
the space information is provided in all of them. In DPD-Time,
SDRT & SDRTS are better because they are both provided with the
timing information. What is interesting is that SDRTS is worse than
SDRT on the two ows in DPD-TS. Their main dierence is that
the desired speed in SDRTS is set to be the average speed of that
trajectory, while the desired speed in SDRT is randomly drawn from
Metric/Simulations SDR SDRT SDRTS Ours
DPD-Space 0.4751 0.3813 0.4374 0.2988
DPD-Time 0.3545 0.0795 0.064 0.0419
DPD-TS 1.0 0.8879 1.0 0.4443
DPD-Space 0.2753 0.2461 0.2423 0.1173
DPD-Time 0.0428 0.0319 0.0295 0.0213
DPD-TS 0.9970 0.8157 0.9724 0.5091
Table 4. Comparison on space flow P2 in Forum (Top) and space flow P1 in
TrainStation (Boom) based on DPD metrics, both shown in Fig. 4.
Lower
is beer.
a Gaussian estimated from real data. The latter achieves a slightly
better performance on both ows in DPD-TS.
Quantitative metrics for comparing simulated and real crowds
have been proposed before. However, they either only compare
individual motions [Guy et al
.
2012] or only space patterns [Wang
et al
.
2016, 2017]. Holistically considering space, time & speed has
a combinatorial eect, leading to many explicable metrics evaluat-
ing dierent aspects of crowds (AL & DPD metrics). This makes
multi-faceted comparisons possible, which is unachievable in ex-
isting methods. Technically, the exible design of THDP allows
for dierent choices of marginalization, which greatly increases
the evaluation versatility. This shows the theoretical superiority of
THDP over existing methods.
7.4 Guided Simulations
Our automated simulation guidance proves to be superior to careful
manual settings. We rst show the AL results in Table 3. Our guided
simulation outperforms all other settings that were carefully and
manually set up. The superior performance is achieved in the Over-
all comparisons as well as most dimension-specic comparisons.
Next, we show the same space ow of our guided simulation in
Fig. 8, in comparison with other settings. Qualitatively, SDR, SDRT
and SDRTS generate narrower ows due to straight lines are sim-
ulated. In contrast, our simulation shows more realistic intra-ow
randomness which led to a wider ow. It is much more similar to
the real data. Quantitatively, we show the DPD results in Table 4.
Again, our automated guidance outperforms all other settings.
Automated simulation guidance has only been attempted by a
few researchers before [Karamouzas et al
.
2018; Wolinski et al
.
2014].
However, their methods aim to guide simulators to reproduce low-
level motions for the overall similarity with the data. Our approach
aims to inform simulators with structured scene semantics. More-
over, it gives the freedom to the users so that the full semantics
or partial semantics (e.g. the top n ows) can be used to simulate
crowds, which no previous method can provide.
7.5 Implementation Details
For space discretization, we divide the image space of Forum, CarPark
and TrainStation uniformly into 40
×
40,40
×
40 and 120
×
120 pixel
grids respectively. Since Forum is recorded by a top-down camera,
we directly estimate the velocity from two consecutive observations
in time. For CarPark and TrainStation, we estimate the velocity by
reconstructing a top-down view via perspective projection. THDP
also has hyper-parameters such as the scaling factors of every DP
ACM Trans. Graph., Vol. 39, No. 4, Article 1. Publication date: July 2020.
1:12 Feixiang He, Yuanhang Xiang, Xi Zhao, and He Wang
Fig. 8. Space flow P2 in Forum (Top) and P1 in TrainStation (Boom) in dierent simulations. The y axes of the time and speed profiles indicate likelihood.
(totally 6 of them). Our inference method is not very sensitive to
them because they are also sampled, as part of the CRFL sampling.
Please refer to Appx. B.3 for details. In inference, we have a burn-in
phase, during which we only use CRF on the Space-HDP and ignore
the rest two HDPs. After the burn-in phase, we use CRFL on the
full THDP. We found that it can greatly help the convergence of the
inference. For crowd simulation, we use ORCA in Menge [Curtis
et al. 2016].
We randomly select 664 trajectories in Forum, 1000 trajectories
in CarPark and 1000 trajectories in Trainstation for performance
tests. In each experiment, we split the data into segments in time
domain to mimic fragmented video observations. The number of
segments is a user-dened hyper-parameter and depends on the
nature of the dataset. We chose the segment number to be 384, 87
and 28, for Forum, CarPark and TrainStation respectively to cover
situations where the video is nely or roughly segmented. During
training, we rst run 5k CRF iterations on the Space-HDP only in
the burn-in phase, then do the full CRFL on the whole THDP to
speed up the mixing. After training, the numbers of space, time and
speed modes are 25, 5 and 7 in Forum; 13, 6 and 6 in CarPark; 16, 3
and 4 in TrainStation. The training took 85.1, 11.5 and 7.8 minutes
on Forum, Carpark and TrainStation, on a PC with an Intel i7-6700
3.4GHz CPU and 16GB memory.
8 DISCUSSION
We chose MCMC to avoid the local minimum issue. (Stochastic)
Variational Inference (VI) [Homan et al
.
2013] and Geometric Op-
timization [Yurochkin and Nguyen 2016] are theoretically faster.
However, VI for a single HDP is already prone to local minimum
[Wang et al
.
2016]. We also found the same issue with geometric
optimization. Also, can we use three independent HDPs? Using in-
dependent HDPs essentially breaks the many-to-many associations
between space, time and speed modes. It can cause mis-clustering
due to that the clustering is done on dierent dimensions separately
[Wang and O’Sullivan 2016].
The biggest limitation of our method does not consider the cross-
scene transferability. Since the analysis focuses on the semantics
in a given scene, it is unclear how the results can inspire simula-
tion settings in unseen environments. In addition, our metrics do
not directly reect visual similarities on the individual level. We
deliberately avoid the agent-level one-to-one comparison, to allow
greater exibility in simulation setting while maintaining statistical
similarities. Also, we currently do not model high-level behaviors
such as grouping, queuing, etc. This is due to that such informa-
tion can only be obtained through human labelling which would
incur massive workload and be therefore impractical on the chosen
datasets. We intentionally chose unsupervised learning to deal with
large datasets.
9 CONCLUSIONS AND FUTURE WORK
In this paper, we present the rst, to our best knowledge, multi-
purpose framework for comprehensive crowd analysis, visualization,
comparison (between real and simulated crowds) and simulation
guidance. To this end, we proposed a new non-parametric Bayesian
model called Triplet-HDP and a new inference method called Chi-
nese Restaurant Franchise League. We have shown the eectiveness
of our method on datasets varying in volume, duration, environment
and crowd dynamics.
In the future, we would like to extend the work to cross-environment
prediction. It would be ideal if the modes learnt from given envi-
ronments can be used to predict crowd behaviors in unseen envi-
ronments. Preliminary results show that the semantics are tightly
coupled with the layout of sub-spaces with designated functionali-
ties. This means a subspace-functionality based semantic transfer is
possible. Besides, we will look into using semi-supervised learning
to identify and learn high level social behaviors, such as grouping
and queuing.
ACKNOWLEDGEMENT
The project is partially supported by EPSRC (Ref:EP/R031193/1), the
Fundamental Research Funds for the Central Universities (xzy012019048)
and the National Natural Science Foundation of China (61602366).
REFERENCES
Saad Ali and Mubarak Shah. 2007. A lagrangian particle dynamics approach for crowd
ow segmentation and stability analysis. In
2007 IEEE Conference on Computer
Vision and Pattern Recognition. IEEE, 1–6.
Jiang Bian, Dayong Tian, Yuanyan Tang, and Dacheng Tao. 2018. A sur veyon traje ctory
clustering analysis. CoRR abs/1802.06971 (2018). arXiv:1802.06971
Christopher Bishop. 2007.
Pattern Recognition and Machine Learning
. Springer, New
York.
Rima Chaker, Zaher Al Aghbari, and Imran N Junejo. 2017. Social network model for
crowd anomaly detection and localization.
Pattern Recognition
61 (2017), 266–281.
Panayiotis Charalambous, Ioannis Karamouzas, Stephen J Guy, and Yiorgos Chrysan-
thou. 2014. A data-driven framework for visual crowd analysis. In
Computer
Graphics Forum, Vol. 33. Wiley Online Library, 41–50.
ACM Trans. Graph., Vol. 39, No. 4, Article 1. Publication date: July 2020.
Informative Scene Decomposition for Crowd Analysis, Comparison and Simulation Guidance 1:13
Sean Curtis, Andrew Best, and Dinesh Manocha. 2016. Menge: A Modular Framework
for Simulating Crowd Movement. Collective Dynamics 1, 0 (2016).
Cathy Ennis, Christopher Peters, and Carol O’Sullivan. 2011. Perceptual Eects of
Scene Context and Viewpoint for Virtual Pedestrian Crowds.
ACM Transaction on
Applied Perception 8, 2, Article Article 10 (Feb. 2011), 22 pages.
Thomas S. Ferguson. 1973. A Bayesian Analysis of Some Nonparametric Problems.
The Annals of Statistics 1, 2 (1973), 209–230.
Abhinav Golas, Rahul Narain, and Ming Lin. 2013. Hybrid Long-range Collision
Avoidance for Crowd Simulation. In
ACM SIGGRAPH Symposium on Interactive
3D Graphics and Games. 29–36.
Stephen J. Guy, Jur van den Berg, Wenxi Liu, Rynson Lau, Ming C. Lin, and Dinesh
Manocha. 2012. A Statistical Similarity Measure for Aggregate Crowd Dynamics.
ACM Transaction on Graphics 31, 6 (2012), 190:1–190:11.
Dirk Helbing et al
.
1995. Social Force Model for Pedestrian Dynamics.
Physical Review
E (1995).
Matthew D. Homan, David M. Blei, Chong Wang, and John Paisley. 2013. Stochastic
Variational Inference.
Journal of Machine Learning Research
14, 1 (2013), 1303–
1347.
Kevin Jordao, Julien Pettré, Marc Christie, and Marie-Paule Cani. 2014. Crowd Sculpting:
A Space-time Sculpting Method for Populating Virtual Environments.
Computer
Graphics Forum (2014).
Ioannis Karamouzas, Nick Sohre, Ran Hu, and Stephen J. Guy. 2018. Crowd Space: A
Predictive Crowd Analysis Technique.
ACM Transaction on Graphics
37, 6, Article
Article 186 (Dec. 2018), 14 pages.
Leonard Kauman and Peter J. Rousseeuw. 2005.
Finding Groups in Data: An
Introduction to Cluster Analysis. John Wiley & Sons.
Kang Hoon Lee, Myung Geol Choi, Qyoun Hong, and Jehee Lee. 2007. Group behavior
from video: a data-driven approach to crowd simulation. In
Proceedings of the 2007
ACM SIGGRAPH/Eurographics symposium on Computer animation. 109–118.
S. Lemercier, A. Jelic, R. Kulpa, J. Hua, J. Fehrenbach, P. Degond, C. Appert-Rolland, S.
Donikian, and J. Pettré. 2012. Realistic Following Behaviors for Crowd Simulation.
Computer Graphics Forum 31, 2 (2012), 489–498.
Alon Lerner, Yiorgos Chrysanthou, Ariel Shamir, and Daniel Cohen-Or. 2009. Data
driven evaluation of crowds. In
International Workshop on Motion in Games
.
Springer, 75–83.
Ning Lu et al
.
2019. ADCrowdNet: An Attention-injective Deformable Convolutional
Networkfor Crowd Understanding.
IEEE Conference on Computer Vision and
Pattern Recognition (2019).
A López, F Chaumette, E Marchand, and J Pettré. 2019. Character navigation in dy-
namic environments based on optical ow. In
Proceedings of Eurographics 2019
(Eurographics 2019). Eurographics.
B. Majecka. 2009.
Statistical models of pedestrian behaviour in the Forum
. MSc Dis-
sertation. School of Informatics, University of Edinburgh, Edinburgh.
Ramin Mehran, Alexis Oyama, and Mubarak Shah. 2009. Abnormal crowd behavior
detection using social force model. In
2009 IEEE Conference on Computer Vision
and Pattern Recognition. IEEE, 935–942.
Rahul Narain, Abhinav Golas, Sean Curtis, and Ming C. Lin. 2009. Aggregate Dynamics
for Dense Crowd Simulation.
ACM Transaction on Graphics
28, 5 (2009), 122:1–
122:8.
Carl Edward Rasmussen. 1999. The Innite Gaussian Mixture Model. In
International
Conference on Neural Information Processing Systems (NIPS’99)
. MIT Press, Cam-
bridge, MA, USA, 554–560.
Jiaping Ren, Wei Xiang, Yangxi Xiao, Ruigang Yang, Dinesh Manocha, and Xiaogang
Jin. 2018. Heter-Sim: Heterogeneous multi-agent systems simulation by interactive
data-driven optimization. CoRR abs/1812.00307 (2018). arXiv:1812.00307
Zeng Ren, P. Charalambous, J. Bruneau, Q. Peng, and J. Pettré. 2016. Group modelling:
A unied velocity-based approach. Computer Graphics Forum (2016).
Mohammad Sabokrou et al
.
2017. Deep-cascade:cascading 3D deep neural networks
for fast anomaly detection and localization in crowded scenes.
IEEE Transaction
on Image Processing (2017).
Long Sha, Patrick Lucey, Yisong Yue, Xinyu Wei, Jennifer Hobbs, Charlie Rohlf, and
Sridha Sridharan. 2018. Interactive sports analytics: An intelligent interface for utiliz-
ing trajectories for interactive sports play retrieval and analytics.
ACM Transactions
on Computer-Human Interaction (TOCHI) 25, 2 (2018), 1–32.
Long Sha, Patrick Lucey, Stephan Zheng, Taehwan Kim, Yisong Yue, and Sridha Srid-
haran. 2017. Fine-grained retrieval of sports plays using tree-based alignment of
trajectories. (2017). arXiv:1710.02255
Yijun Shen, Joseph Henry, He Wang, Edmond S. L. Ho, Taku Komura, and
Hubert P. H. Shum. 2018. Data-Driven Crowd Motion Control With
Multi-Touch Gestures.
Computer Graphics Forum
37, 6 (2018), 382–394.
arXiv:https://onlinelibrary.wiley.com/doi/pdf/10.1111/cgf.13333
Jianbo Shi and J. Malik. 2000. Normalized cuts and image segmentation.
IEEE
Transactions on Pattern Analysis and Machine Intelligence 22, 8 (2000), 888–905.
Yee Whye Teh, Michael I. Jordan, Matthew J. Beal, and David M. Blei. 2006. Hierarchical
Dirichlet Processes.
Journal of American Statistical Association
101, 476 (2006),
1566–1581.
J. van den Berg, Ming C. Lin, and Dinesh Manocha. 2008. Reciprocal Velocity Obstacles
for real-time multi-agent navigation.
IEEE International Conference on Robotics
and Automation (2008).
He Wang, Jan Ondřej, and Carol O’Sullivan. 2016. Path Patterns: Analyzing and
Comparing Real and Simulated Crowds. In
Proceedings of the 20th ACM SIGGRAPH
Symposium on Interactive 3D Graphics and Games (I3D ’16)
. ACM, New York, NY,
USA, 49–57. https://doi.org/10.1145/2856400.2856410
He Wang, Jan Ondřej, and Carol O’Sullivan. 2017. Trending Paths: A New Semantic-
level Metric for Comparing Simulated and Real Crowd Data.
IEEE Transactions on
Visualization and Computer Graphics 23, 5 (2017), 1454–1464.
He Wang and Carol O’Sullivan. 2016.
Globally Continuous and Non-Markovian Crowd
Activity Analysis from Videos
. Springer International Publishing, Cham, 527–544.
Qi Wang et al
.
2019. Learning from Synthetic Data for Crowd Counting in the Wild.
IEEE Conference on Computer Vision and Pattern Recognition (2019).
Xiaogang Wang, Keng Teck Ma, Gee-Wah Ng, and W. E. L. Grimson. 2008. Trajectory
analysis and semantic region modeling using a nonparametric Bayesian model. In
IEEE Conference on Computer Vision and Pattern Recognition. 1–8.
David Wolinski, Stephen J. Guy, Anne-Hélène Olivier, Ming C. Lin, Dinesh Manocha,
and Julien Pettré. 2014. Parameter estimation and comparative evaluation of crowd
simulations. Computer Graphics Forum 33, 2 (2014), 303–312.
Yanyu Xu et al
.
2018. Encoding Crowd Interaction with Deep Neural Network for
Pedestrian Trajectory Prediction.
IEEE Conference on Computer Vision and Pattern
Recognition (2018).
S. Yi, H. Li, and X. Wang. 2015. Understanding pedestrian behaviors from stationary
crowd groups. In
IEEE Conference on Computer Vision and Pattern Recognition
.
3488–3496.
Mikhail Yurochkin and XuanLong Nguyen. 2016. Geometric Dirichlet Means Algorithm
for topic inference. In
International Conference on Neural Information Processing
Systems.
A CHINESE RESTAURANT FRANCHISE
To give the mathematical derivation of the sampling process de-
scribed in Sec. 5.1, we rst give meanings to the variables in Fig. 2
Left.
θji
is the dish choice made by
xji
, the
i
th customer in the
j
th
restaurant.
Gj
is the tables with dishes and the dishes are from the
global menu
G
. Since
θji
indicates the choice of tables and therefore
dishes, we use some auxiliary variables to represent the process.
We introduce
tji
and
kjt
as the indices of the table and the dish
on the table chosen by
xji
. We also denote
mjk
as the number of
tables serving the
k
th dish in restaurant
j
and
njt k
as the number of
customers at table
t
in restaurant
j
having the
k
th dish. We also use
them to represent accumulative indicators such as
m·k
representing
the total number of tables serving the
k
th dish. We also use super-
script to indicate which customer or table is removed. If customer
xji
is removed, then
nji
jt k
is the number of customers at the table
t
in restaurant jhaving the kth dish without the customer xji .
Customer-level sampling
. To choose a table for
xji
(line 5 in
Algorithm 1), we sample a table index tji :
p(tji =t|tji
,k) ∝ nj i
jt ·fxj i
kjt (xj i )if talready exists
αjp(xji |tji
,tji =tn ew
,k)if t=tnew (11)
where
nji
jt ·
is the number of customers at table
t
(table popularity),
and
fxji
kjt (xj i )
is how much
xji
likes the
kjt
th dish,
fkjt
, served on
that table (dish preference).
fkjt
is the dish and thus is a problem-
specic probability distribution.
fxji
kjt (xj i )
is the likelihood of
xji
on
fkjt
. In our problem,
fkjt
is Multinomial if it is the Space-HDP
or otherwise Normal.
αj
is the parameter in Eq. 1, so it controls how
likely
xji
will create a new table, after which she needs to choose
a dish according to
p(xji |tji
,tji =tn ew
,k)
. When a new table is
created,
tji =tn ew
, we need sampling a dish (line 7 in Algorithm 1),
indexed by kj tne w , according to:
ACM Trans. Graph., Vol. 39, No. 4, Article 1. Publication date: July 2020.
1:14 Feixiang He, Yuanhang Xiang, Xi Zhao, and He Wang
p(kjt new =k|t,kjtnew ) ∝ m·kfxj i
k(xji )if kalready exists
γfxji
kne w (xj i )if k=knew
(12)
where
m·k
is the total number of tables across all restaurants serving
the
k
th dish (dish popularity).
fxji
k(xji )
is how much
xji
like the
k
th dish, again the likelihood of
xji
on
fk
.
γ
is the parameter in
Eq. 1, so it controls how likely a new dish will be created.
Table-level sampling
. Next we sample a dish for a table (line
11 in Algorithm 1). We denote all customers at the
t
th table in the
jth restaurant as xjt. Then we sample its dish kjt according to:
p(kjt =k|t,kjt) ∝ mjt
·kfxjt
k(xjt)if kalready exists
γfxjt
kne w (xjt)if k=knew (13)
Similarly,
mjt
·k
is the total number of tables across all restaurants
serving the
k
th dish, without
xjt
(dish popularity).
fxjt
k(xjt)
is how
much the group of customers
xjt
likes the
k
th dish (dish preference).
This time, fxjt
k(xjt)is a joint probability of all xji xjt .
Finally, in both Eq. 12 and Eq. 13, we need to sample a new
dish. This is done by sampling a new distribution from the base
distribution
H
,
ϕkH
. After inference, the weights
β
can be com-
puted as
βDirichlet (m·1,m·2,· · · ,m·k,γ)
. The choice of
H
is
related to the data. In our metaphor, the dishes of the Space-HDP
are ows so we use Dirichlet. In the Time-HDP and Speed-HDP, the
dishes are modes of time and speed which are Normals. So we use
Normal-Inverse-Gamma for
H
. The choices are because Dirchlet
and Norma-Inverse-Gamma are the conjugate priors of Multinomial
and Normal respectively. The whole CRF sampling is done by itera-
tively computing Eq. 11 to Eq. 13. The dish number will dynamically
increase/decrease until the sampling mixes. In this way, we do not
need to know in advance how many space ows or time modes or
speed modes there are because they will be automatically learnt.
B CHINESE RESTAURANT FRANCHISE LEAGUE
B.1 Customer Level Sampling
When we do customer-level sampling to sample a new table (line 8
in Algorithm 2), the left side of Eq. 11 becomes:
p(tji =t,xj i ,ykd ,zk c |xji
,tji
,k,ykd
,okd
,l,zkc
,pkc
,q)(14)
So whether
ykd
and
zkc
like the new restaurants should be taken
into consideration. After applying Bayesian rules and factorization
on Eq. 14, we have:
p(tji =t,xj i ,ykd ,zkc |•) =p(tj i |tji
,k)
p(xji |ykd ,zk c ,tji =t,kjt =k,•)
p(ykd |tji =t,kj t =k,ykd
,okd
,l)
p(zkc |tji =t,kjt =k,zkc
,pkc
,q)(15)
where
is {
xji
,tji
,k,ykd
,okd
,l,zkc
,pkc
,q
}. The four proba-
bilities on the right-hand side of Eq. 15 have intuitive meanings.
p(tji |tji
,k)
and
p(xji |ykd ,zk c ,tji =t,kjt =k,•)
are the table pop-
ularity and dish preference of xji in the space-HDP:
p(tji |tji
,k) ∝ nj i
jt if talready exists
αjif t=tnew (16)
p(xji |ykd ,zk c ,tji =t,kjt =k,•) ∝
fxji
kjt (xj i )if texists
m·kfxji
k(xji )else if kexists
γfxji
kne w (xj i )if k=knew
(17)
Eq. 16 and Eq. 17 are just re-organization of Eq. 11 and Eq. 12.
The remaining
p(ykd |tji =t,kj t =k,ykd
,okd
,l)
and
p(zkc |tji =
t,kjt =k,zkc
,pkc
,q)
can be seen as how much the time-customer
ykd
and speed-customer
zkc
like the
k
th time and speed restaurant
respectively (restaurant preference). This restaurant preference does
not appear in single HDPs and thus need special treatment. This is
the rst major dierence between CRFL and CRF. Since we propose
the same treatment for both, we only explain the time-restaurant
preference treatment here.
If every time we sample a
tji
, we compute
p(ykd |tji =t,kj t =
k,ykd
,okd
,l)
on every time table in every time-restaurant, it will
be prohibitively slow. We therefore marginalize over all the time
tables in a time-restaurant, to get a general restaurant preference of
ykd :
p(ykd |tji =t,kj t =k,ykd
,okd
,l)=
hk·
okd =1
p(okd =o|tji =t,kj t =k,ykd
,okd)
p(ykd |okd =o,lk o =l,l)(18)
where
okd
is the table choice of
ykd
in the
kth
time-restaurant.
lko
is
the time-dish served on the
o
th table in the
k
th time-restaurant.
hk·
is the total number of tables in the
k
th time-restaurant. Similar to
Eq. 16 and Eq. 17:
p(okd =o|tji =t,kj t =k,ykd
,okd) ∝ skd
ko if oexists
ϵkif okd =on ew
(19)
where
skd
ko
is the number of time-customers already at the
o
th table
and ϵkis the scaling factor.
p(ykd |okd =o,lk o =l,l) ∝
дykd
lko (ykd )if oexists
h·lдykd
l(ykd )else if lexists
εдykd
lne w (ykd )if l=ln ew
(20)
where
h·l
is the total number tables serving time-dish
l
and
д
is a pos-
terior predictive distribution of Normal, a Student’s t-Distribution.
ε
controls how likely a new time dish would be needed. Now we have
nished deriving the sampling for
p(ykd |tji =t,kj t =k,ykd
,okd
,l)
.
Similar derivations can be done for
p(zkc |tji =t,kjt =k,zkc
,pkc
,q)
.
After table sampling, we need to do dish sampling (line 10 in
Algorithm 2). The left side of Eq. 12 becomes:
ACM Trans. Graph., Vol. 39, No. 4, Article 1. Publication date: July 2020.
Informative Scene Decomposition for Crowd Analysis, Comparison and Simulation Guidance 1:15
p(kjt new =k,xj i ,ykd ,zk c |kjtnew
,ykd
,okd
,
l,zkc
,pkc
,q) ∝
mjt
·kp(xji | · · · )p(yk d | · · · )p(zk c | · · · )
γp(xji | · · · )p(yk d | · · · )p(zk c | · · · ) (21)
The dierences between Eq. 21 and Eq. 12 are
p(ykd | · · · )
and
p(zkc | · · · )
. Both are Innite Gaussian Mixture Model so the likeli-
hoods can be easily computed. We therefore have given the whole
sampling process for the customer-level sampling (Eq. 14). We still
need to deal with the table-level sampling.
B.2 Table Level Sampling
Similarly, when we do the table-level sampling (line 14 in Algo-
rithm 2), the left side of Eq. 13 change to:
p(kjt =k,xjt,ykdjt ,zkcjt |kjt
,ykdjt
,okdjt
,
lko
,zkcjt
,pkcjt
,qkp) ∝
mjt
·kp(xjt | · · · )p(ykdjt | · · · )p(zkcjt | · · · )
γp(xjt | · · · )p(ykdjt | · · · )p(zkcjt | · · · ) (22)
where
xjt
is the space-customers at the table
t
,
ykdjt
and
zkcjt
are the
associated time and speed customers.
kjt
,
ykdjt
,
okdjt
,
lko
,
zkcjt
,
pkcjt
,
qkp
are the rest customers and their choices of tables and
dishes in three HDPs.
· · ·
represents all the conditional variables
for simplicity. p(xjt | · · · ) is the Multinomial fas in Eq. 13.
p(ykdjt | · · · )
and
p(zkcjt | · · · )
are not easy to compute. However,
they can be treated in the same way so we only explain how to com-
pute
p(ykdjt | · · · )
here. To fully compute
p(ykdjt | · · · )
=
p(ykdjt |kjt =
k,okdjt
,lko)
, one needs to consider it for every
ykd jt ykdjt
which
is extremely expensive. This is because we deal with large datasets
and there can easily be thousands, if not more, of customers in
ykdjt
.
In Eq. 15, we already see how
ykd
’s time-restaurant preference
inuences the table choice of
xji
. Given a group
ykdjt
, their collec-
tive time-restaurant preference,
p(ykdjt | · · · )
, will inuence the dish
choice of
xjt
. Since the distribution of individual time-restaurant
preference is hard to compute analytically, we approximate it. We
do a random sampling over
ykdjt
to approximate
p(ykdjt | · · · )
. This
number of samples is a hyper-parameter, referred as customer se-
lection. For every single
yykdjt
we can compute its probability in
the same way as in Eq. 18. So we approximate the
p(ykdjt | · · · )
with
the joint probability of the sampled time-customers.
B.3 Sampling for Hyper-parameters
A Dirichlet Process contains two parameters, a base distribution
and a concentration parameter. To make THDP more robust to
these parameters, we impose a prior, a Gamma distribution onto
the concentration parameter
γΓ(α,ϖ)
, where
α
is the shape
parameter and
ϖ
is the rate parameter. There are totally six
α
s and
ϖ
s for the six DPs in THDP. They are initialized as 0.1. Then they
are updated during the optimization using the method in [Teh et al
.
2006]. The update is done in every iteration in CRFL, after sampling
all the other parameters. The customer selection parameter is set
to 1000 across all experiments. Finally, after CRFL, the inference is
done for the three distributions in Eq. 2:
ϕs
kHs,βDirichlet (m·1,m·2,· · · ,m·k,γ)(23)
ϕt
lHt,ζDirichlet (h·1,h·2,· · · ,h·l,ε)(24)
ϕe
qHe,ρDirichlet (a·1,a·2,· · · ,a·q,λ)(25)
where
m·k
is the total number of space-tables choosing space-dish
k
;
h·l
is the total number of time-tables choosing time-dish
l
;
a·q
is
the total number of speed-tables choosing speed-dish
q
.
γ
,
ε
and
λ
are the scaling factors of Gs,Gtand Ge.
C SIMULATION GUIDANCE
The dynamics of of one trajectory, ¯
w, is:
x¯
w
t=Ast+ωtωN(0,)
st=Bst1+λtλN(0,Λ)
Given the
U
trajectories, from a space ow
ˇ
w
, the total likelihood
is:
p(ˇ
w)=ΠU
i=1p(¯
wi)where
p(¯
wi)=ΠTi1
t=2p(xi
t|st)P(st|st1)s1=xi
1,sT=xi
Ti(26)
where
A
is an identity matrix and
is a known diagonal matrix.
Ti
is the length of the trajectory
i
. We use homogeneous coordinates
to represent both
x=[x1,x2,
1
]T
and
s=[s1,s2,
1
]T
. Consequently,
A
is a
R3×3
identity matrix.
is set to be a
R3×3
diagonal matrix
with its non-zeros entries set to 0.001.
B
is a
R3×3
transition matrix
and Λis R3×3covariance matrix, both to be learned.
We apply Expectation-Maximization (EM) [Bishop 2007] to esti-
mate parameters
B,Λ
and states
S
by maximizing the log likelihood
loдP(u)
. Each iteration of EM consists of a E-step and a M-step. In
the E-step, we x the parameters and sample states
s
via the poste-
rior distribution of
x
. The posterior distribution and the expectation
of complete-data likelihood are denoted as
L=ES|X;ˆ
B,ˆ
Λ(loдP(S,X;B,Λ))
=
i
τiEsi|xi{p(si
,xi)} (27)
where
τi
is dened as
τi=
1
TiTi
t=1p(xi
t|si
t)
U
i=11
TiTi
t=1p(xi
t|si
t)
. In the M-step, we
maximize the complete-data likelihood and the model parameters
are updated as:
Bnew =iτiTi
t=2Pi
t,t1
iτiTi
t=2Pi
t1,t1
(28)
Λnew =iτi(Ti
t=2Pi
t,tBnew Ti
t=2Pi
t,t1)
iτi(Ti2)(29)
Pi
t,t=Esi|xi(stsT
t)(30)
Pi
t,t1=Esi|xi(stsT
t1)(31)
During updating, we use
Λ=1
2(Λ+ΛT)
to ensure its symmetry.
ACM Trans. Graph., Vol. 39, No. 4, Article 1. Publication date: July 2020.

Supplementary resource (1)

... To fill this gap while retaining advantages of hybrid methods, we extend our previous work NSP-SFM (Yue et al., 2022) to propose a new Bayesian Neural Stochastic Differentiable Equation model for trajectory forecasting. Unlike NSP-SFM which learns a deterministic steering behavior plus an unexplainable randomness, we further dissect the randomness into aleatoric uncertainty caused by behavioral randomness and epistemic uncertainty by unobserved reasons (He, Xia, Zhao, & Wang, 2020). We argue that the aleatoric uncertainty should be explainable and the epistemic uncertainty can be unexplainable, because the former is observed in data and can be explained e.g., due to collision avoidance, while the latter is unknown. ...
... Aleatoric uncertainty arises from the steering behavior which is random e.g. when avoiding other pedestrians or obstacles, while epistemic uncertainty is caused by unknown factors e.g. affective state, sensor error, etc. (He et al., 2020). Explicitly capturing the aleatoric uncertainty can help explain the behavior as well as give confidence about the prediction. ...
... r ij gives a repulsive force. However, such repulsion between pedestrians has randomness (He et al., 2020;Wang, Ondřej, & O'Sullivan, 2016a). Therefore, our model considers a time-varying Gaussian with learnable mean and variance, so that the collision avoidance factor F t col is a time-varying Gaussian mixture. ...
Preprint
Human trajectory forecasting helps to understand and predict human behaviors, enabling applications from social robots to self-driving cars, and therefore has been heavily investigated. Most existing methods can be divided into model-free and model-based methods. Model-free methods offer superior prediction accuracy but lack explainability, while model-based methods provide explainability but cannot predict well. Combining both methodologies, we propose a new Bayesian Neural Stochastic Differential Equation model BNSP-SFM, where a behavior SDE model is combined with Bayesian neural networks (BNNs). While the NNs provide superior predictive power, the SDE offers strong explainability with quantifiable uncertainty in behavior and observation. We show that BNSP-SFM achieves up to a 50% improvement in prediction accuracy, compared with 11 state-of-the-art methods. BNSP-SFM also generalizes better to drastically different scenes with different environments and crowd densities (~ 20 times higher than the testing data). Finally, BNSP-SFM can provide predictions with confidence to better explain potential causes of behaviors. The code will be released upon acceptance.
... However, the scope of this work was limited to microscopic models for traffic, and the implementation relies on automatic differentiation, which can be computationally inefficient. In addition to traffic, this hybrid differentiable simulation framework can potentially generalize to other large-scale, multi-agent systems, such as crowds, insects, fluids, grains, etc. [Colas et al. 2022;Hädrich et al. 2021;He et al. 2020;Ishiwaka et al. 2021] as well. ...
Preprint
Full-text available
We introduce a novel differentiable hybrid traffic simulator, which simulates traffic using a hybrid model of both macroscopic and microscopic models and can be directly integrated into a neural network for traffic control and flow optimization. This is the first differentiable traffic simulator for macroscopic and hybrid models that can compute gradients for traffic states across time steps and inhomogeneous lanes. To compute the gradient flow between two types of traffic models in a hybrid framework, we present a novel intermediate conversion component that bridges the lanes in a differentiable manner as well. We also show that we can use analytical gradients to accelerate the overall process and enhance scalability. Thanks to these gradients, our simulator can provide more efficient and scalable solutions for complex learning and control problems posed in traffic engineering than other existing algorithms. Refer to https://sites.google.com/umd.edu/diff-hybrid-traffic-sim for our project.
Chapter
In the development of autonomous driving systems, pedestrian trajectory prediction plays a crucial role. Existing models still face some challenges in capturing the accuracy of complex pedestrian actions in different environments and in handling large-scale data and real-time prediction efficiency. To address this, we have designed a novel Complex Gated Recurrent Unit (CGRU) model, cleverly combining the spatial expressiveness of complex numbers with the efficiency of Gated Recurrent Unit networks to establish a lightweight model. Moreover, we have incorporated a social force model to further develop a Social Complex Gated Recurrent Unit (S-CGRU) model specifically for predicting pedestrian trajectories. To improve computational efficiency, we conducted an in-depth study of the pedestrian’s attention field of view in different environments to optimize the amount of information processed and increase training efficiency. Experimental verification on six public datasets confirms that S-CGRU model significantly outperforms other baseline models not only in prediction accuracy but also in computational efficiency, validating the practical value of our model in pedestrian trajectory prediction.
Article
Realistic crowd simulation has been pursued for decades, but it still necessitates tedious human labour and a lot of trial and error. The majority of currently used crowd modelling is either empirical (model‐based) or data‐driven (model‐free). Model‐based methods cannot fit observed data precisely, whereas model‐free methods are limited by the availability/quality of data and are uninterpretable. In this paper, we aim at taking advantage of both model‐based and data‐driven approaches. In order to accomplish this, we propose a new simulation framework built on a physics‐based model that is designed to be data‐friendly. Both the general prior knowledge about crowds encoded by the physics‐based model and the specific real‐world crowd data at hand jointly influence the system dynamics. With a multi‐granularity physics‐based model, the framework combines microscopic and macroscopic motion control. Each simulation step is formulated as an energy optimization problem, where the minimizer is the desired crowd behaviour. In contrast to traditional optimization‐based methods which seek the theoretical minimizer, we designed an acceleration‐aware data‐driven scheme to compute the minimizer from real‐world data in order to achieve higher realism by parameterizing both velocity and acceleration. Experiments demonstrate that our method can produce crowd animations that are more realistically behaved in a variety of scales and scenarios when compared to the earlier methods.
Article
This paper proposes a novel crowd simulation method which integrates not only modelling ideas but also advantages from both data‐driven methods and crowd dynamics methods. To seamlessly integrate these two different modelling ideas, first, a fusion crowd motion model is developed. In this model the motion of crowd are driven dynamically by different forces. Part of the forces are modeled under a universal interaction mechanism, which describe the common parts of crowd dynamics. Others are modeled by examples from real data, which describe the personality parts of the agent motion. Second, a construction method for example dataset is proposed to support the fusion model. In the dataset, crowd trajectories captured in the real world are decomposed and re‐described under the structure of the fusion model. Thus, personality parts hidden in the real data could be locked and extracted, making the data understandable and migratable for our fusion model. A comprehensive crowd motion generation workflow using the fusion model and example dataset is also proposed. Quantitative and qualitative experiments and user studies are conducted. Results show that the proposed fusion crowd simulation method can generate crowd motion with the great motion fidelity, which not only match the macro characteristics of real data, but also has lots of micro personality showing the diversity of crowd motion.
Chapter
Crowd simulation methods generally focus on high fidelity 2D trajectories but ignore detailed 3D body animation which is normally added in a post-processing step. We argue that this is an intrinsic flaw as detailed body motions affect the 2D trajectories, especially when interactions are present between characters, and characters and the environment. In practice, this requires labor-intensive post-processing, fitting individual character animations onto simulated trajectories where anybody interactions need to be manually specified. In this paper, we propose a new framework to integrate the modeling of crowd motions with character motions, to enable their mutual influence, so that crowd simulation also incorporates agent-agent and agent-environment interactions. The whole framework is based on a three-level hierarchical control structure to effectively control the scene at different scales efficiently and consistently. To facilitate control, each character is modeled as an agent governed by four modules: visual system, blackboard system, decision system, and animation system. The animation system of the agent model consists of two modes: a traditional Finite State Machine (FSM) animation mode, and a motion matching mode. So an agent not only retains the flexibility of FSMs, but also has the advantage of motion matching which adapts detailed body movements for interactions with other agents and the environment. Our method is universal and applicable to most interaction scenarios in various environments in crowd animation, which cannot be achieved by prior work. We validate the fluency and realism of the proposed method by extensive experiments and user studies.
Article
We introduce a novel differentiable hybrid traffic simulator , which simulates traffic using a hybrid model of both macroscopic and microscopic models and can be directly integrated into a neural network for traffic control and flow optimization. This is the first differentiable traffic simulator for macroscopic and hybrid models that can compute gradients for traffic states across time steps and inhomogeneous lanes. To compute the gradient flow between two types of traffic models in a hybrid framework, we present a novel intermediate conversion component that bridges the lanes in a differentiable manner as well. We also show that we can use analytical gradients to accelerate the overall process and enhance scalability. Thanks to these gradients, our simulator can provide more efficient and scalable solutions for complex learning and control problems posed in traffic engineering than other existing algorithms. Refer to https://sites.google.com/umd.edu/diff-hybrid-traffic-sim for our project.
Chapter
Trajectory prediction has been widely pursued in many fields, and many model-based and model-free methods have been explored. The former include rule-based, geometric or optimization-based models, and the latter are mainly comprised of deep learning approaches. In this paper, we propose a new method combining both methodologies based on a new Neural Differential Equation model. Our new model (Neural Social Physics or NSP) is a deep neural network within which we use an explicit physics model with learnable parameters. The explicit physics model serves as a strong inductive bias in modeling pedestrian behaviors, while the rest of the network provides a strong data-fitting capability in terms of system parameter estimation and dynamics stochasticity modeling. We compare NSP with 15 recent deep learning methods on 6 datasets and improve the state-of-the-art performance by 5.56%–70%. Besides, we show that NSP has better generalizability in predicting plausible trajectories in drastically different scenarios where the density is 2–5 times as high as the testing data. Finally, we show that the physics model in NSP can provide plausible explanations for pedestrian behaviors, as opposed to black-box deep learning. Code is available: https://github.com/realcrane/Human-Trajectory-Prediction-via-Neural-Social-Physics.KeywordsHuman trajectory predictionNeural differential equations
Article
For safety planning in crowd evacuation, it is important to predict the evacuation decisions made by different individuals and understand the reasons behind these decisions. To this end, this paper proposes an automated approach that can learn prioritized fuzzy decision rules from crowd data to predict and understand the evacuation decisions of a real human. A coevolutionary fuzzy rule miner based on genetic fuzzy-system is designed to select necessary decision features from available ones and learn both rule structure and associated rule parameters from training data. The learned fuzzy rule contains multiple sub-rules, each of which can represent evacuation strategies of different individuals in a given scenario and the features in the fuzzy condition of the sub-rule are organized and evaluated in a sequential order to reflect the priorities of different features. Based on training and testing on four evacuations scenarios of two real-world datasets, it is shown that our proposed approach can learn decision rules that are competitive to the existing evacuation decision models in terms of prediction accuracy. More importantly, it is also demonstrated that our learned rules complying with the proposed prioritized fuzzy rule representation can facilitate the interpretation of evacuation behaviors, such as “herding under zero visibility of exit” and “diminished importance on the distance to exit”, which are aligned to the field observations from real crowd evacuation.
Article
Full-text available
Analytics in professional sports has experienced a dramatic growth in the last decade due to the wide deployment of player and ball tracking systems in team sports, such as basketball and soccer. With the massive amount of fine-grained data being generated, new data-points are being generated, which can shed light on player and team performance. However, due to the complexity of plays in continuous sports, these data-points often lack the specificity and context to enable meaningful retrieval and analytics. In this article, we present an intelligent human--computer interface that utilizes trajectories instead of words, which enables specific play retrieval in sports. Various techniques of alignment, templating, and hashing were utilized by our system and they are tailored to multi-agent scenario so that interactive speeds can be achieved. We conduct a user study to compare our method to the conventional keywords-based system and the results show that our method significantly improves the retrieval quality. We also show how our interface can be utilized for broadcast purposes, where a user can draw and interact with trajectories on a broadcast view using computer vision techniques. Additionally, we show that our method can also be used for interactive analytics of player performance, which enables the users to move players around and see how performance changes as a function of position and proximity to other players.
Article
Full-text available
Controlling a crowd using multi-touch devices appeals to the computer games and animation industries, as such devices provide a high dimensional control signal that can effectively define the crowd formation and movement. However, existing works relying on pre-defined control schemes require the users to learn a scheme that may not be intuitive. We propose a data-driven gesture-based crowd control system, in which the control scheme is learned from example gestures provided by different users. In particular, we build a database with pairwise samples of gestures and crowd motions. To effectively generalize the gesture style of different users, such as the use of different numbers of fingers, we propose a set of gesture features for representing a set of hand gesture trajectories. Similarly, to represent crowd motion trajectories of different numbers of characters over time, we propose a set of crowd motion features that are extracted from a Gaussian mixture model. Given a run-time gesture, our system extracts the K nearest gestures from the database and interpolates the corresponding crowd motions in order to generate the run-time control. Our system is accurate and efficient, making it suitable for real-time applications such as real-time strategy games and interactive animation controls.
Article
Interactive multi-agent simulation algorithms are used to compute the trajectories and behaviors of different entities in virtual reality scenarios. However, current methods involve considerable parameter tweaking to generate plausible behaviors. We introduce a novel approach (Heter-Sim) that combines physics-based simulation methods with data-driven techniques using an optimization-based formulation. Our approach is general and can simulate heterogeneous agents corresponding to human crowds, traffic, vehicles, or combinations of different agents with varying dynamics. We estimate motion states from real-world datasets that include information about position, velocity, and control direction. Our optimization algorithm considers several constraints, including velocity continuity, collision avoidance, attraction, direction control. Other constraints are implemented by introducing a novel energy function to control the motions of heterogeneous agents. To accelerate the computations, we reduce the search space for both collision avoidance and optimal solution computation. Heter-Sim can simulate tens or hundreds of agents at interactive rates and we compare its accuracy with real-world datasets and prior algorithms. We also perform user studies that evaluate the plausible behaviors generated by our algorithm and a user study that evaluates the plausibility of our algorithm via VR.
Article
Steering and navigation are important components of character animation systems to enable them to autonomously move in their environment. In this work, we propose a synthetic vision model that uses visual features to steer agents through dynamic environments. Our agents perceive optical flow resulting from their relative motion with the objects of the environment. The optical flow is then segmented and processed to extract visual features such as the focus of expansion and time‐to‐collision. Then, we establish the relations between these visual features and the agent motion, and use them to design a set of control functions which allow characters to perform object‐dependent tasks, such as following, avoiding and reaching. Control functions are then combined to let characters perform more complex navigation tasks in dynamic environments, such as reaching a goal while avoiding multiple obstacles. Agent's motion is achieved by local minimization of these functions. We demonstrate the efficiency of our approach through a number of scenarios. Our work sets the basis for building a character animation system which imitates human sensorimotor actions. It opens new perspectives to achieve realistic simulation of human characters taking into account perceptual factors, such as the lighting conditions of the environment.
Conference Paper
We propose a novel method for effective retrieval of multi-agent spatiotemporal tracking data. Retrieval of spatiotemporal tracking data offers several unique challenges compared to conventional text-based retrieval settings. Most notably, the data is fine-grained meaning that the specific location of agents is important in describing behavior. Additionally, the data often contains tracks of multiple agents (e.g., multiple players in a sports game), which generally leads to a permutational alignment problem when performing relevance estimation. Due to the frequent position swap of agents, it is difficult to maintain the correspondence of agents, and such issues make the pairwise comparison problematic for multi-agent spatiotemporal data. To address this issue, we propose a tree-based method to estimate the relevance between multi-agent spatiotemporal tracks. It uses a hierarchical structure to perform multi-agent data alignment and partitioning in a coarse-to-fine fashion. We validate our approach via user studies with domain experts. Our results show that our method boosts performance in retrieving similar sports plays -- especially in interactive situations where the user selects a subset of trajectories compared to current state-of-the-art methods.