Content uploaded by He Wang
Author content
All content in this area was uploaded by He Wang on Apr 29, 2020
Content may be subject to copyright.
Informative Scene Decomposition for Crowd Analysis, Comparison and
Simulation Guidance
FEIXIANG HE, University of Leeds, United Kingdom
YUANHANG XIANG, Xi’an Jiaotong University, China
XI ZHAO∗,Xi’an Jiaotong University, China
HE WANG†,University of Leeds, United Kingdom
Fig. 1. Overview of our framework.
Crowd simulation is a central topic in several elds including graphics. To
achieve highdelity simulations, data has been increasingly relied upon for
analysis and simulation guidance. However, the information in realworld
data is often noisy, mixed and unstructured, making it dicult for eective
analysis, therefore has not been fully utilized. With the fastgrowing volume
of crowd data, such a bottleneck needs to be addressed. In this paper, we
propose a new framework which comprehensively tackles this problem.
It centers at an unsupervised method for analysis. The method takes as
input raw and noisy data with highly mixed multidimensional (space, time
and dynamics) information, and automatically structure it by learning the
correlations among these dimensions. The dimensions together with their
correlations fully describe the scene semantics which consists of recurring
activity patterns in a scene, manifested as space ows with temporal and dy
namics proles. The eectiveness and robustness of the analysis have been
∗Corresponding author
†Corresponding author
Authors’ addresses: Feixiang He, University of Leeds, School of Computing, United
Kingdom, fxhe1992@gmail.com; Yuanhang Xiang, Xi’an Jiaotong University, School of
Computer Science and Technology, China, xiangyuanhang@icloud.com; Xi Zhao, Xi’an
Jiaotong University, School of Computer Science and Technology, China, zhaoxi.jade@
gmail.com; He Wang, University of Leeds, School of Computing, United Kingdom,
realcrane@gmail.com.
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
for prot or commercial advantage and that copies bear this notice and the full citation
on the rst page. Copyrights for components of this work owned by others than ACM
must be honored. Abstracting with credit is permitted. To copy otherwise, or republish,
to post on servers or to redistribute to lists, requires prior specic permission and/or a
fee. Request permissions from permissions@acm.org.
©2020 Association for Computing Machinery.
07300301/2020/7ART1 $15.00
https://doi.org/10.1145/3386569.3392407
tested on datasets with great variations in volume, duration, environment
and crowd dynamics. Based on the analysis, new methods for data visual
ization, simulation evaluation and simulation guidance are also proposed.
Together, our framework establishes a highly automated pipeline from raw
data to crowd analysis, comparison and simulation guidance. Extensive
experiments and evaluations have been conducted to show the exibility,
versatility and intuitiveness of our framework.
CCS Concepts:
•Computing methodologies →Animation
;
Topic mod
eling
;
Learning in probabilistic graphical models
;
Scene understand
ing
;
Activity recognition and understanding
;
Multiagent planning
;
•Mathematics of computing →Probabilistic inference problems
;
Nonparametric statistics.
Additional Key Words and Phrases: Crowd Simulation, Simulation Evalua
tion, Bayesian Inference
ACM Reference Format:
Feixiang He, Yuanhang Xiang, Xi Zhao, and He Wang. 2020. Informative
Scene Decomposition for Crowd Analysis, Comparison and Simulation
Guidance. ACM Trans. Graph. 39, 4, Article 1 (July 2020), 15 pages. https:
//doi.org/10.1145/3386569.3392407
1 INTRODUCTION
Crowd simulation has been intensively used in computer anima
tion, as well as other elds such as architectural design and crowd
management. The delity or realism of simulation has been a long
standing problem. The main complexity arises from its multifaceted
nature. It could mean highlevel global behaviors [Narain et al
.
2009],
midlevel ow information [Wang et al
.
2016] or lowlevel individ
ual motions [Guy et al
.
2012]. It could also mean perceived realism
[Ennis et al
.
2011] or numerical accuracy [Wang et al
.
2017]. In
ACM Trans. Graph., Vol. 39, No. 4, Article 1. Publication date: July 2020.
1:2 •Feixiang He, Yuanhang Xiang, Xi Zhao, and He Wang
any case, analyzing realworld data is inevitable for evaluating and
guiding simulations.
The main challenges in utilizing realworld data are data com
plexity, intrinsic motion randomness and the shear volume. The
data complexity makes structured analysis dicult. As the most
prevalent form of crowd data, trajectories extracted from sensors
contain rich but mixed and unstructured information of space, time
and dynamics. Although highlevel statistics such as density can
be used for analysis, they are not well dened and cannot give
structural insights [Wang et al
.
2017]. Second, trajectories show
intrinsic randomness of individual motions [Guy et al
.
2012]. The
randomness shows heterogeneity between dierent individuals and
groups, and is inuenced by internal factors such as state of mind
and external factors such as collision avoidance. Hence a single
representation is not likely to be able to capture all randomness
for all people in a scene. This makes it dicult to guide simulation
without systematically considering the randomness. Lastly, with
more recording devices being installed and data being shared, the
shear volume of data in both space and time, with excessive noise,
requires ecient and robust analysis.
Existing methods that use realworld data for purposes such as
qualitative and quantitative comparisons [Wang et al
.
2016], sim
ulation guidance [Ren et al
.
2018] or steering [López et al
.
2019],
mainly focus on one aspect of data, e.g. space, time or dynamics,
and tend to ignore the structural correlations between them. Also
during simulation and analysis, motion randomness is often ignored
or uniformly modelled for all trajectories [Guy et al
.
2012; Helbing
et al
.
1995]. Ignoring the randomness (e.g. only assuming the least
eort principle) makes simulated agents to walk in straight lines
whenever possible, which is rarely observed in realworld data; uni
formly modelling the randomness fails to capture the heterogeneity
of the data. Besides, most existing methods are not designed to deal
with massive data with excessive noise. Many of them require the
full trajectories to be available [Wolinski et al
.
2014] which cannot
be guaranteed in real world, and do not handle data at the scale of
tens of thousands of people and several days long.
In this paper, we propose a new framework that addresses the
three aforementioned challenges. This framework is centered at an
analysis method which automatically decomposes a crowd scene
of a large number of trajectories into a series of modes. Each mode
comprehensively captures a unique pattern of spatial, temporal and
dynamics information. Spatially, a mode represents a pedestrian ow
which connects subspaces with specic functionalities, e.g. entrance,
exit, information desk, etc.; temporally it captures when this ow
appears, crescendos, wanes and disappears; dynamically it reveals
the speed preferences on this ow. With space, time and dynamics
information, each mode represents a unique recurring activity and
all modes together describe the scene semantics. These modes serve
as a highly exible visualization tool for general and taskspecic
analysis. Next, they form a natural basis where explicable evaluation
metrics can be derived for quantitatively comparing simulated and
real crowds, both holistically and dimensionspecic (space, time
and dynamics). Lastly, they can easily automate simulation guidance,
especially in capturing the heterogeneous motion randomness in
the data.
The analysis is done by a new unsupervised clustering method
based on nonparametric Bayesian models, because manual labelling
would be extremely laborious. Specically, Hierarchical Dirichlet
Processes (HDP) are used to disentangle the spatial, temporal and
dynamics information. Our model consists of three intertwined
HDPs and is thus named Triplet HDPs (THDP). The outcome is a
(potentially innite) number of modes with weights. Spatially, each
mode is a crowd ow represented by trajectories sharing spatial
similarities. Temporally, it is a distribution of when the ow appears,
crescendos, peaks, wanes and disappears. Dynamically, it shows the
speed distribution of the ow. The whole data is then represented by
a weighted combination of all modes. Besides, the power of THDP
comes with an increased model complexity, which brings challenges
on inference. We therefore propose a new method based on Markov
Chain Monte Carlo (MCMC). The method is a major generalization
of the Chinese Restaurant Franchise (CRF) method, which was orig
inally developed for HDP. We refer to the new inference method as
Chinese Restaurant Franchise League (CRFL). THDP and CRFL are
general and eective on datasets with great spatial, temporal and
dynamics variations. They provide a versatile base for new methods
for visualization, simulation evaluation and simulation guidance.
Formally, we propose the rst, to our best knowledge, multi
purpose framework for crowd analysis, visualization, simulation
evaluation and simulation guidance, which includes:
(1) a new activity analysis method by unsupervised clustering.
(2) a new visualization tool for highly complex crowd data.
(3)
a set of new metrics for comparing simulated and real crowds.
(4) a new approach for automated simulation guidance.
To this end, we have technical contributions which include:
(1)
the rst, to our best knowledge, nonparametric method that
holistically considers space, time and dynamics for crowd
analysis, simulation evaluation and simulation guidance.
(2)
a new Markov Chain Monte Carlo method which achieves
eective inference on intertwined HDPs.
2 RELATED WORK
2.1 Crowd Simulation
Empirical modelling and datadriven methods have been the two
mainstreams in simulation. Empirical modelling dominates early
research, where observations of crowd motions are abstracted into
mathematical equations and deterministic systems. Crowds can be
modelled as elds or ows [Narain et al
.
2009], or as particle systems
[Helbing et al
.
1995], or by velocity and geometric optimization
[van den Berg et al
.
2008]. Social behaviors including queuing and
grouping [Lemercier et al
.
2012; Ren et al
.
2016] have also been
pursued. On the other hand, datadriven simulation has also been
explored, in using e.g. rstperson vision to guide steering behaviors
[López et al
.
2019] or trajectories to extract features to describe
motions [Karamouzas et al
.
2018; Lee et al
.
2007]. Our research is
highly complementary to simulation research in providing analysis,
guidance and evaluation metrics. It aims to work with existing
steering and global planning methods.
ACM Trans. Graph., Vol. 39, No. 4, Article 1. Publication date: July 2020.
Informative Scene Decomposition for Crowd Analysis, Comparison and Simulation Guidance •1:3
2.2 Crowd Analysis
Crowd analysis has been a trendy topic in computer vision [Wang
and O’Sullivan 2016; Wang et al
.
2008]. They aim to learn structured
latent patterns in data, similar to our analysis method. However, they
only consider limited information (e.g. space only or space/time)
compared to our method because our method explicitly models
space, time, dynamics and their correlations. In contrast, another
way of scene analysis is to focus on the anomalies [Charalambous
et al
.
2014]. Their perspective is dierent from ours and therefore
complementary to our approach. Trajectory analysis also plays an
important role in modern sports analysis [Sha et al
.
2018, 2017], but
they do not deal with a large number of trajectories as our method
does. Recently, deep learning has been used for crowd analysis in
trajectory prediction [Xu et al. 2018], people counting [Wang et al.
2019], scene understanding [Lu et al. 2019] and anomaly detection
[Sabokrou et al
.
2017]. However, they either do not model lowlevel
behaviors or can only do shorthorizon prediction (seconds). Our
research is orthogonal to theirs by focusing on the analysis and its
applications in simulations.
Besides computer vision, crowd analysis has also been investi
gated in physics. In [Ali and Shah 2007], Lagrangian Particle Dy
namics is exploited for the segmentation of highdensity crowd
ows and detection of ow instabilities, where the target was simi
lar to our analysis. But they only consider space when separating
ows, while our research explicitly models more comprehensive
information, including space, time and dynamic. Physicsinspired
approaches have also been applied in abnormal trajectory detec
tion for surveillance [Chaker et al
.
2017; Mehran et al
.
2009]. An
approach based on social force model [Mehran et al
.
2009] is intro
duced to describe individual movement in microscopic by placing a
grid particle over the image. A local and global social network are
built by constructing a set of spatiotemporal cuboids in [Chaker
et al
.
2017] to detect anomalies. Compared with these methods, our
anomaly detection is more informative and versatile in providing
what attributes contribute to the abnormality.
2.3 Simulation Evaluation
How to evaluate simulations is a longstanding problem. One major
approach is to compare simulated and real crowds. There are quali
tative and quantitative methods. Qualitative methods include visual
comparison [Lemercier et al
.
2012] and perceptual experiments [En
nis et al
.
2011]. Quantitative methods fall into modelbased methods
[Golas et al
.
2013] and datadriven methods [Guy et al
.
2012; Lerner
et al
.
2009; Wang et al
.
2016, 2017]. Individual behaviors can be
directly compared between simulation and reference data [Lerner
et al
.
2009]. However, it requires full trajectories to be available
which is dicult in practice. Our comparison is based on the latent
behavioral patterns instead of individual behaviors and does not
require full trajectories. The methods in [Wang et al
.
2016, 2017]
are similar to ours where only space is considered. In contrast, our
approach is more comprehensive by considering space, time and
dynamics. Dierent combinations of these factors result in dierent
metrics focusing on comparing dierent aspects of the data. The
comparisons can be spatially focused or temporally focused. They
can also be comparing general situations or specic modes. Overall,
our method provides greater exibility and more intuitive results.
2.4 Simulation Guidance
Quantitative simulation guidance has been investigated before,
through user control or realworld data. In the former, trajectory
based user control signals can be converted into guiding trajectories
for simulation [Shen et al
.
2018]. Predened crowd motion ‘patches’
can be used to compose heterogeneous crowd motions [Jordao et al
.
2014]. The purpose of this kind of guidance is to give the user the
full control to ‘sculpture’ crowd motions. The latter is to guide sim
ulations using realworld data to mimic real crowd motions. Given
data and a parameterized simulation model, optimizations are used
to t the model on the data [Wolinski et al
.
2014]. Alternatively,
features can be extracted and compared for dierent simulations,
so that predictions can be made about dierent steering methods
on a simulation task [Karamouzas et al
.
2018]. Our approach also
heavily relies on data and is thus similar to the latter. But instead
of anchoring on the modelling of individual motions, it focuses on
the analysis of scene semantics/activities. It also considers intrinsic
motion randomness in a structured and principled way.
3 METHODOLOGY OVERVIEW
The overview of our framework is in Fig. 1. Without loss of gener
ality, we assume that the input is raw trajectories/tracklets which
can be extracted from videos by existing trackers, where we can
estimate the temporal and velocity information. Naively modelling
the trajectories/tracklets, e.g. by simple descriptive statistics such
as average speed, will average out useful information and cannot
capture the data heterogeneity. To capture the heterogeneity in the
presence of noise and randomness, we seek an underlying invariant
as the scene descriptor. Based on empirical observations, steady
space ows, characterized by groups of geometrically similar tra
jectories, can be observed in many crowd scenes. Each ow is a
recurring activity connecting subspaces with designated function
alities, e.g. a ow from the front entrance to the ticket oce then
to a platform in a train station. Further, this ow reveals certain
semantic information, i.e. people buying tickets before going to the
platforms. Overall, all ows in a scene form a good basis to describe
the crowd activities and the basis is an underlying invariant. How
to compute this basis is therefore vital in analysis.
However, computing such a basis is challenging. Naive statistics of
trajectories are not descriptive enough because the basis consists of
many ows, and is therefore highly heterogeneous and multimodal.
Further the number of ows is not known a priori. Since the ows
are formed by groups of geometrically similar trajectories/tracklets,
a natural solution is to cluster them [Bian et al
.
2018]. In this spe
cic research context, unsupervised clustering is needed due to that
the shear data volume prohibits human labelling. In unsupervised
clustering, popular methods such as Kmeans and Gaussian Mixture
Models [Bishop 2007] require a predened cluster number which
is hard to know in advance. Hierarchical Agglomerative Clustering
[Kauman and Rousseeuw 2005] does not require a predened clus
ter number, but the user must decide when to stop merging, which
is similarly problematic. Spectralbased clustering methods [Shi and
ACM Trans. Graph., Vol. 39, No. 4, Article 1. Publication date: July 2020.
1:4 •Feixiang He, Yuanhang Xiang, Xi Zhao, and He Wang
Malik 2000] solve this problem, but require the computation of a
similarity matrix whose space complexity is
O(n2)
on the number of
trajectories. Too much memory is needed for large datasets and per
formance degrades quickly with increasing matrix size. Due to the
aforementioned limitations, nonparametric Bayesian approaches
were proposed [Wang et al
.
2016, 2017]. However, a new approach
is still needed because the previous approaches only consider space,
and therefore cannot be reused or adapted for our purposes.
We propose a new nonparametric Bayesian method to cluster
the trajectories with the time and velocity information in an unsu
pervised fashion, which requires neither manual labelling nor the
prior knowledge of cluster number. The outcome of clustering is a
series of modes, each being a unique distribution over space, time
and speed. Then we propose new methods for data visualization,
simulation evaluation and automated simulation guidance.
We rst introduce the background of one family of nonparametric
Bayesian models, Dirichlet Processes (DPs), and Hierarchical Dirich
let Processes (HDP) (Sec. 4.1). We then introduce our new model
Triplet HDPs (Sec. 4.2) and new inference method Chinese Restau
rant Franchise League (Sec. 5). Finally new methods are proposed
for visualization (Sec. 6.1), comparison (Sec. 6.2) and simulation
guidance (Sec. 6.3).
4 OUR METHOD
4.1 Background
Dirichlet Process
. To understand DP, imagine there is a multi
modal 1D dataset with ve highdensity areas (modes). Then a
classic vecomponent Gaussian Mixture Model (GMM) can t the
data via ExpectationMinimization [Bishop 2007]. Now further gen
eralize the problem by assuming that there are an unknown number
of highdensity areas. In this case, an ideal solution would be to
impose a prior distribution which can represent an innite number
of Gaussians, so that the number of Gaussians needed, their means
and covariances can be automatically learnt. DP is such a prior.
A DP(
γ
, H) is a probabilistic measure on measures [Ferguson
1973], with a scaling parameter
γ
> 0 and a base probability measure
H
. A draw from DP,
G
~
DP (γ,H)
is:
G=∞
k=1βkδϕk
, where
βk∈β
is random and dependent on
γ
.
ϕk∈ϕ
is a variable distributed
according to
H
,
ϕk∼H
.
δϕk
is called an atom at
ϕk
. Specically for
the example problem above, we can dene
H
to be a NormalInverse
Gamma (NIG) so that any draw,
ϕk
, from
H
is a Gaussian, then
G
becomes an Innite Gaussian Mixture Model (IGMM) [Rasmussen
1999]. In practice, kis nite and computed during inference.
Hierarchical DPs
. Now imagine that the multimodal dataset in
the example problem is observed in separate data groups. Although
all the modes can be observed from the whole dataset, only a subset
of the modes can be observed in any particular data group. To model
this phenomenon, a parent DP is used to capture all the modes with
a child DP modelling the modes in each group:
Gj∼DP (αj,G)or Gj=
∞
i=1
βji δψj i where G=
∞
k=1
βkδϕk(1)
where
Gj
is the modes in the
j
th data group.
αj
is the scaling factor
and
G
is its based distribution.
βji
is the weight and
δψji
is the atom.
Now we have the Hierarchical DPs, or HDP [Teh et al
.
2006] (Fig. 2
Fig. 2. Le: HDP. Right: Triplet HDP.
Left). At the top level, the modes are captured by G∼DP(γ,H). In
each data group
j
, the modes are captured by
Gj
which is dependent
on
αj
and
G
. This way, the modes,
Gj
, in every data group come
from the common set of modes
G
, i.e.
ψji ∈ {ϕ1,ϕ2, . . ., ϕk}
. In Fig. 2
Left, there is also a variable
θji
called factor which indicates with
which mode (
ψji
or equally
ϕk
) the data sample
xji
is associated.
Finally, if
H
is again a NIG prior, then the HDP becomes Hierarchical
Innite Gaussian Mixture Model (HIGMM).
4.2 TripletHDPs (THDP)
We now introduce THDP (Fig. 2 Right). There are three HDPs in
THDP, to model space, time and speed. We name them TimeHDP
(Green), SpaceHDP (Yellow) and SpeedHDP (Blue). SpaceHDP is
to compute space modes. TimeHDP and SpeedHDP are to compute
the time and speed modes associated with each space mode, which
requires the three HDPs to be linked. The modeling choice of the
links will be explained later. The only observed variable in THDP
is
w
, an observation of a person in a frame. It includes a location
orientation (
xji
), timestamp (
ykd
) and speed (
zkc
).
θs
ji
,
θt
kd
and
θe
kc
are their factor variables. Given a single observation denoted as
w
,
we denote one trajectory as
¯
w
, a group of trajectories as
ˇ
w
and the
whole data set as
w
. Our nal goal is to compute the space, time
and speed modes, given w:
Gs=
∞
k=1
βkδϕs
kGt=
∞
l=1
ζlδϕt
lGe=
∞
q=1
ρqδϕe
q(2)
In THDP, a space mode is dened to be a group of geometrically
similar trajectories
ˇ
w
. Since these trajectories form a ow, we also
refer to it as a space ow. A space ow’s timestamps (
ykd
s) and
speed (
zkc
s) are both 1D data and can be modelled in similar ways.
We rst introduce the TimeHDP. One space ow
ˇ
w
might appear,
crescendo, peak, wane and disappear several times. If a Gaussian
distribution is used to represent one time peak on the timeline,
multiple Gaussians are needed. Naturally IGMM is used to model
the
ykd ∈ˇ
w
. A possible alternative is to use Poisson Processes to
model the entry time. But IGMM is chosen due to its ability to t
complex multimodal distributions. It can also model a ow for the
entire duration. Next, since there are many space ows and the
ykd
s of each space ow form a timestamp data group, we therefore
assume that there is a common set of time peaks shared by all space
ACM Trans. Graph., Vol. 39, No. 4, Article 1. Publication date: July 2020.
Informative Scene Decomposition for Crowd Analysis, Comparison and Simulation Guidance •1:5
Fig. 3. From le to right: 1. A space flow. 2. Discretization and flow cell
occupancy, darker means more occupants. 3. Codebook with normalized
occupancy as probabilities indicated by color intensities. 4. Five colored
orientation subdomains (Pink indicates static).
ows and each space ow shares only a subset. This way, we use a
DP to represent all the time peaks and a child DP below the rst DP
to represent the peaks in each space ow. This is a HIGMM (for the
TimeHDP) where the
Ht
is a NIG. Similarly for the speed,
zkc ∈ˇ
w
can also have multiple peaks on the speed axis, so we use IGMM
for this. Further, there are many space ows. We again assume that
there is a common set of speed peaks and each space ow only has
a subset of these peaks and use another HIGMM for the SpeedTDP.
After TimeHDP and SpeedHDP, we introduce the SpaceHDP.
The SpaceHDP is dierent because, unlike time and speed, space
data (
xji
s) is 4D (2D location + 2D orientation), which means its
modes are also multidimensional. In contrast to time and speed, a
4D Gaussian cannot represent a group of similar trajectories well. So
we need to use a dierent distribution. Similar to [Wang et al
.
2017],
we discretize the image domain (Fig. 3: 1) into a m
×
n grid (Fig. 3:
2). The discretization serves three purposes: 1. the cell occupancy
serves as a good feature for a ow, since a space ow occupies a xed
group of cells. 2. it removes noises caused by frequent turns and
tracking errors. 3. it eliminates the dependence on full trajectories.
As long as instantaneous positions and velocities can be estimated,
THDP can cluster observations. This is crucial in dealing with real
world data where full trajectories cannot be guaranteed. Next, since
there is no orientation information so that the representation cannot
distinguish between ows from AtoB and ows from BtoA, we
discretize the instantaneous orientation into 5 cardinal subdomains
(Fig. 3: 4). This makes the grid m
×
n
×
5 (Fig. 3: 3), which now
becomes a codebook and every 4D
xji
can be converted into a cell
occupancy. Note although the grid resolution is problemspecic, it
does not aect the validity of our method.
Next, since the cell occupancy on the grid (after normalization)
can be seen as a Multinomial distribution, we use Multinomials to
represent space ows. This way, a space ow has high probabilities
in some cells and low probabilities in others (Fig. 3:3). Further, we
assume the data is observed in groups and any group could contain
multiple ows. We use a DP to model all the space ows of the
whole dataset with child DPs representing the ows in individual
data groups, e.g. video clips. This is a HDP (SpaceHDP) with
Hs
being a Dirichlet distribution.
After the three HDPs introduced separately, we need to link them,
which is the key of THDP. For a space ow
ˇ
w1
, all
xji ∈ˇ
w1
are
associated with the same space mode, denoted by
ϕs
1
, and all
ykd ∈
ˇ
w1
are associated with the time modes {
ϕt
1
} which forms a temporal
prole of
ϕs
1
. This indicates that
ykd
’s time mode association is
dependent on
xji
’s space mode association. In other words, if
x1
ji ∈
ˇ
w1
(
ϕs
1
) and
x2
ji ∈ˇ
w2
(
ϕs
2
), where
x1
ji =x2
ji
but
ˇ
w1,ˇ
w2
(two
ows can partially overlap), then their corresponding
y1
kd ∈ˇ
w1
and
y2
kd ∈ˇ
w2
should be associated with {
ϕt
1
} and {
ϕt
2
} where {
ϕt
1
}
,
{
ϕt
2
} when
ˇ
w1
and
ˇ
w2
have dierent temporal proles. We therefore
condition
θt
kd
on
θs
ji
(The left red arrow in Fig. 2 Right) so that
ykd
’s
time mode association is dependent on
xji
’s space mode association.
Similarly, a conditioning is also added to
θe
kc
on
θs
ji
. This way,
w
’s
associations to space, time and speed modes are linked. This is the
biggest feature that distinguishes THDP from just a simple collection
of HDPs, which would otherwise require doing analysis on space,
time and dynamics separately, instead of holistically.
5 INFERENCE
Given data
w
, the goal is to compute the posterior distribution
p
(
β
,
ϕs
,
ζ
,
ϕt
,
ρ
,
ϕe

w
). Existing inference methods for DPs include
MCMC [Teh et al
.
2006], variational inference [Homan et al
.
2013]
and geometric optimization [Yurochkin and Nguyen 2016]. However,
they are designed for simpler models (e.g. a single HDP). Further,
both variational inference and geometric optimization suer from
local minimum. We therefore propose a new MCMC method for
THDP. The method is a major generalization of Chinese Restaurant
Franchise (CRF). Next, we rst give the background of CRF, then
introduce our method.
5.1 Chinese Restaurant Franchise (CRF)
A single DP has a Chinese Restaurant Process (CRP) representation.
CRF is its extension onto HDPs. We refer the readers to [Teh et al
.
2006] for details on CRP. Here we directly follow the CRF metaphor
on HDP (Eq. 1, Fig. 2 Left) to compute the posterior distribution
p
(
β
,
ϕ

x
). In CRF, each observation
xji
is called a customer. Each data
group is called a restaurant. Finally, since a customer is associated
with a mode (indicated by
θji
), the mode is called a dish and is to
be learned, as if the customer ordered this dish. CRF dictates that,
in every restaurant, there is a potentially innite number of tables,
each with only one dish and many customers sharing that dish.
There can be multiple tables serving the same dish. All dishes are on
a global menu shared by all restaurants. The global menu can also
contain an innite number of dishes. In summary, we have multiple
restaurants with many tables where customers order dishes from a
common menu.
CRF is a Gibbs sampling approach. The sampling process is con
ducted at both customer and table level alternatively. At the cus
tomer level, each customer is treated, in turn, as a new customer,
given all the other customers sitting at their tables. Then she needs
to choose a table in her restaurant. There are two criteria inuenc
ing her decision: 1. how many customers are already at the table
(table popularity) and 2. how much she likes the dish on that table
(dish preference). If she decides to not sit at any existing table, she
can create a new table then order a dish. This dish can be from the
menu or she can create a new dish and add it to the menu. Next, at
the tablelevel, for each table, all the customers sitting at that table
are treated as a new group of customers, and are asked to choose a
dish together. Their collective dish preference and how frequently
the dish is ordered in all restaurants (dish popularity) will inuence
their choice. They can choose a dish from the menu or create a new
ACM Trans. Graph., Vol. 39, No. 4, Article 1. Publication date: July 2020.
1:6 •Feixiang He, Yuanhang Xiang, Xi Zhao, and He Wang
ALGORITHM 1: Chinese Restaurant Franchise
Result: β,ϕ(Eq. 1)
1Input: x;
2while Not converged do
3for every restaurant j do
4for every customer xji do
5Sample a table tji (Eq. 11, Appx. A);
6if a new table is chosen then
7
Sample a dish or create a new dish (Eq. 12, Appx. A)
8end
9end
10 for every table and its customers xjt do
11 Sample a new dish (Eq. 13, Appx. A)
12 end
13 end
14 Sample hyperparameters [Teh et al. 2006]
15 end
one and add it to the menu. We give the algorithm in Algorithm 1
and refer the readers to Appx. A for more details.
5.2 Chinese Restaurant Franchise League (CRFL)
We generalize CRF by proposing a new method called Chinese
Restaurant Franchise League. We rst change the naming conven
tion by adding prexes space, time and speed to customers, restau
rant and dishes to distinguish between corresponding variables in
the three HDPs. For instance, an observation
w
now contains a
spacecustomer
xji
, a timecustomer
ykd
and a speedcustomer
zkc
.
CRFL is a Gibbs sampling scheme, shown in Algorithm 2. The dif
ferences between CRF and CRFL are on two levels. At the top level,
CRFL generalizes CRF by running CRF alternatively on three HDPs.
This makes use of the conditional independence between the Time
HDP and the SpeedHDP given the SpaceHDP xed. At the bottom
level, there are
three
major dierences in the sampling, between
Eq. 11 and Eq. 3, Eq. 12 and Eq. 4, Eq. 13 and Eq. 5.
ALGORITHM 2: Chinese Restaurant Franchise League
Result: β,ϕs,ζ,ϕt,ρ,ϕe(Eq. 2)
1Input: w;
2while Not converged do
3Fix all variables in SpaceHDP;
4Do one CRF iteration (line 313, Algorithm 1) on TimeHDP;
5Do one CRF iteration (line 313, Algorithm 1) on SpeedHDP;
6for every spacerestaurant j in SpaceHDP do
7for every spacecustomer xji do
8Sample a table tji (Eq. 3);
9if a new table is chosen then
10 Sample a dish or create a new dish (Eq. 4);
11 end
12 end
13 for every table and its spacecustomers xjt do
14 Sample a new spacedish (Eq. 5);
15 end
16 end
17 Sample hyperparameters (Appx. B.3);
18 end
The rst dierence is when we do customerlevel sampling (line
8 in Algorithm 2), the left side of Eq. 11 in CRF becomes:
p(tji =t,xj i ,ykd ,zk c x−ji
,t−ji
,k,y−kd
,o−kd
,l,z−kc
,p−kc
,q)(3)
where
tji
is the new table for spacecustomer
xji
.
ykd
and
zkc
are
the time and speed customer.
x−ji
and
t−ji
are the other customers
(excluding
xji
) in the
j
th spacerestaurant and their choices of tables.
k
is the space dishes. Correspondingly,
y−kd
and
o−kd
are the other
timecustomers (excluding
ykd
) in the
k
th timerestaurant and their
choices of tables.
l
is the time dishes. Similarly,
z−kc
and
p−kc
are the
other speedcustomers (excluding
zkc
) in the
k
th speedrestaurant
and their choices of tables.
q
is the speeddishes. The intuitive in
terpretation of the dierences between Eq. 3 and Eq. 11 is: when a
spacecustomer
xji
chooses a table, the popularity and preference
are not the only criteria anymore. She has to also consider the prefer
ences of her associated timecustomer
ykd
and speedcustomer
zkc
.
This is because when
xji
orders a dierent spacedish,
ykd
and
zkc
will be placed into a dierent timerestaurant and speedrestaurant,
due to that the organizations of time and speedrestaurants are
dependent on the spacedishes (the dependence of
θt
kd
and
θe
kc
on
θs
ji
). Each spacedish corresponds to a timerestaurant and a
speedrestaurant (see Sec. 4.2). Since a spacecustomer’s choice of
spacedish can change during CRFL, the organization of time and
speedrestaurants becomes dynamic! This is why CRF cannot be
directly applied to THDP.
The second dierence is when we need to sample a dish (line 10
in Algorithm 2), the left side of Eq. 12 in CRF becomes:
p(kjt new =k,xj i ,ykd ,zk c k−jtnew
,y−kd
,o−kd
,
l,z−kc
,p−kc
,q) ∝
m·kp(xji  · · · )p(yk d  · · · )p(zk c  · · · )
γp(xji  · · · )p(yk d  · · · )p(zk c  · · · ) (4)
where
kjt ne w
is the new dish for customer
xji
.
· · ·
represents all
the conditional variables for simplicity.
p(ykd  · · · )
and
p(zkc  ··· )
are the major dierences. We refer the readers to Appx. B regarding
the computation of Eq. 3 and Eq. 4.
The last dierence is when we do the tablelevel sampling (line
14 in Algorithm 2), the left side of Eq. 13 in CRF changes to:
p(kjt =k,xjt,ykdjt ,zkcjt k−jt
,y−kdjt
,o−kdjt
,
l−ko
,z−kcjt
,p−kcjt
,q−kp) ∝
m−jt
·kp(xjt  · · · )p(ykdjt  · · · )p(zkcjt  · · · )
γp(xjt  · · · )p(ykdjt  · · · )p(zkcjt  · · · ) (5)
where
xjt
is the spacecustomers at the
t
th table,
ykdjt
and
zkcjt
are
the associated time and speedcustomers.
k−jt
,
y−kdjt
,
o−kdjt
,
l−ko
,
z−kcjt
,
p−kcjt
,
q−kp
are the rest and their table and dish choices in
three HDPs.
· · ·
represents all the conditional variables for sim
plicity.
p(xjt  · · · )
is the Multinomial
f
as in Eq. 13. Unlike Eq. 4,
p(ykdjt  · · · )
and
p(zkcjt  · · · )
cannot be easily computed and needs
special treatment. We refer the readers to Appx. B for details.
ACM Trans. Graph., Vol. 39, No. 4, Article 1. Publication date: July 2020.
Informative Scene Decomposition for Crowd Analysis, Comparison and Simulation Guidance •1:7
Now we have fully derived CRFL. Given a data set
w
, we can com
pute the posterior distribution
p
(
β
,
ϕs
,
ζ
,
ϕt
,
ρ
,
ϕe

w
) where
β
,
ζ
and
ρ
are the weights of the space, time and speed dishes,
ϕs
,
ϕt
and
ϕe
respectively.
ϕs
are Multinomials.
ϕt
and
ϕe
are Gaussians. As
mentioned in Sec. 5.1, the number of
ϕs
, is automatically learnt, so
we do not need to know the space dish number in advance. Neither
do we need it for
ϕt
and
ϕe
. This makes THDP nonparametric.
Further, since one
ϕs
could be associated with potentially an in
nite number of
ϕt
s and
ϕe
s and vice versa, the manytomany
associations are also automatically learnt.
5.3 Time Complexity of CRFL
For each sampling iteration in Algorithm 2, the time complexities of
sampling on timeHDP, speedHDP and spaceHDP are
O[W(N+
L)+K N L]
,
O[W(A+Q)+KAQ]
and
O[W(M+K)+
2
W(K+
1
)η+JMK]
respectively, where
η=N+L+A+Q
.
W
is the total observation
number.
K
,
L
and
Q
are the dish numbers of space, time and speed.
J
is the number of spacerestaurants.
M
,
N
and
A
are the average table
numbers in space, time and speedrestaurants respectively. Note
that
K
appears in all three time complexities because the number of
spacedishes is also the number of time and spacerestaurants.
The time complexity of CRFL is
O[W(N+L)+K N L]+O[W(A+
Q)+KAQ]+O[W(M+K)+
2
W(K+
1
)η+JMK]
. This time complexity
is not high in practice.
W
can be large, depending on the dataset,
over which a sampling could be used to reduce the observation
number. In addition,
K
is normally smaller than 50 even for highly
complex datasets.
L
and
Q
are even smaller.
J
is decided by the user
and in the range of 1030.
M
,
N
and
A
are not large either due to the
high aggregation property of DPs, i.e. each table tends to be chosen
by many customers, so the table number is low.
6 VISUALIZATION, METRICS AND SIMULATION
GUIDANCE BASED ON THDP
THDP provides a powerful and versatile base for new tools. In
this section, we present three tools for structured visualization,
quantitative comparison and simulation guidance.
6.1 Flexible and Structured Crowd Data Visualization
After inference, the highly rich but originally mixed and unstruc
tured data is now structured. This is vital for visualization. It is
immediately easy to visualize the time and speed modes as they are
mixtures of univariate Gaussians. The space modes require further
treatments because they are m
×
n
×
5 Multinomials and hard to vi
sualize. We therefore propose to use them as classiers to classify
trajectories. After classication, we select representative trajectories
for a clear and intuitive visualization of ows. Given a trajectory
¯
w
,
we compute a softmax function:
pk(¯
w)=epk(¯
w)
K
k=1epk(¯
w)k∈[1, K] (6)
where
pk(¯
w)
=
p(¯
wβk,ϕs
k,ζk,ϕt
,ρk,ϕe)
.
ϕs
k
and
βk
are the
k
th
space mode and its weight. The others are the associated time and
speed modes. The time and speed modes (
ϕt
and
ϕe
) are associated
with space ow
ϕs
k
, with weights,
ζk
and
ρk
.
K
is the total number
of space ows. This way, we classify every trajectory into a space
ow. Then we can visualize representative trajectories with high
probabilities, or show anomaly trajectories with low probabilities.
In addition, since THDP captures all space, time and dynamics,
there is a variety of visualization. A period of time can be represented
by a weighted combination of time modes {
ϕt
}. Assuming that the
user wants to see what space ows are prominent during this period,
we can visualize trajectories based on
ρ,ϕep(β,ϕs{ϕt})
, which
gives the space ows with weights. This is very useful if for instance
{
ϕt
} is rush hours,
ρ,ϕep(β,ϕs{ϕt})
shows us what ows are
prominent and their relative importance during the rush hours.
Similarly, if we visualize data based on
ζ,ϕtp(ρ,ϕeϕs)
, it will tell
us if people walk fast/slowly on the space ow
ϕs
. A more complex
visualization is
p(ζ,ϕt
,ρ,ϕeϕs)
where the timespeed distribution
is given for a space ow
ϕs
. This gives the speed change against
time of this space ow, which could reveal congestion at times.
Through marginalizing and conditioning on dierent variables
(as above), there are many possible ways of visualizing crowd data
and each of them reveals a certain aspect of the data. We do not
enumerate all the possibilities for simplicity but it is very obvious
that THDP can provide highly exible and insightful visualizations.
6.2 New antitative Evaluation Metrics
Being able to quantitatively compare simulated and real crowds is
vital in evaluating the quality of crowd simulation. Trajectorybased
[Guy et al
.
2012] and owbased [Wang et al
.
2016] methods have
been proposed. The rst owbased metrics are proposed in [Wang
et al
.
2016] which is similar to our approach. In their work, the two
metrics proposed were: average likelihood (AL) and distribution
pair distance (DPD) based on KullbackLeibler (KL) divergence. The
underlying idea is that a good simulation does not have to strictly
reproduce the data but should have statistical similarities with the
data. However, they only considered space. We show that THDP
is a major generalization of their work and provides much more
exibility with a set of new AL and DPD metrics.
6.2.1 AL Metrics. Given a simulation data set,
ˆ
w=(ˆ
xji ,ˆ
yk d ,ˆ
zk c )
and
p
(
β
,
ϕs
,
ζ
,
ϕt
,
ρ
,
ϕe

w
) inferred from realworld data
w
, we can
compute the AL metric based on space only, essentially computing
the average space likelihood while marginalizing time and speed:
1
ˆ
w
j,i
K
k=1
βkzy
p(ˆ
xji ϕs
k,ˆ
ykd ,ˆ
zkc )p(ˆ
ykd )p(ˆ
zkc )dydz (7)
where
ˆ
w
is the number of observations in
ˆ
w
. The dependence on
β
,
ϕs
,
ζ
,
ϕt
,
ρ
,
ϕe
are omitted for simplicity. If we completely discard
time and speed, Eq. 7 changes to the AL metric in [Wang et al
.
2017],
1
ˆ
wj,ikβkp(ˆ
xji ϕs
k)
. However, the metric is just a special case
of THDP. We give a list of AL metrics in Table 1, which all have
similar forms as Eq. 7.
6.2.2 DPD Metrics. AL metrics are based on average likelihoods,
summarizing the dierences between two data sets into one number.
To give more exibility, we also propose distributionpair metrics.
We rst learn two posterior distributions
p
(
ˆ
β
,
ˆ
ϕs
,
ˆ
ζ
,
ˆ
ϕt
,
ˆ
ρ
,
ˆ
ϕe

ˆ
w
)
and
p
(
β
,
ϕs
,
ζ
,
ϕt
,
ρ
,
ϕe

w
). Then we can compare individual
ACM Trans. Graph., Vol. 39, No. 4, Article 1. Publication date: July 2020.
1:8 •Feixiang He, Yuanhang Xiang, Xi Zhao, and He Wang
Metric To compare
1. 1
ˆ
wp(ˆ
xji ,ˆ
ykd ,ˆ
zkc •) overall similarity
2. 1
ˆ
wp(ˆ
xji ,ˆ
ykd •) space&time ignoring speed
3. 1
ˆ
wp(ˆ
xji ,ˆ
zkc •) space&speed ignoring time
4. 1
ˆ
wp(ˆ
ykd ,ˆ
zkc •) time&speed ignoring space
5. 1
ˆ
wp(ˆ
xi j •) space ignoring time & speed
6. 1
ˆ
wp(ˆ
ykd •) time ignoring space & speed
7. 1
ˆ
wp(ˆ
zkc •) speed ignoring space & time
Table 1. AL Metrics, •represents {β,ϕs
,ζ,ϕt
,ρ,ϕe}.
pairs of
ϕs
and
ˆ
ϕs
,
ϕt
and
ˆ
ϕt
,
ϕe
and
ˆ
ϕe
. Since all space, time and
speed modes are probability distributions, we propose to use Jensen
Shannon divergence, as oppose to KL divergence [Wang et al
.
2017]
due to KL’s asymmetry:
JSD (PQ)=1
2D(PM)+1
2D(QM)(8)
where
D
is KL divergence and
M=1
2(P+Q)
.
P
and
Q
are probabil
ity distributions. Again, in the DPD comparison, THDP provides
many options, similar to the AL metrics in Table 1. We only give
several examples here. Given two space ows,
ϕs
and
ˆ
ϕs
, JSD(
ϕs

ˆ
ϕs
) directly compares two space ows. Further,
P
and
Q
can be
conditional distributions. If we compute JSD(
p(ϕt

ϕs
)  p(
ˆ
ϕt

ˆ
ϕs
))
where
ϕt
and
ˆ
ϕt
are the associated time modes of
ϕs
and
ˆ
ϕs
respec
tively. This is to compare the two temporal proles. This is very
useful when
ϕs
and
ˆ
ϕs
are two spatially similar ows but we want
to compare the temporal similarity. Similarly, we can also compare
their speed proles JSD(
p(ϕe

ϕs
)  p(
ˆ
ϕe

ˆ
ϕs
)) or their timespeed
proles JSD(
p(ϕt
,
ϕe

ϕs
)  p(
ˆ
ϕt
,
ˆ
ϕe

ˆ
ϕs
)). In summary, similar
to AL metrics, dierent conditioning and marginalization choices
result in dierent DPD metrics.
6.3 Simulation Guidance
We propose a new method to automate simulation guidance with
realworld data, which works with existing simulators including
steering and global planning methods. Assuming that we want to
simulate crowds in a given environment based on data, there are
still several key parameters which need to be estimated including,
starting/destination positions, the entry timing and the desired
speed. After inferring, we use GMM to model both starting and
destination regions for every space ow. This way, we completely
eliminate the need for manual labelling, which is dicult in spaces
with no designated entrances/exits (e.g. a square). Also, we removed
the onetoone mapping requirement of the agents in simulation
and data. We can sample any number of agents based on space ow
weights (
β
) and still keep similar agent proportions on dierent
ows to the data. In addition, since each ow comes with a temporal
and speed prole, we sample the entry timing and desired speed
for each agent, to mimic the randomness in these parameters. It is
dicult to manually set the timing when the duration is long and
sampling the speed is necessary to capture the speed variety within
a ow caused by latent factors such as dierent physical conditions.
Next, even with the right setting of all the aforementioned param
eters, existing simulators tend to simulate straight lines whenever
possible while the real data shows otherwise. This is due to that no
intrinsic motion randomness is introduced. Intrinsic motion ran
domness can be observed in that people rarely walk in straight lines
and they generate slightly dierent trajectories even when asked
to walk several times between the same starting position and desti
nation [Wang et al
.
2017]. This is related to the state of the person
as well as external factors such as collision avoidance. Individual
motion randomness can be modelled by assuming the randomness
is Gaussiandistributed [Guy et al
.
2012]. Here, we do not assume
that all people have the same distribution. Instead, we propose to do
a structured modelling. We observe that people on dierent space
ows show dierent dynamics but share similar dynamics within
the same ow. This is because people on the same ow share the
same starting/destination regions and walk through the same part
of the environment. In other words, they started in similar positions,
had similar goals and made similar navigation decisions. Although
individual motion randomness still exists, their randomness is likely
to be similarly distributed. However, this is not necessarily true
across dierent ows. We therefore assume that each space ow can
be seen as generated by a unique dynamic system which captures
the withingroup motion randomness which implicitly considers
factors such as collision avoidance. Given a trajectory,
¯
w
, from a
ow ˇ
w, we assume that there is an underlying dynamic system:
x¯
w
t=Ast+ωtω∼N(0,Ω)
st=Bst−1+λtλ∼N(0,Λ)(9)
where
x¯
w
t
is the observed location of a person at time
t
on trajectory
¯
w
.
st
is the latent state of the dynamic system at time
t
.
ωt
and
λt
are the observational and dynamics randomness. Both are white
Gaussian noises.
A
and
B
are transition matrices. We assume that
Ω
is a known diagonal covariance matrix because it is intrinsic to the
device (e.g. a camera) and can be trivially estimated. We also assume
that
A
is an identity matrix so that there is no systematic bias and the
observation is only subject to the state
st
and noise
ωt
. The dynamic
system then becomes:
x¯
w
t∼N(Ist,Ω)
and
st∼N(Bst−1,Λ)
, where
we need to estimate
st
,
B
and
Λ
. Given the
U
trajectories in
ˇ
w
, the
total likelihood is:
p(ˇ
w)=ΠU
i=1p(¯
wi)where
p(¯
wi)=ΠTi−1
t=2p(xi
tst)P(stst−1)s1=xi
1,sT=xi
Ti(10)
where
Ti
is the length of trajectory
¯
wi
. We maximize
loдP(ˇ
w)
via
ExpectationMaximization [Bishop 2007]. Details can be found in
the Appx. C. After learning the dynamic system for a space ow and
given a starting and destination location,
s1
and
sT
, we can sample
diversied trajectories while obeying the ow dynamics. During
simulation guidance, one target trajectory is sampled for each agent
and this trajectory reects the motion randomness.
7 EXPERIMENTS
In this section, we rst introduce the datasets, then show our highly
informative and exible visualization tool. Next, we give quantitative
comparison results between simulated and real crowds by the newly
proposed metrics. Finally, we show that our automated simulation
ACM Trans. Graph., Vol. 39, No. 4, Article 1. Publication date: July 2020.
Informative Scene Decomposition for Crowd Analysis, Comparison and Simulation Guidance •1:9
Fig. 4. Forum (top), CarPark (Middle) and TrainStation (Boom) dataset. In each dataset, Top le: original data; P1P9: the top 9 space modes; Top right: the
time modes of P1P9; Boom right: the speed modes of P1P9. Both time and speed profiles are scaled by their respective space model weights, with the y axis
indicating the likelihood.
guidance with high semantic delity. We only show representative
results in the paper and refer the readers to the supplementary video
and materials for details.
7.1 Datasets
We choose three publicly available datasets:
Forum
[Majecka 2009],
CarPark
[Wang et al
.
2008] and
TrainStation
[Yi et al
.
2015], to
cover dierent data volumes, durations, environments and crowd
dynamics. Forum is an indoor environment in a school building,
recorded by a topdown camera, containing 664 trajectories and
lasting for 4.68 hours. Only people are tracked and they are mostly
slow and casual. CarPark consists of videos of an outdoor car park
with mixed pedestrians and cars, by a fardistance camera and con
tains totally 40,453 trajectories over ve days. TrainStation is a big
indoor environment with pedestrians and designated subspaces.
It is from New York Central Terminal and contains totally 120,000
frames with 12,684 pedestrians within approximately 45 minutes.
The speed varies among pedestrians.
7.2 Visualization Results
We rst show a general, fullmode visualization in Fig. 4. Due to
the space limit, we only show the top 9 space modes and their cor
responding time and speed proles. Overall, THDP is eective in
decomposing highly mixed and unstructured data into structured
results across dierent data sets. The top 9 space modes (with time
and speed) are the main activities. With the environment informa
tion (e.g. where the doors/lifts/rooms are), the semantic meanings
of the activities can be inferred. In addition, the time and dynamics
are captured well. One peak of a space ow (indicated by color) in
the time proles indicates that this ow is likely to appear around
that time. Correspondingly, one peak of a space ow in the speed
prole indicates a major speed preference of the people on that ow.
Multiple space ows can peak near one point in both the time and
speed proles. The speed proles of Forum and TrainStation are
slightly dierent, with most of the former distributed in a smaller
region. This is understandable because people in TrainStation in
general walk faster. The speed prole of CarPark is quite dierent
in that it ranges more widely, up to 10m/s. This is because both
pedestrians and vehicles were recorded.
Besides, we show conditioned visualization. Suppose that the
user is interested in a period (e.g. rush hours) or speed range (e.g.
ACM Trans. Graph., Vol. 39, No. 4, Article 1. Publication date: July 2020.
1:10 •Feixiang He, Yuanhang Xiang, Xi Zhao, and He Wang
Fig. 5. Le: TrainStation, Right: CarPark. The space flow prominence (indi
cated by bar heights) of P1P9 in Fig. 4 respectively given a time period (blue
bars) or speed range (orange bars). The higher the bar, the more prominent
the space flow is.
Fig. 6. Space flows from Forum, CarPark and TrainStation and their time
speed distributions. The y (up) axis is likelihood. The x and z axes are time
and speed. The redder, the higher the likelihood is.
to see where people generally walk fast/slowly), the associated ow
weights can be visualized (Fig. 5). This allows users to see which
space ows are prominent in the chosen period or speed range.
Conversely, given a space ow in interest, we can visualize the time
speed distribution (Fig. 6), showing how the speed changes along
time, which could help identify congestion on that ow at times.
Last but not least, we can identify anomaly trajectories and show
unusual activities. The anomalies here refer to statistical anomalies.
Although they are not necessarily suspicious behaviors or events,
they can help the user to quickly reduce the number of cases needed
to be investigated. Note that the anomaly is not only the spacial
anomaly. It is possible that a spatially normal trajectory that is ab
normal in time and/or speed. To distinguish between them, we rst
compute the probabilities of all trajectories and select anomalies.
Then for each anomaly trajectory, we compute its relative probabili
ties (its probability divided by the maximal trajectory probability) in
space, time and speed, resulting in three probabilities in [0, 1]. Then
we use them (after normalization) as the barycentric coordinates of
a point inside of a colored triangle. This way, we can visualize what
contributes to their abnormality (Fig. 7). Take T1 for example. It has
a normal spacial pattern, and therefore is close to the ‘space’ vertex.
It is far away from both ‘time’ and ‘speed’ vertex, indicating T1’s
time and speed patterns are very dierent from the others’. THDP
can be used as a versatile and discriminative anomaly detector.
Nonparametric Bayesian approaches have been used for crowd
analysis [Wang et al
.
2016, 2017]. However, existing methods can
be seen as variants of the SpaceHDP and cannot decompose infor
mation in time and dynamics. Consequently, they cannot show any
results related to time & speed, as opposed to Fig. 47. A naive alter
native would be to use the methods in [Wang et al
.
2016, 2017] to
Fig. 7. Representative anomaly trajectories. Every trajectory has a cor
responding location in the triangle on the right, indicating what factors
contribute more in its abnormality. For instance, T1 is close to the space
vertex, it means its spatial probability is relatively high and the main abnor
mality contribution comes from its time and speed. For T2, the contribution
mainly comes from its speed.
rst cluster data regardless time and dynamics, then do percluster
time and dynamics analysis, equivalent to using the SpaceHDP
rst, then the timeHDP & SpeedHDP subsequently. However, this
kind of sequential analysis has failed due to one limitation: the
spatialonly HDP misclassies observations in the overlapped ar
eas of ows [Wang and O’Sullivan 2016]. The following time and
dynamics analysis would be based on wrong clustering. The simul
taneity of considering all three types of information, accomplished
by the links (red arrows in Fig. 2 Right) among three HDPs in THDP,
is therefore essential.
7.3 Compare Real and Simulated Crowds
To compare simulated and real crowds, we ask participants (Master
and PhD students whose expertise is in crowd analysis and simula
tion) to simulate crowds in Forum and TrainStation. We left CarPark
out because its excessively long duration makes it extremely dif
cult for participants to observe. We built a simple UI for setting
up simulation parameters including starting/destination locations,
the entry timing and the desired speed for every agent. For simula
tor, our approach is agnostic about simulation methods. We chose
ORCA in Menge [Curtis et al
.
2016] for our experiments but other
simulation methods would work equally well. Initially, we provide
the participants with only videos and ask them to do their best to
replicate the crowd motions. They found it dicult because they
had to watch the videos and tried to remember a lot of information,
which is also a realworld problem of simulation engineers. This
suggests that dierent levels of detail of the information are needed
to set up simulations. The information includes variables such as
entry timings and start/end positions, which are readily available,
or descriptive statistics such as average speed, which can be rela
tively easily computed. We systematically investigate their roles in
producing scene semantics. After several trials, we identied a set
of key parameters including starting/ending positions, entry timing
and desired speed. Dierent simulation methods require dierent
parameters, but these are the key parameters shared by all. We also
identied four typical settings where we gradually provide more
and more information about these parameters. This design helps us
to identify the qualitative and quantitative importance of the key
parameters for the purpose of reproducing the scene semantics.
The rst setting, denoted as Random, is where only the start
ing/destination regions are given. The participants have to estimate
ACM Trans. Graph., Vol. 39, No. 4, Article 1. Publication date: July 2020.
Informative Scene Decomposition for Crowd Analysis, Comparison and Simulation Guidance •1:11
Information / Setting Random SDR SDRT SDRTS
Starting/Dest. Areas ✓ ✓ ✓ ✓
Exact Starting/Dest. Positions ×✓ ✓ ✓
Trajectory Entry Timing × × ✓ ✓
Trajectory Average Speed × × × ✓
Table 2. Dierent simulation seings and the information provided.
Metric/Simulations Random SDR SDRT SDRTS Ours
Overall (×10−8) 7.11 20.67 37.08 40.55 57.9
SpaceOnly (×10−3) 2.7 5.3 5.3 5.5 5.1
SpaceTime (×10−7) 1.23 2.96 5.56 5.77 6.02
SpaceSpeed (×10−3) 1.5 3.6 3.5 4.0 4.9
Overall (×10−7) 6.7 11.97 13.96 19.39 19.89
SpaceOnly (×10−3) 3.5 6.8 6.7 6.6 6.9
SpaceTime (×10−7) 8.02 15.87 19.00 18.84 20.44
SpaceSpeed (×10−3) 2.9 5.0 4.9 6.9 6.7
Table 3. Comparison on Forum (Top) and TrainStation (Boom) based on
AL metrics.
Higher
is beer. Numbers should only compared within the
same row.)
the rest. Based on Random, we further give the exact starting/ending
positions, denoted by SDR. Next, we also give the entry timing for
each agent based on SDR, denoted by SDRT. Finally, we give the
average speed of each agent based on SDRT, denoted by SDRTS.
Random is the leastinformed scenario where the users have to esti
mate many parameters, while SDRTS is the mostinformed situation.
A comparison between the four settings is shown in Table 2.
We use four AL metrics to compare simulations with data, as they
provide detailed and insightful comparisons: Overall (Table 1: 1),
SpaceOnly (Table 1: 5), SpaceTime (Table 1: 2) and SpaceSpeed
(Table 1: 3) and show the comparisons in Table 3. In Random, the
users had to guess the exact entrance/exit locations, entry timing
and speed. It is very dicult to do by just watching videos and thus
has the lowest score across the board. When provided with exact
entrance/exit locations (SDR), the score is boosted in Overall and
SpaceOnly. But the scores in SpaceTime and SpaceSpeed remain
relatively low. As more information is provided (SDRT & SDRTS),
the scores generally increase. This shows that our metrics are sensi
tive to space, time and dynamics information during comparisons.
Further, each type of information is isolated out in the comparison.
The SpaceOnly scores are roughly the same between SDR, SDRT
and SDRTS. The SpaceTime scores do not change much between
SDRT and SDRTS. The isolation in comparisons makes our AL met
rics ideal for evaluating simulations in dierent aspects, providing
great exibility which is necessary in practice.
Next, we show that it is possible to do more detailed comparisons
using DPD metrics. Due to the space limit, we show one space ow
from all simulation settings (Fig. 8), and compare them in space
only (DPDSpace), time only (DPDTime) and timespeed (DPDTS)
in Table 4. In DPDSpace, all settings perform similarly because
the space information is provided in all of them. In DPDTime,
SDRT & SDRTS are better because they are both provided with the
timing information. What is interesting is that SDRTS is worse than
SDRT on the two ows in DPDTS. Their main dierence is that
the desired speed in SDRTS is set to be the average speed of that
trajectory, while the desired speed in SDRT is randomly drawn from
Metric/Simulations SDR SDRT SDRTS Ours
DPDSpace 0.4751 0.3813 0.4374 0.2988
DPDTime 0.3545 0.0795 0.064 0.0419
DPDTS 1.0 0.8879 1.0 0.4443
DPDSpace 0.2753 0.2461 0.2423 0.1173
DPDTime 0.0428 0.0319 0.0295 0.0213
DPDTS 0.9970 0.8157 0.9724 0.5091
Table 4. Comparison on space flow P2 in Forum (Top) and space flow P1 in
TrainStation (Boom) based on DPD metrics, both shown in Fig. 4.
Lower
is beer.
a Gaussian estimated from real data. The latter achieves a slightly
better performance on both ows in DPDTS.
Quantitative metrics for comparing simulated and real crowds
have been proposed before. However, they either only compare
individual motions [Guy et al
.
2012] or only space patterns [Wang
et al
.
2016, 2017]. Holistically considering space, time & speed has
a combinatorial eect, leading to many explicable metrics evaluat
ing dierent aspects of crowds (AL & DPD metrics). This makes
multifaceted comparisons possible, which is unachievable in ex
isting methods. Technically, the exible design of THDP allows
for dierent choices of marginalization, which greatly increases
the evaluation versatility. This shows the theoretical superiority of
THDP over existing methods.
7.4 Guided Simulations
Our automated simulation guidance proves to be superior to careful
manual settings. We rst show the AL results in Table 3. Our guided
simulation outperforms all other settings that were carefully and
manually set up. The superior performance is achieved in the Over
all comparisons as well as most dimensionspecic comparisons.
Next, we show the same space ow of our guided simulation in
Fig. 8, in comparison with other settings. Qualitatively, SDR, SDRT
and SDRTS generate narrower ows due to straight lines are sim
ulated. In contrast, our simulation shows more realistic intraow
randomness which led to a wider ow. It is much more similar to
the real data. Quantitatively, we show the DPD results in Table 4.
Again, our automated guidance outperforms all other settings.
Automated simulation guidance has only been attempted by a
few researchers before [Karamouzas et al
.
2018; Wolinski et al
.
2014].
However, their methods aim to guide simulators to reproduce low
level motions for the overall similarity with the data. Our approach
aims to inform simulators with structured scene semantics. More
over, it gives the freedom to the users so that the full semantics
or partial semantics (e.g. the top n ows) can be used to simulate
crowds, which no previous method can provide.
7.5 Implementation Details
For space discretization, we divide the image space of Forum, CarPark
and TrainStation uniformly into 40
×
40,40
×
40 and 120
×
120 pixel
grids respectively. Since Forum is recorded by a topdown camera,
we directly estimate the velocity from two consecutive observations
in time. For CarPark and TrainStation, we estimate the velocity by
reconstructing a topdown view via perspective projection. THDP
also has hyperparameters such as the scaling factors of every DP
ACM Trans. Graph., Vol. 39, No. 4, Article 1. Publication date: July 2020.
1:12 •Feixiang He, Yuanhang Xiang, Xi Zhao, and He Wang
Fig. 8. Space flow P2 in Forum (Top) and P1 in TrainStation (Boom) in dierent simulations. The y axes of the time and speed profiles indicate likelihood.
(totally 6 of them). Our inference method is not very sensitive to
them because they are also sampled, as part of the CRFL sampling.
Please refer to Appx. B.3 for details. In inference, we have a burnin
phase, during which we only use CRF on the SpaceHDP and ignore
the rest two HDPs. After the burnin phase, we use CRFL on the
full THDP. We found that it can greatly help the convergence of the
inference. For crowd simulation, we use ORCA in Menge [Curtis
et al. 2016].
We randomly select 664 trajectories in Forum, 1000 trajectories
in CarPark and 1000 trajectories in Trainstation for performance
tests. In each experiment, we split the data into segments in time
domain to mimic fragmented video observations. The number of
segments is a userdened hyperparameter and depends on the
nature of the dataset. We chose the segment number to be 384, 87
and 28, for Forum, CarPark and TrainStation respectively to cover
situations where the video is nely or roughly segmented. During
training, we rst run 5k CRF iterations on the SpaceHDP only in
the burnin phase, then do the full CRFL on the whole THDP to
speed up the mixing. After training, the numbers of space, time and
speed modes are 25, 5 and 7 in Forum; 13, 6 and 6 in CarPark; 16, 3
and 4 in TrainStation. The training took 85.1, 11.5 and 7.8 minutes
on Forum, Carpark and TrainStation, on a PC with an Intel i76700
3.4GHz CPU and 16GB memory.
8 DISCUSSION
We chose MCMC to avoid the local minimum issue. (Stochastic)
Variational Inference (VI) [Homan et al
.
2013] and Geometric Op
timization [Yurochkin and Nguyen 2016] are theoretically faster.
However, VI for a single HDP is already prone to local minimum
[Wang et al
.
2016]. We also found the same issue with geometric
optimization. Also, can we use three independent HDPs? Using in
dependent HDPs essentially breaks the manytomany associations
between space, time and speed modes. It can cause misclustering
due to that the clustering is done on dierent dimensions separately
[Wang and O’Sullivan 2016].
The biggest limitation of our method does not consider the cross
scene transferability. Since the analysis focuses on the semantics
in a given scene, it is unclear how the results can inspire simula
tion settings in unseen environments. In addition, our metrics do
not directly reect visual similarities on the individual level. We
deliberately avoid the agentlevel onetoone comparison, to allow
greater exibility in simulation setting while maintaining statistical
similarities. Also, we currently do not model highlevel behaviors
such as grouping, queuing, etc. This is due to that such informa
tion can only be obtained through human labelling which would
incur massive workload and be therefore impractical on the chosen
datasets. We intentionally chose unsupervised learning to deal with
large datasets.
9 CONCLUSIONS AND FUTURE WORK
In this paper, we present the rst, to our best knowledge, multi
purpose framework for comprehensive crowd analysis, visualization,
comparison (between real and simulated crowds) and simulation
guidance. To this end, we proposed a new nonparametric Bayesian
model called TripletHDP and a new inference method called Chi
nese Restaurant Franchise League. We have shown the eectiveness
of our method on datasets varying in volume, duration, environment
and crowd dynamics.
In the future, we would like to extend the work to crossenvironment
prediction. It would be ideal if the modes learnt from given envi
ronments can be used to predict crowd behaviors in unseen envi
ronments. Preliminary results show that the semantics are tightly
coupled with the layout of subspaces with designated functionali
ties. This means a subspacefunctionality based semantic transfer is
possible. Besides, we will look into using semisupervised learning
to identify and learn high level social behaviors, such as grouping
and queuing.
ACKNOWLEDGEMENT
The project is partially supported by EPSRC (Ref:EP/R031193/1), the
Fundamental Research Funds for the Central Universities (xzy012019048)
and the National Natural Science Foundation of China (61602366).
REFERENCES
Saad Ali and Mubarak Shah. 2007. A lagrangian particle dynamics approach for crowd
ow segmentation and stability analysis. In
2007 IEEE Conference on Computer
Vision and Pattern Recognition. IEEE, 1–6.
Jiang Bian, Dayong Tian, Yuanyan Tang, and Dacheng Tao. 2018. A sur veyon traje ctory
clustering analysis. CoRR abs/1802.06971 (2018). arXiv:1802.06971
Christopher Bishop. 2007.
Pattern Recognition and Machine Learning
. Springer, New
York.
Rima Chaker, Zaher Al Aghbari, and Imran N Junejo. 2017. Social network model for
crowd anomaly detection and localization.
Pattern Recognition
61 (2017), 266–281.
Panayiotis Charalambous, Ioannis Karamouzas, Stephen J Guy, and Yiorgos Chrysan
thou. 2014. A datadriven framework for visual crowd analysis. In
Computer
Graphics Forum, Vol. 33. Wiley Online Library, 41–50.
ACM Trans. Graph., Vol. 39, No. 4, Article 1. Publication date: July 2020.
Informative Scene Decomposition for Crowd Analysis, Comparison and Simulation Guidance •1:13
Sean Curtis, Andrew Best, and Dinesh Manocha. 2016. Menge: A Modular Framework
for Simulating Crowd Movement. Collective Dynamics 1, 0 (2016).
Cathy Ennis, Christopher Peters, and Carol O’Sullivan. 2011. Perceptual Eects of
Scene Context and Viewpoint for Virtual Pedestrian Crowds.
ACM Transaction on
Applied Perception 8, 2, Article Article 10 (Feb. 2011), 22 pages.
Thomas S. Ferguson. 1973. A Bayesian Analysis of Some Nonparametric Problems.
The Annals of Statistics 1, 2 (1973), 209–230.
Abhinav Golas, Rahul Narain, and Ming Lin. 2013. Hybrid Longrange Collision
Avoidance for Crowd Simulation. In
ACM SIGGRAPH Symposium on Interactive
3D Graphics and Games. 29–36.
Stephen J. Guy, Jur van den Berg, Wenxi Liu, Rynson Lau, Ming C. Lin, and Dinesh
Manocha. 2012. A Statistical Similarity Measure for Aggregate Crowd Dynamics.
ACM Transaction on Graphics 31, 6 (2012), 190:1–190:11.
Dirk Helbing et al
.
1995. Social Force Model for Pedestrian Dynamics.
Physical Review
E (1995).
Matthew D. Homan, David M. Blei, Chong Wang, and John Paisley. 2013. Stochastic
Variational Inference.
Journal of Machine Learning Research
14, 1 (2013), 1303–
1347.
Kevin Jordao, Julien Pettré, Marc Christie, and MariePaule Cani. 2014. Crowd Sculpting:
A Spacetime Sculpting Method for Populating Virtual Environments.
Computer
Graphics Forum (2014).
Ioannis Karamouzas, Nick Sohre, Ran Hu, and Stephen J. Guy. 2018. Crowd Space: A
Predictive Crowd Analysis Technique.
ACM Transaction on Graphics
37, 6, Article
Article 186 (Dec. 2018), 14 pages.
Leonard Kauman and Peter J. Rousseeuw. 2005.
Finding Groups in Data: An
Introduction to Cluster Analysis. John Wiley & Sons.
Kang Hoon Lee, Myung Geol Choi, Qyoun Hong, and Jehee Lee. 2007. Group behavior
from video: a datadriven approach to crowd simulation. In
Proceedings of the 2007
ACM SIGGRAPH/Eurographics symposium on Computer animation. 109–118.
S. Lemercier, A. Jelic, R. Kulpa, J. Hua, J. Fehrenbach, P. Degond, C. AppertRolland, S.
Donikian, and J. Pettré. 2012. Realistic Following Behaviors for Crowd Simulation.
Computer Graphics Forum 31, 2 (2012), 489–498.
Alon Lerner, Yiorgos Chrysanthou, Ariel Shamir, and Daniel CohenOr. 2009. Data
driven evaluation of crowds. In
International Workshop on Motion in Games
.
Springer, 75–83.
Ning Lu et al
.
2019. ADCrowdNet: An Attentioninjective Deformable Convolutional
Networkfor Crowd Understanding.
IEEE Conference on Computer Vision and
Pattern Recognition (2019).
A López, F Chaumette, E Marchand, and J Pettré. 2019. Character navigation in dy
namic environments based on optical ow. In
Proceedings of Eurographics 2019
(Eurographics 2019). Eurographics.
B. Majecka. 2009.
Statistical models of pedestrian behaviour in the Forum
. MSc Dis
sertation. School of Informatics, University of Edinburgh, Edinburgh.
Ramin Mehran, Alexis Oyama, and Mubarak Shah. 2009. Abnormal crowd behavior
detection using social force model. In
2009 IEEE Conference on Computer Vision
and Pattern Recognition. IEEE, 935–942.
Rahul Narain, Abhinav Golas, Sean Curtis, and Ming C. Lin. 2009. Aggregate Dynamics
for Dense Crowd Simulation.
ACM Transaction on Graphics
28, 5 (2009), 122:1–
122:8.
Carl Edward Rasmussen. 1999. The Innite Gaussian Mixture Model. In
International
Conference on Neural Information Processing Systems (NIPS’99)
. MIT Press, Cam
bridge, MA, USA, 554–560.
Jiaping Ren, Wei Xiang, Yangxi Xiao, Ruigang Yang, Dinesh Manocha, and Xiaogang
Jin. 2018. HeterSim: Heterogeneous multiagent systems simulation by interactive
datadriven optimization. CoRR abs/1812.00307 (2018). arXiv:1812.00307
Zeng Ren, P. Charalambous, J. Bruneau, Q. Peng, and J. Pettré. 2016. Group modelling:
A unied velocitybased approach. Computer Graphics Forum (2016).
Mohammad Sabokrou et al
.
2017. Deepcascade:cascading 3D deep neural networks
for fast anomaly detection and localization in crowded scenes.
IEEE Transaction
on Image Processing (2017).
Long Sha, Patrick Lucey, Yisong Yue, Xinyu Wei, Jennifer Hobbs, Charlie Rohlf, and
Sridha Sridharan. 2018. Interactive sports analytics: An intelligent interface for utiliz
ing trajectories for interactive sports play retrieval and analytics.
ACM Transactions
on ComputerHuman Interaction (TOCHI) 25, 2 (2018), 1–32.
Long Sha, Patrick Lucey, Stephan Zheng, Taehwan Kim, Yisong Yue, and Sridha Srid
haran. 2017. Finegrained retrieval of sports plays using treebased alignment of
trajectories. (2017). arXiv:1710.02255
Yijun Shen, Joseph Henry, He Wang, Edmond S. L. Ho, Taku Komura, and
Hubert P. H. Shum. 2018. DataDriven Crowd Motion Control With
MultiTouch Gestures.
Computer Graphics Forum
37, 6 (2018), 382–394.
arXiv:https://onlinelibrary.wiley.com/doi/pdf/10.1111/cgf.13333
Jianbo Shi and J. Malik. 2000. Normalized cuts and image segmentation.
IEEE
Transactions on Pattern Analysis and Machine Intelligence 22, 8 (2000), 888–905.
Yee Whye Teh, Michael I. Jordan, Matthew J. Beal, and David M. Blei. 2006. Hierarchical
Dirichlet Processes.
Journal of American Statistical Association
101, 476 (2006),
1566–1581.
J. van den Berg, Ming C. Lin, and Dinesh Manocha. 2008. Reciprocal Velocity Obstacles
for realtime multiagent navigation.
IEEE International Conference on Robotics
and Automation (2008).
He Wang, Jan Ondřej, and Carol O’Sullivan. 2016. Path Patterns: Analyzing and
Comparing Real and Simulated Crowds. In
Proceedings of the 20th ACM SIGGRAPH
Symposium on Interactive 3D Graphics and Games (I3D ’16)
. ACM, New York, NY,
USA, 49–57. https://doi.org/10.1145/2856400.2856410
He Wang, Jan Ondřej, and Carol O’Sullivan. 2017. Trending Paths: A New Semantic
level Metric for Comparing Simulated and Real Crowd Data.
IEEE Transactions on
Visualization and Computer Graphics 23, 5 (2017), 1454–1464.
He Wang and Carol O’Sullivan. 2016.
Globally Continuous and NonMarkovian Crowd
Activity Analysis from Videos
. Springer International Publishing, Cham, 527–544.
Qi Wang et al
.
2019. Learning from Synthetic Data for Crowd Counting in the Wild.
IEEE Conference on Computer Vision and Pattern Recognition (2019).
Xiaogang Wang, Keng Teck Ma, GeeWah Ng, and W. E. L. Grimson. 2008. Trajectory
analysis and semantic region modeling using a nonparametric Bayesian model. In
IEEE Conference on Computer Vision and Pattern Recognition. 1–8.
David Wolinski, Stephen J. Guy, AnneHélène Olivier, Ming C. Lin, Dinesh Manocha,
and Julien Pettré. 2014. Parameter estimation and comparative evaluation of crowd
simulations. Computer Graphics Forum 33, 2 (2014), 303–312.
Yanyu Xu et al
.
2018. Encoding Crowd Interaction with Deep Neural Network for
Pedestrian Trajectory Prediction.
IEEE Conference on Computer Vision and Pattern
Recognition (2018).
S. Yi, H. Li, and X. Wang. 2015. Understanding pedestrian behaviors from stationary
crowd groups. In
IEEE Conference on Computer Vision and Pattern Recognition
.
3488–3496.
Mikhail Yurochkin and XuanLong Nguyen. 2016. Geometric Dirichlet Means Algorithm
for topic inference. In
International Conference on Neural Information Processing
Systems.
A CHINESE RESTAURANT FRANCHISE
To give the mathematical derivation of the sampling process de
scribed in Sec. 5.1, we rst give meanings to the variables in Fig. 2
Left.
θji
is the dish choice made by
xji
, the
i
th customer in the
j
th
restaurant.
Gj
is the tables with dishes and the dishes are from the
global menu
G
. Since
θji
indicates the choice of tables and therefore
dishes, we use some auxiliary variables to represent the process.
We introduce
tji
and
kjt
as the indices of the table and the dish
on the table chosen by
xji
. We also denote
mjk
as the number of
tables serving the
k
th dish in restaurant
j
and
njt k
as the number of
customers at table
t
in restaurant
j
having the
k
th dish. We also use
them to represent accumulative indicators such as
m·k
representing
the total number of tables serving the
k
th dish. We also use super
script to indicate which customer or table is removed. If customer
xji
is removed, then
n−ji
jt k
is the number of customers at the table
t
in restaurant jhaving the kth dish without the customer xji .
Customerlevel sampling
. To choose a table for
xji
(line 5 in
Algorithm 1), we sample a table index tji :
p(tji =tt−ji
,k) ∝ n−j i
jt ·f−xj i
kjt (xj i )if talready exists
αjp(xji t−ji
,tji =tn ew
,k)if t=tnew (11)
where
n−ji
jt ·
is the number of customers at table
t
(table popularity),
and
f−xji
kjt (xj i )
is how much
xji
likes the
kjt
th dish,
fkjt
, served on
that table (dish preference).
fkjt
is the dish and thus is a problem
specic probability distribution.
f−xji
kjt (xj i )
is the likelihood of
xji
on
fkjt
. In our problem,
fkjt
is Multinomial if it is the SpaceHDP
or otherwise Normal.
αj
is the parameter in Eq. 1, so it controls how
likely
xji
will create a new table, after which she needs to choose
a dish according to
p(xji t−ji
,tji =tn ew
,k)
. When a new table is
created,
tji =tn ew
, we need sampling a dish (line 7 in Algorithm 1),
indexed by kj tne w , according to:
ACM Trans. Graph., Vol. 39, No. 4, Article 1. Publication date: July 2020.
1:14 •Feixiang He, Yuanhang Xiang, Xi Zhao, and He Wang
p(kjt new =kt,k−jtnew ) ∝ m·kf−xj i
k(xji )if kalready exists
γf−xji
kne w (xj i )if k=knew
(12)
where
m·k
is the total number of tables across all restaurants serving
the
k
th dish (dish popularity).
f−xji
k(xji )
is how much
xji
like the
k
th dish, again the likelihood of
xji
on
fk
.
γ
is the parameter in
Eq. 1, so it controls how likely a new dish will be created.
Tablelevel sampling
. Next we sample a dish for a table (line
11 in Algorithm 1). We denote all customers at the
t
th table in the
jth restaurant as xjt. Then we sample its dish kjt according to:
p(kjt =kt,k−jt) ∝ m−jt
·kf−xjt
k(xjt)if kalready exists
γf−xjt
kne w (xjt)if k=knew (13)
Similarly,
m−jt
·k
is the total number of tables across all restaurants
serving the
k
th dish, without
xjt
(dish popularity).
f−xjt
k(xjt)
is how
much the group of customers
xjt
likes the
k
th dish (dish preference).
This time, f−xjt
k(xjt)is a joint probability of all xji ∈xjt .
Finally, in both Eq. 12 and Eq. 13, we need to sample a new
dish. This is done by sampling a new distribution from the base
distribution
H
,
ϕk∼H
. After inference, the weights
β
can be com
puted as
β∼Dirichlet (m·1,m·2,· · · ,m·k,γ)
. The choice of
H
is
related to the data. In our metaphor, the dishes of the SpaceHDP
are ows so we use Dirichlet. In the TimeHDP and SpeedHDP, the
dishes are modes of time and speed which are Normals. So we use
NormalInverseGamma for
H
. The choices are because Dirchlet
and NormaInverseGamma are the conjugate priors of Multinomial
and Normal respectively. The whole CRF sampling is done by itera
tively computing Eq. 11 to Eq. 13. The dish number will dynamically
increase/decrease until the sampling mixes. In this way, we do not
need to know in advance how many space ows or time modes or
speed modes there are because they will be automatically learnt.
B CHINESE RESTAURANT FRANCHISE LEAGUE
B.1 Customer Level Sampling
When we do customerlevel sampling to sample a new table (line 8
in Algorithm 2), the left side of Eq. 11 becomes:
p(tji =t,xj i ,ykd ,zk c x−ji
,t−ji
,k,y−kd
,o−kd
,l,z−kc
,p−kc
,q)(14)
So whether
ykd
and
zkc
like the new restaurants should be taken
into consideration. After applying Bayesian rules and factorization
on Eq. 14, we have:
p(tji =t,xj i ,ykd ,zkc •) =p(tj i t−ji
,k)
p(xji ykd ,zk c ,tji =t,kjt =k,•)
p(ykd tji =t,kj t =k,y−kd
,o−kd
,l)
p(zkc tji =t,kjt =k,z−kc
,p−kc
,q)(15)
where
•
is {
x−ji
,t−ji
,k,y−kd
,o−kd
,l,z−kc
,p−kc
,q
}. The four proba
bilities on the righthand side of Eq. 15 have intuitive meanings.
p(tji t−ji
,k)
and
p(xji ykd ,zk c ,tji =t,kjt =k,•)
are the table pop
ularity and dish preference of xji in the spaceHDP:
p(tji t−ji
,k) ∝ n−j i
jt if talready exists
αjif t=tnew (16)
p(xji ykd ,zk c ,tji =t,kjt =k,•) ∝
f−xji
kjt (xj i )if texists
m·kf−xji
k(xji )else if kexists
γf−xji
kne w (xj i )if k=knew
(17)
Eq. 16 and Eq. 17 are just reorganization of Eq. 11 and Eq. 12.
The remaining
p(ykd tji =t,kj t =k,y−kd
,o−kd
,l)
and
p(zkc tji =
t,kjt =k,z−kc
,p−kc
,q)
can be seen as how much the timecustomer
ykd
and speedcustomer
zkc
like the
k
th time and speed restaurant
respectively (restaurant preference). This restaurant preference does
not appear in single HDPs and thus need special treatment. This is
the rst major dierence between CRFL and CRF. Since we propose
the same treatment for both, we only explain the timerestaurant
preference treatment here.
If every time we sample a
tji
, we compute
p(ykd tji =t,kj t =
k,y−kd
,o−kd
,l)
on every time table in every timerestaurant, it will
be prohibitively slow. We therefore marginalize over all the time
tables in a timerestaurant, to get a general restaurant preference of
ykd :
p(ykd tji =t,kj t =k,y−kd
,o−kd
,l)=
hk·
okd =1
p(okd =otji =t,kj t =k,y−kd
,o−kd)
p(ykd okd =o,lk o =l,l)(18)
where
okd
is the table choice of
ykd
in the
kth
timerestaurant.
lko
is
the timedish served on the
o
th table in the
k
th timerestaurant.
hk·
is the total number of tables in the
k
th timerestaurant. Similar to
Eq. 16 and Eq. 17:
p(okd =otji =t,kj t =k,y−kd
,o−kd) ∝ s−kd
ko if oexists
ϵkif okd =on ew
(19)
where
s−kd
ko
is the number of timecustomers already at the
o
th table
and ϵkis the scaling factor.
p(ykd okd =o,lk o =l,l) ∝
д−ykd
lko (ykd )if oexists
h·lд−ykd
l(ykd )else if lexists
εд−ykd
lne w (ykd )if l=ln ew
(20)
where
h·l
is the total number tables serving timedish
l
and
д
is a pos
terior predictive distribution of Normal, a Student’s tDistribution.
ε
controls how likely a new time dish would be needed. Now we have
nished deriving the sampling for
p(ykd tji =t,kj t =k,y−kd
,o−kd
,l)
.
Similar derivations can be done for
p(zkc tji =t,kjt =k,z−kc
,p−kc
,q)
.
After table sampling, we need to do dish sampling (line 10 in
Algorithm 2). The left side of Eq. 12 becomes:
ACM Trans. Graph., Vol. 39, No. 4, Article 1. Publication date: July 2020.
Informative Scene Decomposition for Crowd Analysis, Comparison and Simulation Guidance •1:15
p(kjt new =k,xj i ,ykd ,zk c k−jtnew
,y−kd
,o−kd
,
l,z−kc
,p−kc
,q) ∝
m−jt
·kp(xji  · · · )p(yk d  · · · )p(zk c  · · · )
γp(xji  · · · )p(yk d  · · · )p(zk c  · · · ) (21)
The dierences between Eq. 21 and Eq. 12 are
p(ykd  · · · )
and
p(zkc  · · · )
. Both are Innite Gaussian Mixture Model so the likeli
hoods can be easily computed. We therefore have given the whole
sampling process for the customerlevel sampling (Eq. 14). We still
need to deal with the tablelevel sampling.
B.2 Table Level Sampling
Similarly, when we do the tablelevel sampling (line 14 in Algo
rithm 2), the left side of Eq. 13 change to:
p(kjt =k,xjt,ykdjt ,zkcjt k−jt
,y−kdjt
,o−kdjt
,
l−ko
,z−kcjt
,p−kcjt
,q−kp) ∝
m−jt
·kp(xjt  · · · )p(ykdjt  · · · )p(zkcjt  · · · )
γp(xjt  · · · )p(ykdjt  · · · )p(zkcjt  · · · ) (22)
where
xjt
is the spacecustomers at the table
t
,
ykdjt
and
zkcjt
are the
associated time and speed customers.
k−jt
,
y−kdjt
,
o−kdjt
,
l−ko
,
z−kcjt
,
p−kcjt
,
q−kp
are the rest customers and their choices of tables and
dishes in three HDPs.
· · ·
represents all the conditional variables
for simplicity. p(xjt  · · · ) is the Multinomial fas in Eq. 13.
p(ykdjt  · · · )
and
p(zkcjt  · · · )
are not easy to compute. However,
they can be treated in the same way so we only explain how to com
pute
p(ykdjt  · · · )
here. To fully compute
p(ykdjt  · · · )
=
p(ykdjt kjt =
k,o−kdjt
,l−ko)
, one needs to consider it for every
ykd jt ∈ykdjt
which
is extremely expensive. This is because we deal with large datasets
and there can easily be thousands, if not more, of customers in
ykdjt
.
In Eq. 15, we already see how
ykd
’s timerestaurant preference
inuences the table choice of
xji
. Given a group
ykdjt
, their collec
tive timerestaurant preference,
p(ykdjt  · · · )
, will inuence the dish
choice of
xjt
. Since the distribution of individual timerestaurant
preference is hard to compute analytically, we approximate it. We
do a random sampling over
ykdjt
to approximate
p(ykdjt  · · · )
. This
number of samples is a hyperparameter, referred as customer se
lection. For every single
y∈ykdjt
we can compute its probability in
the same way as in Eq. 18. So we approximate the
p(ykdjt  · · · )
with
the joint probability of the sampled timecustomers.
B.3 Sampling for Hyperparameters
A Dirichlet Process contains two parameters, a base distribution
and a concentration parameter. To make THDP more robust to
these parameters, we impose a prior, a Gamma distribution onto
the concentration parameter
γ∼Γ(α,ϖ)
, where
α
is the shape
parameter and
ϖ
is the rate parameter. There are totally six
α
s and
ϖ
s for the six DPs in THDP. They are initialized as 0.1. Then they
are updated during the optimization using the method in [Teh et al
.
2006]. The update is done in every iteration in CRFL, after sampling
all the other parameters. The customer selection parameter is set
to 1000 across all experiments. Finally, after CRFL, the inference is
done for the three distributions in Eq. 2:
ϕs
k∼Hs,β∼Dirichlet (m·1,m·2,· · · ,m·k,γ)(23)
ϕt
l∼Ht,ζ∼Dirichlet (h·1,h·2,· · · ,h·l,ε)(24)
ϕe
q∼He,ρ∼Dirichlet (a·1,a·2,· · · ,a·q,λ)(25)
where
m·k
is the total number of spacetables choosing spacedish
k
;
h·l
is the total number of timetables choosing timedish
l
;
a·q
is
the total number of speedtables choosing speeddish
q
.
γ
,
ε
and
λ
are the scaling factors of Gs,Gtand Ge.
C SIMULATION GUIDANCE
The dynamics of of one trajectory, ¯
w, is:
x¯
w
t=Ast+ωtω∼N(0,Ω)
st=Bst−1+λtλ∼N(0,Λ)
Given the
U
trajectories, from a space ow
ˇ
w
, the total likelihood
is:
p(ˇ
w)=ΠU
i=1p(¯
wi)where
p(¯
wi)=ΠTi−1
t=2p(xi
tst)P(stst−1)s1=xi
1,sT=xi
Ti(26)
where
A
is an identity matrix and
Ω
is a known diagonal matrix.
Ti
is the length of the trajectory
i
. We use homogeneous coordinates
to represent both
x=[x1,x2,
1
]T
and
s=[s1,s2,
1
]T
. Consequently,
A
is a
R3×3
identity matrix.
Ω
is set to be a
R3×3
diagonal matrix
with its nonzeros entries set to 0.001.
B
is a
R3×3
transition matrix
and Λis R3×3covariance matrix, both to be learned.
We apply ExpectationMaximization (EM) [Bishop 2007] to esti
mate parameters
B,Λ
and states
S
by maximizing the log likelihood
loдP(u)
. Each iteration of EM consists of a Estep and a Mstep. In
the Estep, we x the parameters and sample states
s
via the poste
rior distribution of
x
. The posterior distribution and the expectation
of completedata likelihood are denoted as
L=ESX;ˆ
B,ˆ
Λ(loдP(S,X;B,Λ))
=
i
τiEsixi{p(si
,xi)} (27)
where
τi
is dened as
τi=
1
TiTi
t=1p(xi
tsi
t)
U
i=11
TiTi
t=1p(xi
tsi
t)
. In the Mstep, we
maximize the completedata likelihood and the model parameters
are updated as:
Bnew =iτiTi
t=2Pi
t,t−1
iτiTi
t=2Pi
t−1,t−1
(28)
Λnew =iτi(Ti
t=2Pi
t,t−Bnew Ti
t=2Pi
t,t−1)
iτi(Ti−2)(29)
Pi
t,t=Esixi(stsT
t)(30)
Pi
t,t−1=Esixi(stsT
t−1)(31)
During updating, we use
Λ=1
2(Λ+ΛT)
to ensure its symmetry.
ACM Trans. Graph., Vol. 39, No. 4, Article 1. Publication date: July 2020.