Conference PaperPDF Available

Who are my neighbors?: A perception model for selecting neighbors of pedestrians in crowds

Authors:

Abstract and Figures

Pedestrian trajectory prediction is a challenging problem. One of the aspects that makes it so challenging is the fact that the future positions of an agent are not only determined by its previous positions, but also by the interaction of the agent with its neighbors. Previous methods, like Social Attention have considered the interactions with all agents as neighbors. However, this ends up assigning high attention weights to agents who are far away from the queried agent and/or moving in the opposite direction, even though, such agents might have little to no impact on the queried agent's trajectory. Furthermore, trajectory prediction of a queried agent involving all agents in a large crowded scenario is not efficient. In this paper, we propose a novel approach for selecting neighbors of an agent by modeling its perception as a combination of a location and a locomotion model. We demonstrate the performance of our method by comparing it with the existing state-of-the-art method on publicly available datasets. The results show that our neighbor selection model overall improves the accuracy of trajectory prediction and enables prediction in scenarios with large numbers of agents in which other methods do not scale well.
Content may be subject to copyright.
Who are my neighbors?
A perception model for selecting neighbors of
pedestrians in crowds
Fangkai Yang
KTH Royal Institute of Technology
Stockholm, Sweden
fangkai@kth.se
Himangshu Saikia
KTH Royal Institute of Technology
Stockholm, Sweden
saikia@kth.se
Christopher Peters
KTH Royal Institute of Technology
Stockholm, Sweden
chpeters@kth.se
ABSTRACT
Pedestrian trajectory prediction is a challenging problem. One of
the aspects that makes it so challenging is the fact that the future po-
sitions of an agent are not only determined by its previous positions,
but also by the interaction of the agent with its neighbors. Previous
methods, like Social Attention have considered the interactions with
all agents as neighbors. However, this ends up assigning high at-
tention weights to agents who are far away from the queried agent
and/or moving in the opposite direction, even though, such agents
might have little to no impact on the queried agent’s trajectory.
Furthermore, trajectory prediction of a queried agent involving all
agents in a large crowded scenario is not ecient. In this paper, we
propose a novel approach for selecting neighbors of an agent by
modeling its perception as a combination of a location and a loco-
motion model. We demonstrate the performance of our method by
comparing it with the existing state-of-the-art method on publicly
available data-sets. The results show that our neighbor selection
model overall improves the accuracy of trajectory prediction and
enables prediction in scenarios with large numbers of agents in
which other methods do not scale well.
KEYWORDS
perception, virtual agents, trajectory prediction, machine learning
1 INTRODUCTION
Pedestrians are able to perceive and act according to a variety of
information present in the environment. Such information includes
the surroundings, relative positions, velocities and perceived intent
of other pedestrians - some located quite far apart - and processing
all of this in a streaming fashion to successfully navigate their
path. Designing virtual agents to perform similar processing of
their surroundings and evaluate their path trajectories correctly is
of great importance in social robotics [
11
], abnormality detection
in crowds [
12
], urban planning for public safety [
2
] and realistic
simulation of virtual crowds [10] among others.
A seminal work on modeling the behavior of agents in a crowd
of multiple agents is the Social Forces model [
6
] where interactions
of an agent with its surrounding agents are modeled by means
of invisible (social) forces. Interacting Gaussian Processes (IGP) is
another method used to model the joint distribution of trajecto-
ries of all interacting agents in a crowd [
22
]. Following the recent
advancements in machine learning based methods, Recurrent Neu-
ral Networks (RNNs) became very popular at sequence prediction
This is the corresponding author
problems. Especially, RNNs equipped with Long Short-Term Mem-
ory (LSTM) cells performed much better than traditional RNNs
at remembering important features far back in time [
18
]. LSTM
networks [
1
] were used to predict human trajectories in crowds
with Social-LSTM. It is more general to learn useful features rather
then hand-coding such features. However, Social-LSTM only used
a local discretized neighborhood around an agent and ignored all
other agents outside this neighborhood. The Social Attention model
was introduced in [
24
], which improved the trajectory prediction
compared with the Social-LSTM model by considering a spatio-
temporal graph representation of the relationship between all pairs
of agents.
All of these past methods achieve varying degrees of accuracy
but suer from assumptions which detract from generality. Usu-
ally methods which do not learn from human trajectory informa-
tion suer from insucient information from hand-coded features.
Methods which do try to learn from human trajectories either only
consider very local regions or naïvely cover the entire space - which
in turn does not scale well.
We present a learning-based approach related to the Social Atten-
tion method which uses spatio-temporal representation between all
pairs of agents. An LSTM is then trained using this representation
to predict future trajectories of humans in real scenarios. Though
it performs better than the Social-LSTM method, considering all
agents as neighboring agents is not optimal as this does not scale
for large crowds. To this end, we propose a novel perception model
based on the human visual system [
19
,
20
] which combines a loca-
tion and a locomotion model to determine neighboring agents to a
queried one. Our approach helps in pruning out unimportant neigh-
boring agents, thereby, also making it scalable for larger datasets.
2 RELATED WORK
2.1 Pedestrian Trajectory Prediction
Social Forces was presented in [
6
] to capture the interactions of an
agent with its surrounding agents by means of invisible (social)
forces, attractive forces towards destinations and repulsive forces
from obstacles and other agents, which led to an ecient collision
avoidance strategy while navigation. However, this is only enough
to simulate very simple behavior and cannot take into account com-
plex interactions, especially further back in time. In [
21
,
22
], it was
argued that agents exhibit cooperative behavior with regards to col-
lision avoidance. The authors used Interacting Gaussian Processes
(IGP) to model the joint distribution of trajectories of all interacting
agents in the crowd. Again, this method also only accounted for
relative positioning, but not relative velocities or accelerations.
Following the recent advancements in learning based methods
using neural networks, automatic feature detection without having
to handcraft individual features became a huge success in many
diverse elds. For sequence prediction problems such as speech
recognition and synthesis, RNNs became very popular [
3
]. How-
ever, RNNs suer from the vanishing gradient problem [
14
] and
hence are dicult to train. LSTM cells were then introduced as
a specic building unit for RNNs which performed much better
than traditional RNNs [
18
]. A key feature of LSTMs is the ability to
learn from features observed long ago in the sequence. In [
1
], LSTM
networks were used to predict multiple correlated sequences corre-
sponding to human trajectories in crowds with an approach called
Social-LSTM. Using a neural network to learn useful features of
social interaction from real data is indeed more general, rather than
hand-coding such features like in Social Forces, or IGP. However,
Social-LSTM in its current form, and some of its derivatives, only
used a local discretized neighborhood around an agent and ignored
all other agents outside this neighborhood. The Social Attention
model was introduced in [
24
], which improved upon the Social-
LSTM model by considering a spatio-temporal graph representation
of the relationship between all pairs of agents in a crowd.
3 METHOD
When pedestrians walk in crowds, the trajectories are aected by
the motions of other pedestrians. Some work considers the inuence
to be local [
1
,
11
]. However, as shown in [
24
], not only positions,
but other features like velocity and acceleration also play important
roles in inuencing the queried pedestrian’s trajectory. Keeping
this in mind, we come up with a neighbor selection model which
not only considers the position of an agent, but also their speed,
forward orientation, and bearing angles with other agents.
3.1 The Social Attention Model
Vemula et al. [
24
] proposed the Social Attention model which used
Structural RNN (S-RNN) [
7
] to model both the spatial and temporal
dynamics of trajectories in crowds. The human-human interactions
are modeled using a soft attention model over all pedestrians in the
crowd. When predicting the future trajectory of a target pedestrian,
other pedestrians in the crowd who has a higher attention weight
should have a larger inuence on trajectory prediction. By com-
puting a soft attention over hidden states of spatial edges for each
agent, they trained an LSTM network to predict future trajectories.
Since they aimed to nd out which surrounding agents humans
attend to, they built spatial edges between all pairs of agents. This
however is expensive and (as we show later) does not scale for
large crowds. In some cases, their model assigned a high attention
weight to agents who are far away from the queried agent or to
agents almost static. Also, the bearing angle between agents did
not seem to inuence the attention weights in their model. In order
to address these issues, we propose a perception model based on
the human visual system to better select the important agents from
all surrounding agents rather than every pedestrian in the crowd.
In this paper, we use the same S-RNN architecture to train an LSTM
network as shown in [
24
], but we use our perception model (see
Section 3.2) to prune out the unimportant spatial edges.
3.2 Model Architecture
The overall architecture of our model consists of two parts. The
location model selects interesting neighbors out of all the agents
based on their proximity and bearing angles. The locomotion model
selects the agents with high risk of future collisions based on their
angular and tangential velocities.
3.2.1 Location Model. People perceive their surroundings with
a sense of vision and proximity. Therefore, we applied a unied
agent-sensing model proposed in [
17
]. As shown in Figure 1(a),
for an agent
Ai
it consists of an ellipse
Ei
and a sector
Si
. Unlike
using multiple vision cones to simulate human vision [
8
,
15
] which
results in blind spots near the cones’ intersection, the ellipse covers
blind spots and simulates the reduction of vision sensitivity as the
distance increases. The ellipse foci F1,F2are calculated below:
F1,F2=xi+fi(ad±c)(1)
where
xi
is the position and
fi
is the forward direction of
Ai
,
d
is
the intimate distance within which the agent could sense neighbors
from behind,
a
is the semi-major axis of the ellipse and
c
the focal
distance. We consider the semi-minor axis
b=atan(π/
6
)
here and
hence cis given by c=a2b20.817a.
Agents within the ellipse are marked as fully perceived. However,
for the agents outside of the ellipse, the probability of being per-
ceived varies based on their proximity and orientation with respect
to the queried agent. The perceived probability of the
L
ocation
Model is therefore modeled as:
pi
LM (xj;α,β)=
1,for Aj∈ Ei
cos π
2||xjxi||
RSα·
cos π
2θ(Aj,Ai)
ΘSβ,for Aj∈ SiAj<Ei
0,for Aj<Si
(2)
where
RS
is the sector radius,
ΘS
is the central angle of the sector,
||xjxi||
and
θ(Aj,Ai)
are the distance term and the orientation
term between agent
Aj
and
Ai
respectively. The orientation term
θ(Aj,Ai)
or simply
θji
is given by the bearing angle between the
two agents given by
(xjxi,fi)
.
α
and
β
are parameters which
control the inuence of the distance term and orientation term
respectively. An example of the Location Model is shown in Figure 2.
The proximity parameters in Equation 1 and Equation 2 were
chosen from Proxemics Theory [
5
]. Although the perception of
proximities is culturally determined, to simplify the model, we set
these parameters to be constants (
d=
0
.
15
m
,
a=
1
.
2
m
,
RS=
3
.
5
m
).
Also, human eyes have around 200
vision angle as proposed in
[25], thus we set ΘS=200.
3.2.2 Locomotion Model. As stated in [
24
], sometimes agents
in the immediate vicinity of the queried agent and moving in the
same direction might not be as important as agents located far away
but moving towards the queried agent. In such cases, the location
model alone would fail to consider potential neighbors which may
have an inuence on the queried agent’s trajectory. [
13
] evaluated
the risk and dangerousness of future collisions by using the bearing
2
Si
Ei
Ai
F1
F2
2a
2b
(a)
θθθ
Û
θ<0Û
θ0Û
θ>0
(b)
Figure 1: (a) The Location Model. For an agent Ai, this con-
sists of an ellipse Eiand a sector Si(b) The eect of the bear-
ing angle velocity on risk of collision. Assume two agents
(represented by the green and blue circles), one moving ver-
tically up and the other horizontally towards the left. If the
bearing angle between the two reduces with time (i.e. Û
θ<
0
),
the blue agent will pass in front of the green agent and not
collide. If the bearing angle increases with time (i.e. Û
θ>
0
),
the green agent passes the blue agent and again there is no
collision. However, if Û
θ
0
, the two agents will probably col-
lide.
Figure 2: Example of Location Model. (Left) Three Location
models are calculated by using dierent αand β. (Right) The
probability along two horizontal lines (a)and (b)
angle velocity given by
Û
θi j =θ2
i j θ1
i j
and the remaining time-to-
interaction
ti j =||xjxi||/|vr
i j |
relative to the agent. Here
θ2
i j
and
θ1
i j
are bearing angles in the current frame and the previous one
respectively, and
vr
i j
is the relative tangential velocity which points
towards agent Ai(the queried agent).
Figure 3: Trajectories for 20 time steps in the ETH Hotel
dataset. The queried agent, whose trajectory is being pre-
dicted, is shown in red. The blue diamond marker repre-
sents the current positions of the various agents. The circu-
lar radii represents the weights from our neighbor selection
model. The current frame in the original video of the dataset
is superimposed in the background.
As shown in Figure 1(b), if
|Û
θi j |
is low, agent
Ai
and
Aj
have
a high risk of collision in the future [
13
]. Also, smaller
ti j
means
higher dangerousness of future collision. Thus, the inuence prob-
ability of the Locomotion Model is given as:
pi
C M (xj;γ)=expγt2
i j − (1γ)| Û
θi j |2(3)
where γis a weighting parameter.
The nal combined model is thus represented based on the
agent’s location and locomotion by combining the corresponding
models pLM and pC M as follows :
Pi(xj;Θ)=λpi
LM (xj;α,β)+(1λ)pi
C M (xj;γ)(4)
where
Θ=[α,β,γ,λ]
are weighting parameters for the dierent
terms. For the queried agent
Ai
, we only select agent
Aj
if
j,i
and
with
Pi(xj)
above a specied threshold
τ
. As shown in Figure 3,
our model assigns higher attention weights to those agents which
might be involved in future collisions with high dangerousness.
4 EVALUATION
4.1 Datasets and Metrics
We evaluated our model on three publicly available datasets: ETH
[
16
], UCY [
9
], and Pedestrian Walking Path (PWP) [
27
]. The ETH
and UCY datasets contain 5 crowd sets with a total of 1536 pedestri-
ans. The PWP dataset contains the labeled walking paths of 12684
pedestrians. As shown in [
24
], Social Attention performs better
than other methods such as LSTM and Social LSTM [
1
] on the ETH
and UCY datasets. Thus, we choose Social Attention as the base-
line to compare the performance with our model. However, unlike
Social Attention which modeled the inuence of all agents in the
crowd, we used our model to select only those neighboring agents
with potential inuence. Hence, we tested a new dataset, PWP, in
our work. Compared with the ETH and UCY datasets which have
roughly 10 agents per frame, PWP has roughly 100 agents per frame
which results in higher computational overhead if all agents in the
environment are considered. We preprocessed the datasets given
the homography matrices used in [
26
] for normalizing all datasets
to a perspective top-down view.
To compute the prediction error, we used two metrics: Average
Displacement Error (ADE) [
1
] and Final Displacement Error (FDE)
[
24
]. ADE calculates the mean squared error over all the estimated
3
points in a trajectory with the ground truth. FDE calculates the Eu-
clidean distance between the nal predicted position of a trajectory
with the ground truth.
4.2 Implementation
In the process of training and testing, we perform a two-level ap-
proach. We set ETH and UCY as low-density level, and PWP as
high-density level. Similar to [
24
], we used the same leave-one-out
method while training and validation on 4 sets from low-density
datasets, and test on the remaining set. This was repeated for all
the 5 sets in ETH and UCY. For validation, each set was divided
in a 4:1 ratio for training and validation. As for PWP, we divided
it into training (80 %) and testing (20 %) parts. To match with the
annotation frequency in low-density datasets (annotated every 0.4
seconds), we did an interpolation for the PWP dataset. We also set
the same time-steps for observed trajectory (
Tobs =
8time-steps,
3.2 seconds) and predicted trajectory (
Tpr e d =
12 time-steps, 4.8
seconds). The dimension of hidden states of temporal edges was set
to 128 and the spatial edges to 256. The embedding layers embed-
ded the input into a 64 dimensional vector with ReLU nonlinearity.
The model was trained on a single GTX-1070 GPU on a personal
computer with 16GB RAM.
The weighting parameter
Θ=[α,β,γ,λ]
used in this paper was
set as
[2,2,0.5,0.4]
. To nd a good threshold
τ
, we tested values
varying from 0.1 to 0.7 with 0.1 time-step in ETH-Hotel dataset.
The Final Displacement Error was used to select the best threshold
among all these values. The FDE was observed to be small near
τ=
0
.
2, which was the value chosen for the entire training process.
5 RESULTS
5.1 Quantitative Results
The prediction errors of two models are shown in Figure 4. Our
model performed better on the most low-density datasets except
for UCY Zara 1. It might be the case that our model over pruned
the neighbors which actually exerted some inuence. On datasets
which have pedestrians standing still or walking cross, like ETH-
Hotel and UCY Zara 2, our model performed much better than the
Social Attention model since it assigned a higher attention to static
agents and distant agents than those agents walking by.
A high-density level dataset (PWP) tested the scalability of our
model. On this dataset, the Social Attention model ran out of mem-
ory in the training stage. Because it tried to build spatial edges
between each pair of agents in this environments which contains
almost 100 agents. However, our model enabled the process to scale
to large crowds while conserving important interactions.
5.2 Qualitative Results
Figure 5 shows an exemplar scenario where the Social Attention
model did not perform optimally, but with the neighbor selection
model, it could successfully predict the trajectories. The predicted
trajectories from Social Attention model diverge further than the
ones predicted using our model, and falsely predict the trajectories
of static agents (purple trajectory and dots in Figure 5(a)). From
the weights in these two methods, we can see that Social Attention
model assigned a high attention to agent who moves backwards
related to the queried agent (green trajectory and dots in Figure 5(a)),
(a) Average Displacement Error
(b) Final Displacement Error
Figure 4: The average and nal displacement errors (in me-
ters) on several datasets for the Social Attention method
and ours. The Social Attention method runs out of memory
in the training stage on the PWP dataset. As can be seen,
our method performs better than Social Attention on most
datasets and even scales to larger datasets.
and the agent who stands still (purple one in Figure 5(a)). However,
these agents should not exert much inuence on the queried agent.
In Figure 6, we list three representative cases (Figure 6(a), (c) and
(e)) where the Social Attention model did not perform optimally.
Figure 6(a) and Figure 6(c) show that the Social Attention model
assigned a high attention to the agent who is far behind, but a
very low attention to the agent close by. Figure 6(e) shows that
the Social Attention model assigned a similar attention weights
to both close-by and far away neighbors who could hardly exerts
any inuence. Figures 6(b), (d) and (f) show the weights from our
model which assigned a relative high attention to the pedestrian
agents who could exert important inuence to the trajectory to be
predicted in the exact same scenarios.
6 DISCUSSION
We observe that our model performs better than the Social Atten-
tion model, which in turn outperforms the-state-of-art methods,
e.g. Social-LSTM, as shown in [
24
]. This leads us to believe that
our model performs better than the current state-of-art. Since our
model is a neighbor selection model, it could be integrated with
4
(a) The Social Attention model. (b) Our model.
Figure 5: An example illustrating the dierence between the Social Attention model and our model. (1) Prediction accuracy :
The solid dots represent the ground truth positions and the ‘+-’ markers represent the predicted positions. As can be clearly
observed, our method predicts future positions more accurately and there is lesser deviation between the true position and
the predicted positions. (2) Neighbor importance : The queried agent, whose neighbors are being estimated for importance, is
shown in red. The blue diamond marker represents the current positions of the various agents. The circular radii represents
their attention weights. The Social Attention model assigns high attention weights to agents who are far away from the queried
agent and/or moving in the opposite direction. Our model successfully prunes such agents and assigns high weights to only
those agents who are likely to inuence the queried agent’s trajectory.
(a) (b)
(c) (d)
(e) (f)
Figure 6: The weights in the Social Attention model (a), (c), (e), and our model (b), (d), (f). The solid dots represent the ground
true positions. The queried agent, whose trajectory is being predicted, is shown in red. The blue diamond marker represents
the current positions of the various agents. The circular radii represents the attention weights.
5
other methods which select neighbors to predict trajectories and
test performance. For example, replacing the grid based neighbor
selection model in Social-LSTM [
1
] in order to select neighbors with
potentially higher aecting trajectory predictions. Moreover, our
method could be extended beyond trajectory prediction methods.
For example, in crowd simulation, the-state-of-art methods (e.g.,
ORCA [
23
], Social Force model [
6
]) consider neighbors (normally
within a circle centered in the queried agent) in order to avoid
collisions. Our method could be integrated in order to better select
neighbors and assign attention weights based on perception. For
simplication, our model is homogeneous which assumes all indi-
viduals have the same ability to perceive neighbors. One possible
solution is to learn the personality of an agent based on their tra-
jectory [
4
], which in turn, gifts heterogeneous perceptual abilities.
Also, the weighting parameters (as shown in Section 4.2) could
potentially be tuned to give a better performance. It is interesting
to train optimal weighting parameters based on better trajectory
prediction feedback.
7 CONCLUSIONS
In this paper, we presented a novel method for selecting neighbors
of a pedestrian by modeling its vision and perception. It consists
of the Location Model and the Locomotion Model which accounted
for both relative positions and velocities. The model was used to
prune out unimportant agents and shown to perform better than
Social Attention model which is the current state-of-the-art method
using LSTM for trajectory prediction. We show that our model
performs better on most low-density datasets and also scales to
larger datasets. As discussed in Section 6, a direction of future
work will be to extend our model further to methods requiring
neighbor selection, for example, trajectory prediction and crowd
simulation, and compare performance. Another future work will
be to train general optimal weighting parameters and incorporate
personalities of pedestrians based on gait and other features, to
enable better prediction accuracy.
REFERENCES
[1]
A. Alahi, K. Goel, V. Ramanathan, A. Robicquet, L. Fei-Fei, and S. Savarese.
2016. Social LSTM: Human Trajectory Prediction in Crowded Spaces. In 2016
IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 961–971.
https://doi.org/10.1109/CVPR.2016.110
[2]
Michael Batty, Jake Desyllas, and Elspeth Duxbury. 2003. Safety in Numbers?
Modelling Crowds and Designing Control for the Notting Hill Carnival. Urban
Studies 40, 8 (2003), 1573–1590. https://doi.org/10.1080/0042098032000094432
arXiv:https://doi.org/10.1080/0042098032000094432
[3]
Alex Graves, Abdel-rahman Mohamed, and Georey Hinton. 2013. Speech
recognition with deep recurrent neural networks. In Acoustics, speech and signal
processing (icassp), 2013 ieee international conference on. IEEE, 6645–6649.
[4]
Stephen J Guy, Sujeong Kim, Ming C Lin, and Dinesh Manocha. 2011. Simulating
heterogeneous crowd behaviors using personality trait theory. In Proceedings of
the 2011 ACM SIGGRAPH/Eurographics symposium on computer animation. ACM,
43–52.
[5] Edward Twitchell Hall. 1966. The Hidden Dimension.
[6]
Dirk Helbing and Peter Molnar. 1995. Social force model for pedestrian dynamics.
Physical review E 51, 5 (1995), 4282.
[7]
A. Jain, A. R. Zamir, S. Savarese, and A. Saxena. 2016. Structural-RNN: Deep
Learning on Spatio-Temporal Graphs. In 2016 IEEE Conference on Computer Vision
and Pattern Recognition (CVPR). 5308–5317. https://doi.org/10.1109/CVPR.2016.
573
[8]
Tom Leonard. 2003. Building an AI Sensory System: Examining the Design of
Thief: The Dark Project. Game Development Conference (GDC 2003) (2003).
[9]
Alon Lerner, Yiorgos Chrysanthou, and Dani Lischinski. 2007.
Crowds by Example. Computer Graphics Forum 26, 3 (2007),
655–664. https://doi.org/10.1111/j.1467-8659.2007.01089.x
arXiv:https://onlinelibrary.wiley.com/doi/pdf/10.1111/j.1467-8659.2007.01089.x
[10]
C. Loscos, D. Marchal, and A. Meyer. 2003. Intuitive Crowd Behavior in Dense
Urban Environments using Local Laws. In Proceedings of Theory and Practice of
Computer Graphics, 2003. 122–129. https://doi.org/10.1109/TPCG.2003.1206939
[11]
Matthias Luber, Johannes A Stork, Gian Diego Tipaldi, and Kai O Arras. 2010.
People tracking with human motion predictions from social forces. In Robotics
and Automation (ICRA), 2010 IEEE International Conference on. IEEE, 464–469.
[12]
Ramin Mehran, Alexis Oyama, and Mubarak Shah. 2009. Abnormal crowd
behavior detection using social force model. In Computer Vision and Pattern
Recognition, 2009. CVPR 2009. IEEE Conference on. IEEE, 935–942.
[13]
Jan Ondřej, Julien Pettré, Anne-Hélène Olivier, and Stéphane Donikian. 2010. A
synthetic-vision based steering approach for crowd simulation. ACM Transactions
on Graphics 29, 4 (2010), 1. https://doi.org/10.1145/1778765.1778860
[14]
Razvan Pascanu, Tomas Mikolov, and Yoshua Bengio. 2013. On the diculty
of training recurrent neural networks. In International Conference on Machine
Learning. 1310–1318.
[15]
Claudio Pedica and Hannes Högni Vilhjálmsson. 2010. Spontaneous avatar
behavior for human territoriality. Applied Articial Intelligence 24, 6 (2010),
575–593.
[16]
Stefano Pellegrini, Andreas Ess, Konrad Schindler, and Luc Van Gool. 2009. You’ll
never walk alone: Modeling social behavior for multi-target tracking. In Computer
Vision, 2009 IEEE 12th International Conference on. IEEE, 261–268.
[17]
Steve Rabin and Michael Delp. 2008. Designing a Realistic and Unied Agent-
Sensing Model. Game Programming Gems 7 (2008), 217–228.
[18]
Haşim Sak, Andrew Senior, and Françoise Beaufays. 2014. Long short-term mem-
ory recurrent neural network architectures for large scale acoustic modeling. In
Fifteenth annual conference of the international speech communication association.
[19]
Jill Sardegna. 2002. The encyclopedia of blindness and vision impairment. Infobase
Publishing.
[20]
Hans Strasburger, Ingo Rentschler, and Martin Jüttner. 2011. Peripheral vision
and pattern recognition: A review. Journal of vision 11, 5 (2011), 13–13.
[21]
Peter Trautman and Andreas Krause. 2010. Unfreezing the robot: Navigation in
dense, interacting crowds. In Intelligent Robots and Systems (IROS), 2010 IEEE/RSJ
International Conference on. IEEE, 797–803.
[22]
Peter Trautman, Jeremy Ma, Richard M Murray, and AndreasKrause. 2013. Robot
navigation in dense human crowds: the case for cooperation. In Robotics and
Automation (ICRA), 2013 IEEE International Conference on. IEEE, 2153–2160.
[23]
Jur Van Den Berg, Stephen J Guy, Ming Lin, and Dinesh Manocha. 2011. Reciprocal
n-body collision avoidance. In Robotics research. Springer, 3–19.
[24]
Anirudh Vemula, Katharina Mülling, and Jean Oh. 2017. Social Attention: Model-
ing Attention in Human Crowds. CoRR abs/1710.04689 (2017). arXiv:1710.04689
http://arxiv.org/abs/1710.04689
[25] Brian A Wandell. 1995. Foundations of vision. Sinauer Associates.
[26]
Kota Yamaguchi, Alexander C Berg, Luis E Ortiz, and Tamara L Berg. 2011.
Who are you with and where are you going?. In Computer Vision and Pattern
Recognition (CVPR), 2011 IEEE Conference on. IEEE, 1345–1352.
[27]
Shuai Yi, Hongsheng Li, and Xiaogang Wang. 2015. Understanding pedestrian
behaviors from stationary crowd groups. In Proceedings of the IEEE Conference
on Computer Vision and Pattern Recognition. 3488–3496.
6
... Earlier research defined local pedestrian neighborhoods based on fixed spatial distances as in [6]- [8]. Forming neighborhoods neurally is already explained in [9], [10]. There are few works that developed the social modeling on spatiotemporal graphs considering the spatial relations between pedestrians [11]- [13] using fixed parameters to determine the neighborhood boundaries. ...
... The usage of GridLSTM as a multimodal feature encoder is clarified in (9). It combines visual awareness state V , the scene spatial feature map C, and the normally-initialized hidden states h O , for stemming pedestrians attention to the physical context. ...
Preprint
Full-text available
Pedestrian trajectory prediction is an active research area with recent works undertaken to embed accurate models of pedestrians social interactions and their contextual compliance into dynamic spatial graphs. However, existing works rely on spatial assumptions about the scene and dynamics, which entails a significant challenge to adapt the graph structure in unknown environments for an online system. In addition, there is a lack of assessment approach for the relational modeling impact on prediction performance. To fill this gap, we propose Social Trajectory Recommender-Gated Graph Recurrent Neighborhood Network, (STR-GGRNN), which uses data-driven adaptive online neighborhood recommendation based on the contextual scene features and pedestrian visual cues. The neighborhood recommendation is achieved by online Nonnegative Matrix Factorization (NMF) to construct the graph adjacency matrices for predicting the pedestrians' trajectories. Experiments based on widely-used datasets show that our method outperforms the state-of-the-art. Our best performing model achieves 12 cm ADE and \sim15 cm FDE on ETH-UCY dataset. The proposed method takes only 0.49 seconds when sampling a total of 20K future trajectories per frame.
... Existing works have shown the benefits of combining head pose with positional trajectory for prediction. The head pose is used as a substitute for the gaze direction to determine the Visual Field of Attention (VFOA) [13,29]. The head pose feature correlates with the walking direction and speed, which emits pedestrian destinations as well as their awareness of the surrounding context. ...
... The visual field of attention in pedestrians relied on assumptions that align head pose with gaze direction to fixate the attention region as pedestrian is walking. In resemblance to [29], we argue that the width of the visual field and its shape shall affect the representation of pedestrian visual awareness state and thereby their neighborhood perception. To a large extent, when pedestrians are walking, they only consider other pedestrians who are close and can pose a direct influence. ...
Preprint
Full-text available
Pedestrian trajectory prediction is a prominent research track that has advanced towards modelling of crowd social and contextual interactions, with extensive usage of Long Short-Term Memory (LSTM) for temporal representation of walking trajectories. Existing approaches use virtual neighborhoods as a fixed grid for pooling social states of pedestrians with tuning process that controls how social interactions are being captured. This entails performance customization to specific scenes but lowers the generalization capability of the approaches. In our work, we deploy \textit{Grid-LSTM}, a recent extension of LSTM, which operates over multidimensional feature inputs. We present a new perspective to interaction modeling by proposing that pedestrian neighborhoods can become adaptive in design. We use \textit{Grid-LSTM} as an encoder to learn about potential future neighborhoods and their influence on pedestrian motion given the visual and the spatial boundaries. Our model outperforms state-of-the-art approaches that collate resembling features over several publicly-tested surveillance videos. The experiment results clearly illustrate the generalization of our approach across datasets that varies in scene features and crowd dynamics.
... Earlier research defined local pedestrian neighborhoods based on fixed spatial distances as in [6][7][8]. Forming neighborhoods neurally is already explained in [9,10]. There are few works that developed the social modeling on spatiotemporal graphs considering the spatial relations between pedestrians [11][12][13] using fixed parameters to determine the neighborhood boundaries. ...
Conference Paper
Full-text available
Pedestrian trajectory prediction is an active research area with recent works undertaken to embed accurate models of pedestrians social interactions and their contextual compliance into dynamic spatial graphs. However, existing works rely on spatial assumptions about the scene and dynamics, which entails a significant challenge to adapt the graph structure in unknown environments for an online system. In addition, there is a lack of assessment approach for the relational mod-eling impact on prediction performance. To fill this gap, we propose Social Trajectory Recommender-Gated Graph Recurrent Neighborhood Network (STR-GGRNN), which uses data-driven adaptive online neighborhood recommendation based on the contextual scene features and pedestrian visual cues. The neighborhood recommendation is achieved by on-line Nonnegative Matrix Factorization (NMF) to construct the graph adjacency matrices for predicting the pedestrians' tra-jectories. Experiments based on widely-used datasets show that our method outperforms the state-of-the-art. Our best performing model achieves 12 cm ADE and ∼15 cm FDE on ETH-UCY dataset.
... Existing works attempt to learn the interactive context of urban environments using fixed spatial neighborhoods to provide local features of pedestrians social interactions [1]. The idea of defining neighborhood boundaries using the neural model was highlighted earlier in [29]. Nevertheless, few structured architectures developed this task on graphs considering the spatial relations between pedestrians [9,32] using fixed parameters to establish neighborhoods. ...
Conference Paper
Full-text available
Intelligent vehicles and social robots need to navigate in crowded environments while avoiding collisions with pedestrians. To achieve this, pedestrian trajectory prediction is essential. However, predicting pedestrians' trajectory in crowded environments is nontrivial as human-to-human interactions among the crowd participants influence their motion. In this work, we propose a novel end-to-end graph-centric gated learning model to estimate the existence of interactions between individuals. Accordingly, the model predicts pedestrians' future locations and velocities. Recent methods based on LSTM networks used thresholding techniques to define neighborhood boundaries and relationships. Other graph-structured methods grow edges in polynomial size. In contrast, our graph-based GRU network model employs an online data-driven criterion that can learn from interactions and grow connections between pedestrian nodes. The proposed model yields outperform-ing prediction accuracy over state-of-the-art works in two public datasets, i.e. Crowds and SDD.
... Our approach is also motivated by prior data-driven methods and we use the prior examples as 'trajectory database structures'. Inspired by the success of LSTM neural networks for different sequence prediction tasks, many researchers have used such networks for trajectory prediction [11], [30], [12], [31], [32]. However, LSTM neural networks usually deal with relatively short input streams [13] and may not work well in terms of long input stream information required for trajectory prediction [33], [34]. ...
Preprint
We present a novel trajectory prediction algorithm for pedestrians based on a personality-aware probabilistic feature map. This map is computed using a spatial query structure and each value represents the probability of the predicted pedestrian passing through various positions in the crowd space. We update this map dynamically based on the agents in the environment and prior trajectory of a pedestrian. Furthermore, we estimate the personality characteristics of each pedestrian and use them to improve the prediction by estimating the shortest path in this map. Our approach is general and works well on crowd videos with low and high pedestrian density. We evaluate our model on standard human-trajectory datasets. In practice, our prediction algorithm improves the accuracy by 5-9% over prior algorithms.
... As was pointed out in [1,16,37,47], the pooling or attention modules help to capture the features of neighboring agents and result in a better prediction. However, these works only capture the features from the neighboring positions. ...
Conference Paper
Full-text available
While many works involving human-agent interactions have focused on individuals or crowds, modelling interactions on the group scale has not been considered in depth. Simulation of interactions with groups of agents is vital in many applications, enabling more comprehensive and realistic behavior encompassing all possibilities between crowd and individual levels. In this paper, we propose a novel neural network App-LSTM to generate the approach trajectory of an agent towards a small free-standing conversational group of agents. The App-LSTM model is trained on a dataset of approach behaviors towards the group. Since current publicly available datasets for these encounters are limited, we develop a social-aware navigation method as a basis for creating a semi-synthetic dataset composed of a mixture of real and simulated data representing safe and socially-acceptable approach trajectories. Via a group interaction module, App-LSTM then captures the position and orientation features of the group and refines the current state of the approaching agent iteratively to better focus on the current intention of group members. We show our App-LSTM outperforms baseline methods in generating approaching group trajectories.
... Inverse Reinforcement Learning (IRL) methods such as due to [15] learn predictions from a large trajectory data-set of human crowds. More recently, a perception based model [27] was proposed as an addition to the Social Attention model, to filter out unimportant agents and yield better prediction scores. Reinforcement learning methods have also been introduced to make agents learn the desired behavior from set objectives [19]. ...
Conference Paper
Full-text available
Goal directed agent navigation in crowd simulations involves a complex decision making process. An agent must avoid all collisions with static or dynamic obstacles (such as other agents) and keep a trajectory faithful to its target at the same time. This seemingly global optimization problem can be broken down into smaller local optimization problems by looking at a concept of criticality. Our method resolves critical agents - agents that are likely to come within collision range of each other - in order of priority using a Particle Swarm Optimization scheme. The resolution involves altering the velocities of agents to avoid criticality. Results from our method show that the navigation problem can be solved in several important test cases with minimal number of collisions and minimal deviation to the target direction. We prove the efficiency and correctness of our method by comparing it to four other well-known algorithms, and performing evaluations on them based on various quality measures.
... As was pointed out in [1,16,37,47], the pooling or attention modules help to capture the features of neighboring agents and result in a better prediction. However, these works only capture the features from the neighboring positions. ...
Preprint
Full-text available
While many works involving human-agent interactions have fo-cused on individuals or crowds, modelling interactions on the group scale has not been considered in depth. Simulation of interactions with groups of agents is vital in many applications, enabling more comprehensive and realistic behavior encompassing all possibilities between crowd and individual levels. In this paper, we propose a novel neural network App-LSTM to generate the approach tra-jectory of an agent towards a small free-standing conversational group of agents. The App-LSTM model is trained on a dataset of approach behaviors towards the group. Since current publicly available datasets for these encounters are limited, we develop a social-aware navigation method as a basis for creating a semi-synthetic dataset composed of a mixture of real and simulated data representing safe and socially-acceptable approach trajectories. Via a group interaction module, App-LSTM then captures the position and orientation features of the group and refines the current state of the approaching agent iteratively to better focus on the current intention of group members. We show our App-LSTM outperforms baseline methods in generating approaching group trajectories.
Article
Robots that navigate through human crowds need to be able to plan safe, efficient, and human predictable trajectories. This is a particularly challenging problem as it requires the robot to predict future human trajectories within a crowd where everyone implicitly cooperates with each other to avoid collisions. Previous approaches to human trajectory prediction have modeled the interactions between humans as a function of proximity. However, that is not necessarily true as some people in our immediate vicinity moving in the same direction might not be as important as other people that are further away, but that might collide with us in the future. In this work, we propose Social Attention, a novel trajectory prediction model that captures the relative importance of each person when navigating in the crowd, irrespective of their proximity. We demonstrate the performance of our method against a state-of-the-art approach on two publicly available crowd datasets and analyze the trained attention model to gain a better understanding of which surrounding agents humans attend to, when navigating in a crowd.
Conference Paper
Recurrent neural networks (RNNs) are a powerful model for sequential data. End-to-end training methods such as Connectionist Temporal Classification make it possible to train RNNs for sequence labelling problems where the input-output alignment is unknown. The combination of these methods with the Long Short-term Memory RNN architecture has proved particularly fruitful, delivering state-of-the-art results in cursive handwriting recognition. However RNN performance in speech recognition has so far been disappointing, with better results returned by deep feedforward networks. This paper investigates backslashemphdeep recurrent neural networks, which combine the multiple levels of representation that have proved so effective in deep networks with the flexible use of long range context that empowers RNNs. When trained end-to-end with suitable regularisation, we find that deep Long Short-term Memory RNNs achieve a test set error of 17.7% on the TIMIT phoneme recognition benchmark, which to our knowledge is the best recorded score.
Article
Deep Recurrent Neural Network architectures, though remarkably capable at modeling sequences, lack an intuitive high-level spatio-temporal structure. That is while many problems in computer vision inherently have an underlying high-level structure and can benefit from it. Spatio-temporal graphs are a popular flexible tool for imposing such high-level intuitions in the formulation of real world problems. In this paper, we propose an approach for combining the power of high-level spatio-temporal graphs and sequence learning success of Recurrent Neural Networks~(RNNs). We develop a scalable method for casting an arbitrary spatio-temporal graph as a rich RNN mixture that is feedforward, fully differentiable, and jointly trainable. The proposed method is generic and principled as it can be used for transforming any spatio-temporal graph through employing a certain set of well defined steps. The evaluations of the proposed approach on a diverse set of problems, ranging from modeling human motion to object interactions, shows improvement over the state-of-the-art with a large margin. We expect this method to empower a new convenient approach to problem formulation through high-level spatio-temporal graphs and Recurrent Neural Networks, and be of broad interest to the community.
Article
Long Short-Term Memory (LSTM) is a specific recurrent neural network (RNN) architecture that was designed to model temporal sequences and their long-range dependencies more accurately than conventional RNNs. In this paper, we explore LSTM RNN architectures for large scale acoustic modeling in speech recognition. We recently showed that LSTM RNNs are more effective than DNNs and conventional RNNs for acoustic modeling, considering moderately-sized models trained on a single machine. Here, we introduce the first distributed training of LSTM RNNs using asynchronous stochastic gradient descent optimization on a large cluster of machines. We show that a two-layer deep LSTM RNN where each LSTM layer has a linear recurrent projection layer can exceed state-of-the-art speech recognition performance. This architecture makes more effective use of model parameters than the others considered, converges quickly, and outperforms a deep feed forward neural network having an order of magnitude more parameters.