Beyond Actions: Discriminative Models for Contextual Group Activities.
-
Citations (0)
-
Cited In (0)
Page 1
Beyond Actions: Discriminative Models for
Contextual Group Activities
Tian Lan
School of Computing Science
Simon Fraser University
tla58@sfu.ca
Yang Wang
Department of Computer Science
University of Illinois at Urbana-Champaign
yangwang@uiuc.edu
Weilong Yang
School of Computing Science
Simon Fraser University
wya16@sfu.ca
Greg Mori
School of Computing Science
Simon Fraser University
mori@cs.sfu.ca
Abstract
We propose a discriminative model for recognizing group activities. Our model
jointly captures the group activity, the individual person actions, and the interac-
tions among them. Two new types of contextual information, group-person inter-
action and person-person interaction, are explored in a latent variable framework.
Different from most of the previous latent structured models which assume a pre-
defined structure for the hidden layer, e.g. a tree structure, we treat the structure of
the hidden layer as a latent variable and implicitly infer it during learning and in-
ference. Our experimental results demonstrate that by inferring this contextual in-
formation together with adaptive structures, the proposed model can significantly
improve activity recognition performance.
1Introduction
Look at the two persons in Fig. 1(a), can you tell they are doing two different actions? Once the
entire contexts of these two images are revealed (Fig. 1(b)) and we observe the interaction of the
person with other persons in the group, it is immediately clear that the first person is queuing, while
the second person is talking. In this paper, we argue that actions of individual humans often cannot
be inferred alone. We instead focus on developing methods for recognizing group activities by
modeling the collective behaviors of individuals in the group.
Before we proceed, we first clarify some terminology used throughout the rest of the paper. We use
action to denote a simple, atomic movement performed by a single person. We use activity to refer
to a more complex scenario that involves a group of people. Consider the examples in Fig. 1(b),
each frame describes a group activity: queuing and talking, while each person in a frame performs
a lower level action: talking and facing right, talking and facing left, etc.
Our proposed approach is based on exploiting two types of contextual information in group activ-
ities. First, the activity of a group and the collective actions of all the individuals serve as context
(we call it the group-person interaction) for each other, hence should be modeled jointly in a unified
framework. As shown in Fig. 1, knowing the group activity (queuing or talking) helps disambiguate
individual human actions which are otherwise hard to recognize. Similarly, knowing most of the
persons in the scene are talking (whether facing right or left) allows us to infer the overall group
activity (i.e. talking). Second, the action of an individual can also benefit from knowing the actions
of other surrounding persons (which we call the person-person interaction). For example, consider
Fig. 1(c). The fact that the first two persons are facing the same direction provides a strong cue that
1
Page 2
(a)(b)(c)
Figure 1: Role of context in group activities. It is often hard to distinguish actions from each individual person
alone (a). However, if we look at the whole scene (b), we can easily recognize the activity of the group and the
actionofeachindividual. Inthispaper, weoperationalizeonthisintuitionandintroduceamodelforrecognizing
group activities by jointly consider the group activity, the action of each individual, and the interaction among
certain pairs of individual actions (c).
both of them are queuing. Similarly, the fact that the last two persons are facing each other indicates
they are more likely to be talking.
Related work: Using context to aid visual recognition has received much attention recently. Most
of the work on context is in scene and object recognition. For example, work has been done on ex-
ploiting contextual information between scenes and objects [13], objects and objects [5, 16], objects
and so-called “stuff” (amorphous spatial extent, e.g. trees, sky) [11], etc.
Most of the previous work in human action recognition focuses on recognizing actions performed
by a single person in a video (e.g. [2, 17]). In this setting, there has been work on exploiting contexts
provided by scenes [12] or objects [10] to help action recognition. In still image action recognition,
object-action context [6, 9, 23, 24] is a popular type of context used for human-object interaction.
The work in [3] is the closest to ours. In that work, person-person context is exploited by a new
feature descriptor extracted from a person and its surrounding area.
Our model is directly inspired by some recent work on learning discriminative models that allow
the use of latent variables [1, 6, 15, 19, 25], particularly when the latent variables have complex
structures. These models have been successfully applied in many applications in computer vision,
e.g. object detection [8, 18], action recognition [14, 19], human-object interaction [6], objects and
attributes [21], human poses and actions [22], image region and tag correspondence [20], etc. So
far only applications where the structures of latent variables are fixed have been considered, e.g. a
tree-structure in [8, 19]. However in our applications, the structures of latent variables are not fixed
and have to be inferred automatically.
Our contributions: In this paper, we develop a discriminative model for recognizing group ac-
tivities. We highlight the main contributions of our model. (1) Group activity: most of the work
in human activity understanding focuses on single-person action recognition. Instead, we present
a model for group activities that dynamically decides on interactions among group members. (2)
Group-person and person-person interaction: although contextual information has been exploited
for visual recognition problems, ours introduces two new types of contextual information that have
not been explored before. (3) Adaptive structures: the person-person interaction poses a challenging
problem for both learning and inference. If we naively consider the interaction between every pair of
persons, the model might try to enforce two persons to have take certain pairs of labels even though
these two persons have nothing to do with each other. In addition, selecting a subset of connec-
tions allows one to remove “clutter” in the form of people performing irrelevant actions. Ideally, we
would like to consider only those person-person interactions that are strong. To this end, we propose
to use adaptive structures that automatically decide on whether the interaction of two persons should
be considered. Our experimental results show that our adaptive structures significantly outperform
other alternatives.
2Contextual Representation of Group Activities
Our goal is to learn a model that jointly captures the group activity, the individual person actions, and
the interactions among them. We introduce two new types of contextual information, group-person
2
Page 3
(a)(b)
Figure 2: Graphical illustration of the model in (a). The edges represented by dashed lines indicate the connec-
tions are latent. Different types of potentials are denoted by lines with different colors in the example shown in
(b).
interaction and person-person interaction. Group-person interaction represents the co-occurrence
between the activity of a group and the actions of all the individuals. Person-person interaction
indicates that the action of an individual can benefit from knowing the actions of other people in the
same scene. We present a graphical model representing all the information in a unified framework.
One important difference between our model and previous work is that in addition to learning the
parameters in the graphical model, we also automatically infer the graph structures (see Sec. 3).
We assume an image has been pre-processed (i.e. by running a person detector) so the persons in the
image have been found. On the training data, each image is associated with a group activity label,
and each person in the image is associated with an action label.
2.1 Model Formulation
A graphical representation of the model is shown in Fig. 2. We now describe how we model an
image I. Let I1,I2,...,Imbe the set of persons found in the image I, we extract features x from
the image I in the form of x = (x0,x1,...,xm), where x0is the aggregation of feature descriptors
of all the persons in the image (we call it root feature vector), and xi(i = 1,2,...,m) is the feature
vector extracted from the person Ii. We denote the collective actions of all the persons in the image
as h = (h1,h2,...,hm), where hi∈ H is the action label of the person Iiand H is the set of all
possible action labels. The image I is associated with a group activity label y ∈ Y, where Y is the
set of all possible activity labels.
We assume there are connections between some pairs of action labels (hj,hk). Intuitively speaking,
this allows the model to capture important correlations between action labels. We use an undirected
graph G = (V,E) to represent (h1,h2,...,hm), where a vertex vi∈ V corresponds to the action
label hi, and an edge (vj,vk) ∈ E corresponds to the interactions between hjand hk.
We use fw(x,h,y;G) to denote the compatibility of the image feature x, the collective action labels
h, the group activity label y, and the graph G = (V,E). We assume fw(x,h,y;G) is parameterized
by w and is defined as follows:
fw(x,h,y;G) = w?Ψ(y,h,x;G)
= w?
(1a)
(1b)
0φ0(y,x0) +
?
j∈V
w?
1φ1(xj,hj) +
?
j∈V
w?
2φ2(y,hj) +
?
j,k∈E
w?
3φ3(y,hj,hk)
The model parameters w are simply the combination of four parts, w = {w1,w2,w3,w4}. The
details of the potential functions in Eq. 1 are described in the following:
Image-Action Potential w?
the j-th person’s action label hjand its image feature xj. It is parameterized as:
?
where xjis the feature vector extracted from the j-th person and we use 1(·) to denote the indicator
function. The parameter w1is simply the concatenation of w1bfor all b ∈ H.
1φ1(xj,hj): This potential function models the compatibility between
w?
1φ1(xj,hj) =
b∈H
w?
1b1(hj= b) · xj
(2)
3
Page 4
Action-Activity Potential w?
the group activity label y and the j-th person’s action label hj. It is parameterized as:
?
2φ2(y,hj): This potential function models the compatibility between
w?
2φ2(y,hj) =
a∈Y
?
b∈H
w2ab· 1(y = a) · 1(hj= b)
(3)
Action-Action Potential w?
tween a pair of individuals’ action labels (hj,hk) under the group activity label y, where (j,k) ∈ E
corresponds to an edge in the graph. It is parameterized as:
?
3φ3(y,hj,hk): This potential function models the compatibility be-
w?
3φ3(y,hj,hk) =
a∈Y
?
b∈H
?
c∈H
w3abc· 1(y = a) · 1(hj= b) · 1(hk= c)
(4)
Image-Activity Potential w?
compatibility between the activity label y and the root feature vector x0of the whole image. It is
parameterized as:
w?
0φ0(y,x0): This potential function is a root model which measures the
0φ0(y,x0) =
?
a∈Y
w?
0a1(y = a) · x0
(5)
The parameter w0acan be interpreted as a root filter that measures the compatibility of the class
label a and the root feature vector x0.
3Learning and Inference
We now describe how to infer the label given the model parameters (Sec. 3.1), and how to learn the
model parameters from a set of training data (Sec. 3.2). If the graph structure G is known and fixed,
we can apply standard learning and inference techniques of latent SVMs. For our application, a
good graph structure turns out to be crucial, since it determines which person interacts (i.e. provides
action context) with another person. The interaction of individuals turns out to be important for
group activity recognition, and fixing the interaction (i.e. graph structure) using heuristics does not
work well. We will demonstrate this experimentally in Sec. 4. We instead develop our own inference
and learning algorithms that automatically infer the best graph structure from a particular set.
3.1Inference
Given the model parameters w, the inference problem is to find the best group activity label y∗for a
new image x. Inspired by the latent SVM [8], we define the following function to score an image x
and a group activity label y:
Fw(x,y) = max
Gy
max
hy
fw(x,hy,y;Gy) = max
Gy
max
hy
w?Ψ(x,hy,y;Gy)
(6)
We use the subscript y in the notations hyand Gyto emphasize that we are now fixing on a particular
activity label y. The group activity label of the image x can be inferred as: y∗= argmaxyFw(x,y).
Since we can enumerate all the possible y ∈ Y and predict the activity label y∗of x, the main
difficulty of solving the inference problem is the maximization over Gyand hyaccording to Eq. 6.
Note that in Eq. 6, we explicitly maximize over the graph G. This is very different from previous
work which typically assumes the graph structure is fixed.
The optimization problem in Eq. 6 is in general NP-hard since it involves a combinatorial search.
We instead use a coordinate ascent style algorithm to approximately solve Eq. 6 by iterating the
following two steps:
1. Holding the graph structure Gyfixed, optimize the action labels hyfor the ?x,y? pair:
hy= argmax
h?w?Ψ(x,h?,y;Gy)
(7)
2. Holding hyfixed, optimize graph structure Gyfor the ?x,y? pair:
Gy= argmax
G?w?Ψ(x,hy,y;G?)
(8)
4
Page 5
The problem in Eq. 7 is a standard max-inference problem in an undirected graphical model. Here
we use loopy belief propagation to approximately solve it. The problem in Eq. 8 is still an NP-hard
problem since it involves enumerating all the possible graph structures. Even if we can enumerate
all the graph structures, we might want to restrict ourselves to a subset of graph structures that will
lead to efficient inference (e.g. when using loopy BP in Eq. 7). One obvious choice is to restrict
G?to be a tree-structured graph, since loopy BP is exact and tractable for tree structured models.
However, as we will demonstrate in Sec. 4, the tree-structured graph built from simple heuristic (e.g.
minimum spanning tree) does not work that well. Another choice is to choose graph structures that
are “sparse”, since sparse graphs tend to have fewer cycles, and loopy BP tends to be efficient in
graphs with fewer cycles. In this paper, we enforce the graph sparsity by setting a threshold d on
the maximum degree of any vertex in the graph. When hyis fixed, we can formulate an integer
linear program (ILP) to find the optimal graph structure (Eq. 8) with the additional constraint that
the maximum vertex degree is at most d. Let zjk= 1 indicate that the edge (j,k) is included in the
graph, and 0 otherwise. The ILP can be written as:
?
where we use ψjkto collectively represent the summation of all the pairwise potential functions in
Eq. 1 for the pairs of vertices (j,k). Of course, the optimization problem in Eq. 9 is still hard due
to the integral constraint zjk∈ {0,1}. But we can relax the value of zjkto a real value in the range
of [0,1]. The solution of the LP relaxation might have fractional numbers. To get integral solutions,
we simply round them to the closest integers.
max
z
j∈V
?
k∈V
zjkψjk,
s.t.
?
j∈V
zjk≤ d,
?
k∈V
zjk≤ d, zjk= zkj, zjk∈ {0,1}, ∀j,k
(9)
3.2Learning
Given a set of N training examples ?xn,hn,yn?(n = 1,2,...,N), we would like to train the model
parameter w that tends to produce the correct group activity y for a new test image x. Note that the
action labels h are observed on training data, but the graph structure G (or equivalently the variables
z) are unobserved and will be automatically inferred. A natural way of learning the model is to adopt
the latent SVM formulation [8, 25] as follows:
min
w,ξ≥0,Gy
s.t.
1
2||w||2+ C
max
N
?
n=1
ξn
(10a)
Gynfw(xn,hn,yn;Gyn) − max
where ∆(y,yn) is a loss function measuring the cost incurred by predicting y when the ground-
truth label is yn. In standard multi-class classification problems, we typically use the 0-1 loss ∆0/1
defined as:
∆0/1(y,yn) =
Gy
max
hy
fw(xn,hy,y;Gy) ≥ ∆(y,yn) − ξn,∀n,∀y (10b)
?
1
0
if y ?= yn
otherwise
(11)
The constrained optimization problem in Eq. 10 can be equivalently written as an unconstrained
problem:
min
w,ξ
1
2||w||2+ C
N
?
max
hy
n=1
(Ln− Rn)
max
Gy
(12a)
where Ln= max
y
(∆(y,yn) + fw(xn,hy,y;Gy)), Rn= max
Gynfw(xn,hn,yn;Gyn)(12b)
We use the non-convex bundle optimization in [7] to solve Eq. 12. In a nutshell, the algorithm
iteratively builds an increasingly accurate piecewise quadratic approximation to the objective func-
tion. During each iteration, a new linear cutting plane is found via a subgradient of the objective
function and added to the piecewise quadratic approximation. Now the key issue is to compute two
subgradients ∂wLnand ∂wRnfor a particular w, which we describe in detail below.
First we describe how to compute ∂wLn. Let (y∗,h∗,G∗) be the solution to the following optimiza-
tion problem:
max
y
h
G
maxmax ∆(y,yn) + fw(xn,h,y;G)
(13)
5