The action similarity labeling challenge.
ABSTRACT Recognizing actions in videos is rapidly becoming a topic of much research. To facilitate the development of methods for action recognition, several video collections, along with benchmark protocols, have previously been proposed. In this paper, we present a novel video database, the "Action Similarity LAbeliNg" (ASLAN) database, along with benchmark protocols. The ASLAN set includes thousands of videos collected from the web, in over 400 complex action classes. Our benchmark protocols focus on action similarity (same/not-same), rather than action classification, and testing is performed on never-before-seen actions. We propose this data set and benchmark as a means for gaining a more principled understanding of what makes actions different or similar, rather than learning the properties of particular action classes. We present baseline results on our benchmark, and compare them to human performance. To promote further study of action similarity techniques, we make the ASLAN database, benchmarks, and descriptor encodings publicly available to the research community.
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. X, NO. X, X 201X
The Action Similarity Labeling Challenge
Orit Kliper-Gross, Tal Hassner, Lior Wolf
Abstract—Recognizing actions in videos is rapidly becoming a topic of much research. To facilitate the development of methods for
action recognition, several video collections, along with benchmark protocols, have previously been proposed. In this paper we present
a novel video database, the “Action Similarity LAbeliNg” (ASLAN) database, along with benchmark protocols. The ASLAN set includes
thousands of videos collected from the web, in over 400 complex action classes. Our benchmark protocols focus on action similarity
(same/not-same), rather than action classification, and testing is performed on never-before-seen actions. We propose this data set
and benchmark as a means for gaining a more principled understanding of what makes actions different or similar, rather than learning
the properties of particular action classes. We present baseline results on our benchmark, and compare them to human performance.
To promote further study of action similarity techniques, we make the ASLAN database, benchmarks, and descriptor encodings publicly
available to the research community.
Index Terms—Action recognition, Action similarity, Video database, Web videos, Benchmark.
of applications, including video retrieval, surveillance,
man-machine interaction, and more. With the availability
of high bandwidth communication, large storage space
and affordable hardware, digital video is now every-
where. Consequently, the demand for video processing,
particularly effective action recognition techniques, is
rapidly growing. Unsurprisingly, action recognition has
recently been the focus of much research.
Human actions are complex entities taking place over
time and over different body parts. Actions are either
connected to a context (e.g., swimming) or context free
(e.g., walking). What constitutes an “action” is often
undefined, and so the number of actions being per-
formed is typically uncertain. Actions can vary greatly in
duration; some actions being instantaneous whereas oth-
ers prolonged. They can involve interactions with other
people, or static objects. Finally, they may include the
whole body or be limited to one limb. Figure 1 provides
examples, from our database, of these variabilities.
To facilitate the development of action recognition
methods, many video sets, along with benchmark pro-
tocols, have been assembled in the past. These attempt
to capture the many challenges of action recognition.
Some examples include the KTH  and Weizmann 
databases, and the more recent, Hollywood, Holly-
wood2 , , and YouTube-actions databases .
This growing number of benchmarks and data sets is
reminiscent of the data sets used for image classification
and face recognition. However, there is one important
difference: image sets for classification and recognition
ECOGNIZING human actions in videos is an impor-
tant problem in Computer Vision with a wide range
• O. Kliper-Gross is with the Department of Mathematics and Computer
Science, The Weizmann Institute of Science, Israel.
• T. Hassner is with The Computer Science Devision, The Open University,
• L. Wolf is with the Blavatnik School of Computer Science, Tel-Aviv
Fig. 1: Examples of the diversity of “real-world” actions
now typically contain hundreds, if not thousands, of ob-
ject classes or subject identities (see for example: , ,
), whereas existing video data sets typically provide
only around 10 classes (see Section 2).
We believe one reason for this disparity between
image and action classification is the following. Once
many action classes are assembled, classification be-
comes ambiguous. Consider, for example, a high jump.
Is it “running”? “jumping”? “falling”? Of course, it can
be all three and possibly more. Consequently, labels
assigned to such complex actions can be subjective and
may vary from one person to the next. To avoid this
problem, existing data sets for action classification offer
only a small set of well-defined, atomic actions, which
are either periodic (e.g., walking), or instantaneous (e.g.
answering the phone).
In this paper we present a new action recognition data
set, the “Action Similarity LAbeliNg” (ASLAN) collec-
tion. This set includes thousands of videos collected from
the web, in over 400 complex action classes.
To standardize testing with this data, we provide a
“same/not-same” benchmark, which addresses the ac-
tion recognition problem as a non class-specific similarity
1. Our video collection, benchmarks, and related additional infor-
mation is available at:
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. X, NO. X, X 201X2
Fig. 2: Examples of “same” pairs from our database.
problem and which is different from more traditional
multi-class recognition challenges. The rationale is that
such a benchmark requires that methods learn to eval-
uate the similarity of actions rather than be able to
recognize particular actions.
Specifically, the goal is to answer the following binary
question – “does a pair of videos present the same action,
or not?”. This problem is sometimes referred to as the
“unseen pair matching problem” (see for example ).
Figures 2 and 3 show some examples of “same” and
“not-same” labeled pairs from our database.
The power of the same/not same formulation is in
diffusing a multi-class task into a manageable binary
class problem. Specifically, this same/not-same approach
has the following important advantages over multi-class
action labeling: (a) It relaxes the problem of ambiguous
action classes - it is certainly easier to label pairs as
same/not-same rather than pick one class out of over
a hundred, especially when working with videos. Class
label ambiguities make this problem worst. (b) By re-
moving from the test set all the actions provided for
training, we focus on learning action similarity, rather
than the distinguishing features of particular actions.
Thus, the benchmark aims to gain a generalization ability
which is not limited to a predefined set of actions.
Finally, (c) besides providing insights towards better
action classification, pair-matching has interesting appli-
cations in its own right. Specifically, given a video of
an (unknown) action, one may wish to retrieve videos
of a similar action, without learning a specific model of
that action, and without relying on text attached to the
video. Such applications are now standard features in
image search engines (e.g., Google images).
To validate our data set and benchmarks we code
the videos in our database using state-of-the-art action
features, and present baseline results on our benchmark
using these descriptors. We further present a human
survey on our database. This demonstrates that our
benchmark, although challenging to modern Computer
Fig. 3: Examples of “not-same” pairs from our database.
Vision techniques, is well within human capabilities.
To summarize, we make the following contributions:
1) We make available a novel collection of videos
and benchmark tests for developing action simi-
larity techniques. This set is unique in the number
of categories it provides (an order of magnitude
more than existing collections), its associated pair-
matching benchmark and the realistic, uncontrolled
settings used to produce the videos.
2) We report performance scores obtained with a vari-
ety of leading action descriptors on our benchmark.
3) We have conducted an extensive human survey
which demonstrates the gap between current state-
of-the-art performance and human performance.
In the last decade, image and video databases have be-
come standard tools for benchmarking the performance
of methods developed for many Computer Vision tasks.
Action recognition performance in particular has greatly
improved due to the availability of such data sets. We
present a list highlighting several popular data sets in
Table 1. All these sets typically contain around ten action
classes and vary in the number of videos available, the
video source, and the video quality.
Early sets, such as KTH  and Weizmann , have
been extensively used to report action recognition per-
formance (e.g. , , , , , , , to name
a few). These sets contain few, “atomic” classes such as
walking, jogging, running, and boxing. The videos in
both these sets were acquired under controlled settings:
static camera and un-cluttered, static background.
Over the last decade the recognition performance on
these sets has saturated. Consequently, there is a growing
need for new sets, reflecting general action recognition
tasks with a wider range of actions. Attempts have been
made to manipulate acquisition parameters in the labo-
ratory. This was usually done for specific purposes, such
as studying viewing variations , occlusions  or
EXISTING DATA-SETS AND BENCHMARKS
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. X, NO. X, X 201X3
TABLE 1: Popular Action Recognition Databases
Database Classes Videos SettingData description
KTH 6600 Laboratory: 25 actors, 4 conditions,
4 repetitions = 2391 sub-sequences
Laboratory: 9 different actors
Laboratory: 1 actor, many repetitions
Laboratory: 10 actors arbitrary ori-
entation, 5 view points with multi-
ple cameras, 30 sub-sequences
Real sports broadcasts
Homogenous background, static camera, 25fps,
160x120px, 4s duration, AVI DVIX-compressed
Static background, resolution 180x144px, 25fps
Resolution 100-200px, very short sequences
UCF-sports 9200Unconstrained: wide range of scenes and view-
points, simple background, resolution 720x480px
5065 manually annotated frames: high intra class
variability, background clutter, large camera motion,
motion blur, occlusions and appearance variations
Large intra-class variability, label ambiguity, multiple
persons, challenging camera motion, rapid scene
changes, unconstrained and cluttered background,
high quality, 240x450px, 24fps
Varying views, appearance, camera motion, action
frequencies, simultaneous actions by different actors
25 sub-groups: different environments and photog-
raphers. Mix of steady and shaky cameras, cluttered
background, variation in object scale, views points
and illumination, low resolution (mpeg4-codec)
Complex activities in living environment, static back-
ground, res. 1280x720px, 30fps, duration 10-60s
30-600 frames, realistic human interactions: varying
number of actors, scale, and views
Complex activities, 50 sequences per class, labeled
by Amazon Mechanical Turk
Multiple actions for the purpose of action detection,
203 action instances, 320x240px, 15fps.
Olympic-games  17166Video footage from olympic games
Hollywood , 8/12430/366932/69 movies
UFC 220minBroadcast videos
Youtube-actions 11 1168 Youtube videos and personal videos
ADL 10 150Laboratory: 5 actors, 3 repetitions
High-Five  430023 different TV shows
MSR II 
354recorded in crowded environment
recognizing daily actions in static scenes . Although
these databases have contributed much to specific as-
pects of action recognition, one may wish to develop
algorithms for more realistic videos and diverse actions.
TV and motion pictures videos have been used as
alternatives to controlled sets. The biggest such database
to date was constructed by Laptev et al. . Its authors,
recognizing the lack of realistic annotated data sets for
action recognition, proposed a method for automatic
annotation of human actions in motion pictures based
on script alignment and classification. They have thus
constructed a large data set of 8 action classes from
32 movies. In a subsequent work , an extended set was
presented, containing 3,669 action samples of 12 action
and 10 scene classes acquired from 69 motion pictures.
The videos included in it are of high quality and contain
no unintended camera motion. In addition, the actions
they include are non-periodic and well defined in time.
These sets, although new, have already drawn a lot of
attention (see for example , , ).
Other data sets employing videos from such sources
are the data set made available in , which includes
actions extracted from a TV series, the work of ,
which classifies actions in broadcast sports videos, and
the recent work of , which explores human interac-
tions in TV shows. All these sets offer only a limited
number of well defined action categories.
While most action recognition research has focused on
atomic actions, the recent work of  and  address
complex activities, i.e. actions composed of few simpler
or shorter actions. Ikizler and Forsyth  suggest learn-
ing complex activity models by joining atomic action
models built separately across time and across the body.
Their method has been tested on a controlled set of
complex motions and on challenging data from the TV
series Friends. Niebles and et al.  propose a general
framework for modeling activities as temporal compo-
sition of motion segments. The authors have collected
a new data set of 16 complex Olympic Sports activities
downloaded from YouTube.
Websites such as YouTube make huge amounts of
video footage easily accessible. Videos available on these
websites are produced under diverse, realistic conditions
and have the advantage of having a huge variability of
actions. This naturally brings to light new opportunities
for constructing action recognition benchmarks. Such
web data is increasingly being used for action recogni-
tion related problems. This includes , , perform-
ing automatic categorization of web videos, and , 
which categorize events in web videos. These do not
directly address action recognition but inspire further
research in using web data for action recognition.
Most closely related to our ASLAN set, is the YouTube
Action Data Set . As far as we know, it is the first
action recognition database containing videos “in the
wild”. This database, already used in a number of re-
cent publications (for example, , , ), contains
1,168 complex and challenging video sequences from
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. X, NO. X, X 201X4
YouTube and personal home-videos. Since the videos’
source is mainly the web, there was no control over
the filming and therefore the database contains large
variations in camera motion, scale, view, background,
illumination conditions, etc. In this sense, this database
is similar to our own. However, unlike the ASLAN set,
the YouTube set contains only 11 action categories, which
although exhibiting large intra-class variation, are still
relatively well separated.
Most research on action recognition focuses either on
multi-label action classification or on action detection.
Existing methods for action similarity such as , ,
 mainly focus on spatiotemporal action detection or
on action classification. Action recognition has addition-
ally been considered for never-before-seen views of a
given action class (see, e.g., the work of , , ).
None of these provide data or standard tests for the
purpose of matching pairs of never-before-seen actions.
The benchmark proposed here attempts to address
another shortcoming of existing benchmarks, namely, the
lack of established, standard testing protocols. Different
researchers use varying sizes of training and testing sets,
different ways of averaging over experiments, etc. We
hope that by providing a unified testing protocol we may
provide an easy means of measuring and comparing
performance of different methods.
Our work has been motivated by recent image sets,
such as the Labeled Faces in the Wild (LFW)  for
face recognition, and the extensive Scene Understanding
(SUN) database  for scene recognition. In both cases,
very large image collections were presented, answering
a need for larger scope in complementary vision prob-
lems. The unseen pair-matching protocol presented in 
motivated the one proposed here.
We note that same/not-same benchmarks such as the
one described here have been successfully employed
for different tasks in the past. Face recognition “in the
wild” is one such example . Others include historical
document analysis , face recognition from YouTube
videos , and object classification (e.g., ).
In a same/not-same setting the goal is to decide if two
videos present the same action or not, following training
with same and not-same labeled video pairs. The actions
in the test set are not available during training, but rather
belong to separate classes. This means that there is no
opportunity during training to learn models for actions
presented for testing.
We favor a same/not-same benchmark over multi-
label classification as its simple binary structure makes it
far easier to design and evaluate tests. However, we note
that typical action recognition applications label videos
using one of several different labels rather than making
similarity decisions. The relevance of a same/not-same
benchmark to these tasks is therefore not obvious. Recent
GOALS OF THE PROPOSED BENCHMARK
The same/not-same challenge
evidence obtained using the LFW benchmark suggests,
however, that successful pair-matching methods may
be applied for multi-label classification with equal suc-
The setting of our testing protocol is similar to the one
proposed by the LFW benchmark  for face recognition.
The benchmarks for the ASLAN database are organized
into two “Views”. View-1 is for algorithm development
and general experimentation, prior to formal evaluation.
View-2 is for reporting performance and should only be
used for the final evaluation of a method.
View-1: Model selection and algorithm development.
This view of the data consists of two independent
subsets of the database, one for training, and one for
testing. The training set consists of 1,200 video pairs:
600 pairs with similar actions, and 600 pairs of different
actions. The test set consists of 600 pairs: 300 “same” and
300 “not-same”-labeled pairs. The purpose of this view is
for researchers to freely experiment with algorithms and
parameter settings without worrying about over-fitting.
View-2: Reporting performance. This view consists of
10 subsets of the database, mutually exclusive in the ac-
tions they contain. Each of the subsets contains 600 video
pairs: 300 same and 300 not-same. Once the parameters
for an algorithm have been selected the performance of
that algorithm can be measured using View-2.
ASLAN performance should be reported by aggregat-
ing scores on 10 separate experiments in a leave-one-
out cross validation scheme. In each experiment, nine of
the subsets are used for training, with the tenth used
for testing. It is critical that the final parameters of the
classifier under each experiment be set using only the
training data for that experiment, resulting in 10 separate
classifiers (one for each test set).
For reporting final performance of the classifier, we
use the same method as in  and ask that each ex-
perimenter report the estimated mean accuracy and
the standard error of the mean (SE) for View-2 of
the database. Namely, the estimated mean accuracy ˆ µ
is given by ˆ µ =
where Pi is the percentage
of correct classifications on View-2, using subset i for
testing. The standard error of the mean is given by,
The testing paradigm
√10,ˆ σ =
i=1(Pi− ˆ µ)2
In our experiments (see Section 5) we also report
the Area Under the Curve (AUC) of the ROC curve
produced for classifiers used on the 10 test sets.
ASLAN was assembled in over 5 months of work,
which included the downloading and the processing of
around 10,000 videos from YouTube. Construction was
performed in two phases. In each phase we followed the
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. X, NO. X, X 201X5
TABLE 2: ASLAN Database Statistics
♯ action samples
♯ unique samples
♯ unique urls
♯ unique titles
average ♯ of samples per class
♯ classes with > 1 samples
Largest number of samples
long samples (duration> 10sec)
short samples (duration< 1sec)
♯ test pairs / ♯ training pairs
min ♯ STIP / max ♯ STIP
Average ♯ STIP
“Handstand” 91 seq./64 urls
600 / 5400
3 / 26052
following steps: (1) defining search terms, (2) collecting
raw data, (3) extracting action samples, (4) labeling and,
(5) manual validation. After the database was assembled
we defined the two views by randomly selecting video
pairs. We next describe the main construction details. For
further details please refer to the project web page.
Our original search terms were based on the terms
defined by the CMU Graphics Lab Motion Capture
Database3. The CMU database is organized as a tree,
where the final description of an action sequence is at the
leaf. Our basic search terms were based on individual ac-
tion terms from the CMU leaves. For some of the search
terms we also added a context term (usually taken from
a higher level in the CMU tree). For example: one search
term could be climb and another could be playground-
climb. This way, several query terms can retrieve the
same action in different contexts.
235 such terms, and automatically downloaded the top
20 YouTube video results for each term, resulting in
∼ 3,000 videos. Action labels were defined by the search
terms, and we validated these labels manually.
Following the validation, only ∼ 10% of the down-
loaded videos contained at least one action, demon-
strating the poor quality of keyword-based search as
noted also in , . We further dismissed cartoons,
static images, and very low quality videos. The intra-
class variability was extremely large and search terms
only generally described the actions in each category.
We were consequently required to use more subtle action
definitions, and a more careful labeling process.
In the second phase, 174 new search terms were
defined based on first phase videos. 50 videos were
Main construction details
2. Numbers relate to View-2, for each of the 10 experiments.
downloaded for each new term, totaling ∼ 6,400 videos.
YouTube videos often present more than one action.
Since ASLAN is designed for action similarity, not detec-
tion, we manually cropped the videos into action sam-
ples. An action sample is defined as a sub-sequence of a
shot presenting a detected action, that is, a consecutive
set of frames taken by the same camera presenting one
action. The action samples were then manually labeled
according to their content; a new category was defined
for each new action encountered. We allowed each action
sample to fall into several categories whenever the action
could be described in more than one way.
The final database contains 3,631 unique action sam-
ples from 1,571 unique urls, and 1,561 unique titles,
in 432 action classes. Table 2 provides some statistical
information on our database. Additional information
may be found on our website.
All the action samples are encoded using mp4 (codec
h264) high resolution format (highest available for
download), as well as AVIs (xvid codec). The database
contains videos of different resolution, frame size, aspect
ratio, and frame rate. Most videos are in color, but some
Before detailing the views’ construction we note the
following. Action recognition is often used for video
analysis and/or scene understanding. The term itself
sometimes refers to action detection, which may involve
selecting a bounding box around the actor, or marking
the time an action is performed. Here we avoid de-
tection by constructing our database from short video
samples that could in principle be the output of an
action detector. In particular, since every action sample
in our database is manually extracted, there is no need
to temporally localize the action. We thus separate action
detection from action similarity and minimize the ambi-
guity that may arise by determining action durations.
To produce the views for our database, we begin by
defining a list of valid pairs. Valid pairs are any two
distinct samples which were not originally cut from the
same video; pairs of samples originating from the same
video were ignored. The idea was to avoid biases for
certain video context/background in same-labeled pairs,
and to reduce confusion due to similar background for
View-1 test pairs were chosen out of the valid pairs in
40 randomly selected categories. The pairs in the training
set of View-1 were chosen out of the valid pairs in the
To define View-2, we randomly split the categories
into 10 subsets, ensuring that each has at least 300 valid
same pairs. To balance each subset’s categories, we allow
only up to 30 same pairs from each label. Once the
categories of the subsets were defined, we randomly
Building the views