ArticlePDF Available

Masked autoencoder for multiagent trajectories

Authors:

Abstract and Figures

Automatically labeling trajectories of multiple agents is key to behavioral analyses but usually requires a large amount of manual annotations. This also applies to the domain of team sport analyses. In this paper, we specifically show how pretraining transformer models improves the classification performance on tracking data from professional soccer. For this purpose, we propose a novel self-supervised masked autoencoder for multiagent trajectories to effectively learn from only a few labeled sequences. Our approach builds upon a factorized transformer architecture for multiagent trajectory data and employs a masking scheme on the level of individual agent trajectories. As a result, our model allows for a reconstruction of masked trajectory segments while being permutation equivariant with respect to the agent trajectories. In addition to experiments on soccer, we demonstrate the usefulness of the proposed pretraining approach on multiagent pose data from entomology. In contrast to related work, our approach is conceptually much simpler, does not require handcrafted features and naturally allows for permutation invariance in downstream tasks.
This content is subject to copyright. Terms and conditions apply.
Vol.:(0123456789)
Machine Learning (2025) 114:44
https://doi.org/10.1007/s10994-024-06647-3
Masked autoencoder formultiagent trajectories
YannickRudolph1 · UlfBrefeld1
Received: 5 October 2023 / Revised: 31 May 2024 / Accepted: 3 October 2024 /
Published online: 27 January 2025
© The Author(s) 2025
Abstract
Automatically labeling trajectories of multiple agents is key to behavioral analyses but
usually requires a large amount of manual annotations. This also applies to the domain of
team sport analyses. In this paper, we specifically show how pretraining transformer mod-
els improves the classification performance on tracking data from professional soccer. For
this purpose, we propose a novel self-supervised masked autoencoder for multiagent tra-
jectories to effectively learn from only a few labeled sequences. Our approach builds upon
a factorized transformer architecture for multiagent trajectory data and employs a mask-
ing scheme on the level of individual agent trajectories. As a result, our model allows for
a reconstruction of masked trajectory segments while being permutation equivariant with
respect to the agent trajectories. In addition to experiments on soccer, we demonstrate the
usefulness of the proposed pretraining approach on multiagent pose data from entomology.
In contrast to related work, our approach is conceptually much simpler, does not require
handcrafted features and naturally allows for permutation invariance in downstream tasks.
Keywords Self-supervised learning· Multiagent trajectories· Masked autoencoder·
Transformer· Tracking data· Soccer
1 Introduction
Classification of multiagent trajectory segments is very important for sense-making, cat-
egorization, and retrieval in behavioral analyses. Applications range from educational data
mining given student interactions, behavior analysis for animal behavior or tactical analy-
ses in sports analytics. An exemplary task for the latter domain is automatically labeling
instances of multiagent tracking data from team sports regarding the occurrence of selected
in-game events (such as on-ball actions). While supervised learning for this task is in
Editors: Philippe Lopes, Werner Dubitzky, Daniel Berrar, Jesse Davis.
* Yannick Rudolph
yannick.rudolph@leuphana.de
Ulf Brefeld
ulf.brefeld@leuphana.de
1 Leuphana University ofLüneburg, Universitätsallee 1, 21335Lüneburg, Germany
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Machine Learning (2025) 114:44
44 Page 2 of 18
principle straight forward, more recent deep learning models might require large amounts
of manual annotations, even if we make use of symmetries in the data.
So far, supervised models for multiagent trajectories have been either trained on hand-
crafted or static features (Anzer and Bauer, 2022; Chawla etal., 2017; Power etal., 2017)
or involve self-supervised and semi-supervised methods with autoregressive reconstruc-
tion tasks (Sun etal., 2021; Fassmeyer etal., 2021). Regarding self-supervised pretraining
however, it is unclear whether autoregressive models can achieve state-of-the-art represen-
tations for downstream tasks. While the performance of autoregressive models for some
generative tasks are impressive (van den Oord etal., 2016; Brown etal., 2020), recent stud-
ies on language and vision suggest that data denoising tasks may be better suited as a pre-
text task for pretraining representations that help with classification (Devlin etal., 2018;
He et al., 2022). Regarding learning from sequential data, self-supervised approaches
for multiagent trajectory data so far do not account for obvious symmetries in the data,
i.e. models and objectives are not permutation equivariant or permutation invariant with
respect to the ordering of agents (Sun etal., 2021).
In this paper, we introduce a novel self-supervised pretraining approach for classify-
ing multiagent trajectory instances: the trajectory masked autoencoder (T-MAE). In our
approach, we reconstruct randomly masked trajectory segments by incorporating a novel
masking scheme into a transformer architecture that factorizes over time and agent dimen-
sion. The factorization renders the encoder (i.e. feature extractor) permutation equivariant
with respect to the ordering of the trajectories and allows for applications that are per-
mutation invariant with respect to the agents. Empirically, our approach enables modeling
capacities of vastly over-parameterized modern transformer architectures for downstream
tasks with only a few labeled instances and improves performance for scenarios with lots
of training data. We observe that pretrained models consistently outperform un-pretrained
baseline models on the task of classifying instances of multiagent trajectories regarding
events in professional soccer matches. Furthermore, our method compares favorably to
related self-supervised approaches on multiagent pose data from entomology.
2 Preliminaries andproblem formulation
We focus on trajectories with a fixed length of T timesteps and a fixed number of K agents
for all instances of multiagent trajectories. We further do not assume any partial or total
order on the set of agents. Observations for agent
1kK
at timestep
1tT
are given
by
xt
k
. Observations may, for example, include two-dimensional positions in space as well
as the speed or/and rotation angles of agents or parts of agents. A complete trajectory for
an agent k is given by
x1T
k
and a single timestep t for the set of all agents is given by
{
x
t
1
,,x
t
K}
. Each instance consists of a set of trajectories and we let
x
=
{
x
1T
1
,,x
1T
K}
denote a complete multiagent trajectory instance.
Let
X
be the space of possible multiagent trajectories. We aim to solve the following
multi-label classification problem: Given an N-sample of labeled multiagent trajectory
instances
D
=
{
(x(n),y(n))
}N
n=1
with
and binary label vectors
y(n)
L
2
, where L is
the number of classes, the goal is to learn a function fX
L
2
that generalizes well on
unseen data.
Since we must impose some (possibly random) ordering of agents and their trajectories
within the input to our model, a direct application of standard supervised classification net-
works is rendered suboptimal. We instead require f to be a permutation invariant mapping
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Machine Learning (2025) 114:44 Page 3 of 18 44
with respect to the ordering of trajectories. Specifically: Let
𝜋
be a K-tuple, denoting per-
mutations of integers 1 through K, i.e. let
(𝜋1,𝜋2,,𝜋K)
represent a permutation of
(1, 2, ,K)
. If f
([
x
1T
1
,,x
1T
K])
=
y
we require that for any permutation
𝜋
, we obtain
f
([
x1T
𝜋
1
,,x1T
𝜋
K])
=
y
as well.
Since labels for multiagent trajectories are usually expensive, it is further desirable that
the classifier f can be trained efficiently regarding labeled data. In this paper we are con-
cerned with pretraining a representation
𝜙(x)
that can be learned on purely unlabeled data,
where we assume unlabeled data to be cheap and thus available in large quantity. A clas-
sifier f can subsequently be applied to this pretrained representation. Hence, the feature
extraction
𝜙(
)
becomes part of the overall classification model. To assure that the classifier
f applied to representation
𝜙(x)
is permutation invariant with respect to the ordering of
trajectories in
x
we require the representation
𝜙(x)
to be permutation equivariant regarding
ordering of the trajectories.
We formalize both the requirement and its adequacy in the following exposition.
Assuming we can decompose the representation
𝜙(x)
into parts corresponding to the indi-
vidual trajectories
[
𝜙
(
x1T
1)
,,𝜙
(
x1T
K)]
, permutation equivariance with respect to the
ordering of trajectories implies that for any permutation
𝜋
of our K-tuple, we require that
𝜙([
x1T
𝜋
1
,,x1T
𝜋
K])
=
[
𝜙
(
x1T
𝜋
1)
,,𝜙
(
x1T
𝜋
K)]
.
Proposition 1 If the representation
𝜙(x)
satisfies permutation equivariance with respect to
the ordering of trajectories within
x
, a permutation invariant classifier f which is applied
to representation
𝜙(x)
can be considered permutation invariant with respect to the order-
ing of trajectories within
x
.
Proofsketch Since permutation equivariance of
𝜙(x)
guarantees that the set of trajectory
representations in
𝜙(x)
is invariant to the permutation of
x
, any permutation invariant clas-
sifier that is applied to
𝜙(x)
will also be permutation invariant with respect to
x
.
3 Trajectory masked autoencoder
In this section, we introduce the trajectory masked autoencoder (T-MAE), a novel self-
supervised pretraining approach for multiagent trajectories. Specifically, we propose to
train a model that reconstructs multiagent trajectory instances given that we randomly
mask out segments of individual trajectories. The proposed model consists of an encoder
and a decoder, and thus can be regarded as a denoising autoencoder (Vincent etal., 2008,
2010), with the noise being random masking, i.e. random removal of parts of the input
data. The approach is related to the recently proposed masked autoencoder (He etal., 2022)
for image data. For both the encoder and the decoder of the pretraining architecture we
propose to use a factorized transformer architecture (Vaswani etal., 2017) that is permuta-
tion equivariant with respect to trajectory ordering. Since we directly apply (and finetune)
the encoder for downstream tasks, we leverage the transformer’s modeling capacity for the
classification task.
Self-supervised pretraining naturally comes with additional compute. The presenta-
tion of the masked autoencoding approach in this section is limited to a factorized trans-
former architecture, which is most sensible if there are interactions in the trajectories.
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Machine Learning (2025) 114:44
44 Page 4 of 18
Furthermore, we present the T-MAE under the assumption that unlabeled data is cheap and
abundant.
We proceed by first introducing details regarding the factorized transformer as applied
in the T-MAE. We then introduce the autoencoding scheme of the T-MAE, including its
application to downstream tasks.1
3.1 Factorized transformer encoder (FTE)
To update representations within the T-MAE we employ a factorized transformer architec-
ture. We will refer to this model as factorized transformer encoder (FTE) given its relation
to the encoder of the standard transformer model (Vaswani etal., 2017). Within the FTE
we factorize self-attention over time and agents. The FTE basically comprises of stacks of
two standard transformer encoder layers, where one layer operates over trajectory segments
in time and is applied separately to each trajectory, while the other layer operates over tra-
jectory segments per agent and is applied separately to all temporal positions.
The explicit distinction of model components that operate over the time dimension and
model components that operate over the agent dimension has been a cornerstone of recent
multiagent trajectory models (Yeh etal., 2019; Casas etal., 2020). Using a transformer for
both operations is conceptually simple and comes with a relatively high model capacity.
In principle, the FTE has been introduced as component of a variational autoencoder for
trajectory generation in Girgis etal. (2022), where it is referred to as a multi-head atten-
tion block (MAB). While the FTE and the MAB are architecturally identical, they operate
on different representations. Instead of focusing on individual timesteps, we generalize the
input data to trajectory segments (consisting of one or more consecutive timesteps). Fur-
thermore, our model operates on a masked (and shuffled) representations, where masking
causes the segments to not be temporally aligned for updates along the agent dimension.
Hence, the model must rely on the positional encoding to decide to which other segments
to attend to. See Fig.1 for a visualization of one composite layer of the FTE applied to
shuffled segment representations.
Although the FTE is applied to different inputs in the encoder and decoder, for sim-
plicity we denote all input to the FTE as
x
in the following exposition. In principle,
the FTE architecture consists of stacks of two different types of standard transformer
encoder layers as proposed in Vaswani etal. (2017) which themselves comprise multi-
head self-attention layers, residual connections, dropout (Srivastava et al., 2014), layer
Fig. 1 Sketch of the factorized transformer encoder (FTE) applied to trajectory segments which are shuffled
along the temporal dimension (numbered 1 through 4, the agent dimension is color-coded).
TLenc
denotes a
standard transformer encoder layer as proposed in Vaswani etal. (2017)
1 We provide a link to a PyTorch implementation of T-MAEs at https:// ml3. leuph ana. de/ proje cts. html.
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Machine Learning (2025) 114:44 Page 5 of 18 44
normalization (Ba et al., 2016) and feed-forward neural networks. We refer to either
Vaswani etal. (2017) or Girgis etal. (2022) for further details on standard transformer
encoder layers. Within one stack of the FTE the first transformer encoder layers oper-
ates over segments in time and is applied separately to each trajectory, whereas the sec-
ond layer operates over the segments per agent and is applied separately to all temporal
positions. That is, we basically extend the batch dimension of the input to trajectories
for the first operation and extend the batch dimension of the input to temporal positions
in the second operation. Together, the encoder layer with temporal self-attention and the
encoder layer with self-attention with respect to trajectory interactions constitute one
composite layer in the FTE, of which several can be stacked within the model archi-
tecture. In our experiments in Sect.4 we use three composite FTE layers both in the
encoder and decoder of T-MAEs as well as for the baseline transformer models.
Importantly, the application of a FTE to
x
is permutation equivariant with respect
to the order of trajectories. This property of the FTE follows directly from the fact that
the standard transformer encoder is also permutation equivariant with respect to its
input, and that the temporal self-attention (which is provided with positional informa-
tion through encodings) is separately applied to each trajectory. We refer to Girgis etal.
(2022) for a formal proof of FTE’s permutation equivariance.
3.2 Masked autoencoder formultiagent trajectories
We now describe the details of our masked autoencoder. We want to point out that each
step of the autoencoder is permutation equivariant regarding the order of the trajec-
tories. Since a composition of permutation equivariant functions is also permutation
equivariant (Zaheer etal., 2017), the trajectory masked autoencoder is thus permutation
equivariant. See Fig.2 for a complete sketch of the approach.
3.2.1 Encoder
We begin by describing the encoder
enc𝜙(
x
)
of the proposed trajectory masked autoen-
coder (T-MAE), which operates on a multiagent trajectory instance
x
=
{
x
1T
1
,,x
1T
K}
.
We first split each individual trajectory
x1T
k
into S segments
xs
k
of an equal length of
Fig. 2 Sketch of the proposed trajectory masked autoencoder (T-MAE). In the encoder (top), we highlight
segments which are masked only in a later step with dotted lines for illustration purpose
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Machine Learning (2025) 114:44
44 Page 6 of 18
timesteps
lS
. Each segment thus comprises of several timesteps (with
lS=1
we have one
segment per timestep and S being equal T). We let
SEGlS
denote the segmentation and
can formalize the operation as
SEGlS
(x
1T
k
)=
x
1S
k
, where each trajectory is segmented
separately.
Every segment is separately processed with a shared fully-connected feed-forward
network (
FFN
) and thus embedded into model dimension d. At this point, we add posi-
tional encodings (
PE
) and (optional) trajectory embeddings (
TE
) to our representation.
We denote this as
ADDemb
, where “
emb
” denotes the embeddings that we add to each
trajectory segment. We encode the temporal position of each segment by sinusoidal PE
as introduced in Vaswani etal. (2017) and previously used by Girgis etal. (2022) for
multiagent trajectories (alternatively, one could also employ learnable embeddings for
positional encoding). The encoding has dimensionality d and is added elementwise to
each segment, i.e.
ADDPE
(
x
s
k
)=PE
s
x
s
k
. Trajectory embeddings are learnable embed-
dings, also of dimensionality d, which encode properties of the individual trajectories
and are also added to segments elementwise. In this case, we add the same embedding
to each segment of any individual trajectory:
ADDTE(
x
s
k)=TEk
x
s
k
. In an applica-
tion to soccer data, we may encode whether the trajectory originates from the ball or a
player.
For each segmented and embedded trajectory with additive positional and trajectory
embeddings, we uniformly chose random segments to be masked out, according to a
predetermined masking ratio r. Masking out means that we remove segments from the
current representation of the trajectory. For example, a trajectory representation with
S=3
segments and masking ratio
r
=
1
3
will have one of these segments removed at ran-
dom during masking. The self-supervised learning objective of the masked autoencoder
is to reconstruct these removed segments.
To implement the masking efficiently, we follow the conceptual approach proposed
in He etal. (2022), albeit generalized to multiagent trajectories. We separately shuffle
the segments of each trajectory and apply the mask by slicing off the final
S
r
shuf-
fled segments of all trajectories. We denote the involved operations as
SHUFFLE
and
MASKr
respectively. The composite application of both operations has the same effect
as if we had masked each individual trajectory uniformly, removed masked segments
and shuffled the remaining segments within each trajectory. Removing masked seg-
ments leads to efficiency gains, since we are reducing the input to a transformer model
which has quadratic computational cost over the number of segments in each trajectory.
In the final encoding step, the segments are transformed by a factorized transformer
encoder (FTE), which comprises multi-head self-attention that is factorized over the agent
and the time dimension. We refer to Sect.3.1 for details regarding the architecture and its
properties. At its core, the FTE architecture includes two different types of standard trans-
former encoder layers, one of which operates over segments in time and is applied sepa-
rately to each trajectory, whereas the other operates over segments per agent and is applied
separately to all temporal positions. Given the notation above, temporal self-attention is
applied to
x
1
S
k
for all
k∈[1, ,K]
, where
S=SS
r
in the T-MAE encoder during
self-supervised pretraining (and
S=S
in the T-MAE decoder and T-MAE encoder dur-
ing testing). Self-attention with respect to trajectory interactions is applied to
xs
1K
, for all
s
S
. Notably, these operations will distribute information over time as well as account for
interactions of trajectories throughout the representation. Furthermore, the application of
FTE is permutation equivariant with respect to the trajectories.
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Machine Learning (2025) 114:44 Page 7 of 18 44
In sum, the encoder
enc𝜙
is given by
The output of the encoder consists of segment representations, where each segment in prin-
ciple could have been updated with respect to all other segment representations remaining
after masking due to the application of the FTE. Since the encoder is a composition of per-
mutation equivariant functions with respect to the ordering of the trajectories, the encoder
is itself permutation equivariant.
3.2.2 Decoder
We proceed to describe the pretraining approach, in which we apply a decoder
dec𝜃
to the
representation
enc𝜙(
x
)
. In a first decoding step, we concatenate
S
r
query tokens
Q
to
the end of each trajectory (we denote this by
CATQ
). These query tokens thus are placed at
the positions previously taken on by removed segments. The tokens will subsequently be
trained to reconstruct just these segments.
We proceed to
UNSHUFFLE
all segments by applying the inverse of the encoder shuf-
fle function to all trajectories. The resulting representation comprises ordered segments
and query tokens of the original segment shape, number of agents K by number of seg-
ments per trajectory S. We add positional encoding
PE
to each segment analogously to the
positional encoding in the encoder via the above described
ADDPE
operation. Notably, the
positional encoding informs each query token
Q
about its position within a trajectory.
We use a second FTE to transform the query tokens regarding all other information
within the instance. As in the encoder, this is a result of iterative transformations in the
FTE to input segments and query tokens along trajectory and time dimensions (for details
regarding the FTE we refer to Sect.3.1). A final FFN projects the transformed query tokens
to the input space, where each query token is supposed to reconstruct the features of its
respective masked trajectory segments. We arrive at
3.2.3 Reconstruction objective
We enforce the reconstruction of masked segments by their respective query tokens by
minimizing the mean squared error (MSE) between predicted query token features and the
respective features of masked segments in input space. While minimizing the MSE is a
simple reconstruction objective, it has been shown to be very effective for pretraining in He
etal. (2022).
3.2.4 Application todownstream tasks
If, after pretraining, we aim to apply the encoder for a downstream task, we set the masking
ratio r to zero to encode all information inherent in the multiagent trajectory instance. For
most use cases it is sensible to unshuffle the extracted representation as described in the
section on the decoder. We can either train the classifier f
(
enc
𝜙
(x)
)
with weights in
enc𝜙
fixed or finetuned. Given that
enc𝜙(
x
)
is permutation equivariant with respect to
x
, if f is
enc
𝜙
(
x
)=FTE
MASK
r
SHUFFLE
ADD
PE,TE
FFN
SEG
lS
(
x
).
dec
𝜃
(
enc
𝜙(x)
)
=FFNFTEADD
PE
UNSHUFFLECAT
Q
(
enc
𝜙(x)
).
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Machine Learning (2025) 114:44
44 Page 8 of 18
permutation invariant regarding
enc𝜙(
x
)
, f is also permutation invariant with respect to
x
according to Proposition 1.
3.2.5 Note onshuffling
Since we randomly remove segments prior to the application of the FTE in the encoder,
the model cannot in general rely on the alignment of segments according to their temporal
position. Self-attention along the agent dimension must rather also consider the positional
encoding to decide with which degree to attend to other segments. This is independent of
the fact that we include a shuffling operation to implement random masking efficiently. In
fact, we refrain from unshuffling the segments prior to the FTE to prevent spurious align-
ments and shuffle the input during downstream tasks for consistency with pretraining. We
empirically show in Sect.4 that shuffling segments along the temporal dimension as imple-
mented in the trajectory masked autoencoder results in good performances on downstream
tasks.
4 Experiments
In this section, we evaluate self-supervised pretraining with the trajectory masked autoen-
coder (T-MAE) for multi-label classification on two different data sets. In both cases, we
investigate the effect of pretraining for factorized transformer architectures on four differ-
ent data regimes, where we have access to 5, 10, 50 and 100% of the labeled training data
respectively. Since in both tasks positive labels are sparse, we evaluate the classification by
mean average precision (mAP), which is suited to unbalanced datasets. We do not evaluate
the T-MAE with respect to different quantities of unlabeled data. As stated in Sect.2, we
rather assume unlabeled data to be cheap and use all the available data for pretraining.
On professional soccer data, we show how the proposed approach scales to a set of sev-
eral unordered trajectories, we show how the approach consistently outperforms un-pre-
trained transformer architectures, and we show that the T-MAE is robust to variations in
its hyperparameters (segment length and mask ratio). On a data set capturing interaction of
fruit flies, we show how T-MAE compares favorably to a recent SSL approach for multia-
gent trajectories (Sun etal., 2021) that is more involved, takes longer to train, and requires
domain experts to devise handcrafted features.
4.1 Predicting events inprofessional soccer
In this section, we evaluate our trajectory masked autoencoder (T-MAE) on proprietary
trajectory data from professional soccer matches.2 Agent features are xy-coordinates on the
pitch, plus the current speed in km/h for all players and the ball, recorded at 25 Hz (frames
per second). Multiagent trajectory instances are five second intervals which we extract
with an overlap of one second each. For simplicity, we only consider intervals where all
22 players and the ball are consistently tracked. We use data of 54 games (totaling 280,000
2 All matches are taken from season 2017/18 of the German Bundesliga and can be acquired from the Ger-
man league (DFL). Only rather small datasets of similar kind are publicly available, for example the soccer
video and player position data set (Pettersen etal., 2014).
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Machine Learning (2025) 114:44 Page 9 of 18 44
instances). We experiment with respect to four data folds over halftimes, training the clas-
sification always on 54 halftimes and validating and testing on 27 halftimes each. Pretrain-
ing is performed over different data splits without access to the labels.
We transform the data so that the home team always plays from left to right and provide
type embeddings which encode whether a trajectory belongs to (i) the ball, (ii) the home
keeper, (iii) a home field player, (iv) the guest keeper or (v) a guest field player. We work
with minimal data augmentation, mirroring the instances along horizontal and vertical lines
going through the middle point of the pitch. Instances are labeled with an event, if it occurs
within the 75 central frames. We consider multi-label classification with ten labels, which
are (sorted from most frequent to least frequent): pass, other ball action, tackling game,
throw in, free kick, foul, shot at goal, cross, goal kick and corner kick. Sparsity in positive
labels and noise in the manual annotations make this a hard task. Random guessing results
in a mAP of only 0.067. In Fig.3, we provide visualizations of two ground truth, masked
and reconstructed instances. In both cases, the reconstruction results from the application
of a trajectory masked autoencoder with factorized transformer architecture and
lS=5
and
r=0.8
. Even though large parts of the input data are masked, the model apparently learned
to account for turns, twists and possibly interactions in the players’ movements.
In our experiments, we compare the classification performance of a factorized trans-
former model pretrained with T-MAE to models of similar architecture without pre-
training. An overview of the used classification architecture, which was chosen based
on the performance of transformer models in preliminary experiments, is provided in
Fig.4. In the following, we provide more details. The embedding dimension (i.e. model
dimension d) for all models has been fixed to 32. For all architectures we employed
three FTE composite layers which consist of two standard transformer encoder layers
(see Sect.3.1 and Vaswani etal. (2017) for reference). For the transformer encoder lay-
ers we used the standard implementation that is provided with the PyTorch framework
(Paszke etal., 2019). We used 4 heads and 128 dimensions in the feed-forward net-
works. For normalization we used layer normalization (Ba etal., 2016) and we did not
use any dropout (Srivastava etal., 2014) if not otherwise specified. The feature repre-
sentation extracted with T-MAE’s encoder is unshuffled, aggregated (i.e. summed) over
Fig. 3 Visualization of T-MAE masking (middle) and reconstruction (right) for two ground truth multia-
gent trajectory instances (left). The reconstruction to the right is a combination of ground truth timesteps
provided to the model (i.e. dots in the middle column with
lS
=
5
and
r=0.8
) and masked timesteps pre-
dicted by the model. The top/bottom row show a training/validation instance respectively
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Machine Learning (2025) 114:44
44 Page 10 of 18
agents and concatenated along the segment dimension. While the un-pretrained models
feature the same factorized transformer architecture, we omitted shuffling the segments
within trajectories, as it is not required. We apply a single linear layer to assign scores
for multi-label classification to the extracted representation. During pretraining with
T-MAE, we use three additional FTE layers in the decoder. Different to the encoder
transformer layers we used an embedding dimension of 64 (we transform the repre-
sentations with a single linear layer segment-wise to account for the change in model
dimension) and a dimensionality of 256 for the feed-forward network within the trans-
former layers.
We used batchsize 256 for training the classification task and a batchsize of 64 for
pretraining. In general, optimization has been performed with Adam (Kingma and Ba,
2014) and early stopping on validation performance. For early stopping we use a toler-
ance of 5 batches without improvement for pretraining and a tolerance of 10 batches
without improvement while training the classification task. Gradient norm clipping
(Pascanu etal., 2013) is used throughout the experiments (with the threshold set to 5.0).
Notably, we finetuned the weights of the pretrained model during training of the classifi-
cation task. While we used a standard learning rate of 0.001 and no dropout in the FTEs
for the pretrained models, un-pretrained models are additionally trained with learning
rate 0.0001 and dropout 0.2 to provide for stability and regularization in absence of
Fig. 4 Supervised classifica-
tion architecture used for the
experiments in this paper. FTE
composite layers in blue consist
of two standard transformer
encoder layers as detailed in
Fig.1. Striped-green procedures
are only applied to models pre-
trained with a T-MAE. During
pretraining we mask the input
and employ three additional FTE
composite layers for decoding,
see Fig.2 for details on masking
and decoding
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Machine Learning (2025) 114:44 Page 11 of 18 44
pretraining. Regarding these additional runs, we choose the hyperparameter configura-
tion according to validation mAP separately for each data regime.
Overall, we have trained both pretrained and un-pretrained models with three different
segment lengths
lS∈{1, 5, 25}
. For all experiments on the soccer data, mean and standard
error of the mean are calculated with respect to the four data splits.
The main results of our experiment are given in Fig.5. As can be seen, the model pre-
trained with the trajectory masked autoencoder (with segment length
lS=5
and mask ratio
r=0.8
) outperforms the models without pretraining consistently in all four data regimes.
Notably, this implies that the model is more data efficient and that we can reach the same
performance with less labeled data. The pretraining approach however also results in the
best overall performance with 100% of the labeled data. As such, the trajectory masked
autoencoder suggests itself also for cases where labels for training data are abundant. Addi-
tionally, we observe that pretrained models are robust to variations in hyperparameter
lS
and r (as can be seen in Table 1). A multi-layer perceptron (MLP) simply operating on
raw features of the whole instance (with ordered agents) is provided as a further baseline.
Fig. 5 Comparing T-MAE (with
segment length
lS
=
5
and mask
ratio
r=0.8
) to un-pretrained
transformer models with similar
architecture and MLP baseline.
Metric is mean average precision
(mAP, higher is better), error
bars show the standard error
Table 1 Results of T-MAE pretraining model variations for downstream classification task on soccer data
for different fractions of classification training data. Metric is mean average precision (mAP, higher is bet-
ter), SE = standard error, best results are highlighted in bold
0.05 0.1 0.5 1.0
Model Mean SE Mean SE Mean SE Mean SE
Variations in r
T-MAE (
lS
,
r
=
5
,
0.2
) 0.367 ± 0.006 0.395 ± 0.004 0.441 ± 0.006 0.453 ± 0.006
T-MAE (
lS
,
r=5
,
0.4
)0.370 ± 0.004 0.400 ± 0.006 0.445 ± 0.006 0.456 ± 0.004
T-MAE (
lS
,
r=5
,
0.6
) 0.362 ± 0.004 0.394 ± 0.007 0.446 ± 0.007 0.458 ± 0.009
T-MAE (
lS
,
r=5
,
0.8
) 0.367 ± 0.001 0.402 ± 0.005 0.447 ± 0.007 0.459 ± 0.007
Variations in
lS
T-MAE (
lS
,
r=1
,
0.8
)0.371 ± 0.004 0.399 ± 0.002 0.442 ± 0.007 0.456 ± 0.007
T-MAE (
lS
,
r=5
,
0.8
) 0.367 ± 0.001 0.402 ± 0.005 0.447 ± 0.007 0.459 ± 0.007
T-MAE (
lS
,
r=25
,
0.8
) 0.348 ± 0.003 0.382 ± 0.002 0.438 ± 0.007 0.453 ± 0.004
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Machine Learning (2025) 114:44
44 Page 12 of 18
Here, we evaluated MLP models of different modeling capacity and chose the one with the
best performance on validation performance. The poor performance of the MLP baseline
indicates the importance of permutation invariance with respect to agents and the need for
increased modeling capacity.
In Table2 we provide evidence that the factorized transformer architecture is a sensible
choice for classification models on the given data: We compare baseline factorized trans-
former models against model architectures without interaction attention. That is, for these
additional models we replaced each FTE layer by two transformer encoder layers attending
over the temporal dimension only. Both model classes are un-pretrained and have the same
respective parameter counts. While the evidence for a factorized architecture is not very
strong, models with factorized architectures achieved the best overall results.
4.2 Predicting interactions offruit flies
In this section, we report empirical results on multi-label classification of fly interaction
based on trajectory data of two flies which has been extracted from video data (Eyjolfsdot-
tir etal., 2014). We closely follow the experimental setup provided and implemented in
Sun etal. (2021) for predicting six interaction types (such as lunge or wing thread) based
on raw trajectory data (referred to as keypoints in Sun etal. (2021)). Specifically, the clas-
sification task is to predict the labels of each individual frame, of which there are about 1.5
million in the data set. The input data consists of 21 frames surrounding the frame of inter-
est (the centerframe). Each frame has ten features per fly including the xy-position of the
fly’s centroid, sine, and cosine of its orientation, as well as its wing positions.
In our experiments, we compare against an approach for self-supervised pretraining for
multiagent trajectories recently proposed in Sun etal. (2021) as trajectory embedding for
behavior analysis (TREBA). The approach involves encoding the multiagent trajectory
instances via a variational autoencoder for trajectory data (Co-Reyes et al., 2018; Zhan
et al., 2020) with an autoregressive reconstruction loss and three additional losses (an
attribute consistency loss, an attribute decoding loss and a contrastive loss) which all are
with respect to so called programmed tasks. The process of task programming as proposed
in Sun etal. (2021) involves an expert writing a function to extract deterministic features
from the raw data. As such, task programming involves generating handcrafted features
which are then used to guide the self-supervised pretraining. Due to the combination of
Table 2 Ablation study for transformer architecture. In the model without interaction attention, we replaced
each FTE layer by two transformer encoder layers attending over the temporal dimension only (resulting in
same parameter count). Results are for different segment lengths
lS∈{1, 5, 25}
. Models are trained with
100% of the training data. Metric is mean average precision (mAP, higher is better), SE = standard error,
best results are highlighted in bold
1 5 25
Model Mean SE Mean SE Mean SE
FTE Transformer
architecture
(Fig.4)
0.443 ± 0.003 0.451 ± 0.007 0.437 ± 0.003
Transformer w/o
interaction atten-
tion
0.426 ± 0.007 0.447 ± 0.005 0.437 ± 0.003
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Machine Learning (2025) 114:44 Page 13 of 18 44
four losses the approach is more involved than our proposed trajectory masked autoen-
coder. Due to the use of handcrafted features it must be adapted to each novel task anew.
To be able to directly compare our trajectory masked autoencoder to TREBA, we are
using the same data splits and classifier architectures as in Sun etal. (2021).3 However,
instead of training one classifier for each of the six labels, we rather train a classifier with
six classification heads for convenience. As in Sun etal. (2021), we are training a total of
nine classifiers per architecture and hyperparameter setting for each data regime. Reported
mean results and standard errors are calculated with respect to these nine repetitions.
In Sun etal. (2021) the TREBA performance is compared to an MLP operating only on
the centerframe features. We deem this to be not a fair comparison. Since the pretrained
models have access to all 21 trajectory frames, we provide an additional MLP baseline
which operates on raw features of the whole instance (i.e. all 21 trajectory frames).
Note that the two flies in the trajectory data are ordered according to which of the two
flies acts as an intruder. Since TREBA is not a permutation equivariant model, it naturally
needs (and is provided with) information about the ordering of the flies. To account for this
ordering in the trajectory masked autoencoder, we sum extracted features over the segment
dimension and concatenate the agent dimension in the classifier. We thus use the opposite
aggregations for the agent and temporal dimensions compared to the experiments on soc-
cer data in Sect.4.1. This notably and by design removes permutation invariance from the
resulting classifier.
In addition to the experimental setup provided in Sun etal. (2021) we are repeating the
experiment by Sun etal. (2021) with random permutations of fly-orderings per instance.
To account for this, we sum extracted features over the agent dimension and concatenate
unshuffled segments per trajectory in the classifier for the T-MAE representation. These
transformations are alike our transformations in the experiments on soccer data and the
T-MAE based classifiers used for this setting are thus permutation invariant. We deem the
additional experiment highly relevant, as the 21 input frames are hardly indicative for fig-
uring out which fly is acting as an intruder in most of the instances.
The trajectory masked autoencoder used to compare against TREBA in both experi-
ments is configured to have a segment length
lS=3
and mask ratio
r=0.4
, with fine tun-
ing during classification. Otherwise, the T-MAE model architecture, hyperparameters and
training schemes used are mostly the same as for the experiments on soccer data. We only
deviate by using a batchsize of 512 for training the classification task and a feed-forward
dimension of 256 for the transformer encoder layers in the FTE encoder. Please refer to
Sect.4.1 and Fig.4 for more details.
We do not adjust the classifier architecture devised for TREBA representations to the
T-MAE model, i.e. we simply replace the final classification layer in Fig.4 with the clas-
sification architecture proposed in Sun etal. (2021). However, we discard regularization
via dropout for our approach, as it does not seem required (we ran an ablation for TREBA
classification without dropout too, to validate that dropout indeed helps for the TREBA
model). While the TREBA training makes use of data augmentation, we do not augment
the data for T-MAE. Notably, on an Nvidia A100 GPU, training the TREBA representa-
tions required more than 12h, whereas pretraining the trajectory masked autoencoder on
the same data took less than half an hour, which is faster by a factor of 24.
3 TREBA code as well as a link to the dataset can be found at https:// github. com/ neuro ethol ogy/ TREBA.
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Machine Learning (2025) 114:44
44 Page 14 of 18
The results of both our experiments are depicted in Fig.6. While the T-MAE based
classifier outperforms the TREBA approach only for the data regime with all training
labels in the experiment with ordered flies (left), it seems to generally perform best in
the experiment with unordered flies (right). Here it can make use of permutation invari-
ance. Both MLP baselines are clearly outperformed. We attribute the relatively good
performance of TREBA and the MLP baselines on the the regime with all training
labels for the unordered dataset to implicit regularization that comes with the unordered
input (notably the full dataset has a lot of instances which have an overlap in frames).
A summary of both experiments is provided in Table3, where we report test results
of the best performing models over both ordered and unordered flies according to vali-
dation performance. For reference we include the highest provided mAP metrics pro-
vided in Sun etal. (2021). These latter results are however not directly comparable, as
we picked the results from a variety of model configurations with noticeable differences
in test performance and could not select the results based on validation performance as
we have done for all models evaluated by us.
Fig. 6 Results on multi-label classification for fly interaction (with two flies) based on raw trajectory fea-
tures. Metric is mean average precision (mAP, higher is better), error bars show the standard error
Table 3 Summary of the fruit flies classification task including results as provided in Sun et al. (2021).
Results by Sun etal. (2021) are marked with an asterisk. They are provided as a reference but are not
directly comparable. The metric is mean average precision (mAP, higher is better), SE = standard error,
best results are highlighted in bold
0.05 0.1 0.5 1.0
Model Mean SE Mean SE Mean SE Mean SE
MLP (centerfr.)* 0.348 0.519 0.586
TREBA* 0.666 0.738 0.775
MLP (centerfr., our) 0.379 ± 0.011 0.450 ± 0.010 0.547 ± 0.003 0.606 ± 0.002
MLP (trajectory, our) 0.380 ± 0.011 0.467 ± 0.008 0.648 ± 0.008 0.729 ± 0.007
TREBA (our impl.) 0.607 ± 0.009 0.623 ± 0.006 0.743 ± 0.003 0.791 ± 0.002
T-MAE (
lS
,
r=3
,
4
7
)0.549 ± 0.010 0.649 ± 0.010 0.749 ± 0.008 0.798 ± 0.001
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Machine Learning (2025) 114:44 Page 15 of 18 44
5 Related work
Unsupervised or self-supervised pretraining prior to supervised finetuning is a promising
concept that has so far been especially fruitful for language and vision tasks (Dai and Le,
2015; Devlin etal., 2018; He et al., 2020; Henaff, 2020). Only recently have self-super-
vised and semi-supervised methods for multiagent trajectories been established that work
with related reconstruction tasks (Sun etal., 2021; Fassmeyer etal., 2021). The relation of
self-supervised learning for semi-supervised learning has been investigated in Chen etal.
(2020), where it is posed that large self-supervised models can be applied very successfully
to semi-supervised learning tasks. The relation of self-supervised pretraining and a semi-
supervised approach to classification of multiagent trajectories with access to unlabeled
data for a specific task is implicitly investigated in Sun etal. (2021). We see our paper in
this line of research.
Regarding different pretext tasks for self-supervised learning, our work continues to
investigate the denoising autoencoder approach that has been established in Vincent etal.
(2008, 2010). Recently, He etal. (2022) introduced a very efficient masking scheme in
the context of computer vision. Empirically, the proposed masked autoencoder enables
training transformer-based classifiers for image data (Dosovitskiy etal., 2021, whichprevi-
ouslyneededverylargeamountsoflabeleddata) to achieve good performance with much
less labeled data. The simplicity of the reconstruction objective and denoising task in He
etal. (2022) is attributable to the deviation from convolutional neural networks and the use
of a transformer architecture. As such, the work connects supervised-pretraining for vision
models directly to very influential denoising autoencoder models for language (Devlin
etal., 2018). Our work generalizes the masking scheme of He etal. (2022) to multiagent
trajectories and hence allows us to use the modeling capacity of vastly overparameterized
modern transformer architectures for downstream tasks such as classification, for which
these architectures otherwise might not be applicable.
Factorized transformer architectures for multiagent trajectories have been independently
proposed by Aksan etal. (2021) and Girgis etal. (2022). While our transformer architecture
is closely related to Girgis etal. (2022), it operates on different representations. Instead of
focusing on individual timesteps, we generalize the input data to trajectory segments. Most
importantly however, our model also operates on shuffled representations. With strong
empirical performance, we demonstrate that we don’t need to rely on temporal alignment
of the segments and that positional encoding is sufficient to inform self-attention.
Classification of multiagent data has been studied mainly in sports analytics. For exam-
ple, Sanford etal. (2020) use transformer architectures to classify on-ball events from soc-
cer sequences. However, they focus on only the ball and its K-nearest players and use dis-
tances to overcome the inherent permutation problem of players; their solution may swap
player identities across temporal windows. Related to classifying multiagent trajectories is
the detection of a priori known patterns in multiagent trajectory sequences, such as coun-
terpressing (Bauer and Anzer, 2021), or pass risk quantification (Anzer and Bauer, 2022;
Chawla etal., 2017; Power etal., 2017). While all these approaches rely on handcrafted
feature representations, convolutional neural networks have been deployed to overcome the
manual labeling effort (Dick etal., 2021; Bauer etal., 2023). Recently, permutation invari-
ance has been addressed with graph neural networks (Stöckl etal., 2021; Fassmeyer etal.,
2021; Anzer etal., 2022).
Concurrently and independently, Chen etal. (2023) propose a masked autoencoder pre-
training approach for the task of multiagent trajectory prediction in autonomous driving
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Machine Learning (2025) 114:44
44 Page 16 of 18
and for pedestrian movement. The authors propose a combination of different reconstruc-
tion tasks for multiagent trajectories involving different masking schemes. Like in our
trajectory masked autoencoder, Chen etal. (2023) make use of the factorized transformer
architecture proposed by Girgis etal. (2022) to distribute information over time and indi-
vidual trajectories. The approach includes an additional reconstruction task on map infor-
mation (i.e. context information for the trajectory data which is not applicable to our exper-
iments). In contrast to our paper, the downstream task under consideration is trajectory
prediction and not classification. Empirically, Chen etal. (2023) report very good results
with masked trajectory autoencoding on multiagent trajectory forecasting benchmarks.
6 Conclusion
We proposed a novel self-supervised approach for multiagent trajectories, the trajectory
masked autoencoder (T-MAE). The approach has been build upon a factorized transformer
architecture for multiagent trajectory data and employs a masking scheme on the level of
individual agent trajectories. As a result, the encoder of our pretraining model is permuta-
tion equivariant with respect to the order of trajectories and is thus naturally applicable
to downstream tasks that require permutation invariance with respect to the order of indi-
vidual trajectories. Empirically, pretraining with our approach improved the performance
of multi-label classification of multiagent trajectory instances regarding in-game events for
tracking data from professional soccer matches and compares favorably to a well published
recent self-supervised learning approach for multiagent trajectories on multiagent pose
data from entomology.
Acknowledgements We would like to thank Eraldo Rezende Fernandes, Marius Lehne and Marco Spinaci
for their input and support while writing this article.
Author Contributions The initial algorithm, its implementation and all experiments were devised and con-
ducted by Yannick Rudolph, who also wrote the first draft of the manuscript. Ulf Brefeld supervised all
steps along the way, discussed modifications to algorithm and manuscript, helped correct the manuscript
and commented on previous versions. All authors read and approved the final manuscript.
Funding Open Access funding enabled and organized by Projekt DEAL. No funds, grants, or other support
was received.
Data Availability The soccer data is proprietary and can be acquired from the German league (DFL). The
fly data has been previously published by Eyjolfsdottir etal. (2014). We used a version of the dataset which
(alongside dataset splits and baseline code) can be accessed athttps:// github. com/ neuro ethol ogy/ TREBA .
Code availability We provide details of our experimental setup, model architectures, hyperparameters and
implementation details in Sect.4 of the paper and provide a link to a PyTorch implementation of T-MAEs at
https:// ml3. leuph ana. de/ proje cts. html.
Declarations
Conflict of interest The authors have no conflict of interest to declare that are relevant to the content of this
article.
Ethics approval Not applicable.
Consent to participate Not applicable.
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Machine Learning (2025) 114:44 Page 17 of 18 44
Consent for publication Not applicable.
Employment Yannick Rudolph performed large part of his work on this article while employed at SAP SE,
Berlin.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License,
which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long
as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Com-
mons licence, and indicate if changes were made. The images or other third party material in this article
are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the
material. If material is not included in the article’s Creative Commons licence and your intended use is not
permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly
from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
References
Aksan, E., Kaufmann, M., Cao, P., & Hilliges, O. (2021). A Spatio-temporal Transformer for 3D Human
Motion Prediction. International Conference on 3D Vision. arXiv: 2004. 08692 [cs.CV].
Anzer, G., Bauer, P., Brefeld, U., & Fassmeyer, D. (2022). Detection of tactical patterns using semi-super-
vised graph neural networks. In MIT Sloan Sports Analytics Conference.
Anzer, G., & Bauer, P. (2022). Expected passes: Determining the difficulty of a pass in football (soccer)
using spatio-temporal data. Data Mining And Knowledge Discovery, 36, 295–317.
Ba, J. L., Kiros, J. R., & Hinton, G. E. (2016). Layer Normalization. arXiv preprint arXiv: 1607. 06450 .
Bauer, P., & Anzer, G. (2021). Data-driven detection of counterpressing in professional football: A super-
vised machine learning task based on synchronized positional and event data with expert-based feature
extraction. Data Mining and Knowledge Discovery, 35, 2009–2049.
Bauer, P., Anzer, G., & Shaw, L. (2023). Putting team formations in association football into context. Jour-
nal of Sports Analytics, 9(1), 39–59.
Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Neelakantan, A., Shyam, P.,
Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh,
A., Ziegler, D., Wu, J., Winter, C., … Amodei, D. (2020). Language models are few-shot learners.
Advances in Neural Information Processing Systems, 33, 1877–1901.
Casas, S., Gulino, C., Suo, S., Luo, K., Liao, R., & Urtasun, R. (2020). Implicit Latent Variable Model for
Scene-Consistent Motion Forecasting. In European Conference on Computer Vision.
Chawla, S., Estephan, J., Gudmundsson, J., & Horton, M. (2017). Classification of passes in football
matches using spatiotemporal data. ACM Transactions on Spatial Algorithms and Systems, 3(2), 1–30.
Chen, T., Kornblith, S., Swersky, K., Norouzi, M., & Hinton, G. (2020). Big Self-Supervised Models are
Strong Semi-Supervised Learners. In Advances in Neural Information Processing Systems.
Chen, H., Wang, J., Shao, K., Liu, F., Hao, J., Guan, C., Chen, G., & Heng, P. A. (2023). Traj-MAE: Masked
Autoencoders for Trajectory Prediction. In International Conference on Computer Vision.
Co-Reyes, J., Liu, Y., Gupta, A., Eysenbach, B., Abbeel, P., & Levine, S. (2018). Self-consistent trajectory
autoencoder: Hierarchical reinforcement learning with trajectory embeddings. In International Confer-
ence on Machine Learning, pp. 1009–1018. PMLR.
Dai, A. M., & Le, Q. V. (2015). Semi-supervised sequence learning. In Advances in Neural Information
Processing Systems, Volume28.
Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2018), October. BERT: Pre-training of deep bidirec-
tional transformers for language understanding. arXiv preprint arXiv: 1810. 04805. [cs.CL].
Dick, U., Tavakol, M., & Brefeld, U. (2021). Rating player actions in soccer. Frontiers in Sports and Active
Living, 3, 682986.
Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M.,
Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., & Houlsby, N. (2021). An Image is Worth 16x16
Words: Transformers for Image Recognition at Scale. In International Conference on Learning
Representations.
Eyjolfsdottir, E., Branson, S., Burgos-Artizzu, X. P., Hoopfer, E. D., Schor, J., Anderson, D. J., & Perona, P.
(2014). Detecting social actions of fruit flies. In European Conference on Computer Vision.
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Machine Learning (2025) 114:44
44 Page 18 of 18
Fassmeyer, D., Anzer, G., Bauer, P., & Brefeld, U. (2021). Toward Automatically Labeling Situations in
Soccer. Frontiers in Sports and Active Living, 3, 725431.
Girgis, R., Golemo, F., Codevilla, F., Weiss, M., D’Souza, J. A., Kahou, S. E., Heide, F., & Pal, C. (2022).
Latent variable sequential set transformers for joint multi-agent motion prediction.
He, K., Chen, X., Xie, S., Li, Y., Dollár, P., & Girshick, R. (2022). Masked Autoencoders Are Scalable
Vision Learners. In IEEE Conference on Computer Vision and Pattern Recognition.
He, K., Fan, H., Wu, Y., Xie, S., & Girshick, R. (2020). Momentum contrast for unsupervised visual repre-
sentation learning. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 9729–9738.
Henaff, O. (2020). Data-efficient image recognition with contrastive predictive coding. In International
Conference on Machine Learning, pp. 4182–4192. PMLR.
Kingma, D. P., & Ba, J. L. (2014). Adam: A method for stochastic optimization. arXiv preprint arXiv: 1412.
6980 .
Pascanu, R., Mikolov, T., & Bengio, Y. (2013). On the difficulty of training recurrent neural networks. In
International Conference on Machine Learning, pp. 1310–1318.
Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N.,
Antiga, L., Desmaison, A., Köpf, A., Yang, E., DeVito, Z., Raison, M., Tejani, A., Chilamkurthy, S.,
Steiner, B., Fang, L., … Chintala, S. (2019). PyTorch: An imperative style, high-performance deep
learning library. Advances in Neural Information Processing Systems, 32, 8024–8035.
Pettersen, S. A., Halvorsen, P., Johansen, D., Johansen, H., Berg-Johansen, V., Gaddam, V. R., Mortensen,
A., Langseth, R., Griwodz, C., & Stensland, H. K. (2014). Soccer Video and Player Position Dataset.
In ACM Multimedia Systems Conference, pp. 18–23.
Power, P., Ruiz, H., Wei, X., & Lucey, P. (2017). Not all passes are created equal: Objectively measuring
the risk and reward of passes in soccer from tracking data. In SIGKDD International Conference on
Knowledge Discovery and Data Mining, pp. 1605–1613.
Sanford, R., Gorji, S., Hafemann, L. G., Pourbabaee, B., & Javan, M. (2020). Group Activity Detection
From Trajectory and Video Data in Soccer. In (Workshop) IEEE Conference on Computer Vision and
Pattern Recognition, pp. 898–899.
Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., & Salakhutdinov, R. (2014). Dropout: a simple
way to prevent neural networks from overfitting. The Journal of Machine Learning Research, 15(1),
1929–1958.
Stöckl, M., Seidl, T., Marley, D., & Power, P. (2021). Making Offensive Play Predictable – Using a Graph
Convolutional Network to Understand Defensive Performance in Soccer. In MIT Sloan Sports Analyt-
ics Conference.
Sun, J. J., Kennedy, A., Zhan, E., Anderson, D. J., Yue, Y., & Perona, P. (2021). Task Programming: Learn-
ing Data Efficient Behavior Representations. In IEEE Conference on Computer Vision and Pattern
Recognition.
van den Oord, A., Dieleman, S., Zen, H., Simonyan, K., Vinyals, O., Graves, A., Kalchbrenner, N., Senior,
A., & Kavukcuoglu, K. (2016). Wavenet: A generative model for raw audio. arXiv preprint arXiv:
1609. 03499 .
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł, & Polosukhin, I.
(2017). Attention Is all you need. Advances in Neural Information Processing Systems, 30, 5998–6008.
Vincent, P., Larochelle, H., Bengio, Y., & Manzagol, P. A. (2008). Extracting and composing robust fea-
tures with denoising autoencoders. In International Conference on Machine Learning, pp. 1096–1103.
ACM.
Vincent, P., Larochelle, H., Lajoie, I., Bengio, Y., & Manzagol, P. A. (2010). Stacked denoising autoencod-
ers: Learning useful representations in a deep network with a local denoising criterion. Journal of
Machine Learning Research, 11, 3371–3408.
Yeh, R. A., Schwing, A. G., Huang, J., & Murphy, K. (2019). Diverse Generation for Multi-Agent Sports
Games. In IEEE Conference on Computer Vision and Pattern Recognition.
Zaheer, M., Kottur, S., Ravanbakhsh, S., Poczós, B., Salakhutdinov, R. R., & Smola, A. J. (2017). Deep sets.
Advances in Neural Information Processing Systems, 30, 3391–3401.
Zhan, E., Tseng, A., Yue, Y., Swaminathan, A., & Hausknecht, M. (2020). Learning Calibratable Policies
using Programmatic Style-Consistency.
Publisher’s Note Springer Nature remains neutral with regard to jurisdictional claims in published maps and
institutional affiliations.
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
1.
2.
3.
4.
5.
6.
Terms and Conditions
Springer Nature journal content, brought to you courtesy of Springer Nature Customer Service Center
GmbH (“Springer Nature”).
Springer Nature supports a reasonable amount of sharing of research papers by authors, subscribers
and authorised users (“Users”), for small-scale personal, non-commercial use provided that all
copyright, trade and service marks and other proprietary notices are maintained. By accessing,
sharing, receiving or otherwise using the Springer Nature journal content you agree to these terms of
use (“Terms”). For these purposes, Springer Nature considers academic use (by researchers and
students) to be non-commercial.
These Terms are supplementary and will apply in addition to any applicable website terms and
conditions, a relevant site licence or a personal subscription. These Terms will prevail over any
conflict or ambiguity with regards to the relevant terms, a site licence or a personal subscription (to
the extent of the conflict or ambiguity only). For Creative Commons-licensed articles, the terms of
the Creative Commons license used will apply.
We collect and use personal data to provide access to the Springer Nature journal content. We may
also use these personal data internally within ResearchGate and Springer Nature and as agreed share
it, in an anonymised way, for purposes of tracking, analysis and reporting. We will not otherwise
disclose your personal data outside the ResearchGate or the Springer Nature group of companies
unless we have your permission as detailed in the Privacy Policy.
While Users may use the Springer Nature journal content for small scale, personal non-commercial
use, it is important to note that Users may not:
use such content for the purpose of providing other users with access on a regular or large scale
basis or as a means to circumvent access control;
use such content where to do so would be considered a criminal or statutory offence in any
jurisdiction, or gives rise to civil liability, or is otherwise unlawful;
falsely or misleadingly imply or suggest endorsement, approval , sponsorship, or association
unless explicitly agreed to by Springer Nature in writing;
use bots or other automated methods to access the content or redirect messages
override any security feature or exclusionary protocol; or
share the content in order to create substitute for Springer Nature products or services or a
systematic database of Springer Nature journal content.
In line with the restriction against commercial use, Springer Nature does not permit the creation of a
product or service that creates revenue, royalties, rent or income from our content or its inclusion as
part of a paid for service or for other commercial gain. Springer Nature journal content cannot be
used for inter-library loans and librarians may not upload Springer Nature journal content on a large
scale into their, or any other, institutional repository.
These terms of use are reviewed regularly and may be amended at any time. Springer Nature is not
obligated to publish any information or content on this website and may remove it or features or
functionality at our sole discretion, at any time with or without notice. Springer Nature may revoke
this licence to you at any time and remove access to any copies of the Springer Nature journal content
which have been saved.
To the fullest extent permitted by law, Springer Nature makes no warranties, representations or
guarantees to Users, either express or implied with respect to the Springer nature journal content and
all parties disclaim and waive any implied warranties or warranties imposed by law, including
merchantability or fitness for any particular purpose.
Please note that these rights do not automatically extend to content, data or other material published
by Springer Nature that may be licensed from third parties.
If you would like to use or distribute our Springer Nature journal content to a wider audience or on a
regular basis or in any other manner not expressly permitted by these Terms, please contact Springer
Nature at
onlineservice@springernature.com
ResearchGate has not been able to resolve any citations for this publication.
Article
Full-text available
Choosing the right formation is one of the coach’s most important decisions in football. Teams change formation dynamically throughout matches to achieve their immediate objective: to retain possession, progress the ball up-field and create (or prevent) goal-scoring opportunities. In this work we identify the unique formations used by teams in distinct phases of play in a large sample of tracking data. This we achieve in two steps: first, we trained a convolutional neural network to decompose each game into non-overlapping segments and classify these segments into phases with an average F 1-score of 0.76. We then measure and contextualize unique formations used in each distinct phase of play. While conventional discussion tends to reduce team formations over an entire match to a single three-digit code (e.g. 4-4-2; 4 defender, 4 midfielder, 2 striker), we provide an objective representation of teams formations per phase of play. Using the most frequently occurring phases of play, mid-block, we identify and contextualize six unique formations. A long-term analysis in the German Bundesliga allows us to quantify the efficiency of each formation, and to present a helpful scouting tool to identify how well a coach’s preferred playing style is suited to a potential club.
Conference Paper
Full-text available
Overlapping runs are a widely used group-tactical pattern in soccer. By combining variational autoencoder with a graph neural network representation of positional data, we are able to detect overlapping runs using only a very limited amount of hand-labeled data. Based on this detection, we show practical applications using data of the German national team during the European Championship 2021. Using the same methodology, we outperform state of the art approaches on the prediction of player trajectories using a publicly available Basketball dataset.
Article
Full-text available
Passes are by far football’s (soccer) most frequent event, yet surprisingly little meaningful research has been devoted to quantify them. With the increase in availability of so-called positional data, describing the positioning of players and ball at every moment of the game, our work aims to determine the difficulty of every pass by calculating its success probability based on its surrounding circumstances. As most experts will agree, not all passes are of equal difficulty, however, most traditional metrics count them as such. With our work we can quantify how well players can execute passes, assess their risk profile, and even compute completion probabilities for hypothetical passes by combining physical and machine learning models. Our model uses the first 0.4 seconds of a ball trajectory and the movement vectors of all players to predict the intended target of a pass with an accuracy of 93.0%93.0\% 93.0 % for successful and 72.0%72.0\% 72.0 % for unsuccessful passes much higher than any previously published work. Our extreme gradient boosting model can then quantify the likelihood of a successful pass completion towards the identified target with an area under the curve (AUC) of 93.4%93.4\% 93.4 % . Finally, we discuss several potential applications, like player scouting or evaluating pass decisions.
Article
Full-text available
We study the automatic annotation of situations in soccer games. At first sight, this translates nicely into a standard supervised learning problem. However, in a fully supervised setting, predictive accuracies are supposed to correlate positively with the amount of labeled situations: more labeled training data simply promise better performance. Unfortunately, non-trivially annotated situations in soccer games are scarce, expensive and almost always require human experts; a fully supervised approach appears infeasible. Hence, we split the problem into two parts and learn (i) a meaningful feature representation using variational autoencoders on unlabeled data at large scales and (ii) a large-margin classifier acting in this feature space but utilize only a few (manually) annotated examples of the situation of interest. We propose four different architectures of the variational autoencoder and empirically study the detection of corner kicks, crosses and counterattacks. We observe high predictive accuracies above 90% AUC irrespectively of the task.
Article
Full-text available
We present a data-driven model that rates actions of the player in soccer with respect to their contribution to ball possession phases. This study approach consists of two interconnected parts: (i) a trajectory prediction model that is learned from real tracking data and predicts movements of players and (ii) a prediction model for the outcome of a ball possession phase. Interactions between players and a ball are captured by a graph recurrent neural network (GRNN) and we show empirically that the network reliably predicts both, player trajectories as well as outcomes of ball possession phases. We derive a set of aggregated performance indicators to compare players with respect to. to their contribution to the success of their team.
Article
Full-text available
Detecting counterpressing is an important task for any professional match-analyst in football (soccer), but is being done exclusively manually by observing video footage. The purpose of this paper is not only to automatically identify this strategy, but also to derive metrics that support coaches with the analysis of transition situations. Additionally, we want to infer objective influence factors for its success and assess the validity of peer-created rules of thumb established in by practitioners. Based on a combination of positional and event data we detect counterpressing situations as a supervised machine learning task. Together, with professional match-analysis experts we discussed and consolidated a consistent definition, extracted 134 features and manually labeled more than 20, 000 defensive transition situations from 97 professional football matches. The extreme gradient boosting model—with an area under the curve of 87.4%87.4\% 87.4 % on the labeled test data—enabled us to judge how quickly teams can win the ball back with counterpressing strategies, how many shots they create or allow immediately afterwards and to determine what the most important success drivers are. We applied this automatic detection on all matches from six full seasons of the German Bundesliga and quantified the defensive and offensive consequences when applying counterpressing for each team. Automating the task saves analysts a tremendous amount of time, standardizes the otherwise subjective task, and allows to identify trends within larger data-sets. We present an effective way of how the detection and the lessons learned from this investigation are integrated effectively into common match-analysis processes.
Conference Paper
Specialized domain knowledge is often necessary to accurately annotate training sets for in-depth analysis, but can be burdensome and time-consuming to acquire from domain experts. This issue arises prominently in automated behavior analysis, in which agent movements or actions of interest are detected from video tracking data. To reduce annotation effort, we present TREBA: a method to learn annotation-sample efficient trajectory embedding for behavior analysis, based on multi-task self-supervised learning. The tasks in our method can be efficiently engineered by domain experts through a process we call "task programming", which uses programs to explicitly encode structured knowledge from domain experts. Total domain expert effort can be reduced by exchanging data annotation time for the construction of a small number of programmed tasks. We evaluate this trade-off using data from behavioral neuroscience, in which specialized domain knowledge is used to identify behaviors. We present experimental results in three datasets across two domains: mice and fruit flies. Using embeddings from TREBA, we reduce annotation burden by up to a factor of 10 without compromising accuracy compared to state-of-the-art features. Our results thus suggest that task programming and self-supervision can be an effective way to reduce annotation effort for domain experts.