PreprintPDF Available
Preprints and early-stage research may not have been peer reviewed yet.

Abstract and Figures

Tracking and identifying players is a fundamental step in computer vision-based ice hockey analytics. The data generated by tracking is used in many other downstream tasks, such as game event detection and game strategy analysis. Player tracking and identification is a challenging problem since the motion of players in hockey is fast-paced and non-linear when compared to pedestrians. There is also significant camera panning and zooming in hockey broadcast video. Identifying players in ice hockey is challenging since the players of the same team look almost identical, with the jersey number the only discriminating factor between players. In this paper, an automated system to track and identify players in broadcast NHL hockey videos is introduced. The system is composed of three components (1) Player tracking, (2) Team identification and (3) Player identification. Due to the absence of publicly available datasets, the datasets used to train the three components are annotated manually. Player tracking is performed with the help of a state of the art tracking algorithm obtaining a Multi-Object Tracking Accuracy (MOTA) score of 94.5%. For team identification, the away-team jerseys are grouped into a single class and home-team jerseys are grouped in classes according to their jersey color. A convolutional neural network is then trained on the team identification dataset. The team identification network gets an accuracy of 97% on the test set. A novel player identification model is introduced that utilizes a temporal one-dimensional convolutional network to identify players from player bounding box sequences. The player identification model further takes advantage of the available NHL game roster data to obtain a player identification accuracy of 83%.
Content may be subject to copyright.
JOURNAL OF L
A
T
E
X CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 1
Player Tracking and Identification in Ice Hockey
Kanav Vats, Systems Design Engineering, University of Waterloo
Pascale Walters, Stathletes Inc.
Mehrnaz Fani, Systems Design Engineering, University of Waterloo
David A. Clausi, Systems Design Engineering, University of Waterloo
John Zelek Systems Design Engineering, University of Waterloo
Abstract—Tracking and identifying players is a fundamental
step in computer vision-based ice hockey analytics. The data
generated by tracking is used in many other downstream tasks,
such as game event detection and game strategy analysis. Player
tracking and identification is a challenging problem since the
motion of players in hockey is fast-paced and non-linear when
compared to pedestrians. There is also significant camera panning
and zooming in hockey broadcast video. Identifying players in
ice hockey is challenging since the players of the same team look
almost identical, with the jersey number the only discriminating
factor between players. In this paper, an automated system to
track and identify players in broadcast NHL hockey videos
is introduced. The system is composed of three components
(1) Player tracking, (2) Team identification and (3) Player
identification. Due to the absence of publicly available datasets,
the datasets used to train the three components are annotated
manually. Player tracking is performed with the help of a state
of the art tracking algorithm obtaining a Multi-Object Tracking
Accuracy (MOTA) score of 94.5%. For team identification, the
away-team jerseys are grouped into a single class and home-
team jerseys are grouped in classes according to their jersey
color. A convolutional neural network is then trained on the
team identification dataset. The team identification network gets
an accuracy of 97% on the test set. A novel player identification
model is introduced that utilizes a temporal one-dimensional
convolutional network to identify players from player bounding
box sequences. The player identification model further takes
advantage of the available NHL game roster data to obtain a
player identification accuracy of 83%.
Index Terms—ice hockey, convolutional neural network, player
tracking, team identification, player identification
I. INTRODUCTION
Ice hockey is a popular sport played by millions of people
[21]. Being a team sport, knowing the location of players
on the ice rink is essential for analyzing the game strategy
and player performance. The locations of the players on
the rink during the game are used by coaches, scouts, and
statisticians for analyzing the play. Although player location
data can be obtained manually, the process of labelling data
by hand on a per-game basis can be extremely tedious and
time consuming. Therefore, a computer vision-based player
tracking and identification system is of high utility.
In this paper, we introduce an automated system to track
and identify players in broadcast National Hockey League
(NHL) videos. The input to the system is broadcast NHL clips
from the main camera view (i.e., camera located in the stands
above the centre ice line) and the output are player trajectories
along with their identities. Since there are no publicly available
datasets for ice hockey player tracking, team identification, and
player identification, we annotate our own datasets for each
of these problems. The previous papers in ice hockey player
tracking [9, 35] make use of hand crafted features for detection
and re-identification. Therefore, we perform experiments with
five state of the art tracking algorithms [4, 6, 8, 50, 52] on our
hockey player tracking dataset and evaluate their performance.
The output of the player tracking algorithm is a sequence of
player bounding boxes, called player tracklets.
Posing team identification as a classification problem with
each team treated as a separate class would be impractical
since (1) This will result in a large number of classes, and
(2) The same NHL team wears two different colors based
on whether it is the home or away team (Fig. 2). Therefore,
instead of treating each team as a separate class, we treat the
away (light) jerseys of all teams as a single class and cluster
home jerseys based on their jersey color. For example, the
Toronto Maple Leafs and the Tampa Bay Lightning both have
dark blue home jerseys and therefore can be put into a single
‘Blue’ class (Fig. 9). Since referees are easily distinguishable
from players, they are treated as a separate class. Based on
this simple training data formation, hockey players can be
classified into home and away teams. The team identification
network obtains an accuracy of 96.6% on the test set and does
not require additional fine tuning on new games.
Unlike soccer and basketball [41] where player facial fea-
tures and skin color are visible, a big challenge in player
identification in hockey is that the players of the same team
look almost identical. Therefore, we use jersey number for
identifying players since it is the most prominent feature
present on all player jerseys. Instead of classifying jersey
numbers from static images [14, 26, 29], we identify a player’s
jersey number from a sequence of player bounding boxes in
a video (also called tracklets). Player tracklets allow a model
to process more temporal context to identify a jersey number
since it is likely to be visible in multiple frames of the tracklet.
We introduce a temporal 1-dimensional Convolutional Neural
Network (1D CNN)-based network for identifying players
from their tracklets. The network outperforms the previous
work by Chan et al. [10] by 9.9% without requiring any
additional probability score aggregation model for inference.
The tracking, team identification, and player identification
models are combined to form a holistic offline system to
track and identify players and referees in the broadcast videos.
Player tracking helps team identification by removing team
identification errors in player tracklets through a simple ma-
jority voting. Additionally, based on the team identification
arXiv:2110.03090v1 [cs.CV] 6 Oct 2021
JOURNAL OF L
A
T
E
X CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 2
Player Tracking
Team identification
Player identification
Player Roster
(d) Player tracks with
Jersey number ID
(a) Input video (b) Player tracks
(c) Team id
Text
Zoomed bounding box
for the righmost player
Zoomed bounding box
for the righmost player
Fig. 1: Overview of the player tracking and identification system. The tracking model takes a hockey broadcast video clip as
input and outputs player tracks. The team identification model takes the player track bounding boxes as input and identifies
the team of each player along with identifying the referees. The player identification model utilizes the player tracks, team
data and game roster data to output player tracks with jersey number identities.
Fig. 2: Home (dark) and away (white) jerseys worn by the
Montreal Canadiens of the National Hockey League [33].
output, we use the game roster data to further improve the
identification performance of the automated system by an
additional 5%. The overall system is depicted in Fig. 1. The
system is able to identify players from video with an accuracy
of 82.8% with a Multi-Object Tracking Accuracy (MOTA)
score of 94.5% and an Identification F1(IDF1) score of
62.9%.
Five contributions are recognized:
1) New datasets are introduced for player tracking, team
identification, and player identification from tracklets.
2) We compare and contrast several state-of-the-art tracking
algorithms and analyze their performance and failure
modes on ice hockey dataset.
3) A simple but efficient team identification algorithm for
ice hockey is introduced.
4) A temporal 1D CNN based player identification model is
introduced and implemented that outperforms the current
state of the art [10] by 9.9%.
5) A holistic system that combines tracking, team iden-
tification, and player identification models, along with
making use of the team roster data, to track and identify
players in broadcast ice hockey videos is introduced.
II. BACKGROU ND
A. Tracking
The objective of Multi-Object Tracking (MOT) is to de-
tect objects of interest in video frames and associate the
detections with appropriate trajectories. Player tracking is an
important problem in computer vision-based sports analytics,
since player tracking combined with an automatic homography
estimation system [24] is used to obtain absolute player
locations on the sports rink. Also, various computer vision-
based tasks, such as sports event detection [39, 46, 47], can
be improved with player tracking data.
Tracking By Detection (TBD) is the most widely used
approach for multi-object tracking. Tracking by detection
consists of two steps: (1) Detecting objects of interest (hockey
players in our case) frame-by-frame in the video, then (2)
Linking player detections to produce tracks using a tracking
JOURNAL OF L
A
T
E
X CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 3
algorithm. Detection is usually done with the help of a deep
detector, such as Faster R-CNN [37] or YOLO [36]. For
associating detections with trajectories, techniques such as
Kalman filtering with Hungarian algorithm [6, 50, 52] and
graphical inference [8, 42] are used. In recent literature, re-
identification in tracking is commonly carried out with the help
of deep CNNs using appearance [8, 50, 52] and pose features
[42].
For sports player tracking, Sanguesa et al. [40] demon-
strated that deep features perform better than classical hand
crafted features for basketball player tracking. Lu et al. [31]
perform player tracking in basketball using a Kalman filter.
Theagarajan et al. [43] track players in soccer videos using the
DeepSORT algorithm [50]. Hurault et al [20] introduce a self-
supervised detection algorithm to detect small soccer players
and track players in non-broadcast settings using a triplet loss
trained re-identification mechanism, with embeddings obtained
from the detector itself.
In ice hockey, Okuma et al. [35] track hockey players by
introducing a particle filter combined with mixture particle
filter (MPF) framework [48], along with an Adaboost [49]
player detector. The MPF framework [48] allows the particle
filter framework to handle multi-modality by modelling the
posterior state distributions of Mobjects as an Mcomponent
mixture. A disadvantage of the MPF framework is that the
particles merge and split in the process and leads to loss of
identities. Moreover, the algorithm did not have any mecha-
nism to prevent identity switches and lost identities of players
after occlusions. Cai et al. [9] improved upon [35] by using
a bipartite matching for associating observations with targets
instead of using the mixture particle framework. However, the
algorithm is not trained or tested on broadcast videos, but
performs tracking in the rink coordinate system after a manual
homography calculation.
In ice hockey, prior published research [9, 35] perform
player tracking with the help of hand-crafted features for
player detection and re-identification. In this paper we track
and identify hockey players in broadcast NHL videos and
analyze performance of several state-of-the-art deep tracking
models on the ice hockey dataset.
B. Player Identification
Identifying players and referees is one of the most im-
portant problems in computer vision-based sports analytics.
Analyzing individual player actions and player performance
from broadcast video is not feasible without detecting and
identifying the player. Before the advent of deep learning
methods, player identification was performed with the help of
hand-crafted features [53]. Although techniques for identifying
players from body appearance exist [41], jersey number is the
primary and most widely used feature for player identification,
since it is observable and consistent throughout a game.
Most deep learning based player identification approaches in
the literature focus on identifying the player jersey number
from single frames using a CNN [14, 26, 29]. Gerke et al.
[14] were one of the first to use CNNs for soccer jersey
number identification and found that deep learning approach
outperforms hand-crafted features. Li et al. [26] employed a
semi-supervised spatial transformer network to help the CNN
localize the jersey number in the player image. Liu et al.
[29] use a pose-guided R-CNN for jersey digit localization
and classification by introducing a human keypoint prediction
branch to the network and a pose-guided regressor to generate
digit proposals. Gerke et al. [15] also combined their single-
frame based jersey classifier with soccer field constellation
features to identify players.
Zhang et al. [51] track and identify players in a multi-
camera setting using a distinguishable deep representation
of player identity using a coarse-to-fine framework. Chan
et al. [10] use a combination of a CNN and Long Short
Term Memory Network (LSTM) [19] similar to the long term
recurrent convolutional network (LRCN) by Dohnaue et al.
[12] for identifying players from player sequences. The final
inference in Chan el al. [10] is carried out using a another
CNN network applied over probability scores obtained from
CNN LSTM network.
In this paper, we identify player using player sequences
(tracklets) with the help of a temporal 1D CNN. Our proposed
inference scheme does not require the use of an additional
network.
C. Team Identification
Beyond knowing the identity of a player, they must also
be assigned to a team. Many sports analytics, such as “shot
attempts” and “team formations”, require knowing the team
to which each individual belongs. In sports leagues, teams
differentiate themselves based on the colour and design of
the jerseys worn by the players. In ice hockey, formulating
team identification as a classification problem with each team
treated as a separate class is proved to be problematic, as
hockey teams wear light- and dark-coloured jerseys depending
on whether they are playing at their home venue or away venue
(Fig. 2). Furthermore, each game in which new teams play
would require fine-tuning [25].
Early work used colour histograms or colour features with a
clustering approach to differentiate between teams [1, 3, 7, 13,
16, 23, 30, 32, 34, 44]. This approach, while being lightweight,
does not handle occlusions, changes in illumination, and teams
wearing similar jersey colours well [3, 25]. Deep learning
approaches have increased performance and generalizablitity
of player classification models [22].
Istasse et al. [22] simultaneously segment and classify
players in indoor basketball games. Players are segmented
and classified in a system where no prior is known about the
visual appearance of each team with associative embedding. A
trained CNN outputs a player segmentation mask and, for each
pixel, a feature vector that is similar for players belonging to
the same team. Theagarajan and Bhanu [43] classify soccer
players by team as part of a pipeline for generating tactical
performance statistics by using triplet CNNs.
In ice hockey, Guo et al. [17] perform team identification
using the color features of the hockey players’ uniforms. For
this purpose, the uniform region (central region) of the player’s
bounding box is cropped. From this region, hue, saturation,
JOURNAL OF L
A
T
E
X CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 4
Fig. 3: Network architecture for the player identification model. The networks accepts a player tracklet as input. Each tracklet
image is passed through a ResNet18 to obtain time ordered features F. The features Fare input into three 1D convolutional
blocks, each consisting of a 1D convolutional layer, batch normalization, and ReLU activation. In this figure, kand sare the
kernel size and stride of convolution operation. The activations obtained from the convolutions blocks are mean-pooled and
passed through a fully connected layer and a softmax layer to output the probability distribution of jersey number pjn.
and lightness (HSL) pixel values are extracted, and histograms
of pixels in five essential color channels (i.e., green, yellow,
blue, red, and white) are constructed. Finally, the player’s
team identification is determined by the channel that contains
the maximum proportions of pixels. Koshkina et al. [25]
use contrastive learning to classify player bounding boxes in
hockey games. This self-supervised learning approach uses a
CNN trained with triplet loss to learn a feature space that
best separates players into two teams. Over a sequence of
initial frames, they first learn two k-means cluster centres, then
associate players to teams.
III. TECHNICAL APP ROAC H
A. Player Tracking
1) Dataset: The player tracking dataset consists of a total of
84 broadcast NHL game clips with a frame rate of 30 frames
per second (fps) and resolution of 1280 ×720 pixels. The
average clip length is 36 seconds. The 84 video clips in the
dataset are extracted from 25 NHL games. The length of the
clips in shown in Figure 8. Each frame in a clip is annotated
with player and referee bounding boxes and player identity
consisting of player name and jersey number. The annotation
is carried out with the help of open source CVAT tool1. The
dataset is split such that 58 clips are used for training, 13 clips
for validation, and 13 clips for testing. In order to prevent any
game-level bias affect the results, the split is made on game
level, such that the training clips are obtained from 17 games,
validation clips from 4 games and test split from 4 games,
respectively.
2) Methodology: We experimented with five state of the
art tracking algorithms on the hockey player tracking dataset.
The algorithms include four online tracking algorithms [4,
6, 50, 52] and one offline tracking algorithm [8]. The best
tracking performance is achieved using the MOT Neural
Solver tracking model [8] re-trained on the hockey dataset.
MOT Neural Solver uses the popular tracking-by-detection
paradigm.
1Found online at: https://github.com/openvinotoolkit/cvat
Fig. 4: Distribution of tracklet lengths in frames of the player
identification dataset. The distribution is positively skewed
with the average length of a player tracklet is 191 frames.
TABLE I: Network architecture for the player identification
model. k, s, d and p denote kernel dimension, stride, dilation
size and padding respectively. C hi,Choand bdenote the
number of channels going into and out of a block, and batch
size, respectively.
Input: Player tracklet b×30 ×3×300 ×300
ResNet18 backbone
Layer 1: Conv1D
Chi= 512, C ho= 512
(k = 3,s=3,p=0,d=1)
Batch Norm 1D
ReLU
Layer 2: Conv1D
Chi= 512, C ho= 512
(k = 3,s=3,p=1,d=1)
Batch Norm 1D
ReLU
Layer 3: Conv2D
Chi= 512, C ho= 128
(k = 3,s=1,p=0,d=1)
Batch Norm 1D
ReLU
Layer 4: Fully connected
Chi= 128, C ho= 86
Output b×86
In tracking by detection, the input is a set of object
detections O={o1, .....on}, where ndenotes the total number
of detections in all video frames. A detection oiis repre-
JOURNAL OF L
A
T
E
X CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 5
Fig. 5: Examples of two tracklets in the player identification dataset. (a) Tracklet represents a case when the jersey number 12
is visible in only a subset of frames (b) Example of a tracklet when the jersey number is never visible over the whole tracklet.
Fig. 6: Class distribution in the player tracklet identification
dataset. The dataset is heavily imbalanced with the null
class (denoted by class 100) consisting of 50.4% of tracklet
examples.
sented by {xi, yi, wi, hi, Ii, ti}, where xi, yi, wi, hidenotes
the coordinates, width, and height of the detection bounding
box. Iiand tirepresent the image pixels and timestamp
corresponding to the detection. The goal is to find a set of
trajectories T={T1, T2....Tm}that best explains Owhere
each Tiis a time-ordered set of observations. The MOT Neural
Solver models the tracking problem as an undirected graph
G= (V, E ), where V={1,2, ..., n}is the set of nnodes
for nplayer detections for all video frames. In the edge set E,
every pair of detections is connected so that trajectories with
missed detections can be recovered. The problem of tracking is
now posed as splitting the graph into disconnected components
where each component is a trajectory Ti. After computing
each node (detection) embedding and edge embedding using a
CNN, the model then solves a graph message passing problem.
The message passing algorithm classifies whether an edge
between two nodes in the graph belongs to the same player
trajectory.
B. Team Identification
1) Dataset: The team identification dataset is obtained from
the same games and clips used in the player tracking dataset.
The train/validation/test splits are also identical to player
tracking data. We take advantage of the fact that the away team
in NHL games usually wear a predominantly white colored
jersey with color stripes and patches, and the home team wears
a dark colored jersey. We therefore build a dataset with five
classes (blue, red, yellow, white, red-blue and referees) with
each class composed of images with same dominant color. The
data-class distribution is shown in Fig. 10. Fig. 9 shows some
examples from the dataset. The training set consists of 32419
images. The validation and testing set contain 6292 and 7898
images respectively.
2) Methodology: For team identification, we use a
ResNet18 [18] pretrained on the ImageNet dataset [11], and
train the network on the team identification dataset by re-
placing the final fully connected layer to output six classes.
The image resolution used for training is 224 ×224 pixels.
During inference, the network classifies whether a bounding
box belongs to the away team (white color), the home team
(dark color), or the referee class. For inferring the team for
a player tracklet, the team identification model is applied
to each image of the tracklet and a simple majority vote is
used to assign a team to the tracklet. This way, the tracking
algorithm helps team identification by resolving errors in team
prediction.
3) Training Details: We use the Adam optimizer with
an initial learning rate of .001 and a weight decay of .001
for optimization. The learning rate is reduced by a factor
of 1
3at regular intervals during the training process. We
do not perform data augmentation since performing color
augmentation on white away jerseys makes it resemble colored
home jerseys.
C. Player Identification
1) Image Dataset: The player identification image dataset
[45] consists of 54,251 player bounding boxes obtained from
25 NHL games. The NHL game videos are of resolution
1280 ×720 pixels. The dataset contains a total of 81 jersey
JOURNAL OF L
A
T
E
X CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 6
Fig. 7: Team identification results from four different games that are each not present in the team identification dataset. The
model performs well on data not present in dataset, which demonstrates the ability to generalize well on out of sample data
points.
Fig. 8: Length of the videos in the player tracking dataset. The
average clip length is 36 seconds.
number classes, including an additional null class for no jersey
number visible. The player head and bottom of the images
are cropped such that only the jersey number (player torso)
is visible. Images from 17 games are used for training, four
games for validation and four games for testing. The dataset
is highly imbalanced such that the ratio between the most
frequent and least frequent class is 92. The dataset covers a
range of real-game scenarios such as occlusions, motion blur
and self-occlusions.
2) Tracklet Dataset: The player identification tracklet
dataset consists of 3510 player tracklets. The tracklet bounding
boxes and identities are annotated manually. The manually
annotated tracklets simulate the output of a tracking algorithm.
The tracklet length distribution is shown in Fig. 4. The average
length of a player tracklet is 191 frames. It is important to
note that the player jersey number is visible in only a subset
of tracklet frames. Fig. 5 illustrates two tracklet examples
from the dataset. The dataset is divided into 86 jersey number
classes with one null class representing no jersey number
visible. The class distribution is shown in Fig. 6. The dataset
is heavily imbalanced with the null class consisting of 50.4%
of tracklet examples. The training set contains 2829 tracklets,
176 validation tracklets and 505 test tracklets. The game-wise
training/testing data split is identical in all the four datasets
discussed.
3) Network Architecture: Let T={o1, o2....on}denote a
player tracklet where each oirepresents a player bounding
box. The player head and bottom in the bounding box oiare
cropped such that only the jersey number is visible. Each re-
sized image IiR300×300×3corresponding to the bounding
box oiis input into a backbone 2D CNN, which outputs a set
of time-ordered features {F={f1, f2.....fn}fiR512}. The
features Fare input into a 1D temporal convolutional network
that outputs probability pR86 of the tracklet belonging to
a particular jersey number class. The architecture of the 1D
CNN is shown in Fig. 3.
The network consists of a ResNet18 [18] based 2D CNN
backbone pretrained on the player identification image dataset
(Section III-C1). The weights of the ResNet18 backbone
network are kept frozen while training. The 2D CNN backbone
is followed by three 1D convolutional blocks each consisting
JOURNAL OF L
A
T
E
X CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 7
Fig. 9: Examples of ‘blue’ class in the team identification
dataset. Home jersey of teams such as (a) Vancouver Canucks
(b) Toronto Maple Leafs and (c) Tampa Bay Lightning are
blue in appearance and hence are put in the same class.
of a 1D convolutional layer, batch normalization, and ReLU
activation. Each block has a kernel size of three and dilation of
one. The first two blocks have a larger stride of three, so that
the initial layers have a larger receptive field to take advantage
of a large temporal context. Residual skip connections are
added to aid learning. The exact architecture is shown in
Table I. Finally, the activations obtained are pooled using
mean pooling and passed through a fully connected layer
with 128 units. The logits obtained are softmaxed to obtain
jersey number probabilities. Note that the model accepts fixed
length training sequences of length n= 30 as input, but the
training tracklets are hundreds of frames in length (Fig. 4).
Therefore, n= 30 tracklet frames are sampled with a random
starting frame from the training tracklet. This serves as a form
of data augmentation since every at every training iteration,
the network processes a random set of frames from an input
tracklet.
4) Training Details: In order to handle the severe class
imbalance present in the tracklet dataset, the tracklets are
sampled intelligently such that the null class is sampled with
a probability p0= 0.1. The network is trained with the help of
cross entropy loss. We use Adam optimizer for training with
a initial learning rate of .001 with a batch size of 15. The
learning rate is reduced by a factor of 1
5after iteration numbers
2500, 5000, and 7500. Several data augmentation techniques
such as random cropping, color jittering, and random rotation
are also used. All experiments are performed on two Nvidia
P-100 GPUs.
5) Inference: During inference, we need to assign a single
jersey number label to a test tracklet of kbounding boxes
Ttest ={o1, o2....ok}. Here kcan be much greater than n=
30. So, a sliding window technique is used where the network
is applied to the whole test tracklet Ttest with a stride of one
frame to obtain window probabilities P={p1, p2, ...pk}with
each piR86. The probabilities Pare aggregated to assign a
single jersey number class to a tracklet. In order to aggregate
the probabilities P, we filter out the tracklets where the jersey
number is visible. To do this we first train a ResNet18 classifier
Cim (same as the backbone of discussed in Section III-C3) on
the player identification image dataset. The classifier Cim is
Fig. 10: Classes in team identification and their distribution.
The ‘ref’ class denotes referees.
run on every image of the tracklet. A jersey number is assumed
to be absent on a tracklet if the probability of the absence of
jersey number Cim
null is greater than a threshold θfor each
image in the tracklet. The threshold θis determined using the
player identification validation set. For the tracklets for which
the jersey number is visible, the probabilities are averaged to
obtain a single probability vector pavg , which represents the
probability distribution of the jersey number in the test tracklet
Ttest. As post-processing, only those probability vectors piare
averaged for which argmax(pi)6=null. This post-processing
step leads to an accuracy improvement of 2.37%.
The rationale behind using visibility filtering and post
processing step is that a large tracklet with hundreds of
frames may have the number visible in only a few frames
and therefore, a simple averaging of probabilities Pwill
often output null. The proposed inference technique allows
the network to ignore the window probabilities corresponding
to the null class if a number is visible in the tracklet. The
proposed inference method shows an improvement of 7.53%
over simply obtaining the final prediction by averaging all
piP. The whole algorithm is illustrated in Algorithm 1.
Algorithm 1: Algorithm for inference on a tracklet.
1Input: Tracklet Ttest ={o1, o2....ok}, Jersey image
classifier Cim, Tracklet id model P
2Output: Identity Id
3Initialize:vis =f alse
4P=P(Ttest)// using sliding window
5for oiin Ttest do
6if Cim
null(oi)< θ then
7vis =true
8break
9end
10 end
11 if vis == true then
12 P0={piP:argmax(pi)6=null}
// post-processing
13 Id =argmax(mean(P0))
14 end
15 else
16 Id =null
17 end
JOURNAL OF L
A
T
E
X CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 8
TABLE II: Tracking performance of the MOT Neural Solver model for the 13 test videos. (means lower is better, mean
higher is better)
Video number IDF1MOTA ID-switches False positives (FP)False negatives (FN)
178.53 94.95 23 100 269
261.49 93.29 26 48 519
355.83 95.85 43 197 189
467.22 95.50 31 77 501
572.60 91.42 40 222 510
666.66 90.93 38 301 419
749.02 94.89 59 125 465
850.06 92.02 31 267 220
953.33 96.67 30 48 128
10 55.91 95.30 26 65 193
11 56.52 96.03 40 31 477
12 87.41 94.98 14 141 252
13 62.98 94.77 30 31 252
TABLE III: Comparison of the overall tracking performance on test videos the hockey player tracking dataset. (means lower
is better, mean higher is better)
Method IDF1MOTA ID-switches False positives (FP)False negatives (FN)
SORT [6] 53.7 92.4 673 2403 5826
Deep SORT [50] 59.3 94.2 528 1881 4334
Tracktor [4] 56.5 94.4 687 1706 4216
FairMOT [52] 61.5 91.9 768 1179 7568
MOT Neural Solver [8] 62.9 94.5 431 1653 4394
D. Overall System
The player tracking, team identification, and player identi-
fication methods discussed are combined together for tracking
and identifying players and referees in broadcast video shots.
Given a test video shot, we first run player detection and
tracking to obtain a set of player tracklets τ={T1, T2, ....Tn}.
For each tracklet Tiobtained, we run the player identification
model to obtain the player identity. We take advantage of the
fact that the player roster is available for NHL games through
play-by-play data, hence we can focus only on players actually
present in the team. To do this, we construct vectors vaand
vhthat contain information about which jersey numbers are
present in the away and home teams, respectively. We refer
to the vectors vhand vaas the roster vectors. Assuming we
know the home and away roster, let Hbe the set of jersey
numbers present in the home team and Abe the set of jersey
numbers present in away team. Let null denote the no-jersey
number class and jdenote the index associated with jersey
number njin pjn vector.
vh[j]=1,if njH∪ {null}(1)
vh[j]=0, otherwise, (2)
similarly,
va[j]=1,if njA∪ {null}(3)
va[j]=0, otherwise, (4)
We multiply the probability scores pjn R86 obtained from
the player identification by vhR86 if the player belongs to
home team or vaR86 if the player belongs to the away team.
The determination of player team is done through the trained
team identification model. The player identity Iis determined
through
Id =argmax(pj n vh)(5)
(where denotes element-wise multiplication) if the player
belongs to home team, and
Id =argmax(pj n va)(6)
if the player belongs to the away team. The overall algorithm is
summarized in Algorithm 2. Fig. 1 depicts the overall system
visually.
Algorithm 2: Holistic algorithm for player tracking
and identification.
1Input: Input Video V, Tracking model Tr, Team ID
model T, Player ID model P,vh,va
2Output: Identities ID ={I d1, I d2.....Idn}
3Initialize:ID =φ
4τ={T1, T2, ....Tn}=Tr(V)
5for Tiin τ do
6team =T(Ti)
7pjn =P(Ti)
8if team == home then
9Id =argmax(pj n vh)
10 else if team == away then
11 Id =argmax(pj n va)
12 else
13 Id =ref
14 end
15 ID =I D ∪ Id
16 end
JOURNAL OF L
A
T
E
X CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 9
IV. RES ULT S
A. Player Tracking
The MOT Neural Solver algorithm is compared with four
state of the art algorithms for tracking. The methods compared
to are Tracktor [4], FairMOT [52], Deep SORT [50] and
SORT [6]. Player detection is performed using a Faster-RCNN
network [37] with a ResNet50 based Feature Pyramid Network
(FPN) backbone [27] pre-trained on the COCO dataset [28]
and fine tuned on the hockey tracking dataset. The object
detector obtains an Average Precision (AP) of 70.2on the
test videos (Table V). The accuracy metrics for tracking used
are the CLEAR MOT metrics [5] and Identification F1 score
(IDF1) [38]. An important metric is the number of identity
switches (IDSW), which occurs when a ground truth ID iis
assigned a tracked ID jwhen the last known assignment was
k6=j. Low number of identity switches is an indicator of good
tracking performance. For sports player tracking, the IDF1 is a
better accuracy measure than Multi Object Tracking accuracy
(MOTA) since it measures how consistently the identity of a
tracked object is preserved with respect to the ground truth
identity. The overall results are shown if Table III. The MOT
Neural Solver model obtains the highest MOTA score of 94.5
and IDF1 score of 62.9on the test videos.
1) Analysis: From Table III it can be seen that the MOTA
score of all methods is above 90%. This is because MOTA is
calculated as
M OT A = 1 Σt(F Nt+F Pt+I DS Wt)
ΣtGTt
(7)
where tis the frame index and GT is the number of ground
truth objects. MOTA metric counts detection errors through
the sum F P +F N and association errors through I DSW s.
Since false positives (FP) and false negatives (FN) heavily rely
on the performance of the player detector, the MOTA metric
highly depends on the performance of the detector. For hockey
player tracking, the player detection accuracy is high because
of large size of players in broadcast video and limited number
of players to detect on the screen. Therefore, the MOTA score
for all methods is very high.
The MOT Neural Solver method achieves the highest IDF1
score of 62.9and significantly lower identity switches than
the other methods. This is because pedestrian trackers use
a linear motion model assumption which does not perform
well with motion of hockey players. Sharp changes in player
motion often leads to identity switches. The MOT Neural
Solver model, on the other hand, has no such assumptions
since it poses tracking as a graph edge classification problem.
Table II shows the performance of the MOT Neural solver
for each of the 13 test videos. We do a failure analysis to
determine the cause of identity switches and low IDF1 score in
some videos. The major source of identity switches are severe
occlusions and player going out of field of view due to camera
panning. We define a pan identity switch as an identity switch
resulting from a player leaving and re-entering camera field of
view due to panning. It is very difficult for the tracking model
to maintain identity in these situations since players of the
same team look identical and a player going out of the camera
Fig. 11: Proportion of pan-identity switches for all videos at
a threshold of δ= 40. On an average, pan identity switches
account for 65% of identity switches.
field of view at a particular point in screen coordinates can re-
enter at any other point. We try to estimate the proportion of
pan-identity switches to determine the contribution of panning
to total identity switches.
In order to estimate the number of pan ID switches, since
we have quality annotations, we make the assumption that the
ground truth annotations are accurate and there are no missing
annotations in ground truth. Based on this assumption, there
is a significant time gap between two consecutive annotated
detections of a player only when the player leaves the camera
field of view and comes back again. Let Tgt ={o1, o2, ..., on}
a ground truth tracklet, where oi={xi, yi, wi, ht, Ii, ti}
represents a ground truth detection. A pan-identity switch
is expected to occur during tracking when the difference
between timestamps (in frames) of two consecutive ground
truth detections iand jis greater than a sufficiently large
threshold δ. That is
(titj)> δ (8)
Therefore, the total number of pan-identity switches in a video
is approximately calculated as
X
G
1(titj> δ)(9)
where the summation is carried out over the all ground truth
trajectories and 1is an indicator function. Consider the video
number 9having 30 identity switches and an IDF1 of 53.33.
We plot the proportion of pan identity switches (Fig 12), that
is
=PG1(titj> δ)
I DSW s (10)
against δ, where δvaries between 40 and 80 frames. In video
number 9video I DSW s = 30. From Fig. 12 it can be seen
that majority of the identity switches ( 90% at a threshold
of δ= 40 frames) occur due to camera panning, which
is the main source of error. Visually investigating the video
confirmed the statement. Fig. 11 shows the proportion of pan-
identity switches for all videos at a threshold of δ= 40. On
an average, pan identity switches account for 65% of identity
switches in the videos. This shows that the tracking model
is able to tackle occlusions and lack of detections with the
exception of extremely cluttered scenes.
JOURNAL OF L
A
T
E
X CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 10
TABLE IV: Ablation study on different methods of aggregating probabilities for tracklet confidence scores.
Method Accuracy F1 score Visiblility filtering Postprocessing
Majority voting 80.59% 80.40% X X
Probability averaging 75.64% 75.07%
Proposed w/o postprocessing 80.80% 79.12% X
Proposed w/o visibility filtering 50.10% 48.00% X
Proposed 83.17% 83.19% X X
Fig. 12: Proportion of pan identity switches vs. δplot for
video number 9. Majority of the identity switches ( 90% at
a threshold of δ= 40 frames) occur due to camera panning,
which is the main source of error.
TABLE V: Player detection results on the test videos. AP
stands for Average Precision. AP50 and AP75 are the average
precision at an Intersection over Union (IoU) of 0.5 and 0.75
respectively.
AP AP50 AP75
70.2 95.9 87.5
B. Team Identification
The team identification model obtains an accuracy of 96.6%
on the team identification test set. Table VI shows the macro
averaged precision, recall and F1 score for the results. The
model is also able to correctly classify teams in the test set
that are not present in the training set. Fig. 7 shows some
qualitative results where the network is able to generalize
on videos absent in training/testing data. We compare the
model to color histogram features as a baseline. Each image
in the dataset was cropped such that only the upper half of
jersey is visible. A color histogram was obtained from the
RGB representation of each image, with nbins bins per image
channel. Finally a Support Vector Machine (SVM) with an
Radial Basis function (RBF) kernel was trained on the normal-
ized histogram features. The optimal SVM hyperparameters
and number of histogram bins were determined using grid
search by doing a five fold cross validation on the combination
of training and validation set. The optimal hyperparameters
obtained were C= 10 ,γ=.01 and nbins = 12. Compared
to the SVM model, the deep network based approach performs
14.6% better on the test set demonstrating that the CNN based
approach is superior to simple hand crafted color histogram
features.
C. Player Identification
The proposed player identification network attains an ac-
curacy of 83.17% on the test set. We compare the network
with Chan et al. [10] who use a secondary CNN model for
TABLE VI: Team identification accuracy on the team-
identification test set.
Method Accuracy Precision Recall F1 score
Proposed 96.6 97.0 96.5 96.7
SVM with color histogram 82.0 81.7 81.5 81.5
TABLE VII: Overall player identification accuracy for 13 test
videos. The mean accuracy for the video increases by 4.9%
after including the player roster information
Video number Without roster vectors With roster vectors
190.6% 95.34%
257.1% 71.4%
384.2% 85.9%
474.0% 78.0%
579.6% 81.4%
688.0% 88.0%
768.6% 74.6%
891.6% 93.75%
988.6% 90.9%
10 86.04% 88.37%
11 44.44% 68.88%
12 84.84% 84.84%
13 75.0% 75.0%
Mean 77.9% 82.8%
aggregating probabilities on top of an CNN+LSTM model.
Our proposed inference scheme, on the other hand, does not
require any additional network. Since the code and dataset for
Chan et al. [10] is not publicly available, we re-implemented
the model by scratch and trained and evaluated the model
on our dataset. The proposed network performs 9.9% better
than Chan et al. [10]. The network proposed by Chan et al.
[10] processes shorter sequences of length 16 during training
and testing, and therefore exploits less temporal context than
the proposed model with sequence length 30. Also, the sec-
ondary CNN used by Chan et al. [10] for aggregating tracklet
probability scores easily overfits on our dataset. Adding L2
regularization while training the secondary CNN proposed in
Chan et al. [10] on our dataset also did not improve the
performance. This is because our dataset is half the size and
is more skewed than the one used in Chan et al. [10], with
the null class consisting of half the examples in our case.
The superior performance indicates that the proposed network
and training methodology involving intelligent sampling of the
null class and the proposed inference scheme works better
on our dataset. Additionally, temporal 1D CNNs have been
reported to perform better than LSTMs in handling long range
dependencies [2], which is verified by the results.
The network is able to identify digits during motion blur
and unusual angles (Fig 14). Upon inspecting the error cases,
it is seen that when a two digit jersey number is misclassified,
the predicted number and ground truth often share one digit.
This phenomenon is observed in 85% of misclassified two
JOURNAL OF L
A
T
E
X CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 11
Fig. 13: Example of a tracklet where the team is misclassified. Here, the away team player is occluded by the home team
player, which causes the team identification model to output the incorrect result. Since the original tracklet contains hundreds
of frames, only a subset of tracklet frames are shown.
Fig. 14: Some frames from a tracklet where the model is able to identify the number 20 where 0 is at a tilted angle in majority
of bounding boxes.
Fig. 15: Jersey number presence accuracy vs. θon
the validation set. The values of θtested are θ=
{0.0033,0.01,0.03,0.09,0.27,0.81}. The highest accuracy is
attained at θ= 0.01.
digit jersey numbers. For example, 55 is misclassified as 65
and 26 is misclassified as 28 since 6 often looks like 8 (Fig
16) because of occlusions and folds in player jerseys.
The value of θ(threshold for filtering out tracklets where
jersey number is absent) is determined using the valida-
tion set. In Fig 15, we plot the percentage of validation
tracklets correctly classified for presence of jersey number
versus the parameter θ. The values of θtested are θ=
{0.0033,0.01,0.03,0.09,0.27,0.81}. The highest accuracy of
95.64% at θ= 0.01. A higher value of θresults in more false
positives for jersey number presence. A θlower than 0.01
results in more false negatives. We therefore use the value of
θ= 0.01 for doing inference on the test set.
1) Ablation studies: We perform ablation studies in order to
study how data augmentation and inference techniques affect
the player identification network performance:
Data augmentation We perform several data augmentation
techniques to boost player identification performance such data
color jittering , random cropping, and random rotation by
rotating each image in a tracklet by ±10 degrees. Note that
since we are dealing with temporal data, these augmentation
techniques are applied per tracklet instead of per image. In
this section we investigate the contribution of each augmen-
tation technique to the overall accuracy. Table VIII shows the
accuracy and weighted macro F1 score values after removing
these augmentation techniques. It is observed that removing
any one of the applied augmentation techniques decreases the
overall accuracy and F1 score.
Inference technique We perform an ablation study to
determine how our tracklet score aggregation scheme of aver-
aging probabilities after filtering out tracklets based on jersey
number presence compares with other techniques. Recall from
section III-C5 that for inference, we perform visibility filtering
of tracklets and evaluate the model only on tracklets where
jersey number is visible. We also include a post-processing
step where only those window probability vectors piare
averaged for which argmax(pi)6=null. The other baselines
tested are described:
1) Majority voting: after filtering tracklets based on jersey
number presence, each window probability piPfor a
tracklet is argmaxed to obtain window predictions after
which a simple majority vote is taken to obtain the final
prediction. For post-processing, the majority vote is only
done for those window predictions with are not the null
JOURNAL OF L
A
T
E
X CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 12
Fig. 16: Some frames from a tracklet where 6appears as 8due to motion blur and folds in the player jersey leading to error
in classification.
Fig. 17: Example of a tracklet where the same identity is assigned to two different players due to an identity switch. This kind
of errors in player tracking gets carried over to player identification, since a single jersey number cannot be associated with
this tracklet.
TABLE VIII: Ablation study on different kinds of data augmentations applied during training. Removing any one of the applied
augmentation techniques decreases the overall accuracy and F1 score.
Accuracy F1 score Color Rotation Random cropping
83.17% 83.19% X X X
81.58% 82.00% X X
81.58% 81.64% X X
81.00% 81.87% X X
class.
2) Only averaging probabilities: this is equivalent to our
proposed approach without visibility filtering and post-
processing.
The results are shown in Table IV. We observe that our
proposed aggregation technique performs the best with an
accuracy of 83.17% and a macro weighted F1 score of 83.19%.
Majority voting shows inferior performance with accuracy of
80.59% even after the visibility filtering and post-processing
are applied. This is because majority voting does not take into
account the overall window level probabilities to obtain the
final prediction since it applies the argmax operation to each
probability vector piseparately. Simple probability averaging
without visibility filtering and post-processing obtains a 7.53%
lower accuracy demonstrating the advantage of visibility filter
and post-processing step. The proposed method without the
post-processing step lowers the accuracy by 2.37% indicating
post-processing step is of integral importance to the overall
inference pipeline. The proposed inference technique without
visibility filtering performs poorly when post-processing is
added with an accuracy of just 50.10%. This is because
performing post-processing on every tracklet irrespective of
jersey number visibility prevents the model to assign the null
class to any tracklet since the logits of the null class are never
taken into aggregation. Hence, tracklet filtering is an essential
precursor to the post-processing step.
D. Overall system
We now evaluate the holistic pipeline consisting of player
tracking, team identification, and player identification. This
evaluation is different from evaluation done the Section IV-C
since the player tracklets are now obtained from the player
tracking algorithm (rather than being manually annotated).
The accuracy metric is the percentage of tracklets correctly
classified by the algorithm.
Table VII shows the holistic pipeline. Taking advantage of
player roster improves the overall accuracy for the test videos
by 4.9%. For video number 11, the improvement in accuracy
is almost 24.44%. This is because the vectors vaand vphelp
the model focus only on the players present in the home and
away roster. There are three main sources of error:
1) Tracking identity switches, where the same ID is as-
signed to two different player tracks. These are illus-
trated in Fig. 17;
2) Misclassification of the player’s team, as shown in Fig.
13, which causes the player jersey number probabilities
to get multiplied by the incorrect roster vector; and
3) Incorrect jersey number prediction by the network.
V. CONCLUSION
In this paper, we have introduced and implemented an
automated offline system for the challenging problem of player
tracking and identification in ice hockey. The system takes
as input broadcast hockey video clips from the main camera
JOURNAL OF L
A
T
E
X CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 13
view and outputs player trajectories on screen along with their
teams and identities. However, there is room for improvement.
Tracking players when they leave the camera view and identi-
fying players when their jersey number is not visible is a big
challenge. In a future work, identity switches resulting from
camera panning can be reduced by tracking players directly
on the ice-rink coordinates using an automatic homography
registration model [24]. Additionally player locations on the
ice rink can be used as a feature for identifying players.
ACKNOWLEDGMENT
This work was supported by Stathletes through the Mitacs
Accelerate Program and Natural Sciences and Engineering
Research Council of Canada (NSERC). We also acknowledge
Compute Canada for hardware support.
REFERENCES
[1] Omar Ajmeri and Ali Shah. Using computer vision and machine
learning to automatically classify nfl game film and develop a
player tracking system. In 2018 MIT Sloan Sports Analytics
Conference, 2018.
[2] Shaojie Bai, J. Zico Kolter, and Vladlen Koltun. An empirical
evaluation of generic convolutional and recurrent networks for
sequence modeling. arXiv:1803.01271, 2018.
[3] Horesh Ben Shitrit, J´
erˆ
ome Berclaz, Franc¸ois Fleuret, and
Pascal Fua. Tracking multiple people under global appearance
constraints. In 2011 International Conference on Computer
Vision, pages 137–144, 2011.
[4] Philipp Bergmann, Tim Meinhardt, and Laura Leal-Taix´
e.
Tracking without bells and whistles. In The IEEE International
Conference on Computer Vision (ICCV), October 2019.
[5] Keni Bernardin and Rainer Stiefelhagen. Evaluating multiple
object tracking performance: The clear mot metrics. EURASIP
Journal on Image and Video Processing, 2008, 01 2008.
[6] Alex Bewley, Zongyuan Ge, Lionel Ott, Fabio Ramos, and Ben
Upcroft. Simple online and realtime tracking. In 2016 IEEE
International Conference on Image Processing (ICIP), pages
3464–3468, 2016.
[7] Alina Bialkowski, Patrick Lucey, Peter Carr, Sridha Sridharan,
and Iain Matthews. Representing team behaviours from noisy
data using player role. Computer Vision in Sports, pages 247–
269, 2014.
[8] Guillem Braso and Laura Leal-Taixe. Learning a neural
solver for multiple object tracking. In Proceedings of the
IEEE/CVF Conference on Computer Vision and Pattern Recog-
nition (CVPR), June 2020.
[9] Yizheng Cai, Nando de Freitas, and James J. Little. Robust
visual tracking for multiple targets. In Aleˇ
s Leonardis, Horst
Bischof, and Axel Pinz, editors, Computer Vision – ECCV
2006, pages 107–118, Berlin, Heidelberg, 2006. Springer Berlin
Heidelberg.
[10] Alvin Chan, Martin D. Levine, and Mehrsan Javan. Player
identification in hockey broadcast videos. Expert Systems with
Applications, 165:113891, 2021.
[11] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and
Li Fei-Fei. Imagenet: A large-scale hierarchical image database.
In 2009 IEEE conference on computer vision and pattern
recognition, pages 248–255. Ieee, 2009.
[12] Jeff Donahue, Lisa Anne Hendricks, Sergio Guadarrama, Mar-
cus Rohrbach, Subhashini Venugopalan, Kate Saenko, and
Trevor Darrell. Long-term recurrent convolutional networks for
visual recognition and description. In CVPR, 2015.
[13] Tiziana D’Orazio, Marco Leo, Paolo Spagnolo, Pier Luigi
Mazzeo, Nicola Mosca, Massimiliano Nitti, and Arcangelo
Distante. An investigation into the feasibility of real-time
soccer offside detection from a multiple camera system. IEEE
Transactions on Circuits and Systems for Video Technology,
19(12):1804–1818, 2009.
[14] S. Gerke, K. M¨
uller, and R. Sch¨
afer. Soccer jersey number
recognition using convolutional neural networks. In 2015
IEEE International Conference on Computer Vision Workshop
(ICCVW), pages 734–741, 2015.
[15] Sebastian Gerke, Antje Linnemann, and Karsten M¨
uller. Soccer
player recognition using spatial constellation features and jersey
number recognition. Computer Vision and Image Understand-
ing, 159:105 – 115, 2017. Computer Vision in Sports.
[16] Tianxiao Guo, Kuan Tao, Qingrui Hu, and Yanfei Shen. Detec-
tion of ice hockey players and teams via a two-phase cascaded
cnn model. IEEE Access, 8:195062–195073, 2020.
[17] Tianxiao Guo, Kuan Tao, Qingrui Hu, and Yanfei Shen. Detec-
tion of ice hockey players and teams via a two-phase cascaded
cnn model. IEEE Access, 8:195062–195073, 2020.
[18] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.
Deep residual learning for image recognition. In 2016
IEEE Conference on Computer Vision and Pattern Recognition
(CVPR), pages 770–778, 2016.
[19] Sepp Hochreiter and J ¨
urgen Schmidhuber. Long short-term
memory. Neural Comput., 9(8):1735–1780, November 1997.
[20] Samuel Hurault, Coloma Ballester, and Gloria Haro. Self-
supervised small soccer player detection and tracking. In
Proceedings of the 3rd International Workshop on Multimedia
Content Analysis in Sports, MMSports ’20, page 9–18, New
York, NY, USA, 2020. Association for Computing Machinery.
[21] IIHF. Survey of Players, 2018. Available online: https://www.
iihf.com/en/static/5324/survey-of-players.
[22] Maxime Istasse, Julien Moreau, and Christophe
De Vleeschouwer. Associative embedding for team
discrimination. In Proceedings of the IEEE/CVF Conference on
Computer Vision and Pattern Recognition (CVPR) Workshops,
June 2019.
[23] Zdravko Ivankovic, Milos Rackovic, and Miodrag Ivkovic.
Automatic player position detection in basketball games. Mul-
timedia Tools and Applications, 72:2741–2767, 10 2014.
[24] Wei Jiang, Juan Camilo Gamboa Higuera, Baptiste Angles,
Weiwei Sun, Mehrsan Javan, and Kwang Moo Yi. Optimizing
through learned errors for accurate sports field registration. In
2020 IEEE Winter Conference on Applications of Computer
Vision (WACV). IEEE, 2020.
[25] Maria Koshkina, Hemanth Pidaparthy, and James H. Elder.
Contrastive learning for sports video: Unsupervised player
classification. In Proceedings of the IEEE/CVF Conference on
Computer Vision and Pattern Recognition (CVPR) Workshops,
pages 4528–4536, June 2021.
[26] G. Li, S. Xu, X. Liu, L. Li, and C. Wang. Jersey number
recognition with semi-supervised spatial transformer network.
In 2018 IEEE/CVF Conference on Computer Vision and Pattern
Recognition Workshops (CVPRW), pages 1864–18647, 2018.
[27] Tsung-Yi Lin, Piotr Doll´
ar, Ross Girshick, Kaiming He, Bharath
Hariharan, and Serge Belongie. Feature pyramid networks for
object detection. In 2017 IEEE Conference on Computer Vision
and Pattern Recognition (CVPR), pages 936–944, 2017.
[28] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays,
Pietro Perona, Deva Ramanan, Piotr Doll´
ar, and C. Lawrence
Zitnick. Microsoft coco: Common objects in context. In
David Fleet, Tomas Pajdla, Bernt Schiele, and Tinne Tuytelaars,
editors, Computer Vision – ECCV 2014, pages 740–755, Cham,
2014. Springer International Publishing.
[29] H. Liu and B. Bhanu. Pose-guided r-cnn for jersey number
recognition in sports. In 2019 IEEE/CVF Conference on
Computer Vision and Pattern Recognition Workshops (CVPRW),
pages 2457–2466, 2019.
[30] Jingchen Liu and Peter Carr. Detecting and tracking sports
players with random forests and context-conditioned motion
models. Computer Vision in Sports, pages 113–132, 2014.
JOURNAL OF L
A
T
E
X CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 14
[31] W. Lu, J. Ting, J. J. Little, and K. P. Murphy. Learning to
track and identify players from broadcast sports videos. IEEE
Transactions on Pattern Analysis and Machine Intelligence,
35(7):1704–1716, July 2013.
[32] Wei-Lwun Lu, Jo-Anne Ting, James J. Little, and Kevin P.
Murphy. Learning to track and identify players from broadcast
sports videos. IEEE Transactions on Pattern Analysis and
Machine Intelligence, 35(7):1704–1716, 2013.
[33] Fernando Martello, Apr 2020.
[34] P. L. Mazzeo, P. Spagnolo, M. Leo, and T. D’Orazio. Visual
players detection and tracking in soccer matches. In 2008 IEEE
Fifth International Conference on Advanced Video and Signal
Based Surveillance, 2008.
[35] Kenji Okuma, Ali Taleghani, Nando de Freitas, James J. Little,
and David G. Lowe. A boosted particle filter: Multitarget
detection and tracking. In Tom´
as Pajdla and Jiˇ
r´
ı Matas,
editors, Computer Vision - ECCV 2004, pages 28–39, Berlin,
Heidelberg, 2004. Springer Berlin Heidelberg.
[36] Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali
Farhadi. You only look once: Unified, real-time object detection.
In 2016 IEEE Conference on Computer Vision and Pattern
Recognition (CVPR), pages 779–788, 2016.
[37] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster
r-cnn: Towards real-time object detection with region proposal
networks. In C. Cortes, N. Lawrence, D. Lee, M. Sugiyama, and
R. Garnett, editors, Advances in Neural Information Processing
Systems, volume 28. Curran Associates, Inc., 2015.
[38] Ergys Ristani, Francesco Solera, Roger Zou, Rita Cucchiara,
and Carlo Tomasi. Performance measures and a data set
for multi-target, multi-camera tracking. In Gang Hua and Herv´
e
J´
egou, editors, Computer Vision – ECCV 2016 Workshops,
pages 17–35, Cham, 2016. Springer International Publishing.
[39] Ryan Sanford, Siavash Gorji, Luiz G. Hafemann, Bahareh
Pourbabaee, and Mehrsan Javan. Group activity detection
from trajectory and video data in soccer. In Proceedings of
the IEEE/CVF Conference on Computer Vision and Pattern
Recognition (CVPR) Workshops, June 2020.
[40] Adri`
a Arbu´
es Sang¨
uesa, C. Ballester, and G. Haro. Single-
camera basketball tracker through pose and semantic feature
fusion. ArXiv, abs/1906.02042, 2019.
[41] Arda Senocak, Tae-Hyun Oh, Junsik Kim, and In So Kweon.
Part-based player identification using deep convolutional rep-
resentation and multi-scale pooling. In Proceedings of the
IEEE Conference on Computer Vision and Pattern Recognition
(CVPR) Workshops, June 2018.
[42] S. Tang, M. Andriluka, B. Andres, and B. Schiele. Multiple
people tracking by lifted multicut and person re-identification.
In 2017 IEEE Conference on Computer Vision and Pattern
Recognition (CVPR), pages 3701–3710, July 2017.
[43] Rajkumar Theagarajan and Bir Bhanu. An automated system for
generating tactical performance statistics for individual soccer
players from videos. IEEE Transactions on Circuits and Systems
for Video Technology, 31(2):632–646, 2021.
[44] Xiaofeng Tong, Jia Liu, Tao Wang, and Yimin Zhang. Au-
tomatic player labeling, tracking and field registration and
trajectory mapping in broadcast soccer video. ACM Trans.
Intell. Syst. Technol., 2(2), February 2011.
[45] Kanav Vats, Mehrnaz Fani, D.A. Clausi, and John S. Zelek.
Multi-task learning for jersey number recognition in ice hockey.
ArXiv, abs/2108.07848, 2021.
[46] Kanav Vats, Mehrnaz Fani, David A. Clausi, and John Zelek.
Puck localization and multi-task event recognition in broadcast
hockey videos. In Proceedings of the IEEE/CVF Conference on
Computer Vision and Pattern Recognition (CVPR) Workshops,
pages 4567–4575, June 2021.
[47] Kanav Vats, Mehrnaz Fani, Pascale Walters, David A. Clausi,
and John Zelek. Event detection in coarsely annotated sports
videos via parallel multi-receptive field 1d convolutions. In
Proceedings of the IEEE/CVF Conference on Computer Vision
and Pattern Recognition (CVPR) Workshops, June 2020.
[48] Vermaak, Doucet, and Perez. Maintaining multimodality
through mixture tracking. In Proceedings Ninth IEEE Interna-
tional Conference on Computer Vision, pages 1110–1116 vol.2,
Oct 2003.
[49] P. Viola and M. Jones. Rapid object detection using a boosted
cascade of simple features. In Proceedings of the 2001 IEEE
Computer Society Conference on Computer Vision and Pattern
Recognition. CVPR 2001, volume 1, pages I–I, Dec 2001.
[50] Nicolai Wojke, Alex Bewley, and Dietrich Paulus. Simple online
and realtime tracking with a deep association metric. In 2017
IEEE International Conference on Image Processing (ICIP),
pages 3645–3649. IEEE, 2017.
[51] Ruiheng Zhang, Lingxiang Wu, Yukun Yang, Wanneng Wu,
Yueqiang Chen, and Min Xu. Multi-camera multi-player track-
ing with deep player identification in sports video. Pattern
Recognition, 102:107260, 2020.
[52] Yifu Zhang, Chunyu Wang, Xinggang Wang, Wenjun Zeng,
and Wenyu Liu. Fairmot: On the fairness of detection and
re-identification in multiple object tracking. arXiv preprint
arXiv:2004.01888, 2020.
[53] Matko ˇ
Saric, Hrvoje Dujmic, Vladan Papic, and Nikola Roˇ
zic.
Player number localization and recognition in soccer video
using hsv color space and internal contours. International
Journal of Electrical and Computer Engineering, 2(7):1408 –
1412, 2008.
ResearchGate has not been able to resolve any citations for this publication.
Article
Full-text available
Multi-object tracking (MOT) is an important problem in computer vision which has a wide range of applications. Formulating MOT as multi-task learning of object detection and re-ID in a single network is appealing since it allows joint optimization of the two tasks and enjoys high computation efficiency. However, we find that the two tasks tend to compete with each other which need to be carefully addressed. In particular, previous works usually treat re-ID as a secondary task whose accuracy is heavily affected by the primary detection task. As a result, the network is biased to the primary detection task which is not fair to the re-ID task. To solve the problem, we present a simple yet effective approach termed as FairMOT based on the anchor-free object detection architecture CenterNet. Note that it is not a naive combination of CenterNet and re-ID. Instead, we present a bunch of detailed designs which are critical to achieve good tracking results by thorough empirical studies. The resulting approach achieves high accuracy for both detection and tracking. The approach outperforms the state-of-the-art methods by a large margin on several public datasets. The source code and pre-trained models are released at https://github.com/ifzhang/FairMOT.
Article
We present a deep recurrent convolutional neural network (CNN) approach to solve the problem of hockey player identification in NHL broadcast videos. Player identification is a difficult computer vision problem mainly because of the players’ similar appearance, occlusion, and blurry facial and physical features. However, we can observe players’ jersey numbers over time by processing variable length image sequences of players (aka ‘tracklets’). We propose an end-to-end trainable ResNet+LSTM network, with a residual network (ResNet) base and a long short-term memory (LSTM) layer, to discover spatio-temporal features of jersey numbers over time and learn long-term dependencies. Additionally, we employ a secondary 1-dimensional convolutional neural network classifier as a late score-level fusion method to classify the output of the ResNet+LSTM network. For this work, we created a new hockey player tracklet dataset that contains sequences of hockey player bounding boxes. This achieves an overall player identification accuracy score over 87% on the test split of our new dataset.