Movement Tracks for the Automatic Detection of Fish Behavior in Videos
Declan McIntosh 1Tunai Porto Marques 1Alexandra Branzan Albu 1Rodney Rountree 2Fabio De Leo 3
Global warming is predicted to profoundly impact
ocean ecosystems. Fish behavior is an important
indicator of changes in such marine environments.
Thus, the automatic identiﬁcation of key ﬁsh be-
havior in videos represents a much needed tool
for marine researchers, enabling them to study
climate change-related phenomena. We offer a
dataset of sableﬁsh (Anoplopoma ﬁmbria) star-
tle behaviors in underwater videos, and investi-
gate the use of deep learning (DL) methods for
behavior detection on it. Our proposed detec-
tion system identiﬁes ﬁsh instances using DL-
based frameworks, determines trajectory tracks,
derives novel behavior-speciﬁc features, and em-
ploys Long Short-Term Memory (LSTM) net-
works to identify startle behavior in sableﬁsh. Its
performance is studied by comparing it with a
state-of-the-art DL-based video event detector.
Among the negative impacts of climate change in marine
ecosystems predicted for global warming levels of
C (e.g. signiﬁcant global mean sea level rise [
free Artic Oceans [
], interruption of ocean-based services)
are the acidiﬁcation and temperature rise of waters.
The behavioral disturbance in ﬁsh species resulting from
climate change can be studied with the use of underwater
optical systems, which have become increasingly prevalent
over the last six decades [
]. However, advancements
in automated video processing methodologies have not kept
pace with advances in the video technology itself. The
manual interpretation of visual data requires prohibitive
amounts of time, highlighting the necessity of semi- and
fully-automated methods for the enhancement [
annotation of marine imagery.
University of Victoria, BC, Canada
University of Victoria, BC, Canada
Ocean Networks Canada,
BC, Canada. Correspondence to: Alexandra Branzan Albu
Tackling Climate Change with Machine Learning Workshop at
Conference on Neural Information Processing Systems
As a result, the ﬁeld of automatic interpretation of underwa-
ter imagery for biological purposes has experienced a surge
of activity in the last decade [
]. While numerous works
propose the automatic detection and counting of specimen
], ecological applications require more complex
insights. Video data provides critical information on ﬁsh be-
havior and interactions such as predation events, aggressive
interactions between individuals, activities related to repro-
duction and startle responses. The ability to detect such
behavior represents an important shift in the semantic rich-
ness of data and scientiﬁc value of computer vision-based
analysis of underwater videos: from the focused detection
and counting of individual specimens, to the context-aware
identiﬁcation of ﬁsh behavior.
Given the heterogeneous visual appearance of diverse be-
haviors observed in ﬁsh, we initially focus our study on a
particular target: startle motion patterns observed in sable-
ﬁsh (Anoplopoma ﬁmbria). Such behavior is characterized
by sudden changes in the speed and trajectory of sableﬁsh
We propose a novel end-to-end behavior detection frame-
work that considers 4-second clips to 1) detect the presence
of sableﬁsh using DL-based object detectors [
]; 2) uses
the Hungarian algorithm [
] to determine trajectory tracks
between subsequent frames; 3) measures four handcrafted
and behavior-speciﬁc features and 4) employs such features
in conjunction with LSTM networks [
] to determine the
presence of startle behavior and describe it (i.e. travelling
direction, speed, and trajectory). The remainder of this arti-
cle is structured as follows. In Section 2we discuss works
of relevance to the proposed system. Section 3describes
the proposed approach. In Section 4we present a dataset
of sableﬁsh startle behaviors and use it to compare the per-
formance of our system with that of a state-of-the-art event
]. Section 5draws conclusions and outlines
2. Related Works
Related works to our approach include DL-based methods
for object detection in images and events in videos.
Deep learning-based object detection for static images
Krizhevsky et al. [
] demonstrated the potential of us-
ing Convolutional Neural Networks (CNNs) to extract and
Object Detection and Tracking with Domain Specific Metrics for LSTM classification
Startle No Startle
Computational pipeline of the ﬁsh behavior detection system proposed. The framework is able to provide clip-wise and
movement track-wise classiﬁcations alike (see 4.2).
classify visual features from large datasets. Their work
motivated the use of CNNs in object detection, where frame-
works perform both localization and classiﬁcation of regions
of interest (RoI). Girshick et al. [
] introduced R-CNN, a
system that uses a traditional computer vision-based tech-
nique (selective search [
]) to derive RoIs where individual
classiﬁcation tasks take place. Processing times are further
reduced in Fast R-CNN [
] and Faster R-CNN [
group of frameworks [
] referred to as “one-stage”
detectors proposed the use of a single trainable network for
both creating RoIs and performing classiﬁcation. This re-
duces processing times, but often leads to a drop in accuracy
when compared to two-stage detectors. Recent advance-
ments in one-stage object detectors (e.g. a loss measure
that accounts for extreme class imbalance in training sets
]) have resulted in frameworks such as YOLOv3 [
and RetinaNet [
], which offer fast inference times and
performances comparable with that of two-stage detectors.
DL-based event detection in videos
. Although object de-
tectors such as YOLOv3 [
] can be used in each frame of
a video, they often ignore important temporal relationships.
Rather than individual images employed by aforementioned
methods, recent works [
video’s inter-frame temporal relationship to detect relevant
events. Saha et al. [
] use Fast R-CNN [
] to identify
motion from RGB and optical ﬂow inputs. The outputs
from these networks are fused resulting in action tubes that
encompass the temporal length of each action. Kang et
] offered a video querying system that trains special-
ized models out of larger and more general CNNs to be
able to efﬁciently recognize only speciﬁc visual targets un-
der constrained view perspectives with claimed processing
speed-ups of up to
ar et al. [
] combined an
object-tracking detector [
], trajectory- and pixel- based
methods to detect abnormal activities. Ionescu et al. [
offered a system that not only recognizes motion in videos,
but also considers context to differentiate between normal
(e.g. a truck driving on a road) and abnormal (e.g. a truck
driving on a pedestrian lane) events.
Yu et al. [
] proposed ReMotENet, a light-weight event
detector that leverages spatial-temporal relationships be-
tween objects in adjacent frames. It uses 3D CNNs (“spatial-
temporal attention modules”) to jointly model these video
characteristics using the same trainable network. A frame
differencing process allows for the network to focus exclu-
sively on relevant, motion-triggered regions of the input
frames. This simple yet effective architecture results in fast
processing speeds and reduced model sizes .
3. Proposed approach
We propose a hybrid method for the automatic detection of
context-aware, ecologically relevant behavior in sableﬁsh.
We ﬁrst describe our method for tracking sableﬁsh in video,
then propose the use of 4 track-wise features to constrain
sableﬁsh startle behaviors. Finally, we describe a Long
Short Term Memory (LSTM) architecture that performs
classiﬁcation using the aforementioned features.
Object detection and tracking
. We use the YOLOv3 [
end-to-end object detector as the ﬁrst step of this hybrid
method. The detector was completely re-trained to perform
a simpliﬁed detection task were only the class ﬁsh is tar-
geted. We use a novel 600-image dataset (detailed in 4.1) of
sableﬁsh instances composed of data from Ocean Networks
Canada (ONC) to train the object detector.
The detection of each frame offer a set of bounding boxes
and spatial centers. In order to track organisms we asso-
ciate these detection between frames. Our association loss
value consists simply of the distance between detection cen-
ters in two subsequent frames. We employ the Hungarian
] to generate a loss minimizing associations
between detection in two consecutive frames. We then re-
move any associations where the distance between various
detection is greater than 15% of the frame resolution. Tracks
are terminated if no new detection is associated with them
for 5 frames (i.e. 0.5 seconds—see 4.1).
Behavior Speciﬁc Features
. We propose the use of a series
Sample movement tracks and object detection conﬁdence
scores. The bounding boxes highlight the ﬁsh detection in the
current frame. Each color represents an individual track.
of four domain-speciﬁc features that describe the startle be-
havior of sableﬁsh. Each feature conveys independent and
complementary information, and the limited number of fea-
) prevents over-constraining the behavior detection
The ﬁrst two features quantify the direction of travel and
speed from a track. These track characteristics were selected
because often the direction of travel changes and the ﬁsh
accelerates at the moment of a startle motion.
LMCM kernel designed to extract fast changes in sequen-
A third metric considers the aspect ratio of the detection
bounding box of a ﬁsh instance over time. The reasoning
behind this feature is the empirical observation that sableﬁsh
generally take on a “c” shape when startling, in preparation
for moving away from their current location. The ﬁnal Lo-
cal Momentary Change Metric (LMCM) feature seeks to
ﬁnd fast and unstained motion, or temporal visual impulses,
associated with startle events. This feature is obtained by
convolving the novel 3-dimensional LMCM kernel, depicted
in Figure 3, over three temporally adjacent frames. This
spatially symmetric kernel was designed to produce high
output values where impulse changes occur between frames.
Given its zero-sum and isotropic properties, the kernel out-
puts zero when none or only constant temporal changes are
occurring. We observe that the LMCM kernel efﬁciently
detects leading and lagging edges of motion. In order to
associate this feature with a track we average the LMCM
output magnitude inside a region encompassed by a given
ﬁsh detection bounding box.
3.1. LSTM classiﬁer
LSTM-based network for the classiﬁcation of movement
tracks. The network receives four features conveying speed, direc-
tion and rate of change from each track as input.
In order to classify an individual track, we ﬁrst combine
its four features as a tensor data structure of dimensions
; tensors associated with tracks of less than
are end-padded with zeros. A set of normalization coefﬁ-
cients calculated empirically using the training set (see 4.1)
is then used to normalize each value in the input time-series
to the range
. A custom-trained long short-term mem-
ory (LSTM) classiﬁer receives the normalized tensors as
input, and outputs a track classiﬁcation of non-startle back-
ground or startle, as well as a conﬁdence score. This is done
by considering underlying temporal relationships between
their values. We chose to use LSTM networks because the
temporal progression of values from the features extracted
(along 40 frames or 4 seconds) conveys important informa-
tion for the classiﬁcation of individual clips/tracks. Figure 4
details the architecture of the LSTM network employed. All
convolutional layers employ three-layered 1D kernels.
4. Results and Discussion
We compare our method with a state-of-the-art event de-
tection algorithm [
]. Section Section 4.1 describes our
dataset. Our comparison considers standard performance
metrics in Section 4.2 and discusses the potential advantages
of semantically richer data outputs in ecological research.
4.1. Sableﬁsh Startle Dataset
The data used in this work was acquired with a stationary
video camera permanently deployed at
m of depth at the
Barkley Node of Ocean Networks Canada’s
marine cabled observatory infrastructure. All videos sam-
ples were collected between September 17
2019 because this temporal window contains high sable-
ﬁsh activity. The original monitoring videos are ﬁrst divided
into units of 4-second clips (a sableﬁsh startle is expected to
last roughly one second) and down-sampled to 10 frames per
second for processing reasons. An initial ﬁltering is carried
out using Gaussian Mixture Models [
], resulting in a set
composed only by clips which contain motion. For training
purposes, these motion clips are then manually classiﬁed as
possessing startle or not. The Sableﬁsh Startle dataset con-
positive (i.e. presenting startle events) clips, as
well as 446 randomly selected negative samples (i.e. without
startle events). Table 1details the dataset usage for training,
validation and testing.
Data Split Clips Startle Clips Tracks Startle Tracks
Train 642 321 1533 323
Validation 150 75 421 80
Test 100 50 286 50
Division of the 4-second clips of the Sableﬁsh Startle
Dataset for training, validation and testing purposes.
A second dataset composed of 600 images of sabbleﬁsh
was created to train the YOLOV3 [
] object detector (i.e.
ﬁrst step of the proposed approach). In order to assess
the track-creation performance of the proposed system, we
use this custom-trained object detector to derive movement
tracks from each of the 892 clips composing the Sableﬁsh
Startle Dataset. Tracks with length shorter than 2 seconds
are discarded. The remaining tracks are manually annotated
as startle or non-startle (see Table 1).
This dual annotation approach (i.e. clip- and track-wise)
employed with the Sableﬁsh Startle Dataset allows for a
two-fold performance analysis: 1) clip-wise classiﬁcation,
where an entire clip is categorized as possessing startle
or not, and 2) track-wise classiﬁcation, which reﬂects the
accuracy in the classiﬁcation of each candidate track as
startle or non-startle.
4.2. Experimental Results
We calculate the Average Precision (AP), Binary Cross En-
tropy (BCE) loss and Recall for both track- and clip-wise
outputs of the proposed system using only the Test portion
of the Sableﬁsh Startle Dataset. A threshold of
range) is set to classify a candidate movement track
as positive or negative with respect to all of its constituent
points. In order to measure the performance of the clip-wise
classiﬁcation and compare it with the baseline method (Re-
]), we consider that the “detection score” of a
clip is that of its highest-conﬁdence movement track (if any).
Thus, any clip where at least one positive startle movement
track is identiﬁed will be classiﬁed as positive.
The conversion from track-wise classiﬁcation to clip-wise
classiﬁcation is expected to lower the overall accuracy of
our proposed approach. A “true” startle event might create
only short, invalid tracks, or sometimes no tracks at all. This
situation would lower the clip-wise classiﬁcation perfor-
mance, but would not interfere with the track-wise one. The
track-wise metrics are applicable only to our approach and
they mainly reﬂect the difference between the manual and
automatic classiﬁcation of the tracks created in the dataset
by the proposed system, thus evaluating the ability of the
LSTM network to classify tracks. Table 2shows that the
LSTM portion of our method performs well for classifying
startle tracks (AP of
). Clip-wise, our method outper-
formed a state-of-the-art DL-based event detector [
an AP of 0.67.
Method Track AP Track BCE Clip AP Clip Recall
Ours 0.85 0.412 0.67 0.58
ReMotENet  N/A1N/A10.61 0.5
1: ReMotENetdoes not perform track-wise classiﬁcation.
Performance comparison between the proposed method
and a state-of-the-art event detector.
Our proposed system generates more semantically rich data
by detecting and describing behavior rather than just mark-
ing a clip as containing the behavior; this may be helpful for
ecological and biological research. Our approach provides
instance-level information such as track-speciﬁc average
speed, direction and rate of change. These extra data may al-
low for further analyses considering inter- and intra-species
behaviors. Also, there are clips that contain more than one
startle, and our approach is able to identify all startle in-
stances in such clips. Simply classifying a clip as startle or
non-startle would, of course, not allow for the detection of
multiple startle instances within the same clip.
We propose an automatic detector of ﬁsh behavior in videos
that performs semantically richer tasks than typical com-
puter vision-based analyses: instead of specimens counting,
it identiﬁes and describes a complex biological event (star-
tle events of sableﬁsh). Our intent is to enable long-term
studies on changes in ﬁsh behavior that could be caused by
climate change (e.g. temperature rise and acidiﬁcaton).
A dataset composed of 892 4-second positive (startle) and
negative (non-startle) clips, and associated tracks were man-
ually annotated. Experiments using this data show that the
proposed detector identiﬁes and classiﬁes well individual
tracks of motion as startle or not (AP of
the performance of our clip-wise classiﬁcation is compared
to that of a state-of-the-art event detector, ReMotENet [
Our system outperforms ReMotENet with an AP of
Future work will address more ﬁsh behaviors (e.g. predation,
spawining) and will adapt DL-based event detectors such as
ReMotENet  and NoScope  to that end.
Thomas F Stocker, Dahe Qin, G-K Plattner,
Melinda MB Tignor, Simon K Allen, Judith Boschung,
Alexander Nauels, Yu Xia, Vincent Bex, and
Pauline M Midgley. Climate change 2013: The physi-
cal science basis. contribution of working group i to
the ﬁfth assessment report of ipcc the intergovernmen-
tal panel on climate change, 2014.
Nathaniel L Bindoff, Peter A Stott, Krishna Mirle
AchutaRao, Myles R Allen, Nathan Gillett, David
Gutzler, Kabumbwe Hansingo, G Hegerl, Yongyun
Hu, Suman Jain, et al. Detection and attribution of
climate change: from global to regional. 2013.
Delphine Mallet and Dominique Pelletier. Underwater
video techniques for observing coastal marine biodi-
versity: a review of sixty years of publications (1952–
2012). Fisheries Research, 154:44–62, 2014.
Tunai Porto Marques, Alexandra Branzan Albu, and
Maia Hoeberechts. A contrast-guided approach for
the enhancement of low-lighting underwater images.
Journal of Imaging, 5(10):79, 2019.
Jacopo Aguzzi, Carolina Doya, Samuele Tecchio,
Fabio De Leo, Ernesto Azzurro, Cynthia Costa, Vale-
rio Sbragaglia, Joaquin del Rio, Joan Navarro, Henry
Ruhl, Paolo Favali, Autun Purser, Laurenz Thomsen,
and Ignacio Catal
an. Coastal observatories for mon-
itoring of ﬁsh behaviour and their responses to en-
vironmental changes. Reviews in Fish Biology and
Fisheries, 25:463–483, 2015.
Tunai Porto Marques and Alexandra Branzan Albu.
L2uwe: A framework for the efﬁcient enhancement of
low-light underwater images using local contrast and
multi-scale fusion. In Proceedings of the IEEE/CVF
Conference on Computer Vision and Pattern Recogni-
tion Workshops, pages 538–539, 2020.
Cosmin Ancuti, Codruta Orniana Ancuti, Tom Haber,
and Philippe Bekaert. Enhancing underwater images
and videos by fusion. In 2012 IEEE Conference on
Computer Vision and Pattern Recognition, pages 81–
88. IEEE, 2012.
J. Aguzzi1, D. Chatzievangelou, J.B. Company,
L. Thomsen, S. Marini, F. Bonoﬁglio, F. Juanes,
R. Rountree, A. Berry, R. Chumbinho, C. Lordan,
J. Doyle, J. del Rio, J. Navarro, F.C. De Leo, N. Ba-
hamon, J.A. Garc´
ıa, R. Danovaro, M. Francescangeli,
V. Lopez-Vazquez1, and Ps Gaughan. The potential
of video imagery from worldwide cabled observatory
networks to provide information supporting ﬁsh-stock
and biodiversity assessment. ICES Journal of Marine
Science, In press.
YH Toh, TM Ng, and BK Liew. Automated ﬁsh count-
ing using image processing. In 2009 International
Conference on Computational Intelligence and Soft-
ware Engineering, pages 1–5. IEEE, 2009.
Concetto Spampinato, Yun-Heh Chen-Burger, Gay-
athri Nadarajan, and Robert B Fisher. Detecting, track-
ing and counting ﬁsh in low quality unconstrained
underwater videos. VISAPP (2), 2008(514-519):1,
Song Zhang, Xinting Yang, Yizhong Wang, Zhenxi
Zhao, Jintao Liu, Yang Liu, Chuanheng Sun, and Chao
Zhou. Automatic ﬁsh population counting by machine
vision and a hybrid deep neural network model. Ani-
mals, 10(2):364, 2020.
Joseph Redmon and Ali Farhadi. Yolov3:
An incremental improvement. arXiv preprint
Harold W Kuhn. The hungarian method for the as-
signment problem. Naval research logistics quarterly,
Sepp Hochreiter and J
urgen Schmidhuber. Long short-
term memory. Neural computation, 9(8):1735–1780,
Ruichi Yu, Hongcheng Wang, and Larry S Davis. Re-
motenet: Efﬁcient relevant motion event detection for
large-scale home surveillance videos. In 2018 IEEE
Winter Conference on Applications of Computer Vision
(WACV), pages 1642–1651. IEEE, 2018.
Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hin-
ton. Imagenet classiﬁcation with deep convolutional
neural networks. In Advances in neural information
processing systems, pages 1097–1105, 2012.
Ross Girshick, Jeff Donahue, Trevor Darrell, and Jiten-
dra Malik. Rich feature hierarchies for accurate object
detection and semantic segmentation. In Proceedings
of the IEEE conference on computer vision and pattern
recognition, pages 580–587, 2014.
Jasper RR Uijlings, Koen EA Van De Sande, Theo
Gevers, and Arnold WM Smeulders. Selective search
for object recognition. International journal of com-
puter vision, 104(2):154–171, 2013.
Ross Girshick. Fast r-cnn. In Proceedings of the IEEE
international conference on computer vision, pages
Shaoqing Ren, Kaiming He, Ross Girshick, and Jian
Sun. Faster r-cnn: Towards real-time object detection
with region proposal networks. In Advances in neural
information processing systems, pages 91–99, 2015.
Wei Liu, Dragomir Anguelov, Dumitru Erhan, Chris-
tian Szegedy, Scott Reed, Cheng-Yang Fu, and Alexan-
der C Berg. Ssd: Single shot multibox detector. In
European conference on computer vision, pages 21–37.
Joseph Redmon, Santosh Divvala, Ross Girshick, and
Ali Farhadi. You only look once: Uniﬁed, real-time
object detection. In Proceedings of the IEEE con-
ference on computer vision and pattern recognition,
pages 779–788, 2016.
Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming
He, and Piotr Doll
ar. Focal loss for dense object de-
tection. In Proceedings of the IEEE international con-
ference on computer vision, pages 2980–2988, 2017.
Daniel Kang, John Emmons, Firas Abuzaid, Peter
Bailis, and Matei Zaharia. Noscope: optimizing neural
network queries over video at scale. arXiv preprint
Suman Saha, Gurkirt Singh, Michael Sapienza,
Philip HS Torr, and Fabio Cuzzolin. Deep learning for
detecting multiple space-time action tubes in videos.
arXiv preprint arXiv:1608.01529, 2016.
Dan Xu, Elisa Ricci, Yan Yan, Jingkuan Song, and
Nicu Sebe. Learning deep representations of appear-
ance and motion for anomalous event detection. arXiv
preprint arXiv:1510.01553, 2015.
Radu Tudor Ionescu, Fahad Shahbaz Khan, Mariana-
Iuliana Georgescu, and Ling Shao. Object-centric
auto-encoders and dummy anomalies for abnormal
event detection in video. In Proceedings of the IEEE
Conference on Computer Vision and Pattern Recogni-
tion, pages 7842–7851, 2019.
ar, Giuseppe Donatiello, Vania Bogorny,
Carolina Garate, Luis Otavio Alvares, and Fran
emond. Toward abnormal trajectory and event de-
tection in video surveillance. IEEE Transactions on
Circuits and Systems for Video Technology, 27(3):683–
Huijuan Xu, Boyang Li, Vasili Ramanishka, Leonid
Sigal, and Kate Saenko. Joint event detection and
description in continuous video streams. In 2019 IEEE
Winter Conference on Applications of Computer Vision
(WACV), pages 396–405. IEEE, 2019.
Duc Phu Chau, Fran
emond, Monique Thonnat,
and Etienne Corv
ee. Robust mobile object tracking
based on multiple feature similarity and trajectory ﬁl-
tering. arXiv preprint arXiv:1106.2695, 2011.
Chris Stauffer and W Eric L Grimson. Adaptive back-
ground mixture models for real-time tracking. In Pro-
ceedings. 1999 IEEE Computer Society Conference
on Computer Vision and Pattern Recognition (Cat. No
PR00149), volume 2, pages 246–252. IEEE, 1999.