Conference PaperPDF Available

Movement Tracks for the Automatic Detection of Fish Behavior in Videos



Global warming is predicted to profoundly impact ocean ecosystems. Fish behavior is an important indicator of changes in such marine environments. Thus, the automatic identification of key fish behavior in videos represents a much needed tool for marine researchers, enabling them to study climate change-related phenomena. We offer a dataset of sablefish (Anoplopoma fimbria) startle behaviors in underwater videos, and investigate the use of deep learning (DL) methods for behavior detection on it. Our proposed detection system identifies fish instances using DL-based frameworks, determines trajectory tracks, derives novel behavior-specific features, and employs Long Short-Term Memory (LSTM) networks to identify startle behavior in sablefish. Its performance is studied by comparing it with a state-of-the-art DL-based video event detector.
Movement Tracks for the Automatic Detection of Fish Behavior in Videos
Declan McIntosh 1Tunai Porto Marques 1Alexandra Branzan Albu 1Rodney Rountree 2Fabio De Leo 3
Global warming is predicted to profoundly impact
ocean ecosystems. Fish behavior is an important
indicator of changes in such marine environments.
Thus, the automatic identification of key fish be-
havior in videos represents a much needed tool
for marine researchers, enabling them to study
climate change-related phenomena. We offer a
dataset of sablefish (Anoplopoma fimbria) star-
tle behaviors in underwater videos, and investi-
gate the use of deep learning (DL) methods for
behavior detection on it. Our proposed detec-
tion system identifies fish instances using DL-
based frameworks, determines trajectory tracks,
derives novel behavior-specific features, and em-
ploys Long Short-Term Memory (LSTM) net-
works to identify startle behavior in sablefish. Its
performance is studied by comparing it with a
state-of-the-art DL-based video event detector.
1. Introduction
Among the negative impacts of climate change in marine
ecosystems predicted for global warming levels of
C to
C (e.g. significant global mean sea level rise [
], sea-ice-
free Artic Oceans [
], interruption of ocean-based services)
are the acidification and temperature rise of waters.
The behavioral disturbance in fish species resulting from
climate change can be studied with the use of underwater
optical systems, which have become increasingly prevalent
over the last six decades [
]. However, advancements
in automated video processing methodologies have not kept
pace with advances in the video technology itself. The
manual interpretation of visual data requires prohibitive
amounts of time, highlighting the necessity of semi- and
fully-automated methods for the enhancement [
] and
annotation of marine imagery.
University of Victoria, BC, Canada
Biology Department,
University of Victoria, BC, Canada
Ocean Networks Canada,
BC, Canada. Correspondence to: Alexandra Branzan Albu
Tackling Climate Change with Machine Learning Workshop at
Conference on Neural Information Processing Systems
(NeurIPS 2020).
As a result, the field of automatic interpretation of underwa-
ter imagery for biological purposes has experienced a surge
of activity in the last decade [
]. While numerous works
propose the automatic detection and counting of specimen
], ecological applications require more complex
insights. Video data provides critical information on fish be-
havior and interactions such as predation events, aggressive
interactions between individuals, activities related to repro-
duction and startle responses. The ability to detect such
behavior represents an important shift in the semantic rich-
ness of data and scientific value of computer vision-based
analysis of underwater videos: from the focused detection
and counting of individual specimens, to the context-aware
identification of fish behavior.
Given the heterogeneous visual appearance of diverse be-
haviors observed in fish, we initially focus our study on a
particular target: startle motion patterns observed in sable-
fish (Anoplopoma fimbria). Such behavior is characterized
by sudden changes in the speed and trajectory of sablefish
movement tracks.
We propose a novel end-to-end behavior detection frame-
work that considers 4-second clips to 1) detect the presence
of sablefish using DL-based object detectors [
]; 2) uses
the Hungarian algorithm [
] to determine trajectory tracks
between subsequent frames; 3) measures four handcrafted
and behavior-specific features and 4) employs such features
in conjunction with LSTM networks [
] to determine the
presence of startle behavior and describe it (i.e. travelling
direction, speed, and trajectory). The remainder of this arti-
cle is structured as follows. In Section 2we discuss works
of relevance to the proposed system. Section 3describes
the proposed approach. In Section 4we present a dataset
of sablefish startle behaviors and use it to compare the per-
formance of our system with that of a state-of-the-art event
detector [
]. Section 5draws conclusions and outlines
future work.
2. Related Works
Related works to our approach include DL-based methods
for object detection in images and events in videos.
Deep learning-based object detection for static images
Krizhevsky et al. [
] demonstrated the potential of us-
ing Convolutional Neural Networks (CNNs) to extract and
Input 4
second clip
Object Detection
and tracking
with YoloV3
Aspect Ratio
Change Metric
Object Detection and Tracking with Domain Specific Metrics for LSTM classification
Clip Wise
Startle No Startle
Figure 1.
Computational pipeline of the fish behavior detection system proposed. The framework is able to provide clip-wise and
movement track-wise classifications alike (see 4.2).
classify visual features from large datasets. Their work
motivated the use of CNNs in object detection, where frame-
works perform both localization and classification of regions
of interest (RoI). Girshick et al. [
] introduced R-CNN, a
system that uses a traditional computer vision-based tech-
nique (selective search [
]) to derive RoIs where individual
classification tasks take place. Processing times are further
reduced in Fast R-CNN [
] and Faster R-CNN [
]. A
group of frameworks [
] referred to as “one-stage”
detectors proposed the use of a single trainable network for
both creating RoIs and performing classification. This re-
duces processing times, but often leads to a drop in accuracy
when compared to two-stage detectors. Recent advance-
ments in one-stage object detectors (e.g. a loss measure
that accounts for extreme class imbalance in training sets
]) have resulted in frameworks such as YOLOv3 [
and RetinaNet [
], which offer fast inference times and
performances comparable with that of two-stage detectors.
DL-based event detection in videos
. Although object de-
tectors such as YOLOv3 [
] can be used in each frame of
a video, they often ignore important temporal relationships.
Rather than individual images employed by aforementioned
methods, recent works [
] used
video’s inter-frame temporal relationship to detect relevant
events. Saha et al. [
] use Fast R-CNN [
] to identify
motion from RGB and optical flow inputs. The outputs
from these networks are fused resulting in action tubes that
encompass the temporal length of each action. Kang et
al. [
] offered a video querying system that trains special-
ized models out of larger and more general CNNs to be
able to efficiently recognize only specific visual targets un-
der constrained view perspectives with claimed processing
speed-ups of up to
. Co
ar et al. [
] combined an
object-tracking detector [
], trajectory- and pixel- based
methods to detect abnormal activities. Ionescu et al. [
offered a system that not only recognizes motion in videos,
but also considers context to differentiate between normal
(e.g. a truck driving on a road) and abnormal (e.g. a truck
driving on a pedestrian lane) events.
Yu et al. [
] proposed ReMotENet, a light-weight event
detector that leverages spatial-temporal relationships be-
tween objects in adjacent frames. It uses 3D CNNs (“spatial-
temporal attention modules”) to jointly model these video
characteristics using the same trainable network. A frame
differencing process allows for the network to focus exclu-
sively on relevant, motion-triggered regions of the input
frames. This simple yet effective architecture results in fast
processing speeds and reduced model sizes [15].
3. Proposed approach
We propose a hybrid method for the automatic detection of
context-aware, ecologically relevant behavior in sablefish.
We first describe our method for tracking sablefish in video,
then propose the use of 4 track-wise features to constrain
sablefish startle behaviors. Finally, we describe a Long
Short Term Memory (LSTM) architecture that performs
classification using the aforementioned features.
Object detection and tracking
. We use the YOLOv3 [
end-to-end object detector as the first step of this hybrid
method. The detector was completely re-trained to perform
a simplified detection task were only the class fish is tar-
geted. We use a novel 600-image dataset (detailed in 4.1) of
sablefish instances composed of data from Ocean Networks
Canada (ONC) to train the object detector.
The detection of each frame offer a set of bounding boxes
and spatial centers. In order to track organisms we asso-
ciate these detection between frames. Our association loss
value consists simply of the distance between detection cen-
ters in two subsequent frames. We employ the Hungarian
Algorithm [
] to generate a loss minimizing associations
between detection in two consecutive frames. We then re-
move any associations where the distance between various
detection is greater than 15% of the frame resolution. Tracks
are terminated if no new detection is associated with them
for 5 frames (i.e. 0.5 seconds—see 4.1).
Behavior Specific Features
. We propose the use of a series
Figure 2.
Sample movement tracks and object detection confidence
scores. The bounding boxes highlight the fish detection in the
current frame. Each color represents an individual track.
of four domain-specific features that describe the startle be-
havior of sablefish. Each feature conveys independent and
complementary information, and the limited number of fea-
tures (
) prevents over-constraining the behavior detection
The first two features quantify the direction of travel and
speed from a track. These track characteristics were selected
because often the direction of travel changes and the fish
accelerates at the moment of a startle motion.
Figure 3.
LMCM kernel designed to extract fast changes in sequen-
tial images.
A third metric considers the aspect ratio of the detection
bounding box of a fish instance over time. The reasoning
behind this feature is the empirical observation that sablefish
generally take on a “c” shape when startling, in preparation
for moving away from their current location. The final Lo-
cal Momentary Change Metric (LMCM) feature seeks to
find fast and unstained motion, or temporal visual impulses,
associated with startle events. This feature is obtained by
convolving the novel 3-dimensional LMCM kernel, depicted
in Figure 3, over three temporally adjacent frames. This
spatially symmetric kernel was designed to produce high
output values where impulse changes occur between frames.
Given its zero-sum and isotropic properties, the kernel out-
puts zero when none or only constant temporal changes are
occurring. We observe that the LMCM kernel efficiently
detects leading and lagging edges of motion. In order to
associate this feature with a track we average the LMCM
output magnitude inside a region encompassed by a given
fish detection bounding box.
3.1. LSTM classifier
Figure 4.
LSTM-based network for the classification of movement
tracks. The network receives four features conveying speed, direc-
tion and rate of change from each track as input.
In order to classify an individual track, we first combine
its four features as a tensor data structure of dimensions
; tensors associated with tracks of less than
are end-padded with zeros. A set of normalization coeffi-
cients calculated empirically using the training set (see 4.1)
is then used to normalize each value in the input time-series
to the range
. A custom-trained long short-term mem-
ory (LSTM) classifier receives the normalized tensors as
input, and outputs a track classification of non-startle back-
ground or startle, as well as a confidence score. This is done
by considering underlying temporal relationships between
their values. We chose to use LSTM networks because the
temporal progression of values from the features extracted
(along 40 frames or 4 seconds) conveys important informa-
tion for the classification of individual clips/tracks. Figure 4
details the architecture of the LSTM network employed. All
convolutional layers employ three-layered 1D kernels.
4. Results and Discussion
We compare our method with a state-of-the-art event de-
tection algorithm [
]. Section Section 4.1 describes our
dataset. Our comparison considers standard performance
metrics in Section 4.2 and discusses the potential advantages
of semantically richer data outputs in ecological research.
4.1. Sablefish Startle Dataset
The data used in this work was acquired with a stationary
video camera permanently deployed at
m of depth at the
Barkley Node of Ocean Networks Canada’s
marine cabled observatory infrastructure. All videos sam-
ples were collected between September 17
and October
2019 because this temporal window contains high sable-
fish activity. The original monitoring videos are first divided
into units of 4-second clips (a sablefish startle is expected to
last roughly one second) and down-sampled to 10 frames per
second for processing reasons. An initial filtering is carried
out using Gaussian Mixture Models [
], resulting in a set
composed only by clips which contain motion. For training
purposes, these motion clips are then manually classified as
possessing startle or not. The Sablefish Startle dataset con-
sists of
positive (i.e. presenting startle events) clips, as
well as 446 randomly selected negative samples (i.e. without
startle events). Table 1details the dataset usage for training,
validation and testing.
Data Split Clips Startle Clips Tracks Startle Tracks
Train 642 321 1533 323
Validation 150 75 421 80
Test 100 50 286 50
Table 1.
Division of the 4-second clips of the Sablefish Startle
Dataset for training, validation and testing purposes.
A second dataset composed of 600 images of sabblefish
was created to train the YOLOV3 [
] object detector (i.e.
first step of the proposed approach). In order to assess
the track-creation performance of the proposed system, we
use this custom-trained object detector to derive movement
tracks from each of the 892 clips composing the Sablefish
Startle Dataset. Tracks with length shorter than 2 seconds
are discarded. The remaining tracks are manually annotated
as startle or non-startle (see Table 1).
This dual annotation approach (i.e. clip- and track-wise)
employed with the Sablefish Startle Dataset allows for a
two-fold performance analysis: 1) clip-wise classification,
where an entire clip is categorized as possessing startle
or not, and 2) track-wise classification, which reflects the
accuracy in the classification of each candidate track as
startle or non-startle.
4.2. Experimental Results
We calculate the Average Precision (AP), Binary Cross En-
tropy (BCE) loss and Recall for both track- and clip-wise
outputs of the proposed system using only the Test portion
of the Sablefish Startle Dataset. A threshold of
(in a
range) is set to classify a candidate movement track
as positive or negative with respect to all of its constituent
points. In order to measure the performance of the clip-wise
classification and compare it with the baseline method (Re-
MotENet [
]), we consider that the “detection score” of a
clip is that of its highest-confidence movement track (if any).
Thus, any clip where at least one positive startle movement
track is identified will be classified as positive.
The conversion from track-wise classification to clip-wise
classification is expected to lower the overall accuracy of
our proposed approach. A “true” startle event might create
only short, invalid tracks, or sometimes no tracks at all. This
situation would lower the clip-wise classification perfor-
mance, but would not interfere with the track-wise one. The
track-wise metrics are applicable only to our approach and
they mainly reflect the difference between the manual and
automatic classification of the tracks created in the dataset
by the proposed system, thus evaluating the ability of the
LSTM network to classify tracks. Table 2shows that the
LSTM portion of our method performs well for classifying
startle tracks (AP of
). Clip-wise, our method outper-
formed a state-of-the-art DL-based event detector [
] with
an AP of 0.67.
Method Track AP Track BCE Clip AP Clip Recall
Ours 0.85 0.412 0.67 0.58
ReMotENet [15] N/A1N/A10.61 0.5
1: ReMotENetdoes not perform track-wise classification.
Table 2.
Performance comparison between the proposed method
and a state-of-the-art event detector.
Our proposed system generates more semantically rich data
by detecting and describing behavior rather than just mark-
ing a clip as containing the behavior; this may be helpful for
ecological and biological research. Our approach provides
instance-level information such as track-specific average
speed, direction and rate of change. These extra data may al-
low for further analyses considering inter- and intra-species
behaviors. Also, there are clips that contain more than one
startle, and our approach is able to identify all startle in-
stances in such clips. Simply classifying a clip as startle or
non-startle would, of course, not allow for the detection of
multiple startle instances within the same clip.
5. Conclusion
We propose an automatic detector of fish behavior in videos
that performs semantically richer tasks than typical com-
puter vision-based analyses: instead of specimens counting,
it identifies and describes a complex biological event (star-
tle events of sablefish). Our intent is to enable long-term
studies on changes in fish behavior that could be caused by
climate change (e.g. temperature rise and acidificaton).
A dataset composed of 892 4-second positive (startle) and
negative (non-startle) clips, and associated tracks were man-
ually annotated. Experiments using this data show that the
proposed detector identifies and classifies well individual
tracks of motion as startle or not (AP of
). Furthermore,
the performance of our clip-wise classification is compared
to that of a state-of-the-art event detector, ReMotENet [
Our system outperforms ReMotENet with an AP of
(against 0.61).
Future work will address more fish behaviors (e.g. predation,
spawining) and will adapt DL-based event detectors such as
ReMotENet [15] and NoScope [24] to that end.
Thomas F Stocker, Dahe Qin, G-K Plattner,
Melinda MB Tignor, Simon K Allen, Judith Boschung,
Alexander Nauels, Yu Xia, Vincent Bex, and
Pauline M Midgley. Climate change 2013: The physi-
cal science basis. contribution of working group i to
the fifth assessment report of ipcc the intergovernmen-
tal panel on climate change, 2014.
Nathaniel L Bindoff, Peter A Stott, Krishna Mirle
AchutaRao, Myles R Allen, Nathan Gillett, David
Gutzler, Kabumbwe Hansingo, G Hegerl, Yongyun
Hu, Suman Jain, et al. Detection and attribution of
climate change: from global to regional. 2013.
Delphine Mallet and Dominique Pelletier. Underwater
video techniques for observing coastal marine biodi-
versity: a review of sixty years of publications (1952–
2012). Fisheries Research, 154:44–62, 2014.
Tunai Porto Marques, Alexandra Branzan Albu, and
Maia Hoeberechts. A contrast-guided approach for
the enhancement of low-lighting underwater images.
Journal of Imaging, 5(10):79, 2019.
Jacopo Aguzzi, Carolina Doya, Samuele Tecchio,
Fabio De Leo, Ernesto Azzurro, Cynthia Costa, Vale-
rio Sbragaglia, Joaquin del Rio, Joan Navarro, Henry
Ruhl, Paolo Favali, Autun Purser, Laurenz Thomsen,
and Ignacio Catal
an. Coastal observatories for mon-
itoring of fish behaviour and their responses to en-
vironmental changes. Reviews in Fish Biology and
Fisheries, 25:463–483, 2015.
Tunai Porto Marques and Alexandra Branzan Albu.
L2uwe: A framework for the efficient enhancement of
low-light underwater images using local contrast and
multi-scale fusion. In Proceedings of the IEEE/CVF
Conference on Computer Vision and Pattern Recogni-
tion Workshops, pages 538–539, 2020.
Cosmin Ancuti, Codruta Orniana Ancuti, Tom Haber,
and Philippe Bekaert. Enhancing underwater images
and videos by fusion. In 2012 IEEE Conference on
Computer Vision and Pattern Recognition, pages 81–
88. IEEE, 2012.
J. Aguzzi1, D. Chatzievangelou, J.B. Company,
L. Thomsen, S. Marini, F. Bonofiglio, F. Juanes,
R. Rountree, A. Berry, R. Chumbinho, C. Lordan,
J. Doyle, J. del Rio, J. Navarro, F.C. De Leo, N. Ba-
hamon, J.A. Garc´
ıa, R. Danovaro, M. Francescangeli,
V. Lopez-Vazquez1, and Ps Gaughan. The potential
of video imagery from worldwide cabled observatory
networks to provide information supporting fish-stock
and biodiversity assessment. ICES Journal of Marine
Science, In press.
YH Toh, TM Ng, and BK Liew. Automated fish count-
ing using image processing. In 2009 International
Conference on Computational Intelligence and Soft-
ware Engineering, pages 1–5. IEEE, 2009.
Concetto Spampinato, Yun-Heh Chen-Burger, Gay-
athri Nadarajan, and Robert B Fisher. Detecting, track-
ing and counting fish in low quality unconstrained
underwater videos. VISAPP (2), 2008(514-519):1,
Song Zhang, Xinting Yang, Yizhong Wang, Zhenxi
Zhao, Jintao Liu, Yang Liu, Chuanheng Sun, and Chao
Zhou. Automatic fish population counting by machine
vision and a hybrid deep neural network model. Ani-
mals, 10(2):364, 2020.
Joseph Redmon and Ali Farhadi. Yolov3:
An incremental improvement. arXiv preprint
arXiv:1804.02767, 2018.
Harold W Kuhn. The hungarian method for the as-
signment problem. Naval research logistics quarterly,
2(1-2):83–97, 1955.
Sepp Hochreiter and J
urgen Schmidhuber. Long short-
term memory. Neural computation, 9(8):1735–1780,
Ruichi Yu, Hongcheng Wang, and Larry S Davis. Re-
motenet: Efficient relevant motion event detection for
large-scale home surveillance videos. In 2018 IEEE
Winter Conference on Applications of Computer Vision
(WACV), pages 1642–1651. IEEE, 2018.
Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hin-
ton. Imagenet classification with deep convolutional
neural networks. In Advances in neural information
processing systems, pages 1097–1105, 2012.
Ross Girshick, Jeff Donahue, Trevor Darrell, and Jiten-
dra Malik. Rich feature hierarchies for accurate object
detection and semantic segmentation. In Proceedings
of the IEEE conference on computer vision and pattern
recognition, pages 580–587, 2014.
Jasper RR Uijlings, Koen EA Van De Sande, Theo
Gevers, and Arnold WM Smeulders. Selective search
for object recognition. International journal of com-
puter vision, 104(2):154–171, 2013.
Ross Girshick. Fast r-cnn. In Proceedings of the IEEE
international conference on computer vision, pages
1440–1448, 2015.
Shaoqing Ren, Kaiming He, Ross Girshick, and Jian
Sun. Faster r-cnn: Towards real-time object detection
with region proposal networks. In Advances in neural
information processing systems, pages 91–99, 2015.
Wei Liu, Dragomir Anguelov, Dumitru Erhan, Chris-
tian Szegedy, Scott Reed, Cheng-Yang Fu, and Alexan-
der C Berg. Ssd: Single shot multibox detector. In
European conference on computer vision, pages 21–37.
Springer, 2016.
Joseph Redmon, Santosh Divvala, Ross Girshick, and
Ali Farhadi. You only look once: Unified, real-time
object detection. In Proceedings of the IEEE con-
ference on computer vision and pattern recognition,
pages 779–788, 2016.
Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming
He, and Piotr Doll
ar. Focal loss for dense object de-
tection. In Proceedings of the IEEE international con-
ference on computer vision, pages 2980–2988, 2017.
Daniel Kang, John Emmons, Firas Abuzaid, Peter
Bailis, and Matei Zaharia. Noscope: optimizing neural
network queries over video at scale. arXiv preprint
arXiv:1703.02529, 2017.
Suman Saha, Gurkirt Singh, Michael Sapienza,
Philip HS Torr, and Fabio Cuzzolin. Deep learning for
detecting multiple space-time action tubes in videos.
arXiv preprint arXiv:1608.01529, 2016.
Dan Xu, Elisa Ricci, Yan Yan, Jingkuan Song, and
Nicu Sebe. Learning deep representations of appear-
ance and motion for anomalous event detection. arXiv
preprint arXiv:1510.01553, 2015.
Radu Tudor Ionescu, Fahad Shahbaz Khan, Mariana-
Iuliana Georgescu, and Ling Shao. Object-centric
auto-encoders and dummy anomalies for abnormal
event detection in video. In Proceedings of the IEEE
Conference on Computer Vision and Pattern Recogni-
tion, pages 7842–7851, 2019.
Serhan Co
ar, Giuseppe Donatiello, Vania Bogorny,
Carolina Garate, Luis Otavio Alvares, and Fran
emond. Toward abnormal trajectory and event de-
tection in video surveillance. IEEE Transactions on
Circuits and Systems for Video Technology, 27(3):683–
695, 2016.
Huijuan Xu, Boyang Li, Vasili Ramanishka, Leonid
Sigal, and Kate Saenko. Joint event detection and
description in continuous video streams. In 2019 IEEE
Winter Conference on Applications of Computer Vision
(WACV), pages 396–405. IEEE, 2019.
Duc Phu Chau, Fran
ois Br
emond, Monique Thonnat,
and Etienne Corv
ee. Robust mobile object tracking
based on multiple feature similarity and trajectory fil-
tering. arXiv preprint arXiv:1106.2695, 2011.
Chris Stauffer and W Eric L Grimson. Adaptive back-
ground mixture models for real-time tracking. In Pro-
ceedings. 1999 IEEE Computer Society Conference
on Computer Vision and Pattern Recognition (Cat. No
PR00149), volume 2, pages 246–252. IEEE, 1999.
... An efficient method for the automatic detection of some of these behaviours holds an enormous biological value, as it allows for a specific interpretation of long-term video recordings from offshore observatories [2], [13], e.g., identifying predation events of a given species, rather than just its presence. As a result, this capability represents a change in the level of abstraction of machine learning-and computer visionbased interpretation of underwater data [14]: from a narrow and context-agnostic counting of specimens to a broader and biologically complex identification of specific behaviours. ...
... In order to have a comparison baseline, the proposed system, TempNet, focuses on the startle motion patterns (similarly to [14]) observed in sablefish (Anoplopoma fimbria), see the example startle event frames of Figure 1. Prevalence of sablefish startle behaviours can be useful to biological studies of population stress levels. ...
... Our detector utilizes temporal attention modules based on previous channel attention modules, residual convolutional blocks, wavelet downsampling, and a custom architecture to improve performance over previous methods while achieving faster than real-time efficiency [15], [16]. TempNet not only outperforms the baseline method proposed in [14], but since it does not rely on species-specific visual features, it can be applied to any other behavioural event for which sufficient annotations have been compiled. To the best of our knowledge, our system 1 ...
... An efficient method for the automatic detection of some of these behaviours holds an enormous biological value, as it allows for a specific interpretation of long-term video recordings from offshore observatories [2], [13], e.g., identifying predation events of a given species, rather than just its presence. As a result, this capability represents a change in the level of abstraction of machine learning-and computer visionbased interpretation of underwater data [14]: from a narrow and context-agnostic counting of specimens to a broader and biologically complex identification of specific behaviours. ...
... In order to have a comparison baseline, the proposed system, TempNet, focuses on the startle motion patterns (similarly to [14]) observed in sablefish (Anoplopoma fimbria), see the example startle event frames of Figure 1. Prevalence of sablefish startle behaviours can be useful to biological studies of population stress levels. ...
... Our detector utilizes temporal attention modules based on previous channel attention modules, residual convolutional blocks, wavelet downsampling, and a custom architecture to improve performance over previous methods while achieving faster than real-time efficiency [15], [16]. TempNet not only outperforms the baseline method proposed in [14], but since it does not rely on species-specific visual features, it can be applied to any other behavioural event for which sufficient annotations have been compiled. To the best of our knowledge, our system Fig. 1. ...
Full-text available
Recent advancements in cabled ocean observatories have increased the quality and prevalence of underwater videos; this data enables the extraction of high-level biologically relevant information such as species' behaviours. Despite this increase in capability, most modern methods for the automatic interpretation of underwater videos focus only on the detection and counting organisms. We propose an efficient computer vision- and deep learning-based method for the detection of biological behaviours in videos. TempNet uses an encoder bridge and residual blocks to maintain model performance with a two-staged, spatial, then temporal, encoder. TempNet also presents temporal attention during spatial encoding as well as Wavelet Down-Sampling pre-processing to improve model accuracy. Although our system is designed for applications to diverse fish behaviours (i.e, is generic), we demonstrate its application to the detection of sablefish (Anoplopoma fimbria) startle events. We compare the proposed approach with a state-of-the-art end-to-end video detection method (ReMotENet) and a hybrid method previously offered exclusively for the detection of sablefish's startle events in videos from an existing dataset. Results show that our novel method comfortably outperforms the comparison baselines in multiple metrics, reaching a per-clip accuracy and precision of 80% and 0.81, respectively. This represents a relative improvement of 31% in accuracy and 27% in precision over the compared methods using this dataset. Our computational pipeline is also highly efficient, as it can process each 4-second video clip in only 38ms. Furthermore, since it does not employ features specific to sablefish startle events, our system can be easily extended to other behaviours in future works.
... The authors intend to encourage the behaviour analysis of fishes as these fishes are used to study drug addiction, neurological disorders etc. In [31], the authors developed the Sablefish startle behaviour dataset for automatic behaviour detection. Their purpose is to aid the researchers to study climate change by analysing the behaviour of Sablefish. ...
... CNN based methods can perform well in these situations. Recently, there has been an upsurge in the usage of these models to track animals [30], [31], [60], [61]. Generally, motion or appearance models are popularly adopted to track the target objects. ...
Full-text available
Aquaculture provides food security to many developing countries and enhances the socio-economic conditions of the fishermen. To enhance the productivity of the aquaculture, it is necessary to maintain stress free controlled eco-system for the fishes. For recognising the stress in fishes, behaviour analysis of fishes via tracking is imperative. Early detection of stress in fish facilitates fishermen to take precautionary measures promptly. Computer vision-based fish behaviour analysis of economically important fish species is challenging due to the lack of datasets, occlusions, rapid changes in swim directions etc. The present study proposes a multiple fish video dataset of an economically important species in a controlled environment, namely Sillago Sihama-Vid with accurate annotations. The study emulates the natural environment of Sillago Sihama in a large aquarium. This work proposes a novel fish tracking algorithm that incorporates swim direction information in addition to temporal, appearance, and spatial information. The inclusion of swim direction information reduces the number of identity switches. Comparative performance analysis of the proposed tracking algorithm with the conventional methods on the developed dataset highlights the performance efficiency. The proposed method has a clear performance improvement in MOTA, MOTP, IDSW and MT with respect to the other compared methods. The study also presents a novel unsupervised continual behaviour modelling strategy to model the evolving behaviours of the fishes. Further, interpretation of fish behaviour from the proposed behaviour modelling is performed to highlight the reliability of the proposed method. The significance of the proposed method is that, it is independent of training and labelled data. In addition, the method represents an innovative alternative to capture all the non observable behaviours of the fishes. The proposed tracking and behaviour modelling strategy act as a benchmark for developing algorithms to study fish behaviour via tracking. Finally, the dataset provides an opportunity for developing computer vision-based models to analyse the different behaviours of fish Sillago Sihama.
... Recently, there has been a growing interest in computer vision methods for animal behavior analysis [20][21][22]. However, such systems have yet to be developed for larger animals in the wild with complex body structures and vast degrees of movement freedom, such as chimpanzees, gorillas, and macaques. ...
Multi-instance object tracking is an active research problem in computer vision, where most novel methods analyze and locate targets on videos taken from static camera set-ups, just as many existing monitoring systems worldwide. These have proved efficient and effective for many established monitoring systems worldwide, such as animal behavior studies and human and road traffic. However, despite the growing success of computer vision in animal monitoring and behavior analysis, such a system has yet to be developed for free-ranging Japanese macaques. With this, our study aims to establish a tracking system for Japanese macaques in their natural habitat. We begin by training a monkey detector using You Only Look Once (YOLOv4) and investigating the effect of different transfer learning techniques, curriculum learning, and dataset heterogeneity to improve the model’s accuracy. Using the resulting box detections from our monkey detection model, we use SuperGlue and Murty’s algorithm for re-identifying the monkey individuals across the succeeding frames. With a mean AP50 of 96.59%, a precision score of 93%, a recall of 96%, and a mean IOUAP@50 of 77.2%, our Japanese macaque detection model trained using a YOLO-v4 architecture with spatial attention module, and Mish activation function based on 3-stage training curriculum yielded the best performance. For animal behavior studies, our tracking system can prove effective and reliable with our achieved 91.35% MOTA even on our heterogeneous dataset.
Full-text available
Advances in visual perceptual tasks have been mainly driven by the amount, and types, of annotations of large-scale datasets. Researchers have focused on fully-supervised settings to train models using offline epoch-based schemes. Despite the evident advancements, limitations and cost of manually annotated datasets have hindered further development for event perceptual tasks, such as detection and localization of objects and events in videos. The problem is more apparent in zoological applications due to the scarcity of annotations and length of videos-most videos are at most ten minutes long. Inspired by cognitive theories, we present a self-supervised perceptual prediction framework to tackle the problem of temporal event segmentation by building a stable representation of event-related objects. The approach is simple but effective. We rely on LSTM predictions of high-level features computed by a standard deep learning backbone. For spatial segmentation, the stable representation of the object is used by an attention mechanism to filter the input features before the prediction step. The self-learned attention maps effectively localize the object as a side effect of perceptual prediction. We demonstrate our approach on long videos from continuous wildlife video monitoring, spanning multiple days at 25 FPS. We aim to facilitate automated ethogramming by detecting and localizing events without the need for labels. Our approach is trained in an online manner on streaming input and requires only a single pass through the video, with no separate training set. Given the lack of long and realistic (includes real-world challenges) datasets, we introduce a new wildlife video dataset–nest monitoring of the Kagu (a flightless bird from New Caledonia)–to benchmark our approach. Our dataset features a video from 10 days (over 23 million frames) of continuous monitoring of the Kagu in its natural habitat. We annotate every frame with bounding boxes and event labels. Additionally, each frame is annotated with time-of-day and illumination conditions. We will make the dataset, which is the first of its kind, and the code available to the research community. We find that the approach significantly outperforms other self-supervised, traditional (e.g., Optical Flow, Background Subtraction) and NN-based (e.g., PA-DPC, DINO, iBOT), baselines and performs on par with supervised boundary detection approaches (i.e., PC). At a recall rate of 80%, our best performing model detects one false positive activity every 50 min of training. On average, we at least double the performance of self-supervised approaches for spatial segmentation. Additionally, we show that our approach is robust to various environmental conditions (e.g., moving shadows). We also benchmark the framework on other datasets (i.e., Kinetics-GEBD, TAPOS) from different domains to demonstrate its generalizability. The data and code are available on our project page:
Full-text available
Through the advancement of observation systems, our vision has far extended its reach into the world of fishes, and how they interact with fishing gears—breaking through physical boundaries and visually adapting to challenging conditions in marine environments. As marine sciences step into the era of artificial intelligence (AI), deep learning models now provide tools for researchers to process a large amount of imagery data (i.e., image sequence, video) on fish behavior in a more time-efficient and cost-effective manner. The latest AI models to detect fish and categorize species are now reaching human-like accuracy. Nevertheless, robust tools to track fish movements in situ are under development and primarily focused on tropical species. Data to accurately interpret fish interactions with fishing gears is still lacking, especially for temperate fishes. At the same time, this is an essential step for selectivity studies to advance and integrate AI methods in assessing the effectiveness of modified gears. We here conduct a bibliometric analysis to review the recent advances and applications of AI in automated tools for fish tracking, classification, and behavior recognition, highlighting how they may ultimately help improve gear selectivity. We further show how transforming external stimuli that influence fish behavior, such as sensory cues and gears as background, into interpretable features that models learn to distinguish remains challenging. By presenting the recent advances in AI on fish behavior applied to fishing gear improvements (e.g., Long Short-Term Memory (LSTM), Generative Adversarial Network (GAN), coupled networks), we discuss the advances, potential and limits of AI to help meet the demands of fishing policies and sustainable goals, as scientists and developers continue to collaborate in building the database needed to train deep learning models.
Full-text available
Biological data sets are increasingly becoming information-dense, making it effective to use a computer science-based analysis. We used convolution neural networks (CNN) and the specific CNN architecture Unet to study sponge behavior over time. We analyzed a large time series of hourly high-resolution still images of a marine sponge, Suberites concinnus (Demospongiae, Suberitidae) captured between 2012 and 2015 using the NEPTUNE seafloor cabled observatory, off the west coast of Vancouver Island, Canada. We applied semantic segmentation with the Unet architecture with some modifications, including adapting parts of the architecture to be more applicable to three-channel images (RGB). Some alterations that made this model successful were the use of a dice-loss coefficient, Adam optimizer and a dropout function after each convolutional layer which provided losses, accuracies and dice scores of up to 0.03, 0.98 and 0.97, respectively. The model was tested with five-fold cross-validation. This study is a first step towards analyzing trends in the behavior of a demosponge in an environment that experiences severe seasonal and inter-annual changes in climate. The end objective is to correlate changes in sponge size (activity) over seasons and years with environmental variables collected from the same observatory platform. Our work provides a roadmap for others who seek to cross the interdisciplinary boundaries between biology and computer science.
Conference Paper
Full-text available
Images captured underwater often suffer from sub-optimal llumination settings that can hide important visual features, reducing their quality. We present a novel single-image low-light underwater image enhancer, L^2UWE, that builds on our observation that an efficient model of atmospheric lighting can be derived from local contrast information. We create two distinct models and generate two enhanced images from them: one that highlights finer details, the other focused on darkness removal. A multi-scale fusion process is employed to combine these images while emphasizing regions of higher luminance, saliency and local contrast. We demonstrate the performance of L^2UWE by using seven metrics to test it against seven state-of-the-art enhancement methods specific to underwater and low-light scenes. Code available at
Full-text available
In intensive aquaculture, the number of fish in a shoal can provide valuable input for the development of intelligent production management systems. However, the traditional artificial sampling method is not only time consuming and laborious, but also may put pressure on the fish. To solve the above problems, this paper proposes an automatic fish counting method based on a hybrid neural network model to realize the real-time, accurate, objective, and lossless counting of fish population in far offshore salmon mariculture. A multi-column convolution neural network (MCNN) is used as the front end to capture the feature information of different receptive fields. Convolution kernels of different sizes are used to adapt to the changes in angle, shape, and size caused by the motion of fish. Simultaneously, a wider and deeper dilated convolution neural network (DCNN) is used as the back end to reduce the loss of spatial structure information during network transmission. Finally, a hybrid neural network model is constructed. The experimental results show that the counting accuracy of the proposed hybrid neural network model is up to 95.06%, and the Pearson correlation coefficient between the estimation and the ground truth is 0.99. Compared with CNN- and MCNN-based methods, the accuracy and other evaluation indices are also improved. Therefore, the proposed method can provide an essential reference for feeding and other breeding operations.
Full-text available
Underwater images are often acquired in sub-optimal lighting conditions, in particular at profound depths where the absence of natural light demands the use of artificial lighting. Low-lighting images impose a challenge for both manual and automated analysis, since regions of interest can have low visibility. A new framework capable of significantly enhancing these images is proposed in this article. The framework is based on a novel dehazing mechanism that considers local contrast information in the input images, and offers a solution to three common disadvantages of current single image dehazing methods: oversaturation of radiance, lack of scale-invariance and creation of halos. A novel low-lighting underwater image dataset, OceanDark, is introduced to assist in the development and evaluation of the proposed framework. Experimental results and a comparison with other underwater-specific image enhancement methods show that the proposed framework can be used for significantly improving the visibility in low-lighting underwater images of different scales, without creating undesired dehazing artifacts.
Conference Paper
Can a large convolutional neural network trained for whole-image classification on ImageNet be coaxed into detecting objects in PASCAL? We show that the answer is yes, and that the resulting system is simple, scalable, and boosts mean average precision, relative to the venerable deformable part model, by more than 40% (achieving a final mAP of 48% on VOC 2007). Our framework combines powerful computer vision techniques for generating bottom-up region proposals with recent advances in learning high-capacity convolutional neural networks. We call the resulting system R-CNN: Regions with CNN features. The same framework is also competitive with state-of-the-art semantic segmentation methods, demonstrating its flexibility. Beyond these results, we execute a battery of experiments that provide insight into what the network learns to represent, revealing a rich hierarchy of discriminative and often semantically meaningful features.
Recent advances in computer vision---in the form of deep neural networks---have made it possible to query increasing volumes of video data with high accuracy. However, neural network inference is computationally expensive at scale: applying a state-of-the-art object detector in real time (i.e., 30+ frames per second) to a single video requires a $4000 GPU. In response, we present NoScope, a system for querying videos that can reduce the cost of neural network video analysis by up to three orders of magnitude via inference-optimized model search. Given a target video, object to detect, and reference neural network, NoScope automatically searches for and trains a sequence, or cascade, of models that preserves the accuracy of the reference network but is specialized to the target video and are therefore far less computationally expensive. NoScope cascades two types of models: specialized models that forego the full generality of the reference model but faithfully mimic its behavior for the target video and object; and difference detectors that highlight temporal differences across frames. We show that the optimal cascade architecture differs across videos and objects, so NoScope uses an efficient cost-based optimizer to search across models and cascades. With this approach, NoScope achieves two to three order of magnitude speed-ups (265-15,500x real-time) on binary classification tasks over fixed-angle webcam and surveillance video while maintaining accuracy within 1--5% of state-of-the-art neural networks.