ArticlePDF Available

Abstract and Figures

Many ecological studies rely on count data and involve manual counting of objects of interest, which is time-consuming and especially disadvantageous when time in the field or lab is limited. However, an increasing number of works uses digital imagery, which opens opportunities to automatise counting tasks. In this study, we use machine learning to automate counting objects of interest without the need to label individual objects. By leveraging already existing image-level annotations, this approach can also give value to historical data that were collected and annotated over longer time series (typical for many ecological studies), without the aim of deep learning applications. We demonstrate deep learning regression on two fundamentally different counting tasks: (i) daily growth rings from microscopic images of fish otolith (i.e., hearing stone) and (ii) hauled out seals from highly variable aerial imagery. In the otolith images, our deep learning-based regressor yields an RMSE of 3.40 day-rings and an R2 of 0.92. Initial performance in the seal images is lower (RMSE of 23.46 seals and R2 of 0.72), which can be attributed to a lack of images with a high number of seals in the initial training set, compared to the test set. We then show how to improve performance substantially (RMSE of 19.03 seals and R2 of 0.77) by carefully selecting and relabelling just 100 additional training images based on initial model prediction discrepancy. The regression-based approach used here returns accurate counts (R2 of 0.92 and 0.77 for the rings and seals, respectively), directly usable in ecological research.
This content is subject to copyright. Terms and conditions apply.
Scientic Reports | (2021) 11:23209 |
Counting using deep learning
regression gives value to ecological
Jeroen P. A. Hoekendijk1,2*, Benjamin Kellenberger3, Geert Aarts1,4,5, Sophie Brasseur4,
Suzanne S. H. Poiesz1,6 & Devis Tuia3
Many ecological studies rely on count data and involve manual counting of objects of interest, which is
time-consuming and especially disadvantageous when time in the eld or lab is limited. However, an
increasing number of works uses digital imagery, which opens opportunities to automatise counting
tasks. In this study, we use machine learning to automate counting objects of interest without the
need to label individual objects. By leveraging already existing image-level annotations, this approach
can also give value to historical data that were collected and annotated over longer time series
(typical for many ecological studies), without the aim of deep learning applications. We demonstrate
deep learning regression on two fundamentally dierent counting tasks: (i) daily growth rings from
microscopic images of sh otolith (i.e., hearing stone) and (ii) hauled out seals from highly variable
aerial imagery. In the otolith images, our deep learning-based regressor yields an RMSE of 3.40 day-
rings and an
of 0.92. Initial performance in the seal images is lower (RMSE of 23.46 seals and
0.72), which can be attributed to a lack of images with a high number of seals in the initial training set,
compared to the test set. We then show how to improve performance substantially (RMSE of 19.03
seals and
of 0.77) by carefully selecting and relabelling just 100 additional training images based on
initial model prediction discrepancy. The regression-based approach used here returns accurate counts
of 0.92 and 0.77 for the rings and seals, respectively), directly usable in ecological research.
Ecological studies aim to unravel the interactions between organisms and their environment at various spatial
scales. In order to quantify these intricate relationships, many ecological studies rely on count data: for instance,
during animal surveys, individuals are counted to estimate and monitor population size1,2 or to predict the
spatial distribution of animals3. On smaller scales, counting physical traits is widely used in, for example, plant
phenotyping, where the number of leaves of a plant is a key trait to describe development and growth4,5. e
usage of count data in ecology is also common on microscopic scales, for example to estimate the age of sh by
counting daily growth rings that are visible in otoliths (i.e., hearing stones)6,7.
Irrespective of the scale, counting objects of interest can be tedious and time-consuming, especially when
objects occur in large numbers and/or densities (e.g., wildlife that clusters in colonies8), when they overlap (e.g.,
leaves of plants4,5), or when they are less well-dened and cryptic (e.g., otolith rings6). Historically, many of these
traits were counted directly by eye. Later, objects of interest were photographed, which allowed for optimisation of
the time in the eld (or lab) and repeatability of the counts. Nowadays, many studies increasingly take advantage
of digital photography, allowing for more ecient ways of archiving the data. Crucially, these archived images
can now potentially be used for digital processing and automated counting.
To this end, recent ecological studies have shown promising potential of using computer vision to count
objects of interest from digital imagery9,10: they employ Machine Learning (ML) models, which are trained
on a set of manually annotated (labelled) images to learn to recognise patterns (e.g., colours and shapes), and
eventually objects, in those training images. Once trained, these ML models can be used to automatically recog-
nise similar patterns in new images and perform tasks like species classication, animal detection, and more11.
1NIOZ Royal Netherlands Institute for Sea Research, 1790AB Den Burg, The Netherlands. 2Wageningen University
and Research, 6708PB Wageningen, The Netherlands. 3Ecole Polytechnique Fédérale de Lausanne (EPFL),
1950 Sion, Switzerland. 4Wageningen Marine Research, Wageningen University and Research, 1781AG Den
Helder, The Netherlands. 5Wageningen University and Research, Wildlife Ecology and Conservation Group, 6708
PB Wageningen, The Netherlands. 6Groningen Institute of Evolutionary Life Sciences, University of Groningen,
9700 CC Groningen, The Netherlands. *email:
Content courtesy of Springer Nature, terms of use apply. Rights reserved
Scientic Reports | (2021) 11:23209 |
Most successful ML models belong to the family of Deep Learning (DL)12, in particular Convolutional Neural
Networks (CNNs)13.
Most ecological studies that use computer vision for counting apply CNNs designed for object detection1416.
ese object detectors are trained on images in which every object of interest is annotated individually, most
commonly by a bounding box drawn around the object, or a location point at its centre. Alternatively, objects of
interest can be counted using detectors based on image segmentation17, which require even more extensive anno-
tations, as every pixel in the image must be labelled. Annotating training images for object detection and image
segmentation can therefore be labour-intensive, especially for images where object counts are high. Hence, this
could potentially undermine the time (and cost) reduction advantage promised by ML models in the rst place.
An alternative is to instead annotate training images with a single value that represents the number of objects
in an image. ese image-level annotations pose signicantly reduced annotation time and can directly be used to
train regression-based CNNs. Perhaps more importantly, image-level counts are an oen used annotation format
in ecological studies, for example in cases where objects are manually counted from digital imagery over longer
time series. Furthermore, image-level annotations provide a viable solution for scenarios that are complicated
to annotate otherwise, such as for overlapping objects5, complex and atypically shaped objects like concentric
rings18, or continuous variables like an individual’s size or age19.
In this study we highlight the value of regression-based CNNs for ecological studies. We present a relatively
lightweight DL model for counting objects in digital imagery and evaluate it on two fundamentally dierent real-
world datasets, that were originally collected without the aim of training DL models. e rst dataset consists of
microscopic images of plaice (Pleuronectes platessa) otoliths (i.e., hearing stones) in which concentric rings are
visible. ese rings represent daily growth layers and are used to estimate the age of the sh to reconstruct egg
and larval dri and calculate the contribution of various spawning grounds to dierent settling areas6,7. Plaice
eggs and larvae are transported from their North Sea spawning grounds towards the coast of the North Sea and
into the Wadden Sea (pelagic phase), where they settle (benthic phase). e transition of the pelagic phase to
the benthic phase is visible in the otoliths. For this application, only the post-settlement benthic phase growth
rings (visible directly aer the pelagic phase centre) are counted. e already existing image-level annotations
in this dataset are of high quality and are directly usable for DL applications. e second dataset consists of
aerial images of grey seals (Halichoerus grypus) and harbour seals (Phoca vitulina) hauled out on land, which
are collected from an aircra using a hand-held camera during annual surveys monitoring population size and
distribution8. ese images are highly variable in light conditions, distance towards the seals, focal length and
angle of view. For this second dataset, some of the existing image-level annotations were not directly usable for
DL applications (see “Methods” section). Instead of recounting the seals and correcting the annotations for all
images in this dataset, we propose a multi-step model building approach to handle scenarios where the quality
of existing image-level annotations is insucient to train a CNN. is approach can also be used to adapt the
CNN to dataset variations that appear over time or with new acquisitions conditions.
ese two real-world applications show that regression-based CNNs have the potential to greatly facilitate
counting tasks in ecology. ey allow researchers to reassign valuable resources and scale up their surveying
eort, while potentially leveraging existing image-level annotations from archived datasets directly for the auto-
mation of counting.
For the results reported in this section, we used a pre-trained ResNet-18 CNN20 and modied it for the task of
regression. Aer various experiments with other architectures and hyperparameters (Supplementary S1), we
found that this relatively lightweight (i.e., shallow) ResNet-18, trained with a Huber loss function21 and with the
largest possible batch size (limited by hardware,
images for the otolith and seal application,
respectively) gave the best performance on the validation set, for both the seal and otolith ring counting applica-
tion. Details on the CNN architecture selection and training are provided in the “Methods” section.
Otolith daily growth rings from microscopic images. For the otolith growth ring counting applica-
tion, the regression CNN was trained on 3465 microscopic images of otoliths. e results are provided inFig.1.
Here, the predicted counts on the randomly selected test set (
) are plotted against the labels (i.e., the
manual counts of the post-settlement growth rings). e CNN achieved an
of 0.92, an RMSE of 3.40 day-rings
and an MAE of 2.60 day-rings (Table1), which corresponds to an average error of
Hauled out seals from aerial images. For the seal counting application, the existing image-level annota-
tions were of insucient quality (see “Methods” section) and manual recounting was required before training
the CNN. Instead of recounting all the seals and correcting the annotations for all 11,087 aerial images in the
main dataset, we applied a multi-step model building approach. First, two smaller subsets from the main dataset
were selected, recounted and used for (i) a stratied random test set (
) and for (ii) training/validation
(named ‘seal subset 1’,
) (see “Methods” section). Unlike the stratied random test set (which reects
the full distribution of available annotations from the main dataset), the images in ‘seal subset 1’ were selected
(visually) for their high quality, which led to an under-representation of images with a high number of seals
(which were generally of poorer quality). is rst step greatly reduced the number of images that needed to be
recounted and relabelled. Figure2 (open dots, panels A and B) illustrates the predicted counts versus the real
counts of the resulting model. is Step 1 model achieved an
of 0.72, an RMSE of 23.46 seals and an MAE of
10.47 seals on the seal test set (Table1). e next step allowed us to focus on images where the CNN was most
incorrect. Here, the Step 1 model was used to predict counts on the 10,200 remaining images from the main
dataset (that still include noisy labels). To train the model further, the images from the main dataset in which
Content courtesy of Springer Nature, terms of use apply. Rights reserved
Scientic Reports | (2021) 11:23209 |
the number of seals was most overestimated (
) and most underestimated (
) with respect to the
original (noisy) labels were selected (‘seal subset 2’), manually recounted and relabelled and used to supplement
‘seal subset 1’. By further tuning the model using this extended training/validation set, the performance on the
test set improved (Fig.2, solid dots, panel A and C), with the model achieving an
of 0.77, an RMSE of 19.03
seals and an MAE of 8.14 seals (Table1). is can be attributed mostly to improved predictions for images with
a higher number of seals. Experiments with a random sampling on the whole distribution of labels (i.e., 787
images randomly selected from ‘seal subset 1’ and ‘seal subset 2’ combined, including images with a high num-
ber of seals) did not lead to better performance of the Step 1 model (see Supplementary S3). us, the two-step
strategy allowed us to signicantly improve the model performance on the seals with only 100 images to be re-
annotated, thereby reducing labelling eorts to a minimum.
In the test set, a total of 3300 seals were annotated. With our multi-step approach, the predicted total number
of seals on the test set increased from 2372 (71.9% of the total) to 2986 (90.5%) for the Step 1 and Step 2 model,
Visualising counts. Class activation maps (CAM)22,23 of images from the test sets were used to further
examine model performance. ese heatmaps represent the regions of the original image that contributed the
most to the nal prediction of the CNN. e heatmaps of the otolith images (Fig.3) were less informative than
those of the the seal images. However, they illustrate that areas with more contrasting post-settlement rings were
highlighted, while the accessory growth centre (containing pre-settlement growth rings that are not targeted by
this application) did not seem to contribute to the prediction (i.e., it remained darker). is underlines that the
model is indeed focusing on the task of counting post-settlement growth rings.
For most seal images, the heatmaps show that the regions containing seals contributed the most to the nal
prediction. Unlike the cryptic concentric otolith rings, seals are clearly picked up by the model, according to
the heatmaps (Fig.4).
Figure1. Numerical results on the otolith test set (
), where the labels (i.e., manual counts of post-
settlement growth rings) are plotted against the predicted counts. e dotted line corresponds to the optimum
Table 1. Numerical performance of the proposed method on the randomly selected test sets for both
applications. e performance of the seal counting application increased aer ne-tuning the Step 1 model
using ‘seal subset 2’ (Step 2 model).
Otolith Seals
Rings Step 1 model Step 2 model
0.92 0.72 0.77
RMSE 3.40 23.46 19.03
MAE 2.60 10.47 8.14
Content courtesy of Springer Nature, terms of use apply. Rights reserved
Scientic Reports | (2021) 11:23209 |
e regression-based CNN presented here performed well when trained on the two fundamentally dierent
datasets. is was achieved without making any modications to the architecture of the CNN between the two
cases, except for training hyperparameters like the learning rate and number of epochs (see “Methods” section).
By automating the counting tasks, the processing time of newly acquired images is dramatically reduced: pro-
cessing 100 images using our trained CNN takes less than a minute, while manual processing the same amount
of images is estimated to require at least one hour for the seals and three hours for the otoliths.
e accuracies reported here are directly usable in ecological research. For harbour seals, a correction factor
of 0.68 is routinely used to extrapolate the survey counts to a population size estimate24. e 95% condence
interval of this correction factor is [0.47,0.85]. In other words, the uncertainty in the population size estimate is
minus 21% or plus 17%, which is substantially larger than the 9.5% underestimate in the total predicted counts
of our Step 2 model. For the ring counting application, a coecient of variation between multiple human experts
was not available for daily growth rings of plaice. However, these are reported for yearly growth rings of Green-
land halibut as 12%25 and 16.3%26, which is higher than the reported 9.9% average error obtained by our deep
counting regression approach.
e two datasets feature dierent challenges regarding both the quality of the existing annotations and the task
complexity. In the case of the otoliths, the existing annotations were of good quality and could be used directly
to train the model. ese image-level annotations provide a solution to label the complex concentric growth
rings, which would be extremely dicult to annotate using other approaches, such as bounding boxes. A DL
regression-based approach was applied in previous research to count otolith growth rings18, which achieved a
higher accuracy on their test-set (MSE of 2.99). However, the tasks considered in that study were radically dier-
ent from ours: in their paper, Moen and colleagues18 considered year rings, which are less cryptic than the post
settlement day rings considered in this paper. Furthermore, our model was trained with fewer images (
instead of
), to make predictions on a wider range of counts (1 to 63 day-rings instead of 1 to 26 year-
rings). Finally, we evaluated the performance using a stratied random test set, which covers the ensemble of
Figure2. Numerical results on the seal test set (
), where the labels (i.e., the manual counts of hauled
out seals) are plotted against the predicted counts. e black dotted lines resemble
. (A) e accuracy
of the model trained on ‘seal subset 1’ (white dots) strongly improved aer ne-tuning using training subset 2
(black dots). (B,C) Zoomed in (range 0-30) on predicted counts made by the Step 1 model (B) and the Step 2
model (C).
Content courtesy of Springer Nature, terms of use apply. Rights reserved
Scientic Reports | (2021) 11:23209 |
the distribution of possible values, while Moen and colleagues used a non-stratied test set, therefore reducing
the number of occurrences of rare out-of-distribution cases in the test set.
In the case of the seals, the counting task was complex due to the high variability of the images (e.g., light-
ing conditions, distance from the seals and angle of view). Additionally, some of the existing count labels were
not directly usable for training a CNN (see “Methods” section). However, this provided an opportunity to
Figure3. Examples of images (le) and CAMs (right), with good performance from the ring test set.
Figure4. Examples of images (le) and CAMs (right), with good performance from the seal test set. Notice
that in the top example some birds are visible (yellow dotted line), which are not counted by the model, which
has specialised on seals.
Content courtesy of Springer Nature, terms of use apply. Rights reserved
Scientic Reports | (2021) 11:23209 |
demonstrate the use of an iterative approach, in which the required re-annotation eorts could be minimised
and focused on images where the model performed poorly. e CNN was rst trained using only a subset
containing recounted high quality images (‘seal subset 1’). As is common among DL applications, the resulting
Step 1 model performed relatively poorly when it needed to make predictions that fell outside the range of the
training images. is was the case for images in which a high number of seals were visible and/or when the seals
appeared smaller (i.e., were photographed from a larger distance or a smaller focal length was used). e poor
performance on these type of images could be attributed to ‘seal subset 1’ containing only images with clearly
visible seals, ranging from zero to 99 individuals (see “Methods” section). By using the Step 1 model predic-
tions to guide the selection of images that need to be reviewed, a relatively small number of images (‘seal subset
2’) was selected from the remaining images in the main database, to supplement ‘seal subset 1’. is multi-step
approach allows to focus on images with a large potential for improvement for the Step 2 model: many of the
images in ‘seal subset 2’ contained a high number of seals and/or seals that appeared smaller. is approach can
therefore also be used to cope with dataset variations that appear over time or with new acquisitions conditions.
e high variability in the seal dataset (i.e., distance towards seals, angle of view and zoom level) suggests that
a regression-based approach based on this data can also provide solutions for scenarios where the objects of
interest move through a three-dimensional space (e.g., ocks of birds, schools of sh), provided that the model
is trained with a wide variety of input data covering the expected variations.
In contrast with an object detection approach, it is not possible to evaluate the predicted location of single
objects in our regression-based approach, as the predictions are given as image-level counts. However, by using
CAMs as presented here, model decisions can be visualised and used to evaluate the model performance in more
detail. In case of the seals, these heatmaps were used to further compare the performance of the Step 1 and Step
2 model on the test set. e Step 2 model generally performed better, especially for images where seals appeared
smaller (e.g., Fig.5, case A). For some images however, the model predictions deteriorated. is was for instance
the case for an image with birds presents adjacent to the seals, which contributed to the predicted counts for
the Step 2 model (Fig.5, case B). For some images that were particularly dicult (e.g., due to blur or extremely
small seals), the Step 2 model remained unable to count seals adequately (Fig.5, case C).
For future applications, automated counts based on the regression approach presented here could poten-
tially be further improved by changing the survey design to have lower variability in the images. In the case of
the seals for instance, this could be obtained by photographing the seals from a more constant distance with a
single focal length, although in practice this might be challenging. For existing data sets, the model could also
deliberately be exposed to more appearance variability. is could for instance be done by resorting to un- or
semi-supervised domain adaptation routines27. is requires no or only a few extra annotated images but result
Figure5. Examples of CAMs for cases with unsatisfactory performance. e rst column shows the unedited
aerial images, where the red dotted line marks the area where the seals are visible. e second and third columns
show the heatmaps when predictions are made using the Step 1 and Step 2 model, respectively. For case (A)
(small seals) the performance increased, but is still unsatisfactory, as seals remain only partially detected. For
case (B) the performance decreased as birds (yellow dotted line) start to contribute to the predictions, while for
case (C) (blurry and extremely small seals) the performance was poor for both models.
Content courtesy of Springer Nature, terms of use apply. Rights reserved
Scientic Reports | (2021) 11:23209 |
in more robustness of the model to the appearance variations inherent in the data. Alternatively, in cases such as
the seals, where many images remain unused due to noisy labels, the iterative approach presented in this study
could be repeated, which is expected to further improve the models performance.
In many computer vision disciplines, regression-based CNNs similar to the one employed here are com-
monly used for counting tasks, especially when the objects of interest occur in high densities and high numbers,
such as human crowds28,29 or buildings30. ey have also been used in some ecological applications, particularly
when the objects of interest are hard to annotate using bounding boxes, for instance in the case of overlapping
plant leaves5,31,32. Wildlife counting is a domain that is typically addressed with spatially explicit object detection
approaches14,15. Few other works have addressed this task using regression-based CNNs33,34, but they either had
no explicit focus on wildlife detection34 or used it to approximate spatial locations33. Nonetheless, proceeding
with a regression approach permits to process surveys where only global counts are provided, rather than precise
annotations of individuals that would be required by object detection approaches. But even in the presence of
individual annotations, the regression approach remains competitive in terms of nal counts: when compared
to a traditional deep object detection approach (Faster R-CNN35) on a manually annotated subset of the seals
dataset, our regression approach remained more accurate (details in Supplementary S2). Furthermore, it took
approximately one hour to obtain the image-level annotations required to train the regression-based CNN,
while it took over 8 hours to create the individual bounding boxes required to train the Faster R-CNN model.
Our study illustrates how a relatively lightweight regression CNN can be used to automatically count objects
of interest from digital imagery in fundamentally dierent kinds of ecological applications. We have shown that it
is well-suited to count wildlife (especially when individuals occur in high densities) and to count cryptic objects
that are extremely dicult to annotate individually. Previous ecological studies have shown that by automating
detection tasks, time and resources can be reassigned, allowing for an increase in sampling eort14. By using
annotations at the image-level, labelling eorts and costs can be reduced. Finally, a unique advantage of using
a regression-based approach is that it has the potential to leverage already existing labels, collected without the
aim of DL applications, thereby reducing labelling eorts and costs to zero.
Datasets. In this study, datasets from two fundamentally dierent real-world ecological use cases were
employed. e objects of interest in these images were manually counted in previous studies2,8,36,37, without the
aim of DL applications.
Microscopic images of otolith rings. e rst dataset consists of 3585 microscopic images of otoliths (i.e., hear-
ing stones) of plaice (Pleuronectes platessa). Newly settled juvenile plaice of various length classes were collected
at stations along the North Sea and Wadden Sea coast during 23 sampling campaigns conducted over 6 years.
Each individual sh was measured, the sagittal otoliths were removed and microscopic images of two zoom
levels (
10 ×20
10 ×10
, depending on sh length) were made. Post-settlement daily growth rings outside
the accessory growth centre were then counted by eye6,7. In this dataset, images of otoliths with less than 16 and
more than 45 rings were scarce (Fig.6). erefore, a stratied random design was used to select 120 images to
evaluate the model performance over the full range of ring counts: all 3585 images were grouped in eight bins
according to their label (Fig.6) and from each bin 15 images were randomly selected for the test set. Out of the
remaining 3465 images, 80% of the images were randomly selected for training and 20% were used as a valida-
tion set, which is used to estimate the model performance and optimise hyperparameters during training.
Aerial images of seals. e second dataset consists of 11,087 aerial images (named ‘main dataset’ from now
onwards) of hauled out grey seals (Halichoerus grypus) and harbour seals (Phoca vitulina), collected between
2005 and 2019 in the Dutch part of the Wadden Sea2,36. Surveys for both species were performed multiple times
each year: approximately three times during pupping season and twice during the moult8. During these periods,
seals haul out on land in larger numbers. Images were taken manually through the airplane window when-
Figure6. Distribution of the labels (i.e., number of post-settlement rings) of all images in the otolith dataset
Content courtesy of Springer Nature, terms of use apply. Rights reserved
Scientic Reports | (2021) 11:23209 |
ever seals were sighted, while ying at a xed height of approximately 150m, using dierent focal lengths (80-
400mm). Due to variations in survey conditions (e.g., weather, lighting) and image composition (e.g., angle of
view, distance towards seals), this main dataset is highly variable. Noisy labels further complicated the use of this
dataset: seals present in multiple (partially) overlapping images were counted only once, and were therefore not
included in the count label of each image. Recounting the seals on all images in this dataset to deal with these
noisy labels would be a tedious task, compromising one of the main aims of this study of reducing annotation
eorts. Instead, only a selection of the main dataset was recounted and used for training and testing. First, 100
images were randomly selected (and recounted) for the test set. In the main dataset, images with a high number
of seals were scarce, while images with a low number of seals were abundant (Fig.7, panel A). erefore, as with
the otoliths, all 11,087 images were grouped into 20 bins according to their label (Fig.7, panel A), aer which
ve images were randomly selected from each bin for the test set. Second, images of sucient quality and con-
taining easily identiable were selected from the main dataset (and recounted) for training and validation, until
787 images were retained (named ‘seal subset 1’). In order to create images with zero seals (i.e., just containing
the background) and to remove seals that are only partly photographed along the image borders, some of these
images were cropped. e dimensions of those cropped images were preserved and, if required, the image-level
annotation was modied accordingly. e resulting ‘seal subset 1’ only contains images with zero to 99 seals
(Fig.7, panel B). ese 787 images were then randomly split in a training (80%) and validation set (20%). In
order to still take advantage of the remaining 10,200 images from the main dataset, a two-step label renement
was performed (see the section “Dealing with noisy labels: two-step label renement” below).
Convolutional neural networks. CNNs are a particular type of articial neural network. Similar to a
biological neural network, where many neurons are connected by synapses, these models consist of a series of
connected articial neurons (i.e., nodes), grouped into layers that are applied one by one. In a CNN, each layer
receives an input and produces an output by performing a convolution between the neurons (now organised into
a rectangular lter) and each spatial input location and its surroundings. is convolution operator computes a
dot product at each location in the input (image or previous layer’s output), encoding the correlation between
the local input values and the learnable lter weights (i.e., neurons). Aer this convolution, an activation func-
tion is applied so that the nal output of the network can represent more than just a linear combination of the
inputs. Each layer performs calculations on the inputs it receives from the previous layer, before sending it to the
next layer. Regular layers that ingest all previous outputs rather than a local neighbourhood are sometimes also
employed at the end; these are called “fully-connected” layers. e number of layers determines the depth of the
network. More layers introduce a larger number of free (learnable) parameters, as does a higher number of con-
volutional lters per layer or larger lter sizes. A nal layer usually projects the intermediate, high-dimensional
Figure7. Distribution of the labels (i.e., number of seals) in (A) the seal main dataset (
), (B) ‘seal
subset 1’ (
) and (C) ‘seal subset 2’ (
Content courtesy of Springer Nature, terms of use apply. Rights reserved
Scientic Reports | (2021) 11:23209 |
outputs into a vector of size C (the number of categories) in the case of classication, into a single number in
the case of regression (ours), or into a custom number of outputs representing arbitrarily complex parameters,
such as the class label and coordinates of a bounding box in the case of object detection. During training, the
model is fed with many labelled examples to learn the task at hand: the parameters of the neurons are updated
to minimise a loss (provided by an error function measuring the discrepancy between predictions and labels; in
our case this is the Huber loss as described below). To do so, the gradient and its derivative with respect to each
neuron in the last layer is computed; modifying neurons by following their gradients downwards allows reducing
the loss (and thereby improving model prediction) for the current image accordingly. Since the series of layers
in a CNN can be seen as a set of nested, dierentiable functions, the chain rule can be applied to also compute
gradients for the intermediate, hidden layers and modify neurons therein backwards until the rst layer. is
process is known as backpropagation38. With the recent increase of computational power and labelled dataset
sizes, these models are now of increasing complexity (i.e., they have higher numbers of learnable parameters in
the convolutional lters and layers).
CNNs come in many layer congurations, or architectures. One of the most widely used CNN architecture
is the ResNet20, which introduced the concept of residual blocks: in ResNets, the input to a residual block (i.e.,
a group of convolutional layers with nonlinear activations) is added to its output in an element-wise manner.
is allows the block to focus on learning residual patterns on top of its inputs. Also, it enables learning signals
to by-pass entire blocks, which stabilises training by avoiding the problem of vanishing gradients39. As a con-
sequence, ResNets were the rst models that could be trained even with many layers in series and provided a
signicant increase in accuracy.
Model selection and training. For the otolith dataset, we employed ResNet20 architectures of various
depths (i.e., ResNet18, ResNet34, ResNet50, ResNet101 and ResNet152, where the number corresponds to
the number of hidden layers in the model, see Supplementary S1). ese ResNet models were pretrained on
ImageNet40, which is a large benchmark dataset containing millions of natural images annotated with thousands
of categories. Pre-training on ImageNet is a commonly employed methodology to train a CNN eciently, as it
will already have learned how to recognise common recurring features, such as edges and basic geometrical pat-
terns, which would have to be learned from zero otherwise. erefore, pre-training reduces the required amount
of training data signicantly.
We modied the ResNet architecture to perform a regression task. To do so, we replaced the classication
output layer with two fully-connected layers that map to 512 neurons aer the rst layer and to a single continu-
ous variable aer the second layer23 (Fig.8). Since the nal task to be performed is regression, the loss function
is a loss function that is tailored for regression. In our experiments we tested both a Mean Squared Error and
a Smooth L1 (i.e., Huber) loss21 (see Supplementary S1). e Huber loss is more robust against outliers and is
dened as follows:
is given by
is the value predicted by the model, y is the true (ground truth) value (i.e., the label) and n is the batch
size. Intuitively, the Huber loss assigns a strong (squared) penalty for predictions that are close to the target
0.5 ×(yi−ˆyi)
, if |yi−ˆyi|<
|yi−ˆyi|−0.5, otherwise
Figure8. Schematic representation of the CNN used in this study. e classication output layer of the
pretrained ResNet18 is replaced by two fully-connected layers. e model is trained with a Huber loss.
Content courtesy of Springer Nature, terms of use apply. Rights reserved
Scientic Reports | (2021) 11:23209 |
value, but not perfect (i.e., loss value
) and a smaller (linear) penalty for predictions far o, which increases
tolerance towards potential outliers both in prediction and target.
Computations were performed on a Linux server with four Nvidia GeForce GTX 1080 Ti graphics cards.
e CNNs were trained using the FastAI library23 (version 2.0.13) in PyTorch41 (version 1.6.0). FastAI’s default
settings were used for image normalisation, dropout42, weight decay and momentum23, and a batch size of 84
images was used for the otolith dataset. Whenever an image was used in a model iteration during training, a series
of transformations was applied randomly to it for data augmentation (including resizing to
1040 ×770
random horizontal ips, lighting, warping, zooming and zero-padding). When using image-level annotations,
only limited degrees of zooming can be used, otherwise objects of interest might be cut out of the image, making
the image-level annotations incorrect. For the same reason, images were squeezed instead of cropped whenever
necessary to account for dierent image dimensions. Various Learning Rates (LR) and Batch Sizes (BS) were
evaluated (see Supplementary S1). A LR nder43 was used to determine the initial LR values, and FastAIs default
settings for discriminative LR were applied23. In discriminative LR, a lower LR is used to train the early layers of
the model, while the later layers are trained using a higher LR. For this purpose, our model was divided into three
sections (the pretrained part of the network is split into two sections, while the third section comprised the added
fully-connected layers), that each had a dierent LR (specied below) during training. Additionally, we applied
‘1cycle training’23,44. Here, training is divided into two phases, one where the LR grows towards a maximum, fol-
lowed by a phase where the LR is reduced to the original value again. Firstly, only the two fully-connected layers
added for regression (i.e., the third section) were trained for 25 epochs (of which the best performing 24th epoch
was saved) with an LR of
, while the rest of the network remained frozen. Aer this, the entire network was
unfrozen and all layers were further tuned using a discriminative LR ranging from
, for another
50 epochs, of which the best performing epoch was saved (50th epoch). e same model architecture, training
approach and hyperparameters were used for the seal images, with the following exceptions. e batch size was
100 and images were resized to to
1064 ×708
pixels. First, only the added layers were trained (analogue to the
rings), with an LR of
, for 50 epochs (of which the best performing 45th epoch was saved). Aer this, the
entire network was unfrozen and further tuned for 50 epochs (of which the best performing epoch, the 49th,
was saved), using a discriminative LR ranging from
For both the otolith and seal cases, the trained models were evaluated on their respective test sets (described
above). ese test sets represent unseen data that is not used during the training and validation of the model.
, RMSE and MAE were used as performance metrics, and predicted counts were plotted against the labels.
Additionally, Class Activation Maps (CAM) were made to aid with interpreting the models predictions22,23.
Dealing with noisy labels: two-step label renement. In order to take advantage of the additionally
available noisy data during training, a two-step approach was employed that avoids the need to recount tens of
thousands of seals. By using the Step 1 model (trained using ‘seal subset 1’) predictions, an additional 100 images
were selected (and recounted) from the remaining main dataset (see “Results” section). For 35 images, the seals
were not clearly identiable by eye (i.e., they appeared too small) and the image was discarded and replaced by
the next most poorly predicted image. ese resulting 100 images (named ‘seal subset 2’, Fig.7, panel C) were
expected to include cases with noisy labels, but also cases that were challenging for the model to predict (e.g.,
images with a high number of seals). Aer this, the entire model (i.e., all layers) was retrained using ‘seal subset
1’ supplemented with ‘seal subset 2’, randomly split in a training (80%) and validation set (20%), for an additional
50 epochs using the same hyperparameters as before, except for the LR. Various LR were evaluated and a dis-
criminative LR ranging from
gave the best performance on the validation set, in the 48th epoch.
Data availability
e data used in this study are open-source and publicly available. Code and data associated with this study can
be obtained at https:// doi. org/ 10. 25850/ nioz/ 7b. b0c.
Received: 18 August 2021; Accepted: 10 November 2021
1. Buckland, S. T. et al. Introduction to Distance Sampling: Estimating Abundance of Biological Populations (Oxford University Press,
2. Brasseur, S. M. J. M. et al. Echoes from the past: Regional variations in recovery within a harbour seal population. PLoS One 13,
e0189674 (2018).
3. Matthiopoulos, J., Fieberg, J. & Aarts, G. Species-Habitat Associations: Spatial Data, Predictive Models, and Ecological Insights
(University of Minnesota Libraries Publishing, 2020).
4. Walter, A. & Schurr, U. e modular character of growth in Nicotiana tabacum plants under steady-state nutrition. J. Exp. Bot. 50,
1169–1177 (1999).
5. Dobrescu, A., ValerioGiurida, M. & Tsaaris, S.A. Leveraging multiple datasets for deep leaf counting. In Proceedings of the
IEEE International Conference on Computer Vision Workshops, 2072–2079 (2017).
6. Poiesz, S. S. H. et al. A comparison of growth in two juvenile atsh species in the Dutch Wadden Sea: Searching for a mechanism
for summer growth reduction in atsh nurseries. J. Sea Res. 144, 39–48 (2019).
7. Poiesz, S. S. H. et al. Is summer growth reduction related to feeding guild? A test for a benthic juvenile atsh sole (Solea solea) in
a temperate coastal area, the western wadden sea. Estuar. Coast. Shelf Sci. 235, 106570 (2020).
8. Cremer, J. S.M., Brasseur, S. M. J.M., Meijboom, A., Schop, J. & Verdaat, J.P. Monitoring van gewone en grijze zeehonden in de
nederlandse waddenzee, 2002-2017. Tech. Rep., Wettelijke Onderzoekstaken Natuur & Milieu (2017).
9. Weinstein, B. G. A computer vision for animal ecology. J. Anim. Ecol. 87, 533–545 (2018).
10. Christin, S., Hervet, É. & Lecomte, N. Applications for deep learning in ecology. Methods Ecol. Evol. 10, 1632–1644 (2019).
11. essen, A.E. Adoption of machine learning techniques in ecology and earth science. One Ecosystem 1 (2016).
Content courtesy of Springer Nature, terms of use apply. Rights reserved
Scientic Reports | (2021) 11:23209 |
12. Schmidhuber, J. Deep learning in neural networks: An overview. Neural Netw. 61, 85–117 (2015).
13. LeCun, Y., Bengio, Y. & Hinton, G. Deep learning. Nature 521, 436–444 (2015).
14. Eikelboom, J. A. J. et al. Improving the precision and accuracy of animal population estimates with aerial image object detection.
Methods Ecol. Evol. 10, 1875–1887 (2019).
15. Kellenberger, B., Marcos, D. & Tuia, D. Detecting mammals in UAV images: Best practices to address a substantially imbalanced
dataset with deep learning. Remote Sens. Environ. 216, 139–153 (2018).
16. Corcoran, E., Denman, S., Hanger, J., Wilson, B. & Hamilton, G. Automated detection of koalas using low-level aerial surveillance
and machine learning. Sci. Rep. 9, 1–9 (2019).
17. Zabawa, L. et al. Counting of grapevine berries in images via semantic segmentation using convolutional neural networks. ISPRS
J. Photogram. Remote Sens. 164, 73–83 (2020).
18. Moen, E. et al. Automatic interpretation of otoliths using deep learning. PLoS One 13, e0204713 (2018).
19. Vabø, R. et al. Automatic interpretation of salmon scales using deep learning. Ecol. Inform. 63, 101322 (2021).
20. He, K., Zhang, X., Ren, S. & Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer
Vision and Pattern Recognition, 770–778 (2016).
21. Girshick, R. Fast R-CNN. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 1440–1448 (2015).
22. Zhou, B., Khosla, A., Lapedriza, A., Oliva, A. & Torralba, A. Learning deep features for discriminative localization. In Proceedings
of the IEEE Conference on Computer Vision and Pattern Recognition, 2921–2929 (2016).
23. Howard, J. & Gugger, S. Deep Learning for Coders with Fastai and Pytorch: AI Applications Without a PhD (O’Reilly Media Incor-
porated, 2020).
24. Ries, E. H., Hiby, L. R. & Reijnders, P. J. H. Maximum likelihood population size estimation of harbour seals in the Dutch Wadden
Sea based on a mark-recapture experiment. J. Appl. Ecol. 35, 332–339 (1998).
25. Albert, O. T., Kvalsund, M., Vollen, T. & Salberg, A.-B. Towards accurate age determination of Greenland halibut. J. Northwest Atl.
Fish. Sci. 40, 81–95 (2008).
26. Albert, O. T. Growth and formation of annual zones in whole otoliths of Greenland halibut, a slow-growing deep-water sh. Ma r.
Freshw. Res. 67, 937–942 (2016).
27. Wang, M. & Deng, W. Deep visual domain adaptation: A survey. Neurocomputing 312, 135–153 (2018).
28. Ryan, D., Denman, S., Sridharan, S. & Fookes, C. An evaluation of crowd counting methods, features and regression models.
Comput. Vis. Image Underst. 130, 1–17 (2015).
29. Sindagi, V. A. & Patel, V. M. A survey of recent advances in CNN-based single image crowd counting and density estimation.
Pattern Recognit. Lett. 107, 3–16 (2018).
30. Lobry, S. & Tuia, D. Deep learning models to count buildings in high-resolution overhead images. In 2019 Joint Urban Remote
Sensing Event (JURSE), 1–4 (IEEE, 2019).
31. Jiang, Y. & Li, C. Convolutional neural networks for image-based high-throughput plant phenotyping: A review. Plant Phenomics
2020 (2020).
32. Li, Z., Guo, R., Li, M., Chen, Y. & Li, G. A review of computer vision technologies for plant phenotyping. Comput. Electron. Agric.
176 (2020).
33. Kellenberger, B., Marcos, D. & Tuia, D. When a few clicks make all the dierence: Improving weakly-supervised wildlife detection
in UAV images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (2019).
34. Marsden, M., McGuinness, K., Little, S., Keogh, C.E. & O’Connor, N.E. People, penguins and petri dishes: Adapting object count-
ing models to new visual domains and object types without forgetting. In Proceedings of the IEEE Conference on Computer Vision
and Pattern Recognition, 8070–8079 (2018).
35. Ren, S., He, K., Girshick, R. & Sun, J. Faster R-CNN: Towards real-time object detection with region proposal networks. Adv. Neural
Inf. Process. Syst. 28, 91–99 (2015).
36. Brasseur, S. M. J. M. et al. Rapid recovery of Dutch gray seal colonies fueled by immigration. Mar. Mamm. Sci. 31, 405–426 (2015).
37. Vander Veer, H.W., Witte, J.I., Flege, P., vander Molen, J. & Poiesz, S. S.H. Synchrony in plaice larval supply to European coastal
nurseries by dierent North Sea spawning grounds (in prep.)
38. LeCun, Y. et al. Backpropagation applied to handwritten zip code recognition. Neural Comput. 1, 541–551 (1989).
39. Hochreiter, S., Bengio, Y., Frasconi, P. & Schmidhuber, J. Gradient ow in recurrent nets: e diculty of learning long-term
dependencies. In A Field Guide to Dynamical Recurrent Neural Networks, chap. 14 (eds Kremer, S. C. & Kolen, J. F.) 237–244 (IEEE
Press, 2001).
40. Deng, J. etal. Imagenet: A large-scale hierarchical image database. In Proceedings of the IEEE Conference on Computer Vision and
Pattern Recognition, 248–255 (IEEE, 2009).
41. Paszke, A. etal. Automatic dierentiation in pytorch. In NIPS Workshop (2017).
42. Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I. & Salakhutdinov, R. Dropout: A simple way to prevent neural networks
from overtting. J. Mach. Learn. Res. 15, 1929–1958 (2014).
43. Smith, L.N. A disciplined approach to neural network hyper-parameters: Part 1—Learning rate, batch size, momentum, and
weight decay. arXiv: 1803. 09820 (2018).
44. Smith, L.N. & Topin, N. Super-convergence: Very fast training of neural networks using large learning rates. In Articial Intelligence
and Machine Learning for Multi-Domain Operations Applications, vol. 11006 (International Society for Optics and Photonics, 2019).
e authors would like to thank Peter Reijnders, André Meijboom, Hans Verdaat and Jessica Schop for their
help collecting the aerial seal images, Hans Witte and Henk van der Veer for their work on the otoliths and to
Dainius Masiliunas for his assistance with the server.
Author contributions
S.B. and S.P. collected the data, J.H., G.A., S.B., B.K. and D.T. conceived the experiments, J.H. and B.K. conducted
the experiments and analysed the results. All authors wrote and reviewed the manuscript.
is study was funded by Nederlandse Organisatie voor Wetenschappelijk Onderzoek (project ALWPP.2017.003).
Competing interests
e authors declare no competing interests.
Content courtesy of Springer Nature, terms of use apply. Rights reserved
Scientic Reports | (2021) 11:23209 |
Additional information
Supplementary Information e online version contains supplementary material available at https:// doi. org/
10. 1038/ s41598- 021- 02387-9.
Correspondence and requests for materials should be addressed to J.P.A.H.
Reprints and permissions information is available at
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and
institutional aliations.
Open Access is article is licensed under a Creative Commons Attribution 4.0 International
License, which permits use, sharing, adaptation, distribution and reproduction in any medium or
format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the
Creative Commons licence, and indicate if changes were made. e images or other third party material in this
article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the
material. If material is not included in the article’s Creative Commons licence and your intended use is not
permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from
the copyright holder. To view a copy of this licence, visit http:// creat iveco mmons. org/ licen ses/ by/4. 0/.
© e Author(s) 2021
Content courtesy of Springer Nature, terms of use apply. Rights reserved
Terms and Conditions
Springer Nature journal content, brought to you courtesy of Springer Nature Customer Service Center GmbH (“Springer Nature”).
Springer Nature supports a reasonable amount of sharing of research papers by authors, subscribers and authorised users (“Users”), for small-
scale personal, non-commercial use provided that all copyright, trade and service marks and other proprietary notices are maintained. By
accessing, sharing, receiving or otherwise using the Springer Nature journal content you agree to these terms of use (“Terms”). For these
purposes, Springer Nature considers academic use (by researchers and students) to be non-commercial.
These Terms are supplementary and will apply in addition to any applicable website terms and conditions, a relevant site licence or a personal
subscription. These Terms will prevail over any conflict or ambiguity with regards to the relevant terms, a site licence or a personal subscription
(to the extent of the conflict or ambiguity only). For Creative Commons-licensed articles, the terms of the Creative Commons license used will
We collect and use personal data to provide access to the Springer Nature journal content. We may also use these personal data internally within
ResearchGate and Springer Nature and as agreed share it, in an anonymised way, for purposes of tracking, analysis and reporting. We will not
otherwise disclose your personal data outside the ResearchGate or the Springer Nature group of companies unless we have your permission as
detailed in the Privacy Policy.
While Users may use the Springer Nature journal content for small scale, personal non-commercial use, it is important to note that Users may
use such content for the purpose of providing other users with access on a regular or large scale basis or as a means to circumvent access
use such content where to do so would be considered a criminal or statutory offence in any jurisdiction, or gives rise to civil liability, or is
otherwise unlawful;
falsely or misleadingly imply or suggest endorsement, approval , sponsorship, or association unless explicitly agreed to by Springer Nature in
use bots or other automated methods to access the content or redirect messages
override any security feature or exclusionary protocol; or
share the content in order to create substitute for Springer Nature products or services or a systematic database of Springer Nature journal
In line with the restriction against commercial use, Springer Nature does not permit the creation of a product or service that creates revenue,
royalties, rent or income from our content or its inclusion as part of a paid for service or for other commercial gain. Springer Nature journal
content cannot be used for inter-library loans and librarians may not upload Springer Nature journal content on a large scale into their, or any
other, institutional repository.
These terms of use are reviewed regularly and may be amended at any time. Springer Nature is not obligated to publish any information or
content on this website and may remove it or features or functionality at our sole discretion, at any time with or without notice. Springer Nature
may revoke this licence to you at any time and remove access to any copies of the Springer Nature journal content which have been saved.
To the fullest extent permitted by law, Springer Nature makes no warranties, representations or guarantees to Users, either express or implied
with respect to the Springer nature journal content and all parties disclaim and waive any implied warranties or warranties imposed by law,
including merchantability or fitness for any particular purpose.
Please note that these rights do not automatically extend to content, data or other material published by Springer Nature that may be licensed
from third parties.
If you would like to use or distribute our Springer Nature journal content to a wider audience or on a regular basis or in any other manner not
expressly permitted by these Terms, please contact Springer Nature at
... Likewise, vehicle counting [4,15,16] provides civil engineers and city planners with tools for analyzing traffic patterns and street parking patterns. Object counting has also been applied to ecological surveying [17], where the population levels of animals like Penguins [2] and Steller Sea Lions [18] have been measured from images. Other applications include yield assessment and growth forecasting in agriculture via counting plant organs (such as fruits and leaves) [3,[19][20][21], and biomedical applications such as cell counting [22][23][24][25]. ...
... Due to arctic locations being challenging to access, methods that allow for remote penguin counting are highly valued. Similarly, object counting has been explored for Seal population analysis [17,18]. While the number of ecological surveying benchmark datasets is small, the Penguins dataset [2] is the largest commonly used object counting dataset available. ...
... Synthetic data from GCC dataset [12] (b) Real data from ShanghaiTech B dataset [1] Fig. 17: Zoomed in crops of crowd counting images from a real and a synthetic dataset. The underlying domain gap makes task transfer from synthetic to real domains a challenging problem. ...
Full-text available
Object counting is an important computer vision application and research topic, which typically involves enumerating the number of objects in an image. Methodologies spanning a broad set of strategies have been proposed for solving object counting problems. These methods have seen an increase in relevance with the recent emergence of several highly successful deep learning techniques, which have led to significant performance improvements on a growing number of annotated counting benchmark datasets. However, despite the recent advancements in deep learning and computer vision, object counting remains a challenging problem with several open research directions. Datasets often contain objects that are highly occluded and which occur across a range of scales and perspectives. Further, popular annotation strategies, like density map annotations, suffer from annotator noise and inconsistency, which creates a performance bottleneck. These annotation strategies also have a high annotation burden, which leads to datasets that are very small when compared to common benchmark datasets in domains like image classification. Given both the significant progress and continued challenges of object counting, this task continues to be an interesting and ongoing research problem. This overview explores the historical context of object counting methods, the fundamental methodologies driving progress, the state of the art methods, and the significant open problems. In particular, we focus on recent trends that attempt to alleviate the problem of the annotation burden for object counting problems.
... Some current drone applications with pinnipeds leverage thermal or multispectral imagery to facilitate detection by high contrast in drone imagery (Seymour et al. 2017, Sweeney et al. 2019, Larsen et al. 2022b), but many more studies rely exclusively on visible-light photography to detect pinnipeds. With visible-light aerial imagery, deep learning techniques have already been applied to estimate aggregate pinniped counts (Hoekendijk et al. 2021), detect individual pinnipeds (Dujon et al. 2021), and classify pinnipeds by age class (Salberg 2015, Infantes et al. 2022, though success and generalisability vary widely between examples. Upcoming applications also include deep learning for photogrammetry, as has already been demonstrated with dronebased photography of cetaceans (Gray et al. 2019) and more recently with harbour seals (Infantes et al. 2022), and deep learning for individual identification, as has been demonstrated with ground-based photography of harbour seals (Nepovinnykh et al. 2018, Birenbaum et al. 2022. ...
Full-text available
Pinniped species undergo uniquely amphibious life histories that make them valuable subjects for many domains of research. Pinniped research has often progressed hand‐in‐hand with technological frontiers of wildlife biology, and drones represent a leap forward for methods of aerial remote sensing, enabling data collection, and integration at new scales of biological importance. Drone methods and data types provide four key opportunities for wildlife surveillance that are already advancing pinniped research and management: 1) repeat and on‐demand surveillance, 2) high‐resolution coverage at large extents, 3) morphometric photogrammetry, and 4) computer vision and deep learning applications. Drone methods for pinniped research represent early stages of technological adoption and can reshape the field as they scale towards the full potential of their techniques.
... Class-specific counting methods aim to enumerate the instances of a single or small set of known classes [1,5,8,23]. These methods struggle to adapt to novel classes and need new data and retraining for each type of object. ...
Full-text available
Class-agnostic counting methods enumerate objects of an arbitrary class, providing tremendous utility in many fields. Prior works have limited usefulness as they require either a set of examples of the type to be counted or that the image contains only a single type of object. A significant factor in these shortcomings is the lack of a dataset to properly address counting in settings with more than one kind of object present. To address these issues, we propose the first Multi-class, Class-Agnostic Counting dataset (MCAC) and A Blind Counter (ABC123), a method that can count multiple types of objects simultaneously without using examples of type during training or inference. ABC123 introduces a new paradigm where instead of requiring exemplars to guide the enumeration, examples are found after the counting stage to help a user understand the generated outputs. We show that ABC123 outperforms contemporary methods on MCAC without the requirement of human in-the-loop annotations. We also show that this performance transfers to FSC-147, the standard class-agnostic counting dataset.
... Deep neural networks for regression (DNNR) have been broadly employed in various fields, such as climate [3], agricultural planting [4], and renewable energy forecasting [5]. Several studies have revealed that adjusting the models' hyperparameters through algorithms can achieve better performance. ...
Full-text available
Due to rapid development in information technology in both hardware and software, deep neural networks for regression have become widely used in many fields. The optimization of deep neural networks for regression (DNNR), including selections of data preprocessing, network architectures, optimizers, and hyperparameters, greatly influence the performance of regression tasks. Thus, this study aimed to collect and analyze the recent literature surrounding DNNR from the aspect of optimization. In addition, various platforms used for conducting DNNR models were investigated. This study has a number of contributions. First, it provides sections for the optimization of DNNR models. Then, elements of the optimization of each section are listed and analyzed. Furthermore, this study delivers insights and critical issues related to DNNR optimization. Optimizing elements of sections simultaneously instead of individually or sequentially could improve the performance of DNNR models. Finally, possible and potential directions for future study are provided.
... For both datasets, AED [6] and SD [7], the ground-truth density maps are generated using dot-annotated images (which symbolize the centre of each object) convolved with a Gaussian filter with a fixed variance (σ 2 ). The variance of the Gaussian filter is commonly based either on the size of the object of interest (for low density image) [5,11] or the distance between the objects of interest (for high density images) [12]. We now describe these two techniques. ...
Anthropogenic activities pose threats to wildlife and marine fauna, prompting the need for efficient animal counting methods. This research study utilizes deep learning techniques to automate counting tasks. Inspired by previous studies on crowd and animal counting, a UNet model with various backbones is implemented, which uses Gaussian density maps for training, bypassing the need of training a detector. The new model is applied to the task of counting dolphins and elephants in aerial images. Quantitative evaluation shows promising results, with the EfficientNet-B5 backbone achieving the best performance for African elephants and the ResNet18 backbone for dolphins. The model accurately locates animals despite complex image background conditions. By leveraging artificial intelligence, this research contributes to wildlife conservation efforts and enhances coexistence between humans and wildlife through efficient object counting without detection from aerial remote sensing.
... Recently, Hoekendijk et al. (2021) proposed a deep CNN model to regress the count objects of interest in the image. Their model is composed of a ResNet and two fully connected layers. ...
Full-text available
Animal abundance estimation is increasingly based on drone or aerial survey photography. Manual postprocessing has been used extensively; however, volumes of such data are increasing, necessitating some level of automation, either for complete counting, or as a labour-saving tool. Any automated processing can be challenging when using such tools on species that nest in close formation such as Pygoscelis penguins. We present here a customized CNN-based density map estimation method for counting of penguins from low-resolution aerial photography. Our model, an indirect regression algorithm, performed significantly better in terms of counting accuracy than standard detection algorithm (Faster-RCNN) when counting small objects from low-resolution images and gave an error rate of only 0.8 percent. Density map estimation methods as demonstrated here can vastly improve our ability to count animals in tight aggregations and demonstrably improve monitoring efforts from aerial imagery.
... Machine learning models such as convolutional neural networks (CNNs) have been effective at image processing [42][43][44][45][46][47][48][49]. Several early attempts at using CNNs to detect whales [25,27] and seals [50] have been made, although these studies all used down-sampled three-band RGB imagery rather than the full range of spectral bands collected by WorldView-2 and WorldView-3 sensors. In addition to advances in detection algorithms-methodology for estimating whale density is needed to derive abundance estimates. ...
Full-text available
Monitoring marine mammals is of broad interest to governments and individuals around the globe. Very high-resolution (VHR) satellites hold the promise of reaching remote and challenging locations to fill gaps in our knowledge of marine mammal distribution. The time has come to create an operational platform that leverages the increased resolution of satellite imagery, proof-of-concept research, advances in cloud computing, and machine learning to monitor the world’s oceans. The Geospatial Artificial Intelligence for Animals (GAIA) initiative was formed to address this challenge with collaborative innovation from government agencies, academia, and the private sector. In this paper, we share lessons learned, challenges faced, and our vision for how VHR satellite imagery can enhance our understanding of cetacean distribution in the future.
... Teknik seperti scale invariant feature transform (SIFT) [1] dan speeded up robust features (SURF) [2] adalah contoh mengidentifikasi objek secara pemrosesan citra dengan mencari fiturfitur yang dominan. Teknik yang lebih modern adalah pendeteksi citra adalah dengan menggunakan machine learning (ML) dan deep neural networks (DNN) [3][4]. Teknik dengan ML dan DNN sangat tergantung dari koleksi data dan proses pelatihannya. ...
Full-text available
Alat ukur yang dapat menghitung jumlah tumpukan barang secara otomatis dan cepat sangat dibutuhkan industri dan bisnis. Teknik menghitung jumlah koin pada umumnya menggunakan mesin yang secara fisik dapat merusak, dan menghasilkan polusi, dan jika mengandung virus dan bakteri dapat menyebarkan penyakit. Untuk itu pada makalah ini diusulkan sebuah penghitung jumlah dan tipe tumpukan koin dengan menggunakan intensitas cahaya baris sebagai pendeteksi objek. Tipe objek ditentukan dari lebar piksel dari iluminasi koin. Jumlah koin dapat dihitung berdasarkan objek yang telah ditentukan. Pada 70 sampel citra uang koin 1, 5, 10, dan 50 NTD, diperoleh akurasi mencapai 98,98%, dan berhasil menentukan koin dengan benar sebesar 88,6%.
... The eye can identify a cementum annulus, ignore artifacts, search for discontinuous cementum lines and complete discontinuities that the software is unable to interpret, and adapt. Advances in analytical tools such as machine learning already applied in sclerochronology (Hoekendijk et al., 2021;Moen et al., 2018) and improvement of open-source plugins for analyzing micrographs may represent a precious perspective for future applications. ...
Full-text available
It has been repeatedly acknowledged that age‐at‐death estimation based on dental cementum represents a partial and time‐consuming method that hinders adoption of this histological approach. User‐friendly micrograph analysis represents a growing request of cementochronology. This article evaluates the feasibility of using a module to accurately quantify cementum deposits and compares the module's performance to that of a human expert. On a dental collection (n = 200) of known‐age individuals, precision and accuracy of estimates performed by a developed program (101 count/tooth; n = 20,200 counts) were compared to counts performed manually (5 counts/tooth; n = 975 counts). Reliability of the software and agreement between the two approaches were assessed by intraclass correlation coefficient and Bland Altman analysis. The automated module produced reliable and reproducible counts with a higher global precision than the human expert. Although the software is slightly more precise, it shows higher sensitivity to taphonomic damages and does not avoid the trajectory effect described for age‐at‐death estimation in adults. Likewise, for human counts, global accuracy is acceptable, but underestimations increase with age. The quantification of the agreement between the two approaches shows a minor bias, and 94% of individuals fall within the intervals of agreement. Automation gives an impression of objectivity even though the region of interest, profile position and parameters are defined manually. The automated system may represent a time‐saving module that can allow an increase in sample size, which is particularly stimulating for population‐based studies.
Full-text available
For several fish species, age and other important biological information is manually inferred from visual scrutinization of scales, and reliable automatic methods are not widely available. Here, we apply Convolutional Neural Networks (CNN) with transfer learning on a novel dataset of 9056 images of Atlantic salmon scales for four different prediction tasks. We predicted fish origin (wild/farmed), spawning history (previous spawner/non-spawner), river age, and sea age. We obtained high prediction accuracy for fish origin (96.70%), spawning history (96.40%), and sea age (86.99%), but lower accuracy for river age (63.20%). Against six human expert readers with an additional dataset of 150 scales, the CNN showed the second-highest percentage agreement for sea age (94.00%, range 87.25–97.30%), but the lowest agreement for river age (66.00%, range 66.00–84.68%). Estimates of river age by expert readers exhibited higher variance and lower levels of agreement compared to sea age and may indicate why this task is also more difficult for the CNN. Automatic interpretation of scales may provide a cost- and time-efficient method of predicting fish age and life-history traits.
Full-text available
Ecologists develop species-habitat association (SHA) models to understand where species occur, why they are there and where else they might be. This knowledge can be used to designate protected areas, estimate anthropogenic impacts on living organisms and assess risks from invasive species or disease spill-over from wildlife to humans. Here, we describe the state of the art in SHA models, looking beyond the apparent correlations between the positions of organisms and their local environment. We highlight the importance of ecological mechanisms, synthesize diverse modelling frameworks and motivate the development of new analytical methods. Above all, we aim to be synthetic, bringing together several of the apparently disconnected pieces of ecological theory, taxonomy, spatiotemporal scales, and mathematical and statistical technique in our field. The first edition of this ebook reviews the ecology of species-habitat associations, the mechanistic interpretation of existing empirical models and their shared statistical foundations that can help us draw scientific insights from field data. It will be of interest to graduate students and professionals looking for an introduction to the ecological and statistical literature of SHAs, practitioners seeking to analyse their data on animal movements or species distributions and quantitative ecologists looking to contribute new methods addressing the limitations of the current incarnations of SHA models.
Full-text available
The number of leaves a plant has is one of the key traits (phenotypes) describing its development and growth. Here, we propose an automated, deep learning based approach for counting leaves in model rosette plants. While state-of-the-art results on leaf counting with deep learning methods have recently been reported, they obtain the count as a result of leaf segmentation and thus require per-leaf (instance) segmentation to train the models (a rather strong annotation). Instead, our method treats leaf counting as a direct regression problem and thus only requires as annotation the total leaf count per plant. We argue that combining different datasets when training a deep neural network is beneficial and improves the results of the proposed approach. We evaluate our method on the CVPPP 2017 Leaf Counting Challenge dataset, which contains images of Arabidopsis and tobacco plants. Experimental results show that the proposed method significantly outperforms the winner of the previous CVPPP challenge, improving the results by a minimum of 50% on each of the test datasets, and can achieve this performance without knowing the experimental origin of the data (i.e. “in the wild” setting of the challenge). We also compare the counting accuracy of our model with that of per leaf segmentation algorithms, achieving a 20% decrease in mean absolute difference in count (|DiC|).
Full-text available
Plant phenotyping has been recognized as a bottleneck for improving the efficiency of breeding programs, understanding plant-environment interactions, and managing agricultural systems. In the past five years, imaging approaches have shown great potential for high-throughput plant phenotyping, resulting in more attention paid to imaging-based plant phenotyping. With this increased amount of image data, it has become urgent to develop robust analytical tools that can extract phenotypic traits accurately and rapidly. The goal of this review is to provide a comprehensive overview of the latest studies using deep convolutional neural networks (CNNs) in plant phenotyping applications. We specifically review the use of various CNN architecture for plant stress evaluation, plant development, and postharvest quality assessment. We systematically organize the studies based on technical developments resulting from imaging classification, object detection, and image segmentation, thereby identifying state-of-the-art solutions for certain phenotyping applications. Finally, we provide several directions for future research in the use of CNN architecture for plant phenotyping purposes.
Full-text available
Animal population sizes are often estimated using aerial sample counts by human observers, both for wildlife and livestock. The associated methods of counting remained more or less the same since the 1970s, but suffer from low precision and low accuracy of population estimates. Aerial counts using cost‐efficient Unmanned Aerial Vehicles or microlight aircrafts with cameras and an automated animal detection algorithm can potentially improve this precision and accuracy. Therefore, we evaluated the performance of the multi‐class convolutional neural network RetinaNet in detecting elephants, giraffes and zebras in aerial images from two Kenyan animal counts. The algorithm detected 95% of the number of elephants, 91% of giraffes and 90% of zebras that were found by four layers of human annotation, of which it correctly detected an extra 2.8% of elephants, 3.8% giraffes and 4.0% zebras that were missed by all humans, while detecting only 1.6 to 5.0 false positives per true positive. Furthermore, the animal detections by the algorithm were less sensitive to the sighting distance than humans were. With such a high recall and precision, we posit it is feasible to replace manual aerial animal count methods (from images and/or directly) by only the manual identification of image bounding boxes selected by the algorithm and then use a correction factor equal to the inverse of the undercounting bias in the calculation of the population estimates. This correction factor causes the standard error of the population estimate to increase slightly compared to a manual method, but this increase can be compensated for when the sampling effort would increase by 23%. However, an increase in sampling effort of 160% to 1,050% can be attained with the same expenses for equipment and personnel using our proposed semi‐automatic method compared to a manual method. Therefore, we conclude that our proposed aerial count method will improve the accuracy of population estimates and will decrease the standard error of population estimates by 31% to 67%. Most importantly, this animal detection algorithm has the potential to outperform humans in detecting animals from the air when supplied with images taken at a fixed rate.
Plant phenotype plays an important role in genetics, botany, and agronomy, while the currently popular methods for phenotypic trait measurement have some limitations in aspects of cost, performance, and space-time coverage. With the rapid development of imaging technology, computing power, and algorithms, computer vision has thoroughly revolutionized the plant phenotyping and is now a major tool for phenotypic analysis. Based on the above reasons, researchers are devoted to developing image-based plant phenotyping methods as a complementary or even alternative to the manual measurement. However, the use of computer vision technology to analyze plant phenotypic traits can be affected by many factors such as research environment, imaging system, research object, feature extraction, model selection, and so on. Currently, there is no review paper to compare and analyze these methods thoroughly. Therefore, this review introduces the typical plant phenotyping methods based on computer vision in detail, with their principle, applicable range, results, and comparison. This paper extensively reviews 200+ papers of plant phenotyping in the light of its technical evolution, spanning over twenty years (from 2000 to 2020). A number of topics have been covered in this paper, including imaging technologies, plant datasets, and state-of-the-art phenotyping methods. In this review, we categorize the plant phenotyping into two main groups: plant organ phenotyping and whole-plant phenotyping. Furthermore, for each group, we analyze each research of these groups and discuss the limitations of the current approaches and future research directions.
The extraction of phenotypic traits is often very time and labour intensive. Especially the investigation in viticulture is restricted to an on-site analysis due to the perennial nature of grapevine. Traditionally skilled experts examine small samples and extrapolate the results to a whole plot. Thereby different grapevine varieties and training systems, e.g. vertical shoot positioning (VSP) and semi minimal pruned hedges (SMPH) pose different challenges. In this paper we present an objective framework based on automatic image analysis which works on two different training systems. The images are collected semi automatic by a camera system which is installed in a modified grape harvester. The system produces overlapping images from the sides of the plants. Our framework uses a convolutional neural network to detect single berries in images by performing a semantic segmentation. Each berry is then counted with a connected component algorithm. We compare our results with the Mask-RCNN, a state-of-the-art network for instance segmentation and with a regression approach for counting. The experiments presented in this paper show that we are able to detect green berries in images despite of different training systems. We achieve an accuracy for the berry detection of 94.0% in the VSP and 85.6% in the SMPH.