Content uploaded by Edgar Santos-Fernández
Author content
All content in this area was uploaded by Edgar Santos-Fernández on Apr 20, 2021
Content may be subject to copyright.
Complex image classification via crowdsourcing for conservation 1
Complex image classification via crowdsourcing for conservation: a viable solution?
Edgar Santos-Fernandez
School of Mathematical Sciences. Queensland University of Technology. Australia.
Australian Research Council Centre of Excellence for Mathematical and Statistical Frontiers (ACEMS)
email: santosfe@qut.edu.au; edgar.santosfdez@gmail.com
and
Julie Vercelloni
School of Mathematical Sciences. Queensland University of Technology. Australia.
Australian Research Council Centre of Excellence for Mathematical and Statistical Frontiers (ACEMS)
email: j.vercelloni@qut.edu.au
and
Bryce Christensen
Institute for Future Environments. Queensland University of Technology. Australia.
email: bryce.christensen@qut.edu.au
and
Erin E. Peterson
School of Mathematical Sciences. Queensland University of Technology. Australia.
Australian Research Council Centre of Excellence for Mathematical and Statistical Frontiers (ACEMS)
email: erin@peterson-consulting.com
and
Kerrie Mengersen
School of Mathematical Sciences. Queensland University of Technology. Australia.
Australian Research Council Centre of Excellence for Mathematical and Statistical Frontiers (ACEMS)
email: k.mengersen@qut.edu.au
Summary: Crowdsourcing methods allow the production of scientific information by non-experts and is becoming a key tool
to address complex challenges in ecological research. In some cases the participants of crowdsourcing programs are familiar
with the ecological species or categories involved in the task, whereas in many others substantial training and qualifications
are required to produce suitable data. Many tasks remain difficult, even for well-qualified participants. The feasibility of
crowdsourcing for ecological problems is assessed, specifically in the presence of difficult classification tasks. We present a case
study involving the classification of hard corals, from underwater images from the Great Barrier Reef, Australia. We compare
several majority vote algorithms and introduce a Bayesian approach based on item response models, which achieves superior
results. These statistical models produce estimates of participants’ abilities and allow participants to be clustered into groups.
For difficult tasks, the weighted consensus method that uses experts and experienced participants was found to produce better
performance measures. We found that this method of aggregating participants’ answers allows the identification of three out of
four points on images containing hard corals (sensitivity), and obtained near 80% accuracy. We found that participants learn
as they get more classification opportunities which substantially increases their abilities over time. This study demonstrates
the feasibility of crowdsourcing for answering complex and challenging ecological questions when these data are appropriately
analysed.
Key words: Bayesian inference; citizen science; item response model.
Preprint , 1–13
Apr 2021
1. Introduction
Crowdsourcing methods are booming in ecological research,
engaging millions of collaborators worldwide, producing a vast
volume of information in a timely manner, and raising aware-
ness of environmental conservation (Hsu et al., 2014; Fritz
et al., 2019). Citizen science platforms such as iNaturalist,
eButterfly and Zooniverse have gained popularity in recent
years because they have the potential to reduce the workload
of conservation experts. However, considerable concern has
been raised regarding research involving data elicited by non-
expert citizen scientists. Much of this concern focuses on
the potential for bias arising from the unstructured nature
of the data and the differing abilities of the participants. A
substantial body of research is emerging on the development
of methods for correcting these biases and eliminating non-
informative data, thereby improving the quality of the data
and enabling the production of valuable scientific outputs.
One set of problems that is particularly popular for crowd-
sourcing is the classification of objects in images. Many clas-
sification projects can be found in online platforms such as
Zooniverse, iNaturalist and eButterfly.
While in some citizen science projects the individual clas-
sification accuracy is high, ranging from 70 and 95% (e.g.
Kosmala et al., 2016), aggregation via consensus or more
complex methods are often required to produce reliable
classifications and estimates (Santos-Fernandez et al., 2021;
Santos-Fern´andez and Mengersen, 2021). Many aggregation
methods, including majority voting, enjoy great success in
the literature. These methods rely on the assumption that
participants have greater than 50% chance of answering cor-
rectly. However, such an assumption is not always satisfied in
the presence of difficult tasks, in which the majority of the
classifications can be incorrect (Raykar et al., 2010).
In this article, we investigate the feasibility of crowdsourc-
ing methods as a viable solution for manual classifications
of images when the task is difficult, in the sense that spe-
cific skills are required from participants to obtain satisfying
results. This may occur, for example, if the images repre-
sent complex ecosystems, depicting aggregations of diverse
communities that change in space and time. Difficult tasks
can also be related to images produced by different sources,
such as environmental monitoring programs and cameras. We
study how a suitable statistical design can improve infor-
mation from participants, to help answer relevant ecological
problems and estimate complex measures such as ecosystem
health. We then assess changes in participants’ abilities over
time and whether they learn with more classification op-
portunities. We explore methods for aggregating the partici-
pants’ answers via majority voting. Finally, we identify those
categories that are harder to classify and factors affecting
difficulties of the images.
We present a case study of the classification of underwater
images from the Great Barrier Reef, Australia. In the exper-
iment we show coral reef images to non-expert participants
and ask them to classify five broad categories of benthic
communities using the Amazon Mechanical Turk platform.
The statistical approach developed in this paper allows us
to estimate the most difficult benthic categories, investigate
different types of difficulty, including those related to images
and cameras, and detect different groups of participants,
based on their abilities to perform the required tasks. We
also examine the relationship between classification time and
the participants’ latent abilities. The classification dataset is
made available for further research.
We refer to records such as audio recordings, videos and
images as items. Despite dealing with manual classification
of images in this study, the methodology can be extrapolated
to other sources of citizen science produced data. We use the
term label refer to the true underlying category on images.
1.1 An overview of foundational IRT models
In this section, we briefly introduce the item response models
that will promote understanding of the methods we propose.
The foundations of item response theory (IRT) were laid by
Rasch (Rasch, 1960). A canonical example of this approach
is the estimation of the ability of participants who answer a
set of questions (items), taking into account the difficulty of
the questions. The basic one-parameter logistic model (1PL),
also known as the Rasch model, involves a binary response
variable and is similar to a logistic regression model.
Several extensions and variations were proposed soon after
publication of Rasch’s article. For example, the two-parameter
logistic model (2PL) incorporates a discrimination parameter
or slope, that determines the discriminatory power of the
item, and the three-parameter logistic model (3PL) allows
for a degree of guessing, which results in participants having
a greater than zero chance of answering correctly regardless
of their abilities (Birnbaum, 1968).
Many other extensions, examples and implementations of
these models can be found in published textbooks e.g. Baker
and Kim (2004); van der Linden and Hambleton (2013); Em-
bretson and Reise (2013); De Ayala (2013). IRT modelling has
been approached from the generalized linear model (GLM)
perspective. In a spatial context, Can¸cado et al. (2016) has
suggested a model that borrows some principles of spatial
statistics for the identification of spatial clusters. Another
model that considers temporal and spatial dependence was
proposed by Juhl (2019). Wang et al. (2013) introduced a
dynamic item response approach aimed at characteristics
measured longitudinally. This dynamic approach produces
growth curves showing how the participants’ abilities change
over time. Within the Bayesian philosophy, multiple model
formulations have been suggested e.g. Patz and Junker (1999);
Fox (2010); Albert (2015).
2. Materials and methods
In this section, we discuss majority voting algorithms that are
generally used to estimate the true labels in manual image
classification.
We consider a set of images j= 1,2,··· , J, each composed
of k= 1,2,··· , K elicitation points selected using a spatially
balanced random sampling approach. We deal with a binary
classification task, in which a participant iis asked whether
a category is present on a point belonging to a given image.
Complex image classification via crowdsourcing for conservation 3
Let zijk be the answer of the participant ion the kth point
from the jth image.
zijk =1 if the participant considered the category to be present
0 otherwise
(1)
2.1 Majority vote (MV)
By design, multiple participants classify the same point k.
Based on these answers, we obtain the majority vote by
aggregating the answers, so that the category with the highest
proportion of votes wins the vote, which is the mode.
ˆzjk =1 if the proportion votes with “species present” >0.5
0 Otherwise
(2)
In general, this approach performs poorly for difficult tasks.
This is because only expert participants are likely to respond
correctly, and they can be outvoted by beginners (Raykar
et al., 2010).
A variation of this method is obtained using a weighted
majority voting (WMV) which has been discussed, among
others, by Littlestone et al. (1989); Lintott et al. (2008); Hines
et al. (2015). In this method, each participant has a vote
proportional to some weights, based on their knowledge, skills
or past performance.
2.2 Consensus based on a Bayesian item response model
The estimation of participants’ abilities in crowdsourcing data
has been widely discussed in the literature (Whitehill et al.,
2009; Welinder and Perona, 2010; Paun et al., 2018). In this
research we develop an item response model with the aim of
informing a weighted consensus voting approach. This model
is set in the context of a broader workflow, described as
follows.
(1) Produce a representative set of gold standard images e.g.
33% of the total number of images. In this set of images
the true labels or answers will be obtained from expert
elicitation or other another suitable method. Images in
this set will be scored by most of the participants.
(2) Fit an item response model and obtain estimates of the
participants abilities accounting for difficulties, guessing,
etc.
(3) Obtain weights of every participant proportional to the
ability estimates.
(4) Perform the weighted consensus vote to obtain the esti-
mated labels.
We embed the description of the models in the context
of classification of images. These images can vary in quality,
for example, they can be captured by volunteers, camera
traps or professional monitoring programs. The images are
manually classified by volunteers to obtain the latent labels,
which can be used for example to estimate the occupancy
of species. A popular setting is to assess whether images
contain a target category or species. Another option for high
abundance species or coverage on images is whether species
are present on elicited points on the images. The number
of points used in practice generally ranges between 5 and
50 per image on which the optimized number of points can
be found using a sampling theory approach based on size,
abundance and distributional patterns of ecological groups of
interest (Perkins et al., 2016). In our case study, we use 15
classification points per image on a total of 514 images on
which five possible benthic categories were classified by the
participants. This number of classification points ensures an
optimal sampling per image in order to detect at least 10%
in coral cover change as suggested in the following studies
by Stoddart and Stoddart (2005); Beijbom et al. (2015);
Williams et al. (2019).
2.3 Item response model
Let the binary response variable Yijk represent whether a
question associated with the kth point (k= 1,··· , K) on
the jth image (j= 1,··· , J ) is correctly answered or not by
the ith participant (i= 1,·· · , I ). A full list of symbol and
definitions can be found in Table 1. We assume that Yijk
follows a Bernoulli distribution with parameter pijk:
Yijk ∼Bern (pijk ) (3)
We use an extension of the item response model, namely
the linear logistic test model (LLTM) (Fischer, 1973; Wilson
et al., 2008). The probability that the participant icorrectly
identifies the category on the kth point from the jth image
taken using the lth camera is given by:
pijkl =ηk+ (1 −ηk)1
1 + exp {−αk(θi−βk−βl)}(4)
where βkand βlare difficulties associated with the point
and the camera. The parameter αkgives the slope of the
logistic curve and ηkis a pseudo-guessing associated with the
point, indicating the probability of answering correctly due to
guessing.
The ability of an average participant was anchored by
setting it equal to zero to avoid identifiability issues with the
model. Using the mirt package (Chalmers, 2012) in R, we
found this as the participant with an ability score closest to
zero.
We use Bayesian inference and therefore we need to define
prior distributions for the parameters of interest in Eq4.
θi∼ N (0, σθ) # hierarchical prior on the abilities
σθ∼uniform (0,10) # flat prior on a weakly informative range for the s.d. of the users’ abilities
βk∼ N µbk, σbk# hierarchical prior on the item difficulties
µβk∼ N (0,5) # weakly informative prior for the mean of the item difficulties
σβk∼Cauchy (0,5) T(0,) # informative prior for the sd of the item difficulty, allowing for substantially complex tasks
βl∼ N (µb, σb) # hierarchical prior on the camera difficulties
µβl∼ N (0,5) # weakly informative prior for the mean of camera difficulty
σβl∼Cauchy (0,5) T(0,) # informative prior for the sd of the camera difficulties
αk∼ N (1, σα) # normal prior with mean 1 on the slope
σα∼Cauchy (0,5) T(0,) # half Cauchy prior on the slope sd truncated at 0
ηk∼beta (1,5) # weakly informative prior on the pseudoguessing
4Preprint, Apr 2021
2.3.1 Changes in ability across classifications opportuni-
ties. Several authors have suggested the implementation of
dynamic item response models that account for temporal
variation in the answers, under the principle that subjects’
abilities change with time as a learning curve (e.g. Wang et al.,
2013).
To capture the learning in the process, we add a temporally
dependent component to the model:
pijl =ηj+ (1 −ηj)1
1 + exp {−αj(θi+φt−βjIj−βlIl)}
(5)
where φtis a common learning measure that captures the
change in abilities according to the daily occasions that partic-
ipants performed classifications t= 1,2,··· 15. For participant
the i,t= 1, represents the first classification day, t= 2 is the
second, and so on.
3. Case study: coral reefs
“How inappropriate to call this planet Earth when it is quite clearly
Ocean.” - Arthur C. Clarke
The Great Barrier Reef (GBR) is located on Australia’s
north eastern coast and is among the largest and most com-
plex ecosystems in the world (Great Barrier Reef Marine Park
Authority, 2009). Two effects of climate change, including reef
bleaching events and cyclones, are negatively affecting this
ecosystem causing an unprecedented decline in the prevalence
of hard corals (Hughes et al., 2017; De’ath et al., 2012;
Ainsworth et al., 2016; Vercelloni et al., 2020). Estimation
and assessment of this decline is difficult and expensive to
quantify using traditional marine surveys (Gonz´alez-Rivero
et al., 2014). For this reason, some researchers are harnessing
the strength of citizen science to produce estimates of reef-
health indicators (e.g. proportion of hard corals in the benthic
zone) across large spatial and temporal scales (Peterson et al.,
2020; Santos-Fernandez et al., 2021). This information can
then be used by reef managers and scientists to make data-
enabled management decisions and inform future research.
We performed an experiment using Amazon Mechanical
Turk (https://www.mturk.com/) to assess the feasibility of
using crowdsourced data for estimation of a common reef-
health indicator, hard coral cover, represented as the two-
dimensional proportion of the benthic zone (i.e. seafloor)
covered in hard corals. Hard corals play an important role in
reef ecosystems; their rigid structures provide critical habitat
for many organisms and they are vulnerable to a range of
impacts such as tourism and recreation, storm damage, and
climate change (Hill et al., 2004). The dataset consisted of
514 geotagged images obtained from the XL Catlin Seaview
Survey (Gonz´alez-Rivero et al., 2014) and the University of
Queensland’s Remote Sensing Research Centre (Roelfsema
et al., 2018), which we used to assess the participants’ abilities
to identify hard corals. In practice, coral cover estimates from
images are often based on a subset of classification points,
rather than the whole image ((Thompson et al., 2016; Sweat-
man et al., 2005). In these two programs, 40 to 50 spatially
balanced, random classification points were selected on each
image and classified by coral reef scientists. We consider the
classifications from the experts as a gold standard (i.e. the
ground truth).
We engaged participants and provided instructions
in an 11-page training document https://github.
com/EdgarSantos-Fernandez/reef_misclassification/
HelpGuide_MTurk20200203.pdf, describing how to identify
the different benthic categories, which included: hard and
soft corals, algae,sand,water and other. Participants
were also given the option to select unsure if they were
uncertain about which category to select. Several image
classification examples were included in this guide and the
differences between commonly misclassified benthic groups
were highlighted.
After studying the training document, a qualification task
was used to assess the proficiency of the participants to accu-
rately complete the task. More specifically, the participants
were shown five images containing one classification point
each and asked to select the correct class from the five possible
choices. The qualification was granted to those scoring at least
three correct classifications out of five.
We designed an experiment to select images representative
of the GBR in terms of benthic composition (proportion of
hard and soft corals,algae, and sand) and camera types
(Canon, Lumix, Olympus, Sony, and Canon EOS). Details
of the cameras are given in Table 3. This produced a dataset
composed of 514 images. We produced 514 human intelligence
tasks (HITS) with a maximum number of 70 assignments
per HIT (i.e. maximum number of times each image can
be classified). Images were randomly assigned to the partic-
ipants. We were concerned that classifying 40 or 50 points
per image was too time consuming and that it would reduce
participation. In addition, previous research has shown that
accurate estimates of coral cover can be obtained with ap-
proximately 10 points Stoddart and Stoddart (2005); Beijbom
et al. (2015); Williams et al. (2019). Therefore, we asked
participants to classify 15 classification points on each image,
randomly selected from the 40-50 points previously classified
by reef scientists. See the example in Fig.1.
Participants were required to select a classification category
for all of the points before submitting the classification. Every
assignment (i.e. an image) was expected to take approxi-
mately one minute to complete. The payment was set to 0.10
USD per image and participants reported earning more than
the federal minimum wage in the United States ($7.25 per
hour) for their contributions.
Honeypots were used to assess the quality of the classifi-
cations and prevent low-quality participants from contribut-
ing. participants achieving less than 40% accumulated daily
accuracy in the gold standard images were not allowed to
continue working on the project. Those who scored less than
60% accuracy on a daily basis assessment results had to repeat
the qualification task before continuing. Details of the dataset
contributed by the participants can be found in Section 4.
3.1 Performance measures
Our category of interest in the analyses is hard corals. We used
a suite of performance measures to describe the ability of the
participants, which are based on the true positive (TP), true
negative (TN), false positive (FP), and false negative (FN). In
this context, TP are the points classified as hard coral given
Complex image classification via crowdsourcing for conservation 5
●
●
●●
●
●
●
●●
●●
●
●
●
●
●
●
●●
●
●
●
●●
●●
●
●
●
●
benthic
category
●
●
●
●
●
Algae
Hard Corals
Other
Sand
Soft Corals
Figure 1: Example of an underwater image from the Great
Barrier Reef, Australia used for classification. Participants
were asked to classify what they observed within the points
into multiple categories.
that the point is truly occupied by hard corals. A TN occurs
when the point is correctly classified as something other than
hard coral when there is no hard coral present. FP represent
points classified as hard coral when hard coral is absent,
while FN occur when hard coral is incorrectly classified as
something else. This information was then used to generate
other performance measures such as:
•Sensitivity: measures the ability of a participant to identify
a category when it is present: se =T P/(T P +F N).
•Specificity: measures the ability of a participant to correctly
identify a category when it is absent: sp =T N/(T N +F P ).
•Classification accuracy: the proportion of correctly identi-
fied classification points: acc = (T P +TN )/(T P +T N +
F N +F P ).
•Precision: the probability of correctly classifying points
that truly contain the category hard coral divided by the
total number of points classified as containing it pre =
T P /(T P +F P ).
•Matthews correlation coefficient (MCC) (Matthews, 1975):
a measure of the effectiveness of the classifier using all the
elements of the confusion matrix.
•positive likelihood ratio (lr+): gives the number of true
positives for every false positive.
•negative likelihood ratio (lr−): measures the number of
false negatives for every true negative.
We clustered the subjects based on an item response model
ability estimates θi. We then used these clusters to construct
several model variations based on majority voting, restricted
to experts or experts/experienced subjects. Three replicates
were generated for different gold standard proportion and the
results were averaged out.
3.2 Results
The data contributed by participants were aggregated using
the five different methods. First, we considered the raw esti-
mates obtained directly from the participants’ classifications
(i.e. raw), without any grouping or consensus applied (Table
1). The columns in this table, represent the number of classi-
fication points, nand the performance measures.
In Table 1, we also considered a weighted consensus based
on item response model. The robustness of the results were
assessed for the proposed consensus method using different
proportions of images where the truth was known (10%, 20%,
33% and 50%); noting that images were randomly selected
without replacement. Obtaining the true labels is often a
tedious process and incurs the cost of experts time, which
can be prohibitive.
Using the raw data without combining the subjects’ an-
swers produced relatively low-quality performance compared
to other aggregation methods, se = 0.642 and sp = 0.618
with an accuracy of 0.626 and MCC = 0.246. On average,
the subjects identified 1.682 TP hard coral points for every
FP hard coral classification. However, using a traditional
consensus approach combining the all of the participants’
responses substantially increased the performance measures
e.g. se = 0.825 and MCC = 0.487. We note that that this
method produced far better se and sp improvement compared
to the raw data method.
This majority voting method outperforms (M C C > 0.53)
in all the variants explored. The ratio of TP/FP points
identified under these methods is above 3. However, the
likelihood ratio (FN/TN) was similar to the one obtained
using the consensus approach. Fig 2 compares the statistical
performance measures. In the case of the MV methods, we
show four values per method representing the proportion of
images in the gold standard set (10, 20, 33 and 50%).
The weighted variation performed well compared to the
item response approach based on experts and experienced,
achieving marginal improvements in acc and pre. The item
response model captures well the abilities of the subjects, even
with a small training dataset. This indicates that a minimal
training set (e.g. 50 images) is enough to cluster subjects
based on proficiency (i.e. expert, experienced, etc.).
A straightforward indicator of the difficulty of a point is the
evenness in the vote. We would expect a large disagreement
for points with a high difficulty. In Table , we compare the
methods for points with an evenness greater and smaller than
66%.
Four groups of participants were obtained from the quan-
tiles of the posterior means of the abilities (Figure 3): begin-
ners,competent,experienced and experts.
The vertical axis gives the latent ability score and the
x-axis the proportion of correctly classified points. Skillful
participants have a large ability score values. The size of the
point gives the number of classification points and indicates
the engagement on the project. The vertical bar is the 95%
posterior highest density interval and represents the disper-
sion around the posterior ability estimate.
In black, we represent a reference participant who self-
identified with diving experience. This participant fall within
the expert category, yet remarkably it was not the best
performing one.
6Preprint, Apr 2021
Table 1: Method 1. Performance measures obtained from the participants’ classifications using raw data, and using classic
consensus and item response consensus estimates. We considered several proportions of images where the ground truth is
known (10, 20, 33 and 50%). The Matthews correlation coefficient is represented in last column (MCC).
method n TP FP TN FN se sp acc pre MC C lr+lr−
raw 614,160 132,752 155,482 251,849 74,077 0.642 0.618 0.626 0.461 0.246 1.682 0.579
consensus 23,488 6,779 4,798 10,470 1,441 0.825 0.686 0.734 0.586 0.487 2.624 0.256
experts, GS:10%,n= 51 23,488 6,468 3,468 11,794 1,750 0.787 0.773 0.778 0.652 0.541 3.480 0.275
experts, GS:20%,n= 102 23,488 6,520 3,510 11,752 1,699 0.793 0.770 0.778 0.650 0.543 3.451 0.268
experts, GS:33%,n= 171 23,488 6,541 3,513 11,747 1,677 0.796 0.770 0.779 0.651 0.545 3.457 0.265
experts, GS:50%,n= 257 23,488 6,588 3,495 11,767 1,632 0.801 0.771 0.782 0.653 0.552 3.500 0.258
experts/experienced, GS:10%,n= 51 23,488 6,637 3,883 11,385 1,583 0.807 0.746 0.767 0.631 0.531 3.186 0.258
experts/experienced, GS:20%,n= 102 23,488 6,665 3,906 11,362 1,555 0.811 0.744 0.767 0.631 0.532 3.174 0.254
experts/experienced, GS:33%,n= 171 23,488 6,704 3,969 11,299 1,516 0.816 0.740 0.766 0.628 0.532 3.137 0.249
experts/experienced, GS:50%,n= 257 23,488 6,675 3,925 11,343 1,545 0.812 0.743 0.767 0.630 0.532 3.160 0.253
weighted, exp/exp,GS:10%,n= 51 23,488 6,155 3,043 12,219 2,063 0.749 0.801 0.783 0.671 0.539 3.810 0.311
weighted, exp/exp,GS:20%,n= 102 23,488 6,057 2,806 12,456 2,162 0.737 0.816 0.788 0.684 0.545 4.023 0.322
weighted, exp/exp,GS:33%,n= 171 23,488 6,182 2,852 12,409 2,036 0.752 0.813 0.792 0.684 0.554 4.029 0.305
weighted, exp/exp,GS:50%,n= 257 23,488 6,262 2,898 12,364 1,957 0.762 0.810 0.793 0.684 0.559 4.013 0.294
●
●
●
●
●
●●
●
●
●
●
●
●
●
0.6
0.7
0.8
0.9
0.600.650.700.750.800.85 sp
se
method
●
●
●
●
●
raw
con
exp
exp2
con_w
(a)
●
●
●
●
●
●
●
●
●
●
●●
●
●
0.60
0.65
0.70
0.75
0.80
0.85
0.45 0.50 0.55 0.60 0.65 0.70
pre
acc
method
●
●
●
●
●
raw
con
exp
exp2
con_w
(b)
●
●
●
●
●
●
●
●
●
●
●●
●
●
0.3
0.4
0.5
2.0 2.5 3.0 3.5 4.0
lr_p
lr_n
method
●
●
●
●
●
raw
con
exp
exp2
con_w
(c)
●
●
●
●
●
●
●
●
●
●●
●
●
●
10
20
33
50
10
20
33
50 10
20
33 50
0.3
0.4
0.5
raw con exp exp2con_w
method
MCC
con_w
(d)
Figure 2: Comparison of the raw performance measures and
those obtained from different methods: consensus (con), item
response consensus based on experts/experienced (exp) &
experts participants (exp2), and using weighted consensus
(con w). (a) sensitivity (se) vs specificity (sp), (b) accuracy
(acc) vs precision (pre), (c) negative likelihood ratio (lrn)
vs positive likelihood ratio (lrp), (d) Matthews correlation
coefficient (MCC).
Citizen science project managers can benefit from this
Figure in many ways: (1) Clustering participants provides
a way to weight the evidence provided; (2) the ability to
identify beginners means that they can be asked to re-qualify
before contributing additional data; (3) gamification, such
as leaderboards, can be constructed using the latent ability
values and the number of classifications; and (4) a priori
knowledge about which images are the most difficult could
be used to assign them to the most skillful participants.
In Fig 4 we assess the dynamic of the rate of learning
as the participants increased daily classification occasions
(i.e. they work on the task a second, third, fourth time,
etc). Fitting a linear regression to the posterior means at
the occasion tproduced a slope significantly different from
0 (p-value = 0.019), which indicates that the participants’
skills increase with participation and they become better at
classifying the points. After controlling for pseudo-guessing
and discrimination, the average participant increased their
probability of correct classification by approximately 4% after
five occasions, and 8% and 12% in the 10th and 15th occasions,
respectively.
There were also differences in the difficulty of classifying
each of the classes (Fig 5). The results showed that the
soft coral category was hardest to identify in the images. As
expected, points containing sand had the highest chances of
correct classification.
Substantial differences were found among cameras used to
take the images. Images taken by the Canon EOS camera
represented the vast majority and they were substantial more
difficult (Fig 6). The Canon EOS camera was used to take
images in the northern section of the GBR as part of the
XL Catlin Seaview survey (Gonz´alez-Rivero et al., 2014) and
images from this survey resulted easier for the participants.
Images from the other Canon camera, taken as part of the
Heron survey, were easier to classify (Roelfsema and Phinn,
2010). However, difficulty associated with the camera might
be confounded by the regions where images were taken (north
versus south of the GBR), where reef composition and diver-
sity differs.
3.3 Classification time and quality
The classification time has the potential to be used as a simple
indicator of classification quality. The boxplots in Fig 8 shows
the distributions of the classification times for each of the
participant groups. We note that those participants identified
as expert required significantly higher classification times
Complex image classification via crowdsourcing for conservation 7
●
−6
−3
0
3
0.25 0.50 0.75 1.00
Probability
Abilities
●diver worker ●●●
2000 4000 6000 ●●●●●
beginner competent experienced expert diver
Figure 3: Abilities posterior estimates for four groups of participants (beginner, competent, experienced and experts) using a
gold standard dataset (n= 171) and 95% highest density interval as a function of the proportion of correct answers. The size
of the dot represents the number of points classified. The black dot represents a reference diver that engaged in the projects
as a participant.
●
●
●
●
●
−0.25
0.00
0.25
1−3 4−6 7−9 10−12 13−15
Classification occasion
Learning parameter (φ)
Figure 4: Posterior estimates of the learning parameter φ
and 95% highest density interval as a function of the daily
classification occasion. These estimates were obtained fitting
the whole dataset (n= 514 images). The size of the dot is
relative to the number of points classified. The error bars
represent the 95% highest density interval of φ.
compared to those in the beginner and competent groups
as shown in Table 2. The comparison was made using the
non-parametric Wilcoxon test, with an alternative hypothesis
that the sampled elements from the group in the rows had
substantially greater mean rank values than those in the
columns (Table 2). participants with average classification
times less than 12 secs/image had a mean posterior ability
score falling within the lowest quantile (beginner). This,
therefore, can be used as a straightforward way of detecting
low performing respondents. It also allows initial inferences
to be made before fitting statistical models; especially in the
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
−2
−1
0
1
2
3
algae hard soft sand other
category
difficulty
class
algae
hard
soft
sand
other
Figure 5: Violin/box plots of with posterior difficulties of the
benthic categories.
Table 2: p-values obtained using a pairwise Wilcoxon test
comparing the classification times between groups.
beginner competent experienced
competent 1.0000
experienced 0.5932 0.2955
expert 0.0037 0.0009 0.0781
context where participants are are getting paid for completed
images and thus trying to maximize their effort.
8Preprint, Apr 2021
●
●
●
●
●
−0.5
0.0
0.5
0.60 0.65 0.70
probability
Camera difficulties
●Canon ● Lumix ● EOS ● Olympus ● Sony ●●●●●
20000 40000 60000 80000 100000
Figure 6: Posterior estimates of the difficulties and 95%
highest density interval of the five cameras used on the study
as a function of the proportion of correct answers. The size of
the dot represents the number of points classified in images
taken from each camera.
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
0.1
0.2
0.3
0.4
0.5
algae hard soft sand other
class
guess
class
algae
hard
soft
sand
other
Figure 7: Violin/box plots of the posterior estimates of
the pseudoguessing parameter associated with the benthic
categories.
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
0
50
100
150
beginner competent experienced expert
ability group
time (secs)
beginner
competent
experienced
expert
Figure 8: Box plots of the classification time as a function
of the participants ability groups.
4. Discussion and conclusions
Citizen science (CS) is becoming a very active area of research,
especially in ecology and conservation. However, the validity
of research using these sources of data is often questionable
when the task is challenging or difficult. Tasks such as the
identification of benthic categories on images are deemed
to be challenging for most citizen scientists, who have little
knowledge of what a hard coral looks like.
Similarly, not all participants have the same commitment
and skills and they engage differently in CS projects. This
is critical when dealing with CS data and statistical models
need to weight the evidence, based on these factors.
The current Amazon Mechanical Turk framework does
not consider the contribution of the participants or their
expertise, but instead allows requesters not paying for poor
quality work. A system based on ability scores as developed
in this research, constitutes a better approach to compensate
participants and is more effective than the actual binary
system.
We show that multiple factors affect the difficulty of the
task, including the underlying category and the camera type.
Identifying those categories and images that produce greater
misclassification errors is critical to producing useful training
and qualification materials. In our experience, this was critical
to achieving good classification performance. The project
increased engagement and knowledge about the Great Barrier
Reef and created more awareness about current ecological
challenges. The valuable feedback we received from the par-
ticipants allowed us to improve project design, training, and
compensation processes.
We found that for easy tasks, with broad evenness in the
responses of the participants, most approaches will perform
relatively well. However, when the task is difficult, aggregat-
ing the answers of the participants, using simple consensus,
produces poor results. Identifying and combining expert re-
sponses tend to provide the optimum solution and this can be
done even with a reduced gold standard dataset. We found
that performing a weighted majority voting, based on latent
abilities, produces the best performance outcomes.
Obtaining the true classes using marine biologists or expert
elicitation is expensive when large numbers of images need
to be classified. Our results show that this method is a
viable option when there are budget constraints. The item
response model also allowed us to identify careless or low-
skilled respondents, who generally fall into the beginners cat-
egory. Responses from these participants are generally messy;
therefore our voting algorithm does not include them.
The implementation of our Bayesian models can be
found in the GitHub repository https://github.com/
EdgarSantos-Fernandez/staircase. These methods also in-
clude the possibility of accounting for spatial autocorrelation,
in the parameters associated with the images.
Author Contributions
ESF, JV, BC, EP and KM conceptualized and designed the
study. BC designed and wrote the Amazon Mechanical Turk
interface. ESF, JV and BC carried out the experiment. ESF
Complex image classification via crowdsourcing for conservation 9
wrote the R/Stan codes. All the authors made a substantial
contribution to the draft and approved the final submission.
Data availability
The dataset used in the case study can be found in the repos-
itory: https://github.com/EdgarSantos-Fernandez/reef.
Acknowledgement
This research was supported by the Australian Research
Council (ARC) Laureate Fellowship Program under the
project “Bayesian Learning for Decision Making in the Big
Data Era” (ID: FL150100150) and the Centre of Excel-
lence for Mathematical and Statistical Frontiers (ACEMS).
Thanks to the members of the VRD team (https://www.
virtualreef.org.au/about/). We also thank all the par-
ticipants that contributed to the classification of images.
Ethical approval was granted for the collection of this data
by the Research Ethics Advisory Team, Queensland Univer-
sity of Technology (QUT). Approval Number: 1600000830.
Computations were performed through the QUT High Per-
formance Computing (HPC) infrastructure. Data analysis
and computations were undertaken in Rsoftware (R Core
Team, 2018) using the packages rstan (Stan Development
Team, 2018). Data visualizations were made with the packages
tidyverse (Wickham, 2017), bayesplot (Gabry and Mahr, 2018),
ggvoronoi (Garrett et al., 2018) and ggrepel (Slowikowski,
2019)
References
Ainsworth, T. D., Heron, S. F., Ortiz, J. C., Mumby, P. J.,
Grech, A., Ogawa, D., Eakin, C. M., and Leggat, W.
(2016). Climate change disables coral bleaching protec-
tion on the Great Barrier Reef. Science 352, 338–342.
Albert, J. (2015). Introduction to Bayesian item response
modelling. International Journal of Quantitative Re-
search in Education 2, 178–193.
Baker, F. B. and Kim, S.-H. (2004). Item response theory:
Parameter estimation techniques. CRC Press.
Beijbom, O., Edmunds, P. J., Roelfsema, C., Smith, J., Kline,
D. I., Neal, B. P., Dunlap, M. J., Moriarty, V., Fan, T.-
Y., Tan, C.-J., et al. (2015). Towards automated an-
notation of benthic survey images: Variability of human
experts and operational modes of automation. PloS one
10, e0130312.
Birnbaum, A. L. (1968). Some latent trait models and their
use in inferring an examinee’s ability. In Statistical
theories of mental test scores. Addison-Wesley.
Can¸cado, A. L., Gomes, A. E., da Silva, C. Q., Oliveira, F. L.,
and Duczmal, L. H. (2016). An item response theory
approach to spatial cluster estimation and visualization.
Environmental and ecological statistics 23, 435–451.
Chalmers, R. P. (2012). mirt: A multidimensional item
response theory package for the R environment. Journal
of Statistical Software 48, 1–29.
De Ayala, R. J. (2013). The theory and practice of item
response theory. Guilford Publications.
De’ath, G., Fabricius, K. E., Sweatman, H., and Puotinen, M.
(2012). The 27–year decline of coral cover on the Great
Barrier Reef and its causes. Proceedings of the National
Academy of Sciences 109, 17995–17999.
Embretson, S. E. and Reise, S. P. (2013). Item response
theory. Psychology Press.
Fischer, G. H. (1973). The linear logistic test model as an
instrument in educational research. Acta psychologica
37, 359–374.
Fox, J.-P. (2010). Bayesian item response modeling: Theory
and applications. Springer Science & Business Media.
Fritz, S., See, L., Carlson, T., Haklay, M. M., Oliver, J. L.,
Fraisl, D., Mondardini, R., Brocklehurst, M., Shanley,
L. A., Schade, S., et al. (2019). Citizen science and the
United Nations Sustainable Development Goals. Nature
Sustainability 2, 922–930.
Gabry, J. and Mahr, T. (2018). bayesplot: Plotting for
Bayesian Models. R package version 1.6.0.
Garrett, R. C., Nar, A., and Fisher, T. J. (2018). ggvoronoi:
Voronoi Diagrams and Heatmaps with ’ggplot2’. R pack-
age version 0.8.2.
Gonz´alez-Rivero, M., Bongaerts, P., Beijbom, O., Pizarro,
O., Friedman, A., Rodriguez-Ramirez, A., Upcroft, B.,
Laffoley, D., Kline, D., Bailhache, C., et al. (2014). The
Catlin Seaview Survey–kilometre-scale seascape assess-
ment, and monitoring of coral reef ecosystems. Aquatic
Conservation: Marine and Freshwater Ecosystems 24,
184–198.
Great Barrier Reef Marine Park Authority (2009). Great
barrier reef outlook report 2009: In brief. Great Barrier
Reef Marine Park Authority.
Hill, J. J., Wilkinson, C. C., et al. (2004). Methods for ecolog-
10 Preprint, Apr 2021
ical monitoring of coral reefs: a resource for managers.
Australian Institute of Marine Science (AIMS).
Hines, G., Swanson, A., Kosmala, M., and Lintott, C.
(2015). Aggregating user input in ecology citizen science
projects. In Twenty-Seventh IAAI Conference.
Hsu, A., Malik, O., Johnson, L., and Esty, D. C. (2014).
Development: Mobilize citizens to track sustainability.
Nature News 508, 33.
Hughes, T. P., Barnes, M. L., Bellwood, D. R., Cinner,
J. E., Cumming, G. S., Jackson, J. B., Kleypas, J., Van
De Leemput, I. A., Lough, J. M., Morrison, T. H., et al.
(2017). Coral reefs in the anthropocene. Nature 546,
82–90.
Juhl, S. (2019). Measurement uncertainty in spatial mod-
els: A Bayesian dynamic measurement model. Political
Analysis 27, 302–319.
Kosmala, M., Wiggins, A., Swanson, A., and Simmons, B.
(2016). Assessing data quality in citizen science. Fron-
tiers in Ecology and the Environment 14, 551–560.
Lintott, C. J., Schawinski, K., Slosar, A., Land, K., Bamford,
S., Thomas, D., Raddick, M. J., Nichol, R. C., Szalay,
A., Andreescu, D., et al. (2008). Galaxy zoo: morpholo-
gies derived from visual inspection of galaxies from the
sloan digital sky survey. Monthly Notices of the Royal
Astronomical Society 389, 1179–1189.
Littlestone, N., Warmuth, M. K., et al. (1989). The weighted
majority algorithm. University of California, Santa Cruz,
Computer Research Laboratory.
Matthews, B. W. (1975). Comparison of the predicted
and observed secondary structure of t4 phage lysozyme.
Biochimica et Biophysica Acta (BBA)-Protein Structure
405, 442–451.
Patz, R. J. and Junker, B. W. (1999). Applications and
extensions of mcmc in irt: Multiple item types, missing
data, and rated responses. Journal of educational and
behavioral statistics 24, 342–366.
Paun, S., Carpenter, B., Chamberlain, J., Hovy, D., Kr-
uschwitz, U., and Poesio, M. (2018). Comparing bayesian
models of annotation. Transactions of the Association
for Computational Linguistics 6, 571–585.
Perkins, N. R., Foster, S. D., Hill, N. A., and Barrett, N. S.
(2016). Image subsampling and point scoring approaches
for large-scale marine benthic monitoring programs. Es-
tuarine, Coastal and Shelf Science 176, 36–46.
Peterson, E. E., Santos-Fern´andez, E., Chen, C., Clifford,
S., Vercelloni, J., Pearse, A., Brown, R., Christensen,
B., James, A., Anthony, K., et al. (2020). Monitoring
through many eyes: Integrating disparate datasets to im-
prove monitoring of the great barrier reef. Environmental
Modelling & Software 124, 104557.
R Core Team (2018). R: A Language and Environment
for Statistical Computing. R Foundation for Statistical
Computing, Vienna, Austria.
Rasch, G. (1960). Studies in mathematical psychology: I.
Probabilistic models for some intelligence and attainment
tests. Nielsen & Lydiche. Oxford, England.
Raykar, V. C., Yu, S., Zhao, L. H., Valadez, G. H., Florin, C.,
Bogoni, L., and Moy, L. (2010). Learning from crowds.
Journal of Machine Learning Research 11, 1297–1322.
Roelfsema, C., Kovacs, E., Ortiz, J. C., Wolff, N. H.,
Callaghan, D., Wettle, M., Ronan, M., Hamylton, S. M.,
Mumby, P. J., and Phinn, S. (2018). Coral reef habitat
mapping: A combination of object-based image analysis
and ecological modelling. Remote sensing of environment
208, 27–41.
Roelfsema, C. and Phinn, S. (2010). Calibration and val-
idation of coral reef benthic community maps derived
from high spatial resolution satellite imagery. Journal of
Applied Remote Sensing 4, 043527.
Santos-Fern´andez, E. and Mengersen, K. (2021). Understand-
ing the reliability of citizen science observational data
using item response models. Methods in Ecology and
Evolution To appear,.
Santos-Fernandez, E., Peterson, E. E., Vercelloni, J., Rush-
worth, E., and Mengersen, K. (2021). Correcting mis-
classification errors in crowdsourced ecological data: A
bayesian perspective. Journal of the Royal Statistical
Society: Series C (Applied Statistics) 70, 147–173.
Slowikowski, K. (2019). ggrepel: Automatically Position Non-
Overlapping Text Labels with ’ggplot2’. R package version
0.8.1.
Stan Development Team (2018). RStan: the R interface to
Stan. R package version 2.18.2.
Stoddart, J. and Stoddart, S. (2005). Corals of the dampier
harbour: their survival and reproduction during the
dredging programs of 2004. MScience Pty Ltd., Uni-
versity of Western Australia, Perth, Western Australia
.
Sweatman, H., Burgess, S., Cheal, A., Coleman, G., Delean,
J. S. C., Emslie, M., Miller, I., Osborne, K., McDonald,
A., and Thompson, A. (2005). Long-term monitoring of
the great barrier reef.
Thompson, A., Costello, P., Davidson, J., Logan, M., Cole-
man, G., Gunn, K., and Schaffelke, B. (2016). Marine
monitoring program: Annual report for inshore coral reef
monitoring 2014-2015.
van der Linden, W. J. and Hambleton, R. K. (2013). Handbook
of modern item response theory. Springer Science &
Business Media.
Vercelloni, J., Liquet, B., Kennedy, E. V., Gonz´alez-Rivero,
M., Caley, M. J., Peterson, E. E., Puotinen, M., Hoegh-
Guldberg, O., and Mengersen, K. (2020). Forecasting
intensifying disturbance effects on coral reefs. Global
change biology 26, 2785–2797.
Wang, X., Berger, J. O., Burdick, D. S., et al. (2013). Bayesian
analysis of dynamic item response models in educational
testing. The Annals of Applied Statistics 7, 126–153.
Welinder, P. and Perona, P. (2010). Online crowdsourcing:
rating annotators and obtaining cost-effective labels. In
2010 IEEE Computer Society Conference on Computer
Vision and Pattern Recognition-Workshops, pages 25–
32. IEEE.
Whitehill, J., Wu, T.-f., Bergsma, J., Movellan, J. R., and
Ruvolo, P. L. (2009). Whose vote should count more:
Optimal integration of labels from labelers of unknown
expertise. In Advances in neural information processing
systems, pages 2035–2043.
Wickham, H. (2017). tidyverse: Easily Install and Load the
’Tidyverse’. R package version 1.2.1.
Williams, I. D., Couch, C. S., Beijbom, O., Oliver, T. A.,
Complex image classification via crowdsourcing for conservation 11
Vargas-Angel, B., Schumacher, B. D., and Brainard,
R. E. (2019). Leveraging automated image analysis tools
to transform our capacity to assess status and trends of
coral reefs. Frontiers in Marine Science 6, 222.
Wilson, M., De Boeck, P., and Carstensen, C. H. (2008).
Explanatory item response models: A brief introduc-
tion. Assessment of competencies in educational contexts
pages 91–120.
Supporting Information
12 Preprint, Apr 2021
Table 1: Symbols and definitions
isubject id, 1,2,· ·· , I
jimage number, 1,2,··· , J
kelicitation point, 1,2,·· · , K
Inumber of participants
Jnumber of images
Jnumber of elicited points in the images
zijk ={0,1}1 indicates if the kth point in the image jwas
classified as hard coral by the subject i
C= 1 if hard corals are present in the point. 0 otherwise.
¯
C= 1 if the point is classified as hard coral. 0 otherwise.
N= 1 if the point does not contain hard corals . 0 otherwise.
¯
N= 1 if the point is classified as no hard coral. 0 otherwise.
T Pijk ={0,1}true positive. TP = 1 if ¯
C= 1|C= 1
T Nijk ={0,1}true negative. TN = 1 if ¯
N= 1|N= 1
F Pijk ={0,1}false positive. FP = 1 if ¯
C= 1|C= 0
F Nijk ={0,1}false negative. FN = 1 if ¯
N= 0|N= 1
se =Pn
i=1 Pm
j=1 Pq
k=1 T Pijk
Pm
j=1 Pq
k=1 T Pijk+Pn
i=1 Pm
j=1 Pq
k=1 F Nijk sensitivity
sp =Pn
i=1 Pm
j=1 Pq
k=1 T Nijk
Pm
j=1 Pq
k=1 T Nijk+Pn
i=1 Pm
j=1 Pq
k=1 F Pijk specificity
acc =Pn
i=1 Pm
j=1 Pq
k=1 T Pijk+Pn
i=1 Pm
j=1 Pq
k=1 T Nijk
Pm
j=1 Pq
k=1 T Pijk+Pn
i=1 Pm
j=1 Pq
k=1 F Nijk+Pn
i=1 Pm
j=1 Pq
k=1 T Nijk+Pn
i=1 Pm
j=1 Pq
k=1 F Pijk accuracy
pre =Pn
i=1 Pm
j=1 Pq
k=1 T Pijk
Pm
j=1 Pq
k=1 T Pijk+Pn
i=1 Pm
j=1 Pq
k=1 F Pijk precision
MCC =T P ×T N −F P ×F N
√(T P +F P )(T P +F N)(T N +F P )(T N +F N)Matthews correlation coefficient
lr+=se
1−sp positive likelihood ratio.
lr−=1−se
sp negative likelihood ratio.
Table 2: Method 2. Performance measures obtained from the participants’ classifications using raw data, based on classic
consensus and using item response consensus estimates. We considered several proportions of points where the ground truth
is known (10, 20, 33 and 50%) selected from the points with less than 66% evenness in the vote. The Matthews correlation
coefficient is represented in last column (MCC).
method n TP FP TN FN se sp acc pre M C C
mturk 614160 132752 155482 251849 74077 0.642 0.618 0.626 0.461 0.246
mturk con 23488 12750 6410 24126 3690 0.776 0.790 0.785 0.665 0.549
mturk eve > 0.67% 12239 4152 1299 6448 340 0.924 0.832 0.866 0.762 0.734
mturk con experts eve 60.67, GS:10% 11249 1947 1547 5973 1781 0.522 0.794 0.704 0.557 0.322
mturk con experts eve 60.67, GS:20% 11249 1898 1512 6008 1830 0.509 0.799 0.703 0.557 0.315
mturk con experts eve 60.67, GS:33% 11249 1787 1364 6155 1940 0.479 0.819 0.706 0.567 0.312
mturk con experts eve 60.67, GS:50% 11249 1841 1417 6102 1886 0.494 0.812 0.706 0.565 0.317
mturk con experts/experienced GS:10%, eve 60.66 11249 1981 1760 5761 1747 0.531 0.766 0.688 0.530 0.297
mturk con experts/experienced GS:20%, eve 60.66 11249 1847 1590 5931 1881 0.495 0.789 0.691 0.537 0.290
mturk con experts/experienced GS:33%, eve 60.66 11249 1828 1546 5975 1900 0.490 0.794 0.694 0.542 0.293
mturk con experts/experienced GS:66%, eve 60.66 11249 1847 1563 5958 1881 0.495 0.792 0.694 0.542 0.295
Table 3: Cameras used in the surveys.
Camera Type Specifications
Sony Cyber Shot PC10 Compact Resolution 5 MP, Focal length 7.9-23.7mm, Maximum resolution 2592 ×1944, Aperture f2.8-5.6
Lumix DMC-LC43 Compact Resolution 4 MP, Focal length 35-105mm, Maximum resolution 2304 ×1728, Aperture f/2.8–4.9
Canon PowerShot A2400 Compact Resolution 16 MP, Focal length 28-140mm, Maximum resolution 4608 ×3456, Aperture f2.8–6.9
Olympus TG-4 Compact Resolution 16 MP, Focal length 25-100mm, Maximum resolution 4608 ×3456, Aperture f2-4.9
Canon EOS 5d Mk II w/ Fisheye Zoom Digital SLR Resolution 30.4 MP, Focal length 24-105mm, Maximum resolution 5613 ×3744, Aperture f8
Complex image classification via crowdsourcing for conservation 13
0
30000
60000
90000
0 50 100 150 200
time
count
Figure 1: Classification time in seconds.
Appendix Description of the reef dataset
In this section we introduce the dataset produced during the
experiment. These data can a useful in machine learning and
citizen science research.
A set of 514 images from unique locations in the Great
Barrier Reef, Australia were used for classification. Specially
balanced random points (40 or 50 depending on the source
of the image) were selected form the XL Catlin Seaview
Survey (Gonz´alez-Rivero et al., 2014) and the University of
Queensland’s Remote Sensing Research Centre (Roelfsema
et al., 2018).
Images were captured between 2008 and 2017 using five
cameras (Sony, Canon, Lumix, Olympus and Canon EOS).
They are the result of sampling design and they were selected
based keeping a balance between the benthic compositions.
A total of 482 participants contributed to the project
producing 614,385 classification points using seven categories:
hard corals, soft corals, algae, sand, water and other and
we also included the unsure option. participants classified 15
random points on the images, which was designed to take
approximately 1 min.
This spatio-temporal dataset contains the lat and lon of
where the images were taken and the time stamp (date-
time) associated to the classification. The true answer is also
provided.
Table 4 contains the number of points on each category
and their proportion obtained from marine science experts
elicitation. Images were randomly presented for classification
in batches and retired after 70 classifications per image.
Dataset is part of the Rpackage reef (https://github.
com/EdgarSantos-Fernandez/reef)and it includes the fol-
lowing variables:
•media: numeric image identifier (de-identified).
•annotator: numeric participant identifier (de-identified).
•assignment: numeric unique identifier for the assignment
(de-identified).
•answer given: classification value (benthic category).
•answer actual: true benthic category value.
•x: horizontal location of the point centre in pixels.
•y: vertical location of the point centre in pixels.
•class: binary classification.
•class true: binary true value.
0
100
200
300
400
500
2020−01−09
2020−01−12
2020−01−15
2020−01−18
2020−01−21
2020−01−24
2020−01−27
2020−01−30
2020−02−02
2020−02−05
2020−02−08
2020−02−11
2020−02−14
2020−02−17
2020−02−20
2020−02−23
2020−02−26
2020−02−29
date
subject
Figure 2: participants contribution across time. The patterns
clearly identify that the task was send in batches.
Table 4: Count of the number of points on each category and
their proportion obtained from expert elicitation.
algae hard other sand soft
count 9,576 8,220 649 2,749 2,294
proportion 0.408 0.350 0.028 0.117 0.098
•Camera: The brand of the camera. Sony, Canon, Lumix,
Olympus and Canon EOS.