Data Labeling: An Empirical Investigation into
Industrial Challenges and Mitigation Strategies
Teodor Fredriksson1, David Issa Mattos1[0000−0002−2501−9926] , Jan
Bosch1[0000−0003−2854−722X], and Helena Holmstr¨om
1Chalmers University of Technology, H¨orselg˚angen 11, 417 56, Gothenburg, Sweden
2Malm¨o University, Nordenski¨oldsgatan 1, 211 19 Malm¨o, Sweden
Abstract. Labeling is cornerstone in supervised machine learning. How-
ever, in industrial applications data is often not labeled, which compli-
cates the use of this data for machine learning. Although there are well
established labeling techniques such as crowdsourcing, active learning
and semi-supervised learning, but these still do not provide accurate and
reliable labels for every machine learning use case in industry. In this
context, industry still relies heavily on manually annotating and label-
ing their own data. This study investigates the challenges that compa-
nies experience when annotating and labeling their data. We performed
a case study using semi-structured interview with data scientists at two
companies to explore what problems they experience when labeling and
annotating their data. This paper provides two contributions. We identify
industry challenges in the labeling process and then we propose mitiga-
tion strategies for these challenges
Keywords: Data Labeling ·Machine Learning ·Case study
Current research estimates that over 80% of engineering tasks in a machine-
learning ML project concerns data preparation and labeling, and that the third-
party data labeling market is expected to almost triple by 2024 [1, 2]. This large
eﬀort spent in data preparation and labeling often happens because, in indus-
try, datasets are often incomplete in the sense that some or all instances are
missing labels. In addition, in some cases the labels that are available are of
poor quality, meaning that the label associated with a data entry is incorrect or
only partially correct. Labels of suﬃcient quality are a prerequisite to perform
supervised machine learning as the performance of the model in operations is
directly inﬂuenced by the quality of the training data .
To overcome the lack of labels in both quantity and quality, crowdsourcing
has been a common strategy for acquiring quality labels with human supervi-
sion [4, 5], in particular for computer vision and natural language processing
2 T. Fredriksson et al.
applications. However, for other industry applications, crowdsourcing has sev-
eral limitations, such as allowing unknown third-party access to company data,
lack of people with in-depth understanding of the problem or the business to
create quality labels. In-house labeling can be half as expensive as crowdsourced
labels and while providing higher quality  Due to these factors, companies still
perform in-house labeling. Despite the large body of research on crowdsourcing
and machine learning systems that can overcome diﬀerent label quality prob-
lems, to the best of our knowledge, there is no research that investigates the
challenges faced and strategies adopted by data scientists and human labelers
in the labeling process of company specﬁc applications. In particular, we focus
on the problems seen in applications where labeling is non-trivial and requires
understanding of the problem domain.
Utilizing case study research based on semi-structure interviews with prac-
titioners in two companies, one of which has extensive labeling experience we
study the challenges and the adopted mitigation strategies in the data labeling
process that these companies employ. The contribution of this paper is twofold.
First, we identify the key challenges that these companies experience in relation
to labeling data. Second, we present an overview of the mitigation strategies that
companies employ regularly or potential solutions to address these challenges.
The remainder of the paper is organized as follows. In the next section, we
provide more in-depth overview of the background of our research. Subsequently,
in section 3 we present the research method that we employed in the paper as
well as an overview of the case companies. Section 4 presents the challenges that
we identiﬁed during the case study, observations and interviews at the company,
the results from the expert interviews to validate the challenges as well as the
mitigation strategies. Finally, the paper is concluded in section 6.
Crowdsourcing is deﬁned as a process of acquiring required information or results
by request of assistance from a group of many people available through online
communities. This is a way of dividing and distributing a large project among
people. After each process is completed the people involved in the process are
rewarded . According to  crowdsourcing is the primary way of achieving
labels. In the context of machine learning however, crowdsourcing has its own
set of problems. The primary problem is annotators that produces bad labels.
An annotator might not be able to label instances correctly and even if an
annotator is an expert, the quality of the labels will potentially decrease over
time due to the human factor . Examples of crowdsourcing platforms are the
Amazon Mechanical Turk and the Lionbridge AI .
Allowing a third-party company to label your data has its beneﬁts, such as
not having to develop your own annotation tools and labeling infrastructure.
In-house labeling also requires investing time training your annotators and this
is not optimal if you don’t have enough time and resources. A downside is that
sensitive and conﬁdential company data has to be shared with the crowdsourc-
Title Suppressed Due to Excessive Length 3
ing platforms. Therefore, there are essential factors to consider before selecting
crowdsourcing platform, such as how many and what kind of projects has the
platform been successful with previously? Does the platform have high-quality
labeling technologies so that high quality labels can be obtained? How does the
platform ensure that the annotators can produce labels of suﬃcient quality?
What security measures are taken to ensure the safety of your data? A tool
to be used in crowdsourcing when noisy labels are cheap to obtain is repeated-
labeling. According to  repeated labeling should be exercised if labeling can be
repeated and the labels are noisy. This approach can improve the quality of the
labels which leads to improved quality in machine learning model. This seems to
work especially well when the repeated-labeling is done selectively, taking into
account label uncertainty and machine learning model uncertainty. However, this
approach does not guarantee that the quality is improved. Sheshadri and Lease
 provides an empirical evaluation study that compares diﬀerent algorithms
that computes the crowd consensus on benchmark crowdsourced data sets using
the Statistical Quality Assurance Robustness Evaluation (SQUARE) benchmark
. The conclusions of  is that no matter what algorithm you choose, there
is no signiﬁcant diﬀerence in accuracy. These algorithms includes majority voting
(MV), ZenCrowd (ZC), David and Skene (DS)/ Naive Bayes (NB) . There
are also other ways to handle noisy labels, for example in , they improve the
accuracy when training a deep neural network with noisy labels by incorporating
a noise layer. So rather than correcting noisy labels, there are ways to changes
the machine learning models so that they can handle noisy labels. The downside
to this approach is that you need to know which instances are clean and which
instances are noisy This can be diﬃcult with industrial data. Another strategy
to detect noisy labels is conﬁdent learning which can be used to identify noisy
labels as well a learn form noisy labels. .
3 Research Method
In this paper, we report on case study research in which we explored the chal-
lenges related to labeling of data for machine learning and what strategies can be
employed to mitigate them. In this section we will present the data we collected
and how we analyzed it to identify the challenges.
A case study is a research method that investigates real-world phenomena
through empirical investigations. The aim of these studies is to identify challenges
and ﬁnd mitigation strategies through action, reﬂection, theory and practice,
[13–15]. A case study suits well with our purpose because of it’s exploratory
nature and we are trying to learn more about certain processes at Company A
and B. The two main research questions we have are:
– RQ1: What are the key challenges that practitioners face in the process of
– RQ2 What are the mitigation strategies that practitioners use to overcome
4 T. Fredriksson et al.
3.1 Data Collection
Our case study was conducted in collaboration with two companies. Company
A is a worldwide telecommunication provider and one of the leading providers in
Information and Communication Technology (ICT). Company B is a company
specialized in labeling. They have developed an annotation platform in order
to provide the autonomous vehicles industry with labeled training data of top
quality. Their clients include software companies and research institutes.
– Phase I: Exploration - The empirical data collected during this phase is
based on an internship from November 18 2019 to February 28 2020 in which
the ﬁrst author spent time at Company As oﬃce two-three days a week. The
data was collected from the data scientist by observing how the they were
working with machine learning and how they deal with data where labels
are missing as well as having access to data sets. We held discussions with
each of the data scientist working with each particular dataset to collect data
regarding the origin of the data, what they wish to use it for in the future,
how often it is updated. Using Python we could investigate how skew the
label distribution is of the label distribution as well as examine the data to
potentially ﬁnd any clustering structure in the labels. The datasets studied
in phase I came from participant I and II.
– Phase II: Validation - After the challenges had been identiﬁed during
phase I, both internal and external conﬁrmation interviews were conducted
to validate if the challenges found in the previous phase were general. Four
participants in the interviews where from company A and one participant
was from company B. Company A had several data scientists but we in-
cluded only included scientists that had issues with labeling. Each partic-
ipant was interviewed separately and the interviews lasted between 25-55
minutes. All but one interview was conducted in English. The one interview
was conducted in Swedish and then translated to English by the ﬁrst author.
During the interview we asked questions such as What is the purpose of your
labels?,How do you get annotated data? and How do you assess the quality
of the data/labels?
Based on meetings and interviews, we managed to evaluate and plan strategies
to mitigate the challenges that we have observed during our study.
3.2 Data analysis
The interviews was analyzed by taking notes during the interviews and intern-
ship. We then performed a thematic analysis . A thematic analysis is deﬁned
as ”a method for identifying, analyzing and reporting patterns” and was used
to to identify the diﬀerent themes and patterns in the data we collected. From
the analysis we were able to identify themes and deﬁne the industrial challenges
Title Suppressed Due to Excessive Length 5
Table 1. List of the interview participants of phase II
Company Participant Nr Title/role Experience
A I Data Scientist 4 years
A II Senior Data Scientist 8 years
A III Data Scientist 3 years.
A IV Senior Data Scientist 2 years
B V Senior Data Scientist 7 years
based on the notes. For each interview we identiﬁed diﬀerent themes such as
topics that came up during the interviews. Several of these themes were present
in more than one interview so we combined the data for each of the interviews
and based on that we could draw conclusions based on the information on the
3.3 Threats to Validity
Accoring to  there are four diﬀerent concepts of validity to consider, construct
validity,internal validity,external validity and reliability. To achieve construct
validity we provided every participant of company A with an e-mail containing
all the deﬁnitions of concepts and some sample questions to be asked during
the interview. We also provided a lecture on how to use machine learning to
label data before the interviews so that the could participant’s could reﬂect
and prepare for the interview. We can argue that we achieved internal validity
through data triangulation since we interviewed every person at Company A
that had experience with labels. Therefore it is very unlikely that we missed any
necessary information when collecting data.
In this section we shall present the results from our study. We begin by listing
the key problems that found from phase I of the study. Coming up next we state
the problems we found from Phase II. The interview we held with participant
V was then used as an inspiration for formulating mitigation strategies for the
problems faced by the data scientist from Company A.
4.1 Phase I: Exploration
Here we list the problems that we found during Phase I of the case study.
1. Lack of a systematic approach to labeling data for speciﬁc features:
It was clear that automated labeling processes was needed. The data scien-
tists working at Company A had all kinds of needs for automated labeling.
Currently they have no idea how to approach the problem.
6 T. Fredriksson et al.
2. Unclear responsibility for labeling: Data scientist do not have the time
to label instances manually. Their stakeholders can label the data by hand
but they do not want to do it either. Thus the data scientist are expected to
come up with a way to do the labeling.
3. Noisy labels: Participant I has a small subset of his data labeled. These
labels comes from experiments conducted in a lab. The label noise seem
to be negligible but that is not the case. There is a diﬀerence between the
generated data and the true data. The generated data will have features
that are continuous while the generated data will be discrete. Participant II
works on a data set that contains tens of thousands of rows and columns.
The column of interest contains two class labels, ”Yes” and ”No”. The ﬁrst
problem with the labels is that they are noisy. The ”Yes” is dependent of
two errors, I and II. Only ”Yes” based on error I is of interest. If the ”Yes”
is based on error II. then it should be relabeled as a ”No”. Furthermore the
stakeholders do not know if the ”Yes” instances are due to error I or error II.
4. Diﬃculty to ﬁnd a correlation between labels and features: Partici-
pant I works with a dataset whose label distribution contains ﬁve classes that
describes grades from ”best” to ”worst”. Where 1 ”best” and 5 is ”worst”.
Cluster analysis reveals that there is no particular cluster structure for some
of the labels. Labels of grade 5 seem to be in one cluster but the other 1-
4 seem to be randomly scattered in one cluster. Analysis of the data from
participant II reveals that there is no way of telling whether the ”Yes” is
based on error I or error II. This means that many of the ”Yes” are labeled
5. Skewed label distributions: The label distribution from both datasets
is highly skewed. The dataset from participant I has fewer instances that
has a high grade compared to low grades. For participant II the number of
instances labeled ”No” is greater than the number of labels set as ”Yes”.
When training a model ion this data it will overﬁt.
6. Time dependence: Due to the nature of participant IIs data, it is possible
that some of the ”No” can become ”Yes” it the future and so the ”No” labels
are possibly incorrect too.
7. Diﬃculty to predict future uses for datasets. The purpose of the labels
in both datasets were to predict new labels for future instances provided by
the stakeholder on an irregular basis. For participant I the labels might be
used for other purposes later. There are no current plans to use the label for
other machine learning purposes.
4.2 Phase II: Validation
The problems that appeared during the interviews can be categorized as follows
Title Suppressed Due to Excessive Length 7
1. Label distribution related. Question regarding the distribution.
2. Multiple-task related. Questions regarding the purpose of the labels.
3. Annotation related. Questions regarding the oracle and noisy labels.
4. Model and data reuse related. Questions regarding reuse of trained model on
Below we discuss each category in more detail.
1. Label Distribution: We found several issues related to the label distri-
bution. Participant Is data has a label distribution that is unknown. The
current labels are measured in percentages and need to be translated into
at least two classes but If more labels are needed that can be done as well.
Participant II has a label distribution that contains two classes, ”Yes” and
”No”. Participant IIIs data has a label distribution that contains at least
three labels. Participants IV has more than three-thousand labels so it is
hard to get a clear picture of what the distribution is. Participant I-III all
have skewed label distributions. If a dataset has a skew label distribution
then the machine learning model will overﬁt. This means that if you have
a binary classiﬁcation problem and you have 80% of class A and 20% of
class B, the model might just predict A the majority of a time even when
an actual case is labeled as B .
2. Multiple tasks: Participant I, II and III says that that for now, the only
purpose of their labels is to ﬁnd labels for new data but chances are that
it will be reused for something else later on. Participant IV does not use
its labels for machine learning purposes but for other practical reasons. The
problem here is that if you do not plan ahead and only train a model with
respect to one task, then if you need to use the labels for something else
later you will habe to re-label the instances for each new task.
3. Annotation: Participant I has some labeled data that comes from labo-
ratory experiments. However, these labels are only used to help label new
instances to be labeled manually. Participant II has its labels coming from
the stakeholders but since these are noisy, these instances needs to be re-
labeled. Participant III has labeled data coming from stakeholders and these
are expected 100% correct. Participant IV deﬁnes all labels by itself and
does not consult the stake holders at all. The problem here is that the data
scientist are often tasked to do labeling on their own. Even if the data sci-
entist get instances from the stakeholders, the amount of labels are often of
insuﬃcient quantity and/or quality.
4. Data Reuse: Participant III has had problems with reusing a model. First
the data was labeled into two classes ”Yes” and ”No. Later the ”Yes” cate-
gory would be divided into sub-categories ”YesA” and ”YesB”. When run-
ning the model on this new data, it would predict the old ”Yes” instance as
”No” instance. Participant III has no idea as to why this happens.
8 T. Fredriksson et al.
4.3 Summary from Company B
Participant V of Company B has earlier experience with automatic labeling.
Therefore interview V was used to verify some actual labeling issues from indus-
try. According to participant V, Company B has worked and studied automatic
labeling for at least seven years. Company B uses crowd-sourcing to label data
using 1000 people. Participant V conﬁrms that thanks to active learning the
labeling task takes 200 times less time than if active learning was not used. The
main problem company B has with the labeling is that it is hard to evaluate the
quality labels and access the quality of the human annotator. A ﬁnal remark
from Company V is that they have experienced a correlation between automa-
tion and quality. The more automation included in the process, the less accurate
with the labels be. Three of the authors of this paper performed a systematic
literature review on automated labeling using machine learning . Thanks to
that paper, we can draw the conclusion that active learning and semi-supervised
learning can be used to label instances.
4.4 Machine Learning methods for Data Labeling
Here we present and discuss Active Learning and Semi-supervised learning meth-
ods in terms of how they can be used in practice with labeling problems.
Active Learning: Traditionally labels would be chosen randomly to be labeled
to be used with machine learning. However, choosing instance to be labeled ran-
domly could lead to a model with low predictive accuracy since non-informative
instances could be select for labeling. To mitigate the issue of choosing non-
informative instances active learning (AL) is proposed. Active learning queries
instances by informativeness and then labels them. The diﬀerent methods used
to pose queries are known as query strategies. According to  the most com-
monly used query strategies are uncertainty sampling, error/variance reduction,
query-by-commitee (QBC) and query-by-disagreement (QBD). After instances
are queried and labeled, they are added to the training set. A machine learn-
ing algorithm is then trained and evaluated. If the learner is not happy with
the results, more instances will be queried and the model will be retrained and
evaluated. This iterative procedure will proceed until the learner decides it is
time to stop learning. Active learning has proven to outperform passive learning
if the query strategy is properly selected based on the learning algorithm.
Most importantly active learning is a great way to make sure that time is not
wasted on labeling non-informative instances thus saving both time and money
in crowdsourcing .
Semi-supervised learning: Semi-supervised learning (SSL) is concerned with
a set of algorithms that can be used in the scenario where most of the data is
unlabeled but a small subset of it is labeled. Semi-supervised learning is mainly
divided into semi-supervised classiﬁcation and constrained clustering .
Title Suppressed Due to Excessive Length 9
Semi-supervised classiﬁcation is when a classiﬁer is trained based on train-
ing data that contains both labeled and unlabeled instances. Sometimes semi-
supervised learning outperforms supervised classiﬁcation .
Constrained clustering is an extension to unsupervised clustering. Constrained
clustering requires unlabeled instances as well as some supervised information
about the clusters. The objective of constrained clustering is to improve upon un-
supervised clustering. The most popular semi-supervised classiﬁcation meth-
ods are mixture models using the EM-algorithm,co-training/multi-view learning,
graph-based SSL and Semi-supervised support vector machines (S3VM) .
Below we list eight practical considerations of Active Learning.
1. Data exploration to determine which algorithm is best. When start-
ing on a new project involving machine learning, it is hard to know which
algorithm will yield the best result. Often there is no way of knowing before
hand what the best choice is. There are empirical studies on which one to
choose but the results are fairly mixed [23–25]. Since the selection of algo-
rithm varies so much it is essential to understand the problem beforehand.
If it is interesting to reduce the error, then expected error or variance reduc-
tion are the best query strategies to choose from . If the density of the
sample is easy to use and there is strong evidence that support correlation
between cluster structure to the labels, then use density-weighted methods
. If using large probabilistic models uncertainty sampling is the only vi-
able option . If there is no time testing out diﬀerent query strategies it is
best to you the more simple approaches based on uncertainty . From our
investigation it is clear that company A is in need for labels in their projects.
However since they have never implemented an automatic labeling process
before, it is important that do right from the beginning. That is, the data
scientists must carefully examine the distribution of data set, check whether
there is any cluster structures and if there any relationships between the
clusters and the labels. If the data exploration is done in a detailed correct,
then ﬁnding the correct machine learning approach is easy and we don’t need
to spend time on testing diﬀerent machine learning algorithms.
2. Alternative query types: A traditional active learners queries instances to
be labeled by an oracle. However there are other ways of querying, e.g human
domain knowledge has been incorporated into machine learning algorithms.
This means the learner builds models based on human advice, such as rules
and constraints as well as labeled and unlabeled data.
An example of domain knowledge with active learning is to use informa-
tion about the features. This approach is referred to as tandem learning and
incorporates feature feedback in traditional classiﬁcation problems. Active
dual supervision is an area of active learning where features are labeled.
Here oracles label features that are judged to be good predictors of one or
more classes. The big question is how to actively query these feature labels.
10 T. Fredriksson et al.
3. Multi-task active learning: From our interview we can see that there are
cases where labels are needed to predict labels for future instances. In other
cases the labels aren’t even needed for machine learning. In one case the
data scientist thinks that the labels will be used for other prediction task
but is unsure. The most basic way in which active learning operates is that
a machine learner is trying to solve a single task. From the interviews it is
clear the same data needs to annotated in several ways for several future
tasks. This means that the data scientist will have to spend even more time
annotating at least one time for each task. It would be more economical to
label a single instance for all sub-tasks simultaneously. This can be done
with the help of multi-task active learning. 
4. Data reuse and the unknown model class: The labeled training set col-
lected after performing active learning always has a bias distribution. The
bias is connected to the class of model used to select the queries. If it is
necessary to switch learner to a more improved learner, then it might be
troublesome to reuse the training data with models of a diﬀerent class. This
is an important issue in practical uses for active learning. If you know the
best model class and feature set beforehand, then active learning can safely
be used. Otherwise active learning will be outperformed by passive learning.
5. Unreliable oracles: It is important to have access to top quality labeled
data. If the labels come from some experiment there is almost always some
noise present. In one of the data sets from company A, a small subset of the
data was labeled. The labels of that particular data set comes from experi-
ments conducted in a lab. The label noise seem to be negligible but that is
not the case. There is a diﬀerence between the generated data and the true
data. The generated data will have features that are continuous while the
generated data will be discrete. Another data set that we studied has labels
came from customer data. The labels were coded ”Yes” and ”No”. However
the ”Yes” were due to factors A and B. So the problem here is to ﬁnd a
model that can predict the labels, but we are only interested in the ”Yes”
that are due to factor A. The ”Yes” that are due to factor B needs to be
relabeled to a ”No”. Since the customer data does not provide whether the
”Yes” are due to factor A or B. The second problem was that some of the
”No” could develop into a ”Yes” over time. It was up to the data scientist
to ﬁnd a way to relabel the data correctly. The data scientist had a solution
to the problem but realized that it was faulty and therefore asked us for
help. We took a look at the data and the current solution. We saw two large
clusters but that there were no noteworthy relationship between the diﬀerent
labels and the features. We found two clusters but both of contained almost
equally many ”Yes” and ”No”. Lets say that the ﬁrst cluster contained about
60% ”Yes” and 40% ”No” and in the second cluster we had 60% ”No” and
40% ”Yes”. After doing this, all of the instances in the ﬁrst cluster were
re-labeled as ”Yes” and all instances in the second cluster were re-labeled as
”No”. We conclude that this is an approach that will yield noisy labels. The
Title Suppressed Due to Excessive Length 11
same goes if the labels come from a human annotator because some of the
instances might be diﬃcult to label and people can easily be distracted and
tired over time and so the quality of the labels will vary over time. Thanks
to crowd-sourcing one can let several people annotate the same data and
that way it is easier to determine which label is the correct one and pro-
duce ”gold-standard quality training sets”. This approach can also be used
to evaluate learning algorithms on training sets that are non gold-standard.
The big question is: How do we use noisy oracles in active learning? When
should the learner query new unlabeled instances rather than updating cur-
rently labeled instances in case we suspect an error. Studies where estimates
of both oracle and model uncertainty were taken into account show that data
can be improved by selectively repeated labeling. How do we evaluate the
annotators? How might the aﬀect of payment inﬂuence annotation quality?
What to do if the some instances are noisy no matter what oracle you use
and repeated labeling is not to improve the situation?
6. Skewed label distributions: In two of the data sets we studied, the dis-
tributions of the labels are skewed. That is there are more of one label than
there is of another. In the ”Yes” and ”No” labeled example, there are way
more ”No” instances. When the label distribution is skew, then active learn-
ing might not give much better results than passive learning. This is because
if the labels are not balanced, active learning might query more of one la-
bel than another. Not only is the skewed distribution a problem, but also
the lack of labeled data is also a problem. This might be a problem in the
data set, here we have instances labeled from an experiment. Very few labels
are labeled from the beginning and new unlabeled data is coming in ev-
ery ﬁfteen minutes. ”Guided learning” is proposed to mitigate the slowness
problem. Guided learning allows the human annotator to search for class-
representative instances in addition to just querying for labels. Empirical
studies indicate that guided learning performs better than active learning as
long as it’s annotation costs are less than eight times more expensive than
7. Real labeling costs and cost reduction: From observing the data sci-
entists at Company A, we would say that they will spend about 80% of the
time that they spend on data science is prepossessing the data. Therefore
we recognize that they do not have time to label to many instances and it is
crucial to reduce the time it takes to label things manually. If the possibility
exists avoid manual labeling.
Assume that the cost of labeling is uniform. The smaller training set used,
the lower will the associated costs be. However in some applications the cost
might be varying so simply reducing the labeled instances in the training
data does not necessarily reduce the cost. This problem is studied within
cost-sensitive active learning. To reduce eﬀort in active learning automatic
pre-annotation can help. In automatic pre-annotation the current model
predictions helps to query the labels [27, 28]. This can often help the labor-
12 T. Fredriksson et al.
ing eﬀorts of the learner. If the models does many classiﬁcation mistakes
then there will be extra work for the human annotator to correct these. To
mitigate these problems correlation propagation can be used. In correlation
propagation the local edits are used to interactively update the prediction.
In general automatic pre-annotation and correction propagation does not
deal with labeling cost themselves. However they do try to reduce the costs
indirectly by minimizing the number of labeling actions performed by the
Other cost-sensitive active learning methods takes varying labeling costs into
account. Both current labeling costs and expected future errors in classiﬁca-
tion costs can be incorporated . The costs might not even be deterministic
In many applications the costs are not known beforehand however they might
be able to be described as a function over annotation time . To ﬁnd such
function, train a regression cost-model to that predicts the annotation costs.
Studies involving real human annotation cost shows the following results.
–Annotation costs are not constant across instances [31–34].
–Active learners that ignore costs might not perform better than passive
–The annotations costs may vary on the person doing the annotation ,
–The annotation costs can include stochastic components. Jitter and
pause are two types of noise that aﬀect the annotation speed.
–Annotation can be predicted after seeing only a few labeled instances.
8. Stopping criteria: This is related to cost-reduction. Since active learning
is an iterative process it is relevant to know when to stop learning Based
on our empirical ﬁndings the data scientist have no interest in doing any
manual labeling and if they have to they want to do it as little as possible.
So when the cost of gathering more training data is higher than the cost of
the errors made by the current system, then it is time to stop extending the
training set and hence stop training the machine learning algorithm. From
our experience at company A the data scientist have so little time free from
doing other data prepossessing, time is the most common stopping factor.
4.5 Challenges and Mitigation Strategies:
Many of the problems identiﬁed during phase I and phase II overlap to a certain
degree. Therefore we took all the problems and summarized them into three
challenges (C1-C3) that was later mapped to three mitigation strategies (MS1-
MS3). These mitigation strategies are derived from the practical consideration
above. Finally we map MS1 to C1, MS2 to C2 and MS3 to C3.
C1:Pre-processing: This challenge represents all that needs to be done during
the planning stage of the labeling procedure. This would include creating a
Title Suppressed Due to Excessive Length 13
systematic approach for labeling (problem 1 of phase I), doing an exploratory
data analysis to ﬁnd correlation between labels and features (problem 4 of
phase I), as well as , choosing a model that can be reused on new data (prob-
lem 6 of phase I) and label instances with respect to multiple tasks (problem
7 of phase I, problem 4 of phase II).
MS1:Planning: This strategy contains all the solution frameworks from practical
consideration 1, 2, 3, 4, 7 and 8. This is because they all involve the steps
necessary to plan an active learning strategy for labeling.
C2:Annotation: This challenge represent the problems concerning choosing an
annotator as well as evaluating and reduce the label noise (problems 2,3 from
phase I and problem 3 from phase II).
MS2:Oracle selection: This strategy contains only solution frameworks from
practical consideration 5. It describes how we can choose oracles to produce
top quality labels.
C3:Label Distribution: This challenge represents all the problems concerning
the symmetry of the label distributions such as learning with a skew label
distribution (problem 5 of Phase I and problem 1 o Phase II).
MS3:Label distribution: This strategy contains solution frameworks from prac-
tical consideration 6. It describes how we can do labeling when the label
distribution is skew.
From our veriﬁcation interview with Company B we learned that active learning
is a popular tool for acquiring labels. Thanks to active learning the labeling task
takes 200 times less than if active learning was not used.
In the background we presented some current practices that can help with
labeling. The most popular practice being crowdsourcing. However, crowdsourc-
ing has its own set of problems. The primary concern is that bad annotators will
produce noisy labels due to inexperience or due the human factor. Secondly, The
beneﬁts of allowing third-company to label data is that you don’t have to spend
time training you employees to do then job, nor do you need to develop you own
annotation tools and infrastructure. The big downside is that you have to share
conﬁdential company data with the crowdsourcing platform. Repeated labeling
can be used to improve the quality of the labels but there is no guarantees that
this will improve the quality. Rather than correcting noisy labels, there are ways
in which you can change the machine learning models so that they can handle
noisy labels. The downside to this is that you need to know beforehand which
instances are noisy. This can be diﬃcult in an industrial setting.
Non of the techniques discussed in the background utilizes automated la-
beling using machine learning. Thanks to our eﬀorts we managed to formulate
14 T. Fredriksson et al.
three labeling challenges and provide mitigation strategies based on active ma-
chine learning. These challenges are related to questions such as, How can a
labeling processes be structured?, who and how do we label the instances? Can
correlation between labels and features be found, so that labels can be deter-
mined from the features. Both manual and automatic labeling involves some
noise in the labels. How should these noisy labels be used? What do we do if the
distribution of the labels is skewed? How do we take into account the fact that
some of the labels might change over time, due to the nature of the data? How
do we label instances so that the labels can be useful for several future tasks?
Three mitigation strategies that could possibly solve the three challenges
The goal of this study is the provide a detailed overview of the challenges that
the industry faces with labeling, as well as outline mitigation strategies for these
To the best of our knowledge 95%of all the machine learning algorithms
deployed in industry are supervised. Therefore, it is important that every dataset
is complete with labeled instances. Otherwise, the data would be insuﬃcient and
supervised learning would not be possible.
It proves to be challenging to ﬁnd and structure a labeling process. You need
to deﬁne a systematic approach for labeling and examine the data to choose
the optimal model. Finally you need to choose an oracle to produce top-quality
labels as well as plan how to handle skewed label distributions.
The contribution of this paper is twofold. First, based on a case study involv-
ing two companies we identiﬁed problems that companies experience in relation
to labeling data. We validated these problems using interviews at both companies
and summarized all problems into challenges. Second, we present an overview of
the mitigation strategies that companies employ (or could employ) to address
In our future work, we aim to develop further verify the challenges as well as
the mitigation strategies with more companies. In addition, we intend to develop
solutions to simplify the use of automated labeling in industrial contexts.
This work was partially supported by the Wallenberg AI Autonomous Systems
and Software Program (WASP) funded by Knut and Alice Wallenberg Funda-
1. Cognilytica Research, “Data Preparation & Labeling for AI 2020,” tech. rep., Cog-
nilytica Research, 2020.
Title Suppressed Due to Excessive Length 15
2. Y. Roh, G. Heo, and S. E. Whang, “A survey on data collection for machine
learning: a big data-ai integration perspective,” IEEE Transactions on Knowledge
and Data Engineering, 2019.
3. AzatiSoftware, AzatiSoftware Automated Data Labeling with Machine Learning,
4. J. C. Chang, S. Amershi, and E. Kamar, “Revolt: Collaborative crowdsourcing for
labeling machine learning datasets,” in Proceedings of the 2017 CHI Conference
on Human Factors in Computing Systems, pp. 2334–2346, 2017.
5. J. Zhang, V. S. Sheng, T. Li, and X. Wu, “Improving crowdsourced label qual-
ity using noise correction,” IEEE transactions on neural networks and learning
systems, vol. 29, no. 5, pp. 1675–1688, 2017.
6. H. Cloud Factory, Crowd vs. Managed Team: A studo on Quality Data Processing
at Scale, 2020. https://go.cloudfactory.com/hubfs/02-Contents/3-Reports/Crowd-
7. J. Zhang, X. Wu, and V. S. Sheng, “Learning from crowdsourced labeled data: a
survey,” Artiﬁcial Intelligence Review, vol. 46, no. 4, pp. 543–576, 2016.
8. hackernoon.com, Crowdsourcing Data Labeling for Machine Learning Projects,
9. P. G. Ipeirotis, F. Provost, V. S. Sheng, and J. Wang, “Repeated labeling using
multiple noisy labelers,” Data Mining and Knowledge Discovery, vol. 28, no. 2,
pp. 402–441, 2014.
10. A. Sheshadri and M. Lease, “Square: A benchmark for research on computing
crowd consensus,” in First AAAI conference on human computation and crowd-
11. S. Sukhbaatar and R. Fergus, “Learning from noisy labels with deep neural net-
works,” arXiv preprint arXiv:1406.2080, vol. 2, no. 3, p. 4, 2014.
12. C. G. Northcutt, L. Jiang, and I. L. Chuang, “Conﬁdent learning: Estimating
uncertainty in dataset labels,” arXiv preprint arXiv:1911.00068, 2019.
13. P. Reason and H. Bradbury, Handbook of action research: Participative inquiry and
practice. Sage, 2001.
14. P. Runeson and M. H¨ost, “Guidelines for conducting and reporting case study
research in software engineering,” Empirical software engineering, vol. 14, no. 2,
p. 131, 2009.
15. M. Staron, Action Research in Software Engineering: Theory and Applications.
Springer Nature, 2019.
16. V. Braun and V. Clarke, “Using thematic analysis in psychology,” Qualitative
research in psychology, vol. 3, no. 2, pp. 77–101, 2006.
17. T. DataScience, What To Do When Your Classiﬁcation Data is Imbalanced?, 2019.
18. T. Fredriksson, J. Bosch, and H. Holmstr¨om-Olsson, “Machine learning models for
automatic labeling: A systematic literature review,” 2020.
19. B. Settles, “Active learning. morgan claypool,” Synthesis Lectures on AI and ML,
20. X. J. Zhu, “Semi-supervised learning literature survey,” tech. rep., University of
Wisconsin-Madison Department of Computer Sciences, 2005.
21. N. N. Pise and P. Kulkarni, “A survey of semi-supervised learning methods,” in
2008 International Conference on Computational Intel ligence and Security, vol. 2,
pp. 30–34, IEEE, 2008.
16 T. Fredriksson et al.
22. E. Bair, “Semi-supervised clustering methods,” Wiley Interdisciplinary Reviews:
Computational Statistics, vol. 5, no. 5, pp. 349–361, 2013.
23. C. K¨orner and S. Wrobel, “Multi-class ensemble-based active learning,” in Euro-
pean conference on machine learning, pp. 687–694, Springer, 2006.
24. A. I. Schein and L. H. Ungar, “Active learning for logistic regression: an evalua-
tion,” Machine Learning, vol. 68, no. 3, pp. 235–265, 2007.
25. B. Settles and M. Craven, “An analysis of active learning strategies for sequence
labeling tasks,” in Proceedings of the 2008 Conference on Empirical Methods in
Natural Language Processing, pp. 1070–1079, 2008.
26. A. Harpale, Multi-task active learning. PhD thesis, Carnegie Mellon University,
27. J. Baldridge and M. Osborne, “Active learning and the total cost of annotation,”
in Proceedings of the 2004 Conference on Empirical Methods in Natural Language
Processing, pp. 9–16, 2004.
28. A. Culotta and A. McCallum, “Reducing labeling eﬀort for structured prediction
tasks,” in AAAI, vol. 5, pp. 746–751, 2005.
29. A. Kapoor, E. Horvitz, and S. Basu, “Selective supervision: Guiding supervised
learning with decision-theoretic active learning.,” in IJCAI, vol. 7, pp. 877–882,
30. B. Settles, M. Craven, and L. Friedland, “Active learning with real annotation
costs,” in Proceedings of the NIPS workshop on cost-sensitive learning, pp. 1–10,
Vancouver, CA:, 2008.
31. S. Arora, E. Nyberg, and C. Rose, “Estimating annotation cost for active learn-
ing in a multi-annotator environment,” in Proceedings of the NAACL HLT 2009
Workshop on Active Learning for Natural Language Processing, pp. 18–26, 2009.
32. E. K. Ringger, M. Carmen, R. Haertel, K. D. Seppi, D. Lonsdale, P. McClana-
han, J. L. Carroll, and N. Ellison, “Assessing the costs of machine-assisted corpus
annotation through a user study.,” in LREC, vol. 8, pp. 3318–3324, 2008.
33. S. Vijayanarasimhan and K. Grauman, “What’s it going to cost you?: Predict-
ing eﬀort vs. informativeness for multi-label image annotations,” in 2009 IEEE
Conference on Computer Vision and Pattern Recognition, pp. 2262–2269, IEEE,
34. B. C. Wallace, K. Small, C. E. Brodley, J. Lau, and T. A. Trikalinos, “Modeling
annotation time to reduce workload in comparative eﬀectiveness reviews,” in Pro-
ceedings of the 1st ACM International Health Informatics Symposium, pp. 28–35,
35. R. A. Haertel, K. D. Seppi, E. K. Ringger, and J. L. Carroll, “Return on invest-
ment for active learning,” in Proceedings of the NIPS Workshop on Cost-Sensitive
Learning, vol. 72, 2008.