ArticlePDF Available

Guidelines for Quality Assurance of Machine Learning-Based Artificial Intelligence

Authors:

Abstract and Figures

Significant effort is being put into developing industrial applications for artificial intelligence (AI), especially those using machine learning (ML) techniques. Despite the intensive support for building ML applications, there are still challenges when it comes to evaluating, assuring, and improving the quality or dependability. The difficulty stems from the unique nature of ML, namely, system behavior is derived from training data not from logical design by human engineers. This leads to black-box and intrinsically imperfect implementations that invalidate many principles and techniques in traditional software engineering. In light of this situation, the Japanese industry has jointly worked on a set of guidelines for the quality assurance of AI systems (in the Consortium of Quality Assurance for AI-based Products and Services) from the viewpoint of traditional quality-assurance engineers and test engineers. We report on the second version of these guidelines, which cover a list of quality evaluation aspects, catalogue of current state-of-the-art techniques, and domain-specific discussions in five representative domains. The guidelines provide significant insights for engineers in terms of methodologies and designs for tests driven by application-specific requirements.
Content may be subject to copyright.
Guidelines for Quality Assurance of Machine
Learning-Based Arti¯cial Intelligence
Gaku Fujii
Cinnamon, Inc., Tokyo, Japan
gfujii@cinnamon.is
Koichi Hamada
DeNA Co., Ltd., Tokyo, Japan
koichi.hamada@dena.com
Fuyuki Ishikawa
National Institute of Informatics
Tokyo, Japan
f-ishikawa@nii.ac.jp
Satoshi Masuda
IBM Research, Tokyo, Japan
smasuda@jp.ibm.com
Mineo Matsuya
LIFULL Co., Ltd.
Tokyo, Japan
matsuyamineo@lifull.com
Tomoyuki Myojin
Hitachi, Ltd., Yokohama, Japan
tomoyuki.myojin.fs@hitachi.com
Yasuharu Nishi
University of Electro-Communications
Tokyo, Japan
yasuharu.nishi@uec.ac.jp
Hideto Ogawa
Hitachi, Ltd., Yokohama, Japan
hideto.ogawa.cp@hitachi.com
This is an Open Access article published by World Scienti¯c Publishing Company. It is distributed under
the terms of the Creative Commons Attribution 4.0 (CC BY) License which permits use, distribution and
reproduction in any medium, provided the original work is properly cited.
OPEN ACCESS
International Journal of Software Engineering
and Knowledge Engineering
Vol. 30, Nos. 11&12 (2020) 15891606
#
.
cThe Author(s)
DOI: 10.1142/S0218194020400227
1589
Int. J. Soft. Eng. Knowl. Eng. 2020.30:1589-1606. Downloaded from www.worldscientific.com
by 88.214.3.192 on 01/26/21. Re-use and distribution is strictly not permitted, except for Open Access articles.
Takahiro Toku
OMRON Corporation, Kyoto, Japan
takahiro.toku@omron.com
Susumu Tokumoto
FUJITSU LABORATORIES LTD.
Kawasaki, Japan
tokumoto.susumu@fujitsu.com
Kazunori Tsuchiya
FUJITSU LTD.
Kawasaki, Japan
ktsuchiya@fujitsu.com
Yasuhiro Ujita
OMRON Corporation
Kyoto, Japan
yasuhiro.ujita@omron.com
Received 14 July 2020
Revised 9 August 2020
Accepted 30 August 2020
Signi¯cant e®ort is being put into developing industrial applications for arti¯cial intelligence
(AI), especially those using machine learning (ML) techniques. Despite the intensive support for
building ML applications, there are still challenges when it comes to evaluating, assuring, and
improving the quality or dependability. The di±culty stems from the unique nature of ML,
namely, system behavior is derived from training data not from logical design by human
engineers. This leads to black-box and intrinsically imperfect implementations that invalidate
many principles and techniques in traditional software engineering. In light of this situation, the
Japanese industry has jointly worked on a set of guidelines for the quality assurance of AI
systems (in the Consortium of Quality Assurance for AI-based Products and Services) from the
viewpoint of traditional quality-assurance engineers and test engineers. We report on the second
version of these guidelines, which cover a list of quality evaluation aspects, catalogue of current
state-of-the-art techniques, and domain-speci¯c discussions in ¯ve representative domains. The
guidelines provide signi¯cant insights for engineers in terms of methodologies and designs for
tests driven by application-speci¯c requirements.
Keywords: Software quality; testing; arti¯cial intelligence; machine learning; guidelines.
1. Introduction
Machine learning (ML) is a key driving force for industrial innovation in the form
of arti¯cial intelligence (AI) systems. ML-based AI systems consistently display
unique characteristics in engineering because components (models) are constructed
by training with data in an inductive manner. The obtained components are
1590 G. Fujii et al.
Int. J. Soft. Eng. Knowl. Eng. 2020.30:1589-1606. Downloaded from www.worldscientific.com
by 88.214.3.192 on 01/26/21. Re-use and distribution is strictly not permitted, except for Open Access articles.
intrinsically imperfect, i.e. they tend to have limited accuracy and are black-box in
the sense that the learned behavior is too complex to understand or reason about,
especially in the case of deep learning. Further di±culties emerge as such AI systems
work with fuzzy requirements regarding human perception or the open real world.
One survey showed that more than 40% of engineers feel it di±cult to quality assure
AI systems, the highest level in the sense that current approaches are no longer
working [1].
At the same time, there is an increasing demand for high-quality and dependable
AI systems as more and more e®ort is put into industrial applications. It is therefore
crucial to provide clear guidance for understanding and tackling the di±culties in-
herent in high-quality AI systems. In response to such industry demands, we
established the Consortium of Quality Assurance for AI-based Products and Services
(QA4AI Consortium), made up of experts from both industry and academia. The
objectives of the consortium are to form a societal consensus on the quality of AI
systems by researching issues and solutions relating to them and contribute to the
di®usion of ML developments into a safe and secure society.
In this paper, we report on the second version of the guidelines for the quality
assurance of ML-based AI systems [2]. These guidelines de¯ne the general concept
and technologies for the quality assurance of AI systems including concrete guide-
lines relating to the quality characteristics, test architecture, and test viewpoints in
each typical domain. This paper extends our previous version [3] with the latest
status of the guidelines, speci¯cally, the addition of discussion on AI-OCR (optical
character recognition).
The remainder of this paper is organized as follows: In Sec. 2, we ¯rst describe the
consortium and the methodology for working on the guidelines. In Secs. 3and 4,we
describe the guidelines in terms of the common core part and domain-speci¯c parts,
respectively. We discuss the evaluation of the guidelines in Sec. 5and discuss the
threats to validity of the evaluation in Sec. 6. In Sec. 7, we introduce related work.
We conclude the paper with future perspectives in Sec. 8.
2. Methodology
2.1. The QA4AI consortium
The QA4AI Consortium is a voluntary group to discuss the quality assurance of ML-
based AI systems in Japan. Its objectives are to promote the application of ML-based
AI systems by reducing the risks associated with AI/ML and foster common social
understanding of their quality, including limitations.
When the ¯rst version of the guidelines was released, the consortium consisted of
39 experts and three organizations from both academia and industry. The experts
include researchers and practitioners in various technical ¯elds including software
engineering, system safety, ML, and quality assurance. The application domains of
the participants are also diverse, covering the entertainment, automotive, factory
Guidelines for Quality Assurance of Machine Learning-Based Arti¯cial Intelligence 1591
Int. J. Soft. Eng. Knowl. Eng. 2020.30:1589-1606. Downloaded from www.worldscientific.com
by 88.214.3.192 on 01/26/21. Re-use and distribution is strictly not permitted, except for Open Access articles.
automation, electrics and electronics, communications, software, IT solutions, con-
sumer devices, web systems, aerospace and more.
2.2. Structure of guidelines
The consortium facilitated two types of discussion to formulate the guidelines. In the
¯rst, quality-assurance-related issues in speci¯c application domains were discussed.
The purpose was to derive concrete insights since general insights might be too
abstract for the various domains with di®erent demands. For the ¯rst version of the
guidelines, there were four working groups: one each for generative systems, opera-
tional data in process systems, voice user interfaces (VUIs), and autonomous driving
(AD). Later, a new domain was added for OCR for the second version reported in
this paper.
The second type of discussion was for organizing and summarizing the common
core concepts of the quality assurance of ML-based AI systems. These discussions
were facilitated by expert members, and their output was reviewed by the entire
consortium. There are two core parts to the guidelines: axes of quality evaluation and
a technical catalog.
The ¯rst version of the guidelines was published on the QA4AI Consortium's
website
a
in May 2019. The second version was published in February 2020. The
content of the guidelines includes the above core parts and domain-speci¯c parts as
follows:
.Guidelines for speci¯c domains for (1) Generative Systems; (2) Operational Data
in Process Systems; (3) VUI; (4) AD; and (5) OCR.
3. Core Parts of Guidelines
3.1. Axes of quality evaluation
The quality assurance of ML-based systems has unique aspects in contrast to the
quality assurance of traditional non-ML-based systems. Speci¯cally, ML-based sys-
tems usually include a complex, nonlinear model constructed in the inductive de-
velopment style for stakeholders, who may be unfamiliar with ML-based system
development.
Software development can be divided into deductive and inductive styles. The
former is that, for traditional software, engineers have extensive knowledge on de-
velopment from their experiences. Quality assurance applies knowledge such as
process assessment, measurement, reviews, and testing. The latter is for ML-based
systems, because the existing knowledge for software programs often does not
apply to ML-based systems as they are automatically generated, nonlinear, and too
complex [1]. Therefore, traditional process assessment, measurement, and reviews
a
http://www.qa4ai.jp.
1592 G. Fujii et al.
Int. J. Soft. Eng. Knowl. Eng. 2020.30:1589-1606. Downloaded from www.worldscientific.com
by 88.214.3.192 on 01/26/21. Re-use and distribution is strictly not permitted, except for Open Access articles.
may be ine®ective. Frequent, entire, and exhaustive testing (FEET) still works.
Engineers have to adopt both the inductive development style for the core ML
models and deductive development style for an entire ML-based system.
The QA4AI guidelines extract ¯ve aspects of quality evaluation for ML-based
systems: Data Integrity, Model Robustness, System Quality, Process Agility, and
Customer Expectation.
Data Integrity deals with the data aspect, i.e. samples of inputs and outputs used
for training and testing. This aspect contains 11 general checkpoints of statistical
considerations of privacy, intellectual property rights, online learning, and quality of
the data generator, such as volume and cost, meaningfulness and requirements,
relationships between population and sample, bias and contamination, complexity,
multicollinearity, outliers and missing values, privacy and con¯dentiality, intellec-
tual property rights, independence of validation data, and e®ect of online learning.
Model Robustness relates to the quality assurance of a model generated auto-
matically. This aspect has 11 general checkpoints for the characteristics of neural
networks, model performance, generalization, noise, local optima, architecture,
hyper parameters, cross validation, data diversity, and degradation.
System Quality is for arguing the quality assurance of the whole system. This
aspect has eight general checkpoints for system-level quality of system performance,
validation scope, criticality and frequency of accidents, controllability of the system
in accidents, functional safety, security, contribution and localizability of ML com-
ponents, and explainability and assurability.
Process Agility captures the quality assurance from the viewpoint of development
process. This aspect has 11 general checkpoints for quickness of exploration including
short iterations and immediate feedback, scalability, automatability, FEET, ap-
propriate skills and deep understanding, and teamwork.
Finally, Customer Expectation allows for argument of quality assurance for
various stakeholders who may be unfamiliar with ML-based system development.
This aspect has eight general checkpoints of extravagant expectation for AI,
acceptance of probabilistic behavior, severity of expectation, optimism for huge data,
ambiguity of requirements, compliance, linear and deterministic thinking, and
bureaucracy. This axis is the baseline for the other. The higher Customer
Expectation is, the higher the other axes need to be.
The overall quality of ML systems should be evaluated from the viewpoint of
balance among the axes according to Customer Expectation. The development or-
ganization of ML-based systems should also establish a well-balanced quality-
assurance fabric, organization structure, and quality-management system. Figure 1
shows examples of well-balanced and ill-balanced quality pentagons consisting of the
axes. Furthermore, the overall quality of ML-systems usually depends on develop-
ment phases such as proof of concept (PoC), beta release, and deployment of service
to a large number of users. The later the phase of development, the better the quality
should be.
Guidelines for Quality Assurance of Machine Learning-Based Arti¯cial Intelligence 1593
Int. J. Soft. Eng. Knowl. Eng. 2020.30:1589-1606. Downloaded from www.worldscientific.com
by 88.214.3.192 on 01/26/21. Re-use and distribution is strictly not permitted, except for Open Access articles.
3.2. Technical catalog
Technical guidelines typically generalize and summarize techniques and practices
being successfully used in the industry, at least in leading companies. However, for
the quality assurance of ML models or ML-based systems, techniques or practices are
only just emerging and remain under active investigation. We therefore collected
trends from state-of-the-art research papers in the software engineering community.
We also listed the standard concepts established in the ML community, primarily
for performance evaluation, e.g. precision/recall, over/under-¯tting, and cross
validation.
The state-of-the-art trends included in the guidelines are as follows:
.Use of a pseudo oracle (e.g. [4]).
.Metamorphic testing (e.g. [5,6]).
.Robustness evaluation and search for adversarial examples (e.g. [5,7]).
.Structural coverage for neural networks [4,8].
.Methods for explainable AI including local explanation for each output (e.g.
[9,10]) and global explanation of the trained model (e.g. [11]).
Note that we endeavor to generalize the concepts as well as decompose multiple
aspects combined in one research paper or tool (e.g. [4]).
4. Guidelines for Speci¯c Domains
The ¯ve axes provide common guidelines for the quality assurance of ML-based
systems, but it is necessary to design a concrete scheme of quality assurance with an
appropriate understanding of the characteristics of each system. Therefore, we ex-
amined ¯ve popular domains in which ML-based systems are used to discuss the
required quality and the quality-assurance viewpoint for each domain.
Fig. 1. Examples of well-balanced and ill-balanced quality pentagon.
1594 G. Fujii et al.
Int. J. Soft. Eng. Knowl. Eng. 2020.30:1589-1606. Downloaded from www.worldscientific.com
by 88.214.3.192 on 01/26/21. Re-use and distribution is strictly not permitted, except for Open Access articles.
4.1. Generative systems
There have been outstanding advances in techniques for generative models, which
learn \what happens with what probability", particularly in techniques for genera-
tive adversarial networks (GANs) [12]. With these techniques, applications that
create images, videos, essays, or dialog can be constructed. We focus on such
emerging applications because they have a unique focus when it comes to quality, for
example, how natural and diverse the outputs are. Such quality attributes are in-
trinsically fuzzy and di±cult to assess automatically.
Our objective in this domain is to uncover potential approaches to automated
evaluation of such quality attributes for emerging generative systems. We de¯ned a
concrete application that generates an image or video of an anime character, which is
inspired by the technique presented by Hamada et al. [13]. Such functions help create
attractive interface agents and videos. We de¯ned ¯ve use cases for this application.
Two use cases are shown in what follows and Fig. 2illustrates the ¯rst example.
(1) Generate diverse natural character images of a speci¯ed pose given as
2D-coordinates of key body parts.
(2) Generate a natural character video given two images for the start and end
points.
For these use cases, we enumerated the quality attributes that should be inves-
tigated, which are summarized as follows:
.Naturalness: e.g. the outputs let human users feel they were created by human
creators.
.Clearness and smoothness: e.g. there is no noise, collapse, or discontinuity in the
outputs.
.Diversity: e.g. poses (when not speci¯ed) or clothing in the outputs have a certain
degree of diversity.
.Social appropriateness: e.g. no discriminatory or obscene output is generated.
Fig. 2. Image generation from given pose speci¯cation.
Guidelines for Quality Assurance of Machine Learning-Based Arti¯cial Intelligence 1595
Int. J. Soft. Eng. Knowl. Eng. 2020.30:1589-1606. Downloaded from www.worldscientific.com
by 88.214.3.192 on 01/26/21. Re-use and distribution is strictly not permitted, except for Open Access articles.
.Speci¯cation conformance: the output follows the given instruction such as gender
or clothing color.
Although they are intrinsically fuzzy due to human perception, the possibilities of
automated evaluation should be explored. Three primary approaches for evaluating
these quality attributes and some examples are given in the following.
Approach 1 Metrics
De¯ne and use metrics that represent the target quality attribute, even approxi-
mately. For example, we can leverage the evaluation metrics of GANs for Natural-
ness and Diversity [14,15]. As another example, we can evaluate statistical values
and distributions of optical °ow, which capture the movement of each part in the
frames of the video to detect obviously too drastic movement.
Approach 2 Evaluation AI
Construct an AI that evaluates the target quality attribute. Pose-estimation tech-
niques [16] can be used to judge whether a generated output matches the speci¯ed
pose. We can also build our own model for pose estimation since the training data for
the generative model originally includes mappings between poses and images, which
can be used as training data for a pose-estimation model. We can also investigate a
dedicated model and data for the target quality attributes. For example, we may
construct a classi¯er to detect noisy images by creating training data that include
images with noise automatically added.
Approach 3 Evaluation rules
Construct a rule-based AI or traditional software to evaluate the target quality
attribute. For example, we can implement an analyzer that checks if the speci¯ed
clothing color is dominant inside the character in the output image.
4.2. Operational data in process systems
In industrial systems, ML technologies have been applied and practically used in
various ¯elds, such as abnormality detection, parameter recommendation, and visual
inspection. Quality assurance requires the following three characteristics:
.Stakeholder diversity: Industrial systems consist of multiple subsystems. Data
Integrity depends on various stakeholders, operations, and contracts.
.Environmental dependency: Systems are exposed to unrepeatable and unpre-
dictable changes of 5M + E (man, machine, method, material, measure, and
environment).
.Accountability: To operate the whole system, we need to endorse the validity for
all system standards and rules.
Considering these three characteristics and the inductive manner in building an
ML model, we de¯ned a development process model for ML systems called the
1596 G. Fujii et al.
Int. J. Soft. Eng. Knowl. Eng. 2020.30:1589-1606. Downloaded from www.worldscientific.com
by 88.214.3.192 on 01/26/21. Re-use and distribution is strictly not permitted, except for Open Access articles.
intelligent experimental integration (IXI) model, as shown in Fig. 3. This model is
divided into three phases: PoC, development, and operation. Major risks should be
identi¯ed and veri¯ed in the PoC phase. In the development phase, industrial sys-
tems using ML are developed based on the results of the PoC. In the operation phase,
the output of the deployed ML and the behavior of the system are monitored and
quality is maintained. The results from each phase are collected to explain to sta-
keholders, and the ML model of the system should be evaluated using risk-identi¯ed
data during operation and updated as necessary. The reason it is di±cult to proceed
with each phase is that there are no rational guidelines of evaluation in each activity.
Therefore, we modeled all mandatory development and operation activities in the
IXI model and de¯ned the following evaluation criteria:
.Customer Expectation: Coordinate intangible assets such as software and involved
various stakeholders.
.Data Integrity: Repetitive data con¯rmation process for environmental changes or
deterioration of facilities.
.Model Robustness: Condition of data collection and evaluation process and
measurements.
.System Quality: System quality would depend on data and model quality.
This criterion shows the evaluation process of each change and explanation to
stakeholders.
.Process Agility: We emphasized the importance of adapting any changes and
picked up important agile practices.
We speci¯ed the relationship between the quality-model criteria and each activity
in the de¯ned process model (IXI model: Fig. 3). All criteria are concrete for practical
Fig. 3. IXI model: Intelligent eXperimental Integration model.
Guidelines for Quality Assurance of Machine Learning-Based Arti¯cial Intelligence 1597
Int. J. Soft. Eng. Knowl. Eng. 2020.30:1589-1606. Downloaded from www.worldscientific.com
by 88.214.3.192 on 01/26/21. Re-use and distribution is strictly not permitted, except for Open Access articles.
development and the process model is checked to a real case reported in a previous
study [17]. The model works not only to verify quality but also to plan ML system
development.
We also discussed a real example and added the results to the guideline con-
tent [17]. The system has the above three characteristics, so we take it as a good
example for our guidelines (quality-model criteria and IXI model). We discussed the
criteria, the review process, and test viewpoints for this example and found that there
are the following pros (+)/cons () in our guidelines:
.(+) easy to cover quality criteria. It covers all ML-related review points and test
viewpoints.
.(+) easy to plan using the IXI model. This model helps in the understanding of the
necessity of iteration.
.() to conclude speci¯c criteria of thresholds and methodology to conduct mea-
surement with the metrics.
Finally, we conclude that our guidelines help cover and plan the quality assurance
for industrial systems. To further understand the concrete criteria in the IXI model,
please refer to our original guidelines.
4.3. Voice user interface system
A VUI system, such as a smart speaker, recognizes the user's voiced sentences,
understands the intent, and performs the actions requested by using the following
ML technologies:
.Speech recognition converts speech signals captured with a microphone
into text.
.Natural language understanding interprets the converted text to generate
the commands to act.
.Speech synthesis converts text that is the result of the commands into
speech signals.
We discussed the quality of VUI systems according to the axes of quality evalu-
ation presented in Sec. 3.1. For Data Integrity, the system requires the same action
to be performed for the same intention with di®erent voices or expressions. For
Model Robustness, the quality of model update is typically important since even new
words are created daily. For System Quality, pro¯les and daily lives of users are
important because smart speakers are usually located in homes. For Customer Ex-
pectation, it is necessary to determine target users for each function and evaluate
whether the users are satis¯ed.
The test architecture for smart speakers consists of several test viewpoints
at several test levels, as shown in Table 1. It is, however, di±cult to clearly eva-
luate conformity to the requirements due to various requirements for VUI systems.
1598 G. Fujii et al.
Int. J. Soft. Eng. Knowl. Eng. 2020.30:1589-1606. Downloaded from www.worldscientific.com
by 88.214.3.192 on 01/26/21. Re-use and distribution is strictly not permitted, except for Open Access articles.
The n-level evaluation method will solve such di±culty: Various engineers evaluate
whether output behaviors are suitable to various intentions and speci¯cations. An
example of a ¯ve-level evaluation for a smart speaker is shown as follows:
(1) Perform unintended and di®erent function.
(2) The intended function is performed, but the content is unintended.
(3) The intended function is performed, but unintended information is returned.
(4) The intended function is performed and the intended content is returned, but it
must be incorrect.
(5) The intended function is performed and the intended content is returned.
Quality-assurance levels of the whole system of a smart speaker can be de¯ned
with the following two levels:
(1) Behavior level: The results of tests that can be answered with Yes/No meet the
speci¯ed acceptance criteria.
(2) Contents level: The results of tests for evaluating the attractiveness of the
product meet de¯ned acceptance criteria.
4.4. Autonomous driving
AD uses ML-based systems as a core technology for object recognition, path plan-
ning, and manipulation decisions. We investigated ideas, approaches, technologies,
and methodologies that assure the quality of ML-based systems, focusing on object
recognition for autonomous emergency braking (AEB) as a concrete function of AD.
It supports automated steering and acceleration capabilities, which correspond to
level 2 of the Society of Automotive Engineers (SAE) standard [18].
We identi¯ed three challenges with the quality assurance of AD: AD is expected
to reduce crashes compared to human driving, scenes in real driving are now beyond
the scope when AD was ¯rst developed, and ML models are di±cult to replace on the
°y after AD is deployed to real driving. As a solution to these challenges, we de-
veloped a methodology consisting of the following phases:
(1) Analyze a use case for object recognition of AEB based on a framework to
manage the uncertainty of AD [19] and structuringvalidation [20].
Table 1. Example of test architecture for VUI systems.
Test level Test target Test viewpoint
Unit test System modules other than ML Unit test for each module
Speech recognition, natural language un-
derstanding and speech synthesis
Accuracy test for data and ML models
Integration test APIs Functional test of integrated modules
System test Features Speci¯cation-based testing
Exploratory testing
Scenario-based testing
Guidelines for Quality Assurance of Machine Learning-Based Arti¯cial Intelligence 1599
Int. J. Soft. Eng. Knowl. Eng. 2020.30:1589-1606. Downloaded from www.worldscientific.com
by 88.214.3.192 on 01/26/21. Re-use and distribution is strictly not permitted, except for Open Access articles.
(2) Develop fail-safe functions against misrecognition based on failure analysis.
(3) Store knowledge of uncertainty gaps on real roads and feed it back to the analysis
at the beginning of development.
An example of this methodology that includes an AD development process,
analysis of uncertainty, and items to be veri¯ed in the development process is shown
in Fig. 4.
This methodology helps in creating test cases for AEB. For the results of ana-
lyzing the uncertainty of AEB for pedestrians, test cases are represented as a pe-
destrian who does not look like a pedestrian (false negative) and an object that looks
like a pedestrian (false positive). False negatives include, for example, pedestrians
wearing a coat that is the same color as the wall or standing behind a pole, and false
positives include a pedestrian re°ected in a window and a painting that looks like a
pedestrian. The test cases require expected results. A false negative means AEB will
not work, so the driver needs to operate the brake. A false positive means that AEB
will work (the car will decelerate) against the driver's expectation, so the driver
cannot prevent deceleration.
4.5. AI-OCR
The AI-OCR system enables character code conversion from images with ML
techniques. To clarify the discussion scope in this guideline, we de¯ned the AI-OCR
system as being composed of the following four modules:
(1) Preprocessing module: This is a module for removing noise, skew, and so on from
input image data.
Fig. 4. Methodology for analysis of uncertainty and items to be veri¯ed in AD development process.
1600 G. Fujii et al.
Int. J. Soft. Eng. Knowl. Eng. 2020.30:1589-1606. Downloaded from www.worldscientific.com
by 88.214.3.192 on 01/26/21. Re-use and distribution is strictly not permitted, except for Open Access articles.
(2) Character-position-detection module: This is a module to identify the character's
position to de¯ne the OCR target from preprocessed data.
(3) OCR module: This is a module for converting the character code from the image
that detected position from the detection module.
(4) Information-extraction module: This is a module for recognizing the OCR results
as speci¯c information.
We discussed the quality characteristics of AI-OCR systems based on the axes of
quality evaluation presented in Sec. 3.1.
The target of AI-OCR is often images of forms. Therefore, we de¯ned four form
properties, i.e. layout, character, noise, and images for Data Integrity.
For Model Robustness, we presented an example of metamorphic testing for the
AI-OCR system. We proposed the following three steps to identify the test case:
(1) Identify recognition patterns in which OCR without ML has been weak.
(2) Analyze the properties and frequency of forms generated in customer business
operations.
(3) Identify the signi¯cant impact on customer operations if the AI-OCR output is
unexpected.
The AI-OCR system is applied in many business processes such as inputoutput
business tasks like a reconciliation task. Thus, Customer Expectation is evaluated by
the contribution of business process e±ciency and de¯ned as in Table 3.
Table 2. Form-property examples.
Property Characteristic
Layout Fixed layout or free layout
Amount of characters in one image
Table structure
Vertical line or horizontal line
Single page or several pages
Character Hand written or printed
Decoration (bold, italic, and so on)
Symbol
Font (Gothic, Mincho, and so on)
Logo
Noise Blotted characters
Stamp on characters
With shadow
Skewed forms
Dirty background
Images Dots per inch (DPI)
Image size
Monochrome or color
Image brightness
Image saturation
Guidelines for Quality Assurance of Machine Learning-Based Arti¯cial Intelligence 1601
Int. J. Soft. Eng. Knowl. Eng. 2020.30:1589-1606. Downloaded from www.worldscientific.com
by 88.214.3.192 on 01/26/21. Re-use and distribution is strictly not permitted, except for Open Access articles.
Quality-assurance levels of the AI-OCR system can be de¯ned with the following
four levels:
(1) Initial level: [The AI-OCR system?] is evaluated without a real data set.
(2) AI-development level: [The AI-OCR system?] evaluated with a real data set, but
there is no consideration regarding frequency and priority of each data set.
(3) Business-ready level: [the AI-OCR system?] is evaluated with a real data set with
consideration of frequency and priority of each data set.
(4) Business-e±ciency level: [The AI-OCR system?] is evaluated with the same real
data set as the operation data set.
5. Evaluation
We administered a questionnaire survey to evaluate the usefulness of the guidelines.
The respondents were 31 guideline readers, including 13 who had participated in
developing the guidelines, since they are also users of the guidelines. Table 4shows
the professions of the respondents.
The questionnaire was based on a 5-point Likert scale ranging from \strongly
deny" to \strongly agree." The summary of questions is shown in Fig. 5. The results
of question 1 indicate that all users could understand the characteristics of ML-based
products that completely di®er from those for software and the proposed concept of
their quality assurance.
Questions 29 addressed the usefulness of the guidelines at each phase of
AI-based system development. Over 77% of respondents agreed or strongly agreed
with the usefulness at every phase, and 94% agreed or strongly agreed at the
Table 3. Example of form properties.
Expectation type Expectation level
Task-replacement level Completely replace the user's task
Helping user's task
Using OCR results for preservation
Task-processing-speed level Faster than users
Equal to users
Task-quality level AI-OCR makes mistakes that are impossible for humans to make
Quality of task varies from human to human
Table 4. Professions of respondents.
Target Research Development Testing Quality Other Total
ML/AI 1 11 3 2 0 17
Software 1 0 4 8 0 13
Procurement 0 0 0 0 1 1
Total 2 11 7 10 1 31
1602 G. Fujii et al.
Int. J. Soft. Eng. Knowl. Eng. 2020.30:1589-1606. Downloaded from www.worldscientific.com
by 88.214.3.192 on 01/26/21. Re-use and distribution is strictly not permitted, except for Open Access articles.
quality-assurance phase. These results indicate that the QA4AI guidelines meet
the objective of clarifying the general concept of quality assurance of AI-based
systems.
6. Threats to Validity
We now discuss external validity as the core threat to validity. There were few
respondents to the questionnaire, so more readers are needed to properly evaluate the
guidelines. Moreover, because this was an open web questionnaire, it is possible that
only readers who felt positively responded. Figure 6shows the di®erences in the
responses to question 1 between the authors and others. The authors rated the
understandability of the guidelines more highly than the others The authors may
have a more positive assessment of the guidelines than the others.
7. Related Work
Reports on practices or case studies are emerging from the industry: Most are general
(e.g. [21,22], and aspects of quality assurance or testing are very limited. Simple
questions to evaluate testing activities have been provided [23]. These questions
provide signi¯cant guidance on which aspects should be considered, e.g. monitoring
input features. Our guidelines, which cover these questions, provide more detailed
guidance including the investigation of speci¯c domains in depth.
Fig. 5. Responses to questionnaire.
Fig. 6. Di®erences in responses to question 1 between authors and others.
Guidelines for Quality Assurance of Machine Learning-Based Arti¯cial Intelligence 1603
Int. J. Soft. Eng. Knowl. Eng. 2020.30:1589-1606. Downloaded from www.worldscientific.com
by 88.214.3.192 on 01/26/21. Re-use and distribution is strictly not permitted, except for Open Access articles.
8. Concluding Remarks
We reported on the active e®orts for the quality assurance of ML models and ML-
based systems in the QA4AI Consortium driven by the Japanese industry. The
second version of a set of guidelines was published, including ¯ve axes of evaluation, a
technical catalog, and speci¯c insights for ¯ve application domains. Testing is the
most signi¯cant aspect of the guidelines since it is the most signi¯cant activity in
practice. The guidelines provide insights from quality-assurance engineers and test
engineers. This direction complements speci¯c testing techniques, which has been
actively investigated, that are also introduced in the guidelines.
Given the high demands of the industry, we opted for a quick release and frequent
cycles of updates. We are aware that the current guidelines are insu±cient for certain
aspects of the industry. The current version was constructed in a bottom-up, best-
e®ort manner to identify what is missing in the guidelines or in the knowledge from
research communities. For example, we found there is very little discussion on how to
use explainability tools, such as LIME [9], in engineering activities.
We are continuously working to extend and enhance the guidelines. Current
activities include case studies to uncover more insights in each domain as well as to
clarify mapping with other standards such as the Ethics Guidelines in the European
Commission
b
and quality standards for general software systems (SQuaRE, ISO/
IEC 250XX series).
Acknowledgments
The authors are grateful to all members of the QA4AI Consortium who contributed
to the ¯rst and second versions of the guidelines. The authors are listed in alpha-
betical order, with no di®erence in their contributions to the paper, as representa-
tives of the consortium.
References
1. F. Ishikawa and N. Yoshioka, How do engineers perceive di±culties in engineering of
machine-learning systems? Questionnaire survey, in Joint Int. Workshop on Con-
ducting Empirical Studies in Industry and 6th Int. Workshop on Software Engineering
Research and Industrial Practice, 2018, pp. 29.
2. QA4AI Consortium, Guideline for quality assurance of AI-based products, 2019, http://
www.qa4ai.jp/QA4AI.Guideline.201905.pdf (in Japanese).
3. K. Hamada, F. Ishikawa, S. Masuda, T. Myojin, Y. Nishi, H. Ogawa, T. Toku, S.
Tokumoto, K. Tsuchiya, Y. Ujita and M. Matsuya, Guidelines for quality assurance of
machine learning-based arti¯cial intelligence, in The 32nd Int. Conf. Software Engi-
neering & Knowledge Engineering, 2020, pp. 335341.
4. K. Pei, Y. Cao, J. Yang and S. Jana, DeepXplore: Automated whitebox testing of deep
learning systems, in The 26th Symp. Operating Systems Principles, 2017, pp. 118.
b
https://ec.europa.eu/futurium/en/ai-alliance-consultation/guidelines.
1604 G. Fujii et al.
Int. J. Soft. Eng. Knowl. Eng. 2020.30:1589-1606. Downloaded from www.worldscientific.com
by 88.214.3.192 on 01/26/21. Re-use and distribution is strictly not permitted, except for Open Access articles.
5. Y. Tian, K. Pei, S. Jana and B. Ray, DeepTest: Automated testing of deep-neural-
network-driven autonomous cars, in The 40th Int. Conf. Software Engineering, 2018,
pp. 303314.
6. A. Dwarakanath, M. Ahuja, S. Sikand, R. M. Rao, R. P. J. C. Bose, N. Dubash and
S. Podder, Identifying implementation bugs in machine learning based image classi¯ers
using metamorphic testing, in The 27th ACM SIGSOFT Int. Symp. Software Testing and
Analysis, 2018, pp. 118120.
7. X. Huang, M. Kwiatkowska, S. Wang and M. Wu, Safety veri¯cation of deep neural
networks, in The 29th Int. Conf. Computer Aided Veri¯cation, 2017, pp. 329.
8. L. Ma, F. Juefei-Xu, F. Zhang, J. Sun, M. Xue, B. Li, C. Chen, T. Su, L. Li, Y. Liu, J.
Zhao and Y. Wang, DeepGauge: Multi-granularity testing criteria for deep learning
systems, in The 33rd ACM/IEEE Int. Conf. Automated Software Engineering, 2018,
pp. 120131.
9. M. T. Ribeiro, S. Singh and C. Guestrin, \Why should I trust you?": Explaining the
predictions of any classi¯er, in The 22nd ACM SIGKDD Int. Conf. Knowledge Discovery
and Data Mining, 2016, pp. 11351144.
10. P. W. Koh and P. Liang, Understanding black-box predictions via in°uence functions, in
The 34th Int. Conf. Machine Learning, 2018, pp. 18851894.
11. E. Angelino, N. Larus-Stone, D. Alabi, M. Seltzer and C. Rudin, Learning certi¯ably
optimal rule lists for categorical data, in The 23rd ACM SIGKDD Int. Conf. Knowledge
Discovery and Data Mining, 2017, pp. 3544.
12. I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair,
A. Courville and Y. Bengio, Generative adversarial nets, in Advances in Neural Infor-
mation Processing Systems 27, 2014, pp. 26722680.
13. K. Hamada, K. Tachibana, T. Li, H. Honda and Y. Uchida, Full-body high-resolution
anime generation with progressive structure-conditional generative adversarial networks,
in The 1st Workshop on Computer Vision for Fashion, Art and Design, 2018, pp. 6774.
14. T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford and X. Chen, Improved
techniques for training GANs, in The 29th Int. Conf. Neural Information Processing
Systems, 2016, pp. 22342242.
15. M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler and S. Hochreiter, GANs trained by a
two time-scale update rule converge to a local Nash equilibrium, in The 30th Int. Conf.
Neural Information Processing Systems, 2017, pp. 66266637.
16. K. Sun, B. Xiao, D. Liu and J. Wang, Deep high-resolution representation learning for
human pose estimation, in The IEEE Conf. Computer Vision and Pattern Recognition,
2019, pp. 26722680.
17. Y. H. Tsuruta Kosuke and M. Toshifumi, Development of AI technology for machine
automation controller (1), Technical report, OMRON TECHNICS (2018).
18. S. O.-R. A. D. Committee et al., SAE J3016. Taxonomy and de¯nitions for terms related
to driving automation systems for on-road motor vehicles, Technical report, SAE Inter-
national (2016).
19. K. Czarnecki and R. Salay, Towards a framework to manage perceptual uncertainty for
safe automated driving, in SAFECOMP 2018 Workshops, ASSURE, DECSoS, SASSUR,
STRIVE, and WAISE, Proc., 2018, pp. 439445.
20. L. Gauerhof, P. Munk and S. Burton, Structuring validation targets of a machine learning
function applied to automated driving, in Proc. 37th Int. Conf., Computer Safety, Reli-
ability, and Security, 2018, pp. 4558.
21. M. Zinkevich, Rules for reliable machine learning: Best practices for ML engineering, in
NIPS 2016 Workshop on Reliable Machine Learning in the Wild, 2017, http://martin.
zinkevich.org/rules of ml/rules of ml.pdf.
Guidelines for Quality Assurance of Machine Learning-Based Arti¯cial Intelligence 1605
Int. J. Soft. Eng. Knowl. Eng. 2020.30:1589-1606. Downloaded from www.worldscientific.com
by 88.214.3.192 on 01/26/21. Re-use and distribution is strictly not permitted, except for Open Access articles.
22. S. Amershi, A. Begel, C. Bird, R. DeLine, H. Gall, E. Kamar, N. Nagappan, B. Nushi
and T. Zimmermann, Software engineering for machine learning: A case study, in
The 41st Int. Conf. Software Engineering: Software Engineering in Practice, 2019,
pp. 291300.
23. E. Breck, S. Cai, E. Nielsen, M. Salib and D. Sculley, What's your ML test score? A rubric
for ML production systems, in NIPS 2016 Workshop on Reliable Machine Learning in the
Wild, 2017, pp. 11231132.
1606 G. Fujii et al.
Int. J. Soft. Eng. Knowl. Eng. 2020.30:1589-1606. Downloaded from www.worldscientific.com
by 88.214.3.192 on 01/26/21. Re-use and distribution is strictly not permitted, except for Open Access articles.
... Another potential challenge is convincing customers about the value to be provided by ML capabilities Nguyen-Duc et al., 2020). Nevertheless, customers may ask for ''perfect ML systems'' and have unrealistic expectations (Fujii et al., 2020;Ishikawa and Yoshioka, 2019;Kim et al., 2017). It is important to inform customers about the benefits of ML systems even without being imperfect (Ishikawa and Yoshioka, 2019) and explain the possibility of change in the level of success of ML systems on production in time (Wan et al., 2019). ...
... Dealing with data: Data preparation is a vital and inevitable group of activities for developing ML systems (Sapp, 2017). Discovering, accessing, collecting, cleaning, and transforming data is challenging and time-consuming Baier et al., 2019;Correia et al., 2020;Fredriksson et al., 2020;Fujii et al., 2020;Hill et al., 2016;Kim et al., 2017;Lwakatare et al., 2019;Sankaran et al., 2017). There may be various types of data sources, such as transactional systems, data warehouses, data lakes, data meshes, and real-time data streams (Sato et al., 2019). ...
... The discovery of adversarial examples and evaluation of model robustness are fundamental problems in engineering ML systems Fujii et al., 2020;. Adversarial examples cause ML models to produce an error output with high confidence, even though their difference from natural inputs are subtle (Szegedy et al., 2014). ...
Article
Context: Advancements in machine learning (ML) lead to a shift from the traditional view of software development, where algorithms are hard-coded by humans, to ML systems materialized through learning from data. Therefore, we need to revisit our ways of developing software systems and consider the particularities required by these new types of systems. Objective: The purpose of this study is to systematically identify, analyze, summarize, and synthesize the current state of software engineering (SE) research for engineering ML systems. Method: I performed a systematic literature review (SLR). I systematically selected a pool of 141 studies from SE venues and then conducted a quantitative and qualitative analysis using the data extracted from these studies. Results: The non-deterministic nature of ML systems complicates all SE aspects of engineering ML systems. Despite increasing interest from 2018 onwards, the results reveal that none of the SE aspects have a mature set of tools and techniques. Testing is by far the most popular area among researchers. Even for testing ML systems, engineers have only some tool prototypes and solution proposals with weak experimental proof. Many of the challenges of ML systems engineering were identified through surveys and interviews. Researchers should conduct experiments and case studies, ideally in industrial environments, to further understand these challenges and propose solutions. Conclusion: The results may benefit (1) practitioners in foreseeing the challenges of ML systems engineering; (2) researchers and academicians in identifying potential research questions; and (3) educators in designing or updating SE courses to cover ML systems engineering.
... Nevertheless, customers may ask for "perfect ML systems" and have unrealistic expectations (Fujii et al., 2020;Ishikawa and Yoshioka, 2019;Kim et al., 2017). It is important to inform customers about the benefits of ML systems even without being imperfect (Ishikawa and Yoshioka, 2019) and explain the possibility of change in the level of success of ML systems on production in time (Wan et al., 2019). ...
... Dealing with data: Data preparation is a vital and inevitable group of activities for developing ML systems (Sapp, 2017). Discovering, accessing, collecting, cleaning, and transforming data is challenging and time-consuming (Amershi et al., 2019;Baier et al., 2019;Correia et al., 2020;Fredriksson et al., 2020;Fujii et al., 2020;Hill et al., 2016;Kim et al., 2017;Lwakatare et al., 2019;Sankaran et al., 2017). There may be various types of data sources, such as transactional systems, data warehouses, data lakes, data meshes, and real-time data streams (Sato et al., 2019). ...
... Quality models for ML systems: Current systems and software quality models (such as ISO/IEC 25010) should be revisited to address the different characteristics of ML systems, such as trustworthiness and fairness. As an example of the fulfillment of this need, the QA4AI Consortium was established in 2018 in Japan to discuss the quality assurance of ML systems (Fujii et al., 2020). A set of guidelines was published to ensure ML systems' quality (QA4AI, 2020 Keeping pace with the latest versions of ML libraries and their dependencies is a challenge . ...
Preprint
Full-text available
Context: Advancements in machine learning (ML) lead to a shift from the traditional view of software development, where algorithms are hard-coded by humans, to ML systems materialized through learning from data. Therefore, we need to revisit our ways of developing software systems and consider the particularities required by these new types of systems. Objective: The purpose of this study is to systematically identify, analyze, summarize, and synthesize the current state of software engineering (SE) research for engineering ML systems. Method: I performed a systematic literature review (SLR). I systematically selected a pool of 141 studies from SE venues and then conducted a quantitative and qualitative analysis using the data extracted from these studies. Results: The non-deterministic nature of ML systems complicates all SE aspects of engineering ML systems. Despite increasing interest from 2018 onwards, the results reveal that none of the SE aspects have a mature set of tools and techniques. Testing is by far the most popular area among researchers. Even for testing ML systems, engineers have only some tool prototypes and solution proposals with weak experimental proof. Many of the challenges of ML systems engineering were identified through surveys and interviews. Researchers should conduct experiments and case studies, ideally in industrial environments, to further understand these challenges and propose solutions. Conclusion: The results may benefit (1) practitioners in foreseeing the challenges of ML systems engineering; (2) researchers and academicians in identifying potential research questions; and (3) educators in designing or updating SE courses to cover ML systems engineering.
... Most prominent examples of such domains are Autonomous Driving and medical applications. Especially in these domains calls for quality assurance measures for AI applications were recently expressed (Tian et al., 2018;Hamada et al., 2020;Challen et al., 2019). ...
... Lenarduzzi et al. focus their consideration on AI Software Development i.e. the software-driven definition, training, testing, and deployment of AI systems (Lenarduzzi et al., 2021). Hamada et al. provide general guidelines for AI QA for different application domains, one of which being Autonomous Driving (Hamada et al., 2020). Similar to the method proposed in this report, the method proposed by Hamada et al. "helps to create test cases", which, in their description, appear similar to examples shown in Section 4. ...
Article
This work introduces a method for Quality Assurance of Artificial Intelligence (AI) Systems, which identifies and characterizes "corner cases". Here, corner cases are intuitively defined as "inputs yielding an unexpectedly bad AI performance". While relying on automated methods for corner case selection, the method relies also on human work. Specifically, the method structures the work of data scientists in an iterative process which formalizes the expectations towards an AI under test. The method is applied in a use case in Autonomous Driving, and validation experiments, which point at a general effectiveness of the method, are reported on. Besides allowing insights on the AI under test, the method seems to be particularly suited to structure a constructive critique of the quality of a test dataset. As this work reports on a first application of the method, a special focus lies on limitations and possible extensions of the method.
... For each phase, the authors identify a set of functional quality properties to guarantee the quality of such systems. Similarly, the Q4AI consortium proposed a set of guidelines [25] for the quality assurance of ML systems for specific domains: generative systems, operational data in process systems, voice user interface system, autonomous driving and AI OCR. For each domain, the authors identify a set of properties and metrics to ensure quality. ...
Preprint
Full-text available
The recently increased complexity of Machine Learning (ML) methods, led to the necessity to lighten both the research and industry development processes. ML pipelines have become an essential tool for experts of many domains, data scientists and researchers, allowing them to easily put together several ML models to cover the full analytic process starting from raw datasets. Over the years, several solutions have been proposed to automate the building of ML pipelines, most of them focused on semantic aspects and characteristics of the input dataset. However, an approach taking into account the new quality concerns needed by ML systems (like fairness, interpretability, privacy, etc.) is still missing. In this paper, we first identify, from the literature, key quality attributes of ML systems. Further, we propose a new engineering approach for quality ML pipeline by properly extending the Feature Models meta-model. The presented approach allows to model ML pipelines, their quality requirements (on the whole pipeline and on single phases), and quality characteristics of algorithms used to implement each pipeline phase. Finally, we demonstrate the expressiveness of our model considering the classification problem.
... Finally, the third stream of related works focuses on quality assurance and quality models for ML-based software systems. Fujii et al. [5] conducted a survey to evaluate the usefulness of quality guidelines for ML-based software systems. These quality guidelines address the handling of quality characteristics, test architecture, and test viewpoints for different domains. ...
Preprint
Full-text available
As AI-enabled software systems become more prevalent in smart manufacturing, their role shifts from a reactive to a proactive one that provides context-specific support to machine operators. In the context of an international research project, we develop an AI-based software platform that shall facilitate the collaboration between human operators and manufacturing machines. We conducted 14 structured interviews with stakeholders of the prospective software platform in order to determine the individual relevance of selected quality characteristics for human-AI teaming in smart manufacturing. These characteristics include the ISO 25010:2011 standard for software quality and AI-specific quality characteristics such as trustworthiness, explicability, and auditability. The interviewees rated trustworthiness, functional suitability, reliability, and security as the most important quality characteristics for this context, and portability, compatibility, and maintainability as the least important. Also, we observed agreement regarding the relevance of the quality characteristics among interviewees having the same role. On the other hand, the relevance of each quality characteristics varied depending on the concrete use case of the prospective software platform. The interviewees also were asked about the key success factors related to human-AI teaming in smart manufacturing. They identified improving the production cycle, increasing operator efficiency, reducing scrap, and reducing ergonomic risks as key success criteria. In this paper, we also discuss metrics for measuring the fulfillment of these quality characteristics, which we intend to operationalize and monitor during operation of the prospective software platform.
Conference Paper
Full-text available
Deep learning (DL) defines a new data-driven programming paradigm that constructs the internal system logic of a crafted neuron network through a set of training data. We have seen wide adoption of DL in many safety-critical scenarios. However, a plethora of studies have shown that the state-of-the-art DL systems suffer from various vulnerabilities which can lead to severe consequences when applied to real-world applications. Currently, the testing adequacy of a DL system is usually measured by the accuracy of test data. Considering the limitation of accessible high quality test data, good accuracy performance on test data can hardly provide confidence to the testing adequacy and generality of DL systems. Unlike traditional software systems that have clear and controllable logic and functionality, the lack of interpretability in a DL system makes system analysis and defect detection difficult, which could potentially hinder its real-world deployment. In this paper, we propose DeepGauge, a set of multi-granularity testing criteria for DL systems, which aims at rendering a multi-faceted portrayal of the testbed. The in-depth evaluation of our proposed testing criteria is demonstrated on two well-known datasets, five DL systems, and with four state-of-the-art adversarial attack techniques against DL. The potential usefulness of DeepGauge sheds light on the construction of more generic and robust DL systems.
Conference Paper
Great efforts are currently underway to develop industrial applications for artificial intelligence (AI), especially those using machine learning (ML) techniques. Despite the intensive support for building ML applications, there are still challenges when it comes to evaluating, assuring, and improving the quality or dependability. The difficulty stems from the unique nature of ML: namely, that the system behavior is derived from training data, not from logical design by human engineers. This leads to black-box and intrinsically imperfect implementations that invalidate many of the existing principles and techniques in traditional software engineering. In light of this situation, the Japanese industry has jointly worked on a set of guidelines for the quality assurance of AI systems (in the QA4AI consortium) from the viewpoint of traditional quality-assurance engineers and test engineers. We report the initial version of these guidelines, which cover a list of the quality evaluation aspects, a catalogue of current state-of-the-art techniques, and domain-specific discussions in four representative domains. The guidelines provide significant insights for engineers in terms of methodologies and designs for tests driven by application-specific requirements.
Conference Paper
Recent advances in Deep Neural Networks (DNNs) have led to the development of DNN-driven autonomous cars that, using sensors like camera, LiDAR, etc., can drive without any human intervention. Most major manufacturers including Tesla, GM, Ford, BMW, and Waymo/Google are working on building and testing different types of autonomous vehicles. The lawmakers of several US states including California, Texas, and New York have passed new legislation to fast-track the process of testing and deployment of autonomous vehicles on their roads. However, despite their spectacular progress, DNNs, just like traditional software, often demonstrate incorrect or unexpected corner-case behaviors that can lead to potentially fatal collisions. Several such real-world accidents involving autonomous cars have already happened including one which resulted in a fatality. Most existing testing techniques for DNN-driven vehicles are heavily dependent on the manual collection of test data under different driving conditions which become prohibitively expensive as the number of test conditions increases. In this paper, we design, implement, and evaluate DeepTest, a systematic testing tool for automatically detecting erroneous behaviors of DNN-driven vehicles that can potentially lead to fatal crashes. First, our tool is designed to automatically generated test cases leveraging real-world changes in driving conditions like rain, fog, lighting conditions, etc. DeepTest systematically explore different parts of the DNN logic by generating test inputs that maximize the numbers of activated neurons. DeepTest found thousands of erroneous behaviors under different realistic driving conditions (e.g., blurring, rain, fog, etc.) many of which lead to potentially fatal crashes in three top performing DNNs in the Udacity self-driving car challenge.
Conference Paper
We have recently witnessed tremendous success of Machine Learning (ML) in practical applications. Computer vision, speech recognition and language translation have all seen a near human level performance. We expect, in the near future, most business applications will have some form of ML. However, testing such applications is extremely challenging and would be very expensive if we follow today's methodologies. In this work, we present an articulation of the challenges in testing ML based applications. We then present our solution approach, based on the concept of Metamorphic Testing, which aims to identify implementation bugs in ML based image classifiers. We have developed metamorphic relations for an application based on Support Vector Machine and a Deep Learning based application. Empirical validation showed that our approach was able to catch 71% of the implementation bugs in the ML applications.
Conference Paper
Deep learning (DL) systems are increasingly deployed in safety- and security-critical domains including self-driving cars and malware detection, where the correctness and predictability of a system's behavior for corner case inputs are of great importance. Existing DL testing depends heavily on manually labeled data and therefore often fails to expose erroneous behaviors for rare inputs. We design, implement, and evaluate DeepXplore, the first whitebox framework for systematically testing real-world DL systems. First, we introduce neuron coverage for systematically measuring the parts of a DL system exercised by test inputs. Next, we leverage multiple DL systems with similar functionality as cross-referencing oracles to avoid manual checking. Finally, we demonstrate how finding inputs for DL systems that both trigger many differential behaviors and achieve high neuron coverage can be represented as a joint optimization problem and solved efficiently using gradient-based search techniques. DeepXplore efficiently finds thousands of incorrect corner case behaviors (e.g., self-driving cars crashing into guard rails and malware masquerading as benign software) in state-of-the-art DL models with thousands of neurons trained on five popular datasets including ImageNet and Udacity self-driving challenge data. For all tested DL models, on average, DeepXplore generated one test input demonstrating incorrect behavior within one second while running only on a commodity laptop. We further show that the test inputs generated by DeepXplore can also be used to retrain the corresponding DL model to improve the model's accuracy by up to 3%.
Article
We propose a new framework for estimating generative models via an adversarial process, in which we simultaneously train two models: a generative model G that captures the data distribution, and a discriminative model D that estimates the probability that a sample came from the training data rather than G. The training procedure for G is to maximize the probability of D making a mistake. This framework corresponds to a minimax two-player game. In the space of arbitrary functions G and D, a unique solution exists, with G recovering the training data distribution and D equal to 1/2 everywhere. In the case where G and D are defined by multilayer perceptrons, the entire system can be trained with backpropagation. There is no need for any Markov chains or unrolled approximate inference networks during either training or generation of samples. Experiments demonstrate the potential of the framework through qualitative and quantitative evaluation of the generated samples.