Conference PaperPDF Available

AI safety: state of the field through quantitative lens


Abstract and Figures

Last decade has seen major improvements in the performance of artificial intelligence which has driven widespread applications. Unforeseen effects of such mass-adoption has put the notion of AI safety into the public eye. AI safety is a relatively new field of research focused on techniques for building AI beneficial for humans. While there exist survey papers for the field of AI safety, there is a lack of a quantitative look at the research being conducted. The quantitative aspect gives a data-driven insight about the emerging trends, knowledge gaps and potential areas for future research. In this paper, bibliometric analysis of the literature finds significant increase in research activity since 2015. Also, the field is so new that most of the technical issues are open, including: explainability and its long-term utility, and value alignment which we have identified as the most important long-term research topic. Equally, there is a severe lack of research into concrete policies regarding AI. As we expect AI to be the one of the main driving forces of changes, AI safety is the field under which we need to decide the direction of humanity's future.
Content may be subject to copyright.
AI safety: state of the field through quantitative
Mislav Juric*, Agneza Sandic ** and Mario Brcic**
* Student at University of Zagreb Faculty of Electrical Engineering and Computing, Zagreb, Croatia
** University of Zagreb Faculty of Electrical Engineering and Computing, Zagreb, Croatia,,
Abstract - Last decade has seen major improvements in
the performance of artificial intelligence which has driven
wide-spread applications. Unforeseen effects of such mass-
adoption has put the notion of AI safety into the public eye.
AI safety is a relatively new field of research focused on
techniques for building AI beneficial for humans. While
there exist survey papers for the field of AI safety, there is a
lack of a quantitative look at the research being conducted.
The quantitative aspect gives a data-driven insight about the
emerging trends, knowledge gaps and potential areas for
future research. In this paper, bibliometric analysis of the
literature finds significant increase in research activity since
2015. Also, the field is so new that most of the technical
issues are open, including: explainability and its long-term
utility, and value alignment which we have identified as the
most important long-term research topic. Equally, there is a
severe lack of research into concrete policies regarding AI.
As we expect AI to be the one of the main driving forces of
changes, AI safety is the field under which we need to decide
the direction of humanity’s future.
Keywords - AI safety; technical AI safety; research;
surveys; bibliometrics
The field of Artificial Intelligence (AI) safety is
concerned with answering a very important question
(along with its variations) - „How can we make AI safe
for humans?”. As the field of AI safety evolved over the
years, many developments have been published on the
topics related to AI safety. Although there are a lot of
survey papers giving an overview of different aspects of
the field of AI safety, there is a lack of quantitative insight
into the state of the field of AI safety. Namely, there is a
lack of data which tells a story about current and past
trends and gives us empirical evidence to where research
contributions would be most valuable.
In this paper we shall divide AI safety in the following
hierarchy of sub-fields:
1. technical AI safety - deals with the
technical issues of achieving safety and
utility. It is subdivided by SRA
classification [1]
a) specification (S) – defines the purpose
of the system
b) robustness (R) – designing systems
withstanding perturbations
c) assurance (A) monitoring,
understanding, and controlling system
during operation
2. AI ethics – mainly deals with the questions
of moral responsibility and utility for the
3. AI policy deals with the questions how
legal and governance systems need to be
setup with respect to the new AI
technologies (not only on national, but also
on supranational level)
This paper is structured as follows. We shall explain
our research methodology in section II, present our results
and elaborate upon them in section III. In section IV we
give our view on the current and possible future
developments in the field of AI safety. Finally, in section
V we shall give our conclusion and state the limitations to
our research.
In this paper, we use bibliometrics as the basis of our
research. In AI safety there are plenty of surveys (see
Table 1), but most of them are specific to their certain
sub-fields. There is a general survey [2], but it is of
enumerative and descriptive nature, while we aim to
supplement it with quantitative overview.
We have used online databases to identify indexed work
sampled across different subfields of AI safety. We used
the following databases: SCOPUS, Web of Science
(WoS), and Google Scholar (GS). Since the area of AI
safety is rather new, number of published work is smaller
than for the well-established areas. This is especially the
case for WoS. GS offers high-volume of work, including
important pre-print sources, but the quality varies
substantially and the result exploration seems to be
limited. SCOPUS offers good trade-off between the
volume and quality, paired with flexible searching and
result exploration. So, we have manually fine-tuned our
search-queries to SCOPUS and then we have used them
across all three databases. We have fine-tuned queries
until we got purity of 90% over returned results – purity
established empirically by sampling across the results.
We did our best to look into as many as subfields as
possible, but we leave improvements, with respect to
covered sub-fields, to the future work.
We have searched the following areas: AI ethics, AI
policy, robustness (R), explainability and intepretability
(A), fairness (S), value alignment (S), reward hacking (S),
interruptibility (A), safe exploration (R), distributional
shift (R), and AI privacy (A). Other areas such as
verifiability have proven to be elusive for search queries
in the terms of results' purity and relevance. The
respective search queries for each selected area are given
in the appendix VI.
Table 1 AI safety surveys
Subject References
General AI safety [2]–[9]
AI ethics [10]–[18]
AI policy [19]–[24]
Interpretability/XAI [25]–[32]
Adversarial robustness [33]–[37]
Fairness/bias [38]–[40]
Value alignment [41]–[44]
Safe exploration [45]
interruptibility [46]
Identified documents have the following distribution:
conference papers (47.71%), journal articles (39.89%),
reviews (3.89%), books (3.57%), book chapters (3.11%),
and other (2.83%). We chose to focus on papers from
1985 till 2019 with the relevant data retrieved on January
26, 2020.
In Figure 1 we can see the trends of growth in AI ethics
and AI policy. AI ethics has seen steady growth since
2003, with visible explosion in interest since 2010. On
the other hand, AI policy has had no significant amount
of work until 2018 since when it is experiencing strong
growth last two years.
Figure 1 Number of papers in AI ethics and policy papers published
each year
Figure 2 Number of papers in technical AI safety, high volume topics
Figure 3 Number of papers in technical AI safety, low volume topics
Figure 2 and Figure 3 show the change of popularity for
different subfields of technical AI safety over the last 20
years, grouped by their popularity level. The whole field
of AI safety is seeing strong growth which is mainly
driven by strong growing sub-fields. Interpretability
(with explainable AI (XAI) ) is the single most
important growth generator for the field. Strong growth is
shown by the fields of AI ethics, adversarial robustness
and by the smaller volume topics of value alignment and
safe exploration. These fields are driven mainly by the
near-term applications in diversity of areas, such as
transportation, medicine, biology, robotics, etc. Slight
growth includes topics of fairness and privacy. Finally,
the topics of safe exploration, distribution shift,
interruptibility, and reward hacking seem to be emerging
and are likely to become more intense venues in the
future. Interruptibility and reward hacking are more
focused on long-term research goals and do not get as
much attention as other topics. These are areas that could
be covered by non-publication decisions by some of the
research organizations.
rank Journal # Journal #
1 Expert systems with
249 AI and Society 32
2 IEEE Access 237 Ethics and Information
3 Neurocomputing 200 Futures 17
4 Information sciences 199 Minds and machines 16
5 Applied Soft Computing
154 Phil.and Tech./ Science
and Engineering Ethics
Table 2 Published articles per journal
In Table 2 we have identified top 5 journals for each
technical AI safety and AI ethics respectively. We must
emphasize that a considerable amount of interesting
research and ideas in the area is not published in peer
reviewed outlets. At best, they are published at pre-print
services and otherwise they are published in blogs and
their respective comments. This makes idea sharing
harder and reinventing more likely. Some organizations,
such as MIRI, have decided not to publish most of their
work due to security reasons.
AI governance had such a small volume of work that
journal ranking makes no sense. Some of the journals in
AI policy are: „Computer Law And Security Review”,
„Contemporary Security Policy”, and „Ethics And
International Affairs”.
Tool/Technique #
Machine Learning 3052
Classification 2089
Deep Learning 1910
Forecasting 1289
Feature Extraction 1112
Decision Trees 1017
Fuzzy Systems 929
Optimization 899
Clustering Algorithms 763
Genetic Algorithms 738
Knowledge Based Systems 636
Table 3 Popular AI safety research tools and techniques
Table 3 summarizes most used tools and techniques in
research based on their frequency as an article keyword.
Machine learning tops the list, due to the latest surge of
applications and proliferation. Classification seems to be
the most studied problem setting, followed by
forecasting. Deep learning is at the top of algorithmic
approaches, followed by the decision trees which are
more transparent model. Optimization is a tool used in
learning systems and it can aim at different criteria.
Genetic algorithms are an approach to optimization of
hard problems. Fuzzy systems and knowledge based
systems seem to be emerging in popularity, starting to be
proposed as supplementary techniques to deep learning
which bring complementary benefits where deep learning
seems lacking.
Generally, we welcome more concrete empirical work
which would bring much needed information to fuel the
future development. This calls for necessarily
multidisciplinary work with both computational
experimentation (such as [47]), and real-world testing
(such as [48], [49]). Creation of many rich and diverse
simulation environments and benchmarks would improve
both system building and safety testing, since learning
algorithms are data-hungry and real-world has too small
experience bandwidth.
A. Explainability
One of the most important open problems in explainability
is that there is no agreement on what an explanation is.
Some works define decision tree, set of rules or an image
as good explanation [31], with most of the work appealing
to the mere intuition. Evaluation of comprehensibility of
explanations to humans is underexplored [26]. There is
still no algorithm that provides both high accuracy and
explainability. In that regard, new hybrid techniques hold
potential to achieve more effective explanations [32].
Looking further, the utility of explainability to the safety
in the long-term is unclear since such approaches lack
guarantees for adversarial schemes. Namely, possibly
harmfully incorrect, but plausible explanations can be
generated. Moreover, the size and dynamic of gap
between the true and the most incorrect plausible
explanation are interesting questions. Also, the nature of
limits to the explainability of advanced concepts (for
example in future science) to humans is not yet
understood. Overstepping such limits makes explanations
too far removed from the reality and their utility fades out.
B. Value alignment
Work on reinforcement learning (RL) and learning a
reward function from actions and preferences will be very
important for advancing the field of value alignment [2].
More advanced than value learning, value discovery
through aligned algorithms could enable finding better
reward functions that unlock new opportunities. It is more
advanced prospect than value learning. Researching into
reward corruption and side effects of optimizing for a goal
which doesn't capture human values fully is just at the
beginning. Currently, the most interesting work are
approaches combining recursive factorization and
bootstrapping for scalable reward learning and value
alignment both in cooperative [50], [51] and adversarial
[52] settings. There are many open questions about
feasibility of such ideas with respect to error bounding,
automated factorization process, validity of assumptions,
etc. Mesa-optimizer [53] is an important concept,
especially for the long-term research. It introduces
multiple levels of value alignment within learned
optimization. However, their relevance to the current
practice is yet to be demonstrated.
C. AI policy
Transparency, accountability, reliability, security,
corrigibility, interpretability, value specification, ability to
limit capability, and performance and safety guarantees
for particular AI systems are all important for future wide-
spread applications and should be important parts and
aims of future policies [21], [24]. Regulating research
related to AI seems to be a possible long-term approach to
regulating AI developments, but not at the cost of
restricting scientific progress [5]. AI regulation could
work globally only if there is concensus between the
major AI research organizations. Although there are many
papers related to AI governance, there is a severe lack of
concrete AI policy suggestions.
D. Corrigibility
Corrigibility, reasoning that reflects that an agent is
incomplete and potentially flawed in a dangerous ways, is
something that needs to be worked on [54]. Safety
measures which are easily integrated within the AI
development environment seems like something which
would be tremendously valuable for improving AI safety.
One promising method of improving AI safety in the
short-term, while developing, is containment through
virtualization [55].
E. Safe exploration and distr. shift
Preventing catastrophic mistakes from occurring while
training a reinforcement learning model is a non-
negotiable need when the system interacts with the real
world and not within a simulated environment [8]. Work
in detecting and overriding an agent's action when it
seems too dangerous seems like a good approach to
reduce the number of catastrophic mistakes [2]. Proposing
new conceptual solutions to the problem of safe
exploration and distributional shift seems to be something
that could bring a lot of value, but testing and/or extending
existing solutions may be valuable as well. Hybridization
of symbolic and sub-symbolic approaches might be a
valid approach here.
F. Adversarial robustness
Deep neural network models are highly vulnerable to
adversarial attacks, which curtails their applications.
There are special methods which reduce success rates of
different types of adversarial attacks, but there are no
general defensive methods successful against all attacks
[34]. Open questions include why do the adversarial
examples exist in the first place and why are they
transferable [36]. Security verification of models to
adversarial attacks is an open research challenge [35].
G. AI ethics
Further work into human trust is necessary, especially
from the aspect of human-AI interaction [13]. In
applications, privacy concerns need to be taken seriously
if we are to construct AI which is ethical [18], [56]. Ethics
vary across the globe and evolve over time [12], [48],
[49]. More research into human moral preferences is
important to guide our development and policies.
In this paper, we have surveyed the field of AI safety
through quantiative lens. We have found various trends in
the field of AI safety. We identified interpretability and
explainability as strongest research topic in the near-term,
but we have raised the question of utility in the long-term
since such approaches lack guarantees for adversarial
schemes [25]. Also, there are limits to explaining
increasingly advanced concepts (for example, in future
science) in a comprehensible way to humans. After
overstepping such limits, the utility of explanations would
fade out. Robustness-based topics are growing in
importance with the incoming technical applications. AI
policy is seeing its first greater contributions, but there is a
severe lack of concrete policies. However, we would like
to point out value alignment as the most important
subfield in the long-term as it continues its sudden growth
from relative obscurity. It is the most promising research
direction for achieving the coexistence and cooperation
with the computationally more capable agents than
ourselves. If their goals are aligned with ours, they have
no incentive to harm. In the case of misalignment, it is
hard to have general guarantees through other approaches
against incentivized, computationally superior agents.
This research has limitations, some of them are:
underestimation/overestimation due to the search queries,
non-publication bias of some research fields in peer-
reviewed outlets, bias due to the increase of number of
publishing outlets, reliance on the descriptive statistics to
derive facts and insights about AI safety as we did not
read all of the covered papers.
[1] “Building safe artificial intelligence: specification, robustness,
and assurance.” [Online]. Available:
artificial-intelligence-52f5f75058f1. [Accessed: 27-Jan-2020].
[2] T. Everitt, G. Lea, and M. Hutter, “AGI Safety Literature
Review,” ArXiv180501109 Cs, May 2018.
[3] R. Yampolskiy and J. Fox, “Safety Engineering for Artificial
General Intelligence,” Topoi, vol. 32, no. 2, pp. 217–226, Oct.
2013, doi: 10.1007/s11245-012-9128-9.
[4] P. J. Scott and R. V. Yampolskiy, “Classification Schemas for
Artificial Intelligence Failures,” ArXiv190707771 Cs, Jul. 2019.
[5] K. Sotala and R. V. Yampolskiy,Responses to catastrophic
AGI risk: a survey,” Phys. Scr., vol. 90, no. 1, p. 018001, Dec.
2014, doi: 10.1088/0031-8949/90/1/018001.
[6] J. M. Zhang, M. Harman, L. Ma, and Y. Liu, “Machine
Learning Testing: Survey, Landscapes and Horizons,”
ArXiv190610742 Cs Stat, Dec. 2019.
[7] Y. K. Dwivedi et al., “Artificial Intelligence (AI):
Multidisciplinary perspectives on emerging challenges,
opportunities, and agenda for research, practice and policy,” Int.
J. Inf. Manag., p. 101994, Aug. 2019, doi:
[8] D. Amodei, C. Olah, J. Steinhardt, P. Christiano, J. Schulman,
and D. Mané, “Concrete Problems in AI Safety,”
ArXiv160606565 Cs, Jul. 2016.
[9] S. Russell, D. Dewey, and M. Tegmark, “Research Priorities for
Robust and Beneficial Artificial Intelligence,” AI Mag., vol. 36,
no. 4, pp. 105–114, Dec. 2015, doi: 10.1609/aimag.v36i4.2577.
[10] B. D. Mittelstadt, P. Allo, M. Taddeo, S. Wachter, and L.
Floridi, “The ethics of algorithms: Mapping the debate,” Big
Data Soc., vol. 3, no. 2, p. 2053951716679679, Dec. 2016, doi:
[11] J.-A. Cervantes, S. López, L.-F. Rodríguez, S. Cervantes, F.
Cervantes, and F. Ramos, “Artificial Moral Agents: A Survey of
the Current Status,” Sci. Eng. Ethics, Nov. 2019, doi:
[12] A. Hagerty and I. Rubinov, “Global AI Ethics: A Review of the
Social Impacts and Ethical Implications of Artificial
Intelligence,” ArXiv190707892 Cs, Jul. 2019.
[13] B. W. Israelsen and N. R. Ahmed, “‘Dave...I can assure
you ...that it’s going to be all right ...’ A Definition, Case for,
and Survey of Algorithmic Assurances in Human-Autonomy
Trust Relationships,” ACM Comput. Surv. CSUR, vol. 51, no. 6,
pp. 113:1–113:37, Jan. 2019, doi: 10.1145/3267338.
[14] A. Jobin, M. Ienca, and E. Vayena, “The global landscape of AI
ethics guidelines,” Nat. Mach. Intell., vol. 1, no. 9, pp. 389–399,
Sep. 2019, doi: 10.1038/s42256-019-0088-2.
[15] J. Morley, C. Machado, C. Burr, J. Cowls, M. Taddeo, and L.
Floridi, “The Debate on the Ethics of AI in Health Care: A
Reconstruction and Critical Review,” Social Science Research
Network, Rochester, NY, SSRN Scholarly Paper ID 3486518,
Nov. 2019.
[16] B. S. Barn, “Mapping the public debate on ethical concerns:
algorithms in mainstream media,” J. Inf. Commun. Ethics Soc.,
vol. ahead-of-print, no. ahead-of-print, Jan. 2019, doi:
[17] J. Morley, L. Floridi, L. Kinsey, and A. Elhalal, “From What to
How: An Initial Review of Publicly Available AI Ethics Tools,
Methods and Research to Translate Principles into Practices,”
Sci. Eng. Ethics, pp. 1–28, Dec. 2019, doi: 10.1007/s11948-019-
[18] S. Spiekermann, J. Korunovska, and M. Langheinrich, “Inside
the Organization: Why Privacy and Security Engineering Is a
Challenge for Engineers,” Proc. IEEE, vol. 107, no. 3, pp. 600–
615, Mar. 2019, doi: 10.1109/JPROC.2018.2866769.
[19] A. Ramamoorthy and R. Yampolskiy, “Beyond MAD?: The
race for artificial general intelligence,” ITU J. ICT Discov., vol.
2018, no. 1, pp. 77–84, Mar. 2018, doi:
[20] C. Coglianese and D. Lehr, “Transparency and Algorithmic
Governance,” Adm. Law Rev., vol. 71, p. 1, 2019.
[21] C. Cath, S. Wachter, B. Mittelstadt, M. Taddeo, and L. Floridi,
“Artificial Intelligence and the ‘Good Society’: the US, EU, and
UK approach,” Sci. Eng. Ethics, vol. 24, no. 2, pp. 505–528,
Apr. 2018, doi: 10.1007/s11948-017-9901-7.
[22] M. Brundage and J. Bryson, “Smart Policies for Artificial
Intelligence,” ArXiv160808196 Cs, Aug. 2016.
[23] C. Cath, “Governing artificial intelligence: ethical, legal and
technical opportunities and challenges,” Philos. Trans. R. Soc.
Math. Phys. Eng. Sci., vol. 376, no. 2133, p. 20180080, Nov.
2018, doi: 10.1098/rsta.2018.0080.
[24] A. Dafoe, “AI Governance: A Research Agenda.” Governance
of AI Program, Future of HumanityInstitute, University of
Oxford, 2018.
[25] R. V. Yampolskiy, “Unexplainability and Incomprehensibility
of Artificial Intelligence,” ArXiv190703869 Cs, Jun. 2019.
[26] F. K. Došilović, M. Brčić, and N. Hlupić, “Explainable artificial
intelligence: A survey,” in 2018 41st International Convention
on Information and Communication Technology, Electronics
and Microelectronics (MIPRO), 2018, pp. 0210–0215, doi:
[27] A. Adadi and M. Berrada, “Peeking Inside the Black-Box: A
Survey on Explainable Artificial Intelligence (XAI),” IEEE
Access, vol. 6, pp. 52138–52160, 2018, doi:
[28] J. Townsend, T. Chaton, and J. M. Monteiro, “Extracting
Relational Explanations From Deep Neural Networks: A Survey
From a Neural-Symbolic Perspective,” IEEE Trans. Neural
Netw. Learn. Syst., pp. 1–15, 2019, doi:
[29] A. Barredo Arrieta et al., “Explainable Artificial Intelligence
(XAI): Concepts, taxonomies, opportunities and challenges
toward responsible AI,” Inf. Fusion, vol. 58, pp. 82–115, Jun.
2020, doi: 10.1016/j.inffus.2019.12.012.
[30] A. Preece, “Asking ‘Why’ in AI: Explainability of intelligent
systems perspectives and challenges,” Intell. Syst. Account.
Finance Manag., vol. 25, no. 2, pp. 63–72, 2018, doi:
[31] R. Guidotti, A. Monreale, S. Ruggieri, F. Turini, F. Giannotti,
and D. Pedreschi, “A Survey of Methods for Explaining Black
Box Models,” ACM Comput. Surv. CSUR, vol. 51, no. 5, pp.
93:1–93:42, Aug. 2018, doi: 10.1145/3236009.
[32] A. Rosenfeld and A. Richardson, “Explainability in human–
agent systems,” Auton. Agents Multi-Agent Syst., vol. 33, no. 6,
pp. 673–705, Nov. 2019, doi: 10.1007/s10458-019-09408-y.
[33] N. Akhtar and A. Mian, “Threat of Adversarial Attacks on Deep
Learning in Computer Vision: A Survey,” IEEE Access, vol. 6,
pp. 14410–14430, 2018, doi: 10.1109/ACCESS.2018.2807385.
[34] A. Chakraborty, M. Alam, V. Dey, A. Chattopadhyay, and D.
Mukhopadhyay, “Adversarial Attacks and Defences: A Survey,”
ArXiv181000069 Cs Stat, Sep. 2018.
[35] X. Wang, J. Li, X. Kuang, Y. Tan, and J. Li, “The security of
machine learning in an adversarial setting: A survey,” J.
Parallel Distrib. Comput., vol. 130, pp. 12–23, Aug. 2019, doi:
[36] X. Yuan, P. He, Q. Zhu, and X. Li, “Adversarial Examples:
Attacks and Defenses for Deep Learning,” IEEE Trans. Neural
Netw. Learn. Syst., vol. 30, no. 9, pp. 2805–2824, Sep. 2019,
doi: 10.1109/TNNLS.2018.2886017.
[37] J. Zhang and C. Li, “Adversarial Examples: Opportunities and
Challenges,” IEEE Trans. Neural Netw. Learn. Syst., pp. 1–16,
2019, doi: 10.1109/TNNLS.2019.2933524.
[38] N. Mehrabi, F. Morstatter, N. Saxena, K. Lerman, and A.
Galstyan, “A Survey on Bias and Fairness in Machine
Learning,” ArXiv190809635 Cs, Sep. 2019.
[39] X. Zhang and M. Liu, “Fairness in Learning-Based Sequential
Decision Algorithms: A Survey,” ArXiv200104861 Cs, Jan.
[40] D. L. Coates and A. Martin, “An instrument to evaluate the
maturity of bias governance capability in artificial intelligence
projects,” IBM J. Res. Dev., vol. 63, no. 4/5, pp. 7:1-7:15, Jul.
2019, doi: 10.1147/JRD.2019.2915062.
[41] D. Abel, J. MacGlashan, and M. L. Littman, “Reinforcement
Learning as a Framework for Ethical Decision Making, in
AAAI Workshop: AI, Ethics, and Society, 2016.
[42] J. Taylor, E. Yudkowsky, P. LaVictoire, and A. Critch,
“Alignment for Advanced Machine Learning Systems,” 2016.
[43] T. Arnold, D. Kasenberg, and M. Scheutz, “Value Alignment or
Misalignment - What Will Keep Systems Accountable?,” in
AAAI Workshops, 2017.
[44] T. LaCroix and Y. Bengio, “Learning from Learning Machines:
Optimisation, Rules, and Social Norms,” ArXiv200100006 Cs
Stat, Dec. 2019.
[45] J. García, Fern, and O. Fernández, “A Comprehensive Survey
on Safe Reinforcement Learning,” J. Mach. Learn. Res., vol. 16,
no. 42, pp. 1437–1480, 2015.
[46] N. Soares, B. Fallenstein, S. Armstrong, and E. Yudkowsky,
“Corrigibility,” in Workshops at the Twenty-Ninth AAAI
Conference on Artificial Intelligence, 2015.
[47] J. Leike et al., “AI Safety Gridworlds,” ArXiv171109883 Cs,
Nov. 2017.
[48] E. Awad et al., “The Moral Machine experiment,” Nature, vol.
563, no. 7729, pp. 59–64, Nov. 2018, doi: 10.1038/s41586-018-
[49] E. Awad, S. Dsouza, A. Shariff, I. Rahwan, and J.-F. Bonnefon,
“Universals and variations in moral decisions made in 42
countries by 70,000 participants,” Proc. Natl. Acad. Sci., Jan.
2020, doi: 10.1073/pnas.1911517117.
[50] P. Christiano, B. Shlegeris, and D. Amodei, “Supervising strong
learners by amplifying weak experts,” ArXiv181008575 Cs Stat,
Oct. 2018.
[51] J. Leike, D. Krueger, T. Everitt, M. Martic, V. Maini, and S.
Legg, “Scalable agent alignment via reward modeling: a
research direction,” ArXiv181107871 Cs Stat, Nov. 2018.
[52] G. Irving, P. Christiano, and D. Amodei, “AI safety via debate,”
ArXiv180500899 Cs Stat, Oct. 2018.
[53] E. Hubinger, C. van Merwijk, V. Mikulik, J. Skalse, and S.
Garrabrant, “Risks from Learned Optimization in Advanced
Machine Learning Systems,” ArXiv190601820 Cs, Jun. 2019.
[54] N. Soares and B. Fallenstein, “Agent foundations for aligning
machine intelligence with human interests: a technical research
agenda,” in The Technological Singularity, Springer, 2017, pp.
[55] J. Babcock, J. Kramar, and R. V. Yampolskiy, “Guidelines for
Artificial Intelligence Containment,” ArXiv170708476 Cs, Jul.
[56] L. Rothenberger, B. Fabian, and E. Arunov, “RELEVANCE OF
The queries from the Table 4 were used for searching through all the fields. The only exception is the query for AI
privacy which was used only on the fields: title, abstract, and keywords.
Table 4 Search queries used for bibliometric research
Topic Search query
explainability ("artificial intelligence" OR "machine learning" OR "AI") AND "explainable"
interpretability ("interpretable" OR "interpretability") AND "artificial intelligence"
<general> "AI safety" OR "safe AI"
Adversarial robustness "adversarial examples"
AI ethics "AI ethics" OR "machine ethics" OR "friendly AI" OR "good AI" OR
("superintelligence" AND "risk") OR ("existential risk" AND "AI")
AI policy "AI policy" OR "AI governance"
fairness ( "AI" OR "algorithmic" ) AND ( "bias" OR "fairness" OR "discrimination")
AND "ethics"
"AI" AND ( "interruptibility" OR "corrigibility" ) AND "risk"
Safe exploration "AI" AND "safe exploration"
Distributional shift "Artificial intelligence" AND "distributional shift" AND ( "generalization" OR
"safety" )
Reward hacking ( "reward hacking" OR "wireheading" OR "reward tampering" OR "reward
gaming" ) AND "AI"
AI privacy* "privacy" AND ("AI" OR "artificial intelligence")
Value alignment ("value alignment" OR "alignment problem" ) AND "artificial" AND "ethics"
... Typical NFRs include performance (whether software performs the desired behavior, ex. accuracy), fairness [1][2][3][4][5][6], security [7][8][9], safety (whether the execution result of the software is safe) [10][11][12], and transparency (whether software can be trusted) [13][14][15]. It is essential to verify not only functional requirements but also NFRs for high-quality software. ...
... In [13], the authors proposed a method for understanding the behavior of AI by visualizing the activation of each layer of neural network models. • Some researchers [10][11][12] defined the safety of AI, proposed safety methods, and studied safety-violation cases. ...
... There are other studies [11,12] described the concept and definition of safety in AI and proposed general methods for achieving safety. ...
Full-text available
Artificial intelligence (AI) is one of the most important topics that implements symmetry in computer science. As like humans, most AI also learns by trial-and-error approach which requires appropriate adversarial examples. In this study, we prove that adversarial training can be useful to verify the safety of classification model in early stage of development. We experimented with various amount of adversarial data and found that the safety can be significantly improved by appropriate ratio of adversarial training.
... It deals with issues of security and the usefulness of AI for people and civilization in general. According to [7] AI safety is divided in the following sub-fields: ...
... In the absence of a model that is both transparent and accurate, it should be possible to formally verify the correctness of the algorithm and perhaps have a certification body that guarantees the accuracy of predictive models. Also, AI methods are subject to limitations that should definitely be considered to ensure AI safety and security [7] with especially concerning the recent impossibility result that shows inherent unfairness in explainability [72]. Currently, one of the biggest problems is the unregulated use of job selection programs that show bias in a variety of ways. ...
Full-text available
Artificial intelligence has become mainstream and its applications will only proliferate. Specific measures must be done to integrate such systems into society for the general benefit. One of the tools for improving that is explainability which boosts trust and understanding of decisions between humans and machines. This research offers an update on the current state of explainable AI (XAI). Recent XAI surveys in supervised learning show convergence of main conceptual ideas. We list the applications of XAI in the real world with concrete impact. The list is short and we call to action - to validate all the hard work done in the field with applications that go beyond experiments on datasets, but drive decisions and changes. We identify new frontiers of research, explainability of reinforcement learning and graph neural networks. For the latter, we give a detailed overview of the field.
... Possible explanations for this fact could be: (a) the AI community does not find these problems real; (b) the AI community does not find these problems urgent; (c) the AI community thinks we have more urgent problems at hand; or even (d) that the AI community does not know about issues like "alignment" or "corrigibility." In fact, if we look at the distribution of papers submitted in the NeurIPS 2021, 31 approximately 2% were safety-related (e.g., AI safety, ML-Fairness, Privacy, Interpretability). ...
Full-text available
In the last decade, a great number of organizations have produced documents intended to standardize, in the normative sense, and promote guidance to our recent and rapid AI development. However, the full content and divergence of ideas presented in these documents have not yet been analyzed, except for a few meta-analyses and critical reviews of the field. In this work, we seek to expand on the work done by past researchers and create a tool for better data visualization of the contents and nature of these documents. We also provide our critical analysis of the results acquired by the application of our tool into a sample size of 200 documents.
... Norms as sets of rules and conventions could also be a correlating device if they are simple enough to follow (Axelrod, 1984). They should at least be explainable and comprehensible (Dosilovic et al., 2018;Juric et al., 2020;Krajna et al., 2022Krajna et al., , 2022. However, trying to follow many rules is certainly computationally hard, as constraint satisfaction problems from computer science can attest (Gallardo et al., 2009). ...
We shall have a hard look at ethics and try to extract insights in the form of abstract properties that might become tools. We want to connect ethics to games, talk about the performance of ethics, introduce curiosity into the interplay between competing and coordinating in well-performing ethics, and offer a view of possible developments that could unify increasing aggregates of entities. All this is under a long shadow cast by computational complexity that is quite negative about games. This analysis is the first step toward finding modeling aspects that might be used in AI ethics for integrating modern AI systems into human society.
... There is a special branch of AI called AI safety that deals with the question of how to make AI safe for humans. This field [6] is divided into three subfields: technical AI safety, AI ethic and AI policy. The field of explainability itself is located in the subcategory of technical security called assurance. ...
Artificial intelligence (AI) has been embedded into many aspects of people's daily lives and it has become normal for people to have AI make decisions for them. Reinforcement learning (RL) models increase the space of solvable problems with respect to other machine learning paradigms. Some of the most interesting applications are in situations with non-differentiable expected reward function, operating in unknown or underdefined environment, as well as for algorithmic discovery that surpasses performance of any teacher, whereby agent learns from experimental experience through simple feedback. The range of applications and their social impact is vast, just to name a few: genomics, game-playing (chess, Go, etc.), general optimization, financial investment, governmental policies, self-driving cars, recommendation systems, etc. It is therefore essential to improve the trust and transparency of RL-based systems through explanations. Most articles dealing with explainability in artificial intelligence provide methods that concern supervised learning and there are very few articles dealing with this in the area of RL. The reasons for this are the credit assignment problem, delayed rewards, and the inability to assume that data is independently and identically distributed (i.i.d.). This position paper attempts to give a systematic overview of existing methods in the explainable RL area and propose a novel unified taxonomy, building and expanding on the existing ones. The position section describes pragmatic aspects of how explainability can be observed. The gap between the parties receiving and generating the explanation is especially emphasized. To reduce the gap and achieve honesty and truthfulness of explanations, we set up three pillars: proactivity, risk attitudes, and epistemological constraints. To this end, we illustrate our proposal on simple variants of the shortest path problem.
Autonomous systems gain more and more interest in research and society. However, they bring new challenges in safeguarding these systems. This contribution orders those new challenges and provides an overview of already existing concepts and approaches to solve those challenges of safeguarding autonomous systems. Moreover, existing metrics for safeguarding of autonomous systems are systematically reviewed. The presented concepts and approaches are of different domains, namely, ground, nautical and aerial vehicles, industrial robots and smart manufacturing, and medical and healthcare. Finally, the concepts and approaches are discussed concerning the following points: Main ideas, parallels existing in different domains, which ideas can be transferred from one domain to another, which high-level tasks were adressed and which assumptions were made
Conference Paper
Full-text available
The future of computation is massively parallel and heterogeneous with specialized accelerator devices and instruction sets in both edge-and cluster-computing. However, software development is bound to become the bottleneck. To extract the potential of hardware wonders, the software would have to solve the following problems: heterogeneous device mapping, capability discovery, parallelization, adaptation to new ISAs, and many others. This systematic complexity will be impossible to manually tame for human developers. These problems need to be offloaded to intelligent compilers. In this paper, we present the current research that utilizes deep learning, polyhedral optimization, reinforcement learning, etc. We envision the future of compilers as consisting of empirical testing, automatic statistics collection, continual learning, device capability discovery, multiphase compiling-precompiling and JIT tuning, and classification of workloads. We devise a simple classification experiment to demonstrate the power of simple graph neural networks (GNNs) paired with program graphs. The test performance demonstrates the effectiveness and representational appro-priateness of GNNs for compiler optimizations in heterogeneous systems. The benefits of intelligent compilers are time savings for the economy, energy savings for the environment, and greater democratization of software development.
Conference Paper
Full-text available
Scheduling is a family of combinatorial problems where we need to find optimal time arrangements for activities. Scheduling problems in applications are usually notoriously hard to solve exactly. Existing exact solving procedures, based on mathematical programming and constraint programming, usually make manually-tuned heuristic choices. These heuristics can be improved by machine learning. In this paper, we apply the graph convolutional neural network from the literature on speeding up general branch&bound solver by learning its branching decisions. We test the augmented solver on job-shop scheduling problems and specific delivery scheduling problems in the supply chain of a local retailer. We get promising results and point to possible improvements. We discuss the interesting question of how much we can accelerate solving NP-hard problems in the light of the known limits and impossibility results in AI.
Conference Paper
Full-text available
Artificial intelligence has become mainstream and its applications will only proliferate. Specific measures must be done to integrate such systems into society for the general benefit. One of the tools for improving that is explainability which boosts trust and understanding of decisions between humans and machines. This research offers an update on the current state of explainable AI (XAI). Recent XAI surveys in supervised learning show convergence of main conceptual ideas. We list the applications of XAI in the real world with concrete impact. The list is short and we call to action-to validate all the hard work done in the field with applications that go beyond experiments on datasets, but drive decisions and changes. We identify new frontiers of research, explainability of reinforcement learning and graph neural networks. For the latter, we give a detailed overview of the field.
Full-text available
In the last few years, Artificial Intelligence (AI) has achieved a notable momentum that, if harnessed appropriately, may deliver the best of expectations over many application sectors across the field. For this to occur shortly in Machine Learning, the entire community stands in front of the barrier of explainability, an inherent problem of the latest techniques brought by sub-symbolism (e.g. ensembles or Deep Neural Networks) that were not present in the last hype of AI (namely, expert systems and rule based models). Paradigms underlying this problem fall within the so-called eXplainable AI (XAI) field, which is widely acknowledged as a crucial feature for the practical deployment of AI models. The overview presented in this article examines the existing literature and contributions already done in the field of XAI, including a prospect toward what is yet to be reached. For this purpose we summarize previous efforts made to define explainability in Machine Learning, establishing a novel definition of explainable Machine Learning that covers such prior conceptual propositions with a major focus on the audience for which the explainability is sought. Departing from this definition, we propose and discuss about a taxonomy of recent contributions related to the explainability of different Machine Learning models, including those aimed at explaining Deep Learning methods for which a second dedicated taxonomy is built and examined in detail. This critical literature analysis serves as the motivating background for a series of challenges faced by XAI, such as the interesting crossroads of data fusion and explainability. Our prospects lead toward the concept of Responsible Artificial Intelligence , namely, a methodology for the large-scale implementation of AI methods in real organizations with fairness, model explainability and accountability at its core. Our ultimate goal is to provide newcomers to the field of XAI with a thorough taxonomy that can serve as reference material in order to stimulate future research advances, but also to encourage experts and professionals from other disciplines to embrace the benefits of AI in their activity sectors, without any prior bias for its lack of interpretability.
Full-text available
When do people find it acceptable to sacrifice one life to save many? Cross-cultural studies suggested a complex pattern of universals and variations in the way people approach this question, but data were often based on small samples from a small number of countries outside of the Western world. Here we analyze responses to three sacrificial dilemmas by 70,000 participants in 10 languages and 42 countries. In every country, the three dilemmas displayed the same qualitative ordering of sacrifice acceptability, suggesting that this ordering is best explained by basic cognitive processes, rather than cultural norms. The quantitative acceptability of each sacrifice, though, showed substantial country-level variations. We show that low relational mobility (where people are more cautious about not alienating their current social partners) is strongly associated with the rejection of sacrifices for the greater good (especially for Eastern countries), which may be explained by the signaling value of this rejection. We make our dataset fully available as a public resource for researchers studying universals and variations in human morality: all the data and code used in this article can be downloaded at
Full-text available
The debate about the ethical implications of Artificial Intelligence dates from the 1960s (Samuel in Science, 132(3429):741-742, 1960.; Wiener in Cybernetics: or control and communication in the animal and the machine, MIT Press, New York, 1961). However, in recent years symbolic AI has been complemented and sometimes replaced by (Deep) Neural Networks and Machine Learning (ML) techniques. This has vastly increased its potential utility and impact on society, with the consequence that the ethical debate has gone mainstream. Such a debate has primarily focused on principles-the 'what' of AI ethics (beneficence, non-maleficence, autonomy, justice and explicability)-rather than on practices, the 'how.' Awareness of the potential issues is increasing at a fast rate, but the AI community's ability to take action to mitigate the associated risks is still at its infancy. Our intention in presenting this research is to contribute to closing the gap between principles and practices by constructing a typology that may help practically-minded developers apply ethics at each stage of the Machine Learning development pipeline, and to signal to researchers where further work is needed. The focus is exclusively on Machine Learning, but it is hoped that the results of this research may be easily applicable to other branches of AI. The article outlines the research method for creating this typology, the initial findings, and provides a summary of future research needs.
Full-text available
One of the objectives in the field of artificial intelligence for some decades has been the development of artificial agents capable of coexisting in harmony with people and other systems. The computing research community has made efforts to design artificial agents capable of doing tasks the way people do, tasks requiring cognitive mechanisms such as planning, decision-making, and learning. The application domains of such software agents are evident nowadays. Humans are experiencing the inclusion of artificial agents in their environment as unmanned vehicles, intelligent houses, and humanoid robots capable of caring for people. In this context, research in the field of machine ethics has become more than a hot topic. Machine ethics focuses on developing ethical mechanisms for artificial agents to be capable of engaging in moral behavior. However, there are still crucial challenges in the development of truly Artificial Moral Agents. This paper aims to show the current status of Artificial Moral Agents by analyzing models proposed over the past two decades. As a result of this review, a taxonomy to classify Artificial Moral Agents according to the strategies and criteria used to deal with ethical problems is proposed. The presented review aims to illustrate (1) the complexity of designing and developing ethical mechanisms for this type of agent, and (2) that there is a long way to go (from a technological perspective) before this type of artificial agent can replace human judgment in difficult, surprising or ambiguous moral situations.
With the widespread use of artificial intelligence (AI) systems and applications in our everyday lives, accounting for fairness has gained significant importance in designing and engineering of such systems. AI systems can be used in many sensitive environments to make important and life-changing decisions; thus, it is crucial to ensure that these decisions do not reflect discriminatory behavior toward certain groups or populations. More recently some work has been developed in traditional machine learning and deep learning that address such challenges in different subdomains. With the commercialization of these systems, researchers are becoming more aware of the biases that these applications can contain and are attempting to address them. In this survey, we investigated different real-world applications that have shown biases in various ways, and we listed different sources of biases that can affect AI applications. We then created a taxonomy for fairness definitions that machine learning researchers have defined to avoid the existing bias in AI systems. In addition to that, we examined different domains and subdomains in AI showing what researchers have observed with regard to unfair outcomes in the state-of-the-art methods and ways they have tried to address them. There are still many future directions and solutions that can be taken to mitigate the problem of bias in AI systems. We are hoping that this survey will motivate researchers to tackle these issues in the near future by observing existing work in their respective fields.
This chapter surveys eight research areas organized around one question: As learning systems become increasingly intelligent and autonomous, what design principles can best ensure that their behavior is aligned with the interests of the operators? The chapter focuses on two major technical obstacles to AI alignment: the challenge of specifying the right kind of objective functions and the challenge of designing AI systems that avoid unintended consequences and undesirable behavior even in cases where the objective function does not line up perfectly with the intentions of the designers. The questions surveyed include the following: How can we train reinforcement learners to take actions that are more amenable to meaningful assessment by intelligent overseers? What kinds of objective functions incentivize a system to “not have an overly large impact” or “not have many side effects”? The chapter discusses these questions, related work, and potential directions for future research, with the goal of highlighting relevant research topics in machine learning that appear tractable today.
This paper provides a comprehensive survey of Machine Learning Testing (ML testing) research. It covers 138 papers on testing properties (e.g., correctness, robustness, and fairness), testing components (e.g., the data, learning program, and framework), testing workflow (e.g., test generation and test evaluation), and application scenarios (e.g., autonomous driving, machine translation). The paper also analyses trends concerning datasets, research trends, and research focus, concluding with research challenges and promising research directions in machine learning testing. The full paper is available at:
Deep neural networks (DNNs) have shown huge superiority over humans in image recognition, speech processing, autonomous vehicles, and medical diagnosis. However, recent studies indicate that DNNs are vulnerable to adversarial examples (AEs), which are designed by attackers to fool deep learning models. Different from real examples, AEs can mislead the model to predict incorrect outputs while hardly be distinguished by human eyes, therefore threaten security-critical deep-learning applications. In recent years, the generation and defense of AEs have become a research hotspot in the field of artificial intelligence (AI) security. This article reviews the latest research progress of AEs. First, we introduce the concept, cause, characteristics , and evaluation metrics of AEs, then give a survey on the state-of-the-art AE generation methods with the discussion of advantages and disadvantages. After that, we review the existing defenses and discuss their limitations. Finally, future research opportunities and challenges on AEs are prospected. Index Terms-Adversarial examples (AEs), artificial intelligence (AI), deep neural networks (DNNs).
The term "explainable AI" refers to the goal of producing artificially intelligent agents that are capable of providing explanations for their decisions. Some models (e.g., rule-based systems) are designed to be explainable, while others are less explicit "black boxes" for which their reasoning remains a mystery. One example of the latter is the neural network, and over the past few decades, researchers in the field of neural-symbolic integration (NSI) have sought to extract relational knowledge from such networks. Extraction from deep neural networks, however, has remained a challenge until recent years in which many methods of extracting distinct, salient features from input or hidden feature spaces of deep neural networks have been proposed. Furthermore, methods of identifying relationships between these features have also emerged. This article presents examples of old and new developments in extracting relational explanations in order to argue that the latter have analogies in the former and, as such, can be described in terms of long-established taxonomies and frameworks presented in early neural-symbolic literature. We also outline potential future research directions that come to light from this refreshed perspective.