Content uploaded by Laura Waltersdorfer
Author content
All content in this area was uploaded by Laura Waltersdorfer on Aug 07, 2023
Content may be subject to copyright.
313
Combining Machine Learning and Semantic Web: A
Systematic Mapping Study
ANNA BREIT,Semantic Web Company
LAURA WALTERSDORFER,TU Wien
FAJAR J. EKAPUTRA,Vienna University of Economics and Business (WU) and TU Wien
MARTA SABOU,Vienna University of Economics and Business (WU)
ANDREAS EKELHART,University of Vienna and SBA Research
ANDREEA IANA,HEIKO PAULHEIM,andJAN PORTISCH,University of Mannheim
ARTEM REVENKO,Semantic Web Company
ANNETTE TEN TEIJE and FRANK VAN HARMELEN,Vrije Universiteit (VU) Amsterdam
In line with the general trend in articial intelligence research to create intelligent systems that combine
learning and symbolic components, a new sub-area has emerged that focuses on combining Machine Learn-
ing components with techniques developed by the Semantic Web community—Semantic Web Machine Learn-
ing (SWeML). Due to its rapid growth and impact on several communities in thepast two decades, there is a
need to better understand the space of these SWeML Systems, their characteristics, and trends. Yet, surveys
that adopt principled and unbiased approaches are missing. To ll this gap, we performed a systematic study
and analyzed nearly 500 papers published in the past decade in this area, where we focused on evaluating
architectural and application-specic features. Our analysis identied a rapidly growing interest in SWeML
Systems, with a high impact on several application domains and tasks. Catalysts for this rapid growth are the
increased application of deep learning and knowledge graph technologies. By leveraging the in-depth under-
standing of this area acquired through this study, a further key contribution of this article is a classication
system for SWeML Systems that we publish as ontology.
This work was supported in part by the research project OBARIS, which received funding from the Austrian Research Pro-
motion Agency (FFG) under grant 877389. SBA Research (SBA-K1) is a COMET Centre within the framework of COMET—
Competence Centers for Excellent Technologies Programme and funded by BMK, BMDW, and the federal state of Vienna;
COMET is managed by FFG. Moreover, this work was supported by the Christian Doppler Research Association, the Aus-
trian Federal Ministry for Digital and Economic Aairs, and the National Foundation for Research, Technology and Devel-
opment. M. Sabou was funded through the FWF HOnEst project (V 754-N).
Authors’ addresses: A. Breit and A. Revenko, Semantic Web Company, Neubaugasse 1/8, 1070 Vienna, Austria; emails:
{anna.breit, artem.revenko}@semantic-web.com; L. Waltersdorfer, TU Wien, Favoritenstrasse 9-11/194-01, 1040 Vienna,
Austria; email: laura.waltersdorfer@tuwien.ac.at; F. J. Ekaputra, Vienna University of Economics and Business (WU), In-
stitute for Data Process, and Knowledge Management (DPKM). Building D2/C, 2nd oor, Welthandelsplatz 1, 1020 Vi-
enna, Austria; email: fajar.ekaputra@wu.ac.at; M. Sabou, University of Vienna, Kolingasse 14-16, 5. OG, 1090 Vienna, Aus-
tria; email: marta.sabou@wu.ac.at; A. Ekelhart, University of Vienna and SBA Research, Floragasse 7, 1040 Vienna, Aus-
tria; email: andreas.ekelhart@univie.ac.at; A. Iana, H. Paulheim, and J. Portisch, University of Mannheim, L 1, 1, 68131
Mannheim, Germany; emails: andreea.iana@uni-mannheim.de, {heiko, jan}@informatik.uni-mannheim.de; A. ten Teije
and F. van Harmelen, Vrije Universiteit (VU) Amsterdam, Dept. of Computer Science, De Boelelaan 1111, 1085HV Am-
sterdam, Netherlands; emails: {annette.ten.teije, frank.van.harmelen}@vu.nl.
Author’s current addresses: J. Portisch, SAP SE, Dietmar-Hopp-Allee 16, 69190 Walldorf, Germany; email:
jan.portisch@sap.com.
Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee
provided that copies are not made or distributed for prot or commercial advantage and that copies bear this notice and
the full citation on the rst page. Copyrights for third-party components of this work must be honored. For all other uses,
contact the owner/author(s).
© 2023 Copyright held by the owner/author(s).
0360-0300/2023/07-ART313 $15.00
https://doi.org/10.1145/3586163
ACM Computing Surveys, Vol. 55, No. 14s, Article 313. Publication date: July 2023.
313:2 A. Breit et al.
CCS Concepts: • Information systems →Semantic web description languages;•Computing method-
ologies →Knowledge representation and reasoning;Machine learning;
Additional Key Words and Phrases: Semantic Web, Machine Learning, Articial Intelligence, knowledge
graph, Knowledge Representation and Reasoning, neuro-symbolic integration, Systematic Mapping Study
ACM Reference format:
Anna Breit, Laura Waltersdorfer, Fajar J. Ekaputra, Marta Sabou, Andreas Ekelhart, Andreea Iana, Heiko Paul-
heim, Jan Portisch, Artem Revenko, Annette ten Teije, and Frank van Harmelen. 2023. Combining Machine
Learning and Semantic Web: A Systematic Mapping Study. ACM Comput. Surv. 55, 14s, Article 313 (July 2023),
41 pages.
https://doi.org/10.1145/3586163
1 INTRODUCTION
For a system to be perceived as “intelligent,” it has to fulll certain properties: it needs to be able
to adapt and react to unknown situations, and it needs to have some understanding of the world
in which it acts, subjected to constant renement over time while it obtains access to new infor-
mation. Although Articial Intelligence (AI) and human intelligence are unarguably dierent,
and especially the latter is still not fully understood, researchers have drawn parallels between the
two in the past. A recent AAAI paper [6] relates the building blocks of AI to Daniel Kahneman’s
theory of human intelligence, which is divided into an (intuitive) system 1 and a (rational) system
2[10]. The authors state that system 1 is comparable to Machine Learning (ML), whereas system
2 rather resembles Knowledge Representation and Reasoning (KR). They further argue that
AI, just like human intelligence, needs the combination of both, also called neuro-symbolic AI.
This symbiotic use of ML and KR techniques is a strongly emerging trend in AI. Indeed, recent
years have seen an increased interest and fast-paced developments in techniques that make use
of this combination to build intelligent systems in the vein of neuro-symbolic AI. At the same
time, the Semantic Web (SW) research community has popularized knowledge representation
techniques and resources in the past two decades [15], leading to a great interest in and uptake
of SW resources such as knowledge graphs, ontologies, thesauri, and linked datasets outside of
the SW research community [17]. These two trends have led to the development of systems that
rely on both SW resources and ML components, known as Semantic Web Machine Learning
(SWeML) Systems.
This research area of SWeML has gained a lot of traction in the past few years, as shown in a
rapidly growing number of publications in dierent outlets, as well as SWeML techniques being
employed to solve problems in various domains. At the same time, this growth poses two main
challenges that threaten to hamper further development of the eld.
First, keeping up with the main trends in the eld has become unfeasible, not only because of its
fast pace and a large volume of published papers but also because papers require understanding
techniques from the two diverse research sub-areas of AI. In an attempt to address this challenge,
several works aimed to provide overviews of SWeML Systems and related systems (see Table 1).
However, reviewing those, we conclude that existing work either (1) focuses rather on a wider or
related eld [34] or, on the contrary, (2) is scoped around a very specic sub-eld of SWeML [25,26,
29]. Additinally, none of the reviewed surveys adopts a principled and reproducible methodology
that would guarantee unbiased and representative data collection. We therefore conclude that there
is a need for a survey that adopts a solid review methodology to complement current insights with
evidence-based ndings.
The second challenge, which amplies the rst one, is the lack of a standardized way to report
SWeML Systems that hampers understanding all key aspects of these systems. On the one hand,
authors of SWeML Systems would benet from a structured way to describe their system and
ACM Computing Surveys, Vol. 55, No. 14s, Article 313. Publication date: July 2023.
Combining Machine Learning and Semantic Web 313:3
its key characteristics. Readers, on the other hand, would benet from a structured way of
interpreting such systems. This would not only facilitate the understandability for those coming
from other communities but also improve the comparability of dierent systems. An early work
in this direction was proposed by Van Harmelen and ten Teije [34] by introducing patterns for
representing hybrid AI systems in terms of their components and information ows with the aim
to facilitate a more schematic representation of the system. Although these patterns were derived
from a large number of papers, there is currently no insight into their adoption in the eld (e.g.,
about the completeness of the introduced system patterns) or their usage frequency.
To address the preceding challenges, in this article we investigate the following main research
questions:
•What are the state of the art and trends related to systems that combine SW and ML compo-
nents?
•How can these systems be classied into a systematic taxonomy?
To that end, in contrast to previous work, we perform a Systematic Mapping Study (SMS) [20]
of the SWeML Systems eld. SMS is an established method in evidence-based research because (1)
it follows a well-dened paper selection process to identify a representative set of primary studies
reducing selection bias in comparison to ad hoc study selection, and (2) it adopts a standard, well-
documented process allowing for the study to be reproduced.
Based on this methodology, we provide two main contributions:
•Atrendslandscape, to capture the tendencies of SWeML Systems with respect to the level of
adoption, maturity, and reproducibility, with an in-depth focus on structural aspects, their
processing ows, and the characteristics of their ML and SW components, derived from a
systematic survey.
•A classication system for SWeML Systems that can be used as a template for analyzing ex-
isting systems and describing new ones. This can be seen as a controlled vocabulary for the
dierent building blocks of those systems. A key aspect of this classication system is mani-
fested as a framework for documenting and classifying processing ows in SWeML Systems.
With these contributions, we aim to address a large and diverse audience and facilitate their
understanding of this rapidly emerging area both within the scope of this article and beyond. To
ensure transparency and reproducibility, we share our research material including the study proto-
col, the list of collected papers, and additional analysis.1Furthermore, we share the classication
system in the form of an ontology to enable the creation of human-understandable yet machine-
actionable documentation.
This article is structured as follows. Section 2denes and positions SWeML Systems, whereas
Section 3reviews existing surveys and classication systems in related and neighboring elds.
Section 4describes the survey methodology including a renement of the overall research
question of this survey into more detailed ones, Section 5presents the ndings from the survey,
and Section 6derives a new classication system for SWeML Systems. We conclude with a
discussion of our main ndings and an outlook on future work.
2 SEMANTIC WEB MACHINE LEARNING SYSTEMS
2.1 Definition
SWeML Systems are the result of combining SW technologies and an inductive model. More
precisely, they describe a system that makes use of an SW knowledge structure as well as an ML
sub-system to solve a specic task.
1https://swemls.github.io/swemls/.
ACM Computing Surveys, Vol. 55, No. 14s, Article 313. Publication date: July 2023.
313:4 A. Breit et al.
Fig. 1. Relation of dierent research areas around SWeML.
For the purposes of this survey, we dene such SW knowledge structures as a symbolic repre-
sentation of a conceptual domain model and data complying with such domain models. These
resources can have varying levels of formalization ranging from formal logical foundations (in
particular, description logic) to lightweight semantic structures. Examples include vocabularies,
taxonomies, ontologies, linked datasets, and knowledge graphs. The ML sub-system consists of an
inductive model that can generalize over a given set of examples. These models include rule learn-
ing systems, traditional ML models such as support vector machines, random forests, or multi-layer
perceptrons, as well as more recent deep learning models. SWeML Systems are tangible systems
with a software implementation. These implementations might be of dierent maturity, ranging
from prototypes to enterprise-ready systems. However, all of these systems aim at solving specic
tasks as opposed to being generic use components such as libraries and conceptual frameworks.
Although SWeML Systems are required to include the usage of both a semantic module and an
ML module, they are not restricted in the number of these modules, nor in the incorporation of
components that do not fall under the denition of either of these modules. This means that SW
knowledge structures as well as ML models can be used in various parts of the system—either in a
pure form or complemented with other models or knowledge structures—but they are not required
to be applied in all of the parts. Furthermore, no assumptions about the patterns of interactions
between the semantic and the ML module are made, yielding a wide variety of possible design
patterns for SWeML Systems.
2.2 Background and Positioning
To further deepen the understanding and denition of the SWeML eld, it is helpful to draw connec-
tions to related elds. An overview of the connection of these research elds is shown in Figure 1.
Neuro-Symbolic Integration. Recent developments in AI mostly rely on an underlying ML sys-
tem; however, a more traditional approach is based on symbolic KR. Neuro-Symbolic Systems
(NeSy) aim to integrate both these approaches to combine and exploit the advantages of an in-
ductive and deductive system [12]. As KR and SW technologies are tightly connected, SWeML is
signicantly overlapping with the area of neuro-symbolic integration, despite having a dierent
focus: SWeML Systems may incorporate deductive reasoners, and NeSy frequently work with SW
data representations.
Explainable Articial Intelligence. Explainable Articial Intelligence (XAI) systems aim to
increase the interpretability and comprehensibility of AI systems [4], where interpretable refers to
ACM Computing Surveys, Vol. 55, No. 14s, Article 313. Publication date: July 2023.
Combining Machine Learning and Semantic Web 313:5
the ability to understand how inputs are processed on a systems level, whereas comprehensible
systems provide the user with symbols that enable them to draw conclusions on how properties
of the input inuence the output [11]. A SWeML System can improve the interpretability and
comprehensibility of the ML sub-system through its incorporation of a semantic symbolic sub-
system. Such explainable SWeML Systems have been used in dierent domains to solve a wide
variety of tasks [29]. However, not every SWeML System results in an XAI system.
SW Mining. SW mining addresses the combination of Web mining and SW technologies. Web
mining describes the application of data mining techniques to Web resources to extract useful pat-
terns and information [32]. The patterns of interaction between Web mining and SW are diverse—
that is, data mining can be used to construct SW resources, or SW data can be exploited for Web
mining [30]. Even though there are parallels between a SWeML System and a SW mining sys-
tem, they have a signicantly dierent focus: SW mining systems aim at Web-based data, whereas
SWeML Systems do not restrict the type of their data sources. On the other side, data mining
techniques used in Web mining do not necessarily have to be based on inductive systems.
3 RELATED WORK
3.1 Related Surveys
The vision of combining symbolic knowledge and learning has a long history [12] with intensied
research activities over the past 5 years [28]. As a result, a number of survey papers have addressed
the areas of SWeML and neuro-symbolic integration as synthesized in Table 1in terms of the system
type they cover (e.g., NeSy,SWeML System) and their paper selection procedure.
Starting with the works investigating NeSy,Besoldetal.[5] survey learning and reasoning
approaches from a holistic perspective. The paper investigates the intersection of computer science,
cognitive science, and neuroscience on a general level. Coming from the ML perspective, Von
Rueden et al. [35] coin the term informed ML, which is one approach to NeSy. They propose a
taxonomy of methods to integrate knowledge into learning systems [35]. Although knowledge
graphs and other approaches are mentioned, SW data and symbolic representation methods are
not the key topics.
Two surveys on design patterns for NeSy propose a taxonomically organized vocabulary to de-
scribe both processes and data structures [33,34]. However, their paper selection is ad hoc, without
an attempt to be exhaustive. Hitzler et al. [16] provide an initial overview of neuro-symbolic AI
for SW and discuss their mutual benets. Examples include deductive reasoning and knowledge
graph embeddings [16]. This paper oers rst insights into techniques and examples but does not
discuss common architectures or frequencies of used models. Recently, Sarker et al. [28]surveyed
43 papers from well-established AI conferences to characterize neuro-symbolic AI using two dier-
ent taxonomies. Although they oer a rst perspective into trends, details concerning dataows,
used models, and architectures are not discussed.
Finally, Seeliger et al. [29] and D’Amato [8] focus on SWeML Systems, although they do not de-
ne SWeML Systems as a concept. Seeliger et al. [29] investigate in a systematic literature review
how to make opaque ML algorithms explainable through SW technologies, exploring which com-
binations of SW and ML techniques are used to obtain explanations, in which domains they are
mainly used, and how these explanations are evaluated. From an SW perspective, D’Amato [8]de-
scribes research directions for incorporating ML techniques into symbolic approaches. Examples
include instance retrieval, concept learning, knowledge completion, or learning disjointedness, but
more of an overview of these techniques is oered than a deep analysis.
We conclude that existing surveys mostly focus on the broader category of NeSy systems and
only a few target SWeML Systems. Additionally, with exception of the work of Seeliger et al. [29],
ACM Computing Surveys, Vol. 55, No. 14s, Article 313. Publication date: July 2023.
313:6 A. Breit et al.
Table 1. Related Survey Papers Focusing on the Intersection of SW and ML
Ref. Authors Venue Type Year Paper Selection Contribution
[5] Besold et al. Neuro-Symbolic
Articial Intelligence:
The State of the Art
NeSy 2021 Custom Overview of NeSy from
dierent perspectives:
Computer science,
cognitive science, cognitive
neuroscience
[35] Von Rueden
et al.
IEEE Transactions on
Knowledge & Data
Engineering
NeSy 2021 Custom Overview of informed ML:
Taxonomy, overview of
knowledge types
[33,34] Van Bekkum
et al., Van
Harmelen and
ten Teije
Applied Intelligence,
Journal of Web
Engineering
NeSy 2019,
2021
Custom Design patterns for hybrid
AI:
Taxonomy to describe
processes and data
structures, case study
[16] Hitzler et al. Semantic Web NeSy 2020 Custom (vision paper) Overview of NeSy for SW :
Knowledge graph
embeddings, explainable
deep learning, deductive
reasoning
[28] Sarker et al. arXiv NeSy 2021 Survey papers
collected from
NeurIPS, ICML, AAAI,
ICLR, IJCAI
Overview of NeSy:
Grouping into a taxonomy
[19] and into dimensions [3]
[29] Seeliger et al. SemEx ISWC XAI
SWeML
System
2019 Systematic literature
review:
(Q1) “deep learning” OR “data
mining”; (Q2) “expla nation *”
OR “interpret*” OR
“transparent*”; (Q3) “Semantic
Web” OR “ontolog*” OR
“background knowledge” OR
“knowledge graph*”
Overview of :
SW for XAI, application
domains and tasks
important to this research
eld, forms of explanations,
evaluation
[8] D’Amato Semantic Web SW
SWeML
System
2020 Custom Overview of ML methods for
SW :
Probabilistic latent variable
models, embedding models,
vector space embeddings
all surveys adopt a custom approach to collecting the surveyed papers. Therefore, there is a need
for a systematic survey that adopts a solid review methodology and focuses on SWeML Systems,
which we aim to address with this article.
3.2 Existing Classification Systems
Several works aim to characterize NeSy: Bader and Hitzler [3] made an early attempt to propose
eight dimensions for classication purposes. More recently, Van Harmelen and ten Teije [34]pro-
posed a set of 13 design patterns, similar to design patterns in software engineering. This taxonomy
is extended with processes and models in the work of Van Bekkum et al. [33]. Kautz [19] introduced
a neuro-symbolic taxonomy of six dierent types of systems. Although the goal is similar, his tax-
onomy does not reect the internal architectures of the investigated systems. The taxonomies
proposed by Sarker et al. [28] and Von Reuden et al. [35] are more ne-grained but with less focus
on how to combine dierent system architectures.
For SWeML Systems, no works that target their classication could be found. To ll this gap,
we are proposing a classication system that uses ideas from the taxonomically organized vo-
cabulary to describe both processes and data structures for NeSy in the work of Van Harmelen
and ten Tieje [34]. However, our proposed classication system (1) takes a more coarse-grained
system-level view as opposed to the ne-grained view in characterizing data structures/processes;
ACM Computing Surveys, Vol. 55, No. 14s, Article 313. Publication date: July 2023.
Combining Machine Learning and Semantic Web 313:7
Fig. 2. Overview of the SMS process.
(2) focuses on a particular type of neuro-symbolic AI systems, namely SWeML Systems; and
(3) has been derived as part of the SMS from a large number of papers.
4 METHODOLOGY
To gain an overview of existing research that falls under the term of SWeML Systems, we conducted
an SMS [20], which is well suited to structure broad research areas. A more detailed explanation
of the methodology (e.g., details on keyword selection and selection criteria) can be found in the
study protocol.2The SMS consists of three consecutive phases (cf. Figure 2) as follows:
(1) Study Planning, where we design and develop the Study Protocol, as detailed in (Section 4.1);
(2) Study Execution, consisting of (2.1) Literature Search,(2.2–2.3) Literature Selection,and(2.4)
Data Extraction, which rely on details specied in the Study Protocol (Section 4.2); and
(3) Analysis & Reporting, focusing on the analysis of the extracted data and the reporting thereof
(Section 5).
4.1 Study Planning
As the rst phase of the study, planning focuses on scoping the study and, accordingly, proposing the
methodology for each step of the study as documented in the study protocol. Study scoping includes
positioning the planned work in the context of related research areas and related work in terms of
similar surveys. This forms a basis for deriving pertinent research questions (Section 4.1.1), which
are then translated into appropriate search queries (Section 4.1.2), a number of paper selection
criteria (Section 4.1.3) used to identify relevant papers, and a data extraction form that facilitates
the objective and unbiased extraction of data. The methodological details captured in the study
protocol aim to make the study process transparent and reproducible.
4.1.1 Detailed Research estions. We rene our two overall research questions (announced
in Section 1) into a number of more detailed research questions, which (1) help identify emerging
trends in the area and (2) provide insights into the characteristics of SWeML Systems for deriving
a classication scheme thereof:
RQ1 Bibliographic characteristics: How are the publications temporally and geographically
distributed? How are the systems positioned, and which keywords are used to describe
them?
RQ2 System architecture: What processing patterns are used in terms of inputs/outputs and
the order of processing units?
RQ3 Application areas: What kind of tasks are solved (e.g., text analysis)? In which domains
are SWeML Systems applied (e.g., life sciences)?
2https://swemls.github.io/swemls/.
ACM Computing Surveys, Vol. 55, No. 14s, Article 313. Publication date: July 2023.
313:8 A. Breit et al.
Table 2. Sub-eries for the Search ery Q=Q1∩Q2∩Q3
Sub-Query Search Strings
Q1 (SW module) knowledge graph, linked data, semantic web, ontolog*, RDF, OWL,
SPARQL, SHACL
Q2 (ML module) deep learning, neural network, embedding, representation learning,
feature learning, language model, language representation model,
rule mining, rule learning, rule induction, genetic programming,
genetic algorithm, kernel method
Q3 (system) Natural Language Processing, Computer Vision, Information
Retrieval, Data Mining, Information integration, Knowledge
management, Pattern recognition, Speech recognition
Each sub-query consists of a disjunction (OR) of search strings.
RQ4 Characteristics of the ML module: What ML models are incorporated (e.g., SVM)? Which
ML components can be identied (e.g., attention)? What training type(s) is used during the
system training phase?
RQ5 Characteristics of the SW module: What type of SW structure is used (e.g., taxonomy)?
What is the degree of semantic exploitation? What are the size and the formalism of the
resources? Does the system integrate semantic processing modules (i.e., KR)?
RQ6 Maturity, transparency, and auditability: What is the level of maturity of the systems?
How transparent are the systems in terms of sharing source code, details of infrastructure,
and evaluation setup? Does the system have a provenance-capturing mechanism?
4.1.2 Digital Libraries and Search eries.
Digital Libraries. We performed a query-based search in the following digital libraries to retrieve
important conference and journals papers, as they are referred to as good sources for software
engineering publications [7,20]: (i) Web of Science, (ii) ACM Digital Library, (iii) IEEE Xplore, and
(iv) Scopus.3
Search Query. The search query was derived from the study research questions and iteratively
rened to obtain a high number of relevant papers while keeping the number of retrieved papers
manageable. The query consists of three sub-queries targeting the SW module (Q1), the ML module
(Q2), and the system aspect (Q3) of SWeML Systems, respectively. The search strings contained in
these queries are presented in Table 2. Each sub-query consists of a union of the listed search terms;
the nal query used for the study search is an intersection of the three sub-queries. The collection
of the search terms for the sub-queries followed a systematic methodology, which is described in
more detail in Appendix A.1.
4.1.3 Study Selection Criteria. We have selected eight study selection criteria. For each crite-
rion (C), an inclusion criterion (IC) and a complimentary exclusion criterion (EC) are given. This
improves the specicity of the criteria. Inclusion criteria 1 through 5 concern metadata of the
publications, such as publication date (2010–2020), language (English), publication type (peer re-
viewed), accessibility (accessible to authors), and duplicates (latest version). C6 and C7 refer to the
SWeML Systems denition: whether described systems have an interconnection between the SW
and ML component (C6), and whether the system solves a task (C7). C8 lters out papers with low
English and/or scientic quality that cannot be fully understood.
3(i) http://www.webofknowledge.com/, (ii) https://dl.acm.org/, (iii) https://ieeexplore.ieee.org/, (iv) https://www.scopus.
com/.
ACM Computing Surveys, Vol. 55, No. 14s, Article 313. Publication date: July 2023.
Combining Machine Learning and Semantic Web 313:9
4.2 Study Execution
Based on the study protocol, the study is executed through the steps described in the next sections.
Literature Search. The execution of search queries in the four digital libraries returned 2,865
papers.4After merging the four result sets, 1,986 papers remained (cf. box 2.1 of Figure 2). We
used a combination of automatic merging of bibtex entries,5as well as manual checking to ensure
the correctness of the merged results.
Literature Selection. Literature selection includes two separate selection steps. The rst step fo-
cuses on metadata, titles, and abstracts (cf. box 2.2 of Figure 2). We divided the retrieved papers
into 10 batches of 200 papers each and assigned two researchers to each batch. The rst researcher
decided on inclusion or exclusion considering the criteria C6 and C7. The second assignee checked
unclear cases plus 10% of the decisions of the rst researcher. This step reduced the number of pa-
pers to 987 papers. In the second step (cf. Box 2.3 of Figure 2), all papers were investigated more
thoroughly based on their content and in terms of the study selection criteria C1 through C8,
leading to 476 papers selected for data extraction.
Data Extraction. This step (cf. box 2.4 of Figure 2) was conducted with the help of a shared data
extraction form that denes how and which data is collected from papers to answer the study
research questions (Section 4.1.1). The form was prepared prior to study execution to reduce re-
searcher bias and allow multiple researchers to extract data objectively [20]. The detailed data
extraction form is available in the study protocol.
5 DATA ANALYSIS
Because our study criteria lead to 476 publications that are included in the data analysis, it is
not possible to perform a publication-wise analysis. We will therefore limit ourselves to meta-
analysis. However, to provide a better understanding of the concrete outcomes, we will exemplify
the extracted values with seven example papers included in the analysis: PUB1 [13], PUB2 [39],
PUB3 [2], PUB4 [24], PUB5 [38], PUB6 [9], and PUB7 [31].
5.1 RQ1 Bibliographic Characteristics
5.1.1 Temporal Distribution. Figure 3shows the non-exclusive distribution per year of the 476
publications.6We observed two trends in the publication count over the years. Starting from 2016,
there is a surge in the number of papers in all digital libraries. Furthermore, a large portion of the
selected papers were retrieved from Scopus. Between 2010 and 2016, the published papers account
yearly for less than 5% of the total number of selected publications. From 2016 onward, 15% to 20%
were retrieved yearly, increasing in 2019 and 2020 to more than 35% of all publications selected for
data extraction. An important aspect we take into account in the remainder of the data analysis is
that the decrease from 2019 might be because the set of papers from 2020 is incomplete.7
5.1.2 Thematic Distribution. For identifying the positioning and focus of a paper, we con-
centrate on author-dened keywords. If no keywords were provided in the paper (e.g., PUB4
4The literature search was executed on November 2, 2020.
5Using bibliographic data management software from Mendeley (https://www.mendeley.com/) and Zotero (https://www.
zotero.org/).
6Note for Figure 3that several papers were counted multiple times in the graph due to being available in more than one
digital library.
7The search for papers was performed on November 2, 2020, and many digital libraries have delays of several months for
indexing publications, whereas some relevant conferences only take place in December.
ACM Computing Surveys, Vol. 55, No. 14s, Article 313. Publication date: July 2023.
313:10 A. Breit et al.
Fig. 3. Number of selected publica-
tions in individual digital libraries per
year (non-exclusive).
Fig. 4. Popularity of the top 10
keywords.
Fig. 5. Thematic distribu-
tion of the selected papers
basedontheauthor
keywords.
and PUB5), they were automatically generated by extracting the terms with the highest TF-IDF
score [22] from the title and abstract of the paper. To increase quality, only those generated key-
words were considered that also appeared in the global list of all author-provided keywords. With
this procedure, we, for example, generated for PUB5 (titled Using Distributional Semantics for Auto-
matic Taxonomy Induction) the keywords “taxonomy,” “textual entailment,” and “natural language
processing.”
Keywords. Figure 4shows the evolution of the top 10 keywords over the years. Two of the
top 3 keywords describe types of semantic resources: (1) knowledge graph (110 papers) and (2)
ontology (54 papers), followed by (3) deep learning (39 papers). From a semantics perspective, until
2016, ontology was used as a frequent term, whereas from 2017 on, a substantial switch toward
knowledge graphs can be seen (for a detailed analysis of semantic resources and their types, cf.
Section 5.5). Deep learning gained traction from 2016 on, as well as embeddings from 2017 on
(both word and knowledge graph embeddings, cf. Section 5.4 for details on the ML components).
However, for both assertions, it is important to note that the number of included papers in our
study signicantly increased from 2016 on with 57 papers from 2010 to 2015 compared to 419
papers from 2016 to 2020. Until 2015, the small number of selected papers and thus few keywords
do not support conclusive insights for this period. As the eld matures, future studies might show
a drift toward more common system tasks and application areas. However, until now, mainly ML
and SW components are used for the denition of systems.
Positioning. Figure 5depicts the positioning of papers according to their specied area. To that
end, we categorized keywords that appeared in at least two papers8into Machine Learning (ML),
Semantic Web (SW), and Systems (SYS).
Each keyword can be associated with one or multiple categories—for example, knowledge graph
is in the SW category, deep learning is in the ML category, and knowledge graph embedding is
associated both with SW and ML.Question answering and information retrieval are examples of
SYS keywords. The majority of papers fall into all three areas of ML,SW,andSYS (144 papers);
second is the intersection between ML and SW (83 papers); and third, SW and SYS (76 papers).
The SYS area has been assigned less often, which could be both related to our choice of system
keywords, or an indication that SWeML Systems remains an evolving eld with a high variety in
use cases and domains, without clearly dominant tasks or system-related keywords.
5.1.3 Geographical Distribution. Figure 6illustrates the regional distribution of the publishing
institutes of the included publications. We found three major regional clusters of publishing in-
stitutions in the domain of SWeML Systems. More specically, 43% of the surveyed papers have
8Twenty-eight of the 476 papers used only unique keywords and hence were not considered in this analysis.
ACM Computing Surveys, Vol. 55, No. 14s, Article 313. Publication date: July 2023.
Combining Machine Learning and Semantic Web 313:11
Fig. 6. Regional distribution of ailiation of paper authors. The area of the circles corresponds to the number
of publications. The distribution is non-exclusive (e.g., PUB5 appears in both Germany and Pakistan). The
raw data can be found in Appendix A.2.
an author aliated with an institution in Asia, approximately 29% have one aliated in Europe,
and nearly 19% have an author based in North America. Among the Asian countries, in 71% of
the cases, the author is based in China, whereas in North America, the authors are located in the
United States in 86% of the cases. The geographical distribution in Europe is less skewed, with
Germany, France, the United Kingdom, and Italy being the most frequent countries of aliation
of the authors, each in more than 10% of the cases. In only 1% of the cases, publications have an
author from Africa or from South and Central America. Similarly, the Middle East and Oceania
are also underrepresented, as in only 2.5% to 3.5% of the cases, respectively, publications have an
author aliated with an institution from one of the two regions.
5.1.4 Conclusions: RQ1.We conclude that there is a recent and accelerated growth in interest
(and corresponding published papers) in the area of SWeML that is present worldwide, with a
strong cluster in China and a general weak representation of the global south. Furthermore, based
on their keywords, papers reporting on SWeML Systems mostly relate to the SW area (in par-
ticular, ontology, knowledge graph)orML area (most frequently deep learning) or a combination
thereof, and less to a particular domain (although natural language processing is a prominently
used keyword).
5.2 RQ2 System Architecture
To gain deeper insights into how SW and ML modules are combined in a SWeML System, we
analyze the overall processing ows and the roles these modules play. To depict the processing
ows in SWeML Systems, we use a boxology notation framework to dene interaction patterns.
5.2.1 SWeML Systems Boxology. To eciently analyze the internals of SWeML Systems, it is
necessary to abstract their processing ows under a common framework. Such a framework would
not only enable comparability of system architectures but would also facilitate the common under-
standing of these systems, as it provides a unied and intuitive way of describing, documenting,
and visualizing.
In this survey, we introduce a visual framework that focuses on depicting the ow of information
through SWeML Systems. Herefore, a set of basic elements is dened, which can be combined into
reusable design patterns. Our work builds on top of the boxology for NeSy introduced by Van
ACM Computing Surveys, Vol. 55, No. 14s, Article 313. Publication date: July 2023.
313:12 A. Breit et al.
Fig. 7. Visual (top) and flat (boom) notation of system paerns according to the SWeML Systems boxology.
The paerns (from le to right) correspond to the processing workflows from PUB5, PUB1, and PUB2.
Harmelen and ten Teije [34], which proposes two base elements: algorithmic modules (i.e., objects
that perform some computation) that can be of type inductive (ML)ordeductive (KR), and data
structures, which are the input and output to such modules that can be of symbolic (sym)(e.g.,
semantic entities or relations) or non-symbolic (data) nature (e.g., text, images, or embeddings)
(Figure 7).
Based on our denition, each pattern that describes a SWeML System needs to incorporate at
least one ML and one sym (=Semantic Web resource) module. Although deductive KR modules
are not necessary for a SWeML System, their presence and therefore the documentation of their
participation in a processing ow is of great interest. Each of the algorithmic modules ingests some
input and produces some output, meaning that each must have at least one incoming and one
outgoing data structure; the chaining of algorithmic modules without intermediate data modules
is not permitted. However, an algorithmic module can consist of a combination of model parts (e.g.,
a Transformer model and attached classication layer can be represented using one ML module).
The processing pattern for a specic system can most intuitively be represented visually, as
shown in Figure 8; however, to be able to also eciently refer to them in writing, we are further
introducing a at notation. Herefore, we use the symbols M,K,s,anddfor ML and KR,andsym
and data, respectively. Furthermore, the ow arrows are simplied into dashes, whereas parallel
sub-ows are separated by slashes and enclosed by curly brackets. After the closing curly bracket,
the parallel processes fuse. We would like to point out that the used notation is prone to limitations,
as highly complex patterns (e.g., those including loops) cannot be represented; however, most of
the patterns, we found this notation to be useful for easy comprehensible textual reference.
Figure 7provides an overview and some examples of processing patterns. The rst depicted
pattern shows a low complexity and corresponds to the processing ow in PUB5, where the authors
use a textual corpus (data) on which they applied a distributional word embedding model (ML)to
induce a taxonomy (sym). The second pattern is more complex, depicting the creation of graph
embeddings (s-M-d) that are together with image data (d) fed into a CNN model (M)tocreateimage
classications (s) from PUB1. The third illustrated highly complex pattern is derived from PUB2,
where the authors build a visual question answering system: a language model creates embeddings
from the question and a CNN model from the image (both d-M-d). Both kinds of embedding are
used as input to a complex neural network to produce the answer (upper part of the diagram).
Furthermore, the CNN is deployed to provide attribute and object labels sthat are—together with
the input question—used to query ConceptNet and retrieve concept descriptions ({s/d/s}-K-d)
that are further fed to a language model to produce embeddings that serve as additional input to
the complex neural network.
Although in the original boxology the introduced design patterns are based on the task of their
underlying system, in this work we aim to separate these concerns. The purpose of a design pat-
tern is to show the structural characteristics of a system only; therefore, it focuses on the input
consumed and output produced by the dierent modules, as well as the connection between these
modules to aggregate common paths rather than perfectly depicting a single system. Although
ACM Computing Surveys, Vol. 55, No. 14s, Article 313. Publication date: July 2023.
Combining Machine Learning and Semantic Web 313:13
Fig. 8. For each paern type, the most popular
paerns are shown in boxology and flat notation.
Fig. 9. Overall distribution of paerns and paern
types.
this abstraction naturally leads to the loss of some details, it facilitates the understandability of the
patterns, as it reduces complexity, while providing an overview summarizing the most important
processing information, and enables their aggregation into pattern types.
5.2.2 Paern Types. We started o with the existing 11 patterns presented in the original pa-
per.9During the annotation process, the annotators were tasked to reuse already discovered pat-
terns; however, if none of the existing patterns captured the processing ow of a given paper, the
annotator introduced a new pattern. This resulted in 33 new patterns, summing up to a total of 41
dierent processing ow patterns that we include in this analysis.10 To allow a conclusive analysis,
we further classied these patterns into pattern types. As the boxology itself focuses on the archi-
tectural properties of the processing ows, we based our classication schema on the structural
characteristics of the pattern shape as a feature of its complexity, resulting in the following six
types (see examples in Figure 8):
Atomic Pattern: A single algorithmic module consumes a single input.
Fusion Pattern: A single algorithmic module consumes more than one input.
I-Pattern: A chain of Atomic Patterns.
T-Pattern: A chain of Atomic and Fusion Patterns (usually, an I-Pattern with one Fusion
Pattern).
Y-Pattern: Combination of two (or more) Atomic Patterns via a Fusion Pattern.
Other Pattern: Patterns that do not fall in any of the previous types. These patterns are typ-
ically quite complex; however, a further reduction would lead to loss of essential insights
into the processing workow. An example of this is the third pattern in Figure 7.
AscanbeseeninFigure9, the majority of systems (more than 63%) exploit rather simple design
patterns (i.e., Atomic and Fusion Patterns). The most often used pattern is s-M-s (A1), which is, for
example, the classical pattern for link prediction in a KG based on cosine similarity of graph em-
beddings; however, rule-learning systems such as presented in PUB2 also use this pattern. Within
9The original paper introduced 15 patterns; however, patterns (1) and (2) were excluded because they do not represent a
SWeML System, (10) forms a duplication of (6), and (14) does not follow our denition of processing ow.
10The original patterns (8), (13), and (15) were not assigned to any analyzed system and therefore not considered in the
analysis.
ACM Computing Surveys, Vol. 55, No. 14s, Article 313. Publication date: July 2023.
313:14 A. Breit et al.
Fig. 10. Mean number of KR and
ML modules in analyzed
systems.
Fig. 11. Overall input types consumed
and output types produced by ana-
lyzed systems.
Fig. 12. Intermediate represen-
tation type of systems with at
least two algorithmic modules.
the fusion patterns, d/s-M-d (F1) and d/s-M-s (F2) are equally important, showing the exibility
of a system that consumes both sym and data input.
Among the more complex patterns, T-Patterns are the most prominent, without a clear trend
for a specic pattern. A dierent picture is drawn for I- and Y-Patterns: in I-Patterns, s-M-d-M-s
(I1) is most prominent (using sym input to produce sym output with intermediate data represen-
tation), whereas for Y-Patterns, {s-M-d/d-M-d}-M-s (Y1) and {s-M-d/d-M-d}-M-d (Y2) represent
the vast majority. These Y-Patterns include (but are not limited to) creating embeddings for both
sym and data input, and combining these embeddings in a further inference step. A concrete ex-
ample of such a processing ow can be seen in PUB6, where a system for the extraction of adverse
drug events and related information (e.g., drugs, their attributes, and reason for administration)
from unstructured medical documents is introduced: a linguistic model generates semantic word
embeddings from the textual input while, at the same time, a graph embedding model calculates
embeddings from an accompanying medical knowledge structure for entities identied in the text.
Both these embedding types are then fed into a neural network to create the predictions.
Over time, systems tend to grow larger in terms of number of modules per pattern: in Figure 10,
we observe that the number of ML modules continuously grows starting from 2014, whereas the
usage of KR modules remains on a low frequency over the years.
5.2.3 Paern Abstraction and Meta-Flow Analysis. By abstracting the patterns to the highest
degree possible, an input-output-centric view can be achieved (Figure 11). From this representation,
we can see that in 92.4% of the cases, sym input is consumed by the SWeML System (where it is
used as sole input type in 35.5%), whereas data is taken as input in 64.5% of the systems (7.6% use
data as sole input). In fact, most of the systems (56.9%) use both sym and data as input.
However, about one-third of the papers produce only data output compared to 65.1% producing
only sym output (1.9% produce both). Therefore, although consuming SW resources seems to be
quite essential to a SWeML System, the creation of new or extension of existing SW resources is
targeted in two-thirds of the cases.
To perform a deeper analysis of how inputs and outputs are connected, and which intermediate
data structures are used (Figure 12), the paths from all patterns used in the analyzed papers were
aggregated into a meta processing ow (Figure 13), which shows the popularity of dierent paths.
It can be seen that sym and data are most often fused in the very rst step via an ML module.
This processing path is taken by the very frequently used patterns F1 and F2, but also by some
T-patterns such as T2. By comparing the incoming and outcoming arrows of the second ML and
KR module, we can further investigate that also for combinations of sym and data in later steps,
ML modules are used more frequently than KR modules (e.g., patterns T1, T3, T4, T6,andY1-Y2
use the path via the second ML module, T5 via the second KR module).
ACM Computing Surveys, Vol. 55, No. 14s, Article 313. Publication date: July 2023.
Combining Machine Learning and Semantic Web 313:15
Fig. 13. Meta processing flow of interaction paerns. The most common paerns are fused into a meta
paern to illustrate the most popular paths. The thickness and color of the connections show the number of
systems taking this path. A cycle represents a fusion.
The output of type sym is in most cases either produced in one processing step (e.g., s-M-s)
or via an intermediate data representation (e.g., s-M-d-M-s). This conclusion is also applicable to
data output, although with an overall lower frequency. Generally, data enjoys signicantly higher
popularity as intermediate representation type compared to sym (patterns T1-T4,T6,I1-I2,and
Y1-Y2 exclusively use the path via the intermediate data representation, but only T5 via interme-
diate sym). The trend can be also observed in the temporal analysis in Figure 12, showing a steep
incline of publications with intermediate representations starting from 2016. In those papers, data
is used more frequently as intermediate representation compared to sym. This development could
be explained by the increased incorporation of embedding and representation learning methods
as pre-processors.
5.2.4 Conclusions: RQ2.The usage of the boxology framework allowed us to gain deeper in-
sights into the processing patterns of SWeML Systems. Overall, we discovered 41 dierent patterns,
where simple patterns that only incorporate one ML module are more often used than more com-
plex ones; however, we observed that the number of modules used in SWeML Systems is growing
over time.
In terms of input-output analysis, we observe that sym data structures are almost always used
as input, and often as output of SWeML Systems, whereas data is often used as intermediate rep-
resentation. Combining both data and sym as input is quite popular, especially through an ML
module.
5.3 RQ3 Application Areas
SWeML Systems can be characterized in terms of the kind of tasks they aim to solve and the
domains in which they are applied.
5.3.1 Targeted Tasks. From the papers in this survey, we identied several tasks targeted
by SWeML Systems, which we grouped into four main task categories: tasks based on Natural
Language Processing (NLP), on Graphs,onImage,andOther tasks. Tasks and task categories can
be seen in Figure 14. PUB6 and PUB7 target Named Entity Recognition and relation extraction
and therefore fall under the NLP-based tasks Annotation and Information Extraction. Another
NLP system is the geological text classication approach introduced in PUB4, which solves a Text
Analysis task. The Creation task of automatic taxonomy induction from PUB5 and the Extension
ACM Computing Surveys, Vol. 55, No. 14s, Article 313. Publication date: July 2023.
313:16 A. Breit et al.
Fig. 14. Distribution of papers per targeted task. The dashed line represents the expected distribution
E(xT)=
NT
N∗xTfor task T,whereNis the total number of papers published, NTis the total number
of pap ers published that target task T,andxTrepresents the publications per year that target task T.
task of identifying missing links from PUB3 are examples for the Graph-based task category.
Finally, PUB1 and PUB2 target Image tasks with their artwork analysis model and visual question
answering system, respectively.
Distribution of Tasks. Figure 14 shows the distribution of systems in terms of their tasks. The task
categories NLP and Graph are the most frequent, covering 40% and 38.9% of systems, respectively.
In contrast, only 5.7% of the SWeML Systems are concerned with the least common category of
Image-related tasks (i.e., image or video annotation and classication, image segmentation, action
recognition, object detection), whereas 15.4% of the surveyed papers target Other tasks, such as
recommender systems, data augmentation, or association rule learning. When taking a closer look
at the Graph-based tasks, we see that almost 55% focus on Graph Extension. All remaining sub-tasks
each represent 7% to 22% of the total number of Graph-based tasks. In comparison to Graph-based
tasks, the distribution of NLP sub-tasks is not as skewed. In this category, Text Analysis is targeted
most often, in roughly 30% of the papers, followed by Annotation in 21.7% of the publications, and
by QA & conversational and Information Extraction sub-tasks in nearly 16% of the papers.
Tasks over Ti me. The evolution of targeted tasks over the years is also shown in Figure 14.Tofa-
cilitate the interpretation when correcting for the overall number of published papers, an expected
temporal distribution was added that indicates increased interest (actual publications higher than
expected value) and decreased interest (actual publications lower than expected value) in a specic
task for a given year.
ACM Computing Surveys, Vol. 55, No. 14s, Article 313. Publication date: July 2023.
Combining Machine Learning and Semantic Web 313:17
Fig. 15. Overall distribution of systems across
domains.
Fig. 16. Temporal evolution of interest for domains.
The total number of systems published in a specific
domain is shown.
Although before 2016 we observe an increased interest in Information Retrieval tasks, this seems
to slowly decline in more recent years. The current trends of the last 2 years include QA & conver-
sational and Text Analysis and Graph Alignment tasks. In contrast, solving Image tasks seems to
have become less relevant to the SWeML community recently.
5.3.2 Application Domain. SWeML Systems can either be domain independent (e.g., PUB2 and
PUB5) or used in a specic application domain, such as Natural Sciences (e.g., biology for PUB3 and
health for PUB6), Culture & Education (e.g., art for PUB1), or Geography & Economics (e.g., geology
for PUB4).
Distribution of Application Domains. More than half of the surveyed SWeML Systems are
General-Domain ones (Figure 15). However, some papers target specic application domains.
Among the domain-dependent systems, the largest share (26.7%) is from Natural Sciences domains,
such as biology, medicine, or chemistry, followed by Culture & Education (including education,
academia, digital humanities, art) and Geography & Economics. The least popular application do-
mains for SWeML Systems appear to be Production of Goods (e.g., manufacturing, transportation,
and logistics) and Administration & Politics, represented in only 1.5%, respectively 1.2%, of the
surveyed papers. Later, Figure 28 shows an overview of concrete SW resources specic to these
domains.
Application Domains over Time. Figure 16 shows the evolution of application domains over time.
On the one hand, the distribution of papers applied to domains such as Administration & Politics,
Production of Goods,orNews & Social Media, has remained relatively constant over the years. On
the other hand, the number of General Domain papers, as well as of those targeting Natural Sciences,
Culture & Education,andGeography & Economics, showed a steep increase from 2016. The number
of General Domain papers has not only increased signicantly in the past 5 years (by 466% between
2015 and 2019) but has kept growing, although at a slower pace, throughout the last 2 years. In
contrast, the number of domain-dependent papers from Natural Sciences and Culture & Education
started to decrease in 2019, by approximately 12% and 50%, respectively. A similar trend can be
observed in the case of the Geography & Economics domain, which suered a slow decrease since
2019. However, this observed decrease might be inuenced by the incomplete number of papers
from 2020 included in the survey.
5.3.3 Correlation of Tasks and Application Domain. The majority of NLP and Image tasks
are applied in General Domain scenarios, as illustrated in Figure 17. Among domain-dependent
ACM Computing Surveys, Vol. 55, No. 14s, Article 313. Publication date: July 2023.
313:18 A. Breit et al.
Fig. 17. Normalized distribution of domains per targeted
tasks. In the bars, the absolute numbers are provided.
Table 3. Distribution of Input Types Con-
sumed by Systems for Domains with at
Least 10 Publications
Domain Data Data & Sym Sym
General Domain 3% 50% 46%
Natural Science 5% 68% 27%
Culture & Education 22.5% 55% 22.5%
News & Soc. Media 14% 79% 7%
Geo. & Economics 21% 56% 23%
Software & Tech 8% 67% 25%
All systems 8% 57% 35%
Fig. 18. Task categories solved by systems
per paern type.
applications, NLP and Image tasks are most often encountered in the eld of Natural Sciences.One
exception is QA & conversational tasks, which are applied in the Culture & Education domain more
often than the Natural Sciences. Another interesting observation is that both Text Analysis and
Information Extraction are quite popular in the Geography & Economics eld. In comparison, the
distribution of Graph tasks shows much more variation: whereas Graph Extension and Graph Align-
ment are mostly applied in General Domain scenarios, Graph Creation and Other Graph tasks are
heavily applied in specic domains. More specically, Graph Creation was most widely utilized in
the eld of Culture & Education, followed by the General Domain,Natural Sciences,andGeography
& Economics. Similarly, Other Graph tasks, such as node clustering or graph pattern mining, are
also mainly domain dependent and appear most often in the Natural Sciences domain. Furthermore,
Other tasks are most often applied either in General Domain scenariosorintheNatural Sciences
domain. The eld of Software & Technology generally represents a rarely targeted application do-
main for all tasks and is even not encountered in combination with Text Analysis,Graph Alignment,
or Graph Creation, whereas 10% of Image- or Video-based tasks are applied to this domain.
5.3.4 Correlations of Application Areas and Paerns. In the analysis of input types per target
task, we identify that for Graph tasks, sym is used as sole input in 58% of the cases. Conversely, for
NLP (77%) and Image (90%) tasks, the majority of systems take both sym and data as inputs. This
observation is also reected in the analysis by pattern types (Figure 18): Fusion and Y-Patterns
are often used for NLP tasks, whereas Atomic and I-Patterns are more common for Graph-related
tasks. Image tasks are mostly solved with T-, Y- and Fusion Patterns combining image/video and
sym inputs in the rst step.
When analyzing the input types per application domain, we can see dierent distributions for
general domain and domain-specic areas: Table 3shows that domain-independent systems over-
proportionally consume only sym, whereas domain-bound systems tend to incorporate data more
often than on average, either as sole input (Culture & Education and Geography & Economics)orin
combination with sym (Natural Sciences,News & Social Media,and Software & Tech). A particular
outlier is News & Social Media, where the usage of only sym is 30 percentage points less likely than
on average.
5.3.5 Conclusions: RQ3.Overall, we observed that SWeML Systems are used in a wide range of
domains, as well as for solving a variety of tasks, which contributes to their increasing importance
for authors of numerous scientic disciplines. Furthermore, SWeML Systems are versatile, oering
ACM Computing Surveys, Vol. 55, No. 14s, Article 313. Publication date: July 2023.
Combining Machine Learning and Semantic Web 313:19
solutions both in the general domain and in specic application domains, particularly in those
that are data intensive, such as Natural Sciences or Culture & Education. Last, in terms of addressed
tasks, NLP- and Graph-based tasks are the most frequent ones, the latter being facilitated by
the growing interest in knowledge graphs in recent years. We further observed that dierent
Targete d Tasks and Application Domains have preferences for certain pattern types, especially
with respect to their input types. This insight is useful from a systems engineering perspective
since it helps the engineer select an appropriate processing pattern given a task, a domain, or an
SW resource.
5.4 RQ4 Characteristics of the ML Module
Each ML module of a SWeML System canconsist of multiple parts: not only can it be a combination
of dierent ML models or categories, but this combination can also introduce further ML compo-
nents. In the ML module of PUB4, for example, a word2vec and a Bi-LSTM model are applied,
which are further extended by an attention mechanism component. Investigating these module-
based characteristics helps to landscape architectural preferences in the eld. Additionally, the
systems can be analyzed based on their overall training type, providing possible insight into the
eorts needed to develop such systems.
5.4.1 Model Categories. Due to the great variety of ML models used in the analyzed systems, an
abstraction into ML categories was necessary to enable meaningful analysis. ML categories summa-
rize related families of ML models—for example, the modular co-attention network introduced in
PUB2 goes together with BERT-based models (among others) in the category Transformer,whereas
word2vec, fasttext, and RDF2vec are all part of the category Plain Encoder. As it was not possible to
construct a global taxonomy for these ML categories,11 we only introduce the shallow separation
of (1) Classical ML (i.e., not neural network based) and (2) Deep Learning (DL) (i.e., neural network
based). As the emerging eld of neural networks that can directly operate on and exploit structural
information of graph data is of special interest when analyzing SWeML Systems, we introduce a
third super-category for (3) Graph Deep Learning (Graph DL). Examples for Classical ML are the as-
sociation rule mining approach introduced in PUB3 or the SVM applied in PUB7, whereas models
such as word2vec (PUB5-7), CNNs (PUB1), LSTMs (PUB2, 4 and 6), and approaches such as LINE
(PUB7) and node2vec (PUB1) are classied as DL. The raw data of ML categories can be found in
Appendix A.3.
Distribution of ML Categories. Figure 19 summarizes the temporal evolution of ML categories,
sorted by their usage frequency. The overall frequency is also shown later in Figure 21.Thethree
most frequently used ML categories are Encoders,Plain Feed Forward Neural Networks (FFNNs), and
Translational Distance Models, where the large occurrence of Encoders is mainly due to the heavy
usage of the word2vec algorithm in the surveyed publications. The top ve ML categories are all
DL, whereby there is only one dedicated Graph DL.
Since the number of publications varies signicantly over time, we add the expected distribu-
tion of ML categories if the categories would have the same share of publications each year. This
helps identify usage trends while correcting for overall growing publication numbers. With the
exception of Matrix Factorization and Recurrent GNNs, the number of systems incorporating DL
models has been rapidly growing in recent years, showing an increasing interest by the research
11In fact, we found that there is no overarching taxonomy of ML algorithms, as existing classications are either too coarse-
grained regarding modern DL approaches (e.g., [21]) or only focus on (sub-areas of) DL (e.g., [27], [40]). Since constructing
one ourselves would be biased toward the papers selected in this survey, we decided to create appropriate categories for
this survey for which we do not claim universal completeness.
ACM Computing Surveys, Vol. 55, No. 14s, Article 313. Publication date: July 2023.
313:20 A. Breit et al.
Fig. 19. Temporal evolution of ML categories that appear in at least 10 papers. Plots are sorted by their
aggregated usage frequency. The dashed line represents the expected distribution E(xC)=
NC
N∗xfor ML
category C,whereNis the total number of papers published, NCis the total number of papers published
that contain the ML category C,andxCrepresents the publications per year for ML category C.
Fig. 20. ML categories per domain, for those domains that characterize at least 10 papers. In parentheses,
the number of models (not systems) deployed for the specific domain is shown.
community. In contrast, publications of Rule Learning approaches, which were the most popular
type of ML category before 2014, stay at a constant rate.
ML Category Usage in Application Domains. The distribution of ML super-categories for each of
the six largest application domains is presented in Figure 20. In all domains, DL is most prominent;
it is used on average in around two-thirds of the papers. Our analysis shows that Classical ML is
most prevalent in Natural Sciences, whereas the relative share of DL models is the greatest in News
& Social Media.IntheGeneral Domain,modelsfromGraph DL are notably more often applied
compared to their usage in specic domains. This could be seen as an indicator for Graph DL
still being in the phase of developing and establishing models, which are evaluated on established
domain-independent benchmarks (e.g., based on DBpedia, WordNet, or Freebase)—only slowly
transitioning toward domain-specic applications.
ACM Computing Surveys, Vol. 55, No. 14s, Article 313. Publication date: July 2023.
Combining Machine Learning and Semantic Web 313:21
Fig. 21. Relation between ML categories (those that are used in at least 10 papers) and tasks performed
by the analyzed systems. Both task categories and individual tasks are shown via color coding, and ML
categories are sorted by total frequency.
ML Category Usage per Task. The individual ML categories were analyzed in combination with
the Targete d Tasks. Figure 21 shows the total number of occurrences for each category together
with the most frequent tasks and task category. Graph and NLP tasks appear in every ML cate-
gory, typically fairly balanced. ML categories that focus on solving NLP tasks include Plain En-
coders,Transformer models, and kNN,whereasGraph DL model categories such as Translational
Distance models and Convolutional GNNs, but also Rule Learning models, mainly focus on Graph
tasks. Image tasks are addressed particularly with CNN algorithms. Overall, we do not see dom-
ination of a task category for a specic ML category; instead, ML categories are generally used
across tasks—even presumably task-bound algorithms such as Translational Distance models or
Transformer models are broadly used in multiple tasks.
5.4.2 Deep Learning Components. Since DL models may be composed of multiple architectural
patterns, we further analyzed which basic building blocks were used in these models. We
identied the following components, which we assigned on the ML module level: (1) Feed Forward
(FF), (2) Recurrent (Rec), (3) Convolution (Conv), and (4) Attention (Att). FF components can
be found for example, in the word2vec algorithm of PUB4-6; Rec components are present in
the LSTM models (including ELMO) in PUB2, PUB4, and PUB6; and PUB1-2 also incorporate
Conv components. Attention can be seen as a standard component of a deployed model, such
as the modular co-attention network introduced in PUB2, or as an extension to other models,
such as in PUB4, where a word2vec and a BiLSTM model are combined with an attention
mechanism.
Figure 22 summarizes the main statistical properties of the DL components of SWeML Systems—
that is, their total frequency as well as their combined usage. Given the overall usage, FFs are
clearly the main DL component in SWeML Systems, whereas the remaining components (Rec,
Conv,andAtt) are used equally often. We found that all possible combinations of components
occur. The combination of Conv and Rec is quite unpopular; however, both of these components
are frequently (and almost equally likely) combined with FF and Att, which reects the universality
of these components—for example, to add a classication layer or to improve prediction quality,
respectively. Furthermore, almost 70% of systems that incorporate Att also use FF components,
where the 28 identied Transformer-based models play a major role.
An analysis of DL components per task reveals that FFs dominate each task (Figure 23). FF is
a broad group of many simple DL architectures, and therefore its broader usage is not surprising.
ACM Computing Surveys, Vol. 55, No. 14s, Article 313. Publication date: July 2023.
313:22 A. Breit et al.
Fig. 22. Co-occurrence of DL
components within a system. In
parentheses, the total frequency
is provided.
Fig. 23. Tasks solved by systems containing a certain DL component.
The remaining components are relatively evenly spread. An exception is the dominance of Convs
for Image tasks.
5.4.3 Training Type. We identied ve distinct training types, which are analyzed indepen-
dently of the model to provide a higher-level understanding. In the following, we introduce each
training type whereby the total number of papers falling into each category is provided in paren-
theses (ve papers did not provide sucient information about the training type used):
Supervised ML (194): A supervised model is trained based on labeled data—for example, a
fraud detection system that recognizes fraudulent nancial transactions based on 10,000
manually labeled transactions.
Self-supervised ML (188): A self-supervised model uses unlabeled data in combination with a
training objective—for example, a word2vec algorithm that uses an unlabeled text corpus as
input. Given a sentence from the corpus, the algorithm is trained to predict the surrounding
words for each word in the sentence.
Semi-supervised ML (24): A semi-supervised model is trained based on a small set of labeled
and a large set of unlabeled data. A classier is typically trained on the labeled data and
then used to create further pseudo-labeled data. The entirety of labeled and pseudo-labeled
data is used to train the nal classier. This approach is often used, for example, for text
classication.
Reinforcement learning (7): Reinforcement learning does not require labeled data but instead
requires a feedback loop where the machine making decisions obtains feedback and opti-
mizes its strategy over time. An example would be an algorithm that is tasked to play a
game such as Go or chess.
Unsupervised ML (58): In unsupervised learning situations, a model is obtained without any
labeled data. An example would be an algorithm that clusters products of a company.
Note that these training types are applied on the system level rather than on the individual
modules or models. For example, if a system—such as PUB6—uses models that are trained in a self-
supervised fashion (e.g., word2vec) to train a supervised approach (the BiLSTM-CRF classier),
we consider the entire system supervised. With regard to the overall distribution of the dierent
training types, we found that supervised and self-supervised models account for approximately
40% of the systems each, no supervision is used in approximately 12% of the systems, and models
based on semi-supervision or reinforcement are rare. Examples for semi-supervision are, for ex-
ample, the iterative hierarchy induction approach from PUB5 and the surrogate learning approach
from PUB7. This nding is intuitive since on the one hand supervised systems are the most widely
ACM Computing Surveys, Vol. 55, No. 14s, Article 313. Publication date: July 2023.
Combining Machine Learning and Semantic Web 313:23
Fig. 24. Combination frequency of ML cate-
gories (i.e., their co-occurrence within a system).
ML categories are grouped by super-categories.
Fig. 25. Position distribution of ML categories in
systems with more than one algorithmic module.
Shown is the relative frequency of usage within
the first or a later module.
adopted and established type of system (e.g., PUB1, PUB2, PUB4, and PUB6), and on the other
hand a wide range of models that are most commonly trained in a self-supervised fashion such
as dierent kinds of embedding models (e.g., for systems targeting Graph Extension tasks) or rule
learning approaches (cf. PUB3) are heavily applied in the analyzed systems.
5.4.4 Combination of ML Categories and Paerns. The analysis of ML categories combined in
a single SWeML System is shown in Figure 24. It can be seen that, in general, a combination of DL
models is more common than a combination of Classical ML models. Plain Encoders (e.g., word2vec)
are combined with a wide variety of models, including Classical ML,DL,andGraph DL categories.
As depicted in Figure 25,Plain Encoders mostly take the role of a pre-processor, meaning that
they are applied in the rst processing step of the SWeML System. The same observation can be
made for Translation Distance models; however, these graph embedding models are less likely to
be combined with Classic ML models.
Plain neural network architectures (e.g., Plain FFNN,Plain RNN,Plain CNN) are more often
combined than advanced architectures (e.g., Transformer), where plain neural networks have a
slight trend to be used in a later step, whereas Transformer models are slightly more often used in
the rst processing step. The trend of combining simpler models is also reected in the Graph DL
area: complex models such as Recurrent GNNs are far less often combined than simpler ones such
as Translational Distance models (e.g., TransE).
Many DL and Graph DL categories show a rather balanced position distribution, except for those
categories mainly consisting of simple graph and text embedding methods (see Figure 25). Classi-
cal ML categories seem to have a more established position: Genetic Algorithms, for example, are
always used in the rst step, whereas Decision Trees,SVMs,andRegression models are exclusively
being used in later steps when combined with other ML models.
5.4.5 Conclusions: RQ4.Our analysis showed that SWeML Systems use primarily supervised
and self-supervised models. ML categories are used across applications without a signicant expo-
sure of one category to exclusively one task. Throughout all application domains, DL categories
are most prevalent (with a share of 60%) followed by Classical ML categories. Graph DL categories
occur with low frequency (<20%) in all domains except the General Domain, where they occur sig-
nicantly more often. The popularity in the General Domain potentially indicates that the models
are still being explored and did not transition into more specic applications.
ACM Computing Surveys, Vol. 55, No. 14s, Article 313. Publication date: July 2023.
313:24 A. Breit et al.
Fig. 26. Comparison of semantic resource types as as-
signed by authors (le) and the study team (right).
Fig. 27. Temporal evolution of semantic resource
types. Types correspond to those assigned by the
study team.
Interestingly, we found that even presumably task-bound ML categories (e.g., Transformers)are
used in a variety of tasks (i.e., are not purely dependent on a particular task). Furthermore, DL
categories—especially those that include typical and rather simple embedding models—are most
likely combined with other ML models and used as system pre-processors. Component-wise, we
found that the most broadly used DL component in SWeML Systems is FF, which is most often
combined with Att, whereas the combination of Conv and Rec is least popular.
5.5 RQ5 Characteristics of the SW Module
SW knowledge structures (aka SW resources) play an important role in SWeML Systems. From the
reviewed 476 papers, a large number of papers (307) make use of already existing (i.e., predened)
semantic resources such as general domain resources (e.g., DBpedia,YAGO ) that usually play the
role of inputs. Furthermore, some of the reported systems create custom SW resources as their out-
put or as internal, intermediary representations. Since in the papers reviewed the information about
semantic resources, such as type, size, and representation formalism, is generally weakly specied,
we focus our analysis on the predened semantic resources, where we can collect this information
from the sources that describe the resources as such (i.e., resource websites, relevant papers). As
some papers rely on several SW resources, we identied 516 non-unique (corresponding to 139
unique) predened resources in the 307 papers that make use of such resources.
5.5.1 Types of Semantic Resources.
Author Assigned Type. The left part of Figure 26 depicts the types of the 516 predened SW
resources as mentioned by the authors of the analyzed papers. Several papers fail to name the type
of the resource; however, others most often mention employing resources of type knowledge graph,
followed by ontology,dataset,andknowledge base. Less frequent mentions of semantic resource
types are taxonomy,thesaurus,hierarchy,controlled vocabulary,linked dataset,andentity catalog.
This indicates that on the one hand, there is a preference for using rather generic terms (ontology,
knowledge graph,dataset) as opposed to more specic terms that describe more specialized types
of resources (e.g., controlled vocabulary,taxonomy), and on the other hand, terms novel to the
SW community are introduced (e.g., database,entity catalog). Both could be explained by the fact
that several of the studies originate from communities other than the SW community, where such
terminological knowledge is present to a more limited extent.
Type Assigned by the Study Team. As our study includes papers from several communities, we
noticed an inconsistent use of terminology among these communities as well as over time (with
ACM Computing Surveys, Vol. 55, No. 14s, Article 313. Publication date: July 2023.
Combining Machine Learning and Semantic Web 313:25
the same SW resource being categorized as ontology,linked dataset,orknowledge graph depending
on the trending terminology at the time the corresponding paper was written). For example, the
term knowledge graph was popularized only in 2012 by Google, although some of the resources that
would be called knowledge graphs today were already in use before. Therefore, besides collecting
the types of these resources as mentioned by the paper authors, we also evaluated the semantic
resources and assigned them a type12 based on the following predened glossary13:
Thesaurus: A controlled vocabulary connected with relations that express linguistic rela-
tions such as equivalency (synonyms), and broader/narrower relations without strict logical
semantics such as subsumption. Examples are Agrovoc, WordNet, ConceptNet.
Taxonomy: A domain model containing terminological (T-Box) information limited to con-
cepts and their subsumption hierarchy.
Ontology: A terminological model richer than a taxonomy containing also additional named
relations and axioms (T-Box).
Dataset: Contains semantic instance data (or metadata), corresponding to an A-Box in logics.
A collection of triples describing instances can be considered a dataset.
Knowledge base: Contains both terminological and instance knowledge (TBox+ABox).
Linked dataset: A dierentiating feature of linked datasets from the semantic structures
described earlier is that they contain links (in terms of URI references) to other semantic
resources. Additionally, such datasets contain large numbers of instance data, although
they may include (lightweight) terminological knowledge as well. The canonical example
is DBpedia.
Knowledge graph:Knowledge graph was most recently dened as “a graph of data intended
to accumulate and convey knowledge of the real world, whose nodes represent entities of in-
terest and whose edges represent potentially dierent relations between these entities” [17].
This is a very broad denition that actually subsumes all the semantic resource type
denitions presented earlier. However, when one resource could be clearly classied in the
preceding categories, we did so. This still left a number of resources that represented graph
data using a non-semantic formalization (e.g., JSON, XML, proprietary formats), thus making
it impossible to classify these along the ABox-TBox distinction which underlies the descrip-
tion of the preceding semantic resource types. Such resources were classied as knowledge
graphs.
The right side of Figure 26 shows the types (and corresponding number) of the 516 predened
resources in terms of the types assigned by the study team. Accordingly, we can conrm that
knowledge graphs are most frequently used. However, contrary to the author-based classication,
thesauri and linked datasets precede ontologies in terms of the frequency of their use, indicating a
focus on instance data as opposed to terminological information.
We can also conclude on some aspects of terminological (mis)use. The terms that are mostly
incorrectly used are dataset (to describe resources that are ontologies,knowledge graphs, linked
dataset,orknowledge bases), knowledge graph (almost half of the uses of this type can be mapped
to other resource types), and knowledge base. Interestingly, the term linked dataset is seldom used
by paper authors.
12The assigned type reects the type of the resource per se, independently of its use by the SWeML System—for example,
a semantically rich ontology would be assigned the type ontology even if the system would only make use of its taxonomic
structure.
13Although the denition of the semantic resource types reects the joint understanding of the study’s author team, we
are aware that, as with all terminologies, variations in denitions exist. Nevertheless, our goal here was to establish a
common baseline to dene resources that would allow having a consistent view of the analyzed resources and account for
terminology variations across communities.
ACM Computing Surveys, Vol. 55, No. 14s, Article 313. Publication date: July 2023.
313:26 A. Breit et al.
Fig. 28. Concrete SW resources, the frequency of their use, and their domain classification. The raw data
can be found in Appendix A.4.
From a temporal perspective (Figure 27), whereas ontologies appear as the most frequently used
resource types until 2014, from 2016, besides ontologies, there is a (sharply) increased interest in
other data types, particularly thesauri,linked datasets,andknowledge graphs.Knowledge graphs
dominate the scene from 2018 onward.
5.5.2 Concrete Semantic Resources. Figure 28 depicts concrete semantic resources used by the
reviewed papers, the frequency of their use, and their domain classication (it covers the 516
non-unique predened resources). Reconrming ndings related to the application domains of
the papers (see Figure 15 in Section 5.3.2), the majority of studies make use of domain-agnostic
resources, whereas the most domain-specic resources stem from the topmost popular domains—
that is, Natural Sciences,Culture & Education,andGeography & Economics.
Analyzing the concrete resources themselves, we conclude that a large number of dierent re-
sources are used, and there is a mix between frequently used resources and more specialized/niche
resources. Among the domain-agnostic resources, DBpedia,YAGO ,Freebase,andWordN e t are the
most frequently used resources, either as the resources themselves or benchmarks based on them.
From the domain-specic resources, most frequent are those from Natural Sciences, particularly
Medicine/Health (with often used reference resources such as UMLS,Mesh,andICD)andBiology
(with the Gene Ontology being the most frequently used resource). In the Human Culture and Ed-
ucation domain resources related to Movies,Music and Academic Publishing are the most frequent
(e.g., the AIFB research ontology being the most frequently used).
5.5.3 Resource Size and Representation Formalism. Figure 29 depicts the size of used resources
(in terms of number of triples) over time considering the resource size as collected by the review-
ing team in 2021 (which, in some cases, might dier from the resource size when it was used by
the reporting study). We conclude that although in the rst part of the past decade (up to 2014)
there is more or less an equal mix of resource sizes, from 2014 larger resources (with > 1M triples)
are more prevalent and there is a sharp rise in the use thereof. This could correlate with the in-
creased popularity of ML methods that require larger training datasets, the availability of large-
scale linked datasets and knowledge graphs, and the increased scalability of processing tools as
well as the available compute power. Indeed, several of the frequently used resources, such as
DBpedia,Freebase,orYAGO, exceed 1M triples. Additionally, several very large domain-specic
resources are employed—for example, UMLS (Health), MusicBrainz (Music), and Microsoft Aca-
demic Graph (Academia/publishing). In terms of the (knowledge) representation formalisms used
(Figure 30), whereas until 2014 there is no clear trend, from 2014 most frequently RDF/RDF-S is
employed, followed by a widespread practice to serialize SW resources in non-semantic formats
such as XML, JSON, and a number of non-mainstream/proprietary formats. This is in line with
ACM Computing Surveys, Vol. 55, No. 14s, Article 313. Publication date: July 2023.