Conference PaperPDF Available

Matching Network of Ontologies: A Random Walk and Frequent Itemsets Approach

Authors:

Abstract and Figures

A System of Systems (SoS) is a complex set of IS (Information Systems) created by the aggregation and interconnection of ISs. SoS brings unexpected behavior and functionality during its construction to the benefit of its users and the SoS itself. The integration of SoSs is a matter of time. However, as SoSs can behave as an organization, it may be inappropriate to try to integrate individual IS members separately. On the other hand, manually integrating the SoS as a whole can be unfeasible due to its complexity. If an SoS has ontologies modeling the knowledge, the integration of SoSs can be translated into a problem of a network of ontologies alignment. However, it creates another challenge: computing each possible pair of entities inside each network’s ontology can have unfeasible execution time, even using the best matchers available. In this article, we propose to mine the data from the networks using random walks and frequent item sets algorithm and discover relevant nodes elected as candidate entities. Next, the networks are pruned by an algebraic method eliminating identical entities. The relevant nodes are reinserted in the network to avoid losing essential correspondences. After the pre-processing step, data is sent to two matchers to obtain metrics and compare the results with the pairwise brute force approach and previous work. We identified relevant nodes with recall up to 0.75. The results are promising since precision and recall are closer to the force brute, and execution time is shorter, even more, when the size of the networks and the number of ontologies to be compared increases. We validate our approach using ontologies created from the OAEI (Ontology Alignment Evaluation Initiative).
Content may be subject to copyright.
Received February 22, 2022, accepted March 15, 2022, date of publication April 1, 2022, date of current version May 2, 2022.
Digital Object Identifier 10.1109/ACCESS.2022.3164067
Matching Network of Ontologies: A Random
Walk and Frequent Itemsets Approach
FABIO SANTOS 1,2, AND CARLOS E. MELLO1
1Programa de Pós-Graduação em Informática (PPGI), Universidade Federal do Estado do Rio de Janeiro, Rio de Janeiro 22290-255, Brazil
2School of Informatics, Computing, and Cyber Systems (SICCS), Northern Arizona University, Flagstaff, AZ 86011, USA
Corresponding author: Fabio Santos (fabiomarcos.santos@uniriotec.br)
This work was supported by the Fundação Carlos Chagas Filho de Amparo à Pesquisa do Estado do Rio de Janeiro (FAPERJ) under
Grant E-26/210.231/2021.
ABSTRACT A System of Systems (SoS) is a complex set of IS (Information Systems) created by
the aggregation and interconnection of ISs. SoS brings unexpected behavior and functionality during its
construction to the benefit of its users and the SoS itself. The integration of SoSs is a matter of time.
However, as SoSs can behave as an organization, it may be inappropriate to try to integrate individual IS
members separately. On the other hand, manually integrating the SoS as a whole can be unfeasible due to
its complexity. If an SoS has ontologies modeling the knowledge, the integration of SoSs can be translated
into a problem of a network of ontologies alignment. However, it creates another challenge: computing
each possible pair of entities inside each network’s ontology can have unfeasible execution time, even using
the best matchers available. In this article, we propose to mine the data from the networks using random
walks and frequent item sets algorithm and discover relevant nodes elected as candidate entities. Next, the
networks are pruned by an algebraic method eliminating identical entities. The relevant nodes are reinserted
in the network to avoid losing essential correspondences. After the pre-processing step, data is sent to two
matchers to obtain metrics and compare the results with the pairwise brute force approach and previous work.
We identified relevant nodes with recall up to 0.75. The results are promising since precision and recall are
closer to the force brute, and execution time is shorter, even more, when the size of the networks and the
number of ontologies to be compared increases. We validate our approach using ontologies created from the
OAEI (Ontology Alignment Evaluation Initiative).
INDEX TERMS System integration, system of systems, ontology matching, network of ontologies.
I. INTRODUCTION
System of Systems (SoS) became a natural evolution of
Information Systems (IS). This evolution began with the
enhancement of network protocols and the popularization
of the Internet, and has been continuing up to these days,
in which IS are unable to operate alone. This evolving process
has been driven, especially, by user demands for new require-
ments. Many ecosystems such as e-commerce, m-commerce
and social networks are interesting examples of SoS that have
been quickly evolving together so that no boundaries among
them can be perceived. Many important applications of SoS
have been proposed in different domains, such as Smart Cities
and Integrated Healthcare, Emergency Response, and Crisis
Management Systems [11].
The associate editor coordinating the review of this manuscript and
approving it for publication was Francisco J. Garcia-Penalvo .
In this context, integration has become a highly impor-
tant feature of SoS, especially due to requirements such as
built-in authentication and cross-platform data sharing solu-
tions in order to connect different resources from different
applications. Despite these needs for integration, the IS must
remain operating with independence and management capac-
ity. On the other hand, the SoS must support the collaboration
among the IS providing interoperability. The SoS works as
one entire system, being able to address not only those distinct
features held by its compound IS, but also the new ones
that emerge from the resulting behavior provided by such a
collaboration [29].
For instance, when one company acquires another that
operates in the same business domain, usually both have
similar IS or SoS. Therefore, in order to keep the operations
running smoothly and steady, one should conduct an integra-
tion process between these SoS. However, this can be very
challenging due to the multiple possibilities of knowledge and
44638 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ VOLUME 10, 2022
F. Santos, C. E. Mello: Matching Network of Ontologies: Random Walk and Frequent Itemsets Approach
concept representation shared by both SoS. Then, one may
appeal to ontologies or network of ontologies that represent
pieces of knowledge required for the integration of the SoS.
In this way, the integration process can begin with the match-
ing of these networks of ontologies as it may offer a common
ground for the posterior SoS integration.
Accordingly, the goal is to first integrate the pieces of
knowledge throughout the concepts related to each associate
SoS, and later utilize the resulting mapping to conduct the
SoS integration itself. The matching task of ontologies and
network of ontologies essentially consists of finding equiva-
lent concepts, a.k.a. correspondences or matches.
This task of matching networks of ontologies can be handle
as an ontology matching problem, which basically consists
of finding correspondences between two or more ontologies.
However, to extend it for network of ontologies one should
deal with computational issues due to the large number of
possible correspondences to be considered between two or
more networks. Hence, we may address this task as a different
one, in which one should overcome the exhaustive computa-
tions by taking advantage of underlying structural properties
held by the networks of ontologies [10], [39]–[41].
We conducted a case study [47] using the conference
domain of ontologies project [2] as our case study. OAEI
is an initiative to help improve the work on ontology align-
ment/matching, provide communication between developers,
promote conferences, assess the strengths and weaknesses
of alignment/matching systems, compare the performance of
techniques, and improve evaluation techniques. OAEI pro-
vides datasets1of ontologies with reference alignments to
create an environment that compares the research evolving
ontology matching. Since 2004 the OAEI has evaluated tracks
oriented to different ontologies, sizes, types of alignment,
and modalities. The last results had 13 evaluations. The con-
ference domain had ten competitors in 2020 and is one of
the essential datasets being refined since 2006. We chose
the conference domain dataset for our case study because
of the reliability of the reference alignments and as a way
to compare our research results with many results available
from the OAEI competitions.
A. PROBLEM DEFINITION
This study addresses the characteristics of ontology matching
in networks with enormous numbers of entities by trying to
avoid computing all possible matchings.
Some problems have the characteristics of being too
complex to be processed by computers and, consequently,
to provide an analytical solution. A possible solution to these
problems is to better represent the occurrence of the problem
by randomly sampling the possible results. The intuition
behind this research comes from the idea that the ontology
structure contains information about the relevance of entities.
The relative importance of a concept in one ontology might be
the same as its counterpart in the other ontology. Therefore,
1http://oaei.ontologymatching.org/2021/
a concept more central in one ontology tends to also appear
more centrally in the other ontology. On the other hand,
a more peripheral concept in one ontology also tends to be
more peripheral in another ontology. Also, the corresponding
concepts are structurally represented in two ontologies in a
similar way, and therefore will be present with similar rela-
tive frequencies within samples generated from a sampling
process that respects the distribution of the structure between
the concepts. As the research aims to avoid computing all
similarity metrics between each pair of concepts (pairwise
approach), we can sample the most relevant concepts by
randomly selecting a subset of the ontology.
The method goal is to reduce the comparison effort without
pruning relevant entities, therefore reducing the accuracy.
Our proposed method does not compete with the large-scale
matching strategies, which are orthogonal. While large-scale
techniques try to identify similar modules to restrict the com-
parison scope using a heuristic, our approach combines the
identification of regions with the same entities to prune them
and avoid the comparison and a stochastic search method to
keep the relevant concepts. They are complementary strate-
gies: we can use our method to prune the same entities in
both networks, reinsert the relevant and finally send them to
a large-scale matcher.
The research questions we want to answer are:
RQ1: To what extent the method can identify the rel-
evant entities? To answer RQ1, we employed a random
walk and frequent itemsets approach to identify those relevant
entities. We also explored the influence of configurations
(i.e., quartile limits to keep the path visited and a semantic
threshold to keep a mapping suggesting). Overall, we found
that pre-processing the ontology can discover relevant nodes
with an average recall of 75%
RQ2: To what extent do the relevant entities improve
the matching of the network of ontologies? To answer
RQ2, we ran experiments matching networks and comparing
with the study from [41]. We also compared it with two
classic matchers. The metrics obtained are compatible with
the classic matchers and reduce processing time in large net-
works. The results showed they still cannot compare complex
structures with many sources of reference alignments.
We evaluated the proposed approach in a preliminary
experiment using an OAEI dataset.
The contribution of this study is: that we develop a method
able to match an extensive network of ontologies in better
execution time than the pairwise approach with similar met-
rics. We created open-source code available to researchers
addressing large-scale and network matching.
II. BACKGROUND
A. ONTOLOGY
Ontology is an explicit specification of a conceptualiza-
tion [12] and a set of representational primitives with which
to model a domain of knowledge or discourse [13]. The
Computer Science area had realized the importance of the
VOLUME 10, 2022 44639
F. Santos, C. E. Mello: Matching Network of Ontologies: Random Walk and Frequent Itemsets Approach
ontologies to construct a foundation where Information Sys-
tems may be developed with meaning stability. Once a
domain or task has an ontology to refer to, all the technicians
may precisely understand the kinds involved. Guarino and
Giaretta called that as ‘‘ontological engineering’ [14].
After understanding why the ontologies may be con-
structed, they have been built for many domains. The myriad
of ontologies being built leaded to a multi definition of con-
cepts that caused problems to solve queries and difficulties
to integrate systems where ontologies are in background to
support the knowledge model. Thus, to integrate IS, ontolo-
gies need to be aligned, enabling a translation of concepts.
Align the concepts or entities becomes crucial in situations
such as systems integration within the company or outside,
such as in electronic commerce, when a company needs to
map the concepts between its systems and systems belonging
to partners, suppliers, and customers [26].
B. ONTOLOGY MATCHING
The process of aligning ontologies, finding and describ-
ing their matches is called ontology matching [44]. Thus,
an alignment is a set of correspondences that express the
translation semantics between concepts belonging to different
ontologies. Given two ontologies, correspondence is a 5-uple:
<id;e1;e2;r;c>, where the first element, id represents
a unique identifier for the correspondence, enare classes or
properties (entities in general), r is the relationship, consid-
ering r {=,v,w,⊥}. equivalence, subsume (more and
less general) and disjointness, and the confidence value in
the interval from 0 to 1 [44]. The relationship implies the
alignments are subject to structural constraints, which must
be obeyed to maintain consistency [25].
C. PAIRWISE MATCHING
Given a set of ontologies, if the pairwise matching is applied
to all ontologies from this set, it will sequentially compute the
alignment of each pair of ontologies from the set. For exam-
ple, given a set of ontologies 0=< , 3 >, in which =
{O1,O2,O3}of Figure 1, the pairwise matching of all ontolo-
gies in the set is obtained by computing (((O1×O2)(O1×
O3))(O2×O3)). Thus, the pairwise matching approach com-
putes one pair each time. The final result may have duplicate
alignments.
D. HOLISTIC MATCHING
Given a set of ontologies, if the holistic matching is applied
to all ontologies from this set, it will compute at once the
alignment of all ontologies from the set. For example, given a
set of ontologies 0=< , 3 >, in which = {O1,O2,O3}
of Figure 1, the holistic matching is obtained by computing
(O1×O2O1×O3O2×O3). Thus, the holistic match-
ing approach computes all matchings between all ontolo-
gies inside this set and merge the results deleting duplicate
alignments.
E. NETWORK MATCHING
A network of ontologies is a set of two or more ontologies
aligned. We define a network of ontologies 0=< , 3 >
a finite set of ontologies and a finite set of alignments
between these ontologies and 3(O,O0) represents the set
of alignments in 3between O and O’ [10]. Given a set of
two or more networks of ontologies 9= {01, 02, . . . , 0n},
the network matching problem searches for a final network
of ontologies 0fresulting from the alignments of the net-
works in 9. For instance, Figure 1depicts two networks
of ontologies (0and 00, each one with 3 ontologies (0=
{o1,o2,o3}and 00= {o10,o20,o30}, describing two Systems-
of-Systems. The goal is to match these two networks, find-
ing all the alignments between them. The set of alignments
between the networks (or inter network alignments) are:
3= {A2,20,A3,20,A3,10}.
F. LARGE-SCALE ONTOLOGY MATCHING
Large-scale matching and efficient matching techniques were
defined as challenges by [44]. In this context, the alignment
of networks can be considered a variation of large-scale
matching with efficiency. In general, these challenges seek to
improve the alignments of large ontologies while maintaining
good metrics and scalable computational complexity. Pair-
wise and holistic approaches, where each entity of the source
ontology we need to compare with all possible entities of the
destination ontology usually make the approach not scalable
with the alignment of large ontologies. The main issues faced
are: high memory consumption, increasing complexity of
the alignment process, and increasing time taken to achieve
alignment [34].
The techniques to address both efficiency and scalability
can be summarized as: reduce the search space and parallel
the matching [37].
To reduce the ontology search space, ontology partitioning
techniques can be used. Partitioning aims to decrease pro-
cessing time while allowing space complexity to be reduced.
In a pairwise approach, the search space is O(n2), considering
n the number of entities to be compared. Partitioning aims
to create regions that limit the scope of comparisons. Thus,
partitioning can reduce this space down to O(n2/k), where
kis equal to the number of partitions created. Partitioning
also influences the processing time. Assuming that the com-
putation time of the similarity metrics for each entity pair
is t, a naive pairwise approach would use O(n2)ttime
units to get all the metrics. By partitioning the ontologies,
we can reduce this time to O(n2/k)t. However, partitioning
introduces a risk: if the partitioning algorithm creates poor
partitions, it will impact the final metrics (precision, recall,
and F-measure) and limit the comparison of entities between
similar partitions.
Two common approaches are those based on graphs and
those based on ontology logic. Therefore, the graph approach
uses only the ontology structure and is usually more scal-
able but ignores part of the semantics that extrapolates the
44640 VOLUME 10, 2022
F. Santos, C. E. Mello: Matching Network of Ontologies: Random Walk and Frequent Itemsets Approach
TABLE 1. Weakness of network alignment approaches.
information obtained from the structure. The logic-based
technique allows the use of description logic and reasoning
with axioms, better capturing the semantics described by
the ontology, and therefore, as it depends on the reasoning
processing, it is less scalable.
G. NETWORK OF ONTOLOGIES AND SYSTEM OF SYSTEMS
Networks of ontologies can be used to support a System
of Systems (SoS). SoS is defined as a set of independent
systems, providing functionalities derived from the interop-
erability between them [5].
An extensive network of ontologies can contain several of
the same entities. Each ontology in the network describes
its domain or complements the knowledge of another ontol-
ogy in the same domain. When we integrate systems from
two companies, we may have networks with ontologies that
describe the same or similar domains. These domains can be
supported by the same ontologies or by different ontologies.
When we think of an environment with multiple IS and
therefore multiple ontologies, the alignment process can be
highly complex and inefficient if naively tried comparing all
possible existing pairs of entities of all ontologies inside two
networks of ontologies.
Holistic matching addresses the duplication in results when
matching many different networks. However, both pairwise
and holistic matching are unable to address the exponential
increase of the number of comparisons (table 1).
SubInterNM [41] addresses the case when ontologies share
same entities and prune them. Indeed, it avoids the Cartesian
product. However, it creates a side effect: when the network
of ontologies are converted to graphs the entities identified
as identical, based on the URI, are pruned from the graphs.
Thus, they are excluded from the final alignment result and
the structural metrics, gathered from the graphs, might be
harmed. In figure 1the entire ontology o2does not need to
be compared with the identical o0
2since they shared the same
entities. The same occurs for entities a1and b1from ontology
o1and a0
1and b0
1from ontology o0
1and a3,b3,c3,d3,e3from
ontology o3and their counterparts in ontology o0
3
H. MARKOV CHAINS
Markov chains are stochastic problems that can be used to
represent complex systems impacted by aleatory changes.
Markov property states that the current state depends solely
on the state before. Mathematically one can define it as:
LetVn=V1, . . . Vnrandom variables that satisfy the
conditional dependency property. Suppose si, . . . snto be all
possible states random variables can assume. So, according
Markov property the probability of a random variable P(Vn=
sn|Vn1=sn1)=(P(Vn=sn|Vn1=sn1,Vn2=
sn1,...,V0=s0).
A transition matrix T for a Markov chain V in time t
is a matrix that persists values from transition probabilities
among their states. Given a matrix’s rows and columns by the
state space, the element of the matrix Ti,j is given by
Ti,j=P(Vn=sn|Vn1=sn1). It is important to highlight
that the sum of the matrix lines =1 and it is a probability
vector.
Using these concepts in ontologies, the transition matrix
can be represented by the adjacent matrix used in graphs
to determine the adjacent nodes. Assuming that a node has
four child nodes and considering that the probability of a
transition is equal for each node, the transition matrix for this
node will be [0.2,0.2,0.2,0.2,0.2] where the first 0.2 means
the probability of the next stage is the same node, while all
other 0.2 values are the probabilities of a child node being
selected for the next state.
I. RANDOM WALK
A random walk (RW) is a finite Markov chain and has a
finite set of states. A random walk starts at some state.
At a given time step, if it is in state x, the next state y is
selected randomly with probability pxy. A node of the random
walk after sufficiently many steps is therefore essentially
uniformly distributed [27].
They have a direct relation with kinds of Markov chain
depending on the underlying data structure will be explored.
Indeed, is a model of a stochastic process that serves to solve
a wide variety of problems and applied and a vast range
of domains such: as probability theory, computer science,
statistical physics, operations research, and others [30]. Some
problems list include financial markets, social network affini-
ties, and ranking systems [30].
RW and graphs data structure can be combined in a
computational environment. Both have been used to model
problems in several domains. Both can be customized to
model complex systems behavior and the corresponding
data model [27]. The authors in [30] discuss many RW
types and applications according to the type of graph rep-
resenting the networks. They also stated the RW is good to
‘‘uncover various types of structural properties of networks,
identifying ‘‘central’ nodes, edges, or other substructures in
networks.
Let G=(V,E) be a connected graph with nnodes and m
edges. Suppose in time ta movement is made from node vt
to vt+1. Consider dtthe degree of node vt. The probability of
reach vt+1=1
dt, if Gis a graph and 1
dt+1if Gis a digraph,
if t,t+1Eor 0 otherwise.
The sequence of steps can be seen as a Markov chain where
the distribution probability of being at node tis pt.
The transition matrix Mcontains all Pt,t+1representing the
probability to reach node t+1.
Since ontologies can be represented as a graph it is pos-
sible to use an RW to go through the ontology structure.
VOLUME 10, 2022 44641
F. Santos, C. E. Mello: Matching Network of Ontologies: Random Walk and Frequent Itemsets Approach
FIGURE 1. Matching network of ontologies example adapted from [10].
RW algorithms also can deal with specific situations and
problems intrinsic in networks structure. In fact, variations
on the networks created a fertile terrain to enrich the RW
research. Moreover, that symbiotic environment encourages
both areas to grow together. The work of [30] lists many of
them: RW variations, types of networks, and RW algorithms
providing a brief survey. Although the author did not mention
the network of ontologies domain, it is possible to represent
the network structure with a graph as used in [40]. Therefore,
RW may be used to address the network alignment problem
as well.
Consider a RW movement inside a structure that represent
an ontology or a network, and the transitional probability
matrix Mt. Suppose in such moment a movement will be done
from position vtto position vt+1, and probability Pij. So each
element i,jfrom Mtis: Mt
i,j
Pij
di
Markov property hold for network RW, so the following
remains true: PN
j=1Mt
ij =1
So, it is possible to gather all the visits running many
random walks through the structure and create some expecta-
tions about the more visited nodes and edges, and also these
information can reveal some tips about the position of the
nodes and edges on the whole structure. In fact, because of
the probability of the transitions, some nodes can be more
visited than others. It might create a suggestion of the node
relevancy. The work of [45] discusses the node importance
based on this centrality and the spectrum of the adjacency
matrix.
Following the random web surfer model and the page rank
algorithm [16], [35], the rank of a such node can be calculated
based on the degree of connectivity. The higher are the num-
ber of incoming edges the more importance may have a node.
Also, the higher is the position in the graph hierarchy more
the node importance. Since each RW execution is starting in
a random node the hierarchy position, it is not being used in
this study.
The formalization of this problem defines the concept of
neighborhood n(i) which is all indexes jpointed by the node i.
So, n(i)=i:(j,i)G(V,E). Suppose N(i) as the number of
ineighborhood and n1(i) the neighborhood that points to i.
The rank of a node is given using a notion of visit time.
Suppose a node iis visited at the time t=0 and at time
t=1 one of its neighbors is visited. The time spent at the
node defines its rank. If we define a function:
I(Wn=i)=(1 walker Wvisits node iat time t,
0 otherwise (1)
So, the rank r1of the node iis:
r1=lim
t→∞
1
n
n
X
m=1
I(Wn=i) (2)
The method was implemented using the same probability
for each node in each transition of the walker W=Pij
di=1
di
in function of degrees of the node ior 1
Ni.
44642 VOLUME 10, 2022
F. Santos, C. E. Mello: Matching Network of Ontologies: Random Walk and Frequent Itemsets Approach
Then the method performs the RW a large number of times
and keeps track of the visits. The rank (r1) is approximate:
r11
n
n
X
m=1
I(Wn=i) (3)
This study proposes a variation of the page rank, random
web surfer algorithm, to rank the nodes [16], [35]. Count each
node’s number of visits, analyze the paths statistically, and the
patterns gathered, using random walks and frequent itemsets
to determine its relative importance in the graph structure.
As the graph structure represents an ontology, considering
the data from the paths and the patterns, a highly visited
entity can be selected as a candidate for reference alignment.
Information about the paths where the entity was visited will
be used together to reinforce the choice of entities.
III. RELATED WORK
The ontology matching problem has been researched for a
decade. However, despite the evolution in the field, to the
extent of our knowledge, matchers still work pairwise,
i.e., computing the alignments for each pair of ontology.
Thus, there are few studies dealing with the area of ontology
networks and, consequently, the problem of ontology net-
work matching, although there are open challenges to deal
with [39].
We have found the applications that usually need some
matching of networks were: biomedicine, biological net-
works, link prediction, ontology matching, multi-layer net-
works, and social networks.
Thus, we looked for similar domains where matching
among networks and the complexity of structures have been
applied, including the large-scale ontology matching.
A. NETWORK MATCHING
We identified approaches and clustered them into three
domains or groups. The domains we found were: AI,
databases, and network science. The network science group
transforms the problem in a graph representation and try to
employ the best ways to solve the graph isomorphism prob-
lem or went beyond, addressing a multiple network problem,
i.e., not limited to only two networks, and try to predict links
to avoid the massive computation of all possibilities.
Prediction is a way to avoid massive computation of all
possibilities. Sometimes it is needed because there is little or
no information about the nodes. Indeed, it also may be caused
by security and privacy situations or because the network to
align is a very new structure, e.g., a new social network. The
papers from the AI group are concerned with training models,
creating neural networks to refine the algorithms to obtain
gains in near future alignments, or formally modeling the
structures to define possible operations over networks better.
Finally, we have a small group of database domain articles
that deal with distribution, algebra operations, and modular
structures to reduce the final computation The main goal
is ‘‘divide to conquer’’ to transform the problem in smaller
pieces and, thus, deal with large and intricate structures.
Prime Align evolved the PageRank algorithm balancing a
weight representation of the edges and the topology using
a random walk and Markov chain [22]. The work from [7]
predicts links between networks after determining some con-
sistencies between them, using the fact that two very different
domains (social networks and PPI networks in the study),
might keep some degree of consistencies. From [46] the
random walk algorithm was used after the generation of an
associated graph, which has a system of ranking of affinity.
Random walk highlighted the probably most crucial nodes in
the graph structure while submerged the others, to select the
final set of candidate entities. The work of [19] uses a random
walk with a state to define, and whether or not, the next step
will occur in on, different networks using the topological
similarity computing a transitional probability matrix. The
matrix is optimized, and a greedy algorithm outcomes the
multiple network alignments. The approach in [31] uses a
clustering strategy to reduce the processing step in align-
ments by dividing and distributing processing parts using a
distributed solution called spark.
Our study mixed up some strategies from [46] and the
approach presented by [41] trying to refine the preprocessing
step with three phases.
B. LARGE-SCALE MATCHING
Previous studies tried to address the complexity of large-scale
matching by using diverse strategies. We focus on the
graph-based ones since our method explores them.
The work from [42] tries to define a dependency func-
tion to create modules who capture similar nodes. Left-
overs are connected with the best modules. [33] uses a
query-based approach similar to database views to define
similar regions starting from some concepts. [24] proposes a
taxonomy-based partitioning for gene ontologies which can
split partitions by terms. The approach uses up and down
propagation to define the boundaries of each partition. The
study from [15], computes coupling and cohesion of entities
in each proposed cluster (or partition) by computing the value
of the entities connected.
Similarly [18] uses the notion of cohesion and coupling to
create blocks. However, after defining them, the blocks are
connected by anchors avoiding the Cartesian product. The
more anchors found between blocks, the more connected they
must be matched. Anchor-flood is presented in [43] exploring
anchors to create segments using a greedy algorithm. Once all
alignments, parents, and descendants are explored, they have
the overall alignment. The segments are eventually created
after and not before the alignment.
Finally, LogMap [21] uses the ISUB algorithm to expand
anchors using a string-based approach. The context grows,
including new neighbors without comparing them more than
once and only over a predefined threshold. The anchors are
chosen using the locality principle: if the hierarchy neigh-
bors of the classes in an anchor match with low confidence,
then the anchor is abandoned. Thus, avoiding the Cartesian
product.
VOLUME 10, 2022 44643
F. Santos, C. E. Mello: Matching Network of Ontologies: Random Walk and Frequent Itemsets Approach
The previous studies that deal with large ontologies usually
try to modularize them in order to avoid explicit comparison
to each entity [34]. However, large-scale related work does
not consider a previous network structure, and the modular-
ization process clusters regions of the ontology to reduce the
problem size and, indeed, does not avoid computing each pair
inside the regions. Besides that, an ontology network may be
composed of large ontologies that bring the problem to a more
significant number of entities comparisons. Thus, modular-
ization and our proposed method may be complementary.
The proposed approach aims to reduce search space while
pursuing alignment efficiency. This is possible by recovering
possible alignments lost after pruning the graph of networks.
The SubInterNM approach creates partitions not to obtain
segmentation for creating clusters or modules that reduce
the scope of entity comparison. It seeks to discover parti-
tions of equal content, removing them and thus obtaining
the reduction of search space. Our method works at this
moment. As SubInterNM partially destroys the graphs by
pruning similar entities, we discover possible relevant entities
by mining potentially crucial nodes, returning them to the
graph, and thus recovering the lost structure.
We identified many places where parallelization could
assist reduce the time complexity. Comparing our method
with the taxonomy of large-scale matching approaches [34],
since the reference classify modules when they are created
by reasoning, we can classify our method as a no mod-
ular extraction and no complete partitioning with possi-
ble parallelization using graph-based search space reduction
and it is scalable. Indeed, it is a partial partitioning or a
structure-based modular extraction
The novelty of this study lies in the proposed method to
address large-scale matching evolving networks of ontologies
by taking advantage of modeling ontologies as Markov chains
from which observations are drawn to be statistically evalu-
ated by data mining techniques to recommend pair matches
eventually.
IV. METHOD
This study comprises three phases, as summarized in
Figure 2: gathering information from the structures through
a random walk, selecting the most relevant entities, and opti-
mizing the network to be matched. To foster reproducibility,
we provide a publicly available dataset2containing the raw
data, the random walker used in phase 1, the Jupyter notebook
scripts used in phase 2, and the network optimizer used in
phase 3.
A. PHASE 1 - GATHERING INFORMATION FROM THE
STRUCTURES
We converted the OWL structures from each one of the
dataset ontologies into graphs. Table 2shows the graph
representation created for each ontology in the dataset.
2https://zenodo.org/record/5573204
TABLE 2. Ontologies graphs.
After, the random walk algorithm ran ten times the number
of nodes in each ontology to retrieve data from the nodes vis-
ited. This information includes the frequency, paths, and dis-
tribution of each random walk, including the mean, median,
quartiles, max, and min values.
We filtered out the paths during the data collection using
a sliding window algorithm limiting the data gathered from
the random walks. The sliding windows were set up to start
in (1,1) (size of window, size of shift) until the window’s
maximum size could retrieve a valid path. We also filtered out
paths from random walks based on the number of visits con-
sidering the quartiles one, two, and three from the statistical
distribution for each possible sliding window configuration.
We kept only paths that had at least one node inside the quar-
tile. Thus, we generated the number of possible random walks
x 3 output files for each ontology. For example, one ontology
with valid random walks starting from (1,1) until (5,5) (i.e:
(1,1), (2,1),(2,2), (3,1), (3,2), (3,3),(4,1), (4,2), (4,3),(4,4),
(5,1), (5,2), (5,3), (5,4), (5,5) X 3) generated 45 different
output files considering the three defined quartiles.
The main output files were the visit file, summarizing each
node with the total number of visits and the paths where
it was present, and the binary file containing each valid
path 1 or 0 indicating whether or not a specific node was
visited.
Figure 3shows the random walk sliding window algorithm
gathering data from the sigkdd ontology and creating
the main output files: visits and binary. This experiment
focused only on the Class node types for feasibility when
manually analyzing the results from each step in each
phase.
B. PHASE 2 - SELECTING THE MOST RELEVANT ENTITIES
Once we have information about the nodes with potential
more relevancy in the graph structure representation of the
ontology, we need to discover the best setup to maximize the
metrics and persist the better overall setup for each ontology
and the entire domain.
The data mined by the random walker made a large amount
of data available for analysis, specifically in our case, the
many sequences of paths visited within the graph struc-
ture, in the form of unstructured text, for each different
configuration to identify possible relevant nodes. The com-
plexity of unstructured text has created new challenges in
data analysis [6], and data aggregation algorithms may help
to identify relevant data. In order to evaluate all possible
44644 VOLUME 10, 2022
F. Santos, C. E. Mello: Matching Network of Ontologies: Random Walk and Frequent Itemsets Approach
FIGURE 2. The Framework of Research Design.
FIGURE 3. Phase 1 - sigkdd ontology example.
configurations, we aggregated the data, identified patterns,
and applied natural language processing to evaluate the pos-
sible candidate hits by comparing the chosen nodes to the
reference alignment.
1) FREQUENT ITEMSET EVALUATION
To select the most relevant nodes to the next phase, we ran
a frequent itemsets algorithm using all the binary outputs
produced in phase 1.
VOLUME 10, 2022 44645
F. Santos, C. E. Mello: Matching Network of Ontologies: Random Walk and Frequent Itemsets Approach
We used the Python Mlxtend package [1] to run the apriori
algorithm to obtain the frequent itemsets ordered by the sup-
port. The apriori algorithm is simpler than its counterparts,
Eclat and FP-Growth. However it can overload the memory
when bags have more than 60 itens [17]. It did not happen in
our dataset. We build an evaluation experiment for each pair
of ontologies present in the OEAI dataset with a reference
alignment.
Each ontology was evaluated separately and incrementally,
including the next bag sorted by the support. After the bag
was collected, we verified whether or not each bag produced
by the apriori algorithm was unique to avoid comparing
the same bags multiple times. Each bag is inserted into a
collection that does not allow duplicates. We called this the
candidate check.
Once the candidacy was approved, we individually com-
pared the bag with the reference alignment from both ontolo-
gies (each side). Then we retrieved the recall, precision, and
F-measure evolution for each side separately. For example,
a pair of ontologies that ran the random walk from (1,1)
to (5,5) and three quartiles have 45 (RW output files) x 2
(ontologies) x num of bags side evaluations.
From each side evaluation and the final evaluation,
we gathered the graph evolution of the recall, precision, and
F-measure to track the progress of the metrics after each bag
was incorporated into the process.
We generated a data set for each sliding window +quar-
tile +semantic threshold to compare the results and select the
best setup for each pair of ontologies.
Finally, we compared all the results from each pair of
ontologies to define the best overall setup to feed the network
matching optimizer in the next phase.
2) SEMANTIC THRESHOLD
Next, we to discard ’noisy’ candidates we defined three levels
of semantic thresholds: 0.7, 0.8, and 0.9 mimicking the range
used by LogMap and Alin. LogMap uses for a default expan-
sion a threshold of 0.70 and mapping threshold of 0.95 [21].
While in Alin, using the Conference dataset, the similarity
value was defined as 0.9 [9]. We applied the Spacy NLP
Python package to compare each bag of frequent itemsets
between the ontologies. The Spacy package [3] was trained
using the model (en_core_web_lg) that is the largest English
model of spaCy with a size 788 MB. Then we combined each
possible pair of selected tokens from the unique bags created
for each ontology and compared them with the complete
reference alignment.
The final evaluation considered the semantic threshold
defined above to keep only pairs beyond the limit defined
(0.7, 0.8, or 0.9).
Figure 4shows an example of the phase 2.
3) DATA EVALUATION
To evaluate the classifiers, we employed the following
metrics:
Precision measures the proportion between the number
of correctly predicted labels and the total number of
predicted labels.
Recall corresponds to the percentage of correctly pre-
dicted labels among all truly relevant labels.
F-measure calculates the harmonic mean of precision
and recall. F-measure is a weighted measure of how
many relevant labels are predicted and how many are
relevant.
4) DATA ANALYSIS
We used the evaluation as mentioned earlier metrics to con-
duct the data analysis. We used the Mann-Whitney U test
to compare the metrics from the outcome of the frequent
itemsets, followed by Cliff’s delta effect size test. The Cliff’s
delta magnitude was assessed using the thresholds provided
by [38], i.e. $d$ <0.147 ‘‘negligible, $d$ <0.33 ‘small,’
$d$ <0.474 ‘‘medium,’’ otherwise ‘large.’
5) DATASET ANALYSIS
Data mining techniques and machine learning are frankly
employed in data analytics applications. Frequent itemsets
mining is widely used among the unsupervised techniques
to find information about association rules, correlations, and
dependencies hidden in datasets. A frequent itemset is a
form of a combination of all items in transaction data that
is interconnected, and the match can be found using several
data mining techniques, one of which is the apriori algorithm
[4], [32]. One advantage of the frequent itemsets is the
flexibility to be employed with many domains due to the
simplicity of the required dataset format. The dataset consists
of the visited nodes produced by the random walk algorithm,
where each row has columns for all the node entities marked
with ‘‘1’ if it was visited in that random walk or ‘‘0’ if it was
not.
C. PHASE 3 - MATCHING NETWORK OF ONTOLOGIES
Entering phase 3, we need to use the information discovered
in phases 1 and 2 in the future alignment of networks of
ontologies. We could analyze and decide the best setup for
this case study using different setups. We claim the final set
of nodes suggests the structurally more relevant candidates.
However, the structural information should not retrieve all
relevant nodes to the network match. Thus, increasing the
recall, including more nodes, and discarding them only in the
final matcher computation, seems more crucial.
1) EXTRACTING THE RELEVANT NODES
Using the best setup outcome from the frequent itemsets, after
the data analysis step, we used a projection algorithm defined
in [8] to extract the nodes from the network of ontologies.
Those nodes will be kept safe for future use. We used the
Ontology Manager Tab [28] and the SubInterNM (defined
in [40]) modified with the projection operation.
44646 VOLUME 10, 2022
F. Santos, C. E. Mello: Matching Network of Ontologies: Random Walk and Frequent Itemsets Approach
FIGURE 4. Phase 2 - sigkdd ontology example.
2) OPTIMIZING THE MATCHING
Using the graph representing the network structure,
we employed the SubInterNM approach preprocessing the
networks, pruning redundant entities found in both networks,
reducing the final matcher’s effort. Indeed, it employs a set
of algebraic operations (union, intersection, and difference)
to the graph representation of the ontologies. The results
discussed in [41] reduced the number of comparisons, avoid-
ing the complete Cartesian product to verify all possible
entity matches. Also, the study mentioned above showed the
reduced overall time processing and more balanced metrics
compared with traditional matchers used alone to deal with
networks.
In complement with the outcome from the SubInterNM,
we inserted the persisted nodes retrieved from the projection
operation back to the graph, using the union operation, send-
ing them back to their respective networks and, consequently,
restoring some relevant nodes that the SubInterNM approach
has pruned.
To evaluate the experiment, we create networks using the
conference domain ontologies. The choice of that domain to
compose the networks, despite not reflecting an ideal situa-
tion in the real world, was because of the reliability of the
reference alignments available to evaluate the results consid-
ering many different possibilities. OAEI provides 15 different
reference alignments for combinations using seven ontolo-
gies (conference, cmt, confof, edas, ekaw, sigkdd and iasted).
We excluded the cmt ontology from the experiment due to
errors processing the RDF file.
We evaluated the following networks: (table 3).
TABLE 3. Phase 3 experiments.
Each experiment operation was defined with a string to
help the execution. For instance, the 2 ×2 experiment ran
the following projections:
- confof_P_conferenceXconfof
- confof_P_confofXsigkdd
The first item means using the confof ontology and running
a (P)rojection using frequent itemsets obtained in phase 2
from the combination of ontologies conference and confof.
The second line orients to use the confof ontology and run a
projection using the frequent itemsets from confof and sigkdd
(confofXsigkdd).
The union operation followed a similar notation:
- net1_220_U_confof_P_conferenceXconfof
Thus, this line above represents the sequenced operation:
using the first network ={sigkdd, confof} in experi-
ment 2 ×2 using the default configuration in SubInterNM
(net1220), make a (U)nion with the result of the (P)rojection
VOLUME 10, 2022 44647
F. Santos, C. E. Mello: Matching Network of Ontologies: Random Walk and Frequent Itemsets Approach
from confof ontology and an outcome from the frequent
itemsets that evaluated the conference and confof ontology.
The notation aims to automatize the entire experiment
connecting all the phases in a row in the future. Table 11
summarizes the operations in some experiments.
Finally, we projected the final union results with all ontolo-
gies fragments, returning to the pairwise problem. This step
aims to verify to what extent the results increased the metrics
obtained in [41], without the problems caused by the limita-
tions of the actual matchers to identify alignments from more
than two ontologies.
Section Vwill discuss which frequent itemsets results were
used.
3) FINAL MATCHING
Following the optimizing step, we sent the results to two
external matchers: Alin and LogMap [20]. These match-
ers were selected because they represent some of the best
results in the OAEI campaign, considering the conference
domain. In addition, Alin can compare multiple ontologies
(but not networks), and LogMap has excellent results for large
datasets, which is the situation we face in our experiment
comparing networks of ontologies.
4) RESULT ANALYSIS
To the extent of our knowledge, there is no network ontology
matcher to compare our results. However, we created a base-
line selecting random entities from the ontologies to compare
with the Random Walk and the Relevant Nodes Discovery
results. Also, the final results can be sent to the two matchers
selected, using the outcome after the union operation and after
the fragment projections. Indeed, this addresses the problem
that classic matchers do not usually handle more than two
ontologies simultaneously as input.
5) EXPERIMENT SETUP
Phase 1 ran the random walk using as input the ontologies
from the Conference domain available in OAEI. We use all
possible sliding window and offset setups limited by the
height of the graph representation of each ontology. The
visits files were limited by the quartile threshold parame-
ter responsible for deleting paths where at least one of the
entities visited in each path is under the defined threshold.
Thus, the parameters (sliding window size, offset size, and
quartile threshold) influence had to be analyzed with all the
combinations reflected in the visits files. The goal was to keep
only the paths with important nodes to the next phase.
Phase 2 ingested the visits files produced at phase 1.
Next, we ran the frequent itemsets algorithm and deleted the
duplicate entities creating the candidate set. Those sets (from
each pair of ontology both networks) were compared using
the semantic threshold. The semantic threshold parameter
limits candidates by previously comparing some candidates.
The final selected candidates from each ontology produced
the projection set to phase 3. The setups were statistically
compared, and only the best setup was sent to phase 3.
Phase 1 and 2 used as control experiment two baselines: the
first select some entities by chance (number more significant
than the reference alignment). The second gets entities at
random (10x the size of the graph). Both baselines aim to
check after phase 2 whether or not the candidates picked up
by the baselines have better metrics than our method.
Phase 3 uses the previous results from [41]. There, the
‘‘gray’ (All or Brute Force) and ‘‘blue’ with ‘‘red’ exper-
iment were studied. figure 5. The ‘gray’’ experiment ran
the matchers over all ontologies in both networks run-
ning the Cartesian product. The ‘‘blue’ (Naive) experiment
ran the union operation [8] creating a single OWL file for
each network. The single file was sent to the matchers to
retrieve the alignments and metrics. The goal was to evaluate
the ability of the state-of-art matchers to deal with a com-
plex set of ontologies, as we have in networks. The ‘blue’’
experiment can reduce the effort by transforming alignments
of the type sigkdd x ekaw and ekaw x sigkdd into only one:
sigkdd x ekaw or ekaw x sigkdd. The ‘red’’ experiment is the
SubInterNM implementations running the union, intersection
and difference operations defined in [8].
Our study creates two new experiments: the ‘green’’ and
the ‘‘purple.’’ The ‘green’’ (Sub+RW+FIS) experiment uses
the results from phase 1 and phase 2 to project the relevant
nodes and insert them again in the final network obtained
after the SubInterNM approach (extending the ‘red’ (Sub)
approach). Finally, the ‘purple’’ (Sub+RW+FIS+Pairwise)
approach extends the ‘green’ (Sub+RW+FIS) approach by
projecting the final union with the fragments of the original
ontologies and sending them to the matcher in a pairwise way.
Thus, we verified how the alignments with the fragments
compare with the alignments using the original ontologies
and how the metrics compare with the Force Brute.
We measured the execution time of each phase of the
experiment. For the SubInterNM we got the processing time
for each operation. We selected the best setup for the RW and
FIS (discovered in phase 2) and ran ten times, calculating the
average using each ontology used in the experiment.
The final results were compared with the similar analysis
done in [41] as the baseline. All (Brute Force), Naive, and
SubInterNM were compared using the LogMap and Alin.
Both had good results matching anatomy (Alin and LogMap)
and large biomedical ontologies (LogMap) in the OAEI com-
petition [36]. Finally, the parameters influence is discussed in
the results section.
D. COMPLEXITY
The complexity analysis was divided into phases. As men-
tioned in the section III we did not implement the paralleliza-
tion, but we will mention where it can be used to reduce the
time and space complexity as future work.
In phase 1, the random walk procedure runs r*O(n) where
ris the number of random walks over the graph structure.
In phase 2, the method uses O((ns)2) where sis the sup-
port threshold used to compare the frequent itemsets created
for each pair of entities. However, we can run it without
44648 VOLUME 10, 2022
F. Santos, C. E. Mello: Matching Network of Ontologies: Random Walk and Frequent Itemsets Approach
FIGURE 5. Phase 3 - 5 ×2 example.
FIGURE 6. Phases 1 and 2 - windows shift comparison.
the semantic threshold over each pair before the alignment.
In this case, we can harm the precision by offering more
entity entities as candidates since we disregard the target
ontology. On the other hand, the frequent itemsets will run
with O(2 (ns)) complexity.
Phase 3, we have many operations: union, intersection, and
difference (from the SubInterNM) preceded by the projection
(to save the relevant entities discovered by the random walk
and frequent itemsets) and finally, the union between the
results coming from SubInterNM with the projection.
Casanova et all [8] state the algebraic operation uses O(n2)
in acyclic graphs and is NP-hard when applied to strongly
connected graphs. The union uses O(n2)twhere tis the
time to compare the URIs from two different entities. Indeed,
the minimization procedures to create the graph and the union
can be run before the alignment process. The intersection
needs O(n2)t, and it can be run in parallel, reducing its
complexity to O((n2/m2)) twhere mis the number of
ontologies (supposing both networks have the same number
of ontologies).
The difference uses O((n2)/2)) in average since the inter-
sections calculated before are compared with both networks,
and each difference can be run in parallel. The projection uses
O(n), and the final union complexity is O((2n)/j) where jis
the number of projected ontologies, and the projection can be
run in parallel as well. Finally, the select matchers can align
the pruned networks with O((np)2)ccomplexity, where
pare the pruned entities, and cis the time to calculate all the
similarity metrics between a pair of ontologies
The total time complexity equals the most time-consuming
step. The most impacting executions are the frequent itemsets
comparing entities (O((ns)2)) in phase 2, O((n2)/2)) from
the difference and the time for the matching O((np)2)c
in phase 3.
The Cartesian product to address the pairwise or the holis-
tic approaches uses O(n2)c.
It is worth mentioning as we have more similar entities in
the networks tends to increase the number of entities pruned
pand the difference from O((np)2)c(matching after
the method) to O(n2)cusing the naive Cartesian product.
As discussed in the section II-G this is the expected situation
when companies merge.
E. PSEUDO CODE
This section summarizes the pseudo code corresponding a
each phase of the method 2.
The pseudo-code 1receives a set of windows sizes and
offsets (shifts) and starts producing a random walk from each
possible entity ej
iwhere j= {1,2}and i= {1..n}from
possible entities {O1,O2}. The random walk checks the
adjacent list to select the next node to be visited randomly.
For each node visited, a bag of nodes with their paths are
stored in Bj. Since bags B1and B2are created, the quartile
check verify whether the path includes all nodes above the
quartile limit Q.
The relevant nodes discovery procedure starts using the
bags of each random walk created by 1considering all the
window sizes shifts and quartiles output. Using each defined
setup (window sizes, shifts and quartiles output), the apriori
function returns all the frequent itemsets items Fi, discarding
duplicates Iiand calculating the metrics for the results, con-
sidering only one ontology regarding the matches with one
side of the reference alignment R. From the unique candidates
and using the semantic thresholds Tdefined, we checked
the semantic similarity, and those above the threshold are
persisted to the next phase E. The proposed matches are
returned by the procedure to be used as input in the network
matcher optimizer. We also collected the match metrics using
both ontologies to analyzed the results M.
The optimized network matcher starts computing the pro-
jection for the elements considered relevant Efrom the algo-
rithm 2stored in P. The Resulting set of networks 90and
each final network 00is initialized. Next, for each network
and each ontology, we perform the Union operation. This
way, all ontologies from a network will be united inside a
temporary variable 00
i. Following for each network we calcu-
late the Intersection for all ontologies, storing momentarily
in Upsiloni. The Difference procedure prunes all intermedi-
ate networks 00
iwith the identified intersections Upsiloni.
VOLUME 10, 2022 44649
F. Santos, C. E. Mello: Matching Network of Ontologies: Random Walk and Frequent Itemsets Approach
Algorithm 1: Random Walk:
Input: Two ontologies O1 and O2 with a set of entities
E1= {e1
1,e1
2,...,e1
k} and E2= {e2
1,e1
2,...,e2
k},
sliding windows setup window size, shift and
quartile (W,O,Q), a set of reference
alignments R
Output: Two sets of binary bags B1,B2visited nodes
and paths for each window size and shift
Random Walk (O1,O2,R,W,O,Q)
B1;
B2;
while e1
i1to nO16= E1and e2
i1to
nO26= E2do
while random walk O1do
node e1
i;
select adjacent node a1
ifrom finite set
A1= {a1
1,a1
2,...,a1
n}of adjacent matrix of
node;
while path and new window shift do
V1visit(e1
i(a1
i));
select adjacent node a1
ifrom finite set
A1= {a1
1,a1
2,...,a1
n}of adjacent matrix of
a1
i;
i++;
node e1
i;
B1createBag(V1);
while random walk O2do
root e2
i;
select adjacent node a2
ifrom finite set
A2= {a2
1,a2
2,...,a2
n}of adjacent matrix of
root;
while path and new window shift do
V2visit(e2
i(a2
i));
select adjacent node a2
ifrom finite set
A2= {a2
1,a2
2,...,a2
n}of adjacent matrix of
a2
i;
i++;
node e2
i;
B2createBags(V2);
for b1B1do
for e1b1do
if e1/Qthen
B1=B1-b1
for b2B2do
for e2b2do
if e2/Qthen
B2=B2-b2
Nodes Discovery(B1,B2);
Finally, the entities saved from the projection are restored
using the union operation 00
i<UNION (00
i,ei) before
Algorithm 2: Relevant Nodes Discovery:
Input: Two sets of binary bags B1,B2with visited nodes
and paths for each window size, shift, quartile,
semantic_threshold T. A reference alignment R
Output: A list of suggested candidate entities E=
{e1
1,e1
2,...,e1
k}O1x{e2
1,e2
2,...,e2
k}O2,
a set of metrics SM1,SM2and M
Nodes Discovery (B1,B2)
E
for each window size and offset do
F1apriori(b1
iB1)
for f1
iF1ordered by support do
I1checkUnicity(f1
i);
SM1calcSideMetrics(I1);
F2apriori(b2
iB2)
for f2
iF2ordered by support do
I2checkUnicity(f2
i);
SM2calcSideMetrics(I2);
for e1
iI1and e1
iRdo
for e2
iI2and e2
iRdo
E=E+ semantic_check(e1,e2,R,T);
McalcFinalMetrics(F,R);
return E,SM1,SM 2,M;
calling the matcher passing the final networks 90. A final set
of correspondences is returned C.
The results section explores the retrieval of the relevant
entities, returning to the original pairwise problem. Indeed,
explored matchers could not compute alignments when they
have more than two ontologies as input. Thus, returning to
the pairwise matching can prove the method validity. The
pseudo-code above does not contain that step since it was
not in our original method and is only an adaptation to show
the validity of the proposed methods. Therefore, a set of
projections should be executed with the network to retrieve
the ontologies fragments, and the projection function is the
same presented in the third pseudo code.
V. RESULTS
We present the results grouped by Research question.
Phases 1 and 2 will be presented together, and the phase 3
results will be presented separately. Inside phases 1 and 2,
we presented the results of the overall approach of the con-
ference domain, mentioning specific characteristics of an
experiment involving a pair of ontologies when relevant. The
existence of a reference alignment defined those pairs to
support the data analysis.
A. RQ1:TO WHAT EXTENT CAN THE METHOD IDENTIFY
THE RELEVANT NODES?
The goal of RQ1 is to determine the best setup to feed the
network matcher optimizer in phase 3. Therefore we analyzed
the results from the random walk and the frequent itemsets.
44650 VOLUME 10, 2022
F. Santos, C. E. Mello: Matching Network of Ontologies: Random Walk and Frequent Itemsets Approach
Algorithm 3: Modified Optimized Network Matching:
Input: A finite set E= {e1,e2,...,en}of entities, two
networks of ontologies 9= {01, 02}
Output: A list of correspondences C= {c1,c2,...,ck}
P;
for ei1to nEdo
for 0i1to n9do
if ei0ithen
PP+ Projection (0i,ei);
end if
end for
end for
90, 00;
for 0i1to n9do
for i1to n0ido
00
iUnion(00
i, i);
end for
9090+00
i;
end for
Set of intersections ϒ;
for 0i1to n9do
for i, j1to n0ido
ϒIntersection(i, j);
end for
end for
for 00
i1to n9do
for ϒi1to nϒdo
if Upsiloni00
ithen
00
iDifference(00
i,Upsiloni);
end if
90(00
i);
end for
end for
for pi1to nPdo
for 0i1to n9do
if pi0ithen
00
iUnion(00
i,ei);
end if
90(00
i);
end for
end for
return Calignment tool(90);
1) RANDOM WALKS AND FREQUENT ITEMSETS
Our first investigation was about the parameters that could
change the dataset results in phases 1 and 2. We started
with the configuration of the sliding window parameters:
window and shift. The high F-measure results are the win-
dow 2 and shift 2 (2-2). The window and shift configuration
started losing metrics when the window was beyond the 3
(figure 6). Table 4shows the statistical comparison between
the F-measure from some selected window-shift configura-
tions. Despite the huge variance, it shows the significant dif-
ference between the configurations with large Cliff’s Delta,
except when comparing (2-2) with (1-1) where we found a
medium effect size.
TABLE 4. Window and shift configuration comparison (2-2) x (1-1), (3-1),
(4-1) and (5-1).
TABLE 5. Threshold precision configuration comparison 0.7 x 0.8 and 0.9.
FIGURE 7. Phases 1 and 2 - semantic threshold comparison.
Next, we compared the semantic threshold influence on
our results. Looking to the figure 7we can observe the
effect of the semantic threshold on the results. Table 5shows
the Mann-Whitney U test with the Cliff’s-delta effect size.
Comparing the 0.7 thresholds, 0.8 and 0.9, using precision
we found a statistical difference with medium and large effect
sizes (0.7 x 0.8 p<0.7e-19, -0.423, ’medium’ and 0.7 x
0.9 p<0.3e-67, -0.753, ’large’). Using recall we also found
a statistical difference with negligible and small effect sizes
(0.7 x 0.8 p<0.015 0.093, ’negligible’ and 0.7 x 0.9 p<0.9e-5
0.161, ’small’)
We can retrieve better precision by using the 0.9 limits
when comparing both candidates from each side (ontologies)
and better recall when applying the 0.7 threshold. Figure 9
presents a comparison between setups from iasted and ekaw
ontologies. We can see some exceptions to the general rule
above. In setup 2-1-1-0.9, the ontologies had opposite behav-
ior. Ekaw (ontology 1) had the best hits while iasted the
worst. On the other hand, both had fewer misses. The setup
2-2-2-0.7 has 9 hits (ontology 1 and 2), 32 and 48 misses,
respectively.
Finally, we compared the quartile setup from the random
walk. The quartile value decides whether a bag from each path
created with the sliding window configuration will prevail
VOLUME 10, 2022 44651
F. Santos, C. E. Mello: Matching Network of Ontologies: Random Walk and Frequent Itemsets Approach
FIGURE 8. Phases 1 and 2 - quartile comparison.
FIGURE 9. Phases 1 and 2 - ekaw - iasted: semantic threshold
comparison and standard deviation.
or not. Bags prevail when at least one element has visits
beyond the defined quartile, considering the statistics gath-
ered to form all random walks using this ontology. Looking
at the figure 8and running the test, we did not find different
distribution, comparing quartiles 1 x 2 and 2 x 3, thus failing
to reject H0(H0: quartiles have similar distribution).
We compared phases 1 and 2 with two baselines. Baseline 1
selects a limited number of entities close to the number of
entities in the reference alignment to mimic a user guess.
The second baseline selected 10x the number of entities
randomly to verify the random walk effectiveness to select the
relevant nodes based on the graph structure. The results were
sent to the apriori algorithm in sequence. Figure 10 shows
the comparison of the number of hits with the baseline 1
results. The figure shows one of the bests setups consider-
ing the recall in the table 6, since we will send those to
the next phase. Selecting some possible alignments close to
the number of entities in the reference alignment randomly,
as a user behavior probably should do, proved to have worst
results than our approach. Next, we aim to verify the random
walk performance using the second baseline. Looking at the
FIGURE 10. Phases 1 and 2 - ekaw - iasted: first baseline corrects
comparison and standard deviation for setups.
FIGURE 11. Phases 1 and 2 - second baseline misses comparison.
figure 11 we can see that when selecting the same number of
nodes randomly, tends to miss more entities with a statistical
difference from the random walk in phase 1 (misses from
ontology 1: p<0.8e-20, -0.769, ’large’ and misses from the
second ontology: p<0.7e-17, -0.701, ’large’).
Still measuring the difference from the experiment with
the baseline 2, we compared the final precision, final recall,
and final F-measure from the experiment and the base-
line 2. (p<0.461, effect size: 0.008, ’negligible’) (p<0.3e-19,
-0.802, ’large’) (p<0.9e-10, -0.527, ’large’). This suggests
the precision had no statistical difference while having a
statistical difference in favor of our approach in recall and
F-measure.
Considering the entire conference domain, the ten best
recall configuration was the following table 6.
Using the setup (2-2-1-07), we got the hits and
misses after the semantic test. The reason to priori-
tize the best recall is that the final set selected before
the semantic test using the defined threshold often
suggests the same entities. For instance: (’Organiz-
ing_committee’, =’, ’Organizing_Committee’), is one select
44652 VOLUME 10, 2022
F. Santos, C. E. Mello: Matching Network of Ontologies: Random Walk and Frequent Itemsets Approach
TABLE 6. Best recall setups.
match considered right bt the reference alignment. However,
(’Organizing_committee’, =’, ’Committee’), (’Organiz-
ing_committee’, =’, ’Program_Committee’), (’Organiz-
ing_committee’, =’, ’Organizing_Committee_member’),
are also selected and are wrong matches accordingly with
the same reference. The three wrong matches harmed the
calculated precision. Therefore, in a first look, they seem
a bad choice. Indeed, they are not, since they will be used
to project only one entity from their original ontology:
’Organizing_committee. Thus, many misses in the set of
persisted entities may be repeated in pairs of candidate
entities. This leads to our recall metrics being underestimated
in this moment of the processing pipeline. On the other hand,
prioritizing the precision could select a set with only one
suggestion and correct, hence with precision =1.0, thus
projecting only one value to use later.
RQ1 Summary. Our findings suggest that it is possible
to discover relevant entities using existing statistical data
from the ontologies structure. The best setup is the window
shift (2-2). The quartile showed no statistical influence,
while the semantic threshold increased the recall when
it was set up with 0.7. We found that pre-processing the
ontology can discover relevant nodes with an average
recall of 75%
B. RQ2: TO WHAT EXTENT DO THE RELEVANT ENTITIES
IMPROVED THE MATCHING OF NETWORK OF
ONTOLOGIES?
Phase 3 carried out the experiments using some network
configurations in the table 3and compared with the results
presented in [41] (‘‘blue,’’ ‘red’’ and ‘gray’’ experiments).
1) GREEN EXPERIMENT
This section presents the results from the ‘‘green’ experi-
ment. Tho goal is to verify whether or not reinserting the
projected entities back into the network will improve the
alignment metrics obtained in the ‘‘red’ experiment. After
running phase 1, random walk, and phase 2, frequent item-
sets, we sent the results using the setup 2-2-1-0.7. We cal-
culated the projections from each ontology in the networks
that the SubInterNM approach should prune. For instance
in the 2 ×2 experiment we matched 1={sigkdd, con-
fof} with 2={conference, confof}, since ’confof’ was
TABLE 7. Processing time pairwise - individual cases
(seconds.milliseconds).
pruned, before we had gotten projections from ’confof’
ontology using the matching suggestions from phase 2.
Hence, ’confof’ must be projected with the suggested
entities from confof X conference and confof X sigkdd.
Therefore, we had confof_P_conferenceXconfof and con-
fof_P_confofXsigkdd operations executed (table 11). The
same reasoning applies to the remaining experiments.
Next, we run the unions to each network. Since network 1
had ’confof’ pruned, we need to make a union operation with
the entities suggested as relevant from phase 2 to recover
partially, at least, the missing results from ’confof X con-
ference’. Thus, the operation needed is the union between
network 1 and the result of the projection using ’confof’
and the suggested entities from ’conference X confof’ or
’net1_220_U _confof_P_conferenceXconfof’.
To compare the results with the metrics (precision, recall,
and F-measure) obtained in [41], we created new rows in the
tables starting with ’RW.’ The final results were submitted to
LogMap [21] and Alin to compute the alignments.
Tables 12 and 8summarize the metrics and processing time
for the ‘‘green’ experiment.
2) PURPLE EXPERIMENT
We can observe in results (from the ‘green,’’ ‘red,’ and
the ‘‘blue’ experiment) that the matchers had bad results
computing alignments where entities from more than two
ontologies are present. Therefore, we run one more projection
to split each ontology’s entities. The goal of the ‘purple’’
experiment is to demonstrate the viability of the method and
explain why the ‘green’’ experiment did not overcome the
‘‘red’ one. Indeed, we came back to the pairwise matching,
where the matchers had only to ingest a pair of ontologies.
The items below show the projection operations executed
in the RW2 ×2 experiment. Therefore, we had to compute
the execution time used to match the ontology fragments
after the projection, as shown in table 7. For instance, from
the final result ’net1_220_U _confof_P_conferenceXconfof’
and ’net2_220_U _confof_P_confofXsigkdd’ we created
’net1220_confof’ that is the projection of ’net1_220_U _con-
fof_P_conferenceXconfof’ using ’confof’.
VOLUME 10, 2022 44653
F. Santos, C. E. Mello: Matching Network of Ontologies: Random Walk and Frequent Itemsets Approach
(net1_220_U_confof_P_conferenceXconfof) _P_confof
(net1_220_U_confof_P_conferenceXconfof) _P_sigkdd
(net2_220_U_confof_P_confofXsigkdd) _P_confof
(net2_220_U_confof_P_confofXsigkdd) _P_conference
The results in table 12 confirmed that the matchers are
unable to compute alignments using networks of ontologies,
probably because of the complexity of the structure. Looking
at the lines 2 ×2 and RW 2 ×2, the metrics were even worse
after the projections. However, when we split the entities
in the final projection and send only pairs to the matchers,
we overcome the metrics individually and in the average. The
2×2 case is an exception, possibly because the complexity is
not relevant to harm the results from the network matching.
Also the 2 ×2 did not computed any correspondences from
’confof’ X ’sigkdd’ and ’confof’ X ’conference’.
The matchers with brute force had better metrics. How-
ever, it was expected since the network approaches compare
fragments of the ontologies with some missing entities. Some
of those entities should possibly be part of the reference
alignment. While brute force had better metrics, it needed
to compare each entity, while the proposed method avoided
many comparisons. After the last projections return the prob-
lem to the pairwise matching, the results show the approach
is feasible, and the actual matchers must be improved to deal
with more than pairs of ontologies at once.
3) PROCESSING TIME ANALYSIS
Next step, we compared the execution time. The goal is
to show how the method might be helpful for extensive
network matching tasks. Looking at the figure 12 we can
compare the execution time-sliced by operation. As expected,
we observe that the RW+FIS (RW) and RW+FIS+pairwise
(RWPair) times are always higher than the corresponding
SubInterNM (Sub) and Brute Force (All). (Experiments
operations are showed in figure 5). As expected, the Sub
and the All approaches overcame the Sub+RW+FIS and
Sub+RW+FIS+Pairwise approaches. The overhead caused
by the many preprocessing steps impacted the execution time.
Nevertheless, due to the nature of the experiments and
the need to track each intermediate result, manually check-
ing them, our most extensive network had ten nodes, five
each network. Figure 12 shows the experiment with force
brute (All) uses the matcher (blue bar - LogMap) intensively
to compute all the possible alignments. While the interception
and union operation can be run in advance, both also can be
run in parallel. The difference operations also may start in
parallel for both networks, just after the union and intercep-
tion finish. The RW operation is also independent. RW results
may be previously stored based on the best setup for the
domain. The FIS also may run without the semantic check and
be stored waiting for a future alignment. Even running the FIS
with the semantic check, all the semantic checks can be run in
parallel. Considering this particular case, we eliminated the
union and the RW+FIS to verify how is the behavior of the
execution time (figure 13). Considering the experiment 2×2,
it is notable the force brute (All) had the second-best time and
FIGURE 12. Processing time experiments LogMap.
now has the worst time. The experiment 5 ×5 did not repeat
the same results, although the force brute is in the third place
overcoming only the RW+FIS+Pairwise (SubRWPair). The
projections are pairwise executed in a few milliseconds and
are invisible in the figure. Tables 9and 8shows the process-
ing time from some experiments. Running the method with
previously stored data from the union, RW and FIS helps
to decrease the total processing time. Despite saving 27s on
average, it’s not enough to overcome brute force.
The fastest approach is running the matcher with the entire
network grouped with the join (‘‘blue’’) operation. How-
ever, it produces terrible metrics. SubInterNM (‘‘red’’) is
faster than the random walk +frequent itemsets (‘‘green’’)
approach but can miss a lot of pruned entities even if the
matcher cannot calculate them as they struggle to deal with
more than two ontologies in time. Both struggled to maintain
good metrics. RW+FIS+Pairwise (‘‘purple’’) had metrics
close to brute force (‘‘gray’’), notably when using Alin 9.
On the other hand, the processing time suggests that Alin with
RW +FIS outperforms Alin with brute force as networks
grow. The same did not happen with LogMap. We limited
our experiment to 5×5 networks due to the need to manually
check the projections and network joins in the ‘green’ and
‘‘purple’ experiments. We must predict the results for more
extensive networks.
LogMap is clearly faster than Alin and will be used for the
predictions.
4) PREDICTION FOR LARGER NETWORK SIZES
Considering the increase in network sizes, we can predict
how much those operations will increase execution time.
We aim to show that the proposed method may overperform
44654 VOLUME 10, 2022
F. Santos, C. E. Mello: Matching Network of Ontologies: Random Walk and Frequent Itemsets Approach
FIGURE 13. Processing time experiments LogMap without Union,
Random Walk and Frequent Itemsets.
TABLE 8. Processing Time (seconds.milliseconds).
TABLE 9. Processing Time Proposed Method (using previously stored
RW+FIS data) X Brute Force (All) (seconds.milliseconds).
the existing matchers to align the networks by brute force.
We ran a regression to observe the evolution of the total
execution time in more extensive networks. Our dataset con-
sists of data from the number of nodes (nodes), a total of
network comparisons (net1Xnet2All), execution time used
by the matcher (LogTime or AlinTime), number of entities
TABLE 10. Predicted processing times (ms). Orange =Purple experiment
running some steps in parallel. (Figure 14).
comparisons (comparisonsAll), and the total execution time
(LogAll or AlinAll).
We run regressions using the faster matcher, LogMap,
for the force brute experiment (all), the RW+FIS+Pairwise
(without the Union), and the SubInterNM. Our dataset con-
sists of the data gathered in phases 1, 2, and 3. For each
experiment, we registered the processing times for the steps.
It includes the matcher and more: the number of ontolo-
gies to be compared, the number of nodes in networks,
and the number of entities being compared. The general
regression model used was: lm <- lm(totalExecutionTime
networksNodes +ontologiesCompared +entitiesCom-
pared +matcherTime). We found p-values of 0.003, 0.01 and
0.05 respectively and adjusted R-squared of 0.95, 0.77 and
0.74 respectively.
We discarded the entitiesCompared and the matcherTime
covariates since we could not reject the null hypothesis H0=
entitiesCompared =0 and H0=matcherTime =0 using
the anova test to compare full x reduced model. After we
predicted for the selected experiments the total execution
time for networks with 6 ×6, 7 ×7, 8 ×8, 9 ×9 and
10 ×10, 20 ×20 and 30 ×30 nodes, using the predict
function. For example: the 6 ×6 prediction was carried out
with the predict(lm, data.frame(ontologiesCompared =36,
networksNodes =12). Figure 14 shows the graphical com-
parison with the predicted execution times and the number
of nodes. Table 10 shows the predicted processing time for
experiments varying from the 6×6 until 30×30 experiment.
As the networks grow in size, the proposed approach is
faster than the brute force after the number of nodes >10.
The SubInterNM is faster; however, it may suffer from the
pruned entities. The method can be even faster identifying
the relevant nodes before pruning them, running the union,
RW and FIS using previous stored the results and running the
difference and the intersections in parallel.
RQ2 Summary. Our findings suggest that finding relevant
entities improved the matching with metrics (precision,
recall, and F-measure) closer to the brute force approach
and faster execution time. The metrics outperformed the
SubInterNM method.
VI. DISCUSSION
What is the role of Random Walk in discovering the
entity’s relevance
VOLUME 10, 2022 44655
F. Santos, C. E. Mello: Matching Network of Ontologies: Random Walk and Frequent Itemsets Approach
TABLE 11. Projection and union operations in some experiments.
TABLE 12. Precision Recall and F-Measure (RW)+SubInterNM+LogMap (RW)+SubInterNM+Alin LogMap naive Alin naive.
The RW provides a way to understand the ontology’s struc-
ture through the visited nodes (entities). The connections and
centrality of a node tell us the node’s relative importance to
the structure [23], [45]. The more connected and central a
node is, the more probability to be visited. However, there
is no guarantee that a relevant node will participate in a
reference alignment. Only the data provided by the structure
is not enough. The result depends on the semantics between
the ontologies to be aligned. The RW seriously impacted the
processing time. However, once we have the best setup for a
domain of ontologies, We can store the result of the RW and
run only the FIS when aware of the ontology to be aligned.
What is the role of the frequent itemsets in discovering
the entity’s relevance The Frequent itemsets (FIS) compares
entities from pairs of ontologies using different RW setups
to complement the RW method. The comparisons can be
44656 VOLUME 10, 2022
F. Santos, C. E. Mello: Matching Network of Ontologies: Random Walk and Frequent Itemsets Approach
TABLE 13. Precision recall and f-measure pairwise individual cases.
FIGURE 14. Predicted Processing time and number of nodes - Brute Force
x SubInterNM x LogMap Pairwise.
made in two ways: with or without the semantic check. The
semantic check provides a way to filter out pairs of candidates
using a semantic threshold to improve the precision of the
persisted pairs. As the RW can select many nodes, this is an
important step to avoid using all the discovered ones. The
semantic checking, however, poses a performance problem.
Since we need to know what ontology we will match, that
operation cannot be run in advance to perform the seman-
tic checking. However, it is possible to gather the relevant
entities using the RW and FIS without semantic checking.
Thus, it permits running in advance. The side effect is a
bigger set of candidates improving the recall but harming the
precision. Figure 12 show the execution time (average) for
each operation. The RW+FIS execution time is in orange.
In future work, we can explore to what extent bypassing the
semantic check harm the precision.
What metric do we must prioritize
Baseline 2 showed that selecting entities by randomizing
the choice can retrieve a representative set of relevant entities.
Nevertheless, the price is a poor precision with an extensive
set to be used in the projection to increase the execution time
in future pipeline steps. The RW and the semantic check
provide a way to keep the set smaller than the baseline 2.
Interestingly this may indicate that precision should be sought
and that this would be the most critical metric. Surprisingly it
is not the precision that seems to be the metric to be prioritized
when using semantic checking in FIS. When indicating pairs
of candidate entities for semantic verification, invariably,
an entity often appears in several pairs (ex: (document x
conference-document ), (document x paper), (document x
abstract). Thus, a single indication of the ‘document’’ as a
candidate entity impacts the precision. However, if we try
to prioritize just the precision, dropping all the ‘‘document’
suggestions, for instance, we will have cases where we will
indicate few pairs for phase 3, having an almost perfect
precision, close to 1 but with few relevant entities to save from
pruning from the subsequent intersection/difference opera-
tions. So, tuning the semantic check close to 0.9 had this side
effect of discarding too many candidates, getting higher levels
of precision, limiting the number of entities in the projection
operation too much.
Is the processing overhead worth it
The Random Walk and the Frequent itemsets may run
parallel with the SubInterNM, since they depend not on
each other. Even the SubInterNM can parallel the Unions
and Intersections operations and join together to process the
Differences. We ran the projections and the unions (after the
SubInterNM) using the ontology manager tab interface and
got the average of six operations using the ontologies and
partial results. The RW processing time was the average of
all domain ontologies. The projections are fast operating, and
they will not increase the processing time substantially.
VOLUME 10, 2022 44657
F. Santos, C. E. Mello: Matching Network of Ontologies: Random Walk and Frequent Itemsets Approach
The RW+FIS can be paralleled undependable of the num-
ber of ontologies in each network. Thus, the increased number
of ontologies will not affect the final processing time.
Also, the proposed method considered three phases using
different tools, compared with robust and reliable matchers
competing in OAEI many years. The Random Walk (RW),
Frequent itemsets (FIS), and the SubInterNM can be inte-
grated into only one application and improve the time pro-
cessing running all possible steps in parallel.
As the number of nodes increases, the approach reduces
the processing time and keeps the metrics closer to the brute
force approach. The actual matchers are not prepared to
handle alignments with more than two ontologies. Indeed,
this limits our proposal, forcing the performance of one
more operation to come back to the pairwise comparison.
Thus, as soon the matchers can surpass that limitation, the
proposed approach will be even better as we can observe
comparing the LogMap contribution to the execution time
(in light blue) in all experiments SubRW x SubRWPair
(figure: 12). Even though the pairwise final operation (in fact,
a projection) does not harm the overall execution time, the
matcher needs to return to the brute force approach, at least
with smaller (pruned) ontologies.
VII. LIMITATIONS
This study is limited to the ontologies from the domain
conference. Since the structural approach is used to discover
the relevant nodes and the subsequent selection carried out by
the frequent itemsets algorithm, which is based on the possi-
ble bags, it is supposed to be generalized to other domains
once the approach is not tied to the domain characteristics.
The variation of the graph shape number of axioms should
cause no harm to the RW+FIS. Nevertheless, this suggests
an excellent future work to be explored. Another restriction
must be taken into the remarks: ontologies like Mouse and
Human used by the OAEI initiative should not respond to the
semantic check because they have entities with numeric IDs.
The approach is only helpful in cases where we have enti-
ties shared in different networks. Networks without entities in
common will not benefit from it. Small networks to be aligned
or more extensive with small ontologies should not be worth
the processing overhead caused by the implemented pipeline.
This leads us to another interesting future work: how do we
previously know when the approach is worth to be used?
The semantic check implemented can be improved to help
better select the candidates. However, improving it can lead
us to implement a whole matcher. So, to what extent may the
approach be improved before it becomes a matcher?
Even though the predictions showed the method worth it
when more extensive networks need to be aligned, many
different network shapes and ontologies can vary the results,
as probably would see in real cases when one company buys
another one and wants to integrate their complex systems.
We must investigate it more profound in future work.
VIII. CONCLUSION
In this article, we carry out a case study to investigate the
extent to which we can identify relevant entities in the context
of ontology networks alignment. To support our research,
we needed a set of well-defined ontologies, reliable reference
alignments, and proven matchers to carry out our experi-
ments. We use the dataset composed of ontologies from the
‘‘conference’ domain and the matchers participating in the
OAEI competitions.
The method prevents relevant concepts to be discarded
using the notion of the relative importance of a concept in the
ontology’s structure. The structural relevancy are discovered
by a sampling process using random walks. The results are
submitted to an association rule learning algorithm: frequent
itemsets.
A previous study pruned some identical entities to save
time. We could identify and keep the relevant entities to
present them again to the matcher. The method showed sig-
nificant results compared with the force brute and algebraic
approaches, retrieving balanced metrics while presenting a
shorter execution time than the force brute approaches for
more extensive networks. The SubInterNM was faster but
pruned some relevant nodes that may be present in the ref-
erence alignment.
In future work, we will test the method with networks of
ontologies from diverse domains to verify to what extent the
method generalizes. Identifying the relevant nodes before the
SubInterNM allow us to not prune the nodes in the difference
operation and thus, help to save the processing time.
Three tools were created and available as free, open-source
software in the replication package. We employed six ontolo-
gies from the conference domain, which are used by the
OAEI initiative to compare matcher’s performance in our
experiments. The results suggest the method’s best set up to
identify the relevant entities in the conference domain.
The results can be applied to system integration problems
when many entities should be compared, avoiding the full
Cartesian product (pairwise) check. Incidentally, we discov-
ered that the actual matchers could not retrieve good metrics
when comparing more than two ontologies. As the integration
problem in the real world should involve more than a couple
of systems, the results suggest an opportunity to be explored
by researchers working with ontology matching problems.
REFERENCES
[1] Mlxtend Machine Learning Extensions. Accessed: May 29, 2021. [Online].
Available: http://rasbt.github.io/mlxtend/
[2] OAEI Ontology Alignment Evaluation Initiative. Accessed: May 29, 2021.
[Online]. Available: http://oaei.ontologymatching.org/
[3] Spacy Industrial-Strength NLP. Accessed: May 29, 2021. [Online].
Available: https://pypi.org/project/spacy/
[4] D. Apiletti, E. Baralis, T. Cerquitelli, P. Garza, F. Pulvirenti, and
L. Venturini, ‘‘Frequent itemsets mining for big data: A comparative anal-
ysis,’ Big Data Res., vol. 9, pp. 67–83, Sep. 2017.
[5] B. Boehm, ‘‘A view of 20th and 21st century software engineer-
ing,’ in Proc. 28th Int. Conf. Softw. Eng., May 2006, pp. 12–29, doi:
10.1145/1134285.1134288.
44658 VOLUME 10, 2022
F. Santos, C. E. Mello: Matching Network of Ontologies: Random Walk and Frequent Itemsets Approach
[6] M. Bouakkaz, Y. Ouinten, S. Loudcher, and P. Fournier-Viger, ‘Efficiently
mining frequent itemsets applied for textual aggregation,’’ Int. J. Speech
Technol., vol. 48, no. 4, pp. 1013–1019, Apr. 2018.
[7] X. Cao, H. Chen, X. Wang, W. Zhang, and Y. Yu, ‘‘Neural link prediction
over aligned networks,’’ in Proc. 32nd AAAI Conf. Artif. Intell., 2018,
pp. 249–256.
[8] M. A. Casanova and R. C. Magalh aes, ‘‘Operations over lightweight
ontologies and their implementation,’ in Implicit and Explicit Semantics
Integration in Proof-Based Developments of Discrete Systems. Tokyo,
Japan: Springer, 2020, pp. 61–82.
[9] J. Da Silva, K. Revoredo, F. Baião, and J. Euzenat, ‘‘ALIN: Improving
interactive ontology matching by interactively revising mapping sugges-
tions,’ Knowl. Eng. Rev., vol. 35, p. e1 and 22, Jan. 2020.
[10] J. Euzenat, ‘‘Revision in networks of ontologies,’’ Artif. Intell., vol. 228,
pp. 195–216, Apr. 2015. [Online]. Available: https://ftp.inrialpes.fr/pub/
exmo/publications/euzenat2015a.pdf
[11] J. Fitzgerald, S. Foster, C. Ingram, P. G. Larsen, and J. Woodcock, ‘‘Model-
based engineering for systems of systems: The compass manifesto,’ COM-
PASS Interest Group, Tech. Rep. 1, 2013.
[12] T. R. Gruber, ‘Toward principles for the design of ontologies used for
knowledge sharing?’’ Int. J. Hum.-Comput. Stud., vol. 43, nos. 5–6,
pp. 907–928, Nov. 1995.
[13] T. Gruber, L. Ling, and O. M. Tanner, Ontology Definition Encyclopedia
of Database Systems. New York, NY, USA: Springer Verlag, 2008.
[14] N. Guarino and P. Giaretta, ‘Ontologies and knowledge bases,’ Towards
Very Large Knowl. Bases, vol. 4, pp. 1–2, Oct. 1995.
[15] F. Hamdi, B. Safar, C. Reynaud, and H. Zargayouna, ‘Alignment-based
partitioning of large-scale ontologies,’ in Advances in Knowledge Discov-
ery and Management. Springer, 2010, pp. 251–269.
[16] N. Hareshkumar and D. Garg, ‘‘Random web surfer pagerank algorithm,’’
Int. J. Comput. Appl., vol. 975, p. 8887, Oct. 2011.
[17] J. Heaton, ‘‘Comparing dataset characteristics that favor the Apriori, Eclat
or FP-Growth frequent itemset mining algorithms,’ in Proc. SoutheastCon,
2016, pp. 1–7.
[18] W. Hu, Y. Qu, and G. Cheng, ‘‘Matching large ontologies: A divide-and-
conquer approach,’ Data Knowl. Eng., vol. 67, no. 1, pp. 140–160, 2008.
[19] H. Jeong and B.-J. Yoon, ‘Accurate multiple network alignment through
context-sensitive random walk,’’ BMC Syst. Biol., vol. 9, no. 1, p. S7, 2015,
doi: 10.1186/1752-0509-9-S1-S7.
[20] E. Jiménez-Ruiz, ‘‘Logmap family participation in the OAEI 2019,’’ in
Proc. CEUR Workshop, 2019, pp. 1–4.
[21] E. Jiménez-Ruiz and G. B. Cuenca, ‘‘LogMap: Logic-based and scalable
ontology matching,’ in Proc. Int. Semantic Web Conf. Bonn, Germany:
Springer, 2011, pp. 273–288.
[22] K. Kalecky and Y.-R. Cho, ‘‘PrimAlign: PageRank-inspired Markovian
alignment for large biological networks,’’ Bioinformatics, vol. 34, no. 13,
pp. i537–i546, Jul. 2018, doi: 10.1093/bioinformatics/bty288.
[23] K. Kempf-Leonard, ‘‘Encyclopedia of social measurement,’’ Tech. Rep.,
2004.
[24] W. Kuánierczyk, ‘Taxonomy-based partitioning of the gene ontology,’’
J. Biomed. Informat., vol. 41, no. 2, pp. 282–292, Apr. 2008.
[25] S. Lambrini and K. Achilles, ‘‘Composable relations induced in networks
of aligned ontologies: A category theoretic approach,’ Axiomathes, vol. 25,
no. 3, pp. 285–311, Sep. 2015.
[26] M. Lenzerini, ‘‘Data integration: A theoretical perspective,’ in Proc.
21st ACM SIGMOD-SIGACT-SIGART Symp. Princ. Database Syst., 2002,
pp. 233–246.
[27] L. Lovász, ‘‘Random walks on graphs,’’ Combinatorics, vol. 2, pp. 1–46,
Apr. 1993.
[28] R. C. Magalhães, M. A. Casanova, B. P. Nunes, and G. R. Lopes, ‘‘On the
implementation of an algebra of lightweight ontologies,’ in Proc. 21st Int.
Database Eng. Appl. Symp., 2017, pp. 169–175.
[29] M. W. Maier, ‘‘Architecting principles for systems-of-systems,’’Syst. Eng.,
vol. 1, no. 4, pp. 267–284, 1998.
[30] N. Masuda, M. A. Porter, and R. Lambiotte, ‘‘Random walks anddiffusion
on networks,’ Phys. Rep., vols. 716–717, pp. 1–58, Nov. 2017.
[31] I. O. B. Mountasser and B. Frikh, ‘‘Parallel Markov-based clustering
strategy for large-scale ontology partitioning,’’ in Proc. KEOD, 2017,
pp. 195–202.
[32] R. Muliono, Muhathir, N. Khairina, and M. K. Harahap, ‘‘Analysis of
frequent itemsets mining algorithm againts models of different datasets,’’
J. Phys., Conf. Ser., vol. 1361, no. 1, Nov. 2019, Art. no. 012036.
[33] N. F. Noy and M. A. Musen, ‘Specifying ontology views by traversal,’’ in
Proc. 3rd Int. Semantic Web Conf., vol. 3298. Hiroshima, Japan: Springer,
2004, pp. 713–725.
[34] P. Ochieng and S. Kyanda, ‘‘Large-scale ontology matching: State-of-the-
art analysis,’ ACM Comput. Surveys, vol. 51, no. 4, pp. 1–35, Jul. 2019.
[35] L. Page, S. Brin, R. Motwani, and T. Winograd, ‘‘The pagerank citation
ranking: Bringing order to the web,’ Stanford InfoLab, Stanford, CA,
USA, Tech. Rep. 422, 1999.
[36] M. A. N. Pour, ‘‘Results of the ontology alignment evaluation initia-
tive 2021,’’ in Proc. 16th Int. Workshop Ontol. Matching Co-Located,
vol. 3063, P. Shvaiko, J. Euzenat, E. Jiménez-Ruiz, O. Hassanzadeh,
and C. Trojahn, Eds., 2021, pp. 62–108. [Online]. Available: http://ceur-
ws.org/Vol-3063/oaei21_paper0.pdf
[37] E. Rahm, ‘‘Towards large-scale schema and ontology matching,’ in
Schema Matching Mapping. Berlin, Germany: Springer, 2011, pp. 3–27.
[38] J. Romano, J. Kromrey, J. Coraggio, and J. Skowronek, ‘Appropriate
statistics for ordinal level data: Should we really be using T-test and
Cohen’sd for evaluating group differences on the NSSE and other surveys,’’
in Proc. Annu. Meeting Florida Assoc. Inst. Res., 2006, pp. 1–3.
[39] F. Santos, K. Revoredo, and A. F. Bai, ‘‘Paving a research roadmap on
network of ontologies,’ in Proc. 12th Int. Workshop Ontol. Matching
Co-Located 16th Int. Semantic Web Conf. (ISWC), 2017, vol. 7, no. 4,
pp. 396–420.
[40] F. Santos, K. Revoredo, and F. Baiao, ‘‘A proposal for optimizing internet-
work matching of ontologies,’ in Proc. ISWC Workshop, 2018, p. 71.
[41] F. Santos, K. Revoredo, and F. Baiao, ‘‘SUBINTERNM: Optimizing the
matching of networks of ontologies,’ in Proc. Matching, 2020, p. 77.
[42] A. Schlicht and H. Stuckenschmidt, ‘‘Criteria-based partitioning of large
ontologies,’ in Proc. 4th Int. Conf. Knowl. Capture, 2007, pp. 171–172.
[43] M. H. Seddiqui and M. Aono, ‘‘An efficient and scalable algorithm for
segmented alignment of ontologies of arbitrary size,’ J. Web Semantics,
vol. 7, no. 4, pp. 344–356, Dec. 2009.
[44] P. Shvaiko and J. Euzenat, ‘‘Ontology matching: State of the art and future
challenges,’ IEEE Trans. Knowl. Data Eng., vol. 25, no. 1, pp. 158–176,
Jan. 2013.
[45] Y. Wang, Z. Di, and Y. Fan, ‘‘Identifying and characterizing nodes impor-
tant to community structure using the spectrum of the graph,’ PLoS ONE,
vol. 6, no. 11, Nov. 2011, Art. no. e27418.
[46] C. Xiang, B. Chang, and Z. Sui, ‘‘An ontology matching approach based
on affinity-preserving random walks,’’ in Proc. 24th Int. Joint Conf. Artif.
Intell., 2015, pp. 1–7.
[47] M. Seddiqui, Case Study Research: Design and Methods Applied Social
Research Methods Series. Beverly Hills, CA, USA: Sage, 1984.
FABIO SANTOS received the B.S. and M.S.
degrees in informatics (databases) from the Pon-
tifícia Universidade Católica do Rio de Janeiro,
Brazil, in 2002. He is currently pursuing the Ph.D.
degree in computer science (information systems,
knowledge modeling and reasoning) with the Uni-
versidade Federal do Estado do Rio de Janeiro,
Brazil. He was developer, DBA, project man-
ager, and IT Superintendent in Brazilian Navy.
His research interest includes the study of knowl-
edge modeling to support integration of system of systems and network of
ontologies.
CARLOS E. MELLO received the M.Sc. degree
in engineering from the Universidade Federal do
Estado do Rio de Janeiro, Brazil, in 2008, and the
Ph.D. degree in computer science from the École
Centrale Paris, France, in 2013. He is currently an
Associate Professor at the Universidade Federal do
Estado do Rio de Janeiro with more than ten years
conducting data science projects. He has advised
and co-advised more than 15 graduate students,
among master’s and Ph.D. In recent years, he has
been focused on developing data science for social good projects with
the Brazilian government and private sector. He leads the Research Group
on Data Science for Social Welfare at the University of Rio de Janeiro
(UNIRIO).
VOLUME 10, 2022 44659
ResearchGate has not been able to resolve any citations for this publication.
Presentation
Full-text available
Ontology Matching Network of Ontologies System of Systems Network Matching
Article
Full-text available
Ontology matching aims at discovering mappings between the entities of two ontologies. It plays an important role in the integration of heterogeneous data sources that are described by ontologies. Interactive ontology matching involves domain experts in the matching process. In some approaches, the expert provides feedback about mappings between ontology entities, that is, these approaches select mappings to present to the expert who replies which of them should be accepted or rejected, so taking advantage of the knowledge of domain experts towards finding an alignment. In this paper, we present Alin , an interactive ontology matching approach which uses expert feedback not only to approve or reject selected mappings but also to dynamically improve the set of selected mappings, that is, to interactively include and to exclude mappings from it. This additional use for expert answers aims at increasing in the benefit brought by each expert answer. For this purpose, Alin uses four techniques. Two techniques were used in the previous versions of Alin to dynamically select concept and attribute mappings. Two new techniques are introduced in this paper: one to dynamically select relationship mappings and another one to dynamically reject inconsistent selected mappings using anti-patterns. We compared Alin with state-of-the-art tools, showing that it generates alignment of comparable quality.
Article
Full-text available
Data mining is a study that uses statistical knowledge, mathematical calculations, artificial intelligence methods, machine learning by extracting and identifying useful information and related knowledge from various large databases.. One of them is looking for itemsets combination from the data stack, the search process can be done using the Apriori Association Rules algorithm, the FPGrowth Association Rule and Closed Association rule. The three algorithms are some of the implementations of frequent itemsets search methods. Two datasets will be tested, namely retail datasets and accidents datasets derived from fimi datasets, with the aim of knowing the behavior of the three algorithms against the dataset model we tested. Each algorithm will test both datasets, with minimum support values for retail datasets ranging from 0.01-0.05 and min-conf 0.01-0.05. Likewise on the accident min-sup and min-conf 0.6-1 datasets. For retail algorithm data with the fastest processing time, the FP-Growth AR algorithm is more than 85% compared to Apriori-AR, followed by AR-Apriori and Apriori-Close-AR. For detailed memory usage results the Apriori-AR algorithm is lighter and outperforms. Accident data the experimental results show that FP-Growth-AR is a little more memory efficient and the fastest process is Apriori-Close, while the Apriori-AR with the longest processing time.
Conference Paper
Full-text available
A System-of-Systems (SoS) is a set of independent information systems that must communicate with each other towards providing a specific service. Therefore, effectively integrating these systems is demanding. Considering that each system is conceptually described by a unique ontology, the conceptual support for the whole SoS demands the alignment of all ontologies, deriving a network of ontologies. Existing ontology matching techniques may be used for the task; however, due to the recently increasing size of the ontologies and the potential number of ontologies being aligned, current approaches may suffer from scalability and performance issues. In this paper, we introduce an approach to reduce the number of potential correspondences, therefore optimizing the process of creating a network of ontologies. A preliminary experiment was conducted, showing the potential of the proposed approach.
Article
Full-text available
Motivation: Cross-species analysis of large-scale protein-protein interaction (PPI) networks has played a significant role in understanding the principles deriving evolution of cellular organizations and functions. Recently, network alignment algorithms have been proposed to predict conserved interactions and functions of proteins. These approaches are based on the notion that orthologous proteins across species are sequentially similar and that topology of PPIs between orthologs is often conserved. However, high accuracy and scalability of network alignment are still a challenge. Results: We propose a novel pairwise global network alignment algorithm, called PrimAlign, which is modeled as a Markov chain and iteratively transited until convergence. The proposed algorithm also incorporates the principles of PageRank. This approach is evaluated on tasks with human, yeast and fruit fly PPI networks. The experimental results demonstrate that PrimAlign outperforms several prevalent methods with statistically significant differences in multiple evaluation measures. PrimAlign, which is multi-platform, achieves superior performance in runtime with its linear asymptotic time complexity. Further evaluation is done with synthetic networks and results suggest that popular topological measures do not reflect real precision of alignments. Availability and implementation: The source code is available at http://web.ecs.baylor.edu/faculty/cho/PrimAlign. Supplementary information: Supplementary data are available at Bioinformatics online.
Article
Full-text available
Ontologies have become a popular means of knowledge sharing and reuse. This has motivated the development of large-sized independent ontologies within the same or different domains with some overlapping information among them. To integrate such large ontologies, automatic matchers become an inevitable solution. However, the process of matching large ontologies has high space and time complexities. Therefore, for a tool to efficiently and accurately match these large ontologies within the limited computing resources, it must have techniques that can significantly reduce the high space and time complexities associated with the ontology matching process. This article provides a review of the state-of-the-art techniques being applied by ontology matching tools to achieve scalability and produce high-quality mappings when matching large ontologies. In addition, we provide a direct comparison of the techniques to gauge their effectiveness in achieving scalability. A review of the state-of-the-art ontology matching tools that employ each strategy is also provided. We also evaluate the state-of-the-art tools to gauge the progress they have made over the years in improving alignment’s quality and reduction of execution time when matching large ontologies.
Article
Link prediction is a fundamental problem with a wide range of applications in various domains, which predicts the links that are not yet observed or the links that may appear in the future. Most existing works in this field only focus on modeling a single network, while real-world networks are actually aligned with each other. Network alignments contain valuable additional information for understanding the networks, and provide a new direction for addressing data insufficiency and alleviating cold start problem. However, there are rare works leveraging network alignments for better link prediction. Besides, neural network is widely employed in various domains while its capability of capturing high-level patterns and correlations for link prediction problem has not been adequately researched yet. Hence, in this paper we target atlink prediction over aligned networks using neural networks. The major challenge is the heterogeneousness of the considered networks, as the networks may have different characteristics, link purposes, etc. To overcome this, we propose a novel multi-neural-network framework MNN, where we have one individual neural network for each heterogeneous target or feature while the vertex representations are shared. We further discuss training methods for the multi-neural-network framework. Extensive experiments demonstrate that MNN outperforms the state-of-the-art methods and achieves 3% to 5% relative improvement of AUC score across different settings, particularly over 8% for cold start scenarios.
Conference Paper
The Ontology Alignment Evaluation Initiative (OAEI) aims at comparing ontology matching systems on precisely defined test cases. These test cases can be based on ontologies of different levels of complexity and use different evaluation modalities (e.g., blind evaluation, open evaluation, or consensus). The OAEI 2021 campaign offered 13 tracks and was attended by 21 participants. This paper is an overall presentation of that campaign.