Content uploaded by Kozlova Irina
Author content
All content in this area was uploaded by Kozlova Irina on May 19, 2023
Content may be subject to copyright.
A Probabilistic-Statistical Approach to Detection of Se-
mantic Relations between Indexing Terms
I.V. Kozlova 1() [0000-0003-0940-942X] and M.B. Saidakhmedova2 [0000-0002-
1063-6825]
1 Plekhanov Russian University of Economics, 36 Stremyanny lane, Moscow, Russia
2 Plekhanov Russian University of Economics, 36 Stremyanny lane, Moscow, Russia
Abstract. An approach to dynamic structurization of information space based
on the analysis of semantic relations between indexing terms is proposed.
Mathematical methods of pattern recognition are adapted to solve the problem
of semantic relations analysis. Justification and verification of hypothesis about
statistical distinctiveness of semantic relation classes in a subject-oriented da-
tabase are presented. The main types (classes) of semantic relations between
indexing terms in a problem-oriented database are defined. The statistical char-
acteristics of the selected classes of semantic relations are given. The experi-
ment revealed the following property: in the classes of hierarchical relations,
the distribution of probabilities of occurrence of pairs of terms with relative
frequency fij obeys the lognormal law. Analytical expressions for lognormal
equalizing distribution curves were derived. Based on these expressions, a
probabilistic-statistical model of semantic relations and an algorithm for their
detection were developed. Experimental study has shown the effectiveness of
the proposed approach to identify semantic relations of hierarchical type.
Keywords: Information resources, Statistical methods, Semantic relations,
Indexing terms, Probability-statistical model, Detection algorithm, Lognormal
distribution.
1 Introduction
The effectiveness of the management apparatus in decisions making on the science
development should be based on the objective data analysis of the modern science
structure and its components, the regularities of its development, the characteristics
of existing and emerging scientific trends. Such analysis is based on the system prop-
erties of documentary information resources, which are an "information model" of a
scientific research area and accumulate the most valuable experience and knowledge
of the world scientific community in databases.
The results of such analysis can be used in structuring areas of scientific research,
academic disciplines in the development of curricula and programs, in the thesauruses
construction.
At present, the most widespread are methods of structural-thematic analysis of in-
formation resources, when connections between objects, correlation and classification
2
are revealed. The articles [1, 2] consider approaches based on the formal analysis of
the world publications flow and joint citation to build "maps of science". These maps
are a set of clusters of thematically similar terms. A significant limitation of these
methods is using of only statistical relations between terms and the lack of semantic
relations analysis, reflecting objectively existing links and relations between the sub-
ject area objects.
The information model of the subject area is represented in the form of a semantic
network, reflecting the subject area semantics in the form of concepts and relations.
The number of relationship types in the semantic network is determined based on
specific goals and objectives. Such relations are not explicitly present in modern in-
formation resources.
The following tasks must be solved to analyze the semantic relations between the
indexing terms of documentary information resources and databases:
- to identify the main classes of semantic relations necessary for the field of
scientific research structuring;
- to develop a method of semantic relations detection;
- to determine a rule or a function of a term pair belonging to a certain class
of relations.
This solution will allow, on the one hand, to allocate term groups belonging to a
certain class of semantic relations, such as: object − its parts, object − its areas of
application, object − method, etc. On the other hand, it will increase the efficiency of
information retrieval based on the storage of selected semantic relations between
terms.
We present our new approach to the evaluation of semantic relations between in-
dexing terms in SODB based on the developed probabilistic-statistical model. The
proposed model considers two main aspects: (1) the property proved in the paper
about the statistical distinctiveness of classes of relations between terms in SODB;
(2) modeling the behavior of a terms’ pair in a selected class of relations, character-
ized by the frequency of terms’ co-occurrence fij and the relative frequencies of the fij
occurrence − n(fij). To develop the model, we have defined classes of semantic rela-
tions, calculated the a priori probabilities P(Kj), and studied the joint occurrence prob-
abilities of terms within a given class. As a result, analytical expressions for deter-
mining the conditional probabilities of belonging a terms’ pair to a certain class of
semantic relations in the form of lognormal curves were obtained. This made it pos-
sible to determine the a posteriori probability of classes (the product of the a priori
probability of a class by the conditional probability that a pair of terms belongs to this
class) and to apply a Bayes decision rule to linguistic objects. The promising aspect
of this approach lies in the relative easiness of detection automation of hierarchical
type semantic relations based on the Algorithm 1 proposed in this paper.
The originality of our approach lies in the application of the mathematical appa-
ratus of pattern recognition for the analysis of semantic relationships between index-
ing terms. A significant advantage of this study is that no external information about
semantic relationship classes, such as WordNet etc., is used here. The indexing terms
are already present in the database documents, and the thematic structure of the sub-
ject domain is implicitly contained in the SODB. Note also that the analyzed list of
3
semantic relations necessary for structuring the subject domain is different from the
types of relations in the sources discussed below.
2 Related works and theoretical analysis
The task of a subject area thematic structure constructing using databases is based on
the following principles [1]:
- formalized representation of thematic structure as a semantiс network;
- the use of subject-oriented databases (SODB);
- evaluation of joint occurrence of indexing terms in SODB;
- determination of the probability characteristics of the co-occurrence of terms in
SODB;
- two types of semantic relations are used: hierarchical and non-hierarchical types.
The situation with the analysis of terms pairs connected by certain semantic rela-
tions is to find and extract the features that characterize these pairs, and then assign
each pair to one of the given classes of relations. This resembles the situation in pat-
tern recognition systems.
There are two types of methods for extracting semantic relationships: statistical
and linguistic. The first ones calculate the statistical characteristics of the terms ac-
cording to their occurrence in the texts, then apply mathematical methods to deter-
mine the degree of proximity of the terms. In most cases, statistical methods find only
associative relations between terms.
The book [3] discusses the application of various statistical methods to texts. To
emphasize the quantitative aspects of semantics and the possibility to study them, the
authors introduce the concept of statistical semantics and give an overview of the
methods used in this field. Thus, the focus is on the methods of investigating semantic
relations between words based on empirical corpus data. As an example of the appli-
cation of statistical methods to texts, the authors cite statistical tests to determine the
differences in two sets of texts, prediction of variables from the text, generalization,
and graphical illustration of the text.
A statistical method for automatic construction of associative relationships for the-
sauruses of several languages is proposed in [4, 5]. In [5] an overview of approaches
to discovering semantic relations, such as distributional similarity is given. Similarity
is evaluated based on frequency characteristics and semantic relations are of the as-
sociative type. The application of NLP methods to the creation of a thesaurus by con-
sidering the information contained in external domain-oriented semantic sets is dis-
cussed. Frequency characteristics and ranking (Zipf Law), expert evaluations are used
to evaluate the semantic significance of terms, and the contextual environment of a
term is used to identify semantic relations (associative, synonymy and causative rela-
tions). They use neural network BERT as a binary classifier for term prediction: next
sentence prediction, question answering, name entity recognition, etc. However, it is
noted that its training is expensive and time-consuming.
4
The detection of semantic relations between words based on co-occurrence fre-
quency using statistical methods and several measures of semantic similarity is con-
sidered in [6]. The word co-occurrence statistics is provided by the "document-term"
matrix, which describes the distribution of words in the texts of the training collection.
Every individual word or word combination is represented as a vector. This approach
provides a qualitative way of identifying only associative relationships and allows to
build a vocabulary in the form of an associative network, while identifying synonymy
and hierarchical relationships requires the linguistic methods usage.
Based on statistics, a translation system SMT (Statistical Machine Translation) [7]
was developed, which analyzes the most frequent fragments of sentences in the orig-
inal and in the translation. Based on the method of automatic identification of seman-
tic relations between terms and statistical information from the Internet, an associative
network using the English dictionary is automatically constructed [8].
Recently, linguistic methods have been used to identify semantic relations neces-
sary for semantic networks, ontologies, analysis of the texts tone, as well as in ma-
chine translation systems. The WordNet system is predominated here. The basic unit
in WordNet is a synset, which combines words with similar meanings into a node of
the semantic network. Each synset contains a list of synonyms and pointers of rela-
tionships with other synsets. Synonymy and associative relationships also extracted
from resources (full texts, dictionaries, thesauruses, ontologies, etc.) or linguistic rules
or templates are applied [9].
In the paper [10] WordNet is complemented by a deep semantic topology function
based on the Neo4J graph database, which provides a visualization of its internal data
and a three-dimensional storage structure in the form of graphs with proprietary algo-
rithms for data retrieval.
Now a lot of studies have appeared on the text semantic analysis using neural net-
works. Semantic analysis of text is a multi-step process implemented using artificial
intelligence and machine learning algorithms. Semantic analysis refers to the process
of understanding natural language (text) by extracting useful information, such as
context, emotions, feelings, from unstructured data [11]. It analyzes the grammatical
format of sentences, including the arrangement of words, phrases, and sentences, to
determine the relationships between independent terms in a particular context. This is
the most important task of NLP systems.
The process of semantic analysis begins with the analysis of dictionary definitions
and meanings of individual words. The relationships between words in a sentence
provide a clear understanding of the context. The main relationships for understand-
ing the meaning of a text are hyponymy, meronomy, polysemy, synonymy, anton-
ymy, and homonymy [12]. Understanding these terms is critical to NLP programs that
seek to extract understanding from textual information, extract information.
The paper [13] describes how to detect semantic relations, including synonym, hy-
ponym and hypernym relations based on WordNet and entities of Knowledge Graph.
The authors used NLP and Deep Learning for processing data before putting into
Knowledge Graph.
It is of interest to study texts tonality classification methods based on machine
learning and tonality dictionaries using support vectors and Bayesian classifier [14,
5
15]. Various statistical characteristics are used: TF-IDF, mutual information, Gini co-
efficient, Kullback-Leibler distance, ꭓ2-criterion, etc. [16]. Associative connectivity
measures and their effectiveness are investigated when calculating the strength of
connectivity of the word combinations components within bigrams and trigrams.
The article [17] shows how semantic analysis is used by search engines to extract
semantic meaning from a query with TF-IDF method and Zipf law. The statistical
measure TF-IDF (word frequency − inverse document frequency) is used to analyze
the significance of the word in the document and reduce the weight of words that are
often found in the text and do not carry a meaningful load. A word matrix is built,
each cell denotes a word frequency in a contextual phrase. Each word is represented
by a vector, semantic properties of a word are determined by the contexts. This allows
to establish the synonymy and association of words, to improve search efficiency, and
to compare any texts. In addition, the TF*IDF algorithm allows to determine the
weight of the word and is used to extract keywords from the text [18].
In the past few years, neural word embeddings have been widely used in text min-
ing. Relationship identification plays an important role in many NLP applications,
such as question answering, text matching recognition, and discourse comprehension.
A new approach for evaluating the semantic relations in word embeddings using
external knowledge bases: Wikipedia, WordNet, and Unified Medical Language Sys-
tem (UMLS) was proposed in [19]. Most of the embedding models used the words
surrounding the target word as the context based on the assumption that words with a
similar context have the similar meanings. The authors trained multiple word embed-
dings using health-related articles in Wikipedia. They used the semantic relations in
WordNet and UMLS, which covered most of the commonly used semantic relations
(synonyms, antonyms, hypernyms, hyponyms, homonyms, meronyms). Also, they
compared the distribution of relation terms in the nearest neighbors of a word in dif-
ferent word embeddings.
Recently, research has been conducted on the detection of hypernymy/hyponymy
as an example of lexical-semantic relations. Relationship classification is performed
in a supervised way, using a dataset labeled with a certain number of relationship
classes. The type of semantic relationship between pairs of words is recognized ac-
cording to the information about their co-occurrence in the corpus. These methods
use the dependency path between word pairs as an input feature and have some limi-
tations for word pairs not occurring in context. However, other types of relations have
been less studied, and the results of semantic relation classification are not sufficiently
acceptable.
An ensemble of models to improve the predictive performance of relationship clas-
sification is presented in [20]. To gain an advantage among predictions of separately
trained models, the authors use an ensemble of two artificial neural networks. The
input of the model is a concatenation of embedding vectors of words corresponding
to target word pairs, and the output is a class label predicted based on the learned
representation of the semantic relation distribution between words in vector space.
The paper [21] discusses latent relational analysis (LRA) based on the latent rela-
tion hypothesis. It argues that pairs of words that occur together in similar lexico-
syntactic models have similar semantic relations. Authors presented NLRA, which
6
learns the distributed representation of word pairs capturing semantic relational infor-
mation through co-occurring patterns encoded by the neural network LSTM. This
model jointly learns the mapping from the word embedding space into the word-pair
embedding space to generalize co-occurrences of word pairs and patterns. With this
mapping, NLRA can generalize joint embeddings of word pairs and patterns and ob-
tain relational embeddings for arbitrary word pairs even if they do not occur in the
corpus.
The paper [22] proposed the method like [21] in lexical semantic relation detection.
This neural method modeled the co-occurrences of word pairs and dependency paths
connecting two words to alleviate the data sparseness problem of pattern-based lexi-
cal semantic relation detection. While they assigned randomly initialized embeddings
to each dependency path, the work [21] encodes co-occurring patterns with the neural
network LSTM for better generalization.
As shown in the article [23], lexical relations in the vector space of words are de-
fined by a set of vector shifts between pairs of words. The existence of these relations
was established by means of confirmatory analysis, in which a pair of relations con-
structed based on previous knowledge confirms its existence in the space by analogy.
The authors investigate the clustering of lexical relations using word vectors extracted
from the neural word embedding techniques (BERT, Word2Vec, FastText). In addi-
tion, the paper shows that using a simple nearest-neighbor approach, new examples
of known relations can be discovered in the displacement space. This approach fo-
cuses on the study of emerging patterns of trends in social networks and related mes-
sages.
Of interest is an article [24] in which the author explored the distinction between
the three paradigmatic semantic relations of synonymy, antonymy, and hypernymy.
She expected that differences in the naturalness of relations across word classes
should be reflected by how humans perceive and distinguish semantic relatedness,
and by how successful standard distributional approaches are in modelling semantic
relatedness. For the calculations, the author uses distributional similarity estimates
derived from standard vector space models derived from a large data set for German
(GermaNet). Two types of vector space models using window and lexico-syntactic
patterns were created, a vector space for the word-relation pairs was induced. The
paper shows that the most significant types of features and their success varied ac-
cording to word classes and types of paradigmatic relations. Thus, for both human
and automatic approaches, a reliable distinction between relations depends on the
class − relationship.
Some papers publish data on the estimated accuracy of the proposed models. For
example, in [21] the accuracy of the model reaches 0.47. According to [20], the model
accuracy estimate was 0.894 when predicting relationships between FarsNet word
pairs and 0.768 when predicting Princeton Wordnet relationship classes. In this study,
the accuracy of Algorithm 1: Detection of semantic relations of hierarchical type
reaches a value of 0.74, which is comparable with the accuracy results of the other
researchers’ models.
7
The analysis of publications has shown that the variety of tasks of the semantic
relations analysis is very large, and the number of types of these relations is also di-
verse. One thing is clear, the types of semantic relations depend on the task set by the
researcher and the semantic network he develops.
This article presents a probabilistic-statistical approach to the analysis of hierar-
chical semantic relationships between information resource indexing terms, based on
the ideas of pattern recognition. In this study, the problem statement is as follows. We
proceed from the probabilistic nature of recognizable objects − terms pairs, charac-
terized by co-occurrence frequency x=fij in SODB and the corresponding probabilities
of occurrence of these frequencies. The statistical distinguishability of relationship
classes in SODB emphasizes the usefulness of statistical methods using to recognize
relationship types between terms [25].
In this case, the decision on whether a terms pair belongs to one of the two distin-
guished semantic relations classes is based on a Bayes decision rule minimizing the
average risk [26]. It includes P(Kj) − the a priori probability of Kj class occurrence,
and P(X/Kj), the conditional probability of πi terms pair occurrence in Kj class. The
decision rule for the two classes К1 and К2 takes the form:
πi K1, if and only if
(1)
A statistical study of sample populations of term pairs allowed us to identify the main
classes of stable semantic relations in SODB, to obtain quantitative estimates of prob-
abilities of occurrence of the selected relationship classes and distributions of proba-
bilities of co-occurrence frequencies in the relationship classes of the hierarchical
type.
3 Probabilistic-statistical model
This article continues our research related to the construction of thematic structures
of subject areas, using documentary information resources [1,2]. During this study,
the possibilities of various methods of cluster analysis for the selection of thematic
areas were studied. Hence, the problem of analyzing semantic relationships between
indexing terms within a selected thematic area in SODB was posed. The sets of doc-
uments from which the candidate terms have been extracted represents SODB. For
the experiment a SODB in Robotics was generated from DB INSPEC C with a vol-
ume of 4255 documents. A term similarity matrix was automatically created and the
frequency of co-occurrence for each term pair was calculated with the ICSTI Cluster
program. The volume of incomplete similarity matrix for cluster analysis was 16300
pairs of terms.
At the stage of semantic-statistical analysis of relations between indexing terms the
following problems are solved:
- class probabilities P(Kj) and conditional probabilities P(X/Kj) of joint oc-
currence of terms (pairs) belonging to each class are estimated;
8
- determine a posteriori probability and decide whether a pair belongs to a
certain class.
As a random variable X we will consider the frequency of occurrence of a term pair
belonging to a certain type of semantic relationship. Based on the statistical definition
of probability:
(2)
where f(A) is the relative frequency of event A;
F − absolute frequency of event A realization;
N − number of trials, N→∞, i.e. is a large enough positive number not equal to 0;
p(A) – probability of event A,
let us test the hypothesis of statistical distinction of semantic relation classes in
SODB.
Testing the above hypothesis requires the following tasks solving:
- identifying the main classes of semantic relations between indexing terms in the
database;
- determining the relative frequency (probability) of occurrence of term pairs be-
longing to each class;
- estimating the minimum sample size needed to obtain reliable data about the oc-
currence frequency of different relationship types in the database.
In this study, methods of mathematical statistics for linguistic objects were used to
estimate the reliability of the results obtained and the minimum sample size [28, 30].
To describe relations between terms, we use the linguistic theories, that give de-
scription and systematization of semantic relationships. At the same time, semantic-
statistical analysis of the contents of the field "descriptors" of SODB documents, as
well as analysis of term similarity matrixes in automated mode allowed us to identify
the following semantic relationships common to several technical fields as the main
ones (Table 1). The results of the analysis were evaluated by expert.
Table 1. The main types of semantic relations in the subject area "Robotics".
№
Relation types
Short description
1
Problem – aspect
– the topic expressed by the term t2, is
an aspect of considering the topic expressed by the
term t 1
2
Genus – species
t1– generic term for t2, t 2 – species for
t1
3
Whole – part
t2 – part of t1 as an object or as a process
4
Application area –
object
Sectors of the economy, technological processes and
technological operations are considered as areas of ap-
plication
9
5
Adjacency
– terms t1 and t2 are two aspects of the
same topic
6
Synonymy
t1 Rsyn t2 the meaning of terms t1 and t2 is the same
7
Equality
terms t1 and t2 are the same thing (singu-
lar and plural of the same term)
8,9,10
,11
Reverse relationships to
1,2,3,4
12
Object – firm
Term t2 indicates the name of the firms
13
Object – method
Term t2 indicates the name of the method
14
Object – research area
Term t2 indicates the name of the research area
15
Object – country
Term t2 indicates the name of the country
Relations 2, 3,6,9,10 belong to the main paradigmatic relations.
The frequency characteristics of each relationship class are determined based on
the analysis of the training sample; the relative frequency of a particular relationship
class is equal to the number of terms pairs related by that relationship type divided by
the total number of pairs in the sample. With a small number of experiments, the
frequencies of occurrence of pairs of a certain type are variable and random and can
vary from one group to another. For example, in one random sample of 10 term pairs
the "problem – aspect" relation did not appear once, but in other samples of 20 and 50
pairs this relation appeared 3 and 17 times respectively with relative frequencies of
0.105 and 0.340. However, with a successive increase in the sample volume the rela-
tive frequency of pairs occurrence of the ''problem – aspect'' type acquires a certain
stability, approaching the value of 0.310. Values of absolute and relative frequencies
for the "problem – aspect" relation at various sample volumes are summarized in Ta-
ble 2.
Table 2. Relative frequency of the "problem – aspect" relation occurrence in the SODB for
Robotics.
Sample
volume
10
20
50
100
150
200
250
282
F
0
3
17
31
47
62
78
86
f(A)
0.000
0.105
0.340
0.310
0.313
0.310
0.312
0.305
10
Thus, in the random nature of the relative frequency’s changes of the certain type of
term pairs appearance with an increase in the sample size, an objective property man-
ifests itself. It consists in the concentration of frequency values around a certain value
and tends to remain constant and manifest itself more and more clearly with an in-
crease in the volume of the studied material.
On the basis of J. Bernoulli's theorem with a probability arbitrarily close to 1, it
can be argued that for a sufficiently large number of trials, i.e. as N → ∞, the relative
frequency F / N of the event A (the appearance of the j - relation type) differs arbitrar-
ily little from the probability p:
(3)
It should be noted that the distribution of frequencies of various types of relations,
and, consequently, of their probabilities in the sample is like their distribution in the
entire database [27].
Reliability of the results obtained in the statistical analysis of the frequency distri-
bution of different relations is ensured by the minimum required sample size from the
whole general population. The number of the sample depends on the value P, with
which the reliability of the conclusions is concluded, on the value of the acceptable
error δ, as well as on the probability of occurrence of the event p.
According to the expression (4) to obtain reliable results at the statistical study of
term pairs related by the "problem-aspect" relationship, occurring in a sample with a
probability of 0.310, with a given relative error δ=0.2, and reliability P=0.95, a sam-
ple of at least 213 term pairs is required for linguistic research [28]:
(4)
where zp=1.96 is determined from the table at a given value of reliability P=0.95.
The results of semantic-statistical analysis of term pairs - the main types of seman-
tic relations and characteristics of their occurrence in the sample of N=282 term pairs
from the Database on Robotics are summarized in Table 3.
Table 3. Classes of semantic relations and their frequency characteristics in the sample vol-
ume N=282.
Name of the semantic rela-
tions class
Frequency of term pairs
absolute
relative
Problem – aspect
86
0.310
Genus – species
51
0.181
Whole – part
49
0.174
Application area – object
39
0.138
Adjacency
21
0.074
Synonymy
9
0.032
Object – application area
8
0.028
Equality
5
0.018
Aspect – problem
4
0.014
11
Species – genus
4
0.014
Part – whole
3
0.011
Object – firm
3
0.011
Object – method
2
0.007
Object – research area
1
0.004
Object – country
1
0.004
Examples of the "problem – aspect" and "genus – species" relationships from the
SODB on Robotics:
- artificial intelligence − cognitive systems, artificial intelligence − adaptive systems,
artificial intelligence − knowledge engineering, artificial intelligence − brain models;
- robots − industrial robots, robots − medical robots, robots − household robots.
Analysis of the results obtained (Table 3) confirms the hypothesis about the statis-
tical distinction of semantic relations classes between indexing terms in SODB (for
example, SODB in Robotics) and allows to estimate the values of a priori probabilities
of the selected relation classes - Р(Kj).
The next step in the construction of a probabilistic-statistical model of semantic
relations is to study the distribution of frequency of term pairs occurrence inside a
given class of semantic relations to estimate the distribution of conditional probabili-
ties P(X/Kj).
As a random variable X is considered linguistic feature − frequency of terms co-
occurrence fij, forming a pair belonging to a certain class of semantic relations, and
analyzed the distribution of relative frequencies of values of this feature in a sample
of volume N*. To obtain reliable data on the distribution of conditional probabilities
of linguistic feature fij with given values of reliability P=0.95, and an admissible error
δ=0.1 we find the value N*=96 [29].
In the course of the experiment on research of conditional probability distributions
of random variables fij, belonging to classes of hierarchical relations "problem-aspect"
(P(K1)=0.310) and "genus - species" (P(K2)=0.181) two random samples of volume
N1=107 (pairs, connected by "problem − aspect" relation) and N2=97 (pairs, connected
by "genus − species" relation) were formed. In each sample pairs are ordered by in-
creasing frequency of terms joint occurrence, absolute and relative frequencies for
each value of linguistic feature fij are determined.
Taking the relative frequency of a linguistic feature occurrence in the sample as an
estimate of its conditional probability, we obtain an empirical distribution of frequen-
cies of terms co-occurrence, provided that the pairs belong either to the class of rela-
tions "problem − aspect" or "genus − species". The results of the experimental esti-
mation of the probabilities of the linguistic feature fij are summarized for each class
of "problem − aspect" and "genus − species" relations in Tables 4 and 5, respectively.
Experimental data were obtained using the ICSTI Cluster program.
The whole range of frequencies of terms joint occurrence is divided into S partial
intervals, average values of frequencies in intervals are calculated, and absolute and
relative values of frequencies of average values fij occurrence in the sample are deter-
mined [28].
Table 4. Empirical distribution of the average frequencies of terms co-occurrence for the
"Problem – Aspect" relation.
12
Interval
number
Interval
frequency
limits
Midpoint
Natural logarithms
Absolute
frequency
Relative
frequency
interval
boundaries
interval
midpoint
1
7 – 12
9.5
2.0 – 2.46
2.23
1
0.009
2
12 – 19
15.5
2.46 – 2.92
2.69
6
0.056
3
19 – 29
24
2.92 – 3.38
3.15
25
0.234
4
29 – 47
38
3.38 – 3.84
3.61
34
0.318
5
47 – 92
59.5
3.84 – 4.27
4.07
27
0.252
6
92 – 113
92.5
4.30 – 4.76
4.53
12
0.112
7
113 – 179
146
4.76 – 5.22
4.99
2
0.019
Table 5. Empirical distribution of the average frequencies of terms co-occurrence for the
"Genus – Species" relation.
Interval
number
Interval
frequency
limits
Midpoint
Natural logarithms
Absolute
frequency
Relative
frequency
interval
boundaries
interval
midpoint
1
3 – 5
4
1.10 – 1.62
1.36
0
0.000
2
5 – 8
6.5
1.62 – 2.14
1.88
4
0.041
3
8 – 14
11
2.14 – 2.66
2.40
8
0.082
4
5
14 – 24
19
2.66 – 3.18
2.92
17
0.175
6
24 – 40
32
3.18 – 3.70
3.44
21
0.216
7
40 – 68
54
3.70 – 4.22
3.96
25
0.258
8
68 – 114
91
4.22 – 4.74
4.48
15
0.155
9
114 – 192
153
4.74 – 5.26
5.00
5
0.052
10
192 – 324
258
5.26 – 5.78
5.52
2
0.021
11
324 – 645
434.5
5.78 – 6.30
6.04
0
0.000
12
545 – 916
730.5
6.30 – 6.82
6.56
0
0.000
On these data empirical curves of distribution of conditional probabilities of mean
frequencies of joint occurrence of terms and logarithms of mean frequencies for clas-
ses of relations "problem – aspect" and "genus – species" are constructed (see Fig.
1,2,3,4). The observed frequencies fij or their logarithms logfij are on the abscissa axis,
and the relative frequencies of the occurrence fij – n(fij) – on the ordinate axis. Calcu-
lations and plotting were performed using MS Excel.
13
Fig. 1. The conditional probabilities empirical distribution of the means values of the term’s
co-occurrence frequencies for the problem – aspect relation.
Fig. 2. The conditional probabilities empirical distribution of the means values of the term’s
co-occurrence frequencies for the genus – species relation.
0,009
0,056
0,234
0,318
0,252
0,112
0,019
0
0,05
0,1
0,15
0,2
0,25
0,3
0,35
020 40 60 80 100 120 140 160
Relative frequency
fij
p(x)
0
0,041
0,082
0,175
0,216
0,258
0,155
0,052
0,021
0
0
-0,05
0
0,05
0,1
0,15
0,2
0,25
0,3
0100 200 300 400 500 600 700 800
Relative frequency
fij
p(x)
0,009
0,056
0,234
0,318
0,252
0,112
0,019
0
0,05
0,1
0,15
0,2
0,25
0,3
0,35
0123456
Relative frequency
log fij
p(x)
14
Fig. 3. The conditional probabilities empirical distribution of logarithm means values of the
term’s co-occurrence frequencies for the problem – aspect relation.
Fig. 4. The conditional probabilities empirical distribution of logarithm means values of the
term’s co-occurrence frequencies for the genus – species relation.
The goal in examining the distribution of a random variable is to establish the equa-
tion of the distribution curve of values of this variable. The resulting empirical curves
are characterized by right-hand asymmetry, which indicates that they cannot be ap-
proximated by normal curves. In addition, the normal distribution arises when a ran-
dom variable is the result of the interaction of many independent small effects. The
studied random variable X=fij is the result of the action of impulses – frequencies fi
and fj and depends on their magnitude. In this case, not the random variable X itself is
normally distributed, but its logarithm logX [30].
It is always theoretically possible to define a function f(x) that transforms an asym-
metric distribution into a normal distribution. In many cases, a function of the follow-
ing type is proposed to be used as such a function [31]:
(5)
where the function g(x) contains no unknown parameters.
Thus, although the random variable X itself is not normally distributed, its function
f(x) is normally distributed.
The transforming function in this case is the function g(x)=log x. The function g(x)
can be obtained from the distribution graph by taking the transformed observed values
of frequencies log fij as the abscissa. The analysis of the distribution of natural loga-
rithms of the mean frequencies of the terms co-occurrence in the sample allowed us
to formulate a hypothesis about the normal distribution of the logarithms of the means
frequencies fij. The paper attempts to model the obtained empirical distributions by
means of lognormal distribution.
The hypothesis concerning the distribution of a random variable belongs to the
number of statistical hypotheses, which are tested and evaluated using methods and
criteria developed within mathematical statistics. As a criterion for testing the null
0
0,041
0,082
0,175
0,216
0,258
0,155
0,052
0,021
0
0
-0,05
0
0,05
0,1
0,15
0,2
0,25
0,3
01234567
Relative frequency
log fij
p(x)
15
hypothesis of lognormal distribution of mean frequency values fij, let us take a random
variable:
ꭓ2=
(6)
where ni – the empirical frequencies of the random variable X distribution,
ni’ – theoretical frequencies.
Decision rule. If ꭓ2obl < ꭓ2cr the null hypothesis is accepted, otherwise the null hy-
pothesis is rejected.
Let us compare the frequencies mean values obtained from experience with their
theoretical probabilities or mathematical expectations. Calculation of theoretical fre-
quencies is based on the differential form of the normal law. Since the random varia-
ble X=logfij is distributed according to the normal law, then:
f(x) =
(7)
Since the values of µ, σ, and σ2 are unknown, we replace them with the values
obtained from the empirical distribution. Assuming (xi - ) / = z, we introduce an
auxiliary function φ(z) whose values are tabulated [31]:
φ(z) =
(8)
Using the auxiliary function φ(z), we write the expression for
in the form:
φ(z) (9)
where N – number of trials;
φ(z) = pi; f(x) = φ(z) / – theoretical curve.
We will check the hypothesis of normality of the distribution of natural logarithms
of the mean frequencies of terms co-occurrence in pairs belonging to the semantic
relations "problem – aspect" and "genus– species" classes according to the above
methodology, using empirical curves (see Fig. 3, 4) and spreadsheets.
Comparison of the empirical and theoretical curves by the ꭓ2-Pearson [28, 30]
showed their agreement with probability P=0.95. It means that simultaneously with
the hypothesis about normal distribution of logarithms of means frequencies of in-
dexing terms co-occurrence lo gfij in semantic relations classes "problem – aspect"
and "genus – species" the hypothesis about lognormal distribution of value fij is con-
firmed.
The points, corresponding to the frequencies of the empirical distribution of the
linguistic feature lo gfij, are grouped near the theoretical curve. The construction of
equalizing curves of normal and lognormal distributions is carried out according to
the following scheme [28, 30]:
1. An auxiliary curve is plotted (see Fig. 5) for the normal distribution φ(z), zi=(xi-
2. We construct a theoretical curve (see Fig. 5) f (x) of the normal distribution N (µ,
σ2)=N ( )=N (3.61;0.64) with the parameters found during the experiment.
16
3. Using the normalizing factor ∆x we obtain the equalizing curve p(x) = ∆x f(x) from
the theoretical curve f(x) (see Fig. 6).
Fig. 5. Auxiliary φ(z) and theoretical f(x) normal distribution curves for the genus – species
relation
Fig. 6. Equalization curves plogn(x) of the lognormal frequency distribution of the co-occurrence
of indexing terms for the relations problem – aspect (1) and genus – species (2)
As a result, the equations of equalization curves for empirical distributions corre-
sponding to the classes of semantic relations were obtained in the paper: problem –
aspect and genus – species:
0
0,1
0,2
0,3
0,4
0,5
0,6
012345678
φ(z) ,f(x)
fij
φ(z)
f(x)
-0,05
0
0,05
0,1
0,15
0,2
0,25
0,3
0,35
0,4
050 100 150 200 250
plogn (x)
fij
plogn (x)1
plogn (x)2
17
(10)
- the equalization lognormal distribution curve for the problem – aspect re-
lation.
(11)
- the equalization lognormal distribution curve for the genus – species rela-
tion.
The equalization lognormal distribution curves are shown in Fig. 6.
An analysis of the posterior probability distribution shows that when the frequency
of terms co-occurrence is less than 10, the probability of their belonging to classes of
hierarchical relations is close to 0. In this regard, we can assume that pairs of terms
with fij <10 belong to classes of non-hierarchical relations.
Using expressions for probability distributions
and
,
representing in analytical form the distribution of conditional probabilities P1(X/K1)
and P2(X/K2), we calculate the a posteriori probabilities of term pairs belonging to
classes K1 and K2 − P(K1/X) and P(K2/X) - by multiplying the conditional probabilities
by the given values of a priori probabilities P(K1)=0.310 and P(K2)=0.181 (see for-
mula (1)). Then we decide which of the given classes the given pair of terms belongs
to (see Algorithm 1).
Let's illustrate Algorithm 1 with a graph (see Fig. 6). Let us take the pair "robots –
robot learning". This pair occurred 40 times in the sample of 263 pairs. The probabil-
ity value P(X/K1)=0.34 ("problem – aspect") and P(X/K2)=0.258 ("genus – species").
Multiplying these values by 0.310 (P(K1)) and 0.181 (P(K2)), we obtain 0.105 and
0.047, respectively. The ratio of these numbers is 2.26, which is greater than 1. It
follows that this pair belongs to the "problem – aspect" class.
The obtained results allow us to develop an algorithm for detecting semantic rela-
tions of hierarchical type based on the analysis of co-occurrence frequencies of terms.
Algorithm 1: Detection of semantic relations of hierarchical type
18
Input: the initial terms of the zero level of the tree (root) are set, their numbers
are determined by the frequency dictionary of the database
Output: obtaining the lists of term pairs that belong to classes K1 and K2 with the
corresponding probability values, expert evaluation of the recognition results
1 begin
2 select from the similarity matrix all pairs containing the term tk
3 for k = 1 to m do
4 computing for each term pair the frequency of occurrence fij as several matching
document numbers indexed by these terms in the database
5 selecting pairs of terms with joint occurrence frequency fij > 10
6 for each value of fij the conditional probabilities p1(X/K1) and p2(X/K2) are cal-
culated according to formulas (10) and (11)
7 we calculate the a posteriori probabilities of term pairs belonging to classes K1
and K2 - P(K1/X) and P(K2/X) - by multiplying the conditional probabilities by the
given values of a priori probabilities P(K1)=0.310 and P(K2)=0.181
8 for each pair of terms the probability ratio is calculated:
9 the calculated values Θ are compared with 1,
if Θ >1, then the term pair titj K1, otherwise titj K2
10 print a list of term pairs belonging to classes K1 and K2
11 print a list of terms which serve as vertices of the next level
12 end.
4 Conclusions
The article proposes an approach to the semantic analysis of textual indicators in
problem-oriented databases. The authors analyze the statistical characteristics of the
distribution of words, phrases, structural and probabilistic dependencies for certain
classes of relations.
The advantages of this approach – for the first time the mathematical apparatus of
pattern recognition is applied to the analysis of hierarchical relationships between
terms in problem-oriented databases. It is shown that the classes of hierarchical rela-
tions are statistically distinguishable. This is confirmed by found statistical character-
istics - probabilities of occurrence of classes. The statistical characteristics of terms
pairs in certain classes were investigated, which allows to decide, based on expression
(1), whether the pair belongs to one or another class.
It is also theoretically advantageous that the study of empirical distributions of
mean frequencies of co-occurrence of terms in hierarchical relationship classes has
confirmed the hypothesis of the lognormal nature of the distribution. A comparison
19
of the empirical and theoretical distributions using the ꭓ2-Pearson test showed their
consistency with probability P=0.95.
The expressions (10) and (11) obtained in analytical form for the probability dis-
tributions, which are the conditional probability distributions P(X/K1) and P(X/K2),
we use for Algorithm 1. This algorithm for detecting semantic relations of hierarchical
type is based on the analysis results and allows to determine (expression (1)) whether
each term pair belongs to one of the given relationship classes.
From an applied point of view, the advantage of this work is the possibility to ob-
tain lists of pairs of terms belonging to a given class. However, further research is
needed to construct a semantic network of the study area with subsequent visualiza-
tion.
It is worth noting the universality of the proposed approach, as the identified pat-
terns are universal and do not depend on a particular language.
The results obtained in the work:
- The existence of statistically stable patterns of frequency distribution of co-
occurrence of indexing terms in SODB in different classes of semantic relations is
revealed. Quantitative estimates of the probabilities of occurrence of the selected clas-
ses have been obtained.
- It is found that in hierarchical classes the probability distribution of occur-
rence pairs of terms with frequency fij obeys a lognormal law. Analytic expressions
for lognormal distribution curves in the classes of "problem – aspect" and "genus –
species" relations are obtained.
- The algorithm of revealing of hierarchical type semantic relations which
belongs to the class of heuristic algorithms is proposed. An experimental evaluation
of the accuracy of the algorithm was carried out, its value was 0.74, which confirms
the effectiveness of the proposed approach to identify the semantic relations of hier-
archical type.
Identification of semantic relationships between indexing terms and the construc-
tion of a semantic network of SODB combines the possibilities of structuring the field
of research with the means of intellectualization AIRS.
Acknowledgments. The authors would like to express their deep gratitude to Pro-
fessor T. Sadykov for valuable comments and additions that helped to improve the
article.
References
1. Kozlova I.V. Structural and thematic analysis of documentary information resources.
International research journal. No 1, vol. 43, pp. 38-40. URL:
https://doi.org/10.18454/IRJ.2016.43.002. (2016).
2. Kozlova I.V. About approaches to creating science maps. International research journal.
No.10, vol. 41, pp. 76-78. URL: https://doi.org/10.18454/IRJ.2015.41.192. (2015).
3. Sikström S., Garcia D. Statistical semantics: Methods and applications. 249 p. Springer.
(2020).
20
4. Panchenko A., Ustalov D., Arefyev N., Paperno D., Konstantinova N., Loukachevitch N.
Human and machine judgements for russian semantic relatedness. Analysisof Images, So-
cial Networks and Texts. 5th International Conference, AIST 2016, pp. 221-235. Springer
(2016).
5. Lanza C., Amir Hazem A., Daille B. Towards Automatic Thesaurus Construction and
Enrichment. Proceedings of the 6th International Workshop on Computational Terminol-
ogy (COMPUTERM 2020), pp. 62-71 Language Resources and Evaluation Conference
(LREC 2020). (2020).
6. Nugumanova A. B., Bessmertnyj I. A., Pecina P., Bajburin E. M. Semantic relations in
text classification based on bag-of-words model. Software products and systems.No.2,
vol.114, pp. 89-99, Russia (2016).
7. Mitrenina O. V. Back to 47th: to the 70th anniversary of machine translation as a scien-
tific field. Novosibirsk State University Bulletin: Linguistics and intercultural communica-
tion. No.3, vol.15, pp. 5-12 (2017).
8. Borzì V., Faro S., Pavone A. Automatic Extraction of Semantic Relations by Using Web
Statistical Information. In: Hernandez N., Jäschke R., Croitoru M. (eds) Graph-Based Rep-
resentation and Reasoning. ICCS 2014. Lecture Notes in Computer Science, vol 8577, pp.
174-187. Springer, Cham (2014).
9. Ustalov D.A. Detecting concepts in a synonym graph. Computing technology. No.1,
vol.22. pp. 99-112 (2017).
10. Li W., Wang T., Cao J., Tao S. A Visual Semantic Relations Detecting Method Based
on WordNet. In: Machine learning and intelligent communication. MLICOM 2019. Lec-
ture Notes from the Institute of Computer Science, Social Informatics, and Telecommuni-
cations Engineering, vol. 294, pp. 465-476. Springer. URL: https://doi.org/10.1007/978-3-
030-32388-2_40 (2019).
11. Kanade V.A. What is semantic analysis? Definition, examples, and applications in
2022. SpiceWorks Portal. Artificial Intelligence. URL: www.spiceworks.com. (2022).
12. Selig J. What semantic analysis means for natural language processing? Expert.AI Por-
tal. URL: www.expert.ai. (2022).
13. Ta Duy Cong Chien. Detection of Semantic Relations Based on Knowledge Graph. Jour-
nal of Science and Technology, Vol. 52B. URL: file:///C:/Users/Usser/Down-
loads/Bài+1.pdf. (2021).
14. Grimmer J., Stewart B. Text as data: The promise and pitfalls of automatic content anal-
ysis methods for political texts, Political analysis. No. 21, vol. 3, pp. 267-297(2013).
15. Ravi K., Ravi V. A survey on opinion mining and sentiment analysis: tasks, approaches
and applications, Knowledge-Based Systems. No. 89, pp. 14-46 (2015).
16. Lagutina N.S., Lagutina K.V., Shchitov I.A., Paramonov I.V. Analysis of the use of var-
ious types of relationships between the terms of the thesaurus generated using hybrid meth-
ods in the problems of text classification. Modeling and analysis of information systems.
No.6. vol. 24, pp. 772-787 (2017).
17. How search engines use semantic analysis of text to assess relevance. Sape portal. Se-
mantic analysis of text: Zipf law, TF-IDF methods for text quality analysis. URL:
blog.sape.ru. (2019).
18. Gorkovenko. D. K. Application of text mining methods for classification of information
distributed in social networks. Young scientist. № 14 (118). p. 66-72. URL:
https://moluch.ru/archive/118/32878/. (2016).
19. Chen Z., He Z., Liu X. Evaluating semantic relations in neural word embeddings with
biomedical and general domain knowledge bases. BMC Med Inform Decis Mak 18, 65.
URL: https://doi.org/10.1186/s12911-018-0630-x. (2018).
21
20. Hosseinipour R., Shamsfard M. EoANN: Lexical Semantic Relation Classification Us-
ing an Ensemble of Artificial Neural Networks. Proceedings of Recent Advances in Natural
Language Processing, pp. 481-486, Varna, Bulgaria. URL: https://doi.org/10.26615/978-
954-452-056-4_057. (2019).
21. Washio K., Kato T. Neural Latent Relational Analysis to Capture Lexical Semantic Re-
lations in a Vector Space. Proceedings of the 2018 Conference on Empirical Methods in
Natural Language Processing, pp. 594-600. Brussels, Belgium. URL: https://aclanthol-
ogy.org. (2018).
22. Washio K., Kato T. Filling missing paths: Modeling co-occurrences of word pairs and
dependency paths for recognizing lexical semantic relations. In Proceedings of the 2018
Conference of the North American Chapter of the Association for Computational Linguis-
tics: Human Language Technologies, Vol. 1, pp. 1123-1133. URL: https://aclanthol-
ogy.org. (2018).
23. Jadhav A., Amir Y., Pardos Z. Lexical Relation Mining in Neural Word Embeddings.
Proceedings of the 28th International Conference on Computational Linguistics, pp. 1299-
1311. Barcelona, Spain. URL: https://doi.org/10.18653/V1/2020.COLING-MAIN.112.
(2020).
24. Sabine Schulte im Walde. Distinguishing between paradigmatic semantic relations
across word classes: human ratings and distributional similarity, Journal of Language Mod-
elling, 8(1). pp. 53-101. URL: https://dx.doi.org/10.15398/jlm.v8i1.199. (2020).
25. Vasiliev V.I. Recognition systems. Directory.
URL:https://obuchalka.org/2015072785919/raspoznauschie-sistemi-spravochnik-vasilev-
v-i-1983.html. (1983).
26. Gikhman I.I. etc. Theory of Probability and Mathematical Statistics. I.I. Gikhman, A.V.
Skorokhod, M.I. Yadrenko. - 2nd ed. Vyscha school. Parent Publishing House, 439 p. URL:
https://www.litmir.me/bd/?b=566028. (1988).
27. Nechaev V.V., Trofimenko V.M. Analysis of Semantic Search Methods for Information
Resources. Educational resources and technologies. No.5, vol. 8, pp. 127-135 (2014).
28. Piotrovsky R.G., Bektayev K.B., Piotrovskaya A.A. Mathematical linguistics. Textbook
for ped. universities. Higher school, 383 p. URL: https://studfile.net/ preview/4474761/.
(1977).
29. Osipov G.V. Methods and techniques for statistical processing of primary sociological
information. Nauka, 327 p. (1968).
30. Mitropolskiy A.K. Statistical computing techniques. - Ed. 2nd. Nauka, 568 p. URL:
https://biblioclub.ru/index.php?page=book&id=458351. (1971).
31. Yanko Y. Mathematical and statistical tables: Trans. from Czech. Gosstatizdat, 243 p.
(1961).