ChapterPDF Available

Comparative Analysis of the Informativeness and Encyclopedic Style of the Popular Web Information Sources

Authors:

Abstract and Figures

Nowadays, very often decision making relies on information that is found in the various Internet sources. Preferred are texts of the encyclopedic style, which contain mostly factual information. We propose to combine the logic-linguistic model and the universal dependency treebank to extract facts of various quality levels from texts. Based on Random Forest as a classification algorithm, we show the most significant types of facts and types of words that most affect the encyclopedic-style of the text. We evaluate our approach on four corpora based on Wikipedia, social and mass media texts. Our classifier achieves over 90% F-measure.
Content may be subject to copyright.
* This is a preprint version. The final publication is available at
Springer via https://doi.org/10.1007/978-3-319-93931-5_24
Comparative analysis of the informativeness and
encyclopedic style of the popular Web information
sources *
Nina Khairova1 ,Włodzimierz Lewoniewski2, Krzysztof Węcel 2,
Mamyrbayev Orken3, Mukhsina Kuralai4
1National Technical University “Kharkiv Polytechnic Institute”, Ukraine
2Poznań University of Economics and Business, Poland
3 Institute of Information and Computational Technologies, Kazakhstan
4Al-Farabi Kazakh National University, Kazakhstan
khairova@kpi.kharkov.ua, wlodzimierz.lewoniewski@ue.poznan.pl,
krzysztof.wecel@ue.poznan.pl, morkenj@mail.ru, kuka_ai@mail.ru
Abstract. Nowadays, very often decision making relies on information that is
found in the various Internet sources. Preferred are texts of the encyclopedic
style, which contain mostly factual information. We propose to combine the
logic-linguistic model and the universal dependency treebank to extract facts of
various quality levels from texts. Based on Random Forest as a classification
algorithm, we show the most significant types of facts and types of words that
most affect the encyclopedic-style of the text. We evaluate our approach on
four corpora based on Wikipedia, social and mass media texts. Our classifier
achieves over 90% F-measure.
Keywords: encyclopedic, informativeness, universal dependency, random
forest, facts extraction, Wikipedia, mass media
1 Introduction
Very often the decision making depends on the information that is found in various
Internet sources. Enterprises increasingly use various external big data sources in
order to extract and integrate information into their own business systems [1].
Meanwhile, the Internet is flooded with meaningless blogs, computer-generated spam,
and texts that convey no useful information for business purposes. Firms,
organizations, and individual users can publish texts that have different purposes.
The quality of information about the same subject in these texts can greatly depend on
different aspects. For business purposes, however, organizations and individual users
need a condensed presentation of material that identifies the subject accurately,
completely and authentically. At the same time, the subject matter should be
displayed in a clear and understandable manner.
In other words, the decision making should prefer texts of an encyclopedic style,
which is directly related to the informativeness concept i.e. the amount of useful
information contained in a document. Obviously, the amount of knowledge that
human consciousness can extract from a text has to correlate with the quality and
quantity of facts in the text. Based on the definitions of an encyclopedia and
encyclopedia articles [2] we can suggest that an encyclopedic text has to focus on
factual information concerning the particular field, which is defined. We propose to
join the use of the logic-linguistic model [3] and the universal dependency treebank
[4] to extract facts of various quality levels from texts.
The model that we described in our previous studies defines the complex and
simple facts via correlations between grammatical and semantic features of the words
in a sentence. In order to identify these grammatical and semantic features correctly,
we employ the Universal Dependencies parser, which can analyze the syntax of verb
groups, subordinate clauses, and multi-word expressions in the most sufficient way.
Additionally, we take into account the employing of proper nouns, numerals,
foreign words and some others provided by POS-tagging morphological and semantic
types of words in the text, which can have an impact on the briefness and
concreteness of particular information.
In our study, we focus on using information about the quality and quantity of facts
and morphological and semantic types of the words in a text to evaluate the
encyclopedic style of the text.
In order to estimate the influence of quality and quantity of factual information and
semantic types of words of the text on its encyclopedic style, we decided to use four
different corpora. The first one comprises Wikipedia articles which are manually
divided into several classes according to their quality. The second Wikipedia corpus
comprises only the best Wikipedia articles. The third corpus is The Blog Authorship
Corpus
1
, which contains posts of 19,320 bloggers gathered from blogger.com. The
corpus incorporates a total of 681,288 posts and over 140 million words - or
approximately 35 posts and 7250 words per person [5]. The fourth corpus comprises
news reports from various topics sections of The New York Times” and “The
Guardian”, which are extracted in January 2018. We apply the Random Forests
algorithm of Weka 3 Data Mining Software in order to estimate the importance of
investigated features in obtained classification model.
2 Related work
Nowadays, the problem of determining informativeness of a text has become one
of the most important tasks of NLP. In recent years, many articles devoted to the
solution of the problem have appeared.
Usually, informativeness of a text is considered at three levels of the linguistic
system: (1) at the sentence level, (2) at the level of the discourse and (3) at the level of
the entire text. In traditional linguistics, the definition of the sentence informativeness
is based on the division of an utterance into two parts - the topic (or theme) and the
1
http://u.cs.biu.ac.il/~koppel/BlogCorpus.htm
3
comment (or rheme, rema). [6]. At the discourse level, the traditional approach
involves the anchoring of information units or events to descriptives and interpretives
within a narrative frame [7].
Many studies determine 'informativeness' of text via 'informativeness' of words in
the text or - 'term informativeness'. Herewith, the most part of known approaches to
measure term informativeness falls into the statistics-based category. Usually, they
estimate informativeness of words by distributional characteristics of words in a
corpus. For instance, a recent Rennie and Jaakkola's study [8] introduced the term
informativeness based on the fit of a word’s frequency to a mixture of 2 Unigram
distribution.
The considerably fewer studies measure the semantics value of term
informativeness. In our opinion, an interesting one is Kireyev's research [9], which
defined informativeness of term via the ratio of a term’s LSA vector length to its
document frequency. More recently, Wu and Giles [10] defined a context-aware term
informativeness based on the semantic relatedness between the context and the term’s
featured contexts (or the top important contexts that cover most of a term’s
semantics).
Most of approaches use statistical information in corpora. For instance, Shams
[11] explored possibilities to determine informativeness of a text using a set of natural
language attributes or features related to Stylometry the linguistic analysis of
writing style. However, he focused mainly on the search for informativeness in the
specific biomedical texts.
Allen et al. [12] studied the information content of analysts report texts. Authors
suggested that informativeness of a text is determined by its topics, writing style, and
features of other signals in the reports that have important implications for investors.
At the same time, they emphasized that more information texts are more assertive and
concise. Their inferences about informativeness of a text are based on investors’
reaction to the analyst reports for up to five years.
In [13] work, informativeness analysis of language is determined using text-based
electronic negotiations, i.e. negotiations conducted by text messages and numerical
offers sent through electronic means. Appling Machine Learning methods allowed the
authors to build the sets of the most informativeness words and n-grams, which are
related to the successful negotiations.
Lex et al. guessed that informativeness of a document could be measured through
factual density of a document, i.e. the number of facts contained in a document,
normalized by its length [14].
Although the concept of encyclopedicness is closely related to the informativeness,
it also includes such notions as brevity and correspondence to a given subject-matter.
We suppose that the notion of encyclopedicness of the text is more interesting and
more useful than informativeness of the text because it bases on knowledge
concerning the particular subject-matter. Therefore, in our study, we consider the
influence both of the various types of facts and semantic types of words of the text on
the encyclopedic style of the text.
3 Methodology
3.1 Logical-Linguistic Model
We argue that the encyclopedic style of an article can be represented explicitly by
the various linguistic means and semantic characteristics in the text. We have defined
four possible semantic types of facts in a sentence, which, in our opinion, might help
to determine the encyclopedicness of the text. Figure 1 shows the structural scheme
for distinguishing four types of facts from simple English sentences.
Fig. 1. Structural scheme for distinguishing four types of facts from simple English sentences:
subj-fact, obj-fact, subj-obj fact and complex fact.
We called the first type of facts as subj-fact. We defined this type of the facts in an
English sentence as the smallest grammatical clause that includes a verb and a noun.
In this case, the verb represents Predicate of an action and the noun represents
Subject
2
of an action. According to our model of fact extraction from English texts
[3], the semantic relations that denote the Subject of the fact can be determined as the
following logical-linguistic equation:
γ1(z, y, x, m, p, f, n) = yout ((fcan ˅ fmay ˅ fmust ˅ fshould ˅ fcould ˅ fneed ˅ fmight ˅
fwould ˅ fout) (nnot ˅ nout) (pI ˅ ped ˅pIII) xf mout ˅(xl (mis ˅ mare ˅ mhavb mhasb ˅
mhadb ˅ mwas ˅ mwere ˅ mbe ˅ mout ) zby) , (1)
where the subject variable zprep defines the syntactic feature of the preposition after
the verb in English phrases; the subject variable y (yap ˅ yaps ˅ yout=1) defines whether
there is an apostrophe in the end of the word; the subject variable x defines the
2
We use 'Predicate', 'Subject' and 'Object' with the first upper-case letters to emphasize the
semantic meaning of words in a sentence
5
position of the noun with respect to the verb; the subject variable m defines whether
there is a form of the verb "to be" in the phrase and the subject variable p defines the
basic forms of the verb in English.
Additionally, in this study, we have appended two subject variables f and n to
account for modality and negation. The subject variable f defines the possible forms
of modal verbs:
fcan ˅ fmay ˅ fmust ˅ fshould ˅ fcould ˅ fneed ˅ fmight ˅ fwould ˅ fout =1
Using the subject variable n we can take into account the negation in a sentence:
nnot ˅ nout=1
Definition 1. The subj-fact in an English sentence is the smallest grammatical
clause that includes a verb and a noun (or personal pronoun) that represents the
Subject of the fact according to the equation (1).
The Object of a verb is the second most important argument of a verb after the
subject. We can define grammatical and syntactic characteristics of the Object in
English text by the following logical-linguistic equation:
γ2(z, y, x, m, p, f, n) = yout (nnot ˅ nout) (fcan ˅ fmay ˅ fmust ˅ fshould ˅ fcould ˅ fneed ˅
fmight ˅ fwould ˅ fout) (zout xl mout (pI ˅ ped ˅pIII) ˅ xf (zout ˅ zby) (mis ˅ mare ˅
mhavb ˅ mhasb ˅ mhadb ˅ mwas ˅ mwere ˅ mbe ˅ mout ) (ped ˅pIII) (2)
Definition 2. The obj-fact in an English sentence is the smallest grammatical clause
that includes a verb and a noun (or personal pronoun) representing the Object of the
fact according to the conjunction of grammatical features in equation (2).
The third group of facts includes main clauses, in which one noun has to play the
semantic role of the Subject and the other has to be the Object of the fact.
Definition 3. The subj-obj fact in an English sentence is the main clause that
includes a verb and two nouns (or personal pronouns), one of them represents the
Subject of the fact accordance with the equation (1) and the other represents the
Object of the fact accordance with the equation (2).
We also defined the complex fact in texts as a grammatical simple English sentence
that includes a verb and a few nouns (or personal pronouns). In that case, the verb
also represents Predicate, but among nouns, one of these has to play the semantic role
of the Subject, other has to be the Object and the others are the attributes of the action.
Definition 4. The complex fact in an English sentence is the simple sentence that
includes a verb and a few nouns (or personal pronouns), one of them represents the
Subject, another represents the Object of the fact in accordance with the equations (1)
and (2) respectively and the others represent attributes of the action.
These can be the attributes of time, location, mode of action, affiliation with the
Subject or the Object, etc. According to our previous articles, the attributes of an
action in English simple sentence can be represented by nouns that were defined by
the logical-linguistic equations [12].
Additionally, we distinguish a few semantic kinds of nouns that can be extracted
by labels of POS-tagging. Moreover, additionally, we distinguish a few semantic
types of nouns that can be extracted by labels of POS-tagging. These are NNP*
(plural and single proper noun), CD (numeral), DT (determiner, which marked such
words as "all", "any", "each", "many" etc.), FW (foreign word), LS (list item marker),
MD (modal auxiliary). The approach is based on our hypothesis that occurrence of
the proper names, numerals, determiner words, foreign words, items markers, modal
auxiliary words in a text can influence its encyclopedicness. For instance, the
occurrence of proper nouns, numerals, foreign words and items markers in a sentence
can explicitly represent that a statement is formulated more precisely and concisely.
On the contrary, we can guess that occurrence of modal auxiliary words in a sentence
makes the statement vaguer and more implicit.
3.2 Using Universal Dependencies
In order to correctly pick facts out and properly distinguish their type, we employ
the syntactic dependency relation. We exploit Universal Dependencies parser because
for this treebanks can the most sufficiently analyze verb groups, subordinate clauses,
and multi-word expressions for a lot of languages. The dependency representation of
UD evolves out of Stanford Dependencies (SD), which itself follows ideas of
grammatical relations-focused description that can be found in many linguistic
frameworks. That is, it is centrally organized around notions of subject, object, clausal
complement, noun determiner, noun modifier, etc. [4], [15]. These syntactic relations,
which connect words of a sentence to each other, often express some semantic
content. The verb is taken to be the structural center of clause structure in
Dependency grammar and all other words are either directly or indirectly connected
to it. In Dependency grammar just as a verb is considered to be the central component
of a fact and all participants of the action depend on the Predicate, which expresses
the fact and is represented by a verb [16]. For example, Figure 2 shows the graphical
representation of Universal Dependencies for the sentence "The Marines reported that
ten Marines and 139 insurgents died in the offensive", which is obtained using a
special visualization tool for dependency parse - DependenSee
3
.
For our analysis we used 7 out of 40 grammatical relations between words in
English sentences, which UD v1 contains
4
. In order to pick the subj-fact out, we
distinguish three types of dependencies: nsubj, nsubjpass and csubj. Nsubj label
denotes the syntactic subject dependence on a root verb of a sentence, csubj label
denotes the clausal syntactic subject of a clause, and nsubjpass label denotes the
syntactic subject of a passive clause. Additionally, in order to pick the obj_fact out,
we distinguish three types of dependencies: obj, iobj, dobj and ccomp. Obj denotes
the entity acted upon or which undergoes a change of state or motion. The labels iobj,
dobj and ccomp are used for more specific notation of action object dependencies on
the verb.
3
http://chaoticity.com/dependensee-a-dependency-parse-visualisation-tool/
4
http://universaldependencies.org/en/dep/
7
Fig. 2. Graphical representation of Universal Dependencies for the sentence "The Marines
reported that ten Marines and 139 insurgents died in the offensive". Source: DependenSee
Considering that the root verb is the structural center of a sentence in Dependency
grammar, we distinguish additional types of facts that we can extract from a text. The
root grammatical relation points to the root of the sentence [15]. These are such fact
types that are formed from the root verb (root_fact, subj_fact_root, obj_fact_root,
subj_obj_fact_root, complex_fact_root) and the other ones, in which the action
Predicate is not a root verb (obj_fact_notroot, subj_obj_fact_notroot,
complex_fact_notroot).
For completeness of the study, we attribute sentences with copular verbs to a
special type of facts, which we called copular_fact. We should do this for the
following reason. Despite the fact that, as is widely known, the copular verb is not an
action verb, such a verb can often be used as an existential verb, meaning "to exist".
4 Source Data and experimental results
Our dataset comprises four corpora, two of which include articles from English
Wikipedia. We consider texts from Wikipedia for our experiments for a few reasons.
First, we assume that since Wikipedia is the biggest public universal encyclopedia,
consequently Wikipedia's articles must be well-written and must follow the
encyclopedic style guidelines. Furthermore, Wikipedia articles can be divided into a
different quality classes [16], [17], [18], hence the best Wikipedia's articles have a
greater degree of encyclopedicness than most other texts do. These hypotheses allow
us to use the dataset of Wikipedia articles in order to evaluate the impact our proposed
linguistic features on the encyclopedic style of texts.
The first Wikipedia corpus, which we called Wikipedia_6C, comprises 3000
randomly selected articles from the 6 quality classes of English Wikipedia (from the
highest): Featured articles (FA), Good articles (GA), B-class, C-class, Start, Stub. We
exclude A-class articles since this quality grade is usually used in conjunction with
other ones (more often with FA and GA) as it was done in the previous studies [17],
[18]. The second Wikipedia corpus, which is called WikipediaFA, comprises 3000
only the best Wikipedia articles that randomly are selected from the best quality class
- FA.
In order to process plain texts of described above corpora, we use Wikipedia
database dump from January 2018 and special framework WikiExtractor,
5
which
extracts and cleans text from a Wikipedia database dumps.
In addition, in order to compare the encyclopedic style of texts from Wikipedia and
texts from other information sources, we have produced two further corpora. The first
one is created on the basis of The Blog Authorship Corpus [5]. The corpus collected
posts of 19,320 bloggers gathered from blogger.com one day in August 2004. The
bloggers' age is from 13 to 47 years
6
. For our purposes, we extract all texts of 3000
randomly selected bloggers (authors) from two age groups: "20s" bloggers (ages 23-
27) and "30s" bloggers (ages 33-47). Each age group in our dataset has the same
number of bloggers (1500 each). Since bloggers have a different number of posts of
various size, in our corpus we consider all texts of one blogger as a separate item.
Hence we got in total 3000 items for our Blogs corpus.
Table 1. The distributions of the analyzed texts by our four corpora
Corpus name
Categories
Total
number of
items in
corpus
Wikipedia_6C
FA, GA, B-class, C-class, Start, Stub
3000
Wikipedia_FA
FA
3000
Blogs
"20s" blogs, "30s" blogs
3000
News
Business, Health, N.Y., Opinion, Politics,
Science, Sports, Tech, U.S., World topics
of “The New York Times”.
UK news, World, Sport, Opinion, Culture,
Business, Lifestyle, Technology,
Environment, Travel topics of “The
Guardian”.
3000
The second supplementary corpus, which is called News, is created on the basis
of articles and news from popular mass media portals, such as “The New York
Times”
7
and “The Guardian”
8
. For a more comprehensive analysis, we extracted an
equal number of news from different topics. For our experiment, we selected 10
topics for each news source. So, we extracted 150 recent news from each topic of
each news source ("The New York Times” and “The Guardian”) in January 2018. In
total, we have got 3000 items for our News corpus.
Thus, we have had four corpora with the same number of items for our
experiments. Table 1 shows the distributions of the analyzed texts according to the
categories. By categories, we mean topics of newspaper posts in the News corpus,
5
https://github.com/attardi/wikiextractor
6
Groups description available on the page http://u.cs.biu.ac.il/~koppel/BlogCorpus.htm
7
https://www.nytimes.com/
8
https://www.theguardian.com
9
age groups of bloggers in the Blogs corpus and the special manual mark of
assessment quality of Wikipedia articles in Wikipedia_6C corpus and
Wikipedia_FA corpus.
Additionally, based on the Corpus Linguistics approaches [19], in order to compare
the frequencies of linguistic features occurrence in the different corpora, we
normalized their frequencies per million words. That allows to compare the
frequencies of various characteristics in the corpora of different sizes.
Definition 5. The frequency of each feature in a corpus is defined as the number of
the feature occurrence in the corpus divided by the number of words in the corpus
multiplied by million.
In order to assess the impact of the various types of facts in a sentence and some
types of words in a text on the degree of encyclopedic text, we focus on two
experiments. Both of them classify texts from Blogs, News and Wikipedia. The
difference lies in the selected Wikipedia corpus. In the first experiment, we used texts
from Wikipedia_6C corpus, which includes Wikipedia articles of different quality.
We called this experiment as BNW6 model. In the other experiment we use texts from
‘Wikipedia_FA’ corpus, which only consists of the best Wikipedia articles. We called
second experiment as BNWF model.
The used Random Forests classifier of the Data Mining Software Weka 3
9
allows
determining the probability that an article belongs to one of the three corpora. Table 2
shows detailed accuracy by two models respectively.
Table 2. Detailed Accuracy by models
Model
TP
Rate
FP
Rate
Precision
Recall
F-
Measure
MCC
ROC
Area
PRC
Area
BNW
0.887
0.057
0.888
0.887
0.887
0.831
0.974
0.950
BNWF
0.903
0.048
0.904
0.903
0.904
0.856
0.981
0.962
Additionally, the used Data Mining Software Weka 3 allows constructing a
confusion matrix that permits to visualize the performance of the model. Such
matrices for both classification models are shown in table 3. Each row in the matrix
allows representing the number of instances in a predicted class, while each column
represents the instances in an actual class. It makes possible to present which classes
were predicted correctly by our model.
Obviously, that the best Wikipedia articles must be well-written and consequently
must follow the encyclopedic style guidelines. This is confirmed by higher
coefficients of recall and precision of BNWF classification model than BNW6 one.
9
https://www.cs.waikato.ac.nz/ml/weka/index.html
Table 3. Confusion Matrices of the obtained models.
BNW6
BNWF
Blogs
News
Wikip
edia
Blogs
News
Wikipedia_
FA
2691
274
34
Blogs
2683
283
33
Blogs
199
2575
227
News
206
2655
140
News
10
276
2714
Wikiped
ia_6C
24
183
2793
Wikipedi
a_FA
The Random Forest classifier can show the importance of features in the models. It
provides two straightforward methods for feature selection: mean decrease impurity
and mean decrease accuracy. Table 4 shows the most significant features, which are
based on average impurity decrease and the number of nodes using that feature.
Table 4. The most significant features of our models based on average impurity decrease (AID)
and the number of nodes using the features (NNF)
Feature
BNW6
BNWF
AID
NNF
AID
NNF
root_fact
0.53
7526
0.52
6354
subj_fact_root
0.47
7614
0.48
6287
subj_fact_notroot
0.45
6537
0.45
5786
obj_fact_notroot
0.42
5678
0.41
4772
obj_fact_root
0.4
5155
0.4
5354
subj_obj_fact_root
0.38
5270
0.39
4479
complex_fact_root
0.39
3994
0.38
3882
complex_fact_notroot
0.35
5541
0.35
4413
copular_fact
0.34
3924
0.33
3254
CD
0.33
4262
0.32
3668
DT
0.32
4646
0.31
4061
FW
0.29
2083
0.31
2087
MD
0.31
4388
0.3
3702
NNP*
0.29
4893
0.29
4112
LS
0.22
775
0.23
664
11
5 Conclusions and Future Works
We consider the determination problem of the encyclopedic style and
informativeness of text from different sources as a classification task. We have four
corpora of texts. Some corpora comprise more encyclopedic texts or articles and
others include less encyclopedic ones.
Our study shows that factual information has the greatest impact on
encyclopedicness of text. As Figure 1 shows, we distinguish several types of facts in
the sentence. They are complex fact, subj fact, obj fact, subj-obj fact and copular-fact.
Additionally, we highlight the main fact that is represented by a sentence.
Table 4 summarizes that the most significant features that affect the encyclopedic
style of the text are (1) the frequency of the main facts (root_fact), (2) the frequency
of the subj facts, (3) the frequency of the obj facts and (4) the frequency of the
subj_obj facts in a corpus. We definite all these types of facts on the basis of our
logical-linguistic model and using Universal Dependencies parser.
The Random Forest classifier, which bases on our features, allows obtaining
sufficiently high recall, precision and F-mesure. We provide Recall = 0.887 and
Precision = 0.888 in the case of the classification of texts by Blogs, News and
Wikipedia corpora. In the case of considering only the best of Wikipedia articles in
the last corpus, we provide recall= 0.903 and precision = 0.904. Moreover, using the
Random Forest classifier allowed us to show the most important features related to
informativeness and the encyclopedic style in our classification models.
In future work, we plan to extend obtained approach to compare the encyclopedic
style of texts of Wikipedia and of various Web information sources in different
languages. In our opinion, it is possible to implement the method in commercial or
corporate search engines to provide users with more informative and encyclopedic
texts. Such tools must be significant for making important decisions based on the text
information from the various Internet sources. On the other hand, firms and
organizations will get the opportunity to evaluate the informativeness of the texts that
are placed on their Web sites, and make changes to provide more valuable
information to potential users. Additionally, more encyclopedic texts can be used to
enrich different open knowledge bases (such as Wikipedia, DBpedia) and business
information systems in enterprises.
6 References.
1. Cai, L., Zhu, Y. (2015). The challenges of data quality and data quality assessment in
the big data era. Data Science Journal, 14.
2. Henri Béjoint.. Modern Lexicography: An Introduction. 2000. Oxford University
Press. pp. 3031.
3. Khairova, N., Petrasova, S., Gautam, A., (2016), The Logical-Linguistic Model of
Fact Extraction from English Texts. International Conference on Information and
Software Technologies. CCIS 2016: Communications in Computer and Information
Science, pp. 625-635.
4. Joakim Nivre et al. Universal Dependencies v1: A Multilingual Treebank Collection
In Proceedings of the Tenth International Conference on Language Resources and
Evaluation (LREC 2016), Paris, France, May. European Language Resources
Association (ELRA).
5. Schler J., Koppel M., Argamon S. and Pennebaker J. (2006). Effects of Age and
Gender on Blogging in Proceedings of 2006 AAAI Spring Symposium on
Computational Approaches for Analyzing Weblogs, 191-197 pp.
6. Leafgren, John. Degrees of explicitness: Information structure and the packaging of
Bulgarian subjects and objects. Amsterdam & Philadelphia: John Benjamins, 2002.
7. Ruth A. Berman and Dorit Ravid. Analyzing Narrative Informativeness in Speech
and Writing In A. Tyler, Y. Kim, & M. Takada, eds. Language in the Context of
Use: Cognitive Approaches to Language and Language Learning. Mouton de
Gruyter: The Hague [Cognitive Linguistics Research Series], 2008. pp. 79-101.
8. J. D. M. Rennie, and T. Jaakkola. 2005. Using Term Informativeness for Named
Entity Detection. In Proceedings of SIGIR 2005, pp. 353360.
9. K. Kireyev. Semantic-based Estimation of Term Informativeness. Human Language
Technologies: The 2009 Annual Conference of the North American Chapter of the
ACL, pages 530538.
10. Zhaohui Wu and Lee C. Giles. 2013. Measuring term informativeness in context. In
Proceedings of NAACL ’13, pages 259–269, Atlanta, Georgia.
11. Shams, Rushdi, "Identification of Informativeness in Text using Natural Language
Stylometry" (2014). Electronic Thesis and Dissertation Repository. 2365.
12. Allen H. Huang, Amy Y. Zang, and Rong Zheng (2014) Evidence on the Information
Content of Text in Analyst Reports. The Accounting Review: November 2014, Vol.
89, No. 6, pp. 2151-2180.
13. Marina Sokolova, Guy Lapalme. How Much Do We Say? Using Informativeness of
Negotiation Text Records for Early Prediction of Negotiation Outcomes. Group
Decision and Negotiation, 21(3):363379.
14. Lex, E., Voelske, M., Errecalde, M., Ferretti, E., Cagnina, L., Horn, C., Granitzer, M.,
(2012), Measuring the quality of web content using factual information, Proceedings
of the 2nd joint WICOW/AIRWeb workshop on web quality, ACM, pp. 7-10.
15. De Marneffe, M. C., & Manning, C. D. (2008). Stanford typed dependencies manual
(pp. 338-345). Technical report, Stanford University.
16. Lewoniewski W. (2017) Enrichment of Information in Multilingual Wikipedia Based
on Quality Analysis. In: Abramowicz W. (eds) Business Information Systems
Workshops. BIS 2017. Lecture Notes in Business Information Processing, vol 303.
Springer, Cham.
17. Węcel, K., Lewoniewski, W., (2015), Modelling the Quality of Attributes in
Wikipedia Infoboxes. In Business Information Systems Workshops. Volume 228 of
Lecture Notes in Business Information Processing. Springer International
Publishing, pp.308320
18. Lewoniewski,W., Węcel, K., Abramowicz, W., (2016), Quality and importance
of Wikipedia articles in different languages. In: Information and Software
Technologies: 22nd International Conference, ICIST 2016, Druskininkai, Lithuania,
October 13-15, 2016, Proceedings. Springer International Publishing, Cham (2016)
613624.
19. McEnery, T. and Hardie, A (2012) Corpus Linguistics: Method, Theory and Practice,
Cambridge University Press, Cambridge. 48-52pp.
... Such a tool is applied by Khairova et al. [32]. They use 7 dependency representation of UD evolves out of Stanford Dependencies, which follows ideas of grammatical relationsfocused description. ...
Conference Paper
In this survey we systematize the state-of-the-art features that are used to model texts for text classification tasks: topical and sentiment classification, authorship attribution, style detection, etc. We classify text models into three categories: standard models that use popular features, linguistic models that apply complex linguistic features, and modern universal models that combine deep neural networks with text graphs or language models. For each category we describe particular models and their adaptations, note the most effective solutions, summarize advantages, disadvantages and limitations, and make suggestions for future research.
... We distinguish four semantic types of facts extracted from the text. Each of them is expressed by different structures of natural languages [21] (Fig. 2). ...
Chapter
Full-text available
Nowadays, structured information that obtains from unstructured texts and Web context can be applied as an additional source of knowledge to create ontologies. In order to extract information from a text and represent it in the RDF-triplets format, we suggest using the Open Information Extraction model. Then we consider the adaptation of the model to fact extraction from unstructured texts in the Kazakh language. In our approach, we identify lexical units that name the participants of the action (the Subject and Object) and semantic relations between them based on words characteristics in a sentence. The model provides semantic functions of the action participants via logical-linguistic equations that express the relations of the grammatical and semantic characteristics of the words in a Kazakh sentence. Using the tag names and some syntactic characteristics of words in the Kazakh sentences as the values of the predicate variables in corresponding equations allows us to extract Subjects, Objects and Predicates of facts from texts of Web content. The experimental research dataset includes texts extracted from Kazakh bilingual news websites. The experiment shows that we can achieve the precision of facts extraction over 71% for Kazakh corpus.
... Recently, the interest in the researches, which focus on the ways of the identification and extractions of the facts in unstructured texts, has been growing constantly. This is due to the recent proposal to utilize a statistical measure called factual density to assess the quality of content and indicate informativeness of a document from the Internet (Khairova et al., 2018), (Lex et al., 2012). ...
Article
Full-text available
Open Information Extraction (OIE) is a modern strategy to extract the triplet of facts from Web-document collections. However, most part of current OIE approaches is based on NLP techniques such as POS tagging and dependency parsing, which tools are accessible not to all languages. In this paper, we suggest the logical-linguistic model, which basic mathematical means are logical-algebraic equations of finite predicates algebra. These equations allow expressing a semantic role of the participant of a triplet of the fact (Subject-Predicate-Object) due to the relations of grammatical characteristics of words in the sentence. We propose the model that extracts the unlimited domain-independent number of facts from sentences of different languages. The use of our model allows extracting the facts from unstructured texts without requiring a pre-specified vocabulary, by identifying relations in phrases and associated arguments in arbitrary sentences of English, Kazakh and Russian languages. We evaluate our approach on corpora of three languages based on English and Kazakh bilingual news websites. We achieve the precision of facts extraction over 87% for English corpus, over 82% for Russian corpus and 71% for Kazakh corpus.
... 1. Nowadays decision-making is often based on the information of various Internet sources [11], so an educated person is required to be highly competent in dealing with large amounts of information. Due to the relevance of this problem for the IT society, one cannot overestimate the didactic prospects of cloud technologies, as they contribute to: efficiency of handling with the students' real life problem situations, which can be sorted out with digital devices and gadgets [26]; mastering the skills to find, systematize, analyze a large amount of necessary information; the reasonable use of cloud technologies, the skills to assess the benefits and risks of cloud technologies for self-development, environment or society. ...
Preprint
Full-text available
The article is devoted to the topical issue of the cloud technologies implementation in educational process in general and when studying geography, in particular. The authors offer a selection of online services which can contribute to the effective acquisition of geographical knowledge in higher school. The publication describes such cloud technologies as Gapminder, DESA,Datawrapper.de, Time.Graphics, HP Reveal, MOZAIK education, Settera Online, Click-that-hood, Canva, Paint Instant. It is also made some theoretical generalization of their economic, technical, technological, didactic advantages and disadvantages. Visual examples of application are provided in the article. The authors make notice that in the long run the technologies under study should become a valuable educational tool of creation virtual information and education environments connected into common national, and then global, educational space.
Chapter
Full-text available
Data Stream represents a significant challenge for data analysis and data mining techniques because those techniques are developed based on training batch data. Classification technique that deals with data stream should have the ability for adapting its model for the new samples and forget the old ones. In this paper, we present an intensive comparison for the performance of six of popular classification techniques and focusing on the power of Adaptive Random Forest. The comparison was made based on four real medical datasets and for more reliable results, 40 other datasets were made by adding white noise to the original datasets. The experimental results showed the dominant of Adaptive Random Forest over five other techniques with high robustness against the change in data and noise.
Conference Paper
Full-text available
Despite the fact that Wikipedia is one of the most popular sources of information in the world, it is often criticized for the poor quality of content. In this online encyclopaedia articles on the same topic can be created and edited independently in different languages. Some of this language versions can provide valuable information on a specific topics. Wikipedia articles may include infobox, which used to collect and present a subset of important information about its subject. This study presents method for quality assessment of Wikipedia articles and information contained in their infoboxes. Choosing the best language versions of a particular article will allow for enrichment of information in less developed version editions of particular articles.
Conference Paper
Full-text available
This article aims to analyse the importance of the Wikipedia articles in different languages (English, French, Russian, Polish) and the impact of the importance on the quality of articles. Based on the analysis of literature and our own experience we collected measures related to articles, specifying various aspects of quality that will be used to build the models of articles’ importance. For each language version, the influential parameters are selected that may allow automatic assessment of the validity of the article. Links between articles in different languages offer opportunities in terms of comparison and verification of the quality of information provided by various Wikipedia communities. Therefore, the model can be used not only for a relative assessment of the content of the whole article, but also for a relative assessment of the quality of data contained in their structural parts, the so-called infoboxes.
Conference Paper
Full-text available
Quality of data in DBpedia depends on underlying information provided in Wikipedia’s infoboxes. Various language editions can provide different information about given subject with respect to set of attributes and values of these attributes. Our research question is which language editions provide correct values for each attribute so that data fusion can be carried out. Initial experiments proved that quality of attributes is correlated with the overall quality of the Wikipedia article providing them. Wikipedia offers functionality to assign a quality class to an article but unfortunately majority of articles have not been graded by community or grades are not reliable. In this paper we analyse the features and models that can be used to evaluate the quality of articles, providing foundation for the relative quality assessment of infobox’s attributes, with the purpose to improve the quality of DBpedia.
Article
Full-text available
We document that textual discussions in a sample of 363,952 analyst reports provide information to investors beyond that in the contemporaneously released earnings forecasts, stock recommendations, and target prices, and also assist investors in interpreting these signals. Cross-sectionally, we find that investors react more strongly to negative than to positive text, suggesting that analysts are especially important in propagating bad news. Additional evidence indicates that analyst report text is more useful when it places more emphasis on nonfinancial topics, is written more assertively and concisely, and when the perceived validity of other information signals in the same report is low. Finally, analyst report text is shown to have predictive value for future earnings growth in the subsequent five years.
Article
Full-text available
High-quality data are the precondition for analyzing and using big data and for guaranteeing the value of the data. Currently, comprehensive analysis and research of quality standards and quality assessment methods for big data are lacking. First, this paper summarizes reviews of data quality research. Second, this paper analyzes the data characteristics of the big data environment, presents quality challenges faced by big data, and formulates a hierarchical data quality framework from the perspective of data users. This framework consists of big data quality dimensions, quality characteristics, and quality indexes. Finally, on the basis of this framework, this paper constructs a dynamic assessment process for data quality. This process has good expansibility and adaptability and can meet the needs of big data quality assessment. The research results enrich the theoretical scope of big data and lay a solid foundation for the future by establishing an assessment model and studying evaluation algorithms.
Article
Full-text available
This paper examines the Stanford typed dependencies representation, which was designed to provide a straightforward de- scription of grammatical relations for any user who could benefit from automatic text understanding. For such purposes, we ar- gue that dependency schemes must follow a simple design and provide semantically contentful information, as well as offer an automatic procedure to extract the rela- tions. We consider the underlying design principles of the Stanford scheme from this perspective, and compare it to the GR and PARC representations. Finally, we address the question of the suitability of the Stan- ford scheme for parser evaluation.
Conference Paper
In this paper we suggest the logical-linguistic model that allows extracting required facts from English sentences. We consider the fact in the form of a triplet: Subject > Predicate > Object with the Predicate representing relations and the Object and Subject pointing out two entities. The logical-linguistic model is based on the use of the grammatical and semantic features of words in English sentences. Basic mathematical characteristic of our model is logical-algebraic equations of the finite predicates algebra. The model was successfully implemented in the system that extracts and identifies some facts from Web-content of a semi-structured and non-structured English text.