ChapterPDF Available

Analysis of References Across Wikipedia Languages

Authors:

Abstract

Reliable information sources are important to assess content quality in Wikipedia. Using references readers can verify facts or find more details about described topic. Each Wikipedia article can have over 290 language versions. As articles can be edited independently in any language, even by anonymous users, the information about the same topic may be inconsistent. This also applies to sources that can be found in various language versions of particular article, so the same statement can have different sources. In some cases, Wikipedia users, which speak two or more languages, can transfer information with references between language versions. This paper presents an analysis of using common references in over 10 million articles in several Wikipedia language editions: English, German, French, Russian, Polish, Ukrainian, Belarussian. Also, the study shows the use of similar sources and their number in language sensitive topics.
* This is a preprint version. The final publication is available at
Springer via https://doi.org/10.1007/978-3-319-67642-5_47
Analysis of References across Wikipedia Languages *
Włodzimierz Lewoniewski, Krzysztof Węcel, Witold Abramowicz
Poznań University of Economics and Business, Poland
{wlodzimierz.lewoniewski, krzysztof.wecel, witold.abramowicz}@ue.poznan.pl
Abstract. Reliable information sources are important to assess content quality in
Wikipedia. Using references readers can verify facts or find more details about
described topic. Each Wikipedia article can have over 290 language versions. As
articles can be edited independently in any language, even by anonymous users,
the information about the same topic may be inconsistent. This also applies to
sources that can be found in various language versions of particular article, so the
same statement can have different sources. In some cases, Wikipedia users,
which speak two or more languages, can transfer information with references
between language versions. This paper presents an analysis of using common
references in over 10 million articles in several Wikipedia language editions:
English, German, French, Russian, Polish, Ukrainian, Belarussian. Also, the
study shows the use of similar sources and their number in language sensitive
topics.
Keywords: Wikipedia, reference, source, citation.
1 Introduction
Wikipedia is a popular large collection of human knowledge. In April, 2017 this free
online encyclopedia was the fifth most visited website in the world.
1
Nowadays there
are over 44 million articles in almost 300 language versions of Wikipedia. The biggest
language version is English, which has more than 5 million articles.
Wikipedia offers an innovative way to read and edit the information online for
people around the world. Even anonymous users without confirming their skills and
experience can collaborate in articles creation in this community knowledge base.
Despite the fact that Wikipedia is often criticized for poor quality of information,
for the last 10 years its articles have been cited in over 80 thousands scientific
publications.
2
This is almost 10 times more than number articles citing Encyclopaedia
Britannica in scientific publications in the same period.
One of the most important quality measures for Wikipedia is verifiability. Different
language versions of the same topic in Wikipedia can be created and edited
independently. Therefore, there are often differences in quality between various
language version of the same article. Wikipedia users who speak several languages, try
1
http://www.alexa.com/siteinfo/wikipedia.org
2
Information about the number of scientific publications is taken from https://www.scopus.com
where search query was REF(wikipedia.org/wiki) in works published in 2008-2017
to translate some content between more and less developed language versions. Often
along with the content, users also transfer information about references. Referencing
verifiable resources enhances the quality of Wikipedia articles [10].
In this paper we analyze number of references included in Wikipedia articles in
various languages, the most popular information sources, number of common
references in different pairs of Wikipedia language editions. In order to compare the
same references with different description we used the unification method based on
special identifiers. In this study we analyze all articles with references from some of
the most the developed Wikipedia editions and some less developed ones: English
(EN), German (DE), French (FR), Russian (RU), Polish (PL), Ukrainian (UK), and
Belarussian (BE).
2 Sources in Wikipedia
Wikipedia articles with high quality must be well-researched and have representative
survey of the relevant literature.
3
When adding or editing article content, authors must
also add reliable and published sources. As a result, people using the encyclopedia can
check where the information comes from and verify the facts described in it.
A large number of Wikipedia articles are unassessed or have low quality grade [1].
Differences between language versions about same topic cause an additional difficulty
in assessing the quality of articles.
There is a series of studies that use references for assessing quality of Wikipedia
articles. One group of scientific works examined how references affected the articles
quality. Experiments showed that number of references and derivatives (e.g. references
and articles length ratio) were one of the most important predictors in article quality
models [2,3]. Online service WikiRank
4
together with other features uses the number
of references to assess and compare the quality of Wikipedia articles in different
languages.
Second group of studies focused on quality of references in Wikipedia. One of the
first studies in this direction suggested that Wikipedia articles tend to cite articles in
high impact journals such as New England Journal of Medicine, Nature, Science
[8]. At the same time number of peer reviewed academic papers in the health sciences
which are citing Wikipedia is increasing [4]. References can cover a wide range of
subjects, but particularly focused on articles from ecology, evolution and other topics
that can enrich the encyclopedia with scholarly sources [6]. More than half of the
references used in the history articles of the encyclopedia are internet sources, such as
news, media, government websites [7]. If users add references connected with academic
publications, then they prefer to use book as a source rather than articles [5]. So,
Wikipedia is especially valuable due to the potential direct linkages to other primary
sources through special identifier such as DOIs or PubMed IDs [9]. Additionally,
academic status of work is the most important predictor of its appearance in Wikipedia
references [12].
3
https://en.wikipedia.org/wiki/Wikipedia:Featured_article_criteria
4
http://wikirank.net
Wikipedia has also developed a set of templates for flagging articles that have not
enough references or there are no references at all.
5
That template is the most frequent
in English Wikipedia from the over 300 specific quality flaw templates [11]. So, we
can conclude that Wikipedia community pays special attention to availability of
references in articles.
3 Extraction of References
Using Wikipedia dumps from May, 2017, we have extracted all references from over
10 million articles in 7 language editions (BE, DE, EN, FR, PL, RU, UK).
In wiki-code references are usually placed between special tags <ref>…</ref>.
6
In
general, we can divide this references into two groups: with special template and
without it. In the case of references without special template they usually have URL of
source and some optional description (e.g. title).
References with special templates can have different data describing the source.
Here in separate fields we can add information about author(s), title, URL, format,
access date, publisher and others. Additionally, these templates can contain special
identifiers such as DOI, JSTOR, PMC, PMID, arXiv, ISBN, ISSN, and OCLC. The set
of possible parameters depends on the type of templates, which can describe web
source, book, journal, news, conference, act and others. It is important to note that each
language version of Wikipedia can use own group of templates with own names and
set of parameters that describe information sources.
Table 1. Articles and references count in different language versions of Wikipedia in
May 2017.
Lang.
Number
of articles
Articles
with ref.
Number of
references
Unique ref.
Unique
ref.
domains
BE
143,023
31,522
111,961
82,295
22,042
DE
2,057,871
874,370
3,777,825
2,988,443
500,560
EN
5,396,615
3,540,201
25,534,467
18,470,122
1,588,692
FR
1,866,412
818,909
4,510,703
3,364,408
389,588
PL
1,219,709
611,247
2,468,167
1,548,696
184,909
RU
1,391,120
714,599
3,852,470
2,873,069
356,896
UK
693,969
260,913
1,010,965
635,149
114,109
Total
12,768,719
6,851,761
41,266,558
29,962,182
3,156,796
Source: own calculation based on Wikipedia dumps.
In order to extract information about sources we created own parser, which takes
into the account different names of references templates and parameters in each
5
https://en.wikipedia.org/wiki/Template:Unreferenced
6
Also can be <ref name=”…”>…</ref> or <ref name=”…” />
Wikipedia language edition. We investigated about 12,7 million articles (which are not
redirects to other articles) and found over 42 million references from over 3 million
website domains in 7 language versions. More detailed statistics are placed in Table 1.
Zipfian distribution of domains frequency of sources in each language is shown in
Figure 1.
Figure 1. Zipflaw frequency vs. frequency rank for domains in each language
version of Wikipedia
It is important to note that for references with the same special identifiers we can
determine equivalency even though they have different parameters in description (e.g.
titles in another languages). We can also unify their URL. For example if reference
have ISBN number “978-3-319-46254-7”, we give it URL
“books.google.com/books?vid=ISBN9783319462547”. More detailed information
about identifiers which we used to unifying the references is shown in Table 2.
Table 2. Identifiers that used for URL unification of references.
Identifier
Description
New URL
arXiv
arXiv repository identifier
http://arxiv.org/abs/...
DOI
Digital object identifier
http://doi.org/...
ISBN
International Standard Book
Number
http://books.google.com/books?vid
=ISBN...
ISSN
International Standard Serial
Number
https://worldcat.org/ISSN/...
JSTOR
Journal Storage number
https://jstor.org/stable/...
PMC
PubMed Central
https://ncbi.nlm.nih.gov/pmc/articl
es/PMC...
PMID
PubMed
https://ncbi.nlm.nih.gov/pubmed/...
OCLC
WorldCat's Online Computer
Library Center
https://worldcat.org/oclc/...
Table 3 present number of unique references with particular identifier in each
language version of Wikipedia.
Table 3. Number of references with particular identifier in Wikipedia articles
lang.
arXiv
DOI
ISBN
ISSN
JSTOR
PMC
PMID
OCLC
BE
90
1,185
13,656
78
28
53
198
19
DE
2,416
31,014
171,073
12,696
1,591
1,022
3,481
2,671
EN
4,226
1,014,602
1,670,495
79,442
35,709
16,384
52,387
54,995
FR
842
50,381
332,593
25,297
2,045
782
7,406
7,598
PL
577
41,796
245,833
23,319
781
338
11,157
1,131
RU
1,577
33,956
232,427
3,045
785
1,236
5,164
977
UK
301
2,562
37,628
618
96
160
313
400
Total
10,029
1,175,496
27,03,705
144,495
41,035
19,975
80,106
67,791
Source: own calculations.
Unification of URLs based on identifiers was used for counting the number of unique
references and will be used for comparison of similarity of references in different
language versions of Wikipedia articles.
4 Similarity of Sources
In order to examine similarity of sources across different Wikipedia language
versions we create three datasets with articles covering different topics (Wiki, Wiki7,
Wiki5) and three datasets with language sensitive topics (LST). All data extracted from
Wikipedia dumps from May 2017.
4.1. Wiki
We first chose all the articles (about 6.9 million) with references in 7 considered
languages. After extraction we had almost 30 million references. Table 4 presents
results in a number of common sources on each language intersection.
Table 4. Number of common references used in Wikipedia language versions in Wiki
dataset.
lang.
BE
DE
EN
FR
PL
RU
UK
BE
82,295
3,522
19,116
6,127
5,043
47,931
13,100
DE
-
2,988,443
345,202
81,572
41,558
69,634
21,097
EN
-
-
18,470,130
584,037
244,120
635,546
160,408
FR
-
-
-
3,364,409
61,104
118,700
32,470
PL
-
-
-
-
1,548,696
71,221
26,022
RU
-
-
-
-
-
2,873,070
185,473
UK
-
-
-
-
-
-
635,149
Source: own calculations.
The largest number of references in the English Wikipedia can be explained by the
largest number of articles in it. In the next datasets we take equal number of articles in
each language. We show unique references overlaps between selected language
versions in Figure 1.
Figure 2. Unique references overlap between selected language version of
Wikipedia. Source: own calculations.
It is noticeable that there are more common sources among Slavic language versions
(PL, RU, UK).
Table 5. Top 10 most popular reference domains in various Wikipedia language
versions in Wiki dataset
7
BE
DE
EN
FR
books.google.com
pravo.by
football.by
doi.org
cuetracker.net
naviny.org
by.tribuna.com
worldsnooker.com
web.archive.org
gks.ru
books.google.com
books.google.de
spiegel.de
doi.org
welt.de
zeit.de
faz.net
worldcat.org
youtube.com
sueddeutsche.de
books.google.com
doi.org
nytimes.com
news.bbc.co.uk
bbc.co.uk
theguardian.com
worldcat.org
news.google.com
youtube.com
census.gov
books.google.com
doi.org
books.google.fr
worldcat.org
lemonde.fr
legifrance.gouv.fr
lefigaro.fr
insee.fr
gallica.bnf.fr
interieur.gouv.fr
PL
RU
UK
books.google.com
web.archive.org
doi.org
sports-reference.com
archive.is
books.google.com
doi.org
insee.fr
billboard.com
textual.ru
insee.fr
books.google.com
kia.hu
w1.c1.rada.gov.ua
demo.istat.it
7
Top 100 popular references domains with the number of references in each language version of
Wikipedia can be found on page: http://en.lewoniewski.info/2017/top-100-domains-in-
wikipedia-references/
worldcat.org
stat.gov.pl
discogs.com
allmusic.com
getamap.ordnancesurvey.co.uk
int.soccerway.com
lenta.ru
web.archive.org
youtube.com
kommersant.ru
nsi.bg
cvk.gov.ua
pravda.com.ua
youtube.com
web.archive.org
Source: own calculations.
Table 6. Number of common references’ domains used in Wikipedia language versions
in Wiki dataset.
lang.
BE
DE
EN
FR
PL
RU
UK
BE
22,042
10,563
15,393
10,475
9,783
19,030
12,485
DE
-
500,560
219,536
104,212
62,595
90,361
41,407
EN
-
-
1,588,692
201,601
101,495
183,234
69,437
FR
-
-
-
389,588
56,693
86,071
39,426
PL
-
-
-
-
184,909
60,130
32,382
RU
-
-
-
-
-
356,896
73,254
UK
-
-
-
-
-
-
114,109
Source: own calculations.
Figure 3. References’ domains overlap between selected language version of
Wikipedia. Source: own calculations.
Comparing figure 2 and 3, we can find that references domains are more
international - there are relatively more common across language versions of
Wikipedia.
4.2. Wiki5
In this dataset there are 273,878 articles, that are written in five language versions:
DE, EN, FR, PL, RU. Number of articles and extracted references are shown in Table
7.
Table 7. Articles and references count in different language versions of Wikipedia
in Wiki5 dataset.
Lang.
Number
of
articles
Articles
with ref.
Number of
references
Ref. with
template
Unique
ref.
domains
DE
273,878
149,664
917,936
326,514
155,869
EN
273,878
205,503
3,897,533
3,232,357
383,766
FR
273,878
147,655
1,276,342
821,887
148,614
PL
273,878
129,118
745,196
615,556
83,519
RU
273,878
154,936
1,154,815
712,284
151,549
Total
1,369,390
786,876
7,991,822
5,708,598
923,317
Source: own calculations.
Table 8. Number of common references used in Wikipedia language versions in
Wiki5 dataset.
lang.
DE
EN
FR
PL
RU
DE
792,077
90,863
26,797
19,345
31,043
EN
-
3,261,658
170,200
104,595
236,229
FR
-
-
1,056,170
29,015
49,156
PL
-
-
-
561,213
36,239
RU
-
-
-
-
963,546
Source: own calculations.
4.3. Wiki7
In this dataset there are 46,957 articles, that are written in all seven analyzed
languages: BE, DE, EN, FR, PL, RU, UK. Number of articles and extracted references
are shown on table 8.
Table 9. Articles and references count in different language versions of Wikipedia
in Wiki7 dataset.
Lang.
Number
of
articles
Articles
with ref.
Number of
references
Ref. with
template
Unique
ref.
domains
BE
46,957
10,538
51,387
28,016
13,497
DE
46,957
27,278
239,520
86,902
54,640
EN
46,957
37,884
1,089,035
918,726
152,324
FR
46,957
33,589
415,599
272,618
61,427
PL
46,957
24,493
203,567
169,139
31,853
RU
46,957
27,959
353,592
202,034
65,567
UK
46,957
20,431
111,213
60,023
26,268
Total
328,699
182,172
2,463,913
1,737,458
405,576
Source: own calculations.
Table 10. Number of common references used in Wikipedia language versions in
Wiki7 dataset.
lang.
BE
DE
EN
FR
PL
RU
UK
BE
43,778
1,378
9,733
2,757
2,637
27,378
6,794
DE
-
217,236
17,768
5,467
3,572
5,377
2,585
EN
-
-
955,305
44,528
26,139
47,782
21,066
FR
-
-
-
354,607
7,262
11,134
4,532
PL
-
-
-
-
159,002
8,320
3,711
RU
-
-
-
-
-
308,500
28,619
UK
-
-
-
-
-
-
91,191
Source: own calculations.
4.4. LST
Additionally to the above analyses, we decided to carry out additional analysis
concerning “nationality” of sources. We chose three sub datasets, which described
cities in particular country: Poland, Germany, and France. So, these datasets are
Language Sensitive. We further chose cities, which were described at least in five
languages: DE, EN, FR, PL, RU. As a result we obtained a dataset with articles about
10516 German cities, 10092 French cities, and 904 Polish cities.
German cities (LST DE)
Similarly to the previous datasets, Table 11 presents number of articles with
references and number of references in each language. It is noticeable that German
Wikipedia have the highest number of articles with references and the highest total
number of references. So, information about German cities is the most verifiable in
German Wikipedia.
Table 11. Articles and references count in different language versions of
Wikipedia in LST DE dataset.
Lang.
Number
of
articles
Articles
with ref.
Number of
references
Ref. with
template
Unique
ref.
domains
DE
10,516
9,532
64,305
18,893
16,541
EN
10,516
2,540
11,744
3,168
3,359
FR
10,516
1,129
2,752
484
956
PL
10,516
2,805
5,087
1,204
1,155
RU
10,516
8,820
9,875
292
607
Total
52,580
24,826
93,763
24,041
22,618
Source: own calculations.
From Table 12 we can argue that more common sources have German end English
Wikipedia when describing German cities.
Table 12. Number of common references used in Wikipedia language versions in
LST DE dataset.
lang.
DE
EN
FR
PL
RU
DE
49,436
1,045
234
80
90
EN
-
7,936
77
49
75
FR
-
-
1,719
16
24
PL
-
-
-
1,572
25
RU
-
-
-
-
961
Source: own calculations.
French cities (LST FR)
Based on tables1 13 and 14 we can make a similar conclusion, that French cities
have the most verifiable description in French Wikipedia, and more common references
have this language version with English Wikipedia.
Table 13. Articles and references count in different language versions of
Wikipedia in LST FR dataset.
Lang.
Number
of
articles
Articles
with ref.
Number of
references
Ref. with
template
Unique
ref.
domains
DE
10,092
2,568
8,167
3,460
1,902
EN
10,092
1,738
11,896
5,830
3,342
FR
10,092
8,763
101,325
52,003
15,700
PL
10,092
643
1,144
954
179
RU
10,092
8,157
38,007
34,844
1,103
Total
50,460
21,869
160,539
97,091
22,226
Source: own calculations.
Table 14. Number of common references used in Wikipedia language versions in
LST FR dataset.
lang.
DE
EN
FR
PL
RU
DE
6,959
128
368
14
408
EN
-
9,652
2,076
10
87
FR
-
-
70,817
27
683
PL
-
-
-
497
6
RU
-
-
-
-
21,930
Source: own calculations.
Polish cities (LST PL)
Finally, in the case of Polish cities, Table 15 demonstrates similar tendency Polish
Wikipedia have the highest number of references, and therefore is the most prominent
for this dataset. However, Table 16 shows that pair EN&PL does not have the biggest
number of common references (99) a little more have EN&FR language version
(101).
Table 15. Articles and references count in different language versions of
Wikipedia in LST PL dataset.
Lang.
Number
of
articles
Articles
with ref.
Number of
references
Ref. with
template
Unique
ref.
domains
DE
904
608
2,439
387
932
EN
904
476
2,747
1,930
1,320
FR
904
253
541
179
350
PL
904
904
14,804
9,471
4,451
RU
904
158
394
151
235
Total
4,520
2,399
20,925
12,118
7,288
Source: own calculations.
Table 16. Number of common references used in Wikipedia language versions in
LST PL dataset.
lang.
DE
EN
FR
PL
RU
DE
2,116
81
13
58
9
EN
-
2,382
101
99
53
FR
-
-
472
37
10
PL
-
-
-
11,098
40
RU
-
-
-
-
339
Source: own calculations.
We can see that in each language sensitive datasets the total number of references is
always the biggest in own language. If we look to the biggest number of common
sources between two languages, always English version is the first. This could mean
that most users that translate content from one language to another often choose English
version as a source or a destination.
5 Conclusions and Future Work
Wikipedia community puts great emphasis on verifiability of information contained
in the articles. Using special identifiers we can unify the same references that are
present in various Wikipedia editions.
This study shows that different language versions of Wikipedia use common sources
in different manner depends on a topic. The biggest number of common references have
English and German versions 345,202. However, we need to take into account total
number of articles in these languages they are the biggest Wikipedia editions. If we
consider only articles that are represented in at least 5 considered languages, than the
biggest number of common references have Russian and English Wikipedia editions.
For language sensitive topics we always get the same results the most verifiable
information is available in the respective language. In this case, often this topics have
more common references with the biggest language version of Wikipedia English.
Our future work will be devoted to more in-depth researches about similarity of
references. We plan to use some external open citation databases (e.g. WorldCat
8
,
Google Schoolar
9
, Microsoft Academic
10
) to find different data about same sources
(URLs, titles, identifiers, etc.). This databases can be also helpful to find information
about importance of particular source (e.g. citation index, impact factor). We plan
include this analysis to assess the quality of articles and parameters in special templates
infoboxes. This can help to improve the articles quality in less developed language
versions of Wikipedia and also enrich other popular open knowledge databases such as
DBpedia
11
, Wikidata
12
, YAGO, Freebase and others.
References
1. Węcel, K., Lewoniewski, W., (2015), Modelling the Quality of Attributes in Wikipedia
Infoboxes. Business Information Systems Workshops. Volume 228 of Lecture Notes in
Business Information Processing. Springer International Publishing, pp. 308320
2. Warncke-Wang, M., Cosley, D., Riedl, J., (2013), Tell me more: an actionable quality model
for Wikipedia, Proceedings of the 9th International Symposium on Open Collaboration.
3. Lewoniewski,W., Węcel, K., Abramowicz,W., (2016), Quality and importance of Wikipedia
articles in different languages, Information and Software Technologies: 22nd International
Conference, ICIST 2016, Druskininkai, Lithuania, October 13-15, 2016, Proceedings.
Springer International Publishing, Cham, pp. 613624.
4. Bould, M. D., Hladkowicz, E. S., Pigford, A. A. E., Ufholz, L. A., Postonogova, T., Shin, E.,
Boet, S. (2014), References that anyone can edit: review of Wikipedia citations in peer
reviewed health science literature, BMJ, 348.
5. Kousha, K., Thelwall, M., (2017), Are Wikipedia citations important evidence of the impact
of scholarly articles and books?, Journal of the Association for Information Science and
Technology, 68(3), pp. 762-779.
6. Lin, J., Fenner, M. (2014), An analysis of Wikipedia references across PLOS publications, 14:
Expanding impacts and metrics, An ACM Web Science Conference 2014 Workshop, pp. 23-
26.
7. Luyt, B., & Tan, D. (2010). Improving Wikipedia's credibility: References and citations in a
sample of history articles. Journal of the American Society for Information Science and
Technology, 61(4), 715-722.
8. Nielsen, F. Å., (2007), Scientific citations in Wikipedia, First Monday, 12(8)
9. Page, R. D., (2010), Wikipedia as an encyclopaedia of life, Organisms Diversity & Evolution,
10(4), pp. 343-349.
10. Mesgari, M., Okoli, C., Mehdi, M., Nielsen, F. Å., & Lanamäki, A., (2015), “The sum of all
human knowledge”: A systematic review of scholarly research on the content of Wikipedia,
Journal of the Association for Information Science and Technology, 66(2), pp. 219-245.
11. Anderka, M., (2013), Analyzing and Predicting Quality Flaws in User-generated Content:
The Case of Wikipedia, Doctoral dissertation, Bauhaus-Universität Weimar Germany.
12. Teplitskiy, M., Lu, G., Duede, E., (2016), Amplifying the impact of Open Access: Wikipedia
and the diffusion of science, Journal of the Association for Information Science and
Technology.
8
http://www.worldcat.org
9
https://scholar.google.com
10
https://academic.microsoft.com/
11
http://www.dbpedia.org
12
https://www.wikidata.org
... A reliable source is defined, in turn, as a secondary and published, ideally scholarly, one. 3 Despite the community's best efforts to add all the needed citations, the majority of articles in Wikipedia might still contain unverified claims, in particular lower-quality ones (Lewoniewski, Wecel et al., 2017). The citation practices of editors might also not be systematic at times (Chen & Roth, 2012;Forte, Andalibi et al., 2018). ...
... A high portion of citations to sources in Wikipedia refer to scientific or scholarly literature (Nielsen, Mietchen, & Willighagen, 2017), as Wikipedia is instrumental in providing access to scientific information and in fostering the public understanding of science (Heilman, Kemmann et al., 2011;Laurent & Vickers, 2009;Lewoniewski et al., 2017;Maggio, Steinberg et al., 2020;Maggio, Willinsky et al., 2017;Shafee, Masukume et al., 2017;Smith, 2020;Torres-Salinas, Romero-Frías, & Arroyo-Machado, 2019). Citations in Wikipedia are also useful for users browsing low-quality or underdeveloped articles, as they allow them to look for information outside of the platform (Piccardi, Redi et al., 2020). ...
Article
Full-text available
Wikipedia’s content is based on reliable and published sources. To this date, relatively little is known about what sources Wikipedia relies on, in part because extracting citations and identifying cited sources is challenging. To close this gap, we release Wikipedia Citations, a comprehensive data set of citations extracted from Wikipedia. A total of 29.3 million citations were extracted from 6.1 million English Wikipedia articles as of May 2020, and classified as being books, journal articles, or Web content. We were thus able to extract 4.0 million citations to scholarly publications with known identifiers—including DOI, PMC, PMID, and ISBN—and further equip an extra 261 thousand citations with DOIs from Crossref. As a result, we find that 6.7% of Wikipedia articles cite at least one journal article with an associated DOI, and that Wikipedia cites just 2% of all articles with a DOI currently indexed in the Web of Science. We release our code to allow the community to extend upon our work and update the data set in the future.
... Blumenstock (2008) found that the number of words in an article is a strong predictor of whether the article will be featured on Wikipedia. Similarly, a number of studies (Lewoniewski et al., 2016(Lewoniewski et al., , 2017aWarncke-Wang et al., 2013) found that the number of references in an article is a strong indicator of the articles' quality. ...
... Wikipedia guidelines recommend editors to support their edits with "good references from independent sources" 14 and an article with significant number of references to support its content is considered to be a Good Article (GA). 15 The number of references in an article has been found to be one of the most important predictors of article quality (Lewoniewski et al., 2016(Lewoniewski et al., , 2017aWarncke-Wang et al., 2013). Comparing the number of references in articles in different languages on the same topic can thus provide insights into the relative quality and rigor of the content about the topic in different editions. ...
Article
Wikipedia is the largest web‐based open encyclopedia covering more than 300 languages. Different language editions of Wikipedia differ significantly in terms of their information coverage. In this article, we compare the information coverage in English Wikipedia (most exhaustive) and Wikipedias in 8 other widely spoken languages, namely Arabic, German, Hindi, Korean, Portuguese, Russian, Spanish, and Turkish. We analyze variations in different language editions of Wikipedia in terms of the number of topics covered as well as the amount of information discussed about different topics. Further, as a step towards bridging the information gap, we present WikiCompare—a browser plugin that allows Wikipedia readers to have a comprehensive overview of topics by incorporating missing information from Wikipedia page in other language.
... Wikipedia incorporates one of the largest reference repositories in existence. This is primarily due to its guidelines strongly encouraging that all content have to be verifiable, which is mostly achieved by providing a pointer to a reliable source that supports content added to the article text. 1 Thus, Wikipedia articles usually include reference lists; and overall, the English Wikipedia contains more than 55 million references. 2 Cited sources can be different types of publications, including for example formally published scientific papers, books, and news media articles, but also links to websites or any other type of Web documents (Lewoniewski et al., 2017). ...
... According to Mesgari et al. (2015), the quality of content and of referenced sources, in particular, was one of the major study objects on Wikipedia. For example, Lewoniewski et al. (2017) studied the similarity of sources from different Wikipedia language editions. They found that URLs in references shared many domain names between language versions, but there were not many cases of exact matches of URLs in references across languages. ...
Preprint
Full-text available
With this work, we present a publicly available dataset of the history of all the references (more than 55 million) ever used in the English Wikipedia until June 2019. We have applied a new method for identifying and monitoring references in Wikipedia, so that for each reference we can provide data about associated actions: creation, modifications, deletions, and reinsertions. The high accuracy of this method and the resulting dataset was confirmed via a comprehensive crowdworker labelling campaign. We use the dataset to study the temporal evolution of Wikipedia references as well as users' editing behaviour. We find evidence of a mostly productive and continuous effort to improve the quality of references: (1) there is a persistent increase of reference and document identifiers (DOI, PubMedID, PMC, ISBN, ISSN, ArXiv ID), and (2) most of the reference curation work is done by registered humans (not bots or anonymous editors). We conclude that the evolution of Wikipedia references, including the dynamics of the community processes that tend to them should be leveraged in the design of relevance indexes for altmetrics, and our dataset can be pivotal for such effort.
... Wikipedia articles are created, improved and maintained by the efforts of the community of volunteer editors [Priedhorsky et al., 2007;Chen and Roth, 2012], and they are used in a variety of ways by a wide user base [Singer et al., 2017;Lemmerich et al., 2019;. The information Wikipedia contains is generally considered to be of high-quality and up-to-date [Priedhorsky et al., 2007;Keegan et al., 2011;Geiger and Halfaker, 2013;Kumar et al., 2016;Piscopo and Simperl, 2019;Adams et al., 2020;Smith, 2020], notwithstanding margins for improvement and the need for constant knowledge maintenance [Chen and Roth, 2012;Lewoniewski et al., 2017;Forte et al., 2018]. ...
... Following Wikipedia's editorial guidelines, the community of editors creates contents often relying on scientific and scholarly literature [Nielsen et al., 2017;Arroyo-Machado et al., 2020], and therefore Wikipedia can be considered a mainstream gateway to scientific information [Laurent and Vickers, 2009;Heilman et al., 2011;Lewoniewski et al., 2017;Shafee et al., 2017;Maggio et al., 2019;. Unfortunately, few studies have considered the representativeness and reliability of Wikipedia's scientific sources. ...
Article
Wikipedia is one of the main sources of free knowledge on the Web. During the first few months of the pandemic, over 5,200 new Wikipedia pages on COVID-19 have been created and have accumulated over 400M pageviews by mid June 2020. ¹ At the same time, an unprecedented amount of scientific articles on COVID-19 and the ongoing pandemic have been published online. Wikipedia's contents are based on reliable sources such as scientific literature. Given its public function, it is crucial for Wikipedia to rely on representative and reliable scientific results, especially so in a time of crisis. We assess the coverage of COVID-19-related research in Wikipedia via citations to a corpus of over 160,000 articles. We find that Wikipedia editors are integrating new research at a fast pace, and have cited close to 2% of the COVID-19 literature under considera- tion. While doing so, they are able to provide a representative coverage of COVID-19-related research. We show that all the main topics discussed in this literature are proportionally represented from Wikipedia, after accounting for article-level effects. We further use regression analyses to model citations from Wikipedia and show that Wikipedia editors on average rely on literature which is highly cited, widely shared on social media, and has been peer-reviewed.
... Finally, the existence of a 'source' asymmetry, conceived as an unequal use of refer-ences in the construction of content about people of different genders, has been examined. The relevance of this aspect lies in the idea that the use of appropriate sources is a significant factor in the quality of encyclopaedic articles (Nielsen, 2007;Lewoniewski et al., 2017). One study compared article sources among male and female CEOs, finding that women's biographies have more references and more diverse sources (Young et al., 2016). ...
Article
Full-text available
The gender gap in Wikipedia content is a complex phenomenon that comprises several asymmetries, discursive dimensions, and social concerns. However, there is no theoretical framework to organise this complexity consistently. Based on writings by Foucault, Deleuze and Tkacz, we interpret Wikipedia as a 'field of visibility' and provide a framework to systemise its content gaps. Then we use that model to organise the complexity of the content gender gap on Wikipedia, performing a systematic overview of the asymmetries tested in empirical research. We suggest that this analysis is relevant for the effective planning of governance processes that seek to avoid female or non-male subordination in digital platforms' discourses.
... According to Mesgari et al. (2015), the quality of content and of referenced sources was one of the major study objects on Wikipedia. For example, Lewoniewski et al. (2017) studied the similarity of sources from different Wikipedia language editions. They found that URLs in references shared many domain names between language versions, but there were not many cases of exact matches of URLs in references across languages. ...
Article
Full-text available
With this work, we present a publicly available dataset of the history of all the references (more than 55 million) ever used in the English Wikipedia until June 2019. We have applied a new method for identifying and monitoring references in Wikipedia, so that for each reference we can provide data about associated actions: creation, modifications, deletions, and reinsertions. The high accuracy of this method and the resulting dataset was confirmed via a comprehensive crowdworker labelling campaign. We use the dataset to study the temporal evolution of Wikipedia references as well as users’ editing behaviour. We find evidence of a mostly productive and continuous effort to improve the quality of references: (1) there is a persistent increase of reference and document identifiers (DOI, PubMedID, PMC, ISBN, ISSN, ArXiv ID), and (2) most of the reference curation work is done by registered humans (not bots or anonymous editors). We conclude that the evolution of Wikipedia references, including the dynamics of the community processes that tend to them should be leveraged in the design of relevance indexes for altmetrics, and our dataset can be pivotal for such an effort. Peer Review https://publons.com/publon/10.1162/qss_a_00171
... Finally, it has been examined the existence of a 'source' asymmetry, conceived as an unequal use of references in the construction of content about people of different genders. The interest in this aspect lies in the idea that the use of appropriate sources is a relevant factor in the quality of encyclopaedic articles (Nielsen, 2007;Lewoniewski et al., 2017). One study compared article sources among male and female CEOs, finding that women's biographies have more references and from more diverse sources (Young et al., 2016). ...
Article
Full-text available
The gender gap in Wikipedia content is a complex phenomenon that comprises several asymmetries, discursive dimensions, and social concerns. However, there is no theoretical framework to organise this complexity consistently. Based on writings by Foucault, Deleuze and Tkacz, we interpret Wikipedia as a 'field of visibility' and provide a framework to analyse its content gaps. Then we use that model to organise the complexity of the content gender gap, performing a systematic overview of the asymmetries tested in empirical research. We suggest that this analysis is relevant for the effective planning of governance processes that seek to avoid women's subordination in digital platforms' discourses. Teaser: we provide a theoretical framework to analyse content gaps in Wikipedia and then use it to examine how this platform is shaping women's visibility.
... Singh et al. [7] provide a dataset of English Wikipedia references, focusing on scholarly publications. Lewoniewski et al. [4] analyse usage of references on Wikipedia across multiple language versions. ...
Preprint
Full-text available
References are an essential part of Wikipedia. Each statement in Wikipedia should be referenced. In this paper, we explore the creation and collection of references for new Wikipedia articles from an editors' perspective. We map out the workflow of editors when creating a new article, emphasising how they select references.
... The types of sources consulted and the number of references are different in Wikipedia than they are in published academic texts; in Wikipedia, they vary between different areas, fields, and language versions. In general, more references are provided for articles about objects and phenomena related to the country where the specific language edition originated, as exemplified by the entries on cities from five versions of Wikipedia (Lewoniewski et al., 2017). The study by Ford et al. (2013) examined types of sources used in English Wikipedia, with randomly selected 500 citations as its data sample. ...
Article
Full-text available
Purpose/Thesis: The paper aims to describe the types and structure of references to different sources as cited by the selected Polish Wikipedia articles from the category of people related to the Austrian Partition and all the categories below. Approach/Methods: The research data consisted of references from 50 randomly selected articles from Polish Wikipedia, including 1007 citations and 758 references. The references have been gathered, processed, and analyzed mainly employing R language. They have been categorized, and then the descriptive statistics for the chosen elements have been provided and analyzed. Results and conclusions: The study shows that the majority of sources used in the research sample were of primary nature. Consequently, it demonstrates that the analyzed articles about historical persons can be regarded more as a product of research than simple imitative work to a certain extent. Polish Wikipedians mainly utilized government directories and newspaper or magazine articles, often from digital libraries. Secondary sources, on the other hand, chiefly consisted of books, webpages, and book sections. The structure of references was diverse, and bibliographic descriptions sometimes lacked important elements. The findings confirm difficulties in analyzing sources in Wikipedia. Moreover, they support the need for researching different editions and subject areas of the largest online encyclopedia. Research limitations: Due to the exploratory character of research, which focuses on references from selected articles about historical persons from Poland, one should not readily extrapolate its results to other parts of Polish Wikipedia. The research sample only comprised citations and references, which were collected at one specific point of time. Additionally, the categorization of references has been done by a single researcher, and intercoder reliability has not been checked. Originality/Value: Most of the studies into sources used in Wikipedia articles have been limited to its English edition so far. Moreover, articles about historical persons in this encyclopedia have not been analyzed from the perspective of utilized sources, their types, and reference patterns. The paper broadens the understanding of sources usage in Wikipedia by focusing on the Polish edition of the encyclopedia.
Article
Full-text available
Wiki-librarian is a multiyear project to train librarians and students of librarianship and information science to use wiki tools, including writing articles on Wikipedia. The project has existed since the beginning of 2015, and the University Library “Svetozar Markovic” and is logistically and financially supported by Wikimedia Serbia. The part of the project that deals with the training of librarians is officially accredited. Moreover, the project has expanded its scope to work with students, as well as with librarians outside the training process, through editathons, digitization of content, participation in other wiki activities such as 1Lib1Ref, Wikipedian in Residence. Project activities were presented at several conferences and partially in several publications. On the one hand, better awareness of librarians of wiki software IT capabilities as well as methods for improved presentation of their knowledge in the digital wiki environment are achieved by this project, while on the other hand these activities significantly increase the textual resources of Serbian Wikipedia and their quality.
Conference Paper
Full-text available
This article aims to analyse the importance of the Wikipedia articles in different languages (English, French, Russian, Polish) and the impact of the importance on the quality of articles. Based on the analysis of literature and our own experience we collected measures related to articles, specifying various aspects of quality that will be used to build the models of articles’ importance. For each language version, the influential parameters are selected that may allow automatic assessment of the validity of the article. Links between articles in different languages offer opportunities in terms of comparison and verification of the quality of information provided by various Wikipedia communities. Therefore, the model can be used not only for a relative assessment of the content of the whole article, but also for a relative assessment of the quality of data contained in their structural parts, the so-called infoboxes.
Conference Paper
Full-text available
Quality of data in DBpedia depends on underlying information provided in Wikipedia’s infoboxes. Various language editions can provide different information about given subject with respect to set of attributes and values of these attributes. Our research question is which language editions provide correct values for each attribute so that data fusion can be carried out. Initial experiments proved that quality of attributes is correlated with the overall quality of the Wikipedia article providing them. Wikipedia offers functionality to assign a quality class to an article but unfortunately majority of articles have not been graded by community or grades are not reliable. In this paper we analyse the features and models that can be used to evaluate the quality of articles, providing foundation for the relative quality assessment of infobox’s attributes, with the purpose to improve the quality of DBpedia.
Article
Full-text available
Individual academics and research evaluators often need to assess the value of published research. Whilst citation counts are a recognised indicator of scholarly impact, alternative data is needed to provide evidence of other types of impact, including within education and wider society. Wikipedia is a logical choice for both of these because the role of a general encyclopaedia is to be an understandable repository of facts about a diverse array of topics and hence it may cite research to support its claims. To test whether Wikipedia could provide new evidence about the impact of scholarly research, this article counted citations to 302,328 articles and 18,735 monographs in English indexed by Scopus in the period 2005 to 2012. The results show that citations from Wikipedia to articles are too rare for most research evaluation purposes, with only 5% of articles being cited in all fields. In contrast, a third of monographs have at least one citation from Wikipedia, with the most in the arts and humanities. Hence, Wikipedia citations can provide extra impact evidence for academic monographs. Nevertheless, the results may be relatively easily manipulated and so Wikipedia is not recommended for evaluations affecting stakeholder interests.
Article
Full-text available
With the rise of Wikipedia as a first-stop source for scientific knowledge, it is important to compare its representation of that knowledge to that of the academic literature. This article approaches such a comparison through academic references made within the worlds 50 largest Wikipedias. Previous studies have raised concerns that Wikipedia editors may simply use the most easily accessible academic sources rather than sources of the highest academic status. We test this claim by identifying the 250 most heavily used journals in each of 26 research fields (4,721 journals, 19.4M articles in total) indexed by the Scopus database, and modeling whether topic, academic status, and accessibility make articles from these journals more or less likely to be referenced on Wikipedia. We find that, controlling for field and impact factor, the odds that an open access journal is referenced on the English Wikipedia are 47% higher compared to closed access journals. Moreover, in most of the worlds Wikipedias a journals high status (impact factor) and accessibility (open access policy) both greatly increase the probability of referencing. Among the implications of this study is that the chief effect of open access policies may be to significantly amplify the diffusion of science, through an intermediary like Wikipedia, to a broad public audience.
Article
Full-text available
Wikipedia may be the best-developed attempt thus far to gather all human knowledge in one place. Its accomplishments in this regard have made it a point of inquiry for researchers from different fields of knowledge. A decade of research has thrown light on many aspects of the Wikipedia community, its processes, and its content. However, due to the variety of fields inquiring about Wikipedia and the limited synthesis of the extensive research, there is little consensus on many aspects of Wikipedia's content as an encyclopedic collection of human knowledge. This study addresses the issue by systematically reviewing 110 peer-reviewed publications on Wikipedia content, summarizing the current findings, and highlighting the major research trends. Two major streams of research are identified: the quality of Wikipedia content (including comprehensiveness, currency, readability, and reliability) and the size of Wikipedia. Moreover, we present the key research trends in terms of the domains of inquiry, research design, data source, and data gathering methods. This review synthesizes scholarly understanding of Wikipedia content and paves the way for future studies. Open access version: http://spectrum.library.concordia.ca/978652/
Article
Full-text available
To examine indexed health science journals to evaluate the prevalence of Wikipedia citations, identify the journals that publish articles with Wikipedia citations, and determine how Wikipedia is being cited. Bibliometric analysis. Publications in the English language that included citations to Wikipedia were retrieved using the online databases Scopus and Web of Science. To identify health science journals, results were refined using Ulrich's database, selecting for citations from journals indexed in Medline, PubMed, or Embase. Using Thomson Reuters Journal Citation Reports, 2011 impact factors were collected for all journals included in the search. Resulting citations were thematically coded, and descriptive statistics were calculated. 1433 full text articles from 1008 journals indexed in Medline, PubMed, or Embase with 2049 Wikipedia citations were accessed. The frequency of Wikipedia citations has increased over time; most citations occurred after December 2010. More than half of the citations were coded as definitions (n=648; 31.6%) or descriptions (n=482; 23.5%). Citations were not limited to journals with a low or no impact factor; the search found Wikipedia citations in many journals with high impact factors. Many publications are citing information from a tertiary source that can be edited by anyone, although permanent, evidence based sources are available. We encourage journal editors and reviewers to use caution when publishing articles that cite Wikipedia.
Article
Full-text available
The detection and improvement of low-quality information is a key concern in Web applications that are based on user-generated content; a popular example is the online encyclopedia Wikipedia. Existing research on quality assessment of user-generated content deals with the classification as to whether the content is high-quality or low-quality. This paper goes one step further: it targets the prediction of quality flaws, this way providing specific indications in which respects low-quality content needs improvement. The prediction is based on user-defined cleanup tags, which are commonly used in many Web applications to tag content that has some shortcomings. We apply this approach to the English Wikipedia, which is the largest and most popular user-generated knowledge source on the Web. We present an automatic mining approach to identify the existing cleanup tags, which provides us with a training corpus of labeled Wikipedia articles. We argue that common binary or multiclass classification approaches are ineffective for the prediction of quality flaws and hence cast quality flaw prediction as a one-class classification problem. We develop a quality flaw model and employ a dedicated machine learning approach to predict Wikipedia's most important quality flaws. Since in the Wikipedia setting the acquisition of significant test data is intricate, we analyze the effects of a biased sample selection. In this regard we illustrate the classifier effectiveness as a function of the flaw distribution in order to cope with the unknown (real-world) flaw-specific class imbalances. The flaw prediction performance is evaluated with 10,000 Wikipedia articles that have been tagged with the ten most frequent quality flaws: provided test data with little noise, four flaws can be detected with a precision close to 1.
Article
Full-text available
This study evaluates how well the authors of Wikipedia history articles adhere to the site's policy of assuring verifiability through citations. It does so by examining the references and citations of a subset of country histories. The findings paint a dismal picture. Not only are many claims not verified through citations, those that are suffer from the choice of references used. Many of these are from only a few US government Websites or news media and few are to academic journal material. Given these results, one response would be to declare Wikipedia unsuitable for serious reference work. But another option emerges when we jettison technological determinism and look at Wikipedia as a product of a wider social context. Key to this context is a world in which information is bottled up as commodities requiring payment for access. Equally important is the problematic assumption that texts are undifferentiated bearers of knowledge. Those involved in instructional programs can draw attention to the social nature of texts to counter these assumptions and by so doing create an awareness for a new generation of Wikipedians and Wikipedia users of the need to evaluate texts (and hence citations) in light of the social context of their production and use.
Thesis
Web applications that are based on user-generated content are often criticized for containing low-quality information; a popular example is the online encyclopedia Wikipedia. The major points of criticism pertain to the accuracy, neutrality, and reliability of information. The identification of low-quality information is an important task since for a huge number of people around the world it has become a habit to first visit Wikipedia in case of an information need. Existing research on quality assessment in Wikipedia either investigates only small samples of articles, or else deals with the classification of content into high-quality or low-quality. This thesis goes further, it targets the investigation of quality flaws, thus providing specific indications of the respects in which low-quality content needs improvement. The original contributions of this thesis, which relate to the fields of user-generated content analysis, data mining, and machine learning, can be summarized as follows: (1) We propose the investigation of quality flaws in Wikipedia based on user-defined cleanup tags. Cleanup tags are commonly used in the Wikipedia community to tag content that has some shortcomings. Our approach is based on the hypothesis that each cleanup tag defines a particular quality flaw. (2) We provide the first comprehensive breakdown of Wikipedia's quality flaw structure. We present a flaw organization schema, and we conduct an extensive exploratory data analysis which reveals (a) the flaws that actually exist, (b) the distribution of flaws in Wikipedia, and, (c) the extent of flawed content. (3) We present the first breakdown of Wikipedia's quality flaw evolution. We consider the entire history of the English Wikipedia from 2001 to 2012, which comprises more than 508 million page revisions, summing up to 7.9 TB. Our analysis reveals (a) how the incidence and the extent of flaws have evolved, and, (b) how the handling and the perception of flaws have changed over time. (4) We are the first who operationalize an algorithmic prediction of quality flaws in Wikipedia. We cast quality flaw prediction as a one-class classification problem, develop a tailored quality flaw model, and employ a dedicated one-class machine learning approach. A comprehensive evaluation based on human-labeled Wikipedia articles underlines the practical applicability of our approach.
Conference Paper
In this paper we address the problem of developing actionable quality models for Wikipedia, models whose features directly suggest strategies for improving the quality of a given article. We first survey the literature in order to understand the notion of article quality in the context of Wikipedia and existing approaches to automatically assess article quality. We then develop classification models with varying combinations of more or less actionable features, and find that a model that only contains clearly actionable features delivers solid performance. Lastly we discuss the implications of these results in terms of how they can help improve the quality of articles across Wikipedia.