Conference PaperPDF Available

Abstract and Figures

Innovations in digital libraries and services enable users to access large amounts of data on demand. Yet, quality assessment of information encountered on the Internet remains an elusive open issue. For example, Wikipedia, one of the most visited platforms on the Web, hosts thousands of user-generated articles and undergoes 12 million edits/contributions per month. User-generated content is undoubtedly one of the keys to its success, but also a hindrance to good quality: contributions can be of poor quality because everyone, even anonymous users, can participate. Though Wikipedia has defined guidelines as to what makes the perfect article, authors find it difficult to assert whether their contributions comply with them and reviewers cannot cope with the ever growing amount of articles pending review. Great efforts have been invested in algorith-mic methods for automatic classification of Wikipedia articles (as featured or non-featured) and for quality flaw detection. However, little has been done to support quality assessment of user-generated content through interactive tools that allow for combining automatic methods and human intelligence. We developed WikiLyzer, a toolkit comprising three Web-based interactive graphic tools designed to assist (i) knowledge discovery experts in creating and testing metrics for quality measurement , (ii) users searching for good articles, and (iii) users that need to identify weaknesses to improve a particular article. A case study suggests that experts are able to create complex quality metrics with our tool and a report in a user study on its usefulness to identify high-quality content.
Content may be subject to copyright.
WikiLyzer: Interactive Information Quality Assessment in
Wikipedia
Cecilia di Sciascio
Know-Center GmbH
Graz, Austria
cdisciascio@know-center.at
David Strohmaier
Know-Center GmbH
Graz, Austria
david.strohmaier@gmx.at
Marcelo Errecalde
National University San Luis
San Luis, Argentina
merrecalde@gmail.com
Eduardo Veas
Know-Center GmbH
Graz, Austria
eveas@know-center.at
ABSTRACT
Innovations in digital libraries and services enable users to ac-
cess large amounts of data on demand. Yet, quality assessment
of information encountered on the Internet remains an elusive
open issue. For example, Wikipedia, one of the most visited
platforms on the Web, hosts thousands of user-generated arti-
cles and undergoes 12 million edits/contributions per month.
User-generated content is undoubtedly one of the keys to its
success, but also a hindrance to good quality: contributions can
be of poor quality because everyone, even anonymous users,
can participate. Though Wikipedia has defined guidelines as
to what makes the perfect article, authors find it difficult to
assert whether their contributions comply with them and re-
viewers cannot cope with the ever growing amount of articles
pending review. Great efforts have been invested in algorith-
mic methods for automatic classification of Wikipedia articles
(as featured or non-featured) and for quality flaw detection.
However, little has been done to support quality assessment
of user-generated content through interactive tools that allow
for combining automatic methods and human intelligence. We
developed WikiLyzer, a toolkit comprising three Web-based
interactive graphic tools designed to assist (i) knowledge dis-
covery experts in creating and testing metrics for quality mea-
surement, (ii) users searching for good articles, and (iii) users
that need to identify weaknesses to improve a particular article.
A case study suggests that experts are able to create complex
quality metrics with our tool and a report in a user study on its
usefulness to identify high-quality content.
ACM Classification Keywords
H.5.m. Information Interfaces and Presentation (e.g. HCI):
Miscellaneous
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
for profit or commercial advantage and that copies bear this notice and the full citation
on the first page. Copyrights for components of this work owned by others than ACM
must be honored. Abstracting with credit is permitted. Tocopy otherwise, or republish,
to post on servers or to redistribute to lists, requires prior specific permission and/or a
fee. Request permissions from permissions@acm.org.
IUI 2017, March 13-16, 2017, Limassol, Cyprus
© 2017 ACM. ISBN 978-1-4503-4348-0/17/03.. .$15.00
DOI: http://dx.doi.org/10.1145/3025171.3025201
Author Keywords
Text Analytics; Text Quality; User Generated Content;
Wikipedia; Visual Analytics
INTRODUCTION
Information Quality (IQ) assessment is a topic of growing
interest, mainly due to the popularity of user-generated Web
content and the unavoidable divergence in the quality thereof
[9]. In this context, Wikipedia, the largest and most popular
user-generated source of knowledge on the Web, presents
several challenges related to quality assurance. In particular,
its size and dynamic nature render a manual quality assurance
completely unfeasible. This has resulted in increasing research
efforts towards automatic IQ assessment in Wikipedia that
can be roughly categorized into three main lines of research,
namely: (a) featured article identification [23, 26, 10]; (b)
development of metrics for quality measurement [24, 32]; and
(c) quality flaw detection [5, 6, 7, 15, 16, 17, 14].
These research lines heavily focus on identifying which pieces
of information, among all the information that can be extracted
from a Wikipedia article, are more relevant to automatically
determine different quality aspects. Approaches range from
simple word[10] and character
n
-grams [26] counts, to more
elaborated approaches based on factual information [23] or
information combined from the content and structure of the
article, network information and edit history [4]. Automatic
methods have proven be efficient to tackle automatic IQ assess-
ment in research scenarios, but translating these advances into
concrete improvements remains an open issue. Automatic IQ
assessment methods remain mostly inaccessible to the general
public and it is even hard for researchers to compare their own
methods against the state of the art and share them with the
rest of the community.
Our motivation is to provide a visual interactive system that
brings these methods to users with specific needs. In this con-
text, we can identify three types of users: a) the “consumer”,
who is only interested in retrieving high-quality articles, b)
the “repairman”, who could use the system to visualize flaws
in a specific article and improve it; and c) the knowledge dis-
covery expert (KDE), who is interested in identifying patterns
and extracting useful new knowledge to improve automatic
IQ assessment algorithms. We hereby introduce WikiLyzer,
a toolkit comprising three Web-based applications that sup-
port the tasks of the above mentioned users. First the Quality
Agent (QAg) provides a work environment for the KDE where
they can interactively create, compare and share quality met-
rics (QMs). Secondly, the Quality Detective (QD) presents a
straightforward UI for Wikipedians to rank and filter articles
by quality based on built-in QMs. After choosing a specific
article, a Wikipedian can leverage the Quality Assistant (QAs)
to assess and improve the quality of each section. Moreover,
a first exploratory analysis has been carried out to validate
individual parts of the toolkit proposed in this paper. The eval-
uations focused on validating that, on one hand, experts can
compose quality metrics interactively, and that, on the other
hand, users can identify good quality content in an article.
INFORMATION QUALITY IN WIKIPEDIA
The concept of information quality (IQ) is inherently fuzzy
because it is subject to endless interpretations as to what makes
an article perfect. Thus, Wikipedia defined its own featured
article criteria
1
(FA-criteria), i.e. a series of standards for high-
quality articles. They include, among others: neutral point of
view,verifiability,no original research (all material must be
attributable to a reliable published source).
In 2006 the community started developing a general grading
scheme consisting in 8 classes
2
, namely: (from best to worse)
FA (Featured Article), A,GA,B+,B,C,Start and Stub. Al-
though general descriptions are defined for each class, there is
still room for interpretation depending on the specific domain.
For example, articles about Mathematics have totally differ-
ent standards than biographies concerning the well-written
standard. Therefore, the community started to build different
quality assessment groups for different domains.
Nevertheless, most of Wikipedia’s content remains un-tagged
in relation to quality aspects and the modification and im-
provement of articles is essentially a manual task with little
(or none) computer support. For instance, featured articles
constitute less than 0.1% of the English Wikipedia. Only in
2014, 269K new articles were created, from which only 602
were peer-reviewed and 298 were promoted as featured arti-
cles
3
. Clearly, human power alone is not enough to carry out
an extensive and thorough review over all Wikipedia’s arti-
cles. Other approaches are needed in order to assist users in
identifying high quality information to be used or low quality
information to be improved.
Out of the research lines on automatic IQ assessment, FA clas-
sification and IQ measurement using quality metrics (QMs)
are the paths that pertain us the most. WikiLyzer makes use
of state-of-the-art QMs with different purposes: the QAg for
comparison against new QMs, the QD to allow the user to rank
articles by quality and the QAs to detect sections that need
1https://en.wikipedia.org/wiki/Wikipedia:Featured_article_
criteria
2https://en.wikipedia.org/wiki/Template:Grading_scheme
3https://en.wikipedia.org/wiki/Wikipedia:Featured_article_
statistics
improvement. The remainder of this section reviews state-of-
the-art methods for automatic IQ assessment and applications
for assessing and visualizing quality aspects in Wikipedia.
Automatic IQ Measurement
Beyond the characterization specified by Wikipedia, differ-
ent researchers have posed their own basic requirements for
quantifying good IQ. For instance, Alexander and Tate [3]
proposed objectivity,completeness and pluralism. In a more
recent study Arazy et. al. [8] used accuracy,completeness,
objectivity, and representation as the main dimensions to deter-
mine the quality of articles. Wikipedia provides three channels
for data extraction: (i) the
article’s content
contains the text,
media files, references, links, etc. in an article; (ii) the
page
history
encloses the collaboration network and works like a
version control system, such that every edit is saved as a new
revision; and (iii) a
talk page
containing the quality grade
(if exists) and improvement suggestions. These channels pro-
vide several measurable attributes. Examples of content-based
attributes are article length, number of links and number of
images. In turn, the edit history can supply attributes such as
currency, article age or number of edits.
Research on QMs for Wikipedia articles places much impor-
tance in predicting whether an article potentially meets the
FA criteria. QMs are defined through mathematical combina-
tions of a few of these directly retrievable attributes or other
post-computed measurements, e.g. readability scores. More-
over, each QM usually concentrate on a specific aspect of the
FA-criteria, e.g. well-written can be quantified by Complexity
[32], which is computed as function of Flesch and Kincaid
readability scores.
Blumenstock [10] focused on article length for FA classifi-
cation, as articles longer than 2000 words are much more
likely to be featured articles than shorter ones. This method
reached 96% accuracy (depending on the classifier). Other ap-
proaches rely on readability scores. Hasan et al. [20] uses the
Flesch-Kincaid readability score whereas Stvilia et al. [32] ad-
ditionally employs other common scores, e.g. Smog-Grading,
Gunning Fog, etc. Other features beyond readability can be
used to classify articles as well. Stvilia et al. [32] calculated
seven QMs mostly based on the page history. Hu et al. [21]
combine author authority with article length, while Lim et al.
[25] further improve the latter approach by adding collabora-
tion information from the authors.
In turn, Adler et al. [2] emphasize that author reputation is a
good indicator for predicting the lifespan of a written text and
also an important factor for trustworthiness. The downside is
that article context is not taken into account. Thus, the fact
that an author is able to write good mathematical articles, does
not mean that she is also able to write good History articles.
Lih and Andrew [24] discovered that the quality of an article
is linked to the public interest on its topic. It is shown that
after an article was cited in press, the number of edits and
the number of unique editors strongly increases. Wöhner and
Peters [34] can classify an article into low or high quality, with
an accuracy of 87% using lifecycle-based QMs.
Brandes et al. [11] address quality assessment by exploring the
collaboration network among users. Nodes represent authors
and edges represent negative actions, e.g one author undoes
the edits of another author. This network-oriented approach
defines three parameters, namely: bipolarity, groupstore and
autbalance, and finds a correlation between these parameters
and high-quality articles.
Lipka et al. [26] developed a machine learning method that
focuses on writing style. By analyzing plain text and article
length, classification accuracy reaches 96%, with the restric-
tion that only articles from the same domain are compared. It
is also possible to use this method for more than one domain,
however the accuracy diminishes.
Accurately predicting the FA criteria may be a good approx-
imation to article quality. Yet, their interpretation is rather
deceptive: this article has enough words to be good (word
count), there were no disputes while writing it,the authors
wrote FA before. More importantly, many of these metrics
carry little information for the author to improve the article
(except, perhaps, write more words).
Visual Tools for IQ Assessment
The MetadataScript [33] is a Wikipedia plugin that displays
the article’s quality indicator right below its headline. Quality
measurement is based on information extracted from the talk
page, thus it entirely relies on human maintenance.
WikipediaViz [12] measures five quality metrics (word count,
number of contributors, number and lengths of edits, number
of references and internal links, length and activity of the dis-
cussion) and visually represents them in the article page using
different types of charts. A disadvantage of this visualizations
is that novice Wikipedians cannot interpret these charts.
GreenWiki [13] calculates two QMs: Coverage, which de-
scribes how citations are spread over the whole article, and
Stability, indicating the NUMBER of edits per month. Fur-
thermore, the tool provides heatmaps and different diagrams
to visualize these metrics and to track possible edit wars.
IChase [28] is an administrative tool that visualizes an article’s
activities and contributions in two different heatmaps and
synchronizes them in a timeline.
In fact, the importance of quality visualizations is their capac-
ity to influence credibility. Pirolli et al. [27] developed the
WikiDashborad, which shows the article and author editing
history, while Adler et al. [1] mainly focus on the contributors’
reputation and highlight the parts of the text influencing the
trust level. Both came to the conclusion that an article gains
credibility and trust when IQ visualizations are shown.
REQUIREMENTS
The design of the Wikilizer Toolkit is based on the analysis of
gaps and shortcomings in existing tools and methods. We sub-
divide the requirements according to user type: the knowledge
discovery expert is mostly focused in the creation and evalua-
tion of IQ metrics, whereas the Wikipedian user plays either
the consumer or repairman role, i.e. consumes information or
edits/improves Wikipedia articles.
Figure 1: The WikiLyzer toolkit comprises three applications:
the Quality Agent (QAg), developed for experts; Quality De-
tective (QD) for information seekers and the Quality Assistant
(QAs), for Wikipedians performing editing tasks.
Requirements for Expert Users
RE1: Create QMs.
Experts should be able to interactively
compose equations leveraging available attributes and/or other
QMs. Furthermore, the GUI should facilitate constructions
with any level of complexity, from basic arithmetic to more
advanced operations such as n-th root.
RE2: Compare QMs.
To date, if researches want to com-
pare their approach to state-of-the-art methods they have to
implement these approaches themselves[20]. Our application
should include a pre-built set of state-of-the-art QMs that ex-
perts can readily access to compare against their new QMs.
RE3: Share QMs.
Experts should be able to share their QMs
with other experts and wikipedian users (see RW2).
RE4: Provide test datasets.
Ground-truth datasets of pre-
classified articles (FA/NFA) are necessary to meet
RE2
. Addi-
tional datasets covering a variety of topics would also facilitate
the optimization of metrics for specific domains.
RE5: Extract article attributes
: The system should retrieve
all available attributes through the data channels and pre-
compute other necessary measures, e.g. reading scores.
RE6: Provide normalization methods.
Composing QMs re-
quires comparable measures. For example when combining
article length (word count normally
>1000
) and number of
images (
[010]
), the gap between them would nullify the con-
tribution of the latter. Fair comparisons require that measures
are normalized before composing a QM with them.
RE7: Rank articles by QMs.
The user should be able to
select combinations of QMs to rank articles in a test dataset.
RE8: Track evolution of an article.
By selecting an article
it should be possible to see how its quality status (assigned by
the community) evolved over the last 10 years.
RE9: Store article ranking scores.
Allow for storing the
current article ranking and the article scores in CSV-format.
Figure 2: Quality Agent UI. (A) QM Panel: contains built-in and custom QMs. (B) Attributes Panel: contains all available
attributes to create QMs. (C) Ranking View: ranks articles contained in the current dataset by the selected QM(s). (D) Equation
Composer: area where experts can create QMs through mathematical combinations of attributes and/or other QMs. (E) Article
Timeline: shows the quality evolution of the selected article.
Requirements for Wikipedia Consumers
RW1 Search in Wikipedia.
The application should retrieve
articles from the live version of Wikipedia upon user queries.
The retrieval process should also include measurement
extraction and all necessary pre-computations.
RW2: Detect potential high-quality articles.
The user
should be able to interactively rank the retrieved articles
by one or more built-in QMs. Built-in QMs can be either
state-of-the-art or created by an expert user. Quality scores for
each article should be computed on demand and a ranking
visualization should represent these scores accordingly.
Requirements for Wikipedia Authors
RW3: Track evolution of an article. Same as RE7.
RW4: Analyze IQ score composition of one article in de-
tail.
It should be possible to visualize IQ scores for sections
individually, as well as their influence on the global IQ score
of the article. Based on this analysis the user could decide the
appropriate strategy to improve the article.
RW5: Compare two revisions of a specific article.
Users
should be able check how the sections of an article changed
their quality between two points in time.
THE WIKILYZER TOOLKIT
The above defined requirements set the foundations for the
three Web-based applications composing the toolkit. The QAg
is the tool designed to support KDEs and hence attempts to
fulfill
RE1
-
RE9
. In turn, the QD targets Wikipedians navigat-
ing the platform in search for high quality content related to a
topic of interest. Thus, this tool tackles
RW1
-
RW3
. Finally,
the QAs addresses
RW4
-
RW5
and is tailored for Wikipedi-
ans that wish to edit and improve a specific article. Figure 1
illustrates the main components and the two workspaces of the
WikiLyzer toolkit, where experts and wikipedian users share a
common global database.
Quality Agent (QAg)
The QAg main use is to create new QMs and test them against
built-in state-of-the-art methods. Through the QAg, expert
users can access pre-loaded datasets of Wikipedia articles,
including a series of associated attributes and pre-computed
values. They use these datasets to compare QMs. Furthermore,
any new QM can be shared with other experts and saved to be
accessed from the wikipedian’s workspace.
A session starts as the expert selects a predefined dataset of
Wikipedia articles. The QAg automatically extracts the corre-
sponding attributes from each article and computes additional
scores. The QAg then loads QMs from the global database
and displays them in orderly fashion by their accuracy score.
At this point experts can interact with the advanced UI and
perform a variety of actions explained below.
The
QM Panel
(Figures 2.A) is the area containing all QMs,
both built-in and user created ones. The default view is a
tag-cloud-like arrangement, and the alternative view arranges
the QMs in a ranked list. In both cases, the QMs are sorted
by accuracy (Figure 3a) to facilitate comparimg QMs at a
glance (partially fulfilling
RE2
). Accuracy is computed by
performing a binary classification that takes a test dataset
comprising 100 articles as ground truth, from which 50 are
known to be FA and the other 50 are known to be NFA (
RE4
).
The ranked list is built upon the idea that experts create QMs
to identify potential featured articles. Thus, the better a QM
can distinguish FA from NFA, the higher it will be ranked.
The
Attributes Panel
(Figure 2.B) provides extracted at-
tributes and pre-computed measures (
RE5
) that can be se-
lected to create new QMs. Furthermore, this panel provides
a selector to change the normalization method. The default
method is set to taxicab norm, but the user can chose between
euclidean, p- or maximum norm and no normalization (
RE6
).
The
Equation Composer
(Figure 2.D) is the area used to
interactively create new QMs by selecting attributes and/or
existing QMs (from the aforementioned panels) and building a
mathematical formula through the operation buttons displayed
at the bottom (
RE1
). The selected parameters are represented
by green boxes augmented with sliders for weight adjustment.
For example, to create the QM Completeness[32] one needs
to select and mathematically combine the necessary attributes
to compose the following formula:
0.4×#InternalLinks +
0.4×#InternalBrokenLinks +0.2×ArticleLength
. This is
done by clicking on the boxes representing the corresponding
attributes followed by the summation operation. By default,
the Equation Composer normalizes measures, so the last step
is to assign the weights to each box. The created QMs can be
stored and shared in the global database (RE3).
The
QM Comparator
(Figure 3b) appears upon selecting
QMs from the QM Panel ( button on the right-bottom cor-
ner), replacing the area pointed in Figure 2.C. This view allows
for thorough comparisons of up to 4 QMs (
RE4
). QMs are ren-
dered using a "set" visual metaphor indicating the proportion
of true-positive, false-positive, true-negative and false-negative
classifications. Beside the set figure, three color-coded hor-
izontal bars (using a traffic-light palette) indicate precision,
recall and F1 measure [18] (RE2).
As the expert selects a QM or edits its formula, the
Ranking
View
(Figure 2.C) updates the article ranking in real-time.
Article scores are computed by first breaking down the QM
into its simple components, i.e. attributes, and then solving
the equation with the article attributes as variables (Figure 4).
Experts can also select several QMs as ranking criteria (QM’s
icon at right-top), in which case the tool provides a stacked-
bar view (Figure 5a) by default and an alternative split-bar
view (Figure 5b) (
RE7
). Stacked bars allow for visualizing the
influence of each QM to the articles’ overall IQ score, while
(a) Alternative disposition in QM
Panel as ranked list.
(b) QM Comparator shows recall, pre-
cision and F1 values for each QM.
Figure 3: Features for QM comparisons
Figure 4: Computing article scores implies breaking down the
selected QMs into their lowest-level components
(a) Stacked-bar visualization (b) Split-bar visualization
Figure 5: Bar charts for visualizing score compositions in
Ranking View
grouped baseline-aligned bars like in the split view enable
comparisons of values across different QMs [19, 31]. Finally,
experts can export a snapshot of the current ranking state in
CSV-format by clicking on the icon (RE9).
As the user selects an article in the Ranking View, the
Article
Timeline (Figure 2.E)
provides an overview of the evolution
experienced by the article over the last ten years, in terms of the
IQ score computed for the currently loaded QM (
RE8
). The
QAg fetches the last revision of each year and calculates the
corresponding QM score. Scores are represented by ordered
bar charts linked by a trend line. The mobile version of the
article is additionally displayed underneath the timeline.
Quality Detective (QD)
The QD targets ”consumer” users, interested in finding po-
tential FA within a topic of interest, and are not expected to
create or compare new metrics. The graphical layout of the
Quality Detective is similar to that of the QAg but with limited
functionality. Components such as the Equation Composer,
Attributes Panel and QM Comparator are not avaliable. This
tool allows users to query the live version of the platform by
specifying a set of search keywords and the number of articles
to be retrieved (
RW1
). The retrieval process starts with a typi-
cal query to obtain a list of articles related to the typed terms
and then several requests per article crawling their content,
page histories and talk pages.
The ”Quality Metric” header was renamed ”Ranking Meth-
ods”. Also, built-in QMs have more descriptive names in the
QD. The QD works as a real-live system. When clicking on a
ranking method (QM) the article ranking is updated and the
Ranking View is refreshed, allowing for immediate inspection
of potential FA (
RW2
). Articles classified as potential FA are
indicated with green bars in the "Ranking" column, otherwise
Figure 6: Portion of Quality Detective’s UI showing articles about "Coffee" ranked by ArticleLengthQM. The actual class of each
article appears on the right-most column
the bar is colored in red. Since Wikipedians are familiar with
Wikipedia’s grading scheme, we added an extra column dis-
playing the articles’ class (toggled by clicking on ). Color
encoding for class bars matches the palette used in Wikipedia’s
grading scheme. Figure 6 illustrates an example of the UI with
articles retrieved for "coffee" ranked by article length. As an
article is selected, the Article Timeline shows its evolution
(
RW3
) and its mobile version appears below. Finally, a user
that wants to edit an article can directly switch to the Quality
Assistant by clicking on the edit button ( ) thereof.
Quality Assistant (QAs)
The QAs is designed to assist the editing task by visually high-
lighting the parts of an article that likely need improvement
and those that seem to be already of high quality. It can also
be used to compare two revisions.
First the user has to indicate the title of the article to be ana-
lyzed. The QAs then automatically crawls its data channels
to extract the necessary attributes, pre-computes additional
measures and computes a set of predefined QMs for each sec-
tion individually. Furthermore, a sentiment score per section
and for the whole article is calculated. Subsequently, the QAs
computes section-wise scores that are then aggregated in order
to obtain an overall quality score for the article.
The
Augmented Table of Content
(ATOC) in Figure 7.A
provides an overview of the article’s structure, augmenting
section titles with color-coded bars. Bar width indicates the
length of the section while color encoding conveys the IQ
score thereof. Again, the traffic-light metaphor is leveraged to
pre-attentively denote sections that need improvement (red),
those that seem to comply with high quality standards (green)
and those in the fuzzy area (yellow/orange).
Clicking on the "show-statistic" button ( ) beside a section
title expands the
Section Quality Panel
, as shown in Fig-
ure 7.B. This area breaks down the section’s score into the
composing scores used to measure different quality aspects,
among others: readability, number of images, references and
links (
RW4
). These quality scores are represented with traffic-
light-coded bars. The exception is the sentiment score, which
indicates whether a section is written in a rather positive, neg-
ative or neutral tone. Since Wikipedia is an encyclopedia,
articles should be neutral, thus the central range of the bar is
colored in green and softly degrades to red towards the edges.
An overlaid pointer conveys the sentiment score of the section.
At the bottom, small thumbnails show the images attached
to the current section. The user has the option to include or
exclude the section from the overall article score computation
by clicking on the checkbox.
The overall IQ computed for the article is conveyed as a color-
coded bar in the
Article IQ Panel
(Figure 7.D), at the top of
the UI. Additionally, the overall sentiment score is reported as
a smiley icon and a tooltip – visible by hovering over the bar –
provides the same information as any Section Quality Panel
but for the whole article.
The
Article Viewer
(Figure 7.E) displays the article in an
iframe. The panel can be switched into a markup editor
through the "Show Wikipage" toggle. Edits are immediately
incorporated to the page after switching the toggle on.
Finally, when the
Revision Panel
(Figure 7.C) is enabled (
button) the user can chose an older revision from the drop-
down list. A secondary ATOC representing the structure of the
article for the chosen revision is added. The user can use this
feature to understand where and how quality changed between
from that moment until the present version (RW5).
Calculating Quality Scores
By default, the IQ values computed for each section are two
readability scores, namely: Flesch Reading Ease and Flesch-
Kincaid Grade Level [22], along with the attributes word count,
number of images, references and links. The IQ score for
section
i
,
si
, is formed by the combination of these six values
as depicted in Formula 1. Sections that do not bear any content,
such as References, and See Also are ignored.
si=F leschReadingE ase ×WordCount (1)
+FleschKincaidGradeLevel +#Images
+#Re f erences +#Links
All terms in the above sum are normalized beforehand, in
order to balance their influence over the section score
si
. Thus,
we first defined thresholds indicating the boundary between
low and high values and then compute normalized values as
denoted in Formula 2, where
ˆqj
is the normalized value for
term jin Formula 1.
ˆqj=(qj
threashol d j,if qjthreashold j
1,otherwise (2)
Figure 7: User interface of the Quality Assistant. (A) Augmented Table of Content (ATOC): shows quality score and length of
each section. (B) Section Quality Panel: allows for visualizing quality scores composing the section score, as well as thumbnails of
referenced images. (C) Revision Panel: unfolds the ATOC of a past revision of the article. (D) Article IQ Panel: shows the overall
quality and sentiment score of the article. (E) Article Viewer: displays the article’s Wikipedia page and can be used as an editor.
Finally, the overall article score is recursively computed as a
bottom-up aggregation. In other words, scores for non-leaf
sections are obtained by averaging over their children’ scores
and the process continues until the root section is reached.
Since QMs created with the QAg can be shared through the
global database, it is also possible to choose other QMs or
attributes, as long as the corresponding thresholds are defined.
These values can be either set by experts or calculated by
taking the average values of several featured articles.
Implementation
The three Web applications composing the WikiLyzer toolkit
are implemented in JavaScript and PHP. Article metadata are
retrieved via REST requests to the the MediaWiki API
4
and
sentiment scores in the QAs are calculated with Sensium
5
.
The tools and repository are publicly available
6
, as well as
datasets used in the figures of this paper.
DESIGN STUDY: QUALITY AGENT
Design studies are widely used in the Information Visual-
ization community to validate design choices with first-hand
experts [30]. In this case we were interesting in assessing
whether the implemented features for the Quality Agent are
suitable for knowledge discovery experts dedicated to auto-
matic IQ assessment. The goals of the study reflect the re-
quirements elicited in section 3.1.
Methodology
We recruited two researchers familiarized with development
and testing of QMs and asked them to perform 4 tasks with
increasing level of difficulty, namely:
4https://www.mediawiki.org/wiki/API:Main_page/de
5https://www.sensium.io/index.html
6http://wikilyzer.know-center.tugraz.at/
Task 1 (inspect):
Researchers had to analyze the equations
of a set of QMs and answer 6 multiple choice questions.
3 questions asked about the purpose of the QM and the
remaining 3 about which attribute had the most or least
influence on QM’s computation.
Task 2 (combine and compare):
Required to compare
recall, precision and F
1
scores for built-in QMs and report
the one with highest accuracy. Then they had to combine
QMs to rank articles in one of the provided datasets and
analyze their influence in article scores.
Task 3 (create with constraints):
Researches had to create
a new QM based on the requirements: ”The article should
not be too short, it should be well-written, mature and ad-
ministrators should have done some edits.
Task 4 (create freely):
Researches created a QM based on
their personal preferences.
Personal meetings were not feasible due to geographic rea-
sons, thus sessions were carried out online. Interviews and
observations were replaced by surveys comprising 7-point
likert scales, multiple choice and open questions. A logger
captured all user interactions for later analysis. After each task
the participant had to answer workload-related questions. A
final post-study survey gathered feedback about the different
features of the QAg via linear-scaled and open questions.
A session started with an introductory video followed by a
training phase where participants could try all the system
features and read a cheatsheet (built as a result of a previous
pilot study). Participants could contact the experimenter if
they had further questions or encountered problems during the
session. Recorded logs revealed that Expert 1 (E1) and 2 (E2)
did very long training sessions and tested all features of the
QAg. Both had questions at this phase, which were discussed
via email and online calls. Once they understood all the main
features, they executed the 4 tasks seamlessly.
Outcomes
Task 1 (inspect)
. 5 out of 6 experts’ responses matched and
were correctly answered. Only the question addressing the
purpose of the QM Complexity differed: E2 chose the expected
response whereas E1 provided a more detailed answer that
was not in the listed options. Action logs revealed that they
used the Equation Composer to load the QMs and analyze
them in detail. No further features were used for this task.
Task 2 (compare and combine)
. Questions 1-2 asked ex-
perts to compare 4 QMs and determine the most accurate one,
questions 2-3 asked about the best and worse QM to find FA
and questions 5-6 required to combine two QMs and find the
most influential one for a given document. All responses were
correct, except for question 4, possibly because E1 looked at
the wrong row in the ranked list in the QM Panel. Action logs
expose that the experts successfully employed the views in
the QM Panel and the QM Comparator to answer questions
1-4 and the Ranking View, including stacked- and split-bars
displays for questions 5-6.
Task 3 (create with constraints)
. Based on the
given requirement, we expected experts to create a QM
using the attributes
Articl eLength
,
FleschReadingEase
or
FleschKincaidGradeLevel
,
AdministratorEditShare
and
Aarticl eAge
. Here experts followed different strategies. E1
took the most obvious path by building the new QM entirely
with attributes selected from the Attributes Panel. Though
AdministratorEditShare
is not present in the final equation,
the recorded logs unveil that E1 first included this attribute
but later replaced it with
#UniqueE ditors
. E2’s decisions, on
the other hand, were rather ”unorthodox” and revealed a deep
understanding of how QMs work. Instead of creating a QM
from scratch, E2 combined built-in QMs into a new one.These
QMs comprise exactly the attributes specified in the task re-
quirements. These outcomes denote that the QAg is flexible
enough for experts to follow their preferred methodology and
still fulfill the task successfully. Table 1 presents the final
equations composed by E1 and E2.
Task 4 (create freely)
. We noticed a similar situation as in
Task 3. While E1 created a new QM from scratch by combin-
ing attributes only, E2 attempted to combine different built-in
QMs and attributes. In both cases, only basic arithmetic oper-
ations were used, i.e. addition and multiplication. However,
E1 employed a total of 6 attributes and E2 even tuned weights
for two parameters. Final equations are available in Table 1.
By analyzing the log-files and post-study surveys we were
able to identify 4 features that were positively mentioned: i)
feature to combine QM and attributes (Equation Composer),
ii) use of split and stacked bars to make score compositions
visible, iii) visualization of recall, precision and F
1
values and
iv) ranked list of QMs in the QM Panel.
Subjective Feedback: Usability and Workload
The experts stated that the Quality Agent is intuitive, consistent
and that all functions are well integrated. They were able to
learn how to use the main features and emphasized its ease of
use. One of the experts even mentioned that being able to see
the results of the interactions in real-time enhanced its usability.
Task Expert Equation
3 E1 ((Fl eschReadingEasem+#U niqueEdit orsa)
×Articl eLengtha)×Art icleAgea
3 E2 Articl eLengthq+Complexityq
+Consistencyq
4 E1 (FleschReadingEasea×#U niqueEdit orsa
×Articl eLengtha) + #I magesa+
Diversitya+#InternalLinksa
4 E2 Articl eLengthq+0.2×
#Imagesa+0.2×#ArticleAgea
Table 1: Equations created by experts during the case study.
a
indicates attribute and qmeans QM.
However, they also claimed that some specific features need
to be better explained. Although the experts agreed that the
interface of the Equation Composer was â ˘
A˙
IOKâ ˘
A˙
I, they
sometimes had difficulties creating the desired equations. They
also provided some improvement suggestions, e.g. a drag-and-
drop interaction would be preferable over creating a slot in the
Equation Composer and filling it by clicking on an attribute.
Regarding workload, we collected feedback after each task
through a 7-point likert scale. Experts had to report on per-
ceived performance, effort and task difficulty. Responses were
quite positive, as illustrated in Figure 8.
Performance
Effort (−)
Task Difficulty (−)
● ●
● ●
● ● ●
5
6
7
123412341234
Task
Rating
Expert E1 E2
Figure 8: Perceived performance, effort and task difficulty
reported in the Case Study (higher is better).
Discussion and Limitations
The evaluation followed a design study format to target active
KDE experts in the research area. However, the two major lim-
itation of this kind of study are, on one hand its online nature,
which prevented us from collecting first-hand observations,
and on the other hand, the limited pool of participants.
Despite the limitations, the insights gathered after this study
are invaluable from the formative point of view. Though some
improvements are desirable, we believe the interactive features
of the QAg, particularly the Equation Composer and views
for QM comparisons have great potential for further adoption
in the automatic IQ assessment realm. Indeed, both experts
reported that they would like to effectively use the tool in their
future experiments.
USER STUDY: QAS VS METADATA SCRIPT
The goal of this evaluation was to ascertain the usefulness of
the QAs useful to assess quality aspects of Wikipedia articles.
Although the tool is designed to support users that want to edit
an article, the scope was not to assess the participants’ writing
ability and expertise on a topic, but whether the features of
the tool helped them to identify good and poor quality aspects
in an article. Intuitively, detecting strengths and weaknesses
of a piece of text is a step that comes before attempting any
improvement on it. Thus, we did not ask participants to edit
an article but to analyze its current quality.
Methodology
We chose the MetadataScript (MS) [33] as baseline tool. Al-
though more sophisticated tools do exist, such as WikipediaViz
[12], most of them are no longer available or only work with a
pre-defined set of articles, e.g. GreenWiki [13] only supports
three articles. None of the available tools provides features
for assessing quality of sections or smaller parts in an article.
However, we consider the MS is a fair benchmark, as it is
currently accessible as a Wikipedian plugin.
The task goal was to answer a total of 29 questions in a 7-
point likert scale addressing quality aspects. Since the basis
for quality assessment in Wikipedia is the "featured article
criteria", each question targeted aspects thereof, namely: ap-
propriate length, comprehensive, enough media items, good
structure, neutrally written, well-researched, well- written,
well-summarized. We employed two articles, one FA (Moon)
and a non-FA (Dr. Phosphorus), hence we prepared sepa-
rate questionnaires but with the same structure. Questions
1-21 were section-specific while the rest concerned the whole
article. Section-specific questions were actually the same
7 questions replicated for 3 different sections (Introduction
was always the first one). Each participant had to perform
two tasks, one with the QAs and one with the MS. To coun-
terbalance learning effects, tool-article combinations were
randomized in the following 4 groups:
Group 1: (task 1) QAs–FA, (task 2) MS–NFA
Group 2: (task 1) QAs–NFA, (task 2) MS–FA
Group 3: (task 1) MS–FA, (task 2) QAs–NFA
Group 4: (task 1) MS–NFA, (task 2) QAs–FA
Twenty four participants took part in the study: 2 female
and 22 male, between 20 and 40 years old. None of the
participants was a regular contributor to Wikipedia, though
everyone had read some articles before. Furthermore, all
participants had a scientific background and a consolidated
opinion about quality aspects that should be fulfilled by an
encyclopedia. Participants were randomly assigned to one of
the above mentioned groups (6 participants/group).
At the beginning of a session, a video introduced the features
of the QAs to the participant. Then the participant had to
perform two short trial tasks, one with each tool. Next, the
first task started and once finished, the participant had to an-
swer 3 questions about workload (performance, effort and task
difficulty). Then the second task proceeded in the same way.
Additionally, the participant had to answer a System Usability
Scale (SUS) [29] regarding the QAs. Finally, participants were
asked in short interview about their experience and feedback.
QAE
MS
0.0
0.2
0.4
0.6
0.8
0.0
0.2
0.4
0.6
0.8
FA
NFA
1234567 1234567
SCORES
DENSITY
QAE MS
Figure 9: Density distribution of quality assessment responses
in User Study
Hypotheses
H1 (FA):
Since high-quality articles have to fulfill all items
in the featured article criteria, we expected that for the FA
(Moon), participants using the QAs would rate questions
higher than those using the MS.
H2 (NFA):
For the NFA (Dr. Phosphorus), participants will
rate questions higher with the QAs than with the MS.
H3 (workload):
Participants working with the QAs will ex-
perience lower workload than those using MS.
H4 (SUS): The QAs will reach a SUS score above C.
Results
In this section we report the outcomes of the study by breaking
them down into performance, workload and usability.
Performance
Due to violations to the normal distribution assumption (tested
with Kolgomorov-Smirnov) we executed Mann-Whitney tests
rather than Student t-tests to analyze quality assessment re-
sponses. A Mann-Whitney test for the FA (Moon) revealed
that overall answers were significantly higher with the QAs,
U=73438.5
,
p< .01,r=.20
, thus supporting
H1
. Density
distributions plotted in Figure 9 show that ratings for FA-
related questions were positively skewed with both tools, but
rather higher for the QAs. More than 40% of responses from
QAs participants received the highest rating (7), in contrast to
roughly 25% of MS participants’ responses.
Conversely, Mann-Whitney test for answers regarding the
NFA (Dr. Phosphorus) did not reveal a significant difference,
U=58514.5
,
p> .05,r=.05
.
H2
is hence not supported.
Figure 9 supports this results, as responses with both tool for
the NFA follow a similar binomial distribution.
Workload
Pair-wise t-tests addressing workload measures turned signifi-
cantly lower across all dimensions, i.e. H3 was met, namely:
(i) Performance (inverted): t(23) = 2.48,p< .01,r=.46;
(ii) Effort:t(23) = 6.26,p< .001,r=.79; and
(iii) Task difficulty:t(23) = 3.67,p< .001,r=.61.
System Usability Scale
We adapted the usual 5-point likert scale to a 7-point one
(
1=
strongly disagree,
7=
strongly agree), in order to be
consistent with the quality assessment questionnaires, and
leveraged the positive-tone version available at Sauro et al.
[29]. User responses were multiplied by 1.66 instead of 2.5 to
obtain overall SUS scores in a range between 0 and 100. Thus,
the score
si
for question
xi
was computed as
si= (xi1)1.6
.
The QAs achieved an overall score of 79.31, which is trans-
lated into a
B+
grade, supporting
H4
.Usable and Learnable
sub-scales produced similar scores (adjusting multipliers to
2.08 and 8.33, respectively). Usable obtained 79.17 while
Learnable 79.86, i.e. both achieved a B+ grade.
Discussion & Limitations
This study shed a light on the usefulness of the QAs to support
IQ assessment of user-generated content. The system proved
most useful when participants had to analyze a FA, in contrast
to NFA. In the latter case, answers were quite widespread with
both tools and it was not possible to detect any differences.
However, it is indeed encouraging that users were able to
detect good quality content more efficiently using the QAs
than the baseline tool, especially considering the fact that
participants working with the MetadataScript were aware of
the real classification of the article under revision, in contrast
to those working with the QAs.
The findings concerning
H3
suggest that the QAs helps to
reduce workload in terms of perceived performance, effort
and task difficulty. Although there is a possibility of bias in-
troduced by our system, our observations and user feedback
suggest otherwise. None of the participants answered the qual-
ity assessment questionnaire by blindly following the system
suggestions or without reading the text. Perhaps the presence
of an evaluator prevented a careless behavior. This is also
underpinned by several comments collected in the post-study
interviews, e.g. "if the tool shares your opinion, you feel much
more confident about your decisions". In general, participants
stated that the usefulness of the tool lies in that they felt more
confident when their own assessments matched the system sug-
gestions and that the suggestions provided valuable hints when
they at first could not decide on the rating for a particular ques-
tion. Moreover, a not minor detail is that participants working
with the baseline tool (MetadataScript) were aware of the type
of article they were analyzing (recall that this plugin makes
the title appear in a color matching the article’s category). On
the other hand, participants working with WikiLyzer did not
know whether the article was featured or not.
The major limitation of this study is that participants did not
transfer their quality assessment into tangible content improve-
ments. We consider that a writing task would be more suitable
for actual Wikipedians used to making frequent contributions.
We invested some efforts in contacting some of them but un-
fortunately did not receive much support.
CONCLUSIONS AND FUTURE WORK
We presented WikiLyzer, a Web-based toolkit assisting a va-
riety of tasks, ranging from creating metrics for quality mea-
surement (QMs) of user-generated content to applying them
in the analysis and improvement of articles in Wikipedia. The
toolkit consists of three applications: (i) the Quality Agent,
designed for experts that need to create and compare QMs; (ii)
the Quality Detective, for users conducting query-based and in-
terested in high-quality content, and (iii) the Quality Assistant,
targeting wikipedians that wish to contribute and improve a
specific article. Our tools fill the gap in automatic information
quality (IQ) assessment, where efforts have concentrated in
developing unsupervised and semi-supervised methods but
little has been done to date with regards to interactive tools
that facilitate the creation and comparison of methods and,
more importantly, bring these methods to Wikipedia users that
could actually benefit from them.
The case study in this paper provided initial evidence that the
QAg is suitable for knowledge discovery experts to understand,
compare and create QMs in a visual analytics environment.
In turn, the user study concerning the QAs demonstrated that
average users, who are not actual Wikipedia contributors, are
able to spot high-quality content more effectively and with less
effort with the QAs, in comparison to an existing Wikipedia
plugin. Notwithstanding, more extensive and thorough eval-
uation of all the proposed features is necessary. Thus, future
work includes a longitudinal study as well as stepwise analysis
of each different tool.
The workflow defined in this article to investigate, propose,
and apply automatic methods for quality assessment was tested
with Wikipedia content, but is transferable to other domains
dealing with user-generated content. As an example, the
generic workflow could be adapted to communities of contrib-
utors and reviewers. In the near future we will concentrate
on extending WikiLyzer into broader domains, i.e. outside
of the Wikipedia realm. Moreover, we intend to enhance the
usability of different interactive features based on the feedback
obtained and ultimately leverage machine learning to improve
the tool assessment over time.
ACKNOWLEDGMENTS
This work was partially funded by the EC FP7-PEOPLE pro-
gram within the WIQ-EI project (grant 269180). The Know-
Center is funded within the Austrian COMET Program – Com-
petence Centers for Excellent Technologies – of the Austrian
Federal Ministry of Transport, Innovation and Technology, the
Austrian Federal Ministry of Economy, Family and Youth and
by the State of Styria. COMET is managed by the Austrian
Research Promotion Agency (FFG).
REFERENCES
1. Adler, B. T., Chatterjee, K., De Alfaro, L., Faella, M.,
Pye, I., and Raman, V. Assigning trust to wikipedia
content. In Proceedings of the 4th International
Symposium on Wikis, ACM (2008), 26.
2. Adler, B. T., and De Alfaro, L. A content-driven
reputation system for the wikipedia. In Proceedings of the
16th international conference on World Wide Web, ACM
(2007), 261–270.
3. Alexander, J. E., and Tate, M. A. Web Wisdom; How to
Evaluate and Create Information Quality on the Webb,
1st ed. L. Erlbaum Associates Inc., Hillsdale, NJ, USA,
1999.
4. Anderka, M. Analyzing and Predicting Quality Flaws in
User-generated Content: The Case of Wikipedia.
Dissertation, Bauhaus-Universität Weimar, June 2013.
5. Anderka, M., Stein, B., and Lipka, N. Detection of text
quality flaws as a one-class classification problem. In
20th ACM International Conference on Information and
Knowledge Management (CIKM’11), ACM (2011),
2313–2316.
6.
Anderka, M., Stein, B., and Lipka, N. Towards Automatic
Quality Assurance in Wikipedia. In 20th International
Conference on World Wide Web, ACM (2011).
7. Anderka, M., Stein, B., and Lipka, N. Predicting Quality
Flaws in User-generated Content: The Case of Wikipedia.
In 35rd Annual International ACM SIGIR Conference on
Research and Development in Information Retrieval,
ACM (2012).
8. Arazy, O., Nov, O., Patterson, R., and Yeo, L.
Information quality in wikipedia: The effects of group
composition and task conflict. Journal of Management
Information Systems 27, 4 (2011), 71–98.
9. Baeza-Yates, R. User generated content: how good is it?
In 3rd Workshop on Information Credibility on the Web
(WICOW’09), ACM (2009), 1–2.
10.
Blumenstock, J. E. Size matters: word count as a measure
of quality on wikipedia. In Proceedings of the 17th
international conference on World Wide Web, ACM
(2008), 1095–1096.
11. Brandes, U., Kenis, P., Lerner, J., and van Raaij, D.
Network analysis of collaboration structure in wikipedia.
In Proceedings of the 18th international conference on
World wide web, ACM (2009), 731–740.
12. Chevalier, F., Huot, S., and Fekete, J.-D. Wikipediaviz:
Conveying article quality for casual wikipedia readers. In
Pacific Visualization Symposium (PacificVis), 2010 IEEE,
IEEE (2010), 49–56.
13.
Dalip, D. H., Santos, R. L., Oliveira, D. R., Amaral, V. F.,
Gonçalves, M. A., Prates, R. O., Minardi, R., and
de Almeida, J. M. Greenwiki: a tool to support users’
assessment of the quality of wikipedia articles. In
Proceedings of the 11th annual international ACM/IEEE
joint conference on Digital libraries, ACM (2011),
469–470.
14.
Ferretti, E., Errecalde, M., Anderka, M., and Stein, B. On
the Use of Reliable-Negatives Selection Strategies in the
PU Learning Approach for Quality Flaws Prediction in
Wikipedia. In Proceedings of the 11th International
Workshop on Text-based Information Retrieval (TIR
2014), held in conjunction with DEXA 2014, IEEE
(2014), 211–215.
15. Ferretti, E., Fusilier, D., Cabrera, R., Montes-y-Gómez,
M., Errecalde, M., and Rosso, P. On the use of PU
Learning for quality flaw prediction in Wikipedia:
notebook for PAN at CLEF 2012. In Notebook Papers of
CLEF 2012 LABs and Workshops (2012).
16. Ferschke, O., Gurevych, I., and Rittberger., M.
FlawFinder: a modular system for predicting quality
flaws in Wikipedia: notebook for PAN at CLEF 2012. In
Notebook Papers of CLEF 2012 LABs and Workshops
(2012).
17.
Ferschke, O., Gurevych, I., and Rittberger, M. The impact
of topic bias on quality flaw prediction in wikipedia. In
Proceedings of the 51st Annual Meeting of the
Association for Computational Linguistics (ACL 2013),
vol. 1, Association for Computational Linguistics
(Stroudsburg, PA, USA, Aug. 2013), 721–730.
18.
Goutte, C., and Gaussier, E. A probabilistic interpretation
of precision, recall and f-score, with implication for
evaluation. In Advances in information retrieval.
Springer, 2005, 345–359.
19.
Gratzl, S., Lex, A., Gehlenborg, N., Pfister, H., and Streit,
M. LineUp: visual analysis of multi-attribute rankings.
IEEE transactions on visualization and computer
graphics 19, 12 (dec 2013), 2277–86.
20. Hasan Dalip, D., André Gonçalves, M., Cristo, M., and
Calado, P. Automatic quality assessment of content
created collaboratively by web communities: a case study
of wikipedia. In Proceedings of the 9th ACM/IEEE-CS
joint conference on Digital libraries, ACM (2009),
295–304.
21. Hu, M., Lim, E.-P., Sun, A., Lauw, H. W., and Vuong,
B.-Q. Measuring article quality in wikipedia: models and
evaluation. In Proceedings of the sixteenth ACM
conference on Conference on information and knowledge
management, ACM (2007), 243–252.
22. Kincaid, J. P., Fishburne Jr, R. P., Rogers, R. L., and
Chissom, B. S. Derivation of new readability formulas
(automated readability index, fog count and flesch
reading ease formula) for navy enlisted personnel. Tech.
rep., DTIC Document, 1975.
23. Lex, E., Völske, M., Errecalde, M., Ferretti, E., Cagnina,
L., Horn, C., Stein, B., and Granitzer, M. Measuring the
quality of web content using factual information. In 2nd
joint WICOW/AIRWeb Workshop on Web quality, ACM
(2012).
24. Lih, A. Wikipedia as participatory journalism: Reliable
sources? metrics for evaluating collaborative media as a
news resource. Nature (2004).
25. Lim, E.-P., Vuong, B.-Q., Lauw, H. W., and Sun, A.
Measuring qualities of articles contributed by online
communities. In Web Intelligence (2006), 81–87.
26. Lipka, N., and Stein, B. Identifying featured articles in
wikipedia: writing style matters. In Proceedings of the
19th international conference on World wide web, ACM
(2010), 1147–1148.
27. Pirolli, P., Wollny, E., and Suh, B. So you know you’re
getting the best possible information: a tool that increases
wikipedia credibility. In Proceedings of the SIGCHI
Conference on Human Factors in Computing Systems,
ACM (2009), 1505–1508.
28. Riche, N. H., Lee, B., and Chevalier, F. ichase:
Supporting exploration and awareness of editing
activities on wikipedia. In Proceedings of the
International Conference on Advanced Visual Interfaces,
ACM (2010), 59–66.
29. Sauro, J., and Lewis, J. R. Quantifying the user
experience: Practical statistics for user research.
Elsevier, 2012.
30. Sedlmair, M., Meyer, M., and Munzner, T. Design study
methodology: Reflections from the trenches and the
stacks. IEEE Transactions on Visualization and
Computer Graphics 18, 12 (2012), 2431–2440.
31. Streit, M., and Gehlenborg, N. Bar charts and box plots.
Nature methods 11, 2 (feb 2014), 117.
32. Stvilia, B., Twidale, M., Smith, L., and Gasser, L.
Assessing information quality of a community-based
encyclopedia. In 10th International Conference on
Information Quality (ICIQ’05), MIT (2005), 442–454.
33. User:Pyrospirit. User:Pyrospiri. https:
//en.wikipedia.org/wiki/User:Pyrospirit/metadata, 2015.
[Online; accessed 26-May-2015].
34. Wöhner, T., and Peters, R. Assessing the quality of
wikipedia articles with lifecycle based metrics. In
Proceedings of the 5th International Symposium on Wikis
and Open Collaboration, ACM (2009), 16.
... In our previous work [18], we introduced WikiLyzer, a toolkit comprising three Web-based applications for tasks related to information quality assessment in Wikipedia. The Quality Agent is the tool for experts to compose their assessment methods, in the form of quality metrics. ...
... Sections 3.1 and 3.2, respectively, cover in detail the goal, task abstraction, design rationale, and workflow of the two tools. The Quality Detective was introduced in di Sciascio et al. [18]. This article does not contain further details thereof. ...
... We report Student t-test statistics and Cohen's d effect sizes. 18 Assumption of normality was verified with Kolgomorov-Smirnov. Subsequently, we calculated intraclass correlations to measure the level of inter-rater absolute agreement among participants of each tool. ...
Article
Full-text available
Digital libraries and services enable users to access large amounts of data on demand. Yet, quality assessment of information encountered on the Internet remains an elusive open issue. For example, Wikipedia, one of the most visited platforms on the Web, hosts thousands of user-generated articles and undergoes 12 million edits/contributions per month. User-generated content is undoubtedly one of the keys to its success but also a hindrance to good quality. Although Wikipedia has established guidelines for the “perfect article,” authors find it difficult to assert whether their contributions comply with them and reviewers cannot cope with the ever-growing amount of articles pending review. Great efforts have been invested in algorithmic methods for automatic classification of Wikipedia articles (as featured or non-featured) and for quality flaw detection. Instead, our contribution is an interactive tool that combines automatic classification methods and human interaction in a toolkit, whereby experts can experiment with new quality metrics and share them with authors that need to identify weaknesses to improve a particular article. A design study shows that experts are able to effectively create complex quality metrics in a visual analytics environment. In turn, a user study evidences that regular users can identify flaws, as well as high-quality content based on the inspection of automatic quality scores.
... Readers of encyclopedias must be able to check where the information comes from [60]. Therefore, one of the most commonly used reliability measures is the number of references in a Wikipedia article [28,34,38,[48][49][50]56,58,[61][62][63][64]. References are related to the credibility of the article. ...
... At the same time, the number of authors of an article can also indicate the relevance of the article to the Wikipedia community. To sum up, articles created by a larger number of people may be more objective, hence one of the measures leveraged in our research is the number of unique authors [28,34,47,[55][56][57][58][63][64][65][70][71][72][73][74][75]. ...
Article
Full-text available
On Wikipedia, articles about various topics can be created and edited independently in each language version. Therefore, the quality of information about the same topic depends on the language. Any interested user can improve an article and that improvement may depend on the popularity of the article. The goal of this study is to show what topics are best represented in different language versions of Wikipedia using results of quality assessment for over 39 million articles in 55 languages. In this paper, we also analyze how popular selected topics are among readers and authors in various languages. We used two approaches to assign articles to various topics. First, we selected 27 main multilingual categories and analyzed all their connections with sub-categories based on information extracted from over 10 million categories in 55 language versions. To classify the articles to one of the 27 main categories, we took into account over 400 million links from articles to over 10 million categories and over 26 million links between categories. In the second approach, we used data from DBpedia and Wikidata. We also showed how the results of the study can be used to build local and global rankings of the Wikipedia content.
... So we can analyze sources placements to assess the reliability in context of topic and language. Different studies support that Wikipedia article length and number of references are important indicators for quality assessment of information [5,6,10,22,24,51]. Moreover, derived measures that are based on those indicators can improve existing quality models [33,34,56] Quality of Wikipedia articles depends also on quantity and experience of authors who contributed to the article. ...
Preprint
Full-text available
There are over a billion websites on the Internet that can potentially serve as sources of information on various topics. One of the most popular examples of such an online source is Wikipedia. This public knowledge base is co-edited by millions of users from all over the world. Information in each language version of Wikipedia can be created and edited independently. Therefore, we can observe certain inconsistencies in the statements and facts described therein - depending on language and topic. In accordance with the Wikipedia content authoring guidelines, information in Wikipedia articles should be based on reliable, published sources. So, based on data from such a collaboratively edited encyclopedia, we should also be able to find important sources on specific topics. This effect can be potentially useful for people and organizations. The reliability of a source in Wikipedia articles depends on the context. So the same source (website) may have various degrees of reliability in Wikipedia depending on topic and language version. Moreover, reliability of the same source can change over the time. The purpose of this study is to identify reliable sources on a specific topic - the COVID-19 pandemic. Such an analysis was carried out on real data from Wikipedia within selected language versions and within a selected time period.
... So we can analyze sources placements to assess the reliability in context of topic and language. Different studies support that Wikipedia article length and number of references are important indicators for quality assessment of information [5,6,10,22,24,51]. Moreover, derived measures that are based on those indicators can improve existing quality models [33,34,56] Quality of Wikipedia articles depends also on quantity and experience of authors who contributed to the article. ...
Conference Paper
Full-text available
There are over a billion websites on the Internet that can potentially serve as sources of information on various topics. One of the most popular examples of such an online source is Wikipedia. This public knowledge base is co-edited by millions of users from all over the world. Information in each language version of Wikipedia can be created and edited independently. Therefore, we can observe certain inconsistencies in the statements and facts described therein-depending on language and topic. In accordance with the Wikipedia content authoring guidelines, information in Wikipedia articles should be based on reliable, published sources. So, based on data from such a collaboratively edited encyclopedia, we should also be able to find important sources on specific topics. This effect can be potentially useful for people and organizations. The reliability of a source in Wikipedia articles depends on the context. So the same source (website) may have various degrees of reliability in Wikipedia depending on topic and language version. Moreover, reliability of the same source can change over the time. The purpose of this study is to identify reliable sources on a specific topic-the COVID-19 pandemic. Such an analysis was carried out on real data from Wikipedia within selected language versions and within a selected time period.
... Hence, DBpedia shares many similarities with Wikipedia, including varying content quality. In the past, there were many research articles that analysed the content quality of Wikipedia articles using various methodologies such as natural language processing and deep learning [23], assessment based on the historical edits that employ both deep neural networks and feature engineering [24], etc. [25,26]. ...
Article
Full-text available
In recent years, considerable efforts have been made by cultural heritage institutions across the globe to digitise cultural heritage sites, artifacts, historical maps, etc. for digital preservation and online representation. On the other hand, ample research projects and studies have been published that demonstrate the great capabilities of web-geographic information systems (web-GIS) for the dissemination and online representation of cultural heritage data. However, cultural heritage data and the associated metadata produced by many cultural heritage institutions are heterogeneous. To make this heterogeneous data more interoperable and structured, an ever-growing number of cultural heritage institutions are adopting linked data principles. Although the cultural heritage domain has already started implementing linked open data concepts to the cultural heritage data, there are not many research articles that present an easy-to-implement, free, and open-source-based web-GIS architecture that integrates 3D digital cultural heritage models with cloud computing and linked open data. Furthermore, the integration of web-GIS technologies with 3D web-based visualisation and linked open data may offer new dimensions of interaction and exploration of digital cultural heritage. To demonstrate the high potential of integration of these technologies , this study presents a novel cloud architecture that attempts to enhance digital cultural heritage exploration by integrating 3D digital cultural heritage models with linked open data from DBpedia and GeoNames platforms using web-GIS technologies. More specifically, a digital interactive map, 3D digital cultural heritage models, and linked open data from DBpedia and GeoNames platforms were integrated into a cloud-based web-GIS architecture. Thus, the users of the architecture can easily interact with the digital map, visualise 3D digital cultural heritage models, and explore linked open data from GeoNames and DBpedia platforms, which offer additional information and context related to the selected cultural heritage site as well as external web resources. The architecture was validated by applying it to specific case studies of Australian cultural heritage and seeking expert feedback on the system, its benefits, and scope for improvement in the near future. Citation: Nishanbaev I.; Champion E.; McMeekin D.A. A Web GIS-Based Integration of 3D Digital Models with Linked Open Data for Cultural Heritage Exploration.
... Relevance of article length and number of references for quality assessment of Wikipedia content was supported by many publications [15,[20][21][22][23][24][25][26]. Particularly interesting is the combination of these indicators (e.g., references and articles length ratio) as it can be more actionable in quality prediction than each of them separately [27]. ...
Article
Full-text available
One of the most important factors impacting quality of content in Wikipedia is presence of reliable sources. By following references, readers can verify facts or find more details about described topic. A Wikipedia article can be edited independently in any of over 300 languages, even by anonymous users, therefore information about the same topic may be inconsistent. This also applies to use of references in different language versions of a particular article, so the same statement can have different sources. In this paper we analyzed over 40 million articles from the 55 most developed language versions of Wikipedia to extract information about over 200 million references and find the most popular and reliable sources. We presented 10 models for the assessment of the popularity and reliability of the sources based on analysis of meta information about the references in Wikipedia articles, page views and authors of the articles. Using DBpedia and Wikidata we automatically identified the alignment of the sources to a specific domain. Additionally, we analyzed the changes of popularity and reliability in time and identified growth leaders in each of the considered months. The results can be used for quality improvements of the content in different languages versions of Wikipedia.
... Relevance of article length and number of references for quality assessment of Wikipedia content was supported by many publications [11,[18][19][20][21][22][23][24]. Particularly interesting is the combination of these indicators (e.g. ...
Preprint
Full-text available
One of the most important factors impacting quality of content in Wikipedia is presence of credible sources. By following references readers can verify facts or find more details about described topic. A Wikipedia article can be edited independently in any of over 300 languages, even by anonymous users, therefore information about the same topic may be inconsistent. This also applies to use of references in different language versions of a particular article, so the same statement can have different sources. In this paper we analyzed over 40 million articles from the 55 most developed language versions of Wikipedia to extract information about nearly 200 million references and find the most popular and reliable sources. We presented 10 models for the assessment of the popularity and reliability of the sources based on analysis of meta information about the references in Wikipedia articles, page views and authors of the articles. Using DBpedia and Wikidata we automatically identified the alignment of the sources to a specific domain. Additionally, we analyzed the changes of popularity and reliability in time and identified growth leaders in each considered months. The results can be used for quality improvements of the content in different languages versions of Wikipedia.
... Readers of the encyclopedia must be able to check where the information come from [25]. Therefore, one of the most commonly used measure related to credibility is number of the references in Wikipedia articles [6, 15,17,28,81,13,61,65,64,84,72,47,46,48,43] or external link count [6, 15,68,82,30,28,13,65,72,47]. On of the related research has shown that depending on the references users can assess the trustworthiness of Wikipedia articles [52]. ...
Chapter
Full-text available
One of the most popular collaborative knowledge bases on the Internet is Wikipedia. Articles of this free encyclopaedia are created and edited by users from different countries in about 300 languages. Depending on topic and language version, quality of information there may vary. This study presents and classifies measures that can be extracted from Wikipedia articles for the purpose of automatic quality assessment in different languages. Based on a state of the art analysis and own experiments, specific measures for various aspects of quality have been defined. Additional, in this work they were also defined measures for quality assessment of data contained in the structural parts of Wikipedia articles - infoboxes. This study describes also an extraction methods for various sources of measures, that can be used in quality assessment.
Article
Wikipedia is the world’s largest online encyclopedia, but maintaining article quality through collaboration is challenging. Wikipedia designed a quality scale, but with such a manual assessment process, many articles remain unassessed. We review existing methods for automatically measuring the quality of Wikipedia articles, identifying and comparing machine learning algorithms, article features, quality metrics, and used datasets, examining 149 distinct studies, and exploring commonalities and gaps in them. The literature is extensive, and the approaches follow past technological trends. However, machine learning is still not widely used by Wikipedia, and we hope that our analysis helps future researchers change that reality.
Preprint
Full-text available
In Wikipedia, articles about various topics can be created and edited independently in each language version. Therefore, quality of information about the same topic depends on language. Any interested user can improve an article and that improvement may depend on popularity of the article. The goal of this study is to show what topics are best represented in different language versions of Wikipedia using results of quality assessment for over 39 million articles in 55 languages. In this paper, we also analyze how popular are selected topics among readers and authors in various languages. We used two approaches to assign articles to various topics. First, we selected 27 main multilingual categories and analyzed all their connections with sub-categories based on information extracted from over 10 million categories in 55 language versions. To classify the articles to one of the 27 main categories we took into account over 400 million links from articles to over 10 million categories and over 26 million links between categories. In the second approach we used data from DBpedia and Wikidata. We also showed how the results of the study can be used to build local and global rankings of the Wikipedia content.
Article
Full-text available
Learning from positive and unlabeled examples (PU learning) has proven to be an effective method in several Web mining applications. In particular, in the 1st International Competition on Quality Flaw Prediction in Wikipedia in 2012, a tailored PU learning approach performed best amongst the competitors. A key feature of that approach is the introduction of sampling strategies within the original PU learning procedure. The paper in hand revisits the winner approach of 2012 and elaborates on neglected aspects in order to provide evidence for the usefulness of sampling in PU learning. In this regard, we propose a modification to this PU learning approach, and we show how the different sampling strategies affect the flaw prediction effectiveness. Our analysis is based on the original evaluation corpus of the 2012-competition on quality flaw prediction. A main outcome is that under the best sampling strategy, our new modified version of PU learning increases in average the flaw prediction effectiveness by 18.31%, when compared against the winning approach of the competition.
Article
Full-text available
Creating a simple yet effective plot requires an understanding of data and tasks.
Thesis
Web applications that are based on user-generated content are often criticized for containing low-quality information; a popular example is the online encyclopedia Wikipedia. The major points of criticism pertain to the accuracy, neutrality, and reliability of information. The identification of low-quality information is an important task since for a huge number of people around the world it has become a habit to first visit Wikipedia in case of an information need. Existing research on quality assessment in Wikipedia either investigates only small samples of articles, or else deals with the classification of content into high-quality or low-quality. This thesis goes further, it targets the investigation of quality flaws, thus providing specific indications of the respects in which low-quality content needs improvement. The original contributions of this thesis, which relate to the fields of user-generated content analysis, data mining, and machine learning, can be summarized as follows: (1) We propose the investigation of quality flaws in Wikipedia based on user-defined cleanup tags. Cleanup tags are commonly used in the Wikipedia community to tag content that has some shortcomings. Our approach is based on the hypothesis that each cleanup tag defines a particular quality flaw. (2) We provide the first comprehensive breakdown of Wikipedia's quality flaw structure. We present a flaw organization schema, and we conduct an extensive exploratory data analysis which reveals (a) the flaws that actually exist, (b) the distribution of flaws in Wikipedia, and, (c) the extent of flawed content. (3) We present the first breakdown of Wikipedia's quality flaw evolution. We consider the entire history of the English Wikipedia from 2001 to 2012, which comprises more than 508 million page revisions, summing up to 7.9 TB. Our analysis reveals (a) how the incidence and the extent of flaws have evolved, and, (b) how the handling and the perception of flaws have changed over time. (4) We are the first who operationalize an algorithmic prediction of quality flaws in Wikipedia. We cast quality flaw prediction as a one-class classification problem, develop a tailored quality flaw model, and employ a dedicated one-class machine learning approach. A comprehensive evaluation based on human-labeled Wikipedia articles underlines the practical applicability of our approach.
Article
Design studies are an increasingly popular form of problem-driven visualization research, yet there is little guidance available about how to do them effectively. In this paper we reflect on our combined experience of conducting twenty-one design studies, as well as reading and reviewing many more, and on an extensive literature review of other field work methods and methodologies. Based on this foundation we provide definitions, propose a methodological framework, and provide practical guidance for conducting design studies. We define a design study as a project in which visualization researchers analyze a specific real-world problem faced by domain experts, design a visualization system that supports solving this problem, validate the design, and reflect about lessons learned in order to refine visualization design guidelines. We characterize two axes - a task clarity axis from fuzzy to crisp and an information location axis from the domain expert's head to the computer - and use these axes to reason about design study contributions, their suitability, and uniqueness from other approaches. The proposed methodological framework consists of 9 stages: learn, winnow, cast, discover, design, implement, deploy, reflect, and write. For each stage we provide practical guidance and outline potential pitfalls. We also conducted an extensive literature survey of related methodological approaches that involve a significant amount of qualitative field work, and compare design study methodology to that of ethnography, grounded theory, and action research.
Conference Paper
With the increasing amount of user generated reference texts in the web, automatic quality assessment has become a key challenge. However, only a small amount of annotated data is available for training quality assessment systems. Wikipedia contains a large amount of texts annotated with cleanup templates which identify quality flaws. We show that the distribution of these labels is topically biased, since they cannot be applied freely to any arbitrary article. We argue that it is necessary to consider the topical restrictions of each label in order to avoid a sampling bias that results in a skewed classifier and overly optimistic evaluation results. We factor out the topic bias by extracting reliable training instances from the revision history which have a topic distribution similar to the labeled articles. This approach better reflects the situation a classifier would face in a real-life application.
Article
Rankings are a popular and universal approach to structuring otherwise unorganized collections of items by computing a rank for each item based on the value of one or more of its attributes. This allows us, for example, to prioritize tasks or to evaluate the performance of products relative to each other. While the visualization of a ranking itself is straightforward, its interpretation is not, because the rank of an item represents only a summary of a potentially complicated relationship between its attributes and those of the other items. It is also common that alternative rankings exist which need to be compared and analyzed to gain insight into how multiple heterogeneous attributes affect the rankings. Advanced visual exploration tools are needed to make this process efficient. In this paper we present a comprehensive analysis of requirements for the visualization of multi-attribute rankings. Based on these considerations, we propose LineUp - a novel and scalable visualization technique that uses bar charts. This interactive technique supports the ranking of items based on multiple heterogeneous attributes with different scales and semantics. It enables users to interactively combine attributes and flexibly refine parameters to explore the effect of changes in the attribute combination. This process can be employed to derive actionable insights as to which attributes of an item need to be modified in order for its rank to change. Additionally, through integration of slope graphs, LineUp can also be used to compare multiple alternative rankings on the same set of items, for example, over time or across different attribute combinations. We evaluate the effectiveness of the proposed multi-attribute visualization technique in a qualitative study. The study shows that users are able to successfully solve complex ranking tasks in a short period of time.