Content uploaded by Jan Kocoń
Author content
All content in this area was uploaded by Jan Kocoń on Dec 11, 2017
Content may be subject to copyright.
Proceedings of Recent Advances in Natural Language Processing, pages 473–482,
Varna, Bulgaria, Sep 4–6 2017.
https://doi.org/10.26615/978-954-452-049-6_063
Inforex — a Collaborative System
for Text Corpora Annotation and Analysis
Michał Marci´
nczuk Marcin Oleksy
G4.19 Research Group
Department of Computational Intelligence
Faculty of Computer Science and Management
Wrocław University of Technology, Wrocław, Poland
{michal.marcinczuk,marcin.oleksy,jan.kocon}@pwr.edu.pl
Jan Koco´
n
Abstract
We report a first major upgrade of In-
forex — a web-based system for qualita-
tive and collaborative text corpora anno-
tation and analysis. Inforex is a part of
Polish CLARIN infrastructure1. It is inte-
grated with a digital repository for storing
and publishing language resources2and it
allows to visualize, browse and annotate
text corpora stored in the repository. As
a result of a series of workshops for re-
searchers in Humanities and Social Sci-
ences we improved the graphical inter-
face to make the system more friendly and
readable for non-experienced users. We
also implemented a new functionality for
a gold standard annotation which includes
private annotations and annotation agree-
ment by a super-annotator.
1 Introduction
Digital humanities (DH) create new demand and
challenges for development of new or existing
tools and systems for text documents manip-
ulation, processing, analysis and visualization.
CLARIN-PL — the Polish part of CLARIN infras-
tructure — tries to rise the challenges associated
with DH for Polish language. Among many other
issues, there is a need for an intuitive and easy
to use system for qualitative text corpora manage-
ment, annotation, analysis and visualization. To
fulfill these needs we develop such a system called
Inforex. In this article we present the current state
of the system development.
The decision to create a system for text cor-
pora annotation was taken in 2009 when there
were no such systems which support collaborative
1http://clarin-pl.eu
2http://clarin-pl.eu/dspace
work. On that time the only existing tools were
desktop applications for individual work such as
GATE (Cunningham et al.,2011) or Manufak-
turzysta Luna (Marciniak et al.,2010). Since
2010 several systems have emerged, like We-
bAnno 3 (Eckart de Castilho et al.,2016) or GATE
Teamware (Bontcheva et al.,2013).
The first version of Inforex system was re-
leased in 2010 and its initial role was to construct
corpus-based linguistic resource for various tasks
from the field of natural language processing,
including named entity recognition (Marci´
nczuk
et al.,2011), shallow parsing (Radziszewski and
Piasecki,2010), word sense disambiguation (Bas
et al.,2008), recognition of semantic relations
between named entities (Marci´
nczuk and Ptak,
2012). It was used to develop two major (at that
time) resources for Polish: Corpus of Wrocław
University of Technology called KWPr (Broda
et al.,2012) (within the NEKST3project) and
Corpus of Economic News (CEN) (Marci´
nczuk
et al.,2013) (within the SyNaT project4). Later,
in 2013 Inforex was used to construct another ma-
jor resource, which is Polish Corpus of Suicide
Notes (PCSN)5(Marci´
nczuk et al.,2011) guided
by Monika Za´
sko-Zieli´
nska (2013). Until now the
system has been used to access the corpus. The
access is granted on a demand after obtaining a
permission form Wrocław University.
In 2013 Poland joined CLARIN — European
Research Infrastructure for Language Resources
and Technology. The goal of CLARIN is to
make the language technologies more accessible
to researches from humanities and social sciences,
which in most cases do not have the technical
skills to use many of the tools on their own. At that
time we made a decision to make Inforex a part
3http://nekst.ipipan.waw.pl/
4http://www.synat.pl/
5http://pcsn.uni.wroc.pl/
473
of the Polish CLARIN infrastructure. In 2015–
2017 we have organized several workshops for re-
searchers in humanities and social sciences. The
workshops showed us several user experience is-
sues. System GUI turned out to be not enough
intuitive for non-experienced users. Then, first
of all, it needed to be simplified. Second prob-
lem was connected with the methodology. The
researchers use various tools for corpora analy-
sis (including spreadsheets) and Inforex may be
treated as some kind of pre-processing tool that
allows to prepare corpus for further analysis. Data
export was possible but complicated and required
an access to a database. Users feedback proved
that the easy form of data export is one of the
crucial needs. After the set of workshops we
gathered more information about other important
needs (also in the form of questionnaires) like ac-
cess to a custom annotation schemas definition or
data visualisation. Some of them have been al-
ready implemented and the other are under con-
struction.
2 Inforex Features Overview
In the following sections we present the main
functionalities and features of the Inforex system.
2.1 Web-based Access
Inforex is a web-based tool which does not re-
quire installation. It can be accessed by any web-
browser which support JavaScript. Despite In-
forex is built on several universal JavaScript li-
braries and frameworks (jQuery, jQuery exten-
sions and Bootstrap) we suggest using Chrome
and Firefox. These two web browsers are used to
test the system on daily bases. Users might use
other browsers as well, however we are not able to
validate all functions in each of the available web
browsers, thus some minor issues might occur.
2.2 Authorized and Public Access
Corpora stored in Inforex can be accessed by au-
thorized and unauthorized users. The manager of
the corpus (the owner or a user with specific priv-
ileges) decides what type of information from the
corpora can be publicly available. For instance,
only authorized users can have access to docu-
ments’ content and can modify the corpus anno-
tations while unauthorized users may have access
to some statistics or annotation frequency lists.
2.3 Integration with DSpace as a Part of
Polish CLARIN Infrastructure
Inforex system is available at http:
//inforex.clarin-pl.eu and it is part of
Polish CLARIN infrastructure. This installation is
integrated with the official repository for language
resources in Polish CLARIN6. The repository
runs on DSpace system7. When a user registers
in https://clarin-pl.eu/dspace/, he
also gains access to Inforex system. At this stage
accounts are automatically synchronized. In the
future both systems will use unified federation
authorization.
2.4 Collaboration
Inforex offers several ways for collaborative work
on a single corpus. One of them is the access to
the same corpora for different authorized users.
The other one is a selective, task-oriented access to
the same document. For instance, different groups
of users can have access to document’s metadata.
The last one is the ”2+1” annotation, i.e. two or
more users annotate the same set of documents in-
dependently and the super-annotator creates the fi-
nal set of annotations based on their input. More
about this type of collaboration is presented in
Section 3.2.
2.5 Qualitative Document Annotation
Inforex was designed for qualitative document an-
notation. This means it does not offer a fast and ro-
bust search functions over large corpora with mil-
lions of documents. Such functionality can be ob-
tained using other existing tools designed for it,
for instance Sketch Engine (Kilgarriff et al.,2014)
or NoSketch Engine (Rychl´
y,2007). Inforex is
suited for medium size corpora (containing thou-
sands of small documents) and to manually de-
scribe documents in terms of their metadata, an-
notations (types of phrases organized in a hierar-
chy), annotation attributes, relations between an-
notations and annotation frames.
2.6 Language-independent
Inforex is language-independent in the sense that
it can handle documents in any natural language.
So far it has been used to annotate Polish, English
and Hebrew texts (see Section 3.2).
6https://clarin-pl.eu/dspace/
7https://github.com/ufal/clarin-dspace
474
Figure 1: Corpus overview
2.7 Document Visualisation
Inforex can handle documents in two formats:
plain text and XML. For XML documents it is
possible to display their content in a visually for-
mated way. This allows to highlight the document
structure what improves the user experience while
browsing and annotating documents. Sample visu-
alizations of different types of documents are pre-
sented in Figure 3.
2.8 Document Description
Inforex supports four types of information units
which can be used to describe documents content:
1. Metadata — an information unit which is
assigned to whole document (author name,
document creation time, source, etc.).
2. Annotation — an information unit which is
assigned to a sequence of words in the doc-
ument content. Each annotation is described
with a category (categories can be organized
in a hierarchy) and a set of attributes. The set
of attributes depends on the semantic inter-
pretation of the annotation category. For in-
stance, for named entities it can be a lemma,
for temporal expressions it can be a normal-
ized value of the expression and for event
mentions it can be an event modality.
3. Relation — an information unit which is as-
signed to a pair of annotations. It is a directed
link between two annotations of some cate-
gory.
4. Frame — an information unit which is as-
signed to a set of annotations. Frame consists
of a set of annotations with roles assigned to
them. This type of structure can be used for
event annotations (LCD,2005).
3 Recent Improvements
In the following sections we present the recent ma-
jor improvements of Inforex system.
3.1 Modern Layout
A set of workshops carried out from 2015 to 2017
showed that there was the need for an adjustment
of user interface to a new group of users — re-
searchers in humanities and social sciences not in-
volved in NLP tools development. New users re-
ported confusion with the large amount of infor-
mation and the number of available functions. The
need of interface simplification appeared while
functionalities of the system would remain un-
changed. Thus, Inforex layout has been upgraded
and modernized. It involved not only a design lift-
ing of the user interface but also changes in nav-
igation panels. The comparision of old and new
475
Figure 2: Document annotation view
layout is presented in Figure 4.
3.2 Annotation Agreement
Reliability is a key value in the creation of a
good quality corpora for learning and testing of
NLP tools. The current version of Inforex en-
ables simultaneous and independent annotation of
the same text sample by more than one annota-
tor. Moreover, the annotation process coordina-
tor may keep track of inter-annotator agreement
between two raters thanks to the Agreement mod-
ule which uses Positive Specific Agreement (PSA)
measure (Hripcsak and Rothschild,2005) to cal-
culate the reliability (see Figure 5). View config-
uration gives the opportunity to define annotation
layers, subsets or categories, users and set of doc-
uments that have to be analysed. The coordina-
tor may also specify a comparison mode: whether
the system has to take into consideration the an-
notation boundaries only or boundaries and cat-
egories. It may also include annotation lemmas.
Inter-annotator agreement is a very important in-
dicator of the annotation guidelines clearness or
cohesion. Keeping track of changes of the inter-
annotator agreement between subsequent annota-
tion iterations helps to improve the quality of the
annotation guidelines. Agreement module makes
that process easier and faster.
Inforex system also supports the curation of
the annotation process (see Figure 6). The cu-
rator can make choice between two different an-
notators choices, or even reject consistent but in-
correct annotations. Thanks to that module sev-
eral Gold Standard projects were performed e.g.
Polish Coreference Corpus (Ogrodniczuk et al.,
2015) for definite descriptions annotation and Pol-
ish Spatial Texts corpus for the annotation of dy-
namic spatial expressions.
4 Applications
In the following sections we present several prac-
tical applications of the Inforex system.
4.1 KPWr
KPWr (Polish Corpus of Wrocław University of
Technology) (Broda et al.,2012) is a corpus of
written and spoken documents available on the
Creative Commons license which is intended pri-
marily as a training and testing material for NLP
tools being developed at Wrocław University of
Science and Technology. It is successively en-
riched with annotation layers. Inforex recently
supported manual text annotation within such lay-
ers as temporal expressions and their normaliza-
tions, events (and description of event attributes),
spatial expressions and semantic roles. In order to
prepare temporal expressions annotation (Koco´
n
et al.,2015) a new annotation scheme based on
476
(a) Facebook conversation.
(b) Wikipedia article. (c) Hebrew document.
Figure 3: Sample documents visualizations
TimeML was added. These categories refer to a
date, time of a day, duration and frequency of an
event. Annotation lemmas perspective was used to
provide normalized temporal expressions, reveal-
ing that the term ’lemma’ in Inforex may func-
tion as a broad concept. The Annotator perspec-
tive from the system also supports event annota-
tion (Marci´
nczuk et al.,2015). There are seven
coarse-grained categories of events, i.e. action,
state, reporting, perception, aspectual, intensional
action and intensional state. The categorization
was based on the TimeML guidelines with some
modifications. It also involved creation of a new
annotation scheme. The flexibility in adding new
annotation layers (setting the new annotation cate-
gories) is one of the most important features. The
possibility of establishing relations between anno-
tated fragments is not less relevant. It was cru-
cial e.g. for spatial expressions annotation. Its
main goal was to extract different ways of dis-
tributing spatial information throughout a sentence
by reviewing the lexical and grammatical signals
of various relations between objects (Marci´
nczuk
et al.,2016).
4.2 European Legal Texts
As practice shows, although Inforex was primar-
ily developed for Polish language, that it can also
be used to work with documents written in other
languages. Inforex features and functionalities are
useful e.g. in examining current EU official lit-
erature related to territorial development and ur-
ban planning. Authors of this analysis first up-
loaded EU Territorial Policy Documents 2007-
20168to CLARIN-PL DSpace repository and then
imported it to the Inforex system. The corpus was
divided into 4 subcorpora and prepared for qual-
itative and quantitative analysis. The review of
the key strands enabled the identification of its 8
core values (or principles) for further statistical
and contextual analysis. After ascribing to each
category its textual triggers (word forms), a quan-
titive analysis using words frequency lists gener-
ated by Inforex was performed. Manual annota-
tion with a newly defined set of annotations and
Annotation Browser with the possibility of export-
ing data were a great support for qualitative anal-
ysis — detailed contextual analysis of the corpus
focused on two crucial categories: Participation
and Communication.
8http://hdl.handle.net/11321/316
477
(a) Inforex layout before modernization
(b) Inforex layout after modernization
Figure 4: Inforex layouts comparison
4.3 Hebrew Corpus
Inforex supports manual annotation even if the text
is written using non-latin alphabet and a right-to-
left notation. One of the system applications was
related to a corpus of Hebrew gravestone inscrip-
tions. It also involved the creation of a new an-
notation schema. Categories referred mainly to
the pragmatic level of communication (e.g. initial
and final expressions, laudations, death circum-
stances). The perspective of annotation lemmas
was used to enter Polish translations of annotated
fragments, which also showed that the lemma at-
tribute may be a broad term especially in the case
of practical applications of the system.
4.4 Other Corpora
Inforex was used to prepare the training data dur-
ing participation in BSNLP 2017 shared task on
multilingual named entity recognition aimed at
recognizing mentions of named entities in web
documents in Slavic languages, their normaliza-
tion / lemmatization, and cross-language matching
(Marci´
nczuk et al.,2017). The system also sup-
ported the annotation of the corpora constructed
specially for specific tasks from the field of natural
language processing e.g. Polish Coreference Cor-
pus for definite descriptions annotation and Polish
Spatial Texts corpus for the annotation of dynamic
spatial expressions. It involved creation of dedi-
cated annotation layers but, what is important, in
these tasks the new module of the system (Anno-
tation Agreement and ”2+1” annotation) was used
for the first time, which significantly improved the
time of preparation of annotated training and test-
ing corpora.
478
5 Summary
Inforex system, as a part of CLARIN-PL infras-
tructure, is gradually developed. Although its ini-
tial role was to construct qualitative linguistic re-
sources for various tasks from the field of natu-
ral language processing, recently it is also used
by scientists for other purposes. We received an
important and constructive feedback from users
during and after workshops related to CLARIN-
PL tools and resources. As users have different
needs, we identified the common functionalities
and implement them as soon as possible in order
to boost their research tasks and provide new pos-
sibilities. We also challenged with the fact that
many researches from the field of digital humani-
ties are not experienced users of such systems and
we made Inforex as easy and intuitive as possible.
Acknowledgments
Work financed as part of the investment in the
CLARIN-PL research infrastructure funded by the
Polish Ministry of Science and Higher Education.
References
Dominik Bas, Bartosz Broda, and Maciej Piasecki.
2008. Towards Word Sense Disambiguation of
Polish. In Proceedings of the International
Multiconference on Computer Science and In-
formation Technology, {IMCSIT}2008, Wisla,
Poland, 20-22 October 2008. IEEE, pages 73–78.
https://doi.org/10.1109/IMCSIT.2008.4747220.
Kalina Bontcheva, Hamish Cunningham, Ian Roberts,
Angus Roberts, Valentin Tablan, Niraj Aswani,
and Genevieve Gorrell. 2013. Gate teamware: a
web-based, collaborative text annotation framework.
Language Resources and Evaluation 47(4):1007–
1029.
Bartosz Broda, Michał Marci ´
nczuk, Marek Maziarz,
Adam Radziszewski, and Adam Wardy´
nski. 2012.
KPWr: Towards a Free Corpus of Polish. In Nico-
letta Calzolari, Khalid Choukri, Thierry Declerck,
Mehmet U˘
gur Do˘
gan, Bente Maegaard, Joseph Mar-
iani, Jan Odijk, and Stelios Piperidis, editors, Pro-
ceedings of LREC’12. ELRA, Istanbul, Turkey.
Hamish Cunningham, Diana Maynard, Kalina
Bontcheva, Valentin Tablan, Niraj Aswani, Ian
Roberts, Genevieve Gorrell, Adam Funk, An-
gus Roberts, Danica Damljanovic, Thomas
Heitz, Mark A. Greenwood, Horacio Saggion,
Johann Petrak, Yaoyong Li, and Wim Peters.
2011. Text Processing with GATE (Version 6).
http://tinyurl.com/gatebook.
Richard Eckart de Castilho, Eva Mujdricza-Maydt,
Seid Muhie Yimam, Silvana Hartmann, Iryna
Gurevych, Anette Frank, and Chris Biemann. 2016.
A web-based tool for the integrated annotation of
semantic and syntactic structures. In Proceedings
of the workshop on Language Technology Resources
and Tools for Digital Humanities (LT4DH) at COL-
ING 2016. pages 76–84.
George Hripcsak and Adam S. Rothschild. 2005.
Agreement, the f-measure, and reliability in infor-
mation retrieval. J. of Am. Medical Informatics As-
sociation 12(3):296–298.
Adam Kilgarriff, V´
ıt Baisa, Jan Buˇ
sta, Miloˇ
s
Jakub´
ıˇ
cek, Vojtˇ
ech Kov´
aˇ
r, Jan Michelfeit, Pavel
Rychl´
y, and V´
ıt Suchomel. 2014. The sketch en-
gine: ten years on. Lexicography .
Jan Koco´
n, Michał Marci´
nczuk, Marcin Oleksy,
Tomasz Berna´
s, and Michał Wolski. 2015. Tem-
poral expressions in polish corpus kpwr. Cognitive
Studies— ´
Etudes cognitives (15):293–317.
LCD. 2005. ACE (Automatic Content Extraction) En-
glish Annotation Guidelines for Events. Technical
report, Linguistic Data Consortium.
M. Marci´
nczuk and M. Ptak. 2012. Preliminary study
on automatic induction of rules for recognition of
semantic relations between proper names in Polish
texts, volume 7499 LNAI.
Michał Marci´
nczuk, Jan Koco´
n, and Marcin
Oleksy. 2017. Liner2 — a generic framework
for named entity recognition. In Proceedings
of the 6th Workshop on Balto-Slavic Natural
Language Processing. Association for Computa-
tional Linguistics, Valencia, Spain, pages 86–91.
http://www.aclweb.org/anthology/W17-1413.
Michał Marci´
nczuk, Marcin Oleksy, Tomasz Berna´
s,
Jan Koco´
n, and Michał Wolski. 2015. Towards an
event annotated corpus of polish. Cognitive Stud-
ies— ´
Etudes cognitives (15):253–267.
Michał Marci´
nczuk, Michał Stanek, Maciej Piasecki,
and Adam Musiał. 2011. Rich Set of Features for
Proper Name Recognition in Polish Texts. In SIIS
2011. Springer.
Michał Mirosław Marci´
nczuk, Marcin Oleksy, and Jan
Wieczorek. 2016. Towards recognition of spatial re-
lations between entities for polish. Cognitive Stud-
ies— ´
Etudes cognitives (16):119–132.
Małgorzata Marciniak, Agnieszka Mykowiecka, and
Katarzyna Głowi´
nska. 2010. Anotowany korpus di-
alog´
ow telefonicznych. In Małgorzata Marciniak,
editor, Anotowany korpus dialog´
ow telefonicznych,
Akademicka Oficyna Wydawnicza EXIT, Warsaw,
chapter Anotacja korpusu LUNA–WOZ.PL, pages
217–230.
479
Michał Marci´
nczuk, Jan Koco´
n, and Maciej Janicki.
2013. Liner2 – a customizable framework for
proper names recognition for Polish. In Robert
Bembenik, Lukasz Skonieczny, Henryk Rybinski,
Marzena Kryszkiewicz, and Marek Niezgodka, ed-
itors, Intelligent Tools for Building a Scientific In-
formation Platform, pages 231–253.
Michał Marci´
nczuk, Monika Za´
sko-Zieli´
nska, and Ma-
ciej Piasecki. 2011. Structure annotation in the pol-
ish corpus of suicide notes. In Ivan Habernal and
V´
aclav Matouˇ
sek, editors, Text, Speech and Dia-
logue, Springer Berlin Heidelberg, volume 6836 of
Lecture Notes in Computer Science, pages 419–426.
Maciej Ogrodniczuk, Katarzyna Głowi´
nska, Mateusz
Kope´
c, Agata Savary, and Magdalena Zawisławska.
2015. Coreference in Polish: Annotation, Res-
olution and Evaluation. Walter De Gruyter.
http://www.degruyter.com/view/product/428667.
Adam Radziszewski and Maciej Piasecki. 2010. A Pre-
liminary Noun Phrase Chunker for Polish. Proceed-
ings of the Intelligent Information Systems pages
169–180.
Pavel Rychl´
y. 2007. Manatee/bonito - a modular
corpus manager. In 1st Workshop on Recent Ad-
vances in Slavonic Natural Language Processing.
Masarykova univerzita, Brno, pages 65–70.
M. Za´
sko-Zieli´
nska. 2013. Listy po˙
zegnalne:
w poszukiwaniu lingwistycznych wyz-
nacznik´
ow autentyczno´
sci tekstu. Quaestio.
https://books.google.pl/books?id=QG60ngEACAAJ.
480
Figure 5: Summary of annotation agreement for a set of document
481
Figure 6: User agreement verification for a single document
482