Conference PaperPDF Available

Inforex—a Collaborative System for Text Corpora Annotation and Analysis

Authors:

Figures

Content may be subject to copyright.
Proceedings of Recent Advances in Natural Language Processing, pages 473–482,
Varna, Bulgaria, Sep 4–6 2017.
https://doi.org/10.26615/978-954-452-049-6_063
Inforex — a Collaborative System
for Text Corpora Annotation and Analysis
Michał Marci´
nczuk Marcin Oleksy
G4.19 Research Group
Department of Computational Intelligence
Faculty of Computer Science and Management
Wrocław University of Technology, Wrocław, Poland
{michal.marcinczuk,marcin.oleksy,jan.kocon}@pwr.edu.pl
Jan Koco´
n
Abstract
We report a first major upgrade of In-
forex — a web-based system for qualita-
tive and collaborative text corpora anno-
tation and analysis. Inforex is a part of
Polish CLARIN infrastructure1. It is inte-
grated with a digital repository for storing
and publishing language resources2and it
allows to visualize, browse and annotate
text corpora stored in the repository. As
a result of a series of workshops for re-
searchers in Humanities and Social Sci-
ences we improved the graphical inter-
face to make the system more friendly and
readable for non-experienced users. We
also implemented a new functionality for
a gold standard annotation which includes
private annotations and annotation agree-
ment by a super-annotator.
1 Introduction
Digital humanities (DH) create new demand and
challenges for development of new or existing
tools and systems for text documents manip-
ulation, processing, analysis and visualization.
CLARIN-PL — the Polish part of CLARIN infras-
tructure — tries to rise the challenges associated
with DH for Polish language. Among many other
issues, there is a need for an intuitive and easy
to use system for qualitative text corpora manage-
ment, annotation, analysis and visualization. To
fulfill these needs we develop such a system called
Inforex. In this article we present the current state
of the system development.
The decision to create a system for text cor-
pora annotation was taken in 2009 when there
were no such systems which support collaborative
1http://clarin-pl.eu
2http://clarin-pl.eu/dspace
work. On that time the only existing tools were
desktop applications for individual work such as
GATE (Cunningham et al.,2011) or Manufak-
turzysta Luna (Marciniak et al.,2010). Since
2010 several systems have emerged, like We-
bAnno 3 (Eckart de Castilho et al.,2016) or GATE
Teamware (Bontcheva et al.,2013).
The first version of Inforex system was re-
leased in 2010 and its initial role was to construct
corpus-based linguistic resource for various tasks
from the field of natural language processing,
including named entity recognition (Marci´
nczuk
et al.,2011), shallow parsing (Radziszewski and
Piasecki,2010), word sense disambiguation (Bas
et al.,2008), recognition of semantic relations
between named entities (Marci´
nczuk and Ptak,
2012). It was used to develop two major (at that
time) resources for Polish: Corpus of Wrocław
University of Technology called KWPr (Broda
et al.,2012) (within the NEKST3project) and
Corpus of Economic News (CEN) (Marci´
nczuk
et al.,2013) (within the SyNaT project4). Later,
in 2013 Inforex was used to construct another ma-
jor resource, which is Polish Corpus of Suicide
Notes (PCSN)5(Marci´
nczuk et al.,2011) guided
by Monika Za´
sko-Zieli´
nska (2013). Until now the
system has been used to access the corpus. The
access is granted on a demand after obtaining a
permission form Wrocław University.
In 2013 Poland joined CLARIN — European
Research Infrastructure for Language Resources
and Technology. The goal of CLARIN is to
make the language technologies more accessible
to researches from humanities and social sciences,
which in most cases do not have the technical
skills to use many of the tools on their own. At that
time we made a decision to make Inforex a part
3http://nekst.ipipan.waw.pl/
4http://www.synat.pl/
5http://pcsn.uni.wroc.pl/
473
of the Polish CLARIN infrastructure. In 2015–
2017 we have organized several workshops for re-
searchers in humanities and social sciences. The
workshops showed us several user experience is-
sues. System GUI turned out to be not enough
intuitive for non-experienced users. Then, first
of all, it needed to be simplified. Second prob-
lem was connected with the methodology. The
researchers use various tools for corpora analy-
sis (including spreadsheets) and Inforex may be
treated as some kind of pre-processing tool that
allows to prepare corpus for further analysis. Data
export was possible but complicated and required
an access to a database. Users feedback proved
that the easy form of data export is one of the
crucial needs. After the set of workshops we
gathered more information about other important
needs (also in the form of questionnaires) like ac-
cess to a custom annotation schemas definition or
data visualisation. Some of them have been al-
ready implemented and the other are under con-
struction.
2 Inforex Features Overview
In the following sections we present the main
functionalities and features of the Inforex system.
2.1 Web-based Access
Inforex is a web-based tool which does not re-
quire installation. It can be accessed by any web-
browser which support JavaScript. Despite In-
forex is built on several universal JavaScript li-
braries and frameworks (jQuery, jQuery exten-
sions and Bootstrap) we suggest using Chrome
and Firefox. These two web browsers are used to
test the system on daily bases. Users might use
other browsers as well, however we are not able to
validate all functions in each of the available web
browsers, thus some minor issues might occur.
2.2 Authorized and Public Access
Corpora stored in Inforex can be accessed by au-
thorized and unauthorized users. The manager of
the corpus (the owner or a user with specific priv-
ileges) decides what type of information from the
corpora can be publicly available. For instance,
only authorized users can have access to docu-
ments’ content and can modify the corpus anno-
tations while unauthorized users may have access
to some statistics or annotation frequency lists.
2.3 Integration with DSpace as a Part of
Polish CLARIN Infrastructure
Inforex system is available at http:
//inforex.clarin-pl.eu and it is part of
Polish CLARIN infrastructure. This installation is
integrated with the official repository for language
resources in Polish CLARIN6. The repository
runs on DSpace system7. When a user registers
in https://clarin-pl.eu/dspace/, he
also gains access to Inforex system. At this stage
accounts are automatically synchronized. In the
future both systems will use unified federation
authorization.
2.4 Collaboration
Inforex offers several ways for collaborative work
on a single corpus. One of them is the access to
the same corpora for different authorized users.
The other one is a selective, task-oriented access to
the same document. For instance, different groups
of users can have access to document’s metadata.
The last one is the ”2+1” annotation, i.e. two or
more users annotate the same set of documents in-
dependently and the super-annotator creates the fi-
nal set of annotations based on their input. More
about this type of collaboration is presented in
Section 3.2.
2.5 Qualitative Document Annotation
Inforex was designed for qualitative document an-
notation. This means it does not offer a fast and ro-
bust search functions over large corpora with mil-
lions of documents. Such functionality can be ob-
tained using other existing tools designed for it,
for instance Sketch Engine (Kilgarriff et al.,2014)
or NoSketch Engine (Rychl´
y,2007). Inforex is
suited for medium size corpora (containing thou-
sands of small documents) and to manually de-
scribe documents in terms of their metadata, an-
notations (types of phrases organized in a hierar-
chy), annotation attributes, relations between an-
notations and annotation frames.
2.6 Language-independent
Inforex is language-independent in the sense that
it can handle documents in any natural language.
So far it has been used to annotate Polish, English
and Hebrew texts (see Section 3.2).
6https://clarin-pl.eu/dspace/
7https://github.com/ufal/clarin-dspace
474
Figure 1: Corpus overview
2.7 Document Visualisation
Inforex can handle documents in two formats:
plain text and XML. For XML documents it is
possible to display their content in a visually for-
mated way. This allows to highlight the document
structure what improves the user experience while
browsing and annotating documents. Sample visu-
alizations of different types of documents are pre-
sented in Figure 3.
2.8 Document Description
Inforex supports four types of information units
which can be used to describe documents content:
1. Metadata — an information unit which is
assigned to whole document (author name,
document creation time, source, etc.).
2. Annotation — an information unit which is
assigned to a sequence of words in the doc-
ument content. Each annotation is described
with a category (categories can be organized
in a hierarchy) and a set of attributes. The set
of attributes depends on the semantic inter-
pretation of the annotation category. For in-
stance, for named entities it can be a lemma,
for temporal expressions it can be a normal-
ized value of the expression and for event
mentions it can be an event modality.
3. Relation — an information unit which is as-
signed to a pair of annotations. It is a directed
link between two annotations of some cate-
gory.
4. Frame — an information unit which is as-
signed to a set of annotations. Frame consists
of a set of annotations with roles assigned to
them. This type of structure can be used for
event annotations (LCD,2005).
3 Recent Improvements
In the following sections we present the recent ma-
jor improvements of Inforex system.
3.1 Modern Layout
A set of workshops carried out from 2015 to 2017
showed that there was the need for an adjustment
of user interface to a new group of users — re-
searchers in humanities and social sciences not in-
volved in NLP tools development. New users re-
ported confusion with the large amount of infor-
mation and the number of available functions. The
need of interface simplification appeared while
functionalities of the system would remain un-
changed. Thus, Inforex layout has been upgraded
and modernized. It involved not only a design lift-
ing of the user interface but also changes in nav-
igation panels. The comparision of old and new
475
Figure 2: Document annotation view
layout is presented in Figure 4.
3.2 Annotation Agreement
Reliability is a key value in the creation of a
good quality corpora for learning and testing of
NLP tools. The current version of Inforex en-
ables simultaneous and independent annotation of
the same text sample by more than one annota-
tor. Moreover, the annotation process coordina-
tor may keep track of inter-annotator agreement
between two raters thanks to the Agreement mod-
ule which uses Positive Specific Agreement (PSA)
measure (Hripcsak and Rothschild,2005) to cal-
culate the reliability (see Figure 5). View config-
uration gives the opportunity to define annotation
layers, subsets or categories, users and set of doc-
uments that have to be analysed. The coordina-
tor may also specify a comparison mode: whether
the system has to take into consideration the an-
notation boundaries only or boundaries and cat-
egories. It may also include annotation lemmas.
Inter-annotator agreement is a very important in-
dicator of the annotation guidelines clearness or
cohesion. Keeping track of changes of the inter-
annotator agreement between subsequent annota-
tion iterations helps to improve the quality of the
annotation guidelines. Agreement module makes
that process easier and faster.
Inforex system also supports the curation of
the annotation process (see Figure 6). The cu-
rator can make choice between two different an-
notators choices, or even reject consistent but in-
correct annotations. Thanks to that module sev-
eral Gold Standard projects were performed e.g.
Polish Coreference Corpus (Ogrodniczuk et al.,
2015) for definite descriptions annotation and Pol-
ish Spatial Texts corpus for the annotation of dy-
namic spatial expressions.
4 Applications
In the following sections we present several prac-
tical applications of the Inforex system.
4.1 KPWr
KPWr (Polish Corpus of Wrocław University of
Technology) (Broda et al.,2012) is a corpus of
written and spoken documents available on the
Creative Commons license which is intended pri-
marily as a training and testing material for NLP
tools being developed at Wrocław University of
Science and Technology. It is successively en-
riched with annotation layers. Inforex recently
supported manual text annotation within such lay-
ers as temporal expressions and their normaliza-
tions, events (and description of event attributes),
spatial expressions and semantic roles. In order to
prepare temporal expressions annotation (Koco´
n
et al.,2015) a new annotation scheme based on
476
(a) Facebook conversation.
(b) Wikipedia article. (c) Hebrew document.
Figure 3: Sample documents visualizations
TimeML was added. These categories refer to a
date, time of a day, duration and frequency of an
event. Annotation lemmas perspective was used to
provide normalized temporal expressions, reveal-
ing that the term ’lemma’ in Inforex may func-
tion as a broad concept. The Annotator perspec-
tive from the system also supports event annota-
tion (Marci´
nczuk et al.,2015). There are seven
coarse-grained categories of events, i.e. action,
state, reporting, perception, aspectual, intensional
action and intensional state. The categorization
was based on the TimeML guidelines with some
modifications. It also involved creation of a new
annotation scheme. The flexibility in adding new
annotation layers (setting the new annotation cate-
gories) is one of the most important features. The
possibility of establishing relations between anno-
tated fragments is not less relevant. It was cru-
cial e.g. for spatial expressions annotation. Its
main goal was to extract different ways of dis-
tributing spatial information throughout a sentence
by reviewing the lexical and grammatical signals
of various relations between objects (Marci´
nczuk
et al.,2016).
4.2 European Legal Texts
As practice shows, although Inforex was primar-
ily developed for Polish language, that it can also
be used to work with documents written in other
languages. Inforex features and functionalities are
useful e.g. in examining current EU official lit-
erature related to territorial development and ur-
ban planning. Authors of this analysis first up-
loaded EU Territorial Policy Documents 2007-
20168to CLARIN-PL DSpace repository and then
imported it to the Inforex system. The corpus was
divided into 4 subcorpora and prepared for qual-
itative and quantitative analysis. The review of
the key strands enabled the identification of its 8
core values (or principles) for further statistical
and contextual analysis. After ascribing to each
category its textual triggers (word forms), a quan-
titive analysis using words frequency lists gener-
ated by Inforex was performed. Manual annota-
tion with a newly defined set of annotations and
Annotation Browser with the possibility of export-
ing data were a great support for qualitative anal-
ysis — detailed contextual analysis of the corpus
focused on two crucial categories: Participation
and Communication.
8http://hdl.handle.net/11321/316
477
(a) Inforex layout before modernization
(b) Inforex layout after modernization
Figure 4: Inforex layouts comparison
4.3 Hebrew Corpus
Inforex supports manual annotation even if the text
is written using non-latin alphabet and a right-to-
left notation. One of the system applications was
related to a corpus of Hebrew gravestone inscrip-
tions. It also involved the creation of a new an-
notation schema. Categories referred mainly to
the pragmatic level of communication (e.g. initial
and final expressions, laudations, death circum-
stances). The perspective of annotation lemmas
was used to enter Polish translations of annotated
fragments, which also showed that the lemma at-
tribute may be a broad term especially in the case
of practical applications of the system.
4.4 Other Corpora
Inforex was used to prepare the training data dur-
ing participation in BSNLP 2017 shared task on
multilingual named entity recognition aimed at
recognizing mentions of named entities in web
documents in Slavic languages, their normaliza-
tion / lemmatization, and cross-language matching
(Marci´
nczuk et al.,2017). The system also sup-
ported the annotation of the corpora constructed
specially for specific tasks from the field of natural
language processing e.g. Polish Coreference Cor-
pus for definite descriptions annotation and Polish
Spatial Texts corpus for the annotation of dynamic
spatial expressions. It involved creation of dedi-
cated annotation layers but, what is important, in
these tasks the new module of the system (Anno-
tation Agreement and ”2+1” annotation) was used
for the first time, which significantly improved the
time of preparation of annotated training and test-
ing corpora.
478
5 Summary
Inforex system, as a part of CLARIN-PL infras-
tructure, is gradually developed. Although its ini-
tial role was to construct qualitative linguistic re-
sources for various tasks from the field of natu-
ral language processing, recently it is also used
by scientists for other purposes. We received an
important and constructive feedback from users
during and after workshops related to CLARIN-
PL tools and resources. As users have different
needs, we identified the common functionalities
and implement them as soon as possible in order
to boost their research tasks and provide new pos-
sibilities. We also challenged with the fact that
many researches from the field of digital humani-
ties are not experienced users of such systems and
we made Inforex as easy and intuitive as possible.
Acknowledgments
Work financed as part of the investment in the
CLARIN-PL research infrastructure funded by the
Polish Ministry of Science and Higher Education.
References
Dominik Bas, Bartosz Broda, and Maciej Piasecki.
2008. Towards Word Sense Disambiguation of
Polish. In Proceedings of the International
Multiconference on Computer Science and In-
formation Technology, {IMCSIT}2008, Wisla,
Poland, 20-22 October 2008. IEEE, pages 73–78.
https://doi.org/10.1109/IMCSIT.2008.4747220.
Kalina Bontcheva, Hamish Cunningham, Ian Roberts,
Angus Roberts, Valentin Tablan, Niraj Aswani,
and Genevieve Gorrell. 2013. Gate teamware: a
web-based, collaborative text annotation framework.
Language Resources and Evaluation 47(4):1007–
1029.
Bartosz Broda, Michał Marci ´
nczuk, Marek Maziarz,
Adam Radziszewski, and Adam Wardy´
nski. 2012.
KPWr: Towards a Free Corpus of Polish. In Nico-
letta Calzolari, Khalid Choukri, Thierry Declerck,
Mehmet U˘
gur Do˘
gan, Bente Maegaard, Joseph Mar-
iani, Jan Odijk, and Stelios Piperidis, editors, Pro-
ceedings of LREC’12. ELRA, Istanbul, Turkey.
Hamish Cunningham, Diana Maynard, Kalina
Bontcheva, Valentin Tablan, Niraj Aswani, Ian
Roberts, Genevieve Gorrell, Adam Funk, An-
gus Roberts, Danica Damljanovic, Thomas
Heitz, Mark A. Greenwood, Horacio Saggion,
Johann Petrak, Yaoyong Li, and Wim Peters.
2011. Text Processing with GATE (Version 6).
http://tinyurl.com/gatebook.
Richard Eckart de Castilho, Eva Mujdricza-Maydt,
Seid Muhie Yimam, Silvana Hartmann, Iryna
Gurevych, Anette Frank, and Chris Biemann. 2016.
A web-based tool for the integrated annotation of
semantic and syntactic structures. In Proceedings
of the workshop on Language Technology Resources
and Tools for Digital Humanities (LT4DH) at COL-
ING 2016. pages 76–84.
George Hripcsak and Adam S. Rothschild. 2005.
Agreement, the f-measure, and reliability in infor-
mation retrieval. J. of Am. Medical Informatics As-
sociation 12(3):296–298.
Adam Kilgarriff, V´
ıt Baisa, Jan Buˇ
sta, Miloˇ
s
Jakub´
ıˇ
cek, Vojtˇ
ech Kov´
aˇ
r, Jan Michelfeit, Pavel
Rychl´
y, and V´
ıt Suchomel. 2014. The sketch en-
gine: ten years on. Lexicography .
Jan Koco´
n, Michał Marci´
nczuk, Marcin Oleksy,
Tomasz Berna´
s, and Michał Wolski. 2015. Tem-
poral expressions in polish corpus kpwr. Cognitive
Studies— ´
Etudes cognitives (15):293–317.
LCD. 2005. ACE (Automatic Content Extraction) En-
glish Annotation Guidelines for Events. Technical
report, Linguistic Data Consortium.
M. Marci´
nczuk and M. Ptak. 2012. Preliminary study
on automatic induction of rules for recognition of
semantic relations between proper names in Polish
texts, volume 7499 LNAI.
Michał Marci´
nczuk, Jan Koco´
n, and Marcin
Oleksy. 2017. Liner2 — a generic framework
for named entity recognition. In Proceedings
of the 6th Workshop on Balto-Slavic Natural
Language Processing. Association for Computa-
tional Linguistics, Valencia, Spain, pages 86–91.
http://www.aclweb.org/anthology/W17-1413.
Michał Marci´
nczuk, Marcin Oleksy, Tomasz Berna´
s,
Jan Koco´
n, and Michał Wolski. 2015. Towards an
event annotated corpus of polish. Cognitive Stud-
ies— ´
Etudes cognitives (15):253–267.
Michał Marci´
nczuk, Michał Stanek, Maciej Piasecki,
and Adam Musiał. 2011. Rich Set of Features for
Proper Name Recognition in Polish Texts. In SIIS
2011. Springer.
Michał Mirosław Marci´
nczuk, Marcin Oleksy, and Jan
Wieczorek. 2016. Towards recognition of spatial re-
lations between entities for polish. Cognitive Stud-
ies— ´
Etudes cognitives (16):119–132.
Małgorzata Marciniak, Agnieszka Mykowiecka, and
Katarzyna Głowi´
nska. 2010. Anotowany korpus di-
alog´
ow telefonicznych. In Małgorzata Marciniak,
editor, Anotowany korpus dialog´
ow telefonicznych,
Akademicka Oficyna Wydawnicza EXIT, Warsaw,
chapter Anotacja korpusu LUNA–WOZ.PL, pages
217–230.
479
Michał Marci´
nczuk, Jan Koco´
n, and Maciej Janicki.
2013. Liner2 – a customizable framework for
proper names recognition for Polish. In Robert
Bembenik, Lukasz Skonieczny, Henryk Rybinski,
Marzena Kryszkiewicz, and Marek Niezgodka, ed-
itors, Intelligent Tools for Building a Scientific In-
formation Platform, pages 231–253.
Michał Marci´
nczuk, Monika Za´
sko-Zieli´
nska, and Ma-
ciej Piasecki. 2011. Structure annotation in the pol-
ish corpus of suicide notes. In Ivan Habernal and
V´
aclav Matouˇ
sek, editors, Text, Speech and Dia-
logue, Springer Berlin Heidelberg, volume 6836 of
Lecture Notes in Computer Science, pages 419–426.
Maciej Ogrodniczuk, Katarzyna Głowi´
nska, Mateusz
Kope´
c, Agata Savary, and Magdalena Zawisławska.
2015. Coreference in Polish: Annotation, Res-
olution and Evaluation. Walter De Gruyter.
http://www.degruyter.com/view/product/428667.
Adam Radziszewski and Maciej Piasecki. 2010. A Pre-
liminary Noun Phrase Chunker for Polish. Proceed-
ings of the Intelligent Information Systems pages
169–180.
Pavel Rychl´
y. 2007. Manatee/bonito - a modular
corpus manager. In 1st Workshop on Recent Ad-
vances in Slavonic Natural Language Processing.
Masarykova univerzita, Brno, pages 65–70.
M. Za´
sko-Zieli´
nska. 2013. Listy po˙
zegnalne:
w poszukiwaniu lingwistycznych wyz-
nacznik´
ow autentyczno´
sci tekstu. Quaestio.
https://books.google.pl/books?id=QG60ngEACAAJ.
480
Figure 5: Summary of annotation agreement for a set of document
481
Figure 6: User agreement verification for a single document
482
... To systematically and accurately annotate discourse relations in Polish, the project employs Inforex, a web-based annotation platform (Marcińczuk et al., 2012(Marcińczuk et al., , 2017Marcińczuk and Oleksy, 2019). The system has not been prepared specifically for this work, but has been configured to meet its objectives. ...
... The annotation process, outlined in 3.2, was executed using Inforex. Inforex 6 is an online platform for constructing text corpora, developed as an integral part of the CLARIN-PL infrastructure (Marcińczuk et al., 2012(Marcińczuk et al., , 2017Marcińczuk and Oleksy, 2019). It allows parallel online access and resource sharing among multiple users. ...
Chapter
Full-text available
This paper explores a discourse relations annotation project carried out under the CLARIN-PL initiative, leveraging the ISO 24617-8 standard. The goal is to boost research interoper-ability and foster multilingual research. Our team of three linguist-annotators tackled the annotation of a corpus spanning several gen-res, including e.g., literature and press articles in the Polish language. This effort was guided by a project expert and external linguists from the CLARIN-PL language technology research infrastructure. Several significant challenges emerged during the process. Ambiguities within the ISO standard's relation categories, poorly-defined definitions for certain relation categories, and the difficulty of identifying and annotating implicit discourse relations, which lack explicit discourse con-nectives or signaling devices, were among the key issues. To overcome these problems , we implemented strategies such as regular team meetings, collaborative annotation forms, and preliminary revisions to the annotation scheme. This paper presents the project, the annotation process, and offers initial annotation data on the discourse relations and con-nectives identified within the corpus. Looking forward, we discuss potential enhancements to the process, including additional revisions to the guidelines and conclude with an overview of the project's contributions and a discussion of our future development plans.
... The annotation procedure was implemented using the Inforex platform. Inforex 1 is a web-based tool designed for building text corpora and an important component of the CLARIN-PL infrastructure (Marcińczuk et al., 2012;Marcińczuk et al., 2017;Marcińczuk and Oleksy, 2019). Inforex supports simultaneous online access and facilitates resource collaboration among its users. ...
Chapter
Full-text available
This paper presents the Polish Discourse Corpus, a pioneering resource of this kind for Polish and the first corpus in Poland to employ the ISO standard for discourse relation annotation. The Polish Discourse Corpus adopts ISO 24617-8, a segment of the Language Resource Management-Semantic Annotation Framework (SemAF), which outlines a set of core discourse relations adaptable for diverse languages and genres. The paper overviews the corpus architecture, annotation procedures, the challenges that the annotators have encountered, as well as key statistical data concerning discourse relations and connectives in the corpus. It further discusses the initial phases of the discourse parser tailored for the ISO 24617-8 framework. Evaluations on the efficacy and potential refinement areas of the corpus annotation and parsing strategies are also presented. The final part of the paper touches upon anticipated research plans to improve discourse analysis techniques in the project and to conduct discourse studies involving multiple languages.
... (1) initial phase and (2) the final annotation of Dia-Biz.Kom corpus. Both stages were performed by a team of qualified linguists with the use of the Inforex system (Marcińczuk et al., 2017). ...
Conference Paper
Full-text available
This article presents the specification and evaluation of DiaBiz.Kom-the corpus of dialogue texts in Polish. The corpus contains transcriptions of telephone conversations conducted according to a prepared scenario. The transcripts of conversations have been manually annotated with a layer of information concerning communicative functions. DiaBiz.Kom is the first corpus of this type prepared for the Polish language and will be used to develop a system of dialogue analysis and modules for creating advanced chatbots.
... In the last several years, the NLP community has shown growing interest in tools that are web-based, open source, and multi-purpose: WebAnno [Yimam and Gurevych, 2013], Inforex [Marcińczuk et al., 2017], and Anafora [Chen and Styler, 2013]. Other popular non web-based annotation systems include GATE [Cunningham et al., 2011] and AnCoraPipe [Bertrán et al., 2008]. ...
Preprint
Full-text available
This dissertation explores the linguistic and computational aspects of the meaning relations that can hold between two or more complex linguistic expressions (phrases, clauses, sentences, paragraphs). In particular, it focuses on Paraphrasing, Textual Entailment, Contradiction, and Semantic Similarity. In Part I: "Similarity at the Level of Words and Phrases", I study the Distributional Hypothesis (DH) and explore several different methodologies for quantifying semantic similarity at the levels of words and short phrases. In Part II: "Paraphrase Typology and Paraphrase Identification", I focus on the meaning relation of paraphrasing and the empirical task of automated Paraphrase Identification (PI). In Part III: "Paraphrasing, Textual Entailment, and Semantic Similarity", I present a novel direction in the research on textual meaning relations, resulting from joint research carried out on on paraphrasing, textual entailment, contradiction, and semantic similarity.
... Polish Corpus of Wrocław University of Technology (KPWr) [19]- [21] is a corpus of both written and spoken texts that have been semi-automatically and manually annotated at multiple semantic and grammatical levels. Additionally, metadata such as text domain, keywords, text type, and functional style have been manually assigned to the documents. ...
Conference Paper
Full-text available
Many publications prove that the creation of a multiobjective machine learning model is possible and reasonable. Moreover, we can see significant gains in expanding the knowledge domain, increasing prediction quality, and reducing the inference time. New developments in cross-lingual knowledge transfer open up a range of possibilities, particularly in working with low-resource languages. With a motivation to explore the latest subfields of natural language processing and their interactions, we decided to create a multi-task multilingual model for the following text classification tasks: functional style, domain, readability, and sentiment. The paper discusses the effectiveness of particular language-agnostic approaches to Polish and English and the effectiveness and validity of the multi-task model. https://sentic.net/sentire2021gawron.pdf
... The annotation was performed by two independent annotators and then reviewed by a super-annotator with the help of Inforex system [11]. The corpus contains roughly 4 hundred thousand tokens and its size is comparable with the manually annotated part of the National Corpus of Polish (1.2 million) [14]. ...
... ent classification, summarization, etc. This task follows on from previous TempEval events organized for evaluating time expressions for English and Spanish like SemEval-2013 (UzZaman i in. 2013). This time a corpus of Polish documents fully annotated with temporal expressions was provided. The annotation process was performed using Inforex system (Marcinczuk i in. 2017) and it covered texts boundaries, classes and normalized values of temporal expressions. The annotation for Polish texts is based on a modified version of original TIMEX3 annotation guidelines 3 at the level of annotating boundaries/types 4 and local/global normalization 5 . ...
Conference Paper
Full-text available
This article presents the research in the recognition and normalization of Polish temporal expressions as the result of the first PolEval 2019 shared task. Temporal information extracted from the text plays a significant role in many information extraction systems, like question answering, event recognition or text summarization. A specification for annotating Polish temporal expressions (PLIMEX) was used to prepare a completely new test dataset for the competition. PLIMEX is based on state-of-the-art solutions for English, mostly TimeML. The training data provided for the task is Polish Corpus of Wrocław University of Technology (KPWr) fully annotated using PLIMEX guidelines.
... It support text cleanup, mention annotation, relations between annotations, morpholog-ical tagging, annotation attributes, metadata and many others. A more comprehensive list of functions can be found in (Marcinczuk et al., 2017). ...
Conference Paper
Full-text available
In the paper we present the latest changes introduce to Inforex-a web-based system for qualitative and collaborative text corpora annotation and analysis. One of the most important news is the release of source codes. Now the system is available on the GitHub repository (https://github.com/ CLARIN-PL/Inforex) as an open source project. The system can be easily setup and run in a Docker container what simplifies the installation process. The major improvements include: semi-automatic text annotation, multilingual text preprocessing using CLARIN-PL web services, morphological tagging of XML documents, improved editor for annotation attribute, batch annotation attribute editor, morphological disambiguation, extended word sense annotation. This paper contains a brief description of the mentioned improvements. We also present two use cases in which various Inforex features were used and tested in real-life projects.
Article
Full-text available
Motivation: Annotation tools are applied to build training and test corpora, which are essential for the development and evaluation of new natural language processing algorithms. Further, annotation tools are also used to extract new information for a particular use case. However, owing to the high number of existing annotation tools, finding the one that best fits particular needs is a demanding task that requires searching the scientific literature followed by installing and trying various tools. Methods: We searched for annotation tools and selected a subset of them according to five requirements with which they should comply, such as being Web-based or supporting the definition of a schema. We installed the selected tools (when necessary), carried out hands-on experiments and evaluated them using 26 criteria that covered functional and technical aspects. We defined each criterion on three levels of matches and a score for the final evaluation of the tools. Results: We evaluated 78 tools and selected the following 15 for a detailed evaluation: BioQRator, brat, Catma, Djangology, ezTag, FLAT, LightTag, MAT, MyMiner, PDFAnno, prodigy, tagtog, TextAE, WAT-SL and WebAnno. Full compliance with our 26 criteria ranged from only 9 up to 20 criteria, which demonstrated that some tools are comprehensive and mature enough to be used on most annotation projects. The highest score of 0.81 was obtained by WebAnno (of a maximum value of 1.0).
Chapter
In this paper, we introduce an open-source tool, YEDDA, supported by a pre-annotation module based deep learning. EPAD proposes a novel annotation workflow, combining pre-annotation and manual annotation, which improves the efficiency and quality of annotation. The pre-annotation module can effectively reduce the annotation time, and meanwhile improve the precision and recall of annotation. EPAD also contains some of the mechanisms to facilitate the usage of the pre-annotation module. As a collaborative design, EPAD provides administrators with annotation statistics and analysis functions. Experiments showed that EPAD shortened almost 60.0%\% of the total annotation time, and improved 12.7%\% of F-measure for annotation quality.
Conference Paper
Full-text available
We introduce the third major release of WebAnno, a generic web-based annotation tool for distributed teams. New features in this release focus on semantic annotation tasks (e.g. semantic role labelling or event annotation) and allow the tight integration of semantic annotations with syntactic annotations. In particular, we introduce the concept of slot features, a novel constraint mechanism that allows modelling the interaction between semantic and syntactic annotations, as well as a new annotation user interface. The new features were developed and used in an annotation project for semantic roles on German texts. The paper briefly introduces this project and reports on experiences performing annotations with the new tool. On a comparative evaluation, our tool reaches significant speedups over WebAnno 2 for a semantic annotation task. http://aclweb.org/anthology/W16-4011.pdf
Conference Paper
Full-text available
In the paper we present an adaptation of Liner2 framework to solve the BSNLP 2017 shared task on multilingual named entity recognition. The tool is tuned to recognize and lemmatize named entities for Polish.
Article
Full-text available
In this paper, the problem of spatial relation recognition in Polish is examined. We present the different ways of distributing spatial information throughout a sentence by reviewing the lexical and grammatical signals of various relations between objects. We focus on the spatial usage of prepositions and their meaning, determined by the ‘conceptual’ schemes they constitute. We also discuss the feasibility of a comprehensive recognition of spatial relations between objects expressed in different ways by reviewing the existing tools and resources for text processing in Polish. As a result, we propose a heuristic method for the recognition of spatial relations expressed in various phrase structures called spatial expressions. We propose a definition of spatial expressions by taking into account the limitations of the available tools for the Polish language. A set of rules is used to generate candidates of spatial expressions which are later tested against a set of semantic constraints. The results of our work on recognition of spatial expressions in Polish texts were partially presented in (Marcińczuk, Oleksy, & Wieczorek, 2016). In that paper we focused on a detailed analysis of errors obtained using a set of basic morphosyntactic patterns for generating spatial expression candidates - we identified and described the most common sources of errors, i.e. incorrectly recognized or unrecognized expressions. In this paper we focused mainly on the preliminary stages of spatial expression recognition. We presented an extensive review on how the spatial information can be encoded in the text, types of spatial triggers in Polish and a detailed evaluation of morphosyntactic patterns which can be used to generate spatial expression candidates.
Article
Full-text available
p> Towards an event annotated corpus of Polish The paper presents a typology of events built on the basis of TimeML specification adapted to Polish language. Some changes were introduced to the definition of the event categories and a motivation for event categorization was formulated. The event annotation task is presented on two levels – ontology level (language independent) and text mentions (language dependant). The various types of event mentions in Polish text are discussed. A procedure for annotation of event mentions in Polish texts is presented and evaluated. In the evaluation a randomly selected set of documents from the Corpus of Wrocław University of Technology (called KPWr) was annotated by two linguists and the annotator agreement was calculated. The evaluation was done in two iterations. After the first evaluation we revised and improved the annotation procedure. The second evaluation showed a significant improvement of the agreement between annotators. The current work was focused on annotation and categorisation of event mentions in text. The future work will be focused on description of event with a set of attributes, arguments and relations.</p
Article
Full-text available
In the paper we present a customizable and open-source framework for proper names recognition called Liner2. The framework consists of several universal methods for sequence chunking which include: dictionary look-up, pattern matching and statistical processing. The statistical processing is performed using Conditional Random Fields and a rich set of features including morphological, lexical and semantic information. We present an application of the framework to the task of recognition proper names in Polish texts (5 common categories of proper names, i.e. first names, surnames, city names, road names and country names). The Liner2 framework was also used to train an extended model to recognize 56 categories of proper names which was used to bootstrap the manual annotation of KPWr corpus. We also present the CRF-based model integrated with a heterogeneous named entity similarity function. We show that the similarity function added to the best configuration improved the final result for cross-domain evaluation. The last section presents NER-WS-a web service for proper names recognition in Polish texts utilizing the Liner2 framework and the model for 56 categories of proper names. The web service can be tested using a web-based demo available at http://nlp.pwr.wroc.pl/inforex/.
Article
Full-text available
This paper presents GATE Teamware—an open-source, web-based, collaborative text annotation framework. It enables users to carry out complex corpus annotation projects, involving distributed annotator teams. Different user roles are provided (annotator, manager, administrator) with customisable user interface functionalities, in order to support the complex workflows and user interactions that occur in corpus annotation projects. Documents may be pre-processed automatically, so that human annotators can begin with text that has already been pre-annotated and thus making them more efficient. The user interface is simple to learn, aimed at non-experts, and runs in an ordinary web browser, without need of additional software installation. GATE Teamware has been evaluated through the creation of several gold standard corpora and internal projects, as well as through external evaluation in commercial and EU text annotation projects. It is available as on-demand service on GateCloud.net, as well as open-source for self-installation.
Conference Paper
Full-text available
We compare three different methods of word sense disambiguation applied to the disambiguation of a selected set of 13 Polish words. The selected words express different problems for sense disambiguation. As it is hard to find works for Polish in this area, our goal was to analyse applicability and limitations of known methods in relation to Polish and Polish language resources and tools. The obtained results are very positive, as using limited resources, we achieved the accuracy of sense disambiguation greatly exceeding the baseline of the most frequent sense. For the needs of experiments a small corpus of representative examples was manually collected and annotated with senses drawn from plWordNet. Different representations of context of word occurrences were also experimentally tested. Examples of limitations and advantages of the applied methods are discussed.
Conference Paper
In the paper we present a preliminary work on automatic construction of rules for recognition of semantic relations between pairs of proper names in Polish texts. Our goal was to check the feasibility of automatic rule construction using existing inductive logic programming (ILP) system as an alternative or supporting method for manual rule creation. We present a set of predicates in first-order logic that is used to represent the semantic relation recognition task. The background knowledge encode the morphological, orthographic and named entity-based features. We applied an ILP on the proposed representation to generate rules for relation extraction. We have utilized an existing ILP system called Aleph [1]. The performance of automatically generated rules was compared with a set of hand-crafted rules developed on the basis of training set for 8 categories of relations (affiliation, alias, creator, composition, location, nationality, neighbourhood, origin). Finally, we proposed several ways how to improve to preliminary results in the future work.
Conference Paper
In this paper we analyse the importance of data generalisation and usage of local context in the problem of the Proper Name recognition. We present an extended set of features that provide generalised description of the data and encode linguistic information. To utilize the rich set of features we applied Conditional Random Fields (CRF) — a modern approach for sequence labelling. We present results of the evaluation on a single domain following the cross-validation scheme and cross-domain evaluation based on training and testing on different corpora. We show that the extended set of features improves the final results for CRF and also this approach outperforms Hidden Markov Models (HMM). On the single domain CRF obtained 92.53% of F-measure for 5 categories of proper names, and 67.72% and 72.62% of F-measure for other two corpora in cross-domain evaluation.
Article
Information retrieval studies that involve searching the Internet or marking phrases usually lack a well-defined number of negative cases. This prevents the use of traditional interrater reliability metrics like the kappa statistic to assess the quality of expert-generated gold standards. Such studies often quantify system performance as precision, recall, and F-measure, or as agreement. It can be shown that the average F-measure among pairs of experts is numerically identical to the average positive specific agreement among experts and that kappa approaches these measures as the number of negative cases grows large. Positive specific agreement-or the equivalent F-measure-may be an appropriate way to quantify interrater reliability and therefore to assess the reliability of a gold standard in these studies.