Conference PaperPDF Available

Inforex – a web-based tool for text corpus management and semantic annotation

Abstract and Figures

The aim of this paper is to present a system for semantic text annotation called Inforex. Inforex is a web-based system designed for managing and annotating text corpora on the semantic level including annotation of Named Entities (NE), anaphora, Word Sense Disambiguation (WSD) and relations between named entities. The system also supports manual text clean-up and automatic text pre-processing including text segmentation, morphosyntactic analysis and word selection for word sense annotation.
Content may be subject to copyright.
Inforex – a web-based tool for text corpus management
and semantic annotation
Michał Marci´
nczuk, Jan Koco´
n, Bartosz Broda
Wrocław University of Technology, Wrocław, Poland,,
The aim of this paper is to present a system for semantic text annotation called Inforex. Inforex is a web-based system designed for
managing and annotating text corpora on the semantic level including annotation of Named Entities (NE), anaphora, Word Sense
Disambiguation (WSD) and relations between named entities. The system also supports manual text clean-up and automatic text
pre-processing including text segmentation, morphosyntactic analysis and word selection for word sense annotation.
Keywords: corpus management, corpus annotation, bootstrapping, Inforex
1. Introduction
Large text corpora are central in statistical-based Natural
Language Processing (NLP) (Manning and Schütze, 2001).
One can find many approaches based on supervised Ma-
chine Learning (ML) to solve NLP-related problems in the
literature. For training ML algorithms a manually anno-
tated corpus is needed. That is, a domain expert have to
mark certain parts of the text with appropriate labels. The
annotation process is usually hard, costly and time consum-
ing. The problem is even more pronounced when multi-
ple people are working simultaneously on the same corpus.
However, usage of supporting Language Technology (LT)
can improve the process of manual corpus annotation con-
siderably. In this paper we describe Inforex – an example
of LT that helps in this process.
Inforex is a web-based system for text corpora management
and semantic annotation. The construction of the system
started in early 2010, at the beginning of NEKST project.
At that time we needed to gather and prepare data for the
task of named entity recognition (Marci´
nczuk and Piasecki,
2011). In the second half of the year another project started
called SyNaT. One of the tasks of the project was to build
a manually annotated corpus with semantic information.
New requirements emerged and we decided to extend our
system with the new functionality. The system was also
used in another projected started in the beginning of 2011
(Marcinczuk et al., 2011) in construction of a Polish Corpus
of Suicide Notes (PCSN).
We decided to construct a system from scratch because we
couldn’t find system that: (a) is an open-source and freely
available, (b) is platform independent, (c) store all the data
(text and annotations) in a central repository integrated with
the application, (d) provide transparent deployment of new
versions of the system, (e) can be run on any computer
without the need of downloading and installing additional
This paper is organised as follows: we start with descrip-
tion of existing systems for corpora annotation. Next, the
description of Inforex (Sec. 3.) and annotation workflow
(Sec. 4.) is given. Detailed description of task supported
by Inforex is given in Section 5. The paper is finished with
brief discussion of licensing status (Sec. 7.), system appli-
cations (Sec. 6.) and conclusions in Sec. 8.
2. Existing Annotation Environments
As a corpus annotation is not a new NLP task some sys-
tems have already been build. Before making a decision
to develop Inforex form scratch we had investigated sev-
eral existing systems. Most of the system have some severe
limitation in terms of our requirements. Nevertheless, the
analysis helped us in refining our design goals and architec-
ture of Inforex. The list of exterminated systems includes:
GATE (Cunningham et al., 2011) is widely-known and
used system for corpus management and text anno-
tation that is being developed over 15 years. It is a
desktop application written in Java and can be run un-
der almost any operating systems. It provides many
of functionality we required, but we did not decide
to use it because we stumbled upon many problem
while developing a java-based desktop application for
wordnet construction called WordnetLoom (Piasecki
et al., 2011). Among the decisive factors were fre-
quent upgrades which are very inconvenient for the
users and issues with rapid bug-reproduction on devel-
opers’ computers leading to high cost of bug-fixing.
Manufakturzysta 2.0 Luna (Marci´
nczuk, 2010) is a
desktop application written in C# that was used to an-
notate transcriptions of phone calls within the LUNA
project (Mykowiecka et al., 2010). The system was
designed to annotate the text on the semantic level in-
cluding named entities and binary relations between
the entities. The system works only with Windows op-
erating system and it does not support parallel access
to central data by different users — every instance of
the application works on local data.
GATE Teamware (LLC, 2010) is a web-based version
of GATE also implemented in Java. Information about
the system was available since 2010, but the source
code was published after the development of Inforex
was started.
Annotatornia (Przepiórkowski and Murzynowski,
2009) is a web-based system developed to an-
notate text on four levels: word-level segmenta-
tion, sentence-level segmentation, morphosyntax and
WSD. Annotation of named entities, binary relations
and events was not included. Implementation started
in 2009 and the source code was published in July
3. Inforex Characteristic
Inforex can be accessed from any standard-compliant web
browser supporting JavaScript.1The user interface has a
form of dynamic HTML pages using the AJAX technol-
ogy. The server part of the system is written in PHP and the
data is stored in MySQL database. The system make use of
some external tools that are installed on the server or can be
accessed via web services.
The documents are stored in the database in the original
format — either plain text, XML or HTML. Tokenization
and sentence segmentation is optional and is stored in a
separate table. Tokens are stored as pairs of values rep-
resenting indexes of first and last character of the tokens
and sets of features representing the morpho-syntactic in-
formation. Annotations2created by user are stored in the
same way as tokens (pair of character indexes) but in addi-
tional table. Character indexes omit all the white spaces and
XML/HTML tags. In addition, HTML entities are counted
as one character.
4. Annotation Workflow
The corpus annotation workflow in Inforex starts with the
creation and configuration of a new corpus. This involves
definition of subcorpora, flags (described in the next para-
graph), selection of document perspectives, selection of ex-
isting or creation of new schemas of annotations and re-
lations and uploading documents. When the corpus con-
figuration is set up one can add new or existing users and
grant access and permissions to the document perspectives.
Users, that have appropriate permissions can perform cer-
tain actions. When user logs in to the system he or she sees
only the corpora that were assigned to her/him or are pub-
Flags (see Figure 1), that were mentioned in the previous
paragraph, are used to track work progress. The mechanism
allows to define a set of named flags that can be used to
describe work state of every document within given corpus.
Every flag can be set to one of several predefined states,
i.e., not ready,ready to process,being processed,ready to
check,need correction,checked.
1In practice we have only enough resources to properly test the
system in Firefox. Thus, some of the complex dynamic functions
might not work properly under other web browsers.
2Annotation is understood as a label attached to a continuous
piece of text. Additional information can be attached to the anno-
tation as a pair of strings: {argument; value}.
5. Tasks
This section presents how the system supports different
kind of tasks related to the corpus construction.
5.1. Document Browsing
The XML tags in the document are used to encode the doc-
ument structure. While browsing they are not displayed to
the user directly but influence how the text blocks are dis-
played on the screen. The HTML tags (h1, em, p, li, etc.)
are displayed in a default way. For custom tags user can
define the formatting using CSS.
5.2. Document Content Edition
The common operation that is performed on every docu-
ment is its clean up. The documents can be edited in the
Edit Content perspective. Every modification of the con-
tent is tracked by the system and a difference with previous
version is generated and stored in the database. In addition,
user can add a comment in order to motivate the introduced
A complete revision of the document versions is presented
in the History of Changes perspective (see Figure 2). Ev-
ery modification is displayed as a diff with previous version
with date, time, user name and user comment. This mecha-
nism is used to track back potential errors introduced during
document clean up.
As the annotations are stored as a pairs of character indexes
representing the annotation boundary the modification of
annotated documents needs special treatment. If a docu-
ment contains annotations, special markers are inserted into
the document content to indicate the annotation boundaries.
When the document content is changed Inforex automat-
ically calculates the changes in annotations (and possible
deletions of annotations). The user sees the list of changes
that will be applied to the document and can either reject
or confirm them. The users can also backtrack to document
5.3. Document Segmentation
Document tokenization is stored independently from the
document content in the same way as annotations — pairs
of character indexes representing tokens range. The sen-
tence segmentation is indirectly based on characters. Sen-
tence boundaries are stored together with tokenization by
marking tokens ending sentences.
For Polish two tokenizers were integrated with the system.
The first one is accessible, directly through the Tokeniza-
tion perspective utilizing a TaKIPI-WS web service (Broda
et al., 2010b). The other tool, maca (Radziszewski and ´
atowski, 2011), is executed as a batch script that can reside
on external server. The script inserts the tokenization di-
rectly into the database through the API provided by In-
forex. Using such an approach we don’t tie Inforex to one
tokenization schema as any external tool can be utilised for
this purpose.
The current version of the perspective does not allow to
modify the segmentation by hand. However, after recurring
reports from linguists about errors in the automatic segmen-
tation that introduce problems during the annotation we de-
Figure 1: List of documents — contains basic information about the documents (left part of the table) and flags indicating
the document work progress (right part of the table).
Figure 2: History of Changes — perspective used to track changes in the document content.
cided to extend this perspective and allow users to modify
the sentence boundaries.
5.4. Named Entities Annotation
Annotation of named entities is an example of annotation-
based tasks, i.e., a tasks which goal is to assign a set of
predefined labels to the text. The annotation is performed
within the Semantic Annotator perspective (see Figure 3).
It was challenging to design and to develop a HTML-based
interface for text annotation. The following questions had
to be answered:
1. How to display annotations in a formatted HTML doc-
ument in a compact way?
2. How to organize annotation of tokenized and not tok-
enized texts?
3. How to handle nested, overlapping and discontinuous
4. How to simplify, support and automate creation of
common annotations?
We wanted to display the annotations in a compact way be-
cause we wanted to fit as much text on the screen as pos-
sible — some annotation tasks will require access to wide
document context. The best way was to display the an-
notations as HTML formatted tags (i.e., span elements).
However, this solution has some limitations, i.e., a problem
with displaying overlapping and discontinuous annotations
(nested annotations can be easily displayed with this ap-
To solve the problem with overlapping annotations we as-
sumed, that annotations within the same group of annota-
tions (layers) cannot overlap. Annotations from different
groups can overlap but cannot be displayed in the same
panel at the same time. In order to display annotations from
overlapping groups the screen is split, forming twin panels
(see Figure 4). The idea was to display the same docu-
ment in two parallel panels and allow user to choose which
group of annotations should be display in each panel. This
way overlapping groups of annotations are displayed side-
by-side. In addition, user can show/hide selected subgroups
of annotations.
To solve the problem with discontinuous annotations we
decided to use relations mechanism (described in Sec-
tion 5.7.). Every continuous part of annotation is repre-
sented by a single annotation. Then, all the annotations are
connected with a special type of relation in a continuous
chain. That is, first part with the second one, second one
with the third one, and so on.
Figure 3: Semantic Annotator — perspective used to create, modify and delete semantic annotations.
Figure 4: Twin-panel with named entities on the left and
agreement chunks on the right.
During annotation, if the word segmentation is provided
the system automatically expand text selection to capture
whole tokens. The process of annotation is supported in
three additional ways:
quick mode — in the quick mode user selects one type
of annotation and then after every selection of text
this annotation type is automatically added (in normal
mode user have to choose annotation type after every
text selection),
common annotations — allows to display selected
types of annotations instead of full list of annotations.
It is useful for groups with lots of rare annotation types
which would require lot of scrolling.
sentence segmentation highlight — allows to display
every sentence starting from a new line and separated
by a horizontal line to clearly indicate the sentence
boundaries (see Figure 5). This mode is useful in
sentence-context annotation tasks. For example, syn-
tactic chunks cannot cross sentence boundaries and se-
mantic relations between proper names are contained
within one sentence.
5.5. Annotation Bootstrapping
Bootstrapping perspective (see Figure 6) allows to run ex-
ternal module to recognize named entities and to verify the
Figure 5: Sentence segmentation highlight mode.
results of automatic recognition. The automatically recog-
nized annotations are presented to the user for the verifica-
tion. For every proposition the user can choose one of four
options: accept if the annotation is correct, discard if the
annotation is incorrect (the annotation border is incorrect),
change annotation type if the annotation border is correct
but the annotation type is wrong and the last option later
leaves the proposition unchanged for later verification. The
missing annotations (not recognized in bootstrapping) must
be added manually in the Semantic Annotator perspective.
The discarded annotations are stored in the database to pre-
vent the system from repeating wrong decisions. Storing
mistakes of the system also enables calculation of the per-
formance of the bootsrapping module.
5.6. Word Sense Annotation
The perspective for word sense annotation (WSD Annota-
tor) was based on system presented in (Broda et al., 2010c).
The perspective consists of three parts: (1) a list of words
to be annotated, (2) a document view with marked words
for annotation, (3) a list of senses for selected word. The
perspective allows user to browse the instances of selected
word in a predefined order or to jump directly to a first not
annotated word.
Figure 6: Bootstrapping — perspective for manual verification of bootstrapped annotation.
Figure 7: WSD Annotator — perspective for word sense annotation.
5.7. Relation Annotation
Annotation of relations is performed in the Semantic An-
notator perspective. Relations can be created between any
types of annotations according to a predefined schema. The
schema defines groups of relations, annotation layers to
which the relations are assigned and constrains on anno-
tation types that can be connected with given relation. The
constraints can be set on the level of annotation layers, an-
notation groups and single annotation types.
5.8. Anaphora Annotation
Anaphora is a kind of relation that connect two elements. In
general, anaphora could be annotated using general mech-
anism for relations. However, the number of operations re-
quired to create an anaphora relation is too large. In order
to simplify and speed up annotation of anaphora a dedi-
cated perspective was designed and implemented, namely
Anaphora Annotator (see Figure 8). The perspective con-
sists of three parts: (1) left part is a document view with tok-
enization, (2) middle part is a document view with selected
named entities and (3) right part with a list of anaphora
types. The process of creating a new relation requires three
operations: (1) selection of a source word or named entity,
(2) selection of a target named entity and (3) selection of
anaphora type.
5.9. Annotation of Events
Annotation of events can be done in the Semantic Annotator
perspective. Events are defined as set of pairs {attribute;
value}. attribute is a name of slot defined in the schema,
and value is an annotation of defined type or category. One
can add several types of events to one document. For every
created event user can add several slots, and for every slot
one annotation can be selected.
5.10. Data Export
The document content, tokenization, sentence segmenta-
tion, annotations (syntactic chunks, proper names, WSD)
and relations between annotations (syntactic relations be-
tween chunks, semantic relations between named entities
and anaphora) can be exported to a XML-like corpus for-
mat called CCL. The CCL format is based on XCES (Ide et
al., 2000) with a few simple extensions that enables simple
encoding of all the required annotation levels.
6. Applications
Inforex is being used to construct and annotate corpora
within three ongoing projects:
NEKST3— two corpora of Polish stock exchange re-
ports (1215 documents) and economic news from Pol-
ish Wikinews (797 documents) annotated with named
entities (Marci´
nczuk and Piasecki, 2011);
SyNaT4— a Wrocław University of Technology Cor-
pus (KPWr; Korpus Politechniki Wrocławskiej) con-
taining samples of documents from various domains
(blogs, science, stenographic recordings, dialogue,
contemporary prose, etc.) annotated with named enti-
ties, semantic chunks, word senses, syntactic relations
between chunks, semantic relations between named
entities and anaphora relations (Broda et al., 2010a).
At the moment of writing the corpus consists of more
than 1300 documents;
3Project home page:
4Project home page:
PCSN5— a Polish Corpus of Suicide Notes annotated
with named entities, semantic and pragmatic informa-
tion (Marci´
nczuk et al., 2011). At the moment of writ-
ing the corpus consists of 626 genuine suicide notes
and 51 simulated suicide notes.
7. Access and License
Inforex is hosted at Wrocław University of Tech-
nology and is available at the following address To test the
major features of the application one can login using demo
account (user and password are demo).
We plan to release the source code of Inforex on a free li-
cense as soon as the system will be tested enough and will
be relatively stable. The source code and further informa-
tion will be posted on the Inforex web page.
8. Conclusion
Inforex is a web-based system for semantic annotation of
text corpora. Major functions of the system are already
implemented and used in couple projects by several users.
However, the system is still under development and new
features are being added when required. The list of fea-
tures to be implemented contains for example a perspective
to fix the automatic sentence-level segmentation.
Work co-financed by Innovative Economy Pro-
gramme project POIG.01.01.02-14-013/09 (http:
// and NCBiR NrU.:
SP/I/1/77065/10 (
9. References
Bartosz Broda, Michał Marci´
nczuk, Marek Maziarz, Adam
Radziszewski, and Adam Wardy´
nski. 2010a. WUTC:
Towards a Free Corpus of Polish. In Proceedings of the
Eighth Conference on International Language Resources
and Evaluation (LREC’12), Istanbul, Turkey. May 23–
25, 2012.
Bartosz Broda, Michał Marci´
nczuk, and Maciej Piasecki.
2010b. Building a Node of the Accessible Language
Technology Infrastructure. In Khalid Choukri, Bente
Maegaard, Joseph Mariani, Jan Odjik, Stelios Piperidis,
Mike Rosner, and Daniel Tapias, editors, Proceedings
of the Seventh conference on International Language
Resources and Evaluation (LREC’10), Valletta, Malta.
May 19–21, 2010.
Bartosz Broda, Maziarz Maziarz, and Maciej Piasecki.
2010c. Evaluating LexCSD — a Weakly-Supervised
Method on Improved Semantically Annotated Corpus
in a Large Scale Experiment. In S. T. Wierzcho´
M. A. Kłopotek, A. Przepiórkowski and K. Trojanowski,
editors, Proceedings of Intelligent Information Systems.
Hamish Cunningham, Diana Maynard, Kalina Bontcheva,
Valentin Tablan, Niraj Aswani, Ian Roberts, Genevieve
Gorrell, Adam Funk, Angus Roberts, Danica Daml-
janovic, Thomas Heitz, Mark A. Greenwood, Horacio
Saggion, Johann Petrak, Yaoyong Li, and Wim Peters.
2011. Text Processing with GATE (Version 6).
5Project home page:
Figure 8: Anaphora Annotator — perspective used to create and delete anaphora between named entities and words.
Nancy Ide, Patrice Bonhomme, and Laurent Romary. 2000.
Xces: An xml-based encoding standard for linguistic
corpora. In Proceedings of the Second International
Language Resources and Evaluation Conference. Paris:
European Language Resources Association.
Fairview Research LLC. 2010. GATE Teamware 1.3 User
Guide. Technical report.
C. D. Manning and H. Schütze. 2001. Foundations of Sta-
tistical Natural Language Processing. The MIT Press.
Michał Marci´
nczuk and Maciej Piasecki. 2011. Statisti-
cal Proper Name Recognition in Polish Economic Texts.
Control and Cybernetics, 40(2).
Michal Marcinczuk, Monika Zasko-Zielinska, and Maciej
Piasecki. 2011. Structure Annotation in the Polish Cor-
pus of Suicide Notes. In TSD, pages 419–426.
Michał Marci´
nczuk, Monika Za´
nska, and Maciej
Piasecki. 2011. Structure Annotation in the Polish Cor-
pus of Suicide Notes. In Vaclav Habernal, Ivan; Ma-
tousek, editor, Text, Speech and Dialogue, 14th Interna-
tional Conference, TSD 2011, volume 6836 of Lecture
Notes in Computer Science. Springer.
Michał Marci´
nczuk. 2010. Manufakturzysta 2.0 Luna.
Dokumentacja techniczna. Technical report.
Agnieszka Mykowiecka, Katarzyna Głowi´
nska, and Joanna
sniewska. 2010. Domain-related Annota-
tion of Polish Spoken Dialogue Corpus LUNA.PL. In
Nicoletta Calzolari (Conference Chair), Khalid Choukri,
Bente Maegaard, Joseph Mariani, Jan Odijk, Stelios
Piperidis, Mike Rosner, and Daniel Tapias, editors, Pro-
ceedings of the Seventh conference on International Lan-
guage Resources and Evaluation (LREC’10), Valletta,
Malta, may. European Language Resources Association
Maciej Piasecki, Michał Marci´
nczuk, Radosaw Ramocki,
and Marek Maziarz. 2011. WordnetLoom: a Word-
net Development System Integrating Form-based and
Graph-based Perspectives. Int. J. of Data Mining, Mod-
elling and Management. To Appear.
Adam Przepiórkowski and Grzegorz Murzynowski. 2009.
Manual annotation of the National Corpus of Polish with
Anotatornia. In Stanisław Go´zd´z-Roszkowski, editor,
The proceedings of Practical Applications in Language
and Computers PALC 2009, Frankfurt am Main. Peter
Lang. Forthcoming.
Adam Radziszewski and Tomasz ´
Sniatowski. 2011. Maca
— a configurable tool to integrate Polish morphological
data. In Proceedings of the Second International Work-
shop on Free/Open-SourceRule-Based Machine Transla-
tion, Barcelona, Spain. Universitat Oberta de Catalunya.
January 20, 2011.
... The annotation process was carried out using the Inforex tool [41]. One of the features of Inforex, in which the multi-genre corpus was annotated, is a module for annotation reconciliation called the Agreement module. ...
Conference Paper
Aspect-based sentiment analysis (ABSA) is a text analysis method that categorizes data by aspects and identifies the sentiment assigned to each aspect. Aspect-based sentiment analysis can be used to analyze customer opinions by associating specific sentiments with different aspects of a product or service. Most of the work in this topic is thoroughly performed for English, but many low-resource languages still lack adequate annotated data to create automatic methods for the ABSA task. In this work, we present annotation guidelines for the ABSA task for Polish and preliminary annotation results in the form of the AspectEmo corpus, containing over 1.5k consumer reviews annotated with over 63k annotations. We present an agreement analysis on the resulting annotated corpus and preliminary results using transformer-based models trained on AspectEmo.
... Each text was manually annotated by two annotators: a psychologist and a linguist, who worked according to the general guidelines. The annotation tool used for this task was Inforex 10 (Marcińczuk et al., 2012; Oleksy, 2019) -a web-based system for text corpora management, annotation and analysis, available as an open source project. In the pilot project, we decided to deal with the sentiment annotation of the entire text. ...
Conference Paper
Full-text available
In this article we present an extended version of PolEmo – a corpus of consumer reviews from 4 domains: medicine, hotels, products and school. Current version (PolEmo 2.0) contains 8,216 reviews having 57,466 sentences. Each text and sentence was manually annotated with sentiment in 2+1 scheme, which gives a total of 197,046 annotations. We obtained a high value of Positive Specific Agreement, which is 0.91 for texts and 0.88 for sentences. PolEmo 2.0 is publicly available under a Creative Commons copyright license. We explored recent deep learning approaches for the recognition of sentiment, such as Bi-directional Long Short-Term Memory (BiLSTM) and Bidirectional Encoder Representations from Transformers (BERT).
... Commands are designed to be intuitive, e.g. arrow (Stenetorp et al., 2012) Classify, Link Yes apache Python, Javascript GUI GATE (Bontcheva et al., 2013) Classify, Link Yes -Java GUI YEDDA (Yang et al., 2018) Classify --Python GUI ANALEC (Landragin et al., 2012) Classify, Link --Java GUI Anafora (Chen and Styler, 2013) Classify, Link Yes -Python GUI CAT (Lenzi et al., 2012) Classify -apache Java GUI Chooser (Koeva et al., 2008) Classify, Link --C++, Python, Perl GUI CorA (Bollmann et al., 2014) Classify --PHP, JavaScript GUI Djangology (Apostolova et al., 2010) Classify Yes Django Python GUI eHost (South et al., 2012) Classify, Link Yes -Java GUI Glozz (Widlöcher and Mathet, 2012) Classify, Link Yes -Java GUI GraphAnno (Gast et al., 2015) Classify, Link --Ruby GUI Inforex (Marcinczuk et al., 2012) Classify, Link --JavaScript GUI Knowtator (Ogren, 2006) Classify, Link Yes Protégé Java GUI MAE and MAI (Stubbs, 2011) Classify, Link Yes -Java GUI MMAX2 (Müller and Strube, 2006) Classify, Link --Java GUI PALinkA (Orasan, 2003) Classify, Link --Java GUI SAWT (Samih et al., 2016) Classify --Python, PHP GUI SYNC3 (Petasis, 2012) Classify Yes Ellogon C GUI Stanford (Lee, 2004) Classify --Java GUI UAM (O'Donnell, 2008) Classify Yes -Java GUI WAT-SL (Kiesel et al., 2017) Classify Yes apache Java GUI WebAnno (Yimam et al., 2013) Classify, Link Yes -Java GUI WordFreak (Morton and LaCivita, 2003) Classify, Link Yes -Java GUI keys change which item is selected. Keybindings can be modified to suit user preferences and multikey commands can be defined, like the labels in Figure 1a (e.g. ...
Many annotation tools have been developed, covering a wide variety of tasks and providing features like user management, pre-processing, and automatic labeling. However, all of these tools use Graphical User Interfaces, and often require substantial effort to install and configure. This paper presents a new annotation tool that is designed to fill the niche of a lightweight interface for users with a terminal-based workflow. Slate supports annotation at different scales (spans of characters, tokens, and lines, or a document) and of different types (free text, labels, and links), with easily customisable keybindings, and unicode support. In a user study comparing with other tools it was consistently the easiest to install and use. Slate fills a need not met by existing systems, and has already been used to annotate two corpora, one of which involved over 250 hours of annotation effort.
... Inforex (Marcińczuk, Kocoń and Broda 2012) is a web-based system designed for managing and annotating text corpora at the semantic level including annotation of Named Entities, anaphora, Word Sense Disambiguation (WSD) and relations between named entities. The system also supports manual text clean-up and automatic text pre-processing including text segmentation, morphosyntactic analysis and word selection for WSD annotation. ...
A key challenge of the Information Extraction in Natural Language Processing is the ability to recognise and classify temporal expressions (timexes). It is a crucial source of information about when something happens, how often something occurs or how long something lasts. Timexes extracted automatically from text, play a major role in many Information Extraction systems, such as question answering or event recognition. We prepared a broad specification of Polish timexes ? PLIMEX. It is based on the state-of-the-art annotation guidelines for English, mainly TIMEX2 and TIMEX3 (a part of TimeML ? Markup Language for Temporal and Event Expressions). We have expanded our specification for a description of the local meaning of timexes, based on LTIMEX annotation guidelines for English. Temporal description supports further event identification and extends event description model, focussing on anchoring events in time, events ordering and reasoning about the persistence of events. We prepared the specification, which is designed to address these issues, and we annotated all documents in Polish Corpus of Wroclaw University of Technology (KPWr) using our annotation guidelines. We also adapted our Liner2 machine learning system to recognise Polish timexes and we propose two-phase method to select a subset of features for Conditional Random Fields sequence labelling method. This article presents the whole process of corpus annotation, evaluation of inter-annotator agreement, extending Liner2 system with new features and evaluation of the recognition models before and after feature selection with the analysis of statistical significance of differences. Liner2 with presented models is available as open source software under the GNU General Public License.
... We used the positive specific agreement (psa) (Hripcsak & Rothschild, 2005) as there are no negative decisions to count to measure the agreement between two linguists. The documents were annotated using the Annotator perspective from the Inforex system 8 (see Figure 4) (Marcińczuk, Kocoń, & Broda, 2012). In the first iteration we randomly selected 100 documents. ...
Full-text available
p> Towards an event annotated corpus of Polish The paper presents a typology of events built on the basis of TimeML specification adapted to Polish language. Some changes were introduced to the definition of the event categories and a motivation for event categorization was formulated. The event annotation task is presented on two levels – ontology level (language independent) and text mentions (language dependant). The various types of event mentions in Polish text are discussed. A procedure for annotation of event mentions in Polish texts is presented and evaluated. In the evaluation a randomly selected set of documents from the Corpus of Wrocław University of Technology (called KPWr) was annotated by two linguists and the annotator agreement was calculated. The evaluation was done in two iterations. After the first evaluation we revised and improved the annotation procedure. The second evaluation showed a significant improvement of the agreement between annotators. The current work was focused on annotation and categorisation of event mentions in text. The future work will be focused on description of event with a set of attributes, arguments and relations.</p
Full-text available
Motivation: Annotation tools are applied to build training and test corpora, which are essential for the development and evaluation of new natural language processing algorithms. Further, annotation tools are also used to extract new information for a particular use case. However, owing to the high number of existing annotation tools, finding the one that best fits particular needs is a demanding task that requires searching the scientific literature followed by installing and trying various tools. Methods: We searched for annotation tools and selected a subset of them according to five requirements with which they should comply, such as being Web-based or supporting the definition of a schema. We installed the selected tools (when necessary), carried out hands-on experiments and evaluated them using 26 criteria that covered functional and technical aspects. We defined each criterion on three levels of matches and a score for the final evaluation of the tools. Results: We evaluated 78 tools and selected the following 15 for a detailed evaluation: BioQRator, brat, Catma, Djangology, ezTag, FLAT, LightTag, MAT, MyMiner, PDFAnno, prodigy, tagtog, TextAE, WAT-SL and WebAnno. Full compliance with our 26 criteria ranged from only 9 up to 20 criteria, which demonstrated that some tools are comprehensive and mature enough to be used on most annotation projects. The highest score of 0.81 was obtained by WebAnno (of a maximum value of 1.0).
Full-text available
This article introduces the issue of recognition and normalisation of temporal expressions for the Polish language. We describe what temporal information is and we present TimeML specification, adapted to Polish as a model for the description of temporal expressions. Classes of temporal expressions are presented as well as guidelines for annotation, normalisation of these expressions and our approach to corpus annotation and temporal expressions recognition. The key aspect of the work is the description of the features used for the recognition, the use of the method for selection and creation of feature templates for the model of Conditional Random Fields. We demonstrate the experiments and conclusions drawn from them.
Semantic technologies provide meaning to information resources in the form of machine-accessible structured data. Research over the past two decades has commonly focused on tools and interfaces for technical experts, leading to various usability problems regarding users unfamiliar with the underlying technologies – so-called nontechnical experts. Existing approaches to semantic technologies consider mostly consumers of structured data and leave out the creation perspective. In this work, we focus on the usability of creating structured data from textual resources, especially the creation of relations between entities. The research was conducted in collaboration with scholars from the humanities. We review existing research on the usability of semantic technologies and the state of the art of annotation tools to identify shortcomings. Subsequently we use the knowledge gained to propose a new interaction design for the creation of relations between entities to create structured data in the subject-predicate-object form. We implemented our interaction design and conducted a user study which showed that the proposal performed well, making it a contribution to enhance the overall usability in this field. However, this research provides an example of how technically sophisticated technology needs to be “translated” to make it usable for nontechnical experts. We need to extend this perspective in the future by providing more insight into the internal functioning of semantic technologies.
This chapter describes our recent attempts to explore a methodology for evaluating annotated corpora through analysing annotator behaviour during annotation. We first describe the details of an experiment for collecting annotator behaviour during annotating predicate argument relations in Japanese texts. In this experiment, every annotation tool operation and annotator eye gaze were collected from three annotators. We discuss the relationship between the collected data and the annotation agreement between multiple annotators, in which two types of disagreement are distinguished: explicit annotation disagreement (EAD) and missing annotation disagreement (MAD). We further report the preliminary results of an attempt for detecting missing disagreement by analysing the collected data. The chapter concludes with some remarks for future research directions.
Full-text available
In the paper we present a Proper Name Recognition algorithm based on the Hidden Markov Model (HMM). Recognition of the Proper Names (PN) is treated as the basis for Named Entity Recognition problem in general. The proposed method is based on combining domain-dependent method based on HMM with domain independent methods based on gazetteers and hand-written rules for recognition and post-processing that capture the general properties of Polish PN structure. A large gazetteer with entries described morphologically was acquired from the web. The HMM re-scoring mechanism was applied as a basis for integration of different knowledge sources in PN recognition. Results of experiments on a domain corpus of Polish stock exchange reports, used for training and testing, are presented. A cross-domain evaluation on two other corpora is also presented. Adaptability of the method was analysed by applying the trained model to two other domain corpora.
Full-text available
This paper presents our efforts aimed at collecting and annotating a free Polish corpus. The corpus will serve for us as training and testing material for experiments with Machine Learning algorithms. As others may also benefit from the resource, we are going to release it under a Creative Commons licence, which is hoped to remove unnecessary usage restrictions, but also to facilitate reproduction of our experimental results. The corpus is being annotated with various types of linguistic entities: chunks and named entities, selected syntactic and semantic relations, word senses and anaphora. We report on the current state of the project as well as our ultimate goals.
Full-text available
The paper presents WordNetLoom – an application for WordNet development used in the construction of a Polish WordNet called plWordNet. WordNetLoom provides two means of interaction: a form-based, implemented initially, and a graph-based introduced recently. The graphical, active presentation of WordNet structure enables direct work on the structure of synsets and lexico-semantic relations. In the paper, the both means of interaction are compared and the results of the usability evaluation performed on a group of experienced WordNetLoom users are presented. Directions of the application development were identified. A new version of WordNetWeaver – a tool supporting semi-automated WordNet expansion – is also presented. The new version is based on the user interface similar to WordNetLoom, utilises all types of WordNet relations and is embedded in WordNetLoom. The paper discusses also the role of the application in WordNet development and the extent to which the application can be used for other WordNets. A set of WWW-based tools supporting team work coordination and verification is presented, too.
Conference Paper
Full-text available
Polish Corpus of Suicide Notes (henceforth PCSN) is constructed to meet the needs of forensic linguistics. Suicide notes are messages created in borderline situation, shortly before death. Hence the annotation schema requires a complex description of a document structure, the textual content, as well as its linguistic properties. TEI was selected as the basis for the document encoding schema. TEI adaptation and extension with respect to such aspects of encoding as: a letter structure, various layers of changes and omissions, error correction, and extra-linguistic elements etc., are discussed with examples. Keywordsforensic linguistics–suicide note–structure annotation
Conference Paper
Full-text available
In this paper we present a corpus of Polish spoken dialogues annotated on several levels, from transcription of dialogues and their morphosyntactic analysis, to semantic annotation. The corpus is one of the results of LUNA project. The description is concentrated on the semantic annotation on the levels of concepts (attribute-value) and predicates (frame sets).
Full-text available
The Corpus Encoding Standard (CES) is a part of the EAGLES Guidelines developed by the Expert Advisory Group on Language Engineering Standards (EAGLES) that provides a set of encoding standards for corpus-based work in natural language processing applications. We have instantiated the CES as an XML application called XCES, based on the same data architecture comprised of a primary encoded text and "standoff" annotation in separate documents. Conversion to XML enables use of some of the more powerful mechanisms provided in the XML framework, including the XSLT Transformation Language, XML Schemas, and support for inter-rescue reference together with an extensive path syntax for pointers. In this paper, we describe the differences between the CES and XCES DTDs and demonstrate how XML mechanisms can be used to select from and manipulate annotated corpora encoded according to XCES specifications. We also provide a general overview of XML and the XML mechanisms that are most relevant to language engineering research and applications.
GATE Teamware 1.3 User Guide
  • Fairview Research
Fairview Research LLC. 2010. GATE Teamware 1.3 User Guide. Technical report.
Evaluating LexCSD-a Weakly-Supervised Method on Improved Semantically Annotated Corpus in a Large Scale Experiment
  • Bartosz Broda
  • Maziarz Maziarz
  • Maciej Piasecki
Bartosz Broda, Maziarz Maziarz, and Maciej Piasecki. 2010c. Evaluating LexCSD-a Weakly-Supervised Method on Improved Semantically Annotated Corpus in a Large Scale Experiment. In S. T. Wierzcho´nWierzcho´n M. A. Kłopotek, A. Przepiórkowski and K. Trojanowski, editors, Proceedings of Intelligent Information Systems.