Article

Building an NLP pipeline within a digital publishing workflow

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

Outside the laboratory environment, NLP tool developers have always been obliged to use robust techniques in order to clean and streamline the ubiquitous formats of authentic texts. In most cases, the cleaned version simply consisted of the bare text discarded of all typographical information, tokenised in such a way that even the reconstruction of a simple sentence resulted in a displeasing layout. In order to integrate the NLP output within the production workflow of digital publications, it is necessary to keep track of the original layout. In this paper, we present an example of an NLP pipeline developed to meet the requirements of real-world applications of digital publications. The NLP pipeline presented here was developed within the framework of the iRead+ project, a cooperative research project between several industrial and academic partners in Flanders. The pipeline aims at enabling automatic enrichment of texts with word-specific and contextual information in order to create an enhanced reading experience on tablets and to support automatic generation of grammatical exercises. The enriched documents contain both linguistic annotations (part-of-speech and lemmata) and semantic annotations based on the recognition and disambiguation of named entities. The whole enrichment process, provided via a web service, can be integrated into an XML-based production flow. The input of the NLP enrichment engine consists of two documents: a well-formed XML source file and a control file containing XPath expressions describing the nodes in the source file to be annotated and enriched. As nodes may contain a pre-defined set of mixed data, reconstruction of the original document (with selected enrichments) is enabled.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

Article
In recent years, thematic route planning is gaining popularity in recreational navigation. A growing number of people start to use route-planning services to prepare, ride, explore, and log their activities, with a particular focus on where they want to ride and what they want to see. In the context of cultural heritage, however, route planners still suffer from lack of data and route weighting/scoring mechanisms to achieve end-user satisfaction. In this article, we take advantage of mobile sensing and geotagging (r)evolution to tackle both issues and propose a novel framework for cultural heritage routing on top of RouteYou’s existing recreational navigation platform. Our first improvement focuses on the automatic collection and multimodal enrichment of thematic cultural heritage points of interest. Second, we introduce a weighting procedure for these points of interest and analyze their meta(data) quality and spatial coverage in our route databases. Finally, we present a novel routing algorithm targeted to cultural heritage exploration. Experimental results show that the proposed framework improves cultural heritage POI coverage and quality with respect to traditional recreational navigation routing algorithms. Furthermore, the proposed framework can easily be used in other thematic routing applications due to its generic architecture, making it a widely applicable approach.
Article
Full-text available
This paper presents the LeTs Preprocess Toolkit, a suite of robust high-performance preprocessing modules including Part-of-Speech Taggers, Lemmatizers and Named Entity Recognizers. The currently supported languages are Dutch, English, French and German. We give a detailed description of the architecture of the LeTs Preprocess pipeline and describe the data and methods used to train each component. Ten-fold cross-validation results are also presented. To assess the performance of each module on different domains, we collected real-world textual data from companies covering various domains (a.o. automotive, dredging and human resources) for all four supported languages. For this multi-domain corpus, a manually verified gold standard was created for each of the three preprocessing steps. We present the performance of our preprocessing components on this corpus and compare it to the performance of other existing tools.
Chapter
Full-text available
This chapter presents the Dutch Parallel Corpus (DPC)—a 10-millionword,high-quality, sentence-aligned parallel corpus for the language pairs Dutch-English and Dutch-French. The corpus contains five different text types and is balanced with respect to text type and translation direction. Rich metadata information is stored for each text sample. All texts included in the corpus have been cleared from copyright. The entire corpus is aligned at sentence level and enriched with linguistic annotations. Twenty-five thousand words of the Dutch-English part have been manually aligned at the sub-sentential level. The corpus is released as full texts in XML format and can also be queried via a web concordancer.
Article
Full-text available
This paper describes the creation of a fine-grained named entity annotation scheme and corpus for Dutch, and experiments on automatic main type and subtype named entity recognition. We give an overview of existing named entity annotation schemes, and motivate our own, which describes six main types (persons, organizations, locations, products, events and miscellaneous named entities) and finer-grained information on subtypes and metonymic usage. This was applied to a one-million-word subset of the Dutch SoNaR reference corpus. The classifier for main type named entities achieves a micro-averaged F-score of 84.91 %, and is publicly available, along with the corpus and annotations.
Article
Full-text available
This paper proposes a minor but significant modification to the TEI ODD language and explores some of its implications. Can we improve on the present compromise whereby TEI content models are expressed in RELAX NG? A very small set of additional elements would permit the ODD language to cut its ties with any existing schema language, and thus permit it to support exactly and only the subset or intersection of their facilities which makes sense in the TEI context. It would make the ODD language an integrated and independent whole rather than an uneasy hybrid, and pave the way for future developments in the management of structured text beyond the XML paradigm.
Article
Full-text available
In this paper we present FoLiA, a Format for Linguistic Annotation, and conduct a comparative study with other annotation schemes, including the Linguistic Annotation Framework (LAF), the Text Encoding Initiative (TEI) and Text Corpus Format (TCF). An additional point of focus is the interoperability between FoLiA and metadata standards such as the Component MetaData Infrastructure (CMDI), as well as data category registries such as ISOcat. The aim of the paper is to present a clear image of the capabilities of FoLiA and how it relates to other formats. This should open discussion and aid users in their decision for a particular format. FoLiA is a practically-oriented XML-based annotation format for the representation of language resources, explicitly supporting a wide variety of annotation types. It introduces a flexible and uniform paradigm and a representation independent of language or label set. It is designed to be highly expressive, generic, and formalised, whilst at the same time focussing on being as practical as possible to ease its adoption and implementation. The aspiration is to offer a generic format for storage, exchange, and machine-processing of linguistically annotated documents, preventing users as well as software tools from having to cope with a wide variety of different formats, which in the field regularly causes convertibility issues and proliferation of ad-hoc formats. FoLiA emerged from such a practical need in the context of Computational Linguistics in the Netherlands and Flanders. It has been successfully adopted by numerous projects within this community. FoLiA was developed in a bottom-up fashion, with special emphasis on software libraries and tools to handle it.
Article
Full-text available
This paper investigates how to best couple hand-annotated data with information extracted from an external lexical resource to improve part-of-speech tagging performance. Focusing mostly on French tagging, we introduce a maximum entropy Markov model-based tagging system that is enriched with information extracted from a morphological resource. This system gives a 97.75 % accuracy on the French Treebank, an error reduction of 25 % (38 % on unknown words) over the same tagger without lexical information. We perform a series of experiments that help understanding how this lexical information helps improving tagging accuracy. We also conduct experiments on datasets and lexicons of varying sizes in order to assess the best trade-off between annotating data versus developing a lexicon. We find that the use of a lexicon improves the quality of the tagger at any stage of development of either resource, and that for fixed performance levels the availability of the full lexicon consistently reduces the need for supervised data by at least one half.
Article
Full-text available
This paper presents a distributed platform for Natural Language Processing called PyPLN. PyPLN leverages a vast array of NLP and text processing open source tools, managing the distribution of the workload on a variety of configurations: from a single server to a cluster of linux servers. PyPLN is developed using Python 2.7.3 but makes it very easy to incorporate other softwares for specific tasks as long as a linux version is available. PyPLN facilitates analyses both at document and corpus level, simplifying management and publication of corpora and analytical results through an easy to use web interface. In the current (beta) release, it supports English and Portuguese languages with support to other languages planned for future releases. To support the Portuguese language PyPLN uses the PALAVRAS parser\citep{Bick2000}. Currently PyPLN offers the following features: Text extraction with encoding normalization (to UTF-8), part-of-speech tagging, token frequency, semantic annotation, n-gram extraction, word and sentence repertoire, and full-text search across corpora. The platform is licensed as GPL-v3.
Article
Full-text available
The TEI has served for many years as a mature annotation format for corpora of different types, including linguistically annotated data. Although it is based on the consensus of a large community, it does not have the legal status of a standard. During the last decade, efforts have been undertaken to develop definitive de jure standards for linguistic data that not only act as a normative basis for the exchange of language corpora but also address recent advancements in technology, such as web-based standards, and the use of large and multiply annotated corpora. In this article we will provide an overview of the process of international standardization and discuss some of the international standards currently being developed under the auspices of ISO/TC 37, a technical committee called “Terminology and other Language and Content Resources”. After that the relationship between the TEI Guidelines and these specifications, according to their formal model, notation format, and annotation model, will be discussed. The conclusion of the paper provides recommendations for dealing with language corpora.
Chapter
Full-text available
The paper presents the main ideas and the architecture of the open source PSI-Toolkit, a set of linguistic tools being developed within a project financed by the Polish Ministry of Science and Higher Education. The toolkit is intended for experienced language engineers as well as casual users not having any technological background. The former group of users is delivered a set of libraries that may be included in their Perl, Python or Java applications. The needs of the latter group should be satisfied by a user friendly web interface. The main feature of the toolkit is its data structure, the so-called PSI-lattice that assembles annotations delivered by all PSI tools. This cohesive architecture allows the user to invoke a series of processes with one command. The command has the form of a pipeline of instructions resembling shell command pipelines known from Linux-based system.
Article
Full-text available
We describe LT TTT, a recently developed software system which provides tools to perform text tokensiation and mark-up. The system includes ready-made components to segment text into paragraphs, sentences, words and other kinds of token but, crucially, it also allows users to tailor rule-sets to produce mark-up appropriate for particular applications. We present three case studies of our use of LT TTT: named-entity recognition (MUC-7), citation recognition and mark-up and the preparation of a corpus in the medical domain. We conclude with a discussion of the use of browsers to visualise marked-up text.
Chapter
Full-text available
Memory-based language processing--a machine learning and problem solving method for language technology--is based on the idea that the direct re-use of examples using analogical reasoning is more suited for solving language processing problems than the application of rules extracted from those examples. This book discusses the theory and practice of memory-based language processing, showing its comparative strengths over alternative methods of language modelling. Language is complex, with few generalizations, many sub-regularities and exceptions, and the advantage of memory-based language processing is that it does not abstract away from this valuable low-frequency information. © Walter Daelemans and Antal van den Bosch 2005 and Cambridge University Press, 2009.
Conference Paper
Full-text available
The explicit introduction of morphosyntactic information into statistical machine translation approaches is receiving an important focus of attention. The current freely available Part of Speech (POS) taggers for the French language are based on a limited tagset which does not account for some flectional particularities. Moreover, there is a lack of a unified framework of training and evaluation for these kind of linguistic resources. Therefore in this paper, three standard POS taggers (Treetagger, Brill's tagger and the standard HMM POS tagger) are trained and evaluated in the same conditions on the French MULTITAG corpus. This POS-tagged corpus provides a tagset richer than the usual ones, including gender and number distinctions, for example. Experimental results show significant differences of performance between the taggers. According to the tagging accuracy estimated with a tagset of 300 items, taggers may be ranked as follows: Treetagger (95.7% ), Brill's tagger (94.6%), HMM tagger (93.4%). Examples of translation outputs illustrate how considering gender and number distinctions in the POS tagset can be relevant.
Article
Full-text available
This paper describes a simple, efficient algorithm to locate all occurrences of any of a finite number of keywords in a string of text. The algorithm consists of constructing a finite state pattern matching machine from the keywords and then using the pattern matching machine to process the text string in a single pass. Construction of the pattern matching machine takes time proportional to the sum of the lengths of the keywords. The number of state transitions made by the pattern matching machine in processing the text string is independent of the number of keywords. The algorithm has been used to improve the speed of a library bibliographic search program by a factor of 5 to 10.
Conference Paper
Full-text available
Most current statistical natural language process- ing models use only local features so as to permit dynamic programming in inference, but this makes them unable to fully account for the long distance structure that is prevalent in language use. We show how to solve this dilemma with Gibbs sam- pling, a simple Monte Carlo method used to per- form approximate inference in factored probabilis- tic models. By using simulated annealing in place of Viterbi decoding in sequence models such as HMMs, CMMs, and CRFs, it is possible to incorpo- rate non-local structure while preserving tractable inference. We use this technique to augment an existing CRF-based information extraction system with long-distance dependency models, enforcing label consistency and extraction template consis- tency constraints. This technique results in an error reduction of up to 9% over state-of-the-art systems on two established information extraction tasks.
Article
Full-text available
We show that we can acquire satisfactory parsing results for French from data induced from the French Treebank using an unlexicalised parsing algorithm.
Article
Full-text available
We show that we can acquire satisfactory parsing results for French from data induced from the French Treebank using an unlexicalised parsing algorithm, that learns a probabilistic contex-free grammar with latent annotations. We investigate various instantiations of the treebank, in order to improve the performance of the learnt parser.
Article
Full-text available
In this paper, we show how the paradigm of evaluation can function as language resource producer for high quality and low cost validated language resources. First the paradigm of evaluation is presented, the main points of its history are recalled, from the first deployment that took place in the USA during the DARPA/NIST evaluation campaigns, up to latest efforts in Europe (SENSEVAL2/ROMANSEVAL2, CLEF, CLASS etc.). Then the principle behind the method used to produce high-quality validated language at low cost from the byproducts of an evaluation campaign is exposed. It was inspired by the experiments (Recognizer Output Voting Error Recognition) performed during speech recognition evaluation campaigns in the USA and consists of combining the outputs of the participating systems with a simple voting strategy to obtain higher performance results. Here we make a link with the existing strategies for system combination studied in machine learning. As an illustration we describe how the MU...
Article
In this paper we discuss a rule-based approach to chunking implemented using the LT-XML2 and LT-TTT2 tools. We describe the tools and the pipeline and grammars that have been developed for the task of chunking. We show that our rule-based approach is easy to adapt to different chunking styles and that the mark-up of further linguistic information such as nominal and verbal heads can be added to the rules at little extra cost. We evaluate our chunker against the CoNLL 2000 data and discuss discrepancies between our output and the CoNLL mark-up as well as discrepancies within the CoNLL data itself. We contrast our results with the higher scores obtained using machine learning and argue that the portability and flexibility of our approach still make it a more practical solution.
Article
This paper presents a couple of extensions to a basic Markov Model tagger (called TreeTagger)which improve its accuracy when trained on small corpora. The basic tagger was originally developedfor English [Schmid, 1994]. The extensions together reduced error rates on a German test corpusby more than a third.
Article
The GRACE evaluation program aims at applying the Evaluation Paradigm to the evaluation of Part-of-Speech taggers for French. An interesting by-product of GRACE is the production of validated language resources necessary for the evaluation. After a brief recall of the origins and the nature of the Evaluation Paradigm, we show how it relates to other national and international initiatives. We then present the now ending GRACE evaluation campaign and describe its four main components (corpus building, tagging procedure, lexicon building, evaluation procedure), as well as its internal organization. 1. The Evaluation Paradigm The Evaluation Paradigm has been proposed as a mean to foster development in research and technology in the field of language engineering. Up to now, it has been mostly used in the United States in the framework of the ARPA and NIST projects on automatic processing of spoken and written language. The paradigm is based on a two step process: ffl first, create textual...
The Association for Computer Linguistics
  • Alexis Nasr
  • Frédéric Béchet
  • Jean-François Rey
  • Benoît Favre
  • Joseph Le Roux
Nasr, Alexis, Frédéric Béchet, Jean-François Rey, Benoît Favre, and Joseph Le Roux (2011), MA- CAON: An NLP tool suite for processing word lattices, ACL (System Demonstrations), The Association for Computer Linguistics, pp. 86–91.
Training and evaluation of POS taggers on the french MULTITAG corpus, LREC, European Language Resources Association
  • Alexandre Allauzen
  • Hélène Bonneau-Maynard
Allauzen, Alexandre and Hélène Bonneau-Maynard (2008), Training and evaluation of POS taggers on the french MULTITAG corpus, LREC, European Language Resources Association.
LT TTT -a flexible tokenisation tool
  • Claire Grover
  • Colin Matheson
  • Andrei Mikheev
  • Marc Moens
Grover, Claire, Colin Matheson, Andrei Mikheev, and Marc Moens (2000), LT TTT -a flexible tokenisation tool, In Proceedings of Second International Conference on Language Resources and Evaluation, pp. 1147-1154.