Roman Yangarber

Roman Yangarber
University of Helsinki | HY · Department of Computer Science

About

99
Publications
17,617
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
2,374
Citations

Publications

Publications (99)
Preprint
Full-text available
Assessment of proficiency of the learner is an essential part of Intelligent Tutoring Systems (ITS). We use Item Response Theory (IRT) in computer-aided language learning for assessment of student ability in two contexts: in test sessions, and in exercises during practice sessions. Exhaustive testing across a wide range of skills can provide a deta...
Preprint
Full-text available
We investigate how pretrained language models (PLM) encode the grammatical category of verbal aspect in Russian. Encoding of aspect in transformer LMs has not been studied previously in any language. A particular challenge is posed by "alternative contexts": where either the perfective or the imperfective aspect is suitable grammatically and semant...
Preprint
Full-text available
Language modeling is a fundamental task in natural language processing, which has been thoroughly explored with various architectures and hyperparameters. However, few studies focus on the effect of sub-word segmentation on the performance of language models (LMs). In this paper, we compare GPT and BERT models trained with the statistical segmentat...
Preprint
This paper presents the development of an AI-based language learning platform Revita. It is a freely available intelligent online tutor, developed to support learners of multiple languages, from low-intermediate to advanced levels. It has been in pilot use by hundreds of students at several universities, whose feedback and needs are shaping the dev...
Conference Paper
Full-text available
We present experiments on assessing the grammatical correctness of learner answers in the Revita language-learning platform. 1 In particular, we explore the problem of detecting alternative-correct answers: when more than one inflected form of a lemma fits syntactically and semantically in a given context. This problem was formulated as Multiple Ad...
Conference Paper
Full-text available
This paper describes Slav-NER: the 3rd Multilingual Named Entity Challenge in Slavic languages. The tasks involve recognizing mentions of named entities in Web documents , normalization of the names, and cross-lingual linking. The Challenge covers six languages and five entity types, and is organized as part of the 8th Balto-Slavic Natural Language...
Preprint
We consider the problem of disambiguating the lemma and part of speech of ambiguous words in morphologically rich languages. We propose a method for disambiguating ambiguous words in context, using a large un-annotated corpus of text, and a morphological analyser -- with no manual disambiguation or data annotation. We assume that the morphological...
Conference Paper
Full-text available
We present the first version of the longitudinal Revita Learner Corpus (ReLCo), for Russian. In contrast to traditional learner corpora, ReLCo is collected and annotated fully automatically, while students perform exercises using the Revita language-learning platform. The corpus currently contains 8 422 sentences exhibiting several types of errors-...
Conference Paper
Full-text available
We describe the Second Multilingual Named Entity Challenge in Slavic languages. The task is recognizing mentions of named entities in Web documents, their normalization, and cross-lingual linking. The Challenge was organized as part of the 7th Balto-Slavic Natural Language Processing Workshop, co-located with the ACL-2019 conference. Eight teams pa...
Article
Full-text available
The preservation of linguistic diversity has long been recognized as a crucial, integral part of supporting our cultural heritage. Yet many “minority” languages — those that lack official state status — are in decline, many severely endangered. We present a prototype system aimed at “heritage” speakers of endangered Finno-Ugric languages. Heritage...
Conference Paper
We present algorithms that learn to segment words in morphologically rich languages, in an unsupervised fashion. Morphology of many languages can be modeled by finite state machines (FSMs). We start with a baseline MDL-based learning algorithm. We then formulate well-motivated and general linguistic principles about morphology, and incorporate them...
Conference Paper
Full-text available
We describe a computational system for language learning and supporting endangered languages. The platform provides the user an opportunity to improve her competency through active language use. The platform currently works with several endangered Finno-Ugric languages, as well as with Yakut, and Finnish, Swedish, and Russian. This paper describes...
Conference Paper
Full-text available
Unsupervised learning of morphological segmentation of words in a language, based only on a large corpus of words, is a challenging task. Evaluation of the learned segmentations is a challenge in itself, due to the inherent ambiguity of the segmentation task. There is no way to posit unique “correct” segmentation for a set of data in an objective w...
Conference Paper
In this paper we study the interactions between how companies are mentioned in news, their presence on social media, and daily fluctuation in their stock prices. Our experiments demonstrate that for some entities these time series can be correlated in interesting ways, though for others the correspondences are more opaque. In this study, social med...
Conference Paper
Full-text available
This paper presents a method for linking models for aligning linguistic etymological data with models for phylogenetic inference from population genetics. We begin with a large database of genetically related words—sets of cognates—from languages in a language family. We process the cognate sets to obtain a complete alignment of the data. We use th...
Conference Paper
Full-text available
We explore supervised learning for multi-class, multi-label text classification, focusing on real-world settings, where the distribution of labels changes dynamically over time. We use the PULS Information Extraction system to collect information about the distribution of class labels over named entities found in text. We then combine a knowledge-b...
Conference Paper
Full-text available
We examine supervised learning for multi-class, multi-label text classification. We are interested in exploring classification in a real-world setting, where the distribution of labels may change dynamically over time. First, we compare the performance of an array of binary classifiers trained on the label distribution found in the original corpus...
Conference Paper
Full-text available
This paper presents models for automatic transliteration of proper names between languages that use different alphabets. The models are an extension of our work on automatic discovery of patterns of etymological sound change, based on the Minimum Description Length Principle. The models for pairwise alignment are extended with algorithms for predic...
Chapter
We present work on estimating the relevance of the results of an Event Extraction system to the end-user’s needs. Our aim is to develop user-oriented measures of utility of the extracted events, i.e., how useful is the factual information found in the document for the end user. We introduce discourse and lexical features, and build classifiers that...
Article
Internet biosurveillance utilizes unstructured data from diverse web-based sources to provide early warning and situational awareness of public health threats. The scope of source coverage ranges from local media in the vernacular to international media in widely read languages. Internet biosurveillance is a timely modality that is available to gov...
Article
Full-text available
The objective of Web-based expert epidemic intelligence systems is to detect health threats. The Global Health Security Initiative (GHSI) Early Alerting and Reporting (EAR) project was launched to assess the feasibility and opportunity for pooling epidemic intelligence data from seven expert systems. EAR participants completed a qualitative survey...
Book
Information extraction (IE) and text summarization (TS) are powerful technologies for finding relevant pieces of information in text and presenting them to the user in condensed form. The ongoing information explosion makes IE and TS critical for successful functioning within the information society. These technologies face particular challenges du...
Article
This chapter presents a number of techniques for multilingual event extraction, the main task is to accurately and efficiently detect key information about security-related events from electronic news media and summarize it in the form of database-like structures. Gathering such information over time is an important task for developing global news...
Book
Full-text available
Information extraction (IE) and text summarization (TS) are powerful technologies for finding relevant pieces of information in text and presenting them to the user in condensed form. The ongoing information explosion makes IE and TS critical for successful functioning within the information society. These technologies face particular challenges d...
Chapter
In this chapter we present a brief overview of Information Extraction, which is an area of natural language processing that deals with finding factual information in free text. In formal terms, facts are structured objects, such as database records. Such a record may capture a real-world entity with its attributes mentioned in text, or a real-world...
Conference Paper
This paper presents a novel method for aligning etymological data, which models context-sensitive rules governing sound change, and utilizes phonetic features of the sounds. The goal is, for a given corpus of cognate sets, to find the best alignment at the sound level. We introduce an imputation procedure to compare the goodness of the resulting mo...
Article
We describe a system that tracks the spread of epidemics by automatically extracting content from the Web. The system continuously monitors a large set of news sources, extracts information from new articles, and accumulates the extracted facts in a database in real time. The system provides functionality for visualizing results, as well as alertin...
Conference Paper
When faced with the need for analyzing vast streams of on-line text data, we require methods that go well beyond keyword-based queries. Large-scale surveillance of on-line news streams requires an understanding of the text on a deeper level than is afforded by names and keywords alone, it becomes essential to understand complex interactions among t...
Chapter
This chapter gives an overview of tools developed for Frontex, the European Agency for the Management of Operational Cooperation at the External Borders of the Member States of the European Union, to facilitate the process of extracting structured information on events related to border security from on-line news articles, with a particular focus o...
Conference Paper
There is currently a paucity of publicly available NLP tools to support analysis of Russian-language text. This especially concerns higher-level applications, such as Information Extraction. We present work on tools for information extraction from text in Russian in the domain of on-line news. On the lower level we employ the AOT toolkit for natura...
Conference Paper
We introduce several models for alignment of etymological data, that is, for finding the best alignment, given a set of etymological data, at the sound or symbol level. This is intended to obtain a means of measuring the quality of the etymological data sets, in terms of their internal consistency. One of our main goals is to devise automatic metho...
Article
Full-text available
Global medical and epidemic surveillance is an essential function of Public Health agencies, whose mandate is to protect the public from major health threats. To perform this function effectively one requires timely and accurate med- ical information from a wide range of sources. In this work we present a freely ac- cessible system designed to moni...
Conference Paper
Full-text available
Ontologies lend themselves for resolving ambiguities in a wide range of applications, including mashups from diverse third-party information sources, and human-and machine-readable specifications of electronic business services (eBS). While tool support exists for the development and maintenance of ontologies, the question remains unanswered what i...
Conference Paper
Full-text available
Processing content for security becomes more and more important since every local danger can have global consequences. Being able to collect and analyse information in different languages is a great issue. This paper addresses multilingual solutions for analysis of press articles for epidemiological surveillance. The system described here relies on...
Conference Paper
This presentation gives an overview of an effort to construct OSINT (Open-Source Intelligence) tools for Frontex, the European Agency for the Management of Operational Cooperation at the External Borders of the Member States of the European Union, to facilitate automating the process of extracting structured knowledge from on-line news articles on...
Conference Paper
This paper presents an endeavor aiming at construction of a real-time event extraction system for border security-related intelligence gathering from online news. First, the background and motivation behind the presented work is given. Next, the paper describes the event extraction processing chain, the specifics of the domain, i.e., illegal migrat...
Conference Paper
This paper presents ongoing work on application of Information Extraction (IE) technology to domain of Public Health, in a real-world scenario. A central issue in IE is the quality of the results. We present two novel points. First, we distinguish the criteria for quality: the objective criteria that measure correctness of the system's analysis in...
Article
Full-text available
Event-based biosurveillance is a scientific discipline in which diverse sources of data, many of which are available from the Internet, are characterized prospectively to provide information on infectious disease events. Biosurveillance complements traditional public health surveillance to provide both early warning of infectious disease events and...
Chapter
Full-text available
The Medical Information System (MedISys) is a fully automatic 24/7 public health surveillance system monitoring human and animal infectious diseases and chemical, biological, radiological and nuclear (CBRN) threats in open-source media. In this article, we explain the technology behind MedISys, describing the processing chain from the definition of...
Article
Full-text available
Event-based biosurveillance is a scientific discipline in which diverse streams of data, available from the Internet, are characterized prospectively to provide information on infectious disease events. Biosurveillance complements traditional public health surveillance to provide both early warning of infectious disease events as well as situationa...
Article
Resolving semantic ambiguities in electronic business-services (eBS) during their discovery in broker systems is essential for th e setup and enactment of business-to-business (B2B) collaboration. These ambiguities do not merely result from differently understood terminology, but also f rom specifications for- mulated in other human languages than...
Conference Paper
This paper gives an overview of an ongoing effort to construct tools for automating the process of extracting structured information about border-security related events from on-line news. The paper describes our overall approach to the problem, the system architecture and event information access and moderation. Keywordsevent extraction from on-l...
Article
Full-text available
In order to gather a comprehensive picture of potential epidemic threats, public health authorities increasingly rely on systems that perform epidemic intelligence (EI). EI makes use of information that originates from official sources such as national public health surveillance systems as well as from informal sources such as electronic media and...
Article
Full-text available
Information extraction (IE) and text summarization (TS) are key technologies aiming at extracting relevant information from texts and presenting the information to the user in a condensed form. The ongoing information explosion makes IE and TS particularly critical for successful functioning within the information society. These technologies, howev...
Conference Paper
We report on a set of experiments in text mining, specifically, finding semantic patterns given only a few keywords. The experiments employ the Counter-training framework for discovery of semantic knowledge from raw text in a weakly supervised fashion. The experiments indicate that the framework is suitable for efficient acquisition of semantic wor...
Conference Paper
The accuracy of event extraction is lim- ited by a number of complicating factors, with errors compounded at all sages in- side the Information Extraction pipeline. In this paper, we present methods for re- covering automatically from errors com- mitted in the pipeline processing. Recov- ery is achieved via post-processing facts aggregated over a l...
Conference Paper
Full-text available
This work demonstrates the ProMED-PLUS Epidemiological Fact Base. The facts are automatically extracted from plain-text reports about outbreaks of infectious epidemics around the world. The system collects new reports, extracts new facts, and updates the database, in real time. The extracted database is available on-line through a Web server.
Article
This paper presents a method for unsupervised discovery of semantic patterns.
Article
This paper presents new Information Extraction scenarios which are linguistically and structurally more challenging than the traditional MUC scenarios. Traditional views on event structure and template design are not adequate for the more complex scenarios.
Chapter
Linguistic knowledge in Natural Language understanding systems is commonly stratified across several levels. This is true of Information Extraction as well. Typical state-of-the-art Information Extraction systems require syntactic-semantic patterns for locating facts or events in text; domain-specific word or concept classes for semantic generaliza...
Article
We present an algorithm for unsupervised learning and semantic classification of names and terms. Given a small number of seed ex-amples and an unlabeled training corpus, the algorithm learns patterns that identify more examples, in a bootstrapping cycle. Multiple classes are learned simultaneously, including negative classes that serve to provide...
Article
This paper presents a method for unsupervised discovery of semantic patterns.
Article
Full-text available
This paper describes how NOMLEX, a dictionary of nominalizations, can be used in Information Extraction (IE). This paper details a procedure which maps syntactic and semantic information designed for writing an IE pattern for an active clause (IBM appointed Alice Smith as vice president) into a set of patterns for nominalizations (e.g., IBM's...
Article
Research in example-based machine translation (EBMT) has been hampered by the lack of efficient tree alignment algorithms for bilingual corpora. This paper describes an alignment algorithm for EBMT whose running time is quadratic in the size of the input parse trees. The algorithm uses dynamic programming to score all possible matching nodes betwee...
Article
Full-text available
We describe a system for creating and automatically updating a data base of information on infectious disease outbreaks. A web crawler is used to retrieve current news stories; potentially relevant stories are fed to an information extraction engine, whose output is used to update the data base. A web-based browser allows users to examine the data...
Article
This paper discusses an efficient algorithm for aligning parse trees, and its application to the automatic acquisition of transfer rules for machine translation. Although the general problem of finding an optimal tree alignment is NP-complete, the problem becomes tractable if we consider only alignments that are restricted to preserve a dominance r...
Article
Full-text available
This paper presents problems of template structure for Information Extraction. We investigate these problems in the context of two new Information Extraction scenarios which are linguistically and structurally more challenging than the traditional MUC scenarios. By a scenario we mean a predefined set of facts to be extracted from text. Traditional...
Article
In developing an Information Extraction (IE) system for a new class of events or relations, one of the major tasks is identifying the many ways in which these events or relations may be expressed in text. This has generally involved the manual analysis and, in some cases, the annotation of large quantities of text involving these events. This paper...
Article
We present an algorithm, Nomen, for learning generalized names in text. Examples of these are names of diseases and infectious agents, such as bacteria and viruses. These names exhibit certain properties that make their identification more complex than that of regular proper names. Nomen uses a novel form of bootstrapping to grow sets of textual in...
Article
Document search is generally based on individual terms in the document. However, for collections within limited domains it is possible to provide more powerful access tools. This paper describes a system designed for collections of reports of infectious disease outbreaks. The system, Proteus-BIO, automatically creates a table of outbreaks, with eac...
Conference Paper
Full-text available
This paper presents new Information Extraction scenarios which are linguistically and structurally more challenging than the traditional MUC scenarios. Traditional views on event structure and template design are not adequate for the more complex scenarios.The focus of this paper is to show the complexity of the scenarios, and propose a way to reco...
Conference Paper
We present an algorithm, NOMEN, for learning in text. Examples of these are names of diseases and infectious agents, such as bacteria and viruses. These names exhibit certain properties that make their identification more complex than that of regular proper names, NOMEN uses a novel form of bootstrapping to grow sets of textual instances and of the...
Conference Paper
Linguistic knowledge in Natural Language understanding systems is commonly stratified across several levels. This is true of Information Extraction as well. Typical state-of-the-art Information Extraction systems require syntactic-semantic patterns for locating facts or events in text; domain-specific word or concept classes for semantic generaliza...
Article
Information Extraction (IE) is an emerging NLP technology, whose function is to process unstructured, natural language text, to locate specific pieces of information, or facts, in the text, and to use these facts to fill a database. IE systems today are commonly based on pattern matching. The core IE engine uses a cascade of sets of patterns of inc...
Article
In developing an Information Extraction (IE) system for a new class of events or relations, one of the major tasks is identifying the many ways in which these events or relations may be expressed in text. This has generally involved the manual analysis and, in some cases, the annotation of large quantities of text involving these events. This paper...
Article
Full-text available
Information Extraction (IE) systems are commonly based on pattern matching. Adapting an IE system to a new scenario entails the construction of a new pattern base---a timeconsuming and expensive process. We have implemented a system for finding patterns automatically from un-annotated text. Starting with a small initial set of seed patterns propose...
Article
Information Extraction (IE) systems today are commonly based on pattern matching. The patterns are regular expressions stored in a customizable knowledge base. Adapting an IE system to a new subject domain entails the construction of a new pattern base { a time-consuming and expensive task. We describe a strategy for building patterns from examples...
Article
This paper reports on the development of the Japanese Information Extraction system and the Japanese Information Extraction customization tool. These systems are based on the corresponding English systems, the Proteus Information Extraction system (Grishman 97) and the Proteus Extraction Tool (Yangarber and Grishman 97), developed at NYU. In this p...
Article
this paper we discuss the system's performance on the MUC-7 Scenario Template task (ST). The topics covered in the following sections are: the Proteus core extraction engine; the example-based PET interface to Proteus; a discussion of how these were used to accommodate the MUC-7 Space Launch scenario task. We conclude with the evaluation of the sys...
Article
Information Extraction (IE) is becoming an increasingly fast and accurate technology for extracting information about specific relationships and events from free natural language text. However, adapting an IE system to a new class of events remains a time-consuming and expensive task. We describe a suite of tools for rapidly adapting a system to ne...
Conference Paper
Full-text available
In this paper we present Morphy, an integrated tool for German morphology, part-of-speech tagging and context-sensitive lemmatization. Its large lexicon of more than 320, 000 word forms plus its ability to process German compound nouns guarantee a wide ...
Conference Paper
this paper we discuss the system's performance on the MUC-7 Scenario Template task (ST). The topicscovered in the following sections are: the Proteus core extraction engine; the example-based PET interfaceto Proteus; a discussion of how these were used to accommodate the MUC-7 Space Launch scenario task.We conclude with the evaluation of the system...
Conference Paper
We introduce the problem of mining association rules in large relational tables containing both quantitative and categorical attributes. An example of such an association might be "10% of married people between age 50 and 60 have at least 2 cars". We ...
Article
Full-text available
This work demonstrates the ProMED- PLUS Epidemiological Fact Base. The facts are automatically extracted from plain-text reports about outbreaks of in- fectious epidemics around the world. The system collects new reports, extracts new facts, and updates the database, in real time. The extracted database is available on-line through a Web server.
Article
This paper describes an on-going effort to com- bine Information Retrieval (IR) and Informa- tion Extraction (IE) technologies, to leverage the benefits provided by both approaches to add value for the end-user, as compared with IR or IE in isolation. The main aim of the com- bined system is to pool together information from multiple sources to imp...