Ralf Steinberger's research while affiliated with European Commission and other places

Publications (106)

Chapter
Full-text available
All information-seeking professionals need to sieve through large amounts of text to retrieve the information they need so that they can stay up-to-date of develop-ments in their field. Language Technology tools can help make the analyst’s work more efficient by increasing the amount of data analysed and by speeding up the process. Software tools a...
Article
Full-text available
Since 2004 the European Commission's Joint Research Centre (JRC) has been analysing the online version of printed media in over twenty languages and has automatically recognised and compiled large amounts of named entities (persons and organisations) and their many name variants. The collected variants not only include standard spellings in various...
Article
Full-text available
Any large organisation, be it public or private, monitors the media for information to keep abreast of developments in their field of interest, and usually also to become aware of positive or negative opinions expressed towards them. At least for the written media, computer programs have become very efficient at helping the human analysts significa...
Article
This paper reports on an approach and experiments to automatically build a cross-lingual multi-word entity resource. Starting from a collection of millions of acronym/expansion pairs for 22 languages where expansion variants were grouped into monolingual clusters, we experiment with several aggregation strategies to link these clusters across langu...
Article
Full-text available
Starting in 2006, the European Commission's Joint Research Centre and other European Union organisations have made available a number of large-scale highly-multilingual parallel language resources. In this article, we give a comparative overview of these resources and we explain the specific nature of each of them. This article provides answers to...
Conference Paper
Full-text available
Social media texts are significant information sources for several application areas including trend analysis, event monitoring, and opinion mining. Unfortunately, existing solutions for tasks such as named entity recognition that perform well on formal texts usually perform poorly when applied to social media texts. In this paper, we report on exp...
Article
Multilingual text processing is useful because the information content found in different languages is complementary, both regarding facts and opinions. While Information Extraction and other text mining software can, in principle, be developed for many languages, most text analysis tools have only been applied to small sets of languages because th...
Article
Full-text available
We propose a real-time machine translation system that allows users to select a news category and to translate the related live news articles from Arabic, Czech, Danish, Farsi, French, German, Italian, Polish, Portuguese, Spanish and Turkish into English. The Moses-based system was optimised for the news domain and differs from other available syst...
Article
We describe an existing multilingual information extraction system that automatically detects event information on disasters, conflicts and health threats in near-real time from a continuous flow of on-line news articles.We illustrate a number of strategies for customizing the system to process social media texts such as Twitter messages, which are...
Conference Paper
Full-text available
Various recent studies show that the performance of named entity recognition (NER) systems developed for well-formed text types drops significantly when applied to tweets. The only existing study for the highly inflected agglutinative language Turkish reports a drop in F-Measure from 91% to 19% when ported from news articles to tweets. In this stud...
Chapter
In this chapter, the authors discuss several pertinent aspects of an automatic system that generates summaries in multiple languages for sets of topic-related news articles (multilingual multi-document summarisation), gathered by news aggregation systems. The discussion follows a framework based on Latent Semantic Analysis (LSA) because LSA was sho...
Conference Paper
We give an overview of the highly multilingual news analysis systemEurope Media Monitor (EMM), which gathers an average of 175,000 online news articles per day in tens of languages, categorises the news items and extracts named entities and various other information from them. We explain how users benefit from media monitoring and why it is so impo...
Article
Full-text available
This paper describes a new, freely available, highly multilingual named entity resource for person and organisation names that has been compiled over seven years of large-scale multilingual news analysis combined with Wikipedia mining, resulting in 205,000 per-son and organisation names plus about the same number of spelling variants written in ove...
Article
Full-text available
We are presenting work on recognising acronyms of the form Long-Form (Short-Form) such as "International Monetary Fund (IMF)" in millions of news articles in twenty-two languages, as part of our more general effort to recognise entities and their variants in news text and to use them for the automatic analysis of the news, including the linking of...
Article
Full-text available
Recent years have brought a significant growth in the volume of research in sentiment analysis, mostly on highly subjective text types (movie or product reviews). The main difference these texts have with news articles is that their target is clearly defined and unique across the text. Following different annotation efforts and the analysis of the...
Article
Full-text available
Most large organizations have dedicated departments that monitor the media to keep up-to-date with relevant developments and to keep an eye on how they are represented in the news. Part of this media monitoring work can be automated. In the European Union with its 23 official languages, it is particularly important to cover media reports in many la...
Article
Full-text available
EuroVoc (2012) is a highly multilingual thesaurus consisting of over 6,700 hierarchically organised subject domains used by European Institutions and many authorities in Member States of the European Union (EU) for the classification and retrieval of official documents. JEX is JRC-developed multi-label classification software that learns from manua...
Article
Full-text available
The European Commission's (EC) Directorate General for Translation, together with the EC's Joint Research Centre, is making available a large translation memory (TM; i.e. sentences and their professionally produced translations) covering twenty-two official European Union (EU) languages and their 231 language pairs. Such a resource is typically use...
Conference Paper
We present the highly multilingual news analysis system Europe Media Monitor (EMM), which gathers an average of 175,000 online news articles per day in tens of languages, categorises the news items and extracts named entities and various other information from them. We also give an overview of EMM’s text mining tool set, focusing on the issue of ho...
Article
In this chapter we present a generic approach for summarizing clusters of multilingual news articles such as the ones produced by the Europe Media Monitor (EMM) system. Our approach uses robust statistical techniques as well as multilingual tools for named entity recognition and disambiguation to produce entity-centered summaries. We run experiment...
Article
The paper presents a semi-automatic approach to creating sentiment dictionaries in many languages. We first produced high-level gold-standard sentiment dictionaries for two languages and then translated them automatically into third languages. Those words that can be found in both target language word lists are likely to be useful because their wor...
Article
The present is marked by the influence of the Social Web on societies and people worldwide. In this context, users generate large amounts of data, especially containing opinion, which has been proven useful for many real-world applications. In order to extract knowledge from user-generated content, automatic methods must be developed. In this paper...
Conference Paper
We propose a real-time machine translation system that allows users to select a news category and to translate the related live news articles from Arabic, Czech, Danish, Farsi, French, German, Italian, Polish, Portuguese, Spanish and Turkish into English. The Moses-based system was optimised for the news domain and differs from other available syst...
Article
We describe a system that tracks the spread of epidemics by automatically extracting content from the Web. The system continuously monitors a large set of news sources, extracts information from new articles, and accumulates the extracted facts in a database in real time. The system provides functionality for visualizing results, as well as alertin...
Article
Full-text available
The Europe Media Monitor (EMM) family of applications is a set of multilingual tools that gather, cluster and classify news in currently fifty languages and that extract named entities and quotations (reported speech) from twenty languages. In this paper, we describe the recent effort of adding the African Bantu language Swahili to EMM. EMM is desi...
Conference Paper
Full-text available
As developers of a highly multilingual named entity recognition (NER) system, we face an evaluation resource bottleneck problem: we need evaluation data in many languages, the annotation should not be too time-consuming, and the evaluation results across languages should be comparable. We solve the problem by automatically annotating the English ve...
Conference Paper
Full-text available
In this paper we present an approach to large-scale coreference resolution for an ample set of human languages, with a particular emphasis on time performance and precision. One of the distinctive features of our approach is the use of a mature multilingual named entity repository (persons and organizations) gradually compiled over the past few yea...
Article
Full-text available
A lot of information is hidden in foreign language text, making multi-lingual information extraction tools – and applications that allow cross-lingual in-formation access – particularly useful. Only a few system developers offer their products for more than two or three languages. Typically, they develop the tools for one language and then adapt th...
Article
Full-text available
Global medical and epidemic surveillance is an essential function of Public Health agencies, whose mandate is to protect the public from major health threats. To perform this function effectively one requires timely and accurate med- ical information from a wide range of sources. In this work we present a freely ac- cessible system designed to moni...
Conference Paper
In this paper we present NewsGist, a multilingual, multi-document news summarization system underpinned by the Singular Value Decomposition (SVD) paradigm for document summarization and purpose-built for the Europe Media Monitor (EMM). The summarization method employed yielded state-of-the-art performance for English at the Update Summarization tas...
Conference Paper
We are presenting a method for the evaluation of multilingual multi-document summarisation that allows saving precious annotation time and that makes the evaluation results across languages directly comparable. The approach is based on the manual selection of the most important sentences in a cluster of documents from a sentence-aligned parallel co...
Conference Paper
Full-text available
We present a working Arabic information extraction (IE) system that is used to analyze large volumes of news texts every day to extract the named entity (NE) types person, organization, location, date and number, as well as quotations (direct reported speech) by and about people. The Named Entity Recognition (NER) system was not developed for Arabi...
Article
Multilingual text processing is useful because the information content found in different languages is complementary, both regarding facts and opinions. While Information Extraction and other text mining software can, in principle, be developed for many languages, most text analysis tools have only been applied to small sets of languages because th...
Chapter
Full-text available
In this chapter, we present an approach to learn a signed social network automatically from online news articles. The vertices in this network represent people and the edges are labeled with the polarity of the attitudes among them (positive, negative, and neutral). Our algorithm accepts as its input two social networks extracted via unsupervised a...
Chapter
Full-text available
The Medical Information System (MedISys) is a fully automatic 24/7 public health surveillance system monitoring human and animal infectious diseases and chemical, biological, radiological and nuclear (CBRN) threats in open-source media. In this article, we explain the technology behind MedISys, describing the processing chain from the definition of...
Conference Paper
Full-text available
We describe a methodology for building event extraction systems. The approach is based on multilingual domain-specific grammars and exploits weakly supervised machine learning algorithms for lexical acquisition. We report on the process of adapting an already existing event extraction system for the domain of conflicts and crises to the Portuguese...
Conference Paper
Full-text available
In this paper we propose a novel information-theoretic metric for automatic summary evaluation when model summaries are available as in the setting of the AESOP task of the Update Summarization track of the Text Analysis Conference (TAC). The metric is based on the concept of information content operationalized by using a taxonomy. Hereby, we prese...
Conference Paper
Full-text available
The main focus of this work is to investigate robust ways for generating summaries from summary representations without recurring to simple sentence extraction and aiming at more human-like summaries. This is motivated by empirical evidence from TAC 2009 data showing that human summaries contain on average more and shorter sentences than the system...
Conference Paper
The publicly accessible Europe Media Monitor (EMM) family of applications (http://press.jrc.it/overview.html) gather and analyse an average of 80,000 to 100,000 online news articles per day in up to 43 languages. Through the extraction of meta-information in these articles, they provide an aggregated view of the news; they allow to monitor trends a...
Article
Full-text available
In order to gather a comprehensive picture of potential epidemic threats, public health authorities increasingly rely on systems that perform epidemic intelligence (EI). EI makes use of information that originates from official sources such as national public health surveillance systems as well as from informal sources such as electronic media and...
Article
Full-text available
In this paper we present an approach to summarizing positive and negative opin-ions in blog threads. We first run a sentiment analysis system and consequently pass its output through a standard LSA-based text summarization system. Further on, we evaluate our ap-proach and present the results obtained, which we believe are promising in the context o...
Article
Most sentiment analysis work has been carried out on highly subjective text types where the target is clearly defined and unique across the text (movie or product reviews). However, when applying sentiment analysis to the news domain, it is necessary to clearly define the scope of the task, in a more specific manner than it has been done in the fie...
Conference Paper
Full-text available
Abstract—In this paper,we,present,a generic,approach for summarising,multilingual,news,clusters such,as the ones produced,by the Europe Media Monitor (EMM) system. It is generic because,it uses robust statistical techniques,to perform the summarisation,step and its multilinguality is inherited from the multilingual,entity disambiguation,system,used...
Article
Full-text available
We built 462 machine translation systems for all language pairs of the Acquis Communau-taire corpus. We report and analyse the per-formance of these system, and compare them against pivot translation and a number of sys-tem combination methods (multi-pivot, multi-source) that are possible due to the available systems.
Conference Paper
Full-text available
Opinion mining is the task of extracting from a set of documents opinions expressed by a source on a specified target. This article presents a comparative study on the methods and resources that can be employed for mining opinions from quotations (reported speech) in newspaper articles. We show the difficulty of this task, motivated by the presence...
Chapter
With the emergence of the World Wide Web, analyzing and improving Web communication has become essential to adapt the Web content to the visitors’ expectations. Web communication analysis is traditionally performed by Web analytics software, which produce long lists of page-based audience metrics. These results suffer from page synonymy, page polys...
Article
Full-text available
The Europe Media Monitor system (EMM) gathers and aggregates an aver- age of 50,000 newspaper articles per day in over 40 languages. To manage the in- formation overflow, it was decided to group similar articles per day and per language into clusters and to link daily clusters over time into stories. A story automatically comes into existence when...
Chapter
This chapter presents on-going efforts at the Joint-Research Center of the European Commission for automating event extraction from news articles collected through the Internet with the Europe Media Monitor system. Event extraction builds on techniques developed over several years in the fields of information extraction, whose basic goal is to deri...
Conference Paper
Full-text available
This paper presents a fully operational real-time event extraction system which is capable of accurately and efficiently ex- tracting violent and natural disaster events from vast amount of online news articles per day in different languages. Due to the requirement that the system must be mul- tilingual and easily extendable, it is based on a shall...
Article
Named Entity Recognition and Classification (NERC) is a known and well-explored text analysis application that has been applied to various languages. We are presenting an automatic, highly multilingual news analysis system that fully integrates NERC for locations, persons and organisations with document clustering, multi-label categorisation, name...
Article
Full-text available
We present a new, unique and freely available parallel corpus containing European Union (EU) documents of mostly legal nature. It is available in all 20 official EUanguages, with additional documents being available in the languages of the EU candidate countries. The corpus consists of almost 8,000 documents per language, with an average size of ne...
Article
Full-text available
We are presenting a method to recognise geographical references in free text. Our tool must work on various languages with a minimum of language-dependent resources, except a gazetteer. The main difficulty is to disambiguate these place names by distinguishing places from persons and by selecting the most likely place out of a list of homographic p...
Article
Full-text available
We are proposing a simple, but efficient basic approach for a number of multilingual and cross-lingual language technology applications that are not limited to the usual two or three languages, but that can be applied with relatively little effort to larger sets of languages. The approach consists of using existing multilingual linguistic resources...
Article
Full-text available
We present a tool that, from automatically recognised names, tries to infer inter-person relations in order to present associated people on maps. Based on an in-house Named Entity Recognition tool, applied on clusters of an average of 15,000 news articles per day, in 15 different languages, we build a knowledge base that allows extracting statistic...
Article
Full-text available
In a highly multilingual and multicultural environment such as in the European Commission with soon over twenty official languages, there is an urgent need for text analysis tools that use minimal linguistic knowledge so that they can be adapted to many languages without much human effort. We are presenting two such Information Extraction tools tha...
Article
Full-text available
Automatic annotation of documents with controlled vocabulary terms (descriptors) from a conceptual thesaurus is not only useful for document indexing and retrieval. The mapping of texts onto the same thesaurus furthermore allows to establish links between similar documents. This is also a substantial requirement of the Semantic Web. This paper pres...
Article
Full-text available
Texts and their translations are a rich linguistic resource that can be used to train and test statistics-based Machine Translation systems and many other applications. In this paper, we present a working system that can identify translations and other very similar documents among a large number of candidates, by representing the document contents...
Article
Full-text available
We are presenting a set of multilingual text analysis tools that can help analysts in any field to explore large document collections quickly in order to determine whether the documents contain information of interest, and to find the relevant text passages. The automatic tool, which currently exists as a fully functional prototype, is expected to...