Ralf Steinberger's research while affiliated with European Commission and other places
What is this page?
This page lists the scientific contributions of an author, who either does not have a ResearchGate profile, or has not yet added these contributions to their profile.
It was automatically created by ResearchGate to create a record of this author's body of work. We create such pages to advance our goal of creating and maintaining the most comprehensive scientific repository possible. In doing so, we process publicly available (personal) data relating to the author as a member of the scientific community.
If you're a ResearchGate member, you can follow this page to keep up with this author's work.
If you are this author, and you don't want us to display this page anymore, please let us know.
It was automatically created by ResearchGate to create a record of this author's body of work. We create such pages to advance our goal of creating and maintaining the most comprehensive scientific repository possible. In doing so, we process publicly available (personal) data relating to the author as a member of the scientific community.
If you're a ResearchGate member, you can follow this page to keep up with this author's work.
If you are this author, and you don't want us to display this page anymore, please let us know.
Publications (108)
All information-seeking professionals need to sieve through large amounts of text to retrieve the information they need so that they can stay up-to-date of develop-ments in their field. Language Technology tools can help make the analyst’s work more efficient by increasing the amount of data analysed and by speeding up the process. Software tools a...
Since 2004 the European Commission's Joint Research Centre (JRC) has been analysing the online version of printed media in over twenty languages and has automatically recognised and compiled large amounts of named entities (persons and organisations) and their many name variants. The collected variants not only include standard spellings in various...
Any large organisation, be it public or private, monitors the media for information to keep abreast of developments in their field of interest, and usually also to become aware of positive or negative opinions expressed towards them. At least for the written media, computer programs have become very efficient at helping the human analysts significa...
This paper reports on an approach and experiments to automatically build a cross-lingual multi-word entity resource. Starting from a collection of millions of acronym/expansion pairs for 22 languages where expansion variants were grouped into monolingual clusters, we experiment with several aggregation strategies to link these clusters across langu...
Starting in 2006, the European Commission's Joint Research Centre and other European Union organisations have made available a number of large-scale highly-multilingual parallel language resources. In this article, we give a comparative overview of these resources and we explain the specific nature of each of them. This article provides answers to...
Social media texts are significant information sources for several
application areas including trend analysis, event monitoring, and opinion
mining. Unfortunately, existing solutions for tasks such as named entity
recognition that perform well on formal texts usually perform poorly when
applied to social media texts. In this paper, we report on exp...
Multilingual text processing is useful because the information content found in different languages is complementary, both
regarding facts and opinions. While Information Extraction and other text mining software can, in principle, be developed
for many languages, most text analysis tools have only been applied to small sets of languages because th...
We propose a real-time machine translation system that allows users to select
a news category and to translate the related live news articles from Arabic,
Czech, Danish, Farsi, French, German, Italian, Polish, Portuguese, Spanish and
Turkish into English. The Moses-based system was optimised for the news domain
and differs from other available syst...
We describe an existing multilingual information extraction system that automatically detects event information on disasters, conflicts and health threats in near-real time from a continuous flow of on-line news articles.We illustrate a number of strategies for customizing the system to process social media texts such as Twitter messages, which are...
Various recent studies show that the performance of named entity recognition (NER) systems developed for well-formed text types drops significantly when applied to tweets. The only existing study for the highly inflected agglutinative language Turkish reports a drop in F-Measure from 91% to 19% when ported from news articles to tweets. In this stud...
In this chapter, the authors discuss several pertinent aspects of an automatic system that generates summaries in multiple languages for sets of topic-related news articles (multilingual multi-document summarisation), gathered by news aggregation systems. The discussion follows a framework based on Latent Semantic Analysis (LSA) because LSA was sho...
We give an overview of the highly multilingual news analysis systemEurope Media Monitor (EMM), which gathers an average of 175,000 online news articles per day in tens of languages, categorises the news items and extracts named entities and various other information from them. We explain how users benefit from media monitoring and why it is so impo...
This paper describes a new, freely available, highly multilingual named
entity resource for person and organisation names that has been compiled over
seven years of large-scale multilingual news analysis combined with Wikipedia
mining, resulting in 205,000 per-son and organisation names plus about the same
number of spelling variants written in ove...
We are presenting work on recognising acronyms of the form Long-Form
(Short-Form) such as "International Monetary Fund (IMF)" in millions of news
articles in twenty-two languages, as part of our more general effort to
recognise entities and their variants in news text and to use them for the
automatic analysis of the news, including the linking of...
Recent years have brought a significant growth in the volume of research in
sentiment analysis, mostly on highly subjective text types (movie or product
reviews). The main difference these texts have with news articles is that their
target is clearly defined and unique across the text. Following different
annotation efforts and the analysis of the...
Most large organizations have dedicated departments that monitor the media to
keep up-to-date with relevant developments and to keep an eye on how they are
represented in the news. Part of this media monitoring work can be automated.
In the European Union with its 23 official languages, it is particularly
important to cover media reports in many la...
EuroVoc (2012) is a highly multilingual thesaurus consisting of over 6,700
hierarchically organised subject domains used by European Institutions and many
authorities in Member States of the European Union (EU) for the classification
and retrieval of official documents. JEX is JRC-developed multi-label
classification software that learns from manua...
The European Commission's (EC) Directorate General for Translation, together
with the EC's Joint Research Centre, is making available a large translation
memory (TM; i.e. sentences and their professionally produced translations)
covering twenty-two official European Union (EU) languages and their 231
language pairs. Such a resource is typically use...
We present the highly multilingual news analysis system Europe Media Monitor (EMM), which gathers an average of 175,000 online news articles per day in tens of languages, categorises the news items and extracts named entities and various other information from them. We also give an overview of EMM’s text mining tool set, focusing on the issue of ho...
In this chapter we present a generic approach for summarizing clusters of multilingual news articles such as the ones produced by the Europe Media Monitor (EMM) system. Our approach uses robust statistical techniques as well as multilingual tools for named entity recognition and disambiguation to produce entity-centered summaries. We run experiment...
The paper presents a semi-automatic approach to creating sentiment dictionaries in many languages. We first produced high-level gold-standard sentiment dictionaries for two languages and then translated them automatically into third languages. Those words that can be found in both target language word lists are likely to be useful because their wor...
The present is marked by the influence of the Social Web on societies and people worldwide. In this context, users generate large amounts of data, especially containing opinion, which has been proven useful for many real-world applications. In order to extract knowledge from user-generated content, automatic methods must be developed. In this paper...
We propose a real-time machine translation system that allows users to select a news category and to translate the related live news articles from Arabic, Czech, Danish, Farsi, French, German, Italian, Polish, Portuguese, Spanish and Turkish into English. The Moses-based system was optimised for the news domain and differs from other available syst...
We describe a system that tracks the spread of epidemics by automatically extracting content from the Web. The system continuously monitors a large set of news sources, extracts information from new articles, and accumulates the extracted facts in a database in real time. The system provides functionality for visualizing results, as well as alertin...
The Europe Media Monitor (EMM) family of applications is a set of multilingual tools that gather, cluster and classify news
in currently fifty languages and that extract named entities and quotations (reported speech) from twenty languages. In this
paper, we describe the recent effort of adding the African Bantu language Swahili to EMM. EMM is desi...
As developers of a highly multilingual named entity recognition (NER) system, we face an evaluation resource bottleneck problem: we need evaluation data in many languages, the annotation should not be too time-consuming, and the evaluation results across languages should be comparable. We solve the problem by automatically annotating the English ve...
In this paper we present an approach to large-scale coreference resolution for an ample set of human languages, with a particular emphasis on time performance and precision. One of the distinctive features of our approach is the use of a mature multilingual named entity repository (persons and organizations) gradually compiled over the past few yea...
A lot of information is hidden in foreign language text, making multi-lingual information extraction tools – and applications that allow cross-lingual in-formation access – particularly useful. Only a few system developers offer their products for more than two or three languages. Typically, they develop the tools for one language and then adapt th...
Global medical and epidemic surveillance is an essential function of Public Health agencies, whose mandate is to protect the public from major health threats. To perform this function effectively one requires timely and accurate med- ical information from a wide range of sources. In this work we present a freely ac- cessible system designed to moni...
In this paper we present NewsGist, a multilingual, multi-document news summarization system underpinned by the Singular Value
Decomposition (SVD) paradigm for document summarization and purpose-built for the Europe Media Monitor (EMM). The summarization
method employed yielded state-of-the-art performance for English at the Update Summarization tas...
We are presenting a method for the evaluation of multilingual multi-document summarisation that allows saving precious annotation
time and that makes the evaluation results across languages directly comparable. The approach is based on the manual selection
of the most important sentences in a cluster of documents from a sentence-aligned parallel co...
We present a working Arabic information extraction (IE) system that is used to analyze large volumes of news texts every day to extract the named entity (NE) types person, organization, location, date and number, as well as quotations (direct reported speech) by and about people. The Named Entity Recognition (NER) system was not developed for Arabi...
Multilingual text processing is useful because the information content found in different languages is complementary, both regarding facts and opinions. While Information Extraction and other text mining software can, in principle, be developed for many languages, most text analysis tools have only been applied to small sets of languages because th...
In this chapter, we present an approach to learn a signed social network automatically from online news articles. The vertices in this network represent people and the edges are labeled with the
polarity of the attitudes among them (positive, negative, and neutral). Our algorithm accepts as its input two social networks
extracted via unsupervised a...
The Medical Information System (MedISys) is a fully automatic 24/7 public health surveillance system monitoring human and animal infectious diseases and chemical, biological, radiological and nuclear (CBRN) threats in open-source media. In this article, we explain the technology behind MedISys, describing the processing chain from the definition of...
We describe a methodology for building event extraction systems. The approach is based on multilingual domain-specific grammars
and exploits weakly supervised machine learning algorithms for lexical acquisition. We report on the process of adapting an
already existing event extraction system for the domain of conflicts and crises to the Portuguese...
In this paper we propose a novel information-theoretic metric for automatic summary evaluation when model summaries are available
as in the setting of the AESOP task of the Update Summarization track of the Text Analysis Conference (TAC). The metric is
based on the concept of information content operationalized by using a taxonomy. Hereby, we prese...
The main focus of this work is to investigate robust ways for generating summaries from summary representations without recurring to simple sentence extraction and aiming at more human-like summaries. This is motivated by empirical evidence from TAC 2009 data showing that human summaries contain on average more and shorter sentences than the system...
The publicly accessible Europe Media Monitor (EMM) family of applications (http://press.jrc.it/overview.html) gather and analyse an average of 80,000 to 100,000 online news articles per day in up to 43 languages. Through the extraction of meta-information in these articles, they provide an aggregated view of the news; they allow to monitor trends a...
In order to gather a comprehensive picture of potential epidemic threats, public health authorities increasingly rely on systems that perform epidemic intelligence (EI). EI makes use of information that originates from official sources such as national public health surveillance systems as well as from informal sources such as electronic media and...
In this paper we present an approach to summarizing positive and negative opin-ions in blog threads. We first run a sentiment analysis system and consequently pass its output through a standard LSA-based text summarization system. Further on, we evaluate our ap-proach and present the results obtained, which we believe are promising in the context o...
Most sentiment analysis work has been carried out on highly subjective text types where the target is clearly defined and unique across the text (movie or product reviews). However, when applying sentiment analysis to the news domain, it is necessary to clearly define the scope of the task, in a more specific manner than it has been done in the fie...
Abstract—In this paper,we,present,a generic,approach for summarising,multilingual,news,clusters such,as the ones produced,by the Europe Media Monitor (EMM) system. It is generic because,it uses robust statistical techniques,to perform the summarisation,step and its multilinguality is inherited from the multilingual,entity disambiguation,system,used...
We built 462 machine translation systems for all language pairs of the Acquis Communau-taire corpus. We report and analyse the per-formance of these system, and compare them against pivot translation and a number of sys-tem combination methods (multi-pivot, multi-source) that are possible due to the available systems.
Opinion mining is the task of extracting from a set of documents opinions expressed by a source on a specified target. This article presents a comparative study on the methods and resources that can be employed for mining opinions from quotations (reported speech) in newspaper articles. We show the difficulty of this task, motivated by the presence...
With the emergence of the World Wide Web, analyzing and improving Web communication has become essential to adapt the Web
content to the visitors’ expectations. Web communication analysis is traditionally performed by Web analytics software, which
produce long lists of page-based audience metrics. These results suffer from page synonymy, page polys...
The Internet gives us access to a wealth of information in languages we don't understand. The investigation of automated or semi-automated approaches to translation has become a thriving research field with enormous commercial potential. This volume investigates how Machine Learning techniques can improve Statistical Machine Translation, currently...
The Europe Media Monitor system (EMM) gathers and aggregates an aver- age of 50,000 newspaper articles per day in over 40 languages. To manage the in- formation overflow, it was decided to group similar articles per day and per language into clusters and to link daily clusters over time into stories. A story automatically comes into existence when...
This chapter presents on-going efforts at the Joint-Research Center of the European Commission for automating event extraction
from news articles collected through the Internet with the Europe Media Monitor system. Event extraction builds on techniques
developed over several years in the fields of information extraction, whose basic goal is to deri...
This paper presents a fully operational real-time event extraction system which is capable of accurately and efficiently ex- tracting violent and natural disaster events from vast amount of online news articles per day in different languages. Due to the requirement that the system must be mul- tilingual and easily extendable, it is based on a shall...
Named Entity Recognition and Classification (NERC) is a known and well-explored text analysis application that has been applied to various languages. We are presenting an automatic, highly multilingual news analysis system that fully integrates NERC for locations, persons and organisations with document clustering, multi-label categorisation, name...
We present a new, unique and freely available parallel corpus containing European Union (EU) documents of mostly legal nature. It is available in all 20 official EUanguages, with additional documents being available in the languages of the EU candidate countries. The corpus consists of almost 8,000 documents per language, with an average size of ne...
We are presenting a method to recognise geographical references in free text. Our tool must work on various languages with a minimum of language-dependent resources, except a gazetteer. The main difficulty is to disambiguate these place names by distinguishing places from persons and by selecting the most likely place out of a list of homographic p...
We are proposing a simple, but efficient basic approach for a number of multilingual and cross-lingual language technology applications that are not limited to the usual two or three languages, but that can be applied with relatively little effort to larger sets of languages. The approach consists of using existing multilingual linguistic resources...
We present a tool that, from automatically recognised names, tries to infer inter-person relations in order to present associated people on maps. Based on an in-house Named Entity Recognition tool, applied on clusters of an average of 15,000 news articles per day, in 15 different languages, we build a knowledge base that allows extracting statistic...
In a highly multilingual and multicultural environment such as in the European Commission with soon over twenty official languages, there is an urgent need for text analysis tools that use minimal linguistic knowledge so that they can be adapted to many languages without much human effort. We are presenting two such Information Extraction tools tha...
Automatic annotation of documents with controlled vocabulary terms (descriptors) from a conceptual thesaurus is not only useful for document indexing and retrieval. The mapping of texts onto the same thesaurus furthermore allows to establish links between similar documents. This is also a substantial requirement of the Semantic Web. This paper pres...
Texts and their translations are a rich linguistic resource that can be used to train and test statistics-based Machine Translation systems and many other applications. In this paper, we present a working system that can identify translations and other very similar documents among a large number of candidates, by representing the document contents...
We are presenting a set of multilingual text analysis tools that can help analysts in any field to explore large document collections quickly in order to determine whether the documents contain information of interest, and to find the relevant text passages. The automatic tool, which currently exists as a fully functional prototype, is expected to...
We present an exploratory tool that extracts person names from multilingual news collections, matches name variants referring to the same person, and infers relationships between people based on the co-occurrence of their names in related news. A novel feature is the matching of name variants across languages and writing systems, including names wr...
We are presenting a text analysis tool set that allows analysts in various fields to sieve through large collections of multilingual news items quickly and to find information that is of relevance to them. For a given document collection, the tool set automatically clusters the texts into groups of similar articles, extracts names of places, people...
An automatic news tracking and analysis system which records world events over long time periods is described. It allows to
track country specific news, the activities of individual persons and groups, to derive trends, and to provide data for further
analysis and research. The data source is the Europe Media Monitor (EMM) which monitors news from...
This paper presents a languageindependent,approach ,to controlled ,vocabulary,keyword ,assignment ,using ,the EUROVOCthesaurus. Due to the multilingual nature of EUROVOC, the keywords for a document,written in one ,language can be displayed in all ,eleven official European Union languages. The mapping ,of documents written in different languages to...
With the emergence of the World Wide Web, analyzing and improving Web communication has become essential to adapt the Web content to the visitors' expectations. Web communication analysis is traditionally performed by Web analytics software, which produce long lists of page-based audience metrics. These results sufier from page synonymy, page polys...
With the emergence of the World Wide Web, Web sites have become a key communication channel for organizations. In this context, analyzing and improving Web communication is essential to better satisfy the objectives of the target audience. Web communication analysis is traditionnally performed by Web analytics software, which produce long lists of...
The European Commission has a freely accessible news monitoring system called the Europe Media Monitor NewsBrief (http://press.jrc.it/), which is available for all twenty official languages of the European Union, plus some more languages. Among other things, NewsBrief categorizes articles through routing procedures and it alerts users interested in...
This paper describes new methods used for mapping news events gathered from around the world. Web based graphical map displays are used to monitor both the real time situation, and longer term historical trends. The results are derived from a synthesis of world events based on 20,000 news reports collected from the Internet each day by the Europe M...
The paper discusses the compilation of massively multilingual corpora, the EU ACQUIS corpus, and the corpus annotation tool "totale". The ACQUIS text collection has recently become available on the Web, and contains EU law texts (the Acquis Communautaire) in all the languages of the current EU, and more, i.e. parallel texts in over twenty different...
In this paper we present the problem found when studying an automated text categorization system for a collection of High
Energy Physics (HEP) papers, which shows a very large number of possible classes (over 1,000) with highly imbalanced distribution.
The collection is introduced to the scientific community and its imbalance is studied applying a...
We are presenting a working system for automated news analysis that ingests an average total of 7600 news articles per day in five languages. For each language, the system detects the major news stories of the day using a group-average unsupervised agglomerative clustering process. It also tracks, for each cluster, related groups of articles publis...