Article

Information Extraction, Multilinguality and Portability

Authors:
To read the full-text of this research, you can request a copy directly from the author.

Abstract

The growing availability of on-line textual sources and the potential number of applications of knowledge acquisition approaches from textual data, such as Information Extraction (IE), has lead to an increase in IE research. Some examples of these applications are the generation of data bases from documents, as well as the acquisition of knowledge useful for emerging technologies like question answering and information integration, among others related to text mining. However, one of the main drawbacks of the application of IE refers to the intrinsic language and domain dependence. For the sake of reducing the high cost of manually adapting IE applications to new domains and languages, different Machine Learning (ML) techniques have been applied by the research community. This survey describes and compares the main approaches to IE and the different ML techniques used to achieve adaptable IE technology, as of today.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the author.

... It is important to mention that IE systems are highly specific to the application domain, and therefore, that the contribution of the experts cannot be reutilized in new scenarios. Because of this situation, research in IE is mainly focused on the automatic discovery of the extraction patterns [Muslea, 1999; Peng, 1999; Stevenson & Greenwood, 2006; Turno, 2003]. In particular, modern IE approaches are supported on machine leaning (ML) techniques [Ireson et al., 2005]. ...
Article
Full-text available
The disasters caused by natural phenomena have been present all along human history; nevertheless, their consequences are greater each time. This tendency will not be reverted in the coming years; on the contrary, it is expected that natural phenomena will increase in number and intensity due to the global warming. Because of this situation it is of great interest to have sufficient data related to natural disasters, since these data are absolutely necessary to analyze their impact as well as to establish links between their occurrence and their effects. In accordance to this necessity, in this paper we describe a system based on Machine Learning methods that improves the acquisition of natural disaster data. This system automatically populates a natural disaster database by extracting information from online news reports. In particular, it allows extracting information about five different types of natural disasters: hurricanes, earthquakes, forest fires, inundations, and droughts. Experimental results on a collection of Spanish news show the effectiveness of the proposed system for detecting relevant documents about natural disasters (reaching an F-measure of 98%), as well as for extracting relevant facts to be inserted into a given database (reaching an F-measure of 76%).
Conference Paper
Full-text available
Information extraction is concerned with applying natural language processing to automatically extract the essential details from text documents. A great disadvantage of current approaches is their intrinsic dependence to the application domain and the target language. Several machine learning tech- niques have been applied in order to facilitate the portability of the information extraction systems. This paper describes a general method for building an in- formation extraction system using regular expressions along with supervised learning algorithms. In this method, the extraction decisions are lead by a set of classifiers instead of sophisticated linguistic analyses. The paper also shows a system called TOPO that allows to extract the information related with natu- ral disasters from newspaper articles in Spanish language. Experimental results of this system indicate that the proposed method can be a practical solution for building information extraction systems reaching an F-measure as high as 72%.
Article
Full-text available
e engineering . LaSIE is a single, integrated system that builds up a unified model of a text which is then used t o produce outputs for all four of the MUC-6 tasks . Of course this model may also be used for other purpose s aside from MUC-6 results generation, for example we currently generate natural language summaries of th e MUC-6 scenario results . Put most broadly, and superficially, our approach involves compositionally constructing semantic repre- sentations of individual sentences in a text according to semantic rules attached to phrase structure con- stituents which have been obtained by syntactic parsing using a corpus-derived context-free grammar . Th e semantic representations of successive sentences are then integrated into adiscourse model' which, once th e entire text has been processed, may be viewed as a specialisation of a general world model with which th e system sets out to process each text . LaSIE has a historical connection with the University of Sussex MUC-5 system (GCE93) from which it de - rives its approach to world modelling and coreference resolution and its approach to recombining fragmente d semantic representations which result from partial grammatical coverage . However, the parser and gramma r differ significantly from those used in the Sussex system . In its approach to named entity identification LaSI E borrows to some extent from the approach adopted in the MUC-5 Diderot system (CGJ +93). Virtually al l of the code in LaSIE is new and has been developed since January 1995 with about 20 person-months o f effort.
Article
Full-text available
MetLife processes over 260,000 life insurance applications a year. Underwriting of these applications is labor intensive. Automation is difficult because the applications include many free-form text fields. MetLife's intelligent text analyzer (MITA) uses the information-extraction technique of natural language processing to structure the extensive textual fields on a life insurance application. Knowledge engineering, with the help of underwriters as domain experts, was performed to elicit significant concepts for both medical and occupational textual fields. A corpus of 20,000 life insurance applications provided the syntactical and semantic patterns in which these underwriting concepts occur. These patterns, in conjunction with the concepts, formed the frameworks for information extraction. Extension of the information-extraction work developed by Wendy Lehnert was used to populate these frameworks with classes obtained from the systematized nomenclature of human and veterinary medicine and the Dictionary of Occupational Titles ontologies. These structured frameworks can then be analyzed by conventional knowledge-based systems. MITA is currently processing 20,000 life insurance applications a month. Eighty-nine percent of the textual fields processed by MITA exceed the established confidence-level threshold and are potentially available for further analysis by domain-specific analyzers.