Conference PaperPDF Available

Abstract and Figures

The problem of acquiring valuable information from the large amounts available today in electronic media requires automated mechanisms more natural and efficient than those already existing. The trend in the evolution of information retrieval systems goes toward systems capable of answering specific questions formulated by the user in her/his language. The expected answers from such systems are short and accurate sentences, instead of large document lists. On the other hand, the state of the art of these systems is focused -mainly- in the resolution of factual questions, whose answers are named entities (dates, quantities, proper nouns, etc). This paper proposes a model to represent source documents that are then used by question answering systems. The model is based on a representation of a document as a set of named entities (NEs) and their local lexical context. These NEs are extracted and classified automatically by an off-line process. The entities are then taken as instance concepts in an upper ontology and stored as a set of DAML+OIL resources which could be used later by question answering engines. The paper presents a case of study with a news collection in Spanish and some preliminary results.
Content may be subject to copyright.
Toward a Document Model for
Question Answering Systems
M. Pérez-Coutiño, T. Solorio, M. Montes-y-Gómez
,
A. López-López and L. Villaseñor-Pineda
Instituto Nacional de Astrofísica, Óptica y Electrónica (INAOE)
Luis Enrique Erro No. 1, Sta Ma Tonantzintla, 72840, Puebla, Pue, México.
{mapco,thamy,mmontesg,allopez,villasen}@inaoep.mx
Abstract. The problem of acquiring valuable information from the large
amounts available today in electronic media requires automated mechanisms
more natural and efficient than those already existing. The trend in the
evolution of information retrieval systems goes toward systems capable of
answering specific questions formulated by the user in her/his language. The
expected answers from such systems are short and accurate sentences, instead
of large document lists. On the other hand, the state of the art of these systems
is focused –mainly– in the resolution of factual questions, whose answers are
named entities (dates, quantities, proper nouns, etc). This paper proposes a
model to represent source documents that are then used by question answering
systems. The model is based on a representation of a document as a set of
named entities (NEs) and their local lexical context. These NEs are extracted
and classified automatically by an off-line process. The entities are then taken
as instance concepts in an upper ontology and stored as a set of DAML+OIL
resources which could be used later by question answering engines. The paper
presents a case of study with a news collection in Spanish and some preliminary
results.
Keywords: Question Answering, Ontology, Semantic Web, Named Entity
Classification.
1 Introduction
The technological advances have brought us the possibility to access large amounts of
information automatically, either in the Internet or in specialized collections of
information. However, such information becomes useless without the appropriate
mechanisms that help users to find the required information when they need it.
Traditionally, searching information in non-structured or semi-structured sources been
performed by search engines that return a ranked list of documents containing all or
some of the terms from the user’s query. Such engines are incapable of returning a
concise answer to a specific information request [4].
This work was done while visiting the Dept. of Information Systems and Computation
Polytechnic University of Valencia, Spain.
The alternative to information retrieval systems for resolving specific questions are
Question Answering (QA) systems capable of answer questions formulated by the
user in natural language. Research in QA has increased as a result of the inclusion of
QA evaluations as part of the Text Retrieval Conference (TREC)
1
in 1999, and
recently [5] in Multilingual Question Answering as part of the Cross Language
Evaluation Forum (CLEF)
2
.
The goal of QA systems is to respond to a natural language question stated by the
user, replying with a concrete answer to the given question and, in some cases, a
context for its validation. Current operational QA systems are focused in factual
questions [1, 15] that require a named entity (date, quantity, proper noun, locality, etc)
as response. For instance, the question “¿Dónde nació Benito Juárez?”
3
demands as
answer ‘San Pablo Guelatao’, a locality of Mexico. Several approaches of QA
systems like [8, 14] use named entities at different degree of refinement in order to
find a candidate answer. Other systems like [3, 9] include the use of ontologies and
contextual patterns of named entities to represent knowledge about question and
answer contents. Thus, it is clear that named entities identification plays a central role
in the resolution of factual questions.
On this basis, we propose in this paper a model for the representation of the source
documents that are then used by QA systems. The proposed model represents text
documents as a set of classified named entities and their local lexical context (nouns
and verbs). The representation is automatically gathered by an off-line process that
generates instances of concepts from a top level ontology and stores them as resources
in DAML+OIL.
The rest of this paper is organized as follows; section two describes the proposed
model, both at conceptual and implementation level; section three details the process
of the named entities extraction and classification; section four presents a case of
study answering some questions employing the model in a set of news documents in
Spanish; finally section five exposes our conclusions and discusses further work.
2 Model Description
The aim of modeling source documents for QA systems is to provide a preprocessed
set of resources which contain valuable information that makes easier to accomplish
answer retrieval and extraction tasks. An important feature of the proposed model is
that implies a uniform format for data sources, as mentioned in [1] “…it is also
necessary that the data sources become more heterogeneous and of larger size…”
Developing a document model makes possible that several heterogeneous data
sources can be expressed in a standardized format, or at least feasible the
transformation and mapping between equivalent sources.
To reach these goals, the following key assumptions were made to develop the
proposed model:
1. The collection of documents that will be used by the QA system contains
documents about facts like those published in news without domain restriction.
1
http://trec.nist.gov/
2
http://clef-qa.itc.it/
3
Where was Benito Juarez born?
2. The model must reuse an upper ontology in order to allow further refinement
and reasoning on the named entities.
3. The model must be encoded in some ontological language for the Semantic Web
in order to allow future applications such as specialized QA engines or web
agents that make use of the document representation, instead of the document
itself, to achieve their goals.
The next subsection details the conceptual and implementation levels of the model.
2.1 Conceptual Level
Figure 1 shows the model. At the conceptual level, a document is seen as a factual
text object whose content refers to several named entities even when it is focused on a
central topic. Named entities could be one of these objects: persons, organizations,
locations, dates and quantities. The model assumes that the named entities are
strongly related to their lexical context, especially to nouns (subjects) and verbs
(actions). Thus a document can be seen as a set of entities and their contexts.
Moreover, each named entity could be refined by means of ontologies [6]. This is the
aim of instantiating an upper level ontology, instead of developing an ontology from
scratch.
Figure 1. The proposed model.
The model is based on the Suggested Upper Merged Ontology (SUMO)
4
[7], an
existing framework specifically designed to provide a basis for more specific domain
ontologies. SUMO combines a number of top-level ontologies to achieve wide
conceptual coverage, has a strong basis of semiotic and linguistic concepts already,
and is being developed by an IEEE working group
5
that includes a number of experts
from a variety of fields.
2.2 Implementation Level
As mentioned earlier the model is implemented as a set of instances of concepts in
SUMO. The mapping between NEs and SUMO, as well as the used slots or axioms
are shown in table 1.
4
http://ontology.teknowledge.com:8080/
5
http://suo.ieee.org/
FactualText
Ob
j
ect
NEs
Date, Quantity
Person, Locality,
Organization
related
refers
related
Context: Nouns &
Verbs
Physical
Abstract
Document
The use of “refers” and “cooccurs” slots allow to refine the mapping of concepts
between NEs and SUMO concepts. For instance, with an improved version of the
extraction process, we could refer to a “City” instead of a “GeographicArea”, or to a
“Government” instead of an “Organization”.
Table 1. Mapping between NEs and SUMO concepts.
NEs SUMO Concept Slot Description
FactualText refers
This is the top concept of the model. Refers slot
means that a factual text could make reference to
other entities (like our NEs).
Person Human cooccurs
Organization Organization cooccurs
Locality GeographicArea cooccurs
Human, Organization and GeographicArea could
be in co-occurrence with other entities (like verbs
and nouns).
Date TemporalRelation refers
Quantity Quantity refers
Date and Quantity are a special case because
these entities are considered as abstract entities in
SUMO. Thus their relation with other physical
entities is established by the “refers” slot.
On the other hand context is mapped as the SUMO concepts “noun” and “verb” in
accordance with the information gathered from the SL-tagger (refer to section 3).
According to [1] the study of context’s effect in QA is one of the complex issues that
requires formal models as well as experimentation in order to improve the
performance of QA systems. The context considered for our preliminary experiments
consists of the four verbs or nouns both at the left and right of its corresponding NE.
Despite the fact that this parameter was chosen empirically, the results over the test
collection are encouraging (refer to section 4.2).
Table 2 shows a subset of the instances collected from a sample document. Each
row corresponds to an instance, and each concept is in bold font.
Table 2. An extract of SUMO instances gathered from a sample document.
<sumo:FactualText rdf:about="#010698-1Lunes">
<sumo:refers rdf:resource="#Cárdenas"/>
<sumo:refers rdf:resource="#PNR"/>
<sumo:refers rdf:resource="#Tamaulipas"/>
<sumo:refers rdf:resource="#1931"/>
</sumo:FactualText>
<sumo:Human rdf:about="#Cárdenas">
<sumo:cooccur rdf:resource="#presidente"/>
<sumo:cooccur rdf:resource="#PNR"/>
<sumo:cooccur rdf:resource="#echar"/>
<sumo:cooccur rdf:resource="#mano"/>
</sumo:Human>
<sumo:Organization rdf:about="#PNR">
<sumo:cooccur rdf:resource="#presidente"/>
<sumo:cooccur rdf:resource="#echar"/>
<sumo:cooccur rdf:resource="#mano"/>
<sumo:cooccur rdf:resource="#Ersatz"/>
<sumo:cooccur rdf:resource="#democracia"/>
</sumo:Organization>
<sumo:GeographicArea rdf:about=”#Tamaulipas”>
<sumo:cooccur rdf:resource="#gobierno"/>
<sumo:cooccur rdf:resource="#subir"/>
<sumo:cooccur rdf:resource="#partido"/>
<sumo:cooccur rdf:resource="#Partido_Social_Fronterizo"/>
</sumo: GeographicArea>
<sumo:TemporalRelation rdf:about="#1931">
<sumo:refers rdf:resource="#echar"/>
<sumo:refers rdf:resource="#mano"/>
<sumo:refers rdf:resource="#Ersatz"/>
<sumo:refers rdf:resource="#democracia"/>
<sumo:refers rdf:resource="#vez"/>
<sumo:refers rdf:resource="#selección/>
<sumo:refers rdf:resource="#candidato"/>
<sumo:refers rdf:resource="#gobernador"/>
</sumo: TemporalRelation >
<sumo:Verb rdf:about="#echar"></sumo:Verb>
<sumo:Verb rdf:about="#subir"></sumo:Verb>
<sumo:Noun rdf:about="#presidente"></sumo:Noun>
<sumo:Noun rdf:about="#gobierno"></sumo:Noun>
<sumo:Noun rdf:about="#partido"></sumo:Noun>
<sumo:Noun rdf:about="# Partido_Social_Fronterizo "></sumo:Noun>
<sumo:Noun rdf:about="#mano"></sumo:Noun>
<sumo:Noun rdf:about="#Ersatz"></sumo:Noun>
<sumo:Noun rdf:about="#democracia"></sumo:Noun>
<sumo:Noun rdf:about="#vez"></sumo:Noun>
<sumo:Noun rdf:about="#selección"></sumo:Noun>
<sumo:Noun rdf:about="#candidato"></sumo:Noun>
<sumo:Noun rdf:about="#gobernador"></sumo:Noun>
3 Extraction Process
We describe in this section the NE tagger used in order to extract the entities and their
contexts that will be used to represent the documents. This NE tagger is also used to
extract NEs in the questions which will help us exploit the representation model for
question resolution. As mentioned earlier, this extraction process is performed off-
line. Once we have extracted the entities and their contexts, these are taken as
instances of an upper level ontology as described in section 2.2.
A NE is a word or sequence of words that falls in one of these five categories:
name of persons, organizations, locations, dates and quantities. There has been a
considerable amount of work aimed to develop NE taggers with human-level
performance. However, this is a difficult goal to achieve due to a common problem of
all natural language processing tasks: ambiguity; another inconvenience is that
documents are not uniform, their writing style, as well as their vocabulary change
dramatically from one collection to another.
The NE tagger used in this work is that proposed by [11]. This system is based on
training a Support Vector Machine (SVM) [10,12,13] classifier using as features the
outputs of a handcrafted system together with information acquired automatically
from the document, such as Part-of-Speech (POS) tags and capitalization information.
The goal of this method is to reduce the effort in adapting a handcrafted NE extractor
to a new domain. Instead of redesigning the grammars or regular expressions, and
revising the lists of trigger words and gazetteers, we need only to build a training set
by correcting, when needed, the outputs of the handcrafted system.
The starting handcrafted system used is considered by Solorio and López (SL)
tagger as a black box, in particular, the system developed by [2] was used. The system
classifies the words in the documents into the following six categories: Persons,
Organizations, Locations, Dates, Numeric Expressions, and “none of the above”.
Then each word in the documents has as features the output of the handcrafted
system, their POS tag and their capitalization information (first letter capitalized, all
letters capitalized, etc.). A previously trained SVM assigns the final NE tags using the
features mentioned above. This process can be considered as a stacking classifier, in
the first stage a handcrafted system assigns NE tags to the document, and then these
tags (corrected if necessary) are used as inputs to the SVM classifier which decides
the final NE tags.
In order to show an example of how this NE tagger performs, we present here a
comparison between the handcrafted system and the tagger from Solorio and López.
Table 3 shows the results of tagging questions that can be answered using the model
proposed here. In this table, we only show the named entities from the questions. As it
can be seen, the SL tagger improves the accuracy of the handcrafted system. In this
example, the SL tagger corrects 6 tags that were originally misclassified by the
handcrafted system.
Table 3. Comparison between the handcrafted system (HS) and that of Solorio and López (SL).
Cases where the HS tagger misclassifies NEs that are correctly classified by the SL tagger are
in bold. The asterix (*) marks cases where the SL tagger misclassifies a NE correctly classified
by the HS tagger.
Named Entity HS tags SL tags True NE tag
Unión_de_Cineastas_de_Rusia Organization Organization Organization
Director_de_Aeroméxico* Person Organization Person
Irán Organization Location Location
Copa_Mundial_de_Fútbol Person Organization Organization
Irán Organization Location Location
Irán-Estados_Unidos* Location Organization Location
Aeroméxico Person Organization Organization
Aeroméxico Person Organization Organization
Ruanda Location Location Location
Mundial_Francia Organization Organization Organization
Consejo_de_Ministros_de_Líbano Organization Organization Organization
OTAN Organization Organization Organization
Estados_Unidos Organization Location Location
Accuracy 53% 84%
4 Case of Study
This section presents a schema for the application of the proposed model to an
experimental –and yet simple– QA system. In this case the searching process uses
only the information considered by the model.
The algorithm shows the appropriateness of the representation in searching for
answers to factual questions. The following subsection describes the general
algorithm and its application over a sample collection of news in Spanish. Given the
limitation of space no implementation details are given.
4.1 The Algorithm
The algorithm is based in two key assumptions:
First, the kind of the question defines the class of NE to search. Generally
speaking, factual questions do not rely on the predicate of the sentence, but on the
subject, the characteristics of the question, or on some other sentence element. In this
way, by the interrogative adverb (Wh-word) employed in the question, it is possible to
infer the role of the NE required as an answer. For instance, “¿Quién es el presidente
de México?”
6
requires to be answered with a NE of the class person (human). Of
6
Who is the president of México?
course not all interrogative adverbs define the kind of NE for the answer, e.g. “¿Cuál
es el nombre del presidente de México?”
7
. For now, the algorithm is focused on
partial interrogative questions whose answer role could be immediately identified by
the interrogative adverb employed.
Second, from the question itself two kinds of information can be extracted: its NEs
and the lexical context of the question. With the proposed model, all the NEs
mentioned in a given document can be known beforehand. Thus the NEs from the
question become key elements in order to define the document set more likely to
provide the answer. For instance, in any of the sample questions above, the NE
“Mexico” narrows the set of documents to only those containing such NE. At the
same time, another assumption is that the context in the neighborhood of the answer
has to be similar to the lexical context of the question. Once more, from the sample
question, the fragment “even before his inauguration as president
of Mexico, Vicente
Fox…” contains a lexical context next to the answer which is similar to the question.
Following is the algorithm in detail:
1. Identify the type of NE-answer for a given question. We are limited by the
set of NEs in the model (persons, organizations, locality, date & time, and
quantity).
2. Extract NEs contained in the question and starting from them identify the
appropriate document subset.
3. Retrieve all candidate NEs and their local lexical context (as detailed by
the model) starting from those identified in step 2.
4. Compute the similarity between question context and those of the
candidate NEs.
5. Rank the candidate NE in decreasing order of similarity.
6. Report the top NEs as possible answers
4.2 Results
This subsection shows the application of the algorithm just described on a small text
collection of news in Spanish. The collection News94 consists of a set of 94 news
(see table 4 for collection details). These documents contain national and international
news from the years 1998 to 2000. Regarding extracted information, the total of NEs
obtained from this collection was 3191 (table 4 also shows totals by class).
Table 4. Main data of collection News94
Collection Size
Number of
documents
Average
document size
Number
of pages
Number of
lexical forms
Number of
terms
372 Kb 94 3.44 Kb 124 11,562 29,611
Entities
Date &
Time
Locality Organization Person Quantity Others
266 570 1094 973 155 133
7
What is the name of the president of Mexico?
The processing of the question: “¿Quién era el presidente del PNR en 1931?”
8
was done as follows:
1. Identify the class of the NE to search. Given that the interrogative adverb
is “quién” (who), the class is person (human).
2. Extract the NEs in the question. These are: PNR (Organization), present
in the set of documents {0,13,86}; 1931 (Date), found in the set of
documents {0}. As a consequence, the subset of documents is {0}.
3. Retrieve all NEs of class person (human) from the document ‘0’
4. Compute the similarity between the question context, that is {ser (be),
presidente (president), PNR, 1931} and candidate NEs contexts. Table 5
shows the computed similarity.
5 y 6. {Cárdenas, Cárdenas_Presidente}
Table 5. Candidate NEs, their context and similarity
Context NE Sim.
{creador, 30, año, plebiscito, añoranza, embellecer,
muerte}
Portes_Gil 0
{presidente, PNR, echar, mano} Cárdenas 2
{prm, fundar, carácter, otorgar, ser, forma, permanecer,
candidatura}
Cárdenas_Presidente 1
{arribar, 1964, pri, presidencia, Carlos, juventud, líder,
camisa}
Madrazo 0
{Madrazo, arribar, 1964, pri, juventud, líder, traer} Carlos 0
{pri, faltar, mano, gato} Madrazo 0
{zurrar, Sinaloa, Madrazo, corto} Polo_Sánchez_Celis 0
{Sinaloa, zurrar, Polo_Sánchez_Celis, corto, 11, mes,
salir}
Madrazo 0
{pierna, cola, pri, salir, diputado, pelea, deber} Polo_Martínez_Domínguez 0
In this example, “Cárdenas” is the correct answer, and the original text passage is
shown in table 6.
Table 6. Passage with the answer to the sample question
Cárdenas como presidente del PNR echó mano del Ersatz de democracia en 1931 por vez
primera en la selección de candidatos a gobernadores”
9
.
Table 7 shows a subset of the questions used in our preliminary experiments. A
total of 30 questions were proposed by 5 assessors for this experiment. From these
questions only 22 were classified as factoid and were evaluated. Results show that for
55% of the questions, the answer is found in the first NE, and that 82% of the
questions are correctly responded within the top-5 NEs.
Despite of the informal evaluation of the algorithm and the small size of the
collection, we found these results very encouraging, hinting the appropriateness of the
proposed model and the likely robustness of the QA algorithm.
8
Who was the president of the PNR in 1931?
9
Cardenas, as president of the PNR made use of the Ersatz of democracy in 1931, for the first
time in the selection of candidates for governor.
Table 7. Subset of testing questions.
Question 1st Answer Correct Answer
¿Quién es el presidente de la Unión de
Cineastas de Rusia?
(Who is the president of the Moviemakers
Union of Rusia?)
El Barbero de Siberia Nikita Mijalkov
¿Quién ha impulsado el desmantelamiento
del presidencialismo?
(Who encouraged the dismantling of
presidentialism?)
Zedillo Zedillo
¿Cuándo calificó por última vez Irán para una
Copa Mundial de Futbol?
(When was the last time Iran classified for a
Soccer World Cup?)
1978 1978
¿Quién es el presidente de Irán?
(Who is president of Iran?)
Muhamad Khatami Muhamad Khatami
¿Quién es la dirigente del sindicato de
sobrecargos de Aeroméxico?
(Who is the union leader of flight attendants
of Aeromexico?)
Carlos_Ruíz_Sacristán
Alejandra Barrales
Magdaleno
¿Cuántas personas fueron asesinadas en
Ruanda durante 1994?
(How many people were murdered in
Rwanda during 1994?)
500,000 Más de 500 mil
¿Cuántos jugadores de futbol participarán en
el Mundial de Futbol Francia 1998?
(How many soccer player will participate in
the World Soccer Cup of France 1998?
------ 704
¿Cuándo aprobó el senado la ampliación de la
OTAN?
(When did the senate approved the expansion
of OTAN?)
30 de abril 30 de abril de 1998
Correct answer in the first NE 55%
Correct answer within the top-5 NE 82%
5 Conclusions
The proposed model can be an initial step toward document representation for
specific tasks, such as QA as detailed. This representation is functional because
captures valuable information that allows performing retrieval and extraction
processes for QA in an easier and more practical way. Some important features of this
model are that it considers a broader classification of NEs which improves the
precision of the system; it also accelerates the whole process by searching only in the
corresponding named entity class that is believed to contain the answer, instead of
searching the answer in the whole document.
Besides, the representation is expressed in a standardized language as DAML+OIL
–soon could be OWL–, in the direction of the next Web generation. This could yield
to the exploitation of this document representation in multilingual settings for QA
either in stand alone collections or the Semantic Web.
Preliminary results exploiting the representation of documents as proposed by the
model were very encouraging. The context similarity assessment method has to be
refined and additional information can be taken into account, e.g. proximity. We are
also in the process of experimenting with large text collections and questions sets
supplied by international conferences on Questions Answering systems such as TREC
or CLEF. In further developments of this model we pretend to refine the classification
of named entities in order to take full advantage of the ontology.
Acknowledgements. This work was done under partial support of CONACYT
(Project Grant U39957-Y), SNI-Mexico, and the Human Language Technologies
Laboratory of INAOE.
References
1. Burger, J. et al. Issues, Tasks and Program Structures to Roadmap Research in
Question & Answering (Q&A). NIST 2001.
2. Carreras, X. and Padró, L. A Flexible Distributed Architecture for Natural
Language Analyzers. In Proceedings of the LREC’02, Las Palmas de Gran
Canaria, Spain, 2002.
3. Cowie J., et al., Automatic Question Answering, Proceedings of the International
Conference on Multimedia Information Retrieval (RIAO 2000)., 2000.
4. Hirshman L. and Gaizauskas R. Natural Language Question Answering: The View
from Here, Natural Language Engineering 7, 2001.
5. Magnini B., Romagnoli S., Vallin A., Herrera J., Peñas A., Peinado V., Verdejo F.
and Rijke M. The Multiple Language Question Answering Track at CLEF 2003.
CLEF 2003 Workshop, Springer-Verlag.
6. Mann, G.S. Fine-Grained Proper Noun Ontologies for Question Answering,
SemaNet'02: Building and Using Semantic Networks, 2002.
7. Niles, I. and Pease A., Toward a Standard Upper Ontology, in Proceedings of the
2nd International Conference on Formal Ontology in Information Systems (FOIS-
2001), 2001.
8. Prager J., Radev D., Brown E., Coden A. and Samn V. The Use of Predictive
Annotation for Question Answering in TREC8. NIST 1999.
9. Ravichandran D. and Hovy E. Learning Surface Text Patterns for a Question
Answering System. In ACL Conference, 2002.
10. Schölkopf, B. and Smola A.J. Learning with Kernels: Support Vector Machines,
Regularization, Optimization, and Beyond, MIT Press, 2001.
11. Solorio, T. and López López A. Learning Named Entity Classifiers using Support
Vector Machines, CICLing 2004, LNCS Springer-Verlag, Feb. 2004, (to appear).
12. Stitson, M.O., Wetson J.A.E., Gammerman A., Vovk V., and Vapnik V. Theory
of Support Vector Machines. Technical Report CSD-TR-96-17, Royal Holloway
University of London, England, December 1996.
13. Vapnik, V. The Nature of Statistical Learning Theory, Springer, 1995.
14. Vicedo, J.L., Izquierdo R., Llopis F. and Muñoz R., Question Answering in
Spanish. CLEF 2003 Workshop, Springer-Verlag.
15. Vicedo, J.L., Rodríguez, H., Peñas, A. and Massot, M. Los sistemas de Búsqueda
de Respuestas desde una perspectiva actual. Revista de la Sociedad Española para
el Procesamiento del Lenguaje Natural, n.31, 2003.
... Table 4 shows an example of a named entity and its context. For details about the document model and its automatic construction, we refer the reader to [29]. Table 4. Context associated to the named entity "CFC" Underlined verbs and common nouns indicated the lexical contexts of the named entity <DOCNO>EFE19941219-11009</DOCNO> … son usados en los productos anticongelantes, de insuflación y como refrigerantes, que tienen al cloro como un ingrediente común. ...
Article
Full-text available
Nowadays, due to the great advances in communication and storage media, there is more information available than ever before. This information can satisfy almost every information need; nevertheless, without the appropriate manage facilities, all of it is practically useless. This fact has motivated the emergence of several text processing applications that help in accessing large document collections. Currently, there are three main approaches for this purpose: information retrieval, information extraction, and question answering. Question answering (QA) systems aim to identify the exact answer to a question from a given document collection. This paper presents a survey of the Mexican experience in Spanish QA. In particular, it presents an overview of the participations of the Language Technologies Laboratory of INAOE (LabTL) in the Spanish QA evaluation task at CLEF, from 2004 to 2007. Through these participations, the LabTL has mainly explored two different approaches for QA: a language independent approach based on statistical methods, and a language dependent approach supported by sophisticated linguistic analyses of texts. It is important to point out that, due to these works, the LabTL has become one of the leading research groups in Spanish QA.
... Once the relevant passages are selected, the second part requires the POS and Parsing tagged forms of each passage in order to gather the representation used to extract candidates answers. Tagged passages are represented as described in [Pérez-Coutiño et al., 2004] where each retrieved passage is modeled by the system as a factual text object whose content refers to several named entities even when it is focused on a central topic. As mentioned , named entities could be one of these: persons, organizations, locations, dates, quantities and miscellane-ous 4 . ...
Article
Full-text available
This paper describes the experiments performed for the QA@CLEF-2006 within the joint participation of the eLing Division at VEng and the Language Technologies Laboratory at INAOE. This year our laboratories have participated in the Spanish monolingual task, continue with their previous work described in (Pérez-Coutiño et al., 2005). The aim of these experiments was to observe and quantify the possible improvement at the final step of the Question Answering prototype when some syntactic features were taken into the decision process. In order to reach this goal, a shallow approach to answer ranking based on the term density measure has been included. This measure weighs the number of question terms which have a syntactic dependency to one can- didate answer within a relevant passage to the given question. Once the term density has been computed for each candidate answer, their weights along to the weights gathered in the previous steps are merged by a lineal combination to gather the final weight for each candidate answer. Fi- nally, the answer selection process arranges candidate answers based on their weights, selecting the top-n as the Question Answering system answers. The approach described has shown a small but interesting improvement against the same Question Answering prototype without this module. Nevertheless, there are many variables to consider for a substantial improvement of the whole Question Answering system, and particularly at the initial steps, where passage retrieval and can- didate answer selection are determinant for the improvement of system's recall.
... Tagged passages are represented in the same way that in [4] where each retrieved passage is modeled by the system as a factual text object whose content refers to several named entities even when it is focused on a central topic. As mentioned, named entities could be one of these: persons, organizations, locations, dates, quantities and miscellaneous 2 . ...
Article
Full-text available
This paper describes the prototype developed by the Language Technologies Laboratory at INAOE for Spanish monolingual QA evaluation task at CLEF 2005. Our approach is centered in the use of lexical features in order to identify possible answers to factual questions. Such method is supported by an alternative one based on pattern recognition in order to identify candidate answers to definition questions. The methods applied at different stages of the system and prototype architecture for question answering are described. The paper shows and discusses the results achieved with this approach.
... Thus, a document can be seen as a set of entities and their contexts. For details about the document model we refer the reader to [7]. In order to obtain the representation of the documents, the system begins preprocessing each document with MACO, where this process is performed off-line. ...
Article
Full-text available
This paper describes the prototype developed by the Language Technologies Laboratory at INAOE for Spanish monolingual QA evaluation task at CLEF 2004. Our approach is centered in the use of context at a lexical level in order to identify possible answers to factoid questions. Such method is supported by an alternative one based on pattern recognition in order to identify candidate answers to definition questions. The methods applied at different stages of the system and prototype architecture for question answering are described. The paper shows and discusses the results achieved with this approach.
... Un paso en la evolución hacia la mejora de la RI son los sistemas QA. Se presentan como una alternativa a los tradicionales sistemas de RI tratando de ofrecer respuestas precisas y comprensibles a preguntas factuales, en lugar de presentar al usuario una lista de documentos relacionados con la búsqueda [5], de modo que el usuario no ha de leer documentos completos para obtener la información requerida. El desarrollo de los sistemas QA toma un importante impulso en el seno de la conferencia sobre recuperación de información TREC (Text REtrieval Conference) −principalmente a partir de TREC-8 [6]− la cual, desde 1992, constituye un foro internacional para aunar e incentivar la investigación en diferentes ámbitos de la recuperación de información. ...
Conference Paper
Full-text available
Los sistemas QA se presentan como una alternativa a los sistemas tradicionales de recuperación de información tratando de ofrecer respuestas precisas a preguntas factuales. Hemos realizado un estudio para evaluar la eficiencia de estos sistemas como fuentes terminológicas para los especialistas y para usuarios en general. Con este fin, se ha evaluado el funcionamiento de cuatro sistemas QA, dos especializados en el dominio biomédico (MedQA y HONqa) y dos de dominio general (START y QuALiM). El estudio ha utilizado una colección de 150 preguntas biomédicas definicionales (What is…?), obtenidas del sitio web médico WebMD. Para determinar el funcionamiento, se han evaluado las respuestas ofrecidas utilizando una serie de medidas específicas (precisión, MRR, TRR, FHS). El estudio permite confirmar que los cuatro sistemas son útiles para la recuperación de información definicional en este ámbito, ya que han proporcionado respuestas coherentes y precisas con un grado de aceptabilidad adecuado.
... Question Answering systems (QA systems) are an evolutionary improvement in IR systems. As alternative traditional IR systems, they give correct and understandable answers to factual questions (Pérez-Coutiño et al., 2004)rather than just offering a list of documents related to the search. The benefit is that users do not have to read whole documents to find the desired information. ...
Chapter
Full-text available
Question-answering systems (heretofore QA Systems) can be viewed as a new alternative to the more familiar Information Retrieval Systems. These systems try to offer detailed, understandable answers to factual questions, in order to retrieve a collection of documents related to a particular search (Jackson & Schilder, 2005). We have carried out a research to evaluate the quality and efficiency of open- and restricted-domain QA systems as sources for physicians and users in general through one monolingual evaluation and another multilingual. Our objective led us to use definition-type questions in order to evaluate QA systems and determine if there are useful to retrieve medical information. Also we analyzed and evaluated the results obtained, and identified the source or sources used by the systems and their procedure (Olvera-Lobo & Gutiérrez-Artacho, 2010; Olvera-Lobo & Gutiérrez-Artacho, 2011).
... Table 4 shows an example of a named entity and its context. For details about the document model and its automatic construction, we refer the reader to [29]. Table 4. Context associated to the named entity "CFC" Underlined verbs and common nouns indicated the lexical contexts of the named entity <DOCNO>EFE19941219-11009</DOCNO> … son usados en los productos anticongelantes, de insuflación y como refrigerantes, que tienen al cloro como un ingrediente común. ...
Article
Full-text available
Nowadays, due to the great advances in communication and storage media, there is more information available than ever before. This information can satisfy almost every information need; nevertheless, without the appropriate manage facilities, all of it is practically useless. This fact has motivated the emergence of several text processing applications that help in accessing large document collections. Currently, there are three main approaches for this purpose: information retrieval, information extraction, and question answering. Question answering (QA) systems aim to identify the exact answer to a question from a given document collection. This paper presents a survey of the Mexican experience in Spanish QA. In particular, it presents an overview of the participations of the Language Technologies Laboratory of INAOE (LabTL) in the Spanish QA evaluation task at CLEF, from 2004 to 2007. Through these participations, the LabTL has mainly explored two different approaches for QA: a language independent approach based on statistical methods, and a language dependent approach supported by sophisticated linguistic analyses of texts. It is important to point out that, due to these works, the LabTL has become one of the leading research groups in Spanish QA. Resumen. En la actualidad, debido a los grandes avances en los medios de comunicación y de almacenamiento, hay más información disponible como nunca antes se ha visto. Esta información puede satisfacer casi todas las necesidades de información, sin embargo, sin una adecuada gestión ésta es prácticamente inútil. Este hecho ha motivado la aparición de diferentes aplicaciones para el procesamiento de texto orientadas a facilitar el acceso a grandes colecciones de documentos. Hoy en día, existen tres enfoques principales para este propósito: la recuperación de información, la extracción de información, y los sistemas de búsqueda de respuestas. Los sistemas de búsqueda de respuestas (QA por sus siglas en inglés) tienen por objeto identificar la respuesta exacta a una pregunta dentro de una determinada colección de documentos. Este trabajo presenta un panorama general de la experiencia mexicana en QA en español. En particular, se presentan las participaciones del Laboratorio de Tecnologías del Lenguaje del INAOE (LabTL) en la tarea de QA en español dentro del foro de evaluación CLEF, desde 2004 a 2007. A través de estas participaciones, el LabTL ha explorado principalmente dos enfoques diferentes en QA: un enfoque independiente del lenguaje basado en métodos estadísticos, y un enfoque dependiente del lenguaje apoyado en un complejo análisis lingüístico del texto. Es importante señalar que, debido a estos trabajos, el LabTL se ha convertido en uno de los principales grupos de investigación de QA en español. Palabras Claves: Búsqueda de Respuestas, Recuperación de Pasajes, Extracción de Respuestas, Aprendizaje Automático.
... Thus, a document can be seen as a set of entities and their contexts. For details about the document model see [8] . In order to obtain the representation of the documents, the system begins preprocessing each document with MACO, where this process is performed off-line. ...
Conference Paper
Full-text available
This paper describes the prototype developed by the Language Technologies Laboratory at INAOE for Spanish monolingual QA evaluation task at CLEF 2004. Our approach is centered on the use of context at a lexical level in order to identify possible answers to factoid questions. This method is supported by an alternative one based on pattern recognition in order to identify candidate answers to definition questions. We describe the methods applied at different stages of the system and our prototype architecture for question answering. The paper shows and discusses the results we achieved with this approach.
... Thus, a document can be seen as a set of entities and their contexts. For details about the document model see [8]. In order to obtain the representation of the documents, the system begins preprocessing each document with MACO, where this process is performed off-line. ...
Conference Paper
Full-text available
"Information has became a need for modern society. But, the rapidly increasing of information volume, caused the search engines unable to provide specific information that is needed by user. An ontology-based question answering system as an intersection of question answering technology and semantic web technology is viewed as the one of solution for this problem. The objective of this research is to build an ontology-based simple question answering as a semantic web application. Domain of this system is informations about movie. The natural language is used in this system is bahasa Indonesia. Semantic web application is developed by using Java Server Pages as an user interface, Web ontology language for representing knowledge based, JENA ontology API as an ontology interface. The result shows that 96,7% of questions were handled correctly by this system.
Article
Full-text available
As users struggle to navigate the wealth of on-line information now available, the need for automated question answering systems becomes more urgent. We need systems that allow a user to ask a question in everyday language and receive an answer quickly and succinctly, with sufficient context to validate the answer. Current search engines can return ranked lists of documents, but they do not deliver answers to the user. Question answering systems address this problem. Recent successes have been reported in a series of question-answering evaluations that started in 1999 as part of the Text Retrieval Conference (TREC). The best systems are now able to answer more than two thirds of factual questions in this evaluation.
Conference Paper
Full-text available
We have developed a method for answering single answer questions automatically using a collection of documents or the Internet as a source of data for the production of the answer. Examples of such questions are `What is the melting point of tin?', and `Who wrote the novel Moby Dick?'. The approach we have adopted to the problem uses the Mikrokosmos ontology to represent knowledge about question and answer content. A specialized lexicon of English connects words, in English, to their ontological meanings. Analysis of texts (both questions and documents) is based on a statistical part-of speech tagger, and pattern-based proper name and fact classification and phrase recognition. The system assumes that all the information required to produce an answer exists in a single sentence and retrieval strategies (where possible) are geared to finding documents in which this is the case. In this paper we describe the overall structure of the system and the operation of the various components. I...
Conference Paper
Full-text available
The Suggested Upper Merged Ontology (SUMO) is an upper level ontology that has been proposed as a starter document for The Standard Upper Ontology Working Group, an IEEE-sanctioned working group of collaborators from the fields of engineering, philosophy, and information science. The SUMO provides definitions for general-purpose terms and acts as a foundation for more specific domain ontologies. In this paper we outline the strategy used to create the current version of the SUMO, discuss some of the challenges that we faced in constructing the ontology, and describe in detail its most general concepts and the relations between them.
Book
Setting of the learning problem consistency of learning processes bounds on the rate of convergence of learning processes controlling the generalization ability of learning processes constructing learning algorithms what is important in learning theory?.
Article
Many modern NLP applications require basic language processors such as POS taggers, parsers, etc. All these tools are usually pre-existing, and must be adapted to fit in the requirements of the application to be developed. This adaptation procedure is usually time consuming and increases the application development cost. Our proposal to minimize this effort is to use standard engineering solutions for software reusability. In that sense, we converted all our language processors to classes which may be instantiated and accessed from any application via a CORBA broker. Reusability is not the only advantatge, since the distributed CORBA approach also makes it possible to access the analyzers from any remote application, developed in any language, and running on any operating system.
Conference Paper
Traditional methods for named entity classification are based on hand-coded grammars, lists of trigger words and gazetteers. While these methods have acceptable accuracies they present a serious draw- back: if we need a wider coverage of named entities, or a more domain specific coverage we will probably need a lot of human effort to redesign our grammars and revise the lists of trigger words or gazetteers. We present here a method for improving the accuracy of a traditionally- built named entity extractor. Support vector machines are used to train a classifier based on the output of an existing extractor system. Experi- mental results show that this approach can be a very practical solution, increasing precision by up to 11.94% and recall by up to 27.83% without considerable human effort.