QA@L2F, First Steps at QA@CLEF.
QA@L2F, first steps at QA@CLEF
Ana Mendes, Lu´ ısa Coheur, Nuno J. Mamede
Ricardo Ribeiro, Fernando Batista, David Martins de Matos
Rua Alves Redol, 9, 1000-029 Lisboa, Portugal
Abstract. This paper presents QA@L2F, the question-answering sys-
tem developed at L2F, INESC-ID. QA@L2F follows different strategies
according with the question type, and relies strongly on named entity
recognition and on the pre-detection of linguistic patterns. Each ques-
tion type is mapped into a single strategy; however, if no answer is found,
the system proceeds and tries to find an answer using one of the other
In this paper we present QA@L2F, the question-answering system from L2F,
INESC-ID, as well as the results obtained at CLEF 2007.
In general terms, we can say that QA@L2F executes the following tasks:
– Information Extraction: information sources are processed, in order to ex-
tract potentially relevant information (such as named entities or relations
between concepts), which is stored into a database;
– Question Interpretation: question is interpreted and mapped into an SQL
– Answer Finding: according with the question type, different strategies are
followed in order to find the answer.
Considering information extraction, if a QA system focus on a particular
domain or if the system is going to be used in an evaluation where the informa-
tion sources are known, it makes sense to process all that information off-line,
in order to get potentially relevant information. Thus, for the CLEF competi-
tion, QA@L2F pre-process the available corpora and gets structured information
(such as named entities or noun phrases) that might be the answer to potential
questions. This task is performed by many systems, as for instance Senso [14,
Either to extract information or to interpret the question, some systems use
natural language processing techniques [5,8]; some perform named entity recog-
nition and co-reference resolution. Also, many systems profit from thesaurus [5,
2,8,9] or ontologies [14,17]. Internet may also be used as a resource [6,4].
In what concerns QA@L2F, it profits from a Natural Language Processing
(NLP) chain, which performs morpho-syntactic analysis, named entity recogni-
tion and shallow semantic analysis based on the named entities [10,16]. This
NLP chain uses the following tools:
– Palavroso , responsible for the morphological analysis and MARv  for
– Rudrico (an improved version of PAsMo ), which not only recognize
multi-word terms and collapse them into single tokens, but also splits tokens;
– XIP , which returns the input organized in chunks and connected by de-
This chain is used both in the information extraction step and in question
In order to find the answer, systems such as INAOE  focus on the ques-
tion type and follow different strategies according to it. QA@L2F also applies
different strategies depending on the question type. However, if no answer is
found, the system relaxes and tries to find an answer using one of the other
strategies. Typically, several snippets are answer candidates and the QA system
has to choose one of them. Although there are systems such as QUASAR 
that combine frequency and the confidence given to both answer candidate and
text passage in which the answer can be found, many systems choose the most
frequent of all possible answers . A confidence level is used by QA@L2F in
one of its strategies; all the others only take frequency into consideration.
This paper is organized as follows: section 2 focus on the information extrac-
tion step; section 3 details the question interpretation; section 4 describes the
different methods used to find the answer; section 5 presents and discusses the
evaluation results; finally, section 6 concludes and points to future work.
2 Information Extraction
In order to extract information from newspaper corpora, a morpho-syntactic
analysis is used to identify named entities, such as PEOPLE, which refer to
person’s names, CULTURE, to pieces of art, and TITLE, to person’s professions
and titles. With this information, as well as with a set of manually built linguistic
patterns, relations between concepts are captured by the same NLP-chain, and
stored into a database (from now on, the “relation-concepts” database). Every
named entity recognized is also stored into a database (from now on, the “named
For instance, consider the sentence “Land and Freedom, de Ken Loach, evoca¸ c˜ ao
da Guerra Civil Espanhola” (“Land and Freedom, by Ken Loach, an evocation
of the Spanish Civil War”). In this piece of information might lay the answer to
the question “Who directed Land and Freedom?”. Therefore, by using linguistic
patterns, the entry in the relation-concepts database from table 2 is built.
1As it should be clear, in both situations, the reference to the text snippet holding
those relations/entities is also kept.
1 Land and Freedom Ken Loach
culture author confidence count
Table 1. Entry representing the relation between Ken Loach and Land and Freedom.
It should be noticed that these relation-concepts tables have information
concerning the confidence given to that relation. It depends on the confidence
level given to the linguistic patterns, which are assigned manually. Notice also,
that “count” represents the frequency of this relation in the processed corpus.
In what concerns Wikipedia, QA@L2F used the WikiXML collection pro-
vided by the Information and Language Processing Systems group at the Infor-
matics Institute, University of Amsterdam, as well as its database structure2. A
new table containing only the XML article nodes from every Wikipedia page,
with no linguistic processing, was also created. The aim was to answer definition
In QA@L2F, the question interpretation step is responsible for the transforma-
tion of the question into a SQL query.
The question is processed by: a) the NLP chain, which recovers the type of
the question, as well as other information considered relevant (such as named
entities and the question focus); b) a SQL generator.
Considering the question “Quem ´ e Boaventura Kloppenburg?” (“Who is Boaven-
tura Kloppenburg?”), after the NLP chain, both the type (WHO PEOPLE) and
the focus (Boaventura Kloppenburg) of the question are identified:
<PARAMETER ind="0" num="11" word="Boaventura Kloppenburg"/>
The SQL generator comprises the steps shown in Figure 1.
Fig.1. SQL generation.
The frame builder is responsible for chosing:
– the answer extraction script to be called next (depending on the type of the
– the question focus;
– all the named entities identified in the question.
The SQL generation is performed by a set of scripts that maps the frames
into a SQL query.
Considering the previous example, the following frame is built:
"Boaventura Kloppenburg" PEOPLE
This frame is then mapped into the following MySQL query, that will possibly
retrieve the question’s answer:
GROUP BY confidence DESC, count DESC
title, confidence, count
The “relation-concepts” database is queried and every title (or profession)
connected with Boaventura Kloppenburg is retrieved, in descendant order of con-
fidence and frequency.
QA@L2F has a set of answer finding strategies. From within this set, the system
has a prefered one to be applied on each question, depending on its type. The
system expects this strategy to give the correct answer.
As an example, if the submitted question can be answered directly using
the “relation-concepts” database, the system will just query that database. If
not, the system adopts the following strategies, depending on the type of the
– Linguistic Reordering: the answer is searched in the wikipedia, after a re-
ordering of some question elements;
– Named Entities Matching: the answer is searched in the named entities
– Brute Force plus NLP: some text snippets are chosen and processed in run-
time; the obtained information provides QA@L2F a last chance to find an
After detecting the type of question, one of these strategies is followed. If no
answer is found, the system tries to answer it by using other strategy.
Using a method that allows it to jump to another strategy if the first one
applied did not succeed, implicitly makes the system relax its constraints: it
applies a strategy, even if it is not the one in which it relies the most to use on
4.1 Linguistic Reordering
This strategy is used mainly for answering definition questions, like Quem foi
Pirro? (Who was Pirro?) and O que ´ e a Igreja Maronita? (What is the Maronite
Church?), or list questions, like Diga uma escritora sarda. (Mention a sardinian
QA@L2F uses Wikipedia in order to answer that group of questions. Firstly,
the question interpretation step recovers the question focus (Pirro, Igreja Mar-
ronita and escritora sarda, considering the above examples). Then, it performs
a search over the articles and applies the patterns inferred by the question struc-
ture to find the answer.
For definition questions, patterns are of the form: question focus plus the
inflected verb to be. For instance, Pirro foi... (Pirro was...) or Maronite Chuch
´ e...(Maronite Church was...). On the other hand, for list questions, those patterns
are of the form: the inflected to be plus the question focus. For instance, ...´ e uma
escritora sarda(...is a sardinian writer).
This strategy is also used on questions for which the system could not find an
answer using the linguistic patterns matching technique. Consider, for instance,
the question Quem foi´Esquilo? (“Who was Aeschylus?”). The relation between
´Esquilo and his title was not captured using linguistic patterns. Thus, the system
searched on Wikipedia for the page having´Esquilo as title. The information
about´Esquilo’s definition, a tragic greek poet, was found by processing the first
line of this Wikipedia article page and, finally, returned as the question’s answer.
4.2Named Entities matching
This method queries the named entities database. A set of text snippets con-
taining the named entities of the question is retrieved.
For instance, during the question interpretation of Quem sucedeu a Augusto?
(Who came after Augustus?), the following frame was built:
AUXILIARES "sucedeu" ACTION "a Augusto"
"Augusto " PEOPLE
With this information, QA@L2F searches on the database for snippets con-
taining the named entity of type PEOPLE Augusto and the words sucedeu and
a Augusto. For these last two, since they are not classified as named entities, the
system performs a full-text query against the text snippets. The system gathers
all the named-entities of types PEOPLE and PROPER (name) on those snip-
pets, classifies them by order of frequency and returns the most frequent. Due
to the fact that the system discards every candidate answer matching any word
in the built frame, the named-entity Augusto is not chosen as the final answer.
4.3Brute-Force plus NLP
If none of the previously described strategies finds an answer, the system per-
forms a full-text query against the raw text snippets database, returning the top
ten best qualified snippets. Those snippets are processed by the NLP chain and
the most frequent concept matching the wanted answer type is returned.
It should be noticed that this strategy is also used because we did not apply
the information extraction module over the entire corpora. As so, although all
the information is in the database, sometimes it is just in the form of a text
snipped, without any processing. This technique allow us to extract information
in run-time from paragraphs considered relevant.
4.4 Choosing the Answer
The system uses two main approaches in order to retrieve the final answer,
depending on the strategy followed.
If the chosen strategy is either the linguistic patterns matching or the lin-
guistic reordering, the system simply returns the answers found and takes in
consideration the confidence and count attributes of each table (if they exist).
On the other hand, if the chosen strategy is either the named-entity recogni-
tion or the brute-force plus NLP, the answer extraction step depends on the type
of the question. Having in mind that we are dealing with large corpora (564MB
of newspaper text, both in European Portuguese and Brazilian Portuguese, as
well as the Wikipedia pages found in the version of November, 2006), the system
assumes that the correct answer is repeated on more than one text snippet. With
this assumption, QA@L2F returns the most frequent named entity that matches
the type of the question.
QA@L2F participated and was evaluated at CLEF for Portuguese as the query
and target language. Table 2 presents the obtained results.
Right Wrong ineXact Unsupported Total Accuracy (%)
28 16642 20028/200 = 14%
Table 2. QA@L2F results at CLEF 2007.
Considering the correct answers:
– 11 were NIL;
– 3 followed the direct query of the “‘relation-concepts” database;
– 14 followed the linguistic reordering;
– from these 17, 2 used the relaxing mechanism.
It should be noticed that only 114 questions were interpreted (anaphora,
ellipsis and some question types were not addressed).
Considering the ineXact answers, QA@L2F answered only the identified
named entity, resulting into a ineXact answer. Nevertheless, it is difficult to
be objective in deciding what should be the exact answer.
For instance, in the question “Quem ´ e George Vassiliou?” (“Who is George
Vassiliou?”) it is obvious that the answer “presidente de Chipre” (“Cypriot pres-
ident”) is incomplete, as he was “presidente de Chipre entre 88 e 93” (“Cypriot
president between 88 and 93”). However, being given the following paragraph –
“...norueguˆ es, Henrik Ibsen, dramaturgo que escreveu Peer Gynt.”(“...norwegian,
Henrik Ibsen, dramaturge that wrote Peer Gynt”) – it is not so obvious what
should be the right answer to “Quem foi Henrik Ibsen?” (“Who was Henrik
If “dramaturgo” is incomplete, is “dramaturgo norueguˆ es” enough? Or the
right answer should be “dramaturgo norueguˆ es que escreveu Peer Gynt”? It is
difficult to decide.
Details about the evaluation can be found in .
6 Conclusions and future work
This paper presents QA@L2F first steps. The system follows different strategies
according to the type of the submitted question and bases its performance on
named entity recognition; if no answer is found, the system relaxes and tries to
find the answer using another strategy.
Many improvements are yet to be done to QA@L2F. The improvement of all
of the steps/techniques described in this paper are already scheduled, however
the introduction of new strategies is also considered a goal.
Besides the current existence of a linguistic patterns matching approach, we
would like to explore a syntactic pattern matching strategy, using patterns at
the syntatic level.
We also would like to explore in detail Wikipedia’s standard structure (namely
how it stores birh and death days and places, for instance), as it allows an easy
retrieval of miscellaneous information.
1. Salah A´’it-Mokhtar, Jean-Pierre Chanod, and Claude Roux. A multi-input depen-
dency parser. In Proceedings of the Seventh IWPT (International Workshop on
Parsing Technologies), Beijing, China, October 2001.
2. Carlos Amaral, Ad´ an Cassan, Helena Figueira, Andr´ e Martins, Afonso Mendes,
Pedro Mendes, Cl´ audia Pinto, and Daniel Vidal. Priberam’s question answering
system in qa@clef 2007. Working Notes for the CLEF 2007 Workshop, 2007.
3. Davide Buscaldi, Yassine Benajiba, Paolo Rosso, and Emilio Sanchis. The UPV
at QA@CLEF 2007. Working Notes for the CLEF 2007 Workshop, 2007.
4. Lu´ ıs Miguel Cabral, Lu´ ıs Fernando Costa, and Diana Santos. Esfinge at CLEF
2007: First steps in a multiple question and multiple answer approach. Working
Notes for the CLEF 2007 Workshop, 2007.
5. Ad´ an Cassan, Helena Figueira, Andr´ e Martins, Afonso Mendes, Pedro Mendes,
Cl´ audia Pinto, and Daniel Vidal. Priberam’s question answering system in a cross-
language environment. Working Notes for the CLEF 2006 Workshop, 2006.
6. Lu´ ıs Costa. Esfinge - a modular question answering system for portuguese. Working
Notes for the CLEF 2006 Workshop, 2006.
7. AntonioJu´ arez-Gonzalez, Alberto
Manuel Montes y G´ omez, and Luis Villase nor Pineda. INAOE at CLEF 2006:
Experiments in Spanish Question Answering. Working Notes for the CLEF 2006
8. Dominique Laurent, Patrick S´ egu´ ela, and Sophie N` egre. Cross Lingual Question
Answer using QRISTAL for CLEF 2006. Working Notes for the CLEF 2006 Work-
9. Dominique Laurent, Patrick S´ egu´ ela, and Sophie N` egre. Cross Lingual Question
Answering using QRISTAL for CLEF 2007. Working Notes for the CLEF 2007
10. Jo˜ ao Loureiro. NER - Reconhecimento de Pessoas, Organiza¸ c˜ oes e Tempo. Master’s
thesis, Instituto Superior T´ ecnico, Universidade T´ ecnica de Lisboa, Portugal, 2007.
11. Jos´ e Carlos Medeiros. An´ alise morfol´ ogica e correc¸ c˜ ao ortogr´ afica do portuguˆ es.
Master’s thesis, Instituto Superior T´ ecnico, Universidade T´ ecnica de Lisboa, Por-
12. Ana Mendes. Clefomania, QA@L2F: Primeiros Passos. Master’s thesis, Instituto
Superior T´ ecnico, Universidade T´ ecnica de Lisboa, Portugal, 2007.
13. Joana Paulo Pardal and Nuno J. Mamede. Terms spotting with linguistics and
statistics. In Proceedings of the international workshop “Taller de Herramientas
y Recursos Lingu´ ısticos para el Espan˜ ol y el Portugu´ es”, IX Iberoamerican Con-
ference on Artificial Intelligence (IBERAMIA 2004), pages 298–304, November
14. Paulo Quaresma and Irene Rodrigues. A logic programming based approach to
the QA@CLEF05 track. Working Notes for the CLEF 2005 Workshop, 2005.
15. Ricardo Ribeiro, Nuno J. Mamede, and Isabel Trancoso. Using Morphossyntactic
Information in TTS Systems: comparing strategies for European Portuguese. In
Computational Processing of the Portuguese Language: 6th International Work-
shop, PROPOR 2003, Faro, Portugal, June 26-27, 2003. Proceedings, volume 2721
of Lecture Notes in Computer Science. Springer, 2003.
16. Luis Rom˜ ao. NER - Reconhecimento de Locais e Eventos. Master’s thesis, Instituto
Superior T´ ecnico, Universidade T´ ecnica de Lisboa, Portugal, 2007.
17. Jos´ e Saias and Paulo Quaresma. The Senso Question Answering Approach to
Portuguese QA@CLEF-2007. Working Notes for the CLEF 2007 Workshop, 2007.
18. Lu´ ıs Sarmento. Hunting answers with RAPOSA (FOX). Working Notes for the
CLEF 2006 Workshop, 2006.
T´ ellez-Valero, ClaudiaDenicia-Carral,