Content uploaded by Ben Hachey
Author content
All content in this area was uploaded by Ben Hachey
Content may be subject to copyright.
Demonstration of the CROSSMARC System
Vangelis Karkaletsis , Constantine D. Spyropoulos , Dimitris Souflis , Claire Grover ,
Ben Hachey , Maria Teresa Pazienza , Michele Vindigni , Emmanuel Cartier , Jos
´
e Coch
Institute for Informatics and Telecommunications, NCSR “Demokritos”
vangelis, costass @iit.demokritos.gr
Velti S.A.
Dsouflis@velti.net
Division of Informatics, University of Edinburgh
grover, bhachey @ed.ac.uk
D.I.S.P., Universita di Roma Tor Vergata
pazienza, vindigni @info.uniroma2.it
Lingway
emmanuel.cartier, Jose.Coch @lingway.com
1 Introduction
TheEC-fundedR&Dproject,CROSSMARC, is develop-
ing technology for extracting information from domain-
specific web pages, employing language technology
methods as well as machine learning methods in order
to facilitate technology portingto new domains. CROSS-
MARC also employs localisation methodologiesanduser
modelling techniques in order to provide the results of
extraction in accordance with the user’s personal pref-
erences and constraints. The system’s implementation
is based on a multi-agent architecture, which ensures a
clear separation of responsibilities and provides the sys-
tem with clear interfaces and robust and intelligent infor-
mation processing capabilities.
2 System Architecture
The CROSSMARC architecture consists of the following
main processing stages:
Collection of domain-specific web pages, involving
two sub-stages:
- domain-specific web crawling (focused crawling)
for the identification of web sites that are of rele-
vanceto the particulardomain (e.g. retailers of elec-
tronic products).
- domain-specific spidering of the retrieved web sites
in order to identifyweb pages of interest (e.g. laptop
product descriptions).
Information extraction from the domain-specific web
pages, which involves two main sub-stages:
- named entity recognition to identify named enti-
ties such as product manufacturer name or company
name in descriptions inside the web page written in
any of the project’s four languages (English, Greek,
French, Italian) (Grover et al. 2002). Cross-lingual
name matching techniques are also employed in or-
der to link expressions referring to the same named
entities across languages.
- fact extraction to identify those named entities that
fill the slots of the template specifying the infor-
mation to be extracted from each web page. To
achieve this the project combines wrapper-induction
approaches for fact extraction with language-based
information extraction in order to develop site inde-
pendent wrappers for the domain examined.
Data Storage, to store the extracted information (from
the web page descriptions in any of the project’s four
languages) into a common database.
Data Presentation, to present the extracted information
to the end-userthrougha multilingual user interface,in
accordance with the user’s language and preferences.
As a cross-lingual multi-domain system, the goal of
CROSSMARC is to cover a wide area of possible knowl-
edge domains and a wide range of conceivable facts in
each domain. To achieve this we construct an ontology
of each domain which reflects a certain degree of domain
expertknowledge (Pazienza et al. 2003). Cross-linguality
is achieved with the lexica, which provide language spe-
cific synonyms for all the ontology entries. During infor-
mation extraction, web pages are matched against the do-
main ontology and an abstract representation of this real
world information (facts) is generated.
As shown in Figure 1, the CROSSMARC multi-agent
architecture includes agents for web page collection
(crawlingagent, spidering agent), information extraction,
data storage and data presentation. These agents commu-
nicate through the blackboard. The Crawling Agent de-
fines a scheduleforinvokingthe focused crawler which is
Figure 1: Architecture of the CROSSMARC system
written to the blackboardand can be refined by thehuman
administrator. The Spidering Agent is an autonomous
software component, which retrieves sites to spider from
the blackboard and locates interesting web pages within
them by traversing their links. Again, status information
is written to the blackboard.
The multi-lingual IE system is a distributed one where
the individual monolingual components are autonomous
processors, which need not all be installed on the same
machine. (These components have been developed us-
ing a wide range of base technologies: see, for example,
Petasis et al. (2002), Mikheev et al. (1998), Pazienza and
Vindigni (2000)). The IE systems are not offered as web
services, therefore a proxy mechanism is required, util-
ising established remote access mechanisms (e.g. HTTP)
to act as a front-end for every IE system in the project. In
effect, this proxy mechanism turns every IE system into a
web service. For this purpose, we have developed an In-
formation Extraction Remote Invocation module (IERI)
which takes XHTML pages as input and routes them to
the corresponding monolingual IE system according to
the language they are written in. The InformationExtrac-
tionAgentretrieves pages stored on theblackboardby the
Spidering Agent, invokes the Information Extraction sys-
tem (through IERI) for each language and writes the ex-
tracted facts (or error messages) on the blackboard. This
information will then be used by the Data Storage Agent
in order to read the extracted facts and to store them in
the product database.
3 The CROSSMARC Demonstration
The first part of the CROSSMARC demonstration is the
user-interface accessed via a web-page. The user is pre-
sented with the prototype user-interface which supports
menu-driven querying of the product databases for the
two domains. The user enters his/her preferences and is
presented with information about matching products in-
cluding links to the pages which contain the offers.
The main part of the demonstration shows the full
information extraction system including web crawl-
ing, site spidering and Information Extraction. The
demonstration show the results of the individual mod-
ules including real-time spidering of web-sites to
find pages which contain product offers and real-
time information extraction from the pages in the
four project languages, English, French, Italian and
Greek. Screen shots of various parts of the system are
available at http://www.iit.demokritos.gr/
skel/crossmarc/demo-images.htm
Acknowledgments
This research is funded by the European Commis-
sion (IST2000-25366). Further information about the
CROSSMARC project can be found at http://www.
iit.demokritos.gr/skel/crossmarc/.
References
C. Grover, S. McDonald, V. Karkaletsis, D. Farmakiotou,
G. Samaritakis, G. Petasis, M.T. Pazienza, M. Vin-
digni, F. Vichot and F. Wolinski. 2002. Multilingual
XML-Based Named Entity Recognition In Proceed-
ings of the International Conference on Language Re-
sources and Evaluation (LREC 2002).
A. Mikheev, C. Grover, and M. Moens. 1998. Descrip-
tion of the LTG system used for MUC-7. In Seventh
Message Understanding Conference (MUC–7): Pro-
ceedings of a Conference held in Fairfax, Virginia,
29 April–1 May, 1998. http://www.muc.saic.
com/proceedings/muc_7_toc.html.
M. T. Pazienza, A. Stellato, M. Vindigni, A. Valarakos,
and V. Karkaletsis. 2003. Ontology integration in a
multilingual e-retail system. In Proceedings of the Hu-
man Computer Interaction International (HCII’2003),
Special Session on ”Ontologies and Multilinguality in
User Interfaces.
M. T. Pazienzaand M. Vindigni. 2000. Identification and
classification of Italian complex proper names. In Pro-
ceedings of ACIDCA2000 International Conference.
G. Petasis, V. Karkaletsis, G. Paliouras, I. Androutsopou-
los, and C. D. Spyropoulos. 2002. Ellogon: A new
text engineering platform. In Proceedings of the Third
International Conference on Language Resources and
Evaluation (LREC 2002).