Conference PaperPDF Available

Demonstration of the CROSSMARC system

Authors:

Abstract and Figures

The EC-funded R&D project, CROSSMARC, is developing technology for extracting information from domain- specific web pages, employing language technology methods as well as machine learning methods in order to facilitate technology porting to new domains. CROSSMARC also employs localisation methodologies and user modelling techniques in order to provide the results of extraction in accordance with the user’s personal preferences and constraints. The system’s implementation is based on a multi-agent architecture, which ensures a clear separation of responsibilities and provides the sys- tem with clear interfaces and robust and intelligent information processing capabilities.
Content may be subject to copyright.
Demonstration of the CROSSMARC System
Vangelis Karkaletsis , Constantine D. Spyropoulos , Dimitris Souflis , Claire Grover ,
Ben Hachey , Maria Teresa Pazienza , Michele Vindigni , Emmanuel Cartier , Jos
´
e Coch
Institute for Informatics and Telecommunications, NCSR “Demokritos”
vangelis, costass @iit.demokritos.gr
Velti S.A.
Dsouflis@velti.net
Division of Informatics, University of Edinburgh
grover, bhachey @ed.ac.uk
D.I.S.P., Universita di Roma Tor Vergata
pazienza, vindigni @info.uniroma2.it
Lingway
emmanuel.cartier, Jose.Coch @lingway.com
1 Introduction
TheEC-fundedR&Dproject,CROSSMARC, is develop-
ing technology for extracting information from domain-
specific web pages, employing language technology
methods as well as machine learning methods in order
to facilitate technology portingto new domains. CROSS-
MARC also employs localisation methodologiesanduser
modelling techniques in order to provide the results of
extraction in accordance with the user’s personal pref-
erences and constraints. The system’s implementation
is based on a multi-agent architecture, which ensures a
clear separation of responsibilities and provides the sys-
tem with clear interfaces and robust and intelligent infor-
mation processing capabilities.
2 System Architecture
The CROSSMARC architecture consists of the following
main processing stages:
Collection of domain-specific web pages, involving
two sub-stages:
- domain-specific web crawling (focused crawling)
for the identification of web sites that are of rele-
vanceto the particulardomain (e.g. retailers of elec-
tronic products).
- domain-specific spidering of the retrieved web sites
in order to identifyweb pages of interest (e.g. laptop
product descriptions).
Information extraction from the domain-specific web
pages, which involves two main sub-stages:
- named entity recognition to identify named enti-
ties such as product manufacturer name or company
name in descriptions inside the web page written in
any of the project’s four languages (English, Greek,
French, Italian) (Grover et al. 2002). Cross-lingual
name matching techniques are also employed in or-
der to link expressions referring to the same named
entities across languages.
- fact extraction to identify those named entities that
fill the slots of the template specifying the infor-
mation to be extracted from each web page. To
achieve this the project combines wrapper-induction
approaches for fact extraction with language-based
information extraction in order to develop site inde-
pendent wrappers for the domain examined.
Data Storage, to store the extracted information (from
the web page descriptions in any of the project’s four
languages) into a common database.
Data Presentation, to present the extracted information
to the end-userthrougha multilingual user interface,in
accordance with the user’s language and preferences.
As a cross-lingual multi-domain system, the goal of
CROSSMARC is to cover a wide area of possible knowl-
edge domains and a wide range of conceivable facts in
each domain. To achieve this we construct an ontology
of each domain which reflects a certain degree of domain
expertknowledge (Pazienza et al. 2003). Cross-linguality
is achieved with the lexica, which provide language spe-
cific synonyms for all the ontology entries. During infor-
mation extraction, web pages are matched against the do-
main ontology and an abstract representation of this real
world information (facts) is generated.
As shown in Figure 1, the CROSSMARC multi-agent
architecture includes agents for web page collection
(crawlingagent, spidering agent), information extraction,
data storage and data presentation. These agents commu-
nicate through the blackboard. The Crawling Agent de-
fines a scheduleforinvokingthe focused crawler which is
Figure 1: Architecture of the CROSSMARC system
written to the blackboardand can be refined by thehuman
administrator. The Spidering Agent is an autonomous
software component, which retrieves sites to spider from
the blackboard and locates interesting web pages within
them by traversing their links. Again, status information
is written to the blackboard.
The multi-lingual IE system is a distributed one where
the individual monolingual components are autonomous
processors, which need not all be installed on the same
machine. (These components have been developed us-
ing a wide range of base technologies: see, for example,
Petasis et al. (2002), Mikheev et al. (1998), Pazienza and
Vindigni (2000)). The IE systems are not offered as web
services, therefore a proxy mechanism is required, util-
ising established remote access mechanisms (e.g. HTTP)
to act as a front-end for every IE system in the project. In
effect, this proxy mechanism turns every IE system into a
web service. For this purpose, we have developed an In-
formation Extraction Remote Invocation module (IERI)
which takes XHTML pages as input and routes them to
the corresponding monolingual IE system according to
the language they are written in. The InformationExtrac-
tionAgentretrieves pages stored on theblackboardby the
Spidering Agent, invokes the Information Extraction sys-
tem (through IERI) for each language and writes the ex-
tracted facts (or error messages) on the blackboard. This
information will then be used by the Data Storage Agent
in order to read the extracted facts and to store them in
the product database.
3 The CROSSMARC Demonstration
The first part of the CROSSMARC demonstration is the
user-interface accessed via a web-page. The user is pre-
sented with the prototype user-interface which supports
menu-driven querying of the product databases for the
two domains. The user enters his/her preferences and is
presented with information about matching products in-
cluding links to the pages which contain the offers.
The main part of the demonstration shows the full
information extraction system including web crawl-
ing, site spidering and Information Extraction. The
demonstration show the results of the individual mod-
ules including real-time spidering of web-sites to
find pages which contain product offers and real-
time information extraction from the pages in the
four project languages, English, French, Italian and
Greek. Screen shots of various parts of the system are
available at http://www.iit.demokritos.gr/
skel/crossmarc/demo-images.htm
Acknowledgments
This research is funded by the European Commis-
sion (IST2000-25366). Further information about the
CROSSMARC project can be found at http://www.
iit.demokritos.gr/skel/crossmarc/.
References
C. Grover, S. McDonald, V. Karkaletsis, D. Farmakiotou,
G. Samaritakis, G. Petasis, M.T. Pazienza, M. Vin-
digni, F. Vichot and F. Wolinski. 2002. Multilingual
XML-Based Named Entity Recognition In Proceed-
ings of the International Conference on Language Re-
sources and Evaluation (LREC 2002).
A. Mikheev, C. Grover, and M. Moens. 1998. Descrip-
tion of the LTG system used for MUC-7. In Seventh
Message Understanding Conference (MUC–7): Pro-
ceedings of a Conference held in Fairfax, Virginia,
29 April–1 May, 1998. http://www.muc.saic.
com/proceedings/muc_7_toc.html.
M. T. Pazienza, A. Stellato, M. Vindigni, A. Valarakos,
and V. Karkaletsis. 2003. Ontology integration in a
multilingual e-retail system. In Proceedings of the Hu-
man Computer Interaction International (HCII’2003),
Special Session on ”Ontologies and Multilinguality in
User Interfaces.
M. T. Pazienzaand M. Vindigni. 2000. Identification and
classification of Italian complex proper names. In Pro-
ceedings of ACIDCA2000 International Conference.
G. Petasis, V. Karkaletsis, G. Paliouras, I. Androutsopou-
los, and C. D. Spyropoulos. 2002. Ellogon: A new
text engineering platform. In Proceedings of the Third
International Conference on Language Resources and
Evaluation (LREC 2002).
... Sont reconnueségalementreconnueségalement des unités telles que : le Festival du film de Berlin, la gare Montparnasse, l'orchestre philarmonique de New York, le Président de la République et la ligne Maginot pour Nemesis [Daille et al., 2000, Fourour, 2002, lysozyme et immature CD34 + Thy-1+ subset pour le système ABNER [Settles, 2005] ou encore schizophrénie, Advil et Tour de France pour YooName. Certains systèmes s'intéressent aussì a des entités telles que Compaq Presario 12XL300, Intel Pentium III Processor, Lotus SmartSuite Millennium license ou 20GB (module de reconnaissance d'entités nommées dans le projet Crossmarc, [Karkaletsis et al., 2003, D.Farmakiotou et al., 2002). De même, les systèmes réalisés durant la campagne HAREM [Santos et al., 2006] reconnaisssent le Ministère de l'Environnement, le Nazisme et le Décret de loi numéro 31/3 de 2005. ...
... Il s'agit en quelque sorte d'entités nommées « universelles », même si, au regard de la formule proposée ci-avant, ceci constitue un abus de langage. Malgré cette universalité, il est cependant possible d'imaginer des applications pour lesquelles les noms de personnes et de lieux, même nommés par des unités lexicales autonomes, ne sont d'aucune utilité : c'est le cas pour la reconnaissance d'entités nommées opérée dans le projet CrossMarc s'intéressant uniquement aux entités relatives aux ordinateurs portables (nom du processeur, capacité de mémoire, capacité du disque dur, etc.) sur des pages web de vente de matériel informatique [Karkaletsis et al., 2003]. Proposition de définition des entités nommées consiste en la détermination des catégories sémantiques des entitésentitésà reconnaˆıtrereconnaˆıtre etàetà annoter, détermination devant fixer les choses au niveau du choix des catégories elles-mêmes, ainsi qu'au niveau de ce qu'elles recouvrent. ...
Thesis
Le traitement des entités nommées fait aujourd’hui figure d’incontournable en Traitement Automatique des Langues. Apparue au milieu des années 1990 à lafaveur des dernières conférences muc (Message Understanding Conferences), la tâche de reconnaissance et de catégorisation des noms de personnes, de lieux, d’organisations, etc. apparaît en effet comme fondamentale pour diverses applications participant de l’analyse de contenu et nombreux sont les travaux se consacrant à sa mise en oeuvre, obtenant des résultats plus qu’honorables. Fort de ce succès, le traitement des entités nommées s’oriente désormais vers de nouvelles perspectives avec, entre autres, la désambiguïsation et une annotation enrichie de ces unités. Ces nouveaux défis rendent cependant d’autant plus cruciale la question du statut théorique des entités nommées, lequel n’a guère été discuté jusqu’à aujourd’hui. Deux axes de recherche ont par conséquent été investis durant ce travail de thèse : nous avons, d’une part, tenté de proposer une définition des entités nomm ées et, d’autre part, expérimenté des méthodes de désambiguïsation. A la suite d’un état des lieux de la tâche de reconnaissance de ces unités et d’un exposé des difficultés pouvant se présenter à l’occasion d’une telle entreprise, il fut avant tout nécessaire d’examiner, d’un point de vue méthodologique, comment aborder la question de la définition des entités nommées. La démarche adoptée invita à se tourner du cˆoté de la linguistique, avec les noms propres et les descriptions définies, puis du cˆoté du traitement automatique, ce parcours visant au final à proposer une définition tenant compte tant des aspects du langage que des capacit és et exigences des systèmes informatiques. La suite du mémoire rend compte d’un travail davantage expérimental, avec l’exposé d’une méthode d’annotation fine tout d’abord, de résolution de métonymie enfin. Ces travaux, combinant approche symbolique et approche distributionnelle, rendent compte de la possibilité d’une double annotation (catégories générales et catégories fines) et d’une désambiguïsation des entités nommées.
... Other information gathering systems have also adopted this approach, particularly the CROSSMARC system (Karkaletsis et al., 2003) that was implemented for e-retail and job offers domains coupling symbolic rules with wrapper induction. ...
Article
Full-text available
Due to Web size and diversity of information, relevant information gathering on the Web turns out to be a highly complex task. The main problem with most information retrieval approaches is neglecting pages’ context, given their inner deficiency: search engines are based on keyword indexing which cannot capture context. Considering restricted domains, taking into account contexts may lead to more relevant and accurate information gathering. In the last years, we have conducted research with this hypothesis, and proposed an agent- and ontology-based restricted-domain cooperative information gathering approach accordingly, that permit the development of specific information gathering systems. In this article, we present this novel approach based on these guiding ideas, and a generic software architecture, named AGATHE, which is a full-fledged scalable multi-agent system.
Chapter
Due to Web size and diversity of information, relevant information gathering on the Web turns out to be a highly complex task. The main problem with most information retrieval approaches is neglecting pages’ context, given their inner deficiency: search engines are based on keyword indexing, which cannot capture context. Considering restricted domains, taking into account contexts, with the use of domain ontology, may lead to more relevant and accurate information gathering. In the last years, we have conducted research with this hypothesis, and proposed an agent- and ontology-based restricted-domain cooperative information gathering approach accordingly, that can be instantiated in information gathering systems for specific domains, such as academia, tourism, etc. In this chapter, the authors present this approach, a generic software architecture, named AGATHE-2, which is a full-fledged scalable multi-agent system. Besides offering an in-depth treatment for these domains due to the use of domain ontology, this new version uses machine learning techniques over linguistic information in order to accelerate the knowledge acquisition necessary for the task of information extraction over the Web pages. AGATHE-2 is an agent and ontology-based system that collects and classifies relevant Web pages about a restricted domain, using the BWI (Boosted Wrapper Induction), a machine-learning algorithm, to perform adaptive information extraction.
Article
Full-text available
The advent of e-commerce and the continuous growth of the WWW led to a new generation of e- retail stores. A number of commercial agent-based systems have been developed to help Internet shoppers decide what to buy and where to buy it from. In such systems, ontologies play a crucial role in supporting the exchange of business data, as they provide a formal vocabulary for the information and unify different views of a domain in a safe cognitive approach. Based on this assumption, inside CROSSMARC (a European research project supporting development of an agent-based multilingual information extraction system from web pages), an ontology architecture has been developed in order to organize the information provided by different resources in several languages. CROSSMARC ontology aims to support all the different activities carried on by the system's agents. The ontological architecture is based on three different layers: (1) a meta-layer that represents the common semantics that will be used by the different system's components in their reasoning activities, (2) a conceptual layer where the relevant concepts in each domain are represented and (3) a linguistic layer where language dependent realizations of such concepts are organized. This approach has been defined to enable rapid adaptation into different domains and languages.
Article
Full-text available
We describe the multilingual Named Entity Recognition and Classification (NERC) subpart of an e-retail product comparison system which is currently under development as part of the EU-funded project CROSSMARC. The system must be rapidly extensible, both to new languages and new domains. To achieve this aim we use XML as our common exchange format and the monolingual NERC components use a combination of rule-based and machine-learning techniques. It has been challenging to process web pages which contain heavily structured data where text is intermingled with HTML and other code. Our preliminary evaluation results demonstrate the viability of our approach.
Article
Full-text available
This paper presents Ellogon, a multi-lingual, cross-platform, general-purpose text engineering environment. Ellogon was designed in order to aid both researchers in natural language processing, as well as companies that produce language engineering systems for the end-user. Ellogon provides a powerful TIPSTER-based infrastructure for managing, storing and exchanging textual data, embedding and managing text processing components as well as visualising textual data and their associated linguistic information. Among its key features are full Unicode support, an extensive multi-lingual graphical user interface, its modular architecture and the reduced hardware requirements.
Article
this paper we will only describe its functionality at the sgml level
Identificationand classification of Italian complexpropernames Ellogon: A new text engineering platform
  • M T Pazienzaandm
  • G Vindigni
  • V Petasis
  • G Karkaletsis
  • I Paliouras
  • C D Androutsopou-Los
M.T.PazienzaandM. Vindigni. 2000. Identificationand classification of Italian complexpropernames. In Pro-ceedings of ACIDCA2000 International Conference. G. Petasis, V. Karkaletsis, G. Paliouras, I. Androutsopou-los, and C. D. Spyropoulos. 2002. Ellogon: A new text engineering platform. In Proceedings of the Third International Conference on Language Resources and Evaluation (LREC 2002).
Multilingual XML-Based Named Entity Recognition In Proceed-ings of the International Conference on Language Re-sources and Evaluation (LREC 2002) Descrip-tion of the LTG system used for MUC-7
  • C Grover
  • S Mcdonald
  • V Karkaletsis
  • D Farmakiotou
  • G Samaritakis
  • G Petasis
  • M T Pazienza
  • M Vin-Digni
  • F Vichot
  • F A Wolinski
  • C Mikheev
  • M Grover
  • Moens
C. Grover,S. McDonald, V. Karkaletsis, D. Farmakiotou, G. Samaritakis, G. Petasis, M.T. Pazienza, M. Vin-digni, F. Vichot and F. Wolinski. 2002. Multilingual XML-Based Named Entity Recognition In Proceed-ings of the International Conference on Language Re-sources and Evaluation (LREC 2002). A. Mikheev, C. Grover, and M. Moens. 1998. Descrip-tion of the LTG system used for MUC-7. In Seventh Message Understanding Conference (MUC–7): Pro-ceedings of a Conference held in Fairfax, Virginia, 29 April–1 May, 1998. http://www.muc.saic. com/proceedings/muc_7_toc.html