ArticlePDF Available

Abstract

This paper deals with PISEL mediator systems which integrate services. We propose a scalable approach which exploits standardized specifications provided by normalization organisms. The paper focuses on the use of such specifications to automate the construction of the mediator. An illustration in the tourism domain with OTA specifications is given. Full Text at Springer, may require registration or fee
BUILDING SCALABLE MEDIATOR SYSTEMS
Chantal Reynaud
University of Paris-Sud, CNRS (L.R.I.) & INRIA (Fiuturs), orsay, France; University of Paris
X-Nanterre, France.
Abstract: This paper deals with PICSEL mediator systems which integrate services. We
propose a scalable approach which exploits standardized specifications
provided by normalization organisms. The paper focuses on the use of such
specifications to automate the construction of the mediator. An illustration in
the tourism domain with OTA specifications is given.
Key words: Service integration, mediation, scalability.
1. INTRODUCTION
In the recent years, considerable research work has been done on data
mediation systems between users and multiple data sources leading to
integration systems related to a same domain. A mediator system provides a
uniform interface for querying collections of pre-existing data sources that
were created independently. Several data mediation systems have been
implemented. They have proved to be suitable for building specialized data
servers over a reasonable number of data sources (Chawathe et al., 1994;
Genesereth et al., 1997; Kirk et al. 1995).
A mediator system is composed of two parts, a part which contains
knowledge corresponding to the application domain of the system and a
query engine which is generic. The knowledge part is composed first of a
single mediated schema, also called ontology, which is a description of the
application domain. Second, it is composed of a set of source descriptions
expressed as views over the ontology. They model the correspondence
between the ontology and the schemas of the data sources.
Research works have developed languages for describing the mediated
schema, the content of data sources and the users’ queries (Etzioni et al.,
2 Chantal Reynau
d
1994; Papakonstantinou et al., 1995). Other research works addressed the
issue of providing sound and complete algorithms for rewriting queries using
views over data sources. The information integration context is typical of the
need of rewriting queries using views for answering queries because users of
data integration systems do not pose queries directly to the sources in which
data are stored but to a set of virtual relations that have been designed to
provide a uniform and homogeneous access to a domain of interest. The
rewriting problem that has been extensively studied concerns the pure
relational setting (Halevy, 2001).
Improving scalability is a problem which is now studied in the setting of
peer-to-peer computing (Milojicic et al., 2002) but it had not received a lot
of attention in the setting of centralized mediator systems. This paper deals
with this problem in the setting of the PICSEL project (Goasdoue et al.,
2000), a collaboration with France Telecom R&D. A first project, PICSEL1,
was dedicated to the construction of a declarative development platform of a
mediator system. A very rich knowledge representation language, CARIN,
has been proposed to model application domains and the content of the
sources. Algorithms have been designed for rewriting queries over data
sources in an efficient way. We also proposed a representation of an
ontology of a real application domain, the tourism domain, in CARIN. We
defined methodological guidelines to represent the ontology given this
particular language. However, in spite of these results, building the ontology
remains a difficult and very time consuming task preventing the deployment
and the scalability of mediator systems.
Consequently we aimed, in a new project, PICSEL2, at automating the
construction of the ontology in the PICSEL setting. We considered that
resources were XML documents and we used PICSEL to build a mediator in
the tourism domain by automating the construction of the ontology from
XML documents corresponding to standardized messages specifications
provided by Open Travel Alliance, OTA (www.opentravel.com), in the
travel industry. This application was very interesting because it has allowed
to go beyond the initial objective of the project. Domain standards reuse not
only allows the automation of the construction of the ontology but of all the
knowledge part. Moreover, suitable user interfaces have been automatically
generated from the ontology and XML messages have been automatically
written from query plans. As a result, the approach improves scalability
avoiding strong dependency between the system and the available resources.
The paper is organized as follows. Section 2 provides the general
description of the PICSEL2 approach. In section 3, we present the automation
of the construction of the mediator. Section 4 focuses on the interface part.
building scalable mediator systems 3
2. GENERAL DESCRIPTION OF THE PICSEL2
APPROACH
PICSEL2 approach aims at building a scalable mediator-based integration
system PICSEL with available resources restricted to XML documents.
Improving scalability needs automation. As the query engine is generic, the
part of the system concerned by automation is the knowledge-based part of
the system, that is the construction of the ontology and the description of the
contents of the available resources.
The specificity of the approach is that automation is based on XML
standardized specifications provided by organizations for the exchange of
structured e-business messages. Thus, the approach allows services
integration instead of data sources integration. Consequently, the ontology
whose construction is automated (cf. 3.1) describes the services of a domain.
The second part of the knowledge part of the mediator describes the
functionalities of available services, this part being also automatically
generated (cf. 3.2).
Moreover, the goal of the approach is to increase independency between
available service providers and the system. Thus, available services are
decoupled into two parts. Decoupling of services includes information
hiding based on the difference on internal business and public message
exchange protocol interface descriptions. Coupling of the mediator systems
with services is achieved via interfaces. The names of the performed
messages are mentioned in the public part of these interfaces using a
language of description of services (for example WSDL).
In this setting, a user query, formulated on the ontology thanks to an
interface dynamically generated from the ontology (cf. 4.1), is relative to a
service. It is translated into a set of queries using their descriptions. Then the
queries are performed by available service providers. Wrappers are usually
needed at this step in data sources integration systems. In our approach, they
are replaced by a generic module which automatically translates the set of
queries into XML documents corresponding to standardized messages (cf.
4.2).
In respect to this approach, we built a mediator system integrating
services (air booking, hotel reservation, …) from standardized specifications
defined by the Open Travel Alliance. OTA is a non-profit organization
working to define industry-wide, e-business specifications. Our application
exploited 115 OTA XML-Schemas defining the elements to be used in
messages when searching for availability and booking in the travel industry.
4 Chantal Reynau
d
3. AUTOMATISATION OF THE CONSTRUCTION
OF THE MEDIATOR
In a first sub-section we describe the building of the ontology, then the
generation of the service descriptions.
3.1 Semi-automated construction of the ontology
The building of the ontology is a two-step process. First, a very simple
version, guided by standardized specifications, is manually built. We
consider all the messages (e.g. AirAvailabilityRequirement) for which a
content is defined by the normalization organism that is considered. Then,
we group the messages into categories (e.g. AirBookingService). As a
consequence, we obtain the two first levels of a class hierarchy. Names of
the classes in level 2 are names of standardized messages. Names of the
classes in level 1 are names of categories. The name of the root class denotes
the domain of interest (e.g. tourism service).
In a second step, the initial hierarchy is enriched from standardized
specifications in a semi-automatic way. Indeed we ask the ontology designer
to valid and modify, if needed, the enrichment. A set of classes, a set of
properties which characterize classes and a set of relations among classes are
extracted from XML-schemas associated to XML documents provided by
the standardization organism. More details about heuristics used in this
process are given in (Giraldo and Reynaud, 2002). Then, the extracted
elements are structured. That means connecting the two-level initial class
hierarchy with the output of the extraction phase. Thanks to common classes
(terms of level 2 in the initial hierarchy correspond to names of the messages
and are root elements in the XML-schemas), the connection can be
automated. Finally, the model is automatically represented in CARIN.
3.2 Automated generation of service descriptions
In PICSEL the description of the content of each data source is given in
terms of a set of logical implications vi(X)
p(X) establishing a link
between a view vi and the domain relation p whose instances can be found in
the source. The use of PICSEL to integrate services leads to define for each
service as many views as messages the service is able to perform. The
logical implications are generated in an automatic way from the names of the
messages extracted from the public part of the services. For example, S1-v1
AirAvailabilityRequirement could be one implication defining a view
with v1 the name of the view that is associated with
building scalable mediator systems 5
AirAvailabilityRequirement in the ontology, a class corresponding to a name
of a message the service S1 is able to perform. This implication says that the
service S1 fulfills searching for availability of flights.
These view definitions are not sufficient. A user must be able to precisely
define the service he is looking for. For example, he must be able to express
that he wishes to have a flight from the airport CDG. Thus, a view associated
with the relation corresponding to the departure airport in the ontology must
be defined. More generally, we have to define views for all the elements
composing the messages services providers can perform. As all data
composing the messages are already represented in the ontology, such views
are automatically generated.
4. INTERFACING PICSEL2 MEDIATOR SYSTEM
This section deals both with end-users and service providers interface.
4.1 End-users interfaces for querying
End-users interfaces are automatically generated from the ontology.
Their design is based on the fact that there are optional terms in the
ontology, indicated by the cardinalities in the XML-Schemas, not all of the
same technical level. Three ordered levels of difficulties are distinguished:
low, medium and high. When a user wants to query the mediator system, he
has to precise his level to access the system through a suitable interface
which only presents terms in the ontology he is able to understand, i.e. terms
with a level lower than his level. By reducing the number of terms that a
low-level user can used in his query, the system becomes quite usable by
non-expert users. The terms of the ontology are visualized in a graphical
form and each user navigates the ontology as he likes. Moreover, the
graphically specified query is automatically translated in CARIN.
4.2 Providers interfaces for translating query plans
We designed an interface to generate XML documents corresponding to
the messages that must be performed by the service providers mentioned in
the query plans of the PICSEL mediator. The interface is the equivalent of a
wrapper. In our setting, it is a generic interface usable for all the service
providers. The components and the structure of the messages come from the
XML-Schemas. The rewritings given by the PICSEL query engine provide
6 Chantal Reynau
d
the values of the elements to insert into the various XML documents
understood by the service providers mentioned in the query plans.
5. CONCLUSION
We presented a scalable approach based on standards reuse to integrate
multiple services using the PICSEL mediator system. The main point is to
increase automation. We shown that PICSEL2 approach allows to automate
the knowledge part of PICSEL. Moreover, the building of interfaces with
end-users and service providers can also be automated, which is, in addition
to the fact that the approach allows an easy and fast construction of such
systems, an important point to increase their deployment.
REFERENCES
Chawathe S., Garcia-Molina H., Hammer J., Ireland K, Papakonstantinou Y., Ullman J. D.,
Widom J., 1994, The TSIMMIS project: Integration of heterogeneous information sources,
In 16th Meeting of the Information Processing Society of Japan, Tokyo , p 7-18.
Etzioni O., Weld D., 1994, “A Softbot-Based Interface to the Internet“, Communications of
the ACM, v 37, n 7, p 72-76.
Genesereth M. R., Keller A. M., Duschka O. M., 1997, Infomaster: an information integration
system, In Joan M. Peckman (ed), proceedings, ACM SIGMOD International Conference
on Management of Data: SIGMOD 1997: May 13-15, 1997, Tucson, Arizona, USA, v 26,
n 2, SIGMOD Record (ACM Special Interest Group on Management of Data), p 539-542,
New-York, NY 10036, USA. ACM Press.
Giraldo G., Reynaud C., 2002, Construction semi-automatique d'ontologies à partir de DTDs
relatives à un même domaine, Journées Francophones d'Ingénierie des Connaissances.
Goasdoue F., Lattes V., Rousset M.-C., 2000, The use of CARIN language and algorithms for
Integration Information: the PICSEL system, International Journal of Cooperative
Information Systems, v 9, n 4, p 383-401.
Halevy A., 2001, Answering queries using views: A Survey, VLDB Journal.
Kirk T., Levy A. Y., Sagiv Y., Srivastava D., 1995, “The Information Manifold“, In C.
Knoblock & A. Levy (eds), Information Gathering from Heterogeneous, Distributed
Environments, AAAI Spring Symposium Series, Stanford University, California.
Milojicic D. J., Kalogeraki V., Lukose R., Nagaraja K., Pruyne J., Richard B., Rollins S., Xu
Z., 2002, Peer-to-Peer Computing, Technical report HPL, 57, HP Labs.
Papakonstantinou Y., Garcia-Molina H., Widom J., 1995, Object exchange across
heterogeneous information sources, In P. S. Yu and A. L. Chen (eds.), 11th Conference on
Data Engineering, Taipei, Taiwan, IEEE Computer Society, p 251-260.
ResearchGate has not been able to resolve any citations for this publication.
Article
Full-text available
PICSEL is an information integration system(a) over sources that are distributed and possibly heterogeneous. The approach which has been chosen in PICSEL is to define an information server as a knowledge-based mediator in which CARIN is used as the core logical formalism to represent both the domain of application and the contents of information sources relevant to that domain. In this paper, we describe the way the expressive power of the CARIN language is exploited in the PICSEL information integration system, while maintaining the decidability of query answering. We illustrate it on examples coming from the tourism domain, which is the first real case that we have to consider in PICSEL, in collaboration with the travel agency Degriftour.(b).
Article
Full-text available
Infomaster is an information integration system that provides integrated access to multiple distributed heterogeneous information sources on the Internet, thus giving the illusion of a centralized, homogeneous information system. We say that Infomaster creates a virtual data warehouse. The core of Infomaster is a facilitator that dynamically determines an efficient way to answer the user's query using as few sources as necessary and harmonizes the heterogeneities among these sources. Infomaster handles both structural and content translation to resolve differences between multiple data sources and the multiple applications for the collected data. Infomaster connects to a variety of databases using wrappers, such as for Z39.50, SQL databases through ODBC, EDI transactions, and other World Wide Web (WWW) sources. There are several WWW user interfaces to Infomaster, including forms based and textual. Infomaster also includes a programmatic interface and it can download results in structur...
Article
Full-text available
The term "peer-to-peer" (P2P) refers to a class of systems and applications that employ distributed resources to perform a critical function in a decentralized manner. With the pervasive deployment of computers, P2P is increasingly receiving attention in research, product development, and investment circles. This interest ranges from enthusiasm, through hype, to disbelief in its potential. Some of the benefits of a P2P approach include: improving scalability by avoiding dependency on centralized points; eliminating the need for costly infrastructure by enabling direct communication among clients; and enabling resource aggregation.
Article
Full-text available
We describe the Information Manifold (IM), a system for browsing and querying of multiple networked information sources. As a first contribution, the system demonstrates the viability of knowledge representation technology for retrieval and organization of information from disparate (structured and unstructured) information sources. Such an organization allows the user to pose high-level queries that use data from multiple information sources. As a second contribution, we describe novel query processing algorithms used to combine information from multiple sources. In particular, our algorithms are guaranteed to find exactly the set of information sources relevant to a query, and to completely exploit knowledge about local closed world information (Etzioni et al. 1994). Introduction We are currently witnessing an explosion in the amount of information that is available online. For example, the rapid rise in popularity of the World Wide Web (WWW) has increased the amount ...
Conference Paper
We address the problem of providing integrated access to diverse and dynamic information sources. We explain how this problem differs from the traditional database integration problem and we focus on one aspect of the information integration problem, namely information exchange. We define an object-based information exchange model and a corresponding query language that we believe are well suited for integration of diverse information sources. We describe how, the model and language have been used to integrate heterogeneous bibliographic information sources. We also describe two general-purpose libraries we have implemented for object exchange between clients and servers
Article
The problem of answering queries using views is to nd ecient methods of answering a query using a set of previously de ned materialized views over the database, rather than accessing the database relations. The problem has recently received signi cant attention because of its relevance to a wide variety of data management problems. In query optimization, nding a rewriting of a query using a set of materialized views can yield a more ecient query execution plan. To support the separation of the logical and physical views of data, a storage schema can be described using views over the logical schema. As a result, nding a query execution plan that accesses the storage amounts to solving the problem of answering queries using views. Finally, the problem arises in data integration systems, where data sources can be described as precomputed views over a mediated schema. This article surveys the state of the art on the problem of answering queries using views, and synthesizes the disparate works into a coherent framework. We describe the dierent applications of the problem, the algorithms proposed to solve it and the relevant theoretical results.
Article
The Internet Softbot (software robot) is a fully implemented AI agent developed at the University of Washington. It uses a Unix shell and the World-Wide Web to interact with a wide range of Internet resources. It uses a Unix shell and the World-Wide Web to interact with wide range of Internet resources. Effectors include ftp, telnet, mail, and numerous file manipulation commands: sensors include Internet facilities such as archie, gopher, netfind, and many more. The softbot is designed to incorporate new facilities into its repertoire as they become available.
Article
The goal of the Tsimmis Project is to develop tools that facilitate the rapid integration of heterogeneous information sources that may include both structured and unstructured data. This paper gives an overview of the project, describing components that extract properties from unstructured objects, that translate information into a common object model, that combine information from several sources, that allow browsing of information, and that manage constraints across heterogeneous sites. Tsimmis is a joint project between Stanford and the IBM Almaden Research Center. 1 Overview A common problem facing many organizations today is that of multiple, disparate information sources and repositories, including databases, object stores, knowledge bases, file systems, digital libraries, information retrieval systems, and electronic mail systems. Decision makers often need information from multiple sources, but are unable to get and fuse the required information in a timely fashion due to th...
Information Gathering from Heterogeneous, Distributed Environments
  • A Knoblock
  • Levy
Knoblock & A. Levy (eds), Information Gathering from Heterogeneous, Distributed Environments, AAAI Spring Symposium Series, Stanford University, California