Schema Mapping in P2P Networks based on
Classification and Probing
Guoliang Li1, Beng Chin Ooi2, Bei Yu2, and Lizhu Zhou1
1Department of Computer Science and Technology
Tsinghua University, Beijing 100084, China
2School of Computing,
National University of Singapore, Singapore
Abstract. In this paper, we address the problems of adaptive schema
mappings between different peers in peer-to-peer network and searching
for interesting data residing at different peers based on such mappings.
We begin by classifying the shared schema of each peer into a taxon-
omy of relation categories and attribute categories. We then propose our
adaptive schema mapping by selectively probing the shared schema with
query probes, which are generated by the classification rules. To improve
the accuracy of schema mapping, we introduce the notion of confusion
matrix and prior-knowledge. Finally, we present the query reformulation
strategy for retrieving and integrating data from all relevant peers. We
have implemented our proposed schema mapping and query processing
methods in real settings with real datasets. The experimental results
show that our method can be adopted effectively in practice.
Sharing data among multiple sources is crucial in a wide range of applications,
including enterprise data management, large-scale scientific projects, government
agencies and the World-Wide Web in general. Data integration approaches offer
an architecture for data sharing in which data is queried through a mediated
schema, but physically stored at the source locations based on their own schemas.
Recent data integration systems have been successful at enabling data sharing,
but on a relatively small scale, due to the expensive cost of constructing the
Recently, peer data management systems (PDMS) have been proposed as an
architecture for decentralized data sharing [1,2,9,19,20,23]. A PDMS consists
of a set of (physical) peers, and each peer has an associated schema, denoted as
peer schema, that represents its domain of interest. Some peers store actual data
with mappings between their physical schemas to their relevant peer schemas.
However, a peer may not have complete data instances for its peer schema, since
individual peers typically do not contain complete information about a domain.
This calls for schema mappings in order to tap on relevant peers for more com-
plete answers. Mapping all data sources to a single global schema (or mediator)
in a PDMS is not feasible due to the decentralization and scalability requirements
of P2P systems. Therefore, in a PDMS, mappings between disparate schemas
are built directly and stored locally, such that when a query is posed at a peer,
the answers are obtained by integrating retrieved results of reformulated queries
from relevant peers, which are generated by exploring the mappings.
Schema mapping of most existing proposals for PDMS such as Hyperion [2,
15], Piazza [9,23], and PeerDB [19,20] all require human intervention, which is
inefficient and ineffective for large networks and dynamic sources. Therefore, an
adaptive way for generating schema mapping is highly desirable. In this paper,
we propose such a schema mapping method based on classification. We classify
the shared schemas (relational tables and attributes) of individual peers into a
taxonomy of relation categories and associated attribute categories, which es-
sentially represent various conceptual domains. For all peers that have relations
belonging to the same category, schema mappings are generated for them. When
a new peer joins, classification of its shared schema is performed by probing its
relations with query probes generated from classification rules, and consequently,
it will be assigned to one or more relation categories to which the probing re-
sults have best matches. Subsequently, its schema is mapped to peers in the
The advantage of our classification-based schema mapping is that its sim-
plicity and modeling uniformity allow integrating the contents of several sources
without having to tackle complex structural differences. Another advantage is
that query evaluation in classification-based sources can be done efficiently.
Our system is based on a super-peer P2P network in which super peers them-
selves are organized in a structured overlay, such as BATON , and normal
peers within the cluster managed by a super peer are unstructured. The cat-
egories are distributed among super peers, through which normal peers build
schema mappings. Our categories structure is distinct from a global schema (or
mediator), since it is distributed among all the super peers, and it is used for
peers to generate schema mappings, not for users to pose queries.
In this paper, we make the following contributions:
– We propose a method for schema mapping based on classification and prob-
ing in PDMS.
– We adopt the notion of confusion matrix  and apply prior-knowledge to
improve the accuracy of schema mapping whenever there are overlapping
instances among the shared schemas.
– We present query formulation strategies for reformulating local queries among
relevant peers to achieve efficient query answering.
The paper is organized as follows. We discuss the related work in Section 2.
Section 3 presents how to create the schema mapping, and Section 4 describes the
query reformulation and evaluation strategies. In Section 5, we provide extensive
experimental evaluations of our method and we conclude the paper in Section 6.
There is no doubt a long stream of research on schema mapping, and we shall
briefly review recent and relevant proposals. Kang et al.  investigated schema
matching techniques that worked in the presence of opaque column names and
data values. Yu et al.  proposed a method about constraint-based XML
data integration. Dhamankar et al.  described the iMAP system which semi-
automatically discovered both 1-1 and complex matches. These three methods
are only efficient for centralized environment.
More recently, the database community has begun to exploit P2P technolo-
gies for database applications [2,6,8,9,13,15,22,23,26]. In , the problem of
data placement for P2P system was addressed and how data management could
be applied to P2P was presented. In , the class of “hybrid” P2P systems,
where some functionality is still centralized, was studied. In , caching of
OLAP queries was addressed in the context of a P2P network. Ooi et al. [18–
20] introduced an IR technique into schema mapping in PDMS. Halevy et al.
addressed the issue of schema mediation and proposed a language for mediat-
ing between peer schemas in . Hyperion project was proposed in [2,15,22],
In this experiment, we first evaluate the quality of schema mappings gen-
erated with the two approaches, then we compare the effectiveness of query
processing in PDMS with the two methods.
Quality of schema mapping: Similar to Section 5.1, we use precision and re-
call to evaluate the quality of schema mappings for each of the schema from S1
to S5. Figure 5 shows the experimental results of our method (with and without
prior-knowledge) compared with that of peerDB (with 2 keywords annotated for
each relation and with 5 keywords annotated for each relation). Not surprisingly,
we can see that in PDMS our schema mapping method with prior-knowledge is
more effective than that without prior-knowledge. The precision and recall with
prior-knowledge are larger than 80% for most schemas. It can be observed that
our method is superior to the PeerDB approach for most schemas (except S3).
Generally, the precision and recall of our method beats that of PeerDB by 10% to
20%. Moreover, PeerDB depends on the keywords annotated to a schema, which
must be generated manually. Annotating more keywords to a schema could im-
prove the recall, but degrades the precision. The experiment result shows that
our method has good schema mapping performance in PDMS whenever there
are overlap instances of the schemas.
Effectiveness of query processing: With the created schema mappings, we
evaluate the effectiveness of query processing of the two approaches. We also use
the notions of precision and recall for our evaluation. Here precision is defined
as the fraction of the number of correct returned answers to the total number
of returned answers, and recall is the fraction of the number of correct returned
answers to the total number of correct answers.
We generate six queries to evaluate the two methods, in which four queries
are based on Amalgam schemas and two are based THALIA schemas. There are
two queries that contain join operations. Figure 6 shows the experiment results.
Again, we can see that our method is more effective than the PeerDB approach.
Fig.6. Query processing in PDMS
Query Processing Precision(%)
PeerDB(# of keywords=2)
PeerDB(# of keywords=5)
Query Processing Recall(%)
PeerDB(# of keywords=2)
PeerDB(# of keywords=5)
In this paper, we propose a method for effective schema mapping based on
classification and probing in a PDMS. We classify each peer schema into cer-
tain categories through probing, and the relations in the same category can be
mapped to each other. We enhance the classification-based mapping by the ap-
plication of confusion matrix and prior-knowledge. We also present strategy for
reformulating query over a local peer schema to queries on various relevant peer
schemas for effective query answering. Our experimented results show that our
method achieves high accuracy for schema mapping on real datasets.
This work is supported by the National Natural Science Foundation of China
under Grant No.60573094, the National Grand Fundamental Research 973 Pro-
gram of China under Grant No.2006CB303103, the National High Technology
Development 863 Program of China under Grant No.2006AA01A101, Tsinghua
Basic Research Foundation under Grant No. JCqn2005022, and Zhejiang Natural
Science Foundation under Grant No. Y105230.
1. K. Aberer, P. Cudre-Mauroux, and M. Hauswirth.
gossiping. SIGMOD Record, 31(4):505–516, 2002.
2. M. Arenas, V. Kantere, A. Kementsietsidis, I. Kiringa, R. J. Miller, and J. My-
lopoulos. The hyperion project:from data integration to data coordination. SIG-
MOD Record, 32(3):53–58, 2003.
3. W. W. Cohen. Learning trees and rules with set-valued features. In AAAI, pages
4. R. Dhamankar, Y. Lee, A. Doan, and et al. imap: Discovering complex semantic
matches between database schemas. In SIGMOD, 2004.
5. R. O. Duda and P. E. Hart. Pattern classication and scene analysis. In Wiley,
6. E. Franconi, G. Kuper, A. Lopatenko, and I. Zaihrayeu. Queries and updates in
the coDB peer to peer database. In VLDB, 2004.
7. L. Gravano, P. G. Ipeirotis, and M. Sahami. QProber: A system for automatic
classication of hidden-web databases. 21(1):1–41, 2003.
8. S. Gribble, A. Halevy, Z. Ives, M. Rodrig, and D. Suciu. What can databases do
for peer-to-peer. In WebDB, 2001.
9. A. Halevy, Z. Ives, D. Suciu, and I. Tatarinov. Schema mediation in peer data
management systems. In ICDE, pages 505–516, 2003.
10. J. Hammer, M. Stonebraker, and O. Topsakal. THALIA: Test harness for the
assessment of legacy information integration approaches. In ICDE, 2005.
11. P. G. Ipeirotis, L. Gravano, and M. Sahami. Probe, count, and classify: Catego-
rizing hidden-web databases. pages 61–78, 2001.
12. H. V. Jagadish, B. C. Ooi, and Q. H. Vu. BATON: A balanced tree structure for
peer-to-peer networks. In VLDB, pages 661–672, 2005.
13. P. Kalnis, W. S. Ng, B. C. Ooi, D. Papadias, and K. L. Tan. An adaptive peer-to-
peer network for distributed caching of olap results. In SIGMOD, 2002.
14. J. Kang and J. Naughton. On schema matching with opaque column names and
data values. In SIGMOD, 2003.
15. A. Kementsietsidis, M. Arenas, and R. J. Miller. Mapping data in peer to peer
systems: Semantics and algorithmic issues. In SIGMOD, 2003.
16. R. Kohavi and F. Provost. Glossary of terms. 30(2/3):271–274, 1998.
17. R. J. Miller, D. Fisla, M. Huang, D. Kymlicka, F. Ku, and V. Lee. Amalgam schema
and data integration test suite. www.cs.toronto.edu/∼miller/amalgam, 2001.
18. W. S. Ng, B. C. Ooi, and K. L. Tan. Bestpeer: A self-configurable peer-to-peer
system. In ICDE, 2002.
19. W. S. Ng, B. C. Ooi, K.-L. Tan, and A. Zhou. PeerDB:A p2p-based system for
distributed data sharing. In ICDE, 2003.
20. B. C. Ooi, Y. Shu, and K.-L. Tan. Relational data sharing in peer-based data
management systems. SIGMOD Record, 32(3):59–64, 2003.
21. J. R. Quinlan. C4.5: Programs for machine learning. In Morgan Kauf-mann Pub-
lishers, Inc., 1992.
22. P. Rodriguez-Gianolli, M. Garzetti, L. Jiang, and et al. Data sharing in the hype-
rion peer database system. In VLDB, 2005.
23. I. Tatatinov, Z. Ives, J. Madhavan, and A. H. et al. The piazza peer data manage-
ment project. SIGMOD Record, 32(3):47–52, 2003.
24. V. N. Vapnik. Statistical learning theory. In Wiley-Interscience, 1996.
25. J. Wang, J.-R. Wen, F. H. Lochovsky, and W.-Y. Ma. Instance-based schema
matching for web databases by domain-specific query probing. In VLDB, 2004.
26. B. Yang and H. Garcia-Molina. Comparing hybrid peer-to-peer systems. In VLDB,
27. C. Yu and L. Popa. Constraint-based XML query rewriting for data integration.
In SIGMOD, 2004.
A framework for semantic