ArticlePDF Available

Towards improving XML search by using structure clustering technique

Authors:

Abstract and Figures

Searching large XML repositories is a challenging research problem. The application of clustering on a large repository before performing a search enhances the search process significantly. Clustering reduces a search space into smaller XML collections that can be better searched. In this work, we present an enhanced XML clustering by structure method. Also, we introduce a new representation of XML structure that keeps all characteristics of XML structure without summarization. Then, we perform a benchmark comparison between the search results of our improved method to SAXON and Qizx XML XQuery processors. The comparison focuses on search processing time and accuracy of the results using different sizes of datasets for both homogeneous and heterogeneous XML documents. The attained results show better accuracy at the same level of performance.
Content may be subject to copyright.
Article
Journal of Information Science
1–21
ÓThe Author(s) 2014
Reprints and permissions:
sagepub.co.uk/journalsPermissions.nav
DOI: 10.1177/0165551514560523
jis.sagepub.com
Towards improving XML search by using
structure clustering technique
Rehab Shalabi
Institute of Graduate Studies and Research, Alexandria University, Egypt
Ahmed Elfatatry
Institute of Graduate Studies and Research, Alexandria University, Egypt
Abstract
Searching large XML repositories is a challenging research problem. The application of clustering on a large repository before perform-
ing a search enhances the search process significantly. Clustering reduces a search space into smaller XML collections that can be better
searched. In this work, we present an enhanced XML clustering by structure method. Also, we introduce a new representation of XML
structure that keeps all characteristics of XML structure without summarization. Then, we perform a benchmark comparison between
the search results of our improved method to SAXON and Qizx XML XQuery processors. The comparison focuses on search pro-
cessing time and accuracy of the results using different sizes of datasets for both homogeneous and heterogeneous XML documents.
The attained results show better accuracy at the same level of performance.
Keywords
Clustering; EXCLS, information retrieval, XCLS, XEdge, XML search engine, XQuery
1. Introduction
The Extensible Mark-up Language (XML) is now a pervasive technology for information representation, data storage
and exchange. XML can be used in different applications and domains such as e-commerce, digital libraries, integration
systems, bioinformatics [1,2] and laboratories [3]. Currently, there are large collections of structured and semistructured
data both on the Web and in enterprise applications. Available techniques provide less than satisfactory solutions for the
management of such large collections. The continuous growth of XML repositories has been matched by increasing
efforts in the development of XML retrieval systems. For this reason the Initiative for the Evaluation of XML-Retrieval
(INEX) was founded in 2002 and provides a platform for evaluating XML retrieval algorithms. The main goal of INEX
is the evaluation of focused retrieval. Focused retrieval identifies relevant parts of relevant documents by providing large
test collections of structured documents and uniform evaluation measures.
In unstructured information retrieval, getting the right result is usually straightforward. However, the challenge is the
semistructured information retrieval. The retrieval unit in a text document search is an entire document. Keyword search
in XML data has every XML node as a retrievable unit [4]. An XML document consists of nested elements starting with
the root [5]. Each element can have attributes and values, in addition to nested sub-elements. Figure 1 shows an example
of an XML document in a tree representation.
There are different approaches for XML search. Some are based on keywords where a query algorithm cannot identify
the exact meaning of each keyword and has to generate a sub-tree of the XML document including each input keyword
of the query. These sub-trees can be retrieved using different concepts such as lowest common ancestors (LCAs), smal-
lest lowest common ancestors (SLCAs) or meaningful lowest common ancestor structure (MLCA) to answer keyword
queries in XML documents [6–9]. A number of approaches focus on keyword search by identifying connected trees with
Corresponding author:
Rehab Shalabi, Department of Information Technology, Institute of Graduate Studies and Research, Alexandria University, Alexandria, Egypt.
Email: rehab.desoki@gmail.com
at Cairo University on January 7, 2015jis.sagepub.comDownloaded from
minimal cost in labelled graphs as ‘Steiner trees’ [10]. Other approaches have proposed mapping XML documents to
relational databases to improve the response time [11,12]. The work reported in Nayak [13] used a new approach for top-
K data source selection by summarizing the data sources for rapid filtering of non-promising sources.
XML clustering by structure plays a crucial role in searching XML collections as it organizes large and heterogeneous
collections of documents into smaller homogeneous collections. Retrieving information using a search engine from a
large repository is time-consuming. If we can access only the parts of the data that we need, the query processing can be
conducted more efficiently. Hence, the search space will be reduced by avoiding searching in unnecessary data [5].
The performance of a clustering process involves two aspects: accuracy and performance. Optimizing one aspect has
a negative impact on the other. Hence, there is a need to consider the trade-off between accuracy and performance of the
clustering process. In this paper, we take the advantages of Enhanced XML Clustering By Structure (EXCLS) algorithm,
which is an enhancement of the XML clustering by structure (XCLS) described in Xu and Papakonstantinou [7] to opti-
mize both the performance and the accuracy of clustering. By using EXCLS in our proposed method framework, we can
reduce the size of the search space to improve performance and accuracy.
1.1. Research objective
It is evident that there is a need for enhancing both the search performance and the accuracy of search results of XML
documents. The incremental clustering algorithms are non-iterative and their time and space requirements are small [5].
Based on this fact we will focus in this work on the following:
Introduce an enhancement of XCLS [7] incremental clustering algorithm (EXCLS) [8].
Present a new XML representation that preserves all XML characteristics and hierarchical relationships
without structure summarization to enhance the accuracy of the clustering.
Modify the similarity equation that is used in XCLS based on the new XML representation that has been
introduced to calculate the similarity between each pair of documents.
Enhance the matching procedure to be able to find all common nodes between each pair of documents.
Enhance XML search performance and result quality by the usage of EXCLS clustering before XML search.
Finally, a comparison between EXCLS and two major incremental clustering algorithsm (XCLS and XEdge) is per-
formed. XEdge clusters homogeneous and heterogeneous XML documents using edge summaries [9]. Another compari-
son between our proposed search method and two major open source query processors (SAXON and Qizx) is presented.
The comparison is based on two aspects: the performance (run-time) and results accuracy. Our experiments are based on
different sizes and types of datasets. We use both homogeneous (same nodes and same/different edges) and heteroge-
neous (different nodes and different edges) XML documents.
1.2. Outline
In this work, we introduce the XML search method that uses XML incremental clustering before XML search and per-
form a comparison between our proposed method and XQuery processors. The comparison focuses on the search perfor-
mance and the quality of the search results.
This paper is structured as follows. Section 2 discusses existing work on XML search. In Section 3, we present our
proposed method for XML search in details. Section 4 evaluates the proposed XML search method by comparing it with
Figure 1. An example of an XML document.
Shalabi and Elfatatry 2
Journal of Information Science, 2014, pp. 1–21 ÓThe Author(s), DOI: 10.1177/0165551514560523
at Cairo University on January 7, 2015jis.sagepub.comDownloaded from
Saxon [10] and Qizx [11] XQuery processors. In addition, the results of the comparison of EXCLS to XCLS [7] and
XEdge [9]clustering algorithms are analysed. Section 5 discusses the conclusions and future work.
2. Related work
A search in semistructured data usually starts by computing LCAs [12,13]. The aim is to return the set of smallest trees
containing all keywords. The SLCAs of a query consist of nodes that contain all keywords in their sub-trees and none of
whose children contains all keywords in its sub-tree. Figure 2 shows a query for (XML, EDI); node 1.1.2.1 and node
1.1.2.2 contain the two keywords separately, and node 1.1.2 is the ancestor connecting them. Therefore, node 1.1.2 is
expected to be the result.
The problem is which sub-trees are more appropriate for answering the query. The MLCA [14] is an XML segment
that meaningfully relates the nodes together, corresponding to the relevant variables in the XQuery. Usually, users have
only a limited knowledge of the XML structure, and may be unable to write a correct XQuery expression, especially for
heterogeneous XML collections. MLCA enables users to take the advantage of XQuery without having good knowledge
of the document structure. Schema-Free XQuery is used by casual users with partial knowledge of schema, and also by
expert users who work in data integration. MLCA helps users mix keywords and structured queries to find relevant
results. MLCA uses a stack data structure to compute MLCA nodes. It begins by retrieving a list of all matches to each
keyword. Then it visits all the keyword matches in the document order and keeps a stack in which each node is a descen-
dent of the node below it. A potential MLCA is the node that contains all of the keywords queried by the user in its sub-
tree. A schema-free query is written once and can be applied to multiple data sources that supply similar content under
different schemas [14].
Another work on keyword search in structured data identifies connected trees with minimal cost in labelled graphs as
answers called ‘Steiner trees’ [15]. Rows are modelled as nodes in a graph connected by links induced by foreign keys
and other relationships. Query answers are modelled as rooted trees connecting rows that match individual keywords in
the query. A heuristic algorithm is used to rank query results. However, it is fairly difficult to extract all of the Steiner
trees in a large graph [16]. Moreover, the Steiner tree is difficult to adapt for complicated graph databases as it can dis-
cover simple tree structures but cannot identify the meaningful graph structures with rich structural relationships.
Pixware [17] propose a data topology search to improve search efficiency; their method is constrained by inputting
two keywords only. Traditionally, the inverted index is employed to answer keyword queries. It is effective for text- and
document-based retrieval. However, it is inadequate for supporting keyword queries in structured, semistructured and
graph data because it is difficult to identify answers that capture rich structural relationships through the inverted index.
In Li et al. [18], a technique is proposed for mapping an XML document to relational database to enhance response
time. The approach preserves the nested structure of the XML documents. In Guo et al. [19] a lossless schema mapping
algorithm that generates a database schema from a DTD for storing and querying ordered XML documents is presented.
They also propose two linear data mapping algorithms to map ordered XML data into relational data. Normalized sche-
mas convert an XML document into a large number of tuples of different relations. The performance is an issue if the
results of an SQL query have large number of joins. Although many research initiatives have examined how to map and
store XML data in a relational database system, none has specified which approach is best suited for which business
needs. It is very difficult to find a solution that can fit all users requirements because of XML’s flexibility and extensibil-
ity [20]. An evaluation of the performance on various mapping approaches in terms of database file size, creation time
and query processing time on both real and synthetic datasets has been reported in Norwati et al. [21].
Figure 2. XML document.
Shalabi and Elfatatry 3
Journal of Information Science, 2014, pp. 1–21 ÓThe Author(s), DOI: 10.1177/0165551514560523
at Cairo University on January 7, 2015jis.sagepub.comDownloaded from
The Edge approach has the lowest storage requirement. However, the time taken to create a database is too long for a
large dataset. In terms of query processing time, the schema-based approach is faster. The schema-based approach
requires a smaller number of joins compared with the edge and the attribute approaches. Replacing joins with selections
is an advantage in enhancing the query processing. Index utilization makes schema-based and attribute approaches
improve query performance which is a critical issue in large datasets [20].
Approximation techniques have been proposed to deal with heterogeneous data queries and to evaluate twig patterns
[22]. XML query processing has a problem in twig pattern matching, that is, finding in an XML data tree matches that
satisfy a specified twig (path) query pattern. Gou and Chirkova [22] present a survey on twig pattern matching. They
focus on two techniques of XML query processing: the relational and the native techniques. Twig patterns are applicable
to information-retrieval as well as database. Database queries return all results that precisely match the query string. IR
(information retrieval) queries technique allows fuzzy results, which are ranked based on their query relevance. Although
twig pattern matching has become an important research area, it suffers from several drawbacks in many cases, such as
large XML data repositories and complex twig patterns where data related to the query appear in a small part of the
whole XML document. If we can access only parts of the data that we need, the query processing can be more efficient
because the search space is reduced by eliminating less important data (not relative to the query) during the query pro-
cessing [5]. One solution in this case is to use clustering approaches in order to partition the whole XML data based on
content, structures or both [23].
Result snippets and result clustering are the most commonly used result analysis techniques in XML repository search.
Result snippets are used to resolve the ambiguity of keyword search. In order to manage the inaccuracy of ranking func-
tions, web search engines use snippets for helping users to quickly judge the relevance of query results by providing a
brief path of each one so that users can choose and explore relevant results among many others. The result snippets in
XML keyword search have been discussed in eXtract [24]. Result clustering is also used to improve the effectiveness of
information retrieval. It is desirable to cluster query results based on their similarity so that users can quickly browse all
possible interpretations of a query and choose the sets of results that are relevant [25]. The work in Liu et al. [26] shows
that efficiency is increased by using snippets in result clustering.
There are different approaches for integrating data from heterogeneous data sources where a common assumption is
that the data sources remain stable or changes are known. In practice there is always a demand for supporting ad hoc
information needs concerning unexpected changes from autonomous data sources. Harmonization is often necessary for
data collected from autonomous and heterogeneous XML data sources by providing a uniform representation, that is used
to remove all forms of heterogeneity from the collected data. De Vries et al. [27] propose the harmonization for semanti-
cally similar data when extracting data from XML data sources. The harmonization is based on a novel representation of
XML data, called XML relation, which is able to represent the structure and order among XML data. They introduce a
constructor algebra used for the harmonization of XML data. This technique is able to form a unique relational represen-
tation for any XML data source, called an XML relation. They demonstrate that the XML relation representation supports
grouping and aggregation of data needed, for example, in OLAP (online analytical processing) applications.
The XML data mining track addresses the challenges of mining semistructured documents and the generic Machine
Learning (ML) techniques in the structured domain, that is, classification and clustering of semistructured documents.
XML data mining track has run for six editions from INEX 2005 to 2010. In De Vries et al. [28] the focus is on INEX
2010 and the tasks for evaluating unsupervised clustering. Any clustering of XML documents can be evaluated with
respect to their predefined categorization. The standard measures of purity, entropy and F-measure are used to determine
the quality of clusters with regard to the categories.
In Tovar et al. [29] the authors propose two iterative clustering methods for grouping Wikipedia documents into clus-
ters. They use a recursive clustering process iteratively on a subset of the complete set. In each iteration, they select rep-
resentative items for each group which are then used for the next step of clustering. The obtained results outperformed
the random baseline presented in the INEX 2010 clustering task of the XML-Mining track. Clustering analysis refers to
the partitioning of a dataset into clusters, so that the data in each subset (ideally) share some common features according
to a defined distance measure. They report the results of using two different approaches for clustering the INEX 2010 col-
lection. A recursive method based on the K-biX/K-biN and K-Means clustering methods has been proposed. The aim of
the two presented approaches is to allow high scalability. Traditional clustering of huge volumes of data requires calcula-
tion of a two-dimensional similarity matrix, a process that needs quadratic time complexity with respect to the number of
documents. The lower the dimensionality of the similarity matrix, the faster the clustering algorithm will be executed.
However, the performance of both approaches was not as expected, because it just slightly improved a baseline.
Searching XML data coming from thousands of different data sources by sequentially querying each single source is
extremely costly and impractical. In Vutukuru et al. [30] the authors propose a new approach for selecting the top-K data
sources by relying on their relevance to the query, to avoid the cost of searching in irrelevant data sources. The data
Shalabi and Elfatatry 4
Journal of Information Science, 2014, pp. 1–21 ÓThe Author(s), DOI: 10.1177/0165551514560523
at Cairo University on January 7, 2015jis.sagepub.comDownloaded from
sources are summarized as succinct synopses for the rapid filtering of non-promising sources. They also maintain both
structural and value distribution information of each data source, and propose a new ranking function to measure the
relevance of the data source to the query effectively.
3. Proposed XML search framework
Executing a query to search about specific information on a large number of XML documents is a time-consuming and
error-prone task. Clustering XML repository can be used to improve information retrieval in different ways. The first
technique is clustering (grouping) together similar XML documents in structure, content or both before the search pro-
cess. Then, the query is executed on the most relevant clusters. Another technique is to use clustering after the search
process on the search results for ranking the resulting documents. This process is done by taking the search results of an
XML search engine, clustering them and presenting them to the user in distinct groups [31]. Hence, we can define unique
clusters containing all relevant documents or a set of clusters addressing the different aspects of the relevant information
to the user [5].
In our XML search method, we focus on the technique of clustering XML documents before the search process. As
shown in Figure 3, the search process in our proposed method consists of two steps. The first step is clustering the XML
documents repository. In the second step, a calculation of the similarity between the required document and the clusters’
centres is performed. Then, clusters are sorted by their similarity/distance to the required document in descending/ascend-
ing order. Finally, a search in the clusters which are the most similar to the required document is performed.
When adding new data to the XML repository, the new documents are assigned to the most similar cluster. Then, a
refinement phase is applied to check if there are any documents that have to be moved between clusters. In case of having
changes in the elements of a cluster, then we need to calculate a new cluster centre. The clustering algorithm, EXCLS
[8], which is used in our search method utilizes the advantage of incremental XML clustering performance and includes
the following enhancements:
Introducing a new XML representation by using hash table data structure which leads to improving the clustering
performance. The introduced representation preserves the characteristics of the XML documents and the hier-
archical relationships between nodes which leads to improving the clustering accuracy.
Avoiding XML summarization by structure which negatively affects clustering accuracy.
Eliminating input parameters by users, which affect clustering results by splitting the clustering process into two
phases. The first phase is the preparation phase which is applied on a sample dataset before clustering all of the
set to evaluate the threshold (the input parameter to the clustering algorithm). The second phase is the application
of XCLS incremental clustering algorithm to all document sets with a predefined threshold.
Figure 4 shows the proposed XML search method. The method consists of two phases.
Figure 3. Proposed search XML document process.
Shalabi and Elfatatry 5
Journal of Information Science, 2014, pp. 1–21 ÓThe Author(s), DOI: 10.1177/0165551514560523
at Cairo University on January 7, 2015jis.sagepub.comDownloaded from
3.1. Clustering phase (EXCLS)
The first phase is the pre-processing (EXCLS clustering [8]) phase, which is done offline and consists of the following
tasks.
3.1.1. XML representation. Before putting the XML document into the proposed representation, each node name is
encoded into an integer number using a hash table. Handling the XML node’s name with integer value saves memory
and processing time. Our representation consists of a vector of the levels in the XML document where each element of
the vector (level) links to a hash table that contains all the nodes of a given level as a key and a list of parents of the
node in the level as a value without summarization in the document structure, as shown in Figure 5. The use of a hash
table in this representation reduces the time complexity of the matching procedure.
Space complexity of the EXCLS [8] proposed representation consumed only the necessary information of XML data
from RAM. If the average number of elements in a document is Nand the number of levels is L, the total memory
Figure 4. Proposed XML documents search.
Figure 5. XML representation in EXCLS.
Shalabi and Elfatatry 6
Journal of Information Science, 2014, pp. 1–21 ÓThe Author(s), DOI: 10.1177/0165551514560523
at Cairo University on January 7, 2015jis.sagepub.comDownloaded from
required for the level structure in a cluster is approximately (2*N+L)*4 bytes (2 for element id and its parent) in addi-
tion to the memory needed by the hashing table. However, XML representation in XCLS needs L*M*8 bytes (4 bytes
for element id, 4 bytes for occurrences), where Mis the average number of nodes in each level [7].
Figure 6 presents the level structure of the cluster’s centre which is a combination of two level structures in the
EXCLS algorithm. The level structure of a cluster is the union of elements of the two objects at each level.
3.1.2. XML matching. Some modifications have been applied to the similarity equation in equation (1), which has been
used in XCLS clustering algorithm [7] to include the new structural information used in the representation of the level
similarity equation as shown in equation (2). Equation (1) is used to specify the similarity between two XML documents
(either a document or a cluster) and focuses on common nodes on each level.
(1) LevelSim
1, 2
0:5×PL1
i=0(CNi
1)×(r)Li1+0:5×PL1
j=0(CNj
2)×(r)Lj1
PL1
k=0Nk×rðÞ
Lk1
hi
×Z
ð1Þ
CNi
1is the sum of occurrences of every common element in the level iof object 1.
CNj
2is the sum of occurrences of every common element in the level jof object 2.
Nkis the number of elements in level kof the document.
r: The base weight is the increasing factor of weight. This is usually larger than 1 to indicate that the higher-level
elements have more importance than the lower-level elements.
Lis the number of levels in the document.
Zis the number of documents within the cluster.
In the new equation, we count the occurrences of common elements and common elements with their parents in each
corresponding level. Finding matches between nodes and their parents will show higher similarity than matching nodes
only without matching their parents. Elements in different level positions are allocated different weights. The use of this
similarity equation with our proposed matching algorithm is sufficient and there is no need to calculate LevelSim
1, 2
and
LevelSim
2, 1
then get the maximum because LevelSim
1, 2
=LevelSim
2, 1
(2) LevelSim
1, 2
0:5×PL11
i=0(CNi
1+CPi
1)×(r)L1i1+0:5×PL21
j=0(CNj
2+CPj
2)×(r)L2j1
PM1
k=0Nk×rðÞ
Mk1
hi
+0:5×PL11
i=0CPi
1×rL1i1+PL21
j=0CPj
2×rL2j1
hi
ð2Þ
Figure 6. The new level structure of cluster centre in EXCLS.
Shalabi and Elfatatry 7
Journal of Information Science, 2014, pp. 1–21 ÓThe Author(s), DOI: 10.1177/0165551514560523
at Cairo University on January 7, 2015jis.sagepub.comDownloaded from
Most of the terms used in equation (2) are exactly the same as the corresponding terms in equation (1). However, there
are some changes listed in the following:
CPi
1(CPj
2) are the numbers of occurrences of all common elements in level i(j) of the object 1(2) that have the
same parent.
L1(L2) are the number of levels in the first (second) document.
Mis the maximum number of levels in both documents.
The proposed similarity metric in equation (2) value varies between 0 and 1. A value of 0 indicates completely struc-
turally different XML documents and a value of 1 indicates completely structurally similar XML documents. The fol-
lowing example (shown in Figure 7) calculates the similarity using equation (2), where L1=5,L2=5,M=5, assuming
r=2. Equation (1) is used to calculate the similarity between two XML documents in the matching algorithm shown in
Figure 8.
Figure 7. Similarity measure between two objects in EXCLS.
Input : 2 XML documents in EXCLS representation {D
1
, D
2
}
Output : Maximum similarity between {D
1
, D
2
}
Initialization :
1 Set k=0 (the first level of the first document)
2 Set MaxSim =0 (the similarity between {D
1
, D
2
})
Method:
3 While k< L
1
(L
1
the number of levels of the 1
St
document)
4 Set i=k
5 Set j=0 (the first level of the second document)
6 While j< L
2
(L
2
the number of levels of the 2
nd
document) and i< L
1
7 Search for common elements in Level i of L
1
& j of L
2
.
8 If a common element is found :
9 Set and equal to the number of common elements found.
10 Set and equal to the number of common elements which have the same
parent.
11 Move L1& L2 to the next level(increase i,j by 1).
12 Else
13 Move L
2
to the next level(increase j by 1).
14 End while j.
15 Calculate the Similarity between two XML documents.
16 If (Similarity > MaxSim )
17 Set MaxSim=Similarity.
18 Move L
1
to the next level (increase k by 1).
19 End while k.
Figure 8. Matching algorithm.
Shalabi and Elfatatry 8
Journal of Information Science, 2014, pp. 1–21 ÓThe Author(s), DOI: 10.1177/0165551514560523
at Cairo University on January 7, 2015jis.sagepub.comDownloaded from
Assuming that L
1
>0, L
2
>0 (number of levels in both XML objects D
1
and D
2
), we search for all common nodes
between the two XML documents:
(1) Set i, k and jvalues to zero to start search in the first level of D
1
and D
2
.
If matching nodes are found, set the number of common nodes to the level number in D
1
(CN0
1) and D
2
(CN0
2) and set the number of common parents to the level number in D
1
(CP0
1) and D
2
(CP0
2) and both lev-
els’ pointers iand jof the two documents are increased.
Otherwise, only D
2
pointer of the level jis increased.
Go to step 2.
(2) If i<L
1
and j<L
2
Search for matched nodes in the new levels.
If at least one common node is found, set the number of common nodes and common parents with the new level
number in D1 and D2 (CNi
1,CNj
1,CPi
2andCPj
2) and both levels’ pointers iand jof the two documents are
increased.
Otherwise, only D2 pointer of the levels jis increased.
(3) Else
Calculate the similarity between D1 and D2; compare the similarity values in the different iterations and save
the maximum.
Increase the value of kby 1 (where kpoints to the latest level visited of D1).
Set i=kand j= 0 to start the search in the latest visited level of D1 and the first level of D2.
Go to step 2.
Figure 9 shows an example for the matching algorithm in all iterations. As shown in the second iteration of the match-
ing procedure, the similarity is equal to the maximum value where it has the maximum number of matching nodes. By
using the EXCLS algorithm we can find all common nodes. The matching algorithm has a complexity of O(L
1
*L
2
),
where L
1
is the number of levels in the first XML document and L
2
is the number of levels in the second XML docu-
ment, while 1 L
1
Nand 1 L
2
M, and N/Mis the number of nodes in the first/second document.
3.1.3. Cluster XML documents by hierarchal algorithm. In this step, a percentage of the number of the XML documents is
selected randomly. Then, a hierarchical clustering algorithm is applied to cluster the XML document sample. The next
step is to calculate the distance between all documents and their formed cluster’s centre. The value of the threshold has
been evaluated in this step by calculating the minimum similarity (maximum distance) between each document and
corresponding cluster’s centre. This value will be used in the clustering phase.
Figure 9. Example for the matching algorithm.
Shalabi and Elfatatry 9
Journal of Information Science, 2014, pp. 1–21 ÓThe Author(s), DOI: 10.1177/0165551514560523
at Cairo University on January 7, 2015jis.sagepub.comDownloaded from
3.1.4. Cluster XML documents by Incremental algorithm. In this step, we use the XCLS clustering algorithm to cluster all
sets of the XML documents without the need for specifying the threshold input parameter from the user as calculated in
the previous step [7]. Figure 10 shows the XCLS clustering algorithm details.
3.2. Search phase
The second phase is the search phase which consists of the following tasks.
3.2.1. Map Xpath to XML representation pattern. In this step, the Xpath that is required in the XQuery is mapped to the
structure representation presented in the clustering phase. Figure 11 shows an example of the Xpath used in the query
and its mapping to XML representation.
Input: (1) XML Documents
(2) LevelSim_Threshold
Output: (1) A set of clusters.
Method:
/*Phase1 - Allocation*/
For all XML documents to be clustered
Assign the first document to the fist cluster.
Read the next document (represented as the level structure);
Compute the levelSim between the document and existing cluster;
Assign the document to an existing cluster if the maximum of LevelSim(s) is found between two objects
>LevelSim_Threshold;
Otherwise, form a new cluster containing the document.
/*Phase2 - Reassignment*/
For all XML documents
Read the randomly selected document (i.e. level structure);
Compute the levelSim between the document and each existing cluster;
Reassign the document to an existing cluster if the maximum of LevelSim(s) is found between two objects
>LevelSim_Threshold;
Otherwise, form a new cluster containing the document.
/*Stop if there is no improvement in two iteration*/
Figure 10. XCLS clustering algorithm.
Figure 11. Mapping XPath to XML representation.
Shalabi and Elfatatry 10
Journal of Information Science, 2014, pp. 1–21 ÓThe Author(s), DOI: 10.1177/0165551514560523
at Cairo University on January 7, 2015jis.sagepub.comDownloaded from
3.2.2. Find the most similar cluster to the pattern. In this step, a calculation of the distance between the created pattern and
the centres of all clusters is performed to find the least distances (the most similar clusters).
3.2.3. Search within the most similar cluster. The distances between the required XPath and the centres of all formed clus-
ters are calculated. Then, clusters’ distances are sorted in ascending order. Then, a search is performed in the elements of
the cluster that has the least distance between its centre and the required XPath first, then the next cluster and so on. This
step is repeated until no matches are found in the cluster, as shown in Figure 12.
The algorithm summarizes how the search process is carried out assuming that L(Clusters #) >0 and CN i (number
of documents in each cluster) >0. The searching algorithm has a complexity of O(C+Clog (c)+C*CN i), where C
is the number of clusters for the dataset and CN i is the average number of documents in each cluster.
The time complexity consists of three parts. The first one is used to initialize the distance array. The second is used
for the sorting process of the array of distances. The third part is used for searching for the required document within the
array of distances. Figure 13 shows an example of XML search process by calculating the distance between the required
document and the centres of the clusters of the dataset and stores them into Clusters_Distance array. After that we sort
Clusters_Distance array by the distances values discerningly. The required documents are searched for within the docu-
ments of the least distance cluster (in this case it is C
1
) and, if matches are found, the search continues into the next clus-
ter C
3.
Otherwise, the search is stopped because no more results will be found.
4. Comparison
In the first sub-section, we present a comparison between the proposed algorithm EXCLS to XCLS and XEdge algo-
rithms under the same conditions. The experiments have been performed on the same datasets of homogenous (same
nodes and same edges/same nodes but different edges) or heterogeneous (different nodes and different edges) XML doc-
uments. Two aspects are considered in the comparison: clustering quality and run-time. EXCLS, XCLS and XEdge algo-
rithms were implemented using Java and all the documents of the datasets are stored as XML data type in Microsoft
SQL server 2008 database. All of the experiments were performed on the same machine.
In the second sub-section, we compare the performance of the proposed XML search method with two of the widely
used open source XML XQuery processors developed using java XML query [32]:
Figure 12. The process of search XML documents.
Shalabi and Elfatatry 11
Journal of Information Science, 2014, pp. 1–21 ÓThe Author(s), DOI: 10.1177/0165551514560523
at Cairo University on January 7, 2015jis.sagepub.comDownloaded from
(1) SAXON (http://saxon.sourceforge.net), an XSLT and XQuery processor written in Java from Saxonica of
Michael Kay.
(2) Qizx (http://www.axyana.com/qizxopen) an XQuery processor written in Java from Axyana Software.
Our experiments include both homogeneous and heterogeneous datasets of XML documents. Two aspects are consid-
ered in the comparison: search accuracy and query execution time. SAXON is an implementation of XQuery written in
Java. It implements XSLT 3.0, XQuery 3.0. XPath 3.0, and XSD 1.1 and has different features that contribute to speed
query execution [33]. Qizx is an XQuery processor written in Java from Axyana Software implementing XQuery lan-
guage, including XQuery Full-Text, Update, Scripting and most XQuery 1.1 features. It is able to store, search, trans-
form and index any type and size of XML documents with great speed and power. Qizx can be directly integrated with
a stand-alone Java application. XML documents in Qizx must be parsed in memory before use because there is no per-
sistent storage [34].
4.1. Hardware
Our XML search methods, Qizx and SAXON, were tested on the same machine with an Intel Core 2 Duo CPU T6600
(2.20 GHz, 2.20 GHz) and 3 GB memory.
4.2. Software
Our proposed method was implemented using Java. The tests in this section were performed using version 4.1 of the free
engine edition of Qizx and Saxon home edition release 9.4.0.4.
4.3. Dataset
The first experiment aimed to compare the accuracy and the performance of our proposed method with both SAXON and
Qizx in the case of homogeneous XML documents. We used the movie DTD from Ley et al. [35] as a dataset of homoge-
neous data. The movie DTD shown in Figure 14 was used to create five sub-DTDs. We created the sub-DTDs in such a
way that the derived synthetic XML documents would share the same node tags and the same edges/relationships at each
level but a different number of nodes at each level. As shown in Table 1, we generated five sub-DTDs which have differ-
ent numbers of directors, genres and actors attributes.
Digital Bibliography & Library Project (Dblp) DTD was used in the second experiment. Dblp DTD contains biblio-
graphic records as described in Ley [36]. DBLP is a computer science bibliography website hosted at Universita
¨t Trier,
Figure 13. Example of XML document search process.
Shalabi and Elfatatry 12
Journal of Information Science, 2014, pp. 1–21 ÓThe Author(s), DOI: 10.1177/0165551514560523
at Cairo University on January 7, 2015jis.sagepub.comDownloaded from
in Germany. It lists millions of articles, proceeding papers in different conferences on computer science. The work in
Miklau [37] documents the technical details of this DTD. The XML root element <dblp >contains a long sequence of
bibliographic records. The DTD lists several elements to be used as a bibliographic record.
DBLP dataset has a balanced structured XML document. This experiment is focused also on a homogeneous dataset
but with different features compared with those used in the first experiment. The sub-DTD-derived XML documents
share the same node tags, but contain different edges/relationships in some of the levels. The first sub-DTD contains arti-
cle data. The second contains the incollections data. The third contains books data and the fourth contains the inproceed-
ings data. Finally the last sub-DTD contains the www data.
In the third experiment, we used real heterogeneous XML documents from TPC-H relational database benchmark
downloaded from Internet Movie Database [38]. The documents contain different types of data, such as Customer, Order,
Course, Part, and Sub-Supplier, and we use equal numbers of documents from each type. The main feature of this dataset
is that it utilizes different node tags and different edges/relationships.
In the fourth experiment, performance evaluation was carried out using very large real-life data to test scalability. We
used the INEX 2010 IMDB collection [39] of structured movie database. All technical details about IMDB are found in
Delbrua [40].There are two kinds of objects in the IMDB data collection: movies and persons involved in movies, such
as actors, directors and producers. Each movie has title, rating, directors, actors, plot, keywords, genres, release dates,
trivia, etc.; and each person has name, birth date, height, spouse, biography, filmography, salaries, etc., as shown in
Figure 15. We used actors (person) XML documents in our test.
4.4. EXCLS Comparison
4.4.1. Performance results. Figure 16 shows a run-time comparison (in the first experiment) between EXCLS, XCLS and
XEdge in the case of homogenous XML documents for documents that share the same node and same edges.
Figure 17 shows the run-time comparisons between EXCLS, XEdge and XCLS in the case of homogenous XML doc-
uments that have been defined in the second experiment for documents that share the same node tags, but contain differ-
ent edges/relationships.
Figure 18 shows the run-time comparison between EXCLS, XCLS and XEdge in the case of heterogonous XML doc-
uments which were defined before in the third experiment, where the dataset has different node tags and edges/
relationships.
Table 1. The first experiment statistics
Set no. ([attributes], no. of occurrence)
1 ([director, genres and actors], 1)
2 ([directors, genres and actors], 3)
3 ([directors, genres and actors], 5)
4 ([directors, genres and actors], 7)
5 ([directors, genres and actors], 9)
Figure 14. Movie DTD.
Shalabi and Elfatatry 13
Journal of Information Science, 2014, pp. 1–21 ÓThe Author(s), DOI: 10.1177/0165551514560523
at Cairo University on January 7, 2015jis.sagepub.comDownloaded from
As shown in Figures 16–18, it is clear that there is a significant variance in the run-time between EXCLS and XEdge
and a very small variance between EXCLS and XCLS. The reason for such a variance between EXCLS and XCLS in
run-time is because of the characteristics of XML structure in the dataset, which plays a critical role in the run-time eva-
luation. Repeated nodes and edges in the XML documents required a greater memory size for storing XML representa-
tion in EXCLS compared with the case of XCLS, which summarizes all repeated nodes and also more processing time
for the similarity calculation. In both cases of EXCLS and XCLS the run-time increases slowly as the number of XML
documents increases, while in the case of the XEdge, the run-time increases faster, which makes the use of XEdge in
very large datasets not suitable.
Figure 15. Person DTD.
0
1000
2000
3000
4000
5000
6000
7000
8000
9000
10000
11000
12000
13000
14000
15000
16000
17000
18000
0 50 100 150 200 250 300 350 400 450 500
Xedge
EXCLS
XCLS
Number of XML Documents
Run Time (Milliseconds)
Figure 16. Run-time comparison in the case of homogeneous XML documents that share same nodes and same edges.
0
1000
2000
3000
4000
5000
6000
7000
8000
9000
10000
0 50 100 150 200 250 300 350 400 450 500
Xedge
EXCLS
XCLS
Number of XML Documents
Run Time (Milliseconds)
Figure 17 Run-time comparison in case of homogeneous XML documents that share same the nodes but different edges.
Shalabi and Elfatatry 14
Journal of Information Science, 2014, pp. 1–21 ÓThe Author(s), DOI: 10.1177/0165551514560523
at Cairo University on January 7, 2015jis.sagepub.comDownloaded from
As shown in Figure 19, we used the actors XML documents from IMDB dataset with sizes varying from 10,000 to
50,000 documents where each run-time is an average of five runs. We compared XCLS with EXCLS only because
XEdge consumed an extremely long time and did not produce any results. From the figure, we can observe that there is
no significant difference between their run-times but we get a better accuracy.
4.4.2. Effectiveness results. The precision is the number of correct results divided by the number of all results returned
from a query. The recall is the number of correct results divided by the number of results that should have been returned
from the query. We use precision, recall and F-measure metrics to evaluate the quality of the query results of our pro-
posed methods, Saxon and Qizx. The F-measure combines precision and recall with equal weights as shown in equation
(3). F-measure is defined as
2×nr
i
nr+ni
ð3Þ
where nr
iis the number of documents in the generated cluster Cibelonging to actual modelled category, n
r
is the number
of XML documents belong to the modelled category, and niis the number of XML documents categorized by the clus-
tering algorithm. The result of the accuracy comparison between EXCLS, XCLS and XEdge for the first experiment is
shown in Table 2. The input parameter for XCLS algorithm is threshold value 0.9 and the input parameter for XEdge is
the number of clusters (k), which equals 4. It is evident that XCLS failed to assign the documents to their clusters cor-
rectly, while EXCLS and XEdge did a perfect job and clustered the documents correctly. While XEdge needs the correct
number of clusters to be specified, it takes more processing time. As XCLS is based on node summaries, these results
were expected owing to the fact that the documents share the same node tags and edges.
The result of the accuracy comparison between EXCLS, XCLS and XEdge in the second experiment for homoge-
neous documents that share the same node tag, but contain same edges and different relationships is shown in Table 3.
0
1000
2000
3000
4000
5000
6000
7000
8000
9000
10000
11000
12000
13000
0 50 100 150 200 250 300 350 400 450 500
Xedge
EXCLS
XCLS
Number of XML Documents
Run Time (Milliseconds)
Figure 18. Run-time comparison in case of Heterogeneous XML documents.
0
5000
10000
15000
20000
25000
30000
35000
40000
45000
50000
0 10000 20000 30000 40000 50000
EXCLS
XCLS
Number of XML Documents
Run Time
(Milliseconds)
Figure 19. Run-time comparison in the case of IMDB dataset.
Shalabi and Elfatatry 15
Journal of Information Science, 2014, pp. 1–21 ÓThe Author(s), DOI: 10.1177/0165551514560523
at Cairo University on January 7, 2015jis.sagepub.comDownloaded from
The value of the threshold produced by EXCLS and used in XCLS is 0.5856. In such cases, the EXCLS algorithm clus-
ters the documents to a number of clusters that exceeds the expected number because it is more sensitive to the number
of nodes matched as it is not based on structure summarization. XCLS can distinguish documents correctly as they are
based on edge summaries.
In the third experiment, the result of comparing the accuracy of EXCLS, XCLS and XEdge is the same for F-measure-
ment, precision and recall. All of them have a value of 1 because the data is heterogeneous.
In the fourth experiment, XCLS failed to cluster the documents and put them all in one cluster, while EXCLS clus-
tered the documents correctly based on their structure. EXCLS was run first to calculate the threshold and then, we set
the same value of the threshold to XCLS clustering because the value of the threshold has an effect on the run-time and
the accuracy of clustering.
4.5. Proposed XML search method comparison
4.5.1. Performance results. The following are the queries used in the experiments:
Query 1 was used in the first experiment to search for movies that have a specific actor’s first name.
for $doc in
collection(‘src/XQuery/xmlDocuments/*.xml’)
where $doc/W4F_DOC/Movie/Cast/Actor/FirstName=‘?’
return $doc/W4F_DOC/Movie/Title
Query 2 was used in the second experiment to search for incollections published in a specific year.
for $doc in
collection(‘src/XQuery/xmlDocuments/*.xml’)
where $doc/incollection/year=‘?’
return $doc/incollection/title
Table 2. The results of homogeneous XML documents share same nodes and edges
XML domain Number of
documents
XEdge XCLS EXCLS
Recall Precision F-score Recall Precision F-score Recall Precision F-score
Movie-1 500 1 1 1 1 1 1 1 1 1
Movie-2 500 1 1 1 1 0.5 0.667 1 1 1
Movie-3 500 1 1 1 1 1 1 1 1 1
Movie-4 500 1 1 1 0 0 0 1 1 1
Table 3. The results of homogeneous XML documents share same nodes but different edges
XML domain Number of
documents
XEdge XCLS EXCLS
Recall Precision F-score Recall Precision F-score Recall Precision F-score
Dblp-1 100 1 1 1 1 1 1 1 1 1
Dblp-2 100 1 1 1 1 1 1 0.62 1 0.7654
Dblp-3 100 1 1 1 1 1 1 0.63 1 0.773
Dblp-4 100 1 1 1 1 1 1 1 1 1
Dblp-5 100 1 1 1 1 1 1 1 1 1
Shalabi and Elfatatry 16
Journal of Information Science, 2014, pp. 1–21 ÓThe Author(s), DOI: 10.1177/0165551514560523
at Cairo University on January 7, 2015jis.sagepub.comDownloaded from
Query 3 was used in the third experiment to search for a customer that has specific nation key.
for $doc in
collection(‘src/XQuery/xmlDocuments/*.xml’)
where $doc/customer/C_NATIONKEY=‘?’
return $doc/customer/C_NAME
As shown in Figures 20–23, the time complexity graph provides a comparison between the run-time of applying the
XQuery in the proposed XML search and using both SAXON and Qizx for different numbers of XML documents that
vary from 1000 to 10,000 documents. Each result is the average of five runs. It is evident from the graphs that there is a
significant enhancement in the run-time.
As shown in Figure 21 and Table 5, the run-time and the accuracy metrics were calculated only for searches in the
most similar cluster. If paging had been used for displaying the results, the results of the most similar cluster would have
been displayed in the first page of the search. However, in Figure 22, it is obvious that there is an increase in processing
time in the case of searching in all similar clusters that return results. Additionally, sometimes the search in all clusters’
run-time is equal to the search in the most similar cluster if the search in all clusters returns a result like the cases of 1000
and 3000 documents in Figure 22. Although the run-time has increased, it is still less than that for SAXON and Qizx.
0
2000
4000
6000
8000
10000
12000
14000
16000
18000
20000
22000
24000
26000
28000
30000
32000
1000
3000
5000
7000
10000
Search Using Proposed Search
Search Using Saxon
Search Using Qizx
Number of XML Documents
Run Time (Milliseconds)
Figure 20. Query run-time comparison for first experiment.
0
2000
4000
6000
8000
10000
12000
14000
16000
18000
20000
22000
24000
26000
28000
30000
32000
1000
3000
5000
7000
10000
Search Using Proposed
Search
Search Using Saxon
Search Using Qizx
Number of XML Documents
Run Time (Milliseconds)
Figure 21. Query run-time comparison for second experiment in most similar cluster only.
Shalabi and Elfatatry 17
Journal of Information Science, 2014, pp. 1–21 ÓThe Author(s), DOI: 10.1177/0165551514560523
at Cairo University on January 7, 2015jis.sagepub.comDownloaded from
Figure 23 shows the time complexity graph, which compares the run-time of applying the XQuery in the proposed XML
search, SAXON and Qizx for the data defined in the third experiment.
4.5.2. Effectiveness results. As shown in Table 4, both queries in the first and third experiments return the same expected
results for each of the search engines: SAXON, Qizx and our proposed method. In the second experiment, it is clear
from Table 5 that our proposed method has lower recall and F-measurement values in the case of searching in one clus-
ter (most similar cluster). In this case, the count of the returned results is less than that of SAXON and Qizx.
Examining the evaluation of accuracy attributes, it is evident that the type and heterogeneity of the data have an
effect on the query results of our proposed method. In the case of the heterogeneous data, the results of searching in sim-
ilar clusters are accurate. However, in case of homogeneous data, the clustering algorithm is more sensitive to the data
because no summarization for XML documents has been done. In such a case, clustering results divide the XML collec-
tion into more than the required clusters. The consequence is that the count of search result will be less than the expected
value, as shown in Table 5 compared with the case of searching in the most similar cluster.
In the case of using paging for the returned results, the first page will have the results from the most similar cluster.
When the user requests more results, it is possible to search for the required document within the next similar cluster
Figure 22. Query run-time comparison for the second experiment.
0
2000
4000
6000
8000
10000
12000
14000
16000
18000
20000
22000
24000
26000
28000
30000
32000
34000
2000
4000
6000
8000
10000
Search Using Proposed Search
Search Using Saxon
Search Using Qizx
Number of XML Documents
Run Time (Milliseconds)
Figure 23. Query run-time comparison for the third experiment.
Shalabi and Elfatatry 18
Journal of Information Science, 2014, pp. 1–21 ÓThe Author(s), DOI: 10.1177/0165551514560523
at Cairo University on January 7, 2015jis.sagepub.comDownloaded from
until a cluster is found that does not return any matched results. Hence, it is possible to get full accurate results in less
time compared with SAXON and Qizx XQueries.
5. Conclusions and future work
In this work we presented a method for searching XML documents using clustering to enhance search performance. The
method consists of two phases. In the first phase, an enhanced clustering algorithm (EXCLS) is applied to a collection
of XML documents. In the second phase, a search process is applied to the formed clusters. We introduced a new XML
structure representation that preserves all of the characteristics of the XML data. This representation has been used in
the EXCLS clustering algorithm and in the search process, which led to enhancement of the clustering results and the
search performance without the need for summarization.
In order to evaluate both the proposed enhancement in XML clustering and the new XML structure representation,
various comparisons with XCLS and XEdge were applied. To evaluate our proposed method, we performed a compari-
son between it, SAXON and Qizx XQuery engines. As shown in the results of the comparison, our proposed XML search
method attained better performance without the need to summarize the XML structure, which causes the loss of the
XML documents’ characteristics, and decreases the quality of clustering and consequently the search results.
Clustering XML collections by EXCLS before applying the search process leads to improved search efficiency. The
EXCLS clustering quality leads to enhancement of the quality of the search process. Our future plans include adding
XML content to the clustering process in addition to structure and investigating the effect on the performance and accu-
racy of the search process.
Funding
This research received no specific grant from any funding agency in the public, commercial or not-for-profit sectors.
Table 4. Quality results for the first and third experiments
Quality measure No. of documents
1000 3000 5000 7000 10,000
Proposed method Precision 1.0 1.0 1.0 1.0 1.0
Recall 1.0 1.0 1.0 1.0 1.0
F-measurement 1.0 1.0 1.0 1.0 1.0
SAXON Precision 1.0 1.0 1.0 1.0 1.0
Recall 1.0 1.0 1.0 1.0 1.0
F-measurement 1.0 1.0 1.0 1.0 1.0
Qizx Precision 1.0 1.0 1.0 1.0 1.0
Recall 1.0 1.0 1.0 1.0 1.0
F-measurement 1.0 1.0 1.0 1.0 1.0
Table 5. Quality results for the second experiment
Quality measure No. of documents
1000 3000 5000 7000 10,000
Proposed method Precision 1.0 1.0 1.0 1.0 1.0
Recall 1.0 1.0 0.564 0.677 0.618
F-measurement 1.0 1.0 0.721 0.807 0.763
SAXON Precision 1.0 1.0 1.0 1.0 1.0
Recall 1.0 1.0 1.0 1.0 1.0
F-measurement 1.0 1.0 1.0 1.0 1.0
Qizx Precision 1.0 1.0 1.0 1.0 1.0
Recall 1.0 1.0 1.0 1.0 1.0
F-measurement 1.0 1.0 1.0 1.0 1.0
Shalabi and Elfatatry 19
Journal of Information Science, 2014, pp. 1–21 ÓThe Author(s), DOI: 10.1177/0165551514560523
at Cairo University on January 7, 2015jis.sagepub.comDownloaded from
References
[1] Cerami E. XML for bioinformatics, 1st edn. New York: Springer, 2005, vol. 1.
[2] Hanisch D, Zimmer R and Lengauer T. ProML – the protein markup language for specification of protein sequences, structures
and families. Silico Biology 2002; 2(3): 313–324.
[3] Russo MF and Rubin AE. An introduction to using XML for the management of laboratory data. Journal of Laboratory
Automation 2001; 6.
[4] Liu Z, Sun P, Huang Y, Cai Y and Chen Y. Challenges, techniques and directions in building XSeek: an XML search engine.
IEEE Data Engineering Bulletin 2009; 32: 36–43.
[5] Algergawy A, Mesiti M, Nayak R and Saake G. XML data clustering: an overview. ACM Computing Surveys (CSUR) 2011;
43(4).
[6] Guo L, Shao F, Botev C and Shanmugasundaram J. XRANK: ranked keyword search over XML documents. In: (MOD) inter-
national conference on management of data, 2003, pp. 16–27.
[7] Xu Y and Papakonstantinou Y. Efficient keyword search for smallest LCAs in XML databases. In: SIGMOD ‘05: Proceedings
of the 2005 ACM SIGMOD international conference on management of data, 2005, pp. 527–538.
[8] Li Y, Yu C and Jagadish HV. Schema-free XQuery. Very Large Data Bases 2004; 30: 72–83.
[9] Hulgeri A and Nakhe C. Keyword Searching and Browsing in Databases using BANKS. In: ICDE ’02: Proceedings of the 18th
international conference on data engineering, 2002, p. 431.
[10] Khan L and Rao Y. A performance evaluation of storing XML data in relational database management systems. In: CIKM:
Conference on information and knowledge management, 2001, pp. 31–38.
[11] Atay M, Chebotko A, Liu D, Lu S and Fotouhi F. Efficient schema-based XML-to-Relational data mapping. Information
Systems 2007; 32(3): 458–476.
[12] Nguyen K and Cao J. Top-K data source selection for keyword queries over multiple XML data sources. Journal of Information
Science 2012; 38(2): 156–175 .
[13] Nayak J. Fast and effective clustering of XML data using structural information. Knowledge and Information Systems 2008;
14(2): 197–215.
[14] Desoki R and Elfatatry A. EXCLS: Enhanced XML clustering by level structure accuracy. International Journal of Web
Engineering and Technology, 2014, 9(4).
[15] Antonellis P, Makris C and Tsirakis N. XEdge: Clustering homogeneous and heterogeneous XML documents using edge sum-
maries. In: Proceedings of the 2008 ACM symposium on applied computing (SAC), New York, 2008, pp. 1081–1088.
[16] Kay M. The SAXON XSLT and XQuery Processor, 2012, http://saxon.sourceforge.net/
[17] Pixware. XMLMind: Qizx, 2012, http://www.xmlmind.com/qizx/
[18] Li G, Feng J, Ooi BC, Wang J and Zhou L. An effective 3-in-1 keyword search method over heterogeneous data sources.
Information Systems, 2011, pp. 248–266.
[19] Guo L, Shanmugasundaram J and Yona G. Topology search over biological databases. In: IEEE 23rd international conference
on data engineering, 2007: ICDE, 2007, pp. 556–565.
[20] Haw S-C and Lee C-S. Data storage practices and query processing in XML databases: A survey. Knowledge-Based Systems
2011, pp. 1317–1340.
[21] Norwati M, Lee C-S and Haw S-C. Bridging XML and relational databases: Mapping choices and performance evaluation.
IETE Technical Review 2010; 27(4): 308–317.
[22] Gou G and Chirkova R. Efficiently querying large XML data repositories: A survey. IEEE Transactions on Knowledge and
Data Engineering 2007; 19: 1381–1403.
[23] Lian W, Cheung DW, Mamoulis N and Yiu S-M. An efficient and scalable algorithm for clustering xml documents by structure.
IEEE Transactions on Knowledge and Data Engineering 2004; 16: 82–96.
[24] Huang Y, Liu Z and Chen Y. eXtract: A snippet generation system for XML search. In: Proceedings of the VLDB endowment,
vol. 1, 2008, pp. 1392–1395.
[25] Hristidis V, Koudas N, Papakonstantinou Y and Srivastava D. Keyword proximity search in XML trees. IEEE Transactions on
Knowledge and Data Engineering 2006; 18(4): 525–539.
[26] Liu Z, Huang Y and Chen Y. Improving XML search by generating and utilizing informative result snippets. ACM
Transactions on Database Systems 2010; 35(3).
[27] Niemi T, Na
¨ppila
¨T and Ja
¨rvelin K. A relational data harmonization approach to XML. Journal of Information Science 2009;
35(5): 571–601 .
[28] De Vries CM, Nayak R, Kutty S, Geva S and Tagarelli A. Overview of the INEX 2010 XML mining track: Clustering and clas-
sification of XML documents. In: Comparative evaluation of focused retrieval,vol. 6932. Berlin: Springer, 2011, pp. 363–376.
[29] Tovar M, Cruz A, Va
´zquez B, Pinto D and Vilarin
˜o D. An iterative clustering method for the XML-mining task of the INEX
2010. In: Comparative evaluation of focused retrieval, vol. 6932. Berlin: Springer, 2011, pp. 377–382.
[30] Vutukuru V, Pasupuleti K, Khare A and Garg A Conceptemy: An issue in XML information retrieval. In: Proceedings of the
international world wide web conference (WWW), 2002.
[31] Marchiori M. W3C XML query (XQuery), December 2013, http://www.w3.org/XML/Query/
[32] Kay M. Ten reasons why saxon xquery is fast. IEEE Data Engineering Bulletin 2008, pp. 65–74.
Shalabi and Elfatatry 20
Journal of Information Science, 2014, pp. 1–21 ÓThe Author(s), DOI: 10.1177/0165551514560523
at Cairo University on January 7, 2015jis.sagepub.comDownloaded from
[33] Axyana Software and Axyana Software, November 2012, http://www.axyana.com/qizxopen/
[34] University of Wisconsin – Madison Department of Computer Sciences. The Wisconsin XML data bank, http://www.cs.wis-
c.edu/hiagara/data.html
[35] Ley M et al. http://www.informatik.uni-trier.de/~ley/db/about/dblp.dtd
[36] Ley M, DBLP: some lessons learned. Proceedings of the VLDB Endowment 2009; 2(2): 1493–1500.
[37] Miklau G. The XML data repository, http://www.cs.washington.edu/research/xmldatasets/
[38] Internet Movie Database, 2013, http://www.imdb.com
[39] INEX 2010 Data-Centric Track, 2014, http://www.inex.otago.ac.nz/tracks/strong/strong.asp
[40] Delbrua R, Campinas S and Tummarello G. Searching web data: An entity retrieval and high-performance indexing model.
Web Semantics: Science, Services and Agents on the World Wide Web 2012; 10: 33–58.
Shalabi and Elfatatry 21
Journal of Information Science, 2014, pp. 1–21 ÓThe Author(s), DOI: 10.1177/0165551514560523
at Cairo University on January 7, 2015jis.sagepub.comDownloaded from
... Create a mediated schema for integrating approach for XML structures has discussed by Saleem et al. [15], they used linguistic matchers that extract semantics of all node labels and tree-mining data structure and label clusters to find node context. Structure method for enhancing XML clustering without summarize characteristics of XML structure is used by Shalabi and Elfatatry [16], the technique treats with different sizes of homogeneous and heterogeneous XML documents datasets. Al Hamad [17] developed a mediate schema for integrating heterogeneous XML, the technique decomposes the original schema into subschemas using three levels ancestor, root, and leaf. ...
Conference Paper
Full-text available
Extensible Markup Language (XML) becomes widely used over the web to exchange and share the data, its operations and tags help to reduce memory, storage and processing of the data; these features and more were the reason behind rapid spread and adoption using of XML model by many companies. The main contribution of this work is to present a literature survey of different conversion techniques and methods between relational and XML databases models, as well as raising the awareness of these techniques and methods. We review the different researches approaches and techniques that developed for XML conversions. These techniques include but not limited to Document Type Definition (DTD), Document Object Model (DOM), clustering and matching, query languages Structured Query Language (SQL), XPath, XQuery, relational storage, relational catalog and other methods.
... The authors in [27] clustered XML documents via PathXP algorithms. PathXP allows to groups documents according to their characteristic features rather than their direct similarity. ...
Conference Paper
The main objective of the work is to improve the clustering efficiency and performance when we deal with very big datasets. This paper aims to improve the quality of XML data clustering by exploiting more features extracted from source schemas. In particular, it proposes clustering approach that gathers both content and structure of XML documents to determine similarity between them. The content and structure information are concluded using two different similarity methods that are then grouped via weight factor to compute the overall document similarity. The structural similarity of XML data are derived from edge summaries while content features similarity are derived from aggregate of set of similarity measures; Jaccard, Cosine measure and Jensen-Shannon divergence in one algorithm. However, we also experimented using Jaccard distance as content measure with edge summaries to prove that using an aggregation of content similarity measures can further improve the results. The experiments prove that clustering of XML documents based on structure only information produce worse solution in homogenous environment, while in heterogeneous environment clustering of XML document produce better result when the structure and the content are combined. Results have shown that performance and quality of the proposed approach is better in comparison of both XEdge and XCLSC approaches.
Article
Full-text available
Extensible Markup Language (XML) and the XML Path Language (XPath) are introduced with software examples demonstrating how one can use them to write laboratory data management programs. Topics explored include XML document creation, manipulation, and searching. Programming examples make use of the Microsoft® XML Parser library and the Visual Basic programming language. The problem of managing microplate screening data is used as an illustration. Source code for all examples can be downloaded from http://www.labprogrammer.net .
Article
Full-text available
The increasing popularity of XML on the internet has brought about a number of research problems regarding methods of data management, indexing, and retrieval in large repositories. XML clustering is used to decrease the size of large collections of XML documents in a repository to facilitate retrieval operations. Most of clustering approaches focus on improving performance by using structure summary but at the cost of accuracy. A major drawback of summarisation techniques is the loss of XML documents' characteristics. The main objective of this work is improving the accuracy of XML document clustering specifically in the case of homogeneous datasets while preserving performance. Towards this end, in this work we propose a new XML document structure and present an enhanced matching procedure to calculate the similarity between XML documents. The proposed method is implemented and evaluated using homogeneous and heterogeneous datasets. The experimental results show a significant improvement in clustering accuracy, especially in homogeneous XML documents without a significant impact on processing time.
Article
Full-text available
This report explains the objectives, datasets and evaluation criteria of both the clustering and classification tasks set in the INEX 2010 XML Mining track. The report also describes the approaches and results obtained by participants.
Conference Paper
With the growth of the Web, there has been a rapid increase in the number of users who need to access online databases without having a detailed knowledge of the schema or of query languages; even relatively simple query languages designed for non-experts are too complicated for them. We describe BANKS, a system which enables keyword-based search on relational databases, together with data and schema browsing. BANKS enables users to extract information in a simple manner without any knowledge of the schema or any need for writing complex queries. A user can get information by typing a few keywords, following hyperlinks, and interacting with controls on the displayed results.BANKS models tuples as nodes in a graph, connected by links induced by foreign key and other relationships. Answers to a query are modeled as rooted trees connecting tuples that match individual keywords in the query. Answers are ranked using a notion of proximity coupled with a notion of prestige of nodes based on inlinks, similar to techniques developed for Web search. We present an efficient heuristic algorithm for finding and ranking query results.
Article
With the proliferation of XML data, searching XML data using keyword queries has attracted much attention. However, most of the current approaches focus on keyword-based searches over a single XML document. Searching over a system integrating hundreds or even thousands of data sources by sequentially querying every single source is extremely costly, and thus may be impractical. In this article we propose a novel approach for selecting the top-K data sources by relying on their relevance to a given query, to avoid the high cost of searching in numerous, potentially irrelevant data sources. Our approach summarizes the data sources as succinct synopses for the rapid filtering of non-promising sources. We maintain both structural and value distribution information of each data source, and propose a novel ranking function to measure effectively the relevance of the data source to the given query. We conducted experiments with real datasets, and results show that our approach achieves high performances in all evaluation metrics: recall, precision and Spearman’s rank correlation coefficient with different experimental parameters.
Article
Extensible Markup Language (XML) and the XML Path Language (XPath) are introduced with software examples demonstrating how one can use them to write laboratory data management programs. Topics explored include XML document creation, manipulation, and searching. Programming examples make use of the Microsoft® XML Parser library and the Visual Basic programming language. The problem of managing microplate screening data is used as an illustration. Source code for all examples can be downloaded from http://www.labprogrammer.net.