Figure 1 - uploaded by Ahmed Elfatatry
Content may be subject to copyright.
Source publication
Searching large XML repositories is a challenging research problem. The application of clustering on a large repository before performing a search enhances the search process significantly. Clustering reduces a search space into smaller XML collections that can be better searched. In this work, we present an enhanced XML clustering by structure met...
Contexts in source publication
Context 1
... element can have attributes and values, in addition to nested sub-elements. Figure 1 shows an example of an XML document in a tree representation. ...
Context 2
... this step, we use the XCLS clustering algorithm to cluster all sets of the XML documents without the need for specifying the threshold input parameter from the user as calculated in the previous step [7]. Figure 10 shows the XCLS clustering algorithm details. ...
Context 3
... this step, the Xpath that is required in the XQuery is mapped to the structure representation presented in the clustering phase. Figure 11 shows an example of the Xpath used in the query and its mapping to XML representation. ...
Context 4
... a search is performed in the elements of the cluster that has the least distance between its centre and the required XPath first, then the next cluster and so on. This step is repeated until no matches are found in the cluster, as shown in Figure 12. ...
Context 5
... third part is used for searching for the required document within the array of distances. Figure 13 shows an example of XML search process by calculating the distance between the required document and the centres of the clusters of the dataset and stores them into Clusters_Distance array. After that we sort Clusters_Distance array by the distances values discerningly. ...
Context 6
... used the movie DTD from Ley et al. [35] as a dataset of homoge- neous data. The movie DTD shown in Figure 14 was used to create five sub-DTDs. We created the sub-DTDs in such a way that the derived synthetic XML documents would share the same node tags and the same edges/relationships at each level but a different number of nodes at each level. ...
Context 7
... technical details about IMDB are found in Delbrua [40].There are two kinds of objects in the IMDB data collection: movies and persons involved in movies, such as actors, directors and producers. Each movie has title, rating, directors, actors, plot, keywords, genres, release dates, trivia, etc.; and each person has name, birth date, height, spouse, biography, filmography, salaries, etc., as shown in Figure 15. We used actors (person) XML documents in our test. ...
Context 8
... results. Figure 16 shows a run-time comparison (in the first experiment) between EXCLS, XCLS and XEdge in the case of homogenous XML documents for documents that share the same node and same edges. Figure 17 shows the run-time comparisons between EXCLS, XEdge and XCLS in the case of homogenous XML doc- uments that have been defined in the second experiment for documents that share the same node tags, but contain differ- ent edges/relationships. Figure 18 shows the run-time comparison between EXCLS, XCLS and XEdge in the case of heterogonous XML doc- uments which were defined before in the third experiment, where the dataset has different node tags and edges/ relationships. ...
Context 9
... 16 shows a run-time comparison (in the first experiment) between EXCLS, XCLS and XEdge in the case of homogenous XML documents for documents that share the same node and same edges. Figure 17 shows the run-time comparisons between EXCLS, XEdge and XCLS in the case of homogenous XML doc- uments that have been defined in the second experiment for documents that share the same node tags, but contain differ- ent edges/relationships. Figure 18 shows the run-time comparison between EXCLS, XCLS and XEdge in the case of heterogonous XML doc- uments which were defined before in the third experiment, where the dataset has different node tags and edges/ relationships. As shown in Figures 16-18, it is clear that there is a significant variance in the run-time between EXCLS and XEdge and a very small variance between EXCLS and XCLS. ...
Context 10
... 16 shows a run-time comparison (in the first experiment) between EXCLS, XCLS and XEdge in the case of homogenous XML documents for documents that share the same node and same edges. Figure 17 shows the run-time comparisons between EXCLS, XEdge and XCLS in the case of homogenous XML doc- uments that have been defined in the second experiment for documents that share the same node tags, but contain differ- ent edges/relationships. Figure 18 shows the run-time comparison between EXCLS, XCLS and XEdge in the case of heterogonous XML doc- uments which were defined before in the third experiment, where the dataset has different node tags and edges/ relationships. As shown in Figures 16-18, it is clear that there is a significant variance in the run-time between EXCLS and XEdge and a very small variance between EXCLS and XCLS. ...
Context 11
... 17 shows the run-time comparisons between EXCLS, XEdge and XCLS in the case of homogenous XML doc- uments that have been defined in the second experiment for documents that share the same node tags, but contain differ- ent edges/relationships. Figure 18 shows the run-time comparison between EXCLS, XCLS and XEdge in the case of heterogonous XML doc- uments which were defined before in the third experiment, where the dataset has different node tags and edges/ relationships. As shown in Figures 16-18, it is clear that there is a significant variance in the run-time between EXCLS and XEdge and a very small variance between EXCLS and XCLS. The reason for such a variance between EXCLS and XCLS in run-time is because of the characteristics of XML structure in the dataset, which plays a critical role in the run-time eva- luation. ...
Context 12
... both cases of EXCLS and XCLS the run-time increases slowly as the number of XML documents increases, while in the case of the XEdge, the run-time increases faster, which makes the use of XEdge in very large datasets not suitable. As shown in Figure 19, we used the actors XML documents from IMDB dataset with sizes varying from 10,000 to 50,000 documents where each run-time is an average of five runs. We compared XCLS with EXCLS only because XEdge consumed an extremely long time and did not produce any results. ...
Context 13
... result of the accuracy comparison between EXCLS, XCLS and XEdge in the second experiment for homoge- neous documents that share the same node tag, but contain same edges and different relationships is shown in Table 3. Xedge EXCLS XCLS Figure 18. Run-time comparison in case of Heterogeneous XML documents. ...
Context 14
... shown in Figure 21 and Table 5, the run-time and the accuracy metrics were calculated only for searches in the most similar cluster. If paging had been used for displaying the results, the results of the most similar cluster would have been displayed in the first page of the search. ...
Similar publications
Music is a temporal organization of sounds, and we can therefore assume that any music representation has a structure that reflects some conceptual principles. This structure is hardly explicitly accessible in many encodings, such as, for instance, audio files. However, it appears much more clearly in the language of music notation.
We propose to u...
XML has become an important medium for data exchange, and is frequently used as an interface to – i.e. a view of – a relational database. Although much attention has been paid to the problem of querying relational databases through XML views, the problem of updating relational databases through XML views has not been addressed. In this paper we inv...
The cityEHR is an example of an open source EHR system which stores clinical data as collections of XML documents. The records gathered in routine clinical care are a rich source of longitudinal data for use in clinical studies. We describe how the standard language XQuery can be used to identify cohorts of patients, matching specified criteria. We...
As storage-main memory as well as disk-becomes cheaper, the amount of available information is increasing and it is a challenge to organize it. Our broader aim is to provide a unified framework for efficiently versioning and querying data, documents, as well as any kind of semi-structured information between data and documents, which can be stored...
Citations
... Create a mediated schema for integrating approach for XML structures has discussed by Saleem et al. [15], they used linguistic matchers that extract semantics of all node labels and tree-mining data structure and label clusters to find node context. Structure method for enhancing XML clustering without summarize characteristics of XML structure is used by Shalabi and Elfatatry [16], the technique treats with different sizes of homogeneous and heterogeneous XML documents datasets. Al Hamad [17] developed a mediate schema for integrating heterogeneous XML, the technique decomposes the original schema into subschemas using three levels ancestor, root, and leaf. ...
Extensible Markup Language (XML) becomes widely used over the web to exchange and share the data, its operations and tags help to reduce memory, storage and processing of the data; these features and more were the reason behind rapid spread and adoption using of XML model by many companies. The main contribution of this work is to present a literature survey of different conversion techniques and methods between relational and XML databases models, as well as raising the awareness of these techniques and methods. We review the different researches approaches and techniques that developed for XML conversions. These techniques include but not limited to Document Type Definition (DTD), Document Object Model (DOM), clustering and matching, query languages Structured Query Language (SQL), XPath, XQuery, relational storage, relational catalog and other methods.
... The authors in [27] clustered XML documents via PathXP algorithms. PathXP allows to groups documents according to their characteristic features rather than their direct similarity. ...
The main objective of the work is to improve the clustering efficiency and performance when we deal with very big datasets. This paper aims to improve the quality of XML data clustering by exploiting more features extracted from source schemas. In particular, it proposes clustering approach that gathers both content and structure of XML documents to determine similarity between them. The content and structure information are concluded using two different similarity methods that are then grouped via weight factor to compute the overall document similarity. The structural similarity of XML data are derived from edge summaries while content features similarity are derived from aggregate of set of similarity measures; Jaccard, Cosine measure and Jensen-Shannon divergence in one algorithm. However, we also experimented using Jaccard distance as content measure with edge summaries to prove that using an aggregation of content similarity measures can further improve the results. The experiments prove that clustering of XML documents based on structure only information produce worse solution in homogenous environment, while in heterogeneous environment clustering of XML document produce better result when the structure and the content are combined. Results have shown that performance and quality of the proposed approach is better in comparison of both XEdge and XCLSC approaches.