Conference Paper

Clustering of XML documents based on structure and aggregated content

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

The main objective of the work is to improve the clustering efficiency and performance when we deal with very big datasets. This paper aims to improve the quality of XML data clustering by exploiting more features extracted from source schemas. In particular, it proposes clustering approach that gathers both content and structure of XML documents to determine similarity between them. The content and structure information are concluded using two different similarity methods that are then grouped via weight factor to compute the overall document similarity. The structural similarity of XML data are derived from edge summaries while content features similarity are derived from aggregate of set of similarity measures; Jaccard, Cosine measure and Jensen-Shannon divergence in one algorithm. However, we also experimented using Jaccard distance as content measure with edge summaries to prove that using an aggregation of content similarity measures can further improve the results. The experiments prove that clustering of XML documents based on structure only information produce worse solution in homogenous environment, while in heterogeneous environment clustering of XML document produce better result when the structure and the content are combined. Results have shown that performance and quality of the proposed approach is better in comparison of both XEdge and XCLSC approaches.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... This fact has driven researchers to create solutions that minimize loads on networks and increase the speed at which services may be provided. One of the newest suggested improvements is a web-based messages aggregator, which works to combine several messages into a single message by removing duplicate content [20], [21]. However, the aggregation's efficacy depends on the degree of similarity between aggregated messages. ...
Article
Full-text available
div align=""> In recent years, web services run by big corporations and different application-specific data centers have all been embraced by several companies worldwide. Web services provide several benefits when compared to other communication technologies. However, it still suffers from congestion and bottlenecks as well as a significant delay due to the tremendous load caused by a large number of web service requests from end users. Clustering and then aggregating similar web services as one compressed message can potentially achieve network traffic reduction. This paper proposes a dynamic Hilbert clustering as a new model for clustering web services based on convex set similarity. Mathematically, the suggested models compute the degree of similarity between simple object access protocol (SOAP) messages and then cluster them into groups with high similarity. Next, each cluster is aggregated as a compact message that is finally encoded by fixed-length or Huffman. The experiment results have shown the suggested model performs better than the conventional clustering techniques in terms of compression ratio. The suggested model has produced the best results, reaching up to 15 with fixed-length and up to 20 with Huffman </div
... aggregated SOAP messages [17], [18]. Moreover, it would improve performance by reducing the size of the transmitted data. ...
Article
Full-text available
span>Web services' high levels of duplicate textual structures have caused network bottlenecks and congestion. Clustering and then aggregating similar web services as one compressed message can potentially achieve network traffic reduction. In this paper, a static Hilbert clustering as new model for clustering web services based on convex set similarity is proposed. Mathematically, the proposed model calculates similarity among simple object access protocol (SOAP) messages and then cluster them based on higher similarity values. Next, each cluster is aggregated as a compact message. The experiments have explained the proposed model performance as it has outperformed the convention clustering strategies in both compression ratio and clustering time. The best results have been achievable by the proposed model has reached up to (15) with fixed-length and up to (21) with Huffman.</span
Chapter
Many collaboratively building resources, such as Wikipedia, Weibo and Quora, exist in the form of semi-structured data and semi-structured data classification plays an important role in many data analysis applications. In addition to content information, semi-structured data also contain structural information. Thus, combining the structure and content features is a crucial issue in semi-structured data classification. In this paper, we propose a supervised semi-structured data classification approach that utilizes both the structural and content information. In this approach, generalized tag sequences are extracted from the structural information, and nGrams are extracted from the content information. Then the tag sequences and nGrams are combined into features called TSGram according to their link relation, and each semi-structured document is represented as a vector of TSGram features. Based on the TSGram features, a classification model is devised to improve the performance of semi-structured data classification. Because TSGram features retain the association between the structural and content information, they are helpful in improving the classification performance. Our experimental results on two real datasets show that the proposed approach is effective.
Article
Full-text available
Data clustering techniques are valuable tools for researchers working with large databases of multivariate data. In this tutorial, we present a simple yet powerful one: the k-means clustering technique, through three different algorithms: the Forgy/Lloyd, algorithm, the MacQueen algorithm and the Hartigan & Wong algorithm. We then present an implementation in Mathematica and various examples of the different options available to illustrate the application of the technique. Data clustering techniques are descriptive data analysis techniques that can be applied to multivariate data sets to uncover the structure present in the data. They are particularly useful when classical second order statistics (the sample mean and covariance) cannot be used. Namely, in exploratory data analysis, one of the assumptions that is made is that no prior knowledge about the dataset, and therefore the dataset’s distribution, is available. In such a situation, data clustering can be a valuable tool. Data clustering is a form of unsupervised classification, as the clusters are formed by evaluating similarities and dissimilarities of intrinsic characteristics between different cases, and the grouping of cases is based on those emergent similarities and not on an external criterion. Also, these techniques can be useful for datasets of any dimensionality over three, as it is very difficult for humans to compare items of such complexity reliably without a support to aid the comparison.
Article
Full-text available
Every day more digital data in semi-structured format are available on the World Wide Web, corporate intranets, and other media. Knowledge management using information search and processing is essential in the field of academic writing. This task becomes increasingly complex and defiant, mainly because collections of documents are usually heterogeneous, big, diverse, and dynamic. To resolve these challenges it is essential to improve management of time necessary to process scientific information. In this paper, we propose a new method of automatic clustering of XML documents based on their content and structure, as well as on a new similarity function OverallSimSUX which facilitates capturing the degree of similarity among documents. Evaluation of our proposal by means of experiments with data sets showed better results than those in previous work.
Conference Paper
Full-text available
This paper proposes a novel Clustering approach for XML documents that combines both their content and structure information using tree structural-content summaries in order to reduce the size of the document. This reduction has twofold purpose. First, it reduces the size of the XML tree by eliminating redundant nodes. Second, it gathers similaire content. The clustering is performed according to a similarity measure that takes into account the structure and the content between levels. Several experiments are performed to explore the effectiveness of using tree structural summaries and constrained content in the clustering process. Empirical analysis reveals that the designed clustering approach using content within structure and tree structural summaries gives a better solution for XML clustering while improving runtime. It is very suitable when we deal with big data sets.
Article
Full-text available
Searching large XML repositories is a challenging research problem. The application of clustering on a large repository before performing a search enhances the search process significantly. Clustering reduces a search space into smaller XML collections that can be better searched. In this work, we present an enhanced XML clustering by structure method. Also, we introduce a new representation of XML structure that keeps all characteristics of XML structure without summarization. Then, we perform a benchmark comparison between the search results of our improved method to SAXON and Qizx XML XQuery processors. The comparison focuses on search processing time and accuracy of the results using different sizes of datasets for both homogeneous and heterogeneous XML documents. The attained results show better accuracy at the same level of performance.
Article
Full-text available
Mining the web data is one of the emerging researches in data mining. The HTML can be used for maintaining the web data but it is hard to achieve the accurate web mining results from HTML documents. The XML documents make more convenient for finding the properties in web mining. Association rule based mining discovers the temporal associations among XML documents. But this kind of data mining is not sufficient to retrieve the properties of every XML document. Finding the properties for set of similar documents is better idea rather than to find the property of a single document. Hence, the key contribution of the work is to find the meaningful clustered based associations by association rule based clustering. Therefore, this paper proposes a hybrid approach which discovers the frequent XML documents by association rule mining and then find the clustering of XML documents by classical k-means algorithm. The proposed approach was tested with real data of Wikipedia. The comparative study and result analysis are discussed in the paper for knowing the importance of the proposed work.
Article
Full-text available
With its presence in data integration, chemistry, biological, and geographic systems, eXtensible Markup Language (XML) has become an important standard not only in computer science. A common problem among the mentioned applications involves structural clustering of XML documents—an issue that has been thoroughly studied and led to the creation of a myriad of approaches. In this paper, we present a comprehensive review of structural XML clustering. First, we provide a basic introduction to the problem and highlight the main challenges in this research area. Subsequently, we divide the problem into three subtasks and discuss the most common document representations, structural similarity measures, and clustering algorithms. In addition, we present the most popular evaluation measures, which can be used to estimate clustering quality. Finally, we analyze and compare 23 state-of-the-art approaches and arrange them in an original taxonomy. By providing an up-to-date analysis of existing structural XML clustering algorithms, we hope to showcase methods suitable for current applications and draw lines of future research.
Article
Full-text available
Now that the use of XML is prevalent, methods for mining semi-structured documents have become even more important. In particular, one of the areas that could greatly benefit from in-depth analysis of XML’s semi-structured nature is cluster analysis. Most of the XML clustering approaches developed so far employ pairwise similarity measures. In this paper, we study clustering algorithms, which use patterns to cluster documents without the need for pairwise comparisons. We investigate the shortcomings of existing approaches and establish a new pattern-based clustering framework called XPattern, which tries to address these shortcomings. The proposed framework consists of four steps: choosing a pattern definition, pattern mining, pattern clustering, and document assignment. The framework’s distinguishing feature is the combination of pattern clustering and document-cluster assignment, which allows to group documents according to their characteristic features rather than their direct similarity. We experimentally evaluate the proposed approach by implementing an algorithm called PathXP, which mines maximal frequent paths and groups them into profiles. PathXP was found to match, in terms of accuracy, other XML clustering approaches, while requiring less parametrization and providing easily interpretable cluster representatives. Additionally, the results of an in-depth experimental study lead to general suggestions concerning pattern-based XML clustering.
Article
Full-text available
There are plenty of public content available on the Internet, especially in online communities, enabling researchers to study society in new ways. Since the qualitative content analysis is very time consuming, the following problem arises: how to select the content to be analyzed? This paper introduces a new process to support solving this problem. This process is based on unsupervised machine learning techniques and provides consolidated and structured results. This includes measurements and a content exploration method. In order to con-textualize its application, a real world study is shown. A tool that helps to apply the proposed process was created and is presented too. Resumo. A abundância de conteúdo público disponíveis na Internet, especial-mente nas comunidades online, permite novos modos métodos de estudos soci-ais. Visto que a realiza ao de análise qualitativa de conteúdo demanda muito tempo, o seguinte problema aparece: como selecionar o conteúdo a ser anal-isado? Este artigo apresenta um novo processo para auxiliar na resolu ao deste problema. Esse processó e baseado em técnicas de aprendizado de máquina não supervisionadas e proporciona resultados consolidados e estruturados, in-cluindo métricas e um método para explora ao de conteúdo. Para contextu-alizar a sua aplica ao, um estudo reaí e apresentado. Uma ferramenta que ajuda na utiliza ao do processo proposto foi criada e també e apresentada.
Article
Full-text available
With the vastly growing data resources on the In-ternet, XML is one of the most important standards for document management. Not only does it provide enhancements to document exchange and storage, but it is also helpful in a variety of informa-tion retrieval tasks. Document clustering is one of the most inter-esting research areas that utilize XML's semi-structural nature. In this paper, we put forward a new XML clustering algorithm that relies solely on document structure. We propose the use of maximal frequent subtrees and an operator called Satisfy/Violate to divide documents into groups. The algorithm is experimentally evaluated on real and synthetic data sets with promising results.
Article
Full-text available
Similarity is an important and widely used concept in many applications such as Document Summarisation, Question Answering, Information Retrieval, Document Clustering and Categorisation. This paper presents a comparison of various similarity measures in comparing the content of text documents. We have attempted to find the best measure suited for finding the document similarity for newspaper reports.
Conference Paper
Full-text available
In the paper, we addressed the problem of finding the common structures in a collection of XML documents. Since an XML document can be represented as a tree structure, the problem how to cluster a collection of XML documents can be considered as how to cluster a collection of tree-structured documents. First, we used SOM (Self-Organizing Map) with the Jaccard coefficient to cluster XML documents. Then, an efficient sequential mining method called GST was applied to find maximum frequent sequences. Finally, we merged the maximum frequent sequences to produce the common structures in a cluster.
Article
Full-text available
Use of traditional k-mean type algorithm is limited to numeric data. This paper presents a clustering algorithm based on k-mean paradigm that works well for data with mixed numeric and categorical features. We propose new cost function and distance measure based on co-occurrence of values. The measures also take into account the significance of an attribute towards the clustering process. We present a modified description of cluster center to overcome the numeric data only limitation of k-mean algorithm and provide a better characterization of clusters. The performance of this algorithm has been studied on real world data sets. Comparisons with other clustering algorithms illustrate the effectiveness of this approach.
Conference Paper
Full-text available
XML has become a popular method of data representation both on the web and in databases in recent years. One of the reasons for the popularity of XML has been its ability to encode structural information about data records. However, this structural characteristic of data sets also makes it a chal- lenging problem for a variety of data mining problems. One such problem is that of clustering, in which the structural aspects of the data result in a high implicit dimensionality of the data representation. As a result, it becomes more dif- ficult to cluster the data in a meaningful way. In this paper, we propose an effective clustering algorithm for XML data which uses substructures of the documents in order to gain insights about the important underlying structures. We pro- pose new ways of using multiple sub-structural information in XML documents to evaluate the quality of intermediate cluster solutions, and guide the algorithms to a final solu- tion which reflects the true structural behavior in individual partitions. We test the algorithm on a variety of real and synthetic data sets.
Conference Paper
Full-text available
With the increasing use of XML in many domains, XML document clustering has been a central research topic in semistructured data management and mining. Due to the semistructured nature of XML data, the clustering problem becomes particularly challenging, mainly because structural similarity measures specifically designed to deal with tree/graph-shaped data can be quite expensive. Specialized clustering techniques are being developed to account for this difficulty, however most of them still assume that XML documents are represented using a semistructured data model. In this paper we take a simpler approach whereby XML structural aspects are extracted from the documents to generate a flat data format to which well-established clustering methods can be directly applied. Hence, the expensive process of tree/graph data mining is avoided, while the structural properties are still preserved. Our experimental evaluation using a number of real world datasets and comparing with existing structural clustering methods, has demonstrated the significance of our approach.
Conference Paper
Full-text available
@inproceedings{AI-DOUCET-2002, author = {Doucet, A. and Ahonen-Myka, H.}, title = {Naive clustering of a large XML document collection}, booktitle = {Proceedings of First Workshop of the Initiative for the Evaluation of XML Retrieval (INEX'02)}, year = {2002}, month = {December}, pages = {81-87}, address = {Schloss Dagsuhl, Germany}, series = {ERCIM Workshop Proceedings} }
Conference Paper
Full-text available
In this paper we propose a unified clustering algorithm for both homogeneous and heterogeneous XML documents. Depending on the type of the XML documents, the proposed algorithm modifies its distance metric in order to properly adapt to the special structural characteristics of homogeneous and heterogeneous XML documents. We compare the quality of the formed clusters with those of one of the latest XML clustering algorithms and show that our algorithm outperforms it in the case of both homogeneous and heterogeneous XML documents.
Article
Full-text available
The goal of graph clustering is to partition vertices in a large graph into di erent clusters based on various criteria such as vertex con- nectivity or neighborhood similarity. Graph clustering techniques are very useful for detecting densely connected groups in a large graph. Many existing graph clustering methods mainly focus on the topological structure for clustering, but largely ignore the ver- tex properties which are often heterogenous. In this paper, we pro- pose a novel graph clustering algorithm, SA-Cluster, based on both structural and attribute similarities through a unified distance mea- sure. Our method partitions a large graph associated with attributes into k clusters so that each cluster contains a densely connected subgraph with homogeneous attribute values. An e ective method is proposed to automatically learn the degree of contributions of structural similarity and attribute similarity. Theoretical analysis is provided to show that SA-Cluster is converging. Extensive experi- mental results demonstrate the e ectiveness of SA-Cluster through comparison with the state-of-the-art graph clustering and summa- rization methods.
Article
Full-text available
This paper proposes a clustering approach that explores both the content and the structure of XML documents for determining similarity among them. Assuming that the content and the structure of XML documents play different roles and importance depending on the use and purpose of a dataset, the content and structure information of the documents are handled using two different similarity measuring methods. The similarity values produced from these two methods are then combined with weightings to measure the overall document similarity. The effect of structure similarity and content similarity on the clustering solution is thoroughly analysed. The experiments prove that clustering of the text-centric XML documents based on the content-only information produces a better solution in a homogeneous environment, documents that derived from one structural definition; however, in a heterogeneous environment, documents that derived from two or more structural definitions, clustering of the text-centric XML documents produces a better result when the structure and the content similarities of the documents are combined with different strengths.
Conference Paper
Full-text available
We present a novel clustering algorithm to group the XML documents by similar structures. We introduce a Level structure format to represent the XML documents for efficient processing. We develop a global criterion function that do not require the pair-wise similarity to be computed between two individual documents, rather measures the similarity at clustering level utilising structural information of the XML documents. The experimental analysis shows the method to be fast and accurate.
Conference Paper
Full-text available
This paper describes the Jensen-Shannon divergence (JSD) and Hilbert space embedding. With natural definitions making these considerations precise, one finds that the general Jensen-Shannon divergence related to the mixture is the minimum redundancy, which can be achieved by the observer. The set of distributions with the metric √JSD can even be embedded isometrically into Hilbert space and the embedding can be identified.
Conference Paper
Full-text available
dbs.uni-leipzig.de Abstract. Recently, schema matching has found considerable interest in both research and practice. Determining matching components of database or XML schemas is needed in many applications, e.g. for E-business and data integration. Various schema matching systems have been developed to solve the problem semi-automatically. While there have been some evaluations, the overall effectiveness of currently available automatic schema matching systems is largely unclear. This is because the evaluations were conducted in diverse ways making it difficult to assess the effectiveness of each single system, let alone to compare their effectiveness. In this paper we survey recently published schema matching evaluations. For this purpose, we introduce the major criteria that influence the effectiveness of a schema matching approach and use these criteria to compare the various systems. Based on our observations, we discuss the requirements for future match implementations and evaluations. 1
Article
Full-text available
XML is a new standard for exchanging and representing information on the Internet. Documents can be hierarchically represented by XML-elements. In this paper, we propose that an XML document collection be represented and indexed using a bitmap indexing technique. We define the similarity and popularity operations suitable for bitmap indexes. We also define statistical measurements in the BitCube: center, and radius. Based on these measurements, we describe a new bitmap indexing based technique to cluster XML documents. The techniques for clustering are motivated by the fact that the bitmap indexes are expected to be very sparse. Furthermore, a 2-dimensional bitmap index is extended to a 3-dimensional bitmap index, called the BitCube. Sophisticated querying of XML document collections can be performed using primitive operations such as slice, project, and dice. Experiments show that the BitCube can be created efficiently and the primitive operations can be performed more efficiently with the BitCube than with other alternatives.
Article
Full-text available
Jensen-Shannon divergence (JD) is a symmetrized and smoothed version of the most important divergence measure of information theory, Kullback divergence. As opposed to Kullback divergence it determines in a very direct way a metric; indeed, it is the square of a metric. We consider a family of divergence measures (JD_alpha for alpha>0), the Jensen divergences of order alpha, which generalize JD as JD_1=JD. Using a result of Schoenberg, we prove that JD_alpha is the square of a metric for alpha lies in the interval (0,2], and that the resulting metric space of probability distributions can be isometrically embedded in a real Hilbert space. Quantum Jensen-Shannon divergence (QJD) is a symmetrized and smoothed version of quantum relative entropy and can be extended to a family of quantum Jensen divergences of order alpha (QJD_alpha). We strengthen results by Lamberti et al. by proving that for qubits and pure states, QJD_alpha^1/2 is a metric space which can be isometrically embedded in a real Hilbert space when alpha lies in the interval (0,2]. In analogy with Burbea and Rao's generalization of JD, we also define general QJD by associating a Jensen-type quantity to any weighted family of states. Appropriate interpretations of quantities introduced are discussed and bounds are derived in terms of the total variation and trace distance. Comment: 13 pages, LaTeX, expanded contents, added references and corrected typos
Article
XML similarity evaluation has become a central issue in the database and information communities, its applications ranging over document clustering, version control, data integration and ranked retrieval. Various algorithms for comparing hierarchically structured data, XML documents in particular, have been proposed in the literature. Most of them make use of techniques for finding the edit distance between tree structures, XML documents being commonly modeled as Ordered Labeled Trees. Yet, a thorough investigation of current approaches led us to identify several similarity aspects, i.e., sub-tree related structural and semantic similarities, which are not sufficiently addressed while comparing XML documents. In this paper, we provide an integrated and fine-grained comparison framework to deal with both structural and semantic similarities in XML documents (detecting the occurrences and repetitions of structurally and semantically similar sub-trees), and to allow the end-user to adjust the comparison process according to her requirements. Our framework consists of four main modules for (i) discovering the structural commonalities between sub-trees, (ii) identifying sub-tree semantic resemblances, (iii) computing tree-based edit operations costs, and (iv) computing tree edit distance. Experimental results demonstrate higher comparison accuracy with respect to alternative methods, while timing experiments reflect the impact of semantic similarity on overall system performance.
Conference Paper
In this paper, we propose a novel similarity measure for soft sets which is based on Jaccard similarity coefficient. The proposed similarity measure takes into consideration two component i.e. similarity due to the compared parameter set, and similarity between approximate value sets of the overlapping parameters. The efficiency of the proposed measure is compared with existing soft set similarity measures through numerical examples. An application of the new similarity measure in solving a financial diagnostic problem is also illustrated in the paper.
Conference Paper
XML (eXtensible Markup Language) have been adopted by number of software vendors today, it became the standard for data interchange over the web and is platform and application independent also. A XML document is consists of number of attributes like document data, structure and style sheet etc. Clustering is method of creating groups of similar objects. In this paper a weighted similarity measurement approach for detecting the similarity between the homogeneous XML documents is suggested. Using this similarity measurement a new clustering technique is also proposed. The method of calculating similarity of document's structure and styling is given by number of researchers, mostly which are based on tree edit distances. And for calculating the distance between document's contents there are number of text and other similarity techniques like cosine, jaccord, tf-idf etc. In this paper both of the similarity techniques are combined to propose a new distance measurement technique for calculating the distance between a pair of homogeneous XML documents. The proposed clustering model is implemened using open source technology java and is validated experimentally. Given a collection of XML documents distances between documents is calculated and stored in the java collections, and then these distances are used to cluster the XML documents.
Article
Many studies on developing technologies have been published as articles, papers, or patents. We use and analyze these documents to find scientific and technological trends. In this paper, we consider document clustering as a method of document data analysis. In general, we have trouble analyzing documents directly because document data are not suitable for statistical and machine learning methods of analysis. Therefore, we have to transform document data into structured data for analytical purposes. For this process, we use text mining techniques. The structured data are very sparse, and hence, it is difficult to analyze them. This study proposes a new method to overcome the sparsity problem of document clustering. We build a combined clustering method using dimension reduction and K-means clustering based on support vector clustering and Silhouette measure. In particular, we attempt to overcome the sparseness in patent document clustering. To verify the efficacy of our work, we first conduct an experiment using news data from the machine learning repository of the University of California at Irvine. Second, using patent documents retrieved from the United States Patent and Trademark Office, we carry out patent clustering for technology forecasting.
Article
We propose and experimentally evaluate different approaches for measuring the structural similarity of semistructured documents based on information-theoretic concepts. Common to all approaches is a two-step procedure: first, we extract and linearize the structural information from documents, and then, we use similarity measures that are based on, respectively, Kolmogorov complexity and Shannon entropy to determine the distance between the documents. Compared to other approaches, we are able to achieve a linear run-time complexity and demonstrate in an experimental evaluation that the results of our technique in terms of clustering quality are on a par with or even better than those of other, slower approaches.
Article
In this paper, we describe a method for clustering XML documents. Its goal is to group documents sharing similar structures. Our approach is two-step. We first automatically extract the structure from each XML document to be classified. This extracted structure is then used as a representation model to classify the corresponding XML document. The idea behind the clustering is that if XML documents share similar structures, they are more likely to correspond to the structural part of the same query. Finally, for the experimentation purpose, we tested our algorithms on both real (ACM SIGMOD Record corpus) and synthetic data. The results clearly demonstrate the interest of our approach.
Article
This paper presents the results of an experimental study of some similarity measures used for both Information Retrieval and Document Clustering. Our results indicate that the cosine similarity measure is superior than the other measures such as Jaccard measure, Euclidean measure that we tested. Cosine Similarity measure is particularly better for text documents. Previously these measures are compared with the conventional text datasets but the proposed system collects the datasets with the help of API and it returns the collection of XML pages. These XML pages are parsed and filtered to get the web document datasets. In this paper, we compare and analyze the effectiveness of these measures for these web document datasets.
Article
XML has become a de facto standard for data representation and exchange over the Internet. With the emergence of more and more XML documents, the clustering of XML documents has become an active research area. XML documents lie between structured data and unstructured data which describe both content and structure, so how to effectively cluster XML documents is a huge challenge. However, most of existing clustering algorithms are based on the structural similarities between XML documents and not or less take into account content of the XML documents. In this paper, we develop a novel method for measuring similarities between XML documents, which effectively combines structure and contents of the XML documents. Based on this similarity model, we adopt hierarchy clustering algorithm to cluster XML documents. The experiments show that this method gains better clustering quality.
Article
Extensible markup language (XML) is emerging as a de facto standard for information exchange among various applications on the World Wide Web. There has been a growing need for developing high-performance techniques to query large XML data repositories efficiently. One important problem in XML query processing is twig pattern matching, that is, finding in an XML data tree D all matches that satisfy a specified twig (or path) query pattern Q. In this survey, we review, classify, and compare major techniques for twig pattern matching. Specifically, we consider two classes of major XML query processing techniques: the relational approach and the native approach. The relational approach directly utilizes existing relational database systems to store and query XML data, which enables the use of all important techniques that have been developed for relational databases, whereas in the native approach, specialized storage and query processing systems tailored for XML data are developed from scratch to further improve XML query performance. As implied by existing work, XML data querying and management are developing in the direction of integrating the relational approach with the native approach, which could result in higher query processing performance and also significantly reduce system reengineering costs.
Article
In recent years, XML has been established as a major means for information management, and has been broadly utilized for complex data representation (e.g. multimedia objects). Owing to an unparalleled increasing use of the XML standard, developing efficient techniques for comparing XML-based documents becomes essential in the database and information retrieval communities. In this paper, we provide an overview of XML similarity/comparison by presenting existing research related to XML similarity. We also detail the possible applications of XML comparison processes in various fields, ranging over data warehousing, data integration, classification/clustering and XML querying, and discuss some required and emergent future research directions.
Article
Current studies on the storage of XML data are focused on either the efficient mapping of XML data onto an existing RDBMS or the development of a native XML storage. Some native XML storages store each XML node in a parsed object form. Clustering, which means the physical arrangement of objects, can be an important factor in improving the performance in this storage model. In this paper, we propose a clustering method that stores data nodes in an XML document into the native XML storage. The proposed clustering method uses path similarities between data nodes, which can reduce page I/Os required for query processing. In addition, we propose a query processing method using signatures that facilitate the cluster-level access on the stored data to benefit from the proposed clustering method. This method can process a path query by accessing only a small number of clusters and thus need not use all of the clusters, hence enabling the path query to be processed efficiently by skipping unnecessary data. Finally, we compare the performance of the proposed method with that of the existing ones. Our results show that the performance of XML storage can be improved by using a proper clustering method.
Article
XML has recently become very popular as a means of representing semistructured data and as a standard for data exchange over the Web, because of its varied applicability in numerous applications. Therefore, XML documents constitute an important data mining domain. In this paper, we propose a new method of XML document clustering by a global criterion function, considering the weight of common structures. Our approach initially extracts representative structures of frequent patterns from schemaless XML documents using a sequential pattern mining algorithm. Then, we perform clustering of an XML document by the weight of common structures, without a measure of pairwise similarity, assuming that an XML document is a transaction and frequent structures extracted from documents are items of the transaction. We conducted experiments to compare our method with previous methods. The experimental results show the effectiveness of our approach.
Conference Paper
Clustering of XML documents is an important data mining method, the aim of which is the grouping of similar XML documents. The issue of clustering XML documents by structure is being considered in this paper. Two different and independent methods of clustering XML documents by structure are being proposed. The first method represents a set of XML documents as a set of labels. The second method introduces a new representation of a set of XML documents, which is called the SuperTree. In this paper, it is suggested that the proposed methods may improve the accuracy of XML clustering by structure. Such thesis is based on the tests, the aim of which is to assess advantages of the proposals, as conducted respectively on the heterogeneous and homogenous sets of data.
Conference Paper
This report explains our plagiarism detection method using fuzzy semantic-based string similarity approach. The algorithm was developed through four main stages. First is pre-processing which includes tokenisation, stemming and stop words removing. Second is retrieving a list of candidate documents for each suspicious document using shingling and Jaccard coefficient. Suspicious documents are then compared sentence-wise with the associated candidate documents. This stage entails the computation of fuzzy degree of similarity that ranges between two edges: 0 for completely different sentences and 1 for exactly identical sentences. Two sentences are marked as similar (i.e. plagiarised) if they gain a fuzzy similarity score above a certain threshold. The last step is post-processing whereby consecutive sentences are joined to form single paragraphs/sections. Our performance measures on PAN’09 training corpus for external plagiarism detection task (recall=0.3097, precision=0.5424, granularity=7.8867) indicates that about 54 % of our detections are correct while we detect only 30 % of the plagiarism cases. The performance measures on PAN’10 test collection is less (recall = 0.1259, precision = 0.5761, granularity = 3.5828), due to the fact that our algorithm handles external plagiarism detection but neither intrinsic nor cross-lingual. Although our fuzzy semantic-based method can detect some means of obfuscation, it might not work at all levels. Our future work is to improve it for more detection efficiency and less time complexity. In particular, we need to advance the post-processing stage to gain more ideal granularity. 1
Conference Paper
The amount of XML documents is increasing rapidly. In order to analyze the information represented in XML documents efficiently, researches on XML document clustering are actively in progress. The key issue is how to devise the similarity measure between XML documents to be used for clustering. Since XML documents have hierarchical structure, it is not appropriate to cluster them by using a general document similarity measure. In this paper, we propose the novel similarity calculation measure by reducing Nesting and repeating in the whole XML document. Then propose an improved Edge-set comparison algorithm to calculate two XML documents' similarity. Our experiments show that the proposed method improves accuracy on the clustering, compared to the previous works.
Conference Paper
The increasing availability of heterogeneous XML informa- tive sources has raised a number of issues concerning how to represent and manage semistructured data. Although XML sources can exhibit proper structures and contents, dier- ently annotated XML documents may in principle encode related semantics due to subjective definitions of markup tags. Discovering knowledge to infer semantic organization of XML documents has become a major challenge in XML data management. In this context, we address the problem of clustering XML data according to structure as well as content features enriched with lexical ontology knowledge. We propose a framework for clustering semantically cohe- sive XML structures based on a transactional representa- tion model. Experiments on large real datasets give evidence that the proposed approach is highly eective in detecting
Conference Paper
This work presents a methodology for grouping structurally similar XML documents using clustering algorithms. Modeling XML documents with tree-like structures, we face the 'clustering XML documents by structure' prob- lem as a 'tree clustering' problem, exploiting distances that estimate the similarity between those trees in terms of the hierarchical relationships of their nodes. We suggest the usage of tree structural summaries to improve the performance of the distance calculation and at the same time to maintain or even improve its quality. Experimental results are provided using a prototype testbed.
Article
Abstract Representation of features of items and user feedback, and reasoning about their relationships are major problems in recommender systems. This is because item features and user feedback are subjective, imprecise and vague. The paper presents a fuzzy set theoretic method (FTM) for recommender systems that handles the non-stochastic uncertainty induced from subjectivity, vagueness and imprecision in the data, and the domain knowledge and the task under consideration. The research further advances the application of fuzzy modeling,for content-based recommender,systems initially presented by Ronald Yager. The paper defines a representation method, similarity measures and aggregation methods as well as empirically evaluates the methods’ performance through simulation using a benchmark movie data. FTM consist of a representation method for items’ features and user feedback using fuzzy sets, and a content-based algorithm based on various fuzzy set theoretic similarity measures,(the fuzzy set extensions of the Jaccard index, cosine, proximity or correlation similarity measures), and aggregation methods for computing recommendation confidence scores (the maximum–minimum,or Weighted-sum fuzzy set theoretic aggregation methods). Compared to the baseline crisp set based method (CSM) presented, the empirical evaluation of the FTM using the movie data and simulation shows an improvement in precision without loss of recall. Moreover, the paper provides a guideline for recommender systems designers that will help in
Article
This paper proposes a novel Hybrid Clustering approach for XML documents (HCX) that first determines the structural similarity in the form of frequent subtrees and then uses these frequent subtrees to represent the constrained content of the XML documents in order to determine the content similarity. The empirical analysis reveals that the proposed method is scalable and accurate.
Distance-based clustering of XML documents
  • D Francesca
  • F Gordano
  • Gianluca Ortale
  • Riccardo Tagarelli
D.Francesca, F. and Gordano, Gianluca and Ortale, Riccardo and Tagarelli, Andrea," Distance-based clustering of XML documents," ECML/PKDD, vol. 3, pp. 75-78, 2003.