Jianyong Wang

Jianyong Wang
  • Tsinghua University

About

97
Publications
21,400
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
11,113
Citations
Current institution
Tsinghua University

Publications

Publications (97)
Article
The volume of Web videos have increased sharply through the past several years because of the evolvement of Web video sites.Enhanced algorithms on retrieval, classification and TDT (abbreviation of Topic Detection and Tracking) can bring lots of convenience to Web users as well as release tedious work from the administrators. Nevertheless, due to t...
Article
Mining frequent subsequence patterns is a typical data-mining problem and various efficient sequential pattern mining algorithms have been proposed. In many application domains (e.g., biology), the frequent subsequences confined by the predefined gap requirements are more meaningful than the general sequential patterns. In this article, we propose...
Article
Existing studies on keyword search over relational databases usually find Steiner trees composed of connected database tuples as answers. They on-the-fly identify Steiner trees by discovering rich structural relationships between database tuples, and neglect the fact that such structural relationships can be precomputed and indexed. Recently, tuple...
Conference Paper
As a typical data mining research topic, sequential pattern mining has been studied extensively for the past decade. Recently, mining various sequential patterns incrementally over stream data has raised great interest. Due to the challenges of mining stream data, many difficulties not so obvious in static data mining have to be reconsidered carefu...
Article
This paper studies the problem of XML message brokering with user subscribed profiles of keyword queries and presents a KEyword-based XML Message Broker (KEMB) to address this problem. In contrast to traditional-path-expressions-based XML message brokers, KEMB stores a large number of user profiles, in the form of keyword queries, which capture the...
Conference Paper
Full-text available
The problem of privacy-preserving data mining has attracted considerable attention in recent years because of increasing concerns about the privacy of the underlying data. In recent years, an important data domain which has emerged is that of graphs and structured data. Many data sets such as XML data, transportation networks, traffic in IP network...
Conference Paper
Recently, mining sequential patterns, especially closed sequential patterns and generator patterns, has attracted much attention from both academic and industrial communities. In recent years, incremental mining of all sequential patterns (all closed sequential patterns) has been widely studied. However, to our best knowledge, there has not been an...
Article
Conventional keyword search engines are restricted to a given data model and cannot easily adapt to unstructured, semi-structured or structured data. In this paper, we propose an efficient and adaptive keyword search method, called EASE, for indexing and querying large collections of heterogeneous data. To achieve high efficiency in processing keyw...
Article
Name ambiguity stems from the fact that many people or objects share identical names in the real world. Such name ambiguity decreases the performance of document retrieval, Web search, information integration, and may cause confusion in other applications. Due to the same name spellings and lack of information, it is a nontrivial task to distinguis...
Article
A common approach to performing keyword search over relational databases is to find the minimum Steiner trees in database graphs transformed from relational data. These methods, however, are rather expensive as the minimum Steiner tree problem is known to be NP-hard. Further, these methods are independent of the underlying relational database mana...
Conference Paper
Full-text available
Web-page recommendation is to predict the next request of pages that Web users are potentially interested in when surfing the Web. This technique can guide Web users to find more useful pages without asking for them explicitly and has attracted much attention in the community of Web mining. However, few studies on Web page recommendation consider p...
Article
IL-17AA, IL-17FF, and IL-17AF are proinflammatory cytokines that have been implicated in the pathogenesis of autoimmune diseases such as rheumatoid arthritis (RA). In order to measure the levels of these cytokines in synovial fluid and serum samples from RA patients, immunoassays specific for IL-17AA, FF, and AF were developed. Although these assay...
Conference Paper
Classification is one of the most essential tasks in data mining. Unlike other methods, associative classification tries to find all the frequent patterns existing in the input categorical data satisfying a user-specified minimum support and/or other discrimination measures like minimum confidence or information-gain. Those patterns are used later...
Article
In this paper, we study the problem of keyword proximity search in XML documents. We take the disjunctive semantics among the keywords into consideration and find top-k relevant compact connected trees (CCTrees) as the answers of keyword proximity queries. We first introduce the notions of compact lowest common ancestor (CLCA) and maximal CLCA (MCL...
Conference Paper
In this demonstration, we propose an interactive query completion system on structural data like DBLP, called SEQUEL. It is novel in several aspects: with patterns mined on the structural data using newly devised algorithm, SEQUEL offers high-utility completions composed with not only words but also phrases, and requires no explicit indications of...
Article
Nanoparticles have received a great deal of attention for producing new engineering applications due to their novel physicochemical characteristics. However, the broad application of nanomaterials has also produced concern for nanoparticle toxicity due to increased exposure from large-scale industry production. This study was conducted to investiga...
Article
Graph data have become ubiquitous and manipulating them based on similarity is essential for many applications. Graph edit distance is one of the most widely accepted measures to determine similarities between graphs and has extensive applications in the fields of pattern recognition, computer vision etc. Unfortunately, the problem of graph edit di...
Article
Existing algorithms of mining frequent XML query patterns (XQPs) employ a candidate generate-and-test strategy. They involve expensive candidate enumeration and costly tree-containment checking. Further, most of existing methods compute the frequencies of candidate query patterns from scratch periodically by checking the entire transaction database...
Article
Full-text available
In recent years we have witnessed several applications of frequent sequence mining, such as feature selection for protein sequence classification and mining block correlations in storage systems. In typical applications such as clustering, it is not the complete set but only a subset of discriminating frequent subsequences which is of interest. One...
Article
Full-text available
Graph data have become ubiquitous and manipulating them based on similarity is essential for many applications. Graph edit distance is one of the most widely accepted measures to determine similarities between graphs and has extensive applications in the fields of pattern recognition, computer vi- sion etc. Unfortunately, the problem of graph edit...
Conference Paper
To our best knowledge, all existing graph pattern mining al- gorithms can only mine either closed, maximal or the com- plete set of frequent subgraphs instead of graph generators which are preferable to the closed subgraphs according to the Minimum Description Length principle in some ap- plications. In this paper, we study a new problem of frequen...
Conference Paper
Mining generator patterns has raised great research interest in recent years. The main purpose of mining itemset generators is that they can form equivalence classes together with closed itemsets, and can be used to generate simple classification rules according to the MDL principle. In this paper, we devise an efficient algorithm called StreamGen...
Conference Paper
Full-text available
Most of existing methods of keyword search over relational databases find the Steiner trees composed of relevant tuples as the answers. They identify the Steiner trees by discovering the rich structural relationships between tuples, and neglect the fact that such structural relationships can be pre-computed and indexed. Tuple units that are compose...
Conference Paper
Full-text available
This paper studies the problem of frequent pattern mining with uncertain data. We will show how broad classes of algorithms can be extended to the uncertain data setting. In particular, we will study candidate generate-and-test al- gorithms, hyper-structure algorithms and pattern growth based algorithms. One of our insightful observations is that t...
Conference Paper
Graphs or networks can be used to model complex systems. Detecting community structures from large network data is a classic and challenging task. In this paper, we propose a novel community detection algorithm, which utilizes a dynamic process by contradicting the network topology and the topology-based propinquity, where the propinquity is a meas...
Article
In our previous study (Wang et al., 2004, Toxicol. Sci. 82: 124-128), we observed that the cII gene mutant frequency (MF) in the bone marrow of Big Blue mice showed significant increase as early as day 1, reached the maximum at day 3 and then decreased to a plateau by day 15 after a single dose of carcinogen N-ethyl-N-nitrosourea (ENU) treatment, w...
Article
We present EASE, an effective and versatile keyword search engine that enables users to easily access the heterogenous data composed of unstructured, semi-structured and structured data, without the need of learning XPath/XQuery or SQL languages. EASE addresses a challenge in keyword search that has been neglected in the literature: how to efficien...
Article
Full-text available
Graph data has became ubiquitous and manipulating them based on similarity is essential for many applications. Graph edit distance is one of the most widely accepted measure to determine similarities between graphs and has extensive applications in the fields of pattern recognition, computer vision etc. Unfortunately, the problem of graph edit dist...
Conference Paper
Full-text available
Conventional keyword search engines are restricted to a given data model and cannot easily adapt to unstructured, semi-structured or structured data. In this paper, we propose an efficient and adaptive keyword search method, called EASE, for indexing and querying large collections of heterogenous data. To achieve high efficiency in processing keywo...
Article
Parkinson's disease (PD) is a common neurodegenerative disease characterized by progressive loss of midbrain dopaminergic neurons with unknown etiology. MPP+ (1-methyl-4-phenylpyridinium ion) is the active metabolite of the neurotoxin 1-methyl-4-phenyl-1,2,3,6-tetrahydropyridine (MPTP), which induces Parkinson's-like symptoms in humans and animals....
Conference Paper
Full-text available
Distinguishing patterns represent strong distinguishing knowledge and are very useful for constructing powerful, accurate and robust classifiers. The d istinguishing g raph p atterns(DGPs) are able to capture structure differences between any two categories of graph datasets. Whereas, few previous studies worked on the discovery of DGPs. In this...
Conference Paper
In this paper, we study the problem of mining high confidence fragment-based classification rules from the imbalanced HIV data whose class distribution is extremely skewed. We propose an efficient approach to mining frequent fragments in different classes of compounds that can provide best hints of the characteristic of each class and can be used t...
Conference Paper
Mining frequent subsequence patterns from sequence databases is a typical data mining problem and various e-cient sequential pattern mining algorithms have been proposed. In many problem domains (e.g, biology), the frequent subsequences conflned by the predeflned gap requirements are more meaningful than the general sequential patterns. In this pap...
Conference Paper
Full-text available
Graph structure can model the relationships among a set of objects. Mining quasi-clique patterns from large dense graph data makes sense with respect to both statistic and applications. The applications of frequent quasi-cliques include stock price correlation discovery, gene function prediction and protein molecular analysis. Although the graph mi...
Conference Paper
Sequential pattern mining has raised great interest in data mining research field in recent years. However, to our best knowledg e, no existing work studies the problem of frequent sequence generator mining. In this paper we present a novel algorithm, FEAT (abbr. Frequent sEquence generATor miner), to perform this task. Ex- perimental results show...
Conference Paper
Full-text available
This paper studies the problem of unifled ranked retrieval of heterogeneous XML documents and Web data. We pro- pose an efiective search engine called Sailer to adaptively and versatilely answer keyword queries over the heteroge- nous data. We model the Web pages and XML documents as graphs. We propose the concept of pivotal trees to ef- fectively...
Conference Paper
In this paper, we study the problem of keyword proxim- ity search over XML documents and leverage the efficiency and effectiveness. We take the disjunctive semantics among input keywords into consideration and identify meaningful compact connected trees as the answers of keyword proxim- ity queries. We introduce the notions of Compact Lowest Common...
Conference Paper
Full-text available
This paper proposes several vectorial operators for process- ing XML twig queries, which are easy to be performed and inherently efficient for both Ancestor-Descendant (A-D) and Parent-Child (P-C) relationships. We develop optimizations on the vectorial operators to improve the efficiency of an- swering twig queries in holistic. We propose an algor...
Conference Paper
Name ambiguity stems from the fact that many people or objects share identical names. In this paper, we focus on investigating the problem in digital libraries to distinguish publications written by authors with identical names. We present an effective graph-based framework, GHOST (abbr. GrapH-based framewOrk for name diStincTion), to solve the pro...
Article
Parallel frequent pattern discovery algorithms exploit parallel and distributed computing resources to relieve the sequential bottlenecks of current frequent pattern mining (FPM) algorithms. Thus, parallel FPM algorithms achieve better scalability and performance, so they are attracting much attention in the data mining research community. This pap...
Article
Parkinson's disease (PD) is a common neurodegenerative disease characterized by progressive loss of midbrain dopaminergic neurons with unknown etiology. MPP+ (1-methyl-4-phenylpyridinium) is the active metabolite of the neurotoxin 1-methyl-4-phenyl-1,2,3,6-tetrahydropyridine (MPTP), which induces Parkinson's-like syndromes in humans and animals. MP...
Article
Previous research works have presented convincing arguments that a frequent pattern mining algorithm should not mine all frequent but only the closed ones because the latter leads to not only more compact yet complete result set but also better efficiency. Upon discovery of frequent closed XML query patterns, indexing and caching can be effectively...
Article
Previous studies have presented convincing arguments that a frequent pattern mining algorithm should not mine all frequent patterns but only the closed ones because the latter leads to not only a more compact yet complete result set but also better efficiency. However, most of the previously developed closed pattern mining algorithms work under the...
Article
Full-text available
Due to the ability of graphs to represent more generic and more complicated relationships among different objects, graph mining has played a significant role in data mining, attracting increasing attention in the data mining community. In addition, frequent coherent subgraphs can provide valu- able knowledge about the underlying internal structure...
Conference Paper
Full-text available
In this paper, we explore the discriminating subsequence- based clustering problem. First, several effective optimiza- tion techniques are proposed to accelerate the sequence min- ing process and a new algorithm, CONTOUR, is developed to efficiently and directly mine a subset of discriminating frequent subsequences which can be used to cluster the...
Article
3'-Azido-3'-deoxythymidine (AZT), a nucleoside analogue used for the treatment of acquired immunodeficiency syndrome (AIDS), induced a significant dose-related increase in the thymidine kinase (Tk) mutant frequency (MF) in L5178Y/Tk(+/-) 3.7.2C mouse lymphoma cells. Treatment with 1 mg/ml (3,742 muM) AZT for 24 hr resulted in a MF of 407 x 10(-6) c...
Chapter
Full-text available
Large volumes of dynamic stream data pose great challenges to its analysis. Besides its dynamic and transient behavior, stream data has another important characteristic: multi-dimensionality. Much of stream data resides at a multidimensional space and at rather low level of abstraction, whereas most analysts are interested in relatively high-level...
Article
The mouse lymphoma assay (MLA) is the most widely used in vitro mammalian gene mutation assay. It detects various mutation events involving the thymidine kinase (Tk) gene in L5178Y/Tk+/- -3.7.2C mouse lymphoma cells. Mutants are detected using a thymidine analogue that arrests the growth of cells containing a functional Tk gene. However, there are...
Conference Paper
Full-text available
In this paper, we study the problem of effective keyword search over XML documents. We begin by introducing the notion of Valu- able Lowest Common Ancestor (VLCA) to accurately and effec- tively answer keyword queries over XML documents. We then propose the concept of Compact VLCA (CVLCA) and compute the meaningful compact connected trees rooted as...
Conference Paper
Full-text available
XML has become a popular method of data representation both on the web and in databases in recent years. One of the reasons for the popularity of XML has been its ability to encode structural information about data records. However, this structural characteristic of data sets also makes it a chal- lenging problem for a variety of data mining proble...
Conference Paper
Full-text available
Existing studies for mining frequent XML query patterns mainly introduce a straightforward candidate generate-and-test strategy and compute frequencies of candidate query patterns from scratch periodically by checking the entire transaction database, which consists of XML query patterns transformed from user queries. However, it is nontrivial to ma...
Conference Paper
In this paper, we propose a novel and general approach for time-series data mining. As an alternative to traditional ways of designing specific algorithm to mine certain kind of pattern directly from the data, our approach extracts the temporal structure of the time-series data by learning Markovian models, and then uses well established meth- ods...
Article
Full-text available
As OLAP engines are widely used to support multidimensional data analysis, it is desirable to support in data cubes advanced statistical measures, such as regression and filtering, in addition to the traditional simple measures such as count and average. Such new measures will allow users to model, smooth, and predict the trends and patterns of dat...
Article
Full-text available
Many studies have shown that rule-based classifiers perform well in classifying categorical and sparse high-dimensional databases. However, a fundamental limitation with many rule-based classifiers is that they find the rules by employing various heuristic methods to prune the search space and select the rules based on the sequential database cover...
Article
Mining knowledge about ordering from sequence data is an important problem with many applications, such as bioinformatics, Web mining, network management, and intrusion detection. For example, if many customers follow a partial order in their purchases of a series of products, the partial order can be used to predict other related customers' future...
Article
Full-text available
The Thymidine kinase (Tk) mutants generated from the widely used L5178Y mouse lymphoma assay fall into two categories, small colony and large colony. Cells from the large colonies grow at a normal rate while cells from the small colonies grow slower than normal. The relative proportion of large and small colonies after mutagen treatment is associat...
Article
Incorporating constraints into frequent itemset mining not only improves data mining efficiency, but also leads to concise and meaningful results. In this paper, a framework for closed constrained gradient itemset mining in retail databases is proposed by introducing the concept of gradient constraint into closed itemset mining. A tailored version...
Conference Paper
Full-text available
Most previously proposed frequent graph mining algorithms are intended to find the complete set of all frequent, closed subgraphs. However, in many cases only a subset of the frequent subgraphs with a certain topology is of special interest. Thus, the method of mining the complete set of all frequent subgraphs is not suitable for mining these frequ...
Article
Full-text available
Current models of the classification problem do not e#ectively handle bursts of particular classes coming in at different times. In fact, the current model of the classification problem simply concentrates on methods for one-pass classification modeling of very large data sets. Our model for data stream classification views the data stream classifi...
Article
Full-text available
Frequent itemset mining was initially proposed and has been studied extensively in the context of association rule mining. In recent years, several studies have also extended its application to transaction or document clustering. However, most of the frequent itemset based clustering algorithms need to first mine a large intermediate set of frequen...
Article
The L5178Y/Tk+/- -3.7.2C mouse lymphoma cell line is characterized, at the cytogenetic level, by a karyotype involving both numerical and complex structural aberrations. While the karyotype is remarkably normal for a transformed cell line that has been in culture for almost half a century, there are a number of chromosomal alterations that because...
Conference Paper
Speeding up query evaluation in large XML repositories becomes a challenging and all-important problem with vast XML-related applications arising. In this paper, we present SCALER, an efficient algorithm for XML query answering based on UDFTS and effective twig structure matching scheme. UDFTS not only constructs a one-to-one correspondence between...
Conference Paper
The advances of video technology and video-related applications demand appropriate video semantic models for representing video data and their semantics, and supporting powerful semantic queries on them. In this paper, we propose such a model named SemTTE. The model incorporates features of temporal structure and typed events of video contents. It...
Conference Paper
Full-text available
Speeding up query evaluation in large XML repositories becomes a challenging and all-important problem with vast XML-related applications arising. Upon discovery of hot XML query patterns, indexing and caching can be effectively adopted for query performance enhancement. Previous algorithms for finding hot query patterns basically introduced a stra...
Conference Paper
Full-text available
Frequent coherent subgraphs can provide valuable knowledge about the underlying internal structure of a graph database, and mining frequently occurring coherent subgraphs from large dense graph databases has been witnessed several applications and received con- siderable attention in the graph mining community recently. In this paper, we study how...
Conference Paper
Full-text available
Mining ordering information from sequence data is an important data mining task. Sequential pattern mining (Agrawal and Srikant, 1995) can be regarded as mining frequent segments of total orders from sequence data. However, sequential patterns are often insufficient to concisely capture the general ordering information.
Article
Full-text available
Real-time surveillance systems, telecommunication systems, and other dynamic environments often generate tremendous (potentially infinite) volume of stream data: the volume is too huge to be scanned multiple times. Much of such data resides at rather low level of abstraction, whereas most analysts are interested in relatively high-level dynamic cha...
Article
Frequent itemset mining has been studied extensively in literature. Most previous studies require the specification of a min_support threshold and aim at mining a complete set of frequent itemsets satisfying min_support. However, in practice, it is difficult for users to provide an appropriate min_support threshold. In addition, a complete set of f...
Article
Full-text available
The data stream problem has been studied extensively in recent years, because of the great ease in collection of stream data. The nature of stream data makes it essential to use algorithms which require only one pass over the data. Recently, single-scan, stream analysis methods have been proposed in this context. However, a lot of stream data is hi...
Conference Paper
Full-text available
Abstract Many studies have shown that rule-based classiers perform well in classifying categorical and sparse high-dimensional databases. However, a fundamental limitation with many rule-based classiers is that they nd,the rules by employing various heuristic methods to prune the search space, and select the rules based on the sequential database c...
Article
Full-text available
Many studies have shown that rule-based classic a- tion algorithms perform well in classifying categorical and sparse high-dimensional databases. However, a fundamental limitation with many rule-based classiers is that they nd the classic ation rules in a coarse- grained manner. They usually use heuristic methods to prune the search space, and sele...
Conference Paper
The data stream problem has been studied extensively in recent years, because of the great ease in collection of stream data. The nature of stream data makes it essential to use algorithms which require only one pass over the data. Recently, single-scan, stream analysis methods have been proposed in this context. However, a lot of stream data is hi...
Article
Sequential pattern mining is an important data mining problem with broad applications. However, it is also a difficult problem since the mining may have to generate or examine a combinatorially explosive number of intermediate subsequences. Most of the previously developed sequential pattern mining methods, such as GSP, explore a candidate generati...
Conference Paper
Full-text available
Frequent itemset mining was initially proposed and has been studied extensively in the context of association rule mining. In recent years, several studies have also extended its application to the transaction (or document) classification and clustering. However, most of the frequent-itemset based clustering algorithms need to first mine a large in...
Article
Full-text available
The mouse lymphoma L5178Y Tk+/- 3.7.2C assay is a well-characterized in vitro system used for the study of somatic cell mutation. It was determined that this cell line has a heterozygous mutation in exon 5 of Trp53. Based on this assumption that the cell line is heterozygous for the Trp53 gene, it was postulated that the small colony thymidine kina...
Article
Full-text available
Previous study has shown that mining frequent patterns with length-decreasing support constraint is very helpful in removing some uninteresting patterns based on the observation that short patterns will tend to be interesting if they have a high support, whereas long patterns can still be very interesting even if their support is relatively low. Ho...
Article
Full-text available
In recent years, various constrained frequent pattern mining problem formulations and associated algorithms have been developed that enable the user to specify various itemsetbased constraints that better capture the underlying application requirements and characteristics. In this paper we introduce a new class of block constraints that determine t...
Conference Paper
Previous studies have presented convincing arguments that a frequent pattern mining algorithm should not mine all frequent patterns but only the closed ones because the latter leads to not only more compact yet complete result set but also better efficiency. However, most of the previously developed closed pattern mining algorithms work under the c...
Conference Paper
Full-text available
Current models of the classification problem do not effectively handle bursts of particular classes coming in at different times. In fact, the current model of the classification problem simply concentrates on methods for one-pass classification modeling of very large data sets. Our model for data stream classification views the data stream classif...
Conference Paper
This chapter discusses a framework for clustering evolving data streams. The clustering problem is a difficult problem for the data stream domain. This is because the large volumes of data arriving in a stream render most traditional algorithms too inefficient. In recent years, a few one-pass clustering algorithms have been developed for the data s...
Article
Mining frequent closed itemsets provides complete and nonredundant results for frequent pattern analysis. Extensive studies have proposed various strategies for efficient frequent closed itemset mining, such as depth-first search vs. breadthfirst search, vertical formats vs. horizontal formats, treestructure vs. other data structures, top-down vs....
Article
Full-text available
The clustering problem is a difficult problem for the data stream domain. This is because the large volumes of data arriving in a stream renders most traditional algorithms too inefficient. In recent years, a...
Article
In this paper, we propose a new mining task: mining top-k frequent closed patterns of length no less than min_l, where k is the desired number of frequent closed patterns to be mined, and min_l is the minimal length of each pattern. An efficient algorithm, called TFP, is developed for mining such patterns without minimum support. Two methods, close...
Article
This chapter illustrates methods for online, multidimensional regression analysis of time-series stream data. Real-time production systems and other dynamic environments often generate tremendous (potentially infinite) amount of stream data; the volume of data is too huge to be stored on disks or scanned multiple times. With years of research and d...
Article
This paper presents an architectural design and evaluation result of an efficient Web-crawling system. The design involves a fully distributed architecture, a URL allocating algorithm, and a method to assure system scalability and dynamic reconfigurability. Simulation experiment shows that load balance, scalability and efficiency can be achieved in...
Conference Paper
In this paper, we propose a new mining task: mining top-k frequent closed patterns of length no less than min_ℓ, where k is the desired number of frequent closed patterns to be mined, and min_ℓ is the minimal length of each pattern. An efficient algorithm, called TFP, is developed for mining such patterns without minimum support. Two methods, close...
Article
Traditional distributed file systems do not provide clusters with strict single-system image, and cannot fully meet the cluster applications requirements, such as I/O performance, scalability, reliability, and availability. COSMOS is a scalable single-image file system designed for Dawning2000 superserver, a typical cluster system. This paper discu...
Conference Paper
Data cube enables fast online analysis of large data repositories which is attractive in many applications. Although there are several kinds of available cube-based OLAP products, users may still encounter challenges on effectiveness and efficiency in the exploration of large data cubes due to the huge computation space as well as the huge observat...
Conference Paper
Full-text available
Real-time production systems and other dynamic environments often generate tremendous (potentially infinite) amount of stream data; the volume of data is too huge to be stored on disks or scanned multiple times. Can we perform on-line, multi-dimensional analysis and data mining of such data to alert people about dramatic changes of situations and t...
Conference Paper
Full-text available
Real-time surveillance systems and other dynamic environ- ments often generate tremendous (potentially infinite) vol- ume of stream data: the volume is too huge to be scanned multiple times. However, much of such data resides at rather low level of abstraction, whereas most analysts are interested in dynamic changes (such as trends and outliers) at...
Conference Paper
Data cube enables fast online analysis of large data repositories which is attractive in many applications. Although there are several kinds of available cube-based OLAP products, users may still encounter challenges on effectiveness and efficiency in the exploration of large data cubes due to the huge computation space as well as the huge observat...
Article
In this paper, first studied are the distribution characteristics of user behaviors based on log data from a massive web search engine. Analysis shows that stochastic distribution of user queries accords with the characteristics of power-law function and exhibits strong similarity, and the user’s queries and clicked URLs present dramatic locality,...
Conference Paper
A Web search engine is a powerful tool to find useful information for users on the resourceful World Wide Web. WebGather is a search engine system with a focus on Chinese information discovery, indexing and searching. The authors briefly describe the technology used in WebGather such as heuristic resource discovery algorithm, efficient indexing alg...
Article
Full-text available
Frequent itemset mining was initially proposed and has been studied extensively in the context of association rule mining. In recent years, several studies have also extended its application to transaction or document clustering. However, most of the frequent itemset based clustering algorithms need to first mine a large intermediate set of frequen...

Network

Cited By