Baile Shi's research while affiliated with Fudan University and other places

Publications (105)

Article
Data exchange is the problem of taking data structured under a source schema and creating an instance of a target schema, by following a mapping between the two schemas. There is a rich literature on problems related to data exchange, e.g., the design of a schema mapping language, the consistency of schema mappings, operations on mappings, and quer...
Conference Paper
Serial episode mining is one of hot spots in temporal data mining with broad applications such as user-browsing behavior prediction, telecommunication alarm analysis, road traffic monitoring, and root cause diagnostics from faults log data in manufacturing. In this paper, as a step forward to analyzing patterns within an event sequence, we propose...
Article
Due to the promising features of web services, their deployment and research are booming. Among them, various techniques for web service composition have been developed. In this paper, we propose a new composition framework. We use automata to describe behaviors of web services. Each of underlying web services can interact with others through async...
Conference Paper
Stream prediction based on episode rules of the form "whenever a series of antecedent event types occurs, another series of consequent event types appears eventually"has received intensive attention due to its broad applications such as reading sequence forecasting, stock trend analyzing, road traffic monitoring, and software fault preventing. Many...
Conference Paper
Full-text available
Marginal publication is one of important techniques to help researchers to improve the understanding about correlation between published attributes. However, without careful treatment, it’s of high risk of privacy leakage for marginal publications. Solution like ANGEL has been available to eliminate such risks of privacy leakage. But, unfortunately...
Conference Paper
Anatomy is a popular technique for privacy preserving in data publication. However, anatomy is fragile under background knowledge attack and can only be applied into limited applications. To overcome these drawbacks, we develop an improved version of anatomy: permutation anonymization, a new anonymization technique that is more effective than anato...
Conference Paper
Blogs have become an important media of self-expression recently. Millions of people write blog posts, share their interests, give suggestions and form groups in blogspace. An important way to understand the development of blogspace is to identify topic experts as well as blog communities and to further find how they interact with each other. Topic...
Article
When data sources are virtually integrated, there is no common and centralized method to maintain global consistency, so inconsistencies with regard to global integrity constraints are very likely to occur. In this paper, we consider the problem of defining and computing consistent query answers when queries are posed to virtual XML data integratio...
Conference Paper
Frequent serial episodes within an event sequence describe the behavior of users or systems about the application. Existing mining algorithms calculate the frequency of an episode based on overlapping or non-minimal occurrences, which is prone to over-counting the support of long episodes or poorly characterizing the followed-by-closely relationshi...
Conference Paper
We consider the problem of using query transformation to compute consistent answers when queries are posed to virtual XML data integration systems, which are specified following the local-as-view approach. This is achieved in two steps. First the given query is transformed to a new query with global constraints considered, then the new query is rew...
Article
HMM (Hidden Markov model) has been used successfully to analyze various types of time series. To fit time series with HMM, the number of hidden states should be determined before learning other parameters, since it has great impact on the complexity and precision of the fitting HMM. However this becomes too difficult when there is not enough prior...
Conference Paper
Exploring social media resources, such as Flickr and Wikipedia to mitigate the difficulty of semantic gap has attracted much attention from both academia and industry. In this paper, we first propose a novel approach to derive semantic correlation matrix from Flickr's related tags resource. We then develop a novel conditional random field model for...
Conference Paper
The correlation between keywords has been exploited to improve Automatic Image Annotation(AIA). Differing from the traditional lexicon or training data based keyword correlation estimation, we propose using Web-scale image semantic space learning to explore the keyword correlation for automatic Web image annotation. Specifically, we use the Social...
Article
Full-text available
In recent years, evaluating graph distance has become more and more important in a variety of real applications and many graph distance measures have been proposed. Among all of those measures, structure-based graph distance measures have become the research focus due to their independence of the definition of cost functions. However, existing stru...
Conference Paper
Information sharing becomes more frequently and easily than before. However, it also brings serious threats towards individual's privacy. It is no doubt that sharing personal data can cause privacy breaches. Moreover, sharing the knowledge discovered by data mining may also pose threats to personal privacy. In this paper, we consider the anonymity...
Conference Paper
In recent years, the wireless sensor network (WSN) is employed a wide range of applications. But existing communication protocols for WSN ignore the characteristics of collected data and set routes only according to the mutual distance and residual energy of sensors. In this paper we propose a Data-Aware Clustering Hierarchy (DACH), which organizes...
Conference Paper
In this paper, we study the problem of making use of target constraints to integrate XML data from different sources under a target schema. We recognize that target constraints are necessary in data integration, as the constraints are essential part of data semantics, and should be satisfied by integrated data. When integrating data from multiple d...
Conference Paper
Search engine technology plays an important role in Web information retrieval. However, with Internet information explosion, traditional searching techniques cannot provide satisfactory result due to problems such as huge number of result Web pages, unintuitive ranking, etc. Therefore, the reorganization and post-processing of Web search results ha...
Article
Unstructured P2P networks dominate in practice due to their small maintenance overhead. However, the high volume of search traffic threatens its continued growth. The focus of this paper is to study how to improve the search efficiency in a non-DHT P2P network without a distributed indexing structure. We identify possible performance problems in Ka...
Conference Paper
The problem of processing XML streaming data fits a large class of new applications and has been a hot spot recently. The XML streams on the World Wide Web are usually autonomous, distributed and heterogeneity. All existing methods, however, do not support filtering a collection of XML streams from heterogeneous sources. In this paper, we present a...
Conference Paper
In this paper, we present an approach to process XML stream based on keyword search. We define the keyword search semantic as Meaning Independent Smallest Lowest Common Ancestors (MISLCAs) and query results as Meaning Independent Minimum Connecting Trees (MIMCTs), provide a user-friendly query interface based on keywords and a little XML schema kno...
Article
rdquoApplication-level multicast is a promising alternative to IP multicast due to its independence from the IP routing infrastructure and its flexibility in constructing the delivery trees. The existing overlay multicast systems either support a single data source or have high maintenance overhead when multiple sources are allowed. They are ineffi...
Article
Many applications track streaming data for actionable alerts, which may include, for example, network intrusions, transaction frauds, bio-surveilence abnormalities, and so forth. Some stream classification models are built for this purpose. Due to concept drifts, maintaining a model's up-to-dateness has become one of the most challenging tasks in m...
Article
XML document may contain inconsistencies that violate predefined integrity constraints, which causes the data inconsistency problem. In this paper, we consider how to get the consistent data from an inconsistent XML document. There are two basic concepts for this problem: Repair is the data consistent with the integrity constraints, and also minima...
Conference Paper
Active learning is a promising tool to improve the performance of content-based image retrieval (CBIR). As a commonly used active learning approach, angle-diversity provides the most informative images to user for feedback. However, it suffers from the problem that the query concept is diverse and the numbers of the positive and the negative images...
Conference Paper
Automatic image annotation automatically labels image con- tent with semantic keywords. For instance, the Relevance Model estimates the joint probability of the keyword and the image (3). Most of the previous annotation methods as- sign keywords separately. Recently the correlation between annotated keywords has been used to improve image anno- tat...
Conference Paper
An XML document is inconsistent if it violates predefined integrity constraints. In this paper, we consider how to compute repairs for an inconsistent XML document. Here repair is defined as the data consistent with the integrity constraints, and also minimally differs from the original document. Based on a repair framework by introducing a chase m...
Conference Paper
Some of the knowledge discovered by data mining may contain sensitive information, which should be hidden before sharing the result of data mining. In this paper, we consider that the knowledge for sharing is discovered by frequent pattern mining, and some of the frequent patterns are private, which cannot be shared. Our problem of privacy-preservi...
Conference Paper
The knowledge discovered by frequent pattern mining is represented in the form of a collection of frequent patterns with their supports. Sharing the frequent patterns without discrimination may bring threats against privacy and security, because some of frequent patterns themselves may be sensitive and should not be disclosed. Furthermore, due to t...
Conference Paper
The knowledge discovered by data mining may contain sensitive information, which may cause potential threats towards privacy and security. In this paper, we address the problem of better preserving private knowledge by proposing an Item-based Pattern Sanitization to prevent the disclosure of private patterns. We also present two strategies to gener...
Conference Paper
An avalanche of data available in the stream form is overstretching our data analyzing ability. In this paper, we propose a novel load shedding method that enables fast and accurate stream data classification. We transform input data so that its class information concentrates on a few features, and we introduce a progressive classifier that makes p...
Article
As the number of available Web services is steadily increasing, there is a growing interest for reusing basic Web services in new, composite Web services. However, most current Web services choreography proposals, such as BPEL4WS or WSCI, need a fixed execution flow previously designed by human, thus the adaptability of Web services can not be full...
Article
Full-text available
Privacy becomes a more and more serious concern in applications involving microdata. Recently, efficient anonymization has attracted much research work. Most of the previous methods use global recoding, which maps the domains of the quasi-identifier attributes to generalized or changed values. However, global recoding may not always achieve effecti...
Conference Paper
Time series similarity search is of growing importance in many applications. Wavelet transforms are used as a dimensionality reduction technique to permit efficient similarity search over high-dimensional time series data. This paper proposes the tight upper and lower bounds on the estimation distance using wavelet transform, and we show that the t...
Conference Paper
Data broadcast is an efficient method of disseminating information in a wireless environment to a large number of clients with mobile devices. It requires the clients to be actively listening to the broadcast channels for their interested information, which increases the battery consumption. However, up to now the battery of the mobile devices is s...
Conference Paper
In recent years, traditional computing systems face the problems of scalability as the need for information processing services is ever increasing. Grid, as a pool of computing resources, solves the problem in some degree by providing an integrated computing and resources environment. Thus, there emerge many strategies related to grid resource allo...
Conference Paper
Similarity search is of importance in many new database applications. These operations can generally be referred as similarity search in metric space. In this paper, a new index construction algorithm is proposed for similarity search in metric space. The new data structure, called bu-tree (bottom-up tree), is based on constructing the index tree f...
Conference Paper
XML document may contain inconsistencies that violate predefined integrity constraints, and there are two basic concepts for this problem: Repair is the data consistent with the integrity constraints, and also minimally differs from the original one. Consistent data is the data common for every possible repair. In this paper, first we give a genera...
Conference Paper
Time series data naturally arise in many application domains, and the similarity search for time series under dynamic time shifting is prevailing. But most recent research focused on the full length similarity match of two time series. In this paper a basic subsequence similarity search algorithm based on dynamic programming is proposed. For a give...
Conference Paper
Incremental ETL processes are used for the incremental maintenance of data warehouses, which is generally designed by users with ETL tools. Using existing methods of incremental maintenance of materialized views for reference, we put forward an approach to generate an incremental ETL process automatically from the full ETL process in this paper. Ex...
Conference Paper
Time and Space complexity is a critical factor for a successful stream-based continuous query processing system. In this paper, we address the optimization of complex XML queries over XML Streams by using the semantic and structural constraints of its DTD. The optimization is preprocessed before the runtime of matching XML stream against user¿s qu...
Conference Paper
In many real world applications, with the databases frequent insertions and deletions, the ability of a data mining technique to detect and react quickly to dynamic changes in the data distribution and clustering over time is highly desired. Data summarizations (e.g., data bubbles) have been proposed to compress large databases into representative...
Conference Paper
Wireless sensor networks have emerged recently as an effective way of gathering useful information from areas of interest. Prolonging the network lifetime has become the primary concern in data gathering due to the limited battery power of sensors. An underlying assumption of most existing work is that all the sensors are working simultaneously dur...
Conference Paper
Currently technologies of Web services have greatly developed and provided a new application platform for CSCW. For a business process across enterprises, cooperation of multiple Web services is usually required to achieve the final goal. However, most current Web services choreography proposals, such as BPEL4WS or WSCI, only provide a fixed execut...
Conference Paper
Data may contain inconsistencies that violate integrity constraints, the consistent query answering problem attempts to find answers common for every possible repair. In this paper, we study how to handle the inconsistent XML document, which conforms to the DTD, while violates constraints. We consider three types of constraints, including functiona...
Conference Paper
Keyword search is an effective approach for most users to search for information because they do not need to learn complex query languages or the underlying structures of the data. This paper focuses on effective keyword search in XML documents which are modeled as labeled trees. We first analyze the problems caused by the refinement of result gran...
Conference Paper
Emerging web services standards enable the development of large-scale applications in open environments. With the increasingly emerging web services, one of the main problems is not to find out the required services, but to select the optimum one from a set of requirements-satisfying services. In this paper, we propose a workflow-organization-model...
Conference Paper
Image retrieval has found more and more applications. Due to the well recognized semantic gap problem, the accuracy and the recall of image similarity search are often still low. As an effective method to improve the quality of image retrieval, the relevance feedback approach actively applies users' feedback to refine the search. As searching a lar...
Conference Paper
Web services composition is a key issue in web service research area. Substitution of service is closely related with composition and important to robustness of service composition. In this paper, we use process algebra as formalism foundation modeling and specifying web services and reasoning on behavioral features of web services composition. We...
Conference Paper
Full-text available
Privacy becomes a more and more serious concern in applications involving microdata. Recently, efficient anonymization ha s attracted much research work. Most of the previous methods use global re- coding, which maps the domains of the quasi-identifier attri butes to generalized or changed values. However, global recoding may not always achieve eff...
Conference Paper
In the research of transaction management of mobile database, concurrency control is one of the key problems. In a high-quality mobile environment, the given approaches to concurrency control, i.e. 2PL and OCC, produced either too much blocks or validating failure. Based on our previous achievement, ASGT (Active Serialization Graph Technique), we p...
Conference Paper
Full-text available
Many applications use classification models on streaming data to detect actionable alerts. Due to concept drifts in the underlying data, how to maintain a model's up-to-dateness has become one of the most challenging tasks in mining data streams. State of the art approaches, including both the incrementally updated classifiers and the ensemble clas...
Conference Paper
Model-based clustering is one of the most important ways for time series data mining. However, the process of clustering may encounter several problems. In this paper, a novel clustering algorithm of time-series which incorporates recursive hidden Markov model(HMM) training is proposed. Our contributions are as follows: 1) We recursively train mode...
Conference Paper
Full-text available
Probabilistic latent semantic analysis (pLSA) is a powerful statistical technique to analyze relation between factors in dyadic data. Although various pLSA-based applications, ranging from information retrieval, information filtering, to text-mining and visualization, have been successfully conducted, they can not afford dynamic revising of model w...
Conference Paper
Due to the promising features of Web services, their deployment and research are booming. Among them, various techniques for Web service composition have been developed. In this paper, we propose a new composition framework. We use automata to describe behaviors of Web services. Each of underlying Web services can interact with others through async...
Conference Paper
The growth of bioinformatics has resulted in datasets with new characteristics. The DNA sequences typically contain a large number of items. From them biologists assemble a whole genome of species based on frequent concatenate sequences, which ordinarily have hundreds of items. Such datasets pose a great challenge for existing frequent pattern disc...
Conference Paper
This paper studies the XML storage in relations. Unlike traditional techniques, it considers the semantics expressed by functional dependencies. We propose an algorithm for mapping DTD to relational schema, which preserves not only the content and structure but also the semantics of original XML documents. To tackle the problem of constraints expre...
Conference Paper
The compatibility analysis is absolutely necessary for guaranteeing the correct composition of Web services, no matter what styles the composition takes, statically or dynamically. In this paper, we provide a formalization of Web services behavior using the approach of automata. With this understanding, we propose a definition of role among Web ser...
Conference Paper
It is evidenced that formal analyses are helpful for web services interactions. However, most current web services choreography proposals, such as BPEL4WS or WSCI, only provide notations for describing the message flows in web service collaboration, lacking of reasoning mechanisms to verify the process of interacting among them. In this paper, we p...
Conference Paper
In mobile computing environments, the continuous range query is one of the most important types of queries required to support various location-based services. With a large number of queries, the real-time response to query answers and the concurrent execution are two major challenges. In this paper, we propose a novel approach cGridex for efficien...
Conference Paper
Compatibility of Web services states the fitness of service peers that interact each other. It covers both static properties and dynamic behavior of Web services. Most researches deal with compatibility issue in the context of static checking. In this paper, we use a formal method, say CCS, to describe dynamic behavior of Web services. The formaliz...
Conference Paper
To preserve private information while providing thorough analysis is one of the significant issues in OLAP systems. One of the challenges in it is to prevent inferring the sensitive value through the more aggregated non-sensitive data. This paper presents a novel algorithm FMC to eliminate the inference problem by hiding additional data besides the...
Conference Paper
This paper presents a novel method of automatic image se- mantic annotation. Our approach is based on the Image-Keyword Doc- ument Model (IKDM) with image features discretization. According to IKDM, the image keyword annotation is conducted using image simi- larity measurement based on language model from text information re- trieval domain. Throug...
Conference Paper
Automatic image annotation has attracted much attention recently, due to its wide applicability (such as image retrieval by semantics). Most of the known statistical model-based annotation methods learn the joint distribution of the keywords and the image blobs decomposed by segmentation or gride approaches. The effects of these methods suffer from...
Conference Paper
Privacy-preserving classification mining is one of the fast-growing sub-areas of data mining. How to perturb original data and then build a decision tree based on perturbed data is the key research challenge. By applying transition probability matrix this paper proposes a novel privacy-preserving classification mining algorithm which suits all dat...
Article
To combine XML with relations is a hotspot in research field. This paper studies the functional dependency and normalization propagation between relations and XML. First the paper gives the definition of functional dependencies and keys for XML; based on it, the concepts of redundancy and DTD normalization are defined. The paper then discusses the...
Conference Paper
Clustering is a common technique in data mining to dis- cover hidden patterns from massive datasets. With the development of privacy-maintaining data mining application, clustering incomplete high- dimensional data has becoming more and more useful. Motivated by these limits, we develop a novel algorithm CLINCH, which could pro- duce fine clusters...
Conference Paper
This paper proposes a query routing infrastructure that aims at the Web text information retrieval. The routing information is distributed in each query routing node, and needs no central infrastructure, so it can be used in large distributed system to determine which node need to be queried to achieve the function of query routing in semantic net...
Conference Paper
Wireless sensor networks are envisioned to be promising in gathering useful information from areas of interest. Due to the limited battery power of sensors, one critical issue in designing a wireless sensor network is to maximize its lifetime. Many efforts have been made to deal with this problem. However, most existing algorithms are not well opti...
Conference Paper
Currently, constraints are increasingly considered as a kind of means of user- or expert-control for filtering those unsatisfied and redundant patterns rapidly during the web mining process. Recent work has highlighted the importance of constraint-based mining paradigm in the context of frequent itemsets, sequences, and many other interesting patte...
Conference Paper
Mining frequent structural patterns from graph databases is an important research problem with broad applications. Recently, we developed an effective index structure, ADI, and efficient algorithms for mining frequent patterns from large, disk-based graph databases [5], as well as constraint-based mining techniques. The techniques have been integra...
Conference Paper
Full-text available
In many fields and applications, it is critical for users to make decisions through OLAP queries. How to promote accuracy and efficiency while answering multiple aggregate queries, e.g. COUNT, SUM, AVG, MAX, MIN and MEDIAN? It has been the urgent problem in the fields of OLAP and data summarization recently. There have been a few solutions such as...
Conference Paper
Sequencing genomes is a fundamental aspect of biological research. Shotgun sequencing, since introduced by Sanger et al [2], has remained the mainstay in the research field of genome sequence assembly. This method randomly obtains sequence reads (e.g. a subsequence including about 500 characters) from a genome and then assemblies them into contigs...
Article
Set type is an important data type in object-oriented database system and object-relational database system. An index structure of set type Set_struc is presented in this paper. In Set_struc all sets are organized as a tree, and the sets with common prefix are merged. So the size of the index will be decreased for the data set with a large number o...
Article
This paper addresses the semi-structured query rewriting problem for TSL (tree specification language), a language for querying semi-structured data. An algorithm that can find the maximally-contained rewriting query is presented, when a semi-structured query and a set of semi-structured views are given. The idea is borrowed from MiniCon, a scalabl...
Conference Paper
Nowadays, semantic integration of data resources is becoming a hot topic. Many techniques have been developed to map XML sources to ontologies, but most of mapping processes are manually accomplished with the help of experts. In this paper, we propose a new method that automatically derives such mappings by using DTD information, i.e. mappings XML...
Conference Paper
A DTD or XML schema in its current textual form commonly lacks clarity and readability, therefore erroneous, poor quality design and usage are inevitable. A canonical conceptual model for XML documents will provide an effective mean of designing XML documents. The DTD is an early standard for XML and used in legacy systems widely. This paper presen...
Conference Paper
When solving problems such as enterprise information integration and interoperation across heterogeneous repositories, Web services and workflow are extensively used. Due to the nature of modularity, openness and encapsulation, traditional exception handling strategies can hardly meet the demands in the Web service setting. In this paper, we put fo...
Conference Paper
Service composition is a powerful tool to create new services rapidly by reusing existing ones. Previous research mainly focuses on the wired infrastructure-based environment. With the developments in mobile devices and wireless communication technology in recent years, mobile ad hoc network has received an increasing attention as a new communicati...
Conference Paper
Mining frequent tree patterns is an important research prob- lems with broad applications in bioinformatics, digital library, e-commerce, and so on. Previous studies highly suggested that pattern-growth meth- ods are e-cient in frequent pattern mining. In this paper, we systemat- ically develop the pattern growth methods for mining frequent tree pa...
Conference Paper
ROLAP is used to answer queries for analysis support on the data stored in the data warehouse. To fulfill this purpose rapidly and correctly, ROLAP always precompute some views in the data warehouse. However, selecting views to materialize is an NP-hard problem. Some previous works, such as the Greedy Algorithm, BPUS and PBS, have focused on it wit...