Chuan Xiao

Chuan Xiao
  • PhD
  • Professor (Assistant) at Nagoya University

About

76
Publications
10,474
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
2,423
Citations
Introduction
Skills and Expertise
Current institution
Nagoya University
Current position
  • Professor (Assistant)
Additional affiliations
October 2011 - present
Nagoya University
Position
  • Research Associate
August 2005 - February 2007
Northeastern University
Position
  • Research Assistant
Education
March 2007 - August 2010
UNSW Sydney
Field of study
  • Computer Science and Engineering
September 2001 - July 2005
Northeastern University
Field of study
  • Computer Science and Technology

Publications

Publications (76)
Preprint
Full-text available
In this paper, we propose CKGAN, a novel generative adversarial network (GAN) variant based on an integral probability metrics framework with characteristic kernel (CKIPM). CKIPM, as a distance between two probability distributions, is designed to optimize the lowerbound of the maximum mean discrepancy (MMD) in a reproducing kernel Hilbert space, a...
Article
Full-text available
Benchmarking is crucial for evaluating a DBMS, yet existing benchmarks often fail to reflect the varied nature of user workloads. As a result, there is increasing momentum toward creating databases that incorporate real-world user data to more accurately mirror business environments. However, privacy concerns deter users from directly sharing their...
Article
Existing what-if analysis systems are predominantly tailored to operate on either only the application layer or only the database layer of software. This isolated approach limits their effectiveness in scenarios where intensive interaction between applications and database systems occurs. To address this gap, we introduce Ultraverse, a what-if anal...
Chapter
This chapter considers a transaction management problem in collaborative data management. We propose a new approach corresponding to a distributed version of conservative two-phase lock (C2PL). This approach is efficient when the contention rate is high, because it effectively prevents distributed deadlock. First, we introduce two notions to expres...
Chapter
This chapter discusses various architectures for data integration among multiple database servers, which we refer to as peers. Specifically, we emphasize an expanded concept of data integration, termed Bidirectional Collaborative Data Management, wherein multiple peers exchange database updates via bidirectional updatable views. We provide an overv...
Preprint
In enterprise data pipelines, data insertions occur periodically and may impact downstream services if data quality issues are not addressed. Typically, such problems can be investigated and fixed by on-call engineers, but locating the cause of such problems and fixing errors are often time-consuming. Therefore, automatic data validation is a bette...
Preprint
In recent years, machine learning-based cardinality estimation methods are replacing traditional methods. This change is expected to contribute to one of the most important applications of cardinality estimation, the query optimizer, to speed up query processing. However, none of the existing methods do not precisely estimate cardinalities when rel...
Article
Full-text available
Federated learning is a distributed machine learning approach that allows a single server to collaboratively build machine learning models with multiple clients without sharing datasets. Since data distributions may differ across clients, data heterogeneity is a challenging issue in federated learning. To address this issue, numerous federated lear...
Preprint
Full-text available
Retroactive operation is an operation that changes a past operation in a series of committed ones (e.g., cancelling the past insertion of '5' into a queue committed at t=3). Retroactive operation has many important security applications such as attack recovery or private data removal (e.g., for GDPR compliance). While prior efforts designed retroac...
Preprint
Full-text available
Federated learning is a distributed machine learning approach in which a single server and multiple clients collaboratively build machine learning models without sharing datasets on clients. A challenging issue of federated learning is data heterogeneity (i.e., data distributions may differ across clients). To cope with this issue, numerous federat...
Preprint
Full-text available
Computational notebook software such as Jupyter Notebook is popular for data science tasks. Numerous computational notebooks are available on the Web and reusable; however, searching for computational notebooks manually is a tedious task, and so far, there are no tools to search for computational notebooks effectively and efficiently. In this paper...
Chapter
Full-text available
Along with the continuous evolution of data management systems for the new market requirements, we are moving from centralized systems towards decentralized systems, where data are maintained in different sites with autonomous storage and computation capabilities. There are two fundamental issues with such decentralized systems: local privacy and g...
Preprint
Full-text available
Federated learning is a distributed machine learning method in which a single server and multiple clients collaboratively build machine learning models without sharing datasets on clients. Numerous methods have been proposed to cope with the data heterogeneity issue in federated learning. Existing solutions require a model architecture tuned by the...
Article
In this paper, we address a similarity search problem for spatial trajectories in road networks. In particular, we focus on the subtrajectory similarity search problem, which involves finding in a database the subtrajectories similar to a query trajectory. A key feature of our approach is that we do not focus on a specific similarity function; inst...
Article
Similarity query processing has been an active research topic for several decades. It is an essential procedure in a wide range of applications. Recently, embedding and auto-encoding methods as well as pre-trained models have gained popularity. They basically deal with high-dimensional data, and this trend brings new opportunities and challenges to...
Article
Full-text available
Query autocompletion is an important feature saving users many keystrokes from typing the entire query. In this paper, we study the problem of query autocompletion that tolerates errors in users’ input using edit distance constraints. Previous approaches index data strings in a trie, and continuously maintain all the prefixes of data strings whose...
Preprint
In this paper, we address a similarity search problem for spatial trajectories in road networks. In particular, we focus on the subtrajectory similarity search problem, which involves finding in a database the subtrajectories similar to a query trajectory. A key feature of our approach is that we do not focus on a specific similarity function; inst...
Preprint
Selectivity estimation aims at estimating the number of database objects that satisfy a selection criterion. Answering this problem accurately and efficiently is essential to applications, such as density estimation, outlier detection, query optimization, and data integration. The estimation problem is especially challenging for large-scale high-di...
Preprint
Due to the outstanding capability of capturing underlying data distributions, deep learning techniques have been recently utilized for a series of traditional database problems. In this paper, we investigate the possibilities of utilizing deep learning for cardinality estimation of similarity selection. Answering this problem accurately and efficie...
Article
Code completion is a traditional popular feature for API access in integrated development environments (IDEs). It not only frees programmers from remembering specific details about an API but also saves keystrokes and corrects typographical errors. Existing methods for code completion usually suggest APIs based on statistics in code bases described...
Conference Paper
Full-text available
Query autocompletion (QAC) is an important interactive feature that assists users in formulating queries and saving keystrokes. Due to the convenience it brings to users, QAC has been adopted in many applications, includingWeb search engines, integrated development environments (IDEs), and mobile devices. For existingQAC methods, users have to manu...
Article
Full-text available
As big data attracts attention in a variety of fields, research on data exploration for analyzing large-scale scientific data has gained popularity. To support exploratory analysis of scientific data, effective summarization and visualization of the target data as well as seamless cooperation with modern data management systems are in demand. In th...
Article
A similarity search in Hamming space finds binary vectors whose Hamming distances are no more than a threshold from a query vector. It is a fundamental problem in many applications, such as image retrieval, near-duplicateWeb page detection, and scientific databases. State-of-the-art approaches to answering such queries are mainly based on the pigeo...
Article
In this article, we propose a novel indexing and querying method for trajectories constrained in a road network. We aim to provide efficient algorithms for various types of spatiotemporal queries that involve routing in road networks, such as (1) finding moving objects that have traveled along a given path during a given time interval, (2) extracti...
Chapter
Full-text available
To analyze scientific data, there are frequent demands for comparing multiple datasets on the same subject to detect any differences between them. For instance, comparison of observation datasets in a certain spatial area at different times or comparison of spatial simulation datasets with different parameters are considered to be important. Theref...
Article
Full-text available
The pigeonhole principle states that if n items are contained in m boxes, then at least one box has no more than n/m items. It is utilized to solve many data management problems, especially for thresholded similarity searches. Despite many pigeonhole principle-based solutions proposed in the last few decades, the condition stated by the principle i...
Preprint
The pigeonhole principle states that if $n$ items are contained in $m$ boxes, then at least one box has no more than $n / m$ items. It is utilized to solve many data management problems, especially for thresholded similarity searches. Despite many pigeonhole principle-based solutions proposed in the last few decades, the condition stated by the pri...
Article
Full-text available
Graphs are widely used to model complex data in many applications, such as bioinformatics, chemistry, social networks, pattern recognition. A fundamental and critical query primitive is to efficiently search similar structures in a large collection of graphs. This article mainly studies threshold-based graph similarity search with edit distance con...
Article
Full-text available
Query autocompletion is an important and practical technique when users want to search for desirable information. As mobile devices become more and more popular, one of the main applications is location-aware service, such as Web mapping. In this paper, we propose a new solution to location-aware query autocompletion. We devise a trie-based index s...
Conference Paper
How can we compress a data structure containing millions of vehicular trajectories enough to be processed by a small memory device, such as an in-vehicle navigation system or IoT device? Can we retrieve the trajectories along a given path from such a compressed representation? To address these questions, this paper considers a spatial index compres...
Preprint
In this paper, we present a compressed data structure for moving object trajectories in a road network, which are represented as sequences of road edges. Unlike existing compression methods for trajectories in a network, our method supports pattern matching and decompression from an arbitrary position while retaining a high compressibility with the...
Conference Paper
Full-text available
Presentation slide composition is an important job for knowledge workers. Many researchers propose slide generation and composition methods for presentation slides. A primary challenge of the proposed methods is to generate presentation slides automatically, users have no choice about the structure of the presentation, and cannot participate in the...
Conference Paper
Full-text available
With the growing popularity of electronic documents, replication can occur for many reasons. People may copy text segments from various sources and make modifications. In this paper, we study the problem of local similarity search to find partially replicated text. Unlike existing studies on similarity search which find entirely duplicated document...
Article
Full-text available
Query autocompletion has become a standard feature in many search applications, especially for search engines. A recent trend is to support the error-tolerant autocompletion, which increases the usability significantly by matching prefixes of database strings and allowing a small number of errors. In this article, we systematically study the query...
Article
Full-text available
The problem of similarity search is a crucial task in many real-world applications such as multimedia databases, data mining, and bioinformatics. In this work, we investigate the similarity search on uncertain data modeled in Gaussian distributions. By employing Kullback-Leibler divergence (KL-divergence) to measure the dissimilarity between two Ga...
Article
Full-text available
Graph is an increasingly popular way to model complex data, and the size of single graphs is growing toward massive. Nonetheless, executing graph algorithms efficiently and at scale is surprisingly challenging. As a consequence, distributed programming frameworks have emerged to empower large graph processing. Pregel, as a popular computational mod...
Conference Paper
Full-text available
In this paper, we study the problem of discovering all movement patterns from semantic trajectory databases. We propose a two-step method to solve this problem efficiently. We first retrieve frequent movement patterns of categories from the transformed database of sequential categories, and then cluster dense trajectories in a growth-type way for a...
Conference Paper
Full-text available
Slide presentations have become a ubiquitous tool for business and educational purposes. Instead of starting from scratch, slide composers tend to make new presentation slides by browsing existing slides and reusing materials from them. In this paper, we investigate the problem of reused element detection in presentation slides. We develop respecti...
Conference Paper
Full-text available
Slide presentations have become a ubiquitous tool for business and educational purposes. Instead of starting from scratch, slide composers tend to make new presentation slides by reusing materials from existing slides. Understanding how slide elements are copied from one presentation file to another and how presentation files are related to each ot...
Article
Full-text available
Along with the emergence of massive graph-modeled data, it is of great importance to investigate graph similarity joins due to their wide applications for multiple purposes, including data cleaning, and near duplicate detection. This paper considers graph similarity joins with edit distance constraints, which return pairs of graphs such that their...
Conference Paper
Along with the emergence of massive graph-modeled data, it is of great importance to investigate graph similarity join due to its wide applications for multiple purposes, including data cleaning, near duplicate detection, etc. This paper considers graph similarity joins with edit distance constraints, which return pairs of graphs such that their ed...
Conference Paper
Similarity join of complex structures is an important operation in managing graph data. In this paper, we investigate the problem of graph similarity join with edit distance constraints. Existing algorithms extract substructures – either rooted trees or simple paths – as features, and transform the edit distance constraint into a weaker count filte...
Article
Probabilistic range query is an important type of query in the area of uncertain data Management. A probabilistic range query returns all the data objects within a specific range from the query object with a probability no less than a given threshold. In this paper, we assume that each uncertain object stored in the database is associated with a mu...
Article
Graphs are widely used to model complicated data semantics in many applications in bioinformatics, chemistry, social networks, pattern recognition, etc. A recent trend is to tolerate noise arising from various sources such as erroneous data entries and find similarity matches. In this paper, we study graph similarity queries with edit distance cons...
Article
Graphs are widely used to model complex data in many applications, such as bioinformatics, chemistry, social net-works, pattern recognition, etc. A fundamental and critical query primitive is to efficiently search similar structures in a large collection of graphs. This paper studies the graph similarity queries with edit distance constraints. Exis...
Conference Paper
Probabilistic range query is an important type of query in the area of uncertain data management. A probabilistic range query returns all the objects within a specific range from the query object with a probability no less than a given threshold. In this paper we assume that each uncertain object stored in the databases is associated with a multi-d...
Article
Full-text available
Similarity joins play an important role in many application areas, such as data integration and cleaning, record linkage, and pattern recognition. In this paper, we study efficient algorithms for similarity joins with an edit distance constraint. Currently, the most prevalent approach is based on extracting overlapping grams from strings and consid...
Article
Full-text available
Query autocompletion is an important feature saving users many keystrokes from typing the entire query. In this paper we study the problem of query autocompletion that tolerates errors in users' input using edit distance constraints. Previous approaches index data strings in a trie, and continuously maintain all the prefixes of data strings whose e...
Conference Paper
Full-text available
Driven by the increasing demands from applications such as data cleansing, integration, and bioinformatics, approximate string matching queries have gain much attention recently. In this paper, we present the design and implementation of a trie-based system which supports both string similarity search and join based on our recent work [23].
Conference Paper
Full-text available
Inverted indexes are the fundamental index for information retrieval systems. Due to the correlation between terms, inverted lists in the index may have substantial overlap and hence redundancy. In this paper, we propose a new approach that reduces the size of inverted lists while retaining time-efficiency. Our solution is based on merging inverted...
Conference Paper
Being a fundamental problem in managing graph data, subgraph exact all-matching enumerates all isomorphic matches of a query graph q in a large data graph G. The existing techniques focus on pruning non-promising data graph vertices against q. However, the reduction and sharing of intermediate matches have not received adequate attention. These two...
Article
Graphs are widely used to model complicated data semantics in many applications in bioinformatics, chemistry, social networks, pattern recognition, etc. A recent trend is to tolerate noise arising from various sources, such as erroneous data entry, and find similarity matches. In this paper, we study the graph similarity join problem that returns p...
Article
Given a collection of data objects, the skyline problem is to select the objects which are not dominated by any others. In this paper, we propose a new variation of the skyline problem, called the combina-tion skyline problem. The goal is to find the fixed-size combinations of objects which are skyline among all possible combinations. Our problem i...
Article
With the increasing amount of data and the need to integrate data from multiple data sources, one of the challenging issues is to identify near-duplicate records efficiently. In this article, we focus on efficient algorithms to find a pair of records such that their similarities are no less than a given threshold. Several existing algorithms rely o...
Conference Paper
Full-text available
Given a query string Q, an edit similarity search finds all strings in a database whose edit distance with Q is no more than a given threshold τ. Most existing methods answering edit similarity queries employ schemes to generate string subsequences as signatures and generate candidates by set overlap queries on query and data signatures. In this ar...
Conference Paper
Finding the most accessible locations has a number of applications. For example, a user may want to find an accommodation that is close to different amenities such as schools, supermarkets, and hospitals etc. In this paper, we study the problem of finding the most accessible locations among a set of possible sites. The task is converted to a top-k...
Conference Paper
Named entity recognition aims at extracting named enti- ties from unstructured text. A recent trend of named entity recognition is finding approximate matches in the text with respect to a large dictionary of known entities, as the do- main knowledge encoded in the dictionary helps to improve the extraction performance. In this paper, we study the...
Conference Paper
Similarity join is a useful primitive operation underlying many applications, such as near duplicate Web page detection, data integration, and pattern recognition. Traditional similarity joins require a user to specify a similarity threshold. In this paper, we study a variant of the similarity join, termed top-k set similarity join. It returns the...
Article
There has been considerable interest in similarity join in the research community recently. Similarity join is a fundamental operation in many application areas, such as data integration and cleaning, bioinformatics, and pattern recognition. We focus on efficient algorithms for similarity join with edit distance constraints. Existing approaches are...
Conference Paper
With the increasing amount of data and the need to inte- grate data from multiple data sources, a challenging issue is to find near duplicate records efficiently. In this paper, we focus on efficient algorithms to find pairs of records such that their similarities are above a given threshold. Several existing algorithms rely on the prefix filtering...
Conference Paper
Full-text available
Maintaining the quality of queries over streaming data is often thought to be of tremendous challenge since data arrival rate and average per-tuple CPU processing cost are highly unpredictable. In this paper, we address a novel buffer-preposed QoS adaptation framework on the basis of control theory and present several load shedding techniques and s...
Conference Paper
Full-text available
Recently proposed Hole-Filler model is promising for transmitting and evaluating streamed XML fragments. However, by simply matching filler IDs with hole IDs, associating all the correlated fragments to complete the query path would result in blocking. Taking advantage of region-based coding scheme, this paper models the query expression into query...
Conference Paper
Full-text available
Recently XML fragments processing prevails over the Web due to its flexibility and manageability. In this paper, we propose two techniques for document fragmentation considering the query statistics over XML data: path frequency tree (PFT) and Markov tables. Both techniques work by merging the nodes of low inquiring frequency to enhance fragment ut...
Conference Paper
Full-text available
We address several load shedding techniques over sliding window joins. We first construct a dual window architectural model including aux-windows and join-windows, and build statistics on aux-windows. With the statistics, we develop an effective load shedding strategy producing maximum subset join outputs. In order to accelerate the load shedding p...
Conference Paper
Full-text available
With the prevalence of Web applications, expediting multiple queries over streaming XML has become a core challenge due to one-pass processing and limited resources. Recently proposed Hole-Filler model is low consuming for XML fragments transmission and evaluation; however existing work addressed the multiple query problem over XML tuple streams in...
Conference Paper
Full-text available
Unlike in traditional databases, queries on XML streams are bounded not only by memory but also by real time processing. Recently proposed Hole-Filler model is promising for information transmission and publication, by slicing XML data into low consuming, easy synchronized fragments. However, XPath queries evaluate the elements in streamed XML data...

Network

Cited By