About
269
Publications
36,776
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
8,527
Citations
Citations since 2017
Introduction
Publications
Publications (269)
Modern location-based systems have stimulated explosive growth of urban trajectory data and promoted many real-world applications, e.g. , trajectory prediction. However, heavy big data processing overhead and privacy concerns hinder trajectory acquisition and utilization. Inspired by regular trajectory distribution on transportation road networks,...
In this paper, we revisit the problem of route travel time estimation on a road network and aim to boost its accuracy by capturing and utilizing spatio-temporal features from four significant aspects: heterogeneity, proximity, periodicity and dynamicity.
Spatial-wise, we consider two forms of heterogeneity at link level in a road network: the turni...
Traditional cost-based optimizers are efficient and stable to generate optimal plans for simple SQL queries, but they may not generate high-quality plans for complicated queries. Thus learning-based optimizers have been proposed recently that can learn high-quality plans based on past experiences. However, learning-based optimizers cannot work well...
Knowledge bases (KBs), which store high-quality information, are crucial for many applications, such as enhancing search results and serving as external sources for data cleaning. Not surprisingly, there exist outdated facts in most KBs due to the rapid change of information. Naturally, it is important to keep KBs up-to-date. Traditional wisdom has...
Successful machine learning (ML) needs to learn from good data. However, one common issue about train data for ML practitioners is the lack of good features. To mitigate this problem, feature augmentation is often employed by joining with (or enriching features from) multiple tables, so as to become feature-rich ML. A consequent problem is that the...
Entity resolution (ER) is a core data integration problem that identifies pairs of data instances referring to the same real-world entities, and the state-of-the-art results of ER are achieved by deep learning (DL) based approaches. However, DL-based approaches typically require a large amount of labeled training data (i.e. , matching and non-match...
Data exploration—the problem of extracting knowledge from database even if we do not know exactly what we are looking for —is important for data discovery and analysis. However, precisely specifying SQL queries is not always practical, such as “finding and ranking off-road cars based on a combination of Price, Make, Model, Age, Mileage, etc”—not on...
Computing the shortest paths and shortest path distances between two vertices on road networks is a core operation in many real-world applications, e.g., finding the closest taxi/hotel. However, existing techniques have several limitations. First, traditional Dijkstra-based methods have long latency and cannot meet the high-performance requirement....
The lack of sufficient labeled data is a key bottleneck for practitioners in many real-world supervised machine learning (ML) tasks. In this paper, we study a new problem, namely selective data acquisition in the wild for model charging : given a supervised ML task and data in the wild (e.g., enterprise data warehouses, online data repositories, da...
Machine learning(ML) has widespread applications and has revolutionized many industries, but suffers from several challenges. First, sufficient high-quality training data is inevitable for producing a well-performed model, but the data is always human expensive to acquire.Second, a large amount of training data and complicated model structures lead...
Materialized views (MVs) can significantly optimize the query processing in databases. However, it is hard to generate MVs for ordinary users because it relies on background knowledge, and existing methods rely on DBAs to generate and maintain MVs. However, DBAs cannot handle large-scale databases, especially cloud databases that have millions of d...
NL2VIS - which translates natural language (NL) queries to corresponding visualizations (VIS) - has attracted more and more attention both in commercial visualization vendors and academic researchers. In the last few years, the advanced deep learning-based models have achieved human-like abilities in many natural language processing (NLP) tasks, wh...
Supporting the translation from natural language (NL) query to visualization (NL2VIS) can simplify the creation of data visualizations because if successful, anyone can generate visualizations by their natural language from the tabular data. The state-of-the-art NL2VIS approaches (
e.g.
, NL4DV and FlowSense) are based on semantic parsers and heur...
Cardinality estimation is one of the most important problems in query optimization. Recently, machine learning based techniques have been proposed to effectively estimate cardinality, which can be broadly classified into query-driven and data-driven approaches. Query-driven approaches learn a regression model from a query to its cardinality; while...
Query rewrite transforms a SQL query into an equivalent one but with higher performance. However, SQL rewrite is an NP-hard problem, and existing approaches adopt heuristics to rewrite the queries. These heuristics have two main limitations. First, the order of applying different rewrite rules significantly affects the query performance. However, t...
Cardinality estimation is core to the query optimizers of DBMSs. Non-learned methods, especially based on histograms and samplings, have been widely used in commercial and open-source DBMSs. Nevertheless, histograms and samplings can only be used to summarize one or few columns, which fall short of capturing the joint data distribution over an arbi...
We study the problem of utilizing human intelligence to categorize a large number of objects. In this problem, given a category hierarchy and a set of objects, we can ask humans to check whether an object belongs to a category, and our goal is to find the most cost-effective strategy to locate the appropriate category in the hierarchy for each obje...
Entity categorization, the process of categorizing entities into groups, is an important problem with many applications. However, in practice, many entities are mis-categorized, such as Google Scholar and Amazon products. In this paper, we study the problem of discovering mis-categorized entities from a given group of categorized entities. This pro...
Machine learning techniques have been proposed to optimize the databases. For example, traditional empirical database optimization techniques (e.g., cost estimation, join order selection, knob tuning, index and view advisor) cannot meet the high-performance requirement for large-scale database instances, various applications and diversified users,...
jats:p>Differential privacy promises to enable data sharing and general data analytics while protecting individual privacy. Because the private data is often stored in the form of relational database that supports SQL queries, making SQL-based analytics differentially private is thus critical. However, the existing SQL-based differentially private...
Although learning-based database optimization techniques have been studied from academia in recent years, they have not been widely deployed in commercial database systems. In this work, we build an autonomous database framework and integrate our proposed learning-based database techniques into an open-source database system openGauss. We propose e...
Can AI help automate human-easy but computer-hard data preparation tasks that burden data scientists, practitioners, and crowd workers? We answer this question by presenting RPT, a denoising autoencoder for tuple-to-X models (" X " could be tuple, token, label, JSON, and so on). RPT is pre-trained for a tuple-to-tuple model by corrupting the input...
Real-world data is dirty, which causes serious problems in (supervised) machine learning (ML). The widely used practice in such scenario is to first repair the labeled source (a.k.a. train) data using rule-, statistical- or ML-based methods and then use the "repaired" source to train an ML model. During production, unlabeled target (a.k.a. test) da...
Spatial crowdsourcing (SC) allows requesters to crowdsource tasks to workers based on location proximity. To preserve privacy, the location should not be disclosed to untrustworthy entities (even the SC platform). Previous solutions to preserve workers' location privacy require an online trusty third party (TTP), which is not practical in reality....
Intelligent transportation (e.g., intelligent traffic light) makes our travel more convenient and efficient. With the development of mobile Internet and position technologies, it is reasonable to collect spatio-temporal data and then leverage these data to achieve the goal of intelligent transportation, and here, traffic prediction plays an importa...
Can AI help automate human-easy but computer-hard data preparation tasks (for example, data cleaning, data integration, and information extraction), which currently heavily involve data scientists, practitioners, and crowd workers? We envision that human-easy data preparation for relational data can be automated. To this end, we first identify the...
Data labeling, which assigns data with multiple classes, is indispensable for many applications, such as machine learning and data integration. However, existing labeling solutions either incur expensive cost for large datasets or produce noisy results. This paper introduces a cost-effective labeling approach and focuses on the labeling rule genera...
The proliferation of big data has brought an urgent demand for privacy-preserving data publishing. Traditional solutions to this demand have limitations on effectively balancing the tradeoff between privacy and utility of the released data. Thus, the database community and machine learning community have recently studied a new problem of relational...
The proliferation of big data has brought an urgent demand for privacy-preserving data publishing. Traditional solutions to this demand have limitations on effectively balancing the tradeoff between privacy and utility of the released data. Thus, the database community and machine learning community have recently studied a new problem of relational...
Spatio-temporal data analysis is very important in many time-critical applications. We take Coronavirus disease (COVID-19) as an example, and the key questions that everyone will ask every day are: how does Coronavirus spread? where are the high-risk areas? where have confirmed cases around me? Interactive data analytics, which allows general users...
Data visualization is crucial in data-driven decision making. However, bad visualizations generated from dirty data often mislead the users to understand the data and to draw wrong decisions. We present VisClean, a system that can progressively visualize data with improved quality through interactive and visualization-aware data cleaning. We will d...
In this article, we propose and study the problem of trajectory-driven influential billboard placement: given a set of billboards U (each with a location and a cost), a database of trajectories T , and a budget L , we find a set of billboards within the budget to influence the largest number of trajectories. One core challenge is to identify and re...
Database and Artificial Intelligence (AI) can benefit from each other. On one hand, AI can make database more intelligent (AI4DB). For example, traditional empirical database optimization techniques (e.g., cost estimation, join order selection, knob tuning, index and view selection) cannot meet the high-performance requirement for large-scale datab...
Query performance prediction is vital to many database tasks (e.g., database monitoring and query scheduling). Existing methods focus on predicting the performance for a single query but cannot effectively predict the performance for concurrent queries, because it is rather hard to capture the correlations between different queries, e.g., lock conf...
Outlier detection is critical to a large number of applications from finance fraud detection to health care. Although numerous approaches have been proposed to automatically detect outliers, such outliers detected based on statistical rarity do not necessarily correspond to the true outliers to the interest of applications. In this work, we propose...
Na Ta Kaiyu Li Yi Yang- [...]
Guoliang Li
Although individual anxiety evaluation has been well studied, there is still not much work on evaluating public anxiety of groups, especially in the form of communities on social networks, which can be leveraged to detect mental healthness of a society. However, we cannot simply average individual anxiety scores to evaluate a community's public anx...
We study the problem of utilizing human intelligence to categorize a large number of objects. In this problem, given a category hierarchy and a set of objects, we can ask humans to check whether an object belongs to a category, and our goal is to find the most cost-effective strategy to locate the appropriate category in the hierarchy for each obje...
In this work, we present a self-driving data visualization system, called
DeepEye
, that automatically generates and recommends visualizations based on the idea of
visualization by examples.
We propose effective visualization recognition techniques to decide which visualizations are meaningful and visualization ranking techniques to rank the go...
Visualization charts are widely utilized for presenting structured data. Under many circumstances, people want to digitalize the data in the charts collected from various sources (e.g., papers and websites) to further analyze the data or create new charts. However, existing automatic and semi-automatic approaches are not always effective due to the...
Data visualization is crucial in today’s data-driven business world, which has been widely used for helping decision making that is closely related to major revenues of many industrial companies. However, due to the high demand of data processing w.r.t. the volume, velocity, and veracity of data, there is an emerging need for database experts to he...
In this paper, we propose a Deep Reinforcement Learning (RL) framework for task arrangement, which is a critical problem for the success of crowdsourcing platforms. Previous works conduct the personalized recommendation of tasks to workers via supervised learning methods. However, the majority of them only consider the benefit of either workers or...
Cost and cardinality estimation is vital to query optimizer, which can guide the query plan selection. However traditional empirical cost and cardinality estimation techniques cannot provide high-quality estimation, because they may not effectively capture the correlation between multiple tables. Recently the database community shows that the learn...
An end-to-end data integration system requires human feedback in several phases, including collecting training data for entity matching, debugging the resulting clusters, confirming transformations applied on these clusters for data standardization, and finally, reducing each cluster to a single, canonical representation (or "golden record"). The t...
We study \em interactive graph search (IGS), with the conceptual objective of departing from the conventional "top-down" strategy in searching a poly-hierarchy, a.k.a.\ a decision graph. In IGS, a machine assists a human in looking for a target node z in an acyclic directed graph G, by repetitively asking questions. In each \em question, the machin...
Trajectory data analytics plays an important role in many applications, such as transportation optimization, urban planning, taxi scheduling, and so on. However, trajectory data analytics has a great challenge that the time cost for processing queries is too high on big datasets. In this paper, we demonstrate a distributed in-memory framework Ratel...
The problem of data visualization is to transform data into a visual context such that people can easily understand the significance of data. Nowadays, data visualization becomes especially important, because it is the de facto standard for modern business intelligence and successful data science. This tutorial will cover three specific topics: vis...
Large-scale data labeling has become a major bottleneck for many applications, such as machine learning and data integration. This paper presents CrowdGame, a crowdsourcing system that harnesses the crowd to gather data labels in a cost-effective way. CrowdGame focuses on generating high-quality labeling rules to largely reduce the labeling cost wh...
In crowdsourcing, human workers are employed to tackle problems that are traditionally difficult for computers (e.g., data cleaning, missing value filling, and sentiment analysis). In this paper, we study the effective use of crowdsourcing in filling missing values in a given relation (e.g., a table containing different attributes of celebrity star...
Data analysts spend more than 80% of time on data cleaning and integration in the whole process of data analytics due to data errors and inconsistencies. Similarity-based query processing is an important way to tolerate the errors and inconsistencies. However, similarity-based query processing is rather costly and traditional database cannot afford...
Crowdsourced entity resolution has recently attracted significant attentions because it can harness the wisdom of crowd to improve the quality of entity resolution. However, existing techniques either cannot achieve high quality or incur huge monetary costs. To address these problems, we propose a cost-effective crowdsourced entity resolution frame...
Recently, approximate query processing (AQP) has been proposed to enable online approximate OLAP. However, existing AQP methods have some limitations. First, they may involve unacceptable errors on skewed data (e.g., long-tail distribution). Second, they require to store large amount of data and have no significant performance improvement. Third, t...
Web tables have become very popular and important in many real applications, such as search engines and knowledge base enrichment. Due to its benefit, it is very urgent to understand web tables. An important task in web table understanding is the column-type detection, which detects the most likely types (categories) to describe the columns in the...
The results collected from crowd workers may not be reliable because (1) there are some malicious workers that randomly return the answers and (2) some tasks are hard and workers may not be good at these tasks. Thus it is important to exploit the different characteristics of workers and tasks and control the quality in crowdsourcing. Existing studi...
Online analytical processing (OLAP) is a core functionality in database systems. The performance of OLAP is crucial to make online decisions in many applications. However, it is rather costly to support OLAP on large datasets, especially big data, and the methods that compute exact answers cannot meet the high-performance requirement. To alleviate...
Large-scale data annotation is indispensable for many applications, such as machine learning and data integration. However, existing annotation solutions either incur expensive cost for large datasets or produce noisy results. This paper introduces a cost-effective annotation approach, and focuses on the labeling rule generation problem that aims t...
Given a relational table, we study the problem of detecting and repairing erroneous data, as well as marking correct data, using well curated knowledge bases (KBs). We propose detective rules (DRs), a new type of data cleaning rules that can make actionable decisions on relational data, by building connections between a relation and a KB. The main...
Crowd-powered database systems can leverage the crowd's ability to address machine-hard problems, e.g., data integration. Existing crowdsourcing systems adopt the traditional tree model to select a good query plan. However, the tree model can optimize the I/O cost but cannot optimize the monetary cost, latency and quality, which are three important...
Map matching is an important operation of location-based services, which matches raw GPS trajectories onto real road networks, and facilitates tasks of urban computing, such as intelligent traffic systems, etc. More than ten algorithms have been proposed to address this problem in the recent decade. However, existing algorithms have not been thorou...
In this paper we propose and study the problem of trajectory-driven influential billboard placement: given a set of billboards $\ur$ (each with a location and a cost), a database of trajectories $\td$ and a budget $\budget$, find a set of billboards within the budget to influence the largest number of trajectories. One core challenge is to identify...
Many data mining tasks cannot be completely addressed by automated processes, such as sentiment analysis and image classification. Crowdsourcing is an effective way to harness the human cognitive ability to process these machine-hard tasks. Thanks to public crowdsourcing platforms, e.g., Amazon Mechanical Turk and CrowdFlower, we can easily involve...
Trajectory analytics can benefit many real-world applications, e.g., frequent trajectory based navigation systems, road planning, car pooling, and transportation optimizations. In this paper, we demonstrate a distributed in-memory trajectory analytics system DITA to support large-scale trajectory data analytics. DITA exhibit three unique features....
Creating good visualizations for ordinary users is hard, even with the help of the state-of-the-art interactive data visualization tools, such as Tableau, Qlik, because they require the users to understand the data and visualizations very well. DeepEye is an innovative visualization system that aims at helping everyone create good visualizations si...