Guoliang Li

Guoliang Li
Tsinghua University | TH · Department of Computer Science and Technology

About

269
Publications
36,776
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
8,527
Citations
Citations since 2017
126 Research Items
6245 Citations
201720182019202020212022202302004006008001,0001,200
201720182019202020212022202302004006008001,0001,200
201720182019202020212022202302004006008001,0001,200
201720182019202020212022202302004006008001,0001,200

Publications

Publications (269)
Article
Modern location-based systems have stimulated explosive growth of urban trajectory data and promoted many real-world applications, e.g. , trajectory prediction. However, heavy big data processing overhead and privacy concerns hinder trajectory acquisition and utilization. Inspired by regular trajectory distribution on transportation road networks,...
Article
In this paper, we revisit the problem of route travel time estimation on a road network and aim to boost its accuracy by capturing and utilizing spatio-temporal features from four significant aspects: heterogeneity, proximity, periodicity and dynamicity. Spatial-wise, we consider two forms of heterogeneity at link level in a road network: the turni...
Article
Traditional cost-based optimizers are efficient and stable to generate optimal plans for simple SQL queries, but they may not generate high-quality plans for complicated queries. Thus learning-based optimizers have been proposed recently that can learn high-quality plans based on past experiences. However, learning-based optimizers cannot work well...
Article
Knowledge bases (KBs), which store high-quality information, are crucial for many applications, such as enhancing search results and serving as external sources for data cleaning. Not surprisingly, there exist outdated facts in most KBs due to the rapid change of information. Naturally, it is important to keep KBs up-to-date. Traditional wisdom has...
Article
Successful machine learning (ML) needs to learn from good data. However, one common issue about train data for ML practitioners is the lack of good features. To mitigate this problem, feature augmentation is often employed by joining with (or enriching features from) multiple tables, so as to become feature-rich ML. A consequent problem is that the...
Article
Entity resolution (ER) is a core data integration problem that identifies pairs of data instances referring to the same real-world entities, and the state-of-the-art results of ER are achieved by deep learning (DL) based approaches. However, DL-based approaches typically require a large amount of labeled training data (i.e. , matching and non-match...
Article
Full-text available
Data exploration—the problem of extracting knowledge from database even if we do not know exactly what we are looking for —is important for data discovery and analysis. However, precisely specifying SQL queries is not always practical, such as “finding and ranking off-road cars based on a combination of Price, Make, Model, Age, Mileage, etc”—not on...
Article
Full-text available
Computing the shortest paths and shortest path distances between two vertices on road networks is a core operation in many real-world applications, e.g., finding the closest taxi/hotel. However, existing techniques have several limitations. First, traditional Dijkstra-based methods have long latency and cannot meet the high-performance requirement....
Article
The lack of sufficient labeled data is a key bottleneck for practitioners in many real-world supervised machine learning (ML) tasks. In this paper, we study a new problem, namely selective data acquisition in the wild for model charging : given a supervised ML task and data in the wild (e.g., enterprise data warehouses, online data repositories, da...
Article
Machine learning(ML) has widespread applications and has revolutionized many industries, but suffers from several challenges. First, sufficient high-quality training data is inevitable for producing a well-performed model, but the data is always human expensive to acquire.Second, a large amount of training data and complicated model structures lead...
Article
Materialized views (MVs) can significantly optimize the query processing in databases. However, it is hard to generate MVs for ordinary users because it relies on background knowledge, and existing methods rely on DBAs to generate and maintain MVs. However, DBAs cannot handle large-scale databases, especially cloud databases that have millions of d...
Preprint
Full-text available
NL2VIS - which translates natural language (NL) queries to corresponding visualizations (VIS) - has attracted more and more attention both in commercial visualization vendors and academic researchers. In the last few years, the advanced deep learning-based models have achieved human-like abilities in many natural language processing (NLP) tasks, wh...
Article
Supporting the translation from natural language (NL) query to visualization (NL2VIS) can simplify the creation of data visualizations because if successful, anyone can generate visualizations by their natural language from the tabular data. The state-of-the-art NL2VIS approaches ( e.g. , NL4DV and FlowSense) are based on semantic parsers and heur...
Article
Cardinality estimation is one of the most important problems in query optimization. Recently, machine learning based techniques have been proposed to effectively estimate cardinality, which can be broadly classified into query-driven and data-driven approaches. Query-driven approaches learn a regression model from a query to its cardinality; while...
Article
Query rewrite transforms a SQL query into an equivalent one but with higher performance. However, SQL rewrite is an NP-hard problem, and existing approaches adopt heuristics to rewrite the queries. These heuristics have two main limitations. First, the order of applying different rewrite rules significantly affects the query performance. However, t...
Article
Cardinality estimation is core to the query optimizers of DBMSs. Non-learned methods, especially based on histograms and samplings, have been widely used in commercial and open-source DBMSs. Nevertheless, histograms and samplings can only be used to summarize one or few columns, which fall short of capturing the joint data distribution over an arbi...
Article
Full-text available
We study the problem of utilizing human intelligence to categorize a large number of objects. In this problem, given a category hierarchy and a set of objects, we can ask humans to check whether an object belongs to a category, and our goal is to find the most cost-effective strategy to locate the appropriate category in the hierarchy for each obje...
Article
Full-text available
Entity categorization, the process of categorizing entities into groups, is an important problem with many applications. However, in practice, many entities are mis-categorized, such as Google Scholar and Amazon products. In this paper, we study the problem of discovering mis-categorized entities from a given group of categorized entities. This pro...
Article
Machine learning techniques have been proposed to optimize the databases. For example, traditional empirical database optimization techniques (e.g., cost estimation, join order selection, knob tuning, index and view advisor) cannot meet the high-performance requirement for large-scale database instances, various applications and diversified users,...
Article
jats:p>Differential privacy promises to enable data sharing and general data analytics while protecting individual privacy. Because the private data is often stored in the form of relational database that supports SQL queries, making SQL-based analytics differentially private is thus critical. However, the existing SQL-based differentially private...
Article
Although learning-based database optimization techniques have been studied from academia in recent years, they have not been widely deployed in commercial database systems. In this work, we build an autonomous database framework and integrate our proposed learning-based database techniques into an open-source database system openGauss. We propose e...
Article
Can AI help automate human-easy but computer-hard data preparation tasks that burden data scientists, practitioners, and crowd workers? We answer this question by presenting RPT, a denoising autoencoder for tuple-to-X models (" X " could be tuple, token, label, JSON, and so on). RPT is pre-trained for a tuple-to-tuple model by corrupting the input...
Article
Real-world data is dirty, which causes serious problems in (supervised) machine learning (ML). The widely used practice in such scenario is to first repair the labeled source (a.k.a. train) data using rule-, statistical- or ML-based methods and then use the "repaired" source to train an ML model. During production, unlabeled target (a.k.a. test) da...
Article
Spatial crowdsourcing (SC) allows requesters to crowdsource tasks to workers based on location proximity. To preserve privacy, the location should not be disclosed to untrustworthy entities (even the SC platform). Previous solutions to preserve workers' location privacy require an online trusty third party (TTP), which is not practical in reality....
Article
Full-text available
Intelligent transportation (e.g., intelligent traffic light) makes our travel more convenient and efficient. With the development of mobile Internet and position technologies, it is reasonable to collect spatio-temporal data and then leverage these data to achieve the goal of intelligent transportation, and here, traffic prediction plays an importa...
Preprint
Full-text available
Can AI help automate human-easy but computer-hard data preparation tasks (for example, data cleaning, data integration, and information extraction), which currently heavily involve data scientists, practitioners, and crowd workers? We envision that human-easy data preparation for relational data can be automated. To this end, we first identify the...
Article
Full-text available
Data labeling, which assigns data with multiple classes, is indispensable for many applications, such as machine learning and data integration. However, existing labeling solutions either incur expensive cost for large datasets or produce noisy results. This paper introduces a cost-effective labeling approach and focuses on the labeling rule genera...
Preprint
The proliferation of big data has brought an urgent demand for privacy-preserving data publishing. Traditional solutions to this demand have limitations on effectively balancing the tradeoff between privacy and utility of the released data. Thus, the database community and machine learning community have recently studied a new problem of relational...
Article
The proliferation of big data has brought an urgent demand for privacy-preserving data publishing. Traditional solutions to this demand have limitations on effectively balancing the tradeoff between privacy and utility of the released data. Thus, the database community and machine learning community have recently studied a new problem of relational...
Article
Spatio-temporal data analysis is very important in many time-critical applications. We take Coronavirus disease (COVID-19) as an example, and the key questions that everyone will ask every day are: how does Coronavirus spread? where are the high-risk areas? where have confirmed cases around me? Interactive data analytics, which allows general users...
Article
Data visualization is crucial in data-driven decision making. However, bad visualizations generated from dirty data often mislead the users to understand the data and to draw wrong decisions. We present VisClean, a system that can progressively visualize data with improved quality through interactive and visualization-aware data cleaning. We will d...
Article
In this article, we propose and study the problem of trajectory-driven influential billboard placement: given a set of billboards U (each with a location and a cost), a database of trajectories T , and a budget L , we find a set of billboards within the budget to influence the largest number of trajectories. One core challenge is to identify and re...
Article
Full-text available
Database and Artificial Intelligence (AI) can benefit from each other. On one hand, AI can make database more intelligent (AI4DB). For example, traditional empirical database optimization techniques (e.g., cost estimation, join order selection, knob tuning, index and view selection) cannot meet the high-performance requirement for large-scale datab...
Article
Query performance prediction is vital to many database tasks (e.g., database monitoring and query scheduling). Existing methods focus on predicting the performance for a single query but cannot effectively predict the performance for concurrent queries, because it is rather hard to capture the correlations between different queries, e.g., lock conf...
Article
Outlier detection is critical to a large number of applications from finance fraud detection to health care. Although numerous approaches have been proposed to automatically detect outliers, such outliers detected based on statistical rarity do not necessarily correspond to the true outliers to the interest of applications. In this work, we propose...
Article
Although individual anxiety evaluation has been well studied, there is still not much work on evaluating public anxiety of groups, especially in the form of communities on social networks, which can be leveraged to detect mental healthness of a society. However, we cannot simply average individual anxiety scores to evaluate a community's public anx...
Article
We study the problem of utilizing human intelligence to categorize a large number of objects. In this problem, given a category hierarchy and a set of objects, we can ask humans to check whether an object belongs to a category, and our goal is to find the most cost-effective strategy to locate the appropriate category in the hierarchy for each obje...
Article
In this work, we present a self-driving data visualization system, called DeepEye , that automatically generates and recommends visualizations based on the idea of visualization by examples. We propose effective visualization recognition techniques to decide which visualizations are meaningful and visualization ranking techniques to rank the go...
Article
Visualization charts are widely utilized for presenting structured data. Under many circumstances, people want to digitalize the data in the charts collected from various sources (e.g., papers and websites) to further analyze the data or create new charts. However, existing automatic and semi-automatic approaches are not always effective due to the...
Article
Full-text available
Data visualization is crucial in today’s data-driven business world, which has been widely used for helping decision making that is closely related to major revenues of many industrial companies. However, due to the high demand of data processing w.r.t. the volume, velocity, and veracity of data, there is an emerging need for database experts to he...
Preprint
In this paper, we propose a Deep Reinforcement Learning (RL) framework for task arrangement, which is a critical problem for the success of crowdsourcing platforms. Previous works conduct the personalized recommendation of tasks to workers via supervised learning methods. However, the majority of them only consider the benefit of either workers or...
Article
Cost and cardinality estimation is vital to query optimizer, which can guide the query plan selection. However traditional empirical cost and cardinality estimation techniques cannot provide high-quality estimation, because they may not effectively capture the correlation between multiple tables. Recently the database community shows that the learn...
Preprint
An end-to-end data integration system requires human feedback in several phases, including collecting training data for entity matching, debugging the resulting clusters, confirming transformations applied on these clusters for data standardization, and finally, reducing each cluster to a single, canonical representation (or "golden record"). The t...
Conference Paper
We study \em interactive graph search (IGS), with the conceptual objective of departing from the conventional "top-down" strategy in searching a poly-hierarchy, a.k.a.\ a decision graph. In IGS, a machine assists a human in looking for a target node z in an acyclic directed graph G, by repetitively asking questions. In each \em question, the machin...
Conference Paper
Trajectory data analytics plays an important role in many applications, such as transportation optimization, urban planning, taxi scheduling, and so on. However, trajectory data analytics has a great challenge that the time cost for processing queries is too high on big datasets. In this paper, we demonstrate a distributed in-memory framework Ratel...
Conference Paper
The problem of data visualization is to transform data into a visual context such that people can easily understand the significance of data. Nowadays, data visualization becomes especially important, because it is the de facto standard for modern business intelligence and successful data science. This tutorial will cover three specific topics: vis...
Conference Paper
Large-scale data labeling has become a major bottleneck for many applications, such as machine learning and data integration. This paper presents CrowdGame, a crowdsourcing system that harnesses the crowd to gather data labels in a cost-effective way. CrowdGame focuses on generating high-quality labeling rules to largely reduce the labeling cost wh...
Article
In crowdsourcing, human workers are employed to tackle problems that are traditionally difficult for computers (e.g., data cleaning, missing value filling, and sentiment analysis). In this paper, we study the effective use of crowdsourcing in filling missing values in a given relation (e.g., a table containing different attributes of celebrity star...
Article
Data analysts spend more than 80% of time on data cleaning and integration in the whole process of data analytics due to data errors and inconsistencies. Similarity-based query processing is an important way to tolerate the errors and inconsistencies. However, similarity-based query processing is rather costly and traditional database cannot afford...
Article
Full-text available
Crowdsourced entity resolution has recently attracted significant attentions because it can harness the wisdom of crowd to improve the quality of entity resolution. However, existing techniques either cannot achieve high quality or incur huge monetary costs. To address these problems, we propose a cost-effective crowdsourced entity resolution frame...
Article
Recently, approximate query processing (AQP) has been proposed to enable online approximate OLAP. However, existing AQP methods have some limitations. First, they may involve unacceptable errors on skewed data (e.g., long-tail distribution). Second, they require to store large amount of data and have no significant performance improvement. Third, t...
Conference Paper
Web tables have become very popular and important in many real applications, such as search engines and knowledge base enrichment. Due to its benefit, it is very urgent to understand web tables. An important task in web table understanding is the column-type detection, which detects the most likely types (categories) to describe the columns in the...
Chapter
The results collected from crowd workers may not be reliable because (1) there are some malicious workers that randomly return the answers and (2) some tasks are hard and workers may not be good at these tasks. Thus it is important to exploit the different characteristics of workers and tasks and control the quality in crowdsourcing. Existing studi...
Article
Full-text available
Online analytical processing (OLAP) is a core functionality in database systems. The performance of OLAP is crucial to make online decisions in many applications. However, it is rather costly to support OLAP on large datasets, especially big data, and the methods that compute exact answers cannot meet the high-performance requirement. To alleviate...
Article
Large-scale data annotation is indispensable for many applications, such as machine learning and data integration. However, existing annotation solutions either incur expensive cost for large datasets or produce noisy results. This paper introduces a cost-effective annotation approach, and focuses on the labeling rule generation problem that aims t...
Article
Full-text available
Given a relational table, we study the problem of detecting and repairing erroneous data, as well as marking correct data, using well curated knowledge bases (KBs). We propose detective rules (DRs), a new type of data cleaning rules that can make actionable decisions on relational data, by building connections between a relation and a KB. The main...
Article
Crowd-powered database systems can leverage the crowd's ability to address machine-hard problems, e.g., data integration. Existing crowdsourcing systems adopt the traditional tree model to select a good query plan. However, the tree model can optimize the I/O cost but cannot optimize the monetary cost, latency and quality, which are three important...
Chapter
Map matching is an important operation of location-based services, which matches raw GPS trajectories onto real road networks, and facilitates tasks of urban computing, such as intelligent traffic systems, etc. More than ten algorithms have been proposed to address this problem in the recent decade. However, existing algorithms have not been thorou...
Conference Paper
In this paper we propose and study the problem of trajectory-driven influential billboard placement: given a set of billboards $\ur$ (each with a location and a cost), a database of trajectories $\td$ and a budget $\budget$, find a set of billboards within the budget to influence the largest number of trajectories. One core challenge is to identify...
Preprint
Many data mining tasks cannot be completely addressed by automated processes, such as sentiment analysis and image classification. Crowdsourcing is an effective way to harness the human cognitive ability to process these machine-hard tasks. Thanks to public crowdsourcing platforms, e.g., Amazon Mechanical Turk and CrowdFlower, we can easily involve...
Conference Paper
Trajectory analytics can benefit many real-world applications, e.g., frequent trajectory based navigation systems, road planning, car pooling, and transportation optimizations. In this paper, we demonstrate a distributed in-memory trajectory analytics system DITA to support large-scale trajectory data analytics. DITA exhibit three unique features....
Conference Paper
Creating good visualizations for ordinary users is hard, even with the help of the state-of-the-art interactive data visualization tools, such as Tableau, Qlik, because they require the users to understand the data and visualizations very well. DeepEye is an innovative visualization system that aims at helping everyone create good visualizations si...