Aoying Zhou

Aoying Zhou
Fudan University

About

587
Publications
67,910
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
7,389
Citations

Publications

Publications (587)
Article
The modern in-memory database (IMDB) can support highly concurrent on-line transaction processing (OLTP) workloads and generate massive transactional logs per second. Quorum-based replication protocols such as Paxos or Raft have been widely used in the distributed databases to offer higher availability and fault-tolerance. However, it is non-trivia...
Article
High-quality news recommendation heavily relies on accurate and timely representations of news documents and user interests. Social information, which usually contains the most recent information about the activities of users and their friends, naturally reflects the dynamics and diversities of user interests. However, existing news recommendation...
Preprint
Programming-based Pre-trained Language Models (PPLMs) such as CodeBERT have achieved great success in many downstream code-related tasks. Since the memory and computational complexity of self-attention in the Transformer grow quadratically with the sequence length, PPLMs typically limit the code length to 512. However, codes in real-world applicati...
Article
Full-text available
Since consensus protocol and execution mechanism act as two key factors for the overall throughput of blockchain systems, how to execute smart contracts efficiently becomes an emergent bottleneck as many high-performance consensus protocols have been proposed in recent years. Due to the existence of Byzantine nodes, existing concurrency approaches...
Conference Paper
Full-text available
Permissioned blockchain is increasingly being used as a collaborative platform for sharing data. However, current blockchain-based data sharing is unable to balance privacy protection and query functionality, limiting its application scenarios. Order-preserving encryption/encoding (OPE) allows encrypting data to prevent privacy leakage while still...
Preprint
Full-text available
Code summarization with deep learning has been widely studied in recent years. Current deep learning models for code summarization generally follow the principle in neural machine translation and adopt the encoder-decoder framework, where the encoder learns the semantic representations from source code and the decoder transforms the learnt represen...
Article
Full-text available
Balanced clustering, which generates clusters of similar sizes, can be useful in a variety of applications. However, existing clustering algorithms either cannot guarantee balanced clustering results or require relatively high time complexities for balanced clustering. In this work, we propose a constrained balanced clustering method, which is refe...
Article
With the rapid development of distributed transactional databases in recent years, there is an urgent need for fair performance evaluation and comparison. Though there are various open-source benchmarks built for databases, it is lack of a comprehensive study about the applicability for distributed transactional databases. This paper presents a rev...
Preprint
In this paper, we study knowledge tracing in the domain of programming education and make two important contributions. First, we harvest and publish so far the most comprehensive dataset, namely BePKT, which covers various online behaviors in an OJ system, including programming text problems, knowledge annotations, user-submitted code and system-lo...
Article
Many key-value stores use RDMA to optimize the messaging and data transmission between application layer and the storage layer, most of which only provide point-wise operations. Skiplist-based store can support both point operations and range queries, but its CPU-intensive access operations combined with the high-speed network will easily lead to t...
Chapter
We present LinKV, a novel distributed key-value store that can leverage RDMA network to simultaneously provide high performance and strict consistency (i.e., per-key linearizability) for skewed workloads. To avoid the potential performance loss caused by load imbalance under skew, existing solutions will replicate popular items into different nodes...
Chapter
The wide popularity and the maturity of cloud platform promote the development of Cloud Native database systems. On-demand resource configuration is an attractive feature of cloud platforms, but its complexity in resource management challenges the benchmarking of database performance, which is no longer in a stand-alone test environment. Sharing or...
Preprint
Full-text available
We present InferWiki, a Knowledge Graph Completion (KGC) dataset that improves upon existing benchmarks in inferential ability, assumptions, and patterns. First, each testing sample is predictable with supportive data in the training set. To ensure it, we propose to utilize rule-guided train/test generation, instead of conventional random split. Se...
Article
Modern database systems desperate for the ability to support highly scalable transactions and efficient queries simultaneously for real-time applications. One solution is to utilize query optimization techniques on the on-line transaction processing (OLTP) systems. The materialized view is considered as a panacea to decrease query latency. However,...
Article
Relation Extraction (RE) is a vital step to complete Knowledge Graph (KG) by extracting entity relations from texts. However, it usually suffers from the long-tail issue. This paper proposes a novel approach to learn relation prototypes from unlabeled texts, to facilitate long-tail RE by transferring knowledge from relation types with sufficient tr...
Article
Full-text available
Aspect-based sentiment analysis has received considerable attention in recent years because it can provide more detailed and specific user opinion information. Most existing methods based on recurrent neural networks usually suffer from two drawbacks: information loss for long sequences and a high time consumption. To address such issues, a hybrid...
Article
Full-text available
Online sellers often produce redundant and lengthy product textual titles with extra information on e-commerce platforms to attract the attentions of customers. Such overlength product titles become a problem when they are displayed on mobile applications. In this paper, the problem of refining redundant and overlength product titles is studied to...
Conference Paper
Full-text available
We demonstrate SChain, a consortium blockchain that scales transaction processing to support large-scale enterprise applications. The unique advantage of SChain stems from the exploitation of both intra-and inter-block concurrency. The intra-block concurrency not only takes advantage of the multi-core processor on a single peer but also leverages t...
Chapter
Intel Optane DC Persistent Memory (PM) is the first commercially available PM product. Although it meets many hypothesises about PM in previous studies, some other design considerations are observed in subsequent tests. For instance, 1) the internal data access granularity in Optane DC PM is 256B, accesses smaller than 256B will cause read/write am...
Article
State machine replication has been widely used in modern cluster-based database systems. Most commonly deployed configurations adopt the Raft-like consensus protocol, which has a single strong leader which replicates the log to other followers. Since the followers can handle read requests and many real workloads are usually read-intensive, the reco...
Chapter
NER is challenging because of the semantic ambiguities in academic literature, especially for non-Latin languages. Besides, recognizing Chinese named entities needs to consider word boundary information, as words contained in Chinese texts are not separated with spaces. Leveraging word boundary information could help to determine entity boundaries...
Article
Although the emergence of the programmable smart contract makes blockchain systems easily embrace a wide range of industrial services, how to execute smart contracts efficiently becomes a big challenge nowadays. Due to the existence of Byzantine nodes, existing mature concurrency control protocols in database cannot be employed directly, since the...
Article
Owing to a wide variety of deployment of GPS -enabled devices, tremendous amounts of trajectories have been generated in distributed stream manner. It opens up new opportunities to track and analyze the moving behaviors of the entities. In this work, we focus on the issue of outlier detection over distributed trajectory streams, where the outliers...
Book
This book constitutes the refereed post-conference proceedings of the Second BenchCouncil International Federated Intelligent Computing and Block Chain Conferences, FICC 2020, held in Qingdao, China, in October/ November 2020. The 32 full papers and 6 short papers presented were carefully reviewed and selected from 103 submissions. The papers of th...
Preprint
Author disambiguation arises when different authors share the same name, which is a critical task in digital libraries, such as DBLP, CiteULike, CiteSeerX, etc. While the state-of-the-art methods have developed various paper embedding-based methods performing in a top-down manner, they primarily focus on the ego-network of a target name and overloo...
Preprint
Full-text available
Relation Extraction (RE) is a vital step to complete Knowledge Graph (KG) by extracting entity relations from texts.However, it usually suffers from the long-tail issue. The training data mainly concentrates on a few types of relations, leading to the lackof sufficient annotations for the remaining types of relations. In this paper, we propose a ge...
Chapter
The materialized view is considered as a panacea that significantly facilitates query by trading space cost for execution time. However, it also involves the expensive cost of maintenance which trades away the latency. Disk IO cost is an important factor restricting view maintenance performance. To solve this problem, we decouple a complete procedu...
Chapter
With the increments of data volumes and user numbers, big data applications require higher transaction throughput but lower query latency for database systems. The materialized view accelerates analytical queries by trading space for query efficiency. Nevertheless, it has to be updated under transactional workloads to obtain up-to-second results. U...
Chapter
Deep Neural Network (DNN) has been widely adopted in video analysis application. The computation involved in DNN is more efficient on GPUs than on CPUs. However, recent serving systems involve the low utilization of GPU, due to limited process parallelism and storage overhead of DNN model. We propose Euge, which introduces multi-process service (MP...
Chapter
Although we have achieved significant progress in improving the scalability of transactional database systems (OLTP), the presence of contention operations in workloads is still the fundamental limitation in improving throughput. The reason is that the overhead of managing conflict transactions with concurrency control mechanism is proportional to...
Chapter
Since failures in large-scale clusters can lead to severe performance degradation and break system availability, fault tolerance is critical for distributed stream processing systems (DSPSs). Plenty of fault tolerance approaches have been proposed over the last decade. However, there is no systematic work to evaluate and compare them in detail. Pre...
Article
Benchmarks play a crucial role in database performance evaluation, and have been effectively promoting the development of database management systems. With critical transaction processing requirements of new applications, we see an explosion of innovative database technologies for dealing with highly intensive transaction workloads (OLTP) with the...
Chapter
In-memory query processing can be accelerated by caching intermediate query results. Among various types of intermediate results, hash tables used by hash join are ideal objects for caching, as they can benefit a wide range of queries. In this paper, we introduce a fine-grained hash table caching method to benefit the hash-join operator. Our insigh...
Chapter
Redis is a popular key-value store built upon socket interface that remains heavy memory copy overhead within the kernel and considerable CPU overhead to maintain socket connections. The adoption of Remote Direct Memory Access (RDMA) that incorporates outstanding features such as low-latency, high-throughput, and CPU-bypass make it practical to sol...
Chapter
Many key-value stores use RDMA to optimize the messaging and data transmission between application layer and storage layer, most of which only provide point-wise operations. Skiplist-based store can support both point operations and range queries, but its CPU-intensive access operations combined with the high-speed network will easily lead to the s...
Chapter
Cargo distribution is one of most critical issues for steel logistics industry, whose core task is to determine cargo loading plan for each truck. Due to cargos far outnumber available transport capacity in steel logistics industry, traditional policies treat all cargos equally and distribute them to each arrived trucks with the aim of maximizing t...
Chapter
Global web services or storage systems have to respond to changes in clients’ access characteristics for lower latency and higher throughput. As access locality is very common, deploying more servers in datacenters close to clients or moving related data between datacenters is the common practice to respond to those changes. Now Paxos-based protoco...
Chapter
To improve the performance for high-contention workloads, modern main-memory database systems seek to design efficient lock managers. However, OLTP engines adopt the classic FCFS strategy to process operations, where the generated execution order does not take current and future conflicts into consideration. In this case, lock dependencies will hap...
Chapter
Spatial data has the characteristics of spatial location, unstructured, spatial relationships, massive data. However, the general commercial database itself is difficult to meet the requirements, it’s non-trivial to add spatial expansion because spatial data in KVS has brought new challenges. First, the Key-Value database itself does not have a way...
Article
Logging and replication are commonly used recovery approaches in database systems. To guarantee that the database state is not corrupted due to system crash, database systems rely on a centralized logging method to persist log entries into a stable storage device; to prevent data loss due to device failure, a primary server in the database system p...
Preprint
Full-text available
This paper presents LinSBFT, a Byzantine Fault Tolerance (BFT) protocol with the capacity of processing over 2000 smart contract transactions per second in production. LinSBFT applies to a permissionless, public blockchain system, in which there is no public-key infrastructure, based on the classic PBFT with 4 improvements: (\romannumeral1) LinSBFT...
Article
Log-structured merge tree (LSM-tree) is adopted by many distributed storage systems. It contains a Memtable and a number of SSTables. The Memtable is an in-memory structure and the SSTable is a disk-based structure. Data records are horizontally partitioned over the primary key and stored in different SSTables. Data writes on records are first serv...
Article
Recent years have witnessed a widespread increase of interest in network representation learning (NRL). By far most research efforts have focused on NRL for homogeneous networks like social networks where vertices are of the same type, or heterogeneous networks like knowledge graphs where vertices (and/or edges) are of different types. There has be...
Article
Full-text available
A heterogeneous information network (HIN) is a ubiquitous data model, consisting of multiple types of entities and relations. Names of entities in HINs are inherently ambiguous, making it difficult to fully disambiguate a HIN. In this paper, we introduce the task of exploratory entity linking for HINs. Given a partially disambiguated HIN, we aim at...
Article
The full-replication data storage mechanism, as commonly utilized in existing blockchains, is the barrier to the system's scalability, since it retains a copy of entire blockchain at each node so that the overall storage consumption per block is $O(n)$ with $n$ participants. Yet another drawback is that this mechanism may limit the throughput i...
Preprint
Full-text available
The synthetic workload is essential and critical to the performance evaluation of database systems. When evaluating the database performance for a specific application, the similarity between synthetic workload and real application workload determines the credibility of evaluation results. However, the workload currently used for performance evalua...