Li Zha’s research while affiliated with Chinese Academy of Sciences and other places

What is this page?


This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.

Publications (48)


CirroData: Yet Another SQL-on-Hadoop Data Analytics Engine with High Performance
  • Article

January 2020

·

49 Reads

·

5 Citations

Journal of Computer Science and Technology

Zheng-Hao Jin

·

·

Ying-Xin Hu

·

[...]

·

This paper presents CirroData, a high-performance SQL-on-Hadoop system designed for Big Data analytics workloads. As a home-grown enterprise-level online analytical processing (OLAP) system with more than seven-year research and development (R&D) experiences, we share our design details to the community about how to achieve high performance in CirroData. Multiple optimization techniques have been discussed in the paper. The effectiveness and the efficiency of all these techniques have been proved by our customers’ daily usage. Benchmark-level studies, as well as several real application case studies of CirroData, have been presented in this paper. Our evaluations show that CirroData can outperform various types of counterpart database systems in the community, such as “Spark+Hive”, “Spark+HBase”, Impala, DB-X/Y, Greenplum, HAWQ, and others. CirroData can achieve up to 4.99x speedup compared with Greenplum, HAWQ, and Spark in the standard TPC-H queries. Application-level evaluations demonstrate that CirroData outperforms “Spark+Hive” and “Spark+HBase” by up to 8.4x and 38.8x, respectively. In the meantime, CirroData achieves the performance speedups for some application workloads by up to 20x, 100x, 182.5x, 92.6x, and 55.5x as compared with Greenplum, DB-X, Impala, DB-Y, and HAWQ, respectively.


A Survey on Deep Learning Benchmarks: Do We Still Need New Ones?
  • Chapter
  • Full-text available

October 2019

·

141 Reads

·

10 Citations

Lecture Notes in Computer Science

Deep Learning has recently been gaining popularity. From the micro-architecture field to the upper-layer end applications, a lot of research work has been proposed in the literature to advance the knowledge of Deep Learning. Deep Learning Benchmarking is one of such hot spots in the community. There are a bunch of Deep Learning benchmarks available in the community already and new ones keep coming as well. However, we find that not many survey works are available to give an overview of these useful benchmarks in the literature. We also find few discussions on what has been done for Deep Leaning Benchmarking in the community and what are still missing. To fill this gap, this paper attempts to provide a survey on multiple high-impact Deep Learning Benchmarks with training and inference support. We share some of our insightful observations and discussions on these benchmarks. In this paper, we believe the community still needs more benchmarks to capture different perspectives, while these benchmarks need a way for converging to a standard.

Download


Figure 5. Parallel Insert 
Figure 6. System Architecture 
Characterizing and accelerating indexing techniques on distributed ordered tables

December 2017

·

136 Reads

·

2 Citations

In recent years, most Web 2.0/3.0 applications have been built on top of distributed systems which allow data to be modeled as Distributed Ordered Tables (DOTs) such as Apache HBase. To analyze the stored data, SQL-like range queries over a DOT are fundamental requirements. However, range queries over existing DOT implementations are highly inefficient. Several secondary index techniques have been proposed to alleviate this issue, but they introduce additional overhead while creating and updating the index. Moreover, index techniques introduce several additional challenges for DOTs, particularly, network communication and thread models for concurrent request processing. In this paper, we first characterize the performance of index techniques on DOTs from a networking perspective. We then propose an RDMA-based high-performance communication framework which uses HBase as the underlying DOT implementation to accelerate these techniques. We propose several thread models for our RDMA-based design and compare their performance. We design a parallel insert operation to reduce index creation overhead. We also design several benchmarks to evaluate DOT-based systems. Experimental evaluations with state-of-the-art index techniques (CCIndex and Apache Phoenix) show that our design can reduce the insert overhead for secondary indices to just 23%. Evaluation with TPC-H queries demonstrates an increase in query throughput by up to 2x, while application evaluation with real-world workloads and data (100M records) provided by AdMaster Inc. show up to 35% reduction in execution time.



Learning Cost-Effective Social Embedding for Cascade Prediction

October 2016

·

24 Reads

·

6 Citations

Communications in Computer and Information Science

Given a message, cascade prediction aims to predict the individuals who will potentially retweet it. Most existing methods either exploit demographical, structural, and temporal features for prediction, or explicitly rely on particular information diffusion models. Recently, researchers attempt to design fully data-driven methods for cascade prediction (i.e., without requiring human-defined features or information diffusion models), directly leveraging historical cascades to learn interpersonal proximity and then making prediction based on the learned proximity. One widely-used method to represent interpersonal proximity is social embedding, i.e., each individual is embedded into a low-dimensional latent metric space. One challenging problem is to design cost-effective method to learn social embedding from cascades. In this paper, we propose a position-aware asymmetric embedding method to effectively learn social embedding for cascade prediction. Different from existing methods where individuals are embedded into a single latent space, our method embeds each individual into two latent spaces: a latent influence space and a latent susceptibility space. Furthermore, our method employs the occurrence position of individuals in cascades to improve the learning efficiency of social embedding. We validate the proposed method on a dataset extracted from Sina Weibo. Experimental results demonstrate that the proposed model outperforms state-of-the-art social embedding methods at both learning efficiency and prediction accuracy.



An Uncoupled Data Process and Transfer Model for MapReduce

January 2015

·

28 Reads

·

3 Citations

Lecture Notes in Computer Science

In the original MapReduce model, reduce tasks need to fetch output data of map tasks in the manner of “pull”. However, reduce tasks which are occupying reduce slots cannot start executing until all the corresponding map tasks are completed. It forms the dependence between map and reduce tasks, which is called the coupled relationship in this paper. The coupled relationship leads to two problems: reduce slot hoarding and underutilized network bandwidth. Meanwhile, storing the result data is costly especially when the system has replications, which leads to the inefficient storage problem. We propose an uncoupled data process and transfer model in order to address these problems. Four core techniques, including weighted mapping, data pushing, partial data backup, and data compression are introduced and applied in Apache Hadoop, the mainstream open-source implementation of MapReduce model. This work has been practiced in Baidu, the biggest search engine company in China. A real-world application for web data processing shows that our model can improve the system throughput by 29.5 %, reduce the total wall time by 22.8 %, provide a weighted wall time acceleration of 26.3 %, and reduce the result data stored in disk by 70 %. What’s more, the implementation of this model is transparent to users and compatible with the original Hadoop.


SeaBase: An Implementation of Cloud Database

November 2014

·

49 Reads

·

3 Citations

Cloud database usually refers to a database based on the cloud computing technology. However, as far as we know, pre-existing solutions of cloud database cannot integrate the data from multi-sourced heterogeneous databases, only supplying an isolated homogeneous database cluster. This paper presents a new implementation approach for cloud database: Sea Base, which integrates various data types into a unified one, based on the CCEVP(Cloud Computing-based Effective-Virtual-Physical) model. The results of our experiments show that Sea Base is feasible and practical.


DataMPI: Extending MPI to hadoop-like big data computing

May 2014

·

130 Reads

·

67 Citations

MPI has been widely used in High Performance Computing. In contrast, such efficient communication support is lacking in the field of Big Data Computing, where communication is realized by time consuming techniques such as HTTP/RPC. This paper takes a step in bridging these two fields by extending MPI to support Hadoop-like Big Data Computing jobs, where processing and communication of a large number of key-value pair instances are needed through distributed computation models such as MapReduce, Iteration, and Streaming. We abstract the characteristics of key-value communication patterns into a bipartite communication model, which reveals four distinctions from MPI: Dichotomic, Dynamic, Data-centric, and Diversified features. Utilizing this model, we propose the specification of a minimalistic extension to MPI. An open source communication library, DataMPI, is developed to implement this specification. Performance experiments show that DataMPI has significant advantages in performance and flexibility, while maintaining high productivity, scalability, and fault tolerance of Hadoop.


Citations (39)


... In this paper, the Aadhaar data analysis is carried out on different distributed computing frameworks mainly MapReduce [1], Hive [3] and Apache Spark [4] on top of Hadoop. ...

Reference:

Aadhaar Data Analysis Comparison in MapReduce, Hive and Spark
CirroData: Yet Another SQL-on-Hadoop Data Analytics Engine with High Performance
  • Citing Article
  • January 2020

Journal of Computer Science and Technology

... For example, MLPerf [50] is a comprehensive benchmark for measuring ML inference performance across a spectrum of use cases. Architecture-oriented DNN benchmarks [39], [32], [38] target on analyzing the architectural features of DNNs on computing systems of different sizes. MDLBench [39], Embench [32] and AIoTBench [23] are representative benchmarks that characterize the features of different AI models on edge or mobile devices while NNBench-X [57], GNNMark [46] target on acceleration hardware design for different DNNs. ...

A Survey on Deep Learning Benchmarks: Do We Still Need New Ones?

Lecture Notes in Computer Science

... Many large-scale distributed computing organizations that need to store and maintain continuous amounts of data deploy distributed storage systems, such as HDFS [4,5], GFS [2,3] (which were mentioned above), Ceph [12], Microsoft Azure [13,14], Amazon S3 [15], Alluxio [16] etc., which comprise multiple nodes, often organized into groups called racks. Currently, most of these systems write and store large data as blocks of fixed size, which are distributed almost evenly among the system's nodes using random block placement or load balancing policies. ...

The Performance Analysis of Cache Architecture Based on Alluxio over Virtualized Infrastructure
  • Citing Conference Paper
  • May 2018

... Here, the "Spark+HBase" scheme performs the worst which is mainly because HBase is not a good storage engine for processing complex queries with a large number of data accesses. HBase is mainly designed for point query and some range queries with the help of additional index techniques [17] . Compared with "Spark+Hive", CirroData has a better distributed query execution plan, and the tasks are scheduled in a more load-balanced manner. ...

Characterizing and accelerating indexing techniques on distributed ordered tables

... For example, when executing an UPDATE statement, MariaDB updates a table's secondary indexes in the same transaction as the table rows [91]. However, in systems that implement asynchronous (lazy) derived state maintenance policies [111,130,135] derived state can become stale with respect to corpus. ...

The Consistency Analysis of Secondary Index on Distributed Ordered Tables

... [30] models the information propagation as heat diffusion in a high dimensional latent space through which the node's representation is learned. Following this work, [31,32] make modifications to improve the performance. The introduction of the deep learning method makes the use of network topology, temporal order and other features much more convenient, giving rise to a quick shift in designing the model. ...

Learning Cost-Effective Social Embedding for Cascade Prediction
  • Citing Conference Paper
  • October 2016

Communications in Computer and Information Science

... Sometimes not all mobile have cameras as well. Some devices can play video and some cannot play video [42]. Quality assurance of mobile apps deals with high-quality apps is produced based on standards and deliver highquality products line for the consumers. ...

Modeling and Designing Fault-Tolerance Mechanisms for MPI-Based MapReduce Data Computing Framework

... After acquiring data and understanding the data structure, the process which distinguish the standard analytic project and big data project is data split process which applying the concept of map and reduce. In map and reduce concept, the master distributes the idle task to the slave [25]. A large file which is infeasible for standard processing tools is distribute to several files. ...

An Uncoupled Data Process and Transfer Model for MapReduce

Lecture Notes in Computer Science

... Although Google has captured the search technology market globally, its search experience feels like being lost in maze in contrast to what libraries do – i.e. select, acquire and organise information in a systematic way. As Walker (2009) stated: " although most students today would cite Google as their premier source of research, to date, Google – and other such search engines – is unlikely to fully satisfy the researcher who seeks credible and unbiased scholarly content, which is not always available free of charge " . Open content platforms and systems have gained wide attention as they do not hold data hostage, but stimulate a collaborative, contributive and open content development environment. ...

Research on resource discovery mechanisms in grids
  • Citing Article
  • December 2003

... The research of Grid-Cloud based IT platform mainly includes three sub-fields, system architecture, software mechanism and resource allocation. A great deal of existing research concerning software mechanism that will be necessary to bring Grid and Cloud to fruition is underway9101112131415161718. Thereby, we are able to abstract away implementation details and focus on the scalable service system architecture and the effective Grid-Cloud resource allocation. ...

Service oriented VEGA grid system software design and evaluation
  • Citing Article
  • April 2005