Lixin Zhang

ICT University, Baton Rouge, Louisiana, United States

Are you Lixin Zhang?

Claim your profile

Publications (60)7.02 Total impact

  • [Show abstract] [Hide abstract]
    ABSTRACT: This paper presents PARD, a programmable architecture for resourcing-on-demand that provides a new programming interface to convey an application's high-level information like quality-of-service requirements to the hardware. PARD enables new functionalities like fully hardware-supported virtualization and differentiated services in computers. PARD is inspired by the observation that a computer is inherently a network in which hardware components communicate via packets (e.g., over the NoC or PCIe). We apply principles of software-defined networking to this intra-computer network and address three major challenges. First, to deal with the semantic gap between high-level applications and underlying hardware packets, PARD attaches a high-level semantic tag (e.g., a virtual machine or thread ID) to each memory-access, I/O, or interrupt packet. Second, to make hardware components more manageable, PARD implements programmable control planes that can be integrated into various shared resources (e.g., cache, DRAM, and I/O devices) and can differentially process packets according to tag-based rules. Third, to facilitate programming, PARD abstracts all control planes as a device file tree to provide a uniform programming interface via which users create and apply tag-based rules. Full-system simulation results show that by co-locating latency-critical memcached applications with other workloads PARD can improve a four-core computer's CPU utilization by up to a factor of four without significantly increasing tail latency. FPGA emulation based on a preliminary RTL implementation demonstrates that the cache control plane introduces no extra latency and that the memory control plane can reduce queueing delay for high-priority memory-access requests by up to a factor of 5.6.
    ACM SIGPLAN Notices 05/2015; 50(4):131-143. DOI:10.1145/2775054.2694382 · 0.66 Impact Factor
  • Zhen Jia · Lei Wang · Jianfeng Zhan · Lixin Zhang · Chunjie Luo · Ninghui Sun ·
    [Show abstract] [Hide abstract]
    ABSTRACT: Big data analytics applications play a significant role in data centers, and hence it has become increasingly important to understand their behaviors in order to further improve the performance of data center computer systems, in which characterizing representative workloads is a key practical problem. In this paper, after investigating three most impor- tant application domains in terms of page views and daily visitors, we chose 11 repre- sentative data analytics workloads and characterized their micro-architectural behaviors by using hardware performance counters, so as to understand the impacts and implications of data analytics workloads on the systems equipped with modern superscalar out-of-order processors. Our study reveals that big data analytics applications themselves share many inherent characteristics, which place them in a different class from traditional workloads and scale-out services. To further understand the characteristics of big data analytics work- loads we performed a correlation analysis of CPI (cycles per instruction) with other micro- architecture level characteristics and an investigation of the big data software stack impacts on application behaviors. Our correlation analysis showed that even though big data ana- lytics workloads own notable pipeline front end stalls, the main factors affecting the CPI performance are long latency data accesses rather than the front end stalls. Our software stack investigation found that the typical big data software stack significantly contributes to the front end stalls and incurs bigger working set. Finally we gave several recommen- dations for architects, programmers and big data system designers with the knowledge acquired from this paper.

  • Fei Xia · Dejun Jiang · Jin Xiong · Mingyu Chen · Lixin Zhang · Ninghui Sun ·
    [Show abstract] [Hide abstract]
    ABSTRACT: Phase change memory (PCM) is promising to become an alternative main memory thanks to its better scalability and lower leakage than DRAM. However, the long write latency of PCM puts it at a severe disadvantage against DRAM. In this paper, we propose a Dynamic Write Consolidation (DWC) scheme to improve PCM memory system performance while reducing energy consumption. This paper is motivated by the observation that a large fraction of a cache line being written back to memory is not actually modified. DWC exploits the unnecessary burst writes of unmodified data to consolidate multiple writes targeting the same row into one write. By doing so, DWC enables multiple writes to be send within one. DWC incurs low implementation overhead and shows significant efficiency. The evaluation results show that DWC achieves up to 35.7% performance improvement, and 17.9% on average. The effective write latency are reduced by up to 27.7%, and 16.0% on average. Moreover, DWC reduces the energy consumption by up to 35.3%, and 13.9% on average.
    ACM 28th International Conference on Supercomputing (ICS 2014), Munich, Germany; 06/2014
  • [Show abstract] [Hide abstract]
    ABSTRACT: In 2005, as chip multiprocessors started to appear widely, it became possible for the on-chip cores to share the last-level cache. At the time, architects either considered the last-level cache to be divided into per-core private segments, or wholly shared. The shared cache utilized the capacity more efficiency but suffered from high, uniform latencies. This paper proposed a new direction: allowing the caches to be non-uniform, with a varying number of processors sharing each section of the cache. Sharing degree, the number of cores sharing a last-level cache, determines the level of replication in on-chip caches and also affects the capacity and latency for each shared cache. Building on our previous work that introduced non-uniform cache architectures (NUCA), this study explored the design space for shared multi-core caches, focusing on the effect of sharing degree. Our observation of a per-application optimal sharing degree led to a static NUCA design with a reconfigurable sharing degree. This work in multicore NUCA cache architectures has been influential in contemporary systems, including the level-3 cache in the IBM Power 7 and Power 8 processors.
  • [Show abstract] [Hide abstract]
    ABSTRACT: Write-optimized data structures like Log-Structured Merge-tree (LSM-tree) and its variants are widely used in key-value storage systems like Big Table and Cassandra. Due to deferral and batching, the LSM-tree based storage systems need background compactions to merge key-value entries and keep them sorted for future queries and scans. Background compactions play a key role on the performance of the LSM-tree based storage systems. Existing studies about the background compaction focus on decreasing the compaction frequency, reducing I/Os or confining compactions on hot data key-ranges. They do not pay much attention to the computation time in background compactions. However, the computation time is no longer negligible, and even the computation takes more than 60% of the total compaction time in storage systems using flash based SSDs. Therefore, an alternative method to speedup the compaction is to make good use of the parallelism of underlying hardware including CPUs and I/O devices. In this paper, we analyze the compaction procedure, recognize the performance bottleneck, and propose the Pipelined Compaction Procedure (PCP) to better utilize the parallelism of CPUs and I/O devices. Theoretical analysis proves that PCP can improve the compaction bandwidth. Furthermore, we implement PCP in real system and conduct extensive experiments. The experimental results show that the pipelined compaction procedure can increase the compaction bandwidth and storage system throughput by 77% and 62% respectively.
    2014 IEEE International Parallel & Distributed Processing Symposium (IPDPS); 05/2014
  • Yongbing Huang · Zhongbin Zha · Mingyu Chen · Lixin Zhang ·
    [Show abstract] [Hide abstract]
    ABSTRACT: Mobile devices such as smartphones and tablets have become the primary consumer computing devices, and their rate of adoption continues to grow. The applications that run on these mobile platforms vary in how they use hardware resources, and their diversity is increasing. Performance and power limitations also vary widely across mobile platforms. Thus there is a growing need for tools to help computer architects design systems to meet the needs of mobile workloads. Full-system simulators are invaluable tools for designing new architectures, but we still need appropriate benchmark suites that capture the behaviors of emerging mobile applications. Current benchmark suites cover only a small range of mobile applications, and many cannot run directly in simulators due to their user interaction requirements. In this paper, we introduce and characterize Moby, a benchmark suite designed to make it easier to use full-system architectural simulators to evaluate microarchitectures for mobile processors. Moby contains popular Android applications, including a web browser, a social networking application, an email client, a music player, a video player, a document processing application, and a map program. To facilitate microarchitectural exploration, we port the Moby benchmark suite to the popular gem5 simulator. We characterize the architecture-independent features of Moby applications on the simulator and analyze the architecture-dependent features on a current-generation mobile platform. Our results show that mobile applications exhibit complex instruction execution behaviors and poor code locality, but current mobile platforms especially instruction-related components cannot meet their requirements.
    2014 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS); 03/2014
  • Source

  • Source

  • Wentao Bao · Binzhang Fu · Mingyu Chen · Lixin Zhang ·
    [Show abstract] [Hide abstract]
    ABSTRACT: The flourishing large-scale and high-throughput web applications have emphasized the importance of high-density servers for their distinct advantages, such as high computing density, low power and low space requirements. To achieve above advantages, an efficient intra-server interconnection network is necessary. Most state-of-the-art high-density servers adopt the fully-connected intra-server network to achieve high network performance. Unfortunately, this solution is very expensive due to the high degree of nodes. To address this problem, we exploit the theory optimized moore graph to interconnect the chips within a server. Considering the size of applications, the 50-size moore graph, namely the Hoffman-Singleton graph, is extensively discussed in this paper. The simulation results show that it could attain comparative performance as the fully-connected network with much lower cost. In practice, however, chips could be integrated onto multiple boards. Thus, the graph should be divided into self-connected sub-graphs with the same size. Unfortunately, state-of-the-art solutions do not consider the production problem and generate heterogeneous sub graphs. To address this problem, we propose two equivalent-partition solutions for Hoffman-Singleton graph depending on the density of boards. Finally, we propose and evaluate a deadlock-free routing algorithm for each partition scheme.
    2013 IEEE International Conference on High Performance Computing and Communications (HPCC) & 2013 IEEE International Conference on Embedded and Ubiquitous Computing (EUC); 11/2013
  • Tianni Xu · Xiufeng Sui · Zhicheng Yao · Jiuyue Ma · Yungang Bao · Lixin Zhang ·
    [Show abstract] [Hide abstract]
    ABSTRACT: Data centers are increasingly employing virtualization as a means to ensure the performance isolation for latency-sensitive applications while allowing co-locations of multiple applications. Previous research has shown that virtualization could offer excellent resource isolation. However, whether virtualization can mitigate the interference among micro-architectural resources has not been well studied. This paper presents an in-depth analysis of the performance isolation effect of virtualization technology on various micro-architectural resources (i.e., L1 D-Cache, L2 Cache, last level cache (LLC), hardware prefetchers and Non-Uniform Memory Access) by mapping the CloudSuite benchmarks to different sockets, different cores of one chip, and different threads of one core. For each resource, we investigate the correlation between performance variations and contention by changing VM mapping policies according to different application characteristics. Our experiments show that virtualization has rather limited micro-architectural isolation effects. Specifically, LLC interference can degrade applications performance by as much as 28%. When it comes to intra-core resources, the applications performance degradation can be as much as 27%. Additionally, we outline several opportunities to improve performance by reducing misbehavior VM interference.
    2013 IEEE International Conference on High Performance Computing and Communications (HPCC) & 2013 IEEE International Conference on Embedded and Ubiquitous Computing (EUC); 11/2013
  • Source
    Zhen Jia · Lei Wang · Jianfeng Zhan · Lixin Zhang · Chunjie Luo ·
    [Show abstract] [Hide abstract]
    ABSTRACT: As the amount of data explodes rapidly, more and more corporations are using data centers to make effective decisions and gain a competitive edge. Data analysis applications play a significant role in data centers, and hence it has became increasingly important to understand their behaviors in order to further improve the performance of data center computer systems. In this paper, after investigating three most important application domains in terms of page views and daily visitors, we choose eleven representative data analysis workloads and characterize their micro-architectural characteristics by using hardware performance counters, in order to understand the impacts and implications of data analysis workloads on the systems equipped with modern superscalar out-of-order processors. Our study on the workloads reveals that data analysis applications share many inherent characteristics, which place them in a different class from desktop (SPEC CPU2006), HPC (HPCC), and service workloads, including traditional server workloads (SPECweb2005) and scale-out service workloads (four among six benchmarks in CloudSuite), and accordingly we give several recommendations for architecture and system optimizations. On the basis of our workload characterization work, we released a benchmark suite named DCBench for typical datacenter workloads, including data analysis and service workloads, with an open-source license on our project home page on We hope that DCBench is helpful for performing architecture and small-to-medium scale system researches for datacenter computing.
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Now we live in an era of big data, and big data applications are becoming more and more pervasive. How to benchmark data center computer systems running big data applications (in short big data systems) is a hot topic. In this paper, we focus on measuring the performance impacts of diverse applications and scalable volumes of data sets on big data systems. For four typical data analysis applications---an important class of big data applications, we find two major results through experiments: first, the data scale has a significant impact on the performance of big data systems, so we must provide scalable volumes of data sets in big data benchmarks. Second, for the four applications, even all of them use the simple algorithms, the performance trends are different with increasing data scales, and hence we must consider not only variety of data sets but also variety of applications in benchmarking big data systems.
  • Conference Paper: The ARMv8 simulator
    [Show abstract] [Hide abstract]
    ABSTRACT: In this work, we implement an ARMv8 function and performance simulator based on gem5 infrastructure, which is the first open source ARMv8 simulator. All the ARMv8 A64 instructions other than SIMD are implemented using gem5 ISA description language. The ARMv8 simulator supports multiple CPU models, multiple memory systems, and McPAT power model.
    Proceedings of the 27th international ACM conference on International conference on supercomputing; 06/2013
  • Nongda Hu · Binzhang Fu · Xiufeng Sui · Long Li · Tao Li · Lixin Zhang ·
    [Show abstract] [Hide abstract]
    ABSTRACT: Within today's large-scale data centers, the inter-node communication is often the major bottleneck. This fact recently blooms the data center network (DCN) research. Since building a real data center is cost prohibitive, most of DCN studies rely on simulations. Unfortunately, state-of-the-art network simulators have limited support for real world applications, which prevents researchers from first-hand investigation. To address this issue, we developed a unified and cross-layer simulation framework, namely the DCNSim. By leveraging the two widely deployed simulators, DCNSim introduces computer architecture solutions into DCN research. With DCNSim, one could run packet-level network simulation driven by commercial applications while varying computer and network parameters, such as CPU frequency, memory access latency, network topology and protocols. With extensive validations, we show that DCNSim could accurately capture performance trends caused by changing computer and network parameters. Finally, we argue that future DCN researches should consider computer architecture factors via several case studies.
    Proceedings of the ACM International Conference on Computing Frontiers; 05/2013
  • Xiufeng Sui · Tao Sun · Tao Li · Lixin Zhang ·
    [Show abstract] [Hide abstract]
    ABSTRACT: Cloud computing has demonstrated tremendous capability in a wide spectrum of online services. Virtualization provides an efficient solution to the utilization of modern multicore processor systems while affording significant flexibility. The growing popularity of virtualized datacenters motivates deeper understanding of the interactions between virtual machine management and the micro-architecture behaviors of the privileged domain. We argue that these behaviors must be factored into the design of processor microarchitecture in virtualized datacenters. In this work, we use performance counters on modern servers to study the micro-architectural execution characteristics of the privileged domain while performing various VM management operations. Our study shows that today's state-of-the-art processor still has room for further optimizations when executing virtualized cloud workloads, particularly in the organization of last level caches and on-chip cache coherence protocol. Specifically, our analysis shows that: shared caches could be partitioned to eliminate interference between the privileged domain and guest domains; the cache coherence protocol could support a high degree of data sharing of the privileged domain; and cache capacity or CPU utilization occupied by the privileged domain could be effectively managed when performing management workflows to achieve high system throughput.
    Performance Analysis of Systems and Software (ISPASS), 2013 IEEE International Symposium on; 01/2013
  • [Show abstract] [Hide abstract]
    ABSTRACT: Inability to hide main memory latency has been increasingly limiting the performance of modern processors. The problem is worse in large-scale shared memory systems, where remote memory latencies are hundreds, and soon thousands, of processor cycles. To mitigate this problem, we propose an intelligent memory and cache coherence controller (AMC) that can execute Active Memory Operations (AMOs). AMOs are select operations sent to and executed on the home memory controller of data. AMOs can eliminate a significant number of coherence messages, minimize intranode and internode memory traffic, and create opportunities for parallelism. Our implementation of AMOs is cache-coherent and requires no changes to the processor core or DRAM chips. In this paper, we present the microarchitecture design of AMC, and the programming model of AMOs. We compare AMOs’ performance to that of several other memory architectures on a variety of scientific and commercial benchmarks. Through simulation, we show that AMOs offer dramatic performance improvements for an important set of data-intensive operations, e.g., up to 50× faster barriers, 12× faster spinlocks, 8.5×–15× faster stream/array operations, and 3× faster database queries. We also present an analytical model that can predict the performance benefits of using AMOs with decent accuracy. The silicon cost required to support AMOs is less than 1% of the die area of a typical high performance processor, based on a standard cell implementation.
    The Journal of Supercomputing 10/2012; 62(1). DOI:10.1007/s11227-011-0735-9 · 0.86 Impact Factor
  • [Show abstract] [Hide abstract]
    ABSTRACT: With the explosive growth of information, more and more organizations are deploying private cloud systems or renting public cloud systems to process big data. However, there is no existing benchmark suite for evaluating cloud performance on the whole system level. To the best of our knowledge, this paper proposes the first benchmark suite CloudRank-D to benchmark and rank cloud computing systems that are shared for running big data applications. We analyze the limitations of previous metrics, e.g., floating point operations, for evaluating a cloud computing system, and propose two simple metrics: data processed per second and data processed per Joule as two complementary metrics for evaluating cloud computing systems. We detail the design of CloudRank-D that considers representative applications, diversity of data characteristics, and dynamic behaviors of both applications and system software platforms. Through experiments, we demonstrate the advantages of our proposed metrics. In several case studies, we evaluate two small-scale deployments of cloud computing systems using CloudRank-D.
    Frontiers of Computer Science (print) 08/2012; 6(4). DOI:10.1007/s11704-012-2118-7 · 0.43 Impact Factor
  • Source

  • Source
    Jianfeng Zhan · Lixin Zhang · Ninghui Sun · Lei Wang · Zhen Jia · Chunjie Luo ·
    [Show abstract] [Hide abstract]
    ABSTRACT: For the first time, this paper systematically identifies three categories of throughput oriented workloads in data centers: services, data processing applications, and interactive real-time applications, whose targets are to increase the volume of throughput in terms of processed requests or data, or supported maximum number of simultaneous subscribers, respectively, and we coin a new term high volume computing (in short HVC) to describe those workloads and data center computer systems designed for them. We characterize and compare HVC with other computing paradigms, e.g., high throughput computing, warehouse-scale computing, and cloud computing, in terms of levels, workloads, metrics, coupling degree, data scales, and number of jobs or service instances. We also preliminarily report our ongoing work on the metrics and benchmarks for HVC systems, which is the foundation of designing innovative data center computer systems for HVC workloads.
    02/2012; DOI:10.1109/IPDPSW.2012.213

Publication Stats

885 Citations
7.02 Total Impact Points


  • 2014
    • ICT University
      Baton Rouge, Louisiana, United States
  • 2012-2014
    • Chinese Academy of Sciences
      • Key Laboratory of Computer System and Architecture
      Peping, Beijing, China
  • 2011-2013
    • Technical Institute of Physics and Chemistry
      Peping, Beijing, China
  • 2004-2010
    • IBM
      Armonk, New York, United States
  • 2009
    • Pennsylvania State University
      • Department of Computer Science and Engineering
      University Park, Maryland, United States
  • 1998-2001
    • University of Utah
      • School of Computing
      Salt Lake City, UT, United States