Lixin Zhang

Technical Institute of Physics and Chemistry, Peping, Beijing, China

Are you Lixin Zhang?

Claim your profile

Publications (54)2.7 Total impact

  • 08/2014;
  • [Show abstract] [Hide abstract]
    ABSTRACT: Phase change memory (PCM) is promising to become an alternative main memory thanks to its better scalability and lower leakage than DRAM. However, the long write latency of PCM puts it at a severe disadvantage against DRAM. In this paper, we propose a Dynamic Write Consolidation (DWC) scheme to improve PCM memory system performance while reducing energy consumption. This paper is motivated by the observation that a large fraction of a cache line being written back to memory is not actually modified. DWC exploits the unnecessary burst writes of unmodified data to consolidate multiple writes targeting the same row into one write. By doing so, DWC enables multiple writes to be send within one. DWC incurs low implementation overhead and shows significant efficiency. The evaluation results show that DWC achieves up to 35.7% performance improvement, and 17.9% on average. The effective write latency are reduced by up to 27.7%, and 16.0% on average. Moreover, DWC reduces the energy consumption by up to 35.3%, and 13.9% on average.
    ACM 28th International Conference on Supercomputing (ICS 2014), Munich, Germany; 06/2014
  • [Show abstract] [Hide abstract]
    ABSTRACT: In 2005, as chip multiprocessors started to appear widely, it became possible for the on-chip cores to share the last-level cache. At the time, architects either considered the last-level cache to be divided into per-core private segments, or wholly shared. The shared cache utilized the capacity more efficiency but suffered from high, uniform latencies. This paper proposed a new direction: allowing the caches to be non-uniform, with a varying number of processors sharing each section of the cache. Sharing degree, the number of cores sharing a last-level cache, determines the level of replication in on-chip caches and also affects the capacity and latency for each shared cache. Building on our previous work that introduced non-uniform cache architectures (NUCA), this study explored the design space for shared multi-core caches, focusing on the effect of sharing degree. Our observation of a per-application optimal sharing degree led to a static NUCA design with a reconfigurable sharing degree. This work in multicore NUCA cache architectures has been influential in contemporary systems, including the level-3 cache in the IBM Power 7 and Power 8 processors.
    06/2014;
  • [Show abstract] [Hide abstract]
    ABSTRACT: Write-optimized data structures like Log-Structured Merge-tree (LSM-tree) and its variants are widely used in key-value storage systems like Big Table and Cassandra. Due to deferral and batching, the LSM-tree based storage systems need background compactions to merge key-value entries and keep them sorted for future queries and scans. Background compactions play a key role on the performance of the LSM-tree based storage systems. Existing studies about the background compaction focus on decreasing the compaction frequency, reducing I/Os or confining compactions on hot data key-ranges. They do not pay much attention to the computation time in background compactions. However, the computation time is no longer negligible, and even the computation takes more than 60% of the total compaction time in storage systems using flash based SSDs. Therefore, an alternative method to speedup the compaction is to make good use of the parallelism of underlying hardware including CPUs and I/O devices. In this paper, we analyze the compaction procedure, recognize the performance bottleneck, and propose the Pipelined Compaction Procedure (PCP) to better utilize the parallelism of CPUs and I/O devices. Theoretical analysis proves that PCP can improve the compaction bandwidth. Furthermore, we implement PCP in real system and conduct extensive experiments. The experimental results show that the pipelined compaction procedure can increase the compaction bandwidth and storage system throughput by 77% and 62% respectively.
    2014 IEEE International Parallel & Distributed Processing Symposium (IPDPS); 05/2014
  • [Show abstract] [Hide abstract]
    ABSTRACT: Mobile devices such as smartphones and tablets have become the primary consumer computing devices, and their rate of adoption continues to grow. The applications that run on these mobile platforms vary in how they use hardware resources, and their diversity is increasing. Performance and power limitations also vary widely across mobile platforms. Thus there is a growing need for tools to help computer architects design systems to meet the needs of mobile workloads. Full-system simulators are invaluable tools for designing new architectures, but we still need appropriate benchmark suites that capture the behaviors of emerging mobile applications. Current benchmark suites cover only a small range of mobile applications, and many cannot run directly in simulators due to their user interaction requirements. In this paper, we introduce and characterize Moby, a benchmark suite designed to make it easier to use full-system architectural simulators to evaluate microarchitectures for mobile processors. Moby contains popular Android applications, including a web browser, a social networking application, an email client, a music player, a video player, a document processing application, and a map program. To facilitate microarchitectural exploration, we port the Moby benchmark suite to the popular gem5 simulator. We characterize the architecture-independent features of Moby applications on the simulator and analyze the architecture-dependent features on a current-generation mobile platform. Our results show that mobile applications exhibit complex instruction execution behaviors and poor code locality, but current mobile platforms especially instruction-related components cannot meet their requirements.
    2014 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS); 03/2014
  • Source
  • Source
  • [Show abstract] [Hide abstract]
    ABSTRACT: The flourishing large-scale and high-throughput web applications have emphasized the importance of high-density servers for their distinct advantages, such as high computing density, low power and low space requirements. To achieve above advantages, an efficient intra-server interconnection network is necessary. Most state-of-the-art high-density servers adopt the fully-connected intra-server network to achieve high network performance. Unfortunately, this solution is very expensive due to the high degree of nodes. To address this problem, we exploit the theory optimized moore graph to interconnect the chips within a server. Considering the size of applications, the 50-size moore graph, namely the Hoffman-Singleton graph, is extensively discussed in this paper. The simulation results show that it could attain comparative performance as the fully-connected network with much lower cost. In practice, however, chips could be integrated onto multiple boards. Thus, the graph should be divided into self-connected sub-graphs with the same size. Unfortunately, state-of-the-art solutions do not consider the production problem and generate heterogeneous sub graphs. To address this problem, we propose two equivalent-partition solutions for Hoffman-Singleton graph depending on the density of boards. Finally, we propose and evaluate a deadlock-free routing algorithm for each partition scheme.
    2013 IEEE International Conference on High Performance Computing and Communications (HPCC) & 2013 IEEE International Conference on Embedded and Ubiquitous Computing (EUC); 11/2013
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: As the amount of data explodes rapidly, more and more corporations are using data centers to make effective decisions and gain a competitive edge. Data analysis applications play a significant role in data centers, and hence it has became increasingly important to understand their behaviors in order to further improve the performance of data center computer systems. In this paper, after investigating three most important application domains in terms of page views and daily visitors, we choose eleven representative data analysis workloads and characterize their micro-architectural characteristics by using hardware performance counters, in order to understand the impacts and implications of data analysis workloads on the systems equipped with modern superscalar out-of-order processors. Our study on the workloads reveals that data analysis applications share many inherent characteristics, which place them in a different class from desktop (SPEC CPU2006), HPC (HPCC), and service workloads, including traditional server workloads (SPECweb2005) and scale-out service workloads (four among six benchmarks in CloudSuite), and accordingly we give several recommendations for architecture and system optimizations. On the basis of our workload characterization work, we released a benchmark suite named DCBench for typical datacenter workloads, including data analysis and service workloads, with an open-source license on our project home page on http://prof.ict.ac.cn/DCBench. We hope that DCBench is helpful for performing architecture and small-to-medium scale system researches for datacenter computing.
    07/2013;
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Now we live in an era of big data, and big data applications are becoming more and more pervasive. How to benchmark data center computer systems running big data applications (in short big data systems) is a hot topic. In this paper, we focus on measuring the performance impacts of diverse applications and scalable volumes of data sets on big data systems. For four typical data analysis applications---an important class of big data applications, we find two major results through experiments: first, the data scale has a significant impact on the performance of big data systems, so we must provide scalable volumes of data sets in big data benchmarks. Second, for the four applications, even all of them use the simple algorithms, the performance trends are different with increasing data scales, and hence we must consider not only variety of data sets but also variety of applications in benchmarking big data systems.
    07/2013;
  • Conference Paper: The ARMv8 simulator
    [Show abstract] [Hide abstract]
    ABSTRACT: In this work, we implement an ARMv8 function and performance simulator based on gem5 infrastructure, which is the first open source ARMv8 simulator. All the ARMv8 A64 instructions other than SIMD are implemented using gem5 ISA description language. The ARMv8 simulator supports multiple CPU models, multiple memory systems, and McPAT power model.
    Proceedings of the 27th international ACM conference on International conference on supercomputing; 06/2013
  • [Show abstract] [Hide abstract]
    ABSTRACT: Inability to hide main memory latency has been increasingly limiting the performance of modern processors. The problem is worse in large-scale shared memory systems, where remote memory latencies are hundreds, and soon thousands, of processor cycles. To mitigate this problem, we propose an intelligent memory and cache coherence controller (AMC) that can execute Active Memory Operations (AMOs). AMOs are select operations sent to and executed on the home memory controller of data. AMOs can eliminate a significant number of coherence messages, minimize intranode and internode memory traffic, and create opportunities for parallelism. Our implementation of AMOs is cache-coherent and requires no changes to the processor core or DRAM chips. In this paper, we present the microarchitecture design of AMC, and the programming model of AMOs. We compare AMOs’ performance to that of several other memory architectures on a variety of scientific and commercial benchmarks. Through simulation, we show that AMOs offer dramatic performance improvements for an important set of data-intensive operations, e.g., up to 50× faster barriers, 12× faster spinlocks, 8.5×–15× faster stream/array operations, and 3× faster database queries. We also present an analytical model that can predict the performance benefits of using AMOs with decent accuracy. The silicon cost required to support AMOs is less than 1% of the die area of a typical high performance processor, based on a standard cell implementation.
    The Journal of Supercomputing 10/2012; 62(1). · 0.92 Impact Factor
  • Source
    03/2012;
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: For the first time, this paper systematically identifies three categories of throughput oriented workloads in data centers: services, data processing applications, and interactive real-time applications, whose targets are to increase the volume of throughput in terms of processed requests or data, or supported maximum number of simultaneous subscribers, respectively, and we coin a new term high volume computing (in short HVC) to describe those workloads and data center computer systems designed for them. We characterize and compare HVC with other computing paradigms, e.g., high throughput computing, warehouse-scale computing, and cloud computing, in terms of levels, workloads, metrics, coupling degree, data scales, and number of jobs or service instances. We also preliminarily report our ongoing work on the metrics and benchmarks for HVC systems, which is the foundation of designing innovative data center computer systems for HVC workloads.
    02/2012;
  • [Show abstract] [Hide abstract]
    ABSTRACT: Desktop cloud replaces traditional desktop computers with completely virtualized systems from the cloud. It is becoming one of the fastest growing segments in the cloud computing market. However, as far as we know, there is little work done to understand the behavior of desktop cloud. On one hand, desktop cloud workloads are different from conventional data center workloads in that they are rich with interactive operations. Desktop cloud workloads are different from traditional non-virtualized desktop workloads in that they have an extra layer of software stack - hypervisor. On the other hand, desktop cloud servers are mostly built with conventional commodity processors. While such processors are well optimized for traditional desktops and high performance computing workloads, their effectiveness for desktop cloud workloads remains to be studied. As an attempt to shed some lights on the effectiveness of conventional general-purpose processors on desktop cloud workloads, we have studied the behavior of desktop cloud workloads and compared it with that of SPEC CPU2006, TPC-C, PARSEC, and CloudSuite. We evaluate a Xen-based virtualization platform. The performance results reveal that desktop cloud workloads have significantly different characteristics with SPEC CPU2006, TPC-C and PARSEC, but they perform similarly with data center scale-out benchmarks from CloudSuite. In particular, desktop cloud workloads have high instruction cache miss rate (12.7% on average), high percentage of kernel instructions (23%, on average), and low IPC (0.36 on average). And they have much higher TLB miss rates and lower utilization of off-chip memory bandwidth than traditional benchmarks. Our experimental numbers indicate that the effectiveness of existing commodity processors is quite low for desktop cloud workloads. In this paper, we provide some preliminary discussions on some potential architectural and micro-architectural enhancements. We hope that the performance numbers presented - n this paper will give some insights to the designers of desktop cloud systems.
    Workload Characterization (IISWC), 2012 IEEE International Symposium on; 01/2012
  • [Show abstract] [Hide abstract]
    ABSTRACT: Building energy-efficient systems is critical for big data applications. This paper investigates and compares the energy consumption and the execution time of a typical Hadoop-based big data application running on a traditional Xeon-based cluster and an Atom-based (Micro-server) cluster. Our experimental results show that the micro-server platform is more energy-efficient than the Xeon-based platform. Our experimental results also reveal that data compression and decompression accounts for a considerable percentage of the total execution time. More precisely, data compression/decompression occupies 7-11% of the execution time of the map tasks and 37.9-41.2% of the execution time of the reduce tasks. Based on our findings, we demonstrate the necessity of using a heterogeneous architecture for energy-efficient big data processing. The desired architecture takes the advantages of both micro-server processors and hardware compression/decompression accelerators. In addition, we propose a mechanism that enables the accelerators to perform more efficient data compression/decompression.
    10/2011;
  • [Show abstract] [Hide abstract]
    ABSTRACT: The High-Performance Computing ecosystem consists of a large variety of execution platforms that demonstrate a wide diversity in hardware characteristics such as CPU architecture, memory organization, interconnection network, accelerators, etc. This environment also presents a number of hard boundaries (walls) for applications which limit software development (parallel programming wall), performance (memory wall, communication wall) and viability (power wall). The only way to survive in such a demanding environment is by adaptation. In this paper we discuss how dynamic information collected during the execution of an application can be utilized to adapt the execution context and may lead to performance gains beyond those provided by static information and compile-time adaptation. We consider specialization based on dynamic information like user input, architectural characteristics such as the memory hierarchy organization, and the execution profile of the application as obtained from the execution platform's performance monitoring units. One of the challenges of future execution platforms is to allow the seamless integration of these various kinds of information with information obtained from static analysis (either during ahead-of-time or just-in-time) compilation. We extend the notion of information-driven adaptation and outline the architecture of an infrastructure designed to enable information flow and adaptation through-out the life-cycle of an application.
    06/2011;
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: The transistor density of microprocessors continues to increase as technology scales. Microprocessors designers have taken advantage of the increased transistors by integrating a significant number of cores onto a single die. However, a large number of cores are met with diminishing returns due to software and hardware scalability issues and hence designers have started integrating on-chip special-purpose logic units (i.e., accelerators) that were previously available as PCI-attached units. It is anticipated that more accelerators will be integrated on-chip due to the increasing abundance of transistors and the fact that not all logic can be powered at all times due to power budget limits. Thus, on-chip accelerator architectures deserve more attention from the research community. There is a wide spectrum of research opportunities for design and optimization of accelerators. This paper attempts to bring out some insights by studying the data access streams of on-chip accelerators that hopefully foster some future research in this area. Specifically, this paper uses a few simple case studies to show some of the common characteristics of the data streams introduced by on-chip accelerators, discusses challenges and opportunities in exploiting these characteristics to optimize the power and performance of accelerators, and then analyzes the effectiveness of some simple optimizing extensions proposed.
    17th International Conference on High-Performance Computer Architecture (HPCA-17 2011), February 12-16 2011, San Antonio, Texas, USA; 01/2011
  • [Show abstract] [Hide abstract]
    ABSTRACT: Search is the most heavily used web application in the world and is still growing at an extraordinary rate. Understanding the behaviors of web search engines, therefore, is becoming increasingly important to the design and deployment of data center systems hosting search engines. In this paper, we study three search query traces collected from real world web search engines in three different search service providers. The first part of our study is to uncover the patterns hidden in the query traces by analyzing the variations, frequencies, and locality of query requests. Our analysis reveals that, contradicted to some previous studies, real-world query traces do not follow well-defined probability models, such as Poisson distribution and log-normal distribution. The second part of our study is to deploy the real query traces and three synthetic traces generated using probability models proposed by other researchers on a Nutch based search engine. The measured performance data from the deployments further confirm that synthetic traces do not accurately reflect the real traces. We develop an evaluation tool that can collect performance metrics on-line with negligible overhead. The performance metrics include average response time, CPU utilization, Disk accesses, and cycles-per-instructions, etc. The third of our study is to compare the search engine with representative benchmarks, namely Gridmix, SPECweb2005, TPC-C, SPECCPU2006, and HPCC, with respect to basic architecture-level characteristics and performance metrics, such as instruction mix, processor pipeline stall breakdown, memory access latency, and disk accesses. The experimental results show that web search engines have a high percentage of load/store instructions, but have good cache/memory performance. We hope those results presented in this paper will enable system designers to gain insights on optimizing systems hosting search engines.
    01/2011;
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: This paper presents two complementary techniques to manage the power consumption of large-scale systems with a packet-switched interconnection network. First, we propose Thrifty Interconnection Network (TIN), where the network links are activated and de-activated dynamically with little or no overhead by using inherent system events to timely trigger link activation or de-activation. Second, we propose Network Power Shifting (NPS) that dynamically shifts the power budget between the compute nodes and their corresponding network components. TIN activates and trains the links in the interconnection network, just-in-time before the network communication is about to happen, and thriftily puts them into a low-power mode when communication is finished, hence reducing unnecessary network power consumption. Furthermore, the compute nodes can absorb the extra power budget shifted from its attached network components and increase their processor frequency for higher performance with NPS. Our simulation results on a set of real-world workload traces show that TIN can achieve on average 60% network power reduction, with the support of only one low-power mode. When NPS is enabled, the two together can achieve 12% application performance improvement and 13% overall system energy reduction. Further performance improvement is possible if the compute nodes can speed up more and fully utilize the extra power budget reinvested from the thrifty network with more aggressive cooling support.
    17th International Conference on High-Performance Computer Architecture (HPCA-17 2011), February 12-16 2011, San Antonio, Texas, USA; 01/2011