Lixin Zhang

Technical Institute of Physics and Chemistry, Peping, Beijing, China

Are you Lixin Zhang?

Claim your profile

Publications (47)2.3 Total impact

  • [Show abstract] [Hide abstract]
    ABSTRACT: Phase change memory (PCM) is promising to become an alternative main memory thanks to its better scalability and lower leakage than DRAM. However, the long write latency of PCM puts it at a severe disadvantage against DRAM. In this paper, we propose a Dynamic Write Consolidation (DWC) scheme to improve PCM memory system performance while reducing energy consumption. This paper is motivated by the observation that a large fraction of a cache line being written back to memory is not actually modified. DWC exploits the unnecessary burst writes of unmodified data to consolidate multiple writes targeting the same row into one write. By doing so, DWC enables multiple writes to be send within one. DWC incurs low implementation overhead and shows significant efficiency. The evaluation results show that DWC achieves up to 35.7% performance improvement, and 17.9% on average. The effective write latency are reduced by up to 27.7%, and 16.0% on average. Moreover, DWC reduces the energy consumption by up to 35.3%, and 13.9% on average.
    ACM 28th International Conference on Supercomputing (ICS 2014), Munich, Germany; 06/2014
  • Source
  • Source
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: As the amount of data explodes rapidly, more and more corporations are using data centers to make effective decisions and gain a competitive edge. Data analysis applications play a significant role in data centers, and hence it has became increasingly important to understand their behaviors in order to further improve the performance of data center computer systems. In this paper, after investigating three most important application domains in terms of page views and daily visitors, we choose eleven representative data analysis workloads and characterize their micro-architectural characteristics by using hardware performance counters, in order to understand the impacts and implications of data analysis workloads on the systems equipped with modern superscalar out-of-order processors. Our study on the workloads reveals that data analysis applications share many inherent characteristics, which place them in a different class from desktop (SPEC CPU2006), HPC (HPCC), and service workloads, including traditional server workloads (SPECweb2005) and scale-out service workloads (four among six benchmarks in CloudSuite), and accordingly we give several recommendations for architecture and system optimizations. On the basis of our workload characterization work, we released a benchmark suite named DCBench for typical datacenter workloads, including data analysis and service workloads, with an open-source license on our project home page on http://prof.ict.ac.cn/DCBench. We hope that DCBench is helpful for performing architecture and small-to-medium scale system researches for datacenter computing.
    07/2013;
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Now we live in an era of big data, and big data applications are becoming more and more pervasive. How to benchmark data center computer systems running big data applications (in short big data systems) is a hot topic. In this paper, we focus on measuring the performance impacts of diverse applications and scalable volumes of data sets on big data systems. For four typical data analysis applications---an important class of big data applications, we find two major results through experiments: first, the data scale has a significant impact on the performance of big data systems, so we must provide scalable volumes of data sets in big data benchmarks. Second, for the four applications, even all of them use the simple algorithms, the performance trends are different with increasing data scales, and hence we must consider not only variety of data sets but also variety of applications in benchmarking big data systems.
    07/2013;
  • Conference Paper: The ARMv8 simulator
    [Show abstract] [Hide abstract]
    ABSTRACT: In this work, we implement an ARMv8 function and performance simulator based on gem5 infrastructure, which is the first open source ARMv8 simulator. All the ARMv8 A64 instructions other than SIMD are implemented using gem5 ISA description language. The ARMv8 simulator supports multiple CPU models, multiple memory systems, and McPAT power model.
    Proceedings of the 27th international ACM conference on International conference on supercomputing; 06/2013
  • [Show abstract] [Hide abstract]
    ABSTRACT: Inability to hide main memory latency has been increasingly limiting the performance of modern processors. The problem is worse in large-scale shared memory systems, where remote memory latencies are hundreds, and soon thousands, of processor cycles. To mitigate this problem, we propose an intelligent memory and cache coherence controller (AMC) that can execute Active Memory Operations (AMOs). AMOs are select operations sent to and executed on the home memory controller of data. AMOs can eliminate a significant number of coherence messages, minimize intranode and internode memory traffic, and create opportunities for parallelism. Our implementation of AMOs is cache-coherent and requires no changes to the processor core or DRAM chips. In this paper, we present the microarchitecture design of AMC, and the programming model of AMOs. We compare AMOs’ performance to that of several other memory architectures on a variety of scientific and commercial benchmarks. Through simulation, we show that AMOs offer dramatic performance improvements for an important set of data-intensive operations, e.g., up to 50× faster barriers, 12× faster spinlocks, 8.5×–15× faster stream/array operations, and 3× faster database queries. We also present an analytical model that can predict the performance benefits of using AMOs with decent accuracy. The silicon cost required to support AMOs is less than 1% of the die area of a typical high performance processor, based on a standard cell implementation.
    The Journal of Supercomputing 10/2012; 62(1). · 0.92 Impact Factor
  • Source
    03/2012;
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: For the first time, this paper systematically identifies three categories of throughput oriented workloads in data centers: services, data processing applications, and interactive real-time applications, whose targets are to increase the volume of throughput in terms of processed requests or data, or supported maximum number of simultaneous subscribers, respectively, and we coin a new term high volume computing (in short HVC) to describe those workloads and data center computer systems designed for them. We characterize and compare HVC with other computing paradigms, e.g., high throughput computing, warehouse-scale computing, and cloud computing, in terms of levels, workloads, metrics, coupling degree, data scales, and number of jobs or service instances. We also preliminarily report our ongoing work on the metrics and benchmarks for HVC systems, which is the foundation of designing innovative data center computer systems for HVC workloads.
    02/2012;
  • [Show abstract] [Hide abstract]
    ABSTRACT: Desktop cloud replaces traditional desktop computers with completely virtualized systems from the cloud. It is becoming one of the fastest growing segments in the cloud computing market. However, as far as we know, there is little work done to understand the behavior of desktop cloud. On one hand, desktop cloud workloads are different from conventional data center workloads in that they are rich with interactive operations. Desktop cloud workloads are different from traditional non-virtualized desktop workloads in that they have an extra layer of software stack - hypervisor. On the other hand, desktop cloud servers are mostly built with conventional commodity processors. While such processors are well optimized for traditional desktops and high performance computing workloads, their effectiveness for desktop cloud workloads remains to be studied. As an attempt to shed some lights on the effectiveness of conventional general-purpose processors on desktop cloud workloads, we have studied the behavior of desktop cloud workloads and compared it with that of SPEC CPU2006, TPC-C, PARSEC, and CloudSuite. We evaluate a Xen-based virtualization platform. The performance results reveal that desktop cloud workloads have significantly different characteristics with SPEC CPU2006, TPC-C and PARSEC, but they perform similarly with data center scale-out benchmarks from CloudSuite. In particular, desktop cloud workloads have high instruction cache miss rate (12.7% on average), high percentage of kernel instructions (23%, on average), and low IPC (0.36 on average). And they have much higher TLB miss rates and lower utilization of off-chip memory bandwidth than traditional benchmarks. Our experimental numbers indicate that the effectiveness of existing commodity processors is quite low for desktop cloud workloads. In this paper, we provide some preliminary discussions on some potential architectural and micro-architectural enhancements. We hope that the performance numbers presented - n this paper will give some insights to the designers of desktop cloud systems.
    Workload Characterization (IISWC), 2012 IEEE International Symposium on; 01/2012
  • [Show abstract] [Hide abstract]
    ABSTRACT: The High-Performance Computing ecosystem consists of a large variety of execution platforms that demonstrate a wide diversity in hardware characteristics such as CPU architecture, memory organization, interconnection network, accelerators, etc. This environment also presents a number of hard boundaries (walls) for applications which limit software development (parallel programming wall), performance (memory wall, communication wall) and viability (power wall). The only way to survive in such a demanding environment is by adaptation. In this paper we discuss how dynamic information collected during the execution of an application can be utilized to adapt the execution context and may lead to performance gains beyond those provided by static information and compile-time adaptation. We consider specialization based on dynamic information like user input, architectural characteristics such as the memory hierarchy organization, and the execution profile of the application as obtained from the execution platform's performance monitoring units. One of the challenges of future execution platforms is to allow the seamless integration of these various kinds of information with information obtained from static analysis (either during ahead-of-time or just-in-time) compilation. We extend the notion of information-driven adaptation and outline the architecture of an infrastructure designed to enable information flow and adaptation through-out the life-cycle of an application.
    06/2011;
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: The transistor density of microprocessors continues to increase as technology scales. Microprocessors designers have taken advantage of the increased transistors by integrating a significant number of cores onto a single die. However, a large number of cores are met with diminishing returns due to software and hardware scalability issues and hence designers have started integrating on-chip special-purpose logic units (i.e., accelerators) that were previously available as PCI-attached units. It is anticipated that more accelerators will be integrated on-chip due to the increasing abundance of transistors and the fact that not all logic can be powered at all times due to power budget limits. Thus, on-chip accelerator architectures deserve more attention from the research community. There is a wide spectrum of research opportunities for design and optimization of accelerators. This paper attempts to bring out some insights by studying the data access streams of on-chip accelerators that hopefully foster some future research in this area. Specifically, this paper uses a few simple case studies to show some of the common characteristics of the data streams introduced by on-chip accelerators, discusses challenges and opportunities in exploiting these characteristics to optimize the power and performance of accelerators, and then analyzes the effectiveness of some simple optimizing extensions proposed.
    17th International Conference on High-Performance Computer Architecture (HPCA-17 2011), February 12-16 2011, San Antonio, Texas, USA; 01/2011
  • [Show abstract] [Hide abstract]
    ABSTRACT: Search is the most heavily used web application in the world and is still growing at an extraordinary rate. Understanding the behaviors of web search engines, therefore, is becoming increasingly important to the design and deployment of data center systems hosting search engines. In this paper, we study three search query traces collected from real world web search engines in three different search service providers. The first part of our study is to uncover the patterns hidden in the query traces by analyzing the variations, frequencies, and locality of query requests. Our analysis reveals that, contradicted to some previous studies, real-world query traces do not follow well-defined probability models, such as Poisson distribution and log-normal distribution. The second part of our study is to deploy the real query traces and three synthetic traces generated using probability models proposed by other researchers on a Nutch based search engine. The measured performance data from the deployments further confirm that synthetic traces do not accurately reflect the real traces. We develop an evaluation tool that can collect performance metrics on-line with negligible overhead. The performance metrics include average response time, CPU utilization, Disk accesses, and cycles-per-instructions, etc. The third of our study is to compare the search engine with representative benchmarks, namely Gridmix, SPECweb2005, TPC-C, SPECCPU2006, and HPCC, with respect to basic architecture-level characteristics and performance metrics, such as instruction mix, processor pipeline stall breakdown, memory access latency, and disk accesses. The experimental results show that web search engines have a high percentage of load/store instructions, but have good cache/memory performance. We hope those results presented in this paper will enable system designers to gain insights on optimizing systems hosting search engines.
    01/2011;
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: This paper presents two complementary techniques to manage the power consumption of large-scale systems with a packet-switched interconnection network. First, we propose Thrifty Interconnection Network (TIN), where the network links are activated and de-activated dynamically with little or no overhead by using inherent system events to timely trigger link activation or de-activation. Second, we propose Network Power Shifting (NPS) that dynamically shifts the power budget between the compute nodes and their corresponding network components. TIN activates and trains the links in the interconnection network, just-in-time before the network communication is about to happen, and thriftily puts them into a low-power mode when communication is finished, hence reducing unnecessary network power consumption. Furthermore, the compute nodes can absorb the extra power budget shifted from its attached network components and increase their processor frequency for higher performance with NPS. Our simulation results on a set of real-world workload traces show that TIN can achieve on average 60% network power reduction, with the support of only one low-power mode. When NPS is enabled, the two together can achieve 12% application performance improvement and 13% overall system energy reduction. Further performance improvement is possible if the compute nodes can speed up more and fully utilize the extra power budget reinvested from the thrifty network with more aggressive cooling support.
    17th International Conference on High-Performance Computer Architecture (HPCA-17 2011), February 12-16 2011, San Antonio, Texas, USA; 01/2011
  • [Show abstract] [Hide abstract]
    ABSTRACT: Most modern microprocessors provide hardware support for rapidly translating a program logical address to a system physical address (PA). Translation typically sits on the critical path of every memory access, since an access cannot usually be performed until after it has been translated. Enigma is a novel approach to address translation that defers the bulk of the work associated with address translation until data must be retrieved from physical memory. Enigma replaces the address translation unit that exists in each conventional core with a simpler unit to translate from the logical address space to a new intermediate address (IA) space. Intermediate addresses are unique across the entire system except where sharing is required or desired, and their use sidesteps the "synonym" problem present in logically tagged caches. All cache addressing, as well as I/O and coherence traffic, is carried out using IA. Enigma translates an IA to a PA only when no cache in the entire CMP can satisfy the request and memory or I/O must be accessed. A central translation unit attached to the system bus performs translations on IA that must be resolved to a PA. Deferring the bulk of address translation work and removing it from each individual processor core in this manner affords many benefits.
    Proceedings of the 24th International Conference on Supercomputing, 2010, Tsukuba, Ibaraki, Japan, June 2-4, 2010; 01/2010
  • [Show abstract] [Hide abstract]
    ABSTRACT: Traditional multilevel SRAM-based cache hierarchies, especially in the context of chip multiprocessors (CMPs), present many challenges in area requirements, core--to--cache balance, power consumption, and design complexity. New advancements in technology enable caches to be built from other technologies, such as Embedded DRAM (EDRAM), Magnetic RAM (MRAM), and Phase-change RAM (PRAM), in both 2D chips or 3D stacked chips. Caches fabricated in these technologies offer dramatically different power-performance characteristics when compared with SRAM-based caches, particularly in the areas of access latency, cell density, and overall power consumption. In this article, we propose to take advantage of the best characteristics that each technology has to offer through the use of Hybrid Cache Architecture (HCA) designs. We discuss and evaluate two types of hybrid cache architectures: intercache-Level HCA (LHCA), in which the levels in a cache hierarchy can be made of disparate memory technologies; and intracache-level or cache-Region-based HCA (RHCA), where a single level of cache can be partitioned into multiple regions, each of a different memory technology. We have studied a number of different HCA architectures and explored the potential of hardware support for intracache data movement and power consumption management within HCA caches. Utilizing a full-system simulator that has been validated against real hardware, we demonstrate that an LHCA design can provide a geometric mean 6% IPC improvement over a baseline 3-level SRAM cache design under the same area constraint across a collection of 30 workloads. A more aggressive RHCA-based design provides 10% IPC improvement over the baseline. A 2-layer 3D cache stack (3DHCA) of high density memory technology within the same chip footprint gives 16% IPC improvement over the baseline. We also achieve up to a 72% reduction in power consumption over a baseline SRAM-only design. Energy-delay and thermal evaluation for 3DHCA are also presented. In addition to the fast-slow region based RHCA, we further evaluate read-write region based RHCA designs.
    TACO. 01/2010; 7:15.
  • Source
    M. Stephenson, Lixin Zhang, R. Rangan
    [Show abstract] [Hide abstract]
    ABSTRACT: The benefits of Out of Order (OOO) processing are well known, as is the effectiveness of predicated execution for unpredictable control flow. However, as previous research has demonstrated, these techniques are at odds with one another. One common approach to reconciling their differences is to simplify the form of predication supported by the architecture. For instance, the only form of predication supported by modern OOO processors is a simple conditional move. We argue that it is the simplicity of conditional move that has allowed its widespread adoption, but we also show that this simplicity compromises its effectiveness as a compilation target. In this paper, we introduce a generalized form of hammock predication - called predicated mutually exclusive groups - that requires few modifications to an existing processor pipeline, yet presents the compiler with abundant predication opportunities. In comparison to non-predicated code running on an aggressively clocked baseline system, our technique achieves an 8% speedup averaged across three important benchmark suites.
    High Performance Computer Architecture, 2009. HPCA 2009. IEEE 15th International Symposium on; 03/2009
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: 3 wde@zurich, ibm.com ABSTRACT We propose Thrifty Interconnection Network (TIN), where the network links are activated and de-activated dynamically to save power with little or no overhead by using inherent system events to overlap the link activation or de-activation time. Our simulation results on a set of real world HPC workload traces show on average 35% network power reduction.
    Proceedings of the 23rd international conference on Supercomputing, 2009, Yorktown Heights, NY, USA, June 8-12, 2009; 01/2009
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Caching techniques have been an efficient mechanism for mitigating the effects of the processor-memory speed gap. Traditional multi-level SRAM-based cache hierarchies, especially in the context of chip multiprocessors (CMPs), present many challenges in area requirements, core-to-cache balance, power consumption, and design complexity. New advancements in technology enable caches to be built from other technologies, such as Embedded DRAM (EDRAM), Magnetic RAM (MRAM), and Phase-change RAM (PRAM), in both 2D chips or 3D stacked chips. Caches fabricated in these technologies offer dramatically different power and performance characteristics when compared with SRAM-based caches, particularly in the areas of access latency, cell density, and overall power consumption. In this paper, we propose to take advantage of the best characteristics that each technology offers, through the use of Hybrid Cache Architecture (HCA) designs. We discuss and evaluate two types of hybrid cache architectures: inter cache Level HCA (LHCA), in which the levels in a cache hierarchy can be made of disparate memory technologies; and intra cache level or cache Region based HCA (RHCA), where a single level of cache can be partitioned into multiple regions, each of a different memory technology. We have studied a number of different HCA architectures and explored the potential of hardware support for intra-cache data movement and power consumption management within HCA caches. Utilizing a full-system simulator that has been validated against real hardware, we demonstrate that an LHCA design can provide a geometric mean 7% IPC improvement over a baseline 3-level SRAM cache design under the same area constraint across a collection of 25 workloads. A more aggressive RHCA-based design provides 12% IPC improvement over the baseline. Finally, a 2-layer 3D cache stack (3DHCA) of high density memory technology within the same chip footprint gives 18% IPC improvement over the baseline. Furthermore, up to 70% reduction in power consumption over a baseline SRAM-only design is achieved.
    36th International Symposium on Computer Architecture (ISCA 2009), June 20-24, 2009, Austin, TX, USA; 01/2009
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Abstract—Caches made of non-volatile memory technologies, such,as Magnetic,RAM (MRAM) and,Phase-change,RAM (PRAM), offer dramatically different power-performance charac- teristics when compared with SRAM-based caches, particularly in the areas of static/dynamic power consumption, read and write access latency and cell density. In this paper, we propose to take advantage,of the best characteristics that each,technology has to offer through,the use of read-write aware,Hybrid Cache Architecture (RWHCA) designs, where a single level of cache can be partitioned into read and write regions, each of a different memory,technology,with disparate read and write characteristics. We explore the potential of hardware,support,for intra-cache data movement,within RWHCA caches. Utilizing a full-system simulator that has been validated against real hardware, we demonstrate,that a RWHCA design,with a conservative,setup can provide a geometric,mean,55% power,reduction and yet 5% IPC improvement,over,a baseline SRAM cache,design,across a collection of 30 workloads. Furthermore, a 2-layer 3D cache stack (3DRWHCA) of high density memory,technology,with the same,chip footprint still gives 10% power,reduction,and,boost performance,by 16% IPC improvement,over the baseline.
    Design, Automation and Test in Europe, DATE 2009, Nice, France, April 20-24, 2009; 01/2009