Lixin Zhang

Technical Institute of Physics and Chemistry, Peping, Beijing, China

Are you Lixin Zhang?

Claim your profile

Publications (57)4.92 Total impact

  • 08/2014;
  • [Show abstract] [Hide abstract]
    ABSTRACT: Phase change memory (PCM) is promising to become an alternative main memory thanks to its better scalability and lower leakage than DRAM. However, the long write latency of PCM puts it at a severe disadvantage against DRAM. In this paper, we propose a Dynamic Write Consolidation (DWC) scheme to improve PCM memory system performance while reducing energy consumption. This paper is motivated by the observation that a large fraction of a cache line being written back to memory is not actually modified. DWC exploits the unnecessary burst writes of unmodified data to consolidate multiple writes targeting the same row into one write. By doing so, DWC enables multiple writes to be send within one. DWC incurs low implementation overhead and shows significant efficiency. The evaluation results show that DWC achieves up to 35.7% performance improvement, and 17.9% on average. The effective write latency are reduced by up to 27.7%, and 16.0% on average. Moreover, DWC reduces the energy consumption by up to 35.3%, and 13.9% on average.
    ACM 28th International Conference on Supercomputing (ICS 2014), Munich, Germany; 06/2014
  • [Show abstract] [Hide abstract]
    ABSTRACT: In 2005, as chip multiprocessors started to appear widely, it became possible for the on-chip cores to share the last-level cache. At the time, architects either considered the last-level cache to be divided into per-core private segments, or wholly shared. The shared cache utilized the capacity more efficiency but suffered from high, uniform latencies. This paper proposed a new direction: allowing the caches to be non-uniform, with a varying number of processors sharing each section of the cache. Sharing degree, the number of cores sharing a last-level cache, determines the level of replication in on-chip caches and also affects the capacity and latency for each shared cache. Building on our previous work that introduced non-uniform cache architectures (NUCA), this study explored the design space for shared multi-core caches, focusing on the effect of sharing degree. Our observation of a per-application optimal sharing degree led to a static NUCA design with a reconfigurable sharing degree. This work in multicore NUCA cache architectures has been influential in contemporary systems, including the level-3 cache in the IBM Power 7 and Power 8 processors.
    06/2014;
  • [Show abstract] [Hide abstract]
    ABSTRACT: Write-optimized data structures like Log-Structured Merge-tree (LSM-tree) and its variants are widely used in key-value storage systems like Big Table and Cassandra. Due to deferral and batching, the LSM-tree based storage systems need background compactions to merge key-value entries and keep them sorted for future queries and scans. Background compactions play a key role on the performance of the LSM-tree based storage systems. Existing studies about the background compaction focus on decreasing the compaction frequency, reducing I/Os or confining compactions on hot data key-ranges. They do not pay much attention to the computation time in background compactions. However, the computation time is no longer negligible, and even the computation takes more than 60% of the total compaction time in storage systems using flash based SSDs. Therefore, an alternative method to speedup the compaction is to make good use of the parallelism of underlying hardware including CPUs and I/O devices. In this paper, we analyze the compaction procedure, recognize the performance bottleneck, and propose the Pipelined Compaction Procedure (PCP) to better utilize the parallelism of CPUs and I/O devices. Theoretical analysis proves that PCP can improve the compaction bandwidth. Furthermore, we implement PCP in real system and conduct extensive experiments. The experimental results show that the pipelined compaction procedure can increase the compaction bandwidth and storage system throughput by 77% and 62% respectively.
    2014 IEEE International Parallel & Distributed Processing Symposium (IPDPS); 05/2014
  • [Show abstract] [Hide abstract]
    ABSTRACT: Mobile devices such as smartphones and tablets have become the primary consumer computing devices, and their rate of adoption continues to grow. The applications that run on these mobile platforms vary in how they use hardware resources, and their diversity is increasing. Performance and power limitations also vary widely across mobile platforms. Thus there is a growing need for tools to help computer architects design systems to meet the needs of mobile workloads. Full-system simulators are invaluable tools for designing new architectures, but we still need appropriate benchmark suites that capture the behaviors of emerging mobile applications. Current benchmark suites cover only a small range of mobile applications, and many cannot run directly in simulators due to their user interaction requirements. In this paper, we introduce and characterize Moby, a benchmark suite designed to make it easier to use full-system architectural simulators to evaluate microarchitectures for mobile processors. Moby contains popular Android applications, including a web browser, a social networking application, an email client, a music player, a video player, a document processing application, and a map program. To facilitate microarchitectural exploration, we port the Moby benchmark suite to the popular gem5 simulator. We characterize the architecture-independent features of Moby applications on the simulator and analyze the architecture-dependent features on a current-generation mobile platform. Our results show that mobile applications exhibit complex instruction execution behaviors and poor code locality, but current mobile platforms especially instruction-related components cannot meet their requirements.
    2014 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS); 03/2014
  • Source
  • Source
  • [Show abstract] [Hide abstract]
    ABSTRACT: Data centers are increasingly employing virtualization as a means to ensure the performance isolation for latency-sensitive applications while allowing co-locations of multiple applications. Previous research has shown that virtualization could offer excellent resource isolation. However, whether virtualization can mitigate the interference among micro-architectural resources has not been well studied. This paper presents an in-depth analysis of the performance isolation effect of virtualization technology on various micro-architectural resources (i.e., L1 D-Cache, L2 Cache, last level cache (LLC), hardware prefetchers and Non-Uniform Memory Access) by mapping the CloudSuite benchmarks to different sockets, different cores of one chip, and different threads of one core. For each resource, we investigate the correlation between performance variations and contention by changing VM mapping policies according to different application characteristics. Our experiments show that virtualization has rather limited micro-architectural isolation effects. Specifically, LLC interference can degrade applications performance by as much as 28%. When it comes to intra-core resources, the applications performance degradation can be as much as 27%. Additionally, we outline several opportunities to improve performance by reducing misbehavior VM interference.
    2013 IEEE International Conference on High Performance Computing and Communications (HPCC) & 2013 IEEE International Conference on Embedded and Ubiquitous Computing (EUC); 11/2013
  • [Show abstract] [Hide abstract]
    ABSTRACT: The flourishing large-scale and high-throughput web applications have emphasized the importance of high-density servers for their distinct advantages, such as high computing density, low power and low space requirements. To achieve above advantages, an efficient intra-server interconnection network is necessary. Most state-of-the-art high-density servers adopt the fully-connected intra-server network to achieve high network performance. Unfortunately, this solution is very expensive due to the high degree of nodes. To address this problem, we exploit the theory optimized moore graph to interconnect the chips within a server. Considering the size of applications, the 50-size moore graph, namely the Hoffman-Singleton graph, is extensively discussed in this paper. The simulation results show that it could attain comparative performance as the fully-connected network with much lower cost. In practice, however, chips could be integrated onto multiple boards. Thus, the graph should be divided into self-connected sub-graphs with the same size. Unfortunately, state-of-the-art solutions do not consider the production problem and generate heterogeneous sub graphs. To address this problem, we propose two equivalent-partition solutions for Hoffman-Singleton graph depending on the density of boards. Finally, we propose and evaluate a deadlock-free routing algorithm for each partition scheme.
    2013 IEEE International Conference on High Performance Computing and Communications (HPCC) & 2013 IEEE International Conference on Embedded and Ubiquitous Computing (EUC); 11/2013
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: As the amount of data explodes rapidly, more and more corporations are using data centers to make effective decisions and gain a competitive edge. Data analysis applications play a significant role in data centers, and hence it has became increasingly important to understand their behaviors in order to further improve the performance of data center computer systems. In this paper, after investigating three most important application domains in terms of page views and daily visitors, we choose eleven representative data analysis workloads and characterize their micro-architectural characteristics by using hardware performance counters, in order to understand the impacts and implications of data analysis workloads on the systems equipped with modern superscalar out-of-order processors. Our study on the workloads reveals that data analysis applications share many inherent characteristics, which place them in a different class from desktop (SPEC CPU2006), HPC (HPCC), and service workloads, including traditional server workloads (SPECweb2005) and scale-out service workloads (four among six benchmarks in CloudSuite), and accordingly we give several recommendations for architecture and system optimizations. On the basis of our workload characterization work, we released a benchmark suite named DCBench for typical datacenter workloads, including data analysis and service workloads, with an open-source license on our project home page on http://prof.ict.ac.cn/DCBench. We hope that DCBench is helpful for performing architecture and small-to-medium scale system researches for datacenter computing.
    07/2013;
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Now we live in an era of big data, and big data applications are becoming more and more pervasive. How to benchmark data center computer systems running big data applications (in short big data systems) is a hot topic. In this paper, we focus on measuring the performance impacts of diverse applications and scalable volumes of data sets on big data systems. For four typical data analysis applications---an important class of big data applications, we find two major results through experiments: first, the data scale has a significant impact on the performance of big data systems, so we must provide scalable volumes of data sets in big data benchmarks. Second, for the four applications, even all of them use the simple algorithms, the performance trends are different with increasing data scales, and hence we must consider not only variety of data sets but also variety of applications in benchmarking big data systems.
    07/2013;
  • Conference Paper: The ARMv8 simulator
    [Show abstract] [Hide abstract]
    ABSTRACT: In this work, we implement an ARMv8 function and performance simulator based on gem5 infrastructure, which is the first open source ARMv8 simulator. All the ARMv8 A64 instructions other than SIMD are implemented using gem5 ISA description language. The ARMv8 simulator supports multiple CPU models, multiple memory systems, and McPAT power model.
    Proceedings of the 27th international ACM conference on International conference on supercomputing; 06/2013
  • [Show abstract] [Hide abstract]
    ABSTRACT: Within today's large-scale data centers, the inter-node communication is often the major bottleneck. This fact recently blooms the data center network (DCN) research. Since building a real data center is cost prohibitive, most of DCN studies rely on simulations. Unfortunately, state-of-the-art network simulators have limited support for real world applications, which prevents researchers from first-hand investigation. To address this issue, we developed a unified and cross-layer simulation framework, namely the DCNSim. By leveraging the two widely deployed simulators, DCNSim introduces computer architecture solutions into DCN research. With DCNSim, one could run packet-level network simulation driven by commercial applications while varying computer and network parameters, such as CPU frequency, memory access latency, network topology and protocols. With extensive validations, we show that DCNSim could accurately capture performance trends caused by changing computer and network parameters. Finally, we argue that future DCN researches should consider computer architecture factors via several case studies.
    Proceedings of the ACM International Conference on Computing Frontiers; 05/2013
  • Xiufeng Sui, Tao Sun, Tao Li, Lixin Zhang
    [Show abstract] [Hide abstract]
    ABSTRACT: Cloud computing has demonstrated tremendous capability in a wide spectrum of online services. Virtualization provides an efficient solution to the utilization of modern multicore processor systems while affording significant flexibility. The growing popularity of virtualized datacenters motivates deeper understanding of the interactions between virtual machine management and the micro-architecture behaviors of the privileged domain. We argue that these behaviors must be factored into the design of processor microarchitecture in virtualized datacenters. In this work, we use performance counters on modern servers to study the micro-architectural execution characteristics of the privileged domain while performing various VM management operations. Our study shows that today's state-of-the-art processor still has room for further optimizations when executing virtualized cloud workloads, particularly in the organization of last level caches and on-chip cache coherence protocol. Specifically, our analysis shows that: shared caches could be partitioned to eliminate interference between the privileged domain and guest domains; the cache coherence protocol could support a high degree of data sharing of the privileged domain; and cache capacity or CPU utilization occupied by the privileged domain could be effectively managed when performing management workflows to achieve high system throughput.
    Performance Analysis of Systems and Software (ISPASS), 2013 IEEE International Symposium on; 01/2013
  • [Show abstract] [Hide abstract]
    ABSTRACT: Inability to hide main memory latency has been increasingly limiting the performance of modern processors. The problem is worse in large-scale shared memory systems, where remote memory latencies are hundreds, and soon thousands, of processor cycles. To mitigate this problem, we propose an intelligent memory and cache coherence controller (AMC) that can execute Active Memory Operations (AMOs). AMOs are select operations sent to and executed on the home memory controller of data. AMOs can eliminate a significant number of coherence messages, minimize intranode and internode memory traffic, and create opportunities for parallelism. Our implementation of AMOs is cache-coherent and requires no changes to the processor core or DRAM chips. In this paper, we present the microarchitecture design of AMC, and the programming model of AMOs. We compare AMOs’ performance to that of several other memory architectures on a variety of scientific and commercial benchmarks. Through simulation, we show that AMOs offer dramatic performance improvements for an important set of data-intensive operations, e.g., up to 50× faster barriers, 12× faster spinlocks, 8.5×–15× faster stream/array operations, and 3× faster database queries. We also present an analytical model that can predict the performance benefits of using AMOs with decent accuracy. The silicon cost required to support AMOs is less than 1% of the die area of a typical high performance processor, based on a standard cell implementation.
    The Journal of Supercomputing 10/2012; 62(1). · 0.84 Impact Factor
  • Source
    03/2012;
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: For the first time, this paper systematically identifies three categories of throughput oriented workloads in data centers: services, data processing applications, and interactive real-time applications, whose targets are to increase the volume of throughput in terms of processed requests or data, or supported maximum number of simultaneous subscribers, respectively, and we coin a new term high volume computing (in short HVC) to describe those workloads and data center computer systems designed for them. We characterize and compare HVC with other computing paradigms, e.g., high throughput computing, warehouse-scale computing, and cloud computing, in terms of levels, workloads, metrics, coupling degree, data scales, and number of jobs or service instances. We also preliminarily report our ongoing work on the metrics and benchmarks for HVC systems, which is the foundation of designing innovative data center computer systems for HVC workloads.
    02/2012;
  • [Show abstract] [Hide abstract]
    ABSTRACT: Desktop cloud replaces traditional desktop computers with completely virtualized systems from the cloud. It is becoming one of the fastest growing segments in the cloud computing market. However, as far as we know, there is little work done to understand the behavior of desktop cloud. On one hand, desktop cloud workloads are different from conventional data center workloads in that they are rich with interactive operations. Desktop cloud workloads are different from traditional non-virtualized desktop workloads in that they have an extra layer of software stack - hypervisor. On the other hand, desktop cloud servers are mostly built with conventional commodity processors. While such processors are well optimized for traditional desktops and high performance computing workloads, their effectiveness for desktop cloud workloads remains to be studied. As an attempt to shed some lights on the effectiveness of conventional general-purpose processors on desktop cloud workloads, we have studied the behavior of desktop cloud workloads and compared it with that of SPEC CPU2006, TPC-C, PARSEC, and CloudSuite. We evaluate a Xen-based virtualization platform. The performance results reveal that desktop cloud workloads have significantly different characteristics with SPEC CPU2006, TPC-C and PARSEC, but they perform similarly with data center scale-out benchmarks from CloudSuite. In particular, desktop cloud workloads have high instruction cache miss rate (12.7% on average), high percentage of kernel instructions (23%, on average), and low IPC (0.36 on average). And they have much higher TLB miss rates and lower utilization of off-chip memory bandwidth than traditional benchmarks. Our experimental numbers indicate that the effectiveness of existing commodity processors is quite low for desktop cloud workloads. In this paper, we provide some preliminary discussions on some potential architectural and micro-architectural enhancements. We hope that the performance numbers presented - n this paper will give some insights to the designers of desktop cloud systems.
    Workload Characterization (IISWC), 2012 IEEE International Symposium on; 01/2012
  • [Show abstract] [Hide abstract]
    ABSTRACT: Building energy-efficient systems is critical for big data applications. This paper investigates and compares the energy consumption and the execution time of a typical Hadoop-based big data application running on a traditional Xeon-based cluster and an Atom-based (Micro-server) cluster. Our experimental results show that the micro-server platform is more energy-efficient than the Xeon-based platform. Our experimental results also reveal that data compression and decompression accounts for a considerable percentage of the total execution time. More precisely, data compression/decompression occupies 7-11% of the execution time of the map tasks and 37.9-41.2% of the execution time of the reduce tasks. Based on our findings, we demonstrate the necessity of using a heterogeneous architecture for energy-efficient big data processing. The desired architecture takes the advantages of both micro-server processors and hardware compression/decompression accelerators. In addition, we propose a mechanism that enables the accelerators to perform more efficient data compression/decompression.
    10/2011;
  • [Show abstract] [Hide abstract]
    ABSTRACT: The High-Performance Computing ecosystem consists of a large variety of execution platforms that demonstrate a wide diversity in hardware characteristics such as CPU architecture, memory organization, interconnection network, accelerators, etc. This environment also presents a number of hard boundaries (walls) for applications which limit software development (parallel programming wall), performance (memory wall, communication wall) and viability (power wall). The only way to survive in such a demanding environment is by adaptation. In this paper we discuss how dynamic information collected during the execution of an application can be utilized to adapt the execution context and may lead to performance gains beyond those provided by static information and compile-time adaptation. We consider specialization based on dynamic information like user input, architectural characteristics such as the memory hierarchy organization, and the execution profile of the application as obtained from the execution platform's performance monitoring units. One of the challenges of future execution platforms is to allow the seamless integration of these various kinds of information with information obtained from static analysis (either during ahead-of-time or just-in-time) compilation. We extend the notion of information-driven adaptation and outline the architecture of an infrastructure designed to enable information flow and adaptation through-out the life-cycle of an application.
    06/2011;