[Show abstract][Hide abstract] ABSTRACT: Enabling multiple paths in datacenter networks is a common practice to improve the performance and robustness. Multi-path TCP (MPTCP) explores multiple paths by splitting a single flow into multiple subflows. The number of the subflows in MPTCP is determined before a connection is established, and it usually remains unchanged during the lifetime of that connection. While MPTCP improves both bandwidth efficiency and network reliability, more subflows incur additional overhead, especially for small (so-called mice) subflows. Additionally, it is difficult to choose the appropriate number of the subflows for each TCP connection to achieve good performance without incurring significant overhead. To address this problem, we propose an adaptive multi-path transmission control protocol , namely the AMTCP, which dynamically adjusts the number of the subflows according to application workloads. Specifically, AMTCP divides the time into small intervals and measures the throughput of each subflow over the latest interval, then adjusts the number of the subflows dynamically with the goal of reducing resource and scheduling overheads for mice flows and achieving a higher throughput for elephant flows. Our evaluations show that AMTCP increases the throughput by over 30% compared to conventional TCP. Meanwhile, AMTCP decreases the average number of the subflows by more than 37.5% while achieving a similar throughput compared to MPTCP.
[Show abstract][Hide abstract] ABSTRACT: This paper presents PARD, a programmable architecture for resourcing-on-demand that provides a new programming interface to convey an application's high-level information like quality-of-service requirements to the hardware. PARD enables new functionalities like fully hardware-supported virtualization and differentiated services in computers. PARD is inspired by the observation that a computer is inherently a network in which hardware components communicate via packets (e.g., over the NoC or PCIe). We apply principles of software-defined networking to this intra-computer network and address three major challenges. First, to deal with the semantic gap between high-level applications and underlying hardware packets, PARD attaches a high-level semantic tag (e.g., a virtual machine or thread ID) to each memory-access, I/O, or interrupt packet. Second, to make hardware components more manageable, PARD implements programmable control planes that can be integrated into various shared resources (e.g., cache, DRAM, and I/O devices) and can differentially process packets according to tag-based rules. Third, to facilitate programming, PARD abstracts all control planes as a device file tree to provide a uniform programming interface via which users create and apply tag-based rules. Full-system simulation results show that by co-locating latency-critical memcached applications with other workloads PARD can improve a four-core computer's CPU utilization by up to a factor of four without significantly increasing tail latency. FPGA emulation based on a preliminary RTL implementation demonstrates that the cache control plane introduces no extra latency and that the memory control plane can reduce queueing delay for high-priority memory-access requests by up to a factor of 5.6.
No preview · Article · May 2015 · ACM SIGPLAN Notices
[Show abstract][Hide abstract] ABSTRACT: Big data analytics applications play a significant role in data centers, and
hence it has become increasingly important to understand their behaviors in
order to further improve the performance of data center computer systems, in
which characterizing representative workloads is a key practical problem. In
this paper, after investigating three most impor- tant application domains in
terms of page views and daily visitors, we chose 11 repre- sentative data
analytics workloads and characterized their micro-architectural behaviors by
using hardware performance counters, so as to understand the impacts and
implications of data analytics workloads on the systems equipped with modern
superscalar out-of-order processors. Our study reveals that big data analytics
applications themselves share many inherent characteristics, which place them
in a different class from traditional workloads and scale-out services. To
further understand the characteristics of big data analytics work- loads we
performed a correlation analysis of CPI (cycles per instruction) with other
micro- architecture level characteristics and an investigation of the big data
software stack impacts on application behaviors. Our correlation analysis
showed that even though big data ana- lytics workloads own notable pipeline
front end stalls, the main factors affecting the CPI performance are long
latency data accesses rather than the front end stalls. Our software stack
investigation found that the typical big data software stack significantly
contributes to the front end stalls and incurs bigger working set. Finally we
gave several recommen- dations for architects, programmers and big data system
designers with the knowledge acquired from this paper.
[Show abstract][Hide abstract] ABSTRACT: The increasing customer demand is driving modern data centers to embrace the freely-expandable network architecture. Unfortunately, state-of-the-art freely-expandable networks suffer from either the large granularity of expansion or the prohibitive implementation cost. Furthermore, a recent research showed that data center traffic tends to be highly clustered. Based on above observations, this paper proposes a freely-expandable network architecture, namely the dandelion. Dandelion is a two-level hierarchical network, where the first level aims at 'high performance' and the second level aims at 'high scalability'. The resulting network has two distinct advantages. First, it could arbitrarily expand with a reasonable granularity. Second, the router architecture is efficient as well as highly scalable since 1) the routing table is significantly compressed and 2) a fixed number of virtual channels per physical channel are required regardless of the network size. Finally, the traffic characteristics of four typical cloud applications are analyzed, and the generated traffic patterns are used to evaluate the proposed network architecture. Simulation results prove that the dandelion is a promising network architecture for future data centers.
[Show abstract][Hide abstract] ABSTRACT: Phase change memory (PCM) is promising to become an alternative main memory thanks to its better scalability and lower leakage than DRAM. However, the long write latency of PCM puts it at a severe disadvantage against DRAM. In this paper, we propose a Dynamic Write Consolidation (DWC) scheme to improve PCM memory system performance while reducing energy consumption. This paper is motivated by the observation that a large fraction of a cache line being written back to memory is not actually modified. DWC exploits the unnecessary burst writes of unmodified data to consolidate multiple writes targeting the same row into one write. By doing so, DWC enables multiple writes to be send within one. DWC incurs low implementation overhead and shows significant efficiency. The evaluation results show that DWC achieves up to 35.7% performance improvement, and 17.9% on average. The effective write latency are reduced by up to 27.7%, and 16.0% on average. Moreover, DWC reduces the energy consumption by up to 35.3%, and 13.9% on average.
[Show abstract][Hide abstract] ABSTRACT: In 2005, as chip multiprocessors started to appear widely, it became possible for the on-chip cores to share the last-level cache. At the time, architects either considered the last-level cache to be divided into per-core private segments, or wholly shared. The shared cache utilized the capacity more efficiency but suffered from high, uniform latencies. This paper proposed a new direction: allowing the caches to be non-uniform, with a varying number of processors sharing each section of the cache. Sharing degree, the number of cores sharing a last-level cache, determines the level of replication in on-chip caches and also affects the capacity and latency for each shared cache. Building on our previous work that introduced non-uniform cache architectures (NUCA), this study explored the design space for shared multi-core caches, focusing on the effect of sharing degree. Our observation of a per-application optimal sharing degree led to a static NUCA design with a reconfigurable sharing degree. This work in multicore NUCA cache architectures has been influential in contemporary systems, including the level-3 cache in the IBM Power 7 and Power 8 processors.
[Show abstract][Hide abstract] ABSTRACT: Write-optimized data structures like Log-Structured Merge-tree (LSM-tree) and its variants are widely used in key-value storage systems like Big Table and Cassandra. Due to deferral and batching, the LSM-tree based storage systems need background compactions to merge key-value entries and keep them sorted for future queries and scans. Background compactions play a key role on the performance of the LSM-tree based storage systems. Existing studies about the background compaction focus on decreasing the compaction frequency, reducing I/Os or confining compactions on hot data key-ranges. They do not pay much attention to the computation time in background compactions. However, the computation time is no longer negligible, and even the computation takes more than 60% of the total compaction time in storage systems using flash based SSDs. Therefore, an alternative method to speedup the compaction is to make good use of the parallelism of underlying hardware including CPUs and I/O devices. In this paper, we analyze the compaction procedure, recognize the performance bottleneck, and propose the Pipelined Compaction Procedure (PCP) to better utilize the parallelism of CPUs and I/O devices. Theoretical analysis proves that PCP can improve the compaction bandwidth. Furthermore, we implement PCP in real system and conduct extensive experiments. The experimental results show that the pipelined compaction procedure can increase the compaction bandwidth and storage system throughput by 77% and 62% respectively.
[Show abstract][Hide abstract] ABSTRACT: Mobile devices such as smartphones and tablets have become the primary consumer computing devices, and their rate of adoption continues to grow. The applications that run on these mobile platforms vary in how they use hardware resources, and their diversity is increasing. Performance and power limitations also vary widely across mobile platforms. Thus there is a growing need for tools to help computer architects design systems to meet the needs of mobile workloads. Full-system simulators are invaluable tools for designing new architectures, but we still need appropriate benchmark suites that capture the behaviors of emerging mobile applications. Current benchmark suites cover only a small range of mobile applications, and many cannot run directly in simulators due to their user interaction requirements. In this paper, we introduce and characterize Moby, a benchmark suite designed to make it easier to use full-system architectural simulators to evaluate microarchitectures for mobile processors. Moby contains popular Android applications, including a web browser, a social networking application, an email client, a music player, a video player, a document processing application, and a map program. To facilitate microarchitectural exploration, we port the Moby benchmark suite to the popular gem5 simulator. We characterize the architecture-independent features of Moby applications on the simulator and analyze the architecture-dependent features on a current-generation mobile platform. Our results show that mobile applications exhibit complex instruction execution behaviors and poor code locality, but current mobile platforms especially instruction-related components cannot meet their requirements.
[Show abstract][Hide abstract] ABSTRACT: The flourishing large-scale and high-throughput web applications have emphasized the importance of high-density servers for their distinct advantages, such as high computing density, low power and low space requirements. To achieve above advantages, an efficient intra-server interconnection network is necessary. Most state-of-the-art high-density servers adopt the fully-connected intra-server network to achieve high network performance. Unfortunately, this solution is very expensive due to the high degree of nodes. To address this problem, we exploit the theory optimized moore graph to interconnect the chips within a server. Considering the size of applications, the 50-size moore graph, namely the Hoffman-Singleton graph, is extensively discussed in this paper. The simulation results show that it could attain comparative performance as the fully-connected network with much lower cost. In practice, however, chips could be integrated onto multiple boards. Thus, the graph should be divided into self-connected sub-graphs with the same size. Unfortunately, state-of-the-art solutions do not consider the production problem and generate heterogeneous sub graphs. To address this problem, we propose two equivalent-partition solutions for Hoffman-Singleton graph depending on the density of boards. Finally, we propose and evaluate a deadlock-free routing algorithm for each partition scheme.
[Show abstract][Hide abstract] ABSTRACT: Data centers are increasingly employing virtualization as a means to ensure the performance isolation for latency-sensitive applications while allowing co-locations of multiple applications. Previous research has shown that virtualization could offer excellent resource isolation. However, whether virtualization can mitigate the interference among micro-architectural resources has not been well studied. This paper presents an in-depth analysis of the performance isolation effect of virtualization technology on various micro-architectural resources (i.e., L1 D-Cache, L2 Cache, last level cache (LLC), hardware prefetchers and Non-Uniform Memory Access) by mapping the CloudSuite benchmarks to different sockets, different cores of one chip, and different threads of one core. For each resource, we investigate the correlation between performance variations and contention by changing VM mapping policies according to different application characteristics. Our experiments show that virtualization has rather limited micro-architectural isolation effects. Specifically, LLC interference can degrade applications performance by as much as 28%. When it comes to intra-core resources, the applications performance degradation can be as much as 27%. Additionally, we outline several opportunities to improve performance by reducing misbehavior VM interference.
[Show abstract][Hide abstract] ABSTRACT: As the amount of data explodes rapidly, more and more corporations are using
data centers to make effective decisions and gain a competitive edge. Data
analysis applications play a significant role in data centers, and hence it has
became increasingly important to understand their behaviors in order to further
improve the performance of data center computer systems. In this paper, after
investigating three most important application domains in terms of page views
and daily visitors, we choose eleven representative data analysis workloads and
characterize their micro-architectural characteristics by using hardware
performance counters, in order to understand the impacts and implications of
data analysis workloads on the systems equipped with modern superscalar
out-of-order processors. Our study on the workloads reveals that data analysis
applications share many inherent characteristics, which place them in a
different class from desktop (SPEC CPU2006), HPC (HPCC), and service workloads,
including traditional server workloads (SPECweb2005) and scale-out service
workloads (four among six benchmarks in CloudSuite), and accordingly we give
several recommendations for architecture and system optimizations. On the basis
of our workload characterization work, we released a benchmark suite named
DCBench for typical datacenter workloads, including data analysis and service
workloads, with an open-source license on our project home page on
http://prof.ict.ac.cn/DCBench. We hope that DCBench is helpful for performing
architecture and small-to-medium scale system researches for datacenter
[Show abstract][Hide abstract] ABSTRACT: Now we live in an era of big data, and big data applications are becoming
more and more pervasive. How to benchmark data center computer systems running
big data applications (in short big data systems) is a hot topic. In this
paper, we focus on measuring the performance impacts of diverse applications
and scalable volumes of data sets on big data systems. For four typical data
analysis applications---an important class of big data applications, we find
two major results through experiments: first, the data scale has a significant
impact on the performance of big data systems, so we must provide scalable
volumes of data sets in big data benchmarks. Second, for the four applications,
even all of them use the simple algorithms, the performance trends are
different with increasing data scales, and hence we must consider not only
variety of data sets but also variety of applications in benchmarking big data
[Show abstract][Hide abstract] ABSTRACT: In this work, we implement an ARMv8 function and performance simulator based on gem5 infrastructure, which is the first open source ARMv8 simulator. All the ARMv8 A64 instructions other than SIMD are implemented using gem5 ISA description language. The ARMv8 simulator supports multiple CPU models, multiple memory systems, and McPAT power model.
[Show abstract][Hide abstract] ABSTRACT: Within today's large-scale data centers, the inter-node communication is often the major bottleneck. This fact recently blooms the data center network (DCN) research. Since building a real data center is cost prohibitive, most of DCN studies rely on simulations. Unfortunately, state-of-the-art network simulators have limited support for real world applications, which prevents researchers from first-hand investigation. To address this issue, we developed a unified and cross-layer simulation framework, namely the DCNSim. By leveraging the two widely deployed simulators, DCNSim introduces computer architecture solutions into DCN research. With DCNSim, one could run packet-level network simulation driven by commercial applications while varying computer and network parameters, such as CPU frequency, memory access latency, network topology and protocols. With extensive validations, we show that DCNSim could accurately capture performance trends caused by changing computer and network parameters. Finally, we argue that future DCN researches should consider computer architecture factors via several case studies.
[Show abstract][Hide abstract] ABSTRACT: Cloud computing has demonstrated tremendous capability in a wide spectrum of online services. Virtualization provides an efficient solution to the utilization of modern multicore processor systems while affording significant flexibility. The growing popularity of virtualized datacenters motivates deeper understanding of the interactions between virtual machine management and the micro-architecture behaviors of the privileged domain. We argue that these behaviors must be factored into the design of processor microarchitecture in virtualized datacenters. In this work, we use performance counters on modern servers to study the micro-architectural execution characteristics of the privileged domain while performing various VM management operations. Our study shows that today's state-of-the-art processor still has room for further optimizations when executing virtualized cloud workloads, particularly in the organization of last level caches and on-chip cache coherence protocol. Specifically, our analysis shows that: shared caches could be partitioned to eliminate interference between the privileged domain and guest domains; the cache coherence protocol could support a high degree of data sharing of the privileged domain; and cache capacity or CPU utilization occupied by the privileged domain could be effectively managed when performing management workflows to achieve high system throughput.
[Show abstract][Hide abstract] ABSTRACT: Desktop cloud replaces traditional desktop computers with completely virtualized systems from the cloud. It is becoming one of the fastest growing segments in the cloud computing market. However, as far as we know, there is little work done to understand the behavior of desktop cloud. On one hand, desktop cloud workloads are different from conventional data center workloads in that they are rich with interactive operations. Desktop cloud workloads are different from traditional non-virtualized desktop workloads in that they have an extra layer of software stack - hypervisor. On the other hand, desktop cloud servers are mostly built with conventional commodity processors. While such processors are well optimized for traditional desktops and high performance computing workloads, their effectiveness for desktop cloud workloads remains to be studied. As an attempt to shed some lights on the effectiveness of conventional general-purpose processors on desktop cloud workloads, we have studied the behavior of desktop cloud workloads and compared it with that of SPEC CPU2006, TPC-C, PARSEC, and CloudSuite. We evaluate a Xen-based virtualization platform. The performance results reveal that desktop cloud workloads have significantly different characteristics with SPEC CPU2006, TPC-C and PARSEC, but they perform similarly with data center scale-out benchmarks from CloudSuite. In particular, desktop cloud workloads have high instruction cache miss rate (12.7% on average), high percentage of kernel instructions (23%, on average), and low IPC (0.36 on average). And they have much higher TLB miss rates and lower utilization of off-chip memory bandwidth than traditional benchmarks. Our experimental numbers indicate that the effectiveness of existing commodity processors is quite low for desktop cloud workloads. In this paper, we provide some preliminary discussions on some potential architectural and micro-architectural enhancements. We hope that the performance numbers presented - n this paper will give some insights to the designers of desktop cloud systems.
[Show abstract][Hide abstract] ABSTRACT: Inability to hide main memory latency has been increasingly limiting the performance of modern processors. The problem is worse in large-scale shared memory systems, where remote memory latencies are hundreds, and soon thousands, of processor cycles. To mitigate this problem, we propose an intelligent memory and cache coherence controller (AMC) that can execute Active Memory Operations (AMOs). AMOs are select operations sent to and executed on the home memory controller of data. AMOs can eliminate a significant number of coherence messages, minimize intranode and internode memory traffic, and create opportunities for parallelism. Our implementation of AMOs is cache-coherent and requires no changes to the processor core or DRAM chips.
In this paper, we present the microarchitecture design of AMC, and the programming model of AMOs. We compare AMOs’ performance to that of several other memory architectures on a variety of scientific and commercial benchmarks. Through simulation, we show that AMOs offer dramatic performance improvements for an important set of data-intensive operations, e.g., up to 50× faster barriers, 12× faster spinlocks, 8.5×–15× faster stream/array operations, and 3× faster database queries. We also present an analytical model that can predict the performance benefits of using AMOs with decent accuracy. The silicon cost required to support AMOs is less than 1% of the die area of a typical high performance processor, based on a standard cell implementation.
No preview · Article · Oct 2012 · The Journal of Supercomputing