Article

COLORIS: A dynamic cache partitioning system using page coloring

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

Shared caches in multicore processors are subject to contention from co-running threads. The resultant interference can lead to highly-variable performance for individual applications. This is particularly problematic for real-time applications, requiring predictable timing guarantees. Previous work has applied page coloring techniques to partition a shared cache, so that conflict misses are minimized amongst co-running workloads. However, prior page coloring techniques have not addressed the problem of partitioning a cache on over-committed processors where there are more executable threads than cores. Similarly, page coloring techniques have not proven efficient at adapting the cache partition sizes for threads with varying memory demands. This paper presents a memory management framework called COLORIS, which provides support for both static and dynamic cache partitioning using page coloring. COLORIS supports novel policies to reconfigure the assignment of page colors amongst application threads in over-committed systems. For quality-of-service (QoS), COLORIS monitors the cache miss rates of running applications and triggers re-partitioning of the cache to prevent miss rates exceeding applications-specific ranges. This paper presents the design and evaluation of COLORIS as applied to Linux. We show the efficiency and effectiveness of COLORIS to color memory pages for a set of SPEC CPU2006 workloads, thereby enhancing performance isolation over existing page coloring techniques.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... Page coloring used in software techniques does not require hardware support but have its own drawbacks [24]. For example, page coloring is incompatible with huge page, since huge page requires numerous contiguous base pages in both virtual and physical memory, leading to all available page colors compulsorily occupied [41]. ...
... This section provides introduction to prior techniques to implement cache partition [22][23][24] and different cache partition methods based on these techniques [10, 13, 14, 22-27, 29-32, 43]. These techniques have both advantages and drawbacks of their own.Page coloring is software-based technique and does not require hardware support. ...
... These techniques have both advantages and drawbacks of their own.Page coloring is software-based technique and does not require hardware support. It partitions cache resources along cache sets to prevent memory accesses from being arbitrarily mapped to any physical pages [24]. However, it cannot be widely used because of its incompatibility with the huge page mechanism [41], due to difficulty to implement with high page-recoloring cost. ...
Article
Full-text available
In the modern cloud environment, considering the cost of hardware and software resources, applications are often co-located on a platform and share such resources. However, co-located execution and resource sharing bring memory access conflict, especially in the Last Level Cache (LLC). In this paper, a lightweight method is proposed for partition LLC named by Classification-and-Allocation (C&A). Specifically, Support Vector Machine (SVM) is used in the proposed method to classify applications into the triple classes based on the performance change characteristic (PCC), and the Bayesian Optimizer (BO) is leveraged to schedule LLC to guarantee applications with the same PCC sharing the same part of LLC. Since the near-optimal partition can be found efficiently by leveraging BO-based scheduling with a few sampling steps, C&A can handle unseen and versatile workloads with low overhead. We evaluate the proposed method in several workloads. Experimental results show that C&A can outperform the state-of-art method KPart (El-Sayed et al in Proceedings of 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA) 104−117, 2018) by 7.45\(\%\) and 22.50\(\%\) respectively in overall system throughput and fairness, and reduces 20.60\(\%\) allocation overhead.
... Several studies have been based on set partitioning, which does not require hardware support and does not decrease associativity [10], [11]. The entire mechanism uses software but requires changes to the operating system's memory management. ...
... COLORIS [11] proposed keeping free pages of the same color together in linked lists to address these problems. Applications can request multiple colors, and these allocations are performed round-robin. ...
... It is well known that contention over shared hardware resources leads to substantial violation of temporal properties when workload developed and tested in isolation is consolidated on the same multi-core platform. Effects like shared cache contention [1], [2], DRAM bank conflicts [3], [4], contention at the DDR controller [5] have significantly slowed down the adoption of multi-core solutions in the safety-critical domain. The presence of performance interference channels has been acknowledged by certification authorities [6], that have mandated methodologies to "account and bound" the temporal effect of interference channels for the certification of avionic systems. ...
... The last decade has produced seminal results [7] on techniques to manage contention at the different levels of the memory hierarchy. But unfortunately, there is a substantial lack of frameworks and methodologies that can be applied system-wise to: (1) take into account realistic applications, (2) consider that processing workload does not occur only on CPUs; accelerators (e.g. DMAs, video-encoders, GPUs) are also fundamental components in real systems, and (3) that can be deployed on existing platforms while ensuring that the models assumed to derive analytical guarantees are in match with the true behavior of the hardware. ...
Conference Paper
Full-text available
The proliferation of multi-core, accelerator-enabled embedded systems has introduced new opportunities to consolidate real-time systems of increasing complexity. But the road to build confidence on the temporal behavior of co-running applications has presented formidable challenges. Most prominently, the main memory subsystem represents a performance bottleneck for both CPUs and accelerators. And industry-viable frameworks for full-system main memory management and performance analysis are past due. In this paper, we propose our Envelope-aWare Predictive model, or E-WarP for short. E-WarP is a methodology and technological framework to: (1) analyze the memory demand of applications following a profile-driven approach; (2) make realistic predictions on the temporal behavior of workload deployed on CPUs and accelerators; and (3) perform saturation-aware system consolidation. This work aims at providing the technological foundations as well as the theoretical grassroots for truly workload-aware analysis of real-time systems. We provide a full implementation of our techniques on a commercial platform (NXP S32V234) and make two key observations. First, we achieve, on average, a 6% overprediction on the runtime of bandwidth-regulated applications. Second, we experimentally validate that the calculated bounds hold if the main memory subsystem operates below saturation.
... De esta manera, pese a que se mantenga la interferencia en el bus de acceso, se permite minimizar los conflictos en los bancos de memoria y caché mejorando la respuesta temporal del sistema, además de aprovecharse de cierto paralelismo. La coloración de página es utilizada en sistemas multiprocesador por múltiples trabajos, (Zhang et al., 2009;Liu et al., 2012;Suzuki et al., 2013;Ye et al., 2014;Yun et al., 2014;Mancuso et al., 2015;Kim and Rajkumar, 2016;Kloda et al., 2019;Lim and Kim, 2019;Park et al., 2020) entre otros, como una solución efectiva para mitigar la interferencia de memoria. Centrándonos en el contexto de la mitigación en plataformas con GPU, se ha desarrollado una técnica denominada Frac-tionalGPU (Jain et al., 2019) donde se llevan las ideas de la coloración de página para multiprocesador a la GPU. ...
Article
Full-text available
La conducción autónoma despierta un interés cada vez mayor en la industria, no solo en el sector de la automoción, sino también en el transporte de personas o mercancías por carretera o ferrocarril y en entornos de fabricación más controlados. Los sistemas ciber-físicos que se están proponiendo para este tipo de aplicaciones requieren de una gran capacidad de cómputo (arquitecturas hardware con varios núcleos, GPUs, NPUs…) para poder atender y reaccionar a una múltiple y compleja cantidad de sensores (cámaras, radar, LiDAR, medida de distancia, etc.). Por otro lado, este tipo de sistemas debe atender a requisitos de seguridad funcional y también de tiempo real. Este último aspecto plantea retos en los que se está trabajando intensamente y en los que aún quedan muchas cuestiones por resolver. En este trabajo, se hace una revisión de la literatura más reciente del uso de arquitecturas heterogéneas con GPUs en aplicaciones de tiempo real. Estos trabajos proponen soluciones para la estimación de cotas de tiempos de ejecución y respuesta temporal, proponiendo diferentes estrategias de optimización destacando la mitigación de interferencia en la memoria.
... Colored Lockdown [26] combines coloring and locking. Other works have proposed dynamic re-coloring schemes [27]- [29]. Cache coloring has been implemented in several hypervisors such as Bao [4], Jailhouse [13], and XVisor [14]. ...
Preprint
Full-text available
Integrating workloads with differing criticality levels presents a formidable challenge in achieving the stringent spatial and temporal isolation requirements imposed by safety-critical standards such as ISO26262. The shift towards high-performance multicore platforms has been posing increasing issues to the so-called mixed-criticality systems (MCS) due to the reciprocal interference created by consolidated subsystems vying for access to shared (microarchitectural) resources (e.g., caches, bus interconnect, memory controller). The research community has acknowledged all these challenges. Thus, several techniques, such as cache partitioning and memory throttling, have been proposed to mitigate such interference; however, these techniques have some drawbacks and limitations that impact performance, memory footprint, and availability. In this work, we look from a different perspective. Departing from the observation that safety-critical workloads are typically event- and thus interrupt-driven, we mask "colored" interrupts based on the \ac{QoS} assessment, providing fine-grain control to mitigate interference on critical workloads without entirely suspending non-critical workloads. We propose the so-called IRQ coloring technique. We implement and evaluate the IRQ Coloring on a reference high-performance multicore platform, i.e., Xilinx ZCU102. Results demonstrate negligible performance overhead, i.e., <1% for a 100 microseconds period, and reasonable throughput guarantees for medium-critical workloads. We argue that the IRQ coloring technique presents predictability and intermediate guarantees advantages compared to state-of-art mechanisms
... Much effort has been devoted to address the problem of shared resource contention in multicore and many mitigation solutions have been proposed in the real-time systems research community, including both software and hardware-based mechanisms [25,19,40,20,31,41,35,12,38,33]. Recently, the need for greater isolation in accessing shared hardware resources has also been recognized by major industry players. ...
Preprint
In this paper, we address the industrial challenge put forth by ARM in ECRTS 2022. We systematically analyze the effect of shared resource contention to an augmented reality head-up display (AR-HUD) case-study application of the industrial challenge on a heterogeneous multicore platform, NVIDIA Jetson Nano. We configure the AR-HUD application such that it can process incoming image frames in real-time at 20Hz on the platform. We use micro-architectural denial-of-service (DoS) attacks as aggressor tasks of the challenge and show that they can dramatically impact the latency and accuracy of the AR-HUD application, which results in significant deviations of the estimated trajectories from the ground truth, despite our best effort to mitigate their influence by using cache partitioning and real-time scheduling of the AR-HUD application. We show that dynamic LLC (or DRAM depending on the aggressor) bandwidth throttling of the aggressor tasks is an effective mean to ensure real-time performance of the AR-HUD application without resorting to over-provisioning the system.
... However, many server systems still equip old generations of processors, which are vulnerable to the cache side-channel attacks. Hence software-based mitiga-tions are the only possible approaches for such systems even though the software patches lead to significant performance drops [16]- [19]. Another possible solution is to detect the attack processes that perform cache side-channel attacks. ...
Article
Full-text available
Cache side-channel attacks have been serious security threats to server computer systems, thus researchers have proposed software-based defense approaches that can detect the security attacks. Profiling-based detectors are lightweight detection solutions that rely on hardware performance counters to identify unique cache performance behaviors by cache side-channel attacks. The detectors typically need to set appropriate criteria to differentiate between attack processes and normal applications. In this paper, we explore the limitations of profiling-based detectors that rely on hardware performance counters.We present an attack scenario, called Vizard, that can bypass the existing profiling-based detectors by manipulating cache performance behaviors of an attack process. Our analysis discloses that cache side-channel attacks include idle periods that can be exploited as attack windows for creating cache events. Vizard generates counterbalancing cache events within the attack windows to hide particular cache performance behaviors of cache side-channel attacks. Our evaluation exhibits that Vizard can effectively bypass profiling-based detectors while maintaining high attack success rates. Our research work represents that attackers can bypass the existing detection approaches by manipulating performance counters.
... It is well known that contention over shared hardware resources leads to substantial violation of temporal properties when workload developed and tested in isolation is consolidated on the same multi-core platform. Effects like shared cache contention (Ward et al. 2013;Ye et al. 2014), DRAM bank conflicts (Yun et al. 2014;Kim et al. 2014), and contention at the DDR controller have significantly slowed down the adoption of multi-core solutions in the safety-critical domain. The presence of performance interference channels has been acknowledged by certification authorities (C. A. S. Team 2016), which have mandated methodologies to "account and bound" the temporal effect of interference channels for the certification of avionic systems. ...
Article
Full-text available
The proliferation of multi-core, accelerator-enabled embedded systems has introduced new opportunities to consolidate real-time systems of increasing complexity. But the road to build confidence on the temporal behavior of co-running applications has presented formidable challenges. Most prominently, the main memory subsystem represents a performance bottleneck for both CPUs and accelerators. And industry-viable frameworks for full-system main memory management and performance analysis are past due. In this paper, we propose our Envelope-aWare Predictive model, or E-WarP for short. E-WarP is a methodology and technological framework to: (1) analyze the memory demand of applications following a profile-driven approach; (2) make realistic predictions on the temporal behavior of workload deployed on CPUs and accelerators; and (3) perform saturation-aware system consolidation. This work aims at providing the technological foundations as well as the theoretical grassroots for truly workload-aware analysis of real-time systems. This work combines traditional CPU-centric bandwidth regulation techniques with state-of-the-art hardware support for memory traffic shaping via the ARM QoS extensions. We make three key observations. First, our profile-driven methodology achieves, on average, 6% over-prediction on the runtime of bandwidth-regulated applications. Second, we experimentally validate that the calculated bounds hold system-wide if the main memory subsystem operates below saturation. Third, we show that the E-WarP methodology is practical even when applications exhibit input-dependent memory access patterns. We provide a full implementation of our techniques on a commercial platform (NXP S32V234).
... A performance monitoring subsystem employing hardware counters provides FlyOS with the ability to predict last-level shared cache occupancy [87,88]. Such estimates are then used by static page coloring techniques to partition shared caches between sandboxes [52,91]. Consequently, a guest kernel is isolated from any temporal and spatial interference in its execution by another guest. ...
Conference Paper
Full-text available
Autonomous multicopters often feature federated architectures, which incur relatively high communication costs between separate hardware components. These costs limit the ability to react quickly to new mission objectives. Additionally, federated architectures are not easily upgraded without introducing new hardware that impacts size, weight, power and cost (SWaP-C) constraints. In turn, such constraints restrict the use of redundant hardware to handle faults. In response to these challenges, we propose FlyOS, an Integrated Modular Avionics (IMA) approach to consolidate mixed-criticality flight functions in software on heterogeneous multicore aerial platforms. FlyOS is based on a separation kernel that statically partitions resources among virtualized sandboxed OSes. We present a dual-sandbox prototype configuration, where timing-and safety-critical flight control tasks execute in a real-time OS alongside mission-critical vision-based navigation tasks in a Linux sandbox. Low latency shared memory communication allows flight commands and data to be relayed in real-time between sandboxes. A hypervisor-based fault-tolerance mechanism is also deployed to ensure failover flight control in case of critical function or timing failures. We validate FlyOS's performance and showcase its benefits when compared against traditional architectures in terms of predictable, extensible and efficient flight control.
... Once the cache and bank colors are assigned to a task, it only has access to the physical pages of the corresponding memory cell. In 2014, Ye et al. [101] introduced a memory management framework called COLORIS. COLORIS can create static and dynamic cache partitions using the page coloring technique. ...
Article
Full-text available
This survey reviews the scientific literature on techniques for reducing interference in real-time multicore systems, focusing on the approaches proposed between 2015 and 2020. It also presents proposals that use interference reduction techniques without considering the predictability issue. The survey highlights interference sources and categorizes proposals from the perspective of the shared resource. It covers techniques for reducing contentions in main memory, cache memory, a memory bus, and the integration of interference effects into schedulability analysis. Every section contains an overview of each proposal and an assessment of its advantages and disadvantages.
... However, this would cause lower utilization and require more servers to be purchased. Alternatively, providers could leverage LLC partitioning schemes proposed in the literature, such as those that use software techniques [31,61,66,78,80], hardware mechanisms [6,57,59,75], and combinations of software and hardware [11,62,70,72]. However, the prior work has one or more serious drawbacks when it comes to public cloud adoption. ...
... Mancuso et al. [31] proposed the Colored Lockdown technique that combines page coloring and cache lockdown to better keep the frequently accessed pages of tasks in a cache. Page coloring has been also used for general-purpose multicore systems [11], [42]. Valsan et al. [49] studied additional delay caused by miss status holding registers in non-blocking caches. ...
Chapter
Full-text available
With the increasing complexity of recent autonomous platforms, there is a strong demand to better utilize system resources while satisfying stringent real-time requirements. Embedded virtualization is an appealing technology to meet this demand. It enables the consolidation of real-time systems with different criticality levels on a single hardware platform by enforcing temporal isolation. On multi-core platforms, however, shared hardware resources, such as caches and memory buses, weaken this isolation. In particular, due to the resulting cache interference, a large last-level cache in recent processors can easily jeopardize the timing predictability of real-time tasks due to cache interference. While researchers in the real-time systems community have developed solutions to tackle this problem, existing cache management schemes reveal two major limitations when used in a clustered multi-core embedded system. The first is the cache co-partitioning problem, which can lead to wrong cache allocation and cache underutilization. The second is the cache interference of inter-virtual-machine (VM) communication because prior work has considered only independent tasks. This paper presents a cluster-aware real-time cache allocation scheme to address these problems. The proposed scheme takes into account the cluster information of the system, and finds the cache allocation that satisfies the timing and memory requirements of tasks. The scheme also maximizes slack time to meet task deadline, which brings flexibility and resilience to unexpected events. Tasks using inter-VM communication are also provided with guaranteed blocking time and cache isolation. We have implemented a prototype of our scheme on an Nvidia TX2 clustered multi-core platform and evaluated the effectiveness of our scheme over cluster-unaware approaches.
... The previously mentioned issue of interference through caching can be addressed with cache coloring (e.g., [5]), exploiting the fact that (depending on the organization of the cache) certain address ranges will map to the same cache line. By choosing the mapping of virtual memory pages to physical pages with this in mind, performance-optimal memory allocation as well as cache partitioning can be achieved. ...
Conference Paper
Full-text available
Due to the trends of centralizing the E/E architecture and new computing-intensive applications, high-performance hardware platforms are currently finding their way into automotive systems. However, the Systems-on-Chip (SoCs) currently available on the market have significant weaknesses when it comes to providing predictable performance for time-critical applications. The main reason for this is that these platforms are optimized for average-case performance. This shortcoming represents one major risk in the development of current and future automotive systems. In this paper we describe how highperformance and predictability could (and should) be reconciled in future HW/SW platforms. We believe that this goal can only be reached via a close collaboration among system suppliers, IP providers, semiconductor companies, and OS/hypervisor vendors. Furthermore, academic input will be needed to solve remaining challenges and to further improve initial solutions.
... Cache coloring maps page addresses at the OS, compiler, or application level, requiring modifications to the OS's virtual memory [11]. In particular, if the context switch occurs when coloring dynamically at runtime, the page color has to be adjusted [38]. ...
Article
Full-text available
Multicore architecture is applied to contemporary avionics systems to deal with complex tasks. However, multicore architectures can cause interference by contention because the cores share hardware resources. This interference reduces the predictable execution time of safety-critical systems, such as avionics systems. To reduce this interference, methods of separating hardware resources or limiting capacity by core have been proposed. Existing studies have modified kernels to control hardware resources. Additionally, an execution model has been proposed that can reduce interference by adjusting the execution order of tasks without software modification. Avionics systems require several rigorous software verification procedures. Therefore, modifying existing software can be costly and time-consuming. In this work, we propose a method to apply execution models proposed in existing studies without modifying commercial real-time operating systems. We implemented the time-division multiple access (TDMA) and acquisition execution restitution (AER) execution models with pseudo-partition and message queuing on VxWorks 653. Moreover, we propose a multi-TDMA model considering the characteristics of the target hardware. For the interference analysis, we measured the L1 and L2 cache misses and the number of main memory requests. We demonstrated that the interference caused by memory sharing was reduced by at least 60% in the execution model. In particular, multi-TDMA doubled utilization compared to TDMA and also reduced the execution time by 20% compared to the AER model.
Chapter
After years of development, FPGAs finally made an appearance on multi-tenant cloud servers in the late 2010s. Research in micro-architectural attacks has uncovered a variety of vulnerabilities on shared compute devices like CPUs and GPUs which pose a substantial thread to cloud service providers and customers alike, but heterogeneous FPGA-CPU microarchitectures require reassessment of common assumptions about isolation and security boundaries, as they introduce new attack vectors and vulnerabilities. The FPGAs now available from major cloud services use technologies like direct memory access and coherent caching to offer high-throughput, low-latency, and highly scalable FPGA-FPGA and FPGA-CPU coprocessing for heavy workloads. This chapter explores how FPGAs with access to these microarchitectural features can accelerate attacks against the host memory. It points out cache timing side channels and demonstrates a performant Rowhammer attack against a well-known RSA variant through direct memory access.
Article
Multicore PC-class embedded systems present an opportunity to consolidate separate microcontrollers as software-defined functions. For instance, an automotive system with more than 100 electronic control units (ECUs) could be replaced with one or, at most, several multicore PCs running software tasks for chassis, body, powertrain, infotainment and advanced driver assistance system (ADAS) services. However, a key challenge is how to handle real-time device input and output (I/O) and host-level networking as part of sensor data processing and control. A traditional microcontroller would commonly feature one or more Controller Area Network (CAN) buses for real-time I/O. CAN buses are usually absent in PCs, which instead feature higher bandwidth Universal Serial Bus (USB) interfaces. This paper shows how to achieve real-time device I/O and host-to-host communication over USB, using suitably written device drivers and a time-aware POSIX-like “tuned pipe” abstraction. This allows developers to establish task pipelines spanning one or more hosts, with end-to-end latency and throughput guarantees for sensor data processing, control and actuation.
Article
Non-volatile memory express (NVMe) solid-state drives (SSDs) have been widely adopted in multi-tenant cloud computing environments or multi-programming systems. The on-board DRAM cache inside NVMe SSDs can efficiently reduce the disk accesses and extend the lifetime of SSDs. Current SSD cache management research either improves cache hit ratio while ignoring fairness, or improves fairness while sacrificing overall performance. In this paper, we present MLCache, a space-efficient shared cache management scheme for NVMe SSDs. By learning the impact of reuse distance on cache allocation, a workload-generic neural network model is built. At runtime, MLCache continuously monitors the reuse distance distribution for the neural network module to obtain space-efficient allocation decisions. MLCache also proposes an efficient parallel writing back strategy based on hit ratio and response time, to improve fairness. Experimental results show MLCache improves the write hit ratio when compared to baseline, and MLCache strongly safeguards the fairness of SSDs with parallel write-back and maintains a low level of degradation.
Article
In this paper, we identify that memory performance plays a crucial role in the feasibility and effectiveness for performing denial-of-service attacks on shared cache. Based on this insight, we introduce new cache DoS attacks, which can be mounted from the user-space and can cause extreme worst-case execution time (WCET) impacts to cross-core victims—even if the shared cache is partitioned—by taking advantage of the platform’s memory address mapping information and HugePage support. We deploy these enhanced attacks on two popular embedded out-of-order multicore platforms using both synthetic and real-world benchmarks. The proposed DoS attacks achieve up to 111X WCET increases on the tested platforms.
Article
Cache management policies should consider workloads’ contention behavior when managing a shared cache. Prior art makes estimates about shared cache behavior by adding extra logic or time to isolate per workload cache statistics. These approaches provide per-workload analysis but do not provide a holistic understanding of the utilization and effectiveness of caches under the ever-growing contention that comes standard with scaling cores. We present Contention Analysis in Shared Hierarchies using Thefts, or CASHT, ¹ a framework for capturing cache contention information both offline and online. CASHT takes advantage of cache statistics made richer by observing a consequence of cache contention: inter-core evictions, or what we call THEFTS. We use thefts to complement more familiar cache statistics to train a learning model based on Gradient-boosting Trees (GBT) to predict the best ways to partition the last-level cache. GBT achieves 90+% accuracy with trained models as small as 100 B and at least 95% accuracy at 1 kB model size when predicting the best way to partition two workloads. CASHT employs a novel run-time framework for collecting thefts-based metrics despite partition intervention, and enables per-access sampling rather than set sampling that could add overhead but may not capture true workload behavior. Coupling CASHT and GBT for use as a dynamic policy results in a very lightweight and dynamic partitioning scheme that performs within a margin of error of Utility-based Cache Partitioning at a 1/8 the overhead.
Article
Commodity multicore systems are increasingly adopting hardware support that enables the system software to partition the last-level cache (LLC). This support makes it possible for the operating system (OS) to mitigate shared-resource contention effects on multicores by assigning different co-running applications to various cache partitions. Cache-clustering strategies have emerged as a way to improve throughput and fairness on platforms with cache-partitioning support. Unlike strict cache-partitioning, which allocates separate cache partitions to each application, cache-clustering allows partitions to be shared by several applications. In this article we propose LFOC+, a fair OS-level cache-clustering policy for commodity multicores. LFOC+ tries to mimic the behavior of the optimal cache-clustering solution for fairness, which we could obtain for different workloads by using a simulation tool. Our strategy continuously gathers data from performance counters to classify applications based on the degree of cache sensitivity and contentiouness, and separates cache-sensitive applications from aggressor programs to improve fairness, while providing acceptable throughput. We implemented LFOC+ in the Linux kernel and evaluated it on a system featuring an Intel Skylake processor, where we compare its effectiveness to that of four state-of-the-art policies. Our analysis reveals that LFOC+ brings a higher reduction in unfairness, and constitutes a lightweight cache-clustering policy.
Article
Many Cyber-Physical Systems (CPSs) in industrial applications are embedded control systems. With the increasing scale and complexity of such system, the resource efficiency in the system design has become an important issue. The resources of CPSs mainly include: computation resource, communication resource and memory resource. The CPSs combine computation, communication and instruction/data storage closely, and realize the combination and coordination of computing resources and physical resources. This paper classifies the existing mainstream research results from three aspect: computation resource, communication resource and memory resource, reviews the research hotspots of each research field, and discusses the urgent problems related to computation resource, communication resource and memory resource in CPSs, as well as the possible research directions in the future.
Conference Paper
Full-text available
Shared last-level caches, widely used in chip-multi-processors (CMPs), face two fundamental limitations. First, the latency and energy of shared caches degrade as the system scales up. Second, when multiple workloads share the CMP, they suffer from interference in shared cache accesses. Unfortunately, prior research addressing one issue either ignores or worsens the other: NUCA techniques reduce access latency but are prone to hotspots and interference, and cache partitioning techniques only provide isolation but do not reduce access latency.
Conference Paper
Full-text available
Multi-core architectures are shaking the fundamental assumption that in real-time systems the WCET, used to analyze the schedulability of the complete system, is calculated on individual tasks. This is not even true in an approximate sense in a modern multi-core chip, due to interference caused by hardware resource sharing. In this work we propose (1) a complete framework to analyze and profile task memory access patterns and (2) a novel kernel-level cache management technique to enforce an efficient and deterministic cache allocation of the most frequently accessed memory areas. In this way, we provide a powerful tool to address one of the main sources of interference in a system where the last level of cache is shared among two or more CPUs. The technique has been implemented on commercial hardware and our evaluations show that it can be used to significantly improve the predictability of a given set of critical tasks.
Article
Full-text available
A simple modification to an operating system's page allocation algorithm can give physically addressed caches the speed of virtually addressed caches. Colored page allocation reduces the number of bits that need to be translated before cache access, allowing large low-associativity caches to be indexed before address translation, which reduces the latency to the processor. The colored allocation also has other benefits: caches miss less (in general) and more uniformly, and the inclusion principle holds for second level caches with less associativity. However, the colored allocation requires main memory partitioning, and more common bits for shared virtual addresses. Simulation results show high non-uniformity of cache miss rates for normal allocation. Analysis demonstrates the extent of second-level cache inclusion, and the reduction in effective main-memory due to partitioning.
Article
Full-text available
When several applications are co-scheduled to run on a system with multiple shared LLCs, there is opportunity to improve system performance. This opportunity can be exploited by the hardware, software, or a combination of both hardware and software. The software, i.e., an operating system or hypervisor, can improve system performance by co-scheduling jobs on LLCs to minimize shared cache contention. The hardware can improve system throughput through better replacement policies by allocating more cache resources to applications that benefit from the cache and less to those applications that do not. This study presents a detailed analysis on the interactions between intelligent scheduling and smart cache replacement policies. We find that smart cache replacement reduces the burden on software to provide intelligent scheduling decisions. However, under smart cache replacement, there is still room to improve performance from better application co-scheduling. We find that co-scheduling decisions are a function of the underlying LLC replacement policy. We propose Cache Replacement and Utility-aware Scheduling (CRUISE)-a hardware/software co-designed approach for shared cache management. For 4-core and 8-core CMPs, we find that CRUISE approaches the performance of an ideal job co-scheduling policy under different LLC replacement policies.
Article
Full-text available
Most of today's multi-core processors feature shared L2 caches. A major problem faced by such architectures is cache contention, where multiple cores compete for usage of the single shared L2 cache. Uncontrolled sharing leads to scenarios where one core evicts useful L2 cache content belonging to another core. To address this problem, we have implemented a software mechanism in the operating system that allows for partitioning of the shared L2 cache by guiding the allocation of physical pages. This mechanism, which can also be applied to virtual machine monitors, provides isolation capabilities that lead to reduced contention. We show that this mechanism is effective in reducing cache contention in multiprogrammed SPECcpu2000 and SPECjbb2000 workloads. Performance improvements of up to 17% were achieved without adversely affecting co-scheduled applications. In order to effectively size L2 cache partitions, a quantifiable metric is needed to properly predict performance as a function of L2 cache size. For page management, Miss Rate Curves (MRCs) have proven to be useful for this purpose. However, for L2 cache sizing, we have found L2 MRCs to be inadequate and have found instruction retirement Stall Rate Curves (SRCs) to be more effective, where the stalls are caused by memory latencies.
Conference Paper
Full-text available
Cache partitioning and sharing is critical to the effective utilization of multicore processors. However, almost all existing studies have been evaluated by simulation that often has several limitations, such as excessive simulation time, absence of OS activities and proneness to simulation inaccuracy. To address these issues, we have taken an efficient software approach to supporting both static and dynamic cache partitioning in OS through memory address mapping. We have comprehensively evaluated several representative cache partitioning schemes with different optimization objectives, including performance, fairness, and quality of service (QoS). Our software approach makes it possible to run the SPEC CPU2006 benchmark suite to completion. Besides confirming important conclusions from previous work, we are able to gain several insights from whole-program executions, which are infeasible from simulation. For example, giving up some cache space in one program to help another one may improve the performance of both programs for certain workloads due to reduced contention for memory bandwidth. Our evaluation of previously proposed fairness metrics is also significantly different from a simulation-based study. The contributions of this study are threefold. (1) To the best of our knowledge, this is a highly comprehensive execution- and measurement-based study on multicore cache partitioning. This paper not only confirms important conclusions from simulation-based studies, but also provides new insights into dynamic behaviors and interaction effects. (2) Our approach provides a unique and efficient option for evaluating multicore cache partitioning. The implemented software layer can be used as a tool in multicore performance evaluation and hardware design. (3) The proposed schemes can be further refined for OS kernels to improve performance.
Conference Paper
Full-text available
Buffer caches in operating systems keep active file blocks in memory to reduce disk accesses. Related studies have been focused on how to minimize buffer misses and the caused performance degradation. However, the side effects and performance implications of accessing the data in buffer caches (i.e. buffer cache hits) have not been paid attention. In this paper, we show that accessing buffer caches can cause serious performance degradation on multicores, particularly with shared last level caches (LLCs). There are two reasons for this problem. First, data in files normally have weaker localities than data objects in virtual memory spaces. Second, due to the shared structure of LLCs on multicore processors, an application accessing the data in a buffer cache may flush the to-be-reused data of its co-running applications from the shared LLC and significantly slow down these applications. The paper proposes a buffer cache design called Selected Region Mapping Buffer (SRM-buffer) for multicore systems to effectively address the cache pollution problem caused by OS buffer. SRM-buffer improves existing OS buffer management with an enhanced page allocation policy that carefully selects mapping physical pages upon buffer misses. For a sequence of blocks accessed by an application, SRM-buffer allocates physical pages that are mapped to a selected region consisting of a small portion of sets in LLC. Thus, when these blocks are accessed, cache pollution is effectively limited within the small cache region. We have implemented a prototype of SRM-buffer into Linux kernel, and tested it with extensive workloads. Performance evaluation shows SRM-buffer can improve system performance and decrease the execution times of workloads by up to 36%.
Conference Paper
Full-text available
It is well recognized that LRU cache-line replacement can be ineffective for applications with large working sets or non-localized memory access patterns. Specifically, in last- level processor caches, LRU can cause cache pollution by inserting non-reuseable elements into the cache while evicting reusable ones. The work presented in this paper addresses last-level cache pollution through a dynamic operating system mechanism, called ROCS, requiring no change to underlying hardware and no change to applications. ROCS employs hardware performance counters on a com- modity processor to characterize application cache behavior at run-time. Using this online profiling, cache unfriendly pages are dynamically mapped to a pollute buffer in the cache, eliminating competition between reusable and non- reusable cache lines. The operating system implements the pollute buffer through a page-coloring based technique, by dedicating a small slice of the last-level cache to store non- reusable pages. Measurements show that ROCS, implemented in the Linux 2.6.24 kernel and running on a 2.3GHz PowerPC 970FX, improves performance of memory-intensive SPEC CPU 2000 and NAS benchmarks by up to 34%, and 16% on average.
Conference Paper
Full-text available
This paper presents a new technique, compiler-directed page coloring, that eliminates conflict misses in multiprocessor applications. It enables applications to make better use of the increased aggregate cache size available in a multiprocessor. This technique uses the compiler's knowledge of the access patterns of the parallelized applications to direct the operating system's virtual memory page mapping strategy. We demonstrate that this technique can lead to significant performance improvements over two commonly used page mapping strategies for machines with either direct-mapped or two-way set-associative caches. We also show that it is complementary to latency-hiding techniques such as prefetching.We implemented compiler-directed page coloring in the SUIF parallelizing compiler and on two commercial operating systems. We applied the technique to the SPEC95fp benchmark suite, a representative set of numeric programs. We used the SimOS machine simulator to analyze the applications and isolate their performance bottlenecks. We also validated these results on a real machine, an eight-processor 350MHz Digital AlphaServer. Compiler-directed page coloring leads to significant performance improvements for several applications. Overall, our technique improves the SPEC95fp rating for eight processors by 8% over Digital UNIX's page mapping policy and by 20% over a page coloring, a standard page mapping policy. The SUIF compiler achieves a SPEC95fp ratio of 57.4, the highest ratio to date.
Conference Paper
Full-text available
Memory can be efficiently utilized if the dynamic memory demands of applications can be determined and analyzed at run-time. The page miss ratio curve(MRC), i.e. page miss rate vs. memory size curve, is a good performance-directed metric to serve this purpose. However, dynamically tracking MRC at run time is challenging in systems with virtual memory because not every memory reference passes through the operating system (OS).This paper proposes two methods to dynamically track MRC of applications at run time. The first method is using a hardware MRC monitor that can track MRC at fine time granularity. Our simulation results show that this monitor has negligible performance and energy overheads. The second method is an OS-only implementation that can track MRC at coarse time granularity. Our implementation results on Linux show that it adds only 7--10% overhead.We have also used the dynamic MRC to guide both memory allocation for multiprogramming systems and memory energy management. Our real system experiments on Linux with applications including Apache Web Server show that the MRC-directed memory allocation can speed up the applications' execution/response time by up to a factor of 5.86 and reduce the number of page faults by up to 63.1%. Our execution-driven simulation results with SPEC2000 benchmarks show that the MRC-directed memory energy management can improve the Energy * Delay metric by 27--58% over previously proposed static and dynamic schemes.
Conference Paper
Full-text available
Miss rate curves (MRCs) are useful in a number of contexts. In our research, online L2 cache MRCs enable us to dynamically identify optimal cache sizes when cache-partitioning a shared-cache multicore processor. Obtaining L2 MRCs has generally been assumed to be expensive when done in software and consequently, their usage for online optimizations has been limited. To address these problems and opportunities, we have developed a low-overhead software technique to obtain L2 MRCs online on current processors, exploiting features available in their performance monitoring units so that no changes to the application source code or binaries are required. Our technique, called RapidMRC, requires a single probing period of roughly 221 million processor cycles (147 ms), and subsequently 124 million cycles (83 ms) to process the data. We demonstrate its accuracy by comparing the obtained MRCs to the actual L2 MRCs of 30 applications taken from SPECcpu2006, SPECcpu2000, and SPECjbb2000. We show that RapidMRC can be applied to sizing cache partitions, helping to achieve performance improvements of up to 27%.
Conference Paper
Full-text available
Performance degradation of memory-intensive programs caused by the LRU policy's inability to handle weak- locality data accesses in the last level cache is increasingly serious for two reasons. First, the last-level cache remains in the CPU's critical path, where only simple management mecha- nisms, such as LRU, can be used, precluding some sophisticated hardware mechanisms to address the problem. Second, the commonly used shared cache structure of multi-core processors has made this critical path even more performance-sensitive due to intensive inter-thread contention for shared cache resources. Researchers have recently made efforts to address the problem with the LRU policy by partitioning the cache using hardware or OS facilities guided by run-time locality information. Such approaches often rely on special hardware support or lack enough accuracy. In contrast, for a large class of programs, the locality information can be accurately predicted if access patterns are recognized through small training runs at the data object level. To achieve this goal, we present a system-software frame- work referred to as Soft-OLP (Software-based Object-Level cache Partitioning). We first collect per-object reuse dis- tance histograms and inter-object interference histograms via memory-trace sampling. With several low-cost training runs, we are able to determine the locality patterns of data objects. For the actual runs, we categorize data objects into different locality types and partition the cache space among data objects with a heuristic algorithm, in order to reduce cache misses through segregation of contending objects. The object-level cache partitioning framework has been implemented with a modified Linux kernel, and tested on a commodity multi-core processor. Experimental results show that in comparison with a standard L2 cache managed by LRU, Soft-OLP significantly reduces the execution time by reducing L2 cache misses across inputs for a set of single- and multi-threaded programs from the SPEC CPU2000 benchmark suite, NAS benchmarks and a computational kernel set.
Conference Paper
Full-text available
This paper presents a detailed study of fairness in cache sharing between threads in a chip multiprocessor (CMP) architecture. Prior work in CMP architectures has only studied throughput optimization techniques for a shared cache. The issue of fairness in cache sharing, and its relation to throughput, has not been studied. Fairness is a critical issue because the operating system (OS) thread scheduler's effectiveness depends on the hardware to provide fair cache sharing to co-scheduled threads. Without such hardware, serious problems, such as thread starvation and priority inversion, can arise and render the OS scheduler ineffective. This paper makes several contributions. First, it proposes and evaluates five cache fairness metrics that measure the degree of fairness in cache sharing, and shows that two of them correlate very strongly with the execution-time fairness. Execution-time fairness is defined as how uniform the execution times of co-scheduled threads are changed, where each change is relative to the execution time of the same thread running alone. Secondly, using the metrics, the paper proposes static and dynamic L2 cache partitioning algorithms that optimize fairness. The dynamic partitioning algorithm is easy to implement, requires little or no profiling, has low overhead, and does not restrict the cache replacement algorithm to LRU. The static algorithm, although requiring the cache to maintain LRU stack information, can help the OS thread scheduler to avoid cache thrashing. Finally, this paper studies the relationship between fairness and throughput in detail. We found that optimizing fairness usually increases throughput, while maximizing throughput does not necessarily improve fairness. Using a set of co-scheduled pairs of benchmarks, on average our algorithms improve fairness by a factor of 4×, while increasing the throughput by 15%, compared to a nonpartitioned shared cache.
Conference Paper
Full-text available
Existing cache partitioning schemes are designed in a man- ner oblivious to the implicit processor partitioning enforced by the operating system. This paper examines an oper- ating system directed integrated processor-cache partition- ing scheme that partitions both the available processors and the shared cache in a chip multiprocessor among different multi-threaded applications. Extensive simulations using a set of multiprogrammed workloads show that our integrated processor-cache partitioning scheme facilitates achieving bet- ter performance isolation as compared to state of the art hardware/software based solutions. Specifically, our inte- grated processor-cache partitioning approach performs, on an average, 20.83% and 14.14% better than equal partition- ing and the implicit partitioning enforced by the underlying operating system, respectively, on the fair speedup metric on an 8 core system. We also compare our approach to pro- cessor partitioning alone and a state-of-the-art cache parti- tioning scheme and our scheme fares 8.21% and 9.19% better than these schemes on a 16 core system.
Conference Paper
Full-text available
High performance general-purpose processors are increasingly being used for a variety of application domains - scientific, engineering, databases, and more recently, media processing. It is therefore important to ensure that architectural features that use a significant fraction of the on-chip transistors are applicable across these different domains. For example, current processor designs often devote the largest fraction of on-chip transistors (up to 80%) to caches. Many workloads, however, do not make effective use of large caches; e.g., media processing workloads which often have streaming data access patterns and large working sets. This paper proposes a new reconfigurable cache design. This design enables the cache SRAM arrays to be dynamically divided into multiple partitions that can be used for different processor activities. These activities can benefit applications that would otherwise not use the storage allocated to large conventional caches. Our design involves relatively few modifications to conventional cache design, and analysis using a modification of the CACTI analytical model shows a small impact on cache access time. We evaluate one representative use of reconfigurable caches - instruction reuse for media processing. We find this use gives IPC improvements ranging from 1.04X to 1.20X in simulation across eight media processing benchmarks.
Article
Full-text available
Modern chip-level multiprocessors (CMPs) contain multiple processor cores sharing a common last-level cache, memory interconnects, and other hardware resources. Workloads running on separate cores compete for these resources, often resulting in highlyvariable performance. It is generally desirable to co-schedule workloads that have minimal resource contention, in order to improve both performance and fairness. Unfortunately, commodity processors expose only limited information about the state of shared resources such as caches to the software responsible for scheduling workloads that execute concurrently. To make informed resourcemanagement decisions, it is important to obtain accurate measurements of per-workload cache occupancies and their impact on performance, often summarized by utility functions such as miss-ratio curves (MRCs) In this paper, we first introduce an efficient online technique for estimating the cache occupancy of individual software threads using only commonly-available hardware performance counters. We derive an analytical model as the basis of our occupancy estimation, and extend it for improved accuracy on modern cache configurations, considering the impact of set-associativity, line replacement policy, and memory locality effects. We demonstrate the effectiveness of occupancy estimation with a series of CMP simulations in which SPEC benchmarks execute concurrently on multiple cores. Leveraging our occupancy estimation technique, we also introduce a lightweight approach for online MRC construction, and demonstrate its effectiveness using a prototype implementation in the VMware ESX Server hypervisor. We present a series of experiments involving SPEC benchmarks, comparing the MRCs we construct online with MRCs generated offline in which various cache sizes are enforced via static page coloring
Article
Full-text available
This paper proposes dynamic cache partitioning amongst simultaneously executing processes/threads. We present a general partitioning scheme that can be applied to set-associative caches. Since memory reference characteristics of processes/threads can change over time, our method collects the cache miss characteristics of processes/threads at run-time. Also, the workload is determined at run-time by the operating system scheduler. Our scheme combines the information, and partitions the cache amongst the executing processes/threads. Partition sizes are varied dynamically to reduce the total number of misses. The partitioning scheme has been evaluated using a processor simulator modeling a two-processor CMP system. The results show that the scheme can improve the total IPC significantly over the standard least recently used (LRU) replacement policy. In a certain case, partitioning doubles the total IPC over standard LRU. Our results show that smart cache management and scheduling is essential to achieve high performance with shared cache memory.
Conference Paper
Full-text available
In this paper we address the problem of on-chip memory selection for computationally intensive applications, by proposing scratch pad memory as an alternative to cache. Area and energy for different scratch pad and cache sizes are computed using the CACTI tool while performance was evaluated using the trace results of the simulator. The target processor chosen for evaluation was AT91M40400. The results clearly establish scratchpad memory as a low power alternative in most situations with an average energy reduction of 40%. Further the average area-time reduction for the scratchpad memory was 46% of the cache memory
Conference Paper
Full-text available
We propose a low overhead, online memory monitoring scheme utilizing a set of novel hardware counters. The counters indicate the marginal gain in cache hits as the size of the cache is increased, which gives the cache miss-rate as a function of cache size. Using the counters, we describe a scheme that enables an accurate estimate of the isolated miss-rates of each process as a function of cache size under the standard LRU replacement policy. This information can be used to schedule jobs or to partition the cache to minimize the overall miss-rate. The data collected by the monitors can also be used by an analytical model of cache and memory behavior to produce a more accurate overall miss-rate for the collection of processes sharing a cache in both time and space. This overall miss-rate can be used to improve scheduling and partitioning schemes.
Article
Full-text available
In this paper we address the problem of on-chip memory selection for computationally intensive applications, by proposing scratch pad memory as an alternative to cache. Area and energy for different scratch pad and cache sizes are computed using the CACTI tool while performance was evaluated using the trace results of the simulator. The target processor chosen for evaluation was AT91MdOdO0. The results clearly establish scratchpad memory as a low power alternative in most situations with an average energy reduction of 0%. Further the average area-time reduction for the scratchpad memory was 6% of the cache memory.
Article
Full-text available
As the gap between memory and processor speeds continues to widen, cache efficiency is an increasingly important component of processor performance. Compiler techniques have been used to improve instruction and data cache performance for virtually indexed caches by mapping code and data with temporal locality to different cache blocks. In this paper we examine the performance of compiler and hardware approaches for reordering pages in physically addressed caches to eliminate cache misses. The software approach provides a color mapping at compile-time for code and data pages, which can then be used by the operating system to guide its allocation of physical pages. The hardware approach works by adding a page remap field to the TLB, which is used to allow a page to be remapped to a different color in the physically indexed cache while keeping the same physical page in memory. The results show that software page placement provided a 28% speedup and hardware page placement provided a 21% speedup on average for a superscalar processor. For a 4 processor single-chip multiprocessor, the miss rate was reduced from 8.7% down to 7.2% on average.
Article
Full-text available
This paper proposes a dynamic cache partitioning method for simultaneous multithreading systems. We present a general partitioning scheme that can be applied to setassociative caches at any partition granularity. Furthermore, in our scheme threads can have overlapping partitions, which provides more degrees of freedom when partitioning caches with low associativity. Since memory reference characteristics of threads can change very quickly, our method collects the miss-rate characteristics of simultaneously executing threads at runtime, and partitions the cache among the executing threads. Partition sizes are varied dynamically to improve hit rates. Trace-driven simulation results show a relative improvement in the L2 hit-rate of up to 40.5% over those generated by the standard least recently used replacement policy, and IPC improvements of up to 17%. Our results show that smart cache management and scheduling is important for SMT systems to achieve high performance. KEY WORDS Memory System, Simultaneous Multithreading, Cache Partitioning 1.
Article
When several applications are co-scheduled to run on a system with multiple shared LLCs, there is opportunity to improve system performance. This opportunity can be exploited by the hardware, software, or a combination of both hardware and software. The software, i.e., an operating system or hypervisor, can improve system performance by co-scheduling jobs on LLCs to minimize shared cache contention. The hardware can improve system throughput through better replacement policies by allocating more cache resources to applications that benefit from the cache and less to those applications that do not. This study presents a detailed analysis on the interactions between intelligent scheduling and smart cache replacement policies. We find that smart cache replacement reduces the burden on software to provide intelligent scheduling decisions. However, under smart cache replacement, there is still room to improve performance from better application co-scheduling. We find that co-scheduling decisions are a function of the underlying LLC replacement policy. We propose Cache Replacement and Utility-aware Scheduling (CRUISE)-a hardware/software co-designed approach for shared cache management. For 4-core and 8-core CMPs, we find that CRUISE approaches the performance of an ideal job co-scheduling policy under different LLC replacement policies.
Article
Miss rate curves (MRCs) are useful in a number of contexts. In our research, online L2 cache MRCs enable us to dynamically identify optimal cache sizes when cache-partitioning a shared-cache multicore processor. Obtaining L2 MRCs has generally been assumed to be expensive when done in software and consequently, their usage for online optimizations has been limited. To address these problems and opportunities, we have developed a low-overhead software technique to obtain L2 MRCs online on current processors, exploiting features available in their performance monitoring units so that no changes to the application source code or binaries are required. Our technique, called RapidMRC, requires a single probing period of roughly 221 million processor cycles (147 ms), and subsequently 124 million cycles (83 ms) to process the data. We demonstrate its accuracy by comparing the obtained MRCs to the actual L2 MRCs of 30 applications taken from SPECcpu2006, SPECcpu2000, and SPECjbb2000. We show that RapidMRC can be applied to sizing cache partitions, helping to achieve performance improvements of up to 27%.
Article
When several applications are co-scheduled to run on a system with multiple shared LLCs, there is opportunity to improve system performance. This opportunity can be exploited by the hardware, software, or a combination of both hardware and software. The software, i.e., an operating system or hypervisor, can improve system performance by co-scheduling jobs on LLCs to minimize shared cache contention. The hardware can improve system throughput through better replacement policies by allocating more cache resources to applications that benefit from the cache and less to those applications that do not. This study presents a detailed analysis on the interactions between intelligent scheduling and smart cache replacement policies. We find that smart cache replacement reduces the burden on software to provide intelligent scheduling decisions. However, under smart cache replacement, there is still room to improve performance from better application co-scheduling. We find that co-scheduling decisions are a function of the underlying LLC replacement policy. We propose Cache Replacement and Utility-aware Scheduling (CRUISE)-a hardware/software co-designed approach for shared cache management. For 4-core and 8-core CMPs, we find that CRUISE approaches the performance of an ideal job co-scheduling policy under different LLC replacement policies.
Article
Cache partitioning has a wide range of uses in CMPs, from guaranteeing quality of service and controlled sharing to security-related techniques. However, existing cache partitioning schemes (such as way-partitioning) are limited to coarse-grain allocations, can only support few partitions, and reduce cache associativity, hurting performance. Hence, these techniques can only be applied to CMPs with 2-4 cores, but fail to scale to tens of cores. We present Vantage, a novel cache partitioning technique that overcomes the limitations of existing schemes: caches can have tens of partitions with sizes specified at cache line granularity, while maintaining high associativity and strong isolation among partitions. Vantage leverages cache arrays with good hashing and associativity, which enable soft-pinning a large portion of cache lines. It enforces capacity allocations by controlling the replacement process. Unlike prior schemes, Vantage provides strict isolation guarantees by partitioning most (e.g. 90%) of the cache instead of all of it. Vantage is derived from analytical models, which allow us to provide strong guarantees and bounds on associativity and sizing independent of the number of partitions and their behaviors. It is simple to implement, requiring around 1.5% state overhead and simple changes to the cache controller. We evaluate Vantage using extensive simulations. On a 32-core system, using 350 multiprogrammed workloads and one partition per core, partitioning the last-level cache with conventional techniques degrades throughput for 71% of the workloads versus an unpartitioned cache (by 7% average, 25% maximum degradation), even when using 64-way caches. In contrast, Vantage improves throughput for 98% of the workloads, by 8% on average (up to 20%), using a 4-way cache.
Conference Paper
Miss rate curves (MRCs) are useful in a number of contexts. In our research, online L2 cache MRCs enable us to dynamically identify optimal cache sizes when cache-partitioning a shared-cache multicore processor. Obtaining L2 MRCs has generally been assumed to be expensive when done in software and consequently, their usage for online optimizations has been limited. To address these problems and opportunities, we have developed a low-overhead software technique to obtain L2 MRCs online on current processors, exploiting features available in their performance monitoring units so that no changes to the application source code or binaries are required. Our technique, called RapidMRC, requires a single probing period of roughly 221 million processor cycles (147 ms), and subsequently 124 million cycles (83 ms) to process the data. We demonstrate its accuracy by comparing the obtained MRCs to the actual L2 MRCs of 30 applications taken from SPECcpu2006, SPECcpu2000, and SPECjbb2000. We show that RapidMRC can be applied to sizing cache partitions, helping to achieve performance improvements of up to 27%.
Article
We propose a way to improve the performance of embedded processorsrunning data-intensive applications by allowing softwareto allocate on-chip memory on an application-specific basis. Onchipmemory in the form of cache can be made to act like scratchpadmemory via a novel hardware mechanism, which we call column caching. Column caching enables dynamic cache partitioningin software, by mapping data regions to a specified sets of cache"columns" or "ways." When a region of memory is...
Article
Multi-core processors with shared L2 caches can suffer from performance degradations when co-scheduled pro-grams contend for cache resources in a destructive man-ner. In this work, we propose a new classification algorithm for determining the "personalities" of the programs with respect to their cache sharing behaviors. We first demon-strate that our scheme can more accurately predict when cache sharing problems may arise (and therefore when dy-namic cache partitioning techniques are needed) than com-pared to other previously proposed approaches. This may be useful in the creation of better workloads for future multi-core shared-cache simulation studies. Furthermore, our proposed scheme can be implemented directly in hard-ware to provide dynamic, on-the-fly classification of pro-gram behaviors (other classification techniques require, for example, performance comparisons against solo-executions where a program uses the entire L2 cache which cannot be trivially derived in an online fashion). Using this dy-namic classification ability, we propose a very simple dy-namic cache partitioning scheme that performs slightly bet-ter than the Utility-based Cache Partitioning scheme while incurring a lower implementation cost.
Conference Paper
The effectiveness of the last-level shared cache is crucial to the performance of a multi-core system. In this paper, we observe and make use of the DelinquentPC - Next-Use characteristic to improve shared cache performance. We propose a new PC-centric cache organization, NUcache, for the shared last level cache of multi-cores. NUcache logically partitions the associative ways of a cache set into MainWays and DeliWays. While all lines have access to the MainWays, only lines brought in by a subset of delinquent PCs, selected by a PC selection mechanism, are allowed to enter the DeliWays. The PC selection mechanism is an intelligent cost-benefit analysis based algorithm that utilizes Next-Use information to select the set of PCs that can maximize the hits experienced in DeliWays. Performance evaluation reveals that NUcache improves the performance over a baseline design by 9.6%, 30% and 33% respectively for dual, quad and eight core workloads comprised of SPEC benchmarks. We also show that NUcache is more effective than other well-known cache-partitioning algorithms.
Conference Paper
The significant speed-gap between processor and memory and the limited chip memory bandwidth make last-level cache performance crucial for future chip mul- tiprocessors. To use the capacity of shared last-level caches efficiently and to allow for a short access time, proposed non-uniform cache architectures (NUCAs) are organized into per-core partitions. If a core runs out of cache space, blocks are typically relocated to nearby partitions, thus managing the cache as a shared cache. This uncontrolled sharing of all resources may unfortu- nately result in pollution that degrades performance. We propose a novel non-uniform cache architecture in which the amount of cache space that can be shared among the cores is controlled dynamically. The adaptive scheme estimates, continuously, the effect of increas- ing/decreasing the shared partition size on the overall performance. We show that our scheme outperforms a private and shared cache organization as well as a hy- brid NUCA organization in which blocks in a local par- tition can spill over to neighbor core partitions.
Conference Paper
Modern multi-core processors present new resource management challenges due to the subtle interactions of simultaneously executing processes sharing on-chip resources (particularly the L2 cache). Recent research demonstrates that the operating system may use the page coloring mechanism to control cache partitioning, and consequently to achieve fair and efficient cache utilization. However, page coloring places additional constraints on memory space allocation, which may conflict with application memory needs. Further, adaptive adjustments of cache partitioning policies in a multi-programmed execution environment may incur substantial overhead for page recoloring (or copying). This paper proposes a hot-page coloring approach enforcing coloring on only a small set of frequently accessed (or hot) pages for each process. The cost of identifying hot pages online is reduced by leveraging the knowledge of spatial locality during a page table scan of access bits. Our results demonstrate that hot page identification and selective coloring can significantly alleviate the coloring-induced adverse effects in practice. However, we also reach the somewhat negative conclusion that without additional hardware support, adaptive page coloring is only beneficial when recoloring is performed infrequently (meaning long scheduling time quanta in multi-programmed executions).
Conference Paper
Cache hierarchies have been traditionally designed for usage by a single application, thread or core. As multi-threaded (MT) and multi-core (CMP) platform architectures emerge and their workloads range from single-threaded and multithreaded applications to complex virtual machines (VMs), a shared cache resource will be consumed by these different entities generating heterogeneous memory access streams exhibiting different locality properties and varying memory sensitivity. As a result, conventional cache management approaches that treat all memory accesses equally are bound to result in inefficient space utilization and poor performance even for applications with good locality properties. To address this problem, this paper presents a new cache management framework (CQoS) that (1) recognizes the heterogeneity in memory access streams, (2) introduces the notion of QoS to handle the varying degrees of locality and latency sensitivity and (3) assigns and enforces priorities to streams based on latency sensitivity, locality degree and application performance needs. To achieve this, we propose CQoS options for priority classification, priority assignment and priority enforcement. We briefly describe CQoS priority classification and assignment options -- ranging from user-driven and developer-driven to compiler-detected and flow-based approaches. Our focus in this paper is on CQoS mechanisms for priority enforcement -- these include (1) selective cache allocation, (2) static/dynamic set partitioning and (3) heterogeneous cache regions. We discuss the architectural design and implementation complexity of these CQoS options. To evaluate the performance trade-offs for these options, we have modeled these CQoS options in a cache simulator and evaluated their performance in CMP platforms running network-intensive server workloads. Our simulation results show the effectiveness of our proposed options and make the case for CQoS in future multi-threaded/multi-core platforms since it improves shared cache efficiency and increases overall system performance as a result.
Conference Paper
This paper presents and studies a distributed L2 cache management approach through OS-level page allocation for future many-core processors. L2 cache management is a crucial multicore processor design aspect to overcome non-uniform cache access latency for good program per- formance and to reduce on-chip network traffic and related power consumption. Unlike previously studied hardware- based private and shared cache designs implementing a "fixed" caching policy, the proposed OS-microarchitecture approach is flexible; it can easily implement a wide spec- trum of L2 caching policies withoutcomplex hardware sup- port. Furthermore, our approach can provide differenti- ated execution environment to runningprograms by dynam- ically controlling data placement and cache sharing de- grees. We discuss key design issues of the proposed ap- proach and present preliminary experimental results show- ing the promise of our approach.
Conference Paper
This paper investigates the problem of partitioning a shared cache between multiple concurrently executing applications. The commonly used LRU policy implicitly partitions a shared cache on a demand basis, giving more cache resources to the application that has a high demand and fewer cache resources to the applica- tion that has a low demand. However, a higher demand for cache resources does not always correlate with a higher performance from additional cache resources. It is benecia l for performance to invest cache resources in the application that benets more from the cache resources rather than in the application that has more demand for the cache resources. This paper proposes utility-based cache partitioning (UCP), a low-overhead, runtime mechanism that partitions a shared cache between multiple applications depending on the reduction in cache misses that each application is likely to obtain for a given amount of cache resources. The proposed mechanism monitors each ap- plication at runtime using a novel, cost-effective, hardware circuit that requires less than 2kB of storage. The information collected by the monitoring circuits is used by a partitioning algorithm to decide the amount of cache resources allocated to each applica- tion. Our evaluation, with 20 multiprogrammed workloads, shows that UCP improves performance of a dual-core system by up to 23% and on average 11% over LRU-based cache partitioning.
Conference Paper
The role of the operating system (OS) in managing shared resources such as CPU time, memory, peripherals, and even energy is well motivated and understood (23). Unfortu- nately, one key resource|lower-level shared cache in chip multi-processors|is commonly managed purely in hardware by rudimentary replacement policies such as least-recently- used (LRU). The rigid nature of the hardware cache manage- ment policy poses a serious problem since there is no single best cache management policy across all sharing scenarios. For example, the cache management policy for a scenario where applications from a single organization are running under \best eort" performance expectation is likely to be dieren t from the policy for a scenario where applications from competing business entities (say, at a third party data center) are running under a minimum service level expecta- tion. When it comes to managing shared caches, there is an inherent tension between exibilit y and performance. On one hand, managing the shared cache in the OS oers im- mense policy exibilit y since it may be implemented in soft- ware. Unfortunately, it is prohibitively expensive in terms of performance for the OS to be involved in managing tempo- rally ne-grain events such as cache allocation. On the other hand, sophisticated hardware-only cache management tech- niques to achieve fair sharing or throughput maximization have been proposed. But they oer no policy exibilit y. This paper addresses this problem by designing architec- tural support for OS to ecien tly manage shared caches with a wide variety of policies. Our scheme consists of a hard- ware cache quota management mechanism, an OS interface and a set of OS level quota orchestration policies. The hard- ware mechanism guarantees that OS-specied quotas are en- forced in shared caches, thus eliminating the need for (and the performance penalty of) temporally ne-grained OS in- tervention. The OS retains policy exibilit y since it can tune the quotas during regularly scheduled OS interventions. We demonstrate that our scheme can support a wide range of policies including policies that provide (a) passive per-
Conference Paper
Cache partitioning has a wide range of uses in CMPs, from guaranteeing quality of service and controlled sharing to security-related techniques. However, existing cache partitioning schemes (such as way-partitioning) are limited to coarse-grain allocations, can only support few partitions, and reduce cache associativity, hurting performance. Hence, these techniques can only be applied to CMPs with 2-4 cores, but fail to scale to tens of cores. We present Vantage, a novel cache partitioning technique that overcomes the limitations of existing schemes: caches can have tens of partitions with sizes specified at cache line granularity, while maintaining high associativity and strong isolation among partitions. Vantage leverages cache arrays with good hashing and associativity, which enable soft-pinning a large portion of cache lines. It enforces capacity allocations by controlling the replacement process. Unlike prior schemes, Vantage provides strict isolation guarantees by partitioning most (e.g. 90%) of the cache instead of all of it. Vantage is derived from analytical models, which allow us to provide strong guarantees and bounds on associativity and sizing independent of the number of partitions and their behaviors. It is simple to implement, requiring around 1.5% state overhead and simple changes to the cache controller. We evaluate Vantage using extensive simulations. On a 32-core system, using 350 multiprogrammed workloads and one partition per core, partitioning the last-level cache with conventional techniques degrades throughput for 71% of the workloads versus an unpartitioned cache (by 7% average, 25% maximum degradation), even when using 64-way caches. In contrast, Vantage improves throughput for 98% of the workloads, by 8% on average (up to 20%), using a 4-way cache.
Article
When a computer system supports both paged virtual memory and large real-indexed caches, cache performance depends in part on the main memory page placement. To date, most operating systems place pages by selecting an arbitrary page frame from a pool of page frames that have been made available by the page replacement algorithm. We give a simple model that shows that this naive (arbitrary) page placement leads to up to 30% unnecessary cache conflicts. We develop several page placement algorithms, called careful-mapping algorithms, that try to select a page frame (from the pool of available page frames) that is likely to reduce cache contention. Using trace-driven simulation, we find that careful mapping results in 10–20% fewer (dynamic) cache misses than naive mapping (for a direct-mapped real-indexed multimegabyte cache). Thus, our results suggest that careful mapping by the operating system can get about half the cache miss reduction that a cache size (or associativity) doubling can.
Article
We propose an organization for the on-chip memory system of a chip multiprocessor, in which 16 processors share a 16MB pool of 256 L2 cache banks. The L2 cache is organized as a non-uniform cache architecture (NUCA) array with a switched network embedded in it for high performance. We show that this organization can support the spectrum of degrees of sharing: unshared, in which each processor has a private portion of the cache, thus reducing hit latency, completely shared, in which every processor shares the entire cache, thus minimizing misses, and every point in between. We find the optimal degree of sharing for a number of cache bank mapping policies, and also evaluate a per-application cache partitioning strategy. We conclude that a static NUCA organization with sharing degrees of two or four work best across a suite of commercial and scientific parallel workloads. We also demonstrate that migratory, dynamic NUCA approaches improve performance significantly for a subset of the workloads at the cost of increased power consumption and complexity, especially as per-application cache partitioning strategies are applied.
Article
As CMP platforms are widely adopted, more and more cores are integrated on to the die. To reduce the off-chip memory access, the last level cache is usually organized as a distributed shared cache. In order to avoid hot-spots, cache lines are interleaved across the distributed shared cache slices using a hash function. However, as we increase the number of cores and cache slices in the platform, this also implies that most of data references go to remote cache slices, thereby increasing the access latency significantly. In this paper, we propose a hybrid last level cache, which has some amount of private space and some amount of shared space on each cache slice. For workloads with no sharing, the goal is to provide more hits into the local slice while still keeping the overall miss rate low. For workloads with sufficient sharing, the goal is to allow more sharing in the last-level cache slice. We present hybrid last-level cache design options and study its hit/miss rate behavior for a number of important server applications and multi-programmed workloads. Our simulation results on running multi- programmed workloads based on SPEC CINT2000 as well as multithreaded workloads based on commercial server benchmarks (TPCC, SPECjbb, SAP and TPCE) show that this architecture is advantageous especially since it can improve the local hit rate significantly while keeping the overall miss rate similar to the shared cache.
Conference Paper
The last line of defense in the cache hierarchy before going to off-chip memory is very critical in chip multiprocessors (CMPs) from both the performance and power perspectives. We investigate different organizations for this last line of defense (assumed to be L2 in this article) towards reducing off-chip memory accesses. We evaluate the trade-offs between private L2 and address-interleaved shared L2 designs, noting their individual benefits and drawbacks. The possible imbalance between the L2 demands across the CPUs favors a shared L2 organization, while the interference between these demands can favor a private L2 organization. We propose a new architecture, called Shared Processor-Based Split L2, that captures the benefits of these two organizations, while avoiding many of their drawbacks. Using several applications from the SPEC OMP suite and a commercial benchmark, Specjbb, on a complete system simulator, we demonstrate the benefits of this shared processor-based L2 organization. Our results show as much as 42.50% improvement in IPC over the private organization (with 11.52% on the average), and as much as 42.22% improvement over the shared interleaved organization (with 9.76% on the average).
Conference Paper
The MIPS R6000 microprocessor relies on a new type of translation lookaside buffer — called a TLB slice — which is less than one-tenth the size of a conventional TLB and as fast as one multiplexer delay, yet has a high enough hit rate to be practical. The fast translation makes it possible to use a physical cache without adding a translation stage to the processor's pipeline. The small size makes it possible to include address translation on-chip, even in a technology with a limited number of devices. The key idea behind the TLB slice is to have both a virtual tag and a physical tag on a physically-indexed cache. Because of the virtual tag, the TLB slice needs to hold only enough physical page number bits — typically 4 to 8 — to complete the physical cache index, in contrast with a conventional TLB, which needs to hold both a virtual page number and a physical page number. The virtual page number is unnecessary because the TLB slice needs to provide only a hint for the translated physical address rather than a guarantee. The full physical page number is unnecessary because the cache hit logic is based on the virtual tag. Furthermore, if the cache is multi-level and references to the TLB slice are “shielded” by hits in a virtually indexed primary cache, the slice can get by with very few entries, once again lowering its cost and increasing its speed. With this mechanism, the simplicity of a physical cache can been combined with the speed of a virtual cache.
Article
We propose an organization for the on-chip memory system of a chip multiprocessor in which 16 processors share a 16-Mbyte pool of 64 level-2 (L2) cache banks. The L2 cache is organized as a nonuniform cache architecture (NUCA) array with a switched network embedded in it for high performance. We show that this organization can support a spectrum of degrees of sharing: unshared, in which each processor owns a private portion of the cache, thus reducing hit latency, and completely shared, in which every processor shares the entire cache, thus minimizing misses, and every point in between. We measure the optimal degree of sharing for different cache bank mapping policies and also evaluate a per-application cache partitioning strategy. We conclude that a static NUCA organization with sharing degrees of 2 or 4 works best across a suite of commercial and scientific parallel workloads. We demonstrate that migratory dynamic NUCA approaches improve performance significantly for a subset of the workloads at the cost of increased complexity, especially as per-application cache partitioning strategies are applied. We also evaluate the energy efficiency of each design point in terms of network traffic, bank accesses, and external memory accesses.
Café: Cache-aware fair and efficient scheduling for CMPs
  • R West
  • P Zaroo
  • C A Waldspurger
  • X Zhang
R. West, P. Zaroo, C. A. Waldspurger, and X. Zhang. CAFÉ: Cache-aware fair and efficient scheduling for CMPs. In Multicore Technology: Architecture, Reconfiguration and Modeling. CRC Press, 2013.
Application-specific memory management for embedded systems using software-controlled caches
  • D Chiou
  • P Jain
  • L Rudolph
  • S Devadas
D. Chiou, P. Jain, L. Rudolph, and S. Devadas. Application-specific memory management for embedded systems using software-controlled caches. In Proceedings of the 37th Annual Design Automation Conference, pages 416-419, New York, NY, USA, 2000.