Conference Paper

Reducing Shared Cache Contention by Scheduling Order Adjustment on Commodity Multi-cores

DOI: 10.1109/IPDPS.2011.248 In proceeding of: Parallel and Distributed Processing Workshops and Phd Forum (IPDPSW), 2011 IEEE International Symposium on
Source: IEEE Xplore

ABSTRACT Due to the limitation of power and processor complexity on traditional single core processors, multi-core processors have become the mainstream. One key feature on commodity multi-cores is that the last level cache (LLC) is usually shared. However, the shared cache contention can affect the performance of applications significantly. Several existing proposals demonstrate that task co-scheduling has the potential to alleviate the contention, but it is challenging to make co-scheduling practical in commodity operating systems. In this paper, we propose two lightweight practical cache-aware co-scheduling methods, namely static SOA and dynamic SOA, to solve the cache contention problem on commodity multi-cores. The central idea of the two methods is that the cache contention can be reduced by adjusting the scheduling order properly. These two methods are different from each other mainly in the way of acquiring the process's cache requirement. The static SOA (static scheduling order adjustment) method acquires the cache requirement information statically by offline profiling, while the dynamic SOA (dynamic scheduling order adjustment) captures the cache requirement statistics by using performance counters. Experimental results using multi-programmed NAS workloads suggest that the proposed methods can greatly reduce the effect of cache contention on multi-core systems. Specifically, for the static SOA method, the execution time can be reduced by up to 15.7%, the number of cache misses can be reduced by up to 11.8%, and the performance improvement remains obvious across the cache size and the length of time slice. For the dynamic SOA method, the execution time reduction can achieve up to 7.09%.

1 Bookmark
 · 
143 Views
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: We investigated how operating system design should be adapted for multithreaded chip multiprocessors (CMT) - a new generation of processors that exploit thread-level parallelism to mask the memory latency in modern workloads. We determined that the L2 cache is a critical shared resource on CMT and that an insufficient amount of L2 cache can undermine the ability to hide memory latency on these processors. To use the L2 cache as efficiently as possible, we propose an L2-conscious scheduling algorithm and quantify its performance potential. Using this algorithm it is possible to reduce miss ratios in the L2 cache by 25-37% and improve processor throughput by 27-45%.
    Proceedings of the 2005 USENIX Annual Technical Conference, April 10-15, 2005, Anaheim, CA, USA; 01/2005
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: The shared-cache contention on Chip Multiprocessors causes perfor- mance degradation to applications and hurts system fairness. Many previously proposed solutions schedule programs according to runtime sampled cache per- formance to reduce cache contention. The strong dependence on runtime sam- pling inherently limits the scalability and effectiveness of those techniques. This work explores the combination of program locality analysis with job co-scheduling. The rationale is that program locality analysis typically offers a large-sco pe view of various facets of an application including data access patterns and c ache re- quirement. That knowledge complements the local behaviors sampled by runtime systems. The combination offers the key to overcoming the limitations of prior co-scheduling techniques. Specifically, this work develops a lightweight locality model that enables effi - cient, proactive prediction of the performance of co-running processes, offering the potential for an integration in online scheduling systems. Compared to exist- ing multicore scheduling systems, the technique reduces performance degrada- tion by 34% (7% performance improvement) and unfairness by 47%. Its proac- tivity makes it resilient to the scalability issues that constraints the applicability of previous techniques.
    High Performance Embedded Architectures and Compilers, 5th International Conference, HiPEAC 2010, Pisa, Italy, January 25-27, 2010. Proceedings; 01/2010
  • [Show abstract] [Hide abstract]
    ABSTRACT: Today's operating systems don't adequately handle the complexities of Multicore processors. Architectural features confound existing OS techniques for task scheduling, load balancing, and power management. This article shows that the OS can use data obtained from dynamic runtime observation of task behavior to ameliorate performance variability and more effectively exploit multicore processor resources. The authors' research prototypes demonstrate the utility of observation-based policy.
    IEEE Micro 06/2008; · 2.39 Impact Factor