Yan Cui

Tsinghua University, Beijing, Beijing Shi, China

Are you Yan Cui?

Claim your profile

Publications (9)0 Total impact

  • Conference Proceeding: Reducing Shared Cache Contention by Scheduling Order Adjustment on Commodity Multi-cores
    [show abstract] [hide abstract]
    ABSTRACT: Due to the limitation of power and processor complexity on traditional single core processors, multi-core processors have become the mainstream. One key feature on commodity multi-cores is that the last level cache (LLC) is usually shared. However, the shared cache contention can affect the performance of applications significantly. Several existing proposals demonstrate that task co-scheduling has the potential to alleviate the contention, but it is challenging to make co-scheduling practical in commodity operating systems. In this paper, we propose two lightweight practical cache-aware co-scheduling methods, namely static SOA and dynamic SOA, to solve the cache contention problem on commodity multi-cores. The central idea of the two methods is that the cache contention can be reduced by adjusting the scheduling order properly. These two methods are different from each other mainly in the way of acquiring the process's cache requirement. The static SOA (static scheduling order adjustment) method acquires the cache requirement information statically by offline profiling, while the dynamic SOA (dynamic scheduling order adjustment) captures the cache requirement statistics by using performance counters. Experimental results using multi-programmed NAS workloads suggest that the proposed methods can greatly reduce the effect of cache contention on multi-core systems. Specifically, for the static SOA method, the execution time can be reduced by up to 15.7%, the number of cache misses can be reduced by up to 11.8%, and the performance improvement remains obvious across the cache size and the length of time slice. For the dynamic SOA method, the execution time reduction can achieve up to 7.09%.
    Parallel and Distributed Processing Workshops and Phd Forum (IPDPSW), 2011 IEEE International Symposium on; 06/2011
  • Conference Proceeding: Experience on Comparison of Operating Systems Scalability on the Multi-core Architecture.
    2011 IEEE International Conference on Cluster Computing (CLUSTER), Austin, TX, USA, September 26-30, 2011; 01/2011
  • Conference Proceeding: Scaling OLTP applications on commodity multi-core platforms
    Yan Cui, Yu Chen, Yuanchun Shi
    [show abstract] [hide abstract]
    ABSTRACT: Multi-core processor architectures can have significant performance advantage over traditional single core designs, which are limited by power and processor complexity. Predictions based on Moore's Law state that a processor chip may accommodate thousands of cores in 5-10 years. Can software scale with the number of cores and achieve the performance potential? This paper uses two OLTP (online transaction processing) applications (TPCC-UVa and Sysbench-OLTP) as a case study to investigate this question and determine what the performance bottlenecks are. On an Intel 8-core platform, these applications (with slight modifications to run well on a many-core platform) achieve a speedup (in terms of the transaction throughput) of 3.68 and 5.26, respectively. To find the scalability bottlenecks the paper proposes a method based on function's scalability value metric. Functions with a high scalability value limit the scalability. By looking at the functions with the highest scalability value across all functions in the kernel, libraries, and application processes, the paper finds that database buffer pool contention, database synchronization primitives, scheduler overhead and lock contention in System V IPC are the main bottlenecks for TPCC-UVa. In Sysbench-OLTP, database synchronization primitives and the kernel scheduler limit scalability. The paper also explores several ideas such as scalable database lock, scalable spin lock and RCU-based IDR API to improve the scalability.
    Performance Analysis of Systems & Software (ISPASS), 2010 IEEE International Symposium on; 04/2010
  • Conference Proceeding: Scalability comparison of commodity operating systems on multi-cores.
    IEEE International Symposium on Performance Analysis of Systems and Software, ISPASS 2010, www.ispass.org, 28-30 March 2010, White Plains, NY, USA; 01/2010
  • Conference Proceeding: A Scheduling Method for Avoiding Kernel Lock Thrashing on Multi-cores.
    IEEE 16th International Conference on Parallel and Distributed Systems, ICPADS 2010, 8-10 Dec. 2010, Shanghai, China; 01/2010
  • Conference Proceeding: A Discrete Event Simulation Model for Understanding Kernel Lock Thrashing on Multi-core Architectures.
    IEEE 16th International Conference on Parallel and Distributed Systems, ICPADS 2010, 8-10 Dec. 2010, Shanghai, China; 01/2010
  • Conference Proceeding: Reinventing Lock Modeling for Multi-Core Systems.
    MASCOTS 2010, 18th Annual IEEE/ACM International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems, Miami, Florida, USA, August 17-19, 2010; 01/2010
  • Article: OSMark: A benchmark suite for understanding parallel scalability of operating systems on large scale multi-cores
    Yan Cui, Yu Chen, Yuanchun Shi
    [show abstract] [hide abstract]
    ABSTRACT: With the development of transistor technology, multi-core has become mainstream. Predictions based on Moore's Law state that a processor chip can accommodate thousands of cores in 5–10 years. As the system software between applications and hardware, can operating systems scale with the number of cores and achieve the performance potential? In order to answer this question, a micro-benchmark suite OSMark is designed to understand the parallel scalability of operating systems on large scale multi-cores. Different from application-oriented benchmark, OSMark works by stressing separate parts of an operating system including process management, memory management, network, file descriptor operation and System V IPC. Evaluations on AMD 32-core machine with Linux as its OS indicate that most of benchmarks in OSMark scale bad. Linux kernel source code analysis and performance data reveal that kernel synchronization primitives protecting the shared data are the main bottlenecks limiting parallel scalability.
    Computer Science and Information Technology, International Conference on. 08/2009;
  • Conference Proceeding: CFS Optimizations to KVM Threads on Multi-Core Environment.
    IEEE 15th International Conference on Parallel and Distributed Systems, ICPADS 2009, 8-11 December 2009, Shenzhen, China; 01/2009