-
International Conference on Parallel Processing, ICPP 2011, Taipei, Taiwan, September 13-16, 2011; 01/2011
-
IEEE Micro. 01/2011; 31:119-127.
-
37th International Symposium on Computer Architecture (ISCA 2010), June 19-23, 2010, Saint-Malo, France; 01/2010
-
16th International Conference on High-Performance Computer Architecture (HPCA-16 2010), 9-14 January 2010, Bangalore, India; 01/2010
-
43rd Annual IEEE/ACM International Symposium on Microarchitecture, MICRO 2010, 4-8 December 2010, Atlanta, Georgia, USA; 01/2010
-
Proceedings of the 2009 International Symposium on Low Power Electronics and Design, 2009, San Fancisco, CA, USA, August 19-21, 2009; 01/2009
-
[show abstract]
[hide abstract]
ABSTRACT: DRAMs require periodic refresh for preserving data stored in them. The refresh interval for DRAMs depends on the vendor and the de- sign technology they use. For each refresh in a DRAM row, the stored information in each cell is read out and then written back to itself as each DRAM bit read is self-destructive. The refresh pro- cess is inevitable for maintaining data correctness, unfortunately, at the expense of power and bandwidth overhead. The future trend to integrate layers of 3D die-stacked DRAMs on top of a proces- sor further exacerbates the situation as accesses to these DRAMs will be more frequent and hiding refresh cycles in the available slack becomes increasingly difficult. Moreover, due to the implica- tion of temperature increase, the refresh interval of 3D die-stacked DRAMs will become shorter than those of conventional ones. This paper proposes an innovative scheme to alleviate the en- ergy consumed in DRAMs. By employing a time-out counter for each memory row of a DRAM module, all the unnecessary periodic refresh operations can be eliminated. The basic concept behind our scheme is that a DRAM row that was recently read or written to by the processor (or other devices that share the same DRAM) does not need to be refreshed again by the periodic refresh opera- tion, thereby eliminating excessive refreshes and the energy dissi- pated. Based on this concept, we propose a low-cost technique in the memory controller for DRAM power reduction. The simulation results show that our technique can reduce up to 86% of all refresh operations and 59.3% on the average for a 2GB DRAM. This in turn results in a 52.6% energy savings for refresh operations. The overall energy saving in the DRAM is up to 25.7% with an average of 12.13% obtained for SPLASH-2, SPECint2000, and Biobench benchmark programs simulated on a 2GB DRAM. For a 64MB 3D DRAM, the energy saving is up to 21% and 9.37% on an average when the refresh rate- is 64 ms. For a faster 32ms refresh rate the maximum and average savings are 12% and 6.8% respectively.
Microarchitecture, 2007. MICRO 2007. 40th Annual IEEE/ACM International Symposium on; 01/2008
-
[show abstract]
[hide abstract]
ABSTRACT: In order to bridge the gap of the growing speed disparity between processors and their memory subsystems, aggressive prefetch mechanisms, either hardware-based or compiler-assisted, are employed to hide memory latencies. As the first-level cache gets smaller in deep submicron processor design for fast cache accesses, data cache pollution caused by overly aggressive prefetch mechanisms will become a major performance concern. Ineffective prefetches not only offset the benefits of benign prefetches due to pollution but also throttle bus bandwidth, leading to an overall performance degradation. In this paper, we propose and analyze a number of hardware-based prefetch pollution filtering mechanisms to differentiate good and bad prefetches dynamically based on history information. We designed three prefetch pollution filters organized as a one-level, two-level, or gshare style. In addition, we examine two table indexing schemes: per-address (PA) based and program counter (PC) based. Our prefetch pollution filters work in tandem with both hardware and software prefetchers. As our analysis shows, the cache pollution filters can reduce the ineffective prefetches by more than 90 percent and alleviate the excessive memory bandwidth induced by them. Also, the performance can be improved by up to 16 percent when our filtering mechanism is incorporated with aggressive prefetch filters as a result of reduced cache pollution and less competition for the limited number of cache ports. In addition, a number of sensitivity studies are performed to provide more understandings of the prefetch pollution filter design
IEEE Transactions on Computers 02/2007; 56(1):18-31. · 1.10 Impact Factor
-
13th International Conference on Parallel and Distributed Systems (ICPADS 2007), December 5-7, 2007, Hsinchu, Taiwan; 01/2007
-
33rd International Symposium on Computer Architecture (ISCA 2006), June 17-21, 2006, Boston, MA, USA; 01/2006
-
Proceedings of the 2006 International Conference on Compilers, Architecture, and Synthesis for Embedded Systems, CASES 2006, Seoul, Korea, October 22-25, 2006; 01/2006
-
Architecture of Computing Systems - ARCS 2006, 19th International Conference, Frankfurt/Main, Germany, March 13-16, 2006, Proceedings; 01/2006
-
[show abstract]
[hide abstract]
ABSTRACT: This paper presents a new security architecture for protecting software confidentiality and integrity. Different from the
previous process-centric systems designed for the same purpose, the new architecture ties cryptographic properties and security
attributes to memory instead of each individual user process. The advantages of such a memory centric design are many folds.
First, it provides a better security model and access control on software privacy that supports both selective and mixed tamper
resistant protection on software components from heterogeneous sources. Second, the new model supports and facilities tamper
resistant secure information sharing in an open software system where both data and code components could be shared by different
user processes. Third, the proposed security model and secure processor design allow software components protected with different
security policies to inter-operate within the same memory space efficiently. Our new architectural support requires small
silicon resources and its performance impact is minimal based on our experimental results using commercial MS Windows workloads
and cycle based out-of-order processor simulation.
10/2005: pages 153-168;
-
32st International Symposium on Computer Architecture (ISCA 2005), 4-8 June 2005, Madison, Wisconsin, USA; 01/2005
-
13th International Conference on Parallel Architectures and Compilation Techniques (PACT 2004), 29 September - 3 October 2004, Antibes Juan-les-Pins, France; 01/2004
-
Proceedings of the 2004 International Conference on Compilers, Architecture, and Synthesis for Embedded Systems, CASES 2004, Washington DC, USA, September 22 - 25, 2004; 01/2004
-
[show abstract]
[hide abstract]
ABSTRACT: Predicated execution enables the removal of branches wherein segments of branching code are converted into straight-line segments of conditional operations. An important, but generally ignored side effect of this transformation is that the compiler must assign distinct resources to all the predicated operations at a given time to ensure that those resources are available at run-time. However, a resource is only put to productive use when the predicates associated with its operations evaluate to True. We propose predicate-aware scheduling to reduce the superfluous commitment of resources to operations whose predicates evaluate to False at run-time. The central idea is to assign multiple operations to the same resource at the same time, thereby oversubscribing its use. This assignment is intelligently performed to ensure that no two operations simultaneously assigned to the same resource will have both of their predicates evaluate to True. Thus, no resource is dynamically oversubscribed. The overall effect of predicate aware scheduling is to use resources more efficiently, thereby increasing performance when resource constraints are a bottleneck.
02/2003;
-
1st IEEE / ACM International Symposium on Code Generation and Optimization (CGO 2003), 23-26 March 2003, San Francisco, CA, USA; 01/2003
-
32nd International Conference on Parallel Processing (ICPP 2003), 6-9 October 2003, Kaohsiung, Taiwan; 01/2003
-
[show abstract]
[hide abstract]
ABSTRACT: Power has become the first class design constraint in mod-ern processor design. To reduce the power density caused by aggressive, speculative execution seen in previous processor generations, computer architects have turned to a multi-core design strategy with each core substantially simplified. Additionally, different power-saving features have been pro-posed and integrated into each core to adapt dynamic exe-cution scenarios. Due in part to the independent nature of these cores, the power management has also become more flexible to further reduce the overall power consumption. With careful adaptation schemes, the system can save power by entering different idle states dynamically with minimal performance impact. Given the simultaneous emergence of virtualization technologies, the question, then, is how to ef-fectively leverage these idle states in the context of multi-ple virtual machines (VMs) executing on multicore parts. Towards this end, we develop the IdlePower approach to managing idle states in virtualized systems. Our approach combines a novel batching algorithm that creates improved opportunities to enter deep idle states by removing unneces-sary system wakeups depending upon monitored behavior of workloads. IdlePower also provides application awareness in another fashion by not only entering deep idle states based upon transition latencies, but also factoring in the perfor-mance degradation that can occur due to secondary effects such as data loss in cache structures. We extend the use of Bloom filters with IdlePower to detect application charac-teristics for dynamically predicting whether deep idle states are worthwhile based upon possible performance implica-tions. Overall, IdlePower is shown to improve residencies in the deepest C3 idle state by up to 10%, and to avoid performance degradations in workloads of up to 26%.