[Show abstract][Hide abstract] ABSTRACT: It is known that the execution of programs exhibits repetitive phases; in other words, the execution of programs can be partitioned into segments of execution, during which the application exhibits unique architectural properties. This property has been used for various optimization goals. In addition, phase information is utilized to reduce the run time of the architectural simulation. Conventionally, an application is examined in an architecture-independent manner (such as the number of times a basic block is executed) to extract information about the phases and then only the representative execution intervals are executed to analyze architectural choices. We claim that such approaches are becoming inadequate in the many-core era as application execution is not dominated by the instructions only, but instead the communication structure of the application is becoming as important as the instruction behavior. Hence, we propose to utilize communication behavior to determine the phases of an application. Our results reveal that the inclusion of the communication information can increase the accuracy of the phase detection significantly. Specifically, for SPLASH2 and Mine-Bench applications, the average (geometric mean) CPI error rate with the instruction-based phase detection is 11.01%, while our phase detection scheme has an average error rate of 3.41% when compared to the simulations that run the applications to completion.
Performance Analysis of Systems and Software, 2009. ISPASS 2009. IEEE International Symposium on; 05/2009
[Show abstract][Hide abstract] ABSTRACT: Future many-core processors will require high-performance yet energy-efficient on-chip networks to provide a communication substrate for the increasing number of cores. Recent advances in silicon nanophotonics create new opportunities for on-chip networks. To efficiently exploit the benefits of nanophotonics, we propose Firefly - a hybrid, hierarchical network architecture. Firefly consists of clusters of nodes that are connected using conventional, electrical signaling while the inter-cluster communication is done using nanophotonics - exploiting the benefits of electrical signaling for short, local communication while nanophotonics is used only for global communication to realize an efficient on-chip network. Crossbar architecture is used for inter-cluster communication. However, to avoid global arbitration, the crossbar is partitioned into multiple, logical crossbars and their arbitration is localized. Our evaluations show that Firefly improves the performance by up to 57% compared to an all-electrical concentrated mesh (CMESH) topology on adversarial traffic patterns and up to 54% compared to an all-optical crossbar (OP XBAR) on traffic patterns with locality. If the energy-delay-product is compared, Firefly improves the efficiency of the on-chip network by up to 51% and 38% compared to CMESH and OP XBAR, respectively.
36th International Symposium on Computer Architecture (ISCA 2009), June 20-24, 2009, Austin, TX, USA; 01/2009
[Show abstract][Hide abstract] ABSTRACT: Technological advances enable modern processors to utilize increasingly larger DRAMs with rising access frequencies. This is leading to high power consumption and operating temperature in DRAM chips. As a result, temperature management has become a real and pressing issue in high performance DRAM systems. Traditional low power techniques are not suitable for high performance DRAM systems with high bandwidth. In this paper, we propose and evaluate a customized DRAM low power technique based on Page Hit Aware Write Buffer (PHA-WB). Our proposed approach reduces DRAM system power consumption and temperature without any performance penalty. Our experiments show that a system with a 64-entry PHA-WB could reduce the total DRAM power consumption by up to 22.0% (9.6% on average). The peak and average temperature reductions are 6.1°C and 2.1°C, respectively.
Proceedings of the 45th Design Automation Conference, DAC 2008, Anaheim, CA, USA, June 8-13, 2008; 01/2008
[Show abstract][Hide abstract] ABSTRACT: With rising capacities and higher accessing frequencies, high-performance DRAMs are providing increasing memory access bandwidth to the processors. However, the increasing DRAM performance comes with the price of higher power consumption and temperature in DRAM chips. Traditional low power approaches for DRAM systems focus on utilizing low power modes, which is not always suitable for high performance systems. Existing DRAM temperature management techniques, on the other hand, utilize generic temperature management methods inherited from those applied on processor cores. These methods reduce DRAM temperature by controlling the number of DRAM accesses, similar to throttling the processor core, which incurs significant performance penalty. In this paper, we propose a customized low power technique for high performance DRAM systems, namely the Page Hit Aware Write Buffer (PHA-WB). The PHA-WB improves DRAM page hit rate by buffering write operations that may incur page misses. This approach reduces DRAM system power consumption and temperature without any performance penalty. Our proposed Throughput-Aware PHA-WB (TAP) dynamically configures the write buffer for different applications and workloads, thus achieves the best trade off between DRAM power reduction and buffer power overhead. Our experiments show that a system with TAP could reduce the total DRAM power consumption by up to 18.36% (8.64% on average). The steady-state temperature can be reduced by as much as 5.10°C and by 1.93°C on average across eight representative workloads.
Proceedings of the 22nd Annual International Conference on Supercomputing, ICS 2008, Island of Kos, Greece, June 7-12, 2008; 01/2008