[show abstract][hide abstract] ABSTRACT: Bloom Filters are a technique to reduce the effects of conflicts/interference in hash table-like structures. Con- ventional hash tables store information in a single loca- tion which is susceptible to destructive interference through hash conflicts. A Bloom Filter uses multiple hash functions to store information in several locations, and recombines the information through some voting mechanism. Many mi- croarchitectural predictors use simple single-index hash ta- bles to make binary 0/1 predictions, and Bloom Filters help improve predictor accuracy. However, implementing a true Bloom Filter requires k hash functions, which in turn im- plies a k-ported hash table, or k sequential accesses. Un- fortunately, the area of a hardware table increases quadrat- ically with the port count, increasing costs of area, latency and power consumption. We propose a simple but elegant modification to the Bloom Filter algorithm that uses bank- ing combined with special hash functions that guaranteeall hash indexes fall into non-conflicting banks. We evaluate several applicationsof our Banked Bloom Filter (BBF) pre- diction in processors: BBF branch prediction, BBF load hit/miss prediction, and BBF last-tag prediction. We show that BBF predictors can provide accurate predictions with substantially less cost than previous techniques. Bloom Filters (2) are commonly used in the network and database domains to provide approximately correct an- swers to set membership queries. The algorithm is easily extended to binary predictions. While Bloom Filters are structurally similar to a table of counters, they differ by em- ploying multiple hash functions to help tolerate conflicts. The Bloom Filter stores each prediction in multiple loca- tions and a combining function (usually a unanimous vote for set membership queries) converts the multiple predic- tions into the Bloom Filter's final prediction. The Bloom Filter algorithm can potentially improve the accuracy of microarchitectural binary predictors. However, the latency constraints of most predictor implementations make conventionalBloom Filters impractical. A Bloom Fil- ter with k hash functions would require a table of counters with at least k ports (likely 2k ports, k for reading the ta- ble and k more for writing), and the area of a multi-ported memory cell increases quadratically with the port count. This paper presents a new version of the Bloom Filter algo- rithm which is well suited for implementations in hardware and parallel implementations. In particular, banking pro- vides a means of reading multiple entries in parallel without requiring multiple ports, and a special, simple, class of hash functions guarantees that bank conflicts cannot occur. We show the generality of our Banked Bloom Filters with sev- eral example applications: branch prediction, load hit-miss prediction (24), and last-tag prediction (5). Section 2 reviews the basic Bloom Filter algorithm and its applicability to hardware predictors. Section 3 explains our Banked Bloom Filter algorithm and its corresponding implementation. Section 4 describes several applications of Banked Bloom Filters, and Section 5 presents our exper- imental results. Section 6 provides additional analysis of Banked Bloom Filters, and Section 7 concludes the paper.
20th International Symposium on Computer Architecture and High Performance Computing, SBAC-PAD 2008, October 29 - November 1, 2008, Campo Grande, MS, Brazil; 01/2008
[show abstract][hide abstract] ABSTRACT: 3D die stacking is an exciting new technology that increases transistor density by vertically integrating two or more die with a dense, high-speed interface. The result of 3D die stacking is a significant reduction of interconnect both within a die and across dies in a system. For instance, blocks within a microprocessor can be placed vertically on multiple die to reduce block to block wire distance, latency, and power. Disparate Si technologies can also be combined in a 3D die stack, such as DRAM stacked on a CPU, resulting in lower power higher BW and lower latency interfaces, without concern for technology integration into a single process flow. 3D has the potential to change processor design constraints by providing substantial power and performance benefits. Despite the promising advantages of 3D, there is significant concern for thermal impact. In this research, we study the performance advantages and thermal challenges of two forms of die stacking: Stacking a large DRAM or SRAM cache on a microprocessor and dividing a traditional micro architecture between two die in a stack
Microarchitecture, 2006. MICRO-39. 39th Annual IEEE/ACM International Symposium on; 01/2007
[show abstract][hide abstract] ABSTRACT: From multiprocessor scale-up to cache sizes to the number of reorder-buffer entries, microarchitects wish to reap the b enefits of more computing resources while staying within power and latency bounds. This tension is quite evident in schedulers, which need to be large and single-cycle for maximum performance on out-of-order cores. In this work we present two straightforward modificat ions to a matrix scheduler implementation which greatly strengthen its scalability. Both are based on the simple observation that t he wakeup and picker matrices are sparse, even at small sizes; thus small indirection tables can be used to greatly reduce their width and latency. This technique can be used to create quicker iso- performance schedulers (17-58% reduced critical path) or larger iso- timing schedulers (7-26% IPC increase). Importantly, the power and area requirements of the additional hardware are likely offset by the greatly reduced matrix sizes and subsuming the functionality of the power-hungry allocation CAMs. Categories and Subject Descriptors. C.1.0 (Processor Architec-
34th International Symposium on Computer Architecture (ISCA 2007), June 9-13, 2007, San Diego, California, USA; 01/2007