Conference Paper

Heterogeneous memory assembly exploration using a floorplan and interconnect aware framework

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... Efficient memory bank integration methods have been implemented. Gupta et al. proposed the use of heterogeneous memory banks to optimize the overall area of chips [9], while Yan et al. proposed custom optimization techniques [10]. The second aspect is the optimization of memory bank placement. ...
... It had an AXI bus protocol interface to interco other components on the chip. Based on a 28 nm technology, four implements wer ried out for comparison: the traditional method, this paper method, Ref. [9] and Re method. In all implements, the design was run to the post-route stage. ...
... Based on a 28 nm technology, four implements were carried out for comparison: the traditional method, this paper method, Refs. [9,12] method. In all implements, the design was run to the post-route stage. ...
Article
Full-text available
Co-optimization for memory bank compilation and placement was suggested as a way to improve performance and power and reduce the size of a memory subsystem. First, a multi-configuration SRAM compiler was realized that could generate memory banks with different PPA by splitting or merging, upsizing or downsizing, threshold swapping, and aspect ratio deformation. Then, a timing margin estimation method was proposed for the memory bank based on placed positions. Through an exhaustive enumeration of various configuration parameters under the constraint of timing margins, the best SRAM memory compilation configuration was found. This method could be integrated into the existing physical design flow. The experimental results showed that this method achieved up to an 11.1% power reduction and a 7.6% critical path delay reduction compared with the traditional design method.
Conference Paper
Full-text available
Embedded memories are the key contributor to the chip area, dynamic power dissipation and also form a significant part of critical path for high performance advanced SoCs. Therefore, optimal selection of memory instances becomes imperative for SoC designers. While EDA tools have evolved over the past years to optimally select standard logic cells depending on the timing and the power constraints, optimal memory selection is largely a manual process. We propose a framework to optimize power, performance, and area (PPA) of a memory subsystem (MSS) by including floorplan dependent delays and power consumption in interconnects and glue logic of the MSS in the pre-RTL stage. Through this framework, we demonstrate that for a 4 Mb assembly of SRAM instances, dynamic power is reduced by 44%, area by 49%, and leakage by 71% with the floorplan aware selection. The framework has the capability to use different estimates, when routing congestion is important (for example, in low cost processes with less number of metal layers). We also show that the interconnect delays are reduced by about 68% and dynamic power by 58%, if additional metal layers are available for routing compared to a low cost 6 metal process.
Conference Paper
Full-text available
This work proposes a new method for approximating the Pareto front of a multi-objective simulation optimization problem (MOP) where the explicit forms of the objective functions are not available. The method iteratively approximates each objective function using a metamodeling scheme and employs a weighted sum method to convert the MOP into a set of single objective optimization problems. The weight on each single objective function is adaptively determined by accessing newly introduced points at the current iteration and the non-dominated points so far. A trust region algorithm is applied to the single objective problems to search for the points on the Pareto front. The numerical results show that the proposed algorithm efficiently generates evenly distributed points for various types of Pareto fronts.
Conference Paper
Full-text available
Today's feature-rich multimedia products require embedded system solution with complex System-on-Chip (SoC) to meet market expectations of high performance at a low cost and lower energy consumption. The memory architecture of the embedded system strongly influences these parameters. Hence the embedded system designer performs a complete memory architecture exploration. This problem is a multi-objective optimization problem and can be tackled as a two-level optimization problem. The outer level explores various memory architecture while the inner level explores placement of data sections (data layout problem) to minimize memory stalls. Further, the designer would be interested in multiple optimal design points to address various market segments. However, tight time-to-market constraints enforces short design cycle time. In this paper we address the multi-level multi-objective memory architecture exploration problem through a combination of Multi-objective Genetic Algorithm (Memory Architecture exploration) and an efficient heuristic data placement algorithm. At the outer level the memory architecture exploration is done by picking memory modules directly from a ASIC memory Library. This helps in performing the memory architecture exploration in a integrated framework, where the memory allocation, memory exploration and data layout works in a tightly coupled way to yield optimal design points with respect to area, power and performance. We experimented our approach for 3 embedded applications and our approach explores several thousand memory architecture for each application, yielding a few hundred optimal design points in a few hours of computation time on a standard desktop
Conference Paper
Full-text available
Fast and accurate routing congestion estimation is essential for optimizations such as floorplanning, place ment, buffering, and physical synthesis that need to avoid routing congestion. Using a probabilistic technique instead of a global router has the advantage of speed and easy updating. Previously proposed probabilistic models (1) (2) do not account for wiring that may already be fixed in the design, e.g., due to macro blocks or power rails. These "partial wiring blockages" certainly influence the global router, so they should also influence a probabilistic routing prediction algorithm. This work proposes a probabilistic congestion prediction metric that extends the work of (2) to model partial wiring blockages. We also show a new fast algorithm to efficiently generate the congestion map an d demonstrate the effectiveness of our methods on real routing problems.
Article
Full-text available
The desire for large size, high-speed, and low-power on-chip memory necessitates early and accurate estimates of memory performance. A new performance model as well as an early cache design tool and predictor of access and cycle time for cache stack (PRACTICS) has been developed for on-chip static random access memory (SRAM) cache design that includes both delay and dynamic-power models. Efficient models for distributed interconnect delays, verified by Cadence simulations, are introduced, and their necessity is demonstrated. In the delay model, the access time is estimated by decomposing each component into several equivalent lumped resistance-capacitance (RC) circuits and using an appropriate order pi model to approximate the distributed wire delays of each stage. The dynamic-power model calculates the charging power dissipation of the load capacitances using the same equivalent lumped RC circuits. The delay model has been validated with an Intel 18-Mb SRAM at the 180-nm node, achieving accuracy to within 10% of the measured results. The dynamic-power model has been validated with an International Business Machines Corporation (IBM) 18-Mb SRAM at the 180-nm node, to within 13% of the measured power consumption. Detailed comparisons between PRACTICS and cache access and cycle time model (CACTI) in both validation cases indicate that an improved wire delay, appropriate circuit structures, and technology dependent parameters are necessary to accurately predict large cache memory performance at deep submicrometer technology nodes. PRACTICS is used to analyze the access time and power consumption in terms of cache sizes and various degrees of associativity for architectural studies. In addition, the PRACTICS simulation results show that repeater insertion reduces the access time significantly, with a small overhead in dynamic-power consumption for large size cache design at deep submicrometer technology
Article
Framework is developed for estimation of power at pre register transfer level (RTL) stage for structured memory sub-systems. Power estimation model is proposed specifically targeting power consumed by clock network and interconnect. The model is validated with VCD-based simulation on back-annotated netlist of an 8 MB memory sub-system used as video RAM (VRAM) for high-end graphics applications. This methodology also forms the basis for low-power exploration driving floor plan choice, gating structure of data, and clock network. We demonstrate 57% reduction in dynamic power by using low-power techniques for the 8 MB VRAM used as frame buffer in a graphics processor. FALPEM can be extended to other applications like processor cache and ASIC designs.
Article
Three major principles for the selection of indicator data normalization methods in multi-attribute evaluation are presented in this paper. Principle 1: The relative gap between the data for the same indicator should remain constant; Principle 2: The relative gap between different indicators should remain variable ; and Principle 3: The maximum values after normalization should be equal. According to these three major principles, a normalization method for positive indicators is screened out from several alternatives, and a new normalization method for negative indicators is proposed. These two methods are very good for the comparison among panel data. The requirement for data normalization methods is different when the evaluation goals are different, ranking-order-based evaluation is insensitive to data normalization methods.
Conference Paper
Static random access memory (SRAM) compiler is the silicon compiler that generates SRAM IP cores with various specifications. A high performance and flexible SRAM compiler is proposed in this paper. Our compiler uses block assembly techniques with uniform physical data syntax. This encapsulates the compiler from low level module information. Hence it is self-adaptive to the migration of technology nodes. Our experiment result shows that this compiler can generate a wide capacity range of SRAM with relatively high performance.
Conference Paper
We present a library mapping technique that synthesizes a source memory module from a library of target memory modules. We define the library mapping problem for memories, identify and solve the three subproblems of port, bit-width and size (word) mapping associated with this task and finally combine these solutions into an efficient memory mapping algorithm. Experimental results on a number of memory-intensive designs demonstrate that our memory mapping approach generates a wide variety of cost-effective designs, often counter-intuitive ones, based on a user-given cost function and the target library
Conference Paper
With the proliferation of portable consumer electronics, power-consumption becomes a key design criterion, while the speed of memories is still the bottleneck for high-speed applications. In this paper, we discuss the development of an SRAM compiler with the capability to choose between a low-power and a high-speed SRAM. Experimental results show that the low-power version of our 1-kB SRAM can function at a minimum operating voltage of 2.1 V and dissipates 17.4 mW of average power at 20 MHz
Optimizing NUCA organizations and wiring alternatives for large caches with CACTI 6.0
  • N Murlimanohar
  • R Balasubramonian
  • N Jouppi