Balanced instruction cache: reducing conflict misses of direct-mapped caches through balanced subarray accesses
ABSTRACT It is observed that the limited memory space of direct-mapped caches is not used in balance therefore incurs extra conflict misses. We propose a novel cache organization of a balanced cache, which balances accesses to cache sets at the granularity of cache subarrays. The key technique of the balanced cache is a programmable subarray decoder through which the mapping of memory reference addresses to cache subarrays can be optimized hence conflict misses of direct-mapped caches can be resolved. The experimental results show that the miss rate of balanced cache is lower than that of the same sized two-way set-associative caches on average and can be as low as that of the same sized four-way set-associative caches for particular applications. Compared with previous techniques, the balanced cache requires only one cycle to access all cache hits and has the same access time as direct-mapped caches.
- [Show abstract] [Hide abstract]
ABSTRACT: We implement and undertake an empirical study of the cache-oblivious variant in (2) of the polygon indecom- posability testing algorithm of (11), based on a DFS traversal of the computation tree. According to (2), the cache-oblivious variant exhibits improved spatial and temporal locality over the original one, and its spatial locality is optimal. Our im- plementation revolves around eight different variants of the DFS-based algorithm, tailored to assess the trade-offs between computation and memory performance as originally proposed in (2). We analyse performance sensitively to manipulations of the several parameters comprising the input size. We describe how to construct suitably random families of input that solicit such variations, and how to handle redundancies in vector computations at no asymptotic increase in the work and cache complexities. We report extensively on our experimental results. In all eight variants, the DFS-based variant achieves excellent performance in terms of L1 and L2 cache misses as well as total run time, when compared to the original variant in (11). We also benchmark the DFS variant against the powerful computer algebra system MAGMA, in the context of bivariate polynomial irreducibility testing using polygons. For sufficiently high degree polynomials, MAGMA either runs out of memory or fails to terminate after about four hours of execution. In contrast, the DFS-based version processes such input using a couple of seconds. Particularly, we report on absolute irreducibility testing of bivariate polynomials of total degree reaching 19,000 in about 2 seconds for the DFS variant, using a single processor. ∗Computing 01/2010; 88:55-78. · 0.81 Impact Factor
- [Show abstract] [Hide abstract]
ABSTRACT: In this work we study cache peak temperature variation under different cache access patterns. In particular we show that unbalanced cache access results in higher cache peak temperature. This is the result of frequent accesses made to overused cache sets. Moreover we study cache peak temperature under cache access balancing techniques and show that exploiting such techniques not only reduces cache miss rate but also results in lower peak temperature. Our study shows that balancing cache access reduces peak temperature by up to 20% and 12% for instruction and data caches respectively. This temperature reduction reduces peak temperature in neighbor components by up to 7%.The 9th IEEE/ACS International Conference on Computer Systems and Applications, AICCSA 2011, Sharm El-Sheikh, Egypt, December 27-30, 2011; 01/2011
Conference Paper: A Cache Management Algorithm Based on Page Miss Cost[Show abstract] [Hide abstract]
ABSTRACT: Based on summary of existing hard disk cache management algorithms and characteristics of hard disk performance, a page miss cost (PMC) cache management algorithm has been proposed. Most of cache management algorithms focus on maximize hit rate. Our analysis shows that cache miss results in tremendous time cost. To minimize the time consumption when a cache miss occurs is the aim of PMC schema. The PMC algorithm keeps a reserved area for each cache working set. The page with high time cost when be swapped into cache will be reserved in this area for future access instead of being swapped out of cache by least recently used (LRU) algorithm. Simulations indicate PMC obviously improve disk throughputs, and system performance is enhanced.Information Engineering and Computer Science, 2009. ICIECS 2009. International Conference on; 01/2010
Balanced Instruction Cache: Reducing Conflict Misses of Direct-
Mapped Caches through Balanced Subarray Accesses
Department of Electrical and Computer Engineering
San Diego State University
It is observed that the limited memory space of direct-
mapped caches is not used in balance therefore incurs extra
conflict misses. We propose a novel cache organization of a
balanced cache, which balances accesses to cache sets at the
granularity of cache subarrays. The key technique of the
balanced cache is a programmable subarray decoder
through which the mapping of memory reference addresses to
cache subarrays can be optimized hence conflict misses of
direct-mapped caches can be resolved.
The experimental results show that the miss rate of
balanced cache is lower than that of the same sized two-way
set-associative caches on average and can be as low as that
of the same sized four-way set-associative caches for
particular applications. Compared with previous techniques,
the balanced cache requires only one cycle to access all
cache hits and has the same access time as direct-mapped
The increasing gap between memory latency and processor
speeds is a critical bottleneck to achieve a high performance
computing system. To bridge the gap, multi-level memory
hierarchy has been exploited to hide the memory latency.
Level-one caches normally reside on a processor’s critical
path, which determines the clock frequency, therefore fast
access to level-one caches is an important issue for improved
A conventional direct-mapped cache accesses only one tag
array and one data array per cache access, whereas a
conventional set-associative cache accesses multiple tag
arrays and data arrays per cache access. Thus, direct-mapped
cache has the benefit of not requiring a multiplexor to
combine multiple accessed data items, and therefore can
have faster access time. A direct-mapped cache is 26.7% and
16.1% faster than a same sized two-way set-associative
cache  at cache sizes of 8 KB and 16 KB, respectively.
Direct-mapped caches also have the advantages of being
simple to design and easy to implement. However, a direct-
mapped cache may have a higher miss rate than a set-
associative cache, depending on the access patterns of the
executing application, with a higher miss rate meaning more
waiting time for next level memory accesses. Therefore, a
direct-mapped cache may or may not result in better overall
performance for a particular application.
Obviously, a desirable cache design is that a cache has the
access time of direct-mapped caches but with the miss rate as
low as set-associative caches. In conventional direct-mapped
caches, sequences of memory reference addresses are
Manuscript submitted: 14 Mar. 2005. Manuscript accepted: 2 May
2005. Final manuscript received: 14 May 2005.
mapped to cache sets based on index decoding. Because of
the well-known locality exhibited in instruction caches, some
cache sets are accessed more frequently than others and
therefore generate more conflict misses. Other cache sets are
accessed less frequently and not used efficiently. The miss
rate of direct-mapped caches would be reduced if the
accesses to frequently used cache sets are reduced and those
less frequently used cache sets are used more efficiently.
This paper presents the design of the balanced cache,
which is a direct-mapped cache that resolves the conflict
misses by balancing the mappings of memory reference
addresses to the cache sets. Cache accesses to the overused
sets are reduced and hence conflict misses are reduced.
Cache accesses to underused sets are increased therefore the
limited cache memory is used more efficiently to reduce the
Using execution driven simulations, we demonstrate that
the miss rate of the balanced cache is lower than that of a
conventional two-way set associative cache on average and
can be as low as the miss rate of a four-way set-associative
cache for some particular applications.
Furthermore, compared with other techniques that target at
reduction of the miss rate of direct-mapped caches, the
access time of balanced cache is the same as that of a
conventional direct-mapped cache.
Last, the balanced cache requires only one cycle to access
all cache hits while other techniques either need a second
cycle to access part of the cache hits or must have a longer
access time than direct mapped caches. It should be noted
that the presented scheme would not work for a data cache or
an instruction cache that implements self-modifying code.
The related work targets at the reduction of either the
access time of set associative caches or miss rate of direct
mapped caches. Partial address matching  requires an
extra cycle to fetch the desired data when there is a miss
prediction in the PAD comparison. The access time of the
difference bit cache  is slower than that of the balanced
cache. Peir et al.  attempt to use cache space intelligently
by relocating cache lines and taking advantage of the cache
holes during the execution of a program. However, the
relocated cache lines need a second cycle to access. Agarwal
et al.  proposed to avoid the conflict misses by using two
different mapping functions. An extra cycle is needed when
the first search is a miss. Victim buffer reduces conflict
misses of direct mapped caches either with a longer access
time or an extra cycle. When both level-one caches and the
victim buffer are checked concurrently, an extra mux is
required to choose the output from level one caches or the
victim buffer, therefore access time is prolonged. If the
victim buffer is checked after level one caches’ miss, an
extra cycle is required to access the victim buffer.
II. New Observation
Figure 1 shows the cache hit and miss numbers on each
cache set for a direct-mapped 8KB cache with cache line size
of 32 bytes of Benchmark parser from SPEC2K. We
categorize the cache access patterns into three types, which
are frequent hit sets, frequent miss sets, and less accessed
sets. The frequent hit sets have many more cache hits than
other sets. The cache misses occur more frequently on
frequent miss sets. The cache misses that occurred on theses
cache sets account for around 90% of the total cache misses
for this particular example. The less accessed sets are
accessed less than 1% of the total cache references while
they account for around 35% of the total cache size. This
means that the limited cache space has not been used
From Figure 1, we conclude that if the cache sets are
accessed in a balanced manner—all sets are accessed more
evenly, then cache misses can be reduced significantly. Set-
associative caches normally have a higher hit rate than that of
direct-mapped caches. To verify that balancing the accesses
to cache sets would generate higher hit rate, we use the
standard deviation of the cache sets access numbers to
describe the distribution of cache accesses for conventional
direct-mapped and set-associative caches.
The standard deviation is a statistic that can tell how
tightly all the various data are clustered around the mean in a
set of data. When the data are pretty tightly bunched
together, the standard deviation is small. When the data are
spread apart then the standard deviation will be relatively
Figure 2 shows the standard deviations (stdev) of the cache
set access numbers and the miss rates of instruction caches.
The standard deviations and the miss rates are normalized,
using the standard deviations and the miss rate of the direct-
mapped cache as 100%, respectively. For each benchmark,
the standard deviations and the miss rates at associativity of
eight, four, two, and direct-mapped caches are shown from
left to right.
From Figure 2, we observe that the magnitude changes of
the standard deviations generally match the miss rates. The
standard deviation of the higher associativity cache is smaller
than that of caches with a lower associativity. The only
exception is for benchmark mesa, the cache accesses are
more evenly distributed when the cache associativity is
increased but the miss rate is increased.
From the above observation, we can see that set-
associative caches have a lower miss rate because set-
associative caches can use the cache space in balance and
therefore more efficient. If we can find a way to balance the
mappings of a direct-mapped cache such that the accesses to
direct-mapped cache sets are more evenly distributed to all
cache sets, then we can improve the hit rate of direct-mapped
caches without increasing the cache’s associativity or the
cache access time.
In fact, balancing the mappings at the granularity of cache
sets may be much more involved than necessary. To simplify
the problem, we examine the cache accesses’ behavior at the
cache subarray level. Cache memories are divided into
subarrays to achieve the best tradeoff of area, performance
(access time), and power consumption . For 8 KB, 16 KB,
and 32 KB direct-mapped caches, the cache memory is
divided into four subarrays based on the CACTI  model.
We collected cache access information at the cache subarray
level and the results are shown in Figure 1.
From Figure 1, we observe that even at the cache subarray
level, the cache accesses are still not balanced, the cache hit
number to subarray 1 is 7 times higher than that of subarray
2; the misses that occur on subarray 1 are 58 times higher
than that of subarray 2. Balancing the accesses to each cache
subarray, the total miss rate can still be reduced significantly.
III. Balanced Cache Organization
A. Conventional Direct-Mapped Cache
Figure 3 shows the organization of an 8 KB direct-mapped
cache with line size of 32 bytes. The divided word line
(DWL) technique  is adopted to achieve both the fast
116 314661 7691
106 121 136 151 166 181 196 211 226 241 256
Figure 1: Instruction cache hits and misses on each cache set of Benchmark parser.
Figure 2: Instruction cache misses rate and standard deviation (stdev).
Both the miss rate and standard deviation are normalized, using the
stdev and miss rate of the direct-mapped cache as 100%. The stdev
and miss rate of 8-way (leftmost bar), 4-way, 2-way, and direct-
mapped (rightmost bar) caches are shown.
Frequent hit sets
and miss sets
ammp equake mesa parser art mcf votex vpr gzip bzip gcc
access time and low per access energy consumption. It has
four subarrays based on the CACTI model . The eight bits
index, from I7 to I0 control the selection of the cache lines for
a memory reference address. The most significant two bits of
the index, I7 and I6,, determine the selection of the subarray
through the subarray decoder. The other six bits, from I5 to
I0, determine the cache line selection in a subarray. The
combination of subarray and cache line selection locates a
particular cache line for a memory reference address.
B. Programmable Subarray Decoder
The core technique of the balanced cache design is the
programmable decoder that can map cache accesses in
balance. Figure 3 shows the original subarray decoders, four
two-input AND gates, whose inputs are I7 and I6. Figure 3
also shows the new programmable decoder. The decoder
length and width are determined through experiments and
will be discussed in section IV.
We extend the length of the decoder by using six bits
instead of two as in the original design to reduce the accesses
to frequently accessed cache sets. We use content
addressable memory (CAM) instead of using static logic
gates, such as AND or NOR gates, because the addresses that
are going to be mapped to a subarray will never be the same
as in the original direct-mapped cache are. Therefore, we
need a programmable decoder, which makes CAM a good
candidate. To increase the accesses to less frequently
accessed cache sets, we use two rows of CAM decoders, and
each of them contains different desired addresses. This
means that two different address groups, such as addresses
010100 and 001110 can be mapped to the same subarray.
Since there are multiple decoders for one subarray, tag
length has to be extended to include the subarray index, I7
and I6, as shown in Figure 3 to avoid faulty cache hit. For an
8Kbyte direct-mapped cache with line size of 32 bytes, there
are four subarrays. Therefore, the most significant two bits
are used as subarray decoding.
C. Subarray’s Replacement Policy
The replacement policy during a cache miss of the proposed
balanced cache is different from both conventional direct
mapped caches and conventional set associative caches.
In conventional direct-mapped caches, a memory
reference address corresponds to one known and fixed
location in a cache memory. For a cache miss, the new
fetched cache line would replace the cache line that has the
same index with the desired address. However, in the
balanced cache, there is still one location for a memory
reference address, but this location will never be a fixed one.
The selection of this location will depend on the cache space
usage. We propose to use the least recently used (LRU)
policy to allocate cache lines to subarrays. Two situations
should be taken into consideration during a cache miss.
First, the subarray index of the desired memory reference
address is not in the decoders of any subarrays, which means
a subarray decoder miss. In this case, the newly fetched
blocks will be placed in the least recently used cache
subarray. In this subarray, there are two programmable
decoders. We would replace the subarray index in the least
recently used decoder with the corresponding subarray index
of the newly fetched desired cache line.
The second situation is that the subarray index is in one of
the two decoders of a cache subarray, which means that we
have a subarray decoder hit although we have a cache miss.
Under this situation, the newly fetched data is stored in this
subarray. The cache line that has the same cache line index
with the newly fetched data will be replaced. The LRU states
of both the subarray and the two decoders in the subarray
will be updated accordingly.
It should be noted that the replacement in the subarray
decoder does not require the invalidation of all the cache
blocks sharing the same subarray decoder tag bits, since we
target at an instruction cache instead of a data cache.
D. Balanced Cache Circuit Design
We analyze the timing, area, and power consumption issues
of the balanced cache organization using HSPICE simulation
and the CACTI model. The results show that the balanced
cache has no access time overhead, less than 2% area
overhead and 0.22% power overhead compared with a
conventional direct-mapped cache.
The critical path of conventional direct-mapped caches
resides on the tag side instead of the data side . The
original subarray decoding is much faster than that of the
cache line decoding in a particular subarray, therefore
subarray decoding is not on the critical path. To avoid
increasing the critical path, the new programmable decoder
should run as fast as the cache line decoding.
The programmable decoder consists of two rows of
decoders. Each row decoder has six CAM cells. We use
standard ten-transistor CAM cells. Each CAM cell contains a
SRAM cell and a dynamic XOR gate used for comparison.
The match line is precharged high and conditionally
discharged on a mismatch. The two match lines are OR-ed to
generate the subarray index decoding activation. In the
programmable decoder, there are only two rows and six
CAM cells per row. HSPICE simulation shows that we can
easily choose appropriate parameters of the transistors in the
Figure 3: The organization of conventional direct mapped cache and
the balanced cache that uses a programmable subarray decoder to
replace the original subarray decoder
original subarray decoder
I9 I8 I7 I6 I5
cache line index
CAM cells to make the comparison of the programmable
decoder runs faster than that of the 6 to 64 cache line
To avoid faulty cache hits, tag length is extended to
include the subarray index. In a four-subarray direct-mapped
cache, the proposed cache tag is two bits longer than that of
the original tag. This will increase the time spent on
activating the word line, bit lines of tags, and the time needed
to compare the tags. We collect access times for both the tag
and the data side for the balanced cache of four subarrays at
cache size of 8 KB, 16 KB and 32 KB using CACTI model
at technology 0.18µm. The results show that the data side
access time is still longer than that of the tag side after we
extend the length of the tag by two bits. There is still some
time slack left.
The area overhead includes replacing the original four
two-input AND gates of the subarray decoders with eight
six-bit CAM decoders and LRU registers. The area overhead
is measured using an 8 KB cache layout at 0.18 µm
technology. The total area overhead is less than 2% of the
total cache area.
We ran ten SPEC2000 benchmarks through the SimpleScalar
tool set . We used a 4-issue out-of-order processor
simulator with an 8 KB, 16 KB, and 32 KB L1 instruction
and data caches. The benchmarks were fast-forwarded for
one billion instructions and executed for 500 million
instructions afterwards using reference inputs.
Figure 4 shows the miss rate reduction of the proposed
balanced cache and conventional set-associative caches
compared with a conventional direct mapped cache when the
decoder length is four, five, and six bits and the decoder
width is 2 rows for an 8 KB instruction cache. On average,
two rows of decoders with four bits in length achieve 1%
better miss rate reduction than a conventional two-way set-
associative cache. Increasing the length of the decoder to five
bits, the miss reduction is 2% less than that of a conventional
two-way set-associative cache. On the other hand, when the
length of the decoder is six bits, the miss rate reduction is
3.1% higher than that of a conventional two-way set
associative cache. But, for benchmark equake and vpr, a six
bit decoder achieves much better miss rate reduction than
that of five and four bits. For equake, the miss rate reduction
is as low as the miss rate of a four way set-associative cache.
For benchmark mesa, when increasing the associativity from
direct mapped cache to set associative caches, the miss rate is
increased instead of decreased as expected. But the balanced
cache can reduce 5% of the direct mapped cache miss rate.
For benchmark ammp and mcf, the balanced cache achieves
the miss rate as low as conventional four way set associative
Through experiments, we find out that further increasing
the subarray decoder width to 3 or even 4 would not reduce
the miss rate significantly but may instead increase the
access time and power overhead, therefore, we use 2
subarray decoders in our balanced cache design.
In this paper, we observe that the cache memory space of a
conventional direct-mapped cache is not used in balance. We
propose a balanced cache organization, which is still a
direct-mapped cache whose mapping to cache memory is
balanced automatically. The miss rate of the balanced cache
can be greatly reduced and lower than the miss rate of the
same size two-way set-associative caches on average and as
low as four-way set-associative caches for particular
applications at cache size of 8 KB, 16 KB, and 32 KB.
Compared with previous techniques, the balanced cache
requires only one cycle to access cache and has the access
time of direct-mapped caches.
 A. Agarwal and S. D. Pudar,“ Column-Associative Caches: A
Technique for Reducing the Miss Rate of Direct-Mapped
Caches,” Proc. of the 20th International Symposium on
Computer Architecture, pp. 179–180, May 1993.
 D. Burger and T.M. Austin, “The SimpleScalar Tool Set,
Version 2.0,” Univ. of Wisconsin-Madison Computer Sciences
Dept. Technical Report #1342, June 1997.
 T. Hirose, et al., “A 20ns 4Mb CMOS SRAM with
Hierarchical Word Decoding Architecture,” International
Solid-State Circuits Conference, 1990.
 T. Juan, T. Lang, and J. Navarro. “The Difference-bit Cache.”
In Proceedings of the 27th Annual International Symposium on
Computer Architecture, 1996.
 T.L. Johnson and W.W. Hwu. “ Run-time Adaptive Cache
Hierarchy Management via
Proceedings of International Symposium on Computer
 N. Jouppi, “Improving Direct-Mapped Cache Performance by
the Addition of a Small Fully-Associative Cache and Prefetch
Buffers,” in the Proceedings of International Symposium on
Computer Architecture, 1990.
 L. Liu, “Cache Design with Partial Address Matching,” in
Proceedings of the 27th Annual International Symposium on
Microarchitecture, December 1994.
 G. Reinmann and N.P. Jouppi. CACTI2.0: An Integrated
Cache Timing and Power Model, 1999. COMPAQ western
 S. Santhanam, et. al. “A Low-Cost, 300-MHz, RISC CPU with
Attached Media Processor,” IEEE Journal of Solid-State
Circuit, Vol. 33, NO. 11,November 1998.
 J. K. Peir, Y. Lee, and W. W. Hsu, “Capturing Dynamic
Memory Reference Behavior with Adaptive Cache Topology,”
Proc. of the 8th International Conference on Architectural
Support for Programming Language and Operating Systems,
pp. 240–250, Oct. 1998.
 M. Yoshimoto, et.at., “A Divided Word-Line Structure in the
Static RAM and its Application to a 64k Full CMOS RAM”,
IEEE J. Solid-State Circuits, Vol. SC-21, p. 479-485,Oct.,
Reference Analysis.” In
Figure 4: Instruction cache miss rate improvement of two, four, and
eight way set-associative cache and the proposed balanced cache
normalized with respect to direct-mapped cache. W means
associativity, e.g., 2W represents conventional two-way set
associative cache. 2_4 represents the width and length of
programmable decoder, where 2 means the decoder width 2, 4 means
the decoder length is four.