A memory interface for multi-purpose multi-stream accelerators.
ABSTRACT Power and programming challenges make heterogeneous multi-cores composed of cores and ASICs an attractive alternative to homogeneous multi-cores. Recently, multi-purpose loop-based generated accelerators have emerged as an especially attractive accelerator option. They have several assets: short design time (automatic generation), flexibility (multi-purpose) but low configuration and routing overhead (unlike FPGAs), computational performance (operations are directly mapped to hardware), and a focus on memory throughput by leveraging loop constructs. However, with multiple streams, the memory behavior of such accelerators can become at least as complex as that of superscalar processors, while they still need to retain the memory ordering predictability and throughput efficiency of DMAs. In this article, we show how to design a memory interface for multi-purpose accelerators which combines the ordering predictability of DMAs, retains key efficiency features of memory systems for complex processors, and requires only a fraction of their cost by leveraging the properties of streams references. We evaluate the approach with a synthesizable version of the memory interface for an example 9-task generated loop-based accelerator
- SourceAvailable from: brown.edu
Conference Proceeding: An effective on-chip preloading scheme to reduce data access penalty.[show abstract] [hide abstract]
ABSTRACT: Conventional cache prefetching approaches can be either hardware-based, generally by using a one-blockIookahead technique, or compiler-directed, with insertions of non-blocking prefetch instructions. We introduce a new hardware scheme based on the prediction of the execution of the instruction stream and associated operand references. It consists of a reference prediction table and a look-ahead program counter and its associated logic. With this scheme, data with regular access patterns is preloaded, independently of the stride size, and preloading of data with irregular access patterns is prevented. We evaluate our design through trace driven simulation by comparing it with a pure data cache approach under three different memory access models. Our experiments show that this scheme is very effective for reducing the data access penalty for scientific programs and that is has moderate success for other applications.01/1991
Conference Proceeding: An integrated hardware/software approach for run-time scratchpad management.Proceedings of the 41th Design Automation Conference, DAC 2004, San Diego, CA, USA, June 7-11, 2004; 01/2004
- [show abstract] [hide abstract]
ABSTRACT: Projections of computer technology forecast processors with peak performance of 1,000 MIPS in the relatively near future. These processors could easily lose half or more of their performance in the memory hierarchy if the hierarchy design is based on conventional caching techniques. This paper presents hardware techniques to improve the performance of caches. Miss caching places a small fully-associative cache between a cache and its refill path. Misses in the cache that hit in the miss cache have only a one cycle miss penalty, as opposed to a many cycle miss penalty without the miss cache. Small miss caches of 2 to 5 entries are shown to be very effective in removing mapping conflict misses in first-level direct-mapped caches. Victim caching is an improvement to miss caching that loads the small fully-associative cache with the victim of a miss and not the requested line. Small victim caches of 1 to 5 entries are even more effective at removing conflict misses than miss caching. St...02/1999;
A Memory Interface for Multi-Purpose Multi-Stream
Embedded Systems Lab
Thales Research and
Embedded Systems Lab
Thales Research and
Power and programming challenges make heterogeneous multi-
cores composed of cores and ASICs an attractive alternative to
homogeneous multi-cores. Recently, multi-purpose loop-based
generated accelerators have emerged as an especially attractive
accelerator option. They have several assets: short design time
(automatic generation), flexibility (multi-purpose) but low con-
figuration and routing overhead (unlike FPGAs), computational
performance (operations are directly mapped to hardware), and a
focus on memory throughput by leveraging loop constructs. How-
ever, with multiple streams, the memory behavior of such accel-
erators can become at least as complex as that of superscalar
processors, while they still need to retain the memory ordering
predictability and throughput efficiency of DMAs. In this article,
we show how to design a memory interface for multi-purpose ac-
celerators which combines the ordering predictability of DMAs,
retains key efficiency features of memory systems for complex pro-
cessors, and requires only a fraction of their cost by leveraging the
properties of streams references. We evaluate the approach with
a synthesizable version of the memory interface for an example
9-task generated loop-based accelerator.
Categories and Subject Descriptors
C.1.2 [Processor Architectures]: Multiple Data Stream
Architectures (Multiprocessors); C.1.3 [Processor Archi-
tectures]: Other Architecture Styles—Heterogeneous (hy-
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. To copy otherwise, to
republish, to post on servers or to redistribute to lists, requires prior specific
permission and/or a fee.
CASES’10, October 24–29, 2010, Scottsdale, Arizona, USA.
Copyright 2010 ACM 978-1-60558-903-9/10/10 ...$10.00.
Memory interface, Accelerators, multi-stream
Even though CMPs have emerged as the architecture of
choice for most manufacturers, there is a consensus that ef-
ficiently exploiting a large number of cores for a broad range
of programs will be a daunting task. Moreover, ever strin-
gent power constraints may impose in the future that not
all transistors, and thus not all cores, operate at the same
Consequently, accelerators, i.e., specialized circuits/ASICs,
are becoming an increasingly popular alternative. For cost
and efficiency reasons, they have been a fixture in embed-
ded systems where SoCs can include tens of accelerators.
In high-performance general-purpose systems, they would
enable low-power high-performance execution of important
tasks. Their footprint is far smaller than a core, allowing to
cram a large number of accelerators on a chip, trading some
of the cores of a many-core. Such a set of accelerators would
become akin to a hardware library, and the larger the library
the more likely a programmer will find algorithms useful for
his/her program. Moreover, the programming support for
accelerators is far more simple than parallelization, it is in-
deed more like a library call. Finally, accelerators can even
speed up non-thread parallel tasks thanks to circuit-level
While accelerators have many assets, their obvious weak-
ness is flexibility. As a result, a trend is emerging for flexible
accelerators: either accelerators which implement the most
frequent computational patterns for a set of programs ,
or accelerators which efficiently merge together the circuits
for multiple programs . Such flexible accelerators are
configurable, but dedicate far less on-chip estate to configu-
ration logic than FPGAs, and are thus much closer to ASICs
than FPGAs in terms of cost, power and efficiency.
In the trend towards more customization, loop-based ac-
celerators are becoming especially popular [13, 24, 12, 2]
because they not only speed up computations through cus-
tomization but also achieve high-memory bandwidth by lever-
aging loop constructs to efficiently stream data into accel-
erators. As a result, we may soon see complex loop accel-
erators with a large number of streams to feed. For now,
there has been little focus on the memory interface (includ-
ing the detailed stream implementation) required to achieve
the expected memory bandwidth.
Such a memory interface cannot just consist of multi-
plying the number of DMAs, nor can it correspond to the
memory interface used for high-performance processors. A
DMA is typically used to feed data into an accelerator, and
usually, one DMA handles one stream of data. If the ac-
celerator contains multiple streams, the task of the DMA
becomes significantly more complex: it must load balance
streams and multiplex the memory bandwidth among the
different accelerators. Moreover, as the number of streams
scales up, multiple reuse opportunities occur that, if not ex-
ploited, would result in sub-par performance. At the same
time, it must strictly preserve the ordering of data fed to
the accelerator because a custom circuit behaves in a fun-
damentally different way than a processor: data is pushed
to the accelerator which expects data to arrive in the right
order, as opposed to being pulled by the processor when
requested (using addresses). Still, the memory systems of
general-purpose processors have the desirable property of
being designed to achieve both high bandwidth and reuse
for multiple concurrent and out-of-order memory accesses,
through a combination of non-blocking caches and prefetch-
ers. But this approach is not compatible with accelerators
because of its aforementioned pull vs. push mode of opera-
tion, and because of its steep cost.
Moreover, the memory interface of multi-stream acceler-
ators is not only key for their performance, it is also the
most important part of the accelerator in terms of area and
power. For an example 9-task generated loop-based accel-
erator synthesized using the Synopsys Design Compiler and
the TSMC 90nm library, the accelerator streams alone ac-
count for more than 8 times the area and 16 times the power
of the computational and storage (registers) logic of the ac-
In this article, we propose a memory interface for multi-
stream accelerators which can realize the execution correct-
ness and determinism of DMAs, while retaining many of
the performance advantages of general-purpose processors
memory systems (reuse, multiple concurrent requests, out-
of-order requests), at a small area and power cost. We show
how to design streams capable of sustaining 1-word issue per
cycle to the accelerator in the presence of complex memory
patterns (e.g., short loops, irregular accesses, etc), which is
key to expand the scope of such accelerators. We also show
how to complement streams with a Stream Table, which
has a small area and power footprint compared to streams,
but which boosts the average accelerator speedup over a
core from 5 to 10 by taking advantage of short-term cross-
reference temporal reuse, and by augmenting the apparent
2. MEMORY INTERFACE
In this section, we describe the memory interface struc-
ture, which includes the streams and a table called the Stream
Table, see Figure 1. For that purpose, we go through the
different memory access issues raised by multi-stream accel-
erators and how they are handled by the interface.
Figure 1: Overview of the multi-stream memory interface.
2.1 In-Order accesses and reloads using pre-
As mentioned before, an accelerator can often be consid-
ered as a passive circuit which receives data from memory
and immediately processes them, it does not “request” data
using memory addresses. For instance, it is not possible to
use Stream Buffers , as proposed for fetching streams
into caches, because the circuit does not send an address
when it needs a data, and because the circuit cannot filter
out speculatively fetched data.
In a single-stream accelerator with a memory ensuring
in-order requests, that is a non-issue, data comes back in
the order it is requested. If the accelerator is plugged to a
NoC, memory requests may no longer arrive in-order. For
that reason, the stream controler (i.e., the DMA) must pre-
allocate an entry in the stream before sending a request to
memory. Even if the data comes back out of order, it is
stored in the appropriate stream entry.
(next data to be fed to the circuit) is not ready, the circuit
is simply stalled until the appropriate data arrives.
In loop-based accelerators, we denote as a stream the con-
trol logic required to generate addresses to memory and the
fifo used to store data for the circuit. For loops, a stream
includes a counter, which is initiated with start and end
addresses, and a stride, and it then fetches and feeds the
data continuously to the circuit. A handshaking protocol
between the stream and the circuit is required to steer data
consumption: the stream signals to the circuit that data is
available in the buffer with a ready signal, and the circuit
signals to the stream with a shift signal when it can discard
the current data and move to the next one; the stream also
signals when it has fetched all requested data (stop signal),
so that the circuit can compute its own global stop signal.
The latency tolerance capability of the stream is naturally
correlated to its size. The longer the latency, the longer the
time between the pre-allocation and the moment the data
is used. If the stream size is properly dimensioned for the
latency, then in steady state, a filled buffer can feed the cir-
cuit without stalling for the time it takes to fetch data from
If the top entry
Narrow streams for large strides..
Along the same principles as cache blocks for exploiting
spatial locality, it is more efficient to have multi-word stream
entries (wide streams) in order to minimize the stream cost.
Unlike cache blocks, streams need not all have the same
width. And in a multi-stream accelerator, it is possible to of-
fer a mix of streams, best suited to the accelerator. Stride-1
accesses favor the widest possible streams, while large strides
favor narrower if not 1-word wide streams. While stride-1
accesses are the most frequent, a quantitative study of loop
locality showed that large strides were also frequent ,
typically in array column-wise accesses; small-value strides
other than +1/-1, e.g., 2 or 3, are infrequent. The maximum
stream width is somewhat constrained by the rest of the
memory system. Since memory systems in multi-cores (ho-
mogeneous or heterogeneous) are likely to include shared L2
or L3 caches, data is usually fetched by L2 blocks (B words),
and thus stream reloads are simpler if the maximum stream
width does not exceed B.
Complex access patterns..
As the scope of accelerators expands, the task at hand is
more likely to contain complex memory access patterns. For
instance, the widespread use of small loops in SpecInt-type
programs, in signal processing applications (radio, sound,
image, video) or even in scientific applications  makes
it impossible to restrict accelerators to singly-nested loops.
Yehia et al.  recently demonstrated the performance ben-
efit of considering multi-loop accelerators in order to com-
pensate for start-up overhead. And 2-deep or more loop
nests can already induce non-trivial memory accesses. Con-
sider Figure 2, where a few example access patterns are
shown. Assuming an entry width of W words, in case (a),
the inner loop stride is 1, so a W-wide stream should be
used, but the loop bounds are smaller than the first matrix
dimension, resulting in stream entries being partially used;
case (b) is a column-wise access in a 2-D array where the
first dimension is smaller than the stream width; case (c) is
an example of indirect addressing.
Figure 2: Stream entries allocation depending on reference
All these access cases must be handled by the accelerator
streams. The general issue is that not all words within a
stream entry may be used (all examples), that all words in
a stream entry may not be accessed in a monotonic order
(column-wise access and indirect addressing examples), and
that some words may have to be accessed multiple times
within a short time period (a in indirect addressing ex-
ample). In order to keep the conceptually simple and fast
mode of operation that a word is discarded after being con-
sumed by the accelerator, we forbid the latter case, i.e., each
word in a stream entry can only be used once. We explain
below how to design the stream to allow all other cases.
Forbidding multiple same-word same-entry accesses..
Because access patterns can be complex, words are allo-
cated one by one in the stream, as soon as the stream con-
troler has computed the next address. If the next word can-
not be allocated in the current stream entry, a new stream
entry is allocated; each stream entry contains an allocated
bit and a tag, see Figure 3. In order to enforce single usage
per word per entry, each stream entry also contains a word
mask and a locked bit. When a word is allocated, the cor-
responding bit is set in the mask. If the stream controler
wants to allocate the same word a second time, then the
entry is locked, and a new stream entry is allocated. Lock-
ing the stream entry is necessary to ensure in-order access
again. Consider the address reference sequence a, a,
a, a, a, in Figure 2. The first stream entry will be
locked upon the second access to a, which is allocated in
the next stream entry.
Figure 3: Detailed design of a read stream buffer.
In order to allow maximum freedom in the order in which
words are accessed within an entry, we build the hardware
equivalent of a chained list of these words. One major con-
straint is that word access must be very fast in order to avoid
slowing down the circuit; as a result, a two-step indirect ac-
cess, as usually performed in software, is not tolerable; a
word should be issued every cycle to the circuit.
For that purpose, we add a next-word field to each en-
try. Assuming a W-word entry, this field contains W sets
of log2W bits, each set acting as a word pointer. The next-
word at position i indicates the offset of the word to be read
after word i. Any delay due to the indirection is avoided
by reading next-word at the same time as the word is read.
Next-Word is then fed to the combinational circuit which
drives the multiplexor used to select one word among W for
the circuit, see Figure 3.
In order to determine when the last word in the chain for
an entry has been reached, a has-next-word mask of W bits
and a first-word set of log2W bits are also necessary. When
the has-next-word bit is 0, the stream controller knows it has
read the last word of an entry, and must shift to the next
entry. The first-word bits indicate the offset of the first word
to be read in the next entry. The inputs to the multiplexor
are thus the current word offset, the corresponding next-
word bits, the corresponding has-next-word bit and the first-
word bits of the next entry, see Figure 3.
This chained indirect addressing approach induces no tim-
ing overhead compared to a direct stride-1 access and en-
ables both sparse and out-of-order word access within a
stream entry. For a W-word stream of 32-bit words, the
bit storage overhead is
W = 8.
, i.e., 13.67% for
2.3Concurrent stream accesses at a low cost:
readout, allocation, selection, reload
A stream may have to perform all four operations con-
currently. For that purpose, there are four registers in each
stream, each register pointing to the target entry for one
of the aforementioned operations; each register is log2E-bit
large, where E is the number of stream entries. The read-
out register has already been mentioned as used to control
the multiplexor to the circuit. The allocation register points
to the entry being currently allocated. It is used to select
the entry where the allocated, next-word, has-next-word,
first-word and tag bits are written. The select register
points to the entry whose tag will be sent to memory. The
reload register is actually an E-bit mask because several en-
tries can be reloaded simultaneously, as later explained. It is
used to select in which entries the incoming data bits should
Except for the reload register which is set by the memory
interface, the other three registers behave like fifo pointers:
they shift from one entry to the next and back to the top. All
these registers are shifted upon different events. The readout
register is shifted as soon as all words in an entry have been
read. The allocation register is shifted when a word can no
longer be allocated in the currently pointed entry. And the
select register is shifted as soon as the memory interface has
acknowledged the request.
While four different operations normally require four ac-
cess ports to the storage structure, and are thus exceedingly
costly, only one read and write port is actually necessary for
each stream storage structure, provided one carefully con-
siders when each bit is being read and written. The bits
used for allocation (allocated, next-word, has-next-word,
first-word and tag) are not written in the readout, selec-
tion nor reload phase. The only exception is the allocated
bit which is reset after readout has read all words in the
entry; but when the readout register points to an entry, it is
impossible that the allocation register points to that entry
as well (we assume a minimum of two entries per stream).
Similarly, the data bits are only written upon reload. A
signaling bit to the memory interface (introduced later) is
also written upon selection and readout but again both op-
erations can never occur simultaneously on the same entry
(an entry cannot be read to the circuit if it is just being re-
quested to the memory interface). As a result, only a single
write port for each sub-structure is required.
to the memory interface
Figure 4: Combinatorial sorter for T = 2 table ports and
There is a handshaking protocol between the stream and
the memory interface. Upon allocation, the notify bit of the
stream entry is set, signaling a request to the memory inter-
face when the select register rotates to that entry. The mem-
ory interface can accomodate T requests simultaneously, and
it must then pick T streams among the pending ones. We
choose to select streams based on the number of filled words
in each stream, using log2(E × B). The number of filled
words provides an indication of the stream “needs”: if the
circuit consumes the stream words slowly, or if this stream
has been lately privileged by the memory interface, it will
have many words available. This fairness strategy is more
robust than round-robin: if a stream is under-privileged, the
number of filled words will decrease and its priority will nat-
urally shoot up. And the memory interface randomly selects
among the streams with the same number of filled words.
Let us now assume that R streams are sending a request
to the Stream Table. If R ≤ T, then all requests can be
handled by the table. If R > T, we need to pick the T
among R streams with the highest number of filled words.
For that purpose, we need to sort the streams according to
their number of filled words, and to do so very rapidly and
cost-efficiently. We resort to a combinatorial sorter derived
from Batcher’s odd-even merge sort circuit , see Figure 4.
This algorithm splits the list to be sorted into pairs of sets
of 2kordered elements each, with k = 0 initially and incre-
mented at each stage, and orders and merges the elements
in two sets. Sorting N = 2nelements requires n merge
phases, corresponding in total to N −1+log2(N)×(log2(N)−1)
comparators. We alter Batcher’s combinatorial sort circuit
in two ways: each comparator is fed with a simple 1-bit
pseudo-random number  used in case of equality (ran-
dom decision if two streams have the same number of filled
words), and we remove all the comparators of the last (and
largest) merge stage which are not necessary to find the top
T numbers, see Figure 4. Since we later find that the maxi-
mum number of words in a stream is 32 (4 8-word entries),
we use 5-bit comparators.
2.5StreamTable: multiplerequests, datareuse
Figure 5: Table structure and data paths to streams.
In a multi-stream accelerator, there are usually multiple
pending requests. Therefore, we need to implement a table
similar to the MSHR (Miss Status Holding Register) of non-
blocking caches, with streams, rather than registers, as des-
tinations. Moreover, it can often happen that two streams
miss almost simultaneously on the same data. Consider for
instance typical references such as A(i,j),A(i,j+1). Rather
than issuing two misses, an MSHR would typically record
the second request as hit on pending miss. We proceed the
same way, and add a stream mask and a pending bit to
the Stream Table. Whenever a stream hits on a pending
miss, the corresponding stream bit is set in the stream mask.
When the requested data arrives, it is simultaneously writ-
ten back to all target streams. Note that writing the same
data back to multiple streams simultaneously requires no
additional logic or datapath since a bus must anyway con-
nect the Stream Table to all the streams, as shown in Figure
Beyond hits on pending misses, the Stream Table can also
fulfill another classic role of exploiting temporal reuse, es-
pecially reuse across streams. For instance, with A(i,j),
A(i+1,j), the reuse distance is too large for a hit on pend-
ing miss to occur, but if the matrix dimension of A is small
enough, A(i,j) may still reside in the Stream Table when
needed. For that reason, we also allow the table to behave
like a cache, and store data along with each entry. That
naturally increases the table size, however our goal is not to
achieve the same reuse capabilities as traditional cache hier-
archies; our focus on streaming data to the circuit makes it
unnecessary for many accesses. Our main goal is to take ad-
vantage of the frequent short-distance temporal reuse oppor-
tunities , and that requires only a modestly-sized table,
as later shown in Section 4.
Finally, small loops are frequently found in many codes
(e.g., SpecInt, signal processing,...), which might result in
the same address being requested for several entries of the
same stream. Even if the same data appear in two entries
in the same stream, these entries should not be merged as
data must be delivered in order. Therefore, the table must
be able to deliver the data to multiple stream entries. For
that purpose, we add an entry mask to the Stream Table,
besides the stream mask, see Figure 5. Both masks (stream
and entry) account for S ×E bits assuming all streams have
the same number of entries E. This entry mask also reduces
stream cost and speeds up stream reload by saving stream-
level tag checks upon reload.
While reloading several distinct data in a buffer require
multiple ports, reloading several times the same data in a
buffer requires no additional support, the same as for writing
the same data to multiple streams. The write port is already
connected to all stream entries; the only modification is to
allow the simultaneous activation of multiple write signals,
see Figure 5.
Write streams play the same role as write buffers in stan-
dard caches, by avoiding to stall the processor or delay miss
requests. However, we choose to implement one write stream
per circuit output, instead of a common write stream in or-
der to avoid a costly multi-ported stream buffer, and to in-
crease coalescing opportunities (the ability to merge multiple
consecutive words in a single write request).
A write stream is composed of two parts: a simple B-
word word-wide fifo which buffers incoming write requests,
and a B-word latch which also plays the role of a coalescing
buffer. The write is sent to memory when a word from the
fifo cannot be written in the buffer, because a word at that
position is already written, or because the buffer is full. For
that reason, the write latch also includes a word bit mask,
just like the entries of read streams. The write to memory
is delayed until the word fifo is at least half full, in order
to find a right balance between coalescing opportunities and
not risking to stall the stream.
delayed by misses, hence the half-full threshold precaution.
There is a handshaking protocol between write streams
and the memory interface, similar to the one used for read
streams. In addition to arbitrate among multiple read streams,
the memory interface must also arbitrate between read and
write streams. By replacing the“number of filled words”for
read streams with “size fifo - number of words in the word
fifo” for write streams, we can indifferently consider read
and write streams in the load balancing strategy. Indeed,
this criterion is the dual of the “number of filled words” cri-
terion: if a word fifo is full, the write stream should be given
the utmost priority since the next write will stall the circuit,
much like having no filled word in a read stream will stall
The Stream Table operates in a write through mode, with
one additional tag being used for write streams. Note, though,
that there is no hardware support for memory disambigua-
tion, it is considered part of the circuit control task. Most
state-of-the-art loop-based circuit generation approaches [13,
24] still do not automatically handle memory disambigua-
Note that writes can be
Simulated Architecture. Our architecture is shown in
Figure 6(a) and consists of an IBM PowerPC405  core,
a simple 32-bit embedded RISC-processor core including a
5-stage pipeline and 32 registers, but no floating-point units.
We consider a regular 90nm version running at a frequency
of 800MHz, with a 20-cycle memory latency (corresponding
to a L2 access). To simulate this architecture, we used the
UNISIM  infrastructure environment.
(b) Memory hierarchy parameters
Figure 6: Simulated architecture.
The memory sub-system is composed of two write-back L1
data and instruction caches and a main memory. Their pa-
rameters are described in Figure 6(b).
Circuit synthesis. As mentioned before, automatically
generating hardware representation from a source code has
been previously addressed in research and existing industrial
tools . We developed a tool chain which automatically
creates loop-based multi-purpose accelerators down to the
Verilog HDL. We synthesize all circuits using the Synopsys
Design Compiler  and TSMC 90nm standard library, with
the highest mapping effort of the design compiler (-map_-
effort high -area_effort high options).
We use a 9-task accelerator corresponding to the UTDSP
benchmarks of Table 1, as a driving example; the accelera-
tor only includes 32-bit operators for now. To support the
fixed-point precision arithmetic which is frequently used in
embedded systems for cost and power reasons, we modified
all the benchmarks and we chose 12-bit precision for all ex-
periments. The accelerator has been generated using the
compound circuit process proposed by Yehia et al. ; a
similar accelerator could also be obtained using the process
proposed by Fan et al. . This compound circuit can
be configured to execute each of the individual tasks while
having a cost significantly smaller than the union of the 9
circuits; the accelerator is configured for a task through the
processor-to-accelerator interface. At any time, only a single
task is executed on the accelerator. The number of accel-
erator operators of each type (adders, multipliers, registers,
muxes, read and write streams) are detailed in Table 2; the
32-bit operators are used for computations, while 1-bit op-
erators are usually used for control.
Discrete Cosine Transform
1024-point Complex FFT
256-tap FIR filter
Image enhancement using histogram
equalization (gray level mapping loop)
4-cascaded IIR biquad filter processing
32nd-order Normalized Lattice filter
32-tap LMS adaptive FIR filter
Table 1: Benchmark description.
Table 2: Operators of the compound circuit.
We now want to show that it is possible to design a mem-
ory interface for our example multi-purpose multi-stream ac-
celerator, using a combination of streams and a table, which
achieves high performance at low area and power costs. We
first seek the maximal performance that can be achieved
with the best possible configuration of the memory interface
(stream and table characteristics), independently of cost and
power. Then, we minimize area and power by optimizing the
different memory interface characteristics without degrading
The performance is defined as the average speedup of each
individual task over the same task executed on the compan-
ion core. While many characteristics come into play (e.g.,
the ability to simulteanously update or not several streams
or entries masks, the arbitration policy for selecting streams
which can access the table, etc), we focus on the two char-
acteristics which will most affect execution time, cost and
power: the number of stream entries and the number table
The most appropriate number of streams entries is highly
correlated to both the latency and the stride of the mem-
ory reference mapped to the stream. As mentioned before,
words have to be pre-allocated in the streams upon request,
and since up to one word can be allocated per cycle, the op-
timal number of words in a stream depends on the memory
latency. The stride further complicates this criterion as not
all words within an entry may be useful (e.g., 1 useful word
per entry for an 8-word entry and a stride ≥ 8): a stream
optimal performance is reached when the number of useful
words in a stream is greater or equal than the memory re-
quest delay. Several issues can further affect the optimal
number of streams entries: the task may not always con-
sume one stream word per cycle, or delays incurred by other
streams may relieve the pressure on a stream; conversely,
a stream may not be able to immediately issue a miss re-
quest due to the single memory port (other system issues can
naturally have an impact: the variable latency of SDRAM
operations, or the presence of an interconnect between the
accelerator and the memory, etc).
Figure 7: Speedup vs. area for various memory interface
configurations (#streams entries, #table entries).
Figure 8: Speedup vs. power for various memory interface
configurations (#streams entries, #table entries).
In Figures 7 and 8, we show the average performance for
all possible (# streams entries, # table entries) pairs, assum-
ing 8-word streams entries, and vary the number of streams
entries from 2 to 16, and the number of table entries from
1 (equivalent to no table) to 64; the speedup is measured
against respectively the area and power cost of the whole
memory interface using the corresponding streams and ta-
ble configurations. On average, increasing the number of
stream entries beyond 4 has little impact on performance.
And, with 4-stream entries, increasing the number of table
entries from 16 to 32 gives a 3.7% increase in performance
at the cost of a 18.9% and 12.9% increase in area and power
consumption respectively. Therefore, we consider 4-entry
streams and a 16-entry fully-associative table as achieving
near-maximal performance, and call this configuration near-
A small table size is sufficient because the streams are in
charge of hiding the latency of long-distance temporal reuse,
while the role of the table is to avoid short-distance temporal
reuse, either due to small loops or reuse across streams, e.g.,
A(i),A(i + 1) types of references. The fraction of hits on
valid data per cycle, in Figure 9, quantifies the significant
short-reuse benefits of the table.
Figure 9: Average number of table events per cycle.
The table contains several sub-banks: the tags (includ-
ing the pending/valid bits), the streams masks, the entries
masks, the data. Depending on the required number of si-
multaneous accesses, it may be necessary to multi-port them
if concurrent accesses are frequent. In Figure 10, we plot the
distribution of the number of simultaneous streams requests
to the table per cycle. While the average number of simulta-
neous requests per cycle is 0.75, the distribution is actually
very irregular, with 2 or even 4 requests per cycle being fre-
quent for some tasks with many active streams, e.g., fft
or iir. As a result, we design the table so as to accomo-
date four streams requests per cycle. For that purpose, we
have four comparators per table entry. On the other hand,
we only need two output ports per data banks, as there are
rarely more than two hits per cycle. These output ports are
implemented as two arrays of tri-states to buses connecting
the banks to the streams. The area and power cost of Fig-
ure 7 and 8 already factor in these four tag ports and two
Finally, the low average number of simultaneous updates
on either the streams/entries banks (see updates_streams)
in case of hits on pending misses, or the data bank in case of
multiple writes, or simultaneous memory reload and write(s)
(see updates_data) in Figure 9 call for only single-ported
streams/entries and data banks.
Besides latency tolerance, the role of the table is also to
increase the apparent memory bandwidth. The sometimes
high number of simultaneous streams requests already men-
tioned and highlighted by Figure 10, as well as the high num-
ber of hits on valid data in Figure 9, shows that the table
fulfills a significant role in improving the apparent memory
Figure 10: Distribution of number of streams requests to
table per cycle.
Memory interface configuration for other accelerators.
While the streams and table configurations are specific
to our example accelerator, the process for finding the best
streams and table configurations are generic. Based on the
references properties of each task, and the characteristics of
the table accesses, it is possible to similarly dimension the
streams and table for each accelerator.
With increasingly stringent power and performance con-
straints, throughput computing is becoming increasingly pop-
ular. Designs vary in flexibility, performance and power.
They range from homogeneous multi-cores or GPU archi-
tectures , to hardware frameworks for more specialized
accelerators such as ARM OptiMODE , Tensilica .
However, the best performance/power ratio is achieved
with ASICs combined with DMAs, though their long de-
velopment time and narrow application scope have long re-
stricted them to application-specific hardware. The recent
advent of efficient loop-based accelerator generation approaches
as proposed by Synfora , and even more recently multi-
purpose loop-based accelerators as proposed by Clark et al.
, Fan et al.  or Yehia et al. , make it possi-
ble to address the design time and flexibility limitations of
ASICs+DMAs while retaining most of their power and per-
formance assets. As a result, they increasingly emerge as
a design of choice for both application-specific and general-
purpose heterogeneous multi-cores.
However, these multi-purpose multi-stream accelerators
are still in infancy, and their interaction with memory has
not yet been studied in great details. Naturally, there are
many ASIC designs which require multiple reference streams,
but few require to accomodate a wide range of reference
patterns or complex load balancing among streams. Most
accelerators used in the embedded domain rely on tightly
coupled scratchpads  to bring data closer to the accelera-
tor through DMA transfers. Several management techniques
to efficiently use these scratchpads and program the DMAs
have been proposed . Still, these scratchpads usually
have to be explicitly managed by the programmer [23, 9].
DMAs for managing multiple streams have been proposed
, including high-performance streaming processors, ,
though most of these designs usually assume the streams are
independent, with no cross-reference or temporal reuse.
Finally, there is naturally a large body of work on exploit-
ing locality in high-performance memory systems.
of the prefetching mechanisms are also designed to handle
multiple concurrent streams of references , but they are
usually more sophisticated and also more costly than the
streams proposed in our study , and more importantly,
they are speculative mechanisms , and are thus not com-
patible with accelerators. However, they could still act as
complementary mechanisms, in tandem with the Stream Ta-
ble, for instance.
In this study we investigate the design of a memory in-
terface for multi-purpose multi-stream accelerators. While
streams for long stride-1 references are simple to design, the
detailed design of a stream buffer capable of handling com-
plex memory references (short loops, multi-stride or irreg-
ular references), while still ensuring in-order word delivery
and issuing one word per cycle to the accelerator, raises non-
trivial design issues. We also show the potential synergy
between such streams and a Stream Table which captures
most short-distance temporal reuses and increases apparent
bandwidth, with only a fraction of the size of traditional
caches. The memory interface composed of such streams
and a Stream Table form a generic template for efficiently
interfacing multi-purpose loop-based accelerators with mem-
ory, and a necessary building block for generalizing the use
of such accelerators within heterogeneous multi-cores.
 Synopsys design compiler. http://www.synopsys.com.
 Tensilica. http://www.tensilica.com/.
 Designing high-performance dsp hardware using
Catapult C synthesis and the altera accelerated
libraries. Mentor Graphics Technical Library, October
 David August, Jonathan Chang, Sylvain Girbal,
Daniel Gracia-Perez, Gilles Mouchard, David A.
Penry, Olivier Temam, and Neil Vachharajani.
Unisim: An open simulation environment and library
for complex architecture design and collaborative
development. IEEE Comput. Archit. Lett., 6(2):45–48,
 Jean-Loup Baer and Tien-Fu Chen. An effective
on-chip preloading scheme to reduce data access
penalty. In Supercomputing ’91: Proceedings of the
1991 ACM/IEEE conference on Supercomputing,
pages 176–186, New York, NY, USA, 1991. ACM.
 Rajeshwari Banakar, Stefan Steinke, Bo-Sik Lee,
M. Balakrishnan, and Peter Marwedel. Scratchpad
memory: design alternative for cache on-chip memory
in embedded systems. In CODES ’02: Proceedings of
the tenth international symposium on
Hardware/software codesign, pages 73–78, New York,
NY, USA, 2002. ACM.
 Kenneth E. Batcher. Sorting networks and their
applications. In AFIPS Spring Joint Computing
Conference, pages 307–314, 1968.
 M.J. Bellido, A.J. Acosta, M. Valencia, A. Barriga,
and J.L. Huertas. A simple binary random number
generator: new approaches for cmos vlsi. In Circuits
and Systems, 1992., Proceedings of the 35th Midwest
Symposium on, pages 127–129 vol.1, Aug 1992.
 Partha Biswas, Nikil D. Dutt, Laura Pozzi, and Paolo
Ienne. Introduction of architecturally visible storage in
instruction set extensions. IEEE Trans. on CAD of
Integrated Circuits and Systems, 26(3):435–446, 2007.
 Koushik Chakraborty, Philip M. Wells, and
Gurindar S. Sohi. Over-provisioned multicore
processor, September 2009. Patent application,
 Nitin Chawla, Roberto Guizzetti, Yan Meroth,
Arnaud Deleule, Vishal Gupta, Vinod Kathail, and
Pascal Urard and. Multimedia application specific
engine design using high level synthesis. In
 Nathan Clark et al. OptimoDE: Programmable
accelerator engines through retargetable
customization, 2004. In Proc. of Hot Chips 16.
 Nathan Clark, Amir Hormati, and Scott Mahlke. Veal:
Virtualized execution accelerator for loops. In ISCA
’08: Proceedings of the 35th International Symposium
on Computer Architecture, pages 389–400,
Washington, DC, USA, 2008. IEEE Computer Society.
 Dave Comisky, Sanjive Agarwala, and Charles Fuoco.
A scalable high-performance dma architecture for dsp
applications. Computer Design, International
Conference on, 0:414, 2000.
 Kevin Fan, Manjunath Kudlur, Ganesh S. Dasika, and
Scott A. Mahlke. Bridging the computation gap
between programmable processors and hardwired
accelerators. In 15th International Conference on
High-Performance Computer Architecture (HPCA-15
2009), 14-18 February 2009, Raleigh, North Carolina,
USA, pages 313–322. IEEE Computer Society, 2009.
 Poletti Francesco, Paul Marchal, David Atienza, Luca
Benini, Francky Catthoor, and Jose M. Mendias. An
integrated hardware/software approach for run-time
scratchpad management. In DAC ’04: Proceedings of
the 41st annual Design Automation Conference, pages
238–243, New York, NY, USA, 2004. ACM.
 IBM. PowerPC 405 CPU Core. September 2006.
 Norman P. Jouppi. Improving direct-mapped cache
performance by the addition of a small
fully-associative cache and prefetch buffers. SIGARCH
Comput. Archit. News, 18(3a):364–373, 1990.
 Brucek Khailany, William J. Dally, Scott Rixner,
Ujval J. Kapasi, John D. Owens, and Brian Towles.
Exploring the vlsi scalability of stream processors. In
International Conference on High-Performance
Computer Architecture, Anaheim, California, pages
153–164, February 2003.
 Kathryn S. McKinley and Olivier Temam. A
quantitative analysis of loop nest locality. SIGPLAN
Not., 31(9):94–104, 1996.
 Larry Seiler, Doug Carmean, Eric Sprangle, Tom
Forsyth, Michael Abrash, Pradeep Dubey, Stephen
Junkins, Adam Lake, Jeremy Sugerman, Robert
Cavin, Roger Espasa, Ed Grochowski, Toni Juan, and
Pat Hanrahan. Larrabee: a many-core x86
architecture for visual computing. ACM Trans.
Graph., 27(3):1–15, 2008.
 Stephen Somogyi, Thomas F. Wenisch, Anastasia
Ailamaki, and Babak Falsafi. Spatio-temporal memory
streaming. In 36th International Symposium on
Computer Architecture (ISCA 2009), June 20-24,
2009, Austin, TX, USA, pages 69–80, 2009.
 S. Steinke, L. Wehmeyer, B. Lee, and P. Marwedel.
Assigning program and data objects to scratchpad for
energy reduction. In DATE ’02: Proceedings of the
conference on Design, automation and test in Europe,
page 409, Washington, DC, USA, 2002. IEEE
 Sami Yehia, Sylvain Girbal, Hugues Berry, and Olivier
Temam. Reconciling specialization and flexibility
through compound circuits. In 15th International
Conference on High-Performance Computer
Architecture (HPCA-15 2009), 14-18 February 2009,
Raleigh, North Carolina, USA, pages 277–288. IEEE
Computer Society, 2009.