Conference PaperPDF Available

SHRIMP: Efficient Instruction Delivery with Domain Wall Memory

Authors:
SHRIMP: Efficient Instruction Delivery with
Domain Wall Memory
Joonas Multanen, Pekka J¨
a¨
askel¨
ainen
Tampere University, Finland
Email: joonas.multanen@tuni.fi
pekka.jaaskelainen@tuni.fi
Asif Ali Khan, Fazal Hameed, Jeronimo Castrillon
Technische Universit¨
at Dresden, Germany
Email: asif ali.khan@tu-dresden.de fazal.hameed@tu-dresden.de
jeronimo.castrillon@tu-dresden.de
Abstract—Domain Wall Memory (DWM) is a promising emerg-
ing memory technology but suffers from the expensive shifts
needed to align memory locations with access ports. Previous
work on DWM concentrates on data, while, to the best of our
knowledge, techniques to specifically target instruction streams
have not yet been studied. In this paper, we propose Shift-
Reducing Instruction Memory Placement (SHRIMP), the first
instruction placement strategy suited for DWM which is ac-
companied with a supporting instruction fetch and memory
architecture. The proposed approach reduces the number of
shifts by 40% in the best case with a small memory overhead.
In addition, SHRIMP achieves a best case of 23% reduction in
total cycle counts.
I. INTRODUCTION
It is estimated that the information and communication
technology sector will consist up to 20% of global electricity
production by 2025 [1]. This is due to the ever-increasing
complexity of computational workloads and the era of Internet
of Things (IoT), introducing billions of compute devices to
novel contexts. While processing density and efficiency has
followed Moore’s law until recently, memory systems have
not improved at a similar pace to provide adequate bandwidth
and latency – a phenomenon known as the memory wall [2]. In
addition to limiting the processing speed, DRAM power con-
sumption in contemporary computing systems often accounts
for as much as half of the total consumption [3].
The scaling difficulties of traditional memory technolo-
gies have motivated research efforts on different emerg-
ing memory technologies, such as Phase-Change Mem-
ory (PCM), Spin-Transfer Torque RAM (STT-RAM) and Re-
sistive RAM (ReRAM). Although a clear winner among the
candidates is yet to be determined, these memories are ex-
pected to provide major improvements in power consumption,
density and speed while often being non-volatile, reducing the
need for a separate persistent backing store in many use cases.
An emerging technology that has received wide interest
thanks to its extreme density improvement and power reduc-
tion promises is domain wall memory (DWM) [4], [5]. Its
efficiency is achieved by a structure that allows costly access
ports to be shared by multiple memory locations instead of
separate access transistors for each memory cell. DWM uses
thin nanotapes to store data in magnetic domains, which are
moved by passing a current along the tape. As the tapes are
small compared to the access ports and can be 3D-fabricated
on top of them, DWM features a high area-efficiency. How-
ever, a higher density leads to additional energy consumption
and time required to shift the domains to seek the desired
address.
Memory access patterns have a major impact on the number
of shifts required; consecutive accesses require only a single
shift in between. In conjunction with access patterns, careful
consideration of design parameters such as number of access
ports and amount of domains sharing a port is required to
receive optimal returns from DWMs.
Previous work [6]–[11] has proposed hardware architectures
and placement strategies for data streams. What has received
less attention is the fact that in software programmable pro-
cessors, instruction streams greatly contribute to the overall
memory accesses. In comparison to data, instruction streams
have a mostly compile-time analyzable structure, presenting an
interesting target for offline optimizations that reduce costly
shifting on DWMs.
This paper proposes the first instruction-optimized place-
ment technique for DWM. We show how to reduce the shifting
penalty and reduce total cycle counts by exploiting the fact that
instructions in program basic blocks are fetched in order from
the memory hierarchy. Concrete contributions are:
An instruction placement method optimized for DWM
technology.
An accompanying hardware design for the DWM and the
instruction fetch unit.
We evaluate our proposed approach, shift-reducing instruc-
tion memory placement (SHRIMP), with 12 CHStone [12]
benchmarks using RISC-V [13] instruction set architecture
and Spike simulator. Compared to a linear baseline placement,
SHRIMP reduces the number of shifts on average by 40% in
the best case, with a worst-case average overhead in memory
usage of 2.5%. The total cycle count averaged over the 12
benchmarks is reduced by 23%.
II. DOMAIN WALL MEMORY
Domain wall memory, also called racetrack memory, is a
non-volatile technology, where the spin of electrons is used
to describe logical bit values. Fig. 1 illustrates the structure
of a DWM nanotape and its access ports. Different spins are
contained within domains, separated by notches in the tape. A
number of tapes with their access ports are typically clustered
Ish
Domain wall Access port
Ish
Horizontal racetrack
Vertical racetrack
Ish
Ish
Fig. 1: Horizontal and vertical configurations of DWM.
together and organized as DWM block clusters (DBCs) [6].
The whole DBC is activated simultaneously, so that all tapes
are read, written or shifted simultaneously.
Introducing a current from one end of a nanotape to the
other shifts the domains, with the electron flow determining
the shift direction. By shifting the domains, each access
port consisting of CMOS transistors can be used to access
multiple domains, which explains the extreme density of the
DWM technology. The area of a DWM consists mostly of the
access transistors [6] with the trade-off in shifting delays and
additional energy to access the domains.
As shifting a domain over the tape end is destructive,
overhead domains are used in one or both ends of tapes to
avoid data loss when shifting bits. In this paper, we refer to the
amount of accessible domains as effective number of domains.
If density is the most important requirement, one access
port attached to a long tape can be used. The maximum
practical length, however, is determined by the delays and
resulting execution latencies incurred from shifting. Previous
work proposes multiple access ports per tape, so that the
number of domains accessed through each access port is
relatively low [14]. This keeps the average number of shifts
low, while still sharing shifting circuitry for the entire tape.
Read-only ports are smaller than write or read-write ports, as
more current is required to write a value to a domain, requiring
a larger transistor.
III. THE SHRIMP AP PROACH
The proposed SHRIMP approach utilizes an instruction
placement based on static control flow graph (CFG) analysis
together with supporting hardware circuitry. The compilation
flow and overall structure of the target DWM-based archi-
tecture are shown in Fig. 2. As shown in the figure, the
DWM consists of DBCs mapped consecutively in memory,
with DBCs consisting of mtapes with a single read-write
port and a read port each and neffective domains.
The instruction placement is performed before assembling
and linking a program, using a compiler framework such as
GCC. A CFG is first generated from intermediate represen-
tation of the code for each function. Then, function basic
blocks (BBs) are split into two halves and remapped to start
from addresses aligned with the DWM access ports with
instructions of the latter BB half reversed. Unconditional
branches are inserted in the gaps left between the BB halves
and to replace fallthroughs between the remapped BBs. Fi-
nally, the modified code is assembled into an executable.
source code
compilation
BB & CFG analysis
mapping BBs to DBCs
jump insertion
assembly
intermediate assembly code
Modified assembly code
head status array +
shift control
PC
+-1
+1 branch
address
instruction fetch unit
executable
. . .
...
...
target platform
DWM
overhead
area
...
T
ape 0
T
ape 1
Tape m-1
effective
area
0
n-1
read port
read-write port
...
...
...
Fig. 2: Programming flow with SHRIMP and hardware sup-
port. Contributions of this work highlighted.
Comparison between a linear and SHRIMP placement of a
simple if-then-else structure is presented in Fig. 3.
During execution, the first half of a BB is read normally
from the first access port of a DBC by incrementing the
program counter (PC) and, thus, shifting the DBC tapes in one
direction. A jump inserted at the end of a first half switches
execution to the latter half. As the target address resides in
the latter half of the DBC, the fetch unit starts decrementing
the PC, shifting the DBC tapes towards their starting position.
This reduces the total number of shifts and branching delay in
repeatedly executed program BBs, as useful instructions are
fetched while shifting in both directions and executing the BB
again requires little or no shifting to start.
An example of execution with SHRIMP is presented in
Fig. 4. Here, a BB with four instructions is split into a
single DBC. For clarity, individual tapes are not pictured. Each
column represents the same DBC, with clock cycles advancing
from left to right. Light colour represents the accessible
domains and the instruction read at each cycle is highlighted.
First, instructions a0and a1are read sequentially from the
top access port. As a0is initially located at the access port,
no shifts are required. Next, the DBC tapes are shifted once
to reach a1and after that, once again to reach the jump J,
targeting a2. The execution continues from the bottom access
port. No shifting is required for a2as it is aligned with the
access port. For a3and the jump out of the DBC, two shifts
in total are required.
The instruction placement pass and the associated hardware
is described with more detail in the following subsections.
B
A
D
… #A0
… #A1
cmp x, y #A2
blt C #A3
… #B0
… #B1
… #B2
jump D #B3
… #C0
… #C1
… #C2
… #C3
… # D0
… # D1
… # D2
if
then else
C
#A0
#A1
cmp x,y #A2
blt C0#A3
#B0
#B1
#B2
jump D0#B3
#C0
#C1
#C2
#C3
#D0
#D1
... #D2
(a)
#A0
#A1
jmp A2
jmp B0
blt C0#A3
cmp x,y #A2
#B0
#B1
jmp B2
jmp D0
... #B2
#C0
#C1
jmp C2
jmp D0
... #C3
... #C2
#D0
#D1
jmp D2
jmp X
... #D2
(b)
Fig. 3: If-then-else structure using linear placement and
SHRIMP placement. (a) CFG and corresponding linear place-
ment. (b) SHRIMP placement. Inserted branches highlighted.
A. Instruction Placement
The proposed placement algorithm used in SHRIMP is pre-
sented in Algorithm 1. On Lines 2–4, the algorithm identifies
program BBs and constructs CFGs for each function. During
CFG construction, an implicit optimization is done: function
calls are treated as instructions not affecting the control flow.
By later placing the function caller and callee in separate
DBCs, the caller’s DBC is left in ideal position after the
function returns. Resuming execution at the return address
only requires a single shift.
On Lines 7–13, BBs that do not fit in a single DBC are
split every instructionsP erDB C/2instructions. The order
of instructions placed to latter halves of DBCs is reversed, as
the underlying hardware assumes the opposite shift direction
for them. If a non-branching instruction at the end of either
DBC half is reached, the hardware performs an implicit jump
of instructionsP erD BC/2to reach the next instruction.
Next, the remainders of each BB are categorized as executed
once, or able to be executed multiple times. Loops, functions
a0
a1
J
J
a3
a2
a0
a1
J
J
a3
a2
a0
a1
J
J
a3
a2
a0
a1
J
J
a3
a2
a0
a1
J
J
a3
a2
a0
a1
J
J
a3
a2
clock
cycle 0 1 2 3 4 5
...
...
...
...
...
...
Fig. 4: Execution example with SHRIMP.
called from inside loops, and functions called from multiple
locations in the code are placed into the latter category. To
consider BBs with either even or odd amount of instructions,
the first dk/2einstructions are assigned to the first half of a
DBC, and remaining bk/2cinstructions to the second half. On
Lines 14–19, BBs that are able to execute multiple times are
split and each half is placed to align with an individual port
in a DBC. The next free address is set to the start of the next
available DBC. On Lines 20–23, linear placement is used for
the remainders of BBs that can only be executed once, as it
is not beneficial to split them. Again, the order of instructions
placed to lower halves of DBCs is reversed. As splitting a
BB requires inserting two jumps, it also incurs two additional
shifts to access them. These are only avoided partly in BBs
spanning multiple DBCs, as the fallthrough implementation
in SHRIMP is based on the next address to be read. To
avoid a negative impact on shift amount and execution time,
a threshold for the minimum length of a BB is introduced.
If the splitting threshold were not used, shifting to the jump
instructions in short BBs would increase the amount of shifts.
Algorithm 1 Instruction Placement Algorithm
1: nextF reeAddress =programS tartAddress
2: for all function in functions do
3: build CFGs
4: end for
5: for all CF G in CF Gs do
6: for all BB in C F G do
7: for iin bnumBBI nstructions/instructionsP erD BCcdo
8: split BB at index instructionsP erD BC (i+ 1/2)
9: place first half to nextF reeAddress
10: nextF reeAddress += instructionsP erDB C/2
11: place second half to nextF reeAddress in reverse order
12: nextF reeAddress += nextF reeDB CAddress
13: end for
14: if BB can be executed multiple times and
numBBI nstructions
(numBBI nstructions%instructionsP er DBC )
> splittingT hreshold then
15: split BB at index numBB Instructions
(numBBI nstructions%instructionsP er DBC )
16: place first half to nextF reeAddress
17: nextF reeAddress += instructionsP erDB C/2
18: place second half to nextF reeAddress in reverse order
19: nextF reeAddress =nextF reeDB CAddress
20: else
21: place numBBI nstructions
(numBBI nstructions%instructionsP er DBC )
with linear placement
22: reverse order of instructions placed in lower half of DBC
23: nextF reeAddress += numBBI nstructions
24: end if
25: end for
26: end for
27: insert unconditional jumps between split BB halves
28: insert unconditional jumps to and from split BBs
29: fix branch targets
On Line 27, jumps are inserted between the BB halves.
Handling fallthroughs (CFG edges without branches) to other
BBs presents another problem. A solution would be to insert
no operation instructions (NOPs) between the relocated BBs
and let the processor execute NOPs until the successor BB is
reached. However, this clearly increases the execution time and
costly shifts. It could be more efficient to insert a jump after
the last instruction of the fallthrough BB. As branching delay
is architecture and microarchitecture dependent, we chose
to always replace fallthroughs with jumps for the evaluated
proof-of-concept implementation, on Line 28. Aligning BBs
with DBC access ports breaks their sequentiality, leaving gaps
which did not exist in the original program. Thus, jump
addresses are updated on Line 29.
B. Hardware Support
The hardware designs for the DWM and the instruction fetch
unit are shown in Fig. 2. For the DWM, we use a scheme
where the memory peripheral circuits decode an address into
the corresponding DBC and domain, and calculate the required
shifting amount based on a head status array [6] holding the
current shifting position for each DBC.
DWM design decisions and modifications to the instruction
fetch logic for SHRIMP are described in the next subsections.
C. DWM design
As access ports dominate the area of a DBC over the
nanotapes, we chose one read-write and one read port to
maximize the amount of bits stored per area unit illustrated
in Fig. 2. As domains are always shifted in a back-and-forth
manner, only domains mapped to the first access ports require
an overhead area. Assuming nas the effective tape length,
additional n/21overhead domains are required per tape.
Shifting the DWM tapes requires ensuring correct position-
ing of tapes in relation to access ports. Previous work [6]
considers static and dynamic policies for head selection. The
static policy assigns a fixed access port for every domain. The
dynamic policy uses the head closest to the domain to be
read at run time. As the program CFGs provide predictability
for the instruction memory accesses, SHRIMP utilizes a static
head selection policy. Due to the sequential access patterns of
BBs, dynamically computing the access port to use on each
read operation seems excessive.
In addition to which access port to use, a policy for when to
shift the tape is required. Previous work [6] considers eager
and lazy policies. We adopt the lazy policy for SHRIMP, as
for a sequential access pattern, an eager shifting policy would
dramatically increase the number of shifts.
Regarding the head status array required by the lazy policy
in conjunction with the static policy, for daccessible domains
per access port, the maximum number of shifts is d1. Each
entry of the array requires log2(d1) bits to store the offset
amount. For odd or even number of instructions, a split BB
results in writing either zero or one to the head status array.
Thus, it would be tempting to use a single bit per DBC.
However, the linear placement of single-execution BBs and
those below the splitting threshold requires log2(d1) bits.
D. Instruction Fetch Logic
Switching the shifting direction between BB halves is
achieved by incrementing or decrementing the PC. For a DBC
with an effective tape length of n, address bit log2(n)can be
directly used to control the direction. If the bit is zero, the
memory location is in the range of the upper access port and
vice versa for the lower port. The proposed hardware uses this
bit to control a mux, which chooses either -1 or 1 to be added
to the PC.
IV. EVALUATION
For evaluation, we considered 12 CHStone benchmarks,
with their characteristics listed in Table I. Exact instruction
set flavour of the RISC-V was RV32I, with no variable length
instructions included. To produce instruction access traces and
verify correct execution of modified programs, we executed the
binaries using the RISC-V instruction set simulator Spike.
As a baseline, we compiled and simulated all benchmarks
without SHRIMP, assuming the memory layout from Sec-
tion III-C. We used linear DWM placement without reversing
instructions in latter half of DBCs and assuming only one
shifting direction as opposed to two in SHRIMP.
To measure the impact of SHRIMP on the number of shifts
and execution cycles, we used the RTSim [15] simulator.
The cycle-accurate simulation framework models the DWM
shifting operations and simulates the access ports positions.
It takes instruction access traces generated from the Spike
simulator and configuration parameters for the DWM device
and architecture. The simulator produces the total amount of
shifting operations and the execution cycles for a given trace.
For the evaluation, we assumed that reading an instruction and
each shift requires one clock cycle.
We used RISC-V GCC 7.2.0 compiler to produce the
assembly input to the SHRIMP instruction placement pass.
As the RISC-V compiler produces a rather large amount of
identical initialization code for each application, we only took
into account the actual application code to better highlight
differences between the benchmarks.
To prevent RISC-V GCC linker optimizations, where some
load operations are converted from one to two instructions, we
passed --no-relax switch to the linker to maintain the alignment
of BBs with DBC limits. Similarly, we inserted placeholder
NOPs before call operations in the intermediate assembly, as
these were converted into an auipc +jalr pair by the compiler.
We removed the placeholders before compiling. To keep the
remapped BB addresses during assembly, we inserted NOPs
into the unused addresses left by SHRIMP.
A. Effect on Shifting Amount and Execution Cycles
The total shifts per benchmark are presented in Figs. 5
and 6. With effective tape length 8, shift reductions in all
benchmarks were similar, on average 40%. The effect of split
TABLE I: Benchmark characteristics.
instructions able to avg. instructions
execute repeatedly (%) loops per BB
adpcm 88 16 17
aes 91 18 22
blowfish 94 10 29
dfadd 45 1 10
dfdiv 68 3 11
dfmul 55 1 11
dfsin 64 4 11
gsm 41 19 12
jpeg 84 47 11
mips 91 4 7
motion 76 11 9
sha 79 12 14
adpcm aes blowfish dfadd dfdiv dfmul dfsin gsm jpeg mips motion sha
40
50
60
70
80
4 8 16 32 64
%
Fig. 5: Number of shifts across split thresholds from 4 to 64
compared to linear placement, tape length 8.
adpcm aes blowfish dfadd dfdiv dfmul dfsin gsm jpeg mips motion sha
40
50
60
70
80
90
100
110
4 8 16 32 64
%
Fig. 6: Number of shifts across split thresholds from 4 to 64
compared to linear placement, tape length 64.
threshold was non-negligible only in motion, containing many
BBs with less than 4 instructions. Preventing their splitting
with the threshold increased the total shifts. At tape length
64, increasing the splitting threshold increased shifts in all
benchmarks except dfadd,dfdiv,dfmul,dfsin. These shared
a similar structure of a single loop with relatively many
instructions. This lead to a large BB filling multiple DBCs
with all splitting thresholds, resulting in homogeneous shifting
amounts.
Total cycle counts are presented in Figs. 7 and 8. As tape
size increased, the reduction compared to linear placement
decreased in most of the benchmarks. In jpeg,motion and sha,
with small splitting thresholds, the reduction improved from
tape size 8. Total cycle counts were increased in mips, which
had a combination of relatively high amount of instructions
probable to execute multiple times, small BB sizes and only a
few loops, as seen in Table I. This lead to the inserted jumps
between the BB halves negating the benefits from SHRIMP.
B. Instruction Overhead and Memory Utilization
We illustrate the increase in instructions fetched due to
inserted jumps in Table II. As differences between tape lengths
were small, we averaged the results for lengths 8 to 64.
As the splitting threshold was increased, the overhead of
instructions fetched decreased, due to short BBs not being split
and, therefore, jumps not being inserted. Figs. 5 and 6 show
that there is a trade-off between a decreased fetch amount
and an increased shifting amount. At splitting threshold 64,
the amount of instructions fetched did not significantly differ
from the baseline, as the placement resembled the linear
placement with latter DBC halves reversed. mips and motion
fetched significantly more instructions during their execution
compared to the other benchmarks. This is related to the
execution cycles in Fig. 8 and stems from the same reasons
as discussed in Section. IV-A.
As SHRIMP placement leaves some memory addresses un-
used, we evaluated the effective memory utilization, presented
in Figs. 9 and 10, with tape lengths 8 and 64. Tape sizes 8, 16,
32 and 64 were evaluated, with the utilization per benchmark
degrading quite linearly between sizes 8 and 64. With short
adpcm aes bl owfish dfadd dfdiv dfmul dfsin gsm jpeg mips motion sha
-30.00%
-25.00%
-20.00%
-15.00%
-10.00%
-5.00%
0.00%
4
8
16
32
64
Fig. 7: Execution cycles across split thresholds from 4 to 64
compared to linear placement, tape length 8.
adpcm aes blowfish dfadd dfdi v dfmul dfsin gsm jpeg mips motion sha
-30.00%
-25.00%
-20.00%
-15.00%
-10.00%
-5.00%
0.00%
5.00%
4
8
16
32
64
Fig. 8: Execution cycles across split thresholds from 4 to 64
compared to linear placement, tape length 64.
tape lengths, split BBs ended up filling the majority of DBCs,
with only the last instructions of a BB requiring insertion of
jumps and NOPs. Increasing the tape length worsened the
utilization, as short BBs still occupied a full DBC. As the split
threshold increased, utilization improved due to less BBs being
split and ending up consecutively in memory. Comparing to
total cycle counts in Figs. 7 and 8, there was still improvement
over the linear placement in most benchmarks due to the
reversed placement of SHRIMP.
C. Discussion
As instruction and data access patterns are inherently dif-
ferent, different DWM structures for each seems optimal.
This is natural for Harvard architecture devices with separate
instruction and data buses and typically a cache or a scratchpad
for each. However, in Von Neumann architectures, where
instructions and data share the same bus, one memory is
typically used for both. This raises a question: What is the
optimal DWM structure for storing instructions and data?
If the optimization target is area, energy, or performance,
the memory can be designed to favour either one. Another
option is to implement separate instruction-optimized and
data-optimized physical address ranges.
TABLE II: Increase in instructions fetched averaged over
tape lengths from 8 to 64.
Basic block splitting threshold
4 8 16 32 64
adpcm 5.0% 5.0% 3.5% 0.9% 0.0%
aes 6.5% 5.1% 3.7% 2.4% 0.3%
blowfish 4.7% 3.5% 2.3% 0.2% 0.2%
dfadd 0.4% 0.4% 0.4% 0.4% 0.4%
dfdiv 0.5% 0.3% 0.3% 0.3% 0.3%
dfmul 0.4% 0.4% 0.4% 0.4% 0.4%
dfsin 0.1% 0.1% 0.1% 0.1% 0.0%
gsm 4.0% 3.7% 1.8% 0.6% 0.5%
jpeg 7.5% 4.0% 2.1% 0.7% 0.1%
mips 15.2% 12.6% 8.9% 3.6% 3.6%
motion 20.0% 13.4% 0.2% 0.0% 0.0%
sha 5.3% 5.1% 4.6% 4.6% 0.0%
adpcm aes blow fish dfadd dfdiv dfmul dfsin gsm jpeg mips motion sha
0
1
2
3
4 8 16 32 64
%
Fig. 9: Increase in memory usage with basic block splitting
thresholds from 4 to 64, tape effective length 8 domains.
adpcm aes blowfish dfadd dfdi v dfmul dfsin gsm j peg mips motion sha
0
5
10
15
20
25
4 8 16 32 64
%
37
Fig. 10: Increase in memory usage with basic block splitting
thresholds from 4 to 64, tape effective length 64 domains.
Moreover, contemporary processor systems typically imple-
ment memory hierarchy with multiple levels of caches, whose
operation is based on linear placement of instructions and data.
Further research is required on efficient methods of integrating
SHRIMP with mainstream memory hierarchies.
Multiple ports per tape could be used to allow sharing
shifting logic for the tape. As the maximum length of a tape is
limited, and the access port transistors dominate the physical
area in a DBC, we use two ports per tape, the minimum viable
amount for SHRIMP. This is done in order to maximize the
amount of tapes per area unit and, therefore, effective bit den-
sity of the memory. Typically only one instruction is fetched
and decoded per clock cycle in a software programmable
processor. In this context, multiple ports per tape would also
increase the leakage power consumption.
V. RELATED WO RK
Previous work proposes caches [6], scratchpad memories
[7], [10], [11] and GPGPU register files [8]. However, they
are primarily targeted for data. Instruction scheduling in order
to reduce data memory shifts was considered by Gu et al. [9].
They ordered instructions based on the data access patterns in
programs to minimize shift amounts of data memory, but did
not consider reading the instructions from a DWM.
Previous work [14], where DWM is used as a data memory,
utilizes multiple access ports per tape. This is done in order to
minimize the shifting delay when accessing different memory
locations, but simultaneously using only one shifting circuitry
for the entire tape as opposed to using multiple shorter tapes
with fewer access ports.
VI. CONCLUSIONS
In this paper we proposed SHRIMP, the first instruction
placement strategy specifically designed for DWM technol-
ogy. Based on static control flow graph analysis, frequently
executed program BBs were split into halves, where the latter
half was placed in reverse order to reduce energy and time
consuming shifts specific to DWM technology. According to
our measurements, the proposed method was able to reduce
total shift amounts in 12 CHStone benchmarks by 40% on
average when compared to a linear instruction placement.
Reduction in total clock cycles was reduced by 23% on
average.
The results indicate that further research on strategies for
placing multiple BBs in the split or back-and-forth fashion
could provide additional improvements in memory usage
overhead, shifting reduction and total clock cycle counts.
ACKNOWLEDGMENTS
The authors thank the following sources of financial sup-
port: Tampere University Graduate School, Business Finland
(FiDiPro Program funding decision 40142/14), HSA Foun-
dation, the Academy of Finland (funding decision 297548),
ECSEL JU project FitOptiVis (project number 783162), the
German Research Council (DFG) through the TraceSymm
project CA 1602/4-1 and the Cluster of Excellence ‘Center
for Advancing Electronics Dresden’ (cfaed).
REFERENCES
[1] Huawei Technologies, S. Anders, and G. Andrae, “Total con-
sumer power consumption forecast,” 2017, presentation available:
https://www.researchgate.net/publication/320225452 Total Consumer
Power Consumption Forecast.
[2] W. A. Wulf and S. A. McKee, “Hitting the memory wall: implications
of the obvious,” Computer Architecture News, vol. 23, no. 1, Mar 1995.
[3] S. Ghose et al., “What your dram power models are not telling
you: lessons from a detailed experimental study,” in abstracts of the
international conference on measurement and modeling of computer
systems, June 2018.
[4] S. Parkin, M. Hayashi, and L. Thomas, “Magnetic domain-wall racetrack
memory,” Science, vol. 320, May 2008.
[5] S. Parkin and S.-H. Yang, “Memory on the racetrack,” Nature nanotech-
nology, vol. 10, Mar 2015.
[6] R. Venkatesan et al, “Tapecache: a high density, energy efficient cache
based on domain wall memory,” in proceedings of the international
symposium on low power electronics and design, July 2012.
[7] H. Mao, C. Zhang, G. Sun, and J. Shu, “Exploring data placement in
racetrack memory based scratchpad memory,” in proceedings of the non-
volatile memory system and applications symposium, Aug 2015.
[8] M. Mao, W. Wen, Y. Zhang, Y. Chen, and H. Li, “Exploration of gpgpu
register file architecture using domain-wall-shift-write based racetrack
memory,” in proceedings of the design automation conference, June
2014.
[9] Shouzhen Gu et al., “Area and performance co-optimization for domain
wall memory in application-specific embedded systems,” in proceedings
of the design automation conference, June 2015.
[10] A. A. Khan, N. A. Rink, F. Hameed, and J. Castrillon, “Optimizing ten-
sor contractions for embedded devices with racetrack memory scratch-
pads,” in Proceedings of the International Conference on Languages,
Compilers, Tools and Theory for Embedded Systems, June 2019.
[11] A. A. Khan, F. Hameed, R. Blaesing, S. Parkin, and J. Castrillon,
“Shiftsreduce: Minimizing shifts in racetrack memory 4.0,” arXiv e-
prints, Mar 2019.
[12] Y. Hara, H. Tomiyama, S. Honda, and H. Takada, “Proposal and quan-
titative analysis of the chstone benchmark program suite for practical
c-based high-level synthesis,journal of information processing, vol. 17,
Oct 2009.
[13] A. Waterman, Y. Lee, D. A. Patterson, and K. Asanovic, “The RISC-V
instruction set manual, volume i: base user-level ISA,EECS Depart-
ment, UC Berkeley, Tech. Rep. UCB/EECS-2011-62, 2011.
[14] C. Zhang, G. Sun, W. Zhang, F. Mi, H. Li, and W. Zhao, “Quantitative
modeling of racetrack memory, a tradeoff among area, performance, and
power,” in proceedings of the asia and south pacific design automation
conference, Jan 2015.
[15] A. A. Khan, F. Hameed, R. Bl¨
asing, S. Parkin, and J. Castrillon, “RTSim:
A cycle-accurate simulator for racetrack memories,” IEEE Computer
Architecture Letters, vol. 18, no. 1, pp. 43–46, Jan 2019.
... The variable K (line 13) computes the number of DBCs required for storing disjoint variables (V dj ). The variables in V dj are assigned to DBCs 1 → K and V ndj to the remaining (q − K) DBCs (lines [14][15][16][17][18][19][20][21] where q represents the total number of DBCs. Finally, lines 22-23 apply the single DBC heuristics from [2], [7] to optimize within DBC placement of program variables. ...
... DMA-Chen and DMA-SR further improve the latency by (68.1%, 60.1%, 36.5%, 13.4%) and (70.1%, 62%, 37.7%, 14.6%) for (2,4,8,16) DBCs respectively. The latency gain primarily stems from reduced number of RTM shifts which reduces the RTM access latency and ultimately the overall runtime. ...
... The latency gain primarily stems from reduced number of RTM shifts which reduces the RTM access latency and ultimately the overall runtime. Fig. 5 highlights the significant reduction in the total energy consumed by DMA-OFU (61%, 62%, 44%, 13%) and DMA-SR (77%, 70%, 50%, 21%) relative to AFD-OFU for (2,4,8,16) DBCs respectively. By breaking down the energy consumption into leakage energy, read/write and shift energy, we observe that (1) the gain in shift energy is proportional to the reduction in the number of shifts, (2) Fig. 6 shows the trade-off among various parameters for the best performing DMA-SR configuration as we increase the number of DBCs from 2 to 16. ...
Preprint
Full-text available
Ultra-dense non-volatile racetrack memories (RTMs) have been investigated at various levels in the memory hierarchy for improved performance and reduced energy consumption. However, the innate shift operations in RTMs hinder their applicability to replace low-latency on-chip memories. Recent research has demonstrated that intelligent placement of memory objects in RTMs can significantly reduce the amount of shifts with no hardware overhead, albeit for specific system setups. However, existing placement strategies may lead to sub-optimal performance when applied to different architectures. In this paper we look at generalized data placement mechanisms that improve upon existing ones by taking into account the underlying memory architecture and the timing and liveliness information of memory objects. We propose a novel heuristic and a formulation using genetic algorithms that optimize key performance parameters. We show that, on average, our generalized approach improves the number of shifts, performance and energy consumption by 4.3x, 46% and 55% respectively compared to the state-of-the-art.
... The most prominent SW solution for RTM shift reduction is a compiler guided intelligent data and instruction placement [13]- [15], [99]. By static code analysis and profiling, the compiler constructs an internal model of the applications' memory access pattern. ...
... This improves the performance and energy consumption of the RTM-based SPM by 24% and 74%, respectively, compared to an iso-capacity SRAM. The work in [99] explores RTM as an instruction memory and proposes layouts that best suit the sequential reads/writes of RTM and that of the instruction stream. ...
Article
Full-text available
Racetrack memory (RTM) is a novel spintronic memory-storage technology that has the potential to overcome fundamental constraints of existing memory and storage devices. It is unique in that its core differentiating feature is the movement of data, which is composed of magnetic domain walls (DWs), by short current pulses. This enables more data to be stored per unit area compared to any other current technologies. On the one hand, RTM has the potential for mass data storage with unlimited endurance using considerably less energy than today's technologies. On the other hand, RTM promises an ultrafast nonvolatile memory competitive with static random access memory (SRAM) but with a much smaller footprint. During the last decade, the discovery of novel physical mechanisms to operate RTM has led to a major enhancement in the efficiency with which nanoscopic, chiral DWs can be manipulated. New materials and artificially atomically engineered thin-film structures have been found to increase the speed and lower the threshold current with which the data bits can be manipulated. With these recent developments, RTM has attracted the attention of the computer architecture community that has evaluated the use of RTM at various levels in the memory stack. Recent studies advocate RTM as a promising compromise between, on the one hand, power-hungry, volatile memories and, on the other hand, slow, nonvolatile storage. By optimizing the memory subsystem, significant performance improvements can be achieved, enabling a new era of cache, graphical processing units, and high capacity memory devices. In this article, we provide an overview of the major developments of RTM technology from both the physics and computer architecture perspectives over the past decade. We identify the remaining challenges and give an outlook on its future.
... In our previous work [26], we proposed a memory structure and a compiler instruction placement algorithm to reduce the number of shifts in DWM. This paper extends that work with the following contributions: ...
Article
Full-text available
As performance and energy-efficiency improvements from technology scaling are slowing down, new technologies are being researched in hopes of disrupting results. Domain wall memory (DWM) is an emerging non-volatile technology that promises extreme data density, fast access times and low power consumption. However, DWM access time depends on the memory location distance from access ports, requiring expensive shifting. This causes overheads on performance and energy consumption. In this article, we implement our previously proposed shift-reducing instruction memory placement (SHRIMP) on a RISC-V core in RTL, provide the first thorough evaluation of the control logic required for DWM and SHRIMP and evaluate the effects on system energy and energy-efficiency. SHRIMP reduces the number of shifts by 36% on average compared to a linear placement in CHStone and Coremark benchmark suites when evaluated on the RISC-V processor system. The reduced shift amount leads to an average reduction of 14% in cycle counts compared to the linear placement. When compared to an SRAM-based system, although increasing memory usage by 26%, DWM with SHRIMP allows a 73% reduction in memory energy and 42% relative energy delay product. We estimate overall energy reductions of 14%, 15% and 19% in three example embedded systems.
... However, these solutions are infeasible in the embedded domain as they require additional hardware that costs area, latency, and energy. Similarly, the software techniques presented in [12,25,27], and [40] are not ideal fits to optimize tensors applications. To the best of our knowledge, this is the first work that explores tensors' layout in RTMs for the contraction operation. ...
Article
Tensor contraction is a fundamental operation in many algorithms with a plethora of applications ranging from quantum chemistry over fluid dynamics and image processing to machine learning. The performance of tensor computations critically depends on the efficient utilization of on-chip/off-chip memories. In the context of low-power embedded devices, efficient management of the memory space becomes even more crucial, in order to meet energy constraints. This work aims at investigating strategies for performance-and energy-efficient tensor contractions on embedded systems, using racetrack memory (RTM)-based scratch-pad memory (SPM) and DRAM-based off-chip memory. Compiler optimizations such as the loop access order and data layout transformations paired with architectural optimizations such as prefetching and preshifting are employed to reduce the shifting overhead in RTMs. Optimizations for off-chip memory such as memory access order, data mapping and the choice of a suitable memory access granularity are employed to reduce the contention in the off-chip memory. Experimental results demonstrate that the proposed optimizations improve the SPM performance and energy consumption by 32% and 73%, respectively, compared to an iso-capacity SRAM. The overall DRAM dynamic energy consumption improvements due to memory optimizations amount to 80%.
... To abate the total number of shifts, techniques such as data swapping [47,56], data compression [57], data reorganization for bubble memories [10,49,53], and efficient software supported data and instruction placement [5,29,34] have been proposed. In addition, reconfigurable cache organizations have been proposed that mitigate the number of RM shifts by (de-)activating RMcache sets/ways, which are far from the access ports at run time [42,46]. ...
Article
Full-text available
Racetrack memories (RMs) have significantly evolved since their conception in 2008, making them a serious contender in the field of emerging memory technologies. Despite key technological advancements, the access latency and energy consumption of an RM-based system are still highly influenced by the number of shift operations. These operations are required to move bits to the right positions in the racetracks. This article presents data-placement techniques for RMs that maximize the likelihood that consecutive references access nearby memory locations at runtime, thereby minimizing the number of shifts. We present an integer linear programming (ILP) formulation for optimal data placement in RMs, and we revisit existing offset assignment heuristics, originally proposed for random-access memories. We introduce a novel heuristic tailored to a realistic RM and combine it with a genetic search to further improve the solution. We show a reduction in the number of shifts of up to 52.5%, outperforming the state of the art by up to 16.1%.
Article
Traditional memory hierarchy designs, primarily based on SRAM and DRAM, become increasingly unsuitable to meet the performance, energy, bandwidth, and area requirements of modern embedded and high-performance computer systems. Racetrack memory (RTM), an emerging nonvolatile memory technology, promises to meet these conflicting demands by offering simultaneously high speed, higher density, and nonvolatility. RTM provides these efficiency gains by not providing immediate access to all storage locations, but by instead storing data sequentially in the equivalent to nanoscale tapes called tracks . Before any data can be accessed, explicit shift operations must be issued that cost energy and increase access latency. The result is a fundamental change in memory performance behavior: the address distance between subsequent memory accesses now has a linear effect on memory performance. While there are first techniques to optimize programs for linear-latency memories, such as RTM, existing automatic solutions treat only scalar memory accesses. This work presents the first automatic compilation framework that optimizes static loop programs over arrays for linear-latency memories. We extend the polyhedral compilation framework Polly to generate code that maximizes accesses to the same or consecutive locations, thereby minimizing the number of shifts. Our experimental results show that the optimized code incurs up to 85% fewer shifts (average 41%), improving both performance and energy consumption by an average of 17.9% and 39.8%, respectively. Our results show that automatic techniques make it possible to effectively program linear-latency memory architectures such as RTM.
Conference Paper
Full-text available
Tensor contraction is a fundamental operation in many algorithms with a plethora of applications ranging from quantum chemistry over fluid dynamics and image processing to machine learning. The performance of tensor computations critically depends on the efficient utilization of on-chip memories. In the context of low-power embedded devices, efficient management of the memory space becomes even more crucial, in order to meet energy constraints. This work aims at investigating strategies for performance-and energy-efficient tensor contractions on embedded systems, using racetrack memory (RTM)-based scratch-pad memory (SPM). Compiler optimizations such as the loop access order and data layout transformations paired with architectural optimizations such as prefetching and preshifting are employed to reduce the shifting overhead in RTMs. Experimental results demonstrate that the proposed optimizations improve the SPM performance and energy consumption by 24% and 74% respectively compared to an iso-capacity SRAM.
Presentation
Full-text available
This presentation outlines an estimation of the global electricity usage that can be ascribed to Communication Technology (CT) in the coming decade. The scope is two scenarios for use and production of consumer devices, communication networks and data centers. Two different scenarios— best and expected—are set up, which include annual numbers of sold devices, data traffic and electricity intensities/efficiencies. For the first time AR and VR devices will be included in a CT power trend analysis. I will emphasize the potential development of total Global Data Center IP Traffic and data center electric power consumption. The effect of 5G adaption rate on the total CT power consumption will also be addressed. The likely share of CT of the total global electric power consumption will be discussed in the light of CTs potential to reduce consumption.
Conference Paper
Full-text available
Scratchpad Memory (SPM) has been widely adopted in various computing systems to improve performance of data access. Recently, non-volatile memory technologies (NVMs) have been employed for SPM design to improve its capacity and reduce its energy consumption. In this paper, we explore data allocation in SPM based on racetrack memory (RM), which is an emerging NVM with ultra-high storage density and fast access speed. Since a shift operation is needed to access data in RM, data allocation has an impact on performance of RM based SPM. Several allocation methods have been discussed and compared in this work. Especially, we addressed how to leverage genetic algorithm to achieve near-optimal data allocation.
Racetrack memories (RTMs) have drawn considerable attention from computer architects of late. Owing to their SRAM's comparable access latency and ultra-high capacity, RTMs are promising candidates to revolutionize the memory subsystem. In order to evaluate their performance and appraise their suitability at various levels in the memory hierarchy, it is crucial to have RTM-specific simulation tools that accurately model their behavior and enable exhaustive design space exploration. To this end, we propose RTSim, an open source cycle-accurate memory simulator that empowers performance evaluation of the domain-wall-based racetrack memories. The skyrmions-based RTMs can also be modeled with RTSim because they are architecturally similar to domain-wall-based RTMs. RTSim is developed in collaboration with physicists and computer scientists. It accurately models RTM-specific shifting operations, access ports and the sequence of memory commands beside handling the routine read/write operations. RTSim is built on top of NVMain2.0, rendering larger design space for exploration.
Article
Main memory (DRAM) consumes as much as half of the total system power in a computer today, due to the increasing demand for memory capacity and bandwidth. There is a growing need to understand and analyze DRAM power consumption, which can be used to research new DRAM architectures and systems that consume less power. A major obstacle against such research is the lack of detailed and accurate information on the power consumption behavior of modern DRAM devices. Researchers have long relied on DRAM power models that are predominantly based off of a set of standardized current measurements provided by DRAM vendors, called IDD values. Unfortunately, we find that state-of-the-art DRAM power models are often highly inaccurate when compared with the real power consumed by DRAM. This is because existing DRAM power models (1) are based off of the worst-case power consumption of devices, as vendor specifications list the current consumed by the most power-hungry device sold; (2) do not capture variations in DRAM power consumption due to different data value patterns; and (3) do not account for any variation across different devices or within a device.
Conference Paper
Main memory (DRAM) consumes as much as half of the total system power in a computer today, due to the increasing demand for memory capacity and bandwidth. There is a growing need to understand and analyze DRAM power consumption, which can be used to research new DRAM architectures and systems that consume less power. A major obstacle against such research is the lack of detailed and accurate information on the power consumption behavior of modern DRAM devices. Researchers have long relied on DRAM power models that are predominantly based off of a set of standardized current measurements provided by DRAM vendors, called IDD values. Unfortunately, we find that state-of-the-art DRAM power models are often highly inaccurate when compared with the real power consumed by DRAM. This is because existing DRAM power models (1) are based off of the worst-case power consumption of devices, as vendor specifications list the current consumed by the most power-hungry device sold; (2) do not capture variations in DRAM power consumption due to different data value patterns; and (3) do not account for any variation across different devices or within a device.
Conference Paper
Domain Wall Memory (DWM), a recently developed spin-based non-volatile memory technology, inherently offers unprecedented benefits in density by storing multiple bits in the domains of a ferromagnetic nanowire, which logically resembles a bit-serial tape. However, this structure also leads to a unique challenge that the bits must be sequentially accessed by performing \shift" operations, resulting in variable and potential higher access latencies. In this paper, we propose a hardware and software co-optimize approach to improve area efficiency and performance for DWM in application-specific embedded systems. For an application-specific embedded system, this technique can obtain a DWM which consists of both micro-cell DWM and macro-cell DWM with minimal area size. Meanwhile, instruction schedule and data allocation with minimal memory access overhead are generated. Experimental results show that the proposed method can minimize the DWM area size while satisfying a system performance constraint.
Conference Paper
Recently, an emerging non-volatile memory called Racetrack Memory (RM) becomes promising to satisfy the requirement of increasing on-chip memory capacity. RM can achieve ultra-high storage density by integrating many bits in a tape-like racetrack, and also provide comparable read/write speed with SRAM. However, the lack of circuit-level modeling has limited the design exploration of RM, especially in the system-level. To overcome this limitation, we develop an RM circuit-level model, with careful study of device configurations and circuit layouts. This model introduces Macro Unit (MU) as the building block of RM, and analyzes the interaction of its attributes. Moreover, we integrate the model into NVsim to enable the automatic exploration of its huge design space. Our case study of RM cache demonstrates significant variance under different optimization targets, in respect of area, performance, and energy. In addition, we show that the cross-layer optimization is critical for adoption of RM as on-chip memory.