Kilo-instruction processors, runahead and prefetching
-
Citations (0)
-
Cited In (0)
Page 1
1
Kilo-instruction Processors, Runahead and Prefetching
Tanausú Ramírez1, Alex Pajuelo1, Oliverio J. Santana2 and Mateo Valero1,3
1 Departamento de Arquitectura de Computadores UPC – Barcelona
2 Departamento de Informática y Sistemas ULPGC – Las Palmas de GC
3 Barcelona Supercomputing Center (BSC – CNS) – Barcelona
{tramirez, mpajuelo, mateo}@ac.upc.edu , ojsantana@dis.ulpgc.es
Abstract
There is a continuous research effort devoted to overcome the memory wall
problem. Prefetching is one of the most frequently used techniques. A prefetch
mechanism anticipates the processor requests by moving data into the lower levels of
the memory hierarchy. Runahead mechanism is another form of prefetching based on
speculative execution. This mechanism executes speculative instructions under an L2
miss, preventing the processor from being stalled when the reorder buffer completely
fills, and thus allowing the generation of useful prefetches. Another technique to
alleviate the memory wall problem provides processors with large instruction windows,
avoiding window stalls due to in-order commit and long latency loads. This approach,
known as “Kilo-instruction processors”, relies on exploiting more instruction level
parallelism allowing thousands of in-flight instructions while long latency loads are
outstanding in memory.
In this work, we present a comparative study of the three above-mentioned
approaches, showing their key issues and performance tradeoffs. We show that
Runahead execution achieves better performance speedups (30% on average) than
traditional prefetch techniques (21% on average). Nevertheless, the Kilo-instruction
processor performs best (68% on average). Kilo-instruction processors are not only
faster but also generate a lower number of speculative instructions than Runahead.
When combining the prefetching mechanism evaluated with Runahead and Kilo-
instruction processor, the performance is improved even more in each case (49,5% and
88,9% respectively), although Kilo-instruction with prefetch achieves better
performance and executes less speculative instructions than Runahead.
Page 2
2
1 Introduction
The difference between the processor and the memory speed becomes higher
and higher every year. This gap between memory and processor speed is well-known in
the computer architecture area as the memory wall problem [35]. A plethora of
techniques have been proposed to alleviate this problem, such as cache memories
[26][34] and out-of-order execution [2][30]. However, as processor frequency continues
increasing and DRAM latencies do not keep up with this improvement, these traditional
techniques are not enough to hide the main memory latency, severely limiting the
potential performance achievable by the processor. As a consequence, new and different
approaches have been appeared to narrow this gap.
The objective of our work is to analyze state-of-the-art mechanisms aiming to
overcome the memory wall problem. Because of the large number of proposals, it is not
possible to analyse each particular technique in a single paper. Therefore, we have
chosen to focus on three well-known techniques: prefetching, Runahead, and Kilo-
instruction processors.
Aggressive hardware prefetchers are commonly implemented in current
processors [14][29]. Prefetching does an attempt to anticipate the needs of the program
being executed, bringing data near the processor before the program requires them, and
thus reducing the number of memory misses. The efficiency of prefetch depends on data
predictability, that is, on the regularity of program access patterns. If future data
accesses are correctly predicted, data prefetches will improve the processor
performance. On the contrary, wrong prefetches would cause bus contention and
pollution in the cache hierarchy.
Runahead execution [12][20] is an advanced mechanism that relies on
improving prefetch efficiency. Runahead prevents the reorder buffer from stalling on
long-latency memory operations by executing speculative instructions. To do this, when
a memory operation that misses in the L2 cache gets to the ROB head, it takes a
checkpoint of the architectural state. After taking the checkpoint, the processor assigns
an invalid value to the destination register of the memory instruction that caused the L2
miss and enters in runahead mode. During runahead mode, the processor speculatively
executes instructions relying on the invalid value. All the instructions that operate over
the invalid value will also produce invalid results. However, the instructions that do not
depend on the invalid value will be pre-executed. When the memory operation that
Page 3
3
started runahead mode is resolved, the processor rolls back to the initial checkpoint and
resumes normal execution. As a consequence, all the speculative work done by the
processor is discarded. Nevertheless, this previous execution is not completely useless.
The main advantage of Runahead is that the speculative execution would have
generated useful data and instructions prefetches, improving the behaviour of the
memory hierarchy during the real execution. The drawback of this technique is that it
generates a great number of speculative instructions, increasing the overall energy
consumption, and leading to the need for research effort focused on reducing this
problem [21].
A different approach to overcome the memory wall problem is not relying just
on data prefetching, but also on increasing the instruction level parallelism. Several new
designs have been recently proposed to increase the amount of instructions available for
execution by enlarging the instruction window. When having a larger instruction
window, it is possible to execute more independent instructions while long latency loads
are outstanding in memory. Thus, while the memory access is being solved, the
processor is able to overlap it with the execution of useful work. Moreover, this useful
work includes memory accesses that would not be executed using smaller instruction
windows, effectively prefetching data from memory.
Since increasing the size of the instruction window would involve an important
increase of the processor complexity, it is necessary to do a smart design of the main
processor structures. This trend has lead to the design of Kilo-instruction processors
[7][8][9][10], a complexity-effective architecture that virtually enlarges the instruction
window, by using an efficient checkpoint mechanism, leading to an affordable design
that is able to maintain thousands of in-flight instructions.
This paper presents an overall comparison of a stride-based prefetching
mechanism, Runahead execution and Kilo-instruction processor in a joint framework.
We analyze and evaluate important parameters such as performance, number of
executed instructions and the distribution of memory access instructions. This analysis
shows the ability of each technique to reduce the memory wall problem, as well as their
main advantages and disadvantages. We show what are the limitations that prevent each
technique to achieve better performance. Finally, we combine the prefetch mechanism
with Runahead and Kilo-instruction processors in order to evaluate the benefits of
applying two orthogonal techniques.
Page 4
4
The reminder of this paper is organized as follows. We discuss related work and
detail background in Section 2. In Section 3 we describe our experimental framework.
In Section 4, we present a comprehensive study of the three techniques, identifying key
performances issues and research trends. Finally, we provide the conclusion of our work
in Section 5.
2 Background and Related Work
Prefetch is one of the most used techniques to alleviate the memory wall
problem. It is based on predicting future memory accesses to bring, in advance, data to
the faster levels of the memory hierarchy. Unfortunately, prefetching has two major
problems. Firstly, the extra memory accesses increase the pressure in the memory
hierarchy. Secondly, wrong prefetches would pollute the caches, causing unnecessary
misses.
Software prefetching techniques [2][19][23] rely on the compiler to reduce
cache misses by inserting prefetch instructions into the code. This is not a trivial task,
since the compiler has limited knowledge of the actual memory behaviour of an
application. Software prefetching has as major drawback the increment in the size of the
application code, as well as the need to devote front-end bandwidth to fetch these
instructions.
Hardware prefetching techniques [3][15][16] try to dynamically predict the
effective memory address of future memory instructions in order to anticipate the data
that will be required. These techniques do not enlarge programs by inserting prefetch
instructions, but they increase the processor complexity with the tables needed to store
the memory access patterns and the logic required to use these data and generate a
prediction. There are two important parameters that should be considered when
implementing a hardware prefetch mechanism. One is the degree of prefetching [32],
which indicates the number of prefetches that will be generated for a given instruction.
The second parameter is the distance of prefetching [32], which sets when the first
prefetch starts for a given instructions.
There also exist hybrid prefetching techniques that combine both software and
hardware schemes [33]. Another prefetch technique is thread-based prefetching [5]
[6][11][18]. This technique takes benefit from idle thread contexts in a multithreaded
Page 5
5
processor to prefetch data for the main thread. Helping threads and assisted threads are
two of the most important techniques in this point.
Runahead execution is another mechanism to perform speculative prefetch. It
was first proposed for in-order processors [12] and later extended for out-of-order
processor as a simple alternative to large instruction windows [20]. A processor with
Runahead achieves performance improvement. However, it considerably increases the
number of executed instructions, and thus the overall energy consumption of the
processor. To reduce this problem, there are some proposals [21] oriented to make
Runahead a more energy-efficient technique.
A different approach to overcome the memory wall problem is using
complexity-effective strategies to virtually enlarge the instruction window. A simple
proposal is the Waiting Instruction Buffer (WIB) [17]. Those instructions that depend on
an L2 miss are stored in this structure and removed from the instruction window to
allow the commit of those instructions that are independent of the L2 miss. Once the
data is brought from memory, the instructions in the WIB are reinserted into the
instruction window.
Kilo-instruction processors [7][8][9][10] are an architectural proposal that
prevents the stalling of the processors due to the lack of entries in the ROB under L2
misses. A Kilo-instruction processor consists in a set of techniques to allow thousands
of in-flight instructions in the processor, such as multi-checkpointing mechanism, late-
allocation and early-release of registers, and complexity-effective designs of the
instruction queues.
With a similar philosophy, Akkary et al. [1] proposed the Checkpoint Processing
and Recovery (CPR) mechanism, in which the ROB is completely removed from the
processor. This approach incorporates a set of microarchitectural schemes to overcome
the ROB limitations, such as selective checkpoint mechanisms, a hierarchical store
queue organization and an algorithm for aggressive physical register de-allocation. The
Continual Flow Pipelines (CFP) architecture [27] is an evolution of CPR, in which an
efficient implementation of a bi-level issue queue is provided. To further improve this
design, it uses a Slice Data Buffer (SDB), which is a structure having the same
philosophy of the above-mentioned WIB.