Conference PaperPDF Available

Limits of Task-based Parallelism in Irregular Applications

Authors:

Abstract and Figures

Traditional parallel compilers do not effectively parallelize irregular applications because they contain little loop-level parallelism. We explore Speculative Task Parallelism (STP), where tasks are full procedures and entire natural loops. Through profiling and compiler analysis, we find tasks that are speculatively memory- and control-independent of their neighboring code. Via speculative futures, these tasks may be executed in parallel with preceding code when there is a high probability of independence.
Content may be subject to copyright.
In Proceedings of the 3rd International Symposium on High Performance Computing (ISHPC2K), October 2000, (c)
Springer-Verlag.
Limits of Task-based Parallelism in Irregular Applications
Barbara Kreaseck Dean Tullsen Brad Calder
Department of Computer Science and Engineering
University of California, San Diego
La Jolla, CA 92093-0114
kreaseck, tullsen, calder @cs.ucsd.edu
Abstract
Traditional parallel compilers do not effectively parallelize irregular applications because they con-
tain little loop-level parallelism. We explore Speculative Task Parallelism (STP), where tasks are full
procedures and entire natural loops. Through profiling and compiler analysis, we find tasks that are
speculatively memory-and control-independentof their neighboring code. Via speculativefutures, these
tasks may be executed in parallel with preceding code when there is a high probability of indepen-
dence. We estimate the amount of STP in irregular applications by measuring the number of memory-
independent instructions these tasks expose. We find that 7 to 22% of dynamic instructions are within
memory-independenttasks, depending on assumptions.
1 Introduction
Today’s microprocessors rely heavily on instruction-level parallelism (ILP) to gain higher performance.
Flow control imposes a limit to available ILP in single-threaded applications [8]. One way to overcome this
limit is to find parallel tasks and employ multiple flows of control (threads). Task-level parallelism (TLP)
arises when a task is independent of its neighboring code. We focus on finding these independent tasks and
exploring the resulting performance gains.
Traditional parallel compilers exploit one variety of TLP, loop level parallelism (LLP), where loop iter-
ations are executed in parallel. LLP can overwhelming be found in numeric, typically FORTRAN programs
with regular patterns of data accesses. In contrast, general purpose integer applications, which account for
the majority of codes currently run on microprocessors, exhibit little LLP as they tend to access data in
irregular patterns through pointers. Without pointer disambiguation to analyze data access dependences,
traditional parallel compilers cannot parallelize these irregular applications and ensure correct execution.
In this paper we explore task-level parallelism in irregular applications by focusing on Speculative Task
Parallelism (STP), where tasks are speculatively executed in parallel under the following assumptions: 1)
tasks are full procedures or entire natural loops, 2) tasks are speculatively memory-independent and control-
independent, and 3) our architecture allows the parallelization of tasks via speculative futures (discussed
below). Figure 1 illustrates STP, showing a dynamic instruction stream where a task Y has no memory ac-
cess conflicts with a group of instructions, X, that precede Y. The shorter of X and Y determines the overlap
of memory-independent instructions as seen in Figures 1(b) and 1(c). In the absence of any register depen-
dences, X and Y may be executed in parallel, resulting in shorter execution time. It is hard for traditional
parallel compilers of pointer-based languages to expose this parallelism.
1
Serial Thread Non-Spec. Thread
Serial Thread
(a)
(b)
(c)
Non-Spec. Thread
X . . .
st #1
. . .
ld #2
. . .
Call Y
Y . . .
. . .
ld #3
. . .
. . .
. . .
st #4
. . .
. . .
return
X . . .
st #1
. . .
ld #2
. . .
Wait Y
Spec. Thread
Launch
Point
X . . .
. . .
st #1
. . .
. . .
. . .
ld #2
. . .
. . .
. . .
Call Y
Y . . .
ld #3
. . .
st #4
. . .
return
. . .
. . .
Y . . .
ld #3
. . .
st #4
. . .
return
X . . .
. . .
st #1
. . .
. . .
. . .
ld #2
. . .
. . .
. . .
Wait Y
Spec. Thread
Launch
Point
. . .
. . .
. . .
. . .
. . .
. . .
Y . . .
. . .
ld #3
. . .
. . .
. . .
st #4
. . .
. . .
return
Figure 1: STP example: (a) shows a section of code where the task Y is known to be memory-independent of
the preceding code X. (b) the shaded region shows memory- and control-independent instructions that are
essentially removed from the critical path when Y is executed in parallel with X. (c) when task Y is longer
than X.
The goals of this paper are to identify such regions within irregular applications and to find the number of
instructions that may thus be removed from the critical path. This number represents the maximum possible
STP. To facilitate our discussion, we offer the following definitions.
A task, exemplified by Y in Figure 1, is a bounded set of instructions inherent to the application. Two
sections of code are memory-independent when neither contains a store to a memory location that the other
accesses. When all load/store combinations of the type [load,store], [store,load] and [store,store] between
two tasks, X and Y, access different memory locations, X and Y are said to be memory-independent. A
launch point is the point in the code preceding a task where the task may be initiated in parallel with the
preceding code. This point is determined through profiling and compiler analysis. A launched task is one
that begins execution from an associated launch point on a different thread.
Because the biggest barrier to detecting independence in irregular codes is memory disambiguation,
we identify memory-independent tasks using a profile-based approach and measure the amount of STP by
estimating the amount of memory-independent instructions those tasks expose. As successive executions
may differ from the profiled execution, any launched task would be inherently speculative. One way of
launching a task in parallel with its preceding code is through a parallel language construct called a future.
A future conceptually forks a thread to execute the task and identifies an area of memory in which to relay
status and results. When the original thread needs the results of the futured task, it either waits on the futured
task thread, or in the case that the task was never futured due to no idle threads, it executes the futured task
itself.
To exploit STP, we assume a speculative machine that supports speculative futures. Such a processor
could speculatively execute code in parallel when there is a high probability of independence, but no guaran-
tee. Our work identifies launch points for this speculative machine, and estimates the parallelism available
to such a machine. With varying levels of control and memory speculation, 7 to 22% of dynamic instruc-
tions are within tasks that are found to be memory-independent, on a set of irregular applications for which
traditional methods of parallelization are ineffective.
In the next section we discuss related work. Section 3 contains a description of how we identify and
quantify STP. Section 4 describes our experiment methodology and Section 5 continues with some results.
Implementation issues are highlighted in Section 6, followed by a summary in Section 7.
2
2 Related Work
In order to exploit Speculative Task Parallelism, a system would minimally need to include multiple flows of
control and memory disambiguation to aid in mis-speculation detection. Current proposed structures that aid
in dynamic memory disambiguation are implemented in hardware alone [3] or rely upon a compiler [5, 4].
All minimally allow loads to be speculatively executed above stores and detect write-after-read violations
that may result from such speculation.
Some multithreaded machines [21, 19, 2] and single-chip multiprocessors [6, 7] facilitate multiple flows
of control from a single program, where flows are generated by compiler and/or dynamically. All of these
architectures could exploit non-speculative TLP if the compiler exposed it, but only Hydra [6] could support
STP without alteration.
Our paper examines speculatively parallel tasks in non-traditionally parallel applications. Other pro-
posed systems, displaying a variety of characteristics, also use speculation to increase parallelism. They
include Multiscalar processors [16, 12, 20], Block Structured Architecture [9], Speculative Thread-level
Parallelism [15, 14], Thread-level Data Speculation [18], Dynamic Multithreading Processor [1], and Data
Speculative Multithreaded hardware architecture [11, 10].
In these systems, the type of speculative tasks include fixed-size blocks [9], one or more basic blocks [16],
dynamic instruction sequences [18], loop iterations [15, 11], instructions following a loop [1], or following
a procedure call [1, 14]. These tasks were identified dynamically at run-time [11, 1], statically by compil-
ers [20, 9, 14], or by hand [18]. The underlying architectures include traditional multiprocessors [15, 18],
non-traditional multiprocessors [16, 9, 10], and multithreaded processors [1, 11].
Memory disambiguation and mis-speculation detection was handled by an Address Resolution Buffer [16],
the TimeWarp mechanism of time stamping requests to memory [9], extended cache coherence schemes [14,
18], fully associative queues [1], and iteration tables [11]. Control mis-speculation was always handled by
squashing the mis-speculated task and any of its dependents. While a few handled data mis-speculations by
squashing, one rolls back speculative execution to the wrong data speculation [14] and others allow selective,
dependent re-execution of the wrong data speculation [9, 1].
Most systems facilitate data flow by forwarding values produced by one thread to any consuming
threads [16, 9, 18, 1, 11]. A few avoid data mis-speculation through synchronization [12, 14]. Some systems
enable speculation by value prediction using last-value [1, 14, 11] and stride-value predictors [14, 11].
STP identifies a source of parallelism that is complimentary to that found by most of the systems above.
Armed with a speculative future mechanism, these systems may benefit from exploiting STP.
3 Finding Task-based Parallelism
We find Speculative Task Parallelism by identifying all tasks that are memory-independent of the code that
precedes the task. This is done through profiling and compiler analysis, collecting data from memory access
conflicts and control flow information. These conflicts determine proposed launch points that mark the
memory dependences of a task. Then for each task, we traverse the control flow graph (CFG) in reverse
control flow order to determine launch points based upon memory and control dependences. Finally, we
estimate the parallelism expected from launching the tasks early. The following explain the details of our
approach to finding STP.
Task Selection
The type of task chosen for speculative execution directly affects the amount of speculative parallelism
found in an application. Oplinger, et. al. [14], found that loop iterations alone were insufficient to make
speculative thread-level parallelism effective for most programs. To find STP, we look at three types of
3
PLP
Overlap
(a)
Latest
Conflict Source
Conflict
Destination
Ta sk
Overlap
...
st #1
...
Call Z
...
...
...
...
Call X
...
...
ld #3
...
Call Y
...
return
Z
...
...
st #2
...
return
X
...
...
ld #4
st #5
...
return
Y
PLP
PLP
Overlap
Ta sk
Overlap
(b)
...
st #1
...
Call Z
...
...
...
...
Call X
...
...
ld #3
...
Call Y
...
return
Z
...
...
st #2
...
return
X
...
...
ld #4
st #5
...
return
Y
PLP
PLP
Overlap
Ta sk
Overlap
(c)
...
st #1
...
Call Z
...
...
...
...
Call X
...
...
ld #3
...
Call Y
...
return
Z
...
...
st #2
...
return
X
...
...
ld #4
st #5
...
return
Y
PLP
Figure 2: PLP Locations: conflict source, conflict destination, PLP candidate, PLP overlap, and task over-
lap, when the latest conflict source is (a) before the calling routine, (b) within a sibling task (c) within the
calling routine.
tasks: leaf procedures (procedures that do not call any other procedure), non-leaf procedures, and entire
natural loops. When proling a combination of task types, we prole them concurrently, exposing memory-
independent instructions within an environment of interacting tasks.
Although all tasks of the chosen type(s) are proled, only those that expose at least a minimum number
of memory-independent instructions are chosen to be launched early. The nal task selection is made after
evaluating memory and control dependences to determine actual launch points.
Memory Access Conflicts
Memory access conicts are used to determine the memory dependences of a task. They occur when two
load/store instructions access the same memory region. Only a subset of memory conicts that occur during
execution are useful for calculating launch points. Useful conicts span task boundaries and are of the form
[load, store], [store, load], or [store, store]. We also disregard stores or loads due to register saves and
restores across procedure calls. We call the conicting instruction preceding the task the conflict source, and
the conicting instruction within the task is called the conflict destination. Specically, when the conict
destination is a load, the conict source will be the last store to that memory region that occurred outside the
task. When the conict destination is a store, the conict source will be the last load or store to that memory
region that occurred outside the task.
Proposed Launch Points
The memory dependences for a task are marked, via proling, as proposed launch points (PLPs). A PLP
represents the memory access conict with the latest (closest) conict source in the dynamic code preceding
one execution of that task. Exactly one PLP is found for each dynamic task execution. In our approach,
launch points for a task occur only within the tasks calling region, limiting the amount and scope of ex-
ecutable changes that would be needed to exploit STP. Thus, PLPs must also lie within a tasks calling
routine.
Figure 2 contains an example that demonstrates the latest conict sources and their associated PLPs.
Task Z calls tasks X and Y. Y is the currently executing task in the example and Z is its calling routine.
When the conict source occurs before the beginning of the calling routine, as in Figure 2(a), the PLP is
directly before the rst instruction of the task Z. When the conict source occurs within a sibling task or its
child tasks, as in Figure 2(b), the PLP immediately follows the call to the sibling task. In Figure 2(c), the
conict source is a calling routine instruction and the PLP immediately follows the conict source.
4
...
Call Y
...
PLP #1
...
PLP #2
...
LP
(a)
PLP #3
...
Call Y
LP
LP
(c)
LP
...
Call Y
(d)
...
Call Y
LP
(e)
LP
...
Call Y
...
ld #1
...
st #2
...
(b)
Z
Figure 3: Task Launch Points: Dotted areas have not been fully visited by the back-trace for task Y. (a) CFG
block contains a PLP. (b) CFG block is calling routine head. (c) Loop contains a PLP. (d) Incompletely
visited CFG block. (e) Incompletely visited loop.
Two measures of memory-independence are associated with each PLP. They are the PLP overlap and
the task overlap, as seen in Figure 2. The PLP overlap represents the number of dynamic instructions found
between the PLP and the beginning of the task. The task overlap represents the number of dynamic in-
structions between the beginning of the task and the conict destination. With PLPs determined by memory
dependences that are dynamically closest to the task call site, and by recording the smallest task overlap, we
only consider conservative, safe locations with respect to the proling dataset.
Task Launch Points
Both memory dependences and control dependences inuence the placement of task launch points. Our
initial approach to exposing STP determines task launch points that provide two guarantees. First, static
control dependence is preserved: all paths from a task launch point lead to the original task call site. Second,
profiled memory dependence is preserved: should the threaded program be executed on the proling dataset,
all instructions between the task launch point and the originally scheduled task call site will be free of
memory conicts. Variations which relax these guarantees are described in Section 5.2.
For each task, we recursively traverse the CFG in reverse control ow order starting from the original
call site, navigating conditionals and loops, to identify task launch points. We use two auxiliary structures:
a stalled block list to hold incompletely visited blocks, and a stalled loop list to hold incompletely visited
loops. There are ve conditions under which we record a task launch point. These conditions are described
below. The rst three will halt recursive back-tracing along the current path. As illustrated in Figure 3, we
record task launch points:
a. when the current CFG block contains a PLP for that task. The task launch point is the last PLP in the
block.
b. when the current CFG block is the head of the tasks calling routine and contains no PLPs. The task
launch point is the rst instruction in the block.
c. when the current loop contains a PLP for that task. Back-tracing will only get to this point when it
visits a loop, and all loop exit edges have been visited. As this loop is really a sibling of the current
task, task launch points are recorded at the end of all loop exit edges.
d. for blocks that remain on the stalled block list after all recursive back-tracing has exited. A task launch
point is recorded only at the end of each visited successor edge of the stalled block.
e. for loops that remain on the stalled loop list after all recursive back-tracing has exited. A task launch
point is recorded only at the end of each visited loop exit edge.
Each task launch point indicates a position in the executable in which to place a task future. At each
tasks original call site, a check on the status of the future will indicate whether to execute the task serially
or wait on the result of the future.
5
0%
2%
4%
6%
8%
10%
12%
14%
comp gcc go ijpeg li m88k perl vort avg.
% Memory Indep. Instr.
Leaf
non-Leaf
Loop
Figure 4: Task Types: Individual data points identify memory independent instructions as a percentage of
all instructions profiled and represent our starting configuration.
Parallelization Estimation
We estimate the number of memory-independent instructions that would have been exposed had the tasks
been executed at their launch points during the prole run. Our approach ensures that each instruction is
counted as memory-independent at most once. When the potential for instruction overlap exceeds the task
selection threshold the task is marked for STP. We use the total number of claimed memory-independent
instructions as an estimate of the limit of STP available on our hypothetical speculative machine.
4 Methodology
To investigate STP, we used the ATOM proling tools [17] and identied natural loops as dened by Much-
nick [13]. We proled the SPECint95 suite of benchmark programs. Each benchmark was proled for 650
million instructions. We used the reference datasets on all benchmarks except compress. For compress, we
used a smaller dataset, in order to prole a more interesting portion of the application.
We measure STP by the number of memory-independent task instructions that would overlap preceding
non-task instructions should a selected task be launched (as a percentage of all dynamic instructions).
The task selection threshold comprises two values, both of which must be exceeded. For all runs, the task
selection threshold was set at 25 memory-independent instructions per task execution and a total of 0.2%
of instructions executed. We impose this threshold to compensate for the expected overhead of managing
speculative threads and to enable allocation of limited resources to tasks exposing more STP.
Our results show a limit to STP exposed by the launched execution of memory-independent tasks. No
changes, such as code motion, were made or assumed to have been made to the original benchmark codes
that would heighten the amount of memory-independent instructions. Overhead due to thread creation,
along with wakeup and commit, will be implementation dependent, and thus is not accounted for. Register
dependences between the preceding code and the launched task were ignored. Therefore, we show an upper
bound to the amount of STP in irregular applications.
5 Results
We investigated the amount of Speculative Task Parallelism under a variety of assumptions about task types,
memory conict granularity, control and memory dependences. Our starting conguration includes proling
at the page-level (that is, conicts are memory accesses to the same page) with no explicit speculation and
is thus our most conservative measurement.
5.1 Task Type
We rst look to see which task types exhibit the most STP. We then explore various explicit speculation
opportunities to nd additional sources of STP. Finally, we investigate any additional parallelism that might
6
<< Conflict Source 100%
instr-1
instr-2
if (cond-1) { // 90% taken
instr-4
instr-5
if (cond-2) { // 90% taken
instr-7
instr-8
if (cond-3) { // 100% taken
instr-10
instr-11
call task A
} else {
instr-13
}
} else {
instr-14
}
} else {
instr-15
}
instr-16
instr-1
instr-2
if (cond-1)
instr-13
90
10
81
9
9
10
0
0
81
81
instr-10
instr-11
call task A
H
instr-7
instr-8
if (cond-3)
E
instr-15
M
instr-14
G
instr-4
instr-5
if (cond-2)
C
D
B
instr-16
T
PLP
Static
Control
Dep. LP
Profiled
Control
Dep. LP
90% Spec.
Control
Dep. LP
Figure 5: Control Dependence Speculation: Using the profile information that the conditions are true 90%
, 90% , and 100% , respectively, profiled control dependence and speculative control dependence allow task
A to be launched outside of the inner if statement. The corresponding CFG displays the launch points as
placed by each type of control dependence. The edge frequencies reflect that the code was executed 100
times.
be exposed by proling at a ner memory granularity.
We investigate leaf procedures, non-leaf procedures, and entire natural loops. We proled these three
task types to see if any one task type exhibited more STP than the others. Figure 4 shows that, on average, a
total of 7.3% ofthe proled instructions were identied as memory independent, with task type contributions
differing by less than 1% of the proled instructions. This strongly suggests that all task types should be
considered for exploiting STP. The succeeding experiments include all three task types in their proles.
5.2 Explicit Speculation
The starting conguration places launch points conservatively, with no explicit control or memory depen-
dence speculation. Because launched tasks will be implicitly speculative when executed with different
datasets, our hypothetical machine must already support speculation and recovery. We explore the level of
STP exhibited by explicit speculation, rst, by speculating on control dependence, where the launched task
may not actually be needed. Next, we speculate on memory dependence, where the launched task may not
always be memory-independent of the preceding code. Finally, we speculate on both memory and control
dependences by exploiting the memory-independence of the instructions within the task overlap.
Our starting conguration determines task launch points that preserve static control dependences, such
that all paths from the tasks launch points lead to the original task call site. Thus, a task that is statically
control dependent upon a condition whose outcome is constant, or almost constant, throughout the prole,
will not be selected, even though launching this task would lead to almost no mis-speculations. We consid-
ered two additional control dependence schemes that would be able to exploit the memory-independence of
this task.
Profiled control dependences exclude any static control dependences that are based upon branch paths
that are never traversed. When task launch points preserve proled control dependences, all traversed paths
from the launch points lead to the original call site.
When task launch points preserve speculative control dependences, all frequently traversed paths from
the launch points lead to the original call site. The amount of speculation is controlled by setting a minimum
frequency percentage, c. For example, when c is set to 90, then at least 90% of the traversed paths from the
launch points must lead to the original call site.
In Figure 5, the call statement of task A is statically control dependent on all three if-statements. The
7
0%
10%
20%
30%
40%
50%
60%
comp gcc go ijpeg li m88k perl vort avg.
% Memory-Independent Instructions
Page, StatControl, ProfMemory
Page, ProfControl, ProfMemory
Page,SpecControl,ProfMemory
Page,ProfControl,SpecMemory
Page,ProfControl,SpecMemory,Synch
Word,ProfControl,SpecMemory,Synch
Figure 6: Memory-independent instructions reported as a percentage of all instructions profiled. Page =
Page-level profiling, Word = Word-level profiling, ProfControl = Profiled control dependence, SpecControl
= Speculative control dependence, SpecMemory = Speculative memory dependence, Synch = Early start
with synchronization.
corresponding CFG in Figure 5 highlights the launch points as determined by the three control dependence
options. All paths beginning with block H, all traversed paths beginning with block E, and 90% of the
traversed paths beginning with block C lead to the call of task A. Therefore, the static control dependence
launch point is before block H, the proled control dependence launch point is before block E, and with c
set to 90, the speculative control dependence launch point is before block C.
The price of using speculative control dependences will be the waste of resources used to speculatively
initiate a launched task when the executed path does not lead to the task call site. These extra launches can
be squashed at the rst mis-speculated conditional.
The rst three bars per benchmark in Figure 6 show the effect of control dependence speculation. The
bars display static control dependence, proled control dependence, and speculative control dependence
at c = 90, respectively. On average, proled control dependence exposed an additional 1.3% of dynamic
instructions as memory-independent, while speculative control dependence only exposed an additional 0.6%
over proled.
The choice of using proled or speculative control dependence will be inuenced by the underlying ar-
chitecture, and the degree to which speculative threads compete with non-speculative for resources. Further
results in this paper use proled control dependence, due to the low gain from speculative control depen-
dence.
Memory dependence provides another opportunity for explicit speculation. Our starting conguration
determines launch points that preserve profiled
memory dependences such that all instructions between each launch point and its original task call site
are memory-independent of the task. This approach results in a conservative, but still speculative, place-
ment of launch points.
We also consider the less conservative approach of determining task launch points by speculative mem-
ory dependences, which ignores proled memory conicts that occur infrequently. The amount of specu-
lation is controlled by setting a minimum frequency percentage, m. For example, when m is set to 90, then
at least 90% of the traversed paths from the task launch points to the original call site must be memory-
independent of the task. Using speculative memory dependences is especially attractive when PLPs are far
apart, and the ones nearest the task call site seldom cause a memory conict.
We examine the effect of task launch points that preserve speculative memory dependence at m =90
(the fourth bar in Figure 6). Speculative memory dependence provides small increases in parallelism.
Despite the small gains, we include task launch points determined by speculative memory dependence for
the remaining results.
By placing launch points (futures) at control dependences or memory dependences (PLPs), we have used
the limited synchronization inherent within futures to synchronize these dependences with the beginning
8
...
...
...
...
...
...
...
...
...
...
...
ld/st
...
...
...
...
Call Y
...
...
(a)
Task
Overlap
dependence
...
...
...
...
...
...
...
...
st #2
...
...
...
...
...
return
Y
Earliest
Conflict
Destination
...
...
...
...
...
...
...
...
st #2
...
...
...
...
...
return
Y
...
...
...
...
...
...
...
...
...
...
...
ld/st
...
...
...
...
Wait Y
...
...
LP
Overlap
(b)
Launch
Point
Speculative
Thread
...
...
...
...
...
...
...
...
...
...
...
ld/st
release
...
...
...
...
Wait Y
...
...
LP
Overlap
(c)
Task
Overlap
Speculative
Thread
New
Launch
Point
Synch.
Point
...
...
...
...
...
...
...
...
wait
st #2
...
...
...
...
...
return
Y
Figure 7: Synchronization Points: Gray areas represent memory-independent instructions. (a) a task with a
large amount of task overlap on a serial thread. (b) When the dependence is used as a launch point, only LP
overlap contributes to memory-independence. (c) By synchronizing the dependence with the earliest conflict
destination, both LP overlap and task overlap contribute to memory-independence.
of the speculative task. This limits the amount of STP that we have been able to expose to the number
of dynamic instructions between the launch point and the original task call site, which we call the LP
overlap. The instructions represented by the task overlap, between the beginning of the speculative task
and the earliest proled conict destination, are proled memory-independent of all of the preceding code.
By using explicit additional synchronization around the earliest proled conict destination, early start
with synchronization enables the task overlap to contribute to the number of exposed memory-independent
instructions.
5.2.1 Early Start with Synchronization
Currently, a task with a large task overlap and a small LP overlap would not be selected as memory-
independent, even though a large portion of the task is memory-independent with its preceding code. By
synchronizing the control or memory dependence with the earliest conict destination, the task may be
launched earlier than the dependence. Where possible, we placed the task launch point above the depen-
dence a distance equal to the task overlap. Any control dependences between the new task launch point and
the synchronization point would be handled as speculative control dependences.
Figure 7 illustrates synchronization points. When the dependence determines a launch point, in Fig-
ure 7(b), all memory-independent instructions come from the LP overlap. Figure 7(c) shows that by syn-
chronizing the dependence with the earliest conict destination, both the LP overlap and the task overlap
contribute to the number of memory-independent instructions.
Early start shows the greatest increase in parallelism so far, exposing on the average an additional 6.6%
of dynamic instructions as memory-independent (the fth bar per benchmark of Figure 6). The big increase
in parallelism came from tasks that had not previously exhibited a signicant level of STP, but now are able
to exceed our thresholds.
The extra parallelism exposed through early start will come at the cost of additional dynamic instructions
and the cost of explicit synchronization. We did not impose any penalties to simulate those costs as they
will be architecture-dependent.
5.3 Memory Granularity
We dene a memory access conict to occur with two accesses to the same memory region. The memory
granularity (the size of these regions) effects the amount of parallelism that is exposed. Reasonable gran-
9
0%
10%
20%
30%
40%
50%
60%
comp-cons
comp-aggr
gcc-cons
gcc-aggr
go-cons
go-aggr
ijpeg-cons
ijpeg-aggr
li-cons
li-aggr
m88k-cons
m88k-aggr
perl-cons
perl-aggr
vort-cons
vort-aggr
avg.-cons
avg.-aggr
% Memory Indep. Instr.
Loop
non-Leaf
Leaf
Figure 8: Conservative vs. Aggressive Speculation: Conservative is page-level profiling on all task types
with static control dependence and profiled memory dependence. Aggressive is word-level profiling on all
task types with profiled control dependence, speculative memory dependence, and early start.
ularities are full bytes, words, cache-lines, or pages. When a larger memory granularity is used, this may
result in a conservative placement of launch points. The actual granularity used will depend on the granular-
ity at which the processor can detect memory ordering violations. Managing proled-parallel tasks whose
launch points were determined with a memory granularity of a page would allow the use of existing page
protection mechanisms to detect and recover from dependence violations. Thus, our starting conguration
used page-level proling. We also investigate word-level proling.
In Figure 6, the last bar shows the results of word-level proling on top of proled control depen-
dence, speculative memory dependence and early start. The average gain in memory-independence across
all benchmarks was about 6% of dynamic instructions.
5.4 Experiment Summary
Figure 8 re-displays the extremes of our STP results from conservative to aggressive speculation broken
down by task types. The conservative conguration includes page-level proling on all task types with
static control dependence and proled memory dependence. The aggressive conguration comprises word-
level proling on all task types with proled control dependence, speculative memory dependence, and early
start. M88ksim showed the largest increase in the percentage of memory-independent instructions at over
28%, with vortex very close at over 25%, and the average across benchmarks at about 14%. Each of these
increases in parallelism were largely seen in the non-leaf procedures. Ijpeg was the only benchmark to see
a sizable increase contributed by leaf procedures. Loops accounted for increases in gcc, go, li and perl.
Table 1 displays statistics from the conservative and aggressive speculation of those tasks which exceed
our thresholds. The average overlap is that part of the average task length that can be overlapped with other
execution. The number of tasks selected for STP is greatly affected by aggressive speculation.
In this Section, early start with synchronization provided the highest single increase among all alterna-
tives. Speculative systems with fast synchronization should be able to exploit STP the most effectively. Our
results also indicate that a low-overhead word-level scheme to exploit STP would be protable.
6 Implementation Issues
For our limit study of Speculative Task Parallelism, we have assumed a hypothetical speculative machine
that supports speculative futures with mechanisms for resolving incorrect speculation. When implementing
this machine, a number of issues need to be addressed.
10
Conservative Aggressive
Selected Avg Dynamic Average Selected Avg Dynamic Average
Tasks Task Length Overlap Tasks Task Length Overlap
compress 0 0 0 1 282 42
gcc 22 412 93 74 363 89
go 12 546 118 49 334 76
ijpeg 5 803 300 12 996 266
li 3 50 50 9 14198 55
m88ksim 5 34 29 20 123 43
perl 0 0 0 10 515 42
vortex 14 114 45 92 215 47
average 8 245 79 33 2128 83
Table 1: Task Statistics (Conservative vs. Aggressive)
Speculative Thread Management
Any system that exploits STP would need to include instructions for initialization, synchronization, com-
munication and termination of threads. As launched tasks may be speculative, any implementation would
need to handle mis-speculations.
Managing speculative tasks would include detecting load/store conicts between the preceding code
and the launched task, buffering stores in the launched task, and checking for memory-independence before
committing the buffered stores to memory. One conict detection model includes tracking the load and
store addresses in both the preceding code and the launched task. The amount of memory-independence
accommodated by this model will be determined by the size and access of load-store address storage, and
the conict granularity.
Another conict detection model uses a systems page-fault mechanism. When static analysis can de-
termine the page access pattern of the preceding code, the launched task is given restricted access to those
pages, while the preceding code is given access to only those pages. Any page access violation would cause
the speculative task to fail.
Inter-thread Communication
Any implementation that exploits STP will benet from a system with fast communication between threads.
At the minimum, inter-thread communication is needed at the end of a launched task and when the task
results are used. Fast communication would be needed to enable early start with synchronization. The
ability to quickly communicate a mis-speculation would reduce the number of instructions that are issued
but never committed. This is especially important for systems where threads compete for the same resources.
Adaptive STP
We select tasks that exhibit STP based upon memory access proling and compiler analysis. The memory
access pattern from one dataset may or may not be a good predictor for another dataset. Two feedback
opportunities arise that allow the execution of another data set to adapt to differences from the proled
dataset. The rst would monitor the success rate of particular launch points, and suspend further launches
when it fails too frequently.
The second feedback opportunity is found in continuous proling. Rather than have a single dataset
dictate the launched tasks for all subsequent runs, let datasets from all previous runs dictate the launched
tasks for the current run. It is possible that the aggregate information from the preceding runs would have
a better predictive relationship with future runs. Additionally, the proled information from the current run
11
could be used to supersede the proled information from previous runs, with the idea that the current run
may be its own best predictor. Although proling is expensive and must be optimized, the exact cost is
beyond the scope of this paper.
7 Summary
Traditional parallel compilers do not effectively parallelize irregular applications because they contain lit-
tle loop-level parallelism due to ambiguous memory references. A different source of parallelism, namely
Speculative Task Parallelism arises when a task (either a leaf-procedure, a non-leaf procedure or an entire
loop) is control- and memory-independent of its preceding code, and thus could be executed in parallel. To
exploit STP, we assume a speculative machine that supports speculative futures (a parallel programming
construct that executes a task early on a different thread or processor) with mechanisms for resolving incor-
rect speculation when the task is not, after all, independent. This allows us to speculatively parallelize code
when there is a high probability of independence, but no guarantee.
Through proling and compiler analysis, we nd memory-independent tasks that have no memory con-
icts with their preceding code, and thus could be speculatively executed in parallel. We estimate the amount
of STP in an irregular application by measuring the number of memory-independent instructions these tasks
expose. We vary the level of control dependence and memory dependence to investigate their effect on the
amount of memory-independence we found. We prole at different memory granularities and introduced
synchronization to expose higher levels of memory-independence.
We nd that no one task type exposes signicantly more memory-independent instructions, which
strongly suggests that all three task types should be proled for STP. We also nd that starting a task early
with synchronization around dependences exposes the highest additional amount of memory-independent
instructions, an average across the SPECint95 benchmarks of 6.6% of proled instructions. Proling mem-
ory conicts at the word-level shows a similar gain in comparison to page-level proling. Speculating
beyond proled memory and static control dependences shows the lowest gain which is modest at best.
Overall, we nd that 7 to 22% of instructions are within memory-independent tasks. The lower amount
reects tasks launched in parallel from the least speculative locations.
8 Acknowledgments
This work has been supported in part by DARPA grant DABT63-97-C-0028, NSF grant CCR-980869,
equipment grants from Compaq Computer Corporation, and La Sierra University.
References
[1] H. Akkary and M. Driscoll. A dynamic multithreading processor. In 31st International Symposium on Microar-
chitecture, Dec. 1998.
[2] R. Alverson, D. Callahan, D. Cummings, B. Koblenz, A. Portereld, and B. Smith. The Tera computer system.
1990 International Conf. on Supercomputing, June 1990.
[3] M. Franklin and G. S. Sohi. ARB: A hardware mechanism for dynamic reordering of memory references. IEEE
Transactions on Computers, May 1996.
[4] D. M.Gallagher, W. Y. Chen, S. A. Mahlke, J. C. Gyllenhaal, and W. W. Hwu. Dynamic memorydisambiguation
using the memory conict buffer. In Proceedings of the 6th International Conference on Architecture Support
for Programming Languages and Operating Systems, Oct. 1994.
[5] L. Gwennap. Intel discloses new IA-64 features. Microprocessor Report, Mar. 8 1999.
12
[6] L. Hammond, M. Willey, and K. Olukotun. Data speculation support for a chip multiprocessor. ACM SIGPLAN
Notices, 33(11):5869,Nov. 1998.
[7] S. Keckler, W. Dally, D. Maskit, N. Carter, A. Chang, and W. Lee. Exploiting ne-grain thread level parallelism
on the MIT Multi-ALU processor. In Proceedings of the 25th Annual International Symposium on Computer
Architecture (ISCA-98), pages 306317, June 1998.
[8] M. S. Lam and R. P. Wilson. Limits of control ow on parallelism. In Proceedings of the 19th Annual Interna-
tional Symposium on Computer Architecture (ISCA-92), May 1992.
[9] R. H. Litton, J. A. D. McWha, M. W. Pearson, and J. G. Cleary. Block basedexecutionand task level parallelism.
In Australasian Computer Architecture Conference, Feb. 1998.
[10] P. Marcuello and A. Gonzalez. Clustered speculative multithreaded processors. In Proceedings of the [ACM]
International Conference on Supercomputing, June 1999.
[11] P. Marcuello and A. Gonzalez. Exploiting speculative thread-level parallelism on a SMT processor. In Proceed-
ings of the International Conference on High Performance Computing and Networking, April 1999.
[12] A. Moshovos, S. E. Breach, T. N. Vijaykumar, and G. S. Sohi. Dynamic speculation and synchronization of data
dependences. In Proceedings of the 24th Annual International Symposium on Computer Architecture (ISCA-97),
June 1997.
[13] S. S. Muchnick. Advanced Compiler Design and Implementation. Morgan Kaufmann Publ., San Francisco,
1997.
[14] J. T. Oplinger, D. L. Heine, and M. S. Lam. In search of speculative thread-level parallelism. In Proceedings
of the 1999 International Conference on Parallel Architectures and Compilation Techniques (PACT99), October
1999.
[15] J. T. Oplinger, D. L. Heine, S. Liao, B. A. Nayfeh, M. S. Lam, and K. Olukotun. Software and hardware for
exploitingspeculativeparallelism with a multiprocessor. Technical Report CSL-TR-97-715, StanfordUniversity,
Computer Systems Laboratory, 1997.
[16] G. S. Sohi, S. E. Breach, and T. N. Vijaykumar. Multiscalar processors. In Proceedings of the 22nd Annual
International Symposium on Computer Architecture (ISCA-95), June 1995.
[17] A. Srivastava and A. Eustace. ATOM: A system for building customized program analysis tools. Research
Report 94.2, COMPAQ Western Research Laboratory, 1994.
[18] J. G. Steffan and T. C. Mowry. The potential of using thread-level data speculation to facilitate automatic
parallelization. In Proceedings of the 4thInternational Symposiumon High-PerformanceComputer Architecture,
Feb. 1998.
[19] D. M. Tullsen, S. J. Eggers, and H. M. Levy. Simultaneous multithreading: Maximizing on-chip parallelism. In
Proceedings of the 22nd Annual International Symposium on Computer Architecture (ISCA-95), June 1995.
[20] T. N. Vijaykumar and G. S. Sohi. Task selection for a multiscalar processor. In 31st International Symposium on
Microarchitecture, Dec. 1998.
[21] S. Wallace, B. Calder, and D. Tullsen. Threaded multiple path execution. In Proceedings of the 25th Annual
International Symposium on Computer Architecture (ISCA-98), June 1998.
13
... -By mapping dynamic dependences back to program source, Embla 2 gives a realistic estimate of potential speed-up for parallelization using frameworks such as Cilk and OpenMP. Previous studies that we know of [10,26,19] have only considered speculative task-level parallelism-we give potential speedups for programs both with and without the use of thread-level speculation. Using Embla 2 we present estimates of inherent parallelism in various example programs (Sections 3.2 and 3.4). ...
... [25]), but fewer have tried to separate out task-level parallelism from instructionlevel parallelism. Kreaseck et al. [10] explored limits of speculative task-level parallelism by executing function calls early, similar to the way we hoist spawns. They have however imposed the restriction that spawned function calls must be joined at their original call sites, which is a restriction we have felt to be unnecessary, and thus have not imposed in our analysis. ...
Conference Paper
Full-text available
Manual parallelization of programs is known to be difficult and error-prone, and there are currently few ways to measure the amount of potential parallelism in the original sequential code. We present an extension of Embla, a Valgrind-based dependence profiler that links dynamic dependences back to source code. This new tool estimates potential task-level parallelism in a sequential program and helps programmers exploit it at the source level. Using the popular fork-join model, our tool provides a realistic estimate of potential speed-up for parallelization with frameworks like Cilk, TBB or OpenMP 3.0 . Estimates can be given for several different parallelization models, varying in programmer effort and capabilities required of the underlying implementation. Our tool also outputs source-level dependence information to aid the parallelization of programs with lots of inherent parallelism, as well as critical paths to suggest algorithmic rewrites of programs with little of it. We validate our claims by running our tool over serial elisions of sample Cilk programs, finding additional inherent parallelism not exploited by the Cilk code, as well as over serial C benchmarks where the profiling results suggest parallelism-enhancing algorithmic rewrites.
Article
Abstract—Asymmetric,multicore,processors (AMP) are built of cores that expose the same,ISA but differ in per- formance, complexity, and power consumption. A typical AMP might consist of a plenty of slow, small and simple cores and a handful of fast, large and complex cores. AMPs have been proposed,as a more,energy efficient alternative to symmetric,multicore,processors. They are particularly interesting in their potential to mitigate Amdahl’s law for parallel program,with sequential phases. While a parallel phase,of the code runs on plentiful slow cores enjoying low energy per instruction, the sequential phase can run on the fast core, enjoying high single-thread performance of that core. As a result, performance per unit of energy is maximized.,In this paper,we,evaluate,the effects of accelerating sequential phases of parallel applications on an AMP. Using a synthetic workload,generator,and,an efficient asymmetry-aware user-level scheduler, we explore how,the workload’s properties determine,the speedup,that the workload,will experience,on an AMP system. Such an evaluation,has been,performed,before only analytically; experimental,studies have been limited to a small number of workloads. Our study,is the first to experimentally explore,benefits on,AMP systems,for a wide,range,of workloads.
Article
Full-text available
Predication of control edges has the potential advantages of improving fetch bandwidth and reducing branch mispredictions. However, heavily pred- icated code in out-of-order processors can lose significant performance by deferring resolution of the predicates until they are executed, whereas in non- predicated code those control arcs would have remained as branches, and would be resolved immediately in the fetch stage when they are predicted. Although predicate prediction can address this problem, three problems arise when trying to predict aggressively predicated code that contains multi-path hyperblocks: (1) How to maintain a high bandwidth of branch prediction to keep the instruction window full without having the predicate predictions interfere and without increasing the branch mispredictions, (2) how to deter- mine which predicates in the multi-path hyperblocks should be predicted, and (3) how to achieve high predicate prediction accuracies without centralizing all prediction information in a single location. To solve these problems, this paper proposes a speculation architecture called hierarchical control prediction (HCP). In HCP, the control flow speculation is partitioned into two levels. In parallel with the branch predictor, which identifies the coarse- grain execution path by predicting the next hyperblock entry, HPC identifies and predicts the chain of predicate instructions along the predicted path of execution within each hyperblock, using encoded static path approximations in each branch instruction and local per-predicate histories to achieve high accuracies. Using a 16-core composable EDGE processor as the evaluation platform, this study shows that hierarchical control prediction can address these issues comprehensively, accelerating single-threaded execution by 19% compared to no predicate prediction, and thus achieving half of the 38% performance gain that ideal predicate prediction would attain.
Conference Paper
Chip multiprocessors (CMPs), or multi-core processors, have be- come a common way of reducing chip complexity and power con- sumption while maintaining high performance. Speculative CMPs use hardware to enforce dependence, allowing a parallelizing com- piler to generate multithreaded code without needing to prove in- dependence. In these systems, a sequential program is decomposed into threads to be executed in parallel; dependent threads cause performance degradation, but do not affect correctness. Thread de- composition attempts to reduce the run-time overheads of data de- pendence, thread misprediction, and load imbalance. Because these overheads depend on the run times of the threads that are being created by the decomposition, reducing the overheads while creat- ing the threads is a circular problem. Static compile-time decom- position handles this problem by estimating the run times of the candidate threads, but is limited by the estimates' inaccur acy. Dy- namic execution-time decomposition in hardware has better run- time information, but is limited by the decomposition hardware's complexity and run-time overhead. We propose a third approach where a compiler instruments a profile run of the application to search through candidate threads and pick the best threads as the profile run executes. The resultant decomposition is compil ed into the application so that a production run of the application h as no in- strumentation and does not incur any decomposition overhead. We avoid static decomposition's estimation accuracy problem by us- ing actual profile-run execution times to pick threads, and w e avoid dynamic decomposition's overhead by performing the decomposi- tion at profile time. Because we allow candidate threads to sp an arbitrary sections of the application's call graph and loop nests, an exhaustive search of the decomposition space is prohibitive, even in profile runs. To address this issue, we make the key observa tion that the run-time overhead of a thread depends, to the first or der, only on threads that overlap with the thread in execution (e.g., in a
Conference Paper
In this paper, we present a method - parallelization spectroscopy - for analyzing the thread-level parallelism available in production High Performance Computing (HPC) codes.We survey a number of techniques that are commonly used for parallelization and classify all the loops in the case study presented using a sensitivity metric: how likely is a particular technique is successful in parallelizing the loop.
Conference Paper
Full-text available
Much of the improvement in computer performance over the last twenty years has come from faster transistors and architectural advances that increase parallelism. Historically, parallelism has been exploited either at the instruction level with a grain-size of a single instruction or by partitioning applications into coarse threads with grain-sizes of thousands of instructions. Fine-grain threads fill the parallelism gap between these extremes by enabling tasks with run lengths as small as 20 cycles. As this fine-grain parallelism is orthogonal to ILP and coarse threads, it complements both methods and provides an opportunity for greater speedup. This paper describes the efficient communication and synchronization mechanisms implemented in the Multi-ALU Processor (MAP) chip, including a thread creation instruction, register communication, and a hardware barrier. These register-based mechanisms provide 10 times faster communication and 60 times faster synchronization than mechanisms that operate via a shared on chip cache. With a three-processor implementation of the MAP, fine-grain speedups of 1.2-2.1 are demonstrated on a suite of applications.
Article
To exploit instruction level parallelism, compilers for VLIW and superscalar processors often employ static code scheduling. However, the available code reordering may be severely restricted due to ambiguous dependences between memory instructions. This paper introduces a simple hardware mechanism, referred to as the memory conflict buffer, which facilitates static code scheduling in the presence of memory store/load dependences. Correct program execution is ensured by the memory conflict buffer and repair code provided by the compiler. With this addition, significant speedup over an aggressive code scheduling model can be achieved for both non-numerical and numerical programs.
Conference Paper
This paper examines simultaneous multithreading, a technique permitting several independent threads to issue instructions to a superscalar's multiple functional units in a single cycle. We present several models of simultaneous multithreading and compare them with alternative organizations: a wide superscalar, a fine-grain multithreaded processor, and single-chip, multiple-issue multiprocessing architectures. Our results show that both (single-threaded) superscalar and fine-grain multithreaded architectures are limited their ability to utilize the resources of a wide-issue processor. Simultaneous multithreading has the potential to achieve 4 times the throughput of a superscalar, and double that of fine-grain multithreading. We evaluate several cache configurations made possible by this type of organization and evaluate tradeoffs between them. We also show that simultaneous multithreading is an attractive alternative to single-chip multiprocessors; simultaneous multithreaded processors with a variety of organizations outperform corresponding conventional multiprocessors with similar execution resources.While simultaneous multithreading has excellent potential to increase processor utilization, it can add substantial complexity to the design. We examine many of these complexities and evaluate alternative organizations in the design space.
Article
ATOM (Analysis Tools with OM) is a single framework for building a wide range of customized program analysis tools. It provides the common infrastructure present in all code-instrumenting tools; this is the difficult and time-consuming part. The user simply defines the tool-specific details in instrumentation and analysis routines. Building a basic block counting tool like Pixie with ATOM requires only a page of code. ATOM, using OM link-time technology, organizes the final executable such that the application program and user's analysis routines run in the same address space. Information is directly passed from the application program to the analysis routines through simple procedure calls instead of inter-process communication or files on disk. ATOM takes care that analysis routines do not interfere with the program's execution, and precise information about the program is presented to the analysis routines at all times. ATOM uses no simulation or interpretation. ATOM has been implemented on the Alpha AXP under OSF/1. It is efficient and has been used to build a diverse set of tools for basic block counting, profiling, dynamic memory recording, instruction and data cache simulation, pipeline simulation, evaluating branch prediction, and instruction scheduling.
Conference Paper
In this paper we present a run-time mechanism to simultaneously execute multiple threads from a sequential program on a simultaneous multithreaded (SMT) processor. The threads are speculative in the sense that they are created by predicting the future control flow of the program. Moreover, threads are not necessarily independent. Data dependences among simultaneously executed threads may exist. To avoid the serialization that such dependences may cause, inter-thread dependences as well as the values that flow through them are predicted. Speculative threads correspond to different iterations of the same loop, which may significantly reduce the fetch bandwidth requirements since many instructions are shared by several threads. The performance evaluation results show a significant performance improvement when compared with a single-threaded execution, which demonstrates the potential of the mechanism to exploit unused hardware contexts. Moreover, the new processor architecture can achieve an IPC (instructions per cycle) even higher than the peak fetch bandwidth for some programs.
Conference Paper
In this paper we present a processor microarchitecture that can simultaneously execute multiple threads and has a clustered design for scalability purposes. A main feature of the proposed microarchitecture is its capability to spawn speculative threads from a single-thread application at run-time. These speculative threads use otherwise idle resources of the machine. Spawning a speculative threads involves predicting its control flow as well as its dependences with other threads and the values that flow through them. In this way, threads that are not independent can be executed in parallel. Control-flow, data value and data dependence predictors particularly designed for this type of microarchitecture are presented. Results show the potential of the microarchitecture to exploit speculative parallelism in programs that are hard to parallelize at compile-time, such as the SpecInt95. For a 4 thread unit configuration, some programs such as ijpeg and li can exploit an average degree of parallelism of more than 2 threads per cycle. The average degree of parallelism for the whole SpecInt95 suite is 1.6 threads per cycle. This speculative parallelism results in significant speedups for all the SpecInt95 programs when compared with a single-thread execution.
Conference Paper
Thread-level speculation is a technique that enables parallel execution of sequential applications on a multiprocessor. This paper describes the complete implementation of the support for threadlevel speculation on the Hydra chip multiprocessor (CMP). The support consists of a number of software speculation control handlers and modifications to the shared secondary cache memory system of the CMP This support is evaluated using five representative integer applications. Our results show that the speculative support is only able to improve performance when there is a substantial amount of medium--grained loop-level parallelism in the application. When the granularity of parallelism is too small or there is little inherent parallelism in the application, the overhead of the software handlers overwhelms any potential performance benefits from speculative-thread parallelism. Overall, thread-level speculation still appears to be a promising approach for expanding the class of applications that can be automatically parallelized, but more hardware intensive implementations for managing speculation control are required to achieve performance improvements on a wide class of integer applications.
Conference Paper
ATOM (Analysis Tools with OM) is a single framework for building a wide range of customized program analysis tools. It provides the common infrastructure present in all code-instrumenting tools; this is the difficult and time-consuming part. The user simply defines the tool-specific details in instrumentation and analysis routines. Building a basic block counting tool like Pixie with ATOM requires only a page of code.ATOM, using OM link-time technology, organizes the final executable such that the application program and user's analysis routines run in the same address space. Information is directly passed from the application program to the analysis routines through simple procedure calls instead of inter-process communication or files on disk. ATOM takes care that analysis routines do not interfere with the program's execution, and precise information about the program is presented to the analysis routines at all times. ATOM uses no simulation or interpretation.ATOM has been implemented on the Alpha AXP under OSF/1. It is efficient and has been used to build a diverse set of tools for basic block counting, profiling, dynamic memory recording, instruction and data cache simulation, pipeline simulation, evaluating branch prediction, and instruction scheduling.
Article
This material is presented to ensure timely dissemination of scholarly and technical work. Copyright and all rights therein are retained by authors or by other copyright holders. All persons copying this information are expected to adhere to the terms and constraints invoked by each author's copyright. In most cases, these works may not be reposted without the explicit permission of the copyright holder.