ChapterPDF Available

Abstract

Thread-LevelSpeculation (TLS) is a hardware/software technique that enables the execution of multiple loop iterations in parallel, even in the presence of some loop-carried dependences. TLS requires hardware mechanisms to support conflict detection, speculative storage, in-order commit of transactions, and transaction roll-back. There is no off-the-shelf processor that provides direct support for TLS. Speculative execution is supported, however, in the form of Hardware Transactional Memory (HTM) — available in recent processors such as the Intel Core and the IBM POWER8. Earlier work has demonstrated that, in the absence of specific TLS support in commodity processors, HTM support can be used to implement TLS. This paper presents a careful evaluation of the implementation of TLS on the HTM extensions available in such machines. This evaluation provides evidence to support several important claims about the performance of TLS over HTM in the Intel Core and the IBM POWER8 architectures. Experimental results reveal that by implementing TLS on top of HTM, speed-ups of up to 3.8× can be obtained for some loops.
Performance Evaluation of Thread-Level
Speculation in Off-the-Shelf Hardware
Transactional Memories
Juan Salamanca1, Jos´e Nelson Amaral2, and Guido Araujo1
1Institute of Computing, UNICAMP, Campinas-SP, Brazil
{juan,guido}@ic.unicamp.br
2Computing Science Department, University of Alberta, Edmonton-AB, Canada
amaral@cs.ualberta.ca
Abstract. Thread-Level Speculation (TLS) is a hardware/software tech-
nique that enables the execution of multiple loop iterations in parallel,
even in the presence of some loop-carried dependences. TLS requires
hardware mechanisms to support conflict detection, speculative storage,
in-order commit of transactions, and transaction roll-back. There is no
off-the-shelf processor that provides direct support for TLS. Speculative
execution is supported, however, in the form of Hardware Transactional
Memory (HTM) — available in recent processors such as the Intel Core
and the IBM POWER8. Earlier work has demonstrated that, in the ab-
sence of specific TLS support in commodity processors, HTM support
can be used to implement TLS. This paper presents a careful evaluation
of the implementation of TLS on the HTM extensions available in such
machines. This evaluation provides evidence to support several impor-
tant claims about the performance of TLS over HTM in the Intel Core
and the IBM POWER8 architectures. Experimental results reveal that
by implementing TLS on top of HTM, speed-ups of up to 3.8×can be
obtained for some loops.
Keywords: Thread-Level Speculation, Transactional Memory
1 Introduction
Loops account for most of the execution time in programs and thus extensive
research has been dedicated to parallelize loop iterations [2]. Unfortunately, in
many cases these efforts are hindered when the compiler cannot prove that a
loop is free of loop-carried dependences. However, sometimes when static analy-
sis concludes that a loop has a may dependence — for example when the analysis
cannot resolve a potential alias relation — the dependence may actually not exist
or it may occur in very few executions of the program [12]. Thread-Level Spec-
ulation (TLS) is a promising technique that can be used to enable the parallel
execution of loop iterations in the presence of may loop-carried dependences.
TLS assumes that the iterations of a loop can be executed in parallel — even
in the presence of potential dependences — and then relies on a mechanism to
2 J. Salamanca et al.
detect dependence violations and correct them. The main distinction between
TLS and HTM is that in TLS speculative transactions must commit in order.
Recently hardware support for speculation has been implemented in com-
modity off-the-shelf microprocessors [3, 4]. However, the speculation support in
these architectures was designed with Hardware Transactional Memory (HTM)
in mind and not TLS. The only implementation of hardware support for TLS to
date is in the IBM Blue Gene/Q (BG/Q), a machine that is not readily available
for experimentation or usage. HTM extensions, available in the Intel Core and in
the IBM POWER8 architectures, allow for the speculative execution of atomic
program regions [3–5]. Such HTM extensions enable the implementation of three
key features required by TLS: (a) conflict detection; (b) speculative storage; and
(c) transaction roll-back.
Until now, the majority of the attempts to estimate the performance ben-
efits of TLS were based on simulation studies [10, 11]. Unfortunately, studies
of TLS execution based on simulation have serious limitations. The availability
of speculation support in commodity processors allowed for the first study of
TLS on actual hardware and led to some interesting research questions: (1) can
the existing speculation support in commodity processors, originally designed
for HTM, be used to support TLS? and (2) if it can, what performance effects
would be observed from such implementations? Earlier work has provided a cau-
tiously positive answer to the first question, i.e. supporting TLS on top of HTM
hardware is possible, but it requires several careful software adaptations [9]. To
address the second question, this paper presents a careful evaluation of the im-
plementation of TLS on top of the HTM extensions available in the Intel Core
and in the IBM POWER8. This evaluation uses the same loops from an earlier
study by Murphy et al. [6] and led to some interesting discoveries about the
relevance of loop characterization to predict the potential performance of TLS.
The experimental results indicate that: (1) small loops are not amenable to be
parallelized with TLS on the existing HTM hardware because of the expensive
overhead of: (a) starting and finishing transactions, (b) aborting a transaction,
and (c) setting up loop for TLS execution; (2) loops with potential to be success-
fully parallelized in both Intel Core and IBM POWER8 architectures have better
performance on the POWER8 because TLS can take advantage of the ability
of this architecture to suspend and resume transactions to implement ordered
transactions; (3) the larger storage capacity for speculative state in Intel TSX
can be crucial for loops that execute many read and write operations; (4) the
ability to suspend/resume a transaction is important for loops that execute for
a longer time because their transactions may abort due to OS context switching;
and (5) the selected size of the strip can be critical for the increase of aborts due
to order inversion.
The remainder of this paper is organized as follows. Section 2 describes
the relevant aspects of the implementation of HTM in both Intel Core and
IBM POWER8 architectures. Section 3 details the related work. Benchmarks,
methodology and settings are described in Section 4. Finally, Section 5 shows
experimental results and a detailed analysis.
Performance Evaluation of Thread-Level Speculation 3
Table 1: HTM implementations of Intel Core and IBM
POWER [7].
Processor type Intel i7-4770 POWER8
Conflict-detection
granularity (cache line) 64 B 128 B
Tx Load Capacity 4 MB 8 KB
Tx Store Capacity 22 KB 8 KB
L1 Data Cache 32 KB, 8-way 64 KB
L2 Data Cache 256 KB 512 KB, 8-way
SMT level 2 8
Table 2: HTM Architectural Features.
Features TLS Intel P8
Eager Conflict Detection
Resolution Conflict Policy
Ordered Transactions
Multi-versioned caches
Suspend/Resume
Lazy Conflict Detection
Data Forwarding
Word Conflict Detection
2 How to support TLS over HTM
This section reviews HTM extensions and discusses how they can be effectively
used to support the TLS execution of hard-to-parallelize loops containing (may)
loop-carried dependences.
2.1 Intel Core and IBM POWER8
Transactional memory systems must provide transaction atomicity and isolation,
which require the implementation of the following mechanisms: data versioning
management,conflict detection, and conflict resolution [9].
Both Intel and IBM architectures provide instructions to begin and end a
transaction, and to force a transaction to abort. To perform such operations In-
tel Core’s Transactional Synchronization Extensions (TSX) implements the Re-
stricted Transactional Memory (RTM), an instruction set that includes xbegin,
xend, and xabort. The corresponding instructions in the POWER8 are tbegin,
tend, and tabort.
All data conflicts are detected at the granularity of the cache line size because
both processors use cache mechanisms — based on physical addresses — and the
cache coherence protocol to track transactional states. Aborts may be caused by:
memory access conflicts, capacity issues due to excessively large transactional
read/write sets or overflow, conflicts due to false sharing, and OS and micro-
architecture events that cause aborts (e.g. system calls, interrupts or traps) [4,7].
The main differences between POWER8 and the Intel Core HTMs, summa-
rized in Table 1, are: (1) transaction capacity; (2) conflict granularity; and (3)
ability to suspend/resume a transaction. The maximum amount of data that
can be accessed by a transaction in the Intel Core is much larger than in the
POWER8. This speculative storage capacity is limited by the resources needed
both to store read and write sets, and to buffer transactional stores.
In POWER8 the execution of a transaction can be paused through the use
of suspended regions — implemented with two new instructions: tsuspend and
tresume. As described in [9], this mechanism enables the implementation of an
ordered-transaction feature in TLS [5].
4 J. Salamanca et al.
1for(count = 0; count < WEIGHT; count++){
2/* Start sequential segment 1 */
3if (cond) glob++; /* Global scalar*/
4/* End sequential segment 1 */
5
6/* Start sequential segment 2 */
7for(i = 0; i < factor; i++){
8/* Global array, A */
9int tmp = A[factor*(count%4) + i];
10 tmp += count*5;
11 if(tmp%2 == 0){
12 A[factor*(count%4) + i] = tmp;
13 }
14 }
15 /* End sequential segment 2 */
16 }
Fig. 1: A loop with two may loop-carried
dependences. Adapted from [6].
1d= STRIP_SIZE;
2inc=(NUM_THREADS-1)*STRIP_SIZE;
3count=param->count;
4
5for(; count < WEIGHT; count += inc){
6prev_count=count;
7Retry:
8if (!BEGIN()){
9for (; count-prev_count < d &&
count < WEIGHT; count++){
10 if(cond) glob++;
11 }
12 END();
13 }
14 else goto Retry;
15 }
Fig. 2: Code of each thread to paral-
lelize Figure 1’s loop with TLS on ideal
HTM system.
2.2 Thread-Level Speculation
Thread-Level Speculation (TLS) has been widely studied [10,11]. Proposed TLS
hardware systems must support four primary features: (a) data conflict detec-
tion; (b) speculative storage; (c) ordered transactions; and (d) rollback when
a conflict is detected. Some of these features are also supported by the HTM
systems found in the Intel Core and the POWER8, and thus these architectures
have the potential to be used to implement TLS. Table 2 shows the necessary
features required to enable TLS on top of an HTM-supporting mechanism, and
its availability in some modern architectures. Neither Intel TSX nor the IBM
POWER8 provide all the features necessary to carry out TLS effectively [9].
Lets examine how TLS can be applied to a simplified version of the loop
example of Figure 1 (the inner loop is omitted) when it runs on top of an
ideal HTM system containing: (a) ordered transactions in hardware; (b) multi-
versioning cache; (c) eager-conflict detection; and (d) conflict-resolution policy.
Figure 2 shows the loop after it was strip-mined and parallelized for TLS on four
cores. Assume that the END instruction implements: (a) ordered transactions, i.e.,
a transaction executing an iteration of the loop has to wait until all transactions
executing previous iterations have committed, and (b) a conflict-resolution pol-
icy that gives preference to the transaction that is executing the earliest iteration
of the loop while rolling back later iterations. Multi-versioning allows for the re-
moval of Write-After-Write (WAW) and Write-After-Read (WAR) loop-carried
dependences on the glob variable. As shown in Figure 3, in the first four it-
erations cond evaluates false and the iterations finish without aborts. Then, at
iteration 4, the eager-conflict detection mechanism detects the RAW loop-carried
dependence violation on variable glob between iterations 4 and 5, thus rolling
back iteration 5 because it should occur after iteration 4. Subsequent iterations
wait for the previous iterations to commit.
Performance Evaluation of Thread-Level Speculation 5
Fig. 3: Execution flow of Figure 2’s code with STRIP SIZE=1 and NUM THREADS=4.
3 Previous Research on TLS
Murphy et al. [6] propose a technique to speculatively parallelize loops that ex-
hibit transient loop-carried dependences — a loop where only a small subset of
loop iterations have actual loop-carried dependences. The code produced by their
technique uses a TM hardware (TCC hardware) and software (Tiny STM) model
running on top of the HELIX time emulator. They developed three approaches
to predict the performance of implementing TLS on the HELIX time emulator:
coarse-grained, fine-grained, and judicious. The coarse-grained approach spec-
ulates a whole iteration while the fine-grained approach speculates sequential
segments and executes parallel segments without speculation. The judicious ap-
proach uses profile data at compile time to choose which sequential segment
to speculate or synchronize so as to satisfy (may) loop-carried dependences.
They conclude that TLS is not only advantageous to overcome limitations of the
compiler static data-dependence analysis, but that performance might also be
improved by focusing on the transient nature of dependences.
Murphy et al. evaluated TLS on emulated HTM hardware using cBench pro-
grams [1] and, surprisingly, predicted up to 15 times performance improvements
with 16 cores [6]. They arose at these predictions even though they did not
use strip mining to decrease the overhead of starting and finishing transactions
as previous work suggested [8,9]. Particularly, fine-grained speculation without
strip mining can result in large overheads due to multiple transactions (sequen-
tial segments) per iteration, even larger than coarse-grained speculation. They
parallelized loops in a round-robin fashion which can result in small transactions,
large number of transactions, high abort ratio, bad use of memory locality, and
false sharing. Their over-optimistic predictions are explained by the fact that
their emulation study does not take into account the overhead of setting TLS
up — which is specially high without strip mining. For instance, their emulation
study predicted speed-ups even for small loops. However, when executing such
6 J. Salamanca et al.
loops in real hardware, the TLS overhead — setup, begin/end transactions, and
aborts — would nullify any gain from parallel execution.
Odaira and Nakaike, Murphy et al. and Salamanca et al. use coarse-grained
TLS to speculate a (strip-mined) whole iteration and perform conflict detection
and resolution at the end of the iteration to detect RAW dependence viola-
tions [6, 8, 9]. The advantages of coarse-grained TLS are: (a) it is simple to im-
plement because it does not need an accurate data dependence analyzer. (b) the
number of transactions is smaller than or equal to the fine-grained or judicious
approaches; and (c) there is no synchronization in the middle of an iteration.
The downside is that even a single frequent actual loop-carried dependence will
cause transactions to abort and serialize the execution. To illustrate this, assume
an execution of the example of Figure 1 where cond always evaluates true, and
thus the glob variable is increased at each iteration of the outer loop. With
coarse-grained TLS the execution of this outer loop would be serialized.
Salamanca et al. describe how speculation support designed for HTM can
also be used to implement TLS [9]. They focused their work on the impact
of false sharing and the importance of judicious strip mining and privatization
to achieve performance. They provide a detailed description of the additional
software support that is necessary in both the Intel Core and the IBM POWER8
architectures to support TLS. This paper uses that method to carefully evaluate
the performance of TLS on Intel Core and POWER8 using 22 loops from cBench
focusing on the characterization of the loops. This loop characterization could
be used in the future to decide if TLS should be used for a given loop.
4 Benchmarks, Methodology and Experimental Setup
The performance assessment reports speed-ups and abort/commit ratios (Trans-
action Outcome) for the coarse-grained TLS parallelization of loops from the
Collective Benchmark (cBench) benchmark suite [1] running on Intel Core and
IBM POWER8. For all experiments the default input is used for the cBench
benchmarks. The baseline for speed-up comparisons is the serial execution of
the same benchmark program compiled at the same optimization level. Loop
times are compared to calculate speed-ups. Each software thread is bound to
one hardware thread (core) and executes a determined number of pre-assigned
iterations. Each benchmark was run twenty times and the average time is used.
Runtime variations were negligible and are not presented.
Loops from cBench were instrumented with the necessary code to implement
TLS, following the techniques described by Salamanca et al. [9]. They were
then executed using an Intel Core i7-4770 and the IBM POWER8 machines,
and their speed-ups measured with respect to sequential execution. Based on
the experimental results, the loops studied are placed in four classes that will be
explained later. Table 3 lists the twenty two loops from cBench used in the study.
The table shows (1) the loop class (explained later); (2) the ID of the loop in
this study; (3) the ID of the loop in the previous study [6]; (4) the benchmark of
the loop; (5) the file/line of the target loop in the source code; (6) the function
Performance Evaluation of Thread-Level Speculation 7
Table 3: Loops extracted from cBench applications.
Class Loop Previous ID Benchmark Location Function %Cov Invocations
I
A 14 automotive bitcount bitcnts.c,65 main1 100% 560
B 18 automotive susan c susan.c,1458 susan corners 83% 344080
C 22 automotive susan e susan.c,1118 susan edges 18% 165308
D 24 automotive susan e susan.c,1057 susan edges 56% 166056
E 28 automotive susan s susan.c,725 susan smoothing 100% 22050
F 15 automotive bitcount bitcnts.c,59 main1 100% 80
II
G 19 automotive susan c susan.c,1457 susan corners 83% 782
H 23 automotive susan e susan.c,1117 susan edges 18% 374
I 25 automotive susan e susan.c,1056 susan edges 56% 374
J 29 automotive susan s susan.c,723 susan smoothing 100% 49
III
K 1 consumer jpeg c jfdcint.c,154 jpeg fdct islow 5% 1758848
L 2 consumer jpeg c jfdcint.c,219 jpeg fdct islow 5% 1758848
M 4 consumer jpeg c jcphuff.c,488 encode mcu AC first 10% 5826184
N 6 consumer jpeg d jidcint.c,171 jpeg idct islow 14% 7280000
O 7 consumer jpeg d jidcint.c,276 jpeg idct islow 15% 7280000
P 13 automotive bitcount bitcnts.c,96 bit shifter 35% 90000000
Q 16 automotive susan c susan.c,1615 susan corners 7% 344080
R 26 automotive susan s susan.c,735 susan smoothing 96% 198450000
S 34 security rijndael d aesxam.c,209 decfile 7% 31864729
T 3 consumer jpeg c jccolor.c,148 rgb ycc convert 10% 439712
U 5 consumer jpeg c jcphuff.c,662 encode mcu AC refine 17% 5826184
Others V 17 automotive susan c susan.c,1614 susan corners 7% 782
Table 4: Characterization and TLS Execution of Classes.
Class Loop Loop Characterization TLS Execution
ID NIntel’s Tbody Intel’s Tloop %lc Read Size Write Size Privatization Intel Core IBM POWER8 Speed-ups in [6]
(ns) (ns) avg max avg max ss Duration (ns) Speed-up ss Speed-up C F J
I
A 1125000 5.0 5680000 0% 12 B 24 B 0 B 20 B Reduction 502 2600.0 2.20 502 3.80 14.0 14.3 14.3
B 590 12.7 7500 0% 48 B 176 B 0 B 36 B No 59 749.0 1.20 59 1.59 10.2 12.0 12.0
C 592 8.1 4810 0% 14 B 192 B 0 B 32 B Array 72 584.0 1.20 68 1.21 7.5 8.0 8.0
D 594 14.1 8420 0% 76 B 176 B 0 B 28 B Array 88 1240.0 1.28 72 2.22 13.0 15.0 15.0
E 600 198.0 118000 0% 14 B 192 B 0 B 32 B Array 15 2970.0 1.60 15 3.18 14.0 15.0 15.0
F 7 5840000.0 40800000 0% 48 B 268 B 155 B 604 B Array 1 5840000.0 0.98 2 2.40 1.0 2.5 2.5
II
G 440 7710.0 3390000 0% 2 KB 3 KB 29 B 328 B No 1 7710.0 1.23 1 1.15 13.0 15.0 15.0
H 442 4790.0 2120000 0% 3 KB 8 KB 37 B 260 B Array 1 4790.0 2.09 2 0.84 12.0 13.8 13.8
I 444 8680.0 3850000 0% 4 KB 4 KB 206 B 1 KB Array 2 17300.0 1.76 1 1.05 13.0 15.0 15.0
J 450 117000.0 52900000 0% 3 KB 8 KB 37 B 260 B Array 1 117000.0 1.89 1 0.73 0.5 1.0 1.0
III
K 8 8.7 69 0% 16 B 32 B 16 B 32 B Array 1 8.7 0.07 1 0.03 5.5 6.0 6.0
L 8 8.5 68 0% 16 B 32 B 16 B 32 B Array 1 8.5 0.06 1 0.03 5.5 6.0 6.0
M 38 5.4 205 100% 12 B 68 B 4 B 36 B Scalar 1 5.4 0.07 1 0.02 0.5 1.0 0.5
N 8 8.1 65 0% 23 B 64 B 16 B 32 B Array 1 8.1 0.05 1 0.05 4.0 4.2 4.2
O 8 9.4 75 0% 24 B 68 B 5 B 16 B Array 1 9.4 0.07 1 0.05 5.8 6.0 6.0
P 23 1.1 26 0% 4 B 12 B 4 B 16 B Reduction 3 3.4 0.02 3 0.02 1.0 2.3 2.3
Q 590 1.0 567 0.14% 4 B 212 B 0 B 36 B Scalar 118 113.0 0.46 95 0.49 9.0 8.5 8.5
R 15 1.8 27 0% 12 B 68 B 4 B 56 B Reduction 10 18.2 0.05 10 0.04 4.0 4.0 4.0
S 16 1.3 21 0% 7 B 8 B 4 B 16 B Array 2 2.6 0.02 2 0.01 1.0 3.0 3.0
T 162 2.5 404 0% 40 B 44 B 12 B 24 B Array & Scalar 8 19.9 0.15 30 0.33 11.0 11.0 2.0
U 63 4.6 289 30% 7 B 8 B 4 B 20 B Scalar 9 41.4 0.20 10 0.16 10.0 11.0 11.0
Others V 440 511.0 225000 34% 1 KB 4 KB 20 B 196 B Scalar 1 511.0 1.25 1 1.34 2.5 2.5 1.0
where the loop is located; (7) %C ov, the fraction of the total execution time
spent in this loop; and (8) the number of invocations of the loop in the whole
program.
This study uses an Intel Core i7-4770 processor with 4 cores with 2-way SMT,
running at 3.4 GHz, with 16 GB of memory on Ubuntu 14.04.3 LTS (GNU/Linux
3.8.0-29-generic x86 64). The cache-line prefetcher is enabled (by default). Each
core has a 32 KB L1 data cache and a 256 KB L2 unified cache. The four
cores share an 8 MB L3 cache. The benchmarks are compiled with GCC 4.9.2
at optimization level -O3 and with the set of flags specified in each benchmark
program.
The IBM processor used is a 4-core POWER8 with 8-way SMT running at 3
GHz, with 16 GB of memory on Ubuntu 14.04.5 (GNU/Linux 3.16.0-77-generic
ppc64le). Each core has a 64 KB L1 data cache, a 32 KB L1 instruction cache,
a 512 KB L2 unified cache, and a 8192 KB L3 unified cache. The benchmarks
are compiled with the XL 13.1.1 compiler at optimization level -O2.
8 J. Salamanca et al.
5 Classification of Loops Based on TLS Performance
The cbench loops were separated into four classes according to their performance
when executing TLS on top of HTM. The following features, shown in Table 4,
characterize the loops: (1) N, the average number of loop iterations; (2) Tbody,
the average time in nanoseconds of a single iteration of the loop on Intel Core;
(3) Tloop,Tbody ×N; (4) %lc, the percentage of iterations that have loop-carried
dependences for the default input; (5) the average (and maximum) size in bytes
read/written by an iteration. The right side of Table 4 describes TLS execution:
(1) the type of privatization within the transaction used in TLS implementation;3
(2) ss, the strip size used for the experimental evaluation in Intel Core; (3)
Transaction Duration in the Intel Core, which is the product ss ×Tbody; (4) the
average speed-ups with four threads for Intel Core after applying TLS; (5) the ss
for POWER8; (6) the speed-ups for POWER8; and (7) the predicted speed-up
from TLS emulation reported in [6] for coarse-grained (C), fine-grained (F), and
judicious (J) speculation using 16 cores.
For all the loops included in this study N > 4, thus they all have enough
iterations to be distributed to the four cores in each architecture. When the
duration of a loop, Tloop, is too short there is not enough work to parallelize
and the performance of TLS is low — in the worst case, LoopS, TLS can be
100 times slower than the sequential version. Even a small percentage of loop-
carried dependences, %lc, materializing at runtime may have a significant effect
on performance depending on the distribution of the loop-carried dependences
throughout loop iterations at runtime; thus TLS performance for those loops is
difficult to predict. The size of the read/write set in each transaction can also
lead to performance degradation because of capacity aborts. For the Intel Core
the duration of each transaction is important: rapidly executing many small
transactions leads to an increase of order-inversion aborts4. The number of such
aborts is lowest for medium-sized transactions that have balanced iterations —
when the duration of different iterations of the loop varies the number of order-
inversion aborts also increases. Finally, long transactions in both architectures
may cause aborts due to traps caused by the end of the OS quantum.
5.1 Class I: Low speculative demand and better performance in
POWER8
The speculative storage requirement of loops in this class is below 2 KB and
thus they are amenable for TLS, and see speed-ups, in both architectures. A
sufficiently small speculative-storage requirement is more relevant for POWER8
which has smaller speculative-storage capacity (see Table 1). These loops also
result in better scaling in POWER8, when compared to Intel Core, because
they can take advantage of the suspend and resume instructions of POWER8
3A Reduction privatization is a scalar privatization of a reduction operation.
4A order-inversion abort rolls back a transaction that completes execution out of
order using an explicit abort instruction (xabort).
Performance Evaluation of Thread-Level Speculation 9
Fig. 4: Speed-ups and Abort ratios for TLS execution on TSX and POWER8.
to implement ordered transactions in software. They do not scale much beyond
two threads on Intel Core due to the lack of ordered transactions support.
Table 4 shows the characterization of Class I. These loops typically provide
a sufficient number of iterations to enable their distribution among the threads.
They also have a relatively moderate duration, as shown by the Tloop values,
and thus they have enough work to be parallelized. TLS makes most sense when
the compiler cannot prove that iterations are independent, but dependences do
not occur at runtime, therefore most loops that are amenable for TLS (loops in
Class I and II) have %lc of zero.
10 J. Salamanca et al.
1for (i=0; i < FUNCS; i++){//loopF
2for (j=n=0, seed=1;
j<iterations; j++, seed+=
13)//loopA
3n += pBitCntFunc[i](seed);
4if (print)
5printf("%-38s> Bits: %ld\n",
text[i], n);
6}
Fig. 5: loopA and loopF
1for (is=0; is<FUNCS; is+=STRIP_SIZE){//loopF
2for (i=is; i-is < STRIP_SIZE && i<FUNCS; i++)
3for (j = n_arr[i] = 0, seed=1; j<iterations;
j++,seed+=13)//loopA
4n_arr[i] += pBitCntFunc[i](seed);
5if (print)
6for (i=is; i-is < STRIP_SIZE && i< FUNCS; i++)
7printf("%-38s> Bits: %ld\n", text[i],
n_arr[i]);
8}
Fig. 6: loopF after applying strip mining and di-
viding into two components.
A typical example of a loop in Class I is loopA, shown in Figure 5. This loop
achieves speed-ups of up to 3.8×with four threads. This loop calls the same bit-
counting function with different inputs for each iteration. Even though this loop
has may loop-carried dependences inside the functions called, none of these de-
pendences materialize at runtime. A successful technique to parallelize this loop
relies on the privatization of variable nand partial accumulation of results to a
global variable after each transaction commits. The successful parallelization of
loopA stems from a moderate duration (Tloop), no actual runtime dependences,
and a read/write set size that is supported by the HTM speculative-storage ca-
pacity. The large number of iterations of this loop allows increasing the strip size
(ss), and thus the new Tbody (after strip mining) — ss ×T body — is longer; af-
ter that, order-inversion aborts decrease (loopB has more order-inversion aborts
than loopA, although its Tbody is longer).
For most of the loops in this class the performance is directly related to the
effective work to be parallelized, represented by Tloop. In the Intel Core the
proportion of order-inversion aborts is inversely related to the transaction du-
ration because very short transactions may reach the commit point even before
previous iterations could commit. Another issue is that very long transactions
may abort due to traps caused by the end of OS quantum. loopF has the longest
ss ×Tbody among all loops evaluated and thus many transactions abort due to
traps caused by the end of the OS quantum, which explains this loop showing a
high abort ratio by other causes in Figure 4. Whole Coarse-grained TLS paral-
lelization of loopF is not possible because each iteration has a printf statement
that is not allowed within a transaction in either architecture. Therefore, each
iteration of loopF must be divided into two components: loopA and the printf
(as shown in Figure 6), before applying TLS only to the first component. The
second component is always executed non-speculatively.
The performance of loopC from one to three threads is higher on Intel Core
than on POWER8 because the larger speculative store capacity in the Intel
Core allows for the use of a larger strip size. With four threads, there is a
small improvement in POWER8 due to the reduction of order-inversion aborts.
The increment in the number of threads intensifies the effect of order inversion
Performance Evaluation of Thread-Level Speculation 11
1for(j=mask_size;j<x_size-mask_size;j++){//loopE
2area = 0;
3total = 0;
4centre = in[i*x_size+j];
5...// calulating area and total
6tmp = area-10000;
7if (tmp==0)
8*out++=median(in,i,j,x_size);
9else
10 *out++=((total-(centre*10000))/tmp);
11 }
Fig. 7: loopE
1n=0;
2for(i=5;i<y_size-5;i++)//loopV
3for(j=5;j<x_size-5;j++){//loopQ
4x = r[i][j];
5if (x>0 &&(/*compare x*/)){
6corner_list[n].info=0;
7corner_list[n].x=j;
8corner_list[n].y=i;
9...
10 n++;
11 }
12 }
Fig. 8: loopQ and loopV
in performance. Therefore, for machines with a higher number of cores, better
speed-ups should be achieved in POWER8 than in Intel Core.
In loopC,loopD, and loopE consecutive iterations write to consecutive mem-
ory positions leading to false sharing when these iterations are executed in par-
allel in a round-robin fashion. For instance, loopE, shown in Figure 7, writes
to *out++ (consecutive memory positions) in consecutive iterations generating
false sharing in a round-robin parallelization. The solution is privatization: write
instead into local arrays during all the transaction and copy the values back to
the original arrays after commit [9].
5.2 Class II: High speculative demand and better performance in
Intel Core
These loops can scale better in the Intel Core compared to the POWER8 because
of the larger transaction capacity of the Intel Core: the read/write sizes of these
loops overflow the transaction capacity of the POWER8 (see Table 1) leading
to a high number of capacity aborts.
Table 4 shows the characterization of loops in Class II. With more than 400
iterations and a loop execution time Tloop larger than 2 ms these loops have
enough work to be parallelized. Also, no dependences materialize at runtime for
the default inputs (%lc = 0).
The smaller write size in loopG means that 50% of its transactions do not
overflow the POWER8 speculative-storage capacity resulting in this loop show-
ing speed-ups of up to 15% with four threads on POWER8. In the Intel Core
this loop has a large number of order-inversion aborts because it has significant
imbalance between its iterations [6]. A contrast is loopH that has better perfor-
mance in the Intel Core even though its transactions are shorter. loopH results
in much fewer order-inversion aborts because the durations of its transactions
are balanced. loopJ has long transaction duration and suffer aborts due to OS
traps.
12 J. Salamanca et al.
Table 5: Characterization of 6 loops from SPEC CPU 2006.
Loop ID Benchmark Location %Cov N Tbody (ns)Tloop (ns) %lc Iteration Size Class
mcf 429.mcf pbeampp.c,165 40% 300 20 6000 3% 300 B Others
milc 433.milc quark stuff.c,1523 20% 160000 94 15000000 0% 1 KB I
h264ref 464.h264ref mv-search.c,394 36% 1089 156 170000 0% 6 KB II
sphinx3 482.sphinx3 vector.c,513 37% 2048 29 60000 0% 1 KB I
astar 473.astar Way2 .cpp,100 60% 1234 41 50000 20% 1 KB Others
lbm 470.lbm lbm.c,186 99% 1300000 55 71000000 0% 500 B I
5.3 Class III: Not enough work to be parallelized with TLS
These are loops where TLS implementation does not have enough work to be
distributed among the available threads resulting in poor performance in any
architecture. The overhead of setting up TLS for these loops is too high in com-
parison to the benefits of parallelization. Murphy et al. [6] reported speed-ups in
these loops because their emulation of TLS hardware did not take into consid-
eration these costs. The experiments in this section reveal that their emulated
numbers overestimate the potential benefit of TLS for these loops. As shown
in Table 4 the available work to be parallelized, Tloop in all the loops in this
class is below 0.6 µs, which is too small to benefit from parallelization. For in-
stance, loopO (and other loops as loopP) has no aborts in POWER8, but their
performance is poor because of the overhead of setting TLS up.
Most of the loops in this category have many order-inversion aborts in Intel
Core because their transaction duration is below 120 ns leading to a fast end of
the transactions/iterations probably even before previous iterations could com-
mit. loopT presents a high order-inversion abort rate in Intel Core because its
transactions last less than 20 ns. In POWER8, the strip size needed to increase
the loop body and the privatization of three arrays lead to aborts because the
speculative capacity of the HTM is exceeded.
5.4 Others
They are a special case because of they are loops that have sufficient work to be
parallelized but whose dependences materialize at runtime. For instance, loopV
has 34% of probability of loop-carried dependences, but TLS can still deliver
some performance improvement. As explained in [6], this loop finds local maxima
in a sliding window, with each maximum being added to a list of corners, each
iteration of loopQ processes a single pixel whereas a complete row is processed
by each iteration of loopV. The input of this loop is a sparse image with most
of the pixels set to zero, and the suspected corners (iterations with loop-carried
dependences) are processed close to each other.
5.5 Predicting the TLS Performance for Other Loops
The characterization of the loops given in Table 4 and the performance evalu-
ation presented could also be used to predict the potential benefit of applying
TLS for new loops that were not included in this study. For loops with short
Performance Evaluation of Thread-Level Speculation 13
Fig. 9: Four SPEC2006 Loops. Speed-ups and abort ratios for coarse-grained TLS ex-
ecution on TSX and POWER8.
Table 6: TLS Execution for 6 loops from SPEC CPU 2006.
Loop ss Intel Speed-up Loop
ID Intel P8 Tx Duration (ns) Intel P8 Class
mcf 20 48 400 1.45 0.60 Others
milc 4 4 375 1.44 1.50 I
h264ref 16 6 2490 1.74 1.27 II
sphinx3 8 16 234 1.16 1.95 I
astar 128 256 5180 0.74 0.49 Others
lbm 33 17 1800 0.69 1.30 I
Tloop, such as those in class III, TLS is very unlikely to result in performance
improvements in either architecture. For loops with small read/write sets and no
dependences materializing at runtime, such as those in class I, TLS is likely to
result in modest improvement for the Intel Core and more significant improve-
ments for the POWER8. Loops that have sufficient work to be parallelized and
no actual dependences but have larger read/write sets, such as those in Class
II, are likely to deliver speed improvements in the Intel Core but will result
in little or no performance gains in the POWER8 because of the more limited
speculative capacity in this architecture. Finally, loops that have sufficient work
to be parallelized but whose dependences materialize at runtime are difficult to
predict — such as loopV. The distribution of loop-carried dependences among
the iterations of such loops must be studied.
Six loops from the SPEC CPU 2006 suite are characterized (Table 5) to
predict to which class they belong according to the classification described in
Section 5. Loops milc,sphinx3, and lbm are classified as Class I; h264ref as
Class II; and mcf and astar as Others. Based on this classification a prediction
can be made about the relative performance of the loops on TLS over HTM for
both architectures. Results of TLS parallelization of these loops are shown in
Table 6 and Figure 9 and confirm the predictions.
14 J. Salamanca et al.
6 Conclusions
This paper presents a detailed performance study of an implementation of TLS
on top of existing commodity HTM in two architectures. Based on the perfor-
mance results it classifies the loops studied and doing so provides guidance to
developers as to what loop characteristics make them amenable to the use of
TLS on the Intel Core or on the IBM POWER8 architectures. Future design of
hardware support for TLS may also benefit from the observations derived from
this performance study.
Acknowledgments. The authors would like to thank FAPESP (grants 15/04285-
5, 15/12077-3, and 13/08293-7) and the NSERC for supporting this work.
References
1. cTuning Foundation: cbench: Collective benchmarks, http://ctuning.org/cbench
(2016)
2. Hurson, A.R., Lim, J.T., Kavi, K.M., Lee, B.: Parallelization of doall and doacross
loops—a survey. Advances in computers 45, 53–103 (1997)
3. IBM: Power ISA Transactional Memory (2012), www.power.org/wp-content/
uploads/2012/07/PowerISA_V2.06B_V2_PUBLIC.pdf
4. Intel Corporation: Intel architecture instruction set extensions programming refer-
ence. Chapter 8: Intel transactional synchronization extensions (2012)
5. Le, H., Guthrie, G., Williams, D., Michael, M., Frey, B., Starke, W., May, C.,
Odaira, R., Nakaike, T.: Transactional memory support in the IBM POWER8
processor. IBM Journal of Research and Development 59(1), 8:1–8:14 (Jan 2015)
6. Murphy, N., Jones, T., Mullins, R., Campanoni, S.: Performance implications of
transient loop-carried data dependences in automatically parallelized loops. In:
Intern. Conf. on Compiler Construction (CC). pp. 23–33. Barcelona, Spain (2016)
7. Nakaike, T., Odaira, R., Gaudet, M., Michael, M.M., Tomari, H.: Quantitative
comparison of hardware transactional memory for Blue Gene/Q, zEnterprise EC12,
Intel Core, and POWER8. In: Intern. Conf. on Computer Architecture (ISCA). pp.
144–157. Portland, OR (2015)
8. Odaira, R., Nakaike, T.: Thread-level speculation on off-the-shelf hardware trans-
actional memory. In: Intern. Symp. on Workload Characterization (IISWC). pp.
212–221. Atlanta, Georgia, USA (Oct 2014)
9. Salamanca, J., Amaral, J.N., Araujo, G.: Evaluating and improving thread-level
speculation in hardware transactional memories. In: IEEE Int. Parallel and Dis-
tributed Processing Symp. (IPDPS). pp. 586–595. Chicago, USA (2016)
10. Steffan, J., Mowry, T.: The potential for using thread-level data speculation to
facilitate automatic parallelization. In: High Performance Computer Architecture
(HPCA). pp. 2–. Washington, DC, USA (1998)
11. Steffan, J.G., Colohan, C.B., Zhai, A., Mowry, T.C.: A scalable approach to thread-
level speculation. In: Intern. Conf. on Computer Architecture (ISCA). pp. 1–12.
Vancouver, BC, Canada (2000)
12. Tournavitis, G., Wang, Z., Franke, B., O’Boyle, M.F.: Towards a holistic ap-
proach to auto-parallelization: Integrating profile-driven parallelism detection and
machine-learning based mapping. In: Programming Language Design and Imple-
mentation (PLDI). pp. 177–187. PLDI ’09, ACM, Dublin, Ireland (2009)
... Using Hardware Transactional Memory to implement Thread-Level Speculation (TLS) is a promising strategy to accelerate hard-to-parallelize loops [12]. Programmers can use powerful OpenMP constructs to write parallel code such as parallel for tls when parallelizing loops with a low Loop-carried Dependence Probability (LCP) [11]. ...
... However, to this day, there is no off-the-shelf processor that provides direct support for TLS. Speculative execution is supported, however, in the form of Hardware Transactional Memory (HTM) available in processors such as the Intel Core and the IBM POWER [12]. HTM implements three out of the four key features required by TLS: conflict detection, speculative storage, and transaction roll-back. ...
... HTM implements three out of the four key features required by TLS: conflict detection, speculative storage, and transaction roll-back. And thus these architectures have the potential to be used to implement TLS [10,12]. Our proposal is based on this approach. ...
Chapter
Full-text available
Loop Thread-Level Speculation on Hardware Transactional Memories is a promising strategy to improve application performance in the multicore era. However, the reuse of shared scalar or array variables introduces constraints (false dependences or false sharing) that obstruct efficient speculative parallelization. Speculative privatization relieves these constraints by creating speculatively private data copies for each transaction thus enabling scalable parallelization. To support it, this paper proposes two new OpenMP clauses to parallel for that enable speculative privatization of scalar or arrays in may DOACROSS loops: spec_private and spec_reduction. We also present an evaluation that reveals that, for certain loops, speed-ups of up to 3.24× can be obtained by applying speculative privatization in TLS.
... Speculative techniques based on Thread-Level Speculation (TLS) of loop iterations have recently [19,20] demonstrated speed-ups that have not been achieved before for loops that have unpredictable loop-carried dependencies. Unfortunately, implementing such algorithms is a very complex and cumbersome task that demands an extensive re-writing of the loop. ...
... The grainsize clause is set to 1 because if a value greater than 1 is defined the performance degrades since more transactions are created and therefore aborted by the task. As explained in previous work [20], the overhead of starting, finishing and aborting transactions is high. Fig. 6 shows some possible executions of the taskloop tls for the loop of Fig. 4, more details about how coarse-grained TLS is implemented on HTM can be found in previous work [18]. ...
... This fact could cause a loss of performance in TLS for taskloop since the scheduling could be non-monotonic (Fig. 6c), meaning that explicit tasks executing higher iterations could be scheduled before than lower ones. Hence, transactions executed by explicit tasks of higher iterations will abort by ordered inversiona kind of abort where a transaction that completes execution out of order is rolled-back using an explicit abort instruction (xabort) [20]. For instance, Fig. 6a shows how taskloop tls could work if the tasks were scheduled in a static fashion like used in the thread-level model in OpenMP. ...
Chapter
Full-text available
Parallelization constructs in OpenMP, such as parallel for or taskloop, are typically restricted to loops that have no loop-carried dependencies (DOALL) or that contain well-known structured dependence patterns (e.g. reduction). These restrictions prevent the parallelization of many computational intensive may DOACROSS loops. In such loops, the compiler cannot prove that the loop is free of loop-carried dependencies, although they may not exist at runtime. This paper proposes a new clause for taskloop that enables speculative parallelization of may DOACROSS loops: the tls clause. We also present an initial evaluation that reveals that: (a) for certain loops, slowdowns using DOACROSS techniques can be transformed in speed-ups of up to \(2.14\times \) by applying speculative parallelization of tasks; and (b) the scheduling of tasks implemented in the Intel OpenMP runtime exacerbates the ratio of order inversion aborts after applying the taskloop-tls parallelization to a loop.
... These approaches can achieve considerable parallel benefits. However, the current TLS execution models cannot exploit deeply programs' parallelism along with high-misspeculation and extra overheads [13,14]. To solve the above problems, this paper proposes a pure-software TLS execution model based on LLVM [15][16][17][18][19] compiler, which can exploit more program parallelism with the cost of lower overhead. ...
Conference Paper
Full-text available
With the trend of growing number of processing cores on Chip Multiprocessors, researchers have made a lot of efforts to make full use of core resources through extracting programs’ parallelism. Thread-Level Speculation (TLS) can speculatively parallelized sequential dependences-carried programs in an aggressive manner by predicting dependent values in advance. However, the existing TLS execution model is error-prone and extra overhead will be introduced. To solve above problems, in this paper, a pure software TLS execution model based on LLVM compiler is proposed with low-misspeculation and low-overhead. Firstly, an adaptive forking model is presented. This adaptive forking model can not only exploit higher degree of parallelism, but also reduce rollback rate. Secondly, an efficient memory management strategy is designed. With this strategy, the speculative status is separated from the non-speculative status, accelerating both communications among speculative threads and non-speculative threads, and its validation process with less time-consuming overhead. Experiments show that the presented speculative execution model can effectively extract sequential programs’ parallelism, and it can achieve a maximum speedup of 3.76, average 2.59 times speedup on 8-cores processor. Compared with the traditional TLS execution model, the improved model delivers a 9% performance improvement on average.
Chapter
Full-text available
Previous work proposed and evaluated Speculative taskloop (STL) on Intel Core implementing new clauses and constructs in OpenMP. The results indicated that, despite achieving some speed-ups, there was a phenomenon called the Lost-Thread Effect that caused the performance degradation of STL parallelization. This issue is caused by the nonmonotonic scheduling implemented in the LLVM OpenMP Runtime Library. This paper presents an improvement in the STL mechanism by modifying the OpenMP runtime to allow monotonic scheduling of tasks generated by taskloop. We perform an evaluation with two different versions of the OpenMP runtime, both optimized for STL revealing that, for certain loops, infinite slowdowns (deadlocks) using the original OpenMP runtime can be transformed in speed-ups by applying monotonic scheduling. The experimental results show the performance improvement of STL using the modified version of the runtime, reaching speed-ups of up to 2.18\(\times \).
Article
Full-text available
Recent approaches to automatic parallelization have taken advantage of the low-latency on-chip interconnect provided in modern multicore processors, demonstrating significant speedups, even for complex workloads. Although these techniques can already extract significant thread-level parallelism from application loops, we are interested in quantifying and exploiting any additional performance that remains on the table. This paper confirms the existence of significant extra threadlevel parallelism within loops parallelized by the HELIX compiler. However, improving static data dependence analysis is unable to reach the additional performance offered because the existing loopcarried dependences are true only on a small subset of loop iterations. We therefore develop three approaches to take advantage of the transient nature of these data dependences through speculation, via transactional memory support. Results show that coupling the state-of-the-art data dependence analysis with fine-grained speculation achieves most of the speedups and may help close the gap towards the limit of HELIX-style thread-level parallelism.
Conference Paper
Full-text available
This paper presents a detailed analysis of the application of Hardware Transactional Memory (HTM) support for loop parallelization with Thread-Level Speculation (TLS). As a result it provides three contributions: (a) it shows that performance issues well-known to loop parallelism (e.g. false sharing) are exacerbated in the presence of HTM, and that capacity aborts can increase when one tries to overcome them; (b) it reveals that, although modern HTM extensions can provide support for TLS, they are not powerful enough to fully implement TLS; (c) it shows that simple code transformations, such as judicious strip mining and privatization techniques, can overcome such shortcomings, delivering speed-ups for programs that contain loop-carried dependencies. Experimental results reveal that, when these code transformations are used, whole-program speed-ups of up to 30% can be achieved for some loops for which previous research had reported slowdowns.
Article
Full-text available
Since loops in programs are the major source of parallelism, considerable research has been focused on strategies for parallelizing loops. For DOALL loops, loops can be allocated to processors either statically or dynamically. When the execution times of individual iterations vary, dynamic schemes can achieve better load balance, albeit at a higher runtime scheduling cost. The inter-iteration dependencies of DOACROSS loops can be constant (regular DOACROSS loops) or variable (irregular DOACROSS loops). In our research, we have proposed and tested two loop allocation techniques for regular DOACROSS loops, known as Staggered distribution (SD) and Cyclic Staggered (CSD) distribution. This article analyzes several classes of loop allocation algorithms for parallelizing DOALL, regular, and irregular DOACROSS loops.
Conference Paper
Recent approaches to automatic parallelization have taken advantage of the low-latency on-chip interconnect provided in modern multicore processors, demonstrating significant speedups, even for complex workloads. Although these techniques can already extract significant thread-level parallelism from application loops, we are interested in quantifying and exploiting any additional performance that remains on the table. This paper confirms the existence of significant extra thread-level parallelism within loops parallelized by the HELIX compiler. However, improving static data dependence analysis is unable to reach the additional performance offered because the existing loop-carried dependences are true only on a small subset of loop iterations. We therefore develop three approaches to take advantage of the transient nature of these data dependences through speculation, via transactional memory support. Results show that coupling the state-of-the-art data dependence analysis with fine-grained speculation achieves most of the speedups and may help close the gap towards the limit of HELIX-style thread-level parallelism.
Article
Thread-level speculation can speed up a single-thread application by splitting its execution into multiple tasks and speculatively executing those tasks in multiple threads. Efficient thread-level speculation requires hardware support for memory conflict detection, store buffering, and execution rollback, and in addition, previous research has also proposed advanced optimization facilities, such as ordered transactions and data forwarding. Recently, implementations of hardware transactional memory (HTM) are coming into the market with minimal hardware support for thread-level speculation. However, few implementations offer advanced optimization facilities. Thus, it is important to determine how well thread-level speculation can be realized on the current HTM implementations, and what optimization facilities should be implemented in the future. In our research, we studied thread-level speculation on the off-the-shelf HTM implementation in Intel TSX. We manually modified potentially parallel benchmarks in SPEC CPU2006 for thread-level speculation. Our experimental results showed that thread-level speculation resulted in up to an 11% speed-up even without the advanced optimization facilities, but actually degraded the performance in most cases. In contrast to our expectations, the main reason for the performance loss was not the lack of hardware support for ordered transactions but the transaction aborts due to memory conflicts. Our investigation suggests that future hardware should support not only ordered transactions but also memory data forwarding, data synchronization, multi-version cache, and word-level conflict detection for thread-level speculation.
Article
Transactional Memory (TM) is a new programming paradigm for both simple concurrent programming and high concurrent performance. Hardware Transactional Memory (HTM) is hardware support for TM-based programming. It has lower overhead than software transactional memory (STM), which is a software-based implementation of TM. There are now four commercial systems, IBM Blue Gene/Q, IBM zEnterprise EC12, Intel Core, and IBM POWER8, offering HTM. Our work is the first to compare the performance of these four HTM systems. We measured the STAMP benchmarks, the most widely used TM benchmarks. We also evaluated the specific features of each HTM system. Our experimental results show that: (1) there is no single HTM system that is more scalable than the others in all of the benchmarks, (2) there are measurable performance differences among the HTM systems in some benchmarks, and (3) each HTM system has its own implementation characteristics that limit its scalability.
Article
With multi-core processors, parallel programming has taken on greater importance. Traditional parallel programming techniques based on critical sections controlled by locking have several well-known drawbacks. To allow for more efficient parallel programming with higher performance, the IBM POWER8™ processor implements a hardware transactional memory facility. Transactional memory allows groups of load and store operations to execute and commit as a single atomic unit without the use of traditional locks, thereby improving performance and simplifying the parallel programming model. The POWER8 transactional memory facility provides a robust capability to execute transactions that can survive interrupts. It also allows non-speculative accesses within transactions, which facilitates debugging and thread-level speculation. Unique challenges caused by implementing transactional memory on top of the Power ISA (Instruction Set Architecture) weakly consistent memory model are addressed. We detail the Power ISA transactional memory architecture, the POWER8 implementation of this architecture, and two practical uses of this architecture—Transactional Lock Elision (TLE) and Thread-Level Speculation (TLS)—and provide performance results for these uses.
Conference Paper
Compiler-based auto-parallelization is a much studied area, yet has still not found wide-spread application. This is largely du e to the poor exploitation of application parallelism, subsequent ly result- ing in performance levels far below those which a skilled expert programmer could achieve. We have identified two weaknesses in traditional parallelizing compilers and propose a novel, i ntegrated approach, resulting in significant performance improvemen ts of the generated parallel code. Using profile-driven parallelism detection we overcome the limitations of static analysis, enabling us to iden- tify more application parallelism and only rely on the user for fi- nal approval. In addition, we replace the traditional targe t-specific and inflexible mapping heuristics with a machine-learning b ased prediction mechanism, resulting in better mapping decisions while providing more scope for adaptation to different target architec- tures. We have evaluated our parallelization strategy agai nst the NAS and SPEC OMP benchmarks and two different multi-core platforms (dual quad-core Intel Xeon SMP and dual-socket QS20 Cell blade). We demonstrate that our approach not only yields sig- nificant improvements when compared with state-of-the-art par- allelizing compilers, but comes close to and sometimes exceeds the performance of manually parallelized codes. On average, our methodology achieves 96% of the performance of the hand-tuned OpenMP NAS and SPEC parallel benchmarks on the Intel Xeon platform and gains a significant speedup for the IBM Cell plat form, demonstrating the potential of profile-guided and machine- learning based parallelization for complex multi-core platforms.
Conference Paper
While architects understand how to build cost-effective parallel machines across a wide spectrum of machine sizes (ranging from within a single chip to large-scale servers), the real challenge is how to easily create parallel software to effectively exploit all of this raw performance potential. One promising technique for overcoming this problem is Thread-Level Speculation (TLS), which enables the compiler to optimistically create parallel threads despite uncertainty as to whether those threads are actually independent. In this paper we propose and evaluate a design for supporting TLS that seamlessly scales to any machine size because it is a straightforward extension of writeback invalidation-based cache coherence (which itself scales both up and down). Our experimental results demonstrate that our scheme performs well on both single-chip multiprocessors and on larger-scale machines where communication latencies are twenty times larger.
Conference Paper
As we look to the future, and the prospect of a billion transistors on a chip, it seems inevitable that microprocessors will exploit having multiple parallel threads. To achieve the full potential of these “single-chip multiprocessors”, however, we must find a way to parallelize non-numeric applications. Unfortunately, compilers have had little success in parallelizing non-numeric codes due to their complex access patterns. This paper explores the potential for using thread-level data speculation (TLDS) to overcome this limitation by allowing the compiler to view parallelization solely as a cost/benefit tradeoff rather than something which is likely to violate program correctness. Our experimental results demonstrate that with realistic compiler support, TLDS can offer significant program speedups. We also demonstrate that through modest hardware extensions, a generic single-chip multiprocessor could support TLDS by augmenting its cache coherence scheme to detect dependence violations, and by using the primary data caches to buffer speculative state