Improving EnergyEfficiency by Bypassing Trivial Computations.
ABSTRACT We study the energy efficiency benefits of bypassing trivial computations in highperformance processors. Trivial computations are those computations whose output can be determined without performing the computation. We show that bypassing trivial instructions reduces energy consumption while improving performance. Our study shows that by bypassing trivial instructions and for the subset of SPEC'2K benchmarks studied here, on average, it is possible to improve energy and energydelay by up to 4.5% and 11.8% over a conventional processor.

 SourceAvailable from: cse.chalmers.se
 SourceAvailable from: 129.16.20.23
Conference Paper: Energy and Performance Tradeoffs between Instruction Reuse and Trivial Computations for Embedded Applications.
[Show abstract] [Hide abstract]
ABSTRACT: Instruction reuse (IR) and trivial computation (TC) elimination are two architectural techniques that aim at eliminating redundant code to better exploit instructionlevel parallelism. While they have been extensively studied in isolation, this paper is the first to compare their relative efficiency. This is done using applications from the embedded domain. This paper establishes the relationship between the two techniques by framing the arithmetic instructions detected by each of them. While TC can only eliminate instructions where one of the operands is zero or one, IR has potentially a wider scope as it can potentially eliminate any instruction given that it has been executed before with the same set of operand values. Despite the wider scope, we have found that IR and TC can eliminate about the same fraction of instructions even if an infinitely large instruction reuse buffer is assumed (IR and TC can eliminate 26% and 22% of the instructions, respectively). Another quite surprising finding is that the two techniques target quite different sets of instructions suggesting that they can provide almost additive gains if combined. In combination, they can eliminate 40% of the instructions they target. In terms of energyefficiency, we finally find that if an instruction reuse buffer of 256 entries is used, it uses 1% more energy than a processor without IR and TC reduces the energy consumption by 5.6%.IEEE Second International Symposium on Industrial Embedded Systems  SIES'2007, Hotel Costa da Caparica, Lisbon, Portugal, 46 July 2007; 01/2007
Page 1
ABSTRACT
We study the energy efficiency benefits of bypass
ing trivial computations in highperformance proces
sors. Trivial computations are those computations
whose output can be determined without performing the
computation. We show that bypassing trivial instruc
tions reduces energy consumption while improving per
formance. Our study shows that by bypassing trivial
instructions and for the subset of SPEC’2K benchmarks
studied here, on average, it is possible to improve
energy and energydelay by up to 4.5% and 11.8% over
a conventional processor.
1. INTRODUCTION
In this work we improve energyefficiency in high
performance processors by bypassing trivial instruc
tions. A trivial instruction is an instruction whose out
put can be determined without performing the actual
computation. For such instructions, we can determine
the results immediately based on the value of one or
both of the source operands. Examples are multiply or
add instructions where one of the input operands is
zero.
Determining the trivial instruction result without
performing the computation will improve energyeffi
ciency in two ways: First, it will result in faster instruc
tion execution. This, consequently, could result in
earlier execution of those instructions depending on the
trivial instruction output. This results in shorter pro
gram runtime which in turn reduces energy consump
tion. Second, by bypassing trivial instructions we no
longer spend energy on executing them. As such, we
reduce total energy consumption.
We assume a typical load/store ISA where each
instruction may have up to two source operands. We
refer to the operand which trivializes the operation as
the trivializing operand (TO). Examples of TOs are the
operand equal to zero in an add operation or the oper
and equal to one in a multiplication.
Previous study shows that a) an optimizing compiler
is often unable to remove trivial operations since trivial
values are not known at compile time and b) the amount
of trivial computations does not heavily depend on pro
gram specific inputs [2].
Identifying trivial instructions dynamically is possi
ble as soon as the TO and the instruction opcode are
known. However, computing the result may not always
require knowledge of both source operands. In some
cases, e.g., multiplying by zero, we do not need both
operands to compute the result. Under such circum
stances, the result does not depend on the other operand
value. In other cases, e.g., addition to zero, both oper
ands are needed. We refer to those trivial instructions
whose output could be calculated knowing only one of
the operands as fullytrivial instructions. We refer to
those trivial instructions whose result could be com
puted only after knowing both operands as semitrivial
instructions. Our study shows that semitrivial instruc
tions account for the majority of trivial instructions.
However, bypassing a fullytrivial instruction can
impact performance and energy more than bypassing a
semitrivial instruction. This is due to the fact that fully
trivial instructions can be bypassed earlier and save
more energy as they make reading both operands
unnecessary.
Table 1 reports the fullytrivial and semitrivial com
putations studied in this work. We report both the oper
ation and the particular source operand value that
trivializes the operation. It is possible to extend our
study further to include other instruction types (e.g.,
ABS). However, this will not impact our results as such
instructions are very infrequent.
Generally, poweraware techniques save energy at
the expense of performance. Bypassing trivial instruc
tions, however, reduces energy consumption while
improving performance. Note that computing trivial
instruction results, while unnecessary, results in extra
latency and additional energy consumption. Therefore,
bypassing the computation and obtaining the result
without performing the computation will improve both
performance and energy simultaneously.
In this work we study the energy benefits achieved
by dynamically identifying and bypassing both fully
trivial and semitrivial computations. In particular, we
make the following contributions:
Improving EnergyEfficiency by Bypassing Trivial Computations
Ehsan Atoofian and Amirali Baniasadi
ECE Department, University of Victoria
{eatoofia, amirali}@ece.uvic.ca
0769523129/05/$20.00 (c) 2005 IEEE
Page 2
• We show that, by bypassing trivial instructions, it is
possible to reduce energy consumption and improve
energydelay, on average, by 4.5% (min.: 1.5%) and
11.8% (min.: 3.5%) respectively.
• We categorize trivial instructions based on the num
ber of source operands needed to detect them and
their source operands availability time. We also study
how often trivial instructions belong to each category
and how this may impact our energy and perfor
mance improvements.
The rest of the paper is organized as follows. In Sec
tion 2 we explain bypassing trivial instructions in more
detail. In Section 3 we explain our implementation. In
Section 4 we present our experimental evaluation. In
Section 5 we review related work. Finally, in Section 6,
we summarize our findings.
2. TRIVIAL INSTRUCTION BYPASSING
The result of a trivial operation could be either one
of the source operands or zero or one (e.g., operations
reported in Table 1). Trivial instruction frequency
impacts potential benefits of trivial instruction bypass
ing. Therefore, in order to decide if detecting and
bypassing trivial operations is worthwhile we need to
know how frequently they appear in the code stream. In
Figure 1(a) we report trivial instruction frequency. In
addition, and to provide better insight we also report
both fullytrivial and semitrivial instruction frequency.
While the entire bar represents total trivial instructions,
the lower part of each bar shows the frequency of semi
trivial instructions and the upper part represents fully
trivial instructions.
As represented by the entire bar, on average, trivial
instructions account for about 12% of the total instruc
tions. Gcc and vpr have higher number of trivial
instructions compared to others. Wlf has the lowest
number of trivial instructions.
In general, semitrivial instructions outnumber fully
trivial instructions. Fullytrivial instructions may
account for as much as one third of the total number of
trivial instructions (e.g., equ). Meantime they may
account for as little as 2% of the total trivial instructions
(e.g., wlf). On average, about 85% of the trivial instruc
tions are semitrivial while the remaining 15% are
fullytrivial instructions.
As reported in Table 1, different instruction types
can be trivial depending on their source operand values.
However the trivial instruction frequency is different
from one instruction type to another.
Figure 1(b) reports how often each instruction type
is trivial. As reported, at least 10% of each instruction
type is trivial. In cases such as mult, div, and or trivial
instructions account for more than half of the instruc
tions. However, note that a high percentage of trivial
instructions for a specific instruction type does not
always mean that the particular instruction type will
have a considerable impact on energyefficiency. For
example, while 90% of the divisions appear to be triv
ial, they only account for less than 1% of the total num
ber of instructions executed.
Trivial instructions can only be bypassed when
either both operands (for semitrivial) or their TO (for
Table 1: Full and semitrivial instructions studied in this work
OperationFull Triviality Condition
Multiplication: A*B
Division: A/B
AND: A & B
OR: A  B
Logical Shift: A<<B,A>>B
Arithmetic Shift: A<<B, A>>B
A=0 or B=0
A=0
A=0x00000000 or B=0x00000000
A=0xffffffff or B=0xffffffff
A=0
A=0
OperationSemi Triviality Condition
Addition: A+B
Subtraction: AB
Multiplication: A*B
Division: A/B
AND A & B
OR: A  B
XOR: A XOR B
Logical Shift: A<<B,A>>B
Arithmetic Shift: A<<B, A>>B
A=0 or B=0
B=0 or A=B
A=1 or B=1
B=1 or A=B
A=0xffffffff or B=0xffffffff or A=B
A =0x00000000 or B=0x00000000 or A=B
A or B =0x00000000
B=0
B=0
0769523129/05/$20.00 (c) 2005 IEEE
Page 3
fullytrivial) are known. Based on the source operand(s)
availability time(s), we categorize trivial instructions to
two groups:
The first group are those instruction whose source
operand/operands (both operands for semitrivial, the
TO for fullytrivial) is/are known while they are at the
decode stage. For this group, the required source oper
ands have been produced early enough so the trivial
instruction could be bypassed at decode stage.
The second group of trivial instructions are those
instructions whose necessary operands are not available
at instruction decode stage. Therefore, these trivial
instructions could not be bypassed at the decode stage
and are sent to the issue queue where they wait for their
operands and the required resources to become avail
able. This group of trivial instructions is identified at
the issue stage and when the required source operands
(again, both operands for semitrivial, TO for fullytriv
ial) are known.
We refer to the trivial instructions identified at
decode as decodetrivial and to those identified at issue
as issuetrivial. In Figure 2 we report the percentage of
decodetrivial and issuetrivial instructions. While the
entire bar represents total trivial instructions (similar to
Figure 1(a)), the lower part of each bar shows the fre
quency of decodetrivial instructions and the upper part
represents issuetrivial instructions.
On average, issuetrivial instructions account for
half of the trivial instructions. However, for some
benchmarks (e.g.,bzp and equ) the number of decode
trivial instructions exceeds issuetrivial instructions.
For other benchmarks (e.g., gcc and vpr) issuetrivial
instructions are more than decodetrivial instructions.
Note that the earlier a trivial instruction is identified,
the earlier it could be bypassed. As such, we expect
higher energy savings and performance improvements
achieved by decodetrivial instruction compared to
issuetrivial instructions.
3. IMPLEMENTATION
In this work we assume that all reservation stations
monitor their source operands for data availability
simultaneously. We also assume that at dispatch,
alreadyavailable operand values are read from the reg
ister file and stored in the reservation station. The reser
vation station logic compares the operand tags of
unavailable data with the result tags of completing
instructions. Once a match is detected, the operand is
read from the bypass logic. As soon as all operands
Figure 1: (a) Trivial instruction frequency and
distribution: The entire bar represents trivial
instruction frequency. The lower part shows semi
trivial instruction frequency while the upper part
shows fullytrivial instruction frequency. (b) How
often each instruction type is trivial. (See Table 2 for
benchmark abbreviations)
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
add
sub
mult
div
and
or
xor
shift
fadd
fsub
fmult
fdiv
0%
2%
4%
6%
8%
10%
12%
14%
16%
18%
gzpvprequ bzp gccwlfswm mesAVG
HalfTrivial FullTrivial
a)
b)
Figure 2: Trivial instruction frequency and
distribution: The entire bar represents trivial
instruction frequency. The lower part shows decode
trivial instruction frequency while the upper part
shows issuetrivial instruction frequency. (See Table
2 for benchmark abbreviations)
0%
2%
4%
6%
8%
10%
12%
14%
16%
18%
gzp vpr equbzp gccwlf swmmes AVG
DecodeTrivial Instruction IssueTrivial Instruction
0769523129/05/$20.00 (c) 2005 IEEE
Page 4
become available in the reservation station, the instruc
tion may issue (subject to resource availability) [9]. An
alternative implementation is storing pointers to where
the operand can be found (e.g. in the register file) rather
than storing the data in the reservation station [10].
While trivial instruction bypassing could be used on top
of both implementations, here we assume the former.
Figure 3 shows the schematic of a processor that
bypasses trivial instructions and the procedures fol
lowed. We first discuss decodetrivial instructions. At
decode, the Trivial Instruction Detection unit examines
source operands. If the instruction is trivial, the rename
table is modified so it maps the destination register to
the physical register assigned to the input source oper
and or to the zero register as presented in 3(b). Once the
renaming table is modified, we no longer execute the
trivial instruction. As such, instructions depending on
the trivial instruction result can start execution immidi
ately (subject to resource availability). Note that
decodetrivial instructions, once detected, do not con
sume execution unit resources.
To identify trivial instructions while they are in the
issue queue the trivial instruction detection unit exam
ines the produced data as soon as the associated tag is
received by the reservation station. Once we detect an
issuetrivial instruction we bypass executing the
instruction and send the result to the writeback unit as
presented in 3(c). However, the destination register of
an issuetrivial instruction should not be released since
there may still be instructions depending on the trivial
instruction outcome which have not read their source
operands yet.
Note that, in order to improve performance, modern
processors wakeup consumer instructions in advance
and before the data is actually available. This makes
executing producerconsumer pairs in consecutive
cycles possible. As a result, issuetrivial instructions
would have to be issued first and then read operands to
test triviality. Consequently, in this study we assume
that issuetrivial instructions take issue slots but will
not be executed in the ALU and will write their results
as soon as possible. Therefore, issuetrivial instructions
benefit less from trivial instruction bypassing compared
to decodetrivial instructions.
4. METHODOLOGY AND RESULTS
In this Section, we report our analysis framework.
To evaluate how bypassing trivial instructions impacts
performance and energy, we compare our processor
with a conventional processor that does not bypass triv
ial instructions. We report performance, energy and
energydelay.
We used both floating point (equ, mes and swm) and
integer (gzp, vpr, gcc, bzp and wlf) programs from the
SPEC CPU2000 suite compiled for the MIPSlike PISA
Figure 3: a) Schematic for a pipelined processor bypassing trivial instructions b) DecodeTrivial instruction
detection procedure c) IssueTrivial instruction detection procedure.
TRIVIAL INST. DETECTION
FETCH
DECODE &
RENAME
ISSUE
COMPLETE
COMMIT
Read Operand From Register File
Trivial?
Do Nothing
Bypass Instruction:
Remap Renaming Table
Read Operand From Bypass Logic
Trivial?
Do Nothing
Bypass Instruction:
Do not Execute, Send Zero or the Non
Trivilizing Source Operand to Bypass
Logic & Register File
No
Yes
No
Yes
(a)
(b)
(c)
0769523129/05/$20.00 (c) 2005 IEEE
Page 5
architecture used by the Simplescalar v3.0 simulation
tool set [1]. We used WATTCH [4] for energy estimation.
The benchmark set studied here includes different pro
grams including high and low IPC and those limited by
memory, branch misprediction, etc.
Note that detecting and bypassing trivial instructions
requires additional hardware. Consequently, and
depending on how the technique is implemented, this
will result in power overhead. Through this study we
assume that this power overhead is negligible compared
to our savings.
We used GNU’s gcc compiler. In the interest of
space, we use the abbreviations shown under the “Ab.”
column in Table 2. We simulated 500M instructions
after skipping 500M instructions. We detail the base
processor model in Table 3.
4.1. Performance
Bypassing trivial instructions will improve perfor
mance only if the bypassed instructions are on the criti
cal path. To investigate how bypassing trivial
instructions impacts performance, in Figure 4, we
report performance improvements compared to a con
ventional processor. Vpr and mes show higher perfor
mance improvements compared to other benchmarks.
Wlf has the lowest performance improvement among all
benchmarks.
4.2. Energy and EnergyDelay
In Figure 5 we report energy and energydelay mea
surements. In 5(a) we report energy savings achieved
by bypassing trivial instructions. Wlf has the lowest
energy savings compared to other benchmarks. Vpr and
equ have higher energy reduction compared to the rest
of the benchmarks.
In 5(b) we report energydelay improvements
achieved by bypassing trivial instructions. Again, wlf
has the lowest energydelay improvement compared to
other benchmarks. Vpr has the highest energydelay
improvement among all benchmarks
4.3. Discussion
In this Section we review the results. A detailed
analysis of the results would require studying many
issues including instruction type distribution for the
bypassed instructions and how often critical path
instructions are bypassed for each benchmark. Our dis
cussion here, however, only focuses on the data pre
sented earlier. We discuss integer and floating point
benchmarks separately.
1) The integer benchmarks studied here include gzp,
vpr, bzp, gcc and wlf. Among integer benchmarks, vpr
and gcc have higher number of trivial instructions. The
high number of trivial instructions for vpr explains why
this benchmark benefits more than other benchmarks
from bypassing trivial instructions. As for gcc, how
Table 2: Benchmark abbreviations used here
ProgramAb.
ProgramAb.
164.gzip
171.swim
175.vpr
176.gcc
gzp
swm
vpr
gcc
177.mesa
183.equake
256.bzip2
300.twolf
mes
equ
bzp
wlf
Table 3: Base processor configuration.
Instruction Fetch Queue #
Reorder Buffer Size
Load/Store Queue Size
Branch Predictor
32
64
32
8K GShare+8K bimodal w/ 8K
selector
64 entries, RUUlike
Up to 4 instr./cycle. 64Entry Fetch
Buffer
any 4 instructions / cycle
64K, 4way SA, 32byte blocks, 3
cycle hit latency
32K, 2way SA, 32byte blocks, 3
cycle hit latency
256K, 4way SA, 64byte blocks,
16cycle hit latency
Infinite, 80 cycles
2
Scheduler
Fetch Unit
OOO Core
L1  Instruction Caches
L1  Data Caches
Unified L2
Main Memory
Memory Port #
Figure 4: Performance improvement achieved by
bypassing trivial instructions over a conventional
processor.
0%
2%
4%
6%
8%
10%
12%
14%
gzp
vpr
equ bzpgcc
wlf
swm
mes
AVG
0769523129/05/$20.00 (c) 2005 IEEE