Page 1

ABSTRACT

We study the energy efficiency benefits of bypass-

ing trivial computations in high-performance proces-

sors. Trivial computations are those computations

whose output can be determined without performing the

computation. We show that bypassing trivial instruc-

tions reduces energy consumption while improving per-

formance. Our study shows that by bypassing trivial

instructions and for the subset of SPEC’2K benchmarks

studied here, on average, it is possible to improve

energy and energy-delay by up to 4.5% and 11.8% over

a conventional processor.

1. INTRODUCTION

In this work we improve energy-efficiency in high-

performance processors by bypassing trivial instruc-

tions. A trivial instruction is an instruction whose out-

put can be determined without performing the actual

computation. For such instructions, we can determine

the results immediately based on the value of one or

both of the source operands. Examples are multiply or

add instructions where one of the input operands is

zero.

Determining the trivial instruction result without

performing the computation will improve energy-effi-

ciency in two ways: First, it will result in faster instruc-

tion execution. This, consequently, could result in

earlier execution of those instructions depending on the

trivial instruction output. This results in shorter pro-

gram runtime which in turn reduces energy consump-

tion. Second, by bypassing trivial instructions we no

longer spend energy on executing them. As such, we

reduce total energy consumption.

We assume a typical load/store ISA where each

instruction may have up to two source operands. We

refer to the operand which trivializes the operation as

the trivializing operand (TO). Examples of TOs are the

operand equal to zero in an add operation or the oper-

and equal to one in a multiplication.

Previous study shows that a) an optimizing compiler

is often unable to remove trivial operations since trivial

values are not known at compile time and b) the amount

of trivial computations does not heavily depend on pro-

gram specific inputs [2].

Identifying trivial instructions dynamically is possi-

ble as soon as the TO and the instruction opcode are

known. However, computing the result may not always

require knowledge of both source operands. In some

cases, e.g., multiplying by zero, we do not need both

operands to compute the result. Under such circum-

stances, the result does not depend on the other operand

value. In other cases, e.g., addition to zero, both oper-

ands are needed. We refer to those trivial instructions

whose output could be calculated knowing only one of

the operands as fully-trivial instructions. We refer to

those trivial instructions whose result could be com-

puted only after knowing both operands as semi-trivial

instructions. Our study shows that semi-trivial instruc-

tions account for the majority of trivial instructions.

However, bypassing a fully-trivial instruction can

impact performance and energy more than bypassing a

semi-trivial instruction. This is due to the fact that fully-

trivial instructions can be bypassed earlier and save

more energy as they make reading both operands

unnecessary.

Table 1 reports the fully-trivial and semi-trivial com-

putations studied in this work. We report both the oper-

ation and the particular source operand value that

trivializes the operation. It is possible to extend our

study further to include other instruction types (e.g.,

ABS). However, this will not impact our results as such

instructions are very infrequent.

Generally, power-aware techniques save energy at

the expense of performance. Bypassing trivial instruc-

tions, however, reduces energy consumption while

improving performance. Note that computing trivial

instruction results, while unnecessary, results in extra

latency and additional energy consumption. Therefore,

bypassing the computation and obtaining the result

without performing the computation will improve both

performance and energy simultaneously.

In this work we study the energy benefits achieved

by dynamically identifying and bypassing both fully-

trivial and semi-trivial computations. In particular, we

make the following contributions:

Improving Energy-Efficiency by Bypassing Trivial Computations

Ehsan Atoofian and Amirali Baniasadi

ECE Department, University of Victoria

{eatoofia, amirali}@ece.uvic.ca

0-7695-2312-9/05/$20.00 (c) 2005 IEEE

Page 2

• We show that, by bypassing trivial instructions, it is

possible to reduce energy consumption and improve

energy-delay, on average, by 4.5% (min.: 1.5%) and

11.8% (min.: 3.5%) respectively.

• We categorize trivial instructions based on the num-

ber of source operands needed to detect them and

their source operands availability time. We also study

how often trivial instructions belong to each category

and how this may impact our energy and perfor-

mance improvements.

The rest of the paper is organized as follows. In Sec-

tion 2 we explain bypassing trivial instructions in more

detail. In Section 3 we explain our implementation. In

Section 4 we present our experimental evaluation. In

Section 5 we review related work. Finally, in Section 6,

we summarize our findings.

2. TRIVIAL INSTRUCTION BYPASSING

The result of a trivial operation could be either one

of the source operands or zero or one (e.g., operations

reported in Table 1). Trivial instruction frequency

impacts potential benefits of trivial instruction bypass-

ing. Therefore, in order to decide if detecting and

bypassing trivial operations is worthwhile we need to

know how frequently they appear in the code stream. In

Figure 1(a) we report trivial instruction frequency. In

addition, and to provide better insight we also report

both fully-trivial and semi-trivial instruction frequency.

While the entire bar represents total trivial instructions,

the lower part of each bar shows the frequency of semi-

trivial instructions and the upper part represents fully-

trivial instructions.

As represented by the entire bar, on average, trivial

instructions account for about 12% of the total instruc-

tions. Gcc and vpr have higher number of trivial

instructions compared to others. Wlf has the lowest

number of trivial instructions.

In general, semi-trivial instructions outnumber fully-

trivial instructions. Fully-trivial instructions may

account for as much as one third of the total number of

trivial instructions (e.g., equ). Meantime they may

account for as little as 2% of the total trivial instructions

(e.g., wlf). On average, about 85% of the trivial instruc-

tions are semi-trivial while the remaining 15% are

fully-trivial instructions.

As reported in Table 1, different instruction types

can be trivial depending on their source operand values.

However the trivial instruction frequency is different

from one instruction type to another.

Figure 1(b) reports how often each instruction type

is trivial. As reported, at least 10% of each instruction

type is trivial. In cases such as mult, div, and or trivial

instructions account for more than half of the instruc-

tions. However, note that a high percentage of trivial

instructions for a specific instruction type does not

always mean that the particular instruction type will

have a considerable impact on energy-efficiency. For

example, while 90% of the divisions appear to be triv-

ial, they only account for less than 1% of the total num-

ber of instructions executed.

Trivial instructions can only be bypassed when

either both operands (for semi-trivial) or their TO (for

Table 1: Full- and semi-trivial instructions studied in this work

OperationFull Triviality Condition

Multiplication: A*B

Division: A/B

AND: A & B

OR: A | B

Logical Shift: A<<B,A>>B

Arithmetic Shift: A<<B, A>>B

A=0 or B=0

A=0

A=0x00000000 or B=0x00000000

A=0xffffffff or B=0xffffffff

A=0

A=0

OperationSemi Triviality Condition

Addition: A+B

Subtraction: A-B

Multiplication: A*B

Division: A/B

AND A & B

OR: A | B

XOR: A XOR B

Logical Shift: A<<B,A>>B

Arithmetic Shift: A<<B, A>>B

A=0 or B=0

B=0 or A=B

A=1 or B=1

B=1 or A=B

A=0xffffffff or B=0xffffffff or A=B

A =0x00000000 or B=0x00000000 or A=B

A or B =0x00000000

B=0

B=0

0-7695-2312-9/05/$20.00 (c) 2005 IEEE

Page 3

fully-trivial) are known. Based on the source operand(s)

availability time(s), we categorize trivial instructions to

two groups:

The first group are those instruction whose source

operand/operands (both operands for semi-trivial, the

TO for fully-trivial) is/are known while they are at the

decode stage. For this group, the required source oper-

ands have been produced early enough so the trivial

instruction could be bypassed at decode stage.

The second group of trivial instructions are those

instructions whose necessary operands are not available

at instruction decode stage. Therefore, these trivial

instructions could not be bypassed at the decode stage

and are sent to the issue queue where they wait for their

operands and the required resources to become avail-

able. This group of trivial instructions is identified at

the issue stage and when the required source operands

(again, both operands for semi-trivial, TO for fully-triv-

ial) are known.

We refer to the trivial instructions identified at

decode as decode-trivial and to those identified at issue

as issue-trivial. In Figure 2 we report the percentage of

decode-trivial and issue-trivial instructions. While the

entire bar represents total trivial instructions (similar to

Figure 1(a)), the lower part of each bar shows the fre-

quency of decode-trivial instructions and the upper part

represents issue-trivial instructions.

On average, issue-trivial instructions account for

half of the trivial instructions. However, for some

benchmarks (e.g.,bzp and equ) the number of decode-

trivial instructions exceeds issue-trivial instructions.

For other benchmarks (e.g., gcc and vpr) issue-trivial

instructions are more than decode-trivial instructions.

Note that the earlier a trivial instruction is identified,

the earlier it could be bypassed. As such, we expect

higher energy savings and performance improvements

achieved by decode-trivial instruction compared to

issue-trivial instructions.

3. IMPLEMENTATION

In this work we assume that all reservation stations

monitor their source operands for data availability

simultaneously. We also assume that at dispatch,

already-available operand values are read from the reg-

ister file and stored in the reservation station. The reser-

vation station logic compares the operand tags of

unavailable data with the result tags of completing

instructions. Once a match is detected, the operand is

read from the bypass logic. As soon as all operands

Figure 1: (a) Trivial instruction frequency and

distribution: The entire bar represents trivial

instruction frequency. The lower part shows semi-

trivial instruction frequency while the upper part

shows fully-trivial instruction frequency. (b) How

often each instruction type is trivial. (See Table 2 for

benchmark abbreviations)

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

add

sub

mult

div

and

or

xor

shift

fadd

fsub

fmult

fdiv

0%

2%

4%

6%

8%

10%

12%

14%

16%

18%

gzpvprequ bzp gccwlfswm mesAVG

Half-Trivial Full-Trivial

a)

b)

Figure 2: Trivial instruction frequency and

distribution: The entire bar represents trivial

instruction frequency. The lower part shows decode-

trivial instruction frequency while the upper part

shows issue-trivial instruction frequency. (See Table

2 for benchmark abbreviations)

0%

2%

4%

6%

8%

10%

12%

14%

16%

18%

gzp vpr equbzp gccwlf swmmes AVG

Decode-Trivial Instruction Issue-Trivial Instruction

0-7695-2312-9/05/$20.00 (c) 2005 IEEE

Page 4

become available in the reservation station, the instruc-

tion may issue (subject to resource availability) [9]. An

alternative implementation is storing pointers to where

the operand can be found (e.g. in the register file) rather

than storing the data in the reservation station [10].

While trivial instruction bypassing could be used on top

of both implementations, here we assume the former.

Figure 3 shows the schematic of a processor that

bypasses trivial instructions and the procedures fol-

lowed. We first discuss decode-trivial instructions. At

decode, the Trivial Instruction Detection unit examines

source operands. If the instruction is trivial, the rename

table is modified so it maps the destination register to

the physical register assigned to the input source oper-

and or to the zero register as presented in 3(b). Once the

renaming table is modified, we no longer execute the

trivial instruction. As such, instructions depending on

the trivial instruction result can start execution immidi-

ately (subject to resource availability). Note that

decode-trivial instructions, once detected, do not con-

sume execution unit resources.

To identify trivial instructions while they are in the

issue queue the trivial instruction detection unit exam-

ines the produced data as soon as the associated tag is

received by the reservation station. Once we detect an

issue-trivial instruction we bypass executing the

instruction and send the result to the write-back unit as

presented in 3(c). However, the destination register of

an issue-trivial instruction should not be released since

there may still be instructions depending on the trivial

instruction outcome which have not read their source

operands yet.

Note that, in order to improve performance, modern

processors wakeup consumer instructions in advance

and before the data is actually available. This makes

executing producer-consumer pairs in consecutive

cycles possible. As a result, issue-trivial instructions

would have to be issued first and then read operands to

test triviality. Consequently, in this study we assume

that issue-trivial instructions take issue slots but will

not be executed in the ALU and will write their results

as soon as possible. Therefore, issue-trivial instructions

benefit less from trivial instruction bypassing compared

to decode-trivial instructions.

4. METHODOLOGY AND RESULTS

In this Section, we report our analysis framework.

To evaluate how bypassing trivial instructions impacts

performance and energy, we compare our processor

with a conventional processor that does not bypass triv-

ial instructions. We report performance, energy and

energy-delay.

We used both floating point (equ, mes and swm) and

integer (gzp, vpr, gcc, bzp and wlf) programs from the

SPEC CPU2000 suite compiled for the MIPS-like PISA

Figure 3: a) Schematic for a pipelined processor bypassing trivial instructions b) Decode-Trivial instruction

detection procedure c) Issue-Trivial instruction detection procedure.

TRIVIAL INST. DETECTION

FETCH

DECODE &

RENAME

ISSUE

COMPLETE

COMMIT

Read Operand From Register File

Trivial?

Do Nothing

Bypass Instruction:

Remap Renaming Table

Read Operand From Bypass Logic

Trivial?

Do Nothing

Bypass Instruction:

Do not Execute, Send Zero or the Non-

Trivilizing Source Operand to Bypass

Logic & Register File

No

Yes

No

Yes

(a)

(b)

(c)

0-7695-2312-9/05/$20.00 (c) 2005 IEEE

Page 5

architecture used by the Simplescalar v3.0 simulation

tool set [1]. We used WATTCH [4] for energy estimation.

The benchmark set studied here includes different pro-

grams including high and low IPC and those limited by

memory, branch misprediction, etc.

Note that detecting and bypassing trivial instructions

requires additional hardware. Consequently, and

depending on how the technique is implemented, this

will result in power overhead. Through this study we

assume that this power overhead is negligible compared

to our savings.

We used GNU’s gcc compiler. In the interest of

space, we use the abbreviations shown under the “Ab.”

column in Table 2. We simulated 500M instructions

after skipping 500M instructions. We detail the base

processor model in Table 3.

4.1. Performance

Bypassing trivial instructions will improve perfor-

mance only if the bypassed instructions are on the criti-

cal path. To investigate how bypassing trivial

instructions impacts performance, in Figure 4, we

report performance improvements compared to a con-

ventional processor. Vpr and mes show higher perfor-

mance improvements compared to other benchmarks.

Wlf has the lowest performance improvement among all

benchmarks.

4.2. Energy and Energy-Delay

In Figure 5 we report energy and energy-delay mea-

surements. In 5(a) we report energy savings achieved

by bypassing trivial instructions. Wlf has the lowest

energy savings compared to other benchmarks. Vpr and

equ have higher energy reduction compared to the rest

of the benchmarks.

In 5(b) we report energy-delay improvements

achieved by bypassing trivial instructions. Again, wlf

has the lowest energy-delay improvement compared to

other benchmarks. Vpr has the highest energy-delay

improvement among all benchmarks

4.3. Discussion

In this Section we review the results. A detailed

analysis of the results would require studying many

issues including instruction type distribution for the

bypassed instructions and how often critical path

instructions are bypassed for each benchmark. Our dis-

cussion here, however, only focuses on the data pre-

sented earlier. We discuss integer and floating point

benchmarks separately.

1) The integer benchmarks studied here include gzp,

vpr, bzp, gcc and wlf. Among integer benchmarks, vpr

and gcc have higher number of trivial instructions. The

high number of trivial instructions for vpr explains why

this benchmark benefits more than other benchmarks

from bypassing trivial instructions. As for gcc, how-

Table 2: Benchmark abbreviations used here

ProgramAb.

ProgramAb.

164.gzip

171.swim

175.vpr

176.gcc

gzp

swm

vpr

gcc

177.mesa

183.equake

256.bzip2

300.twolf

mes

equ

bzp

wlf

Table 3: Base processor configuration.

Instruction Fetch Queue #

Reorder Buffer Size

Load/Store Queue Size

Branch Predictor

32

64

32

8K GShare+8K bi-modal w/ 8K

selector

64 entries, RUU-like

Up to 4 instr./cycle. 64-Entry Fetch

Buffer

any 4 instructions / cycle

64K, 4-way SA, 32-byte blocks, 3

cycle hit latency

32K, 2-way SA, 32-byte blocks, 3

cycle hit latency

256K, 4-way SA, 64-byte blocks,

16-cycle hit latency

Infinite, 80 cycles

2

Scheduler

Fetch Unit

OOO Core

L1 - Instruction Caches

L1 - Data Caches

Unified L2

Main Memory

Memory Port #

Figure 4: Performance improvement achieved by

bypassing trivial instructions over a conventional

processor.

0%

2%

4%

6%

8%

10%

12%

14%

gzp

vpr

equ bzpgcc

wlf

swm

mes

AVG

0-7695-2312-9/05/$20.00 (c) 2005 IEEE