Content uploaded by Matthias Braun
Author content
All content in this area was uploaded by Matthias Braun on Apr 01, 2014
Content may be subject to copyright.
Simple and Efficient Construction of Static Single
Assignment Form
Matthias Braun1, Sebastian Buchwald1, Sebastian Hack2, Roland Leißa2,
Christoph Mallon2, and Andreas Zwinkau1
1Karlsruhe Institute of Technology,
{matthias.braun,buchwald,zwinkau}@kit.edu
2Saarland University
{hack,leissa,mallon}@cs.uni-saarland.de
Abstract. We present a simple SSA construction algorithm, which al-
lows direct translation from an abstract syntax tree or bytecode into an
SSA-based intermediate representation. The algorithm requires no prior
analysis and ensures that even during construction the intermediate rep-
resentation is in SSA form. This allows the application of SSA-based op-
timizations during construction. After completion, the intermediate rep-
resentation is in minimal and pruned SSA form. In spite of its simplicity,
the runtime of our algorithm is on par with Cytron et al.’s algorithm.
1 Introduction
Many modern compilers feature intermediate representations (IR) based on the
static single assignment form (SSA form). SSA was conceived to make program
analyses more efficient by compactly representing use-def chains. Over the last
years, it turned out that the SSA form not only helps to make analyses more
efficient but also easier to implement, test, and debug. Thus, modern compilers
such as the Java HotSpot VM [14], LLVM [2], and libFirm [1] exclusively base
their intermediate representation on the SSA form.
The first algorithm to efficiently construct the SSA form was introduced by
Cytron et al. [10]. One reason, why this algorithm still is very popular, is that it
guarantees a form of minimality on the number of placed φfunctions. However,
for compilers that are entirely SSA-based, this algorithm has a significant draw-
back: Its input program has to be represented as a control flow graph (CFG) in
non-SSA form. Hence, if the compiler wants to construct SSA from the input
language (be it given by an abstract syntax tree or some bytecode format), it has
to take a detour through a non-SSA CFG in order to apply Cytron et al.’s algo-
rithm. Furthermore, to guarantee the minimality of the φfunction placement,
Cytron et al.’s algorithm relies on several other analyses and transformations:
To calculate the locations where φfunctions have to be placed, it computes
a dominance tree and the iterated dominance frontiers. To avoid placing dead
φfunctions, liveness analyses or dead code elimination has to be performed [7].
Both, requiring a CFG and relying on other analyses, make it inconvenient to
use this algorithm in an SSA-centric compiler.
Modern SSA-based compilers take different approaches to construct SSA:
For example, LLVM uses Cytron et al.’s algorithm and mimics the non-SSA
CFG by putting all local variables into memory (which is usually not in SSA-
form). This comes at the cost of expressing simple definitions and uses of those
variables using memory operations. Our measurements show that 25% of all
instructions generated by the LLVM front end are of this kind: immediately
after the construction of the IR they are removed by SSA construction.
Other compilers, such as the Java HotSpot VM, do not use Cytron et al.’s
algorithm at all because of the inconveniences described above. However, they
also have the problem that they do not compute minimal and/or pruned SSA
form, that is, they insert superfluous and/or dead φfunctions.
In this paper, we
–present a simple, novel SSA construction algorithm, which does neither re-
quire dominance nor iterated dominance frontiers, and thus is suited to con-
struct an SSA-based intermediate representation directly from an AST (Sec-
tion 2),
–show how to combine this algorithm with on-the-fly optimizations to reduce
the footprint during IR construction (Section 3.1),
–describe a post pass that establishes minimal SSA form for arbitrary pro-
grams (Section 3.2),
–prove that the SSA construction algorithm constructs pruned SSA form for
all programs and minimal SSA form for programs with reducible control flow
(Section 4),
–show that the algorithm can also be applied in related domains, like translat-
ing an imperative program to a functional continuation-passing style (CPS)
program or reconstructing SSA form after transformations, such as live range
splitting or rematerialization, have added further definitions to an SSA value
(Section 5),
–demonstrate the efficiency and simplicity of the algorithm by implement-
ing it in Clang and comparing it with Clang/LLVM’s implementation of
Cytron et al.’s algorithm (Section 6).
To the best of our knowledge, the algorithm presented in this paper is the first
to construct minimal and pruned SSA on reducible CFGs without depending on
other analyses.
2 Simple SSA Construction
In the following, we describe our algorithm to construct SSA form. It significantly
differs from Cytron et al.’s algorithm in its basic idea. Cytron et al.’s algorithm is
an eager approach operating in forwards direction: First, the algorithm collects
all definitions of a variable. Then, it calculates the placement of corresponding
φfunctions and, finally, pushes these definitions down to the uses of the variable.
In contrast, our algorithm works backwards in a lazy fashion: Only when a
variable is used, we query its reaching definition. If it is unknown at the current
location, we will search backwards through the program. We insert φfunctions
at join points in the CFG along the way, until we find the desired definition. We
employ memoization to avoid repeated look-ups.
This process consists of several steps, which we explain in detail in the rest
of this section. First, we consider a single basic block. Then, we extend the
algorithm to whole CFGs. Finally, we show how to handle incomplete CFGs,
which usually emerge when translating an AST to IR.
2.1 Local Value Numbering
a←42;
b←a;
c←a+b;
a←c+ 23;
c←a+d;
(a) Source program
v1: 42
v2:v1+v1
v3: 23
v4:v2+v3
v5:v4+v?
(b) SSA form
Fig. 1. Example for local value numbering
When translating a source program, the IR for a sequence of statements usually
ends up in a single basic block. We process these statements in program execution
order and for each basic block we keep a mapping from each source variable to
its current defining expression. When encountering an assignment to a variable,
we record the IR of the right-hand side of the assignment as current definition
of the variable. Accordingly, when a variable is read, we look up its current
definition (see Algorithm 1). This process is well known in literature as local
value numbering [9]. When local value numbering for one block is finished, we
call this block filled. Particularly, successors may only be added to a filled block.
This property will later be used when handling incomplete CFGs.
writeVariable(variable,block,value):
currentDef[variable][block]←value
readVariable(variable,block):
if currentDef[variable]contains block:
# local value numbering
return currentDef[variable][block]
# global value numbering
return readVariableRecursive(variable,block)
Algorithm 1: Implementation of local value numbering
A sample program and the result of this process is illustrated in Figure 1. For
the sake of presentation, we denote each SSA value by a name vi.1In a concrete
implementation, we would just refer to the representation of the expression. The
names have no meaning otherwise, in particular they are not local variables in
the sense of an imperative language.
Now, a problem occurs if a variable is read before it is assigned in a basic
block. This can be seen in the example for the variable dand its corresponding
value v?. In this case, d’s definition is found on a path from the CFG’s root to the
1This acts like a let binding in a functional language. In fact, SSA form is a kind of
functional representation [3].
current block. Moreover, multiple definitions in the source program may reach
the same use. The next section shows how to extend local value numbering to
handle these situations.
2.2 Global Value Numbering
x←...
while (...)
{
if (...) {
x←...
}
}
use(x)
(a) Source Program
v0: ...
while (...)
v2:φ(v0,v3)
{
if (...) {
v1: ...
}
v3:φ(v1,v2)
}
use(v2)
(b) SSA form
v0: ...
v2:φ(v0,v3)
v1: ...
v3:φ(v1,v2)use(v2)
(c) Control flow graph
Fig. 2. Example for global value numbering
If a block currently contains no definition for a variable, we recursively look
for a definition in its predecessors. If the block has a single predecessor, just
query it recursively for a definition. Otherwise, we collect the definitions from
all predecessors and construct a φfunction, which joins them into a single new
value. This φfunction is recorded as current definition in this basic block.
Looking for a value in a predecessor might in turn lead to further recursive
look-ups. Due to loops in the program, those might lead to endless recursion.
Therefore, before recursing, we first create the φfunction without operands and
record it as the current definition for the variable in the block. Then, we deter-
mine the φfunction’s operands. If a recursive look-up arrives back at the block,
this φfunction will provide a definition and the recursion will end. Algorithm 2
shows pseudocode to perform global value numbering. Its first condition will be
used to handle incomplete CFGs, so for now assume it is always false.
Figure 2 shows this process. For presentation, the indices of the values viare
assigned in the order in which the algorithm inserts them. We assume that the
loop is constructed before xis read, i.e., v0and v1are recorded as definitions for x
by local value numbering and only then the statement after the loop looks up x.
As there is no definition for xrecorded in the block after the loop, we perform
a recursive look-up. The block has only a single predecessor, so no φfunction is
needed here. This predecessor is the loop header, which also has no definition
for xand has two predecessors. Thus, we place an operandless φfunction v2. Its
first operand is the value v0flowing into the loop. The second operand requires
further recursion. The φfunction v3is created and gets its operands from its
direct predecessors. In particular, v2placed earlier breaks the recursion.
Recursive look-up might leave redundant φfunctions. We call a φfunction vφ
trivial iff it just references itself and one other value vany number of times:
readVariableRecursive(variable,block):
if block not in sealedBlocks:
# Incomplete CFG
val ←new Phi(block)
incompletePhis[block][variable]←val
else if |block.preds|=1:
# Optimize the common case of one predecessor: No phi needed
val ←readVariable(variable,block.preds[0])
else:
# Break potential cycles with operandless phi
val ←new Phi(block)
writeVariable(variable,block,val)
val ←addPhiOperands(variable,val)
writeVariable(variable,block,val)
return val
addPhiOperands(variable,phi):
# Determine operands from predecessors
for pred in phi.block.preds:
phi.appendOperand(readVariable(variable,pred))
return tryRemoveTrivialPhi(phi)
Algorithm 2: Implementation of global value numbering
tryRemoveTrivialPhi(phi):
same ←None
for op in phi.operands:
if op =same || op =phi:
continue # Unique value or self−reference
if same 6=None:
return phi # The phi merges at least two values: not trivial
same ←op
if same =None:
same ←new Undef() # The phi is unreachable or in the start block
users ←phi.users.remove(phi)# Remember all users except the phi itself
phi.replaceBy(same)# Reroute all uses of phi to same and remove phi
# Try to recursively remove all phi users, which might have become trivial
for use in users:
if use is a Phi:
tryRemoveTrivialPhi(use)
return same
Algorithm 3: Detect and recursively remove a trivial φfunction
∃v∈V:vφ:φ(x1, . . . , xn)xi∈ {vφ, v}. Such a φfunction can be removed and
the value vis used instead (see Algorithm 3). As a special case, the φfunction
might use no other value besides itself. This means that it is either unreachable
or in the start block. We replace it by an undefined value.
Moreover, if a φfunction could be successfully replaced, other φfunctions
using this replaced value might become trivial as well. For this reason, we apply
this simplification recursively on all of these users.
This approach works for all acyclic language constructs. In this case, we can
fill all predecessors of a block before processing it. The recursion will only search
in already filled blocks. This ensures that we retrieve the latest definition for
each variable from the predecessors. For example, in an if-then-else statement the
block containing the condition can be filled before the then and else branches are
processed. Accordingly, after the two branches are completed, the block joining
the branches is filled. This approach also works when reading a variable after a
loop has been constructed. But when reading a variable within a loop, which is
under construction, some predecessors—at least the jump back to the head of
the loop—are missing.
2.3 Handling Incomplete CFGs
We call a basic block sealed if no further predecessors will be added to the block.
As only filled blocks may have successors, predecessors are always filled. Note
that a sealed block is not necessarily filled. Intuitively, a filled block contains
all its instructions and can provide variable definitions for its successors. Con-
versely, a sealed block may look up variable definitions in its predecessors as all
predecessors are known.
sealBlock(block):
for variable in incompletePhis[block]:
addPhiOperands(variable,incompletePhis[block][variable])
sealedBlocks.add(block)
Algorithm 4: Handling incomplete CFGs
But how to handle a look-up of a variable in an unsealed block, which has
no current definition for this variable? In this case, we place an operandless
φfunction into the block and record it as proxy definition (see first case in
Algorithm 2). Further, we maintain a set incompletePhis of these proxies per
block. When later on a block gets sealed, we add operands to these φfunctions
(see Algorithm 4). Again, when the φfunction is complete, we check whether it
is trivial.
Sealing a block is an explicit action during IR construction. We illustrate how
to incorporate this step by the example of constructing the while loop seen in
Figure 3a. First, we construct the while header block and add a control flow edge
from the while entry block to it. Since the jump from the body exit needs to be
added later, we cannot seal the while header yet. Next, we create the body entry
and while exit blocks and add the conditional control flow from the while header
to these two blocks. No further predecessors will be added to the body entry
block, so we seal it now. The while exit block might get further predecessors
due to break instructions in the loop body. Now we fill the loop body. This
might include further inner control structures, like an if shown in Figure 3b.
Finally, they converge at the body exit block. All the blocks forming the body
are sealed at this point. Now we add the edge back to the while header and seal
the while header. The loop is completed. In the last step, we seal the while exit
block and then continue IR construction with the source statement after the
while loop.
while entry
while header
body entry
body exit
while exit
0
7
3
5
8
1
2
4
6
9
(a) While statement
if entry
then entry
then exit
else entry
else exit
if exit
0
2
4
6
8
10
1
3
5
7
9
11
(b) If statement
Fig. 3. CFG illustration of construction procedures. Dotted lines represent possible
further code, while straight lines are normal control flow edges. Numbers next to a
basic block denote the order of sealing (top) and filling (bottom).
3 Optimizations
3.1 On-the-fly Optimizations
In the previous section, we showed that we optimize trivial φfunctions as soon
as they are created. Since φfunctions belong to the IR, this means we employ an
IR optimization during SSA construction. Obviously, this is not possible with all
optimizations. In this section, we elaborate what kind of IR optimizations can
be performed during SSA construction and investigate their effectiveness.
We start with the question whether the optimization of φfunctions removes
all trivial φfunctions. As already mentioned in Section 2.2, we recursively opti-
mize all φfunctions that have used a removed trivial φfunction. Since a success-
ful optimization of a φfunction can only render φfunctions trivial that use the
former one, this mechanism enables us to optimize all trivial φfunctions. In Sec-
tion 4, we show that, for reducible CFGs, this is equivalent to the construction
of minimal SSA form.
Since our approach may lead to a significant number of triviality checks, we
use the following caching technique to speed such checks up: While constructing
aφfunction, we record the first two distinct operands that are also distinct from
the φfunction itself. These operands act as witnesses for the non-triviality of the
φfunction. When a new triviality check occurs, we compare the witnesses again.
If they remain distinct from each other and the φfunction, the φfunction is still
non-trivial. Otherwise, we need to find a new witness. Since all operands until
the second witness are equal to the first one or to the φfunction itself, we only
need to consider operands that are constructed after both old witnesses. Thus,
the technique speeds up multiple triviality checks for the same φfunction.
There is a more simple variant of the mechanism that results in minimal SSA
form for most but not all cases: Instead of optimizing the users of a replaced
φfunction, we optimize the unique operand. This variant is especially interesting
for IRs, which do not inherently provide a list of users for each value.
The optimization of φfunctions is only one out of many IR optimizations
that can be performed during IR construction. In general, our SSA construction
algorithm allows to utilize conservative IR optimizations, i.e., optimizations that
require only local analysis. These optimizations include:
Arithmetic Simplification All IR node constructors perform peephole opti-
mizations and return simplified nodes when possible. For instance, the con-
struction of a subtraction x−xalways yields the constant 0.
Common Subexpression Elimination This optimization reuses existing val-
ues that are identified by local value numbering.
Constant Folding This optimization evaluates constant expressions at compile
time, e.g., 2*3 is optimized to 6.
Copy Propagation This optimization removes unnecessary assignments to lo-
cal variables, e.g., x=y. In SSA form, there is no need for such assignments,
we can directly use the value of the right-hand side.
int foo(int x) {
int mask ←0;
int res;
if (x&mask) {
res ←0;
}else {
res ←x;
}
return res;
}
(a) Program code
v0:x
v1: 0
v2:v0&v1
v3:v26= 0
condjump v3
v4: 0 v5:v0
v6:φ(v4,v5)
return v6
(b) Unoptimized
v0:x
v1: 0
v3:false
return v0
(c) Optimized
Fig. 4. The construction algorithm allows to perform conservative optimization during
SSA construction. This may also affect control flow, which in turn could lead to a
reduced number of φfunctions.
Figure 4 shows the effectiveness of these optimizations. We want to con-
struct SSA form for the code fragment shown in Figure 4a. Without on-the-fly
optimizations, this results in the SSA form program shown in Figure 4b. The
first difference with enabled optimizations occurs during the construction of the
value v2. Since a bitwise conjunction with zero always yields zero, the arithmetic
simplification triggers and simplifies this value. Moreover, the constant value
zero is already available. Thus, the common subexpression elimination reuses
the value v1for the value v2. In the next step, constant propagation folds the
comparison with zero to false. Since the condition of the conditional jump is
false, we can omit the then part.2Within the else part, we perform copy
propagation by registering v0as value for res. Likewise, v6vanishes and in the
end the function returns v0. Figure 4c shows the optimized SSA form program.
The example demonstrates that on-the-fly optimizations can further reduce
the number of φfunctions. This can even lead to fewer φfunctions than required
for minimal SSA form according to Cytron et al.’s definition.
2This is only possible because the then part contains no labels.
3.2 Minimal SSA Form for Arbitrary Control Flow
So far, our SSA construction algorithm does not construct minimal SSA form in
the case of irreducible control flow. Figure 5b shows the constructed SSA form
for the program shown in Figure 5a. Figure 5c shows the corresponding minimal
SSA form—as constructed by Cytron et al.’s algorithm. Since there is only one
definition for the variable x, the φfunctions constructed by our algorithm are
superfluous.
x←...
if (...)
goto second_loop_entry
while (...) {
...
second_loop_entry:
...
}
use(x)
(a) Program with irre-
ducible control flow
v0:. . .
v1:φ(v0,v2)
v2:φ(v0,v1)
use(v2)
(b) Intermediate repre-
sentation with our SSA
construction algorithm
v0:. . .
use(v0)
(c) Intermediate repre-
sentation in minimal SSA
form.
Fig. 5. Our SSA construction algorithm can produce extraneous φfunctions in presence
of irreducible control flow. We remove these φfunctions afterwards.
In general, a non-empty set Pof φfunctions is redundant iff the φfunctions
just reference each other or one other value v:∃v∈V∀vi∈P:vi:φ(x1, . . . , xn)
xi∈P∪ {v}. In particular, when Pcontains only a single φfunction, this
definition degenerates to the definition of a trivial φfunction given in Section 2.2.
We show that each set of redundant φfunctions Pcontains a strongly connected
component (SCC) that is also redundant. This implies a definition of minimality
that is independent of the source program and more strict than the definition
by Cytron et al. [10].
Lemma 1. Let Pbe a redundant set of φfunctions with respect to v. Then there
is a strongly connected component S⊆Pthat is also redundant.
Proof. Let P0be the condensation of P, i.e., each SCC in Pis contracted into a
single node. The resulting P0is acyclic [11]. Since P0is non-empty, it has a leaf
s0. Let Sbe the SCC, which corresponds to s0. Since s0is a leaf, the φfunctions
in Sonly refer to vor other φfunctions in S. Hence, Sis a redundant SCC. ut
Algorithm 5 exploits Lemma 1 to remove superfluous φfunctions. The func-
tion removeRedundantPhis takes a set of φfunctions and computes the SCCs of
their induced subgraph. Figure 6b shows the resulting SCCs for the data flow
graph in Figure 6a. Each dashed edge targets a value that does not belong to the
SCC of its source. We process the SCCs in topological order to ensure that used
values outside of the current SCC are already contracted. In our example, this
proc removeRedundantPhis(phiFunctions):
sccs ←computePhiSCCs(inducedSubgraph(phiFunctions))
for scc in topologicalSort(sccs):
processSCC(scc)
proc processSCC(scc):
if len(scc)=1: return # we already handled trivial φfunctions
inner ←set()
outerOps ←set()
for phi in scc:
isInner ←True
for operand in phi.getOperands():
if operand not in scc:
outerOps.add(operand)
isInner ←False
if isInner:
inner.add(phi)
if len(outerOps)=1:
replaceSCCByValue(scc,outerOps.pop())
else if len(outerOps) > 1:
removeRedundantPhis(inner)
Algorithm 5: Remove superfluous φfunctions in case of irreducible data flow.
means we process the SCC containing only φ0first. Since φ0is the only φfunc-
tion within its SCC, we already handled it during removal of trivial φfunctions.
Thus, we skip this SCC.
For the next SCC containing φ1–φ4, we construct two sets: The set inner
contains all φfunctions having operands solely within the SCC. The set outerOps
contains all φfunction operands that do not belong to the SCC. For our example,
inner={φ3, φ4}and outerOps={φ0,+}.
If the outerOps set is empty, the corresponding basic blocks must be unreach-
able and we skip the SCC. In case that the outerOps set contains exactly one
value, all φfunctions within the SCC can only get this value. Thus, we replace
the SCC by the value. If the outerOps set contains multiple values, all φfunctions
that have an operand outside the SCC are necessary. We collected the remaining
φfunctions in the set inner. Since our SCC can contain multiple inner SCCs, we
recursively perform the procedure with the inner φfunctions. Figure 6c shows
the inner SCC for our example. In the recursive step, we replace this SCC by
φ2. Figure 6d shows the resulting data flow graph.
Performing on-the-fly optimizations (Section 3.1) can also lead to irreducible
data flow. For example, let us reconsider Figure 2 described in Section 2.2.
Assume that both assignments to xare copies from some other variable y. If
we now perform copy propagation, we obtain two φfunctions v2:φ(v0,v3) and
v3:φ(v0,v2) that form a superfluous SCC. Note that this SCC also will appear
after performing copy propagation on a program constructed with Cytron’s algo-
rithm. Thus, Algorithm 5 is also applicable to other SSA construction algorithms.
Finally, Algorithm 5 can be seen as a generalization of the local simplifications
by Aycock and Horspool (see Section 7).
φ0
xy
1
φ1
φ2
+
φ3
φ4
(a) Original data
flow graph
φ0
xy
φ1
φ2
+
φ3
φ4
(b) SCCs and their
operands
φ2
φ3
φ4
(c) Inner
SCC
φ0
xy
1
φ1
+
φ2
(d) Optimized data
flow graph
Fig. 6. Algorithm 5 detects the inner SCC spanned by φ3and φ4. This SCC represents
the same value. Thus, it gets replaced by φ2.
3.3 Reducing the Number of Temporary φFunctions
The presented algorithm is structurally simple and easy to implement. Though,
it might produce many temporary φfunctions, which get removed right away
during IR construction. In the following, we describe two extensions to the algo-
rithm, which aim at reducing or even eliminating these temporary φfunctions.
Marker Algorithm. Many control flow joins do not need a φfunction for a vari-
able. So instead of placing a φfunction before recursing, we just mark the block
as visited. If we reach this block again during recursion, we will place a φfunction
there to break the cycle. After collecting the definitions from all predecessors, we
remove the marker and place a φfunction (or reuse the one placed by recursion)
if we found different definitions. Using this technique, no temporary φfunctions
are placed in acyclic data-flow regions. Temporary φfunctions are only generated
in data-flow loops and might be determined as unnecessary later on.
SCC Algorithm. While recursing we use Tarjan’s algorithm to detect data-flow
cycles, i.e., SCCs [17]. If only a unique value enters the cycle, no φfunctions
will be necessary. Otherwise, we place a φfunction into every basic block, which
has a predecessor from outside the cycle. In order to add operands to these
φfunctions, we apply the recursive look-up again as this may require placement
of further φfunctions. This mirrors the algorithm for removing redundant cycles
of φfunctions described in Section 3.2. In case of recursing over sealed blocks,
the algorithm only places necessary φfunctions. The next section gives a formal
definition of a necessary φfunction and shows that an algorithm that only places
necessary φfunctions produces minimal SSA form.
4 Properties of our Algorithm
Because most optimizations treat φfunctions as uninterpreted functions, it is
beneficial to place as few φfunctions as possible. In the rest of this section, we
show that our algorithm does not place dead φfunctions and constructs minimal
(according to Cytron et al.’s definition) SSA form for programs with reducible
control flow.
Pruned SSA form. A program is said to be in pruned SSA form [7] if each φfunc-
tion (transitively) has at least one non-φuser. We only create φfunctions on
demand when a user asks for it: Either a variable being read or another φfunc-
tion needing an argument. So our construction naturally produces a program in
pruned SSA form.
Minimal SSA form. Minimal SSA form requires that φfunctions for a variable v
only occur in basic blocks where different definitions of vmeet for the first time.
Cytron et al.’s formal definition is based on the following two terms:
Definition 1 (path convergence). Two non-null paths X0→+XJand
Y0→+YKare said to converge at a block Ziff the following conditions hold:
X06=Y0;(1)
XJ=Z=YK;(2)
(Xj=Yk)⇒(j=J∨k=K).(3)
Definition 2 (necessary φfunction). Aφfunction for variable vis necessary
in block Ziff two non-null paths X→+Zand Y→+Zconverge at a block Z,
such that the blocks Xand Ycontain assignments to v.
A program with only necessary φfunctions is in minimal SSA form. The fol-
lowing is a proof that our algorithm presented in Section 2 with the simplification
rule for φfunctions produces minimal SSA form for reducible programs.
We say a block Adominates a block Bif every path from the entry block
to Bpasses through A. We say Astrictly dominates Bif Adominates Band
A6=B. Each block Cexcept the entry block has a unique immediate dominator
idom(C), i.e., a strict dominator of C, which does not dominate any other strict
dominator of C. The dominance relation can be represented as a tree whose
nodes are the basic blocks with a connection between immediately dominating
blocks.
Definition 3 (reducible flow graph, Hecht and Ullmann [12]). A (con-
trol) flow graph Gis reducible iff for each cycle Cof Gthere is a node of C,
which dominates all other nodes in C.
We now assume that our construction algorithm is finished and has pro-
duced a program with a reducible CFG. We observe that the simplification rule
tryRemoveTrivialPhi of Algorithm 3 was applied at least once to each φfunction
with its current arguments. This is because we apply the rule each time a φfunc-
tion’s parameters are set for the first time. In the case that a simplification of
another operation leads to a change of parameters, the rule is applied again.
Furthermore, our construction algorithm fulfills the following property:
Definition 4 (SSA property). In an SSA-form program a path from a defini-
tion of an SSA value for variable vto its use cannot contain another definition or
φfunction for v. The use of the operands of φfunction happens in the respective
predecessor blocks not in the φ’s block itself.
The SSA property ensures that only the “most recent” SSA value of a vari-
able vis used. Furthermore, it forbids multiple φfunctions for one variable in
the same basic block.
Lemma 2. Let pbe a φfunction in a block P. Furthermore, let qin a block Q
and rin a block Rbe two operands of p, such that p,qand rare pairwise distinct.
Then at least one of Qand Rdoes not dominate P.
Proof. Assume that Qand Rdominate P, i.e., every path from the start block
to Pcontains Qand R. Since immediate dominance forms a tree, Qdominates
Ror Rdominates Q. Without loss of generality, let Qdominate R. Furthermore,
let Sbe the corresponding predecessor block of Pwhere pis using q. Then there
is a path from the start block crossing Qthen Rand S. This violates the SSA
property. ut
Lemma 3. If a φfunction pin a block Pfor a variable vis unnecessary, but
non-trivial, then it has an operand qin a block Q, such that qis an unnecessary
φfunction and Qdoes not dominate P.
Proof. The node pmust have at least two different operands rand s, which are
not pitself. Otherwise, pis trivial. They can either be:
–The result of a direct assignment to v.
–The result of a necessary φfunction r0. This however means that r0was
reachable by at least two different direct assignments to v. So there is a path
from a direct assignment of vto p.
–Another unnecessary φfunction.
Assume neither rin a block Rnor sin a block Sis an unnecessary φfunc-
tion. Then a path from an assignment to vin a block Vrcrosses Rand a path
from an assignment to vin a block Vscrosses S. They converge at Por earlier.
Convergence at Pis not possible because pis unnecessary. An earlier conver-
gence would imply a necessary φfunction at this point, which violates the SSA
property.
So ror smust be an unnecessary φfunction. Without loss of generality, let
this be r.
If Rdoes not dominate P, then ris the sought-after q. So let Rdominate P.
Due to Lemma 2, Sdoes not dominate P. Employing the SSA property, r6=p
yields R6=P. Thus, Rstrictly dominates P. This implies that Rdominates all
predecessors of P, which contain the uses of p, especially the predecessor S0that
contains the use of s. Due to the SSA property, there is a path from Sto S0that
does not contain R. Employing Rdominates S0this yields Rdominates S.
Now assume that sis necessary. Let Xcontain the most recent definition of
von a path from the start block to R. By Definition 2 there are two definitions
of vthat render snecessary. Since Rdominates S, the SSA property yields that
one of these definitions is contained in a block Yon a path R→+S. Thus,
there are paths X→+Pand Y→+Prendering pnecessary. Since this is a
contradiction, sis unnecessary and the sought-after q.ut
Theorem 1. A program in SSA form with a reducible CFG Gwithout any triv-
ial φfunctions is in minimal SSA form.
Proof. Assume Gis not in minimal SSA form and contains no trivial φfunctions.
We choose an unnecessary φfunction p. Due to Lemma 3, phas an operand q,
which is unnecessary and does not dominate p. By induction qhas an unnecessary
φfunction as operand as well and so on. Since the program only has a finite
number of operations, there must be a cycle when following the qchain. A cycle
in the φfunctions implies a cycle in G. As Gis reducible, the control flow
cycle contains one entry block, which dominates all other blocks in the cycle.
Without loss of generality, let qbe in the entry block, which means it dominates
p. Therefore, our assumption is wrong and Gis either in minimal SSA form or
there exist trivial φfunctions. ut
Because our construction algorithm will remove all trivial φfunctions, the re-
sulting IR must be in minimal SSA form for reducible CFGs.
4.1 Time Complexity
We use the following parameters to provide a precise worst-case complexity for
our construction algorithm:
–Bdenotes the number of basic blocks.
–Edenotes the number of CFG edges.
–Pdenotes the program size.
–Vdenotes the number of variables in the program.
We start our analysis with the simple SSA construction algorithm presented in
Section 2.3. In the worst case, SSA construction needs to insert Θ(B)φfunc-
tions with Θ(E)operands for each variable. In combination with the fact the
construction of SSA form within all basic block is in Θ(P), this leads to a lower
bound of Ω(P+ (B+E)·V).
We show that our algorithm matches this lower bound, leading to a worst-case
complexity of Θ(P+ (B+E)·V). Our algorithm requires Θ(P)to fill all basic
blocks. Due to our variable mapping, we place at most O(B·V)φfunctions. Fur-
thermore, we perform at most O(E·V)recursive requests at block predecessors.
Altogether, this leads to a worst-case complexity of Θ(P+ (B+E)·V).
Next, we consider the on-the-fly optimization of φfunctions. Once we op-
timized a φfunction, we check whether we can optimize the φfunctions that
use the former one. Since our algorithm constructs at most B·V φ functions,
this leads to O(B2·V2)checks. One check needs to compare at most O(B)
operands of the φfunction. However, using the caching technique described in
Section 3.1, the number of checks performed for each φfunctions amortizes the
time for checking the corresponding φfunction. Thus, the on-the-fly optimization
of φfunctions can be performed in O(B2·V2).
To obtain minimal SSA form, we need to contract SCCs that pass the same
value. Since we consider only φfunctions and their operands, the size of the
SCCs is in O((B+E)·V). Hence, computing the SCCs for the data flow graph
is in O(P+ (B+E)·V). Computing the sets inner and outer consider each
φfunction and its operands exactly once. Thus, it is also in O((B+E)·V). The
same argument applies for the contraction of a SCC in case there is only one
outer operand. In the other case, we iterate the process with a subgraph that is
induced by a proper subset of the nodes in our SCC. Thus, we need at most B·V
iterations. In total, this leads to a time complexity in O(P+B·(B+E)·V2)
for the contraction of SCCs.
5 Other Applications of the Algorithm
5.1 SSA Reconstruction
v0:x
use(v0)
A
B
C
(a) Original program
v0:x
use(v0)
A
B
C
(b) Optimized control flow
v1:x’
v0:x
v2:φ(v1,v0)
use(v2)
A
B
C
(c) Fixed SSA form
Fig. 7. We assume that Aalways jumps via Bto C. Adjusting A’s jump requires SSA
reconstruction.
Some transformations like live range splitting, rematerialization or jump
threading introduce additional definitions for an SSA value. Because this violates
the SSA property, SSA has to be reconstructed. For the latter transformation,
we run through an example in order to demonstrate how our algorithm can be
leveraged for SSA reconstruction.
Consider an analysis determined that the basic block Bin Figure 7a al-
ways branches to Cwhen entered from A. Thus, we let Adirectly jump to
C(Figure 7b). However, definition v0does not dominate its use anymore. We
can fix this issue by first inserting a copy v1of v0into A. Then, we invoke
writeVariable(V,A,x’) and writeVariable(V,B,x) while Vis just some handle to
refer to the set of definitions, which represent the “same variable”. Next, a call
to readVariableRecursive(V,C) adds a necessary φfunction and yields v2as new
definition, which we can use to update v0’s original use (Figure 7c).
In particular, for jump threading, it is desirable to not depend on dominance
calculations—as opposed to Cytron et al.’s algorithm: Usually, several iterations
of jump threading are performed until no further improvements are possible.
Since jump threading alters control flow in a non-trivial way, each iteration
would require a re-computation of the dominance tree.
Note that SSA reconstruction always runs on complete CFGs. Hence, sealing
and issues with non-sealed basic blocks do not arise in this setting.
5.2 CPS Construction
CPS is a functional programming technique that captures control flow in con-
tinuations. Continuations are functions, which never return. Instead, each con-
tinuation invokes another continuation in tail position.
int f(int x) {
int a;
if (x=0) {
a←23;
}else {
a←42;
}
return a;
}
(a) Source program
int f(int x) {
if (x=0) {
}else {
}
v0:φ(23, 42)
return v0
}
(b) SSA form
f(x:int,ret :int → ⊥)→ ⊥ {
let then := () → ⊥
next(23)
else := () → ⊥
next(42)
next := (a:int)→ ⊥
ret(a)
in
branch(x=0, then,else)
}
(c) CPS version
Fig. 8. An imperative program in SSA form and converted to a functional CPS program
SSA form can be considered as a restricted form of CPS [13,3]. Our algo-
rithm is also suitable to directly convert an imperative program into a functional
CPS program without the need for a third program representation. Instead of
φfunctions, we have to place parameters in local functions. Instead of adding
operands to φfunctions, we add arguments to the predecessors’ calls. Like when
constructing SSA form, on-the-fly optimizations, as described in Section 3.1, can
be exploited to shrink the program. Figure 8 demonstrates CPS construction.
In a CPS program, we cannot simply remove a φfunction. Rather, we would
have to eliminate a parameter, fix its function’s type and adjust all users of
this function. As this set of transformations is expensive, it is worthwhile to
not introduce unnecessary parameters in the first place and therefore use the
extensions described in Section 3.3.
6 Evaluation
6.1 Comparison to Cytron et al.’s Algorithm
We implemented the algorithm presented in this paper in LLVM 3.1 [2] to com-
pare it against an existing, highly-tuned implementation of Cytron et al.’s algo-
rithm. Table 1 shows the number of constructed instructions for both algorithms.
Since LLVM first models accesses to local variables with load and stores instruc-
tions, we also denoted the instructions immediately before SSA construction.
In total, SSA construction reduces the number of instructions by 25%, which
demonstrates the significant overhead of the temporary non-SSA IR.
Comparing the number of constructed instructions, we see small differences
between the results of LLVM’s and our SSA-construction algorithm. One reason
for the different number of φfunctions is the removal of redundant SCCs: 3 (out
of 11) non-trivial SCCs do not originate from irreducible control flow and are not
removed by Cytron et al.’s algorithm. The remaining difference in the number of
φfunctions and memory instructions stems from minor differences in handling
unreachable code. In most benchmarks, our algorithm triggers LLVM’s constant
folding more often, and thus further reduces the overall instruction count. Ex-
ploiting more on-the-fly optimizations like common subexpression elimination as
described in Section 3.1 would shrink the overall instruction count even further.
For the runtime comparison of both benchmarks, we count the number of
executed x86 instructions. Table 2 shows the counts collected by the valgrind
instrumentation tool. While the results vary for each benchmark, the marker
Bench- Before SSA Constr. After SSA Constr. Marker Insn
mark #insn #mem #phi #insn #mem #phi #insn #mem #phi ratio
gzip 12,038 5,480 82 9,187 2,117 594 9,179 2,117 594 76%
vpr 40,701 21,226 129 27,155 6,608 1,201 27,092 6,608 1,201 67%
gcc 516,537 206,295 2,230 395,652 74,736 12,904 393,554 74,683 12,910 76%
mcf 3,988 2,173 14 2,613 658 154 2,613 658 154 66%
crafty 44,891 18,804 116 36,050 8,613 1,466 36,007 8,613 1,466 80%
parser 30,237 14,542 100 20,485 3,647 1,243 20,467 3,647 1,243 68%
perlbmk 185,576 86,762 1,764 140,489 37,599 5,840 140,331 37,517 5,857 76%
gap 201,185 86,157 4,074 149,755 29,476 9,325 149,676 29,475 9,326 74%
vortex 126,097 65,245 990 88,257 25,656 2,739 88,220 25,661 2,737 70%
bzip2 8,605 4,170 9 6,012 1,227 359 5,993 1,227 359 70%
twolf 76,078 38,320 246 58,737 18,376 2,849 58,733 18,377 2,849 77%
Sum 1,245,933 549,174 9,754 934,392 208,713 38,674 931,865 208,583 38,696 75%
Table 1. Comparison of instruction counts of LLVM’s normal implementation and our
algorithm. #mem are alloca, load and store instructions. Insn ratio is the quotient
between #insn of before SSA construction and marker.
Benchmark Cytron et al. Marker Marker
Cytron et al.
164.gzip 969,233,677 967,798,047 99.85%
175.vpr 3,039,801,575 3,025,286,080 99.52%
176.gcc 25,935,984,569 26,009,545,723 100.28%
181.mcf 722,918,540 722,507,455 99.94%
186.crafty 3,653,881,430 3,632,605,590 99.42%
197.parser 2,084,205,254 2,068,075,482 99.23%
253.perlbmk 12,246,953,644 12,062,833,383 98.50%
254.gap 8,358,757,289 8,339,871,545 99.77%
255.vortex 7,841,416,740 7,845,699,772 100.05%
256.bzip2 569,176,687 564,577,209 99.19%
300.twolf 6,424,027,368 6,408,289,297 99.76%
Sum 71,846,356,773 71,647,089,583 99.72%
Table 2. Executed instructions for Cytron et al.’s algorithm and the Marker algorithm
algorithm needs slightly (0.28%) fewer instructions in total. All measurements
were performed on a Core i7-2600 CPU with 3.4 GHz, by compiling the C-
programs of the SPEC CINT2000 benchmark suite.
6.2 Effect of On-The-Fly Optimization
We also evaluated the effects of performing on-the-fly optimizations (as described
in Section 3.1) on the speed and quality of SSA construction. Our libFirm [1]
compiler library has always featured a variant of the construction algorithms de-
scribed in this paper. There are many optimizations interweaved with the SSA
construction. The results are shown in Table 3. Enabling on-the-fly optimiza-
tions during construction results in an increased construction time of 0.84 s, but
Benchmark No On-the-fly Optimizations On-the-fly Optimizations
Time IR Time Instructions Time IR Time Instructions
164.gzip 1.38 s 0.03 s 10,520 1.34 s 0.05 s 9,255
175.vpr 3.80 s 0.08 s 28,506 3.81 s 0.12 s 26,227
176.gcc 59.80 s 0.61 s 408,798 59.16 s 0.91 s 349,964
181.mcf 0.57 s 0.02 s 2,631 0.60 s 0.03 s 2,418
186.crafty 7.50 s 0.13 s 42,604 7.32 s 0.18 s 37,384
197.parser 5.54 s 0.06 s 19,900 5.55 s 0.09 s 18,344
253.perlbmk 25.10 s 0.29 s 143,039 24.79 s 0.41 s 129,337
254.gap 18.06 s 0.25 s 152,983 17.87 s 0.34 s 132,955
255.vortex 17.66 s 0.35 s 98,694 17.54 s 0.45 s 92,416
256.bzip2 1.03 s 0.01 s 6,623 1.02 s 0.02 s 5,665
300.twolf 7.24 s 0.18 s 60,445 7.20 s 0.27 s 55,346
Sum 147.67 s 2.01 s 974,743 146.18 s 2.86 s 859,311
Table 3. Effect of on-the-fly optimizations on construction time and IR size
the resulting graph has only 88.2% the number of nodes. This speeds up later
optimizations resulting in an 1.49 s faster overall compilation.
6.3 Conclusion
The algorithm has been shown to be as fast as the Cytron et al.’s algorithm in
practice. However, if the algorithm is combined with on-the-fly optimizations,
the overall compilation time is reduced. This makes the algorithm an interesting
candidate for just-in-time compilers.
7 Related Work
SSA form was invented by Rosen, Wegman, and Zadeck [15] and became popular
after Cytron et al. [10] presented an efficient algorithm for constructing it. This
algorithm can be found in all textbooks presenting SSA form and is used by the
majority of compilers. For each variable, the iterated dominance frontiers of all
blocks containing a definition is computed. Then, a rewrite phase creates new
variable numbers, inserts φfunctions and patches users. The details necessary
for this paper were already discussed in Section 1.
Choi et al. [7] present an extension to the previous algorithm that constructs
minimal and pruned SSA form. It computes liveness information for each variable
vand inserts a φfunction for vonly if vis live at the corresponding basic block.
This technique can also be applied to other SSA construction algorithms to
ensure pruned SSA form, but comes along with the costs of computing liveness
information.
Briggs et al. [6] present semi-pruned SSA form, which omits the costly liveness
analysis. However, they only can prevent the creation of dead φfunctions for
variables that are local to a basic block.
Sreedhar and Gao [16] present a data structure, called DJ graph, that en-
hances the dominance tree with edges from the CFG. Compared to computing
iterated dominance frontiers for each basic block, this data structure is only
linear in the program size and allows to compute the blocks where φfunctions
need to be placed in linear time per variable. This gives an SSA construction
algorithm with cubic worst-case complexity in the size of the source program.
There are also a range of construction algorithms, which aim for simplicity in-
stead. Brandis and Mössenböck [5] present a simple SSA construction algorithm
that directly works on the AST like our algorithm. However, their algorithm is
restricted to structured control flow (no gotos) and does not construct pruned
SSA form. Click and Paleczny [8] describe a graph-based SSA intermediate rep-
resentation used in the Java HotSpot server compiler [14] and an algorithm to
construct this IR from the AST. Their algorithm is in the spirit as the one of
Brandis and Mössenböck and thus does construct neither pruned nor minimal
SSA form. Aycock and Horspool present an SSA construction algorithm that
is designed for simplicity [4]. They place a φfunction for each variable at each
basic block. Afterwards they employ the following rules to remove φfunctions:
1. Remove φfunctions of the form vi=φ(vi, . . . , vi).
2. Substitute φfunctions of the form vi=φ(vi1, . . . , vin)with i1, . . . , in∈ {i, j}
by vj.
This results in minimal SSA form for reducible programs. The obvious drawback
of this approach is the overhead of inserting φfunctions at each basic block. This
also includes basic blocks that are prior to every basic block that contains a real
definition of the corresponding variable.
8 Conclusions
In this paper, we presented a novel, simple, and efficient algorithm for SSA con-
struction. In comparison to existing algorithms it has several advantages: It does
not require other analyses and transformations to produce minimal (on reducible
CFGs) and pruned SSA form. It can be directly constructed from the source
language without passing through a non-SSA CFG. It is well suited to perform
several standard optimizations (constant folding, value numbering, etc.) already
during SSA construction. This reduces the footprint of the constructed program,
which is important in scenarios where compilation time is of the essence. After
IR construction, a post pass ensures minimal SSA form for arbitrary control flow.
Our algorithm is also useful for SSA reconstruction where, up to now, standard
SSA construction algorithms where not directly applicable. Finally, we proved
that our algorithm always constructs pruned and minimal SSA form.
In terms of performance, a non-optimized implementation of our algorithm
is slightly faster than the highly-optimized implementation of Cytron et al.’s
algorithm in the LLVM compiler, measured on the SPEC CINT2000 benchmark
suite. We expect that after fine-tuning our implementation, we can improve the
performance even more.
Acknowledgments We thank the anonymous reviewers for their helpful com-
ments. This work was partly supported by the German Research Foundation
(DFG) as part of the Transregional Collaborative Research Centre “Invasive
Computing” (SFB/TR 89).
References
1. libFirm – The FIRM intermediate representation library, http://libfirm.org
2. The LLVM compiler infrastructure pro ject, http://llvm.org
3. Appel, A.W.: SSA is functional programming. SIGPLAN Notices 33(4), 17–20 (Apr
1998)
4. Aycock, J., Horspool, N.: Simple generation of static single-assignment form. In:
Compiler Construction, Lecture Notes in Computer Science, vol. 1781, pp. 110–125.
Springer (2000)
5. Brandis, M.M., Mössenböck, H.: Single-pass generation of static single-assignment
form for structured languages. ACM Trans. Program. Lang. Syst. 16(6), 1684–1698
(Nov 1994)
6. Briggs, P., Cooper, K.D., Harvey, T.J., Simpson, L.T.: Practical improvements to
the construction and destruction of static single assignment form. Softw. Pract.
Exper. 28(8), 859–881 (Jul 1998)
7. Choi, J.D., Cytron, R., Ferrante, J.: Automatic construction of sparse data flow
evaluation graphs. In: Proceedings of the 18th ACM SIGPLAN-SIGACT Sympo-
sium on Principles of Programming Languages. pp. 55–66. POPL ’91, ACM, New
York, NY, USA (1991)
8. Click, C., Paleczny, M.: A simple graph-based intermediate representation. In:
Papers from the 1995 ACM SIGPLAN workshop on Intermediate representations.
pp. 35–49. IR ’95, ACM, New York, NY, USA (1995)
9. Cocke, J.: Programming languages and their compilers: Preliminary notes. Courant
Institute of Mathematical Sciences, New York University (1969)
10. Cytron, R., Ferrante, J., Rosen, B.K., Wegman, M.N., Zadeck, F.K.: Efficiently
computing static single assignment form and the control dependence graph.
TOPLAS 13(4), 451–490 (Oct 1991)
11. Eswaran, K.P., Tarjan, R.E.: Augmentation problems. SIAM J. Comput. 5(4),
653–665 (Dec 1976)
12. Hecht, M.S., Ullman, J.D.: Characterizations of reducible flow graphs. J. ACM
21(3), 367–375 (Jul 1974)
13. Kelsey, R.A.: A correspondence between continuation passing style and static single
assignment form. SIGPLAN Not. 30, 13–22 (Mar 1995)
14. Paleczny, M., Vick, C., Click, C.: The Java HotSpotTM server compiler. In: Sympo-
sium on JavaTM Virtual Machine Research and Technology Symposium. pp. 1–12.
JVM’01, USENIX Association, Berkeley, CA, USA (2001)
15. Rosen, B.K., Wegman, M.N., Zadeck, F.K.: Global value numbers and redundant
computations. In: Proceedings of the 15th ACM SIGPLAN-SIGACT Symposium
on Principles of Programming Languages. pp. 12–27. POPL ’88, ACM Press, New
York, NY, USA (1988)
16. Sreedhar, V.C., Gao, G.R.: A linear time algorithm for placing φ-nodes. In: Pro-
ceedings of the 22nd ACM SIGPLAN-SIGACT Symposium on Principles of Pro-
gramming Languages. pp. 62–73. POPL ’95, ACM, New York, NY, USA (1995)
17. Tarjan, R.E.: Depth-first search and linear graph algorithms. SIAM Journal Com-
puting 1(2), 146–160 (1972)