Comparability graph coloring for optimizing utilization of stream register files in stream processors.
-
Article: Improvements to Graph Coloring Register Allocation
[show abstract] [hide abstract]
ABSTRACT: This paper describes both the techniques themselves and our experience building and using register allocators that incorporate them. It provides a detailed description of optimistic coloring and rematerialization. It presents experimental data to show the performance of several versions of the register allocator on a suite of FORTRAN programs. It discusses several insights that we discovered only after repeated implementation of these allocators. Categories and Subject Descriptors: D.3.4 [Programming Languages]: Processors---compi l ers , optimization General terms: Languages Additional Key Words and Phrases: Register allocation, code generation, graph coloring 1. INTRODUCTION The relationship between run-time performance and e#ective use of a machine's register set is well understood. In a compiler, the process of deciding which values to keep in registers at each point in the generated code is called register allocation. Value04/2001; -
Conference Proceeding: Register Allocation & Spilling via Graph Coloring.
Proceedings of the SIGPLAN 89 Conference on Programming Language Design and Implementation; 01/1982 -
Article: The Priority-Based Coloring Approach to Register Allocation.
ACM Trans. Program. Lang. Syst. 01/1990; 12:501-536.
Page 1
Comparability Graph Coloring for Optimizing Utilization of
Stream Register Files in Stream Processors
Xuejun Yang∗
National Laboratory for Parallel and Distributed Processing, School of Computer, NUDT, China∗
Programming Languages and Compilers Group, School of Computer Science and Engineering, UNSW, Australia†
{lwang, jingling}@cse.unsw.edu.au
Li Wang∗ † ¶
Jingling Xue†
Yu Deng∗
Ying Zhang∗
{xjyang, yudeng, zhangying}@nudt.edu.cn
Abstract
A stream processor executes an application that has been decom-
posed into a sequence of kernels that operate on streams of data
elements. During the execution of a kernel, all streams accessed
must be communicated through the SRF (Stream Register File), a
non-bypassing software-managed on-chip memory. Therefore, op-
timizingutilization of theSRF iscrucial for good performance. The
key insight is that the interference graphs formed by the streams
in stream applications tend to be comparability graphs or decom-
posable into a set of multiple comparability graphs. We present a
compiler algorithm that can find optimal or near-optimal colorings
in stream IGs, thereby improving SRF utilization than the First-Fit
bin-packing algorithm, the best in the literature.
Categories and Subject Descriptors
guages]: Processors—compilers and optimization; B.3.2 [Mem-
ory Structures]: Design Styles—Primary memory
D.3.4 [Programming Lan-
General Terms
Algorithms, Languages, Performance
Keywords
graph coloring, software-managed cache
Stream processor, stream programming, comparability
1.
Media applications, such as image processing, signal processing,
video and graphics, are becoming an increasingly dominant por-
tion of computing workloads today. In contrast with other applica-
tions, media applications exhibit producer-consumer locality with
little global data reuse, have abundant parallelism and require high
computation rates (with 10-100 billion operations per second and a
fewtothousands of operationsper input data).Thesecharacteristics
are poorly matched to conventional general-purpose programmable
architectures that depend on data reuse (captured by hardware-
managed caches), cannot exploit the available parallelism and can-
not support high computation rates. On the other hand, special-
purpose media-processing processors tailored to one specific ap-
plication require significant design effort and are thus difficult to
change as applications and/or algorithms evolve.
Introduction
¶The work was carried out during Li Wang’s visit to Jingling Xue’s Re-
search Group at UNSW during February 2008 – February 2009.
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
for profit or commercial advantage and that copies bear this notice and the full citation
on the first page. To copy otherwise, to republish, to post on servers or to redistribute
to lists, requires prior specific permission and/or a fee.
PPoPP’09, February 14–18, 2009, Raleigh, North Carolina, USA.
Copyright c ? 2009 ACM 978-1-60558-397-6/09/02 ... $5.00
The (programmable) stream processors, such as Imagine [18],
Raw [20], Cell [23], AMD FireStream and Merrimac [4], represent
a promising alternative in achieving high performance in media
applications [17, 18, 21]. In addition, stream processing isalso well
suited for some scientific applications [4, 23, 25].
Stream
Register File
Network
Interface
Stream
Controller
Host
Interface
Stream
Memory
Controller
Micro
Controller
DDR Memroy
Controller
2.4GB/s
6.4GB/s
8GB/s
16GB/s
8GB/s
4GB/s
64GB/s
Cluster2Cluster1 Cluster0Cluster3
control path
data path
to host
processor
to off-chip
memory
Figure 1: Block diagram of the FT64 stream processor.
We have recently designed and fabricated a 64-bit stream pro-
cessor, FT64 [25], for media applications as well as certain scien-
tific applications that are also amenable to stream processing. As
shown in Figure 1, like Imagine [18], Cell [23] and Merrimac [4],
FT64 can be easily mapped to the stream virtual machine architec-
ture described in [12]. Such stream processor executes applications
that have been mapped to the stream programming model: a pro-
gram is decomposed into a sequence of computation-intensive ker-
nels that operate on streams of data elements. Kernels are compiled
to VLIW microprograms to be executed on clusters of ALUs, one
at a time. Streams are stored in the SRF (Stream Register File), a
software-managed on-chip memory. Expressing an application as
streams exposes its inherent locality and parallelism. Kernels ex-
pose kernel locality by keeping temporary values local (in the non-
shown local register files near the ALUs) and instruction-level par-
allelism (exploited by the multiple ALUs in each cluster). Streams
expose producer-consumer locality between kernels — enabling
some output streams produced by a kernel to be consumed by the
next kernel in sequence — and data-level parallelism — enabling
different elements of an input stream to be operated on simultane-
ously, one on each ALU cluster, in a SIMD fashion.
Research into advanced compiler technology for stream lan-
guages and architectures is still at its infancy. Among several chal-
Page 2
lenges posed by stream processing for compilation [5], an efficient
allocation of the scarce on-chip SRF is critical to performance.
SRF, the nexus of a stream processor, is introduced to capture the
widespread producer-consumer localityinmedia applications to re-
duce expensive off-chip memory traffic. Unlike conventional reg-
ister files, however, SRF is non-bypassing, namely, the input and
output streams of a kernel must be all stored in the SRF when a
kernel is being executed. If the work set of a kernel is too large to
fit into the SRF, strip mining can be applied to segment some large
streams into smaller strips so that the kernel can then be called to
operate on one strip at a time. Alternatively, some streams can be
double-buffered [5] or spilled [22] until the data set of every kernel
does not exceed the SRF capacity. Therefore, optimizing utilization
of SRF is crucial for good performance.
We are aware of two existing SRF management techniques for
stream processors [5, 22]. In [5], SRF utilization is optimized by
applying First-Fit bin-packing heuristics. In our recent work [22],
wehaveexperimented withadopting agraphcoloringapproach that
we introduced in [14] for scratchpad allocation to SRF allocation.
This graph coloring approach requires the SRF to be partitioned
into pseudo registers before graph coloring can be applied. Artifi-
cial aliases among pseudo registers may cause SRF fragmentation,
reducing SRF utilization unnecessarily. On the other hand, First-
Fit heuristics can be sub-optimal for many applications. For small
applications, either technique suffices. For large applications with
tens to hundreds of kernels, both need to be further improved.
In this paper, we present a new compiler algorithm that aims
to optimize utilization of SRF for stream applications. The central
machinery is the traditional interference graph (IG) representation
except an IG here is a weighted (undirected) graph formed by the
streams operated on by a sequence of kernels. The key discovery is
that the IGs in many media applications are comparability graphs,
enabling the compiler to obtain optimal colorings in polynomial
time. This has motivated us to develop a new algorithm for opti-
mizing utilization of SRF when allocating the streams in stream
IGs to the SRF by comparability graph coloring. If the data set of
a kernel still exceeds the SRF capacity after SRF allocation, live-
range splitting (or spilling) and strip mining can be applied, as will
be discussed in the concluding section of this paper.
In summary, this paper makes the following contributions:
• We propose, for the first time, to optimize utilization of SRF by
comparability graph coloring and present an efficient algorithm
designed for well-structured media and scientific applications
amenable to stream processing.
• We show that our algorithm can find optimal and near-optimal
coloringsfor streamIGs,therebyoutperforming First-Fitheuris-
tics.
The rest of this paper is organized as follows. For background
information, Section 2 introduces the stream programming model
by an example. InSection 3, we make precise theSRF management
problem we solve. Section 4 casts it as a comparability graph
coloring problem and presents our algorithm for solving the new
formulation. Section 5 evaluates our approach. Section 6 discuss
related work. Section 7 concludes by discussing future work.
2.
The central idea behind stream processing is to divide an applica-
tion into kernels and streams to expose its inherent locality and par-
allelism. As a result, an application is divided into two programs, a
stream program running onthehost processor and akernel program
running on the stream processor. The stream program specifies the
flow of streams between kernels and initiates the execution of ker-
nels. The kernel program executes these kernels, one at a time.
Stream Programming Model
1 complex xmat[2*N], ymat[2*N];
2 complex twiddlemat[log2(2*N)*N];
3 stream<complex> a(N), b(N);
4 stream<complex> twiddle(N);
5 stream<complex> c(N), d(N);
6
7 dataInit('twiddlematrix.dat', twiddlemat);
dataInit('xmatix.dat', xmat);
8
9
10 for (int i = 0; i < log2(2*N); i+=2) {
11
Load(twiddlemat[i*N, (i+1)*N-1], twiddle);
12
Kernel('fft', a, b, twiddle, c, d);
13
Load(twiddlemat[(i+1)*N, (i+2)*N-1], twiddle);
14
Kernel('fft', c, d, twiddle, a, b);
15 }
16 Store(a, ymat[0, N-1]);
17 Store(b, ymat[N, 2*N-1]);
Load(xmat[0, N-1], a);
Load(xmat[N, 2*N-1], b);
18 bitReverse(ymat);
19 dataSave('ymatrix.dat', ymat);
1 fft(stream<complex> a, stream<complex> b,
2 stream<complex> twiddle,
3 stream<complex> c, stream<complex> d)
4 {
5 complex a_tmp, b_tmp, c_tmp, d_tmp;
6 complex twiddle_tmp;
7 for (int i = 0; i < N/2; i++) {
8 a>>a_tmp;
9 b>>b_tmp;
10 twiddle>>twiddle_tmp;
11 c<<a_tmp+b_tmp;
12 c<<twiddle_tmp*(a_tmp-b_tmp);
13 }
14 for (i = N/2; i < N; i++) {
15 a>>a_tmp;
16 b>>b_tmp;
17 twiddle>>twiddle_tmp;
18 d<<a_tmp+b_tmp;
19 d<<twiddle_tmp*(a_tmp-b_tmp);
20 }
21 }
(a) Stream program(b) Kernel program
Kfft
a
b
twiddle
Kfft
c
d
twiddle
a
b
Kfft
c
d
twiddle
a
b
(c) Data flow of streams through kernels
Figure 2: Stream and kernel programs for a radix FFT.
Figure 2 depicts the mapping of a 2N-point radix-2 FFT
to the stream programming model. The kernel fft is executed
log2(2N) times with explicit producer-consumer locality: every
output stream from a kernel execution is used as an input for the
next kernel execution in sequence.
Let us examine the stream program first. In lines 1 and 2,
three arrays of sizes 2N, 2N and log2(2N) ∗ N are declared,
respectively. In lines 3 – 5, five streams of size N are declared.
In lines 6 and 7, the function dataInit is called twice to initialize
arrays xmat and twiddlemat residing in the off-chip memory with
the two data files stored at the host processor. In line 8, the data in
the first half of xmat are gathered into stream a. This will result in
theloading of thedatafromxmat inoff-chipmemory intothespace
allocated to stream a in the SRF. In line 9, stream b is initialized
from the second half of xmat. In line 10, the loop in a sequential
FFT program is unrolled once to expose the producer-consumer
locality between the kernel calls to fft. In line 11, the “twiddle
factors” needed by FFT are gathered into stream twiddle. In line
12, the kernel fft is called to perform the core computation of FFT
on the stream processor. As shown, a, b and twiddle are input
streamsandc and d areoutput streams.Inline13, streamtwiddle is
updated with new twiddle factors. In line 14, fft is called again with
c, d, twiddle as input and a and b as output. After the kernel has run
to completion, the final output streams are stored from the SRF into
array ymat in off-chip memory (lines 16 and 17). Since the output
is in bit-reversed order, In line 18, the function bitReverse reorders
the data. In line 19, result is saved into a data file.
In the kernel program, a loop at line 7 first goes over the first
half of each input stream. In line 8, the elements of stream a are
read sequentially, one a time, into a temporary variable a tmp. In
lines 9 and 10, the elements of streams y and twiddle are read off
similarly. In lines 11 and 12, the computations on these elements
are performed with the results being appended to output stream c.
In lines 14 – 20, these steps are repeated on the second half of the
input streams, with the results bing appended to output stream d.
3.
The focus of this work is on optimizing utilization of the SRF. So
only stream programs are relevant here. Given a stream program,
this paper presents an algorithm that assigns the streams in the
program to the SRF so as to minimize the total amount of space
Problem Statement
Page 3
taken by the streams. Such an algorithm can then be used by a
stream compiler to produce a final SRF allocation by combining
with live range splitting and strip mining, if necessary.
A stream program consists of a sequence of loops where each
loop includes a sequence of kernels operating on streams. In a
stream compiler, all loops are considered separately in SRF allo-
cation. As shown in Figure 1, the DRAM controller supports two
stream-level instructions, Load and Store, that transfer an entire
stream between off-chip memory and the SRF. In stream programs
as demonstrated in Figure 2, loads and stores are used to initialize
some streams from the global input data residing in off-chip mem-
ory and write certain streams to off-chip memory, respectively.
The central machinery in our approach to allocating the streams
in a loop to the SRF isthe traditional interference graph (IG) except
that it is a weighted (undirected) graph formed by the streams
operated on by the kernels in the loop. All streams accessed in
the loop are identified as live ranges to be placed in the SRF. If
two live ranges interfere (i.e., overlap), they must be placed in non-
overlapping SRF spaces. The live ranges of streams are computed
by extending the def/use definitions for scalars to streams: Load
defines a stream, Store uses a stream, and a kernel call (re)defines
its output streams and uses its input streams. The live range of a
stream starts from its definition and ends at its last use. Of course,
streams arerenamed using theSSA (staticsingle assignment) form.
After the live ranges have been computed for a loop, its
weighted (undirected) IG, denoted G, is built in the normal manner,
where a weighted node denotes a stream live range whose weight is
the size of the stream and and an edge connects two nodes if their
live ranges interfere with each other.
The SRF allocation problem can be naturally solved as an
interval-coloring problem as formalized below. Allocating SRF
spaces to stream live ranges in an IG is represented by an assign-
ment of intervals to the nodes in the IG. Minimizing the span of
intervals amounts to minimizing the required SRF size.
DEFINITION 1. Given a stream IG G = (V,E) with positively
integral node weights w : V → IN (representing stream sizes), an
interval coloring α of G maps each node x onto an interval αx of
the real line of width w(x) such that adjacent nodes are mapped to
disjoint intervals, i.e., (x,y) ∈ E implies αx∩ αy = ∅.
It is well-known that interval coloring is NP-complete.
Our IG-based approach is flexible enough to accommodate pre-
pass optimizations that are applied earlier to a program either by
the programmer or the compiler. One example is to reorder some
loads and stores to overlap memory transfers and kernel execution.
Another is to split some long live ranges (in scientific applications)
accomplished by inserting a pair of store and load instructions. We
plan to automate their integration with this work in future.
4.
Section 4.1 recalls the basic results about interval coloring and
comparability coloring [9], which provide a basis for understand-
ing our approach and proving its optimality and near-optimality.
Section 4.2 describes our key insight drawn from a careful analysis
of the structure of stream IGs: a large number of stream IGs are
comparability graphs, enabling their optimal colorings to be found
in polynomial time. In Section 4.3, we turn this insight into an al-
gorithm that can find optimal or near-optimal colorings for well-
structured media and scientific applications when their stream IGs
are expected to be decomposable into aset of comparability graphs.
Comparability Graph Coloring for Stream IGs
4.1
Given a (directed or undirected) graph G = (V,E) and a subset
A ⊆ V, the induced subgraph by A is G(A) = (A,E(A)), where
Interval and Comparability Graph Coloring: Basics
E(A) = {(x,y) ∈ E | x,y ∈ A}. A subset A ⊆ V of r nodes is
an r-clique if it induces a complete subgraph. A clique isa maximal
clique if it is not contained in any other clique.
Given an undirected graph G = (V,E) with the function w
mapping nodes to positively integral weights, the total width
(i.e., the number of hues) of an interval coloring α, χα(G;w), is
|S
needed to color the nodes in G. The clique number is defined as
ω(G;w) = max{w(K) | K is a clique of G}. As a fact, we have:
x∈Vαx|. The chromatic number χ(G;w) is the smallest width
χ(G;w)
?
ω(G;w)
(1)
4.1.1
Figure 3 illustrates the equivalence between finding an interval
coloring and finding an acyclic orientation for a weighted graph.
Interval Coloring vs. Acyclic Orientation
0 1
2 3
?z
45678 9 10 11 12
?a
?b
?c
?y
?x
0 1
2 3 4 5678 9 10
?a
?z
?b
?c
?y
?x
(a)
(c)
(b)
a:3
b:4
c:2
x:2
y:1
z:4
a:3
b:4
c:2
x:2
y:1
z:4
a:3
b:4
c:2
x:2
y:1
z:4
(a) (G;w);
(b) χα(G;w) = 12
(c) χβ(G;w) = 10
Figure 3: Two interval colorings α and β of a weighted undirected
graph together with their equivalent acyclic orientations.
Let G= (V,E) be an undirected graph. An orientation of G is a
function α that assigns every edge a direction such that α(x,y) ∈
{(x,y),(y,x)} for all (x,y) ∈ E. Let Gαbe the digraph obtained
by replacing each edge (x,y) ∈ E with the arc α(x,y). An
orientation α is acyclic if Gαcontains no directed cycles.
Every interval coloring α of G induces an acyclic orientation α′
such that (x,y) ∈ α′if and only if αx > αy i.e., an arc is directed
from x to y if and only if αxis to the right of αyfor all (x,y) ∈ E.
Conversely, an acyclic orientation α of G induces an interval
coloring α′. For a sink node x, let α′
inductively, for a node y with all its successor nodes already being
colored (i.e., α′defined at the successors), let t be the largest
endpoint of their intervals and define α′
From an acyclic orientation, we can obtain an interval coloring
in linear time by a depth-first search.
The problem of finding optimal colorings is NP-complete. In an
optimal coloring, the chromatic number χ(G;w) is related to the
notion of heaviest path in an acyclic orientation of G:
x = [0,w(x)). Proceeding
y= [t,t + w(y)).
χ(G;w)= min
α∈A(G)( max
µ∈P(α)w(µ))
(2)
where A(G) is the set of acyclic orientations of G and P(α)
the set of directed paths in an orientation α ∈ A(G). In other
words, the orientation whose heaviest path is the smallest induces
an optimal coloring. The heaviest-path-based formulation stated in
(2) is exploited in the development of our coloring algorithm for
stream IGs (Section 4.3).
In Figure 3(b), the heaviest path is x → c → b → z with a
total weight of χα(G;w) = 12. In Figure 3(c), the heaviest path
is c → z → b with a total weight of χβ(G;w) = 10. The gap
between the two colorings is 2 but can be larger (Figure 22). So
there isa need tolook for an optimal solution efficientlyin practice.
Page 4
G G0 0
G G1 1
G G2 2
G G3 3
G1 1
G G2 2
G G3 3
G0[G G1 1, ,G G2 2, ,G G3]
Figure 4: An illustration of Definition 4 (n = 3).
Duetotheequivalence between acyclic orientationsand interval
colorings, we also writeχα(G;w) to mean the width of the interval
coloring associated with an acyclic orientation α of G.
4.1.2
In the context of this work, we examine below a class of graphs that
allows interval colorings to be found optimally in polynomial time.
Comparability Graph Coloring
DEFINITION 2. An orientation α of an undirected graph G is tran-
sitive if (x,z) ∈ Gαwhenever (x,y),(y,z) ∈ Gα.
DEFINITION 3. An undirected graph G is a comparability graph if
there exists a transitive orientation of G.
A transitive orientation is acyclic but the converse is not neces-
sarilytransitive.InFigure3,αisnottransitivesince(x,b),(b,a) ∈
Gαbut (x,a) / ∈ Gα. However, β is transitive. Therefore, the graph
shown in Figure 3(a) is a comparability graph.
Let α be a transitive orientation of a comparability graph G. Re-
stating (1), we have χ(G;w) ? ω(G;w). Due to transitivity, every
path in Gα is contained in a clique of G. In particular, the heavi-
est path in Gα equals to the heaviest clique in G, i.e., χ(G;w) ?
χα(G;w) = ω(G;w). Hence, χα(G;w) = χ(G;w) = ω(G;w).
This result is summarized below.
THEOREM 1. For any transitive orientation α of G, the interval
coloring induced is optimal (and can be found in linear time).
DEFINITION 4. LetG0 be agraph withn nodes v1,v2,...,vnand
G1,G2,...,Gnbe n disjoint graphs. These graphs may be directed
or undirected. The composition graph G = G0[G1,G2,...,Gn],
which is illustratedin Figure 4, isformed formally as follows. First,
replace vi in G0 with Gi. Second, for all 1 ≤ i,j ≤ n, make each
node of Gi adjacent to each node of Gj whenever vi is adjacent to
vj in G0. Formally, for Gi = (Vi,Ei), we define G= (V,E) as:
V
E
=
=
∪1?i?nVi
∪1?i?nEi∪ {(x,y) | x ∈ Vi,y ∈ Vj and (vi,vj) ∈ E0}
THEOREM 2. Let G = G0[G1,G2,...,Gn], where all Gi’s are dis-
joint undirected graphs. Then G is a comparability graph if and
only if each Gi(0 ≤ i ≤ n) is a comparability graph.
Furthermore, the problems of recognizing a comparability
graph G = (V,E) and finding a transitive orientation of G can both
be done in O(δ· | E |) time and O(| V | + | E |) space, where δ
is the maximum degree of a node in G. Based on α, an optimal
coloring of G can be obtained in linear time (Theorem 1).
4.2
In stream programs with producer-consumer locality but little
global data reuse, the live ranges of streams are also local. A typi-
cal stream program (or a loop in such program) consists of a series
of kernels, each producing intermediate streams to be consumed
by the next kernel in sequence. We show below that if all stream
live ranges in a stream IG do not span across more than two kernel
calls, then the IG is a comparability graph and its optimal coloring
Optimal Colorings of Comparability Stream IGs
can thus be found in polynomial time. This result is proved easily
by a straightforward application of Theorem 2.
Figure 5 shows the IG for a seriesof three kernels, where all live
ranges are no longer than two kernel calls. In particular, stream q
is live from kernel ‘1’ to kernel ‘2’, streams u,v and w are live in
kernels ‘2’ and ‘3’, and the remaining streams are only live at the
kernels where they are operated on. In this example and the proofs
of our results, whether a stream is an input or output is irrelevant.
Load(..., p);
Kernel('1', p, q);
Load(..., r);
Load(..., s);
Load(..., t);
Kernel('2', q, r, s, t, u, v, w);
Load(..., x);
Kernel('3', u, v, w, x, y);
store(y, ...);
p
q
r
s
t
u
v
w
x
y
Figure 5: A stream program and its IG.
Let Gcg be the IG built from a loop containing Ncg kernels
(numbered from 1) such that each live range in Gcg is not longer
than two kernels. We partition all live ranges in Gcginto 2Ncgsets:
K1,K12,K2,K23,K3,...,K(Ncg−1)Ncg,KNcg,KNcg1
where Kiconsists of all streams accessed, i.e., live only in kernel i
and Ki(i⊕1)all streams live only in kernels i and i ⊕ 1. We define
i⊕c to be (i+c−1)%Ncg+1 and i⊖c to be (i−c−1)%Ncg+1.
As illustrated in Figure 6, all streams accessed in a kernel in a
loop form a maximal clique in the stream IG of the loop.
(3)
LEMMA 1. The streams in K(i⊖1)i∪Ki∪Ki(i⊕1)form a maximal
clique for every kernel i.
pq
q
r
s
t
u
v
w
u
v
w
x
y
Kernel 1
Kernel 2Kernel 3
Figure 6: Kernel-induced cliques for the program in Figure 5.
Our main results are stated in two theorems, Theorem 3 is
applicable when Ncg is even and Theorem 4 applicable when
KNcg1 = ∅, i.e., when cross-iteration reuse is absent. When neither
condition holds, we can apply loop unrolling once to produce a
loop with an even number of kernels so that Theorem 3 can be
applied. For stream processors, unrolling a stream program that is
executed on the host processor does not affect negatively program
performance. (Code size expansion for the host is not a concern.)
THEOREM 3. If Ncg is even, then Gcgis a comparability graph.
PROOF. Let us assume first that all sets listed in (3) are not empty.
By construction, the live ranges in every such a set are equal. Thus,
the induced subgraph of Gcg by Ki (Ki(i⊕1)) is a clique, denoted
Gi(Gi(i⊕1)). So we have the following 2Ncg induced cliques:
G1,G12,G2,G23,G3,...,G(Ncg−1)Ncg,GNcg,GNcg1
(4)
In addition, for any two sets K and K′listed in (3), either every
live range x ∈ K interferes with every live range x′∈ K′or there
is no interference between the live ranges in K and those in K′.
By Theorem 2, in Gcg, if we let Gi (Gi(i⊕1)) “collapse” into
one node, identified by Ki (Ki(i⊕1)), and denote the resulting
Page 5
“decomposed graph” by G0, we have:
Gcg = G0[G1,G12,G2,...,GNcg,GNcg1]
A clique is a comparability graph. Thus, Gi,i = 1,12,2,...,Ncg1
given in (4) are all comparability graphs. Then, by Theorem 2, Gcg
is a comparability graph if we show that G0 is. To achieve this,
by Definition 3, it suffices if we can find a transitive orientation of
K1
K2
K12
K23
K3
K34
K4
K41
K1
K2
K12
K23
K3
(b) Two orientations
K34
K4
K41
K1
K2
K12
K23
K3
K34
K4
K41
(a) G0
Figure 7: Two transitive orientations of G0(Ncg = 4).
G0. As shown in Figure 7, there are exactly two different transitive
orientations since K12,K23,...,KNcg1 must alternate to be a
source or a sink (Lemma 2). This is possible since Ncg is even.
Finally, if any set listed in (3) is empty, the decomposed graph
G0is still a comparability graph since every induced subgraph of a
comparability graph is a comparability graph.
?
THEOREM 4. If KNcg1 = ∅, then Gcgis a comparability graph.
PROOF. A transitive orientation of Gcgas shown in Figure 7 always
exists even if Ncg is odd since the “ring” is broken at KNcg1.
In fact, Theorem 4 holds as long as Ki(i⊕1)= 0 for some i.
Let us illustrate Theorem 4 in Figure 8 for the IG in Figure 5.
Being a comparability graph, its optimal coloring is guaranteed.
The optimality is independent of the node weights in the graph.
?
K1
K2
K12
K23
K3
G G0
G G1
G G12
r
t
spq
u
w
v
G G23
G G2
y
x
G G3
K1
K2
K12
K23
K3
G G0
0 1
2 345 6 78 9 10 11 12
Iy
Iu
Ix
IwIv Ir It Is
13
Iq
Ip
G G = G0 0[G G1 1, ,G G1 12,G G2 2, ,G G2 23 3, ,G G3 3]
p
q
u
w
v
r
t
s
y
x
Figure 8: Optimal interval coloring of the stream IG given in Fig-
ure 5 (with the weights of p,q and r being 1, the weights of s,t,u
and v being 2 and the weights of w,x and y being 3). To avoid clut-
tering, in the graph labelled by G0[G1,G12,G2,G23,G3], a thicker
arrow directing from a clique K to a clique K′symbolizes all di-
rected edges (x,y), for all x ∈ K and all x′∈ K′.
The facts stated in Lemmas 2 and 3 are exploited in the devel-
opment of our algorithm for coloring stream IGs in Section 4.3.
LEMMA 2. Suppose Gcg is a comparability graph. Let G′
induced subgraph of Gcg. If G′
eight different transitive orientations.
cgbe an
cgis connected, then it has at most
PROOF. If G′
j such that Ki(i⊕1),K(i⊕1)(i⊕2),...K(j⊖2)(j⊖1),K(j⊖1)jare in
G′
live ranges listed in (3). Let us consider only the worst case when
Ki = Kj = ∅. In a transitive orientation of G′
j − i − 2 sets in the above list must alternate to serve as a source
or a sink. So there are only two possibilities. In either case, edge
(Ki(i⊕1),Ki⊕1) may have at most two orientations, and similarly,
edge (Kj⊖1,K(j⊖1)j) may have at most two orientations. So there
are at most 2 × 2 × 2 = 8 different transitive orientations.
cgis connected, there must exist two kernels i and
cgand that these are the only sets containing two-kernel long
cg, the middle
?
LEMMA 3. Suppose Gcg is a comparability graph. If all sets in (3)
are nonempty, then G has exactly two transitive orientations.
Proof. Gcgis connected and then apply Lemma 2 (Figure 7).
?
4.3
Insomescientificapplications (amenable tostreamprocessing), the
presence of temporal reuse in a few streams could make their live
ranges longer than two kernels. In some media applications, there
are also occasionally a few long producer-consumer live ranges.
Furthermore, some liveranges may beextended by theprogrammer
or a pre-pass compiler optimization in order to overlap memory
transfers and kernel execution. Such stream IGs may or may not
be comparability graphs. In this section, we generalize our work
described in the preceding section to deal with these stream IGs.
The basic idea is to partition the node set V in G = (V,E) into:
A General Algorithm for Coloring Stream IGs
Vs
Vl
=
=
{v ∈ V | v’s live range spans at most two kernels}
{v ∈ V | v’s live range spans more than two kernels}
As a result, E is partitioned into the following three subsets:
Es
El
Esl
=
=
=
{(x,y) ∈ E | x ∈ Vs,y ∈ Vs}
{(x,y) ∈ E | x ∈ Vl,y ∈ Vl}
{(x,y) ∈ E | x ∈ Vs,y ∈ Vl}
By Theorems 3 and 4, the subgraph G(Vs) induced by Vs is
a comparability graph. Our key observation is that the long live
ranges in stream IGs are sparse and tend not to be live simultane-
ously. So we assume that the subgraph G(Vl) is a forest of disjoint
trees (which are trivially comparability graphs). In the rare cases
when G(Vl) is not a forest, we may, as part of future work, apply
live range splitting to shorten some live ranges to make it so.
Load(..., k);
Kernel('1', k, l);
Load(..., m);
Load(..., n);
Kernel('2', l, m, n, o);
Load(..., p);
Load(..., q);
Kernel('3', o, p, q, r);
Load(..., s);
Kernel('4', r, s, t);
Load(..., u);
Kernel('5', m, t, u, v);
Load(..., w);
Kernel('6', p, v, w, x);
Kernel('7', x, y);
k
n
l
o
rs
t
u
q
v
w
m
x
y
p
G G
G(Vs)
G G(Vl)
k
n
l
o
r
s
t
u
q
v
w
x
y
m
p
Figure 9: A program with two long live ranges m and p and its IG.
As illustrated in Figure 9, Vlconsists of two long live ranges m
and p: m is live from kernel 2 to kernel 5 and p is live from kernel 3
to kernel 6. Since both streams interfere with each other, the forest
G(Vl) has only one tree, which is a line connecting m and p.