Conference PaperPDF Available

From Algorithms to Parallel Architectures: A Formal Approach.

Authors:

Abstract

The authors introduce a formal approach for synthesis of parallel architectures. Four different forms are used to express the given algorithms: simultaneous recursion, recursion with respect to different variables, fixed nesting and variable nesting. Four different architectures for the same algorithm are obtained. As an example, a matrix-matrix multiplication algorithm is used to obtain four different optimal architectures. The different architectures of this example are compared in terms of area, time, broadcasting and required hardware. The approach is providing two main features: completeness and correctness
From Algorithms
To
Parallel Architectures:
A Formal Approach
Baled
M.
Elleitby
Computer Engineering Department
King Fahd University
Dhahran
31
261
Saudi Arabia
Abstract
In this paper, we introdwe a formal ap-
proach fm synthesis of parallel architectures.
Fwr
different forms are used
to
express thegiven
abo-
rithms: simdtancous recursion, recursion with
respect
to
drfmnt variables, fied nestin8 and
variable nestin8. Four dfifkrent architectures fm
the same algmthm are obtained.
As
an example,
a
matrix-matrix mdtiplication a&withm
is
used
to
obtain fwr drferent optimal architectures.
The
drrerent architecwres of this example are com-
pared in terms of area, time, broadcasting and
re-
quired hardware.
The
approach
is
providing
two
main features: completeness and correctness.
1.
Introduction
Formal high level synthesis of general ar-
chitectures is an important design phase to en-
sure functional, correct and cost effective archi-
tectures. Recently, there have been several
efforts in this direction [l-51, but many of these
efforts have not included parallel architectures.
There have been several synthesis approaches
for synthesizing special class of arrays
[6,7l.
In
this paper we present a formal system for syn-
thesizing parallel architectures. The architec-
tures produced by this system can be classified
as uniprocessor architectures. To exploit the
parallelism in a given algorithm the methodo-
logy
has been generalized
so
that it can be ap-
plied to simultaneous recursion forms
[lo].
In
this paper we shall extend the methodology by
applying it to the following forms: recursion
with respect to several variables, fixed nested re-
cursion and variable nested recursion. Recur-
sion with respect to several variables will be dis-
cussed
in
detail. For a complete discussion
on
all the forms please refer to
[lo].
The metho-
dology provides
two
main features:
completeness
TH0363-2/91/0000/0358$01
.OO
Q
1991
IEEE
Maady A. Bayoumi
The Center For Advanced Computer Studies
University of SoutkrrpcJtGm Louisiana
hfayette,
LA
70504
U.
S.A.
and
correctness.
Completeness means the ability
to use the approach for any general algorithm.
Correctness is achieved by using a set of
transformations that are proved to be correct.
A
design example of matrix-matrix multiplica-
tion is used with each one of the forms to ob-
tain
a parallel architecture. These Merent ar-
chitectures for this example are compared
in
terms of area
and
speed.
1.1.
System Overview
Figure
1
shows the different components
of our formal system. The system is composed
from two subsystems: synthesis subsystem and
user-interface subsystem.
A
new language
Al-
gorithm Specification Language
(ASL)
based
on
p-recursive functions is used to specify the
given algorithm. Transformation techniques
are used to transform an algorithm specified in
ASL
to a realization language called
RSL.
Every
construct
in
ASL
has
an
isomorphic representa-
tion in
RSL
which is the basis of the automated
transformation.
A
logic programming environment based
on
Prolog,
is employed as a user interface to
the synthesis process. The logic programming
environment supports specifying, simulating,
and testing the target systems. Prolog provides
homogeneity to the developed system as
it
sup-
ports hierarchical development and mixing of
description at various hierarchical levels. For
more details on the synthesis subsystem and
the user interface subsystem please refer to
[S-
lo].
In
the next sub-section, due to space limi-
tations, we shall present only the recursion
with respect to several variables synthesis ap-
proach
in
detail.
2.
Recursion with respect
to
358
Authorized licensed use limited to: University of Bridgeport. Downloaded on February 24,2010 at 13:29:18 EST from IEEE Xplore. Restrictions apply.
several variables
“xi
(l<i<n)
are
n-1
place functions,
2;
is
n
gace
functions,
y
is
2n
place function
and
w
are
7c
place functions, then
z
is
defined by ASL Code ASLl.
Transformation
Algorithm
to
RSL
To transform the system of recursion
with respect to several variables to
RSL
we im-
plement each equation using the same method
described in[9]. RSLl is the
RSL
representation
of the system:
Equation
1
is used to show that we use
n
re-
gisters to be initialized with the arguments
(1~~8~~
. .
.
ar~,,).
Equation
2
means that the
unit
Suc
which is a basic function has its in-
puts
control,.
(l<i<v)
connected to the
ready;
output of the
unit
computing
xi
to be
sure that
I
is not incremented
until
xi
is com-
puted. Equation
3
is used to represent the fact
that
I
is
incremented every
clock
cycle using
the
Suc
unit, and
I
is initialized to the value
1
using the register number
n
4-1.
Equation
4
determines the end of operation when
I
reaches the value
m
.
Equations
5,6,7
represent
the composition operation in equations
1,2,3
of the ASL representation respectively. The ar-
chitecture for the recursion with respect to
several variables
is
shown in Figure
2.
Similar
analysis has been done for the other two ap-
proaches: fixed nested recursion and variable
nested recursion
[lo].
Table
1
shows a comparison among these
different forms of recursion
in
terms of archi-
tecture, broadcasting and complexity of the
controller.
The simultaneous recursion is the
only form that gives a two dimensional array.
All forms have broadcasting except the variable
nesting. The controller of the variable nesting
is complex compared with the other three
forms.
3.
Matrix-Matrix Multiplication
Example
An
example of matrix multiplication is in-
troduced as an application of different forms
of
recursion. The architecture has two matrices
A
and
B
as inputs, and matrix
Cas
an output.
The multiplication is done in a recursive way
and can be described by the following high
level subroutine:
matrix-mdti$luatkm
(A,B)
begin
fw
i-1
to
n
fw
j-1
to
n
begin
cij,o
-
O
fw
b-1
to
n
next
b
c;j,k cjj,k-l
+Ai,&
*
Bk,j
end
next
i
nuct
j
end
For the ASL and
RSL
representation using the
four recursive forms please refer to
[lo].
Figure
3
shows the architecture obtained for
matrix multiplication for the case
of
recursive
equations with several variables. The details
of
implementing the inner-product cell
ar5
shown
in[9]. The architecture consists
of
N
inner-
product cells. The number of cycles required
to perform the multiplication is
N.
Figure
4
shows the architecture using recursion
with respect to several variables. The architec-
ture consists of
N
multiplication cells and one
adder. The number of
cycl5s
required to per-
form the multiplication is
N
.
Figure
5
shows the architecture using fixed nes-
ting recursion. The architecture consists of
N
inner-product cells. The number of
cy+
re-
quired to perform the multiplication is
N
.
Table
2
shows a comparison between different
architectures
of
the matrix-matrix multiplica-
tion.
4.
Conclusions
In
this paper an formal approach for
transforming different forms of recursion to
parallel architectures has been introduced.
Four different forms are used to express a given
algorithm. Four optimal architectures for a
matrix-matrix multiplication are compared.
The developed approach represents the first
step towards developing a high level synthesis
system for general parallel architectures.
It
en-
sures correctness, but it does not address op-
timality which is considered
as
an
important
issue too. The developed approach has the fol-
lowing advantages:
[l]
It
is suitable for large problems since the
transformation algorithm is linear.
359
Authorized licensed use limited to: University of Bridgeport. Downloaded on February 24,2010 at 13:29:18 EST from IEEE Xplore. Restrictions apply.
[2]
It
does not require to know the target
ar-
chitecture in advance.
[3]
The technique is fully automated.
[4]
The designer is not responsible for
speclfylng the operations sequencing and
communications among different units.
POI
Processing, April
1990.
K.
M. Elleithy, "A Formal Framework for
High Level Synthesis of Digital Designs,"
Ph.D. Dissertation, Center for Advanced
Computer Studies, University of
SW
Louisiana,
1990.
[5]
The approach is applicable to any general
algorithm.
Acknowledgements:
The first author wishes
to thank
King
Fahd Univeristy of Petroleum
and Minerals for support.
5.
References
C.
J.
Tseng, "Automated Synthesis of
Data Paths in Digital Systems," Ph.D.
Dissertation, CMU, Apr.
1984.
T. J. Kowalski, "The VLSI Design Au-
tomation Assistant: A Knowledge-based
Expert System," Ph.D. thesis, CMU,
1984.
H.
Trickey, "Flamel: A High Level
Hardware Compiler," IEEE Trans. on
Computer Aided Design, vol.
CAD-6,
no.
T. Tanaka, T. Kobayashi and
0.
Karatsu,
"HARP:
Fortran to Silicon," IEEE Trans.
on computer-aided design, vol.
8.,
no.
6.,
pp.
649-660,
June
1989.
C.
Niessen, C.H. van Berkel,
M.
Rem,
and
R
W.
Saeiji,
"VLSI
Programming
and Silicon Compilation; A novel Ap-
proach from Philips Research," Intl Cod.
on Computer Design:
VLSI
in Computer
and processors, pp.
150-151,
Oct.
1988.
S.
K.
Rao,
"Regular Iterative Algorithms
and Their Implementations
on
Processor
Arrays,"
Ph.D.
Dissertation, Dept. of Elec-
trical Eng., Stanford Uni.,
1985.
P.
Quinton, "Automatic Synthesis of
Sys-
tolic Arrays From Uniform Recurrent
Equations," Proc. of the
11
th, Annual
Intl
Symp. on Computer Architecture,
K.
M. Elleithy and
M.
A. Bayoumi,
"Formal Synthesis of Parallel Architec-
tures from Recursive Equations," Proceed-
ings of the
1990
International Con-
ference
on
Parallel Processing, Aug.
1990.
K.
M. Elleithy and M. A. Bayoumi, "A
Formal Framework for Synthesis of
Parallel Architectures," Proceedings of the
Fourth Annual Symposium on Parallel
2,
pp.
259-269,
March
1987.
pp.
208-214, 1984.
360
Authorized licensed use limited to: University of Bridgeport. Downloaded on February 24,2010 at 13:29:18 EST from IEEE Xplore. Restrictions apply.
......................................
=
a(argl+l,
...
,
arg,,-Z+l
arg,,-17
wr-l))
a,
=
a(argl+l,.
..
,
ag,,_,+l
awn)
Simultaneous
SeVerJ
Fixed
variables nesting
Architecture
two one one
Yes
simple
dimensional dimensional dimensional
Broadcasting
Yes Yes
Complexity
simple simple
of
controller
ASL1: ASL code
for
recursion with respect
to
several variables
Variable
nesting
one
dimensional
no
complex
Simultaneous Several
variables
Area
"1
WJ
Time
d("3,
W3)
A*T
)
1
Broadcasting
Yes no
Hardware
inner-product multiplier
(n
)
(n
)
adder
(1)
2
Table
1.
Comparison Between Different
Forms
of
Recursion.
Nesting
+J
3)
1
no
inner-product
n
Table
2
Comparison Between Different Matrix-Matrix Multiplication Architectures.
361
Authorized licensed use limited to: University of Bridgeport. Downloaded on February 24,2010 at 13:29:18 EST from IEEE Xplore. Restrictions apply.
Initp(l,argl
;
2,arg2
;
. .
-
;
nPWn>
......................................
sucmtm,,
-
x,
(amrcdy
I
-
p;+l
stcc
(I)
Ready
-
@(I
,
m)
z
(0
.
arg,
.....
awn)
-
Comp (arg,
......
awn
#
XI)
z
(arg,+l,
0,
arg,
.....
awn)
-
Comp (argl+l
.
0
.
arg3
.....
awn
#
~2)
zRc*
(0
.
arg, ..... awn)
-
And
(argy
.
arg3
fidY
.
...,awn
RcdY)
zRcdy
(argl+l
.
0
.
arg, ..... awn)
=
And
(arg1+lRcdY
.
....
&rgn
kaar
)
......................................
a(arg,+l, arg2+1
.....
awn-,
.
0)
-
Comp(argl+l
.
arg2+l ..... aVn-l+l#
Xn)
zRcdy(argl+l
.
arg2+1
.....
arg,-,+l,
0)
-
And (arg1+lRcadY
.
arg,+l
hady
..... awn-,
+P-)
......................................
zn-l
-
Comp(argl+l
J..
.
arg,-,+l, arg,,-,
,
wr-l)
#
2)
)
...
+lReady
Rcady
(n-1)fiM
-
Atzd(arg1+lRcM arg,-,
3
aWn-1
9
z,
=
Comp(argl+l
...
arg,,-,+l
,
awn
#
2)
-
And(arg1+lRcady
...
argn-l
+lkad'
,
argy)
'n-1
RSL1:
RSL
code for recursion with respect
to
several variables
362
Authorized licensed use limited to: University of Bridgeport. Downloaded on February 24,2010 at 13:29:18 EST from IEEE Xplore. Restrictions apply.
i
363
-
CII
Ell
U1
i
in
I
y-
~~
Authorized licensed use limited to: University of Bridgeport. Downloaded on February 24,2010 at 13:29:18 EST from IEEE Xplore. Restrictions apply.
Article
In verifying the design correctness of a specific class of architecture, special purpose formal design verifier has the advantage of being able to exploit the attributes of that architecture class to produce efficiency in the design verification process. Such development is important due to the fact that architecture design verification using general purpose theorem prover is usually extremely time consuming. This paper briefly presents the techniques and a Prolog-based verifier VSTA that we developed for formal design verification of systolic array architectures in DSP applications. Systolic architecture has been a popular class of parallel architecture due to its suitability for VLSI implementation. The techniques developed are discussed and a formalism (STA) developed for specifying and verifying systolic designs is reviewed. The strategies and notation adopted exploit systolic attributes for fast design verifications. We develop a Prolog-based verifier to automate our techniques due to Prolog's powerful pattern matching and automatic back-tracking mechanism, its popularity and quality, its similarity in representing facts with STA, and its wide acceptance for lower level module and circuit verification (so as to achieve multilevel reasoning later). In the paper, we also describe the application for our tool to verify the correctness of two systolic array designs. Executing the verifier on our workstation shows that a typical array design can usually be verified in less than 10 minutes.
Conference Paper
We describe a systematic method for the design of systolic arrays. This method may be used for algorithms that can be expressed as a set of uniform recurrent equations over a convex set D of Cartesian coordinates. Most of the algorithms already considered for systolic implementation may be represented in this way. The methods consists of two steps: finding a timing-function for the computations that is compatible with the dependences introduced by the equations, then mapping the domain D onto another finite set of coordinates, each representing a processor of the systolic array, in such a way that concurrent computations are mapped onto different processors. The scheduling and mapping functions meet conditions that allow the full automation of the method. The method is exemplified on the convolution product and the matrix product.
Article
Thesis (Ph. D.)--University of Southwestern Louisiana. 1990. Includes bibliographical references (leaves 130-139). Vita. Photocopy. s
Conference Paper
To make silicon as an implementation medium accessible to system designers, the Philips approach requires an interface between the system-level and algorithmic concerns on the one hand and the more physical concerns on the other hand. The CP-0 programming language is proposed as such an interface. The actual design amounts to the translation of system requirements into a CP-0 program, for which the term VLSI programming is proposed. A silicon compiler subsequently transforms the CP-0 program into a VLSI layout and a test trace. Other issues considered include the implementation of delay-insensitive circuits, testing, and the use of `zoom and pan' (ZaP) as a vehicle
Article
This paper describes the design and implementation of a high-level hardware compiler called Flamel. Ordinary Pascal programs are used to define the behavior required of the hardware. Flamel undertakes to find parallelism in the program, so it can produce a fast-running implementation that meets a user-specified cost bound. A number of program transformations create sections of code with more parallel computations than the original program has. A novel feature of Flamel is a method for organizing the search for the transformations that best satisfy the goal. Another new algorithm is one for "expression height reduction": rewriting an ensemble of expressions using algebraic properties in order to compute the expressions faster. An implementation of Flamel has been completed. The output is a description of a datapath and a controller, and at a sufficient level of detail so that good area and execution time figures can be estimated. On a series of tests, Flamel produces implementations of programs that would run 22 to 200 times faster than an MC68000 running the same programs, if the clock cycles were the same. The tests also show that a wide range of time-area tradeoffs are produced by varying the area constraint.
Article
This paper presents a unifying procedure, called Facet, for the automated synthesis of data paths at the register-transfer level. The procedure minimizes the number of storage elements, data operators, and interconnection units. A design generator named Emerald, based on Facet, was developed and implemented to facilitate extensive experiments with the methodology. The input to the design generator is a behavioral description which is viewed as a code sequence. Emerald provides mechanisms for interactively manipulating the code sequence. Different forms of the code sequence are mapped into data paths of different cost and speed. Data paths for the behavioral descriptions of the AM2910, the AM2901, and the IBM System/370 were produced and analyzed. Designs for the AM2910 and the AM2901 are compared with commercial designs. Overall, the total number of gates required for Emerald's designs is about 15 percent more than the commercial designs. The design space spanned by the behavioral specification of the AM2901 is extensively explored.
Article
An advanced silicon compilation system called HARP is described that creates a register-transfer language (RTL) description from a FORTRAN program. HARP contains three main processing parts: a data path synthesizer; a sequence controller synthesizer; and an RTL translator. The first synthesizer generates data paths by solving three subproblems: allocation of function units, storage elements, and interconnection units. The second synthesizer generates a microprogrammed controller and microinstructions. Since the RTL translator transforms the synthesized LSI structures into RTL descriptions which are input for a VLSI synthesizer, LSI mask patterns can be directly generated. HARP produces acceptable LSI ICs which exactly execute the input FORTRAN program. HARP's target is to construct a top-down LSI design methodology starting with fewer hardware images