A Quaternary Decision Diagram Machine and the Optimization of its Code.
-
Conference Proceeding: Fast functional simulation using branching programs
[show abstract] [hide abstract]
ABSTRACT: This paper addresses the problem of speeding up functional (delay-independent) logic simulation for synchronous digital systems. The problem needs very little new motivation-cycle-based functional simulation is the largest consumer of computing cycles in system design. Most existing simulators for this task can he classified as being either event driven or levelized compiled-code, with the levelized compiled code simulators generally being considered faster for this task. An alternative technique, based on evaluation using branching programs, was suggested about a decade ago in the context of switch level functional simulation. However, this had very limited application since it could not handle the large circuits encountered in practice. This paper resurrects the basic idea present this technique and provides significant modifications that enable its application to contemporary industrial strength circuits. We present experimental results that demonstrate up to a 10X speedup over levelized compiled code simulation for a large suite of benchmark circuits as well as for industrial examples with over 40.000 gatesComputer-Aided Design, 1995. ICCAD-95. Digest of Technical Papers., 1995 IEEE/ACM International Conference on; 12/1995 -
Article: The Story behind the Intel Atom Processor Success
[show abstract] [hide abstract]
ABSTRACT: Many state-of-the-art designs, besides their technical challenges, require robust project execution to meet their target. This article presents the Intel Atom processor as an example to analyze from its management perspective the effectiveness of project execution, including organizational structure and geographical distribution, timely decision making, milestone definition, tracking progress, and fast recovery from surprises.IEEE Design and Test of Computers 05/2009; · 1.39 Impact Factor -
Article: Design challenges of technology scaling
[show abstract] [hide abstract]
ABSTRACT: Scaling advanced CMOS technology to the next generation improves performance, increases transistor density, and reduces power consumption. Technology scaling typically has three main goals: 1) reduce gate delay by 30%, resulting in an increase in operating frequency of about 43%; 2) double transistor density; and 3) reduce energy per transition by about 65%, saving 50% of power (at a 43% increase in frequency). These are not ad hoc goals; rather, they follow scaling theory. This article looks closely at past trends in technology scaling and how well microprocessor technology and products have met these goals. It also projects the challenges that lie ahead if these trends continue. This analysis uses data from various Intel microprocessors; however, this study is equally applicable to other types of logic designs. Is process technology meeting the goals predicted by scaling theory? An analysis of microprocessor performance, transistor density, and power trends through successive technology generations helps identify potential limiters of scaling, performance, and integrationIEEE Micro 08/1999; · 1.78 Impact Factor
Page 1
2026
IEICE TRANS. INF. & SYST., VOL.E93–D, NO.8 AUGUST 2010
INVITED PAPER
A Quaternary Decision Diagram Machine: Optimization of Its
Code∗
Special Section on Multiple-Valued Logic and VLSI Computing
Tsutomu SASAO†a), Hiroki NAKAHARA†, Munehiro MATSUURA†, Yoshifumi KAWAMURA††,
and Jon T. BUTLER†††, Members
SUMMARY
ing on the power dissipation and programmability. Then, we show the
advantage of Quarternary Decision Diagrams (QDDs) in representing and
evaluating logic functions. That is, we show how QDDs are used to imple-
ment QDD machines, which yield high-speed implementations. We com-
pare QDD machines with binary decision diagram (BDD) machines, and
show a speed improvement of 1.28-2.02 times when QDDs are chosen. We
consider 1-and 2-address BDD machines, and 3- and 4-address QDD ma-
chines, and we show a method to minimize the number of instructions.
key words: quarternary decision diagram, branching program machine
This paper first reviews the trends of VLSI design, focus-
1.Trends of VLSI Design
1.1Explosion of Complexity
With the growth of multimedia and other applications, the
demand for high-performance processors has increased. In
the past, Moore’s Law solved this problem. Moore’s Law
states that the number of transistors on a chip doubles every
18 months.
In the process of miniaturization, the scaling down of
transistor size and chip area has reduced power dissipation.
That is, by scaling down the transistor size in LSIs, chip
area, delay, and power dissipation can be reduced at the
same time. However, in the future, the number of transis-
tors on a chip is expected to fall short of that predicted by
Moore’s Law.
1.2Power Dissipation
As transistor size decreases, supply voltage must also scale
down to keep the electric field in the integrated circuit con-
stant[32]. However, as the supply voltage decreases, sub-
threshold leakage current increases. Nowadays, power dis-
sipation due to leakage current accounts for about 40% of
the total power dissipation in a microprocessor[5]. There-
fore, as supply voltage is reduced, power density is a limit-
Manuscript received November 9, 2009.
†The authors are with Kyushu Institute of Technology, Iizuka-
shi, 820–8502 Japan.
††The author is with Renesas Electronics Corp., Tokyo, 100–
0004 Japan.
†††The author is with the Naval Postgraduate School, Monterey,
CA 93943–5121, USA.
∗A preliminary version of this paper was presented at ISMVL-
2010[30].
a)E-mail: sasao@cse.kyutech.ac.jp
DOI: 10.1587/transinf.E93.D.2026
ing factor. With an increase of the power density, the tem-
perature of chip may become too high. To make matters
worse, leakage current increases exponentially with temper-
ature[3]. When a transistor produces more heat than the
heatsink can dissipate, thermal runaway occurs. Therefore,
cooling is very important. In the past, reduction of chip area
was the main design issue. However, nowadays, the reduc-
tion of power dissipation is the primary design issue. In mo-
bile applications, battery size is limited, so the use of low
power devices is crucial.
1.3 Multi-Core and Parallel Processing
Power dissipation of a CMOS gate is approximately
P = α × V2
where α is a constant, Vddis the supply voltage, and f is the
clock frequency.
Reduction of the supply voltage without changing tran-
sistor dimensions requires a reduction in clock frequency
f [4]. Assume that the power supply voltage is reduced by
30%, and that the clock frequency is reduced by 50%. In
this case, we have
dd× f,
α × (0.7Vdd)2× 0.5f = 0.25αV2
Consider a dual core version of this, as shown in Fig.1.
In this case, a reduction by half of the frequency is compen-
sated by an increase by two times of the number of proces-
sors, yielding nearly equal throughput. That is, this change
has resulted in a reduction by half of the power with no
change in the system throughput.
In personal computers, many threads are running at the
same time. Thus, many computers can benefit from multi-
cores. In this sense, chip area is increased to reduce power
ddf.
Fig.1
Using a dual core processor to reduce the power by half.
Copyright c ? 2010 The Institute of Electronics, Information and Communication Engineers
Page 2
SASAO et al.: A QUATERNARY DECISION DIAGRAM MACHINE: OPTIMIZATION OF ITS CODE
2027
dissipation. Increasing the number of cores increases the
chip cost, but the reduction of power dissipation is more im-
portant.
By reducing power, cooling fans can be often elim-
inated[2]. Also, reliability will be enhanced because of
lower temperatures. Excessively high temperature can burn
out the chip. Even if the temperature is low enough so that
this does not occur, high temperature can cause cumulative
damage.
Inmulti-coresystems, unusedcorescanbeturnedoffto
further reduce power dissipation. Unfortunately, developing
efficientsoftwareformulti-coreisnotsoeasy. Mostexisting
software is single-threaded. In a single core processor, var-
ious methods are used to increase the performance without
increasing the clock frequency, including pipelining, super
scalar, super pipeline architecture, and very long instruction
word processors (VLIWs). Unfortunately, even if the chip
area of a single-core processor is doubled to increase the
performance, the resulting performance is increased only by
1.4 times, as predicted by Pollack’s rule[4].
1.4Programmable Device
With the miniaturization of chips, the cost of masks for
VLSI has increased drastically. Since the number of tran-
sistors has increased, VLSI design is now very complicated.
As transistors become smaller, variability of the threshold
voltage of transistors increases. Therefore, achieving con-
sistent switching becomes difficult. As a result, design and
test cost has also increased[9]. Due to this, custom chips are
feasible only for mass-production products, such as games
and cellular phones. In addition, the life of today’s prod-
ucts is short: every few months, new products are devel-
oped. Thus, the number of newly developed VLSIs has
been reduced. Instead, microprocessors, application specific
standard products (ASSPs), and field programmable gate ar-
rays (FPGAs) are used to implement electronic appliances.
These can be customized by writing programs.
2.Introduction of Branching Program Machines
In the rest of this paper, we focus on branching program
machines, which are suitable for control applications. They
are programmable, since major parts consist of memories.
Because memory is involved, reliability can be improved by
using traditional techniques, such as error correcting codes
(ECC).
Branching program machines for BDDs have been
used in control applications[6],[10]–[12]. Fast response is
especially important in control applications in which there
are usually hundreds of inputs. For such applications, a gen-
eral purpose microprocessor (MPU) cannot meet the speed
requirements. A branching program machine can be several
times faster than an MPU: An ordinary MPU requires two
or three machine instructions to read and test one input vari-
able, while the branching program machine requires just one
instruction[7].
Fig.2
An example of BDD.
Fig.3
MUX circuit.
Parallelization can be implemented by multi-way
branching programs. Thus, performance can be improved
without increasing the clock frequency.
2.1 Conversion from a Circuit to a Branching Program
Machine
Consider the implementation of a given logic function.
This can be represented by a binary decision diagram
(BDD). Figure 2 shows the BDD of an example function,
f(x1, x2, x3, x4) = x1x2∨ (x3⊕ x4). In this diagram, dotted
lines (left lines) correspond to xi= 0 and solid lines (right
lines) correspond to xi= 1. By replacing each non-terminal
node of a BDD with a multiplexer (MUX), we have a cir-
cuit, at the top of Fig.3, that realizes the given logic function
whose BDD is shown in Fig.2.
However, such implementation requires dedicated in-
terconnections and expensive masks. A branching program
machine is a sequential circuit that emulates the MUX cir-
cuit. In this case, the interconnections are programmed in
a memory. Thus, by using a branching program machine, a
logic function is implemented by logic and memory. Since it
has no instruction fetch, it is faster and dissipates less power
than a general purpose microprocessor.
Unfortunately, a branching program machine is slower
than the original logic circuit, since it emulates the cir-
cuit sequentially. A straightforward method to increase the
speed is to increase the clock frequency. However, this is
Page 3
2028
IEICE TRANS. INF. & SYST., VOL.E93–D, NO.8 AUGUST 2010
difficult in most cases. To increase processing speed without
increasing the clock frequency, we use a Multi-valued Deci-
sion Diagram (MDD). For example, when two variables are
evaluated at the same time, the decision diagram has four-
waybranches; thisiscalledaQuarternaryDecisionDiagram
(QDD). In this way, performance is increased without in-
creasing the clock frequency. Such an idea is used in VLIW
processors[21], where branch instructions are multiway.
2.2Optimization of Branching Program Machine
A Quarternary Decision Diagram (QDD) machine is up to
two times faster than a BDD machine. However, instruction
words for the QDD machine require four address fields, i.e.,
instructions with many bits are necessary. This increases
the power dissipation, which is proportional to the number
of bits in the instruction words.
Optimization of code for a QDD machine can be
treated as an optimization of a 4-valued logic circuit. A
multi-core system of 128 QDD machines was implemented
on an FPGA[24]. This is up to 96 times faster than the mi-
croprocessor (Core2Duo, 1.2GHz, U7600), even though the
QDD machine runs at 100MHz, while the microprocessors
run at 1.2GHz. Further, the power dissipation of 128 QDD
machine is only a quarter of the microprocessor.
The rest of this paper is organized as follows: Sec-
tion 3 introduces a method to represent multi-output logic
functions by multi-valued decision diagrams. Section 4 in-
troduces branching program machines: It introduces both
a 4-address QDD machine and a 3-address QDD machine.
The 3-address QDD machine requires less memory than the
4-address QDD machine. Section 5 shows an optimization
problem of codes for 3-address QDD machines. Section 6
shows the experimental results. And finally, Sect.7 con-
cludes the paper.
3.Representation of Multiple-Output Functions
3.1 Multi-Valued Decision Diagrams
An arbitrary n variable logic function can be represented
by a binary decision diagram (BDD). Evaluation of a BDD
requires n table look-ups. Figure 4 shows an example of
an MTBDD (multi-terminal binary decision diagram). In
this case, many outputs can be evaluated at the same time.
To further speed up the evaluation, a multiple-valued deci-
sion diagram (MDD) is used. In the MDD(k), k variables
are grouped to form a 2k-valued super variable. To evalu-
ate the MDD(k), we need at most ?n
[25]. When the function is represented by an MDD(k), the
evaluation of a logic function can be k times faster than the
corresponding BDD†. Thus, a larger k yields a faster eval-
uation of the MDD(k). Unfortunately, the size of memory
to represent a node for an MDD(k) is proportional to 2k, as
shown in Fig.5. For many benchmark functions, the total
size of the memory for an MDD(k) achieves its minimum
when k = 2[25]. Therefore, in logic evaluation, MDD(2)s
k? table look-ups[20],
Fig.4
Example of an MTBDD.
Fig.5
Nodes for MDD(k).
Fig.6
Conversion of BDD to MDD(2).
are more suitable than BDDs. Since nodes in an MDD(2)
have 4 branches, it is termed a Quarternary Decision Dia-
gram (QDD).
3.2 Optimization of MDDs
In an MDD(k), the evaluation of an n-variable logic func-
tion can be done by at most ?n
jor problem is the minimization of the number of nodes. In
general, it is not so easy to obtain an MDD(k) with the min-
imum number of nodes. The following heuristic method is
used to obtain near minimal MDDs:
k? table look-ups. So, the ma-
1. MinimizenodesoftheBDDbyaheuristicmethod[27].
†This is true only when the MDD(k) and the BDD are quasi
reduced.
Page 4
SASAO et al.: A QUATERNARY DECISION DIAGRAM MACHINE: OPTIMIZATION OF ITS CODE
2029
2. PartitiontheinputvariablestogenerateanMDD(k)[28].
Figure 6 shows an example of a conversion from a BDD into
an MDD(2). In the above MDDs, we assume each group of
variables has the same size. Such MDDs are homogeneous
MDDs. When the groups have different sizes, the MDD
is a heterogeneous MDD. For simplicity, in this paper, we
consider only homogeneous MDDs.
4.Branching Program Machine
Special machines to evaluate MDDs have been devel-
oped[13]–[15]. Unfortunately, they are unsuitable for prac-
tical applications. Here, we consider a machine whose ar-
chitecture is well-suited for evaluating MDDs, but is easily
programmed.
4.12-Address BDD Machine
A branching program for BDDs uses only two kinds of in-
structions:
B_Branch (ADDR0, ADDR1), INDEX
Output DATA, and GOTO ADDR.
The first one is the binary branch instruction that is
similar to the computed GOTO statement of the FORTRAN
language: If the value of INDEX is equal to 0, then go to
ADDR0, otherwise goto ADDR1. The second one performs
the output operation followed by an unconditional GOTO
operation.
Example 4.1: Consider the MTBDD shown in Fig.4. The
following code evaluates the MTBDD:
N0:B_Branch(N2,N1), X1
N1:B_Branch(N2,T4), X2
N2:B_Branch(N3,N4), X3
N3:B_Branch(T0,T1), X4
N4:B_Branch(T2,T3), X4
T0:Output 0, and GOTO N0
T1:Output 9, and GOTO N0
T2:Output 10, and GOTO N0
T3:Output 11, and GOTO N0
T4:Output 15, and GOTO N0
In this example, DATA in Output DATA is the decimal
equivalent of the function output values expressed in binary
as f3, f2, f1, f0.(End of Example)
Figure 7 shows the architecture of the 2-address BDD ma-
chine, where only the circuit for the branching operation is
shown. The first field, COM, of the branching instruction
specifies the branch command. The second field, INDEX,
specifies the index i of the input variables xi. It determines
which variables to select. The input selector in Fig.7 pro-
duces the value of the variable xiselecting the next branch
address. When xi = 0, ADDR0 is selected. Otherwise,
ADDR1 is selected. The selected address is then loaded
into the program counter (PC). In this way, the next address
Fig.7
2-address BDD machine.
Fig.8
1-address BDD machine.
is specified. To reduce the width of the instruction words,
1-address BDD machines shown in Fig.8 have been devel-
oped [6],[11],[18],[33]. In this case, when the value IN-
DEX is 1, the machine works similarly to the case of the
2-address BDD machine. Otherwise, the content of the pro-
gram counter (PC) is incremented by one, to access the next
address. In this case, the size of the instruction word is re-
duced, but unconditional GOTO instructions are necessary,
as shown later.
4.2 4-Address QDD Machine
By simultaneously evaluating two binary variables and by
increasing the number of branch addresses to four, we have
a branch instruction for a 4-address QDD machine. Since
it evaluates two binary variables at a time, it can reduce the
Page 5
2030
IEICE TRANS. INF. & SYST., VOL.E93–D, NO.8 AUGUST 2010
Fig.9
4-address QDD machine.
Fig.10
Branch instruction for 4-address QDD machine.
Fig.11
Output instruction for a QDD machine.
evaluation time to half that of the 2-address BDD machine.
A branching program for 4-address QDD machines
consists of two kind of instructions:
Q_Branch(ADDR0,ADDR1,ADDR2,ADDR3),INDEX
Output DATA, and GOTO ADDR
Figure 10 shows the format for the branch instruction. Fig-
ure 9 shows the architecture of the 4-address QDD ma-
chine, where only the circuit for the branching operation is
shown. The first field of the branching instruction specifies
the branch command. The second field, INDEX, specifies
the index i of the input variable Xi. It determines which
variables to select. In the case of a QDD, two consecutive
binary variables are selected at a time. The input selector
shown in Fig.9 produces Xi. The upper multiplexer selects
the variable. When Xi= (0,0), ADDR0 is selected; when
Xi= (0,1), ADDR1isselected; when Xi= (1,0), ADDR2is
selected; and when Xi= (1,1), ADDR3 is selected. The se-
lected address is then loaded into the program counter (PC).
In this way, the next address is specified as a function of IN-
DEX i and the input variable Xi. Note that this instruction
requires a rather long word, which would be expensive for
embedded applications.
Figure 11 shows the format for the output instruction.
The left field specifies the instruction type: Output. The
middle field contains the address to which this program
should jump. The right field is the output value, as shown at
the bottom of the QDD.
Fig.12
Branch instruction for a 3-address QDD machine.
Fig.13
3-address QDD machine.
4.33-Address QDD Machine
Since the 4-address QDD instruction requires a long word,
we developed a 3-address QDD machine. The branch in-
struction for the 3-address QDD machine contains only
three address fields. For example, consider the instruction
shown in Fig.12. This instruction is symbolically denoted
by
Q_Branch(+1,ADDR1,ADDR2,ADDR3),INDEX.
In this instruction, ADDR1, ADDR2, and ADDR3 are spec-
ified, but ADDR0 is missing. ADDR0 is replaced by “+1”,
which corresponds to the next address of the current instruc-
tion. This instruction performs the following operations:
• Let i be the value of INDEX. If (i = 0) then goto
the next address of the current instruction, else goto
ADDRi.
Lemma 4.1: An arbitrary QDD can be evaluated by a pro-
gram consisting of the following instructions:
Q_Branch(+1,ADDR1,ADDR2,ADDR3),INDEX
GOTO ADDR
Output DATA, and GOTO ADDR
Forexample, theinstructionforthe4-addressQDDmachine
Q_Branch(ADDR0,ADDR1,ADDR2,ADDR3),INDEX
can be simulated by the pair of instructions:
Q_Branch(+1,ADDR1,ADDR2,ADDR3),INDEX
GOTO ADDR0
Note that the last instruction is an unconditional GOTO
statement. As shown in the next section, the number of un-
conditional GOTO statements can be minimized by an opti-
mization algorithm. Figure 13 shows the architecture of the
Page 6
SASAO et al.: A QUATERNARY DECISION DIAGRAM MACHINE: OPTIMIZATION OF ITS CODE
2031
Fig.14
Four types of branch instructions for 3-address QDD machine.
3-address QDD machine, where only the circuit for branch-
ing operations is shown. Consider the instruction in Fig.12.
When the value of INDEX and the input variables are non-
zero, the machine is like 4-address QDD machine. When
the value of INDEX and the input variables are equal to 0,
the program counter (PC) is incremented by one, to access
the next address.
In our hardware implementation, we use the four types
of branch instructions shown in Fig.14. To distinguish four
branchinstructions, weusetwoadditionalbitsintheinstruc-
tion field. However, as shown in the experimental results, by
using four branch instructions, we can reduce the number of
instructions and the total bit size. So, the cost of these extra
bits is fully compensated.
5. Optimization of Codes for QDD Machines
In this section, we consider a method to reduce the num-
ber of instructions for QDD machines. Interestingly, this is
solved by minimizing the number of unconditional GOTO
statements.
Definition 5.1: Given the QDD and an order of the input
variables (e.g. x1, x2,...,andxn), the code size CSIZE is the
number of instructions needed to compute the Decision dia-
gram on a given machine. Let 4aQDDM denote a 4-address
QDD machine, and let 3aQDDM denote a 3-address QDD
machine.
Lemma 5.2: Let NNbe the number of non-terminal nodes,
and let NTbe the number of terminal nodes in a QDD. We
have the following relation:
CSIZE(4aQDDM) = NN+ NT.
(1)
(Proof) In a 4-address QDD machine, a non-terminal node
is represented by a branch instruction, and a terminal node
is represented by an output instruction.(Q.E.D.)
Lemma 5.3: Let NNbe the number of non-terminal nodes
and let NTbe the number of terminal nodes in a QDD. Let
NUbe the number of unconditional GOTO statements that
are not part of output statements. Then, we have the follow-
ing relations:
CSIZE(3aQDDM) = NU+ NN+ NT
(2)
0 ≤ NU≤ NN
(Proof) In a 3-address QDD machine, a non-terminal node
is represented by either a branch instruction or a pair con-
sisting of a branch instruction and an unconditional GOTO
(3)
Fig.15
QDD for example function.
statement. Also, a terminal node is represented by an out-
put instruction. Thus, the number of unconditional GOTO
statements is at most the number of non-terminal nodes.
(Q.E.D.)
In the case of a 4-address QDD machine, there is no
code optimization problem, i.e., the instructions can be gen-
erated in any order. However, in the case of a 3-address
QDD machine, the length of the program depends on the
order of instructions.
Example 5.2: Consider the QDD shown in Fig.15. It has
five non-terminal nodes, and four terminal nodes. When the
code is generated in breadth-first order, i.e., in the order of
X1,X2and X3, we have the following:
/** Code with Unconditional GOTO **/
N0:Q_Branch(+1,N1,N1,N1),X1
Q_Branch(+1,N3,N3,N3),X2
GOTO N2
N1:Q_Branch(+1,T3,T3,T3),X2
GOTO N3
N2:Q_Branch(+1,T1,T1,T1),X3
GOTO T0
N3:Q_Branch(+1,T2,T2,T2),X3
GOTO T1
T0:Output 0, and GOTO N0
T1:Output 1, and GOTO N0
T2:Output 2, and GOTO N0
T3:Output 3, and GOTO N0
Note that, the above program has four unconditional GOTO
statements that are not part of output statements. However,
when the code is generated in depth-first order, it has no
unconditional GOTO statements that are not part of output
statements.:
/** Code without Unconditional GOTO **/
N0:Q_Branch(+1,N1,N1,N1),X1
Q_Branch(+1,N3,N3,N3),X2
Q_Branch(+1,T1,T1,T1),X3
T0:Output 0, and GOTO N0
N1:Q_Branch(+1,T3,T3,T3),X2
N3:Q_Branch(+1,T2,T2,T2),X3
T1:Output 1, and GOTO N0
T2:Output 2, and GOTO N0
Page 7
2032
IEICE TRANS. INF. & SYST., VOL.E93–D, NO.8 AUGUST 2010
T3:Output 3, and GOTO N0
Note that the first four instructions correspond to the left-
most path from the root node to the terminal node T0. The
next three instructions correspond to the path from node N1,
node N3, and terminal node T1.(End of Example)
The code optimization problem for a 3-address QDD ma-
chine can be reduced to a graph covering problem as fol-
lows:
Definition 5.2: A path cover of a QDD is a set of paths
such that every node in the QDD belongs to exactly one
path. A minimal path cover is a path cover with the fewest
paths. A path in a QDD can consist of just one node.
Theorem 5.1: An optimal code for a 3-address QDD ma-
chine corresponds to a minimal disjoint path cover of the
QDD.
(Proof) A path in a QDD corresponds to a sequence of
Q Branch instructions followed by an output instruction. A
sequence of Q Branch instructions without an output in-
struction requires an unconditional GOTO statement. By
Lemma 5.3, minimization of the number of unconditional
GOTO statements minimizes the code size.(Q.E.D.)
6.Experiment and Observation
6.1Benchmark Results
To see the effectiveness of QDDs over BDDs, and the effec-
tiveness of the code optimization, we realized certain bench-
mark functions by BDDs and QDDs. First, we compare
Table 1
Number of nodes and code sizes for BDD machine and QDD machine.
BDD QDD
Func.
Name
C432
amd
apex2
apex4
chkn
duke2
gary
in0
in1
in2
in3
in4
in5
in6
in7
m181
misex2
misex3
misj
mlp6
risc
signet
tial
vg2
x1dn
x6dn
x9dn
## BDD
Nodes
1779
206
335
749
220
636
228
195
284
291
259
607
461
4325
300
222
113
2910
4656
5270
Opt.
Codes
1779
206
363
750
241
637
232
200
299
296
259
611
466
4338
301
222
113
2975
4656
6062
Term.
Nodes
128
Aver.
Inst.
19.10
5.63
6.66
8.24
7.01
6.36
5.51
5.02
6.85
3.98
6.63
4.69
8.54
7.51
7.58
6.80
4.97
7.55
14.12
12.10
4.42
18.23
12.05
7.65
9.55
4.14
9.30
QDD
Nodes
1027
164
231
600
157
546
173
145
217
219
214
491
369
3546
256
196
X=00
Codes
1408
171
332
639
215
594
191
170
288
262
234
569
452
3815
275
217
Opt.
Codes
1027
164
265
601
172
547
174
148
229
225
214
495
371
3555
256
196
X=00
GOTO
381
Opt.
GOTO
Aver.
Inst.
12.73
3.47
4.99
4.61
5.16
4.09
3.42
2.92
4.70
2.60
4.77
3.44
6.57
5.88
5.84
4.71
3.60
4.05
9.57
5.98
2.55
13.31
6.37
5.62
5.74
2.74
5.80
Ratio
Inp.
36
14
39
Out.
70
0
1.50
1.62
1.33
1.79
1.36
1.55
1.61
1.72
1.46
1.53
1.39
1.36
1.30
1.28
1.30
1.44
1.38
1.86
1.47
2.02
1.74
1.37
1.89
1.36
1.66
1.52
1.60
24 847
38 101
39
58
48
18
25
71
43
20
78
83
269
19
21
34
919319
28
255
70
52
55
73
72
178
134
1638
112
1
29
22
15
15
16
19
35
32
24
33
26
15
25
14
35
12
715
29
11
11
17
10
29
20
14
23
10
1
1
3
12
6
0
4
2
9
0
0
0
0
0
984
35 18
14
14
12
31
91 96915
1041
1408
1238
1773
3275
2582
2159
3828
2966
1773
3275
2694
44
6907
466
91
141
177
157
386
553
384112
8 56 5628 44 4400
39
14
25
27
39
27
8
8
8
6
5
7
7347
697
131
200
214
204
8652
790
135
218
231
222
128
49
24
18
28
22
5671
388
8374
552
110
171
215
188
2703
164
1236
78
8921
45
56
48
2
126
159
140
15
18
17
QDDs and BDDs with respect to the numbers of nodes.
Then, we convert these into code for BDD and QDD ma-
chines, and the number of instructions.
Table1showstheexperimentalresults. Func.namede-
notes the name of the benchmark functions; # Inp. denotes
the number of input variables; # Out. denotes the number
of outputs; BDD Nodes denotes the number of nodes of the
MTBDD including both terminal and non-terminal nodes;
Opt. Codes under BDD denotes the number of instructions
of the optimized code for the 1-address BDD machine (near
optimalsolution); Term.Nodesdenotesthenumberoftermi-
nal nodes; Aver. Inst. under BDD denotes the average num-
ber of instructions to evaluate an input vector by a 1-address
BDD machine; QDD Nodes denotes the number of nodes
of the MTQDD including both terminal and non-terminal
nodes, that is the same as the number of instructions for a 4-
address QDD machine; X = 00 Codes under QDD denotes
the number of instructions in the code for 3-address QDD
machine, when only the first type of instruction in Fig.14 is
used; Opt. Codes under QDD denotes the number of instruc-
tions of the optimized code for the 3-address QDD machine,
whenallfourtypesofinstructionsinFig.14areusedtomin-
imize the number of GOTO statements; X = 00 GOTO de-
notes the number of GOTO statements, when only one type
of branching instruction is used; Opt. GOTO = (Opt. Codes
-QDD. Nodes) under QDD denotes the number of GOTO
statements, when four types branching instructions are used;
Aver. Inst. in QDD denotes the average number of instruc-
tions to evaluate an input vector by a 3-address QDD ma-
chine; and Ratio denotes the value: (Aver. Inst. in 1-address
BDD machine)/(Aver. Inst. in 3-address QDD machine).
Page 8
SASAO et al.: A QUATERNARY DECISION DIAGRAM MACHINE: OPTIMIZATION OF ITS CODE
2033
6.2Detail of the Experiment
Optimization of Decision Diagrams: First, the ordering
that minimizes the size of the MTBDD is obtained. Then,
the input variables are partitioned into groups of two vari-
ables in the natural order to obtain the MTQDDs.
Optimization of Code: Theorem 5.1 shows how to mini-
mize the number of instructions by minimizing the number
of GOTO statements. The algorithm given by [16] is only
applicable to the program with nodes whose in-degrees and
out-degrees are both two. So, we developed our own algo-
rithm to obtain near optimal solutions for our more general
case.
6.3Observations
From the table, we can observe the following:
• The number of nodes in QDDs is smaller than that of
BDDs.
• The number of instructions for the 3-address QDD ma-
chine can be considerably reduced by an optimization
algorithm.
• For C432, in3, misex2, misj, and risc, the number of
GOTO statements in the optimized QDD codes is zero.
This means that optimal code is generated for these
functions. Also, for these functions, optimal code for
BDD machines are generated.
• signet requires many GOTO statements in both BDD
and QDD machines.The number of GOTO state-
ments for a BDD machine is given by (Opt. Codes) −
(BDD Nodes) = 8671 − 7347 = 1324.
• Opt. Codes, the number of instructions for a 3-address
QDD machines is often larger than QDD Nodes, the
number of instructions for a 4-address QDD machine.
The column headed by Opt. GOTO (=OPT. Codes -
QDD. Nodes) shows the extra GOTOs. Except for a
few functions, the extra GOTOs are rather small.
• Consider the value: (Sum of X = 00 Codes) − (Sum
of Optimal Codes) = 28535 − 24528 = 4007. This
shows the total number of instructions reduced by us-
ing four types of branch instructions, instead of us-
ing only one type of branching instructions.
ever, to specify four types of instructions, we need
two additional bits in the instruction field. Let w be
the number of bits in a word in the 3-address QDD
machine, where only one type of branching instruc-
tion is used. Then, the merit of using four types of
instructions is accurately expressed as: (Sum of X =
00 Codes) × w − (Sum of Opt. Codes) × (w + 2) =
28535w − 24528(w + 2) = 4007w − 49056. Note that,
in most cases, w > 20, so we can conclude that the use
of four types of Q Branch instructions reduces the total
number of bits.
• The last column of the table shows that the 3-address
QDD machine is 1.28 − 2.02 times faster than the 1-
address BDD machine. Note that, for MLP6, the ratio
How-
is greater than 2. This is due to GOTO statements. If
we compared the average numbers of instructions in
a 2-address BDD machine and a 4-address QDD ma-
chine, the ratio is at most 2.
6.4Hardware Implementation
To show the usefulness of multi-core QDD machines,
we have developed a parallel branching program machine
(PBM128) consisting of 128 QDD machines and a pro-
grammable interconnection on Altera’s Stratix II FPGA.
We realized many benchmark functions on the PBM128,
and compared its memory size and computation time with
Intel’s Core2Duo microprocessor.
proximately one quarter of the memory required by the
Core2Duo, and is 21.4-96.1 times faster than the Core2Duo.
Details are shown in [24].
PBM128 requires ap-
7.Conclusions
In this paper, first, we review the trends of VLSI design, fo-
cusing on the power dissipation and programmability. Then,
we considered a branching program machine to evaluate
multiple-output logic functions. To increase the speed of
evaluation, we used QDDs instead of BDDs. To reduce
the memory size, we used 3-address QDD machines in-
stead of 4-address QDD machines. We proposed the use
of four types of branch instructions. Also, we considered
a method to optimize codes for 3-address QDDs. This is
different from existing methods to optimize the decision di-
agrams. We show that the minimization of the number of
instructions corresponds to minimizing the number of un-
conditional GOTO statements. For various benchmark func-
tions, we optimized the codes, and showed the effectiveness
of the approach.
Acknowledgements
This research is partly supported by The Japan Society for
the Promotion of Science (JSPS) Grant in Aid for Scientific
Research, and the Knowledge Cluster Initiative (the second
stage) of MEXT (Ministry of Education, Culture, Sports,
Science and Technology), and by the U.S. Air Force/CVAQ
(D. Nussbaum). Discussions with Prof. Shigeki Iwata and
Mr. Hisashi Kajiwara were quite helpful.
References
[1] P. Ashar and S. Malik, “Fast functional simulation using branching
programs,” Proc. International Conference on Computer Aided De-
sign, pp.408–412, Nov. 1995.
[2] B. Beavers, “The story behind the Intel Atom processor success,”
IEEE Des. Test Comput., vol.26, no.2, pp.8–13, March-April 2009.
[3] S. Borkar, “Design challenges of technology scaling,” IEEE Micro,
vol.19, no.4, pp.23–29, July-Aug. 1999.
[4] S. Borkar, “Thousand core chips: A technology perspective,” Proc.
44th annual Design Automation Conference, DAC-2007, pp.746–
749, June 2007.
Page 9
2034
IEICE TRANS. INF. & SYST., VOL.E93–D, NO.8 AUGUST 2010
[5] S. Borkar, N.P. Jouppi, and P. Stenstrom, “ Microprocessors in the
era of terascale integration,” Proc. Conference on Design, Automa-
tion and Test in Europe, (DATE-2007), pp.237–242, April 2007.
[6] R.T. Boute, “The binary-decision machine as programmable con-
troller,” Euromicro Newsletter, vol.1, no.2, pp.16–22, 1976.
[7] P.C. Baracos, R.D. Hudson, L.J. Vroomen, and P.J.A. Zsombor-
Murray, “Advances in binary decision based programmable con-
trollers,” IEEE Trans. Ind. Electron., vol.35, no.3, pp.417–425, Aug.
1988.
[8] J.T. Butler, T. Sasao, and M. Matsuura, “Average path length
of binary decision diagrams” IEEE Trans. Comput., vol.54, no.9,
pp.1041–1053, Sept. 2005.
[9] R.K. Brayton, “The future of logic synthesis and verification,” in H.
Soha and T. Sasao eds., Logic Synthesis and Verification, Kluwer
Academic Publishers, 2002.
[10] C.H. Clare, Designing Logic Systems Using State Machines,
McGraw-Hill, New York, 1973.
[11] M. Davio, J.-P Deschamps, and A. Thayse, Digital Systems with
Algorithm Implementation, p.368, John Wiley & Sons, New York,
1983.
[12] D. Green, Modern Logic Design, Addison-Wesley Publishing Com-
pany, 1986.
[13] Y. Iguchi, T. Sasao, and M. Matsuura, “Implementation of multiple-
output functions using PROMDDs,” 30th International Symposium
on Multiple-Valued Logic, pp.199–205, Portland, Oregon, U.S.A.,
May 2000.
[14] Y. Iguchi, T. Sasao, M. Matsuura, and A. Iseno, “A hardware simula-
tion engine based on decision diagrams,” ASP-DAC 2000, (Asia and
South Pacific Design Automation Conference 2000), Yokohama,
Japan, Jan. 2000.
[15] Y. Iguchi, T. Sasao, and M. Matsuura, “Evaluation of multiple-
output logic functions using decision diagrams,” ASP-DAC 2003,
(Asia and South Pacific Design Automation Conference 2003),
pp.312–315, Kitakyusu, Jan. 2003.
[16] S. Iwata, “Programs with minimal gotostatements,” Informationand
Control, vol.37, no.1, pp.105–114, 1978.
[17] T. Kam, T. Villa, R. Brayton, and A. Sangiovanni, “Multi-valued
decision diagrams: Theory and applications,” J. Multiple-Valued
Logic, vol.4, no.1-2, pp.9–62, 1998.
[18] D. Mange, “A high-level-language programmable controller,” IEEE
Micro, vol.6, no.1, pp.25–41 (Part I), Feb./March, 1986, vol.6, no.2,
pp.47–63 (Part II), March/April, 1986.
[19] S. Minato, N. Ishiura, and S. Yajima, “Shared binary decision dia-
gram with attributed edges for efficient Boolean function manipula-
tion,” Proc. 27th ACM/IEEE Design Automation Conf., pp.52–57,
June 1990.
[20] P.C. McGeer, K.L. McMillan, A. Saldanha, A.L. Sangiovanni-
Vincentelli, and P. Scaglia, “Fast discrete function evaluations using
decision diagrams,” International Conf. on Computer Aided Design,
pp.402–407, Nov. 1995.
[21] S.M. Moon and S.D. Carson, “Generalized multiway branch unit for
VLIW microprocessors,” IEEE Trans. Parallel Distrib. Syst., vol.6,
no.8, pp.850–862, Aug. 1995.
[22] R. Murgai, F. Hirose, and M. Fujita, “Logic synthesis for a single
large look-up table,” Proc. International Conference on Computer
Design, pp.415–424, Oct. 1995.
[23] H. Nakahara and T. Sasao, “A PC-based logic simulator using
a look-up table cascade emulator,” IEICE Trans. Fundamentals,
vol.E89-A, no.12, pp.3471–3481, Dec. 2006.
[24] H. Nakahara, T. Sasao, K. Matsuura, and Y. Kawamura, “Emula-
tion of sequential circuits by a parallel branching program machine,”
5th International Workshop on Applied Reconfigurable Computing
(ARC2009), Karlsruhe, Germany, March 2009, Lect. Notes Com-
put. Sci., LNCS5443, pp.261–267, March 2009.
[25] S. Nagayama, T. Sasao, Y. Iguchi, and M. Matsuura, “Area-time
complexities of multi-valued decision diagrams,” IEICE Trans. Fun-
damentals, vol.E87-A, no.5, pp.1020–1028, May 2004.
[26] S. Nagayama and T. Sasao, “On the optimization of heterogeneous
MDDs,” IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst.,
vol.24, no.11, pp.1645–1659, Nov. 2005.
[27] R. Rudell, “Dynamic variable ordering for ordered binary decision
diagrams,” ICCAD-93, pp.42–47, 1993.
[28] T. Sasao and J.T. Butler, “A method to represent multiple-output
switching functions by using multi-valued decision diagrams,” IEEE
International Symposium on Multiple-Valued Logic, Santiago de
Compostela, pp.248–254, Spain, May 1996.
[29] T. Sasao, Switching Theory for Logic Synthesis, Kluwer Academic
Publishers, 1999.
[30] T. Sasao, H. Nakahara, M. Matsuura, Y. Kawamura, and J.T. Butler,
“A quaternary decision diagram machine and the optimization of
its code,” 39th International Symposium on Multiple-Valued Logic
(ISMVL 2009), pp.362–369, May 2009.
[31] C. Scholl, R. Drechsler, and B. Becker, “Functional simulation using
binary decision diagrams,” ICCAD’97, pp.8–12, Nov. 1997.
[32] N. Weste and K. Eshraghian, Principles of CMOS VLSI Design: A
Systems Perspective, Addison-Wesley, 1980.
[33] P.J.A. Zsombor-Murray, L.J. Vroomen, R.D. Hudson, Le-Ngoc Tho,
and P.H. Holck, “Binary-decision-based programmable controllers,
Part I-III,” IEEE Micro, vol.3. no.4, pp.67–83 (Part I), July-Aug.
1983, vol.3. no.5, pp.16–26 (Part II), Oct. 1983, vol.3. no.6, pp.24–
39 (Part III), Nov.-Dec. 1983.
Tsutomu Sasao
and Ph.D. degrees in Electronics Engineering
from Osaka University, Osaka Japan, in 1972,
1974, and 1977, respectively.
faculty/research positions at Osaka University,
Japan, IBM T. J. Watson Research Center, York-
town Height, NY and the Naval Postgraduate
School, Monterey, CA. He has served as the
Director of the Center for Microelectronic Sys-
tems at the Kyushu Institute of Technology, Ii-
zuka, Japan. Now, he is a Professor of Depart-
received the B.E., M.E.,
He has held
ment of Computer Science and Electronics, His research areas include
logic design and switching theory, representations of logic functions, and
multiple-valued logic. He has published more than 8 books on logic de-
sign including, Logic Synthesis and Optimization, Representation of Dis-
crete Functions, Switching Theory for Logic Synthesis, and Logic Syn-
thesis and Verification, Kluwer Academic Publishers 1993, 1996, 1999,
2001 respectively. He has served Program Chairman for the IEEE In-
ternational Symposium on Multiple-Valued Logic (ISMVL) many times.
Also, he was the Symposium Chairman of the 28th ISMVL held in Fuku-
oka, Japan in 1998. He received the NIWA Memorial Award in 1979,
Takeda Techno-Entrepreneurship Award in 2001, and Distinctive Contribu-
tion AwardsfromIEEEComputerSociety MVL-TC forpaperspresented at
ISMVLs in 1986, 1996, 2003 and 2004. He has served an associate editor
of the IEEE Transactions on Computers. He is a Fellow of the IEEE.
Page 10
SASAO et al.: A QUATERNARY DECISION DIAGRAM MACHINE: OPTIMIZATION OF ITS CODE
2035
Hiroki Nakahara
M.E., and Ph.D. degrees in computer science
from Kyushu Institute of Technology, Fukuoka,
Japan, in 2003, 2005, and 2007, respectively.
He has a research position at Kyushu Institute
of Technology, Iizuka, Japan. His research in-
terests include logic synthesis, reconfigurable
architecture, embedded system, and high-level
synthesis. He is a member of the IEEE.
received the B.E.,
Munehiro Matsuura
in Kitakyushu City, Japan. He studied at the
Kyushu Institute of Technology from 1983 to
1989.He received the B.E. degree in Natu-
ral Sciences from the University of the Air, in
Japan, 2003. He has been working as a Tech-
nical Assistant at the Kyushu Institute of Tech-
nology since 1991. He has implemented several
logic design algorithms under the direction of
Professor Tsutomu Sasao. His interests include
decision diagrams and exclusive-OR based cir-
was born in 1965
cuit design.
Yoshifumi Kawamura
Electrical Engineering of Miyagi Technical Col-
lege. In 1981, he entered the Semiconductor Di-
vision of Hitachi, Ltd., and engaged in the de-
velopment of Telecommunication devices. In
1993, he transferred Graphics Communication
Laboratories, and engaged in the research and
development of MPEG2 System for four years.
In 1997, he transferred to the Semiconductor
Div. Hitachi Ltd., and engaged in development
of LSI for GSM Cell phone. In 2003, he trans-
graduated from
ferred to the Corporate Strategy Planning Division, Renesas Technology
Corporation. Now, he is a senior engineer of the System Core Technology
Division. And, he is involved in the development of programmable and
reconfigurable technology.
Jon T. Butler
MEngr degrees from Rensselaer Polytechnic In-
stitute, Troy, New York, in 1966 and 1967, re-
spectively. He received the PhD degree from
The Ohio State University, Columbus, Ohio, in
1973. Since 1987, he has been a professor at
the Naval Postgraduate School, Monterey, Cal-
ifornia. From 1974 to 1987, he was at North-
western University, Evanston, Illinois. During
that time he served two periods of leave at the
Naval Postgraduate School, first as a National
received the BEE and
Research Council Senior Postdoctoral Associate (1980-1981) and second
as the NAVALEX Chair Professor (1986-1987). He served one period of
leave as a foreign visiting professor at the Kyushu Institute of Technology,
Iizuka, Japan. His research interests include logic optimization, multiple-
valued logic, and reconfigurable computing. He has served on the editorial
boards of the IEEE Transactions on Computers, Computer, and the IEEE
Computer Society Press. He has served as the editor-in-chief of Computer
and the IEEE Computer Society Press. He received the Award of Excel-
lence, the Outstanding Contributed Paper Award, and a Distinctive Con-
tributed Paper Award for papers presented at the International Symposium
on Multiple-Valued Logic. He received the Distinguished Service Award,
two Meritorious Awards, and nine Certificates of Appreciation for service
to the IEEE Computer Society. He is a fellow of the IEEE.
View other sources
Hide other sources
-
Available from Hiroki Nakahara · 14 Mar 2013
-
Available from lsi-cad.com