A Parallel Branching Program Machine for Emulation of Sequential Circuits.
ABSTRACT The parallel branching program machine (PBM128) consists of 128 branching program machines (BMs) and a programmable interconnection.
To represent logic functions on BMs, we use quaternary decision diagrams. To evaluate functions, we use 3-address quaternary
branch instructions. We emulated many benchmark circuits on PBM128, and compared its memory size and computation time with
the Intel’s Core2Duo microprocessor. PBM128 requires approximately quarter of the memory for the Core2Duo, and is 21.4-96.1
times faster than the Core2Duo.
- SourceAvailable from: uci.edu
Conference Proceeding: Fast functional simulation using branching programs[show abstract] [hide abstract]
ABSTRACT: This paper addresses the problem of speeding up functional (delay-independent) logic simulation for synchronous digital systems. The problem needs very little new motivation-cycle-based functional simulation is the largest consumer of computing cycles in system design. Most existing simulators for this task can he classified as being either event driven or levelized compiled-code, with the levelized compiled code simulators generally being considered faster for this task. An alternative technique, based on evaluation using branching programs, was suggested about a decade ago in the context of switch level functional simulation. However, this had very limited application since it could not handle the large circuits encountered in practice. This paper resurrects the basic idea present this technique and provides significant modifications that enable its application to contemporary industrial strength circuits. We present experimental results that demonstrate up to a 10X speedup over levelized compiled code simulation for a large suite of benchmark circuits as well as for industrial examples with over 40.000 gatesComputer-Aided Design, 1995. ICCAD-95. Digest of Technical Papers., 1995 IEEE/ACM International Conference on; 12/1995
- [show abstract] [hide abstract]
ABSTRACT: In this paper we present a new data structure for representing Boolean functions and an associated set of manipulation algorithms. Functions are represented by directed, acyclic graphs in a manner similar to the representations introduced by Lee  and Akers , but with further restrictions on the ordering of decision variables in the graph. Although a function requires, in the worst case, a graph of size exponential in the number of arguments, many of the functions encountered in typical applications have a more reasonable representation. Our algorithms have time complexity proportional to the sizes of the graphs being operated on, and hence are quite efficient as long as the graphs do not grow too large. We present experimental results from applying these algorithms to problems in logic design verification that demonstrate the practicality of our approach.IEEE Transactions on Computers 09/1986; · 1.38 Impact Factor
- [show abstract] [hide abstract]
ABSTRACT: The traditional problem in binary decision diagrams (BDDs) has been to minimize the number of nodes since this reduces the memory needed to store the BDD. Recently, a new problem has emerged: minimizing the average path length (APL). APL is a measure of the time needed to evaluate the function by applying a sequence of variable values. It is of special significance when BDDs are used in simulation and design verification. A main result of this paper is that the APL for benchmark functions is typically much smaller than for random functions. That is, for the set of all functions, we show that the average APL is close to the maximum path length, whereas benchmark functions show a remarkably small APL. Surprisingly, however, typical functions do not achieve the absolute maximum APL. We show that the parity functions are unique in having that distinction. We show that the APL of a BDD can vary considerably with variable ordering. We derive the APL for various functions, including the AND, OR, threshold, Achilles' heel, and certain arithmetic functions. We show that the unate cascade functions uniquely achieve the absolute minimum APL.IEEE Transactions on Computers 10/2005; 54(9):1041- 1053. · 1.38 Impact Factor
A Parallel Branching Program Machine for Emulation
of Sequential Circuits
Hiroki Nakahara1, Tsutomu Sasao1Munehiro Matsuura1, and Yoshifumi Kawamura2
1Kyushu Institute of Technology, Japan
2Renesas Technology Corp., Japan
Abstract. The parallel branching program machine (PBM128) consists of 128
branching program machines (BMs) and a programmable interconnection. To
represent logic functions on BMs, we use quaternary decision diagrams. To eval-
uate functions, we use 3-address quaternary branch instructions. We emulated
many benchmark circuits on PBM128, and compared its memory size and com-
putation time with the Intel’s Core2Duo microprocessor. PBM128 requires ap-
proximately quarter of the memory for the Core2Duo, and is 21.4-96.1 times
faster than the Core2Duo.
A Branching Program Machine (BM) is a special-purpose processor that evaluates
binary decision diagrams (BDDs)[3,2,14]. The BM uses only two kind of instructions:
Branch and output instructions. Thus, the architecture for the BM is much simpler than
that for a general-purpose microprocessor (MPU). Since the BM uses the dedicated
instructionstoevaluateBDDs, itis fasterthantheMPU.Infact,forcontrolapplications,
the BM is much faster than the MPU . The applications of BMs include sequencers
[3,14], logic simulators [11,1], and networks (e.g., packet classification).
In this paper,we showthe parallel branchingmachine(PBM128)that consists of128
BMs and a programmable interconnection. To reduce computation time and memory
size, we use special instructions that evaluate consecutive two nodes at a time.
2Branching Program Machine to Emulate Sequential Circuits
We show the branching program machine (BM) that emulates the sequential circuit
shown in Fig. 1. First, the combinational circuit is represented by a decision diagram.
Next, it is translated into the codes of the BM. Finally, the BM executes those codes. To
emulate the sequential circuit, the BM uses registers that store state variables. We as-
sume that the BM uses 32-bit instructions, which match the data structure of embedded
systems and the embedded memory of FPGAs.
In this paper, we use standard terminologies for reduced ordered binary decision dia-
grams (BDDs), and reduced ordered multi-valued decision diagrams (MDDs).
J. Becker et al. (Eds.): ARC 2009, LNCS 5453, pp. 261–267, 2009.
c ? Springer-Verlag Berlin Heidelberg 2009
262H. Nakahara et al.
Fig.1. Model for a Sequential Circuit
Fig.2. Mnemonics and Internal Representations
00 10 01
Fig.3. Example of MTBDD
Fig.4. MTQDD derived from MTBDD in Fig. 3
An MTBDD (Multi-Terminal Binary Decision Diagram)  can evaluate many
outputs at a time. Evaluation of an MTBDD requires n table look-ups. The APL (av-
erage path length) of a BDD denotes the average number of nodes to traverse for the
BDD. Evaluation time for a BDD is proportional to the APL . To further speed up
the evaluation, an MTMDD(k) (Multi-terminal Multi-valued Decision Diagram) is
used. In the MTMDD(k), k variables are grouped to form a 2k-valued super variable.
Note that a BDD is equivalent to an MDD(1). For many benchmark functions, in logic
evaluation, with regard to the area-time complexity, MDD(2)s are more suitable than
BDDs. Since MDD(2) has 4 branches, it is denoted by a QDD (Quaternary Decision
Diagram). In this paper, we use an MTQDD (Multi-terminal Multi-valued QDD).
Example 2.1 Fig. 3 shows an example of MTBDD. Fig. 4 shows the MTQDD that is
derived from the MTBDD in Fig. 3.(End of Example)
2.2Instructions to Evaluate MTQDDs
Three types of instructions are used to evaluate an MTQDD. A 2-address binary
branch instruction (B BRANCH) and a 3-address quaternary branch instruction
(Q BRANCH) evaluate a non-terminal node, while a dataset instruction (DATASET)
evaluates a terminal node. Mnemonics and their internal representations for
B BRANCH, Q BRANCH and DATASET are shown in Fig. 2.
B BRANCH performs a binary branch: If the value of the variable specified by IN-
DEX is equal to 0, then GOTO ADDR0, else GOTO ADDR1. DATASET performs
an output operation and a jump operation. First, DATASET writes DATA (16 bits) to
A Parallel Branching Program Machine for Emulation of Sequential Circuits263
a register specified by REG. Then, GOTO ADDR. Q BRANCH jumps to one of four
addresses: Three jump addresses are specified by ADDR0, ADDR1, and ADDR2, while
the remaining address is the next address (PC+1) to the present one. Since it evaluates
two variables at a time, the total evaluation time is reducedup to a half of a B BRANCH
instruction. Also, it can reduce the total number of instructions. We use four different
Q BRANCH instructions shown in Fig. 7. SEL in the Q BRANCH specifies one of four
combinations. Let i be the value of the variable specified by INDEX. If (SEL=i), then
jump to PC+1, otherwise jump to ADDRi. In addition, unconditional jump instruc-
tions are necessary to evaluate some QDDs. Example 2.2 illustrates this.
Example 2.2 The program in Fig. 5 evaluates the MTBDD in Fig. 3. Consider the
MTQDD shown in Fig. 4. Fig. 8 shows the MTQDD with address assignment for
Q BRANCH instructions, where SEL has the same meaning as Fig. 7. For A6,
B BRANCH instruction is used to perform an unconditional jump. The program in
Fig. 6 evaluates the MTQDD.(End of Example)
A0: B BRANCH (A1,A7),x0
A1: B BRANCH (A2,A3),x1
A2: DATASET 01,0,A0
A3: B BRANCH (A4,A5),x2
A4: DATASET 10,0,A0
A5: B BRANCH (A4,A6),x3
A6: DATASET 00,0,A0
A7: B BRANCH (A3,A8),x1
A8: B BRANCH (A6,A5),x2
Fig.5. Program Code for the MTBDD in
A0: Q BRANCH (A2,A2,A5),X0,00
A1: DATASET 01,0,A0
A2: Q BRANCH (A3,A3,A4),X1,00
A3: DATASET 10,0,A0
A4: DATASET 00,0,A0
A5: Q BRANCH (A4,A4,A4),X1,10
A6: B BRANCH (A3,A3),--
Fig.6. Program Code for the MTQDD in Fig. 8
PC+1 ADDR0 ADDR1 ADDR2
ADDR0 PC+1 ADDR1 ADDR2
ADDR0 ADDR1 PC+1 ADDR2
ADDR0 ADDR1 ADDR2 PC+1
01 10 11
01 10 11
01 10 11
01 10 11
Fig.7. Four Different Q BRANCH Instructions
Fig.8. MTQDD with 3-address Quater-
nary Branch Instructions
2.3Branching Program Machine for a Sequential Circuit
Fig. 9 shows a branching program machine (BM) for a sequential circuit. It consists
of the instruction memory that stores up to 256 words of 32 bits; the instruction
decoder; the program counter (PC); and the register file. In our implementation, two
clocks are used to execute each instruction of the BM: A Double-Rank Filp-Flop is
used to implement the state register and the output register . Fig. 10 shows the
Double-Rank Filp-Flop, where L1and L2are D-latches.
264H. Nakahara et al.
(32bit x 256word)
To Next BM
Fig.9. BM for a Sequential Circuit
Fig.10. Double-Rank Flip-Flop
In the BM, values of state register are feedbacked into its inputs. Thus, the BM can
emulate a sequential circuit. A BM can load the external inputs, the state variables, and
the outputs from other BMs by specifying the value of the input select register.
Parallel Branching Program Machine
Fig. 11 shows the architecture of the 8 BM consisting of 8 BMs. The output registers
and the flag registers of BMs are connectedin cascade throughprogrammablerouting
boxes. Then, these values are stored into the common registers of the 8 BM. Also, the
values of registers are feedbacked to the input of BM0. Each BM can operate indepen-
A programmable routing box implements either the bitwise AND, or the bitwise
OR operation. Constant values can be also generated. In the programmable routing
boxes (highlighted with gray in Fig. 11), constant 1s are generated to perform the bit-
wise AND operation, while constant 0s are generated to perform the bitwise OR oper-
ation. Since BMs are connected each other by sharing a register, each BM can send the
signal to other BM in one clock. Since a BM uses two clocks to perform an instruction,
the communication delay within an 8 BM can be neglected.
Fig.11. Architecture of 8 BM
in out in out
state var. state var. state var.
Fig.12. Parallel Branching Program Ma-
A Parallel Branching Program Machine for Emulation of Sequential Circuits265
3.2Parallel Branching Program Machine
Fig. 12 shows the Parallel Branching programMachine (PBM128)consistingof 128
BMs described in Section 2. Eight BM constitute an 8 BMs, and sixteen 8 BMs and a
programmable interconnection constitute the PBM128. Primary inputs and configu-
rationsignalsaresenttothe8 BMs.Each8 BMhasexternaloutputsandstatevariables.
The external outputs are connected to the system bus, while the state variables are sent
to 8 BMs through the programmable interconnection. When the all 8 BMs finish the
operation, the values of state variables of an 8 BM are sent to other 8 BMs through
the programmable interconnection. These operations can be specified by the values of
the flag register. In addition, MPU is used to control the whole system.
3.3 Programmable Interconnection
A multi-level circuit of multiplexers is used in the programmable interconnection. To
increase the throughput,pipeline registers are inserted into the programmableintercon-
nection. The insertion of pipeline registers increases the latency: Four clocks are used
to connect the outputs of an 8 BM to other 8 BM. Since two clocks are used for an in-
structionof the BM, the PBM128 requirestwo instructionstime to finish the connection
between BMs in different 8 BMs. In the code generation, the wait time inserted.
4Implementation and Experimental Results
4.1 Implementation of Parallel Branching Program Machine
We implemented the PBM128 on the Altera’s FPGA (StratixII: EP2S130F1508C4).
In our implementation, the maximum frequency is 132.73[MHz]. The PBM128 con-
sumes 67817 ALUTs out of 106032 of available ALUTs. Each BM consumes 455
ALUTs (0.6% of used ALUTs), each 8 BM consumes 3778 ALUTs (5.6% of used
ALUTs), sixteen 8 BMs consume 60764 ALUTs (89.6% of used ALUTs), and the pro-
grammable interconnection consumes 6307 ALUTs (9.3% of used ALUTs). As for the
MPU, the embedded processor NiosII/f is used.
Table 1. Comparison of the Execution Code Size and the Execution Time
Name In Out FFCore2Duo
Code Time Code Time Code
35 49 164 74.6 12030 17.8
36 39 211 148.6 13450 33.4
229 197 224 112.1 17500 24.8
bigkey 263 197 224 149.5 19170 33.9
apex6 135 9923.0 3700
cps 24 10233.9 3468
des256 245 123.1 16560 30.7
frg2143 13940.0 6390
4.8 163 4.79
8.3 162 4.08
9.2 215 4.34
266 H. Nakahara et al.
We selected benchmark functions , and compared the execution time and code size
for the PBM128 with the Intel’s general-purposeprocessor Core2Duo U7600 (1.2GHz,
Cache L1 data 32KB, L1 instruction 32KB, and L2 2MB). The execution code was
generated by gcc compiler with optimization option -O3. We partition the outputs into
groups, then represent them by multiple MTQDDs, and finally convert them into the
codes for the PBM128. We used a grouping method  that partitions outputs with
similar inputs. As for the data structure, the MTQDD is used for the PBM128, while
the MTBDD is used for the Core2Duo, since the MTBDD is faster than the MTQDD.
We used the same partitions of the outputs in the Core2Duo and in the PBM128. To ob-
tain the executiontime per a vector, we generated random test vectors, and obtained the
averagetime.ThefrequencyforthePBM128is 100[MHz],whilethat forthe Core2Duo
is 1.2[GHz]. Table 1 compares the code size and the execution time for the Core2Duo
and the PBM128. In Table 1, Name denotes the name of benchmark function; In de-
notes the number of inputs; Out denotes the number of outputs; FF denotes the number
of state variables; Code denotes the size of execution code [KBytes]; Time denotes the
execution time [nsec]; and Ratios denote that for the code size and that of the exe-
cution time (Core2Duo/PBM128). Table 1 shows that the PBM128 requires approxi-
mately quarter of the memory for the Core2Duo, and is 21.4-96.1 times faster than the
In this paper, we presented the PBM128 that consists of 128 BMs and a programmable
interconnection.To represent logic functions on BMs, we used quaternarydecision dia-
ulated many benchmark functions on the PBM128 and the Intel’s Core2Duo micropro-
cessor. The PBM128 requires approximately quarter of the memory of the Core2Duo,
and is 21.4-96.1 times faster than the Core2Duo.
This research is supported in part by the Grants in Aid for Scientific Research of JSPS,
and the grant of Innovative Cluster Project of MEXT (the second stage). Discussion
with Mr.Hisashi Kajiwara was quite useful.
1. Ashar, P., Malik, S.: Fast functional simulation using branching programs. In: Proc. Interna-
tional Conference on Computer Aided Design, pp. 408–412 (November 1995)
2. Baracos, P.C., Hudson, R.D., Vroomen, L.J., Zsombor-Murray, P.J.A.: Advances in binary
decision based programmable controllers. IEEETransactions on Industrial Electronics35(3),
A Parallel Branching Program Machine for Emulation of Sequential Circuits 267
3. Boute, R.T.: The binary-decision machine as programmable controller. Euromicro Newslet-
ter 1(2), 16–22 (1976)
4. Bryant, R.E.: Graph-based algorithms for boolean function manipulation. IEEE Trans.
Compt. C-35(8), 677–691 (1986)
5. Butler, J.T., Sasao, T., Matsuura, M.: Average path length of binary decision diagrams. IEEE
Trans. Compt. 54(9), 1041–1053 (2005)
6. Clare, C.H.: Designing Logic Systems Using State Machines. McGraw-Hill, New York
7. Davio, M., Deschamps, J.-P., Thayse, A.: Digital Systems with Algorithm Implementation,
p. 368. John Wiley & Sons, New York (1983)
8. Iguchi, Y., Sasao, T., Matsuura, M.: Evaluation of multiple-output logic functions. In: Asia
and South Pacific Design Automation Conference 2003, Kitakyushu, Japan, January 21-24,
pp. 312–315 (2003)
9. Kam, T., Villa, T., Brayton, R.K., Sagiovanni-Vincentelli, A.L.: Multi-valued decision dia-
grams: Theory and Applications. Multiple-Valued Logic 4(1-2), 9–62 (1998)
10. Nakahara, H., Sasao, T., Matsuura, M.: A Design algorithm for sequential circuits using LUT
rings. IEICE Transactions on Fundamentals of Electronics, Communications and Computer
Sciences E88-A(12), 3342–3350 (2005)
11. McGeer, P.C., McMillan, K.L., Saldanha, A., Sangiovanni-Vincentelli, A.L., Scaglia, P.: Fast
discrete function evaluation using decision diagrams. In: Proc. International Conference on
Computer Aided Design, pp. 402–407 (November 1995)
12. Sasao, T., Nakahara, H., Matsuura, M., Iguchi, Y.: Realization of sequential circuits by look-
uptablering. In: The 2004 IEEEInternational Midwest Symposium on Circuitsand Systems,
Hiroshima, July 25-28, pp. I:517–I:520 (2004)
13. Yang, S.: Logic synthesis and optimization benchmark user guide version 3.0. MCNC (Jan-
14. Zsombor-Murray, P.J.A., Vroomen, L.J., Hudson, R.D., Tho, L.-N., Holck, P.H.: Binary-
decision-based programmable controllers, Part I-III. IEEE Micro 3(4), 67–83 (Part I), (5),
16–26 (Part II), (6), 24–39 (Part III) (1983)