A Parallel Branching Program Machine for Emulation of Sequential Circuits.
ABSTRACT The parallel branching program machine (PBM128) consists of 128 branching program machines (BMs) and a programmable interconnection.
To represent logic functions on BMs, we use quaternary decision diagrams. To evaluate functions, we use 3-address quaternary
branch instructions. We emulated many benchmark circuits on PBM128, and compared its memory size and computation time with
the Intel’s Core2Duo microprocessor. PBM128 requires approximately quarter of the memory for the Core2Duo, and is 21.4-96.1
times faster than the Core2Duo.
- SourceAvailable from: Hiroki Nakahara[Show abstract] [Hide abstract]
ABSTRACT: We show the advantage of Quarternary Decision Dia- grams (QDDs) in representing and evaluating logic func- tions. That is, we show how QDDs are used to implement QDD machines, which yield high-speed implementations. We compare QDD machines with binary decision diagram (BDD) machines, and show a speed improvement of 1.28- 2.02 times when QDDs are chosen. We consider 1-and 2- address BDD machines, and 3- and 4-address QDD ma- chines, and we show a method to minimize the number of instructions.Proceedings of The International Symposium on Multiple-Valued Logic 01/2010; 93-D:2026-2035.
- [Show abstract] [Hide abstract]
ABSTRACT: The parallel branching program machine (PBM128) consists of 128 branching program machines (BMs) and a programmable interconnection. To represent logic functions on BMs, we use quaternary decision diagrams. To evaluate functions, we use 3-address quaternary branch instructions. We realized many benchmark functions on the PBM128, and compared its memory size, computation time, and power consumption with the Intel's Core2Duo microprocessor. The PBM128 requires approximately a quarter of the memory for the Core2Duo, and is 21.4-96.1 times faster than the Core2Duo. It dissipates a quarter of the power of the Core2Duo. Also, we realized packet filters such as an access controller and a firewall, and compared their performance with software on the Core2Duo. For these packet filters, the PBM128 requires approximately 17% of the memory for the Core2Duo, and is 21.3-23.7 times faster than the Core2Duo.IEICE Transactions. 01/2010; 93-D:2048-2058.
Conference Paper: A Packet Classifier Using a Parallel Branching Program Machine.[Show abstract] [Hide abstract]
ABSTRACT: A branching program machine (BM) is a special-purpose processor that uses only two kinds of in- structions: Branch and output instructions. Thus, the architecture for the BM is much simpler than that for a general-purpose microprocessor (MPU). Since the BM uses the dedicated instructions for a special-purpose appli- cation, it is faster than the MPU. This paper presents a packet classifier using a parallel BMs (PBM). To reduce computation time and code size, first, a set of rules for packet classifier is partitioned into subsets. Then, the PBM evaluates them in parallel. Also, the paper shows a method to estimate the necessary number of BMs to realize a given packet classifier. We implemented the PBM32, a system using 32 BMs, on an FPGA, and compared it with the Intel's Core2Duo@1.2GHz microprocessor. The PBM32 is 8.1-11.1 times faster than the Core2Duo, and the PBM32 requries only 0.2-10.3 percent of the memory for the Core2Duo.13th Euromicro Conference on Digital System Design, Architectures, Methods and Tools, DSD 2010, 1-3 September 2010, Lille, France; 01/2010
A Parallel Branching Program Machine for Emulation
of Sequential Circuits
Hiroki Nakahara1, Tsutomu Sasao1Munehiro Matsuura1, and Yoshifumi Kawamura2
1Kyushu Institute of Technology, Japan
2Renesas Technology Corp., Japan
Abstract. The parallel branching program machine (PBM128) consists of 128
branching program machines (BMs) and a programmable interconnection. To
represent logic functions on BMs, we use quaternary decision diagrams. To eval-
uate functions, we use 3-address quaternary branch instructions. We emulated
many benchmark circuits on PBM128, and compared its memory size and com-
putation time with the Intel’s Core2Duo microprocessor. PBM128 requires ap-
proximately quarter of the memory for the Core2Duo, and is 21.4-96.1 times
faster than the Core2Duo.
A Branching Program Machine (BM) is a special-purpose processor that evaluates
binary decision diagrams (BDDs)[3,2,14]. The BM uses only two kind of instructions:
Branch and output instructions. Thus, the architecture for the BM is much simpler than
that for a general-purpose microprocessor (MPU). Since the BM uses the dedicated
instructionstoevaluateBDDs, itis fasterthantheMPU.Infact,forcontrolapplications,
the BM is much faster than the MPU . The applications of BMs include sequencers
[3,14], logic simulators [11,1], and networks (e.g., packet classification).
In this paper,we showthe parallel branchingmachine(PBM128)that consists of128
BMs and a programmable interconnection. To reduce computation time and memory
size, we use special instructions that evaluate consecutive two nodes at a time.
2Branching Program Machine to Emulate Sequential Circuits
We show the branching program machine (BM) that emulates the sequential circuit
shown in Fig. 1. First, the combinational circuit is represented by a decision diagram.
Next, it is translated into the codes of the BM. Finally, the BM executes those codes. To
emulate the sequential circuit, the BM uses registers that store state variables. We as-
sume that the BM uses 32-bit instructions, which match the data structure of embedded
systems and the embedded memory of FPGAs.
In this paper, we use standard terminologies for reduced ordered binary decision dia-
grams (BDDs), and reduced ordered multi-valued decision diagrams (MDDs).
J. Becker et al. (Eds.): ARC 2009, LNCS 5453, pp. 261–267, 2009.
c ? Springer-Verlag Berlin Heidelberg 2009
262H. Nakahara et al.
Fig.1. Model for a Sequential Circuit
Fig.2. Mnemonics and Internal Representations
00 10 01
Fig.3. Example of MTBDD
Fig.4. MTQDD derived from MTBDD in Fig. 3
An MTBDD (Multi-Terminal Binary Decision Diagram)  can evaluate many
outputs at a time. Evaluation of an MTBDD requires n table look-ups. The APL (av-
erage path length) of a BDD denotes the average number of nodes to traverse for the
BDD. Evaluation time for a BDD is proportional to the APL . To further speed up
the evaluation, an MTMDD(k) (Multi-terminal Multi-valued Decision Diagram) is
used. In the MTMDD(k), k variables are grouped to form a 2k-valued super variable.
Note that a BDD is equivalent to an MDD(1). For many benchmark functions, in logic
evaluation, with regard to the area-time complexity, MDD(2)s are more suitable than
BDDs. Since MDD(2) has 4 branches, it is denoted by a QDD (Quaternary Decision
Diagram). In this paper, we use an MTQDD (Multi-terminal Multi-valued QDD).
Example 2.1 Fig. 3 shows an example of MTBDD. Fig. 4 shows the MTQDD that is
derived from the MTBDD in Fig. 3.(End of Example)
2.2Instructions to Evaluate MTQDDs
Three types of instructions are used to evaluate an MTQDD. A 2-address binary
branch instruction (B BRANCH) and a 3-address quaternary branch instruction
(Q BRANCH) evaluate a non-terminal node, while a dataset instruction (DATASET)
evaluates a terminal node. Mnemonics and their internal representations for
B BRANCH, Q BRANCH and DATASET are shown in Fig. 2.
B BRANCH performs a binary branch: If the value of the variable specified by IN-
DEX is equal to 0, then GOTO ADDR0, else GOTO ADDR1. DATASET performs
an output operation and a jump operation. First, DATASET writes DATA (16 bits) to
A Parallel Branching Program Machine for Emulation of Sequential Circuits263
a register specified by REG. Then, GOTO ADDR. Q BRANCH jumps to one of four
addresses: Three jump addresses are specified by ADDR0, ADDR1, and ADDR2, while
the remaining address is the next address (PC+1) to the present one. Since it evaluates
two variables at a time, the total evaluation time is reducedup to a half of a B BRANCH
instruction. Also, it can reduce the total number of instructions. We use four different
Q BRANCH instructions shown in Fig. 7. SEL in the Q BRANCH specifies one of four
combinations. Let i be the value of the variable specified by INDEX. If (SEL=i), then
jump to PC+1, otherwise jump to ADDRi. In addition, unconditional jump instruc-
tions are necessary to evaluate some QDDs. Example 2.2 illustrates this.
Example 2.2 The program in Fig. 5 evaluates the MTBDD in Fig. 3. Consider the
MTQDD shown in Fig. 4. Fig. 8 shows the MTQDD with address assignment for
Q BRANCH instructions, where SEL has the same meaning as Fig. 7. For A6,
B BRANCH instruction is used to perform an unconditional jump. The program in
Fig. 6 evaluates the MTQDD.(End of Example)
A0: B BRANCH (A1,A7),x0
A1: B BRANCH (A2,A3),x1
A2: DATASET 01,0,A0
A3: B BRANCH (A4,A5),x2
A4: DATASET 10,0,A0
A5: B BRANCH (A4,A6),x3
A6: DATASET 00,0,A0
A7: B BRANCH (A3,A8),x1
A8: B BRANCH (A6,A5),x2
Fig.5. Program Code for the MTBDD in
A0: Q BRANCH (A2,A2,A5),X0,00
A1: DATASET 01,0,A0
A2: Q BRANCH (A3,A3,A4),X1,00
A3: DATASET 10,0,A0
A4: DATASET 00,0,A0
A5: Q BRANCH (A4,A4,A4),X1,10
A6: B BRANCH (A3,A3),--
Fig.6. Program Code for the MTQDD in Fig. 8
PC+1 ADDR0 ADDR1 ADDR2
ADDR0 PC+1 ADDR1 ADDR2
ADDR0 ADDR1 PC+1 ADDR2
ADDR0 ADDR1 ADDR2 PC+1
01 10 11
01 10 11
01 10 11
01 10 11
Fig.7. Four Different Q BRANCH Instructions
Fig.8. MTQDD with 3-address Quater-
nary Branch Instructions
2.3Branching Program Machine for a Sequential Circuit
Fig. 9 shows a branching program machine (BM) for a sequential circuit. It consists
of the instruction memory that stores up to 256 words of 32 bits; the instruction
decoder; the program counter (PC); and the register file. In our implementation, two
clocks are used to execute each instruction of the BM: A Double-Rank Filp-Flop is
used to implement the state register and the output register . Fig. 10 shows the
Double-Rank Filp-Flop, where L1and L2are D-latches.
264H. Nakahara et al.
(32bit x 256word)
To Next BM
Fig.9. BM for a Sequential Circuit
Fig.10. Double-Rank Flip-Flop
In the BM, values of state register are feedbacked into its inputs. Thus, the BM can
emulate a sequential circuit. A BM can load the external inputs, the state variables, and
the outputs from other BMs by specifying the value of the input select register.
Parallel Branching Program Machine
Fig. 11 shows the architecture of the 8 BM consisting of 8 BMs. The output registers
and the flag registers of BMs are connectedin cascade throughprogrammablerouting
boxes. Then, these values are stored into the common registers of the 8 BM. Also, the
values of registers are feedbacked to the input of BM0. Each BM can operate indepen-
A programmable routing box implements either the bitwise AND, or the bitwise
OR operation. Constant values can be also generated. In the programmable routing
boxes (highlighted with gray in Fig. 11), constant 1s are generated to perform the bit-
wise AND operation, while constant 0s are generated to perform the bitwise OR oper-
ation. Since BMs are connected each other by sharing a register, each BM can send the
signal to other BM in one clock. Since a BM uses two clocks to perform an instruction,
the communication delay within an 8 BM can be neglected.
Fig.11. Architecture of 8 BM
in out in out
state var. state var. state var.
Fig.12. Parallel Branching Program Ma-
A Parallel Branching Program Machine for Emulation of Sequential Circuits265
3.2Parallel Branching Program Machine
Fig. 12 shows the Parallel Branching programMachine (PBM128)consistingof 128
BMs described in Section 2. Eight BM constitute an 8 BMs, and sixteen 8 BMs and a
programmable interconnection constitute the PBM128. Primary inputs and configu-
rationsignalsaresenttothe8 BMs.Each8 BMhasexternaloutputsandstatevariables.
The external outputs are connected to the system bus, while the state variables are sent
to 8 BMs through the programmable interconnection. When the all 8 BMs finish the
operation, the values of state variables of an 8 BM are sent to other 8 BMs through
the programmable interconnection. These operations can be specified by the values of
the flag register. In addition, MPU is used to control the whole system.
3.3 Programmable Interconnection
A multi-level circuit of multiplexers is used in the programmable interconnection. To
increase the throughput,pipeline registers are inserted into the programmableintercon-
nection. The insertion of pipeline registers increases the latency: Four clocks are used
to connect the outputs of an 8 BM to other 8 BM. Since two clocks are used for an in-
structionof the BM, the PBM128 requirestwo instructionstime to finish the connection
between BMs in different 8 BMs. In the code generation, the wait time inserted.
4Implementation and Experimental Results
4.1 Implementation of Parallel Branching Program Machine
We implemented the PBM128 on the Altera’s FPGA (StratixII: EP2S130F1508C4).
In our implementation, the maximum frequency is 132.73[MHz]. The PBM128 con-
sumes 67817 ALUTs out of 106032 of available ALUTs. Each BM consumes 455
ALUTs (0.6% of used ALUTs), each 8 BM consumes 3778 ALUTs (5.6% of used
ALUTs), sixteen 8 BMs consume 60764 ALUTs (89.6% of used ALUTs), and the pro-
grammable interconnection consumes 6307 ALUTs (9.3% of used ALUTs). As for the
MPU, the embedded processor NiosII/f is used.
Table 1. Comparison of the Execution Code Size and the Execution Time
Name In Out FFCore2Duo
Code Time Code Time Code
35 49 164 74.6 12030 17.8
36 39 211 148.6 13450 33.4
229 197 224 112.1 17500 24.8
bigkey 263 197 224 149.5 19170 33.9
apex6 135 9923.0 3700
cps 24 10233.9 3468
des256 245 123.1 16560 30.7
frg2143 13940.0 6390
4.8 163 4.79
8.3 162 4.08
9.2 215 4.34
266 H. Nakahara et al.
We selected benchmark functions , and compared the execution time and code size
for the PBM128 with the Intel’s general-purposeprocessor Core2Duo U7600 (1.2GHz,
Cache L1 data 32KB, L1 instruction 32KB, and L2 2MB). The execution code was
generated by gcc compiler with optimization option -O3. We partition the outputs into
groups, then represent them by multiple MTQDDs, and finally convert them into the
codes for the PBM128. We used a grouping method  that partitions outputs with
similar inputs. As for the data structure, the MTQDD is used for the PBM128, while
the MTBDD is used for the Core2Duo, since the MTBDD is faster than the MTQDD.
We used the same partitions of the outputs in the Core2Duo and in the PBM128. To ob-
tain the executiontime per a vector, we generated random test vectors, and obtained the
averagetime.ThefrequencyforthePBM128is 100[MHz],whilethat forthe Core2Duo
is 1.2[GHz]. Table 1 compares the code size and the execution time for the Core2Duo
and the PBM128. In Table 1, Name denotes the name of benchmark function; In de-
notes the number of inputs; Out denotes the number of outputs; FF denotes the number
of state variables; Code denotes the size of execution code [KBytes]; Time denotes the
execution time [nsec]; and Ratios denote that for the code size and that of the exe-
cution time (Core2Duo/PBM128). Table 1 shows that the PBM128 requires approxi-
mately quarter of the memory for the Core2Duo, and is 21.4-96.1 times faster than the
In this paper, we presented the PBM128 that consists of 128 BMs and a programmable
interconnection.To represent logic functions on BMs, we used quaternarydecision dia-
ulated many benchmark functions on the PBM128 and the Intel’s Core2Duo micropro-
cessor. The PBM128 requires approximately quarter of the memory of the Core2Duo,
and is 21.4-96.1 times faster than the Core2Duo.
This research is supported in part by the Grants in Aid for Scientific Research of JSPS,
and the grant of Innovative Cluster Project of MEXT (the second stage). Discussion
with Mr.Hisashi Kajiwara was quite useful.
1. Ashar, P., Malik, S.: Fast functional simulation using branching programs. In: Proc. Interna-
tional Conference on Computer Aided Design, pp. 408–412 (November 1995)
2. Baracos, P.C., Hudson, R.D., Vroomen, L.J., Zsombor-Murray, P.J.A.: Advances in binary
decision based programmable controllers. IEEETransactions on Industrial Electronics35(3),
A Parallel Branching Program Machine for Emulation of Sequential Circuits 267
3. Boute, R.T.: The binary-decision machine as programmable controller. Euromicro Newslet-
ter 1(2), 16–22 (1976)
4. Bryant, R.E.: Graph-based algorithms for boolean function manipulation. IEEE Trans.
Compt. C-35(8), 677–691 (1986)
5. Butler, J.T., Sasao, T., Matsuura, M.: Average path length of binary decision diagrams. IEEE
Trans. Compt. 54(9), 1041–1053 (2005)
6. Clare, C.H.: Designing Logic Systems Using State Machines. McGraw-Hill, New York
7. Davio, M., Deschamps, J.-P., Thayse, A.: Digital Systems with Algorithm Implementation,
p. 368. John Wiley & Sons, New York (1983)
8. Iguchi, Y., Sasao, T., Matsuura, M.: Evaluation of multiple-output logic functions. In: Asia
and South Pacific Design Automation Conference 2003, Kitakyushu, Japan, January 21-24,
pp. 312–315 (2003)
9. Kam, T., Villa, T., Brayton, R.K., Sagiovanni-Vincentelli, A.L.: Multi-valued decision dia-
grams: Theory and Applications. Multiple-Valued Logic 4(1-2), 9–62 (1998)
10. Nakahara, H., Sasao, T., Matsuura, M.: A Design algorithm for sequential circuits using LUT
rings. IEICE Transactions on Fundamentals of Electronics, Communications and Computer
Sciences E88-A(12), 3342–3350 (2005)
11. McGeer, P.C., McMillan, K.L., Saldanha, A., Sangiovanni-Vincentelli, A.L., Scaglia, P.: Fast
discrete function evaluation using decision diagrams. In: Proc. International Conference on
Computer Aided Design, pp. 402–407 (November 1995)
12. Sasao, T., Nakahara, H., Matsuura, M., Iguchi, Y.: Realization of sequential circuits by look-
uptablering. In: The 2004 IEEEInternational Midwest Symposium on Circuitsand Systems,
Hiroshima, July 25-28, pp. I:517–I:520 (2004)
13. Yang, S.: Logic synthesis and optimization benchmark user guide version 3.0. MCNC (Jan-
14. Zsombor-Murray, P.J.A., Vroomen, L.J., Hudson, R.D., Tho, L.-N., Holck, P.H.: Binary-
decision-based programmable controllers, Part I-III. IEEE Micro 3(4), 67–83 (Part I), (5),
16–26 (Part II), (6), 24–39 (Part III) (1983)