Architectural implications of brick and mortar silicon manufacturing.
-
Citations (0)
-
Cited In (0)
Page 1
Architectural Implications of Brick and Mortar
Silicon Manufacturing
Martha Mercaldi Kim∗
Mojtaba Mehrara†
Mark Oskin∗
Todd Austin†
∗Computer Science & Engineering
University of Washington
Seattle, WA 98195
{mercaldi,oskin}@cs.washington.edu
†Electrical Engineering & Computer Science
University of Michigan
Ann Arbor, MI 48109
{mehrara,austin}@umich.edu
ABSTRACT
We introduce a novel chip fabrication technique called “brick and
mortar”, in which chips are made from small, pre-fabricated ASIC
bricks and bonded in a designer-specified arrangement to an inter-
brick communication backbone chip. The goal of brick and mortar
assembly is to provide a low-overhead method to produce custom
chips, yet with performance that tracks an ASIC more closely than
an FPGA. This paper examines the architectural design choices in
this chip-design system. These choices include the definition of
reasonablebricks, bothinfunctionalityandsize, aswellasthecom-
munication interconnect that the I/O cap provides. To do this we
synthesize candidate bricks, analyze their area and bandwidth de-
mands, and present an architectural design for the inter-brick com-
munication network. We discuss a sample chip design, a 16-way
CMP, and analyze the costs and benefits of designing chips with
brick and mortar. We find that this method of producing chips in-
curs only a small performance loss (8%) compared to a fully cus-
tom ASIC, which is significantly less than the degradation seen
from other low-overhead chip options, such as FPGAs. Finally, we
measure the effect that architectural design decisions have on the
behavior of the proposed physical brick assembly technique, flu-
idic self-assembly.
CategoriesandSubjectDescriptors: B.7IntegratedCircuits: Ad-
vanced technologies; B.4.3 Input/Output and Data Communica-
tions:Interconnections (Subsystems)[Interfaces,Topology]
General Terms: Design, Performance
Keywords: Chip assembly, Design re-use, Interconnect design.
1.INTRODUCTION
Technology scaling has produced a wealth of transistor resources
and, largely, commensurate improvements in chip performance.
These benefits, however, have come with an ever increasing price
tag, due to rising design, engineering, validation, and ASIC ini-
tiation costs [8]. The result has been a steady decline in ASIC
“starts” [9]. The cycle feeds on itself: fewer starts means fewer
customers to amortize the high cost of fabrication facilities, lead-
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. To copy otherwise, to
republish, to post on servers or to redistribute to lists, requires prior specific
permission and/or a fee.
ISCA’07, June 9–13, 2007, San Diego, California, USA.
Copyright 2007 ACM 978-1-59593-706-3/07/0006 ...$5.00.
ing to even higher start costs and further declining starts.
To implement a design, engineers typically choose between two
options. Either they must face the high fixed costs of ASIC pro-
duction, and hope to amortize it over a large volume of parts, or
they must use an FPGA with low fixed costs but high unit part
cost. The trade-offs are not just financial. ASICs provide signif-
icant speed (3-4x) and power (up to 12x) savings [27], compared
to FPGAs, and the technical demands of certain applications, for
instance, cell phones, will demand an ASIC. However, FPGAs of-
fer in-field reprogrammability, which is useful for accommodat-
ing changing standards. This drives the need for a manufacturing
technology that provides the key advantages of FPGAs – low non-
recurring costs, and quick turn-around on designs – coupled with
the key advantages of ASICs – low unit cost, high performance and
low power.
This paper introduces such a technology, which we call brick
and mortar silicon. At the heart of this manufacturing technique
aretwoarchitecturalcomponents: bricks, whicharemass-produced
pieces of silicon containing processor cores, memory arrays, small
gate arrays, DSPs, FFT engines, and other IP (intellectual property)
blocks; and mortar, an I/O cap, that is a mass-produced silicon sub-
strate. Engineers design products with the brick and mortar process
by putting pre-produced bricks of IP into an application-specific
layout. This arrangement of bricks is then bonded to the I/O cap
that interconnects them.
What differentiates brick and mortar from existing approaches,
such as system on chip (SoC), is that bricks and I/O caps are manu-
factured separately and bonded together using flip-chip techniques.
Existing approaches provide IP blocks to engineers as “gateware”
netlists. Engineers integrate them into a chip design that is then
manufactured. Brick and mortar provides IP to designers as real
physical entities – small chips to be assembled into the final prod-
uct.
Ourvisionisthatbricksarethemodern-dayanalogueofthe7400
series of logic, and the I/O cap is the modern wire-wrap board.
Rather than spin custom ASICs for products, engineers could pur-
chase these prefabricated components and bond them together as
needed.
The key advantages of brick and mortar chip production stem
from mass-production of its constituent parts. Bricks are produced
in conventional ASIC processes, and hence brick and mortar chips
gain the advantages of an ASIC: low power and high performance.
Although they are ASICs, bricks are small, resulting in lower in-
dividual design and verification costs. Once designed and verified,
they can be produced in bulk and used in a variety of end-user prod-
ucts. All of this reduces the cost of a brick and mortar chip. Brick
and mortar chips are also designed to be mass-produced, using
Page 2
fluidic self-assembly or another low-cost physical assembly tech-
nique.
To make brick and mortar chip production successful, one must
carefully design the architecture of the bricks and the I/O cap.
The bricks must have appropriate sizes and useful function. Large
bricks provide more area for physical connection to the I/O cap
and, consequently, more inter-brick bandwidth. Large bricks can
integrate more logic and/or memory on a single brick, thereby in-
creasing circuit performance via the decreased intra-brick commu-
nication latency. In contrast, small bricks offer more design flexi-
bility and, because they are less specialized, more potential re-use
across designs. It is important to find a suitable balance between
integration and generality in brick function for this technology.
The I/O cap implements inter-brick communication. It is an ac-
tive silicon die containing wiring, routing, and logic resources. I/O
caps that provide more sophisticated routing capabilities (such as
packet networks) free logic space on the bricks. On the other hand,
if the I/O cap is too specialized, it cannot be re-used across a vari-
ety of brick and mortar chips. Striking the correct balance between
logic and wiring efficiency is the question driving the architecture
of the I/O cap.
This paper is the first to describe and evaluate the brick and mor-
tar assembly process. One must carefully engineer the architecture
of both the bricks and the I/O cap to make this chip production
method viable. We present a design study of these components and
find that three physical sizes of bricks (0.25mm2, 1mm2, 4mm2)
are sufficient to contain the IP blocks we study. Using these bricks
and an I/O cap designed for both packet-switched communication
and FPGA island-style, configured communication, we show how
to build a variety of CMP products. These CMPs can perform as
close as 8% to an equivalent design built with a traditional ASIC
design process. Finally, we describe how to build brick and mor-
tar chips from a low-cost fluidic self-assembly process. We study
how this manufacturing process interacts with the architectural de-
cisionsbothbrickandapplicationchipdesignerswillmake. Specif-
ically, we find that designing chip architectures permitting a small
amount of slack in brick placement on the I/O cap, can lead to a
factor of 10 improvement in the rate of brick and mortar chip pro-
duction.
The next section describes brick and mortar chip production in
more detail. We also foreshadow the quantitative results presented
later in this paper with a qualitative discussion summarizing the
key advantages of brick and mortar chips. Section 3 presents the
architecture of bricks and the I/O cap and motivates those architec-
tural choices through design synthesis results. Section 4 examines
how architectural choices affect a sample chip design, a 16-way
chip-multiprocessor, quantifying the benefits and costs of brick and
mortar with respect to performance. In Section 5 we discuss how
to assemble brick and mortar chips, which can be either via robots
or via self-assembly. For cost and convenience reasons, we ex-
pect mass-produced brick and mortar chips to utilize fluidic self-
assembly, and in Section 6 examine in more detail the behavior of
the assembly process and how it interacts with the architectural de-
cisions presented in Section 3. Section 7 summarizes related tech-
nologies before we conclude.
2.THE POTENTIAL OF BRICK AND
MORTAR CHIPS
At the heart of the brick and mortar chip manufacturing process
are two architectural components: a brick and an I/O cap. Bricks
are physical pieces of silicon that contain an IP block size compo-
nent such as a processor, network interface, or small gate array. An
I/O cap is another silicon die containing an inter-brick communica-
tion infrastructure. A brick and mortar manufactured chip consists
of several bricks, arranged into an application specific layout, that
are bonded to an I/O cap. Once bricks are bonded to the I/O cap,
the cap provides power and clock to the bricks and I/O capabilities
that enable bricks to communicate with each other and the outside
pins of the chip package. Figure 1 depicts a brick layout and the
I/O cap to which the bricks are bonded.
Before delving into the architectural components of brick and
mortar chip design, we outline qualitatively the key reasons we are
pursuing this line of research. Later sections of this paper will re-
visit most of these issues with quantitative analysis.
Reduced cost: As already discussed in the introduction, the chief
motivation for our research is to produce a low-cost alternative to
ASIC chip production. Section 7 describes other technologies with
related goals. With brick and mortar, cost reductions come from
utilizing mass produced bricks and I/O caps in multiple different
chip designs.
Compatible design flow: Today ASIC designers employ signifi-
cant amounts of existing IP to produce chips. This improves design
reliability and saves design time and cost. Brick and mortar is com-
patible with this design flow, merely moving the IP blocks from de-
sign modules, which fit into synthesis tool flows, to physical bricks,
which fit into a manufacturing flow.
ASIC-like speed and power: Because most of the logic of a brick
and mortar chip exists within a single ASIC component, its perfor-
mance, in speed and power, will tend closer to an ASIC than an
FPGA. Small gate array bricks can implement any small custom
logic.
Mixed process integration: As we will show, bricks must to
comply with a standard physical and logical interface. They do
not, however, have to be built from the same underlying technol-
ogy. This offers an easy way to mix and match bulk CMOS, SOI,
DRAM and other process technologies into the same chip.
Improved yield: Large brick and mortar chips can have a higher
yield than large ASICs. The advantage comes from assembling a
large chip out of a many smaller components. The smaller the com-
ponent, the higher the yield. One can test component bricks before
assembling them, ensuring only functional bricks are included in
any assembly, and resulting in an extremely high overall yield.
These advantages do not come for free, however. Brick and
mortar assembly will be viable only if its components are well-
architected. This paper presents the results of our architectural
analysis. We begin in the next section by designing the brick and
I/O components.
3.ARCHITECTURE
We now turn to the task of understanding two central architec-
tural questions the brick and mortar approach poses, namely, “what
is a brick?”, and “what is an I/O cap?”
3.1 Bricks
There are three important architectural questions to answer about
a brick. How do bricks communicate? How large is a brick? What
is the appropriate functionality for bricks to provide? To answer
these questions we begin by investigating how the physical con-
straints placed on bricks influence the architectural decisions.
What are the goals and constraints of inter-brick communica-
tion? The primary architectural constraint on inter-brick commu-
nication is that bricks must communicate with other bricks through
the I/O cap. Flip-chip bonding connects the I/O pads of each brick
to the I/O pads in the cap. Other studies [19] indicate that each
bonding bump requires only 25µmx 25µmin area and can provide
Page 3
I/O Cap
Functional Bricks
assemble bricks
I/O pad
bond to I/O cap
Figure 1: Brick and Mortar Process: With brick and mortar chip design, mass produced ASIC functional bricks are assembled
in a custom, per-design fashion, and bonded to an ASIC I/O cap providing flexible, high-performance interconnect for bricks to
communicate. I/O pads cover the surface of both the bricks and the I/O cap, so that the bricks can communicate when bonded
together.
FunctionCite CircuitMax. Circuit
Area (um2)Freq. (MHz)
Min. Perf.
(Mbps)
0.25 mm2
brick
1.0 mm2
brick
4.0 mm2
brick
Valid Freq. Range (MHz)
Small Bricks
2941USB 1.1
PHYSICAL LAYER
VITERBI
VGA/LCD
CONTROLLER
WB DMA
MEMORY
CONTROLLER
TRI MODE
ETHERNET
PCI BRIDGE
WB Switch
(8 master, 16 slave)
FPU
DES
16K SRAM
(Singleport)
AHO-CORASIK
STR. MATCH
RISC CORE (NO
FPU) / 8K CACHE
8K SRAM
(Dualport)
[34]2,201 12
2 - 2941
No benefit No benefit
[46]
[34]
2,614
4,301
1961
1219
-
-
N/A - 1961
N/A - 1046
No benefit
N/A -1219
No benefit
No benefit
[34]
[34]
13,684
29,338
1163
952
-
-
N/A - 521
N/A - 843
N/A - 1163
N/A - 952
No benefit
No benefit
[34] 32,009893 1000
125 - 893
No benefit No benefit
[34]
[34]
76,905
81,073
1042
1087
-
-
N/A - 610
N/A - 88
N/A - 1042
N/A - 353
No benefit
N/A - 1087
[34]
[34]
[7]
85,250
85,758
195,360
1515
1370
2481
-
N/A - 505
16 - 1203
N/A - 2481
N/A - 1515
16 - 1370
No benefit
No benefit
No benefit
No benefit
1000
-
[51]201,553 2481-
N/A - 1331
N/A - 2481No benefit
[34]
[7]
[7]
219,971 1087-
N/A - 1087
No benefit No benefit
230,580 1988-
N/A - 1988
No benefitNo benefit
Medium Bricks
1282 TRIPLE
DES
FFT
JPEG DECODER
64K SRAM
(Singleport)
32K SRAM
(Dualport)
RISC CORE
+ 64K CACHE
[34] 294,0751000 No space
16 - 1282
No benefit
[45]
[34]
[7]
390,145
625,457
682,336
1220
629
2315
-
-
-
No space
No space
No space
N/A - 1220
N/A - 629
N/A - 2315
No benefit
No benefit
No benefit
[7]733,9541842 -No space
N/A - 1842
No benefit
[34]
[7]
864,0171087 - No space
N/A - 1087
No benefit
Large Bricks
2315256K SRAM
(Singleport)
128K SRAM
(Dualport)
RISC CORE +
256K CACHE
[7]2,729,344 -No spaceNo space
N/A - 2315
[7] 2,935,8172882-No spaceNo space
N/A - 2882
[34]
[7]
3,111,0251087 - No space No space
N/A - 1087
Table 1: IP Block Synthesis and Brick Assignment: This table shows the synthesis-produced area and timing characteristics of each
brick-candidate IP block. Each block has been assigned to the smallest brick which met its area and bandwidth constraints. Note
how some of the blocks that we have assigned to small bricks could take advantage of the increased I/O bandwidth afforded by larger
bricks (indicated by the increased frequency range).
Page 4
at least 2.5Gbps bandwidth.
What are the goals and constraints on brick size? The con-
strained I/O sets a lower-bound on feasible brick size. Early VLSI
engineers observed a phenomenon dubbed “Rent’s rule”. Rent’s
rule states that a circuit’s required I/O is proportional to its area
(IO ∝ Areaβ). While the precise constants used in the rule
change depending upon the type of circuit, the structure of the rule
does not [10]. It is important for our purposes, however, that the
I/O required by a block of circuitry grows at just above the square
root of the area. Prior work[10] suggests that β = 0.45 for proces-
sors and memory, and β = 0.6 for less structured logic. Because
the I/O available to a brick grows linearly with its area, there must
be some minimum brick size, below which the brick area will not
be sufficient to support the I/O demands of the circuitry the brick
contains.
Bricks will also have a maximum useful size. Rent’s rule also
means that beyond some larger size, bricks will not be able to uti-
lize all of the I/O available to them. Brick designers should design
bricks that use the available I/O, because it is this I/O that connects
the fixed, inflexible brick designs in unique ways to produce unique
chip designs.
Finally, there can be multiple brick sizes. The more brick sizes
offered, the better the area and I/O offering of the bricks can match
the true area and I/O requirements of the circuit. We require that
the bricks conform to “standard” sizes because it is very difficult to
design an I/O cap to interconnect arbitrarily-sized bricks.
What are the goals and constraints on brick functionality? The
applications for which we envision using brick and mortar manu-
facturing are those which currently employ traditional ASICs. For
example, wireless transceivers, media encoding/decoding, system-
on-chip (SoC) integrations, etc. In these realms, the functional
blocks forming a design are fairly large: FFT engines, JPEG com-
pressors, embedded microprocessors.
Below, we address each of these three questions quantitatively,
based on synthesis data from candidate brick functions.
Brick size determination To begin assembling a brick family, we
used freely available IP cores to produce a “benchmark suite”.
Starting with Verilog source code from OPENCORES.ORG [34] and
other sources of publicly available IP [51, 45, 46], we compiled the
designs with the Synopsys DC Ultra design flow[49], targeting the
90nm TSMC [50] ASIC process. We used a commercial memory
compiler [7] to generate optimized memory IP blocks.
Based on this data and the constraints outlined above, we con-
clude that three brick sizes are reasonable: small (0.25mm2),
medium (1.0mm2) and large (4.0mm2). Table 1 shows the speci-
fications of the resulting brick assignments. Each brick size offers
a fixed I/O bandwidth based on the brick area. In Table 1 we have
converted these bandwidth limitations into upper bounds on the
brick clock speed. Based on prior work [19] we assume 2.5Gbps
per pin. This upper bound is also subject to the speed at which the
IP block can operate in a 90nm TSMC standard cell process. When
present, the lower frequency bound indicates the minimum speed
required to meet application requirements (e.g. an ethernet device
must process data at the line rate).
We have organized bricks according to their sizes. We assign IP
blocks to the smallest brick size which could meet their area and
application bandwidth needs. Note that none of the medium bricks
benefits from increasing the brick size, indicating that none of them
is I/O constrained. This is a direct effect of Rent’s rule. The higher
maximum clock frequency at a larger brick size indicates, however,
thatfiveofthethirteensmallbrickscouldtakesignificantadvantage
of the increased I/O bandwidth that a larger brick affords. In these
cases, we envision brick builders will do one of two things: (1)
provide two different brick sizes, with the smaller brick supporting
only lower frequency designs, or (2) more likely, they will redesign
the bricks to take advantage of the added area of a larger brick. We
did not investigate this aspect of brick design in this study, but one
option would be to group blocks of similar functionality (e.g., an
ethernet and USB controller on the same “general purpose comput-
ing I/O” brick). Another option is to tune buffer sizes on the design.
For example, the Aho-Corasik [51] block can use additional buffer
space to support more complex matching patterns.
3.2I/O cap
The I/O cap is a silicon die that has four primary functions: (1)
power for the bricks; (2) clocks for the bricks; (3) I/O pads for con-
nectivity to external package pins; and (4) connectivity between
bricks. The first three offer little in the way of brick and mortar-
specific architectural questions, so we focus on the fourth to drive
the I/O cap design. Within this, two key questions are: Given an
application space and brick family, what is the best use of the lim-
ited number of communication pins into and out of a brick? How
do we design a single I/O cap that functions with a variety of brick
sizes? To answer these questions we return to our synthesis data.
Because the bricks come in three sizes, and because the partic-
ular arrangement of bricks will vary on a per-chip basis, the in-
terconnect in the I/O cap must be both multigranular and flexible.
Figure 2 illustrates the two network designs we propose.
Packet-switched interconnect: The first network is a dynamically
switched packet network. Panel (a) shows an example brick lay-
out, with an overlay of the logical packet-switched network. Each
circle represents a network node. The black nodes are leaf nodes
which are valid packet destinations. The nodes represent routers
in the interconnect. Within each 4mm2of silicon, the interconnect
is a fat-tree, and at the topmost level it is a grid. We coded and
synthesized a 64-bit packet implementation of this network using
Synopsys DC Ultra. The synthesis results indicated that this net-
work could operate at 800Mhz, with one cycle per hop, and would
consume 43% of the I/O cap area. For a 64mm2I/O cap, the bisec-
tion bandwidth of this network is 3.3Tbps.
FPGA-style interconnect fabric: The second interconnect option
is an island-style reconfigurable interconnect, shown in Figure 2,
panel (b), over which pins in the I/O cap are programmatically
connected. Just as with an FPGA, the connecting wires are routed
through this mesh. Since wires are constrained at the brick-to-I/O
cap interface, we utilize the same physical wires as the packet net-
work and mux them between the two networks dynamically.
As with the packet-switched network, we synthesized a config-
urable single-bit wiring node. Area results from DC Ultra indicate
that such a node requires 155 square microns. Leaving area for re-
maininglogicontheI/Ocap(powerdistribution, clock, paddrivers,
etc), we estimate room for approximately 500 switches per small
brick. We devote 400 of these switches to a 20 by 20 fully con-
figurable mesh, and another 64 switches to a partially configurable
one. Figure 2 illustrates this design. The benefit of this approach
is that by enforcing a small amount of standardization on the pin
interface, bricks can utilize the mesh to route large 64 bit items to
their neighbors. We also retain some flexibility with the 20 fully
configurable routes. The bisection bandwidth of this network is
0.26Tbps (fully switchable) and 0.8Tbps (partially configurable).
While the packet-switched network is most useful for routing
data between dynamically changing sources and destinations, this
mesh is better suited to tightly and statically coupling two bricks,
particularly two that are physically near by one another. We uti-
lize this, for example, to bind the FPU to the CPU for one CMP
configuration in the next section.
Page 5
(a)(b)
64 bits
20 bits
0.5 mm
51 Gbps
205 Gbps
819 Gbps
Figure 2: I/O Cap Interconnects: The I/O cap offers two inter-brick interconnects. The first, in panel (a), is a dynamically-routed,
packet-switched network. The routers, represented by circles in the figure, are organized into a 4-ary fat tree. The network runs
at 800MHz and requires a single cycle hop between routers. The black routers represent valid routing destinations for the example
brick layout. The second interconnect in panel (b) is an island-style, statically programmable, mesh interconnect that can connect
pins directly to one another. These two interconnects support two different styles of inter-brick communication. The first supports
dynamic and variable communication while the second can tightly couple bricks in a fixed pattern.
Chip Multiprocessor Designs
CMP-L
193.5
CMP-M
177.5
CMP-S
200.5Total Area (mm2)
Chip Composition
Count % AreaCount% Area Count% Area
Small Bricks (.5x.5 mm)
RISC CORE (NO FPU)
+ 8K CACHE
FPU
ETHERNET NIC
MEM CNTL
USB PHYS LAYER
DMA
PCI BRIDGE
VGA/LCD CNTL
Medium Bricks (1x1 mm)
RISC CORE
+ 64K CACHE
Large Bricks (2x2 mm)
RISC CORE
+ 256K CACHE
256K SRAM
Simics/GEMS Performance Simulation
- N/A-N/A161.99%
-
1
1
1
1
1
1
N/A
0.13%
0.13%
0.13%
0.13%
0.13%
0.13%
-
1
1
1
1
1
1
N/A
0.14%
0.14%
0.14%
0.14%
0.14%
0.14%
16
1
1
1
1
1
1
1.99%
0.12%
0.12%
0.12%
0.12%
0.12%
0.12%
-N/A 169.01%- N/A
1633.07%-N/A-N/A
3266.15%40 90.14%48 95.29%
Brick & Mortar
16
256
8
4
64
32
31
108%
ASIC
16
256
8
4
64
32
22
100%
Brick & Mortar
16
128
10
5
64
32
41
120%
ASIC
16
128
10
5
64
32
22
100%
Brick & Mortar
16
8
12
6
64
32
50
136%
ASIC
16
8
12
6
64
32
22
100%
Number of Cores
L1 Cache / Core (KB)
L2 Cache Size (MB)
L2 Associativity
L2 Block Size (B)
L2 Set Size (KB)
Processor Cycles to L2
Exe. Time (Avg.)
Table 2: CMP configurations: The following table describes the three CMP configurations used in our study. We have focused on
building CMPs from three different size RISC core bricks. CMP-L features high integration with a large brick combining processor
and L1 cache. CMP-M integrates a much smaller L1 cache onto a medium brick with the processor, while CMP-S offers only 8K of
L1 cache with the processor on a small brick.