Tiled Multi-Core Stream Architecture.
-
Citations (0)
-
Cited In (0)
Page 1
Tiled Multi-Core Stream Architecture
Nan Wu, Qianming Yang, Mei Wen, Yi He, Ju Ren, Maolin Guan, Chunyuan Zhang
Computer School, National University of Defense Technology
Chang Sha, Hu Nan, P. R. of China, 410073
nanwu@nudt.edu.cn
Abstract. Conventional stream architectures focus on exploiting ILP and DLP
in the applications, although stream model also exposes abundant TLP at kernel
granularity. On the other side, with the development of model VLSI
technology, increasing application demands and scalability challenges
conventional stream architectures. In this paper we present a novel Tiled
Multi-Core Stream Architecture called TiSA. TiSA introduces the tile that
consists of multiple stream cores as a new category of architectural resources,
and designed an on-chip network to support stream transfer among tiles. In
TiSA, multiple levels parallelisms are exploited on different granularity of
processing elements. Besides hardware modules, this paper also discusses some
other key issues of TiSA architecture, including programming model, various
execution patterns and resource allocations. We then evaluate the hardware
scalability of TiSA by scaling to 10s~1000s ALUs and estimating its area and
delay cost. We also evaluate the software scalability of TiSA by simulating 6
stream applications and comparing sustained performance with other stream
processors and general purpose processors, and different configuration of TiSA.
A 256-ALU TiSA with 4 tile and 4 stream cores per tile is shown to be feasible
with 45 nanometer technology, sustaining 100~350 GFLOP/s on most stream
benchmarks and providing ~10x of speedup over a 16-ALU TiSA with a 5%
degradation in area per ALU. The result shows that TiSA is a VLSI- and
performance-efficient architecture for the billions-transistors era.
1 Introduction
With VLSI technology scaling, increasing numbers of transistors can fit onto one
chip. In 2010, more than 1 billion transistors may be integrated on chip. In a
45-nanometer technology, a 32-bit ALU with multiplier requires only 0.044mm2 of
chip area. In this technology, over two thousand of these multipliers could fit on a
single 1 cm2 chip and could be easily pipelined to operate at over 1 GHz. At this
speed, these multipliers could provide 2 Teraops per second of arithmetic bandwidth
[1]. The challenge is translating these physical potential to efficient computation
resource. In general purpose processor, most of chip area is used to exploit Instruction
Level Parallelism (ILP) and implement cache memory hierarchy to reduce average
memory access latency.
For some important application domains, such as multi-media, graphic,
cryptography and scientific modeling, special architecture with stream model may
break through the challenge. Stream model is a novel stream programming model
Page 2
originated from vector parallel model. In such stream programming model, an
application is usually composed of a collection of data streams passing through a
series of computation kernels. Each stream is a sequence of homogeneous data
records. Each kernel is a loop body that is applied to each record of the input stream
[2]. In stream program level, the producer-consumer relation between kernel and
stream exposes explicitly TLP between kernels, and between loading streams and
running kernels, while in kernel program level, data streams and intensive
computation intra-kernel expose abundant DLP and ILP. High predictability and
abundant parallelism of stream model provide significant advantages. In stream
architecture, more software and compiler management technology can be used to
avoid dynamic instruction parallelism extraction such as super scalar and instruction
speculation technology. Moreover, latency tolerance and processing a batch of data at
one time in stream architecture make long latency of memory reference and
communication can be tolerated, and the achieved throughput is addressed. Thus,
application-specific architectures which are based on or compatible with stream
model have achieved cost--efficient high performance, such as Imagine[6],
Merrimac[12], FT64[13], CELL[9].
Along with the increasing demand for performance of processor and the broadening
application domain of stream model, stream application and algorithm is more
complexity. For SBR and UAV, the demand for performance has already achieved
1TFLOPS in 2004 [3]. To achieve the performance, there are at least 1000 1GHz
ALUs requested on a chip. These require stream architecture has good scalability of
performance and resource. Therefore, for epoch of more than one billion transistors in
one chip, architecture innovation is necessary to keep performance and cost efficiency
of stream architecture.
This paper presents Tiled Multi-core Stream Architecture (TiSA). We describe
TiSA’s hardware architecture, programming model and resource allocation. Then we
analysis hardware overhead based on scaling cost model, and evaluate TiSA’s
performance for several applications. The result shows that TiSA is a VLSI- and
performance-efficient architecture for the billions-transistors era.
2 Motivation
The scaling of conventional stream processor has reached the limit. Typical stream
architecture, Imagine supporting ILP and DLP is consists of 8 clusters (4
ALU/cluster) while both the number of clusters and ALUs intra-cluster can be
increased (two scaling dimensions: inter-cluster and intra-cluster). However prior
research has shown that stream architecture like Imagine with 16 clusters (4
ALU/cluster) would achieve the best performance efficiency. The scaling of more
than 64 ALUs will cause the decrease of the performance efficiency, and the
downside will be more obvious as the number of ALUs increases [1].
On the other hand, there exists application domain of conventional stream
processor is tending to narrow in terms of the follows: 1. Orignal two level scaling
methods may results in increasing demand for DLP of application, but stream
application may not provide more DLP or longer stream length, that results in
Page 3
inefficient computation or short stream effect; 2. intensive computation is necessary
to make full use of stream architecture. For stream processor with 48 ALUs, the
threshold is about 33 Ops/w (33 arithmetic operations/ memory reference), which is
growing annually for future stream processors as off-chip bandwidth grows more
slowly than arithmetic bandwidth. That means the requirement for application’s
computation intensity is increasing. 3. As stream application domains are broader,
only supporting ILP and DLP is not enough for stream architecture. 4. The single
SIMD execution mode makes some restrictions because there are irregular streams in
stream application that may not be suitable for SIMD mode [4].
Exploits fine-grain parallelism TLP
(b)IMAGINE Stream
8 Cluster x 8 ALU
(a)VIRAM Vector
4 vector units
(c)RAW Fine-grain CMP
64 in-order cores
(d)TRIPS
4 ultra-large core x 16 ALU
Exploits more type of parallelism ILP&DLP
Runs more applications effectively
Lower complexity
(e)MASA
4 Macro-Tile x 4 stream core x 4 ALU
Fig. 1. Granularity of parallel processing elements on a chip
In fact, stream applications usually exhibit mix features with ILP, DLP, TLP in
terms of several types of control and memory reference behaviors. Supporting
multiple parallel execution modes simultaneously is more suitable for these. There are
several approaches. Figure 1 shows several typical stream architectures. The
important difference in them is the parallel processing granularity. The left two
processors in figure 1, VIRAM [5] and Imagine [6], work in SIMD mode whereas
their DLP granularity are word and record respectively. Both of them provide high
performance for stream applications while not suitable for irregular stream and
exploiting TLP. The right two processors in figure 1, RAW [7] and TRIPS [8],
support multiple parallel execution modes simultaneously and are general for several
types of workloads including stream application. RAW uses a set of uniform
hardware to support all parallel execution modes, while TRIPS uses different sets of
hardware to support different parallel execution modes respectively. Both of them are
tile architecture, also are CMP, which is suitable for TLP rather than kernel level
parallelism, DLP and TLP inside kernel. From figure 1 we can see that from left to
right parallel processing elements is decomposed into tiles gradually to exploit
fine-grain parallelism TLP, while from right to left parallelism execution modes
supported by processors become less and the complexity of processor becomes lower.
Page 4
There are also be seen that from top to bottom, the granularity of process unit become
larger which enable run more applications efficiently.
Thus, novel stream architecture targets on running more stream applications by
increasing the granularity of process unit and supporting several parallelism execution
modes. Tiled Multi-Core Stream Architecture (TiSA) is shown in middle of figure 1,
which supports several types of parallelism execution modes including TLP exploited
by tiles, kernel level parallelism exploited by stream cores, ILP and DLP exploited by
ALUs intra-cores. Compared to conventional stream architecture, TiSA provides
following advantages: 1. The number of tiles can be scaled without more DLP of
stream application demanded. It can improve the stream processor’s scalability as
described in section 6. 2. Several parallel execution patterns supported enable more
applications with different parallelism requirements run on TiSA. 3. It can reduce the
threshold of intensive computation. Since each core may perform a kernel, several
cores share the computation executed in a core before. 4. Fault tolerance can be
implemented through the same task is executed in different tiles. 5. It inherits the
advantages of processor with multiple cores. In multi-core, the communication
latency on die is about 20 cycles and the on-die interconnect bandwidth between cores
will be >100 Tera bytes/second [10].
3 The TiSA Architecture
The TiSA architecture uses large, coarse-grained tile to achieve high performance
for stream applications with high computing intensity, and augments them with
multiple execution pattern features that enable the tile to be subdivided for explicitly
concurrent applications at different granularities. Contrary to conventional large-core
designs with centralized components that are difficult to scale, the TiSA architecture
is heavily partitioned to avoid large centralized structures and long wire runs. These
partitioned computation and memory elements are connected by 2D on-chip networks
with multiple virtual point-to-point communication channels that are exposed to
software schedulers referred in section 5 for optimization. Figure 2 shows a diagram
of the TiSA architecture, which consists of following major critical components.
Tile is a partition of computation elements. It consists of multiple stream cores and
a network bridge. As shown in Figure 2, four stream cores compose a tile. The
organization is tightly coupled between stream cores in a tile while loosely coupled
between tiles.
Stream Core adopting simplified classical stream processor architecture [2] is the
basic computation module of TiSA. It is optimized for executing kernels of stream
applications. Each stream core has its own instruction controllers—Stream Controller
and Micro-controller, and data storage—Stream Register File (SRF), and multiple
arithmetic clusters. Clusters are controlled by the micro-controller in SIMD+VLIW
pattern. A cluster is composed of a set of full-pipelined ALUs performing one
multiply-add operation per cycle, and some non-ALU function units including 1
iterative unit to support operations like divide and square-root, 1 juke-box unit to
support conditional streams [1], 1 COMM unit connected to an inter-cluster switch for
data communication between clusters, and a group of local register files. Taking a 4
Page 5
AGAG AGAG
AGAG AGAG
IO
IO
IO
IO
IO
IO
IO
IO
Host
Core
IO Core
Memory Core
Memory Core
IO Core
Host
Core
Stream
Core
Stream
Core
Stream
Core
Stream
Core
Stream
Core
Stream
Core
Stream
Core
Stream
Core
Stream
Core
Stream
Core
Stream
Core
Stream
Core
Stream
Core
Stream
Core
Stream
Core
Stream
Core
NB
Tile and Interconnect Network
Micro Controller
COMM
SIO
ALU
ALUALU
…
MAC
…
DSQ
SP
…
SB
Cluster 1
COMM
SIO
ALU
ALUALU
…
MAC
DSQSP
…
SB
Cluster 0
Stream
Register
File
Cluster N
Stream
Controller
co-
scalar
core
HI
NI
Stream Core
TiSATop
Stream
Register
File
Stream
Controller
co-
scalar
core
HI
NI
AG/IO
AG/IO
…
Memory/IO Core
AG/IO
IntraCore Bus
Host
Scalar
Core
NI
SRF/
Cache
Host Scalar Core
NI Controller
VC
VC
VC
VC
4
Assemble
4
Unwrap
ClockSwitch
ClockSwitch
ClockSwitch
ClockSwitch
Chek
Chek
Chek
Chek
Read, write,
Status, VC, tag
Read write
Stream Controller
SCRs
Instr
SB#
NRF#
RI
Tag
VCmk
Arriv
Stream
Register
File
N+
S+
E+
W+
N-
S-
E-
W-
Network Interface
Status, tag
SB
SB
SB
SB
SB
SB
SB
SB
NB Controller
Route
Allocation
type,
tag,RI
Core0-
Core1-
Core2-
Core3-
NB_E-
NB_S-
NB_W-
NB_N-
Route
Table
FIFO
status
RI
Shifter
Switch 8x8
RI
Core0-
Core1-
Core2-
Core3-
NB_E-
NB_S-
NB_W-
NB_N-
OP_INs
STATE_OUTs
STATE_INs
OP_OUTs
Network Bridge
Fig. 2. TiSA architecture overview
clusters x 4 ALUs configured stream core for example, the peak arithmetic
performance of 32 floating-point operations per cycle can be achieved in a single
core. In order to exploit DLP and ILP efficiently, the components of the stream core
mentioned before are scaleable leading to varied configurations and at tradeoffs
design time. Furthermore, in our future plan the stream core could be heterogeneous,
even some special function unit or re-configurable circuit could be used. Besides this,
there are some standard components in all types of cores including Network Interface