A Low-Power Multithreaded Processor for Software Defined Radio
Michael Schulte2, John Glossner1,3, Sanjay Jinturkar1, Mayan Moudgill1,
Suman Mamidi2, and Stamatis Vassiliadis3
1 Sandbridge Technologies
1 North Lexington Ave.
White Plains, NY, 10512, USA 1415 Engineering Drive
Madison, WI, 53706, USA
2 University of Wisconsin
Dept. of ECE
3Delft University of Technology
Electrical Engineering, Mathematics and
Computer Science Department
Delft, The Netherlands
Abstract. Embedded digital signal processors for software defined radio have stringent design
constraints including high computational bandwidth, low power consumption, and low inter-
rupt latency. Furthermore, due to rapidly evolving communication standards with increasing
code complexity, these processors must be compiler-friendly, so that code for them can
quickly be developed in a high-level language. In this paper, we present the design of the
Sandblaster Processor, a low-power multithreaded digital signal processor for software de-
fined radio. The processor uses a unique combination of token triggered threading, powerful
compound instructions, and SIMD vector operations to provide real-time baseband processing
capabilities with very low power consumption. We describe the processor’s architecture and
microarchitecture, along with various techniques for achieving high performance and low
power dissipation. We also describe the processor’s programming environment and the
SB3010 platform, a complete system-on-chip solution for software defined radio. Using a su-
per-computer class vectorizing compiler, the SB3010 achieves real-time performance in soft-
ware on a variety of communication protocols including 802.11b, GPS, AM/FM radio, Blue-
tooth, GPRS, and WCDMA. In addition to providing a programmable platform for SDR, the
processor also provides efficient support for a wide variety of digital signal processing and
General purpose processors have utilized various microarchitectural techniques such as deep pipelines,
multiple instruction issue, out-of-order instruction issue, and speculative execution to achieve very high
performance . Recently, simultaneous multithreading (SMT) processors, in which multiple hardware
threads simultaneously issue multiple instructions per cycle, have been deployed . These techniques have
produced performance increases at high complexity and power dissipation costs.
In the embedded digital signal processing (DSP) community, power dissipation and real-time processing
constraints have typically precluded general purpose microarchitectural techniques. Rather than minimize
average execution time, embedded DSP processors often require the worst case execution time to be mini-
mized in order to satisfy real-time constraints . Consequently, very long instruction word (VLIW) or
statically scheduled microarchitectures with architecturally visible pipelines are typically employed [4-8].
Unfortunately, exposing pipelines may pose interrupt latency restrictions, particularly if all memory loads
must complete prior to servicing an interrupt. Furthermore, on-chip memory access in DSP systems has
traditionally operated at the processor clock frequency . Although this eases the programming burden
and allows single cycle on-chip memory accesses, it often restricts the maximum processor clock fre-
Traditional wireless communication systems have typically been implemented using custom hardware so-
lutions. Chip rate, symbol rate, and bit rate coprocessors are often coordinated by a programmable DSP,
but the DSP does not typically participate in physical layer processing [9,10]. Even when supporting a
single communication system, the hardware development cycle for these systems is onerous and often re-
quires multiple chip redesigns late in the certification process. When multiple communication systems must
simultaneously be supported, silicon area and design validation are major inhibitors to commercial success.
A software-based platform that is capable of being dynamically reconfigured for different communication
systems enables elegant reuse of silicon area and reduced time-to-market through software modifications,
instead of time-consuming hardware redesigns. Software-based platforms also allow wireless devices to be
reconfigured to implement emerging wireless communication standards, thereby decreasing product devel-
Software Defined Radios (SDRs), which provide a programmable and dynamically reconfigurable
method for implementing the physical layer processing of multiple communication systems, have been
widely recognized as one of the most important new technologies for wireless communication systems .
SDRs have a significant advantage over traditional communication devices, because they can support sev-
eral communication systems in software. For example, a single SDR implementation might provide support
for WCDMA, GPRS, WLAN, and GPS.
In this paper, we present the Sandblaster Processor, a low-power multithreaded digital signal processor
for SDR. In Section 2, we give an overview of the processor and its compound instruction set architecture.
In Section 3, we present a low power multithreaded microarchitecture, in which multithreading is utilized to
reduce power consumption and simplifying programming. We also describe a non-blocking fully inter-
locked pipeline implementation with reduced hardware complexity that allows on-chip memory to operate
significantly slower than the processor cycle time without inducing pipeline stalls. In Section 4, we present
the design of the single-instruction-multiple-data (SIMD) vector unit and discuss a novel approach for per-
forming saturating dot products. In Section 5, we discuss the processor’s programming environment. In
Section 6, we present the SB3010, a complete system-on-chip (SoC) platform for SDR and demonstrate the
ability of the SB3010 to perform real-time physical layer processing of various communication standards in
software. In Section 7, we give our conclusions. This paper is an extension of the research presented in
2 Processor Design
Sandbridge Technologies has designed a multithreaded processor capable of efficiently executing DSP,
embedded control, and Java code in a single compound instruction set optimized for SDR applications [13-
16]. The Sandblaster Processor overcomes the deficiencies of previous approaches by providing substantial
parallelism and throughput for high-performance DSP applications, while maintaining fast interrupt re-
sponse, high-level language programmability, and very low power dissipation. The design utilizes a unique
combination of modern techniques including hardware support for multiple threads, SIMD vector process-
ing, and instruction set support for Java code. Program memory is conserved through the use of powerful
compounded instructions that may issue multiple operations per cycle. Architecturally, it is possible to turn
off the entire processor. All clocks may be disabled or the processor may idle with clocks running. Each
hardware thread unit may also be disabled to reduce toggling.
Figure 1 shows a block diagram of the processor , which is partitioned into three data processing
units; a program flow control unit, an integer/load-store unit, and a SIMD vector unit. The program flow
control unit is the brain of the processor. It performs instruction fetch and decode, instruction address cal-
culations, and interrupt processing. The integer/load-store unit performs scalar arithmetic and logic opera-
tions, data address calculations, memory access operations, and special-purpose register manipulations.
The SIMD vector unit, described in Section 4, simultaneously performs the same operation on four sets of
vector elements and facilitates high-speed execution of SDR applications.
The processor core also includes an instruction cache, data memory, and bus/memory interface unit. The
64KB, 4-way set associative instruction cache stores instructions to be fetched for each thread. An associa-
tive cache is used to reduce the likelihood of one thread evicting another thread’s active program. In our
implementation, a thread identifier register is used to select whether the line from the left or right bank is
evicted, which reduces the complexity of the line selection logic . The 64KB, 8-bank data memory
stores data for each thread. Using a pre-loaded data memory, instead of a data cache, facilitates the stream-
ing nature and real-time requirements of SDR applications. The bus/memory interface unit provides access
to level-2 (L2) memory, other processor cores, and the rest of the system.
Fig 1. Sandblaster processor microarchitecture.
2.1 Processor Pipelines
Pipelines for one particular implementation of the Sandblaster Processor are shown in Figure 2. The exe-
cution pipelines are different for various operations. The Load/Store (Ld/St) pipeline is shown to have nine
stages, and it is assumed that the instruction has already been fetched. The first stage decodes the instruc-
tion. This is followed by a read from the general-purpose register file. The next stage generates the address
to perform the Load or Store. Five cycles are used to access data memory. Finally, the result is written
back to the register file. Once an instruction from a particular thread enters the pipeline, it runs to comple-
tion. It is also guaranteed to write back its result before the next instruction from the same thread tries to
read the result. The number of pipeline stages for each instruction and the maximum number of hardware
threads are selected to provide a short cycle time and sufficient thread-level parallelism for a variety of
SDR applications. The number of cycles to access memory is selected to allow both the processor and
memory to operate near the peak linear power-performance range, as explained in Section 3.1.
Fig. 2. Processor pipelines.
There are multiple (variable) stages for the other execution pipelines. The integer/load-store unit has two
execute stages for arithmetic and logic (ALU) operations and three execute stages for integer multiplication
(I_MUL) operations. The Wait stage for the ALU and I_MUL operations causes the operands to read from
the general-purpose register file one cycle later than the operands for load and store operations. This helps
reduce the number of register file read ports, as is explained in Section 4.1. The vector multiplication
(V_MUL) instruction has four execute stages; two for multiplication and two for addition. An additional
transfer (Xfer) stage is allocated between the computation of a result and writing the result back to the
register file to account for delays due to long wires in deep submicron design.
An important point is that the write back stages of the operations are staggered and different instructions
may read from or write to different register files. This allows a single write port to be implemented per
register file, but provides the same functionality as multiple write ports. The processor does not stall pro-
vided threads issue as even/odd pairs. This allows the register files to be banked, giving the power dissipa-
tion of a single write port. With the complexity of a single write port per register file, the processor can
sustain more than 3.9 taps per cycle on typical DSP filters, including the overhead of entering and exiting
the loop. This is in contrast with VLIW designs that may require many write ports to achieve the same
2.2 Compound Instructions
Historically, DSPs have used compound instruction set architectures to conserve instruction space encod-
ing bits. In contrast, VLIW architectures are often completely orthogonal, but only encode a single opera-
tion per instruction field, such that a single VLIW is composed of multiple instruction fields. This has the
disadvantage of requiring many instruction bits to be fetched per cycle, as well as several register file write
ports. Both these features contribute heavily to power dissipation. For example, Texas Instrument’s
TMS320C62x VelociTi processor fetches up to eight 32-bit instructions each cycle . It has two general-
purpose register files and each register file has 13 read ports and 9 write ports.
The Sandblaster Processor has a compound instruction set architecture, in which specific fields within a
64-bit compound instruction may issue powerful compound operations, including SIMD vector operations.
Certain combinations of compound operations are not allowed within the same instruction to reduce area
and power dissipation. Figure 3 illustrates the compound nature of our architecture. It shows a single com-
pound instruction with three compound operations. This instruction implements the inner loop of a vector
sum-of-squares computation. The first compound operation, lvu, loads the vector register vr0 with four
16-bit elements and updates the address pointer r3 to the next element. The vector multiply-reduce-
saturate operation, vmulreds, reads four fixed-point fractional 16-bit elements from vr0, multiplies
each element by itself, saturates each product, adds all four saturated products plus an accumulator regis-
ter, ac0, with saturation after each addition, and stores the result back in ac0. The loop operation
decrements the loop count register lc0, compares it to zero, and branches to address L0 if the result is not
L0: lvu %vr0, %r3, 8
|| vmulreds %ac0,%vr0,%vr0,%ac0
|| loop %lc0,L0
Fig. 3. A single compound instruction for a vector sum-of-squares loop.
All the code shown in Figure 3 is encoded in a single 64-bit compound instruction. Each compound op-
eration, including each vector operation, is specified with at most 21 bits. Like most DSP architectures,
certain combinations of operations may not be specified within the same instruction. The 64-bit instruction
shown in Figure 3 may require 256 bits or more to encode on a VLIW machine. Furthermore, since the
pipeline in a VLIW machine typically produces architecturally visible side effects (i.e. it is not transparent),
it may take a deeply software pipelined loop to obtain single-cycle throughput, thereby exploding instruc-
tion storage requirements. To further distinguish our approach from VLIW and exposed pipeline architec-
tures, each instruction is completely interlocked and architecturally defined to complete with no visible
pipeline effects, which is critical for fast interrupt processing.
3 Low Power Multithreading
Wireless communication and multimedia applications typically exhibit high thread-level parallelism [18-
21]. For example, our implementation of WCMDA has 32 concurrent threads . Explicit multithreaded
processors are able exploit this parallelism by concurrently executing instructions from different control
threads within a single processor pipeline . Although threads share the same underlying processor
hardware, each thread typically has its own processor state to facilitate concurrent execution. Multithread-
ing offers several benefits including (1) reducing the performance penalty of context switches and long
latency operations, such as cache misses, (2) facilitating concurrent execution of multiple tasks, and (3)
increasing processor throughput and utilization.
There are several techniques for explicit hardware multithreading, including interleaved multithreading
(IMT), blocked multithreading (BMT), and simultaneous multithreading (SMT) . With IMT [23, 24],
also known as fine grain multithreading or horizontal multithreading, only one thread can issue an instruc-
tion each cycle, and threads issue instructions in a predetermined, successive order (e.g., round robin). By
designing the processor pipeline so that an instruction from a given thread retires before the next instruction
from the same thread issues, IMT eliminates control and data dependencies between instructions in the
processor pipeline. With BMT [25,26], also known as coarse-grain multithreading or vertical multithread-
ing, instructions are executed sequentially until a long-latency event (e.g., a cache miss) occurs. The long-
latency event triggers a fast context switch to another thread. With SMT [2,27], multiple instructions may
be issued each cycle from multiple threads. When combined with out-of-order superscalar processing, the
additional hardware required for SMT is not significant. Although SMT may reduce power dissipation in
superscalar processors, both out-of-order superscalar and SMT processing consume significant power .
They also make it difficult to determine worst case execution times, since instructions are scheduled dy-
namically, rather than at compile time.
Recent research has shown the potential benefits of using multithreaded processors in high-performance,
low-power embedded systems.  demonstrates that multithreaded processors can efficiently implement
video decompression algorithms by having independent threads operate in parallel on different video
frames.  shows that multithreaded processors can provide significant power saving over single-threaded
processors due to their ability to utilize execution units more efficiently and meet real-time deadlines at
lower clock frequencies and lower voltages by executing multiple threads in parallel.  demonstrates that
multithreaded processors can dissipate less power than single-threaded architectures, and presents various
techniques for further reducing power in multithreaded systems.  presents new techniques for multi-
threading in embedded processors that have low area and power requirements.  discusses how multi-
threaded processing is well suited for the design of future multimedia chips due to its abilities to hide laten-
cies, facilitate dynamic task scheduling, alleviate operating system overhead, and reduce hardware com-
plexity.  presents a resource management scheme that reduces performance unpredictability in embed-
ded SMT processors for real-time systems.
3.1 Decoupled Logic and Memory
As technology improves, processors are capable of executing at very fast cycle times. Current state-of-
the-art performance for 0.13um technologies can produce processors faster than 3GHz. Unfortunately,
current high-performance processors consume significant power. If power-performance curves are consid-
ered for both memory and logic within a technology, there is a region that provides approximately linear
increase in power for linear increase in performance. Above a specific threshold, there is an exponential
increase in power for a linear increase in performance. Even more significant, memory and logic do not
have the same threshold.
Based on our experience with several designs, logic power-performance curves may be in the linear range
until approximately 600MHz in 0.13um CMOS technology, as illustrated in Figure 4. Unfortunately,
memory power-performance curves are linear to about 300MHz in 0.13um CMOS technology. This pre-
sents a dilemma as to whether to optimize for performance or power. The Sandblaster implementation of
multithreading allows the processor cycle time to be decoupled from the on-chip memory access time. This
allows both logic and memory to operate in the linear region, thereby significantly reducing power dissipa-
tion. The decoupled execution does not induce pipeline stalls due to the unique multithreaded pipeline de-
Fig. 4. Power consumption curves vs. CPU frequency and memory access time.
3.2 Token Triggered Threading
Figure 1 shows the microarchitecture of the multithreaded Sandblaster Processor. In a multithreaded
processor, multiple threads of execution operate simultaneously. The processor supports multiple concur-
rent program execution by the use of hardware threads, where each hardware thread has its own processor
The Sandblaster microarchitecture supports up to eight concurrent hardware threads using a form of in-
terleaved multithreading called token triggered threading (T3). As shown in Figure 5, with T3 only one
thread may issue an instruction on a cycle boundary. This constraint is also imposed on round robin
threading. What distinguishes T3 is that each clock cycle a token indicates the subsequent thread that is to
issue an instruction. Tokens may be sequential (e.g. round-robin), even/odd, or based on other communica-
tion patterns. Compared to SMT, T3 has much less hardware complexity and power dissipation, since the
method for selecting threads is simplified, only a single compound instruction issues each clock cycle, and
dependency checking and bypass hardware are not needed, as explained in the next section. Compared to
traditional IMT, T3 provides higher performance through compound instructions, SIMD vector operations,
and greater flexibility in scheduling threads.
Fig. 5. One possible order of threads issuing instructions in token triggered threading.
4 SIMD Vector Unit
Figure 6 shows a high-level block diagram of the SIMD vector processing unit (VPU), which consists of
??????????????????????????? ?????? ?????? ?????? ???
four vector processing elements (VPEs), a shuffle unit, a reduction unit, and a multithreaded 2-bank accu-
mulator register file. The four VPEs perform arithmetic and logic operations in SIMD fashion on 16-bit,
32-bit, and 40-bit fixed-point data types. High-speed 64-bit data busses allow each PE to load or store 16
bits of data each cycle in SIMD fashion. Support for SIMD execution significantly reduces code size, as
well as power consumption from fetching and decoding instructions, since multiple sets of data elements
are processed with a single instruction .
The shuffle unit transfers data between the VPEs, and is useful when implementing various DSP algo-
rithms, such as FFTs, DCTs, and Viterbi coding, which require data to be processed and then rearranged
. The shuffle unit reduces the number of data memory accesses needed to perform these algorithms by
allowing data to be rearranged within the VPU, instead of having to store data out to memory and then
retrieve it in a different order. The reduction unit takes results from the VPEs, adds them to or subtracts
them from an accumulator register file operand, and then stores the result of the reduction back in the ac-
cumulator register file. The reduction unit and accumulator register file accelerate the computation of dot
products and similar vector operations.
Fig. 6. SIMD vector processing unit
4.1 Vector Processing Element
A block diagram of a single VPE is shown in Figure 7. Each VPE contains a 2-bank multithreaded vec-
tor register file (VRF), a multiply-accumulate (MAC) unit, a barrel shifter, a compare-select unit, a logic
unit, and an adder. To reduce dynamic power dissipation, clock gating is employed such that when per-
forming an operation in a particular unit, the inputs to other units do not change. Pipeline stages are indi-
cated by dashed horizontal lines in Figure 7. The MAC unit uses two pipeline stages to perform partial
product generation and reduction. Outputs from the MAC unit are stored in carry-save format to reduce
delay. The shifter, compare-select unit, and adder also use two pipeline stages, while the logic unit uses a
single pipeline stage. The adder can perform stand-alone addition, subtraction, or rounding operations, as
well as sum the carry-save outputs from the MAC unit to produce a two’s complement result. Besides per-
forming comparisons, the compare-select unit performs minimum, maximum, and select operations. This
capability results in a negligible increase in area and no increase in cycle time, yet greatly reduces the num-
ber of conditional branches needed in many DSP algorithms, which improves performance and power con-
All of the VPE functional units support operations on 16-bit, 32-bit, and 40-bit operands, except for the
MAC unit, which only multiplies 16-bit operands and can then add a 32-bit or 40-bit accumulator. Multi-
plication of numbers larger than 16 bits is not necessary for many DSP algorithms in our application do-
main, and support for it would lead to an unacceptable increase in area, cycle time, and power consump-
tion. When required, 32-bit multiplications are implemented using multiple 16-bit multiplications and 32-
bit additions. The MAC unit, shifter, and adder all support both saturating and wrap-around arithmetic
operations. When overflow occurs with saturating arithmetic, the result is saturated to the most positive or
most negative number in the specified format. When overflow occurs with wrap-around arithmetic, any bits
that cannot fit in the specified number format are simply discarded. DSP algorithms in our application do-
main typically use saturating arithmetic with 32-bit accumulators and wrap-around arithmetic with 40-bit
????0?.?8? ?????&.? ?69&:??&,7;?3?????????
?2??0&.?8? ?&????.? ?$@7???$;?3?????????
Fig. 7. Vector processing element.
Most SIMD vector instructions go through eight pipeline stages. For example, a vector MAC instruction
goes through the following stages: Instruction Decode, VRF Read, Mpy1, Mpy2, Add1, Add2, Transfer
(Xfer), and Write Back (WB). The Xfer stage is needed due to the long wiring delay between the bottom of
the VPU and the VRF. Since there are eight cycles between when consecutive instructions issue from the
same thread, results from one instruction in a thread are guaranteed to have written their results back to the
VRF by the time the next instruction in the same thread is ready to read them. Thus, no dependency check-
ing or bypass hardware is needed. This is illustrated in Figure 8, where two consecutive vector multiply
(vmul) operations issue from the same thread. Even if there is a data dependency between the two opera-
tions, there is no need to stall the second operation, since the first operation has completed the WB stage
before the second operation enters the VRF Read stage.
Fig. 8. Two consecutive vector multiply operations that issue from the same thread.
The VRF in each VPE has eight 40-bit register file entries per thread. Since both a vector MAC opera-
tion (with three vector source operands and one vector destination operand) and a vector load or store op-
eration (with one vector destination or source operand) can appear in the same compound instruction, a
VRF designed using a standard implementation requires four read ports and two write ports. To reduce the
number of ports, the VRF uses a novel technique, which divides it into two register banks; one for even
threads and one for odd threads. Register accesses by certain source and destination operands are delayed,
such that in a given cycle each register file bank has at most two operands being read and one operand be-
ing written. For example, when a MAC operation and a store operation appear in the same compound in-
struction, the two multiplier operands are read from the VRF immediately following the instruction decode
stage, but the accumulator and store operands are read one cycle later (i.e., during the Mpy1 stage). Thus,
the accumulator and store operands are read from one bank of the register file, while the next instruction,
which issues from a different thread, reads at most two operands from the other bank. To further reduce
power dissipation, each 40-bit entry in the VRF is divided into three parts; eight guard bits, 16 upper bits,
and 16 lower bits. When reading or writing 16-bit data types, only the 16 upper bits are accessed.
4.2 Saturating Dot Products
The reduction unit and accumulator register file are used with the VPEs to perform dot products and
similar vector operations, which are required in many DSP applications. In particular, Global System for
Mobile communication (GSM) speech coders, which are fundamental components of second and third gen-
eration cell phone technology, frequently perform dot products with saturation after each multiplication and
each addition. To be compliant with GSM standards, the results produced by GSM algorithms must be
identical (bit-exact) to the results obtained when the algorithms are executed serially with saturation after
each operation. Since saturating arithmetic operations are not associative, most DSP processors execute
saturating dot products in GSM algorithms sequentially, which degrades performance.
Previous hardware solutions to the problem of providing fast bit-exact saturating dot products compute
multiple potential results in parallel with specialized circuitry that determines which additions overflow
[34,35]. Once the overflow information is computed, it is used to select the correct result. Although this
approach improves the execution time of saturating dot products, it requires significant area and power to
calculate all potential results and select the correct one.
The Sandblaster Processor executes roughly (k/4) vmulreds operations to perform a saturating dot
product of two k-element vectors. This computation is similar to the vector sum-of-squares computation
shown in Figure 3, except the vmulreds operation changes to
and data is also loaded into vr1. For each vmulreds operation, four pairs of vector elements are multi-
plied in parallel by the MAC units in the four VPEs. The reduction unit then adds the results from the
VPEs, along with an operand from the accumulator register file, with saturation after each addition. The
result from the reduction unit is then written back to the accumulator register file to be used in the next
To achieve a low cycle time with a low supply voltage, the reduction unit uses a four stage pipeline, in
which each pipeline stage consists of a 40-bit addition, overflow detection, and conditional saturation.
Since the number of cycles between subsequent instructions in a given thread is greater than the number of
reduction unit pipeline stages plus the two cycles needed to read and write the accumulator register file, the
increased latency due to pipelining the reduction unit is hidden by multithreading and the processor does not
need to detect data dependencies between subsequent vmulreds operations. Compared to the techniques
presented in [34,35], our approach uses much less area and can take advantage of deeper reduction unit
pipelines. Compared to serial computation of saturating dot products, our approach improves performance
and reduces the number of accumulator register file accesses by nearly a factor of four.
5 Programming Environment
In classical DSP architectures, the execution pipelines were visible to the programmer (i.e. not transpar-
ent) and necessarily shallow, to allow assembly language optimizations. This programming restriction en-
cumbered implementations with tight timing constraints for both arithmetic execution and memory access.
The key characteristic that separates modern DSP architectures from more classical DSP architectures is
the focus on compilability. As a result, significantly longer pipelines with multiple cycles to access memory
and multiple cycles to compute arithmetic operations could be utilized. This trend has yielded higher clock
frequencies and higher performance DSPs. With long pipelines and multiple instruction issue, the difficul-
ties of attempting assembly language programming become apparent. Controlling instruction dependencies
between upwards of 100 in-flight instructions is a non-trivial task for a programmer. This is exactly the
area where a compiler excels.
5.1 Integrated Development Environment
The Sandbridge Technologies Integrated Development Environment (IDE) provides an easy to use
graphical user interface to all the software tools and is based on the open-source Netbeans IDE . The
Sandbridge IDE is a graphical front end to the C compiler, assembler, simulator, and debugger. The IDE
provides the ability to create, edit, build, execute, and debug an application. It also provides the ability to
access CVS and the web, mount a file system, and communicate with Sandbridge hardware boards.
5.2 Optimizing ANSI C Compiler
There are a number of issues that must be addressed in designing a DSP compiler. First, there is a fun-
damental mismatch between DSP data types and C language constructs. A basic data type in DSPs is a
saturating fractional fixed-point number. C language constructs, however, define integer modulo arithmetic,
which forces the programmer to explicitly program saturating operations. As DSP C compilers have diffi-
culty generating efficient code, language extensions to high-level languages have been introduced .
Typical extensions include support for special 16-bit data types (e.g., Q15 formats), saturating types, mul-
tiple memory spaces, and parallel SIMD execution. These extensions often require a special compiler and
the code may not be emulated easily on multiple platforms.
Sandbridge Technologies has built an optimizing ANSI C compiler that does not rely on any extensions.
This compiler applies several high-performance compiler optimizations, which enable the generation of
very efficient assembly code and obviates the need to write assembly code for this processor. In addition to
applying a number of well-know scalar and loop optimizations, the compiler applies DSP optimizations,
vector optimizations, and automatic parallel multithreaded optimizations.
ANSI C does not provide language features to program saturating DSP computations. Therefore, a pro-
grammer has to write emulation C code to perform these operations. The assembly code generated for this
emulation C code is very inefficient. Therefore, DSP compilers typically use mechanisms called intrinsics
to replace the emulation C code with equivalent assembly code . However, this approach requires the
programmer to specify a predefined mapping between the snippets of assembly code and the emulation C
code. Unfortunately, this forces the programmer to understand the details of the underlying processor’s
assembly language and the details of the compiler’s operation. It also makes the code non-portable and
difficult to maintain. This approach is used by compilers on several well-known DSPs [5-8].
The Sandbridge compiler, however, does not use this approach. We have developed proprietary semantic
analysis techniques, which eliminate the need for intrinsics. A programmer writes C code in a processor
independent manner, focusing primarily on the function to be implemented. If saturating DSP operations
are required, the programmer writes the saturation emulation code in standard modulo C arithmetic. The
compiler converts the C code into a dependence flow graph, analyzes the range of the arithmetic operations
in the emulation code, propagates it across code segments, determines if it is a saturating or non-saturating
operation based on the dependency graph, and emits the correct assembly code. The semantic analysis does
not rely on a particular coding style or patterns in the C source code. This makes the approach very general
and applicable to any piece of C code. This technique has significant software productivity gains over in-
trinsic functions and does not force the software writers to become DSP assembly language programmers.
Further details on this approach are provided in [39,40]
Another important technique used by the compiler is the exploitation of SIMD instructions, which are
used to implement vector operations. The compiler performs high performance inner and outer loop vector
optimizations that use SIMD instructions to exploit the data level parallelism inherent in signal processing
applications. These optimizations include support for vector loads, stores, and arithmetic operations, such
as vmulreds. In conjunction with loop optimizations, these operations provide very efficient loops that
can perform as many as 16 RISC operations per single cycle. It is important to note that though saturating
operations are non-associative (i.e. the order of computation is important), they do take advantage of the
SIMD operations. This is because the compiler was designed in conjunction with the processor and special
hardware support allows the compiler to safely vectorize such non-associative operations .
Figure 9 shows the results of compilers for state-of-the-art DSPs [5,7,8] on out-of-the-box AMR ETSI C
code. The x-axis shows the DSP vendor and the y-axis shows the number of MHz required to compute
frames of speech in real-time. In all cases, the highest optimization level that produced the logically correct
code was used. The AMR code is completely unmodified and no special include files are used. Without
using any compiler techniques, such as intrinsics or special type definitions, our compiler is able to achieve
real-time operation on the Sandblaster core at hand-coded assembly language performance levels.
SBTI C64x TI C62xSC140 ADI BlackFin
Fig. 9. Number of MHz needed to achieve real-time performance on out-of-the-box AMR ETSI en-
coder C code without intrinsics or special type definitions.
Since other solutions are not able to automatically generate DSP operations, they must use proprietary
intrinsics to improve their performance. With intrinsic libraries the results for most DSPs are near the
Sandblaster results. However, as mentioned earlier, these intrinsics make the code non-portable, dependent
on the names of the emulation C routines, and harder to maintain. The Sandbridge solution does not suffer
from these disadvantages.
5.3 Simulation Environment
Efficient compilation is just one aspect of software productivity. Prior to having hardware, algorithm de-
signers should have access to fast simulation technology. Sandbridge recognizes this fact and has provided
a fast cycle count accurate simulator, which improves programmer productivity. The simulator uses a high-
level description of the underlying architecture and microarchitecture to provide accurate cycle counts for
the processor core.
The simulator is based on Just-in-Time code generation technology, which has been developed in house
. This technique is different than the interpretive techniques used in other DSP simulators. In the inter-
pretive techniques, the simulator models the target architecture, may mimic the implementation pipeline,
and has data structures to reflect the machine resources such as registers. The simulator contains a main
driver loop, which performs the fetch, decode, data read, execute and write back operations for each in-
struction in the target executable code. Note that these actions are performed every time the instruction is
executed. In addition, numerous conditional statements have to be executed within the main driver loop as
all combination of opcodes and operands have to be accounted for.
Our simulator uses a Just-in-Time dynamic translation technique . In this technique, the simulator
takes advantage of any apriori knowledge of the target executable and converts the target assembly code to
host assembly code before executing any piece of code. Using this approach, the simulator generates host
machine code for instruction fetch, decode, and operand reads at the beginning of program execution
(called the translation phase). The host instructions are then executed at the end of the translation phase.
This approach eliminates the overhead of repetitive target instruction fetch, decode, and operand read in the
interpretive simulation model.
This technique provides very fast simulation times. Figure 10 shows the post-compilation performance of
the same AMR encoder for several DSP simulators. All programs were executed on the same 1GHz Pen-
tium laptop computer. Our simulator is capable of simulating 25 million instructions per second. This is
more than two orders of magnitude faster than the nearest competitor and allows real-time execution of
GSM speech coding on our simulator running on a 1GHz Pentium. To further elaborate, while some DSPs
cannot even execute the out-of-box C code in real-time on their native processor, Sandbridge achieves mul-
tiple real-time channels on a simulation model of the processor.
Millions of Instructions
TI C64x (Code Composer)
TI C62x(Code Composer)
ADI Blackfin (Visual DSP)
Fig. 10. Speed of various simulators on the ETSI AMR encoder.
6 Software Defined Radio Platform
Sandbridge Technologies has developed the SB3010, a complete system-on-chip (SOC) platform for
software defined radio. As shown in Figure 11, the SB3010 integrates four Sandblaster cores, an ARM
microcontroller that functions as an applications processor, on-chip instruction caches and data memories,
and on-chip L2 memories. The chip also contains several internal digital peripheral interfaces for moving
data in and out of the chip, Time Division Multiplexing (TDM) ports, and an Advanced Microprocessor
Bus Architecture (AMBA) bus. A high-speed Universal Serial Bus (USB) provided easy connectivity to
external systems. Control and test busses, such as JTAG, SPI, and I2C, allow the chip to control RF and
front-end chips. The SB3010 includes support for multiple communication protocols, plus all of the periph-
eral device-features of an advanced, multi-function handset.
Ins & Data Mem
(64KB / 64KB)
(64KB / 64KB)
(64KB / 64KB)
(64KB / 64KB)
Ins & Data Mem
Ins & Data Mem
Ins & Data Mem
(64KB / 64KB)(64KB / 64KB)
(64KB / 64KB)(64KB / 64KB)
Ins & Data Mem
(64KB / 64KB)(64KB / 64KB)
(64KB / 64KB)(64KB / 64KB)
Ins & Data Mem Ins & Data Mem
Ins & Data Mem
Ins & Data Mem Ins & Data Mem
Ins & Data MemIns & Data Mem
10 – 50MHz REF
DSP Local Peripherals
Fig. 11. SB3010 SDR baseband processor.
Silicon for the SB3010 is available and fully functional. Each of the four cores run at up to 800MHz,
providing a total peak performance of more than 12 billion multiply-accumulate (MAC) operations per
second. Measured results for a synthesized version of the SB3010 have achieved 600MHz operation at
0.9V. The typical power dissipation is 150mW per core at 600MHz and 0.9V, providing the most power-
efficient processor design in its class, where power-efficiency is defined as the processor’s performance
divided by its power consumption.
Figure 12 shows the performance requirements for baseband processing in 802.11b, GPS, AM/FM ra-
dio, Bluetooth, GPRS, and WCDMA as a function of SB3010 utilization for different transmission rates.
The numbers given are obtained by dividing the performance of the SB3010 at 600Mhz per core by the
peak real-time processing requirements of the particular communication system. A notable point is that all
these communications systems are written in generic C code with no hardware acceleration required. It is
also notable that performance, accuracy, and concurrency can be dynamically adjusted based on the mix of
tasks desired. As illustrated in Figure 12, the SB3010 provides processing capacity for full 2 Mbits/s
WCDMA FDD-mode including chip, bit, and symbol rate processing in real-time. The remaining cycles
can be used for non-real time tasks.
75m .5sec xyz
5m .1sec xyz
Class 10/1264/384/2k Kbps
% SB3010 Utilization
Fig. 12. SB3010 utilization for baseband processing in various communication systems.
This paper has presented the design of a high-performance, low-power processor for software defined ra-
dio. The design uses a unique combination of token triggered threading, SIMD vector operations, and pow-
erful compound instructions to provide very low power consumption and real-time baseband processing
capabilities. Having validated our low power design approach with working silicon and having imple-
mented complete baseband processing in software, we provide a SDR baseband processor with power dis-
sipation appropriate for commercial terminals.
1. J. P. Shen and M. Lipasti, Modern Processor Design: Fundamentals of Superscalar Processors, New
York, McGraw-Hill Book Company, 2005.
2. D. M. Tullsen, S. J. Eggers, H. M. Levy, “Simultaneous Multithreading: Maximizing on-chip Parallel-
ism,” Proceedings of the International Symposium on Computer Architecture, June 1995, pp. 392-
3. P. Lapsley, J.Bier, A. Shoham, and E. A. Lee, DSP Processor Fundamentals: Architectures and Fea-
tures, New York, IEEE Press, 1997.
4. J. T. J.van Eijndhoven, F. W. Sijstermans, K. A. Vissers, E. J. D. Pol, M. I. A. Tromp, P. Struik, R.
H. J. Bloks, P. van der Wolf, A. D. Pimentel, and H. P. E. Vranken, “TriMedia CPU64 Architecture,”
Proceedings of the International Conference on Computer Design, October 1999, pp. 586-592.
5. O. Wolf and J. Bier, “StarCore Launches First Architecture,” Microprocessor Report, vol. 12, Octo-
ber, 1998, pp. 1-4.
6. J. Fridman and Z. Greenfield, “The TigerSHARC DSP Architecture,” IEEE Micro, vol. 20, January,
2000, pp. 66-76.
7. N. Seshan, “High VelociTI Processing: Texas Instruments VLIW DSP Architecture,” IEEE Signal
Processing Magazine, vol. 15, March 1998, pp. 86 - 101, 117.
8. R. K. Kolagotla, J. Fridman, B. C. Aldrich, M. M. Hoffman, W. C. Anderson, M. S. Allen, D. B.
Witt, R. R. Dunton, and L. A. Booth, Jr., “High Performance Dual-MAC DSP Architecture,” IEEE
Signal Processing Magazine, vol. 19, July 2002, pp. 42-53.
9. J. Glossner, D. Iancu, J. Lu, E. Hokenek, and M. Moudgill, “A Software Defined Communications
Baseband Design,” IEEE Communications Magazine, vol. 41, January 2003, pp. 120-128.
10. A. M. Eltawil and B. Daneshrad, “A Low-power DS-CDMA RAKE Receiver Utilizing Resource
Allocation Techniques,” IEEE Journal of Solid-State Circuits, vol. 39, August 2004, pp. 1321-1330.
11. M. Mehta, N. Drew, G. Vardoulias, N. Greco, and C. Niedermeier, “Reconfigurable Terminals: An
Overview of Architectural Solutions,” IEEE Communications Magazine, vol. 39, August 2001, pp.
12. M. J. Schulte, J. Glossner, S. Mamidi, M. Moudgill, and S. Vassiliadis, “A Low-Power Multithreaded
Processor for Baseband Communication Systems,” in Embedded Processor Design Challenges: Sys-
tems, Architectures, Modelling, and Simulation, Lecture Notes in Computer Science, Norwell, MA,
Springer, vol. 3133, July 2004, pp. 393-402.
13. J. Glossner, K. Chirca, M. J. Schulte, H. Wang, N. Nasimzada, D. Har, S. Wang, A. J. Hoane, Jr., G.
Nacer, M. Moudgill, and S. Vassiliadis, “Sandbridge Sandblaster Low Power DSP,” in Proceedings of
the IEEE Custom Integrated Circuits Conference, October 2004, pp. 575-581.
14. J. Glossner, E. Hokenek, and M. Moudgill, “The Sandbridge Sandblaster Communications Processor,”
Software Defined Radio: Baseband Technology for 3G Handsets and Basestations, West Sussex,
England, John Wiley & Sons, SDR Series, vol. 5, 2004, pp. 129-157.
15. J. Glossner, T. Raja, E. Hokenek, and M. Moudgill, “A Multithreaded Processor Architecture for
SDR,” Proceedings of the Korean Institute of Communication Sciences, vol. 19, November 2002, pp.
16. J. Glossner, M. Schulte, and S. Vassiliadis, “A Java-Enabled DSP,” Embedded Processor Design
Challenges, Systems, Architectures, Modeling, and Simulation (SAMOS), Lecture Notes in Computer
Science, vol. 2268, Berlin, Springer-Verlag, 2002, pp. 307-325.
17. J. Glossner, M. Moudgill, D. Iancu, G. Nacer, S. Jinturkar, S. Stanley, M. Samori, T. Raja, and M. J.
Schulte, “The Sandbridge Sandblaster Convergence Platform,” pp. 1-21, 2005. Available from:
18. J. P. Wittenburg, P Pirsch, and G. Meyer, “A Multithreaded Architecture Approach to Parallel DSPs
for High Performance Image Processing Applications,” Proceedings of the IEEE Workshop on Signal
Processing Systems, October 1999, pp. 241-250.
19. H. Oehring, U. Sigmund, and T. Ungerer, “MPEG-2 Video Decompression on Simultaneous Multi-
threaded Multimedia Processors,” Proceedings of the International Conference on Parallel Architec-
tures and Compilation Techniques, October 1999, pp. 11-16.
20. Y.-K. Chen, E. Debes, R. Lienhart, M. Holliman, and M. Yeung, “Evaluating and Improving Perform-
ance of Multimedia Applications on Simultaneous Multithreading,” Proceedings of the Ninth Interna-
tional Conference on Parallel and Distributed Systems, December 2002, pp. 529-534.
21. S. Kaxiras, G. Narlikar, A. D. Berenbaum, and Z. Hu, “Comparing Power Consumption of an SMT
and a CMP DSP for Mobile Phone Workloads,” Proceedings of the International Conference on
Compilers, Architecture, and Synthesis for Embedded Systems, 2001, pp. 211-220.
22. T. Ungerer, B. Robi
?, and J. Šilc, “A Survey of Processors with Explicit Multithreading,” ACM Com-
puting Surveys, vol. 35, March 2003, pp. 29-63.
23. B. J. Smith, “The Architecture of HEP,” Parallel MIMD Computation: HEP Supercomputer and Its
Applications, Cambridge, MA, MIT Press, 1985, pp. 41–55.
24. R. Alveston, D. Callahan, D. Cummings, R. Koblenz, B. Porterfield, and B. J. Smith, “The Tera Com-
puter System,” Proceedings of the 4th International Conference on Supercomputing, 1990, pp. 1–6.
25. A. Agarwal, J. Kubiatowicz, R. Kranz, B. H. Lim, D. Yeong, G. D’Souza, and M. Parkin, “Sparcle:
An Evolutionary Processor Design for Large-Scale multiprocessors,” IEEE Micro, vol. 13, 1993, pp.
26. U. Brinkschulte, C. Krakowski, J. Kreuzinger, and T. Ungerer, “A Multithreaded Java Microcontroller
for Thread-oriented Realtime Event-handling,” Proceedings of the International Conference on Paral-
lel Architectures and Compilation, 1999, pp. 34-39.
27. D. M. Tullsen, S. J. Eggers, S. J. Emers, H. M. Levy, J. L. Lo, and R. L. Stamm, “Exploiting Choice:
Instruction Fetch and Issue on an Implementable Simultaneous Multithreading Processor,” Proceed-
ings of the 23rd Annual International Symposium on Computer Architecture, 1996, pp. 191-202.
28. J. S. Seng, D. M. Tullsen, and G. Z. N. Cai, “Power-sensitive Multithreaded Architecture,” Proceed-
ings of the International Conference on Computer Design, September 2000, pp. 199-206.
29. J. W. Haskins, Jr., K. R. Hirst, and K. Skadron, “Inexpensive Throughput Enhancement in Small-scale
Embedded Microprocessors with Block Multithreading: Extensions, Characterization, and Tradeoffs,”
Proceedings of the IEEE International Conference on Performance, Computing, and Communica-
tions, April 2001, pp. 319-328.
30. W. El-Kharashi, F. ElGuibaly, and K. F. Li, “Multithreaded Processors: The Upcoming Generation for
Multimedia Chips,” Proceedings of the IEEE Symposium on Advances in Digital Filtering and Signal
Processing, June 1998, pp. 111-115.
31. F. J. Cazorla, A. Ramirez, M. Valero, P. M. W. Knijnenburg, R. Sakellariou, and E. Fernandez, “QoS
for High-performance SMT Processors in Embedded Systems,” IEEE Micro, vol. 24, July-Aug. 2004,
32. J. Sebot and N. Drach, “SIMD Extensions: Reducing Power Consumption on a Superscalar Processor
for Multimedia Applications,” Cool Chips, vol. IV, April 2001.
33. R. B. Lee, “Subword Permutation Instructions for Two-Dimensional Multimedia Processing in Mi-
croSIMD Architectures,” Proceedings of the IEEE 11th International Conference on Application-
Specific Systems, Architectures and Processor, July 2000, pp. 3-14.
34. M. J. Schulte, P. I. Balzola, J. Ruan, and J. Glossner, “Parallel Saturating Multioperand Adders,” Pro-
ceedings of the International Conference on Compilers, Architectures and Synthesis for Embedded
Systems, November 2000, pp. 172-179.
35. P. Balzola, M. Schulte, J. Ruan, J. Glossner, and E. Hokenek, “Design Alternatives for Parallel Satu-
rating Multioperand Adders,” Proceedings of the International Conference on Computer Design:
VLSI in Computers & Processors, September 2001, pp. 172-177.
36. T. Boudreau, J. Glick, S. Greene, J. Woehr, and V. Spurlin, “NetBeans: The Definitive Guide,” Sebas-
topol, CA, O'Reilly & Associates, October 2002.
37. K.W. Leary and W. Waddington, “DSP/C: A Standard High Level Language for DSP and Numeric
Processing,” Proceedings of the International Conference on Acoustics, Speech and Signal Process-
ing, IEEE, 1990, pp. 1065-1068.
38. D. Batten, S. Jinturkar, J. Glossner, M. Schulte, and P. D’Arcy, “A New Approach to DSP Intrinsic
Functions,” Proceedings of the Hawaii International Conference on System Sciences, January 2000,
39. S. Jinturkar, J. Glossner, V. Kotlyar, and M. Moudgill, “The Sandblaster Automatic Multithreaded
Vectorizing Compiler,” Proceedings of the 2004 Global Signal Processing Expo (GSPx) and Interna-
tional Signal Processing Conference, September 2004.
40. V. Kotlyar and M. Moudgill, “Detecting Overflow Detection,” Proceedings of the 2004 CODES+ISSS
International Conference on Hardware/Software Codesign and System Synthesis, September 2004,
41. J. Glossner, S. Dorward, S. Jinturkar, M. Moudgill, E. Hokenek, M. Schulte, and S. Vassiliadis,
“Sandbridge Software Tools,” Proceedings of the 3rd Annual Systems, Architectures, Modelling, and
Simulation (SAMOS) Conference, Samos, Greece, July 2003, pp. 142-148.
Michael Schulte received a B.S. degree in Electrical Engineering from the Uni-
versity of Wisconsin-Madison in 1991, and M.S. and Ph.D. degrees in Electri-
cal Engineering from the University of Texas at Austin in 1992 and 1996, re-
spectively. From 1996 to 2002, he was an assistant and associate professor at
Lehigh University, where he directed the Computer Architecture and Arithmetic Research Laboratory. He
is currently an assistant professor at the University of Wisconsin-Madison, where he leads the Madison
Embedded Systems and Architectures Group. His research interests include high-performance embedded
processors, computer architecture, domain-specific systems, computer arithmetic, and wireless systems. He
is a senior member of the IEEE and the IEEE Computer Society, and an associate editor for the IEEE
Transactions on Computers and the Journal of VLSI Signal Processing.
John Glossner is CTO & Executive Vice President at Sandbridge Technologies.
Prior to co-founding Sandbridge, John managed the Advanced DSP Technology
group, Broadband Transmission Systems group, and was Access Aggregation
Business Development manager at IBM’s T.J. Watson Research Center. Prior to
IBM, John managed the software effort in Lucent/Motorola’s Starcore DSP
design center. John received a Ph.D. in Computer Architecture from TU Delft in the Netherlands for his
work on a Multithreaded Java processor with DSP capability. He also received an M.S. degree in Engineer-
ing Management and an M.S.E.E. from NTU. John also holds a B.S.E.E. degree from Penn State. John has
more than 60 publications and 12 issued patents.
Dr. Sanjay Jinturkar is the Director of Software at Sandbridge and manages the
systems software and communications software groups. Previously, he managed
the software tools group at StarCore. He has a Ph.D in Computer Science from
University of Virginia and holds 20 publications and 4 patents.
Mayan Moudgill obtained a Ph.D. in Computer Science from Cornell University
in 1994, after which he joined IBM at the Thomas J. Watson Research Center.
He worked on a variety of computer architecture and compiler related projects,
including the VLIW research compiler, Linux ports for the 40x series embedded
processors and simulators for the Power 4. In 2001, he co-founded Sandbridge Technologies, a start-up
that is developing digital signal processors targeted at 3G wireless phones.
Suman Mamidi is a graduate student in the Department of Electrical and Com-
puter Engineering at the University of Wisconsin-Madison. He received his M.S.
degree from the University of Wisconsin-Madison in December, 2003 and is
currently working towards his PhD. His research interests include low-power
processors, hardware accelerators, multithreaded processors, reconfigurable hardware, and embedded sys-
Stamatis Vassiliadis was born in Manolates, Samos, Greece, in 1951. He is
currently a Chair Professor in the Electrical Engineering, Mathematics, and
Computer Science (EEMCS) department of Delft University of Technology (TU
Delft), The Netherlands. He previously served in the Electrical and Computer
Engineering faculties of Cornell University, Ithaca, NY and the State University
of New York (S.U.N.Y.), Binghamton, NY. For a decade, he worked with IBM,
where he was involved in a number of advanced research and development pro-
jects. He received numerous awards for his work, including 24 publication awards, 15 invention awards,
and an outstanding innovation award for engineering/scientific hardware design. His 73 USA patents rank
him as the top all time IBM inventor. Dr. Vassiliadis is an ACM fellow, an IEEE fellow and a member of
the Royal Netherlands Academy of Arts and Sciences (KNAW).