A Low-Power Multithreaded Processor for Software Defined Radio.
Conference Proceeding: The Next Generation Challenge for Software Defined Radio.Embedded Computer Systems: Architectures, Modeling, and Simulation, 7th International Workshop, SAMOS 2007, Samos, Greece, July 16-19, 2007, Proceedings; 01/2007
Conference Proceeding: An integrated ARM and multi-core DSP simulator.Proceedings of the 2007 International Conference on Compilers, Architecture, and Synthesis for Embedded Systems, CASES 2007, Salzburg, Austria, September 30 - October 3, 2007; 01/2007
Conference Proceeding: Domain specific architecture for next generation wireless communication.Design, Automation and Test in Europe, DATE 2010, Dresden, Germany, March 8-12, 2010; 01/2010
A Low-Power Multithreaded Processor for Software Defined Radio
Michael Schulte2, John Glossner1,3, Sanjay Jinturkar1, Mayan Moudgill1,
Suman Mamidi2, and Stamatis Vassiliadis3
1 Sandbridge Technologies
1 North Lexington Ave.
White Plains, NY, 10512, USA 1415 Engineering Drive
Madison, WI, 53706, USA
2 University of Wisconsin
Dept. of ECE
3Delft University of Technology
Electrical Engineering, Mathematics and
Computer Science Department
Delft, The Netherlands
Abstract. Embedded digital signal processors for software defined radio have stringent design
constraints including high computational bandwidth, low power consumption, and low inter-
rupt latency. Furthermore, due to rapidly evolving communication standards with increasing
code complexity, these processors must be compiler-friendly, so that code for them can
quickly be developed in a high-level language. In this paper, we present the design of the
Sandblaster Processor, a low-power multithreaded digital signal processor for software de-
fined radio. The processor uses a unique combination of token triggered threading, powerful
compound instructions, and SIMD vector operations to provide real-time baseband processing
capabilities with very low power consumption. We describe the processor’s architecture and
microarchitecture, along with various techniques for achieving high performance and low
power dissipation. We also describe the processor’s programming environment and the
SB3010 platform, a complete system-on-chip solution for software defined radio. Using a su-
per-computer class vectorizing compiler, the SB3010 achieves real-time performance in soft-
ware on a variety of communication protocols including 802.11b, GPS, AM/FM radio, Blue-
tooth, GPRS, and WCDMA. In addition to providing a programmable platform for SDR, the
processor also provides efficient support for a wide variety of digital signal processing and
General purpose processors have utilized various microarchitectural techniques such as deep pipelines,
multiple instruction issue, out-of-order instruction issue, and speculative execution to achieve very high
performance . Recently, simultaneous multithreading (SMT) processors, in which multiple hardware
threads simultaneously issue multiple instructions per cycle, have been deployed . These techniques have
produced performance increases at high complexity and power dissipation costs.
In the embedded digital signal processing (DSP) community, power dissipation and real-time processing
constraints have typically precluded general purpose microarchitectural techniques. Rather than minimize
average execution time, embedded DSP processors often require the worst case execution time to be mini-
mized in order to satisfy real-time constraints . Consequently, very long instruction word (VLIW) or
statically scheduled microarchitectures with architecturally visible pipelines are typically employed [4-8].
Unfortunately, exposing pipelines may pose interrupt latency restrictions, particularly if all memory loads
must complete prior to servicing an interrupt. Furthermore, on-chip memory access in DSP systems has
traditionally operated at the processor clock frequency . Although this eases the programming burden
and allows single cycle on-chip memory accesses, it often restricts the maximum processor clock fre-
Traditional wireless communication systems have typically been implemented using custom hardware so-
lutions. Chip rate, symbol rate, and bit rate coprocessors are often coordinated by a programmable DSP,
but the DSP does not typically participate in physical layer processing [9,10]. Even when supporting a
single communication system, the hardware development cycle for these systems is onerous and often re-
quires multiple chip redesigns late in the certification process. When multiple communication systems must
simultaneously be supported, silicon area and design validation are major inhibitors to commercial success.
A software-based platform that is capable of being dynamically reconfigured for different communication
systems enables elegant reuse of silicon area and reduced time-to-market through software modifications,
instead of time-consuming hardware redesigns. Software-based platforms also allow wireless devices to be
reconfigured to implement emerging wireless communication standards, thereby decreasing product devel-
Software Defined Radios (SDRs), which provide a programmable and dynamically reconfigurable
method for implementing the physical layer processing of multiple communication systems, have been
widely recognized as one of the most important new technologies for wireless communication systems .
SDRs have a significant advantage over traditional communication devices, because they can support sev-
eral communication systems in software. For example, a single SDR implementation might provide support
for WCDMA, GPRS, WLAN, and GPS.
In this paper, we present the Sandblaster Processor, a low-power multithreaded digital signal processor
for SDR. In Section 2, we give an overview of the processor and its compound instruction set architecture.
In Section 3, we present a low power multithreaded microarchitecture, in which multithreading is utilized to
reduce power consumption and simplifying programming. We also describe a non-blocking fully inter-
locked pipeline implementation with reduced hardware complexity that allows on-chip memory to operate
significantly slower than the processor cycle time without inducing pipeline stalls. In Section 4, we present
the design of the single-instruction-multiple-data (SIMD) vector unit and discuss a novel approach for per-
forming saturating dot products. In Section 5, we discuss the processor’s programming environment. In
Section 6, we present the SB3010, a complete system-on-chip (SoC) platform for SDR and demonstrate the
ability of the SB3010 to perform real-time physical layer processing of various communication standards in
software. In Section 7, we give our conclusions. This paper is an extension of the research presented in
2 Processor Design
Sandbridge Technologies has designed a multithreaded processor capable of efficiently executing DSP,
embedded control, and Java code in a single compound instruction set optimized for SDR applications [13-
16]. The Sandblaster Processor overcomes the deficiencies of previous approaches by providing substantial
parallelism and throughput for high-performance DSP applications, while maintaining fast interrupt re-
sponse, high-level language programmability, and very low power dissipation. The design utilizes a unique
combination of modern techniques including hardware support for multiple threads, SIMD vector process-
ing, and instruction set support for Java code. Program memory is conserved through the use of powerful
compounded instructions that may issue multiple operations per cycle. Architecturally, it is possible to turn
off the entire processor. All clocks may be disabled or the processor may idle with clocks running. Each
hardware thread unit may also be disabled to reduce toggling.
Figure 1 shows a block diagram of the processor , which is partitioned into three data processing
units; a program flow control unit, an integer/load-store unit, and a SIMD vector unit. The program flow
control unit is the brain of the processor. It performs instruction fetch and decode, instruction address cal-
culations, and interrupt processing. The integer/load-store unit performs scalar arithmetic and logic opera-
tions, data address calculations, memory access operations, and special-purpose register manipulations.
The SIMD vector unit, described in Section 4, simultaneously performs the same operation on four sets of
vector elements and facilitates high-speed execution of SDR applications.
The processor core also includes an instruction cache, data memory, and bus/memory interface unit. The
64KB, 4-way set associative instruction cache stores instructions to be fetched for each thread. An associa-
tive cache is used to reduce the likelihood of one thread evicting another thread’s active program. In our
implementation, a thread identifier register is used to select whether the line from the left or right bank is
evicted, which reduces the complexity of the line selection logic . The 64KB, 8-bank data memory
stores data for each thread. Using a pre-loaded data memory, instead of a data cache, facilitates the stream-
ing nature and real-time requirements of SDR applications. The bus/memory interface unit provides access
to level-2 (L2) memory, other processor cores, and the rest of the system.
Fig 1. Sandblaster processor microarchitecture.
2.1 Processor Pipelines
Pipelines for one particular implementation of the Sandblaster Processor are shown in Figure 2. The exe-
cution pipelines are different for various operations. The Load/Store (Ld/St) pipeline is shown to have nine
stages, and it is assumed that the instruction has already been fetched. The first stage decodes the instruc-
tion. This is followed by a read from the general-purpose register file. The next stage generates the address
to perform the Load or Store. Five cycles are used to access data memory. Finally, the result is written
back to the register file. Once an instruction from a particular thread enters the pipeline, it runs to comple-
tion. It is also guaranteed to write back its result before the next instruction from the same thread tries to
read the result. The number of pipeline stages for each instruction and the maximum number of hardware
threads are selected to provide a short cycle time and sufficient thread-level parallelism for a variety of
SDR applications. The number of cycles to access memory is selected to allow both the processor and
memory to operate near the peak linear power-performance range, as explained in Section 3.1.