A Low-Power Multithreaded Processor for Software Defined Radio.
ABSTRACT Embedded digital signal processors for software defined radio have stringent design constraints including high computational
bandwidth, low power consumption, and low interrupt latency. Furthermore, due to rapidly evolving communication standards
with increasing code complexity, these processors must be compiler-friendly, so that code for them can quickly be developed
in a high-level language. In this paper, we present the design of the Sandblaster Processor, a low-power multithreaded digital
signal processor for software defined radio. The processor uses a unique combination of token triggered threading, powerful
compound instructions, and SIMD vector operations to provide real-time baseband processing capabilities with very low power
consumption. We describe the processor’s architecture and microarchitecture, along with various techniques for achieving high
performance and low power dissipation. We also describe the processor’s programming environment and the SB3010 platform, a
complete system-on-chip solution for software defined radio. Using a super-computer class vectorizing compiler, the SB3010
achieves real-time performance in software on a variety of communication protocols including 802.11b, GPS, AM/FM radio, Bluetooth,
GPRS, and WCDMA. In addition to providing a programmable platform for SDR, the processor also provides efficient support for
a wide variety of digital signal processing and multimedia applications.
- SourceAvailable from: Muataz H Salih[Show abstract] [Hide abstract]
ABSTRACT: Implementation of embedded systems-on-chip on modern field programmable gate arrays (FPGAs) chip is doable due to its large density. Architecture of multilevel computing focusing on its embedded processor is suggested in our project. The architecture design of embedded processor presents the challenges and opportunities that stem from the task coarse granularity and the large number of input and output for each task. Thus, we have designed a new architecture called Embedded Concurrent Computing (ECC). The entire embedded processor architecture is implemented on the FPGA chip using VHDL. We have synthesized and evaluated the embedded system based on an Altera environment by using a DE2 board. The performances of a realistic application show scalable speedups comparable to that of the simulation. Furthermore, the results show the accuracy of Extended Kalman Filter (EKF) rather than the Kalman Filter (KF) in identifying the landmarks and target in underwater environment, and the usefulness of the multiple filtering techniques when the nonlinearities are too large due to linearization errors. We believe that implementation has been achieved in providing low complexity in terms of FPGA resource usage and frequency. In addition, the design methodology allows the embedded processor to be scalable as the entire system grows.Signal and Image Processing Applications (ICSIPA), 2009 IEEE International Conference on; 01/2009
- [Show abstract] [Hide abstract]
ABSTRACT: A stream is a sequence of similar data records with real-time throughput or bandwidth constraints attached to it. Examples include link-level en cryption in networks, video trans- coding, video compression, cellular telephony as well as the image and speech pA stream is a sequence of similar data records with real-time throughput or bandwidth constraints attached to it. Examples include link-level encryption in networks, video trans-coding, video compres- sion, cellular telephony as well as the image and speech processing. Stream programs consist of a data-flow network where the nodes called kernels represe nt simple algorithms that trans- form an input block to an output block with access to a limited amount of history. Stream processors are decoupled access/execute processors whose architecture has been optimized for the repeated application of stream computations to high bandwidth data-streams at very high levels of performance and energy efficiency. Stream processors have been extensively studied in academia by projects such as Stanford Imagine and MIT RAW. They made their commercial debut with the Cell, a broadband proces- sor from IBM that is the computing engine for the Sony PlayStation 3. The cellular telephony market has experienced rapid growth around the world and represents a significant opportunity for stream processors because this domain requires very hig h computation rates to reduce the bit error rate and to support high data rates, full motion vid eo and multimedia applications, and a variety of wireless standards. Simultaneously, they must also be energy efficient and flexible, have a low time to market, and be low cost. This article starts by providing an overview of the fundamental concepts behind stream processors, their applications to perception, media, wire less and scientific workloads, major research projects etc. It will elaborate on the nature of 3G a nd 4G wireless algorithms, archi- tectural approaches to optimize these algorithms as well as commercial processors that have been optimized for the wireless domain.01/2007;
- [Show abstract] [Hide abstract]
ABSTRACT: Software-defined radio (SDR) is an emerging technology that facilitates having multiple wireless communication protocols on one device. Previous work has shown that current W-CDMA, GPS, GSM, and WiMAX applications can run on this class of device while consuming significant processing power. Next generation wireless networks require speeds in excess of 50Mbps. Some of the fastest AES software im-plementations only achieve 20Mbps on our reference platform. In order to have secure software-defined radio, the security processing gap must be addressed. This paper presents instruction set architecture (ISA) extensions for the Sandblaster DSP. The Sandblaster DSP is a multithreaded processor for SDR that issues multiple operations each cycle and supports vector operations.International Journal of High Performance Systems Architecture. 01/2010; 2.
A Low-Power Multithreaded Processor for Software Defined Radio
Michael Schulte2, John Glossner1,3, Sanjay Jinturkar1, Mayan Moudgill1,
Suman Mamidi2, and Stamatis Vassiliadis3
1 Sandbridge Technologies
1 North Lexington Ave.
White Plains, NY, 10512, USA 1415 Engineering Drive
Madison, WI, 53706, USA
2 University of Wisconsin
Dept. of ECE
3Delft University of Technology
Electrical Engineering, Mathematics and
Computer Science Department
Delft, The Netherlands
Abstract. Embedded digital signal processors for software defined radio have stringent design
constraints including high computational bandwidth, low power consumption, and low inter-
rupt latency. Furthermore, due to rapidly evolving communication standards with increasing
code complexity, these processors must be compiler-friendly, so that code for them can
quickly be developed in a high-level language. In this paper, we present the design of the
Sandblaster Processor, a low-power multithreaded digital signal processor for software de-
fined radio. The processor uses a unique combination of token triggered threading, powerful
compound instructions, and SIMD vector operations to provide real-time baseband processing
capabilities with very low power consumption. We describe the processor’s architecture and
microarchitecture, along with various techniques for achieving high performance and low
power dissipation. We also describe the processor’s programming environment and the
SB3010 platform, a complete system-on-chip solution for software defined radio. Using a su-
per-computer class vectorizing compiler, the SB3010 achieves real-time performance in soft-
ware on a variety of communication protocols including 802.11b, GPS, AM/FM radio, Blue-
tooth, GPRS, and WCDMA. In addition to providing a programmable platform for SDR, the
processor also provides efficient support for a wide variety of digital signal processing and
General purpose processors have utilized various microarchitectural techniques such as deep pipelines,
multiple instruction issue, out-of-order instruction issue, and speculative execution to achieve very high
performance . Recently, simultaneous multithreading (SMT) processors, in which multiple hardware
threads simultaneously issue multiple instructions per cycle, have been deployed . These techniques have
produced performance increases at high complexity and power dissipation costs.
In the embedded digital signal processing (DSP) community, power dissipation and real-time processing
constraints have typically precluded general purpose microarchitectural techniques. Rather than minimize
average execution time, embedded DSP processors often require the worst case execution time to be mini-
mized in order to satisfy real-time constraints . Consequently, very long instruction word (VLIW) or
statically scheduled microarchitectures with architecturally visible pipelines are typically employed [4-8].
Unfortunately, exposing pipelines may pose interrupt latency restrictions, particularly if all memory loads
must complete prior to servicing an interrupt. Furthermore, on-chip memory access in DSP systems has
traditionally operated at the processor clock frequency . Although this eases the programming burden
and allows single cycle on-chip memory accesses, it often restricts the maximum processor clock fre-
Traditional wireless communication systems have typically been implemented using custom hardware so-
lutions. Chip rate, symbol rate, and bit rate coprocessors are often coordinated by a programmable DSP,
but the DSP does not typically participate in physical layer processing [9,10]. Even when supporting a
single communication system, the hardware development cycle for these systems is onerous and often re-
quires multiple chip redesigns late in the certification process. When multiple communication systems must
simultaneously be supported, silicon area and design validation are major inhibitors to commercial success.
A software-based platform that is capable of being dynamically reconfigured for different communication
systems enables elegant reuse of silicon area and reduced time-to-market through software modifications,
instead of time-consuming hardware redesigns. Software-based platforms also allow wireless devices to be
reconfigured to implement emerging wireless communication standards, thereby decreasing product devel-
Software Defined Radios (SDRs), which provide a programmable and dynamically reconfigurable
method for implementing the physical layer processing of multiple communication systems, have been
widely recognized as one of the most important new technologies for wireless communication systems .
SDRs have a significant advantage over traditional communication devices, because they can support sev-
eral communication systems in software. For example, a single SDR implementation might provide support
for WCDMA, GPRS, WLAN, and GPS.
In this paper, we present the Sandblaster Processor, a low-power multithreaded digital signal processor
for SDR. In Section 2, we give an overview of the processor and its compound instruction set architecture.
In Section 3, we present a low power multithreaded microarchitecture, in which multithreading is utilized to
reduce power consumption and simplifying programming. We also describe a non-blocking fully inter-
locked pipeline implementation with reduced hardware complexity that allows on-chip memory to operate
significantly slower than the processor cycle time without inducing pipeline stalls. In Section 4, we present
the design of the single-instruction-multiple-data (SIMD) vector unit and discuss a novel approach for per-
forming saturating dot products. In Section 5, we discuss the processor’s programming environment. In
Section 6, we present the SB3010, a complete system-on-chip (SoC) platform for SDR and demonstrate the
ability of the SB3010 to perform real-time physical layer processing of various communication standards in
software. In Section 7, we give our conclusions. This paper is an extension of the research presented in
2 Processor Design
Sandbridge Technologies has designed a multithreaded processor capable of efficiently executing DSP,
embedded control, and Java code in a single compound instruction set optimized for SDR applications [13-
16]. The Sandblaster Processor overcomes the deficiencies of previous approaches by providing substantial
parallelism and throughput for high-performance DSP applications, while maintaining fast interrupt re-
sponse, high-level language programmability, and very low power dissipation. The design utilizes a unique
combination of modern techniques including hardware support for multiple threads, SIMD vector process-
ing, and instruction set support for Java code. Program memory is conserved through the use of powerful
compounded instructions that may issue multiple operations per cycle. Architecturally, it is possible to turn
off the entire processor. All clocks may be disabled or the processor may idle with clocks running. Each
hardware thread unit may also be disabled to reduce toggling.
Figure 1 shows a block diagram of the processor , which is partitioned into three data processing
units; a program flow control unit, an integer/load-store unit, and a SIMD vector unit. The program flow
control unit is the brain of the processor. It performs instruction fetch and decode, instruction address cal-
culations, and interrupt processing. The integer/load-store unit performs scalar arithmetic and logic opera-
tions, data address calculations, memory access operations, and special-purpose register manipulations.
The SIMD vector unit, described in Section 4, simultaneously performs the same operation on four sets of
vector elements and facilitates high-speed execution of SDR applications.
The processor core also includes an instruction cache, data memory, and bus/memory interface unit. The
64KB, 4-way set associative instruction cache stores instructions to be fetched for each thread. An associa-
tive cache is used to reduce the likelihood of one thread evicting another thread’s active program. In our
implementation, a thread identifier register is used to select whether the line from the left or right bank is
evicted, which reduces the complexity of the line selection logic . The 64KB, 8-bank data memory
stores data for each thread. Using a pre-loaded data memory, instead of a data cache, facilitates the stream-
ing nature and real-time requirements of SDR applications. The bus/memory interface unit provides access
to level-2 (L2) memory, other processor cores, and the rest of the system.
Fig 1. Sandblaster processor microarchitecture.
2.1 Processor Pipelines
Pipelines for one particular implementation of the Sandblaster Processor are shown in Figure 2. The exe-
cution pipelines are different for various operations. The Load/Store (Ld/St) pipeline is shown to have nine
stages, and it is assumed that the instruction has already been fetched. The first stage decodes the instruc-
tion. This is followed by a read from the general-purpose register file. The next stage generates the address
to perform the Load or Store. Five cycles are used to access data memory. Finally, the result is written
back to the register file. Once an instruction from a particular thread enters the pipeline, it runs to comple-
tion. It is also guaranteed to write back its result before the next instruction from the same thread tries to
read the result. The number of pipeline stages for each instruction and the maximum number of hardware
threads are selected to provide a short cycle time and sufficient thread-level parallelism for a variety of
SDR applications. The number of cycles to access memory is selected to allow both the processor and
memory to operate near the peak linear power-performance range, as explained in Section 3.1.