S. Parameswaran

University of South Wales, Понтиприте, Wales, United Kingdom

Are you S. Parameswaran?

Claim your profile

Publications (126)15.01 Total impact

  • H. Javaid, A. Ignjatovic, S. Parameswaran
    [Show abstract] [Hide abstract]
    ABSTRACT: The paradigm of pipelined MPSoC (processors connected in a pipeline) is well suited to data flow nature of multimedia applications. Often design space exploration is performed to optimize execution time, latency or throughput of a pipelined MPSoC where the variants in the system are processor configurations due to customizable options in each of the processors. Since there can be billions of combinations of processor configurations (design points), the challenge is to quickly provide estimates of performance metrics of those design points. Hence, in this article, we propose analytical models to estimate execution time, latency and throughput of a pipelined MPSoC's design points, avoiding slow full-system cycle accurate simulations of all the design points. For effective use of these analytical models, latencies of individual processor configurations should be available. We propose two estimation methods (PS and PSP) to quickly gather latencies of processor configurations with reduced number of simulations. The PS method simulates all the processor configurations once, while the PSP method simulates only a subset of processor configurations and then uses a processor analytical model to estimate the latencies of the remaining processor configurations. We experimented with several pipelined MPSoCs executing typical multimedia applications (JPEG encoder/decoder, MP3 encoder and H.264 encoder). Our results show that the analytical models with PS and PSP methods had maximum absolute error of 12.95 percent and 18.67 percent respectively, and minimum fidelity of 0.93 and 0.88 respectively. The design spaces of the pipelined MPSoCs ranged from $10^{12}$ to $10^{18}$ design points, and hence simulation of all design points will take years and is infeasible. Compared to PS method, the PSP method reduced simulation time from days to several hour- .
    IEEE Transactions on Parallel and Distributed Systems 01/2014; 25(8):2159-2168. · 1.80 Impact Factor
  • H. Javaid, M. Shafique, J. Henkel, S. Parameswaran
    [Show abstract] [Hide abstract]
    ABSTRACT: Pipelined MPSoCs provide a high throughput implementation platform for multimedia applications. They are typically balanced at design-time considering worst-case scenarios so that a given throughput can be fulfilled at all times. Such worst-case pipelined MPSoCs lack runtime adaptability and result in inefficient resource utilization and high power/energy consumption under a dynamic workload. In this paper, we propose a novel adaptive architecture and a distributed runtime processor manager to enable runtime adaptation in pipelined MPSoCs. The proposed architecture consists of main processors and auxiliary processors, where a main processor uses differing number of auxiliary processors considering runtime workload variations. The runtime processor manager uses a combination of application's execution and knowledge, and offline profiling and statistical information to proactively predict the auxiliary processors that should be used by a main processor. The idle auxiliary processors are then deactivated using clock- or power-gating. Each main processor with a pool of auxiliary processors has its own runtime manager, which is independent of the other main processors, enabling a distributed runtime manager. Our experiments with an H.264 video encoder for HD720p resolution at 30 frames/s show that the adaptive pipelined MPSoC consumed up to 29% less energy (computed using processors and caches) than a worst-case pipelined MPSoC, while delivering a minimum of 28.75 frames/s. Our results show that adaptive pipelined MPSoCs can emerge as an energy-efficient implementation platform for advanced multimedia applications.
    IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 01/2014; 33(5):663-676. · 1.09 Impact Factor
  • Liang Tang, J.A. Ambrose, S. Parameswaran
    [Show abstract] [Hide abstract]
    ABSTRACT: The need to integrate multiple wireless communication protocols into a single low-cost, low-power hardware platform is prompted by the increasing number of emerging communication protocols and applications. This paper presents a novel application specific platform for integrating multiple wireless communication transmission baseband protocols in a pipelined coprocessor, which can be programmed to support various baseband protocols. This coprocessor can dynamically select the suitable pipeline stages for each baseband protocol. Moreover, each carefully designed stage is able to perform a certain signal processing function in a reconfigurable fashion. The proposed platform is flexible (compared to ASICs) and is suitable for mobile applications (compared to FPGAs and processors). The area footprint of the coprocessor is smaller than an ASIC or FPGA implementation of multiple individual protocols, while the overhead of throughput is 34% worse than ASICs and 32% better than FPGAs. The power consumption is 2.7X worse than ASICs but 40X better than FPGAs on average. The proposed platform outperforms processor implementation in all area, throughput and power consumption. Moreover, fast protocol switching is supported. Wireless LAN (WLAN) 802.11 a, WLAN 802.11 b and Ultra Wide Band (UWB) transmission circuits are developed and mapped to the pipelined coprocessor to prove the efficacy of our proposal.
    Design Automation Conference (DAC), 2013 50th ACM / EDAC / IEEE; 01/2013
  • H. Javaid, D. Witono, S. Parameswaran
    [Show abstract] [Hide abstract]
    ABSTRACT: In this paper, we propose a design flow for the pipelined paradigm of Multi-Processor System on Chips (MPSoCs) targeting multiple streaming applications. A multi-mode pipelined MPSoC, used as a streaming accelerator, executes multiple, mutually exclusive applications through modes where each mode refers to the execution of one application. We model each application as a directed graph. The challenge is to merge application graphs into a single graph so that the multi-mode pipelined MPSoC derived from the merged graph contains minimal resources. We solve this problem by finding maximal overlap between application graphs. Three heuristics are proposed where two of them greedily merge application graphs while the third one finds an optimal merging at the cost of higher running time. The results indicate significant area saving (up to 62% processor area, 57% FIFO area and 44 processor/FIFO ports) with minuscule degradation of system throughput (up to 2%) and latency (up to 2%) and increase in energy values (up to 3%) when compared to widely used approach of designing distinct pipelined MPSoCs for individual applications. Our work is the first step in the direction of multi-mode pipelined MPSoCs, and the results demonstrate the usefulness of resource sharing among pipelined MPSoCs based streaming accelerators in a multimedia platform.
    Design Automation Conference (ASP-DAC), 2013 18th Asia and South Pacific; 01/2013
  • [Show abstract] [Hide abstract]
    ABSTRACT: Applying error recovery monotonously can either compromise the real-time constraint, or worsen the power/energy envelope. Neither of these violations can be realistically accepted in embedded system design, which expects ultra efficient realization of a given application. In this paper, we propose a HW/SW methodology that exploits both application specific characteristics and Spatial/Temporal redundancy. Our methodology combines design-time and runtime optimizations, to enable the resultant embedded processor to perform runtime adaptive error recovery operations, precisely targeting the reliability-wise critical instruction executions. The proposed error recovery functionality can dynamically 1) evaluate the reliability cost economy (in terms of execution-time and dynamic power), 2) determine the most profitable scheme, and 3) adapt to the corresponding error recovery scheme, which is composed of spatial and temporal redundancy based error recovery operations. The experimental results have shown that our methodology at best can achieve fifty times greater reliability while maintaining the execution time and power deadlines, when compared to the state of the art.
    Design Automation Conference (DAC), 2013 50th ACM / EDAC / IEEE; 01/2013
  • Liang Tang, J.A. Ambrose, S. Parameswaran
    [Show abstract] [Hide abstract]
    ABSTRACT: The need to integrate multiple wireless communication protocols into a single low-cost flexible hardware platform is prompted by the increasing number of emerging communication protocols and applications in modern embedded systems. The modulation mapping scheme, one of the key components in the communication baseband, varies differing communication protocols. This paper presents an efficient tiny processor, named MAPro, which is programmable at runtime for differing modulation schemes. MAPro costs little in area, consumes less power compared to other programmable solutions, and is sufficiently fast to satisfy even the most demanding of Software Defined Radio (SDR) applications. The proposed method is flexible (when compared to ASICs) and suitable for mobile applications (when compared to FPGAs and ASIP processors). The area of MAPro is only 25% of the combined ASIC implementation of multiple individual modulation mapping circuits, while the throughput meets specification. Power consumption is 110% more than the ASIC implementation on average. MAPro outperforms both FPGA and ASIP processor significantly in area and power consumption. In terms of throughput, MAPro is similar to FPGA, and outperforms the ASIP processor.
    VLSI Design and 2013 12th International Conference on Embedded Systems (VLSID), 2013 26th International Conference on; 01/2013
  • J. Henkel, V. Narayanan, S. Parameswaran, J. Teich
    [Show abstract] [Hide abstract]
    ABSTRACT: As embedded on-chip systems grow more and more complex and are about to be deployed in automotive and other demanding application areas (beyond the main-stream of consumer electronics), run-time adaptation is a prime design consideration for many reasons: i) reliability is a major concern when migrating to technology nodes of 32nm and beyond, ii) efficiency i.e. computational power per Watt etc. is a challenge as computing models do not keep up with hardware-provided computing capabilities, iii) power densities increase rapidly as Dennard Scaling fails resulting in what is dubbed “Dark Silicon”, iv) highly complex embedded applications are hard to predict etc. All these scenarios (and further not listed here) make proactive and sophisticated run-time adaption techniques a prime design consideration for generations of multi-core architectures to come. The intend of this paper is to present problems and solutions of top research initiatives from diverse angles with the common denominator of the dire need for run-time adaption: The first part tackles the thermal problem i.e. high power densities and the related short and long-term effects it has on the reliability and it presents scalable techniques to cope the related problems. The second section demonstrates the potential of steep slope devices on thread scheduling of multi-cores. The third approach presents embedded pipelined architectures running complex multi-media applications whereas the fourth section introduces the paradigm of invasive computing i.e. a novel computing approach promising high efficiency through a highly-adaptive hardware/software architecture. In summary, the paper presents snapshots on four highly-adaptive solutions and platforms from different angles for challenges of complex future multi-core systems.
    Hardware/Software Codesign and System Synthesis (CODES+ISSS), 2013 International Conference on; 01/2013
  • [Show abstract] [Hide abstract]
    ABSTRACT: Efficiency in embedded systems is paramount to achieve high performance while consuming less area and power. Processors in embedded systems have to be designed carefully to achieve such design constraints. Application Specific Instruction set Processors (ASIPs) exploit the nature of applications to design an optimal instruction set. Despite being not general to execute any application, ASIPs are highly preferred in the embedded systems industry where the devices are produced to satisfy a certain type of application domain/s (either intra-domain or inter-domain). Typically, ASIPs are designed from a base-processor and functionalities are added for applications. This paper studies the multi-application ASIPs and their instruction sets, extensively analyzing the instructions for inter-domain and intra-domain designs. Metrics analyzed are the reusable instructions and the extra cost to add a certain application, together with the hardware synthesis numbers, such as area, timing and delay. A wide range of applications from various application benchmarks (BioPerf, CommBench, MediaBench, MiBench and SPEC2006) and domains are analyzed for three different architectures (LEON2, PISA and ARM-Thumb). Processors are generated for these architectures for different configurations to analyze and synthesize. Our study shows that the intra-domain applications contain larger number of common instructions, whereas the inter-domain applications have very less common instructions, regardless the kind of architecture (and therefore the ISA).
    VLSI Design and 2013 12th International Conference on Embedded Systems (VLSID), 2013 26th International Conference on; 01/2013
  • Liang Tang, J.A. Ambrose, S. Parameswaran
    [Show abstract] [Hide abstract]
    ABSTRACT: The need to integrate multiple wireless communication protocols into a single low-cost flexible hardware platform is prompted by the increasing number of emerging communication protocols and applications in modern embedded systems. The interleaving, one of the key components in the communication baseband, varies in differing communication protocols. A novel reconfigurable variable increment step (VIS) based interleaver is proposed for efficient multimode communication application. The proposed reconfigurable interleaver supports both block and convolutional interleaving, costs little in area, consumes less power compared to some other programmable solutions, and is sufficiently fast to satisfy even the most demanding of Software Defined Radio (SDR) applications. The proposed method is flexible (when compared to ASICs and some reconfigurable solutions) and suitable for mobile applications in terms of area, power consumption and throughput (when compared to FPGAs, processors and some other ASIC based reconfigurable proposals).
    Circuits and Systems (ISCAS), 2013 IEEE International Symposium on; 01/2013
  • [Show abstract] [Hide abstract]
    ABSTRACT: Soft error has become a major adverse effect in CMOS based electronic systems. Mitigating soft error requires enhancing the underlying system with error recovery functionality, which typically leads to considerable design cost overhead, in terms of performance, power and area. For embedded systems, where stringent design constraints apply, such cost must be properly bounded. In this paper, we propose a HW/SW methodology DHASER, which enables efficient error recovery functionality for embedded ASIP-based multi-core systems. DHASER consists of three main parts: task level correctness (TLC) analysis, TLC-based processor/core customization, and runtime reliability-aware task management mechanism. It enables each individual ASIP-based processing core to dynamically adapt its specific error recovery functionality according to the corresponding task's characteristics (i.e., soft error vulnerability and execution time deadline). The goal is to optimize the overall system reliability while considering performance/throughput. The experimental results have shown that DHASER can significantly improve the reliability of the system, with little cost overhead, in comparison to the state-of-art counterparts.
    Computer-Aided Design (ICCAD), 2013 IEEE/ACM International Conference on; 01/2013
  • J.A. Ambrose, I. Nawinne, S. Parameswaran
    [Show abstract] [Hide abstract]
    ABSTRACT: Mapping tasks to cores in an Multiprocessor System-on-Chip (MPSoC) to meet constraints is widely investigated. Thus far the data flow graphs used for binding have been limited to acyclic graphs or have been single rate. In this paper we generalize the approach by allowing DFGs to be cyclic and multi rate. We further improve energy consumption by setting frequency per core in a Globally Asynchronous and Locally Synchronous (GALS) architecture (by the distribution of slack). A design flow is proposed with these two approaches to form a latency constrained and energy efficient binding. A generalized solution is proposed, compared to state-of-the-art, using improvements in formulation, data structures and heuristics. Eight benchmarks are experimented upon for mesh and pipeline architectures. Our heuristics achieve significant simulation speedup compared to the state-of-the-art and provide a solution which is 2.5% lower (26% worst case) than the optimal, but the solution is obtained 40x quicker (average case). Such a speedup allows us to rapidly explore a large design space.
    Circuits and Systems (ISCAS), 2013 IEEE International Symposium on; 01/2013
  • Su Myat Min, H. Javaid, A. Ignjatovic, S. Parameswaran
    [Show abstract] [Hide abstract]
    ABSTRACT: This paper studies the effects of last-level cache on DRAM energy consumption. In particular, we explore how different last-level cache configurations affect the idle periods of DRAM, and whether those idle periods can be exploited through the use of self refresh power down mode to enable maximum energy reduction in both the energy consumption of the last-level cache and DRAM. A suitable last-level cache configuration reduces active power consumption of DRAM by reducing read/write accesses to it and use of the self refresh power down mode reduces background power of DRAM, creating a possibility of significant energy reduction. We propose a power mode controller to adaptively transition DRAM to self refresh power down mode when a memory request hits the last-level cache, and activate the DRAM when a memory request misses the last-level cache. We experimented with eight applications from mediabench, and found that an optimal last-level cache configuration with self refresh power down mode can save up to 89% energy compared to a standard memory controller. Additionally, the use of self refresh power down mode degraded the performance by a maximum of 2% only. Thus, we conclude that exploration and optimization of last-level cache can result in significant energy savings for memory subsystem with little performance degradation.
    Embedded Computing (MECO), 2013 2nd Mediterranean Conference on; 01/2013
  • Haseeb Bokhari, Haris Javaid, Sri Parameswaran
    [Show abstract] [Hide abstract]
    ABSTRACT: Application specific MPSoCs are often used for high end streaming applications, which impose stringent throughput constraints and raise the demand for application specific communication architectures. In this paper, we introduce a framework that selectively adds high bandwidth express links between communicating processors so that some traffic is directed via the express links rather than the baseline on-chip interconnect to improve MPSoC's throughput. We present a novel heuristic, xLink, which exploits both processor latencies and on-chip traffic volume to efficiently prune the exponential design space and quickly reach the solution of minimal number of express links for a given throughput constraint. Our framework is oblivious to the baseline interconnect and therefore can be applied to different interconnects. We applied our framework to two different MPSoC interconnects: crossbar NoC and mesh NoC, using 9 benchmark applications. For crossbar NoC based MPSoC, xLink found the optimal solution in 24 out of 26 cases considered (with max error of 20%), while a traditional heuristic found the optimal solution in only 17 cases (with max error of 44%). For mesh NoC based MPSoC, xLink is better than traditional heuristic in 3 out of 9 cases considered with up to 11% saving in communication architecture area footprint. The xLink heuristic always took less than one hour, compared to several hours for the traditional heuristic and several days for an exhaustive algorithm. On average, xLink resulted in a runtime speedup of 7.5× for crossbar NoC topology, and 16.5× for mesh NoC topology, with respect to the traditional heuristic.
    Embedded Systems for Real-time Multimedia (ESTIMedia), 2013 IEEE 11th Symposium on; 01/2013
  • Josef Schneider, Sri Parameswaran
    [Show abstract] [Hide abstract]
    ABSTRACT: JPEG Encoding is a commonly performed application that is also very process and memory intensive, and not suited for low-power embedded systems with narrow data buses and small amounts of memory. An embedded system may also need to adapt its application in order to meet varying system constraints such as power, energy, time or bandwidth. We present here an extremely compact JPEG encoder that uses very few system resources, and which is capable of dynamically changing its Quality of Service (QoS) on the fly. The application was tested on a NIOS II core, AVR, and PIC24 microcontrollers with excellent results.
    Design, Automation & Test in Europe Conference & Exhibition (DATE), 2013; 01/2013
  • Proceedings of the 23rd ACM International Conference on Great Lakes Symposium on VLSI, Paris, France; 01/2013
  • [Show abstract] [Hide abstract]
    ABSTRACT: Intermittent faults (IF) in chips are becoming commonplace with the current technology trend and the process scaling. In this paper, we first modify the well known birth-death Markov model so that availability can be calculated. We then show that the standard birth-death Markov model does not capture IF correctly, and create a novel Markov model for intermittent faults that is derived from the specific nature of such faults. The proposed model, for the first time, differentiates risky and normal components and therefore does not waste processing time for unnecessary testing procedures. Consequently, the availability of processors with the proposed model increases significantly compared to the traditional model (from 0.90 to 0.99 with a typical parameter set). In addition, the proposed model facilitates parameter space exploration. Positive effects were observed with varying parameters such as error rate, recovery time and test program length. It was concluded that choice of right testing parameters are vital for gaining optimal system availability and the new model supports achieving the same.
    Proceedings of the 23rd ACM International Conference on Great Lakes Symposium on VLSI, Paris, France; 01/2013
  • Su Myat Min Shwe, H. Javaid, S. Parameswaran
    [Show abstract] [Hide abstract]
    ABSTRACT: In this paper, we propose to explore design space of a unified last-level cache to improve system performance and energy efficiency. The challenge is to quickly estimate the execution time and energy consumption of the system with distinct cache configurations using minimal number of slow full-system cycle-accurate simulations. To this end, we propose a novel, simple yet highly accurate execution time estimator and a simple, reasonably accurate energy estimator. Our framework, RExCache, combines a cycle-accurate simulator and a trace-driven cache simulator with our novel execution time estimator and energy estimator to avoid cycle-accurate simulations of all the last-level cache configurations. Once execution time and energy estimates are available from the estimators, RExCache chooses minimum execution time or minimum energy consumption cache configuration. Our experiments with nine different applications from mediabench, and 330 last-level cache configurations show that the execution time and energy estimators had at least average absolute accuracy of 99.74% and 80.31% respectively. RExCache took only a few hours (21 hours for H.264enc) to explore last-level cache configurations compared to several days of traditional method (36 days for H.264enc) and cycle-accurate simulations (257 days for H.264enc), enabling quick exploration of the last-level cache. When 100 different real-time constraints on execution time and energy were used, all the cache configurations found by RExCache were similar to those from cycle-accurate simulations. On the other hand, the traditional method found correct cache configurations for only 69 out of 100 constraints. Thus, RExCache has better absolute accuracy than the traditional method, yet reducing the simulation time by at least 97%.
    Design Automation Conference (ASP-DAC), 2013 18th Asia and South Pacific; 01/2013
  • [Show abstract] [Hide abstract]
    ABSTRACT: Soft error has been identified as one of the major challenges to CMOS technology based computing systems. To mitigate this problem, error recovery is a key component, which usually accounts for a substantial cost, since they must introduce redundancies in either time or space. Consequently, using state-of-art recovery techniques could heavily worsen the design constraint, which is fairly stringent for embedded system design. In this paper, we propose a HW/SW methodology that generates the processor, which performs finely configured error recovery functionality targeting the given design constraints (e.g., performance, area and power). Our methodology employs three application-specific optimization heuristics, which generate the optimized composition and configuration based on the two primitive error recovery techniques. The resultant processor is composed of selected primitive techniques at corresponding instruction execution, and configured to perform error recovery at run-time accordingly to the scheme determined at design time. The experiment results have shown that our methodology can at best achieve nine times reliability while maintaining the given constraints, in comparison to the state of the art.
    Design, Automation & Test in Europe Conference & Exhibition (DATE), 2013; 01/2013
  • [Show abstract] [Hide abstract]
    ABSTRACT: It is of critical importance to satisfy deadline requirements for an embedded application to avoid undesired outcomes. Multiprocessor System-on-Chips (MPSoCs) play a vital role in contemporary embedded devices to satisfy timing deadlines. Such MPSoCs include two-level cache hierarchies which have to be dimensioned carefully to support timing deadlines of the application(s) while consuming minimum area and therefore minimum power. Given the deadline of an application, it is possible to systematically derive the maximum time that could be spent on memory accesses which can then be used to dimension the suitable cache sizes. As the dimensioning has to be done rapidly to satisfy the time to market requirement, we choose a well acclaimed rapid cache simulation strategy, the single-pass trace driven simulation, for estimating the cache dimensions. Therefore, for the first time, we address the two main challenges, coherency and scalability, in adapting a single-pass simulator to a MPSoC with two-level cache hierarchy. The challenges are addressed through a modular bottom-up simulation technique where L1 and L2 simulations are handled in independent communicating modules. In this paper, we present how the dimensioning is performed for a two-level inclusive data cache hierarchy in an MPSoC. With the rapid simulation proposed, the estimations are suggested within an hour (worst case on considered application benchmarks). We experimented our approach with task based MPSoC implementations of JPEG and H264 benchmarks and achieved timing deviations of 16.1% and 7.2% respectively on average against the requested data access times. The deviation numbers are always positive meaning our simulator guarantees to satisfy the requested data access time. In addition, we generated a set of synthetic memory traces and used them to extensively analyse our simulator. For the synthetic traces, our simulator provides cache sizes to always guarantee the requested data access time, deviating below 14.5% on average.
    Proceedings of the eighth IEEE/ACM/IFIP international conference on Hardware/software codesign and system synthesis; 10/2012
  • Liang Tang, Jorgen Peddersen, Sri Parameswaran
    [Show abstract] [Hide abstract]
    ABSTRACT: The need to integrate multiple wireless communication protocols into a single low-cost, low power hardware platform is prompted by the increasing number of emerging communication protocols and applications. This paper presents an efficient methodology for integrating multiple wireless protocols in an ASIC which minimizes resource occupation. A hierarchical data path merging algorithm is developed to find common shareable components in two different communication circuits. The data path merging approach will build a combined generic circuit with inserted multiplexers (MUXes) which can provide the same functionality of each individual circuit. The proposed method is orders of magnitude faster (well over 1000 times faster for realistic circuits) than the existing data path merging algorithm (with an overhead of 3% additional area) and can switch communication protocols on the fly (i.e. it can switch between protocols in a single clock cycle), which is a desirable feature for seemingly simultaneous multi-mode wireless communication. Wireless LAN (WLAN) 802.11a, WLAN802.11b and Ultra Wide Band (UWB) transmission circuits are merged to prove the efficacy of our proposal.
    Proceedings of the IEEE International Conference on VLSI Design 01/2012;

Publication Stats

663 Citations
15.01 Total Impact Points

Institutions

  • 2003–2013
    • University of South Wales
      Понтиприте, Wales, United Kingdom
  • 2001–2013
    • University of New South Wales
      • School of Computer Science and Engineering
      Kensington, New South Wales, Australia
  • 2010
    • Stanford University
      Palo Alto, California, United States
  • 2009
    • Tampere University of Technology
      Tammerfors, Province of Western Finland, Finland
  • 1998–2001
    • University of Queensland 
      • School of Information Technology and Electrical Engineering
      Brisbane, Queensland, Australia