Tjerk Bijlsma

Universiteit Twente, Enschede, Overijssel, Netherlands

Are you Tjerk Bijlsma?

Claim your profile

Publications (11)0 Total impact

  • [Show abstract] [Hide abstract]
    ABSTRACT: Many applications contain loops with an undetermined number of iterations. These loops have to be parallelized in order to increase the throughput when executed on an embedded multiprocessor platform. This paper presents a method to automatically extract a parallel task graph based on function level parallelism from a sequential nested loop program with while loops. In the parallelized task graph loop iterations can overlap during execution. We introduce the notion of a single assignment section such that we can exploit single assignment to overlap iterations of the while loop during the execution of the parallel task graph. Synchronization is inserted in the parallelized task graph to ensure the same functional behavior as the sequential nested loop program. It is shown that the generated parallel task graph does not introduce deadlock. A DVB-T radio receiver where the user can switch channels after an undetermined amount of time illustrates the approach.
    Design, Automation & Test in Europe Conference & Exhibition (DATE), 2011; 04/2011
  • Source
    Tjerk Bijlsma
    [Show abstract] [Hide abstract]
    ABSTRACT: This thesis is concerned with the automatic parallelization of real-time stream processing applications, such that they can be executed on embedded multiprocessor systems. Stream processing applications can be found in the video and channel decoding domain. These applications often have temporal requirements and can contain non-manifest conditions and expressions. For non-manifest conditions and expressions the results cannot be evaluated at compile time. Current parallelization approaches have difficulties with the extraction of function parallelism from stream processing applications. Some of these approaches require applications with manifest behavior and affine index-expressions. For these applications, they can derive data dependencies and insert inter-task communication via FIFO buffers. But, these approaches cannot support stream processing applications with non-manifest loops. Furthermore, to the best of our knowledge current approaches can only extract a temporal analysis model from applications with manifest behavior and without cyclic data dependencies. To address the issues mentioned above, we present an automatic parallelization approach to extract function parallelism from sequential descriptions of real-time stream processing applications. We introduce a language to describe stream processing applications. The key property of this language is that all dependencies can be derived at compile time. In our language we support non-manifest loops, if-statements, and index-expressions. We introduce a new buffer type that can always be used to replace the array communication. This buffer supports multiple reading and writing tasks. Because we can always derive the data dependencies and always replace the array communication by communication via a buffer, we can always extract the available function parallelism. Furthermore, our parallelization approach uses an underlying temporal analysis model, in which we capture the inter-task synchronization. With this analysis model, we can compute system settings and perform optimizations. Our parallelization approach is implemented in a multiprocessor compiler. We evaluated our approach, by extracting parallelism from a WLAN channel decoder application and a JPEG decoder application with our multiprocessor compiler.
    Theoretical Computer Science - TCS. 01/2011;
  • [Show abstract] [Hide abstract]
    ABSTRACT: Multimedia applications process streams of values and can often be represented as task graphs. For performance reasons, these task graphs are executed on multiprocessor systems. Inter-task communication is performed via buffers, where the order in which values are written into a buffer can differ from the order in which they are read. Some existing approaches perform inter-task communication via first-in-first-out buffers in combination with reordering tasks and require applications with affne index-expressions. In our previous work, we used circular buffers with a non-overlapping read and write window, such that a reordering task is not required. However, these windows can cause deadlock for cyclic task graphs. In this paper, we introduce circular buffers with multiple overlapping windows that do not delay the release of locations and therefore they do not introduce deadlock for cyclic task graphs. We show that buffers with multiple overlapping read and write windows are attractive, because they avoid that a buffer has to be selected from which a value has to be read or into which a value has to be written. This significantly simplifies the extraction of a task graph from a sequential application. These buffers are also attractive, because a buffer capacity equal to the array size is sufficient for deadlock-free execution, instead of performing global analysis to compute sufficient buffer capacities. Our case-study presents two applications that require these buffers.
    01/2011;
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Multimedia applications process streams of values and can often be represented as task graphs. For performance reasons, these task graphs are executed on multiprocessor systems. Inter-task communication is performed via buffers, where the order in which values are written into a buffer can differ from the order in which they are read. Some existing approaches perform inter-task communication with first-in-first-out buffers and reordering tasks and require applications with affine index expressions. Other approaches communicate containers, in which values can be accessed in any order, such that a reordering task is not required. However, these containers delay the release of locations, which can cause deadlock in cyclic task graphs. In this paper, we introduce circular buffers with overlapping windows for deadlock-free execution of cyclic task graphs that may contain non-affine index expressions. Inside the windows, values can be written or read in an arbitrary order, such that a reordering task is not required. Deadlock is avoided by releasing a written location directly from the write window. The approach is demonstrated for the cyclic task graph of an orthogonal frequency-division multiplexing (OFDM) receiver application, containing non-affine index expressions.
    Systems, Architectures, Modeling, and Simulation, 2009. SAMOS '09. International Symposium on; 08/2009
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Multimedia applications, executed by embedded multiprocessor systems, can in some cases be represented as task graphs, with the tasks containing nested loop programs. The nested loop programs communicate via arrays and can be executed on different processors. Typically an array can be communicated via a circular buffer with a capacity smaller than the array. For such buffers, the communicating nested loop programs have to synchronize and a sufficient buffer capacity needs to be computed. In a circular buffer we use a write and a read window to support rereading, out-of-order reading or writing, and skipping of locations. A cyclo static dataflow model is derived from the application and used to compute buffer capacities that guarantee deadlock free execution. Our case-study applies circular buffers in a Digital Audio Broadcasting channel decoder application, where the frequency deinterleaver reads according to a non-affine pseudo-random function. For this application, buffer capacities are calculated that guarantee deadlock free execution.
    Proceedings of the 11th International Workshop on Software and Compilers for Embedded Systems, Munich, Germany, March 13-14, 2008; 01/2008
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: In modern multiprocessor systems, proces-sors can be stalled by inter-task communication when read-ing from a remote buffer. This paper presents a solution for the inter-task communication, that has a minimal im-pact on the performance of the system, hides the inter-task communication latency without requiring additional hard-ware. The solution applies to jobs, represented as task graphs, where the tasks are nested loop programs. Buffers are allocated in scratch-pad memories of the consuming tasks to provide low latency read access. For the nested loop programs, minimal buffer sizes can be determined to cover all possible communication patterns. The added computational complexity is low, as the solution adds only a few operations to the nested loop programs.
    Operating Systems Review - SIGOPS. 01/2007;
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Digital Down Conversion (DDC) is an algorithm, used to lower the amount of samples per second by selecting a limited frequency band out of a stream of samples. A pos- sible DDC algorithm consists of two simple Cascading In- tegrating Comb (CIC) filters and a Finite Input Response (FIR) filter preceded by a modulator that is controlled with a Numeric Controlled Oscillator (NCO). Implementations of the algorithm have been made for five architectures, two Application Specific Integrated Circuits (ASIC), a General Purpose Processor (GPP), a Field Programmable Gate Ar- ray (FPGA), and the Montium Tile Processor (TP). All ar- chitectures are functionally capable of performing the algo- rithm. The differences between the architectures are their performance, flexibility and energy consumption. In this paper we compared the energy consumption of the archi- tectures when performing the DDC algorithm. The ASIC is the best solution if digital down conversion is constantly re- quired. When digital down conversion is needed only parts of the time, the Altera Cyclone II is the best solution due to its smaller technology size. In the spare time the reconfig- urable architectures can be reconfigured for other tasks of today's multimedia devices.
    20th International Parallel and Distributed Processing Symposium (IPDPS 2006), Proceedings, 25-29 April 2006, Rhodes Island, Greece; 01/2006
  • Tjerk Bijlsma, Pierre G. Jansen
    [Show abstract] [Hide abstract]
    ABSTRACT: With the growing popularity of Wireless Sensor Networks (WSNs), the demand increases to perform time critical operations within such networks. A WSN is composed of sensor nodes, which contain a radio, ports for multiple sensors and a microcontroller. These sensor nodes have to provide a wide range of functionality as long as possible, while they use their scarce energy from a battery.
    Smart Sensing and Context, First European Conference, EuroSSC 2006, Enschede, Netherlands, October 25-27, 2006, Proceedings; 01/2006
  • T. Bijlsma, P.T. Wolkotte, G.J.M. Smit
    [Show abstract] [Hide abstract]
    ABSTRACT: Digital down conversion (DDC) is an algorithm, used to lower the amount of samples per second by selecting a limited frequency band out of a stream of samples. A possible DDC algorithm consists of two simple cascading integrating comb (CIC) filters and a finite input response (FIR) filter preceded by a modulator that is controlled with a numeric controlled oscillator (NCO). Implementations of the algorithm have been made for five architectures, two application specific integrated circuits (ASIC), a general purpose processor (GPP), a field programmable gate array (FPGA), and the Montium tile processor (TP). All architectures are functionally capable of performing the algorithm. The differences between the architectures are their performance, flexibility and energy consumption. In this paper, we compared the energy consumption of the architectures when performing the DDC algorithm. The ASIC is the best solution if digital down conversion is constantly required. When digital down conversion is needed only parts of the time, the Altera Cyclone II is the best solution due to its smaller technology size. In the spare time, the reconfigurable architectures can be reconfigured for other tasks of today's multimedia devices
    Parallel and Distributed Processing Symposium, 2006. IPDPS 2006. 20th International; 01/2006
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Our Multi Processor System on Chip (MPSoC) template provides processing tiles that are connected via a network on chip. A processing tile contains a processing unit and a Scratch Pad Memory (SPM). This paper presents the Omphale tool that performs the first step in mapping a job, represented by a task graph, to such an MPSoC, given the SPM sizes as constraints. Furthermore a memory tile is introduced. The result of Omphale is a Cyclo Static DataFlow (CSDF) model and a task graph where tasks communicate via sliding windows that are located in circular buffers. The CSDF model is used to determine the size of the buffers and the communication pattern of the data. A buffer must fit in the SPM of the processing unit that is reading from it, such that low latency access is realized with a minimized number of stall cycles. If a task and its buffer exceed the size of the SPM, the task is examined for additional parallelism or the circular buffer is partly located in a memory tile. This results in an extended task graph that satisfies the SPM size constraints.