Article
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

The computational demands of encoding and decoding motion-compensated transform representations of digital video are well-known, and even hard-wired solutions to single algorithms remain a design challenge. When interactivity or personalization are added, and when algorithms increase in complexity to include structured or object-based representations, not only do the requirements increase but so too does the need for computational flexibility. It is often proposed to solve the computational problem in a flexible manner by using multiple identical general-purpose processors in parallel (a multiple-instruction, multiple-data, or MIMD approach). Such methods, though, may not achieve the needed number of operations per second without large numbers of processors; in that case communications bottlenecks can arise and programmers can find difficulty in efficiently parallelizing software. A less-well-known form of parallel computation, based on streams, is conceptually closer to to the ways in...

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... One characteristic property of applications such as image recognition, digital signal processing and video stream operations , is that they are typically very demanding [1], and the same operations are repeated. Many of the above applications have structured data which lie in blocks or in a regular manner in the memory [2]. ...
... Such properties can be exploited by letting the applications execute in one or more computation modules that receive the data in streams. This type of data-processing is also known as stream-based computation [1]. Examples of stream-based algorithms are various transforms and filtering operations on video and multimedia data (e.g. ...
... A lot of work has been done on developing new machines which are controlled by a data stream rather by an instruction stream. Chidi [1]and PipeRench [4] are used to accelerate multimedia applications. Likewise , the SCORE [5] system and RaPiD [6] are reconfigurable systems for stream based computations. ...
Conference Paper
This paper describes the design and implementation of an address generator for stream-based computation. The unit can generate addresses by a 1, 2 or 3-dimensional map- ping from a linear data string in memory. A processing unit will get the required data in a continuous stream without empty time slots, even when switching between addressing algorithms. Each algorithm is specified by a set of param- eters loaded into FIFOs in background. The unit is speci- fied by VHDL, simulated, synthesized and implemented on an FPGA of type Xilinx Virtex-II Pro. A speed of 144 MHz is obtained for generating 36 bit addresses. Ideas for expand- ing the flexibility of the unit is discussed.
... In [36], Watlington and Bove discuss a stream-based computing paradigm for programming video processing applications. Rather than dealing with multidimensional dataspaces directly, as is done in this paper, the authors sketch some ideas of how multidimensional arrays can be collapsed into 1-D streams using simple horizontal/vertical scanning techniques. ...
... Programs are specified as signal flow graphs. Streams are 1-D, as in [36]. Multirate operations are supported by associating a clock period with every operation. ...
Article
Full-text available
Signal flow graphs with dataflow semantics have been used in signal processing system simulation, algorithm development, and real-time system design. Dataflow semantics implicitly expose function parallelism by imposing only a partial ordering constraint on the execution of functions. One particular form of dataflow called synchronous dataflow (SDF) has been quite popular in programming environments for digital signal processing (DSP) since it has strong formal properties and is ideally suited for expressing multirate DSP algorithms. However, SDF and other dataflow models use first-in first-out (FIFO) queues on the communication channels and are thus ideally suited only for one-dimensional (1-D) signal processing algorithms. While multidimensional systems can also be expressed by collapsing arrays into 1-D streams, such modeling is often awkward and can obscure potential data parallelism that might be present. SDF can be generalized to multiple dimensions; this model is called multidimensional synchronous dataflow (MDSDF). This paper presents MDSDF and shows how MDSDF can be efficiently used to model a variety of multidimensional DSP systems, as well as other types of systems that are not modeled elegantly in SDF. However, MDSDF generalizes the FIFO queues used in SDF to arrays and, thus, is capable only of expressing systems sampled on rectangular lattices. This paper also presents a generalization of MDSDF that is capable of handling arbitrary sampling lattices and lattice-changing operations such as nonrectangular decimation and interpolation. An example of a practical system is given to show the usefulness of this model. The key challenge in generalizing the MDSDF model is preserving static schedulability, which eliminates the overhead associated with dynamic scheduling, and preserving a model where data parallelism, as well as functional parallelism, is fully explicit
... The high-performance signal processing part of an application can be represented by task graphs, in which the nodes represent large autonomous tasks and the edges represent stream communication between these tasks [5][6][10]. To obtain high performance, concurrent execution of tasks is exploited by using multiple autonomous ADS processors that can perform tasks independently, and thus in parallel with other processors (seeFigure 1). ...
Conference Paper
Full-text available
The demands in terms of processing performance, communication bandwidth and real-time throughput of many multimedia applications are much higher than today's processing architectures can deliver. The PROPHID heterogeneous multiprocessor architecture template aims to bridge this gap. The template contains a general purpose processor connected to a central bus, as well as several high-performance application domain specific processors. A high-throughput communication network is used to meet the high bandwidth requirements between these processors. In this network multiple time-division-multiplexed data streams are transferred over several parallel physical channels. This paper presents a method for guaranteeing the throughput for hard-real-time streams in such a network. At compile time sufficient bandwidth is assigned to these streams. The assignment can be determined in polynomial time. Remaining bandwidth is assigned to soft-real-time streams at run time. We thus achieve efficient stream communication with guaranteed performance
... In the application domain of real-time video we focus on dedicated architectures that support the concept of streams [17] and achieve the required performance by exploiting the inherent parallelism of the applications on domain-specific, coarse-grain processors, with limited internal flexibility (i.e. weakly programmable). ...
Conference Paper
Full-text available
In this paper we present an approach for quantitative analysis of application-specific dataflow architectures. The approach allows the designer to rate design alternatives in a quantitative way and therefore supports him in the design process to find better performing architectures. The context of our work is video signal processing algorithms which are mapped onto weakly-programmable, coarse-grain dataflow architectures. The algorithms are represented as Kahn graphs with the functionality of the nodes being coarse-grain functions. We have implemented an architecture simulation environment that permits the definition of dataflow architectures as a composition of architecture elements, such as functional units, buffer elements and communication structures. The abstract, clock-cycle accurate simulator has been built using a multi-threading package and employs object oriented principles. This results in a configurable and efficient simulator. Algorithms can subsequently be executed on the architecture model producing quantitative information for selected performance metrics. Results are presented for the simulation of a realistic application on several dataflow architecture alternatives, showing that many different architectures can be simulated in modest time on a modern workstation
... The DS (Data Shu er) controls the ow of stream data between the PowerPC bus and the RP. It basically contains bu ers, FIFOs, comparators, data ow state machines, and registers needed to move stream data using the Media Lab's Stream Processing mechanism 13 . Sixty four bits of data can ow through the DS in either direction simultaneously. ...
Article
Holo-Chidi is a holographic video processing system designed at the MIT Media Laboratory for real-time computation of Computer Generated Holograms and the subsequent display of the holograms at video frame rates. It's processing engine is adapted from Chidi which is a reconfigurable multimedia processing system used for real-time synthesis and analysis of digital video frames. Holo-Chidi is made of two main components: the sets of Chidi processor cards and the display video concentrator card. The processor cards are used for hologram computation while the display video concentrator card acts as frame buffer for the system. The display video concentrator also formats the computed holographic data and converts them to analog form for feeding the acousto-optic modulators of the Media Lab's Mark-II holographic display system. The display video concentrator card can display the computed holograms from the Chidi cards loaded from its high-speed I/O (HSIO) interface port or precomputed hologr...
... The Object-Based Media Group at the Media Laboratory has for several years proposed stream-based software techniques and hardware archtectures for managing interconnected and possibly heterogeneous parallel processing resources, particularly in the domain of sophisticated video and audio processing. 1 Even in systems with standard hardware architecture the application of stream-based software techniques is advantageous. ...
Article
The Chidi system is a PCI-bus media processor card which performs its processing tasks on a large fieldprogrammable gate array (Altera 10K100) in conjunction with a general purpose CPU (PowerPC 604e). Special address-generation and buffering logic (also implemented on FPGAs) allows the reconfigurable processor to share a local bus with the CPU, turning burst accesses to memory into continuous streams and converting between the memory's 64-bit words and the media data types. In this paper we present the design requirements for the Chidi system, describe the hardware architecture, and discuss the software model for its use in media processing. Keywords: video compression, field-programmable gate array, data-flow computing, digital signal processing 1. INTRODUCTION Now that field-programmable gate arrays (FPGAs) have reached gate densities at which they can perform useful computational tasks, particularly in certain mathematical and signal-processing domains, they are increasingly bei...
Article
Full-text available
Abstract This thesis proposes a mechanism, streams, for overcoming many of the problems associated with the parallel processing of media. A programming model and runtime system using the stream mechanism, MagicEight, is also proposed. MagicEight supports medium to coarse grain parallelism, using a hybrid dataflow model of execution. Multidimensional streams of data elements provide a scalable means of both obtaining parallelism and ameliorating an indeterminate memory access latency. The machine architectures which MagicEight is intended to support range from a single general purpose processor to heterogenous multiprocessor systems of two to two hundred processors interconnected by communications channels of varying capabilities. In the multiprocessor case, some of the processors may be specialized-- capable of executing a restricted set of algorithms much more efficiently than a general purpose processor. 1 1 Introduction While the ability of personal computers to acquire, process, and present video and sound has now been established, the computational requirements of many media applications exceed that provided by a single general purpose processor. My thesis is that streams are a mechanism for enabling efficient dynamic parallelization of the computational tasks typically found in media processing. I am also proposing a programming model for media processing using this mechanism. The model is a variant of hybrid dataflow, utilizing multidimensional streams as both a basic data type and a mechanism for synchronization and obtaining parallelism. It supports machine architectures containing a heterogenous mix of processors. In order to provide higher compression, greater flexibility, and more semantic description of scene content, video1 is increasingly moving toward representations in which the data are segmented not into arbitrary fixed and regular patterns, but rather into objects or regions determined by scene-understanding algorithms [18][23][30][7][6]. These structured (or objectbased) representations are effectively sets of objects and "scripts " describing how to render output images from the objects. The media being presented is generated at the receiver, not merely decoded, allowing the presentation to adapt to receiver capabilities, viewing situation, and user preferences.
Article
We describe a parallel computer system for processing media: audio,video, and graphics, among others. The system supports medium to coarse grain parallelism, using a dataflow model of execution, on a range of machine architectures scaling from a single von Neumann or general purpose processor (GPP) up to networks of several hundred heterogeneous processors. A distributed resource manager, extending or subsuming the functionality of a traditional operating system, is an integral and necessary part of the system. While we are building a system for processing a variety of media, in this paper we concentrate on video because it provides an extreme case in terms of both data rates and available parallelism.
Article
Object-based media refers to the representation of audiovisual information as a collection of objects - the result of scene-analysis algorithms - and a script describing how they are to be rendered for display. Such multimedia presentations can adapt to viewing circumstances as well as to viewer preferences and behavior, and can provide a richer link between content creator and consumer. With faster networks and processors, such ideas become applicable to live interpersonal communications as well, creating a more natural and productive alternative to traditional videoconferencing. In this paper is outlined an example of object-based media algorithms and applications developed by my group, and present new hardware architectures and software methods that we have developed to enable meeting the computational requirements of object- based and other advanced media representations. In particular we describe stream-based processing, which enables automatic run-time parallelization of multidimensional signal processing tasks even given heterogenous computational resources.
Conference Paper
We present an approach to model dataflow architectures at a high level of abstraction using timed coloured Petri nets. We specifically examine the value of Petri nets for evaluating the performance of such architectures. For this purpose we assess the value of Petri nets both as a modelling technique for dataflow architectures and as an analysis tool that yields valuable performance data for such architectures through the execution of Petri net models. Because our aim is to use the models for performance analysis, we focus on representing the timing and communication behaviour of the architecture rather than the functionality. A modular approach is used to model architectures. We identify five basic hardware building blocks from which Petri net models of dataflow architectures can be constructed. In defining the building blocks we will identify strengths and weaknesses of Petri nets for modelling dataflow architectures. A technique called folding is applied to build generic models of dataflow architectures. A timed coloured Petri net model of the Prophid dataflow architecture, which is being developed at Philips Research Laboratories, is presented. This model has been designed in the tool ExSpect. The performance of the Prophid architecture has been analysed by simulation with this model.
Article
In this paper I describe the application of machine-vision techniques to video coding in order to create what my re-search group calls object-oriented television, where mov-ing scenes are represented in terms of objects (as recov-ered by analysis methods). Beyond data compactness, such a representation ooers the ability to add new de-grees of freedom to content creation and display. I discuss some of the scene analysis problems (par-ticularly 2-D and 3-D model--tting and object segmen-tation) and the algorithmic approaches my group has taken to solve themm suggest computational strategies for compact, powerful, programmable decoding hard-ware (particularly stream-based computing combined with automatic resource management)) and demon-strate some of the applications we h a ve developed.
Article
This paper reports the development of a holographic video (holovideo) rendering system that uses standard 2-D medical imaging inputs and generates medical images of human body parts as holographic video with three-dimensional (3-D) realism. The system generates 3-D medical images by transforming a numerical description of a sce ne (such as in data from URI, CAT, PET, and X-ray databases) into a holographic fringe pa ttern and then displays the images on the image volume of a holovideo display system. The system uses specialized digital signal processors to scale up the computation and rendering speed of the holovideo computing system beyond what exists today. Holograms developed under this research have horizontal (holographic) resolution high enough for smooth binocular parallax and a (video) resolution in the vertical direction comparable to NTSC television. Thus, the holovideo rendering and display system provides medical personnel with the information essential for viewing internal organs of humans with accurate 3-D realism. It is envisioned that the commercializable system that will ultimately be developed in the course of this research program will be compact enough to be used as desktop equipment in a medical imaging p latform and economical enough to be made available in adequate numbers to hospitals.
Article
In this paper I describe some of the design issues and research questions associated with object-based video coding algorithms, as well as the new applications made possible. I propose a hardware and software strategy to cope with the computational demands (stream-based computing combined with automatic resource management) and also briefly introduce object-based audio representations that are linked to the video representations.
Conference Paper
PROPHID is a design method aiming at high-performance systems with a focus on high-throughput signal processing for multimedia applications. The processing and communication bandwidth requirements of such systems are very high. To obtain a good balance between performance, programmability and efficiency in terms of speed, area and power PROPHID uses a novel heterogeneous multi-processor architecture template which exploits task-level concurrency. A general purpose processor aimed at control-oriented tasks and low to medium-performance signal processing tasks, as well as application domain specific processors aimed at high-performance signal processing tasks are available in this template. Next to a central control-oriented bus a special high-throughput communication network is used to meet the high bandwidth requirements of the application domain specific processors. This paper discusses the characteristics and advantages of the PROPHID architecture showing that high performance is obtained by embedding multiple autonomous data-driven processors in a stream-based communication environment
Conference Paper
Full-text available
The original vision of ubiquitous computing [14] is about enabling people to more easily accomplish tasks through the seamless interworking of the physical environment and a computing infrastructure. A major challenge to the practical realization of this vision involves the integration of commercial-off-the-shelf (COTS) hardware and software components: consider the awkwardness of such a mundane task as exporting a textual memo written on a Palm Pilot to a Microsoft Word document.It is not enough to overcome the protocol and data format mismatches that currently impede the interoperation of these entities: for the user experience to be truly seamless, we must provide a framework for the dynamic connection of such endpoints on demand, to support the ad-hoc interactions that are an integral part of ubiquitous computing. To this end, we offer a dynamic mediation framework called Paths. A Path consists of dynamically instantiated, automatically composable operators that bridge datatype and protocol mismatches between components wishing to communicate.Because operator composability is inferred from the type system, adding support for a new type of endpoint requires only incremental work; because the control and data flow for Paths are largely decoupled from the communicating endpoints, it is easy to connect COTS or legacy components. We describe the Paths architecture, our prototype implementation, and our experience and lessons based on several production applications built with the framework, and outline some continuing work on Paths in the context of the Stanford Interactive Workspaces project.
Article
The generation of computer-generated holographic fringes for real-time holographic video (holovideo) display is very computation-intensive, requiring the development of such special systems as the Massachusetts Institute of Technology (MIT) Media Lab's "Holo-Chidi" system, which can generate and display holovideo at video rates. The Holo-Chidi system is made of two sets of cards - the set of processor cards and the set of video concentrator cards (VCCs). The processor cards are used for hologram computation, data archival/retrieval from a host system, and higher level control of the VCCs. The VCC formats compute holographic data from multiple hologram computing processor cards, converting the digital data to analog form to feed the acousto-optic modulators of the Media Lab's "MarkII" holographic display system. The generation of the holographic fringes from the 3-D numerical description of a scene takes place inside field-programmable gate arrays (FPGAs) resident in the processor card. These large FPGAs employ several superposition processing pipelines, all working in parallel to generate the fringes of the hologram frame. With nine processor boards, there are the equivalent of about 288 superposition "processors" generating the fringes simultaneously. A Holo-Chidi system with three VCCs has enough frame buffering capacity to hold up to 32 36-Mbyte hologram frames at a time. Precomputed holograms can also be loaded into the VCC from a host computer through the low-speed universal serial bus (USB) port.
Article
Thesis (M.Eng.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 1998. Includes bibliographical references (p. 78-79). by Mark Lee. M.Eng.
Article
Thesis (M.S.)--Massachusetts Institute of Technology, Program in Media Arts & Sciences, 1996. Includes bibliographical references (p. 81-84). by Kathleen Lee Evanco. M.S.
Conference Paper
Shortened military avionics life cycles, growing real time processing performance requirements, and commercial product technology adaptation to military operational environments makes the move to COTS challenging. Avionics systems integrators must focus on procuring avionics box enclosures that support military operational environments. Aircraft wiring modification processes have to be developed to provide upgrades that support high speed computing systems using greater data bandwidths. Software engineering processes have to be tailored to incorporate new computing system standards and methodologies. Avionics Systems Engineering processes must evolve and adapt to dynamically changing COTS NDI product lines that incorporate emerging standards. COTS-OSEP embraces the combination of art and science that includes insight, intuition, and technology projection necessary to upgrade avionics suites with the latest capabilities. With the convergence of telecommunication, high definition television, multimedia, and super computing desktop technologies, traditional military ECBS architecting paradigms have to significantly evolve in order to reap `value engineering' benefits from the commercial marketplace
Article
Full-text available
This paper discusses the Chidi holographic video processing system (called Holo-Chidi) used for real-time computation of Computer Generated Holograms and the subsequent display of the holograms at video frame rates. Chidi is a reconfigurable multimedia processing system designed at the MIT Media Laboratory for real-time synthesis and analysis of multimedia data in general and digital video frames in particular. Holo-Chidi which is an adaptation of Chidi, comprises two main components: the sets of processor cards and the the display interface cards. Each processor card consists of a General Purpose Processor (GPP), a processor to PCI bridge, up to 128MByte DRAM, three SRAM-based Field Programmable Gate Arrays (FPGAs), and high bandwidth data transceivers, all resident on a standard PCI form-factor card. One of the FPGAs, called the RP (Reprogrammable Processor), is dynamically reconfigurable and enables Chidi to be used as a flexible specialized hardware for use in performing computatio...
Article
Full-text available
In this paper we present an approach for quantitative evaluation of design alternatives. We particularly focus on signal processing algorithms that are stream-oriented (possibly with dynamic dataflow) and on the mapping of the algorithms to configurable coarse-grain dataflow architectures. Such algorithms can be represented as Kahn graphs with the functionality of the nodes being described as functional C-code. We have implemented a simulation environment that permits the definition of a dataflow architecture as a composition of architecture elements such as functional units, buffer elements and communication structures. The simulator is built around a multithreading package using object oriented principles resulting in a configurable and efficient simulator. Algorithms can subsequently be executed on the architecture producing quantitative information for selected performance metrics. These performance numbers quantify design alternatives and therefore help steer design decisions. Res...
Article
The generation of video and audio coding methods to follow the present, pioneering generation have not yet been standardized, but it is possible to predict many of their characteristics. I discuss these, with particular reference to their impact on the design of software and hardware systems for multimedia. Keywords: video compression, model-based coding, data-flow computing, digital signal processing 1. BACKGROUND Compressed digital video and audio have become common. The price of dedicated processing circuits has dropped to the point that they are cost-effective for consumer applications, while general-purpose processor performance is increasing to the point that standard personal computers can support digital multimedia applications with little or no additional hardware support. But digital media representations are a moving target, and now that we know appropriate hardware/software design methodologies and optimizations to achieve real-time performance for the current generation ...
Article
Full-text available
A hierarchical model of computer organizations is developed, based on a tree model using request/service type resources as nodes. Two aspects of the model are distinguished: logical and physical. General parallel- or multiple-stream organizations are examined as to type and effectiveness¿especially regarding intrinsic logical difficulties. The overlapped simplex processor (SISD) is limited by data dependencies. Branching has a particularly degenerative effect. The parallel processors [single-instruction stream-multiple-data stream (SIMD)] are analyzed. In particular, a nesting type explanation is offered for Minsky's conjecture¿the performance of a parallel processor increases as log M instead of M (the number of data stream processors). Multiprocessors (MIMD) are subjected to a saturation syndrome based on general communications lockout. Simplified queuing models indicate that saturation develops when the fraction of task time spent locked out (L/E) approaches 1/n, where n is the number of processors. Resources sharing in multiprocessors can be used to avoid several other classic organizational problems.
Article
Full-text available
The Cheops Imaging System is a compact, modular platform for acquisition, processing, and display of digital video sequences and model-based representations of moving scenes, and is intended as both a laboratory tool and a prototype architecture for future programmable video decoders. Rather than using a large number of general-purpose processors and dividing up image processing tasks spatially, Cheops abstracts out a set of basic, computationally intensive stream operations that may be performed in parallel and embodies them in specialized hardware. We review the Cheops architecture, describe the software system that has been developed to perform resource management, and present the results of some performance tests
Article
Full-text available
A novel design methodology for rapid implementation of cheap high-performance ASICs (application-specific integrated circuits) is introduced. The method derives from high-level algorithm specifications or from high-level source programs not only the target hardware, but (in contrast to silicon compilers) also the machine code to run it. The method is based on a novel sequential machine paradigm where execution is used (being orders of magnitude more efficient) instead of simulation and where programmers may do the design job, rather than real hardware designers. It is shown that, for a very large class of commercially important algorithms (DSP, graphics, image processing and many others), this paradigm is orders of magnitude more efficient that the von Neumann paradigm. Compared to von-Neumann-based implementations, acceleration factors of up to more than 2000 have been obtained experimentally. The performance of ASICs obtained by this methodology is mostly competitive with ASIC designs obtained in the much slower and much more expensive traditional way. As a by-product the new methodology also supports the automatic generation of universal accelerators for coprocessor use in workstations
Article
Full-text available
This report describes the hardware architecture and software implementation of a hologram computing system developed at the MIT Media Laboratory. The hologram computing employs specialized stream-processing hardware embedded in the Cheops Image Processing system -- a compact, block data-flow parallel processor. A superposition stream processor performs weighted summations of arbitrary one-dimensional basis functions. A two-step holographic computation method -- called Hogel-Vector encoding -- utilizes the stream processor's computational power. An array of encoded hogel vectors, generated from a three-dimensional scene description, is rapidly decoded using the processor. The resulting 36-megabyte holographic pattern is transferred to framebuffers and then fed to a real-time electro-holographic display, producing three-dimensional holographic images. System performance is sufficient to generate an image volume approximately 100 mm per side in 3 seconds. The architecture is scalable over...
Book
This book describes a powerful language for multidimensional declarative programming called Lucid. Lucid has evolved considerably in the past ten years. The main catalyst for this metamorphosis was the discovery that Lucid is based on intensional logic, one commonly used in studying natural languages. Intensionality, and more specifically indexicality, has enabled Lucid to implicitly express multidimensional objects that change, a fundamental capability with several consequences which are explored in this book. The author covers a broad range of topics, from foundations to applications, and from implementations to implications. The role of intensional logic in Lucid as well as its consequences for programming in general is discussed. The syntax and mathematical semantics of the language are given and its ability to be used as a formal system for transformation and verification is presented. The use of Lucid in both multidimensional applications programming and software systems construction (such as a parallel programming system and a visual programming system) is described. A novel model of multidimensional computation--education--is described along with its serendipitous practical benefits for harnessing parallelism and tolerating faults. As the only volume that reflects the advances over the past decade, this work will be of great interest to researchers and advanced students involved with declarative language systems and programming.
Article
The Astronautics ZS-1 is a high speed, 64-bit computer system designed for scientific and engineering applications. The ZS-1 central processor uses a decoupled architecture, which splits instructions into two streams---one for fixed point/memory address computation and the other for floating point operations. The two instruction streams are then processed in parallel. Pipelining is also used extensively throughout the ZS-1.This paper describes the architecture and implementation of the ZS-1 central processor, beginning with some of the basic design objectives. Descriptions of the instruction set, pipeline structure, and virtual memory implementation demonstrate the methods used to satisfy the objectives. High performance is achieved through a combination of static (compile-time) instruction scheduling and dynamic (run-time) scheduling. Both types of scheduling are illustrated with examples.
Article
Access/execute architectures have several advantages over more traditional architectures. Because address generation and memory access are decoupled from operand use, memory latencies are tolerated better, there is more potential for concurrent operation, and it permits the use of specialized hardware to facilitate fast address generation. This paper describes the code generation and optimization algorithms that are used in an optimizing compiler for an architecture that contains explicit hardware support for the access/execute model of computation. Of particular interest is the novel approach that the compiler uses to detect recurrence relations in programs and to generate code for them. Because these relations are often used in problem domains that require significant computational resources, detecting and handling them can result in significant reductions in execution time. While the tectilques discussed were originally targeted for one specific architecture, many of the techniques are applicable to commonly available microprocessors. The paper describes the algorithms as well as our experience with using them on a number of machines.
Article
In search of more compression, researchers have recently sought to describe digital video of real scenes not as sequences of frames but rather as collections of objects that are rendered and combined according to scripting information. Depending upon the application and the scene analysis tools available, representations may range from two-dimensional layers to full three-dimensional computer-graphics-style data bases. The significance of these more meaningful representations goes beyond compression, however, enabling new forms of interactivity and personalization, as well as new degrees of freedom in post-production. This paper proposes a computational framework for a television receiver that can handle digital video in forms from 'traditional' motion-compensated transform coders to sets of three-dimensional objects and discusses the requirements for a scripting language to control such a receiver. It is also noted that the concept of scalability can be expanded to include 'intelligently resizable video,' where the originator of a video sequence can specify how the scene is to be composed and cut for displays of differing sizes and aspect ratios.
Article
The combination of processing power, memory capacity, and input/output (I/O) bandwidth found in the IBM POWER Visualization System™ (PVS) makes it an ideal tool for high-end, digital post-production applications. This general-purpose computer can adapt to the specific tasks required during almost any phase of the post-production process, including computer graphics rendering, video editing, rotoscoping, and special effects, auto-assembly, conforming, and compression. This paper examines the post-production process itself: the steps that are taken, how data flows through it, and generally how the PVS can fit in to provide an integrated environment.
Article
In search of more compression, researchers have recently sought to describe digital video of real scenes not as sequences of frames but rather as collections of objects that are rendered and combined according to scripting information. Depending upon the application and the scene analysis tools available, representations may range from two-dimensional layers to full three-dimensional computer-graphics-style data bases. The significance of these more meaningful representations goes beyond compression, however, enabling new forms of interactivity and personalization, as well as new degrees of freedom in post-production. This paper proposes a computational framework for a television receiver that can handle digital video in forms from “traditional” motion-compensated transform coders to sets of three-dimensional objects and discusses the requirements for a scripting language to control such a receiver. It is also noted that the concept of scalability can be expanded to include “intelligently resizable video,” where the originator of a video sequence can specify how the scene is to be composed and cut for displays of differing sizes and aspect ratios.
Article
A hardware structure capable of high concurrency is outlined and an abstract model of data flow program execution is presented which could be implemented within the proposed hardware structure. The abstract model supports a user programming language that includes recursive function modules and provides streams of values for inter-module communication. The aim of this work is to develop practical general-purpose computer systems embodying data flow principles.
Article
An abstract is not available.
Article
The author shows how a graphical programming environment like those commonly used for signal processing can expose data parallelism. In particular, he sets objectives for the syntax and semantics of graphical programs. It is shown that the synchronous dataflow model can be extended to multidimensional streams to represent and exploit data parallelism in signal processing applications. The resulting semantics are related to reduced dependence graphs used in systolic array design and to the stream-oriented functional languages Lucid, Sisal, and Silage. Formal properties are developed.
Article
One principle of structured programming is that a program should be separated into meaningful independent subprograms, which are then combined so that the relation of the parts to the whole can be clearly established. This paper describes several alternative ways to compose programs. The main method used is to permit the programmer to denote by an expression the sequence of values taken on by a variable. The sequence is represented by a function called a stream, which is a functional analog of a coroutine. The conventional while and for loops of structured programming may be composed by a technique of stream processing (analogous to list processing), which results in more structured programs than the originals. This technique makes it possible to structure a program in a natural way into its logically separate parts, which can then be considered independently.
Conference Paper
In this paper, we describe a simple language for parallel programming. Its semantics is studied thoroughly. The de- sirable properties of this language and its deciencies are exhibited by this theoretical study. Basic results on parallel program schemata are given. We hope in this way to make a case for more formal (i.e. mathematical) approach to the design of languages for systems programming and the design of operating systems. There is a wide disagreement among systems designers as to what are the best primitives for writing systems programs. In this paper, we describe a simple language for parallel programming and study its mathematical properties.
Conference Paper
Streams are data structures proposed for inclusion in several research programming languages, including VAL, to promote parallel execution and to implement input-output in applicative systems. To avoid paying a large overhead cost in near-term multiprocessor systems, the authors propose a special version of streams whose implementation efficiency potential does not impair their usefulness in typical applications. Special streams require no dynamic storage management during element production and consumption. They are part of a VAL implementation effort for the Denelcor HEP multiprocessor system. 10 references.
Conference Paper
In several data flow architectures, “streams” are proposed as special data structures able to improve parallel execution in functional programs by providing a pipelining effect between different program parts. This paper describes how streams are implemented on a data flow computer system based on a paged memory. This memory holds both the data flow programs and data structures such as streams. Streams are stored in the memory as a linked list of pages while pointers to the streams are flowing as data tokens. A reference count is used to prevent for excessive copying of data and to control the allocation and recovery of pages. Input/output is treated as a special application of streams.
Article
This document presents a language and graph representation designed to aid in the partitioning of large signal processing applications into tasks to run on a multiprocessor. This language compiles directly into a program graph upon which optimizations are performed to find the optimal task configuration. The optimal task configuration is one in which computation-to-communication is minimized while throughput is maximized. The language and graph are designed to express coarse-grain parallelism so that large communication delays and task overhead do not outweigh the speedup achieved by exploiting the parallelism in the application. The data flow model underlying the graph representation is based on a distributed, loosely-coupled multiprocessor concept which supports tagged data flow at the operating system level. The approach utilized in the graph and language is to insert specific, data, routing operators into the data flow graph defined by the application. These operators shape and route data through the application and express all the coarse-grain parallelism which is exploitable in the model. Data routing operators are parameterized to reflect the degree of parallelism, i.e., the number of parallel tasks in a partition of the application. The graph constructs also contain sufficient information to describe fully the task structure of the application. Given the characteristics of the partition, the programmer can define a standard by which the 'goodness' of the partition is evaluated. Given this standard, algorithms to perform the evaluation can be developed. The ultimate goal is the development of algorithms to perform the entire partitioning process automatically.
Conference Paper
Not Available
Conference Paper
A general purpose, programmable, digital video signal processor has been designed for efficient processing of real-time video signals. The chip is fabricated in a 0.8 μm CMOS technology on a die of 156 mm 2. The parallel architecture of 28 processing elements realizes a throughput of 15 GIPS at a clock-frequency of 54 MHz. To achieve such a throughput, many of the blocks are custom designed. Special attention has been given to clock routing. Dedicated measures have been taken to reduce power and ground bounce. One of them is the inclusion of a low-voltage I/O mode, which also limits the power consumption. A set of programming tools supports the development of applications
Article
An approach to transformations for general loops in which dependence vectors represent precedence constraints on the iterations of a loop is presented. Therefore, dependences extracted from a loop nest must be lexicographically positive. This leads to a simple test for legality of compound transformations: any code transformation that leaves the dependences lexicographically positive is legal. The loop transformation theory is applied to the problem of maximizing the degree of coarse- or fine-grain parallelism in a loop nest. It is shown that the maximum degree of parallelism can be achieved by transforming the loops into a nest of coarsest fully permutable loop nests and wavefronting the fully permutable nests. The canonical form of coarsest fully permutable nests can be transformed mechanically to yield maximum degrees of coarse- and/or fine-grain parallelism. The efficient heuristics can find the maximum degrees of parallelism for loops whose nesting level is less than five
  • onanian
The ZS-1 Central Processor
  • smith