Article

SRDAG compaction: a generalization of trace scheduling to increase the use of global context information

Authors:
To read the full-text of this research, you can request a copy directly from the author.

Abstract

Microcode compaction is the process of converting essentially vertical microcode into horizontal microcode for a given architecture. The conventional plan calls for a microcode compiler to generate vertical code for a given architecture and then use a compaction system to produce horizontal code, thereby greatly reducing the complexity of horizontal code generation.This paper attempts to extend the existing techniques used to perform the compaction process. Specifically, the procedure presented generalizes the "trace scheduling" method of [Fisher81] by using more global context information in compaction decisions. A number of definitions from classical compaction are generalized to encompass this expanded scope.Further, the paper presents two example classes of problem for which the new method outperforms the trace scheduling technique in terms of the execution time efficiency of the generated code. A number of unresolved questions are noted involving the class of global compaction procedures.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the author.

... Since the introduction of trace scheduling, several attempts have been made to address excessive compensation code and unnecessary degradation of side paths [Freudenberger et al. 1994;Lah and Atkin 1983;Linn 1983;Smith et al. 1992;Su et al. 1984]. Most previous work on solving the compensationcode and side-path problems focuses on disabling certain global code motions, thus limiting the benefit of trace scheduling. ...
... The original trace scheduling paper [Fisher 1981] used CP list scheduling. Most subsequent work on trace scheduling focused on improving the original algorithm by limiting the amount of compensation code [Freudenberger et al. 1994;Lah and Atkin 1983;Linn 1983;Smith et al. 1992;Su et al. 1984], a problem that was recognized in the original paper. The paper by Freudenberger et al. [1994] provides an extensive study of compensation code and identifies two different approaches to limiting compensation code: avoidance (avoiding code motions that lead to compensation code) and suppression (using global data and control flow information to detect cases where compensation code is redundant). ...
Article
Full-text available
This article presents the first optimal algorithm for trace scheduling. The trace is a global scheduling region used by compilers to exploit instruction-level parallelism across basic block boundaries. Several heuristic techniques have been proposed for trace scheduling, but the precision of these techniques has not been studied relative to optimality. This article describes a technique for finding provably optimal trace schedules, where optimality is defined in terms of a weighted sum of schedule lengths across all code paths in a trace. The optimal algorithm uses branch-and-bound enumeration to efficiently explore the entire solution space. Experimental evaluation of the algorithm shows that, with a time limit of 1 s per problem, 91% of the hard trace scheduling problems in the SPEC CPU 2006 Integer Benchmarks are solved optimally. For 58% of these hard problems, the optimal schedule is improved compared to that produced by a heuristic scheduler with a geometric mean improvement of 3.2% in weighted schedule length and 18% in compensation code size. copies show this notice on the first page or initial screen of a display along with the full citation. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers, to redistribute to lists, or to use any component of this work in other works requires prior specific per-mission and/or a fee. Permissions may be requested from the Publications Dept.
... Since the introduction of trace scheduling, several attempts have been made to address excessive compensation code and unnecessary degradation of side paths [Freudenberger et al. 1994;Lah and Atkin 1983;Linn 1983;Smith et al. 1992;Su et al. 1984]. Most previous work on solving the compensationcode and side-path problems focuses on disabling certain global code motions, thus limiting the benefit of trace scheduling. ...
... The original trace scheduling paper [Fisher 1981] used CP list scheduling. Most subsequent work on trace scheduling focused on improving the original algorithm by limiting the amount of compensation code [Freudenberger et al. 1994;Lah and Atkin 1983;Linn 1983;Smith et al. 1992;Su et al. 1984], a problem that was recognized in the original paper. The paper by Freudenberger et al. [1994] provides an extensive study of compensation code and identifies two different approaches to limiting compensation code: avoidance (avoiding code motions that lead to compensation code) and suppression (using global data and control flow information to detect cases where compensation code is redundant). ...
Article
Full-text available
This article presents the first optimal algorithm for trace scheduling. The trace is a global scheduling region used by compilers to exploit instruction-level parallelism across basic block boundaries. Several heuristic techniques have been proposed for trace scheduling, but the precision of these techniques has not been studied relative to optimality. This article describes a technique for finding provably optimal trace schedules, where optimality is defined in terms of a weighted sum of schedule lengths across all code paths in a trace. The optimal algorithm uses branch-and-bound enumeration to efficiently explore the entire solution space. Experimental evaluation of the algorithm shows that, with a time limit of 1 s per problem, 91% of the hard trace scheduling problems in the SPEC CPU 2006 Integer Benchmarks are solved optimally. For 58% of these hard problems, the optimal schedule is improved compared to that produced by a heuristic scheduler with a geometric mean improvement of 3.2% in weighted schedule length and 18% in compensation code size.
... Numerous improvements to trace scheduling have been suggested since Fisher first published his algorithm [17,18,16,23,15,13]. Recently, three approaches have been suggested that have different strategies for creating compensation code: superblock scheduling [14,4], Bernstein and Rodell's global instruction scheduling [3,2], and Smith's global scheduling [26]. Our work was done before these three, but not published until now. ...
Article
Trace scheduling is an optimization technique that selects a sequence of basic blocks as a trace and schedules the operations from the trace together. If an operation is moved across basic block boundaries, one or more compensation copies may be required in the off-trace code. This article discusses the generation of compensation code in a trace scheduling compiler and presents techniques for limiting the amount of compensation code: avoidance (restricting code motion so that no compensation code is required) and suppression (analyzing the global flow of the program to detect when a copy is redundant). We evaluate the effectiveness of these techniques based on measurements for the SPEC89 suite and the Livermore Fortran Kernels, using our implementation of trace scheduling for a Multiflow Trace 7/300. The article compares different compiler models contrasting the performance of trace scheduling with the performance obtained from typical RISC compilation techniques. There are two key results of this study. First, the amount of compensation code generated is not large. For the SPEC89 suite, the average code size increase due to trace scheduling is 6%. Avoidance is more important than suppression, although there are some kernels that benefit significantly from compensation code suppression. Since compensation code is not a major issue, a compiler can be more aggressive in code motion and loop unrolling. Second, compensation code is not critical to obtain the benefits of trace scheduling. Our implementation of trace scheduling improves the SPEC mark rating by 30% over basic block scheduling, but restricting trace scheduling so that no compensation code is required improves the rating by 25%. This indicates that most basic block scheduling techniques can be extended to trace scheduling without requiring any complicated compensation code bookkeeping.
... More sophisticated global compaction algorithms are mostly based on Fishers Trace Scheduling [4]. The improvements concentrate on the reduction of computation time and better trace selection (e.g. [9], [17],[18], [6] and [13] ). A good overview on global microcode compaction can be found in [16]. ...
Conference Paper
Full-text available
Modern CAD systems allow the designers to come up with powerful programmable datapaths in a very short time. The time to develop compilers for this datapaths is much longer. This paper presents a new approach to compiler generation. We show how a VHDL description of a pro- grammable datapath can be analyzed to extract several informations for compiler generation. The analysis finds computing and storage resources, classifies signals as con- trol or data, and extracts all the possible micro operations for this datapath.
... We have not measured its effectiveness without speculative execution. An extension to consider multiple control flow paths, as suggested in [46,47], is a good idea, but the trace algorithm as described is an improvement over basic block schedulers. ...
Article
The Multiflow compiler uses the trace scheduling algorithm to find and exploit instruction-level parallelism beyond basic blocks. The compiler generates code for VLIW computers that issue up to 28 operations each cycle and maintain more than 50 operations in flight. At Multiflow the compiler generated code for eight different target machine architectures and compiled over 50 million lines of Fortran and C applications and systems code. The requirement of finding large amounts of parallelism in ordinary programs, the trace scheduling algorithm, and the many unique features of the Multiflow hardware placed novel demands on the compiler. New techniques in instruction scheduling, register allocation, memory-bank management, and intermediate-code optimizations were developed, as were refinements to reduce the overhead of trace scheduling. This article describes the Multiflow compiler and reports on the Multiflow practice and experience with compiling for instruction-level parallelism beyond basic blocks.
... This was first proposed by Fisher in [52] as a way to increase the available parallelism at the microcode level. The technique has also been applied to horizontally microcoded architectures [96] [104]. ...
Article
CONTENTS ACKNOWLEDGMENTS.................................................................................................. iii LIST OF TABLES ............................................................................................................. vi LIST OF FIGURES .......................................................................................................... vii CHAPTER I INTRODUCTION ...............................................................................................................1 1 Scheduling....................................................................................................2 2 Methodology. ...............................................................................................5 3 Research Contributions ..............................................................................12 4 Thesis Organization ...................................................................................13 CHAPTER II INSTRUCTION
Chapter
Since its introduction by Joseph A. Fisher in 1979, trace scheduling has influenced much of the work on compile-time ILP. Initially developed for use in microcode compaction, trace scheduling quickly became the main technique for machine-level compile-time parallelism exploitation. Trace scheduling has been used since the 1980s in many state-of-the-art compilers (e.g., Intel, Fujitsu, HP).
Chapter
Some time ago, we proposed a new computing model, called VLIW-in-the-large, allowing both coarse grain and fine grain parallelism to be exploited in the execution of programs onto the coMP architecture. coMP is an MIMD machine explicitly designed to allow fast inter-processor communications to be performed. These kind of fast, possibly synchronous, inter-processor communications are essential to the realization of the VLIW-in-the-large computing model. In this paper, we present some experimental results that validate both the computing model and the design choices relative to the coMP, massively parallel computing architecture.
Article
Pipelining and parallel functional units are common optimization techniques used in high-performance processors. Traditionally, this parallelism internal to the data path of a processor is only available to the microcode programmer, and the problems of minimizing the execution time of the microcode within and across basic blocks are known as local and global compaction, respectively. The development of the global compaction technique, trace scheduling, has led to the introduction of VLIW (very long instruction word) architectures [9,19,20,21]. A VLIW machine is like a horizontally microcoded machine: it consists of parallel functional units, each of which can be independently controlled through dedicated fields in a “very long” instruction. A characteristic distinctive of VLIW architectures is that these long instructions are the machine instructions. There is no additional layer of interpretation where machine instructions are expanded into micro-instructions. A compiler directly generates these long machine instructions from programs written in a high-level language. A VLIW machine generally has an orthogonal instruction set; whereas in a typical horizontally microcoded engine, complex resource or field conflicts exist between functionally independent operations.
Article
The compiler described here is a production grade compiler which supports all the features normally associated with a production compiler, such as global optimization, an interface to an interactive symbolic debugger, reference map output, etc. This report shows that con~only known algorithms can be used in the construction of a compiler which gives atceptable performance in both compilation rate and object code efficiency for a production environment. The compiler has now been in general use as a principal part of FPS's program development software (PDS) system for two years and has manifested its strengths and weaknesses. Because the architecture of the FPS-164 is ur~conventional, the next section of this paper briefly describes the principal hardware features. Next~ the compiler is described to show which algorithms are used and how ~hey are integrated to produce the complete .system. In
Article
This paper describes an efficient global microprogram optimization technique called modywt. It uses a dynamic microoperation priority weight function in the process of combining microoperations into microinstructions. modywt is tested with random microprograms generated for a simulated machine having ibm 360/40 system architecture. Performances of different global compaction algorithms are also compared with modywt using some metrices proposed in the software engineering literature for measuring software complexities.
Article
Compacting microoperations of a microprogram into horizontal microinstructions require an efficient global compaction algorithm. This paper describes a global compaction algorithm which is more practical than some of the existing techniques based on Tree compaction, Trace scheduling and generalized data dependency graph. The algorithm uses a local compaction technique in which the microoperation priority function is dynamically modified.
Article
Global microcode compaction is an open problem in firmware engineering. Although Fisher's trace scheduling method may produce significant reductions in the execution time of compacted microcode, it has some drawbacks. There have been four methods. Tree, SRDAG, ITSC , and GDDG, presented recently to mitigate those drawbacks in different ways. The purpose of the research reported in this paper is to evaluate these new methods. In order to do this, we have tested the published algorithms on several unified microcode sequences of two real machines and compared them on the basis of the results of experiments using three criteria: time efficiency, space efficiency, and complexity.
Article
We describe a system which allows high-level microprogramming without requiring programmer knowledge of the target architecture, depending instead on retargetable microcode generation and optimization. In the ideal system the code generation, microcode compaction, encoding and simulation are driven by a single description of the target microarchitecture. An initial implementation, which is now working for a real microprogrammable processor, demonstrates the feasibility of the key technologies.
Article
This paper shows that software pipelining is an effective and viable scheduling technique for VLIW processors. In software pipelining, iterations of a loop in the source program are continuously initiated at constant intervals, before the preceding iterations complete. The advantage of software pipelining is that optimal performance can be achieved with compact object code. This paper extends previous results of software pipelining in two ways: First, this paper shows that by using an improved algorithm, near-optimal performance can be obtained without specialized hardware. Second, we propose a hierarchical reduction scheme whereby entire control constructs are reduced to an object similar to an operation in a basic block. With this scheme, all innermost loops, including those containing conditional statements, can be software pipelined. It also diminishes the start-up cost of loops with small number of iterations. Hierarchical reduction complements the software pipelining technique, permitting a consistent performance improvement be obtained. The techniques proposed have been validated by an implementation of a compiler for Warp, a systolic array consisting of 10 VLIW processors. This compiler has been used for developing a large number of applications in the areas of image, signal and scientific processing.
Conference Paper
A global compacter is presented for digital signal processors. The global compaction algorithm outlined demonstrates that optimal or near-optimal code can be produced for digital signal processing (DSP) chips by employing conventional compiler optimization techniques in conjunction with a global compacter and a loop pipeliner. Code can be efficiently compacted across basic block boundaries as well as within basic blocks. The loop pipeliner produces optimal or near-optimal pipelined loops for any looping structure. The global compacter can be used by both high-level-language compilers and hand assemblers
Conference Paper
Fine-grained parallelism is offered by an increasing number of processors. This kind of parallelism increases performance for all kinds of applications, including general-purpose code; most promising is the combination with coarse-grained parallelism. Unlike coarse-grained parallelism it can be exploited by automatic parallelization. This paper presents program analysis and transformation methods for exploitation of fine-grained parallelism, based on global instruction scheduling.
Conference Paper
Microcode compaction is an essential component of any high-level language compiler that generates microcode for a horizontal architecture machine. Recent research into both local and global compaction has assumed the use of a simple abstract machine. Although this assumption simplifies the effort considerably, it neglects addressing and timing problems brought on by the uncommon operation of some machines.This paper discusses both local and global compaction in terms of the Burroughs D-machine. The D-machine has peculiar timing and an uncommon jump instruction that do not readily fit into proposed compaction algorithms. Methods for handling these problems are presented. In addition, two popular algorithms for performing compaction, list scheduling and trace scheduling, are explained entirely in terms of the D-machine. This should aid the reader in understanding the problem and evaluating any alternatives.
Conference Paper
This paper shows that software pipelining is an effective and viable scheduling technique for VLIW processors. In software pipelining, iterations of a loop in the source program are continuously initiated at constant intervals, before the preceding iterations complete. The advantage of software pipelining is that optimal performance can be achieved with compact object code. This paper extends previous results of software pipelining in two ways: First, this paper shows that by using an improved algorithm, near-optimal performance can be obtained without specialized hardware. Second, we propose a hierarchical reduction scheme whereby entire control constructs are reduced to an object similar to an operation in a basic block. With this scheme, all innermost loops, including those containing conditional statements, can be software pipelined. It also diminishes the start-up cost of loops with small number of iterations. Hierarchical reduction complements the software pipelining technique, permitting a consistent performance improvement be obtained. The techniques proposed have been validated by an implementation of a compiler for Warp, a systolic array consisting of 10 VLIW processors. This compiler has been used for developing a large number of applications in the areas of image, signal and scientific processing.
Article
This paper describes a new scheduling algorithm for automatic synthesis of the control blocks of control-dominated circuits. The proposed scheduling algorithm is distinctive in its approach to partition a control/data flow graph (CDFG) into an equivalent state transition graph. It works on the CDFG to exploit operation relocation, chaining, duplication, and unification. The optimization goal is to schedule each execution path as fast as possible. Benchmark data shows that this approach achieved better results over the previous ones in terms of the speedup of the circuit and the number of states and transitions.
Article
The performance of microprocessors has increased steadily over the past 20 years at a rate of about 50% per year. This is the cumulative result of architectural improvements as well as increases in circuit speed. Moreover, this improvement has been obtained in a transparent fashion, that is, without requiring programmers to rethink their algorithms and programs, thereby enabling the tremendous proliferation of computers that we see today. To continue this performance growth, microprocessor designers have incorporated instruction-level parallelism (ILP) into new designs. ILP utilizes the parallel execution ofthe lowest level computer operations-adds, multiplies, loads, and so on-to increase performance transparently. The use of ILP promises to make possible, within the next few years, microprocessors whose performance is many times that of a CRAY-IS. This article provides an overview of ILP, with an emphasis on ILP architectures-superscalar, VLIW, and dataflow processors-and the compiler techniques necessary to make ILP work well.
Conference Paper
VLIW architectures have been shown to be able to exploit large amounts of find grain parallelism in the execution of sequential imperative programs. In this paper, a new computing model is presented, which allows the VLIW techniques to be adopted to operate a distributed memory, multiprocessor machine. The model, called VLIW-in-the-large, can be adopted in conjunction with a suitable hardware framework to obtain consistent speedups in the execution of both sequential and parallel-natured software. The authors show that the advantages of the VLIW-in-the-large computing model with respect to the classical VLIW approach are: (i) better utilization of hardware resources; (ii) extension of the applicability of the VLIW techniques to multiprocessor architectures, in such a way that they can be used for multi-style, multi-grain parallelism exploitation; (iii) compact realization of processing elements, suitable for VLSI massively parallel architectures
Article
An overview of automatic program parallelization techniques is presented. It covers dependence analysis techniques, followed by a discussion of program transformations, including straight-line code parallelization, do-loop transformations, and parallelization of recursive routines. Several experimental studies on the effectiveness of parallelizing compilers are surveyed
Article
Full-text available
Microcode compaction is an essential tool for the compilation of high-level language microprograms into microinstructions with parallel microoperations. Although guaranteeing minimum execution time is an exponentially complex problem, recent research indicates that it is not difficult to obtain practical results. This paper, which assumes no prior knowledge of microprogramming on the part of the reader, surveys the approaches that have been developed for compacting microcode. A comprehensive terminology for the area is presented, as well as a general model of processor behavior suitable for comparing the algorithms. Execution examples and a discussion of strengths and weaknesses are given for each of the four classes of loc .al compaction algorithms: linear, critical path, branch and bound, and list scheduling. Local compaction, which applies to jump-free code, is fundamental to any compaction technique. The presentation emphasizes the conceptual distinction between data dependency and conflict analysis.
Article
In previous papers [1,2,3] a high level microprogramming language schema called S* was described. S* is a partially specified language such that for a given host machine M1, a particular language S*(M1) results when M1's properties are used to complete the specifications of S*. We say that S* is instantiated into S*(M1) with respect to M1. This paper describes the instantiation of S* with respect to the Nanodata QM-1. The resulting language S*(QM-1) allows high level “nanoprograms” to be written for the QM-1. The major objective of this research was to examine the language schema S*, from which S*(QM-1) was instantiated, in light of its overall philosophy and usefulness as a tool in the development of a specific microprogramming language for a highly complex microprogrammable machine.
Article
In this study ″trace scheduling″ is developed as a solution to the global compaction problem. Trace scheduling works on traces (or paths) through microprograms. Compacting is thus done with a broad overview of the program. Important operations are given priority, no matter what their source block was. This is in sharp contrast with earlier methods, which compact one block at a time and then attempt iterative improvement. It is argued that those methods suffer from the lack of an overview and make many undesirable compactions, often preventing desirable ones. Loops are handled using the reducible property of most flow graphs. The loop handling technique permits the operations to move around loops, as well as into loops, where appropriate.
Article
Microcode compaction is an essential tool for the compilation of high-level language microprograms into microinstructions with parallel microoperations. The purpose of the research reported in this paper is to compare four microcode compaction methods reported in the literature: first-come first-served, critical path, branch and bound, and list scheduling. In order to do this a complete, machine independent method of representing the microoperations of real machines had to be developed; and the compaction algorithms had to be recast to use this representation. The compaction algorithms were then implemented and tested on microcode produced by a compiler for a high-level microprogramming language. The results of these experiments were that for all cases examined the first-come first-served and list scheduling algorithms produced microcode compacted into a minimal number of microinstructions in time that was a polynomial function of order two of the number of input microoperations.
Ullman Principles of Compiler Design Bell Laboratories Murray Hill N
  • A V Aho