Pohua P. Chang's research while affiliated with University of Illinois, Urbana-Champaign and other places

This page lists the scientific contributions of an author, who either does not have a ResearchGate profile, or has not yet added these contributions to their profile.

It was automatically created by ResearchGate to create a record of this author's body of work. We create such pages to advance our goal of creating and maintaining the most comprehensive scientific repository possible. In doing so, we process publicly available (personal) data relating to the author as a member of the scientific community.

If you're a ResearchGate member, you can follow this page to keep up with this author's work.

If you are this author, and you don't want us to display this page anymore, please let us know.

Comparing Static And Dynamic Code Scheduling for Multiple-Instruction-Issue Processors

April 1999

·

197 Reads

·

22 Citations

·

William Y. Chen

·

Scott A. Mahlke

·

This paper examines two alternative approaches to supporting code scheduling for multiple-instruction-issue processors. One is to provide a set of non-trapping instructions so that the compiler can perform aggressive static code scheduling. The application of this approach to existing commercial architectures typically requires extending the instruction set. The other approach is to support out-of-order execution in the microarchitecture so that the hardware can perform aggressive dynamic code scheduling. This approach usually does not require modifying the instruction set but requires complex hardware support. In this paper, we analyze the performance of the two alternative approaches using a set of important nonnumerical C benchmark programs. A distinguishing feature of the experiment is that the code for the dynamic approach has been optimized and scheduled as much as allowed by the architecture. The hardware is only responsible for the additional reordering that cannot be performed...

Scalar Program Performance on Multiple-Instruction-Issue Processors with a Limited Number of Registers

Article

April 1999

·

6 Reads

·

11 Citations

Scott A. Mahlke

·

William Y. Chen

·

·

In this paper the performance of multiple-instructionissue processors with variable register file sizes is examined for a set of scalar programs. We make several important observations. First, multiple-instruction-issue processors can perform effectively without a large number of registers. In fact, the register files of many existing architectures (16--32 registers) are capable of sustaining a high instruction execution rate. Second, even for small register files (8--12 registers), substantial performance gains can be obtained by increasing the issue rate of a processor. In general, the percentage increase in performance achieved by increasing the issue rate is relatively constant for all register file sizes. Finally, code transformations designed for multiple-instruction-issue processors are found to be effective for all register file sizes; however, for small register files, the performance improvement is limited due to the excessive spill code introduced by the transformations. 1 ...

Proceedings of the 18th International Symposium on Computer Architecture (ISCA)

Article

April 1999

·

19 Reads

·

3 Citations

·

Scott A. Mahlke

·

William Y. Chen

·

[...]

·

The performance of multiple-instruction-issue processors can be severely limited by the compiler's ability to generate efficient code for concurrent hardware. In the IMPACT project, we have developed IMPACT-I, a highly optimizing C compiler to exploit instruction level concurrency. The optimization capabilities of the IMPACT-I C compiler are summarized in this paper. Using the IMPACT-I C compiler, we ran experiments to analyze the performance of multiple-instruction-issue processors executing some important non-numerical programs. The multiple-instruction-issue processors achieve solid speedup over high-performance single-instruction-issue processors. We ran experiments to characterize the following architectural design issues: code scheduling model, instruction issue rate, memory load latency, and function unit resource limitations. Based on the experimental results, we propose the IMPACT Architectural Framework, a set of architectural features that best support the IMPACT-I C compile...

The Effect Of Compiler Optimizations On Available Parallelism In Scalar Programs

April 1999

·

52 Reads

·

6 Citations

Scott A. Mahlke

·

Nancy J. Warter

·

William Y. Chen

·

[...]

·

In this paper we analyze the effect of compiler optimizations on fine grain parallelism in scalar programs. We characterize three levels of optimization: classical, superscalar, and multiprocessor. We show that classical optimizations not only improve a program's efficiency but also its parallelism. Superscalar optimizations further improve the parallelism for moderately parallel machines. For highly parallel machines, however, they actually constrain available parallelism. The multiprocessor optimizations we consider are memory renaming and data migration. Introduction Compiler optimizations are designed to reduce a program's execution time. Traditionally, these optimizations are customized for a given machine model. Classical optimizations are designed to improve the program's efficiency for a machine model which has one thread of execution and can issue one instruction per cycle. Superscalar optimizations are designed for a machine model with a single thread of execution and a lim...

Data Access Microarchitectures for Superscalar Processors with Compiler-Assisted Data Prefetching

Article

April 1999

·

7 Reads

·

60 Citations

William Y. Chen

·

Scott A. Mahlke

·

·

The performance of superscalar processors is more sensitive to the memory system delay than their single-issue predecessors. This paper examines alternative data access microarchitectures that effectively support compilerassisted data prefetching in superscalar processors. In particular, a prefetch buffer is shown to be more effective than increasing the cache dimension in solving the cache pollution problem. All in all, we show that a small data cache with compiler-assisted data prefetching can achieve a performance level close to that of an ideal cache. 1 Introduction Superscalar processors can potentially deliver more than five times speedup over conventional single-issue processors [1]. With the total execution cycle count dramatically reduced, each cycle becomes more significant to the overall performance. Because each data cache miss can introduce many extra execution cycles, a superscalar processor can easily lose the majority of its performance to the memory hierarchy. Out-of-...

+1

SOFTWARE---PRACTICE AND EXPERIENCE, VOL. 21(12), 1301--1321 (DECEMBER 1991) Using Profile Information to Assist Classic Code Optimizations

April 1997

·

333 Reads

·

107 Citations

·

Scott A. Mahlke

·

This paper describes the design and implementation of an optimizing compiler that automatically generates profile information to assist classic code optimizations. This compiler contains two new components, an execution profiler and a profile-based code optimizer, which are not commonly found in traditional optimizing compilers. The execution profiler inserts probes into the input program, executes the input program for several inputs, accumulates profile information and supplies this information to the optimizer. The profile-based code optimizer uses the profile information to expose new optimization opportunities that are not visible to traditional global optimization methods. Experimental results show that the profile-based code optimizer significantly improves the performance of production C programs that have already been optimized by a high-quality global code optimizer

Three Architectural Models for Compiler-Controlled Speculative Execution

Article

May 1995

·

6 Reads

·

36 Citations

IEEE Transactions on Computers

·

·

Scott A. Mahlke

·

[...]

·

To effectively exploit instruction level parallelism, the compiler must move instructions across branches. When an instruction is moved above a branch that it is control dependent on, it is considered to be speculatively executed since it is executed before it is known whether or not its result is needed. There are potential hazards when speculatively executing instructions. If these hazards can be eliminated, the compiler can more aggressively schedule the code. The hazards of speculative execution are outlined in this paper. Three architectural models: restricted, general, and boosting, which have increasing amounts of support for removing these hazards are discussed. The performance gained by each level of additional hardware support is analyzed using the IMPACT C compiler which performs superblock scheduling for superscalar and superpipelined processors

The Importance of Prepass Code Scheduling for Superscalar and Superpipelined Processors

April 1995

·

79 Reads

·

61 Citations

IEEE Transactions on Computers

·

·

Scott A. Mahlke

·

[...]

·

Superscalar and superpipelined processors utilize parallelism to achieve peak performance that can be several times higher than that of conventional scalar processors. In order for this potential to be translated into the speedup of real program, the compiler must be able to schedule instructions so that the parallel hardware is effectively utilized. Previous work has shown that prepass code scheduling helps to produce a better schedule for scientific programs, but the importance of prescheduling has never been demonstrated for control-intensive non-numeric programs. These programs are significantly different from the scientific programs because they contain frequent branches. The compiler must do global scheduling in order to find enough independent instructions. In this paper, the code optimizer and scheduler of the IMPACT-I C compiler is described. Within this framework, we study the importance of prepass code scheduling for a set of production C programs. It is shown that, in contrast to the results previously obtained for scientific programs, prescheduling is not important for compiling control-intensive programs to the current generation of superscalar and superpipelined processors. However, if some of the current restrictions on upward code motion can be removed in future architectures, prescheduling would substantially improve the execution time of this class of programs on both superscalar and superpipelined processors

+2

The Superblock: An Effective Technique for VLIW and Superscalar Compilation

May 1993

·

663 Reads

·

626 Citations

The Journal of Supercomputing

·

Scott A. Mahlke

·

William Y. Chen

·

[...]

·

A compiler for VLIW and superscalar processors must expose sufficient instruction-level parallelism (ILP) to effectively utilize the parallel hardware. However, ILP within basic blocks is extremely limited for control-intensive programs. We have developed a set of techniques for exploiting ILP across basic block boundaries. These techniques are based on a novel structure called thesuperblock. The superblock enables the optimizer and scheduler to extract more ILP along the important execution paths by systematically removing constraints due to the unimportant paths. Superblock optimization and scheduling have been implemented in the IMPACT-I compiler. This implementation gives us a unique opportunity to fully understand the issues involved in incorporating these techniques into a real compiler. Superblock optimizations and scheduling are shown to be useful while taking into account a variety of architectural features.

Efficient Instruction Sequencing with Inline Target Insertion

January 1993

·

37 Reads

·

14 Citations

IEEE Transactions on Computers

·

Inline target insertion, a specific compiler and pipeline implementation method for delayed branches with squashing, is defined. The method is shown to offer two important features not discovered in previous studies. First, branches inserted into branch slots are correctly executed. Second, the execution returns correctly from interrupts or exceptions with only one program counter. These two features result in better performance and less software/hardware complexity than conventional delayed branching mechanisms

... The first step towards the design of video processors and video systems is to achieve an accurate understanding of the major video applications and the workload characteristics of those applications. Knowledge of video applications is a necessary component in video system design for deciding: (1) whether the system will support one or more video standards, and consequently whether it requires hardware and/or software for separate standards, (2) whether the design should employ either more application-specific hardware, providing more computational capability but less flexibility, or more software, providing greater flexibility but necessitating more powerful and costly processors, and (3) what video processing and system control software requirements must be served by the system processor(s). Depending upon how much of the video processing is performed in dedicated hardware versus software, it is necessary to understand the workload characteristics of the video applications in selecting or designing a video processor. ...
Reference:
MediaBench II video: Expediting the next generation of video systems research

Citing Article
May 1991

ACM SIGARCH Computer Architecture News

·

Scott A. Mahlke

·

William Y. Chen

·

[...]

·

... Code duplication has frequently been used in compilers to enable subsequent optimizations [17,53,61,62]. As it implies a trade-off between the achievable performance gains via those follow-up optimizations and the code size increase, fine-tuned heuristics are often used to influence decisions on whether to duplicate code or not [58]. ...
Reference:
Control Flow Duplication for Columnar Arrays in a Dynamic Compiler

Using Profile Information to Assist Classic Compiler Code Optimizations

Citing Article
December 1991

Software Practice and Experience

·

Scott A. Mahlke

·

... Most existing approaches consider information about individual functions in isolation, and the called function is not considered part of the semantics of the calling function, which can result in some semantic information being lost when executing binary code similarity analysis. Existing solutions inline all user-defined functions, which can lead to an explosion in function code size (Chang et al. 1992;Wang et al. 2015). However, not all functions are closely related to the function calling them. ...
Reference:
Unleashing the power of pseudo-code for binary code similarity analysis

Profile‐guided automatic inline expansion for C programs

Citing Article
May 1992

Software Practice and Experience

·

Scott A. Mahlke

·

William Y. Chen

·

... As shown in Table 2.1, the dynamic target set is considerably smaller than the static set. The data in Figure 2.8 strongly supports our MBR optimization, previously identified by Chang and Hwu [33] for multi-way branches implemented for switch statements. In their work, any switch statement with less than ten branches would be implemented as sleds of conditional branches and direct jumps, ordered by profiled execution frequency. ...
Reference:
Control-Flow Security.

Control flow optimization for supercomputer scalar processing

Citing Conference Paper
January 1989

·

... Compiler researchers proposed loop tranformation techniques such as loop tiling and fusion to improve data locality [3, 4, 1]. To enhance instruction locality, researchers proposed procedure orderings and code layout optimisations [17, 10, 5] to improve spatial locality of instructions. Existing literature has only considered improving data/instruction locality over single program runs. ...
Reference:
Test case permutation to improve execution time

Achieving High Instruction Cache Performance with an Optimizing Compiler.

Citing Conference Paper
May 1989

·

... It is similar to the constant propagation compiler optimization (Cooper & Torczon 2011, Chp. 8). As IRIS does not allow user-defined functions with its domain-specific language, we do not consider the inlining of functions (Chang & Hwu 1989). ...
Reference:
One-Way Model Transformations in the Context of the Technology-Roadmapping Tool IRIS.

Inline Function Expansion for Compiling C Programs.

Citing Conference Paper
July 1989

ACM SIGPLAN Notices

·

... The superblock is a global scheduling region comprising a single-entry multiple-exit sequence of basic blocks [2]. Scheduling decisions may decrease the execution time for one control path while increasing it for another, and by making these decisions in favor of a more frequently executed path, overall performance can be improved [19]. When testing and applying mathematical functions in SML, we discover that for a particular input set, there is one path that demonstrates obviously higher execution frequency than others. ...
Reference:
Superblock-based performance optimization for Sunway Math Library on SW26010 many-core processor

The Superblock: An Effective Technique for VLIW and Superscalar Compilation

Citing Article
Full-text available
May 1993

The Journal of Supercomputing

·

Scott A. Mahlke

·

William Y. Chen

·

[...]

·

... Several branch prediction schemes have been proposed [2,3,4,5]. The primitive ones are static in which they predict always taken or not taken. ...
Reference:
A study for branch predictors to alleviate the aliasing problem [pipelining]

Comparing Software And Hardware Schemes For Reducing The Cost Of Branches

Citing Conference Paper
May 1989

·

Thomas M. Conte

·

... The full source code of this work is available on https://gricad-gitlab.univ-grenoble-alpes. Scheduling methods (for the compaction problem: how to parallelize a sequential micro-code) have been thoroughly studied by Fisher [30], Rau et al. [78], Chang and Hwu [20]. Some of these methods were implemented in the Multiflow compiler [55] for a VLIW architecture. ...
Reference:
Compilation optimisante et formellement prouvée pour un processeur VLIW

Trace Selection For Compiling Large C Application Programs To Microcode

Citing Conference Paper
Full-text available
January 1988

·

... Extracting instruction-level parallelism can be done either statically [10] (at compile time) or dynamically [16] (at execution time). Dynamic extraction of parallelism is based on complex hardware, while static techniques [2] [6] shift the burden of identifying parallelism onto the compiler [1] [5]. ...
Reference:
Application Acceleration with the Explicitly Parallel Operations System - the EPOS Processor

Three Architectural Models for Compiler-Controlled Speculative Execution

Citing Article
May 1995

IEEE Transactions on Computers

·

·

Scott A. Mahlke

·

[...]

·