Pohua P. Chang's research while affiliated with University of Illinois, Urbana-Champaign and other places

What is this page?


This page lists the scientific contributions of an author, who either does not have a ResearchGate profile, or has not yet added these contributions to their profile.

It was automatically created by ResearchGate to create a record of this author's body of work. We create such pages to advance our goal of creating and maintaining the most comprehensive scientific repository possible. In doing so, we process publicly available (personal) data relating to the author as a member of the scientific community.

If you're a ResearchGate member, you can follow this page to keep up with this author's work.

If you are this author, and you don't want us to display this page anymore, please let us know.

Publications (21)


Table 4 .
Comparing Static And Dynamic Code Scheduling for Multiple-Instruction-Issue Processors
  • Article
  • Full-text available

April 1999

·

197 Reads

·

22 Citations

Pohua P. Chang

·

William Y. Chen

·

Scott A. Mahlke

·

This paper examines two alternative approaches to supporting code scheduling for multiple-instruction-issue processors. One is to provide a set of non-trapping instructions so that the compiler can perform aggressive static code scheduling. The application of this approach to existing commercial architectures typically requires extending the instruction set. The other approach is to support out-of-order execution in the microarchitecture so that the hardware can perform aggressive dynamic code scheduling. This approach usually does not require modifying the instruction set but requires complex hardware support. In this paper, we analyze the performance of the two alternative approaches using a set of important nonnumerical C benchmark programs. A distinguishing feature of the experiment is that the code for the dynamic approach has been optimized and scheduled as much as allowed by the architecture. The hardware is only responsible for the additional reordering that cannot be performed...

Download
Share

Scalar Program Performance on Multiple-Instruction-Issue Processors with a Limited Number of Registers

April 1999

·

6 Reads

·

11 Citations

In this paper the performance of multiple-instructionissue processors with variable register file sizes is examined for a set of scalar programs. We make several important observations. First, multiple-instruction-issue processors can perform effectively without a large number of registers. In fact, the register files of many existing architectures (16--32 registers) are capable of sustaining a high instruction execution rate. Second, even for small register files (8--12 registers), substantial performance gains can be obtained by increasing the issue rate of a processor. In general, the percentage increase in performance achieved by increasing the issue rate is relatively constant for all register file sizes. Finally, code transformations designed for multiple-instruction-issue processors are found to be effective for all register file sizes; however, for small register files, the performance improvement is limited due to the excessive spill code introduced by the transformations. 1 ...


Proceedings of the 18th International Symposium on Computer Architecture (ISCA)

April 1999

·

19 Reads

·

3 Citations

Pohua P. Chang

·

Scott A. Mahlke

·

William Y. Chen

·

[...]

·

The performance of multiple-instruction-issue processors can be severely limited by the compiler's ability to generate efficient code for concurrent hardware. In the IMPACT project, we have developed IMPACT-I, a highly optimizing C compiler to exploit instruction level concurrency. The optimization capabilities of the IMPACT-I C compiler are summarized in this paper. Using the IMPACT-I C compiler, we ran experiments to analyze the performance of multiple-instruction-issue processors executing some important non-numerical programs. The multiple-instruction-issue processors achieve solid speedup over high-performance single-instruction-issue processors. We ran experiments to characterize the following architectural design issues: code scheduling model, instruction issue rate, memory load latency, and function unit resource limitations. Based on the experimental results, we propose the IMPACT Architectural Framework, a set of architectural features that best support the IMPACT-I C compile...


Figure 6: The interaction of classical, superscalar and multiprocessor optimizations for an 1 window.
Figure 2: EEect of data migration on parallelism and ee- ciency.  
The Effect Of Compiler Optimizations On Available Parallelism In Scalar Programs

April 1999

·

52 Reads

·

6 Citations

In this paper we analyze the effect of compiler optimizations on fine grain parallelism in scalar programs. We characterize three levels of optimization: classical, superscalar, and multiprocessor. We show that classical optimizations not only improve a program's efficiency but also its parallelism. Superscalar optimizations further improve the parallelism for moderately parallel machines. For highly parallel machines, however, they actually constrain available parallelism. The multiprocessor optimizations we consider are memory renaming and data migration. Introduction Compiler optimizations are designed to reduce a program's execution time. Traditionally, these optimizations are customized for a given machine model. Classical optimizations are designed to improve the program's efficiency for a machine model which has one thread of execution and can issue one instruction per cycle. Superscalar optimizations are designed for a machine model with a single thread of execution and a lim...


Data Access Microarchitectures for Superscalar Processors with Compiler-Assisted Data Prefetching

April 1999

·

7 Reads

·

60 Citations

The performance of superscalar processors is more sensitive to the memory system delay than their single-issue predecessors. This paper examines alternative data access microarchitectures that effectively support compilerassisted data prefetching in superscalar processors. In particular, a prefetch buffer is shown to be more effective than increasing the cache dimension in solving the cache pollution problem. All in all, we show that a small data cache with compiler-assisted data prefetching can achieve a performance level close to that of an ideal cache. 1 Introduction Superscalar processors can potentially deliver more than five times speedup over conventional single-issue processors [1]. With the total execution cycle count dramatically reduced, each cycle becomes more significant to the overall performance. Because each data cache miss can introduce many extra execution cycles, a superscalar processor can easily lose the majority of its performance to the memory hierarchy. Out-of-...


SOFTWARE---PRACTICE AND EXPERIENCE, VOL. 21(12), 1301--1321 (DECEMBER 1991) Using Profile Information to Assist Classic Code Optimizations

April 1997

·

333 Reads

·

107 Citations

This paper describes the design and implementation of an optimizing compiler that automatically generates profile information to assist classic code optimizations. This compiler contains two new components, an execution profiler and a profile-based code optimizer, which are not commonly found in traditional optimizing compilers. The execution profiler inserts probes into the input program, executes the input program for several inputs, accumulates profile information and supplies this information to the optimizer. The profile-based code optimizer uses the profile information to expose new optimization opportunities that are not visible to traditional global optimization methods. Experimental results show that the profile-based code optimizer significantly improves the performance of production C programs that have already been optimized by a high-quality global code optimizer


Three Architectural Models for Compiler-Controlled Speculative Execution

May 1995

·

6 Reads

·

36 Citations

IEEE Transactions on Computers

To effectively exploit instruction level parallelism, the compiler must move instructions across branches. When an instruction is moved above a branch that it is control dependent on, it is considered to be speculatively executed since it is executed before it is known whether or not its result is needed. There are potential hazards when speculatively executing instructions. If these hazards can be eliminated, the compiler can more aggressively schedule the code. The hazards of speculative execution are outlined in this paper. Three architectural models: restricted, general, and boosting, which have increasing amounts of support for removing these hazards are discussed. The performance gained by each level of additional hardware support is analyzed using the IMPACT C compiler which performs superblock scheduling for superscalar and superpipelined processors


Table 2 : The benchmarks.
Figure 3: The performance of prescheduling for the superpipelined versions of existing architectures. The base architecture is the single-issue processor with no prescheduling. 2 and 3 are singleissue , 2X-and 3X-superpipelined processors respectively. 4 and 6 are dual-issue, 2X-and 3X- superpipelined processors respectively. All the processors have 32 registers.  
The Importance of Prepass Code Scheduling for Superscalar and Superpipelined Processors

April 1995

·

79 Reads

·

61 Citations

IEEE Transactions on Computers

Superscalar and superpipelined processors utilize parallelism to achieve peak performance that can be several times higher than that of conventional scalar processors. In order for this potential to be translated into the speedup of real program, the compiler must be able to schedule instructions so that the parallel hardware is effectively utilized. Previous work has shown that prepass code scheduling helps to produce a better schedule for scientific programs, but the importance of prescheduling has never been demonstrated for control-intensive non-numeric programs. These programs are significantly different from the scientific programs because they contain frequent branches. The compiler must do global scheduling in order to find enough independent instructions. In this paper, the code optimizer and scheduler of the IMPACT-I C compiler is described. Within this framework, we study the importance of prepass code scheduling for a set of production C programs. It is shown that, in contrast to the results previously obtained for scientific programs, prescheduling is not important for compiling control-intensive programs to the current generation of superscalar and superpipelined processors. However, if some of the current restrictions on upward code motion can be removed in future architectures, prescheduling would substantially improve the execution time of this class of programs on both superscalar and superpipelined processors


The Superblock: An Effective Technique for VLIW and Superscalar Compilation

May 1993

·

663 Reads

·

626 Citations

The Journal of Supercomputing

A compiler for VLIW and superscalar processors must expose sufficient instruction-level parallelism (ILP) to effectively utilize the parallel hardware. However, ILP within basic blocks is extremely limited for control-intensive programs. We have developed a set of techniques for exploiting ILP across basic block boundaries. These techniques are based on a novel structure called thesuperblock. The superblock enables the optimizer and scheduler to extract more ILP along the important execution paths by systematically removing constraints due to the unimportant paths. Superblock optimization and scheduling have been implemented in the IMPACT-I compiler. This implementation gives us a unique opportunity to fully understand the issues involved in incorporating these techniques into a real compiler. Superblock optimizations and scheduling are shown to be useful while taking into account a variety of architectural features.


Efficient Instruction Sequencing with Inline Target Insertion

January 1993

·

37 Reads

·

14 Citations

IEEE Transactions on Computers

Inline target insertion, a specific compiler and pipeline implementation method for delayed branches with squashing, is defined. The method is shown to offer two important features not discovered in previous studies. First, branches inserted into branch slots are correctly executed. Second, the execution returns correctly from interrupts or exceptions with only one program counter. These two features result in better performance and less software/hardware complexity than conventional delayed branching mechanisms


Citations (20)


... The first step towards the design of video processors and video systems is to achieve an accurate understanding of the major video applications and the workload characteristics of those applications. Knowledge of video applications is a necessary component in video system design for deciding: (1) whether the system will support one or more video standards, and consequently whether it requires hardware and/or software for separate standards, (2) whether the design should employ either more application-specific hardware, providing more computational capability but less flexibility, or more software, providing greater flexibility but necessitating more powerful and costly processors, and (3) what video processing and system control software requirements must be served by the system processor(s). Depending upon how much of the video processing is performed in dedicated hardware versus software, it is necessary to understand the workload characteristics of the video applications in selecting or designing a video processor. ...

Reference:

MediaBench II video: Expediting the next generation of video systems research
IMPACT
  • Citing Article
  • May 1991

ACM SIGARCH Computer Architecture News

... Code duplication has frequently been used in compilers to enable subsequent optimizations [17,53,61,62]. As it implies a trade-off between the achievable performance gains via those follow-up optimizations and the code size increase, fine-tuned heuristics are often used to influence decisions on whether to duplicate code or not [58]. ...

Using Profile Information to Assist Classic Compiler Code Optimizations
  • Citing Article
  • December 1991

Software Practice and Experience

... Most existing approaches consider information about individual functions in isolation, and the called function is not considered part of the semantics of the calling function, which can result in some semantic information being lost when executing binary code similarity analysis. Existing solutions inline all user-defined functions, which can lead to an explosion in function code size (Chang et al. 1992;Wang et al. 2015). However, not all functions are closely related to the function calling them. ...

Profile‐guided automatic inline expansion for C programs
  • Citing Article
  • May 1992

Software Practice and Experience

... As shown in Table 2.1, the dynamic target set is considerably smaller than the static set. The data in Figure 2.8 strongly supports our MBR optimization, previously identified by Chang and Hwu [33] for multi-way branches implemented for switch statements. In their work, any switch statement with less than ten branches would be implemented as sleds of conditional branches and direct jumps, ordered by profiled execution frequency. ...

Control flow optimization for supercomputer scalar processing
  • Citing Conference Paper
  • January 1989

... Compiler researchers proposed loop tranformation techniques such as loop tiling and fusion to improve data locality [3, 4, 1]. To enhance instruction locality, researchers proposed procedure orderings and code layout optimisations [17, 10, 5] to improve spatial locality of instructions. Existing literature has only considered improving data/instruction locality over single program runs. ...

Achieving High Instruction Cache Performance with an Optimizing Compiler.
  • Citing Conference Paper
  • May 1989

... The superblock is a global scheduling region comprising a single-entry multiple-exit sequence of basic blocks [2]. Scheduling decisions may decrease the execution time for one control path while increasing it for another, and by making these decisions in favor of a more frequently executed path, overall performance can be improved [19]. When testing and applying mathematical functions in SML, we discover that for a particular input set, there is one path that demonstrates obviously higher execution frequency than others. ...

The Superblock: An Effective Technique for VLIW and Superscalar Compilation

The Journal of Supercomputing

... The full source code of this work is available on https://gricad-gitlab.univ-grenoble-alpes. Scheduling methods (for the compaction problem: how to parallelize a sequential micro-code) have been thoroughly studied by Fisher [30], Rau et al. [78], Chang and Hwu [20]. Some of these methods were implemented in the Multiflow compiler [55] for a VLIW architecture. ...

Trace Selection For Compiling Large C Application Programs To Microcode

... Extracting instruction-level parallelism can be done either statically [10] (at compile time) or dynamically [16] (at execution time). Dynamic extraction of parallelism is based on complex hardware, while static techniques [2] [6] shift the burden of identifying parallelism onto the compiler [1] [5]. ...

Three Architectural Models for Compiler-Controlled Speculative Execution
  • Citing Article
  • May 1995

IEEE Transactions on Computers