International Journal of Parallel Programming

Published by Springer Nature

Online ISSN: 1573-7640

·

Print ISSN: 0885-7458

Articles


ADL-Based Specification of Implementation Styles for Functional Simulators
  • Conference Paper

August 2011

·

48 Reads

David A. Penry

·

Functional simulators find widespread use as sub-systems within microarchitectural simulators. The speed of functional simulators is strongly influenced by the implementation style of the functional simulator, e.g. interpreted vs. binary-translated simulation. Speed is also strongly influenced by the level of detail of the interface the functional simulator presents to the rest of the timing simulator. This level of detail may change during design space exploration, requiring corresponding changes to the interface and the simulator. However, for many implementation styles, changing the interface is difficult. As a result, architects may choose either implementation styles which are more malleable or interfaces with more detail than is necessary. In either case, simulation speed is traded for simulator design time. We show that this tradeoff is unnecessary if an orthogonal-specification design principle is practiced: specify how a simulator is to be implemented separately from what it is implementing and then synthesize a simulator from the combined specifications. We show that the use of an Architectural Description Language (ADL) with constructs for implementation style specification makes it possible to synthesize interfaces with different implementation styles with reasonable effort.
Share

Figure 1. Combining atomic blocks into an enlarged atomic block.
Increasing the Instruction Fetch Rate via Block-Structured Instruction Set Architectures
  • Conference Paper
  • Full-text available

January 1997

·

132 Reads

To exploit larger amounts of instruction level parallelism, processors are being built with wider issue widths and larger numbers of functional units. Instruction fetch rate must also be increased in order to effectively exploit the performance potential of such processors. Block-structured ISAs provide an effective means of increasing the instruction fetch rare. We define an optimization, called block enlargement, that can be applied to a block-structured ISA to increase the instruction fetch rate of a processor that implements that ISA. We have constructed a compiler that generates block-structured ISA code, and a simulator that models the execution of that code on a block-structured ISA processor. We show that for the SPECint95 benchmarks, the block-structured ISA processor executing enlarged atomic blocks outperforms a conventional ISA processor by 12% while using simpler microarchitectural mechanisms to support wide-issue and dynamic scheduling
Download

Branch Classification: A New Mechanism for Improving Branch Predictor Performance

April 1996

·

79 Reads

There is wide agreement that one of the most important impediments to the performance of current and future pipelined superscalar processors is the presence of conditional branches in the instruction stream. Speculative execution seems to be one solution of choice to the branch problem, but speculative work is discarded if a branch is mispredicted. Therefore, we need a very accurate branch predictor; 95% accuracy is not good enough. This paper proposes branch classification to help improve the accuracy of branch predictors. Branch classification allows an individual branch instruction to be associated with the branch predictor best suited to predict its direction. Using this approach, a hybrid branch predictor can be constructed such that each component branch predictor predicts those branches for which it is best suited. This paper suggests one classification scheme, analyzes several branch predictors, and proposes a hybrid branch predictor that achieves higher prediction accuracy than any branch predictor previously reported in the literature.

Combining loop transformations considering caches and scheduling

January 1997

·

15 Reads

The performance of modern microprocessors is greatly affected by cache behavior, instruction scheduling, register allocation and loop overhead. High level loop transformations such as fission, fusion, tiling, interchanging and outer loop unrolling (e.g., unroll and jam) are well known to be capable of improving all these aspects of performance. Difficulties arise because these machine characteristics and these optimizations are highly interdependent. Interchanging two loops might, for example, improve cache behavior but make it impossible to allocate registers in the inner loop. Similarly, unrolling or interchanging a loop might individually hurt performance but doing both simultaneously might help performance. Little work has been published on how to combine these transformations into an efficient and effective compiler algorithm. In this paper we present a model that estimates total machine cycle time taking into account cache misses, software pipelining, register pressure and loop overhead. We then develop an algorithm to intelligently search through the various possible transformations, using our machine model to select the set of transformations leading to the best overall performance. We have implemented this algorithm as part of the MIPSPro commercial compiler system. We give experimental results showing that our approach is both effective and efficient in optimizing numerical programs

Higher order data types

February 1980

·

31 Reads

We consider a generalization of the concept of abstract data type suitable for modeling situations in which there is more than one level of functionality. An instance of such a situation is the difference in level of functionality between the query and update functions in a data base. We introduce the concept of a higher order data type to model this idea. The underlying algebraic ideas are outlined, and sample applications of the concept are presented.

Extension of tabled 0L-systems and languages

January 1973

·

13 Reads

This paper introduces a new family of languages which originated from a study of some mathematical models for the development of biological organisms. Various properties of this family are established and in particular it is proved that it forms a full abstract family of languages. It is compared with some other families of languages which have already been studied and which either originated from the study of models for biological development or belong to the now standard Chomsky hierarchy. A characterization theorem for context-free languages is also established.

Minimal data dependence abstractions for loop transformations: Extended version

August 1995

·

52 Reads

Many abstractions of program dependences have already been proposed, such as the Dependence Distance, the Dependence Direction Vector, the Dependence Level or the Dependence Cone. These different abstractions have different precisions. Theminimal abstraction associated to a transformation is the abstraction that contains the minimal amount of information necessary to decide when such a transformation is legal. Minimal abstractions for loop reordering and unimodular transformations are presented. As an example, the dependence cone, which approximates dependences by a convex cone of the dependence distance vectors, is the minimal abstraction for unimodular transformations. It also contains enough information for legally applying all loop reordering transformations and finding the same set of valid mono- and multi-dimensional linear schedules as the dependence distance set.

Semantic-Aware Automatic Parallelization of Modern Applications Using High-Level Abstractions

October 2010

·

193 Reads

Automatic introduction of OpenMP for sequential applications has attracted significant attention recently because of the proliferation of multicore processors and the simplicity of using OpenMP to express parallelism for shared-memory systems. However, most previous research has only focused on C and Fortran applications operating on primitive data types. Modern applications using high-level abstractions, such as C++ STL containers and complex user-defined class types, are largely ignored due to the lack of research compilers that are readily able to recognize high-level object-oriented abstractions and leverage their associated semantics. In this paper, we use a source-to-source compiler infrastructure, ROSE, to explore compiler techniques to recognize high-level abstractions and to exploit their semantics for automatic parallelization. Several representative parallelization candidate kernels are used to study semantic-aware parallelization strategies for high-level abstractions, combined with extended compiler analyses. Preliminary results have shown that semantics of abstractions can help extend the applicability of automatic parallelization to modern applications and expose more opportunities to take advantage of multicore processors. KeywordsAutomatic parallelization-High-level abstractions-Semantics-ROSE-OpenMP

LALP: A language to program custom FPGA-based acceleration engines

January 2010

·

145 Reads

Field-Programmable Gate Arrays (FPGAs) are becoming increasingly important in embedded and high-performance computing systems. They allow performance levels close to the ones obtained with Application-Specific Integrated Circuits, while still keeping design and implementation flexibility. However, to efficiently program FPGAs, one needs the expertise of hardware developers in order to master hardware description languages (HDLs) such as VHDL or Verilog. Attempts to furnish a high-level compilation flow (e.g., from C programs) still have to address open issues before broader efficient results can be obtained. Bearing in mind an FPGA available resources, it has been developed LALP (Language for Aggressive Loop Pipelining), a novel language to program FPGA-based accelerators, and its compilation framework, including mapping capabilities. The main ideas behind LALP are to provide a higher abstraction level than HDLs, to exploit the intrinsic parallelism of hardware resources, and to allow the programmer to control execution stages whenever the compiler techniques are unable to generate efficient implementations. Those features are particularly useful to implement loop pipelining, a well regarded technique used to accelerate computations in several application domains. This paper describes LALP, and shows how it can be used to achieve high-performance computing solutions. KeywordsLoop pipelining–Compilers–Reconfigurable computing–FPGA

Data structures in context-addressed cellular memories

December 1972

·

7 Reads

This report discusses the capability of an associative memory to search some useful data bases. The report utilizes a simplified cell and a collection of assembler language instructions to show how sets and trees can be searched in the memory. An OR rail and an EXCLUSIVE-OR rail are discussed in relation to their use to search-ordered and unordered sets, strings, and tree data structures. Linked data structures are also discussed. This report is oriented toward the software aspects of the associative memory to lead to further research in the design of high-level languages that utilize the capability of the rails.

An NC algorithm for recognizing tree adjoining languages

April 1992

·

6 Reads

A parallel algorithm is presented for recognizing the class of languages generated by tree adjoining grammars, a tree rewriting system which has applications in natural language processing. This class of languages is known to properly include all context-free languages; for example, the noncontext-free sets {a n b n c n } and {ww} are in this class. It is shown that the recognition problem for tree adjoining languages can be solved by a concurrent read, concurrent write parallel random-access machine (CRCW PRAM) inO(logn) time using polynomially many processors. Thus, the class of tree adjoining languages is inAC 1 and hence inNC. This extends a previous result for context-free languages.

Communication Optimization for Affine Recurrence Equations Using Broadcast and Locality

February 2000

·

22 Reads

This paper deals with communication optimization which is a crucial issue in automatic parallelization. From a system of parameterized affine recurrence equations we propose a heuristic that determines a set of efficient space-time transformations. It focuses on distant communications removal using broadcast—including anticipated broadcast, and locality enforcement.

On the Effectiveness of Flow Aggregation in Improving Instruction Reuse in Network Processing Applications

December 2003

·

19 Reads

Instruction Reuse is a microarchitectural technique that exploits dynamic instruction repetition to remove redundant computations at run-time. In this paper we examine instruction reuse of integer ALU and load instructions in network processing applications and attempt to answer the following questions: (1) How much of instruction repetition can be reused in packet processing applications?, (2) Can the temporal locality of network traffic be exploited to reduce interference in the Reuse Buffer and improve reuse? and (3) What is the effect of reuse on microarchitectural features such as resource contention and memory accesses? We use an execution driven simulation methodology to evaluate instruction reuse and find that for the benchmarks considered, 1 to 50% of the dynamic instructions are reused yielding performance improvement between 1 and 20%. To further improve reuse, a flow aggregation scheme as well as an architecture for exploiting the same is proposed. This scheme is mostly applicable to header processing applications and exploits temporal locality in packet data to uncover higher reuse. As a side effect, instruction reuse reduces memory traffic and improves performance.

Dynamic Binary Instrumentation and Data Aggregation on Large Scale Systems

June 2007

·

237 Reads

·

·

·

[...]

·

Dynamic binary instrumentation for performance analysis on large scale architectures such as the IBM Blue Gene/L system (BG/L) poses unique challenges. Their unprecedented scale and often limited OS support require new mechanisms to organize binary instrumentation, to interact with the target application, and to collect the resulting data. We describe the design and current status of a new implementation of the Dynamic Probe Class Library (DPCL) API for large scale systems. DPCL provides an easy to use layer for dynamic instrumentation on parallel MPI applications based on the DynInst dynamic instrumentation library for sequential platforms. Our work includes modifying DynInst to control instrumentation from remote I/O nodes and porting DPCL’s communication for performance data collection to use MRNet, a tree-based overlay network that (TBON) supports scalable multicast and data reduction. We describe extensions to the DPCL API that support instrumentation of task subsets and aggregation of collected performance data.

Computer aided design of database internal schema

January 1981

·

5 Reads

A large number of complexly interrelated parameters are involved at the internal schema level design of database systems. Consequently, a single design model is seen to be infeasible. A package of three aids is proposed to assist a designer in step by step design of internal schema. The three aids pertain to splitting of a relation, merging of relations, and access strategy selection for a relation.

Parallel general prefix computations with geometric, algebraic, and other applications

December 1989

·

7 Reads

We introduce a generic problem component that captures the most common, difficult kernel of many problems. This kernel involves general prefix computations (GPC). GPC's lower bound complexity of (n logn) time is established, and we give optimal solutions on the sequential model inO(n logn) time, on the CREW PRAM model inO(logn) time, on the BSR (broadcasting with selective reduction) model in constant time, and on mesh-connected computers inO(n) time, all withn processors, plus anO(log2 n) time solution on the hypercube model. We show that GPC techniques can be applied to a wide variety of geometric (point set and tree) problems, including triangulation of point sets, two-set dominance counting, ECDF searching, finding two-and three-dimensional maximal points, the reconstruction of trees from their traversals, counting inversions in a permutation, and matching parentheses.

Algorithmic approach to the consecutive retrieval property

August 1979

·

4 Reads

Ghosh''s consecutive retrieval property (CR property) not only represents an elegant file organization, but also raises the problem of how to construct such a file with this property. Ghosh used ann m 0–1 incidence matrix, wheren is the number of records andm is the number of queries, as a tool for constructing a file with the CR property. In this paper the rows and columns of the incidence matrix are restricted to unique 0–1 patterns. It is found that such a unique incidence matrix cannot have the CR property if the number of rows is greater than 2m–1. This upper bound can be used to generatem(2m–1) columns, which form all the matrices with the CR property that may correspond to the given matrix. Two algorithms are presented which map the columns of the given incidence matrix to these columns with consecutive l''s. A proposed implementation in terms of data base design is also presented.

Parallel algorithms for line generation

October 1990

·

25 Reads

A new, parallel approach for generating Bresenham-type lines is developed. Coordinate pairs which approximate straight lines on a square grid are derived from line equations. These pairs serve as a basis for the development of four new parallel algorithms. One of the algorithms uses the fact that straight time generation is equivalent to a vector prefix sums calculation. The algorithms execute on a binary tree of processors. Each node in the tree performs a simple calculation that involves only additions and shifts. All four algorithms have time complexityO(log2 n) wheren in the form 2 m denotes the number of points generated andn-1 is the number of processors in the tree. This compares toO(n) for Bresenham's algorithm executed on a sequential processor. Pipelining can be used to achieve a constant time per line generation as long as line length is less thann.

Two Algorithms for Constructing a Delaunay Triangulation

June 1980

·

4,230 Reads

This paper provides a unified discussion of the Delaunay triangulation. Its geometric properties are reviewed and several applications are discussed. Two algorithms are presented for constructing the triangulation over a planar set ofN points. The first algorithm uses a divide-and-conquer approach. It runs inO(N logN) time, which is asymptotically optimal. The second algorithm is iterative and requiresO(N 2) time in the worst case. However, its average case performance is comparable to that of the first algorithm.

Memetic Algorithms for Parallel Code Optimization

April 2007

·

92 Reads

Discovering the optimum number of processors and the distribution of data on distributed memory parallel computers for a given algorithm is a demanding task. A memetic algorithm (MA) is proposed here to find the best number of processors and the best data distribution method to be used for each stage of a parallel program. Steady state memetic algorithm is compared with transgenerational memetic algorithm using different crossover operators and hill-climbing methods. A self-adaptive MA is also implemented, based on a multimeme strategy. All the experiments are carried out on computationally intensive, communication intensive, and mixed problem instances. The MA performs successfully for the illustrative problem instances.

Depth-m search in branch-and-bound algorithms

January 1978

·

28 Reads

A new search strategy, called depth-m search, is proposed for branch-and-bound algorithms, wherem is a parameter to be set by the user. In particular, depth-1 search is equivalent to the conventional depth-first search, and depth- search is equivalent to the general heuristic search (including best-bound search as a special case). It is confirmed by computational experiment that the performance of depth-m search continuously changes from that, of depth-first search to that of heuristic search, whenm is changed from 1 to . The exact upper bound on the size of the required memory space is derived and shown to be bounded byO(nm), wheren is the problem size. Some methods for controllingm during computation are also proposed and compared, to carry out the entire computation within a given memory space bound.

Optimal parallel algorithms for constructing and maintaining a balancedm-way search tree

December 1986

·

8 Reads

We present parallel algorithms for constructing and maintaining balancedm-way search trees. These parallel algorithms have time complexity O(1) for ann processors configuration. The formal correctness of the algorithms is given in detail.

Greedy partitioned algorithms for the shortest-path problem

August 1991

·

20 Reads

A partitioned, priority-queue algorithm for solving the single-source best-path problem is defined and evaluated. Finding single-source paths for sparse graphs is notable because of its definitelack of parallelism-no known algorithms are scalable. Qualitatively, we discuss the close relationships between our algorithm and previous work by Quinn, Chikayama, and others. Performance measurements of variations of the algorithm, implemented both in concurrent and imperative programming languages on a shared-memory multiprocessor, are presented. This quantitative analysis of the algorithms provides insights into the tradeoffs between complexity and overhead in graph-searching executed in high-level parallel languages with automatic task scheduling.

Modeling of multiple copy update costs for file allocation in distributed databases

February 1985

·

11 Reads

The file allocation problem for distributed databases has been extensively studied in the literature and the objective is to minimize total costs consisting of storage, query and update communication costs. Current modeling of update communication costs is simplistic and does not capture the working of most of the protocols that have been proposed. This paper shows that more accurate modeling of update costs can be achieved fairly easily without an undue increase in the complexity of the formulation. In particular, formulations for two classes of update protocols are shown. Existing heuristics can be used on these formulations to obtain good solutions.

Analytical and Simulation-based Design Space Exploration of Software Defined Radios

June 2010

·

30 Reads

The proposed idea of software defined radios (SDRs) offers the potential to cope with the complexity and flexibility demands of future wireless communication devices. Unfortunately, the tight interaction of software and hardware as well as the high computational requirements make development of SDRs to one of the most challenging tasks system architects are facing today. The main challenge is to select the optimal or a sub-optimal solution within the large design space spread by the many design options. This paper introduces a novel design space exploration framework for particular early development stages. The key contribution is a pre-simulation based mathematical analysis based on synchronous data flow (SDF) graphs in order to take the right software and hardware design decisions. This analysis can be utilized right from the start of the design cycle with only limited knowledge of the final implementation. In addition, the proposed technique seamlessly integrates into an electronic system level (ESL) based simulation framework. This allows for a smooth transition from pure mathematical analysis to the simulation of the final implementation. The practical usage of the framework and its capabilities are highlighted by a case study from a typical SDR design. KeywordsDesign space exploration-Software defined radio-ESL design

Fig. 1. Move the lower part of A[i] into the upper part of B[i].
Fig. 2. 
Fig. 3. Speedup on an Intel Pentium III using MMC.
An Extended ANSI C for Processors with a Multimedia Extension

January 2003

·

208 Reads

This paper presents the Multimedia C language, which is designed for the multimedia extensions included in all modern microprocessors. The paper discusses the language syntax, the implementation of its compiler and its use in developing multimedia applications. The goal was to provide programmers with the most natural way of using multimedia processing facilities in the C language. The MMC language has been used to develop some of the most frequently used multimedia kernels. The presented experiments on these scientific and multimedia applications have yielded good performance improvements.

MRPPS—An interactive refutation proof procedure system for question-answering

June 1974

·

26 Reads

In this paper a system termed the Maryland refutation proof procedure system (MRPPS) is described. MRPPS is an interactive system which gives the user the ability to create and maintain a core-bound data base and to input queries either as well-formed formulas in the first-order predicate calculus or as clauses. MRPPS consists of three basic components: (1) a set of inference rules for making logical deductions, many of which are based on the resolution principle; (2) a search strategy for heuristically determining the sequence in which to select base clauses and to perform deductions on clauses already generated; and (3) a base clause selection strategy that uses heuristic and semantic information for determining which data axioms and general axioms are to be brought to bear on the problem. These three components are described and MRPPS is compared to related work.

APCFS: Autonomous and Parallel Compressed File System

August 2011

·

133 Reads

APCFS (Autonomous and Parallel Compressed File System) is a file system that supports fast autonomous compression at high compression rates. It is designed as a virtual layer inserted over existing file system, compressing and decompressing data by intercepting kernel calls. The system achieves enhanced compression ratio by combining two compression techniques. Speed is attained by performing the two compression techniques in parallel. Experimental results indicate good performance. KeywordsFile compression–Parallel compression–High-speed compression–Compressible format

Concepts and realization of a high-performance data type architecture

February 1982

·

13 Reads

Fifth generation computers are expected to capitalize on the dramatic progress of VLSI technology, in order to offer an improved performance/cost figure. An even more important requirement, however, is that they will support by architectural means the generation, execution, and maintenance of quality software, as a way out of the software crisis. One approach towards the design and implementation of quality software is programming with abstract data types, in connection with elaborate type consistency checking. The objection raised against the abstract data type based programming style is poor run time efficiency when such programs are executed on a conventional machine. In this paper adata type architecture is described that offers efficient and convenient mechanism for constructing arbitrary data structures and encapsulating them into abstract data types, thus avoiding the inefficiency penalty mentioned above. Through a process of hierarchical decomposition, user-defined abstract data types are mapped on representations given in terms of a basicstructured machine data type. This approach combines high performance with generality and completeness. The hardware structure of the data type architecture can be classified as a strongly coupled, asymmetric multicomputer system with hierarchical function distribution among the computers. The system includes a pipeline for numerical and nonnumerical operations, performed on the vector-structured basic machine data type in the SIMD mode of operation. Software reliability and data security is enhanced through elaborate run time consistency checking. The computer, which was designed and built at the Technical University of Berlin, has recently become operational. This paper outlines the operational principle, the mechanisms, and the hardware and software structure of this innovative, fifth generation computer architecture.

Fig. 2. Processor pipeline.
Fig. 6. Floorplans for RF-to-RF architectures.
Fig. 7. DSE framework.
Fig. 8. Average variation in performance for multi-cycle configurations.
Fig. 10. Effective variation in performance for UMC 0.13µ ASIC technology and multicycle configurations.
Evaluation of Bus Based Interconnect Mechanisms in Clustered VLIW Architectures

December 2007

·

90 Reads

With new sophisticated compiler technology, it is possible to schedule distant instructions efficiently. As a consequence, the amount of exploitable instruction level parallelism (ILP) in applications has gone up considerably. However, monolithic register file VLIW architectures present scalability problems due to a centralized register file which is far slower than the functional units (FU). Clustered VLIW architectures, with a subset of FUs connected to any RF provide an attractive solution to address this issue. Recent studies with a wide variety of inter-cluster interconnection mechanisms have reported substantial gains in performance (number of cycles) over the most studied RF-to-RF type interconnections. However, these studies have compared only one or two design points in the RF-to-RF interconnects design space. In this paper, we extend the previous reported work. We consider both multi-cycle and pipelined buses. To obtain realistic bus latencies, we synthesized the various architectures and calculated post-layout clock periods. The results demonstrate that while there is less that 10% variation in interconnect area, the bus based architectures are slower by as much as 400%. Also, neither multi-cycle or pipelined buses nor increasing the number of buses itself is able to achieve performance comparable to point-to-point type interconnects.

An optimal scheduling procedure for matrix inversion on linear array at a processor level

August 1994

·

3 Reads

This paper presents a parallel algorithm for computing the inversion of a dense matrix based on modified Jordan's elimination which requires fewer calculation steps than the standard one. The algorithm is proposed for the implementation on the linear array with a small to moderate number of processors which operate in a parallel-pipeline fashion. A communication between neighboring processors is achieved by a common memory module implemented as a FIFO memory module. For the proposed algorithm we define a task scheduling procedure and prove that it is time optimal. In order to compute the speedup and efficiency of the system, two definitions (Amdahl's and Gustafson's) were used. For the proposed architecture, involving two to 16 processors, estimated Gustafson's (Amdahl's) speedups are in the range 1.99 to 13.76 (1.99 to 9.69).

Figure 1: Process arrival pattern
Figure 5: Process arrival patterns in the microbenchmark (64KB message size, 200ms computation time) on the two platforms
Table 5 : The imbalance factor for major routines
A Study of Process Arrival Patterns for MPI Collective Operations

November 2006

·

133 Reads

Process arrival pattern, which denotes the timing when different processes arrive at an MPI collective operation, can have a significant impact on the performance of the operation. In this work, we characterize the process arrival patterns in a set of MPI programs on two common cluster platforms, use a micro-benchmark to study the process arrival patterns in MPI programs with balanced loads, and investigate the impacts of different process arrival patterns on collective algorithms. Our results show that (1) the differences between the times when different processes arrive at a collective operation are usually sufficiently large to affect the performance; (2) application developers in general cannot effectively control the process arrival patterns in their MPI programs in the cluster environment: balancing loads at the application level does not balance the process arrival patterns; and (3) the performance of collective communication algorithms is sensitive to process arrival patterns. These results indicate that process arrival pattern is an important factor that must be taken into consideration in developing and optimizing MPI collective routines. We propose a scheme that achieves high performance with different process arrival patterns, and demonstrate that by explicitly considering process arrival pattern, more efficient MPI collective routines than the current ones can be obtained.

Figure 2: Overview of the determination of data dependences.  
Data Dependence Analysis of Assembly Code

January 2000

·

414 Reads

Determination of data dependences is a task typically performed with high-level language source code in today's optimizing and parallelizing compilers. Very little work has been done in the field of data dependence analysis on assembly language code, but this area will be of growing importance, e.g., for increasing instruction-level parallelism. A central element of a data dependence analysis in this case is a method for memory reference disambiguation which decides whether two memory operations may access (or definitely access) the same memory location. In this paper we describe a new approach for the determination of data dependences in assembly code. Our method is based on a sophisticated algorithm for symbolic value propagation, and it can derive value-based dependences between memory operations instead of just address-based dependences. We have integrated our method into the Salto system for assembly language optimization. Experimental results show that our approach greatly improves the precision of the dependence analysis in many cases.

Attempting guards in parallel: A data flow approach to execute generalized guarded commands

January 1992

·

6 Reads

Earlier approaches to execute generalized alternative/repetitive commands of Communicating Sequential Processes (CSP) attempt the selection of guards in a sequential order. Also, these implementations are based on either shared memory or message passing multiprocessor systems. In contrast, we propose an implementation of generalized guarded commands using the data-driven model of computation. A significant feature of our implementation is that it attempts the selection of the guards of a process in parallel. We prove that our implementation is faithful to the semantics of the generalized guarded commands. Further, we have simulated the implementation using discrete-event simulation and measured various performance parameters. The measured parameters are helpful in establishing the fairness of our implementation and its superiority, in terms of efficiency and the parallelism exploited, over other implementations. The simulation study is also helpful in identifying various issues that affect the performance of our implementation. Based on this study, we have proposed an adaptive algorithm which dynamically tunes the extent of parallelism in the implementation to achieve an optimum level of performance.

Specification of data restructuring software based on the attribute method

January 1984

·

12 Reads

The idea that abstract software specification is an essential phase in developing large and complex software has been widely accepted. In this paper, we specify in an abstract, but precise way, software for restructuring data structures based on the flat file and hierarchical data models. Our specification contains also the case that a target data structure is constructed from many source data structures. In data restructuring data structures are transformed. We propose the use of the attribute method for these kinds of translation oriented specification situations in the data base area. We apply the attribute method in the context of abstract syntax instead of a concrete one.

Semantic equivalence of covering attribute grammars

December 1979

·

6 Reads

This paper investigates some methods for proving the equivalence of different language specifications that are given in terms of attribute grammars. Different specifications of the same language may be used for different purposes, such as language definition, program verification, or language implementation. The concept of syntactic coverings is extended to the semantic part of attribute grammars. Given two attribute grammars, the paper discusses several propositions that give sufficient conditions for one attribute grammar to be semantically covered by the other one. These tools are used for a comparison of two attribute grammars that specify syntax and semantics of mixed-type expressions. This example shows a trade-off between the complexity of syntactic and semantic specifications. Another example discussed is the equivalence of different attribute grammars for the translation of the while-statement, as used in compilers for top-down and bottom-up syntax analysis.

Syntactic monoids in the construction of systolic tree automata

February 1985

·

12 Reads

The acceptance of regular languages by systolic tree automata is analyzed in more detail by investigating the structure of the individual processors needed. Since the processors essentially compute the monoid product, our investigation leads to questions concerning syntactic monoids. Hereby certain properties of monoids turn out to be important. These properties, as well as the language families induced by them, are also studied in the paper.

A Compile/Run-time Environment for the Automatic Transformation of Linked List Data Structures

December 2008

·

23 Reads

Irregular access patterns are a major problem for today’s optimizing compilers. In this paper, a novel approach will be presented that enables transformations that were designed for regular loop structures to be applied to linked list data structures. This is achieved by linearizing access to a linked list, after which further data restructuring can be performed. Two subsequent optimization paths will be considered: annihilation and sublimation, which are driven by the occurring regular and irregular access patterns in the applications. These intermediate codes are amenable to traditional compiler optimizations targeting regular loops. In the case of sublimation, a run-time step is involved which takes the access pattern into account and thus generates a data instance specific optimized code. Both approaches are applied to a sparse matrix multiplication algorithm and an iterative solver: preconditioned conjugate gradient. The resulting transformed code is evaluated using the major compilers for the x86 platform, GCC and the Intel C compiler.

Automatic Application-Specific Instruction-Set Extensions Under Microarchitectural Constraints

December 2003

·

193 Reads

This paper presents a methodology for automatically designing Instruction-Set Extensions in embedded processors. Many commercially available CPUs now offer the possibility of extending their instruction set for a specific application. Their tool chains typically support manual experimentations, but algorithms that can define the set of customised functional units most beneficial for a given applications are missing. Only a few algorithms exist but are severely limited in the type and size of operation clusters they can choose and hence reduce significantly the effectiveness of specialisation. A more general algorithm is presented here which selects maximal-speedup convex subgraphs of the application dataflow graph under fundamental microarchitectural constraints, and which improves significantly on the state of the art.

Bandwidth Efficient All-to-All Broadcast on Switched Clusters

August 2008

·

122 Reads

Clusters of workstations employ flexible topologies: regular, irregular, and hierarchical topologies have been used in such systems. The flexibility poses challenges for developing efficient collective communication algorithms since the network topology can potentially have a strong impact on the communication performance. In this paper, we consider the all-to-all broadcast operation on clusters with cut-through and store-and-forward switches. We show that near-optimal all-to-all broadcast on a cluster with any topology can be achieved by only using the links in a spanning tree of the topology when the message size is sufficiently large. The result implies that increasing network connectivity beyond the minimum tree connectivity does not improve the performance of the all-to-all broadcast operation when the most efficient topology specific algorithm is used. All-to-all broadcast algorithms that achieve near-optimal performance are developed for clusters with cut-through and clusters with store-and-forward switches. We evaluate the algorithms through experiments and simulations. The empirical results confirm our theoretical finding.

The butterfly barrier

August 1986

·

202 Reads

We describe and algorithm for barrier synchronization that requires only read and write to shared store. The algorithm is faster than the traditionallocked counter approach for two processors and has an attractive log2 N time scaling for largerN. The algorithm is free of hot spots and critical regions and requires a shared memory bandwidth which grows linearly withN, the number of participating processors. We verify the technique using both a real shared memory multiprocessor, for numbers of processors up to 30, and a shared memory multiprocessor simulator, for number of processors up to 256.

Simple-english for data base communication

December 1977

·

8 Reads

Three classes of the so-called natural languages for communication with data bases are defined:English-like, pseudo-English, andsimple-English. It is argued that English-like and pseudo-English languages are normally more difficult to learn and use than artificial programming languages with no overt claim to English likeness. Simple-English is presented as a family of languages in which many restrictions (which hamper learning) are removed through interaction with, and drawing inferences from, the data base and the underlying system. It is concluded, however, that English likeness and ease of learning may be contradictory notions.

Data base organization and retrieval techniques for steam turbine engineering

December 1973

·

3 Reads

The data base organization and information retrieval techniques described in this paper are used by the Westinghouse Electric Corporation''s Large Turbine Division at Lester, Pennsylvania to store and retrieve engineering and descriptive information pertaining to the entire life of a steam turbine from the original design of the unit through the latest overhaul. The large volume of data generated during the design, manufacture and subsequent operation of a steam turbine requires the use of a computer for effective data management. Manual record keeping has become increasingly inadequate to cope with the task of relating this ever growing data to sound management and engineering decisions. An operational computer-oriented data bank system has been installed to fulfill the following data storage and retrieval needs: (1) Quicker and more comprehensive retrieval of failure data to assist in service work. (2) Organized and efficient feedback of failure information to the Engineering Department. (3) An integrated data reference source for turbine history. (4) Greater insight into reliability problems to provide statistics for design guidance and manpower allocation. (5) Greater insight into proper turbine operating techniques and maintenance schedules. The system is built upon a master Generic Profile data base; all other files and reports are derived from it. Indexed sequential data sets are used throughout, providing complete random access to any information in the data base. A substantial number of types of retrieval are available, and within each type the user is able to request any combination of data he may require. In some instances retrieval requests of the same type are processed simultaneously, thereby reducing file access and overall execution time. Topics covered in the paper include data base content, data base organization and access methods, types of retrieval available, and retrieval techniques. The system employs an IBM 360 Model 50 computer operating under MVT and OS. The largest program requires approximately 150 K of core storage for execution. All programs are written in ANS COBOL.

Entity-relationship diagrams which are in BCNF

January 1983

·

181 Reads

In Ref. 8, we introduced a simplifying assumption about entity-relationship diagrams (ERDs), called regularity, and showed that regular ERDs have several desirable properties. One such property is that every relation schema in the ERD''s canonical relational scheme can be put into Third Normal Form. We left open there the more basic question: under what conditions would the original relation schemas actually be in Boyce-Codd Normal Form (BCNF)? Since the visible semantics of ERDs determine naturally their associated functional dependencies (fd''s), it is important to know when an ERD, as designed, already has this strongest normal form given purely in terms of fd''s. We show here a sufficient diagrammatic condition (loop-free) under which a regular ERD will have databases enjoying the benefits of BCNF.

Fig. 15. Conformance-checking the refinment of an even-parity checker.
Fig. 16. Refinement of locks with a double handshake protocol.
A Compositional Behavioral Modeling Framework for Embedded System Design and Conformance Checking

December 2005

·

90 Reads

We propose a framework based on a synchronous multi-clocked model of computation to support the inductive and compositional construction of scalable behavioral models of embedded systems engineered with de facto standard design and programming languages. Behavioral modeling is seen under the paradigm of type inference. The aim of the proposed type system is to capture the behavior of a system under design and to re-factor it by performing global optimizing and architecture-sensitive transformations on it. It allows to modularly express a wide spectrum of static and dynamic behavioral properties and automatically or manually scale the desired degree of abstraction of these properties for efficient verification. The type system is presented using a generic and language-independent static single assignment intermediate representation.

Fig. 2 NBTI microarchitectural assessment framework  
Fig. 5 NBTI-induced path delay across ALU components  
Fig. 6 NBTI-induced path delay across ALU components and technology with RAS  
New-Age: A Negative Bias Temperature Instability-Estimation Framework for Microarchitectural Components

August 2009

·

61 Reads

Degradation of device parameters over the lifetime of a system is emerging as a significant threat to system reliability. Among the aging mechanisms, wearout resulting from Negative Bias Temperature Instability (NBTI) is of particular concern in deep submicron technology generations. While there has been significant effort at the device and circuit level to model and characterize the impact of NBTI, the analysis of NBTI’s impact at the architectural level is still at its infancy. To facilitate architectural level aging analysis, a tool capable of evaluating NBTI vulnerabilities early in the design cycle has been developed that evaluates timing degradation due to NBTI. The tool includes workload-based temperature and performance degradation analysis across a variety of technologies and operating conditions, revealing a complex interplay between factors influencing NBTI timing degradation.

Binary input, output, and manipulation extensions of conversational programming with some biological applications

January 1976

·

2 Reads

This paper presents details of how the conversational programming system may be extended to allow input, output, and manipulation in binary and how these extensions may be used in various applications.

Binding environments for parallel logic programs in non-shared memory multiprocessors

April 1988

·

6 Reads

A method known asclosed environments can be used to represent variable bindings for OR-parellel logic programs without relying on a shared memory or common address space. The representation is based on a procedure that trans-forms stack frames after unification, taking into account problems with common unbound ancestors and shared instances of complex terms. Closed environments were developed for the AND/OR Process Model, but may be applicable to other OR-parallel models.

On data retrieval from unambiguous bit matrices

January 1975

·

13 Reads

Algorithms to check whether a bit matrix is unambiguous or a sum set is unique are given. Let an unambiguous bit matrixZ be represented by its row sums and column sums. An efficient algorithm is developed to reconstruct only those rows ofZ satisfying the conditions specified by a given data retrieval descriptor. This algorithm illustrates that using unambiguous bit matrices as data files is desirable not only for the purpose of data compression but also for the purpose of fast data retrieval.

On the efficient implementation of retention block-structured languages

February 1981

·

13 Reads

This paper describes the deletion-retention contour machine (DRCM), an efficient implementation of a retention block-structured language. It allows programs to be handled by the deletion strategy until some forward reference is generated during execution. Then the retention strategy is adopted, when a time- and space-efficient garbage compaction algorithm recovers the inaccessible cells. Moreover, the garbage collector can, on discovering the absence of accessible forward references, restore the deletion strategy. An estimate of the computation time of a lifetime well-stacking (LWS) program on the DRCM is obtained, which shows that an LWS program runs on the DRCM in almost the same time as on a stack machine with lifetime checks to prevent dangling references. Such a property also holds for LWS programs with full-label and nonlocal gotos.

Top-cited authors