About
85
Publications
16,736
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
1,863
Citations
Introduction
Current institution
Additional affiliations
January 1995 - December 1999
Publications
Publications (85)
A method for rolling back speculative threads in symmetric-multiprocessing (SMP) environments is disclosed. In one embodiment, such a method includes detecting an aborted thread at runtime and determining whether the aborted thread is an oldest aborted thread. In the event the aborted thread is the oldest aborted thread, the method sets a high-prio...
Many studies point to the difficulty of scaling existing computer architectures to meet the needs of an exascale system (i.e., capable of executing $10^{18} $ floating-point operations per second), consuming no more than 20 MW in power, by around the year 2020. This paper outlines a new architecture, the Active Memory Cube, which reduces the energy...
A method, system and computer program product for implementing load-reserve and store-conditional instructions in a multi-processor computing system. The computing system includes a multitude of processor units and a shared memory cache, and each of the processor units has access to the memory cache. In one embodiment, the method comprises providin...
This paper describes an end-to-end system implementation of a transactional memory (TM) programming model on top of the hardware transactional memory (HTM) of the Blue Gene/Q machine. The TM programming model supports most C/C++ programming constructs using a best-effort HTM and the help of a complete software stack including the compiler, the kern...
A system, method and computer program product for performing various store-operate instructions in a parallel computing environment that includes a plurality of processors and at least one cache memory device. A queue in the system receives, from a processor, a store-operate instruction that specifies under which condition a cache coherence operati...
In a multiprocessor system, with conflict checking implemented in a directory lookup of a shared cache memory, a reader set encoding permits dynamic recordation of read accesses. The reader set encoding includes an indication of a portion of a line read, for instance by indicating boundaries of read accesses. Different encodings may apply to differ...
A system and method for improving the yield rate of a multiprocessor semiconductor chip that includes primary processor cores and one or more redundant processor cores. A first tester conducts a first test on one or more processor cores, and encodes results of the first test in an on-chip non-volatile memory. A second tester conducts a second test...
A system and method for accessing memory are provided. The system comprises a lookup buffer for storing one or more page table entries, wherein each of the one or more page table entries comprises at least a virtual page number and a physical page number; a logic circuit for receiving a virtual address from said processor, said logic circuit for ma...
In a multiprocessor system with at least two levels of cache, a speculative thread may run on a core processor in parallel with other threads. When the thread seeks to do a write to main memory, this access is to be written through the first level cache to the second level cache. After the write though, the corresponding line is deleted from the fi...
Implementation primitives for concurrent array-based stacks, queues, double-ended queues (deques) and wrapped deques are provided. In one aspect, each element of the stack, queue, deque or wrapped deque data structure has its own ticket lock, allowing multiple threads to concurrently use multiple elements of the data structure and thus achieving hi...
A multiprocessor system includes nodes. Each node includes a data path that includes a core, a TLB, and a first level cache implementing disambiguation. The system also includes at least one second level cache and a main memory. For thread memory access requests, the core uses an address associated with an instruction format of the core. The first...
In a multiprocessor system, a central memory synchronization module coordinates memory synchronization requests responsive to memory access requests in flight, a generation counter, and a reclaim pointer. The central module communicates via point-to-point communication. The module includes a global OR reduce tree for each memory access requesting d...
A list prefetch engine improves a performance of a parallel computing system. The list prefetch engine receives a current cache miss address. The list prefetch engine evaluates whether the current cache miss address is valid. If the current cache miss address is valid, the list prefetch engine compares the current cache miss address and a list addr...
A system, method and computer program product for supporting system initiated checkpoints in high performance parallel computing systems and storing of checkpoint data to a non-volatile memory storage device. The system and method generates selective control signals to perform checkpointing of system related data in presence of messaging activity a...
An apparatus and method for tracking coherence event signals transmitted in a multiprocessor system. The apparatus comprises a coherence logic unit, each unit having a plurality of queue structures with each queue structure associated with a respective sender of event signals transmitted in the system. A timing circuit associated with a queue struc...
In a parallel processing system with speculative execution, conflict checking occurs in a directory lookup of a cache memory that is shared by all processors. In each case, the same physical memory address will map to the same set of that cache, no matter which processor originated that access. The directory includes a dynamic reader set encoding,...
A method and apparatus for supporting cache coherency in a multiprocessor computing environment having multiple processing units, each processing unit having one or more local cache memories associated and operatively connected therewith. The method comprises providing a snoop filter device associated with each processing unit, each snoop filter de...
An apparatus and computer program product for improving performance of a parallel computing system. A first hardware local cache controller associated with a first local cache memory device of a first processor detects an occurrence of a false sharing of a first cache line by a second processor running the program code and allows the false sharing...
A multiprocessor system supports multiple concurrent modes of speculative execution. Speculation identification numbers (IDs) are allocated to speculative threads from a pool of available numbers. The pool is divided into domains, with each domain being assigned to a mode of speculation. Modes of speculation include TM, TLS, and rollback. Allocatio...
Fixing a problem is usually greatly aided if the problem is reproducible. To ensure reproducibility of a multiprocessor system, the following aspects are proposed: a deterministic system start state, a single system clock, phase alignment of clocks in the system, system-wide synchronization events, reproducible execution of system components, deter...
In a cache memory, energy and other efficiencies can be realized by saving a result of a cache directory lookup for sequential accesses to a same memory address. Where the cache is a point of coherence for speculative execution in a multiprocessor system, with directory lookups serving as the point of conflict detection, such saving becomes particu...
Mechanisms for generating checkpoints in a speculative versioning cache of a data processing system are provided. The mechanisms execute code within the data processing system, wherein the code accesses cache lines in the speculative versioning cache. The mechanisms further determine whether a first condition occurs indicating a need to generate a...
A parallel computing system processes at least one store instruction. A first processor core issues a store instruction. A first queue, associated with the first processor core, stores the store instruction. A second queue, associated with a first local cache memory device of the first processor core, stores the store instruction. The first process...
A method and system are disclosed for providing combined error code protection and subgroup parity protection for a given group of n bits. The method comprises the steps of identifying a number, m, of redundant bits for said error protection; and constructing a matrix P, wherein multiplying said given group of n bits with P produces m redundant err...
A computing apparatus for reducing the amount of processing in a network computing system which includes a network system device of a receiving node for receiving electronic messages comprising data. The electronic messages are transmitted from a sending node. The network system device determines when more data of a specific electronic message is b...
A system and method for enhancing performance of a computer which includes a computer system including a data storage device. The computer system includes a program stored in the data storage device and steps of the program are executed by a processer. The processor processes instructions from the program. A wait state in the processor waits for re...
A circuit generates a global clock signal with a pulse width modification to synchronize processors in a parallel computing system. The circuit may include a hardware module and a clock splitter. The hardware module may generate a clock signal and performs a pulse width modification on the clock signal. The pulse width modification changes a pulse...
Mechanisms are provided for controlling version pressure on a speculative versioning cache. Raw version pressure data is collected based on one or more threads accessing cache lines of the speculative versioning cache. One or more statistical measures of version pressure are generated based on the collected raw version pressure data. A determinatio...
A system and method and computer program product for reducing the latency of signals communicated through a crossbar switch, the method including using at slave arbitration logic devices associated with Slave devices for which access is requested from one or more Master devices, two or more priority vector signals cycled among their use every clock...
The memory subsystem of the IBM Blue Gene®/Q Compute chip features multi-versioning and access conflict detection. Its ordered and unordered transaction modes implement both speculative execution (SE) and transactional memory (TM). Blue Gene/Q's large shared second-level cache serves as storage for speculative versions, allowing up to 30 MB of spec...
The heart of a Blue Gene®/Q system is the Blue Gene/Q Compute (BQC) chip, which combines processors, memory, and communication functions on a single chip. The Blue Gene/Q Compute chip has 16 + 1 + 1 processor cores, each with a quad single-instruction, multiple-data (SIMD) floating-point unit, and a multi-versioned Level 2 cache that provides hardw...
A stream prefetch engine performs data retrieval in a parallel computing system. The engine receives a load request from at least one processor. The engine evaluates whether a first memory address requested in the load request is present and valid in a table. The engine checks whether there exists valid data corresponding to the first memory addres...
This paper describes an end-to-end system implementation of the transactional memory (TM) programming model on top of the hardware transactional memory (HTM) of the Blue Gene/Q (BG/Q) machine. The TM programming model supports most C/C++ programming constructs on top of a best-effort HTM with the help of a complete software stack including the comp...
A method and apparatus for managing coherence between two processors of a two processor node of a multi-processor computer system. Generally the present invention relates to a software algorithm that simplifies and significantly speeds the management of cache coherence in a message passing parallel computer, and to hardware apparatus that assists t...
Blue Gene/Q aims to build a massively parallel high-performance computing system out of power-efficient processor chips, resulting in power-efficient, cost-efficient, and floor-space-efficient systems. Focusing on reliability during design helps with scaling to large systems and lowers the total cost of ownership. This article examines the architec...
A control logic device performs a local rollback in a parallel super computing system. The super computing system includes at least one cache memory device. The control logic device determines a local rollback interval. The control logic device runs at least one instruction in the local rollback interval. The control logic device evaluates whether...
A method for passing remote messages in a parallel computer system formed as a network of interconnected compute nodes includes that a first compute node (A) sends a single remote message to a remote second compute node (B) in order to control the remote second compute node (B) to send at least one remote message. The method includes various steps...
A memory system and method for providing atomic memory-based counter operations to operating systems and applications that make most efficient use of counter-backing memory and virtual and physical address space, while simplifying operating system memory management, and enabling the counter-backing memory to be used for purposes other than counter-...
A programmable memory system and method for enabling one or more processor devices access to shared memory in a computing environment, the shared memory including one or more memory storage structures having addressable locations for storing data. The system comprises: one or more first logic devices associated with a respective one or more process...
A low latency memory system access is provided in association with a weakly-ordered multiprocessor system. Each processor in the multiprocessor shares resources, and each shared resource has an associated lock within a locking device that provides support for synchronization between the multiple processors in the multiprocessor and the orderly shar...
A method and apparatus for managing coherence between two processors of a two processor node of a multi-processor computer system. Generally the present invention relates to a software algorithm that simplifies and significantly speeds the management of cache coherence in a message passing parallel computer, and to hardware apparatus that assists t...
On June 26, 2007, IBM announced the Blue Gene/P (TM) system as the leading offering in its massively parallel Blue Geneo supercomputer line, succeeding the Blue Gene/L (TM) system. The Blue Gene/P system is designed to scale to at least 262,144 quad-processor nodes, with a peak performance of 3.56 petaflops. More significantly, the Blue Gene/P syst...
The Blue Gene/L system at the Department of Energy Lawrence Livermore National Laboratory in Livermore, California is the
world’s most powerful supercomputer. It has achieved groundbreaking performance in both standard benchmarks as well as real
scientific applications. In that process, it has enabled new science that simply could not be done befor...
A low latency memory system access is provided in association with a weakly-ordered multiprocessor system. Bach processor in the multiprocessor shares resources, and each shared resource has an associated lock within a locking device that provides support for synchronization between the multiple processors in the multiprocessor and the orderly shar...
Optimizing supercomputer performance requires a balance between objectives for processor performance, network performance, power delivery and cooling, cost and reliability. In particular, scaling a system to a large number of processors poses challenges for reliability, availability and serviceability. Given the power and thermal constraints of dat...
The BlueGene/L supercomputer has been designed with a focus on power/performance efficiency to achieve high application performance under the thermal constraints of common data centers. To achieve this goal, emphasis was put on system solutions to engineer a power-efficient system. To exploit thread level parallelism, the BlueGene/L system can scal...
As one of the most highly integrated system-on-a-chip application-specific integrated circuits (ASICs) to date, the Blue Gene®/L compute chip presented unique challenges that required extensions of the standard ASIC synthesis, timing, and physical design methodologies. We describe the design flow from floorplanning through synthesis and timing clos...
The Blue Gene®/L computer is a massively parallel supercomputer based on IBM system-on-a-chip technology. It is designed to scale to 65,536 dual-processor nodes, with a peak performance of 360 teraflops. This paper describes the project objectives and provides an overview of the system architecture that resulted. We discuss our application-based ap...
The Blue Gene®/L compute chip is a dual-processor system-on-a-chip capable of delivering an arithmetic peak performance of 5.6 gigaflops. To match the memory speed to the high compute performance, the system implements an aggressive three-level on-chip cache hierarchy. The implemented hierarchy offers high bandwidth and integrated prefetching on ca...
An overview of the design aspects of the BlueGene/L chip, the heart of the BlueGene/L supercomputer, is presented. Following an SoC approach, processors, memory and communication subsystems are integrated into one low-power chip. The high-density system packaging of the BlueGene/L system provides better power and cost performance.
The Blue Gene®/L compute chip contains two PowerPC® 440 processor cores, private L2 prefetch caches, a shared L3 cache and double-data-rate synchronous dynamic random access memory (DDR SDRAM) memory controller, a collective network interface, a torus network interface, a physical network interface, an interrupt controller, and a bridge interface t...
The Blue Gene®/L compute (BLC) and Blue Gene/L link (BLL) chips have extensive facilities for control, bring-up, self-test, debug, and nonintrusive performance monitoring built on a serial interface compliant with IEEE Standard 1149.1. Both the BLL and the BLC chips contain a standard eServer™ chip JTAG controller called the access macro. For BLC,...
This paper describes the Blue Gene®/L advanced diagnostics environment (ADE) used throughout all aspects of the Blue Gene/L project, including design, logic verification, bring-up, diagnostics, and manufacturing test. The Blue Gene/L ADE consists of a lightweight multithreaded coherence-managed kernel, runtime libraries, device drivers, system prog...
BlueGene/L is a supercomputer consisting of 64K dual-processor system-on-a-chip compute nodes, capable of delivering an arithmetic peak performance of 5.6Gflops per node. To match the memory speed to the high compute performance, the system implements an aggressive three-level on-chip cache hierarchy for each node. The implemented hierarchy offers...
This paper gives an overview of the BlueGene/L Supercomputer. This is a jointly funded research partnership between IBM and the Lawrence Livermore National Laboratory as part of the United States Department of Energy ASCI Advanced Architecture Research Program. Application performance and scaling studies have recently been initiated with partners a...
System-on-a-chip technology allows a level of integration that can
be leveraged to develop inexpensive high-performance, low-power
computing nodes. When used in aggregate, this approach promises to
challenge conventional supercomputer architectures in the
high-performance computing arena. Systems under consideration reach into
the hundreds of thous...
Summary form only given. Large powerful networks coupled to state-of-the-art processors have traditionally dominated supercomputing. As technology advances, this approach is likely to be challenged by a more cost-effective System-On-A-Chip approach, with higher levels of system integration. The scalability of applications to architectures with tens...
This paper gives an overview of the BlueGene/L Supercomputer. This is a
jointly funded research partnership between IBM and the Lawrence Livermore National
Laboratory as part of the United States Department of Energy ASCI Advanced
Architecture Research Program. Application performance and scaling studies have
recently been initiated with partners a...
Large powerful networks coupled to state-of-the-art processors have traditionally dominated supercomputing. As technology
advances, this approach is likely to be challenged by a more cost-effective System-On-A-Chip approach, with higher
levels of system integration. The scalability of applications to architectures with tens to hundreds of thousands...
System-on-a-chip technology allows a level of integration that can be leveraged to develop inexpensive high-performance, low-power computing nodes. When used in aggregate, this approach promises to challenge conventional supercomputer architectures in the high-performance computing arena. Systems under consideration reach into the hundreds of thous...
A programmable digital signal processor (DSP) for real-time image
processing is presented that combines the concepts of single-instruction
multiple-data (SIMD) and very long instruction word with a high
utilization of parallel resources on instruction and data level. The
SIMD approach has been extended with autonomous instruction selection
capabili...
In this paper, a programmable DSP for real time image processing is presented that combines the concepts of VLIW and SIMD with a high utilization of parallel resources on instruction level and data level. The SIMD approach has been extended with autonomous instruction selection capabilities (ASIMD), which offers to control four parallel datapaths w...
A fully scalable, cellular multiprocessor architecture is proposed that is able to dynamically adapt its processing resources
to varying demands of signal processing applications. This ability is achieved by migration of tasks between processor cells
at run-time such as to avoid cell overload. Several dynamic migration strategies are investigated,...
Architecture and design of the HiPAR-DSP, a SIMD controlled signal processor with parallel data paths, VLIW and novel memory design. The processor architecture is derived from an analysis of the target algorithms and specified in VHDL on register transfer level. A team of more than 20 graduate students covered the whole design process, including th...
Architecture and design of the HiPAR-DSP, a SIMD controlled signal processor with parallel data paths, VLIW and novel memory design is presented. The processor architecture is derived from an analysis of the target algorithms and specified in VHDL on register transfer level. A team of more than 20 graduate students covered the whole design process,...
Derived from a thorough analysis of a wide class of image
processing algorithms' properties, a parallel RISC architecture has been
developed. The architecture gains performance from data level
parallelism as well as from instruction level parallelism. From the
beginning of the concept phase, high-level programming capabilities have
been one of the...
This paper describes essential differences in the video coding part of recent multimedia communication schemes (as the MPEG-4 standard being currently under development) to conventional schemes (as MPEG-1, MPEG-2 or ITU H.26×) caused by the possibility to work on arbitrarily shaped objects rather than only on rigid square blocks of pels. Methods em...
The efficient implementation of algorithms with irregular data
access or control-flow on a parallel SIMD processor requires specific
architectural measures. This paper demonstrates the parallelization of
medium-level algorithms on the HiPAR-DSP, a programmable RISC processor
for real-time image processing with 4 or 16 parallel data paths. We show
t...
A highly parallel single-chip image signal processor architecture has been derived by analysis of image processing algorithms. Available levels of parallelism and their associated demands on data access, control and complexity of operations were taken into account. The RISC-architecture, called “HiPAR-DSP”, consists of a control unit, 16 parallel A...
An apparatus and method for providing a data eye monitor. The data eye monitor apparatus utilizes an inverter/latch string circuit and a set of latches to save the data eye for providing an infinite persistent data eye. In operation, incoming read data signals are adjusted in the first stage individually and latched to provide the read data to the...
An apparatus and method for evaluating a state of an electronic or integrated circuit (IC), each IC including one or more processor elements for controlling operations of IC sub-units, and each the IC supporting multiple frequency clock domains. The method comprises: generating a synchronized set of enable signals in correspondence with one or more...
Method and apparatus of prefetching streams of varying prefetch depth dynamically changes the depth of prefetching so that the number of multiple streams as well as the hit rate of a single stream are optimized. The method and apparatus in one aspect monitor a plurality of load requests from a processing unit for data in a prefetch buffer, determin...
A massively parallel supercomputer of petaOPS-scale includes node architectures based upon System-On-a-Chip technology, where each processing node comprises a single Application Specific Integrated Circuit (ASIC) having up to four processing elements. The ASIC nodes are interconnected by multiple independent networks that optimally maximize the thr...
An apparatus and method for controlling power usage in a computer includes a plurality of computers communicating with a local control device, and a power source supplying power to the local control device and the computer. A plurality of sensors communicate with the computer for ascertaining power usage of the computer, and a system control device...
Zugl.: Hannover, Universiẗat, Diss., 2001.