
Thomas Sterling- Indiana University Bloomington
Thomas Sterling
- Indiana University Bloomington
About
154
Publications
15,649
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
3,532
Citations
Publications
Publications (154)
Control parallelism and data parallelism is mostly reasoned and optimized as separate functions. Because of this, workloads that are irregular, fine-grain and dynamic such as dynamic graph processing become very hard to scale. An experimental research approach to computer architecture that synthesizes prior techniques of parallel computing along wi...
HPC is entering a point of singularity where previous technology trends (Moore’s Law etc.) are terminating and dramatic performance progress may depend on advances in computer architecture outside of the scope of conventional practices. This may extend to the opportunities potentially available through the context of non-von Neumann architectures....
The rise in computing hardware choices is driving a reevaluation of operating systems. The traditional role of an operating system controlling the execution of its own hardware is evolving toward a model whereby the controlling processor is distinct from the compute engines that are performing most of the computations. In this context, an operating...
The rise in computing hardware choices is driving a reevaluation of operating systems. The traditional role of an operating system controlling the execution of its own hardware is evolving toward a model whereby the controlling processor is distinct from the compute engines that are performing most of the computations. In this context, an operating...
Future Exascale architectures will likely make extensive use of computing accelerators such as Field Programmable Gate Arrays (FPGAs) given that these accelerators are very power efficient. Oftentimes, these FPGAs are located at the network interface card (NIC) and switch level in order to accelerate network operations, incorporate contention avoid...
We present DASHMM, a general library implementing multipole methods (including both Barnes-Hut and the Fast Multipole Method). DASHMM relies on dynamic adaptive runtime techniques provided by the HPX-5 system to parallelize the resulting multipole moment computation. The result is a library that is easy-to-use, extensible, scalable, efficient, and...
This report summarizes runtime system challenges for exascale computing, that follow from the fundamental
challenges for exascale systems that have been well studied in past reports, e.g., [6, 33, 34, 32, 24]. Some of
the key exascale challenges that pertain to runtime systems include parallelism, energy efficiency, memory
hierarchies, data movemen...
Maintaining a scalable high-performance virtual global address space using distributed memory hardware has proven to be challenging. In this paper we evaluate a new approach for such an active global address space that leverages the capabilities of the network fabric to manage addressing, rather than software at the endpoint hosts. We describe our...
A strategic challenge confronting the continued advance of high performance computing (HPC) to extreme scale is the approaching near-nanoscale semiconductor technology and the end of Moore's Law. This paper introduces the foundations of an innovative class of parallel architecture reversing many of the conventional architecture directions, but bene...
This poster focuses on application performance under HPX. Developed world-wide, HPX
is emerging as a critical new programming model combined with a runtime system that
uses an asynchronous style to escape the traditional static communicating sequential processes execution model, namely MPI, with a fully dynamic and adaptive model exploiting the cap...
The HPX runtime system is a critical component of the DOE XPRESS (eXascale PRogramming Environment and System Software) project and other projects world-wide. We are exploring a set of innovations in execution models, programming models and methods, runtime and operating system software, adaptive scheduling and resource management algorithms, and i...
Conventional programming practices on multicore processors in high performance computing architectures are not universally effective in terms of efficiency and scalability for many algorithms in scientific computing. One possible solution for improving efficiency and scalability in applications on this class of machines is the use of a many-tasking...
Brain-inspired computing structures, technologies, and methods offer innovative approaches to the future of computing. From the lowest level of neuron devices to the highest abstraction of consciousness, the brain drives new ideas (literally and conceptually) in computer design and operation. This paper interrelates three levels of brain inspired a...
The end of Dennard scaling and the looming Exascale challenges of efficiency, reliability, and scalability are driving a shift in programming methodologies away from conventional practices towards dynamic runtimes and asynchronous, data driven execution. Since Exascale machines are not yet available, however, experimental runtime systems and applic...
Achieving the performance potential of an Exascale machine depends on realizing both operational efficiency and scalability in high performance computing applications. This requirement has motivated the emergence of several new programming models which emphasize fine and medium grain task parallelism in order to address the aggravating effects of a...
The guest editors discuss some recent advances in exascale computing, as well as remaining issues.
The addition of nuclear and neutrino physics to general relativistic fluid codes allows for a more realistic description of hot nuclear matter in neutron star and black hole systems. This additional microphysics requires that each processor have access to large tables of data, such as equations of state, and in large simulations, the memory require...
The addition of nuclear and neutrino physics to general relativistic fluid
codes allows for a more realistic description of hot nuclear matter in neutron
star and black hole systems. This additional microphysics requires that each
processor have access to large tables of data, such as equations of state, and
in large simulations the memory required...
• This conference focuses strongly on computational accelerator technologies, a specific new technology proving very useful in support of computationally intensive research. • I will, as promised, summarize the state of use of accelerators ...
In past centuries, education has been one of the most powerful tools to help propel economic development and improve social well-being. Modern educational systems have benefited from technological advancement, especially in information and networking technologies. Although, distance education has existed for more than 100 years it still continues t...
Several applications in astrophysics require adequately resolving many
physical and temporal scales which vary over several orders of magnitude.
Adaptive mesh refinement techniques address this problem effectively but often
result in constrained strong scaling performance. The ParalleX execution model
is an experimental execution model that aims to...
The scalability and efficiency of graph applications are significantly
constrained by conventional systems and their supporting programming models.
Technology trends like multicore, manycore, and heterogeneous system
architectures are introducing further challenges and possibilities for emerging
application domains such as graph applications. This...
Exascale systems, expected to emerge by the end of the next decade, will
require the exploitation of billion-way parallelism at multiple hierarchical
levels in order to achieve the desired sustained performance. The task of
assessing future machine performance is approached by identifying the factors
which currently challenge the scalability of par...
Exascale systems, expected to emerge by the end of the next decade, will require the exploitation of billion-way parallelism at multiple hierarchical levels in order to achieve the desired sustained performance. While traditional approaches to performance evaluation involve measurements of existing applications on the available platforms, such a me...
Over the last 20 years, the open-source community has provided more and more software on which the world’s high-performance computing systems depend for performance and productivity. The community has invested millions of dollars and years of effort to build key components. However, although the investments in these separate software elements have...
Creating the next generation of power-efficient parallel computers requires a rethink of the mechanisms and methodology for building parallel applications. Energy constraints have pushed us into a regime where parallelism will be ubiquitous rather than limited to highly specialized high-end supercomputers. New execution models are required to span...
The Beowulf Bootcamp is an initiative designed to raise the awareness of and interest in high performance computing in the high schools the area of Baton Rouge, Louisiana. The goal is to familiarize students with all aspects of a supercomputer, giving them hands-on experience with touching or assembling hardware components. No less significant to t...
HPC is entering a new phase in system structure and operation driven by a combination of technology and architecture trends.
Perhaps foremost are the constraints of power and complexity that as a result of the at-lining of clock rates relies on multicore
as the primary means by which performance gain is being achieved with Moore’s Law. Indeed, for...
High performance computing (HPC) is experiencing a phase change with the challenges of programming and management of heterogeneous multicore systems architectures and large scale system configurations. It is estimated that by the end of the next decade exaflops computing systems requiring hundreds of millions of cores demanding multi-billion-way pa...
The development of HPC systems capable of exascale performance will demand innovations in hardware architecture and system software as well as programming models and methods. The combination of a vast increase in scale of such systems combined with the emergence of heterogeneous multicore structures is forcing future systems to be organized and ope...
Cloud computing is gaining importance as a computational resource allocation trend in commercial, academic, and industrial sectors. Cloud computing offers an amorphous distributed environment of computing resources and services to a dynamic distributed user base. High performance computing (HPC) involves many processing requirements that demand enh...
High performance computing (HPC) is experiencing a phase change with the challenges of programming and management of heterogeneous multicore systems architectures and large scale system configurations. It is estimated that by the end of the next decade exaflops computing systems requiring hundreds of millions of cores demanding multi-billion-way pa...
Although distance learning has a history that spans many decades, the full opportunity that is implicit in its exploitation has not been fully realized due to combination of factors including disparate experience between it and its classroom counterpart. However current and emerging technologies are helping overcome this barrier by providing signif...
AbstractProductivity is an emerging measure of merit for high-performance computing. While pervasive in application, conventional metrics such as flops fail to reflect the complex interrelationships of diverse factors that determine the overall effectiveness of the use of a computing system. As a consequence, comparative analysis of design and proc...
Instruction pressure is the level of time, space, and power required to manage the instruction stream to support high-speed execution of modern multicore general processor and embedded controller based computing. L1 instruction cache and processor pin bandwidth are examples of direct resource costs imposed by the instruction access demand of a proc...
The performance opportunities enabled through multi-core chips and the efficiency potential of heterogeneous ISA and structures
is creating a climate for computer architecture, highly parallel processing chips, and HPC systems unprecedented for more
than a decade. But with change comes the uncertainty from competition of alternatives. One thing is...
Abstract This paper proposes the study of a new,computation model,that attempts to address the underlying sources of performance degradation (e.g. latency, overhead, and star- vation) and the difficulties of programmer,productivity (e.g. explicit locality management and scheduling, performance tuning, fragmented memory, and synchronous global barri...
This paper addresses the underlying sources of performance degradation (e.g. latency, overhead, and starvation) and the difficulties of programmer productivity (e.g. explicit locality management and scheduling, performance tuning, fragmented memory, and synchronous global barriers) to dramatically enhance the broad effectiveness of parallel process...
Experts in advanced system technologies will predict the design of the best HPC architectures in 2020. They will defend why they think the technology they select will be the winning technology 15 years from now. The panelists will pick one set of technology - not a list of possibilities - to define the system. They will define the performance and a...
A dramatic trend in computing is the adoption of multi-core technology by the vendors from which our current and future HPC systems are being derived. Multi-core is offered as a path to continued reliance and benefits of Moore's Law while reining in the previously unfettered growth of power consumption and design complexity. Are we saved? or is it...
Twenty five years ago supercomputing was dominated by vector processors and emergent SIMD array processors clocked at tens of Megahertz. Today responding to dramatic advances in semiconductor device fabrication technologies, the world of supercomputing is dominated by multi-core based MPP and commodity cluster systems clocked at Gigahertz. Twenty f...
This chapter centers mainly on successful programming models that map algorithms and simulations to computational resources
used in high-performance computing. These resources range from group-based or departmental clusters to high-end resources
available at the handful of supercomputer centers around the world. Also covered are newer programming m...
MIND (Memory, Intelligence, and Network Device) is an advanced parallel computer architecture for high performance computing and scalable embedded processing. It is a
Processor-in-Memory (PIM) architecture integrating both DRAM bit cells and CMOS logic devices on the same silicon die. MIND is multicore with multiple memory/processor nodes on
each c...
In a recent paper, Gordon Bell and Jim Gray (2002) put forth a view of the past, present, and future of high-performance computing (HPC) that is both insightful and thought provoking. Identifying key trends with a grace and candor rarely encountered in a single work, the authors describe an evolutionary past drawn from their vast experience and pro...
Continuum computer architecture (CCA) is a non-von Neumann architecture that offers an alternative to conventional structures as digital technology evolves towards nano-scale and the ultimate flat-lining of Moore's law. Coincidentally, it also defines a model of architecture particularly well suited to logic classes that exhibit ultra-high clock ra...
The anticipated advent of practical nanoscale technology sometime in the next decade with likely experimental technologies nearer term presents enormous opportunities for the realization of future high performance computing potentially in the pan-Exaflops performance domain (10 18 to
10 21 flops), but imposes substantial, albeit exciting, technical...
Historically, high performance computing has been measured in terms of peak or delivered performance, and to a lesser extent to performance to cost. Such metrics fail to capture the impact on the usefulness and ease of use of such systems. Productivity has been identified as a new parameter for high end computing systems that include both delivered...
Last year's paper by Bell and Gray [1] examined past trends in high performance computing and asserted likely future directions based on market forces. While many of the insights drawn from this perspective have merit and suggest elements governing likely future directions for HPC, there are a number of points put forth that we feel require further...
InfiniBand is a new industry-wide general-purpose interconnect standard designed to provide significantly higher levels of reliability, availability, performance, and scalability than alternative server I/O technologies. After more than two years since its official release, many are still trying to understand what are the profitable uses for this n...
Percolation has recently been proposed as a key component of an advanced program execution model for future generation high-end machines featuring adaptive data/code transformation and movement for effective latency tolerance. An early evaluation of the performance effect of percolation is very important in the design space exploration of future ge...
Future high-end computers which promise very high performance require sophisticated program execution models and languages in order to deal with very high latencies across the memory hierarchy and to exploit massive parallelism. This paper presents our progress in an ongoing research toward this goal. Specifically, we will develop a suitable progra...
Processor-in-Memory (PIM) architectures avoid the von Neumann bottleneck in conventional machines by integrating high-density DRAM and CMOS logic on the same chip. Parallel systems based on this new technology are expected to provide higher scalability, adaptability, robustness, fault tolerance and lower power consumption than current MPPs or commo...
Processor-in-Memory (PIM) architectures avoid the von Neumann bottleneck in conventional machines by integrating high-density DRAM and CMOS logic on the same chip. Parallel systems based on this new technology are expected to provide higher scalability, adaptability, robustness, fault tolerance and lower power consumption than current MPPs or commo...
Implicit in the evolution of current technology and high-end system evolution is the anticipated achievement of the implementation
of computers capable of a peak performance of 1 Petaflops by the year 2010. This is consistent with both the semiconductor
industry’s roadmap of basic device technology development and an extrapolation of the TOP-500 li...
this document, the rationale for design choices made in the interface specification is set off in this format. Some readers may wish to skip these sections, while readers interested in interface design may want to read them carefully. (End of rationale.) Advice to users. Throughout this document, material that speaks to users and illustrates usage...
Future high-end computers will offer great performance improvements over today's machines, enabling applications of far greater complexity. However, designers must solve the challenge of exploiting massive parallelism efficiency in the face of very high latencies across the memory hierarchy. We believe the key to meeting this challenge is the desig...
The application of cluster computer systems has escalated dramatically over the last several years. Driven by a range of applications that need relatively low-cost access to high performance computing systems, clusters computers have reached worldwide use. In this paper we outline the results of using three generations of cluster machines at JPL fo...
Scientists have found a cheaper way to solve tremendously difficult computational problems: connect ordinary PCs so that they can work together.
The emergence of semiconductor fabrication technology allowing a tight coupling between high-density DRAM and CMOS logic on the same chip has led to the important new class of Processor-In-Memory (PIM) architectures. Newer developments provide powerful parallel processing capabilities on the chip, exploiting the facility to load wide words in singl...
High performance computing with respect to personal computer (PC) clusters is presented. The requirement of performance and the availability of resources is addressed through an innovative synergy of new and old ideas of parallel computing community. The low cost technologies from the consumer digital electronics industry are also addressed. The le...
Future-generation space missions across the solar system to the planets, moons, asteroids, and comets may someday incorporate supercomputers both to expand the range of missions being conducted and to significantly reduce their cost. By performing science computation directly on the spacecraft itself, the amount of data required to be downlinked ma...
Throughout the history of computer implementation, the technologies employed for logic to build ALUs and the technologies
employed to realize high speed and high-density storage for main memory have been disparate, requiring different fabrication
techniques. This was certainly true at the beginning of the era of electronic digital computers where l...
Teraflops-scale computing systems are becoming available to an increasingly broad range of users as the performance of the constituent processing elements increases and their relative cost (e.g. per Mflops) decreases. To the original DOE ASCI Red machine has been added the ASCI Blue systems and additional 1 Teraflops commercial systems at key natio...
The emergence of semiconductor fabrication technology allowing a tight coupling between high-density DRAM and CMOS logic on
the same chip has led to the important new class of Processor-in-Memory (PIM) architectures. Recent developments provide powerful
parallel processing capabilities on the chip, exploiting the facility to load wide words in sing...
Beowulf-class systems are an extremely inexpensive way of aggregating substantial quantities of a given resource to facilitate
the execution of different kinds of potentially large workloads. Beowulf-class systems are clusters of mass-market COTS PC
computers (e.g. Intel Pentium III) and network hardware (e.g. Fast Ethernet, Myrinet) employing avai...
The emergence of semiconductor fabrication technology allowing a tight coupling between high-density DRAM and CMOS logic on
the same chip has led to the important new class of Processor-In-Memory (PIM) architectures. Furthermore, large arrays of
PIMs can be arranged into massively parallel architectures. In this paper, we outline the salient featur...
In cooperation with the European Southern Observatory (ESO), Caltech has investigated the application of Beowulf clusters to the management and analysis of data generated by large astronomical instruments, exempli ed by the Very Large Telescope (VLT) in Cerro Paranal, Chile. The VLT consists of four 8-meter telescopes that can operate independently...
Machine Interface 45 B.1 Global System Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 B.1.1 Global Name Management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 B.1.2 Global Memory Management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 B.2 Macroservers . . ....
The emergence of semiconductor fabrication technology allowing a tight coupling between high-density DRAM and CMOS logic on the same chip has led to the important new class of Processor-In-Memory (PIM) architectures. Newer developments provide powerful parallel processing capabilities on the chip, exploiting the facility to load wide words in singl...
: The semantics of memory---a large state which can only be read or changed a small piece at a time---has remained virtually untouched since von Neumann, and its effects---latency and bandwidth---have proved to be the major stumbling block for high performance computing. This paper suggests a new model termed "microservers" that exploits "Processin...
This paper presents an analytical performance prediction for the implementation of Cannon's matrix multiply algorithm in the Hybrid Technology Multi-Threading (HTMT) architecture [8]. The HTMT subsystems are built from new technologies: super-conducting processor elements (called SPELLs [5]), a network based on RSFQ (Rapid Single Flux Quantum) logi...
Do-it-yourself supercomputing has emerged as a solution to cost-effectively sustain the computational demands of the scientific research community. Despite some of the successes of this approach, represented by Beowulf-class computing, it has limitations that need to be recognized as well as problems that need to be resolved in order to extend its...