Thomas Sterling

Thomas Sterling
  • Indiana University Bloomington

About

154
Publications
15,649
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
3,532
Citations
Current institution
Indiana University Bloomington

Publications

Publications (154)
Preprint
Control parallelism and data parallelism is mostly reasoned and optimized as separate functions. Because of this, workloads that are irregular, fine-grain and dynamic such as dynamic graph processing become very hard to scale. An experimental research approach to computer architecture that synthesizes prior techniques of parallel computing along wi...
Chapter
HPC is entering a point of singularity where previous technology trends (Moore’s Law etc.) are terminating and dramatic performance progress may depend on advances in computer architecture outside of the scope of conventional practices. This may extend to the opportunities potentially available through the context of non-von Neumann architectures....
Article
The rise in computing hardware choices is driving a reevaluation of operating systems. The traditional role of an operating system controlling the execution of its own hardware is evolving toward a model whereby the controlling processor is distinct from the compute engines that are performing most of the computations. In this context, an operating...
Preprint
The rise in computing hardware choices is driving a reevaluation of operating systems. The traditional role of an operating system controlling the execution of its own hardware is evolving toward a model whereby the controlling processor is distinct from the compute engines that are performing most of the computations. In this context, an operating...
Chapter
Future Exascale architectures will likely make extensive use of computing accelerators such as Field Programmable Gate Arrays (FPGAs) given that these accelerators are very power efficient. Oftentimes, these FPGAs are located at the network interface card (NIC) and switch level in order to accelerate network operations, incorporate contention avoid...
Article
We present DASHMM, a general library implementing multipole methods (including both Barnes-Hut and the Fast Multipole Method). DASHMM relies on dynamic adaptive runtime techniques provided by the HPX-5 system to parallelize the resulting multipole moment computation. The result is a library that is easy-to-use, extensible, scalable, efficient, and...
Technical Report
Full-text available
This report summarizes runtime system challenges for exascale computing, that follow from the fundamental challenges for exascale systems that have been well studied in past reports, e.g., [6, 33, 34, 32, 24]. Some of the key exascale challenges that pertain to runtime systems include parallelism, energy efficiency, memory hierarchies, data movemen...
Conference Paper
Maintaining a scalable high-performance virtual global address space using distributed memory hardware has proven to be challenging. In this paper we evaluate a new approach for such an active global address space that leverages the capabilities of the network fabric to manage addressing, rather than software at the endpoint hosts. We describe our...
Conference Paper
A strategic challenge confronting the continued advance of high performance computing (HPC) to extreme scale is the approaching near-nanoscale semiconductor technology and the end of Moore's Law. This paper introduces the foundations of an innovative class of parallel architecture reversing many of the conventional architecture directions, but bene...
Conference Paper
Full-text available
This poster focuses on application performance under HPX. Developed world-wide, HPX is emerging as a critical new programming model combined with a runtime system that uses an asynchronous style to escape the traditional static communicating sequential processes execution model, namely MPI, with a fully dynamic and adaptive model exploiting the cap...
Poster
Full-text available
The HPX runtime system is a critical component of the DOE XPRESS (eXascale PRogramming Environment and System Software) project and other projects world-wide. We are exploring a set of innovations in execution models, programming models and methods, runtime and operating system software, adaptive scheduling and resource management algorithms, and i...
Conference Paper
Conventional programming practices on multicore processors in high performance computing architectures are not universally effective in terms of efficiency and scalability for many algorithms in scientific computing. One possible solution for improving efficiency and scalability in applications on this class of machines is the use of a many-tasking...
Conference Paper
Brain-inspired computing structures, technologies, and methods offer innovative approaches to the future of computing. From the lowest level of neuron devices to the highest abstraction of consciousness, the brain drives new ideas (literally and conceptually) in computer design and operation. This paper interrelates three levels of brain inspired a...
Conference Paper
The end of Dennard scaling and the looming Exascale challenges of efficiency, reliability, and scalability are driving a shift in programming methodologies away from conventional practices towards dynamic runtimes and asynchronous, data driven execution. Since Exascale machines are not yet available, however, experimental runtime systems and applic...
Conference Paper
Achieving the performance potential of an Exascale machine depends on realizing both operational efficiency and scalability in high performance computing applications. This requirement has motivated the emergence of several new programming models which emphasize fine and medium grain task parallelism in order to address the aggravating effects of a...
Article
The guest editors discuss some recent advances in exascale computing, as well as remaining issues.
Conference Paper
The addition of nuclear and neutrino physics to general relativistic fluid codes allows for a more realistic description of hot nuclear matter in neutron star and black hole systems. This additional microphysics requires that each processor have access to large tables of data, such as equations of state, and in large simulations, the memory require...
Article
The addition of nuclear and neutrino physics to general relativistic fluid codes allows for a more realistic description of hot nuclear matter in neutron star and black hole systems. This additional microphysics requires that each processor have access to large tables of data, such as equations of state, and in large simulations the memory required...
Conference Paper
• This conference focuses strongly on computational accelerator technologies, a specific new technology proving very useful in support of computationally intensive research. • I will, as promised, summarize the state of use of accelerators ...
Article
Full-text available
In past centuries, education has been one of the most powerful tools to help propel economic development and improve social well-being. Modern educational systems have benefited from technological advancement, especially in information and networking technologies. Although, distance education has existed for more than 100 years it still continues t...
Article
Several applications in astrophysics require adequately resolving many physical and temporal scales which vary over several orders of magnitude. Adaptive mesh refinement techniques address this problem effectively but often result in constrained strong scaling performance. The ParalleX execution model is an experimental execution model that aims to...
Article
Full-text available
The scalability and efficiency of graph applications are significantly constrained by conventional systems and their supporting programming models. Technology trends like multicore, manycore, and heterogeneous system architectures are introducing further challenges and possibilities for emerging application domains such as graph applications. This...
Article
Exascale systems, expected to emerge by the end of the next decade, will require the exploitation of billion-way parallelism at multiple hierarchical levels in order to achieve the desired sustained performance. The task of assessing future machine performance is approached by identifying the factors which currently challenge the scalability of par...
Article
Full-text available
Exascale systems, expected to emerge by the end of the next decade, will require the exploitation of billion-way parallelism at multiple hierarchical levels in order to achieve the desired sustained performance. While traditional approaches to performance evaluation involve measurements of existing applications on the available platforms, such a me...
Article
Full-text available
Over the last 20 years, the open-source community has provided more and more software on which the world’s high-performance computing systems depend for performance and productivity. The community has invested millions of dollars and years of effort to build key components. However, although the investments in these separate software elements have...
Article
Creating the next generation of power-efficient parallel computers requires a rethink of the mechanisms and methodology for building parallel applications. Energy constraints have pushed us into a regime where parallelism will be ubiquitous rather than limited to highly specialized high-end supercomputers. New execution models are required to span...
Article
Full-text available
The Beowulf Bootcamp is an initiative designed to raise the awareness of and interest in high performance computing in the high schools the area of Baton Rouge, Louisiana. The goal is to familiarize students with all aspects of a supercomputer, giving them hands-on experience with touching or assembling hardware components. No less significant to t...
Conference Paper
HPC is entering a new phase in system structure and operation driven by a combination of technology and architecture trends. Perhaps foremost are the constraints of power and complexity that as a result of the at-lining of clock rates relies on multicore as the primary means by which performance gain is being achieved with Moore’s Law. Indeed, for...
Conference Paper
High performance computing (HPC) is experiencing a phase change with the challenges of programming and management of heterogeneous multicore systems architectures and large scale system configurations. It is estimated that by the end of the next decade exaflops computing systems requiring hundreds of millions of cores demanding multi-billion-way pa...
Article
The development of HPC systems capable of exascale performance will demand innovations in hardware architecture and system software as well as programming models and methods. The combination of a vast increase in scale of such systems combined with the emergence of heterogeneous multicore structures is forcing future systems to be organized and ope...
Article
Cloud computing is gaining importance as a computational resource allocation trend in commercial, academic, and industrial sectors. Cloud computing offers an amorphous distributed environment of computing resources and services to a dynamic distributed user base. High performance computing (HPC) involves many processing requirements that demand enh...
Conference Paper
High performance computing (HPC) is experiencing a phase change with the challenges of programming and management of heterogeneous multicore systems architectures and large scale system configurations. It is estimated that by the end of the next decade exaflops computing systems requiring hundreds of millions of cores demanding multi-billion-way pa...
Conference Paper
Although distance learning has a history that spans many decades, the full opportunity that is implicit in its exploitation has not been fully realized due to combination of factors including disparate experience between it and its classroom counterpart. However current and emerging technologies are helping overcome this barrier by providing signif...
Article
AbstractProductivity is an emerging measure of merit for high-performance computing. While pervasive in application, conventional metrics such as flops fail to reflect the complex interrelationships of diverse factors that determine the overall effectiveness of the use of a computing system. As a consequence, comparative analysis of design and proc...
Conference Paper
Instruction pressure is the level of time, space, and power required to manage the instruction stream to support high-speed execution of modern multicore general processor and embedded controller based computing. L1 instruction cache and processor pin bandwidth are examples of direct resource costs imposed by the instruction access demand of a proc...
Conference Paper
The performance opportunities enabled through multi-core chips and the efficiency potential of heterogeneous ISA and structures is creating a climate for computer architecture, highly parallel processing chips, and HPC systems unprecedented for more than a decade. But with change comes the uncertainty from competition of alternatives. One thing is...
Conference Paper
Full-text available
Abstract This paper proposes the study of a new,computation model,that attempts to address the underlying sources of performance degradation (e.g. latency, overhead, and star- vation) and the difficulties of programmer,productivity (e.g. explicit locality management and scheduling, performance tuning, fragmented memory, and synchronous global barri...
Conference Paper
Full-text available
This paper addresses the underlying sources of performance degradation (e.g. latency, overhead, and starvation) and the difficulties of programmer productivity (e.g. explicit locality management and scheduling, performance tuning, fragmented memory, and synchronous global barriers) to dramatically enhance the broad effectiveness of parallel process...
Conference Paper
Experts in advanced system technologies will predict the design of the best HPC architectures in 2020. They will defend why they think the technology they select will be the winning technology 15 years from now. The panelists will pick one set of technology - not a list of possibilities - to define the system. They will define the performance and a...
Conference Paper
A dramatic trend in computing is the adoption of multi-core technology by the vendors from which our current and future HPC systems are being derived. Multi-core is offered as a path to continued reliance and benefits of Moore's Law while reining in the previously unfettered growth of power consumption and design complexity. Are we saved? or is it...
Conference Paper
Twenty five years ago supercomputing was dominated by vector processors and emergent SIMD array processors clocked at tens of Megahertz. Today responding to dramatic advances in semiconductor device fabrication technologies, the world of supercomputing is dominated by multi-core based MPP and commodity cluster systems clocked at Gigahertz. Twenty f...
Chapter
Full-text available
This chapter centers mainly on successful programming models that map algorithms and simulations to computational resources used in high-performance computing. These resources range from group-based or departmental clusters to high-end resources available at the handful of supercomputer centers around the world. Also covered are newer programming m...
Article
MIND (Memory, Intelligence, and Network Device) is an advanced parallel computer architecture for high performance computing and scalable embedded processing. It is a Processor-in-Memory (PIM) architecture integrating both DRAM bit cells and CMOS logic devices on the same silicon die. MIND is multicore with multiple memory/processor nodes on each c...
Article
Full-text available
In a recent paper, Gordon Bell and Jim Gray (2002) put forth a view of the past, present, and future of high-performance computing (HPC) that is both insightful and thought provoking. Identifying key trends with a grace and candor rarely encountered in a single work, the authors describe an evolutionary past drawn from their vast experience and pro...
Conference Paper
Continuum computer architecture (CCA) is a non-von Neumann architecture that offers an alternative to conventional structures as digital technology evolves towards nano-scale and the ultimate flat-lining of Moore's law. Coincidentally, it also defines a model of architecture particularly well suited to logic classes that exhibit ultra-high clock ra...
Article
The anticipated advent of practical nanoscale technology sometime in the next decade with likely experimental technologies nearer term presents enormous opportunities for the realization of future high performance computing potentially in the pan-Exaflops performance domain (10 18 to 10 21 flops), but imposes substantial, albeit exciting, technical...
Article
Historically, high performance computing has been measured in terms of peak or delivered performance, and to a lesser extent to performance to cost. Such metrics fail to capture the impact on the usefulness and ease of use of such systems. Productivity has been identified as a new parameter for high end computing systems that include both delivered...
Article
Full-text available
Last year's paper by Bell and Gray [1] examined past trends in high performance computing and asserted likely future directions based on market forces. While many of the insights drawn from this perspective have merit and suggest elements governing likely future directions for HPC, there are a number of points put forth that we feel require further...
Article
Full-text available
InfiniBand is a new industry-wide general-purpose interconnect standard designed to provide significantly higher levels of reliability, availability, performance, and scalability than alternative server I/O technologies. After more than two years since its official release, many are still trying to understand what are the profitable uses for this n...
Conference Paper
Percolation has recently been proposed as a key component of an advanced program execution model for future generation high-end machines featuring adaptive data/code transformation and movement for effective latency tolerance. An early evaluation of the performance effect of percolation is very important in the design space exploration of future ge...
Article
Future high-end computers which promise very high performance require sophisticated program execution models and languages in order to deal with very high latencies across the memory hierarchy and to exploit massive parallelism. This paper presents our progress in an ongoing research toward this goal. Specifically, we will develop a suitable progra...
Conference Paper
Processor-in-Memory (PIM) architectures avoid the von Neumann bottleneck in conventional machines by integrating high-density DRAM and CMOS logic on the same chip. Parallel systems based on this new technology are expected to provide higher scalability, adaptability, robustness, fault tolerance and lower power consumption than current MPPs or commo...
Article
Processor-in-Memory (PIM) architectures avoid the von Neumann bottleneck in conventional machines by integrating high-density DRAM and CMOS logic on the same chip. Parallel systems based on this new technology are expected to provide higher scalability, adaptability, robustness, fault tolerance and lower power consumption than current MPPs or commo...
Conference Paper
Implicit in the evolution of current technology and high-end system evolution is the anticipated achievement of the implementation of computers capable of a peak performance of 1 Petaflops by the year 2010. This is consistent with both the semiconductor industry’s roadmap of basic device technology development and an extrapolation of the TOP-500 li...
Article
this document, the rationale for design choices made in the interface specification is set off in this format. Some readers may wish to skip these sections, while readers interested in interface design may want to read them carefully. (End of rationale.) Advice to users. Throughout this document, material that speaks to users and illustrates usage...
Conference Paper
Full-text available
Future high-end computers will offer great performance improvements over today's machines, enabling applications of far greater complexity. However, designers must solve the challenge of exploiting massive parallelism efficiency in the face of very high latencies across the memory hierarchy. We believe the key to meeting this challenge is the desig...
Conference Paper
Full-text available
The application of cluster computer systems has escalated dramatically over the last several years. Driven by a range of applications that need relatively low-cost access to high performance computing systems, clusters computers have reached worldwide use. In this paper we outline the results of using three generations of cluster machines at JPL fo...
Article
Scientists have found a cheaper way to solve tremendously difficult computational problems: connect ordinary PCs so that they can work together.
Article
The emergence of semiconductor fabrication technology allowing a tight coupling between high-density DRAM and CMOS logic on the same chip has led to the important new class of Processor-In-Memory (PIM) architectures. Newer developments provide powerful parallel processing capabilities on the chip, exploiting the facility to load wide words in singl...
Article
High performance computing with respect to personal computer (PC) clusters is presented. The requirement of performance and the availability of resources is addressed through an innovative synergy of new and old ideas of parallel computing community. The low cost technologies from the consumer digital electronics industry are also addressed. The le...
Article
Future-generation space missions across the solar system to the planets, moons, asteroids, and comets may someday incorporate supercomputers both to expand the range of missions being conducted and to significantly reduce their cost. By performing science computation directly on the spacecraft itself, the amount of data required to be downlinked ma...
Conference Paper
Throughout the history of computer implementation, the technologies employed for logic to build ALUs and the technologies employed to realize high speed and high-density storage for main memory have been disparate, requiring different fabrication techniques. This was certainly true at the beginning of the era of electronic digital computers where l...
Article
Teraflops-scale computing systems are becoming available to an increasingly broad range of users as the performance of the constituent processing elements increases and their relative cost (e.g. per Mflops) decreases. To the original DOE ASCI Red machine has been added the ASCI Blue systems and additional 1 Teraflops commercial systems at key natio...
Conference Paper
The emergence of semiconductor fabrication technology allowing a tight coupling between high-density DRAM and CMOS logic on the same chip has led to the important new class of Processor-in-Memory (PIM) architectures. Recent developments provide powerful parallel processing capabilities on the chip, exploiting the facility to load wide words in sing...
Conference Paper
Beowulf-class systems are an extremely inexpensive way of aggregating substantial quantities of a given resource to facilitate the execution of different kinds of potentially large workloads. Beowulf-class systems are clusters of mass-market COTS PC computers (e.g. Intel Pentium III) and network hardware (e.g. Fast Ethernet, Myrinet) employing avai...
Conference Paper
The emergence of semiconductor fabrication technology allowing a tight coupling between high-density DRAM and CMOS logic on the same chip has led to the important new class of Processor-In-Memory (PIM) architectures. Furthermore, large arrays of PIMs can be arranged into massively parallel architectures. In this paper, we outline the salient featur...
Article
In cooperation with the European Southern Observatory (ESO), Caltech has investigated the application of Beowulf clusters to the management and analysis of data generated by large astronomical instruments, exempli ed by the Very Large Telescope (VLT) in Cerro Paranal, Chile. The VLT consists of four 8-meter telescopes that can operate independently...
Article
Machine Interface 45 B.1 Global System Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 B.1.1 Global Name Management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 B.1.2 Global Memory Management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 B.2 Macroservers . . ....
Article
The emergence of semiconductor fabrication technology allowing a tight coupling between high-density DRAM and CMOS logic on the same chip has led to the important new class of Processor-In-Memory (PIM) architectures. Newer developments provide powerful parallel processing capabilities on the chip, exploiting the facility to load wide words in singl...
Article
Full-text available
: The semantics of memory---a large state which can only be read or changed a small piece at a time---has remained virtually untouched since von Neumann, and its effects---latency and bandwidth---have proved to be the major stumbling block for high performance computing. This paper suggests a new model termed "microservers" that exploits "Processin...
Article
Full-text available
This paper presents an analytical performance prediction for the implementation of Cannon's matrix multiply algorithm in the Hybrid Technology Multi-Threading (HTMT) architecture [8]. The HTMT subsystems are built from new technologies: super-conducting processor elements (called SPELLs [5]), a network based on RSFQ (Rapid Single Flux Quantum) logi...
Article
Do-it-yourself supercomputing has emerged as a solution to cost-effectively sustain the computational demands of the scientific research community. Despite some of the successes of this approach, represented by Beowulf-class computing, it has limitations that need to be recognized as well as problems that need to be resolved in order to extend its...

Network

Cited By