PresentationPDF Available

Man vs. Machine:The Challenge of Engineering Programs for HPC (Slides & youtube)

Authors:

Abstract

In retrospect, the first era of scientific computing, 1960-1995 was defined by Seymour Cray designed computers hosting single memory FORTRAN programs that can be executed by a dozen or so processors operating in parallel. In 1993, the first multicomputer with a thousand, independent and interconnected computers outperformed mono-memory supercomputers and operated at 60 GFlops. In 2018 the fastest computer at Oak Ridge has 4608 computers in a collection of processors, cores and GPUs or 1.5 million processing elements, operates af 140 Petaflops. Thus, in a little over a quarter of a century, the problem of utilizing this performance gain of over 2 million coming from managing millions of processing elements with the highest performing computers includes not only changes in the nature and scale of the science being simulated or analyzed, but resulting algorithms that scale, and the engineering investment in the design, construction, and maintenance of the programs. Use paradigms are evolving from analysis and simulation, to data science analytics and machine learning and Artificial Intelligence.
9/23/2019
1
The Challenge of Designing and
The Challenge of Designing and The Challenge of Designing and
The Challenge of Designing and
Using HPC Programs
Using HPC ProgramsUsing HPC Programs
Using HPC Programs
Man vs Machine
Petascale Computing Insitute
Gordon Bell
YouTube Talk is at
https://www.youtube.com/playlist?list=PLO8UWE9gZTlBwO8Xzq79J45clbnfaYDca
PCI Sponsors
The Argonne National Laboratory, the Blue Waters project at NCSA, the National Energy
Research Scientific Computing Center,
Oak Ridge Leadership Comp Facility, Pittsburgh Supercomputing Center,
SciNet at the University of Toronto, and the Texas Advanced Computing Center.
Abstract
AbstractAbstract
Abstract
Man
Man Man
Man vs. Machine:
vs. Machine:vs. Machine:
vs. Machine:
The Challenge of
The Challenge of The Challenge of
The Challenge of Engineerng
EngineerngEngineerng
Engineerng Programs for HPC
Programs for HPCPrograms for HPC
Programs for HPC
In retrospect, the first era of scientific computing, 1960-1995 was defined by Seymour
Cray designed computers hosting single memory FORTRAN programs that can be executed
by a dozen or so processors operating in parallel.
In 1993, the first multicomputer with a thousand, independent and interconnected
computers outperformed mono-memory supercomputers and operated at 60 GFlops. In
2018 the fastest computer at Oak Ridge has 4608 computers in a collection of processors,
cores and GPUs or 1.5 million processing elements, operates af 140 petaflops.
Thus, in a little over a quarter of a century, the problem of utilizing this performance gain
of over 2 million coming from managing millions of processing elements with the highest
performing computers includes not only changes in the nature and scale of the science
being simulated or analyzed, but resulting algorithms that scale, and the
engineering investment in the design, construction, and maintenance of the programs.
Use paradigms are evolving ffrom analysis and simulation, to data science analytics and
manchine learning and Artificial Intelligence.,
1
2
9/23/2019
2
The Challenge of
The Challenge of The Challenge of
The Challenge of Engineerng
EngineerngEngineerng
Engineerng Programs for HPC
Programs for HPCPrograms for HPC
Programs for HPC
1947-1993 In retrospect, the first era of scientific computing was defined by
Seymour Cray designed computers hosting single memory FORTRAN programs
that can be easily executed by a dozen or so mono-memory, vector processors.
I993 the first multicomputer operating as a supercomputer had a thousand,
independent, interconnected computers that outperformed the Cray, mono-
memory supercomputers, and operated at 60 GFlops.
2019 the fastest computer at ORNL has 4608 computers in a collection of
processing cores, processors and GPUs providing 1.5 million processing
elements, that operate af 140 petaflops.
Thus, in a little over a quarter of a century, the problem became:
Realizing a performance gain of over 2 million plus coming from
managing millions of processing elements with the
changes in the nature and scale of the science being simulated or analyzed,
discovering algorithms that scale, and
engineering the design, construction, and maintenance of the programs.
DAY
MONDAY, AUGUST
19
TUESDAY, AUGUST
20
WEDNESDAY,
AUGUST 21
THURSDAY,
AUGUST 22
FRIDAY, AUGUST
23
8:00 Keynote – G Bell MPI OpenACC Python (in HPC)
Software
Engineering Best
Practices
9:00 Computing
Paradigms MPI CUDA Python (in HPC) Containers
10:00 Working Lunch Lunch Lunch Lunch Lunch
11:00 ANL NCSA NERSC ORNL PSC
11:10 OpenMP Hybrid (MPI +
OpenMP) CUDA
Debugging,
Profiling,
Optimization
SciNet
TACC
12:00 OpenMP Survey of Libraries CUDA
Debugging,
Profiling,
Optimization
Survey of
Visualization
Resources;
13:00 Break Break Break Break Wrap-up and
Adjourn
13:30 MPI OpenACC
CUDA, OpenACC;
linear algebra
optimization
exercise
Parallel I/O Best
Practices
CPUs, GPUs, and the programming models of MPI, OpenMP,
Hyrbid
,
OpenACC
, CUDA
3
4
9/23/2019
3
We
We We
We will be surprised
will be surprisedwill be surprised
will be surprised…
In hindsight, FORTRAN c1957 has hidden parallelism --until 90s
Last 25 years: High placed fruit--f (ladders, balloons, aircraft)
Performance gain of 1000 per decade … 10K hour or 5 year mastery
change in speed, form of parallelism, algorithms
New computing paradigms after experimentation and theory
3rd Paradigm: Ken Wilson, Simulation …mid 80’s
Visualization rediscovered every decade to deal with data
4th Paradigm: Data discovery …Gray 2008
AI, Machine learning, neural nets, to uncover or expose the
phenomena …2018
Enables new kinds of science and engineering
Topics
Topics Topics
Topics
Moore’s Law vs Algorithms … What can go wrong?
The Ideal supercomputer. Quick visit to the machines, parallelism, and
performance….
Capability Computing: Sunway and Global Simulation (Bell Prize, 2017)
Juggling: Mastering the options for parallelism? Capability? or Capacity?
Job Stream and Ensemble (embarrassing parallelism), lots of jobs
Shared Memory, the first 34 years of my career until I gave up
MPI (Communicating Sequential Processes),
SIMD
Paradigms …new systems and careers
The Petascale program
Architecture will continue: Specialized chips versus programmable chips
5
6
9/23/2019
4
7
Climate Change Doesn’t Just Happen
Blue Waters hurricane simulation & visualization
Storms are worse than we had thought
vis a vis new effects of CO2 mixing
7
8
9/23/2019
5
Modeling the climate…
9
10
9/23/2019
6
European Centre for Medium-Range Weather
Forecasts (ECMWF) ISC19
11
12
9/23/2019
7
13
No magic bullets for exa-ops
Op rate is constrained by a 2 -3 GHz clock
Exa ops: 1018 / 2 x109or a half a billion-fold parallelism
1000s -10,000s Independent, interconnected computers
Intra-computer parallelism
– SIMD e.g. GPU
– Specialized chip e.g. Google TPU
– FPGA e.g. unrolling the loops
Simpler ops versus double precision floating helps
– 8 or 16 bit ops -> 16x
14
13
14
9/23/2019
8
Courtesy Jack Dongarra
15
16
15
16
9/23/2019
9
year
relative
speedup
Algorithms and Moore’s Law
Advances over 36 years, or 24 doubling times for Moore’s Law
224
16 million
the same as the factor from algorithms alone!
0
1
2
3
4
5
6
7
8
9
10
1980 1990 2000 2010
Calendar Year
Log Effective GigaFLOPS
High Order
Autocode
ARK integrator
complex chem
Higher
order
AMR
NERSC
RS/6000
NERSC
SP3
Cray 2
AMR
Low Mach
“Moore’s Law” for combustion simulations
17
18
9/23/2019
10
“Moore’s Law” for MHD simulations
“Semi-implicit”:
All waves treated
implicitly, but still
stability-limited by
transport
“Partially implicit”:
Fastest waves
filtered, but still
stability-limited by
slower waves
The power of optimal algorithms
Advances in algorithmic efficiency rival advances in hardware
architecture
Consider Poisson’s equation on a cube of size N=n3
If n=64, this implies an overall reduction in flops of ~16 million
Year Method Reference Storage Flops
1947 GE (banded) Von Neumann & Goldstine n5n7
1950 Optimal SOR Young n3n4log n
1971 CG Reid n3n3.5 log n
1984 Full MG Brandt n3n3
2u=f64
64 64
*Six-months is reduced to 1 s
19
20
9/23/2019
11
… and we started calling them supercomputers
The ideal Supercomputer:
Speed, memory, and parallelism (scaling)
Clock speed that increases with time
One, very large and scalable memory for any number of processors
Overlap of memory access and instruction execution
Parallelism of a single instruction stream including look-ahead and execution
Pipelining
Vector processing
Multiprocessors—(Scale up)
Multiple streams & multi-threading scalability
Multiple independent interconnected computers (Scale out)
aka clusters aka multicomputers
Multiprocessor nodes aka constellations
Multi-threading and vector processing
Stream processing using GPUs
Direct compilation of algorithms into FPGAs and application specific chips
21
22
9/23/2019
12
HPC aka Supercomputing ideas & machines:
It’s about speed, parallelism, standards, & cost (COTS)
1. The Cray Era (1964-1993): Mono-memory computer
Increased clock: high Khz => Mhz => low Ghz (10,000x)
Intra-processor parallelism techniques (10x > 100x vector)
Shared memory, multi-processor parallelism (1>32)
2. “Killer Micros” transition (1984-1993)
Searching for the way… scalability, CMOS micros, standards, failures
Similar to search for the first supers, and for xxMD way.
1987… Prize to acknowledge parallelism
3. The Multicomputer aka Clusters era (1984-present)
Parallelism: <10 => 1000x => 10,000 => 100,000 => million… billion
Now it is up to programmers to exploit parallelism
Copyright Gordon Bell
CDC 6600
CDC 7600
Cray 1
Cray XMP
Cray YMP 16
Intel Delta
TM CM5
Fujitsu NWT
IBM Power3
NEC ES
BlueGene
BlueGene
Tianhe Intel
Fujitsu SPARC
BG/Q Cray
Target
0.0001
0.001
0.01
0.1
1
10
100
1000
10000
100000
1000000
10000000
100000000
1960 1970 1980 1990 2000 2010 2020
Linpack (GFLOPS) vs. Year of introductions
Multicomputers
aka
clusters
Seymour Cray
Mono-memory computers
2x per year (1000x per decade)
40% per year (100x per decade)
Copyright Gordon Bell
23
24
9/23/2019
13
1. 1945,6 EDVAC Recipe and IAS Architecture. A dozen IACS were built e.g. Illiac, Johniac, Maniac,
2. 1957 Fortran first delivery establishes standard for scientific computing… FORTRAN 2008
3. 1960 LARC and 1961 Stretch—response to customer demand; 1962 Atlas commissioned
4. 1964 CDC 6600 (.48 MF) introduces parallel function units. Seymour establishes 30 year reign
5. 1965 Amdahl’s Law: single processor vs multi-P’s or vectors
6. 1976 Cray 1 (26 MF) introduces first, practical vector processor architecture.
7. 1982 Cray XMP… (PK: 1 GF) Intro of mP. The end of the beginning of mono-memory
8. 1982, 83 Caltech Cosmic Cube demo of multicomputer with 8, 64 computers. New beginning.
9. 1984 NSF Establishes Supercomputing centers at NCSA, SDSC, and Cornell, +Pittsburgh
10.1987 nCUBE (1K computers) 400-600 speedup, Sandia wins first Bell Prize.
1988 Gustafson’s Law as Amdahl’s Law Corollary (Simon #1)
11.1992 Intel Touchstone Delta at Sandia Reaches 100 GF
12.1993 CM5 (60 GF Bell Prize) 1024 Sparc computers. Cray C90 was 16! No way or plan to compete
13.1993 Top500 established using LINPACK Benchmark. (Simon #10) Begin multicomputer era
14.1994 Beowulf kit and recipe for multicomputers and MPI-1 Standard established
15.1995 ASCI > Advanced Simulation and Computing (ASC) Program
16.1996 Seymour R. Cray dies in a car accident. Building a shared memory computer using itanium
17.1997 ASCI Red (1 TF) at Sandia
18.1999 The Grid: Blueprint for a New Computing Infrastructure (Simon #8)
19.2008 IBM BlueGene (1.5 PF)
20.2012 Cray Titan (17.6) GPU and CUDA
21.Tiahne-2 at NUDT, 2016 Sunlight achieves 93 PF with > 10M core
Top 20 +HPC seminal events: recipes, standards, plans, and prizes
Top 20 +HPC seminal events: recipes, standards, plans, and prizesTop 20 +HPC seminal events: recipes, standards, plans, and prizes
Top 20 +HPC seminal events: recipes, standards, plans, and prizes
Copyright Gordon Bell
1945,6 EDVAC Recipe & IAS Architecture.
Over a dozen IACS were built e.g. Illiac, Johniac, Maniac,
1957 FORTRAN establishes standard for scientific computing
FORTRAN 2008
1960 LARC and 1961 Stretch—response to customer demand;
1962 Atlas commissioned
1964 CDC 6600 (.48 MF) “first super” parallel function units.
S R Cray begins 30 year reign as “the supercomputer designer
1965 Amdahl’s Law: Defines difficulty to speed up computation
with various forms of parallelism hardware
1976 Cray 1 (26 MF) first, practical vector processor architecture.
1982 Cray XMP…C90 (.5-16 GF) Intro of shared memory mPv.
The beginning of the end of mono-memory computing
1993 CM5, 1024 multicomputer beats mono memory Cray
Top 20 HPC seminal events: Machines,
Top 20 HPC seminal events: Machines, Top 20 HPC seminal events: Machines,
Top 20 HPC seminal events: Machines,
recipes, standards, plans, and prizes
recipes, standards, plans, and prizesrecipes, standards, plans, and prizes
recipes, standards, plans, and prizes
Copyright Gordon Bell
25
26
9/23/2019
14
SDSC
SDSC SDSC
SDSC
flash
flashflash
flash Gordon
Gordon Gordon
Gordon
2 PFlops
2 PFlops2 PFlops
2 PFlops
47K Cores
47K Cores47K Cores
47K Cores
247 Tbytes
247 Tbytes247 Tbytes
247 Tbytes
64 Tbytes, flash
64 Tbytes, flash64 Tbytes, flash
64 Tbytes, flash
Bridget Gordon Bell, C. Gordon Bell, Robert Sinkovitz (SDSC)
I never pass up an opportunity to visit a computing center\
1960: “We need the largest computer you can build”
1960: “We need the largest computer you can build”1960: “We need the largest computer you can build”
1960: “We need the largest computer you can build”
UNIVAC, IBM, and Manchester U.
UNIVAC, IBM, and Manchester U.UNIVAC, IBM, and Manchester U.
UNIVAC, IBM, and Manchester U.
Three efforts to build the world’s largest computer
establishes a unique computer class for scientific computing
UNIVAC LARC … Livermore /Univac spec’d, decimal (Univac/Eckert)
IBM Stretch … lookahead, pipelining for LANL
Manchester/Ferranti Atlas … paging and one-level store
27
28
9/23/2019
15
Mainframe? Supercomputer?: LARC
Begun in 1955 for Livermore and delivered in 1960
Had dual processors and decimal arithmetic
New surface-barrier transistors and core memory
Decimal Arithmetic
Courtesy of Burton Smith, Microsoft
LARC at LLNL c1960
Sid Fernbach, Harold Brown, Edward Teller
29
30
9/23/2019
16
Mainframe? Supercomputer?: Stretch, Harvest
IBM 7030 (STRETCH)
Delivered to Los Alamos 4/61
Pioneered in both architecture
and implementation at IBM
IBM 7950 (HARVEST)
Delivered to NSA 2/62
Was STRETCH + 4 boxes
IBM 7951 Stream unit
IBM 7952 Core storage
IBM 7955 Tape unit
IBM 7959 I/O Exchange
Courtesy of Burton Smith, Microsoft
IBM SMS Modules for Stretch
31
32
9/23/2019
17
IBM Stretch c1961 &
IBM Stretch c1961 & IBM Stretch c1961 &
IBM Stretch c1961 &
360/91 c1965
360/91 c1965360/91 c1965
360/91 c1965
consoles!
consoles!consoles!
consoles!
Ferranti/Manchester Atlas c1961
(One million instructions per second)
Mainframe? Supercomputer?
33
34
9/23/2019
18
Mastering the skills of parallelism
Pipelining Look-ahead, …Multi-threading
Fortran 1957, ’60, … ’08 Spec’ing a Supercomputer
Fortran 1957, ’60, … ’08 Spec’ing a SupercomputerFortran 1957, ’60, … ’08 Spec’ing a Supercomputer
Fortran 1957, ’60, … ’08 Spec’ing a Supercomputer
35
36
9/23/2019
19
CDC 6600 #1 Console & frame c1964 LLNL
First Supercomputer?
Neil Lincolns
Neil Lincolns Neil Lincolns
Neil Lincolns Reminiscences of
Reminiscences of Reminiscences of
Reminiscences of
computer architecture and design
computer architecture and design computer architecture and design
computer architecture and design
at Control Data Corporation
at Control Data Corporationat Control Data Corporation
at Control Data Corporation
NEIL: Nevertheless, Seymour did have a conception
of FORTRAN when he was building the 6600. He
didn't build a machine and then adapt FORTRAN to
it .
MARIS (re 7600): The really crucial thing was that
Seymour was the key designer as well as the
architect. He understood the whole thing and kept it
under control.
37
38
9/23/2019
20
CDC 6600 block diagram
217
60-bit
words
400,000 transistors,
MTBF 2000 hrs.
CDC 6600 Processor
…first RISC?
39
40
9/23/2019
21
LLNL Octopus hub PDP
LLNL Octopus hub PDPLLNL Octopus hub PDP
LLNL Octopus hub PDP-
--
-6
6 6
6
c1965
c1965c1965
c1965
256K word, 36 bits/word
256K word, 36 bits/word 256K word, 36 bits/word
256K word, 36 bits/word
2 x 10 x 25 =500 modules; 5000 transistors?1900 watts
Thought in 1964
Thought in 1964 Thought in 1964
Thought in 1964
when I heard about the 6600:
when I heard about the 6600: when I heard about the 6600:
when I heard about the 6600:
“Holy s**t!” How did he do that?
“Holy s**t!” How did he do that?“Holy s**t!” How did he do that?
“Holy s**t!” How did he do that?
Digital’s PDP- 6 was being built and introduced
10x less expensive ($300 K vs. $3 M)
6600 had 600K transistors; 4 Phase, 10 Mhz clock
“6” had 5,000 transistors, 2 bays x 10-5”crates x 25 = 500
modules, Clock ran asynchronously at 5 MHz.
Successor PDP-10 ran at 10 MHz
41
42
9/23/2019
22
PDP-6 …
6 6600 Factor
Transistor type Intro Ge c10/64 Si c9/64
No. transistors
Not counting diodes
500 boards x 10
Mem 50 x 10
400,000 400?
Clock (ns) 200 4 phase, 100 2-4
Power watts 1900 150,000 79x
Cost 250-500K 7-10 million 14-20
Adams cost per month 6.2-30 62-91 10
Adams add time 4.4 .3 14.7
Roberts cost $m 1 5.5 5.5
Roberts perf. bits/us 9 145 16
Roberts bits/sec/$ 27 9 3
Degree of parallelism 1, mem. overlap Lots!
happening…
Successors DecSystem 10, 20 6400,…7600
Team size 17 including SW 35
Two CDC 7600s and LLNL c1969
Courtesy of Burton Smith, Microsoft
43
44
9/23/2019
23
Search for a Architecture parallel
Search for a Architecture parallel Search for a Architecture parallel
Search for a Architecture parallel
constrained by Amdahl’s Law
constrained by Amdahl’s Lawconstrained by Amdahl’s Law
constrained by Amdahl’s Law
Three approaches… that didn’t work
Illiac IV (SIMD) m=64
CDC STAR and ETA10 vectors in memory
TI ASC (Vector arch; Cray 1’s 5x clock)
Then success
Cray 1 vector architecture
Amdahl’s law c1967…the limit of parallelism
If w1work is done at speed s1and w2at speed s2,
the average speed sis (w1+w2)/(w1/s1+ w2/s2)
This is just the total work divided by the total time
For example, if w1= 9, w2= 1, s1= 100, and s2= 1
then s = 10/1.09
9 (speed)
Amdahl, Gene M, “Validity of the single
processor approach to achieving large
scale computing capabilities”,
Proc. SJCC, AFIPS Press, 1967
Courtesy of Burton Smith, Microsoft
45
46
9/23/2019
24
ILLIAC IV: Uof IL at NASA in1971
Courtesy of Burton Smith, Microsoft
1964 project
(U. of IL)
Burroughs
contract to build
SIMD 64 PEs
10 MB disk/PE
Moved to NASA
1975 on ARPAnet
Resides at the
Computer History
Museum
Cray
CrayCray
Cray-
--
-1 c1976:
1 c1976: 1 c1976:
1 c1976: Supercomputer vector era
Supercomputer vector eraSupercomputer vector era
Supercomputer vector era
Courtesy of Burton Smith, Microsoft
Steve Squired, DARPA SCI 1985
47
48
9/23/2019
25
Cray 1 sans covers
The Vector ISA
Unlike the CDC Star-100, no
development contract
Los Alamos got a one-year
free trial. Los Alamos
leased the system.
Los Alamos developed or
adapted existing software
Cray-1 and Amdahl’s law
Scalar performance
2X the 7600
Vector 160 Mflops
80 MHz clock
Peak floating point ops vs.
Instructions per second
“Supercomputer” connotes
a Cray-1
Courtesy of Burton Smith, Microsoft
Cray 1 processor block diagram… see 6600
49
50
9/23/2019
26
Shared Memory: Cray Vector Systems
Cray Research, by Seymour Cray
Cray-1 (1976): 1 processor
Cray-2 (1985): up to 4 processors*
Cray Research, not by Seymour Cray
Cray X-MP (1982): up to 4 procs
Cray Y-MP (1988): up to 8 procs
Cray C90: (1991?): up to 16 procs
Cray T90: (1994): up to 32 procs
Cray X1: (2003): up to 8192 procs
Cray Computer, by Seymour Cray
Cray-3 (1993): up to 16 procs
Cray-4 (unfinished): up to 64 procs
All are UMA systems except the X1,
which is NUMA
*One 8-processor Cray-2 was built
Cray-2
Courtesy of Burton Smith, Microsoft
Four Decades 1976…2016
Four Decades 1976…2016Four Decades 1976…2016
Four Decades 1976…2016
a million, million
a million, million a million, million
a million, million
Fenton
FentonFenton
Fenton-
--
-Tantos
Tantos Tantos
Tantos
Desktop Cray 1 or XMP
Desktop Cray 1 or XMP Desktop Cray 1 or XMP
Desktop Cray 1 or XMP
at 0.1 size.
at 0.1 size.at 0.1 size.
at 0.1 size.
Spartan-3E-1600-Development-Board
51
52
9/23/2019
27
A new beginning: “the killer micros:”
A new beginning: “the killer micros:”A new beginning: “the killer micros:”
A new beginning: “the killer micros:”
A transition decade
A transition decadeA transition decade
A transition decade
1982, 83 Caltech Cosmic Cube multicomputer with 8, 64 computers.
DARPA Strategic Computing Initiative and Japanese Fifth Gen.
1984 NSF Establishes Supercomputer Centers at Illinois & San Diego
1987 First Bell Prize for parallel programing awarded to Sandia nCUBE
(1K computers) 400-600 speedup,
1988 Gustafson’s Law as Amdahl’s Law Corollary (Simon #1)
1989 Parallel Virtual Machine distributed memory programming
1992 Intel Touchstone Delta at Sandia Reaches 100 GF
1993 CM5 (60 GF Bell Prize) 1024 Sparc computers.
Cray C90 was 16! No way/plan to compete with ECL, shared memory
1993 Top500 established , using LINPACK Benchmark. (Simon #10)
Begin multicomputer era
1994 Beowulf kit and recipe for multicomputers and
MPI
-
1 Standard
released.
Copyright Gordon Bell
Capacity Computing … independent parallelism
53
54
9/23/2019
28
Capability … parallelism for a single job
1983 Caltech Cosmic Cube
1983 Caltech Cosmic Cube1983 Caltech Cosmic Cube
1983 Caltech Cosmic Cube
8 node prototype (‘82) & 64 node ‘83
8 node prototype (‘82) & 64 node ‘838 node prototype (‘82) & 64 node ‘83
8 node prototype (‘82) & 64 node ‘83
Intel iPSC 64 Personal Supercomputer ‘85
Intel iPSC 64 Personal Supercomputer ‘85Intel iPSC 64 Personal Supercomputer ‘85
Intel iPSC 64 Personal Supercomputer ‘85
Copyright Gordon Bell
55
56
9/23/2019
29
1982
19821982
1982-
--
-1984 : The Lax Report to NSFs NSB
1984 : The Lax Report to NSF’s NSB1984 : The Lax Report to NSF’s NSB
1984 : The Lax Report to NSF’s NSB
for NSF Advanced Scientific Computing
for NSF Advanced Scientific Computingfor NSF Advanced Scientific Computing
for NSF Advanced Scientific Computing
Gresham's Law: VAXen are KILLING supercomputers
NSF needs to fund supercomputer access and centers
1984: NSF Establishes Office of Scientific Computing
NCSA at U IL (1985)
SDSC at UCSF
Cornell National Supercomputer Facility
Pittsburgh Supercomputer Center
John Von Neumann Center at Princeton etc.
1986 CISE (Computer and Information
Science and Engineering) Directorate
Research focus on parallelism!
Knuth , Thompson, Karp disagree
Gordon Bell with VAX Minicomputer c1978,
VAX c1978. 5 MHz
Cray 1 ‘76, XMP 4 ‘82
YMP 16 ’88 12O MHz
1986 Lax Report …
VAX is killing supercomputing
1986 NSF Centers: NCSA, SDSC,
Pittsburgh, Cornell, JVNC.
1987 Bell starts up NSF CISE
57
58
9/23/2019
30
Bell Prize for Parallelism, July 1987
Bell Prize for Parallelism, July 1987Bell Prize for Parallelism, July 1987
Bell Prize for Parallelism, July 1987
Alan Karp:
Offers $100 for a
program with 200 X
parallelism by 1995.
Bell, 1987 goals:
10 X by 1992
100 X by 1997
Researcher claims:
1 million X by 2002
59
60
9/23/2019
31
Gustafson’s Law
Gustafson’s LawGustafson’s Law
Gustafson’s Law
Benner, Gustafson, Montry winners of first Gordon Bell Prize
Benner, Gustafson, Montry winners of first Gordon Bell Prize Benner, Gustafson, Montry winners of first Gordon Bell Prize
Benner, Gustafson, Montry winners of first Gordon Bell Prize
S(P) = P –α x (P-1)
Pis the number of processors,
Sis the SPEEDUP, and
α, the non-parallelizable fraction of any parallel process
1989: The “killer micros” –Eugene Brooks, LLNL
Challenge: how do you utilize (program)
a large number of interconnected,
independent computers?
61
62
9/23/2019
32
Top500 #1 @ 60 Gflops: June 1993
Top500 #1 @ 60 Gflops: June 1993Top500 #1 @ 60 Gflops: June 1993
Top500 #1 @ 60 Gflops: June 1993
1994: MPI 1.0 Massage Passing Interface
1994: MPI 1.0 Massage Passing Interface1994: MPI 1.0 Massage Passing Interface
1994: MPI 1.0 Massage Passing Interface
63
64
9/23/2019
33
Beowulf:
Beowulf:Beowulf:
Beowulf:
Computer Cluster
Computer Cluster Computer Cluster
Computer Cluster
by Don Becker &
by Don Becker & by Don Becker &
by Don Becker &
Tom Sterling,
Tom Sterling, Tom Sterling,
Tom Sterling,
NASA 1994
NASA 1994NASA 1994
NASA 1994
BSD, LINUX, Solaris,
and Windows Support
for MPI and PVM
Goodyear Aerospace MPP SIMD
Gould NPL
Guiltech
Intel Scientific Computers
International Parallel Machines
Kendall Square Research
Key Computer Laboratories searching again
MasPar
Meiko
Multiflow
Myrias
Numerix
Pixar
Parsytec
nCUBE
Prisma
Pyramid Early RISC
Ridge
Saxpy
Scientific Computer Systems (SCS)
Soviet Supercomputers
Supertek
Supercomputer Systems
Suprenum
Tera > Cray Company
Thinking Machines
Vitesse Electronics
Wavetracer SIMD
Lives Lost: The search for parallelism c1983
Lives Lost: The search for parallelism c1983Lives Lost: The search for parallelism c1983
Lives Lost: The search for parallelism c1983-
--
-1997
19971997
1997
DOE and DARPA Strategic Computing Initiative
DOE and DARPA Strategic Computing Initiative DOE and DARPA Strategic Computing Initiative
DOE and DARPA Strategic Computing Initiative
ACRI French-Italian program
Alliant Proprietary Crayette
American Supercomputer
Ametek
Applied Dynamics
Astronautics
BBN
CDC >ETA ECL transition
Cogent
Convex > HP
Cray Computer > SRC GaAs flaw
Cray Research > SGI > Cray Manage
Culler-Harris
Culler Scientific Vapor…
Cydrome VLIW
Dana/Ardent/Stellar/Stardent
Denelcor
Encore
Elexsi
ETA Systems aka CDC;Amdahl flaw
Evans and Sutherland Computer
Exa
Flexible
Floating Point Systems SUN savior
Galaxy YH-1
65
66
9/23/2019
34
1994 Meeting with Jim Gray
“the day I gave up on shared memory
computers”
CopyrightG Bell and J Gray 1996
1994: Computers will All be Scalable
1994: Computers will All be Scalable1994: Computers will All be Scalable
1994: Computers will All be Scalable
for the web, vs smP
for the web, vs smPfor the web, vs smP
for the web, vs smP
Thesis: SNAP: Scalable Networks as Platforms
upsize from desktop to world-scale computer
based on a few standard components
Because:
Moore’s law: exponential progress
standards & commodities
stratification and competition
When: Sooner than you think!
massive standardization gives massive use
economic forces are enormous
Network
Platform
GB1
1993 CM5 , 1024 computer cluster . Top500
1995 ASCI > Advanced Simulation and Computing (ASC) Program
1996 Seymour R. Cray is killed in a car accident…. Was building a
shared memory computer using itanium
1997 ASCI Red (1 TF) at Sandia, 9K computers
2008 IBM Blue Gene (1.5 PF)
2012 Cray Titan (17.6 PF) GPU and CUDA
Tiahne-2 at NUDT, 2016 Sunlight achieves 93 PF with > 10M core
ORNL Summit Top500 2018;
148.5 PF, 2.4 Mcores, 10 Mwatts, 3 Ghz
2018 $500 M commitment to deliver one+ exaflops to ANL
The Multicomputer aka Clusters Era
The Multicomputer aka Clusters EraThe Multicomputer aka Clusters Era
The Multicomputer aka Clusters Era
Copyright Gordon Bell
67
68
Slide 67
GB1
Gordon Bell, 6/19/2019
9/23/2019
35
ASCI Red 1997
ASCI Red 1997ASCI Red 1997
ASCI Red 1997-
--
-2005 at Sandia National Lab
2005 at Sandia National Lab2005 at Sandia National Lab
2005 at Sandia National Lab
June 1997-2000
1.3-2.5 Tflops
9,216-9632 proc
640 disks,
1,540 PS,
616 interconnect
Japanese Earth Simulator (NEC)
Japanese Earth Simulator (NEC)Japanese Earth Simulator (NEC)
Japanese Earth Simulator (NEC)
2002
20022002
2002-
--
-2004 35 Teraflops 5,000 vector processor computers
2004 35 Teraflops 5,000 vector processor computers2004 35 Teraflops 5,000 vector processor computers
2004 35 Teraflops 5,000 vector processor computers
…Stimulant for ASCI
…Stimulant for ASCI…Stimulant for ASCI
…Stimulant for ASCI
69
70
9/23/2019
36
LLNL Sequoia 17 Petaflops…
LLNL Sequoia 17 Petaflops… LLNL Sequoia 17 Petaflops…
LLNL Sequoia 17 Petaflops…
Tops500 #1 June 2012
Tops500 #1 June 2012Tops500 #1 June 2012
Tops500 #1 June 2012
4 threads/core
16 cores/chip
1024 chips/rack
96 racks per system
1.57 million processors
1 Gigabytes/processor
1.6 PetaBytes primary memory
80 KWatt/rack
7.7 MWatts
ORNL Summit Top500 2018 #1
148.5 PF, 201 peak, 2.4 Mcores, 10 Mwatts, 3 Ghz
Copyright Gordon Bell
IBM Power System AC922 node
4,600 compute nodes
22 GB/s non blocking links
two IBM POWER9 processors and
Four threads
2 SIMD Multi-Core (SMC)
512 GB of DDR4 memory
six NVIDIA Volta V100 accelerators
80 streaming multiprocessors (SMs)
32 FP64 (double-precision) cores,
64 FP32 (single-precision) cores,
64 INT32 cores, and 8 tensor cores.
96 GB for accelerators
1.6TB of non-volatile memory
71
72
9/23/2019
37
GBell Prize Platforms 1987
GBell Prize Platforms 1987GBell Prize Platforms 1987
GBell Prize Platforms 1987-
--
-2016 (3 Decades)
2016 (3 Decades)2016 (3 Decades)
2016 (3 Decades)
2002 NEC ES ASCI White
2003 1944 NEC ES
2004 4096 NEC ES 1.34TF Grape
2005 131,072 BlueGene/L LLNL
2006 65,536 BlueGene MD Grape
2007 200,000 BlueGene
2008 196,000 Jaguar, Oak Ridge -
2009 Pf 147,464 Jaguar, Oak Ridge
Anton 1; GPUs
2010 200,000 Jaguar, Oak Ridge
2011 442,368 Fujitsu K CPU:16K/GPU:4K
2012 82,944 Fujitsu K
2013 1,600,000 Sequoia
2014 Anton 2
2015 1,500,000 Sequoia
2016 10,600,000 Sunway TaihuLight
Year
1987 Mf
Parallelism
600
M 1
Ncube 1K
M 2
Cray XMP
1988 Gf 800 Cray YMP Ncube, iPSC
1989 1100 TM CM2, 1K Pe PLA
1990 16,000 TM CM2, 16K iPSC 860
1991 - -
1992 500 Intel Delta Dist. WS
1993 1000 TM CM5 SNAP
1994 1904 Intel Paragon Cluster WS
1995 128 Fujitsu NWT Grape P:288
1996 196 Fujitsu NWT Grape 1296
1997 4096 ASCI Red Cluster Alpha
1998 9200 Cray T3E/1024PE ASCI Red
1999 Tf 5832 Blue Pacific Grape 5
2000 Grape 6 Cluster WS
2001 1024 IBM 16 mP cluster Distributed Cs
Copyright Gordon Bell
2 CMU
2 Cornell
2 D E Shaw
2 FSU
2 Fujitsu
2 HIT
2 JAMSTEC
2 Max Planck Inst
2 Mobil
2 Nagoya U
2 Pittsburg SC
2 THC
2 TMC
2 Tokyo Inst of Tech
2 Tokyo U
2 Tskuba U
2 U Colorado
2 U MN
2 U TX
2 UC/Berkeley
118
3 Japan AEC
3 NEC
3 Cray
3 ETH
3 Japan
Marine
3 NAL
3 NYU
3 U Tokyo
3 Yale
4 Argonne
4 Intel
4 Riken
Bell Prize
winner orgs.
7 Caltech
7 IBM
6 LANL
6 LLNL
6 ORNL
6 Sandia
5 Earth Sim.
Ctr.
Abuques
Ansoft
Bejing Normal U
Brown
BTL
Center of Earth System
Columbia
Emory
Fermilab
Found MST
GA Tech
Hiroshima U
HNC
IDA
Inst fir Frontier Res
Japan NAL
Keio U
LBNL
MIT
Munich Tech U
Nagoya U
NAO Japan
NASA Ames
NASA Goddard
NASA Langley
Nat Space Dev
NCAR
Next Gen SC
NRL
NSC
Ohio State
Old Dominion U
Penn State
Purdue
Rutgers
Sandia
Tel Aviv U
THC
Traco
Tsingua U
U Chicago
U de Louviain
U IL
U Messina
U MI
U Milano
U NM
U of Bristol
U of Chinese Acad
U of Electro-communication
U of IL
U of TN
U Penn
U Sydney
UC/Davis
United Tech
Vienna U of Tech
Wills Phy Lab
Yamagata U
68 Copyright Gordon Bell
73
74
9/23/2019
38
0
1
10
100
1,000
10,000
100,000
1,000,000
10,000,000
100,000,000
1,000,000,000 #Processors,
Linpack
Linpack Gflops
Bell Prize (Gflops)
Expon. (Linpack
Gflops)
Copyright Gordon Bell
Sunway 100+ Petaflops computer 10.6 million cores
Sunway 100+ Petaflops computer 10.6 million coresSunway 100+ Petaflops computer 10.6 million cores
Sunway 100+ Petaflops computer 10.6 million cores
40 cabinets x 4 supernodes/cab
40 cabinets x 4 supernodes/cab 40 cabinets x 4 supernodes/cab
40 cabinets x 4 supernodes/cab
x 256 (= 32 boards x 4 cards x 2 nodes/card)
x 256 (= 32 boards x 4 cards x 2 nodes/card)x 256 (= 32 boards x 4 cards x 2 nodes/card)
x 256 (= 32 boards x 4 cards x 2 nodes/card)
x 4 x (1+ 8 x 8) processing elements/ nodes
x 4 x (1+ 8 x 8) processing elements/ nodes x 4 x (1+ 8 x 8) processing elements/ nodes
x 4 x (1+ 8 x 8) processing elements/ nodes
Node peak performance of (256 cores * 8 flops/cycle * 1.45 GHz) +
(4 core * 16 flops/cycle * 1.45 GHz) = 3.0624 Tflop/s per node
Node memory 4 x 8 Gbytes. 32 x 10**9 / 3 x 10**12 flops /sec
= .01 Bytes/Operation
75
76
9/23/2019
39
CPUs, GPUs, … all those registers
78
77
78
9/23/2019
40
Sunway Top500 2016-17
• 74.15% efficient (peak at 125 Pflop/s)
• The number of cores is 163,840 x 64 = 10,649,600 cores for the HPL run.
Size of the matrix, n = 12,288,000 (1.2 PB) …100 Mbytes/node
• 163,840 MPI processes, which corresponds to 4 x 40,960 CGs in the system.
• Logical process grid of pxq = 256 x 640 (2*14 x 10 = 16K x 10
• Each CG (Core Group) has one MPE, and 64 CPEs.
So within the MPI process, 64 threads to use the 64 CPEs.
• • Time to complete benchmark run: 13,298 seconds (3.7 hours)
• Average 15.371 MW
• 6 Gflops/W
79
80
9/23/2019
41
2016 Bell Prize: Climate Modeling
Yifeng Cui, SDSC
team member
81
82
9/23/2019
42
LLNL Sierra (Top500 2018) #2
Vendor IBM
CPU Clock Speed
(GHz) 3.4
Nodes
Login Nodes* 5
GPU nodes 4,320
Total Nodes 4,474
CPUs
CPU Architecture IBM Power9
Cores/Node 44
Total Cores 190,080
GPUs
GPU Architecture NVIDIA V100 (Volta)
Total GPUs 17,280
GPUs per compute
node 4
GPU peak p (TFLOP/s
double precision) 7.00
Memory
Memory Total (GB) 1,382,400
CPU Memory/Node (GB) 256
GPU Memory/Node (GB) 64
Peak single CPU memory
bandwidth (GB/s) 170
Peak Performance
Peak TFLOPS (CPUs) 4,666
Peak TFLOPs (GPUs) 120,960
Peak TFLOPS
(CPUs+GPUs) 125,626
OS RHEL
Interconnect IB EDR
ORNL Summit Top500 2018 #1
148.5 PF, 201 peak, 2.4 Mcores, 10 Mwatts, 3 Ghz
Copyright Gordon Bell
IBM Power System AC922 node
4,600 compute nodes
22 GB/s non blocking links
two IBM POWER9 processors and
Four threads
2 SIMD Multi-Core (SMC)
512 GB of DDR4 memory
six NVIDIA Volta V100 accelerators
80 streaming multiprocessors (SMs)
32 FP64 (double-precision) cores,
64 FP32 (single-precision) cores,
64 INT32 cores, and 8 tensor cores.
96 GB for accelerators
1.6TB of non-volatile memory
83
84
9/23/2019
43
Copyright Gordon Bell
Parallelism
Compute o(1000s)
Nodes (many jobs, many instances of a job (ensemble computing)
Internode communication or communicating sequential processes
Multiple sockets (chips) per node 1…8 may share memory
Multiple cores 2-44
Multiple threads 1-2
»Instruction level parallelism pipeling into GPU ALUs o(50)
Copyright Gordon Bell
85
86
9/23/2019
44
Parallel levels by grain size
Job stream parallelism aka ensembles, capacity computing
Communicating, sequential processes with Multiple threads
Shared Memory
GPU, CUDA, Accelerated computing
Four
FourFour
Four
Science Paradigms, Jim Gray, Jan. 2007
Science Paradigms, Jim Gray, Jan. 2007Science Paradigms, Jim Gray, Jan. 2007
Science Paradigms, Jim Gray, Jan. 2007
1. Thousand years ago:
science was empirical
describing natural phenomena
2. Last few hundred years:
theoretical branch
using models, generalizations
3. 1987, since ENIAC & FORTRAN:
a computational branch
simulating complex phenomena
4. 2010 Data-intensive science :
data exploration (eScience)
unify theory, experiment, and simulation
Data captured by instruments
Or generated by simulation
Processed by software
Information/Knowledge stored in computer
Scientist analyzes database / files
using data management and statistics
2
2
2
.
3
4
a
cG
a
aΚ=
ρπ
2
2
2
.
3
4
a
cG
a
aΚ=
ρπ
Jim Gray NRC-CSTB 2007-01
87
88
9/23/2019
45
Third and Fourth Paradigms of science
Third and Fourth Paradigms of scienceThird and Fourth Paradigms of science
Third and Fourth Paradigms of science
1987 Ken Wilson, Nobel Prize
Winner declares:
“Compuation is 3rd Paradigm”
2008 DOE Dir. Of Science
discovers 3rd Paradigm
Nov 2010 “The Big Idea: The
Next Scientific Revolution -
Harvard Business Review”
2007 Jim Gray @NRC CSTRB :
Data Science is 4th Paradigm
More paradigms? What’s Next?
More paradigms? What’s Next?More paradigms? What’s Next?
More paradigms? What’s Next?
Visualization described in 1987, Re-discovered 2000,
and Re-re-discover (Every 20 years something is rediscovered)
Very large, coupled, complete, & complex models e.g. climate
simulation
Data Science recognized in 2010 to manage data
Data Scientist has become a profession
AI for Machine Learning is next big thing for HPC
89
90
9/23/2019
46
Copyright Gordon Bell
Exascale Apps
Copyright Gordon Bell
91
92
9/23/2019
47
Exascale program goals
The software must be:
interoperable
sustainable
maintainable
adaptable
portable
scalable
deployed at DOE computing
facilities
Software must Work:
Easy o use
Understandable
Perform well
Outperform anything out there
Competitive
Validated and Verified
Copyright Gordon Bell
Architectures for Apps, “MSFT Consulting Eng.
Ending of a 50 year exponential obviously has huge ramifications
We will see a Cambrian explosion of new hardware in the cloud
This heterogeneity will be disruptive in many respects
We can back into it with massive amounts of programmable hardware
We need ways of applying “architecture” that are not manual or ad
hoc
Innovation will continue but will be more surprising / less predictable
We should celebrate this world-changing era but keep going!
93
94
9/23/2019
48
TPU: High-level Chip
Architecture
4 MiB of on-chip Accumulator
memory
24 MiB of on-chip Unified
Buffer (activation memory)
3.5X as much on-chip memory
vs GPU
The Matrix Unit: 65,536
(256x256) 8-bit multiply-
accumulate units
700 MHz clock rate
Peak: 92T operations/second
65,536 * 2 * 700M
>25X as many MACs vs GPU
>100X as many MACs vs CPU
Two 2133MHz DDR3
DRAM channels
8 GiB of off-chip weight DRAM
memory 95
FPGA: spatial compute
FPGA: spatial computeFPGA: spatial compute
FPGA: spatial compute
FPGA
Data
Instruction
Instruction
Instruction
Data
Instruction
Instruction
Instruction
CPU: temporal compute
CPU: temporal computeCPU: temporal compute
CPU: temporal compute
Temporal versus Spatial Computing
CPU
Instruction
95
96
9/23/2019
49
The end
Copyright Gordon Bell
97
ResearchGate has not been able to resolve any citations for this publication.
Atlas commissioned 4. 1964 CDC 6600 (.48 MF) introduces parallel function units. Seymour establishes 30 year reign
  • Larc
LARC and 1961 Stretch-response to customer demand; 1962 Atlas commissioned 4. 1964 CDC 6600 (.48 MF) introduces parallel function units. Seymour establishes 30 year reign
ASCI Red (1 TF) at Sandia 18.1999 The Grid: Blueprint for a New Computing Infrastructure (Simon #8) 19
  • R Seymour
Seymour R. Cray dies in a car accident. Building a shared memory computer using itanium 17.1997 ASCI Red (1 TF) at Sandia 18.1999 The Grid: Blueprint for a New Computing Infrastructure (Simon #8) 19.2008 IBM BlueGene (1.5 PF)
Top500 established using LINPACK Benchmark. (Simon #10) Begin multicomputer era 14.1994 Beowulf kit and recipe for multicomputers and MPI-1 Standard established 15
Intel Touchstone Delta at Sandia Reaches 100 GF 12.1993 CM5 (60 GF Bell Prize) 1024 Sparc computers. Cray C90 was 16! No way or plan to compete 13.1993 Top500 established using LINPACK Benchmark. (Simon #10) Begin multicomputer era 14.1994 Beowulf kit and recipe for multicomputers and MPI-1 Standard established 15.1995 ASCI > Advanced Simulation and Computing (ASC) Program
Core Group) has one MPE, and 64 CPEs
  • C G Each
• Each CG (Core Group) has one MPE, and 64 CPEs. So within the MPI process, 64 threads to use the 64 CPEs.
Compuation is 3 rd Paradigm" • 2008 DOE Dir. Of Science discovers 3 rd Paradigm •
  • Bell Prize
Bell Prize: Climate Modeling Yifeng Cui, SDSC team member 81 • 1987 Ken Wilson, Nobel Prize Winner declares: "Compuation is 3 rd Paradigm" • 2008 DOE Dir. Of Science discovers 3 rd Paradigm • Nov 2010 "The Big Idea: The Next Scientific Revolution -Harvard Business Review"