Technical ReportPDF Available

Cray 1 Supercomputer Performance Comparisons With Home Computers Phones and Tablets

Authors:
  • UK Government

Abstract

The main comparisons are based on benchmark results used to verify performance of the first Cray 1, with variations from two similar vintage benchmarks. Maximum MFLOPS performance, 100% vectorisation and multiprocessor effects are also considered. Samples of historic benchmark results are included along with others for the latest compilations. Bottom line is a mid range 2021 laptop that is indicated as being between 226 and 2671 times faster than the Cray 1 from 12 comparisons, dependent on the particular application. For lower range ARM processors, the 2020 Raspberry Pi produced gains between 25 and 400 times. My 2020 mid cost Android based phone achieved 74 to 757 times. There is no one answer.
Cray 1 Supercomputer Performance Comparisons With Home Computers
Phones and Tablets
Roy Longbottom
Contents
Summary and Background Activities
Summary Reliability Studies Acceptance Trials
System Evaluation and Trials Stress Testing Programs Met Seymour Cray
External Consultancy Hands on Cray 1 Collecting Performance Data
Met Key Benchmark Authors Influencing Supercomputer Choice Benchmarking In Japan
My PC Benchmarks
Classic Benchmarks Netlib Involvement Livermore Loops
Linpack 100 Whetstone Vector Whetstone
MP Whetstone MP MFLOPS
Detailed Results and Comparisons
Variations Cray 1 Raspberry Pi
Android Windows PCs SIMD Windows and Linux PCs
Vector Whetstone MP Whetstone MP MFLOPS
MP MFLOPS Part 2 MP Livermore Loops MP MFLOPS 4 to 64 Threads
Performance Summary More Advanced Hardware Disassembled Code
Benchmark Error Reports Run Time Displays Faster Than Expected
Also celebrating the 50th anniversary of the Whetstone Benchmark - 1972 to 2022.
Summary
This report is mainly based on the comprehensive benchmark used to verify performance of the first Cray 1. This
comprises the Lawrence Livermore Laboratory program kernels (aka Livermore Loops), that provides a range of Millions of
Floating Point Operations Per Second (MFLOPS) measurements. In this case, results from my 1990s conversion to all C
code are used.
To support these performance ratings, results are also considered from two similar vintage benchmarks. These are the
Linpack and Whetstone benchmarks. The first is linpack-pc.c, my accepted conversion for PCs, available at Netlib. For the
second, I took over design responsibility from Harold Curnow, the original author, and developed enhanced variations,
including one with 100% vectorisation, the initial target being the first Cray 1 system delivered to the UK.
A selection of available results is provided to demonstrate performance variations and comparisons over the years. Other
important issues can be considered, based on the information provided in my first Raspberry Pi report.
"In 1978, the Cray 1 supercomputer cost $7 Million, weighed 10,500 pounds and had a 115 kilowatt power supply. It was,
by far, the fastest computer in the world. The Raspberry Pi costs around $70 (CPU board, case, power supply, SD card),
weighs a few ounces, uses a 5 watt power supply and is more than 4.5 times faster than the Cray 1"
Background Activities - This provides details of my involvement in evaluating, acceptance testing and benchmarking
mainframe and supercomputer systems for UK Government and University projects, including hands-on Cray 1 program
development of benchmarks and stress tests..
Results Provided - Livermore Loops MFLOPS minimum, geometric mean (official average) and maximum, Linpack MFLOPS,
Whetstone overall MWIPS and average MFLOPS of appropriate tests. These are all single core benchmarks.
Raspberry Pi ARM CPUs - The comment above was for the 2012 Pi 1. In 2020, the Pi 400 average Livermore Loops,
Linpack and Whetstone MFLOPS reached 78.8, 49.5 and 95.5 times faster than the Cray 1.
Android ARM CPUs - 2012 Android tablet results identified Cray 1 gains with a range from barely there up to 10 times.
My 2021 mid priced phone produced MFLOPS gains of 123, 74 and 151 times.
Windows and Linux PCs Intel CPUs - The first PC to reach the average Cray 1 Livermore Loops score is indicated as a
1994 100 MHz Pentium. Best results for the original benchmarks are for a medium range laptop with a 2021 11th
generation 4150 MHz Core i5 CPU. The the three MFLOPS gains were 117, 131 and 134 times.
Advanced SIMD compilations lead to i5 gains of 359, 337 and 226 times.
Multiprogramming Livermore Loops - Four copies of the Advanced SIMD Livermore Loops Benchmark were run at the
same time. This resulted in a MFLOPS throughput gain of 1134 times.
Vector Whetstones - This single core benchmark uses large data arrays that produce 100% vectorisation for all test
functions and was produced to benchmark the first UK Cray 1. Results are included for thirteen 1978 to 1991
supercomputers. For this benchmark, Single and Double Precision (SP and DP) versions are available, the latter could be
appropriate for comparison with supercomputer longer words. Top SP and DP MFLOPS measurements for the Core i5 were
602 and 433 times faster than Cray 1.
Multithreading MP Whetstones - Results are provided essentially from running multiple copies of the mainly scalar
version of the Whetstone benchmark, using 1, 2, 4, and 8 threads, via a single program. It highlights complications due to
varying CPU MHz, according to the number of threads, and benefits of PC Hyperthreading. Single and double precision
versions were run, in this case obtaining similar performance. Eight thread throughput gains over the Cray 1 were
Raspberry Pi 400 times, Android phone 757 times and Core i5 laptop 1521 times.
MP MFLOPS - This executes combinations of floating point multiplications and additions handling SP or DP variables,
intended to demonstrate near maximum performance, again from a single program. For Intel, assembly code listings are
provided for the normally fastest test. Based on the mix of floating point operations, an estimate of Cray 1 maximum
speed, running these, is reduced from 160 MFLOPS to 122. The Core i5 laptop gains, over the revised Cray 1 maximum
rating, were SP 2671 and DP 1317 times via 326 and 161 GFLOPS. Gains on the other devices were Android phone
SP 293 times and Raspberry Pi SP 247 times, both at greater than 30 GFLOPS. This benchmark has a run time
parameter to use up to 64 threads that should demonstrate far superior performance of more advanced CPUs.
Background Activities Next or Go To Start
Background Activities
Reliability Studies - I worked for the UK Government Central Computer Agency from 1960, initially analysing fault returns
that were contractually required for all new systems. These provided the first detailed statistics included in my book
“Computer System Reliability” published in 1980. I also provided assistance in running acceptance tests, gathering similar
information, over the years, for inclusion in my book.
Acceptance Trials and First Supercomputer Involvement - During the late 1960s, with 20 staff, I became in charge
of all acceptance trials, taking personal responsibility for top of the range computers. This included organising and
supervising trials the UK Atlas 2 for Cambridge University Mathematical Laboratory, the 1962 earlier version said to be the
most powerful supercomputer in the world.
Scientific Systems Evaluation and Acceptance Trials - In the 1970s and early 1980s, with up to 15 staff, I covered
evaluating and acceptance testing of scientific systems, with continuing responsibility for design and acceptance trial
supervision of the larger systems. Between 1972 and 1973 these included an IBM 360/195 for UK Met Office and a CDC
7600 for ULCC (University of London), again said to be the current fastest supercomputers.
Stress Testing Programs - In order to stress test all computers, during acceptance tests and under Operating Systems,
I produced a range of Fortran programs, a few for testing CPUs, with others covering everything from paper tape punches
to disk drives. These had parameters to run for extended periods and were used during hundreds of acceptance tests
from 1972 up until the 1990s. The tests included the Whetstone benchmark, produced by my CCTA colleague Harold
Curnow and the first accepted general purpose computer benchmark. I collected running times of most programs for use
in performance evaluation.
Met Seymour Cray - It must have been 1969, when I was visiting Control Data manufacturing facility in Minnesota, that
I was asked to visit Chippewa Falls in Wisconsin to witness a UK Met Office benchmark run on the CDC 7600. Then, I had
a brief encounter with Seymour Cray, who appeared to run the benchmark. After setting it up, it was all over in a flash,
with Seymour reported that it took not a lot of milliseconds. This influenced my later development of general purpose
benchmarks to have have noticeable running times with ongoing displays of progress.
External Consultancy - CCTA had contractual responsibility for handling procurement of centrally funded university
computers, leading to me becoming an advisor to the Computer Board for Universities and Research Councils, and later a
member of the Technical Sub-Group for Advanced Research Computers. In 1976, I was appointed, as an expert from a
Member State, to join a European Centre for Medium-Range Weather Forecasts Committee, involving procurement of a
new supercomputer, where a Cray 1 became the obvious choice.
Hands-on Cray 1 - My detailed involvement in real supercomputers started in in 1978, including a second visit to
Chippewa Falls to evaluate the Cray 1. This was followed by a pre-delivery factory trial, in 1979, for the new AWRE
Aldermaston system.. Meanwhile, Cray 1 serial 1 was at the UK Rutherford Laboratory, where I converted all my
appropriate test programs, and Whetstone benchmark, to use the new vector instructions. The on-site acceptance trials
were carried out later in 1979, where the Cray 1 passed with flying colours. This was followed by the same factory and
on-site testing procedures on serial 1 CDC Cyber 205 for UK Met Office, in 1981. That one failed its first factory trial due
to my I/O stress testing program identifying a design fault.
Collecting Performance Data - Next, until my early retirement in 1993, I was mainly involved in performance
consultancy of data processing systems, personally covering more than 60 projects. I also took over design responsibility
for the Whetstone benchmark and continued consultancy on university procurements. Starting during this period, I
collected published details of computers, amounting to more than 2000 mainframes, minicomputers, supercomputers and
workstations, from around 120 suppliers. I also continued collecting Whetstone results, ending with more than 700 from
53 computer manufacturers, covering minicomputers, through mainframes, to supercomputers. These provided the
beginning for my ResearchGate reports, starting with Whetstone Benchmark History and Results.pdf and
Computer_Speed_Claims_1980_to_1996.pdf, also Computer Speeds From Instruction Mixes pre-1960 to 1971.pdf.
Met Key Supercomputer Benchmark Authors - As part of a university benchmark investigation team, I visited the USA
in 1987, including to the creator of the Linpack range of benchmarks, Jack Dongarra in Tennessee, and to the Lawrence
Livermore Research Laboratory, who produced the Livermore Loops (Livermore Fortran Kernels) benchmark. This became
the key supercomputer benchmark for a number of years.
Influencing Supercomputer Choice - In 1988, the Director of University of Manchester Regional Computer Centre
requested my performance analysis of the two competing supercomputers, after being quoted by part of the evaluation
team that I would support one of the proposals. I demonstrated the opposite choice, that was accepted, using results of
my scalar and vector Whetstone benchmark results. This is based on a large number of users, where 90% of programs
can be vectorised. Then, the one with the fastest maximum vector speed, but the lowest for scalar code, lost the battle.
Benchmarking Supercomputers in Japan - My last involvement in supercomputers was for a new one for University of
London Computer Centre, over 1991 and 1992, when I became the independent observer of a benchmark, based on
numerous real applications, at Fujitsu and NEC in Japan. My colleague dealt with Cray, in the USA, that won the contract
with a Y-MP configuration. As confirmed with my simple scalar and vector Whetstone, that I ran then, it was really
comparing multiple pipelines against multiple CPUs, each of the latter with scalar and vector processing capabilities.
My PC Benchmarks Next or Go To Start
My PC Benchmarks
Classic Benchmarks - Following retirement came part time consultancy and eventually creating my website
(roylongbottom.org.uk) to house copies of performance data, collected during my CCTA days (with approval) and a range
of benchmarks, initially concentrating on those for PCs, all for free with no adverts. The first, being detailed in my Classic
Benchmarks report, covering Whetstone, Dhrystone, Linpack 100 and Livermore Loops, using C/C++ compiled programs.
Early PCs had poor timer resolution, with benchmarks or functions requiring running times of 5 seconds for consistent
performance. Other requirements included logging results in text files, checking output for consistent numeric results and,
where possible, performance and results displayed as the tests progressed. Besides benchmark reports, identified below, I
developed programs in assemly code, with reports now at ResearchGate in PC CPUID 1994 to 2013, plus Measured
Maximum Speeds via Assembler Code.pdf and PC CPU Specifications 1994 to 2014 plus Measured MIPS and MFLOPS per
MHz.pdf.
Netlib Involvement - Other than for my Whetstone programs, the initial source code was obtained from Netlib, where
my linpack-pc.c code, was later accepted and included, for use on PCs. Livermore Loops conversion was time consuming,
where C code was available for the calculations, but data generation, checking and other activities were in Fortran, that
I converted to C. All these followed conversion routes from running under DOS, though variations of Windows, Linux,
including Raspberry Pi, and Android, most at 32 bits and 64 bits. The majority of early results were collected through
involvement in Compuserve Benchmark and Standards Forum.
Livermore Loops - Results shown here are for the second version of this benchmark, comprising 24 loops. Besides
MFLOPS measurements for each of these, summary minimum, maximum and various averages are produced, with
geometric mean being the official average. During my visit to LLL, I was given a copy of the 1986 report “The Livermore
Fortran Kernels: A Computer Test of the Numerical Performance Range”, with over 200 pages of results for minicomputers
to supercomputers. The document appears to be available via Internet, but I never managed to obtain a free download.
My ResearchGate report Livermore Loops Benchmark Results On PCs and Later Devices.pdf, contains results up to 2017.
As for the other Classic Benchmarks, later details are included in reports covering the different platforms.
Linpack 100 - Performance of this original version of the benchmark is dependent on a function that simply calculates
dy[i] = dy[i] + da*dx[i]; but with too many overheads that prevent performance from approaching maximum possible.
However, this can be compiled to use linked (fused) multiply and add instructions, such as available on the Cray 1 and
later computers, particularly with SIMD vector capabilities. As for Livermore Loops, run on systems with smaller word sizes
than Cray 1, it is compiled to use double precision arithmetic. Detailed results of this original version, and that for the
later HPL variety for modern supercomputers, are available from Netlib in document Performance of Various Computers
Using Standard Linear Equations Software. My Linpack Benchmark results On PCs and Later Devices is available from
ResearchGate.
Whetstone - The current versions include the changes I made to produce performance ratings of each of the eight test
functions, particularly to identify cases where some of the code was not being executed (deliberately by some, in the
days of minicomputers). Here, the original overall MWIPS ratings are quoted, along with average MFLOPS for the three
tests carrying out straightforward calculations. Two other tests include functions such as SIN and LOG, where changes in
maths libraries can significantly affect overall MWIPS. This benchmark is based on simple code sequences used in the
1960s, where performance is more inclined to be proportional to CPU MHz until dramatic changed are made in hardware,
like the introduction of additional instructions. My ResearchGate report for this benchmark is Whetstone Benchmark
Detailed Later Results.
Vector Whetstone - The vector version, converted for the Cray 1, executes the same functions as the scalar version,
but covering a number of sequential memory locations, defined by a vector length parameter. Cray 1 performance
achieved maximum performance at vector lengths of 64 words and above, with a sawtooth pattern . I have a previous
unpublished C/C++ version that I ran on Raspberry Pi 400 and Windows and Linux based PCs, with results, at vector
lengths of 256, being provided below.
MP Whetstone - None of the other benchmarks, covered here so far, exercise more than one CPU core. This MP
benchmark currently executes 1, 2, 4 and 8 copies of the standard code via multithreading in a single program. Some
results are included to highlight performance gains over the single CPU in the Cray 1.
MP MFLOPS - Results for this benchmark have been included for comparison with the maximum MFLOPS possible on the
Cray 1. The benchmark executes tests with 2, 8 and 32 floating point operations per data word, covering larger caches
and RAM. Default operation uses 8 threads but can be changed, up to 64 threads, with a run time parameter. Description
and earlier results are available in MultiThreading Benchmarks.
Detailed Results Next or Go To Start
Detailed Results and Comparisons
Variations - Results provided below, cover Livermore Loops, Linpack and Whetstone benchmarks. Note that the main
Livermore report contains numerous results, covering different compilers and possibly hardware changes. As for the
Whetstone benchmark, slow performance of single test functions can severely impact overall ratings. The chosen results
represent my choice of typical performance.
Cray 1 - The maximum possible hardware performance was said to be 160 MFLOPS for the 80 MHz Cray 1, comprising
linked multiply and add, for two results per clock cycle. The LLL benchmark maximum shown is 82.1 MFLOPS with average
11.9 and minimum 1.2. Then Linpack and Whetstone are 27 and 6 MFLOPS, Linpack benefiting from linked multiply and
add.
Raspberry Pi - In 2013 I ran my benchmarks on the first version of Raspberry Pi. These were essentially the same as
those used on PCs, under Linux. The programs included the Livermore Loops benchmark and that lead me including the
following in a report.
“In 1978, the Cray 1 supercomputer cost $7 Million, weighed 10,500 pounds and had a 115 kilowatt power supply. It was,
by far, the fastest computer in the world. The Raspberry Pi costs around $70 (CPU board, case, power supply, SD card),
weighs a few ounces, uses a 5 watt power supply and is more than 4.5 times faster than the Cray 1.” This refers to
official average geometric mean results.
In 2019 (aged 84), I was invited to become a voluntary member of Raspberry Pi Foundation’s Alpha Testing Team, which I
accepted. This lead to me running my benchmarks and stress tests on new top of the range systems before
announcement. The supply of new hardware and software, so far, lead to me producing eight additional PDF reports that
are available from ResearchGate. See Project Log for Performance of Raspberry Pi and Android Devices. Early results are
included in the benchmark specific reports, identified above. Later reports that include links to download the benchmarks
are Raspberry Pi 400 PC 32 Bit and 64 Bit Benchmarks and Stress Tests.pdf and Raspberry Pi 32 Bit and 64 Bit Benchmarks
and Stress Tests.pdf.
The following MFLOPS comparisons are in the order of Livermore Loops average.
Comparison - The three 700 MHz Pi 1 main measurements (Loops, Linpack and Whetstone) were 55, 42 and 94 MFLOPS,
with the four gains over Cray 1 being 8.8 times for MHz and 4.6, 1.6, 15.7 times for MFLOPS.
The 2020 1800 MHz Pi 400 provided 819, 1147 and 498 MFLOPS, with MHz speed gains of 23 times and 69, 42 and 83
times for MFLOPS. With more advanced SIMD options, the 64 bit compilation produced Cray 1 MFLOPS gains of 78.8,
49.5 and 95.5 times.
LLLOOPS MFLOPS MFLOPS MWIPS MFLOPS CPU Device
CPU MHz Max Gmean Min Linpack Whets Whets Year Year
Main Columns V V V V
Cray
Cray 1 80 82.1 11.9 1.2 27 16.2 est 6 1978
XMP1 118 162.2 17.3 2.1 121 30.3 11 1985
Cray 1 Whets MFLOPS estimated based on XMP results
Raspberry Pi 32 bit
Pi CPU
1 1176JZF 700 148 55 17 42 271 94 2001 2012
2 A7 900 248 115 42 120 525 244 2011 2014
3 A53 1200 436 184 56 176 725 324 2012 2016
4 A72 1500 1861 679 180 764 1883 415 2015 2019
400 A72 1800 2262 819 217 1147 2258 498 2015 2020
Raspberry Pi 64 bit
400 A72 1800 3353 938 242 1337 2505 573 2015 2020
Rpi 1/Cray 1 8.8 1.8 4.6 13.8 1.6 16.7 15.7
64 bit/32 bit 1.0 1.5 1.1 1.1 1.2 1.1 1.1
64 bit/Cray 1 22.5 40.8 78.8 201.7 49.5 154.6 95.5
Main Columns # # # #
Android Next or Go To Start
Android
In 2012, I converted my benchmarks to run via Android, in native ARM code, requiring Java front end programs. The latest
versions identify the hardware, automatically running code for ARM or Intel CPUs, under 32 bit or 64 bit Operating
Systems. In the early days I obtained lots of results with similar performance, a sample of these being provided below.
Comparison - The first results were for tablets that did not have hardware or software to support fast floating point
calculations. The earliest with appropriate facilities, from 2012, used the ARM Cortex-A9 processors, starting with 800
Mhz versions. This is indicated as having the three MFLOPS speeds of 20, 11 and 22, or at 10 times Cray 1 CPU MHz, with
gains of 1.7, 0.4 and 3.7 in MFLOPS.
A later 800 MHz V7-A9 obtained 115, 101 and 155 MFLOPS, or Cray 1 gains of 9.7, 3.7 and 25.8 times.
Fastest results provided are for a 2021 mid priced phone with a Kryo 570 CPU, said to be based on ARM Cortex-A77. At
2000 MHz, this obtained an average LLL speed of 1468 MFLOPS, with Linpack at 1986 and 905 for Whetstone and Cray 1
performance gains of 123, 74 and 151 times, at 25 times CPU MHz.
The latest versions of the benchmarks can be downloaded and installed from the following (see security warning). Android
9 Benchmarks and Stress Tests On 32 Bit and 64 Bit CPUs. Then Android 10 and 11 Benchmarks and ARM big.LITTLE
Architecture Issues might be of interest, with Android Benchmarks For 32 Bit and 64 Bit CPUs from ARM and Intel providing
more information, results and access to older (out of date) apps.
LLLOOPS MFLOPS MFLOPS MWIPS MFLOPS CPU Device
CPU MHz Max Gmean Min Linpack Whets Whets Year Year
Main Columns V V V V
Cray 1 80 82.1 11.9 1.2 27 16.2 6.0 1978
Android 32 bits
V7-A9 a 800 36 20 11 11 171 22 2012 2012
V7-A9 a later 800 253 115 47 101 687 155 2012 2012
v7-A9 1200 208 176 27 159 731 259 2012 2012
v8-A53 1300 397 164 28 348 868 332 2012 2015
v7-A15 1700 471 342 34 826 907 329 2012 2013
QU-800 2150 447 356 112 630 1974 610 2013 2013
V8-A72 1800 674 584 136 1023 2053 465 2015 2015
Android 64 bits
v8-A53 1300 805 238 101 338 1494 319 2012 2015
Exynos 8890 2300 188 158 27 999 3342 760 2016 2017
v8-A57 2000 724 641 245 1163 1988 390 2013 2015
v8-A73 2000 877 786 269 1122 2927 497 2016 2019
Kryo 570 2000 1620 1468 514 1986 4650 905 2020 2021
A53 64/32bit 1.0 2.0 1.5 3.6 1.0 1.7 1.0
V7-A9 a/Cray 1 10.0 0.4 1.7 9.2 0.4 10.6 3.7
v7-A9 later 10.0 3.1 9.7 39.2 3.7 42.4 25.8
32b A72/Cray 1 22.5 8.2 49.0 113.5 37.9 126.7 77.5
64b 570/Cray 1 25.0 19.7 123.3 428.0 73.6 287.1 150.8
Main Columns # # # #
PCs Next or Go To Start
Windows Intel and AMD CPUs
I developed my benchmarks for Intel CPUs in the 1990s, starting with DOS and OS/2. through varieties of Windows and
Linux. The compiled benchmarks and source codes are all available for download via my PDF reports at ResearchGate. I
received numerous results up to 2005. By 2013, my interests became ARM CPUs, with MHz of those from Intel not
increasing sufficiently to show real improvements in performance of my single core benchmarks. Then, my 3.9 GHz CPU
was close to maximum speed and, in 2021, this appears to have only reached 5.5 GHz, but now with 16 CPU cores. In
order to obtain some up to date performance data, I bought a new laptop with a 11th generation Core i5 CPU that runs
at a maximum speed of 4.15 GHz.
Comparison - Below are samples of results where details for the three benchmarks were available. The first PC to reach
the average Cray 1 Livermore Loops score is indicated as a 1994 100 MHz Pentium, shown as 12 MFLOPS, with Linpack
and Whetstone at 12 and 16. This gives approximate Cray 1 comparisons of MHz and the three MFLOPS measurements of
1.3, 1.0. 0.44 and 2.6 times.
PCs with faster Pentium processors continued to produce performance proportional to CPU MHz, with improvements
appearing with the 1995 Pentium Pro. At 200 MHz the three MFLOPS measurements were 34, 49 and 41 and four
comparisons 2.5, 2.9, 1.8 and 6.8 times.
Next came various Pentium II and III models with improvements to these benchmarks mainly proportional to CPU MHz.
Then the 2002 Pentium 4 is show to achieve 187 , 382 and 146 MFLOPS, but at 1700 MHz, producing the four Cray `
comparisons of 21, 16, 14 and 24 times, with decreases in MFLOPS per MHz, compared with earlier Pentiums.
With alternative CPU technology, the per MHz ratio improved with a single core of a 1820 MHz 2007 Core 2 processor
obtaining 413, 998 and 374 MFLOPS or Cray 1 improvements of 23, 35, 37 and 62 times.
The 2010 Core i7 range produced an improvement in MFLOPS per MHz, with the 3900 MHz 2013 model obtaining 1108,
2684 and 716 MFLOPS and comparisons 49, 93, 99 and 119 times.
The 2021 laptop with a Core i5 1135G7 CPU provided further gains with a higher MFLOPS per MHz rating for Livermore
Loops and Linpack but not much with Whetstone. MFLOPS identified were 1387, 3541 and 802, and Cray 1 comparisons
of 117, 131 and 134 times.
These results are from running optimised versions of the original Windows Classic Benchmarks livecont.exe, linpcont.exe
and whetcont.exe, available in downloadable benchnt.zip.
LLLoops
Gmean
LLLOOPS MFLOPS MFLOPS MWIPS MFLOPS Device MFLOPS
CPU MHz Max Gmean Min Linpack Whets Whets Year per MHz
Main Columns V V V V
Cray 1 80 82.1 11.9 1.2 27 16.2 6.0 1978 0.15
Windows PCs
AMD 80386 40 1.2 0.6 0.2 0.5 5.7 0.8 1991 0.02
80486 DX2 66 4.9 2.7 0.7 2.6 15 3.3 1992 0.04
Pentium 75 24 7.7 1.3 7.6 48 11 1994 0.10
Pentium 100 34 12 2.1 12 66 16 1994 0.12
Pentium 200 66 22 3.8 132 31 1996 0.11
AMD K6 200 68 22 2.7 23 124 26 1997 0.11
Pentium Pro 200 121 34 3.6 49 161 41 1995 0.17
Pentium II 300 177 51 5.5 48 245 61 1997 0.17
AMD K62 500 172 55 6.0 46 309 67 1999 0.11
Pentium III 450 267 77 8.3 62 368 92 1999 0.17
Pentium 4 1700 1043 187 19 382 603 146 2002 0.11
Athlon Tbird 1000 1124 201 23 373 769 161 2000 0.20
Core 2 1830 1650 413 40 998 1557 374 2007 0.23
Core i5 2300 2326 438 35 1065 1813 428 2009 0.19
Athlon 64 2150 2484 447 48 812 1720 355 2005 0.21
Phenom II 3000 3894 644 64 1413 2145 424 2009 0.21
Core i7 930 3066 2751 732 68 1765 2496 576 2010 0.24
Core i7 4820K 3900 5508 1108 88 2680 3114 716 2013 0.28
Core i5 1135G7 4150 7505 1387 92 3541 3293 802 2021 0.33
Pentium/Cray 1 1.3 0.4 1.0 1.8 0.4 4.1 2.6
i5/Cray 1 52 91 117 77 131 203 134
i5/i7 1.1 1.4 1.3 1.1 1.3 1.1 1.1
Main Columns # # # #
SIMD Windows and Linux Next or Go To Start
SIMD Windows and Linux PCs
Following are results from running the benchmarks compiled with SSE, AVX and AVX-512 SIMD options. These employ 128,
256 or 512 bit vector registers simultaneous operating on 4, 8 or 16 single precision (SP) and 2, 4 or 8 double precision
(DP) numbers, historically stated as producing maximum SP performance of 4, 8 or 16 MFLOPS per MHz and half those for
DP. With 100% fused multiply and add (FMA) type operations these maximum expectations would be doubled. FMA was
only available on the Core i5 laptop tested here. It should be noted that using fused operation can result in slightly
different accuracy in computed results. The benchmarks report these as errors. See Error Reports. Similar variations were
encountered, in the past, using different versions of the compilers.
Windows benchmarks, used in this area, were lloops64.exe, linpack64.exe and whetsSSE.exe. These and source code
files are included in Windows-Benchmarks.zip . Compared with the earlier results, performance increased to achieve Cray
1 MFLOPS gains of 238, 190 and 182 times. For this area, double precision Whetstone results are also shown to run at
the same speed as the single precision version.
I have had difficulties in using the latest C compilers for Windows, but a new bootable flash drive for Ubuntu 20.04
provided the compiler, enabling more advanced options to be used under Linux. The new benchmarks were initially
compiled on an older PC as it did not seem possible to boot the latest flash drive on my new Core i5 based laptop. For the
latter, I installed WSL (Windows Subsystem for Linux) in order to compile and run the programs.
Linux - The first compilations under Linux were slightly faster than those from Windows. Those used here were compiled
on the i5 laptop using the latest gcc 9.3.0 compiler, under Ubuntu. Disassembly code was examined to show that SSE,
AVX and AVX-512 instructions were being used, as appropriate. This cannot be guaranteed by relying on compile options.
These benchmarks can be downloaded in Linux-Benchmarks.tar.xz. The first Linux results, using the AVX SIMD
instructions, increased the three i5 Cray 1 gains to 300, 259 and 179 times. AVX-512 hardware was only available on the
Core i5 CPU, providing the three MFLOPS gains of 359, 337 and 226 times.
The table provides MFLOPS per MHz calculations for Livermore Loops average and maximum results. A major surprise is
that the latter for SSE and AVX, of 3.56 and 4.77 were higher than recognised maximum double precision ratios, without
FMA, of 2.0 and 4.0. This also applied for SSE for the Core i7 at 3.05. The AVX-512 FMA 47692 MFLOPS ratio of 11.49
suggests significant FMA was being used. See also Faster Than Expected below.
DP LLLoops
Gmean Max
LLLOOPS MFLOPS MFLOPS MWIPS MFLOPS Device MFLOPS MFLOPS
CPU MHz Max Gmean Min Linpack Whets Whets Year per MHz per MHz
Main Columns V V V V
Cray 1 80 82.1 11.9 1.2 27 16.2 6.0 1978 0.15 1.03
Windows PCs Earlier SSE Compiler
Core i7 4820K 3900 6145 2037 327 3601 6385 1081 2013 0.52 1.58
Core i5 1135G7 4150 8313 2828 386 5132 7466 1094 2021 0.68 2.00
Core i5 DP 4150 7256 1098
i5/Cray 1 52 101 238 321 190 461 182
i5/i7 1.1 1.4 1.4 1.2 1.4 1.2 1.0
Linux PCs SSE New Compiler
Core i7 4820K 3900 11881 2578 569 5306 6007 1182 2013 0.66 3.05
Core i5 1135G7 4150 14786 3364 575 7322 6586 1052 2021 0.81 3.56
i5/Cray 1 52 180 283 479 271 407 175
i5/i7 1.1 1.2 1.3 1.0 1.4 1.1 0.9
Linux PCs AVX New Compiler
Core i7 4820K 3900 12878 2615 597 5098 5887 1174 2013 0.67 3.30
Core i5 1135G7 4150 19794 3568 943 6998 6477 1077 2021 0.86 4.77
Core i5 DP 6861 1076
i5/Cray 1 52 241 300 786 259 400 179
i5/i7 1.1 1.5 1.4 1.6 1.4 1.1 0.9
SP/DP 0.9 1.0
i7 AVX/SSE 1.1 1.0 1.0 1.0 1.0 1.0
i5 AVX/SSE 1.3 1.1 1.6 1.0 1.0 1.0
Linux AVX 512 FMA New Cpmpiler
Core i5 1135G7 4150 47692 4273 965 9088 8193 1353 2021 1.03 11.49
i5/Cray 1 52 581 359 805 337 506 226
Main Columns # # # #
The following is a summary of the range of results on the 2021 Core i5, to show the impact of compilers that support
newer technology. Note that trying to run the AVX-512 variety on earlier CPUs, without this option, results in an program
failure report. The No SSE results are from the earlier table.
The i5 CPU MHz is 52 times than that for the Cray 1, compared with over 300 times for Livermore Loops and Linpack
benchmarks using AVX-512 functions and more than 200 times for Whetstone. Later are multithreading results for the
latter, and for a vector version, to highlight the benefits of using more advanced facilities
MFLOPS i5/Cray 1
LLOOPS Linpack Whets LLOOPS Linpack Whets
No SSE 1387 3541 802 117 131 134
SSE 3364 7322 1052 283 271 175
AVX 3568 6998 1077 300 259 179
AVX512 4273 9088 1353 359 337 226
Vector Whetstone Benchmark Next or Go To Start
Vector Whetstone Benchmark
Below are details of supercomputer Whetstone Scalar and Vector benchmark results included in my ResearchGate
Whetstone Benchmark History and Results report. Details of the vector program version are included above. As far as I
remember, all these results are from systems using a single scalar CPU, possibly with more than one vector pipeline. Cray
was the first manufacturer to produce systems with multiple scalar CPUs but it is not clear if any of the others followed
this line in the timescale considered. From the details here, both Cray Y-MP MHz clock speed and scalar MFLOPS are
indicated as around twice as fast as Cray 1, with vector MFLOPS four times faster, the system having two vector units.
This benchmark is included in Linux-Benchmarks.tar.xz and Windows-Benchmarks.zip .
Best results, from the next table, for Core i5 and Raspberry Pi 400 are provided, to demonstrate their superiority over
1991 supercomputers. On top of this, the former have multiple cores, with the potential of four time higher throughput or
raw performance. See MP Whetstone results and those for MP MFLOPS.
Vector/
Scalar Vector Scalar
MHz MWIPS MFLOPS MWIPS MFLOPS MFLOPS DATE
Cray 1 80 16.2 5.9 98 47 8.0 1978
CDC Cyber 205 50 11.9 4.9 161 57 11.7 1981
Cray XMP1 118 30.3 11.0 313 151 13.7 1982
Cray 2/1 244 25.8 N/A 425 N/A 1984
Amdahl VP 500 # 143 21.7 7.5 250 103 13.8 1984
Amdahl VP 1100 # 143 21.7 7.5 374 146 19.5 1984
Amdahl VP 1200 # 143 21.7 7.5 581 264 35.3 1984
IBM 3090-150 VP 54 12.1 4.9 60 17 3.6 1986
(CDC) ETA 10E 95 15.7 6.5 335 124 19.2 1987
Cray YMP1 154 31.0 12.0 449 195 16.3 1987
Fujitsu VP-2400/4 312 71.7 25.4 1828 794 31.3 1991
NEC SX-3/11 345 42.9 17.0 1106 441 25.9 1991
NEC SX-3/12 345 42.9 17.0 1667 753 44.3 1991
# Fujitsu Systems
Core i5 AVX512 SP 4150 7780 1353 21039 28303 20.9 2021
Core i5 AVX512 DP 4150 8193 1353 21464 20346 15.0 2021
Pi 400 SP 1800 2505 573 3755 2131 3.7 2020
Pi 400 DP 1800 2684 575 3407 1184 2.1 2020
The following include all three MFLOPS measurements to identify maximum, as the second test sometimes falls behind.
Single and double precision results are provided, where either could be valid, depending on numeric precision
requirements.
The fastest Whetstone floating point code is not suitable to benefit much from fused multiply and add operation, with one
multiply associated with four additions or subtractions. The maximum Core i5 speed of 75.1 GFLOPS is quite impressive.
Average i5 Cray 1 MFLOPS gains were 602 and 433 times, for single then double precision calculations. Note that
some SP SSE MFLOPS per MHz were again greater than 4.0 and AVX above 8.0 and half these with DP. The Raspberry Pi
400 vector performance was not that good but, as shown above, somewhat faster than the scalar speed.
Average Maximum Average
Average MFLOPS MFLOPS MFLOPS
Mode MHz MWIPS MFLOPS MFLOPS MFLOPS MFLOPS Per MHz Per MHz xCray 1
Windows SSE
Phenom II 64b SP 3000 4869 4429 3067 751 1593 0.5 1.5 34
Phenom II 64b DP 3000 4897 2418 1722 751 1290 0.4 0.8 27
Phenom II 32b SP 3000 4624 1798 1584 701 1148 0.4 0.6 24
Core i7 4820K 64b SP 3900 7256 14233 12655 958 2513 0.6 3.6 53
Core i7 4820K 64b DP 3900 7299 7416 7019 953 2261 0.6 1.9 48
Core i7 4820K 32b SP 3900 10494 10362 9748 9468 9846 2.5 2.7 209
Core i5 1135G7 64b SP 4150 8435 23709 21246 1043 2862 0.7 5.7 61
Core i5 1135G7 64b DP 4150 8621 12375 11475 1041 2659 0.6 3.0 57
Core i5 1135G7 32b SP 4150 13387 18221 17254 13739 16162 3.9 4.4 344
Linux
Core i7 4820K Op3 SP 3900 12012 12896 6248 17131 10136 2.6 4.4 216
Core i7 4820K AVX SP 3900 11924 20394 7124 23551 12938 3.3 6.0 275
Core i7 4820K Op3 DP 3900 11383 6259 4601 8711 6099 1.6 2.2 130
Core i7 4820K AVX DP 3900 11526 10509 5789 11950 8533 2.2 3.1 182
Core i5 1135G7 Op3 SP 4150 20870 21024 10721 28800 17088 4.1 6.9 364
Core i5 1135G7 AVX SP 4150 20294 37170 12353 39126 22487 5.4 9.4 478
Core i5 1135G7 A512 SP 4150 21039 62592 13037 75094 28303 6.8 18.1 602
Core i5 1135G7 Op3 DP 4150 20045 10884 8035 14575 10528 2.5 3.5 224
Core i5 1135G7 AVX DP 4150 20526 19270 10311 20360 15152 3.7 4.9 322
Core i5 1135G7 A512 DP 4150 21464 33188 11504 32907 20346 4.9 8.0 433
Raspberry Pi 400 SP 1800 3755 2413 1683 2506 2131 1.2 1.4 45
Raspberry Pi 400 DP 1800 3407 1216 1151 1186 1184 0.7 0.7 25
MP Whetstone Benchmarks Next or Go To Start
MP Whetstone Benchmark
Previous results compared Cray 1 performance with single CPU cores on the later systems. Here we consider possible
implication of using multiple cores, using this benchmark that effectively represents 1, 2, 4 and 8 users concurrently
executing the same application, but using different data. Details shown are overall MWIPS ratings, the three MFLOPS
measurements, overall harmonic mean MFLOPS, recorded running times, MFLOPS performance gains over the Cray 1 and
MFLOPS per MHz ratios for single core activity. Note nominal running time varies due to rough calibration of the number of
passes to use. The benchmark is also included in Linux-Benchmarks.tar.xz and Windows-Benchmarks.zip .
Phenom, Windows 7 - This demonstrates almost perfect speed gains using 1 to 2 and 2 to 4 cores, with no further
increase using 8 threads.
Core i7 Desktop - This can use 4 cores or 8 independent threads at the same time. This application appeared to
demonstrate near best case performance gains using 8 threads.
Core i5 Laptop - Performance Monitor indicated that this ran at around 4150 MHz using 1 and 2 threads, but reduced to
about 3800 MHz for 4 and 8 threads.
Windows vs Linux - Average MFLOPS performance was quite similar, on both the i7 and i5 PCs, at the lower level of
optimisation shown here.
Single vs Double Precision - Results indicated similar performance, as expected from scalar operation.
PC Performance Gains - some of the Core i7 speeds were faster than on the i5. For the latter, eight thread Cray 1
MFLOPS gains were 1521 times.
Android Phone - The Kryo 570 CPU has out-of-order execution, maybe responsible for the highest MFLOPS per MHz ratio
of 0.42. But maximum performance of the big/LITTLE CPU arrangement, of 2 fast and 6 slow cores, lead to 8 core
performance being only 5 times faster than than for 1 core. Still, the Cray 1 gain was 757 times.
Raspberry Pi 400 - As might be expected, performance of this quad core system produced the same elapsed time using
1, 2 and 4 threads, and a little bit extra with 8 threads. Maximum Cray 1 gain was 400 times.
Average --- Average MFLOPS ---
System Threads MWIPS MFLOPS MFLOPS MFLOPS MFLOPS Secs xCray 1 Gain Per MHz
Desktop Win 7 1 4086 817 817 752 794 5.0 132 1.0 0.26
Phenom II 2 8149 1635 1616 1501 1582 5.0 264 2.0
4 core 4 16199 3261 3234 2968 3149 5.1 525 4.0
3000 MHz 8 16602 3428 3461 3056 3304 10.1 551 4.2
Desktop Win 10 1 6169 1236 1236 856 1077 4.5 179 1.0 0.28
Core i7 4820K 2 13106 2601 2604 1910 2322 4.2 387 2.2
4 Core 8 Thread 4 25343 5181 5197 3723 4587 4.5 764 4.3
3900 MHz 8 46579 10310 10263 7403 9104 5.0 1517 8.5
Laptop Win 10 1 7555 1195 1216 1046 1147 4.9 191 1.0 0.28
Core i5 1135G7 2 15048 2385 2424 2083 2287 5.0 381 2.0
4 Core 8 Thread 4 27290 4339 4407 3787 4158 5.6 693 3.6
4150 MHz or less 8 53037 8619 8773 7538 8272 5.9 1379 7.2
Linux
Desktop SP 1 6157 1189 1146 931 1076 4.7 179 1.0 0.28
Core i7 4820K 2 12641 2529 2608 1931 2314 4.6 386 2.1
4 Core 8 Thread 4 25490 5204 5213 3900 4685 4.6 781 4.4
3900 MHz 8 43907 10217 10440 7714 9279 5.7 1547 8.6
Desktop DP 1 6500 1235 1252 972 1138 3.9 190 1.0 0.29
Core i7 4820K 2 13098 2542 2636 1938 2328 3.9 388 2.0
4 Core 8 Thread 4 26298 5105 5273 3906 4676 3.9 779 4.1
3900 MHz 8 44758 10268 10435 7755 9312 5.2 1552 8.2
Laptop SP 1 7640 1140 1199 1015 1113 5.0 185 1.0 0.27
Core i5 1135G7 2 14662 2347 2262 1997 2192 5.4 365 2.0
4 Core 8 Thread 4 26754 4320 4387 3752 4133 6.1 689 3.7
4150 MHz or less 8 46016 7885 8264 6701 7556 7.5 1259 6.8
Laptop SP AVX512 1 8432 1281 1280 1248 1269 5.0 212 1.0 0.31
Core i5 1135G7 2 16728 2542 2548 2471 2520 5.0 420 2.0
4 Core 8 Thread 4 29816 4625 4617 4523 4588 6.0 765 3.6
4150 MHz or less 8 54985 9203 9188 8994 9127 6.6 1521 7.2
Laptop DP AVX512 1 8748 1278 1278 1248 1268 4.9 211 1.0 0.31
Core i5 1135G7 2 17372 2542 2542 2481 2521 5.0 420 2.0
4 Core 8 Thread 4 31459 4622 4622 4514 4585 5.5 764 3.6
4150 MHz or less 8 57024 9187 9210 8985 9126 6.0 1521 7.2
Android Phone 1 4327 1010 984 782 913 4.6 152 1.0 0.42
Kryo 570 2 8782 1850 2126 1604 1836 4.5 306 2.0
2 x 2200 MHz + 4 13969 3189 3373 2641 3034 6.9 506 3.3
6 x 1800 MHz 8 21039 4535 4985 4171 4540 7.9 757 5.0
Raspberry Pi 400 1 2266 644 645 376 520 5.0 87 1.0 0.29
4 x Cortex A72 2 4533 1285 1284 751 1038 5.0 173 2.0
1800 MHz 4 9065 2562 2498 1505 2062 5.0 344 4.0
8 9611 3284 3375 1543 2402 10.1 400 4.6
MP MFLOPS Next or Go To Start
MP MFLOPS Linux - Intel Single Precision Results
The benchmark aims at producing maximum measured performance of floating point operation for comparison with the
theoretically possible 160 MFLOPS on Cray 1. Here, a Linux benchmark is used, running SSE and AVX Intel SIMD
instructions (in Linux-Benchmarks.tar.xz).
Calculations are carried out of the form x[i] = (x[i] + a) * b - (x[i] + c) * d + (x[i] + e) * f with 2, 8 or 32 operations per
input data word. In each case, accessing 102400, 1024000 and 10240000 data words, covering caches and RAM. Up to
64 threads can be used, each using a dedicated segment of the data, default being 8 threads. Data is checked for
consistent values at the end.
Below are measured MFLOPS using 1, 2, 4 and 8 threads for the Core i7 and i5 computers, executing SSE and AVX
instructions, plus AVX-512 on the i5. As for MP Whetstones, performance improvements, from doubling the number of
threads (MP Gains), are shown to be non-linear for the Core i5 laptop.
Single core MFLOPS per MHz ratios are also shown. Maximum single precision expectations, without FMA instructions, are
4 for SSE and 8 for AVX and 16 for AVX-512, then 32 for the latter, where FMA is used. Then double precision operation
expectations are half these values.
It can be seen that, for both i7 and i5, SSE and AVX MFLOPS/MHz ratios were higher than these. I have been unable to
identify the reason for these levels of performance, without FMA type instructions being used. For further information see
Faster Than Expected below.
AVX-512 MFLOPS per MHz was less than 32, one reason being that all instructions were not of the FMA variety, as shown
in code disassemblies, shown below. These indicate that the highest expected speed achievable by the FMA code is just
over 76% of maximum with complete FMA instructions, or 24.4 MFLOPS/MHz, close to that obtained.
The performance adjustment is also shown to produce a likely reduction in Cray 1 maximum speed to 122 MFLOPS,
executing these functions. The maximum Core i5 single precision speed of 325915 MFLOPS indicates a Cray 1 gain of
2671 times. Maximum double precision result, from the next page, was 160641 MFLOPS with a gain of 1317 times.
Also single precision results on the next page indicate maximum 8 thread speed on Raspberry Pi of 30150 MFLOPS or
Cray 1 gain of 247 times and Android phone at 35686 MFLOPS or gain of 293 times, both at Intel SSE SIMD level.
Threads 1 2 4 8 1 2 4 8 1 2 4 8
KWDs Ops SSE SSE SSE SSE AVX AVX AVX AVX AVX512 AVX512 AVX512 AVX512
Core i7 3900 MHz MFLOPS
102 2 10106 22704 47224 54668 11379 27114 56982 63095 N/A N/A N/A N/A
1024 2 9801 19227 36849 42389 10542 20127 39567 45256
10240 2 5856 9342 10120 9951 6004 9400 10165 9936
102 8 24258 48818 91871 97077 36354 82307 169881 184765
1024 8 24356 49258 91911 96902 34820 67057 130960 161412
10240 8 19421 34454 39855 39777 22340 36088 40372 39578
102 32 23355 46711 88383 93448 45374 88045 171961 177649
1024 32 23284 46883 88776 93381 45459 91277 172443 178895
10240 32 23107 46102 85346 92767 43834 86697 152019 157381
Maximum 24356 49258 91911 97077 45459 91277 172443 184765
MP Gains 1.0 2.0 3.8 4.0 1.0 2.0 3.8 4.1
AVX/SSE 1.9 1.9 1.9 1.9
Max/MHz 6.2 11.7
Core i5 MFLOPS
102 2 24612 48845 46738 80544 29021 30791 86020 93812 37656 74288 72164 121973
1024 2 21362 42345 43579 79180 21656 44753 44415 93920 23333 46844 58968 122122
10240 2 7495 12295 13298 14067 7620 11160 13454 14020 9274 13455 13337 13995
102 8 33271 65364 71105 119460 64946 128515 153955 210177 71895 142743 142554 241880
1024 8 32614 65504 63763 118933 62120 127095 121959 210157 66304 134081 144756 239841
10240 8 22467 38871 50079 56166 24963 42384 53438 56122 30345 49693 54170 56226
102 32 33273 58673 69365 119426 64941 124972 133637 225265 94417 170909 324843 325915
1024 32 32997 39974 86194 119313 64304 125772 125365 224014 91558 185785 324870 324936
10240 32 32777 64727 82112 116115 61061 114491 127026 200120 77458 140903 182219 222231
Maximum 33273 65504 86194 119460 64946 128515 153955 225265 94417 185785 324870 325915
MP Gains 1.0 2.0 2.6 3.6 1.0 2.0 2.4 3.5 1.0 2.0 3.4 3.5
AVX/SSE 2.0 2.0 1.8 1.9
a512/AVX 1.5 1.4 2.1 1.4
i5/i7 1.4 1.3 0.9 1.2 1.4 1.4 0.9 1.2
MHz 4150 4150 3600 3600 4150 4150 3600 3600 4150 4150 3600 3600
Max/MHz 8.0 15.6 22.8
continued below or Go To Start
MP MFLOPS 2 - Intel DP, Android and Raspberry Pi SP
The first columns in this table, provide Core i7 and Core i5 double precision MP MFLOPS results, using. 1 and 8 threads.
Calculations below show that performance at 32 operations per word, and other high performing areas, was effectively at
half single precision speed, as expected with SIMD. The lower ratios probably reflect half speed double precision
calculations and overheads dealing with 64 bit numbers.
Single precision results are also included for the 2000 MHz Kryo 570 Android phone and 1800 MHz Raspberry Pi 400. For
these, SIMD level used is equivalent to Intel SSE. Worse than MP Whetstone, this time the Kryo 570, using 8 threads,
was only three times faster than during the single thread test. Then, a CPU monitoring app indicated that six cores were
running at 1804 MHz, with two at 768 MHz.
Core i7 Core i5 Phone RPi
Threads 1 8 1 8 1 8 1 8 1 8 1 8 1 8
Ops/word SSE SSE AVX AVX SSE SSE AVX AVX AVX512 AVX512 SP SP SP SP
102 2 4921 28537 5290 32337 12437 38391 14606 43320 18872 60955 6977 15998 4015 10169
1024 2 4820 21214 4772 19551 4978 29821 6351 32157 8120 35674 8034 14536 3865 9622
10240 2 2949 4923 2946 4950 3604 6562 3683 6728 4442 6514 2984 2442 447 585
102 8 12233 48924 17683 95178 16500 59285 32504 104046 35958 120212
1024 8 12074 48679 16145 78149 12762 54904 19300 92706 22226 105465
10240 8 9929 19774 10969 19845 10941 26897 12157 27045 14806 26544
102 32 11742 46894 22880 89459 16602 58258 32420 111461 47200 160641 12178 34803 7902 28978
1024 32 11697 46848 22667 88958 16314 59325 31215 107323 42515 151251 12139 35686 7860 30150
10240 32 11615 46395 21983 78687 16315 57399 30488 99303 38532 105812 12137 34050 7326 8537
Maximum 12233 48924 22880 95178 16602 59325 32504 111461 47200 160641 12178 35686 7902 30150
MP Gain 1.0 4.0 1.0 4.2 1.0 3.6 1.0 3.4 1.0 3.4 1.0 2.9 1.0 3.8
MF/MHZ 3.14 5.87 4.0 7.8 11.4 6.1 4.4
Double/Single Precision
102 2 0.49 0.52 0.46 0.51 0.51 0.48 0.50 0.46 0.50 0.50
1024 2 0.49 0.50 0.45 0.43 0.23 0.38 0.29 0.34 0.35 0.29
10240 2 0.50 0.49 0.49 0.50 0.48 0.47 0.48 0.48 0.48 0.47
102 8 0.50 0.50 0.49 0.52 0.50 0.50 0.50 0.50 0.50 0.50
1024 8 0.50 0.50 0.46 0.48 0.39 0.46 0.31 0.44 0.34 0.44
10240 8 0.51 0.50 0.49 0.50 0.49 0.48 0.49 0.48 0.49 0.47
102 32 0.50 0.50 0.50 0.50 0.50 0.49 0.50 0.49 0.50 0.49
1024 32 0.50 0.50 0.50 0.50 0.49 0.50 0.49 0.48 0.46 0.47
10240 32 0.50 0.50 0.50 0.50 0.50 0.49 0.50 0.50 0.50 0.48
MP Livermore Loops
Four copies of the Livermore Loops Benchmark were run at the same time on the i5 laptop, with a longer parameter for
seconds per loop, each program running for around 15 minutes. Using all cores lead to the usual reduction in CPU MHz,
but there may have been more throttling to counteract heating effects. Single thread (Geomean) official average speed
was 4273 MFLOPS, compared with a per thread average of 3375 here. However, the total throughput of 13500 MFLOPS
indicates an increase over Cray 1 of 1134 times.
---------- AVX-512 DP MFLOPS ----------
Thread Maximum Average Geomean Harmean Minimum
1 33413.3 5809.5 3430.8 2293.0 493.7
2 35648.5 5576.5 3275.7 2223.1 552.1
3 35422.7 5953.9 3449.2 2300.6 505.1
4 36895.5 5746.0 3344.4 2190.7 459.4
MP MFLOPS 4 to 64 Threads
As indicated earlier, MP MFLOPS benchmark can handle up to 64 threads, with an execution command under LINUX (such
as ./MPmflops64AVX512 Threads 64). For correct operation, the specified number must be 1, 2, 4, 8, 16, 32 or 64.
Following are results on the i5 between 4 and 64 threds. These show that performance can be significantly improved
using additional threads, the reason is due to data being allocated to a faster lower level cache.
4 Byte Ops/ Repeat MFLOPS Using Number Of Threads
Words Word Passes 4 8 16 32 64
102400 2 75000 72164 112210 155132 158133 153968
1024000 2 7500 58968 108429 119118 117709 122011
10240000 2 750 13337 13824 17251 60342 116964
102400 8 75000 142554 210116 253359 270576 275220
1024000 8 7500 144756 212406 233939 236110 242271
10240000 8 750 54170 54988 64520 174245 235583
102400 32 75000 324843 312508 316881 318233 327762
1024000 32 7500 324870 308995 310405 325996 327897
10240000 32 750 182219 204563 243408 301543 322605
Performance Summary below or Go To Start
Performance Summary
Following is a summary of most results, intended to show best case performance gains, over the Cray 1, for different
classes of work. Considering the Core i5 details, the first four are for programs that only use (or are intended to use) a
single CPU core. The main one for Cray 1 comparison being Livermore Loops average. Linpack is the only one that
provides a single measurement. Whetstones identify relative performance of scalar and vector processing. As for MP
benchmarks, vector single and double precision results are provided. The former can be used for comparison with those
produced via the long word used by Cray, if the numeric accuracy is acceptable.
The MP benchmark results can be used to represent multiple users running the same program or a single program
executing multiple threads, each handling a dedicated segment of shared data. Again for the Core i5, MP Whetstone
MFLOPS were similar for double and single precision versions, with little opportunity for vectorisation. The simpler
Whetstone calculations demonstrate the benefit of hyperthreading with the 4 core, 8 thread throughput being nearly
seven times faster than the standalone run. On the other hand, MP MFLOPS suffered from the i5 running at a lower MHz
when four cores were being used, leading to 8 thread performance being less than four times faster than via 1 thread.
This benchmark identified the highest Cray 1 performance gains of over 2600 times for single precision calculations, but
half of this at double precision.
On cost/performance grounds, the Raspberry Pi 400 was better than the Core i5 laptop, in some of the early cases, but
worse on others, then fell far behind on benchmarks that could benefit from compilation using Intel Advanced Vector
instructions. Compared with the Cray 1, MP performance gains of up to 400 times were recorded.
Just considering performance of the Android phone, the more advanced ARM CPU used provided some significant gains
over the Raspberry Pi, but lost the advantage, due to the big/LITTLE architecture, on running the MP MFLOPS 8 thread
test. Still, best Cray 1 performance gain was 757 times through using multiple cores.
Core i5 AVX-512 Android Phone Raspberry Pi 400
Cray 1 X Cray X Cray X Cray
CPU MHz 1 Thread 80 4150 52 2000 25 1800 23
CPU MHz 8 Thread N/A 3600 <1800 1800
1. Livermore Loops
MFLOPS Max 82.1 47692 581 1620 20 3353 41
MFLOPS Average 11.9 4273 359 1468 123 938 79
2. Linpack
MFLOPS 27 9088 337 1986 74 1337 50
3. Whetstone
MFLOPS 6 1353 226 905 151 573 96
4. Vector Whetstone
MFLOPS DP Average 47 20346 433 1184 25
MFLOPS DP Maximum 32907 1216
MFLOPS SP Average 28303 602 2131 45
MFLOPS SP Maximum 75094 2506
5. MP Whetstone
MFLOPS DP Average 6 9126 1521
MFLOPS DP Maximum 9210
MFLOPS SP Average 9127 1521 4540 757 2402 400
MFLOPS SP Maximum 9203 4985 3284
6. MP MFLOPS
MFLOPS DP 1 Thread 122 47200 387
MFLOPS DP 8 Thread 160641 1317
MFLOPS SP 1 Thread 94417 774 12178 100 7902 65
MFLOPS SP 8 Thread 325915 2671 35686 293 30150 247
More Advanced Hardware
Here, relative Cray 1 performance calculations, for Android devices and PCs, have been for mid range hardware. It is
useful to consider apparent more powerful processors.
CPU MHz - In a given processing architecture, performance is usually proportional to CPU MHz. This was clear in earlier
times, when Pentium, Celeron and Xeon processors had the same core processor. The above benchmarks were run on a
Core i5 with maximum turbo speed of 4150 MHz and an ARM CPU at 2000 MHz. The latest 2022 processors appear to be
rated at up to 5500 MHz for PCs and 3000 MHz for ARM based phones. These would affect the single core benchmarks
but not excessively.
Multiple Cores - At least for the laptop and phone used here, full benefits of multiple cores were not apparent. The
laptop switched to a lower MHz and the phone’s 8 core big/LITTLE processor maximum performance became not much
better than the 4 core Raspberry Pi. Performance appears to be becoming even more unpredictable. The latest (that I
have seen) - Intel 24 threads over 16 cores, 8 at up to 5.1 GHz and 8 to 3.8 GHz.Then ARM that has cores 1 at 3200
MHz, 3 at 2420 MHz and 4 at 1800 MHz.
More Advanced CPU Options - Some CPUs in the Core range have two 512-bit fused-multiply add (FMA) units that can,
potentially, double SIMD performance of the right sort of application. Judging by the improvement in adopting a higher
level of SIMD here and consideration of heating effects, I would not bet on it.
Disassembled Code Next or Go To Start
Disassembled Code
Disassembled code compiled to use AVX-512 and AVX instructions are listed below. The former includes vector fused multiply and
add or subtract instructions. With AVX-512 there are 21 arithmetic vector instructions and the expected 32 with AVX, the latter also
applying for SSE code. Minimum instructions for full fused multiply and add type is 16, leading to a (16/21) 76.19% reduction in
achievable speed. This would lead the maximum Cray 1 MFLOPS for this code becoming 122, instead of 160.
AVX-512 AVX
L22: L60:
vmovupd (%rax), %zmm0 vmovups (%rax), %xmm1
addq $64, %rax vinsertf128 $0x1, 16(%rax), %ymm1, %ymm1
vaddpd %zmm0, %zmm28, %zmm31 addq $32, %rax
vaddpd %zmm0, %zmm30, %zmm1 vaddps -24(%rsp), %ymm1, %ymm0
vmulpd %zmm27, %zmm31, %zmm31 vmulps 8(%rsp), %ymm0, %ymm15
vfmsub132pd %zmm29, %zmm31, %zmm vaddps 40(%rsp), %ymm1, %ymm0
vaddpd %zmm0, %zmm26, %zmm31 vmulps 72(%rsp), %ymm0, %ymm0
vfmadd231pd %zmm31, %zmm25, %zmm vsubps %ymm0, %ymm15, %ymm0
vaddpd %zmm24, %zmm0, %zmm31 vaddps 104(%rsp), %ymm1, %ymm15
vfnmadd132pd %zmm23, %zmm1, %zmm3 vmulps 136(%rsp), %ymm15, %ymm15
vaddpd %zmm22, %zmm0, %zmm1 vaddps %ymm15, %ymm0, %ymm0
vfmadd231pd %zmm21, %zmm1, %zmm3 vaddps 168(%rsp), %ymm1, %ymm15
vaddpd %zmm20, %zmm0, %zmm1 vmulps -56(%rsp), %ymm15, %ymm15
vfnmadd132pd %zmm19, %zmm31, %zmm vsubps %ymm15, %ymm0, %ymm0
vaddpd %zmm18, %zmm0, %zmm31 vaddps %ymm14, %ymm1, %ymm15
vfmadd231pd %zmm17, %zmm31, %zmm vmulps -88(%rsp), %ymm15, %ymm15
vaddpd %zmm16, %zmm0, %zmm31 vaddps %ymm15, %ymm0, %ymm0
vfnmadd132pd %zmm15, %zmm1, %zmm3 vaddps %ymm13, %ymm1, %ymm15
vaddpd %zmm14, %zmm0, %zmm1 vmulps %ymm12, %ymm15, %ymm15
vfmadd231pd %zmm13, %zmm1, %zmm3 vsubps %ymm15, %ymm0, %ymm0
vaddpd %zmm12, %zmm0, %zmm1 vaddps %ymm11, %ymm1, %ymm15
vaddpd %zmm10, %zmm0, %zmm0 vmulps %ymm10, %ymm15, %ymm15
vfnmadd132pd %zmm11, %zmm31, %zmm vaddps %ymm15, %ymm0, %ymm0
vfmadd132pd %zmm9, %zmm1, %zmm0 vaddps %ymm9, %ymm1, %ymm15
vmovupd %zmm0, -64(%rax) vmulps %ymm8, %ymm15, %ymm15
cmpq %rax, %rcx vsubps %ymm15, %ymm0, %ymm0
jne .L22 vaddps %ymm7, %ymm1, %ymm15
vmulps %ymm6, %ymm15, %ymm15
vaddps %ymm15, %ymm0, %ymm0
vaddps %ymm5, %ymm1, %ymm15
vaddps %ymm3, %ymm1, %ymm1
vmulps %ymm4, %ymm15, %ymm15
vsubps %ymm15, %ymm0, %ymm0
vmulps %ymm2, %ymm1, %ymm15
vaddps %ymm15, %ymm0, %ymm0
vmovups %xmm0, -32(%rax)
vextractf128 $0x1, %ymm0, -16(%rax)
cmpq %rdx, %rax
jne .L60
Benchmark Error Reports Next or Go To Start
Benchmark Error Reports
Livermore Loops benchmark displays details of results for each of the three times 24 sets of calculations. These include
the final numeric results, whose values are included in the program and can vary slightly, depending on the hardware and
compiler options. The values under OK indicate accuracy in terms of the number of decimal places, double precision
numbers being said to be accurate up to 16 decimal places, but possibly subject to rounding errors.
As indicated, there were differences in numeric results from the Core i5 laptop, with accuracy reducing from 15 or 16
decimal places to 12 or 13, using the AVX512 compile option. Apparently, there is only one rounding for fused operations,
as opposed to one for each separate instruction.
Kernel Floating Pt ops
No Passes E No Total Secs. MFLOPS Span Checksums OK
Earlier Compilation
6 3 x 658 2 1.566566e+09 0.89 1751.62 64 4.375116344729986e+03 16
7 4 x 529 16 6.737344e+09 0.89 7529.02 995 6.104251075174761e+04 16
18 2 x 703 44 6.124536e+09 0.89 6867.09 100 1.015727037502299e+05 15
Log Program report - Numeric results were as expected
AVX Compilation
6 3 x 814 2 1.937971e+09 1.00 1929.85 64 4.375116344729986e+03 16
7 4 x 616 16 7.845376e+09 1.00 7835.67 995 6.104251075174761e+04 16
18 2 x1711 44 1.490623e+10 1.0014869.06 100 1.015727037502299e+05 15
Log Program report - Numeric results were as expected
AVX512 Compilation
6 3 x 757 2 1.802266e+09 1.00 1802.82 64 4.375116344743195e+03 12
7 4 x3738 16 4.760717e+10 1.0047691.47 995 6.104251075174966e+04 13
18 2 x2393 44 2.084782e+10 1.0020781.51 100 1.015727037502806e+05 12
Log Program report - Examples of different numeric results
Test 6 result was 4.375116344743195e+03 expected 4.375116344729986e+03
Test 7 result was 6.104251075174966e+04 expected 6.104251075174761e+04
Test 18 result was 1.015727037502806e+05 expected 1.015727037502299e+05
MP Linpack and Whetstone Benchmarks Error Reports
Similar sumcheck variations were recorded on running the Linpack and Whetstone benchmarks on the Core i5 based
laptop. In both cases, as for the Livermore Loops example, the errors were not reported running on older hardware or
from alternative compilations.
Linpack AVX-512
Linpack Double Precision Unrolled Benchmark n @ 100
Optimisation AVX512 64 Bit, Tue Dec 7 11:38:24 2021
Speed 5151.83 MFLOPS
Variable norm. resid Non-standard result was 1.9 instead of 1.7
Variable resid Non-standard result was 8.46778499e-14 instead of 7.41628980e-14
Variable x[0]-1 Non-standard result was -1.11799459e-13 instead of -1.49880108e-14
Variable x[n-1]-1 Non-standard result was -9.60342916e-14 instead of -1.89848137e-14
Whetstone SSE
Whetstone Double Precision SSE2 Benchmark Tue Jan 11 19:34:50 2022
Test 5 Non-standard result was 0.49902937281518372 instead of 0.49902937281518167
Log file result
Loop content Result MFLOPS MOPS Seconds
N5 sin,cos etc. 0.49902937281518372 281.276 2.089
Run Time Displays Next or Go To Start
Run Time Displays
The running times of benchmarks, used here, are calibrated to run for a given noticeable time, with Windows and Linux
versions displaying details on completion of individual test functions. This automatic adjustment has currently survived
more than 100 times increase in CPU MHz. The aim for Livermore Loops is currently 1 second for each of the 72 tests,
with Linpack 1 second for each of the 10 calculations and Whetstone 10 seconds overall.
L.L.N.L. 'C' KERNELS: MFLOPS P.C. VERSION 4.0
Calculating outer loop overhead
1000 times 0.00 seconds
10000 times 0.00 seconds
100000 times 0.00 seconds
1000000 times 0.01 seconds
10000000 times 0.04 seconds
20000000 times 0.08 seconds
40000000 times 0.16 seconds
80000000 times 0.31 seconds
Overhead for each loop 3.9288e-09 seconds
Calibrating part 1 of 3
Loop count 4 0.00 seconds
Loop count 16 0.00 seconds
Loop count 64 0.00 seconds
Loop count 256 0.01 seconds
Loops 200 x 1 x Passes
Kernel Floating Pt ops
No Passes E No Total Secs. MFLOPS Span Checksums OK
------------ -- ------------- ----- ------- ---- ---------------------- --
1 7 x1566 5 1.097296e+10 0.9811171.25 1001 5.114652693224671e+04 16
2 67 x 595 4 3.093524e+09 0.98 3164.40 101 1.539721811668385e+03 15
3 9 x 657 2 2.367565e+09 0.95 2494.82 1001 1.000742883066363e+01 15
4 14 x 728 2 2.446080e+09 0.96 2555.68 1001 5.999250595473891e-01 16
5 10 x 234 2 9.360000e+08 0.95 980.20 1001 4.548871642387267e+03 16
6 3 x 904 2 2.152243e+09 0.95 2276.20 64 4.375116344729986e+03 16
7 4 x 975 16 1.241760e+10 1.0312101.10 995 6.104251075174761e+04 16
8 10 x 385 36 5.488560e+09 0.95 5788.45 100 1.501268005625795e+05 15
9 36 x 536 17 6.626246e+09 0.96 6926.99 101 1.189443609974981e+05 16
10 34 x 456 9 2.818627e+09 0.95 2973.11 101 7.310369784325296e+04 16
11 11 x 565 1 1.243000e+09 0.95 1309.65 1001 3.342910972650109e+07 16
12 12 x1201 1 2.882400e+09 0.95 3030.87 1000 2.907141294167248e-05 16
13 36 x 177 7 5.709312e+08 0.95 600.71 64 1.202533961842805e+11 15
14 2 x 290 11 1.277276e+09 0.95 1347.14 1001 3.165553044000335e+09 15
15 1 x 660 33 2.178000e+09 0.96 2268.96 101 3.943816690352044e+04 15
16 25 x 768 10 2.035200e+09 0.94 2153.77 75 5.650760000000000e+05 16
17 35 x 368 9 2.341584e+09 0.96 2447.92 101 1.114641772902486e+03 16
18 2 x 733 44 6.385896e+09 0.97 6567.18 100 1.015727037502299e+05 15
19 39 x 215 6 1.016262e+09 0.95 1070.62 101 5.421816960147207e+02 16
20 1 x 187 26 9.724000e+08 0.95 1021.36 1000 3.040644339351239e+07 16
21 1 x 302 2 7.625500e+09 0.95 8021.31 101 1.597308280710199e+08 15
22 11 x 356 17 1.344754e+09 0.95 1416.60 101 2.938604376566697e+02 16
23 8 x 223 11 1.942776e+09 0.95 2045.20 100 3.549900501563623e+04 16
24 5 x1553 1 1.553000e+09 0.95 1637.44 1001 5.000000000000000e+02 16
Maximum Rate12101.10
Average Rate 3557.12
Geometric Mean 2580.73
Harmonic Mean 1966.74
Minimum Rate 600.71
Do Span 471
Calibrating part 2 of 3
Loop count 8 0.00 seconds
Loop count 32 0.00 seconds
Loop count 128 0.00 seconds
Loops 200 x 2 x Passes
Kernel Floating Pt ops
No Passes E No Total Secs. MFLOPS Span Checksums OK
------------ -- ------------- ----- ------- ---- ---------------------- --
1 40 x1061 5 8.572880e+09 0.98 8769.29 101 5.253344778937972e+02 16
2 40 x 495 4 3.072960e+09 1.01 3046.39 101 1.539721811668385e+03 15
3 53 x 595 2 2.548028e+09 1.00 2536.39 101 1.009741436578952e+00 16
4 70 x 949 2 3.188640e+09 1.00 3194.69 101 5.999250595473891e-01 16
5 55 x 247 2 1.086800e+09 1.00 1082.99 101 4.589031939600982e+01 16
6 7 x 760 2 2.042880e+09 0.98 2081.44 32 8.631675645333210e+01 16
7 22 x 858 16 1.220145e+10 0.9912378.97 101 6.345586315784055e+02 16
More Below or Go To Start
8 6 x 338 36 5.782234e+09 1.00 5784.83 100 1.501268005625795e+05 15
9 21 x 483 17 6.966212e+09 1.00 6934.93 101 1.189443609974981e+05 16
10 19 x 431 9 2.977520e+09 1.01 2952.22 101 7.310369784325296e+04 16
11 64 x 536 1 1.372160e+09 1.00 1366.60 101 3.433560407475758e+04 16
12 68 x 931 1 2.532320e+09 1.04 2443.43 100 7.127569130821465e-06 16
13 41 x 165 7 6.061440e+08 1.00 603.71 32 9.816387810944356e+10 15
14 10 x 373 11 1.657612e+09 1.01 1640.50 101 3.039983465145392e+07 15
15 1 x 348 33 2.296800e+09 1.00 2295.02 101 3.943816690352044e+04 15
16 27 x 748 10 2.261952e+09 1.01 2241.47 40 6.480410000000000e+05 16
17 20 x 340 9 2.472480e+09 1.01 2441.10 101 1.114641772902486e+03 16
18 1 x 753 44 6.560136e+09 0.99 6608.43 100 1.015727037502299e+05 15
19 23 x 192 6 1.070438e+09 1.02 1053.29 101 5.421816960147207e+02 16
20 8 x 125 26 1.040000e+09 1.01 1031.93 100 3.126205178815431e+04 16
21 1 x 324 2 8.100000e+09 1.00 8099.88 50 7.824524877232093e+07 16
22 7 x 295 17 1.418242e+09 1.00 1415.93 101 2.938604376566697e+02 16
23 5 x 188 11 2.047320e+09 1.00 2044.93 100 3.549900501563623e+04 16
24 31 x 881 1 1.092440e+09 1.00 1087.94 101 5.000000000000000e+01 16
Maximum Rate12378.97
Average Rate 3464.01
Geometric Mean 2544.88
Harmonic Mean 1951.83
Minimum Rate 603.71
Do Span 90
Calibrating part 3 of 3
Loop count 32 0.00 seconds
Loop count 128 0.00 seconds
Loop count 512 0.00 seconds
Loop count 2048 0.01 seconds
Loops 200 x 8 x Passes
Kernel Floating Pt ops
No Passes E No Total Secs. MFLOPS Span Checksums OK
------------ -- ------------- ----- ------- ---- ---------------------- --
1 28 x1795 5 1.085616e+10 1.0010866.23 27 3.855104502494961e+01 16
2 46 x 748 4 2.422323e+09 1.00 2415.89 15 3.953296986903059e+01 16
3 37 x1126 2 3.599597e+09 1.01 3575.44 27 2.699309089320672e-01 16
4 38 x1471 2 2.683104e+09 1.00 2685.18 27 5.999250595473891e-01 16
5 40 x 473 2 1.574144e+09 1.02 1546.96 27 3.182615248447483e+00 16
6 21 x1047 2 1.688602e+09 1.01 1665.66 8 1.120309393467088e+00 15
7 20 x1082 16 1.163366e+10 0.9412311.54 21 2.845720217644024e+01 16
8 9 x 427 36 5.755277e+09 1.00 5741.74 14 2.960543667875005e+03 15
9 26 x 664 17 7.043712e+09 1.00 7015.77 15 2.623968460874250e+03 16
10 25 x 557 9 3.007800e+09 1.02 2959.56 15 1.651291227698265e+03 16
11 46 x1015 1 1.942304e+09 1.02 1901.09 27 6.551161335845770e+02 16
12 48 x1714 1 3.422515e+09 1.02 3359.57 26 1.943435981130448e-06 16
13 31 x 226 7 6.277376e+08 1.01 621.29 8 3.847124199949431e+10 15
14 8 x 490 11 1.862784e+09 1.01 1853.19 27 2.923540598672009e+06 15
15 1 x 639 33 2.361744e+09 1.00 2361.30 15 1.108997288134785e+03 16
16 14 x 974 10 2.399936e+09 1.00 2394.06 15 5.152160000000000e+05 16
17 26 x 513 9 2.881008e+09 1.00 2875.92 15 2.947368618589361e+01 16
18 2 x 647 44 5.921344e+09 1.00 5907.72 14 9.700646212337041e+02 16
19 28 x 273 6 1.100736e+09 1.02 1074.17 15 1.268230698051003e+01 15
20 7 x 145 26 1.097824e+09 1.02 1077.78 26 5.987713249475302e+02 16
21 1 x 212 2 8.480000e+09 1.01 8435.26 20 5.009945671204667e+07 16
22 8 x 421 17 1.374144e+09 1.00 1373.68 15 6.109968728263972e+00 16
23 7 x 356 11 2.850848e+09 1.00 2862.02 14 4.850340602749970e+02 16
24 23 x 952 1 9.108736e+08 1.00 910.11 27 1.300000000000000e+01 16
Maximum Rate12311.54
Average Rate 3657.96
Geometric Mean 2704.87
Harmonic Mean 2064.49
Minimum Rate 621.29
Do Span 19
Overall
Part 1 weight 1
Part 2 weight 2
Part 3 weight 1
Maximum Rate12378.97
Average Rate 3535.78
Geometric Mean 2593.02
Harmonic Mean 1982.64
Minimum Rate 600.71
Do Span 167
More Below or Go To Start
Unrolled Double Precision Linpack Benchmark - PC Version in 'C/C++'
Optimisation AVX 64 Bit
norm resid resid machep x[0]-1 x[n-1]-1
1.7 7.41628980e-14 2.22044605e-16 -1.49880108e-14 -1.89848137e-14
Times are reported for matrices of order 100
1 pass times for array with leading dimension of 201
dgefa dgesl total Mflops unit ratio
0.00016 0.00001 0.00017 4091.04 0.0005 0.0030
Calculating matgen overhead
10 times 0.00 seconds
100 times 0.00 seconds
1000 times 0.03 seconds
10000 times 0.28 seconds
20000 times 0.54 seconds
40000 times 1.02 seconds
Overhead for 1 matgen 0.00003 seconds
Calculating matgen/dgefa passes for 1 seconds
10 times 0.00 seconds
100 times 0.02 seconds
1000 times 0.17 seconds
2000 times 0.32 seconds
4000 times 0.64 seconds
8000 times 1.27 seconds
Passes used 6311
Times for array with leading dimension of 201
dgefa dgesl total Mflops unit ratio
0.00013 0.00000 0.00014 5049.22 0.0004 0.0024
0.00013 0.00000 0.00014 4949.89 0.0004 0.0025
0.00013 0.00000 0.00014 4956.75 0.0004 0.0025
0.00013 0.00000 0.00014 5048.17 0.0004 0.0024
0.00013 0.00000 0.00014 5049.18 0.0004 0.0024
Average 5010.64
Calculating matgen2 overhead
Overhead for 1 matgen 0.00003 seconds
Times for array with leading dimension of 200
dgefa dgesl total Mflops unit ratio
0.00012 0.00000 0.00013 5372.95 0.0004 0.0023
0.00012 0.00000 0.00013 5374.23 0.0004 0.0023
0.00012 0.00000 0.00013 5370.76 0.0004 0.0023
0.00012 0.00000 0.00013 5462.18 0.0004 0.0022
0.00012 0.00000 0.00013 5463.74 0.0004 0.0022
Average 5408.77
Unrolled Double Precision 5010.64 Mflops
Single Precision C Whetstone Benchmark AVX 64 Bit, Tue Jan 18 23:29:03 2022
Calibrate
0.01 Seconds 1 Passes (x 100)
0.01 Seconds 5 Passes (x 100)
0.05 Seconds 25 Passes (x 100)
0.23 Seconds 125 Passes (x 100)
1.12 Seconds 625 Passes (x 100)
5.49 Seconds 3125 Passes (x 100)
Use 5695 passes (x 100)
Single Precision C/C++ Whetstone Benchmark
Loop content Result MFLOPS MOPS Seconds
N1 floating point -1.12475013732910156 1324.013 0.083
N2 floating point -1.12274742126464844 1319.724 0.580
N3 if then else 1.00000000000000000 0.000 0.000
N4 fixed point 12.00000000000000000 6512.599 0.275
N5 sin,cos etc. 0.49911010265350342 134.391 3.526
N6 floating point 0.99999982118606567 977.216 3.144
N7 assignments 3.00000000000000000 3908.815 0.269
N8 exp,sqrt etc. 0.75110864639282227 99.207 2.135
MWIPS 5688.198 10.012
Faster Than Expected Next or Go To Start
Faster Than Expected
The following is intended to show that it is not just my MP MFLOPS benchmark that provides performance levels higher
than expected via compilations using SSE and AVX instructions, in this case dealing with double precision variables. The
other benchmark considered is the naturally double precision Livermore Loops. In both cases, disassembled code was
checked to ensure that there were no Fused Multiply and Add (FMA) type instructions. Then, according to viewed
documentation, maximum performance using SSE functions is 2.0 MFLOPS per MHz, using 128-bit xmm registers, and 4.0
using 256-bit AVX with ymm registers.
Below, the Livermore Loops example shows the full displayed output for the kernel producing maximum MFLOPS, the
source code with 16 floating point operations and compile commands used. The SSE example indicates 3.56 MFLOPS per
MHz, thought to be impossible without FMA. The AVX results provide 4.86 MFLOPS per MHz 21.5% higher than expected
maximum.
The same range of results, source code and compile options are provided for MP MFLOPS benchmark these combinations
of instructions., running via a single thread. Looking at the first word size details, least likely to involve RAM data
transfers, SSE 12437 to 16602 MFLOPS equates to 3.00 to 4.00 per MHz and AVX 14606 to 32420 MFLOPS at 3.53 to
7.81 per MHz. These ranges include Livermore Loops ratios, but the larger ones are higher than might be expected using
FMA, with the particular combination of instructions shown.
Unexpected high levels of performance were also produced on running the benchmarks on the much older Core i7 PC.
Livermore Loops SSE maximum was 3.05 MFLOPS per MHz and 1 thread DP MP-MFLOPS with SSE 3.14 and AVX 5.87
MFLOPS per MHz.
4150 MHz Core i5 Livermore Loops Benchmark
Kernel Floating Pt ops
No Passes E No Total Secs. MFLOPS Span Checksums OK
------------ -- ------------- ----- ------- ---- ---------------------- --
SSE2
7 4 x1037 16 1.320723e+10 0.89 14782.74 995 6.104251075174761e+04 16
AVX
7 4 x1423 16 1.812333e+10 0.90 20184.30 995 6.104251075174761e+04 16
Kernel 7 C Code
for ( k=0 ; k < n ; k++ )
{
x[k] = u[k] + r*( z[k] + r*y[k] ) +
t*( u[k+3] + r*( u[k+2] + r*u[k+1] ) +
t*( u[k+6] + q*( u[k+5] + q*u[k+4] ) ) );
}
Compiled With gcc lloops.c -O3 -msse2 -m64 -lrt -lc -lm -o lloopssse2
and gcc lloops.c -O3 -mavx -m64 -lrt -lc -lm -o lloopsavx
#####################################################
4150 MHz Core i5 MP DP MFLOPS Benchmark 1 Thread
Test 4 Byte Ops/ Repeat Seconds MFLOPS First All
Words Word Passes Results Same
SSE2
Data in & out 102400 2 75000 1.234995 12437 0.414016 Yes
Data in & out 1024000 2 7500 3.085865 4978 0.812316 Yes
Data in & out 10240000 2 750 4.262126 3604 0.977908 Yes
Data in & out 102400 8 75000 3.723678 16500 0.563491 Yes
Data in & out 1024000 8 7500 4.814260 12762 0.883058 Yes
Data in & out 10240000 8 750 5.615416 10941 0.986707 Yes
Data in & out 102400 32 75000 14.803324 16602 0.353716 Yes
Data in & out 1024000 32 7500 15.063927 16314 0.723569 Yes
Data in & out 10240000 32 750 15.063069 16315 0.964957 Yes
AVX
Data in & out 102400 2 75000 1.051636 14606 0.414016 Yes
Data in & out 1024000 2 7500 2.418388 6351 0.812316 Yes
Data in & out 10240000 2 750 4.170949 3683 0.977908 Yes
Data in & out 102400 8 75000 1.890234 32504 0.563491 Yes
Data in & out 1024000 8 7500 3.183412 19300 0.883058 Yes
Data in & out 10240000 8 750 5.054079 12157 0.986707 Yes
Data in & out 102400 32 75000 7.580423 32420 0.353716 Yes
Data in & out 1024000 32 7500 7.873082 31215 0.723569 Yes
Data in & out 10240000 32 750 8.061002 30488 0.964957 Yes
C Function Code 8 Operationss per Word
for(i=0; i < n; i++)
x[i] = (x[i]+a)*b-(x[i]+c)*d+(x[i]+e)*f;
Compiled With gcc mpmflops2dp.c -lpthread -msse2 -lrt -lc -lm -O3 -o MPmflops64SSE2DP
and gcc mpmflops2dp.c -lpthread -mavx -lrt -lc -lm -O3 -o MPmflops64AVXDP
Go To Start

Supplementary resource (1)

ResearchGate has not been able to resolve any citations for this publication.
ResearchGate has not been able to resolve any references for this publication.