Technical ReportPDF Available

Raspberry Pi 4B 64 Bit Benchmarks and Stress Tests

Authors:
  • UK Government

Abstract

Previously, I have run my 32 bit and 64 bit benchmarks and stress tests on the appropriate range of Raspberry Pi computers, up to model 3B+. Details of the benchmarks, results and download links are available from ResearchGate in a PDF file. I have also run the 32 bit versions on the Raspberry Pi 4, with results in Raspberry Pi 4 Benchmarks PDF file and Raspberry Pi 4 Stress Tests PDF file. This new report contains brief reminders of the benchmarks, with 64 bit results on the Raspberry Pi 4 and Pi 3B+ using Gentoo Operating System. Pi 4/Pi 3B+ comparisons are included, then others with 32 bit systems and later gcc 9 compilations.
Raspberry Pi 4B 64 Bit Benchmarks and Stress Tests
Roy Longbottom
Contents
Summary Introduction Whetstone Benchmark
Dhrystone Benchmark Linpack 100 Benchmark Livermore Loops Benchmark
FFT Benchmarks BusSpeed Benchmark MemSpeed Benchmark
NeonSpeed Benchmark MultiThreading Benchmarks MP-Whetstone Benchmark
MP-Dhrystone Benchmark MP NEON Linpack Benchmark MP-BusSpeed Benchmark
MP-BusSpeed Disassembly MP-RandMem Benchmark MP-MFLOPS Benchmarks
MP-MFLOPS Disassembly MP-MFLOPS Sumchecks OpenMP-MFLOPS Benchmarks
OpenMP-MemSpeed Benchmarks Stress Testing Benchmarks Integer Stressing Benchmark
Single Precision Stress Benchmark Double Precision Stress Benchmark High Performance Linpack
DriveSpeed Benchmark USB 3 and 2 Benchmarks Drive Write/Reboot/Read Tests
LAN and WiFi Benchmarks Java Whetstone Benchmark JavaDraw Benchmark
OpenGL Benchmark Stress Tests HP Linpack Stress Test
Integer Stress Test Single Precision FPU Stress Test Double Precision FPU Stress Test
OpenGL + 3 x Livermore Loops Input/Output Stress Test CPU + Main SD + USB + LAN Test
Summary
Previously, I have run my 32 bit and 64 bit benchmarks and stress tests on the appropriate range of Raspberry Pi
computers, up to model 3B+. Details of the benchmarks, results and download links are available from ResearchGate in a
PDF file. I have also run the 32 bit versions on the Raspberry Pi 4, with results in Raspberry Pi 4 Benchmarks PDF file and
Raspberry Pi 4 Stress Tests PDF file. This new report contains brief reminders of the benchmarks, with 64 bit results on
the Raspberry Pi 4 and Pi 3B+ using Gentoo Operating System. Pi 4/Pi 3B+ comparisons are included, then others with 32
bit systems and later gcc 9 compilations. The range of benchmarking targets were as follows.
Single Core CPU Tests - comprising Whetstone, Dhrystone, Linpack and Livermore Loops Classic Benchmarks.
Single Core Memory Benchmarks - measuring performance using data from caches and RAM. These comprise FFTs
with floating point, BusSpeed, with integer arithmetic, then MemSpeed and NeonSpeed with both.
Multithreading Benchmarks - Most of the multithreading benchmarks execute the same calculations using 1, 2, 4 and
8 threads. Some are multithreaded versions of the previous programs, comprising Whetstone, Dhrystone, Linpack and
BusSpeed benchmarks. Then there is MP-RandMem for random access and MP-MFLOPS for high speed floating point.
Finally, there are OpenMP versions of the latter and MemSpeed.
Stress Testing Benchmarks - The Raspberry Pi 4 can become excessively hot and might need a cooling fan
attachment for efficient operation of certain applications. Stress tests are detailed later, this area covering benchmarks
intended to identify which area to test. Three programs cover floating point and integer arithmetic, with different
processing profiles, accessing caches and RAM. Then there is High Performance Linpack that can be a killer.
Input/Output Benchmarks - DriveSpeed and LanSpeed are used measure performance of the main SD card, USB
connected storage and networks via WiFi or Ethernet.
Java and OpenGL Benchmarks - A Java Whetstone benchmark is provided and one using JavaDraw procedures. The
OpenGL benchmark has six test functions of increasing complexity and run using a range of different window sizes.
Stress Tests - Stress tests mainly have run time options to specify running time and such as memory used and
alternative test function, then run with continuous displays showing any changes in performance. An extra program
measured CPU MHz, temperature and voltage. The main CPU stress tests are mentioned above, the Livermore Loops and
OpenGL benchmark programs can also be used, along with one geared up up to exercise input/output. Stress test results
identify cases of temperature related CPU speed throttling down to 600 MHz, with temperatures up to 85°C, when a
cooling fan is not fitted.
Performance Comparisons - More than 1400 comparisons are provided. For the particular main 1000 plus applicable to
CPU speed, the Pi 4 was faster than the Pi 3B+, with an average, minimum and maximum values of 2.62, 0.70 and 16.8
times, the latter involved in using the larger L2 cache. There were also average performance gains of 64 bit
compilations, compared with those at 32 bits, and some losses, the three ratios being 1.28, 0.31 and 4.90. The same
story applied to gcc 9 versus gcc 6 compilations at 1.16, 0.37 and 2.93. A key area is maximum floating point speed
running the High Performance Linpack Benchmark, with the four GB Pi 4 achieving more than 10 presumably double
precision GFLOPS, close to my benchmark’s score at 13, with single precision at 26.
Other Issues
Dual Monitors - handled in different ways. Gentoo provided mirroring or a wide image squashed on one monitor.
Raspbian spread wide images across both displays, but had no mirroring option.
C Direct I/O - This worked as expected at 32 bits but in the 64 bit Gentoo version could lead to failure to write or read.
Separate write and read programs were produced to enable performance to be measured.
5 GHz WiFi - there were difficulties in connecting at 5 GHz using Raspbian but seemed to be impossible using Gentoo.
Introduction below or Go To Start
Introduction
The Raspberry Pi 4B uses a quad core ARM A72 CPU, with 32 KB L1 cache and shared 1 MB L2 cache. RAM is 3200-
LPDDR4 with 1, 2 or 4 GB options. Other enhancements are USB 3 connections and gigabit Ethernet. The benchmarks
and stress tests covered here were run on 4 GB models.
Previously, I have run my 32 bit and 64 bit benchmarks and stress tests on the appropriate range of Raspberry Pi
computers, up to model 3B+. Details of the benchmarks, results and download links are available from ResearchGate in a
Raspberry Pi 3B 32 bit and 64 bit Benchmarks and Stress Tests PDF file. I have also run the 32 bit versions on the
Raspberry Pi 4, with results in Raspberry Pi 4 Benchmarks PDF file and Raspberry Pi 4 Stress Tests PDF file. This new
report contains brief reminders of the benchmarks, with 64 bit results on the Raspberry Pi 4 and Pi 3B+ using Gentoo
Operating System. Pi 4/Pi 3B+ comparisons are included, then others with 32 bit systems and later gcc 9 compilations.
The programs and source codes for the original 64 bit versions are available for downloading in Raspberry-Pi-4-
Benchmarks.tar.gz, and the new gcc 9 compilations in Raspberry-Pi-4-64-Bit-Benchmarks.tar.gz.
New gcc 9 program versions - On producing these, he first step was to change the functions used to identify the
hardware, where the existing procedures replicate information for each core (even four lots were too much). I noted
that the lscpu command now provides adequate detail, so I use this now. RPi 3B+ and RPi 4B CPUID results are now as
follows:
Pi 3B+
Architecture: aarch64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 4
On-line CPU(s) list: 0-3
Thread(s) per core: 1
Core(s) per socket: 4
Socket(s): 1
Vendor ID: ARM
Model: 4
Model name: Cortex-A53
Stepping: r0p4
CPU max MHz: 1400.0000
CPU min MHz: 600.0000
BogoMIPS: 38.40
Flags: fp asimd evtstrm crc32 cpuid
Linux pi64 4.19.67-v8-174fcab91765-bis+ #2 SMP PREEMPT
Tue Aug 27 13:29:20 GMT 2019 aarch64 GNU/Linux
Pi 4B
Architecture: aarch64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 4
On-line CPU(s) list: 0-3
Thread(s) per core: 1
Core(s) per socket: 4
Socket(s): 1
Vendor ID: ARM
Model: 3
Model name: Cortex-A72
Stepping: r0p3
CPU max MHz: 1500.0000
CPU min MHz: 600.0000
BogoMIPS: 108.00
Flags: fp asimd evtstrm crc32 cpuid
Linux pi64 4.19.67-v8-174fcab91765-p4-bis+ #2 SMP PREEMPT
Tue Aug 27 13:58:09 GMT 2019 aarch64 GNU/Linux
Whetstone Benchmark below or Go To Start
Whetstone Benchmark - whetstonePi64, whetstonePi64g9, whetstonePiA7
This has a number of simple programming loops, with the overall MWIPS rating dependent on floating point calculations,
lately those identified as COS and EXP. The last three can be over optimised (N/A), but the time does not affect the
overall rating much.
For this simple code, at 64 bits, average Pi 4 performance gain, over the Pi 3B+, was 2.12 times, but only around 1.3
times for straightforward floating point calculations. Then, as should be expected, the Pi 4B 32 bit speed was not much
slower.
Performance of the gcc 9 compilations for the Pi 4B was effectively the same as the earlier versions. The Pi 3B+ results
indicated improvements, but this was due to the EXP type function calculations. The new compilation included a minor
tweak for the IF tests, to avoid over optimisation.
System MHz MWIPS ------MFLOPS------ ------------MOPS---------------
1 2 3 COS EXP FIXPT IF EQUAL
Pi 3B+ 1400 1071 383 403 328 20.9 12.4 1704 N/A 1357
Pi 4B 1500 2269 522 534 398 54.8 39.8 2487 N/A 997
Pi4/3B+ 1.07 2.12 1.36 1.32 1.21 2.63 3.21 1.46 N/A 0.73
Pi 4B 32b 1500 1884 516 478 310 54.7 27.1 2498 2247 999
64b/32b 1.00 1.20 1.01 1.12 1.28 1.00 1.47 1.00 N/A 1.00
===========================================================================
gcc 9
Pi 3B+ 1400 1482 384 404 329 27.4 28.2 1712 2042 1362
Pi 4B 1500 2330 522 533 398 60.4 40.3 2493 2984 997
Pi4/3B+ 1.07 1.57 1.36 1.32 1.21 2.21 1.43 1.46 1.46 0.73
gcc 9/6
Pi 4B 1.00 1.03 1.00 1.00 1.00 1.10 1.01 1.00 N/A 1.00
Dhrystone Benchmark below or Go To Start
Dhrystone Benchmark - dhrystonePi64, dhrystonePi64g9, dhrystonePiA7
This appears to be the most popular ARM benchmark and often subject to over optimisation. So you can’t compare
results from different compilers. Ignoring this, results in VAX MIPS aka DMIPS and comparisons follow. This benchmark
has no significant data arrays, suitable for vectorisation.
Using the same 64 bit program, the Pi 4 was more than twice as fast and 52% faster than the 32 bit compilation.
The gcc 9 compilations lead to no real difference in performance.
Compiled DMIPS
System MHz DMIPS /MHz
Pi 3B+ 1400 4028 2.88
Pi 4B 1500 8176 5.45
Pi4/3B+ 1.07 2.03
Pi 4B 32b 1500 5366 3.58
64b/32b 1.00 1.52
===============================
gcc 9
Pi 3B+ 1400 3896 2.78
Pi 4B 1500 8190 5.46
Pi4/3B+ 1.07 2.10
gcc 9/6
Pi 4B 1.00 1.00
Linpack Benchmark below or Go To Start
Linpack 100 Benchmark MFLOPS - linpackPi64, linpackPiSP64, linpackPiNEONi64, linpackPi64g9,
linpackPi64g9SP, linpackPi64NEONig9, linpackPiA7, linpackPiA7SP
The original Linpack benchmark specified the use of double precision (DP) floating point arithmetic, and the code used
here is identical to that initially approved for use on old PCs. For the benefit of early ARM computers, the code is also
run using single precision (SP) numbers. A version was also produced, replacing the key Daxpy code with NEON Intrinsic
Functions, using vector operations, also with single precision calculations.
The Pi 3B+ 32 bit results are also provided for clarification. My results were highlighted in the MagPi magazine, on
announcement of the Pi 4, particularly the 2 GFLOPS 32 bit NEON speed. See raspberry-pi-4-specs-benchmarks.
At 64 bits, Pi 4/3B+ performance ratios were generally higher than those from the earlier benchmarks. Then, as could be
expected, virtually compiler independent performance, using NEON Intrinsic Functions, were similar at 32 bits and 64
bits. The main 64 bit gain was with the compiled single precision version, obtaining the same performance as that via
NEON Intrinsics.
The new gcc 9 compilations produced the same performance as the older versions, within the variations normally seen
on this benchmark.
------ MFLOPS ------
System MHz DP SP SP NEON
Pi 3B+ 1400 396.6 562.1 604.2
Pi 4B 1500 1059.9 1977.8 1968.6
Pi4/3B+ 1.07 2.67 3.52 3.26
Pi 4B 32b 1500 760.2 921.6 2010.5
64b/32b 1.00 1.39 2.15 0.98
Pi 3B+ 32 1400 210.5 225.2 562.5
Pi4/3B+ 1.07 3.61 4.09 3.57
=======================================
gcc 9
Pi 3B+ 1400 396.2 571.3 566.7
Pi 4B 1500 1110.6 2052.4 1887.5
Pi4/3B+ 1.07 2.80 3.59 3.33
gcc 9/6
Pi 4B 1.00 1.05 1.04 0.96
Livermore Loops Benchmark below or Go To Start
Livermore Loops Benchmark MFLOPS - liverloopsPi64, liverloopsPi64g9, liverloopsPiA7
This original main benchmark for supercomputers was first introduced in 1970, initially comprising 14 kernels of numerical
application, written in Fortran. This was increased to 24 kernels in the 1980s. Following are overall MFLOPS ratings,
geometric mean being the official average performance, followed by details from the 24 kernels. Note that these are for
double precision calculations
All the ratings indicate reasonably significant performance gains of Pi 4 over Pi 3B+ and 64 bits over 32 bits. Results from
the 24 kernels indicate some higher gains. Also note the maximum speed of 2.49 GFLOPS (Double Precision).
The speed of the original Raspberry Pi could be rated as 4.5 times faster than the Cray 1 supercomputer (Geomean 11.9)
- see my quote in this report. Now, one core of the Raspberry Pi 4B, at 64 bits, produces performance equivalent to 61
Cray 1 supercomputers.
There were some performance differences in gcc 9 results but average speeds were quite similar.
Overall Ratings - MFLOPS
System MHz Maximum Average Geomean Harmean Minimum
Pi 3B+ 64b 1400 737.7 319.4 284.7 250.6 91.6
Pi 4B 64b 1500 2490.5 892 730.3 603.3 212.4
Pi4/3B+ 1.07 3.38 2.79 2.57 2.41 2.32
Pi 4B 32b 1500 1800.2 635.1 519,0 416.1 155.3
64b/32b 1.00 1.38 1.40 1.41 1.45 1.37
======================================================
gcc 9
Pi 3B+ 1400 1000.7 347.8 308.0 275.2 117.3
Pi 4B 1500 2744.5 962.5 768.2 596.2 132.1
Pi4/3B+ 1.07 2.74 2.77 2.49 2.17 1.13
gcc 9/6
Pi 4B 1.00 1.10 1.08 1.05 0.99 0.62
MFLOPS for 24 loops
MFLOPS Of 24 Kernels
Pi 3B+ 540 296 539 527 226 175 738 428 484 251 169 245
127 161 291 258 440 520 333 280 310 93 362 209
Pi 4B 2026 997 987 948 372 739 2033 2491 1980 758 495 875
220 404 811 710 753 1124 444 397 1061 414 822 283
Pi 4B/ 3.75 3.37 1.83 1.80 1.65 4.23 2.76 5.83 4.09 3.02 2.92 3.57
Pi 3B+ 1.73 2.51 2.79 2.75 1.71 2.16 1.33 1.42 3.43 4.48 2.27 1.36
Min 1.33 Max 5.83
Pi 4B 32 746 964 988 943 212 538 1169 1800 1032 469 214 186
159 335 778 623 732 1034 320 350 489 360 749 187
64b/32b 2.72 1.03 1.00 1.00 1.76 1.37 1.74 1.38 1.92 1.62 2.31 4.70
1.38 1.20 1.04 1.14 1.03 1.09 1.39 1.13 2.17 1.15 1.10 1.51
Min 1.00 Max 4.70
===========================================================================
gcc9
Pi 3B+ 565 320 319 535 227 207 1001 581 541 234 171 248
121 160 293 280 456 547 337 287 367 190 386 209
Pi 4B 2146 989 970 965 390 785 2386 2479 1879 632 500 973
134 423 814 670 726 1177 450 397 1675 561 818 283
Pi 4B/ 3.80 3.09 3.04 1.80 1.72 3.80 2.38 4.27 3.48 2.70 2.93 3.93
Pi 3B+ 1.10 2.65 2.78 2.39 1.59 2.15 1.33 1.39 4.56 2.95 2.12 1.35
Min 1.10 Max 4.56
gcc 9/6
Pi 4B 1.06 0.99 0.98 1.02 1.05 1.06 1.17 1.00 0.95 0.83 1.01 1.11
0.61 1.05 1.00 0.94 0.96 1.05 1.01 1.00 1.58 1.35 1.00 1.00
Min 0.61 Max 1.58
Fast Fourier Transforms Benchmarks below or Go To Start
Fast Fourier Transforms Benchmarks - fft1-RPi64, fft3c-RPi64, fft1Pi64g9,
fft3cPi64g9, fft1-RPi2, fft3c-Rpi2
This is a real application provided by my collaborator at Compuserve Forum. There are two versions. The first one is the
original C program. The second is an optimised version, originally using my x86 assembly code, but translated back into C
code, making use of the partitioning and (my) arrangement to optimise for burst reading from RAM. Three measurements
are made, at each size, using both single and double precision data, calculating FFT sizes between 1K and 1024K.
Results are in milliseconds, with those here, the average of three measurements.
There were gains all round on the Pi 4, compared with the 3B+, mainly between 3 and 4 times on the optimised version,
less so using FFT1, with more data transfer speed dependency.
On the Pi 4, performance from the 32 bit compilation was often similar to that at 64 bits. This is probably due to much of
the data being read on a skipped sequential basis, not good for vectorisation.
The Pi 4B/3B+ performance gains were similar using both gcc 9 and gcc 6 compiled programs, but the gcc 9 compilation
produced some faster FFT1 speeds, as shown in the Pi 4B gcc 9/6 comparisons.
Gentoo 64b Pi 3B+
Size FFT1 FFT3
K SP DP SP DP
1 0.13 0.15 0.15 0.17
2 0.29 0.39 0.32 0.38
4 0.76 1.13 0.79 0.85
8 1.93 2.66 1.77 1.94
16 4.02 5.51 4.69 5.14
32 9.50 25.11 9.51 13.67
64 42.53 110.21 25.30 32.25
128 151.08 257.41 57.68 76.71
256 355.88 589.07 129.47 174.85
512 819.91 1324.89 297.80 390.74
1024 1746.23 2943.08 641.50 863.82
Gentoo 64b Pi 4B Pi4/3B+
Size FFT1 FFT3 FFT1 FFT3
K SP DP SP DP SP DP SP DP
1 0.04 0.04 0.04 0.04 3.30 3.62 3.60 4.13
2 0.08 0.14 0.11 0.09 3.81 2.88 2.82 4.03
4 0.25 0.38 0.19 0.22 3.05 2.93 4.13 3.86
8 0.79 1.31 0.46 0.50 2.45 2.04 3.87 3.87
16 2.15 2.91 1.15 1.09 1.87 1.89 4.07 4.71
32 5.71 6.76 2.48 3.18 1.66 3.71 3.83 4.30
64 15.22 51.00 5.43 9.29 2.79 2.16 4.66 3.47
128 83.47 151.95 16.28 24.75 1.81 1.69 3.54 3.10
256 231.24 362.64 39.13 57.28 1.54 1.62 3.31 3.05
512 561.16 765.18 90.20 133.21 1.46 1.73 3.30 2.93
1024 1250.51 1878.44 213.35 303.39 1.40 1.57 3.01 2.85
Raspbian 32b Pi 4B 64B/32b
Size FFT1 FFT3 FFT1 FFT3
K SP DP SP DP SP DP SP DP
1 0.04 0.04 0.06 0.05 0.99 0.96 1.44 1.18
2 0.08 0.12 0.13 0.11 1.04 0.89 1.14 1.18
4 0.32 0.37 0.27 0.24 1.28 0.96 1.42 1.09
8 0.77 0.97 0.58 0.55 0.98 0.74 1.26 1.09
16 1.69 2.01 1.49 1.35 0.78 0.69 1.29 1.24
32 4.37 4.89 2.96 3.63 0.77 0.72 1.19 1.14
64 9.12 26.55 7.46 10.75 0.60 0.52 1.37 1.16
128 55.52 160.11 17.93 26.03 0.67 1.05 1.10 1.05
256 305.92 423.06 41.16 55.06 1.32 1.17 1.05 0.96
512 833.10 854.88 86.93 120.53 1.48 1.12 0.96 0.90
1024 1617.49 1875.52 190.28 266.60 1.29 1.00 0.89 0.88
More below or Go To Start
===========================================================================
Gentoo Pi 3B+ gcc 9 Gentoo Pi 4B gcc 9
Size FFT1 FFT3 FFT1 FFT3
K SP DP SP DP SP DP SP DP
1 0.15 0.16 0.15 0.14 0.04 0.04 0.04 0.04
2 0.34 0.39 0.31 0.31 0.08 0.13 0.08 0.09
4 0.89 1.00 0.82 0.79 0.19 0.33 0.19 0.21
8 2.19 2.70 1.66 1.89 0.71 0.74 0.46 0.46
16 4.32 5.94 4.88 5.32 1.63 2.06 1.17 1.09
32 12.47 24.05 9.59 14.82 3.73 4.03 2.44 3.09
64 66.46 116.11 26.53 36.64 7.92 27.12 5.46 9.06
128 169.06 268.02 63.65 84.00 43.28 100.75 16.09 22.00
256 401.86 600.72 141.83 195.69 192.57 254.20 37.08 49.76
512 853.48 1266.96 329.26 435.23 590.20 651.24 82.54 110.23
1024 1966.69 2808.07 721.36 981.82 1463.15 1749.37 202.20 251.71
Pi 4B/3B+ Pi 4B gcc 9/6
1 3.53 3.77 3.63 3.78 0.97 0.98 1.02 1.18
2 4.39 3.05 3.97 3.64 1.00 1.06 1.46 1.08
4 4.75 3.03 4.23 3.81 1.34 1.16 0.98 1.06
8 3.06 3.62 3.62 4.10 1.10 1.76 1.00 1.09
16 2.65 2.89 4.16 4.89 1.32 1.41 0.98 1.00
32 3.34 5.97 3.93 4.79 1.53 1.68 1.02 1.03
64 8.39 4.28 4.85 4.04 1.92 1.88 0.99 1.03
128 3.91 2.66 3.96 3.82 1.93 1.51 1.01 1.12
256 2.09 2.36 3.82 3.93 1.20 1.43 1.06 1.15
512 1.45 1.95 3.99 3.95 0.95 1.17 1.09 1.21
1024 1.34 1.61 3.57 3.90 0.85 1.07 1.06 1.21
BusSpeed Benchmark below or Go To Start
BusSpeed Benchmark - busSpdPi64, busspeedPi64g9, busspeedPiA7
This is a read only benchmark with data from caches and RAM. The program reads one word with 32 word address
increments before the next, followed by reading after decreasing increments. finally reading all data. This shows where
data is read in bursts, enabling estimates being made of bus speeds. The two comparison columns are for two word and
one word increments.
Most data transfers were 2.0 to 2.5 times faster on the Pi 4, including from RAM, and somewhat higher with L2 cache
based data.
The 64 bit version still deals with 32 bit words but transferred data somewhat quicker than the 32 bit program, as shown
by the Pi 4 results.
Results from the gcc 9 compilations were virtually the same as those from gcc 6.
Gentoo 64b Pi 3B+
BusSpeed armv8 64 Bit Fri Aug 16 12:53:43 2019
Reading Speed 4 Byte Words in MBytes/Second
Memory Inc32 Inc16 Inc8 Inc4 Inc2 Read
KBytes Words Words Words Words Words All
16 3819 4253 4622 5041 5089 3870
32 1234 1328 2067 3158 4082 3674
64 681 704 1325 2208 3350 3602
128 638 646 1214 2070 3238 3625
256 592 617 1165 1991 3164 3622
512 295 309 640 985 2085 2790
1024 108 120 271 525 1070 1636
4096 98 123 249 486 881 1840
16384 121 114 246 480 977 1642
65536 121 124 248 409 989 1864
Gentoo 64b Pi 4B Inc2 Rd All
4B/3B+ 4B/3B+
16 4999 5042 5665 5885 5891 8217 1.16 2.12
32 1578 2105 3283 4339 5154 7507 1.26 2.04
64 585 911 1855 3085 5163 7918 1.54 2.20
128 590 932 1888 3110 5161 7874 1.59 2.17
256 598 934 1908 3056 5265 7883 1.66 2.18
512 603 939 1822 3019 5124 7716 2.46 2.77
1024 319 482 1060 1885 3283 5721 3.07 3.50
4096 209 253 503 1006 2009 4111 2.28 2.23
16384 209 261 520 1041 2071 4115 2.12 2.51
65536 203 263 489 1011 2023 4036 2.05 2.17
Raspbian 32b Pi 4B Rd All
64b/32b
16 3836 4049 4467 5885 4641 5858 1.14
32 761 1473 2594 3216 3960 4780 1.01
64 409 801 1684 2422 3745 3940 0.95
128 406 803 1202 1914 3037 5377 1.32
256 415 700 1165 2481 4789 5137 1.27
512 392 760 1243 2455 3764 4264 1.38
1024 230 256 623 1061 2455 3501 1.59
4096 197 214 454 938 1852 3195 1.80
16384 138 215 445 897 1724 3210 1.91
65536 174 215 398 744 1655 3130 1.61
More below or Go To Start
=====================================================================
Gentoo 64b Pi 3B+ gcc 9
BusSpeed 64 Bit gcc 9 Thu Sep 26 12:51:15 2019
BusSpeed armv8 64 Bit Fri Aug 16 12:53:43 2019
Reading Speed 4 Byte Words in MBytes/Second
Memory Inc32 Inc16 Inc8 Inc4 Inc2 Read
KBytes Words Words Words Words Words All
16 3860 4283 4677 4901 5022 3591
32 2228 2433 2989 4740 4912 3629
64 700 697 1299 2200 3310 3348
128 637 636 1208 2064 3151 3396
256 597 600 1161 1945 3105 3377
512 232 194 500 884 1629 2350
1024 118 131 159 440 692 1682
4096 91 99 197 463 923 1878
16384 119 117 200 392 775 1606
65536 101 105 238 464 873 1876
Gentoo 64b Pi 4B Rd All Rd All
4B/3B+ gcc 9/6
16 4815 5060 5573 5808 5741 8935 2.49 1.09
32 1534 1828 2967 4254 4930 7825 2.16 1.04
64 792 1007 1988 3269 4844 8062 2.41 1.02
128 730 950 1881 3133 5007 8162 2.40 1.04
256 733 955 1901 3128 5071 8236 2.44 1.04
512 737 952 1885 3139 5058 8237 3.51 1.07
1024 374 539 1047 1884 3177 5537 3.29 0.97
4096 235 255 497 990 1975 3386 1.80 0.82
16384 239 263 501 913 1984 3973 2.47 0.97
65536 239 237 502 995 1984 3971 2.12 0.98
MemSpeed Benchmark below or Go To Start
MemSpeed Benchmark MB/Second - memSpdPi64, memSpdPi64g9, memspeedPiA7
MemSpeed benchmark measures data reading speeds in MegaBytes per second, carrying out calculations on arrays of
cache and RAM data, normally sized 2 x 4 KB to 2 x 4 MB. Calculations are as shown in the result headings. For the first
two double precision tests, speed MFLOPS can be calculated by dividing MB/second by 8 and 16. For single precision
divide by 4 and 8.
Results are provided below for the Gentoo 64 bit version on the Pi 3B+ and Pi 4B, and the Raspbian 32 bit variety on the
Pi 4B, then a sample of relative performance, covering data from L1 cache, L2 cache and RAM.
Gains, greater than the 7% CPU MHz difference, were recorded all round by the Pi 4B over the Pi 3B+. The most
impressive were on using L2 cache based data and the more intensive floating point calculations. On the Pi 4B, speeds
of 64 bit and 32 bit compilations were similar using RAM based data and executing some integer tests, but significantly
faster from cache based floating point calculations.
Many Pi 4B/3B+ comparisons were similar, but the gcc 9 compilation gave rise to a number of changes, compared with
the older version. The latter was slightly faster using some double precision calculations, but gcc 9 produced speed
increases between 1.3 and 2.6 times with integers and single precision, the latter providing a maximum of 5.5 GFLOPS
compared with 3.5.
Memory Reading Speed Test armv8 64 Bit by Roy Longbottom
Start of test Fri Aug 16 12:48:51 2019
Memory x[m]=x[m]+s*y[m] Int+ x[m]=x[m]+y[m] x[m]=y[m]
KBytes Dble Sngl Int32 Dble Sngl Int32 Dble Sngl Int32
Used MB/S MB/S MB/S MB/S MB/S MB/S MB/S MB/S MB/S
Gentoo 64b Pi 3B+
8 4813 2897 4350 6180 3954 4831 5378 4324 4324
16 4540 2900 4356 6213 3961 4838 5401 4344 4333
32 4184 2780 4047 5540 3721 4483 5421 4285 4316
64 3784 2678 3803 4776 3547 4171 4925 4087 4051
128 3613 2694 3842 4731 3562 4188 4967 4087 4103
256 3133 2652 3800 4626 3493 4027 4967 4093 4096
512 670 882 1630 2913 2422 2718 3101 3141 2780
1024 587 774 1017 1310 1287 1184 1105 1526 1543
2048 555 746 917 1143 1131 1043 1071 1007 1128
4096 545 691 1130 1039 1015 1140 1045 1087 892
8192 537 795 1139 980 1133 1148 887 854 922
Max MFLOPS 602 725
Gentoo 64b Pi 4B
8 15530 13973 12509 15570 14025 15534 11417 9308 7798
16 15719 14042 12750 15745 14200 15660 11753 9447 7890
32 14062 12228 11435 14052 12699 12855 11864 9459 7937
64 12195 11344 10698 12211 11705 12025 8872 8752 7904
128 12172 11360 10755 12166 11862 11975 8569 8460 7913
256 12228 11369 10697 12123 11790 12082 8073 8222 7896
512 11269 10738 10206 10985 11164 11590 8017 6280 6557
1024 3407 2635 3281 3396 3242 2979 3765 3947 4029
2048 1525 1832 1838 1851 1607 1838 2819 2790 2770
4096 1407 1851 1859 1861 1666 1840 2485 2487 2410
8192 1913 1914 1922 1528 1895 1891 2496 2234 2489
Max MFLOPS 1965 3511
Comparison 64b Pi4/3B+
8 3.23 4.82 2.88 2.52 3.55 3.22 2.12 2.15 1.80
16 3.46 4.84 2.93 2.53 3.58 3.24 2.18 2.17 1.82
256 3.90 4.29 2.82 2.62 3.38 3.00 1.63 2.01 1.93
512 16.82 12.17 6.26 3.77 4.61 4.26 2.59 2.00 2.36
1024 5.80 3.40 3.23 2.59 2.52 2.52 3.41 2.59 2.61
4096 2.58 2.68 1.65 1.79 1.64 1.61 2.38 2.29 2.70
8192 3.56 2.41 1.69 1.56 1.67 1.65 2.81 2.62 2.70
Raspbian 32b Pi 4B
8 8459 4766 13344 8303 4768 15553 7806 9926 9927
16 7142 3918 8649 7103 4094 9309 7899 10086 10056
32 7969 4490 10339 7941 4532 11627 7758 10070 10048
64 8126 4602 9909 8114 4617 11069 7425 8021 8070
128 8302 4651 9623 8311 4657 10836 7374 8049 7934
256 8319 4663 9627 8360 4666 10768 7530 7922 7925
512 8088 4629 9453 8239 4650 10696 5023 7904 7949
1024 3581 3113 3618 3577 3150 3675 5358 2431 1560
2048 1338 1808 1780 1811 1832 1773 2131 950 956
4096 1881 1880 1852 1879 1664 1336 1988 984 1054
8192 1890 1901 1884 1729 1319 1367 2252 1018 1021
Max MFLOPS 1057 1192
MemSpeed Continued Below
Comparison Pi 4B 64b/32b
8 1.84 2.93 0.94 1.88 2.94 1.00 1.46 0.94 0.79
16 2.20 3.58 1.47 2.22 3.47 1.68 1.49 0.94 0.78
256 1.47 2.44 1.11 1.45 2.53 1.12 1.07 1.04 1.00
512 1.39 2.32 1.08 1.33 2.40 1.08 1.60 0.79 0.82
1024 0.95 0.85 0.91 0.95 1.03 0.81 0.70 1.62 2.58
4096 0.75 0.98 1.00 0.99 1.00 1.38 1.25 2.53 2.29
8192 1.01 1.01 1.02 0.88 1.44 1.38 1.11 2.19 2.44
=====================================================================
Gentoo 64b Pi 3B+ gcc 9
Memory Reading Speed Test 64 Bit gcc 9 by Roy Longbottom
Start of test Thu Sep 26 12:43:02 2019
Memory x[m]=x[m]+s*y[m] Int+ x[m]=x[m]+y[m] x[m]=y[m]
KBytes Dble Sngl Int32 Dble Sngl Int32 Dble Sngl Int32
Used MB/S MB/S MB/S MB/S MB/S MB/S MB/S MB/S MB/S
8 4565 5140 7847 5439 5827 7928 6161 4288 4334
16 4445 5145 7942 5362 5829 7941 6207 4358 4310
32 4094 4853 7251 4750 5396 7250 6139 4312 4303
64 3767 4748 7008 4320 5309 6954 5461 4097 4100
128 3912 4799 7319 4442 5486 7325 5328 4133 4134
256 3838 4824 6934 4400 5426 7247 5354 3844 4010
512 2570 3661 3826 2773 3975 4912 3302 2532 3017
1024 878 2120 2228 938 2182 2239 1098 1215 1361
2048 848 1961 2046 1016 2008 2033 758 805 814
4096 856 1961 2040 1007 1984 2036 839 863 856
8192 885 1940 1956 1013 1921 1957 844 865 868
Max MFLOPS 571 1286
Gentoo 64b Pi 4B
8 13385 21854 24413 13416 23402 24404 11630 9316 9315
16 13527 22116 24712 13551 23675 24722 11800 9447 9446
32 12170 19681 21716 12164 21047 21740 11403 9511 9514
64 11402 19074 20086 11613 20057 20101 9317 8651 8663
128 11770 20334 21119 12124 21389 21087 8003 8136 8136
256 11740 20281 21115 12029 21384 21111 8098 8184 8015
512 11671 20255 20873 12058 21561 21072 7721 6684 6929
1024 2818 7728 5968 3957 7839 7831 4691 3610 3832
2048 1884 3436 3743 1880 3578 3281 2597 2717 2696
4096 1284 2399 2555 1446 3802 3625 2420 2630 2632
8192 1913 3759 3459 1937 3798 3772 2468 2482 2482
Max MFLOPS 1691 5529
Comparison 64b Pi4/3B+
8 2.93 4.25 3.11 2.47 4.02 3.08 1.89 2.17 2.15
16 3.04 4.30 3.11 2.53 4.06 3.11 1.90 2.17 2.19
256 3.06 4.20 3.05 2.73 3.94 2.91 1.51 2.13 2.00
512 4.54 5.53 5.46 4.35 5.42 4.29 2.34 2.64 2.30
1024 3.21 3.65 2.68 4.22 3.59 3.50 4.27 2.97 2.82
4096 1.50 1.22 1.25 1.44 1.92 1.78 2.88 3.05 3.07
8192 2.16 1.94 1.77 1.91 1.98 1.93 2.92 2.87 2.86
Comparison Pi4B gcc 9/6
8 0.86 1.56 1.95 0.86 1.67 1.57 1.02 1.00 1.19
16 0.86 1.57 1.94 0.86 1.67 1.58 1.00 1.00 1.20
256 0.96 1.78 1.97 0.99 1.81 1.75 1.00 1.00 1.02
512 1.04 1.89 2.05 1.10 1.93 1.82 0.96 1.06 1.06
1024 0.83 2.93 1.82 1.17 2.42 2.63 1.25 0.91 0.95
4096 0.91 1.30 1.37 0.78 2.28 1.97 0.97 1.06 1.09
8192 1.00 1.96 1.80 1.27 2.00 1.99 0.99 1.11 1.00
NeonSpeed Benchmark below or Go To Start
NeonSpeed Benchmark MB/Second - NeonSpeedPi64, NeonSpeedPi64g9, NeonSpeed
This carries out some of the same calculations as MemSpeed. All results are for 32 bit floating point and integer
calculations. Norm functions were as generated by the compiler, using NEON directives and the Neon measurements
through using Intrinsic Functions.
Unlike running the same programs on the Pi 3B+, using the Pi 4, compiled codes were no longer slower than those
produced via Intrinsic Functions. This lead to performance gains of up to over five times.
Except using L1 cache based data, performance was essentially the same using 32 bit and 64 bit benchmarks.
With the gcc 9 compilation, the Pi 4B continued to be significantly faster than the 3B+. Comparing Pi 4B gcc 9 and 6
results, performance was essentially the same when NEON Intrinsic Functions were used, but, as with MemSpeed,
normal compilations were faster, averaging around 80% faster, in this case.
NEON Speed Test armv8 64 Bit V 1.0 Fri Aug 16 2019
Vector Reading Speed in MBytes/Second
Memory Float v=v+s*v Int v=v+v+s Neon v=v+v
KBytes Norm Neon Norm Neon Float Int
Gentoo 64b Pi 3B+
16 2715 5110 3945 4826 5426 5598
32 2528 4326 3569 4191 4596 4661
64 2491 4153 3494 4068 4407 4429
128 2537 4228 3583 4120 4461 4473
256 2526 4265 3614 4140 4480 4514
512 1917 2830 2545 2579 2896 2964
1024 1166 1299 1152 1257 1205 1229
4096 1022 1135 1132 1122 1130 1100
16384 1080 1026 1131 1016 1064 1094
65536 996 1120 1061 831 1110 1069
Gentoo 64b Pi 4B
16 13982 16424 12505 15239 16065 17193
32 9554 10753 8981 9657 10970 11025
64 10658 11833 10274 10722 12110 12134
128 10657 11887 10337 10680 11994 11973
256 10709 11970 10360 10774 12003 12083
512 10147 11441 9733 10209 11264 11532
1024 2964 3222 2876 3216 3270 2942
4096 1734 1712 1729 1772 1586 1728
16384 1592 1922 1818 1923 1926 1667
65536 1970 1736 1997 1747 1884 2021
Comparison 64b Pi4/3B+
16 5.15 3.21 3.17 3.16 2.96 3.07
256 4.24 2.81 2.87 2.60 2.68 2.68
512 5.29 4.04 3.82 3.96 3.89 3.89
65536 1.98 1.55 1.88 2.10 1.70 1.89
Raspbian 32b Pi 4B
16 9677 10072 8905 9358 9776 10473
32 10149 10330 9364 9539 9988 10543
64 10948 11708 10466 10568 11318 11994
128 10484 11232 10410 10104 11200 11792
256 10509 11369 10428 10264 11273 11842
512 10406 11066 10134 10054 11075 11467
1024 3069 3202 3159 3166 3204 3203
4096 1721 1910 1908 1882 1903 1900
16384 2023 2009 2008 1965 2032 2013
65536 2073 2074 2074 2073 2068 2064
Comparison Pi 4B 64b/32b
16 1.44 1.63 1.40 1.63 1.64 1.64
256 1.02 1.05 0.99 1.05 1.06 1.02
512 0.98 1.03 0.96 1.02 1.02 1.01
65536 0.95 0.84 0.96 0.84 0.91 0.98
NeonSpeed Continued Below
=====================================================================
Gentoo 64b Pi 3B+ gcc 9
NEON Speed Test 64 Bit gcc 9 Thu Sep 26 12:45:07 2019
Vector Reading Speed in MBytes/Second
Memory Float v=v+s*v Int v=v+v+s Neon v=v+v
KBytes Norm Neon Norm Neon Float Int
16 5118 5461 6218 5298 6024 6011
32 4894 4980 5886 4855 5431 5445
64 4713 4557 5669 4452 4868 4867
128 4824 4703 5814 4598 4995 4946
256 4857 4750 5815 4643 5028 4964
512 3694 2652 4265 2675 3003 3007
1024 2085 1135 2204 1132 1128 1077
4096 2008 1021 2070 1033 1056 1036
16384 1912 1061 2042 958 1065 1047
65536 1783 1062 1873 769 1080 1081
Gentoo 64b Pi 4B
16 21046 14555 16698 13502 14565 16970
32 17797 12061 14509 10785 12282 13112
64 19517 10860 15252 9981 10793 11419
128 19839 10936 15468 10120 11001 11579
256 20094 10838 15603 10229 10885 11566
512 20076 10846 15469 10185 10943 11667
1024 7016 3040 6826 3211 3417 3548
4096 3945 1940 3599 1950 1768 1937
16384 3394 2017 3386 1963 1848 2014
65536 3484 2043 3839 1765 2060 2049
Comparison 64b Pi4/3B+
16 4.11 2.67 2.69 2.55 2.42 2.82
32 3.64 2.42 2.47 2.22 2.26 2.41
64 4.14 2.38 2.69 2.24 2.22 2.35
128 4.11 2.33 2.66 2.20 2.20 2.34
256 4.14 2.28 2.68 2.20 2.16 2.33
512 5.43 4.09 3.63 3.81 3.64 3.88
1024 3.36 2.68 3.10 2.84 3.03 3.29
4096 1.96 1.90 1.74 1.89 1.67 1.87
16384 1.78 1.90 1.66 2.05 1.74 1.92
65536 1.95 1.92 2.05 2.30 1.91 1.90
Comparison Pi4B gcc 9/6
16 1.51 0.89 1.34 0.89 0.91 0.99
32 1.86 1.12 1.62 1.12 1.12 1.19
64 1.83 0.92 1.48 0.93 0.89 0.94
128 1.86 0.92 1.50 0.95 0.92 0.97
256 1.88 0.91 1.51 0.95 0.91 0.96
512 1.98 0.95 1.59 1.00 0.97 1.01
1024 2.37 0.94 2.37 1.00 1.04 1.21
4096 2.28 1.13 2.08 1.10 1.11 1.12
16384 2.13 1.05 1.86 1.02 0.96 1.21
65536 1.77 1.18 1.92 1.01 1.09 1.01
Average 1.95 1.00 1.73 1.00 0.99 1.06
MultiThreading Benchmarks below or Go To Start
MultiThreading Benchmarks
Most of the multithreading benchmarks execute the same calculations using 1, 2, 4 and 8 threads. One of them, MP-
MFLOPS, is available in two different versions, using standard compiled “C” code for single and double precision
arithmetic. A further version uses NEON intrinsic functions. Another variety uses OpenMP procedures for automatic
parallelism.
Go To Start
MP-Whetstone Benchmark - MP-WhetsPi64, MP-WhetsPi64g9, MP-WHETSPiA7
Multiple threads each run the eight test functions at the same time, but with some dedicated variables. Measured speed
is based on the last thread to finish, with Mutex functions, used to avoid the updating conflict by only allowing one
thread at a time to access common data. Performance was generally proportional to the number of cores used. There
can be some significant differences from the single CPU Whetstone benchmark results on particular tests due to a
different compiler being used. None of the test functions are suitable for SIMD operation, with the simpler instructions
being used. Overall seconds indicates MP efficiency.
As with the single core version, average Pi 4 MWIPS performance gain, over the Pi 3B+, was just over 2 times, but more
similar compared with 32 bit speed, this time the latter being somewhat faster on some floating point calculations.
Most of the important Pi 4B gcc 9 results were virtually the same as those from the earlier gcc 6 compilations but the
3B+ COS and EXP speeds were somewhat slower.
MWIPS MFLOPS MFLOPS MFLOPS Cos Exp Fixpt If Equal
Threads 1 2 3 MOPS MOPS MOPS MOPS MOPS
Gentoo RPi 3B+ 64 Bit
1 1152 383 383 328 23.2 13.0 N/A 2721 1365
2 2312 767 767 657 46.5 26.0 N/A 5461 2738
4 4580 1506 1526 1304 92.0 51.6 N/A 10777 5449
8 4788 1815 1961 1382 95.0 53.3 N/A 13827 5811
Overall Seconds 4.96 1T, 4.95 2T, 5.05 4T, 10.07 8T
Gentoo RPi 4B 64 Bit
1 2395 536 538 397 60.8 39.0 N/A 4483 997
2 4784 1062 1079 794 121.2 77.9 N/A 8932 1990
4 9476 2125 2080 1568 240.8 155.3 N/A 17718 3962
8 9834 2631 2744 1630 243.6 160.1 N/A 22265 4053
Overall Seconds 4.99 1T, 5.01 2T, 5.12 4T, 10.17 8T
Comparison 64b Pi4/3B+
1 2.08 1.40 1.41 1.21 2.62 3.00 N/A 1.65 0.73
2 2.07 1.39 1.41 1.21 2.61 3.00 N/A 1.64 0.73
4 2.07 1.41 1.36 1.20 2.62 3.01 N/A 1.64 0.73
8 2.05 1.45 1.40 1.18 2.56 3.00 N/A 1.61 0.70
Raspbian RPi 4B 32 Bit
1 2059 673 680 311 55.6 33.1 7462 2245 995
2 4117 1342 1391 624 110.7 65.9 14887 4467 1986
4 7910 2652 2722 1180 208.5 132.6 29291 8952 3832
8 8652 3057 2971 1268 233.2 149.6 38368 11923 3942
Overall Seconds 4.99 1T, 5.01 2T, 5.29 4T, 10.71 8T
Comparison Pi 4B 64b/32b
1 1.16 0.80 0.79 1.28 1.09 1.18 N/A 2.00 1.00
2 1.16 0.79 0.78 1.27 1.09 1.18 N/A 2.00 1.00
4 1.20 0.80 0.76 1.33 1.15 1.17 N/A 1.98 1.03
8 1.14 0.86 0.92 1.28 1.04 1.07 N/A 1.87 1.03
MP-Whetstone Continued Below
===========================================================================
MWIPS MFLOPS MFLOPS MFLOPS Cos Exp Fixpt If Equal
Threads 1 2 3 MOPS MOPS MOPS MOPS MOPS
Gentoo 64b Pi 3B+ gcc 9
1 1500 381 384 328 27.2 28.1 5098 2049 1368
2 3001 766 762 656 54.5 56.5 10130 4102 2737
4 5940 1488 1528 1304 107.8 111.5 19741 7665 5423
8 5987 1528 1666 1267 107.4 117.9 25862 9518 5666
Overall Seconds 4.98 1T, 4.98 2T, 5.16 4T, 10.30 8T
Gentoo 64b Pi 4B gcc 9
1 2364 530 532 395 60.6 40.0 7426 2242 996
2 4724 1060 1052 789 121.0 80.4 14853 4476 1994
4 9413 2103 2112 1579 241.0 159.5 29161 8638 3968
8 9848 2671 2453 1644 247.0 168.1 37385 11636 4108
Overall Seconds 5.00 1T, 5.01 2T, 5.07 4T, 10.20 8T
Comparison 64b Pi4/3B+
1 1.58 1.39 1.38 1.20 2.23 1.42 1.46 1.09 0.73
2 1.57 1.38 1.38 1.20 2.22 1.42 1.47 1.09 0.73
4 1.58 1.41 1.38 1.21 2.24 1.43 1.48 1.13 0.73
8 1.64 1.75 1.47 1.30 2.30 1.43 1.45 1.22 0.72
Comparison Pi4B gcc 9/6
1 0.99 0.99 0.99 1.00 1.00 1.03 N/A 0.50 1.00
2 0.99 1.00 0.97 0.99 1.00 1.03 N/A 0.50 1.00
4 0.99 0.99 1.02 1.01 1.00 1.03 N/A 0.49 1.00
8 1.00 1.02 0.89 1.01 1.01 1.05 N/A 0.52 1.01
MP-Dhrystone Benchmark below or Go To Start
MP-Dhrystone Benchmark - MP-DHRYPi64, MP-DHRYPi64g9, MP-DHRYPiA7
This executes multiple copies of the same program, but with some shared data, leading to inconsistent multithreading
performance with not much gain using multiple cores.
The single thread speeds were similar to the earlier Dhrystone results, with RPi 4B ratings around twice as fast as those
for the Pi 3B+. The single thread Pi 4B 64 bit/32 bit speed ratio was also similar to that during the single core tests.
As indicated for the earlier gcc 6 results, this benchmark produces inconsistent performance and does not provide a
good example of multithreading but, in this case, gcc 6 and gcc 9 results were similar, with a reasonably high Pi 4B/3B+
performance gain.
Example Results Log File
MP-Dhrystone Benchmark 64 Bit gcc 9 Thu Sep 26 11:46:22 2019
Using 1, 2, 4 and 8 Threads
Threads 1 2 4 8
Seconds 0.55 1.19 2.31 4.57
Dhrystones per Second 14579147 13499628 13827400 14017880
VAX MIPS rating 8298 7683 7870 7978
Internal pass count correct all threads
End of test Thu Sep 26 11:46:31 2019
#############################################################
Comparisons
Threads 1 2 4 8
VAX MIPS rating Pi 3B+ 6 4207 6804 7401 7415
VAX MIPS rating Pi 4B 64 8880 7828 8303 8314
VAX MIPS rating Pi 4B 32 5539 5739 6735 7232
Pi 4B/3B+ 64 bits 2.11 1.15 1.12 1.12
Pi 4B 64 bits/32 bits 1.60 1.36 1.23 1.15
=======================================================
Gentoo gcc 9
VAX MIPS rating Pi 3B+ 6 4062 6504 8242 8343
VAX MIPS rating Pi 4B 64 8298 7683 7870 7978
Pi 4B/3B+ 64 bits 2.04 1.18 0.95 0.96
Pi 4B gcc 9/6 0.93 0.98 0.95 0.96
MP Linpack Benchmark below or Go To Start
MP SP NEON Linpack Benchmark - linpackMPNeonPi64, linpackMPNeonPi64g9, linpackNeonMP
This executes a single copy of the benchmark, at three data sizes, with the critical daxpy code multithreaded. This code
was also modified to allow a higher level of parallelism, without changing any calculations. Still MP performance was
much slower than running as a single thread. The main reasons appear to be updating data in RAM, to maintain integrity,
with performance reflecting memory speeds, and overheads of exceptionally high thread start/stop overheads.
This benchmark uses the same NEON Intrinsic Functions as the single core program, with similar speeds at N = 100,
without the threading overheads, but decreasing with larger data sizes, involving RAM accesses.
The full logged output is shown for the first entry, to demonstrate error checking facilities. The sumchecks were
identical from the Pi 3B+ and Pi 4B at Gentoo 64 bits, but those from the Raspbian 32 bit test were different, as shown
below. Ignoring the slow threaded results, performance ratios of CPU speed limited tests were similar to the single core
version.
At least for the unthreaded tests, the gcc 9 results for the Pi 4B were mainly within 10% of those from gcc 6.
Example Results Log File
Linpack Single Precision MultiThreaded Benchmark
64 Bit NEON Intrinsics, Fri Aug 23 00:45:54 2019
MFLOPS 0 to 4 Threads, N 100, 500, 1000
Threads None 1 2 4
N 100 642.56 66.69 66.05 65.54
N 500 479.48 274.36 274.85 269.07
N 1000 363.77 316.17 310.37 316.71
NR=norm resid RE=resid MA=machep X0=x[0]-1 XN=x[n-1]-1
N 100 500 1000
NR 1.97 5.40 13.51
RE 4.69621336e-05 6.44138840e-04 3.22485110e-03
MA 1.19209290e-07 1.19209290e-07 1.19209290e-07
X0 -1.31130219e-05 5.79357147e-05 -3.08930874e-04
XN -1.30534172e-05 3.51667404e-05 1.90019608e-04
Thread
0 - 4 Same Results Same Results Same Results
####################################################
Comparisons
Threads None 1 2 4
Gentoo Pi 3B+ 64 Bits
N 100 642.56 66.69 66.05 65.54
N 500 479.48 274.36 274.85 269.07
N 1000 363.77 316.17 310.37 316.71
Gentoo 64b Pi 4B
N 100 2252.7 97.3 97.4 97.4
N 500 1628.2 665.2 646.6 674.4
N 1000 399.9 406.8 405.8 399.5
Comparison 64b Pi4/3B+
N 100 3.51 1.46 1.48 1.49
N 500 3.40 2.42 2.35 2.51
N 1000 1.10 1.29 1.31 1.26
Raspbian 32b Pi 4B
N 100 1921.5 108.7 101.9 102.5
N 500 1548.8 530.2 714.4 733.1
N 1000 399.9 378.1 364.8 398.2
Comparison Pi 4B 64b/32b
N 100 1.17 0.89 0.96 0.95
N 500 1.05 1.25 0.91 0.92
N 1000 1.00 1.08 1.11 1.00
MP SP NEON Linpack Continued Below
========================================
gcc 9
MFLOPS 0 to 4 Threads, N 100, 500, 1000
Threads None 1 2 4
Gentoo 64b Pi 3B+ gcc 9
N 100 641.6 63.0 62.3 61.9
N 500 326.6 229.3 222.6 227.0
N 1000 320.1 275.0 274.3 275.2
Gentoo 64b Pi 4B gcc 9
N 100 2076.2 98.6 96.6 96.2
N 500 1327.1 631.9 632.5 639.2
N 1000 394.6 375.3 382.3 375.7
Comparison 64b Pi4/3B+
N 100 3.24 1.57 1.55 1.55
N 500 4.06 2.76 2.84 2.82
N 1000 1.23 1.36 1.39 1.37
Comparison Pi4B gcc 9/6
N 100 0.92 1.01 0.99 0.99
N 500 0.82 0.95 0.98 0.95
N 1000 0.99 0.92 0.94 0.94
####################################################
32 bit numeric results
N 100 500 1000
NR 2.17 5.42 9.50
RE 5.16722466e-05 6.46698638e-04 2.26586126e-03
MA 1.19209290e-07 1.19209290e-07 1.19209290e-07
X0 -2.38418579e-07 -5.54323196e-05 -1.26898289e-04
XN -5.06639481e-06 -4.70876694e-06 1.41978264e-04
MP BusSpeed Benchmark below or Go To Start
MP BusSpeed Benchmark - MP-BusSpd2Pi64, MP-BusSpd2Pi64g9, MP-BusSpeedPiA7
(read only)
Each thread accesses all of the data in separate sections, covering caches and RAM, starting at different points, with
this version. See single processor BusSpeed details regarding burst reading that can indicate significant differences.
Comparisons are provided for RdAll, at 1, 2 and 4 threads. Pi 4B/3B+ performance ratios were similar to that for the
single core tests. There was an exception with two threads, on the Pi 4, using RAM at 64 bits, probably due to caching
effects and not seen on subsequent repeated tests.
Particularly note that performance was significantly better using the 32 bit Raspbian compiler. Below are examples of
disassembly, showing that Pi 4 code employed scalar operation, using 32 bit w registers, with the 3B benefiting from
using 128 bit q registers, for Single Instruction Multiple Data (SIMD) operation. Compile options are included below,
where alternative were also tried on the Pi 4B, but failed to implement SIMD operation.
At least, most of the gcc 9 read all compiled tests were significantly faster than those produced by gcc 6.
MP-BusSpd armv8 64 Bit Fri Aug 23 00:47:43 2019
MB/Second Reading Data, 1, 2, 4 and 8 Threads
Gentoo 64b Pi 3B+
KB Inc32 Inc16 Inc8 Inc4 Inc2 RdAll
12.3 1T 3138 2822 3044 2383 1708 1737
2T 5354 4865 5647 4519 3303 3361
4T 7922 7504 9717 6794 6216 6597
8T 5125 4159 6987 6696 5350 5195
122.9 1T 640 666 1191 1864 1627 1712
2T 1008 1018 1926 3496 3268 3387
4T 962 1042 2157 4259 6427 4372
8T 1031 1047 2147 3952 6317 6514
12288 1T 124 114 260 527 1016 1363
2T 137 138 275 487 946 2182
4T 105 118 240 409 975 2158
8T 108 117 236 504 1077 2051
Gentoo 64b Pi 4B RdAll
4B/3B+
12.3 1T 4864 4879 5378 4379 4115 4221 2.43
2T 8159 6924 9179 8006 7689 7837 2.33
4T 12677 11531 14850 12554 13807 14794 2.24
8T 7398 6927 10881 11675 11497 13075 2.52
122.9 1T 665 926 1869 2714 3557 4152 2.43
2T 610 696 1549 4898 7188 8184 2.42
4T 476 865 1885 4107 8058 14617 3.34
8T 474 883 1848 3919 7939 13633 2.09
12288 1T 202 210 514 1044 2033 3616 2.65
2T 258 425 853 1551 3693 6228 2.85
4T 217 346 497 1024 2181 3789 1.76
8T 220 275 540 1030 1937 3577 1.74
Raspbian 32b Pi 4B RdAll
64b/32b
12.3 1T 5263 5637 5809 5894 5936 13445 0.31
2T 9412 10020 10567 11454 11604 24980 0.31
4T 16282 15577 16418 21222 20000 45530 0.32
8T 11600 13285 16070 18579 20593 36837 0.35
122.9 1T 739 956 1888 3153 5008 9527 0.44
2T 629 1158 1568 5058 9509 16489 0.50
4T 600 1093 2134 4527 8732 16816 0.87
8T 593 1104 2121 4382 8629 17158 0.79
12288 1T 238 258 518 1005 2001 4029 0.90
2T 278 228 453 1690 1826 3628 1.72
4T 269 257 740 1019 1790 4145 0.91
8T 233 292 532 926 2186 3581 1.00
MP-BusSpeed Continued Below
===================================================================
MB/Second Reading Data, 1, 2, 4 and 8 Threads
KB Inc32 Inc16 Inc8 Inc4 Inc2 RdAll
Gentoo 64b Pi 3B+ gcc 9
12.3 1T 3453 4178 4428 3543 3584 2335
2T 5594 7732 8086 6856 6924 4654
4T 9065 12522 13157 12942 13415 9209
8T 6661 10770 13266 11955 12573 8478
122.9 1T 640 646 1197 1970 2909 2272
2T 1030 1012 2006 3671 5784 4528
4T 1001 1041 2145 4266 8337 6729
8T 1043 1061 2123 4005 8133 8572
12288 1T 114 104 241 444 932 1352
2T 126 122 253 370 1005 1997
4T 104 138 197 471 1133 1745
8T 102 96 231 466 796 1893
Gentoo 64b Pi 4B gcc 9 RdAll Pi 4B
4B/3B+ gcc 9/6
12.3 1T 5573 5750 5057 5646 5800 9129 3.91 2.16
2T 7191 9038 10035 11020 11125 17757 3.82 2.27
4T 7023 12144 14591 17681 20490 29184 3.17 1.97
8T 7553 11837 12565 15640 18546 30517 3.60 2.33
122.9 1T 672 922 1864 3092 4744 7741 3.41 1.86
2T 577 947 2100 3051 8780 14975 3.31 1.83
4T 519 983 1884 3980 8701 18139 2.70 1.24
8T 515 951 1913 4181 8797 16899 1.97 1.24
12288 1T 230 261 499 1016 1678 3873 2.86 1.07
2T 276 225 418 925 1929 5629 2.82 0.90
4T 258 267 579 802 1749 5758 3.30 1.52
8T 214 213 538 1069 2145 4680 2.47 1.31
MP BusSpeed Disassembly below or Go To Start
MP BusSpeed Disassembly
Following shows part of the source code used to read all data, compile commands used and disassembly of part of the
(100+) long sequences of instructions used for the 32 bit and 64 bit gcc 9 benchmarks. A disassembly of the 64 bit gcc
6 version was not available.
Source Code 64 AND instructions in main loop
for (i=start; i<end; i=i+64)
{
andsum1[t] = andsum1[t]
& array[i ] & array[i+1 ] & array[i+2 ] & array[i+3 ]
& array[i+4 ] & array[i+5 ] & array[i+6 ] & array[i+7 ]
To
& array[i+56] & array[i+57] & array[i+58] & array[i+59]
& array[i+60] & array[i+61] & array[i+62] & array[i+63];
}
Pi 32 Bit Raspbian Compile
gcc mpbusspd2.c cpuidc.c -lpthread -lm -lrt -O3 -mcpu=cortex-a7
-mfloat-abi=hard -mfpu=neon-vfpv4 -o MP-BusSpd2PiA7
Pi 64 Bit Gentoo Compile
gcc mpbusspd2.c -lpthread -lm -lrt -O3 -march=armv8-a -no-pie -o MP-BusSpd2Pi64g9
Parameters also tried
-march=armv8-a+crc -mtune=cortex-a72 -ftree-vectorize -O2 -pipe
-fomit-frame-pointer"
Pi 32 Bit Disassembly Pi 64 Bit Disassembly
vld1.32 {q6}, [lr] ldp w30, w17, [x0, 52]
vld1.32 {q7}, [r6] and w18, w18, w30
vand q10, q10, q6 and w1, w1, w18
vld1.32 {q6}, [r0] ldp w18, w30, [x0, 60]
vand q9, q9, q7 and w17, w17, w18
vand q12, q12, q6 and w1, w1, w17
vld1.32 {q7}, [ip] ldp w17, w18, [x0, 68]
vld1.32 {q6}, [r7] and w30, w30, w17
add r1, r3, #96 and w1, w1, w30
add r6, r3, #144 ldp w30, w17, [x0, 76]
vand q11, q11, q7 and w18, w18, w30
vand q14, q14, q6 and w1, w1, w18
vld1.32 {q7}, [r1] ldp w18, w30, [x0, 84]
vld1.32 {q6}, [r6] and w17, w17, w18
MP RandMem Benchmark below or Go To Start
MP RandMem Benchmark - MP-RandMemPi64, MP-RandMemPi64g9, MP-RandMemPiA7
This benchmark potentially reads and writes all data, in sections covering caches and RAM, each thread starting at
different addresses. Random access can select any address after that. Writing tends to be involve updating the
appropriate memory area, providing constant speeds. Random access is significantly affected by burst reading and
writing.
Pi 4B provided variable gains over the Pi 3B+, at 64 bits but less on the Pi 4B, from 64 bits over 32 bits.
Some moderate Pi4/3B+ performance gains were produced using gcc 9, but the older version was, possibly, a little
faster.
MB/Second Using 1, 2, 4 and 8 Threads
Serial Serial Random Random Serial Serial Random Random
KB+Thread Read RdWr Read RdWr Read RdWr Read RdWr
Gentoo Pi 4B 64 Bits
12.3 1T 5922 7871 5892 7857
2T 11856 7882 11902 7923
4T 22964 7821 22276 7832
8T 23225 7751 22082 7717
122.9 1T 5827 7276 2052 1921
2T 10965 7258 1754 1924
4T 10969 7232 1848 1929
8T 10896 7158 1834 1909
12288 1T 3879 1052 188 170
2T 4848 935 218 168
4T 4684 943 332 170
8T 3982 1049 340 171
Gentoo Pi 3B+ 64 Bits Raspbian Pi 4B 32 Bits
12.3 1T 4901 3587 4912 3585 5860 7905 5927 7657
2T 8749 3564 8719 3556 11747 7908 11182 7746
4T 17108 3504 17160 3505 21416 7626 17382 7731
8T 16885 3475 16650 3485 20649 7528 20431 7378
122.9 1T 3921 3339 1010 974 5479 7269 1826 1923
2T 7360 3350 1814 972 10355 6964 1667 1920
4T 12199 3313 2281 969 9808 7177 1715 1908
8T 12089 3313 2279 968 11677 7058 1697 1919
12288 1T 2024 828 83 67 3438 1271 179 152
2T 2169 820 142 67 4176 1204 213 167
4T 2178 818 154 67 4227 1117 337 161
8T 2219 821 161 67 3479 1093 287 168
4 Thread Pi 4B/3B+ 64 Bits 4 Thread Pi 4B 64 bits/32 bits
12.3 4T 1.34 2.23 1.30 2.23 1.07 1.03 1.28 1.01
122.9 4T 0.90 2.18 0.81 1.99 1.12 1.01 1.08 1.01
12288 4T 2.15 1.15 2.16 2.54 1.11 0.84 0.99 1.06
===================================================================
MB/Second Using 1, 2, 4 and 8 Threads
Serial Serial Random Random Serial Serial Random Random
KB+Thread Read RdWr Read RdWr Read RdWr Read RdWr
Gentoo 64b Pi 3B+ gcc 9 Gentoo 64b Pi 4B gcc 9
12.3 1T 4886 3581 4878 3590 5737 6884 5763 7537
2T 8723 3550 8724 3550 11536 7592 10238 6898
4T 16836 3498 17531 3509 21084 7575 15160 7390
8T 15777 3459 16783 3466 20089 7339 15311 7200
122.9 1T 3913 3346 987 972 5739 7231 2006 1906
2T 7285 3339 1753 964 10662 7217 1742 1896
4T 12354 3344 2350 972 10376 6741 1815 1812
8T 11841 3333 2300 962 10298 6937 1823 1848
12288 1T 1795 761 69 60 3477 905 181 162
2T 1915 735 118 60 3750 794 215 164
4T 2452 730 128 59 4669 968 259 162
8T 1805 755 137 60 3419 981 301 157
4 Thread 4 Thread
Comparison 64b Pi4/3B+ Comparison Pi4B gcc 9/6
12.3 4T 1.25 2.17 0.86 2.11 0.92 0.97 0.68 0.94
122.9 4T 0.84 2.02 0.77 1.86 0.95 0.93 0.98 0.94
12288 4T 1.90 1.33 2.02 2.75 1.00 1.03 0.78 0.95
MP-MFLOPS Benchmarks below or Go To Start
MP-MFLOPS Benchmarks - MP-MFLOPSPi64, MP-MFLOPSPi64g9, MP-MFLOPSPi64DP,
MP-MFLOPSPi64DPg9, MP-NeonMFLOPS64, MP-NeonMFLOPS64g9, MP-MFLOPSPiA7
MP-MFLOPS measures floating point speed on data from caches and RAM. The first calculations are as used in Memory
Speed Benchmark, with a multiply and an add per data word read. The second uses 32 operations per input data word of
the form x[i] = (x[i]+a)*b-(x[i]+c)*d+(x[i]+e)*f -- more. Tests cover 1, 2, 4 and 8 threads, each carrying out the same
calculations but accessing different segments of the data. Versions are available using single precision and double
precision data, plus one with NEON intrinsic functions. The numeric results are converted into a simple sumcheck, that
should be constant, irrespective of the number of threads used. Correct values are included at the end of the results
below. Note the differences using NEON functions and double or single precision floating point instructions.
There can be wide variations in speeds, affected by the short running times and such as cached data variations. In
order to help in interpreting results, comparisons are provided of results using one and four threads. These indicate that,
with cache based data, the Pi 4B was more than 3.5 times faster than the Pi 3B+ at two operations per word, but less
so at 32 operations.
The 64 bit and 32 bit comparisons were, no doubt, influenced by the particular compiler version used, and this is
reflected in the main disassembled code shown below, for 32 operations per word. The 32 bit version compile included -
mfpu=neon-vfpv4, but NEON was not implemented, resulting in scalar operation, using single word s registers. I have
another version with compile including -funsafe-math-optimizations, that compiles NEON instructions, with similar
performance as the 64 bit version, but more sumcheck differences.
The benchmark compiled to use NEON Intrinsic Functions does not include any that specify fused multiply and add
operations, reducing maximum possible speed. The 64 bit compiler converts the functions to include fused instructions,
providing the fastest speeds.
The main compiler independent feature that provides a clear advantage to 64 bit operation is that the CPU, at 32 bits,
does not support double precision SIMD (NEON) operation, with single word d registers being compiled. On the other
hand, performance gain does not appear to be meet the potential. This suggests that there are other limiting factors -
see disassembly below.
It is difficult to judge relative gcc 9 and 6 performance, probably due to the short running times. The former appears to
be more than 10% faster, running the single precision tests. For these, the disassembled instructions look the same as
those shown below, but in a different sequence.
Single Precision
MP-MFLOPS armv8 64Bit Thu Aug 22 19:50:10 2019
FPU Add & Multiply using 1, 2, 4 and 8 Threads
2 Ops/Word 32 Ops/Word 2 Ops/Word 32 Ops/Word
KB 12.8 128 12800 12.8 128 12800 12.8 128 12800 12.8 128 12800
---- Gentoo Pi 4B 64 Bits MFLOPS ---
1T 2908 2854 459 5778 5734 5405
2T 5700 5311 457 10935 11212 7968
4T 10375 5588 490 18181 21842 7637
8T 9675 8460 511 20128 20567 8568
--- Gentoo Pi 3B+ 64 Bits MFLOPS --- -- Raspbian Pi 4B 32 Bits MFLOPS -
1T 792 806 373 1780 1783 1724 987 993 606 2816 2794 2804
2T 1482 1596 382 3542 3509 3380 1823 1837 567 5610 5541 5497
4T 2861 2742 429 5849 7013 5465 2119 3349 647 9884 10702 9081
8T 2770 2877 429 6434 6700 6101 3136 3783 609 10230 10504 9240
Comparisons
--------- Pi 4B/3B+ 64 Bits -------- ------ Pi 4B 64 bits/32 bits -----
1T 3.67 3.54 1.23 3.25 3.22 3.14 2.95 2.87 0.76 2.05 2.05 1.93
2T 3.85 3.33 1.20 3.09 3.20 2.36 3.13 2.89 0.81 1.95 2.02 1.45
4T 3.63 2.04 1.14 3.11 3.11 1.40 4.90 1.67 0.76 1.84 2.04 0.84
MP-MFLOPS Continued Below
===========================================================================
MP-MFLOPS 64 Bit gcc 9 Thu Sep 26 12:36:54 2019
FPU Add & Multiply using 1, 2, 4 and 8 Threads
Gentoo 64b Pi 3B+ gcc 9 Gentoo 64b Pi 4B gcc 9
2 Ops/Word 32 Ops/Word 2 Ops/Word 32 Ops/Word
KB 12.8 128 12800 12.8 128 12800 12.8 128 12800 12.8 128 12800
1T 827 805 371 3232 3157 2802 3162 3072 468 6754 6714 6340
2T 1608 1567 360 6420 6423 5286 6498 6029 496 13329 12397 7623
4T 1764 3142 400 11240 12355 6029 11709 6141 529 24825 25055 8723
8T 2548 2575 381 10813 11755 5827 10828 8158 493 19452 22190 8426
Comparisons
........... 64b Pi4/3B+ .......... .......... Pi4B gcc 9/6 ..........
1T 3.82 3.82 1.26 2.09 2.13 2.26 1.09 1.08 1.02 1.17 1.17 1.17
2T 4.04 3.85 1.38 2.08 1.93 1.44 1.14 1.14 1.09 1.22 1.11 0.96
4T 6.64 1.95 1.32 2.21 2.03 1.45 1.13 1.10 1.08 1.37 1.15 1.14
###########################################################################
Double Precision
MP-MFLOPS armv8 64Bit Double Precision Thu Aug 22 19:51:42 2019
FPU Add & Multiply using 1, 2, 4 and 8 Threads
2 Ops/Word 32 Ops/Word 2 Ops/Word 32 Ops/Word
KB 12.8 128 12800 12.8 128 12800 12.8 128 12800 12.8 128 12800
---- Gentoo Pi 4B 64 Bits MFLOPS ---
1T 1464 1386 225 3398 3386 3182
2T 2837 2792 228 6720 6741 4547
4T 5172 3414 251 10405 12762 4763
8T 4774 4353 275 11506 12118 4865
--- Gentoo Pi 3B+ 64 Bits MFLOPS --- -- Raspbian Pi 4B 32 Bits MFLOPS -
1T 415 386 206 1400 1403 1333 1187 1220 309 2682 2714 2701
2T 820 813 209 2804 2767 2597 2420 2416 282 5379 5415 4780
4T 1328 1323 212 5433 5340 2465 4665 2381 317 10256 10336 5242
8T 1343 1308 214 5090 5006 3280 4385 3114 310 9721 10340 5131
Comparisons
--------- Pi 4B/3B+ 64 Bits -------- ------ Pi 4B 64 bits/32 bits -----
1T 3.53 3.59 1.09 2.43 2.41 2.39 1.23 1.14 0.73 1.27 1.25 1.18
2T 3.46 3.43 1.09 2.40 2.44 1.75 1.17 1.16 0.81 1.25 1.24 0.95
4T 3.89 2.58 1.18 1.92 2.39 1.93 1.11 1.43 0.79 1.01 1.23 0.91
===========================================================================
MP-MFLOPS 64 Bit gcc 9 Double Precision Thu Sep 26 22:05:10 2019
FPU Add & Multiply using 1, 2, 4 and 8 Threads
---- Gentoo 64b Pi 3B+ gcc 9 ---- ----- Gentoo 64b Pi 4B gcc 9 ----
2 Ops/Word 32 Ops/Word 2 Ops/Word 32 Ops/Word
KB 12.8 128 12800 12.8 128 12800 12.8 128 12800 12.8 128 12800
1T 384 350 127 1582 1546 1372 657 663 183 3283 3358 3169
2T 753 753 184 3109 3157 2645 3203 2690 223 6573 6353 4535
4T 1346 1330 194 4228 6099 3067 5799 3866 292 12432 12665 4906
8T 1234 1340 201 4888 5748 3190 5322 4583 269 10738 8891 4521
Comparisons
........... 64b Pi4/3B+ .......... .......... Pi4B gcc 9/6 ..........
1T 1.71 1.89 1.44 2.08 2.17 2.31 0.45 0.48 0.81 0.97 0.99 1.00
2T 4.25 3.57 1.21 2.11 2.01 1.71 1.13 0.96 0.98 0.98 0.94 1.00
4T 4.31 2.91 1.51 2.94 2.08 1.60 1.12 1.13 1.16 1.19 0.99 1.03
MP-MFLOPS Continued Below
NEON Single Precision
MP-MFLOPS NEON Intrinsics 64 Bit Thu Aug 22 19:52:48 2019
FPU Add & Multiply using 1, 2, 4 and 8 Threads
2 Ops/Word 32 Ops/Word 2 Ops/Word 32 Ops/Word
KB 12.8 128 12800 12.8 128 12800 12.8 128 12800 12.8 128 12800
---- Gentoo Pi 4B 64 Bits MFLOPS ---
1T 3311 3192 535 6442 6548 6198
2T 4607 6186 552 13030 13012 8468
4T 6279 5725 562 23798 24128 9374
8T 7815 12044 486 22725 21712 9395
--- Gentoo Pi 3B+ 64 Bits MFLOPS -- -- Raspbian Pi 4B 32 Bits MFLOPS -
1T 830 823 406 2989 2986 2792 2491 2399 615 4325 4285 4261
2T 1575 1498 414 5981 5872 5445 5629 5520 591 8602 8463 8308
4T 2217 2650 431 11661 11644 6061 10580 5594 553 16991 16493 9124
8T 2733 3197 437 10505 10637 6708 7047 10785 513 14325 16219 8867
Comparisons
--------- Pi 4B/3B+ 64 Bits -------- ------ Pi 4B 64 bits/32 bits -----
1T 3.99 3.88 1.32 2.16 2.19 2.22 1.33 1.33 0.87 1.49 1.53 1.45
2T 2.93 4.13 1.33 2.18 2.22 1.56 0.82 1.12 0.93 1.51 1.54 1.02
4T 2.83 2.16 1.30 2.04 2.07 1.55 0.59 1.02 1.02 1.40 1.46 1.03
===========================================================================
MP-MFLOPS NEON Intrinsics 64 Bit gcc 9 Thu Sep 26 22:02:00 2019
FPU Add & Multiply using 1, 2, 4 and 8 Threads
---- Gentoo 64b Pi 3B+ gcc 9 ---- ----- Gentoo 64b Pi 4B gcc 9 ----
2 Ops/Word 32 Ops/Word 2 Ops/Word 32 Ops/Word
KB 12.8 128 12800 12.8 128 12800 12.8 128 12800 12.8 128 12800
1T 769 765 354 3009 2967 2638 1233 1313 507 6451 6428 6224
2T 1315 1324 293 5863 5990 5097 6307 4824 389 12559 12784 7612
4T 1750 2647 380 10081 11250 5748 8101 5186 531 24762 24708 7902
8T 2180 2664 392 9719 11010 6368 6782 8444 504 22598 24113 7979
........... 64b Pi4/3B+ .......... .......... Pi4B gcc 9/6 ..........
1T 1.60 1.72 1.43 2.14 2.17 2.36 0.37 0.41 0.95 1.00 0.98 1.00
2T 4.80 3.64 1.33 2.14 2.13 1.49 1.37 0.78 0.70 0.96 0.98 0.90
4T 4.63 1.96 1.40 2.46 2.20 1.37 1.29 0.91 0.94 1.04 1.02 0.84
MP-MFLOPS Disassembly below or Go To Start
MP-MFLOPS Disassembly
On the Pi 4B, with single precision floating point and SIMD, four word registers were used (see 4s below). With this, four
results of calculations might be expected per clock cycle, or 6 GFLOPS per core and up to 24 GFLOPS using all four
cores, Then such as fused multiply and add could double the speed for up to four times to 12 GFLOPS per core. For the
mix of instructions below, expectations might by 70% of this or 8.4 GFLOPS. Using double precision, with two words in
the 128 bit registers, expectations might be half that at 4.2 GFLOPS per core, with this code.
SP NEON 24.1 GFLOPS 6.55 1 core DP 12.7 GFLOPS - 3.39 1 core
.L41: .L84:
ldr q1, [x1] ldr q16, [x2, x0]
ldr q0, [sp, 64] add w3, w3, 1
fadd v18.4s, v20.4s, v1.4s cmp w3, w6
fadd v17.4s, v22.4s, v1.4s fadd v15.2d, v16.2d, v14.2d
fadd v0.4s, v0.4s, v1.4s fadd v17.2d, v16.2d, v12.2d
fadd v16.4s, v24.4s, v1.4s fmul v15.2d, v15.2d, v13.2d
fadd v7.4s, v26.4s, v1.4s fmls v15.2d, v17.2d, v11.2d
fadd v6.4s, v28.4s, v1.4s fadd v17.2d, v16.2d, v10.2d
fadd v5.4s, v30.4s, v1.4s fmla v15.2d, v17.2d, v9.2d
fmul v0.4s, v0.4s, v19.4s fadd v17.2d, v16.2d, v8.2d
fadd v4.4s, v10.4s, v1.4s fmls v15.2d, v17.2d, v31.2d
fadd v3.4s, v12.4s, v1.4s fadd v17.2d, v16.2d, v30.2d
fadd v2.4s, v14.4s, v1.4s fmla v15.2d, v17.2d, v29.2d
fadd v1.4s, v8.4s, v1.4s fadd v17.2d, v16.2d, v28.2d
fmls v0.4s, v21.4s, v18.4s fmls v15.2d, v17.2d, v0.2d
fmla v0.4s, v23.4s, v17.4s fadd v17.2d, v16.2d, v27.2d
fmls v0.4s, v25.4s, v16.4s fmla v15.2d, v17.2d, v26.2d
fmla v0.4s, v27.4s, v7.4s fadd v17.2d, v16.2d, v25.2d
fmls v0.4s, v29.4s, v6.4s fmls v15.2d, v17.2d, v24.2d
fmla v0.4s, v31.4s, v5.4s fadd v17.2d, v16.2d, v23.2d
fmls v0.4s, v9.4s, v1.4s fmla v15.2d, v17.2d, v22.2d
fmla v0.4s, v4.4s, v11.4s fadd v17.2d, v16.2d, v21.2d
fmls v0.4s, v3.4s, v13.4s fadd v16.2d, v16.2d, v19.2d
fmla v0.4s, v2.4s, v15.4s fmls v15.2d, v17.2d, v20.2d
str q0, [x1], 16 fmla v15.2d, v16.2d, v18.2d
cmp x1, x0 str q15, [x2, x0]
bne .L41 add x0, x0, 16
bcc .L84
32 bit 64 bit 32 bit 64 bit 32 bit 64 bit
SP SP DP DP NEON SP NEON SP
Maximum GFLOPS 10.7 21.8 10.3 12.7 17.0 24.1
Instructions
Total 27 39 26 27 67 27
Floating point 22 32 22 32 32 22
FP operations
Total 32 128 32 64 128 128
Add or subtract 11 44 11 22 21 44
Multiply 1 4 1 2 11 4
Fused 20 80 20 40 0 80
Add example fadds fadd faddd fadd vadd.f32 fadd
s16, v15.4s, d25, v15.2d, q9, v1.4s,
s23, v16.4s, d17, v16.2d, q8, v8.4s,
s2 v15.4s d15 v14.2d q14 v1.4s
Multiply example fnmuls fmul fmuld fmul vmul.f32 fmul
s16, v15.4s, d16, v15.2d, q9, v0.4s,
s3, v15.4s, d16, v15.2d, q9, v0.4s,
s16 v17.4s d5 v13.2d q12 v19.4s
Fused example vfma.f32 fmla vfma.f64 fmla N/A fmla
s16, v15.4s, d16, v15.2d, v0.4s,
s29, v17.4s, d22, v17.2d, v4.4s,
s9 v0.4s d28 v22.2d v11.4s
FP registers used 32 4 32 25 16 32
MP-MFLOPS Sumchecks below or Go To Start
MP-MFLOPS Sumchecks
Different instructions, like between SP and DP, may not produce identical numeric results. Variations also depend on the
number of passes, here they were close to 1.0 as data size increased. Only anomaly is -X below.
2 Ops/Word 32 Ops/Word
KB 12.8 128 12800 12.8 128 12800
SP
4B/64 1T 76406 97075 99969 66015 95363 99951
3B/64 1T 76406 97075 99969 66015 95363 99951
4B/32 1T 76406 97075 99969 66015 95363 99951
DP
4B/64 1T 76384 97072 99969 66065 95370 99951
3B/64 1T 76384 97072 99969 66065 95370 99951
4B/32 1T 76384 97072 99969 66065 95370 99951
NEON Bit SP
4B/64 1T 76406 97075 99969 66015 95363 99951
3B/64 1T 76406 97075 99969 66015 95363 99951
4B/32 1T 76406 97075 99969 66014-X 95363 99951
OpenMP-MFLOPS Benchmarks below or Go To Start
OpenMP MFLOPS - OpenMP-MFLOPS64, OpenMP-MFLOPS64g9, notOpenMP-MFLOPS64,
notOpenMP-MFLOPS64g9, OpenMP-MFLOPS, notOpenMP-MFLOPS
This benchmark carries out the same calculations as the MP-MFLOPS Benchmarks but, in addition, calculations with eight
operations per data word. There is also notOpenMP-MFLOPS single core version, compiled from the same code and
carrying out identical numbers of floating point calculations, but without an OpenMP compile directive.
Following is an example of full output. The strange test names were carried forward from a 2014 CUDA benchmark, via
Windows and Linux Intel CPU versions. Details are in the following GigaFLOPS Benchmarks report, covering MP-MFLOPS,
QPAR and OpenMP. This showed nearly 100 GFLOPS from a Core i7 CPU and 400 GFLOPS from a GeForce GTX 650
graphics card, via CUDA. See GigaFLOPS Benchmarks.htm.
The detail is followed by MFLOPS results on Pi 3B+ and Pi 4B. The direct conversions of the code from large systems lead
to excessive memory demands for Raspberry Pi systems, with too many tests dependent on RAM speed, and low MP
performance gains. There were glimpses of the usual performance gains an a maximum of over 20 SP GFLOPS on a 64 bit
Pi 4B.
The Pi 4B gcc 9/6 performance ratios indicate no real advantage of either compilation, except the results indicate 24.7
SP GFLOPS using gcc 9.
Gentoo 64b Pi 4B gcc 9
OpenMP MFLOPS64g9 Thu Sep 26 16:51:07 2019
Test 4 Byte Ops/ Repeat Seconds MFLOPS First All
Words Word Passes Results Same
Data in & out 100000 2 2500 0.124228 4025 0.929538 Yes
Data in & out 1000000 2 250 0.842066 594 0.992550 Yes
Data in & out 10000000 2 25 0.873622 572 0.999250 Yes
Data in & out 100000 8 2500 0.147889 13524 0.957117 Yes
Data in & out 1000000 8 250 0.904478 2211 0.995518 Yes
Data in & out 10000000 8 25 0.951405 2102 0.999549 Yes
Data in & out 100000 32 2500 0.324246 24673 0.890215 Yes
Data in & out 1000000 32 250 1.097993 7286 0.988088 Yes
Data in & out 10000000 32 25 1.045087 7655 0.998796 Yes
--------- gcc 9 ---------
Mbytes/ Pi 3B+ Pi 4B Pi 4B Pi 3B+ Pi 4B
Ops/W0rd 64b 64b 32b 64b 64b
All 1T All 1T All 1T All 1T All 1T
0.4/2 2674 755 5386 2780 4716 2850 2341 795 4025 2236
4/2 411 404 563 557 556 429 381 362 594 403
40/2 419 408 545 588 544 632 401 387 572 493
0.4/8 7029 1886 15401 5555 7981 5191 6051 1906 13524 5373
4/8 1656 1495 2223 2116 2389 2082 1491 1352 2211 1948
40/8 1725 1507 2361 2310 2199 2003 1598 1418 2102 2308
0.4/32 6648 1699 20429 5647 8147 5449 12002 3185 24673 6786
4/32 5977 1616 8082 5445 7951 5385 5641 2809 7286 6385
40/32 6027 1616 8470 5479 8030 5379 6142 2809 7655 6415
Pi 4B gcc 9 Pi 4B
4b/3b 64/32b 4b/3b gcc 9/6
All 1T All 1T All 1T All 1T
0.4/2 2.01 3.68 1.14 0.98 1.72 2.81 0.75 0.80
4/2 1.37 1.38 1.01 1.30 1.56 1.11 1.06 0.72
40/2 1.30 1.44 1.00 0.93 1.43 1.27 1.05 0.84
0.4/8 2.19 2.95 1.93 1.07 2.24 2.82 0.88 0.97
4/8 1.34 1.42 0.93 1.02 1.48 1.44 0.99 0.92
40/8 1.37 1.53 1.07 1.15 1.32 1.63 0.89 1.00
0.4/32 3.07 3.32 2.51 1.04 2.06 2.13 1.21 1.20
4/32 1.35 3.37 1.02 1.01 1.29 2.27 0.90 1.17
40/32 1.41 3.39 1.05 1.02 1.25 2.28 0.90 1.17
OpenMP-MemSpeed Benchmarks below or Go To Start
OpenMP-MemSpeed - OpenMP-MemSpeed264, OpenMP-MemSpeed264g9,
NotOpenMP-MemSpeed264, NotOpenMP-MemSpeed264g9, OpenMP-MemSpeed2,
NotOpenMP-MemSpeed2
This is the same program as the single core MemSpeed benchmark, but with increased memory sizes and compiled using
OpenMP directives. The same program was also compiled without these directives (NotOpenMP-MemSpeed2). Although
the source code appears to be suitable for speed up by parallelisation, many of the test functions are slower using
OpenMP. Detailed comparisons of these results are rather meaningless. Below are Pi 4B results from a gcc 9 compilation.
See MemSpeed results for other comparisons.
Memory Reading Speed Test OpenMP 64 Bit gcc 9 by Roy Longbottom
Start of test Thu Sep 26 22:08:22 2019
Memory x[m]=x[m]+s*y[m] Int+ x[m]=x[m]+y[m] x[m]=y[m]
KBytes Dble Sngl Int32 Dble Sngl Int32 Dble Sngl Int32
Used MB/S MB/S MB/S MB/S MB/S MB/S MB/S MB/S MB/S
4 7616 8480 8749 7548 8520 8530 35856 18594 18601
8 8195 8660 8876 8147 5740 8365 37153 18878 18864
16 7992 7684 8189 8064 8139 8023 35774 18896 18898
32 8975 8535 8024 9048 8536 8512 37465 18392 19024
64 8622 7997 8057 8511 7953 7994 19618 16857 16701
128 11940 11637 11554 12101 11659 11498 13815 13417 13964
256 17008 17339 16359 17104 17396 17038 11877 12344 12376
512 17740 15986 18607 17522 18547 15612 12575 13616 13495
1024 7011 10208 10016 11310 5287 11413 7060 6279 10045
2048 7024 4201 7006 7017 6943 3225 2822 3386 3391
4096 3854 7002 7126 6912 7074 3985 2199 3127 3132
8192 2632 6950 7151 5291 2796 6813 2546 3091 2403
16384 7350 7073 3537 7583 5327 3200 2609 3053 1907
32768 7514 7616 7725 7807 2344 2936 2702 2559 3042
65536 7065 2937 7571 4306 7086 2975 2127 3017 2677
131072 1772 1779 2562 8092 2583 2800 2035 1866 2869
Memory Reading Speed Test notOpenMP 64 Bit gcc 9 by Roy Longbottom
4 12991 21391 23815 13044 22904 23856 11216 9060 9062
8 13380 21857 24416 13414 23420 24400 11630 9313 9312
16 13534 22119 24711 13550 23683 24718 11797 9447 9447
32 11981 19879 21566 12100 21243 21572 9552 8928 8924
64 11695 19992 20989 12044 21020 20966 9356 8613 8602
128 11824 20347 21045 12116 21217 21067 8132 8149 8178
256 11705 20247 21090 12041 21382 21013 8081 8182 5919
512 11515 20242 21155 12059 21089 20938 8093 8127 7376
1024 4504 8674 8151 4658 8682 8680 3894 3739 3887
2048 1868 3231 3636 1868 3581 3491 2639 2871 2896
4096 1921 2994 3748 1925 3781 3443 2589 2634 2636
8192 1836 3719 3695 1921 3624 3791 2603 2596 2595
16384 1951 3724 3002 1977 3838 3249 2584 2572 2384
32768 1710 3431 3427 2008 3186 3449 2545 2531 2529
65536 2030 3034 2135 2047 3035 2394 2550 2535 2546
131072 2029 2023 2024 1873 2059 1652 2378 2466 2392
Stress Test Benchmarks below or Go To Start
Stress Testing Programs Benchmarking Mode
My latest stress testing programs have parameters that specify running time, data size, number of threads, log file
number and, in two cases, processing density. When run without parameters, the full range of options are used,
providing a useful benchmark. Log file results from Pi 4B tests, and comparisons, are provided below.
Integer Stress Test Benchmark - MP-IntStress64, MP-IntStress
The integer program test loop comprises 32 add or subtract instructions, operating on hexadecimal data patterns, with
sequences of 8 subtracts then 8 adds to restore the original pattern. Disassembly shows that the test loop, in fact,
used 68 instructions, most additional ones being load register type. The result of these is 68/32 instructions per 4 byte
word. At the maximum of 1943M words per second, using a single core, resultant execution speed was 4129 MIPS with
nearly four times more using all cores.
The tables below, with speeds on the considered systems, provide average performance gains of the Pi 4B at 64 bits,
somewhat limited in this case.
Gentoo Pi 4B 64 Bits
MP-Integer-Test 64 Bit v1.0 Fri Sep 6 16:33:36 2019
Benchmark 1, 2, 4, 8, 16 and 32 Threads
MB/second
KB KB MB Same All
Secs Thrds 16 160 16 Sumcheck Tests
4.3 1 7771 7352 3895 00000000 Yes
3.3 2 15467 14218 3714 FFFFFFFF Yes
3.0 4 28715 26652 3345 5A5A5A5A Yes
3.0 8 30292 26310 3334 AAAAAAAA Yes
3.0 16 29466 28503 3337 CCCCCCCC Yes
3.0 32 29351 30358 3390 0F0F0F0F Yes
Pi 4B 32 bit MB/sec Pi 3B+ 64 bit MB/sec
KB KB MB KB KB MB
16 160 16 16 160 16
Threads
1 5964 5756 3931 4823 3884 1209
2 11787 11430 3748 9613 7709 1908
4 23214 22060 3456 17737 15137 1779
6 22197 22171 3472 17651 18692 1767
16 22671 23299 3256 18255 18793 1757
32 21379 21881 3346 18246 18674 1748
Pi 4B 64b/32b 64b Pi 4B/3B+
Average
Gain 1.31 1.25 0.99 1.63 1.67 2.13
Floating Point Stress Test Benchmarks or Go To Start
Single Precision Floating Point Stress Test Benchmark - MP-FPUStress64, MP-FPUStress
This and the double precision program carry out the same calculations as MP-MFLOPS, but are slightly faster by
including a loop that repeats the tests within the calculate functions. Maximum speeds were 6.75 GFLOPS, using one
core, and 26.7 GFLOPS with four cores.
These programs were written using a later compiler than those used for MP-MFLOPS, at least resulting in similar speeds
between 32 bit and 64 bit versions. Typical Pi 4B/3B+ performance improvements were indicated.
Gentoo Pi 4B 64 Bits
MP-Threaded-MFLOPS 64 Bit v1.0 Fri Sep 6 16:30:12 2019
Benchmark 1, 2, 4 and 8 Threads
MFLOPS Numeric Results
Ops/ KB KB MB KB KB MB
Secs Thrd Word 12.8 128 12.8 12.8 128 12.8
1.7 T1 2 2819 2874 504 40392 76406 99700
1.7 T1 2 2819 2874 504 40392 76406 99700
3.2 T2 2 5592 5702 511 40392 76406 99700
4.6 T4 2 9223 7520 519 40392 76406 99700
6.0 T8 2 9520 10471 545 40392 76406 99700
8.2 T1 8 5381 5595 2050 54764 85092 99820
9.8 T2 8 11039 10883 2173 54764 85092 99820
11.3 T4 8 19087 21040 2044 54764 85092 99820
12.9 T8 8 19747 21107 2016 54764 85092 99820
17.5 T1 32 6693 6753 6377 35206 66015 99520
20.2 T2 32 13491 13464 8710 35206 66015 99520
22.2 T4 32 25732 26704 9160 35206 66015 99520
24.1 T8 32 25708 25770 8927 35206 66015 99520
End of test Fri Sep 6 16:30:37 2019
Pi 4B 32 bit Pi 3B+ 64 bit
Threads KB KB MB KB KB MB
Ops/wd 12.8 128 12.8 12.8 128 12.8
T1 2 2641 2607 646 838 826 373
T2 2 5089 5116 618 1659 1650 380
T4 2 8282 8522 683 2584 3296 384
T8 2 8756 9847 686 3013 3056 391
T1 8 5543 5428 2597 1981 1972 1354
T2 8 10754 10603 2711 3936 3923 1518
T4 8 18716 20823 2844 7482 7396 1531
T8 8 19859 21684 2555 7399 7705 1534
T1 32 5309 5274 5265 2820 2809 2462
T2 32 10557 10509 9991 5636 5583 4754
T4 32 20416 20919 11340 10640 10882 6020
T8 32 20072 19787 9330 10641 10926 6159
Average Pi 4B Performance Gains
Ops/Word Pi 4B 64b/32b 64b Pi 4B/3B+
2 1.09 1.04 0.79 3.37 3.16 1.36
8 1.00 1.01 0.77 2.69 2.80 1.40
32 1.27 1.29 0.96 2.40 2.41 1.85
Double Precision Stress Test Benchmark below or Go To Start
Double Precision Floating Point Stress Test Benchmark - MP-FPUStress64DP,
MP-FPUStressDP
Maximum measured DP speeds were 3.39 GFLOPS, using one core, and 13.2 GFLOPS with four cores. Some of the 64/32
bit and 4B/3B+ performance ratios were similar to those from MP-MFLOPS
Gentoo Pi 4B 64 Bits
MP-Threaded-MFLOPS 64 Bit v1.0 Fri Sep 6 16:31:24 2019
Double Precision Benchmark 1, 2, 4 and 8 Threads
MFLOPS Numeric Results
Ops/ KB KB MB KB KB MB
Secs Thrd Word 12.8 128 12.8 12.8 128 12.8
3.2 T1 2 1398 1462 285 40395 76384 99700
6.2 T2 2 2799 2807 256 40395 76384 99700
8.9 T4 2 5024 4589 257 40395 76384 99700
11.5 T8 2 5089 5545 280 40395 76384 99700
15.7 T1 8 2668 2790 1103 54805 85108 99820
18.8 T2 8 5670 5545 1158 54805 85108 99820
21.7 T4 8 10259 10011 1068 54805 85108 99820
24.7 T8 8 10239 10824 1036 54805 85108 99820
34.1 T1 32 3317 3390 3195 35159 66065 99521
39.2 T2 32 6791 6754 4753 35159 66065 99521
43.1 T4 32 12940 13200 4497 35159 66065 99521
46.9 T8 32 13200 13049 4557 35159 66065 99521
End of test Fri Sep 6 16:32:11 2019
Pi 4B 32 bit Pi 3B+ 64 bit
Threads KB KB MB KB KB MB
Ops/wd 12.8 128 12.8 12.8 128 12.8
T1 2 993 998 329 412 411 193
T2 2 1971 1995 309 828 824 194
T4 2 3633 3937 340 1543 1514 197
T8 2 3635 3796 339 1525 1551 196
T1 8 2378 2445 1288 980 978 696
T2 8 4770 4860 1282 1975 1964 782
T4 8 9281 9556 1210 3688 3688 781
T8 8 9119 9448 1245 3726 3689 787
T1 32 2697 2726 2708 1402 1403 1231
T2 32 5397 5446 5163 2808 2808 2399
T4 32 10689 10806 5146 5379 5413 3195
T8 32 10716 10494 4497 5450 5485 3150
Average Pi 4B Performance Gains
Ops/Word Pi 4B 64b/32b 64b Pi 4B/3B+
2 1.40 1.37 0.82 3.34 3.39 1.38
8 1.13 1.12 0.87 2.78 2.83 1.44
32 1.23 1.24 1.00 2.40 2.41 1.86
High Performance Linpack Benchmark below or Go To Start
High Performance Linpack Benchmark
Earlier, the High Performance Linpack Benchmark was run on Raspberry Pi 3 models, and later, on the Raspberry Pi 4
system, both via 32 bit Raspbian Operating System. Details and results can be found in the following reports. Pi 3B and
3B+ results and Pi 4B 32 bit reslts.
Initially, two versions of HPL tests were run, one accessing precompiled Basic Linear Algebra Subprograms and the other
with ATLAS alternatives, that had to be built. The whole benchmark suite was produced according to instructions in the
following. these instructions.
The ATLAS version was installed, as the older benchmark would not run on the Pi 4. One issue is the time required for
the build, apparently due to the numerous tuning tests. Time taken was 14 hours using a Pi 3B+, then 8 hours on a Pi 4.
Later, 64 bit ATLAS was built on the Pi 3B+, via Gentoo, taking 26 hours, that included extended periods swapping data
with the rather slow main drive. The procedure specified in the above was used, successfully leading to a working
package. Only one change was required, this was to Make.rpi line 95 to;
LAdir = /home/pi/atlas-build to = /home/demouser/atlas-build.
Following the introduction of 64 bit Gentoo for the Pi 4B, ATLAS was again created, taking more than 10 hours. As
indicated in the above links, the HPL benchmark can be a useful stress test, due to the long running time with heavy
processing. It can lead to CPU MHz being throttled on the Pi 4B, producing slow GFLOPS speeds. The tests reported here
were run using a Pi 4B with a cooling fan, with CPU MHz monitored to help to indicate that the processor was running at
full speed.
The benchmark was run on various Raspberry Pi models, using the same parameters. An example of the main output
produced is shown below. Key areas are array size parameter N, running time, GFLOPS speed rating and sumcheck
(0.0010188 in this case), including whether acceptable (PASSED).
pi@raspberrypi:~/hpl-2.2/bin/rpi $ mpiexec -f nodes-1pi ./xhpl
================================================================================
HPLinpack 2.2 -- High-Performance Linpack benchmark -- February 24, 2016
Written by A. Petitet and R. Clint Whaley, Innovative Computing Laboratory, UTK
Modified by Piotr Luszczek, Innovative Computing Laboratory, UTK
Modified by Julien Langou, University of Colorado Denver
================================================================================
An explanation of the input/output parameters follows:
T/V : Wall time / encoded variant.
N : The order of the coefficient matrix A.
NB : The partitioning blocking factor.
P : The number of process rows.
Q : The number of process columns.
Time : Time in seconds to solve the linear system.
Gflops : Rate of execution for solving the linear system.
The following parameter values will be used:
N : 20000
NB : 128
PMAP : Row-major process mapping
P : 2
Q : 2
PFACT : Right
NBMIN : 4
NDIV : 2
RFACT : Crout
BCAST : 1ringM
DEPTH : 1
SWAP : Mix (threshold = 64)
L1 : transposed form
U : transposed form
EQUIL : yes
ALIGN : 8 double precision words
--------------------------------------------------------------------------------
- The matrix A is randomly generated for each test.
- The following scaled residual check will be computed:
||Ax-b||_oo / ( eps * ( || x ||_oo * || A ||_oo + || b ||_oo ) * N )
- The relative machine precision (eps) is taken to be 1.110223e-16
- Computational tests pass if scaled residuals are less than 16.0
================================================================================
T/V N NB P Q Time Gflops
--------------------------------------------------------------------------------
WR11C2R4 20000 128 2 2 494.46 1.079e+01
HPL_pdgesv() start time Fri Oct 11 22:34:37 2019
HPL_pdgesv() end time Fri Oct 11 22:42:52 2019
--------------------------------------------------------------------------------
||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)= 0.0010188 ...... PASSED
================================================================================
High Performance Linpack Benchmark Results below or Go To Start
High Performance Linpack Benchmark Results
Particularly important, maximum performance is dependent on the amount of RAM available. As with the original single
CPU Linpack benchmark, where N is the matrix problem size, minimum memory used is N x N x 8 Bytes (double precision)
or 512 MB for N = 8000 or 3.2 GB for N = 20000. The end of the detailed output indicates a further problem, where the
first run at maximum size might be slow, with extra time swapping data out of RAM, to create space for the HPL data.
Next, the benchmark produces a sumcheck but, in the case of the ATLAS implementation, these are not consistent using
the same problem size, all those shown here were indicated as PASSED (within specified tolerances). The anomaly could
be produced using different CPU models or alternative compilations but, the least understandable is identified at the end
of the detailed output, where the sumcheck is shown to vary on repeating the program on the same system.
Comparing Pi 4B 32 bit and 64 bit GFLOPS maximum speeds, the 32 bit version appears to be slightly faster (or the same
within reasonable tolerances). Then it is not clear (to me), whether the compiled code completely embraces the
difference in technology or whether external compile options should be included for the different packages involved.
Anyway, around 10 double precision GFLOPS was the maximum produced by other benchmarks, reported above.
------ Time ------ ----- GFLOPS ----- ----------- Sumcheck ----------
4B 4B 3B+ 4B 4B 3B+ 4B 4B 3B+
N 64b 32b 64b 64b 32b 64b 64b 32b 64b
4000 5.51 5.20 14.53 7.75 8.20 2.94 0.0022808 0.0023975 0.0025857
8000 38.22 36.70 101.59 8.93 9.30 3.36 0.0017216 0.0016746 0.0017518
16000 269.26 263.00 10.14 10.40 0.0012577 0.0011258
20000 513.67 494.30 10.38 10.80 0.0009637 0.0010188
GFLOPS Comparisons
4B 64b
N 64b/32b 4B/3B+
4000 0.95 2.64
8000 0.96 2.66
16000 0.98
20000 0.96
Example Logged Results
Time Gflops
--------------------------------------------------------------------------------
WR11C2R4 20000 128 2 2 516.71 1.032e+01
||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)= 0.0008697 ...... PASSED
================================================================================
First Run
WR11C2R4 20000 128 2 2 656.89 8.120e+00
||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)= 0.0009470 ...... PASSED
================================================================================
I/O Benchmarks below or Go To Start
DriveSpeed Benchmark - DriveSpeed64v2, DriveSpeed64v2g9, DriveSpeed
This benchmark has the format shown below, measuring writing and reading speeds of large files, cached files, random
access and numerous small files. Run time parameters are available to specify large file size and the file path.
In order to test a USB drive, it must be mounted - plug in, right click Mount Volume or double click to open. Run df
command to find the path, needed for use as a run time parameter. Following is an example log file and the command
used to run the program to test a USB 3 stick. With no MB parameter, default large file sizes are 8 and 16 MB.
############################## Pi 4B USB 3 ###############################
Run command ./DriveSpeed64v2g9 MB 512 FilePath /run/media/demouser/PATRIOT
##########################################################################
DriveSpeed RasPi 64 Bit 2.0 Fri Sep 13 22:25:40 2019
Selected File Path:
/run/media/demouser/PATRIOT/
Total MB 120832, Free MB 119778, Used MB 1054
MBytes/Second
MB Write1 Write2 Write3 Read1 Read2 Read3
512 30.72 31.11 34.01 287.24 295.04 311.90
1024 34.66 36.11 35.45 298.87 302.38 300.26
Cached
8 42.03 39.58 38.85 1167.71 1029.35 1061.56
Random Read Write
From MB 4 8 16 4 8 16
msecs 0.004 0.007 0.310 9.65 10.42 9.71
200 Files Write Read Delete
File KB 4 8 16 4 8 16 secs
MB/sec 0.03 0.07 0.13 268.10 427.95 657.48
ms/file 122.73 122.28 122.22 0.02 0.02 0.02 2.557
For non-cached tests, in the standard version of this benchmark, the file opening handle includes the O_DIRECT option,
specifying Direct I/O (no caching). The latest minor variety of this appears to work, as expected, on the 32 bit Raspbian
version, on both main and USB drives. The 64 bit compilation of this indicated a failure to write to the main SD drive and
a failure to read from USB flash drives. Omitting O_DIRECT, for reading, appeared to correct the latter (see above). To
check this and enable main drive measurements, separate direct I/O free large file write and read only programs were
produced, to follow write/reboot/read procedures. These were also necessary to indicate throughput simultaneously
writing or reading two USB 3 drives.
Following are 64 bit Pi 4B SD main drive results from the separate write and read tests, followed by full results from Pi 4B
with 32 bit Raspbian, using a same brand SD card. Note the similarity in writing and reading speeds of large files.
################# Main SD Drive From Write/Read Tests Below =################
Write1 Write2 Write3 Read1 Read2 Read3
Write 18.99 19.34 19.47 1337.09 1164.91 1325.96 - cached
Read N/A N/A N/A 45.80 45.88 45.89 - not cached
============================== 32 Bit Results ==============================
DriveSpeed RasPi 1.1 Mon Apr 29 10:20:57 2019
Current Directory Path: /home/pi/Raspberry_Pi_Benchmarks/DriveSpeed/drive1
Total MB 14845, Free MB 8198, Used MB 6646
MBytes/Second
MB Write1 Write2 Write3 Read1 Read2 Read3
8 16.41 11.21 12.27 39.81 40.10 40.39
16 11.79 21.10 34.05 40.18 40.19 40.33
Cached
8 137.47 156.43 285.59 580.73 598.66 587.97
Random Read Write
From MB 4 8 16 4 8 16
msecs 0.371 0.371 0.363 1.28 1.53 1.30
200 File Write Read Delete
File KB 4 8 16 4 8 16 secs
MB/sec 3.49 6.41 8.26 7.67 11.68 17.51
ms/file 1.17 1.28 1.98 0.53 0.70 0.94 0.014
USB Flash Drives below or Go To Start
USB 3 and 2 Flash Drive Benchmarks
Two FAT 32 formatted USB 3 sticks were used, P at 128 GB, with 32 KB sectors, reading speed rated as up to 400
MB/second, and R 8.8 GB partition, with 8 KB sectors, reading speed rated as up to 190 MB/second (but appears to do
better sometimes). The benchmark was run using USB 2 connections, on a Pi 3B+ and a Pi 4B, and via USB 3 slots on
the Pi 4B.
Following is a summary of results, indicating USB 3 large file reading speed improvements between 6.7 and 8.1 times, but
disappointing writing performance, where the slower P speeds might be affected by the mysteries of updating file
allocation tables, also influencing random access and dealing with lots of small files, including file delete times. USB 3 use
provided little or no performance gains for the latter. Cached reading reflects RAM speed, the only area showing clear
difference in performance between the Pi 3B+ and Pi 4B.
MB/second 16 MB USB 2, 1024 MB USB 3
System Drive Write1 Write2 Write3 Read1 Read2 Read3
Pi 3B+ USB 2 P 11.5 11.4 11.5 36.6 37.7 37.3
Pi 3B+ USB 2 R 15.9 16.4 13.9 37.1 40.1 39.8
Pi 4B USB 2 P 12.6 12.6 12.6 37.0 37.3 37.2
Pi 4B USB 2 R 22.6 22.9 22.9 36.5 36.3 36.5
Pi 4B USB 3 P 34.7 36.1 35.5 298.9 302.4 300.3
Pi 4B USB 3 R 48.9 44.6 53.4 249.4 248.8 246.2
Compare MB/second
Pi 4B P USB 3/2 2.75 2.88 2.81 8.07 8.11 8.07
Pi 4B R USB 3/2 2.17 1.94 2.33 6.83 6.85 6.74
Cached MB/second Write1 Write2 Write3 Read1 Read2 Read3
Pi 3B+ USB 2 P 13.6 14.2 14.4 633.4 544.0 464.3
Pi 3B+ USB 2 R 13.7 14.4 19.4 623.5 661.4 557.6
Pi 4B USB 2 P 15.0 14.7 14.8 1204.0 1047.3 1066.3
Pi 4B USB 2 R 20.8 21.2 13.9 930.2 933.6 1230.3
Pi 4B USB 3 P 42.0 39.6 38.9 1167.7 1029.4 1061.6
Pi 4B USB 3 R 21.1 15.9 36.2 1103.6 944.9 981.0
Compare
Pi 4B P USB 3/2 2.80 2.70 2.63 0.97 0.98 1.00
Pi 4B R USB 3/2 1.01 0.75 2.60 1.19 1.01 0.80
Random milliseconds
Read Write
Pi 3B+ USB 2 P 0.013 0.013 0.254 11.76 10.18 9.80
Pi 3B+ USB 2 R 0.017 0.008 0.032 1.09 1.39 11.72
Pi 4B USB 2 P 0.006 0.007 0.215 9.56 8.54 8.75
Pi 4B USB 2 R 0.009 0.005 0.016 1.35 2.12 1.34
Pi 4B USB 3 P 0.004 0.007 0.310 9.65 10.42 9.71
Pi 4B USB 3 R 0.004 0.004 0.008 1.75 0.85 0.92
Compare
Pi 4B P USB 3/2 1.50 1.00 0.69 0.99 0.82 0.90
Pi 4B R USB 3/2 2.25 1.25 2.00 0.77 2.49 1.46
200 Small Files milliseconds
Write Read Delete
Pi 3B+ USB 2 P 134.2 128.6 129.6 0.08 0.12 0.07 3.36
Pi 3B+ USB 2 R 105.5 104.7 107.6 0.05 0.05 0.07 0.26
Pi 4B USB 2 P 125.8 125.5 125.8 0.02 0.02 0.02 3.12
Pi 4B USB 2 R 104.1 104.0 104.0 0.02 0.02 0.03 0.14
Pi 4B USB 3 P 122.7 122.3 122.2 0.02 0.02 0.02 2.56
Pi 4B USB 3 R 105.4 104.0 104.3 0.02 0.02 0.03 0.15
Compare
Pi 4B P USB 3/2 1.03 1.03 1.03 1.00 1.00 1.00 1.22
Pi 4B R USB 3/2 0.99 1.00 1.00 1.00 1.00 1.00 0.95
Drive Write/Reboot/Read Tests below or Go To Start
Drive Write/Reboot/Read Tests - DriveSpeed264WR, DriveSpeed264Rd
As a reminder, different programs were produced to enable separate measurements of writing and reading, because of
the inability to avoid written data being cached on a main drive, invalidating drive reading speed measurements. These
were also required to measure overall throughput, when using two USB drives. The write test also reads the data for
verification, but this will normally be cached in RAM, with high data transfer speeds. VMSTAT results are provided,
covering reading speeds.
Main SD Drive - This is rated at up to 98 MB/second reading speed but only achieves near 46 MB/second. VMSTAT
results confirm data transfer speed and three files eventually occupying around 3 GB of the cache, with the low 2% (x4)
CPU utilisation and 23% (x4) waiting for I/O.
Run Commands ./DriveSpeed264WR MB 1024 and ./DriveSpeed264Rd MB 1024
Current Directory Path: /home/demouser/RPi3-64-Bit-Benchmarks/IOtests/writeread
Total MB 28225, Free MB 18761, Used MB 9464
1024 MB MBytes/Second
Write1 Write2 Write3 Read1 Read2 Read3
Write 18.99 19.34 19.47 1337.09 1164.91 1325.96
Read N/A N/A N/A 45.80 45.88 45.89
vmstat
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu----
r b swpd free buff cache si so bi bo in cs us sy id wa st
0 1 0 673848 60668 2792716 0 0 45056 0 767 1181 0 2 75 23 0
0 1 0 630228 60668 2835544 0 0 44544 0 789 1199 0 2 74 23 0
0 1 0 585204 60668 2880268 0 0 45056 0 691 1041 0 3 75 23 0
USB 3 Drive P - Read only speed was similar to that from the earlier detailed test. Note high CPU utilisation average of
17%, equivalent to 68% of one core.
Run Commands ./DriveSpeed264WR MB 1024 FilePath /run/media/demouser/PATRIOT
and ./DriveSpeed264Rd MB 1024 FilePath /run/media/demouser/PATRIOT
Selected File Path:
/run/media/demouser/PATRIOT/
Total MB 120832, Free MB 119752, Used MB 1080
1024 MB MBytes/Second
Write1 Write2 Write3 Read1 Read2 Read3
Write 58.45 23.10 22.91 1368.04 1190.71 1354.84
Read N/A N/A N/A 306.18 294.93 302.91
vmstat
procs -----------memory--------- ---swap-- -----io---- -system-- ------cpu----
r b swpd free buff cache si so bi bo in cs us sy id wa st
1 0 256 811672 20920 2696504 0 0 305664 0 3898 6182 1 15 73 11 0
0 1 256 510852 20920 2996188 0 0 303616 0 4304 5936 1 16 72 12 0
1 0 256 239400 20920 3267636 0 0 307184 0 4512 6177 1 17 71 11 0
USB 3 Drive R - This time data transfer speed was slower than the earlier example.
Selected File Path:
/run/media/demouser/REMIX_OS/
Total MB 9017, Free MB 7485, Used MB 1532
1024 MB MBytes/Second
Write1 Write2 Write3 Read1 Read2 Read3
Write 46.43 28.81 36.57 1265.07 1103.23 1236.02
Read N/A N/A N/A 172.71 172.14 176.49
vmstat
procs -----------memory--------- ---swap-- -----io---- -system-- ------cpu----
r b swpd free buff cache si so bi bo in cs us sy id wa st
0 1 256 111512 912 3417624 0 0 175189 0 4315 5929 1 12 71 17 0
0 1 256 169756 992 3358840 0 0 169043 0 4064 5515 1 11 71 17 0
0 1 256 177444 1068 3351176 0 0 155724 0 4088 6023 1 12 70 16 0
USB 3 Drives R and P Together below or Go To Start
USB 3 Drives R and P Together
File sizes were reduced to 512 MB for these tests, in order to ensure that there would be sufficient RAM to contain six
copies, as indicated in VMSTAT cache occupancy. This makes it more tricky to measure total throughput, but the
following appears to provide a best case example, with a maximum of up to 386 MB/second, with CPU utilisation near
100% of one core. Different log files are needed for reading, to avoid confusion.
Later is a bad example, where one drive appears to be running at USB 2 speed.
Run Commands ./DriveSpeed264WR MB 512 FilePath /run/media/demouser/PATRIOT
and. ./DriveSpeed264WR MB 512 FilePath /run/media/demouser/REMIX_OS
and ./DriveSpeed264Rd MB 512 FilePath /run/media/demouser/PATRIOT Log 1
and ./DriveSpeed264Rd MB 512 FilePath /run/media/demouser/REMIX_OS Log 2
Write/Read Thu Sep 19 16:07:48 2019 /run/media/demouser/REMIX_OS/
Write/Read Thu Sep 19 16:07:46 2019 /run/media/demouser/PATRIOT/
512 MB MBytes/Second
Write1 Write2 Write3 Read1 Read2 Read3
R 28.72 33.89 44.69 1302.19 1131.65 1374.24
P 11.93 8.86 6.21 1232.47 1072.38 1213.36
Sep 23 17:11:21 2019 /run/media/demouser/PATRIOT/
Sep 23 17:11:20 2019 /run/media/demouser/REMIX_OS/
512 MB MBytes/Second
Write1 Write2 Write3 Read1 Read2 Read3 Seconds
P N/A N/A N/A 159.78 187.44 294.23 7.7
R N/A N/A N/A 221.83 232.10 230.94 6.7+2 delayed start
vmstat
procs -----------memory--------- ---swap-- -----io---- -system-- ------cpu----
r b swpd free buff cache si so bi bo in cs us sy id wa st
0 0 0 3160720 74616 296092 0 0 0 0 2031 3601 4 2 94 0 0
0 1 0 3112052 74616 342188 0 0 45552 0 1512 2257 1 3 93 4 0
0 1 0 2908004 74616 547600 0 0 206336 0 4684 7169 4 14 67 15 0
2 0 0 2531960 74616 919400 0 0 369136 0 5495 8033 4 24 47 25 0
2 0 0 2149064 74616 1303288 0 0 382960 0 5168 7007 1 21 52 26 0
1 1 0 1771492 74616 1681348 0 0 385024 0 5969 8255 1 23 49 26 0
1 1 0 1383524 74616 2068788 0 0 386016 0 5621 7926 1 21 49 29 0
0 2 0 999100 74616 2453280 0 0 383488 0 4602 6895 1 19 54 26 0
0 1 0 628988 74616 2824188 0 0 368640 0 5405 8153 2 20 56 22 0
1 0 0 310748 74624 3142732 0 0 317424 20 4622 6551 1 17 72 10 0
1 0 0 223052 73680 3231812 0 0 268288 0 2815 5012 1 18 72 10 0
0 0 0 223824 73680 3231280 0 0 32768 0 1044 2009 1 3 95 1 0
0 0 0 223824 73680 3231280 0 0 0 0 393 619 0 0 99 0 0
===============================================================================
Bad Example
Write1 Write2 Write3 Read1 Read2 Read3
P N/A N/A N/A 36.37 37.72 37.48
R N/A N/A N/A 248.18 248.22 223.53
LAN and WiFi Benchmarks below or Go To Start
LAN and WiFi Benchmarks - LanSpeed64, LanSpeed64g9, LanSpdx86Win.exe, LanSpeed
The Raspberry Pi LanSpeed64 version uses the same programming code as for the DriveSpeed benchmark, except
O_DIRECT is not used on creating files. The measurements were made between the Pi 4B and a Windows 7 based PC,
where the data transfer speed was confirmed via Task Manager Network information and sysstat sar -n DEV on the
Raspberry Pi 4. SAMBA was also installed to connect a remote PC and enable an Intel Windows version,
LanSpdx86Win.exe, to be run.
An example of a LanSpeed64 log file is provided below, preceded by examples of the required mount and run commands.
For further details of required procedures see This PDF file, LAN/WiFi section. The 64 bit results are followed by details
from running the benchmark on a 32 bit system, and showing the same levels of performance, within the usual variability.
Commands
sudo mount -t cifs -o dir_mode=0777,file_mode=0777 //192.168.1.68/d /media/public
./LanSpeed64 FilePath /media/public/test
Log File
LanSpeed RasPi 64 Bit 1.0 Thu Sep 12 22:06:06 2019
Selected File Path:
/media/public/test/
Total MB 266240, Free MB 70991, Used MB 195249
MBytes/Second
MB Write1 Write2 Write3 Read1 Read2 Read3
8 66.13 92.09 92.76 96.36 96.85 97.30
16 80.79 93.59 94.61 103.99 104.34 104.57
Random Read Write
From MB 4 8 16 4 8 16
msecs 0.004 0.009 0.435 0.95 0.92 0.93
200 Files Write Read Delete
File KB 4 8 16 4 8 16 secs
MB/sec 1.37 2.45 4.77 1.37 2.49 4.92
ms/file 2.99 3.35 3.43 2.98 3.29 3.33 0.467
==
************************ 32 Bit Pi 4B ************==************
MBytes/Second
MB Write1 Write2 Write3 Read1 Read2 Read3
8 67.82 12.97 90.19 99.84 93.49 96.83
16 92.25 92.66 92.96 103.9 105.28 91.17
Random Read Write
From MB 4 8 16 4 8 16
msecs 0.007 0.01 0.04 1.01 0.85 0.91
200 Files Write Read Delete
File KB 4 8 16 4 8 16 secs
MB/sec 1.47 2.8 5.14 2.47 4.71 8.61
ms/file 2.78 2.92 3.19 1.66 1.74 1.90 0.256
LAN and WiFi Benchmark Results below or Go To Start
LAN and WiFi Benchmark Results
Below are results from programs run on the Pi 3B+ and 4B, plus others from running on a PC. Dealing with large files, PC
to Pi 4B and Pi 4B to PC LAN speeds demonstrated some gigabit performance examples (over 100 MB/second), around
three times faster than on the Pi 3B+. My BT Hub has dual 2.4 GHz (WiFi1) and 5 GHz (WiFi2) capabilities, leading to the
following erratic performance, where (I think) greater than 10 MB/second is indicative of 5 GHz and around 4 MB/second
for 2.4 GHz, the former usually only on writing. In this case, the hub was inches away from the Pi.
I changed the hub settings to provide separate 2.4 and 5 GHz hub address selections, with 72 and 180 Mbits/second
being indicated, respectively. These sort of numbers were confirmed on my Smartphone, but variable. The 64 bit version
would not connect to the network at 5 GHz, unlike the 32 bit program, for example, obtaining 15 MB/second writing and
8 MB/second reading. these differences could be, I suppose, due to program, software and/or hub incompatibility.
Random access times appeared to be quite similar on all WiFi tests, with faster but variable comparative times via LAN.
There were similar relationships on dealing with numerous small files.
Some results from running the 32 bit benchmark on a Pi 4B are provided. Performance there was also erratic, these
speeds representing best case measurements, reading large files somewhat faster than those achieved at 64 bits.
Large Files MB/second
System MB Write1 Write2 Write3 Read1 Read2 Read3
PC WiFi 16 4.08 4.16 4.11 2.34 1.68 1.30
PC LAN 16 106.11 106.11 105.89 50.67 33.86 25.47
LAN 3B+ 16 28.63 29.03 28.96 22.18 32.28 32.61
3B+ WiFi 16 11.15 11.00 10.76 4.01 3.89 3.09
4B WiFi1 16 6.43 6.39 6.47 4.33 4.13 4.86
4B WiFi2 16 13.26 13.34 13.25 3.69 4.22 4.00
4B LAN 16 80.79 93.59 94.61 103.99 104.34 104.57
4B LAN 128 96.58 96.67 95.74 106.41 107.24 107.82
32 Bit
4B WiFi1 16 6.70 6.82 6.76 7.19 6.53 7.22
4B WiFi2 16 11.50 13.93 14.13 9.91 8.88 9.92
Random milliseconds
System Read Write
PC WiFi 1.711 1.972 2.015 2.26 2.28 2.25
PC LAN 0.606 0.590 0.532 0.47 0.48 0.47
LAN 3B+ 0.030 0.816 0.484 1.19 1.16 1.16
3B+ WiFi 3.052 3.167 3.475 3.60 3.39 3.45
4B WiFi1 3.286 3.549 3.627 4.02 3.45 3.72
4B WiFi2 2.786 2.822 2.944 3.20 2.94 2.92
4B LAN 0.004 0.009 0.435 0.95 0.92 0.93
32 Bit
4B WiFi1 2.691 2.875 3.048 3.13 2.93 2.84
4B WiFi2 Similar
200 Small Files milliseconds per file
System Write Read Delete
PC WiFi 10.09 12.42 13.81 5.50 6.11 8.06 1.507
PC LAN 4.05 4.59 4.53 2.38 2.23 2.64 0.661
LAN 3B+ 3.72 4.36 4.45 3.33 3.40 3.60 0.378
3B+ WiFi 12.61 13.53 14.97 13.17 14.06 15.88 2.534
4B WiFi1 15.08 16.53 22.83 12.96 14.23 17.29 2.509
4B WiFi2 11.38 12.85 12.82 10.64 11.83 14.15 2.083
4B LAN 2.99 3.35 3.43 2.98 3.29 3.33 0.467
32 Bit
4B WiFi1 12.14 18.59 15.70 11.10 22.20 12.99 2.153
4B WiFi2 30.85 17.83 18.10 16.62 14.93 16.01 3.361
Java Whetstone Benchmark below or Go To Start
Java Whetstone Benchmark - whetstc.class
The benchmark measures performance of various floating point and integer calculations , with an overall rating in Million
Whetstone Instructions Per Second (MWIPS). Results are also provided for a 32 bit version run on a Pi 4B, showing
variations in performance, using a different version of Java.
############################# Pi 3B+ #############################
Whetstone Benchmark Java Version, Sep 20 2019, 11:06:12
1 Pass
Test Result MFLOPS MOPS millisecs
N1 floating point -1.124750137 310.88 0.0618
N2 floating point -1.131330490 289.41 0.4644
N3 if then else 1.000000000 241.15 0.4292
N4 fixed point 12.000000000 706.28 0.4460
N5 sin,cos etc. 0.499110132 23.31 3.5700
N6 floating point 0.999999821 130.04 4.1480
N7 assignments 3.000000000 89.19 2.0720
N8 exp,sqrt etc. 0.825148463 21.92 1.6970
MWIPS 775.89 12.8884
Operating System Linux, Arch. aarch64, Version 4.19.67
Java Vendor IcedTea, Version 1.8.0_222
############################# Pi 4B ##############################
Whetstone Benchmark Java Version, Sep 12 2019, 20:15:35
1 Pass
Test Result MFLOPS MOPS millisecs Gains
N1 floating point -1.124750137 488.80 0.0393 1.57
N2 floating point -1.131330490 475.92 0.2824 1.64
N3 if then else 1.000000000 344.31 0.3006 1.43
N4 fixed point 12.000000000 1571.86 0.2004 2.23
N5 sin,cos etc. 0.499110132 43.55 1.9104 1.87
N6 floating point 0.999999821 264.15 2.0420 2.03
N7 assignments 3.000000000 264.00 0.7000 2.96
N8 exp,sqrt etc. 0.825148463 25.80 1.4420 1.18
MWIPS 1445.70 6.9171 1.86
Operating System Linux, Arch. aarch64, Version 4.19.67
Java Vendor IcedTea, Version 1.8.0_222
######################### Pi 4B 32 Bit ###########################
Whetstone Benchmark OpenJDK11 Java Version, May 15 2019, 18:48:20
1 Pass
Test Result MFLOPS MOPS millisecs
N1 floating point -1.124750137 524.02 0.0366
N2 floating point -1.131330490 494.12 0.2720
N3 if then else 1.000000000 289.92 0.3570
N4 fixed point 12.000000000 1092.99 0.2882
N5 sin,cos etc. 0.499110132 59.86 1.3900
N6 floating point 0.999999821 345.95 1.5592
N7 assignments 3.000000000 331.54 0.5574
N8 exp,sqrt etc. 0.825148463 25.41 1.4640
MWIPS 1687.92 5.9244
Operating System Linux, Arch. arm, Version 4.19.37-v7l+
Java Vendor BellSoft, Version 11.0.2-BellSoft
JavaDraw Benchmark below or Go To Start
JavaDraw Benchmark - JavaDrawPi.class
The benchmark uses small to rather excessive simple objects to measure drawing performance in Frames Per Second
(FPS). Five tests draw on a background of continuously changing colour shades, each test adding to the load.
Pi 4B performance gains shown below were indicated between 2.1 and 3.42 times.
At the end are 32 bit results from a Pi 4B test, using alternative Java software, with similar results.
############################# Pi 3B+ #############################
Java Drawing Benchmark, Sep 20 2019, 11:08:33
Produced by javac 1.7.0_02
Test Frames FPS
Display PNG Bitmap Twice Pass 1 335 33.46
Display PNG Bitmap Twice Pass 2 546 54.53
Plus 2 SweepGradient Circles 502 50.08
Plus 200 Random Small Circles 366 36.59
Plus 320 Long Lines 134 13.30
Plus 4000 Random Small Circles 46 4.59
Total Elapsed Time 60.2 seconds
Operating System Linux, Arch. aarch64, Version 4.19.67
Java Vendor IcedTea, Version 1.8.0_222
############################# Pi 4B ##############################
Java Drawing Benchmark, Sep 12 2019, 20:18:28
Produced by javac 1.7.0_02
Test Frames FPS Gains
Display PNG Bitmap Twice Pass 1 1146 114.52 3.42
Display PNG Bitmap Twice Pass 2 1318 131.79 2.42
Plus 2 SweepGradient Circles 1237 123.66 2.47
Plus 200 Random Small Circles 972 97.13 2.65
Plus 320 Long Lines 415 41.48 3.12
Plus 4000 Random Small Circles 97 9.65 2.10
Total Elapsed Time 60.1 seconds
Operating System Linux, Arch. aarch64, Version 4.19.67
Java Vendor IcedTea, Version 1.8.0_222
######################### Pi 4B 32 Bit ###########################
Java Drawing Benchmark, May 15 2019, 18:55:41
Produced by OpenJDK 11 javac
Test Frames FPS
Display PNG Bitmap Twice Pass 1 877 87.65
Display PNG Bitmap Twice Pass 2 1042 104.18
Plus 2 SweepGradient Circles 1015 101.47
Plus 200 Random Small Circles 779 77.85
Plus 320 Long Lines 336 33.52
Plus 4000 Random Small Circles 83 8.25
Total Elapsed Time 60.1 seconds
Operating System Linux, Arch. arm, Version 4.19.37-v7l+
Java Vendor BellSoft, Version 11.0.2-BellSoft
OpenGL GLUT Benchmark below or Go To Start
OpenGL GLUT Benchmark - videogl64, videogl64g9, videogl32
In 2012, I approved a request from a Quality Engineer at Canonical, to use this OpenGL benchmark in the testing
framework of the Unity desktop software. The program can be run as a benchmark, or selected functions, as a stress
test of any duration.
The benchmark measures graphics speed in terms of Frames Per Second (FPS) via six simple and more complex tests.
The first four tests portray moving up and down a tunnel including various independently moving objects, with and
without texturing. The last two tests, represent a real application for designing kitchens. The first is in wireframe format,
drawn with 23,000 straight lines. The second has colours and textures applied to the surfaces.
Pi 4B average performance gains are included below, with textured objects the best, at 2.1 times, and worst, at around
1.5 times, with the slow kitchen displays.
Dual Monitors - The benchmark was also run with two 1920x1080 monitors connected. It displayed two identical
displays when the mirror option was selected. Without this, the normal display, from where the program is executed,
appeared on one display, and the OpenGL images on the other. This was fine when the usual display dimensions, as
shown below, were specified. With no parameters, full screen image was assumed to be 3840x1080 and this was
displayed horizontally squashed into 1920 pixels. FPS measurements for the latter are shown below. On running the 32
bit version via Raspbian, the default display was 3840x1080, across both monitors, but only on one monitor, when
1920x1080 parameters or less were specified. There was no mirror option. See performance below.
In order to demonstrate maximum speeds, VSYNCH (vblank) has to be switched off. The command is shown in the
following script that is used to run a series of tests.
export vblank_mode=0
./videogl64g9 Width 160, Height 120, NoEnd
./videogl64g9 Width 320, Height 240, NoHeading, NoEnd
./videogl64g9 Width 640, Height 480, NoHeading, NoEnd
./videogl64g9 Width 1024, Height 768, NoHeading, NoEnd
./videogl64g9 NoHeading
The benchmark can also be run as a stress test, using run time parameters for running time and test to run, besides
window size, as shown above.
32 bit Pi 4B results are also provided, in this case, a bit slower than the 64 bit speeds.
############################# Pi 3B+ #############################
GLUT OpenGL Benchmark 64 Bit Version 1, Fri Sep 20 11:15:47 2019
Running Time Approximately 5 Seconds Each Test
Window Size Coloured Objects Textured Objects WireFrm Texture
Pixels Few All Few All Kitchen Kitchen
Wide High FPS FPS FPS FPS FPS FPS
160 120 389.6 227.2 122.6 75.3 30.0 21.5
320 240 328.1 201.7 113.8 73.3 30.2 21.3
640 480 203.3 144.7 87.8 62.0 30.2 21.0
1024 768 107.1 94.5 60.3 51.1 28.9 20.0
1920 1080 45.3 47.5 36.9 33.1 28.7 20.0
############################## Pi 4B #############################
160 120 767.4 420.3 258.3 154.3 45.7 31.7
320 240 682.9 388.8 245.0 148.3 45.1 30.8
640 480 367.1 262.6 217.9 140.1 46.2 30.9
1024 768 150.8 148.8 128.6 117.3 45.3 30.4
1920 1080 71.9 73.9 64.0 61.6 43.3 27.9
Pi 4B Gains 1.77 1.74 2.12 2.10 1.52 1.46
Dual Monitor- mirrored displays
1920 1080 65.0 66.3 61.6 58.2 42.7 27.5
Dual Monitor - not mirrored squashed image on one monitor
3840 1080 60.9 59.6 57.2 54.8 40.8 26.8
Dual Monitor 32 bit two monitors
3840 1080 26.9 26.6 26.1 25.1 25.5 15.9
************************ Pi 4B 32 Bit ************************
GLUT OpenGL Benchmark 32 Bit Version 1, Fri Oct 11 19:12:24 2019
Window Size Coloured Objects Textured Objects WireFrm Texture
Pixels Few All Few All Kitchen Kitchen
320 240 663.3 365.9 218.6 126.3 33.1 23.5
640 480 318.7 259.7 192.4 116.8 32.2 22.1
1024 768 138.9 134.1 112.7 102.7 31.9 21.4
1920 1080 57.5 56.1 53.3 50.0 29.3 19.5
Avg 64b/32b 1.13 1.13 1.15 1.19 1.42 1.39
Stress Tets below or Go To Start
Stress Tests
The first stress tests used cover the central processor, for which an extra program was produced to measure the
environment whilst running. Variable parameters are:
Passes and sampling seconds to determine running time. If the stress test also has sampling periods, it is normally
not possible to synchronise them but approximate periods can be matched.
CPU MHz - This can vary faster than any sampling time based on seconds, but the general trend can be useful. Tests
that measure speed over sampling periods provide a better indication.
Core Voltage - This appears to vary a little, reason unknown.
CPU Temperatue - assuming that it is correct, as it change slowly, this is the most useful measurement.
PMIC temperature - No issue so far with Power Management Integrated Circuit temperatures
###################################################
Parameters - upper or lower case
./RPiHeatMHzVolts2 passes 33 secs 20 log 12
or
./RPiHeatMHzVolts2 P 33 S 20 L 12
For 33 samples at 20 second intervals, log file RPiHeatMHz12.txt
To cover 10 minute test
###################################################
Temperature and CPU MHz Measurement
Start at Mon Oct 28 20:49:52 2019
Using 33 samples at 20 second intervals
Seconds
0.0 ARM MHz=1500, core volt=0.8490V, CPU temp=61.0'C, pmic temp=55.2'C
20.0 ARM MHz=1500, core volt=0.8437V, CPU temp=73.0'C, pmic temp=62.8'C
40.3 ARM MHz=1500, core volt=0.8437V, CPU temp=77.0'C, pmic temp=66.5'C
60.5 ARM MHz=1500, core volt=0.8437V, CPU temp=79.0'C, pmic temp=69.4'C
80.7 ARM MHz=1500, core volt=0.8437V, CPU temp=80.0'C, pmic temp=70.3'C
101.0 ARM MHz=1500, core volt=0.8437V, CPU temp=81.0'C, pmic temp=70.3'C
121.2 ARM MHz=1500, core volt=0.8437V, CPU temp=81.0'C, pmic temp=72.2'C
141.4 ARM MHz=1000, core volt=0.8437V, CPU temp=81.0'C, pmic temp=72.2'C
161.7 ARM MHz=1500, core volt=0.8437V, CPU temp=81.0'C, pmic temp=72.2'C
181.9 ARM MHz=1000, core volt=0.8437V, CPU temp=82.0'C, pmic temp=72.2'C
Next are results for the High Performance Linpack that runs for a long time, significantly increasing CPU temperatures
and slowing down, without a cooling fan being in place. These results can be compared with those for the 32 bit version,
available in the report Raspberry Pi 4B Stress Tests Including High Performance Linpack.pdf. This shows that the same
sort of performance levels as the 64 bit version are obtained, with and without a cooling fan.
Following HPL results here, are some for my integer and floating point stress tests. Although further comparative tests
are needed to be conclusive, it does seem that the 64 bit floating point versions are faster than the 32 bit varieties and
subject to lower temperature increases.
HP Linpack Stress Test or Go To Start
High Performance Linpack Stress Test
The earlier HPL benchmark results quoted obtained speeds of 8.1 GFLOPS on a cold start and 10.8 GFLOPS later, with a
cooling fan in operation for both. The first results below were run without a fan, with a room temperature around 21°C,
producing 7.6 GFLOPS on a cold start. Then average CPU frequency came out at 1056 MHz, with an average
temperature of 80.3°C.
The second results followed a warm reboot to use a different version of Gentoo with HPL installed, obtaining 5.54
GFLOPS, with severe CPU frequency throttling, down to 600 MHz, with temperatures up to 80.3°C. Averages were 790
MHz and 80.3°C.
Shortly afterwards, with the fan in place, the Pi ran at 1500 MHz continuously, achieving 10.4 GFLOPS, with a maximum
temperature of 64°C.
================================================================================
T/V N NB P Q Time Gflops
--------------------------------------------------------------------------------
WR11C2R4 20000 128 2 2 702.81 7.589e+00
HPL_pdgesv() start time Sat Aug 24 10:42:58 2019
HPL_pdgesv() end time Sat Aug 24 10:54:41 2019
--------------------------------------------------------------------------------
||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)= 0.0008453 ...... PASSED
================================================================================
Example 2 - Note different sumchecks again
================================================================================
T/V N NB P Q Time Gflops
--------------------------------------------------------------------------------
WR11C2R4 20000 128 2 2 963.16 5.538e+00
HPL_pdgesv() start time Tue Oct 29 11:51:10 2019
HPL_pdgesv() end time Tue Oct 29 12:07:13 2019
--------------------------------------------------------------------------------
||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)= 0.0009005 ...... PASSED
================================================================================
Temperature and CPU MHz Measurement
Start at Tue Oct 29 11:50:27 2019
Using 40 samples at 30 second intervals
Seconds
0.0 ARM MHz=1500, core volt=0.8542V, CPU temp=63.0'C, pmic temp=58.0'C
30.0 ARM MHz=1500, core volt=0.8542V, CPU temp=79.0'C, pmic temp=69.4'C
60.3 ARM MHz=1000, core volt=0.8542V, CPU temp=83.0'C, pmic temp=72.2'C
91.6 ARM MHz=1000, core volt=0.8490V, CPU temp=85.0'C, pmic temp=74.1'C
122.2 ARM MHz=1000, core volt=0.8490V, CPU temp=84.0'C, pmic temp=74.1'C
152.7 ARM MHz= 750, core volt=0.8490V, CPU temp=83.0'C, pmic temp=74.1'C
183.2 ARM MHz=1000, core volt=0.8490V, CPU temp=84.0'C, pmic temp=76.0'C
213.8 ARM MHz= 750, core volt=0.8490V, CPU temp=85.0'C, pmic temp=76.0'C
244.3 ARM MHz= 750, core volt=0.8490V, CPU temp=85.0'C, pmic temp=76.9'C
274.7 ARM MHz= 600, core volt=0.8490V, CPU temp=86.0'C, pmic temp=76.9'C
305.2 ARM MHz= 750, core volt=0.8490V, CPU temp=85.0'C, pmic temp=76.0'C
335.6 ARM MHz=1000, core volt=0.8490V, CPU temp=85.0'C, pmic temp=76.0'C
366.1 ARM MHz= 750, core volt=0.8490V, CPU temp=85.0'C, pmic temp=76.0'C
396.6 ARM MHz= 600, core volt=0.8490V, CPU temp=85.0'C, pmic temp=76.9'C
427.2 ARM MHz= 750, core volt=0.8490V, CPU temp=86.0'C, pmic temp=76.9'C
457.5 ARM MHz= 600, core volt=0.8490V, CPU temp=85.0'C, pmic temp=76.9'C
488.0 ARM MHz= 600, core volt=0.8490V, CPU temp=85.0'C, pmic temp=76.9'C
518.6 ARM MHz= 750, core volt=0.8490V, CPU temp=84.0'C, pmic temp=76.9'C
549.0 ARM MHz= 600, core volt=0.8490V, CPU temp=86.0'C, pmic temp=76.9'C
579.6 ARM MHz= 750, core volt=0.8490V, CPU temp=86.0'C, pmic temp=76.0'C
610.1 ARM MHz= 750, core volt=0.8490V, CPU temp=85.0'C, pmic temp=76.0'C
640.6 ARM MHz= 750, core volt=0.8490V, CPU temp=86.0'C, pmic temp=76.9'C
671.1 ARM MHz= 750, core volt=0.8490V, CPU temp=86.0'C, pmic temp=76.9'C
701.6 ARM MHz= 600, core volt=0.8490V, CPU temp=86.0'C, pmic temp=76.0'C
732.0 ARM MHz= 750, core volt=0.8490V, CPU temp=85.0'C, pmic temp=76.9'C
762.4 ARM MHz= 600, core volt=0.8490V, CPU temp=85.0'C, pmic temp=76.9'C
792.9 ARM MHz= 750, core volt=0.8490V, CPU temp=85.0'C, pmic temp=76.0'C
823.4 ARM MHz= 750, core volt=0.8490V, CPU temp=84.0'C, pmic temp=76.9'C
853.9 ARM MHz= 600, core volt=0.8490V, CPU temp=85.0'C, pmic temp=76.9'C
884.4 ARM MHz= 750, core volt=0.8490V, CPU temp=85.0'C, pmic temp=76.9'C
914.9 ARM MHz= 600, core volt=0.8490V, CPU temp=85.0'C, pmic temp=76.9'C
945.3 ARM MHz= 750, core volt=0.8490V, CPU temp=85.0'C, pmic temp=76.9'C
975.8 ARM MHz= 750, core volt=0.8490V, CPU temp=85.0'C, pmic temp=76.0'C
1006.3 ARM MHz= 750, core volt=0.8490V, CPU temp=84.0'C, pmic temp=76.0'C
1036.7 ARM MHz= 750, core volt=0.8490V, CPU temp=85.0'C, pmic temp=76.0'C
1067.0 ARM MHz= 750, core volt=0.8490V, CPU temp=85.0'C, pmic temp=74.1'C
Averages 790 84.1 75.5
Integer Stress Test below or Go To Start
Integer Stress Test - MP-IntStress64, MP-IntStress
As for my other CPU stress tests, the four and 8 thread results are shown, from running in benchmarking mode. Run time
parameters are also provided, the commands used for the particular tests being included.
In this case, a summary of separate tests for L1 cache, L2 cache and RAM are given. During the 10 minute sessions, the
cache tests were mainly running at 1000 MHz, with those using RAM at the full speed 1500 MHz. No temperatures above
84°C were recorded.
Examining the full detail of the first test indicated that average CPU MHz and measured MB/second were around 75% of
the maximum.
KB KB MB Same All
Secs Thrds 16 160 16 Sumcheck Tests
3.0 4 28715 26652 3345 5A5A5A5A Yes
3.0 8 30292 26310 3334 AAAAAAAA Yes
./RPiHeatMHzVolts2 passes 66 secs 10 log 34 - used for all 10 minute stress tests
==== Stress Test Parameters - upper or lower case, only first letter counts ====
Threads 1, 2, 4, 8, 16, 32 KB between 12 and 15624 Log < 100 Minutes any > 0
./MP-IntStress64 KB 16 Threads 8 Mins 10 Log 34
Seconds MB/sec
0.0 ARM MHz=1500, core volt=0.8455V, CPU temp=62.0'C, pmic temp=57.1'C
10.0 ARM MHz=1500, core volt=0.8455V, CPU temp=69.0'C, pmic temp=62.8'C 28695
20.2 ARM MHz=1500, core volt=0.8402V, CPU temp=73.0'C, pmic temp=64.6'C 28729
152.5 ARM MHz=1000, core volt=0.8402V, CPU temp=82.0'C, pmic temp=72.2'C 21523
305.5 ARM MHz=1000, core volt=0.8402V, CPU temp=83.0'C, pmic temp=74.1'C 20026
448.2 ARM MHz=1000, core volt=0.8402V, CPU temp=83.0'C, pmic temp=74.1'C 19611
601.1 ARM MHz=1000, core volt=0.8402V, CPU temp=83.0'C, pmic temp=74.1'C 19199
%Min/Max 66.9
./MP-IntStress64 KB 160 Threads 8 Mins 10 Log 34
Seconds MB/sec
0.0 ARM MHz=1500, core volt=0.8402V, CPU temp=64.0'C, pmic temp=57.1'C
10.0 ARM MHz=1500, core volt=0.8402V, CPU temp=71.0'C, pmic temp=62.8'C 26323
20.2 ARM MHz=1500, core volt=0.8402V, CPU temp=75.0'C, pmic temp=66.5'C 26140
152.9 ARM MHz=1000, core volt=0.8402V, CPU temp=82.0'C, pmic temp=74.1'C 18016
306.5 ARM MHz=1000, core volt=0.8402V, CPU temp=83.0'C, pmic temp=74.1'C 17306
449.8 ARM MHz=1000, core volt=0.8402V, CPU temp=84.0'C, pmic temp=74.1'C 17248
603.3 ARM MHz= 750, core volt=0.8402V, CPU temp=84.0'C, pmic temp=74.1'C 16832
%Min/Max 63.9
./MP-IntStress64 KB 16000 Threads 8 Mins 10 Log 34
Seconds MB/sec
0.0 ARM MHz=1500, core volt=0.8402V, CPU temp=66.0'C, pmic temp=60.9'C
10.0 ARM MHz=1500, core volt=0.8402V, CPU temp=71.0'C, pmic temp=62.8'C 3372
20.3 ARM MHz=1500, core volt=0.8402V, CPU temp=72.0'C, pmic temp=62.8'C 3369
155.2 ARM MHz=1500, core volt=0.8402V, CPU temp=76.0'C, pmic temp=68.4'C 3365
309.8 ARM MHz=1500, core volt=0.8402V, CPU temp=79.0'C, pmic temp=69.4'C 3367
454.4 ARM MHz=1500, core volt=0.8402V, CPU temp=78.0'C, pmic temp=70.3'C 3367
599.7 ARM MHz=1500, core volt=0.8402V, CPU temp=78.0'C, pmic temp=70.3'C 3368
%Min/Max 99.8
Single Precision Floating Point Stress Tests below or Go To Start
Single Precision Floating Point Stress Test - MP-FPUStress64, MP-FPUStress
Two sets of result summaries are provided below, both using 1280 KB memory space and 8 threads. With four cores, this
results in data being in L2 cache (4 x 160 KB) to run at full speed, with additional overhead of moving data to/from RAM.
One test uses 8 operations per word, with 32 in the other. With hot starts, neither reached a CPU temperature of 84°C
and had similar performance degradation at the highest temperatures.
Following writing the above, the 32 bit stress test was repeated, with results shown below. Although not conclusive
from a single run, they indicate that the impact was more severe than the 64 bit run, CPU speed sample reducing to 600
MHz, higher temperatures and a larger performance degradation.
Ops/ KB KB MB KB KB MB
Secs Thrd Word 12.8 128 12.8 12.8 128 12.8
4.6 T4 2 9223 7520 519 40392 76406 99700
6.0 T8 2 9520 10471 545 40392 76406 99700
11.3 T4 8 19087 21040 2044 54764 85092 99820
12.9 T8 8 19747 21107 2016 54764 85092 99820
22.2 T4 32 25732 26704 9160 35206 66015 99520
24.1 T8 32 25708 25770 8927 35206 66015 99520
==== Stress Test Parameters - upper or lower case, only first letter counts ====
Threads 1,2,4,8,16,32,64 KB 12 to 15624 Ops/Wordd 2,8,32 Log<100 Minutes any>0
./MP-FPUStress64 KB 1280 T 8 Ops 8 Mins 10 Log 33
Seconds MFLOPS
0.0 ARM MHz=1500, core volt=0.8437V, CPU temp=64.0'C, pmic temp=59.0'C
10.0 ARM MHz=1500, core volt=0.8437V, CPU temp=71.0'C, pmic temp=62.8'C 17309
20.2 ARM MHz=1500, core volt=0.8437V, CPU temp=75.0'C, pmic temp=66.5'C 18018
101.9 ARM MHz=1000, core volt=0.8437V, CPU temp=82.0'C, pmic temp=72.2'C 14224
204.2 ARM MHz=1000, core volt=0.8437V, CPU temp=82.0'C, pmic temp=72.2'C 12806
306.8 ARM MHz=1000, core volt=0.8437V, CPU temp=81.0'C, pmic temp=73.1'C 12447
409.4 ARM MHz=1000, core volt=0.8437V, CPU temp=83.0'C, pmic temp=73.1'C 11870
501.6 ARM MHz=1000, core volt=0.8437V, CPU temp=83.0'C, pmic temp=74.1'C 12191
604.1 ARM MHz=1000, core volt=0.8437V, CPU temp=82.0'C, pmic temp=74.1'C 12169
%Min/Max 65.9
./MP-FPUStress64 KB 1280 T 8 Ops 32 Mins 10 Log 33
Seconds MFLOPS
0.0 ARM MHz=1500, core volt=0.8437V, CPU temp=65.0'C, pmic temp=59.0'C
10.0 ARM MHz=1500, core volt=0.8437V, CPU temp=72.0'C, pmic temp=65.6'C 22634
20.2 ARM MHz=1500, core volt=0.8437V, CPU temp=76.0'C, pmic temp=67.5'C 22992
101.9 ARM MHz=1500, core volt=0.8437V, CPU temp=81.0'C, pmic temp=72.2'C 18629
204.0 ARM MHz=1000, core volt=0.8437V, CPU temp=81.0'C, pmic temp=74.1'C 16674
306.3 ARM MHz=1000, core volt=0.8437V, CPU temp=81.0'C, pmic temp=72.2'C 16448
408.6 ARM MHz=1000, core volt=0.8437V, CPU temp=82.0'C, pmic temp=72.2'C 16158
500.7 ARM MHz=1000, core volt=0.8437V, CPU temp=83.0'C, pmic temp=74.1'C 16081
603.0 ARM MHz=1000, core volt=0.8437V, CPU temp=82.0'C, pmic temp=72.2'C 15553
%Min/Max 67.6
======================================================================================
32 Bit Version ./MP-FPUStress KB 1280 T 8 Ops 32 Mins 10 Log 73
Seconds MFLOPS
0.0 ARM MHz=1500, core volt=0.8560V, CPU temp=56.0'C, pmic temp=50.5'C
10.0 ARM MHz=1500, core volt=0.8560V, CPU temp=70.0'C, pmic temp=60.9'C 20233
20.7 ARM MHz=1500, core volt=0.8560V, CPU temp=74.0'C, pmic temp=64.6'C 20221
106.4 ARM MHz=1000, core volt=0.8560V, CPU temp=83.0'C, pmic temp=70.3'C 14173
204.3 ARM MHz=1000, core volt=0.8455V, CPU temp=84.0'C, pmic temp=73.1'C 13115
302.2 ARM MHz=1000, core volt=0.8455V, CPU temp=85.0'C, pmic temp=74.1'C 12650
400.2 ARM MHz= 750, core volt=0.8455V, CPU temp=85.0'C, pmic temp=74.1'C 11957
508.8 ARM MHz=1000, core volt=0.8455V, CPU temp=85.0'C, pmic temp=74.1'C 11485
585.1 ARM MHz= 600, core volt=0.8455V, CPU temp=84.0'C, pmic temp=74.1'C 11454
606.9 ARM MHz=1000, core volt=0.8455V, CPU temp=84.0'C, pmic temp=74.1'C 11242
%Min/Max 55.6
Double Precision Floating Point Stress Tests below or Go To Start
Double Precision Floating Point Stress Test - MP-FPUStress64DP, MP-FPUStressDP
Below are full results for a 10 minute test using the double precision floating point stress test, with data in L2 cache
with four cores in use. Although the measured MFLOPS was greater than that obtained be HPL Linpack, the same range
of high temperatures and performance degradation were not generated.
The 32 bit version was also rerun, producing similar results as those at 64 bits.
Ops/ KB KB MB KB KB MB
Secs Thrd Word 12.8 128 12.8 12.8 128 12.8
8.9 T4 2 5024 4589 257 40395 76384 99700
11.5 T8 2 5089 5545 280 40395 76384 99700
21.7 T4 8 10259 10011 1068 54805 85108 99820
24.7 T8 8 10239 10824 1036 54805 85108 99820
43.1 T4 32 12940 13200 4497 35159 66065 99521
46.9 T8 32 13200 13049 4557 35159 66065 99521
==== Stress Test Parameters - upper or lower case, only first letter counts ====
Threads 1,2,4,8,16,32,64 KB 12 to 15624 Ops/Wordd 2,8,32 Log<100 Minutes any>0
./MP-FPUStress64DP KB 1280 T 8 Ops 32 Mins 10 Log 31
Seconds MFLOPS
0.0 ARM MHz=1500, core volt=0.8437V, CPU temp=63.0'C, pmic temp=57.1'C
10.0 ARM MHz=1500, core volt=0.8437V, CPU temp=71.0'C, pmic temp=62.8'C 12718
20.2 ARM MHz=1500, core volt=0.8437V, CPU temp=74.0'C, pmic temp=66.5'C 12755
30.5 ARM MHz=1500, core volt=0.8437V, CPU temp=77.0'C, pmic temp=68.4'C 12750
40.7 ARM MHz=1500, core volt=0.8437V, CPU temp=81.0'C, pmic temp=70.3'C 12755
50.9 ARM MHz=1500, core volt=0.8437V, CPU temp=81.0'C, pmic temp=70.3'C 12183
61.2 ARM MHz=1500, core volt=0.8437V, CPU temp=81.0'C, pmic temp=72.2'C 11358
71.4 ARM MHz=1000, core volt=0.8437V, CPU temp=81.0'C, pmic temp=72.2'C 10922
81.6 ARM MHz=1000, core volt=0.8437V, CPU temp=80.0'C, pmic temp=72.2'C 10333
91.8 ARM MHz=1000, core volt=0.8437V, CPU temp=81.0'C, pmic temp=72.2'C 9948
102.0 ARM MHz=1000, core volt=0.8437V, CPU temp=81.0'C, pmic temp=72.2'C 9692
112.3 ARM MHz=1000, core volt=0.8437V, CPU temp=81.0'C, pmic temp=72.2'C 9466
122.6 ARM MHz=1000, core volt=0.8437V, CPU temp=82.0'C, pmic temp=72.2'C 9217
132.8 ARM MHz=1000, core volt=0.8437V, CPU temp=81.0'C, pmic temp=74.1'C 9181
143.0 ARM MHz=1000, core volt=0.8437V, CPU temp=82.0'C, pmic temp=74.1'C 9145
153.2 ARM MHz=1000, core volt=0.8437V, CPU temp=80.0'C, pmic temp=72.2'C 9043
163.4 ARM MHz=1000, core volt=0.8437V, CPU temp=82.0'C, pmic temp=72.2'C 8921
173.6 ARM MHz=1000, core volt=0.8437V, CPU temp=82.0'C, pmic temp=72.2'C 9838
183.9 ARM MHz=1000, core volt=0.8437V, CPU temp=82.0'C, pmic temp=72.2'C 8755
194.1 ARM MHz=1000, core volt=0.8437V, CPU temp=82.0'C, pmic temp=74.1'C 8737
204.4 ARM MHz=1000, core volt=0.8437V, CPU temp=81.0'C, pmic temp=72.2'C 8721
214.7 ARM MHz=1000, core volt=0.8437V, CPU temp=82.0'C, pmic temp=72.2'C 8721
224.9 ARM MHz=1500, core volt=0.8437V, CPU temp=83.0'C, pmic temp=73.1'C 8670
235.1 ARM MHz=1000, core volt=0.8437V, CPU temp=82.0'C, pmic temp=73.1'C 8619
245.4 ARM MHz=1000, core volt=0.8437V, CPU temp=82.0'C, pmic temp=74.1'C 8592
255.6 ARM MHz=1000, core volt=0.8437V, CPU temp=82.0'C, pmic temp=72.2'C 8592
265.9 ARM MHz=1000, core volt=0.8437V, CPU temp=83.0'C, pmic temp=74.1'C 8540
276.2 ARM MHz=1000, core volt=0.8437V, CPU temp=82.0'C, pmic temp=73.1'C 8488
286.4 ARM MHz=1000, core volt=0.8437V, CPU temp=82.0'C, pmic temp=74.1'C 8547
296.7 ARM MHz=1000, core volt=0.8437V, CPU temp=83.0'C, pmic temp=74.1'C 8510
307.0 ARM MHz=1000, core volt=0.8437V, CPU temp=83.0'C, pmic temp=74.1'C 8473
317.2 ARM MHz=1000, core volt=0.8437V, CPU temp=84.0'C, pmic temp=74.1'C 8507
327.5 ARM MHz=1000, core volt=0.8437V, CPU temp=83.0'C, pmic temp=74.1'C 8541
337.7 ARM MHz=1000, core volt=0.8437V, CPU temp=83.0'C, pmic temp=74.1'C 8544
347.9 ARM MHz=1000, core volt=0.8437V, CPU temp=82.0'C, pmic temp=74.1'C 8464
358.2 ARM MHz=1000, core volt=0.8437V, CPU temp=83.0'C, pmic temp=74.1'C 8531
368.4 ARM MHz=1000, core volt=0.8437V, CPU temp=84.0'C, pmic temp=74.1'C 8495
378.7 ARM MHz=1000, core volt=0.8437V, CPU temp=83.0'C, pmic temp=74.1'C 8460
388.9 ARM MHz=1000, core volt=0.8437V, CPU temp=83.0'C, pmic temp=74.1'C 8514
399.2 ARM MHz=1000, core volt=0.8437V, CPU temp=83.0'C, pmic temp=74.1'C 8484
409.4 ARM MHz=1000, core volt=0.8437V, CPU temp=82.0'C, pmic temp=74.1'C 8454
419.6 ARM MHz=1000, core volt=0.8437V, CPU temp=83.0'C, pmic temp=74.1'C 8459
429.8 ARM MHz=1000, core volt=0.8437V, CPU temp=83.0'C, pmic temp=74.1'C 8489
440.1 ARM MHz=1000, core volt=0.8437V, CPU temp=84.0'C, pmic temp=74.1'C 8472
450.3 ARM MHz=1000, core volt=0.8437V, CPU temp=83.0'C, pmic temp=74.1'C 8428
460.6 ARM MHz=1000, core volt=0.8437V, CPU temp=83.0'C, pmic temp=74.1'C 8384
470.9 ARM MHz=1000, core volt=0.8437V, CPU temp=84.0'C, pmic temp=74.1'C 8384
481.2 ARM MHz=1000, core volt=0.8437V, CPU temp=83.0'C, pmic temp=74.1'C 8387
491.4 ARM MHz=1000, core volt=0.8437V, CPU temp=83.0'C, pmic temp=74.1'C 8391
501.7 ARM MHz=1000, core volt=0.8437V, CPU temp=84.0'C, pmic temp=74.1'C 8244
511.9 ARM MHz=1000, core volt=0.8437V, CPU temp=84.0'C, pmic temp=74.1'C 8346
522.1 ARM MHz= 750, core volt=0.8437V, CPU temp=83.0'C, pmic temp=74.1'C 8272
532.4 ARM MHz=1000, core volt=0.8437V, CPU temp=83.0'C, pmic temp=74.1'C 8272
542.6 ARM MHz=1000, core volt=0.8437V, CPU temp=84.0'C, pmic temp=74.1'C 8329
552.8 ARM MHz= 750, core volt=0.8437V, CPU temp=84.0'C, pmic temp=74.1'C 8239
563.1 ARM MHz=1000, core volt=0.8437V, CPU temp=84.0'C, pmic temp=74.1'C 8183
573.3 ARM MHz=1000, core volt=0.8437V, CPU temp=84.0'C, pmic temp=74.1'C 8129
583.6 ARM MHz=1000, core volt=0.8437V, CPU temp=84.0'C, pmic temp=74.1'C 8343
593.9 ARM MHz=1000, core volt=0.8437V, CPU temp=84.0'C, pmic temp=74.1'C 8266
604.1 ARM MHz=1000, core volt=0.8437V, CPU temp=85.0'C, pmic temp=74.1'C 8190
%Min/Max 63.7
below or Go To Start
OpenGL + 3 x Livermore Loops - liverloopsPi64Rg9, liverloopsPi64, liverloopsPiA7R
In order make it easier to run these stress tests, lxterminal was installed and the script shown below used to open four
terminal windows and run the environmental monitor program plus three copies of a modified Loops benchmark, that
allows different log files to be specified. This executes 72 loops for a minimum time of 12 seconds each. The second
script file is provided to run the kitchen disply tests for 16 minutes in full screen mode. A further terminal was opened to
run VMSTAT resource monitor.
The tests were run twice, without and with a cooling fan in place. Results are shown below. In this case, the no fan
tests were not that much slower, obtaining averages of 77 to 80% of the fan cooled speeds on OpenGL FPS, CPU MHz
and total Loops MFLOPS.
These results were produced with all programs compiled by gcc 9 and not run on a hot day. Compared with performance
using 32 bit versions, detailed in this 32 Bit Report, the 64 bit results were far better, but the former were produced by
an older compiler and run on a hot day. The tests were repeated, using 32 bit programs produced by the later gcc 8
compiler.
As before, the 64 bit gcc 9 Livermore Loops and OpenGL single core benchmarks were faster than the new 32 bit
versions, in this case by 14% for the former and 40% for the latter. On running the stress test, both had similar average
CPU MHz, CPU temperature and PMIC temperature, with 64 bit FPS and MFLOPS maintaining performance advantage,
with similar ratios as obtained from single core tests.
run.sh
lxterminal -e ./RPiHeatMHzVolts2 Passes 35 Seconds 30 Log 20 &
lxterminal -e ./liverloopsPi64Rg9 Seconds 12 Log 21 &
lxterminal -e ./liverloopsPi64Rg9 Seconds 12 Log 22 &
lxterminal -e ./liverloopsPi64Rg9 Seconds 12 Log 23
runogl.sh
export vblank_mode=0 &
./videogl64g9 Test 6 Mins 16 Log 20
No Fan With Fan
Seconds MHz CPU C PMIC C FPS MHz CPU C PMIC C FPS
0 1500 57 51 1500 37 32
30 1500 75 63 27 1500 53 44 27
60 1500 76 68 29 1500 53 44 28
90 1500 81 72 25 1500 58 50 27
120 1500 81 70 23 1500 55 48 26
150 1000 82 74 23 1500 57 49 29
180 1000 80 72 22 1500 54 47 27
210 1000 81 72 24 1500 55 46 29
240 1500 80 72 26 1500 54 44 28
270 1500 81 72 27 1500 55 47 28
300 1000 82 72 22 1500 56 48 29
330 1500 82 72 24 1500 56 50 29
360 1000 82 72 24 1500 56 49 28
390 1000 82 72 22 1500 58 50 26
420 1000 83 72 22 1500 57 50 26
450 1000 82 74 19 1500 56 50 30
480 1000 82 74 21 1500 56 48 28
510 1000 82 72 22 1500 54 46 29
540 1000 81 72 22 1500 55 47 30
570 1500 81 72 24 1500 55 47 30
600 1000 82 74 24 1500 57 49 30
630 1500 81 72 23 1500 58 51 29
660 1000 82 72 23 1500 57 50 29
690 1000 83 73 22 1500 59 51 28
720 1000 83 72 21 1500 57 51 28
750 1000 82 74 21 1500 57 50 29
780 1000 84 74 19 1500 54 47 29
810 1000 82 72 19 1500 56 48 29
840 1000 82 72 20 1500 54 46 29
870 1000 82 72 20 1500 53 46 30
900 1000 82 72 23 1500 49 42 31
Average 1161 81 71 23 1500 55 47 29
Minimum 1000 57 51 19 1500 37 32 26
Maximum 1500 84 74 29 1500 59 51 31
% Hot/Cold
Average 77 68 66 80
Minimum 67 65 61 73
Maximum 100 70 69 94
MFLOPS Average Geomean Harmean Average Geomean Harmean
1 684 562 453 898 732 590
2 716 574 451 887 712 571
3 716 566 438 895 724 582
Total %Hot/Cold
MFLOPS 79 78 77
Input/Output Stress Test below or Go To Start
Input/Output Stress Test - burnindrive264g9, burnindrive2
This is essentially the same as my program used during hundreds of UK Government and University computer acceptance
trials during the 1970s and 1980s, with some significant achievements. Burnindrive writes four files, using 164 blocks of
64 KB, repeated 16 times (164.0 MB), with each block containing a unique data pattern. The files are then read for two
minutes, on a sort of random sequence, with data and file ID checked for correct values. Then each block (unique
pattern) is read numerous times, over one second, again with checking for correct values. Total time is normally about 5
minutes for all tests, with default parameters. The data patterns are shown below, followed by run time parameters,
then examples of results provided, including added calculations of speed.
Patterns
No. Hex No. Hex No. Hex No. Hex No. Hex No. Hex No. Hex
1 0 25 800000 49 3 73 FF 97 FFFFDFFF 121 FFFFEAAA 145 FFFFF0F0
2 1 26 1000000 50 33 74 FF00FF 98 FFFFBFFF 122 FFFFAAAA 146 FFF0F0F0
3 2 27 2000000 51 333 75 1FF 99 FFFF7FFF 123 FFFEAAAA 147 F0F0F0F0
4 4 28 4000000 52 3333 76 3FF 100 FFFEFFFF 124 FFFAAAAA 148 FFFFFFE0
5 8 29 8000000 53 33333 77 7FF 101 FFFDFFFF 125 FFEAAAAA 149 FFFF83E0
6 10 30 10000000 54 333333 78 FFF 102 FFFBFFFF 126 FFAAAAAA 150 FE0F83E0
7 20 31 20000000 55 3333333 79 1FFF 103 FFF7FFFF 127 FEAAAAAA 151 FFFFFFC0
8 40 32 40000000 56 33333333 80 3FFF 104 FFEFFFFF 128 FAAAAAAA 152 FFFC0FC0
9 80 33 1 57 7 81 7FFF 105 FFDFFFFF 129 EAAAAAAA 153 FFFFFF80
10 100 34 5 58 1C7 82 FFFF 106 FFBFFFFF 130 AAAAAAAA 154 FFE03F80
11 200 35 15 59 71C7 83 FFFFFFFF 107 FF7FFFFF 131 FFFFFFFC 155 FFFFFF00
12 400 36 55 60 1C71C7 84 FFFFFFFE 108 FEFFFFFF 132 FFFFFFCC 156 FF00FF00
13 800 37 155 61 71C71C7 85 FFFFFFFD 109 FDFFFFFF 133 FFFFFCCC 157 FFFFFE00
14 1000 38 555 62 F 86 FFFFFFFB 110 FBFFFFFF 134 FFFFCCCC 158 FFFFFC00
15 2000 39 1555 63 F0F 87 FFFFFFF7 111 F7FFFFFF 135 FFFCCCCC 159 FFFFF800
16 4000 40 5555 64 F0F0F 88 FFFFFFEF 112 EFFFFFFF 136 FFCCCCCC 160 FFFFF000
17 8000 41 15555 65 F0F0F0F 89 FFFFFFDF 113 DFFFFFFF 137 FCCCCCCC 161 FFFFE000
18 10000 42 55555 66 1F 90 FFFFFFBF 114 BFFFFFFF 138 CCCCCCCC 162 FFFFC000
19 20000 43 155555 67 7C1F 91 FFFFFF7F 115 FFFFFFFE 139 FFFFFFF8 163 FFFF8000
20 40000 44 555555 68 1F07C1F 92 FFFFFEFF 116 FFFFFFFA 140 FFFFFE38 164 FFFF0000
21 80000 45 1555555 69 3F 93 FFFFFDFF 117 FFFFFFEA 141 FFFF8E38
22 100000 46 5555555 70 3F03F 94 FFFFFBFF 118 FFFFFFAA 142 FFE38E38
23 200000 47 15555555 71 7F 95 FFFFF7FF 119 FFFFFEAA 143 F8E38E38
24 400000 48 55555555 72 1FC07F 96 FFFFEFFF 120 FFFFFAAA 144 FFFFFFF0
Sequences - First 16
No. File No. File No. File No. File
1 0 1 2 3 5 0 2 1 3 9 0 3 1 2 13 0 1 2 3
2 1 2 3 0 6 1 3 2 0 10 1 0 3 2 14 1 2 3 0
3 2 3 0 1 7 2 0 1 3 11 2 1 0 3 15 2 3 0 1
4 3 0 2 1 8 3 1 2 0 12 3 2 1 0 16 3 0 2 1
###########################################################################
Run Time Parameters - Upper or Lower Case
Default
R or Repeats Data size, multiplier of 10.25 MB, more or less 16
P or Patterns Number of patterns for smaller files < 164 164
M or Minutes Large file reading time 2
L or Log Log file name extension 0 to 99 0
S or Seconds Time to read each block, last section 1
F or FilePath For other than SD card or SD card directory
C or CacheData Omit O_DIRECT on opening files to allow caching No
O or OutputPatterns Log patterns and file sequences used as above No
D or DontRunReadTests Or only run write tests No
Format ./burnindrive2 Repeats 16, Minutes 2, Log 0, Seconds 1
or ./burnindrive2 R 16, M 2, L 0, S 1
###########################################################################
Examples of Results Main SD Card Default Parameters
File 1 164.00 MB written in 14.66 seconds - 11.2 MB/second
To File 4 164.00 MB written in 12.15 seconds - 13.5 MB/second
Read passes 1 x 4 Files x 164.00 MB in 0.33 minutes - 33.1 MB/second
To Read passes 7 x 4 Files x 164.00 MB in 2.28 minutes - 33.6 MB/second
Passes in 1 second(s) for each of 164 blocks of 64KB: - 164 measurements
580 580 580 580 580 580 580 580 580 580 580
580 580 580 580 580 580 580 580 580 580 580
95120 read passes of 64KB blocks in 2.76 minutes - 36.8 MB/second
CPU + Main SD + USB + LAN Test below or Go To Start
CPU + Main SD + USB + LAN Test
A system test was run using the following script file, comprising commands to run programs to monitor the environment,
and others to exercise the main SD card, two USB 3 drives, 1 Gbps Ethernet and CPU floating point with two threads.
The programs were run via the script file so that they all started at the same time, as indicated in the summaries below.
They also all ran for between 12 and 13 minutes. The by itself performance levels (BI) are also shown, often not
indicating much improvement. Performance is not as high as shown by other benchmarks, probably because data
transfers are based on 64 KB block sizes and all data in each block is checked for correctness.
A snapshot of vmstat system performance is also provided. The bo and bi KB/second writing and reading speeds are
essentially the same as the sum those reported by the programs handling the main and USB drives. LAN speeds are not
included in vmstat. Total CPU utilisation (us + sy) is shown to be nearly 90% at the start of writing and closer to 75% on
reading, representing average utilisation per core or at least three cores at 100%. Next page shows variations in
performance with time.
############################### Script File ###############################
lxterminal -e ./RPiHeatMHzVolts2 Passes 35 Seconds 30 Log 20 &
lxterminal -e ./burnindrive264g9 Seconds 4 Minutes 1 Log 21 &
lxterminal -e ./burnindrive264g9 Seconds 4 Minutes 1
FilePath /run/media/demouser/PATRIOT Log 22 &
lxterminal -e ./burnindrive264g9 Seconds 4 Minutes 1
FilePath /run/media/demouser/REMIXOSSYS Log 23 &
lxterminal -e ./burnindrive264g9 Seconds 4 Minutes 1
FilePath /media/public/test Log 24 &
lxterminal -e ./MP-FPUStress64 KB 256 T 2 Ops 32 Mins 12 Log 33
vmstat 10 96 > vmstat.txt
############################################################################
Main SD Drive Tue Nov 5 15:47:03 2019
End of test Tue Nov 5 16:00:06 2019
Write 164 MB x files 4 53.6 seconds = 12.2 MB/second (BI 12.7)
Read 164 MB x files 3 x 4 67.2 seconds = 29.3 MB/second (BI 33.6)
Read 329480 x 64 KB 659.4 seconds = 32.0 MB/second (BI 36.8)
============================================================
USB 3 Drive 1 Tue Nov 5 15:47:03 2019
End of test Tue Nov 5 15:59:31 2019
Write 164 MB x files 4 17.5 seconds = 37.5 MB/second (BI 68.3)
Read 164 MB x files 6 x 4 72.0 seconds = 54.7 MB/second (BI 75.0)
Read 735800 x 64 KB 657.6 seconds = 71.6 MB/second (BI 66.5)
============================================================
USB 3 Drive 2 Tue Nov 5 15:47:03 2019
End of test Tue Nov 5 15:59:57 2019
Write 164 MB x files 4 37.4 seconds = 17.5 MB/second (BI 23.8)
Read 164 MB x files 3 x 4 75.6 seconds = 26.0 MB/second (BI 28.5)
Read 282740 x 64 KB 660.0 seconds = 27.4 MB/second (BI 29.8)
============================================================
1 Gbps LAN Tue Nov 5 15:47:03 2019
End of test Tue Nov 5 15:59:35 2019
Write 164 MB x files 4 18.1 seconds = 36.2 MB/second (BI 55.7)
Read 164 MB x files 3 x 4 74.4 seconds = 26.4 MB/second (BI 34.0)
Read 303920 x 64 KB 659.4 seconds = 29.5 MB/second (BI 45.3)
============================================================
MP-Threaded-MFLOPS 64 Bit v1.1 Tue Nov 5 15:47:03 2019
End of test Tue Nov 5 15:59:13 2019
2 core GFLOPS 10.9 to 7.4 with CPU throttling.
See RPiHeatMHzVolts2 results where detail is included
============================================================
From vmstat 10 second sampling
Secs procs ---------memory---------- ---swap-- -----io---- --system-- ------cpu-----
r b swpd free buff cache si so bi bo in cs us sy id wa st
10 5 3 0 3059800 94956 346060 0 0 14 63204 17819 19051 51 38 2 9 0
20 3 2 0 3058696 95248 346704 0 0 14265 60713 17613 18789 51 33 4 12 0
60 4 2 0 3061196 95668 343572 0 0 93479 7577 24239 24987 54 19 4 23 0
70 4 3 0 3050632 95684 353600 0 0 112115 24 24496 25316 54 20 12 14 0
710 3 3 0 3058696 96532 349460 0 0 132992 16 18936 22387 53 22 3 22 0
720 5 1 0 3058728 96548 349452 0 0 134400 13 20635 23842 54 23 1 23 0
Speeds and Temperature below or Go To Start
Speeds and Temperature - These tests were run without an active cooling fan, resulting in some CPU throttling, with
clock speed down to 1000 MHz some of the time, when the temperature reached 80°C. The MP-Threaded-MFLOPS dual
core performance measurements have been added to the environmental details, mainly indicating the effects of
throttling.
The burnindrive last results record the number of read passes in 4 seconds, in a table comprising 14 lines of 11
recordings and one with 10, over approximately 11 minutes. The average burnindrive results for each line are provided
below, not exactly synchronised, but giving an indication of changes in throughput with time. Total passes and
percentage degradation are also shown, the latter not being as severe as CPU speed reductions.
Temperature and CPU MHz Measurement + MP-FPUStress64 2 Core MFLOPS
Start at Tue Nov 5 15:47:03 2019
Using 25 samples at 30 second intervals
Seconds MFLOPS
0.0 ARM MHz=1500, core volt=0.8560V, CPU temp=66.0'C, pmic temp=59.0'C
30.0 ARM MHz=1500, core volt=0.8560V, CPU temp=75.0'C, pmic temp=65.6'C 10890
60.2 ARM MHz=1500, core volt=0.8560V, CPU temp=78.0'C, pmic temp=68.4'C 10551
90.4 ARM MHz=1500, core volt=0.8560V, CPU temp=80.0'C, pmic temp=70.3'C 10549
120.6 ARM MHz=1500, core volt=0.8560V, CPU temp=81.0'C, pmic temp=70.3'C 10452
150.8 ARM MHz=1500, core volt=0.8560V, CPU temp=81.0'C, pmic temp=70.3'C 9862
181.1 ARM MHz=1000, core volt=0.8560V, CPU temp=81.0'C, pmic temp=70.3'C 9482
211.4 ARM MHz=1500, core volt=0.8560V, CPU temp=82.0'C, pmic temp=72.2'C 9137
241.6 ARM MHz=1500, core volt=0.8507V, CPU temp=81.0'C, pmic temp=72.2'C 9132
271.9 ARM MHz=1000, core volt=0.8507V, CPU temp=82.0'C, pmic temp=70.3'C 9122
302.2 ARM MHz=1500, core volt=0.8455V, CPU temp=82.0'C, pmic temp=72.2'C 9389
332.4 ARM MHz=1500, core volt=0.8455V, CPU temp=82.0'C, pmic temp=72.2'C 8550
362.7 ARM MHz=1000, core volt=0.8455V, CPU temp=82.0'C, pmic temp=72.2'C 9043
392.9 ARM MHz=1500, core volt=0.8455V, CPU temp=81.0'C, pmic temp=72.2'C 8045
423.3 ARM MHz=1000, core volt=0.8455V, CPU temp=81.0'C, pmic temp=72.2'C 8174
453.6 ARM MHz=1500, core volt=0.8455V, CPU temp=82.0'C, pmic temp=72.2'C 8444
483.9 ARM MHz=1500, core volt=0.8455V, CPU temp=82.0'C, pmic temp=72.2'C 8335
514.3 ARM MHz=1000, core volt=0.8455V, CPU temp=82.0'C, pmic temp=72.2'C 7951
544.6 ARM MHz=1500, core volt=0.8455V, CPU temp=82.0'C, pmic temp=72.2'C 8125
574.8 ARM MHz=1500, core volt=0.8455V, CPU temp=83.0'C, pmic temp=72.2'C 8078
605.1 ARM MHz=1000, core volt=0.8455V, CPU temp=81.0'C, pmic temp=72.2'C 8280
635.4 ARM MHz=1000, core volt=0.8455V, CPU temp=82.0'C, pmic temp=72.2'C 7845
665.7 ARM MHz=1000, core volt=0.8455V, CPU temp=82.0'C, pmic temp=72.2'C 7761
696.0 ARM MHz=1000, core volt=0.8455V, CPU temp=82.0'C, pmic temp=73.1'C 8341
726.2 ARM MHz=1000, core volt=0.8455V, CPU temp=82.0'C, pmic temp=72.2'C 7407
Passes in 4 seconds for each of 164 blocks of 64KB
Seconds Main SD USB 1 USB 2 LAN Total %First
44 2013 4522 1884 1915 10333 100
88 2007 4533 1838 1911 10289 100
132 2016 4496 1760 1809 10082 98
176 2011 4536 1785 1845 10178 99
220 2002 4493 1729 1913 10136 98
264 1971 4262 1751 1904 9887 96
308 1980 4540 1747 1911 10178 99
352 2002 4464 1660 1845 9971 96
396 1987 4442 1629 1844 9902 96
440 1964 4453 1585 1771 9773 95
484 1995 4504 1635 1731 9864 95
528 1989 4229 1696 1762 9676 94
572 1947 4616 1684 1833 10080 98
616 2013 4476 1660 1798 9947 96
660 2262 4758 1826 2022 10868 105
Go To Start
... Furthermore, the performance evaluation can be extrapolated to other architectures. In [34], an extensive set of tests are performed on the Raspberry Pi. ...
Article
Full-text available
Since the advent of the microgrid (MG) concept, almost two decades ago, the energy sector has evolved from a centralized operational approach to a distributed generation paradigm challenged by the increasing number of distributed energy resources (DERs) mainly based on renewable energy. This has encouraged new business models and management strategies looking for a balance between energy generation and consumption, and promoting an efficient utilization of energy resources within MGs and minimizing costs for the market participants. In this context, this paper introduces an efficient management strategy, which is aimed at obtaining a fair division of costs billed by the utilities, without relying on a centralized utility or MG aggregator, through the design of a local event-based energy market within the MG. This event-driven MG energy market operates with blockchain (BC) technology based on smart contracts for electricity transactions to both guarantee veracity and immutability of the data and automate the transactions. The event-based energy market approach focuses on two of the design limitations of BC, namely the amount of information to be stored and the computational burden, which are significantly reduced while maintaining a high level of performance. Furthermore, the prosumer data is obtained by using IEC 61850 standard-based commands within the BC framework. By doing so, the system is compatible with any device irrespective of the manufacturer implementing the IEC 61850 standard. The advantages of this management approach are considerable for: MG participants, in terms of financial benefits; the MG itself, as it can operate more independently from the main grid; and the grid since the MG becomes less unpredictable due to the internal energy exchanges. The proposed strategy is validated on an experimental setup employing low-cost devices.
ResearchGate has not been able to resolve any references for this publication.