Content uploaded by Roy Longbottom
Author content
All content in this area was uploaded by Roy Longbottom on Nov 02, 2020
Content may be subject to copyright.
Content uploaded by Roy Longbottom
Author content
All content in this area was uploaded by Roy Longbottom on Nov 02, 2020
Content may be subject to copyright.
Raspberry Pi 400 PC 32 Bit and 64 Bit Benchmarks and Stress Tests
Roy Longbottom
Contents
Summary Introduction Benchmark Results
Whetstone Benchmark Dhrystone Benchmark Linpack 100 Benchmark
Livermore Loops Benchmark FFT Benchmarks BusSpeed Benchmark
MemSpeed Benchmark NeonSpeed Benchmark MultiThreading Benchmarks
MP-Whetstone Benchmark MP-Dhrystone Benchmark MP NEON Linpack Benchmark
MP-BusSpeed Benchmark MP-RandMem Benchmark MP-MFLOPS Benchmarks
OpenMP-MFLOPS Benchmarks OpenMP-MemSpeed Benchmarks Java Whetstone Benchmark
JavaDraw Benchmark OpenGL GLUT Benchmark I/O Benchmarks
LAN Benchmark WiFi Benchmark USB Booting
USB 3 and Main Drive Benchmarks High Performance Linpack 32 Bit Stress Test Benchmarks
64 Bit Stress Test Benchmarks Stress Test Parameters 32 Bit Floating Point Stress Tests
64 Bit Floating Point Stress Tests 32 Bit Integer Stress Tests 64 Bit Integer Stress Tests
32 Bit System Stress Tests 32 Bit System Stress Tests Part 2 64 Bit System Stress Tests
32 Bit TV Test Plus Remote Access 64 Bit TV Test Using Bluetooth 64 Bit Danger
Summary
This report provides results of benchmarks and stress tests run on a Raspberry PI 400 PC, using the 32 bit and 64 bit Operating
Systems. The PC comprises an upgraded version of the Raspberry Pi 4B CPU, fitted and fanless within a Raspberry Pi keyboard,
running at 1800 MHz instead of 1500 MHz. Benchmark results were compared with those run on the original Pi 4B at 32 bits, also
those for Pi 400 32 bit versus 64 bit operation.
CPU and RAM Benchmarks The first group of 18 benchmarks measure various aspects of CPU performance, including accessing
multiple CPU cores. At 32 bits, the Pi 400 generally provides the expected 20% improvement in performance, where CPU time
dominates but little difference with RAM speed limitations. Average performance was superior using 64 bit operation, but too
variable to be conclusive. The compiler version used was identified as a potential significant issue.
Graphics Benchmarks These comprise Java and OpenGL programs that execute a range of test functions. All ran successfully
on each configuration, including dual monitor operation. Some Pi 400 and 64 bit gains were apparent, but depending on version
of system software used.
LAN Benchmarks These demonstrated that the Pi 400 PC obtained the same Gigabit performance as he original Pi 4B, writing
and reading large files at around 112 MB/second. The benchmark also demonstrated that 64 bit operation could handle much
larger file sizes.
WiFi Benchmarks All systems were run using 2.4 and 5 GHz operation. In my environment, obtaining consistent 5 GHz operation
was extremely difficult to achieve and Pi 400 appeared to be slower on data reception. speeds.
USB and Main Drive Benchmarks Local and USB 3 performance was measured using a range of low and high speed drives, also
the surprising availability of USB Booting. File size limitations were also exposed with 32 bit working.
High Performance Linpack Benchmark was ported to work at 32 and 64 bits on the Pi 400, both achieving the highest Pi 4 GB
RAM rating of 11.7 GFLOPS. As Stress Tests, over 30 minutes, CPU speed was constant, unlike a fanless Pi 4B with a best
case performance of 8.8 GFLOPS.
CPU Stress Tests Fifteen Half hour stress tests were run, covering 32 bit and 64 bit operation, single and double precision (SP
and DP) floating point and integer calculations, using four threads. Best Pi 400 32 bit floating point average performance was 25
GFLOPS SP and 13 GFLOPS DP, with 64 bits somewhat faster. Most were at room temperatures of 30°C, some using side by Pi
400 and 4B systems. With one exception, Pi 400 and 4B with fan tests ran at constant CPU speeds at temperatures less
than 70°C, where Pi 400 performance was around 20% faster. The exception was a Pi 400 session outside temperatures were
greater than 40°C. This time the performance was also constant with CPU temperature up to 71°C. Tests with the fanless Pi 4
saw temperatures of 86°C and CPU MHz sometimes throttled down to 600 MHz.
System Stress Tests Six programs were run at the same time for 15 minutes, exercising integer and floating point hardware, all
RAM space, OpenGL and drive data transfers, whilst monitoring environment and system utilisation. These were run at 32 and 64
bits on the Pi 400 and fan controlled Pi 4B. There were no excessive CPU temperatures and no data comparison errors.
TV Tests BBC iPlayer programmes were viewed on the Pi 400 for at least seven hours each, via TV at 32 bits and a PC monitor at
64 bits, with external bluetooth speaker sound for the latter. There were a few peculiarities for consideration, but no
interruptions to service.
Introduction below or Go To Start
Introduction
This is a continuation of earlier activity with details at ResearchGate in Raspberry Pi 4B 32 Bit Benchmarks.pdf, Raspberry Pi 4B
Stress Tests Including High Performance Linpack.pdf and Raspberry Pi 64 Bit OS and 8 GB Pi 4B Benchmarks.pdf. The 32 bit
benchmarks are available in Raspberry-Pi-4-Benchmarks.tar.gz, with the 64 bit versions in Raspberry-Pi-OS-64-Bit-
Benchmarks.tar.xz.
This report covers the Pi 400 PC, using July/August 2020 32 bit and 64 bit Operating Systems, with 64 bit/32 bit comparisons and
others with the original Pi 32 bit 4B. Brief descriptions are generally provided. For more comprehensive information, see the above
PDF files.
The Pi 400 PC is essentially a Raspberry Pi keyboard containing an upgraded and fanless Pi 4B with enhanced facilities and
options. The latest Pi 4 CPU default clock speed is 1800 MHz, compared with 1500 MHz for the original model.
Traditionally, the benchmark provided details of the system being tested, by accessing built-in CPUID details. Following are the
latest that identify the difference between the two model Pi 4 systems and 32 bit/64 bit variations.
Pi 4B 32 Bit 2019 OS
Architecture: armv7l
Byte Order: Little Endian
CPU(s): 4
On-line CPU(s) list: 0-3
Thread(s) per core: 1
Core(s) per socket: 4
Socket(s): 1
Vendor ID: ARM
Model: 3
Model name: Cortex-A72
Stepping: r0p3
CPU max MHz: 1500.0000
CPU min MHz: 600.0000
BogoMIPS: 270.00
Flags: half thumb fastmult vfp edsp neon vfpv3 tls vfpv4
idiva idivt vfpd32 lpae evtstrm crc32
Raspberry Pi reference reference 2019-09-26
Pi 400 PC 32 Bit 2020 OS
Architecture: armv7l
Byte Order: Little Endian
CPU(s): 4
On-line CPU(s) list: 0-3
Thread(s) per core: 1
Core(s) per socket: 4
Socket(s): 1
Vendor ID: ARM
Model: 3
Model name: Cortex-A72
Stepping: r0p3
CPU max MHz: 1800.0000
CPU min MHz: 600.0000
BogoMIPS: 324.00
Flags: half thumb fastmult vfp edsp neon vfpv3 tls vfpv4 idiva idivt vfpd32 lpae evtstrm crc32
Raspberry Pi reference 2020-07-06
Pi 400 64 Bit 2020 OS - as 32 bit except
BogoMIPS: 108.00
Flags: fp asimd evtstrm crc32 cpuid
Linux raspberrypi 5.4.51-v8+ #1333 SMP PREEMPT Mon Aug 10 16:58:35 BST 2020 aarch64 GNU/Linux
Benchmark Results
The following provides benchmark results from the original 32 bit Raspberry Pi 4B and later ones from the Pi 400 PC, working at 32
bits and 64 bits. Comparisons and limited comments are provided.
Whetstone Benchmark below or Go To Start
Whetstone Benchmark - whetstonePiC8, whetstonePi64g8
This has a number of simple programming loops, with the overall MWIPS rating dependent on floating point calculations. With no
accessing of data in L2 cache or RAM, across the board 400/4B performance comparisons were the same as that for MHz
speeds..
Pi 400 overall rating was 11% faster at 64 bits, with variations of -6% to +27%.
System MHz MWIPS ------MFLOPS------ -------------MOPS---------------
1 2 3 COS EXP FIXPT IF EQUAL
Pi 4B 32b 1500 1883 522 471 313 54.9 26.4 2496 3178 998
Pi 400 32b 1800 2258 628 565 376 65.7 31.7 2998 3826 1198
Pi 400 64b 1800 2505 628 643 478 69.2 32.8 2996 3592 1198
400/4B 1.20 1.20 1.20 1.20 1.20 1.20 1.20 1.20 1.20 1.20
400 64/32b 1.00 1.11 1.00 1.14 1.27 1.05 1.04 1.00 0.94 1.00
Dhrystone Benchmark - dhrystonePiC8, dhrystonePi64g8
This is the most popular ARM integer benchmark, often subject to over optimisation, rated in VAX MIPS aka DMIPS. Again, Pi 400
32 bit performance improvement was 20%. The latter also indicated a 38% gain at 64 bits.
DMIPS
System MHz DMIPS /MHz Compare MHz DMIPS
Pi 4B 32 bit 1500 5648 3.77
Pi 400 32 bit 1800 6779 3.77 400/4B 1.20 1.20
Pi 400 64 bit 1800 9337 5.19 64/32 bit 1.00 1.38
Linpack 100 Benchmark MFLOPS - linpackPiC8, linpackPiC8SP, linpackPiNEONiC8,
linpackPi64g8, linpackPi64gSP, linpackPi64NEONig8
This original Linpack benchmark executes double precision arithmetic. I introduced two single precision versions, one using NEON
functions to include vector processing. Performance of this benchmark can vary, with its dependence on data placement in L2
cache. In this case, best results from more than one run were used, reflecting around 20% gain by the 32 bit Pi 400.
64 bit results indicated 10% to 79% speed gain, best being SP where the compiler probably generated vector instructions.
NEON NEON
System MHz DP SP SP Compare MHz DP SP SP
Pi 4B 32 bit 1500 957.1 1068.8 1819.9
Pi 400 32 bit 1800 1146.9 1306.2 2174.6 400/4B 1.20 1.20 1.22 1.19
Pi 400 64 bit 1800 1337.2 2343.9 2400.4 64/32 bit 1.00 1.17 1.79 1.10
Livermore Loops Benchmark MFLOPS - liverloopsPiC8, liverloopsPi64g8
This benchmark measures performance of 24 double precision kernels, initially used to select the latest supercomputer. The
official average is geometric mean, where Cray 1 supercomputer was rated as 11.9 MFLOPS. Following are MFLOPS for the
individual kernels, followed by overall scores. Although each kernel is executed for a relatively long time, performance of some
can be inconsistent. as reflected in the performance ratios. Fortunately, in this case, overall 32 bit Pi 400 performance ratings
indicated a 20% improvement.
There were wide 64 bit/32 bit comparison variations, but the geometric mean indicated a 15% higher rating.
MFLOPS for 24 loops Pi 4B 1500 MHz 32 bit
1480 1017 974 930 383 657 1624 1861 1664 617 498 741
221 320 803 640 737 1003 451 378 1047 411 763 187
MFLOPS for 24 loops Pi 400 1800 MHz 32 bit
1751 1225 1187 1120 469 608 1944 2262 2004 888 591 878
267 374 965 767 897 1218 542 454 1261 473 915 225
MFLOPS for 24 loops Pi 400 1800 MHz 64 bit
2553 1194 1163 1139 452 887 2762 3352 2451 999 601 1169
255 480 979 771 872 1377 540 477 1978 481 982 375
400/4B 1.18 1.20 1.22 1.20 1.22 0.93 1.20 1.22 1.20 1.44 1.19 1.18
1.21 1.17 1.20 1.20 1.22 1.21 1.20 1.20 1.20 1.15 1.20 1.20
64/32b 1.46 0.97 0.98 1.02 0.96 1.46 1.42 1.48 1.22 1.13 1.02 1.33
0.96 1.28 1.01 1.01 0.97 1.13 1.00 1.05 1.57 1.02 1.07 1.67
System MHz Maximum Average Geomean Harmean Minimum
Pi 4B 32 bit 1500 1860.8 800.4 679.0 564.1 179.5
Pi 400 32 bit 1800 2262.0 965.1 818.6 679.6 217.4
Pi 400 64 bit 1800 3353.1 1170.9 938.2 761.6 242.0
400/4B 32 bit 1.20 1.22 1.21 1.21 1.20 1.21
64 bit/32 bit 1.00 1.48 1.21 1.15 1.12 1.11
Fast Fourier Transforms Benchmarks below or Go To Start
Fast Fourier Transforms Benchmarks - fft1PiC8, fft3cPiC8, fft1Pi64g, fft3cPi64g8
This is a real application provided by my collaborator at Compuserve Forum. There are two benchmarks. The first one is the
original C program. The second is an optimised version, originally using my x86 assembly code, but translated back into C code,
making use of the partitioning and (my) arrangement to optimise for burst reading from RAM. Three measurements use both single
and double precision data, calculating FFT sizes between 1K and 1024K, with data from caches and RAM. Note that steps in
performance levels occur at data size changes from L1 to L2 caches, then to RAM.
Following are average running times from the three passes of each FFT calculation. Performance can vary, particularly for the
calculations with the shortest running times. But they indicate that those that are CPU speed dependent were around 20% faster
using the 1800 MHz processor at 32 bit working. Then the remainder, affected by increasing dependency on RAM speed, showed
no gain.
The Pi 400 64 bit and 32 bit comparisons indicate that the FFT calculation s involving RAM data transfers are of similar speeds.
Then, as with Linpack benchmarks, single precision performance gains are more apparent than that with double precision.
Time in milliseconds Comparison
Pi 4B FFT1 32b Pi 4B FFT3 32b
SP DP SP DP
Size K
1 0.04 0.04 0.05 0.04
2 0.08 0.13 0.10 0.10
4 0.29 0.34 0.24 0.23
8 0.79 0.82 0.57 0.51
16 1.65 1.85 1.32 1.19
32 3.76 4.71 2.69 3.30
64 8.82 30.64 6.60 9.47
128 58.54 132.41 16.92 23.85
256 275.44 373.12 37.61 55.97
512 780.89 751.27 81.54 128.13
1024 1578.70 1812.20 186.45 288.27
Pi 400 FFT1 32b Pi 400 FFT3 32b 400/4B1 FFT1 400/4B1 FFT3
SP DP SP DP SP DP SP DP Average
Size K
1 0.03 0.03 0.05 0.04 1.17 1.26 1.09 1.11 1.16
2 0.07 0.10 0.09 0.08 1.15 1.34 1.13 1.24 1.21
4 0.21 0.29 0.21 0.19 1.35 1.19 1.17 1.20 1.23
8 0.63 0.82 0.47 0.42 1.26 1.00 1.21 1.20 1.17
16 1.45 1.60 1.27 1.02 1.13 1.16 1.04 1.16 1.12
32 3.54 4.22 2.80 3.08 1.06 1.12 0.96 1.07 1.05
64 7.72 37.12 6.94 8.83 1.14 0.83 0.95 1.07 1.00
128 55.94 111.66 15.70 22.27 1.05 1.19 1.08 1.07 1.10
256 230.26 326.14 34.93 53.51 1.20 1.14 1.08 1.05 1.12
512 667.66 901.75 76.35 121.06 1.17 0.83 1.07 1.06 1.03
1024 1503.53 1948.66 167.64 279.32 1.05 0.93 1.11 1.03 1.03
Pi 400 FFT1 64b Pi 400 FFT3 64b 64b/32b FFT1 64b/32b FFT3
SP DP SP DP SP DP SP DP Average
Size K
1 0.03 0.03 0.03 0.03 1.00 0.94 1.36 1.06 1.09
2 0.07 0.11 0.07 0.08 1.05 0.85 1.26 1.01 1.04
4 0.18 0.29 0.17 0.19 1.16 0.99 1.24 0.99 1.10
8 0.52 0.84 0.38 0.44 1.20 0.98 1.23 0.96 1.09
16 1.27 1.57 0.96 1.03 1.14 1.02 1.32 0.99 1.12
32 3.25 4.00 1.96 3.00 1.09 1.06 1.43 1.02 1.15
64 7.24 28.72 5.35 9.97 1.07 1.29 1.30 0.89 1.14
128 45.45 187.78 14.83 23.77 1.23 0.59 1.06 0.94 0.96
256 321.30 465.17 36.13 52.24 0.72 0.70 0.97 1.02 0.85
512 825.72 1073.88 77.95 113.83 0.81 0.84 0.98 1.06 0.92
1024 1622.96 2014.74 166.64 250.60 0.93 0.97 1.01 1.11 1.00
BusSpeed Benchmark below or Go To Start
BusSpeed Benchmark - busspeedPiC8, busspeedPi64g8
This is a read only benchmark with data from caches and RAM. The program reads one word with 32 word increments for the
next one, skipping following data word by decreasing increments. finally reading all data. This shows where data is read in bursts,
enabling estimates being made of bus speeds, as 16 times the speed of appropriate measurements at Inc16.
Performance gains of the 32 bit Pi 400 PC CPU increased in line with MHz difference, using data from L1 and L2 caches, with no
gain using RAM based data.
With the Pi 400 at 64 bits, RAM speeds were, again, virtually the same as at 32 bits. Based on reading all data, average 64 bit
cache based performance gains were 55%. The 64 bit compilation appears to generate less efficient code, like burst reading
effects, using L1 cache based data.
Reading Speed 4 Byte Words in MBytes/Second
Memory Inc32 Inc16 Inc8 Inc4 Inc2 Read
KBytes Words Words Words Words Words All
Pi 4B 1500 MHz 32 bits
16 4880 5075 5612 5852 5877 5864
32 846 1138 2153 3229 4908 5300
64 746 1019 2035 3027 4910 5360
128 728 983 1952 2908 4888 5389
256 683 934 1901 2794 4874 5431
512 656 900 1760 2625 4585 5259
1024 301 410 870 1356 2846 4238
4096 233 248 531 996 2151 4045
16384 236 258 511 891 2143 4011
65536 237 257 508 881 2172 4015
Pi 400 1800 MHz 32 bits
16 5859 6098 6734 7023 7053 7036
32 1726 2247 3593 5034 6093 6412
64 800 1098 2259 3425 5886 6506
128 825 1125 2258 3353 5842 6513
256 815 1125 2279 3351 5837 6534
512 822 1103 2198 3308 5849 6499
1024 315 533 1035 1172 3134 4961
4096 232 266 557 1062 2148 4256
16384 239 256 487 940 1987 3787
65536 227 256 481 935 1945 3766
Pi 400/4B 32 bits
16 1.20 1.20 1.20 1.20 1.20 1.20
32 2.04 1.97 1.67 1.56 1.24 1.21
64 1.07 1.08 1.11 1.13 1.20 1.21
128 1.13 1.14 1.16 1.15 1.20 1.21
256 1.19 1.20 1.20 1.20 1.20 1.20
512 1.25 1.23 1.25 1.26 1.28 1.24
1024 1.05 1.30 1.19 0.86 1.10 1.17
4096 1.00 1.07 1.05 1.07 1.00 1.05
16384 1.01 0.99 0.95 1.05 0.93 0.94
65536 0.96 1.00 0.95 1.06 0.90 0.94
Pi 400 1800 MHz 64 bits
16 1576 2079 4920 6419 6612 10274
32 1506 1857 3213 4859 6087 10126
64 965 1239 2442 3969 5844 10015
128 885 1142 2246 3773 5889 10266
256 880 1129 2271 3782 5909 10346
512 875 1135 2203 3682 5818 10175
1024 425 570 1105 1973 3312 6064
4096 246 259 560 1122 2182 4276
16384 236 256 493 987 1968 3921
65536 243 258 477 944 1887 3780
Pi 400 64 bits/32 bits
16 0.27 0.34 0.73 0.91 0.94 1.46
32 0.87 0.83 0.89 0.97 1.00 1.58
64 1.21 1.13 1.08 1.16 0.99 1.54
128 1.07 1.02 0.99 1.13 1.01 1.58
256 1.08 1.00 1.00 1.13 1.01 1.58
512 1.06 1.03 1.00 1.11 0.99 1.57
1024 1.35 1.07 1.07 1.68 1.06 1.22
4096 1.06 0.97 1.01 1.06 1.02 1.00
16384 0.99 1.00 1.01 1.05 0.99 1.04
65536 1.07 1.01 0.99 1.01 0.97 1.00
MemSpeed Benchmark below or Go To Start
MemSpeed Benchmark MB/Second - memspeedPiC8, memspeedPi64g8
The benchmark includes CPU speed dependent calculations using data from caches and RAM, via single and double precision
floating point and integer functions. The instruction sequences used are shown in the results column titles.
Subject to normal variations, 32 bit comparisons again indicate the expected 20% improved performance of the later processor,
at the lower data sizes, and no difference accessing RAM.
Under 64 bit working, it seems that performance from RAM can still have CPU speed influences, where the 64/32 bit performance
ratio can vary from 1.0. With data from caches, 64 bit floating point functions were mainly faster than at 32 bits, with double
precision operation, but similar on integer calculations. Completely unexpectedly, 64 bit 32 bit single precision floating point tests
were slower than at double precision, making the 32 bit benchmark appear to be more than twice as fast. See 64 Bit Danger.
Memory x[m]=x[m]+s*y[m] Int+ x[m]=x[m]+y[m] x[m]=y[m]
KBytes Dble Sngl Int32 Dble Sngl Int32 Dble Sngl Int32
Used MB/S MB/S MB/S MB/S MB/S MB/S MB/S MB/S MB/S
Pi 4B 1500 MHz 32 bits
8 11768 9844 3841 11787 9934 4351 10309 7816 7804
16 11880 9880 3822 11886 10043 4363 10484 7902 7892
32 9539 8528 3678 9517 8661 4098 10564 7948 7945
64 9952 9310 3733 9997 9470 4160 8452 7717 7732
128 9947 9591 3757 9990 9757 4178 8205 7680 7753
256 10015 9604 3758 10030 9781 4186 8120 7734 7707
512 9073 9300 3751 9472 9526 4175 7995 7709 7602
1024 2681 5303 3594 2664 4965 3760 4828 3592 3569
2048 1671 3488 3242 1757 3635 3540 2882 1036 1023
4096 1777 3700 3283 1827 3627 3555 2433 1052 1054
8192 1931 3805 3420 1933 3815 3629 2465 980 971
Pi 400 1800 MHz 32 bits
8 14084 11813 4591 14142 12013 5226 12383 9380 9364
16 14259 11857 4586 14263 12061 5243 12589 9483 9476
32 14323 11877 4563 14321 12078 5114 12688 9539 9478
64 12010 11155 4479 11965 11345 4980 10951 9127 9121
128 12147 11512 4515 11998 11714 5030 9677 9176 9191
256 12149 11527 4522 12000 11735 5026 9683 9145 9249
512 11383 11071 4508 10765 11467 5007 9675 9187 9139
1024 3531 6947 4300 4229 7006 4662 5530 5380 5325
2048 1730 3427 3507 1979 3938 3947 2836 1021 1022
4096 1772 4032 3891 2027 3484 4044 2511 1038 1038
8192 2016 4005 3896 2021 3908 3956 2544 1000 1003
Pi 400/4B 32 bits
8 1.20 1.20 1.20 1.20 1.21 1.20 1.20 1.20 1.20
16 1.20 1.20 1.20 1.20 1.20 1.20 1.20 1.20 1.20
32 1.50 1.39 1.24 1.50 1.39 1.25 1.20 1.20 1.19
64 1.21 1.20 1.20 1.20 1.20 1.20 1.30 1.18 1.18
128 1.22 1.20 1.20 1.20 1.20 1.20 1.18 1.19 1.19
256 1.21 1.20 1.20 1.20 1.20 1.20 1.19 1.18 1.20
512 1.25 1.19 1.20 1.14 1.20 1.20 1.21 1.19 1.20
1024 1.32 1.31 1.20 1.59 1.41 1.24 1.15 1.50 1.49
2048 1.04 0.98 1.08 1.13 1.08 1.11 0.98 0.99 1.00
4096 1.00 1.09 1.19 1.11 0.96 1.14 1.03 0.99 0.98
8192 1.04 1.05 1.14 1.05 1.02 1.09 1.03 1.02 1.03
Pi 400 1800 MHz 64 bits
8 18133 4792 4749 18693 5259 5275 13962 11182 11182
16 13147 4574 4532 13052 5015 5043 14049 11327 11340
32 16248 4614 4702 16355 5030 5090 13598 11393 11391
64 15292 4617 4710 15106 5020 5056 11114 10488 10527
128 14771 4641 4734 14603 5007 5058 9832 9836 9837
256 14783 4646 4716 14698 5053 5063 9666 9768 9809
512 14842 4648 4717 14705 5057 5066 9768 9925 9877
1024 5441 4436 4494 5484 4486 4646 3852 4179 4389
2048 1703 3940 3918 2034 4037 4036 2913 2918 2874
4096 2053 3968 4025 2060 4091 4070 2735 2714 2685
8192 2036 3940 3882 2034 3935 3995 2642 2643 2638
Pi 400 64 bits/32 bits
8 1.29 0.41 1.03 1.32 0.44 1.01 1.13 1.19 1.19
16 0.92 0.39 0.99 0.92 0.42 0.96 1.12 1.19 1.20
32 1.13 0.39 1.03 1.14 0.42 1.00 1.07 1.19 1.20
64 1.27 0.41 1.05 1.26 0.44 1.02 1.01 1.15 1.15
128 1.22 0.40 1.05 1.22 0.43 1.01 1.02 1.07 1.07
256 1.22 0.40 1.04 1.22 0.43 1.01 1.00 1.07 1.06
512 1.30 0.42 1.05 1.37 0.44 1.01 1.01 1.08 1.08
1024 1.54 0.64 1.05 1.30 0.64 1.00 0.70 0.78 0.82
2048 0.98 1.15 1.12 1.03 1.03 1.02 1.03 2.86 2.81
4096 1.16 0.98 1.03 1.02 1.17 1.01 1.09 2.61 2.59
8192 1.01 0.98 1.00 1.01 1.01 1.01 1.04 2.64 2.63
NeonSpeed Benchmark below or Go To Start
NeonSpeed Benchmark MB/Second - NeonSpeedC8, NeonSpeedPi64g8
This carries out some of the same calculations as MemSpeed. All results are for 32 bit floating point and integer calculations.
Norm functions were as generated by the compiler and NEON through using intrinsic functions.
32 bit performance ratios, at different data sizes, were essentially the same as those for the Memspeed benchmark, around 1.2
from caches and 1.0 from RAM.
At 64 bits, the first column calculations are the same as in MemSpeed, where the 64 bit compiler produces ridiculous results. See
64 Bit Danger. Some gains are indicated with normal integer calculations. The others are from using NEON intrinsic functions,
where 64 bit vector instructions can be similar to those from 32 bit NEON code.
Vector Reading Speed in MBytes/Second
Memory Float v=v+s*v Int v=v+v+s Neon v=v+v
KBytes Norm Neon Norm Neon Float Int
Pi 4B 1500 MHz 32 bits
16 9884 12882 3910 12773 13090 15133
32 9904 13061 3916 13002 13162 15239
64 9029 11526 3450 10704 11708 12084
128 9242 11784 3391 11016 11816 12179
256 9283 11890 3396 11215 11929 12284
512 9043 10680 3413 10211 10925 11241
1024 5818 3310 3507 3288 3239 2902
4096 4060 1994 3497 1991 2009 2011
16384 4030 2063 3445 2068 2072 2067
65536 3936 2109 3391 1858 2122 2121
Pi 400 1800 MHz 32 bits
16 11860 15690 4736 15664 15690 18155
32 11884 15563 4702 15462 15770 17912
64 10911 14009 4085 13499 14226 14435
128 11003 14111 4065 13282 14149 14400
256 11098 14167 4079 13366 14164 14490
512 10756 12779 4087 12544 13100 13275
1024 5812 3561 4161 3314 3268 3352
4096 3798 1813 3666 1870 1870 1868
16384 3892 1988 3850 1928 1991 1990
65536 3815 2031 3719 2044 2022 2018
Pi 400/4B 32 bits
16 1.20 1.22 1.21 1.23 1.20 1.20
32 1.20 1.19 1.20 1.19 1.20 1.18
64 1.21 1.22 1.18 1.26 1.22 1.19
128 1.19 1.20 1.20 1.21 1.20 1.18
256 1.20 1.19 1.20 1.19 1.19 1.18
512 1.19 1.20 1.20 1.23 1.20 1.18
1024 1.00 1.08 1.19 1.01 1.01 1.16
4096 0.94 0.91 1.05 0.94 0.93 0.93
16384 0.97 0.96 1.12 0.93 0.96 0.96
65536 0.97 0.96 1.10 1.10 0.95 0.95
Pi 400 1800 MHz 64 bits
16 4496 19696 4790 17870 18908 21817
32 4302 16223 4658 14602 16138 16890
64 4043 13754 4620 13009 13995 14035
128 4002 14077 4700 13371 14157 14231
256 3992 14148 4716 13508 14311 14312
512 4007 14178 4716 13649 14524 14515
1024 3867 5478 4490 5301 5458 5531
4096 3706 2088 4070 2092 2101 2098
16384 3636 2063 3985 2062 2058 2057
65536 3319 2057 3803 2011 2059 2063
Pi 400 64 bits/32 bits
16 0.38 1.26 1.01 1.14 1.21 1.20
32 0.36 1.04 0.99 0.94 1.02 0.94
64 0.37 0.98 1.13 0.96 0.98 0.97
128 0.36 1.00 1.16 1.01 1.00 0.99
256 0.36 1.00 1.16 1.01 1.01 0.99
512 0.37 1.11 1.15 1.09 1.11 1.09
1024 0.67 1.54 1.08 1.60 1.67 1.65
4096 0.98 1.15 1.11 1.12 1.12 1.12
16384 0.93 1.04 1.04 1.07 1.03 1.03
65536 0.87 1.01 1.02 0.98 1.02 1.02
MultiThreading Benchmark next or Go To Start
MultiThreading Benchmarks
Most of the multithreading benchmarks execute the same calculations using 1, 2, 4 and 8 threads. One of them, MP-MFLOPS, is
available in two different versions, using standard compiled “C” code for single and double precision arithmetic. A further version
uses NEON intrinsic functions. Another variety uses OpenMP procedures for automatic parallelism.
MP-Whetstone Benchmark - MP-WHETSPC8, MP-WHETSPi64g8
Multiple threads each run the eight test functions at the same time, but with some dedicated variables. Measured speed is based
on the last thread to finish. Performance was generally proportional to the number of cores used. Overall seconds indicates MP
efficiency.
During the 32 bit tests, except for the last memory copy (Equal) test and 8 thread section, that can have variable performance,
other 1800 Mhz gain ratios were effectively at the expected 1.20 level.
Performance of the Pi 400 64 bit version was similar to that at 32 bits on a number of test functions, but with the overall score
indicated 14% improvement on all significant thread counts.
MWIPS MFLOPS MFLOPS MFLOPS Cos Exp Fixpt If Equal
1 2 3 MOPS MOPS MOPS MOPS MOPS
Pi 4B 1500 MHz 32 bits
1T 1889.5 538.7 537.6 311.4 56.3 26.1 7450.5 2243.2 659.9
2T 3782.7 1065.5 1071.2 627.1 112.3 52.0 14525.7 4460.9 1327.3
4T 7564.1 2101.0 2145.9 1250.4 225.0 104.1 29430.5 8944.2 2660.8
8T 8003.6 2598.8 2797.0 1313.0 233.2 110.4 37906.3 10786.7 2799.4
Overall Seconds 4.99 1T, 5.00 2T, 5.03 4T, 10.06 8T
Pi 400 1800 MHz 32 bits
1T 2269.3 639.9 646.3 376.3 67.6 31.3 8975.2 2690.7 762.9
2T 4533.6 1275.0 1292.1 752.8 134.5 62.6 17936.5 5348.1 1539.8
4T 9057.1 2527.4 2574.9 1502.3 269.8 124.6 35455.7 10736.3 3086.1
8T 9658.2 3009.7 3068.6 1577.6 287.3 133.8 46331.2 13940.2 3208.7
Overall Seconds 5.08 1T, 5.08 2T, 5.12 4T, 10.22 8T
Pi 400/4B 32 bits
1T 1.20 1.19 1.20 1.21 1.20 1.20 1.20 1.20 1.16
2T 1.20 1.20 1.21 1.20 1.20 1.20 1.23 1.20 1.16
4T 1.20 1.20 1.20 1.20 1.20 1.20 1.20 1.20 1.16
8T 1.21 1.16 1.10 1.20 1.23 1.21 1.22 1.29 1.15
Pi 400 1800 MHz 64 bits
1T 2577.2 636.9 635.2 477.4 72.6 32.8 8878.4 2693.8 1198.3
2T 5153.3 1266.0 1264.9 954.4 145.2 65.6 17863.2 5388.7 2394.6
4T 10289.2 2520.1 2537.3 1906.5 290.3 131.1 33002.1 10732.0 4781.4
8T 10768.5 3027.7 3361.7 1960.4 298.5 137.4 45139.4 12877.9 4887.0
Overall Seconds 4.99 1T, 5.00 2T, 5.04 4T, 10.10 8T
Pi 400 64 bits/32 bits
1T 1.14 1.00 0.98 1.27 1.07 1.05 0.99 1.00 1.57
2T 1.14 0.99 0.98 1.27 1.08 1.05 1.00 1.01 1.56
4T 1.14 1.00 0.99 1.27 1.08 1.05 0.93 1.00 1.55
8T 1.11 1.01 1.10 1.24 1.04 1.03 0.97 0.92 1.52
MP-Dhrystone Benchmark next or Go To Start
MP-Dhrystone Benchmark - MP-DHRYPiC8, MP-DHRYPi64g8
This executes multiple copies of the same program, but with some shared data, leading to unacceptable multithreaded
performance. The 32 bit single thread speeds were similar to the earlier Dhrystone result, with the usual around 20%
improvement at the 1800 MHz. The other results don’t mean much but were not too far from 20%, in this case.
The single thread test at 64 bits was 42% faster than at 32 bits, but much less using more than one thread.
MP-Dhrystone Benchmark
Using 1, 2, 4 and 8 Threads
Pi 4B 1500 MHz 32 bits
Threads 1 2 4 8
Seconds 0.79 1.21 2.62 4.88
Dhrystones per Second 10126308 13262168 12230188 13106002
VAX MIPS rating 5763 7548 6961 7459
Pi 400 1800 MHz 32 bits
Seconds 0.65 1.00 2.09 3.87
Dhrystones per Second 12259203 15949971 15292691 16517837
VAX MIPS rating 6977 9078 8704 9401
Pi 400/4B 32 bits
Comparison 1.21 1.20 1.25 1.26
Pi 400 1800 MHz 64 bits
Seconds 0.92 1.78 3.58 7.17
Dhrystones per Second 17447778 17937022 17879626 17858080
VAX MIPS rating 9930 10209 10176 10164
Pi 400 64 bits/32 bits
Comparison 1.42 1.12 1.17 1.08
MP SP NEON Linpack Benchmark - linpackNeonMPC8, linpackMPNeonPi64g8
This was produced to show that the original Linpack benchmark was completely unsuitable for benchmarking multiple CPUs or
cores, and this is reflected in the results. The program uses NEON intrinsic functions, with increasing data sizes.
At 32 bits, the unthreaded N=100 1800 MHz performance gains were close to expectations. At N=500, memory demands border
on the L2 cache/RAM area that can lead to inconsistent results, with N=1000 limited by RAM speed.
The Pi 400 64 bit version had similar slow multithreaded performance as the other examples and faster N=100 single thread
performance, compared with the Pi 4B.
Linpack Single Precision MultiThreaded Benchmark
Using NEON Intrinsics
MFLOPS 0 to 4 Threads, N 100, 500, 1000
Threads None 1 2 4
Pi 4B 1500 MHz 32 bits
N 100 2007.38 112.55 107.85 106.98
N 500 1332.24 686.10 686.11 689.02
N 1000 402.61 435.26 432.21 432.01
Pi 400 1800 MHz 32 bits
N 100 2345.82 109.74 104.55 104.73
N 500 2022.64 812.55 827.02 819.69
N 1000 423.56 438.79 440.90 443.00
Pi 400/4B 32 bits
N 100 1.17 0.98 0.97 0.98
N 500 1.52 1.18 1.21 1.19
N 1000 1.05 1.01 1.02 1.03
Pi 400 1800 MHz 64 bits
N 100 2611.69 95.68 97.32 97.17
N 500 1611.41 660.61 658.51 654.51
N 1000 409.08 436.27 435.62 416.79
Pi 400 64 bits/32 bits
N 100 1.11 0.87 0.93 0.93
N 500 0.80 0.81 0.80 0.80
N 1000 0.97 0.99 0.99 0.94
MP BusSpeed Benchmark below or Go To Start
MP BusSpeed (read only) Benchmark - MP-BusSpd2PiC8, MP-BusSpd2Pi64g8
Each thread accesses all of the data in separate sections, covering caches and RAM, starting at different points, the latter to
avoid misrepresentation of performance using shared L2 cache.
Performance variations can be expected from this benchmark, but cache based 32 bit 1800/1500 MHz ratios can be interpreted
as around 1.2 and 1.0 from RAM.
The 64 bit compiler somehow manages to lose its way on decreasing addressing increments after Inc8, leading to the 32 bit
version appearing to be up to three times faster.
KB Inc32 Inc16 Inc8 Inc4 Inc2 RdAll
Pi 4B 1500 MHz 32 bits
12.3 1T 5310 5616 5801 5898 5940 13425
2T 9393 10008 11293 11293 11368 24932
4T 15781 15015 17606 19034 22279 40736
8T 8465 9599 14580 18465 20034 36831
122.9 1T 664 930 1861 3191 5017 10281
2T 564 726 1523 5376 9387 18985
4T 486 919 1886 4289 8337 16979
8T 487 912 1854 4275 8271 16826
12288 1T 225 258 514 1010 1992 3975
2T 202 421 450 1765 3307 7396
4T 261 288 825 1332 1772 5014
8T 218 273 496 1041 2571 4021
Pi 400 1800 MHz 32 bits
12.3 1T 6387 6760 6851 6749 7138 16074
2T 11269 9282 11428 12968 13224 29163
4T 16205 12617 16819 21992 26820 52253
8T 9723 13073 17001 21763 23693 43707
122.9 1T 797 1117 2070 3794 6006 12286
2T 690 834 1786 5527 11282 22883
4T 578 1102 2259 5090 9868 19940
8T 576 1101 2242 5041 9870 19975
12288 1T 230 255 495 1003 1981 3993
2T 210 223 906 1587 1918 3794
4T 303 283 476 904 2082 3549
8T 250 238 524 1046 2284 3701
Pi 400/4B 32 bits
12.3 1T 1.20 1.20 1.18 1.14 1.20 1.20
2T 1.20 0.93 1.01 1.15 1.16 1.17
4T 1.03 0.84 0.96 1.16 1.20 1.28
8T 1.15 1.36 1.17 1.18 1.18 1.19
122.9 1T 1.20 1.20 1.11 1.19 1.20 1.20
2T 1.22 1.15 1.17 1.03 1.20 1.21
4T 1.19 1.20 1.20 1.19 1.18 1.17
8T 1.18 1.21 1.21 1.18 1.19 1.19
12288 1T 1.02 0.99 0.96 0.99 0.99 1.00
2T 1.04 0.53 2.01 0.90 0.58 0.51
4T 1.16 0.98 0.58 0.68 1.17 0.71
8T 1.15 0.87 1.06 1.00 0.89 0.92
Pi 400 1800 MHz 64 bits
12.3 1T 6198 6657 6770 5048 4907 5080
2T 8825 12952 11558 9422 9531 9937
4T 10051 11518 17686 16592 18680 19757
8T 8994 10828 16669 16241 16797 19140
122.9 1T 718 1114 2257 3357 4610 4877
2T 676 939 2587 5818 9114 9703
4T 579 1126 2447 4861 9587 17663
8T 572 1119 2427 4911 9556 16538
12288 1T 226 255 473 940 1882 3738
2T 311 297 446 944 1880 3786
4T 236 352 568 1352 2773 3205
8T 246 263 563 931 1904 4182
Pi 400 64 bits/32 bits
12.3 1T 0.97 0.98 0.99 0.75 0.69 0.32
2T 0.78 1.40 1.01 0.73 0.72 0.34
4T 0.62 0.91 1.05 0.75 0.70 0.38
8T 0.93 0.83 0.98 0.75 0.71 0.44
122.9 1T 0.90 1.00 1.09 0.88 0.77 0.40
2T 0.98 1.13 1.45 1.05 0.81 0.42
4T 1.00 1.02 1.08 0.96 0.97 0.89
8T 0.99 1.02 1.08 0.97 0.97 0.83
12288 1T 0.98 1.00 0.96 0.94 0.95 0.94
2T 1.48 1.33 0.49 0.59 0.98 1.00
4T 0.78 1.24 1.19 1.50 1.33 0.90
8T 0.98 1.11 1.07 0.89 0.83 1.13
MP RandMem Benchmark below or Go To Start
MP RandMem Benchmark - MP-RandMemPiC8, MP-RandMemPi64g8
The benchmark uses the same complex indexing for serial and random access, with separate read only and read/write tests. The
performance patterns were as expected. Random access is dependent on the impact of burst reading and writing, producing
those slow speeds. Read only performance increased, as expected, relative to the thread count, with that for read/write
remaining constant at particular data size, probably due to write back to shared data space.
Performance comparisons at 32 bits again indicatd 1800 MHz 20% gains using cached data and no gain using RAM.
On the Pi 400, 64 bit performance was generally the same as at 32 bits, with no scope for vectorisation.
KB SerRD SerRDWR RndRD RndRDWR
Pi 4B 1500 MHz 32 bits
12.3 1T 5950 7903 5945 7896
2T 11849 7923 11887 7917
4T 23404 7785 23395 7761
8T 21903 7669 23104 7655
122.9 1T 5670 7309 2002 1924
2T 10682 7285 1648 1923
4T 9944 7266 1813 1927
8T 9896 7216 1812 1919
12288 1T 3904 1075 179 164
2T 7317 1055 215 164
4T 3398 1063 343 165
8T 4156 1062 350 165
Pi 400 1800 MHz 32 bits
12.3 1T 7135 9348 7136 9370
2T 14256 9359 14273 9352
4T 28119 9240 28114 9258
8T 26441 9130 26328 9144
122.9 1T 6790 8281 2381 2246
2T 12555 8297 2098 2313
4T 11951 8481 2177 2317
8T 12044 8485 2155 2305
12288 1T 3777 946 178 162
2T 7474 1066 211 165
4T 4319 1184 343 164
8T 4407 1227 340 165
Pi 400/4B 32 bits
12.3 1T 1.20 1.18 1.20 1.19
2T 1.20 1.18 1.20 1.18
4T 1.20 1.19 1.20 1.19
8T 1.21 1.19 1.14 1.19
122.9 1T 1.20 1.13 1.19 1.17
2T 1.18 1.14 1.27 1.20
4T 1.20 1.17 1.20 1.20
8T 1.22 1.18 1.19 1.20
12288 1T 0.97 0.88 0.99 0.99
2T 1.02 1.01 0.98 1.01
4T 1.27 1.11 1.00 0.99
8T 1.06 1.16 0.97 1.00
Pi 400 1800 MHz 64 bits
12.3 1T 7138 9489 7129 9478
2T 14187 9516 13922 9506
4T 22329 9352 23537 9153
8T 20274 9216 24488 9252
122.9 1T 6921 8444 2397 2242
2T 13046 8419 1983 2339
4T 12397 8443 2127 2347
8T 12567 8295 2127 2371
12288 1T 2761 1264 183 167
2T 7408 1278 200 162
4T 3354 772 254 167
8T 3993 1251 253 141
Pi 400 64 bits/32 bits
12.3 1T 1.00 1.02 1.00 1.01
2T 1.00 1.02 0.98 1.02
4T 0.79 1.01 0.84 0.99
8T 0.77 1.01 0.93 1.01
122.9 1T 1.02 1.02 1.01 1.00
2T 1.04 1.01 0.95 1.01
4T 1.04 1.00 0.98 1.01
8T 1.04 0.98 0.99 1.03
12288 1T 0.73 1.34 1.03 1.03
2T 0.99 1.20 0.95 0.98
4T 0.78 0.65 0.74 1.02
8T 0.91 1.02 0.74 0.85
MP-MFLOPS Benchmarks below or Go To Start
MP-MFLOPS Benchmarks - MP-MFLOPSPiC8, MP-MFLOPSDPC8, MP-NeonMFLOPSC8,
MP-MFLOPSPi64g8, MP-MFLOPSDPPi64g8, MP-NeonMFLOPSPi64g8
MP-MFLOPS measures floating point speed on data from caches and RAM. The first calculations are as used in Memory Speed
Benchmark, with a multiply and an add per data word read. The second uses 32 operations per input data word of the form x[i] =
(x[i]+a)*b-(x[i]+c)*d+(x[i]+e)*f -- more. Tests cover 1, 2, 4 and 8 threads, each carrying out the same calculations but
accessing different segments of the data. There are three varieties, single precision, double precision and single precision
through NEON intrinsic functions, all attempting to show near maximum MP floating point processing speeds.
At 32 bits, but subject to normal intermittent variations, The Pi 400 exhibited the normal 20% performance increase, over the Pi
4B, using 12.8 and 128 KB cache based data size but similar at 12.8 MB from RAM, when there was little dependence on floating
point calculating times. Note the NEON 32 bit performance gains.
Pi 400 performance was similar at 64 bits and 32 bits using RAM, again where there was little dependence on arithmetic
calculations. With cached data, Pi 400 single precision speed improved considerably, typically 2.5 times faster, but half that at
double precision. The NEON cached calculations were generally somewhat faster, with the two processors executing different
varieties of vector instructions.
Single Precision MFLOPS Comparisons
2 Ops/Word 32 Ops/Word 2 Ops/Word 32 Ops/Word
KB 12.8 128 12800 12.8 128 12800 12.8 128 12800 12.8 128 12800
Pi 4B 1500 MHz 32 bits Max 11.0 GFLOPS
1T 1224 1257 520 2814 2800 2803
2T 2485 2257 525 5608 5575 5576
4T 4119 3243 534 11018 10645 8358
8T 4131 4618 541 9941 10339 8165
Pi 400 1800 MHz 32 bits Max 13.2 GFLOPS Pi 400/4B
1T 1526 1504 593 3380 3366 3356 1.25 1.20 1.14 1.20 1.20 1.20
2T 2972 2593 598 6725 6696 6675 1.20 1.15 1.14 1.20 1.20 1.20
4T 5414 5304 621 13179 13144 9339 1.31 1.64 1.16 1.20 1.23 1.12
8T 4862 5790 602 12055 13010 9343 1.18 1.25 1.11 1.21 1.26 1.14
Pi 400 1800 MHz 64 bits Max 30.9 GFLOPS Pi 400 64 bits/32 bits
1T 3995 3752 476 8108 8066 7436 2.62 2.49 0.80 2.40 2.40 2.22
2T 7743 5887 594 15761 15762 10185 2.61 2.27 0.99 2.34 2.35 1.53
4T 13859 13899 600 30674 30891 10476 2.56 2.62 0.97 2.33 2.35 1.12
8T 12741 13809 583 27381 30559 8492 2.62 2.38 0.97 2.27 2.35 0.91
NEON Intrinsic Functions MFLOPS
Pi 4B 1500 MHz 32 bits Max 17.2 GFLOPS NEON SP/Normal SP
1T 2797 2870 641 4422 4454 4405 2.29 2.28 1.23 1.57 1.59 1.57
2T 3217 5601 569 8587 8800 8377 1.29 2.48 1.08 1.53 1.58 1.50
4T 7902 9864 611 17061 17215 9704 1.92 3.04 1.14 1.55 1.62 1.16
8T 7070 10562 603 15531 16203 9516 1.71 2.29 1.11 1.56 1.57 1.17
Pi 400 1800 MHz 32 bits Max 20.1 GFLOPS Pi 400/4B
1T 3471 3459 597 5318 5345 5244 1.24 1.21 0.93 1.20 1.20 1.19
2T 6842 4295 575 10587 10460 9499 2.13 0.77 1.01 1.23 1.19 1.13
4T 9441 6507 608 20053 20147 9377 1.19 0.66 1.00 1.18 1.17 0.97
8T 7133 8382 500 18080 20187 8589 1.01 0.79 0.83 1.16 1.25 0.90
Pi 400 1800 MHz 64 bits Max 30.2 GFLOPS Pi 400 64 bits/32 bits
1T 4015 3865 447 7902 7860 7326 1.16 1.12 0.75 1.49 1.47 1.40
2T 7412 7347 573 15625 15543 9123 1.08 1.71 1.00 1.48 1.49 0.96
4T 9292 13936 605 29605 30067 10412 0.98 2.14 1.00 1.48 1.49 1.11
8T 10169 9622 585 28978 30150 8537 1.43 1.15 1.17 1.60 1.49 0.99
Double Precision MFLOPS
Pi 4B 1500 MHz 32 bits Max 10.4 GFLOPS
1T 1203 1211 315 2675 2719 2674
2T 2291 2441 293 5406 5421 4907
4T 4673 2501 309 10313 10393 5256
8T 4394 3550 265 8782 10110 5197
Pi 400 1800 MHz 32 bits Max 12.6 GFLOPS Pi 400/4B
1T 1441 1470 259 3274 3262 3116 1.20 1.21 0.82 1.22 1.20 1.17
2T 2944 2640 258 6491 6368 4420 1.29 1.08 0.88 1.20 1.17 0.90
4T 5555 2860 270 12560 12604 4344 1.19 1.14 0.87 1.22 1.21 0.83
8T 3730 5499 267 12154 11558 4398 0.85 1.55 1.01 1.38 1.14 0.85
Pi 400 1800 MHz 64 bits Max 15.1 GFLOPS Pi 400 64 bits/32 bits
1T 2003 1955 250 4085 4071 3744 1.39 1.33 0.97 1.25 1.25 1.20
2T 3789 3780 296 8110 8098 5141 1.29 1.43 1.15 1.25 1.27 1.16
4T 6974 7093 300 14998 15093 5085 1.26 2.48 1.11 1.19 1.20 1.17
8T 4784 3983 281 14296 14433 4238 1.28 0.72 1.05 1.18 1.25 0.96
OpenMP-MFLOPS Benchmarks below or Go To Start
OpenMP-MFLOPS - OpenMP-MFLOPSC8, notOpenMP-MFLOPSC8, OpenMP-MFLOPS64g8, notOpenMP-
MFLOPS64g8
This benchmark carries out the same single precision calculations as the MP-MFLOPS Benchmarks but, in addition, calculations
with eight operations per data word. There is also notOpenMP-MFLOPS single core version, compiled from the same code and
carrying out identical numbers of floating point calculations, but without an OpenMP compile directive.
The final data values are checked for consistency. Different compilers or different CPUs could involve using alternative
instructions or rounding effects, with variable accuracy. Then, OpenMP sumchecks could be expected to be the same as those
from NotOpenMP single core values. However, this is not always the case. This benchmark was a compilation of code used for
desktop PCs, starting at 400 KB (100K Words), then 4 MB and 40 MB.
The main purposes of this benchmark are to see if OpenMP can produce similar maximum performance as MP-MFLOPS and that
this can increase in line with the number of cores used. In fact, faster OpenMP 32 bit performance was apparent, with 24.1 SP
GFLOPS at 1800 MHz, 21% faster than at 1500 MHz, indicating more efficient operation than via my hand coded NEON functions.
At 64 bits, a maximum speed of 30.2 single precision GFLOPS was demonstrated, effectively the same as MP-MFLOPS.
With 400 KB minimum data size, probably mainly from L2 cache, performance can be quite variable as with other sizes
representing speed from RAM. Appropriate four core performance gains were demonstrated in some cases.
Test 4 Byte Ops/ Repeat Seconds MFLOPS First All 400
Words Word Passes Results Same /4B
OpenMP MFLOPS Pi 4B 1500 MHz 32 bits Max 20.0 GFLOPS
Data in & out 100000 2 2500 0.098043 5100 0.929538 Yes
Data in & out 1000000 2 250 0.810084 617 0.992550 Yes
Data in & out 10000000 2 25 0.922891 542 0.999250 Yes
Data in & out 100000 8 2500 0.144870 13805 0.957126 Yes
Data in & out 1000000 8 250 0.922568 2168 0.995524 Yes
Data in & out 10000000 8 25 0.918226 2178 0.999550 Yes
Data in & out 100000 32 2500 0.401577 19921 0.890282 Yes
Data in & out 1000000 32 250 0.935064 8556 0.988096 Yes
Data in & out 10000000 32 25 0.916277 8731 0.998806 Yes
OpenMP MFLOPS Pi 400 1800 MHz 32 bits Max 24.2 GFLOPS
24.2
Data in & out 100000 2 2500 0.157307 3178 0.929538 Yes 0.62
Data in & out 1000000 2 250 1.049819 476 0.992550 Yes 0.77
Data in & out 10000000 2 25 0.908061 551 0.999250 Yes 1.02
Data in & out 100000 8 2500 0.132957 15042 0.957126 Yes 1.09
Data in & out 1000000 8 250 0.797976 2506 0.995524 Yes 1.16
Data in & out 10000000 8 25 0.917282 2180 0.999550 Yes 1.00
Data in & out 100000 32 2500 0.330647 24195 0.890282 Yes 1.21
Data in & out 1000000 32 250 0.872310 9171 0.988096 Yes 1.07
Data in & out 10000000 32 25 0.948771 8432 0.998806 Yes 0.97
Next Run
Data in & out 100000 2 2500 0.087220 5733 0.929538 Yes 1.12
Data in & out 100000 8 2500 0.108323 18463 0.957126 Yes 1.34
Data in & out 100000 32 2500 0.987574 8101 0.890282 Yes 0.41
notOpenMP MFLOPS Pi 4B 1500 MHz 32 bits Max 5.5 GFLOPS
Data in & out 100000 2 2500 0.220277 2270 0.929538 Yes
Data in & out 1000000 2 250 0.791373 632 0.992550 Yes
Data in & out 10000000 2 25 0.792594 631 0.999250 Yes
Data in & out 100000 8 2500 0.362916 5511 0.957126 Yes
Data in & out 1000000 8 250 0.902125 2217 0.995524 Yes
Data in & out 10000000 8 25 0.786859 2542 0.999550 Yes
Data in & out 100000 32 2500 1.497859 5341 0.890282 Yes
Data in & out 1000000 32 250 1.518747 5267 0.988096 Yes
Data in & out 10000000 32 25 1.516393 5276 0.998806 Yes
notOpenMP MFLOPS Pi 400 1800 MHz 32 bits Max 6.6 GFLOPS
Data in & out 100000 2 2500 0.127996 3906 0.929538 Yes 1.72
Data in & out 1000000 2 250 0.802889 623 0.992550 Yes 0.99
Data in & out 10000000 2 25 0.774740 645 0.999250 Yes 1.02
Data in & out 100000 8 2500 0.302848 6604 0.957126 Yes 1.20
Data in & out 1000000 8 250 0.897527 2228 0.995524 Yes 1.00
Data in & out 10000000 8 25 0.858763 2329 0.999550 Yes 0.82
Data in & out 100000 32 2500 1.247949 6411 0.890282 Yes 1.20
Data in & out 1000000 32 250 1.303086 6139 0.988096 Yes 1.17
Data in & out 10000000 32 25 1.293210 6186 0.998806 Yes 1.17
64 Bit OpenMP-MFLOPS Results below
Results OpenMP-MFLOPS64g8, notOpenMP-MFLOPS64g8
Test 4 Byte Ops/ Repeat Seconds MFLOPS First All 64 bits/
Words Word Passes Results Same 32 bits
OpenMP MFLOPS Pi 400 1800 MHz 64 bits Max 30.2 GFLOPS
Data in & out 100000 2 2500 0.085683 5835 0.929538 Yes 1.84
Data in & out 1000000 2 250 0.781184 640 0.992550 Yes 1.34
Data in & out 10000000 2 25 0.781273 640 0.999250 Yes 1.16
Data in & out 100000 8 2500 0.097116 20594 0.957117 Yes 1.37
Data in & out 1000000 8 250 0.802633 2492 0.995518 Yes 0.99
Data in & out 10000000 8 25 0.817137 2448 0.999549 Yes 1.16
Data in & out 100000 32 2500 0.288180 27760 0.890215 Yes 1.15
Data in & out 1000000 32 250 0.832001 9615 0.988088 Yes 1.05
Data in & out 10000000 32 25 0.850003 9412 0.998796 Yes 1.12
Other Run
Data in & out 100000 32 2500 0.265007 30188 0.890215 Yes
Data in & out 1000000 32 250 0.836860 9560 0.988088 Yes
Data in & out 10000000 32 25 0.850294 9409 0.998796 Yes
notOpenMP MFLOPS Pi 400 1800 MHz 64 bits Max 8.2 GFLOPS
Data in & out 100000 2 2500 0.128715 3885 0.929538 Yes 0.99
Data in & out 1000000 2 250 1.012816 494 0.992550 Yes 0.79
Data in & out 10000000 2 25 1.232502 406 0.999250 Yes 0.63
Data in & out 100000 8 2500 0.301728 6628 0.957117 Yes 1.00
Data in & out 1000000 8 250 1.097623 1822 0.995518 Yes 0.82
Data in & out 10000000 8 25 1.021233 1958 0.999549 Yes 0.84
Data in & out 100000 32 2500 0.981493 8151 0.890215 Yes 1.27
Data in & out 1000000 32 250 1.243212 6435 0.988088 Yes 1.05
Data in & out 10000000 32 25 1.131976 7067 0.998796 Yes 1.14
OpenMP-MemSpeed Benchmarks below or Go To Start
OpenMP-MemSpeed - OpenMP-MemSpeed2C8, NotOpenMP-MemSpeed2C8
OpenMP-MemSpeed264g8, NotOpenMP-MemSpeed64g8
This is the same program as the single core MemSpeed benchmark, but with increased memory sizes and compiled using OpenMP
directives. The same program was also compiled without these directives (NotOpenMP-MemSpeed2), with the example single core
results also shown after the detailed measurements. Although the source code appears to be suitable for speed up by
parallelisation, many of the test functions are slower using OpenMP. The bencmsrk demonstrates that OpenMP might be
unsuitable to produce performance gains on what appears to be suitable code. There might also be compile options that
overcome this problem.
Performance comparisons are provided for samples of the OpenMP results and some for single core operation. These can be
interpreted as demonstrating the usual 1800 MHz gains of 20% for CPU speed limited tests, at 32 bits, and no gain using RAM
based data, but subject to wide variations. MP speed at 64 bits was little different to that at 32 bits, but the single core version
appeared to be somewhat faster executing double precision calculations.
Memory Reading Speed Test OpenMP
Memory x[m]=x[m]+s*y[m] Int+ x[m]=x[m]+y[m] x[m]=y[m]
KBytes Dble Sngl Int32 Dble Sngl Int32 Dble Sngl Int32
Used MB/S MB/S MB/S MB/S MB/S MB/S MB/S MB/S MB/S
Pi 4B 1500 MHz 32 bits
4 8097 8322 8641 8020 8436 8384 39701 19701 19712
8 7814 8555 8756 8321 8548 8526 39042 19984 19996
16 8149 7738 7742 8303 7779 8192 37995 19883 19984
32 8969 8769 8799 9040 8759 8743 37737 20133 20130
64 7617 7457 7437 7575 7380 7422 17770 15332 14248
128 11221 10936 11003 11105 11011 10986 13650 13910 13881
256 17883 18144 18036 17691 18094 17844 13073 12465 12535
512 18001 18468 19675 17075 18221 19264 13511 13895 12008
1024 9532 10590 9772 11842 11282 11277 7173 9473 9496
2048 7095 7025 6866 7117 7043 6946 2914 3475 3468
4096 7244 6927 7036 5951 7054 6531 2582 3130 3122
8192 4578 7173 7025 6322 7078 7182 2504 3127 3115
16384 5470 7043 7067 7103 7052 7020 2557 3093 3088
32768 7359 7817 7766 7158 7078 7757 2618 3066 3094
65536 7810 7268 7266 3824 7478 5164 2486 3016 2931
131072 2460 2655 7224 7513 7308 7339 2540 2944 2940
Not OMP
8 11775 3895 4342 11787 4325 4354 10334 7806 7816
256 10032 3699 4223 9978 4289 4185 7105 7612 7621
65536 2099 2587 3033 2103 3021 3001 2585 1105 1101
Pi 400 1800 MHz 32 bits
4 9870 10099 10417 9594 10121 10071 47620 23674 23662
8 9413 10284 10511 9978 10239 10233 46992 24003 24005
16 9462 10322 10557 9446 10264 10234 45814 24091 23992
32 10985 10180 10356 10898 10204 10258 45479 24148 24160
64 11212 10952 10978 11184 10992 10938 30749 22302 21768
128 14481 14069 14237 14437 14228 14265 16353 17408 17293
256 20737 21740 21905 20853 21742 21892 14898 15113 15109
512 20922 22457 23702 20469 21381 22893 14975 16248 16272
1024 14626 12757 12104 13595 13508 12422 10711 12240 10897
2048 5193 7184 7224 7309 7238 7227 2990 3347 3355
4096 7839 6201 7620 7822 7646 7561 2650 2997 3016
8192 7867 7820 7844 7778 7736 7736 2494 2961 2877
16384 8089 7768 7800 5995 7829 7996 2508 2858 2840
32768 1921 7278 7313 7756 7308 7552 2659 2895 2869
65536 2302 7267 2708 5814 6992 7310 2597 2769 2801
131072 3730 2546 2841 7442 2617 5254 2611 2804 2764
Not OMP
4 13781 4653 5235 13804 5216 5215 11951 9142 9130
256 12185 4441 5064 11889 5009 5001 9702 9260 9256
65536 1108 1418 3026 2016 2264 3038 2551 945 901
400/4B 32 bits Samples
4 1.22 1.21 1.21 1.20 1.20 1.20 1.20 1.20 1.20
8 1.20 1.20 1.20 1.20 1.20 1.20 1.20 1.20 1.20
16 1.16 1.33 1.36 1.14 1.32 1.25 1.21 1.21 1.20
128 1.29 1.29 1.29 1.30 1.29 1.30 1.20 1.25 1.25
256 1.16 1.20 1.21 1.18 1.20 1.23 1.14 1.21 1.21
512 1.16 1.22 1.20 1.20 1.17 1.19 1.11 1.17 1.36
32768 0.26 0.93 0.94 1.08 1.03 0.97 1.02 0.94 0.93
65536 0.29 1.00 0.37 1.52 0.94 1.42 1.04 0.92 0.96
131072 1.52 0.96 0.39 0.99 0.36 0.72 1.03 0.95 0.94
Not OMP
8 1.17 1.19 1.21 1.17 1.21 1.20 1.16 1.17 1.17
256 1.21 1.20 1.20 1.19 1.17 1.19 1.37 1.22 1.21
65536 0.53 0.55 1.00 0.96 0.75 1.01 0.99 0.86 0.82
64 Bit OpenMP-MemSpeed Resultss below
Results OpenMP-MemSpeed264g8, NotOpenMP-MemSpeed64g8
Pi 400 1800 MHz 64 bits
Memory x[m]=x[m]+s*y[m] Int+ x[m]=x[m]+y[m] x[m]=y[m]
KBytes Dble Sngl Int32 Dble Sngl Int32 Dble Sngl Int32
Used MB/S MB/S MB/S MB/S MB/S MB/S MB/S MB/S MB/S
4 9304 10175 10500 8950 10229 10240 47444 22323 22326
8 9867 10400 10669 9767 10411 10432 46796 22663 22659
16 9624 10210 10012 9400 10097 10223 45939 22831 22830
32 11543 10488 10590 11533 10484 10479 45358 22849 22850
64 9976 9351 9360 9833 9300 9287 23856 19396 19353
128 12504 12426 12326 12438 12399 12252 16646 16057 16029
256 20937 22629 22372 21707 22453 22434 14964 15050 15079
512 21415 22091 21606 20618 22166 21289 15254 16012 15761
1024 9413 11347 9278 12991 12294 12168 6405 9009 9213
2048 7451 6795 7128 7292 7099 7047 2945 3091 3179
4096 6679 7016 7220 7210 7028 7107 2784 2897 2919
8192 7741 6345 7626 7625 7532 7374 2461 2837 2842
16384 7557 7331 7650 7556 7345 7729 2559 2924 2927
32768 2214 7172 7066 7283 7125 7105 2489 2836 2829
65536 2309 6584 7193 7265 6608 6570 2662 2748 2755
131072 5200 6211 7275 7400 7287 7248 2603 2805 2764
Not OMP
8 18146 4545 5046 17725 5043 5051 13599 10933 10999
256 14712 4648 5064 14594 5040 5064 9724 9855 9853
65536 2046 3002 3077 2029 3067 3070 2550 2569 2568
Pi 400 64 bits/32 bits
4 0.94 1.01 1.01 0.93 1.01 1.02 1.00 0.94 0.94
8 1.05 1.01 1.02 0.98 1.02 1.02 1.00 0.94 0.94
16 1.02 0.99 0.95 1.00 0.98 1.00 1.00 0.95 0.95
128 0.86 0.88 0.87 0.86 0.87 0.86 1.02 0.92 0.93
256 1.01 1.04 1.02 1.04 1.03 1.02 1.00 1.00 1.00
512 1.02 0.98 0.91 1.01 1.04 0.93 1.02 0.99 0.97
32768 1.15 0.99 0.97 0.94 0.97 0.94 0.94 0.98 0.99
65536 1.00 0.91 2.66 1.25 0.95 0.90 1.03 0.99 0.98
131072 1.39 2.44 2.56 0.99 2.78 1.38 1.00 1.00 1.00
Not OMP
8 1.32 0.98 0.96 1.28 0.97 0.97 1.14 1.20 1.20
256 1.21 1.05 1.00 1.23 1.01 1.01 1.00 1.06 1.06
65536 1.85 2.12 1.02 1.01 1.35 1.01 1.00 2.72 2.85
JavWhetstone Benchmark below or Go To Start
Java Whetstone Benchmark - whetstc.class
The Java benchmarks comprise class files that were produced some time ago. But source codes are available to renew the files.
Performance can vary significantly using different Java Virtual Machines. So, comparisons might not be appropriate. Note that,
here some speeds are effectively the same as found running the C compiled version above with Pi 400 speed gains, at 32 bits,
around 20%. Using these particular versions of Java, some floating point functions were slower at 64 bits.
********************* Pi 4B 1500 MHz 32 bits ********************
Whetstone Benchmark OpenJDK11 Java Version, May 15 2019, 18:48:20
1 Pass
Test Result MFLOPS MOPS millisecs
N1 floating point -1.124750137 524.02 0.0366
N2 floating point -1.131330490 494.12 0.2720
N3 if then else 1.000000000 289.92 0.3570
N4 fixed point 12.000000000 1092.99 0.2882
N5 sin,cos etc. 0.499110132 59.86 1.3900
N6 floating point 0.999999821 345.95 1.5592
N7 assignments 3.000000000 331.54 0.5574
N8 exp,sqrt etc. 0.825148463 25.41 1.4640
MWIPS 1687.92 5.9244
Operating System Linux, Arch. arm, Version 4.19.37-v7l+
Java Vendor BellSoft, Version 11.0.2-BellSoft
******************** Pi 400 1800 MHz 32 bits ********************
Whetstone Benchmark Java Version, Jul 30 2020, 11:49:33
1 Pass
Test Result MFLOPS MOPS millisecs
N1 floating point -1.124750137 629.92 0.0305
N2 floating point -1.131330490 584.35 0.2300
N3 if then else 1.000000000 415.00 0.2494
N4 fixed point 12.000000000 1315.79 0.2394
N5 sin,cos etc. 0.499110132 71.72 1.1600
N6 floating point 0.999999821 415.05 1.2996
N7 assignments 3.000000000 399.48 0.4626
N8 exp,sqrt etc. 0.825148463 32.95 1.1290
MWIPS 2083.13 4.8005
Operating System Linux, Arch. arm, Version 5.4.51-v7l+
Java Vendor Raspbian, Version 11.0.8
******************** Pi 400 1800 MHz 64 bits ********************
Whetstone Benchmark Java Version, Aug 26 2020, 19:37:28
1 Pass
Test Result MFLOPS MOPS millisecs
N1 floating point -1.124750137 624.19 0.0308
N2 floating point -1.131330490 577.32 0.2328
N3 if then else 1.000000000 405.88 0.2550
N4 fixed point 12.000000000 1582.12 0.1991
N5 sin,cos etc. 0.499110132 57.54 1.4460
N6 floating point 0.999999821 331.53 1.6270
N7 assignments 3.000000000 359.25 0.5144
N8 exp,sqrt etc. 0.825148463 30.47 1.2210
MWIPS 1809.61 5.5261
Operating System Linux, Arch. aarch64, Version 5.4.51-v8+
Java Vendor Debian, Version 11.0.8
JavaDraw Benchmark below or Go To Start
JavaDraw Benchmark - JavaDrawPi.class
The benchmark uses small to rather excessive simple objects to measure drawing performance in Frames Per Second (FPS). Five
tests draw on a background of continuously changing colour shades, each test adding to the load.
At 32 bits, in order for this to run at maximum speed on the original Pi 4B, it was necessary to disable the experimental GL driver.
The initial Pi 400 run was without that driver being enabled. Then, the initial graphics speed dependent tests were the same as
those for the Pi 4B, with latest CPU limited ones some 20% faster. This Pi 400 test was rerun with part of the display window on
the monitor and part on a TV, via dual monitor operation, then completely on the TV. Affected, somewhat slower, FPS speeds,
are shown below, as seen in both cases.
Using the 64 bit Pi 400 configuration, performance was the same using the original and experimental GL driver. It was also similar
to the first Pi 400 results at 32 bits.
********************* Pi 4B 1500 MHz 32 bits ********************
Java Drawing Benchmark, May 15 2019, 18:55:41
Produced by OpenJDK 11 javac
Test Frames FPS
Display PNG Bitmap Twice Pass 1 877 87.65
Display PNG Bitmap Twice Pass 2 1042 104.18
Plus 2 SweepGradient Circles 1015 101.47
Plus 200 Random Small Circles 779 77.85
Plus 320 Long Lines 336 33.52
Plus 4000 Random Small Circles 83 8.25
Total Elapsed Time 60.1 seconds
Operating System Linux, Arch. arm, Version 4.19.37-v7l+
Java Vendor BellSoft, Version 11.0.2-BellSoft
******************** Pi 400 1800 MHz 32 bits ********************
Java Drawing Benchmark, Jul 30 2020, 12:01:08
Produced by javac 1.7.0_02
Test Frames FPS
Display PNG Bitmap Twice Pass 1 904 90.36
Display PNG Bitmap Twice Pass 2 1038 103.79
Plus 2 SweepGradient Circles 1019 101.84
Plus 200 Random Small Circles 855 85.41
Plus 320 Long Lines 391 39.08
Plus 4000 Random Small Circles 102 10.11
Total Elapsed Time 60.1 seconds
Operating System Linux, Arch. arm, Version 5.4.51-v7l+
Java Vendor Raspbian, Version 11.0.8
******************* Dual Monitor + TV Involved *****************
Display PNG Bitmap Twice Pass 1 698 69.75
Display PNG Bitmap Twice Pass 2 909 90.84
Plus 2 SweepGradient Circles 918 91.78
** 32 bit Pi 400 after enabling experimental desktop GL driver **
Java Drawing Benchmark, Jul 31 2020, 10:08:07
Produced by javac 1.6.0_27
Test Frames FPS
Display PNG Bitmap Twice Pass 1 1164 116.33
Display PNG Bitmap Twice Pass 2 1346 134.49
Plus 2 SweepGradient Circles 1317 131.62
Plus 200 Random Small Circles 976 97.53
Plus 320 Long Lines 402 40.12
Plus 4000 Random Small Circles 103 10.27
Total Elapsed Time 60.1 seconds
Operating System Linux, Arch. arm, Version 5.4.51-v7l+
Java Vendor Raspbian, Version 11.0.8
64 Bit Java Draw Benchmark below
64 Bit JavaDraw Benchmark - JavaDrawPi.class
******************** Pi 400 1800 MHz 64 bits ********************
Java Drawing Benchmark, Aug 26 2020, 19:38:46
Produced by javac 1.8.0_222
Test Frames FPS
Display PNG Bitmap Twice Pass 1 860 85.92
Display PNG Bitmap Twice Pass 2 957 95.68
Plus 2 SweepGradient Circles 1002 100.18
Plus 200 Random Small Circles 843 84.24
Plus 320 Long Lines 402 40.12
Plus 4000 Random Small Circles 99 9.86
Total Elapsed Time 60.1 seconds
Operating System Linux, Arch. aarch64, Version 5.4.51-v8+
Java Vendor Debian, Version 11.0.8
** 64 bit Pi 400 after enabling experimental desktop GL driver **
Java Drawing Benchmark, Aug 26 2020, 20:09:05
Produced by javac 1.8.0_222
Test Frames FPS
Display PNG Bitmap Twice Pass 1 800 79.94
Display PNG Bitmap Twice Pass 2 966 96.51
Plus 2 SweepGradient Circles 999 99.81
Plus 200 Random Small Circles 864 86.30
Plus 320 Long Lines 409 40.83
Plus 4000 Random Small Circles 109 10.86
Total Elapsed Time 60.1 seconds
Operating System Linux, Arch. aarch64, Version 5.4.51-v8+
Java Vendor Debian, Version 11.0.8
******************* Dual Monitor + TV Involved *****************
Dual Monitor Part on monitor and part on TV
Display PNG Bitmap Twice Pass 1 748 74.72
Display PNG Bitmap Twice Pass 2 872 87.15
Plus 2 SweepGradient Circles 914 91.37
OpenGL GLUT Benchmark below or Go To Start
32 Bit OpenGL GLUT Benchmark - videogl32
In 2012, I approved a request from a Quality Engineer at Canonical, to use this OpenGL benchmark in the testing framework of
the Unity desktop software. The program can be run as a benchmark, or selected functions, as a stress test of any duration.
The benchmark measures graphics speed in terms of Frames Per Second (FPS) via six simple and more complex tests. The first
four tests portray moving up and down a tunnel including various independently moving objects, with and without texturing. The
last two tests, represent a real application for designing kitchens. The first is in wireframe format, drawn with 23,000 straight
lines. The second has colours and textures applied to the surfaces.
As a benchmark, it was run using the following script file, the first command needed to avoid VSYNC, allowing FPS to be greater
than 60.
export vblank_mode=0
./videogl32 Width 320, Height 240, NoEnd
./videogl32 Width 640, Height 480, NoHeading, NoEnd
./videogl32 Width 1024, Height 768, NoHeading, NoEnd
./videogl32 Width 1920, Height 1080, NoHeading
The first Pi 400 results indicated that performance was slower on all tests, excluding those for the kitchen displays, the latter
being more CPU speed limited, providing hoped for 20% performance improvement. Then, I remembered using an experimental
desktop GL driver, enabled via sudo raspi-config. This was used on the Pi 400, where G3 GL OpenGL desktop driver with full KMS
was selected. This produced the same or better Pi 400 performance than the Pi 4B.
As indicated below, the dual monitor connections enabled this option to be tested, the default for monitor full screen pixel
settings across both monitors being applied, 2 x 1920 wide in this case.
********************* Pi 4B 1500 MHz 32 bits ********************
GLUT OpenGL Benchmark 32 Bit Version 1, Thu May 2 19:01:05 2019
Running Time Approximately 5 Seconds Each Test
Window Size Coloured Objects Textured Objects WireFrm Texture
Pixels Few All Few All Kitchen Kitchen
Wide High FPS FPS FPS FPS FPS FPS
320 240 766.7 371.4 230.6 130.2 32.5 22.7
640 480 427.3 276.5 206.0 121.8 31.7 22.2
1024 768 193.1 178.8 150.5 110.4 31.9 21.5
1920 1080 81.4 79.4 74.6 68.3 30.8 20.0
******************** Pi 400 1800 MHz 32 bits ********************
GLUT OpenGL Benchmark 32 Bit Version 1, Thu Jul 30 12:31:31 2020
Running Time Approximately 5 Seconds Each Test
Window Size Coloured Objects Textured Objects WireFrm Texture
Pixels Few All Few All Kitchen Kitchen
Wide High FPS FPS FPS FPS FPS FPS
320 240 688.1 405.2 223.1 138.2 42.8 29.0
640 480 319.4 281.4 200.1 126.8 41.4 27.8
1024 768 140.3 134.5 113.9 103.0 40.2 27.1
1920 1080 57.7 56.3 53.5 49.6 37.4 24.0
******************* Pi 400 New Driver 32 bits ******************
GLUT OpenGL Benchmark 32 Bit Version 1, Thu Jul 30 13:59:55 2020
Running Time Approximately 5 Seconds Each Test
Window Size Coloured Objects Textured Objects WireFrm Texture
Pixels Few All Few All Kitchen Kitchen
Wide High FPS FPS FPS FPS FPS FPS
320 240 823.6 435.1 244.5 140.7 42.5 28.7
640 480 427.8 310.0 219.6 134.3 42.1 28.3
1024 768 192.3 181.9 149.9 116.3 40.9 27.0
1920 1080 81.7 79.0 73.7 67.4 38.1 24.5
****************** Pi 400 Dual Monitor 32 bits ******************
3840 1080 27.0 26.6 26.3 25.1 27.3 19.3
64 Bit OpenGL Benchmark below
64 Bit OpenGL GLUT Benchmark - videogl64
In this case (at this early stage?), the 64 bit default driver appeared to produce much slower performance than the 32 bit Pi 400
system. The later driver also produced slower performance on the early, graphics speed dependent, tests, but 20% faster on the
last tests that depend on processor power.
The dual monitor test results were similar to those at 32 bits.
******************** Pi 400 1800 MHz 64 bits ********************
GLUT OpenGL Benchmark 64 Bit gcc 9, Wed Aug 26 19:53:43 2020
Running Time Approximately 5 Seconds Each Test
Window Size Coloured Objects Textured Objects WireFrm Texture
Pixels Few All Few All Kitchen Kitchen
Wide High FPS FPS FPS FPS FPS FPS
160 120 334.3 162.1 173.9 90.0 27.1 23.7
320 240 220.5 131.9 128.7 74.1 25.0 21.5
640 480 109.4 81.0 80.6 55.7 22.2 17.9
1024 768 57.5 47.5 45.4 34.2 18.2 13.4
1920 1080 27.0 24.3 22.0 18.9 14.3 8.4
******************* Pi 400 New Driver 64 bits ******************
GLUT OpenGL Benchmark 64 Bit gcc 9, Wed Aug 26 20:03:54 2020
Running Time Approximately 5 Seconds Each Test
Window Size Coloured Objects Textured Objects WireFrm Texture
Pixels Few All Few All Kitchen Kitchen
Wide High FPS FPS FPS FPS FPS FPS
160 120 783.4 446.7 286.4 170.7 50.9 35.5
320 240 659.3 406.0 265.5 160.8 51.9 35.1
640 480 319.2 276.9 229.0 144.2 47.5 32.7
1024 768 140.2 134.4 122.4 113.2 48.1 32.5
1920 1080 57.8 56.5 55.6 52.4 46.7 29.8
****************** Pi 400 Dual Monitor 64 bits ******************
3840 1080 27.2 26.6 27.0 26.0 27.5 21.4
I/O Benchmarks below or Go To Start
I/O Benchmarks
Two varieties of I/O benchmarks are provided, one to measure performance of main and USB drives, and the other for LAN and
WiFi network connections. The programs write and reads three files at two sizes (defaults 8 and 16 MB), followed by random
reading and writing of 1KB blocks out of 4. 8 and 16 MB and finally, writing and reading 200 small files, sized 4, 8 and 16 KB. Run
time parameters are provided for the size of large files and file path. The same program code is used for both varieties, the only
difference being file opening properties. The drive benchmark includes extra options to use direct I/O, avoiding data caching in
main memory, but includes an extra test with caching allowed. For further details and downloads see the PDF file.
LanSpeed Benchmark - (1G bits per second Ethernet) - LanSpeed, LanSpeed64g8
Measured performance can vary significantly, but both Pi 4B and Pi 400 tests demonstrated Gigabit performance on the large
files. Of particular note (with my program), these 32 bit systems indicated that the 2 GB file could not be written, the actual file
size ended at 2,147,483,647 Bytes (or 2^31 - 1). Also note the more consistent speeds handling 1 GB files.
The default 64 bit benchmark produced similar performance as the 32 bit version. However, a major advantage of the former, is
its ability to handle much larger files, as illustrated below at 3 and 6 GB.
******************** Pi 4B 1500 MHz 32 bits ******************
MBytes/Second
MB Write1 Write2 Write3 Read1 Read2 Read3
8 67.82 12.97 90.19 99.84 93.49 96.83
16 92.25 92.66 92.96 103.9 105.28 91.17
Random Read Write
From MB 4 8 16 4 8 16
msecs 0.007 0.01 0.04 1.01 0.85 0.91
200 Files Write Read Delete
File KB 4 8 16 4 8 16 secs
MB/sec 1.47 2.80 5.14 2.47 4.71 8.61
ms/file 2.78 2.92 3.19 1.66 1.74 1.90 0.256
Large File Write MBytes/Second Read MBytes/Second
1 GB 96.13 93.34 94.98 114.51 112.16 114.91
2 GB Error writing file Segmentation fault
******************* Pi 400 1800 MHz 32 bits ******************
MBytes/Second
MB Write1 Write2 Write3 Read1 Read2 Read3
8 47.07 87.12 90.94 102.11 100.03 100.24
16 82.75 90.84 91.03 106.19 106.39 105.10
Random Read Write
From MB 4 8 16 4 8 16
msecs 0.007 0.02 0.43 0.98 0.90 0.89
200 Files Write Read Delete
File KB 4 8 16 4 8 16 secs
MB/sec 1.35 2.62 5.04 2.21 4.10 6.88
ms/file 3.03 3.12 3.25 1.85 2.00 2.38 0.184
Large File Write MBytes/Second Read MBytes/Second
1 GB 109.69 111.03 107.39 112.28 112.72 112.02
2 GB Error writing file Segmentation fault
******************* Pi 400 1800 MHz 64 bits ******************
MBytes/Second
MB Write1 Write2 Write3 Read1 Read2 Read3
8 46.59 89.13 93.19 103.35 73.78 65.73
16 65.89 96.57 67.83 90.43 105.20 105.43
Random Read Write
From MB 4 8 16 4 8 16
msecs 0.004 0.017 0.397 1.09 1.02 1.05
200 Files Write Read Delete
File KB 4 8 16 4 8 16 secs
MB/sec 1.36 2.64 5.11 1.95 3.33 8.55
ms/file 3.01 3.11 3.21 2.10 2.46 1.92 0.194
Large File Write MBytes/Second Read MBytes/Second
3 GB 114.00 114.11 114.93 112.31 114.79 116.96
6 GB 92.46 92.06 114.06 115.22 115.57 113.66
WiFi Benchmarks below or Go To Start
LanSpeed Benchmarks - WiFi - LanSpeed, LanSpeed64g8
Following are Old Pi 4B results and those for the Pi 400, using 2.4 GHz and 5 GHz WiFi frequencies, communicating with a
Windows 7 based PC. Details on setting up the links can be found in This PDF file, LAN/WiFi section. Performance of the two
systems was reasonably similar at 2.4 GHz, exhibiting normal variations at these file sizes. With my setup, obtaining consistent 5
GHz operation was extremely difficult to achieve, in both cases, but those shown indicate the most frequent performance
patterns. The main difference was the particularly slow Pi 400 reading speeds, apparently with 5 GHz being lower than at 2.4
GHz.
Results at 64 bits were similar to those at 32 bits and it took many more attempts to run at 5 GHz.
**************** Pi 4B 1500 MHz 2.4 GHz 32 bit **************
MBytes/Second
MB Write1 Write2 Write3 Read1 Read2 Read3
8 6.35 6.33 6.38 7.05 6.98 7.10
16 6.70 6.82 6.76 7.19 6.53 7.22
Random Read Write
From MB 4 8 16 4 8 16
msecs 2.691 2.875 3.048 3.13 2.93 2.84
200 Files Write Read Delete
File KB 4 8 16 4 8 16 secs
MB/sec 0.34 0.44 1.04 0.37 0.37 1.26
ms/file 12.14 18.59 15.7 11.1 22.2 12.99 2.153
***************** Pi 4B 1500 MHz 5 GHz 32 bit ***************
MBytes/Second
MB Write1 Write2 Write3 Read1 Read2 Read3
8 11.90 12.96 13.16 10.11 9.55 9.66
16 11.50 13.93 14.13 9.91 8.88 9.92
200 Files Write Read Delete
File KB 4 8 16 4 8 16 secs
MB/sec 0.13 0.46 0.91 0.25 0.55 1.02
ms/file 30.85 17.83 18.10 16.62 14.93 16.01 3.361
Random similar to 2.4 GHz
*************** Pi 400 1800 MHz 2.4 GHz 32 bit **************
MBytes/Second
MB Write1 Write2 Write3 Read1 Read2 Read3
8 2.02 6.08 6.59 6.91 5.82 7.01
16 6.78 6.64 6.70 7.04 6.05 6.36
Random Read Write
From MB 4 8 16 4 8 16
msecs 3.234 3.354 3.637 4.12 3.72 3.72
200 Files Write Read Delete
File KB 4 8 16 4 8 16 secs
MB/sec 0.36 0.61 1.07 0.46 0.85 1.55
ms/file 11.50 13.37 15.34 8.88 9.59 10.55 2.924
**************** Pi 400 1800 MHz 5 GHz 32 bit ***************
MBytes/Second
MB Write1 Write2 Write3 Read1 Read2 Read3
8 2.85 9.75 9.82 4.03 4.20 4.14
16 11.42 10.20 10.14 4.18 4.17 4.16
Random Read Write
From MB 4 8 16 4 8 16
msecs 3.006 3.206 3.276 3.55 3.29 3.28
200 Files Write Read Delete
File KB 4 8 16 4 8 16 secs
MB/sec 0.42 0.50 0.34 0.48 0.88 1.44
ms/file 9.72 16.44 48.26 8.61 9.30 11.39 2.812
64 Bit WiFi Results and USB Booting below Go To Start
LanSpeed Benchmarks - WiFi - LanSpeed64g8
Windows Perfmon was executed, to indicate volume and reliability of network traffic, at the same time as a run of the 5 GHz
benchmark. This confirmed the measured data transfer speeds of large files and indicated no errors or discards.
*************** Pi 400 1800 MHz 2.4 GHz 64 bit **************
MBytes/Second
MB Write1 Write2 Write3 Read1 Read2 Read3
8 5.93 5.91 5.98 6.79 5.75 6.62
16 6.51 3.23 6.61 6.08 5.72 6.19
Random Read Write
From MB 4 8 16 4 8 16
msecs 3.240 3.720 3.651 4.14 3.92 4.16
200 Files Write Read Delete
File KB 4 8 16 4 8 16 secs
MB/sec 0.32 0.58 1.00 0.42 0.79 1.44
ms/file 12.92 14.14 16.39 9.80 10.42 11.36 1.335
*************** Pi 400 1800 MHz 5 GHz 64 bit ****************
MBytes/Second
MB Write1 Write2 Write3 Read1 Read2 Read3
8 11.55 12.00 12.24 4.16 4.32 4.28
16 12.21 12.41 12.34 4.13 4.28 4.24
Random Read Write
From MB 4 8 16 4 8 16
msecs 2.738 2.882 2.967 3.10 2.87 2.89
200 Files Write Read Delete
File KB 4 8 16 4 8 16 secs
MB/sec 0.49 0.43 0.64 0.54 0.92 1.54
ms/file 8.42 19.06 25.54 7.65 8.87 10.66 1.009
USB Booting
The initial 8 GB SD card was particularly slow on booting and on running initial benchmarks. I cloned it to a 32 GB SanDisk Ultra
card and that was fine. In turn, to test USB booting, I then copied that to a faster 64 GB USB stick and a 500 GB partition of a 1
TB hard drive for the highest data transfer rate. Without investigating, I expanded the Filesystem of the HD and that made it
unusable (Windows indicated total size of 256 GB, unreadable). That did not matter as there was no useful information on the
drive. I repartitioned the drive, via Ubuntu, with >64 GB for Raspbian, where cloning to that from the 64 GB image was
satisfactory, without expansion. Two further partitions were created, formatted as FAT32 and EXT4, each occupying half of the
remaining space.
As indicated later, the USB 3 drives produced higher data transfer speeds than the 32 GB SD card, but were slower on booting,
as shown in the following early life measurements, that could change. Part of the reason for slow booting is explained in an initial
display, indicating that an SD card cannot be found and later, apparently searching for a bootable device.
Seconds
Initial Total To Reboot To
Drive Display Desktop Desktop
32 bit 8 GB SD card N/A 46 68
32 bit 32 GB SD card N/A 22 26
32 bit 32 GB SD USB Reader 7 31 30
32 bit 64 GB USB Stick 25 46 29
32 bit 64 GB HD Partition 29 63 64
64 bit 32 GB SD card N/A 22 26
64 bit 32 GB SD USB Reader 7 25 28
USB 3 and Main Drive Benchmarks or Go To Start
32 Bit USB 3 and Main Drive Benchmarks - DriveSpeed
Large Files - Below are DriveSpeed benchmark results, initially from the four main drives. The SD cards obtained the same level
of performance when booted using a USB 3 card reader. The main drive data area is formatted as Ext4. Performance of large files
identifies possible benefits of using alternatives drives to SD cards. The 64 GB USB stick appeared to write at a much slower
rates, starting at less somewhat less than 500 MB. Using the secondary disk partitions, writing to EXT4 format is shown to be
faster than to FAT32. It seems that, compared with the Pi 4B, the Pi 400 can provide superior writing performance with FAT32
but not so at EXT4.
Random Access - The measured access times can vary widely, with the reasons for differences difficult to identify. Traditionally,
hard disk drive times would normally be greater than half the revolution time, 5.5 ms, in this case. Then this Toshiba Canvio is
said to have a 8 MB buffer, indicating that most accesses could be to the buffer, at bus speeds.
200 Small Files - Under EXT4 format, hard drive performance was indicated as being far superior, with the 8 GB SD card worst.
Hard drive performance at FAT32 was exceptionally bad, with sector size of 32 KB, when each of the 200 files were that size. Pi
400 and Pi 4B performance was essentially the same running all these tests (those identical results were double checked).
Large Files MBytes/Second
MB Write1 Write2 Write3 Read1 Read2 Read3
8 GB SD Card 16 7.44 5.23 6.13 22.88 22.71 22.12
32 GB SD Card 16 19.02 17.56 17.39 44.71 43.50 44.84
64 GB USB Drive 16 74.42 77.55 76.80 129.86 130.65 129.75
64 GB USB Drive 500 30.92 23.74 29.67 132.11 131.10 132.16
64 GB USB Drive 2000 28.78 28.77 29.45 131.87 132.27 132.33
64 GB HD Partition 16 55.80 81.05 52.98 134.06 142.09 143.91
64 GB HD Partition 2000 149.83 148.52 146.76 151.64 151.99 150.15
64 GB HD Pi 4B USB 2000 147.17 146.79 146.45 148.38 151.14 97.80
Same HD Pi400 FAT 2000 83.27 82.66 83.22 143.79 144.02 144.06
Same HD Pi400 EXT4 2000 125.74 123.60 120.72 130.20 128.47 124.88
Same HD Pi4B FAT 2000 68.10 66.83 67.67 148.63 148.69 149.25
Same HD Pi4B EXT4 2000 125.36 118.11 122.68 130.29 127.10 128.45
Random Read Write
From MB 4 8 16 4 8 16
8 GB SD Card msecs 0.436 0.417 0.406 2.86 2.87 79.52
32 GB SD Card msecs 0.250 0.249 0.279 1.61 1.50 1.55
64 GB USB Drive msecs 0.671 0.675 0.671 2.14 2.20 2.18
64 GB HD Partition msecs 0.170 0.647 0.426 5.18 11.79 11.13
64 GB HD Pi 4B USB msecs 0.976 0.356 0.367 0.68 0.64 0.68
Same HD Pi400 FAT msecs 0.169 0.170 0.170 0.66 0.63 0.70
Same HD Pi400 EXT4 msecs 0.436 0.486 0.314 0.71 0.64 0.70
Same HD Pi4B FAT msecs 0.573 0.515 0.368 0.63 0.58 0.65
Same HD Pi4B EXT4 msecs 1.087 0.391 0.286 0.68 0.63 0.68
200 Small Files Write Read
File KB 4 8 16 4 8 16
8 GB SD Card MB/sec 0.42 2.59 2.61 5.63 8.95 12.15
32 GB SD Card MB/sec 2.57 5.10 5.59 9.08 12.42 20.69
64 GB USB Drive MB/sec 1.95 2.55 4.58 7.33 11.85 21.22
64 GB HD Partition MB/sec 4.20 16.53 13.64 13.32 20.21 50.28
64 GB HD Pi 4B USB MB/sec 8.58 20.83 35.28 20.83 36.94 61.32
Same HD Pi400 FAT MB/sec 0.04 0.07 0.15 0.37 0.73 1.46
Same HD Pi400 EXT4 MB/sec 8.15 15.02 20.04 8.86 12.86 34.40
Same HD Pi4B FAT MB/sec 0.04 0.07 0.15 0.37 0.73 1.46
Same HD Pi4B EXT4 MB/sec 9.90 15.22 14.05 13.42 7.95 19.51
64 Bit USB 3 and Main Drive Benchmarks Pi 400 below
64 Bit USB 3 and Main Drive Benchmarks Pi 400 - DriveSpeed64v2g8, LanSpeed64g8,
DriveSpeed264WRg8, DriveSpeed264Rd2g8
A major advantage of 64 bit working is that much larger files can be handled, but there is a disadvantage, in running my
benchmarks, where Direct I/O does not appear to be available. Attempting to run DriveSpeed, leads to an error report, when
accessing an Ext4 formatted partition (see below). Alternatives are LanSpeed, using data larger than RAM size, to minimise
caching, or separate programs to write and read, requiring a reboot before the latter (DriveSpeed264WRg8 and
DriveSpeed264Rd2g). The latter are variations of LanSpeed just dealing with large file tests, writing a 1 MB at a time, with read
only declaring an array to contain all data being read. The example below indicated the limitation with 4 GB RAM, where only
reading of the first 1024 MB of each file was successful.
Large Files - Compared with 32 bit operation and using the appropriate formatting, performance was similar using Ext4
partitions, but much larger files could be handled at 64 bits. At FAT32, files of twice the size could be dealt with, but
performance on writing was much worse.
Random Access - All Pi 400 reading times do not represent drive hardware performance, accelerated by caching or HD buffering,
but 32 bit reading was also faster than expectations. Writing times produced inexplicable variations.
Small Files - FAT32 performance was again particularly slow. Then, Ext4 reported speed via LanSpeed was accelerated by
buffering.
Large Files MBytes/Second
MB Write1 Write2 Write3 Read1 Read2 Read3
HD Ext4 LanSpeed 4096 130.60 112.66 110.96 85.23 118.60 119.20
HD Ext4 LanSpeed 8192* 122.62 111.55 103.17 101.44 124.52 119.92
HD FAT32 LanSpeed 4096 Error writing file Segmentation fault
HD FAT32 LanSpeed 4000= 125.45 137.00 137.94 147.63 146.50 146.16
HD Ext4 DriveSpeed Error writing file Segmentation fault
HD FAT32 DriveSpeed 4096 Error writing file Segmentation fault
HD FAT32 DriveSpeed 4000= 20.50 20.56 12.53 143.32 146.59 146.32
SD Main LanSpeed 4096 21.34 18.22 17.40 34.78 45.86 45.33
SD Main Write/Read 4096# 18.73 18.87 18.83 45.96 46.01 46.04
SD Main Read only 4096 Memory allocation failed asked for 3 x 4096 MB
SD Main Read only 1333 Memory allocation failed asked for 3 x 1333 MB
SD Main Read only 1024 N/A N/A N/A 46.26 46.23 45.87
SD FAT32 LanSpeed 4096 Error writing file Segmentation fault
SD FAT32 LanSpeed 4000 20.14 20.12 20.09 95.33 95.21 95.32
SD FAT32 DriveSpeed 4096 Error writing file Segmentation fault
SD FAT32 DriveSpeed 4000 17.13 17.20 17.25 95.79 95.71 95.54
32 Bit From Above For Comparison
HD EXT4 DriveSpeed 2000* 125.74 123.60 120.72 130.20 128.47 124.88
HD FAT32 DriveSpeed 2000= 83.27 82.66 83.22 143.79 144.02 144.06
SD Main DriveSpeed 16# 19.02 17.56 17.39 44.71 43.50 44.84
Random Read Write
From MB 4 8 16 4 8 16
HD Ext4 LanSpeed msecs* 0.002 0.002 0.002 43.48 45.76 41.66
HD FAT32 LanSpeed msecs= 0.003 0.003 0.003 12.22 12.24 16.22
HD FAT32 DriveSpeed msecs= 0.003 0.003 0.004 12.68 12.37 12.26
SD Main LanSpeed msecs# 0.002 0.002 0.002 4.46 4.17 4.63
SD FAT32 LanSpeed msecs 0.003 0.003 0.003 6.05 5.87 6.05
SD FAT32 DriveSpeed msecs 0.004 0.004 0.010 2.97 2.55 2.42
32 Bit From Above For Comparison
HD EXT4 DriveSpeed msecs* 0.436 0.486 0.314 0.71 0.64 0.70
HD FAT32 DriveSpeed msecs= 0.169 0.170 0.170 0.66 0.63 0.70
SD Main DriveSpeed msecs# 0.250 0.249 0.279 1.61 1.50 1.55
200 Small Files Write Read
File KB 4 8 16 4 8 16
HD Ext4 LanSpeed MB/sec* 69.10 115.19 175.42 232.95 395.30 624.64
HD FAT32 LanSpeed MB/sec= 0.04 0.08 0.16 296.33 485.65 736.98
HD FAT32 DriveSpeed MB/sec= 0.04 0.07 0.15 292.47 40.44 391.07
SD Main LanSpeed MB/sec# 83.78 36.88 148.72 335.38 216.77 786.13
SD FAT32 LanSpeed MB/sec 0.04 0.08 0.15 306.56 493.11 730.44
SD FAT32 DriveSpeed MB/sec 0.04 0.08 0.15 299.67 130.72 34.53
SD FAT32 DriveSpeed MB/sec 0.04 0.08 0.15 299.67 130.72 34.53
32 Bit From Above For Comparison
HD EXT4 DriveSpeed MB/sec* 8.15 15.02 20.04 8.86 12.86 34.40
HD FAT32 DriveSpeed MB/sec= 0.04 0.07 0.15 0.37 0.73 1.46
SD Main DriveSpeed MB/sec# 2.57 5.10 5.59 9.08 12.42 20.69
High Performance Linpack Benchmark or Go To Start
High Performance Linpack Benchmark - xhpl
My ATLAS version of HPL has been ported onto the Pi 400. For more detail and results see my ResearchGate report Raspberry Pi
4B Stress Tests Including High Performance Linpack.pdf. Besides being the gold standard benchmark for massively parallel
supercomputers, it makes an excellent stress test, a disadvantage being that there is no ongoing progress report to indicate
deteriorating performance. It has an N input parameter that determines how much memory is required (N x N x 8 bytes), where
the maximum for 4 GB RAM is not much higher than 20000 for 3.2 GB (base 10).
Below is an example of the main output from the first set of tests on the Pi 400, followed by a summary of later results,
comprising five runs over around 50 minutes at N = 20000. There are start and end overheads not reported in benchmark
execution time. Performance is shown to be constant over this period. Then are details of VMSTAT system monitor results,
showing use of 3.2 GB RAM and 100% CPU utilisation of four cores.
Then are details of VMSTAT system monitor results, showing use of 3.2 GB RAM and 100% CPU utilisation of four cores. This is
followed by CPU and Power Management IC temperatures, during the five runs, nowhere near where CPU MHz throttling might be
expected. Room temperature was 27°C and hot spot readings on the keyboard up to 36°C.
================================================================================
T/V N NB P Q Time Gflops
--------------------------------------------------------------------------------
WR11C2R4 20000 128 2 2 451.75 1.181e+01
HPL_pdgesv() start time Thu Jul 23 21:28:59 2020
HPL_pdgesv() end time Thu Jul 23 21:36:31 2020
--------------------------------------------------------------------------------
||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)= 0.0010188 ...... PASSED
================================================================================
Start Time Run Time GFLOPS SumCheck
12:04:00 455.77 11.70 0.0010188
12:14:13 453.90 11.75 0.0010188
12:25:13 458.17 11.64 0.0010188
12:36:14 453.06 11.77 0.0010188
12:46:58 457.73 11.65 0.0010188
VMSTAT -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
SUMMARY swpd free buff cache si so bi bo in cs us sy id wa st
Pre Start 0 3545296 24620 225724 0 0 102 4 176 263 1 1 97 0 0
1 Started 512 304936 24640 199252 0 17 44 18 491 88 98 2 0 0 0
1 Finished 512 421176 24916 199220 0 0 4 4 490 97 98 2 0 0 0
2 Started 7680 314692 18484 194812 1 241 23 243 550 229 97 3 0 0 0
2 Finished 7680 433492 18836 190508 0 0 0 4 535 202 98 2 0 0 0
Later near 7424 427324 20520 193888 0 0 0 3 513 159 97 3 0 0 0
Run 1 2 3 4 5 1 2 3 4 5
CPU °C PMIC °C
Seconds
0 33 40 43 44 45 36 43 46 48 49
30 41 48 52 53 53 38 45 48 50 51
60 47 53 57 57 58 41 48 51 51 52
90 50 54 57 58 58 44 50 51 53 54
To
390 55 58 61 60 61 51 53 55 55 57
420 55 59 59 62 62 51 53 55 55 57
450 55 58 60 62 62 51 53 55 55 57
480 56 55 59 60 62 51 53 55 55 57
Max 56 59 61 62 62 51 53 55 55 57
Below are Pi 4B and Pi 400 results at larger N values. Later 4B results are included, where fanless performance improved. With
performance being partly affected by RAM speed, the fanless Pi 400 gain was around 10%, compared to the Pi 4B with a fan.
System Fan N GFLOPS Seconds Max °C Min MHz
Pi 4B No 16000 6.8 404 86 750/600
Yes 16000 10.4 263 70 1500
Later 4B No 16000 8.6 319 83 1000
Yes 16000 10.4 263 63 1500
Pi 400 16000 11.4 239 57 1800
Pi 4B No 20000 6.2 856 87 750/600
Yes 20000 10.8 494 71 1500
Later 4B No 20000 8.8 604 85 1000
Yes 20000 10.7 497 63 1500
Pi 400 20000 11.8 452 62 1800
64 Bit High Performance Linpack Benchmark - xhpl
Following are results from four consecutive runs of HPL using the 64 bit configuration, with environmental and system activity
monitoring. Moderate temperature increases ensured constant CPU MHz and measured GFLOPS, effectively at the same speed as
the 32 bit version. As noted before, and inexplicably, the calculated and accepted sumchecks were different.
64 Bit HPL Benchmark Results below
64 Bit HPL Benchmark Results
Start Time N NB P Q Time Gflops SumCheck
Sep 8 12:07:45 20000 128 2 2 456.61 1.168e+01 0.0009306 .. PASSED
Sep 8 12:16:31 20000 128 2 2 459.68 1.160e+01 0.0009602 .. PASSED
Sep 8 12:25:20 20000 128 2 2 460.25 1.159e+01 0.0011412 .. PASSED
Sep 8 12:34:10 20000 128 2 2 454.22 1.174e+01 0.0009636 .. PASSED
Temperature and CPU MHz Measurement Start at Tue Sep 8 12:07:12 2020
Seconds
0.0 ARM MHz=1800, core volt=0.9500V, CPU temp=39.0'C, pmic temp=41.1'C
60.0 ARM MHz=1800, core volt=0.9500V, CPU temp=52.0'C, pmic temp=45.8'C
121.4 ARM MHz=1800, core volt=0.9500V, CPU temp=53.0'C, pmic temp=48.6'C
182.6 ARM MHz=1800, core volt=0.9500V, CPU temp=54.0'C, pmic temp=49.6'C
243.8 ARM MHz=1800, core volt=0.9500V, CPU temp=55.0'C, pmic temp=50.5'C
304.9 ARM MHz=1800, core volt=0.9500V, CPU temp=55.0'C, pmic temp=51.4'C
366.1 ARM MHz=1800, core volt=0.9500V, CPU temp=56.0'C, pmic temp=52.4'C
427.2 ARM MHz=1800, core volt=0.9500V, CPU temp=57.0'C, pmic temp=52.4'C
488.4 ARM MHz=1800, core volt=0.9500V, CPU temp=54.0'C, pmic temp=52.4'C
549.3 ARM MHz=1800, core volt=0.9500V, CPU temp=54.0'C, pmic temp=51.4'C
610.2 ARM MHz=1800, core volt=0.9500V, CPU temp=57.0'C, pmic temp=53.3'C
671.5 ARM MHz=1800, core volt=0.9500V, CPU temp=59.0'C, pmic temp=54.3'C
732.7 ARM MHz=1800, core volt=0.9500V, CPU temp=57.0'C, pmic temp=54.3'C
794.0 ARM MHz=1800, core volt=0.9500V, CPU temp=59.0'C, pmic temp=54.3'C
855.2 ARM MHz=1800, core volt=0.9500V, CPU temp=58.0'C, pmic temp=55.2'C
916.3 ARM MHz=1800, core volt=0.9500V, CPU temp=58.0'C, pmic temp=55.2'C
977.5 ARM MHz=1800, core volt=0.9500V, CPU temp=59.0'C, pmic temp=55.2'C
1038.7 ARM MHz=1800, core volt=0.9500V, CPU temp=56.0'C, pmic temp=54.3'C
1099.7 ARM MHz=1800, core volt=0.9500V, CPU temp=59.0'C, pmic temp=54.3'C
1160.8 ARM MHz=1800, core volt=0.9500V, CPU temp=61.0'C, pmic temp=55.2'C
1222.0 ARM MHz=1800, core volt=0.9500V, CPU temp=60.0'C, pmic temp=55.2'C
1283.1 ARM MHz=1800, core volt=0.9500V, CPU temp=60.0'C, pmic temp=56.2'C
1344.2 ARM MHz=1800, core volt=0.9500V, CPU temp=60.0'C, pmic temp=56.2'C
1405.4 ARM MHz=1800, core volt=0.9500V, CPU temp=60.0'C, pmic temp=57.1'C
1466.5 ARM MHz=1800, core volt=0.9500V, CPU temp=61.0'C, pmic temp=57.1'C
1527.8 ARM MHz=1800, core volt=0.9500V, CPU temp=59.0'C, pmic temp=56.2'C
1589.0 ARM MHz=1800, core volt=0.9500V, CPU temp=57.0'C, pmic temp=55.2'C
1649.9 ARM MHz=1800, core volt=0.9500V, CPU temp=60.0'C, pmic temp=55.2'C
1710.9 ARM MHz=1800, core volt=0.9500V, CPU temp=60.0'C, pmic temp=57.1'C
1772.1 ARM MHz=1800, core volt=0.9500V, CPU temp=61.0'C, pmic temp=57.1'C
1833.4 ARM MHz=1800, core volt=0.9500V, CPU temp=61.0'C, pmic temp=57.1'C
1894.5 ARM MHz=1800, core volt=0.9500V, CPU temp=62.0'C, pmic temp=57.1'C
1955.6 ARM MHz=1800, core volt=0.9500V, CPU temp=62.0'C, pmic temp=58.0'C
2016.7 ARM MHz=1800, core volt=0.9500V, CPU temp=62.0'C, pmic temp=58.0'C
2077.9 ARM MHz=1800, core volt=0.9500V, CPU temp=59.0'C, pmic temp=57.1'C
vmstat 60 seconds sampling
procs -----------memory---------- ---swap-- ----io---- --system- ------cpu------
r b swpd free buff cache si so bi bo in cs us sy id wa st
0 0 0 3436000 23368 250372 0 0 286 6 204 293 2 2 95 1 0
4 0 15616 266040 880 131528 0 259 109 263 1250 462 94 2 4 0 0
4 0 18688 258616 3928 136892 4 49 251 51 1114 119 100 0 0 0 0
4 0 21504 262664 4240 132936 0 46 13 49 1102 92 100 0 0 0 0
4 0 21504 262160 4264 132940 1 0 1 2 1097 84 100 0 0 0 0
4 0 21504 262932 4288 132932 0 0 0 2 1095 79 100 0 0 0 0
4 0 21504 262428 4316 132944 0 0 0 2 1102 85 100 0 0 0 0
4 0 21504 265200 4340 130812 1 0 1 2 1092 74 100 0 0 0 0
4 0 21504 264948 4708 132172 3 0 31 2 1099 93 100 0 0 0 0
4 0 21504 2423512 4852 135656 0 0 20 3 1105 100 99 1 0 0 0
4 0 21504 280360 4880 117156 1 0 3 3 1100 89 99 1 0 0 0
4 0 27904 294848 4908 105784 0 105 79 107 1115 110 100 0 0 0 0
4 0 27904 293336 4928 105984 0 0 3 2 1097 79 100 0 0 0 0
4 0 57600 301452 9764 120232 13 495 758 517 1458 809 99 1 0 0 0
4 0 73728 312128 9576 123948 25 283 257 305 1336 548 99 1 0 0 0
4 0 73728 311372 9740 124008 0 0 3 2 1099 86 100 0 0 0 0
4 0 73728 310868 9752 124016 1 0 1 2 1096 80 100 0 0 0 0
4 0 73728 309828 9764 124624 0 0 10 2 1098 85 100 0 0 0 0
4 0 73472 1445224 10136 127356 1 0 73 3 1118 128 98 2 0 0 0
4 0 73472 306776 10172 128404 1 0 1 4 1098 87 99 1 0 0 0
4 0 73472 306280 10196 128488 1 0 3 2 1166 219 100 0 0 0 0
5 0 73472 305920 10216 128516 0 0 0 2 1095 78 100 0 0 0 0
4 0 73472 305044 10244 128524 1 0 1 2 1100 90 100 0 0 0 0
4 0 73216 305412 10268 128620 0 0 2 2 1094 80 100 0 0 0 0
4 0 73216 305040 10292 128632 0 0 0 2 1091 75 100 0 0 0 0
4 0 72960 304916 10320 128640 0 0 0 2 1100 81 100 0 0 0 0
4 0 72960 302436 10348 131852 1 0 1 3 1096 80 100 0 0 0 0
4 0 72704 470192 10380 127388 0 0 0 3 1111 110 98 2 0 0 0
4 0 72704 306264 10412 128684 0 0 0 2 1126 146 100 0 0 0 0
4 0 72704 305768 10440 128696 1 0 1 3 1095 82 100 0 0 0 0
4 0 72704 305768 10464 128708 0 0 0 2 1095 79 100 0 0 0 0
4 0 72704 305876 10488 128752 0 0 1 2 1096 77 100 0 0 0 0
4 0 72704 305752 10516 128760 0 0 0 3 1094 77 100 0 0 0 0
4 0 72704 305504 10540 128768 1 0 1 2 1099 87 100 0 0 0 0
4 0 72704 305380 10568 128784 0 0 0 3 1090 74 100 0 0 0 0
4 0 72704 299800 10596 134108 5 0 19 2 1227 353 100 0 0 0 0
Stress Test Benchmarks below or Go To Start
32 Bit Stress Test Benchmarks - MP-FPUStress, MP-FPUStressDP, MP-IntStress
These stress tests have a benchmarking mode that provides choices for a long running test. They cover number of threads and
memory size to cover caches and RAM, in addition, operations carried out on each data word with floating point programs.
Numeric sumchecks are carried out to verify all calculations.
Floating Point - Below are Pi 4B single precision speeds in MFLOPS then sumchecks, followed by those for double precision
working, then the same for Pi 400 and comparisons. The latter indicate a 20% Pi 400 performance gain for cache based tests and
no difference on those that were RAM speed dependent The sumcheck comparisons show that the two system produced the
same numeric results carrying out millions of calculations.
Integer - The test loop comprises 32 add or subtract instructions, operating on hexadecimal data patterns, with sequences of 8
subtracts then 8 adds to restore the original pattern. Performance is measured in MBytes per second. Results show the varying
hexadecimal data patters used and compared verification. Comparative performance again show that the PI 400 was 20% faster
on CPU speed dependent tasks and no different when reliant on RAM speed
32 Bit Single Precision Floating Point 32 Bit Double Precision Floating Point
------ MFLOPS ----- ----- Sumchecks --- ------ MFLOPS ----- ----- Sumchecks ---
* * *
-------------------------------------- Pi 4B -------------------------------------
Ops KB KB MB * KB KB MB * KB KB MB * KB KB MB
Thrds /Wd 12.8 128 12.8 * 12.8 128 12.8 " 12.8 128 12.8 * 12.8 128 12.8
T1 2 2603 2607 651 " 40392 76406 99700 * 992 990 317 * 40395 76384 99700
T2 2 5017 5138 645 * 40392 76406 99700 * 1940 1993 319 * 40395 76384 99700
T4 2 7045 9724 656 * 40392 76406 99700 * 3639 3925 329 * 40395 76384 99700
T8 2 8747 9690 633 * 40392 76406 99700 " 3690 3913 331 * 40395 76384 99700
T1 8 5542 5427 2479 * 54756 85091 99820 " 2390 2435 1266 * 54805 85108 99820
T2 8 10774 10716 2579 * 54756 85091 99820 * 4608 4853 1170 * 54805 85108 99820
T4 8 19196 20561 2595 * 54756 85091 99820 * 8902 9081 1165 * 54805 85108 99820
T8 8 18718 20629 2512 * 54756 85091 99820 * 8852 8971 1098 * 54805 85108 99820
T1 32 5307 5244 5217 * 35296 66020 99519 * 2703 2724 2672 * 35159 66065 99521
T2 32 10559 10521 9764 * 35296 66020 99519 * 5385 5442 5009 * 35159 66065 99521
T4 32 20070 20557 9864 * 35296 66020 99519 * 10582 10836 4824 * 35159 66065 99521
T8 32 19793 20919 9460 * 35296 66020 99519 * 10484 10749 4765 * 35159 66065 99521
------------------------------------- Pi 400 -------------------------------------
T1 2 3163 3129 646 * 40392 76406 99700 * 1192 1187 321 * 40395 76384 99700
T2 2 6145 6144 646 * 40392 76406 99700 * 2362 2392 324 * 40395 76384 99700
T4 2 8974 10119 655 * 40392 76406 99700 * 4155 4692 278 * 40395 76384 99700
T8 2 9584 11780 645 * 40392 76406 99700 * 4232 4730 272 * 40395 76384 99700
T1 8 6606 6514 2515 * 54756 85091 99820 * 2899 2931 1250 * 54805 85108 99820
T2 8 13028 12755 2831 * 54756 85091 99820 * 5643 5829 1128 * 54805 85108 99820
T4 8 22820 25005 2778 " 54756 85091 99820 * 10637 11351 1208 * 54805 85108 99820
T8 8 23260 24714 2345 * 54756 85091 99820 * 10850 10938 1217 * 54805 85108 99820
T1 32 6368 6327 6115 * 35296 66020 99519 * 3252 3257 3156 * 35159 66065 99521
T2 32 12643 12602 10838 * 35296 66020 99519 * 6484 6538 5455 * 35159 66065 99521
T4 32 24016 25146 10124 * 35296 66020 99519 * 12833 12791 4790 * 35159 66065 99521
T8 32 23811 24068 8760 * 35296 66020 99519 * 12093 12226 4463 * 35159 66065 99521
--------------------------------- Pi 400 / Pi 4B --------------------------------
L1 L2 RAM * ---- Sumchecks --- * L1 L2 RAM " ---- Sumchecks ---
T1 2 1.22 1.20 0.99 * 1.00 1.00 1.00 * 1.20 1.20 1.01 " 1.00 1.00 1.00
T2 2 1.22 1.20 1.00 * 1.00 1.00 1.00 * 1.22 1.20 1.02 * 1.00 1.00 1.00
T4 2 1.27 1.04 1.00 * 1.00 1.00 1.00 * 1.14 1.20 0.84 * 1.00 1.00 1.00
T8 2 1.10 1.22 1.02 * 1.00 1.00 1.00 * 1.15 1.21 0.82 * 1.00 1.00 1.00
T1 8 1.19 1.20 1.01 * 1.00 1.00 1.00 * 1.21 1.20 0.99 * 1.00 1.00 1.00
T2 8 1.21 1.19 1.10 * 1.00 1.00 1.00 * 1.22 1.20 0.96 * 1.00 1.00 1.00
T4 8 1.19 1.22 1.07 * 1.00 1.00 1.00 * 1.19 1.25 1.04 * 1.00 1.00 1.00
T8 8 1.24 1.20 0.93 * 1.00 1.00 1.00 * 1.23 1.22 1.11 * 1.00 1.00 1.00
T1 32 1.20 1.21 1.17 * 1.00 1.00 1.00 * 1.20 1.20 1.18 * 1.00 1.00 1.00
T2 32 1.20 1.20 1.11 * 1.00 1.00 1.00 * 1.20 1.20 1.09 * 1.00 1.00 1.00
T4 32 1.20 1.22 1.03 * 1.00 1.00 1.00 * 1.21 1.18 0.99 * 1.00 1.00 1.00
T8 32 1.20 1.15 0.93 * 1.00 1.00 1.00 * 1.15 1.14 0.94 * 1.00 1.00 1.00
---------------------------- 32 Bit Integers ---------------------------
Pi 4B MB/second Same Pi 400 MB/second Pi 400/Pi4B
KB KB MB All KB KB MB KB KB MB
Threads 16 160 16 Sumcheck Tests 16 160 16 16 160 16
1 5751 5755 3882 00000000 Yes 7062 6907 3825 1.23 1.20 0.99
2 11820 11302 3772 FFFFFFFF Yes 14215 13724 3736 1.20 1.21 0.99
4 22467 21906 3375 5A5A5A5A Yes 27026 26533 3397 1.20 1.21 1.01
8 22019 22094 3415 AAAAAAAA Yes 26959 25993 3419 1.22 1.18 1.00
16 22891 22448 3395 CCCCCCCC Yes 27424 27479 3413 1.20 1.22 1.01
32 22574 23412 3436 0F0F0F0F Yes 27143 27869 3458 1.20 1.19 1.01
64 Bit Stress Test Benchmarks below
64 Bit Stress Test Benchmarks - MP-FPUStress64g8, MP-FPUStress64DPg8, MP-IntStress64g8
Unlike the earlier CPU benchmarks reported here, the 32 bit stress tests were produced by an earlier compiler, where comparisons
may not be valid. In this case, the 64 bit floating point performance is generally shown to be faster with the identical data
sumchecks, but that with integer arithmetic is indicated as often running at half speed. Faster results from an earlier 64 bit
version are provided to identify the later compiler deficiency. See 64 Bit Danger.
64 Bit Single Precision Floating Point 64 Bit Double Precision Floating Point
------ MFLOPS ----- ----- Sumchecks --- ------ MFLOPS ----- ----- Sumchecks ---
* * *
Ops KB KB MB * KB KB MB * KB KB MB * KB KB MB
Thrds /Wd 12.8 128 12.8 * 12.8 128 12.8 " 12.8 128 12.8 * 12.8 128 12.8
T1 2 3114 4852 1191 * 40394 76395 99700 * 1822 2252 613 * 40395 76384 99700
T2 2 9362 9555 1236 * 40394 76395 99700 * 4190 4493 604 * 40395 76384 99700
T4 2 16966 15205 1119 * 40394 76395 99700 * 8082 8708 603 * 40395 76384 99700
T8 2 16096 17963 1027 * 40394 76395 99700 + 8275 7905 603 * 40395 76384 99700
T1 8 5645 5697 3695 * 54764 85092 99820 * 3342 3354 2190 * 54805 85108 99820
T2 8 11333 11335 4125 * 54764 85092 99820 * 6643 6718 2142 * 54805 85108 99820
T4 8 21208 22499 4151 * 54764 85092 99820 * 12734 13322 2058 * 54805 85108 99820
T8 8 21585 21456 4115 * 54764 85092 99820 * 12919 12523 2101 * 54805 85108 99820
T1 32 7025 7049 7006 * 35206 66015 99520 * 4002 4009 3961 * 35159 66065 99521
T2 32 14081 14047 13565 * 35206 66015 99520 * 7993 8016 7511 * 35159 66065 99521
T4 32 27027 28036 16116 * 35206 66015 99520 * 15462 15988 8132 * 35159 66065 99521
T8 32 26548 27040 16049 * 35206 66015 99520 * 15722 15825 8038 * 35159 66065 99521
------------------------------- Pi 400 64 Bit/32 Bit ------------------------------
L1 L2 RAM * ---- Sumchecks --- * L1 L2 RAM " ---- Sumchecks ---
T1 2 0.98 1.55 1.84 * 1.00 1.00 1.00 * 1.53 1.90 1.91 * 1.00 1.00 1.00
T2 2 1.52 1.56 1.91 * 1.00 1.00 1.00 * 1.77 1.88 1.86 * 1.00 1.00 1.00
T4 2 1.89 1.50 1.71 * 1.00 1.00 1.00 * 1.95 1.86 2.17 * 1.00 1.00 1.00
T8 2 1.68 1.52 1.59 * 1.00 1.00 1.00 * 1.96 1.67 2.22 * 1.00 1.00 1.00
T1 8 0.85 0.87 1.47 * 1.00 1.00 1.00 * 1.15 1.14 1.75 * 1.00 1.00 1.00
T2 8 0.87 0.89 1.46 * 1.00 1.00 1.00 * 1.18 1.15 1.90 * 1.00 1.00 1.00
T4 8 0.93 0.90 1.49 * 1.00 1.00 1.00 * 1.20 1.17 1.70 * 1.00 1.00 1.00
T8 8 0.93 0.87 1.75 * 1.00 1.00 1.00 * 1.19 1.14 1.73 * 1.00 1.00 1.00
T1 32 1.10 1.11 1.15 * 1.00 1.00 1.00 * 1.23 1.23 1.26 * 1.00 1.00 1.00
T2 32 1.11 1.11 1.25 * 1.00 1.00 1.00 * 1.23 1.23 1.38 * 1.00 1.00 1.00
T4 32 1.13 1.11 1.59 * 1.00 1.00 1.00 * 1.20 1.25 1.70 * 1.00 1.00 1.00
T8 32 1.11 1.12 1.83 * 1.00 1.00 1.00 * 1.30 1.29 1.80 * 1.00 1.00 1.00
------------------------------ Pi 400 Integers ----------------------------
gcc 8 MB/second Same Pi 400 64 Bit/32 Bit Version 1 MB/sec
KB KB MB All KB KB MB KB KB MB
Threads 16 160 16 Sumcheck Tests 16 160 16 16 160 16
1 3455 3481 3074 00000000 Yes 0.49 0.50 0.80 8774 8150 3772
2 7047 6975 3507 FFFFFFFF Yes 0.50 0.51 0.94 17241 15941 3687
4 13712 13977 3357 5A5A5A5A Yes 0.51 0.53 0.99 32768 29966 3339
8 13631 13696 3353 AAAAAAAA Yes 0.51 0.53 0.98 32845 33055 3366
16 13184 13906 3351 CCCCCCCC Yes 0.48 0.51 0.98 32959 34188 3364
32 12617 13960 3414 0F0F0F0F Yes 0.46 0.50 0.99 31531 33694 3388
Stress Test Parameters
The following show stress test run time parameters. The classifications can be upper or lower case and only the first character is
interpreted.
./MP-FPUStress Threads tt, Minutes mm, KB kk, Ops 00, Log ll
./MP-FPUStressDP Threads tt, Minutes mm, KB kk, Ops 00, Log ll
./MP-IntStress Threads tt, Minutes mm, KB kk, Log ll
./RPiHeatMHzVolts2 Passes pp, Seconds ss, Log ll
vmstat ss pp
tt = Threads 1, 2, 4, 8, 16, 32, (64 FPU) mm = Minutes greater than 0
kk = KBytes 12 to 15624 oo = Operations Per Word 2, 8 or 32
ll = number added to log file name, 0 to 99 pp = Passes (at ss econd intervals)
ss = Second intervals
Floating Point Stress Tests below or Go To Start
32 Bit Floating Point Stress Tests - MP-FPUStress, MP-FPUStressDP
Half hour single and double precision cache based stress tests (256 KB, 4 Threads, 32 Ops/Word) were run on the latest fan
cooled 8 GB Pi 4, with appropriate firmware and Operating System, side by side with the same on the fanless Pi 400. This was on
a hot August day, where the room temperature was 30°C. Following is a summary of results, where both ran at constant CPU
MHz, voltage and, effectively, measured MFLOPS performance, the latter reflecting the expected Pi 400 gain. CPU temperatures
did not approach anticipated level where MHz throttling would occur, with maximum measurements on both systems being similar,
unlike those for PMIC, where the Pi 400 recordings were hotter. The Pi 400 has a full width metallic heat spreader between the
keyboard and the circuit board, a quick look suggesting thermal contact with the CPU. This appears to be an excellent cooling
arrangement.
An extra test was carried out on the Pi 4B, with the fan disabled, demonstrating severe CPU MHz throttling, much worst
performance and reflecting the Pi 400 advantage. Average temperature, over half an hour, was 84°C accompanied by a 42%
reduction in performance.
The Pi 400, keyboard temperature was measured during the stress tests, reaching warm to touch 40°C, according to my infrared
thermometer.
Pi 4B Pi 4B No Fan Pi 400 Fanless
Single Double Single Single Double
MFLOPS Avg 20896 10797 12151 25056 12953
MFLOPS Min 20541 10587 10946 24587 12754
MHx Avg 1500 1500 870 1800 1800
MHz Min 1500 1500 600 1800 1800
Volts 0.8600 0.8600 0.8600 0.9500 0.9500
Temperatures CPU PMIC CPU PMIC CPU PMIC CPU PMIC CPU PMIC
°C °C °C °C °C °C °C °C °C °C
Avg 66.6 49.5 60.3 47.1 84.0 71.2 61.8 57.7 61.1 59.5
Max 69.0 50.5 62.0 48.6 86.0 72.2 68.0 61.8 64.0 60.9
Minutes
0 43 38.2 44 41.1 61 57.1 38 43.9 48 53.3
1 63 45.8 59 43.9 82 65.6 56 49.6 60 57.1
2 66 48.6 61 46.7 83 68.4 57 51.4 59 58.0
3 67 49.6 61 47.7 84 70.3 58 53.3 61 59.0
4 67 49.6 62 46.7 85 70.3 59 54.3 62 59.0
5 67 49.6 61 47.7 84 70.3 61 55.2 61 59.0
6 68 49.6 61 47.7 85 71.2 61 55.2 62 59.0
7 67 49.6 62 47.7 85 72.2 62 57.1 63 59.0
8 69 49.6 61 47.7 85 72.2 62 57.1 54 58.0
9 69 49.6 62 47.7 86 72.2 63 58.0 61 59.0
10 68 49.6 62 47.7 85 72.2 63 58.0 62 59.0
11 67 49.6 61 46.7 86 72.2 64 59.0 62 59.0
12 67 49.6 61 46.7 86 72.2 65 59.0 62 59.0
13 68 50.5 59 46.7 86 72.2 53 55.2 62 59.0
14 68 50.5 60 46.7 86 72.2 64 59.0 63 59.0
15 68 50.5 60 46.7 85 72.2 61 57.1 64 60.9
16 69 50.5 59 46.7 85 72.2 64 59.0 64 60.9
17 69 50.5 60 46.7 85 72.2 66 59.0 63 60.9
18 68 49.6 60 46.7 86 72.2 66 59.9 61 60.9
19 68 49.6 61 47.7 85 72.2 66 60.9 64 60.9
20 69 50.5 61 47.7 85 72.2 66 60.9 56 59.0
21 68 50.5 62 47.7 86 72.2 66 60.9 63 59.0
22 68 50.5 61 47.7 85 72.2 66 60.9 62 60.9
23 68 50.5 62 47.7 85 72.2 67 60.9 64 60.9
24 67 50.5 62 48.6 85 72.2 67 61.8 63 60.9
25 68 50.5 62 48.6 84 72.2 54 58.0 63 60.9
26 68 50.5 61 47.7 86 72.2 67 60.9 63 60.9
27 68 50.5 62 47.7 85 72.2 66 61.8 64 60.9
28 69 50.5 61 47.7 86 72.2 68 61.8 63 60.9
29 68 50.5 60 47.7 85 72.2 68 61.8 63 60.9
30 57 50.5 59 47.7 78 72.2 51 57.1 53 59.0
64 Bit Floating Point Stress Tests or Go To Start
64 Bit Floating Point Stress Tests - MP-FPUStress64g8, MP-FPUStress64DPg8
Three half hour floating point stress tests were run consecutively on the Pi 400, two at single precision and one at double
precision. All ran using four threads, sharing 256 KB, with 32 floating point operations per data word read/written. The usual
environment and system monitors were run at the same time. Room temperature was 27°C.
With temperatures remaining relatively low, CPU MHz and measured performance were constant and a little faster than at 32 bits.
Average Pi 400 64 bit double precision performance of 15.9 GFLOPS can be judged against 11.6 GFLOPS from High Performance
Linpack.
10-Sep-20 09:45 10:18 10:49
Precision Single Single Double
MFLOPS Avg 28011 27995 15948
MFLOPS Min 26336 27019 15467
MHz Avg 1800 1800 1800
MHz Min 1800 1800 1800
Volts 0.9500 0.9500 0.9500
Temperatures CPU PMIC CPU PMIC CPU PMIC
°C °C °C °C °C °C
Avg 52.3 51.7 56.6 56.2 60.5 58.5
Max 57.0 55.2 59.0 58.0 63.0 60.9
Minutes
0 35 39.2 47 50.5 47 50.5
1 45 43.9 55 53.3 58 55.2
2 48 45.8 55 54.3 58 55.2
3 49 46.7 54 55.2 59 57.1
4 50 47.7 56 55.2 59 57.1
5 49 48.6 56 55.2 59 57.1
6 51 49.6 57 55.2 61 58.0
7 52 49.6 55 55.2 60 58.0
8 51 50.5 57 55.2 61 59.0
9 52 50.5 56 55.2 60 59.0
10 52 51.4 57 56.2 60 59.0
11 53 51.4 58 56.2 61 59.0
12 53 51.4 57 57.1 62 59.0
13 54 52.4 57 57.1 61 59.0
14 54 52.4 57 57.1 61 59.0
15 54 53.3 58 57.1 61 59.0
16 54 53.3 57 57.1 61 59.0
17 54 53.3 58 57.1 62 59.0
18 55 53.3 58 57.1 62 59.0
19 55 54.3 58 57.1 62 59.0
20 56 54.3 57 57.1 62 59.0
21 55 54.3 59 57.1 63 59.0
22 55 54.3 58 57.1 63 59.0
23 55 54.3 58 57.1 61 59.0
24 54 55.2 58 57.1 62 59.0
25 56 55.2 58 57.1 63 59.0
26 55 55.2 59 57.1 63 60.9
27 55 55.2 59 57.1 63 60.9
28 56 55.2 58 58.0 63 60.9
29 57 55.2 59 58.0 63 60.9
30 48 54.3 50 56.2 53 59.0
Integer Stress Tests below or Go To Start
32 Bit Integer Stress Tests - MP-IntStress
The Pi 400 integer stress tests were run shortly after those on floating point, leading to a higher temperatures on starting, but
not much different on maximum recordings. They were also run using data size of 256 KB and four threads. Both the fan cooled Pi
4 and Pi 400 again ran continuously at maximum MHz and constant voltage. Measured MB/second of each was also effectively
constant, with the Pi 400 being 20% faster. The Pi 4B was also run with the fan inoperable, where there were less CPU MHz
throttling effects and lower performance degradation than the floating point tests, at 29%.
An additional test was carried out on the Pi 400 outside on a sheltered table, where the local temperature was initially 40°C,
increasing to 44°C with the sun shining on part of the keyboard. The keyboard temperature increased to 51°C for the last minute
of the test. Over the testing time, maximum temperatures increased by around 7°C, not sufficient to invoke throttling and
providing virtually the same performance as in the earlier test.
Pi 4B Fan Pi 4B No Fan Pi 400 Fanless Pi 400 Outside
MB/S Avg 22164 15736 26395 26215
MB/S Min 21472 13756 25541 25779
MHz Avg 1500 1053 1800 1800
MHz Min 1500 600 1800 1800
Volts 0.8600 0.8600 0.9500 0.9500
Temperatures CPU PMIC CPU PMIC CPU PMIC CPU PMIC
°C °C °C °C °C °C °C °C
Avg 61.5 47.6 82.7 69.9 62.1 59.8 65.1 63.5
Max 64.0 48.6 86.0 72.2 64.0 61.8 71.0 69.4
Minutes
0 45 41.1 60 55.2 48 53.3 45 49.6
1 59 43.9 78 62.8 58 55.2 56 54.3
2 63 46.7 82 66.5 62 57.1 58 56.2
3 63 48.6 83 69.4 60 59.0 60 58.0
4 63 48.6 83 70.3 62 59.0 62 59.0
5 63 48.6 83 70.3 62 59.0 63 60.9
6 63 47.7 84 70.3 62 59.0 63 61.8
7 64 47.7 84 70.3 63 59.0 63 61.8
8 62 47.7 83 70.3 61 59.0 64 62.8
9 60 46.7 84 70.3 64 59.0 65 62.8
10 62 47.7 83 70.3 63 59.0 65 62.8
11 62 47.7 83 70.3 62 59.0 64 62.8
12 63 48.6 83 70.3 63 59.0 66 62.8
13 62 48.6 83 70.3 63 60.9 67 62.8
14 63 48.6 83 70.3 63 59.0 68 64.6
15 64 48.6 83 70.3 62 59.0 68 64.6
16 63 48.6 83 70.3 63 60.9 67 65.6
17 61 47.7 83 70.3 63 60.9 68 65.6
18 63 47.7 84 71.2 62 60.9 67 65.6
19 62 47.7 83 70.3 64 60.9 68 65.6
20 63 47.7 83 70.3 64 60.9 67 65.6
21 63 48.6 84 70.3 61 60.9 69 66.5
22 63 48.6 84 70.3 63 60.9 67 65.6
23 64 47.7 85 72.2 63 60.9 67 65.6
24 62 48.6 84 72.2 63 61.8 68 66.5
25 61 47.7 86 72.2 63 60.9 69 66.5
26 62 47.7 86 72.2 64 61.8 70 67.5
27 61 47.7 84 72.2 63 61.8 70 67.5
28 62 47.7 86 72.2 62 60.9 71 68.4
29 62 48.6 84 72.2 64 61.8 71 69.4
30 54 47.7 84 72.2 64 61.8 62 68.4
64 Bit Integer Stress Tests below or Go To Start
64 Bit Integer Stress Tests - MP-IntStress64
As indicated in 64 Bit Stress Test Benchmarks, the gcc 8 integer calculations are shown as often running at half speed. See 64
Bit Danger. So, the the earlier MP-IntStress64 program was used for these stress tests. Three half hour runs of these were
carried out, using four threads, covering data from L1 caches, shared L2 cache and RAM. For the latter, more than 3 GB was
used, as reflected in the vmstat details shown below.
Again, temperatures were low and performance constant, within normal variations.
Memory KB 16 256 3500000 vmstat Memory
3500000 KB
MB/sec Avg 34803 28756 3571
MB/sec Min 32815 26940 2804
MHz Avg 1800 1800 1800
MHz Min 1800 1800 1800
Volts 0.9500 0.9500 0.9500
Temperatures CPU PMIC CPU PMIC CPU PMIC
°C °C °C °C °C °C
Avg 55.8 55.3 59.9 58.6 48.2 51.6
Max 59.0 57.1 63.0 61.8 50.0 52.4
swpd free
Minutes
0 42 46.7 44 49.6 42 46.7 0 3417124
1 51 50.5 56 53.3 46 49.6 77312 115776
2 53 52.4 57 55.2 48 50.5 76800 107948
3 55 52.4 57 55.2 47 50.5 76800 107648
4 54 53.3 58 57.1 48 50.5 76800 107884
5 54 54.3 59 57.1 48 51.4 76800 107884
6 55 54.3 58 57.1 48 51.4 76800 107128
7 55 54.3 61 57.1 48 51.4 76800 108388
8 55 55.2 60 58.0 48 51.4 76544 106120
9 55 55.2 61 59.0 48 51.4 76544 105868
10 56 55.2 60 59.0 49 51.4 76544 104356
11 55 55.2 59 59.0 49 51.4 76544 105364
12 58 55.2 60 59.0 49 51.4 76544 104356
13 57 55.2 61 59.0 49 51.4 74240 116292
14 58 55.2 62 59.0 49 52.4 74240 129052
15 57 57.1 63 59.0 48 52.4 74240 129052
16 56 55.2 61 59.0 49 52.4 74240 128548
17 57 57.1 62 59.0 49 52.4 74240 128664
18 57 57.1 63 59.0 49 52.4 74240 128012
19 58 57.1 62 59.0 49 52.4 74240 129272
20 58 57.1 62 60.9 50 52.4 74240 128784
21 58 57.1 62 60.9 49 52.4 74240 128028
22 59 57.1 61 59.0 49 52.4 74240 128532
23 58 57.1 63 60.9 48 52.4 74240 128280
24 58 57.1 63 60.9 49 52.4 74240 127020
25 58 57.1 63 60.9 49 52.4 74240 128532
26 58 57.1 62 60.9 48 52.4 74240 128532
27 58 57.1 60 60.9 50 52.4 74240 128280
28 59 57.1 62 60.9 48 52.4 73728 127776
29 58 57.1 63 61.8 49 52.4 73728 126516
30 50 55.2 52 59.0 44 51.4 73472 127524
32 Bit System Stress Tests below or Go To Start
32 Bit System Stress Tests
These stress tests comprised running programs, each for 15 minutes at the same time, exercising floating point calculations and
OpenGL graphics activity, whilst others were validating data transfers from RAM and the main drive. The run time environment
was also monitored. The variations of programs used can be obtained from Raspberry-Pi-4-Benchmarks.tar.gz.
The script file, shown below, was used to kick off the programs at the same time (within 10 seconds, validated by provided
results logs). The tests were run on the latest 8 GB Pi 4B, with cooling fan, and the fanless Pi 400 PC. The 4B drive was a 32 GB
SD card with the Pi 400 using a higher speed USB 3 booted disk drive.
######################## Script File ########################
lxterminal -e ./RPiHeatMHzVolts2 Passes 16 Seconds 60 Log 31 &
lxterminal -e ./liverloopsPiA7R Seconds 12 Log 31 &
lxterminal -e ./MP-IntStress Threads 1 KB 15000 Mins 15 Log 31 &
lxterminal -e ./burnindrive2 Repeats 16, Minutes 12, Log 31, Seconds 1 &
export vblank_mode=0 &
lxterminal -e ./videogl32 Test 6 Mins 15 Log 31 &
vmstat 60 16 > vmstat31.txt
The following results cover CPU MHz, voltage, temperatures and utilisation of memory, drives and CPU, with details for other
programs on the next page. Both systems appeared to run continuously at maximum CPU MHz, without temperatures increasing
anywhere near the point where throttling would occur. The Pi 4B CPU started 5°C higher, continuing with the same difference
until the end. The Pi 400 PMIC started 4°C higher and that increased to 6°C.
VMSTAT shows that not much RAM was needed for these tests, both systems having similar CPU + Wait For I/O utilisations, with
around 1% idle time. The main difference was main drive MB/second, with the Pi 400 disk drive some 75% faster than the Pi 4B
SD card. Not too much can be read into that. It might have been the opposite effect, with the Pi 4B using the hard drive.
Results on the next page indicate that the Pi 400 obtained an official Livermore Loops average of 592.1 MFLOPS, compared with
494.4 on the Pi 4B, a difference of 20%. The two systems obtained similar speeds during the integer RAM tests, of over 2.3
GB/second, with the Pi 400 producing an 11% performance advantage, running running the OpenGL Textured Kitchen routine.
################# Pi 4B ################# ################# Pi 400 ################
================== CPU MHz CPU Voltage and Temperature Measurement =================
Secs Start at Wed Aug 12 14:03:08 2020 Secs Start at Wed Aug 12 14:02:58 2020
0 ARM MHz=1500 0.86V CPU=46°C pmic=42°C 0 ARM MHz=1800 0.95V CPU=41°C pmic=46°C
60 ARM MHz=1500 0.86V CPU=56°C pmic=47°C 60 ARM MHz=1800 0.95V CPU=49°C pmic=50°C
121 ARM MHz=1500 0.86V CPU=59°C pmic=49°C 121 ARM MHz=1800 0.95V CPU=51°C pmic=51°C
182 ARM MHz=1500 0.86V CPU=59°C pmic=49°C 182 ARM MHz=1800 0.95V CPU=50°C pmic=52°C
243 ARM MHz=1500 0.86V CPU=57°C pmic=49°C 243 ARM MHz=1800 0.95V CPU=51°C pmic=52°C
304 ARM MHz=1500 0.86V CPU=59°C pmic=49°C 303 ARM MHz=1800 0.95V CPU=53°C pmic=53°C
365 ARM MHz=1500 0.86V CPU=59°C pmic=50°C 364 ARM MHz=1800 0.95V CPU=54°C pmic=54°C
426 ARM MHz=1500 0.86V CPU=59°C pmic=50°C 425 ARM MHz=1800 0.95V CPU=53°C pmic=54°C
486 ARM MHz=1500 0.86V CPU=59°C pmic=50°C 486 ARM MHz=1800 0.95V CPU=52°C pmic=55°C
547 ARM MHz=1500 0.86V CPU=59°C pmic=49°C 548 ARM MHz=1800 0.95V CPU=53°C pmic=55°C
608 ARM MHz=1500 0.86V CPU=59°C pmic=49°C 609 ARM MHz=1800 0.95V CPU=55°C pmic=55°C
669 ARM MHz=1500 0.86V CPU=58°C pmic=50°C 670 ARM MHz=1800 0.95V CPU=55°C pmic=55°C
730 ARM MHz=1500 0.86V CPU=59°C pmic=49°C 731 ARM MHz=1800 0.95V CPU=54°C pmic=55°C
790 ARM MHz=1500 0.86V CPU=58°C pmic=49°C 792 ARM MHz=1800 0.95V CPU=54°C pmic=55°C
851 ARM MHz=1500 0.86V CPU=59°C pmic=49°C 854 ARM MHz=1800 0.95V CPU=54°C pmic=56°C
912 ARM MHz=1500 0.86V CPU=51°C pmic=48°C 915 ARM MHz=1800 0.95V CPU=48°C pmic=55°C
End at Wed Aug 12 14:19:21 2020 End at Wed Aug 12 14:19:14 2020
============================== vmstat 60 second samples =============================
Memory MB MB/sec CPU %utilise Wait Memory MB MB/sec CPU %utilise Wait
free buff cache in out user sys idle I/O free buff cache in out user sys idle I/O
7505 23 224 0 0 3 1 95 1 3377 89 229 0 0 3 1 94 2
7428 24 263 0 7 75 7 3 15 3313 89 256 33 11 76 12 2 10
7424 24 266 15 4 76 9 0 15 3315 89 252 42 0 76 11 0 12
7424 24 265 24 0 76 10 1 13 3316 89 252 42 0 76 11 1 12
7423 24 265 25 0 75 10 1 15 3315 89 252 43 0 76 10 2 12
7423 24 265 24 0 75 10 1 15 3312 89 256 42 0 76 11 1 13
7422 24 266 24 0 75 9 1 16 3313 89 254 41 0 76 10 1 13
7422 24 268 24 0 75 10 1 14 3311 89 256 41 0 77 11 0 12
7422 24 268 24 0 76 10 1 13 3310 89 257 42 0 76 11 1 12
7420 24 269 24 0 75 9 1 14 3311 89 255 41 0 77 10 1 12
7422 24 267 24 0 76 10 1 13 3310 89 257 41 0 77 11 0 12
7423 24 267 24 0 74 10 1 15 3308 89 258 41 0 77 11 1 11
7420 24 269 24 0 75 9 0 15 3308 89 258 43 0 76 11 2 12
7419 25 270 24 0 74 9 0 16 3309 89 256 64 0 77 1 5 1
7420 25 268 24 0 75 10 1 14 3309 90 256 75 0 77 1 6 1
7423 25 266 25 0 70 10 4 16 3309 90 258 78 0 63 1 6 1
Other Stress Testing Programs used are below or Go To Start
Other Stress Testing Programs - run with the above
Livermore Loops There are 24 of these, with individual MFLOPS measurements, with a number of summaries also produced, the
official average being geometric mean. Note that there are three passes of this benchmark with differing memory demands. The
detailed figures are from one of these runs but the summaries are for all results.
MP Integer RAM Exerciser and OpenGL Benchmark - These report results as the tests progress, and performance for both is
provided together below. There can be performance variations over the testing time, depending on activities in other programs or
manual interventions.
BurnInDrive uses 64 KB block sizes, with 164 variations of data patterns, where a parameter controls file size, in this case 16
blocks for 164 MB files. Four of these are written then read by random selection for a specified time. Finally, blocks are read
continuously for a specified number of seconds (See more information here). Performance from the Pi 400 hard drive was clearly
superior to that from the Pi 4B SD card. Calculated reading speeds were effectively the same as indicated by VMSTAT.
======= Livermore Loops 64 Bit Reliability test 12 seconds each loop x 24 x 3 =======
Pi 4B Pi 400
Wed Aug 12 14:03:08 2020 Wed Aug 12 14:02:58 2020
Numeric results were as expected Numeric results were as expected
MFLOPS for 24 loops MFLOPS for 24 loops
734.0 933.4 982.3 939.1 204.1 717.0 820.5 1060.4 1063.9 1066.8 233.7 517.4
1128.8 1600.7 1225.6 383.5 211.8 184.8 1358.0 1911.3 1521.3 487.6 251.1 220.1
135.9 267.9 710.0 619.7 731.0 1012.8 188.6 363.1 842.4 734.0 868.1 1177.8
315.4 330.3 305.8 352.7 681.0 186.4 379.2 393.9 303.3 416.7 835.5 207.1
Maximum Average Geomean Harmean Minimum Maximum Average Geomean Harmean Minimum
1600.7 610.8 494.4 390.8 117.9 1911.3 728.0 592.1 472.4 164.5
End of test Wed Aug 12 14:17:56 2020 End of test Wed Aug 12 14:17:24 2020
===================== MP Integer RAM and OpenGL Tests ======================
Pi 4B Pi 400
Start Aug 12 2020 14:03:08 14:03:08 14:02:58 14:02:58
Secs Kbytes Thrds Pattern All Same MB/sec FPS MB/sec FPS
30 15000 1 00000000 Yes 2528 13 2428 14
60 15000 1 FFFFFFFF Yes 2501 13 2379 15
90 15000 1 FFFFFFFF Yes 2217 13 2539 13
To
840 15000 1 AAAAAAAA Yes 2217 14 2175 17
870 15000 1 CCCCCCCC Yes 2569 12 2343 15
900 15000 1 CCCCCCCC Yes 2455 13 2348 16
Average 2351 13.2 2394 14.7
End Aug 12 2020 14:18:12 14:18:11 14:18:03 14:18:01
======================== burnindrive2 Main Drive =========================
pi 400 Pi 4B
Start Wed Aug 12 14:02:58 2020 14:03:08 2020
Write seconds 164.00 MB x 4 files 11.93 82.49
Read files for 12+ minutes Files Minutes
x 4
Read passes 1 x 4 Files x 164.00 MB in 0.26 minutes 1 0.44
Read passes 2 x 4 Files x 164.00 MB in 0.52 minutes 2 0.92
TO
Read passes 25 x 4 Files x 164.00 MB in 6.52 minutes 13 5.88
Read passes 26 x 4 Files x 164.00 MB in 6.78 minutes 14 6.34
To
Read passes 45 x 4 Files x 164.00 MB in 11.79 minutes 26 11.79
Read passes 46 x 4 Files x 164.00 MB in 12.08 minutes 27 12.25
Calculated MB/second over 12+ minutes 41.6 24.1
Passes in 1 second(s) for each of 164 blocks of 64KB:
Examples
1140 1180 1160 1220 1280 1360 1520 1520 1460 1240 420 420
1260 1200 1160 1140 1140 1140 1160 1140 1160 1120 380 400
To
1320 1400 1360 1300 1160 1240 1360 1380 1400 1140 540 560
1240 1240 1240 1220 1220 1240 1180 1180 1160 1180 560 560
Passes Minutes
200220 read passes of 64KB blocks in 2.76 minutes 79580 2.80
No errors found during reading tests
End Wed Aug 12 14:18:00 2020 14:19:34 2020
64 Bit System Stress Tests below or Go To Start
64 Bit System Stress Tests
These 15 minute tests were run using the same script file sequence as the above 32 bit session, using 64 bit program
compilations, except most of the RAM was used in the single core integer test.
The vmstat report shows that all these programs ran without memory swapping, with nearly all four cores being used
continuously. Recorded data transfer speeds confirmed those measured by the drive program. Processor speed and and measured
OpenGL frames per second were constant, with low temperatures being maintained.
vmstat RPiHeatMHzVolts MP-Int OpenGL
Memory MB------- MB/sec CPU %util-- %wait ARM Volts CPU PMIC Stress Test 6
Minutes swpd free cache in out usr sys idl I/O MHz °C °C MB/sec FPS
0 0 3285 298 0 0 8 0 91 0 1800 0.95 38 41
1 0 291 326 0 11 74 8 2 17 1800 0.95 46 45 2198 22
2 0 277 329 28 0 77 8 1 14 1800 0.95 47 46 2202 21
3 0 273 332 28 0 76 7 1 16 1800 0.95 46 47 2203 21
4 0 273 333 28 0 76 7 1 16 1800 0.95 48 48 2211 21
5 0 272 334 28 0 76 7 1 15 1800 0.95 49 48 2201 22
6 0 275 331 28 0 76 7 1 16 1800 0.95 48 48 2196 22
7 0 275 330 28 0 76 7 1 16 1800 0.95 50 49 2193 21
8 0 270 334 28 0 76 7 1 16 1800 0.95 48 49 2189 21
9 0 275 330 28 0 76 7 1 16 1800 0.95 49 49 2175 22
10 0 274 331 28 0 76 7 1 15 1800 0.95 51 50 2169 21
11 0 273 331 28 0 76 7 1 15 1800 0.95 51 50 2166 20
12 0 271 333 28 0 76 7 1 16 1800 0.95 51 50 2162 21
13 0 271 334 28 0 76 7 2 15 1800 0.95 51 50 2156 21
14 0 270 335 30 0 76 7 1 16 1800 0.95 51 50 2148 21
15 0 271 335 30 0 70 7 6 16 1800 0.95 46 49 2129 20
Avg 1800 0.95 48 48 2180 21
Min 1800 0.95 38 41 2129 20
Max 1800 0.95 51 50 2211 22
Livermore Loops Benchmark 64 Bit
Reliability test 12 seconds each loop x 24 x 3
MFLOPS for 24 loops
2447.2 1053.5 1083.2 1140.1 452.8 921.3 2738.0 3288.1 2377.3 694.8 566.2 1129.5
203.9 430.8 925.5 718.4 835.0 1291.8 516.4 440.1 1899.1 435.7 910.7 363.6
Overall Ratings
Maximum Average Geomean Harmean Minimum
3288.1 1110.8 881.5 703.7 185.7
Main SD Card Storage Stress Test ARM 64 Bit
4 x 164.00 MB written in 58.38 seconds = 11.2 MB/second
Read passes 31 x 4 Files x 164.00 MB in 12.07 minutes = 28.1 MB/second
85660 read passes of 64KB blocks in 2.79 minutes = 32.0 MB/second
32 Bit TV Test below or Go To Start
32 Bit TV Test Plus Remote Access
Using a TV connection, the Pi 400 was run for eight hours displaying live TV via BBC iPlayer. Before increasing the image to full
screen, my environment monitor program was started. The room temperature was 24°C. As can be seen below, CPU and PCMI
temperatures did not increase significantly. Full screen pixel density was reported as 960 x 540 with Ethernet traffic at 1700
kbps.
Later, two terminals were connected from Putty, on a PC. Sysstat software was installed from there, to enable monitoring of
network data transfer speeds. VMSTAT system utilisation monitor was started from the second terminal, both saving results on
the Pi 400 SD card.
Received network data mainly arrived continuously at around 214k Bytes per second. Taking into account extra overhead bits,
that is similar to 1700k bits per second. The increases to more than 250 kB/s and associated transmitted bytes were included
after I opened VNC Viewer, on my Smart Phone, to have a look at the TV picture there. It was really bad, with jumpy rather than
smooth flow. Assuming that full screen data is transferred, rather than in compressed input format, 960 x 540 pixels at 4 bytes
per pixel indicates over 2000 kB, implying supplied data to the phone would result in an extremely low displayed frames per
second.
VMSTAT indicates low Pi 400 CPU utilisation. The only noticeable activity is data output to the main drive being the same as kB/s
received over the network. The burst of reading from the drive, near the end, occurred following pausing the iPlayer for a short
time, followed by continuing playing the recording.
sar -n DEV 1800 20 Communications Traffic
10:38:54 rxpck/s txpck/s rxkB/s txkB/s rxcmp/s txcmp/s rxmcst/s %ifutil
11:08:54 147.41 61.02 214.87 4.27 0.00 0.00 0.04 0.18
11:38:54 146.16 62.63 211.44 4.38 0.00 0.00 0.04 0.17
12:08:54 898.76 1548.94 261.53 2104.84 0.00 0.00 0.54 1.72
12:38:54 1028.00 1794.83 273.61 155.17 0.00 0.00 1.40 0.22
13:08:54 148.80 62.42 216.92 4.36 0.00 0.00 0.04 0.18
13:38:54 148.77 62.69 216.94 4.38 0.00 0.00 0.03 0.18
14:08:54 3266.34 3909.49 155.45 2299.59 0.00 0.00 2.33 1.88
14:38:54 147.26 62.17 214.69 4.34 0.00 0.00 0.05 0.18
15:08:54 149.15 62.01 216.70 4.33 0.00 0.00 0.66 0.18
15:38:54 146.12 62.80 211.66 4.39 0.00 0.00 1.17 0.17
16:08:54 148.26 61.73 216.21 4.31 0.00 0.00 0.03 0.18
16:38:54 148.85 62.48 217.04 4.37 0.00 0.00 0.03 0.18
vmstat 1800 20 System Utilisation
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
r b swpd free buff cache si so bi bo in cs us sy id wa st
1 1 0 2772060 39876 714888 0 0 14 58 680 1022 6 3 91 0 0
0 0 0 2659720 46688 797124 0 0 31 259 2771 4188 6 3 91 0 0
0 0 0 2670300 49392 790776 0 0 0 233 2697 4046 5 3 92 0 0
1 0 0 2636548 52248 812228 0 0 3 232 3014 4191 14 4 81 0 0
1 0 0 2606896 55732 819748 0 0 4 239 3271 4198 24 6 70 0 0
0 1 0 2651508 58032 812232 0 0 0 237 2704 4037 6 3 91 0 0
0 0 0 2626876 60160 822588 0 0 0 235 2687 4038 5 3 92 0 0
0 0 0 2631420 62128 821656 0 0 0 238 2703 4034 5 3 92 0 0
0 0 0 2634884 64036 817980 0 0 0 235 2688 4033 5 3 92 0 0
0 0 0 2643896 65856 813900 0 0 0 237 2684 4032 5 3 92 0 0
0 0 0 2629000 67704 816040 0 0 0 233 2682 4036 5 3 91 0 0
4 0 0 2529104 68992 899540 0 0 40 238 2818 4258 6 3 90 0 0
0 0 0 2529352 70632 896856 0 0 0 237 2693 4034 5 2 92 0 0
Temperature and CPU MHz Measurement
Start at Wed Aug 19 08:42:13 2020
Using samples at 1800 second intervals
Seconds
0.0 ARM MHz=1800, core volt=0.9500V, CPU temp=32.0'C, pmic temp=32.6'C
1800.0 ARM MHz=1800, core volt=0.9500V, CPU temp=35.0'C, pmic temp=38.2'C
3600.3 ARM MHz=1800, core volt=0.9500V, CPU temp=36.0'C, pmic temp=39.2'C
5400.5 ARM MHz=1800, core volt=0.9500V, CPU temp=36.0'C, pmic temp=40.1'C
7200.8 ARM MHz=1800, core volt=0.9500V, CPU temp=37.0'C, pmic temp=40.1'C
9001.1 ARM MHz=1800, core volt=0.9500V, CPU temp=36.0'C, pmic temp=40.1'C
10801.3 ARM MHz=1800, core volt=0.9500V, CPU temp=38.0'C, pmic temp=40.1'C
12601.7 ARM MHz=1800, core volt=0.9500V, CPU temp=41.0'C, pmic temp=42.9'C
14401.9 ARM MHz=1800, core volt=0.9500V, CPU temp=38.0'C, pmic temp=42.0'C
16202.2 ARM MHz=1800, core volt=0.9500V, CPU temp=38.0'C, pmic temp=41.1'C
18002.4 ARM MHz=1800, core volt=0.9500V, CPU temp=37.0'C, pmic temp=41.1'C
19802.7 ARM MHz=1800, core volt=0.9500V, CPU temp=37.0'C, pmic temp=41.1'C
21603.0 ARM MHz=1800, core volt=0.9500V, CPU temp=37.0'C, pmic temp=41.1'C
23403.3 ARM MHz=1800, core volt=0.9500V, CPU temp=37.0'C, pmic temp=41.1'C
25203.5 ARM MHz=1800, core volt=0.9500V, CPU temp=37.0'C, pmic temp=41.1'C
27003.8 ARM MHz=1800, core volt=0.9500V, CPU temp=37.0'C, pmic temp=41.1'C
28804.1 ARM MHz=1800, core volt=0.9500V, CPU temp=37.0'C, pmic temp=41.1'C
Terminated Wed Aug 19 16:42
64 Bit TV Test below or Go To Start
64 Bit TV Test Using Bluetooth
The session via the 64 bit Operating System displayed BBC iPlayer programmes on a PC monitor for more than 7 hours. The Pi 400
PC (I used) has no Av jack, so sound was played by pairing a bluetooth speaker. As with the 32 bit tests, the Ethernet network
connection was used.
Without paying serious attention, full screen close up picture quality was acceptable. Then, right clicking on the screen, from
time to time, indicated the following properties, showing large differences in the amount of data handled and displayed. There
were corresponding variations in monitored statistics, with network received traffic (rxkB/s), vmstat drive kB data out (bo), and
CPU utilisation (us + sy). CPU and PMIC temperatures reduced, but that might have been due to the room becoming cooler
approaching midnight. I suppose that the varying traffic levels were caused by network congestion (but was it?).
Bluetooth - I found it difficult to connect bluetooth devices (in my environment?). After failing to pair, I could find no menu
based operation to prevent further error indications. Executing the commands, shown below, allowed more attempts and
sometimes successful connection.
kbps pixels
Periodic 1700 960 x 540
Properties 5166 1280 x 720 at 18:30
Displayed 932 704 x 396
533 512 x 288
5166 1280 x 720
sar -n DEV 1800 20 Communications Traffic
16:25:17 rxpck/s txpck/s rxkB/s txkB/s rxcmp/s txcmp/s rxmcst/s %ifutil
16:55:17 142.08 73.26 205.16 5.13 0.00 0.00 1.93 16.81
17:25:17 137.60 70.84 198.17 4.97 0.00 0.00 2.10 16.23
17:55:17 130.85 67.14 188.22 4.71 0.00 0.00 2.18 15.42
18:25:17 397.85 193.62 576.05 13.28 0.00 0.00 0.49 47.19
18:55:17 455.64 221.75 661.41 15.08 0.00 0.00 0.03 54.18
19:25:17 450.00 217.99 654.41 14.73 0.00 0.00 0.04 53.61
19:55:17 446.87 216.61 649.94 14.68 0.00 0.00 0.03 53.24
20:25:17 176.93 88.05 256.34 6.11 0.00 0.00 1.43 21.00
20:55:17 134.94 68.95 193.33 4.90 0.00 0.00 2.08 15.84
21:25:17 84.61 44.44 117.08 3.32 0.00 0.00 2.04 9.59
21:55:17 79.75 42.79 110.32 3.24 0.00 0.00 2.07 9.04
22:25:17 51.18 28.03 69.52 2.26 0.00 0.00 1.99 5.70
22:55:17 51.49 28.67 69.98 2.32 0.00 0.00 2.07 5.73
23:25:17 37.40 21.18 49.87 1.82 0.00 0.00 2.04 4.09
vmstat 1800 20 System Utilisation
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
r b swpd free buff cache si so bi bo in cs us sy id wa st
2 0 0 2586672 50628 714756 0 0 39 4 148 161 2 1 97 0 0
0 0 0 2314608 54248 897128 0 0 0 235 6690 5244 15 4 81 0 0
4 0 0 2289668 57772 911392 0 0 0 219 6638 5156 14 3 82 0 0
1 0 0 2287436 60836 913288 0 0 0 226 6688 5248 14 4 82 0 0
4 0 0 2123628 64124 970940 0 0 1 356 5991 4691 27 5 68 0 0
3 0 0 2102296 67052 973788 0 0 0 650 7919 5869 50 9 41 0 0
10 0 0 2019576 69760 1020768 0 0 0 665 7963 5894 51 9 40 0 0
7 0 0 2011040 72504 1028204 0 0 0 636 7882 5837 50 9 41 0 0
1 0 0 2017004 74848 1021604 0 0 0 473 7445 5680 37 7 56 0 0
2 0 0 2009756 77008 1013644 0 0 0 230 6684 5016 15 3 82 0 0
0 0 0 2019832 79008 1002208 0 0 0 164 6491 4966 12 3 85 0 0
0 0 0 2006692 80804 1013468 0 0 0 140 6471 4992 11 3 86 0 0
0 0 0 2005796 82740 1007964 0 0 0 107 6381 4839 9 3 88 0 0
1 0 0 1986264 84220 1029168 0 0 0 92 6344 4781 8 3 89 0 0
0 0 0 1995508 85804 1010960 0 0 0 85 6351 4789 8 3 90 0 0
1 0 0 1995000 87296 1011092 0 0 0 75 6295 4743 7 2 90 0 0
Temperature and CPU MHz Measurement Start at Sat Sep 12 16:11:35 2020
Seconds
0.0 ARM MHz=1800, core volt=0.9500V, CPU temp=42.0'C, pmic temp=43.9'C
1800.0 ARM MHz=1800, core volt=0.9500V, CPU temp=42.0'C, pmic temp=46.7'C
3600.3 ARM MHz=1800, core volt=0.9500V, CPU temp=42.0'C, pmic temp=46.7'C
5400.5 ARM MHz=1800, core volt=0.9500V, CPU temp=43.0'C, pmic temp=47.7'C
7200.8 ARM MHz=1800, core volt=0.9500V, CPU temp=47.0'C, pmic temp=50.5'C
9001.3 ARM MHz=1800, core volt=0.9500V, CPU temp=48.0'C, pmic temp=51.4'C
10801.7 ARM MHz=1800, core volt=0.9500V, CPU temp=48.0'C, pmic temp=51.4'C
12602.2 ARM MHz=1800, core volt=0.9500V, CPU temp=48.0'C, pmic temp=51.4'C
14402.6 ARM MHz=1800, core volt=0.9500V, CPU temp=44.0'C, pmic temp=48.6'C
16202.9 ARM MHz=1800, core volt=0.9500V, CPU temp=44.0'C, pmic temp=47.7'C
18003.2 ARM MHz=1800, core volt=0.9500V, CPU temp=42.0'C, pmic temp=46.7'C
19803.4 ARM MHz=1800, core volt=0.9500V, CPU temp=41.0'C, pmic temp=46.7'C
21603.7 ARM MHz=1800, core volt=0.9500V, CPU temp=41.0'C, pmic temp=45.8'C
23404.0 ARM MHz=1800, core volt=0.9500V, CPU temp=40.0'C, pmic temp=45.8'C
25204.2 ARM MHz=1800, core volt=0.9500V, CPU temp=41.0'C, pmic temp=45.8'C
27004.4 ARM MHz=1800, core volt=0.9500V, CPU temp=40.0'C, pmic temp=45.8'C
Buluetooth Commands sudo hciconfig hci0 reset
sudo invoke-rc.d bluetooth restart
64 Bit Danger below or Go To Start
64 Bit Danger
For all my Raspberry Pi, and other Linux benchmarks, I have relied on using compiling optimisation -O3, possibly with additional
parameters for SIMD operation, like -mavx with Linux/Intel and -funsafe-math-optimizations with ARM Cortex-a7, to produce
NEON instructions. With 64 bit operation, using -O3, the gcc compiler produced unsuitable slow code (with my programming
procedures?). I have included other compile directives, but these produced the same slow performance, with -O3 present. Then I
tried -O2 that seems to avoid vectorisation, the results being shown below, along with samples from the -O3 runs.
For the first two examples, although using -O2 produced faster single precision and integer calculations from cached data,
performance using RAM was reduced to half speed. The integer stress test also regained appropriate cache based speeds, but
with no loss on RAM performance.
It seems that anyone hoping for faster SIMD operation, with these types of program, should also try to compile not using
vectorisation, to verify performance gains.
########### Memory Reading Speed Test 64 Bit gcc 8 0pt -02 ###########
Memory x[m]=x[m]+s*y[m] Int+ x[m]=x[m]+y[m] x[m]=y[m]
KBytes Dble Sngl Int32 Dble Sngl Int32 Dble Sngl Int32
Used MB/S MB/S MB/S MB/S MB/S MB/S MB/S MB/S MB/S
8 16065 11315 8296 16099 9473 9459 12353 8037 9349
16 16245 11407 8309 16259 9522 9513 12569 7993 9466
32 14290 10468 7747 14377 8451 8248 12673 8039 9525
64 12853 10212 7867 13049 7747 7975 10854 7452 9026
128 12970 10307 7958 13149 7852 8070 10159 7610 9094
256 13021 10286 7986 13157 7958 8078 9714 7706 8986
512 12781 10259 7958 13009 7951 8079 9631 7665 9033
1024 3689 4372 3978 4432 3886 3902 5865 5469 5928
2048 1800 1792 1722 1805 1769 1750 3023 2984 2949
4096 1921 1933 1905 1918 1910 1894 2658 2678 2686
8192 1962 1961 1809 1952 1955 1926 2596 2601 2613
########### Memory Reading Speed Test 64 Bit gcc 8 0pt -03 ###########
8 18133 4792 4749 18693 5259 5275 13962 11182 11182
256 14783 4646 4716 14698 5053 5063 9666 9768 9809
8192 2036 3940 3882 2034 3935 3995 2642 2643 2638
##### NEON Speed Test 64 Bit gcc 8 Opt -02 #####
Vector Reading Speed in MBytes/Second
Memory Float v=v+s*v Int v=v+v+s Neon v=v+v
KBytes Norm Neon Norm Neon Float Int
16 11286 19667 8090 18132 19678 22533
32 10394 14494 7193 13225 14233 14562
64 10765 13825 7457 12642 13846 14040
128 11057 14324 7769 13237 14394 14612
256 11113 14477 7844 13318 14530 14674
512 11149 14560 7893 13392 14627 14637
1024 4513 4758 3637 3808 4211 4770
4096 2063 2053 2086 2042 2060 2062
16384 2058 2051 2054 2056 2054 2043
65536 2059 2045 2049 2064 2049 2050
##### NEON Speed Test 64 Bit gcc 8 Opt -03 #####
16 4496 19696 4790 17870 18908 21817
256 3992 14148 4716 13508 14311 14312
65536 3319 2057 3803 2011 2059 2063
#### MP-Integer-Test 64 Bit v2-gcc8 opt -02 ####
MB/second
KB KB MB Same All
Secs Thrds 16 160 16 Sumcheck Tests
4.2 1 8040 7892 3783 00000000 Yes
3.2 2 17193 15430 3685 FFFFFFFF Yes
3.0 4 29261 29819 3329 5A5A5A5A Yes
3.0 8 29886 31708 3383 AAAAAAAA Yes
3.0 16 30410 33010 3365 CCCCCCCC Yes
2.9 32 30375 33435 3392 0F0F0F0F Yes
#### MP-Integer-Test 64 Bit v2-gcc8 opt -03 ####
7.4 1 3455 3481 3074 00000000 Yes
4.7 2 7047 6975 3507 FFFFFFFF Yes
3.6 4 13712 13977 3357 5A5A5A5A Yes
3.6 8 13631 13696 3353 AAAAAAAA Yes
3.7 16 13184 13906 3351 CCCCCCCC Yes
3.6 32 12617 13960 3414 0F0F0F0F Yes
Go To Start