Technical ReportPDF Available

Raspberry Pi 64 Bit OS and 8 GB Pi 4B Benchmarks

Authors:
  • UK Government

Abstract

I have compiled and run my benchmarks to run under the 64 bit Beta Raspberry Pi OS, with tests including variations to exercise the Pi 4B with 8 GB RAM. Full results, with comparisons to 32 bit working, are included, also a link to download all the benchmarks and source codes. Some results show significant 64 bit performance gain and others no gain. The main observation was the ability of my programs to use much more memory space and exercise huge files.
Raspberry Pi 64 Bit OS and 8 GB Pi 4B Benchmarks
Roy Longbottom
Contents
Summary Introduction Benchmark Results
Whetstone Benchmark Dhrystone Benchmark Linpack 100 Benchmark
Livermore Loops Benchmark FFT Benchmarks BusSpeed Benchmark
MemSpeed Benchmark NeonSpeed Benchmark MultiThreading Benchmarks
MP-Whetstone Benchmark MP-Dhrystone Benchmark MP NEON Linpack Benchmark
MP-BusSpeed Benchmark MP-RandMem Benchmark MP-MFLOPS Benchmarks
OpenMP-MFLOPS Benchmarks OpenMP-MemSpeed Benchmarks I/O Benchmarks
WiFi Benchmark LAN Benchmark USB 3 Benchmarks
Pi 4 Main Drive benchmark Java Whetstone Benchmark JavaDraw Benchmark
OpenGL Benchmark Usable RAM High Performance Linpack
Floating Point Stress Tests Integer Stress Tests 64 GB SD Card
System Stress Tests Power Over Ethernet CPU Performance Throttling Effects
Summary
This report covers the May 2020 Raspberry Pi 4B upgrades, comprising 8GB RAM and the Beta pre-release 64 bit
Raspberry Pi OS. Note that observations and performance measured might not apply to an officially released Operating
System. Objectives of this exercise were to show that my programs could be compiled and run on the 64 bit system
and to compare performance with that of the original 32 bit Pi 4B.
Single Core CPU Tests - These “Classic Benchmarks” set the original performance standards of computers. There are
four, one with three varieties, and some with multiple test functions. All showed 64 bit average performance gains
in the range of 11% to 81%, the highest where the new vector instructions were compiled. .
Single Core Memory Benchmarks - These measure performance using data from caches and RAM. There are four
benchmarks, each with between 60 and 100 measurements. A bottom line assessment is that 64 bit and 32 bit
speeds from RAM were the same, as were around half of CPU dependent routines, with the other half an
average near 30% faster at 64 bits.
Multithreading Benchmarks - There were twelve, covering some intended to show that they were unsuitable for
multithreading operation. Five measured floating point performance, where the average 64 bit gain was 39%,
demonstrating a maximum of 25.9 single precision GFLOPS and 12.7 at double precision. Of the other two
applicable benchmarks, one was rated as 13% faster, at 64 bits, with the other indicating the same performance.
Drive and Network Benchmarks - These mainly ran successfully at 64 bits, providing similar performance to 32 bit
runs. A major difference is that file sizes appeared to be limited at 2 GB minus 1 (2^31-1) at 32 bits. At this
stage, there were free space limitations, but, at 64 bits, up to 3 x 12 GB could be exercised.
Java and OpenGL Benchmarks - 64 bit Java CPU speed, Java drawing and OpenGL benchmarks were run, with
different window settings, including using dual monitors.
Usable RAM - two simple repetitive exercises were carried out to see how much RAM space could be used, via
allocation and dimensioned arrays. With one program, memory was allocated in 1 billion byte steps. Maximums were 3
billion at 32 bits then at 64 bits, 3 billion with 4 GB RAM and 7 billion at 8 GB. With dimensioning, more precise
values were obtained indicating 3.43 GB and 7.9 GB at 64 bits but 2 GB minus 1 at 32 bits.
High Performance Linpack Benchmark - Performance depends on the memory size parameter N squared. With a fan
in use, maximum 32 bit and 64 bit speeds were similar at around 11.25 double precision GFLOPS, at N=30000
with 8 GB RAM, best performance with 4 GB, was 10.8 GFLOPS at N=20000. As a stress test, with no fan, the
original Pi 4 board obtained 6.2 GFLOPS at N=20000, with the new one reaching at least 8.5 GFLOPS, demonstrating
a significant improvement in thermal management.
CPU Stress Tests - Floating point tests demonstrated the same best case 64 bit performance gains as earlier
benchmarks and details of 10 minute stress tests confirmed better thermal management, in a more linear way.
A single thread 10 minute stress test was run with integer calculations using more than 7.2 GB of RAM, with some
swapping, but no severe performance degradation. The stress tests were run without an operational fan.
64 GB Main Drive SD Card - This was obtained to show that extra large files could be used. A single near 40 GB file
was written and read with a new benchmark variation, taking 33 minutes.
System Stress Tests - Fifteen minute tests were run, with and without cooling, using four benchmarks covering, CPU,
near 6 GB RAM, main drive and graphics. There were temperature rises with no cooling, but with little
performance degradation, both continuously providing around 0.6 GFLOPS, 1140 MB/second from RAM, 30
MB/second from the drive and 21 FPS graphics speed.
Power over Ethernet - Following more comprehensive earlier activity, some long cable PoE tests were repeated to
confirm that it was still applicable for this 64 bit configuration.
CPU Performance Throttling Effects - Again, after an earlier exercise, frequency scaling settings forced the CPU to
run at 600 MHz, normally the lowest throttling frequency, whilst playing programmes via BBC iPlayer for more than two
hours to an HD TV, over WiFi. This ran with acceptable picture and sound quality.
Introduction below or Go To Start
Introduction
This report covers the May 2020 Raspberry Pi 4B upgrades, comprising 8 GB RAM and the Beta pre-release 64 bit
Raspberry Pi OS (Operating System). This is a continuation of earlier activity with details at ResearchGate in Raspberry
Pi 4B 32 Bit Benchmarks.pdf and Raspberry Pi 4B Stress Tests Including High Performance Linpack.pdf. These provide
more detailed information of the programs used and comparisons with older systems.
Most of the benchmarks and stress testing programs were recompiled, for use here, using the supplied gcc 8 compiler,
with two not providing acceptable code, being substituted by earlier 64 bit versions. All the programs are available for
downloading from ResearchGate in Raspberry-Pi-OS-64-Bit-Benchmarks.tar.xz.
Traditionally, the benchmark provided details of the system being tested, by accessing built-in CPUID details. Following
are the latest that identify the difference between 32 bit and 64 bit operation.
32 bit
Architecture: armv7l
Byte Order: Little Endian
CPU(s): 4
On-line CPU(s) list: 0-3
Thread(s) per core: 1
Core(s) per socket: 4
Socket(s): 1
Vendor ID: ARM
Model: 3
Model name: Cortex-A72
Stepping: r0p3
CPU max MHz: 1500.0000
CPU min MHz: 600.0000
BogoMIPS: 270.00
Flags: half thumb fastmult vfp edsp neon vfpv3 tls vfpv4
idiva idivt vfpd32 lpae evtstrm crc32
Raspberry Pi reference 2019-05-13
64 bit
Architecture: aarch64
Byte Order: Little Endian
CPU(s): 4
On-line CPU(s) list: 0-3
Thread(s) per core: 1
Core(s) per socket: 4
Socket(s): 1
Vendor ID: ARM
Model: 3
Model name: Cortex-A72
Stepping: r0p3
CPU max MHz: 1500.0000
CPU min MHz: 600.0000
BogoMIPS: 108.00
Flags: fp asimd evtstrm crc32 cpuid
Linux raspberrypi 4.19.118-v8+ #1311 SMP PREEMPT Mon Apr 27 14:32:38 BST 2020 aarch64 GNU/Linux
Benchmark Results
The following provide benchmark results with limited comments on Raspberry Pi 4B performance, compiled as 32 bit and
64 bit programs. There are also considerations of the impact of the larger 8 GB RAM and the possibility of larger file
sizes.
Whetstone Benchmark below or Go To Start
Whetstone Benchmark - whetstonePiC8, whetstonePi64g8
This has a number of simple programming loops, with the overall MWIPS rating dependent on floating point calculations,
lately those identified as COS and EXP. The last three can be over optimised, but the time does not affect the overall
rating much.
Performance is normally more dependent on CPU MHz than advanced instructions, but an overall improvement of 11%
was indicated and 11% on straightforward floating point calculations.
System MHz MWIPS ------MFLOPS------ -------------MOPS---------------
1 2 3 COS EXP FIXPT IF EQUAL
32 bit 1500 1883 522 471 313 54.9 26.4 2496 3178 998
64 bit 1500 2085 524 535 398 57.6 27.3 2493 2979 997
64/32 bit 1.11 1.00 1.14 1.27 1.05 1.03 1.00 0.94 1.0
Dhrystone Benchmark - dhrystonePiC8, dhrystonePi64g8
This appears to be the most popular ARM benchmark and often subject to over optimisation. So you can’t compare
results from different compilers. Ignoring this, results in VAX MIPS aka DMIPS and comparisons follow.
The 64 bit compilation provided an apparent 54% improvement in performance but possibly over optimised.
DMIPS
System MHz DMIPS /MHz
32 bit 1500 5077 3.76
64 bit 1500 7814 5.21
64/32 bit 1.54
Linpack 100 Benchmark MFLOPS - linpackPiC8, linpackPiC8SP, linpackPiNEONiC8, linpackPi64g8,
linpackPi64gSP, linpackPi64NEONig8
This original Linpack benchmark uses a small data array, unsuitable for higher speed multiprocessing. It executes double
precision arithmetic. I introduced a single precision version with a NEON variety, to indicate vector processing speed.
The NEON version, that uses intrinsic functions, was the star of the show when the Pi 4B was introduced, with the
most significant performance improvements, compared to the Pi 3B, and the benefit reflected in the 32 bit NEON/SP
results below. The 64 bit SP result now shows that 64 bit vector instructions can achieve the same sort of
performance gains, this time 81% faster than at 32 bits.
NEON
System MHz DP SP SP
32 bit 1500 957.1 1068.8 1819.9
64 bit 1500 1111.5 1938.2 2030.9
64/32 bit 1.16 1.81 1.12
Livermore Loops Benchmark MFLOPS - liverloopsPiC8, liverloopsPi64g8
This benchmark measures performance of 24 double precision kernels, initially used to select the latest supercomputer.
The official average is geometric mean, where Cray 1 supercomputer was rated as 11.9 MFLOPS. Following are MFLOPS
for the individual kernels, followed by overall scores.
Based on Geomean results, the overall 64 bit speed rating was 13% faster than at 32 bits, but vector instructions
pushed this up to a maximum 67%.
MFLOPS for 24 loops
32 bit
1480 1017 974 930 383 657 1624 1861 1664 617 498 741
221 320 803 640 737 1003 451 378 1047 411 763 187
64 bit
2108 936 960 965 383 809 2313 2488 2066 669 500 981
181 405 815 644 727 1190 450 397 1716 367 818 313
64 bit / 32 bit gain range - 0.82 to 1.67
Comparisons
System MHz Maximum Average Geomean Harmean Minimum
32 bit 1500 1860.8 800.4 679.0 564.1 179.5
64 bit 1500 2616.7 959.8 766.7 613.0 169.7
64/32 bit 1.41 1.20 1.13 1.09 0.95
Fast Fourier Transforms Benchmarks below or Go To Start
Fast Fourier Transforms Benchmarks - fft1PiC8, fft3cPiC8, fft1Pi64g, fft3cPi64g8
This is a real application provided by my collaborator at Compuserve Forum. There are two benchmarks. The first one is
the original C program. The second is an optimised version, originally using my x86 assembly code, but translated back
into C code, making use of the partitioning and (my) arrangement to optimise for burst reading from RAM. Three
measurements use both single and double precision data, calculating FFT sizes between 1K and 1024K, with data from
caches and RAM. Note that steps in performance levels occur at data size changes from L1 to L2 caches, then to
RAM.
Following are average running times from the three passes of each FFT calculation. There were no significant
variations in overall performance between 32 bit and 64 bit compilations. This could be expected using RAM, but there
is probably too much diversity in data flow from caches to benefit from advanced vector operation.
Time in milliseconds
32 bit FFT 1 32 bit FFT 3 64 bit FFT 1 64 bit FFT 3
SP DP SP DP SP DP SP DP
Sixe K
1 0.04 0.04 0.05 0.04 0.04 0.04 0.04 0.04
2 0.08 0.13 0.10 0.10 0.08 0.14 0.08 0.10
4 0.29 0.34 0.24 0.23 0.23 0.40 0.21 0.24
8 0.79 0.82 0.57 0.51 0.74 0.99 0.47 0.51
16 1.65 1.85 1.32 1.19 1.88 2.67 1.15 1.20
32 3.76 4.71 2.69 3.30 5.04 5.16 2.26 3.31
64 8.82 30.64 6.60 9.47 8.72 32.58 5.72 10.19
128 58.54 132.41 16.92 23.85 49.92 160.12 15.92 24.43
256 275.44 373.12 37.61 55.97 293.06 389.40 37.85 54.60
512 780.89 751.27 81.54 128.13 559.88 780.79 82.06 119.23
1024 1578.70 1812.20 186.45 288.27 1376.28 1890.46 178.37 262.30
Ratios > 1.0 64 bit faster Average 1.05 0.89 1.13 1.02
Minimum 0.75 0.69 0.99 0.93
Maximum 1.39 0.96 1.26 1.12
BusSpeed Benchmark - busspeedPiC8, busspeedPi64g8
This is a read only benchmark with data from caches and RAM. The program reads one word with 32 word increments
for the next one, skipping following data word by decreasing increments. finally reading all data. This shows where
data is read in bursts, enabling estimates being made of bus speeds, as 16 timed the speed of appropriate
measurements at Inc16.
The speed via these increments can vary considerably, so comparison are provided for the Read All column. Then, the
32 bit RAM speeds are indicated as being slightly faster but, with data from caches, average 64 bit gains were around
55%.
Reading Speed 4 Byte Words in MBytes/Second
Memory Inc32 Inc16 Inc8 Inc4 Inc2 Read
KBytes Words Words Words Words Words All
32 bit
16 4880 5075 5612 5852 5877 5864
32 846 1138 2153 3229 4908 5300
64 746 1019 2035 3027 4910 5360
128 728 983 1952 2908 4888 5389
256 683 934 1901 2794 4874 5431
512 656 900 1760 2625 4585 5259
1024 301 410 870 1356 2846 4238
4096 233 248 531 996 2151 4045
16384 236 258 511 891 2143 4011
65536 237 257 508 881 2172 4015
64 bit 64 bit/
32 bit
16 4898 5109 5626 5860 5879 9238 1.58
32 1109 1389 2485 3804 5026 8435 1.59
64 804 1030 2025 3285 4871 8312 1.55
128 737 951 1877 3130 4908 8556 1.59
256 732 953 1897 3147 4941 8617 1.55
512 701 939 1766 2902 4601 8150 1.31
1024 323 494 986 1807 3060 5553 0.31
4096 242 259 486 964 1932 3856 0.95
16384 236 268 493 971 1939 3878 0.97
65536 242 271 494 973 1942 3884 0.97
MemSpeed Benchmark below or Go To Start
MemSpeed Benchmark MB/Second - memspeedPiC8, memspeedPi64g8
The benchmark includes CPU speed dependent calculations using data from caches and RAM. The calculations are
shown in the results column titles. Following are full Pi 32 bit and 64 bit results, plus some calculations of maximum
MFLOPS.
Ignoring the last three columns, with no calculations, that are subject to over optimisation, the arithmetic overhead
lead to similar RAM performance of the two environments. Integer speeds appeared to be the same, but double
precision tests indicated a 64 bit advantage of over 20% and 30%, depending on which cache was involved. This time
(as seen before) the 64 bit compiler generated impossible code, for single precision calculations, by producing much
slower speed than at double precision.
Below are results from running the original 64 bit version, compiled by gcc 7 (I think). This confirmed that the strange
results are unlikely to be caused by the 64 bit hardware or Operating System.
Memory x[m]=x[m]+s*y[m] Int+ x[m]=x[m]+y[m] x[m]=y[m]
KBytes Dble Sngl Int32 Dble Sngl Int32 Dble Sngl Int32
Used MB/S MB/S MB/S MB/S MB/S MB/S MB/S MB/S MB/S
32 bit
8 11768 9844 3841 11787 9934 4351 10309 7816 7804
16 11880 9880 3822 11886 10043 4363 10484 7902 7892
32 9539 8528 3678 9517 8661 4098 10564 7948 7945
64 9952 9310 3733 9997 9470 4160 8452 7717 7732
128 9947 9591 3757 9990 9757 4178 8205 7680 7753
256 10015 9604 3758 10030 9781 4186 8120 7734 7707
512 9073 9300 3751 9472 9526 4175 7995 7709 7602
1024 2681 5303 3594 2664 4965 3760 4828 3592 3569
2048 1671 3488 3242 1757 3635 3540 2882 1036 1023
4096 1777 3700 3283 1827 3627 3555 2433 1052 1054
8192 1931 3805 3420 1933 3815 3629 2465 980 971
MFLOPS 1471 2470
64 bit
8 15531 3999 3957 15576 4387 4358 11629 9313 9314
16 15717 3992 3922 15770 4355 4377 11799 9444 9446
32 12020 3818 3814 12043 4179 4198 11549 9496 9497
64 12228 3816 3887 12220 4166 4195 8935 8506 8506
128 12265 3869 3941 12157 4182 4206 8080 8193 8196
256 12230 3873 3932 12073 4199 4216 8129 8224 8223
512 9731 3832 3902 9709 4150 4171 8029 7845 7865
1024 3772 3682 3769 3467 3887 3920 5478 5543 5378
2048 1896 3463 3496 1886 3616 3612 2937 2945 2923
4096 1924 3520 3528 1933 3651 3394 2752 2796 2785
8192 1996 3523 3555 1988 3643 3630 2668 2661 2663
MFLOPS 1964 1000
64 bit / 32 bit
16 1.32 0.40 1.03 1.33 0.43 1.00 1.13 1.20 1.20
256 1.22 0.40 1.05 1.20 0.43 1.01 1.00 1.06 1.07
8192 1.03 0.93 1.04 1.03 0.95 1.00 1.08 2.72 2.74
########################### Earlier Version ###########################
Memory Reading Speed Test armv8 64 Bit by Roy Longbottom
Start of test Wed Jun 10 10:04:22 2020
Memory x[m]=x[m]+s*y[m] Int+ x[m]=x[m]+y[m] x[m]=y[m]
KBytes Dble Sngl Int32 Dble Sngl Int32 Dble Sngl Int32
Used MB/S MB/S MB/S MB/S MB/S MB/S MB/S MB/S MB/S
8 15504 13974 12580 15552 14024 15534 11521 9313 7791
16 15707 14173 12747 15758 14183 15746 11751 9445 7890
32 13356 11998 11123 13372 12300 12836 11450 9500 7937
64 12340 11302 10651 12156 11698 12044 9415 8937 7910
128 12253 11384 10707 12207 11861 12083 8260 8299 7821
256 12259 11408 10694 12089 11896 12091 8101 8220 7894
512 9855 9593 9246 10264 9482 9801 7917 8057 7754
1024 3317 3613 3571 3640 3602 3600 5885 5833 5616
2048 1881 1885 1881 1890 1879 1879 2911 2999 3015
4096 1950 1946 1949 1952 1941 1925 2672 2666 2661
8192 1952 1964 1964 1968 1962 1961 2546 2536 2537
NeonSpeed Benchmark below or Go To Start
NeonSpeed Benchmark MB/Second - NeonSpeedC8, NeonSpeedPi64g8
This carries out some of the same calculations as MemSpeed. All results are for 32 bit floating point and integer
calculations. Norm functions were as generated by the compiler and NEON through using intrinsic functions.
The same slow single precision calculation speeds, as above, were produced again at 64 bits, as indicated by earlier
version results included below. As could be expected, 32 bit and 64 bit calculations, obtained via NEON intrinsic
functions, were effectively the same.
Vector Reading Speed in MBytes/Second
Memory Float v=v+s*v Int v=v+v+s Neon v=v+v
KBytes Norm Neon Norm Neon Float Int
32 bit
16 9884 12882 3910 12773 13090 15133
32 9904 13061 3916 13002 13162 15239
64 9029 11526 3450 10704 11708 12084
128 9242 11784 3391 11016 11816 12179
256 9283 11890 3396 11215 11929 12284
512 9043 10680 3413 10211 10925 11241
1024 5818 3310 3507 3288 3239 2902
4096 4060 1994 3497 1991 2009 2011
16384 4030 2063 3445 2068 2072 2067
65536 3936 2109 3391 1858 2122 2121
64 bit
16 3629 14987 3925 13643 14457 16642
32 3475 10933 3821 9970 11029 11055
64 3447 11749 3845 11098 11802 12079
128 3332 11392 3912 10813 11430 11513
256 3325 11565 3926 10981 11598 11699
512 3313 10553 3917 10269 10755 10740
1024 3239 3331 3737 3291 3302 3321
4096 2987 1888 3331 1777 1881 1878
16384 3150 1821 3347 1814 1812 1834
65536 2747 1954 3132 2017 1904 2021
64 bit / 32 bit
16 0.37 1.16 1.00 1.07 1.10 1.10
256 0.36 0.97 1.16 0.98 0.97 0.95
8192 0.70 0.93 0.92 1.09 0.90 0.95
########################### Earlier Version ###########################
NEON Speed Test armv8 64 Bit V 1.0 Wed Jun 10 10:06:03 2020
Vector Reading Speed in MBytes/Second
Memory Float v=v+s*v Int v=v+v+s Neon v=v+v
KBytes Norm Neon Norm Neon Float Int
16 13999 16429 12687 15238 16213 17194
32 12384 13367 11232 12767 14406 14493
64 10736 11870 10305 10790 11940 11976
128 10728 11826 10393 10739 11951 11956
256 10760 11908 10386 10816 12026 12064
512 10697 11911 10404 10781 12070 12006
1024 3854 3941 3810 4015 4315 4402
4096 2007 2000 2018 1985 1995 1999
16384 2002 2008 1997 1927 1997 1997
65536 2030 2027 2022 2020 2012 2023
MultiThreading Benchmark next or Go To Start
MultiThreading Benchmarks
Most of the multithreading benchmarks execute the same calculations using 1, 2, 4 and 8 threads. One of them, MP-
MFLOPS, is available in two different versions, using standard compiled “C” code for single and double precision
arithmetic. A further version uses NEON intrinsic functions. Another variety uses OpenMP procedures for automatic
parallelism.
MP-Whetstone Benchmark - MP-WHETSPC8, MP-WHETSPi64g8
Multiple threads each run the eight test functions at the same time, but with some dedicated variables. Measured
speed is based on the last thread to finish. Performance was generally proportional to the number of cores used.
Overall seconds indicates MP efficiency.
The MWIPS performance rating indicated that 64 bit code was 13% faster than that at 32 bits.
MWIPS MFLOPS MFLOPS MFLOPS Cos Exp Fixpt If Equal
1 2 3 MOPS MOPS MOPS MOPS MOPS
32 bit
1T 1889.5 538.7 537.6 311.4 56.3 26.1 7450.5 2243.2 659.9
2T 3782.7 1065.5 1071.2 627.1 112.3 52.0 14525.7 4460.9 1327.3
4T 7564.1 2101.0 2145.9 1250.4 225.0 104.1 29430.5 8944.2 2660.8
8T 8003.6 2598.8 2797.0 1313.0 233.2 110.4 37906.3 10786.7 2799.4
Overall Seconds 4.99 1T, 5.00 2T, 5.03 4T, 10.06 8T
64 bit
1T 2147.8 530.7 530.0 397.8 60.5 27.3 7462.8 2237.7 998.2
2T 4294.1 1058.4 1059.5 795.8 120.9 54.6 14877.9 4457.8 1994.8
4T 8558.2 2093.8 2112.2 1590.3 241.8 108.3 29221.8 8909.9 3982.1
8T 8987.0 2689.8 2721.9 1641.0 254.1 112.0 37422.9 10873.9 4122.3
Overall Seconds 5.00 1T, 5.00 2T, 5.05 4T, 10.13 8T
4 Thread 64 bit/32 bit Performance ratios
1.13 1.00 0.98 1.27 1.07 1.04 0.99 1.00 1.50
MP-Dhrystone Benchmark - MP-DHRYPiC8, MP-DHRYPi64g8
This executes multiple copies of the same program, but with some shared data, leading to unacceptable multithreaded
performance. The single thread speeds were similar to the earlier Dhrystone results, with 44% 64 bit performance
gains. The other results don’t mean much.
Pi 3B+ ARM V7
MP-Dhrystone Benchmark Linux/ARM V7A v1.0 Wed Apr 24 22:57:46 2019
Using 1, 2, 4 and 8 Threads
32 bit
Threads 1 2 4 8
Seconds 0.79 1.21 2.62 4.88
Dhrystones per Second 10126308 13262168 12230188 13106002
VAX MIPS rating 5763 7548 6961 7459
64 bit
Seconds 0.55 1.08 2.15 4.30
Dhrystones per Second 14531390 14791730 14896723 14872767
VAX MIPS rating 8271 8419 8478 8465
64 bit / 32 bit 1.44 1.12 1.22 1.13
MP Linpack Benchmark below or Go To Start
MP SP NEON Linpack Benchmark - linpackNeonMPC8, linpackMPNeonPi64g8
This was produced to show that the original Linpack benchmark was completely unsuitable for benchmarking multiple
CPUs or cores, and this is reflected in the results. The program uses NEON intrinsic functions, with increasing data
sizes. The unthreaded results are of interest but, using NEON functions, the 64 bit program cannot improve
performance much.
Pi 3B+ ARM V7
Linpack Single Precision MultiThreaded Benchmark
Using NEON Intrinsics, Wed Apr 24 23:03:08 2019
MFLOPS 0 to 4 Threads, N 100, 500, 1000
Threads None 1 2 4
32 bit
N 100 2007.38 112.55 107.85 106.98
N 500 1332.24 686.10 686.11 689.02
N 1000 402.61 435.26 432.21 432.01
64 bit
N 100 2167.70 91.82 89.65 89.96
N 500 1438.27 644.85 635.89 635.33
N 1000 394.99 376.97 383.92 384.19
64 bit / 32 bit
N 100 1.08 0.82 0.83 0.84
N 500 1.08 0.94 0.93 0.92
N 1000 0.98 0.87 0.89 0.89
MP BusSpeed (read only) Benchmark - MP-BusSpd2PiC8, MP-BusSpd2Pi64g8
Each thread accesses all of the data in separate sections, covering caches and RAM, starting at different points, the
latter to avoid misrepresentation of performance using shared L2 cache. Each set of results show appropriate
performance gains on increasing the number of threads used. But the 64 bit compiler somehow manages to lose its way
on decreasing addressing increments after Inc8, leading to the 32 bit version appearing to be three times faster.
Below are example of results of a version compiled by gcc 9 for 64 bit Gentoo, showing that the performance issue was
probably not caused by the 64 bit hardware or Operating System.
KB Inc32 Inc16 Inc8 Inc4 Inc2 RdAll
32 bit
12.3 1T 5310 5616 5801 5898 5940 13425
2T 9393 10008 11293 11293 11368 24932
4T 15781 15015 17606 19034 22279 40736
8T 8465 9599 14580 18465 20034 36831
122.9 1T 664 930 1861 3191 5017 10281
2T 564 726 1523 5376 9387 18985
4T 486 919 1886 4289 8337 16979
8T 487 912 1854 4275 8271 16826
12288 1T 225 258 514 1010 1992 3975
2T 202 421 450 1765 3307 7396
4T 261 288 825 1332 1772 5014
8T 218 273 496 1041 2571 4021
64 bit Rd All
64 bit/
32 bit
12.3 1T 5168 5542 5641 4205 4095 4230 0.32
2T 8968 10728 10161 8110 8058 8368 0.34
4T 7874 13255 15586 13641 15485 16533 0.41
8T 8186 13386 15239 13469 14431 16372 0.44
122.9 1T 598 927 1876 2792 3746 4059 0.39
2T 514 719 1538 4846 7596 8083 0.43
4T 486 933 2060 4126 8175 13690 0.81
8T 483 937 2059 4160 8166 13817 0.82
12288 1T 224 257 488 964 1933 3579 0.90
2T 219 427 889 1832 3493 5371 0.73
4T 280 353 562 859 2168 3286 0.66
8T 229 230 527 1075 1880 4480 1.11
###################### gcc 9 Version ###################
MP-BusSpd 64 Bit gcc 9 Fri May 29 09:56:08 2020
12.3 4T 7317 13937 15720 18355 20549 33244
122.9 4T 492 937 1883 4009 7820 16423
MP RandMem Benchmark below or Go To Start
MP RandMem Benchmark - MP-RandMemPiC8, MP-RandMemPi64g8
The benchmark uses the same complex indexing for serial and random access, with separate read only and read/write
tests. The performance patterns were as expected, and essentially the same at 32 bits and 64 bits, with no scope for
vectorisation. Random access is dependent on the impact of burst reading and writing, producing those slow speeds.
Read only performance increased, as expected, relative to the thread count, with that for read/write remaining
constant at particular data size, probably due to write back to shared data space.
KB SerRD SerRDWR RndRD RndRDWR
32 bit
12.3 1T 5950 7903 5945 7896
2T 11849 7923 11887 7917
4T 23404 7785 23395 7761
8T 21903 7669 23104 7655
122.9 1T 5670 7309 2002 1924
2T 10682 7285 1648 1923
4T 9944 7266 1813 1927
8T 9896 7216 1812 1919
12288 1T 3904 1075 179 164
2T 7317 1055 215 164
4T 3398 1063 343 165
8T 4156 1062 350 165
64 bit
12.3 1T 5945 7898 5948 7895
2T 11913 7937 11905 7929
4T 23601 7875 23385 7867
8T 23139 7777 23016 7770
122.9 1T 5785 7090 2026 1977
2T 10941 7074 1654 1968
4T 10364 7052 1854 1970
8T 10256 7031 1844 1973
12288 1T 3861 1244 180 169
2T 3793 1242 220 171
4T 3941 1100 343 170
8T 4065 1247 351 171
64 bit / 32 bit
12.3 4T 1.01 1.01 1.00 1.01
122.9 4T 1.04 0.97 1.02 1.02
12288 4T 1.16 1.03 1.00 1.03
MP-MFLOPS Benchmarks below or Go To Start
MP-MFLOPS Benchmarks - MP-MFLOPSPiC8, MP-MFLOPSDPC8, MP-NeonMFLOPSC8,
MP-MFLOPSPi64g8, MP-MFLOPSDPPi64g8, MP-NeonMFLOPSPi64g8
MP-MFLOPS measures floating point speed on data from caches and RAM. The first calculations are as used in Memory
Speed Benchmark, with a multiply and an add per data word read. The second uses 32 operations per input data word
of the form x[i] = (x[i]+a)*b-(x[i]+c)*d+(x[i]+e)*f -- more. Tests cover 1, 2, 4 and 8 threads, each carrying out the
same calculations but accessing different segments of the data.
There are three varieties, single precision, double precision and single precision through NEON intrinsic functions, all
attempting to show near maximum MP floating point processing speeds. 64 bit operation implemented vector
processing, with expected single precision maximum performance twice as fast as double precision. Best performance
gains, over 32 bit working, were up to more than 2.5 times faster, with four thread performance near 26 GFLOPS, and
double precision near 13 GFLOPS. This time, The 32 bit NEON version provided performance improvements over the
single precision version, but, at 64 bits, more efficient vector instructions were implemented to operate at up to near
25 GFLOPS.
The numeric results are converted into a simple sumcheck, that should be constant, irrespective of the number of
threads used. Correct values are included at the end of the results below. Note the differences using NEON functions
and double or single precision floating point instructions.
Single Precision Version
2 Ops/Word 32 Ops/Word
KB 12.8 128 12800 12.8 128 12800
MFLOPS
32 bit
1T 1224 1257 520 2814 2800 2803
2T 2485 2257 525 5608 5575 5576
4T 4119 3243 534 11018 10645 8358
8T 4131 4618 541 9941 10339 8165
64 bit
1T 3303 3113 526 6750 6713 6429
2T 6410 4860 540 13378 13373 9005
4T 11696 6413 571 25479 25917 10126
8T 10262 10054 571 23140 23427 8726
Max
64b/32b 2.83 2.18 1.06 2.31 2.43 1.21
NEON Intrinsic Functions Version
32 bit
1T 2797 2870 641 4422 4454 4405
2T 3217 5601 569 8587 8800 8377
4T 7902 9864 611 17061 17215 9704
8T 7070 10562 603 15531 16203 9516
64 bit
1T 3319 3245 527 6569 6538 6294
2T 5737 5333 556 12810 12784 9565
4T 8497 11088 572 24775 24885 9570
8T 8037 11330 573 22658 21773 9443
Max
64b/32b 1.08 1.07 0.89 1.45 1.45 0.99
Double Precision Version
32 bit
1T 1203 1211 315 2675 2719 2674
2T 2291 2441 293 5406 5421 4907
4T 4673 2501 309 10313 10393 5256
8T 4394 3550 265 8782 10110 5197
64 bit
1T 1637 1553 273 3356 3351 3220
2T 3180 3031 278 6664 6676 4531
4T 5778 3102 283 12522 12675 4791
8T 3927 4272 286 12304 11351 4875
Max
64b/32b 1.24 1.20 0.91 1.21 1.22 0.93
Sumchecks
SP 76406 97075 99969 66015 95363 99951
NEON 76406 97075 99969 66014 95363 99951
DP 76384 97072 99969 66065 95370 99951
OpenMP-MFLOPS Benchmarks below or Go To Start
OpenMP-MFLOPS - OpenMP-MFLOPSC8, notOpenMP-MFLOPSC8, OpenMP-MFLOPS64g8,
notOpenMP-MFLOPS64g8
This benchmark carries out the same single precision calculations as the MP-MFLOPS Benchmarks but, in addition,
calculations with eight operations per data word. There is also notOpenMP-MFLOPS single core version, compiled from
the same code and carrying out identical numbers of floating point calculations, but without an OpenMP compile
directive.
The final data values are checked for consistency. Different compilers or different CPUs could involve using alternative
instructions or rounding effects, with variable accuracy. Then, OpenMP sumchecks could be expected to be the same
as those from NotOpenMP single core values. However, this is not always the case. This benchmark was a compilation
of code used for desktop PCs, starting at 100 KB, then 1 MB and 10 MB.
The main purposes of this benchmark are to see if OpenMP can produce similar maximum performance as MP-MFLOPS
and that this can increase in line with the number of cores used. These objectives were met using 32 floating point
operations per data word. Then, the 64 bit tests achieved up to 24 GFLOPS, 21% faster than at 32 bits.
Test 4 Byte Ops/ Repeat Seconds MFLOPS First All
Words Word Passes Results Same
OpenMP MFLOPS 32 bit
Data in & out 100000 2 2500 0.098043 5100 0.929538 Yes
Data in & out 1000000 2 250 0.810084 617 0.992550 Yes
Data in & out 10000000 2 25 0.922891 542 0.999250 Yes
Data in & out 100000 8 2500 0.144870 13805 0.957126 Yes
Data in & out 1000000 8 250 0.922568 2168 0.995524 Yes
Data in & out 10000000 8 25 0.918226 2178 0.999550 Yes
Data in & out 100000 32 2500 0.401577 19921 0.890282 Yes
Data in & out 1000000 32 250 0.935064 8556 0.988096 Yes
Data in & out 10000000 32 25 0.916277 8731 0.998806 Yes
OpenMP MFLOPS 64 bit 64b/
32b
Data in & out 100000 2 2500 0.092784 5389 0.929538 Yes 1.06
Data in & out 1000000 2 250 0.794744 629 0.992550 Yes 1.02
Data in & out 10000000 2 25 0.784255 638 0.999250 Yes 1.18
Data in & out 100000 8 2500 0.114583 17455 0.957117 Yes 1.26
Data in & out 1000000 8 250 0.797846 2507 0.995518 Yes 1.16
Data in & out 10000000 8 25 0.879850 2273 0.999549 Yes 1.04
Data in & out 100000 32 2500 0.332392 24068 0.890215 Yes 1.21
Data in & out 1000000 32 250 0.849420 9418 0.988088 Yes 1.10
Data in & out 10000000 32 25 0.933336 8571 0.998796 Yes 0.98
notOpenMP MFLOPS 32 bit
Data in & out 100000 2 2500 0.220277 2270 0.929538 Yes
Data in & out 1000000 2 250 0.791373 632 0.992550 Yes
Data in & out 10000000 2 25 0.792594 631 0.999250 Yes
Data in & out 100000 8 2500 0.362916 5511 0.957126 Yes
Data in & out 1000000 8 250 0.902125 2217 0.995524 Yes
Data in & out 10000000 8 25 0.786859 2542 0.999550 Yes
Data in & out 100000 32 2500 1.497859 5341 0.890282 Yes
Data in & out 1000000 32 250 1.518747 5267 0.988096 Yes
Data in & out 10000000 32 25 1.516393 5276 0.998806 Yes
notOpenMP MFLOPS 64 bit 64b/
32b
Data in & out 100000 2 2500 0.152535 3278 0.929538 Yes 1.44
Data in & out 1000000 2 250 0.965797 518 0.992550 Yes 0.82
Data in & out 10000000 2 25 0.781680 640 0.999250 Yes 1.01
Data in & out 100000 8 2500 0.356388 5612 0.957117 Yes 1.02
Data in & out 1000000 8 250 0.925742 2160 0.995518 Yes 0.97
Data in & out 10000000 8 25 0.840113 2381 0.999549 Yes 0.94
Data in & out 100000 32 2500 1.176455 6800 0.890215 Yes 1.27
Data in & out 1000000 32 250 1.227945 6515 0.988088 Yes 1.24
Data in & out 10000000 32 25 1.225311 6529 0.998796 Yes 1.24
OpenMP-MemSpeed Benchmarks below or Go To Start
OpenMP-MemSpeed - OpenMP-MemSpeed2C8, NotOpenMP-MemSpeed2C8
OpenMP-MemSpeed264g8, NotOpenMP-MemSpeed64g8
This is the same program as the single core MemSpeed benchmark, but with increased memory sizes and compiled
using OpenMP directives. The same program was also compiled without these directives (NotOpenMP-MemSpeed2),
with the example single core results also shown after the detailed measurements. Although the source code appears to
be suitable for speed up by parallelisation, many of the test functions are slower using OpenMP. Detailed comparisons
of these results are rather meaningless, but it demonstrates that OpenMP might be unsuitable to produce performance
gains on what appears to be suitable code. There might also be compile options that overcome this problem.
Memory Reading Speed Test OpenMP
Memory x[m]=x[m]+s*y[m] Int+ x[m]=x[m]+y[m] x[m]=y[m]
KBytes Dble Sngl Int32 Dble Sngl Int32 Dble Sngl Int32
Used MB/S MB/S MB/S MB/S MB/S MB/S MB/S MB/S MB/S
32 bit
4 8097 8322 8641 8020 8436 8384 39701 19701 19712
8 7814 8555 8756 8321 8548 8526 39042 19984 19996
16 8149 7738 7742 8303 7779 8192 37995 19883 19984
32 8969 8769 8799 9040 8759 8743 37737 20133 20130
64 7617 7457 7437 7575 7380 7422 17770 15332 14248
128 11221 10936 11003 11105 11011 10986 13650 13910 13881
256 17883 18144 18036 17691 18094 17844 13073 12465 12535
512 18001 18468 19675 17075 18221 19264 13511 13895 12008
1024 9532 10590 9772 11842 11282 11277 7173 9473 9496
2048 7095 7025 6866 7117 7043 6946 2914 3475 3468
4096 7244 6927 7036 5951 7054 6531 2582 3130 3122
8192 4578 7173 7025 6322 7078 7182 2504 3127 3115
16384 5470 7043 7067 7103 7052 7020 2557 3093 3088
32768 7359 7817 7766 7158 7078 7757 2618 3066 3094
65536 7810 7268 7266 3824 7478 5164 2486 3016 2931
131072 2460 2655 7224 7513 7308 7339 2540 2944 2940
Not OMP
8 11775 3895 4342 11787 4325 4354 10334 7806 7816
256 10032 3699 4223 9978 4289 4185 7105 7612 7621
65536 2099 2587 3033 2103 3021 3001 2585 1105 1101
64 bit
4 7749 8500 8716 7451 8520 8533 39508 18586 18589
8 8198 8669 8874 8148 8678 8691 38972 18863 18861
16 8023 8499 8335 7895 8355 8507 38305 19003 19004
32 9034 8517 8619 9127 8550 8522 37928 19071 18409
64 8652 8201 8178 8565 8223 8093 25191 17494 17508
128 11397 11616 11715 11345 11649 11029 13861 14097 14170
256 18242 18745 18195 17417 18605 18019 12535 12637 12623
512 17580 18467 18787 18010 18414 18321 12900 13180 13121
1024 8043 10172 11540 12510 10220 12082 9800 9586 9857
2048 4816 6807 6850 6922 6805 6666 3137 3372 3369
4096 7029 6846 6881 7017 5145 6801 2776 3124 3112
8192 2428 7085 7124 7068 7134 6904 2571 3092 3112
16384 7133 7152 7328 7008 3445 7178 2473 3099 3104
32768 2656 7643 7669 7802 7616 7559 2043 3112 3104
65536 7995 6523 2572 7059 6514 6485 2431 2955 3036
131072 1981 7273 7327 1878 3615 7267 2538 2968 2976
Not OMP
8 15532 3990 4394 15567 4386 4394 11629 9315 9314
256 12318 3871 4219 12134 4206 4219 8092 8231 8229
65536 2005 2588 2937 2011 2930 2621 2577 2565 2566
I/O Benchmarks below or Go To Start
I/O Benchmarks
Two varieties of I/O benchmarks are provided, one to measure performance of main and USB drives, and the other for
LAN and WiFi network connections. The Raspberry Pi programs write and reads three files at two sizes (defaults 8 and
16 MB), followed by random reading and writing of 1KB blocks out of 4. 8 and 16 MB and finally, writing and reading 200
small files, sized 4, 8 and 16 KB. Run time parameters are provided for the size of large files and file path. The same
program code is used for both varieties, the only difference being file opening properties. The drive benchmark includes
extra options to use direct I/O, avoiding data caching in main memory, but includes an extra test with caching allowed.
For further details and downloads see the usual PDF file.
LanSpeed Benchmarks - WiFi - LanSpeed, LanSpeed64g8
Following are Raspberry 32 bit and 64 bit results using what I believe was, both 2.4 GHz and 5 GHz WiFi frequencies.
Details on setting up the links can be found in This PDF file, LAN/WiFi section. Performance of the two systems was
reasonably similar at both frequencies, but speeds can vary widely. Also, (with my setup?) obtaining consistent 5 GHz
operation was extremely difficult to achieve, in both cases.
*********************. 32 bit 2.4 GHz ********************
MBytes/Second
MB Write1 Write2 Write3 Read1 Read2 Read3
8 6.35 6.33 6.38 7.05 6.98 7.10
16 6.70 6.82 6.76 7.19 6.53 7.22
Random Read Write
From MB 4 8 16 4 8 16
msecs 2.691 2.875 3.048 3.13 2.93 2.84
200 Files Write Read Delete
File KB 4 8 16 4 8 16 secs
MB/sec 0.34 0.44 1.04 0.37 0.37 1.26
ms/file 12.14 18.59 15.7 11.1 22.2 12.99 2.153
********************** 32 bit 5 GHz *********************
MBytes/Second
MB Write1 Write2 Write3 Read1 Read2 Read3
8 11.90 12.96 13.16 10.11 9.55 9.66
16 11.50 13.93 14.13 9.91 8.88 9.92
200 Files Write Read Delete
File KB 4 8 16 4 8 16 secs
MB/sec 0.13 0.46 0.91 0.25 0.55 1.02
ms/file 30.85 17.83 18.10 16.62 14.93 16.01 3.361
Random similar to 2.4 GHz
********************** 64 bit 2.4 GHz *******************
MBytes/Second
MB Write1 Write2 Write3 Read1 Read2 Read3
8 5.48 5.14 5.39 6.86 6.61 5.30
16 5.62 5.64 5.69 5.17 5.02 5.18
Random Read Write
From MB 4 8 16 4 8 16
msecs 3.666 4.035 5.131 4.82 4.67 3.90
200 Files Write Read Delete
File KB 4 8 16 4 8 16 secs
MB/sec 0.24 0.52 0.95 0.34 0.60 1.14
ms/file 17.10 15.73 17.20 12.00 13.68 14.35 2.437
********************** 64 bit 5 GHz *********************
MBytes/Second
MB Write1 Write2 Write3 Read1 Read2 Read3
8 11.43 11.70 11.57 8.21 3.64 7.05
16 10.96 7.30 11.84 8.40 6.24 7.94
200 Files Write Read Delete
File KB 4 8 16 4 8 16 secs
MB/sec 0.38 0.73 1.12 0.39 0.73 0.98
ms/file 10.82 11.15 14.62 10.55 11.23 16.73 2.618
Random similar to 2.4 GHz
LAN Benchmark below or Go To Start
LanSpeed Benchmark - (1G bits per second Ethernet) - LanSpeed, LanSpeed64g8
Measured performance can vary significantly, but both 32 bit and 64 bit tests demonstrated Gigabit performance on
the large files. Of particular note (with my program), the 32 bit system indicated that the 2 GB file could not be
written, the actual file size ended at 2,147,483,647 Bytes (or 2^31 - 1). On the other hand, at 64 bits, three files of
up 8 GB and 16 GB were successfully written and read (in around 25 minutes).
************************ 32 bit ************************
MBytes/Second
MB Write1 Write2 Write3 Read1 Read2 Read3
8 67.82 12.97 90.19 99.84 93.49 96.83
16 92.25 92.66 92.96 103.9 105.28 91.17
Random Read Write
From MB 4 8 16 4 8 16
msecs 0.007 0.01 0.04 1.01 0.85 0.91
200 Files Write Read Delete
File KB 4 8 16 4 8 16 secs
MB/sec 1.47 2.8 5.14 2.47 4.71 8.61
ms/file 2.78 2.92 3.19 1.66 1.74 1.9 0.256
Larger Files
MBytes/Second
MB Write1 Write2 Write3 Read1 Read2 Read3
32 78.2 34.46 80.71 84.94 87.11 84.97
64 88.18 87.52 87.03 111.34 109.58 107.28
128 98.84 99.24 96.58 110.99 110.57 87.43
256 106.75 105.43 106.4 85.78 108.99 106.29
1024 96.13 93.34 94.98 114.51 112.16 114.91
2048 Error writing file Segmentation fault
Wrote 2,147,483,647 bytes
************************ 64 bit ************************
MBytes/Second
MB Write1 Write2 Write3 Read1 Read2 Read3
1024 93.63 93.17 96.38 108.02 109.36 109.30
2048 98.41 96.54 99.18 111.26 111.89 111.83
Random Read Write
From MB 4 8 16 4 8 16
msecs 0.003 0.005 0.014 0.81 0.75 1.23
200 Files Write Read Delete
File KB 4 8 16 4 8 16 secs
MB/sec 1.42 2.82 5.24 2.30 4.56 8.09
ms/file 2.89 2.90 3.13 1.78 1.80 2.02 0.288
Much Larger Files
8192 89.77 89.98 91.86 117.29 117.21 117.17
16384 90.64 89.47 91.10 116.58 117.24 117.13
USB Benchmarks below or Go To Start
USB Benchmarks - DriveSpeed, DriveSpeed64v2g8
Following are DriveSpeed results at 32 bits and 64 bits, accessing the same USB 3 drive. Note the difference in
performance during the various test procedures (They might not be the same next time). The 32 bit system again
failed on attempting to write a 2 GB file (2^31-1 limit).
At 64 bits, 4 GB could not be written, the size limit being disappointing. This benchmark uses Direct I/O. Then, as I
later discovered, running with caching enabled (using LanSpeed benchmark) can write and read much larger files,
including those too large to cache. The example below is for writing and reading three files, each near 6 GB and 12 GB.
The vmstat recordings show that there was no serious memory swapping, with around 7.5 GB of RAM used for caching.
********************* 32 bit USB 3 *********************
DriveSpeed RasPi 1.1 Sat May 30 15:31:20 2020
Selected File Path: /media/pi/PATRIOT1/
Total MB 120832, Free MB 112565, Used MB 8267
MBytes/Second
MB Write1 Write2 Write3 Read1 Read2 Read3
512 73.43 74.88 74.88 217.60 219.98 218.02
1024 63.03 76.64 74.46 220.72 220.60 219.97
Cached
8 38.07 41.95 39.95 700.06 693.26 677.20
Random Read Write
From MB 4 8 16 4 8 16
msecs 0.982 0.981 1.001 6.81 6.31 6.31
200 Files Write Read Delete
File KB 4 8 16 4 8 16 secs
MB/sec 0.03 0.07 0.14 2.58 5.23 10.32
ms/file 120.08 120.06 120.00 1.59 1.57 1.59 2.491
Larger Files MBytes/Second
MB Write1 Write2 Write3 Read1 Read2 Read3
2000 75.14 74.93 74.93 216.19 217.22 216.53
2048 Error writing file Segmentation fault
********************* 64 bit USB 3 *********************
DriveSpeed RasPi 64 Bit gcc 8 Wed May 27 11:43:43 2020
Selected File Path: /media/pi/PATRIOT1/
Total MB 120832, Free MB 114614, Used MB 6218
MBytes/Second
MB Write1 Write2 Write3 Read1 Read2 Read3
1024 27.78 21.39 21.43 270.32 278.81 274.98
2048 21.40 21.14 21.44 275.79 273.14 319.95
Cached
8 40.27 42.81 42.81 1206.64 1068.72 1031.56
Random Read Write
From MB 4 8 16 4 8 16
msecs 0.004 0.004 0.184 4.33 4.00 4.04
200 Files Write Read Delete
File KB 4 8 16 4 8 16 secs
MB/sec 0.03 0.07 0.14 261.45 11.19 84.39
ms/file 119.60 119.05 119.64 0.02 0.73 0.19 2.477
Larger Files MBytes/Second
MB Write1 Write2 Write3 Read1 Read2 Read3
2048 23.77 19.89 20.64 320.34 272.90 271.96
4096 Write failure
2000 21.72 22.38 26.57 275.40 273.85 309.57
4000 37.38 36.30 37.67 297.09 299.91 286.94
Caching Benchmark - USB 3 Hard Drive - 3 files up to near 36 GB capacity used
6000 169.80 136.20 126.26 90.43 146.13 144.05
12000 146.65 108.83 67.14 108.13 146.84 143.76
swpd free buff cache si so bi bo vmstat memory and I/O activity
768 7417668 102040 250844 0 0 1299 1329 Start
768 1970544 94436 5704132 0 0 0 132723 Writing 12000 MB
768 107908 92712 7568500 0 0 140339 0 Reading 12000 MB
Main Drive Benchmark below or Go To Start
Pi 4B Main Drive Benchmark - DriveSpeed, DriveSpeed64v2g8
The DriveSpeed benchmark failed to execute on the 64 bit system, providing the message “Error writing file
Segmentation fault”. It had run previously on the Pi 4B but, again, would only write less than 2 GB files, as shown
below. This also applied when running LanSpeed on the main drive. From below, note the faster reading speeds at 1024
MB, this was because the file size is small enough to be cached.
Below are default results from running LanSpeed on the Pi 4 at 64 bits, initially intended to verify that the main drive
could be accessed by one of my programs. Initially, I could not run specifying large files, as there was limited free
space on the OS drive. After cloning the card to a 32 GB version, 19 GB free space was indicated. I then ran the
program to write three 6000 MB files. This was followed by specifying 16000 MBytes, where one file was written and
the second one generated an error after writing around 2500 MB. The good news also was that the test did not crash
the system.
************************ 32 bit ************************
Current Directory Path: /home/pi/Raspberry_Pi_Benchmarks
Total MB 14845, Free MB 8198, Used MB 6646
MBytes/Second
MB Write1 Write2 Write3 Read1 Read2 Read3
8 16.41 11.21 12.27 39.81 40.10 40.39
16 11.79 21.10 34.05 40.18 40.19 40.33
Cached
8 137.47 156.43 285.59 580.73 598.66 587.97
Random Read Write
From MB 4 8 16 4 8 16
msecs 0.371 0.371 0.363 1.28 1.53 1.30
200 File Write Read Delete
File KB 4 8 16 4 8 16 secs
MB/sec 3.49 6.41 8.26 7.67 11.68 17.51
ms/file 1.17 1.28 1.98 0.53 0.70 0.94 0.014
Larger Files
MBytes/Second
MB Write1 Write2 Write3 Read1 Read2 Read3
1024 13.38 13.35 13.39 42.68 42.59 42.36
2048 Error writing file Segmentation fault
LanSpeed
1024 11.65 13.46 13.48 560.78 574.76 617.67
2048 Error writing file Segmentation fault
************************ 64 bit ************************
LanSpeed RasPi 64 Bit gcc 8 Wed May 27 10:36:54 2020
Current Directory Path: /home/pi/Raspberry-Pi-4-64-Bit-Benchmarks
Total MB 14637, Free MB 8724, Used MB 5913
MBytes/Second
MB Write1 Write2 Write3 Read1 Read2 Read3
8 265.13 281.30 292.28 1270.88 1286.35 1329.42
16 246.59 277.53 299.05 1201.20 1327.24 1095.78
Random Read Write
From MB 4 8 16 4 8 16
msecs 0.002 0.002 0.002 7.68 9.01 7.14
200 Files Write Read Delete
File KB 4 8 16 4 8 16 secs
MB/sec 56.52 64.92 94.20 303.96 549.54 538.32
ms/file 0.07 0.13 0.17 0.01 0.01 0.03 0.014
Larger File - 32 GB SD card
Total MB 29643, Free MB 19776, Used MB 9868
MBytes/Second
MB Write1 Write2 Write3 Read1 Read2 Read3
6000 24.14 18.80 19.39 31.07 45.60 45.76
16000 21.12 Error writing file Segmentation fault
File 1 15.6 GiB (16,777,216,000 bytes)
File 2 2.5 GiB ( 2,645,176,320 bytes) - Not enough free space
Java Whetstone Benchmark below or Go To Start
Java Whetstone Benchmark - whetstc.class
The Java benchmarks comprise class files that were produced some time ago. But source codes are available to renew
the files. Performance can vary significantly using different Java Virtual Machines. So, comparisons might not be
appropriate.
The results below suggest that 32 bit overall performance, in MWIPS, was faster than at 64 bits. This was due to the
most time consuming functions (N5 and N6) taking less time. Note that some speeds are effectively the same as found
running the C compiled version above.
************************* 32 bit *************************
Whetstone Benchmark OpenJDK11 Java Version, May 15 2019, 18:48:20
1 Pass
Test Result MFLOPS MOPS millisecs
N1 floating point -1.124750137 524.02 0.0366
N2 floating point -1.131330490 494.12 0.2720
N3 if then else 1.000000000 289.92 0.3570
N4 fixed point 12.000000000 1092.99 0.2882
N5 sin,cos etc. 0.499110132 59.86 1.3900 x
N6 floating point 0.999999821 345.95 1.5592 x
N7 assignments 3.000000000 331.54 0.5574
N8 exp,sqrt etc. 0.825148463 25.41 1.4640
MWIPS 1687.92 5.9244
Operating System Linux, Arch. arm, Version 4.19.37-v7l+
Java Vendor BellSoft, Version 11.0.2-BellSoft
************************* 64 bit *************************
Whetstone Benchmark Java Version, May 22 2020, 14:24:09
1 Pass
Test Result MFLOPS MOPS millisecs
N1 floating point -1.124750137 520.61 0.0369
N2 floating point -1.131330490 481.38 0.2792
N3 if then else 1.000000000 236.41 0.4378
N4 fixed point 12.000000000 1320.20 0.2386
N5 sin,cos etc. 0.499110132 47.96 1.7348 x
N6 floating point 0.999999821 276.33 1.9520 x
N7 assignments 3.000000000 320.17 0.5772
N8 exp,sqrt etc. 0.825148463 25.41 1.4640
MWIPS 1487.99 6.7205
Operating System Linux, Arch. aarch64, Version 4.19.118-v8+
Java Vendor Debian, Version 11.0.7
JavaDraw Benchmark below or Go To Start
JavaDraw Benchmark - JavaDrawPi.class
The benchmark uses small to rather excessive simple objects to measure drawing performance in Frames Per Second
(FPS). Five tests draw on a background of continuously changing colour shades, each test adding to the load. In order
for this to run at maximum speed, it was necessary to disable the experimental GL driver.
In this case, performance at 32 bits and 64 bits was quite similar.
************************* 32 bit *************************
Java Drawing Benchmark, May 15 2019, 18:55:41
Produced by OpenJDK 11 javac
Test Frames FPS
Display PNG Bitmap Twice Pass 1 877 87.65
Display PNG Bitmap Twice Pass 2 1042 104.18
Plus 2 SweepGradient Circles 1015 101.47
Plus 200 Random Small Circles 779 77.85
Plus 320 Long Lines 336 33.52
Plus 4000 Random Small Circles 83 8.25
Total Elapsed Time 60.1 seconds
Operating System Linux, Arch. arm, Version 4.19.37-v7l+
Java Vendor BellSoft, Version 11.0.2-BellSoft
************************* 64 bit *************************
Java Drawing Benchmark, May 22 2020, 14:25:15
Produced by javac 1.8.0_222
Test Frames FPS
Display PNG Bitmap Twice Pass 1 833 83.26
Display PNG Bitmap Twice Pass 2 1001 100.05
Plus 2 SweepGradient Circles 994 99.39
Plus 200 Random Small Circles 836 83.54
Plus 320 Long Lines 380 37.98
Plus 4000 Random Small Circles 95 9.44
Total Elapsed Time 60.1 seconds
Operating System Linux, Arch. aarch64, Version 4.19.118-v8+
Java Vendor Debian, Version 11.0.7
OpenGL GLUT Benchmark below or Go To Start
OpenGL GLUT Benchmark - videogl32, videogl64
In 2012, I approved a request from a Quality Engineer at Canonical, to use this OpenGL benchmark in the testing
framework of the Unity desktop software. The program can be run as a benchmark, or selected functions, as a stress
test of any duration.
The benchmark measures graphics speed in terms of Frames Per Second (FPS) via six simple and more complex tests.
The first four tests portray moving up and down a tunnel including various independently moving objects, with and
without texturing. The last two tests, represent a real application for designing kitchens. The first is in wireframe
format, drawn with 23,000 straight lines. The second has colours and textures applied to the surfaces.
As a benchmark, it was run using the following script file, the first command needed to avoid VSYNC, allowing FPS to
be greater than 60.
export vblank_mode=0
./videogl32 Width 320, Height 240, NoEnd
./videogl32 Width 640, Height 480, NoHeading, NoEnd
./videogl32 Width 1024, Height 768, NoHeading, NoEnd
./videogl32 Width 1920, Height 1080, NoHeading
The benchmark could not be recompiled, at 64 bits, as certain freeglut functions were not readily available. So, an
earlier version was used. In this case, the 64 bit version, at the higher pixel settings, appeared to be slower on the
graphics speed dependent tests, but faster elsewhere.
As indicated below, the dual monitor connections enabled this option to be tested at 64 bits.
************************ 32 bit ************************
GLUT OpenGL Benchmark 32 Bit Version 1, Thu May 2 19:01:05 2019
Running Time Approximately 5 Seconds Each Test
Window Size Coloured Objects Textured Objects WireFrm Texture
Pixels Few All Few All Kitchen Kitchen
Wide High FPS FPS FPS FPS FPS FPS
320 240 766.7 371.4 230.6 130.2 32.5 22.7
640 480 427.3 276.5 206.0 121.8 31.7 22.2
1024 768 193.1 178.8 150.5 110.4 31.9 21.5
1920 1080 81.4 79.4 74.6 68.3 30.8 20.0
************************ 64 bit ************************
GLUT OpenGL Benchmark 64 Bit gcc 9, Fri May 22 13:50:00 2020
Running Time Approximately 5 Seconds Each Test
Window Size Coloured Objects Textured Objects WireFrm Texture
Pixels Few All Few All Kitchen Kitchen
Wide High FPS FPS FPS FPS FPS FPS
160 120 753.4 414.5 258.3 152.0 42.7 30.0
320 240 644.5 385.9 243.9 145.6 41.5 29.1
640 480 320.6 270.6 217.9 136.8 43.0 29.4
1024 768 140.6 135.1 122.6 114.1 41.8 28.5
1920 1080 57.7 56.4 55.7 52.4 40.5 26.7
****************** 64 bit Dual Monitor*******************
3840 1080 26.9 26.7 27.0 26.0 27.5 21.6
Usable RAM below or Go To Start
Usable RAM - MALLOC
On running various benchmarks, it became clear that there were restrictions on how much RAM could be used by my C
based benchmarks. A simple program was written that allocated a specified amount of memory, using malloc, filled it
with data, freed the space, then repeated the sequence incrementally, until an allocation failure was indicated. Both
32 bit and 64 bit versions were produced and each run on 4 GB and 8 GB systems. Except at 64 bit 8 GB, all were
restricted to less than 4,000,000,000 bytes. For the former, vmstat memory utilisation details are provided, showing
the low points and samples between, identifying that memory space had been freed.
############################### 32 Bit OS ###############################
4 GB RAM
Bytes 1000000000 250000000 words allocated 250000000 written finished
Bytes 2000000000 500000000 words allocated 500000000 written finished
Bytes 3000000000 750000000 words allocated 750000000 written finished
Bytes 4000000000 Memory allocation failed - Exit Later OK to 3050000000 (2.84 GB)
8 GB RAM
Bytes 1000000000 250000000 words allocated 250000000 written finished
Bytes 2000000000 500000000 words allocated 500000000 written finished
Bytes 3000000000 750000000 words allocated 750000000 written finished
Bytes 4000000000 Memory allocation failed - Exit Later OK to 3060000000 (2.85 GB)
############################### 64 Bit OS ###############################
4 GB RAM
Bytes 1000000000 250000000 words allocated 250000000 written finished
Bytes 2000000000 500000000 words allocated 500000000 written finished
Bytes 3000000000 750000000 words allocated 750000000 written finished
Bytes 4000000000 Memory allocation failed - Exit Later OK to 3700000000 (3.45 GB)
8 GB RAM
Bytes 1000000000 250000000 words allocated 250000000 written finished
Bytes 2000000000 500000000 words allocated 500000000 written finished
Bytes 3000000000 750000000 words allocated 750000000 written finished
Bytes 4000000000 1000000000 words allocated 1000000000 written finished
Bytes 5000000000 1250000000 words allocated 1250000000 written finished
Bytes 6000000000 1500000000 words allocated 1500000000 written finished
Bytes 7000000000 1750000000 words allocated 1750000000 written finished
Bytes 8000000000 Memory allocation failed - Exit Later OK to 7900000000 (7.36 GB)
pass swpd free buff cache pass swpd free buff cache
0 7412260 85908 274472 0 7234852 85908 278140
1 0 6615688 85908 277608 5 0 2600856 85908 277096
0 7385388 85908 277264 0 7184736 85908 277612
2 0 5671192 85908 277612 6 0 1571436 85908 277096
0 7210328 85908 277264 0 7257464 85908 277096
3 0 4526104 85908 277096 7 0 624436 86228 281456
0 7324312 85908 277096 0 7402400 86228 283200
4 0 3665272 85908 277264
Usable RAM - Specified Dimensions
Where dimensions were specified in the programs, rather than malloc, some differences were apparent. Using the 32 bit
system, a compile error was indicated when the dimensions required 2 GB (2^31) Bytes, with 1 Byte less being
accepted.. As shown below, at 64 bits, more than 2 GB was allowed on the 4 GB system. Then, at both 4 GB and 8 GB
close to these sizes could be used.
######################## 32 Bit OS 4 GB and 8 GB ########################
int array[536870912]; size of array 'array' is too large 2 GB
int array[536870911]; compiles
float array[536870912]; size of array 'array' is too large 2 GB
float array[536870911]; compiles
double array[268435456]; size of array 'array' is too large 2 GB
double array[268435455]; compiles
############################# 64 Bit OS 4 GB ############################
int array[920000000]; OK 3.43 GB
int array[1073741824]; Segmentation fault 4 GB
float array[920000000]; OK 3.43 GB
float array[1073741824]; Segmentation fault 4 GB
double array[460000000]; OK 3.43 GB
double array[536870912]; Segmentation fault 4 GB
############################# 64 Bit OS 8 GB ############################
int array[1950000000]; OK 7.9 GB paging
int array[2147483648]; Segmentation fault 8 GB
float array[1950000000]; OK 7.9 GB
float array[2147483648]; Segmentation fault 8 GB
double array[975000000]; OK 7.9 GB
double array[1073741824]; Segmentation fault 8 GB
High Performance Linpack Benchmark below or Go To Start
High Performance Linpack Benchmark - xhpl
I ported my ATLAS version of HPL, that I have run on earlier Raspberry Pi systems, to both of the 64 bit and 32 bit SD
cards. See my report at ResearchGate Raspberry Pi 4B Stress Tests Including High Performance Linpack.pdf. The report
showed that the amount of memory used followed the same proportions as the original Linpack benchmark of
somewhat greater than N x N x 8, for double precision operation on a N x N dimensioned array. For RAM residence,
N=20000 would require 4 GB and N=30000 needing 8 GB.
Following are results from tests run without and with a cooling fan in place. The first were for the original Pi 4 with 4
GB RAM, carried out in June 2019. The others, with 8 GB, are running via recent 32 bit and 64 bit Operating System
versions, in 2020. With the fan in place, clock speeds were effectively constant at 1500 MHz, on all three test rigs,
with the same MFLOPS performance at each problem size. Then, the 4 GB system appeared to be running at a higher
temperature, but not high enough to introduce CPU MHz throttling.
With no fan in use, throttling occurred on all systems, at N=16000. From then on, the 4 GB system suffered from more
of this than the 8 GB models, reflected in higher temperatures and slower performance. The difference is thought to be
due to the improvements that have been made in thermal management.
These tests show that the HPL benchmark is an excellent stress testing application that can demonstrate using most
of available RAM and running at high performance levels. The double precision speed approached the 12.6 GFLOPS
achieved by one of my benchmarks. The 64 bit production does not appear to benefit from using advanced vector
operations, but I could not identify whether other compiling parameters could be included.
No Fan Fan
RAM at bits N GFLOPS Seconds Max °C Min MHz GFLOPS Seconds Max °C Min MHz
4 GB 32b 8000 8.6 40 81 1500 9.3 37 61 1500
8 GB 32b 8000 9.7 35 58 1500 9.6 35 57 1500
8 GB 64b 8000 8.8 39 76 1500 8.7 39 55 1500
4 GB 32b 16000 6.8 404 86 750/600 10.4 263 70 1500
8 GB 32b 16000 8.6 319 83 1000 10.4 263 63 1500
8 GB 64b 16000 8.1 338 84 1000 10.0 273 61 1500
4 GB 32b 20000 6.2 856 87 750/600 10.8 494 71 1500
8 GB 32b 20000 8.8 604 85 1000 10.7 497 63 1500
8 GB 64b 20000 8.5 625 85 1000/600 10.3 519 63 1500
4 GB 32b 30000 N/A N/A
8 GB 32b 30000 8.2 2195 85 1000/600 11.3 1590 64 1500
8 GB 64b 30000 7.6 2370 86 1000/600 11.4 1584 63 1500
Below are vmstat details, showing that most of the RAM was in use and four cores were running at 100% utilisation.
Then there are examples of environmental differences btween older 32 bit and later 64 bit operation, particularly MHz
throttling variations, core voltage and pmic temperature differences
8 GB 64b 30000
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
r b swpd free buff cache si so bi bo in cs us sy id wa st
0 0 0 7422216 83712 264952 0 0 213 4 211 345 2 2 96 1 0
4 0 0 5366940 83720 269572 0 0 144 2 1130 483 82 3 15 0 0
4 0 0 2974924 83728 271960 0 0 0 3 1287 585 97 3 0 0 0
4 0 0 637296 83960 275704 0 0 0 48 1859 2130 96 4 0 0 0
4 0 3072 246724 43176 207604 1 83 141 95 1663 1402 97 3 0 0 0
4 0 3584 243388 32412 191932 3 17 11 23 1110 131 100 0 0 0 0
6 0 3584 247168 32420 187520 0 0 0 2 1085 59 100 0 0 0 0
Later
5 0 3584 238580 34324 193432 0 0 4 2 1196 361 99 1 0 0 0
5 0 7936 238124 26356 193392 0 140 386 193 1993 2064 97 3 0 0 0
4 0 7936 247408 27264 194160 1 0 70 11 1889 1888 98 2 0 0 0
4 GB 32b 20000 No Fan
485.3 ARM MHz=1000, core volt=0.8771V, CPU temp=84.0'C, pmic temp=74.1'C
506.6 ARM MHz= 750, core volt=0.8771V, CPU temp=85.0'C, pmic temp=74.1'C
528.0 ARM MHz= 750, core volt=0.8771V, CPU temp=86.0'C, pmic temp=74.1'C
549.2 ARM MHz= 600, core volt=0.8771V, CPU temp=85.0'C, pmic temp=74.1'C
570.6 ARM MHz=1000, core volt=0.8771V, CPU temp=85.0'C, pmic temp=74.1'C
591.9 ARM MHz= 750, core volt=0.8771V, CPU temp=84.0'C, pmic temp=74.1'C
8 GB 64b 30000 No Fan
1546.8 ARM MHz=1000, core volt=0.8600V, CPU temp=86.0'C, pmic temp=70.3'C
1577.8 ARM MHz= 600, core volt=0.8600V, CPU temp=85.0'C, pmic temp=70.3'C
1608.8 ARM MHz=1000, core volt=0.8600V, CPU temp=86.0'C, pmic temp=70.3'C
1639.9 ARM MHz=1000, core volt=0.8350V, CPU temp=85.0'C, pmic temp=70.3'C
1670.8 ARM MHz=1000, core volt=0.8600V, CPU temp=85.0'C, pmic temp=70.3'C
1701.8 ARM MHz= 600, core volt=0.8600V, CPU temp=85.0'C, pmic temp=70.3'C
1732.8 ARM MHz=1000, core volt=0.8600V, CPU temp=85.0'C, pmic temp=70.3'C
Floating Point Stress Tests below or Go To Start
Floating Point Stress Tests - MP-FPUStress, MP-FPUStressDP, MP-FPUStress64g8, MP-
FPUStress64DPg8
These stress tests have a benchmarking mode that provides choices for a long running test. They cover number of
threads, floating point operations carried out on each data word, and memory size to cover caches and RAM. Numeric
sumchecks are carried out, where the same number of calculations apply at different thread counts, in each section.
Below are results for both 64 bit and 32 bit compilations, where sumchecks were identical. Performance at 64 bits can
be seen to be faster than at 32 bits, with best case nearly twice as fast.
Next, below, are results from 10 minute stress tests, showing measured GFLOPS and CPU temperatures, for fanless
operation. CPU MHz variations were between 1500/1000/750 at 32 bits and 1500/1000 for all 64 bit tests. Again,
indicating improved thermal management.
64 Bits MFLOPS Numeric Results 32 Bits MFLOPS
Ops/ KB KB MB KB KB MB KB KB MB
Secs Thrd Word 12.8 128 12.8 12.8 128 12.8 12.8 128 12.8
Single Precision
0.9 T1 2 3845 4032 1232 40394 76395 99700 2134 2607 656
1.6 T2 2 7947 7992 1083 40394 76395 99700 5048 5156 621
2.3 T4 2 14295 14760 1145 40394 76395 99700 7536 9939 681
3.0 T8 2 13427 14985 1166 40394 76395 99700 7934 9839 639
4.9 T1 8 4665 4740 3200 54764 85092 99820 5535 5420 2569
6.0 T2 8 9334 9453 4143 54764 85092 99820 10757 10732 2454
6.9 T4 8 17902 18462 4693 54764 85092 99820 18108 20703 2444
7.7 T8 8 17473 18460 4570 54764 85092 99820 19236 20286 2245
13.0 T1 32 5827 5869 5861 35206 66015 99520 5309 5270 5262
15.6 T2 32 11712 11729 11524 35206 66015 99520 10551 10528 9753
17.2 T4 32 23149 22887 16343 35206 66015 99520 20120 20886 11064
18.7 T8 32 22202 23048 16411 35206 66015 99520 19415 20464 9929
Double Precision
1.8 T1 2 1802 1878 587 40395 76384 99700 921 998 326
3.4 T2 2 3716 3741 527 40395 76384 99700 1968 1995 308
4.8 T4 2 6814 7335 547 40395 76384 99700 3465 3925 342
6.1 T8 2 6633 7011 588 40395 76384 99700 3646 3702 301
9.2 T1 8 2738 2796 2014 54805 85108 99820 2377 2446 1283
11.4 T2 8 5598 5582 2114 54805 85108 99820 4916 4860 1326
13.0 T4 8 10545 11132 2196 54805 85108 99820 9202 9510 1391
14.7 T8 8 10693 10849 2149 54805 85108 99820 9090 9006 1298
24.1 T1 32 3280 3296 3279 35159 66065 99521 2695 2725 2707
28.8 T2 32 6583 6588 6430 35159 66065 99521 5416 5441 5121
31.6 T4 32 12785 13162 8477 35159 66065 99521 10666 10831 5275
34.4 T8 32 12718 12781 8816 35159 66065 99521 10427 10602 4832
Stress Tests Original 32 Bits ------------------ 64 Bits ------------------
8 Ops/word 8 Ops/word 32 Ops/Word 32 Ops/Word DP
Seconds °C GFLOPS °C GFLOPS °C GFLOPS °C GFLOPS
0 61 59 58 58
20 76 19.2 65 18.4 71 22.9 73 12.9
40 81 19.0 74 18.2 74 23.1 77 12.9
60 82 17.8 76 18.4 76 22.9 78 12.9
80 83 15.5 78 18.1 78 23.0 80 13.0
100 84 15.0 78 18.1 79 23.0 83 12.4
120 83 14.0 82 18.2 81 23.0 82 11.7
140 84 13.3 82 17.6 82 22.5 82 11.2
160 84 13.3 81 16.8 82 21.6 82 10.9
180 86 12.9 82 16.3 82 21.0 83 10.9
200 85 13.0 82 16.2 82 20.7 83 10.5
220 84 12.8 82 15.8 82 20.4 82 10.2
240 84 12.6 83 15.6 82 20.1 83 10.2
260 83 12.6 83 15.9 83 19.9 82 10.2
280 85 12.2 83 15.3 82 19.9 83 10.0
300 84 12.1 83 15.4 81 19.6 83 9.9
320 85 12.0 83 15.5 82 19.5 82 9.7
340 84 11.6 82 15.2 82 19.5 82 9.9
360 85 11.6 83 14.7 83 19.3 83 9.8
380 85 11.3 82 14.7 82 19.2 83 9.6
400 85 11.6 83 14.8 82 19.0 83 9.6
420 84 11.6 83 14.9 82 18.9 82 9.5
440 85 11.5 82 14.6 83 18.8 82 9.6
460 84 11.5 83 14.9 83 18.7 82 9.5
480 85 11.5 83 14.6 82 18.8 83 9.5
500 84 11.1 83 14.7 83 18.8 83 9.5
520 85 11.3 82 14.6 82 18.6 83 9.4
540 84 11.4 83 14.7 82 18.7 83 9.4
560 84 11.3 83 14.6 82 18.7 83 9.6
580 85 11.3 83 14.6 83 18.4 83 9.6
600 85 11.3 83 14.5 83 18.5 83 9.7
Average 83.9 12.9 81.2 15.9 81.1 20.2 81.9 10.5
Min/max 0.58 0.78 0.80 0.72
Integer Stress Tests below or Go To Start
Integer Stress Tests - MP-IntStress, MP-IntStress64g8, MP-IntStress64
This program has variables for number of threads, memory required and running time. The test loop comprises 32 add
or subtract instructions, operating on hexadecimal data patterns, with sequences of 8 subtracts then 8 adds to
restore the original pattern. Performance is measured in MBytes per second. Results show the varying hexadecimal
data patters used and compared verification, not shown on the summary benchmarking mode logged details below.
Here, it can be seen that the 64 bit performance was much slower using the latest gcc 8 64 bit tests. Earlier 64 bit
results confirm that the poor performance was due to a compiling issue.
Following the benchmark results are details from two stress tests, running without an operational fan. The first
represents one user demanding 7600000 KB ( 7.25 GB) of memory space. Performance throughout was effectively the
same as memory speed indicted by the benchmark (1 thread 16 MB), CPU MHz being constant, with little change in
temperatures. As shown by vmstat details, some data was swapped out, to make room for that of the application.
The second stress test involved 8 threads and cache based data, initially running at maximum CPU speed (for this
code). This time, there were CPU clock throttling, down to 1000 MHz, CPU temperature rises up to 84°C and a 31%
decrease in measured MBytes per second.
Benchmark MBytes/second
------ 32 Bits ------ ------ 64 Bits ------ --- 2019 64 Bits ---
KB KB MB KB KB MB KB KB MB
Threads 16 160 16 16 160 16 16 160 16
1 5956 5754 3977 2878 2936 2602 5928 6786 3903
2 11861 11429 3763 5855 5817 3641 14468 13292 3772
4 22998 21799 3464 11403 11416 3564 27146 25103 3425
8 22695 21128 3490 10853 11297 3557 27576 24844 3432
16 22835 23491 3485 11069 11612 3548 27365 28511 3434
32 22593 23485 3591 10790 11646 3758 26377 28527 3455
Stress Test Start
Data Same All
Seconds Size Threads MB/sec Sumcheck Threads
20.0 7600000 KB 1 2606 00000000 Yes
57.8 7600000 KB 1 2604 FFFFFFFF Yes
91.0 7600000 KB 1 2575 5A5A5A5A Yes
129.5 7600000 KB 1 2608 AAAAAAAA Yes
vmstat 10 second samples
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
r b swpd free buff cache si so bi bo in cs us sy id wa st
0 0 0 7433336 83140 266140 0 0 222 2 177 273 1 1 97 1 0
1 0 0 5535964 83152 268248 0 0 2 7 501 707 2 7 92 0 0
1 8 69888 63404 1048 106744 16 6943 515 6951 664 506 3 18 54 25 0
1 0 67072 63916 4548 123920 3 0 3 8 468 260 26 1 74 0 0
Later to end
1 0 95336 62748 4868 135672 4 0 4 6 475 274 26 1 73 0 0
--------- 7600000 KB 1 Thread --------- --------- 1280 KB 8 Threads ----------
Secs MB/sec MHz Volts °C CPU °C PMIC MB/sec MHz Volts °C CPU °C PMIC
0 1500 0.8500 59 55.2 1500 0.8600 57 54.3
20 2606 1500 0.8500 61 55.2 10902 1500 0.8600 70 55.2
40 2599 1500 0.8500 63 55.2 10267 1500 0.8600 73 58.0
60 2604 1500 0.8500 63 56.2 10150 1500 0.8600 75 59.0
80 2575 1500 0.8500 65 57.1 11046 1500 0.8600 79 61.8
100 2566 1500 0.8500 65 57.1 11039 1500 0.8600 80 62.8
120 2605 1500 0.8500 66 58.0 10503 1000 0.8600 81 64.6
140 2608 1500 0.8500 66 58.0 8780 1500 0.8600 82 65.6
160 2583 1500 0.8500 67 59.0 8501 1500 0.8600 82 66.5
180 2605 1500 0.8500 66 59.0 8704 1500 0.8600 83 66.5
200 2604 1500 0.8500 66 59.0 8507 1500 0.8600 83 66.5
220 2608 1500 0.8500 67 59.0 8829 1000 0.8600 83 67.5
240 2608 1500 0.8500 68 59.0 8749 1000 0.8600 82 67.5
260 2605 1500 0.8500 68 59.0 8542 1500 0.8600 83 68.4
280 2573 1500 0.8500 67 59.0 8500 1000 0.8600 82 67.5
300 2601 1500 0.8500 68 59.0 8434 1000 0.8600 83 68.4
320 2607 1500 0.8500 68 59.0 8360 1500 0.8600 83 68.4
340 2605 1500 0.8500 68 59.0 8302 1000 0.8600 83 68.4
360 2575 1500 0.8500 67 59.0 8179 1000 0.8600 82 68.4
380 2608 1500 0.8500 68 59.0 8102 1000 0.8600 84 68.4
400 2584 1500 0.8500 68 59.0 8215 1500 0.8600 84 68.4
420 2575 1500 0.8500 68 59.0 8070 1000 0.8600 82 69.4
440 2574 1500 0.8500 66 59.0 8042 1500 0.8600 82 69.4
460 2608 1500 0.8500 67 59.0 7945 1500 0.8600 82 69.4
480 2581 1500 0.8500 68 59.0 8100 1000 0.8600 84 69.4
500 2583 1500 0.8500 67 59.0 8024 1000 0.8600 84 69.4
520 2609 1500 0.8500 69 59.0 7933 1000 0.8600 82 69.4
540 2602 1500 0.8500 67 60.9 7813 1000 0.8600 84 69.4
560 2606 1500 0.8500 68 59.0 7988 1500 0.8600 83 69.4
580 2606 1500 0.8500 69 60.9 7882 1000 0.8600 83 69.4
600 2704 1500 0.8500 69 60.9 7597 1500 0.8600 83 69.4
64 GB SD Card below or Go To Start
64 GB SD Card - DriveSpeed64v2g8, LanSpeed64g8, DriveSpeed264WRg8,
DriveSpeed264Rd2g8
My initial 64 bit Raspberry Pi OS was installed on a 16 GB SD card, later cloned (by Windows Win32DiskImager) to one
with 32 GB capacity. It soon became apparent that this was too small to handle extra large files on the main drive. So
I bought an 64 GB higher speed version, which, surprisingly, resized free space after booting. I then ran some tests to
see how much of this could be used.
The first exercise was to compare performance of 64 GB and 32 GB SanDisk cards, using a USB 3 card reader, via
DriveSpeed Direct I/O. The former has maximum MB/second ratings of read 160, write 60 and the latter only read at
98. For the large file tests, handling near 6 GB (3 x 2000 MB), reading speeds were similar, with the 64 GB card being
much faster on writing. Random access and small file performance were also similar.
Next, up to nearly 6 GB file space was used running LanSpeed, the same program as DriveSpeed, writing and reading
using a 1 MB data array in RAM, with caching allowed, but caching negated on handling such large files. Data from
random and small size file tests was cached and can be ignored. Output from vmstat, with 10 second sampling,
indicates that most the memory was used then released, repeating the activity for the second three files. As observed
in other tests, it seems that writing of cached data is deferred, overlapped with reading.
Finally, an example of results from separate write/read and read only benchmarks, with caching enabled, is provided
below. This just deals with large files, where up to three can be selected. In this case, one file of near 40 GB was
written. The read only test loads the data into an array in RAM, where the maximum size appears to be around 6 GB.
When dealing with smaller files, the system should be rebooted before reading, so that the data will be no longer
cached.
############################ USB 3 ############################
64 GB Total MB 59639, Free MB 48318, Used MB 11321
32 GB Total MB 29643, Free MB 19707, Used MB 9936
MBytes/Second
MB Write1 Write2 Write3 Read1 Read2 Read3
64 GB 2000 58.77 59.24 59.10 68.68 69.18 68.84
32 GB 2000 21.23 21.14 21.16 70.22 70.27 70.33
########################## Main Drive #########################
MBytes/Second
MB Write1 Write2 Write3 Read1 Read2 Read3
64 GB 4000 54.53 38.01 38.91 32.91 45.90 45.88
64 GB 8000 43.16 36.73 36.63 38.34 45.90 45.91
-----------memory---------- ---swap-- -----io----
swpd free buff cache si so bi bo
Stsrt/Write
0 6430000 1024660 317064 0 0 270 3
0 4232720 1024696 2511212 0 0 0 27790
0 3138388 1024744 3605256 0 0 0 37690
Write/Read
512 258740 427420 7089616 0 0 8336 30214
512 67632 400000 7309488 0 0 24475 14101
512 61368 340176 7376464 0 0 44800 0
Delete/Read/Write
512 56868 121324 7600856 0 0 44817 0
512 5605880 115092 2057148 0 0 18298 17233
512 4472096 115140 3191272 0 0 0 36872
Write/Read
512 267968 17524 7492716 0 0 3 33253
512 75996 17596 7684276 0 0 8107 31443
512 63056 17652 7698440 0 0 44817 0
End 512 7521128 18700 238356 0 0 37260 0
#################### Main Drive Near 40 GB ####################
Before Total MB 59639, Free MB 48324, Used MB 11315
After Total MB 59639, Free MB 8325, Used MB 51314
MBytes/Second
MB Write1 Write2 Write3 Read1 Read2 Read3
40000 36.65 45.89
Read only
6000 N/A N/A N/A 45.74
-----------memory---------- ---swap-- -----io----
swpd free buff cache si so bi bo
Example write
256 270432 33192 7473084 0 0 1 36069
Example read
256 62384 31332 7681720 0 0 44809 0
Example read only after reboot
256 272032 25052 3041320 0 0 44812 0
System Stress Tests below or Go To Start
System Stress Tests
These stress tests were run twice, once with a cooling fan in use and then with the fan disabled. The following script
file was run to open six terminals to execute my CPU MHz, Voltage and Temperature Measurement program and vmstat
system monitor, whilst running my Livermore Loops, MP Integer RAM Exerciser, BurninDrive and OpenGL benchmarks, in
stress testing mode, with nominal running time of 15 minutes.
On running these, as indicated in the environmental monitor, the system ran at much higher temperatures, with no fan
in use, but with no indication of CPU MHz throttling in the periodic instantaneous measurement samples. Vmstat
recordings were virtually the same, with and without cooling, starting with MP-IntStress64g8 grabbing near 6 GB of
RAM, with continuing CPU utilisation of around 82% (3 cores at 100%, one at 28%) and, after a short write phase, the
main drive being read at 30 MB/second.
A variation of the Livermore Loops Benchmark has options to change the running time of each of the 72 program
floating point kernels, to control running time for stress testing purposes, where results are also checked for
correctness, and log numbers assigned to enable multiple copies to be run.
######################## Script File ########################
lxterminal -e ./RPiHeatMHzVolts2 Passes 16 Seconds 60 Log 31 &
lxterminal -e ./liverloopsPi64Rg8 Seconds 12 Log 31 &
lxterminal -e ./MP-IntStress64g8 Threads 1 KB 6000000 Mins 15 Log 31 &
lxterminal -e ./burnindrive264g8 Repeats 16, Minutes 12, Log 31, Seconds 1 &
export vblank_mode=0 &
lxterminal -e ./videogl64g9 Test 6 Mins 15 Log 31 &
vmstat 60 16 > vmstat31.txt
############## With Cooling ############# ############### No Cooling ##############
================== CPU MHz CPU Voltage and Temperature Measurement =================
Secs Start at Wed Jun 10 12:56:49 2020 Secs Start at Wed Jun 10 13:19:58 2020
0 ARM MHz=1500 0.85V CPU=39'C pmic=34'C 0 ARM MHz=1500 0.85V CPU=40'C pmic=35'C
60 ARM MHz=1500 0.85V CPU=47'C pmic=39'C 60 ARM MHz=1500 0.85V CPU=58'C pmic=46'C
120 ARM MHz=1500 0.85V CPU=50'C pmic=41'C 120 ARM MHz=1500 0.85V CPU=65'C pmic=53'C
180 ARM MHz=1500 0.85V CPU=50'C pmic=42'C 180 ARM MHz=1500 0.85V CPU=68'C pmic=55'C
241 ARM MHz=1500 0.85V CPU=49'C pmic=41'C 241 ARM MHz=1500 0.85V CPU=71'C pmic=59'C
301 ARM MHz=1500 0.85V CPU=51'C pmic=42'C 301 ARM MHz=1500 0.85V CPU=74'C pmic=60'C
362 ARM MHz=1500 0.85V CPU=52'C pmic=42'C 362 ARM MHz=1500 0.85V CPU=76'C pmic=62'C
422 ARM MHz=1500 0.85V CPU=52'C pmic=42'C 422 ARM MHz=1500 0.85V CPU=76'C pmic=62'C
483 ARM MHz=1500 0.85V CPU=51'C pmic=42'C 482 ARM MHz=1500 0.85V CPU=76'C pmic=62'C
543 ARM MHz=1500 0.85V CPU=51'C pmic=41'C 543 ARM MHz=1500 0.85V CPU=77'C pmic=64'C
604 ARM MHz=1500 0.85V CPU=52'C pmic=42'C 603 ARM MHz=1500 0.85V CPU=78'C pmic=65'C
664 ARM MHz=1500 0.85V CPU=51'C pmic=42'C 664 ARM MHz=1500 0.85V CPU=81'C pmic=66'C
725 ARM MHz=1500 0.85V CPU=51'C pmic=42'C 724 ARM MHz=1500 0.85V CPU=80'C pmic=67'C
785 ARM MHz=1500 0.85V CPU=52'C pmic=42'C 785 ARM MHz=1500 0.85V CPU=81'C pmic=67'C
846 ARM MHz=1500 0.85V CPU=51'C pmic=42'C 845 ARM MHz=1500 0.85V CPU=76'C pmic=66'C
906 ARM MHz=1500 0.85V CPU=46'C pmic=42'C 905 ARM MHz=1500 0.85V CPU=73'C pmic=65'C
966 ARM MHz=1500 0.85V CPU=40'C pmic=37'C 966 ARM MHz=1500 0.85V CPU=65'C pmic=60'C
End at Wed Jun 10 13:12:56 2020 End at Wed Jun 10 13:36:04 2020
============================== vmstat 60 second samples =============================
Memory MB/sec Swap MB/sec %utilise Memory MB/sec Swap MB/sec %utilise
swpd free buf cach si so bi bo us sy id wa swpd free buf cach si so bi bo us sy id wa
0 7231 45 486 0 0 1 0 14 2 81 3 0 7231 45 486 0 0 1 0 14 2 81 3
0 1147 45 533 0 0 11 11 71 11 1 17 0 1147 45 533 0 0 11 11 71 11 1 17
0 1145 45 535 0 0 29 0 76 8 1 16 0 1145 45 535 0 0 29 0 76 8 1 16
0 1142 45 538 0 0 30 0 75 8 1 17 0 1142 45 538 0 0 30 0 75 8 1 17
0 1142 45 536 0 0 30 0 75 7 1 17 0 1142 45 536 0 0 30 0 75 7 1 17
0 1143 45 536 0 0 30 0 75 7 1 17 0 1143 45 536 0 0 30 0 75 7 1 17
0 1141 45 539 0 0 30 0 75 7 1 17 0 1141 45 539 0 0 30 0 75 7 1 17
0 1141 45 538 0 0 30 0 75 8 1 16 0 1141 45 538 0 0 30 0 75 8 1 16
0 1138 45 541 0 0 30 0 75 7 1 17 0 1138 45 541 0 0 30 0 75 7 1 17
0 1141 45 536 0 0 30 0 76 7 0 17 0 1141 45 536 0 0 30 0 76 7 0 17
0 1139 45 540 0 0 30 0 75 7 1 16 0 1139 45 540 0 0 30 0 75 7 1 16
0 1140 46 539 0 0 30 0 74 7 2 17 0 1140 46 539 0 0 30 0 74 7 2 17
0 1143 46 536 0 0 30 0 75 7 2 17 0 1143 46 536 0 0 30 0 75 7 2 17
0 1139 46 537 0 0 30 0 75 7 1 16 0 1139 46 537 0 0 30 0 75 7 1 16
0 1143 46 537 0 0 31 0 61 7 13 18 0 1143 46 537 0 0 31 0 61 7 13 18
0 1142 46 537 0 0 31 0 52 7 21 20 0 1142 46 537 0 0 31 0 52 7 21 20
======= Livermore Loops 64 Bit Reliability test 12 seconds each loop x 24 x 3 =======
Wed Jun 10 12:56:49 2020 Wed Jun 10 13:19:58 2020
Numeric results were as expected Numeric results were as expected
MFLOPS for 24 loops MFLOPS for 24 loops
2061.5 944.0 950.8 946.9 362.4 646.6 1498.8 991.4 920.0 733.5 370.3 561.1
2073.5 2695.3 1403.8 547.2 493.9 959.9 2202.2 2453.3 1991.9 711.4 473.4 676.4
206.5 362.3 794.9 634.4 721.9 1143.2 178.3 349.0 766.6 601.3 641.1 1007.9
411.8 367.7 1469.5 389.4 739.6 306.1 435.3 376.9 1530.5 365.2 801.5 309.5
Maximum Average Geomean Harmean Minimum Maximum Average Geomean Harmean Minimum
2698.1 912.3 737.2 602.3 187.7 2654.4 924.2 742.1 597.9 158.9
End of test Wed Jun 10 13:11:53 2020 End of test Wed Jun 10 13:33:21 2020
Other Stress Testing Programs used are below or Go To Start
Other Stress Testing Programs - included in above
The OpenGL Benchmark has options to select window size, default full screen, then one of the six test procedures,
running time and log number. As shown below, FPS performance was virtually the same with and without a fan being in
use.
BurnInDrive uses 64 KB block sizes, with 164 variations of data patterns, where a parameter controls file size, in this
case 16 blocks for 164 MB files. Four of these are written then read by random selection for a specified time. Finally,
blocks are read continuously for a specified number of seconds (See more information here). Again, there was no real
difference with and without cooling. Measured performance, like 33 x 4 x 164 MB in 12.32 minutes is 29.3 MB/second,
or of the same order measured by vmstat.
############## With Cooling ############# ############### No Cooling ##############
============ OpenGL Reliability Test - Display 1920x1080 for 15 minutes ===========
Wed Jun 10 12:56:49 2020 Wed Jun 10 13:19:58 2020
Test 6 Tiled Kitchen 30 seconds 21 FPS Test 6 Tiled Kitchen 30 seconds 21 FPS
Test 6 Tiled Kitchen 30 seconds 21 FPS Test 6 Tiled Kitchen 30 seconds 21 FPS
Test 6 Tiled Kitchen 30 seconds 20 FPS Test 6 Tiled Kitchen 30 seconds 21 FPS
Test 6 Tiled Kitchen 30 seconds 18 FPS Test 6 Tiled Kitchen 30 seconds 20 FPS
Test 6 Tiled Kitchen 30 seconds 19 FPS Test 6 Tiled Kitchen 30 seconds 20 FPS
Test 6 Tiled Kitchen 30 seconds 19 FPS Test 6 Tiled Kitchen 30 seconds 21 FPS
Test 6 Tiled Kitchen 30 seconds 20 FPS Test 6 Tiled Kitchen 30 seconds 20 FPS
Test 6 Tiled Kitchen 30 seconds 20 FPS Test 6 Tiled Kitchen 30 seconds 21 FPS
Test 6 Tiled Kitchen 30 seconds 20 FPS Test 6 Tiled Kitchen 30 seconds 20 FPS
Test 6 Tiled Kitchen 30 seconds 20 FPS Test 6 Tiled Kitchen 30 seconds 19 FPS
Test 6 Tiled Kitchen 30 seconds 21 FPS Test 6 Tiled Kitchen 30 seconds 21 FPS
Test 6 Tiled Kitchen 30 seconds 21 FPS Test 6 Tiled Kitchen 30 seconds 22 FPS
Test 6 Tiled Kitchen 30 seconds 21 FPS Test 6 Tiled Kitchen 30 seconds 19 FPS
Test 6 Tiled Kitchen 30 seconds 20 FPS Test 6 Tiled Kitchen 30 seconds 20 FPS
Test 6 Tiled Kitchen 30 seconds 20 FPS Test 6 Tiled Kitchen 30 seconds 22 FPS
Test 6 Tiled Kitchen 30 seconds 20 FPS Test 6 Tiled Kitchen 30 seconds 21 FPS
Test 6 Tiled Kitchen 30 seconds 21 FPS Test 6 Tiled Kitchen 30 seconds 20 FPS
Test 6 Tiled Kitchen 30 seconds 21 FPS Test 6 Tiled Kitchen 30 seconds 21 FPS
Test 6 Tiled Kitchen 30 seconds 21 FPS Test 6 Tiled Kitchen 30 seconds 21 FPS
Test 6 Tiled Kitchen 30 seconds 21 FPS Test 6 Tiled Kitchen 30 seconds 18 FPS
Test 6 Tiled Kitchen 30 seconds 21 FPS Test 6 Tiled Kitchen 30 seconds 20 FPS
Test 6 Tiled Kitchen 30 seconds 12 FPS Test 6 Tiled Kitchen 30 seconds 21 FPS
Test 6 Tiled Kitchen 30 seconds 20 FPS Test 6 Tiled Kitchen 30 seconds 21 FPS
Test 6 Tiled Kitchen 30 seconds 20 FPS Test 6 Tiled Kitchen 30 seconds 21 FPS
Test 6 Tiled Kitchen 30 seconds 21 FPS Test 6 Tiled Kitchen 30 seconds 21 FPS
Test 6 Tiled Kitchen 30 seconds 21 FPS Test 6 Tiled Kitchen 30 seconds 21 FPS
Test 6 Tiled Kitchen 30 seconds 20 FPS Test 6 Tiled Kitchen 30 seconds 21 FPS
Test 6 Tiled Kitchen 30 seconds 21 FPS Test 6 Tiled Kitchen 30 seconds 21 FPS
Test 6 Tiled Kitchen 30 seconds 21 FPS Test 6 Tiled Kitchen 30 seconds 21 FPS
Test 6 Tiled Kitchen 30 seconds 20 FPS Test 6 Tiled Kitchen 30 seconds 21 FPS
End at Wed Jun 10 13:11:51 2020 End at Wed Jun 10 13:35:01 2020
======================== burnindrive264g8 Pi 4 Main Drive =========================
Current Path: /home/pi/0test/morestress Total MB 59639 Free MB 20353, Used MB 39286
Wed Jun 10 12:56:49 2020 Wed Jun 10 13:19:58 2020
File 1 164 MB written 9.19 seconds File 1 164 MB written in 9.15 seconds
File 2 164 MB written 9.05 seconds File 2 164 MB written in 8.94 seconds
File 3 164 MB written 9.63 seconds File 3 164 MB written in 9.67 seconds
File 4 164 MB written 8.91 seconds File 4 164 MB written in 8.97 seconds
Total 36.78 seconds Total 36.74 seconds
Start Reading Wed Jun 10 12:57:26 2020 Start Reading Wed Jun 10 13:20:35 2020
Passes 1 x 4 Files x 164 MB 0.38 minutes Passes 1 x 4 Files x 164 MB 0.38 minutes
Passes 2 x 4 Files x 164 MB 0.76 minutes Passes 2 x 4 Files x 164 MB 0.75 minutes
Passes 3 x 4 Files x 164 MB 1.13 minutes Passes 3 x 4 Files x 164 MB 1.14 minutes
To
Passes 31 x 4 Files x 164 MB 11.58 minutes Passes 31 x 4 Files x 164 MB 11.56 minutes
Passes 32 x 4 Files x 164 MB 11.95 minutes Passes 32 x 4 Files x 164 MB 11.93 minutes
Passes 33 x 4 Files x 164 MB 12.32 minutes Passes 33 x 4 Files x 164 MB 12.31 minutes
Start Repeat Read Wed Jun 10 13:09:45 2020 Start Repeat Read Wed Jun 10 13:32:53 2020
Passes in 1 second for 164 blocks of 64KB Passes in 1 second for 164 blocks of 64KB
460 500 540 540 520 440 420 480 540 520 560 560 480 440 440 500 540 540 520 440
520 440 440 460 540 540 520 460 440 440 440 440 540 540 540 420 420 420 520 540
540 540 540 440 440 420 540 540 540 440 540 480 440 460 500 540 540 480 440 460
To To
580 580 580 580 580 580 580 580 580 580 580 580 580 580 580 580 580 580 580 580
580 580 580 580 580 580 580 580 580 580 580 580 580 580 580 580 580 580 580 580
580 580 580 580 580 580 580 580 580 580 580 580 580 580 580 580 580 580 580 580
83300 Passes of 64KB blocks 2.78 minutes 83900 Passes of 64KB blocks 2.78 minutes
No errors found during reading tests No errors found during reading tests
End of test Wed Jun 10 13:12:32 2020 End of test Wed Jun 10 13:35:40 2020
Power Over Ethernet below or Go To Start
Power Over Ethernet (PoE)
I recently carried out tests of Raspberry Pi 4 systems using power supplied over LAN cables. My report is at
ResearchGate in Benchmarking Raspberry Pi 4 Running From Power Over Ethernet.pdf. This covers using long, short,
thick and thin cables, measuring data transmission speeds and the ability to run using my most power consuming
benchmarks, particularly with the only wire connected to the Pi being the Ethernet cable. Screenshots of remote
control via Windows, Linux and Android are provided. PoE requires additional hardware that injects high voltage power
on to the cable and, at the other end, converts it to that normally used by the destination device. For Raspberry Pi,
there is a PoE HAT, with a fan, for this purpose, or separate fanless connectors can be obtained.
A few simple tests were run on the configuration being considered here, simply to verify that the facility was
operational. In this case, 48 metres of CAT 6 cables were used and a fanless connector (the 8 GB Pi was fitted with an
inexpensive fan). A hard disk and a USB flash drive were plugged in to USB 3 sockets, but not in use. The tests were
executed via remote control Terminals, using PuTTy on a Windows 7 based PC. After the first one, the only wire
plugged in to to the Pi was the power connecter, from the PoE converter, with communication via WiFi. Result below
were all copied from the Windows PuTTy displays.
The first tests were run using the LAN Benchmark, with only large file results shown. The Ethernet performance was at
the same 1 Gbps speeds identified earlier. WiFi was from a greater distance, apparently mainly at 2.4 GHz speeds.
The other example is from running a Floating Point Stress Test, for 10 minutes, with 8 threads running at the same
near 24 GFLOPS continuously. The vmstat report indicates 8 processes in use and 100% CPU utilisation (of 4 cores)
over the whole period. With the fan in use, temperature increases were insignificant. Core voltage did not change
between idle and full speed operation.
################ Data Transmission Speeds ################
MBytes/Second
MB Write1 Write2 Write3 Read1 Read2 Read3
Ethernet
512 80.81 81.27 83.18 112.53 111.69 112.38
1024 93.91 91.64 88.02 112.68 112.64 112.68
WiFi
50 7.28 8.55 8.15 5.51 6.10 6.37
100 5.95 7.97 7.14 6.58 5.26 6.75
############ High Power Demanding CPU Stress Test ###########
Data Ops/ Numeric
Seconds Size Threads Word MFLOPS Results Passes
9.3 1280 KB 8 32 23435 50160 19677
18.2 1280 KB 8 32 23274 50160 19677
27.0 1280 KB 8 32 23375 50160 19677
35.8 1280 KB 8 32 23374 50160 19677
44.7 1280 KB 8 32 23357 50160 19677
To
566.3 1280 KB 8 32 23396 50160 19677
575.1 1280 KB 8 32 23406 50160 19677
583.9 1280 KB 8 32 23424 50160 19677
592.7 1280 KB 8 32 23359 50160 19677
601.7 1280 KB 8 32 23145 50160 19677
############################# vmstat Activity Monitor #############################
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
r b swpd free buff cache si so bi bo in cs us sy id wa st
8 0 0 7723004 15720 148020 0 0 11 5 975 421 91 0 9 0 0
8 0 0 7722396 15780 148140 0 0 0 4 1048 428 100 0 0 0 0
8 0 0 7725468 15844 148044 0 0 0 5 1059 447 100 0 0 0 0
8 0 0 7725720 15892 148052 0 0 0 3 1052 431 100 0 0 0 0
8 0 0 7725404 15948 148072 0 0 0 3 1051 432 100 0 0 0 0
8 0 0 7725368 16004 148072 0 0 0 3 1040 413 100 0 0 0 0
8 0 0 7725984 16060 148076 0 0 0 4 1050 431 100 0 0 0 0
9 0 0 7725908 16116 148076 0 0 0 3 1040 409 100 0 0 0 0
8 0 0 7725656 16164 148084 0 0 0 3 1044 415 100 0 0 0 0
8 0 0 7725372 16220 148092 0 0 0 4 1067 437 100 0 0 0 0
##################### CPU MHz, Voltage and Temperatures ####################
Seconds
0.0 ARM MHz= 600, core volt=0.8500V, CPU temp=34.0'C, pmic temp=33.5'C
60.0 ARM MHz=1500, core volt=0.8500V, CPU temp=51.0'C, pmic temp=38.2'C
120.4 ARM MHz=1500, core volt=0.8500V, CPU temp=52.0'C, pmic temp=40.1'C
180.8 ARM MHz=1500, core volt=0.8500V, CPU temp=52.0'C, pmic temp=40.1'C
241.3 ARM MHz=1500, core volt=0.8500V, CPU temp=53.0'C, pmic temp=41.1'C
301.7 ARM MHz=1500, core volt=0.8500V, CPU temp=53.0'C, pmic temp=41.1'C
362.1 ARM MHz=1500, core volt=0.8500V, CPU temp=53.0'C, pmic temp=41.1'C
422.4 ARM MHz=1500, core volt=0.8500V, CPU temp=54.0'C, pmic temp=41.1'C
482.8 ARM MHz=1500, core volt=0.8500V, CPU temp=53.0'C, pmic temp=41.1'C
543.2 ARM MHz=1500, core volt=0.8500V, CPU temp=54.0'C, pmic temp=41.1'C
603.6 ARM MHz=1500, core volt=0.8500V, CPU temp=53.0'C, pmic temp=41.1'C
CPU Performance Throttling Effects below or Go To Start
CPU Performance Throttling Effects
Another of my reports covered Raspberry Pi 4 CPU MHz Throttling Performance Effects. This was demonstrated by
forcing the CPU clock speed to run continuously at 600 MHz, by setting the frequency scaling governor to powersave
mode.
This exercise involved using BBC iPlayer for two and a half hours, connected to a TV with a 1920 x 1080 display, using
WiFi communication and the CPU at 600 MHz. A drama programme was watched for two hours, with no apparent
buffering and, in my opinion, a perfectly good display, where the activity report was 960 x 540 size at 1700 kbps. A
second programme wildlife documentary did produce the occasional short delay, with buffering, reporting the same size
but down to 923 kbps. The tests were run without an active cooling fan.
Following are vmstat details, showing CPU utilisation of around 47%, indicating using two CPU cores at 100%, for most
of the time. Then, the environment monitor showing constant MHz and voltage, without significant rises in
temperatures.
vmstat
-----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
swpd free buff cache si so bi bo in cs us sy id wa st
Early 0 6475260 109296 736232 0 0 0 242 2795 3640 40 7 52 0 0
End 0 6467036 111324 740656 0 0 0 248 2867 3752 39 7 54 0 0
RPiHeatMHzVolts2 Program - Room at 27°C
Hot start ARM MHz= 600, core volt=0.8500V, CPU temp=69.0'C, pmic temp=62.8'C
Later ARM MHz= 600, core volt=0.8500V, CPU temp=70.0'C, pmic temp=64.6'C
Near End ARM MHz= 600, core volt=0.8500V, CPU temp=72.0'C, pmic temp=66.5'C
Go To Start
... Ubuntu Server for ARM [13] is another OS distribution available for RPI in 64-bit. Some previous benchmarking done on RPI version 4 (RPI4) [14] showed that 64-bit OS could improve the system's performance in specific areas [15][16][17]. The RPI was designed to boot and run its OS from a Secure Digital (SD) Card. ...
Article
Full-text available
In this work, we propose a 3-tiered architecture running the Linux, Apache/Nginx, MariaDB and PHP (LAMP) stack on a 64-bit Operating System (OS) and a Solid-State Disk inside a Raspberry Pi (RPI) for performance evaluation. The relative response time and Application Performance Index (Apdex) for a 32-bit OS were measured and compared against an increasing load with Moodle as the application. Our choice of Moodle as a testbed is influenced by, amongst others, the 3-tiered LAMP architecture of MoodleBox as an image for the RPI, the relatively large database underlying Moodle (with more than 200 relational tables), the convenience offered by its sample tests courses and test plans in developing performance tests. Moodle can be easily substituted from the environment to create portable LAMP-based applications. The relative response time improved by 11.9 s for a medium-sized course with 100 users with the upgrades. The Apdex showed that the CPU of the RPI was the limiting factor that prevented the web application from scaling to beyond 40 users for medium-sized courses. This work's methodology, tests, and findings are important to administrators, educators, and users in general involved in capacity planning for the use of portable applications running under the LAMP stack. A ready for use image of the portable 64-bit LAMP stack is available for download. For a video summary of this paper, please visit rpi64box.com.
... So, we can conclude that this event is not caused by OpenJ9. Furthermore, in the research by Roy Longbottom [27], similar results were also observed for SIMD operations with 64-bit data-types on a Raspberry Pi 4B and they could not determine the reason for this anomaly. Analyzing the Perf statistics for these cases also do not indicate any anomalies, although not all the hardware cache events are supported by our Raspberry Pi machine. ...
ResearchGate has not been able to resolve any references for this publication.