Content uploaded by Roy Longbottom
Author content
All content in this area was uploaded by Roy Longbottom on Sep 01, 2017
Content may be subject to copyright.
Roy Longbottom's Raspberry Pi, Pi 2 and Pi 3 Benchmarks
Contents
General Raspberry Pi Systems 64 Bit SUSE & Gentoo
Standards/Configuration Details Whetstone Benchmark Dhrystone 2 Benchmark
Linpack Benchmark Livermore Loops Benchmark Memory Speed Benchmark
Bus Speed Benchmark FFT Benchmarks NEON Benchmarks
Linpack NEON Benchmarks NEON Float & Integer Benchmark NEON MemSpeed Benchmark
Maximum 1 Core MFLOPS MultiThreading Benchmarks MP-MFLOPS
MP-Whetstone MP-Dhrystone MP-BusSpeed
MP-RandMem OpenMP-MFLOPS OpenMP-MemSpeed
NEON MP Benchmarks MP-NeonMFLOPS linpackNeonMP
Java Benchmarks Java Whetstone Benchmarks JavaDraw Benchmark
OpenGL ES Benchmark OpenGL GLUT Benchmark DriveSpeed Benchmark
LAN/WiFi Benchmark 64 Bit Drive & LAN Benchmarks Temperature & MHz Recorder
Reliability Tests 64 Bit Reliability Tests Performance Monitor
Assembly Code SUSE RPi3 64 Bit Stress Tests
Note
Considering the historical significance of some of the benchmarks and performance data, my web site was accepted for
archiving by British Library. A number of instances were archived between 2011 and 2013 (and later after selecting one
of these instances) - see . Roy Longbottom's PC Benchmark Collection Archive . This document was converted by
Winnovative Free HTML to PDF Converter to include in my ResearchGate material. A number of links are included to
various html documents that are now in the archive. Some of these will be converted into further PDF files for
ResearchGate, possibly with more recent information, Also note that internal links such as “To Start” might not work.
General
Roy Longbottom’s PC Benchmark Collection comprises numerous FREE benchmarks and reliability testing programs, for
processors, caches, memory, buses, disks, flash drives, graphics, local area networks and Internet. Original ones run
via DOS and later versions under all varieties of Windows. Most have also been converted to run under Linux on PCs
and many via Android on tablets and phones. Some of the Linux variety C/C++ source code was changed slightly to
compile for execution on the Raspberry Pi. All will be available in ResearchGate project files.
After reading that compilation time on the Raspberry Pi was painfully slow, the programs were compiled on a Linux
Ubuntu 12.04 based PC via Rasbian Toolchain using instructions downloaded from www.xappsoftware.com. This allows
programs to be compiled from a Terminal window. Using this, the C/C++ code can be firstly compiled to run on the Linux
driven PC, then transferred to the Raspberry Pi via LAN or a USB flash drive. In order to execute after transferring, a
change to Properties, Permissions is needed to make executable. One complication is that setting the path to the cross
compiler did not work as suggested by xappsoftware. Below are examples of commands used for the two executable
files - note the path for gcc:
cc whets.c cpuidc.c -lm -O3 -o whetstoneIL
~/toolchain/raspbian-toolchain-gcc-4.7.2-linux32/bin/arm-linux-gnueabihf-gcc whets.c
cpuidc.c -lm -O3 -march=armv6 -mfloat-abi=hard -mfpu=vfp -o whetstonePiA6
Command to execute - ./whetstonePiA6
The last three parameters (-march to -mfpu) made no difference to performance, but others are likely to be needed to
take advantage of later ARM floating point functions. Note, the first four benchmark programs were compiled later on
the Raspberry Pi itself. Both the above cc and gcc (with no Toolchain path) commands were used for compilation.
These and the PC based files all produced the same numeric results and mainly the same performance. Compilation time
was acceptable at between 8 and 36 seconds.
The benchmarks and source codes can be downloaded in Raspberry_Pi_Benchmarks.zip. This includes the executables
compiled, as above, to run on Intel CPUs via Linux and the versions compiled on the Raspberry Pi. To download the
benchmarks, click on the Raspberry_Pi_Benchmarks.zip link, select Save to download to Home (assume /home/pi). Open
File Manager and right click on zip file and select Extract here.
To enable execution of the programs, a security setting is required. Double click on Raspberry_Pi_Benchmarks folder to
open, right click on each executable (dhrystonePiA6, linpackPiA6, linpackPiSP, liverloopsPiA6, memspeedPiA6,
whetstonePiA6), select Properties, Permissions, tick Make the file executable. The new program titles mainly end in
PiA7.
To run, open LX Terminal, type cd Raspberry_Pi_Benchmarks to enter the directory, type ls to ensure the path is
correct and to list files, then execute for example using ./dhrystonePiA6. Information will be displayed as the
benchmarks are running and results will be saved in log files, example Dhry.txt.
To Start
Raspberry Pi System
For those who do not know, the Raspberry Pi has a 3.5 x 2.5 inch motherboard, in this case, containing a 700 MHz ARM
1176JZF v6 single core CPU and 512 MB RAM. External connectors include two full size USB sockets with others for a
full size HDMI plug, a micro USB socket for power, an RJ45 Ethernet port and a slot for an SD card, used as the main
drive.
The operating system is Raspbian, based on Linux Debian, in ths case Wheezy-Raspbian. This can be obtained pre-
loaded on an SD card or downloaded from raspberrypi.org and copied to an SD card to produce a bootable drive. I used
Image Writer for Microsoft Windows for this purpose.
In my case, booting time, from connecting power to desktop display, is 30 seconds. Using a simple command (see
below) produces a menu where CPU speed can be selected up to 1 GHz, also increasing memory bus speed.
Raspberry Pi 2 Model B has a 900 MHz quad core Broadcom BCM2836 ARM V7 CPU with 1 GB RAM and can be
overclocked to 1 GHz, using the configuration menu. L1 data cache size is 32 KB and L2 cache 512 KB, shared by all
cores. Existing benchmarks were run on the new computer along with additional programs, produced by a newer
compiler, to see if additional hardware features were used. The additional benchmarks were produced using gcc 4.8,
where a typical compile command is:
gcc whets.c cpuidc.c -lm -O3 -mcpu=cortex-a7 -mfpu=neon-vfpv4 -mfloat-abi=hard -o newA7
Raspberry Pi 3 Model B includes a quad core Broadcom BCM2837 system-on-chip running at 1200 MHz, each core
having a 32 KB L1 cache. There is a shared 512 KB L2 cache and 1 GB RAM. The CPU is an ARM Cortex-A53, capable of
64 bit working, but presently only supports 32 bit operation. Benchmark results are now included.
Performance of a Cortex-A53 based Android tablet is available, for the same benchmarks, at both 32 bit and 64 bit
working. These results are included below, to identify potential differences at 64 bits.
To Start
64 Bit OpenSUSE and Gentoo For Raspberry Pi 3
Up until late 2016, readily available operating systems have been 32 bit versions. The first reference I have seen, for a
64 bit variety, was for OpenSUSE for Raspberry Pi 3.. There are different distros available, one for SUSE Linux
Enterprise Server (SLES). A number of free OpenSUSE downloads are for both Leap 42.2 and Tumbleweed versions.
Registration is required for SLES, with free use for at least a year. All downloads are raw.xz compressed files.
Converting the xz files to successfully bootable SD cards can be difficult. I had to extract the raw files on a PC using
Linux Ubuntu and copy them to the card via Windows, using Win32 Disk Imager. I managed to produce working systems
for Leap 42.2 but not Tumbleweed.
I installed GCC-6. That produced what appeared to be good 64 bit code (from disassembly), but performance was
variable. This was due to a default “on demand” boot setting that produced variable CPU MHz. In order to understand
the implications of this, I compiled and ran some MP tests, details of which are in Appendix 1 in lieu of SUSE Rpi3 Stress
Tests.htm Some benchmarks were also compiled by gcc 4.8, using 64 bit SLES, to explore performance differences
between 32 bit and 64 bit working.
The benchmarks and source codes are being included in Rpi3-64-Bit-Benchmarks.tar.gz. The source codes include the
compile and link commands used, an example being below.
Example Compile Command
gcc-6 whets.c cpuidc.c -lm -lrt -O3 -march=armv8-a -o whetstonePi64
ARM options, such as -mcpu and other CPUs for -march, were not available
Linux Gentoo - Details of a bootable 64-bit Gentoo image for the Raspberry Pi 3 became available in February 2017.
Details and downloads are available from that Rpi3-64-Bit-Benchmarks.tar.gz file.
The bootable SD card was created as for OpenSUSE above. The OpenSUSE produced benchmarks are being run via
Gentoo and, where appropriate, results included below. This time, although “on demand” CPU MHz was used,
benchmarks consistently ran at full speed, with lower MHz only being shown when the CPU was idle.
To Start
Standards/Configuration Details
All the benchmarks are run from Terminal commands and provide continuous displays of current activity. This was
included in original versions of the benchmarks when CPUs were really slow. They all produce a summary of results in a
.txt based log file and this includes system information, where the following example is for my particular system. Note
that this includes the meaningless BogoMIPS measurement that does not change when the processor is overclocked.
Raspberry Pi 2 has additional features such as neon, vfpv3 and vfpv4.
SUSE and Gentoo for Raspberry Pi 3 - CPU architecture: 8 identifies 64 bit working.
The programs provide keyboard input at the end to include comments in the log, such as "overclocked at 1000 MHz".
The source code has expected numeric answers, selected for particular hardware. These are checked for correctness
and errors reported in the log. Running on a variation of the hardware could produce false error reports for floating
point calculations.
Also shown below is the command to select the menu with the overclocking option and commands to obtain CPU MHz
and these do not change when the CPU is overclocked.
SYSTEM INFORMATION
From File /proc/cpuinfo
Processor : ARMv6-compatible processor rev 7 (v6l)
BogoMIPS : 464.48 was #371 PREEMPT
BogoMIPS : 697.95 later #557 PREEMPT
Features : swp half thumb fastmult vfp edsp java tls
CPU implementer : 0x41
CPU architecture: 7
CPU variant : 0x0
CPU part : 0xb76
CPU revision : 7
Hardware : BCM2708
Revision : 000d
Serial : 00000000db690cb4
From File /proc/version
Linux version 3.6.11+ (dc4@dc4-arm-01) (gcc version 4.7.2 20120731 (prerelease)
(crosstool-NG linaro-1.13.1+bzr2458 - Linaro GCC 2012.08) ) #371 PREEMPT
Thu Feb 7 16:31:35 GMT 2013
####################################################
Raspberry Pi 2
processor : 0, 1, 2 and 3
model name : ARMv7 Processor rev 5 (v7l)
BogoMIPS : 38.40
Features : half thumb fastmult vfp edsp neon vfpv3 tls vfpv4 idiva
idivt vfpd32 lpae evtstrm
CPU implementer : 0x41
CPU architecture: 7
CPU variant : 0x0
CPU part : 0xc07
CPU revision : 5
Linux version 3.18.5-v7+ (dc4@dc4-XPS13-9333) (gcc version 4.8.3 20140303 (prerelease)
(crosstool-NG linaro-1.13.1+bzr2650 - Linaro GCC 2014.03) ) #225 SMP PREEMPT
Fri Jan 30 18:53:55 GMT 2015
####################################################
Raspberry Pi 3 - 32 Bit Mode
processor : 0, 1, 2 and 3
model name : ARMv7 Processor rev 4 (v7l)
BogoMIPS : 38.40
Features : half thumb fastmult vfp edsp neon vfpv3 tls vfpv4 idiva
idivt vfpd32 lpae evtstrm crc32
CPU implementer : 0x41
CPU architecture: 7
CPU variant : 0x0
CPU part : 0xd03
CPU revision : 4
Linux version 4.1.19-v7+ (dc4@dc4-XPS13-9333) (gcc version 4.9.3 (crosstool-NG
crosstool-ng-1.22.0-88-g8460611) ) #858 SMP Tue Mar 15 15:56:00 GMT 2016
Next Page Raspberry Pi 3 - 64 Bit Mode
Continued from Raspberry Pi 3 - 32 Bit Mode
####################################################
Raspberry Pi 3 - 64 Bit OpenSUSE and Gentoo
processor : 0, 1, 2 and 3
BogoMIPS : 38.40
Features : fp asimd evtstrm crc32
CPU implementer : 0x41
CPU architecture: 8
CPU variant : 0x0
CPU part : 0xd03
CPU revision : 4
OpenSUSE
Linux version 4.4.36-8-default (geeko@buildhost) (gcc version 4.8.5 (SUSE Linux))
#1 SMP Fri Dec 9 16:18:38 UTC 2016 (3ec5648)
Gentoo
Linux version 4.10.0-rc5-v8 (sakaki@chiyo) (gcc version 5.4.0 (Gentoo 5.4.0-r2
p1.2, pie-0.6.5) ) #1 SMP PREEMPT Wed Jan 25 20:13:50 GMT 2017
####################################################
Commands to obtain CPU MHz See later for more details
vcgencmd measure_clock arm
frequency(45)=700074000
cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_cur_freq
700000
With Raspbian and Gentoo, both identify full and standby clock frequencies
(RPi 3 1200 and 600 MHz), but the ARM function also provides measurements when the
clock speed is reduced due to high temperatures.
SUSE - does not support vcgencmd but following appears to identify MHz
cat /sys/devices/system/cpu/cpu*/cpufreq/cpuinfo_cur_freq
Command for overclocking selection - not RPi 3
sudo raspi-config
To Start
Whetstone Benchmark - whetstonePiA6, whetstonePiA7, whetstonePi64
The Whetstone Benchmark was the first general purpose benchmark that set industry standards of performance,
particularly for minicomputers, and introduced in 1972. The benchmark produced speed ratings in terms of Thousands of
Whetstone Instructions Per Second (KWIPS). In 1978, self timing versions (by yours truly) produced speed ratings, for
each of the eight test procedures, in MOPS (Millions of Operations Per Second) or MFLOPS (Millions of Floating Point
Operations Per Second), with an overall rating in MWIPS, mainly dependent on floating point speed.
Unlike some other floating point benchmarks, the new PiA7 compilation produces identical numeric results to those
below.
Besides the logged results, other information, shown below, is displayed on the Terminal, particularly for calibrating to
run for a total of about 10 seconds. The time for each test identifies what determines the overall MWIPS rating. It now
depends on those with mathematical functions but was N6 floating point originally.
pi@raspberrypi ~/benchmarks $ ./whetstonePiA6
##########################################
Single Precision C Whetstone Benchmark Opt 3 32 Bit, Sun May 12 11:05:53 2013
Calibrate
0.04 Seconds 1 Passes (x 100)
0.19 Seconds 5 Passes (x 100)
0.93 Seconds 25 Passes (x 100)
4.68 Seconds 125 Passes (x 100)
Use 267 passes (x 100)
Single Precision C/C++ Whetstone Benchmark
Loop content Result MFLOPS MOPS Seconds
N1 floating point -1.12475013732910156 97.811 0.053
N2 floating point -1.12274742126464844 100.800 0.360
N3 if then else 1.00000000000000000 698.625 0.040
N4 fixed point 12.00000000000000000 425.250 0.200
N5 sin,cos etc. 0.49911010265350342 5.850 3.840
N6 floating point 0.99999982118606567 85.669 1.700
N7 assignments 3.00000000000000000 498.960 0.100
N8 exp,sqrt etc. 0.75110864639282227 2.722 3.690
MWIPS 270.460 9.983
A new results file, whets.txt, will have been created in the same
directory as the .EXE files, if one did not already exist.
Type additional information to include in whets.txt - Press Enter
To Start
See Comparisons Below
Whetstone Benchmark Comparisons
Results below are for the Raspberry Pi running at 700 MHz and overclocked at 1000 MHz. For comparison purposes, also
shown are speeds obtained on various Android based ARM CPUs and Intel processors running under Linux, compiled as
above. The latter are similar to those from my earlier Linux benchmarks. Results on many more systems are in
Whetstone Benchmark Results with speeds of ancient computers in Whetstone Benchmark History and Results
Raspberry Pi 2, with default settings, is just over twice as fast as the original, on average, or 57% faster at 1000 MHz.
Performance via gcc 4.8 can be slightly slower than the earlier benchmarks. The programming code used is not really
suitable to produce performance gains through advanced instructions.
This benchmark is particularly sensitive to optimisation in compiling the COS and EXP function tests that can determine
the overall MWIPS rating. The other main influence is the third MFLOPS measurement. On all fronts, the Raspberry Pi 3
performance is around 1.33 times that of a non-overclocked Raspberry Pi 2, similar to the CPU MHz ratio.
Except for the function tests, other results from the Cortex-A53 based tablet are similar to the Raspberry Pi 3,
adjusted for CPU MHz, and that also applies to 32 bit versus 64 bit operation. Much of the similarity is due to execution
loops containing few simple instructions.
Raspberry Pi 3 SUSE and Gentoo 64 Bits - other than COS and EXP type function tests, speeds were similar to 32 bit
version and the Android 64 bit app. With SUSE on-demand CPU frequency, overall MWIPS ratings were 20% to 40%
slower. SUSE and Gentoo MWIPS ratings slightly different, again due to those volatile function test results, with others
essentially the same, as would be expected with the simple processing arrangements. As indicated for the IF test, the
compiler detected that it was not necessary to repeat the calculations, but this would make no real difference to
MWIPS.
System MHz MWIPS ------MFLOPS------- ------------MOPS---------------
1 2 3 COS EXP FIXPT IF EQUAL
Raspberry Pi 700 270.5 97.8 100.8 85.7 5.9 2.7 425.3 698.6 499.0
Raspberry Pi 1000 390.6 136.8 146.3 122.9 8.5 3.9 617.4 1014.3 804.9
RPi 2 v7-A7 900 525.0 252.0 261.3 223.0 10.2 5.1 1102.5 1358.4 882.0
RPi 2 v7-A7 1000 584.6 280.3 290.7 248.0 11.3 5.7 1314.0 1208.9 981.1
RPi 3 v8-A53 1200 724.5 331.0 347.5 298.1 12.1 8.7 1520.4 1873.4 1216.3
gcc 4.8
RPi 2 v7-A7 900 507.0 250.4 227.1 184.6 10.1 5.1 1113.7 1334.9 668.4
RPi 2 v7-A7 1000 568.4 280.4 254.4 206.7 11.3 5.7 1248.8 1497.9 749.2
RPi 3 v8-A53 1200 711.6 336.5 329.7 256.9 12.2 8.8 1498.5 1796.7 1198.7
gcc-6
64 Bit Working
OpenSuse
RPi 3 v8-A53 1200 997.2 336.6 354.1 287.8 18.4 12.3 1498.7 ###### 1197.3
Gentoo
RPi 3 v8-A53 1200 1022.9 327.6 346.3 282.1 20.3 12.6 1467.3 ###### 1166.4
Android
ARM 926EJ 800 31.2 10.2 10.2 11.4 0.6 0.3 38.8 278.4 219.4
ARM v7-A9 800 687.4 165.4 149.9 153.4 15.9 9.3 723.1 1082.1 725.3
ARM v7-A9 1300 1115.0 271.3 250.7 256.4 25.8 14.6 1190.0 1797.0 1198.7
ARM v7-A15 1700 1333.6 315.5 291.2 298.6 39.8 18.1 1394.7 2089.9 1395.5
ARM v8-A53 1300 834.7 348.9 312.7 310.9 36.7 5.4 1556.7 1867.2 570.5
64 Bit Version
ARM v8-A53 1300 1494.2 347.1 307.0 305.9 37.5 20.6 1552.2 1863.7 1239.1
Intel Atom 1666 822.3 332.4 325.7 308.6 17.2 8.1 1013.8 2368.9 1228.0
Core 2 2400 2316.1 810.0 790.4 576.2 56.8 23.8 3986.9 7532.4 2831.4
Core i7 3900 3959.0 1331.0 1330.9 938.4 96.5 42.1 6515.7 10966.7 5850.8
###### compiler optimiser produces 1 pass, this test does not affect MWIPS much
To Start
Dhrystone 2 Benchmark - dhrystonePiA6, dhrystonePiA7, dhrystonePi64
The Dhrystone "C" benchmark provides a measure of integer performance (no floating point instructions). It became the
key standard benchmark from 1984, with the growth of Unix systems. The first version was produced by Reinhold P.
Weicker in ADA and translated to "C" by Rick Richardson. Two versions are available - Dhrystone versions 1.1 and 2.1.
The second version, used here, was produced to avoid over-optimisation problems encountered with version 1, but
some is still possible. Speed was originally measured in Dhrystones per second. This was later changed to VAX MIPS by
dividing Dhrystones per second by 1757, the DEC VAX 11/780 result, the latter being regarded as the first 1 MIPS
minicomputer.
This again runs for 10 seconds after calibration. In this case, logged results are nanoseconds one Dhrystone run,
Dhrystones per Second and VAX MIPS rating plus details of detected errors or “Numeric results were correct? Below is
the execution command and details of displayed information, excluding standard system information.
pi@raspberrypi ~/benchmarks $ ./dhrystonePiA6
##########################################
Dhrystone Benchmark, Version 2.1 (Language: C or C++)
Optimisation Opt 3 32 Bit
Register option not selected
10000 runs 0.00 seconds
100000 runs 0.07 seconds
200000 runs 0.15 seconds
400000 runs 0.28 seconds
800000 runs 0.56 seconds
1600000 runs 1.13 seconds
3200000 runs 2.26 seconds
Final values (* implementation-dependent):
Int_Glob: O.K. 5 Bool_Glob: O.K. 1
Ch_1_Glob: O.K. A Ch_2_Glob: O.K. B
Arr_1_Glob[8]: O.K. 7 Arr_2_Glob8/7: O.K. 3200010
Ptr_Glob-> Ptr_Comp: * 5722488
Discr: O.K. 0 Enum_Comp: O.K. 2
Int_Comp: O.K. 17 Str_Comp: O.K. DHRYSTONE PROGRAM, SOME STRING
Next_Ptr_Glob-> Ptr_Comp: * 5722488 same as above
Discr: O.K. 0 Enum_Comp: O.K. 1
Int_Comp: O.K. 18 Str_Comp: O.K. DHRYSTONE PROGRAM, SOME STRING
Int_1_Loc: O.K. 5 Int_2_Loc: O.K. 13
Int_3_Loc: O.K. 7 Enum_Loc: O.K. 1
Str_1_Loc: O.K. DHRYSTONE PROGRAM, 1'ST STRING
Str_2_Loc: O.K. DHRYSTONE PROGRAM, 2'ND STRING
Nanoseconds one Dhrystone run: 671.88
Dhrystones per Second: 1488372
VAX MIPS rating = 847.11
Type additional information to include in Dhry.txt - Press Enter
To Start
See Comparisons Below
Description above
Dhrystone 2 Benchmark Comparisons
Below is a similar combination of results as for the Whetstone Benchmark. For results on other systems see Dhrystone
Results.htm. Unlike with Whetstones, using floating point calculations, the Raspberry Pi CPU speed is close to ARM
Cortex-A9 processors, on a per MHz basis, but executing integer functions. The Raspberry Pi 2 is faster than the first
version, performance ratios being shown below. The new gcc 4.8 compilation provides slightly higher speed ratings.
The Raspberry Pi 3 averages 45% faster than the Pi 2 on these compilations, compared with a 33% faster CPU MHz.
These results are similar to those from the Cortex-A53 based tablet at 64 bits, where optimisation may not have been
as good as possible at 32 bits.
Raspberry Pi 3 SUSE and Gentoo 64 Bits - Speeds were more than 40% faster than 32 bit system results, and up to
2.95 VAX MIPS (DMIPS) per MHz. Variations between the two 64 bit tests are quite normal. Considering the worse
Android 64 bit performance suggests that the later compiler might be responsible.
System MHz VAX MIPS
Raspberry Pi 700 847
Raspberry Pi 1000 1226
RPi 2 v7-A7 900 1538 1.82 x Rpi 700
RPi 2 v7-A7 1000 1694 1.38 x RPi 1000
RPi 3 v8-A53 1200 2201 1.43 x RPi 2 900
gcc 4.8
RPi 2 v7-A7 900 1667 1.08 x RPi 2
RPi 2 v7-A7 1000 1852 1.09 x RPi 2
RPi 3 v8-A53 1200 2469 1.48 x RPi 2 900
gcc-6
64 Bit Working
OpenSuse
RPi 3 v8-A53 1200 3536 1.43 x RPi 3 32 bits
Gentoo
RPi 3 v8-A53 1200 3475 0.98 x Suse 64 bits
Android
ARM 926EJ 800 356
ARM v7-A9 800 962
ARM v7-A9 1300 1610
ARM v7-A15 1700 3189
ARM v8-A53 1300 1423
64 Bit Version
ARM v8-A53 1300 2569
Linux using CC
Intel Atom 1666 2629
Core 2 2400 6857
Linux using older GCC
Intel Atom 1666 2055
Core 2 2400 5582
Core i7 3900 16356
To Start
Linpack Benchmark 32b - linpackPiA6, linpackPiSP, linpackPiA7, linpackPiA7SP
Linpack Benchmark 64b - linpackPi64, linpackPiSP64
See Comparisons Below
The Linpack Benchmark was produced from the "LINPACK" package of linear algebra routines. It became the primary
benchmark for scientific applications, particularly under Unix, from the mid 1980's, with a slant towards supercomputer
performance. The original double precision C version, used here, operates on 100x100 matrices. Performance is
governed by an inner loop in function daxpy() with a linked triad dy[i] = dy[i] + da * dx[i], and is measured in Millions of
Floating Point Operations Per Second (MFLOPS).
Displayed output is the same as the original version for PCs where the bloated detail was needed due to using a low
resolution timer. My variation, linpack-pc.c, for PCs was accepted by Netlib and can be downloaded from from their
library. Note that clicking on the link might produce details without appropriate line feeds. Using Windows, you might
have to download and open with WordPad or suitable language editor.
The line starting with norm resid 1.7 shows the numeric results of calculations. These can vary using different hardware
and compilers - see examples in Android and Raspberry Pi MultiThreading Versions section of My Linpack Benchmark
Results. For comparison purposes, these are set in the C source code and checked at run time, a "Numeric results were
as expected" message being logged if correct, or details provided if incorrect. Note that the compiled code could give
consistent different results on other Linux based ARM processors. The log file shows only one MFLOPS speed.
Unlike normal Intel floating point, double precision calculations are often slower than those using single precision on
ARM processors. So, besides linpackPiA6, a single precision compilation, linpackPiSP, is also provided. As for the double
precision results, these are identical to those on Android based ARM systems.
The gcc 4.8 equivalents are linpackPiA7 and linpackPiA7SP where, as shown below, these produce different numeric
answers. These are probably acceptable and due to different rounding with the assembly code used. Below is that used
for the performance dependent code.
pi@raspberrypi ~/benchmarks $ ./linpackPiA6
##########################################
Unrolled Double Precision Linpack Benchmark - Linux Version in °C/C++'
Optimisation Opt 3 32 Bit
norm resid resid machep x[0]-1 x[n-1]-1
1.7 7.41628980e-14 2.22044605e-16 -1.49880108e-14 -1.89848137e-14
Times are reported for matrices of order 100
1 pass times for array with leading dimension of 201
dgefa dgesl total Mflops unit ratio
0.00000 0.00000 0.00000 0.00 0.0000 0.0000
Calculating matgen overhead
10 times 0.01 seconds
100 times 0.15 seconds
200 times 0.28 seconds
400 times 0.58 seconds
800 times 1.13 seconds
Overhead for 1 matgen 0.00141 seconds
Calculating matgen/dgefa passes for 1 seconds
10 times 0.17 seconds
20 times 0.35 seconds
40 times 0.69 seconds
80 times 1.38 seconds
Passes used 57
Times for array with leading dimension of 201
dgefa dgesl total Mflops unit ratio
0.01578 0.00053 0.01631 42.11 0.0475 0.2912
0.01596 0.00053 0.01648 41.66 0.0480 0.2943
0.01578 0.00053 0.01631 42.11 0.0475 0.2912
0.01596 0.00053 0.01648 41.66 0.0480 0.2943
0.01578 0.00070 0.01648 41.66 0.0480 0.2943
Average 41.84
Calculating matgen2 overhead
Overhead for 1 matgen 0.00144 seconds
Times for array with leading dimension of 200
dgefa dgesl total Mflops unit ratio
0.01523 0.00053 0.01576 43.58 0.0459 0.2813
0.01540 0.00053 0.01593 43.10 0.0464 0.2845
0.01540 0.00053 0.01593 43.10 0.0464 0.2845
0.01523 0.00070 0.01593 43.10 0.0464 0.2845
0.01523 0.00070 0.01593 43.10 0.0464 0.2845
Average 43.20
Unrolled Double Precision 41.84 Mflops
See Numeric Results of Calculations Below
Raspberry Pi Results of Calculations
norm resid resid x[0]-1 x[n-1]-1
DP Pi 1.7 7.41628980e-14 -1.49880108e-14 -1.89848137e-14
DP Pi 2-3 1.9 8.46778499E-14 -1.11799459E-13 -9.60342916E-14
DP Pi 64 1.9 8.46778499e-14 -1.11799459e-13 -9.60342916e-14
SP Pi 1.6 3.80277634e-05 -1.38282776e-05 -7.51018524e-06
SP Pi NEON 2.2 5.16722466e-05 -2.38418579e-07 -5.06639481e-06
SP Pi 2-3 2.0 4.69621336E-05 -1.31130219E-05 -1.30534172E-05
SP Pi 64 2.0 4.69621336e-05 -1.31130219e-05 -1.30534172e-05
To Start
See Comparisons pn next page
Linpack Benchmark comparisons
The first Raspberry Pi results do not look too good but they would on a cost/performance basis. Also, the MFLOPS
ratings should be compared with Linpack Results on PCs and older mainframes, supercomputers, Unix boxes and
minicomputers with Netlib Linpack Results. The Linpack benchmark depends on data in L2 cache and this might lead to
variations in running time. Other versions might specify larger array sizes (like 1000 x 1000) that can depend on slower
memory.
The Raspberry Pi 2 is faster than the first version, performance ratios being shown below. In this case, the new code
from from gcc 4.8 is faster than the original, but only for the double precision benchmark, due to the more efficient
instructions shown below. The benchmark has also been compiled to use ARM NEON Single Instruction Multiple Data
(SIMD) functions (linpackPiNEONi, linpackPiNEON64), speed being included in the results table. Further details are in
a later section.
Based on MFLOPS/MHz, the Raspberry Pi 3 can be slower than the RPi 2, but is quite a bit faster on the NEON version.
The Cortex-A53 based tablet 32 bit performance is similar to the RPi 3, but 64 bit working is much faster.
Raspberry Pi 3 SUSE and Gentoo 64 Bits - The measurements for these and the Android 64 bit version, were
essentially the same. Speed improvements, over the 32 bit version, were around 1.9 times DP and 2.5 times SP. NEON
speeds were not much different, where the intrinsic functions are translated into different variations of vector
instructions.
MFLOPS GAIN
System MHz DP SP NEON SP DP SP Against
Raspberry Pi 700 42 58 N/A
Raspberry Pi 1000 68 88 N/A
RPi 2 v7-A7 900 120 156 N/A 2.86 2.69 RPi 700
RPi 2 v7-A7 1000 134 175 N/A 1.97 1.99 RPi 1000
RPi 3 v8-A53 1200 176 190 N/A 1.47 1.22 RPi 2 900
gcc 4.8
RPi 2 v7-A7 900 154 156 300 1.28 1.00 RPi 2 900
RPi 2 v7-A7 1000 169 176 334 1.26 1.01 RPi 2 1000
RPi 3 v8-A53 1200 180 194 486 1.17 1.24 RPi 2 900
gcc-6
64 Bit Working
OpenSuse
RPi 3 v8-A53 1200 348 494 530 1.93 2.55 RPI 3 32b
Gentoo
RPi 3 v8-A53 1200 343 482 521 1.90 2.48 RPI 3 32b
Android
ARM 926EJ 800 6 10 N/A
ARM v7-A9 800 101 129 256
ARM v7-A9 1300 151 201 377
ARM v7-A15 1700 459 803 1335
gcc 4.8
ARM v7-A9 1300 159 200
ARM v7-A15 1700 795 977
ARM v8-A53 1300 178 187 423
64 Bit Version
ARM v8-A53 1300 348 493 521
Linux using CC
Intel Atom 1666 211
Core 2 2400 1631
Linux using older GCC
Intel Atom 1666 196
Core 2 2400 1288
Core i7 3900 2534
To Start
Livermore Loops Benchmark - liverloopsPiA6, liverloopsPiA7, liverloopsPi64
See Comparisons Below
This original main benchmark for supercomputers was first introduced in 1970, initially comprising 14 kernels of numerical
application, written in Fortran. This was increased to 24 kernels in the 1980s. Performance measurements are in terms
of Millions of Floating Point Operations Per Second or MFLOPS. The kernels are executed three times with different
double precision data array sizes. Following are overall MFLOPS results for various systems, geometric mean being
the official average performance. [Reference - F.H. McMahon, The Livermore Fortran Kernels: A Computer Test Of
The Numerical Performance Range, Lawrence Livermore National Laboratory, Livermore, California, UCRL-53745,
December 1986]
---------------- MFLOPS ---------------
CPU MHz Maximum Average Geomean Harmean Minimum Measured in
CDC 6600 10 1.1 0.5 0.5 0.4 0.2 1970 *
CDC 7600 36.4 7.3 4.2 3.9 2.5 1.4 1974 *
Cray 1A 80 83.5 25.8 14.4 7.9 2.7 1980 *
Cray 1S 80 82.1 22.2 11.9 6.5 1.0 1985
CDC Cyber 205 50 146.9 36.4 14.6 5.0 0.6 1982 *
Cray 2 244 146.4 36.7 14.2 5.8 1.7 1985
Cray XMP1 105 187.8 61.3 31.5 15.6 3.6 1986
* Fewer than 24 Kernels
Below is the run command, then displayed calibration phase, final results and details for the 24 loops using the largest
data sizes. Calibration arranges for each loop to run for around one second. The Checksums OK column is an indication
of accuracy, compared with a specification and probably based on results from CDC 6600 and 7600. These
hardware/compiler dependent numeric answers are checked as in the Linpack benchmark. Results included in the log file
are Minimum, Maximum, Averages and 24 weighted average MFLOPS speeds.
As with the Linpack benchmark, liverloopsPiA7, the gcc 4.8 compilation, produced different numeric answers to the
earlier version, this time for 22 out of the 24 kernels. All were only slightly different and are shown below, for part 3 of
3. The benchmark produced a run time error from the initial gcc 4.8 compilation. This was due to the way in which
shared array space is allocated and was also apparent with earlier Android compilations. So, the same code changes
were made and the revised source code is included in Raspberry_Pi_Benchmarks.zip.
pi@raspberrypi ~/benchmarks $ ./liverloopsPiA6
##########################################
L.L.N.L. °C' KERNELS: MFLOPS P.C. VERSION 4.0 Optimisation Opt 3 32 Bit
Calculating outer loop overhead
1000 times 0.00 seconds
10000 times 0.00 seconds
100000 times 0.00 seconds
1000000 times 0.06 seconds
2000000 times 0.11 seconds
4000000 times 0.23 seconds
Overhead for each loop 5.7500e-08 seconds
Calibrating part 3 of 3
Loop count 32 0.00 seconds
Loop count 128 0.01 seconds
Loop count 512 0.04 seconds
Loops 200 x 8 x Passes
Kernel Floating Pt ops
No Passes E No Total Secs. MFLOPS Span Checksums OK
------------ -- ------------- ----- ------- ---- ---------------------- --
1 28 x 11 5 6.652800e+07 0.97 68.29 27 3.855104502494961e+01 16
2 46 x 18 4 5.829120e+07 0.93 62.65 15 3.953296986903059e+01 16
3 37 x 36 2 1.150848e+08 0.85 135.70 27 2.699309089320672e-01 16
4 38 x 36 2 6.566400e+07 0.88 75.04 27 5.999250595473891e-01 16
5 40 x 12 2 3.993600e+07 1.08 36.99 27 3.182615248447483e+00 16
6 21 x 34 2 5.483520e+07 1.26 43.52 8 1.120309393467088e+00 15
7 20 x 14 16 1.505280e+08 1.03 146.64 21 2.845720217644024e+01 16
8 9 x 10 36 1.347840e+08 1.08 124.52 14 2.960543667875005e+03 15
9 26 x 11 17 1.166880e+08 1.27 92.17 15 2.623968460874250e+03 16
10 25 x 10 9 5.400000e+07 1.16 46.59 15 1.651291227698265e+03 16
11 46 x 18 1 3.444480e+07 1.10 31.30 27 6.551161335845770e+02 16
12 48 x 14 1 2.795520e+07 1.13 24.66 26 1.943435981130448e-06 16
13 31 x 9 7 2.499840e+07 1.19 21.07 8 3.847124199949431e+10 15
14 8 x 11 11 4.181760e+07 1.08 38.63 27 2.923540598672009e+06 15
15 1 x 17 33 6.283200e+07 0.98 64.21 15 1.108997288134785e+03 16
16 14 x 34 10 8.377600e+07 1.41 59.41 15 5.152160000000000e+05 16
17 26 x 17 9 9.547200e+07 1.13 84.27 15 2.947368618589361e+01 16
18 2 x 11 44 1.006720e+08 1.16 86.92 14 9.700646212337041e+02 16
19 28 x 23 6 9.273600e+07 1.30 71.56 15 1.268230698051003e+01 15
20 7 x 9 26 6.814080e+07 1.19 57.04 26 5.987713249475302e+02 16
21 1 x 2 2 8.000000e+07 1.51 52.99 20 5.009945671204667e+07 16
22 8 x 8 17 2.611200e+07 1.16 22.42 15 6.109968728263972e+00 16
23 7 x 11 11 8.808800e+07 0.98 89.56 14 4.850340602749970e+02 16
24 23 x 35 1 3.348800e+07 1.17 28.56 27 1.300000000000000e+01 16
End of results, plus different numeric results next
Livermore Loops results continued
Maximum Rate 146.64
Average Rate 65.20
Geometric Mean 56.66
Harmonic Mean 48.85
Minimum Rate 21.07
Do Span 19
Overall
Part 1 weight 1
Part 2 weight 2
Part 3 weight 1
Maximum Rate 148.29
Average Rate 64.41
Geometric Mean 54.74
Harmonic Mean 46.40
Minimum Rate 16.62
Do Span 167
gcc 4.8 and gcc-6 Different Numeric Results
Later at 64 bits - Checks for was results
1 was 3.855104502494985e+01 expected 3.855104502494961e+01
2 was 3.953296986903406e+01 expected 3.953296986903059e+01
3 was 2.699309089321338e-01 expected 2.699309089320672e-01
4 was 5.999250595474085e-01 expected 5.999250595473891e-01
5 was 3.182615248448323e+00 expected 3.182615248447483e+00
6 was 1.120309393467610e+00 expected 1.120309393467088e+00
7 was 2.845720217644064e+01 expected 2.845720217644024e+01
8 was 2.960543667877653e+03 expected 2.960543667875005e+03
9 was 2.623968460874436e+03 expected 2.623968460874250e+03
10 was 1.651291227698388e+03 expected 1.651291227698265e+03
11 was 6.551161335846584e+02 expected 6.551161335845770e+02
12 was 1.943435982643127e-06 expected 1.943435981130448e-06
13 was 3.847124173932926e+10 expected 3.847124199949431e+10
14 was 2.923540598700724e+06 expected 2.923540598672009e+06
15 was 1.108997288135077e+03 expected 1.108997288134785e+03
17 was 2.947368618590736e+01 expected 2.947368618589361e+01
18 was 9.700646212341634e+02 expected 9.700646212337041e+02
19 was 1.268230698051755e+01 expected 1.268230698051003e+01
20 was 5.987713249471707e+02 expected 5.987713249475302e+02
21 was 5.009945671206671e+07 expected 5.009945671204667e+07
22 was 6.109968728264851e+00 expected 6.109968728263972e+00
23 was 4.850340602751729e+02 expected 4.850340602749970e+02
To Start
See Comparisons Below
Livermore Loops Benchmark Comparisons
For Cray 1 comparison purposes, it is more appropriate to use Cray 1S results, as these are from running all 24 kernels.
Geometric mean for this system is 11.9 MFLOPS. In 1978, the Cray 1 supercomputer cost $7 Million, weighed 10,500
pounds and had a 115 kilowatt power supply. It was, by far, the fastest computer in the world. The Raspberry Pi costs
around $70 (CPU board, case, power supply, SD card), weighs a few ounces, uses a 5 watt power supply and is more
than 4.5 times faster than the Cray 1.
Average performance gains of the Raspberry Pi 2 are not as high as those for the Linpack benchmark, but the best test
loop, at 900 MHz, is 4.25 times faster than the original Pi at 700 MHz. Highest average of 138 MFLOPS is 11.6 times
faster than a Cray 1.
The Raspberry Pi 3 average speed shown is 46% faster than RPi 2, compared with 33% faster MHz, also a little faster
than the Android tablet with the Cortex-A53. Then the latter’s 64 bit compilation indicates an average improvement of
46%, with a wide variation in MFLOPS from individual tests.
Raspberry Pi 3 SUSE and Gentoo 64 Bits - In this case, the SUSE based test results all appeared to be sightly faster
than those for Gentoo, but the range was between 92% and 99%. Average Android 64 bit speed was 10% slower but
some results were faster, probably due to different compiler, in handling the relatively large code. Compared to 32 bit
speeds, the 64 bit scores were between 1.02 and 2.88 times faster. The official geometric mean rating was 1.34 times
faster. On the same basis, the RPi3 can be rated as the equivalent of 21 times the Cray 1 supercomputer.
See also Livermore Loops Results on PCs.
Compare
System MHz Maximum Average Geomean Harmean Minimum Geomean Against
Raspberry Pi # 700 148.3 64.4 54.7 46.4 16.6
Raspberry Pi # 1000 216.8 94.8 80.8 68.7 29.3
RPi 2 v7-A7 900 248.0 126.1 114.9 103.9 41.5 2.10 RPi 700
RPi 2 v7-A7 1000 273.5 139.7 127.3 115.2 46.5 1.58 RPi 1000
gcc 4.8
RPi 2 v7-A7 * 900 223.8 136.9 125.6 113.0 42.3 1.09 RPi 2# 900
RPi 2 v7-A7 1000 244.9 150.7 138.2 124.4 46.7 1.09 RPi 2# 1000
RPi 3 v8-A53 ^ 1200 435.5 206.9 183.6 159.8 55.6 1.46 RPi 2* 900
gcc 4.8
RPi 3 v8-A53 > 1200 398.4 210.6 185.9 160.2 56.5 1.01 RPi 3^ 1200
gcc-6 64 Bit Working
OpenSuse gcc-6
RPi 3 v8-A53 > 1200 649.0 278.8 249.4 221.6 95.0 1.34 RPi 3 32 bit
Gentoo
RPi 3 v8-A53 > 1200 627.3 275.7 246.8 219.2 90.6 1.34 RPi 3 32 bit
Android
ARM 926EJ 800 9.9 5.6 5.4 5.2 2.4
ARM v7-A9 800 253.2 129.3 115.3 101.6 46.7
ARM v7-A9 1200 391.9 202.1 181.3 160.9 68.1
ARM v7-A15 1700 1252.8 476.0 375.8 288.8 90.8
ARM v8-A53 $ 1300 393.4 188.3 158.3 124.6 27.1 0.85 RPi 3> 1200
64 Bit Version
ARM v8-A53 1300 772.2 265.9 232.5 206.3 97.8 1.47 RPi 3$ 1300
Atom Z3745 1866 1031.2 480.0 429.8 378.6 154.7
Linux using CC
Intel Atom 1666 480.3 217.6 189.9 162.2 59.7
Core 2 2400 2264.7 1039.3 822.9 606.4 161.6
Linux using older GCC
Intel Atom 1666 465.2 212.2 185.1 157.4 49.7
Core 2 2400 2384.9 1038.1 805.8 582.1 161.0
Core i7 4820K 3900 5551.3 2196.8 1712.4 1286.6 415.3
MFLOPS for 24 loops below
MFLOPS for 24 loops
Raspberry Pi 700 MHz
66.1 79.8 132.8 141.1 23.8 29.3 110.8 129.7 90.2 38.7 32.0 25.2
22.1 16.6 61.0 58.6 81.5 59.8 73.5 42.2 29.9 22.5 66.4 29.5
Raspberry Pi 1000 MHz
97.0 116.2 197.2 206.0 37.4 47.2 169.0 185.6 132.6 57.4 46.2 35.9
32.7 32.0 89.7 85.6 118.4 88.8 107.1 75.6 47.6 32.4 106.0 42.6
Raspberry Pi 2 900 MHz
114.1 129.1 221.7 218.0 84.7 96.8 196.3 248.0 155.2 137.4 74.2 63.6
62.4 70.6 125.6 125.1 196.3 153.3 132.6 115.2 78.4 41.6 166.5 89.0
Raspberry Pi 2 1000 MHz
126.7 143.7 246.8 242.7 94.0 108.2 218.5 273.4 172.7 135.8 82.6 70.8
69.0 78.3 140.1 139.3 218.5 170.7 147.7 128.6 80.0 46.7 184.5 99.1
Raspberry Pi 2 900 MHz gcc 4.8
132.0 163.4 223.8 220.6 85.4 126.3 217.5 212.5 189.9 123.4 99.3 56.0
67.9 83.9 125.0 133.2 202.0 180.8 160.3 125.1 86.3 42.5 185.5 127.5
Raspberry Pi 2 1000 MHz gcc 4.8
139.0 166.2 244.9 243.7 88.1 140.1 232.0 234.5 210.7 136.1 109.1 61.6
74.8 92.8 137.9 147.0 223.1 199.2 177.0 133.8 95.2 47.0 204.6 140.9
Raspberry Pi 3 1200 MHz
191.8 242.9 295.6 292.1 139.6 165.7 362.0 435.4 282.7 162.4 108.1 85.0
82.1 107.0 223.8 208.4 358.6 277.4 208.8 201.9 113.9 55.6 305.2 148.6
Raspberry Pi 3 1200 MHz gcc 4.8
192.9 228.0 398.4 337.4 124.6 167.5 359.7 384.3 347.7 171.6 132.5 74.7
83.9 109.1 225.4 221.2 307.9 288.6 202.2 211.9 114.7 56.9 300.2 170.1
Raspberry Pi 3 1200 MHz gcc-6
OpenSuse 64 Bit Working
468.5 260.9 474.4 463.7 196.7 179.6 649.0 399.9 426.1 223.5 148.9 215.3
109.0 140.8 256.3 226.2 386.4 454.5 291.7 246.1 273.6 99.5 316.7 183.3
Gentoo 64 Bit Working
462.9 256.1 465.6 454.5 193.1 178.4 627.3 366.3 417.9 215.1 146.2 211.3
107.1 136.6 251.0 222.3 379.7 446.9 286.5 240.9 253.1 91.6 314.5 180.1
Android
ARM 926EJ 800 MHz
5.6 6.4 6.2 6.1 4.6 4.9 5.9 6.1 6.0 9.0 5.8 3.9
4.0 3.6 3.8 5.6 7.6 4.5 5.7 4.3 5.2 2.5 5.7 7.4
ARM v7-A9 800 MHz
172.6 127.5 253.2 248.6 71.6 141.2 197.6 190.4 202.3 109.2 55.2 51.2
54.1 51.5 100.0 144.1 192.1 139.4 130.1 105.4 111.2 63.1 136.3 56.8
ARM v7-A9 1200 MHz
241.7 233.4 383.5 388.7 98.4 147.1 293.1 258.5 314.6 181.1 99.1 95.3
80.6 68.1 171.6 226.9 346.2 176.9 202.6 184.9 119.5 102.1 200.9 88.5
ARM v8-A53 1300 MHz
163.4 243.4 272.1 270.3 109.5 111.2 282.2 389.0 360.6 219.6 124.0 61.8
67.6 87.4 27.3 224.2 340.1 241.9 168.5 198.8 120.2 120.6 277.7 79.1
ARM v8-A53 1300 MHz 64 Bit
451.4 191.4 243.2 272.4 144.9 144.5 749.4 411.1 453.6 261.1 138.0 206.1
122.5 130.1 215.0 249.8 411.6 395.4 241.7 248.1 152.8 118.7 317.2 103.7
Linux using CC
Intel Atom 1666 MHz
308 297 480 468 206 175 312 308 406 125 169 140
64 101 122 216 236 195 220 134 188 61 304 94
Core 2 2400 MHz
1952 1302 1583 1527 341 1186 2184 2263 2155 1184 800 795
162 396 371 874 1341 1029 509 384 1597 174 1190 558
Linux using older GCC
Intel Atom 1666 MHz
260 250 336 374 167 178 312 306 406 128 168 105
64 99 121 212 228 194 224 134 197 56 304 99
Core 2 2400 MHz
1953 1223 1584 1534 343 1238 2192 2385 2147 1187 795 479
161 396 276 956 1368 959 509 385 1385 165 1182 560
To Start
Memory Speed Benchmark - memspeedPiA6, memspeedPiA7, memSpdPi64
See Results and Comparisons Below
MemSpeed benchmark measures data reading speeds in MegaBytes per second carrying out calculations on arrays of
cache and RAM data, normally sized 2 x 4 KB to 2 x 4 MB. Calculations are as shown in the results headings. For the
first two double precision tests, speed in Million Floating Point Operations Per Second (MFLOPS) can be calculated by
dividing MB/second by 8 and 16. For single precision divide by 4 and 8. A disassembly showed that Millions of
[Assembler] Instructions Per Second (MIPS), for the first two integer tests, can be calculated by multiplying MB/second
by 0.78 and 0.59. For the three copy tests, MIPS are MB/second times 0.344 for double precision and 0.688 for the
other two. These calculations are shown below. Note that the changes in speeds, as data size increases, indicates the
size of caches. As different instructions counts are produced with later NEON compilations, MOPS are shown for the
first integer test.
The two executables are for Raspberry Pi and memspeedIL for Intel/Linux. Particularly for the latter, the default
maximum of 8 MB might be too small to demonstrate RAM speed. For either, a run time parameter is provided to use
more memory. These are for up to 128, 256, 512 or 1024 - examples memspeedPiA6 MB 256 and memspeedIL MB 1024.
Raspberry Pi CPU 700 MHz, Core 400 MHz, SDRAM 400 MHz
Memory Reading Speed Test 32 Bit Version 4 by Roy Longbottom
Start of test Mon May 20 10:25:17 2013
Memory x[m]=x[m]+s*y[m] Int+ x[m]=x[m]+y[m] x[m]=y[m]
KBytes Dble Sngl Int32 Dble Sngl Int32 Dble Sngl Int32
Used MB/S MB/S MB/S MB/S MB/S MB/S MB/S MB/S MB/S
8 538 640 930 602 731 1094 1230 465 465 L1
16 568 602 787 602 731 1023 1000 426 507
32 292 256 310 276 262 330 1066 426 547 L2
64 276 238 276 262 238 292 341 269 284
128 189 170 193 182 170 200 222 196 204
256 140 129 142 136 129 144 138 119 124 RAM
512 138 127 138 134 127 144 131 111 119
1024 136 127 138 134 127 144 124 111 119
2048 136 127 138 132 128 144 128 111 121
4096 136 128 138 134 126 144 128 111 119
8192 138 127 138 136 127 144 126 111 119
End of test Mon May 20 10:26:06 2013
Max MFLOPS 71 160 38 91
Max MIPS 725 645 423 320 320
Max MOPS 233
To Start
Results and comparisons below
Memory Speed Comparison
The first results below are for the Raspberry Pi at the maximum overclocked settings. The overheads on repetitively
running the tests cause variations in speeds of the lower data sizes but average overclocked speed gain, using L1
cache, is 1.41 times, compared with 1.43 times CPU MHz. Average RAM speed gains are 1.53 times, similar to
expectations. A surprise is for L2 cache based data, where the average gain is 1.72 times and some speeds appear to
be faster than using L1 cache.
Comparing 900 MHz Raspberry Pi 2 results, from gcc 4.8 (PiA7), with the original system, at 700 MHz, indicates average
performance gains of 3.3, 5.3 and 3.8 times for L1 cache, L2 cache and RAM based data, increased from the old PiA6
version at 2.4, 4.5 and 3.5 times. The first calculations are the same as those that determine Linpack benchmark
speeds, in this case gcc 4.8 single precision speeds are again slower than using the original benchmarks (Pia7 vfma.f32
instructions and Pia6 fmacs). The PiA7 integer calculations provide the highest performance gains, from cached data,
the test loop containing 2 vector loads to quad word registers (vld1.32), 2 vector adds (vadd.i32) and one vector
store ( vst1.32), compared with 8 loads, 8 adds and 4 stores in PiA6.
Results for a version compiled to use NEON instructions, providing some of the fastest speeds, are included below. For
more details see MemSpeed NEON. Later results are for the same code compiled for Android devices, less the copy
tests, where the later ARM systems are considerably faster. In this case, The Pi performs relatively well on single
precision floating point. For other results see Android Benchmarks.htm.
The other results are using the Intel/Linux version, where speeds are generally much faster. An exception is L1 cache
speed using single precision floating point, where the Pi is faster than the Atom on a MFLOPS/MHz basis. For older PC
speeds that are slower than the Raspberry Pi see MemSpd2k results.htm.
Compared to default speed Raspberry Pi 2 results, RPi 3 L1 cache performance is not much faster than the 1.33 times
clock MHz ratio, but L2 results are more than twice as fast, where RPi 3 L2 and L1 cache speeds are similar. Average
RPi 3 RAM MB/second measurements indicate an average improvement of 2.5 times, where memory clock speed is
double. The Cortex-A53 based Android tablet 64 bit performance is generally much faster than from the 32 bit
compilation, but the 32 bit compiler is not as effective as that used for the Raspberry Pi. Best 64 bit gains are when
using 64 bit double precision numbers, where cache based speed can be twice that from the RPi 3.
Raspberry Pi 3 SUSE and Gentoo 64 Bits - Results below include benchmarks compiled with gcc 4.8 and gcc 6, run
via SUSE and the latter using Gentoo. These are followed by comparisons of L1 cache, L2 cache and RAM speeds. The
first is for SUSE/Gentoo that are essentially the same. Next comparisons are for gcc 6/gcc 4.8 then gcc 6/32 bit A7.
The former indicates gains on DP calculations using caches, and the latter on all L1 cache speeds and L2 cache
speeds, other than for some integer tests, Both are more efficient running the last data copying procedures. For the
first DP tests, gcc 4.8 and gcc 6 both use 64 bit fused multiply and add vector instructions, gcc 4.8 being slower due
to using additional and different load instructions - see Assembly Code. Compile options to use NEON instructions for
MemSpeed NEON are not available at 64 bit working.
Raspberry Pi CPU 1000 MHz, Core 500 MHz, SDRAM 600 MHz, 6 volts
Memory x[m]=x[m]+s*y[m] Int+ x[m]=x[m]+y[m] x[m]=y[m]
KBytes Dble Sngl Int32 Dble Sngl Int32 Dble Sngl Int32
Used MB/S MB/S MB/S MB/S MB/S MB/S MB/S MB/S MB/S
8 602 640 1185 930 1163 1662 1422 511 761 L1
16 787 930 1292 853 1023 1523 1777 537 761
32 487 426 487 465 426 568 1939 820 1142 L2
64 465 393 465 426 393 511 592 457 508
128 330 310 341 320 301 365 341 301 341
256 208 200 213 204 200 217 196 170 189 RAM
512 204 200 213 200 200 213 196 176 182
1024 213 200 208 200 200 217 196 170 182
2048 204 196 213 204 200 217 196 170 182
4096 204 200 213 200 200 217 196 170 182
8192 204 200 213 200 200 218 204 169 182
Max MFLOPS 98 232 58 145
Max MIPS 1007 980 667 563 785
Max MOPS 323
############################## RPi 2 ##################################
Raspberry Pi 2 CPU 900 MHz, Core 250 MHz, SDRAM 450 MHz
Memory x[m]=x[m]+s*y[m] Int+ x[m]=x[m]+y[m] x[m]=y[m]
KBytes Dble Sngl Int32 Dble Sngl Int32 Dble Sngl Int32
Used MB/S MB/S MB/S MB/S MB/S MB/S MB/S MB/S MB/S
PiA6
8 731 1280 1142 2133 1454 1422 2666 1523 1641 L1
16 1066 1333 1292 1969 1406 1523 2666 1523 1641
32 1023 1293 1094 1828 1333 1406 2051 1428 1428
64 930 1016 1067 1662 1185 1230 1230 1333 1333 L2
128 853 1016 1023 1524 1186 1186 1163 1454 1333
256 853 1068 930 1423 1186 1186 1143 1455 1455
512 602 853 787 1168 853 930 1144 1027 1066
1024 365 512 393 465 538 426 984 511 465 RAM
2048 310 445 310 353 465 330 853 496 496
4096 301 445 301 341 445 330 834 546 511
8192 307 446 317 351 446 338 945 580 580
Max MFLOPS 133 333
Max MOPS 323
Continued Below or Go To Start
############################## RPi 2 ##################################
Raspberry Pi 2 CPU 900 MHz, Core 250 MHz, SDRAM 450 MHz
Memory x[m]=x[m]+s*y[m] Int+ x[m]=x[m]+y[m] x[m]=y[m]
KBytes Dble Sngl Int32 Dble Sngl Int32 Dble Sngl Int32
Used MB/S MB/S MB/S MB/S MB/S MB/S MB/S MB/S MB/S
PiA7
8 929 832 2047 2044 1366 2862 2035 2690 2845
16 1398 1197 2050 2049 1368 2868 2044 2861 2861
32 1264 1094 1768 1773 1227 2272 1700 2159 2160
64 1195 1042 1634 1635 1161 1997 1450 1479 1488
128 1133 991 1512 1526 1095 1792 1154 1121 1124
256 961 981 1500 1506 1089 1787 1132 1078 1064
512 629 669 895 878 717 979 1146 786 788
1024 400 396 470 458 413 496 943 642 644
2048 326 313 357 354 328 374 958 678 678
4096 322 311 354 351 326 372 954 721 718
8192 325 311 355 353 327 372 952 732 733
Max MFLOPS 175 299
Max MOPS 512
########################### RPi 2 OC ##################################
Raspberry Pi 2 CPU 1000 MHz, Core 500 MHz, SDRAM 500 MHz, over_voltage=2
Memory x[m]=x[m]+s*y[m] Int+ x[m]=x[m]+y[m] x[m]=y[m]
KBytes Dble Sngl Int32 Dble Sngl Int32 Dble Sngl Int32
Used MB/S MB/S MB/S MB/S MB/S MB/S MB/S MB/S MB/S
PiA6
8 682 853 1306 2327 1523 1777 2909 1777 1777
16 1185 1406 1306 2327 1523 1777 2666 1777 1777
32 1185 1333 1333 1969 1523 1523 2279 1641 1641
64 1023 1293 1094 1778 1333 1306 1882 1599 1428
128 1023 1186 1094 1641 1230 1333 1641 1524 1539
256 930 1142 1016 1642 1333 1333 1778 1429 1524
512 682 930 787 1094 930 930 1642 1068 984
1024 465 602 487 568 639 538 1168 618 618
2048 379 538 409 465 538 409 914 597 597
4096 379 538 379 445 538 409 904 658 682
8192 378 546 393 446 546 427 819 750 760
Max MFLOPS 148 351
Max MOPS 333
PiA7
8 918 928 2261 2258 1509 3162 2248 3142 3143
16 1547 1322 2265 2264 1511 3168 2258 3160 3160
32 1536 1314 2251 2245 1501 3146 2247 3141 3130
64 1296 1135 1773 1776 1263 2134 1795 1789 1797
128 1226 1098 1679 1676 1213 1996 1822 1483 1486
256 1013 985 1442 1446 1083 1672 1549 1311 1304
512 568 553 694 682 579 742 1371 989 993
1024 473 465 550 548 485 591 1279 913 916
2048 413 400 459 456 415 484 943 688 688
4096 410 398 455 446 411 480 871 620 620
8192 411 399 457 454 412 482 847 601 600
Max MFLOPS 193 330 142 189
Max MOPS 566
########################### RPi 2 NEON #################################
Raspberry Pi 2 CPU 900 MHz, Core 250 MHz, SDRAM 450 MHz
Memory x[m]=x[m]+s*y[m] Int+ x[m]=x[m]+y[m] x[m]=y[m]
KBytes Dble Sngl Int32 Dble Sngl Int32 Dble Sngl Int32
Used MB/S MB/S MB/S MB/S MB/S MB/S MB/S MB/S MB/S
8 918 1778 2031 2029 2369 2838 2020 2825 2823
16 1388 1781 2034 2034 2374 2847 2029 2840 2828
32 1380 1768 2021 2020 2357 2811 2024 2832 2831
64 1169 1435 1595 1597 1785 1924 1573 1392 1391
128 1124 1366 1509 1513 1688 1794 1608 990 986
256 875 1163 1270 1269 1391 1460 1163 892 900
512 675 886 953 941 1022 1074 1081 776 785
1024 363 401 409 399 419 428 904 596 596
2048 318 338 341 343 355 362 751 539 541
4096 316 333 339 339 351 359 720 501 503
8192 317 334 340 340 352 361 709 483 484
Max MFLOPS 174 445 127 297
Max MOPS 509
Continued Below or Go To Start
######################## RPi 2 NEON OC ################################
Raspberry Pi 2 CPU 1000 MHz, Core 500 MHz, SDRAM 500 MHz, over_voltage=2
Memory Reading Speed Test NEON 32 Bit Version 1 by Roy Longbottom
Memory x[m]=x[m]+s*y[m] Int+ x[m]=x[m]+y[m] x[m]=y[m]
KBytes Dble Sngl Int32 Dble Sngl Int32 Dble Sngl Int32
Used MB/S MB/S MB/S MB/S MB/S MB/S MB/S MB/S MB/S
NEON
8 1542 1963 2257 2253 2633 3143 1672 2143 3078
16 1542 1978 2248 2258 2638 3163 2247 3111 3116
32 1402 1744 1961 1965 2221 2481 1958 2532 2534
64 1303 1596 1770 1778 1988 2146 1700 1756 1756
128 1242 1508 1665 1667 1862 1977 1599 1458 1467
256 976 1276 1376 1395 1532 1483 1610 1313 1315
512 756 966 1031 1020 1111 1156 1643 1099 1107
1024 476 544 569 554 584 606 1376 953 956
2048 401 432 447 444 458 471 1268 968 967
4096 401 429 443 436 455 466 1239 1043 1039
8192 404 434 448 446 460 472 1001 777 779
Max MFLOPS 193 493 141 330
Max MOPS 562
############################## RPi 3 ##################################
Raspberry Pi 3 CPU 1200 MHz, SDRAM 900 MHz
Memory x[m]=x[m]+s*y[m] Int+ x[m]=x[m]+y[m] x[m]=y[m]
KBytes Dble Sngl Int32 Dble Sngl Int32 Dble Sngl Int32
Used MB/S MB/S MB/S MB/S MB/S MB/S MB/S MB/S MB/S
PiA6 8 1523 1777 1828 2461 1969 2327 3657 2285 2461 L1
16 1662 1777 1828 2285 2133 2327 3846 2285 2461
32 1662 1777 1939 2461 1969 2327 3657 2461 2381
64 1524 1641 1778 2133 1969 1969 3657 2279 2285 L2
128 1524 1778 1828 2328 1829 2133 3657 2279 2279
256 1525 1779 1828 2327 1828 2001 3657 2280 2286
512 1456 1642 1779 2133 1832 1969 3413 2287 2135
1024 930 1094 1094 1236 1186 1186 1232 1144 1070 RAM
2048 930 992 1023 1102 1102 853 1066 914 921
4096 930 1023 1092 1102 1102 1102 834 837 834
8192 893 983 1071 1111 1160 1071 976 945 877
Max MFLOPS 208 444
Max MOPS 485
PiA7 8 1619 1812 3448 2375 2237 3793 2698 3121 3147
16 1621 1814 3459 2379 2240 3793 2710 3136 3162
32 1577 1743 3243 2277 2132 3138 2702 3123 3131
64 1537 1690 3126 2196 2047 3362 2566 2890 2917
128 1570 1714 3257 2243 2076 3502 2624 2993 3027
256 1573 1720 3285 2261 2084 3522 2652 3071 2930
512 1453 1598 2785 2055 1906 2081 2430 2783 2815
1024 918 1097 1327 1204 1185 1355 1606 1261 1263
2048 891 1032 1224 1133 1113 1191 882 811 817
4096 885 1023 1223 1127 1104 1201 787 756 755
8192 876 1019 1225 1118 954 1203 876 871 873
Max MFLOPS 203 454 149 280
Max MOPS 865
########################### RPi 3 NEON #################################
Raspberry Pi 3 CPU 1200 MHz, SDRAM 900 MHz
Memory Reading Speed Test NEON 32 Bit Version 1 by Roy Longbottom
8 1627 2387 3467 2387 3181 3812 2713 3164 3149
16 1621 2377 3457 2377 3169 3805 2713 3164 3165
32 1577 2273 3238 2280 2985 3535 2647 3103 3105
64 1526 2150 3018 2157 2793 3256 2568 2921 2921
128 1554 2217 3190 2216 2925 3436 2631 3028 3029
256 1561 2228 3225 2221 2948 3471 2654 3077 3077
512 1434 2010 2742 1978 2534 2313 2468 2840 2840
1024 950 1227 1324 1182 1306 1339 1581 1298 1298
2048 935 1136 1215 1128 1212 1214 915 880 885
4096 913 1121 1180 1131 1213 1212 825 844 842
8192 926 1134 1212 1126 936 1199 792 774 790
Max MFLOPS 203 594 149 396
Max MOPS 864
Continued Below or Go To Start
############################# RPi 3 SUSE ##############################
Raspberry Pi 3 CPU 1200 MHz, SDRAM 900 MHz
Compiled for 64 bit ARM v8a
Memory x[m]=x[m]+s*y[m] Int+ x[m]=x[m]+y[m] x[m]=y[m]
KBytes Dble Sngl Int32 Dble Sngl Int32 Dble Sngl Int32
Used MB/S MB/S MB/S MB/S MB/S MB/S MB/S MB/S MB/S
8 4224 2547 3813 5433 3469 4237 4717 3793 3794 L1
16 4211 2546 3820 5423 3469 4236 4759 3815 3815
32 3380 2287 3225 4132 3003 3526 4603 3752 3752
64 3290 2266 3179 3994 2966 3451 4539 3724 3723 L2
128 3386 2321 3301 4039 3076 3567 4359 3589 3590
256 3342 2346 3359 4096 3132 3643 4355 3593 3593
512 2961 2070 2824 3371 2640 3025 3599 3087 3082
1024 757 1268 1344 1331 1341 1369 1487 1479 1419 RAM
2048 756 959 1227 1193 1226 1254 1134 1237 1212
4096 699 952 1230 1226 1226 1248 1063 1173 1165
8192 754 1169 1203 1206 1207 1210 1036 1045 1033
Max MFLOPS 528 637 340 433
Max MOPS 955
############################## RPi 3 Gentoo #############################
8 4158 2503 3749 5341 3411 4164 4639 3729 3730 L1
16 4014 2506 3758 5359 3416 4174 4675 3751 3750
32 3925 2483 3722 5307 3384 4125 4665 3712 3712
64 3253 2301 3271 4121 3043 3581 4342 3544 3535 L2
128 3196 2360 3394 4190 3165 3719 4221 3487 3484
256 3125 2385 3437 4225 3201 3767 4215 3501 3504
512 672 2079 2937 3551 2725 3223 3858 3249 3255
1024 618 1189 1266 1265 1255 1274 1156 1433 1355 RAM
2048 607 1133 1183 1162 1178 1194 978 1027 1026
4096 619 1135 1185 1170 1175 1200 995 1060 1048
8192 554 1140 1189 1171 1178 1206 1009 1081 1081
Max MFLOPS 520 627 335 427
Max MOPS 940
############################# RPi 3 SUSE ##############################
Compiled for 64 bit ARM v8a+fp+sim
8 2726 2544 3468 4013 3468 4233 4206 3791 3788 L1
16 2728 2552 3477 4026 3478 4247 4232 3814 3814
32 2557 2392 3190 3611 3191 3812 4248 3819 3822
64 2416 2248 2961 3246 2961 3478 4037 3725 3728 L2
128 2452 2276 3038 3283 3025 3530 3908 3567 3566
256 2414 2313 3093 3350 3088 3600 3940 3594 3594
512 2156 2027 2603 2779 2583 2989 3473 3255 3075
1024 707 954 1315 1330 1314 1330 1597 1591 1538 RAM
2048 704 955 1146 1148 1134 1156 1038 1039 1037
4096 697 983 1136 1135 1109 1142 843 907 898
8192 694 1106 1140 1135 1141 1136 877 957 940
Max MFLOPS 341 636 251 435
Max MOPS 869
######################## 64 Bit Comparison #############################
x[m]=x[m]+s*y[m] Int+ x[m]=x[m]+y[m] x[m]=y[m]
Dble Sngl Int32 Dble Sngl Int32 Dble Sngl Int32
L1 16KB
SUSE/Gentoo 1.05 1.02 1.02 1.01 1.02 1.01 1.02 1.02 1.02
SUSE/gcc 4.8 1.54 1.00 1.10 1.35 1.00 1.00 1.12 1.00 1.00
SUSE/32 bit 2.60 1.40 1.10 2.28 1.55 1.12 1.76 1.22 1.21
L2 256 KB
SUSE/Gentoo 1.07 0.98 0.98 0.97 0.98 0.97 1.03 1.03 1.03
SUSE/gcc 4.8 1.38 1.01 1.09 1.22 1.01 1.01 1.11 1.00 1.00
SUSE/32 bit 2.12 1.36 1.02 1.81 1.50 1.03 1.64 1.17 1.23
RAM 4 MB
SUSE/Gentoo 1.13 0.84 1.04 1.05 1.04 1.04 1.07 1.11 1.11
SUSE/gcc 4.8 1.00 0.97 1.08 1.08 1.11 1.09 1.26 1.29 1.30
SUSE/32 bit 0.79 0.93 1.01 1.09 1.11 1.04 1.35 1.55 1.54
Continued Below or Go To Start
######################## Other Cortex A53 ##############################
Tab 2 A8-50, 1.3 GHz quad core 64 bit ARM Cortex-A53
ARM/Intel MemSpeed Benchmark 1.2 05-Aug-2015 17.16
Compiled for 32 bit ARM v7a
Reading Speed in MBytes/Second
Memory x[m]=x[m]+s*y[m] Int+ x[m]=x[m]+y[m]
KBytes Dble Sngl Int Dble Sngl Int
16 1940 971 1693 2470 1278 2084 L1
32 1879 955 1676 2378 1255 1967
64 1801 938 1615 2254 1218 1912 L2
128 1706 941 1620 2279 1224 1872
256 1818 935 1570 2291 1155 1875
512 1633 884 1451 2008 1132 1704
1024 1276 781 1181 1454 938 1324 RAM
4096 1335 808 1260 1533 1010 1386
16384 1342 813 1270 1487 1013 1419
65536 1346 809 1274 1546 1031 1252
Max MFLOPS 242 243 154 160
Max MOPS 419
ARM/Intel MemSpeed Benchmark 1.2 05-Aug-2015 17.29
Compiled for 64 bit ARM v8a
16 4092 2198 3951 5293 3611 4408
32 3753 2496 3630 4651 3300 3992
64 3407 2388 3368 3715 3023 3677
128 3496 2462 3521 4137 3139 3844
256 3535 2481 3573 4199 3322 3911
512 3054 2248 3126 3556 2548 3372
1024 1714 1704 2029 2069 1854 2099
4096 1832 1595 1841 1914 1780 1897
16384 1844 1601 1850 1925 1798 1891
65536 1859 1608 1837 1921 1795 1812
Max MFLOPS 512 624 331 451
Max MOPS 988
############################# Other ####################################
Android MemSpeed Benchmark 17-Oct-2012 20.19
ARM Cortex-A9 1300 MHz, 1 GB DDR3 RAM
Reading Speed in MBytes/Second
Memory x[m]=x[m]+s*y[m] Int+ x[m]=x[m]+y[m]
KBytes Dble Sngl Int Dble Sngl Int
16 1735 888 2456 2726 1364 2818 L1
32 1448 760 1474 1700 1039 1648
64 1318 719 1290 1468 952 1385 L2
128 1279 715 1289 1443 944 1336
256 1268 714 1279 1435 943 1313
512 1158 691 1204 1321 892 1228
1024 729 553 735 772 632 742
4096 445 392 425 442 421 439 RAM
16384 435 390 428 435 412 431
65536 445 404 393 450 432 449
Continued Below or Go To Start
Intel Atom 1666 MHz memspeedIL
Memory x[m]=x[m]+s*y[m] Int+ x[m]=x[m]+y[m] x[m]=y[m]
KBytes Dble Sngl Int32 Dble Sngl Int32 Dble Sngl Int32
Used MB/S MB/S MB/S MB/S MB/S MB/S MB/S MB/S MB/S
8 1720 853 2150 2203 1086 3686 1379 1851 1785 L1
16 1612 825 2051 2150 1075 2962 1599 1777 1612
32 1517 825 1785 2019 1041 2666 1290 1388 1379 L2
64 1470 825 1785 2051 1041 2580 1379 1333 1646
128 1724 948 2272 2580 1358 3463 1612 1785 1785
256 1725 948 2299 2499 1403 3572 1613 1731 1785
512 1624 914 2151 2349 1315 3228 1533 1670 1668
1024 1590 882 1990 2155 1296 2515 1251 1292 1292 RAM
2048 1590 882 1998 2095 1263 2235 1081 1117 1076
4096 1553 914 1951 2111 1279 2180 1076 1084 1055
8192 1592 910 1985 2113 1279 2171 1092 1085 1119
Core 2 2400 MHz, Dual channel DDR2 RAM, memspeedIL
Memory x[m]=x[m]+s*y[m] Int+ x[m]=x[m]+y[m] x[m]=y[m]
KBytes Dble Sngl Int32 Dble Sngl Int32 Dble Sngl Int32
Used MB/S MB/S MB/S MB/S MB/S MB/S MB/S MB/S MB/S
8 17427 6736 6249 12498 6450 6399 12498 6348 6348 L1
16 13839 6450 6249 12498 6450 6450 12985 6348 6249
32 16664 6249 6450 13134 6399 6450 12498 6348 6143
64 10751 4999 5262 7528 4999 5332 5119 3555 3555 L2
128 7831 4999 5332 7313 4999 5333 5119 3703 3703
256 11494 4999 5332 7691 4999 5333 5208 3555 3656
512 11347 5160 5333 7313 4999 5264 5209 3555 3656
1024 9142 5160 5333 7699 5160 5332 5211 3707 3656
2048 10239 5007 5341 7528 4949 5341 5119 3555 3451
4096 7110 4790 5023 6920 4790 5023 4013 3135 3236
8192 3949 3686 3813 4031 3794 3794 2047 2015 1974 RAM
To Start
Bus Speed Benchmark - busspeedPiA6, busspeedPiA7, busSpdPi64
See Results Below
This benchmark is designed to identify reading data in bursts over buses. The program starts by reading a word (4
bytes) with an address increment of 32 words (128 bytes) before reading another word. The increment is reduced by
half on successive tests, until all data is read.
Maximum MB/second data transfer speed is calculated as bus clock MHz x 2 for Double Data Rate (DDR) x bus width (at
this time 4 bytes ARM, 8 bytes Intel) x number of memory channels. However, some of these specifications can be
misleading and maximum speed options might not be provided on a particular platform. Where the maximum is not
provided, there can be confusion as to whether specified MHz is raw bus clock speed or included DDR consideration.
One thing is quite clear, and that is multiple threads or programs are required to demonstrate highest obtainable
throughput and this should be less than maximum specified speed due to start up (CAS latency) and other overheads.
In order to minimise CPU time influence, estimates of maximum MB/second can be calculates from burst speeds (as
shown below for 16 word address increments), and these should normally be greater than the Read All results. In the
original benchmark, all threads started reading from word one, but this could lead to unreasonable fast speeds when
shared L2 or L3 caches were provided. Results below are for a revised benchmark with staggered starting addresses,
for example 4 threads at 3 MB intervals using 12 MB RAM.
Multithreaded benchmark results are provided below to help to identify why the single core BusSpeed benchmark
speeds might be different from expectation. For comparison purposes, results are included for Android MP Benchmarks
besides BusSpeed section of Raspberry Pi Multithreading Benchmarks.
Bus
Inc16 Inc8 Inc4 Inc2 Read Clock DDR Width Max
Words Words Words Words All MHz x2 Bytes MB/sec
Old Atom 262 541 1048 1973 3262 400 800 x8 6400
Atom 1.86 GHz
Z3745 275 611 1183 2328 3922 533 1066 x16# 17056
2 Threads 435 787 1671 3323 6507
4 Threads 455 884 1754 3490 6971 Max est 16 x 455 7280
Nexus 7 1.2 GHz
Cortex-A9 51 81 126 172 330 666 1333 x4 5333
2 Threads 67 107 196 335 620
4 Threads 68 108 215 426 835 Max est 16 x 68 1088
Kindle HDX 7 2.15 GHz
Snapdragon 800 406 516 899 1663 2929 933 1866 x8# 14928
2 Threads 541 962 1569 2851 4776
4 Threads 605 1109 2439 4161 8243 Max est 16 x 605 9680
Lenovo Tab 2 1.3 GHz
Cortex-A53 175 344 677 1285 1979 666 1333 x4 5333
2 Threads 241 479 968 1883 3724
4 Threads 277 556 1130 2126 4328 Max est 16 x 277 4432
Moto G4 1.5 GHz
Cortex-A53 172 339 658 1247 2014 933 1866 x4 7466
2 Threads 307 591 1124 2192 3839
4 Threads 353 813 1692 3015 6058 Max est 16 x 353 5648
Raspberry Pi 2 0.9 GHz
ARM-V7 71 159 281 628 1147 450 900 x4 3600
2 Threads 87 177 311 697 1256
4 Threads 98 191 297 700 1186 Max est 16 x 98 1568
Raspberry Pi 3 1,2 GHz
Cortex-A53 136 263 513 1047 2080 450 900 x4 3600
2 Threads 138 276 554 1108 2149
4 Threads 137 269 536 1169 2383 Max est 16 x 137 2192
# dual channel
Below are the Raspberry Pi results from busSpeed.txt log file, running at the default speed settings. The program main
test had 64 C statements that translate into 64 load and 64 AND instructions. With loop overheads that translates to
132 instructions on 256 bytes, where MIPS will be MB/second x 0.516. The results suggest that data transfer bursts
are 32 bytes (8 transfers of 4 bytes), with a possible maximum speed of 8 x 34 = 272 MB/second, at this single core
level. They imply that there is also burst reading from caches besides using RAM, and performance of the latter is not
very good, with this single core CPU.
Raspberry Pi CPU 700 MHz, Core 400 MHz, SDRAM 400 MHz
Maximum speed 400 x 2 (DDR) x 4 Width = 3.2 GB/sec
BusSpeed 32 Bit V1.1 Wed May 22 15:28:01 2013
Reading Speed 4 Byte Words in MBytes/Second
Memory Inc32 Inc16 Inc8 Inc4 Inc2 Read
KBytes Words Words Words Words Words All
16 290 304 568 984 1125 1142 L1
32 133 116 131 133 225 465 L2
64 116 98 116 109 192 409
128 60 54 62 68 126 273
256 34 34 34 43 88 192 RAM
512 34 34 34 45 91 200
1024 34 31 34 45 91 181
4096 32 33 33 45 87 183
16384 32 32 34 44 83 186
65536 34 32 34 44 88 186
See Results Below or Go To Start
Bus Speed Results and Comparison
The first one for comparison is the overclocked Pi, where most results are as might be expected at the higher clock
frequencies but, again, with some L2 cache speeds quite a bit faster.
Raspberry Pi 2 results are shown with the CPU at 900 MHz and overclocked to 1 GHz, corresponding SDRAM frequencies
are 450 and 500 MHz. The busspeedPiA6 speeds are most unusual, on reading all data, where speed, on reading all
data, is slower than reading every other word. Assembly Code appears to show that there is little difference in
generated instructions, from the two versions, except PiA7 uses negative indexing. Comparisons, shown with PiA7 1
GHz details, suggest that speed from RAM is at least 2.5 times faster from gcc 4.8. The other comparisons are for
busspeeddPiA6, where the highest performance gains, of RPi 2, are via data in L2 cache.
Next results are for one CPU core on a Nexus 7, with a 1300 MHz ARM Cortex-A9 processor. The overclocked Pi is not
too far away on RAM performance but falls behind on L1 and L2 cache based data.
The two Intel examples are clearly much faster but BusSpd2k Results on PCs provides results on older systems where
the Raspberry Pi is the winner (ignore the last two columns for MMX instructions). There are also results of some slower
systems in Android Benchmarks.htm.
On the Raspberry Pi 3, busspeedPiA6, and the newer busspeedPiA7 benchmark, demonstrate almost identical
performance. With the former, considering just the Read All results, the RPi 3 is shown to average 2.85 times faster
than the RPi 2 using cache based data and 5.26 times from RAM. Corresponding ratios using PiA7 are 2.31 and 1.42
times.
The 32 bit compiler, used for the Cortex-A53 based tablet tests, produced different performance characteristics to
those used on the Raspberry Pi, some better scores and some worse. The same might apply to the 64 bit version, but
results from RAM were faster than other A53 tablet and RPi 3 tests, the latter by an average of 50%.
Raspberry Pi 3 SUSE and Gentoo 64 Bits - With the Read All tests being the most representative of data transfers
normally used, comparisons are provided for these and there was not much difference in performance between 32 bit
PiA7, 64 bit gcc 6 and 64 bit gcc 4.8 speeds. The one exception is at 16 KB data size, where gcc 6 tests were slow.
The C code loop has 64 AND statements. Disassembly shows that the gcc 4.8 version has 64 AND and 65 load (ldr)
instructions, using 8 w registers. The gcc 6 program has 64 AND, 19 load (ldr) and 23 lad pair (ldp) instructions, using
up to 16 w registers - more registers, fewer instructions but slower?
Raspberry Pi CPU 1000 MHz, Core 500 MHz, SDRAM 600 MHz, 6 volts
Reading Speed 4 Byte Words in MBytes/Second
Memory Inc32 Inc16 Inc8 Inc4 Inc2 Read
KBytes Words Words Words Words Words All
16 290 387 984 1505 1575 1750 L1
32 246 186 232 232 393 731 L2
64 146 113 131 148 273 546
128 102 87 93 113 210 420
256 53 48 53 75 131 303 RAM
512 48 48 50 75 137 300
1024 48 50 49 69 139 305
4096 50 52 52 72 134 299
16384 48 52 52 69 139 296
65536 49 52 49 72 139 291
############################## RPi 2 ##################################
Raspberry Pi 2 CPU 900 MHz, Core 250 MHz, SDRAM 450 MHz
Reading Speed 4 Byte Words in MBytes/Second
Memory Inc32 Inc16 Inc8 Inc4 Inc2 Read
KBytes Words Words Words Words Words All
PiA6
16 1346 1428 1575 1641 1706 1489 L1
32 930 984 1163 1422 1489 1641
64 426 372 630 1024 1365 1365 L2
128 341 380 682 1137 1462 1191
256 213 232 512 813 1191 1169
512 129 136 273 570 840 782
1024 73 83 167 360 685 412 RAM
4096 63 76 152 293 629 322
16384 69 74 149 314 599 335
65536 69 78 148 279 629 335
PiA7
16 950 1509 1632 1726 1734 1738
32 1240 1318 1437 1716 1633 1681
64 419 429 747 1214 1479 1587
128 386 411 702 1211 1572 1625
256 367 399 691 1194 1573 1634
512 138 164 313 598 990 1363
1024 79 88 175 372 673 1264
4096 66 76 154 300 632 1266
16384 71 77 154 299 633 1264
65536 71 76 154 297 633 1261
More Results Below or Go To Start
########################### RPi 2 OC ##################################
Raspberry Pi 2 CPU 1000 MHz, Core 500 MHz, SDRAM 500 MHz, over_voltage=2
Reading Speed 4 Byte Words in MBytes/Second
Memory Inc32 Inc16 Inc8 Inc4 Inc2 Read Pi2/Pi
KBytes Words Words Words Words Words All 1 GHz
PiA6
16 1066 1662 1706 1975 1861 1896 1.08
32 930 1163 1367 1706 1706 1861 2.55
64 465 474 820 1219 1575 1462 2.68
128 372 426 787 1241 1706 1490 3.55
256 393 426 745 1260 1626 1491 4.92
512 266 281 522 916 1367 1196 3.99
1024 105 114 249 456 913 508 1.67
4096 93 115 220 396 880 419 1.40
16384 100 113 227 419 838 441 1.49
65536 97 111 209 419 883 447 1.54
A7/A6
PiA7 1 GHz
16 1554 1662 1813 1894 1892 1894 1.00
32 629 648 911 1328 1604 1756 0.94
64 453 461 803 1245 1572 1752 1.20
128 394 430 773 1284 1705 1783 1.20
256 280 410 747 1306 1733 1798 1.21
512 242 253 472 891 1335 1607 1.34
1024 107 122 243 481 919 1287 2.53
4096 95 108 216 420 886 1204 2.87
16384 98 108 216 419 885 1205 2.73
65536 99 109 216 419 888 1204 2.69
############################## RPi 3 ##################################
Raspberry Pi 3 CPU 1200 MHz, SDRAM 900 MHz
Reading Speed 4 Byte Words in MBytes/Second
Memory Inc32 Inc16 Inc8 Inc4 Inc2 Read
KBytes Words Words Words Words Words All
PiA6
16 3429 3555 3938 4266 4266 4266 L1
32 1066 1066 1693 2625 3413 3657
64 639 609 1125 1896 2978 3276 L2
128 533 546 1023 1862 2844 3413
256 533 525 1023 1706 2730 3414
512 351 393 758 1310 2184 2983
1024 123 136 274 548 1012 1879 RAM
4096 100 117 254 471 943 1852
16384 119 129 244 489 978 1806
65536 122 123 258 479 1032 1789
PiA7 /PiA6
16 3335 3741 4075 4371 4388 4413 1.03
32 1964 2229 2787 4271 4308 4311 1.18
64 612 615 1121 1932 2880 3546 1.08
128 570 573 1034 1803 2756 3467 1.02
256 541 544 995 1758 2737 3457 1.01
512 382 408 794 1360 2269 3105 1.04
1024 128 136 256 533 1025 1945 1.04
4096 109 125 245 482 961 1585 0.86
16384 120 125 241 477 964 1744 0.97
65536 120 125 243 477 947 1881 1.05
############################# RPi 3 SUSE ##############################
Raspberry Pi 3 CPU 1200 MHz, SDRAM 900 MHz
Compiled for 64 bit ARM v8a
Reading Speed 4 Byte Words in MBytes/Second
Memory Inc32 Inc16 Inc8 Inc4 Inc2 Read
KBytes Words Words Words Words Words All
/PiA7
16 3370 3765 4085 4440 4477 3399 0.80
32 2070 2222 2768 4314 4386 3389 0.93
64 590 604 1138 1875 2866 3100 0.95
128 559 568 1061 1784 2781 3135 0.92
256 534 542 1023 1741 2759 3161 0.93
512 477 485 948 1628 2648 3107 1.04
1024 100 142 273 519 1082 2038 1.08
4096 90 128 254 493 988 1935 1.04
16384 123 128 253 495 999 1963 1.09
65536 123 128 254 497 994 1980 1.11
More Results Below or Go To Start
############################ RPi 3 Gentoo #############################
/SUSE
16 1927 3680 4011 4336 4394 3335 0.98
32 2022 2159 2688 4171 4257 3299 0.97
64 579 595 1121 1859 2835 3065 0.99
128 549 557 1041 1750 2735 3082 0.98
256 518 528 1001 1700 2701 3095 0.98
512 384 397 788 1397 2284 2744 0.88
1024 128 131 253 505 1010 1923 0.94
4096 88 119 238 461 938 1737 0.90
16384 115 116 238 455 929 1657 0.84
65536 115 119 238 459 934 1764 0.89
############################# RPi 3 SUSE ##############################
Compiled for 64 bit ARM v8a+fp+sim
/SUSE64
16 3275 3775 4021 4277 4330 4399 1.29
32 914 966 1582 2441 3246 3771 1.11
64 601 611 1144 1958 2899 3548 1.14
128 559 567 1054 1824 2796 3471 1.01
256 534 543 1019 1758 2744 3405 1.08
512 319 348 682 1280 2164 3021 0.97
1024 114 138 274 539 1064 2045 1.00
4096 86 124 247 489 966 1788 0.92
16384 121 123 247 488 971 1858 0.95
65536 121 125 247 490 963 1736 0.88
######################## Other Cortex A53 ##############################
Tab 2 A8-50, 1.3 GHz quad core 64 bit ARM Cortex-A53
ARM/Intel BusSpeed Benchmark 1.2 06-Aug-2015 10.57
Compiled for 32 bit ARM v7a
Reading Speed 4 Byte Words in MBytes/Second
Memory Inc32 Inc16 Inc8 Inc4 Inc2 Read
KBytes Words Words Words Words Words All
16 874 932 1814 2302 2355 2263 L1
32 758 803 1309 1820 2323 2386
64 653 671 1203 1741 2206 2332 L2
128 603 620 1107 1693 2222 2351
256 574 589 1075 1711 2211 2327
512 332 372 681 1075 1863 2120
1024 137 193 371 578 1322 2129 RAM
4096 172 179 351 567 1151 2126
16384 172 178 351 504 1117 2136
65536 172 177 349 478 882 2129
ARM/Intel BusSpeed Benchmark 1.2 06-Aug-2015 11.02
Compiled for 64 bit ARM v8a
16 3188 3635 3937 4327 4372 4462
32 1478 1607 2246 3382 3853 4144
64 600 622 1163 2011 2972 3585
128 558 575 1056 1889 2892 3525
256 538 550 1028 1826 2837 3260
512 371 425 813 1490 2403 3202
1024 136 196 382 728 1423 2750
4096 170 177 346 669 1340 2652
16384 169 174 341 678 1352 2663
65536 168 174 341 676 1347 2611
############################# Other ####################################
Android BusSpeed Benchmark 19-Oct-2012 17.29
ARM Cortex-A9 1300 MHz, 1 GB DDR3 RAM
RAM 1 GB DDR3L-1333 Bandwidth 5.3 GB/sec
Reading Speed 4 Byte Words in MBytes/Second
Memory Inc32 Inc16 Inc8 Inc4 Inc2 Read
KBytes Words Words Words Words Words All
16 2723 2420 3044 3364 3499 3500 L1
32 1054 1087 1061 1382 1565 2145
64 436 433 419 652 751 1160 L2
128 345 337 337 542 633 943
256 329 309 322 522 614 961
512 339 299 311 506 574 937
1024 170 168 180 269 349 629
4096 59 55 84 127 176 338 RAM
16384 56 56 83 125 173 335
65536 56 56 82 125 174 334
More Results Below or Go To Start
Intel Atom 1666 MHz busspeedIL
Reading Speed 4 Byte Words in MBytes/Second
Memory Inc32 Inc16 Inc8 Inc4 Inc2 Read
KBytes Words Words Words Words Words All
16 3703 5160 5881 6249 6399 6529 L1
32 484 396 745 1474 2499 3931 L2
64 484 393 787 1516 2482 3878
128 491 410 775 1462 2509 3923
256 492 415 775 1454 2540 3887
512 225 327 606 1213 2184 3534
1024 130 266 533 1034 1952 3306 RAM
4096 126 262 524 1048 1941 3313
16384 135 270 508 1048 1917 3276
65536 135 262 541 1048 1973 3262
Core 2 2400 MHz, Dual channel DDR2 RAM, busspeedIL
Reading Speed 4 Byte Words in MBytes/Second
Memory Inc32 Inc16 Inc8 Inc4 Inc2 Read
KBytes Words Words Words Words Words All
16 6535 5516 6059 6490 6205 6304 L1
32 5925 3225 3938 6023 6094 5966
64 1721 1305 2154 3047 4444 5269 L2
128 1407 1333 2172 3033 4571 5333
256 1538 1365 2206 3047 4432 5334
512 1391 1376 2150 3102 4552 5336
1024 1377 1376 2202 3104 4519 5460
4096 731 814 1425 2206 3669 4882
16384 345 380 761 1310 2530 4343 RAM
65536 321 374 748 1310 2485 4066
To Start
FFT Benchmarks - fft1-RPi2, fft3c-Rpi2, fft1-RPi64, FFT3c-RPi64
In 2000, I provided optimised code for a Fast Fourier Transform program, resulting in a series of Windows benchmarks
that provided graphical output - see fftgraf results.htm. The fastest one used SSE type assembly code that modern
compilers can also produce. The new versions use all C code, with identical calculations compiled to run via Linux,
Windows and Android. The benchmarks and source codes are in FFT Benchmarks.zip with further details and results
from PCs, Android devices and RPi 2 in FFTBenchmarks.htm.
There are two benchmarks, FFT1, the original, and FFT3c, optimised, with 32 bit and 64 bit versions, when appropriate.
Performance is measured in milliseconds, for FFTs sized 1K to 1024K, with three measurements using both single and
double precision floating point data, plus some sumchecks for the largest ones. Results from a Raspberry Pi 2, at 900
MHz, are below. These are similar to a year 2000 Pentium III PC.
Raspberry Pi 3 average performance gains were similar to the clock speed ratio of 1.33.
Raspberry Pi 3 SUSE and Gentoo 64 Bits - On running the newly compiled 64 bit versions on both systems, wide
variations in performance were observed, with the smaller FFTs, where measured time is less than a millisecond. Full
speed could be achieved by using “performance” CPU MHz setting (where available - see On Demand below) or running
another CPU bound program at the same time. These slower speeds also became apparent on 32 bit results, including
via Raspberry Pi 2. All tests were repeated to run at maximum speeds, producing the results shown below. In some
cases, the earlier slow measurements are also included {see Initial Slow Speed).
cases, the earlier slow measurements are also included {see Initial Slow Speed).
Gentoo and SUSE produced virtually the same performance, with variations probably caused by different L2 cache
presence. The 64 bit version averaged 24% faster on the single precision FFTs but with no real difference using double
precision calculations.
The 64 bit benchmarks and source codes are included in Rpi3-64-Bit-Benchmarks.tar.gz.
Small FFT Tests - shortfft64 and shortfft32 - New programs were produced to identify differences in MHz settings.
These execute 30 of the smallest 1K single precision FFTs 500 times. A summary of results is below. Besides the 500
measurements, total time is provided that includes data generation and checking overheads, these being included in a
final summary. With on demand CPU MHz setting, 32 bit Raspbian and 64 bit Gentoo generally produce much slower
execution times over the first few measurements, with the remainder at similar faster speeds. 64 bit OpenSUSE tends
to produce the same slow speeds at the start but also has random longer periods of slow performance, Results from all
three systems indicate constant running time with performance MHz setting or running another CPU benchmark at the
same time. These are also included in Rpi3-64-Bit-Benchmarks.tar.gz.
######################### Raspberry Pi 2 #########################
RPi2 FFT 32 Bit Benchmark Version 1.0 Thu Feb 16 12:23:55 2017
Size milliseconds
K Single Precision Double Precision
1 0.212 0.206 0.206 0.246 0.245 0.252
2 0.462 0.447 0.447 0.689 0.678 0.723
4 1.244 1.206 1.192 1.704 1.634 1.616
8 2.995 3.133 2.989 4.397 3.963 3.899
16 6.983 6.785 6.767 13.282 10.515 9.748
32 17.142 17.182 16.855 31.020 30.025 31.891
64 52.794 52.885 52.727 152.318 146.516 145.472
128 278.668 280.006 285.012 358.963 362.587 360.340
256 624.823 636.579 632.442 779.830 790.282 815.686
512 1506.681 1512.883 1514.028 1678.495 1681.863 1668.933
1024 3288.894 3293.423 3312.335 3792.264 3808.471 3789.059
1024 Square Check Maximum Noise Average Noise
SP 9.999520e-01 3.346482e-06 4.565234e-11
DP 1.000000e+00 1.133294e-23 1.428110e-28
===================== Initial Slow Speed =====================
1 0.309 0.305 0.307 0.364 0.356 0.355
2 0.666 0.673 0.680 0.928 0.912 0.900
RPi2 FFT 32 Bit Benchmark Version 3c.0 Thu Feb 16 12:21:57 2017
Size milliseconds
K Single Precision Double Precision
1 0.282 0.237 0.232 0.255 0.246 0.247
2 0.612 0.529 0.582 0.574 0.627 0.635
4 1.523 1.249 1.203 1.498 1.668 1.543
8 2.925 2.781 2.727 3.226 3.141 3.063
16 7.220 6.679 6.672 8.954 8.808 8.737
32 16.862 17.276 15.712 23.606 23.662 23.527
64 41.294 41.568 40.916 57.516 56.900 56.923
128 98.052 97.028 96.708 128.591 127.978 127.868
256 217.731 214.874 214.927 277.817 276.615 280.291
512 466.673 461.412 462.023 596.874 598.976 595.552
1024 1009.119 998.319 999.178 1325.278 1310.229 1304.572
1024 Square Check Maximum Noise Average Noise
SP 9.999520e-01 3.346482e-06 4.565233e-11
DP 1.000000e+00 1.133294e-23 1.428110e-28
===================== Initial Slow Speed =====================
1 0.393 0.349 0.348 0.253 0.237 0.283
2 0.820 0.781 0.802 0.562 0.551 0.552
More Results Below o Go To Start
######################### Raspberry Pi 3 #########################
Raspberry Pi 3 CPU 1200 MHz, SDRAM 900 MHz
RPi2 FFT 32 Bit Benchmark Version 1.0 Wed Feb 15 11:05:59 2017
Size milliseconds
K Single Precision Double Precision
1 0.167 0.164 0.163 0.166 0.164 0.165
2 0.393 0.366 0.366 0.419 0.417 0.418
4 1.036 1.007 0.934 1.117 1.091 1.088
8 2.269 2.247 2.236 2.550 2.506 2.501
16 5.624 5.290 5.231 6.086 5.852 5.842
32 12.714 12.569 12.844 22.068 22.479 21.907
64 43.349 44.585 43.293 110.424 110.410 110.581
128 214.541 217.334 216.575 269.974 269.617 269.755
256 526.296 525.924 525.682 615.746 615.259 615.811
512 1199.912 1199.233 1199.311 1364.511 1364.153 1367.418
1024 2509.227 2538.168 2523.659 2831.903 2831.330 2826.171
1024 Square Check Maximum Noise Average Noise
SP 9.999520e-01 3.346482e-06 4.565233e-11
DP 1.000000e+00 1.133294e-23 1.428110e-28
===================== Initial Slow Speed =====================
1 0.329 0.335 0.326 0.446 0.340 0.340
2 0.729 0.733 0.765 0.913 0.840 0.824
###################################################
RPi2 FFT 32 Bit Benchmark Version 3c.0 Wed Feb 15 11:03:37 2017
Size milliseconds
K Single Precision Double Precision
1 0.215 0.199 0.199 0.170 0.164 0.164
2 0.453 0.462 0.455 0.376 0.373 0.373
4 1.027 1.279 1.023 0.888 0.889 0.883
8 2.333 2.320 2.282 2.052 2.047 2.043
16 5.465 5.362 5.613 5.987 5.977 6.043
32 12.309 12.468 12.216 15.382 15.479 15.396
64 30.695 31.084 30.685 37.030 36.987 37.003
128 72.510 72.023 72.091 84.237 84.239 84.367
256 161.194 160.483 160.714 193.733 193.813 193.760
512 369.130 367.713 367.509 426.499 426.238 425.983
1024 802.163 799.225 798.768 957.992 948.540 948.625
1024 Square Check Maximum Noise Average Noise
SP 9.999520e-01 3.346482e-06 4.565233e-11
DP 1.000000e+00 1.133294e-23 1.428110e-28
===================== Initial Slow Speed =====================
1 0.427 0.397 0.398 0.175 0.165 0.166
2 0.996 0.952 0.924 0.396 0.395 0.393
############################# RPi 3 SUSE ##############################
Raspberry Pi 3 CPU 1200 MHz, SDRAM 900 MHz
Compiled for 64 bit ARM v8a
armv8 64 Bit FFT Benchmark Version 1.0 Wed Feb 8 20:01:52 2017
Size milliseconds
K Single Precision Double Precision
1 0.153 0.152 0.152 0.175 0.170 0.168
2 0.347 0.339 0.334 0.402 0.387 0.387
4 0.817 0.763 0.766 1.946 1.433 1.242
8 3.296 2.018 1.963 2.966 2.716 2.698
16 4.623 4.456 4.392 6.719 6.229 6.759
32 10.551 10.417 10.301 18.407 18.816 18.941
64 28.290 28.555 28.032 126.881 127.317 127.272
128 173.229 173.332 172.477 299.374 298.644 298.596
256 405.373 405.188 407.602 657.365 657.037 657.864
512 905.640 921.727 921.347 1461.983 1463.511 1462.099
1024 2018.414 2018.043 2018.976 3163.591 3163.848 3164.858
1024 Square Check Maximum Noise Average Noise
SP 9.999520e-01 3.346482e-06 4.565234e-11
DP 1.000000e+00 1.133294e-23 1.428110e-28
More Results Below o Go To Start
armv8 64 Bit FFT Benchmark Version 3c.0 Wed Feb 8 20:11:05 2017
Size milliseconds
K Single Precision Double Precision
1 0.195 0.161 0.159 0.190 0.184 0.185
2 0.380 0.355 0.360 0.421 0.419 0.420
4 0.988 0.796 0.778 0.959 0.956 0.957
8 2.282 2.183 1.802 2.131 2.101 2.100
16 4.371 4.191 4.091 5.203 5.160 5.176
32 9.477 9.550 9.520 14.318 14.219 14.188
64 26.061 25.553 25.462 33.704 33.668 33.720
128 61.337 60.707 60.460 77.791 77.816 77.922
256 137.002 134.328 134.307 179.822 179.707 181.027
512 315.380 313.872 313.642 392.380 394.200 392.586
1024 692.640 689.569 689.751 859.132 854.983 852.890
1024 Square Check Maximum Noise Average Noise
SP 9.999520e-01 3.346482e-06 4.565234e-11
DP 1.000000e+00 1.133294e-23 1.428110e-28
############################ RPi 3 Gentoo #############################
Raspberry Pi 3 CPU 1200 MHz, SDRAM 900 MHz
Compiled for 64 bit ARM v8a
armv8 64 Bit FFT Benchmark Version 1.0 Wed Feb 8 19:46:59 2017
Size milliseconds
K Single Precision Double Precision
1 0.177 0.155 0.166 0.190 0.168 0.168
2 0.346 0.370 0.348 0.640 0.496 0.471
4 0.806 0.776 0.773 1.792 1.811 2.455
8 2.879 2.026 2.313 3.143 2.673 2.614
16 4.694 4.487 4.446 6.501 6.077 6.090
32 10.824 11.067 10.520 27.899 27.393 32.721
64 49.580 37.161 37.028 119.094 118.648 118.820
128 172.333 186.946 172.173 294.386 294.253 294.366
256 406.581 407.594 406.053 670.012 670.096 670.169
512 938.983 938.567 939.929 1486.050 1485.846 1486.961
1024 1987.861 1989.141 1997.740 3143.410 3143.533 3143.669
1024 Square Check Maximum Noise Average Noise
SP 9.999520e-01 3.346482e-06 4.565234e-11
DP 1.000000e+00 1.133294e-23 1.428110e-28
armv8 64 Bit FFT Benchmark Version 3c.0 Wed Feb 8 19:55:51 2017
Size milliseconds
K Single Precision Double Precision
1 0.181 0.172 0.160 0.190 0.185 0.185
2 0.400 0.366 0.362 0.458 0.420 0.423
4 0.892 0.937 0.932 0.989 0.976 0.994
8 1.986 1.967 2.604 2.269 2.270 2.334
16 5.590 4.686 4.433 5.615 5.533 5.621
32 10.438 10.081 10.263 14.656 14.616 14.669
64 27.759 27.381 27.154 34.832 34.816 34.853
128 63.303 62.331 62.107 79.898 79.849 79.896
256 138.935 170.902 137.272 186.385 186.580 186.381
512 318.062 315.184 315.421 409.840 410.370 410.283
1024 691.349 683.468 685.295 919.255 904.665 904.236
1024 Square Check Maximum Noise Average Noise
SP 9.999520e-01 3.346482e-06 4.565234e-11
DP 1.000000e+00 1.133294e-23 1.428110e-28
###### Example 64 Bit Results With On Demand CPU MHz ######
===================== On Demand Slow Speed ====================
armv8 64 Bit FFT Benchmark Version 3c.0 Wed Feb 8 19:58:15 2017
Size milliseconds
K Single Precision Double Precision
1 0.367 0.321 0.320 0.200 0.213 0.188
2 0.875 0.835 0.805 0.443 0.444 0.425
4 1.974 2.038 1.862 0.996 0.978 0.993
8 6.018 5.208 3.971 2.294 2.285 2.278
16 9.424 4.566 4.586 5.574 5.585 5.561
32 10.608 10.236 10.202 14.902 14.826 14.728
64 28.013 27.164 27.240 34.889 34.867 34.939
128 63.213 62.583 62.562 80.222 80.257 80.036
256 139.365 137.684 137.460 186.954 187.057 187.003
512 318.927 316.056 315.992 412.486 412.306 412.417
1024 693.102 684.980 686.608 918.059 902.694 902.613
1024 Square Check Maximum Noise Average Noise
SP 9.999520e-01 3.346482e-06 4.565234e-11
DP 1.000000e+00 1.133294e-23 1.428110e-28
More Results Below o Go To Start
###################################################
RPi 3 500 x 30 1K Single Precision FFT milliseconds
Raspbian On Demand
12.9 12.2 7.4 6.0 6.0 6.4 6.0 6.0 6.0 6.0
6.1 6.0 6.0 6.0 6.0 6.0 6.1 6.1 6.0 6.2
6.2 6.0 6.0 6.1 6.0 6.0 6.0 6.0 6.1 6.0
6.2 6.0 6.0 7.0 6.1 6.0 6.0 6.0 6.1 6.0
6.2 6.1 6.0 6.0 6.2 6.0 6.0 6.0 6.0 7.2
To
6.5 6.3 6.1 6.2 6.1 6.1 6.1 6.1 6.1 6.1
6.5 6.3 6.1 6.1 6.1 6.1 6.1 6.1 6.1 6.1
6.4 6.2 6.1 6.1 6.2 6.1 6.1 6.1 6.1 6.1
Raspbian With Stress Test
6.7 6.2 6.0 6.0 6.0 6.0 6.1 6.0 6.1 6.0
6.5 6.2 6.0 6.0 6.0 6.0 6.0 6.0 6.0 6.0
6.4 6.2 6.0 6.0 6.0 6.0 6.0 6.1 6.0 6.0
To
6.3 6.2 6.0 6.0 6.0 6.0 6.0 6.0 6.0 6.0
6.3 6.2 6.0 6.0 6.0 6.0 6.0 6.0 6.0 6.0
6.3 6.2 6.0 6.0 6.1 6.0 6.0 6.0 6.0 6.0
OpenSUSE On Demand
12.1 12.5 8.9 5.3 5.3 5.3 5.3 5.3 5.3 5.3
5.3 5.7 5.3 5.3 5.3 5.3 5.3 5.3 5.3 5.3
5.3 5.6 5.3 5.3 5.3 5.3 5.3 5.3 5.3 5.3
To
7.9 11.7 10.7 10.6 10.6 10.6 10.6 10.6 10.6 10.7
11.6 11.2 10.6 10.7 10.6 10.6 10.6 10.6 10.6 10.6
11.7 11.5 10.6 10.6 10.6 10.6 10.6 10.6 10.6 10.6
11.8 11.1 10.6 10.6 10.7 10.6 10.6 10.7 10.6 10.6
To
5.5 6.0 5.8 5.3 5.3 5.3 5.3 5.3 5.3 5.3
5.5 5.9 5.7 5.3 5.3 5.3 5.3 5.3 5.3 5.3
5.5 6.0 5.8 5.3 5.3 5.3 5.3 5.3 5.3 5.4
OpenSUSE Performance
6.1 6.0 5.4 5.5 5.4 5.3 5.3 5.3 5.3 5.3
5.5 6.0 5.8 5.3 5.3 5.3 5.3 5.3 5.3 5.3
5.5 6.1 5.8 5.3 5.3 5.3 5.3 5.3 5.3 5.3
To
5.5 6.2 5.7 5.3 5.3 5.3 5.3 5.3 5.3 5.3
5.5 6.1 5.8 5.3 5.3 5.3 5.3 5.3 5.3 5.3
5.5 6.0 5.7 5.3 5.3 5.3 5.3 5.3 5.3 5.3
Gentoo On Demand
17.5 15.4 11.8 8.6 5.4 5.4 5.4 5.4 5.4 5.4
5.5 5.8 6.0 5.4 5.5 5.4 5.5 5.4 5.4 5.4
5.5 5.6 6.1 5.4 5.5 5.4 5.5 5.5 5.4 5.4
To
5.7 6.9 5.7 5.4 5.4 5.4 5.5 5.4 5.4 5.4
5.8 6.8 5.8 5.6 5.4 5.4 5.4 5.5 5.4 5.4
5.7 6.4 5.7 5.5 5.4 5.4 5.5 5.4 5.4 5.4
Gentoo With Stress Test
5.9 7.2 5.9 5.5 5.4 5.4 5.4 5.4 5.4 5.5
5.6 6.9 5.7 5.4 5.4 5.4 5.4 5.4 5.4 5.4
5.6 6.5 5.7 5.4 5.4 5.4 5.4 5.4 5.4 5.4
5.8 7.1 5.9 5.4 5.4 5.4 5.4 5.4 5.4 5.4
To
5.7 6.8 5.7 5.4 5.4 5.4 5.4 5.4 5.4 5.4
5.7 6.7 6.1 5.4 5.4 5.4 5.4 5.4 5.4 5.4
5.8 6.6 5.6 5.4 5.4 5.4 5.4 5.4 5.4 5.4
################### Summary millisecons ###################
Each Av 30 500x30 +Overheads
Raspbian On Demand 0.206 6.17 3086 14402
Raspbian Plus Stress Test 0.202 6.07 3036 14222
OpenSUSE On Demand 0.221 6.23 3314 13035
OpenSUSE Performance 0.182 5.45 2725 10663
Gentoo On Demand 0.190 5.70 2852 8994
Gentoo Plus Stress Test 0.187 5.61 2802 8872
To Start
Single Core NEON Benchmarks
Some of these are essentially the same as my Android NEON Benchmarks.htm, using NEON Intrinsic Functions. Others
are produced by including the compile option -funsafe-math-optimizations, alongside -mfpu=neon-vfpv4. Results for
single core NEON benchmarks are included in this document, with the programs and source codes in
Raspberry_Pi_Benchmarks.zip. For MultiThreading versions, see Raspberry Pi Multithreading Benchmarks and
Raspberry_Pi_MP_Benchmarks.zip.
64 Bit Versions - The compiler does not have the NEON directive, but translates NEON intrinsic functions into 64 bit
vector instructions. The 64 bit benchmarks and source codes are in Rpi3-64-Bit-Benchmarks.tar.gz.
Linpack NEON Benchmarks - linpackPiNEONi, linpackPiFSSP, linpackPiNEONi64
The Android version was written, using NEON Intrinsic Functions and was converted to Linux format in linpackneon.c,
compiled as LinpackPiNEONi. The standard Linux single precision version was recompiled with the additional -funsave
parameter as linpackPiFSSP. Comparative performance of the intrinsic program is shown Linpack Benchmark Comparisons
above.
Linpack benchmark performance is mainly determined by the daxpy function, specifically an unrolled loop with four dy[i]
= dy[i] + da * dx[i] statements, accessing sequential data. NEON q registers are 128 bits or four words and there are
multiply and add instructions, using three registers. The assembly code loop has two loads and one store, with
linpackPiNEONi using vmla Vector Multiply Accumulate instruction and linpackPiFSSP using the faster vfma Fused
Multiply Accumulate - one instruction for 4 multiplies and 4 adds.
Raspberry Pi 3 speeds are shown to be 54% to 57% faster than the non-overclocked Raspberry Pi 2, compared with a
33% faster CPU MHz.
These instructions are known to produce rounding complications, differences in results being shown below. I could not
say whether they are acceptable
Raspberry Pi 3 SUSE and Gentoo 64 Bits - As both use different varieties of SIMD instructions, performance is not
that much better than the 32 bit version.
linpackPiNEONi linpackPiFSSP linpackPiNEONi64
RPi 2 MFLOPS at 900 MHz 300 311
RPi 2 MFLOPS at 1000 MHz 334 348
RPi 3 MFLOPS at 1200 MHz 486 488 530
NEON Function vmla.f32 q8, q9, q10 vfma.f32 q8, q9, q10 fmla v0.4s, v1.4s, v2.4s
norm resid resid x[0]-1 x[n-1]-1
Pi, Android+NEON 1.6 3.80277634e-05 -1.38282776e-05 -7.51018524e-06
Pi 2/3 Not NEON 2.0 4.69621336E-05 -1.31130219E-05 -1.30534172E-05
Pi 3 64 NEON In 2.0 4.69621336E-05 -1.31130219E-05 -1.30534172E-05
Pi 3 64 Not NEON 2.0 4.69621336E-05 -1.31130219E-05 -1.30534172E-05
Pi 2/3 Intrinsic 2.2 5.16722466e-05 -2.38418579e-07 -5.06639481e-06
Pi 2/3 Compiled 1.9 4.62468779e-05 -1.31130219e-05 -1.30534172e-05
To Start
NEON Float & Integer Benchmark - NeonSpeed, NeonSpeedPi64
This was the first benchmark produced to measure speed using NEON instructions on ARM v7 CPUs using Android. It
executes some of the code used in Memory Speed Benchmark, with additional tests recoded using NEON intrinsic
functions. The benchmark and source code are included in Raspberry_Pi_Benchmarks.zip.
The compile command (for gcc 4.8) is shown below, where the -funsafe-math-optimizations option leads to the
compiler generating NEON code for normal floating point statements. In this case, vfma Fused Multiply Accumulate
instructions were generated, as opposed to vmla Vector Multiply Accumulate from the intrinsic functions. Then,
vadd.i32 was produced for all integer tests. In this case, performance from both methods was quite similar.
Raspberry Pi 3 speeds were quite a bit faster than the Raspberry Pi 2 at 900 MHz. Average, minimum and maximum
improvements, using data in L1 cache, were 1.71, 1.37 and 2.02 times. L2 cache ratios were 3.13, 1.90 and 2.45, with
RAM, best, at 3.57, 2.92 and 5.33. The RPi 3 was also more efficient in running the NEON instructions using caches.
Examples Android results logs are also provided, to show the difference where compiled NEON instructions are not
provided at 32 bits. Performance at 64 bits is also provided, for the tablet with the ARM-A53 CPU, where NEON
instructions are compiled and cache based speeds similar to the RPi 3.
Raspberry Pi 3 SUSE and Gentoo 64 Bits - 64 bit and 32 bit speeds are, again, nearly the same, using different
variations of vector instructions. An exception is the slower performance from gcc 4.8 in translating NEON intrinsic
functions for the v=v+s*v test.
gcc neonspeed.c cpuidc.c -lm -lrt -O3 -mcpu=cortex-a7 -mfloat-abi=hard
-mfpu=neon-vfpv4 -funsafe-math-optimizations -o NeonSpeed
##############################################
Raspberry Pi 2 CPU 900 MHz, Core 250 MHz, SDRAM 450 MHz
NEON Speed Test V 1.0 Tue Mar 17 12:06:58 2015
Vector Reading Speed in MBytes/Second
Memory Float v=v+s*v Int v=v+v+s Neon v=v+v
KBytes Norm Neon Norm Neon Float Int
16 1914 1978 2049 2293 2341 2797 L1
32 1897 1951 2032 2253 2310 2745
64 1517 1543 1619 1694 1718 1915 L2
128 1417 1435 1510 1569 1594 1791
256 1414 1433 1499 1571 1593 1771
512 680 578 654 600 577 604
1024 434 403 451 414 396 409 RAM
4096 327 328 332 324 324 330
16384 333 334 338 345 330 337
65536 339 336 340 172 331 338
Max MFLOPS 479 495
Max MOPS 512 573
##################### OC ######################
Raspberry Pi 2 CPU 1000 MHz, Core 500 MHz, SDRAM 500 MHz,
over_voltage=2
NEON Speed Test V 1.0 Tue Mar 17 12:12:37 2015
Vector Reading Speed in MBytes/Second
Memory Float v=v+s*v Int v=v+v+s Neon v=v+v
KBytes Norm Neon Norm Neon Float Int
16 2114 2183 2265 2531 2587 3090 L1
32 2078 2134 2228 2461 2532 3003
64 1673 1703 1785 1870 1900 2118 L2
128 1565 1581 1668 1736 1761 1974
256 1545 1577 1660 1726 1752 1951
512 1055 1042 1100 1121 1101 1178
1024 499 506 523 525 512 530 RAM
4096 429 431 440 428 433 445
16384 436 438 448 453 440 454
65536 446 443 452 229 444 458
Max MFLOPS 529 546
Max MOPS 566 633
End of test Tue Mar 17 12:12:57 2015
More Results Below o Go To Start
################### RPi 3 ###################
Raspberry Pi 3 CPU 1200 MHz, SDRAM 900 MHz
NEON Speed Test V 1.0 Fri Jul 29 12:03:47 2016
Vector Reading Speed in MBytes/Second
Memory Float v=v+s*v Int v=v+v+s Neon v=v+v
KBytes Norm Neon Norm Neon Float Int
16 2720 4001 3459 4225 4474 4750
32 2598 3706 3268 3879 4091 4320
64 2453 3389 3069 3526 3675 3859
128 2503 3466 3178 3598 3718 3918
256 2530 3516 3230 3649 3779 3950
512 2221 2923 2718 2964 3104 3217
1024 1262 1326 1317 1316 1324 1316
4096 1170 1213 1204 1213 1210 1195
16384 1177 1229 1218 1147 1222 1215
65536 1181 1226 1221 916 1208 1218
Max MFLOPS 680 1000
Max MOPS 865 1056
End of test Fri Jul 29 12:04:07 2016
################ RPi 3 SUSE ################
Raspberry Pi 3 CPU 1200 MHz, SDRAM 900 MHz
Compiled for 64 bit ARM v8a
Vector Reading Speed in MBytes/Second
Memory Float v=v+s*v Int v=v+v+s Neon v=v+v
KBytes Norm Neon Norm Neon Float Int
16 2393 4497 3479 4252 4783 4932
32 2299 4081 3284 3910 4362 4441
64 2193 3663 3067 3593 3896 3904
128 2227 3701 3144 3603 3909 3926
256 2226 3693 3153 3586 3896 3923
512 1913 3461 2958 3358 3609 3577
1024 1271 1408 1406 1364 1363 1422
4096 1130 1207 1219 1158 1186 1208
16384 1102 1116 1132 1037 1111 1116
65536 1089 1095 1107 810 1091 1095
Max MFLOPS 598 1124
Max MOPS 870 1063
############### RPi 3 Gentoo #################
16 2352 4419 3418 4178 4700 4850
32 2330 4355 3388 4122 4664 4806
64 2177 3678 3066 3607 3932 3923
128 2230 3772 3174 3683 4012 4007
256 2240 3785 3199 3694 4024 4024
512 1936 3095 2690 2996 3241 3279
1024 1143 1203 1253 1162 1178 1229
4096 1097 1182 1182 1115 1138 1192
16384 1103 1193 1188 1138 1143 1201
65536 1109 1199 1200 866 1165 1214
Max MFLOPS 588 1104
Max MOPS 855 1045
################ RPi 3 SUSE ##################
Compiled for 64 bit ARM v8a+fp+sim
16 2390 3001 3187 3925 4135 4372
32 2381 2985 3187 3894 4135 4371
64 2174 2674 2817 3300 3468 3608
128 2177 2704 2859 3341 3512 3654
256 2200 2712 2848 3315 3520 3637
512 2010 2400 2539 2894 3018 3094
1024 1238 1314 1338 1356 1382 1385
4096 1098 1148 1159 1158 1170 1188
16384 1063 1082 1120 1041 1109 1114
65536 1063 1067 1108 815 1092 1097
Max MFLOPS 598 750
Max MOPS 797 981
More Results Below o Go To Start
#################### Android #####################
Tab 2 A8-50, 1.3 GHz quad core 64 bit ARM Cortex-A53
ARM/Intel NeonSpeed Benchmark V1.2 13-Aug-2015 16.32
Compiled for 32 bit ARM v7a
Vector Reading Speed in MBytes/Second
Memory Float v=v+s*v Int v=v+v+s Neon v=v+v
KBytes Norm Neon Norm Neon Float Int
16 971 3853 1807 4059 3957 4397
32 970 3812 1800 3983 3891 4323
64 927 3228 1605 3038 3269 3521
128 926 3321 1681 3343 3354 3596
256 936 3386 1693 3449 3413 3667
512 898 2889 1578 2996 2927 3118
1024 794 1859 1345 2057 1996 1924
4096 794 1796 1250 1788 1813 1835
16384 792 1773 1270 1820 1829 1864
65536 796 1811 1289 1852 1832 1880
Total Elapsed Time 11.3 seconds
ARM/Intel NeonSpeed Benchmark V1.2 13-Aug-2015 16.37
Compiled for 64 bit ARM v8a
Vector Reading Speed in MBytes/Second
Memory Float v=v+s*v Int v=v+v+s Neon v=v+v
KBytes Norm Neon Norm Neon Float Int
16 3054 4055 3605 4376 4911 5094
32 2922 3787 3435 4198 4546 4682
64 2795 3514 3259 3658 4050 4116
128 2886 3529 3373 3924 4148 3963
256 2883 3641 3264 3942 4193 4276
512 2454 3165 2985 3385 3586 3542
1024 1633 2000 1835 2043 2114 2105
4096 1738 1893 1899 1900 1956 1955
16384 1757 1870 1886 1802 1921 1846
65536 1755 1875 1870 1903 1936 1937
Max MFLOPS 764 1014
Max MOPS 901 1094
Total Elapsed Time 10.2 seconds
#################### Android #####################
Nexus 7 Quad 1200 MHz Cortex-A9, Android 4.1.2
Android NeonSpeed Benchmark 15-Dec-2012 14.38
Vector Reading Speed in MBytes/Second
Memory Float v=v+s*v Int v=v+v+s Neon v=v+v
KBytes Norm Neon Norm Neon Float Int
16 860 2575 2325 2918 3053 3245 L1
32 950 2551 2400 2823 2944 3131
64 744 1396 1329 1434 1465 1496 L2
128 713 1342 1319 1365 1392 1417
256 714 1339 1311 1357 1377 1400
512 708 1323 1299 1348 1358 1383
1024 608 875 869 917 930 952
4096 460 493 492 481 488 504 RAM
16384 460 498 487 507 506 504
65536 459 495 469 251 503 505
Max MFLOPS 238 644
Max MOPS 600 730
To Start
MemSpeed NEON - memSpdPiNEON
This is compiled from the Memory Speed Benchmark source code, using the -funsafe-math-optimizations additional
compile parameter. An example of results in included above. The memspeedPiA7 benchmarks, compiled with the -
mfpu=neon-vfpv4 option, generated NEON instructions for integer arithmetic (vadd.i32 q8, q8, q10), as with
memSpdPiNEON. leading to the same performance. Then four scalar fused multiply and add instructions ( fadds s12, s8,
s12) were generated for the single precision (SP) floating point test, as opposed to NEON (vfma.f32 q8, q9, q6) with
the new benchmark, with similar differences for the second set of calculations. Details are above, and maximum
MFLOPS below. showing a gain of approaching 50% through using NEON instructions. Note: currently NEON floating
point functions are only available at single precision. For reference, double precision (DP) results are also shown.
Both compilations for memspeedPiA7 and memSpdPiNEON have NEON integer instructions of the form vadd.i32 q8, q8,
q9, providing significant performance gains, as shown by integer MOPS below.
Raspberry Pi 3 - Best gains were on Integer MOPS of 1.5 to 1.7 times 900 MHz RPi 2. Some double precision speeds
were slower than clock MHz ratio of 1.33.
Raspberry Pi 3 SUSE and Gentoo 64 Bits - Compile options not available, but see Memory Speed Benchmark above.
MFLOPS
memspeedPiA6 memspeedPiA7 memSpdPiNEON
Raspberry Pi 2
SP MFLOPS at 900 MHz 333 299 445
SP MFLOPS at 1000 MHz 351 330 493
DP MFLOPS at 900 MHz 133 175 174
DP MFLOPS at 1000 MHz 148 193 193
Raspberry Pi 3
SP MFLOPS at 1200 MHz 444 454 594
DP MFLOPS at 1200 MHz 208 203 203
INT MOPS
memspeedPiA6 memspeedPiA7 memSpdPiNEON
Raspberry Pi 2
Int MOPS at 900 MHz 323 512 509
Int MOPS at 1000 MHz 333 566 562
Raspberry Pi 3
Int MOPS at 1200 MHz 485 865 864
To Start
Maximum One Core Single Precision MFLOPS
notOpenMP-MFLOPS, notOpenMP-MFLOPS64, MP-MFLOPSPiA7, MP-MFLOPS64
MP-NeonMFLOPS, MP-NeonMLOPS64, MP-MFLOPSPiNeon
All of these carry out the same calculations executed in the form x[i] = (x[i] + a) * b - (x[i] + c) * d + (x[i] + e) * f
with 2, 8 or 32 operations per input data word. Full results are provided below. OpenMP-MFLOPS automatically uses all
available cores and notOpenMP-MFLOPS uses one core with no MP overheads. All others use 1, 2, 4 and 8 threads,
best MFLOPS from 1 thread shown here.
The compilation for PiA7 MP-MFLOPS includes an option to use NEON instructions, but does not do so in the 32 bit
version. MP-MFLOPS64 and OpenMP-MFLOPS64 varieties use the simple “-march=armv8-a” directive.
The compiled MP-MFLOPSPiNeon and OpenMP benchmarks include “-funsafe-math-optimizations” parameter that
produces SIMD instructions. This option is not available at 64 bits. MP-NeonMFLOPS and MP-NeonMFLOPS64 use a well
ordered structure of NEON intrinsic functions, clearly suitable for SIMD operation. gcc neonmflops.c cpuidc.c -lm -lrt -
O3 -mcpu=cortex-a7 -mfloat-abi=hard -mfpu=neon-vfpv4 -funsafe-math-optimizations -lpthread -o MP-NeonMFLOPS
Raspberry Pi 3 speeds were 1.75 times faster than model 2, at two operations per word, increasing to 2.28 times at
32 operations per word.
64 Bit Versions MPMFLOPS results were between 2.3 to 4 time faster than the 32 bit version, due to using SIMD
instructions. The notOpenMP-MFLOPS performance was similar with both SIMD. MP-NeonMLOPS64 intrinsics were
compiled as more effective vector instructions, to produce gains between 1.25 and 1.54 times.
Cortex-A53 based Android tablet results are also shown, with similar performance. Details are in android
benchmarks.htm#anchor18, available within Android Benchmark and Stress Testing Apps.zip.
Reliability Tests - The MP-MFLOPS functions were used in stress testing programs that have command line options to
define which function to use and running time. See Reliability Tests, 64 Bit Reliability Tests and Raspberry Pi 2 and 3
Stress Tests. The original versions, such as burninfpuPiA7 and MFLOPS benchmarks, produced less the 1.5 MFLOPS
per MHz, where the test functions were driven by repetitive external calls. A later one, burninfpuPi2, in
Raspberry_Pi_2_Stress_Tests.zip, included the repeat calls within the functions, and unrolled some of the calculations,
producing some much faster speeds. The 64 bit version, burninfpuPi64, in Rpi3-64-Bit-Benchmarks.tar.gz. produced
similar superior performance, as reflected in the results below.
Single Precision MFLOPS
MHz 2 Ops/word 8 Ops/word 32 Ops/word
Raspberry Pi 2
notOpenMP-MFLOPS 900 398 777 692
notOpenMP-MFLOPS 1000 461 861 765
burninfpuPiA7 L2 900 450 777 685
Raspberry Pi 3 1200
notOpenMP-MFLOPS 716 1697 1581
MP-MFLOPSPiA7 182 693
MP-MFLOPSPiNeon Compiled 782 1672
MP-NeonMFLOPS Intrinsics 583 1706
burninfpuPiA7 L2 cache data 721 1644 1703
notOpenMP-MFLOPS64 718 1720 1496
MP-MFLOPS64 730 1579
MP-MFLOPSNeon Compiled N/A
MP-NeonMLOPS64 Intrinsics 729 2640
burninfpuPi64 L2 cache data 1721 3796 1562
Cortex A53 Android Tablet 1300 MHz 1 Core Threaded
SP MFLOPS 32 bit Intrinsics 619 1426
SP MFLOPS 64 bit Intrinsics 726 2639
To Start
MultiThreading Benchmarks and MP-MFLOPS
These are essentially the same as my Android Multithreading Benchmarks, available within Android Benchmark and
Stress Testing Apps.zip. Except for OpenMP tests, all run the benchmarks using 1, 2, 4 and 8 threads. Those that use
caches and RAM have data sizes around 12.8 KB, 128 KB and 12.8 MB. The test runs considered below are to provide
Raspberry Pi 3 comparisons of 64 bit versus 32 bit operation. Tne new benchmarks and source codes are included in
Rpi3-64-Bit-Benchmarks.tar.gz. Details and results of earlier measurements can be found in Raspberry Pi Multithreading
Benchmarks, with benchmarks and source codes in Raspberry_Pi_MP_Benchmarks.zip.
Where appropriate, the benchmarks show that the same numerical results are produced using a varying number of
threads. Example results for different compilations of MP-MFLOPS are shown below. At 32 bits, the benchmark was
compiled with normal floating point parameters, secondly with additional NEON directives and thirdly with NEON intrinsic
functions, replacing normal C code. At 64 bits, the first and last of these was appropriate. The intrinsic functions were
translated into different forms of vector instructions. The end products produced variations in numerical results, as
shown in the following.
################ MP-MFLOPS FORMAT #################
MP-MFLOPS armv8 64Bit Fri Feb 24 13:30:16 2017
FPU Add & Multiply using 1, 2, 4 and 8 Threads
2 Ops/Word 32 Ops/Word
KB 12.8 128 12800 12.8 128 12800
MFLOPS
1T 730 717 413 1579 1575 1541
2T 1361 1351 389 3075 3145 2849
4T 2259 2417 370 5399 6114 4944
8T 2226 1919 352 5346 5948 4986
Results x 100000, 0 indicates ERRORS
1T 76406 97075 99969 66015 95363 99951
2T 76406 97075 99969 66015 95363 99951
4T 76406 97075 99969 66015 95363 99951
8T 76406 97075 99969 66015 95363 99951
End of test Fri Feb 24 13:30:21 2017
MP-MFLOPS Linux/ARM V7A
1T 76406 97075 99969 66015 95363 99951
MP-MFLOPS Compiled NEON
1T 76406 97075 99969 66008 95367 99951
MP-MFLOPS NEON Intrinsics
1T 76406 97075 99969 66014 95363 99951
MP-MFLOPS 64 Bit
1T 76406 97075 99969 66015 95363 99951
MP-MFLOPS NEON Intrinsics 64 Bit
1T 76406 97075 99969 66015 95363 99951
MP-MFLOPS Double Precision
1T 76384 97072 99969 66065 95370 99951
MP-MFLOPS 64 Bit DP
1T 76384 97072 99969 66065 95370 99951
See Results Below or Go To Start
MP-MFLOPS - MP-MFLOPSPiA7, MP-MFLOPSDP, MP-MFLOPSPi64, MP-MFLOPSPi64DP
MP-MFLOPS measures floating point speed on data from caches and RAM. The first calculations are as used in Memory
Speed Benchmark, with a multiply and an add per data word read. Others use more calculations in the form of x[i] =
(x[i] + a) * b - (x[i] + c) * d + (x[i] + e) * f with 8 or 32 operations per input data word. Each thread carries out the
same calculations but accesses different segments of the data. The result, on cache based calculations, is often
performance proportional to the number of cores used.
64 Bit vs 32 Bit - Bearing in mind that results represented by the third column are likely to be dependent on memory
speed, average speed gains of the first cache based tests were four times faster, with 25% improvement from RAM.
Then, with 32 operations per word, a 2.19 speed gain applied. Double precision improvements were much less.
Single/Double Precision - Results were quite similar using the 32 bit benchmark. At 64 bits, average improved SP
speed was 2.1 times, at 2 operations per word, and demonstrated a 37% improvement with the higher number of
calculations.
SUSE vs Gentoo - Exccept for the isolated blip, that can be expected on these type of tests, performance was
essentially the same.
MP gains - Ignoring 12800 memory based tests, that can be lower, four versus 1 thread gains averaged 3.38 times,
with a maximum of 3.88 times.
Comparison with other MP-MFLOPS benchmarks - see Maximum 1 Core MFLOPS above.
###################### RPi 3 #######################
Raspberry Pi 3 CPU 1200 MHz, SDRAM 900 MHz
MP-MFLOPS Linux/ARM V7A v1.0 Tue Aug 30 14:16:59 2016
FPU Add & Multiply using 1, 2, 4 and 8 Threads
2 Ops/Word 32 Ops/Word
KB 12.8 128 12800 12.8 128 12800
MFLOPS
1T 159 181 178 690 692 685
2T 342 364 353 1384 1386 1368
4T 466 501 456 2451 2473 2633
8T 581 643 479 2618 2502 2550
Results x 100000
1T 76406 97075 99969 66015 95363 99951
########### RPi 3 V7A2 Double Precision ############
MP-MFLOPS Double Precision v1.0 Wed Sep 7 17:07:12 2016
FPU Add & Multiply using 1, 2, 4 and 8 Threads
2 Ops/Word 32 Ops/Word
KB 12.8 128 12800 12.8 128 12800
MFLOPS
1T 143 182 171 678 680 674
2T 343 361 240 1360 1360 1335
4T 441 712 240 2232 2208 2185
8T 406 593 241 2345 2315 2272
Results x 100000
1T 76384 97072 99969 66065 95370 99951
################## RPi 3 SUSE #####################
Raspberry Pi 3 CPU 1200 MHz, SDRAM 900 MHz
Compiled for 64 bit ARM v8a
MP-MFLOPS armv8 64Bit Fri Feb 24 13:30:16 2017
FPU Add & Multiply using 1, 2, 4 and 8 Threads
2 Ops/Word 32 Ops/Word
KB 12.8 128 12800 12.8 128 12800
MFLOPS
1T 730 717 413 1579 1575 1541
2T 1361 1351 389 3075 3145 2849
4T 2259 2417 370 5399 6114 4944
8T 2226 1919 352 5346 5948 4986
Results x 100000
1T 76406 97075 99969 66015 95363 99951
Continued Below or Go To Start
########### RPi 3 SUSE Double Precision ############
MP-MFLOPS armv8 64Bit Double Precision Fri Feb 24 13:53:27 2017
FPU Add & Multiply using 1, 2, 4 and 8 Threads
2 Ops/Word 32 Ops/Word
KB 12.8 128 12800 12.8 128 12800
MFLOPS
1T 365 356 198 1233 1230 1127
2T 659 657 166 2401 2397 1923
4T 1200 927 176 4678 4640 2776
8T 1051 1039 174 4678 4682 2909
Results x 100000
1T 76384 97072 99969 66065 95370 99951
################ RPi 3 Gentoo ######################
Raspberry Pi 3 CPU 1200 MHz, SDRAM 900 MHz
Compiled for 64 bit ARM v8a
MP-MFLOPS armv8 64Bit Thu Mar 2 16:48:04 2017
FPU Add & Multiply using 1, 2, 4 and 8 Threads
2 Ops/Word 32 Ops/Word
KB 12.8 128 12800 12.8 128 12800
MFLOPS
1T 709 634 391 1541 1535 1497
2T 1095 1072 355 3095 3023 2925
4T 1503 2249 350 5419 6070 5230
8T 2475 1985 381 5440 5975 5030
Results x 100000
1T 76406 97075 99969 66015 95363 99951
########## RPi 3 Gentoo Double Precision ###########
MP-MFLOPS armv8 64Bit Double Precision Thu Mar 2 16:52:33 2017
FPU Add & Multiply using 1, 2, 4 and 8 Threads
2 Ops/Word 32 Ops/Word
KB 12.8 128 12800 12.8 128 12800
MFLOPS
1T 354 327 197 1205 1187 1081
2T 685 691 201 2411 2369 1763
4T 1202 1063 202 4681 4595 2064
8T 1145 1077 201 4520 4581 2663
Results x 100000
1T 76384 97072 99969 66065 95370 99951
To Start
MP-Whetstone - MP-WHETSPiA7, MP-WHETSPi64
Multiple threads each run the eight test functions at the same time, but with some dedicated variables. Measured
speed is based on the last thread to finish, with Mutex functions, used to avoid the updating conflict by only allowing
one thread at a time to access common data. Again performance is generally proportional to the number of cores used.
There can be some significant differences from the single CPU Whetstone benchmark results on particular tests due to
a different compiler being used.
None of the test functions are suitable for SIMD operation, with the simpler instructions being used can lead to some
32 bit tests being faster than those compiled for 64 bits. The Fixed Point MIPS loops are clearly over optimised but, in
any case, the time taken has little influence on the overall MWIPS rating.
############################## RPi 3 ##################################
Raspberry Pi 3 CPU 1200 MHz, SDRAM 900 MHz
MP-Whetstone Benchmark Linux/ARM V7A v1.0 Mon Aug 15 19:34:21 2016
Using 1, 2, 4 and 8 Threads
MWIPS MFLOPS MFLOPS MFLOPS Cos Exp Fixpt If Equal
1 2 3 MOPS MOPS MOPS MOPS MOPS
1T 723.1 517.2 517.0 254.9 12.1 8.8 5853.9 1181.8 1189.8
2T 1464.7 960.5 1025.1 511.3 24.1 18.5 11899.0 2381.2 2385.7
4T 2902.3 1696.4 1867.3 1013.4 47.8 36.8 19754.6 4541.3 4687.1
8T 3004.0 2747.8 2569.0 1066.4 48.6 38.0 25502.9 6075.2 5610.8
Overall Seconds 4.77 1T, 4.74 2T, 4.88 4T, 9.76 8T
############################# RPi 3 SUSE ##############################
Raspberry Pi 3 CPU 1200 MHz, SDRAM 900 MHz
Compiled for 64 bit ARM v8a
MP-Whetstone Benchmark armv8 64 Bit Tue Mar 7 23:27:25 2017
Using 1, 2, 4 and 8 Threads
MWIPS MFLOPS MFLOPS MFLOPS Cos Exp Fixpt If Equal
1 2 3 MOPS MOPS MOPS MOPS MOPS
1T 985.3 336.2 336.3 287.7 18.1 12.3 1478579.3 2331.7 1198.9
2T 1964.8 670.7 672.6 566.7 36.2 24.6 2794892.5 4724.7 2372.4
4T 3900.7 1248.1 1330.8 1139.9 71.6 48.9 3931546.6 9424.8 4747.9
8T 3925.4 1314.4 1349.8 1146.9 72.0 49.1 6508657.2 9578.2 4779.7
Overall Seconds 4.94 1T, 4.98 2T, 5.14 4T, 10.11 8T
############################ RPi 3 Gentoo #############################
Raspberry Pi 3 CPU 1200 MHz, SDRAM 900 MHz
Compiled for 64 bit ARM v8a
MP-Whetstone Benchmark armv8 64 Bit Wed Mar 8 11:48:21 2017
Using 1, 2, 4 and 8 Threads
MWIPS MFLOPS MFLOPS MFLOPS Cos Exp Fixpt If Equal
1 2 3 MOPS MOPS MOPS MOPS MOPS
1T 1045.1 322.6 330.4 282.5 20.4 12.8 1527755.4 2316.1 1178.5
2T 2091.3 653.1 661.0 563.9 40.9 25.5 2764929.6 4599.8 2356.7
4T 2460.5 1199.4 1314.7 1124.8 41.2 27.3 5201735.3 9305.0 2480.2
8T 3394.6 1422.0 1697.0 1192.2 56.4 44.8 4006323.7 10229.6 2480.3
Overall Seconds 5.02 1T, 5.02 2T, 8.57 4T, 13.51 8T
To Start
MP-Dhrystone - MP-DHRYPiA7, MP-DHRYPi64
This runs multiple copies of the whole program at the same time. Dedicated data arrays are used for each thread but
there are numerous other variables that are shared. The latter reduces performance gains via multiple threads and, in
some cases, these can be slower than using a single thread.
The only reliable measurement, for comparison purposes, is the single thread speed. Here, the 64 bit version indicates a
speed improvement of 50%, over the 32 bit program.
############################## RPi 3 ##################################
Raspberry Pi 3 CPU 1200 MHz, SDRAM 900 MHz
MP-Dhrystone Benchmark Linux/ARM V7A v1.0 Mon Aug 15 19:47:57 2016
Using 1, 2, 4 and 8 Threads
Threads 1 2 4 8
Seconds 0.95 1.12 1.59 3.04
Dhrystones per Second 4229473 7124952 10091677 10523432
VAX MIPS rating 2407 4055 5744 5989
Internal pass count correct all threads
End of test Mon Aug 15 19:48:04 2016
############################# RPi 3 SUSE ##############################
Raspberry Pi 3 CPU 1200 MHz, SDRAM 900 MHz
Compiled for 64 bit ARM v8a
MP-Dhrystone Benchmark armv8 64 Bit Tue Mar 7 22:20:45 2017
Using 1, 2, 4 and 8 Threads
Threads 1 2 4 8
Seconds 0.63 0.77 1.40 2.77
Dhrystones per Second 6343818 10382333 11459690 11533058
VAX MIPS rating 3611 5909 6522 6564
Internal pass count correct all threads
End of test Tue Mar 7 22:20:51 2017
############################ RPi 3 Gentoo #############################
Raspberry Pi 3 CPU 1200 MHz, SDRAM 900 MHz
Compiled for 64 bit ARM v8a
MP-Dhrystone Benchmark armv8 64 Bit Wed Mar 8 11:34:32 2017
Using 1, 2, 4 and 8 Threads
Threads 1 2 4 8
Seconds 0.63 0.78 2.75 3.11
Dhrystones per Second 6367171 10213192 5810865 10285768
VAX MIPS rating 3624 5813 3307 5854
Internal pass count correct all threads
End of test Wed Mar 8 11:34:40 2017
To Start
MP-BusSpeedPiA7, MP-BusSpeedPi64
This runs integer read only tests using caches and RAM, each thread accessing the same data sequentially. To start
with, data is read with large address increments to demonstrate burst data transfers. Performance gains, using L1
cache, can be proportional to the number of cores, but not quite so using L2. The program is designed to produce
maximum throughput over buses and demonstrates the fastest RAM speeds using multiple cores.
In the original version, each thread started reading data from the same starting point. This produced acceptable results
until shared L2 caches appeared. Then it produced excessive RAM speeds, using more than one thread. With version 2,
as used for the following, each thread starts reading from different addresses, providing more realistic results.
The 32 bit ARM V7A compilation produced the expected pattern of speeds, doubling up with decreasing address
increments, where burst reading is used, and improving L1 cache data transfer rate, also providing reasonable MP
performance gains. The 64 bit results were much slower and, particularly, demonstrated slower L1 cache speeds at
reducing address increments. The reason can be identified from a disassembly of the code used for the important “Read
All” tests. Here, the C code has a loop with 64 AND operations. The 32 bit version translated these arithmetic
operations into 16 NEON four way vector instructions. The 64 bit version had 64 scalar AND and 64 data load
instructions, overall executing 2.5 times the number of instructions, than the 32 bit version, to deal with the same
amount of data.
MB/Second Reading Data, 1, 2, 4 and 8 Threads
Staggered starting addresses to avoid caching
############# Raspberry Pi 3 CPU 1200 MHz, SDRAM 900 MHz #############
MP-BusSpd ARM V7A v2 Sun Jul 24 09:26:21 2016
KB Inc32 Inc16 Inc8 Inc4 Inc2 RdAll
12.3 1T 3011 3715 3792 4080 4400 4149
2T 5391 6873 7125 7827 8466 8124
4T 8622 11926 13488 15276 16419 13422
8T 4922 7930 9659 11732 13307 11995
122.9 1T 565 563 1070 1792 2830 3865
2T 886 901 1762 3225 5402 7584
4T 901 921 1863 3727 7185 13816
8T 874 919 1762 3712 6269 9242
12288 1T 120 125 244 420 968 1926
2T 126 128 246 537 1000 2184
4T 110 118 231 443 990 1824
8T 120 137 262 517 1043 2124
########### Raspberry Pi 3 CPU 1200 MHz, SDRAM 900 MHz SUSE ###########
Compiled for 64 bit ARM v8a
MP-BusSpd armv8 64 Bit Tue Mar 7 22:44:44 2017
KB Inc32 Inc16 Inc8 Inc4 Inc2 RdAll
12.3 1T 2885 2407 2576 2093 1460 1521
2T 4764 4197 4944 3960 2890 2929
4T 6842 6443 8343 6997 5360 5667
8T 4563 4352 6368 6106 4600 5184
122.9 1T 545 584 1043 1596 1456 1462
2T 872 890 1718 3001 2807 2861
4T 828 900 1859 3687 5523 5789
8T 866 913 1875 3691 5477 5704
12288 1T 113 123 244 486 915 1145
2T 69 125 226 435 1149 1964
4T 86 91 268 490 998 2092
8T 89 104 219 480 976 1798
######## Raspberry Pi 3 CPU 1200 MHz, SDRAM 900 MHz Gentoo ########
Compiled for 64 bit ARM v8a
MP-BusSpd armv8 64 Bit Wed Mar 15 11:23:10 2017
KB Inc32 Inc16 Inc8 Inc4 Inc2 RdAll
12.3 1T 2699 2430 2645 2052 1467 1493
2T 4687 4153 4854 3797 2827 2933
4T 6825 6472 8358 7148 4789 5680
8T 4272 4146 5928 5705 4588 4977
122.9 1T 550 568 1022 1615 1427 1472
2T 872 852 1691 3027 2821 2932
4T 821 894 1845 3654 5570 5822
8T 896 892 1850 3602 5136 5439
12288 1T 108 115 224 455 852 1085
2T 51 120 216 432 856 1722
4T 68 109 229 402 887 1604
8T 67 109 240 583 975 1834
To Start
MP-RandMemPiA7, MP-RandMemPi64
The benchmark has cache and RAM read only and read/write tests using sequential and random access, each thread
accessing the same data but starting at different points. It uses the Mutex functions as in Whetstone above,
sometimes leading to no performance gains using multiple threads. Although performance via the L1 cache, L2 cache
and RAM can be different, it is normally consistent, in each of these areas, during read/write tests. With the read only
tests, performance via L1 cache typically produced a throughput gain of 3.6 to 3.8 times using four cores, but
somewhat less so, using shared data in L2 cache. Random access is also demonstrated as being relatively slow where
burst data transfers are involved. Note that performance can vary somewhat, and a few runs might be needed to
demonstrate best case results.
L1 cache 64 bit speeds are shown to be 43% faster than those at 32 bits, for read only tests and 20% via L2 cache,
but in the same areas, up to 20% slower when writing is involved.
MB/Second Using 1, 2, 4 and 8 Threads
############################## RPi 3 ##################################
Raspberry Pi 3 CPU 1200 MHz, SDRAM 900 MHz
MP-RandMem Linux/ARM V7A v1.0 Mon Aug 15 19:37:27 2016
KB SerRD SerRDWR RndRD RndRDWR
12.3 1T 2907 3773 2917 3790
2T 5480 3768 5187 3775
4T 11198 3679 10960 3712
8T 10094 3697 10038 3685
122.9 1T 2673 3340 686 892
2T 5031 3386 1251 888
4T 9398 3378 2002 890
8T 9291 3370 1916 886
12288 1T 1896 899 50 64
2T 2535 900 98 65
4T 2878 896 137 64
8T 2631 897 130 65
############################# RPi 3 SUSE ##############################
Raspberry Pi 3 CPU 1200 MHz, SDRAM 900 MHz
Compiled for 64 bit ARM v8a
MP-RandMem armv8 64 Bit Tue Mar 7 23:20:26 2017
KB SerRD SerRDWR RndRD RndRDWR
12.3 1T 4251 3142 4180 3074
2T 7641 3118 7586 3120
4T 15308 3077 15309 3060
8T 14920 3041 14761 3043
122.9 1T 3462 2848 889 858
2T 6356 2899 1590 846
4T 11078 2910 2013 857
8T 11069 2917 2018 843
12288 1T 1858 873 83 67
2T 2331 864 148 66
4T 2359 878 160 66
8T 2108 890 163 66
############################ RPi 3 Gentoo #############################
Raspberry Pi 3 CPU 1200 MHz, SDRAM 900 MHz
Compiled for 64 bit ARM v8a
MP-RandMem armv8 64 Bit Sun Mar 12 11:18:10 2017
KB SerRD SerRDWR RndRD RndRDWR
12.3 1T 4268 3087 4267 3087
2T 7520 3062 7525 3055
4T 15295 3021 14322 3021
8T 15200 2973 14897 2999
122.9 1T 3384 2851 872 839
2T 6314 2877 1523 838
4T 11027 2871 2012 836
8T 10344 2864 1937 835
12288 1T 1795 846 78 63
2T 1933 771 136 63
4T 1760 845 152 63
8T 1972 843 138 63
To Start
OpenMP-MFLOPS, notOpenMP-MFLOPS, OpenMP-MFLOPS64, notOpenMP-MFLOPS64
The benchmark uses the same source code program calculations as the original MP_MFLOPS benchmark for Linux with
MP-MFLOPS above using a cut down version, implemented to use on Android devices. OpenMP-MFLOPS benchmark uses
the simplest OpenMP directive, #pragma omp parallel for, before the for loops where parallelisation might be expected,
and a -fopenmp compile parameter. Also, notOpenMP-MFLOPS is the same, without the compile parameter.
Samples of full results are below for 32 bit and 64 bit benchmarks. At this time OpenMP libraries are not included in gcc
for 64 bit Gentoo but, of course, the notOpenMP-MFLOPS64 program could be run.
Below the detailed results are performance comparisons and a table of numeric results. Although the latter
were constant during a test run, variations occur on values from different compilations. In should be noted that
minimum data size is 400 KB, or in L2 cache using one core or four cores.
64 Bit vs 32 Bit - Main gains were at 32 operations per word read, little different with the single core test, maybe a
little slower, but up to 2.54 times faster using all cores.
MP gains - The main gains were on tests using L2 cache and 8 calculations per word, with maximum of 2.72 times at
32 bits and 3.37 times at 64 bits.
Different Numeric Results - 32 bit and 64 bit results can be different. The with and without OpenMP values are the
same, except for 32 operations per word at 32 bits. Here, the same type of instructions are used, but in a different
order.
Comparison with other MP-MFLOPS benchmarks - see Maximum 1 Core MFLOPS above.
############################## RPi 3 ##################################
Raspberry Pi 3 CPU 1200 MHz, SDRAM 900 MHz
Not OpenMP MFLOPS Benchmark 1 Mon Aug 15 19:23:03 2016
Test 4 Byte Ops/ Repeat Seconds MFLOPS First All
Words Word Passes Results Same
Data in & out 100000 2 2500 0.697952 716 0.929538 Yes
Data in & out 1000000 2 250 1.160158 431 0.992550 Yes
Data in & out 10000000 2 25 1.140070 439 0.999250 Yes
Data in & out 100000 8 2500 1.178477 1697 0.957126 Yes
Data in & out 1000000 8 250 1.442497 1386 0.995524 Yes
Data in & out 10000000 8 25 1.428921 1400 0.999550 Yes
Data in & out 100000 32 2500 5.060230 1581 0.890268 Yes
Data in & out 1000000 32 250 5.203246 1538 0.988078 Yes
Data in & out 10000000 32 25 5.203889 1537 0.998806 Yes
OpenMP MFLOPS Benchmark 1 Sat Jul 30 13:01:12 2016
Data in & out 100000 2 2500 0.363631 1375 0.929538 Yes
Data in & out 1000000 2 250 1.133716 441 0.992550 Yes
Data in & out 10000000 2 25 1.150107 435 0.999250 Yes
Data in & out 100000 8 2500 0.432833 4621 0.957126 Yes
Data in & out 1000000 8 250 1.177219 1699 0.995524 Yes
Data in & out 10000000 8 25 1.151536 1737 0.999550 Yes
Data in & out 100000 32 2500 3.845114 2081 0.890232 Yes
Data in & out 1000000 32 250 3.754590 2131 0.988068 Yes
Data in & out 10000000 32 25 3.737356 2141 0.998785 Yes
Continued Below or Go To Start
############################# RPi 3 SUSE ##############################
Raspberry Pi 3 CPU 1200 MHz, SDRAM 900 MHz
Compiled for 64 bit ARM v8a
notOpenMP MFLOPS64 Fri Feb 24 15:48:41 2017
Test 4 Byte Ops/ Repeat Seconds MFLOPS First All
Words Word Passes Results Same
Data in & out 100000 2 2500 0.696362 718 0.929538 Yes
Data in & out 1000000 2 250 1.202102 416 0.992550 Yes
Data in & out 10000000 2 25 1.140033 439 0.999250 Yes
Data in & out 100000 8 2500 1.162491 1720 0.957117 Yes
Data in & out 1000000 8 250 1.504922 1329 0.995518 Yes
Data in & out 10000000 8 25 1.478444 1353 0.999549 Yes
Data in & out 100000 32 2500 5.346043 1496 0.890215 Yes
Data in & out 1000000 32 250 5.482719 1459 0.988088 Yes
Data in & out 10000000 32 25 5.477190 1461 0.998796 Yes
OpenMP MFLOPS64 Fri Feb 24 16:49:35 2017
Data in & out 100000 2 2500 0.229756 2176 0.929538 Yes
Data in & out 1000000 2 250 1.230560 406 0.992550 Yes
Data in & out 10000000 2 25 1.159971 431 0.999250 Yes
Data in & out 100000 8 2500 0.344756 5801 0.957117 Yes
Data in & out 1000000 8 250 1.245537 1606 0.995518 Yes
Data in & out 10000000 8 25 1.187876 1684 0.999549 Yes
Data in & out 100000 32 2500 1.373730 5824 0.890215 Yes
Data in & out 1000000 32 250 1.519274 5266 0.988088 Yes
Data in & out 10000000 32 25 1.469316 5445 0.998796 Yes
############################ RPi 3 Gentoo #############################
Raspberry Pi 3 CPU 1200 MHz, SDRAM 900 MHz
Compiled for 64 bit ARM v8a
notOpenMP MFLOPS64 Thu Mar 2 17:05:47 2017
Data in & out 100000 2 2500 0.739649 676 0.929538 Yes
Data in & out 1000000 2 250 1.230036 406 0.992550 Yes
Data in & out 10000000 2 25 1.179612 424 0.999250 Yes
Data in & out 100000 8 2500 1.196997 1671 0.957117 Yes
Data in & out 1000000 8 250 1.560925 1281 0.995518 Yes
Data in & out 10000000 8 25 1.483354 1348 0.999549 Yes
Data in & out 100000 32 2500 5.437056 1471 0.890215 Yes
Data in & out 1000000 32 250 5.585995 1432 0.988088 Yes
Data in & out 10000000 32 25 5.576582 1435 0.998796 Yes
OpenMP MFLOPS64 - OpenMP libray file not available
################### Comparison ###################
Words Ops/ MP Gains 64 Bit Gains
Word 32 Bit 64 Bit Not OMP
100000 2 1.92 3.03 1.00 1.58
1000000 2 1.02 0.98 0.97 0.92
10000000 2 0.99 0.98 1.00 0.99
100000 8 2.72 3.37 1.01 1.26
1000000 8 1.23 1.21 0.96 0.95
10000000 8 1.24 1.24 0.97 0.97
100000 32 1.32 3.89 0.95 2.80
1000000 32 1.39 3.61 0.95 2.47
10000000 32 1.39 3.73 0.95 2.54
#################### Numeric Results #####################
Words Ops/ Not OMP Not OMP
Word 32 Bit 32 Bit 64 Bit 64 Bit
100000 2 0.929538 0.929538 0.929538 0.929538
1000000 2 0.992550 0.992550 0.992550 0.992550
10000000 2 0.999250 0.999250 0.999250 0.999250
100000 8 0.957126 0.957126 0.957117 0.957117
1000000 8 0.995524 0.995524 0.995518 0.995518
10000000 8 0.999550 0.999550 0.999549 0.999549
100000 32 0.890268 0.890232 0.890215 0.890215
1000000 32 0.988078 0.988068 0.988088 0.988088
10000000 32 0.998806 0.998785 0.998796 0.998796
To Start
OpenMP-MemSpeed2, NotOpenMP-MemSpeed2, OpenMP-MemSpeed264, NotOpenMP-
MemSpeed264
This is the same as Memory Speed Benchmark but with measurements extending to test more memory, also using the
OpenMP directive and compile parameter. The NotOpenMP tests use the same code without specifying a compilation
using OpenMP. These allow comparisons of MP performance gains over the full range of memory use. At this time,
OpenMP was not available in Gentoo, but the NotOpernMP benchmark was run.
MP Gains and Losses As all the test functions involve writing back results, with few instructions in between, MP
benefits are often not that good. With the OpenMP 64 bit version, integer tests averaged 12% to 30% slower, but
faster on floating point calculations 1.62 to 2.45 times DP and 1.25 to 1.88 times SP. 32 bit ratios were 33% to 61%,
2.85 to 3.75 and 1.44 to 1.88 respectively.
64/32 Bit Ratios 64 bit versus 32 bit comparisons were also diverse, starting with the former’s RAM speeds being
somewhat slower. For cache based data, average integer, DP and SP performance ratios, with OpenMP, were 1.23 to
1.45, 0.82 to 1.05 and 0.71 to 1.04, then with notOpenMP, 1.05 to 1.35, 1.63 to 2.60 and 0.96 to 1.35.
############################## RPi 3 ##################################
Raspberry Pi 3 CPU 1200 MHz, SDRAM 900 MHz
Memory Reading Speed Test OpenMP Version 2 by Roy Longbottom
Start of test Mon Sep 5 14:27:38 2016
Memory x[m]=x[m]+s*y[m] Int+ x[m]=x[m]+y[m] x[m]=y[m]
KBytes Dble Sngl Int32 Dble Sngl Int32 Dble Sngl Int32
Used MB/S MB/S MB/S MB/S MB/S MB/S MB/S MB/S MB/S
4 5518 2990 1309 8808 4732 1455 15426 7656 1244
8 5414 3115 1322 10150 5068 1470 14323 8301 1254
16 5503 3143 1270 10255 5154 1378 16743 8043 1221
32 5507 3145 1344 10142 5089 1458 16572 7732 1206
64 5033 2999 1257 9230 4867 1419 16012 7869 1228
128 5255 3041 1258 9372 5014 1365 9452 8192 1252
256 5266 3093 1282 9401 5006 1372 8418 7864 1313
512 4494 2765 1358 7248 4482 1332 5748 5460 1410
1024 3810 2683 1078 4425 3668 1155 1753 1732 1265
2048 2008 1425 1098 2274 2214 980 1086 1094 1333
4096 3972 2413 1075 4628 3672 945 1058 1057 839
8192 1597 2435 920 3671 3649 1199 1059 1067 1043
16384 3838 1624 1867 4440 1550 1108 1065 1076 1166
32768 1658 2273 1695 4227 1876 1054 1066 1039 921
65536 3657 1247 1286 4839 3801 1308 1053 1046 1133
131072 990 655 810 1260 932 826 1129 1083 619
####################### RPi 3 Not OMP ###########################
Raspberry Pi 3 CPU 1200 MHz, SDRAM 900 MHz
Memory Reading Speed Test Not OpenMP Version 2 by Roy Longbottom
Start of test Mon Sep 5 14:28:22 2016
4 785 2536 3789 2360 3448 3787 2670 2693 2692
8 1594 2547 3812 2389 3465 3812 2715 2716 2716
16 1595 2551 3824 2392 3477 3823 2727 2728 2728
32 1556 2435 3564 2300 3272 3565 2730 2722 2723
64 1513 2314 3330 2189 3091 3327 2599 2435 2435
128 1516 2312 3357 2188 3118 3353 2635 2569 2569
256 1521 2316 3381 2187 3130 3384 2676 2618 2617
512 1419 2034 2765 1977 2674 2835 2593 2481 2524
1024 1113 1379 1544 1348 1521 1543 1691 1583 1586
2048 995 1203 1282 1193 1277 1257 1263 1231 1232
4096 992 1196 1248 1178 1252 1259 1203 1176 1166
8192 1041 1237 1290 1213 1298 1291 927 943 954
16384 1052 1262 1311 1229 1252 1303 874 866 867
32768 1053 1271 1317 1239 1325 1303 995 987 991
65536 1057 1281 1323 1245 1343 1316 920 920 918
131072 1057 1283 1323 1184 1350 1327 856 849 840
Continued Below or Go To Start
########################## RPi 3 SUSE 64 Bit ###########################
Raspberry Pi 3 CPU 1200 MHz, SDRAM 900 MHz
Memory Reading Speed Test OpenMP 64 Bit by Roy Longbottom
Start of test Tue Mar 7 23:41:04 2017
Memory x[m]=x[m]+s*y[m] Int+ x[m]=x[m]+y[m] x[m]=y[m]
KBytes Dble Sngl Int32 Dble Sngl Int32 Dble Sngl Int32
Used MB/S MB/S MB/S MB/S MB/S MB/S MB/S MB/S MB/S
4 5788 3106 1698 8203 4576 1827 11038 5622 2042
8 6182 3187 1711 9272 4842 1848 11315 5645 2054
16 5631 3197 1639 9320 4850 1753 11223 5520 1813
32 6132 3174 1604 9124 4833 1640 11040 5408 1731
64 5967 3168 1602 8641 4764 1688 9768 5338 1763
128 5469 3173 1572 8682 4408 1727 9054 5358 1811
256 5242 3177 1625 8630 4668 1678 8276 4972 182