MLPerf Mobile Inference Benchmark
Why Mobile AI Benchmarking Is Hard and What to Do About It
Vijay Janapa Reddi*David Kanter†Peter Mattson‡Jared Duke‡Thai Nguyen‡Ramesh Chukka§
Kenneth Shiring¶Koan-Sin Tan¶Mark Charlebois|| William Chou|| Mostafa El-Khamy**
Jungwook Hong**Michael Buch*Cindy Trinh†† Thomas Atta-fosu§Fatih Cakir**
Masoud Charkhabi‡Xiaodong Chen** Jimmy Chiang¶Dave Dexter‡‡
Woncheol Heo‡Guenther Schmuelling§§ Maryam Shabani§Dylan Zika††
MLPerf Mobile is the ﬁrst industry-standard open-
source mobile benchmark developed by industry members
and academic researchers to allow performance/accuracy
evaluation of mobile devices with different AI chips and
software stacks. The benchmark draws from the expertise
of leading mobile-SoC vendors, ML-framework providers,
and model producers. In this paper, we motivate the drive to
demystify mobile-AI performance and present MLPerf Mo-
bile’s design considerations, architecture, and implemen-
tation. The benchmark comprises a suite of models that
operate under standard data sets, quality metrics, and run
rules. For the ﬁrst iteration, we developed an Android
app to provide an “out-of-the-box” inference-performance
benchmark for computer vision and natural-language pro-
cessing on mobile devices. The benchmark also supports
non-smartphone devices such as laptops and mobile PCs.
As a whole, the MLPerf Mobile inference benchmark can
serve as a framework for integrating future models, for cus-
tomizing quality-target thresholds to evaluate system per-
formance, for comparing software frameworks, and for as-
sessing heterogeneous-hardware capabilities for machine
learning, all fairly and faithfully with reproducible results.
Mobile artiﬁcial-intelligence (AI) applications are in-
creasingly important as AI technology becomes a critical
differentiator among smartphones, laptops, and other mo-
bile devices. Many consumer applications beneﬁt from AI:
image processing, voice processing, and text interpretation.
AI provides state-of-the-art solutions to these tasks with a
*Harvard University †MLCommons ‡Google §Intel
¶MediaTek ||Qualcomm ** Samsung ††ENS Paris-Saclay
‡‡Arm §§ Microsoft
quality that users will notice on their devices. More and
more consumers are using such applications, and they ex-
pect a high-quality experience—especially for applications
with video or audio interactivity.
Consequently, mobile-device and chipset manufacturers
are motivated to improve AI implementations. Support for
the technology is becoming common in nearly all mobile
segments, from cost-optimized devices to premium phones.
The many AI approaches range from purely software tech-
niques to hardware-supported machine learning that relies
on tightly coupled libraries. Seeing through the mist of
competing solutions is difﬁcult for mobile consumers.
On the hardware front, laptops and smartphones have in-
corporated application-speciﬁc integrated circuits (ASICs)
to support AI in an energy-efﬁcient manner. For machine
learning, this situation leads to custom hardware that ranges
from specialized instruction-set-architecture (ISA) exten-
sions on general-purpose CPUs to ﬁxed-function acceler-
ators dedicated to efﬁcient machine learning. Also, because
mobile devices are complex, they incorporate a variety of
features to remain competitive, especially those that help
conserve battery life.
The software front includes many code paths and AI
infrastructures owing to the desire to efﬁciently support
machine-learning hardware. Most SoC vendors lean toward
custom pathways for model compilation and deployment
that are tightly integrated with the hardware. Examples
include Google’s Android Neural Network API (NNAPI)
, Intel’s OpenVINO , MediaTek’s NeuroPilot ,
Qualcomm’s SNPE  and Samsung’s Exynos Neural
Network SDK . These frameworks handle different nu-
merical formats (e.g., FP32, FP16, and INT8) for execution,
and they provide run-time support for various machine-
learning networks that best ﬁt the application and platform.
Hardware and software support for mobile AI applica-
tions is becoming a differentiating capability, resulting in
arXiv:2012.02328v1 [cs.LG] 3 Dec 2020
a growing need to make AI-performance evaluation trans-
parent. OEMs, SoC vendors, and consumers beneﬁt when
mobile devices employ AI in ways they can see and com-
pare. A typical comparison point for smartphone makers
and the technical press, for example, is CPUs and GPUs,
both of which have associated benchmarks . Similarly,
mobile AI performance can also beneﬁt from benchmarks.
Benchmarking AI performance is nontrivial, however. It
is especially challenging because AI implementations come
in a wide variety with differing capabilities. This variety,
combined with a lack of software-interface standards, com-
plicates the design of standard benchmarks. In edge de-
vices, the quality of the results is often highly speciﬁc to
each problem. In other words, the deﬁnition of high perfor-
mance is often task speciﬁc. For interactive user devices,
latency is normally the preferred performance metric. For
noninteractive ones, throughput is usually preferred. The
implementation for each task can generally trade off neural-
network accuracy for lower latency. This tradeoff makes
choosing a benchmark suite’s accuracy threshold critical.
To address these challenges, MLPerf (mlperf.org) takes
an open-source approach. It is a consortium of industry and
academic organizations with shared interests, yielding col-
lective expertise on neural-network models, data sets, and
submission rules to ensure the results are relevant to the in-
dustry and beneﬁcial to consumers while being transparent
The following are important principles that inform the
MLPerf Mobile benchmark:
• Measured performance should match the performance
that end users perceive in commercial devices. We
want to prevent the benchmark from implementing
special code beyond what these users generally em-
• The benchmark’s neural-network models should
closely match typical mobile-device workloads. They
should reﬂect real beneﬁts to mobile-device users in
• Neural-network benchmark models should represent
diverse tasks. This approach yields a challenging test
that resists extensive domain-speciﬁc optimizations.
• Testing conditions should closely match the environ-
ments in which mobile devices typically serve. Af-
fected characteristics include ambient temperature,
battery power, and special performance modes that are
• All benchmark submissions should undergo third-
party validation. Since mobile devices are ubiquitous,
results should be reproducible outside the submitting
MLPerf’s approach to addressing the mobile-AI bench-
mark needs of smartphones is to build an Android app that
all benchmarking must use. As of the initial v0.7 release
of the MLPerf Mobile benchmark, the app employs a stan-
dard set of four neural-network models for three vision tasks
and one NLP task and passes these models to the back-end
layer. This layer is an abstraction that allows hardware ven-
dors to optimize their implementations for neural networks.
The app also has a presentation layer for wrapping the more
technical benchmark layers and the Load Generator (“Load-
Gen”) . MLPerf created the LoadGen  to allow rep-
resentative testing of different inference platforms and use
cases, which generates inference requests in a pattern and
measures some parameters (e.g., latency, throughput, or
latency-bounded throughput). MLPerf additionally offers a
headless version of the mobile application that enables lap-
tops running non-mobile OSs to use the same benchmarks.
The ﬁrst round of MLPerf Mobile submissions is com-
plete . Intel, MediaTek, Qualcomm, and Samsung
participated in this round, and all passed the third-party-
validation requirement (i.e., reproducibility) for their re-
sults. These results show performance variations and illus-
trate the wide range of hardware and software approaches
that vendors take to implementing neural-network models
on mobile devices. The results also highlight a crucial take-
away: measuring mobile-AI performance is challenging but
possible. It requires a deep understanding of the fragmented
and heterogeneous mobile ecosystem as well as a strong
commitment to fairness and reproducibility. MLPerf Mo-
bile is a step toward better benchmark transparency.
2 Benchmarking Challenges
The mobile ecosystem is rife with hardware hetero-
geneity, software fragmentation, developer options, deploy-
ment scenarios, and OEM life cycles. Each by itself leads
to hardware-performance variability, but the combination
makes AI benchmarking on mobile systems extremely dif-
ﬁcult. Figure 1 shows the various stakeholders and explains
the implementation options and challenges facing each one.
2.1 Hardware Heterogeneity
Smartphones contain complex heterogeneous chipsets
that provide many different compute units and accelerators.
Any or all of these components can aid in machine-learning
(ML) inference. As such, recognizing the variability of
SoCs is crucial.
A typical mobile system-on-a-chip (SoC) complex in-
cludes a CPU cluster, GPU, DSP, Neural Processing Unit
(NPU), Hexagon Tensor Accelerator (HTA), Hexagon Vec-
tor Extensions (HVX), and so on. Many smartphones to-
day are Arm based, but the CPU cores generally implement
a heterogeneous “Big.Little” architecture . Some SoCs
even have big-CPU clusters where some big CPUs clock
faster than others. Also, devices fall into different tiers with
Figure 1: Mobile AI performance stakeholders.
different hardware capabilities at different prices, varying in
their memory capacity and storage features.
Any processing engine can run ML workloads, but
this ﬂexibility also makes benchmarking AI performance
difﬁcult. A given device may have a spectrum of AI-
performance capabilities depending on which processing
engines it uses. Hence the need for a systematic way to
benchmark a smartphone’s AI-hardware performance.
2.2 Software Fragmentation
The mobile-software ecosystem is heavily differentiated,
from the OS to the machine-learning run time. The result
can be drastic hardware-performance changes or variability.
Mobile devices employ various OSs: Android, iOS, Win-
dows, Ubuntu, Yocto, and so on. Each one has an ecosys-
tem of ML application programming interfaces (APIs) and
application-deployment options that necessitate particular
Smartphone OSs have undergone substantial consolida-
tion. Numerous APIs have served in the development of
ML applications, and often, a single SoC or OEM device
will support a vendor SDK and a plurality of frameworks.
SoC vendors will by default offer a proprietary SDK that
generates optimized binaries so ML models can run on
SoC-speciﬁc hardware. These vendors also make engineer-
ing investments to support more-generic frameworks, such
as TensorFlow Lite (TFLite)  and NNAPI , that
provide a compatibility layer to support various accelera-
tors and device types. Because engineering resources are
limited, however, SoC vendors must prioritize their own
SDKs, often resulting in partial or less optimized generic-
framework support. The diversity of vendor SDKs and
framework-support levels are all reasons why the mobile-
ML software ecosystem is fragmented.
This situation complicates hardware-performance as-
sessment because the choice of software framework has
a substantial effect. A high-performance SoC, for in-
stance, may deliver low performance owing to an ill-
matched framework. Even for SoCs that integrate a high-
performance ML accelerator, if a generic Android frame-
work such as NNAPI does not support it (well) with high-
performance driver backends, the accelerator will function
poorly when handling a network.
Because software code paths can drastically affect hard-
ware performance, a transparent mechanism for operating
and evaluating a mobile device is essential.
2.3 Developer Options
Developers can choose among several approaches to en-
able machine learning on mobile devices. Each one has im-
plications for achievable hardware performance on a given
application. Recognizing these behind-the-scenes factors is
therefore critical to maximizing performance.
Application developers can work through a marketplace
such as Google Play  to create mobile-app variants for
every SoC vendor if they follow a vendor-SDK approach
(Figure 2a). Doing so presents a scalability challenge, how-
ever, because of the increased time to market and additional
An alternative is to create an application using a native
OS/framework API such as NNAPI, which provides a more
scalable approach (Figure 2b). Nevertheless, this alternative
has a crucial shortcoming: it is only viable if SoC vendors
provide good backend drivers to the framework, necessitat-
ing cooperation between these vendors and the framework
A ﬁnal alternative is to bind the neural-network model to
the underlying hardware. Doing so allows compilation of
the model to a particular device, avoiding reliance on any
particular run time (Figure 2c).
2.4 Deployment Scenarios
Machine-learning applications have many potential uses
on mobile devices. Details of the usage scenario determine
the extent to which a neural-network model is optimized for
the hardware and how it runs, because of strong or weak ties
to the device.
Developers primarily build applications without speciﬁc
ties to vendor implementations. They may design custom
neural-network models that can run on any device. Thus,
mobile devices often run apps that employ unknown mod-
els for a variety of hardware. Figure 3(a) illustrates this
case. OEMs, on the other hand, build their ML applications
for their own devices. Therefore, both the models and the
device targets are known at deployment time (Figure 3(b)).
A service provider (e.g., Verizon or AT&T) that uses a vari-
ety of hardware solutions may, however, support its service
(a) (b) (c)
Figure 2: Application-development options.
with known models, in which case both the models and the
hardware are known (Figure 3(c)).
Development of the applications deployed in these sce-
narios may also take place in various ways. OEMs that
manufacture devices can use vendor SDKs to support their
applications with minimal extra effort.
2.5 OEM Life Cycle
Mobile-SoC testing often occurs on development plat-
forms. Gaining access to them, however, is difﬁcult. There-
fore, the results of benchmark testing that employs a devel-
opment platform may not be independently veriﬁable. For
this reason, benchmarking generally takes place on com-
mercial devices. But because of the way commercial mo-
bile devices (particularly smartphones) operate, getting re-
producible numbers can be difﬁcult.
A variety of factors, ranging from how OEMs pack-
age software for delivery to how software updates are is-
sued, affect hardware-performance measurements. OEMs
employ vendor SoCs and associated software releases to
produce commercial mobile devices. In the case of smart-
phones, those devices may sell unlocked or locked to a wire-
less carrier, in which case the carrier ultimately controls
the software. OEMs pick up the software updates from
the SoC vendors and usually bundle them with other up-
dates for periodic release. If the carrier sells the device,
it will likely require testing and validation before allow-
ing any updates. This restriction can add further delays
to the software-update channel. NNAPI updates, for in-
stance, would require a new software update for the device.
For a benchmark, no recompilation is necessary when using
NNAPI; updates to a vendor SDK, however, may necessi-
tate recompilation (Figure 2a).
When benchmarking a device, a newly installed software
update may affect the results, and installing the same ver-
sion of the software used to generate a particular result may
(a) (b) (c)
Figure 3: ML-application scenarios.
be impossible. After a device applies a system-software up-
date, the only way to revert to the previous conﬁguration is
to factory reset the device. But doing so also undoes any
associated security ﬁxes.
More often than not, a substantial delay occurs between
the time when an SoC vendor releases new software and
when that software sees deployment on user devices. The
delay is usually measured in months, and it especially af-
fects the system-API approach (e.g., NNAPI). Extensive
planning is therefore necessary for a commercial phone to
have all the required features for an upcoming benchmark.
Finally, commercial devices receive OEM updates only
for a ﬁxed period, so they will not beneﬁt from additional
software-performance enhancements after that time.
2.6 Legal and IP
An important yet easily overlooked aspect of ML bench-
marking is the law. A chief challenge to constructing a
widely used mobile benchmark is the legal and intellectual-
property (IP) regime for both data sets and tool chains.
Since ML tends to be open source, the rigidity and restric-
tions on data sets and SDKs can be surprising.
Distribution of standard ML data sets is under licenses
with limited or unclear redistribution rights (e.g., ImageNet
and COCO). Not all organizations have licensed these data
sets for commercial use, and redistribution through an app
is legally complicated. In addition, submitters to an ML
benchmark may apply different legal-safety standards when
participating in a public-facing software release.
Additionally, many SoC vendors rely on proprietary
SDKs to quantize and optimize neural networks for their
products. Although some SDKs are publicly available un-
der off-the-shelf licensing terms, others require direct ap-
proval or negotiation with the vendor. Additionally, most
forbid redistribution and sharing, potentially hindering re-
production of the overall ﬂow and veriﬁcation of a result.
Area Task Reference Model Data Set Quality Target
Vision Image classiﬁcation MobileNetEdgeTPU (4M params) ImageNet 2012 (224x224) 98% of FP32 (76.19% Top-1)
Vision Object detection SSD-MobileNet v2 (17M params) COCO 2017 (300x300) 93% of FP32 (0.244 mAP)
Vision Semantic segmentation DeepLab v3+ (2M params) ADE20K (512x512) 97% of FP32 (54.8% mIoU)
Language Question answering MobileBERT (25M params) mini Squad v1.1 dev 93% of FP32 (93.98% F1)
Table 1: MLPerf Mobile v0.7 benchmark suite.
3 MLPerf Mobile Benchmarks
The MLPerf Mobile Inference benchmark is community
driven. As such, all involved parties aided in developing
the benchmark models and submission rules; the group in-
cludes both submitting organizations and organizations that
care about mobile AI. Participants reached a consensus on
what constitutes a fair and useful benchmark that accurately
reﬂects mobile-device performance in realistic scenarios.
Table 1 brieﬂy summarizes the tasks, models, data sets,
and metrics. This section describes the models in the v0.7
MLPerf Mobile version. Rather than the models, a crucial
aspect of our work is the method we prescribe for mobile-AI
performance testing. Also, we describe the quality require-
ments during benchmark testing.
3.1 Tasks and Models
Machine-learning tasks and associated neural-network
models come in a wide variety. Our benchmark’s ﬁrst it-
eration focused on establishing a high-quality method of
benchmarking, rather than focusing on model quantity. To
this end, we intentionally chose a few machine-learning
tasks representing real-world uses. Benchmarking them
yields helpful insights about hardware performance across
a wide range of deployment scenarios (smartphones, note-
books, etc.). We chose networks for these tasks on the ba-
sis of their maturity and applicability to various hardware
(CPUs, GPUs, DSPs, NPUs, etc.).
Image classiﬁcation picks the best label to describe an
input image and is commonly used for photo search and
text extraction. Many commercial applications employ im-
age classiﬁcation, which is a de facto standard for evaluat-
ing ML-system performance. Moreover, classiﬁer-network
evaluation provides a good performance indicator for the
model when that model serves as a feature-extractor back-
bone for other tasks. Image classiﬁcation has a wide range
of applications, such as photo searches, text extraction, and
industrial automation (object sorting and defect detection).
On the basis of community feedback, we selected Mo-
bileNetEdgeTPU , which is well-optimized for mobile
applications and provides good performance on different
SoCs. The MobileNetEdgeTPU network is a descendent of
the MobileNet-v2 family that is optimized for low-latency
and mobile accelerators. The MobileNetEdgeTPU model
architecture is based on convolutional layers with inverted
residuals and linear bottlenecks, similar to MobileNet v2,
but is optimized by introducing fused inverted bottleneck
convolutions to improve hardware utilization, and remov-
ing hard-swish and squeeze-and-excite blocks.
The MobileNetEdgeTPU reference model is evaluated
on the ImageNet 2012 validation dataset  and requires
74.66% (98% of FP32 accuracy) Top-1 accuracy (app uses
a different dataset). Before inference, images are resized,
cropped to 224x224, and normalized.
Object detection draws bounding boxes around objects
in an input image and then labels the object and is com-
monly applied to camera input. Implementations typically
use a pretrained image-classiﬁer network as a backbone or
feature extractor, then perform bounding-box selection and
regression for precise localization [49, 43]. Object detec-
tion is crucial for automotive tasks, such as detecting haz-
ards and analyzing trafﬁc, and for mobile-retail tasks, such
as identifying items in a picture.
Our reference model is the Single Shot Detector (SSD)
 with a MobileNet v2 backbone —a choice that is
well adapted to constrained computing environments. The
SSD-MobileNet v2 uses Mobilenet v2 for feature extraction
and a mobile friendly variant of regular SSD called SSDlite
for detection. In SSD prediction layers, all the regular con-
volutions are replaced with separable convolutions (depth-
wise followed by 1 x 1 projection). SSD-MobileNet v2 im-
proves latency by signiﬁcantly decreasing the number of op-
erations, it also reduces the memory footprint needed during
inference by never fully materializing the large intermediate
tensors. Two SSD-MobileNet v2 versions acted as the refer-
ence models for the object-detection benchmark, where one
model replaces more of the regular SSD-layer convolutions
with depth-separable convolutions than the other does.
We used the COCO 2017 validation data set  and, for
the quality metric, the mean average precision (mAP). The
target accuracy is a mean Average Precision (mAP) of 22.7
(93% of FP32 accuracy). Preprocessing consists of ﬁrst re-
sizing to 300x300—typical of resolutions in smartphones
and other compact devices—and then normalizing.
Semantic image segmentation partitions an input im-
age into labeled objects at pixel granularity. Semantic im-
age segmentation partitions an input image into labeled ob-
jects at pixel granularity. It applies to autonomous driving
and robotics [38, 54, 45, 53], remote sensing , medical
imaging , and also complex image manipulation such
as red-eye reduction.
Our reference model for this task is DeepLab v3+ 
with a MobileNet v2 backbone. DeepLab v3+ originates
from the family of semantic image-segmentation models
that use fully convolutional neural networks to directly pre-
dict pixel classiﬁcation [44, 33] as well as to achieve state-
of-the-art performance by overcoming reduced-feature-
resolution problems and incorporating multiscale context.
DeepLabV3+ uses an encoder-decoder architecture with
atrous spatial pyramid pooling and a modular feature ex-
tractor. We selected MobileNet-V2 as the feature extractor
because it enables state-of-the-art model accuracy within a
constrained computational budget.
We chose the ADE20K validation data set  for its
realistic scenarios, cropped and scaled images to 512x512,
and (naturally) settled on the mean intersection over union
(mIoU) for our metric. Additionally, we trained the model
to predict just 32 classes (compared with 150 in the original
ADE20K data set); the 1st to the 31st were the most fre-
quent (pixel-wise) classes in ADE20K, and the 32nd rep-
resented all the other classes. The mIoU depends on the
pixels whose ground-truth label belongs to one of the 31
most frequent classes, improving its accuracy by discarding
the network’s bad performance on low-frequency classes.
Question answering is an NLP task - responding to
human-posed questions in colloquial language. Example
applications include search engines, chatbots, and other
information-retrieval tools. For this task, we use the Stan-
ford Question Answering Dataset (Squad) v1.1 Dev .
Given a question and a passage from a Wikipedia article,
the model must extract a text segment from the passage to
answer the question.
Recent NLP models that rely on pretrained contextual
representations have proven useful in diverse situations
[31, 46, 47]. BERT (Bidirectional Encoder Representations
from Transformers)  improves on those models by pre-
training the contextual representations to be bidirectional
and to learn relationships between sentences using unla-
beled text. We selected MobileBERT , a lightweight
BERT model that is well suited to resource-limited mo-
bile devices. Further motivating this choice is the model’s
state-of-the-art performance and task-agnostic nature: even
though we consider question answering, MobileBERT is
adaptable to other NLP tasks with only minimal ﬁne-tuning.
We trained the model with a maximum sequence length of
384 and use the F1 score for our metric.
3.2 Reference Code
MLPerf provides reference-code implementations for
the TensorFlow and TensorFlow Lite benchmarks. All ref-
Figure 4: Load Generator (“LoadGen”) testing the SUT.
erence models have 32-bit ﬂoating-point weights, and the
benchmark additionally provides an 8-bit quantized ver-
sion (with either post-training quantization or quantization-
aware training, depending on the tasks). The code for all
reference implementations is open source and free to down-
load from GitHub .
The reference code’s goal is to explicitly identify the crit-
ical model-invocation stages. For instance, the reference
benchmarks implement the preprocessing stages and the
model’s input-generation procedure. Submitters may adopt
the code for their submission. They may also optimize
these stages (e.g., rewrite them in C instead of Python) for
performance—as long as they employ all the same stages
and take the same steps to maintain equivalence.
By default, the reference code is not well-optimized.
Vendors that submit results to MLPerf must inherit the ref-
erence code, adapt it, and produce optimized glue code that
performs well on their hardware. For example, to perform
(quantized) inference, they may need to invoke the correct
software backend (e.g., SNPE and ENN) or NNAPI driver
to schedule code to their SoC’s custom accelerators.
3.3 System Under Test
A typical system under test (SUT) interfaces with several
components. Orchestrating the complete SUT execution in-
volves multiple stages. The main ones are model selection,
data-set input, preprocessing, back-end execution, and post-
processing. Figure 4 shows how these stages work together.
Model selection. The ﬁrst step is selection of the refer-
ence models, either TensorFlow or TFLite.
Load generator. To enable representative testing of var-
ious inference platforms and use cases, we created the Load
Generator (“LoadGen”) . The LoadGen creates inference
requests in a pattern and measures some parameter (e.g., la-
tency, throughput, or latency-bounded throughput). In addi-
tion, it logs information about the system during execution
to enable post-submission result validation. Submitter mod-
iﬁcation of the LoadGen software is forbidden.
Data-set input. The LoadGen uses the data sets as in-
puts to the SUT. In accuracy mode, it feeds the entire data
set to the SUT to verify that the model delivers the required
accuracy. In performance mode, it feeds a subset of the im-
ages to the SUT to measure steady-state performance. A
seed and random number generator is used to select sam-
ples from the data-set for inference, which precludes any
unrealistic data-set-speciﬁc optimizations.
Preprocessing. The typical image-preprocessing
tasks—such as resizing, cropping, and normalization—
depend on the neural-network model. This stage imple-
ments data-set-speciﬁc preprocessing, varies with the task
and the same steps must be followed by all the submitters.
Back-end execution. The reference benchmark imple-
mentation is a TFLite smartphone back end that optionally
includes NNAPI and GPU delegates. A “dummy” back end
is also available as a reference for proprietary back ends;
submitters replace it with whatever corresponds to their sys-
tem. For instance, Qualcomm would replace the dummy
with SNPE or Samsung would replace it with ENN. The
back end corresponds to other frameworks such as Open-
VINO for notebooks and other large mobile devices.
Postprocessing. This data-set-speciﬁc task covers all the
operations necessary for accuracy calculations. For exam-
ple, computing the Top-1 or Top-5 results for an image clas-
siﬁer requires a Top-K op / layer after the softmax layer.
A typical SUT can be either a smartphone or a laptop.
We therefore designed all the mobile-benchmark compo-
nents to take advantage of either one. Figure 5 shows how
MLPerf Mobile supports this ﬂexibility. The reference Ten-
sorFlow models are at the root of the entire process. The
MLPerf Mobile process follows one of three paths.
Code path 1 allows submitters to optimize the reference
TensorFlow models for implementation via a proprietary
backend (e.g., SNPE for Qualcomm or ENN for Samsung),
then schedule and deploy the networks on the hardware.
Code path 2 allows submitters to convert the reference
TensorFlow models to a mobile-friendly format using an ex-
porter. These models are then easy to deploy on the device,
along with quantization optimizations, using the TFLite del-
egates to access the AI-processing hardware.
Code path 3 allows non-smartphone submitters to run
the reference TensorFlow models through nonmobile back-
ends (e.g., OpenVINO) on laptops and tablets that run op-
erating systems such as Windows and Linux.
3.4 Execution Scenarios
MLPerf Mobile Inference supports two modes for run-
ning ML models: single stream and ofﬂine. They reﬂect the
typical operating behavior of many mobile applications.
Single stream. In the single-stream scenario, the ap-
plication sends a single inference query to the SUT with
a sample size of one. That size is typical of smartphones
and other interactive devices where the user takes a picture
and expects a timely response, as well as AR/VR headsets
where real-time operation is crucial. The LoadGen injects a
query into the SUT and waits for query completion. When
Figure 5: MLPerf Mobile benchmark code paths. The
benchmarks run on smartphones and on mobile PCs, such
as laptops. On smartphones, there are multiple framework
options and backend codepaths that vendors can select.
the query is complete, the LoadGen records the inference
run length and sends the next query. This process repeats
until the LoadGen has issued all the samples (1,024) in the
task’s corresponding data set or a minimum runtime of 60
seconds has been met.
Ofﬂine. In the ofﬂine scenario, the LoadGen sends all
the samples to the SUT in one burst. Although the query
sample size remains one, as in the single-stream scenario,
the number of samples in the query is much larger. Of-
ﬂine mode in MLPerf Mobile v0.7 issues 24,576 samples—
enough to provide sufﬁcient run time. This choice typically
reﬂects applications that require multi-image processing, si-
multaneous processing of batched input, or concurrent use
of models such as image classiﬁcation and person detec-
tion for photos in an album. Its implementation is usually a
batched query with a batch size larger than one.
4 Result Submission
This section outlines how submitters produce high-
quality benchmark results for submission. We outline the
submission process, the run rules, and the procedure for ver-
ifying the accuracy and validity of the results.
4.1 Submission Process
The reference models for MLPerf Mobile are provided as
frozen TensorFlow FP32 checkpoints and valid submissions
must start from these frozen graphs. From the frozen graph,
submitters can export a reference FP32 TFLite model. They
can generate ﬁxed-point models with INT8 precision from
the reference FP32 models using post-training quantization
(PTQ), but they cannot perform quantization-aware train-
ing (QAT). Network retraining typically alters the neural-
network architecture therefore model equivalence is difﬁ-
cult to verify. Additionally, retraining allows a submit-
ter to use their training capabilities (e.g., neural architec-
ture search) to enhance inference performance, changing
the very nature of the benchmark. Depending on submit-
ter needs, however, MLPerf provides QAT versions of the
model. All organizations mutually agree on these QAT
models as being comparable to the PTQ models.
In general, QAT reduces accuracy loss relative to PTQ.
Therefore, we chose the minimum-accuracy thresholds on
the basis of what is achievable through post-training quanti-
zation without any training data. For some benchmarks, we
generated a reference INT8 QAT model using the Tensor-
Flow quantization tools; submitters can employ it directly
in the benchmark.
Some hardware is unable to directly deploy TensorFlow-
quantized models, however, and submission organizations
may need different ﬁxed-point formats to match their hard-
ware. In such cases, we only allow post-training quantiza-
tion without training data from a reference model.
For each model, the Mobile Working Group speciﬁed a
calibration data set (typically 500 samples or images from
the training or validation data set) to use for calibration in
the PTQ process. Submitters can only use the approved cali-
bration data set; but they may select a subset of the samples.
A submitter may implement minimal changes to the
model, if they are mathematically equivalent, or approved
approximations to make the model compatible with their
hardware. However, MLPerf rules strictly prohibit altering
the AI models to reduce their computational complexity;
banned techniques include channel pruning, ﬁlter pruning,
and weight skipping.
4.2 Submission System
Smartphones and notebooks can use the mobile-
benchmark suite. For smartphones, we developed a refer-
ence MLPerf Android app that supports TFLite delegates
and NNAPI delegates. We benchmark the inference-task
performance at the application layer to reﬂect latencies that
mobile-device users observe and to give developers a refer-
ence for expected user-app latencies.
The MLPerf Mobile app queries the LoadGen, which in
turn queries input samples for the task, loads them to mem-
ory, and tracks the time required to execute the task. Com-
panies that used proprietary delegates implemented their
backend interface to the reference MLPerf app. Such back-
ends query the correct library (TensorFlow, TFLite, Exynos
Neural Network (ENN) SDK, or SNPE SDK) to run the
models on the SUT in accordance with the run rules.
For laptops, submitters can build a native command-
line application that incorporates the instructions in the
MLCommons GitHub repo. The MLPerf LoadGen can
integrate this application, and supports backends such as
the OpenVINO run time. The application generates logs
consistent with MLPerf rules, validated by the submission
checker. The number of samples necessary for performance
mode and for accuracy mode remains identical to the num-
ber in the smartphone scenario. The only difference is the
absence of a user interface for these devices.
4.3 Run Rules
In any benchmark, measurement consistency is crucial
for reproducibility. MLPerf Mobile is no different. We de-
veloped a strict set of run rules that allow us to reproduce
submitted results through an independent third party.
•Test control. The MLPerf app runs the ﬁve bench-
marks in a speciﬁc order. For each one, the model
ﬁrst runs on the whole validation set to calculate the
accuracy, which the app then reports. Performance
mode then follows. Single-stream mode measures the
90th-percentile latency over at least 1,024 samples for
a minimum run time of 60 seconds to achieve a stable
performance result. Ofﬂine mode reports the average
throughput necessary to process 24,576 samples and in
current systems will exceed 60 seconds of run time.
•Thermal throttling. Machine-learning models are
computationally heavy and can trigger run-time ther-
mal throttling to cool the SoC. We recommend that
smartphones maintain an air gap with proper ven-
tilation and avoid ﬂush contact with any surfaces.
Additionally, we require normal room temperature
operation—between 20 and 25 degrees Celsius.
•Cooldown interval. The benchmark does not test the
performance under thermal throttling, so the app al-
lows a break setting of 0–5 minutes between the indi-
vidual tests to allow the phone to reach its cooldown
state before starting each one. If the benchmark suite
is to run multiple times, we recommend a minimum
10-minute break between them.
•Battery power. The benchmark runs while the phone
is battery powered, but we recommend a full charge
beforehand to avoid entering power-saving mode.
The above rules are generally inapplicable to laptops be-
cause these devices have sufﬁcient power and cooling.
4.4 Result Validation
MLPerf Mobile submission rules require that the SUT
(smartphone or laptop) be commercially available before
publication, which enables a more tightly controlled and ro-
bust validation, review, and audit process. In contrast, the
other MLPerf benchmark suites allow submission of pre-
view and research systems that are not commercially avail-
able. Smartphones should be for sale either through a car-
rier or as an unlocked phone. The SUT includes both the
hardware and the software components, so these rules pro-
hibit device rooting.
At submission time, each organization has no knowledge
of other results or submissions. All must submit their results
at the same time. Afterward, the submitters collectively re-
view all the results in a closed setting, inspired by the peer-
review process for academic publications.
Submissions include all of the benchmark app’s log ﬁles,
unedited. After the submission deadline, results for each
participating organization are available for examination by
the MLPerf working group and the other submitters, along
with any modiﬁed models and code used in the respective
submissions. The vendor backend (but not the tool chain)
is included. MLPerf also receives private vendor SDKs to
allow auditing of the model-conversion process.
The audit process comprises examination of log ﬁles,
models, and code for compliance with the submission rules
as well as veriﬁcation of their validity. It also includes veri-
ﬁcation of the system’s reported accuracy and latencies. To
verify results, we build the vendor-speciﬁc MLPerf app, in-
stall it on the phone (in the factory-reset state), and attempt
to reproduce latency or throughput numbers, along with ac-
curacy. We consider the results veriﬁed if our numbers are
within 5% of the reported values.
5 Performance Evaluation
The MLPerf Mobile inference suite ﬁrst saw action in
October 2020. Mobile submissions fall into one of two cat-
egories: smartphones and laptops. The results reveal a de-
vice’s system on chip (SoC) performance for each of the
machine learning tasks in version 0.7. This section assesses
how the benchmark performed—speciﬁcally, whether it met
expectations in being transparent and faithful, reﬂecting the
vast diversity of AI hardware and software.
5.1 Premium ML Systems
The submitted systems include premier 5G smartphones
and high-end mobile SoCs from MediaTek, Qualcomm, and
Samsung. The MediaTek chipset is a Dimensity 820  in
the Xiaomi Redmi 10X smartphone; it contains MediaTek’s
AI processing unit (APU) 3.0. The APU uniquely supports
FP16 and INT16 . The Qualcomm chipset is a Snap-
dragon 865+  in the Asus ROG Phone 3. It integrates
Qualcomm’s Hexagon 698 DSP, which consists of two en-
gines that can handle AI processing exclusively. The ﬁrst
engine is the Hexagon Vector Extension (HVX), which is
designed for advanced imaging and computer-vision tasks
intended to run on the DSP instead of the CPU. The sec-
ond, the company’s AI-processor (AIP) cluster, supports
the Hexagon Tensor Accelerator (HTA), which can also per-
form AI tasks. These engines can serve together for maxi-
mum performance, or they can operate in isolation (depend-
ing on the compiler optimizations). The Samsung chipset is
an Exynos 990  in the company’s Galaxy Note 20 Ul-
tra, which has a dual-core custom neural processing unit
(NPU) specialized to handle AI workloads. In the laptop
category, Intel submitted results for its new Willow Cove
CPU  and ﬁrst-generation integrated Xe-LP GPU. That
GPU served as the AI accelerator . These systems col-
lectively reﬂect the state of the art in AI processors.
In the smartphone category, three organizations submit-
ted a total of 14 individual results. No one solution domi-
nates all benchmarks. Figure 6 plots the single-stream re-
sults for the three smartphone chipsets on each benchmark
task. It includes both throughput and latency results. Each
chipset offers a unique differentiable value. MediaTek’s Di-
mensity scored the highest in object-detection and image-
segmentation throughput. Samsung’s Exynos performed
well on image classiﬁcation and NLP, where it achieved
the highest scores. Qualcomm’s Snapdragon is competitive
for image segmentation and NLP. The image-classiﬁcation
task employs ofﬂine mode, which allows batch process-
ing; here, Exynos delivered 674.4 frames per second (FPS),
and Snapdragon delivered 605.37 FPS (not shown in Fig-
ure 6). In most cases, the throughput differences are
marginal. An essential point to keep in mind, however, is
that other metrics—beyond performance benchmarks—go
into assessing a chipset’s viability for a given task.
5.2 Result Transparency
The submission results highlight an important point:
they reﬂect the variety of hardware and software combina-
tions we discussed earlier (Section 2). All mobile SoCs rely
on a generic processor, but the AI-performance results were
from AI accelerators using different software frameworks.
Transparency into how the results were generated is crucial.
Figure 7 shows the potential code paths for producing the
submission results. The dashed lines represent mere possi-
bilities, whereas the solid lines indicate actual submissions.
Looking only at Figure 7 is insufﬁcient to determine which
paths produce high-quality results. Any other code paths
would have yielded a different performance result. There-
fore, transparency on benchmark performance is essential.
It reveals which code paths were taken, making the perfor-
mance results reproducible and informative for consumers.
Table 2 presents additional details, including speciﬁcs
for each benchmark result in both single-stream and ofﬂine
modes. MLPerf Mobile exposes this information to make
the results reproducible. For each benchmark and each sub-
mitting organization, the table shows the numerical preci-
sion, the run time, and the hardware unit that produced the
results. Exposing each of these details is important because
the many execution paths in Figure 7 can drastically affect
a device’s performance.
MediaTek Samsung Qualcomm
(a) Image classiﬁcation
MediaTek Samsung Qualcomm
(b) Object detection (SSD-MobileNet v2)
MediaTek Samsung Qualcomm
(c) Semantic segmentation (DeepLab v3 + MobileNet v2)
MediaTek Samsung Qualcomm
(d) Natural-language processing (MobileBERT)
Figure 6: Results from the ﬁrst MLPerf Mobile round show that no one solution ﬁts all tasks. The bars corresponds to
throughput (left y-axis). The line graph corresponds to latency (right y-axis).
5.3 Execution Diversity
Mobile-device designers prefer INT8 or FP16 format be-
cause quantized inference runs faster and provides better
performance and memory bandwidth than FP32 . The
accuracy tradeoff for quantized models (especially since no
retraining is allowed) is tolerable in smartphones, which
seldom perform safety-critical tasks, such as those in au-
tonomous vehicles (e.g., detecting pedestrians).
All the mobile-vision tasks employ INT8 heavily. Most
vendors rely on INT8 because it enables greater perfor-
mance and consumes less power, preserving the device’s
battery life. NLP favors FP16. Although this format re-
quires more power than INT8, it offers better accuracy. Per-
haps more importantly, submitters use FP16 because most
AI engines today lack efﬁcient support for non-vision tasks.
The GPU is a good balance between ﬂexibility and efﬁ-
ciency. Unsurprisingly, therefore, all vendors submitted re-
sults that employed GPUs with FP16 precision for NLP.
NNAPI is designed to be a common baseline for machine
learning on Android devices and to distribute that workload
across ML-processor units, such as CPUs, GPUs, DSPs,
and NPUs. But nearly all submissions in Table 2 use pro-
prietary frameworks. These frameworks, such as ENN and
SNPE, give SoC vendors more control over their product’s
performance. For instance, they can control which proces-
sor (CPU, GPU, DSP, NPU, for example) to use and what
optimizations to apply.
All laptop submissions employ INT8 and achieve the de-
sired accuracy on vision and language models. For single-
stream mode, because a single sample is available per query
some models cannot fully utilize the computational re-
sources of the GPU. Therefore, the backend must select be-
tween the CPU and GPU to deliver the best overall perfor-
mance. For example, smaller models such as MobileNetEd-
geTPU use the CPU. For the ofﬂine mode, multiple samples
are available as a single query, so inference employs both
the CPU and GPU.
Finally is hardware diversity. Table 2 shows a variety of
hardware combinations that achieve good performance on
all MLPerf Mobile AI tasks. In one case, the CPU is the
backbone, orchestrating overall execution—including pre-
processing and other tasks the benchmark does not mea-
sure. In contrast, the GPU, DSPs, NPUs, and AIPs deliver
Figure 7: Potential code paths (dashed lines) and actual submitted code paths (solid lines) for producing MLPerf Mobile AI-
performance results. NPU is the Neural Processing Unit from Samsung. Hexagon Tensor Accelerator (HTA) and Hexagon
Vector eXtensions (HVX) are part of the Qualcomm DSP that can either be used individually or together simultaneously.
high-performance AI execution.
The MLPerf results provide transparency into the perfor-
mance results, which show how SoC vendors achieve their
best performance on a range of tasks. Figure 7 and Table 2
reveal substantial differences in how AI systems perform on
the different devices. Awareness of such underlying varia-
tions is crucial because the measured performance should
match what end users experience, particularly on commer-
cially available devices.
Finally, since the benchmark models represent diverse
tasks, and since MLPerf Mobile collects results over a sin-
gle long run that covers all of these models, it strongly curbs
domain-speciﬁc framework optimizations. Furthermore,
the benchmarked mobile devices are commonly available
and the testing conditions ensure a realistic experimental
setup, so the results are attainable in practice and repro-
ducible by others.
6 Consumer, Industry, Research Value
Measuring mobile AI performance in a fair, repro-
ducible, and useful manner is challenging but not in-
tractable. The need for transparency owes to the massive
hardware and software diversity, which is often tightly cou-
pled with the intricacies of deployment scenarios, developer
options, OEM life cycles, and so on.
MLPerf Mobile focuses on transparency for consumers
by packaging the submitted code into an app. Figure 8a
shows the MLPerf Mobile startup screen. With a simple tap
on the “Go” button, the app runs all benchmarks by default,
following the prescribed run rules (Figure 8b), and clearly
displays the results. It reports both performance and accu-
racy for all benchmark tasks (Figure 8c) and permits the
user to view results for each one (Figure 8d). Furthermore,
the conﬁguration that generates the results is also transpar-
ent (Figure 8e). The application currently runs on Android,
though future versions will likely support iOS as well.
We believe that analysts, OEMs, academic researchers,
neural-network-model designers, application developers,
and smartphone users can all gain from result transparency.
We brieﬂy summarize how the app beneﬁts each one.
Application developers. MLPerf Mobile shows appli-
cation developers what real-world performance may look
like on the device. For application developers, we expect
the benchmark provides insight into the software frame-
works on the various “phones” (i.e., SoCs). More specif-
ically, it can help them quickly identify the most optimal
solution for a given platform. For application developers
who deploy their products “into the wild,” the benchmark
and the various machine-learning tasks offer perspective on
ImageNet ImageNet COCO ADE20K Squad
MobileNetEdge MobileNetEdge SSD-MobileNet v2 DeepLab v3+ - MobileNet v2 MobileBERT
Table 2: Implementation details for the results presented in Figure 7. The table shows the myriad combinations of numerical
formats, software run times and hardware backend targets that are possible, which reinforces the need for result transparency.
the end-user experience for a real application.
OEMs. MLPerf Mobile standardizes the benchmark-
ing method across different mobile SoCs. All SoC ven-
dors employ the same tasks, models, data sets, metrics,
and run rules, making the results comparable and repro-
ducible. Given the hardware ecosystem’s vast heterogene-
ity, the standardization that our benchmark provides is vital.
Model designers. MLPerf Mobile makes it easy to
package new models into the mobile app, which organi-
zations can then easily share and reproduce. The app
framework, coupled with the underlying LoadGen, allows
model designers to test and evaluate the model’s perfor-
mance on a real device rather than using operation counts
and model size as heuristics to estimate performance. This
feature closes the gap between model designers and hard-
ware vendors—groups that have thus far failed to share in-
formation in an efﬁcient and effective manner.
Mobile users. The average end user wants to make
informed purchases. For instance, many want to know
whether upgrading their phone to the latest chipset will
meaningfully improve their experience. To this end,
they want public, accessible information about various
devices—something MLPerf Mobile provides. In addi-
tion, some power users want to measure their device’s per-
formance and share that information with performance-
crowdsourcing platforms. Both are important reasons for
having an easily reproducible mechanism for measuring
mobile AI performance.
Academic researchers. Reproducibility is a challenge
for state-of-the-art technologies. We hope researchers em-
ploy our mobile-app framework to test their methods and
techniques for improving model performance, quality, or
both. The framework is open source and freely accessi-
ble. As such, it can enable academic researchers to integrate
their optimizations and reproduce more recent results from
Technical analysts. MLPerf Mobile provides repro-
ducibility and transparency for technical analysts, who of-
ten strive to make “apples-to-apples” comparisons. The ap-
plication makes it easy to reproduce vendor-claimed results
as well as to interpret the results, because it shows how the
device achieves a particular performance number and how
it is using the hardware accelerator.
7 Related Work
There are many ongoing efforts in mobile AI perfor-
mance benchmarking. We describe the prior art in mobile
and ML benchmarking and emphasize how MLPerf Mobile
differs from these related works.
Android Machine Learning Test Suite (MLTS).
MLTS, part of the Android Open Source Project (AOSP)
source tree, provides benchmarks for NNAPI drivers . It
is mainly for testing the accuracy of vendor NNAPI drivers.
MLTS includes an app that allows a user to test the latency
and accuracy of quantized and ﬂoating-point TFLite mod-
els (e.g., MobileNet and SSD-MobileNet) against a 1,500-
(a) Startup screen. (b) Running the benchmarks. (c) Reporting results. (d) Run details. (e) Conﬁguration settings.
Figure 8: MLPerf Mobile app on Android.
image subset of the Open Images Dataset v4 . Further
statistics, including latency distributions, are also available.
Xiaomi’s Mobile AI Benchmark. Xiaomi provides
an open-source end-to-end benchmark tool for evaluating
model accuracy and latency . In addition to a command-
line utility to run the benchmarks on a user device, the
tool includes a daily performance-benchmark run for var-
ious neural-network models (mostly on the Xiaomi Redmi
K30 Pro). The tool has a conﬁgurable backend that allows
users to employ multiple ML-hardware-delegation frame-
works (including MACE, SNPE, and TFLite).
TensorFlow Lite. TFLite provides a command-line
benchmark utility to measure the latency of any TFLite
model . A wrapper APK is also available to reference
how these models perform when embedded in an Android
application. Users can select the NNAPI delegate and they
can disable NNAPI in favor of a hardware-ofﬂoad backend.
For in-depth performance analysis, the benchmark supports
timing of individual TFLite operators.
AI-Benchmark. Ignatov et al.  performed an exten-
sive evaluation of machine-learning performance on mobile
systems with AI acceleration, using HiSilicon, MediaTek,
Qualcomm, Samsung, and UniSoc chipsets. They evaluated
21 deep-learning tasks using 50 metrics, including inference
speed, accuracy, and stability. The authors reported the re-
sults of their AI-Benchmark app for 100 mobile SoCs. The
benchmark runs preselected models of various bit widths
(INT8, FP16, and FP32) on the CPU and on open-source or
vendor-proprietary TFLite delegates. Performance-report
updates appear on the AI-Benchmark website  after each
major release of TFLite/NNAPI and of new SoCs with AI
AImark. Master Lu (Ludashi) , a closed-sourced
Android and iOS application, uses vendor SDKs to im-
plement its benchmarks. It comprises image-classiﬁcation,
image-recognition, and image-segmentation tasks, includ-
ing models such as ResNet-34 , Inception V3 ,
SSD-MobileNet [36, 43], and DeepLab v3+ . The
benchmark judges mobile-phone AI performance by eval-
uating recognition efﬁciency and provides a line-test score.
Aitutu. A closed-source application [3, 8], Aitutu em-
ploys Qualcomm’s SNPE, MediaTek’s NeuroPilot, HiSil-
icon’s Kirin HiAI, Nvidia’s TensorRT, and other vendor
SDKs. It implements image classiﬁcation based on the
Inception V3 neural network , using 200 images as
test data. The object-detection model is based on SSD-
MobileNet [36, 43], using a 600-frame video as test data.
The score is a measure of speed and accuracy—faster re-
sults with higher accuracy yield a greater ﬁnal score.
Geekbench. Primate Labs created Geekbench [20, 6],
a cross-platform CPU-compute benchmark that supports
Android, iOS, Linux, macOS, and Windows. The Geek-
bench 5 CPU benchmark features new applications, includ-
ing augmented reality and machine learning, but it lacks
heterogeneous-IP support. Users can share their results by
uploading them to the Geekbench Browser.
UL Procyon AI Inference Benchmark. From UL
Benchmarks, which produced PCMark and 3DMark, came
VRMark [25, 26], an Android NNAPI CPU- and GPU-
focused AI benchmark. The professional benchmark suite
UL Procyon only compares NNAPI implementations and
compatibility on ﬂoating-point- and integer-optimized mod-
els. It contains MobileNet v3 , Inception v4 ,
SSDLite MobileNet v3 [28, 43], DeepLab v3 , and
other models. It also attempts to test custom CNN mod-
els but uses an AlexNet  architecture to test basic op-
erations. The application provides benchmark scores, per-
formance charts, hardware monitoring, model output, and
Neural Scope. National Chiao Tung University [17, 18]
developed an Android NNAPI application supporting FP32
and INT8 precisions. The benchmarks comprise object
classiﬁcation, object detection, and object segmentation,
including MobileNet v2 , ResNet-50 , Inception
v3, SSD-MobileNet [36, 43], and ResNet-50 with atrous-
convolution layers . Users can run the app on their
mobile devices and immediately receive a cost-performance
8 Future Work
The ﬁrst iteration of the MLPerf Mobile benchmark fo-
cused on the foundations. On the basis of these fundamen-
tals, its scope can easily expand. The following are areas of
iOS support. A major area of interest for MLPerf Mo-
bile is to develop an iOS counterpart for the ﬁrst-generation
Android app. Apple’s iOS is a major AI-performance player
that brings both hardware and software diversity compared
Measuring software frameworks. Most AI bench-
marks focus on AI-hardware performance. But as we de-
scribed in Section 2, software performance—and, more im-
portantly, its capabilities—is crucial to unlocking a device’s
full potential. To this end, enabling apples-to-apples com-
parison of software frameworks on a ﬁxed hardware plat-
form has merit. The backend code path in Figure 5 (code
path 1) is a way to integrate different machine-learning
frameworks in order to determine which one achieves the
best performance on a target device.
Expanding the benchmarks. An obvious area of im-
provement is expanding the scope of the benchmarks to in-
clude more tasks and models, along with different quality
targets. Examples include additional vision tasks, such as
super resolution, and speech models, such as RNN-T.
Rolling submissions. The mobile industry is growing
and evolving rapidly. New devices arrive frequently, of-
ten in between MLPerf calls for submissions. MLPerf Mo-
bile therefore plans to add “rolling submissions” in order to
encourage vendors to submit their MLPerf Mobile scores
continuously. Doing so would allow smartphone makers to
more consistently use the benchmark to report the AI per-
formance of their latest devices.
Power measurement. A major area of potential im-
provement for MLPerf Mobile is power measurement.
Since mobile devices are battery constrained, evaluating
AI’s power draw is important.
To make additional progress, we need community in-
volvement. We therefore encourage the broader mobile
community to join the MLPerf effort and maintain the mo-
mentum behind an industry-standard open-source mobile
Machine-learning inference has many potential applica-
tions. Building a benchmark that encapsulates this broad
spectrum is challenging. In this paper, we focused on smart-
phones and the mobile-PC ecosystem, which is rife with
hardware and software heterogeneity. Coupled with the
life-cycle complexities of mobile deployments, this hetero-
geneity makes benchmarking mobile AI performance over-
whelmingly difﬁcult. To bring consensus, we developed the
MLPerf Mobile AI inference benchmark. Many leading or-
ganizations have joined us in building a uniﬁed benchmark
that meets competing organizations’ disparate needs. The
unique value of MLPerf Mobile is not so much in the bench-
marks, rules, and metrics. Instead, it is in the value that the
industry creates for itself, beneﬁting everyone.
MLPerf Mobile provides an open source, out-of-the-
box inference-throughput benchmark for popular computer-
vision and natural-language-processing applications on mo-
bile devices, including smartphones and laptops. It can
serve as a framework to integrate future models, as the un-
derlying framework is independent of the top-level model
and of data-set changes. The app and the integrated Load
Generator allow us to evaluate a variety of situations, such
as changing the quality thresholds for overall system per-
formance. The app can also serve as a common platform
for comparing different machine-learning frameworks on
the same hardware. Finally, the suite allows for fair and
faithful evaluation of heterogeneous hardware, with full re-
The MLPerf Mobile team would like to acknowledge sev-
eral people for their effort. In addition to the team that archi-
tected the benchmark, MLPerf Mobile is the work of many
individuals that also helped produce the ﬁrst set of results.
Arm: Ian Forsyth, James Hartley, Simon Holland, Ray
Hwang, Ajay Joshi, Dennis Laudick, Colin Osborne, and
dviditi: Anton Lokhmotov.
Google: Bo Chen, Suyog Gupta, Andrew Howard, and
Harvard University: Yu-Shun Hsiao.
Intel: Thomas Baker, Srujana Gattupalli, and Maxim
MediaTek: Kyle Guan-Yu Chen, Allen Lu, Ulia Tseng,
and Perry Wang.
Qualcomm: Mohit Mundhra.
Samsung: Dongwoon Bai, Stefan Bahrenburg, Jihoon
Bang, Long Bao, Yoni Ben-Harush, Yoojin Choi, Fang-
ming He, Amit Knoll, Jaegon Kim, Jungwon Lee, Sukhwan
Lim, Yoav Noor, Muez Reda, Hai Su, Zengzeng Sun,
Shuangquan Wang, Maiyuran Wijay, Meng Yu, and George
Xored: Ivan Osipov, and Daniil Efremo.
 AI-Benchmark. http://ai-benchmark.com/.
 AImark. https://play.google.com/store/
 Antutu Benchmark. https://www.antutu.com/en/
 Big.LITTLE. https://www.arm.com/why-arm/
 Deploy High-Performance Deep Learning Inference.
 Geekbench. https://www.geekbench.com/.
 Google Play. https://play.google.com/store.
 Is Your Mobile Phone Smart? Antutu AI Benchmark
Public Beta Is Released. https://www.antutu.com/
 LoadGen. https://github.com/mlperf/
 MediaTek Dimensity 820. https://www.mediatek.
 MLPerf. https://github.com/mlperf.
 MLPerf Mobile v0.7 Results. https://mlperf.org/
 Mobile AI Bench. https://github.com/XiaoMi/
 Mobile Processor Exynos 990. https://www.
 Neural Networks API. https://developer.
 Neural Networks API Drivers. https://source.
 NeuralScope Mobile AI Benchmark Suite. https:
 Neuralscope offers you benchmarking your AI solutions.
 NeuroPilot. https://neuropilot.mediatek.
 Primate Labs. https://www.primatelabs.com/.
 Samsung Neural SDK. https://developer.
 Snapdragon 865+ 5G Mobile Platform.
snapdragon-865- plus-5g- mobile-platform.
 Snapdragon Neural Processing Engine SDK. https:
 TensorFlow Lite. https://www.tensorflow.org/
 UL Benchmarks. https://benchmarks.ul.com/.
 UL Procyon AI Inference Benchmark.
 Willow cove - microarchitectures - intel.
 Andrew Howard, Suyog Gupta. Introducing
the Next Generation of On-Device Vision Mod-
els: MobileNetV3 and MobileNetEdgeTPU.
introducing-next- generation-on- device.
 Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos,
Kevin Murphy, and Alan L. Yuille. Deeplab: Semantic im-
age segmentation with deep convolutional nets, atrous con-
volution, and fully connected crfs, 2017.
 Liang-Chieh Chen, Yukun Zhu, George Papandreou, Flo-
rian Schroff, and Hartwig Adam. Encoder-decoder with
atrous separable convolution for semantic image segmenta-
 Andrew M. Dai and Quoc V. Le. Semi-supervised sequence
 Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina
Toutanova. Bert: Pre-training of deep bidirectional trans-
formers for language understanding, 2019.
 David Eigen and Rob Fergus. Predicting depth, surface nor-
mals and semantic labels with a common multi-scale convo-
lutional architecture, 2015.
 Song Han, Huizi Mao, and William J Dally. Deep com-
pression: Compressing deep neural networks with pruning,
trained quantization and huffman coding. arXiv preprint
 Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.
Deep residual learning for image recognition, 2015.
 Andrew G. Howard, Menglong Zhu, Bo Chen, Dmitry
Kalenichenko, Weijun Wang, Tobias Weyand, Marco An-
dreetto, and Hartwig Adam. Mobilenets: Efﬁcient convolu-
tional neural networks for mobile vision applications, 2017.
 Andrey Ignatov, Radu Timofte, Andrei Kulik, Seungsoo
Yang, Ke Wang, Felix Baum, Max Wu, Lirong Xu, and Luc
Van Gool. Ai benchmark: All about deep learning on smart-
phones in 2019. In 2019 IEEE/CVF International Confer-
ence on Computer Vision Workshop (ICCVW), pages 3617–
3635. IEEE, 2019.
 W. Kim and J. Seok. Indoor semantic segmentation for robot
navigating on mobile. In 2018 Tenth International Confer-
ence on Ubiquitous and Future Networks (ICUFN), pages
 Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton.
Imagenet classiﬁcation with deep convolutional neural net-
works. In F. Pereira, C. J. C. Burges, L. Bottou, and K. Q.
Weinberger, editors, Advances in Neural Information Pro-
cessing Systems 25, pages 1097–1105. Curran Associates,
 Alina Kuznetsova, Hassan Rom, Neil Alldrin, Jasper Ui-
jlings, Ivan Krasin, Jordi Pont-Tuset, Shahab Kamali, Stefan
Popov, Matteo Malloci, Alexander Kolesnikov, and et al. The
open images dataset v4. International Journal of Computer
Vision, 128(7):1956–1981, Mar 2020.
 Chien-Hung Lin, Chih-Chung Cheng, Yi-Min Tsai, Sheng-
Je Hung, Yu-Ting Kuo, Perry H Wang, Pei-Kuei Tsung,
Jeng-Yun Hsu, Wei-Chih Lai, Chia-Hung Liu, et al. 7.1 a
3.4-to-13.3 tops/w 3.6 tops dual-core deep-learning acceler-
ator for versatile ai applications in 7nm 5g smartphone soc.
In 2020 IEEE International Solid-State Circuits Conference-
(ISSCC), pages 134–136. IEEE, 2020.
 Tsung-Yi Lin, Michael Maire, Serge Belongie, Lubomir
Bourdev, Ross Girshick, James Hays, Pietro Perona, Deva
Ramanan, C. Lawrence Zitnick, and Piotr Doll´
coco: Common objects in context, 2015.
 Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian
Szegedy, Scott Reed, Cheng-Yang Fu, and Alexander C.
Berg. Ssd: Single shot multibox detector. Lecture Notes
in Computer Science, page 21–37, 2016.
 Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully
convolutional networks for semantic segmentation, 2015.
 Natalia Neverova, Pauline Luc, Camille Couprie, Jakob J.
Verbeek, and Yann LeCun. Predicting deeper into the future
of semantic segmentation. CoRR, abs/1703.07684, 2017.
 Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt Gard-
ner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer.
Deep contextualized word representations, 2018.
 Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya
Sutskever. Improving language understanding by generative
 Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and
Percy Liang. Squad: 100,000+ questions for machine com-
prehension of text, 2016.
 Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun.
Faster r-cnn: Towards real-time object detection with region
proposal networks, 2016.
 Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, San-
jeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy,
Aditya Khosla, Michael Bernstein, Alexander C. Berg, and
Li Fei-Fei. ImageNet Large Scale Visual Recognition Chal-
lenge. International Journal of Computer Vision (IJCV),
 Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zh-
moginov, and Liang-Chieh Chen. Mobilenetv2: Inverted
residuals and linear bottlenecks, 2019.
 Jamie Sherrah. Fully convolutional networks for dense se-
mantic labelling of high-resolution aerial imagery, 2016.
 Mennatullah Siam, Sara Elkerdawy, Martin Jagersand, and
Senthil Yogamani. Deep semantic segmentation for auto-
mated driving: Taxonomy, roadmap and challenges. In 2017
IEEE 20th international conference on intelligent trans-
portation systems (ITSC), pages 1–8. IEEE, 2017.
 G. Sun and H. Lin. Robotic grasping using semantic seg-
mentation and primitive geometric model based 3d pose es-
timation. In 2020 IEEE/SICE International Symposium on
System Integration (SII), pages 337–342, 2020.
 Zhiqing Sun, Hongkun Yu, Xiaodan Song, Renjie Liu, Yim-
ing Yang, and Denny Zhou. Mobilebert: a compact task-
agnostic bert for resource-limited devices, 2020.
 Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe,
Jonathon Shlens, and Zbigniew Wojna. Rethinking the in-
ception architecture for computer vision, 2015.
 Saeid Asgari Taghanaki, Kumar Abhishek, Joseph Paul Co-
hen, Julien Cohen-Adad, and Ghassan Hamarneh. Deep se-
mantic segmentation of natural and medical images: A re-
 Xavier Vera. Inside tiger lake: Intel’s next generation mobile
client cpu. In 2020 IEEE Hot Chips 32 Symposium (HCS),
pages 1–26. IEEE Computer Society, 2020.
 Bolei Zhou, Hang Zhao, Xavier Puig, Sanja Fidler, Adela
Barriuso, and Antonio Torralba. Scene parsing through
ade20k dataset. In Proceedings of the IEEE conference on
computer vision and pattern recognition, pages 633–641,