PreprintPDF Available

MLPerf Mobile Inference Benchmark: Why Mobile AI Benchmarking Is Hard and What to Do About It

Preprints and early-stage research may not have been peer reviewed yet.

Abstract and Figures

MLPerf Mobile is the first industry-standard open-source mobile benchmark developed by industry members and academic researchers to allow performance/accuracy evaluation of mobile devices with different AI chips and software stacks. The benchmark draws from the expertise of leading mobile-SoC vendors, ML-framework providers, and model producers. In this paper, we motivate the drive to demystify mobile-AI performance and present MLPerf Mobile's design considerations, architecture, and implementation. The benchmark comprises a suite of models that operate under standard models, data sets, quality metrics, and run rules. For the first iteration, we developed an app to provide an "out-of-the-box" inference-performance benchmark for computer vision and natural-language processing on mobile devices. MLPerf Mobile can serve as a framework for integrating future models, for customizing quality-target thresholds to evaluate system performance, for comparing software frameworks, and for assessing heterogeneous-hardware capabilities for machine learning, all fairly and faithfully with fully reproducible results.
Content may be subject to copyright.
MLPerf Mobile Inference Benchmark
Why Mobile AI Benchmarking Is Hard and What to Do About It
Vijay Janapa Reddi*David KanterPeter MattsonJared DukeThai NguyenRamesh Chukka§
Kenneth ShiringKoan-Sin TanMark Charlebois|| William Chou|| Mostafa El-Khamy**
Jungwook Hong**Michael Buch*Cindy Trinh†† Thomas Atta-fosu§Fatih Cakir**
Masoud CharkhabiXiaodong Chen** Jimmy ChiangDave Dexter‡‡
Woncheol HeoGuenther Schmuelling§§ Maryam Shabani§Dylan Zika††
MLPerf Mobile is the first industry-standard open-
source mobile benchmark developed by industry members
and academic researchers to allow performance/accuracy
evaluation of mobile devices with different AI chips and
software stacks. The benchmark draws from the expertise
of leading mobile-SoC vendors, ML-framework providers,
and model producers. In this paper, we motivate the drive to
demystify mobile-AI performance and present MLPerf Mo-
bile’s design considerations, architecture, and implemen-
tation. The benchmark comprises a suite of models that
operate under standard data sets, quality metrics, and run
rules. For the first iteration, we developed an Android
app to provide an “out-of-the-box” inference-performance
benchmark for computer vision and natural-language pro-
cessing on mobile devices. The benchmark also supports
non-smartphone devices such as laptops and mobile PCs.
As a whole, the MLPerf Mobile inference benchmark can
serve as a framework for integrating future models, for cus-
tomizing quality-target thresholds to evaluate system per-
formance, for comparing software frameworks, and for as-
sessing heterogeneous-hardware capabilities for machine
learning, all fairly and faithfully with reproducible results.
1 Introduction
Mobile artificial-intelligence (AI) applications are in-
creasingly important as AI technology becomes a critical
differentiator among smartphones, laptops, and other mo-
bile devices. Many consumer applications benefit from AI:
image processing, voice processing, and text interpretation.
AI provides state-of-the-art solutions to these tasks with a
*Harvard University MLCommons Google §Intel
MediaTek ||Qualcomm ** Samsung ††ENS Paris-Saclay
‡‡Arm §§ Microsoft
quality that users will notice on their devices. More and
more consumers are using such applications, and they ex-
pect a high-quality experience—especially for applications
with video or audio interactivity.
Consequently, mobile-device and chipset manufacturers
are motivated to improve AI implementations. Support for
the technology is becoming common in nearly all mobile
segments, from cost-optimized devices to premium phones.
The many AI approaches range from purely software tech-
niques to hardware-supported machine learning that relies
on tightly coupled libraries. Seeing through the mist of
competing solutions is difficult for mobile consumers.
On the hardware front, laptops and smartphones have in-
corporated application-specific integrated circuits (ASICs)
to support AI in an energy-efficient manner. For machine
learning, this situation leads to custom hardware that ranges
from specialized instruction-set-architecture (ISA) exten-
sions on general-purpose CPUs to fixed-function acceler-
ators dedicated to efficient machine learning. Also, because
mobile devices are complex, they incorporate a variety of
features to remain competitive, especially those that help
conserve battery life.
The software front includes many code paths and AI
infrastructures owing to the desire to efficiently support
machine-learning hardware. Most SoC vendors lean toward
custom pathways for model compilation and deployment
that are tightly integrated with the hardware. Examples
include Google’s Android Neural Network API (NNAPI)
[15], Intel’s OpenVINO [5], MediaTek’s NeuroPilot [19],
Qualcomm’s SNPE [23] and Samsung’s Exynos Neural
Network SDK [21]. These frameworks handle different nu-
merical formats (e.g., FP32, FP16, and INT8) for execution,
and they provide run-time support for various machine-
learning networks that best fit the application and platform.
Hardware and software support for mobile AI applica-
tions is becoming a differentiating capability, resulting in
arXiv:2012.02328v1 [cs.LG] 3 Dec 2020
a growing need to make AI-performance evaluation trans-
parent. OEMs, SoC vendors, and consumers benefit when
mobile devices employ AI in ways they can see and com-
pare. A typical comparison point for smartphone makers
and the technical press, for example, is CPUs and GPUs,
both of which have associated benchmarks [6]. Similarly,
mobile AI performance can also benefit from benchmarks.
Benchmarking AI performance is nontrivial, however. It
is especially challenging because AI implementations come
in a wide variety with differing capabilities. This variety,
combined with a lack of software-interface standards, com-
plicates the design of standard benchmarks. In edge de-
vices, the quality of the results is often highly specific to
each problem. In other words, the definition of high perfor-
mance is often task specific. For interactive user devices,
latency is normally the preferred performance metric. For
noninteractive ones, throughput is usually preferred. The
implementation for each task can generally trade off neural-
network accuracy for lower latency. This tradeoff makes
choosing a benchmark suite’s accuracy threshold critical.
To address these challenges, MLPerf ( takes
an open-source approach. It is a consortium of industry and
academic organizations with shared interests, yielding col-
lective expertise on neural-network models, data sets, and
submission rules to ensure the results are relevant to the in-
dustry and beneficial to consumers while being transparent
and reproducible.
The following are important principles that inform the
MLPerf Mobile benchmark:
Measured performance should match the performance
that end users perceive in commercial devices. We
want to prevent the benchmark from implementing
special code beyond what these users generally em-
The benchmark’s neural-network models should
closely match typical mobile-device workloads. They
should reflect real benefits to mobile-device users in
daily situations.
• Neural-network benchmark models should represent
diverse tasks. This approach yields a challenging test
that resists extensive domain-specific optimizations.
Testing conditions should closely match the environ-
ments in which mobile devices typically serve. Af-
fected characteristics include ambient temperature,
battery power, and special performance modes that are
software adjustable.
All benchmark submissions should undergo third-
party validation. Since mobile devices are ubiquitous,
results should be reproducible outside the submitting
MLPerf’s approach to addressing the mobile-AI bench-
mark needs of smartphones is to build an Android app that
all benchmarking must use. As of the initial v0.7 release
of the MLPerf Mobile benchmark, the app employs a stan-
dard set of four neural-network models for three vision tasks
and one NLP task and passes these models to the back-end
layer. This layer is an abstraction that allows hardware ven-
dors to optimize their implementations for neural networks.
The app also has a presentation layer for wrapping the more
technical benchmark layers and the Load Generator (“Load-
Gen”) [9]. MLPerf created the LoadGen [9] to allow rep-
resentative testing of different inference platforms and use
cases, which generates inference requests in a pattern and
measures some parameters (e.g., latency, throughput, or
latency-bounded throughput). MLPerf additionally offers a
headless version of the mobile application that enables lap-
tops running non-mobile OSs to use the same benchmarks.
The first round of MLPerf Mobile submissions is com-
plete [12]. Intel, MediaTek, Qualcomm, and Samsung
participated in this round, and all passed the third-party-
validation requirement (i.e., reproducibility) for their re-
sults. These results show performance variations and illus-
trate the wide range of hardware and software approaches
that vendors take to implementing neural-network models
on mobile devices. The results also highlight a crucial take-
away: measuring mobile-AI performance is challenging but
possible. It requires a deep understanding of the fragmented
and heterogeneous mobile ecosystem as well as a strong
commitment to fairness and reproducibility. MLPerf Mo-
bile is a step toward better benchmark transparency.
2 Benchmarking Challenges
The mobile ecosystem is rife with hardware hetero-
geneity, software fragmentation, developer options, deploy-
ment scenarios, and OEM life cycles. Each by itself leads
to hardware-performance variability, but the combination
makes AI benchmarking on mobile systems extremely dif-
ficult. Figure 1 shows the various stakeholders and explains
the implementation options and challenges facing each one.
2.1 Hardware Heterogeneity
Smartphones contain complex heterogeneous chipsets
that provide many different compute units and accelerators.
Any or all of these components can aid in machine-learning
(ML) inference. As such, recognizing the variability of
SoCs is crucial.
A typical mobile system-on-a-chip (SoC) complex in-
cludes a CPU cluster, GPU, DSP, Neural Processing Unit
(NPU), Hexagon Tensor Accelerator (HTA), Hexagon Vec-
tor Extensions (HVX), and so on. Many smartphones to-
day are Arm based, but the CPU cores generally implement
a heterogeneous “Big.Little” architecture [4]. Some SoCs
even have big-CPU clusters where some big CPUs clock
faster than others. Also, devices fall into different tiers with
Figure 1: Mobile AI performance stakeholders.
different hardware capabilities at different prices, varying in
their memory capacity and storage features.
Any processing engine can run ML workloads, but
this flexibility also makes benchmarking AI performance
difficult. A given device may have a spectrum of AI-
performance capabilities depending on which processing
engines it uses. Hence the need for a systematic way to
benchmark a smartphone’s AI-hardware performance.
2.2 Software Fragmentation
The mobile-software ecosystem is heavily differentiated,
from the OS to the machine-learning run time. The result
can be drastic hardware-performance changes or variability.
Mobile devices employ various OSs: Android, iOS, Win-
dows, Ubuntu, Yocto, and so on. Each one has an ecosys-
tem of ML application programming interfaces (APIs) and
application-deployment options that necessitate particular
software solutions.
Smartphone OSs have undergone substantial consolida-
tion. Numerous APIs have served in the development of
ML applications, and often, a single SoC or OEM device
will support a vendor SDK and a plurality of frameworks.
SoC vendors will by default offer a proprietary SDK that
generates optimized binaries so ML models can run on
SoC-specific hardware. These vendors also make engineer-
ing investments to support more-generic frameworks, such
as TensorFlow Lite (TFLite) [24] and NNAPI [15], that
provide a compatibility layer to support various accelera-
tors and device types. Because engineering resources are
limited, however, SoC vendors must prioritize their own
SDKs, often resulting in partial or less optimized generic-
framework support. The diversity of vendor SDKs and
framework-support levels are all reasons why the mobile-
ML software ecosystem is fragmented.
This situation complicates hardware-performance as-
sessment because the choice of software framework has
a substantial effect. A high-performance SoC, for in-
stance, may deliver low performance owing to an ill-
matched framework. Even for SoCs that integrate a high-
performance ML accelerator, if a generic Android frame-
work such as NNAPI does not support it (well) with high-
performance driver backends, the accelerator will function
poorly when handling a network.
Because software code paths can drastically affect hard-
ware performance, a transparent mechanism for operating
and evaluating a mobile device is essential.
2.3 Developer Options
Developers can choose among several approaches to en-
able machine learning on mobile devices. Each one has im-
plications for achievable hardware performance on a given
application. Recognizing these behind-the-scenes factors is
therefore critical to maximizing performance.
Application developers can work through a marketplace
such as Google Play [7] to create mobile-app variants for
every SoC vendor if they follow a vendor-SDK approach
(Figure 2a). Doing so presents a scalability challenge, how-
ever, because of the increased time to market and additional
development costs.
An alternative is to create an application using a native
OS/framework API such as NNAPI, which provides a more
scalable approach (Figure 2b). Nevertheless, this alternative
has a crucial shortcoming: it is only viable if SoC vendors
provide good backend drivers to the framework, necessitat-
ing cooperation between these vendors and the framework
A final alternative is to bind the neural-network model to
the underlying hardware. Doing so allows compilation of
the model to a particular device, avoiding reliance on any
particular run time (Figure 2c).
2.4 Deployment Scenarios
Machine-learning applications have many potential uses
on mobile devices. Details of the usage scenario determine
the extent to which a neural-network model is optimized for
the hardware and how it runs, because of strong or weak ties
to the device.
Developers primarily build applications without specific
ties to vendor implementations. They may design custom
neural-network models that can run on any device. Thus,
mobile devices often run apps that employ unknown mod-
els for a variety of hardware. Figure 3(a) illustrates this
case. OEMs, on the other hand, build their ML applications
for their own devices. Therefore, both the models and the
device targets are known at deployment time (Figure 3(b)).
A service provider (e.g., Verizon or AT&T) that uses a vari-
ety of hardware solutions may, however, support its service
(a) (b) (c)
Figure 2: Application-development options.
with known models, in which case both the models and the
hardware are known (Figure 3(c)).
Development of the applications deployed in these sce-
narios may also take place in various ways. OEMs that
manufacture devices can use vendor SDKs to support their
applications with minimal extra effort.
2.5 OEM Life Cycle
Mobile-SoC testing often occurs on development plat-
forms. Gaining access to them, however, is difficult. There-
fore, the results of benchmark testing that employs a devel-
opment platform may not be independently verifiable. For
this reason, benchmarking generally takes place on com-
mercial devices. But because of the way commercial mo-
bile devices (particularly smartphones) operate, getting re-
producible numbers can be difficult.
A variety of factors, ranging from how OEMs pack-
age software for delivery to how software updates are is-
sued, affect hardware-performance measurements. OEMs
employ vendor SoCs and associated software releases to
produce commercial mobile devices. In the case of smart-
phones, those devices may sell unlocked or locked to a wire-
less carrier, in which case the carrier ultimately controls
the software. OEMs pick up the software updates from
the SoC vendors and usually bundle them with other up-
dates for periodic release. If the carrier sells the device,
it will likely require testing and validation before allow-
ing any updates. This restriction can add further delays
to the software-update channel. NNAPI updates, for in-
stance, would require a new software update for the device.
For a benchmark, no recompilation is necessary when using
NNAPI; updates to a vendor SDK, however, may necessi-
tate recompilation (Figure 2a).
When benchmarking a device, a newly installed software
update may affect the results, and installing the same ver-
sion of the software used to generate a particular result may
(a) (b) (c)
Figure 3: ML-application scenarios.
be impossible. After a device applies a system-software up-
date, the only way to revert to the previous configuration is
to factory reset the device. But doing so also undoes any
associated security fixes.
More often than not, a substantial delay occurs between
the time when an SoC vendor releases new software and
when that software sees deployment on user devices. The
delay is usually measured in months, and it especially af-
fects the system-API approach (e.g., NNAPI). Extensive
planning is therefore necessary for a commercial phone to
have all the required features for an upcoming benchmark.
Finally, commercial devices receive OEM updates only
for a fixed period, so they will not benefit from additional
software-performance enhancements after that time.
2.6 Legal and IP
An important yet easily overlooked aspect of ML bench-
marking is the law. A chief challenge to constructing a
widely used mobile benchmark is the legal and intellectual-
property (IP) regime for both data sets and tool chains.
Since ML tends to be open source, the rigidity and restric-
tions on data sets and SDKs can be surprising.
Distribution of standard ML data sets is under licenses
with limited or unclear redistribution rights (e.g., ImageNet
and COCO). Not all organizations have licensed these data
sets for commercial use, and redistribution through an app
is legally complicated. In addition, submitters to an ML
benchmark may apply different legal-safety standards when
participating in a public-facing software release.
Additionally, many SoC vendors rely on proprietary
SDKs to quantize and optimize neural networks for their
products. Although some SDKs are publicly available un-
der off-the-shelf licensing terms, others require direct ap-
proval or negotiation with the vendor. Additionally, most
forbid redistribution and sharing, potentially hindering re-
production of the overall flow and verification of a result.
Area Task Reference Model Data Set Quality Target
Vision Image classification MobileNetEdgeTPU (4M params) ImageNet 2012 (224x224) 98% of FP32 (76.19% Top-1)
Vision Object detection SSD-MobileNet v2 (17M params) COCO 2017 (300x300) 93% of FP32 (0.244 mAP)
Vision Semantic segmentation DeepLab v3+ (2M params) ADE20K (512x512) 97% of FP32 (54.8% mIoU)
Language Question answering MobileBERT (25M params) mini Squad v1.1 dev 93% of FP32 (93.98% F1)
Table 1: MLPerf Mobile v0.7 benchmark suite.
3 MLPerf Mobile Benchmarks
The MLPerf Mobile Inference benchmark is community
driven. As such, all involved parties aided in developing
the benchmark models and submission rules; the group in-
cludes both submitting organizations and organizations that
care about mobile AI. Participants reached a consensus on
what constitutes a fair and useful benchmark that accurately
reflects mobile-device performance in realistic scenarios.
Table 1 briefly summarizes the tasks, models, data sets,
and metrics. This section describes the models in the v0.7
MLPerf Mobile version. Rather than the models, a crucial
aspect of our work is the method we prescribe for mobile-AI
performance testing. Also, we describe the quality require-
ments during benchmark testing.
3.1 Tasks and Models
Machine-learning tasks and associated neural-network
models come in a wide variety. Our benchmark’s first it-
eration focused on establishing a high-quality method of
benchmarking, rather than focusing on model quantity. To
this end, we intentionally chose a few machine-learning
tasks representing real-world uses. Benchmarking them
yields helpful insights about hardware performance across
a wide range of deployment scenarios (smartphones, note-
books, etc.). We chose networks for these tasks on the ba-
sis of their maturity and applicability to various hardware
(CPUs, GPUs, DSPs, NPUs, etc.).
Image classification picks the best label to describe an
input image and is commonly used for photo search and
text extraction. Many commercial applications employ im-
age classification, which is a de facto standard for evaluat-
ing ML-system performance. Moreover, classifier-network
evaluation provides a good performance indicator for the
model when that model serves as a feature-extractor back-
bone for other tasks. Image classification has a wide range
of applications, such as photo searches, text extraction, and
industrial automation (object sorting and defect detection).
On the basis of community feedback, we selected Mo-
bileNetEdgeTPU [28], which is well-optimized for mobile
applications and provides good performance on different
SoCs. The MobileNetEdgeTPU network is a descendent of
the MobileNet-v2 family that is optimized for low-latency
and mobile accelerators. The MobileNetEdgeTPU model
architecture is based on convolutional layers with inverted
residuals and linear bottlenecks, similar to MobileNet v2,
but is optimized by introducing fused inverted bottleneck
convolutions to improve hardware utilization, and remov-
ing hard-swish and squeeze-and-excite blocks.
The MobileNetEdgeTPU reference model is evaluated
on the ImageNet 2012 validation dataset [50] and requires
74.66% (98% of FP32 accuracy) Top-1 accuracy (app uses
a different dataset). Before inference, images are resized,
cropped to 224x224, and normalized.
Object detection draws bounding boxes around objects
in an input image and then labels the object and is com-
monly applied to camera input. Implementations typically
use a pretrained image-classifier network as a backbone or
feature extractor, then perform bounding-box selection and
regression for precise localization [49, 43]. Object detec-
tion is crucial for automotive tasks, such as detecting haz-
ards and analyzing traffic, and for mobile-retail tasks, such
as identifying items in a picture.
Our reference model is the Single Shot Detector (SSD)
[43] with a MobileNet v2 backbone [51]—a choice that is
well adapted to constrained computing environments. The
SSD-MobileNet v2 uses Mobilenet v2 for feature extraction
and a mobile friendly variant of regular SSD called SSDlite
for detection. In SSD prediction layers, all the regular con-
volutions are replaced with separable convolutions (depth-
wise followed by 1 x 1 projection). SSD-MobileNet v2 im-
proves latency by significantly decreasing the number of op-
erations, it also reduces the memory footprint needed during
inference by never fully materializing the large intermediate
tensors. Two SSD-MobileNet v2 versions acted as the refer-
ence models for the object-detection benchmark, where one
model replaces more of the regular SSD-layer convolutions
with depth-separable convolutions than the other does.
We used the COCO 2017 validation data set [42] and, for
the quality metric, the mean average precision (mAP). The
target accuracy is a mean Average Precision (mAP) of 22.7
(93% of FP32 accuracy). Preprocessing consists of first re-
sizing to 300x300—typical of resolutions in smartphones
and other compact devices—and then normalizing.
Semantic image segmentation partitions an input im-
age into labeled objects at pixel granularity. Semantic im-
age segmentation partitions an input image into labeled ob-
jects at pixel granularity. It applies to autonomous driving
and robotics [38, 54, 45, 53], remote sensing [52], medical
imaging [57], and also complex image manipulation such
as red-eye reduction.
Our reference model for this task is DeepLab v3+ [30]
with a MobileNet v2 backbone. DeepLab v3+ originates
from the family of semantic image-segmentation models
that use fully convolutional neural networks to directly pre-
dict pixel classification [44, 33] as well as to achieve state-
of-the-art performance by overcoming reduced-feature-
resolution problems and incorporating multiscale context.
DeepLabV3+ uses an encoder-decoder architecture with
atrous spatial pyramid pooling and a modular feature ex-
tractor. We selected MobileNet-V2 as the feature extractor
because it enables state-of-the-art model accuracy within a
constrained computational budget.
We chose the ADE20K validation data set [59] for its
realistic scenarios, cropped and scaled images to 512x512,
and (naturally) settled on the mean intersection over union
(mIoU) for our metric. Additionally, we trained the model
to predict just 32 classes (compared with 150 in the original
ADE20K data set); the 1st to the 31st were the most fre-
quent (pixel-wise) classes in ADE20K, and the 32nd rep-
resented all the other classes. The mIoU depends on the
pixels whose ground-truth label belongs to one of the 31
most frequent classes, improving its accuracy by discarding
the network’s bad performance on low-frequency classes.
Question answering is an NLP task - responding to
human-posed questions in colloquial language. Example
applications include search engines, chatbots, and other
information-retrieval tools. For this task, we use the Stan-
ford Question Answering Dataset (Squad) v1.1 Dev [48].
Given a question and a passage from a Wikipedia article,
the model must extract a text segment from the passage to
answer the question.
Recent NLP models that rely on pretrained contextual
representations have proven useful in diverse situations
[31, 46, 47]. BERT (Bidirectional Encoder Representations
from Transformers) [32] improves on those models by pre-
training the contextual representations to be bidirectional
and to learn relationships between sentences using unla-
beled text. We selected MobileBERT [55], a lightweight
BERT model that is well suited to resource-limited mo-
bile devices. Further motivating this choice is the model’s
state-of-the-art performance and task-agnostic nature: even
though we consider question answering, MobileBERT is
adaptable to other NLP tasks with only minimal fine-tuning.
We trained the model with a maximum sequence length of
384 and use the F1 score for our metric.
3.2 Reference Code
MLPerf provides reference-code implementations for
the TensorFlow and TensorFlow Lite benchmarks. All ref-
Figure 4: Load Generator (“LoadGen”) testing the SUT.
erence models have 32-bit floating-point weights, and the
benchmark additionally provides an 8-bit quantized ver-
sion (with either post-training quantization or quantization-
aware training, depending on the tasks). The code for all
reference implementations is open source and free to down-
load from GitHub [11].
The reference code’s goal is to explicitly identify the crit-
ical model-invocation stages. For instance, the reference
benchmarks implement the preprocessing stages and the
model’s input-generation procedure. Submitters may adopt
the code for their submission. They may also optimize
these stages (e.g., rewrite them in C instead of Python) for
performance—as long as they employ all the same stages
and take the same steps to maintain equivalence.
By default, the reference code is not well-optimized.
Vendors that submit results to MLPerf must inherit the ref-
erence code, adapt it, and produce optimized glue code that
performs well on their hardware. For example, to perform
(quantized) inference, they may need to invoke the correct
software backend (e.g., SNPE and ENN) or NNAPI driver
to schedule code to their SoC’s custom accelerators.
3.3 System Under Test
A typical system under test (SUT) interfaces with several
components. Orchestrating the complete SUT execution in-
volves multiple stages. The main ones are model selection,
data-set input, preprocessing, back-end execution, and post-
processing. Figure 4 shows how these stages work together.
Model selection. The first step is selection of the refer-
ence models, either TensorFlow or TFLite.
Load generator. To enable representative testing of var-
ious inference platforms and use cases, we created the Load
Generator (“LoadGen”) [9]. The LoadGen creates inference
requests in a pattern and measures some parameter (e.g., la-
tency, throughput, or latency-bounded throughput). In addi-
tion, it logs information about the system during execution
to enable post-submission result validation. Submitter mod-
ification of the LoadGen software is forbidden.
Data-set input. The LoadGen uses the data sets as in-
puts to the SUT. In accuracy mode, it feeds the entire data
set to the SUT to verify that the model delivers the required
accuracy. In performance mode, it feeds a subset of the im-
ages to the SUT to measure steady-state performance. A
seed and random number generator is used to select sam-
ples from the data-set for inference, which precludes any
unrealistic data-set-specific optimizations.
Preprocessing. The typical image-preprocessing
tasks—such as resizing, cropping, and normalization—
depend on the neural-network model. This stage imple-
ments data-set-specific preprocessing, varies with the task
and the same steps must be followed by all the submitters.
Back-end execution. The reference benchmark imple-
mentation is a TFLite smartphone back end that optionally
includes NNAPI and GPU delegates. A “dummy” back end
is also available as a reference for proprietary back ends;
submitters replace it with whatever corresponds to their sys-
tem. For instance, Qualcomm would replace the dummy
with SNPE or Samsung would replace it with ENN. The
back end corresponds to other frameworks such as Open-
VINO for notebooks and other large mobile devices.
Postprocessing. This data-set-specific task covers all the
operations necessary for accuracy calculations. For exam-
ple, computing the Top-1 or Top-5 results for an image clas-
sifier requires a Top-K op / layer after the softmax layer.
A typical SUT can be either a smartphone or a laptop.
We therefore designed all the mobile-benchmark compo-
nents to take advantage of either one. Figure 5 shows how
MLPerf Mobile supports this flexibility. The reference Ten-
sorFlow models are at the root of the entire process. The
MLPerf Mobile process follows one of three paths.
Code path 1 allows submitters to optimize the reference
TensorFlow models for implementation via a proprietary
backend (e.g., SNPE for Qualcomm or ENN for Samsung),
then schedule and deploy the networks on the hardware.
Code path 2 allows submitters to convert the reference
TensorFlow models to a mobile-friendly format using an ex-
porter. These models are then easy to deploy on the device,
along with quantization optimizations, using the TFLite del-
egates to access the AI-processing hardware.
Code path 3 allows non-smartphone submitters to run
the reference TensorFlow models through nonmobile back-
ends (e.g., OpenVINO) on laptops and tablets that run op-
erating systems such as Windows and Linux.
3.4 Execution Scenarios
MLPerf Mobile Inference supports two modes for run-
ning ML models: single stream and offline. They reflect the
typical operating behavior of many mobile applications.
Single stream. In the single-stream scenario, the ap-
plication sends a single inference query to the SUT with
a sample size of one. That size is typical of smartphones
and other interactive devices where the user takes a picture
and expects a timely response, as well as AR/VR headsets
where real-time operation is crucial. The LoadGen injects a
query into the SUT and waits for query completion. When
Figure 5: MLPerf Mobile benchmark code paths. The
benchmarks run on smartphones and on mobile PCs, such
as laptops. On smartphones, there are multiple framework
options and backend codepaths that vendors can select.
the query is complete, the LoadGen records the inference
run length and sends the next query. This process repeats
until the LoadGen has issued all the samples (1,024) in the
task’s corresponding data set or a minimum runtime of 60
seconds has been met.
Offline. In the offline scenario, the LoadGen sends all
the samples to the SUT in one burst. Although the query
sample size remains one, as in the single-stream scenario,
the number of samples in the query is much larger. Of-
fline mode in MLPerf Mobile v0.7 issues 24,576 samples—
enough to provide sufficient run time. This choice typically
reflects applications that require multi-image processing, si-
multaneous processing of batched input, or concurrent use
of models such as image classification and person detec-
tion for photos in an album. Its implementation is usually a
batched query with a batch size larger than one.
4 Result Submission
This section outlines how submitters produce high-
quality benchmark results for submission. We outline the
submission process, the run rules, and the procedure for ver-
ifying the accuracy and validity of the results.
4.1 Submission Process
The reference models for MLPerf Mobile are provided as
frozen TensorFlow FP32 checkpoints and valid submissions
must start from these frozen graphs. From the frozen graph,
submitters can export a reference FP32 TFLite model. They
can generate fixed-point models with INT8 precision from
the reference FP32 models using post-training quantization
(PTQ), but they cannot perform quantization-aware train-
ing (QAT). Network retraining typically alters the neural-
network architecture therefore model equivalence is diffi-
cult to verify. Additionally, retraining allows a submit-
ter to use their training capabilities (e.g., neural architec-
ture search) to enhance inference performance, changing
the very nature of the benchmark. Depending on submit-
ter needs, however, MLPerf provides QAT versions of the
model. All organizations mutually agree on these QAT
models as being comparable to the PTQ models.
In general, QAT reduces accuracy loss relative to PTQ.
Therefore, we chose the minimum-accuracy thresholds on
the basis of what is achievable through post-training quanti-
zation without any training data. For some benchmarks, we
generated a reference INT8 QAT model using the Tensor-
Flow quantization tools; submitters can employ it directly
in the benchmark.
Some hardware is unable to directly deploy TensorFlow-
quantized models, however, and submission organizations
may need different fixed-point formats to match their hard-
ware. In such cases, we only allow post-training quantiza-
tion without training data from a reference model.
For each model, the Mobile Working Group specified a
calibration data set (typically 500 samples or images from
the training or validation data set) to use for calibration in
the PTQ process. Submitters can only use the approved cali-
bration data set; but they may select a subset of the samples.
A submitter may implement minimal changes to the
model, if they are mathematically equivalent, or approved
approximations to make the model compatible with their
hardware. However, MLPerf rules strictly prohibit altering
the AI models to reduce their computational complexity;
banned techniques include channel pruning, filter pruning,
and weight skipping.
4.2 Submission System
Smartphones and notebooks can use the mobile-
benchmark suite. For smartphones, we developed a refer-
ence MLPerf Android app that supports TFLite delegates
and NNAPI delegates. We benchmark the inference-task
performance at the application layer to reflect latencies that
mobile-device users observe and to give developers a refer-
ence for expected user-app latencies.
The MLPerf Mobile app queries the LoadGen, which in
turn queries input samples for the task, loads them to mem-
ory, and tracks the time required to execute the task. Com-
panies that used proprietary delegates implemented their
backend interface to the reference MLPerf app. Such back-
ends query the correct library (TensorFlow, TFLite, Exynos
Neural Network (ENN) SDK, or SNPE SDK) to run the
models on the SUT in accordance with the run rules.
For laptops, submitters can build a native command-
line application that incorporates the instructions in the
MLCommons GitHub repo. The MLPerf LoadGen can
integrate this application, and supports backends such as
the OpenVINO run time. The application generates logs
consistent with MLPerf rules, validated by the submission
checker. The number of samples necessary for performance
mode and for accuracy mode remains identical to the num-
ber in the smartphone scenario. The only difference is the
absence of a user interface for these devices.
4.3 Run Rules
In any benchmark, measurement consistency is crucial
for reproducibility. MLPerf Mobile is no different. We de-
veloped a strict set of run rules that allow us to reproduce
submitted results through an independent third party.
Test control. The MLPerf app runs the five bench-
marks in a specific order. For each one, the model
first runs on the whole validation set to calculate the
accuracy, which the app then reports. Performance
mode then follows. Single-stream mode measures the
90th-percentile latency over at least 1,024 samples for
a minimum run time of 60 seconds to achieve a stable
performance result. Offline mode reports the average
throughput necessary to process 24,576 samples and in
current systems will exceed 60 seconds of run time.
Thermal throttling. Machine-learning models are
computationally heavy and can trigger run-time ther-
mal throttling to cool the SoC. We recommend that
smartphones maintain an air gap with proper ven-
tilation and avoid flush contact with any surfaces.
Additionally, we require normal room temperature
operation—between 20 and 25 degrees Celsius.
Cooldown interval. The benchmark does not test the
performance under thermal throttling, so the app al-
lows a break setting of 0–5 minutes between the indi-
vidual tests to allow the phone to reach its cooldown
state before starting each one. If the benchmark suite
is to run multiple times, we recommend a minimum
10-minute break between them.
Battery power. The benchmark runs while the phone
is battery powered, but we recommend a full charge
beforehand to avoid entering power-saving mode.
The above rules are generally inapplicable to laptops be-
cause these devices have sufficient power and cooling.
4.4 Result Validation
MLPerf Mobile submission rules require that the SUT
(smartphone or laptop) be commercially available before
publication, which enables a more tightly controlled and ro-
bust validation, review, and audit process. In contrast, the
other MLPerf benchmark suites allow submission of pre-
view and research systems that are not commercially avail-
able. Smartphones should be for sale either through a car-
rier or as an unlocked phone. The SUT includes both the
hardware and the software components, so these rules pro-
hibit device rooting.
At submission time, each organization has no knowledge
of other results or submissions. All must submit their results
at the same time. Afterward, the submitters collectively re-
view all the results in a closed setting, inspired by the peer-
review process for academic publications.
Submissions include all of the benchmark app’s log files,
unedited. After the submission deadline, results for each
participating organization are available for examination by
the MLPerf working group and the other submitters, along
with any modified models and code used in the respective
submissions. The vendor backend (but not the tool chain)
is included. MLPerf also receives private vendor SDKs to
allow auditing of the model-conversion process.
The audit process comprises examination of log files,
models, and code for compliance with the submission rules
as well as verification of their validity. It also includes veri-
fication of the system’s reported accuracy and latencies. To
verify results, we build the vendor-specific MLPerf app, in-
stall it on the phone (in the factory-reset state), and attempt
to reproduce latency or throughput numbers, along with ac-
curacy. We consider the results verified if our numbers are
within 5% of the reported values.
5 Performance Evaluation
The MLPerf Mobile inference suite first saw action in
October 2020. Mobile submissions fall into one of two cat-
egories: smartphones and laptops. The results reveal a de-
vice’s system on chip (SoC) performance for each of the
machine learning tasks in version 0.7. This section assesses
how the benchmark performed—specifically, whether it met
expectations in being transparent and faithful, reflecting the
vast diversity of AI hardware and software.
5.1 Premium ML Systems
The submitted systems include premier 5G smartphones
and high-end mobile SoCs from MediaTek, Qualcomm, and
Samsung. The MediaTek chipset is a Dimensity 820 [10] in
the Xiaomi Redmi 10X smartphone; it contains MediaTek’s
AI processing unit (APU) 3.0. The APU uniquely supports
FP16 and INT16 [41]. The Qualcomm chipset is a Snap-
dragon 865+ [22] in the Asus ROG Phone 3. It integrates
Qualcomm’s Hexagon 698 DSP, which consists of two en-
gines that can handle AI processing exclusively. The first
engine is the Hexagon Vector Extension (HVX), which is
designed for advanced imaging and computer-vision tasks
intended to run on the DSP instead of the CPU. The sec-
ond, the company’s AI-processor (AIP) cluster, supports
the Hexagon Tensor Accelerator (HTA), which can also per-
form AI tasks. These engines can serve together for maxi-
mum performance, or they can operate in isolation (depend-
ing on the compiler optimizations). The Samsung chipset is
an Exynos 990 [14] in the company’s Galaxy Note 20 Ul-
tra, which has a dual-core custom neural processing unit
(NPU) specialized to handle AI workloads. In the laptop
category, Intel submitted results for its new Willow Cove
CPU [27] and first-generation integrated Xe-LP GPU. That
GPU served as the AI accelerator [58]. These systems col-
lectively reflect the state of the art in AI processors.
In the smartphone category, three organizations submit-
ted a total of 14 individual results. No one solution domi-
nates all benchmarks. Figure 6 plots the single-stream re-
sults for the three smartphone chipsets on each benchmark
task. It includes both throughput and latency results. Each
chipset offers a unique differentiable value. MediaTek’s Di-
mensity scored the highest in object-detection and image-
segmentation throughput. Samsung’s Exynos performed
well on image classification and NLP, where it achieved
the highest scores. Qualcomm’s Snapdragon is competitive
for image segmentation and NLP. The image-classification
task employs offline mode, which allows batch process-
ing; here, Exynos delivered 674.4 frames per second (FPS),
and Snapdragon delivered 605.37 FPS (not shown in Fig-
ure 6). In most cases, the throughput differences are
marginal. An essential point to keep in mind, however, is
that other metrics—beyond performance benchmarks—go
into assessing a chipset’s viability for a given task.
5.2 Result Transparency
The submission results highlight an important point:
they reflect the variety of hardware and software combina-
tions we discussed earlier (Section 2). All mobile SoCs rely
on a generic processor, but the AI-performance results were
from AI accelerators using different software frameworks.
Transparency into how the results were generated is crucial.
Figure 7 shows the potential code paths for producing the
submission results. The dashed lines represent mere possi-
bilities, whereas the solid lines indicate actual submissions.
Looking only at Figure 7 is insufficient to determine which
paths produce high-quality results. Any other code paths
would have yielded a different performance result. There-
fore, transparency on benchmark performance is essential.
It reveals which code paths were taken, making the perfor-
mance results reproducible and informative for consumers.
Table 2 presents additional details, including specifics
for each benchmark result in both single-stream and offline
modes. MLPerf Mobile exposes this information to make
the results reproducible. For each benchmark and each sub-
mitting organization, the table shows the numerical preci-
sion, the run time, and the hardware unit that produced the
results. Exposing each of these details is important because
the many execution paths in Figure 7 can drastically affect
a device’s performance.
Throughput (frames/second)
Latency (ms)
MediaTek Samsung Qualcomm
(a) Image classification
Throughput (frames/second)
Latency (ms)
MediaTek Samsung Qualcomm
(b) Object detection (SSD-MobileNet v2)
Throughput (frames/second)
Latency (ms)
MediaTek Samsung Qualcomm
(c) Semantic segmentation (DeepLab v3 + MobileNet v2)
Throughput (samples/second)
Latency (ms)
MediaTek Samsung Qualcomm
(d) Natural-language processing (MobileBERT)
Figure 6: Results from the first MLPerf Mobile round show that no one solution fits all tasks. The bars corresponds to
throughput (left y-axis). The line graph corresponds to latency (right y-axis).
5.3 Execution Diversity
Mobile-device designers prefer INT8 or FP16 format be-
cause quantized inference runs faster and provides better
performance and memory bandwidth than FP32 [34]. The
accuracy tradeoff for quantized models (especially since no
retraining is allowed) is tolerable in smartphones, which
seldom perform safety-critical tasks, such as those in au-
tonomous vehicles (e.g., detecting pedestrians).
All the mobile-vision tasks employ INT8 heavily. Most
vendors rely on INT8 because it enables greater perfor-
mance and consumes less power, preserving the device’s
battery life. NLP favors FP16. Although this format re-
quires more power than INT8, it offers better accuracy. Per-
haps more importantly, submitters use FP16 because most
AI engines today lack efficient support for non-vision tasks.
The GPU is a good balance between flexibility and effi-
ciency. Unsurprisingly, therefore, all vendors submitted re-
sults that employed GPUs with FP16 precision for NLP.
NNAPI is designed to be a common baseline for machine
learning on Android devices and to distribute that workload
across ML-processor units, such as CPUs, GPUs, DSPs,
and NPUs. But nearly all submissions in Table 2 use pro-
prietary frameworks. These frameworks, such as ENN and
SNPE, give SoC vendors more control over their product’s
performance. For instance, they can control which proces-
sor (CPU, GPU, DSP, NPU, for example) to use and what
optimizations to apply.
All laptop submissions employ INT8 and achieve the de-
sired accuracy on vision and language models. For single-
stream mode, because a single sample is available per query
some models cannot fully utilize the computational re-
sources of the GPU. Therefore, the backend must select be-
tween the CPU and GPU to deliver the best overall perfor-
mance. For example, smaller models such as MobileNetEd-
geTPU use the CPU. For the offline mode, multiple samples
are available as a single query, so inference employs both
the CPU and GPU.
Finally is hardware diversity. Table 2 shows a variety of
hardware combinations that achieve good performance on
all MLPerf Mobile AI tasks. In one case, the CPU is the
backbone, orchestrating overall execution—including pre-
processing and other tasks the benchmark does not mea-
sure. In contrast, the GPU, DSPs, NPUs, and AIPs deliver
Figure 7: Potential code paths (dashed lines) and actual submitted code paths (solid lines) for producing MLPerf Mobile AI-
performance results. NPU is the Neural Processing Unit from Samsung. Hexagon Tensor Accelerator (HTA) and Hexagon
Vector eXtensions (HVX) are part of the Qualcomm DSP that can either be used individually or together simultaneously.
high-performance AI execution.
5.4 Summary
The MLPerf results provide transparency into the perfor-
mance results, which show how SoC vendors achieve their
best performance on a range of tasks. Figure 7 and Table 2
reveal substantial differences in how AI systems perform on
the different devices. Awareness of such underlying varia-
tions is crucial because the measured performance should
match what end users experience, particularly on commer-
cially available devices.
Finally, since the benchmark models represent diverse
tasks, and since MLPerf Mobile collects results over a sin-
gle long run that covers all of these models, it strongly curbs
domain-specific framework optimizations. Furthermore,
the benchmarked mobile devices are commonly available
and the testing conditions ensure a realistic experimental
setup, so the results are attainable in practice and repro-
ducible by others.
6 Consumer, Industry, Research Value
Measuring mobile AI performance in a fair, repro-
ducible, and useful manner is challenging but not in-
tractable. The need for transparency owes to the massive
hardware and software diversity, which is often tightly cou-
pled with the intricacies of deployment scenarios, developer
options, OEM life cycles, and so on.
MLPerf Mobile focuses on transparency for consumers
by packaging the submitted code into an app. Figure 8a
shows the MLPerf Mobile startup screen. With a simple tap
on the “Go” button, the app runs all benchmarks by default,
following the prescribed run rules (Figure 8b), and clearly
displays the results. It reports both performance and accu-
racy for all benchmark tasks (Figure 8c) and permits the
user to view results for each one (Figure 8d). Furthermore,
the configuration that generates the results is also transpar-
ent (Figure 8e). The application currently runs on Android,
though future versions will likely support iOS as well.
We believe that analysts, OEMs, academic researchers,
neural-network-model designers, application developers,
and smartphone users can all gain from result transparency.
We briefly summarize how the app benefits each one.
Application developers. MLPerf Mobile shows appli-
cation developers what real-world performance may look
like on the device. For application developers, we expect
the benchmark provides insight into the software frame-
works on the various “phones” (i.e., SoCs). More specif-
ically, it can help them quickly identify the most optimal
solution for a given platform. For application developers
who deploy their products “into the wild, the benchmark
and the various machine-learning tasks offer perspective on
Image Classification
Image Classification
Object Detection
Image Segmentation
Natural-Language Processing
ImageNet ImageNet COCO ADE20K Squad
MobileNetEdge MobileNetEdge SSD-MobileNet v2 DeepLab v3+ - MobileNet v2 MobileBERT
NNAPI (neuron-ann),
NNAPI (neuron-ann),
NNAPI (neuron-ann),
TFLite delegate,
TFLite delegate
Table 2: Implementation details for the results presented in Figure 7. The table shows the myriad combinations of numerical
formats, software run times and hardware backend targets that are possible, which reinforces the need for result transparency.
the end-user experience for a real application.
OEMs. MLPerf Mobile standardizes the benchmark-
ing method across different mobile SoCs. All SoC ven-
dors employ the same tasks, models, data sets, metrics,
and run rules, making the results comparable and repro-
ducible. Given the hardware ecosystem’s vast heterogene-
ity, the standardization that our benchmark provides is vital.
Model designers. MLPerf Mobile makes it easy to
package new models into the mobile app, which organi-
zations can then easily share and reproduce. The app
framework, coupled with the underlying LoadGen, allows
model designers to test and evaluate the model’s perfor-
mance on a real device rather than using operation counts
and model size as heuristics to estimate performance. This
feature closes the gap between model designers and hard-
ware vendors—groups that have thus far failed to share in-
formation in an efficient and effective manner.
Mobile users. The average end user wants to make
informed purchases. For instance, many want to know
whether upgrading their phone to the latest chipset will
meaningfully improve their experience. To this end,
they want public, accessible information about various
devices—something MLPerf Mobile provides. In addi-
tion, some power users want to measure their device’s per-
formance and share that information with performance-
crowdsourcing platforms. Both are important reasons for
having an easily reproducible mechanism for measuring
mobile AI performance.
Academic researchers. Reproducibility is a challenge
for state-of-the-art technologies. We hope researchers em-
ploy our mobile-app framework to test their methods and
techniques for improving model performance, quality, or
both. The framework is open source and freely accessi-
ble. As such, it can enable academic researchers to integrate
their optimizations and reproduce more recent results from
the literature.
Technical analysts. MLPerf Mobile provides repro-
ducibility and transparency for technical analysts, who of-
ten strive to make “apples-to-apples” comparisons. The ap-
plication makes it easy to reproduce vendor-claimed results
as well as to interpret the results, because it shows how the
device achieves a particular performance number and how
it is using the hardware accelerator.
7 Related Work
There are many ongoing efforts in mobile AI perfor-
mance benchmarking. We describe the prior art in mobile
and ML benchmarking and emphasize how MLPerf Mobile
differs from these related works.
Android Machine Learning Test Suite (MLTS).
MLTS, part of the Android Open Source Project (AOSP)
source tree, provides benchmarks for NNAPI drivers [16]. It
is mainly for testing the accuracy of vendor NNAPI drivers.
MLTS includes an app that allows a user to test the latency
and accuracy of quantized and floating-point TFLite mod-
els (e.g., MobileNet and SSD-MobileNet) against a 1,500-
(a) Startup screen. (b) Running the benchmarks. (c) Reporting results. (d) Run details. (e) Configuration settings.
Figure 8: MLPerf Mobile app on Android.
image subset of the Open Images Dataset v4 [40]. Further
statistics, including latency distributions, are also available.
Xiaomi’s Mobile AI Benchmark. Xiaomi provides
an open-source end-to-end benchmark tool for evaluating
model accuracy and latency [13]. In addition to a command-
line utility to run the benchmarks on a user device, the
tool includes a daily performance-benchmark run for var-
ious neural-network models (mostly on the Xiaomi Redmi
K30 Pro). The tool has a configurable backend that allows
users to employ multiple ML-hardware-delegation frame-
works (including MACE, SNPE, and TFLite).
TensorFlow Lite. TFLite provides a command-line
benchmark utility to measure the latency of any TFLite
model [24]. A wrapper APK is also available to reference
how these models perform when embedded in an Android
application. Users can select the NNAPI delegate and they
can disable NNAPI in favor of a hardware-offload backend.
For in-depth performance analysis, the benchmark supports
timing of individual TFLite operators.
AI-Benchmark. Ignatov et al. [37] performed an exten-
sive evaluation of machine-learning performance on mobile
systems with AI acceleration, using HiSilicon, MediaTek,
Qualcomm, Samsung, and UniSoc chipsets. They evaluated
21 deep-learning tasks using 50 metrics, including inference
speed, accuracy, and stability. The authors reported the re-
sults of their AI-Benchmark app for 100 mobile SoCs. The
benchmark runs preselected models of various bit widths
(INT8, FP16, and FP32) on the CPU and on open-source or
vendor-proprietary TFLite delegates. Performance-report
updates appear on the AI-Benchmark website [1] after each
major release of TFLite/NNAPI and of new SoCs with AI
AImark. Master Lu (Ludashi) [2], a closed-sourced
Android and iOS application, uses vendor SDKs to im-
plement its benchmarks. It comprises image-classification,
image-recognition, and image-segmentation tasks, includ-
ing models such as ResNet-34 [35], Inception V3 [56],
SSD-MobileNet [36, 43], and DeepLab v3+ [30]. The
benchmark judges mobile-phone AI performance by eval-
uating recognition efficiency and provides a line-test score.
Aitutu. A closed-source application [3, 8], Aitutu em-
ploys Qualcomm’s SNPE, MediaTek’s NeuroPilot, HiSil-
icon’s Kirin HiAI, Nvidia’s TensorRT, and other vendor
SDKs. It implements image classification based on the
Inception V3 neural network [56], using 200 images as
test data. The object-detection model is based on SSD-
MobileNet [36, 43], using a 600-frame video as test data.
The score is a measure of speed and accuracy—faster re-
sults with higher accuracy yield a greater final score.
Geekbench. Primate Labs created Geekbench [20, 6],
a cross-platform CPU-compute benchmark that supports
Android, iOS, Linux, macOS, and Windows. The Geek-
bench 5 CPU benchmark features new applications, includ-
ing augmented reality and machine learning, but it lacks
heterogeneous-IP support. Users can share their results by
uploading them to the Geekbench Browser.
UL Procyon AI Inference Benchmark. From UL
Benchmarks, which produced PCMark and 3DMark, came
VRMark [25, 26], an Android NNAPI CPU- and GPU-
focused AI benchmark. The professional benchmark suite
UL Procyon only compares NNAPI implementations and
compatibility on floating-point- and integer-optimized mod-
els. It contains MobileNet v3 [28], Inception v4 [56],
SSDLite MobileNet v3 [28, 43], DeepLab v3 [30], and
other models. It also attempts to test custom CNN mod-
els but uses an AlexNet [39] architecture to test basic op-
erations. The application provides benchmark scores, per-
formance charts, hardware monitoring, model output, and
device rankings.
Neural Scope. National Chiao Tung University [17, 18]
developed an Android NNAPI application supporting FP32
and INT8 precisions. The benchmarks comprise object
classification, object detection, and object segmentation,
including MobileNet v2 [51], ResNet-50 [35], Inception
v3, SSD-MobileNet [36, 43], and ResNet-50 with atrous-
convolution layers [29]. Users can run the app on their
mobile devices and immediately receive a cost-performance
8 Future Work
The first iteration of the MLPerf Mobile benchmark fo-
cused on the foundations. On the basis of these fundamen-
tals, its scope can easily expand. The following are areas of
future work:
iOS support. A major area of interest for MLPerf Mo-
bile is to develop an iOS counterpart for the first-generation
Android app. Apple’s iOS is a major AI-performance player
that brings both hardware and software diversity compared
with Android.
Measuring software frameworks. Most AI bench-
marks focus on AI-hardware performance. But as we de-
scribed in Section 2, software performance—and, more im-
portantly, its capabilities—is crucial to unlocking a device’s
full potential. To this end, enabling apples-to-apples com-
parison of software frameworks on a fixed hardware plat-
form has merit. The backend code path in Figure 5 (code
path 1) is a way to integrate different machine-learning
frameworks in order to determine which one achieves the
best performance on a target device.
Expanding the benchmarks. An obvious area of im-
provement is expanding the scope of the benchmarks to in-
clude more tasks and models, along with different quality
targets. Examples include additional vision tasks, such as
super resolution, and speech models, such as RNN-T.
Rolling submissions. The mobile industry is growing
and evolving rapidly. New devices arrive frequently, of-
ten in between MLPerf calls for submissions. MLPerf Mo-
bile therefore plans to add “rolling submissions” in order to
encourage vendors to submit their MLPerf Mobile scores
continuously. Doing so would allow smartphone makers to
more consistently use the benchmark to report the AI per-
formance of their latest devices.
Power measurement. A major area of potential im-
provement for MLPerf Mobile is power measurement.
Since mobile devices are battery constrained, evaluating
AI’s power draw is important.
To make additional progress, we need community in-
volvement. We therefore encourage the broader mobile
community to join the MLPerf effort and maintain the mo-
mentum behind an industry-standard open-source mobile
9 Conclusion
Machine-learning inference has many potential applica-
tions. Building a benchmark that encapsulates this broad
spectrum is challenging. In this paper, we focused on smart-
phones and the mobile-PC ecosystem, which is rife with
hardware and software heterogeneity. Coupled with the
life-cycle complexities of mobile deployments, this hetero-
geneity makes benchmarking mobile AI performance over-
whelmingly difficult. To bring consensus, we developed the
MLPerf Mobile AI inference benchmark. Many leading or-
ganizations have joined us in building a unified benchmark
that meets competing organizations’ disparate needs. The
unique value of MLPerf Mobile is not so much in the bench-
marks, rules, and metrics. Instead, it is in the value that the
industry creates for itself, benefiting everyone.
MLPerf Mobile provides an open source, out-of-the-
box inference-throughput benchmark for popular computer-
vision and natural-language-processing applications on mo-
bile devices, including smartphones and laptops. It can
serve as a framework to integrate future models, as the un-
derlying framework is independent of the top-level model
and of data-set changes. The app and the integrated Load
Generator allow us to evaluate a variety of situations, such
as changing the quality thresholds for overall system per-
formance. The app can also serve as a common platform
for comparing different machine-learning frameworks on
the same hardware. Finally, the suite allows for fair and
faithful evaluation of heterogeneous hardware, with full re-
The MLPerf Mobile team would like to acknowledge sev-
eral people for their effort. In addition to the team that archi-
tected the benchmark, MLPerf Mobile is the work of many
individuals that also helped produce the first set of results.
Arm: Ian Forsyth, James Hartley, Simon Holland, Ray
Hwang, Ajay Joshi, Dennis Laudick, Colin Osborne, and
Shultz Wang.
dviditi: Anton Lokhmotov.
Google: Bo Chen, Suyog Gupta, Andrew Howard, and
Jaeyoun Kim.
Harvard University: Yu-Shun Hsiao.
Intel: Thomas Baker, Srujana Gattupalli, and Maxim
MediaTek: Kyle Guan-Yu Chen, Allen Lu, Ulia Tseng,
and Perry Wang.
Qualcomm: Mohit Mundhra.
Samsung: Dongwoon Bai, Stefan Bahrenburg, Jihoon
Bang, Long Bao, Yoni Ben-Harush, Yoojin Choi, Fang-
ming He, Amit Knoll, Jaegon Kim, Jungwon Lee, Sukhwan
Lim, Yoav Noor, Muez Reda, Hai Su, Zengzeng Sun,
Shuangquan Wang, Maiyuran Wijay, Meng Yu, and George
Xored: Ivan Osipov, and Daniil Efremo.
[1] AI-Benchmark.
[2] AImark.
[3] Antutu Benchmark.
[4] Big.LITTLE.
[5] Deploy High-Performance Deep Learning Inference.
[6] Geekbench.
[7] Google Play.
[8] Is Your Mobile Phone Smart? Antutu AI Benchmark
Public Beta Is Released.
[9] LoadGen.
[10] MediaTek Dimensity 820. https://www.mediatek.
[11] MLPerf.
[12] MLPerf Mobile v0.7 Results.
[13] Mobile AI Bench.
mobile-ai- bench.
[14] Mobile Processor Exynos 990. https://www.
[15] Neural Networks API. https://developer.
[16] Neural Networks API Drivers. https://source.
[17] NeuralScope Mobile AI Benchmark Suite. https:
[18] Neuralscope offers you benchmarking your AI solutions.
[19] NeuroPilot. https://neuropilot.mediatek.
[20] Primate Labs.
[21] Samsung Neural SDK. https://developer.
[22] Snapdragon 865+ 5G Mobile Platform.
snapdragon-865- plus-5g- mobile-platform.
[23] Snapdragon Neural Processing Engine SDK. https:
[24] TensorFlow Lite.
[25] UL Benchmarks.
[26] UL Procyon AI Inference Benchmark.
ai-inference- benchmark.
[27] Willow cove - microarchitectures - intel.
[28] Andrew Howard, Suyog Gupta. Introducing
the Next Generation of On-Device Vision Mod-
els: MobileNetV3 and MobileNetEdgeTPU.
introducing-next- generation-on- device.
[29] Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos,
Kevin Murphy, and Alan L. Yuille. Deeplab: Semantic im-
age segmentation with deep convolutional nets, atrous con-
volution, and fully connected crfs, 2017.
[30] Liang-Chieh Chen, Yukun Zhu, George Papandreou, Flo-
rian Schroff, and Hartwig Adam. Encoder-decoder with
atrous separable convolution for semantic image segmenta-
tion, 2018.
[31] Andrew M. Dai and Quoc V. Le. Semi-supervised sequence
learning, 2015.
[32] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina
Toutanova. Bert: Pre-training of deep bidirectional trans-
formers for language understanding, 2019.
[33] David Eigen and Rob Fergus. Predicting depth, surface nor-
mals and semantic labels with a common multi-scale convo-
lutional architecture, 2015.
[34] Song Han, Huizi Mao, and William J Dally. Deep com-
pression: Compressing deep neural networks with pruning,
trained quantization and huffman coding. arXiv preprint
arXiv:1510.00149, 2015.
[35] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.
Deep residual learning for image recognition, 2015.
[36] Andrew G. Howard, Menglong Zhu, Bo Chen, Dmitry
Kalenichenko, Weijun Wang, Tobias Weyand, Marco An-
dreetto, and Hartwig Adam. Mobilenets: Efficient convolu-
tional neural networks for mobile vision applications, 2017.
[37] Andrey Ignatov, Radu Timofte, Andrei Kulik, Seungsoo
Yang, Ke Wang, Felix Baum, Max Wu, Lirong Xu, and Luc
Van Gool. Ai benchmark: All about deep learning on smart-
phones in 2019. In 2019 IEEE/CVF International Confer-
ence on Computer Vision Workshop (ICCVW), pages 3617–
3635. IEEE, 2019.
[38] W. Kim and J. Seok. Indoor semantic segmentation for robot
navigating on mobile. In 2018 Tenth International Confer-
ence on Ubiquitous and Future Networks (ICUFN), pages
22–25, 2018.
[39] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton.
Imagenet classification with deep convolutional neural net-
works. In F. Pereira, C. J. C. Burges, L. Bottou, and K. Q.
Weinberger, editors, Advances in Neural Information Pro-
cessing Systems 25, pages 1097–1105. Curran Associates,
Inc., 2012.
[40] Alina Kuznetsova, Hassan Rom, Neil Alldrin, Jasper Ui-
jlings, Ivan Krasin, Jordi Pont-Tuset, Shahab Kamali, Stefan
Popov, Matteo Malloci, Alexander Kolesnikov, and et al. The
open images dataset v4. International Journal of Computer
Vision, 128(7):1956–1981, Mar 2020.
[41] Chien-Hung Lin, Chih-Chung Cheng, Yi-Min Tsai, Sheng-
Je Hung, Yu-Ting Kuo, Perry H Wang, Pei-Kuei Tsung,
Jeng-Yun Hsu, Wei-Chih Lai, Chia-Hung Liu, et al. 7.1 a
3.4-to-13.3 tops/w 3.6 tops dual-core deep-learning acceler-
ator for versatile ai applications in 7nm 5g smartphone soc.
In 2020 IEEE International Solid-State Circuits Conference-
(ISSCC), pages 134–136. IEEE, 2020.
[42] Tsung-Yi Lin, Michael Maire, Serge Belongie, Lubomir
Bourdev, Ross Girshick, James Hays, Pietro Perona, Deva
Ramanan, C. Lawrence Zitnick, and Piotr Doll´
ar. Microsoft
coco: Common objects in context, 2015.
[43] Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian
Szegedy, Scott Reed, Cheng-Yang Fu, and Alexander C.
Berg. Ssd: Single shot multibox detector. Lecture Notes
in Computer Science, page 21–37, 2016.
[44] Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully
convolutional networks for semantic segmentation, 2015.
[45] Natalia Neverova, Pauline Luc, Camille Couprie, Jakob J.
Verbeek, and Yann LeCun. Predicting deeper into the future
of semantic segmentation. CoRR, abs/1703.07684, 2017.
[46] Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt Gard-
ner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer.
Deep contextualized word representations, 2018.
[47] Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya
Sutskever. Improving language understanding by generative
[48] Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and
Percy Liang. Squad: 100,000+ questions for machine com-
prehension of text, 2016.
[49] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun.
Faster r-cnn: Towards real-time object detection with region
proposal networks, 2016.
[50] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, San-
jeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy,
Aditya Khosla, Michael Bernstein, Alexander C. Berg, and
Li Fei-Fei. ImageNet Large Scale Visual Recognition Chal-
lenge. International Journal of Computer Vision (IJCV),
115(3):211–252, 2015.
[51] Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zh-
moginov, and Liang-Chieh Chen. Mobilenetv2: Inverted
residuals and linear bottlenecks, 2019.
[52] Jamie Sherrah. Fully convolutional networks for dense se-
mantic labelling of high-resolution aerial imagery, 2016.
[53] Mennatullah Siam, Sara Elkerdawy, Martin Jagersand, and
Senthil Yogamani. Deep semantic segmentation for auto-
mated driving: Taxonomy, roadmap and challenges. In 2017
IEEE 20th international conference on intelligent trans-
portation systems (ITSC), pages 1–8. IEEE, 2017.
[54] G. Sun and H. Lin. Robotic grasping using semantic seg-
mentation and primitive geometric model based 3d pose es-
timation. In 2020 IEEE/SICE International Symposium on
System Integration (SII), pages 337–342, 2020.
[55] Zhiqing Sun, Hongkun Yu, Xiaodan Song, Renjie Liu, Yim-
ing Yang, and Denny Zhou. Mobilebert: a compact task-
agnostic bert for resource-limited devices, 2020.
[56] Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe,
Jonathon Shlens, and Zbigniew Wojna. Rethinking the in-
ception architecture for computer vision, 2015.
[57] Saeid Asgari Taghanaki, Kumar Abhishek, Joseph Paul Co-
hen, Julien Cohen-Adad, and Ghassan Hamarneh. Deep se-
mantic segmentation of natural and medical images: A re-
view, 2020.
[58] Xavier Vera. Inside tiger lake: Intel’s next generation mobile
client cpu. In 2020 IEEE Hot Chips 32 Symposium (HCS),
pages 1–26. IEEE Computer Society, 2020.
[59] Bolei Zhou, Hang Zhao, Xavier Puig, Sanja Fidler, Adela
Barriuso, and Antonio Torralba. Scene parsing through
ade20k dataset. In Proceedings of the IEEE conference on
computer vision and pattern recognition, pages 633–641,
ResearchGate has not been able to resolve any citations for this publication.
Conference Paper
Full-text available
In recent years, there have been many successes of using Deep Convolutional Neural Networks (DCNNs) in the task of pixel-level classification (also called “semantic image segmentation”). The advances in DCNN have led to the development of autonomous vehicles that can drive with no driver controls by using sensors like camera, LiDAR, etc. In this paper, we propose a practical method to implement autonomous indoor navigation based on semantic image segmentation using state-of-the-art performance model on mobile devices, especially Android devices. We apply a system called `Mobile DeepLabv3', which uses atrous convolution when applying semantic image segmentation by using MobileNetV2 as a network backbone. The ADE20K dataset is used to train our models specific to indoor environments. Since this model is for robot navigating, we re-label 150 classes into 20 classes in order to easily classify obstacles and road. We evaluate the trade-offs between accuracy and computational complexity, as well as actual latency and the number of parameters of the trained models.
Full-text available
We present a class of efficient models called MobileNets for mobile and embedded vision applications. MobileNets are based on a streamlined architecture that uses depth-wise separable convolutions to build light weight deep neural networks. We introduce two simple global hyper-parameters that efficiently trade off between latency and accuracy. These hyper-parameters allow the model builder to choose the right sized model for their application based on the constraints of the problem. We present extensive experiments on resource and accuracy tradeoffs and show strong performance compared to other popular models on ImageNet classification. We then demonstrate the effectiveness of MobileNets across a wide range of applications and use cases including object detection, finegrain classification, face attributes and large scale geo-localization.
Full-text available
The ability to predict and therefore to anticipate the future is an important attribute of intelligence. It is also of utmost importance in real-time systems, e.g. in robotics or autonomous driving, which depend on visual scene understanding for decision making. While prediction of the raw RGB pixel values in future video frames has been studied in previous work, here we focus on predicting semantic segmentations of future frames. More precisely, given a sequence of semantically segmented video frames, our goal is to predict segmentation maps of not yet observed video frames that lie up to a second or further in the future. We develop an autoregressive convolutional neural network that learns to iteratively generate multiple frames. Our results on the Cityscapes dataset show that directly predicting future segmentations is substantially better than predicting and then segmenting future RGB frames. Our models predict trajectories of cars and pedestrians much more accurately (25%) than baselines that copy the most recent semantic segmentation or warp it using optical flow. Prediction results up to half a second in the future are visually convincing, the mean IoU of predicted segmentations reaching two thirds of the real future segmentations.
We present Open Images V4, a dataset of 9.2M images with unified annotations for image classification, object detection and visual relationship detection. The images have a Creative Commons Attribution license that allows to share and adapt the material, and they have been collected from Flickr without a predefined list of class names or tags, leading to natural class statistics and avoiding an initial design bias. Open Images V4 offers large scale across several dimensions: 30.1M image-level labels for 19.8k concepts, 15.4M bounding boxes for 600 object classes, and 375k visual relationship annotations involving 57 classes. For object detection in particular, we provide \(15\times \) more bounding boxes than the next largest datasets (15.4M boxes on 1.9M images). The images often show complex scenes with several objects (8 annotated objects per image on average). We annotated visual relationships between them, which support visual relationship detection, an emerging task that requires structured reasoning. We provide in-depth comprehensive statistics about the dataset, we validate the quality of the annotations, we study how the performance of several modern models evolves with increasing amounts of training data, and we demonstrate two applications made possible by having unified annotations of multiple types coexisting in the same images. We hope that the scale, quality, and variety of Open Images V4 will foster further research and innovation even beyond the areas of image classification, object detection, and visual relationship detection.