PreprintPDF Available

Estimation of Non-Functional Properties for Embedded Hardware with Application to Image Processing

Authors:
Preprints and early-stage research may not have been peer reviewed yet.

Abstract and Figures

In recent years, due to a higher demand for portable devices, which provide restricted amounts of processing capacity and battery power, the need for energy and time efficient hard- and software solutions has increased. Preliminary estimations of time and energy consumption can thus be valuable to improve implementations and design decisions. To this end, this paper presents a method to estimate the time and energy consumption of a given software solution, without having to rely on the use of a traditional Cycle Accurate Simulator (CAS). Instead, we propose to utilize a combination of high-level functional simulation with a mechanistic extension to include non-functional properties: Instruction counts from virtual execution are multiplied with corresponding specific energies and times. By evaluating two common image processing algorithms on an FPGA-based CPU, where a mean relative estimation error of 3% is achieved for cacheless systems, we show that this estimation tool can be a valuable aid in the development of embedded processor architectures. The tool allows the developer to reach well-suited design decisions regarding the optimal processor hardware configuration for a given algorithm at an early stage in the design process.
Content may be subject to copyright.
©2015 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including
reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any
copyrighted component of this work in other works
Estimation of Non-Functional Properties for
Embedded Hardware with Application to Image
Processing
Christian Herglotz, J¨
urgen Seiler, Andr´
e Kaup
Multimedia Communications and Signal Processing
Friedrich-Alexander University Erlangen-N¨
urnberg (FAU),
Cauerstr. 7, 91058 Erlangen, Germany
{christian.herglotz, juergen.seiler, andre.kaup}@fau.de
Arne Hendricks, Marc Reichenbach, Dietmar Fey
Chair of Computer Architecture
Friedrich-Alexander University Erlangen-N¨
urnberg (FAU),
Martensstr. 3, 91058 Erlangen, Germany
{arne.hendricks, marc.reichenbach, dietmar.fey}@cs.fau.de
Abstract—In recent years, due to a higher demand for portable
devices, which provide restricted amounts of processing capacity
and battery power, the need for energy and time efficient hard-
and software solutions has increased. Preliminary estimations of
time and energy consumption can thus be valuable to improve
implementations and design decisions. To this end, this paper
presents a method to estimate the time and energy consumption
of a given software solution, without having to rely on the use of a
traditional Cycle Accurate Simulator (CAS). Instead, we propose
to utilize a combination of high-level functional simulation with
a mechanistic extension to include non-functional properties:
Instruction counts from virtual execution are multiplied with
corresponding specific energies and times. By evaluating two com-
mon image processing algorithms on an FPGA-based CPU, where
a mean relative estimation error of 3% is achieved for cacheless
systems, we show that this estimation tool can be a valuable aid
in the development of embedded processor architectures. The
tool allows the developer to reach well-suited design decisions
regarding the optimal processor hardware configuration for a
given algorithm at an early stage in the design process.
I. INTRODUCTION
Recently, the demand for portable devices being capable
of performing highly complex image processing tasks has
increased rapidly. Examples are cameras, smart-phones, and
tablet PCs that can capture images and videos in real time.
Furthermore, consumers like to process their data directly
to enhance image quality, compress videos and pictures, or
perform other picture manipulating tasks. Due to their highly
complex nature, these tasks are time and energy consuming
which reduces the operating time of the battery significantly.
Hence, it is desirable to develop energy efficient and fast
running image processing software. For the developer, a major
problem in the software design is that in order to obtain
these non-functional properties of an application, complex and
complicated test setups have to be available. E.g., to obtain the
energy consumption of a solution, the code has to be compiled,
executed on the target platform and measured using a power
meter. Not till then it is possible to decide if a solution is
energetically efficient.
To overcome the complex task of measurement we propose
to perform a simulation on a virtual platform that allows to
estimate the required energy and time for the target processing
platform. This facilitates predictions on virtual platforms and
allows a very precise estimation. As in this paper we do
not model the cache system, we chose two image processing
algorithms as an application. These algorithms (video decoding
and signal extrapolation) show a highly linear processing order
featuring a very high locality such that cache misses play
a minor role during execution. The latter algorithm shows
a highly homogeneous processing flow using repeatedly the
same methods, where the former incorporates highly hetero-
geneous functions that are called in an unpredictable manner.
Hence, we believe that they represent typical algorithms used
on portable, battery constrained devices.
By using the open-access platform OVP (Open Virtual
Platform) by Imperas it is possible to simulate the execution of
such a process on any CPU of interest. During the simulation
it is counted how often a certain instruction is executed.
Multiplying these instruction counts with instruction specific
energies and times we can estimate the complete processing
time and energy of the written code on the desired target
platform. In our approach, these specific energies and times
are measured beforehand using a predefined set of specialized
executable kernels. By evaluating the influence of an FPU on
the chip area, processing time, and energy of the two image
processing algorithms we show that our model can help the
developer in choosing a suitable processing platform for his
application. In this contribution, we show that this approach is
valid for a cacheless, re-configurable, soft intellectual property
(IP) CPU on an FPGA, where further work aims at generaliz-
ing this concept to allow the estimation of any CPU.
In this paper, we build upon the work presented in [4]. We
augment this concept by introducing the FPU into the model,
showing the viability of this approach for an extended set of
test cases, and giving an example for a concrete application.
The paper is organized as follows: Section II presents an
overview of existing approaches as well as a classification.
Section III introduces the virtual platform we used to simulate
the behavior of the processor. Subsequently, based on the
simulation, the general model is presented for estimating
processing time and energy in Section IV. Then, Section V
explains the measurement setup and how the energies and
times for a single instruction as well as for the complete
image processing algorithms is determined. Finally, Section
VI introduces two showcase image processing algorithms,
evaluates the estimation accuracy for both these cases and
shows how design decisions can be made.
arXiv:2203.01771v1 [eess.IV] 3 Mar 2022
II. RE LATE D WOR K
Simulations are crucial tasks when developing hard- and
software systems. Depending on the abstraction level, different
simulation methods exist. A general goal is to simulate as
abstract as possible (to enable a fast simulation) but to be as
accurate as needed (to yield the desired properties). Therefore,
we want to discuss several approaches of simulators for micro
architectures and how to combine them to get very accurate
results for non-functional properties like energy and time with
a short simulation time.
The most exact results can be achieved by cycle-accurate
simulators (CAS), such as simulations on hardware description
level. These could be RTL (Register-Transfer-Level) simula-
tions or gate level simulations, where a CAS is simulating
each clock-cycle within a micro architecture. By counting
glitches and transitions, very accurate power estimations can
be achieved. Moreover, multiplying the simulated clock cycles
with the clock frequency, the exact execution time can be
determined. Unfortunately, CAS leads to very slow simulation
speeds, due to the fact that they simulate the whole architec-
ture. Typical examples include Mentor Modelsim, Synopsys
VCS and Cadence NCSim.
One possibility to speed up the process is to use Sys-
temC and TLM (Transaction-Layer-Modeling). By applying
the method of approximately timed models, as described in the
TLM 2.0 standard, simulations run faster because clock cycles
are combined to so-called phases. However, in contrast to the
CAS simulations, the counted times and transitions are less
exact. A simulator which follows this paradigm is the SoCLib
library [8]. Moreover, frameworks such as Power-Sim have
been developed to speed-up the simulation time by providing
quasi cycle-accurate simulation.
On a higher abstraction level, simulation tools like Gem5
combined with external models like Orion or McPAT [5] can
measure both time and energy to a certain extent, while at
the same time sacrificing simulation performance: Complex
applications like image processing can take up to several days
until simulation is finished. Thus, if only functional properties
such as correctness of algorithm, results or completion are of
interest, an instruction accurate simulator is needed in order to
speed up the simulation process.
Instruction set simulators (ISS) do not simulate the exact
architecture, but rather “interpret” the binary executable and
manage internal registers of the architecture. Typically, they
require the least simulation time, but on the other hand do
not include non-functional results such as processing time and
energy consumption [14]. They are usually used to emulate
systems which are not physically present, as debugger or as
development platform for embedded software to get functional
properties. OVPSim can be mentioned as an example for this
simulator class.
The discussed approaches show that a very abstract
description-level usually goes hand-in-hand with less informa-
tion to be retrieved from simulation. One way to overcome this
problem is executing on a very abstract level but compensating
inaccurate results by applying a mathematical or statistical
model in order to obtain relatively accurate estimations.
A typical example for this approach is presented by Carlson
et al.: The Sniper simulator. According to [6], it is a simulator
for x86 architectures such as Intel Nehalem and Intel Core2.
For an interval model [7], where an interval is a series
Algorithm (C/C++)
SystemC
Cycle Accurate Simulators
Real Hardware + Application
Simulation speed
Application
(C/C++)
Energy/Time
Model
Single
Measurement
Processor
Model
ISS (OVP)
Gem5
Our Work
Accuraccy
x
x
Accuracy
Fig. 1. Different simulation tools regarding simulation speed (left) and
estimation accuracy (right) in regard to non-functional properties. All the tools
provide functional parameters, non-functional properties cannot be obtained
by ISS or the algorithm.
of instructions, possible cache and branch misses (due to
interaction) are estimated. With the help of this information,
homogeneous and heterogeneous desktop multi-core architec-
tures are simulated. This is helpful when modeling rather
complicated desktop architectures that include out-of-order-
execution, latency hiding and varying levels of parallelism
[9]. While they are able to simulate multi program workloads
and multi threaded applications, one drawback is the relative
error which rises up to 25%, which is acceptable for energy
uncritical desktop applications and multi program simulation,
but unacceptable when trying to find optimal design choices
for energy aware embedded hardware. Furthermore, SniperSim
is solely focused on Intel desktop architectures, which makes
its use for general embedded hardware simulation unfeasible.
An illustration of the presented different simulation layers
(including our extension of an ISS), ranged by their respective
simulation speed, as well as the accuracy of the resulting
non-functional parameter estimations, can be seen in Fig-
ure 1. Summarizing, we propose to estimate the processing
time and energy using an extended ISS with only slightly
increased simulation times. Therefore, we chose an existing
tool that features a wide variety of embedded architectures,
the OVP framework. Efforts have been undertaken to find
timing models for CPUs modeled in OVP in the past [13],
where the authors extended parts of a framework with pseudo-
cycle accurate timing models. Using a watchdog component,
they integrated an assembly parser and hash table with pre-
characterized groups of instruction and timing information.
The component then analyzes every instruction based upon
a disassembly by OVP and assigns a suitable timing infor-
mation. Unfortunately, simulation run-time is poor due to the
external analysis of the instructions via disassemblation. In
regards of power modeling, Shafique et al. [17] proposed an
adaptive management system for dynamically re-configurable
processors that considers leakage and dynamic energy. Our
approach is different as our emphasis is not on a low-level
estimation including register-transfer level (RTL) techniques
such as power-gating or instruction set building (often leading
to a long simulation run-time), but rather on a fast high-level
mechanistic simulation. Further work in regards of power and
time consumption has also been done by Gruttner et al. in [11].
Their focus, however, lies on rapid virtual system prototyping
of SoCs using C/C++ generated virtual executable prototypes
utilizing code annotation.
In order to motivate our approach we first explain the
fundamentals of a mechanistic simulation [7], which can be
employed when using an ISS. Mechanistic simulation means
simulating bottom up, starting from a basic understanding of
an architecture. In this context, the source of this understanding
is unimportant, thus it can also be achieved by an empirical
training phase, which makes it suitable to proprietary IPs with
limited knowledge of underlying details. This can be done by
running measurements or experiments on actual hardware, in
our case measuring processing energy and time of instructions.
The resulting data is then prepared to be used in the simulation
model - preparation can include regression and fitting of
parameters. A mechanistic simulation is then run on a typical
set of instructions using constant costs per instruction. In a
very early work, this concept has been analyzed by Tiwari et
al. [18] by measuring the current a processor consumes when
a certain instruction is executed. Our approach is presented in
the next section.
III. INSTRUCTION SET SIM UL ATOR : VI RTUA L
PLATF OR MS
As described in the section before, in this paper we want to
combine an ISS with an additional model to get non-functional
properties such as energy consumption and processing time. To
achieve this, we discuss the ISS in detail and explain how it
is modified for our purposes.
Open Virtual Platform (OVP) is a functional simulation envi-
ronment which provides very fast simulations even for complex
applications. Moreover, this flexible simulation environment is
easy to use because once compiled applications can be run
on both the real hardware and the OVP simulator without
additional annotations to the program. The fast simulation run-
times even for complex application is possible because OVP
simulates instruction-accurate, not cycle-accurate processing.
The user can debug the simulation and will know at any
point, e.g., which content the registers and program counters
have, but not the current state of the processor pipeline.
Analyzing non-functional properties like energy consumption
or execution run-time are therefore natively not possible [2].
To run a simulation using OVP two things are necessary, first
a so-called platform model which denotes the current hardware
platform, e.g., a CPU and a memory. Second, the application
has to be provided as a binary executable file (the kernel). For
a fast start in OVP, several processor and peripheral models
are included, where some of them are open source (e.g.,
the OR1K processor model). These models are dynamically
linked during run-time to the simulator called OVPsim. Today,
there is a number of different manufacturers of embedded
processors on the market, such as ARM or INTEL. Because
they are Hard-IP (intellectual property), these processors have
the disadvantage that they cannot be individualized to the needs
of the hardware designer, e.g., grade of parallelism, cache size
or cache replacement strategy. Fortunately, there are also some
open source Soft-IP processors available, like the LEON3
processor [1]. This processor can be edited individually to the
needs of the hardware designer. The LEON3 implements the
SPARC V8 processor architecture, which is available as VHDL
source code. The different configurations of the processor can
be synthesized and tested on an FPGA.
Disassembler
Decoder
Machine code
10000010000000001000000000000100
Morpher
Decode-Entry
SPARC_ADD_REGISTER
"add %g2, %g4, %g1"
Output
10000010000000001000000000000100
execute add %g2, %g4, %g1
Simulator
Machine code
Fig. 2. Simulation-Workflow of an instruction in OVP.
For our work, we have chosen the LEON3 processor because
of its configurability. It is easier to analyze energy and time
if unnecessary components can be disabled to allow a well
defined measurement environment. Moreover, the availability
as an open source soft IP core allows easy debugging. As
previously there was no SPARC V8 processor model available,
a new complete processor model was developed by us [3]. By
that we reach the same level of flexibility and configuration
possibilities on simulation as in the processor on real hardware.
Therefore, a C API for implementing and extending own
processor models provided by OVP was used.
The general simulation flow of a single instruction in our
SPARC V8 processor model is visualized in Fig. 2. First,
the decoder analyzes the 32-bit instruction for patterns and
decides what kind of instruction it is. Then, the instruction
receives an internal tag which is used for representation
in the disassembler and morpher. The disassembler includes
functions for simulation output if the user wishes to debug
the simulated instructions. The morpher part of the processor
model generates native code for the simulator to execute. These
functions represent what the simulator should do. E.g., an
arithmetic operation extracts the source registers, reads the
values of these registers, executes the operation, and saves the
result to the target register. Moreover, depending on the kind
of arithmetic operation, the internal ALU state can be changed
to implement further instructions like branches.
As writing a morpher function for every possible instruction
is highly complex, instructions were grouped and morphing
functions summarized. E.g., arithmetic instructions like add or
sub and their variants (analyzing flags, setting flags) are one
group. Because of different data manipulation, register-register
and register-immediate instructions had their own group, e.g.,
arithmetic-register-register instructions and arithmetic-register-
immediate instructions. Figure 3 shows a visualization of this
grouping.
The methods described above were used to get a fully
functional simulation environment. To enable the estimation
of non-functional properties, the functional simulation has to
be extended. On real hardware, not all instructions have the
same data or control path in the processor, e.g., a floating
point operation is much more complex, needs more cycles
and therefore more run-time than a simple integer instruction.
Thus, all instruction groups are further divided into categories
like integer, floating point, jumps, etc. The internal counters for
SPARC_ADD_REGISTER
SPARC_SUB_REGISTER
SPARC_AND_REGISTER
SPARC_OR_REGISTER
SPARC_ADD_CONST
SPARC_SUB_CONST
SPARC_AND_CONST
SPARC_OR_CONST
SPARC_BRANCHALWAYS
SPARC_BRANCHONEQUAL
doArithmeticRegister()
op:ADD
op:SUB
op:AND
op:OR
Decode-Entries Morph-Functions
doArithmeticConstant()
op:ADD
op:SUB
op:AND
op:OR
doBranch()
op:BA
op:BE
Fig. 3. Allocation from Decode-Entries to Morph-Functions, which create
native code for the simulator.
these instruction categories are realized without using callback
functions to ensure a high simulation speed. Instead, in every
morpher function a counter is implemented that increases an
internal temporal register after the corresponding instruction
was executed by the simulator. For every category one internal
register exists. After the full execution of the application, the
simulator reads out these registers and presents the results.
In this context we want to mention that the non-functional
behavior of an instruction is not necessarily constant. Espe-
cially for complex instruction set architectures (CISC) the
context has a major impact on energy and time (instructions
can take longer due to mispredicted branches, depending on
where they are found in the context of the program). On the
other hand, for reduced instruction set architectures (RISC),
most instructions can be executed using fewer cycles compared
to a CISC based system. Time and energy waste for, e.g., a
flushed pipeline due to a mispredicted branch are consequently
not as severe when compared to a CISC based processor. As
our work is mainly focused on embedded hardware, which
often implements RISC architectures such as the Sparc V8
based LEON3 (which does not even feature a pipeline), we
argue that due to architectural properties it is valid to assume
that an instruction shows a roughly constant non-functional
behavior regardless of the context it is found in.
IV. ENE RG Y AN D TIME MODELING
The general equations used to estimate the processing en-
ergy ˆ
Eand time ˆ
Tare given as
ˆ
E=X
c
ec·ncand ˆ
T=X
c
tc·nc,(1)
where the index crepresents the instruction category as intro-
duced before, eand tthe instruction specific energy and time,
and nthe instruction count.
For our model, nine instruction categories have been iden-
tified as summarized in Table I. The first six categories
describe the energy consumption of the basic integer unit. The
remaining three categories correspond to the FPU operations.
The category “FPU Arithmetic” comprises the floating point
add, subtract, and multiply operation.
The middle and the right column of Table I include the
instruction specific energies and times that are assigned to the
respective categories. The specific time tccan be interpreted
TABLE I. INSTRUCTION CATEGORIES AND THEIR RESPECTIVE
SPECIFIC ENERGIES AND TIMES AS DERIVED BY DEDICATED
MEASUREMENTS.
Instruction category cSpec. Time tcSpec. Energy ec
Integer Arithmetic 45ns 15nJ
Jump 238ns 76nJ
Memory Load 700ns 229nJ
Memory Store 376ns 166nJ
NOP 46ns 13nJ
Other 41ns 13nJ
FPU Arithmetic 46ns 14nJ
FPU Divide 431ns 431nJ
FPU Square root 612ns 88nJ
as the mean time required to execute one instruction of this
category in our test setup. Likewise, the specific energy ec
describes the mean energy needed during the execution of one
instruction. The values shown in the table have been derived
by the measurement method explained in Section V.
Now if we know how often an instruction of a given category
is executed during a process, we can multiply this number
with the specific energy and time, add up the accumulated
values for each category and obtain an estimation for the
complete execution time and energy. These numbers ncare
called instruction counts and they are derived by the simulation
in the ISS as presented above.
V. MEASUREMENT SE TU P
In order to prove the viability of the presented model,
we built a dedicated test setup for measuring the execution
time and energy consumption of a SPARC LEON3 softcore
processor [1] on an FPGA board. The FPGA board was a
Terasic DE2-115 featuring an Altera Cylcone IV FPGA. The
board was controlled using GRMON debugging tools [10]
and the LEON3 was synthesized using Quartus, where the
cache system and the MMU were disabled. Hence, in this
publication, we consider a baseline CPU including an FPU.
We utilized an FPGA because it offers great flexibility: The
CPU can be customized according to our needs for highly
versatile testing. We exploited this property to generate a useful
platform for the step-by-step construction of an accurate and
general RISC model.
For the measurement of the execution time of a process we
used the clock()-function from the C++ standard library
time.h. The measurement method for the energy consump-
tion of the process is the same as already presented in [4].
To obtain the energy required to execute a single instruction
we measured two kernels: A reference and a test kernel as
indicated by Table II. The processing in both kernels features
the same amount of baseline instructions, e.g., jumps for a
loop. In contrast, the test kernel additionally contains a high
amount of specific instructions that are not included in the
reference kernel. Subtracting the processing energy and time
of the reference kernel (Eref ,Tref ) from that of the test
kernel (Etest, Ttest ), we obtain the time and energy required
by the additional instructions. This value is then divided by the
number of instruction executions ntest, which is the product
of the number of loop cycles with the number of instructions
inside the loop. Thus, we obtain an instruction specific time
and energy as
ec=Etest Eref
ntest
and tc=Ttest Tref
ntest
.(2)
TABLE II. PS EUD O-C ODE O F REF ER ENC E AN D TES T FIL E TO OB TAIN
TIME-AN D ENE RG Y SPE CIFI C VALU ES. T HE R EFE RE NCE FI LE CO NTAI NS A
FOR LOOP WITHOUT ANY CONTENT. IN THE T ES T FILE ,TH E FOR LOO P
CO NTAI NS A LA RG E AMO UNT O F TH E INS TRU CT ION S TO BE T ES TED ,IN
THIS EXAMPLE INTEGER ADD OPERATIONS.
// Reference kernel // Test kernel
int main(.) int main(.)
{ {
for (i=0; i<1000000; i++) for (i=0; i<1000000; i++)
{ {
// empty ADD # #
...
ADD # #
} }
} }
Due to the unrealistic programming flow of the reference
and the test kernel, these values may differ from the values
observed in a real application. Hence, the values are checked
for consistency and manually adapted, if necessary.
It should be noted that the energy for the execution of a
certain instruction can be variable depending on the preceding
and succeeding instruction, or the input data. To overcome
this problem, we take the assumption that in real application,
this variation is averaged to an approximately constant value
when the corresponding instruction is executed multiple times
in different contexts, which is supported by the results of our
evaluation.
VI. EVALUATIO N
In this section, we show that our model returns valid energy
and time estimations for the given CPU by testing two con-
ventional image processing algorithms: High-Efficiency Video
Coding (HEVC) decoding that performs mainly integer arith-
metics and Frequency Selective Extrapolation (FSE) which
makes extensive use of floating point operations.
A. HEVC Decoding
To test the energy consumption of the HEVC decoder we
used the HM-reference software [12] that was slightly modified
to run bare-metal. Therefore, we included in- and output
streams directly into the kernel and cross-compiled it for the
LEON3. While the HEVC can be fully implemented using pure
integer arithmetics, the used software performs few floating
point operations, e.g., for timing purposes. Furthermore, it
uses a high variety of different algorithmic tools and methods
like filtering operations and transformations. Due to predictive
tools, a high amount of memory space is required.
To have a representative test set, we measured the decoding
process of 36 different video bit streams. These bit streams
were encoded with four different encoding configurations (in-
tra, lowdelay, lowdelay P, and randomaccess), three different
visual qualities (quantization parameters 10, 32, and 45), and
three different input raw sequences.
B. Frequency Selective Extrapolation
The second test of the model was carried out by computing
the Frequency Selective Extrapolation (FSE) [15] algorithm
on the device. FSE is an algorithm for reconstructing image
signals which are not completely available, but rather contain
regions where the original content is unknown. This may,
e.g., happen in the case of transmission errors that have to
be concealed or if an image contains distortions or undesired
FSE float FSE fixed HEVC float HEVC fixed
0
100
200
300
400
Energy [J]
Energy − Measurement
Energy − Estimation
Time − Measurement
Time − Estimation
0
200
400
600
800
Time [s]
Fig. 4. Comparison between measurement and estimation for four different
showcase processes. The two FSE kernels as well as the two HEVC kernels
process the same input data. In contrast to the float kernels, the fixed kernels
are compiled with the -msoft-float compiler flag.
objects. For the extrapolation purpose, FSE iteratively gener-
ates a parametric model of the desired signal as a weighted
superposition of complex-valued Fourier basis functions. As
the model is defined for the available as well as for the
unknown samples, one directly obtains an extension of the
signal into these unknown regions.
For generating the model, the available samples are iter-
atively approximated at which in every iteration one basis
function is selected by performing a weighted projection of
the approximation residual on all basis functions. This process
can also be carried out in the frequency-domain. In doing
so, a Fast Fourier Transform is necessary. Due to the high
amplitude range and the required accuracy, all operations need
to be carried out with double precision accuracy. For a
detailed discussion of FSE, especially the relationship between
the spatial-domain implementation and the frequency-domain
implementation and its influence on the required operations
and run-time, please refer to [15], [16].
As a test set for the input data we chose 24 different pictures
from the Kodak test image database where for each picture a
different mask was defined. Hence, we obtained 24 kernels
differing by the input image and mask.
C. Experimental Results
To see the benefit and the influence of an additional FPU in
a CPU, for both algorithms we tested two cases: Processing
with and without floating point operations (float and fixed,
respectively). The latter case is achieved by compiling the
kernel using the compiler flag -msoft-float which em-
ulates floating point operations using integer arithmetics. The
use of this compiler flag does not influence the precision of
the process such that the output matches exactly the output of
the kernel that was compiled without this flag.
We show the validity of the model by measuring the
execution time and the energy consumption of the processes
presented above and comparing them to the estimations re-
turned by the proposed model. The bar diagram in Figure 4
shows the results for four different representative cases.
The dark blue bars depict the measured energies, the light
blue bars the estimated energies (left axis). The yellow bars
represent the measured times and the red bars the estimated
times (right axis). We can see that all estimations are located
close to their corresponding measured values.
To evaluate the performance of our algorithm, we calculated
TABLE III. MEAN ABSOLUTE ESTIMATION ERROR AND MAXIMUM
ABSOLUTE ERROR OF OUR MODEL.
Energy Time
Mean absolute error ¯ε2.68% 2.72%
Maximum absolute error εmax 6.32% 6.95%
TABLE IV. T HE CHANGE OF NON-FU NCT IO NAL P ROPE RTI ES OF A N
ALGORITHM WHEN INTRODUCING AN FPU INT O THE H AR DWARE .
FSE HEVC Decoding
Energy consumption 92.6% 42.88%
Processing Time 92.8% 43.49%
# logical elements +109% +109%
the estimation errors for all Mtested kernels as
εm=
ˆ
EmEmeas,m
Emeas,m
,(3)
where ˆ
Emis the estimated energy from (1), Emeas,m is the
measured energy, and mis the kernel index. In the same way,
the estimation error for time was derived. Table III shows
two summarizing indicators for the evaluated kernels: first, the
mean absolute estimation error ¯ε=1
MPM
m=1 |εm|and second,
the maximum absolute error εmax = maxm|εm|, m =
1, ..., M for both energy and time. The maximum error is the
highest error we observed for our evaluated kernel set. The
small mean errors which are lower than 3% show that the
estimations of our model can be used to approximate the real
energy consumption and processing time.
D. Application of the Model
As a first basic application of this model the above presented
information can be used to help the developer decide for a
suitable architecture. If he, e.g., would like to know if it makes
sense to include an FPU on his hardware, he can simulate the
execution of his code with and without an FPU and obtain
the information about processing time and consumed energy.
The result of such a benchmark for our framework is shown
in Table IV.
The values in the table are mean values over all tested
kernels. The third row shows the increase of chip area needed
for an FPU which can be obtained by the synthetization of the
processor. We can see that if we spend more chip area (about
twice the size as indicated by the number of logical elements),
we save more than 90% of the processing time and energy for
FSE processing, which may be highly beneficial for energy and
time constrained devices. In contrast, for HEVC decoding, an
FPU reduces time and energy by not even half such that the
expensive chip area might lead the developer to choose for a
processor without an FPU.
VII. CONCLUSIONS
In this paper, we presented a method to accurately estimate
non-functional properties of an algorithm using simulations
on a virtual platform. The model is based on a mechanistic
approach and reaches an average error of 2.68% for energy
and 2.72% for time estimation. Furthermore, we have shown
that the information can be used by developers for time and
energy estimates. Further work aims at incorporating a model
for the cache and multi-core processors and generalizing this
concept to any CPU of interest. Additionally, we will evaluate
the estimation accuracy of this model for further algorithms to
show the general viability.
ACKNOWLEDGMENT
This work was financed by the Research Training Group
1773 “Heterogeneous Image Systems”, funded by the German
Research Foundation (DFG).
REFERENCES
[1] Leon3 processor. [Online.] Available: http://gaisler.com/index.php/
products/processors/leon3?task=view&id=13.
[2] B. Bailey. System level virtual prototyping becomes a reality with OVP
donation from imperas. White Paper, June, 1, 2008.
[3] S. Berschneider. Modellbasierte hardwareentwicklung am beispiel
eingebetteter prozessoren f¨
ur die optische messtechnik. Master’s the-
sis, Friedrich-Alexander-Universit¨
at Erlangen-N¨
urnberg, Germany, Dec.
2012.
[4] S. Berschneider, C. Herglotz, M. Reichenbach, D. Fey, and A. Kaup.
Estimating video decoding energies and processing times utilizing
virtual hardware. In Proc. 3PMCES Workshop. Design, Automation
& Test in Europe (DATE), 2014.
[5] Binkert and Beckmann. The gem5 simulator. ACM SIGARCH Computer
Architecture News, 39(2):1–7, May 2011.
[6] T. E. Carlson, W. Heirman, and L. Eeckhout. Sniper: Exploring the level
of abstraction for scalable and accurate parallel multi-core simulations.
In Proc. International Conference for High Performance Computing,
Networking, Storage and Analysis. ACM, Nov 2011.
[7] L. Eeckhout. Computer architecture performance evaluation methods.
Synthesis Lectures on Computer Architecture, 5:1–145, 2010.
[8] K. Z. Elabidine and A. Greiner. An accurate power estimation method
for MPSoC based on SystemC virtual prototyping. In Proc. 3PMCES
Workshop. Design, Automation & Test in Europe (DATE), 2014.
[9] S. Eyerman, L. Eeckhout, T. Karkhanis, and J. E. Smith. A mechanistic
performance model for superscalar our-of-order processors. ACM
Transactions on Computer Systems (TOCS), 27(2), May 2009.
[10] A. Gaisler. Grmon2 debug monitor. [Online.] Available: http://www.
gaisler.com/index.php/products/debug-tools/grmon2.
[11] K. Gruttner, P. Hartmann, T. Fandrey, K. Hylla, D. Lorenz, S. Stat-
telmann, B. Sander, O. Bringmann, W. Nebel, and W. Rosenstiel.
An ESL timing and power estimation and simulation framework for
heterogeneous SoCs. Proc. International Conference on Embedded
Computer Systems: Architectures, Modeling, and Simulation (SAMOS
XIV), pages 181–190, 2014.
[12] ITU/ISO/IEC. HEVC Test Model HM-11.0. [Online.] Available: https:
//hevc.hhi.fraunhofer.de/.
[13] F. Rosa, L. Ost, R. Reis, and G. Sasatelli. Instruction-driven timing
CPU model for efficient embedded software development using OVP.
In Proc. IEEE International Conference on Electronics, Circuits, and
Systems (ICECS), 2011.
[14] D. Sanchez and C. Kozyrakis. Zsim: Fast and accurate microar-
chitectural simulation of thousand-core systems. Proc. 40th Annual
International Symposium on Computer Architecture (ISCA), 2013.
[15] J. Seiler and A. Kaup. Complex-valued frequency selective extrap-
olation for fast image and video signal extrapolation. IEEE Signal
Processing Letters, 17(11):949 952, November 2010.
[16] J. Seiler and A. Kaup. A fast algorithm for selective signal extrapolation
with arbitrary basis functions. EURASIP Journal on Advances in Signal
Processing, 2011:1–10, 2011.
[17] H. Shafique, Bauer. Adaptive energy management for dynamically
reconfigurable processors. IEEE Transactions on Computer-Aided
Design of Integrated Circuits and Systems, 33(1):50–63, January 2014.
[18] V. Tiwari, S. Malik, and A. Wolfe. Power analysis of embedded
software: A first step towards software power minimization. IEEE
Transactions on VLSI Systems, 2(4):437–445, Dec 1994.
ResearchGate has not been able to resolve any citations for this publication.
Article
Full-text available
We present an adaptive energy management system for dynamically reconfigurable processors that chooses an energy-minimizing set of custom instructions (CIs) and then power-gates the temporarily unused subset of CIs. It requires a comprehensive power model to estimate the power consumption of different CIs at run time. We deploy our new energy management in two state-of-the-art reconfigurable processors (RISPP and Molen) and perform an elaborative evaluation of energy savings under various area and performance constraints for different technology nodes. We demonstrate the energy benefits by comparing it to state-of-the-art power-gating techniques for FPGAs. The work is implemented as a prototype on a Xilinx FPGA platform.
Conference Paper
Full-text available
Abstract Consideration of an embedded system’s timing behaviour and power consumption at system-level is an ambitious task. Sophisticated tools and techniques exist for power and timing estimations of individual components such as custom hard- and software as well as IP components. But prediction of the composed system behaviour can hardly be made without considering all system components. In this paper we present an ESL framework for timing and power aware rapid virtual system prototyping of heterogeneous SoCs consisting of software, custom hardware and 3rd party IP components. Our proposed flow combines system-level timing and power estimation techniques with platform-based rapid prototyping. Virtual executable prototypes are generated from a functional C/C ++ description, which then allows to study different platforms, mapping alternatives, and power management strategies. We propose an efficient code annotation technique for timing and power, that enables fast host execution and collection of power traces, based on domain-specific workload scenarios.
Conference Paper
Full-text available
Two major trends in high-performance computing, namely, larger numbers of cores and the growing size of on-chip cache memory, are creating significant challenges for evaluating the design space of future processor architectures. Fast and scalable simulations are therefore needed to allow for sufficient exploration of large multi-core systems within a limited simulation time budget. By bringing together accurate high-abstraction analytical models with fast parallel simulation, architects can trade off accuracy with simulation speed to allow for longer application runs, covering a larger portion of the hardware design space. Interval simulation provides this balance between detailed cycle-accurate simulation and one-IPC simulation, allowing long-running simulations to be modeled much faster than with detailed cycle-accurate simulation, while still providing the detail necessary to observe core-uncore interactions across the entire system. Validations against real hardware show average absolute errors within 25% for a variety of multi-threaded workloads; more than twice as accurate on average as one-IPC simulation. Further, we demonstrate scalable simulation speed of up to 2.0 MIPS when simulating a 16-core system on an 8-core SMP machine.
Article
Full-text available
A mechanistic model for out-of-order superscalar processors is developed and then applied to the study of microarchitecture resource scaling. The model divides execution time into intervals separated by disruptive miss events such as branch mispredictions and cache misses. Each type of miss event results in characterizable performance behavior for the execution time interval. By considering an interval's type and length (measured in instructions), execution time can be predicted for the interval. Overall execution time is then determined by aggregating the execution time over all intervals. The mechanistic model provides several advantages over prior modeling approaches, and, when estimating performance, it differs from detailed simulation of a 4-wide out-of-order processor by an average of 7%. The mechanistic model is applied to the general problem of resource scaling in out-of-order superscalar processors. First, we use the model to determine size relationships among microarchitecture structures in a balanced processor design. Second, we use the mechanistic model to study scaling of both pipeline depth and width in balanced processor designs. We corroborate previous results in this area and provide new results. For example, we show that at optimal design points, the pipeline depth times the square root of the processor width is nearly constant. Finally, we consider the behavior of unbalanced, overprovisioned processor designs based on insight gained from the mechanistic model. We show that in certain situations an overprovisioned processor may lead to improved overall performance. Designs where a processor's dispatch width is wider than its issue width are of particular interest.
Article
Full-text available
Embedded computer systems are characterized by the presence of a dedicated processor and the software that runs on it. Power constraints are increasingly becoming the critical component of the design specification of these systems. At present, however, power analysis tools can only be applied at the lower levels of the design-the circuit or gate level. It is either impractical or impossible to use the lower level tools to estimate the power cost of the software component of the system. This paper describes the first systematic attempt to model this power cost. A power analysis technique is developed that has been applied to two commercial microprocessors-Intel 486DX2 and Fujitsu SPARClite 934. This technique can be employed to evaluate the power cost of embedded software. This can help in verifying if a design meets its specified power constraints. Further, it can also be used to search the design space in software power optimization. Examples with power reduction of up to 40%, obtained by rewriting code using the information provided by the instruction level power model, illustrate the potential of this idea.< >
Conference Paper
Architectural simulation is time-consuming, and the trend towards hundreds of cores is making sequential simulation even slower. Existing parallel simulation techniques either scale poorly due to excessive synchronization, or sacrifice accuracy by allowing event reordering and using simplistic contention models. As a result, most researchers use sequential simulators and model small-scale systems with 16-32 cores. With 100-core chips already available, developing simulators that scale to thousands of cores is crucial. We present three novel techniques that, together, make thousand-core simulation practical. First, we speed up detailed core models (including OOO cores) with instruction-driven timing models that leverage dynamic binary translation. Second, we introduce bound-weave, a two-phase parallelization technique that scales parallel simulation on multicore hosts efficiently with minimal loss of accuracy. Third, we implement lightweight user-level virtualization to support complex workloads, including multiprogrammed, client-server, and managed-runtime applications, without the need for full-system simulation, sidestepping the lack of scalable OSs and ISAs that support thousands of cores. We use these techniques to build zsim, a fast, scalable, and accurate simulator. On a 16-core host, zsim models a 1024-core chip at speeds of up to 1,500 MIPS using simple cores and up to 300 MIPS using detailed OOO cores, 2-3 orders of magnitude faster than existing parallel simulators. Simulator performance scales well with both the number of modeled cores and the number of host cores. We validate zsim against a real Westmere system on a wide variety of workloads, and find performance and microarchitectural events to be within a narrow range of the real system.
Article
Signal extrapolation tasks arise in miscellaneous manners in the field of image and video signal processing. But, due to the widespread use of low-power and mobile devices, the computational complexity of an algorithm plays a crucial role in selecting an algorithm for a given problem. Within the scope of this contribution, we introduce the complex-valued Frequency Selective Extrapolation for fast image and video signal extrapolation. This algorithm iteratively generates a generic complex-valued model of the signal to be extrapolated as weighted superposition of Fourier basis functions. We further show that this algorithm is up to 10 times faster than the existent real-valued Frequency Selective Extrapolation that takes the real-valued nature of the input signals into account during the model generation. At the same time, the quality which is achievable by the complex-valued model generation is similar to the quality of the real-valued model generation.
Article
Signal extrapolation is an important task in digital signal processing for extending known signals into unknown areas. The Selective Extrapolation is a very effective algorithm to achieve this. Thereby, the extrapolation is obtained by generating a model of the signal to be extrapolated as weighted superposition of basis functions. Unfortunately, this algorithm is computationally very expensive and, up to now, efficient implementations exist only for basis function sets that emanate from discrete transforms. Within the scope of this contribution, a novel efficient solution for Selective Extrapolation is presented for utilization with arbitrary basis functions. The proposed algorithm mathematically behaves identically to the original Selective Extrapolation but is several decades faster. Furthermore, it is able to outperform existent fast transform domain algorithms which are limited to basis function sets that belong to the corresponding transform. With that, the novel algorithm allows for an efficient use of arbitrary basis functions, even if they are only numerically defined.