Content uploaded by Johannes Manner
Author content
All content in this area was uploaded by Johannes Manner on Oct 23, 2018
Content may be subject to copyright.
Cold Start Influencing Factors
in Function as a Service
Johannes Manner, Martin Endreß, Tobias Heckel and Guido Wirtz
Distributed Systems Group
University of Bamberg
Bamberg, Germany
{johannes.manner, guido.wirtz}@uni-bamberg.de
{martin.endress, tobias-christian-juergen-lukas.heckel}@stud.uni-bamberg.de
Abstract—Function as a Service (FaaS) is a young and rapidly
evolving cloud paradigm. Due to its hardware abstraction,
inherent virtualization problems come into play and need an
assessment from the FaaS point of view. Especially avoidance of
idling and scaling on demand cause a lot of container starts and
as a consequence a lot of cold starts for FaaS users. The aim of
this paper is to address the cold start problem in a benchmark
and investigate influential factors on the duration of the perceived
cold start.
We conducted a benchmark on AWS Lambda and Microsoft
Azure Functions with 49500 cloud function executions. Formu-
lated as hypotheses, the influence of the chosen programming
language, platform, memory size for the cloud function, and size
of the deployed artifact are the dimensions of our benchmark.
Cold starts on the platform as well as the cold starts for users
were measured and compared to each other. Our results show
that there is an enormous difference for the overhead the user
perceives compared to the billed duration. In our benchmark,
the average cold start overheads on the user’s side ranged from
300ms to 24s for the chosen configurations.
Index Terms—Serverless Computing, Function as a Service,
FaaS, Cloud Functions, Cold Start, Benchmarking
I. INT ROD UC TI ON
Function as a Service (FaaS) and Serverless Computing
are often used in conjunction, but there is a difference [1].
Serverless in general is a more abstract phrase, whereas FaaS
focuses on event driven cloud functions. Therefore, we use the
term FaaS to stick to a precise phrasing of the investigated
cloud service model.
Cold starts are an inherent problem of virtualization tech-
niques. Cloud functions are executed in containers. The first
execution of a cloud function experiences a cold start since
the container must be started prior to the execution. Due to
performance reasons, FaaS providers do not shut down the
containers immediately. Subsequent executions use spawned
but existing containers to profit from the warm execution
environments. Avoidance of idling and scaling on demand are
game changers compared to other cloud service models, but
entail more cold starts.
So far, cold starts are perceived as a system-level chal-
lenge [2], [3] and were not directly investigated to our knowl-
edge. Efforts are made to circumvent the cold start problem
by pinging the cloud function on a regular basis [4], [5]. This
ping hack1sins against the scale to zero principle of FaaS.
Therefore, we are motivated to research factors that influ-
ence the cold start and pose the following research questions:
•RQ1: Which factors written down as hypotheses influence
the cold start of cloud functions?
•RQ2: How to benchmark cold starts of cloud functions
consistently to get repeatable experiments and results?
Based on these questions, we formulated influencing factors
as hypotheses and provided a pipeline to execute repeatable
benchmarks. Our benchmark is designed to be relevant, re-
peatable, fair, verifiable and economical [6].
The agenda is as follows: In Section II, we present other
benchmarks on FaaS with varying focuses. Section III an-
swers the first research question by forming hypotheses about
influential factors on the cold start, which are partly tested
in the experiment in Section IV. Our results are presented
in Section V and discussed in Section VI. Future Work in
Section VII concludes this paper.
II. RE LATE D WOR K
We present economically interesting benchmarks in the first
paragraph and performance-oriented in the second one.
Based on the use cases and requirements, avoidance of
idle instances is a game changer. ADZ IC and CHATL EY [7]
compared the cost of hosting one hour of operation on PaaS
offerings like AWS EC2 and Google AppEngine with the same
functionality on FaaS. They calculated a cost reduction of up
to 99.95% assuming that the functionality is invoked every five
minutes for 200ms. Another benchmark with similar research
questions was performed by VI LL AM IZ AR E T AL . [8]. They
compared three different system architectures: A monolithic
approach, a micoservice-based system, and a cloud function
architecture. Three scenarios were transformed from monoliths
to loosely coupled cloud functions for a more robust data basis.
The transition of the monolithic application into microservices
caused a first cost reduction. This reduction progressed by run-
ning the microservice as cloud functions on a FaaS platform
and resulted in an overall cost reduction of 70% and more.
1J Daly. 15 Key Takeaways from the Serverless Talk at AWS Startup Day.
https://www.jeremydaly.com/15-key-takeaways-from-the-serverless-talk- at-
aws-startup-day/. 2018. Last accessed 9/13/2018.
--- PREPRINT ---
LEE E T AL . [4] executed a performance-oriented benchmark
for distributed data processing emphasizing the scaling prop-
erty and resulting throughput of FaaS platforms. To utilize
the cloud functions, they used a compute intensive workload.
They monitored the number of running instances when varying
the workload to assess the time overhead. However they did
not express the overhead due to cold starts explicitly. All
major, commercial FaaS platforms were part of the study,
namely AWS Lambda2, Microsoft Azure Functions3, Google
Cloud Functions4, and IBM OpenWhisk5. MA LAWSKI E T
AL . [9] also conducted a cross platform study with the same
providers. They used recursive fibonacci as a compute bound
benchmark algorithm. Interaction via API gateways was their
chosen scenario, one of the main use cases for cloud functions.
This allowed a time measurement on client and provider side.
Therefore, they were able to compute the perceived overhead
for the user, which included network latency, platform routing
and scheduling overhead, when following their interpretation
of the results. The cold start was missing in their enumeration
of influencing factors and it is therefore investigated in this
work.
III. HYP OTH ES ES
To facilitate an unbiased evaluation, hypotheses about the
implications of different parameters and decisions were for-
mulated prior to the experiment. FaaS users and especially
providers make decisions, which may influence the cold start
behavior of the executed functions.
H1: Programming Language - FaaS platforms offer a large
variety of programming languages [3]. JavaScript (JS)
for example is supported by all major platforms since
it is a perfect fit for small, stateless cloud functions.
Also compiled languages like Java and C# come in
focus due to the engineering benefits if functions get
more complex. Because of the environment overhead,
our hypothesis is that compiled programming languages
impose a significantly higher cold start overhead than
interpreted languages like JS. For instance, the execution
of a cloud function written in Java needs a running Java
Virtual Machine (JVM) which must be set up prior to
function execution.
H2: Deployment Package Size - Our hypothesis is that the
cold start overhead increases with the deployment pack-
age size. We want to measure the time, which is needed to
copy the function image to the container, load the image
into memory, unpack, and execute it.
H3: Memory/CPU Setting - Our hypothesis is that the cold
start overhead decreases with increasing resources, be-
cause the container can be loaded and set up faster. We
assume that this behavior is observable for low memory
settings where the CPU is busy, but is negligible for
high settings since the CPU is underutilized in these
2https://aws.amazon.com/lambda
3https://azure.microsoft.com/en-us/services/functions/
4https://cloud.google.com/functions/
5https://www.ibm.com/cloud/functions
cases. This limitation does not weaken the hypothesis,
because the low memory settings starting at 128MB are
of particular interest. Memory and CPU are used in
combination since most of the mature platforms offer a
linear scaling of CPU power based on the memory setting.
H4: Number of Dependencies - Loading dependencies
takes time when spinning up a cloud function. Our
hypothesis is that the amount and size of dependencies
increases the cold start overhead since they must be
loaded prior to the first execution and can be reused in
subsequent ones. If we can confirm this hypothesis, a
best practice would be the division of required libraries
in sublibraries, where the needed subset of functionality
is extracted in a new artifact.
H5: Concurrency Level - FaaS gets attention especially due
to the scaling property of cloud functions. We hypothesize
that the concurrency level, i. e., the number of concurrent
requests and therefore started containers, neither influ-
ences the cold start overhead nor the execution time.
Functions are started independently of each other in
a separate container for every concurrent execution. If
1000 requests arrive simultaneously, we expect that 1000
containers are started by the middleware of the FaaS
platform.
H6: Prior Executions - Avoidance of idling is a tremendous
improvement of FaaS compared to PaaS. Achieving this
goal comes with the drawback, that unused containers are
removed from running machines. Subsequent calls to the
cloud function require a new container. We assume that
the cold start overhead is independent of prior executions.
This hypothesis is of particular interest for the first
execution of the cloud function after deployment.
H7: Container Shutdown - Providers may optimize their
infrastructure by using learning algorithms for identifying
cloud functions, which are used frequently. Due to cost
effects and user satisfaction, we hypothesize that the
duration after which a container shuts down is dependent
on the number of previous executions. According to the
FaaS paradigm, executions are independent of each other
and should not influence the lifespan of a container.
IV. EXP ER IM EN T
A. Hypotheses Selection
The experiment in this paper evaluates three out of the seven
hypotheses of Section III. Programming Language,Deploy-
ment Package Size and Memory/CPU Setting are investigated.
Reasons to choose these hypotheses are the ease of testing
and getting stable and reproducible results. Since it is the first
benchmark focusing only on cold starts, our aim is to make
a clear experimental setup and reduce other parameters and
external influences to a minimum. We omitted the hypotheses
with a concurrent notion due to the side effects which are
introduced by concurrency in general. The data base produced
by our sequential benchmark has a minimum set of external
influences. Without having this data base, an evaluation of the
hypotheses with a concurrent notion in our future work would
not be possible due to missing reference data. Therefore, our
benchmark is of special interest for real world applications,
which are only requested once or twice an hour and thus
benefit from the scaling to zero property.
B. Provider and Language Selection
We selected Java and JS as programming languages. An
important reason for this decision is, that Java is a compiled
and JS an interpreted language. This selection emphasizes
the differences in programming languages for the evaluation
of the programming language hypothesis. Furthermore, Java
and JS are widely used in enterprises and the open source
community6. Based on this language selection, we selected
AWS Lambda and Microsoft Azure Functions as platforms
since they were the only platforms supporting Java. This
selection hinders us to compare the memory setting hypothesis
between providers, because Azure Cloud Functions does an
automatic scaling of memory and CPU. However, we can still
provide first insights about this hypothesis.
C. Algorithm Selection
We chose the recursive version of fibonacci like SP IL L-
NE R ET A L. [10], which calculates the nth value of the
fibonacci sequence to test our hypotheses. Recursive fibonacci
is compute bound. As a consequence, the compute time is
determined by the processing power of the machine. Since it
is a recursive algorithm with two recursive calls, the tree grows
exponentially. The calculation is not memory bound, since one
stack is evaluated completely until the next evaluation starts,
e.g. fibonacci(n−1) is evaluated before fibonacci(n−2)
is called. This results in a time complexity of O(2n)and a
memory complexity of O(n).
With these characteristics, the algorithm is well suited for
our benchmark. Assuming that identical hardware is used
within a data center [4], predictability and low variance in
function execution time guarantee stable results. The low
memory usage ensures that we can benchmark the function
with any of the memory settings provided by FaaS platforms.
D. Generic Experiment Pipeline
Using a deployment pipeline, which automates all necessary
steps, makes it possible to reproduce results more easily. The
pipeline is included in our prototype SeMoDe7as an utility
mechanism for the mentioned fibonacci functions.
The first step is to implement a function that should
be assessed. Reference implementations8for some provider-
language combinations of the recursive fibonacci algorithm are
available in a separate folder of SeMoDe. To test some of the
hypotheses explained in Section III, we need an interceptor
step, where the source code is altered. In the case of the
deployment package size hypothesis, this step can be per-
formed automatically. In other cases such as the dependency
hypothesis, a manual interception must be performed.
6Based on the opened pull requests on GitHub, https://octoverse.github.
com/, last accessed on 10/10/2018.
7https://github.com/johannes-manner/SeMoDe
8https://github.com/johannes-manner/SeMoDe/tree/master/fibonacci
We use the Command Line Interface (CLI) of the Serverless
Framework9to specify configuration settings and for the
deployment of our cloud functions.
SeMoDe offers different benchmarking options, where spe-
cial emphasis is put on the isolation of cold starts when
executing cloud functions. We set up the API gateways of
the FaaS platforms to enable local REST calls on our host
machine. This procedure gives us the opportunity to control
the execution by specifying the input of the respective cloud
function. The specific input prevents the platforms to cache
results. Finally, SeMoDe provides fetchers per provider, which
retrieve the data from the logging services in a controlled
manner.
E. Experimental Setup
While creating the experimental setup, we considered sev-
eral aspects that were independent of the hypotheses. The first
one is the way of invoking functions on the platform. Also
logging information for later analysis is a big issue for data
consistency. Finally, thinking about cold start influencing steps
during a function invocation and execution was of concern.
Due to the specific focus on cold starts, the aim of the
experimental setting is to force a cold start closely followed
by a warm start on the same container instance. A warm start
is defined as the reuse of a container in our setup. Given that
there is only a single cold start per container, having a pair
with a single cold and a single warm execution guarantees
a sound comparison, because mean calculation of several
warm executions is avoided. Such mean calculations could
have distorted our results, because one platform optimizes
the performance of cloud functions after a certain amount of
invocations, as we observed during our initial experiments.
Tests have shown that containers on most platforms were shut
down after at most 20 minutes of idling.
A FaaS platform is a black box. The precise execution
duration, which is used for billing on the platform, includes
the function execution and parts of the start up process. Other
parts of the initialization and start up of the container plus
other needed infrastructural components are not included. To
measure these aspects, we performed a REST-based interaction
with the FaaS platform, where the start and end time is also
logged on the client side.
Logging the time stamps locally enables us to compare the
local execution with the platform duration. After storing the
starting time stamp, a REST call is executed, which sends
the request over the network to the API gateway endpoint.
This endpoint creates a new event which triggers a container
creation or reuse. Finally, the cloud function is executed and
the middleware on the platform logs the start and end time
of the function execution as well as the precise duration,
which is the difference of both time stamps. The result of
the computation is transferred to the API gateway endpoint,
wrapped in a response, and sent to the caller via the network.
The client REST call exits and the local end time stamp is
9https://serverless.com/framework
Local REST
FaaS Platform time
1 min 29 min 1 min 29 min
Fig. 1: Sequential Execution of our Experiment
logged on the host machine of the FaaS user. The two local
time stamps enable an assessment of the perceived execution
duration for the user and as a consequence the difference
between cold and warm starts.
To force pairs of cold and warm executions, we used the
SeMoDe benchmarking mode sequential with changing inter-
vals. This mode triggers the provided function with a delay
between execution start times. Delays vary and are defined
in a provided array of delays d in a round robin fashion.
The start time of each execution is generalized in Fig. 2. The
platform response includes a container and platform identifier.
These identifiers enable an unambiguous matching between
the local REST data and the platform log data.
start(i, d) = (0if i= 0
start(i−1, d) + d[imod len(d)] if i≥1
Fig. 2: Start Time of the ith Execution of the Local Benchmark
Invocation
We set our array dto {1 minute, 29 minutes}. The start time
is the time of the local utility, which calls the API gateway. A
representation of the resulting execution sequence can be seen
in Fig. 1. Once again, it should be noted, that the invocation
of the cloud functions is sequential.
F. Experiment’s Dimension Selection
Finally, we select the memory and package sizes to test our
hypotheses. An initial package size is the size of a package
after the build phase. Initial Java packages have approximately
1.5 MB, JS ones are smaller than 1 KB. The following
package sizes, which differ from the initial ones are artificially
increased by adding a file to the zip or increasing the JS
file with a comment. Deployment packages were sized initial,
3.125 MB, 6.25 MB, 12.5 MB, 25 MB and 50 MB. On Azure,
three additional packages were created for 100 MB, 200 MB
and 400 MB. For AWS, 50 MB is the upper package size limit
for functions at the time of the experiment (June 2018).10
The memory setting was only configured on AWS since
Azure does not support this feature. Memory settings on
AWS were 128 MB, 256 MB, 512 MB, 1024 MB, 2048 MB
and the maximum setting of 3008 MB. The memory setting
linearly determines the compute power of the container. Every
combination of deployment package size, memory, language,
and provider resulted in a cloud function. Therefore, we
deployed 72 cloud function on AWS and 18 on Azure. The
10https://docs.aws.amazon.com/lambda/latest/dg/limits.html
lower number of Azure functions is explicable by the dynamic
allocation of memory and CPU to the functions.11
Our experimental setup is designed to exclude side effects.
Calculating the execution overhead (cold – warm) as logged
by the client isolates the perceived cold start overhead. The
average execution time of the function (recursive fibonacci
calculation), network latency, and routing within the FaaS
platform is assumed to be equal for cold and warm executions
and therefore irrelevant for the cold start overhead value.
The reduction results in an isolation of the additional time
consuming steps, which occur during a cold start and answers
the research question, how to benchmark cold start on FaaS
platforms consistently.
G. Provider Limitations
The function execution time is limited to five minutes for
AWS. However, the AWS API Gateway closes the connection
after 29 seconds.12 In this event, the execution may still
succeed but a timeout occurs locally and matching of local
data and platform executions is no longer possible. Therefore,
we had to make sure that the cloud function executions are
never longer than 29 seconds for any of the configurations.
We tested several nvalues as input for the recursive fibonacci
function and found that calculating fibonacci(40) results in a
duration below 29 seconds even for the slowest configuration
(128 MB).
H. Experiment Execution and Data Dimensions
The experiment was executed between 6/25/2018 and
7/1/2018. Each cloud function was invoked 550 times to get
275 pairs of cold and warm executions. If a cloud function
returned 500 as HTTP status code, which indicates a server
error, or another error occurred like an API gateway timeout,
we excluded the cold as well as the warm execution. Only
pairwise valid data is processed and included in the results.
To summarize our setting, the result data matrix consists
of seven dimensions: Provider, Programming Language, De-
ployment Package Size, Memory Setting, Specific Invocation
Time, Local Duration, and Platform Duration.
11https://docs.microsoft.com/en- us/azure/azure-functions/functions-
scale\#how-the-consumption-plan-works
12https://docs.aws.amazon.com/apigateway/latest/developerguide/limits.
html
V. RES ULTS
A. Hypotheses Independent Results
Before we confirm or reject the selected hypotheses, we
gather general insights from the data in this part. Figures 3
and 4 and Table I are based on the same dimension selection.
The deployment package size is initial, all valid pairs of each
cloud function are used to compute the figures and mean
values. For AWS, if not noted otherwise, the cloud function
with 256MB memory is selected. Due to the different scaling
of memory, the absolute values of cold and warm executions
of AWS and Azure are not comparable. Nevertheless, the data
provides insights, how Azure deals with cold starts in general
and how AWS does it with a relatively low memory setting.
The following sections will discuss these aspects in more detail
when reflecting our selected hypotheses.
Figure 3 shows boxplots of the duration in milliseconds,
measured by the start and end time of REST calls on the client.
The bottom of the box is the 25th percentile and the top is the
75th percentile. The center line of the box is the median equal
to the 50th percentile. Upper whisker is the 75th percentile
plus the box length multiplied by 1.5. Corresponding to this
computation, the lower whisker is the 25th percentile minus the
box length multiplied by 1.5. Values that are not between the
two whiskers are outliers and depicted as dots. This procedure
was also chosen for the generation of Fig. 4. The values of
these boxplots were fetched from the logging services of the
respective platforms.
Values for the cold executions compared to corresponding
warm ones are higher on the client, when assessing the box
and the whiskers. These values from the client can visually
be related to the platform ones, because the box plots in the
two figures are based on exactly the same raw data and also
the Y axis dimension is the same for both figures and the
box plots. Due to this visual coherence, some outliers are no
longer included in the figures. The raw data, the box plots
as printed here and the box plots including the outliers are
online accessible13. Sometimes cold executions are faster than
warm ones. Especially the AWS values in Fig. 4, where a huge
duration intersection of cold and warm executions is present,
strengthen this statement. The warm and cold values for AWS
JS seem to be even equal.
To get more insights about absolute values, Table I presents
mean values in milliseconds measured on client and platform
side. The dimensions of the presented data are provider,
language and initial deployment package size. AWS cloud
functions were executed with 256MB memory. The mean
values of these cloud functions can be compared to the
medians in Figs. 3 and 4.
For AWS, we measured an execution overhead for the cold
start of 1,750 ms (cold – warm) on the client instead of 247 ms
on the platform for Java and 644 ms instead of -43 ms for
JS. We observed that some cold executions on the platform
are faster than the warm executions on the same container.
13https://github.com/johannes-manner/SeMoDe/releases/tag/wosc4
Cold Warm Cold–Warm
Client 5961 4211 1750
Java Platform 4329 4082 247
JS Client 14320 13676 644
AWS Platform 13496 13539 -43
Azure Client 26681 1809 24872
Java Platform 15261 1545 13716
JS Client 14369 4547 9822
Platform 5492 4270 1222
TABLE I: Mean Values in Milliseconds for Cold and Warm
Executions on Client and Platform Side
This is the reason why the value for JS is negative in this
case. For JS, 63 % of the cold executions were faster than
the corresponding warm ones. 16% of AWS cloud functions
written in Java executed faster on the cold start compared
to the warm execution on the same container. These start
and end times were logged on the platform. On the client
side, a cold execution is never faster than its corresponding
warm execution, neither for JS nor for Java. Our conclusion
for AWS is, that typical tasks during container start up etc.
are not included in the logged value on the platform. Our
results strengthen this assumption. Java needs a more resource
intensive environment with an initialization of a JVM during
the cold start, whereas JS uses only an interpreter to execute
the code. We assume, that underusage when executing the cold
request, collocation of various cloud functions on the same
host, and other reasons influence the performance as well.
For Azure, a larger execution overhead can also be identified
using the logged execution times on the client. This indicates
that the platforms do not log the complete execution overhead
of the cold start in the execution time of the function.
In order to evaluate our hypotheses, we need to know the
total execution overhead of cold starts. This is the reason why
for all further analyses we only consider the execution times
logged on the client.
B. Hypotheses Dependent Results
1) Assessment Methodology: To assess the hypotheses de-
pendent results, we use mean values as in the previous section,
but more often a correlation metric to make a clear statement,
to which degree a hypothesis is significant based on our data.
We used Spearman’s correlation coefficient since the nor-
mal distribution test showed that the data is not distributed
normally, which renders Pearson’s correlation coefficient not
applicable. The range of the correlation coefficient ρis from
negatively correlated (-1) to positively correlated (+1). There
exist different interpretations considering the significance of
correlation. We stick to a widely used interpretation [11],
where 0 indicates no correlation, absolute value of 0.2 weak,
0.5 moderate, 0.8 strong and 1.0 perfect correlation.
Additionally, we constructed a linear regression model to
calculate the slope of the line plus the intersection point of the
Cold Warm
0 5000 15000 25000
aws−java 256 MB
Cold Warm
0 5000 15000 25000
aws−js 256 MB
Cold Warm
0 5000 15000 25000
azure−java
Cold Warm
0 5000 15000 25000
azure−js
Fig. 3: Execution Times of Cold and Warm Invocations on Client Side
Cold Warm
0 5000 15000 25000
aws−java 256 MB
Cold Warm
0 5000 15000 25000
aws−js 256 MB
Cold Warm
0 5000 15000 25000
azure−java
Cold Warm
0 5000 15000 25000
azure−js
Fig. 4: Execution Times of Cold and Warm Invocations on Provider Side
Y axis. This enables us to state an equation to compute other
configurations than the investigated ones. Especially for the
deployment hypothesis, this approach can forecast arbitrary
package sizes. Resulting linear models and the correlation
coefficient ρare presented in Tables III and V.
The slope of the line is no indicator for correlation, but
states, how strongly the values of Y are influenced by a
increasing or decreasing X value.
2) H1: Hypothesis Programming Language: Execution
times for AWS Java functions with different memory settings
are shown in Table II. Cold start times for Java cloud func-
tions are between two and three times higher than those of
respective JS functions. Azure’s data from the previous section
supports our hypothesis. Average cold execution overhead
was 24,872 ms for Java, while the JS function only caused
9,822 ms, which results in a ratio of 2.53. Based on these
ratios, we confirm the hypothesis, as we noticed that the cold
start time was significantly larger for each of the Java functions
compared to JS.
Memory in MB 128 256 512 1024 2048 3008
Java 1980 1750 1292 1113 1257 861
JS 587 644 614 368 589 371
Ratio 3.38 2.72 2.10 3.03 2.14 2.32
TABLE II: Differences of Cold and Warm Executions on
Client Side (AWS Lambda)
3) H2: Hypothesis Deployment Package Size: For AWS and
Azure JS, we can confirm our hypotheses since ρis positive.
For AWS, the correlation is weak, but present. Azure JS has
the highest value for the slope (43 ms
MB ) and also the most
significant correlation.
For Azure Java the correlation coefficient is negative and
therefore we consider this combination in more detail. Ta-
ble IV states the cold start overhead for different deployment
package sizes on the client side. The mean values in this table
ρ LinearM odel
Java 0.29 1510ms + 9 ms
MB
AWS JS 0.37 613ms + 12 ms
MB
Azure Java -0.15 25580ms −7ms
MB
JS 0.46 8571ms + 43 ms
MB
TABLE III: Spearman’s Correlation Coefficient ρand Linear
Regression Model for Hypothesis Deployment Package Size
DPS 0 3.125 6.25 12.5 25 50 100 200 400
CSO 25 27 22 27 24 36 26 26 23
TABLE IV: Deployment Package Size (DPS) in MB and
Cold Start Overhead (CSO) in Seconds for Azure Java Cloud
Function
are mainly between 22 and 27 seconds with an outlier at 50MB
deployment package size. Comparing the initial size and the
highest configuration with 400MB there are only 2 seconds
difference, which is less than 10 % w.r.t. the absolute value.
This and the fact, that there is no clear tendency, indicates
that the hypothesis for this combination is not significant. This
assessment is also based on the low absolute value of ρand the
small value for the slope. Therefore, we reject the hypothesis
for Azure cloud functions written in Java.
4) H3: Hypothesis Memory Setting: The hypothesis Mem-
ory Setting states that the cold start overhead decreases with
the size of memory. Only AWS cloud functions are tested
since Azure has no memory setting. Exactly as testing the
prior hypothesis, we calculated the Spearman’s correlation
coefficient ρas well as the linear regression model.
ρ LinearM odel
Java -0.59 1634ms −0.25 ms
MB
JS -0.20 606ms −0.07 ms
MB
TABLE V: Spearman’s Correlation Coefficient ρand Linear
Regression Model for Hypothesis Memory Setting
Our hypothesis holds true since the values for the correlation
coefficient ρin Table V are both negative. For Java, we observe
a higher correlation and slope. We assume that this is caused
by a costlier middleware layer. As Java is a compiled language,
the JVM needs to be setup to execute the code. The available
CPU and memory might have a positive influence on how fast
this setup is done. JS is an interpreted language and therefore
the execution environment is not as complex as the one for
Java, but more acquired resources also have a positive effect
on the cold start time.
VI. DISCUSSION
A. Discussion of Results
Our methodology to assess the cold start from a user point
of view is inevitable, because platforms report only a fraction
of cold start overhead in their function duration. Additionally,
they seem to report different fractions of the provisioning and
initialization. Especially for functions written in JS on AWS
our results were surprising. We measured that cold starts on
the platform were faster than the consecutive warm ones in
some cases. This leads to the conclusion that AWS only bills
the users for their function executions without the time to setup
servers, virtual machines and containers.
The gap between compiled and interpreted languages with
a ratio between 2 and 3 was higher than expected. Our expla-
nation is that complex execution environments, like the JVM
in case of the compiled language Java, overcharge the already
busy CPU. This effect is smaller for higher memory settings,
but present. Especially the performance gain for compiled
languages is worth mentioning. Cold start overhead of Java
functions correlates with ρ=−0.59. Only the deployment
package size hypothesis shows a mixed picture, where the
correlation is lower and varies between positive and negative
values within the same platform.
Our motivation to take the cold starts of cloud functions
into consideration is the currently prevailing strategy of using
pings to pre-warm cloud function instances. The experimental
setup of our benchmark is a REST based interaction via an
API gateway. As noted in the introduction, this ping hack is
opposed to the FaaS principle of scaling to zero. The mean
cold start overhead we measured for different platforms, lan-
guages and without artificially increased deployment package
sizes ranges from 370 ms for JS on AWS with a memory
setting of 3008 MB to 24 s for Java on Azure. The 50MB
configuration for Azure Cloud Functions written in Java had
even more overhead as already shown in Section V-B3 due
to the experimental status of some languages. Therefore, the
ping hack may not always be necessary. Additionally, scaling
also leads to cold starts and the ping hack therefore does
not solve the problem at all. The ping hack only ensures
that a fixed amount of containers is available. Our results,
especially the comparison of cold and warm executions on
client side, demonstrate that in some use cases there is no need
for this kind of hack. Especially in situations, where response
times of a few hundreds of milliseconds (AWS-JS-3008MB:
370ms overhead) are reasonable. Because of this wide range
of cold start overheads, it is important to assess the impact on
specific applications. For applications requiring a fast response
or involving user interactions, even small cold start overheads
might impose a problem.
Further investigations are needed in this area, because cold
start is one of the main issues FaaS has to assess and solve.
B. Threats to Validity
Based on the characteristics HUPP LE R [6] mentioned in his
benchmarking publication, we tried to make our benchmark as
robust, self-explanatory, and repeatable as possible. But there
are some factors that could threaten the validity of our data:
Platform Limitations - There is only limited information
available on how containers are initialized and cloud
functions are executed. With the documentation informa-
tion only, the high variety of different execution times of
a cloud function is not fully explicable. Also, additional
services like the API gateway on AWS can influence the
results by returning errors, which are not cloud function
related.
Available Metrics - The function execution time that is
logged and used for billing on the platforms provides
only limited information. In AWS, the start up duration
of a container is not included in the logged execution
time. This initialization of a container is crucial for the
perceived cold starts.
Sample Size - We only tested our hypotheses with 275 cold-
warm pairs per function.
Temporal Relevance - Due to the very young and evolving
FaaS paradigm, the updates and changes in the platforms
limit our results to a certain time frame.
VII. FUTURE WOR K
We plan to do the same benchmark setting again for
the tested hypotheses and want to integrate additional FaaS
platforms, namely IBM OpenWhisk, Google Cloud Functions,
OpenFaaS14 and FnProject15. The next benchmark will be
executed for a longer time period to assess daily differences
in the execution time and cold start behavior. Testing further
hypotheses, especially the number of dependencies, which is
important during the implementation of cloud functions, is
scheduled for the next experiment.
Also the experimental design of some hypotheses, i.e. the
deployment package size hypothesis, needs a redesign espe-
cially from the programming language point of view to verify,
if this hypothesis is not significant for some combinations.
This conceptual redesign should avoid ambiguous results and
be part of the next experiment.
This follow up benchmark serves as a data basis for a
concurrency benchmark, which will be executed at the same
time to get comparable executions. The concurrency tests are
quite important since one of the main use cases is the usage
of cloud functions as a reactive component to decouple peak
loads in a web application scenario. Especially peak loads
trigger a huge amount of cold starts on the platform.
To get more insights about some hypotheses, e.g. the
programming language hypothesis, we want to conduct a
study, where our functions are executed locally. Development-
production parity is a key issue when comparing the local
values with the client perceived REST duration and also
with the platforms’ start and end times. A comparison to
benchmarks executed locally will facilitate us to foster our
hypotheses in future work.
To understand the different FaaS use cases, further cloud
function triggers need to be investigated in respect to their cold
start impact. Especially the integration triggers of databases
are widely used, where a cloud function is triggered for every
inserted entry in a database.
14https://www.openfaas.com/
15https://fnproject.io/
REFERENCES
[1] E. van Eyk et al., “The SPEC Cloud Group’s Research Vision on FaaS
and Serverless Architectures,” in In Proc. WoSC, 2017.
[2] I. Baldini et al., “Serverless Computing: Current Trends and Open
Problems,” 2017.
[3] T. Lynn et al., “A Preliminary Review of Enterprise Serverless Cloud
Computing (Function-as-a-Service) Platforms,” in In Proc. CloudCom,
2017.
[4] H. Lee et al., “Evaluation of Production Serverless Computing Environ-
ments,” in In Proc. WoSC, 2018.
[5] E. van Eyk and A. Iosup, “Addressing Performance Challenges in
Serverless Computing,” in In Proc. ICT.OPEN, 2018.
[6] K. Huppler, “The Art of Building a Good Benchmark,” in Performance
Evaluation and Benchmarking, 2009.
[7] G. Adzic and R. Chatley, “Serverless Computing: Economic and Archi-
tectural Impact,” in In Proc. ESEC/FSE, 2017.
[8] M. Villamizar et al., “Infrastructure Cost Comparison of Running Web
Applications in the Cloud Using AWS Lambda and Monolithic and
Microservice Architectures,” in In Proc. CCGrid, 2016.
[9] M. Malawski et al., “Benchmarking Heterogeneous Cloud Functions,”
in In Proc. Euro-Par, 2018.
[10] J. Spillner et al., “FaaSter, Better, Cheaper: The Prospect of Serverless
Scientific Computing and HPC,” in Communications in Computer and
Information Science, 2017.
[11] K. H. Zou et al., “Correlation and Simple Linear Regression,” Radiology,
vol. 227, no. 3, pp. 617–628, 2003.