PreprintPDF Available

No Free Lunch: Microservice Practices Reconsidered in Industry

Authors:
Preprints and early-stage research may not have been peer reviewed yet.

Abstract and Figures

Microservice architecture advocates a number of technologies and practices such as lightweight container, container orchestration, and DevOps, with the promised benefits of faster delivery, improved scalability, and greater autonomy. However, microservice systems implemented in industry vary a lot in terms of adopted practices and achieved benefits, drastically different from what is advocated in the literature. In this article, we conduct an empirical study, including an online survey with 51 responses and 14 interviews for experienced microservice experts to advance our understanding regarding to microservice practices in industry. As a part of our findings, the empirical study clearly revealed three levels of maturity of microservice systems (from basic to advanced): independent development and deployment, high scalability and availability, and service ecosystem, categorized by the fulfilled benefits of microservices. We also identify 11 practical issues that constrain the microservice capabilities of organizations. For each issue, we summarize the practices that have been explored and adopted in industry, along with the remaining challenges. Our study can help practitioners better position their microservice systems and determine what infrastructures and capabilities are worth investing. Our study can also help researchers better understand industrial microservice practices and identify useful research problems.
Content may be subject to copyright.
No Free Lunch: Microservice Practices
Reconsidered in Industry
Qilin Xiang, Xin Peng, Chuan He, Hanzhang Wang, Tao Xie, Dewei Liu, Gang Zhang, and Yuanfang Cai
Abstract—Microservice architecture advocates a number of technologies and practices such as lightweight container, container
orchestration, and DevOps, with the promised benefits of faster delivery, improved scalability, and greater autonomy. However,
microservice systems implemented in industry vary a lot in terms of adopted practices and achieved benefits, drastically different from
what is advocated in the literature. In this article, we conduct an empirical study, including an online survey with 51 responses and 14
interviews for experienced microservice experts to advance our understanding regarding to microservice practices in industry. As a
part of our findings, the empirical study clearly revealed three levels of maturity of microservice systems (from basic to advanced):
independent development and deployment, high scalability and availability, and service ecosystem, categorized by the fulfilled benefits
of microservices. We also identify 11 practical issues that constrain the microservice capabilities of organizations. For each issue, we
summarize the practices that have been explored and adopted in industry, along with the remaining challenges. Our study can help
practitioners better position their microservice systems and determine what infrastructures and capabilities are worth investing. Our
study can also help researchers better understand industrial microservice practices and identify useful research problems.
Index Terms—microservice, survey, empirical study, industrial practice.
F
1 INTRODUCTION
MICROSERVICE architecture is an architectural style that
structures an application as a suite of loosely coupled
services, each of which has a single responsibility and can be
deployed independently, scaled independently, and tested
independently [1], [2]. Different from traditional service-
oriented architecture (SOA), which is viewed mostly as an
integration solution, microservices are individual software
applications that communicate with each other through
well-defined network interfaces [3]. Microservice architec-
ture is supposed to deliver the benefits of faster delivery,
improved scalability, and greater autonomy [1], and has
been the latest trend in building cloud native applications.
Many Internet applications (e.g., Amazon [4], Netflix [5],
Tencent’s WeChat [6], eBay’s developers program [7]) and
enterprise applications (e.g., online meeting applications
and BPM applications [8]) have been built based on mi-
croservice architecture.
Microservice architecture advocates a series of technolo-
gies and practices. An individual microservice is usually
packaged and deployed in the cloud using a lightweight
container (e.g., Docker [9]). A collection of microservices
are typically managed using container orchestration tech-
nologies (e.g., Kubernetes [10], Docker Swarm [11], and
Mesos [12]), following industry-proven DevOps practices
X. Peng is the corresponding author.
Q. Xiang, X. Peng, C. He, and D. Liu are with the School of Computer
Science and the Shanghai Key Laboratory of Data Science, Fudan Univer-
sity, Shanghai, China, and Shanghai Institute of Intelligent Electronics &
Systems, China.
H. Wang is with the eBay Inc., USA.
T. Xie is with the Peking University, China.
G. Zhang is with the Emergent Design Inc., China.
Y. Cai is with the Drexel University, USA.
and supported by fully automated software integration and
delivery machinery [3], [13], [14], [15]. Recently there is a
trend of applying more advanced microservice technologies
and practices such as chaos engineering [16], serverless [17],
and service mesh [18].
More and more organizations, with various sizes and
domains, are adopting microservice architecture. These or-
ganizations have exhibited drastic differences in terms of
their adopted technologies and practices, as well as achieved
benefits. The adoption of all recommended technologies and
practices would require great investments in process and
infrastructure, and the adoption of new technologies and
practices involves uncertainties and risks. Therefore, the
organizations usually selectively adopt part of the recom-
manded microservice technologies and practices.
On the other hand, the organizations’ motivation to
pursue more sophisticated infrastructure is usually driven
by business needs from market and internal forces. For
example, an organization running a system with stable and
predictable access load may not be interested in flexible
scalability, but could be interested in fast delivery of new
features. Therefore, an organization will only make more
investment on advanced microservice infrastructure when
the corresponding driving forces are taking effects.
Considering the sharp contrast between the advocated
benefits and practices in literature and the fulfilled benefits
and adopted practices in industry, we are curious to explore
microservice practices in industry to find out how and why
they differ. Understanding the differences among the matu-
rity levels of microservice systems in practice is important
for practitioners to position their systems and to create a
roadmap of continuous improvement, and for researchers to
understand the difficulties and challenges at different levels.
Some previous research efforts [19], [20], [21] conduct
systematic mapping studies to learn the characteristics, ben-
1
arXiv:2106.07321v1 [cs.SE] 14 Jun 2021
efits, and research trends of microservices from the litera-
ture, but not directly reflecting the microservice practices
in industry. Other research efforts [22], [23], [24] report in-
dustrial surveys that reveal common practices, benefits, and
challenges. Some empirical studies [25], [26], [27] further
identify microservice-specific bad smells and anti-patterns
from both organizational and technical perspectives. These
previous studies do not reveal the existence of maturity
levels and their associated practices and benefits. As a result,
the practitioners cannot position their current practice, or
make informed decisions on if and what improvement is
needed to achieve higher level of maturity.
In this article, we report our empirical study to inves-
tigate industrial microservice systems with a special focus
on the differences in terms of adopted practices and ful-
filled benefits. Based on a conceptual model to characterize
microservice maturity levels, we investigate the following
three research questions.
RQ1 (Maturity Level): What maturity levels can
industrial microservice systems be classified into?
What are the characteristics and driving forces of
each level? What benefits and promises can be
achieved at each level?
RQ2 (Issue): What are the common issues that the
microservice development capabilities are subject to?
How do these common issues influence the microser-
vice systems at different maturity levels?
RQ3 (Practice and Challenge): What practices have
been explored and adopted for the identified issues
at each maturity level? How well are the issues
addressed by these practices and what challenges
still remain there?
To answer the these questions, we first conduct an on-
line survey to gain an overview of industrial microservice
systems, serving as the basis for the next phase interviews.
The data collected from the survey can be found in our
replication package1.We then conduct a series of interviews
with the architects and technical leaders for some of these
systems to collect high-fidelity and in-depth data around
microservice practices and benefits.
Based on the survey and interviews, we observed that
the practices and benefits that have been adopted and ful-
filled vary greatly in different systems. Some systems have
just fulfilled the basic benefits of independent development
and deployment, but may still well support the business
needs of the organizations. In contrast, some other systems
have been evolved into service ecosystems and achieved
high scalability, availability, and expandability of business
domains, driven by the forces of business expansion and
merger. Most interestingly, these differences can be clearly
characterized by three levels of maturity, from basic to ad-
vanced: (1) independent development and deployment, (2)
high scalability and availability, and (3) service ecosystem.
The success of microservice architecture relies on a col-
lection of capabilities such as service decomposition, log-
ging and monitoring, fault localization, and service evolu-
tion. These capabilities are often restricted by various issues
raised in microservice development, and thus influence the
1. https://replication-package-tse.github.io/TSE2020/index.html
Fig. 1. Conceptual Model
fulfillment of microservice benefits. Based on the survey and
interviews, we identified 11 common issues and analyzed
their influences on microservice systems at different levels.
For each issue, we summarize the practices that have been
explored and adopted, and the remaining challenges.
Our findings in this study are valuable for both practi-
tioners and researchers. For practitioners, our findings can
help them position their microservice systems at a proper
level according to the business needs and determine what
infrastructures and capabilities are worth investing. More
concretely, they can learn the issues that restrict their capa-
bilities and the practices that they can follow to address the
issues. For researchers, our findings can help them under-
stand the situation of industrial microservice practices and
the needs of microservice systems of different levels. They
may identify valuable research to address the challenges
in the current microservice practices and have a better
understanding of problems in the context of microservices.
The rest of the article is organized as follows. Section 2
introduces the conceptual model and the study process. Sec-
tion 3 reports the data statistics of survery results. Section 4
summarizes our findings and answers the three research
questions based on the survey and interviews. Section 5
discusses the research opportunities at different maturity
levels. Section 6 and 7 describe threats to validity and related
work, respectively. Section 8 concludes this article with the
future work.
2 ST UDY DESIGN
We design the research questions and study process based
on a collection of concepts related to the maturity levels.
In this section, we first introduce the conceptual model and
then the study process.
2.1 Conceptual Model
Our conceptual model is shown in Figure 1. Our study starts
with the following three important promises that are often
expected from a microservice architecture [3].
Faster Delivery. Ideas can be turned into features
running in production in a shorter time.
Improved Scalability and Availability. The system
can better scale with environmental changes (e.g.,
system load) and at the same time ensure the avail-
ability.
Greater Autonomy. Development teams can make
technical decisions in a more autonomous way.
2
TABLE 1
Benefits of Microservices
Promise Benefit Description
Faster
Delivery
Parallel
Development
Different services can be developed
and deployed independently and in
parallel.
Extendibility
and
Expandability
The system can well support the ex-
tension of new requirements and the
expansion of business areas.
Improved
Scalability
and
Availability
Flexible
and
Automatic Scalability
The system can automatically and
flexibly scale services according to
the changes of load and environ-
ment.
Fault Tolerance
and
Fault Isolation
The system can ensure high avail-
ability by fault tolerance and fault
isolation.
Greater
Autonomy
Reduced
Communication Cost
Different teams can make technical
decisions independently, and thus
require less communication between
teams.
Flexible Choice
of
Technology Stack
Different teams can flexibly choose
the most appropriate technology
stack (e.g., programming languages,
frameworks) for their services.
These promises can be concretized into a set of bene-
fits [1], [3], [19], [21], [28] as shown in Table 1. The fulfillment
of these benefits varies greatly in industrial microservice
systems. These systems can be classified into different ma-
turity levels based on their achieved benefits. Note that the
fulfillment of a benefit implies the fulfillment of the business
value behind, which is different from the satisfaction on the
corresponding aspect. For example, an organization does
not fulfill the benefit of flexible and automatic scalability
in a microservice system, but may still be satisfied with
its scalability if the number of users accessing the system
remains stable. Investment in higher maturity level of mi-
croservice is often driven by some external forces, which
usually originates from the requests of business develop-
ment, e.g., the increase of service requests and the expansion
of business domain. An organization may choose to make
a microservice system stay at a lower level if there are no
business requests driving them to change.
The fulfillment of the benefits highly relies on a collec-
tion of capabilities of the organizations [1], [22], [23], [24],
[29], including service decomposition, database decomposi-
tion, deployment, service communication design, API gate-
way design, service registration and discovery, logging and
monitoring, performance and availability assurance, testing,
fault localization, and service evolution. These capabilities
are restricted by various issues that are specific to microser-
vice systems. An issue has different impact on systems
with different maturity levels. Accordingly, for microservice
systems at different maturity levels, the developers have
explored and adopted some practices to address the issues.
Some practices are associated with specific advances on the
microservice infrastructures. These practices alleviate the re-
lated issues and can be followed by other microservice sys-
tems, but may still have challenges to be further addressed.
For example, after a strategy of database decomposition is
used to cope with database performance issues, there may
still exist data consistency challenges.
2.2 Study Process
Figure 2 illustrates the three main stages of this study:
survey, interview, and analysis. Our team is composed of six
researchers (including three professors) and two industry
experts. Each of the three main stages of this study has
several key objectives:
Survey: Collect objective and quantifiable data to
gain an overview of multiple microservice projects
and help identify the interviewee candidates. In ad-
dition, the survey provides a basis for the interview
template shown in Figure 3.
Interview: Collect high-fidelity and in-depth data
around microservice practices of the IT industries.
In this stage, we focus on the fulfilled benefits,
capabilities and related issues, infrastructures and
other practices, as well as remaining challenges of
the subject systems.
Analysis: Through descriptive statistics and quanti-
tative analysis, identify practice patterns and issue
trends from data collected from both the survey and
interviews. In this stage, we finalize the identified
maturity levels (RQ1) and learn how issues, prac-
tices, and challenges vary at different levels (RQ2,
RQ3).
2.2.1 Survey
The survey stage collects technical details to create an over-
all picture of microservice capabilities.
As shown in Figure 2, at the survey stage, four re-
searchers and one industry expert conducted brainstorming
to initially design the survey questionnaire based on the
required capabilities of microservices, then two researchers
and one industry expert review and improve the ques-
tionnare. To evaluate and improve the survey questionnaire,
we rigorously conducted five pilot surveys with experi-
enced industry experts, resulting in five revisions based on
the feedback.
We then published the questionnaire2on social media
(e.g., Twitter and WeChat) and microservice-related technol-
ogy communities (e.g., the DevOps community and Kuber-
netes community). We also sent invitations to a manually-
prepared mailing list from microservice related published
articles in technical forums (e.g., InfoQ) and microservice
related opensource project in opensource community (e.g.,
GitHub). The survey is targeted at industry technical experts
who have (1) overall architecture design knowledge of the
microservice architecture and migration, and (2) experience
and familiarity with the main processes of the microservice
project. For every survey and interview, a responding practi-
tioner is asked to discuss based on one microservice system
that he/she is most familiar with, because a practitioner
may be involved in multiple systems that adopt different
practices, his/her responses can be vague or conflicting if
he/she mixes experience from multiple systems.
The survey questionnaire consists of 26 multiple-choice
questions (13 questions with multiple-select checkbox and 2
optional to answer), and 21 Q&A questions (13 optional to
answer). The survey includes four main parts:
2. https://forms.gle/bHjnGmQnB2ddcG8E9
3
Fig. 2. Study Process
Fig. 3. Interview Template
1) Practitioner’s Basic Information: We collect the basic
information of the practitioners, including the name, email
address, company, job title, and number of years working in
industry.
2) Project Overview: We collect the project’s basic infor-
mation, including the project area, the number of microser-
vices, LOC, programming languages in use, service origin
(legacy vs. new service), and the satisfaction of six benefits
listed in Table 1.
3) Technical Detail Questions: We design 34 technical
detail questions based on the capabilities described in Sec-
tion 2.1. For each capability, we focus on how it has been
achieved and the existing shortcomings, or why it is not
available.
4) Feedback: At the end of the questionnaire, we ask
for suggestions and whether they are willing to conduct an
interview.
2.2.2 Interview
We received 51 responses of the survey and conducted
basic statistical analysis to obtain an overview. The data
reveals some interesting facts that are at odds with common
expectations of microservice systems, e.g., the fact that auto
scalability is not achieved in most microservice systems, and
that the participants are mostly satisfied with their fault
tolerance and fault isolation mechanisms while debugging
is recognized as one of the most difficult aspects of microser-
vice systems. These responses inspire our curiosity for fur-
ther inquiry. With the understanding that the respondents
of the survey may or may not be able to spend enough
time to consider each question or provide precise answers
carefully, we further invite these participants to conduct
face-to-face, in-depth interviews. 14 of them accepted our
invitation. Through in-depth discussions with these experts
from various industry sectors across the globe, we derive
our key findings and conclusions.
This stage starts with a brainstorming session based
on the survey result, and we decide to conduct a semi-
structured interview. As recommended by Waring and
Wainwright [30], we prepare a mixture of close-ended and
open-ended interview template shown in Figure 3. The
interview template consists of (1) information confirmation
to warm up and confirm with practitioners the basic in-
formation and project overview according to the answers
recorded in the questionnaire; (2) question setting, i.e., Q1
(in Figure 3) aims to confirm the capability within each
area; Q2 aims to learn the relevant practices to achieve the
capability; Q3 aims to figure out new or unsolved issues
and the remaining challenges; and Q4 aims to discover the
difficulties (e.g., lack of driving forces or issues) behind
capability insufficiency or the reasons why a capability is
not required.
To evaluate and improve the interview, we rigorously
conducted 2 pilot interviews, resulting in 2 revisions based
on the feedback. Then we conduct these formal interviews
mostly (12/14) face-to-face, except two through video con-
ferencing. Each interview involves one or two interviewees
and three of the authors as the interviewers. We record
the interview with the consent of the interviewees. During
the interview, one interviewer is responsible for asking
questions guided by the interview template, while the other
interviewers take notes and make sure that all the concerns
in this study are covered. The interviews are not limited
to the template–while trying to learn the adopted practices
and issues, we also ask improvisational questions based
on the answers of the interviewee(s) to dig deeper [31].
We conduct 33 follow-up conversations (by phone, email,
or in-person) to complement the face-to-face interviews
4
when new issues/practices/challenges are identified later.
For example, when we learn that serverless is adopted for
rapid response of changes in a later interview, we need to
confirm whether the practice is adopted in other systems
interviewed before. The formal interviews (face to face or
online conferencing) and follow-up conversations are all
conducted with the same interviewee. A formal interview
may involve several follow-up conversations.
2.2.3 Analysis
With all the data collected and analyzed, we intend to
summarize the findings to answer the research questions.
From the survey stage, we use the visual analysis tools
provided by the questionnaire platform (e.g., Google Forms)
to analyze the answers to each question. For closed-ended
questions, we directly obtain the required results using the
analysis tools. For open-ended questions, we use the word
frequency and perspective analysis tools provided by the
platform to assist us in extracting effective information from
a large number of texts. This information helps us identify
required categories by open coding and find useful feedback
from the participants. For each open-ended question, three
of the researchers code the answers into different categories
(e.g., service decomposition methods) through discussion
and consensus.
We use the tool of voice-to-text to convert the interview
recordings into a text document, and manually proofread
the converted document to ensure the accuracy of the data.
In addition, we summarize the key points of the interview
based on the notes and text documents converted from
recordings. If a recording is not allowed, we still summarize
the key points based on the notes. Finally, we manually an-
alyze the interview data to answer the three research ques-
tions. For RQ1 (Maturity Level), we extract the promises
and benefits (of microservices) that have been fulfilled in the
systems and then cluster the systems into different groups
to identify the maturity levels. For RQ2 (Issue), three of
the researchers code the mentioned technical topics about
various microservice development capabilities into different
issues. For RQ3 (Practice and Challenge), we summarize the
practices that have been adopted for each identified issue
and estimate the satisfaction levels of the interviewees based
on their feedback. For an issue that is currently not well
addressed, we further consider the challenges behind.
3 SURVEY RES ULTS
3.1 Demographics and System Overview
Based on the 51 responses of the survey we conducted basic
statistical analysis to obtain an overview. In addition to re-
vealing how technologies are used in different microservice
systems, these responses also serve as the basis for us to
conduct in-depth conversations with the practitioners.
The participants of this survey all play significant roles
in developing microservices systems, as shown in Figure 4,
including CTO, technical director, architect, scrum master,
development engineer, DevOps engineer, and product man-
ager etc. They have 3 to 35 years of professional experience
in industry (10.2 years on average). Each participant fills out
a questionnaire based on a microservice system.
4
17
19
322
2
1
1
CTO
Architect
development engineer
DevOps engineer
Scrum master
Technical director
Site Reliability Engineer (SRE)
Project Manager (PM)
Agile coach
Fig. 4. Job Tile of Survery
101-500
3
<20
20
20-100
26 >500
2
(a) The Number of Service
<1000
7
1001-5000
20
5001-20000
11
>20000
13
(b) Average LOC of Service
Fig. 5. System Scale
The survey results show that the number of services sup-
ported by these microservice systems, ranging from dozens
to hundreds. The average number of lines of code (LOC)
for each service ranges from hundreds to tens of thousands.
We observe that the average number of a service’s LOC in
a system evolving from a legacy one is larger than that of a
newly developed microservice system.
Table 2 summarizes how satisfied the participants are
for each of the 6 types of microservice benefits listed in
Table 1. The participants are most (88.24%) satisfied with
Parallel Development and least (48.92%) satisfied with Flexible
and Automatic Scalability. The data indicate that, in existing
microservice systems, parallel development can be achieved
very well, but scalability still needs to be improved. It is
surprising to observe that 13.72% participants do not care
about the benefit of Flexible Choice of Technology Stack.
3.2 Basic Results
In this section, we summarize the responses on each techni-
cal aspect of microservice systems.
3.2.1 Service Decomposition
Service decomposition is the most important part of mi-
croservice development, and different systems employ dif-
ferent decomposition strategies, as shown in Table 3. Most
(46/51) of the systems decompose services based on ex-
pert experience which is an open service decomposition
approach and there is no uniform standard and imple-
mentation method (Decompose services based on the archi-
tect/domain expert’s understanding of the business capabil-
ities and domain). Only a few (4/51)systems use a Domain-
Driven Design approach to decompose services, including
Event Storm and User Journey. A few participants argue that
service decomposition is not designed from the beginning,
5
TABLE 2
The Satisfaction of Microservice Benefits
Benefits Very Dissatisfied Dissatisfied Satisfied Very Satisfied Not care Not know
Parallel Development 0.00% 11.76% 45.10% 43.14% 0.00% 0.00%
Extendibility and Expandability 1.96% 11.76% 56.86% 25.49% 0.00% 3.92%
Flexible and Automatic Scalability 7.84% 35.29% 37.25% 11.76% 5.88% 1.96%
Fault Tolerance and Fault Isolation 3.92% 13.72% 58.82% 23.53% 0.00% 0.00%
Reduced Communication Cost 1.96% 13.72% 54.90% 23.53% 3.92% 1.96%
Flexible Choice of Technology Stack 3.92% 15.68% 25.49% 35.29% 13.72% 5.88%
TABLE 3
Service Decomposition Strategies
Strategies #
By expert experience 46
By Domain-Driven Design 4
By data flow 1
others 5
and needs to evolve according to business changes. Several
shortcomings for service decomposition are also mentioned
by the participants:
Coupling. ”Services are poorly decomposed; there is still
some coupling, and some modules interact too frequently
and have dependencies on each other...”
Consistency. ”It is difficult to ensure that the service
decomposition design is consistent with the code imple-
mentation...”
Granularity. ”Service granularity is large; business isola-
tion needs to be improved... Service granularity is too fine;
governance costs increase...”
TABLE 4
Database Decomposition Strategies
Strategies #
By business capability 27
By domain 13
By horizontal decomposition 2
By vertical decomposition 1
By data flow 1
3.2.2 Database Decomposition
It is recommended that in a microservice system, each
service should manage its own database [1], but we ob-
serve that there are several (7/51) systems using centralized
databases, over half (30/51) of the systems have shared
databases among services, and only a quarter of (14/51)
the systems have no shared database. Table 4 summarizes
database decomposition strategies: over half (27/44) of the
systems decompose databases based on business capability
and a third (13/44) of them decompose databases by do-
mains.
TABLE 5
The Reasons of Shared Database Between Services
Reasons #
Business logic is severely coupled 13
Facilitating data synchronization and cascading queries 8
In the process of database decomposing 3
Small business scale and low operation and maintenance cost 1
Lack of technical support 1
Some data source formatting is difficult to solve 1
The reasons why services need to share databases are
summarized in Table 5. The main reason is the severe cou-
pling of business logic (13/37), which makes it difficult to
decompose databases. Several (8/37) participants reported
that a shared database would reduce the time for data
synchronization and facilitate cascading queries. In some
cases, the databases are decomposed, but not completely,
so there are still services sharing one database. Several par-
ticipants reported shortcomings of database decomposition,
with several examples listed below:
Redundancy. ”There is a lot of data redundancy in the
data tables...”
Granularity. ”Database granularity is too fine; batch
operation performance is low... Database granularity is
coarse, not conducive to data isolation...”
TABLE 6
The Solutions of Service Deployment
Solutions #
Virtual machine 14
Virtual machine, Container 16
Container 12
Physical machine, Virtual machine 3
Physical machine, Virtual machine, Container 3
Physical machine 2
Physical machine, Container 1
3.2.3 Deployment
A microservice system may contain tens or even hundreds
of services. An instance of service can be deployed in a
physical machine, a virtual machine, or a container [32]. As
reported in Table 6, less than half (22/51) of the systems
use a hybrid deployment approach: the combination of
virtual machines and containers is the most (16/51) pop-
ular approach. Some of these systems use only containers
(12/51), virtual machines (14/51), or physical machines
(2/51). Most systems use Kubernetes (20/51), Mesos (1/51),
and Docker Swarm (11/51) to manage containers and use
VMware vSphere (15/51) to manage virtual machines. A
small number of systems are hosted by cloud platforms
(7/51).
TABLE 7
Communication Styles
Modes #
HTTP/REST, Messaging 18
HTTP/REST 15
HTTP/REST, RPC, Messaging 12
RPC 4
HTTP/REST, RPC 2
6
3.2.4 Service Communication Design
In a microservice architecture, services must interact using
an inter-process communication protocol such as HTTP,
AMQP, and RPC, depending on the nature of each ser-
vice [33]. There are three main communication styles:
HTTP/REST, RPC, and Messaging [34]. As shown in Ta-
ble 7, more than half of (32/51) the systems use a hybrid
communication style to meet the requirements of particular
scenarios, a third (18/51) of them use both HTTP/REST and
Messaging, and a quarter of (12/51) the systems use all of
the three communication styles. One-third of the systems
employ only one communication style, HTTP/REST (15/51)
or RPC (4/51).
(a) Single API Gateway (b) Multiple API Gateway
Fig. 6. API Gateway
3.2.5 API Gateway Design
An API Gateway is a server that is the single entry point
into the system. It might have other responsibilities such as
authentication, monitoring, load balancing, caching, request
shaping and management, and static response handling.
There are two kinds of API gateways: single API gateway
(Figure 6(a)) and multiple API gateway (Figure 6(b), a varia-
tion of Backend for Frontend [35]) [36]. The main difference
between a single API gateway and a multiple API gateway
is that the latter defines different gateways for different
clients. In Figure 6(b), there are three kinds of clients:
web application, mobile application, and external 3rd party
application. There are three different API gateways. Each
one provides a set APIs for its clients. The survey results
show that over half (28/51) of the systems use single API
gateway and about a quarter (14/51) of them use multiple
API gateway, and only a few (9/51) do not use any API
Gateway.
29
11 10
Self
3rd-party
Do not use
(a) Service Registration
27
13 10
Client-side
Server-side
Do not use
(b) Service Discovery
Fig. 7. Service Registration and Discovery
3.2.6 Service Registration and Discovery
Service registration and discovery are the key components
for a microservice system. Based on them, client services
could then dynamically discover and invoke the required
functionalities without any explicit reference to the invoked
services’ location [3]. In a microservice system, service
instances have dynamically assigned network locations.
Moreover, the set of service instances change dynamically
because of autoscaling, failures, and upgrades. The service
registration and discovery mechanism helps locate a ser-
vice instance in a runtime environment. Locating a service
instance is divided into two phases: registration and dis-
covery. There are two kinds of registration patterns: self-
registration and third-party registration, and also two kinds
of discovery patterns: client-side discovery and server-side
discovery [37]. According to Figure 7, most (40/51) of the
systems have a service registration mechanism, including
self-registration (29/51) and third-party registration (11/51).
A majority (40/51) of the systems use a service discov-
ery mechanism, including client-side discovery (27/51) and
server-side discovery (13/51).
3.2.7 Logging and Monitoring
A microservice system often employs sophisticated moni-
toring and logging mechanisms to manage individual ser-
vices using dashboards, so that the up/down status and
a variety of operational and business relevant metrics can
be displayed [1]. As shown in Figure 8, it is observed
that most of the logging and monitoring platforms are
built with open-source systems, such as the ELK stack (i.e.,
Logstash [38] for log collection, ElasticSearch [39] for log
indexing and retrieval, Kibana [40] for visualization), and
Zipkin [41] for tracing and visualization. Some systems
use a self-developed Application Performance Management
(APM) system or commercial software. Some participants
complained about “insufficient granularity of monitoring met-
rics and poor support for specific middleware”.
3.2.8 Performance and Availability Assurance
There are a large number of remote calls between services,
and any request error could cause cascading failures. Re-
mote calls could also lead to additional network latency,
affecting system performance. As shown in Figure 9, we
find that almost all (48/51) of the systems use various
strategies to handle the performance and availability issues.
Timeout, Rate Limiters, Retry and Circuit Breaker are the
most common strategies used in real-world microservice
systems, suggesting that more is being done to ensure
the availability of network resources and less of the hard-
ware resources. As reported in Figure 10, horizontal scaling
(41/51) is much preferred to vertical scaling (31/51). In
addition, the proportion of automatic scaling (4/51 Hori-
zontal, 2/51 Vertical) is very low, and nearly half (25/51
Horizontal, 24/51 Vertical) are done manually (more details
being discussed in Section 4.2.4).
TABLE 8
Testing Approches
Approaches #
Unit testing 47
Integration testing 41
Stress testing 37
End-to-end testing 35
Component testing 24
Consumer driven contract testing 13
Chaos engineering 6
7
Fig. 8. Logging and Monitoring Tools
28
30
12
33
29 14
7
3
Circuit Breaker
Rate Limiters
Bulkhead
Timeout
Retry
CQRS
Event Sourcing
Do not use
Fig. 9. Mechanisms for Performance and Availability Assurance
25
12
4
10
Manual
Semi-automated
Automated
Do not use
(a) Horizontal Scaling
24
5
220
Manual
Semi-automated
Automated
Do not use
(b) Vertical Scaling
Fig. 10. Scaling Strategies
3.2.9 Testing
Due to the dynamics and complexity of microservice sys-
tems, testing strategies suitable for monolithic applications
need to be reconsidered [42]. Table 8 shows that unit testing
(47/51) is used in almost all systems, followed by inte-
gration testing (41/51), stress testing (37/51), end-to-end
testing (35/51), component testing (24/51) and consumer-
driven contract testing (13/51). We observe that testing
whether the business logic meets the requirements is a pri-
ority, followed by performance. A few (6/51) of the systems
use chaos engineering to ensure the quality of services.
TABLE 9
The Approaches of Fault Localization
Approaches #
By logging 28
By monitoring tools(metrics,distributed tracing) 20
By testing 5
By remote debugging 5
By local debugging 4
Others 3
3.2.10 Fault Localization
In the production environment, a large portion of microser-
vice failures are related to the complex and dynamic in-
teractions and dynamic runtime environments [43]. Table 9
shows that over half (28/51) of the practitioners use logs for
troubleshooting, two-fifths (20/51) of them use monitoring
tools to help locate faults, and the rest (17/51) of the practi-
tioners locate faults by testing, remote/local debugging, etc.
Most developers may use a combination of these approaches
to locate faults, such as monitoring and distributed tracing.
The practitioners also reported some shortcomings in trou-
bleshooting:
Disappointing logging and monitoring. “distributed
tracing should be combined with log, process state, and
distributed service state to provide richer problem diagno-
sis information.”
Insufficient automation. “locating a fault usually re-
quires human effort to look at logs and monitor informa-
tion on multiple platforms, lacking effective automation
mechanisms to assist in locating faults.”
Fault reproduce. “production failures are difficult to
reproduce in the development environment.”
TABLE 10
System Assessment and Measurement
Types Metrics #
System metrics
CPU 37
Memory 33
Network latency 28
I/O 15
Thread 9
Service metrics
Query Per Second (QPS) 26
Transaction Per Service (TPS) 15
Error and exception 8
Success rate 7
Response time 2
Call chain statistics 4
Mean Time To Repair (MTTR) 3
3.2.11 Service Evolution
Quality assessment plays an important role in the evolution
of microservice systems. As shown in Table 10, the systems
often use two kinds of metrics to assess the quality of the
system. Some practitioners mentioned metrics related to
operation system and hardware, including CPU, memory,
and network latency. A few participants reported that they
pay more attention to service-relevant metrics, such as QPS,
TPS, error, and exception.
8
TABLE 11
Subject Systems in Interviews
Com. Area Sys. Domain #Serv KLOC Origin
C1 Finance S1 Financial 20+ 5+ New
C2 Internet S2 E-Commerce 10+ 0.3+ Legacy
C3 IT S3 E-Commerce 12 5+ New
C4 IT S4 E-Commerce 40+ 0.3+ New
C5 IT S5 Development 10+ 20+ New
C6 IT S6 E-Commerce 20+ 100+ Legacy
C7 Internet S7 Entertainment 800+ 50+ New
C8 Internet
S8 E-Commerce 20+ 100+ Legacy
S9 Development 10+ 0.5+ New
S10 Development 20+ 2+ New
C9 Internet
S11 E-commerce 100+ 0.3+ New
S12 Development 20+ 10+ Legacy
S13 E-Commerce 100,000+ 50+ Legacy
C10 Manufacture S14 Manufacture 300+ 50+ New
4 FINDINGS
We interviewed the designers of 14 microservice systems
from 10 companies of different domains (including finance,
Internet, IT, manufacture) as listed in Table 11. These compa-
nies vary in sizes and four of them are Fortune 500 compa-
nies. The systems cover different domains such as finance,
e-commerce, entertainment, manufacture, and software de-
velopment. Their sizes vary greatly in terms of both service
number (10+ to 1000+, S13 is a huge ecosystem that spans
multiple business domains, and has 100,000+ services.) and
service size (0.3K+ to 100K+ lines of code per service).
Among the 14 systems, 9 are newly developed microservice
systems, while the other 5 are evolved from legacy systems.
Based on the survey and interviews, we summarize our
findings and answer the three research questions.
4.1 Maturity Levels (RQ1)
Our survey and interviews reveal that the promises and
benefits of microservices have been more or less fulfilled in
these systems. On the other hand, the degree of fulfillment
varies greatly in different systems. Some basic benefits are
well fulfilled in most subject systems, while more advanced
benefits are only well fulfilled in a few systems. Based on
the fulfillment degree of the benefits defined in Table 1, we
group the subject systems into the following three levels.
Note that a company may have multiple microservice sys-
tems (e.g., S11-S13) of different levels.
Level 1 - Independent Development and Deploy-
ment. The development and deployment of different
services are isolated, and each service can be devel-
oped and deployed independently in an autonomy
way.
Level 2 - High Scalability and Availability. The
system can be flexibly scaled with system load and
other environmental changes, and at the same time
ensure high availability through fault tolerance and
fault isolation.
Level 3 - Service Ecosystem. The system has been
evolved into an ecosystem that can support not only
the extension of new requirements but also the ex-
pansion of business domains.
S1-S4 belongs to level 1. These systems usually have
dozens of services. They implement some basic charac-
teristics of microservices: each service is physically iso-
lated, running in its own process and communicating with
lightweight mechanisms [1], [3]. All the systems are de-
ployed on physical or virtual machines. S2 also uses contain-
ers for some services but without container orchestration.
To deploy or manage the systems, the operators usually
need to allocate resources manually (e.g., virtual machines),
configure runtime environments, and upgrade or down-
grade packages. These practices are called mutable infras-
tructure [44], which means that the infrastructure will be
continually updated, tweaked, and tuned to meet the ongo-
ing needs of the purpose it serves. Mutable infrastructure is
known to suffer from a number of problems. For example, it
is hard to scale, as each service instance’s creation involves a
lot of manual configurations [45]; it is hard to recover or roll-
back from failures, as the configuration of a service instance
is unknown after changes [45]. All these systems implement
continuous integration/deployment (CI/CD) pipelines ex-
cept S4.
To evolve a monolithic system to a level 1 microservice
system, the organization needs to refactor the system into a
set of services, form a cross-functional team for each service,
and establish CI/CD pipelines. It is usually challenging to
conduct this kind of large-scale architectural refactoring [46]
and at the same time ensure that the system behaviors are
not changed.
S5-S12 belong to level 2. These systems usually have
dozens to hundreds of services. All of these systems are
deployed with lightweight containers (e.g., Docker) and
some of them use a hybrid deployment of virtual machines
and containers. A common characteristic is that they have
established the practices of immutable infrastructure [44],
where service instances are replaced rather than changed.
The practices are based on the construction and deployment
of images (e.g., Docker images). Moreover, all these systems
implement DevOps practices except S5, which just imple-
ments CI/CD pipelines.
To evolve a level 1 system to a level 2 system, the
organization needs to establish a series of infrastructures,
e.g., container and image management and even container
orchestrators (e.g., Kubernetes), and runtime monitoring
systems.
S13 and S14 belong to level 3. These systems have been
evolved into service ecosystems consisting of hundreds to
thousands of services. In these systems, services from differ-
ent business domains are interconnected and supported by
a set of common infrastructures. The service infrastructures
offer not only technical supports (e.g., resource allocation
and scaling, database management, and message queue)
but also business supports such as the accesses of business
services (e.g., user authorization and authentication). These
systems adopt more advanced cloud computing technolo-
gies. For example, service mesh [18] provides the capa-
bility of traffic management and decouples the business
logic and infrastructure, allowing developers to focus on
their business logic without being distracted by business-
neutral issues, such as Service Discovery, Circuit Breaker,
and Rate Limiters. Serverless [17] includes BaaS (Backend
as a Service) [47] and FaaS (Function as a Service) [48]. BaaS
provides a set of common services including technical ser-
vices (such as user authentication and authorization service)
and business services (such as payment service), and we
can quickly build new services and even new applications
9
TABLE 12
Benefits Fulfilled by Different Maturity Levels
Benefit Level 1 Level 2 Level 3
Parallel Development
Extendibility and Expandability
Flexible and Automatic Scalability
Fault Tolerance and Fault Isolation
Reduced Communication Cost
Flexible Choice of Technology Stack - - -
based on BaaS. FaaS is about realizing customized business
logic by running backend code without managing your own
server systems or your own long-lived server applications,
speeding up the new-idea-to-initial-deployment story.
To evolve a level 2 system to a level 3 system, the organi-
zation needs to improve the technical infrastructures that are
expected to provide common business services, introduce
new technologies to the infrastructures (e.g., service mesh
and Serverless), and establish governance mechanisms for
the service ecosystem.
The three maturity levels fulfill different benefits of mi-
croservices as shown in Table 12. We score points at different
maturity levels for each benefit. It is a subjective assignment
based on the following criteria: a delta of a half star and
one star denote minor and fundamental improvement over
a lower level, respectively.
Parallel Development. The benefit is primarily related
to service granularity and largely fulfilled at level 1 with the
decomposition and isolation of services. It can be seen from
Table 2 that the practitioners are most (88.24%) satisfied with
Parallel Development. With the advances in infrastructures,
level 2 and level 3 systems can better support the devel-
opment and frequent updating of smaller services (e.g.,
functions in FaaS), and thus can better fulfill the benefit.
Extendibility and Expandability. At level 1 and level 2,
the benefit is fulfilled based on the proper decomposition
of services, easing the extension of new services or re-
quirements. For example, the application of DDD (Domain
Driven Design) makes services better aligned with domain
concepts and thus facilitate requirements extension [49].
Level 3 systems further facilitate the emergence of new ap-
plications based on the business supports of service infras-
tructures, thus support the expansion of business domains.
Flexible and Automatic Scalability. At level 1, the
operators manually scale the services based on their expe-
riences. Level 2 systems implement semi-automatic scaling
of services: monitoring mechanisms alarm the operators on
possible performance degradation and the operators make
service scaling decisions based on the alarms. The auto-
scaling mechanisms provided by the container orchestra-
tors or cloud platforms automatically execute the scaling
instructions. Level 3 systems fulfill better scalability based
on Serverless, which enables the deployment and delivery
of fine-grained service functionalities without creating and
managing the required infrastructure resources [3]. It is
worth noting that no systems in our study implement fully
automatic scaling of services due to the concern about the
controllability and risks [50].
Fault Tolerance and Fault Isolation. Level 1 systems
largely fulfill fault tolerance and fault isolation based on the
physical isolation of services (ensuring that the failures of
any service do not affect other services) and fault tolerance
mechanisms such as timeout, retry and circuit breaker [51].
Level 2 and level 3 systems better fulfill fault tolerance
and fault isolation based on failure recover and rollback
mechanisms supported by immutable infrastructures. In
addition, the systems can more easily realize blue green
deployment [52], canary release (including A/B testing) [53]
and rolling update [54], so that the systems can be migrated
to the new version fast and smoothly.
Reduced Communication Cost. The benefit is primarily
related to physical isolation of services, explicit definition
of service contracts, and cross-functional teams. It is largely
fulfilled at level 1 and has no significant changes at level 2
and 3.
Flexible Choice of Technology Stack. The aspect mainly
relies on the scale of the organization and is irrelevant to
the maturity levels. Hybrid technology stacks, e.g., using
multiple languages in service development, are expensive
to maintain for small organizations. These technology stacks
require the organization to have developers familiar with
different technology stacks and a common dependency be-
ing implemented in different languages. This requirement
explains why 13.72% participants in the survey do not care
about the benefit of Flexible Choice of Technology Stack (see
Table 2). Our interviews show that large organizations tend
to choose hybrid technology stacks.
The fulfillment of the benefits is driven by specific busi-
ness forces, and the driving force at a higher level includes
the driving forces of lower levels. The driving force of
level 1 is the fast response to market changes and business
innovations, necessitating independent development and
deployment of services. The driving force of level 2 is highly
available services for a large number of user accesses, neces-
sitating flexible and automatic scalability. The driving force
of level 3 is business expansion and merger, necessitating
service ecosystems and high expandability.
4.2 Issues, Practices, and Challenges (RQ2 & RQ3)
For each of the 11 capabilities, we investigated possible re-
strictive issues, practices that have been adopted to address
the issues, and the remaining challenges.
The investigation revealed 11 issues, 10 practices, and 10
challenges, which emerge from the survey and interviews
(Table 13). 5 issues (i.e., Issues 1, 3, 4, 5, 10) are mentioned
both in the survey and in the interviewed systems; the other
6 only showed up in the interviewed systems. The higher
the maturity level, the more issues a system has to face, and
the more practices need to be adopted.
There are 4 capabilities that have no issues: commu-
nication design, API gateway design, service registration
and discovery, and testing. This result indicates that the
organizations generally have no difficulties in these aspects
with the support of mature tools and infrastructures. For ex-
ample, service registration and discovery are well supported
by Eureka [55] and Zookeeper [56]. We next report the issues
identified for each capability together with the practices
addressing each issue and the remaining challenges.
10
TABLE 13
Sources of Issues, Practices, and Challenges (I, P, C mean the issue, practice, challenge, respectively)
System Issue 1 Issue 2 Issue 3 Issue 4 Issue 5 Issue 6 Issue 7 Issue 8 Issue 9 Issue 10 Issue 11
I P C I P C I P C I P C I P C I P C I P C I P C I P C I P C I P C
Survey
Level 1
S1 √√√ √√√√ √√√√√√√√√
S2 √√√√√√ √√
S3 √√√ √√√
S4
Level 2
S5 √√√ √√√√
S6 √√√√√√√√√ √√√ √√√√√√√√√
S7 √√√ √√√√ √√√√√√√√√√√√
S8 √√√ √√√ √√√√√√√√√√√√
S9 √√√ √√√ √√√√√√
S10 √√√√√√ √√√√√√√√√√√√
S11 √√√ √√√ √√√√√√√√√√√√
S12 √√√ √√√ √√√√√√
Level 3 S13 √√√√√√√√√√√√ √√√√√√√√√√√√√√√
S14 √√√ √√√ √√√√√√√√√√√√√√√
4.2.1 Service Decomposition
The capability of service decomposition is restricted by the
issues on decomposition decision, design evaluation, and
refactoring of legacy systems.
1) Issue 1: Decomposition Decision Influences a Lot.
Services in a microservice system are physically isolated
as basic units of development, deployment, and scaling.
Improper service decomposition may cause serious quality
problems (e.g., performance and scalability). On the other
hand, physical isolation makes it impossible to access the
internal logics of a service from outside. Therefore, improper
service decomposition may hinder the implementation of
new requirements. Some interviewees reported that they
had to merge multiple services together due to these prob-
lems. Moreover, they reported that the refactoring of mi-
croservice systems is much harder than that of monolithic
systems, as the refactoring crosses the boundaries of mul-
tiple services with independent development teams. This
issue influences all the three maturity levels in a similar way.
Practice: Domain Driven Design (DDD). DDD [57]
is a software development methodology that focuses on
mapping concepts in the problem domain into artifacts
in the solution domain. Although DDD is not specific for
microservices, it is widely used in microservices systems to
achieve more effective service decomposition. Actually all
the interviewees mentioned that they are using or want to
use DDD, but only a few of the systems successfully applied
DDD.
Challenge: Domain Model and Artifact Mapping. The
challenges with DDD lie in the derivation of the domain
model and its mapping with artifacts, including (1) how to
conduct domain analysis to derive a proper domain model;
(2) how to extract domain concepts from the artifacts (e.g.,
code, test cases) of legacy systems and link them to corre-
sponding new artifacts; (3) how to monitor and maintain the
consistency between domain model and artifacts.
2) Issue 2: Service Dependencies are Hard to Cap-
ture. Compared with the modularity of monolithic systems,
service decomposition quality is more difficult to evaluate.
The modularity evaluation of monolithic systems usually
relies on static dependencies and evolutionary coupling of
files [58], [59]. To use static analysis to capture service de-
pendencies for microservice systems, one needs to map the
service IP addresses or service names in the code to service
repositories. This mapping is usually complex and error-
prone. On the other hand, as services are independently
developed in separate repositories, it is infeasible to capture
evolutionary coupling by analyzing revision histories. The
missing of service dependencies makes it hard to measure
service coupling and further evaluate the quality of service
decomposition. This issue influences all the three maturity
levels similarly.
Practice: Capturing Service Dependency using Run-
time Tracing. Runtime tracing captures service invocations
and thus can be used to capture service dependencies by
analyzing runtime invocations.
Challenge: High Cost and Low Coverage of Runtime
Tracing. Runtime tracing heavily relies on monitoring in-
frastructure (see Section 4.2.5). If an organization has not
established required infrastructure, it is expensive to im-
plement runtime tracing in an ad hoc way (e.g., by instru-
menting monitoring code in an intrusive way). On the other
hand, service dependencies captured by runtime tracing
are usually incomplete due to low coverage of potential
dependencies.
3) Issue 3: Legacy Systems are Hard to Migrate. 5 out of
the 14 interviewed systems are evolved from legacy systems
(see Table 11). Due to the complex dependencies between
files, it is often hard to migrate a monolithic legacy system
to microservice architecture and at the same time ensure the
continuous provision of business services. This issue mainly
influences level 1, as incremental migration to microservice
systems usually occurs at this level.
Practice: Strangler Pattern and Anti-Corruption Layer
Pattern. These two patterns are often used together to
migrate a legacy system to microservice architecture in-
crementally. The Strangler pattern [60] suggests gradual
replacement of specific pieces of functionalities with newly
developed services. The migration process can take a long
time, during which both the legacy system and the new
services are used to support the business together. As all the
legacy system are replaced by new services, it is eventually
strangled and replaced by a microservice system and thus
can be decommissioned. The Anti-corruption Layer [61]
pattern suggests isolating the new services from the legacy
system by placing an anti-corruption layer between them.
This layer translates communications between the two parts,
allowing the legacy system to remain unchanged while the
new services can avoid compromising its design decisions.
Challenge: Legacy System Boundary and Halfway Mi-
gration. It is often hard to incrementally determine a proper
boundary between the legacy system and new services. An
11
improper boundary may make the new services hard to
develop (e.g., due to complex dependencies with the legacy
system) or the anti-corruption layer hard to implement.
Another challenge lies in halfway migration: the migration
process often ends with a big monolithic legacy subsystem
after all the easy parts have been replaced with services.
The remaining legacy subsystem is hard to migrate and at
the same time causes new problems of dead code, as it may
include a large amount of code that has been reimplemented
in new services.
4.2.2 Database Decomposition
The capability of database decomposition is restricted by the
issue of data coupling among services.
Issue 4: Data Coupling among Services. Multiple ser-
vices may have data coupling when the data elements (e.g.,
fields or tables) in their databases are involved in the same
data query or transaction. Due to the physical isolation,
cascading query and traditional transaction management
cannot be used for microservice systems. This limitation
explains why 37 out of the 51 systems in the survey share
databases among services, including 7 using centralized
databases and 30 sharing databases among some services.
Note that by sharing databases we mean sharing database
servers among different services. This issue influences all
the three maturity levels.
Practice: Service Invocation Composition and Dis-
tributed Transaction. A cascading query can be imple-
mented by the composition of multiple service invocations.
Distributed transactions and data consistency can be en-
sured by using distributed transaction frameworks (e.g.,
Seata [62]).
Challenge: Subsequent Refactoring and Network La-
tency. When a database is decomposed into multiple parts,
the source code that relies on multiple parts of the database
(e.g., due to cascading queries) needs to refactored accord-
ingly. On the other hand, the composition of multiple service
invocations may cause serious network latency.
4.2.3 Deployment
The capability of deployment is restricted by the issue of
complex service configurations
Issue 5: Complex Service Configurations. Microservice
systems usually involve complex service configurations. For
example, improper or inconsistent service configurations
(e.g., inconsistent memory limitations of JVM and Docker)
often cause runtime failures [8]. This issue mainly influences
level 2 and 3 systems. Their runtime environments are
highly dynamic and it is hard to form a common recommen-
dation for service configurations. Moreover, it is harder to
locate and fix the failures caused by configuration problems.
Practice and Challenge. There are no effective practices
being identified in this study. It remains a great challenge to
determine and maintain proper configurations for services.
4.2.4 Performance and Availability Assurance
The capability of performance and availability assurance is
restricted by the issues of stateful service and autoscaling
strategy.
1) Issue 6: Inconsistency across Stateful Services. The
scaling of a stateful service may cause its multiple instances
in inconsistent states, which may in turn cause failures.
Stateful services are not recommended, but still exist in
microservice systems due to the incomplete migration from
monolithic to microservice architecture or the choice of
inexperienced developers. This issue mainly influences level
2 and 3 systems.
Practice: Migrating States to External Storage. By mi-
grating service states to external storage such as in-memory
cache (e.g., Redis [63]), we can change a stateful service into
a stateless one.
Challenge: High Refactoring Cost, Network Latency,
and System Bottleneck. Eliminating service states requires
additional refactoring efforts. On the other hand, external
storage may lead to network latency and become system
bottleneck due to shared access.
2) Issue 7: Unpredictable and Uncontrollable Autoscal-
ing Strategy. Autoscaling strategies are hard to test and the
effects of the strategies are highly unpredictable. Moreover,
the effects may be uncontrollable. For example, when a mi-
croservice system encounters DoS attacks, the rapid growth
of service instances may occupy a lot of resources and thus
make services unavailable. This limitation explains why the
participants in the survey are the least satisfied (7.84% very
dissatisfied, 35.29% dissatisfied) with Flexible and Automatic
Scalability (see Table 2). This issue mainly influences level 2
and 3 systems.
Practice: Semi-automated Scaling. Semi-automated
scaling alarms the operators on possible problems, and
the operators make scaling decisions. The decisions are
then automatically executed, e.g., by creating more service
instances.
Challenge: Predictable and Reliable Autoscaling. Al-
though none of the interviewed systems implement fully
automatic scaling, it is still desired. Predictable and reliable
autoscaling requires intelligent scaling decision making and
sound quality assurance at runtime.
4.2.5 Logging and Monitoring
The capability of logging and monitoring is restricted by the
issues on distributed tracing and service anomaly detection.
1) Issue 8: Complex and Asynchronous Service Invo-
cation Chain. Microservice systems often involve complex
and asynchronous service invocation chains. Distributed
tracing is usually required to pinpoint where failures occur
and what causes poor performance [64]. Its impact increases
from level 1 to level 3. Level 2 systems usually have more
complex and dynamic service invocation chains. Level 3
systems crossing multiple business domains have more
complex service interactions.
Practice: Invasive and Non-invasive Tracing. There are
two types of approaches for distributed tracing: invasive
approaches implement tracing by instrumenting probes into
services; non-invasive approaches implement tracing by
using sidecar to proxy network requests in a service mesh.
Challenge: High Cost, Fragility, and Latency. The chal-
lenges with invasive approaches include (1) the cost of code
instrumentation is high, especially when there are a lot of
services written by hybrid languages; (2) the tracing chains
are fragile, as any problems (e.g., missing or wrongly passed
trace ID) with the probe of a service may cause the whole
tracing chain to be interrupted or corrupted. The challenges
12
with non-invasive approaches include (1) the cost for the
required infrastructures such as service mesh; (2) network
latency caused by sidecar proxies.
Issue 9: Service Incidents are Hard to Detect. Due to
the dynamic behaviors (e.g., scaling) and high granularity of
microservices, detecting service incidents is more challeng-
ing than before. This issue exists across all maturity levels.
Level 2 and level 3 systems are more dynamic and flexible,
so they get impacted more than level 1.
Practice: Dashboard and Threshold. Many systems con-
duct anomaly detection by manually analyzing the metrics
provided by the dashboard of monitoring systems (e.g.,
Grafana and Prometheus), or setting thresholds to trigger
predefined actions.
Challenge: Insufficient Automation. Operators cannot
be on the dashboard 24 hours a day; therefore, anomaly
detection tools with full or high automation are needed.
The dashboard cannot display all the metrics of interest, and
operators often need to cross multiple monitoring platforms
to obtain the required information. In addition, it is difficult
to ensure the accuracy and timeliness of the thresholds,
because threshold setting depends on experience and it
needs to be constantly adjusted for system changes.
4.2.6 Fault Localization
The capability of fault localization is restricted by the issue
of complex and dynamic service interactions.
Issue 10: Complex and Dynamic Service Interaction.
Faults and failures of a microservice system often involve
complex and dynamic distributed environments and ser-
vice invocation chains. For example, service instances are
dynamically created and destroyed, and the analysis of a
request execution process often requires logs distributed in
many service instances. This issue often makes it hard to
identify a failing circumstance and reproduce it in the test
environment. The issue exists across all maturity levels, but
influences more on level 2 and 3 systems. These systems are
more dynamic and involve more complex runtime environ-
ments and service invocation chains, so they get impacted
more than level 1.
Practice: Local Debugging, Mock, Remote Debugging,
Traffic Routing. There are different practices applicable to
different levels. Level 1 systems can use local debugging to
locate faults, as the invocation chains usually involve only
a small number of services. For level 2 and level 3 systems,
local debugging is usually infeasible, as the debugging may
involve many services and the local environment cannot
host all of them. These systems can use remote debugging
by connecting to a remote server and using online debug-
ging tools (e.g., Visual Studio Remote Debugger [65]). An al-
ternative practice is traffic routing [66], which routes service
traffic from a remote environment (e.g., a test environment)
to the local environment.
Challenge: Debugging Performance, Infrastructure Re-
quirements, and Lack of Intelligence. Remote debugging
may not be smooth due to the communications with a
remote server, which may become a bottleneck. Traffic rout-
ing requires the supporting network infrastructures (e.g.,
Sidecar [67]), which are not available for many systems.
Overall, the current microservice fault localization practices
lack intelligence and automation (e.g., machine-learning-
based fault localization approaches [43]) are highly desired.
4.2.7 Service Evolution
The capability of service evolution is restricted by the issue
on evolution compatibility.
Issue 11: Evolution Compatibility. A microservice sys-
tem may have a lot of services and the upgrade of a service
may cause compatibility problems with upstream services.
Its impact increases from level 1 to level 3. Level 2 systems
are more dynamic and more frequently upgraded based on
the support of immutable infrastructures. Service evolution
in level 3 systems has much broader ranges of change
impact in ecosystems.
Practice: Downward Compatibility and Upgrade Dead-
line. A common practice for this issue is to ensure down-
ward compatibility by providing multiple versions of a
service API. This multi-version API provision usually can
be done by adding a version prefix to the URI (Uniform
Resource Identifier). At the same time, the usages of dif-
ferent service API versions are continuously monitored and
those that are no longer used can be decommissioned. The
organizations may also set an upgrade deadline; after the
deadline passes, service API versions will no longer be
available.
Challenge: High Maintenance Cost. The impact of a
service upgrade is often hard to predict, so the developers
are not sure whether other services will be broken by the
upgrade. Therefore, the developers may choose to keep
many different versions of the same service API in the
production environment, thus causing high maintenance
costs.
5 DISCUSSION
The complexity and dynamism of microservice systems
pose unique challenges to a variety of software engineering
tasks [8]. Learning the roadmap of industrial microservice
systems and grounding the practices and challenges on
the maturity levels can help researchers better understand
potential research opportunities in the context of different
maturity levels.
The focus of level-1 microservice systems is independent
development and deployment, and the aim of this stage
is to establish a design structure that conforms to the mi-
croservice architecture. For newly developed systems, the
challenges are mainly related to service decomposition and
service coupling.
Software design methodologies such as domain-driven
design (DDD) and distributed transaction are widely prac-
ticed. However, improper service decomposition and data
coupling still often prevent the system from fulfilling the
desired independence, extensibility, and performance. DDD
is a successful methodology, but lacks effective techniques
and tools. Knowledge-based techniques and tools are re-
quired to map between the domain model and artifacts,
and maintain their consistency, e.g., by extracting domain
concepts from the artifacts [68]. On the other hand, mi-
croservice architecture analysis techniques need to consider
the independence and extensibility, e.g., by assessing the
alignment of the domain model and architecture model, and
13
also the performance, e.g., by estimating the frequency and
latency of distributed service invocations.
For migrated microservice systems, the challenges are
mainly related to the incremental migration process. Al-
though the strangler pattern and anti-corruption layer pat-
tern are used to support incremental migration, it is often
the case that a big monolithic legacy subsystem remains in
the system and interacts with migrated services. The reason
why the developers choose to keep the monolithic subsys-
tem is usually that it is not cost effective to refactor the
subsystem. Microservice refactoring techniques are required
to not only provide migration suggestions but also safely
implement the refactoring.
Moreover, this hybrid architecture brings additional dif-
ficulties to runtime monitoring and tracing. The monolithic
subsystem influences the observability of the whole system,
as the subsystem encapsulates a large unobservable part of
the system and its old technology stack makes it hard to
apply the latest tracing frameworks.
The focus of level-2 microservice systems is high scala-
bility and availability, and the aim of this stage is to establish
the operation infrastructure and practices required by high-
scalability and high-availability microservice systems.
The challenges are mainly related to the detection and
localization of faults. Some organizations have advocated
AIOps (artificial intelligence for IT operations), and intelli-
gent fault detection and localization are important parts of
its practices. Microservice systems at this level widely adopt
distributed tracing, and thus fault detection and localization
can utilize not only logs and metrics but also traces. The
analysis involves a huge amount of data. For example, a
large Internet system may produce billions of traces per day.
Highly efficient data analysis techniques are thus required
to detect and locate potential faults among a large number
of services, using techniques such as data mining, machine
learning, and interactive visualization. Moreover, analyzing
the combination of logs, metrics, and traces is challenging,
as they are produced at different levels: logs reflect the local
behaviors of individual service instances; metrics measure
the availability of infrastructure resources and quality of
services; traces record the invocation chains of services
for requests. An integrated representation of the data is
required to support effective and efficient fault detection
and localization.
The focus of level-3 microservice systems is service
ecosystem, and the aim of this stage is to establish the
technical infrastructures that support the continuous ex-
pansion of business domains. The challenges are mainly
related to the required domain abstraction and service gov-
ernance mechanisms. Like software product lines, service
ecosystems are constructed based on a set of core assets
that embody the commonality of the domains. Different
from traditional software product lines, service ecosystems
rely on microservice infrastructure and common services to
implement the core assets. A prominent challenge is how to
elicit and construct a stable abstraction for an open and un-
certain future. The challenge with service governance orig-
inates from the large number of services and the complex
service interactions across multiple relevant domains. The
applications in a service ecosystem may emerge continu-
ously based on existing infrastructure and common services.
To ensure the sustainable evolution of the ecosystem, it
is thus crucial to establish a comprehensive set of mecha-
nisms to support service governance requirements such as
configuration management, authorization, versioning, and
evolution management.
6 TH RE ATS TO VALIDITY
In our study, there are three main threats to internal validity.
The first one is the reasonability of the question settings:
some questions are optional or designed to solicit in-depth
answers, so our key insights are discovered and validated
during the interviews. The second one lies in the qualifi-
cation of the interviewees: we list preferred qualifications
in the survey and select interviews based on their answers.
One interview may invite more than one interviewee, and
follow-up conversations are conducted to ensure the quality
of interviews. The third one lies in the validity of the
answers from interviewees: for the reputation of their com-
pany, some interviewees may not fully faithfully provide
answers from their real experience.
There are two main threats to the external validity of
our study. The first one is the quantitative limitation of the
surveys and interviews. Our study targets at the experts
who must have sufficient knowledge and experience in
microservice practices. Microservice experts with full-stack
knowledge and experiences are rare. 51 participants in the
survey is not an ideal response rate, but still forms a mean-
ingful sample space. In addition, we carefully selected the
interviewee to avoid noise data and maintain a diversified
group for the generalizability. The second one lies in the
limited variety of interviewees’ role. We target at architects
and experienced developers mainly because they have the
most comprehensive background. There is a lack of data
from operation and management personnel.
7 RE LATED WORK
Recently, there have been various investigations on the
practice of microservice architecture. Hassan et al. [69]
formulated the problem of addressing the microservice
design trade-offs and introduced their solution proposal.
Phipathananunth et al. [70] described Pink, a framework
for synthetic runtime monitoring of microservices software
systems. Pina et al. [71] proposed a much simpler and non-
invasive monitoring approach that includes topology and
performance metrics. Zhou et al. [72] presented a debug-
ging approach for microservice systems based on the delta
debugging algorithm, which is to minimize failure inducing
deltas of circumstances for effective debugging. Du et al. [73]
designed an anomaly detection system (ADS) to detect
and diagnose anomalies in microservices by monitoring
and analyzing real-time performance data. Zhou et al. [43]
proposed MEPFL, an approach of latent error prediction
and fault localization for microservice systems by learning
from system trace logs. Each of the preceding efforts mainly
focused on a specific proposed practice of microservices.
In this work, we focus on learning industrial challenges,
practices, and capabilities from real-world examples of mi-
croservice systems.
14
Dragoni et al. [74] reviewed the development history
from objects, services, to microservices, presented the cur-
rent state-of-the-art and raised some open problems and
future challenges. Francesco et al. [20] performed a system-
atic mapping study to identify, classify, and evaluate the
current state-of-the-art on architecting microservices from
the following three perspectives: publication trends, focus
of research, and potential for industrial adoption. Pahl et
al. [21] conducted a systematic mapping study on the mo-
tivation, architecture, methods, techniques, and challenges
of microservices. Alshuqayran et al. [19] conducted a sys-
tematic study on identifying architectural challenges, the
architectural diagrams/views, and quality attributes related
to microsevice systems. Jamshidi et al. [3] reported the
current situation, benefits, evolution, and future challenges
of microservices. Carlos et al. [75] presented an initial set
of requirements for a candidate microservice benchmark
system to be used in research on software architecture.
These previous research efforts conducted valuable sys-
tematic studies from the literature, but do not represent
the practices in industry. Our work conducts an extensive
questionnaire survey along with interviews with industrial
microservice experts, and can help better understand the
state-of-the-practice and the real challenges remaining to be
addressed for future research.
Taibi et al. [23] conducted a questionnaire survey filled
by 21 practitioners and focused on processes, motivation,
and issues for migration toward a microservice architecture.
Francesco et al. [22] performed an empirical study on migra-
tion practices toward the adoption of microservices in in-
dustry. They collected information utilizing interviews and
questionnaires on the activities and the challenges during
the migration. Stefan et al. [76] investigated the importance
of different areas of microservice design. Ten microservice
experts were interviewed to understand the importance
and relevance of the microservices design areas. Justus et
al. [77] contributed a qualitative study with insights into
industry adoption and implementation of microservices by
analyzing 14 service-based systems during 17 interviews.
They focused on applied technologies, microservices char-
acteristics, and the perceived influence on software quality.
Zhang et al. [24] carried out a series of industrial interviews
with 13 different types of companies to investigate the gap
between the ideal visions and real industrial practices and
the benefits of microservices from the industrial experiences.
Ford [78] conducted a survey on the state of microservices
practices in industry, including continuous deployment and
automated testing, containers and kubernetes, integration
with legacy applications. Taibi et al. [25] identified a taxon-
omy of 20 anti-patterns, including both organizational ones
and technical ones. Rechards et al. [27] reported 10 common
microservice anti-patterns and pitfalls. Taibe et al. [26] col-
lected evidence of microservice-specific bad practices and
classified them into a catalog of 11 microservice-specific bad
smells.
These preceding research efforts reported insightful find-
ings on microservice systems in practice but do not clas-
sify or investigate the immanent patterns through multi-
dimensional diversities. Our work not only derives the ma-
turity levels of microservice systems but also draws a com-
prehensive roadmap associated with inherent challenges.
Our work also explores potential research opportunities by
observing industry experiences and obtains a lot of in-depth
findings on practice patterns and ecosystem characteristics.
8 CONCLUSION
In this article, we have reported an empirical study, in-
cluding an online survey and a series of interviews, that
is designed to advance our understanding of microservice
practices in industry and remaining challenges that may
lead to valuable research. As a part of our findings, we have
identified three maturity levels of microservice systems:
independent development and deployment, high scalabil-
ity and availability, and service ecosystem, based on the
fulfilled benefits of microservices. We have also identified
11 practical issues that restrict the microservice capabilities
of organizations and the corresponding practices and chal-
lenges.
For practitioners, our findings can help them position
their microservice systems at a proper level according to
the business needs and determine what infrastructures and
capabilities are worth investing. More concretely, they can
learn the issues that restrict their capabilities and the prac-
tices that they can follow to address the issues. In this way,
they can set a roadmap of continuous improvement. For
researchers, our findings can help them understand the sit-
uation of industrial microservice practices and the needs of
microservice systems of different levels. They may identify
potential research opportunities to address the challenges in
the current microservice practices.
In future work, we plan to investigate the research
problems identified in this work and at the same time
work together with our industrial collaborators to establish
a shared benchmark for microservice research. In addition,
we plan to further explore the emerging technical problems
in service ecosystems.
REFERENCES
[1] J. Lewis and M. Fowler, “Microservices,” 2014. [Online]. Available:
https://martinfowler.com/articles/microservices.html
[2] J. Th¨
ones, “Microservices,” IEEE Software, vol. 32, no. 1, pp. 116–
116, 2015.
[3] P. Jamshidi, C. Pahl, N. C. Mendonc¸a, J. Lewis, and S. Tilkov,
“Microservices: The journey so far and challenges ahead,” IEEE
Software, vol. 35, no. 3, pp. 24–35, 2018.
[4] Amazon.Com, “Amazon,” 2017. [Online]. Available: https:
//d0.awsstatic.com/whitepapers/microservices-onaws.pdf
[5] Netflix.Com, “Netflix,” 2019. [Online]. Available: https://www.
netflix.com/
[6] Wechat.Com, “Wechat,” 2019. [Online]. Available: https://www.
wechat.com/
[7] Ebay.Com, “Ebay developers program,” 2019. [Online]. Available:
https://developer.ebay.com/docs
[8] X. Zhou, X. Peng, T. Xie, J. Sun, C. Ji, W. Li, and D. Ding, “Fault
analysis and debugging of microservice systems: Industrial sur-
vey, benchmark system, and empirical study,” IEEE Transactions
on Software Engineering, pp. 1–1, 2018.
[9] Docker.Com, “Docker,” 2019. [Online]. Available: https://docker.
com/
[10] Kubernetes.Io, “Kubernetes,” 2019. [Online]. Available: https:
//kubernetes.io/
[11] Docker.Com, “Docker swarm,” 2019. [Online]. Available: https:
//docs.docker.com/swarm/
[12] Mesos.Com, “Mesos,” 2019. [Online]. Available: http://mesos.
apache.org/
[13] C. Pahl, “Containerization and the paas cloud,” IEEE Cloud Com-
puting, vol. 2, no. 3, pp. 24–31, 2015.
15
[14] L. Bass, I. Weber, and L. Zhu, DevOps: A Software Architect’s
Perspective. Addison-Wesley Professional, 2015.
[15] J. Humble and D. Farley, Continuous Delivery: Reliable Software
Releases through Build, Test, and Deployment Automation. Pearson
Education, 2010.
[16] Wikipedia.Org, “Chaos engineering,” 2019. [Online]. Available:
https://en.wikipedia.org/wiki/Chaos engineering
[17] M. Roberts, “Serverless architectures,” 2018. [Online]. Available:
https://martinfowler.com/articles/serverless.html
[18] W. Morgan, “What’s a service mesh? and why do i need
one?” 2017. [Online]. Available: https://buoyant.io/2017/04/25/
whats-a-service-mesh-and-why-do-i-need- one/
[19] N. Alshuqayran, N. Ali, and R. Evans, “A systematic mapping
study in microservice architecture,” in Proceedings of the 9th Inter-
national Conference on Service-Oriented Computing and Applications,
2016, pp. 44–51.
[20] P. D. Francesco, I. Malavolta, and P. Lago, “Research on archi-
tecting microservices: Trends, focus, and potential for industrial
adoption,” in Proceedings of the 2017 IEEE International Conference
on Software Architecture, 2017, pp. 21–30.
[21] C. Pahl and P. Jamshidi, “Microservices: A systematic mapping
study,” in Proceedings of the 6th International Conference on Cloud
Computing and Services Science, 2016, pp. 137–146.
[22] P. D. Francesco, P. Lago, and I. Malavolta, “Migrating towards
microservice architectures: An industrial survey,” in Proceedings of
the 2018 IEEE International Conference on Software Architecture, 2018,
pp. 29–39.
[23] D. Taibi, V. Lenarduzzi, and C. Pahl, “Processes, motivations, and
issues for migrating to microservices architectures: An empirical
investigation,” IEEE Cloud Computing, vol. 4, no. 5, pp. 22–32, 2017.
[24] H. Zhang, S. Li, Z. Jia, C. Zhong, and C. Zhang, “Microservice
architecture in reality: An industrial inquiry,” in Proceedings of the
2019 IEEE International Conference on Software Architecture, 2019, pp.
51–60.
[25] D. Taibi, V. Lenarduzzi, and C. Pahl, Microservices Anti-patterns:
A Taxonomy. Cham: Springer International Publishing, 2020, pp.
111–128.
[26] D. Taibi and V. Lenarduzzi, “On the definition of microservice bad
smells,” IEEE Software, 2018.
[27] M. Richards, Microservices AntiPatterns and Pitfalls. O’Reilly
Media, Inc., 2016.
[28] J. Soldani, D. A. Tamburri, and W.-J. V. D. Heuvel, “The pains
and gains of microservices: A systematic grey literature review,”
Journal of Systems and Software, vol. 146, pp. 215 232, 2018.
[29] C. Richardson, “A pattern language for microservices,” 2014. [On-
line]. Available: https://microservices.io/patterns/index.html
[30] T. Waring and D. Wainwright, “Issues and challenges in the use of
template analysis: Two comparative case studies from the field,”
Electronic Journal of Business Research Methods, vol. 6, no. 1, pp. 85–
94, 2008.
[31] H. Kallio, A.-M. Pietil¨
a, M. Johnson, and M. Kangasniemi, “Sys-
tematic methodological review: developing a framework for a
qualitative semi-structured interview guide,” Journal of Advanced
Nursing, vol. 72, no. 12, pp. 2954–2965, 2016.
[32] C. Richardson, “Choosing a microservices deployment
strategy,” 2015. [Online]. Available: https://www.nginx.com/
blog/deploying-microservices/
[33] Microsoft.Com, “Communication in a microser-
vice architecture,” 2018. [Online]. Available:
https://docs.microsoft.com/en-us/dotnet/architecture/
microservices/architect-microservice-container-applications/
communication-in-microservice-architecture
[34] C. Richardson, “Building microservices: Inter-process
communication in a microservices architecture,”
2015. [Online]. Available: https://www.nginx.com/blog/
building-microservices-inter-process-communication/
[35] S. Newman, “Pattern: Backends for frontends,” 2015. [Online].
Available: https://samnewman.io/patterns/architectural/bff/
[36] C. Richardson, “Building microservices: Using an api
gateway,” 2015. [Online]. Available: https://www.nginx.com/
blog/building-microservices-using-an-api-gateway/
[37] ——, “Service discovery in a microservices architecture,”
2015. [Online]. Available: https://www.nginx.com/blog/
service-discovery-in-a-microservices-architecture/
[38] Logstash.Com, “Logstash,” 2018. [Online]. Available: https:
//www.elastic.co/products/logstash
[39] Elasticsearch.Com, “Elasticsearch,” 2018. [Online]. Available:
https://www.elastic.co/products/elasticsearch
[40] Kibana.Com, “Kibana,” 2018. [Online]. Available: https://www.
elastic.co/products/kibana
[41] Zipkin.Com, “Zipkin,” 2016. [Online]. Available: https://zipkin.
io/
[42] T. Clemson, “Testing strategies in a microservice architecture,”
2014. [Online]. Available: https://martinfowler.com/articles/
microservice-testing/
[43] X. Zhou, X. Peng, T. Xie, J. Sun, C. Ji, D. Liu, Q. Xiang, and C. He,
“Latent error prediction and fault localization for microservice
applications by learning from system trace logs,” in Proceedings
of the 27th ACM Joint Meeting on European Software Engineering
Conference and Symposium on the Foundations of Software Engineering,
2019, pp. 683–694.
[44] S. Logic, “Mutable and immutable infrastructure,”
2019. [Online]. Available: https://www.sumologic.com/insight/
mutable-immutable-infrastructure/
[45] H. Vird´
o, “What is immutable infrastructure?” 2019. [Online].
Available: https://www.digitalocean.com/community/tutorials/
what-is-immutable-infrastructure
[46] Y. Lin, X. Peng, Y. Cai, D. Dig, D. Zheng, and W. Zhao, “Interactive
and guided architectural refactoring with search-based recom-
mendation,” in Proceedings of the 24th ACM SIGSOFT International
Symposium on Foundations of Software Engineering, 2016, pp. 535–
546.
[47] Cloudflare.Com, “What is baas? backend-as-a-service vs.
serverless,” 2019. [Online]. Available: https://www.cloudflare.
com/learning/serverless/glossary/backend-as-a-service-baas/
[48] Wikipedia.Org, “Function as a service,” 2019. [Online]. Available:
https://en.wikipedia.org/wiki/Function as a service
[49] F. Rademacher, J. Sorgalla, and S. Sachweh, “Challenges of
domain-driven microservice design: A model-driven perspective,”
IEEE Software, vol. 35, no. 3, pp. 36–43, 2018.
[50] A. Bremler-Barr, E. Brosh, and M. Sides, “Ddos attack on cloud
auto-scaling mechanisms,” in Proceedings of the IEEE INFOCOM
2017 - IEEE Conference on Computer Communications, 2017, pp. 1–9.
[51] Microsoft.Com, “Resiliency patterns,” 2019. [On-
line]. Available: https://docs.microsoft.com/en-us/azure/
architecture/patterns/category/resiliency
[52] M. Fowler, “Bluegreendeployment,” 2010. [Online]. Available:
https://martinfowler.com/bliki/BlueGreenDeployment.html
[53] D. Sato, “Canaryrelease,” 2014. [Online]. Available: https:
//martinfowler.com/bliki/CanaryRelease.html
[54] Kubernetes.Io, “Performing a rolling update,” 2019.
[Online]. Available: https://kubernetes.io/docs/tutorials/
kubernetes-basics/update/update-intro/
[55] Netflix, “Netflix eureka,” 2019. [Online]. Available: https:
//github.com/Netflix/eureka
[56] Apache.Org, “Apache zookeeper,” 2019. [Online]. Available:
http://zookeeper.apache.org/
[57] Wikipedia.Org, “Domain-driven design,” 2019. [Online]. Avail-
able: https://en.wikipedia.org/wiki/Domain-driven design
[58] L. Xiao, Y. Cai, R. Kazman, R. Mo, and Q. Feng, “Identifying
and quantifying architectural debt,” in Proceedings of the 38th
International Conference on Software Engineering, 2016, pp. 488–498.
[59] R. Mo, Y. Cai, R. Kazman, L. Xiao, and Q. Feng, “Decoupling level:
A new metric for architectural maintenance complexity,” in Pro-
ceedings of the 38th International Conference on Software Engineering,
2016, pp. 499–510.
[60] Microsoft.Com, “Strangler pattern,” 2017. [Online]. Avail-
able: https://docs.microsoft.com/en-us/azure/architecture/
patterns/strangler
[61] ——, “Anti-corruption layer pattern,” 2017. [On-
line]. Available: https://docs.microsoft.com/en-us/azure/
architecture/patterns/anti-corruption-layer
[62] Seata.Io, “Seata,” 2019. [Online]. Available: https://seata.io/
en-us/
[63] Redis.Io, “Redis,” 2016. [Online]. Available: https://redis.io/
[64] Opentracing.Io, “What is distributed tracing?” 2019.
[Online]. Available: https://opentracing.io/docs/overview/
what-is-tracing/
[65] Microsoft.Com, “Visual studio remote debugger,” 2019. [On-
line]. Available: https://docs.microsoft.com/en-us/visualstudio/
debugger/remote-debugging?view=vs-2019
[66] Istio.Io, “Traffic management,” 2019. [Online]. Available: https:
//istio.io/docs/concepts/traffic-management/
16
[67] ——, “Sidecar,” 2019. [Online]. Available: https://istio.io/docs/
reference/config/networking/v1alpha3/sidecar/
[68] C. Wang, X. Peng, M. Liu, Z. Xing, X. Bai, B. Xie, and T. Wang,
“A learning-based approach for automatic construction of domain
glossary from source code and documentation,” in Proceedings
of the ACM Joint Meeting on European Software Engineering Con-
ference and Symposium on the Foundations of Software Engineering,
ESEC/SIGSOFT FSE 2019, Tallinn, Estonia, August 26-30, 2019,
M. Dumas, D. Pfahl, S. Apel, and A. Russo, Eds. ACM, 2019,
pp. 97–108.
[69] S. Hassan and R. Bahsoon, “Microservices and their design trade-
offs: A self-adaptive roadmap,” in Proceedings of the 2016 IEEE
International Conference on Services Computing, 2016, pp. 813–818.
[70] C. Phipathananunth and P. Bunyakiati, “Synthetic runtime moni-
toring of microservices software architecture,” in Proceedings of the
42nd IEEE Annual Computer Software and Applications Conference,
2018, pp. 448–453.
[71] F. Pina, J. Correia, R. Filipe, F. Ara´
ujo, and J. Cardroom, “Nonin-
trusive monitoring of microservice-based systems,” in Proceedings
of the 17th IEEE International Symposium on Network Computing and
Applications, 2018, pp. 1–8.
[72] X. Zhou, X. Peng, T. Xie, J. Sun, W. Li, C. Ji, and D. Ding,
“Delta debugging microservice systems,” in Proceedings of the 33rd
ACM/IEEE International Conference on Automated Software Engineer-
ing, 2018, pp. 802–807.
[73] Q. Du, T. Xie, and Y. He, “Anomaly detection and diagnosis for
container-based microservices with performance monitoring,” in
Proceedings of the 18th International Conference on Algorithms and
Architectures for Parallel Processing, 2018, pp. 560–572.
[74] N. Dragoni, S. Giallorenzo, A. Lluch-Lafuente, M. Mazzara,
F. Montesi, R. Mustafin, and L. Safina, “Microservices: Yesterday,
today, and tomorrow,” CoRR, vol. abs/1606.04036, 2016.
[75] C. M. Aderaldo, N. C. Mendonc¸ a, C. Pahl, and P. Jamshidi, “Bench-
mark requirements for microservices architecture research,” in
Proceedings of the 1st International Workshop on Establishing the
Community-Wide Infrastructure for Architecture-Based Software Engi-
neering, 2017, pp. 8–13.
[76] S. Haselbock, R. Weinreich, and G. Buchgeher, “An expert in-
terview study on areas of microservice design,” in Proceedings of
the 11th Conference on Service-Oriented Computing and Applications,
2018, pp. 137–144.
[77] J. Bogner, J. Fritzsch, S. Wagner, and A. Zimmermann, “Microser-
vices in industry: Insights into technologies, characteristics, and
software quality,” in Proceedings of the 2019 IEEE International
Conference on Software Architecture Companion, 2019, pp. 187–195.
[78] N. Ford, in The State of Microservices Maturity. O’Reilly Media,
Inc., 2018.
17
Conference Paper
Full-text available
A domain glossary that organizes domain-specific concepts and their aliases and relations is essential for knowledge acquisition and software development. Existing approaches use linguistic heuristics or term-frequency-based statistics to identify domain specific terms from software documentation, and thus the accuracy is often low. In this paper, we propose a learning-based approach for automatic construction of domain glossary from source code and software documentation. The approach uses a set of high-quality seed terms identified from code identifiers and natural language concept definitions to train a domain-specific prediction model to recognize glossary terms based on the lexical and semantic context of the sentences mentioning domain-specific concepts. It then merges the aliases of the same concepts to their canonical names, selects a set of explanation sentences for each concept, and identifies "is a", "has a", and "related to" relations between the concepts. We apply our approach to deep learning domain and Hadoop domain and harvest 5,382 and 2,069 concepts together with 16,962 and 6,815 relations respectively. Our evaluation validates the accuracy of the extracted domain glossary and its usefulness for the fusion and acquisition of knowledge from different documents of different projects.
Conference Paper
Full-text available
Microservices are a topic driven mainly by practitioners and academia is only starting to investigate them. Hence, there is no clear picture of the usage of Microservices in practice. In this paper, we contribute a qualitative study with insights into industry adoption and implementation of Microservices. Contrary to existing quantitative studies, we conducted interviews to gain a more in-depth understanding of the current state of practice. During 17 interviews with software professionals from 10 companies, we analyzed 14 service-based systems. The interviews focused on applied technologies, Microservices characteristics, and the perceived influence on software quality. We found that companies generally rely on well-established technologies for service implementation, communication, and deployment. Most systems, however, did not exhibit a high degree of technological diversity as commonly expected with Microservices. Decentralization and product character were different for systems built for external customers. Applied DevOps practices and automation were still on a mediocre level and only very few companies strictly followed the "you build it, you run it" principle. The impact of Microservices on software quality was mainly rated as positive. While maintainability received the most positive mentions, some major issues were associated with security. We present a description of each case and summarize the most important findings of companies across different domains and sizes. Researchers may build upon our findings and take them into account when designing industry-focused methods.
Conference Paper
In the production environment, a large part of microservice failures are related to the complex and dynamic interactions and runtime environments, such as those related to multiple instances, environmental configurations, and asynchronous interactions of microservices. Due to the complexity and dynamism of these failures, it is often hard to reproduce and diagnose them in testing environments. It is desirable yet still challenging that these failures can be detected and the faults can be located at runtime of the production environment to allow developers to resolve them efficiently. To address this challenge, in this paper, we propose MEPFL, an approach of latent error prediction and fault localization for microservice applications by learning from system trace logs. Based on a set of features defined on the system trace logs, MEPFL trains prediction models at both the trace level and the microservice level using the system trace logs collected from automatic executions of the target application and its faulty versions produced by fault injection. The prediction models thus can be used in the production environment to predict latent errors, faulty microservices, and fault types for trace instances captured at runtime. We implement MEPFL based on the infrastructure systems of container orchestrator and service mesh, and conduct a series of experimental studies with two opensource microservice applications (one of them being the largest open-source microservice application to our best knowledge). The results indicate that MEPFL can achieve high accuracy in intraapplication prediction of latent errors, faulty microservices, and fault types, and outperforms a state-of-the-art approach of failure diagnosis for distributed systems. The results also show that MEPFL can effectively predict latent errors caused by real-world fault cases.
Article
The complexity and dynamism of microservice systems pose unique challenges to a variety of software engineering tasks such as fault analysis and debugging. In spite of the prevalence and importance of microservices in industry, there is limited research on the fault analysis and debugging of microservice systems. To fill this gap, we conduct an industrial survey to learn typical faults of microservice systems, current practice of debugging, and the challenges faced by developers in practice. We then develop a medium-size benchmark microservice system (being the largest and most complex open source microservice system within our knowledge) and replicate 22 industrial fault cases on it. Based on the benchmark system and the replicated fault cases, we conduct an empirical study to investigate the effectiveness of existing industrial debugging practices and whether they can be further improved by introducing the state-of-the-art tracing and visualization techniques for distributed systems. The results show that the current industrial practices of microservice debugging can be improved by employing proper tracing and visualization techniques and strategies. Our findings also suggest that there is a strong need for more intelligent trace analysis and visualization, e.g., by combining trace visualization and improved fault localization, and employing data-driven and learning-based recommendation for guided visual exploration and comparison of traces.
Conference Paper
Debugging microservice systems involves the deployment and manipulation of microservice systems on a containerized environment and faces unique challenges due to the high complexity and dynamism of microservices. To address these challenges, in this paper, we propose a debugging approach for microservice systems based on the delta debugging algorithm, which is to minimize failureinducing deltas of circumstances (e.g., deployment, environmental configurations) for effective debugging. Our approach includes novel techniques for defining, deploying/manipulating, and executing deltas following the idea of delta debugging. In particular, to construct a (failing) circumstance space for delta debugging to minimize, our approach defines a set of dimensions that can affect the execution of microservice systems. Our experimental study on a medium-size microservice benchmark system shows that our approach can effectively identify failure-inducing deltas that help diagnose the root causes.