
Johan Tordsson- PhD
- Professor (Associate) at Umeå University
Johan Tordsson
- PhD
- Professor (Associate) at Umeå University
About
99
Publications
24,358
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
3,842
Citations
Introduction
Autonomic systems + Cloud native computing + Microservices
Current institution
Additional affiliations
January 2005 - present
Publications
Publications (99)
Disaggregated systems have a novel architecture motivated by the requirements of resource intensive applications such as social networking, search, and in-memory databases. The total amount of resources such as memory and CPU cores is very large in such systems. However, the distributed topology of disaggregated server systems result in non-uniform...
Although repeatability and reproducibility are essential in science, failed attempts to replicate results across diverse fields made some scientists argue for a reproducibility crisis. In response, several high-profile venues within computing established artifact evaluation tracks, a systematic procedure for evaluating and badging research artifact...
Microservice-based architectures consist of numerous, loosely coupled services with multiple instances. Service meshes aim to simplify traffic management and prevent microservice overload through circuit breaking and request retry mechanisms. Previous studies have demonstrated that the static configuration of these mechanisms is unfit for the dynam...
Microservice-based architectures have become ubiquitous in large-scale software systems. Experimental distributed systems researchers constantly propose enhanced resource management mechanisms. These mechanisms need to be evaluated using both realistic and flexible microservice benchmarks that enable studying how diverse application characteristics...
A microservice architecture features hundreds or even thousands of small loosely coupled services with multiple instances. Because microservice performance depends on many factors including the workload, inter-service traffic management is complex in such dynamic environments. Service meshes aim to handle this complexity and to facilitate managemen...
Although High Performance Computing (HPC) users understand basic resource requirements such as the number of CPUs and memory limits, internal infrastructural utilization data is exclusively leveraged by cluster operators, who use it to configure batch schedulers. This task is challenging and increasingly complex due to ever larger cluster scales an...
Causal inference (CI) is one of the popular performance diagnosis methods, which infers the anomaly propagation from the observed data for locating the root causes. Although some specific CI methods have been employed in the literature, the overall performance of this class of methods on microservice performance diagnosis is not well understood. To...
Following the adoption of cloud computing, the proliferation of cloud data centers in multiple regions, and the emergence of computing paradigms such as fog computing, there is a need for integrated and efficient management of geodistributed clusters. Geo-distributed deployments suffer from resource fragmentation, as the resources in certain locati...
Microservice architectures are increasingly adopted to design large-scale applications. However, the highly distributed nature and complex dependencies of microservices complicate automatic performance diagnosis and make it challenging to guarantee service level agreements (SLAs). In particular, identifying the culprits of a microservice performanc...
Site Reliability Engineers are at the center of two tensions: On one hand, they need to respond to alerts within a short time, to restore a non-functional system. On the other hand, short response times is disruptive to everyday life and lead to alert fatigue. To alleviate this tension, many resource management mechanisms are proposed handle overlo...
Data stream processing is an attractive paradigm for analyzing IoT data at the edge of the Internet before transmitting processed results to a cloud. However, the relative scarcity of fog computing resources combined with the workloads' nonstationary properties make it impossible to allocate a static set of resources for each application. We propos...
Microservices represent a popular paradigm to construct large-scale applications in many domains thanks to benefits such as scalability, flexibility, and agility. However, it is difficult to manage and operate a microservice system due to its high dynamics and complexity. In particular, the frequent updates of microservices lead to the absence of h...
Despite the abundant research in cloud autoscaling, autoscaling in Kubernetes, arguably the most popular cloud platform today, is largely unexplored. Kubernetes' Cluster Autoscaler can be configured to select nodes either from a single node pool (CA) or from multiple node pools (CA-NAP). We evaluate and compare these configurations using two repres...
Microservice architectures are increasingly adopted to design large-scale applications. However, the highly distributed nature and complex dependencies of microservices complicate automatic performance diagnosis and make it challenging to guarantee service level agreements (SLAs). In particular, identifying the culprits of a microservice performanc...
The extreme adoption rate of container technologies along with raised security concerns have resulted in the development of multiple alternative container runtimes targeting security through additional layers of indirection. In an apples-to-apples comparison, we deploy three runtimes in the same Kubernetes cluster, the security focused Kata and gVi...
As resources in geo-distributed environments are typically located in remote sites characterized by high latency and intermittent network connectivity, delays and transient network failures are common between the management layer and the remote resources. In this paper, we show that delays and transient network failures coupled with static configur...
Software architecture is undergoing a transition from monolithic architectures to microservices to achieve resilience , agility and scalability in software development. However, with microservices it is difficult to diagnose performance issues due to technology heterogeneity, large number of microservices, and frequent updates to both software feat...
Despite the ubiquitous adoption of cloud computing and a very rich set of services offered by cloud providers, current systems lack efficient and flexible mechanisms to collaborate among multiple cloud sites. In order to guarantee resource availability during peaks in demand and to fulfill service level objectives, cloud service providers cap resou...
Data stream processing (DSP) is an interesting computation paradigm in geo-distributed infrastructures such as Fog computing because it allows one to decentralize the processing operations and move them close to the sources of data. However, any decomposition of DSP operators onto a geo-distributed environment with large and heterogeneous network l...
Data stream processing (DSP) is an interesting computation paradigm in geo-distributed infrastructures such as Fog computing because it allows one to decentralize the processing operations and move them close to the sources of data. However, any decomposition of DSP operators onto a geo-distributed environment with large and heterogeneous network l...
High Performance Computing (HPC) and Data Intensive (DI) workloads have been executed on separate clusters using different tools for resource and application management. With increasing convergence, where modern applications are composed of both types of jobs in complex workflows, this separation becomes a growing overhead and the need for a common...
Energy management has become increasingly necessary in data centers to address all energy‐related costs, including capital costs, operating expenses, and environmental impacts. Heterogeneous systems with mixed hardware architectures provide both throughput and processing efficiency for different specialized application types and thus have a potenti...
Virtualization solutions based on hypervisors or containers are enabling technologies for scalable, flexible, and cost-effective resource sharing. As the fundamental limitations of each technology are yet to be understood, they need to be regularly reevaluated to better understand the trade-off provided by latest technological advances. This paper...
Resource sharing is an inherent characteristic of cloud data centers. Virtual Machines (VMs) and/or Containers that are co-located in the same physical server often compete for resources leading to interference. The noisy neighbor’s effect refers to an anomaly caused by a VM/container limiting resources accessed by another one. Our main contributio...
The Mobile Cloud Network is an emerging cost and capacity heterogeneous distributed cloud topological paradigm that aims to remedy the application performance constraints imposed by centralised cloud infrastructures. A centralised cloud infrastructure and the adjoining Telecom network will struggle to accommodate the exploding amount of traffic gen...
Complete data center failures may occur due to disastrous events such as earthquakes or fires. To attain robustness against such failures and reduce the probability of data loss, data must be replicated in another data center sufficiently geographically separated from the original data center. Implementing geo-replication is expensive as every data...
To reduce the congestion due to the future bandwidth-hungry applications in domains such as Health care, Internet of Things (IoT), etc., we study the benefit of introducing additional Data Centers (DCs) closer to the network edge for the optimal application placement. Our study shows that the edge layer DCs in a Mobile Edge Network (MEN) infrastruc...
Numerous auto-scaling strategies have been proposed in the past few years for improving various Quality of Service (QoS) indicators of cloud applications, for example, response time and throughput, by adapting the amount of resources assigned to the application to meet the workload demand. However, the evaluation of a proposed auto-scaler is usuall...
Current open issues regarding cloud computing include the support for nontrivial Quality of Service-related Service Level Objectives (SLOs) and reducing the energy footprint of data centers. One strategy that can contribute to both is the integration of accelerators as specialized resources within the cloud system. In particular, Field Programmable...
To meet the challenges of consistent performance, low communication latency, and a high degree of user mobility, cloud and Telecom infrastructure vendors and operators foresee a Mobile Cloud Network that incorporates public cloud infrastructures with cloud augmented Telecom nodes in forthcoming mobile access networks. A Mobile Cloud Network is comp...
Cloud applications are growing more and more complex because of user mobility, hardware heterogeneity, and multi-component nature. Today's cloud infrastructure paradigm, based on distant data centers are not able to provide consistent performance and low enough communication latency for future applications. These discrepancies can be accommodated u...
We investigate methods for detection of rapid workload
increases (load spikes) for cloud workloads. Such rapid
and unexpected workload spikes are a main cause for poor
performance or even crashing applications as the allocated
cloud resources become insufficient. To detect the spikes early
is fundamental to perform corrective management actions, li...
Video-on-Demand (VoD) and video sharing services account for a large percentage of the total downstream Internet traffic. In order to provide a better understanding of the load on these services, we analyze and model a workload trace from a VoD service provided by a major Swedish TV broadcaster. The trace contains over half a million requests gener...
Workload burstiness and spikes are among the main reasons for service disruptions and decrease in the Quality-of-Service (QoS) of online services. They are hurdles that complicate autonomic resource management of data enters. In this paper, we review the state-of-the-art in online identification of workload spikes and quantifying burstiness. The ap...
Low resource utilization in cloud data centers can be mitigated by overbooking but this increases the risk of performance degradation. We propose a three level Quality of Service (QoS) scheme for overbooked cloud data centers to assure high performance QoS for applications that need it. We design a controller that dynamically maps virtual cores to...
Since first demonstrated by Clark et al. in 2005, live migration of virtual machines has both become a standard feature of hypervisors and created an active field of research. However, the rich ongoing research in live migration focusmainly on performance improvements to well-known techniques, most of them being variations of the Clark approach. In...
Virtual machine placement is the process of mapping virtual machines to available physical hosts within a datacenter or on a remote datacenter in a cloud federation. Normally, service owners cannot influence the placement of service components beyond choosing datacenter provider and deployment zone at that provider. For some services, however, this...
Resource overbooking is an admission control technique to increase utilization in cloud environments. However, due to uncertainty about future application workloads, overbooking may result in overload situations and deteriorated performance. We mitigate this using brownout, a feedback approach to application performance steering, that ensures grace...
Horizontal elasticity through scale-out is the current dogma for scaling cloud applications but requires a particular application architecture. Vertical elasticity is transparent to applications but less used as scale-up is limited by the size of a single physical server. In this paper, we propose a novel approach, server disaggregation, that aggre...
Energy management has become increasingly necessary in large-scale cloud data centers to address high operational costs and carbon footprints to the environment. In this work, we combine three management techniques that can be used to control cloud data centers in an energy-efficient manner: changing the number of virtual machines, the number of co...
The vision of the CACTOS project is focused on cloud topology optimization. This is realised by providing
new types of data centre optimization and simulation mechanisms. CACTOS is holistic and aims to support
both design-time and run-time optimization, covers data acquisition and application profiling as well as
infrastructure management. One o...
Accurate understanding of workloads is key to efficient cloud resource management as well as to the design of large-scale applications. We analyze and model the workload of Wikipedia, one of the world's largest web sites. With descriptive statistics, time-series analysis, and polynomial splines, we study the trend and seasonality of the workload, i...
Elasticity is key for the cloud paradigm, where the pay-per use nature provides great flexibility for end-users. However, elasticity complicates long-term capacity planning for cloud providers as the exact amount of resources required over time becomes uncertain. Admission control techniques are thus needed to handle the trade-off between resource...
Until now, most research on cloud service placement has focused on static pricing scenarios, where cloud providers offer fixed prices for their resources. However, with the recent trend of dynamic pricing of cloud resources, where the price of a compute resource can vary depending on the free capacity and load of the provider, new placement algorit...
Cloud computing enables elasticity – rapid provisioning and deprovisioning of computational resources. Elasticity allows cloud users to quickly adapt resource allocation to meet changes in their workloads. For cloud providers, elasticity complicates capacity management as the amount of resources that can be requested by users is unknown and can var...
Despite the potential given by the combination of multi-tenancy and virtualization, resource utilization in today's data centers is still low. We identify three key characteristics of cloud services and infrastructure as-a-service management practices: burstiness in service workloads, fluctuations in virtual machine resource usage over time, and vi...
We introduce and define the concept of recontextualization for cloud applications by extending contextualization, i.e. the dynamic configuration of virtual machines (VM) upon initialization, with autonomous updates during runtime. Recontextualization allows VM images and instances to be dynamically re-configured without restarts or downtime, and th...
New VM instances are created from static templates that contain the basic configuration of the VM to achieve elasticity with regards to capacity. Instance specific settings can be injected into the VM during the deployment phase through means of contextualization. So far this is limited to a single data source and data remains static throughout the...
The cloud computing landscape has recently developed into a spectrum of cloud
architectures, leading to a broad range of management tools for similar
operations but specialized for certain deployment scenarios. This both hinders
the efficient reuse of algorithmic innovations within cloud management
operations and increases the heterogeneity between...
Elasticity is the ability of a cloud infrastructure to dynamically change the amount of resources allocated to a running service as load changes. We build an autonomous elasticity controller that changes the number of virtual machines allocated to a service based on both monitored load changes and predictions of future load. The cloud infrastructur...
Cloud elasticity is the ability of the cloud infrastructure to rapidly change the amount of resources allocated to a service in order to meet the actual varying demands on the service while enforcing SLAs. In this paper, we focus on horizontal elasticity, the ability of the infrastructure to add or remove virtual machines allocated to a service dep...
In the past few years, we have witnessed the proliferation of a heterogeneous ecosystem of cloud providers, each one with a different infrastructure offer and pricing policy. We explore this heterogeneity in a novel cloud brokering approach that optimizes placement of virtual infrastructures across multiple clouds and also abstracts the deployment...
We present an approach to optimal virtual machine placement within datacenters for predicable and time-constrained load peaks. A method for optimal load balancing is developed, based on binary integer programming. For tradeoffs between quality of solution and computation time, we also introduce methods to pre-process the optimization problem before...
Although supported by many contemporary Virtual Machine (VM) hyper visors, live migration is impossible for certain applications. When migrating CPU and/or memory intensive VMs two problems occur, extended migration downtime that may cause service interruption or even failure, and prolonged total migration time that is harmful for the overall syste...
We propose a cloud contextualization mechanism which operates in two stages, contextualization of VM images prior to service deployment (PaaS level) and self-contextualization of VM instances created from the image (IaaS level). The contextualization tools are implemented as part of the OPTIMIS Toolkit, a set of software components for simplified m...
Cloud brokerage mechanisms are fundamental to reduce the complexity of using multiple cloud infrastructures to achieve optimal placement of virtual machines and avoid the potential vendor lock-in problems. However, current approaches are restricted to static scenarios, where changes in characteristics such as pricing schemes, virtual machine types,...
We demonstrate the OPTIMIS toolkit for scalable and dependable service platforms and architectures that enable flexible and
dynamic provisioning of Cloud services. The innovations demonstrated are aimed at optimizing Cloud services and infrastructures
based on aspects such as trust, risk, eco-efficiency, cost, performance and legal constraints. Ada...
Addressing the management challenges for a multitude of distributed cloud architectures, we focus on the three complementary cloud management problems of predictive elasticity, admission control, and placement (or scheduling) of virtual machines. As these problems are intrinsically intertwined we also propose an approach to optimize the overall sys...
Despite the widespread support for live migration of Virtual Machines (VMs) in current hypervisors, these have significant shortcomings when it comes to migration of certain types of VMs. More specifically, with existing algorithms, there is a high risk of service interruption when migrating VMs with high workloads and/or over low-bandwidth network...
As cloud computing becomes more predominant, the problem of scalability has become critical for cloud computing providers. The cloud paradigm is attractive because it offers a dramatic reduction in capital and operation expenses for consumers.
Report for early dissemination of its contents. In view of the transfer of copyright to the outside publisher, its distribution outside of IBM prior to publication should be limited to peer communications and specific requests. After outside publication, requests should be filled only by reprints or legally obtained copies of the article (e.g. , pa...
The cloud based delivery model for IT resources is revolutionizing the IT industry. Despite the marketing hype around “the cloud”, the paradigm itself is in a critical transition state from the laboratories to mass market. Many technical and business aspects of cloud computing need to mature before it is widely adopted for corporate use. For exampl...
Web Services communicate through XML-encoded messages and suffer from substantial overhead due to verbose encoding of transferred messages and extensive (de)serialization at the end-points. We demonstrate that response caching is an effective approach to reduce Internet latency and server load. Our Tantivy middleware layer reduces the volume of dat...
We investigate interoperability aspects of scientific workflow systems and argue that the workflow execution environment, the model of computation (MoC), and the workflow language form three dimensions that must be considered depending on the type of interoperability sought: at the activity, sub-workflow, or workflow levels. With a focus on the pro...
SUMMARY The problem of Grid-middleware interoperability is addressed by the design and analysis of a feature- rich, standards-based framework for all-to-all cross-middleware job submission. The service implements a decentralized brokering policy, striving to optimize the performance for an individual user by minimizing the response time for each jo...
We present algorithms, methods, and software for a Grid resource manager, that performs resource brokering and job scheduling in production Grids. This decentralized broker selects computational resources based on actual job requirements, job characteristics, and information provided by the resources, with the aim to minimize the total time to deli...
Abstract,Accurate predictions of application execution time have many uses in Grid computing. The predictions can include both estimations of the execution time of an application for a range of problem sizes, and predictions of how an application performs on different Grid resources. The actual problem formulation depends on the intended use of the...
We present an approach for development of Grid resource management tools, where we put into practice internationally estab- lished high-level views of future Grid architectures. The approach ad- dresses fundamental Grid challenges and strives towards a future vision of the Grid where capabilities are made available as independent and dynamically as...
We present a generic and light-weight Grid workflow execu- tion engine made available as a Grid service. A long-term goal is to fa- cilitate the rapid development of application-oriented end-user workflow tools, while providing a high degree of Grid middleware-independence. The workflow engine is designed for workflow execution, independent of clie...
We propose a multi-tiered architecture for middleware-independent Grid job management. The architecture consists of a number
of services for well-defined tasks in the job management process, offering complete user-level isolation of service capabilities,
multiple layers of abstraction, control, and fault tolerance. The middleware abstraction layer...
We propose a multi-tiered architecture for middleware-independent Grid job man-agement. The architecture consists of a number of services for well-defined tasks in the job management process, offering complete user-level isolation of service capabilities, multiple layers of abstraction, control, and fault tolerance. The mid-dleware abstraction laye...
We present the architecture and implementation of a grid resource broker and job submission service, designed to be as independent as possible of the grid middleware used on the resources. The overall architecture comprises seven general components and a few conversion and integration points where all middleware-specific issues are handled. The imp...
This contribution presents our experiences from developing an advanced course in grid computing, aimed at application and infrastructure developers. The course was intended for computer science students with extensive programming experience and previous knowledge of distributed systems, parallel computing, computer networking, and security. The pre...
A resource broker is a central component in a grid environment. The purpose of the broker is to dynamically nd, characterize and allocate the resources most suitable to the user's applications.
This contribution presents algorithms, methods, and soft- ware for a Grid resource manager, responsible for resource brokering and scheduling in early production Grids. The broker selects comput- ing resources based on actual job requirements and a number of criteria identifying the available resources, with the aim to minimize the total time to de...
A resource broker is a central component in a grid environment. The purpose ofthe broker is to dynamicallynd, characterize and allocate the resources mostsuitable to the user's applications.
We examine some issues that arise when using both local and Grid resources in scientific workflows. Our previous work addresses and illustrates the benefits of a light-weight and generic workflow engine that manages and optimizes Grid resource usage. Extending on this effort, we here illustrate how a client tool for bioinformatics applications empl...
Abstract The emergence of Grid computing infrastructures enables researchers to share resources and collaborate in more ecient,ways than before, despite belonging to dierent,organizations and being distanced geographically. While the Grid computing paradigm oers,new opportunities, it also gives rise to new di- culties. One such problem is the selec...
We present the architecture and implementation of a Grid re- source broker and job submission service, designed to be as independent as possible of the Grid middleware used on the resources. The imple- mentation is based on state-of-the-art Grid and Web services technology as well as existing and emerging standards (WSRF, JSDL, GLUE, WS- Agreement)...
This contribution presents the ongoing development of a resource man- ager for use in early production grids. Even though our main focus is to develop a stable brokering facility for current production grids, we also address features needed in further improved resource managers for future enhanced grid infrastructures. The primary target environmen...