
Peter Garraghan- Lecturer at Lancaster University
Peter Garraghan
- Lecturer at Lancaster University
About
53
Publications
20,259
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
1,916
Citations
Introduction
Current institution
Publications
Publications (53)
Understanding the characteristics and patterns of workloads within a Cloud computing environment is critical in order to improve resource management and operational conditions while Quality of Service (QoS) guarantees are maintained. Simulation models based on realistic parameters are also urgently needed for investigating the impact of these workl...
Cloud computing providers are under great pressure to reduce operational costs through improved energy utilization while provisioning dependable service to customers; it is therefore extremely important to understand and quantify the explicit impact of failures within a system in terms of energy costs. This paper presents the first comprehensive an...
Simulation is critical when studying real operational behavior of increasingly complex Cyber-Physical Systems, forecasting future behavior, and experimenting with hypothetical scenarios. A critical aspect of simulation is the ability to evaluate large-scale systems within a reasonable time frame while modeling complex interactions between millions...
Large-scale distributed systems in Cloud datacenter are capable of provisioning service to consumers with diverse business requirements. Providers face pressure to provision uninterrupted reliable services while reducing operational costs due to significant software and hardware failures. A widely used means to achieve such a goal is using redundan...
Large Language Models (LLMs) guardrail systems are designed to protect against prompt injection and jailbreak attacks. However, they remain vulnerable to evasion techniques. We demonstrate two approaches for bypassing LLM prompt injection and jailbreak detection systems via traditional character injection methods and algorithmic Adversarial Machine...
Large Language Models (LLMs) have become widely used across various domains spanning search engines, code generation, and text creation. However, a major concern associated with their adoption is the high cost of inference, impacting both their sustainability and financial feasibility. In this study, we empirically study how different prompt and re...
As Information and Communication Technology (ICT) use has become more prevalent, there has been a growing concern in how its associated greenhouse gas emissions will impact the climate. Estimating such ICT emissions is a difficult undertaking due to its complexity, its rapidly changing nature, and the lack of accurate and up-to-date data on individ...
The heterogeneity of Deep Learning models, libraries, and hardware poses an important challenge for improving model inference performance. Auto-tuners address this challenge via automatic tensor program optimization towards a target-device. However, auto-tuners incur a substantial time cost to complete given their design necessitates performing ten...
Today's data centers use substantial amounts of the world's electrical supply. However, in line with circular economy concepts, much of this energy can be reused. Such reuse includes the heating of buildings, but also commodity dehydration, electricity production and energy storage. This multi-disciplinary paper presents several novel applications...
Deep Learning (DL) models increasingly power a diversity of applications. Unfortunately, this pervasiveness also makes them attractive targets for extraction attacks which can steal the architecture, parameters, and hyper-parameters of a targeted DL model. Existing extraction attack studies have observed varying levels of attack success for differe...
Deep learning and artificial intelligence are often viewed as panacea technologies — ones which can decarbonise many industries. But what is the carbon cost of these systems? Damian Borowiec, Richard R. Harper and Peter Garraghan discuss.
Modern large-scale computing systems distribute jobs into multiple smaller tasks which execute in parallel to accelerate job completion rates and reduce energy consumption. However, a common performance problem in such systems is dealing with straggler tasks that are slow running instances that increase the overall response time. Such tasks can sig...
The worldwide adoption of cloud data centers (CDCs) has given rise to the ubiquitous demand for hosting application services on the cloud. Further, contemporary data-intensive industries have seen a sharp upsurge in the resource requirements of modern applications. This has led to the provisioning of an increased number of cloud servers, giving ris...
Distributed systems have been an active field of research for over 60 years, and has played a crucial role in computer science, enabling the invention of the Internet that underpins all facets of modern life. Through technological advancements and their changing role in society, distributed systems have undergone a perpetual evolution, with each ch...
In light of the current climate crisis, a holistic approach to infrastructural matters regarding energy, communication, data and sustainable communities, as well as the water-food-energy nexus in general, is critical. One enabler for building sustainable communities around the Globe is ICT (information and communications technology). In the near fu...
To accelerate the training of Deep Learning (DL) models, clusters of machines equipped with hardware accelerators such as GPUs are leveraged to reduce execution time. State-of-the-art resource managers are needed to increase GPU utilization and maximize throughput. While co-locating DL jobs on the same GPU has been shown to be effective, this can i...
Technology providers heavily exploit the usage of edge‐cloud data centers (ECDCs) to meet user demand while the ECDCs are large energy consumers. Concerning the decrease of the energy expenditure of ECDCs, task placement is one of the most prominent solutions for effective allocation and consolidation of such tasks onto physical machine (PM). Such...
Cloud computing systems are splitting compute- and data-intensive jobs into smaller tasks to execute them in a parallel manner using clusters to improve execution time. However, such systems at increasing scale are exposed to stragglers, whereby abnormally slow running tasks executing within a job substantially affect job performance completion. Su...
Deep Learning (DL) models are deployed as jobs within machines containing GPUs. These DL systems - ranging from a singular GPU device to machine clusters - require state-of-the-art resource management to increase resource utilization and job throughput. While it has been identified that co-location - multiple jobs co-located within the same GPU - i...
Workload prediction has been widely researched in the literature. However, existing techniques are per‐job based and useful for service‐like tasks whose workloads exhibit seasonality and trend. But cloud jobs have many different workload patterns and some do not exhibit recurring workload patterns. We consider job‐pool‐based workload estimation, wh...
Current cloud computing frameworks host millions of physical servers that utilize cloud computing resources in the form of different virtual machines (VM). Cloud Data Center (CDC) infrastructures require significant amounts of energy to deliver large scale computational services. Computing nodes generate large volumes of heat, requiring cooling uni...
Current cloud computing frameworks host millions of physical servers that utilize cloud computing resources in the form of different virtual machines. Cloud Data Center (CDC) infrastructures require significant amounts of energy to deliver large scale computational services. Moreover, computing nodes generate large volumes of heat, requiring coolin...
This chapter presents a scalable software‐defined orchestration architecture to intelligently compose and orchestrate thousands of heterogeneous Fog appliances (devices, servers). Specifically, it provides a resource filtering‐based resource assignment mechanism to optimize the resource utilization and fair resource sharing among multitenant Intern...
It is a long-standing challenge to achieve a high degree of resource utilization in cluster scheduling. Resource oversubscription has become a common practice in improving resource utilization and cost reduction. However, current centralized approaches to oversubscription suffer from the issue with resource mismatch and fail to take into account ot...
Cloud computing plays a critical role in modern society and enables a range of applications from infrastructure to social media. Such system must cope with varying load and evolving usage reflecting societies interaction and dependency on automated computing systems whilst satisfying Quality of Service (QoS) guarantees. Enabling these systems are a...
Containerized clusters of machines at scale that provision Cloud services are encountering substantive difficulties with stragglers -- whereby a small subset of task execution negatively degrades system performance. Stragglers are an unsolved challenge due to a wide variety of root-causes and stochastic behavior. While there have been efforts to mi...
Since the conception of cloud computing, ensuring its ability to provide highly reliable service has been of the upmost importance and criticality to the business objectives of providers and their customers. This has held true for every facet of the system, encompassing applications, resource management, the underlying computing infrastructure, and...
A long-standing challenge in cluster scheduling is to achieve a high degree of utilization of heterogeneous resources in a cluster. In practice there exists a substantial disparity between perceived and actual resource utilization. A scheduler might regard a cluster as fully utilized if a large resource request queue is present, but the actual reso...
Cloud computing represents a paradigm shift in provisioning on-demand computational resources underpinned by data center infrastructure, which now constitutes 1.5% of worldwide energy consumption. Such consumption is not merely limited to operating IT devices, but encompasses cooling systems representing 40% total data center energy usage. Given th...
The global uptake of Cloud computing has attracted increased interest within both academia and industry resulting in the formation of large-scale and complex distributed systems. This has led to increased failure occurrence within computing systems that induce substantial negative impact upon system performance and task reliability perceived by use...
Energy consumed by Cloud datacenters has dramatically increased, driven by rapid uptake of applications and services globally provisioned through virtualization. By applying energy-aware virtual machine scheduling, Cloud providers are able to achieve enhanced energy efficiency and reduced operation cost. Energy consumption of datacenters consists o...
Large-scale Internet of Things (IoT) services such as healthcare, smart cities, and marine monitoring are pervasive in cyber-physical environments strongly supported by Internet technologies and fog computing. Complex IoT services are increasingly composed of sensors, devices, and compute resources within fog computing infrastructures. The orchestr...
Cloud datacenters are compute facilities formed by hundreds and thousands of heterogeneous servers requiring significant power requirements to operate effectively. Servers are composed by multiple interacting sub-systems including applications, microelectronic processors, and cooling which reflect their respective power profiles via different param...
Scheduling is a core component within distributed systems to determine optimal allocation of tasks within servers. This is challenging within modern Cloud computing systems - comprising millions of tasks executing in thousands of heterogeneous servers. Theoretical scheduling is capable of providing complete and sophisticated algorithms towards a si...
Large-scale IoT services such as health-care, smart cities and marine monitoring are pervasive in Cyber-physical environments strongly supported by Internet technologies and Fog computing. Complex IoT services are increasingly composed of sensors, devices, and compute resources within Fog computing infrastructures. The orchestration of such applica...
Increased complexity and scale of virtualized distributed systems has resulted in the manifestation of emergent phenomena substantially affecting overall system performance. This phenomena is known as “Long Tail”, whereby a small proportion of task stragglers significantly impede job completion time. While work focuses on straggler detection and mi...
Task stragglers hinder effective parallel job execution in Cloud datacenters, resulting in late-timing failures due to the violation of specified timing constraints. Stragglertolerant methods such as speculative execution provide limited effectiveness due to (i) lack of precise straggler root-cause knowledge and (ii) straggler identification occurr...
The ability of servers to effectively execute tasks within Cloud datacenters varies due to heterogeneous CPU and memory capacities, resource contention situations, network configurations and operational age. Unexpectedly slow server nodes (node-level stragglers) result in assigned tasks becoming task-level stragglers, which dramatically impede para...
Cloud computing systems face the substantial challenge of the Long Tail problem: a small subset of straggling tasks significantly impede parallel jobs completion. This behavior results in longer service response times and degraded system utilization. Speculative execution, which create task replicas at runtime, is a typical method deployed in large...
In cloud computing, good resource management can benefit both cloud users as well as cloud providers. Workload prediction is a crucial step towards achieving good resource management. While it is possible to estimate the workloads of long-running tasks based on the periodicity in their historical workloads, it is difficult to do so for tasks which...
The increasing complexity and scale of distributed systems has resulted in the manifestation of emergent behavior which substantially affects overall system performance. A significant emergent property is that of the "Long Tail", whereby a small proportion of task stragglers significantly impact job execution completion times. To mitigate such beha...
Utility computing is an increasingly important paradigm, whereby computing resources are provided on-demand as utilities. An important component of utility computing is storage, data volumes are growing rapidly, and mechanisms to mitigate this growth need to be developed. Data deduplication is a promising technique for drastically reducing the amou...
As Cloud Computing becomes the trend of information technology computational model, the Cloud security is becoming a major issue in adopting the Cloud where security is considered one of the most critical concerns for the large customers of Cloud (i.e. governments and enterprises). Such valid concern is mainly driven by the Multi-Tenancy situation...
Cloud computing research is in great need of statistical parameters derived from the analysis of real-world systems. One aspect of this is the failure characteristics of Cloud environments composed of workloads and servers, currently, few metrics are available that quantify failure and repair times of workloads and servers at a large-scale. Workloa...
Computing Clouds are typically characterized as large scale systems that exhibit dynamic behavior due to variance in workload. However, how exactly these characteristics affect the dependability of Cloud systems remains unclear. Furthermore provisioning reliable service within a Cloud federation, which involves the orchestration of multiple Clouds...
Understanding the resource utilization and server characteristics of large-scale systems is crucial if service providers are to optimize their operations whilst maintaining Quality of Service. For large-scale data enters, identifying the characteristics of resource demand and the current availability of such resources, allows system managers to des...
Analyzing behavioral patterns of workloads is critical to understanding Cloud computing environments. However, until now only a limited number of real-world Cloud data center trace logs have been available for analysis. This has led to a lack of methodologies to capture the diversity of patterns that exist in such datasets. This paper presents the...
Dependability is a critical concern in provisioning services in Cloud Computing environments. This is true when considering reliability, an attribute of dependability that is a critical and challenging problem in a Cloud context [2]. Fault-tolerance is one means to attain reliability, and is typically implemented by using some form of diversity. Fe...
Cloud computing has emerged as popular paradigm that enables the establishment of large scale, flexible computing infrastructures that can offer significant cost savings for both businesses and consumers by allowing compute resources to be scaled dynamically to deal with current or anticipated usage [1]. This concept has been further strengthened w...