Article
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

Large data centers are complex systems that depend on several generations of hardware and software components, ranging from legacy mainframes and rack-based appliances to modular blade servers and modern rack scale design solutions. To cope with this heterogeneity, the data center manager must coordinate a multitude of tools, protocols, and standards. Currently, data center managers, standardization bodies, and hardware/software manufacturers are joining efforts to develop and promote Redfish as the main hardware management standard for data centers, and even beyond the data center. The authors hope that this article can be used as a starting point to understand how Redfish and its extensions are being targeted as the main management standard for next-generation data centers. This article describes Redfish and the recent collaborations to leverage this standard.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... Third, it needs monitoring-specific agents and plugins on each monitored remote node. In order to overcome these in-band limitations of Nagios, we propose and implement the integration of the Nagios Core with the state-of-theart Out-Of-Band (OOB) Redfish telemetry model and interface [6] [7] [8] [9]. Redfish is an open and scalable industry standard, which is designed to enable data center operators to manage, monitor, and control data center resources. ...
... It also provides other system management functions, including remote power control and interaction with the basic input/output system (BIOS). Significant progress has been made in the arena of BMC hardware and software, including Redfish and the Redfish telemetry model [6][7][8][9]. The following subsections describe OOB protocols relevant to this study. ...
Preprint
Full-text available
Current monitoring tools for high-performance computing (HPC) systems are often inefficient in terms of scalability and interfacing with modern data center management APIs. This inefficiency leads to a lack of effective management of the infrastructure of modern data centers. Nagios is one of the widely used industry-standard tools for data center infrastructure monitoring, which mainly includes monitoring of nodes and associated hardware and software components. However, current Nagios monitoring has special requirements that introduce several limitations. First, significant human effort is needed for the configuration of monitored nodes in the Nagios server. Second, the Nagios Remote Plugin Executor and the Nagios Service Check Acceptor are required on the Nagios server and each monitored node for active and passive monitoring, respectively. Third, Nagios monitoring also requires monitoring-specific agents on each monitored node. These shortcomings are inherently due to Nagios’ in-band implementation nature. To overcome these limitations, we introduced Redfish-Nagios, a scalable out-of-band monitoring tool for modern HPC systems. It integrates the Nagios server with the out-of-band Distributed Management Task Force’s Redfish telemetry model, which is implemented in the baseboard management controller of the nodes. This integration eliminates the requirements of any agent, plugin, hardware component, or configuration on the monitored nodes. It is potentially a paradigm shift in Nagios-based monitoring for two reasons. First, it simplifies communication between the Nagios server and monitored nodes. Second, it saves computational costs by removing the requirements of running complex Nagios-native protocols and agents on the monitored nodes. The Redfish-Nagios integration methodology enables the monitoring of next-generation HPC systems using the scalable and modern Redfish telemetry model and interface.
... It is intended to cover additional data center subsystems namely, power and cooling. 10 However, in the meantime and until such standards are commercially supported, data center managers are faced with the use of existing commercial and open source islands of solutions. ...
Article
A data center infrastructure is composed of heterogeneous resources divided into three main subsystems: IT (processor, memory, disk, network, etc.), power (generators, power transformers, uninterruptible power supplies, distribution units, among others), and cooling (water chillers, pipes, and cooling tower). This heterogeneity brings challenges for collecting and gathering data from several devices in the infrastructure. In addition, extracting relevant information is another challenge for data center managers. While seeking to improve the cloud availability, monitoring the entire infrastructure using a variety of (open source and/or commercial) advanced monitoring tools, such as Zabbix, Nagios, Prometheus, CloudWatch, AzureWatch, and others is required. It is often common to use many monitoring systems to collect real‐time data for data center components from different subsystems. Such an environment brings an inherent challenge stemming from the need to aggregate and organize the whole collected infrastructure data and measurements. This first step is necessary prior to obtaining any valuable insights for decision‐making. In this paper, we present the Data Center Availability (DCA) System, a software system that is able to aggregate and analyze data center measurements aimed toward the study of DCA. We also discuss the DCA implementation and illustrate its operation, monitoring a small University research laboratory data center. The DCA System is able to monitor different types of devices using the Zabbix tool, such as servers, switches, and power devices. The DCA System is able to automatically identify the failure time seasonality and trend present in the collected data from different devices of the data center.
Article
Next‐generation cloud data centers are based on software‐defined data center infrastructures that promote flexibility, automation, optimization, and scalability. The Redfish standard and the Intel Rack Scale Design technology enable software‐defined infrastructure and disaggregate bare‐metal compute, storage, and networking resources into virtual pools to dynamically compose resources and create virtual performance‐optimized data centers (vPODs) tailored to workload‐specific demands. This article proposes four chassis design configurations based on Distributed Management Task Force's Redfish industry standard applied to compose vPOD systems, namely, a fully shared design, partially shared homogeneous design, partially shared heterogeneous design, and not shared design; their main difference is based on the used hardware disaggregation level. Furthermore, we propose models that combine reliability block diagram and stochastic Petri net modeling approaches to represent the complexity of the relationship between the pool of disaggregated hardware resources and their power and cooling sources in a vPOD. These four proposed design configurations were analyzed and compared in terms of availability and component's sensitivity indexes by scaling their configurations considering different data center infrastructure. From the obtained results, we can state that, in general, when one increases the hardware disaggregation, availability is improved. However, after a given point, the availability level of the fully shared, partially shared homogeneous, and partially shared heterogeneous configurations remain almost equal, while the not shared configuration is still able to improve its availability.
Article
Traditional data center infrastructure suffers from a lack of standard and ubiquitous management solutions. Despite the achieved contributions, existing tools lack interoperability and are hardware dependent. Vendors are already actively participating in the specification and design of new standard software and hardware interfaces within different forums. Nevertheless, the complexity and variety of data center infrastructure components that includes servers, cooling, networking, and power hardware, coupled with the introduction of the software defined data center paradigm, led to the parallel development of a myriad of standardization efforts. In an attempt to shed light on recent works, we survey and discuss the main standardization efforts for traditional data center infrastructure management.
Article
Full-text available
Cloud data center providers benefit from software-defined infrastructure once it promotes flexibility, automation, and scalability. The new paradigm of software-defined infrastructure helps facing current management challenges of a large-scale infrastructure, and guarantying service level agreements with established availability levels. Assessing the availability of a data center remains a complex task as it requires gathering information of a complex infrastructure and generating accurate models to estimate its availability. This paper covers this gap by proposing a methodology to automatically acquire data center hardware configuration to assess, through models, its availability. The proposed methodology leverages the emerging standardized Redfish API and relevant modeling frameworks. Through such approach, we analyzed the availability benefits of migrating from a conventional data center infrastructure (named Performance Optimization Data center (POD) with redundant servers) to a next-generation virtual Performance Optimized Data center (named virtual POD (vPOD) composed of a pool of disaggregated hardware resources). Results show that vPOD improves availability compared to conventional data center configurations.
Article
This paper provides an overview of Software-Defined “Hardware” Infrastructures (SDHI). SDHI builds upon the concept of hardware (HW) resource disaggregation. HW resource disaggregation breaks today’s physical server-oriented model where the use of a physical resource (e.g., processor or memory) is constrained to a physical server’s chassis. SDHI extends the definition of Software-Defined Infrastructures (SDI) and brings greater modularity, flexibility, and extensibility to cloud infrastructures, thus allowing cloud operators to employ resources more efficiently and allowing applications not to be bounded by the physical infrastructure’s layout. This paper aims to be an initial introduction to SDHI and its associated technological advancements. The paper starts with an overview of the cloud domain and puts into perspective some of the most prominent efforts in the area. Then, it presents a set of differentiating use-cases that SDHI enables. Next, we state the fundamentals behind SDI and SDHI, and elaborate why SDHI is of great interest today. Moreover, it provides an overview of the functional architecture of a cloud built on SDHI, exploring how the impact of this transformation goes far beyond the cloud infrastructure level in its impact on platforms, execution environments, and applications. Finally, an in-depth assessment is made of the technologies behind SDHI, the impact of these technologies, and the associated challenges and potential future directions of SDHI. IEEE
Article
The rapid growth of cloud computing, both in terms of the spectrum and volume of cloud workloads, necessitate re-visiting the traditional rack-mountable servers based datacenter design. Next generation datacenters need to offer enhanced support for: (i) fast changing system configuration requirements due to workload constraints, (ii) timely adoption of emerging hardware technologies, and (iii) maximal sharing of systems and subsystems in order to lower costs. Disaggregated datacenters, constructed as a collection of individual resources such as CPU, memory, disks etc., and composed into workload execution units on demand, are an interesting new trend that can address the above challenges. In this paper, we demonstrated the feasibility of composable systems through building a rack scale composable system prototype using PCIe switch. Through empirical approaches, we develop assessment of the opportunities and challenges for leveraging the composable architecture for rack scale cloud datacenters with a focus on big data and NoSQL workloads. In particular, we compare and contrast the programming models that can be used to access the composable resources, and developed the implications for the network and resource provisioning and management for rack scale architecture.
Cloud Central Office Reference Architectural Framework
  • D H Karagiannis
D. H. Georgios Karagiannis, "Cloud Central Office Reference Architectural Framework," TR-384, 2018.
Redfish Composability White Paper
"Redfish Composability White Paper," DMTF Redfish DSP2050, 2018.
  • Dmtf
DMTF, "Redfish Interoperability Profiles"; https://www. dmtf.org/sites/default/files/standards/documents/DSP0272 1.1.0.pdf, 2019, accessed May 2019.
Redfish for Networking White Paper
  • Dmtf
DMTF, "Redfish for Networking White Paper"; https://www. dmtf.org/sites/default/files/standards/documents/DSP2047 0.1.0.pdf, 2017, accessed June 2019.
OCP's Hardware Management/SpecsAndDesigns
OCP, "OCP's Hardware Management/SpecsAndDesigns"; https://www.opencompute.org/wiki/Hardware Management/SpecsAndDesigns, 2018, accessed May 2019.