Book

The Grid: Blueprint for a New Computing Infrastructure

Authors:
... In a World of Warcraft-like MMO GuildWars, the player pays for game packages, but not for access to the servers (though that access is still needed to play). 8 Still, it is debatable if games themselves have changed, or if it is simply the marketing of games that has undergone a shift. ...
... Servers, web services and other lingo relating to computation architecture is not entirely compatible with the way services are conceived of in this paper. For an example, see [8]. ...
... This paper concentrates on player services. For a compatible model on the expanded game experience, see[12].8 Note that MMOs are not the only type of games that tie a product into a service: for example alternate reality games such as Majestic also require an active service element [23]. ...
... By and large, data mining systems that have been developed to date for clusters, distributed clusters and grids have assumed that the processors are the scarce resource, and hence shared. When processors become available, the data is moved to the processors, the computation is started, and results are computed and returned [7]. To simplify, this is the supercomputing model, and, in the File Locating and Access Services Distributed Storage Services Routing Services Figure 2: A Sector Server provides file locating and file access services to any Sector Client. ...
... In general, anyone in the public can read data from Sector. In contrast, systems such as GFS [8] and Hadoop [4] are targeted towards organizations (only users with accounts can read and write data), while systems such as Globus [7] are targeted towards virtual organizations (anyone with access to a node running GSI [7] and having an account can read and write data). Also, unlike some peer-to-peer systems, while reading data is open, writing data in Sector is controlled through access control lists. ...
... In general, anyone in the public can read data from Sector. In contrast, systems such as GFS [8] and Hadoop [4] are targeted towards organizations (only users with accounts can read and write data), while systems such as Globus [7] are targeted towards virtual organizations (anyone with access to a node running GSI [7] and having an account can read and write data). Also, unlike some peer-to-peer systems, while reading data is open, writing data in Sector is controlled through access control lists. ...
Preprint
We describe a cloud based infrastructure that we have developed that is optimized for wide area, high performance networks and designed to support data mining applications. The infrastructure consists of a storage cloud called Sector and a compute cloud called Sphere. We describe two applications that we have built using the cloud and some experimental studies.
... The single-sweep method is easy to use and implement, does not require a priori knowledge of the free energy landscape, and can be applied to map free energies in several variables (up to four, as demonstrated here, and probably more). The single-sweep method is also very efficient, especially since the mean force calculations can be performed using independent calculations on distributed processors (i.e. using grid computing facilities [14,15]). ...
... The calculations of these time averages are independent of each other, and hence they can be distributed, using (ideally) at least one processor per center z k , an approach that optimally fits with the purposes of grid computing [14,15]. The estimator in (11) has the advantage of being simple, but it introduces an error due to the finiteness ofκ. ...
... N max in (15) must be much larger than K in (2) to achieve the same accuracy). This is because the leveling out achieved by the integral term in (14) and, hence, the convergence of the representation (15), only occur statistically [24,25] (in contrast, the mean force data used at each center in the single sweep method contains already all the statistical information needed at that center). This is consistent with metadynamics being in essence an histogram method, albeit one where the histogram windows are adjusted on-the-fly. ...
Preprint
A simple, efficient, and accurate method is proposed to map multi-dimensional free energy landscapes. The method combines the temperature-accelerated molecular dynamics (TAMD) proposed in [Maragliano & Vanden-Eijnden, Chem. Phys. Lett. 426, 168 (2006)] with a variational reconstruction method using radial-basis functions for the representation of the free energy. TAMD is used to rapidly sweep through the important regions of the free energy landscape and compute the gradient of the free energy locally at points in these regions. The variational method is then used to reconstruct the free energy globally from the mean force at these points. The algorithmic aspects of the single-sweep method are explained in detail, and the method is tested on simple examples, compared to metadynamics, and finally used to compute the free energy of the solvated alanine dipeptide in two and four dihedral angles.
... In multi-agent systems (MAS), efficient context management is crucial for maintaining response consistency, especially in dynamic, real-time environments with interdependent topics [Ferber, 1999, Shehory andKraus, 1998a]. Traditionally, MAS configurations have leveraged centralized databases [Bernstein and Goodman, 1986] or distributed grids [Foster and Kesselman, 2001] to optimize data management. However, these approaches may become less effective when applied to LLM-based systems where memory limitations and context overflow are significant concerns. ...
... Huang and Chen [2023] introduced techniques for noise filtering in collaborative LLM environments, showing that reducing noise in inter-agent communication improves response consistency. However, as highlighted by Durfee [1991] and Foster and Kesselman [2001], the overhead of noise reduction can still impact system scalability, particularly in high-load environments. ...
Preprint
Large Language Models (LLMs) are increasingly utilized in multi-agent systems (MAS) to enhance collaborative problem-solving and interactive reasoning. Recent advancements have enabled LLMs to function as autonomous agents capable of understanding complex interactions across multiple topics. However, deploying LLMs in MAS introduces challenges related to context management, response consistency, and scalability, especially when agents must operate under memory limitations and handle noisy inputs. While prior research has explored optimizing context sharing and response latency in LLM-driven MAS, these efforts often focus on either fully centralized or decentralized configurations, each with distinct trade-offs. In this paper, we develop a probabilistic framework to analyze the impact of shared versus separate context configurations on response consistency and response times in LLM-based MAS. We introduce the Response Consistency Index (RCI) as a metric to evaluate the effects of context limitations, noise, and inter-agent dependencies on system performance. Our approach differs from existing research by focusing on the interplay between memory constraints and noise management, providing insights into optimizing scalability and response times in environments with interdependent topics. Through this analysis, we offer a comprehensive understanding of how different configurations impact the efficiency of LLM-driven multi-agent systems, thereby guiding the design of more robust architectures.
... The proliferation of distributed computing systems has revolutionized computational task processing, necessitating the development of efficient job scheduling techniques. Distributed computing environments (DCEs) are characterized by complex architectures involving multiple independent computing nodes that collaborate to execute tasks (Foster & Kesselman, 2003). In such environments, efficient job scheduling is critical to optimizing resource utilization, minimizing processing times, and enhancing overall system performance (Buyya et al., 2009). ...
... Grid computing involves the integration of heterogeneous, geographically distributed resources across multiple administrative domains, presenting unique challenges for resource allocation, fault tolerance, and load balancing (Foster & Kesselman, 2003). Traditional centralized schedulers struggle in such environments due to scalability limitations, single points of failure, and administrative autonomy conflicts (Buyya& Venugopal, 2005). ...
Article
The rapid growth of distributed computing environments has necessitated the development of efficient job scheduling mechanisms to optimize resource utilization and minimize latency. MultiAgent Systems (MAS) have emerged as a promising approach to address the complexities of job scheduling in such environments. This paper explores the integration of MAS into distributed computing systems to enhance job scheduling efficiency. We propose a novel framework that leverages the autonomous, collaborative, and adaptive capabilities of agents to improve scheduling decisions. Through extensive simulations and comparative analysis, we demonstrate that our approach significantly reduces job completion times and enhances resource allocation. The findings of this study contribute to the growing body of knowledge on intelligent scheduling systems and provide practical insights for implementing MAS in real-world distributed computing environments.
... Feitelson and Rudolph's 1995 paper [Feitelson and Rudolph (1995)] established a basis for making comparisons between job schedulers, and as other schedulers were developed by other organizations, more such comparison papers were written, including Baker et al. (1996), Byun et al. (2000), El-Ghazawi et al. (2004), and Yan and Chapman (2008). In the early 2000s, a number of universities and research organizations were developing technology to share supercomputing resources across their organizations as computing grids [Foster and Kesselman (2003)]. Among the significant papers and books from the research of scheduling for grid computing were Czajkowski et al. (1998), Krauter et al. (2002), and Nabrzyski et al. (2004. ...
... This solution is not that different from the concept of queues among the HPC schedulers. We should note that a large push for metascheduling in the supercomputing research community in the late 1990s and early 2000s enabled supercomputing jobs to be submitted and executed across multiple organizations and supercomputing centers including Globus [Foster and Kesselman (2003), Nabrzyski et al. (2004)], Legion [Grimshaw and Wulf (1997)], and Condor Flocking [Epema et al. (1996)]. That research has reached maturity and is being used by many supercomputing consortiums. ...
Preprint
In the rapidly expanding field of parallel processing, job schedulers are the "operating systems" of modern big data architectures and supercomputing systems. Job schedulers allocate computing resources and control the execution of processes on those resources. Historically, job schedulers were the domain of supercomputers, and job schedulers were designed to run massive, long-running computations over days and weeks. More recently, big data workloads have created a need for a new class of computations consisting of many short computations taking seconds or minutes that process enormous quantities of data. For both supercomputers and big data systems, the efficiency of the job scheduler represents a fundamental limit on the efficiency of the system. Detailed measurement and modeling of the performance of schedulers are critical for maximizing the performance of a large-scale computing system. This paper presents a detailed feature analysis of 15 supercomputing and big data schedulers. For big data workloads, the scheduler latency is the most important performance characteristic of the scheduler. A theoretical model of the latency of these schedulers is developed and used to design experiments targeted at measuring scheduler latency. Detailed benchmarking of four of the most popular schedulers (Slurm, Son of Grid Engine, Mesos, and Hadoop YARN) are conducted. The theoretical model is compared with data and demonstrates that scheduler performance can be characterized by two key parameters: the marginal latency of the scheduler tst_s and a nonlinear exponent αs\alpha_s. For all four schedulers, the utilization of the computing system decreases to < 10\% for computations lasting only a few seconds. Multilevel schedulers that transparently aggregate short computations can improve utilization for these short computations to > 90\% for all four of the schedulers that were tested.
... Future e-science applications will require efficient processing of the data [1] where storages and processors may be distributed among the collaborating researchers. A computational Grid consists of heterogenous computational resources, possibly with different users, and provide them with remote access to these resources [2,3,4] and it is an ideal computing environment for e-science applications [5,6,7]. The Grid has attracted researchers as an alternative to supercomputers for high performance computing. ...
... Coordinator 0 and coordinator 12 which has excessive loads, calculates the local transfers by executing a pseudocode similar to local load balancing code in Alg. in 2. All of these message transmissions are shown in Fig. 4(c). 4. Coordinator 0 sends a Coord Xfer message to node 1 to start a load transfer. ...
Preprint
E-science applications may require huge amounts of data and high processing power where grid infrastructures are very suitable for meeting these requirements. The load distribution in a grid may vary leading to the bottlenecks and overloaded sites. We describe a hierarchical dynamic load balancing protocol for Grids. The Grid consists of clusters and each cluster is represented by a coordinator. Each coordinator first attempts to balance the load in its cluster and if this fails, communicates with the other coordinators to perform transfer or reception of load. This process is repeated periodically. We analyze the correctness, performance and scalability of the proposed protocol and show from the simulation results that our algorithm balances the load by decreasing the number of high loaded nodes in a grid environment.
... The middleware framework for sharing of information technology resources that was given the communal name "The Grid" had strong authentication built in at the spanning layer of its protocol stack [4]. The Grid service stack was advertised as having a "thin waist" in analogy to the spanning layer of the Internet, and as an attempt to lay claim to the implication of scalability. ...
... 1.1. 4 We define an implements relation ≺ = between two service specifications S and T and a program P as follows: ...
Preprint
The hourglass model is a widely used as a means of describing the design of the Internet, and can be found in the introduction of many modern textbooks. It arguably also applies to the design of other successful spanning layers, notably the Unix operating system kernel interface, meaning the primitive system calls and the interactions between user processes and the kernel. The impressive success of the Internet has led to a wider interest in using the hourglass model in other layered systems, with the goal of achieving similar results. However, application of the hourglass model has often led to controversy, perhaps in part because the language in which it has been expressed has been informal, and arguments for its validity have not been precise. Making a start on formalizing such an argument is the goal of this paper.
... Рассматривая ретроспективу развития вычислительной инфраструктуры следует признать, что концепция GRID вычислительной инфраструктуры, предложенной Яном Фостером и Карлом Кессельманом в 1999 году [6], наиболее соответствует сформулированным выше требованиям. Вот как сами авторы определяют основные свойства GRID инфраструктуры (цитируется по [7] ): ...
... МАТЕМАТИКА, ИНФОРМАТИКА, ПРОЦЕССЫ УПРАВЛЕНИЯ том 516 2024 фигурируемых сетей (SDN) и виртуализации сетевых функций (NFV) [18]. Кратко опишем ее организацию, основанную на федеративном принципе [6]. В ведении каждого федерата находится определенное количество вычислительных ресурсов, телекоммуникационных ресурсов и ресурсов хранения данных. ...
Article
Full-text available
14.11.2023 г. После доработки 20.03.2024 г. Принято к публикации 26.03.2024 г. В статье рассмотрено применение методов машинного обучения для оптимального управления ресур-сами сетевой вычислительной инфраструктурой-вычислительной инфраструктурой нового поколе-ния. Рассмотрена связь между предлагаемой вычислительной инфраструктурой и концепцией GRID. Показано, как методы машинного обучения в управлении сетевой вычислительной инфраструктуре позволяют решить проблемы управления вычислительной инфраструктурой, которые не позволили реализовать концепцию GRID в полной мере. В качестве примера рассмотрено применение мето-да многоагентной оптимизации в комбинации с методом машинного обучения с подкреплением для управления сетевыми ресурсами. Показано, что использование многоагентных методов машинного обучения позволяет повысить скорость распределения транспортных потоков и обеспечить опти-мальную загрузку сетевых каналов вычислительной инфраструктуры по критерию равномерности распределения нагрузки и что такое управление сетевыми ресурсами эффективнее централизован-ного подхода. Ключевые слова: методы обучения с подкреплением, многоагентные методы, сетевая вычислительная инфраструктура
... GRID computing has become a promising architecture for solving large-scale computational problems using distributed resources available across the geographical and administrative boundaries (Foster & Kesselman, 2004). It allows multiple resources of computing, storage, and services to work collectively to address challenges, which require large-scale computing, especially in scientific and technical analysis, and scale, variability and distribution of resources in a networked environment. ...
Article
Grid computing actively changed the sphere of large-scale distributed processing, allowing the subsequent employment of geographically distributed and heterogeneous resources. However, efficient scheduling of tasks in such environments is not a simple feat as it has to address issues such as dynamic workloads, heterogeneity of resources as well as scalability. This research work adopts a heuristic-based task scheduling framework which combines GA, ACO, and PSO to achieve maximum performance in terms of less makespan, improved resource utilization, load balance, and energy consumption. The framework was implemented and evaluated using the GridSim simulation tool under different tasks of load varying from 100 up to 500. A comparative analysis showed that, in general, heuristic schedulers, especially PSO, were more effective than FCFS in all the tested criteria. Other methods showed higher makespan while still producing an evenly distributed load, PSO provided the shortest make span, the fastest resource utilization and a very low average wait and response time. Therefore, the research adds to the existing literature about intelligent adaptive scheduling technologies in grids through proposing a scalable solution fitting both demands for execution efficiency and economy. These results provide insights on the impacts of heuristic optimization on the developments of the grid computing to be more responsive, energy efficient and high throughput.
... Another similar approach that modified the OpenSSH source code was the GSI-SSH project [22], which relied on a GSS-API mechanism that used X.509 certificates for authentication, implemented by the Globus Toolkit [23]. The patch was never merged by the OpenSSH maintainers, due to prioritising security over new features. ...
Article
Full-text available
Secure Shell (SSH) is the de facto standard protocol for accessing remote servers on the command line across a number of use cases, including remote system administration, high-performance computing access, git operations, or system backups via rsync. However, it only supports a limited number of authentication mechanisms, with SSH keys being the most widely used. As federated infrastructures become more prevalent, there is a growing demand for SSH to operate seamlessly and securely in such environments. The use of SSH keys in federated setups poses a number of challenges, since the keys are trusted permanently and can be shared across devices and teams. Mitigations, such as key approval and distribution, make operation at scale complex and error prone. This motivated us to develop a set of tools, collectively referred to as ssh-oidc, for facilitating federated identities with SSH by making use of OpenID Connect (OIDC), one of the established protocols used in federated identity management. We support two different approaches: one based on PAM authentication, which works by passing an OIDC access token to the SSH server for authentication, and the other one utilising SSH certificates, which are issued by our online certificate authority in exchange for an access token. Both approaches rely on a central component, motley_cue, to handle the mapping of federated identities to Unix accounts on the ssh-server, authorisation, and just-in-time account provisioning. This tool integrates well with user management systems and policies. We also provide client-side tools that automate the process of obtaining and storing the necessary credentials, and ensure a single sign-on experience for the user.
... Application would no longer be built from scratch, but as compositions of the available services. In this way, resources can be integrated, which are dynamic, distributed, multi-domain and within the virtual organization resources, to achieve the purpose of solving collaborative problem [1,2]. ...
Article
Full-text available
It is a big challenge to realize the dynamic change of the network environment and users demands. A dynamic application customization and extensible service composition model is proposed in this paper. In order to guarantee the rationality of the service composition, the composition model of on-demand service that based on Petri net is proposed. Then the model is analyzed to validate whether the service composition can achieve the desired target. Finally, the result shows that the model can guarantee the correctness of extensible service, validate the rationality and effectiveness of the method.
... The evolution of parallel processing systems significantly influenced Hadoop's development. The studies of parallel database systems [21], grid computing architectures [22], and high-performance computing solutions [23] all made important contributions. These precedents provided valuable lessons in job scheduling algorithms, resource allocation strategies, data partitioning techniques, and process coordination mechanisms. ...
Article
Full-text available
This paper systematically analyzes Apache Hadoop's technological evolution, tracing its transformation from a web crawling subsystem to a comprehensive enterprise computing platform. Beginning with its origins in Google's foundational papers on the Google File System (GFS) and MapReduce, we examine the critical architectural decisions and technical innovations that shaped Hadoop's development across its major releases. The study talks about important technical milestones, such as how it came out of the Nutch project in 2006, how Yahoo! put it into production in 2008, how the stability-focused 1.0 release came out in 2011, how the groundbreaking YARN architecture came out in 2013, and how the security-enhanced 3.0 version came out in 2017. Our study shows how each stage of development solved a different problem related to distributed computing while also making Hadoop more useful than just being used for the web. We show how architectural changes in resource management, data storage efficiency, and processing flexibility helped Hadoop grow from a specific MapReduce implementation to a flexible distributed computing framework that can handle a wide range of business workloads. The research provides valuable insights into the technical considerations that drive distributed system evolution and offers lessons for future large-scale computing platforms.
... The concept of the grid in the computational network (Foster and Kesselman, 2003b;Foster et al., 2001) has been presented by Foster, Tuecke, and Kesselman. The primary aim is to create the illusion of an extremely large and powerful virtual supercomputer that contains heterogeneous and homogeneous systems with a wide variety of shareable resources. ...
... Современные исследования в области физики высоких энергий невозможны без значительных вычислительных ресурсов. ALICE использует более 100 000 процессоров, развернутых в ГРИД [5,6,7,8,9,10,11], охватывающих более 80 сайтов по всему миру. При следующем запуске БАК, ресурсов ГРИД -инфраструктуры будет достаточно для анализа и обработки данных, но она будет отставать от требований, предъявляемых к исследованиям. ...
Article
Full-text available
Следующий запуск БАК подразумевает использование намного больших ресурсов, чем может предоставить ГРИД. Для решения данной проблемы, ALICE занимается проектом по расширению существующей вычислительной модели для того, чтобы включить в нее дополнительные ресурсы, например, суперкомпьютер Titan. В данной статье описана технология взаимодействия вычислительной среды AliEn и суперкомпьютера Titan, находящегося в Oak Ridge Leadership Computing Facility (OLCF). Эта технология использует PanDA (Production and Distributed Analysis System) WMS (Workload management system) для отправки задач в очередь пакетной обработки информации Titan и локального управления данными. Благодаря PanDA и Titan, эксперимент ALICE на Большом адронном коллайдере получает новые ресурсы для выполнения поставленных задач. Данная реализация была протестирована с применением задач ALICE. AliEn (ALIce ENvironment) – распределенная вычислительная среда, разработанная для проекта ALICE Offline. Она позволяет получить доступ к распределенным вычислительным ресурсам и ресурсам хранения всем участникам эксперимента ALICE на Большом Адронном Коллайдере (БАК). В настоящее время AliEn позволяет обрабатывать задачи примерно на 100 000 вычислительных процессорах, использующихся в более чем 80 ГРИД сайтах по всему миру. Архитектура вычислительной среды на 99% состоит из импортированных компонентов с открытым кодом, что позволяет использовать функциональные возможности без их изменения. Для связи AliEn и ГРИД – инфраструктурой используется сервис VOBOX, позволяющий запускать собственные сервисы на вычислительных сайтах, а также обеспечивающий прямое взаимодействие с очередью пакетной обработки для запуска задач.
... Для управления, хранения и обработки данных, получаемых с ускорителя БАК и детекторов, используется распределенная вычислительная сеть LCG (LHC Computing Grid -англ.), использующая технологию ГРИД [1][2][3], которая позволяет объединить в единую вычислительную среду ресурсы, расположенные в разных точках земного шара 1 . Также, ярким примером развития комплексов ускорителей элементарных частиц является NICA (Nuclotron based Ion Collider fAcility -англ.), ...
Article
Full-text available
В настоящее время, научные исследования в области физики высоких энергий предполагают использование ускорительных комплексов различного уровня сложности, экспериментальных установок, а также значительных вычислительных ресурсов. Для эффективного управления этими комплексами используются специализированные системы, представляющие собой объектно-ориентированные распределенные системы управления аппаратным оборудованием. В данной статье представлен разработанный программный комплекс для распределенной системы управления Tango Controls, позволяющий получать, обрабатывать данные, полученные с сервера устройств Tango Controls, а также отображать полученные данные с помощью веб-интерфейса в виде графиков и данных телеметрии. Безусловным достоинством Tango Controls являются его кроссплатформенность, открытый исходный код, а также универсальный инструментарий, что позволяет использовать Tango Controls в широком диапазоне аппаратных решений. Распределенная система Tango Controls используется для создания систем управления аппаратными ресурсами. Доступ к аппаратным ресурсам осуществляется посредством распределенного объекта Tango Controls. Распределенный объект в Tango Controls называется устройством и создаётся как объект в процессе-контейнере, называемом сервером устройств. Сервер устройств реализует сетевое взаимодействие и связывается с базой данных конфигурации и клиентами. В процессе работы сервер устройств создаёт экземпляры устройств, отображающие логические сущности компонентов оборудования. Для взаимодействия клиентов с серверами устройств используется Tango протокол. Серверы и клиенты устройств Tango могут быть написаны на Python, C++ или Java. Tango поставляется с полным набором инструментов для разработки, контроля, мониторинга и т.д. Разработанный программный комплекс является масштабируемым, и протестирован на отказоустойчивость и безопасность.
... In this context, distributed data mining (DDM) techniques with efficient aggregation phase have become necessary for analysing these large and multi-dimensional datasets. Moreover, DDM is more appropriate for large-scale distributed platforms, such as clusters and Grids [3], where datasets are often geographically distributed and owned by different organisations. Many DDM methods such as distributed association rules and distributed classification [4], [5], [6], [7], [8], [9] have been proposed and developed in the last few years. ...
Preprint
Distributed data mining techniques and mainly distributed clustering are widely used in the last decade because they deal with very large and heterogeneous datasets which cannot be gathered centrally. Current distributed clustering approaches are normally generating global models by aggregating local results that are obtained on each site. While this approach mines the datasets on their locations the aggregation phase is complex, which may produce incorrect and ambiguous global clusters and therefore incorrect knowledge. In this paper we propose a new clustering approach for very large spatial datasets that are heterogeneous and distributed. The approach is based on K-means Algorithm but it generates the number of global clusters dynamically. Moreover, this approach uses an elaborated aggregation phase. The aggregation phase is designed in such a way that the overall process is efficient in time and memory allocation. Preliminary results show that the proposed approach produces high quality results and scales up well. We also compared it to two popular clustering algorithms and show that this approach is much more efficient.
... The EU-founded MammoGrid project [1] is currently collecting an European-distributed database of mammograms with the aim of applying the GRID technologies to support the early detection of breast cancer. GRID is an emerging resource-sharing model that provides a distributed infrastructure of interconnected computing and storage elements [2]. A GRID-based architecture would allow the resource sharing and the co-working between radiologists throughout the European Union. ...
Preprint
Full-text available
A computer-aided detection (CADe) system for microcalcification cluster identification in mammograms has been developed in the framework of the EU-founded MammoGrid project. The CADe software is mainly based on wavelet transforms and artificial neural networks. It is able to identify microcalcifications in different kinds of mammograms (i.e. acquired with different machines and settings, digitized with different pitch and bit depth or direct digital ones). The CADe can be remotely run from GRID-connected acquisition and annotation stations, supporting clinicians from geographically distant locations in the interpretation of mammographic data. We report the FROC analyses of the CADe system performances on three different dataset of mammograms, i.e. images of the CALMA INFN-founded database collected in the Italian National screening program, the MIAS database and the so-far collected MammoGrid images. The sensitivity values of 88% at a rate of 2.15 false positive findings per image (FP/im), 88% with 2.18 FP/im and 87% with 5.7 FP/im have been obtained on the CALMA, MIAS and MammoGrid database respectively.
... This communication takes multiple roles, including the initiation of communication, discovering the holdings and capabilities of each archive, the actual process of querying, the streaming of data, and an overall control structure. None of these ideas are entirely new, the general Information Technology field has been confronting similar issues and solutions, such as Grid frameworks (Foster and Kesselman, 2001), JINI 56 and the Web services model (see, e.g., the IBM Web Service web-site 57 ) for more information) are equally applicable. ...
Preprint
Astronomy has a long history of acquiring, systematizing, and interpreting large quantities of data. Starting from the earliest sky atlases through the first major photographic sky surveys of the 20th century, this tradition is continuing today, and at an ever increasing rate. Like many other fields, astronomy has become a very data-rich science, driven by the advances in telescope, detector, and computer technology. Numerous large digital sky surveys and archives already exist, with information content measured in multiple Terabytes, and even larger, multi-Petabyte data sets are on the horizon. Systematic observations of the sky, over a range of wavelengths, are becoming the primary source of astronomical data. Numerical simulations are also producing comparable volumes of information. Data mining promises to both make the scientific utilization of these data sets more effective and more complete, and to open completely new avenues of astronomical research. Technological problems range from the issues of database design and federation, to data mining and advanced visualization, leading to a new toolkit for astronomical research. This is similar to challenges encountered in other data-intensive fields today. These advances are now being organized through a concept of the Virtual Observatories, federations of data archives and services representing a new information infrastructure for astronomy of the 21st century. In this article, we provide an overview of some of the major datasets in astronomy, discuss different techniques used for archiving data, and conclude with a discussion of the future of massive datasets in astronomy.
... The problem of creating the technology of remote job execution emerged within the conception of distributed computing [21] in 1970s. In 1990s, as a result of the development of Internet and the widespread adoption of computer technology, distributed computing gained a boost in development, particularly within the grid paradigm [4], [5]. As a whole, the conception and the grid infrastructures created in the past years can be viewed as a comprehensive attempt to build distributed systems that would have to automatize the interaction of providers of dataprocessing services and their consumers. ...
Preprint
The paper examines the current trends in designing of systems for convenient and secure remote job submission to various computer resources, including supercomputers, computer clusters, cloud resources, data storages and databases, and grid infrastructures by authorized users, as well as remote job monitoring and obtaining the results. Currently, high-perfomance computing and storage resources are capable of solving independently the majority of practical problems in the field of science and technology. Therefore, the focus in the development of a new generation of middleware shifts from the global grid systems to building convenient and efficient web platforms for remote access to individual computing resources. The paper examines the general principles of the construction and briefly describes some of the specific implementations of the web platforms.
... The main visible line of such changes is the large spreading of clusters which consist in a collection of tens or hundreds of standard almost identical processors connected together by a high speed interconnection network [6]. The next natural step is the extension to local sets of clusters or to geographically distant grids [10]. ...
Preprint
We describe in this paper a new method for building an efficient algorithm for scheduling jobs in a cluster. Jobs are considered as parallel tasks (PT) which can be scheduled on any number of processors. The main feature is to consider two criteria that are optimized together. These criteria are the makespan and the weighted minimal average completion time (minsum). They are chosen for their complementarity, to be able to represent both user-oriented objectives and system administrator objectives. We propose an algorithm based on a batch policy with increasing batch sizes, with a smart selection of jobs in each batch. This algorithm is assessed by intensive simulation results, compared to a new lower bound (obtained by a relaxation of ILP) of the optimal schedules for both criteria separately. It is currently implemented in an actual real-size cluster platform.
... The disparate processing units share no system resources, they have their own operating system, and communicate through high-speed network. The main computing models within the distributed parallel computing systems include cluster [89,26], grid [86,13,32,82], and cloud computing [82,33,59]. ...
Preprint
While modern parallel computing systems offer high performance, utilizing these powerful computing resources to the highest possible extent demands advanced knowledge of various hardware architectures and parallel programming models. Furthermore, optimized software execution on parallel computing systems demands consideration of many parameters at compile-time and run-time. Determining the optimal set of parameters in a given execution context is a complex task, and therefore to address this issue researchers have proposed different approaches that use heuristic search or machine learning. In this paper, we undertake a systematic literature review to aggregate, analyze and classify the existing software optimization methods for parallel computing systems. We review approaches that use machine learning or meta-heuristics for software optimization at compile-time and run-time. Additionally, we discuss challenges and future research directions. The results of this study may help to better understand the state-of-the-art techniques that use machine learning and meta-heuristics to deal with the complexity of software optimization for parallel computing systems. Furthermore, it may aid in understanding the limitations of existing approaches and identification of areas for improvement.
... Motivated by the different utilization levels of clusters around the globe and by the need to run even larger parallel programs, in the early 2000s, Grid Computing became relevant for the HPC community. Grids offer users access to powerful resources managed by autonomous administrative domains [50,51]. The notion of monetary costs for running applications was soft, favoring a more collaborative model of resource sharing. ...
Preprint
High Performance Computing (HPC) clouds are becoming an alternative to on-premise clusters for executing scientific applications and business analytics services. Most research efforts in HPC cloud aim to understand the cost-benefit of moving resource-intensive applications from on-premise environments to public cloud platforms. Industry trends show hybrid environments are the natural path to get the best of the on-premise and cloud resources---steady (and sensitive) workloads can run on on-premise resources and peak demand can leverage remote resources in a pay-as-you-go manner. Nevertheless, there are plenty of questions to be answered in HPC cloud, which range from how to extract the best performance of an unknown underlying platform to what services are essential to make its usage easier. Moreover, the discussion on the right pricing and contractual models to fit small and large users is relevant for the sustainability of HPC clouds. This paper brings a survey and taxonomy of efforts in HPC cloud and a vision on what we believe is ahead of us, including a set of research challenges that, once tackled, can help advance businesses and scientific discoveries. This becomes particularly relevant due to the fast increasing wave of new HPC applications coming from big data and artificial intelligence.
... In order to cope with gigabytes or even terabytes of data, a natural step is to use the power of parallel and distributed machines, and there as been parallel versions for centre-based DM algorithms [2]. These parallel and distributed machines were mostly clusters of computers or Grids [3]. In this case, large amounts of datasets are divided (either horizontal or vertical) into disjoint partitions and then scattered on computing nodes. ...
Preprint
Efficient extraction of useful knowledge from these data is still a challenge, mainly when the data is distributed, heterogeneous and of different quality depending on its corresponding local infrastructure. To reduce the overhead cost, most of the existing distributed clustering approaches generate global models by aggregating local results obtained on each individual node. The complexity and quality of solutions depend highly on the quality of the aggregation. In this respect, we proposed for distributed density-based clustering that both reduces the communication overheads due to the data exchange and improves the quality of the global models by considering the shapes of local clusters. From preliminary results we show that this algorithm is very promising.
... The main challenge in grid computing is efficient resource utilization and minimization of turnaround time. The existing system model consists of the web basedgrid network platform with different management policies, forming a heterogeneous system where the computing costand computing performance become significant at each node [3], [4]. In gridcomputingenvironment, applications are submittedfor use of grid resources by users from their terminals. ...
Conference Paper
Grid scheduling is a technique by which the user demands are met and the resources are efficiently utilized. The scheduling algorithms are used to minimize the jobs waiting time and completion time. Most of the minimization algorithms are implemented in homogeneous resource environment. In this paper the presented algorithm minimize average turnaround time in heterogeneous resource environment. This algorithm is based on greedy approach which is used in static job submission environment where all the jobs are submitted at same time. Taken all jobs independent the turn around time of each job is minimized to minimize the average turnaround time of all submitted jobs.
... The Grid is a computing platform that delivers computing services from various independent computing sites to users in disparate locations, the paradigm requires large-scale sharing and proper service delivery to meet user's needs. To attain these requirements, the Grid encourages the integration and aggregation of different federating computing units to create a virtual organization of grid networks so that it can deliver the Quality of Service required by customers (Foster, 2000;Foster & Kesselman, 1999;Wieczorek, 2009). ...
Article
Full-text available
Improvements in computer technologies continue to shape the presence and the future of modernization, driven by the need for faster and more efficient processing, most chip manufacturers have abandoned the single-processor system and turned attention to other hardware technologies like the multicore system. However, should the baby (single-processor system) be thrown away with the bathwater? Parallelization which defines the era of the multicore if properly exploited on single-processor systems can improve performance. This work exploits thread-level parallelism on the single processor system. This work uses thread-level parallelism to sort randomly generated grid jobs. The method randomly generates grid jobs which are then sorted into groups based on the computing requirements of the job. Using fuzzy rules, the sorting is done with a range of threads from one to eight in steps of two. For each set of sorting, the time of completion is recorded. The analysis shows that increases in the thread improve performance on the single processor system. However, as the number of jobs increases, the execution time also increases for all threads – indicating a general performance decline. The analysis also showed a steady improvement in performance as the number of threads increased from one to two and between two and four threads. However, the improvement leveled off at four threads and six threads and degraded between six threads and eight threads. This indicates that as the number of threads increases, the single processor system poses a bottleneck to performance due to context switches and other overheads. We therefore recommend that for thread-level parallelization on the single-processor systems, the number of threads should not be more than four.
... Ian Foster and Carl Kesselman were the first to use the word "grid" in their book "The Grid: Blueprint for a New Computing Infrastructure" published in 2003. [21]They compare a computing grid to an electrical grid in terms of resource availability. In an electrical grid, you simply plug in an outlet and electricity flows. ...
Article
Computing grids are infrastructures that provide almost infinite computing capacities, they are now used in all fields, from the study of pandemics to the monitoring of rocket trajectories and the study of meteorological and climatic phenomena. They have a distributed and heterogeneous architecture that gives them unlimited computing performance. They are made up of several computing nodes that are subject to failures like frank failures. A frank failure in a computing grid is an abnormal and unexpected interruption of a node. Many frank failure tolerance protocols have been proposed in the literature but none of these protocols integrate the anticipation of frank failures. The objective of this article is to propose a model based on the PDEVS formalism of frank failure tolerance in a computing grid that allows the anticipation of frank failures. The proposed model relies on the temperature variation of electronic components and on the state of the hard disk through the values provided by SMART data to predict a probable frank failure of a node. The results of the simulations on the different scenarios that we have carried out show that our results provide better performances than those proposed in the literature when the number of nodes to tolerate is greater than 200.
... Scenarios in which the 1 ms latency criterion becomes apparent include manoeuvring a 3D object using a joystick or within a VR setting. Should the delay between the virtual image and the movement surpass 1 ms, it can lead to cybersickness [250]. Thus, ensuring low latency in haptic systems is essential. ...
Article
Full-text available
Haptic communications represent a vast area of research focused on integrating the sense of touch into digital sensory experiences. Achieving effective haptic communication requires meticulous design and implementation of all subsystems. As implied by the term, the two primary subsystems are haptic and communication. Haptic refers to replicating the touch sensation in various applications such as augmented reality, virtual reality and teleoperation, and communication involves optimising network structures to transmit and receive haptic information alongside other sensory data. In this survey paper, we discuss both haptic interfaces and network requirements simultaneously. For haptic interfaces, we comprehensively explore the mechanisms of touch perception, haptic sensing, and haptic feedback. We delve into haptic sensing by examining state-of-the-art sensors and approaches to capture data related to touch, such as pressure, force, and motion, and translate these physical interactions into digital data that a haptic system can interpret and respond to. Subsequently, we discuss various methods of achieving haptic feedback, including different mechanical actuators and electrical stimulation. We also investigate the incorporation of artificial intelligence in this field, proposing new areas where it could enhance system performance. Additionally, we address open challenges and future research directions, covering critical issues related to privacy, data transmission, cybersickness, performance and wearability of haptic interface, integrated systems, power supply and evaluation of these devices. Through this interdisciplinary approach, which merges haptic feedback, haptic sensing, and communication, our paper aims to inspire further research and development, ultimately advancing technology and enhancing haptic experiences.
... According to [2], generally there are two types of grid infrastructure, namely computational and data grid. Computational grid is a hardware and software framework to deliver dependable, consistent, pervasive, and affordable access to high-end computation capability [3], while data grid is "an infrastructure that manages huge amount of data files and provides resources across geographically distributed collaboration" [4]. ...
Article
Full-text available
Campus grid is a feasible deployment of grid computing since campus environment is equally controlled and the managerial permission is simpler than any other industries. The usability of grid computing is also potentially high, because of the numerous demands from students or researchers in need of high-end computational power and data storage. However, automated efficient way of campus grid platform deployment has never been disclosed, therefore we propose a methodology to deploy a campus grid with automation based on the desktop-grid architecture. Some related issues and challenges that are currently being addressed, with improvements further to be explored, are presented in this paper. Large scale campus grid deployment in this campus involving multiple computer labs and hundreds of computers in total was accomplished, by combining both automation scripts and manual intervention. The chosen campus grid software system is Berkeley Open Infrastructure for Network Computing (BOINC). This practice is expected to guide future BOINC campus grid administrators to establish a working grid computing system, in order to provide grid-based computing and storage resources for running especially heavy simulation programs.
... In contrast to earlier computing methods, cloud computing "became mainstream" in October 2007. Foster and Kesselman, 1998;Raleigh and Armonk, 2007;Naone, 2007;Reimer, 2007). The cooperation between IBM and Google to operate under a domain (Lohr., 2007;View et al., 2007) and the subsequent entry of (Wikipedia, 2012b;Vouk, 2008). ...
Article
Full-text available
Small and medium-sized enterprises are now aiming to adopt a cost-effective computing resource for their business applications, i.e. by using the novel idea of cloud computing in their environment, in addition to major corporations. Utilizing less resources and management support, a shared network, valuable resources, bandwidth, software, and hardware in a cost-effective manner, and fewer service provider interactions, cloud computing enhances the performance of companies. In essence, it’s a novel idea for giving consumers access to virtualized resources. Customers can use the cloud to store a lot of data from several locations and request services, applications, and solutions. But as cloud computing’s popularity grows, there is a growing danger that security will overtake other concerns as the primary one. The current article suggests a backup strategy needed to address cloud computing security concerns.
Chapter
The rapid advancement of cloud computing has positioned Amazon Web Services (AWS) as a transformative platform for fostering sustainable innovation. This chapter explores how AWS cloud-based ecosystems enable organizations to achieve environmental, economic, and operational sustainability. By leveraging AWS services such as Elastic Compute Cloud (EC2), Lambda, and S3, businesses can optimize resource utilization, reduce energy consumption, and minimize carbon footprints. The chapter delves into key AWS solutions, including AI/ML, IoT, and serverless computing, to highlight their role in driving innovation across industries such as healthcare, manufacturing, and finance. Additionally, it examines strategies for integrating AWS Well-Architected Framework principles to enhance efficiency, scalability, and resilience while adhering to sustainability goals.
Article
John McCarthy proposed the vision of utility computing in 1961. Barbara Liskov proposed a related vision of abstraction-powered Internet Computer in 2009. This position paper outlines a distributed computing model towards realizing the McCarthy-Liskov vision. This “hypertasking” model aims at extending the “hypermedia” model of the World Wide Web into a model of World Wide Computing Utility, turning an information web into a computing web. The hypertasking model contains three abstractions, including global resource space, stored-computer architecture, and monadic hypermedia. A prototype architecture and experimental evidence are presented to support this perspective.
Article
Full-text available
This paper proposes the definition of Smart Sport Psychology as the integration of advanced technologies, such as artificial intelligence (AI), machine learning and the Internet of Things (IoT), along with other emerging technologies, into the principles and practices of sport psychology. This approach uses smart devices, data analytics and virtual reality technologies to optimise athletes' mental and physical performance, personalise psychological interventions and improve athletes' overall well-being. Through a comprehensive literature review, we explore how these technologies can address common psychological problems among athletes, including stress, anxiety, motivation, mental recovery, concentration, self-confidence and psychological injury prevention. The results highlight the effectiveness of continuous monitoring, personalised interventions and real-time feedback, providing innovative, data-driven strategies to improve athletes' performance and well-being. The paper concludes that Smart Sport Psychology has the potential to transform the sport industry, offering effective and sustainable solutions for the holistic development of athletes. This review provides a conceptual and practical framework for the implementation of smart technologies in sport psychology, highlighting their importance and long-term benefits in the sport domain.
Preprint
Full-text available
AI-based object detection, and efforts to explain and investigate their characteristics, is a topic of high interest. The impact of, e.g., complex background structures with similar appearances as the objects of interest, on the detection accuracy and, beforehand, the necessary dataset composition are topics of ongoing research. In this paper, we present a systematic investigation of background influences and different features of the object to be detected. The latter includes various materials and surfaces, partially transparent and with shiny reflections in the context of an Industry 4.0 learning factory. Different YOLOv8 models have been trained for each of the materials on different sized datasets, where the appearance was the only changing parameter. In the end, similar characteristics tend to show different behaviours and sometimes unexpected results. While some background components tend to be detected, others with the same features are not part of the detection. Additionally, some more precise conclusions can be drawn from the results. Therefore, we contribute a challenging dataset with detailed investigations on 92 trained YOLO models, addressing some issues on the detection accuracy and possible overfitting.
Chapter
Grid systems provide mechanisms for single sign-on, and uniform APIs for job submission and data transfer, in order to allow the coupling of distributed resources in a seamless manner. However, new users face a daunting barrier of entry due to the high cost of deployment and maintenance. They are often required to learn complex concepts relative to Grid infrastructures (credential management, scheduling systems, data staging, etc). To most scientific users, running their applications with minimal changes and yet getting results faster is highly desirable, without having to know much about how the resources are used. Hence, a higher level of abstraction must be provided for the underlying infrastructure to be used effectively. For this purpose, we have developed the Opal toolkit for exposing applications on Grid resources as simple Web services. Opal provides a basic set of Application Programming Interfaces (APIs) that allows users to execute their deployed applications, query job status, and retrieve results. Opal also provides a mechanism to define command-line arguments and automatically generates user interfaces for the Web services dynamically. In addition, Opal services can be hooked up to a Metascheduler such as CSF4 to leverage a distributed set of resources, and accessed via a multitude of interfaces such as Web browsers, rich desktop environments, workflow tools, and command-line clients.
Article
Full-text available
High-performance computing (HPC) clusters are essential for scientific simulations, big data processing, and artificial intelligence (AI) applications. Ensuring reliability and minimizing downtime in these environments requires robust fault tolerance strategies. Traditional fault tolerance methods rely on checkpoint/restart mechanisms, hardware redundancy, and error-correcting codes. However, AI-optimized strategies leverage machine learning and predictive analytics to detect, prevent, and recover from failures more efficiently. This paper compares traditional and AI-driven fault tolerance strategies in HPC clusters, analyzing their effectiveness in handling hardware failures, software crashes, and network disruptions. The study evaluates performance trade-offs, scalability, and real-world implementations to determine the most effective approaches for modern HPC infrastructures.
Article
Full-text available
Parallel computing systems are designed to handle complex computations efficiently by distributing workloads across multiple processing units. However, as these systems scale, they become increasingly susceptible to faults that can compromise performance and reliability. Achieving an optimal balance between scalability and fault tolerance is critical for ensuring high-performance computing while minimizing failures. This paper explores the trade-offs between scalability and fault tolerance, examining various fault-handling mechanisms, redundancy strategies, and load-balancing techniques that enhance reliability without compromising efficiency. We present a comparative analysis of fault-tolerance mechanisms and scalability techniques, offering insights into the best practices for designing robust parallel computing architectures.
Article
Full-text available
High-performance computing (HPC) clusters play a vital role in scientific research, financial modeling, artificial intelligence, and other computationally intensive fields. However, these clusters are susceptible to faults that can lead to downtime, reduced efficiency, and increased operational costs. This paper explores strategies for fault prediction and prevention in HPC clusters, focusing on proactive maintenance, machine learning-based anomaly detection, redundancy planning, and self-healing systems. By analyzing current methodologies and real-world case studies, we aim to provide a comprehensive framework for improving system reliability and minimizing downtime in HPC environments.
Article
Full-text available
High-Performance Computing (HPC) environments are essential for complex computations in fields such as scientific research, engineering, and artificial intelligence. However, these systems are prone to faults that can significantly affect performance and reliability. Self-healing algorithms offer a proactive approach to fault mitigation by detecting, diagnosing, and recovering from failures autonomously. This paper explores various self-healing techniques, including machine learning-driven fault prediction, automated recovery mechanisms, and redundancy-based resilience strategies. By analyzing existing solutions and their impact on HPC efficiency, we provide insights into optimizing fault tolerance and ensuring system reliability.
Article
Full-text available
High-Performance Computing (HPC) clusters play a crucial role in executing large-scale computations across scientific research, engineering, and artificial intelligence applications. However, these clusters are prone to hardware and software failures that can lead to significant performance degradation. Optimizing task scheduling algorithms is essential to minimize faults, enhance resource utilization, and ensure reliability. This paper explores various scheduling strategies, including static and dynamic scheduling, fault-tolerant scheduling mechanisms, and the integration of artificial intelligence for predictive fault management. By analyzing real-world case studies and performance benchmarks, this study presents an optimized approach to task scheduling that improves fault resilience in HPC environments.
Article
Full-text available
Parallel computing architectures are increasingly used for high-performance computing applications, but their complexity makes them vulnerable to hardware and software failures. Checkpointing is a critical fault recovery mechanism that periodically saves system states, allowing computations to resume from the last checkpoint rather than restarting from scratch. This paper explores various checkpointing techniques, including system-wide, incremental, and application-level checkpointing, as well as emerging AI-driven approaches for failure prediction and adaptive checkpointing. By analyzing these methods in the context of parallel computing, we aim to provide insights into optimizing reliability and performance in fault-prone computing environments.
Chapter
Cloud computing is an evolution of information technology and a dominant business model for delivering IT resources. With cloud computing, individuals and organizations can gain on-demand network access to a shared pool of managed and scalable IT resources, such as servers, storage, and applications. Recently, academics as well as practitioners have paid a great deal of attention to cloud computing. Individuals rely heavily on cloud services in their daily lives, e.g., for storing data, writing documents, managing businesses, and playing games online. Cloud computing also provides the infrastructure that has powered key digital trends such as mobile computing, the Internet of Things, big data, and artificial intelligence, thereby accelerating industry dynamics, disrupting existing business models, and fueling digital transformation. Still, cloud computing not only provides a vast number of benefits and opportunities; it also comes with several challenges and concerns, e.g., regarding protecting customers’ data.
ResearchGate has not been able to resolve any references for this publication.