Brian Bockelman's research while affiliated with University of Wisconsin–Madison and other places

Publications (170)

Preprint
Full-text available
Prior to the public release of Kubernetes it was difficult to conduct joint development of elaborate analysis facilities due to the highly non-homogeneous nature of hardware and network topology across compute facilities. However, since the advent of systems like Kubernetes and OpenShift, which provide declarative interfaces for building fault-tole...
Preprint
The HL-LHC presents significant challenges for the HEP analysis community. The number of events in each analysis is expected to increase by an order of magnitude and new techniques are expected to be required; both challenges necessitate new services and approaches for analysis facilities. These services are expected to provide new capabilities, a...
Preprint
Modern network performance monitoring toolkits, such as perfSONAR, take a remarkable number of measurements about the local network environment. To gain a complete picture of network performance, however, one needs to aggregate data across a large number of endpoints. The Service Analysis and Network Diagnosis (SAND) data pipeline collects data fro...
Article
Named Data Networking (NDN) is a promising approach to provide fast in-network access to compact muon solenoid (CMS) datasets. It proposes a content-centric rather than a host-centric approach to data retrieval. Data packets with unique and immutable names are retrieved from a content store (CS) using Interest packets. The current NDN architecture...
Article
Full-text available
During the first observation run the LIGO collaboration needed to offload some of its most, intense CPU workflows from its dedicated computing sites to opportunistic resources. Open Science Grid enabled LIGO to run PyCbC, RIFT and Bayeswave workflows to seamlessly run in a combination of owned and opportunistic resources. One of the challenges is e...
Preprint
Full-text available
The High Luminosity Large Hadron Collider provides a data challenge. The amount of data recorded from the experiments and transported to hundreds of sites will see a thirty fold increase in annual data volume. A systematic approach to contrast the performance of different Third Party Copy(TPC) transfer protocols arises. Two contenders, XRootD-HTTPS...
Preprint
Data analysis in HEP has often relied on batch systems and event loops; users are given a non-interactive interface to computing resources and consider data event-by-event. The "Coffea-casa" prototype analysis facility is an effort to provide users with alternate mechanisms to access computing resources and enable new programming paradigms. Instead...
Preprint
Full-text available
The intelligent Data Delivery Service (iDDS) has been developed to cope with the huge increase of computing and storage resource usage in the coming LHC data taking. iDDS has been designed to intelligently orchestrate workflow and data management systems, decoupling data pre-processing, delivery, and main processing in various workflows. It is an e...
Article
Full-text available
Since 2017, the Worldwide LHC Computing Grid (WLCG) has been working towards enabling token based authentication and authorisation throughout its entire middleware stack. Following the publication of the WLCG Common JSON Web Token (JWT) Schema v1.0 [1] in 2019, middleware developers have been able to enhance their services to consume and validate t...
Article
Full-text available
The High Luminosity Large Hadron Collider provides a data challenge. The amount of data recorded from the experiments and transported to hundreds of sites will see a thirty fold increase in annual data volume. A systematic approach to contrast the performance of different Third Party Copy (TPC) transfer protocols arises. Two contenders, XRootD-HTTP...
Article
Full-text available
Data analysis in HEP has often relied on batch systems and event loops; users are given a non-interactive interface to computing resources and consider data event-by-event. The “Coffea-casa” prototype analysis facility is an effort to provide users with alternate mechanisms to access computing resources and enable new programming paradigms. Instead...
Article
Full-text available
The processing needs for the High Luminosity (HL) upgrade for the LHC require the CMS collaboration to harness the computational power available on non-CMS resources, such as High-Performance Computing centers (HPCs). These sites often limit the external network connectivity of their computational nodes. In this paper we describe a strategy in whic...
Article
Full-text available
The intelligent Data Delivery Service (iDDS) has been developed to cope with the huge increase of computing and storage resource usage in the coming LHC data taking. iDDS has been designed to intelligently orchestrate workflow and data management systems, decoupling data pre-processing, delivery, and main processing in various workflows. It is an e...
Preprint
Full-text available
During the first observation run the LIGO collaboration needed to offload some of its most, intense CPU workflows from its dedicated computing sites to opportunistic resources. Open Science Grid enabled LIGO to run PyCbC, RIFT and Bayeswave workflows to seamlessly run in a combination of owned and opportunistic resources. One of the challenges is e...
Article
Mechanisms for remote execution of computational tasks enable a distributed system to effectively utilize all available resources. This ability is essential to attaining the objectives of high availability, system reliability, and graceful degradation and directly contribute to flexibility, adaptability, and incremental growth. As part of a nationa...
Preprint
Full-text available
The WLCG Authorisation Working Group was formed in July 2017 with the objective to understand and meet the needs of a future-looking Authentication and Authorisation Infrastructure (AAI) for WLCG experiments. Much has changed since the early 2000s when X.509 certificates presented the most suitable choice for authorisation within the grid; progress...
Preprint
Full-text available
Since its earliest days, the Worldwide LHC Computational Grid (WLCG) has relied on GridFTP to transfer data between sites. The announcement that Globus is dropping support of its open source Globus Toolkit (GT), which forms the basis for several FTP client and servers, has created an opportunity to reevaluate the use of FTP. HTTP-TPC, an extension...
Preprint
The ATLAS Event Streaming Service (ESS) at the LHC is an approach to preprocess and deliver data for Event Service (ES) that has implemented a fine-grained approach for ATLAS event processing. The ESS allows one to asynchronously deliver only the input events required by ES processing, with the aim to decrease data traffic over WAN and improve over...
Preprint
A general problem faced by computing on the grid for opportunistic users is that delivering cycles is simpler than delivering data to those cycles. In this project we show how we integrated XRootD caches placed on the internet backbone to implement a content delivery network for general science workflows. We will show that for some workflows on dif...
Preprint
WLCG relies on the network as a critical part of its infrastructure and therefore needs to guarantee effective network usage and prompt detection and resolution of any network issues including connection failures, congestion and traffic routing. The OSG Networking Area, in partnership with WLCG, is focused on being the primary source of networking...
Preprint
Failure is inevitable in scientific computing. As scientific applications and facilities increase their scales over the last decades, finding the root cause of a failure can be very complex or at times nearly impossible. Different scientific computing customers have varying availability demands as well as a diverse willingness to pay for availabili...
Preprint
Scientific computing workflows generate enormous distributed data that is short-lived, yet critical for job completion time. This class of data is called intermediate data. A common way to achieve high data availability is to replicate data. However, an increasing scale of intermediate data generated in modern scientific applications demands new st...
Preprint
We overview recent changes in the ROOT I/O system, increasing performance and enhancing it and improving its interaction with other data analysis ecosystems. Both the newly introduced compression algorithms, the much faster bulk I/O data path, and a few additional techniques have the potential to significantly to improve experiment's software perfo...
Article
Full-text available
ROOT is a large code base with a complex set of build-time dependencies; there is a significant difference in compilation time between the “core” of ROOT and the full-fledged deployment. We present results on a “delayed build” for internal ROOT packages and external packages. This gives the ability to offer a “lightweight” core of ROOT, later exten...
Article
Full-text available
The LHC’s Run3 will push the envelope on data-intensive workflows and, since at the lowest level this data is managed using the ROOT software framework, preparations for managing this data are starting already. At the beginning of LHC Run 1, all ROOT data was compressed with the ZLIB algorithm; since then, ROOT has added support for additional algo...
Article
Full-text available
Distinct HEP workflows have distinct I/O needs; while ROOT I/O excels at serializing complex C++ objects common to reconstruction, analysis workflows typically have simpler objects and can sustain higher event rates. To meet these workflows, we have developed a “bulk I/O” interface, allowing multiple events’ data to be returned per library call. Th...
Article
Full-text available
We overview recent changes in the ROOT I/O system, enhancing it by improving its performance and interaction with other data analysis ecosystems. Both the newly introduced compression algorithms, the much faster bulk I/O data path, and a few additional techniques have the potential to significantly improve experiment’s software performance. The nee...
Article
Full-text available
The WLCG Authorisation Working Group was formed in July 2017 with the objective to understand and meet the needs of a future-looking Authentication and Authorisation Infrastructure (AAI) for WLCG experiments. Much has changed since the early 2000s when X.509 certificates presented the most suitable choice for authorisation within the grid; progress...
Article
Full-text available
A Third Party Copy (TPC) mechanism has existed in the pure XRootD storage environment for many years. However, using the XRootD TPC in the WLCG environment presents additional challenges due to the diversity of the storage systems involved such as EOS, dCache, DPM and ECHO, requiring that we carefully navigate the unique constraints imposed by thes...
Article
Full-text available
The ATLAS Event Streaming Service (ESS) at the LHC is an approach to preprocess and deliver data for Event Service (ES) that has implemented a fine-grained approach for ATLAS event processing. The ESS allows one to asynchronously deliver only the input events required by ES processing, with the aim to decrease data traffic over WAN and improve over...
Article
Full-text available
Since its earliest days, the Worldwide LHC Computational Grid (WLCG) has relied on GridFTP to transfer data between sites. The announcement that Globus is dropping support of its open source Globus Toolkit (GT), which forms the basis for several FTP client and servers, has created an opportunity to reevaluate the use of FTP. HTTP-TPC, an extension...
Article
Full-text available
WLCG relies on the network as a critical part of its infrastructure and therefore needs to guarantee effective network usage and prompt detection and resolution of any network issues including connection failures, congestion and traffic routing. The OSG Networking Area, in partnership with WLCG, is focused on being the primary source of networking...
Article
Full-text available
LHC data is constantly being moved between computing and storage sites to support analysis, processing, and simulation; this is done at a scale that is currently unique within the science community. For example, the CMS experiment on the LHC manages approximately 200PB of storage across 100 sites and, on a daily basis, moves 1PB between sites via G...
Article
Full-text available
A general problem faced by opportunistic users computing on the grid is that delivering cycles is simpler than delivering data to those cycles. In this project XRootD caches are placed on the internet backbone to create a content delivery network. Scientific workflows in the domains of high energy physics, gravitational waves, and others profit fro...
Article
Full-text available
The Open Science Grid (OSG) provides a common service for resource providers and scientific institutions, and supports sciences such as High Energy Physics, Structural Biology, and other community sciences. As scientific frontiers expand, so does the need for resources to analyze new data. For example, High Energy Physics experiments such as the LH...
Article
Named Data Networking (NDN) is one of the promising future internet architectures, which focuses on the data rather than its location (IP/host-based system). NDN has several characteristics which facilitate addressing and routing the data: fail-over, in-network caching and load balancing. This makes it useful in areas such as managing scientific da...
Technical Report
Full-text available
This document describes how WLCG users may use the available geographically distributed resources without X.509 credentials. In this model, clients are issued with bearer tokens; these tokens are subsequently used to interact with resources. The tokens may contain authorization groups and/or capabilities, according to the preference of the Virtual...
Conference Paper
Hundreds of physicists analyze data collected by the Compact Muon Solenoid (CMS) experiment at the Large Hadron Collider using the CMS Remote Analysis Builder and the CMS global pool to exploit the resources of the Worldwide LHC Computing Grid. Efficient use of such an extensive and expensive resource is crucial. At the same time, the CMS collabora...
Article
Full-text available
Rucio is an open-source software framework that provides scientific collaborations with the functionality to organize, manage, and access their data at scale. The data can be distributed across heterogeneous data centers at widely distributed locations. Rucio was originally developed to meet the requirements of the high-energy physics experiment AT...
Conference Paper
Data distribution for opportunistic users is challenging as they neither own the computing resources they are using or any nearby storage. Users are motivated to use opportunistic computing to expand their data processing capacity, but they require storage and fast networking to distribute data to that processing. Since it requires significant mana...
Conference Paper
The management of security credentials (e.g., passwords, secret keys) for computational science workflows is a burden for scientists and information security officers. Problems with credentials (e.g., expiration, privilege mismatch) cause workflows to fail to fetch needed input data or store valuable scientific results, distracting scientists from...
Preprint
The LHCs Run3 will push the envelope on data-intensive workflows and, since at the lowest level this data is managed using the ROOT software framework, preparations for managing this data are starting already. At the beginning of LHC Run 1, all ROOT data was compressed with the ZLIB algorithm; since then, ROOT has added support for additional algor...
Preprint
Full-text available
ROOT is a large code base with a complex set of build-time dependencies; there is a significant difference in compilation time between the "core" of ROOT and the full-fledged deployment. We present results on a "delayed build" for internal ROOT packages and external packages. This gives the ability to offer a "lightweight" core of ROOT, later exten...
Preprint
Distinct HEP workflows have distinct I/O needs; while ROOT I/O excels at serializing complex C++ objects common to reconstruction, analysis workflows typically have simpler objects and can sustain higher event rates. To meet these workflows, we have developed a "bulk I/O" interface, allowing multiple events data to be returned per library call. Thi...
Preprint
The management of security credentials (e.g., passwords, secret keys) for computational science workflows is a burden for scientists and information security officers. Problems with credentials (e.g., expiration, privilege mismatch) cause workflows to fail to fetch needed input data or store valuable scientific results, distracting scientists from...
Preprint
Data distribution for opportunistic users is challenging as they neither own the computing resources they are using or any nearby storage. Users are motivated to use opportunistic computing to expand their data processing capacity, but they require storage and fast networking to distribute data to that processing. Since it requires significant mana...
Article
Full-text available
Particle physics has an ambitious and broad experimental programme for the coming decades. This programme requires large investments in detector hardware, either to build new facilities and experiments, or to upgrade existing ones. Similarly, it requires commensurate investment in the R&D of software to acquire, manage, process, and analyse the she...
Preprint
Full-text available
Recent gravitational-wave observations from the LIGO and Virgo observatories have brought a sense of great excitement to scientists and citizens the world over. Since September 2015,10 binary black hole coalescences and one binary neutron star coalescence have been observed. They have provided remarkable, revolutionary insight into the "gravitation...
Article
Recent gravitational-wave observations from the LIGO and Virgo observatories have brought a sense of great excitement to scientists and citizens the world over. Since September 2015,10 binary black hole coalescences and one binary neutron star coalescence have been observed. They have provided remarkable, revolutionary insight into the "gravitation...
Preprint
Rucio is an open source software framework that provides scientific collaborations with the functionality to organize, manage, and access their volumes of data. The data can be distributed across heterogeneous data centers at widely distributed locations. Rucio has been originally developed to meet the requirements of the high-energy physics experi...
Preprint
In this paper, we propose an application-aware intelligent load balancing system for high-throughput, distributed computing, and data-intensive science workflows. We leverage emerging deep learning techniques for time-series modeling to develop an application-aware predictive analytics system for accurately forecasting GridFTP connection loads. Our...
Article
Full-text available
The ROOT software framework is foundational for the HEP ecosystem, providing multiple capabilities such as I/O, a C++ interpreter, GUI, and math libraries. It uses object-oriented concepts and build-time components to layer between them. We believe that a new layering formalism will benefit the ROOT user community. We present the modularization str...
Article
Full-text available
WLCG relies on the network as a critical part of its infrastructure and therefore needs to guarantee effective network usage and prompt detection and resolution of any network issues, including connection failures, congestion and traffic routing. OSG Networking Area in partnership with WLCG has focused on collecting, storing and making available al...
Article
Full-text available
Foundational software libraries such as ROOT are under intense pressure to avoid software regression, including performance regressions. Continuous performance benchmarking, as a part of continuous integration and other code quality testing, is an industry best-practice to understand how the performance of a software product evolves. We present a f...
Article
Full-text available
Scheduling multi-core workflows in a global HTCondor pool is a multi-dimensional problem whose solution depends on the requirements of the job payloads, the characteristics of available resources, and the boundary conditions such as fair share and prioritization imposed on the job matching to resources. Within the context of a dedicated task force,...
Article
Full-text available
A key aspect of pilot-based grid operations are the GlideinWMS pilot factories. A proper and efficient use of any central block in the grid infrastructure for operations is inevitable, and GlideinWMS factories are no exception. The monitoring package for the GlideinWMS factory was originally developed when the factories were serving a couple of VOs...
Article
Full-text available
The CMS experiment has an HTCondor Global Pool, composed of more than 200K CPU cores available for Monte Carlo production and the analysis of da.The submission of user jobs to this pool is handled by either CRAB, the standard workflow management tool used by CMS users to submit analysis jobs requiring event processing of large amounts of data, or b...
Article
Full-text available
LHC experiments make extensive use of web proxy caches, especially for software distribution via the CernVM File System and for conditions data via the Frontier Distributed Database Caching system. Since many jobs read the same data, cache hit rates are high and hence most of the traffic flows efficiently over Local Area Networks. However, it is no...
Article
Full-text available
The CMS Submission Infrastructure Global Pool, built on Glidein-WMS andHTCondor, is a worldwide distributed dynamic pool responsible for the allocation of resources for all CMS computing workloads. Matching the continuously increasing demand for computing resources by CMS requires the anticipated assessment of its scalability limitations. In additi...
Article
Full-text available
The OSG has long maintained a central accounting system called Gratia. It uses small probes on each computing and storage resource in order to collect resource usage. The probes report to a central collector which stores the usage in a database. The database is then queried to generate reports. As the OSG aged, the size of the database grew very la...
Article
Full-text available
Outside the HEP computing ecosystem, it is vanishingly rare to encounter user X509 certificate authentication (and proxy certificates are even more rare). The web never widely adopted the user certificate model, but increasingly sees the need for federated identity services and distributed authorization. For example, Dropbox, Google and Box instead...