• Home
  • CERN
  • Physics Department (PH)
  • Jakob Blomer
Jakob Blomer

Jakob Blomer
CERN | CERN · Physics Department (PH)

About

82
Publications
15,525
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
992
Citations

Publications

Publications (82)
Article
Full-text available
The CernVM File System (CVMFS) provides the software distribution backbone for High Energy and Nuclear Physics experiments and many other scientific communities in the form of a globally available shared software area. It has been designed for the software distribution problem of experiment software for LHC Runs 1 and 2. For LHC Run 3 and even more...
Article
Full-text available
After using ROOT’s TTree I/O subsystem for over two decades and storing more than an exabyte of compressed High Energy Physics (HEP) data, advances in technology have motivated a complete redesign, RNTuple, which breaks backward-compatibility to take better advantage of these storage options. The RNTuple I/O subsystem has been designed to address p...
Article
Full-text available
RDataFrame is ROOT’s high-level interface for Python and C++ data analysis. Since it first became available, RDataFrame adoption has grown steadily and it is now poised to be a major component of analysis software pipelines for LHC Run 3 and beyond. Thanks to its design inspired by declarative programming principles, RDataFrame enables the developm...
Article
Full-text available
The RNTuple I/O subsystem is ROOT’s future event data file format and access API. It is driven by the expected data volume increase at upcoming HEP experiments, e.g. at the HL-LHC, and recent opportunities in the storage hardware and software landscape such as NVMe drives and distributed object stores. RNTuple is a redesign of the TTree binary form...
Article
Full-text available
The recent evolutions of the analysis frameworks and physics data formats of the LHC experiments provide the opportunity of using central analysis facilities with a strong focus on interactivity and short turnaround times, to complement the more common distributed analysis on the Grid. In order to plan for such facilities, it is essential to know i...
Article
Full-text available
Data preservation is a mandatory specification for any present and future experimental facility and it is a cost-effective way of doing fundamental research by exploiting unique data sets in the light of the continuously increasing theoretical understanding. This document summarizes the status of data preservation in high energy physics. The paradi...
Preprint
Full-text available
This document summarizes the status of data preservation in high energy physics. The paradigms and the methodological advances are discussed from a perspective of more than ten years of experience with a structured effort at international level. The status and the scientific return related to the preservation of data accumulated at large collider e...
Article
Full-text available
Upcoming HEP experiments, e.g. at the HL-LHC, are expected to increase the volume of generated data by at least one order of magnitude. In order to retain the ability to analyze the influx of data, full exploitation of modern storage hardware and systems, such as low-latency high-bandwidth NVMe devices and distributed object stores, becomes critica...
Article
Full-text available
In recent years, RDataFrame, ROOT’s high-level interface for data analysis and processing, has seen widespread adoption on the part of HEP physicists. Much of this success is due to RDataFrame’s ergonomic programming model that enables the implementation of common analysis tasks more easily than previous APIs, without compromising on application pe...
Article
Full-text available
The CernVM File System (CernVM-FS) is a global read-only POSIX file system that provides scalable and reliable software distribution to numerous scientific collaborations. It gives access to more than a billion binary files of experiment application software stacks and operating system containers to end user devices, grids, clouds, and supercompute...
Article
Full-text available
Data analysis workflows in High Energy Physics (HEP) read data written in the ROOT columnar format. Such data has traditionally been stored in files that are often read via the network from remote storage facilities, which represents a performance penalty especially for data processing workflows that are I/O bound. To address that issue, this paper...
Preprint
Full-text available
ROOT is high energy physics' software for storing and mining data in a statistically sound way, to publish results with scientific graphics. It is evolving since 25 years, now providing the storage format for more than one exabyte of data; virtually all high energy physics experiments use ROOT. With another significant increase in the amount of dat...
Preprint
Full-text available
This document discusses the state, roadmap, and risks of the foundational components of ROOT with respect to the experiments at the HL-LHC (Run 4 and beyond). As foundational components, the document considers in particular the ROOT input/output (I/O) subsystem. The current HEP I/O is based on the TFile container file format and the TTree binary ev...
Preprint
Upcoming HEP experiments, e.g. at the HL-LHC, are expected to increase the volume of generated data by at least one order of magnitude. In order to retain the ability to analyze the influx of data, full exploitation of modern storage hardware and systems, such as low-latency high-bandwidth NVMe devices and distributed object stores, becomes critica...
Article
The performance gap between CPU and memory widens continuously. Choosing the best memory layout for each hardware architecture is increasingly important as more and more programs become memory bound. For portable codes that run across heterogeneous hardware architectures, the choice of the memory layout for data structures is ideally decoupled from...
Poster
Full-text available
Upcoming HEP experiments, e.g. at the HL-LHC, are expected to increase the volume of generated data by at least one order of magnitude. In order to retain the ability to analyze the influx of data, full exploitation of modern storage hardware and systems, such as low-latency high-bandwidth NVMe devices and distributed object stores, becomes critica...
Preprint
Over the last two decades, ROOT TTree has been used for storing over one exabyte of High-Energy Physics (HEP) events. The TTree columnar on-disk layout has been proved to be ideal for analyses of HEP data that typically require access to many events, but only a subset of the information stored for each of them. Future colliders, and particularly HL...
Preprint
Full-text available
The performance gap between CPU and memory widens continuously. Choosing the best memory layout for each hardware architecture is increasingly important as more and more programs become memory bound. For portable codes that run across heterogeneous hardware architectures, the choice of the memory layout for data structures is therefore ideally deco...
Article
Full-text available
The past years have shown a revolution in the way scientific workloads are being executed thanks to the wide adoption of software containers. These containers run largely isolated from the host system, ensuring that the development and execution environments are the same everywhere. This enables full reproducibility of the workloads and therefore a...
Article
Full-text available
In the HEP community, software plays a central role in the operation of experiments’ facilities and for reconstruction jobs, with CVMFS being the service enabling the distribution of software at scale. In view of High Luminosity LHC, CVMFS developers investigated how to improve the publication workflow to support the most demanding use cases. This...
Article
Full-text available
Containers became the de-facto standard to package and distribute modern applications and their dependencies. The HEP community demonstrates an increasing interest in such technology, with scientists encapsulating their analysis workflow and code inside a container image. The analysis is first validated on a small dataset and minimal hardware resou...
Article
Full-text available
Over the last two decades, ROOT TTree has been used for storing over one exabyte of High-Energy Physics (HEP) events. The TTree columnar on-disk layout has been proved to be ideal for analyses of HEP data that typically require access to many events, but only a subset of the information stored for each of them. Future colliders, and particularly HL...
Preprint
Full-text available
The high energy physics community is discussing where investment is needed to prepare software for the HL-LHC and its unprecedented challenges. The ROOT project is one of the central software players in high energy physics since decades. From its experience and expectations, the ROOT team has distilled a comprehensive set of areas that should see r...
Article
Full-text available
Linux containers have gained widespread use in high energy physics, be it for services using container engines such as containerd/kubernetes, for production jobs using container engines such as Singularity or Shifter, or for development workflows using Docker as a local container engine. Thus the efficient distribution of the container images, whos...
Preprint
The ROOT TTree data format encodes hundreds of petabytes of High Energy and Nuclear Physics events. Its columnar layout drives rapid analyses, as only those parts ("branches") that are really used in a given analysis need to be read from storage. Its unique feature is the seamless C++ integration, which allows users to directly store their event cl...
Article
Full-text available
The CernVM File System provides the software and container distribution backbone for most High Energy and Nuclear Physics experiments. It is implemented as a file system in user-space (Fuse) module, which permits its execution without any elevated privileges. Yet, mounting the file system in the first place is handled by a privileged suid helper pr...
Article
Full-text available
The ROOT TTree data format encodes hundreds of petabytes of High Energy and Nuclear Physics events. Its columnar layout drives rapid analyses, as only those parts (“branches”) that are really used in a given analysis need to be read from storage. Its unique feature is the seamless C++ integration, which allows users to directly store their event cl...
Article
Full-text available
Particle physics has an ambitious and broad experimental programme for the coming decades. This programme requires large investments in detector hardware, either to build new facilities and experiments, or to upgrade existing ones. Similarly, it requires commensurate investment in the R&D of software to acquire, manage, process, and analyse the she...
Article
Full-text available
LHC experiments make extensive use of web proxy caches, especially for software distribution via the CernVM File System and for conditions data via the Frontier Distributed Database Caching system. Since many jobs read the same data, cache hit rates are high and hence most of the traffic flows efficiently over Local Area Networks. However, it is no...
Article
Full-text available
The CernVM File System (CernVM-FS) provides a scalable and reliable software distribution and—to some extent—a data distribution service. It gives POSIX access to more than a billion binary files of experiment application software stacks and operating system containers to end user devices, grids, clouds, and supercomputers. Increasingly, CernVM-FSa...
Article
Full-text available
The CernVM File System (CernVM-FS) provides a scalable and reliable software distribution service implemented as a POSIX read-only filesystem in user space (FUSE). It was originally developed at CERN to assist High Energy Physics (HEP) collaborations in deploying software on the worldwide distributed computing infrastructure for data processing app...
Article
Full-text available
The bright future of particle physics at the Energy and Intensity frontiers poses exciting challenges to the scientific software community. The traditional strategies for processing and analysing data are evolving in order to (i) offer higher-level programming model approaches and (ii) exploit parallelism to cope with the ever increasing complexity...
Article
Full-text available
Containerization technology is becoming more and more popular because it provides an efficient way to improve deployment flexibility by packaging up code into software micro-environments. Yet, containerization has limitations and one of the main ones is the fact that entire container images need to be transferred before they can be used. Container...
Article
Full-text available
The analysis of High Energy Physics (HEP) data sets often takes place outside the realm of experiment frameworks and central computing workflows, using carefully selected "n-tuples" or Analysis Object Data (AOD) as a data source. Such n-tuples or AODs may comprise data from tens of millions of events and grow to hundred gigabytes or a few terabytes...
Conference Paper
In recent years, there was a growing interest in improving the utilization of supercomputers by running applications of experiments at the Large Hadron Collider (LHC) at CERN when idle cores cannot be assigned to traditional HPC jobs. At the same time, the upcoming LHC machine and detector upgrades will produce some 60 times higher data rates and c...
Article
Full-text available
Data federations have become an increasingly common tool for large collaborations such as CMS and Atlas to efficiently distribute large data files. Unfortunately, these typically are implemented with weak namespace semantics and a non-POSIX API. On the other hand, CVMFS has provided a POSIX-compliant read-only interface for use cases with a small w...
Article
Full-text available
The CernVM File System today is commonly used to host and distribute application software stacks. In addition to this core task, recent developments expand the scope of the file system into two new areas. Firstly, CernVM-FS emerges as a good match for container engines to distribute the container image contents. Compared to native container image d...
Article
Full-text available
All four of the LHC experiments depend on web proxies (that is, squids) at each grid site to support software distribution by the CernVM FileSystem (CVMFS). CMS and ATLAS also use web proxies for conditions data distributed through the Frontier Distributed Database caching system. ATLAS & CMS each have their own methods for their grid jobs to find...
Article
Full-text available
Cloud resources nowadays contribute an essential share of resources for computing in high-energy physics. Such resources can be either provided by private or public IaaS clouds (e.g. OpenStack, Amazon EC2, Google Compute Engine) or by volunteers computers (e.g. LHC@Home 2.0). In any case, experiments need to prepare a virtual machine image that pro...
Article
Full-text available
During the last years, several Grid computing centres chose virtualization as a better way to manage diverse use cases with self-consistent environments on the same bare infrastructure. The maturity of control interfaces (such as OpenNebula and OpenStack) opened the possibility to easily change the amount of resources assigned to each use case by s...
Article
Full-text available
Amazon S3 is a widely adopted web API for scalable cloud storage that could also fulfill storage requirements of the high-energy physics community. CERN has been evaluating this option using some key HEP applications such as ROOT and the CernVM filesystem (CvmFS) with S3 back-ends. In this contribution we present an evaluation of two versions of th...
Article
Full-text available
A common use pattern in the computing models of particle physics experiments is running many distributed applications that read from a shared set of data files. We refer to this data is auxiliary data, to distinguish it from (a) event data from the detector (which tends to be different for every job), and (b) conditions data about the detector (whi...
Article
Full-text available
The distributed file system landscape is scattered. Besides a plethora of research file systems, there is also a large number of production grade file systems with various strengths and weaknesses. The file system, as an abstraction of permanent storage, is appealing because it provides application portability and integration with legacy and third-...
Article
Full-text available
Most high-energy physics analysis jobs are embarrassingly parallel except for the final merging of the output objects, which are typically histograms. Currently, the merging of output histograms scales badly. The running time for distributed merging depends not only on the overall number of bins but also on the number partial histogram output files...
Article
Full-text available
Lately, there is a trend in scientific projects to look for computing resources in the volunteering community. In addition, to reduce the development effort required to port the scientific software stack to all the known platforms, the use of Virtual Machines (VMs)u is becoming increasingly popular. Unfortunately their use further complicates the s...
Article
Data from High Energy Physics (HEP) experiments are collected with significant financial and human effort and are mostly unique. An inter-experimental study group on HEP data preservation and long-term analysis was convened as a panel of the International Committee for Future Accelerators (ICFA). The group was formed by large collider-based experim...
Article
Delivering complex software across a worldwide distributed system is a major challenge in high-throughput scientific computing. The problem arises at different scales for many scientific communities that use grids, clouds, and distributed clusters to satisfy their computing needs. For high-energy physics (HEP) collaborations dealing with large amou...
Article
Full-text available
Distributed file systems provide a fundamental abstraction to location-transparent, permanent storage. They allow distributed processes to co-operate on hierarchically organized data beyond the life-time of each individual process. The great power of the file system interface lies in the fact that applications do not need to be modified in order to...
Article
Full-text available
The CernVM File System (CernVM-FS) is a snapshotting read-only file system designed to deliver software to grid worker nodes over HTTP in a fast, scalable and reliable way. In recent years it became the de-facto standard method to distribute HEP experiment software in the WLCG and starts to be adopted by other grid computing communities outside HEP...
Article
Scientific results in high-energy physics and in many other fields often rely on complex software stacks. In order to support reproducibility and scrutiny of the results, it is good practice to use open source software and to cite software packages and versions. With ever-growing complexity of scientific software on one side and with IT life-cycles...
Article
Full-text available
Both the CernVM File System (CVMFS) and the Frontier Distributed Database Caching System (Frontier) distribute centrally updated data worldwide for LHC experiments using http proxy caches. Neither system provides privacy or access control on reading the data, but both control access to updates of the data and can guarantee the authenticity and inte...
Article
Full-text available
In a virtualized environment, contextualization is the process of configuring a VM instance for the needs of various deployment use cases. Contextualization in CernVM can be done by passing a handwritten context to the user data field of cloud APIs, when running CernVM on the cloud, or by using CernVM web interface when running the VM locally. Cern...
Article
Full-text available
PROOF, the Parallel ROOT Facility, is a ROOT-based framework which enables interactive parallelism for event-based tasks on a cluster of computing nodes. Although PROOF can be used simply from within a ROOT session with no additional requirements, deploying and configuring a PROOF cluster used to be not as straightforward. Recently great efforts ha...
Article
Full-text available
The traditional virtual machine building and and deployment process is centered around the virtual machine hard disk image. The packages comprising the VM operating system are carefully selected, hard disk images are built for a variety of different hypervisors, and images have to be distributed and decompressed in order to instantiate a virtual ma...
Conference Paper
Computing at LHC is based on the Grid model, where geographically distributed local batch farms are federated using proper middleware. Several computing centers are considering a conversion to private clouds, capable of supporting alternative computing models along with the Grid one: CERN is pioneering such conversion. Computing tasks at LHC are mo...
Article
Full-text available
CernVM Co-Pilot is a framework for instantiating an ad-hoc computing infrastructure on top of managed or unmanaged computing resources. Co-Pilot can either be used to create a stand-alone computing infrastructure, or to integrate new computing resources into existing infrastructures (such as Grid or batch). Unlike traditional middleware systems, Co...
Article
CernVM is a virtual software appliance designed to support the development cycle and provide a runtime environment for LHC applications. It consists of a minimal Linux distribution, a specially tuned file system designed to deliver application software on demand, and contextualization tools. The maintenance of these components involves a variety of...
Article
The ATLAS collaboration is managing one of the largest collections of software among the High Energy Physics experiments. Traditionally, this software has been distributed via rpm or pacman packages, and has been installed in every site and user's machine, using more space than needed since the releases share common files but are installed in their...
Article
Full-text available
Since a couple of years, a team at CERN and partners from the Citizen Cyberscience Centre (CCC) have been working on a project that enables general physics simulation programs to run in a virtual machine on volunteer PCs around the world. The project uses the Berkeley Open Infrastructure for Network Computing (BOINC) framework. Based on CERNVM and...
Article
Long-term preservation of scientific data represents a challenge to experiments, especially regarding the analysis software. Preserving data is not enough; the full software and hardware environment is needed. Virtual machines (VMs) make it possible to preserve hardware “in software”. A complete infrastructure package has been developed for easy de...
Article
The CernVM File System (CernVM-FS) provides a scalable, reliable and close to zero-maintenance software distribution service. It was developed to assist HEP collaborations to deploy their software on the worldwide-distributed computing infrastructure used to run their data processing applications. CernVM-FS is deployed on a wide range of computing...
Article
Full-text available
Volunteer desktop grids are nowadays becoming more and more powerful thanks to improved high end components: multi-core CPUs, larger RAM memories and hard disks, better network connectivity and bandwidth, etc. As a result, desktop grid systems can run more complex experiments or simulations, but some problems remain: The heterogeneity of hardware a...
Article
CernVM is a virtual software appliance designed to support the development cycle and provide a runtime environment for the LHC experiments. It consists of three key components that differentiate it from more traditional virtual machines: a minimal Linux Operating System (OS), a specially tuned file system designed to deliver application software on...
Article
CernVM Co-Pilot is a framework for the delivery and execution of the workload on remote computing resources. It consists of components which are developed to ease the integration of geographically distributed resources (such as commercial or academic computing clouds, or the machines of users participating in volunteer computing projects) into exis...
Article
Full-text available
Parallelism aims to improve computing performance by executing a set of computations concurrently. Since the advent of today's many-core machines the full exploitation of the available CPU power has been one of the main challenges. In High Energy Physics (HEP) final data analysis the bottleneck is not only the available CPU but also the available I...
Article
Full-text available
Computing for the LHC, and for HEP more generally, is traditionally viewed as requiring specialized infrastructure and software environments, and therefore not compatible with the recent trend in "volunteer computing", where volunteers supply free processing time on ordinary PCs and laptops via standard Internet connections. In this paper, we demon...
Article
In a distributed computing model as WLCG the software of experiment specific application software has to be efficiently distributed to any site of the Grid. Application software is currently installed in a shared area of the site visible for all Worker Nodes (WNs) of the site through some protocol (NFS, AFS or other). The software is installed at t...
Article
The CernVM File System (CernVM-FS) is a read-only file system designed to deliver high energy physics (HEP) experiment analysis software onto virtual machines and Grid worker nodes in a fast, scalable, and reliable way. CernVM-FS decouples the underlying operating system from the experiment denned software stack. Files and file metadata are aggress...
Article
The computing facilities used to process data for the experiments at the Large Hadron Collider at CERN are scattered around the world. The embarrassingly parallel workload allows for use of various computing resources, such as Grid sites of the Worldwide LHC Computing Grid, commercial and institutional cloud resources, as well as individual home PC...
Article
In the attempt to solve the problem of processing data coming from LHC experiments at CERN at a rate of 15PB per year, for almost a decade the High Enery Physics (HEP) community has focused its efforts on the development of the Worldwide LHC Computing Grid. This generated large interest and expectations promising to revolutionize computing. Meanwhi...
Conference Paper
Scientific computing often takes place on globally scattered resources. The CernVM web file system supports such scenarios. It was designed to easily retrieve files from the web server that provides the CERN software distributions. In this paper, we extend the CernVM file system. We propose a distributed algorithm to retrieve cached data from the m...
Article
CernVM is a Virtual Software Appliance capable of running physics applications from the LHC experiments at CERN. It aims to provide a complete and portable environment for developing and running LHC data analysis on any end-user computer (laptop, desktop) as well as on the Grid, independently of Operating System platforms (Linux, Windows, MacOS). T...
Conference Paper
Full-text available
Using virtualization technology, the entire application environment of an LHC experiment, including its Linux operating system and the experiment's code, libraries and support utilities, can be incorporated into a virtual image and executed under suitable hypervisors installed on a choice of target host platforms. The Virtualization R&D project at...
Article
Full-text available
CernVM[1] is a Virtual Software Appliance capable of running physics applications from the LHC experiments at CERN. Such a virtual appliance aims to provide a complete and portable environment for developing and running LHC data analysis on any end-user computer (laptop, desktop) and on the Grid, independently of Operating System software and hardw...
Article
GrGen.NET is a graph rewrite tool enabling elegant and convenient development of graph rewriting applications with comparable performance to conventionally developed ones. GrGen.NET uses attributed, typed, and directed multigraphs with multiple inheritance on node and edge types. Extensive graphical debugging integrated into an interactive shell co...

Network