Enric Tejedor’s research while affiliated with CERN and other places

What is this page?


This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.

Publications (15)


Figure 2. Runtimes for RDataFrame event loops for three different scenarios: A. 100M events, 1 histogram produced, 2 variations (nominal, up, down histograms filled). B. 10M events, 1 nominal histogram, 100 variations (101 histograms filled). C 100k events, 20 nominal histograms, 100 variations (2k histograms filled).
RDataFrame enhancements for HEP analyses
  • Article
  • Full-text available

February 2023

·

52 Reads

·

1 Citation

Journal of Physics Conference Series

·

·

S Hageboeck

·

[...]

·

S Wunsch

In recent years, RDataFrame, ROOT’s high-level interface for data analysis and processing, has seen widespread adoption on the part of HEP physicists. Much of this success is due to RDataFrame’s ergonomic programming model that enables the implementation of common analysis tasks more easily than previous APIs, without compromising on application performance. Nonetheless, RDataFrame’s interfaces have been further improved by the recent addition of several major HEP-oriented features: in this contribution we will introduce for instance a dedicated syntax to define systematic variations, per-data-sample call-backs useful to define quantities that vary on a per-sample basis, simplifications of collection operations and the injection of just-in-time-compiled Python functions in the optimized C++ event loop.

Download

HL-LHC Analysis With ROOT

May 2022

·

196 Reads

ROOT is high energy physics' software for storing and mining data in a statistically sound way, to publish results with scientific graphics. It is evolving since 25 years, now providing the storage format for more than one exabyte of data; virtually all high energy physics experiments use ROOT. With another significant increase in the amount of data to be handled scheduled to arrive in 2027, ROOT is preparing for a massive upgrade of its core ingredients. As part of a review of crucial software for high energy physics, the ROOT team has documented its R&D plans for the coming years.


ROOT for the HL-LHC: data format

April 2022

·

36 Reads

This document discusses the state, roadmap, and risks of the foundational components of ROOT with respect to the experiments at the HL-LHC (Run 4 and beyond). As foundational components, the document considers in particular the ROOT input/output (I/O) subsystem. The current HEP I/O is based on the TFile container file format and the TTree binary event data format. The work going into the new RNTuple event data format aims at superseding TTree, to make RNTuple the production ROOT event data I/O that meets the requirements of Run 4 and beyond.


Figure 1: Design of the new PyROOT on top of cppyy.
Figure 2: This example dynamically defines a C++ function template with variadic template syntax via PyROOT; after that it instantiates the template with three types and calls the resulting function.
A New PyROOT: Modern, Interoperable and More Pythonic

January 2020

·

388 Reads

·

12 Citations

The European Physical Journal Conferences

Python is nowadays one of the most widely-used languages for data science. Its rich ecosystem of libraries together with its simplicity and readability are behind its popularity. HEP is also embracing that trend, often using Python as an interface language to access C++ libraries for the sake of performance. PyROOT, the Python bindings of the ROOT software toolkit, plays a key role here, since it allows to automatically and dynamically invoke C++ code from Python without the generation of any static wrappers beforehand. In that sense, this paper presents the efforts to create a new PyROOT with three main qualities: modern, able to exploit the latest C++ features from Python; pythonic, providing Python syntax to use C++ classes; interoperable, able to interact with the most important libraries of the Python data science toolset.


Declarative Big Data Analysis for High-Energy Physics: TOTEM Use Case

August 2019

·

78 Reads

·

7 Citations

The High-Energy Physics community faces new data processing challenges caused by the expected growth of data resulting from the upgrade of LHC accelerator. These challenges drive the demand for exploring new approaches for data analysis. In this paper, we present a new declarative programming model extending the popular ROOT data analysis framework, and its distributed processing capability based on Apache Spark. The developed framework enables high-level operations on the data, known from other big data toolkits, while preserving compatibility with existing HEP data files and software. In our experiments with a real analysis of TOTEM experiment data, we evaluate the scalability of this approach and its prospects for interactive processing of such large data sets. Moreover, we show that the analysis code developed with the new model is portable between a production cluster at CERN and an external cluster hosted in the Helix Nebula Science Cloud thanks to the bundle of services of Science Box.


Figure 1. Synchronization and sharing features of CERNBox.
Figure 4. Before cloning a shared project, a SWAN user can inspect its contents and open its notebooks statically.
Facilitating Collaborative Analysis in SWAN

January 2019

·

91 Reads

·

3 Citations

The European Physical Journal Conferences

SWAN (Service for Web-based ANalysis) is a CERN service that allows users to perform interactive data analysis in the cloud, in a “software as a service” model. It is built upon the widely-used Jupyter notebooks, allowing users to write - and run - their data analysis using only a web browser. By connecting to SWAN, users have immediate access to storage, software and computing resources that CERN provides and that they need to do their analyses. Besides providing an easier way of producing scientific code and results, SWAN is also a great tool to create shareable content. From results that need to be reproducible, to tutorials and demonstrations for outreach and teaching, Jupyter notebooks are the ideal way of distributing this content. In one single file, users can include their code, the results of the calculations and all the relevant textual information. By sharing them, it allows others to visualize, modify, personalize or even re-run all the code. In that sense, this paper describes the efforts made to facilitate sharing in SWAN. Given the importance of collaboration in our scientific community, we have brought the sharing functionality from CERNBox, CERN’s cloud storage service, directly inside SWAN. SWAN users have available a new and redesigned interface where theycan share “Projects”: a special kind of folder containing notebooks and other files, e.g., like input datasets and images. When a user shares a Project with some other users, the latter can immediately see andwork with the contents of that project from SWAN.


Figure 4: Visualization of status and resource utilization
Figure 6 : Browsing the HDFS filesystem from Jupyter notebook
Figure 7: Top level comparison of Spark deployments
Figure 8: Client tools to manage Spark Kubernetes cluster
Figure 9: Elapsed time of selected queries executed against YARN & K8s
Apache Spark usage and deployment models for scientific computing

January 2019

·

1,916 Reads

·

8 Citations

The European Physical Journal Conferences

This talk is about sharing our recent experiences in providing data analytics platform based on Apache Spark for High Energy Physics, CERN accelerator logging system and infrastructure monitoring. The Hadoop Service has started to expand its user base for researchers who want to perform analysis with big data technologies. Among many frameworks, Apache Spark is currently getting the most traction from various user communities and new ways to deploy Spark such as Apache Mesos or Spark on Kubernetes have started to evolve rapidly. Meanwhile, notebook web applications such as Jupyter offer the ability to perform interactive data analytics and visualizations without the need to install additional software. CERN already provides a web platform, called SWAN (Service for Web-based ANalysis), where users can write and run their analyses in the form of notebooks, seamlessly accessing the data and software they need. The first part of the presentation talks about several recent integrations and optimizations to the Apache Spark computing platform to enable HEP data processing and CERN accelerator logging system analytics. The optimizations and integrations, include, but not limited to, access of kerberized resources, xrootd connector enabling remote access to EOS storage and integration with SWAN for interactive data analysis, thus forming a truly Unified Analytics Platform. The second part of the talk touches upon the evolution of the Apache Spark data analytics platform, particularly sharing the recent work done to run Spark on Kubernetes on the virtualized and container-based infrastructure in Openstack. This deployment model allows for elastic scaling of data analytics workloads enabling efficient, on-demand utilization of resources in private or public clouds.


Figure 2: A Monte Carlo generation of QCD events featuring low transverse momentum is considered. No I/O from disk is performed but a large set of histograms filled. The plot shows the event rate scaling with respect to the number of cores used on a KNL device. The blue line shows the scaling of the original version of the application, based on a customised version of ROOT 5.34. The red line is relative to a porting of the application to ROOT 6.14. The scaling continues up to 200 cores, taking advantage of the accelerator while the application based on the old ROOT version features a degradation of the performance.
Figure 3: The PyRDF architecture.
Figure 4. Diagram showing the Cling-CUDA compiler pipeline (courtesy of S. Ehring) [15]
Operations parallelised in ROOT as of release 6.14.
A Parallelised ROOT for Future HEP Data Processing

January 2019

·

124 Reads

·

1 Citation

The European Physical Journal Conferences

In the coming years, HEP data processing will need to exploit parallelism on present and future hardware resources to sustain the bandwidth requirements. As one of the cornerstones of the HEP software ecosystem, ROOT embraced an ambitious parallelisation plan which delivered compelling results. In this contribution the strategy is characterised as well as its evolution in the medium term. The units of the ROOT framework are discussed where task and data parallelism have been introduced, with runtime and scaling measurements. We will give an overview of concurrent operations in ROOT, for instance in the areas of I/O (reading and writing of data), fitting / minimization, and data analysis. This paper introduces the programming model and use cases for explicit and implicit parallelism, where the former is explicit in user code and the latter is implicitly managed by ROOT internally.


Figure 6. Scaling of a Monte Carlo QCD Low-PT event generation and analysis on the fly for an adhoc implementation using a patched ROOT 5 and POSIX threads (labeled "original" in the plot) and an RDataFrame rewrite of that same application (yielding identical results). No disk reads or writes are performed by either application. KNL architecture, 64 physical cores.
RDataFrame: Easy Parallel ROOT Analysis at 100 Threads

January 2019

·

903 Reads

·

34 Citations

The European Physical Journal Conferences

The Physics programmes of LHC Run III and HL-LHC challenge the HEP community. The volume of data to be handled is unprecedented at every step of the data processing chain: analysis is no exception. Physicists must be provided with first-class analysis tools which are easy to use, exploit bleeding edge hardware technologies and allow to seamlessly express parallelism. This document discusses the declarative analysis engine of ROOT, RDataFrame, and gives details about how it allows to profitably exploit commodity hardware as well as high-end servers and manycore accelerators thanks to the synergy with the existing parallelised ROOT components. Real-life analyses of LHC experiments’ data expressed in terms of RDataFrame are presented, highlighting the programming model provided to express them in a concise and powerful way. The recent developments which make RDataFrame a lightweight data processing framework are described, such as callbacks and I/O capabilities. Finally, the flexibility of RDataFrame and its ability to read data formats other than ROOT’s are characterised, as an example it is discussed how RDataFrame can directly read and analyse LHCb’s raw data format MDF.



Citations (10)


... It was shown to provide a sweet spot in performance and ergonomics for HEP use cases when compared to more general-purpose query languages [8]. Nevertheless, RDataFrame keeps evolving [9], introducing features such as the aforementioned native support for expressing systematic variations or a tight integration with RNTuple [10], ROOT's new and experimental data format. ...

Reference:

Boosting RDataFrame performance with transparent bulk event processing
RDataFrame enhancements for HEP analyses

Journal of Physics Conference Series

... The benchmark chosen to demonstrate where the new framework adds or replaces functionality is targeted at a phase space enriching light-flavoured jets, with particular focus on Drell-Yan events with additional jets [18]. Adding to the challenge, the existing framework to be enhanced by the new one depends on PyROOT [19] imported as part of experiment-specific software, and its building-blocks are event loops with physics object loops nested therein. Parts of the existing implementation are written in Python 2. ...

A New PyROOT: Modern, Interoperable and More Pythonic

The European Physical Journal Conferences

... Thirty Python package developers, maintainers, and power users, all of whom are authors of this report, gathered together in an informal setting to discuss relevant and timely trends in Python, largely targeting end-user analysis. The following topics were central points of discussion: how PyHEP does (or does not) relate to physicists' needs; the Analysis Grand Challenges that run analyses at scale at analysis facilities; leveraging key packages from the ecosystem; the development of statistical packages, models, interfaces, and serialization; workflow management systems; histogramming; and key distributed processing tools like RDataFrame [3], Coffea [4], and Dask [5]. Finally, we brainstormed the organization of future PyHEP.dev ...

RDataFrame: Easy Parallel ROOT Analysis at 100 Threads

The European Physical Journal Conferences

... Distributed computing has always played a crucial role in HEP, from batch systems with manual job submission to automated WLCG gridware and orchestration systems like Dask [86,87], Apache Spark [88], and Stellar HPX [89,90], as well as WMSs like Luigi and Snakemake. Recently, orchestration systems have changed how we think about distributed computing by removing much of the boilerplate and setup required to parallelize applications. ...

Apache Spark usage and deployment models for scientific computing

The European Physical Journal Conferences

... A later study overcame this issue, but did not achieve higher scaling when using available CERN storage facilities [29]. An example of good scalability was provided by researchers of the TOTEM experiment at CERN, with a first approach at distributing a ROOT application over Spark resources in a cloud [30]. The presence of Spark in the HEP community has become relevant enough that CERN has invested in specific infrastructure to support Spark analysis workflows [31]. ...

Declarative Big Data Analysis for High-Energy Physics: TOTEM Use Case
  • Citing Chapter
  • August 2019

... With this setup, TOTEM scientists validated the obtained physics results and demonstrated that it is possible to reduce the time required for the final steps of their analysis by a factor 280x when compared to single core execution (i.e., from 8 hours to less then 2 minutes). We invite the interested reader to refer to [14] for further details. ...

Big Data Tools and Cloud Services for High Energy Physics Analysis in TOTEM Experiment
  • Citing Conference Paper
  • December 2018

... Implementations of data frames exist in numerous programming languages, including R [10] and Python [11,12]. CERN's ROOT, a common choice for analysis in low energy nuclear physics, also implements data frames as RDataFrame [13,14]. The version of sauce described in this paper has been implemented using the Python bindings of the polars 4 library. ...

Novel functional and distributed approaches to data analysis available in ROOT

Journal of Physics Conference Series

... It is used in production by HEP experiments, for example at the LHC at CERN, and stores more than 2 EB of data. TTree data can be written in parallel with a system called TBufferMerger that merges in memory the files from concurrent data producer threads [3,4]. Originally TBufferMerger used a dedicated output thread that acted as synchronization point for all data buffers [3]. ...

Increasing Parallelism in the ROOT I/O Subsystem

Journal of Physics Conference Series

... Such classes are currently available for TMVA [11] (through the TMVA::Experimental::RReader class included in ROOT), lwtnn [12], Tensorflow [9] (through the C API), PyTorch [13] (using TorchScript), and ONNX-Runtime [14]. Most of these support multiple input nodes, with potentially different types, and are thread-safe, such that implicit multithreading [15] can be used. The performance may in the future be improved by evaluating on a batch instead of a single example at a time. ...

Expressing Parallelism with ROOT

Journal of Physics Conference Series

... An installation on the HNSciCloud [5] comprising 2,400 CPUs, 10 TB of memory, and 22 TB of storage featured ScienceBox in Kubernetes and an Apache Spark 5 cluster to scale-out computations massively. Access to the platform was granted to a group of TOTEM scientists who actively used SWAN [6] in place of the production service at CERN. With this setup, TOTEM scientists demonstrated that it is possible to reduce the time required for the final analysis steps by a factor of 200x and they successfully validated the obtained physics results [7]. ...

SWAN: A service for interactive analysis in the cloud
  • Citing Article
  • December 2016

Future Generation Computer Systems