Peter Zaspel’s research while affiliated with University of Wuppertal and other places

What is this page?


This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.

Publications (34)


Investigating Data Hierarchies in Multifidelity Machine Learning for Excitation Energies
  • Article

March 2025

·

6 Reads

·

1 Citation

Journal of Chemical Theory and Computation

Vivin Vinod

·

Peter Zaspel

Predicting Molecular Energies of Small Organic Molecules With Multi-Fidelity Methods

March 2025

·

29 Reads

Journal of Computational Chemistry

Vivin Vinod

·

Dongyu Lyu

·

·

[...]

·

Peter Zaspel

Multi‐fidelity methods in machine learning (ML) have seen increasing usage for the prediction of quantum chemical properties. These methods, such as ‐ML and Multifidelity Machine Learning (MFML), have been shown to significantly reduce the computational cost of generating training data. This work implements and analyzes several multi‐fidelity methods including ‐ML and MFML for the prediction of electronic molecular energies at DLPNO‐CCSD(T) level, that is, at the level of coupled cluster theory including single and double excitations and perturbative triples corrections. The models for small organic molecules are evaluated not only on the basis of accuracy of prediction, but also on efficiency in terms of the time‐cost of generating training data. In addition, the models are evaluated for the prediction of energies for molecules sampled from a public dataset, in particular for atmospherically relevant molecules, isomeric compounds, and highly conjugated complex molecules.


The workflow of generating the QeMFi dataset by sampling from the WS22 database. 15,000 geometries are used for each molecule resulting in a total of 135,000 single point geometries. For each of these, multiple QC properties are calculated at DFT level of theory with varying basis set sizes to create the diverse multifidelity dataset.
Scatter plots of UMAPs for the various molecules which compose the WS22 database. The UMAPs were generated for the unsorted Coulomb Matrix (CM) molecular descriptor for each molecule. The legend key indicates the geometries which are part of the WS22 and the QeMFi dataset respectively. For all molecules, it can be observed that the QeMFi dataset traverses the entirety of the configuration space that WS22 also covers.
Preliminary analysis of multifidelity structure of SCF ground state energies for the SMA molecule. The three different preliminary tests for the hierarchy are performed as prescribed in ref. ¹⁵. The ground state energies show a normal distribution centered around 0 hE. The difference in the fidelity energies is monotonically decreasing for increasing fidelities indicating that the assumed hierarchy holds. The scatter plot of the energies of different fidelities with respect to the TZVP fidelity show a compact distribution for the most part. With STO3G there is a wider deviation from the identity map (dashed black line).
Learning curves for MFML and o-MFML for the SCF ground state energies of SMA as recorded in the QeMFi database. The reference single-fidelity KRR is also shown by training on TZVP only. The Laplacian kernel was used with a kernel width of 200.0 and regularization of 10⁻¹⁰. The Global SLATM³⁵ molecular descriptors were used.
Time versus MAE plots for MFML and o-MFML models predicting the SCF ground state energies of the SMA molecule. The time to generate the training set for MFML is a comprehensive measure of the cost of a multifidelity model as prescribed in ref. ¹⁵.

+5

QeMFi: A Multifidelity Dataset of Quantum Chemical Properties of Diverse Molecules
  • Article
  • Full-text available

February 2025

·

4 Reads

·

2 Citations

Scientific Data

Progress in both Machine Learning (ML) and Quantum Chemistry (QC) methods have resulted in high accuracy ML models for QC properties. Datasets such as MD17 and WS22 have been used to benchmark these models at a given level of QC method, or fidelity, which refers to the accuracy of the chosen QC method. Multifidelity ML (MFML) methods, where models are trained on data from more than one fidelity, have shown to be effective over single fidelity methods. Much research is progressing in this direction for diverse applications ranging from energy band gaps to excitation energies. One hurdle for effective research here is the lack of a diverse multifidelity dataset for benchmarking. We provide the Quantum chemistry MultiFidelity (QeMFi) dataset consisting of five fidelities calculated with the TD-DFT formalism. The fidelities differ in their basis set choice: STO-3G, 3-21G, 6-31G, def2-SVP, and def2-TZVP. QeMFi offers to the community a variety of QC properties such as vertical excitation properties and molecular dipole moments. Further QeMFi offers QC computation times allowing for a time benefit benchmark of multifidelity models for ML-QC.

Download

Predicting Molecular Energies of Small Organic Molecules with Multifidelity Methods

January 2025

·

15 Reads

Multifidelity methods in machine learning (ML) have seen an increasing usage for the prediction of quantum chemical properties. These methods, such as ∆-ML and multifidelity ML, have been shown to significantly reduce the computational cost of generating training data. This work implements and analyzes several multifidelity methods including ∆-ML and multifidelity ML for the prediction of electronic molecular energies at DLPNO-CCSD(T) level, i.e., at the level of coupled cluster theory including single and double excitations and perturbative triples corrections. The models for small organic molecules are evaluated not only on the basis of accuracy of prediction, but also on efficiency in terms of the time-cost of generating training data. In addition, the models are evaluated for the prediction of energies for molecules sampled from a public dataset, in particular for atmospherically relevant molecules, isomeric compounds, and highly conjugated complex molecules.


Excitation Energy Transfer between Porphyrin Dyes on a Clay Surface: A study employing Multifidelity Machine Learning

October 2024

·

50 Reads

Natural light-harvesting antenna complexes efficiently capture solar energy using chlorophyll, i.e., magnesium porphyrin pigments, embedded in a protein matrix. Inspired by this natural configuration, artificial clay-porphyrin antenna structures have been experimentally synthesized and have demonstrated remarkable excitation energy transfer properties. The study presents the computational design and simulation of a synthetic light-harvesting system that emulates natural mechanisms by arranging cationic free-base porphyrin molecules on an anionic clay surface. We investigated the transfer of excitation energy among the porphyrin dyes using a multiscale quantum mechanics/molecular mechanics (QM/MM) approach based on the semi-empirical density functional-based tight-binding (DFTB) theory for the ground state dynamics. To improve the accuracy of our results, we incorporated an innovative multifidelity machine learning (MFML) approach, which allows the prediction of excitation energies at the numerically demanding time-dependent density functional theory level with the Def2-SVP basis set. This approach was applied to an extensive dataset of 640K geometries for the 90-atom porphyrin structures, facilitating a thorough analysis of the excitation energy diffusion among the porphyrin molecules adsorbed to the clay surface. The insights gained from this study, inspired by natural light-harvesting complexes, demonstrate the potential of porphyrin-clay systems as effective energy transfer systems.


Evaluation of uncertainty estimations for Gaussian process regression based machine learning interatomic potentials

October 2024

·

8 Reads

Machine learning interatomic potentials (MLIPs) have seen significant advances as efficient replacement of expensive quantum chemical calculations. Uncertainty estimations for MLIPs are crucial to quantify the additional model error they introduce and to leverage this information in active learning strategies. MLIPs that are based on Gaussian process regression provide a standard deviation as a possible uncertainty measure. An alternative approach are ensemble-based uncertainties. Although these uncertainty measures have been applied to active learning, it has rarely been studied how they correlate with the error, and it is not always clear whether active learning actually outperforms random sampling strategies. We consider GPR models with Coulomb and SOAP representations as inputs to predict potential energy surfaces and excitation energies of molecules. We evaluate, how the GPR variance and ensemble-based uncertainties relate to the error and whether model performance improves by selecting the most uncertain samples from a fixed configuration space. For the ensemble based uncertainty estimations, we find that they often do not provide any information about the error. For the GPR standard deviation, we find that often predictions with an increasing standard deviation also have an increasing systematical bias, which is not captured by the uncertainty. In these cases, selecting training samples with the highest uncertainty leads to a model with a worse test error compared to random sampling. We conclude that confidence intervals, which are derived from the predictive standard deviation, can be highly overconfident. Selecting samples with high GPR standard deviation leads to a model that overemphasizes the borders of the configuration space represented in the fixed dataset. This may result in worse performance in more densely sampled areas but better generalization for extrapolation tasks.


Predicting Molecular Energies of Small Organic Molecules with Multifidelity Methods

October 2024

·

16 Reads

Multifidelity methods in machine learning (ML) have seen an increasing usage for the prediction of quantum chemical properties. These methods, such as ∆-ML and multifidelity ML, have been shown to significantly reduce the computational cost of generating training data. This work implements and analyzes several multifidelity methods including ∆-ML and multifidelity ML for the prediction of electronic molecular energies at DLPNO-CCSD(T) level, i.e., at the level of coupled cluster theory including single and double excitations and perturbative triples corrections. The models for small organic molecules are evaluated not only on the basis of accuracy of prediction, but also on efficiency in terms of the time-cost of generating training data. In addition, the models are evaluated for the prediction of energies for molecules sampled from a public dataset, in particular for atmospherically relevant molecules, isomeric compounds, and highly conjugated complex molecules.


Fig. 1: A hypothetical comparison of training data used across fidelities for the different kinds of scaling factors used in this work. a) The multifidelity training data structure used in MFML with a small fixed scaling factor (γ). b) Multifidelity training data structure for a large fixed scaling factor (γ) results in a larger number of training samples being used at the cheaper fidelities. c) The structure of multifidelity training data used for scaling factors that are decided based on the QC-time cost, explained in section 2.4 as θ F f and θ f f −1 . d) Comparison of training data structure evolution for conventional MFML and the Γ -curve introduced in section 2.6. Notice how the number of training samples used at the target (the costliest) fidelity remain same across the data structure for the Γ -curve while they increase for the conventional MFML method.
Fig. 2: Multifidelity learning curves for the prediction of excitation energies taken from the QeMFi dataset. The top row corresponds to the MFML models while the bottom row is for the o-MFML models. Different fixed scaling factors are used to scale the data across each fidelity in the multifidelity models as explained section 2.4. The scaling factors are reported on the top of each column.
Fig. 4: Comparison of learning curves for fixed scaling factors γ, θ f f −1 , and θ F f with f b : STO3G. The x-axis reports the number of training samples used at the highest fidelity, that is, TZVP. Both MFML and o-MFML models are compared. Increasing values of γ result in a constant lowered offset of the learning curves. The cost informed scaling factors show a higher value of MAE.
Fig. 5: Time to generate training data versus MAE of the corresponding o-MFML model for the diverse scaling factors studied. The different scaling factors used are denoted as sub-titles. The MAE is reported in cm −1 and the time in minutes. The single fidelity KRR case is also depicted for reference. As one increases the scaling factors across the fidelities, one observes that the learning curves of the MFML models shifts further due to the larger amount of training samples used. The two cases of θ f f −1 and θ F f are explained in section 2.4. The bottom-right corner plot compares the o-MFML curves for the 321G baseline for the two time-informed scaling factors and the case of γ = 2.
Fig. 7: (a) Time to generate training data and corresponding o-MFML model error as MAE in cm −1 for constant scaling factors, γ used in this study. An inset between 1,500-3,000 minutes is provided for the comparison of the curves for all γ studied in this work to readily compare in regions that are too crowded to be observed in the main plot. (b) MAE versus time-cost for different Γ (N T ZV P train )-curves. Increasing the number of training samples at TZVP improves the model accuracies along the Γ (·)-curves with a saturation observed towards the end of each curve.
Investigating Data Hierarchies in Multifidelity Machine Learning for Excitation Energies

October 2024

·

3 Reads

Recent progress in machine learning (ML) has made high-accuracy quantum chemistry (QC) calculations more accessible. Of particular interest are multifidelity machine learning (MFML) methods where training data from differing accuracies or fidelities are used. These methods usually employ a fixed scaling factor, γ\gamma, to relate the number of training samples across different fidelities, which reflects the cost and assumed sparsity of the data. This study investigates the impact of modifying γ\gamma on model efficiency and accuracy for the prediction of vertical excitation energies using the QeMFi benchmark dataset. Further, this work introduces QC compute time informed scaling factors, denoted as θ\theta, that vary based on QC compute times at different fidelities. A novel error metric, error contours of MFML, is proposed to provide a comprehensive view of model error contributions from each fidelity. The results indicate that high model accuracy can be achieved with just 2 training samples at the target fidelity when a larger number of samples from lower fidelities are used. This is further illustrated through a novel concept, the Γ\Gamma-curve, which compares model error against the time-cost of generating training samples, demonstrating that multifidelity models can achieve high accuracy while minimizing training data costs.


Fig. 2: Distribution of training, validation, and test set used in this work. All the nine molecules of the QeMFi dataset are evenly present in each of the sets.
Fig. 5: Learning curves and time-cost assessment for a predicted QC-baseline for ∆-ML. This is in contrast to the usual ∆-ML method wherein one would perform QC calculations for the QCbaseline. This model is further explained in the main text.
Fig. S1: Learning curves for ∆-ML with varying baseline fidelities for the atomization energies of the QM7b dataset. The different basis sets are denoted as subplot titles.
Benchmarking Data Efficiency in Δ\Delta-ML and Multifidelity Models for Quantum Chemistry

October 2024

·

7 Reads

The development of machine learning (ML) methods has made quantum chemistry (QC) calculations more accessible by reducing the compute cost incurred in conventional QC methods. This has since been translated into the overhead cost of generating training data. Increased work in reducing the cost of generating training data resulted in the development of Δ\Delta-ML and multifidelity machine learning methods which use data at more than one QC level of accuracy, or fidelity. This work compares the data costs associated with Δ\Delta-ML, multifidelity machine learning (MFML), and optimized MFML (o-MFML) in contrast with a newly introduced MultifidelityΔ\Delta-Machine Learning (MFΔ\DeltaML) method for the prediction of ground state energies over the multifidelity benchmark dataset QeMFi. This assessment is made on the basis of training data generation cost associated with each model and is compared with the single fidelity kernel ridge regression (KRR) case. The results indicate that the use of multifidelity methods surpasses the standard Δ\Delta-ML approaches in cases of a large number of predictions. For cases, where Δ\Delta-ML method might be favored, such as small test set regimes, the MFΔ\Delta-ML method is shown to be more efficient than conventional Δ\Delta-ML.


Assessing non-nested configurations of multifidelity machine learning for quantum-chemical properties

October 2024

·

21 Reads

·

5 Citations

Multifidelity machine learning (MFML) for quantum chemical properties has seen strong development in the recent years. The method has been shown to reduce the cost of generating training data for high-accuracy low-cost ML models. In such a set-up, the ML models are trained on molecular geometries and some property of interest computed at various computational chemistry accuracies, or fidelities. These are then combined in training the MFML models. In some multifidelity models, the training data is required to be nested, that is the same molecular geometries are included to calculate the property across all the fidelities. In these multifidelity models, the requirement of a nested configuration restricts the kind of sampling that can be performed while selection training samples at different fidelities. This work assesses the use of non-nested training data for two of these multifidelity methods, namely MFML and optimized MFML (o-MFML). The assessment is carried out for the prediction of ground state energies and first vertical excitation energies of a diverse collection of molecules of the CheMFi dataset. Results indicate that the MFML method still requires a nested structure of training data across the fidelities. However, the o-MFML method shows promising results for non-nested multifidelity training data with model errors comparable to the nested configurations.


Citations (15)


... In addition to the single fidelity GPR and MFML models, a recently introduced MFML approach, referred to as the Γ-curve, 46 is analyzed as well. In conventional MFML theory, the training samples at the various fidelities are decided by a scaling factor, γ, that is, ...

Reference:

Excitation Energy Transfer between Porphyrin Dyes on a Clay Surface: A study employing Multifidelity Machine Learning
Investigating Data Hierarchies in Multifidelity Machine Learning for Excitation Energies
  • Citing Article
  • March 2025

Journal of Chemical Theory and Computation

... Here, each fidelity is treated as inter-related to the others and a surrogate MTGPR model is created. Further, a diverse multifidelity dataset consisting of 135,000 point geometries has recently been made available [16,17] with various QC properties, such as vertical excitation energies, calculated with DFT formalism. The fidelities are differentiated by the choice of basis set used in the calculation. ...

QeMFi: A Multifidelity Dataset of Quantum Chemical Properties of Diverse Molecules

Scientific Data

... The value of γ = 2 is conventionally used in MFML based on previous work. 32,[49][50][51] In a recent work, the effect of different values of γ in the model error of MFML has been studied. 33 Ref. 33 reports that the use of very little training data at the target fidelity combined with increasing values of γ, results in a more data efficient model. ...

Assessing non-nested configurations of multifidelity machine learning for quantum-chemical properties

... Multifidelity methods harnessing inherent QC hierarchies to cancel out errors across different numerical QC methods have since superseded the single fidelity ML methods. These methods include Δ-ML 12 based models such as hierarchical machine learning 13 , multifidelity machine learning (MFML) 14,15 , and optimized MFML (o-MFML) 16 . Certain other flavors of ML using multifidelity data have been proposed and tested, including multi-task Gaussian processes treating the different fidelities as interdependent tasks 17,18 . ...

Optimized Multifidelity Machine Learning for Quantum Chemistry

... Ramakrishnan et al. (2015) popularized the ∆-learning approach (Bogojeski et al., 2020), where a model learns to predict the difference between some prior and the reference quantum mechanical targets. Multi-fidelity learning generalizes ∆-learning by building a hierarchy of models that predict increasingly accurate levels of theory (Giselle Fernández-Godino, 2023;Vinod et al., 2023;Forrester et al., 2007;Heinen et al., 2024). Making predictions in the hierarchical multi-fidelity setting corresponds to evaluating a baseline fidelity level and then refining this prediction with models that provide corrections to more accurate levels of theory in the hierarchy. ...

Multifidelity Machine Learning for Molecular Excitation Energies
  • Citing Article
  • October 2023

Journal of Chemical Theory and Computation

... Efforts are being focused on integrating more complex fluid solvers into these creation suites. For instance, the 3D solver for the two-phase incompressible Navier-Stokes equations NaSt3DGPF was successfully coupled with Maya in a toolkit that enables the user to control the full fluid simulation within Maya's interface [42] [43]. The solver uses high-order Finite Difference discretization methods and the rendering techniques result in realistic CFD visualizations. ...

Kernel-based stochastic collocation for the random two-phase Navier-Stokes equations
  • Citing Article
  • January 2019

International Journal for Uncertainty Quantification

... Δ-Machine Learning (ML) aims to efficiently elevate a DFT-MLP to close to the CCSD(T) level. 41,64,101,[106][107][108][109] The Δ-ML approach we use 101 for this purpose is given by the following equation: ...

Boosting Quantum Machine Learning Models with a Multilevel Combination Technique: Pople Diagrams Revisited
  • Citing Article
  • December 2018

Journal of Chemical Theory and Computation

... Here (a) is commonly encountered for compressing forward operators in integral equations and kernel matrices. Existing codes include HLIBpro [3], [18], H2Pack [16], ASKIT [19], GOFMM [20], and GPU implementations like H2Opus [17] and hmglib [21]. They typically leverage adaptive cross approximation, proxy surface, or preselected skeletons to construct the H 2 matrix. ...

Algorithmic patterns for H\mathcal{H}-matrices on many-core processors
  • Citing Article
  • August 2017

Journal of Scientific Computing

... This work introduces algebraically constructed multilevel hierarchies [8,10,24] for the solution of elliptic problems on tensor product domains. While previous works [14,15] first constructed the multilevel hierarchy of meshes or triangulations and then discretized the problem by finite elements, the new approach first discretizes the problem on Ω on the finest (potentially unstructured) mesh T J and then constructs coarser versions of the linear system resulting from the fine discretization. ...

Subspace correction methods in algebraic multi-level frames
  • Citing Article
  • January 2016

Linear Algebra and its Applications