About
1,243
Publications
229,771
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
107,603
Citations
Introduction
Additional affiliations
Position
- Research Associate
January 1989 - present
January 2000 - present
Publications
Publications (1,243)
Contents
I. Predictive Modeling 3
A. Leveraging LLMs for Accurate Molecular Energy Predictions 3
B. From Text to Cement: Developing Sustainable Concretes Using In-Context Learning 6
C. Molecule Discovery by Context 8
D. Text template paraphrasing with LLMs 10
1. Problem 10 2. Solution 10 3. Impact 12 4. Lessons learned 12
E. GA without genes 13
II....
We illustrate how to construct high-performance workflows across multiple computing resources with minimal networking configuration.
Recording: https://youtu.be/KO7anZs4G48
Chemistry and materials science are complex. Recently, there have been great successes in addressing this complexity using data-driven or computational techniques. Yet, the necessity of input structured in very specific forms and the fact that there is an ever-growing number of tools creates usability and accessibility challenges. Coupled with the...
Pooling and sharing data increases and distributes its value. But since data cannot be revoked once shared, scenarios that require controlled release of data for regulatory, privacy, and legal reasons default to not sharing. Because selectively controlling what data to release is difficult, the few data-sharing consortia that exist are often built...
Federated learning has shown enormous promise as a way of training ML models in distributed environments while reducing communication costs and protecting data privacy. However, the rise of complex cyber-physical systems, such as the Internet-of-Things, presents new challenges that are not met with traditional FL methods. Hierarchical Federated Lea...
Machine learning interatomic potentials have emerged as a powerful tool for bypassing the spatio-temporal limitations of ab initio simulations, but major challenges remain in their efficient parameterization. We present AL4GAP, an active learning software workflow for generating multi-composition Gaussian approximation potentials (GAP) for arbitrar...
Applications that fuse machine learning and simulation can benefit from the use of multiple computing resources, with, for example, simulation codes running on highly parallel supercomputers and AI training and inference tasks on specialized accelerators. Here, we present our experiences deploying two AI-guided simulation workflows across such hete...
In many experiment-driven scientific domains, such as high-energy physics, material science, and cosmology, high data rate experiments impose hard constraints on data acquisition systems: collected data must either be indiscriminately stored for post-processing and analysis, thereby necessitating large storage capacity, or accurately filtered in re...
Dendritic microstructures are ubiquitous in nature and are the primary solidification morphologies in metallic materials. Techniques such as X-ray computed tomography (XCT) have provided new insights into dendritic phase transformation phenomena. However, manual identification of dendritic morphologies in microscopy data can be both labor intensive...
Scaling deep neural network training to more processors and larger batch sizes is key to reducing end-to-end training time; yet, maintaining comparable convergence and hardware utilization at larger scales is challenging. Increases in training scales have enabled natural gradient optimization methods as a reasonable alternative to stochastic gradie...
Vast volumes of data are produced by today’s scientific simulations and advanced instruments. These data cannot be stored and transferred efficiently because of limited I/O bandwidth, network speed, and storage capacity. Error-bounded lossy compression can be an effective method for addressing these issues: not only can it significantly reduce data...
ƒ
unc
X is a distributed function as a service (FaaS) platform that enables flexible, scalable, and high performance remote function execution. Unlike centralized FaaS systems, ƒ
unc
X decouples the cloud-hosted management functionality from the edge-hosted execution functionality. ƒ
unc
X's endpoint software can be deployed, by users or adminis...
The Common Fund Data Ecosystem (CFDE) has created a flexible system of data federation that enables researchers to discover datasets from across the US National Institutes of Health Common Fund without requiring that data owners move, reformat, or rehost those data. This system is centered on a catalog that integrates detailed descriptions of biome...
A concise and measurable set of FAIR (Findable, Accessible, Interoperable and Reusable) principles for scientific data is transforming the state-of-practice for data management and stewardship, supporting and enabling discovery and innovation. Learning from this initiative, and acknowledging the impact of artificial intelligence (AI) in the practic...
Clouds play a critical role in the Earth's energy budget and their potential changes are one of the largest uncertainties in future climate projections. However, the use of satellite observations to understand cloud feedbacks in a warming climate has been hampered by the simplicity of existing cloud classification schemes, which are based on single...
Our work seeks to transform how new and emergent variants of pandemic causing viruses, specially SARS-CoV-2, are identified and classified. By adapting large language models (LLMs) for genomic data, we build genome-scale language models (GenSLMs) which can learn the evolutionary landscape of SARS-CoV-2 genomes. By pretraining on over 110 million pr...
Batteries are central to modern society. They are no longer just a convenience but a critical enabler of the transition to a resilient, low-carbon economy. Battery development capabilities are provided by communities spanning materials discovery, battery chemistry and electrochemistry, cell and pack design, scale-up, manufacturing, and deployments....
Powerful detectors at modern experimental facilities routinely collect data at multiple GB/s. Online analysis methods are needed to enable the collection of only interesting subsets of such massive data streams, such as by explicitly discarding some data elements or by directing instruments to relevant areas of experimental space. Thus, methods are...
A foundational set of findable, accessible, interoperable, and reusable (FAIR) principles were proposed in 2016 as prerequisites for proper data management and stewardship, with the goal of enabling the reusability of scholarly data. The principles were also meant to apply to other digital assets, at a high level, and over time, the FAIR guiding pr...
Clouds play an important role in the Earth's energy budget and their behavior is one of the largest uncertainties in future climate projections. Satellite observations should help in understanding cloud responses, but decades and petabytes of multispectral cloud imagery have to date received only limited use. This study reduces the dimensionality o...
funcX is a distributed function as a service (FaaS) platform that enables flexible, scalable, and high performance remote function execution. Unlike centralized FaaS systems, funcX decouples the cloud-hosted management functionality from the edge-hosted execution functionality. funcX's endpoint software can be deployed, by users or administrators,...
Coherent microscopy techniques provide an unparalleled multi-scale view of materials across scientific and technological fields, from structural materials to quantum devices, from integrated circuits to biological cells. Driven by the construction of brighter sources and high-rate detectors, coherent X-ray microscopy methods like ptychography are p...
Computed Tomography (CT) is an imaging technique where information about an object are collected at different angles (called projections or scans). Then the cross-sectional image showing the internal structure of the slice is produced by solving an inverse problem. Limited by certain factors such as radiation dosage, projection angles, the produced...
Research process automation--the reliable, efficient, and reproducible execution of linked sets of actions on scientific instruments, computers, data stores, and other resources--has emerged as an essential element of modern science. We report here on new services within the Globus research data management platform that enable the specification of...
Serial synchrotron crystallography enables the study of protein structures under physiological temperature and reduced radiation damage by collection of data from thousands of crystals. The Structural Biology Center at Sector 19 of the Advanced Photon Source has implemented a fixed-target approach with a new 3D-printed mesh-holder optimized for sam...
The severe acute respiratory syndrome coronavirus-2 (SARS-CoV-2) replication transcription complex (RTC) is a multi-domain protein responsible for replicating and transcribing the viral mRNA inside a human cell. Attacking RTC function with pharmaceutical compounds is a pathway to treating COVID-19. Conventional tools, e.g. cryo-electron microscopy...
Despite much creative work on methods and tools, reproducibility-the ability to repeat the computational steps used to obtain a research result-remains elusive. One reason for these difficulties is that extant tools for capturing research processes, while powerful, often fail to capture vital connections as research projects grow in extent and comp...
p>Applications of X-ray computed tomography (CT) for porosity characterization of engineering materials often involve an extended data analysis workflow that includes CT reconstruction of raw projection data, binarization, labeling and mesh extraction. It is often desirable to map the porosity in larger samples but the computational challenge of re...
p>Applications of X-ray computed tomography (CT) for porosity characterization of engineering materials often involve an extended data analysis workflow that includes CT reconstruction of raw projection data, binarization, labeling and mesh extraction. It is often desirable to map the porosity in larger samples but the computational challenge of re...
The broad sharing of research data is widely viewed as critical for the speed, quality, accessibility, and integrity of science. Despite increasing efforts to encourage data sharing, both the quality of shared data and the frequency of data reuse remain stubbornly low. We argue here that a significant reason for this unfortunate state of affairs is...
Unraveling the liquid structure of multicomponent molten salts is challenging due to the difficulty in conducting and interpreting high-temperature diffraction experiments. Motivated by this challenge, we developed composition-transferable Gaussian approximation potential (GAP) for molten LiCl-KCl. A DFT-SCAN accurate GAP is active-learned from onl...
As efforts advance around the globe, the US falls behind
On August 2, 2021 a group of concerned scientists and US funding agency and federal government officials met for an informal discussion to explore the value and need for a well-coordinated US Open Research Commons (ORC); an interoperable collection of data and compute resources within both the public and private sectors which are easy to use and ac...
G4MP2 theory has proven to be a reliable and accurate quantum chemical composite method for the calculation of molecular energies using an approximation based on second-order perturbation theory to lower computational costs compared to G4 theory. However, it has been found to have significantly increased errors when applied to larger organic molecu...
Extreme times require extreme measures. In this column, we discuss how high-performance computing embraces artificial intelligence and data analytics to address global challenges.
A concise and measurable set of FAIR (Findable, Accessible, Interoperable and Reusable) principles for scientific data are transforming the state-of-practice for data management and stewardship, supporting and enabling discovery and innovation. Learning from this initiative, and acknowledging the impact of artificial intelligence (AI) in the practi...
Pooling and sharing data increases and distributes its value. But since data cannot be revoked once shared, scenarios that require controlled release of data for regulatory, privacy, and legal reasons default to not sharing. Because selectively controlling what data to release is difficult, the few data-sharing consortia that exist are often built...
Background: Personalized breast cancer (BC) screening adjusts the imaging modality and frequency of exams according to a woman's risk of developing BC. This can lower cost and false positives by reducing unnecessary exams and has the potential to find more cancers at a curable stage. Deep learning (DL) is a class of artificial intelligence algorith...
Transformer-based masked language models trained on general corpora, such as BERT and RoBERTa, have shown impressive performance on various downstream tasks. Increasingly, researchers are "finetuning" these models to improve performance on domain-specific tasks. Here, we report a broad study in which we applied 14 transformer-based models to 11 sci...
In order to take full advantage of the U.S. Department of Energy's billion-dollar investments into the next-generation research infrastructure (e.g., exascale, light sources, colliders), advances are required not only in detector technology but also in computing and specifically AI. Let us consider an example from X-ray science. Nanoscale X-ray ima...
Parsl is a parallel programming library for Python that aims to make it easy to specify parallelism in programs and to realize that parallelism on arbitrary parallel and distributed computing systems. Parsl relies on developers annotating Python functions-wrapping either Python or external applications-to indicate that these functions may be execut...
Advancements in scientific instrument sensors and connected devices provide unprecedented insight into ongoing experiments and present new opportunities for control, optimization, and steering. However, the diversity of sensors and heterogeneity of their data result in make it challenging to fully realize these new opportunities. Organizing and syn...
Data assimilation (DA) in geophysical sciences remains the cornerstone of robust forecasts from numerical models. Indeed, DA plays a crucial role in the quality of numerical weather prediction and is a crucial building block that has allowed dramatic improvements in weather forecasting over the past few decades. DA is commonly framed in a variation...
Dendritic microstructures are ubiquitous in nature and are the primary solidification morphologies in metallic materials. Techniques such as x-ray computed tomography (XCT) have provided new insights into dendritic phase transformation phenomena. However, manual identification of dendritic morphologies in microscopy data can be both labor intensive...
Extracting actionable information from data sources such as the Linac Coherent Light Source (LCLS-II) and Advanced Photon Source Upgrade (APS-U) is becoming more challenging due to the fast-growing data generation rate. The rapid analysis possible with ML methods can enable fast feedback loops that can be used to adjust experimental setups in real-...
The increasing volume and variety of science data has led to the creation of metadata extraction systems that automatically derive and synthesize relevant information from files. A critical component of metadata extraction systems is a mechanism for mapping extractors-lightweight tools to mine information from a particular file types-to each file i...
Powerful detectors at modern experimental facilities routinely collect data at multiple GB/s. Online analysis methods are needed to enable the collection of only interesting subsets of such massive data streams, such as by explicitly discarding some data elements or by directing instruments to relevant areas of experimental space. Such online analy...
With the widespread availability of high-speed networks, it becomes feasible to outsource computing to remote providers and to federate resources from many locations. Such observations motivated the development, from the mid-1990s onwards, of a range of innovative Grid technologies, applications, and infrastructures. We review the history, current...
Data - arguably the most important product of worldwide materials research investment - are rarely shared. The small and biased proportion of results published are buried in plots and text licensed by journals. This situation wastes resources, hinders innovation, and, in the current era of data-driven discovery, is no longer tenable. In this commen...
Serial synchrotron crystallography enables studies of protein structures under physiological temperature and reduced radiation damage by collection of data from thousands of crystals. The Structural Biology Center at Sector 19 of the Advanced Photon Source has implemented a fixed-target approach with a new 3D printed mesh-holder optimized for sampl...
The applications being developed within the U.S. Exascale Computing Project (ECP) to run on imminent Exascale computers will generate scientific results with unprecedented fidelity and record turn-around time. Many of these codes are based on particle-mesh methods and use advanced algorithms, especially dynamic load-balancing and mesh-refinement, t...
Beamlines at synchrotron light source facilities are powerful scientific instruments used to image samples and observe phenomena at high spatial and temporal resolutions. Typically, these facilities are equipped only with modest compute resources for the analysis of generated experimental datasets. However, high data rate experiments can easily gen...
Next-generation scientific instruments will collect data at unprecedented rates: multiple GB/s and exceeding TB/day. Such runs will benefit from automation and steering via machine learning methods, but these methods require new data management and policy techniques. We present here the Braid Provenance Engine (Braid-DB), a system that embraces AI-...
Unraveling the liquid structure of multi-component molten salts is challenging due to the difficulty in conducting and interpreting high temperature diffraction experiments. Motivated by this challenge, we developed composition-transferable Gaussian Approximation Potentials (GAP) for molten LiCl-KCl. A DFT-SCAN accurate GAP is active learned from o...
Despite much creative work on methods and tools, reproducibility -- the ability to repeat the computational steps used to obtain a research result -- remains elusive. One reason for these difficulties is that extant tools for capturing research processes do not align well with the rich working practices of scientists. We advocate here for simple me...
The broad sharing of research data is widely viewed as of critical importance for the speed, quality, accessibility, and integrity of science. Despite increasing efforts to encourage data sharing, both the quality of shared data, and the frequency of data reuse, remain stubbornly low. We argue here that a major reason for this unfortunate state of...
X-ray diffraction based microscopy techniques such as high-energy diffraction microscopy (HEDM) rely on knowledge of the position of diffraction peaks with high precision. These positions are typically computed by fitting the observed intensities in detector data to a theoretical peak shape such as pseudo-Voigt. As experiments become more complex a...
Dedicated network connections are being increasingly deployed in cloud, centralized and edge computing and data infrastructures, whose throughput profiles are critical indicators of the underlying data transfer performance. Due to the cost and disruptions to physical infrastructures, network emulators, such as Mininet, are often used to generate me...
The increasing volume and variety of science data has led to the creation of metadata extraction systems that automatically derive and synthesize relevant information from files. A critical component of metadata extraction systems is a mechanism for mapping extractors—lightweight tools to mine information from a particular file types—to each file i...
Technological advancements in modern scientific instruments, such as scanning electron microscopes (SEMs), have significantly increased data acquisition rates and image resolutions enabling new questions to be explored; however, the resulting data volumes and velocities, combined with automated experiments, are quickly overwhelming scientists as th...
Machine learning (ML) has emerged as a promising technology to accelerate materials discovery. While systematic screening of vast chemical spaces is computationally expensive, ML algorithms offer a directed approach to identifying and testing promising molecular candidates for specific applications. Two significant hurdles towards development of ro...