Ian Foster

Ian Foster
University of Chicago | UC · Department of Computer Science

PhD

About

1,243
Publications
229,996
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
107,765
Citations
Additional affiliations
Position
  • Research Associate
January 1989 - present
Argonne National Laboratory
Position
  • Distinguished Fellow
January 2000 - present
University of Chicago
Position
  • Professor

Publications

Publications (1,243)
Data
Contents I. Predictive Modeling 3 A. Leveraging LLMs for Accurate Molecular Energy Predictions 3 B. From Text to Cement: Developing Sustainable Concretes Using In-Context Learning 6 C. Molecule Discovery by Context 8 D. Text template paraphrasing with LLMs 10 1. Problem 10 2. Solution 10 3. Impact 12 4. Lessons learned 12 E. GA without genes 13 II....
Presentation
Full-text available
We illustrate how to construct high-performance workflows across multiple computing resources with minimal networking configuration. Recording: https://youtu.be/KO7anZs4G48
Preprint
Full-text available
Chemistry and materials science are complex. Recently, there have been great successes in addressing this complexity using data-driven or computational techniques. Yet, the necessity of input structured in very specific forms and the fact that there is an ever-growing number of tools creates usability and accessibility challenges. Coupled with the...
Preprint
Full-text available
Pooling and sharing data increases and distributes its value. But since data cannot be revoked once shared, scenarios that require controlled release of data for regulatory, privacy, and legal reasons default to not sharing. Because selectively controlling what data to release is difficult, the few data-sharing consortia that exist are often built...
Preprint
Full-text available
Federated learning has shown enormous promise as a way of training ML models in distributed environments while reducing communication costs and protecting data privacy. However, the rise of complex cyber-physical systems, such as the Internet-of-Things, presents new challenges that are not met with traditional FL methods. Hierarchical Federated Lea...
Preprint
Full-text available
Machine learning interatomic potentials have emerged as a powerful tool for bypassing the spatio-temporal limitations of ab initio simulations, but major challenges remain in their efficient parameterization. We present AL4GAP, an active learning software workflow for generating multi-composition Gaussian approximation potentials (GAP) for arbitrar...
Preprint
Applications that fuse machine learning and simulation can benefit from the use of multiple computing resources, with, for example, simulation codes running on highly parallel supercomputers and AI training and inference tasks on specialized accelerators. Here, we present our experiences deploying two AI-guided simulation workflows across such hete...
Preprint
Full-text available
In many experiment-driven scientific domains, such as high-energy physics, material science, and cosmology, high data rate experiments impose hard constraints on data acquisition systems: collected data must either be indiscriminately stored for post-processing and analysis, thereby necessitating large storage capacity, or accurately filtered in re...
Article
Dendritic microstructures are ubiquitous in nature and are the primary solidification morphologies in metallic materials. Techniques such as X-ray computed tomography (XCT) have provided new insights into dendritic phase transformation phenomena. However, manual identification of dendritic morphologies in microscopy data can be both labor intensive...
Article
Scaling deep neural network training to more processors and larger batch sizes is key to reducing end-to-end training time; yet, maintaining comparable convergence and hardware utilization at larger scales is challenging. Increases in training scales have enabled natural gradient optimization methods as a reasonable alternative to stochastic gradie...
Article
Full-text available
Vast volumes of data are produced by today’s scientific simulations and advanced instruments. These data cannot be stored and transferred efficiently because of limited I/O bandwidth, network speed, and storage capacity. Error-bounded lossy compression can be an effective method for addressing these issues: not only can it significantly reduce data...
Article
ƒ unc X is a distributed function as a service (FaaS) platform that enables flexible, scalable, and high performance remote function execution. Unlike centralized FaaS systems, ƒ unc X decouples the cloud-hosted management functionality from the edge-hosted execution functionality. ƒ unc X's endpoint software can be deployed, by users or adminis...
Article
Full-text available
The Common Fund Data Ecosystem (CFDE) has created a flexible system of data federation that enables researchers to discover datasets from across the US National Institutes of Health Common Fund without requiring that data owners move, reformat, or rehost those data. This system is centered on a catalog that integrates detailed descriptions of biome...
Article
Full-text available
A concise and measurable set of FAIR (Findable, Accessible, Interoperable and Reusable) principles for scientific data is transforming the state-of-practice for data management and stewardship, supporting and enabling discovery and innovation. Learning from this initiative, and acknowledging the impact of artificial intelligence (AI) in the practic...
Preprint
Full-text available
Clouds play a critical role in the Earth's energy budget and their potential changes are one of the largest uncertainties in future climate projections. However, the use of satellite observations to understand cloud feedbacks in a warming climate has been hampered by the simplicity of existing cloud classification schemes, which are based on single...
Preprint
Full-text available
Our work seeks to transform how new and emergent variants of pandemic causing viruses, specially SARS-CoV-2, are identified and classified. By adapting large language models (LLMs) for genomic data, we build genome-scale language models (GenSLMs) which can learn the evolutionary landscape of SARS-CoV-2 genomes. By pretraining on over 110 million pr...
Article
Batteries are central to modern society. They are no longer just a convenience but a critical enabler of the transition to a resilient, low-carbon economy. Battery development capabilities are provided by communities spanning materials discovery, battery chemistry and electrochemistry, cell and pack design, scale-up, manufacturing, and deployments....
Article
Full-text available
Powerful detectors at modern experimental facilities routinely collect data at multiple GB/s. Online analysis methods are needed to enable the collection of only interesting subsets of such massive data streams, such as by explicitly discarding some data elements or by directing instruments to relevant areas of experimental space. Thus, methods are...
Preprint
Full-text available
A foundational set of findable, accessible, interoperable, and reusable (FAIR) principles were proposed in 2016 as prerequisites for proper data management and stewardship, with the goal of enabling the reusability of scholarly data. The principles were also meant to apply to other digital assets, at a high level, and over time, the FAIR guiding pr...
Preprint
Full-text available
Clouds play an important role in the Earth's energy budget and their behavior is one of the largest uncertainties in future climate projections. Satellite observations should help in understanding cloud responses, but decades and petabytes of multispectral cloud imagery have to date received only limited use. This study reduces the dimensionality o...
Preprint
funcX is a distributed function as a service (FaaS) platform that enables flexible, scalable, and high performance remote function execution. Unlike centralized FaaS systems, funcX decouples the cloud-hosted management functionality from the edge-hosted execution functionality. funcX's endpoint software can be deployed, by users or administrators,...
Preprint
Full-text available
Coherent microscopy techniques provide an unparalleled multi-scale view of materials across scientific and technological fields, from structural materials to quantum devices, from integrated circuits to biological cells. Driven by the construction of brighter sources and high-rate detectors, coherent X-ray microscopy methods like ptychography are p...
Preprint
Computed Tomography (CT) is an imaging technique where information about an object are collected at different angles (called projections or scans). Then the cross-sectional image showing the internal structure of the slice is produced by solving an inverse problem. Limited by certain factors such as radiation dosage, projection angles, the produced...
Preprint
Full-text available
Research process automation--the reliable, efficient, and reproducible execution of linked sets of actions on scientific instruments, computers, data stores, and other resources--has emerged as an essential element of modern science. We report here on new services within the Globus research data management platform that enable the specification of...
Article
Full-text available
Serial synchrotron crystallography enables the study of protein structures under physiological temperature and reduced radiation damage by collection of data from thousands of crystals. The Structural Biology Center at Sector 19 of the Advanced Photon Source has implemented a fixed-target approach with a new 3D-printed mesh-holder optimized for sam...
Article
The severe acute respiratory syndrome coronavirus-2 (SARS-CoV-2) replication transcription complex (RTC) is a multi-domain protein responsible for replicating and transcribing the viral mRNA inside a human cell. Attacking RTC function with pharmaceutical compounds is a pathway to treating COVID-19. Conventional tools, e.g. cryo-electron microscopy...
Article
Despite much creative work on methods and tools, reproducibility-the ability to repeat the computational steps used to obtain a research result-remains elusive. One reason for these difficulties is that extant tools for capturing research processes, while powerful, often fail to capture vital connections as research projects grow in extent and comp...
Preprint
Full-text available
p>Applications of X-ray computed tomography (CT) for porosity characterization of engineering materials often involve an extended data analysis workflow that includes CT reconstruction of raw projection data, binarization, labeling and mesh extraction. It is often desirable to map the porosity in larger samples but the computational challenge of re...
Preprint
Full-text available
p>Applications of X-ray computed tomography (CT) for porosity characterization of engineering materials often involve an extended data analysis workflow that includes CT reconstruction of raw projection data, binarization, labeling and mesh extraction. It is often desirable to map the porosity in larger samples but the computational challenge of re...
Article
Full-text available
The broad sharing of research data is widely viewed as critical for the speed, quality, accessibility, and integrity of science. Despite increasing efforts to encourage data sharing, both the quality of shared data and the frequency of data reuse remain stubbornly low. We argue here that a significant reason for this unfortunate state of affairs is...
Article
Unraveling the liquid structure of multicomponent molten salts is challenging due to the difficulty in conducting and interpreting high-temperature diffraction experiments. Motivated by this challenge, we developed composition-transferable Gaussian approximation potential (GAP) for molten LiCl-KCl. A DFT-SCAN accurate GAP is active-learned from onl...
Article
As efforts advance around the globe, the US falls behind
Preprint
Full-text available
On August 2, 2021 a group of concerned scientists and US funding agency and federal government officials met for an informal discussion to explore the value and need for a well-coordinated US Open Research Commons (ORC); an interoperable collection of data and compute resources within both the public and private sectors which are easy to use and ac...
Article
G4MP2 theory has proven to be a reliable and accurate quantum chemical composite method for the calculation of molecular energies using an approximation based on second-order perturbation theory to lower computational costs compared to G4 theory. However, it has been found to have significantly increased errors when applied to larger organic molecu...
Article
Extreme times require extreme measures. In this column, we discuss how high-performance computing embraces artificial intelligence and data analytics to address global challenges.
Preprint
A concise and measurable set of FAIR (Findable, Accessible, Interoperable and Reusable) principles for scientific data are transforming the state-of-practice for data management and stewardship, supporting and enabling discovery and innovation. Learning from this initiative, and acknowledging the impact of artificial intelligence (AI) in the practi...
Article
Pooling and sharing data increases and distributes its value. But since data cannot be revoked once shared, scenarios that require controlled release of data for regulatory, privacy, and legal reasons default to not sharing. Because selectively controlling what data to release is difficult, the few data-sharing consortia that exist are often built...
Article
Background: Personalized breast cancer (BC) screening adjusts the imaging modality and frequency of exams according to a woman's risk of developing BC. This can lower cost and false positives by reducing unnecessary exams and has the potential to find more cancers at a curable stage. Deep learning (DL) is a class of artificial intelligence algorith...
Preprint
Full-text available
Transformer-based masked language models trained on general corpora, such as BERT and RoBERTa, have shown impressive performance on various downstream tasks. Increasingly, researchers are "finetuning" these models to improve performance on domain-specific tasks. Here, we report a broad study in which we applied 14 transformer-based models to 11 sci...
Preprint
In order to take full advantage of the U.S. Department of Energy's billion-dollar investments into the next-generation research infrastructure (e.g., exascale, light sources, colliders), advances are required not only in detector technology but also in computing and specifically AI. Let us consider an example from X-ray science. Nanoscale X-ray ima...
Preprint
Full-text available
Parsl is a parallel programming library for Python that aims to make it easy to specify parallelism in programs and to realize that parallelism on arbitrary parallel and distributed computing systems. Parsl relies on developers annotating Python functions-wrapping either Python or external applications-to indicate that these functions may be execut...
Preprint
Advancements in scientific instrument sensors and connected devices provide unprecedented insight into ongoing experiments and present new opportunities for control, optimization, and steering. However, the diversity of sensors and heterogeneity of their data result in make it challenging to fully realize these new opportunities. Organizing and syn...
Article
Full-text available
Data assimilation (DA) in geophysical sciences remains the cornerstone of robust forecasts from numerical models. Indeed, DA plays a crucial role in the quality of numerical weather prediction and is a crucial building block that has allowed dramatic improvements in weather forecasting over the past few decades. DA is commonly framed in a variation...
Preprint
Full-text available
Dendritic microstructures are ubiquitous in nature and are the primary solidification morphologies in metallic materials. Techniques such as x-ray computed tomography (XCT) have provided new insights into dendritic phase transformation phenomena. However, manual identification of dendritic morphologies in microscopy data can be both labor intensive...
Preprint
Extracting actionable information from data sources such as the Linac Coherent Light Source (LCLS-II) and Advanced Photon Source Upgrade (APS-U) is becoming more challenging due to the fast-growing data generation rate. The rapid analysis possible with ML methods can enable fast feedback loops that can be used to adjust experimental setups in real-...
Conference Paper
Full-text available
The increasing volume and variety of science data has led to the creation of metadata extraction systems that automatically derive and synthesize relevant information from files. A critical component of metadata extraction systems is a mechanism for mapping extractors-lightweight tools to mine information from a particular file types-to each file i...
Preprint
Full-text available
Powerful detectors at modern experimental facilities routinely collect data at multiple GB/s. Online analysis methods are needed to enable the collection of only interesting subsets of such massive data streams, such as by explicitly discarding some data elements or by directing instruments to relevant areas of experimental space. Such online analy...
Preprint
With the widespread availability of high-speed networks, it becomes feasible to outsource computing to remote providers and to federate resources from many locations. Such observations motivated the development, from the mid-1990s onwards, of a range of innovative Grid technologies, applications, and infrastructures. We review the history, current...
Preprint
Full-text available
Data - arguably the most important product of worldwide materials research investment - are rarely shared. The small and biased proportion of results published are buried in plots and text licensed by journals. This situation wastes resources, hinders innovation, and, in the current era of data-driven discovery, is no longer tenable. In this commen...
Preprint
Full-text available
Serial synchrotron crystallography enables studies of protein structures under physiological temperature and reduced radiation damage by collection of data from thousands of crystals. The Structural Biology Center at Sector 19 of the Advanced Photon Source has implemented a fixed-target approach with a new 3D printed mesh-holder optimized for sampl...
Article
The applications being developed within the U.S. Exascale Computing Project (ECP) to run on imminent Exascale computers will generate scientific results with unprecedented fidelity and record turn-around time. Many of these codes are based on particle-mesh methods and use advanced algorithms, especially dynamic load-balancing and mesh-refinement, t...
Chapter
Full-text available
Beamlines at synchrotron light source facilities are powerful scientific instruments used to image samples and observe phenomena at high spatial and temporal resolutions. Typically, these facilities are equipped only with modest compute resources for the analysis of generated experimental datasets. However, high data rate experiments can easily gen...
Chapter
Next-generation scientific instruments will collect data at unprecedented rates: multiple GB/s and exceeding TB/day. Such runs will benefit from automation and steering via machine learning methods, but these methods require new data management and policy techniques. We present here the Braid Provenance Engine (Braid-DB), a system that embraces AI-...
Preprint
Full-text available
Unraveling the liquid structure of multi-component molten salts is challenging due to the difficulty in conducting and interpreting high temperature diffraction experiments. Motivated by this challenge, we developed composition-transferable Gaussian Approximation Potentials (GAP) for molten LiCl-KCl. A DFT-SCAN accurate GAP is active learned from o...
Preprint
Full-text available
Despite much creative work on methods and tools, reproducibility -- the ability to repeat the computational steps used to obtain a research result -- remains elusive. One reason for these difficulties is that extant tools for capturing research processes do not align well with the rich working practices of scientists. We advocate here for simple me...
Preprint
Full-text available
The broad sharing of research data is widely viewed as of critical importance for the speed, quality, accessibility, and integrity of science. Despite increasing efforts to encourage data sharing, both the quality of shared data, and the frequency of data reuse, remain stubbornly low. We argue here that a major reason for this unfortunate state of...
Article
Full-text available
X-ray diffraction based microscopy techniques such as high-energy diffraction microscopy (HEDM) rely on knowledge of the position of diffraction peaks with high precision. These positions are typically computed by fitting the observed intensities in detector data to a theoretical peak shape such as pseudo-Voigt. As experiments become more complex a...
Chapter
Dedicated network connections are being increasingly deployed in cloud, centralized and edge computing and data infrastructures, whose throughput profiles are critical indicators of the underlying data transfer performance. Due to the cost and disruptions to physical infrastructures, network emulators, such as Mininet, are often used to generate me...
Chapter
The increasing volume and variety of science data has led to the creation of metadata extraction systems that automatically derive and synthesize relevant information from files. A critical component of metadata extraction systems is a mechanism for mapping extractors—lightweight tools to mine information from a particular file types—to each file i...
Chapter
Technological advancements in modern scientific instruments, such as scanning electron microscopes (SEMs), have significantly increased data acquisition rates and image resolutions enabling new questions to be explored; however, the resulting data volumes and velocities, combined with automated experiments, are quickly overwhelming scientists as th...
Article
Machine learning (ML) has emerged as a promising technology to accelerate materials discovery. While systematic screening of vast chemical spaces is computationally expensive, ML algorithms offer a directed approach to identifying and testing promising molecular candidates for specific applications. Two significant hurdles towards development of ro...