Ian Foster

Ian Foster
University of Chicago | UC · Department of Computer Science

PhD

About

1,002
Publications
157,908
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
97,008
Citations
Additional affiliations
January 2000 - present
University of Chicago
Position
  • Professor
January 1989 - present
Argonne National Laboratory
Position
  • Distinguished Fellow
January 1985 - December 1988
Imperial College London
Position
  • Research Associate

Publications

Publications (1,002)
Preprint
Beamlines at synchrotron light source facilities are powerful scientific instruments used to image samples and observe phenomena at high spatial and temporal resolutions. Typically, these facilities are equipped only with modest compute resources for the analysis of generated experimental datasets. However, high data rate experiments can easily gen...
Article
Full-text available
Potential climate-related impacts on future crop yield are a major societal concern. Previous projections of the Agricultural Model Intercomparison and Improvement Project’s Global Gridded Crop Model Intercomparison based on the Coupled Model Intercomparison Project Phase 5 identified substantial climate impacts on all major crops, but associated u...
Article
Full-text available
Advanced satellite-borne remote sensing instruments produce high-resolution multispectral data for much of the globe at a daily cadence. These datasets open up the possibility of improved understanding of cloud dynamics and feedback, which remain the biggest source of uncertainty in global climate model projections. As a step toward answering these...
Preprint
Scripting languages such as Python and R have been widely adopted as tools for the productive development of scientific software because of the power and expressiveness of the languages and available libraries. However, deploying scripted applications on large-scale parallel computer systems such as the IBM Blue Gene/Q or Cray XE6 is a challenge be...
Preprint
Full-text available
Advanced satellite-born remote sensing instruments produce high-resolution multi-spectral data for much of the globe at a daily cadence. These datasets open up the possibility of improved understanding of cloud dynamics and feedback, which remain the biggest source of uncertainty in global climate model projections. As a step towards answering thes...
Preprint
Full-text available
Potential climate-related impacts on future crop yield are a major societal concern first surveyed in a harmonized multi-model effort in 2014. We report here on new 21st-century projections using ensembles of latest-generation crop and climate models. Results suggest markedly more pessimistic yield responses for maize, soybean, and rice compared to...
Preprint
Full-text available
Synchrotron-based X-ray computed tomography is widely used for investigating inner structures of specimens at high spatial resolutions. However, potential beam damage to samples often limits the X-ray exposure during tomography experiments. Proposed strategies for eliminating beam damage also decrease reconstruction quality. Here we present a deep...
Preprint
X-ray computed tomography is a commonly used technique for noninvasive imaging at synchrotron facilities. Iterative tomographic reconstruction algorithms are often preferred for recovering high quality 3D volumetric images from 2D X-ray images, however, their use has been limited to small/medium datasets due to their computational requirements. In...
Article
Full-text available
Background The National Cancer Institute Informatics Technology for Cancer Research (ITCR) program provides a series of funding mechanisms to create an ecosystem of open-source software (OSS) that serves the needs of cancer research. As the ITCR ecosystem substantially grows, it faces the challenge of the long-term sustainability of the software be...
Article
Full-text available
A limited nuclear war between India and Pakistan could ignite fires large enough to emit more than 5 Tg of soot into the stratosphere. Climate model simulations have shown severe resulting climate perturbations with declines in global mean temperature by 1.8 °C and precipitation by 8%, for at least 5 y. Here we evaluate impacts for the global food...
Preprint
Full-text available
UNSTRUCTURED The Sustainability and Industry Partnership Work Group (SIP-WG) is a part of the National Cancer Institute (NCI) Informatics Technology for Cancer Research (ITCR) program. The charter of the SIP-WG is to investigate options of long-term sustainability of open source software (OSS) developed by the ITCR, in part by developing a collecti...
Article
Background: Sub-Saharan Africa (SSA) has a high proportion of premenopausal hormone receptor negative breast cancer. Previous studies reported a strikingly high prevalence of germline mutations in BRCA1 and BRCA2 among Nigerian breast cancer patients. It is unknown if this exists in other SSA countries. Methods: Breast cancer cases, unselected f...
Chapter
The continuous increase in the data produced by simulations, experiments and edge components in the last few years has forced a shift in the scientific research process, leading to the definition of a fourth paradigm in Science, concerning data-intensive computing. This data deluge, in fact, introduces various challenges related to big data volumes...
Poster
Full-text available
Data plane programmability has emerged as a response to the lack of flexibility in networking ASICs and the long product cycles that vendors take to introduce new protocols on their networking gear. Furthermore, it tries to bridge the gap between the SDN promise and OpenFlow implementations. Following the ASIC tradition, OpenFlow implementations ha...
Conference Paper
Full-text available
X-ray computed tomography (XCT)is used regularly at synchrotron light sources to study the internal morphology of materials at high resolution. However, experimental constraints, such as radiation sensitivity, can result in noisy or undersampled measurements. Further, depending on the resolution, sample size and data acquisition rates, the resultin...
Chapter
The Mediterranean area is subject to a range of destructive weather events, including middle-latitudes storms, Mediterranean sub-tropical hurricane-like storms (“medicanes”), and small-scale but violent local storms. Although predicting large-scale atmosphere disturbances is a common activity in numerical weather prediction, the tasks of recognizin...
Conference Paper
Full-text available
We present a framework for cloud characterization that leverages modern unsupervised deep learning technologies. While previous neural network-based cloud classification models have used supervised learning methods, unsupervised learning allows us to avoid restricting the model to artificial categories based on historical cloud classification schem...
Chapter
Scientific Named Entity Referent Extraction is often more complicated than traditional Named Entity Recognition (NER). For example, in polymer science, chemical structure may be encoded in a variety of nonstandard naming conventions, and authors may refer to polymers with conventional names, commonly used names, labels (in lieu of longer names), sy...
Conference Paper
Preprint
Full-text available
Synchrotron-based x-ray tomography is a noninvasive imaging technique that allows for reconstructing the internal structure of materials at high spatial resolutions. Here we present TomoGAN, a novel denoising technique based on generative adversarial networks, for improving the quality of reconstructed images for low-dose imaging conditions, as at...
Conference Paper
Rapidly growing data volumes at light sources demand increasingly automated data collection, distribution, and analysis processes, in order to enable new scientific discoveries while not overwhelming finite human capabilities. We present here the case for automating and outsourcing light source science using cloud-hosted data automation and enrichm...
Conference Paper
Distributed scientific and big-data computations are becoming increasingly dependent on access to remote files. Wide-area file transfers are supported by two basic schemes: (i) application-level tools, such as GridFTP, that provide transport services between file systems housed at geographically separated sites, and (ii) file systems mounted over w...
Chapter
To support increasingly distributed scientific and big-data applications, powerful data transfer infrastructures are being built with dedicated networks and software frameworks customized to distributed file systems and data transfer nodes. The data transfer performance of such infrastructures critically depends on the combined choices of file, dis...
Conference Paper
Full-text available
Cloud providers continue to expand and diversify their collection of leasable resources to meet the needs of an increasingly wide range of applications. While this flexibility is a key benefit of the cloud, it also creates a complex landscape in which users are faced with many resource choices for a given application. Suboptimal selections can both...
Preprint
Data transfer in wide-area networks has been long studied in different contexts, from data sharing among data centers to online access to scientific data. Many software tools and platforms have been developed to facilitate easy, reliable, fast, and secure data transfer over wide area networks, such as GridFTP, FDT, bbcp, mdtmFTP, and XDD. However,...
Poster
Full-text available
Poster presented at the 2018 AGU Fall Meeting (https://agu.confex.com/agu/fm18/meetingapp.cgi/Paper/439373), and awarded at the same conference (https://membership.agu.org/ospawinner/ricardo-barros-lourenco/).
Conference Paper
To support increasingly distributed scientific and big-data applications, powerful data transfer infrastructures are being built with dedicated networks and software frameworks customized to distributed file systems and data transfer nodes. The data transfer performance of such infrastructures critically depends on the combined choices of file, dis...
Conference Paper
Dedicated data transport infrastructures are increasingly being deployed to support distributed big-data and high-performance computing scenarios. These infrastructures employ data transfer nodes that use sophisticated software stacks to support network transport among sites, which often house distributed file and storage systems. Throughput measur...
Preprint
Full-text available
While the Machine Learning (ML) landscape is evolving rapidly, there has been a relative lag in the development of the "learning systems" needed to enable broad adoption. Furthermore, few such systems are designed to support the specialized requirements of scientific ML. Here we present the Data and Learning Hub for science (DLHub), a multi-tenant...
Article
Data from sensors incorporated into mobile devices, such as networked navigational sensors, can be used to capture detailed environmental information. We describe here a workflow and framework for using sensors on boats to construct unique new datasets of underwater topography (bathymetry). Starting with a large number of measurements of position,...
Conference Paper
The data crowdsourcing paradigm applied in coastal and marine monitoring and management has been developed only recently due to the challenges of the marine environment. The pervasive internet of things technology is contributing to increase the number of connected instrumented devices available for data crowd-sourcing. A main issue in the fog/edge...
Article
As materials data sets grow in size and scope, the role of data mining and statistical learning methods to analyze these materials data sets and build predictive models is becoming more important. This manuscript introduces matminer, an open-source, Python-based software platform to facilitate data-driven methods of analyzing and predicting materia...
Article
Ongoing, rapid innovations in fi elds ranging from microelectronics, aerospace, and automotive to defense, energy, and health demand new advanced materials at even greater rates and lower costs. Traditional materials R&D methods offer few paths to achieve both outcomes simultaneously. Materials informatics, while a nascent fi eld, offers such a pro...
Article
Full-text available
Advances in sequencing techniques have led to exponential growth in biological data, demanding the development of large-scale bioinformatics experiments. Because these experiments are computation-and data-intensive, they require high-performance computing techniques and can benefit from specialized technologies such as Scientific Workflow Managemen...
Article
Full-text available
Traditional machine learning (ML) metrics overestimate model performance for materials discovery. We introduce (1) leave-one-cluster-out cross-validation (LOCO CV) and (2) a simple nearest-neighbor benchmark to show that model performance in discovery applications strongly depends on the problem, data sampling, and extrapolation. Our results sugges...
Conference Paper
Full-text available
Data publication systems are typically tailored to the requirements and processes of a specific domain, collaboration, and/or use case. We propose here an alternative approach to engineering such systems, based on customizable compositions of simple, independent platform services, each of which provides a distinct function such as identification, m...
Conference Paper
Full-text available
Wide-area data transfer is central to distributed science. Network capacity, data movement infrastructure, and tools in science environments continuously evolve to meet the requirements of distributed-science applications. Research and education (R&E) networks such as the U.S. Department of Energy's Energy Sciences network and Internet2 provide mul...
Article
Full-text available
Scientific computing systems are becoming significantly more complex, with distributed teams and complex workflows spanning resources from telescopes and light sources to fast networks and Internet of Things sensor systems. In such settings, no single, centralized administrative team and software stack can coordinate and manage all resources used b...
Conference Paper
Full-text available
Scientific computing systems are becoming increasingly complex and indeed are close to reaching a critical limit in manageability when using current human-in-the-loop techniques. In order to address this problem, autonomic, goal-driven management actions based on machine learning must be applied end to end across the scientific computing landsca...
Conference Paper
Full-text available
Wide area data transfers play an important role in many science applications but rely on expensive infrastructure that often delivers disappointing performance in practice. In response, we present a systematic examination of a large set of data transfer log data to characterize transfer characteristics, including the nature of the datasets transfe...
Conference Paper
Full-text available
Exponential increases in data volumes and velocities are overwhelming finite human capabilities. Continued progress in science and engineering demands that we automate a broad spectrum of currently manual research data manipulation tasks, from data transfer and sharing to acquisition, publication, and analysis. These needs are particularly evident...
Conference Paper
Full-text available
Amazon spot instances provide preemptable computing capacity at a cost that is often significantly lower than comparable on-demand or reserved instances. Spot instances are charged at the current spot price: a fluctuating market price based on supply and demand for spot instance capacity. However, spot instances are inherently volatile, the spot pr...
Article
Full-text available
Extreme-scale simulations and experiments can generate large amounts of data, whose volume can exceed the compute and/or storage capacity at the simulation or experimental facility. With the emergence of ultra-high-speed networks, researchers are considering pipelined approaches in which data are passed to a remote facility for analysis. Here we...
Article
The Framework to Advance Climate, Economic, and Impact Investigations with Information Technology (FACE-IT) is a workflow engine and data science portal based on Galaxy and Globus technologies that enables computational scientists to integrate data, pre/post processing and simulation into a framework that supports offline environmental model coupli...
Conference Paper
As the volume and velocity of data generated by scientific experiments increase, the analysis of those data inevitably requires HPC resources. Successful research in a growing number of scientific fields depends on the ability to analyze data rapidly. In many situations, scientists and engineers want quasi-instant feedback, so that results from one...
Article
Full-text available
We describe best practices for providing convenient, high-speed, secure access to large data via research data portals. We capture these best practices in a new design pattern, the Modern Research Data Portal, that disaggregates the traditional monolithic web-based data portal to achieve orders-of-magnitude increases in data transfer performance, s...