Content uploaded by Logan Ward
Author content
All content in this area was uploaded by Logan Ward on Nov 25, 2019
Content may be subject to copyright.
MATERIALS DATA AND
AI EFFORTS IN THE US
erhtjhtyhy
LOGAN WARD
Assistant Computational Scientist
Data Science and Learning Division
lward@anl.gov
20 November 2019
RMIT, Melbourne, Australia
A LITTLE BIT ABOUT ME
Education:
–BS/MS in Materials Science from The Ohio State University
–PhD in Materials Science from Northwestern
–Post-doc in Computer Science at Uchicago
Research Background:
To let you know where my perspective comes from
2
ML for Materials Materials Modeling Crystal Structure Solution Software for Science
GOALS FOR TODAY
Two main questions:
1. What should I know about materials databases?
–What are the challenges?
–How are people trying to solve them?
–What is still broken?
2. What are the major “AI plus Materials” activities in the US?
–What are the main themes?
–What is new and exciting?
–What has staying power?
Also the outline for my talk
3
WHAT SHOULD I KNOW ABOUT
MATERIALS DATABASES?
4
MATERIALS DATABASES ARE NOT NEW
CRC Handbook: 1913
ASM Handbooks: ~1920s
JANAF Tables: 1964
(Hultgren) Select Values of the
Thermo. Properties of…: 1973
Pauling File: 1995
OQMD: 2013
… and many more
Why Important Now? Why not “solved”?
In fact, they are older than “materials science”
5
WHY THE SUDDEN PREVALENCE?
Computing (especially AI) makes data much more useful
6
Seko et al. PRB (2014), 054303
Chen et al. Chem. Mater. (2012) Ward et al. Acta Mat. (2018)
Zhang et al. Acta Mat. (2018)
Kusne et al. Sci Rep. (2014)
MATERIALS DATA PYRAMID
Borrowed with appreciation from NIST
7
Curation Effort Interaction Frequency
Different Types of Data == Different Database Requirements!
WORKING DATA: CLOSE TO THE SCIENTIST
[Data] scientists need…
1. Unrestricted access to data
2. Portability
3. Easy use from other tools
4. Ability to share with collaborators
Solving everyday data problems
8
Everyone has their own workflow
Creating usable data management systems is
a huge, and well-studied problem
LABORATORY INVENTORY MANAGEMENT (US)
9
NIST’s CDCS AFRL HyperThought
Materials Commons 4CeeD
“SHARABLE” DATA AND PUBLICATION
Need: “Publish and Forget”
Requirements:
1. Provenance Information
2. Archival Storage
3. Detailed Descriptions
4. Rewards for Data Publication
Data for when a projects is done
10
Common
Features of All
Services
WHAT DOES PUBLISHED DATA LOOK LIKE?
11
Basic Provenance Information
Links to Files
Data is Available (!), Only Usable by Humans
THERE ARE PLENTY OF PUBLICATION SERVICES
Each best for different types of data, different journals, etc.
12
Will these tools work for
next generation beamlines?
REFERENCE DATA: WHAT PEOPLE WANT
At least people who do “data science”
13
Smallest fraction of data. Typically…
•Extensively curated
•Composed of many experiments
•Specific goal of collection
•Consistent format (schema)
Accordingly, reference data are most widely used and usable
Handbooks Web Databases
a.get_in_chemsys(
[‘Ca’, ‘O’]
)
Web APIs
WHAT IS SPECIAL ABOUT “REFERENCE DATA”?
14
Large amount of data
Curated metadata
Link to original
source
Data in Tabular
Form
Such data is rare, yet a requirement for ML
REFERENCE DATA: A BRIGHT FUTURE
15
Commercial/Industrial Academic/National Laboratory
Reference Databases are Proliferating Rapidly!
WHAT ARE THE TRENDS?
1. Data is Getting Published!
2. Repositories are Digital
3. Efforts are Community Driven
16
WHAT ARE THE MAJOR TRENDS?
1. Data is Getting Published Deluge of Data
•Data Management Systems seldom used
•Publication repositories lack metadata
2. Repositories are Digital APIs are Uncommon
•Tools Do Not Work with Databases
3. Efforts are Community Driven Many Silos
•Finding Best Dataset Difficult
PSA: I work with the Materials Data Facility
17
Current State: Data and Tools are Available
What would make things better?
A SEAMLESS DATA INFRASTRUCTURE
18
Data
Resources
Software Computing
Easily Access
Data/Software/Comp
ute from Anywhere
You
Republish New
Data/Software Just
As Easily
THE MATERIALS DATA FACILITY (MDF)
•Connect: Extract domain-
relevant metadata /
transform the data
•Publish: Built to handle
big data (many TB,
millions of files), provides
persistent identifier for
data, distributed storage
enabled
•Discover: Programmatic
search index to aggregate
and retrieve data across
hundreds of indexed data
sources
https://www.materialsdatafacility.org
> 35 TB of data
> 320 published
authors
DLHub –A Data and Learning Hub for Science
Cherukara et al., 2018
Energy Storage Tomography
X-Ray Science
Input Output
•Predict molecular energies with G4MP2
accuracy at B3LYP cost
•Data available in MDF
•Enhance tomographic scans and remove
noise using generative adversarial model
•Example data available on Petrel
•Predict structure and
phase of a material
given coherent
diffraction intensity
•Data available from
Github
Exascale Cancer
Research
20
A DOSE OF PRACTICAL ADVICE
21
HiTp Experimentation Atomistic Simulation
Electrolyte Design Deep Learning and
Formation Enthalpy
22
Multi-Modal Imaging
Advanced State-of-Health
for Li-ion Batteries
WHAT CAN DATA MANAGEMENT LOOK LIKE?
I’ve done a variety of “AI+Materials” work
DATA MANAGEMENT NEED NOT BE FANCY
Your research processes and tool might not change
23
Knowing “what you have” and “where you got it”
should not rely on memory
Write code to format data
(Don’t edit by hand!)
Organize your hard drive
Hold on to metadata
COMMUNICATION:
GITHUB/GITLAB
24
I use a few to track and share my own data
AUTOMATION:
QCARCHIVE
BUT THERE ARE TOOLS AVAILABLE
SHARING:
MDF/CITRINATION
Put simply:
1. Use and share(!) automation code with others
2. Write code just like manuscripts
3. Consider publication at the outset
CONCLUSION / FAQ PAGE
Overall: There is not (and will never be) a single data solution
Sharing your data
▪My data isn’t really usable… consider a LIMS system
▪I have data, but not sure if it will be useful…publish it anyway (it might be!)
▪I have data, and I really want people to use it… Find a good “reference database” or
make your own
Finding new data
▪There are many, probably old, reference books. Dig through them!
▪There are newer, digital resources. For guides, check out:
–Materials Research Bulletin, Hill MRS Bulletin (2019), Technical Societies (go ECS!)
A short guide to common data problems
25
WHAT IS GOING ON IN MATERIALS AND ML?
[LOOKING MOSTLY AT US-RELATED EFFORTS]
26
“INFORMATICS” IS ON THE UPTICK
Goal: High-level overview of what
I think is happening next
What science is already good at:
- Building ML models for materials
- Surrogates for expensive calculations
Where we are going:
Pervasive AI in science?
Or at least papers that use that term are prevalent
27
ML enters the
zeitgeist?
AREA 1: PROLIFIC AND ACCESSIBLE DATA
Solid foundation for data-driven science
28 Ward et al. MRS Bulletin (2018)
30
Efficient dye-sensitized solar cells (Argonne/Cambridge)
Combined data mining with high-throughput computation
Cole Computing in Sci and Eng (2018). doi: 10.1109/MCSE.2018.011111129
MACHINE-ASSISTED CURATION
Several recent success stories in NLP for materials
Guidance for materials synthesis (MIT/LBNL)
Parsed 640k materials articles, resulting data is open!
Kim et al. Chem Mater (2017). doi: 10.1021/acs.chemmater.7b03500
Identifying functional materials from text mining (LBNL)
Learned associations between materials from abstracts
Tshitoyan et al. Nature (2019). doi: 10.1038/s41586-019-1335-8
AUTOMATED LABORATORIES
Taking humans “out of the loop”
31
Humans do not
report data well…
AUTOMATED LABORATORIES
Taking humans “out of the loop”
32
(This team is doing stellar work)
… and their biases
are problematic
A potential solution: Try dumber things faster with computers!
AUTOMATED LABORATORIES
Report data and tirelessly perform research
33
“ESCALATE”
Automation expands a scientist’s ability to be creative
“ADA” “ARES”
Fordham University/Haverford Collage UBC AFRL
AREA 2: AI AS A EVERYDAY RESEARCH TOOL
What is not likely: Replacing scientists with machines
More feasible and useful: Prioritizing scientist effort
“AI Assistants” for scientists
34
Images from Wikimedia
AI-ENHANCED CHARACTERIZATION
Computer vision algorithms are quite mature
35
Less Tedium in Microscopy
Li, Fields, Morgan. npj Comp Mat. (2018)
Faster Acquisition for CT
Liu et al. ArXiv: 1902.07582
Automated Library Analysis
Suram et al. ACS Combi. (2017)
Flagging Images with Errors
Wang et al. IEEE WACV. (2017)
36
Assisting scientific creativity
Models can [often] be improved by just
adding new data
Best case: No human effort
Implication: Updating model
predictions faster than making new
measurements
AI-ASSISTED MATERIALS DESIGN
Figure: Balachandran et al. Sci. Rep. (2016), 19660. doi: 10.1038/srep19660
USING AI TO STEER DESIGN EXPERIMENTS
Leads to better materials with less experimentation
37
Ref: Xue et al. Nat Comm. (2016), 11241. doi: 10.1038/ncomms11241
Of [the 36 compositions we tested], 14 had
smaller Δ𝑇 than any of the 22 in the
original data set.
Best alloy was 42% lower!
BIG GOAL: “CLOSING THE LOOP”
AI Steering Automated Experimentation
38
[Many North
American efforts
under way]
“Scientific
Question”/
“Hypothesis”/
“Design Goals”
Useful Data/
Anomalies
Scientific
Knowledge
Source: Curtis Berlinguette (UBC)
Great example: Ada!
CONCLUSIONS
Two main questions:
1. What should I know about materials databases?
–What are the challenges? Depends on the type of data
–How are people trying to solve them? Community of database software
–What is still broken? Linking databases together and to compute
2. What are the major “AI plus Materials” activities in the US?
–What are the main themes? AI becoming pervasive
–What is new and exciting? NLP and laboratory automation
–What has staying power? Tools to aid (not replace) humans
Also the outline for my talk
39
Thanks to our sponsors!
U.S. DEPARTMENT OF
ENERGY
Globus IMaD
DLHub Argonne
LDRD
40