DataPDF Available

eScience at the Royal Society of Chemistry: Current Initiatives

Authors:

Abstract

Access to scientific information has changed in a manner that was likely never even imagined by the early pioneers of the internet. The quantities of data, the array of tools available to search and analyze, the devices and the shift in community participation continues to expand while the pace of change does not appear to be slowing. ChemSpider is one of the chemistry community’s primary online public compound databases. Containing tens of millions of chemical compounds and its associated data ChemSpider serves data tens of thousands of chemists every day and it serves as the foundation for many important international projects to integrate chemistry and biology data, facilitate drug discovery efforts and help to identify new chemicals from under the ocean. This presentation will provide an overview of the expanding reach of this eScience cheminformatics platform and the nature of the solutions that it helps to enable including structure validation, text mining and semantic markup, the National Chemical Database Service for the United Kingdom and the development of a chemistry data repository. We will also discuss the possibilities it offers in the domain of crowdsourcing and open data sharing. The future of scientific information and communication will be underpinned by these efforts, influenced by increasing participation from the scientific community and facilitated collaboration and ultimately accelerate scientific progress.
eScience at the Royal Society of
Chemistry: Current Initiatives
Antony Williams
Cornell University, May 14th 2013
We Have …Too Much Data!!!
The World of Online Chemistry
Property databases
Compound aggregators
Screening assay results
Scientific publications
Encyclopedic articles (Wikipedia)
Metabolic pathway databases
ADME/Tox data eTOX for example
Blogs/Wikis and Open Notebook Science
e-Science and Primary Data
How much data generated in a lab, that COULD go public, is
lost forever?
e-Science and Primary Data
How much data generated in a lab, that COULD go public, is
lost forever?
Public Domain reference databases of value?
Syntheses
Properties
Spectra
CIFs
Images
e-Science and Primary Data
How much data generated in a lab, that COULD go public, is
lost forever?
Public Domain reference databases of value?
Syntheses
Properties
Spectra
CIFs
Images
Much of chemistry is chemical structure-based where and
how could we host these data?
RSCs ChemSpider
ChemSpider
>28.5 million unique chemicals from >400
data sources
Focus on improving data quality, enhancing
functionality, integrating and enabling
Crowdsourced “Annotations”
Users can add
Descriptions/Syntheses/Commentaries
Links to PubMed articles
Links to articles via DOIs
Add spectral data
Add Crystallographic Information Files
Add photos
Add MP3 files
Add Videos
Spectra
Chemistry Data online are messy
We have inherited errors
All public compound databases have errors
“Incorrect” structures – assertions, timelines etc
“Incorrect” names associated with structures
Properties
Links
Publications
ENORMOUS CHALLENGE
Crowdsourced Curation
Crowd-sourced curation: identify/tag errors,
edit names, synonyms, identify records to
deprecate
Search “Vitamin H”
“Curate” Identifiers
“Curate” Identifiers
“Curate” Identifiers
Validated Name-Structure Dictionaries
Chemical name dictionaries are used for:
Text-mining (publications, patents)
Used to index PubMed and link to Google Patents
Linking to other databases think Biology!
When structures are not available drug names link
Searching the web
Names link to structures link to InChIs
I want to know about “Vincristine”
Vincristine: Identifiers and Properties
Vincristine: Vendors and Sources
Linked by Structure
Vincristine: Patents
Linked by Name
Vincristine: Articles
Linked by Name
Semantic Mark-up of Articles
Linking Names to Structures
The InChI Identifier
InChIStrings Hash to InChIKeys
Vancomycin Search the Internet
Vancomycin
Search Molecular
SKELETON
Search Full Molecule
Full Skeleton Search: 104 Hits
Full Molecule Search: 4 Hits
ChemSpider Resources for Chemistry
Some usage statistics
ca. 200 visitors at any one time, ~30,000 visits per day
Mar 4-Apr 3, 2013
Visits = 731,656
Unique Visitors = 527,008
Independent servers to support other projects
Access ChemSpider
APIs
Programmatic access used by Mobile Apps, Funded
Consortia projects, many Academic groups
Widgets
UI components for embedding in other websites
Data
Data access, downloads, reuse, licensing
Flexible ChemSpider API
http://www.chemspider.com/google/
Flexible ChemSpider API
Publications - a summary of work
Scientific publications are a summary of work
Is all work reported?
How much science is lost to pruning?
What of value sits in notebooks and is lost?
How much data is lost?
How many compounds never reported?
How many syntheses fail or succeed?
How many characterization measurements?
Micropublishing Syntheses
ChemSpider SyntheticPages
Olympicene
So you Want a Profile???
Interactive Data
Integrate to instruments and software
Integration to analytical instrumentation vendors
already in place
Agilent, Bruker, Thermo, Waters
Also, Cheminformatics vendors link to ChemSpider
Accelrys, ACD/Labs, ChemAxon, iChemLabs, and…
PharmaSea
Dereplication via ChemSpider
Segregation of natural products datasets
Analytical data algorithms & integration
Mass spec searching predicted fragmentation
NMR feature searching NMR prediction
Computer-assisted structure elucidation
It is so difficult to navigate…
What’s the
structure?
Are they in
our file?
What’s
similar?
What’s the
target?
Pharmacology
data?
Known
Pathways?
Working On
Now?
Connections to
disease?
Expressed in
right cell type?
Competitors?
IP?
3-year Innovative Medicines Initiative project
Integrating chemistry and biology data using
semantic web technologies
Open source code, open data and open standards
Academics, Pharma companies, Publishers….
ChemSpider Contributions
The host of the chemistry services
Supplier of “standardized” chemical data files
Chemistry searching (structure, substructure etc)
Curator and data quality checking
Now building the Open PHACTS chemical
registration system
Natural Products Updates
Names hard, Structures
“Obvious”
New content based on
monthly updates of the
database
Click through to the Natural
Products Updates entry
National Chemical Database Service
Chemical Database
Service
National Chemical Database
Service for UK Academics
Integrating Commercial
Databases and Services
Chemicals, analytical data,
prediction algorithms
Development of data repository
Community Repository for Data
Funding agencies encourage sharing of data
Increasing availability of “Open Data”
Institutional repositories no specific domain
support
Develop a community repository for chemistry
data private, public, embargoed
Provides data to develop models/algorithms
Community Repository for Data
Automated depositions of data
DOI’ed data objects for citation purposes
A database of reference data, but validated by
the community
National services feeding the repository
crystallography, mass spectrometry
Integrate to blogging tools for chemistry
Integrate to Electronic Lab Notebooks as feeds
Model Building with Community Data
Community data as a basis of model building
Consume data from available databases, community
data, new publications and build predictive
algorithms for the community
How many algorithms are reported and lost? How
much repeat work is done in the domain of
algorithmic development?
Support for Chemical Reactions
Integrating mined reaction data from patents
Will also incorporate and integrate RSC
Databases: Methods of Organic Synthesis,
Catalysts and Catalyzed Reactions and…
Inside our Publication Archive
How much data is in the archive, in the
publications and in the supplementary info?
How many compounds for ChemSpider?
How many syntheses for ChemSpider reactions?
How many characterization measurements?
Property Data
Spectral Data
Graphs and charts to be used for modeling?
What if we could capture it all?
Digitally Enhancing the RSC Archive
Start with data in publications
Data Validation and Curation Required
Encouraging Participation with
Rewards and RECOGNITION
Manual Curation
Integrated commenting, curating and validation
platform across ALL eScience and publishing
platforms
All integrated to a central RSC profile and
feeding the AltMetrics tools
Structure Review
Future Recognition in AltMetrics?
ChemSpider
Internet Data
The Future
Commercial Software
Pre-competitive Data
Open Science
Open Data
Publishers
Educators
Open Databases
Chemical Vendors
Small organic molecules
Undefined materials
Organometallics
Nanomaterials
Polymers
Minerals
Particle bound
Links to Biologicals
The Future of Chemistry on the Web?
Public compound databases federate & build a
linked environment of validated data!
Data validation needs are not ignored
Publishers layer on information to make
publications discoverable
Open Data proliferate
The “Semantic Web” will continue to develop…
Thank you
Email: williamsa@rsc.org
Twitter: @ChemConnector
Personal Blog: www.chemconnector.com
SLIDES: www.slideshare.net/AntonyWilliams
ResearchGate has not been able to resolve any citations for this publication.
ResearchGate has not been able to resolve any references for this publication.