DataPDF Available

eScience at the Royal Society of Chemistry: Current Initiatives

May 2013

May 2013

DOI:10.6084/m9.figshare.703652

Authors:

Antony John Williams

United States Environmental Protection Agency

Access to scientific information has changed in a manner that was likely never even imagined by the early pioneers of the internet. The quantities of data, the array of tools available to search and analyze, the devices and the shift in community participation continues to expand while the pace of change does not appear to be slowing. ChemSpider is one of the chemistry community’s primary online public compound databases. Containing tens of millions of chemical compounds and its associated data ChemSpider serves data tens of thousands of chemists every day and it serves as the foundation for many important international projects to integrate chemistry and biology data, facilitate drug discovery efforts and help to identify new chemicals from under the ocean. This presentation will provide an overview of the expanding reach of this eScience cheminformatics platform and the nature of the solutions that it helps to enable including structure validation, text mining and semantic markup, the National Chemical Database Service for the United Kingdom and the development of a chemistry data repository. We will also discuss the possibilities it offers in the domain of crowdsourcing and open data sharing. The future of scientific information and communication will be underpinned by these efforts, influenced by increasing participation from the scientific community and facilitated collaboration and ultimately accelerate scientific progress.

Content uploaded by Antony John Williams

Content may be subject to copyright.

eScience at the Royal Society of

Chemistry: Current Initiatives

Antony Williams

Cornell University, May 14th 2013

We Have …Too Much Data!!!

The World of Online Chemistry

•Property databases

•Compound aggregators

•Screening assay results

•Scientific publications

•Encyclopedic articles (Wikipedia)

•Metabolic pathway databases

•ADME/Tox data –eTOX for example

•Blogs/Wikis and Open Notebook Science

e-Science and Primary Data

•How much data generated in a lab, that COULD go public, is

lost forever?

e-Science and Primary Data

•How much data generated in a lab, that COULD go public, is

lost forever?

•Public Domain reference databases of value?

–Syntheses

–Properties

–Spectra

–CIFs

–Images

e-Science and Primary Data

•How much data generated in a lab, that COULD go public, is

lost forever?

•Public Domain reference databases of value?

–Syntheses

–Properties

–Spectra

–CIFs

–Images

•Much of chemistry is chemical structure-based –where and

how could we host these data?

RSC’s ChemSpider

ChemSpider

•>28.5 million unique chemicals from >400

data sources

•Focus on improving data quality, enhancing

functionality, integrating and enabling

Crowdsourced “Annotations”

•Users can add

–Descriptions/Syntheses/Commentaries

–Links to PubMed articles

–Links to articles via DOIs

–Add spectral data

–Add Crystallographic Information Files

–Add photos

–Add MP3 files

–Add Videos

Spectra

Chemistry Data online are messy

•We have inherited errors

•All public compound databases have errors

•“Incorrect” structures – assertions, timelines etc

•“Incorrect” names associated with structures

•Properties

•Links

•Publications

•ENORMOUS CHALLENGE

Crowdsourced Curation

•Crowd-sourced curation: identify/tag errors,

edit names, synonyms, identify records to

deprecate

Search “Vitamin H”

“Curate” Identifiers

Validated Name-Structure Dictionaries

•Chemical name dictionaries are used for:

•Text-mining (publications, patents)

–Used to index PubMed and link to Google Patents

•Linking to other databases –think Biology!

–When structures are not available drug names link

•Searching the web

–Names link to structures link to InChIs

I want to know about “Vincristine”

Vincristine: Identifiers and Properties

Vincristine: Vendors and Sources

Linked by Structure

Vincristine: Patents

Linked by Name

Vincristine: Articles

Linked by Name

Semantic Mark-up of Articles

Linking Names to Structures

The InChI Identifier

InChIStrings Hash to InChIKeys

Vancomycin –Search the Internet

Vancomycin

Search Molecular

SKELETON

Search Full Molecule

Full Skeleton Search: 104 Hits

Full Molecule Search: 4 Hits

ChemSpider Resources for Chemistry

Some usage statistics

•ca. 200 visitors at any one time, ~30,000 visits per day

•Mar 4-Apr 3, 2013

–Visits = 731,656

–Unique Visitors = 527,008

•Independent servers to support other projects

Access ChemSpider

•APIs

–Programmatic access used by Mobile Apps, Funded

Consortia projects, many Academic groups

•Widgets

–UI components for embedding in other websites

•Data

–Data access, downloads, reuse, licensing

Flexible ChemSpider API

http://www.chemspider.com/google/

Flexible ChemSpider API

Publications - a summary of work

•Scientific publications are a summary of work

–Is all work reported?

–How much science is lost to pruning?

–What of value sits in notebooks and is lost?

•How much data is lost?

–How many compounds never reported?

–How many syntheses fail or succeed?

–How many characterization measurements?

Micropublishing Syntheses

ChemSpider SyntheticPages

Olympicene

So you Want a Profile???

Interactive Data

Integrate to instruments and software

•Integration to analytical instrumentation vendors

already in place

–Agilent, Bruker, Thermo, Waters

•Also, Cheminformatics vendors link to ChemSpider

–Accelrys, ACD/Labs, ChemAxon, iChemLabs, and…

PharmaSea

•Dereplication via ChemSpider

•Segregation of natural products datasets

•Analytical data algorithms & integration

–Mass spec searching –predicted fragmentation

–NMR feature searching –NMR prediction

–Computer-assisted structure elucidation

It is so difficult to navigate…

What’s the

structure?

Are they in

our file?

What’s

similar?

What’s the

target?

Pharmacology

data?

Known

Pathways?

Working On

Now?

Connections to

disease?

Expressed in

right cell type?

Competitors?

IP?

•3-year Innovative Medicines Initiative project

•Integrating chemistry and biology data using

semantic web technologies

•Open source code, open data and open standards

•Academics, Pharma companies, Publishers….

ChemSpider Contributions

•The host of the chemistry services

–Supplier of “standardized” chemical data files

–Chemistry searching (structure, substructure etc)

–Curator and data quality checking

•Now building the Open PHACTS chemical

registration system

Natural Products Updates

•Names hard, Structures

“Obvious”

•New content based on

monthly updates of the

database

•Click through to the Natural

Products Updates entry

National Chemical Database Service

Chemical Database

Service

•National Chemical Database

Service for UK Academics

•Integrating Commercial

Databases and Services

•Chemicals, analytical data,

prediction algorithms

•Development of data repository

Community Repository for Data

•Funding agencies encourage sharing of data

•Increasing availability of “Open Data”

•Institutional repositories no specific domain

support

•Develop a community repository for chemistry

data –private, public, embargoed

•Provides data to develop models/algorithms

Community Repository for Data

•Automated depositions of data

•DOI’ed data objects for citation purposes

•A database of reference data, but validated by

the community

•National services feeding the repository –

crystallography, mass spectrometry

•Integrate to blogging tools for chemistry

•Integrate to Electronic Lab Notebooks as feeds

Model Building with Community Data

•Community data as a basis of model building

–Consume data from available databases, community

data, new publications and build predictive

algorithms for the community

–How many algorithms are reported and lost? How

much repeat work is done in the domain of

algorithmic development?

Support for Chemical Reactions

•Integrating mined reaction data from patents

•Will also incorporate and integrate RSC

Databases: Methods of Organic Synthesis,

Catalysts and Catalyzed Reactions and…

Inside our Publication Archive

•How much data is in the archive, in the

publications and in the supplementary info?

–How many compounds for ChemSpider?

–How many syntheses for ChemSpider reactions?

–How many characterization measurements?

•Property Data

•Spectral Data

•Graphs and charts to be used for modeling?

What if we could capture it all?

Digitally Enhancing the RSC Archive

Start with data in publications

Data Validation and Curation Required

Encouraging Participation with

Rewards and RECOGNITION

Manual Curation

•Integrated commenting, curating and validation

platform across ALL eScience and publishing

platforms

•All integrated to a central RSC profile and

feeding the AltMetrics tools

Structure Review

Future Recognition in AltMetrics?

ChemSpider

Internet Data

The Future

Commercial Software

Pre-competitive Data

Open Science

Open Data

Publishers

Educators

Open Databases

Chemical Vendors

Small organic molecules

Undefined materials

Organometallics

Nanomaterials

Polymers

Minerals

Particle bound

Links to Biologicals

The Future of Chemistry on the Web?

•Public compound databases federate & build a

linked environment of validated data!

•Data validation needs are not ignored

•Publishers layer on information to make

publications discoverable

•Open Data proliferate

•The “Semantic Web” will continue to develop…

Thank you

Email: williamsa@rsc.org

Twitter: @ChemConnector

Personal Blog: www.chemconnector.com

SLIDES: www.slideshare.net/AntonyWilliams

ResearchGate has not been able to resolve any citations for this publication.

ResearchGate has not been able to resolve any references for this publication.

eScience at the Royal Society of Chemistry: Current Initiatives

Abstract

Recommended publications

Experiences in Hosting Big Chemistry Data Collections for the Community

Current Initiatives in Developing Research Data Repositories at the Royal Society of Chemistry

The Expansive Reach of ChemSpider as a Resource for the Chemistry Community

Sharing chemical structures with peer-reviewed publications. Are we there yet?