PreprintPDF Available

Multimodal AI Engine for Clinical Trials Outcome Prediction: Prospective Case Study Summer 2020

Authors:
Preprints and early-stage research may not have been peer reviewed yet.

Abstract and Figures

The high failure rate of drugs in clinical trials is the main reason for the rapidly increasing costs of drug development. Accurate prediction of clinical trials for programs in early discovery stages may help save billions of dollars and prioritize the programs that are more likely to benefit patients. The pharmaceutical companies usually have substantially more information about the drugs in their clinical pipelines than is available to external parties. However, pharmaceutical R&D is a lengthy and fragmented process with asymmetry of information and reliance on the clinical trial design, human experience and competitive landscape. Since 2014 our team is developing clinical trial outcomes prediction engines utilizing artificial intelligence focusing primarily on early-stage preclinical data, performing retrospective and prospective validation internally and with the pharmaceutical companies and financial institutions. While no validation methodology can guarantee consistent performance in the future, systems validated using retrospective data must be tested using prospective validation where predictions are made before the readouts are known. We have built and tested a novel multimodal AI engine for the prediction of clinical trial success on multiple types of features, including small molecule descriptors, transcriptomic data, and text-mined target, and indication representations, and clinical trial protocols. The predictor achieved 0.88 ROC AUC on predicting phase II to phase III transition in a quasi-prospective validation setting. In this paper, we used our model to predict transitions for all Roche phase II trials with expected readouts in summer 2020. We hope that such validation will prove our model accurate on real prospective setups. This paper is not peer-reviewed and is not intended for making clinical trial adjustments. It will be deposited on a preprint server so that the predictions could be compared with the actual clinical trial readouts in the future.
Content may be subject to copyright.
Multimodal AI Engine for Clinical Trials Outcome
Prediction: Prospective Case Study Summer 2020
Alex Zhavoronkov1*, Roman Kudrin1, Elena Tutubalina1, Anna Kuzmina1,
Daniil Korbut1, Artur Kadurin1, Daniil Polykovskiy1, and Alexander Aliper1
1Insilico Medicine Hong Kong Ltd, Pak Shek Kok, New Territories, Hong Kong.
*Correspondence: alex@insilico.com
Abstract
The high failure rate of drugs in clinical trials is the main reason for the rapidly
increasing costs of drug development. Accurate prediction of clinical trials for programs
in early discovery stages may help save billions of dollars and prioritize the programs
that are more likely to benefit patients. The pharmaceutical companies usually have
substantially more information about the drugs in their clinical pipelines than is available
to external parties. However, pharmaceutical R&D is a lengthy and fragmented process
with asymmetry of information and reliance on the clinical trial design, human
experience and competitive landscape. Since 2014 our team is developing clinical trials
outcomes prediction engines utilizing artificial intelligence focusing primarily on early
stage preclinical data, performing retrospective and prospective validation internally and
with the pharmaceutical companies and financial institutions. While no validation
methodology can guarantee consistent performance in the future, systems validated
using retrospective data must be tested using prospective validation where predictions
are made before the readouts are known.
We have built and tested a novel multimodal AI engine for the prediction of clinical trial
success on multiple types of features, including small molecule descriptors,
transcriptomic data, and text-mined target, and indication representations, and clinical
trial protocols. The predictor achieved 0.88 ROC AUC on predicting phase II to phase III
transition in a quasi-prospective validation setting. In this paper, we used our model to
predict transitions for all Roche phase II trials with expected readouts in summer 2020.
We hope that such validation will prove our model accurate on real prospective setups.
This paper is not peer-reviewed and is not intended for making clinical trial adjustments.
It will be deposited on a preprint server so that the predictions could be compared with
the actual clinical trial readouts in the future.
1
Introduction
It is widely accepted that the clinical development of the drug is a costly and
time-consuming process. As early as in 2012 the plummeting trend of drug development
productivity was recognized in a seminal paper by Scannel et al. [1], the phenomenon
was termed “Eroom’s Law” which is a reference to Moore’s Law that shows the
exponential growth of computational power; in the case of clinical development,
unfortunately, the expenses per drug approval grow exponentially.
The exponential cost growth leads to the fact that a satisfactory return on investment in
the pharmaceutical industry is quite difficult to uphold. Accurate prediction of clinical trial
outcomes may help optimize the pipelines of pharmaceutical companies as well as
guide the decisions of hedge funds’ and investment banks’ representatives considering
the management of investment portfolios. Since deep learning systems started
outperforming humans in multiple tasks including image recognition in 2014 deep
learning techniques are rapidly propagating into biomedicine [2]. One of the common
applications of deep learning is drug purposing and repurposing [3, 4].
There are several existing tools and techniques for clinical trial scoring, we will briefly
describe the most notable ones. PrOCTOR [5] considered a small dataset consisting of
successfully launched drugs and drugs failed due to toxic side-effects. PrOCTOR used
a machine learning scoring ensemble based on several simple drug descriptors and
drug target features. Lo et al. [6] analyzed a large dataset of clinical trials and have built
a machine learning model predicting drug development program phase transition on the
basis of features mostly reflecting clinical trial design such as the number of endpoints,
masking (open-label, double-blind, etc.), and usage of biomarker analysis. Recently,
Feijoo et al. [7] extracted structured data from eligibility criteria using text mining
techniques to predict drug development programs’ phase transitions; this model was
also based on clinical trial design characteristics. The study by Qi et al. [8] focused on
the translation of phase II trial results to phase III trial results using a recurrent neural
network model that took phase II trial results as an input. Artemov et al. [9] developed a
pipeline with deep learning techniques and biology analytical tools to predict the
outcomes of phase I/II clinical trials. This pipeline predicts the side effects of a drug and
estimates drug-induced pathway activation. It then uses the predicted side effect
probabilities and pathway activation scores as an input to train a classifier that predicts
clinical trial outcomes.
2
That said, [5] and [9] used structural and target-based information about drugs, while
[6–8] relied mostly on trial design protocols or readouts from the previous phases of
clinical trials. We sought to include the approaches from these models: an extensive
dataset, multimodal data sources, biological background.
Besides building a predictive model, we analyzed its predictions using Shapley Additive
Explanations (SHAP) [10, 11] to discover clinical trials’ weakest points. Such
explanations are most important when the model predicts that a clinical trial will fail.
From our experience, in clinical trials, retrospective quality does not receive substantial
credibility from experts. Our previous attempts at prospective validation using the very
early legacy predictors, that demonstrated unprecedented performance in
cross-validation [9], did not meet the internal expectations and resulted in a complete
redesign of the pipeline and multiple subsequent experiments and new clinical trials
prediction philosophy.
We designed our engine based on feedback received from collaborations with several
large banks, hedge funds, and big pharma representatives on clinical trial scoring. Our
clinical trials’ outcome prediction pipeline shown in Figure 1 is a part of Insilico
Medicine’s comprehensive drug discovery engine. In this work, we describe our
approach and make prospective predictions for Roche-sponsored clinical trials with
expected readouts in summer 2020.
3
Figure 1. Insilico Medicine drug discovery pipeline. The clinical trials engine was used
for this study. The modules from the target discovery platform were used in this study
for target scoring the virtual screening modules and medicinal chemistry filters from the
chemistry platform were used for predicting the properties of small molecules.
Clinical Trials Scoring Approach
Dataset and Model
Our work builds upon a machine learning model with multiple groups of features:
molecular descriptors; ADMETox features; drug, target and indication representations
derived from biomedical documents; statistical features based on large textual datasets
(biomedical literature, patents, and government grants); various protocol features
extracted from clinical trials; drug and target omics features (Figure 2). The dataset for
training a machine learning model consists of 3,802 unique small molecules, 1,350
unique indications, and 10,922 unique drug-indication pairs in total. Using features
specified below, we built a model for predicting the drug-indication pair phase transition.
Each molecule is associated with SMILES and a textual representation containing a
compound name. In this work, we focused on predicting whether a given drug
development program will advance from phase I to phase II and from phase II to phase
III.
Figure 2. Data sources upon which the predictor of clinical trials’ outcome was built.
4
We summed up time-dependent features over the seven years preceding the year when
the previous clinical trial phase started (e.g., phase II start year for phase II to phase III
transition). Then, we concatenated these features with time-independent features and
the final feature matrix was fed to the LightGBM [12, 13] model.
Chemoinformatics and ADMETox features
To extract features from molecules, we preprocessed molecules by neutralizing them
according to internal ad-hoc rules. Then we calculated molecular descriptors using
publicly available tools (Mordred [14], RDKit [15], MCE-18 [16]) and added ADMETox
properties predicted by proprietary tools.
Temporal drug, target and indication representations
We trained a word2vec model [17] using the Gensim library [18] on PubMed abstracts
(4.5B of words). Unsupervised word embeddings capture latent chemical and
pharmaceutical knowledge from large corpora of texts [19, 20]. We set the embedding
size to 250 and other parameters as default. First, the dataset was divided into yearly
time slices from 2012 to 2018 and trained the word2vec model on each time slice. We
then applied a linear transformation to map word2vec models onto a common space
[21].
Biomedical document statistics
We utilized statistics from various biomedical textual databases using Pharmacognitive .
1
This system provides access to databases of grants, publications, patents, and clinical
trials. We focus on features on three large domains:
(i) scientific literature from PubMed,
(ii) USPTO patents,
(iii) projects from the grant-funding Agencies of the USA, Canada, the EU, and
Australia.
The Pharmacognitive
system allows for the retrieval of statistics such as the number of
documents or overall funding per year matching a query. As queries, we used Drugs
and Indication. We extended all queries with synonyms provided by Pharmacognitive
and computed the following values for each query:
the number of publications/patents/projects published in the particular year
the number of publications/patents/projects published before the particular year
the average and sum of grants’ funding published in the particular year
the average and sum of grants’ funding published before the particular year
1 https://pharmacognitive.com/
5
Clinical trial protocols
The set included enrollment, patient age, ethnicity, number of endpoints, masking
(open-label, double-blind, etc.) and contract research organization locations.
Target omics
We derived the target network scores based on interactomic and transcriptomic data
and pathway activation/inhibition scores from Insilico’s Pandomics platform . For
2
dimensionality reduction and for pathway perturbation analysis we used the In silico
Pathway Activation Network Decomposition Analysis (iPANDA) algorithm [22] which is
commonly used for target identification and evaluating the synergetic effect of
dual-target inhibition [23] and combinations.
Quasi-prospective validation
We utilized time-stamped data to ensure that each trial representation included only the
information available before the clinical trial started to prevent information leak which
could lead to the overestimation of the model performance.
Using our pipeline we performed quasi-prospective validation by splitting all drug
development programs into a training set consisting of projects ended by the specified
year and a validation set consisting of those projects that ended after the specified year.
We report the performance for the experiments assuming the splitting year to be 2015.
Each record in the training and testing sets are associated with the earliest year for
each phase. For the predictor of phase II to phase III transition, we achieved ROC AUC
of 0.88; given the prediction threshold of 0.5 the accuracy of the prediction is 0.81, the
F1-scores are 0.85 and 0.75 for the negative and positive classes respectively on the
test set described above. We also provide ROC AUC for our predictor on specific
therapeutic areas (Figure 3).
2 https://pandomics.com/
6
Figure 3. Performance of the predictor in terms of ROC AUC for phase II to phase III
transition by therapeutic category, vertical axis depicts the number of projects for each
category.
Prospective Validation
In this paper, we make prospective predictions for the upcoming clinical trials for one
pharmaceutical company using the described multimodal clinical trials prediction model.
The pharmaceutical company was selected using the following principles:
1. The company must have many phase I and phase II clinical trials reading out in
summer 2020 for targeted small-molecule drugs
2. Only single-drug trials were counted. Combination therapies were excluded
7
3. Only small molecules with a clear target were included
4. There should be no current engagement between the company and Insilico
Medicine
Using these criteria we selected Roche, which has 2 clinical trials meeting the specified
criteria. We show our model’s predictions in Table 1. We predict success of both trials.
Table 1. Predictions of phase transitions for the analyzed Roche clinical trials (asterisk
indicates the probability of success for the current trial).
NCT ID
Conditions
Intervention
Phases
Primary
Completion
Date
Completion
Date
Probability
of reaching
Phase II
Probability
of reaching
Phase III
NCT03669640
Schizophrenia,
Schizoaffective
Disorder
RO6889450
Phase II
July 2, 2020
July 2, 2020
0.99
0.97*
NCT03385668
Pulmonary
Fibrosis
Pirfenidone
Phase II
September
2020
February
2021
0.98
0.96*
8
Conclusion
Using our proprietary engine we computed the probabilities of success for 2 current
clinical trials by Roche expected to read out in summer 2020. With the threshold of 0.5,
we expect 2 of them to meet their primary endpoints. We recognize the possibility that
COVID19 pandemic may delay the readouts but still hope that the predictions will
provide insight into our engine’s real-world performance and prove its applicability.
Conflict of Interest
All of the authors of this study are affiliated with Insilico Medicine (www.insilico.com), a for-profit
company developing critical intellectual property in the field of artificial intelligence for generative
biology, chemistry, and prediction of clinical trials outcomes with the aim of complete integration
of multiple previously disconnected areas of pharmaceutical drug discovery and development.
The company is engaged in aging research to acquire a deep understanding of the longitudinal
changes in human biology and the relationship between multiple data types.
Disclaimer
This paper is not peer-reviewed. This document is to be posted on a preprint server for the
purposes of date stamping and prospective validation of the clinical trial prediction engine. The
choice of a pharmaceutical company for the case study was based on the number of clinical
studies expected to read out in summer 2020 that the AI engine may be able to predict in a
fully-automated manner. These predictions are highly speculative and should not be used to
derive conclusions about the clinical trials. No insider information from the pharmaceutical
company was used in this study. While the system was used to predict clinical trials for several
large pharmaceutical companies, Roche was selected as there is no active collaboration
between the companies and there is absolutely no chance that any internal data was used for
making these predictions. These predictions were not provided to the hedge funds and
investment banks piloting the applications of Insilico’s AI engine in advance of the publication. In
the future we are planning to date stamp several other predictions of clinical trials outcomes.
9
References
[1] Scannell JW, Blanckley A, Boldon H, et al. Diagnosing the decline in pharmaceutical R&D
efficiency. Nat Rev Drug Discov
2012; 11: 191–200.
[2] Mamoshina P, Vieira A, Putin E, et al. Applications of Deep Learning in Biomedicine. Mol
Pharm
2016; 13: 1445–1454.
[3] Vanhaelen Q, Mamoshina P, Aliper AM, et al. Design of efficient computational workflows
for in silico drug repurposing. Drug Discov Today
2017; 22: 210–222.
[4] Aliper A, Plis S, Artemov A, et al. Deep Learning Applications for Predicting
Pharmacological Properties of Drugs and Drug Repurposing Using Transcriptomic Data.
Mol Pharm
2016; 13: 2524–2530.
[5] Gayvert KM, Madhukar NS, Elemento O. A Data-Driven Approach to Predicting Successes
and Failures of Clinical Trials. Cell Chem Biol
2016; 23: 1294–1301.
[6] Lo AW, Siah KW, Wong CH. Machine learning with statistical imputation for predicting drug
approvals. Harvard Data Science Review
.
[7] Feijoo F, Palopoli M, Bernstein J, et al. Key indicators of phase transition for clinical trials
through machine learning. Drug Discov Today
2020; 25: 414–421.
[8] Qi Y, Tang Q. Predicting Phase 3 Clinical Trial Results by Modeling Phase 2 Clinical Trial
Subject Level Data Using Deep Learning. In: Doshi-Velez F, Fackler J, Jung K, et al. (eds)
Proceedings of the 4th Machine Learning for Healthcare Conference
. Ann Arbor, Michigan:
PMLR, 2019, pp. 288–303.
[9] Artemov AV, Putin E, Vanhaelen Q, et al. Integrated deep learned transcriptomic and
structure-based predictor of clinical trials outcomes. BioRxiv
,
https://www.biorxiv.org/content/10.1101/095653v2.abstract (2016).
[10] Lundberg SM, Erion G, Chen H, et al. From local explanations to global understanding with
explainable AI for trees. Nature machine intelligence
2020; 2: 2522–5839.
[11] Lundberg SM, Lee S-I. A Unified Approach to Interpreting Model Predictions. In: Guyon I,
Luxburg UV, Bengio S, et al. (eds) Advances in Neural Information Processing Systems 30
.
Curran Associates, Inc., 2017, pp. 4765–4774.
[12] Ke G, Meng Q, Finley T, et al. LightGBM: A Highly Efficient Gradient Boosting Decision
Tree. In: Guyon I, Luxburg UV, Bengio S, et al. (eds) Advances in Neural Information
Processing Systems 30
. Curran Associates, Inc., 2017, pp. 3146–3154.
[13] Ke G, Meng Q, Finley T, et al. Welcome to LightGBM’s documentation! — LightGBM 2.3.2
10
documentation. LightGBM
.
[14] Moriwaki H, Tian Y-S, Kawashita N, et al. Mordred: a molecular descriptor calculator. J
Cheminform
2018; 10: 4.
[15] Landrum G, Others. RDKit: Open-source cheminformatics, http://www.rdkit.org (2006).
[16] Ivanenkov YA, Zagribelnyy BA, Aladinskiy VA. Are We Opening the Door to a New Era of
Medicinal Chemistry or Being Collapsed to a Chemical Singularity? Perspective. J Med
Chem
2019; 62: 10026–10043.
[17] Mikolov T, Sutskever I, Chen K, et al. Distributed Representations of Words and Phrases
and their Compositionality. In: Burges CJC, Bottou L, Welling M, et al. (eds) Advances in
Neural Information Processing Systems 26
. Curran Associates, Inc., 2013, pp. 3111–3119.
[18] Radim Rehurek PS. Software Framework for Topic Modelling with Large Corpora. In: IN
PROCEEDINGS OF THE LREC 2010 WORKSHOP ON NEW CHALLENGES FOR NLP
FRAMEWORKS
, http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.695.4595 (2010,
accessed 31 March 2020).
[19] Tshitoyan V, Dagdelen J, Weston L, et al. Unsupervised word embeddings capture latent
knowledge from materials science literature. Nature
2019; 571: 95–98.
[20] Madzhidov TI, Tutubalina EV, Miftahutdinov ZS, et al. Using semantic analysis of texts for
the identification of drugs with similar therapeutic effects. Russ Chem Bull
2017; 66:
2180–2189.
[21] Artetxe M, Labaka G, Agirre E. A robust self-learning method for fully unsupervised
cross-lingual mappings of word embeddings. Proceedings of the 56th Annual Meeting of
the Association for Computational Linguistics (Volume 1: Long Papers)
. Epub ahead of print
2018. DOI: 10.18653/v1/p18-1073.
[22] Ozerov IV, Lezhnina KV, Izumchenko E, et al. In silico Pathway Activation Network
Decomposition Analysis (iPANDA) as a method for biomarker development. Nat Commun
2016; 7: 13427.
[23] Ravi R, Noonan KA, Pham V, et al. Bifunctional immune checkpoint-targeted
antibody-ligand traps that simultaneously disable TGFβ enhance the efficacy of cancer
immunotherapy. Nat Commun
2018; 9: 741.
11
ResearchGate has not been able to resolve any citations for this publication.
Article
Full-text available
Tree-based machine learning models such as random forests, decision trees and gradient boosted trees are popular nonlinear predictive models, yet comparatively little attention has been paid to explaining their predictions. Here we improve the interpretability of tree-based models through three main contributions. (1) A polynomial time algorithm to compute optimal explanations based on game theory. (2) A new type of explanation that directly measures local feature interaction effects. (3) A new set of tools for understanding global model structure based on combining many local explanations of each prediction. We apply these tools to three medical machine learning problems and show how combining many high-quality local explanations allows us to represent global structure while retaining local faithfulness to the original model. These tools enable us to (1) identify high-magnitude but low-frequency nonlinear mortality risk factors in the US population, (2) highlight distinct population subgroups with shared risk characteristics, (3) identify nonlinear interaction effects among risk factors for chronic kidney disease and (4) monitor a machine learning model deployed in a hospital by identifying which features are degrading the model’s performance over time. Given the popularity of tree-based machine learning models, these improvements to their interpretability have implications across a broad set of domains. Tree-based machine learning models are widely used in domains such as healthcare, finance and public services. The authors present an explanation method for trees that enables the computation of optimal local explanations for individual predictions, and demonstrate their method on three medical datasets.
Article
Full-text available
The overwhelming majority of scientific knowledge is published as text, which is difficult to analyse by either traditional statistical analysis or modern machine learning methods. By contrast, the main source of machine-interpretable data for the materials research community has come from structured property databases1,2, which encompass only a small fraction of the knowledge present in the research literature. Beyond property values, publications contain valuable knowledge regarding the connections and relationships between data items as interpreted by the authors. To improve the identification and use of this knowledge, several studies have focused on the retrieval of information from scientific literature using supervised natural language processing3–10, which requires large hand-labelled datasets for training. Here we show that materials science knowledge present in the published literature can be efficiently encoded as information-dense word embeddings11–13 (vector representations of words) without human labelling or supervision. Without any explicit insertion of chemical knowledge, these embeddings capture complex materials science concepts such as the underlying structure of the periodic table and structure–property relationships in materials. Furthermore, we demonstrate that an unsupervised method can recommend materials for functional applications several years before their discovery. This suggests that latent knowledge regarding future discoveries is to a large extent embedded in past publications. Our findings highlight the possibility of extracting knowledge and relationships from the massive body of scientific literature in a collective manner, and point towards a generalized approach to the mining of scientific literature.
Article
Full-text available
The paradigm of “drug likeness” dramatically altered the behavior of the medicinal chemistry community for a long time. In recent years, scientists have empirically found a significant increase in key properties of drugs that have moved structures closer to the periphery or the outside of the rule-of-five “cage”. Herein, we show that for the past decade, the number of molecules claimed in patent records by major pharmaceutical companies has dramatically decreased, which may lead to a “chemical singularity”. New compounds containing fragments with increased 3D complexity are generally larger, slightly more lipophilic and more polar. A core difference between this study and recently published papers is that we consider the nature and quality of sp3-rich frameworks rather than sp3 count. We introduce the original descriptor MCE-18, which stands for Medicinal Chemistry Evolution, 2018, and this measure can effectively score molecules by novelty in terms of their cumulative sp3 complexity.
Article
Full-text available
Semantic analysis of text collections was used to identify drugs with similar therapeutic activity. Natural language processing methods were applied to analyse > 2.5 mln texts from drug reviews (in English) found on patient forums and discussion boards. In order to obtain distributed word representations form the input data, a continuous bag-of-words type model was used. Such model is one of the word2vec models intended to analyse the natural language semantics. This allowed the assignment of a numeric vector to each drug name. A list of pairs of drugs with similar vectors was formed. An analysis of this list confirmed that similar word vectors correspond to either drugs with the same active compound or to drugs with close therapeutic effects that belong to the same therapeutic group. The chemical similarity in such drug pairs was found to be low. The suggested procedure was used to visualize the chemical drug space and in the search for compounds with potentially similar biological effects among drugs of different therapeutic groups.
Article
Full-text available
A majority of cancers fail to respond to immunotherapy with antibodies targeting immune checkpoints, such as cytotoxic T-lymphocyte antigen-4 (CTLA-4) or programmed death-1 (PD-1)/PD-1 ligand (PD-L1). Cancers frequently express transforming growth factor-β (TGFβ), which drives immune dysfunction in the tumor microenvironment by inducing regulatory T cells (Tregs) and inhibiting CD8+and TH1 cells. To address this therapeutic challenge, we invent bifunctional antibody-ligand traps (Y-traps) comprising an antibody targeting CTLA-4 or PD-L1 fused to a TGFβ receptor II ectodomain sequence that simultaneously disables autocrine/paracrine TGFβ in the target cell microenvironment (a-CTLA4-TGFβRIIecd and a-PDL1-TGFβRIIecd). a-CTLA4-TGFβRIIecd is more effective in reducing tumor-infiltrating Tregs and inhibiting tumor progression compared with CTLA-4 antibody (Ipilimumab). Likewise, a-PDL1-TGFβRIIecd exhibits superior antitumor efficacy compared with PD-L1 antibodies (Atezolizumab or Avelumab). Our data demonstrate that Y-traps counteract TGFβ-mediated differentiation of Tregs and immune tolerance, thereby providing a potentially more effective immunotherapeutic strategy against cancers that are resistant to current immune checkpoint inhibitors.
Article
Full-text available
Molecular descriptors are widely employed to present molecular characteristics in cheminformatics. Various molecular-descriptor-calculation software programs have been developed. However, users of those programs must contend with several issues, including software bugs, insufficient update frequencies, and software licensing constraints. To address these issues, we propose Mordred, a developed descriptor-calculation software application that can calculate more than 1800 two- and three-dimensional descriptors. It is freely available via GitHub. Mordred can be easily installed and used in the command line interface, as a web application, or as a high-flexibility Python package on all major platforms (Windows, Linux, and macOS). Performance benchmark results show that Mordred is at least twice as fast as the well-known PaDEL-Descriptor and it can calculate descriptors for large molecules, which cannot be accomplished by other software. Owing to its good performance, convenience, number of descriptors, and a lax licensing constraint, Mordred is a promising choice of molecular descriptor calculation software that can be utilized for cheminformatics studies, such as those on quantitative structure–property relationships. Electronic supplementary material The online version of this article (10.1186/s13321-018-0258-y) contains supplementary material, which is available to authorized users.
Article
A significant number of drugs fail during the clinical testing stage. To understand the attrition of drugs through the regulatory process, here we review and advance machine-learning (ML) and natural language-processing algorithms to investigate the importance of factors in clinical trials that are linked with failure in Phases II and III. We find that clinical trial phase transitions can be predicted with an average accuracy of 80%. Identifying these trials provides information to sponsors facing difficult decisions about whether these higher risk trials should be modified or halted. We also find common protocol characteristics across therapeutic areas that are linked to phase success, including the number of endpoints and the complexity of the eligibility criteria.