ArticlePDF Available

AiZynthFinder: a fast, robust and flexible open-source software for retrosynthetic planning

Authors:

Abstract

We present the open-source AiZynthFinder software that can be readily used in retrosynthetic planning. The algorithm is based on a Monte Carlo tree search that recursively breaks down a molecule to purchasable precursors. The tree search is guided by an artificial neural network policy that suggests possible precursors by utilizing a library of known reaction templates. The software is fast and can typically find a solution in less than 10 s and perform a complete search in less than 1 min. Moreover, the development of the code was guided by a range of software engineering principles such as automatic testing, system design and continuous integration leading to robust software with high maintainability. Finally, the software is well documented to make it suitable for beginners. The software is available at http://www.github.com/MolecularAI/aizynthfinder.
Genhedenetal. J Cheminform (2020) 12:70
https://doi.org/10.1186/s13321-020-00472-1
SOFTWARE
AiZynthFinder: afast, robust andexible
open-source software forretrosynthetic
planning
Samuel Genheden1*, Amol Thakkar1,2, Veronika Chadimová1, Jean‑Louis Reymond2, Ola Engkvist1
and Esben Bjerrum1*
Abstract
We present the open‑source AiZynthFinder software that can be readily used in retrosynthetic planning. The algo‑
rithm is based on a Monte Carlo tree search that recursively breaks down a molecule to purchasable precursors. The
tree search is guided by an artificial neural network policy that suggests possible precursors by utilizing a library of
known reaction templates. The software is fast and can typically find a solution in less than 10 s and perform a com‑
plete search in less than 1 min. Moreover, the development of the code was guided by a range of software engineer‑
ing principles such as automatic testing, system design and continuous integration leading to robust software with
high maintainability. Finally, the software is well documented to make it suitable for beginners. The software is avail‑
able at http://www.githu b.com/Molec ularA I/aizyn thfin der.
Keywords: Neural network, CASP, Retrosynthesis planning software, Monte Carlo tree‑search, Retrosynthesis
© The Author(s) 2020. This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing,
adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and
the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material
in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material
is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the
permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creat iveco
mmons .org/licen ses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creat iveco mmons .org/publi cdoma in/
zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
Introduction
Synthesis planning is the process by which a chemist or a
computer determines how to synthesize a specific com-
pound. is is typically carried out by retrosynthetic
analysis where the desired compound is iteratively bro-
ken down into intermediates or smaller precursors until
known or purchasable building blocks have been found.
Such analysis was pioneered by Corey et al. and was
traditionally carried out by hand or by using expert sys-
tems utilizing hand-encoded rules [13]. With the rise
of deep learning, in the last decade, the field of retrosyn-
thetic software tools has undergone a swift change. Now,
sophisticated and automatic algorithms have the poten-
tial to provide retrosynthetic analysis with a broader
application domain and with better accuracy [46].
Retrosynthesis planning algorithms can be divided
into template-based and template-free approaches. In
template-based approaches, reaction templates or rules
that describe chemical transformations are manually
encoded or derived from a database of known reactions,
and subsequently applied to other compounds to create
plausible reaction outcomes. Segler etal. showed that it
was possible to train a neural network to prioritize tem-
plates, and subsequently use this as a policy to guide
a Monte Carlo tree search algorithm that suggests syn-
thetic pathways for a given compound [7, 8]. Template-
free approaches, on the other hand, do not rely on such
templates but typically treat the chemical reaction as a
natural language problem, where one set of words (reac-
tants) is transformed into another set of words (products)
[911]. Other template-free methods are based on graph
approaches [12, 13].
ere are several tools available for retrosynthesis
planning but to our knowledge only two are fully open
source, i.e. the ASKCOS suite of programs from MIT [14]
Open Access
Journal of Cheminformatics
*Correspondence: samuel.genheden@astrazeneca.com; esben.
bjerrum@astrazeneca.com
1 Hit Discovery, Discovery Sciences, R&D, AstraZeneca Gothenburg,
Mölndal, Sweden
Full list of author information is available at the end of the article
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Page 2 of 9
Genhedenetal. J Cheminform (2020) 12:70
and LillyMol from Eli Lilly and Company [15]. e tools
Chemical AI [16] and IBM RXN [17] are free for regis-
tered users, but only the algorithm of the latter has been
reported in the literature. Other tools [1823] are closed
commercial applications where the algorithm is partly
undisclosed. is is partly a problem of data availabil-
ity—most of the reaction databases or manually encoded
rules are commercial and limits the way a free and open
source software can use them. e same applies to the
database of purchasable precursors that is used as a stop
criterion in several programs. However, we believe that
the scientific community would benefit from an open
source implementation that provides algorithmic trans-
parency and promotes reproducible research with a
sustainable software. erefore, we present the AiZynth-
Finder tool that can be used for retrosynthesis planning.
An early version of this tool has been used previously to
determine the influence of the reaction database on ret-
rosynthesic predictions [24], but the code base has been
re-engineered to make it more flexible, robust and main-
tainable. We provide a trained neural network policy as
well as tools to make a database of purchasable precur-
sors so that the tool can be used directly. In addition, we
provide extensive documentation to lower the learning
curve for new users. We envisage that by providing this
tool free and open-source, other researchers can use it
for benchmarking, contribute to a continuous develop-
ment and use the tool for suggesting synthetic routes for
novel compounds.
Implementation
e AiZynthFinder software is written in Python 3 and
is distributed on GitHub under the MIT license [25]. It
is dependent on several freely available Python packages
such as TensorFlow [26], RDKit [27] and NetworkX [28].
e central algorithm of the AiZynthFinder software
has been described elsewhere [8, 24] and therefore, we
only provide a brief outline here: e input is a molecule
that will be broken down to purchasable precursors.
e outcome will be a list of precursors that can be pur-
chased or molecules that cannot be broken down by the
algorithm. e software is based on a Monte Carlo tree
search [29], where each node in the tree corresponds to
a set of molecules that can or cannot be broken down
further. At each iteration a leaf node is selected that is
deemed to be the most promising to exploit further using
upper confidence bound statistics [29]. A neural network
policy is then used to shortlist reaction templates and
prioritize which child to create by applying a reaction
template to create the new precursors. is procedure
is repeated until a terminal state has been reached, i.e., a
precursor that is purchasable has been found, or the tree
has reached a maximum depth. At this point the score of
the leaf node is backpropagated up to the root of the tree
(the input molecule), and the next iteration commences.
e tree search is terminated either after a fixed number
of iterations or a time-limit has passed. In comparison to
the algorithm proposed by Segler etal. [8], the algorithm
in AiZynthFinder does not include a filter to quickly
remove unfeasible reactions nor does it utilize different
policies for the expansion and rollout phases.
e structure of the AiZynthFinder package is shown
in Fig.1a. e main interface to the algorithm is in the
aizynthnder.py module, which brings classes from
the mcts sub-package together to perform the tree
search. However, for the end-user we provide two inter-
faces: one command-line interface (CLI) and one graphi-
cal user interface (GUI) that is intended to be used in a
Jupyter notebook. ese two interfaces, which reside in
the interface sub-package, are installed together with
the package. e CLI comes with some additional fea-
tures that are lacking from the GUI. Foremost, it allows
compounds to be processed in batch, i.e. the user can
submit hundreds or thousands of compounds with one
command. Secondly, detailed results are stored to disc
that later can be processed or viewed. For instance, one
can calculate statistics on the search trees, or one can
produce images of the top-ranked routes. Lastly, the CLI
allows a finer detail of debugging information, which
could be invaluable to software developers. e sub-
package training contains tools to train the policy
neural network, and the sub-package tools contains
other useful CLIs.
e overall design follows principles from object-
oriented programming such that each component is
implemented as a class. e main classes for the tree
search and their relationships are shown in Fig.1b. e
AiZynthFinder class loads a user configuration from file
as a Configuration object, which includes the creation
of a Policy and a Stock object. is configuration is used
to control the tree search. e actual tree search is then
carried out by the TreeSearch class that creates a Node
object representing a node in the tree search that can be
expanded to create new Nodes. e molecules on each
Node are represented by a State object that holds a list
of TreeMolecule objects. A Reaction class encapsulates a
chemical reaction on TreeMolecule objects and is used to
apply the reaction templates to create new precursors.
e Policy class encapsulates a recommendation
engine based on a trained neural network. Given a mol-
ecule object, it will return a sorted list of reaction tem-
plates and the probability of each template. e templates
are sorted on the probability as given by the neural net-
work. We have trained neural networks on several tem-
plate libraries (see ref [24] for a comparison) and provide
one based on the publicly available US patent office data
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Page 3 of 9
Genhedenetal. J Cheminform (2020) 12:70
(USPTO) set [30] for anyone to use. We also provide tools
to train the neural network, in case someone has their
own or in-licensed reaction database. ese tools can for
instance be used with RDChiral [31] and our previously
described procedure [24] for extracting templates.
e Stock class is an abstraction around a collection
of compounds that serves as stop-conditions for the
tree search. is is a list of purchasable compounds,
but could also be an abstract collection based on some
rules, e.g. compounds with less than seven carbon
atoms are considered purchasable. To support differ-
ent kinds of collections, the Stock class uses one or
more instances of query classes that given a molecule
object returns whether that compound is “in stock”. e
package comes with two query classes, one that holds
a set of InChI keys [32] in the computer memory and
one that holds a connection to a Mongo database with
InChI keys. We also provide examples to show how
one can create a rule-based query class. For our inter-
nal usage we refer to lists of purchasable compounds
from several commercial vendors, however it is just as
straightforward to create a list from open source data-
bases such as ZINC [33]. To simplify this process, we
provide a tool to make a stock in a suitable format for
Fig. 1 The AiZynthFinder package. a The python package structure, outlining top‑level modules and sub‑packages. b The main classes involved
in the tree search and the relationships. A line ending with a solid diamond indicates an “owns”‑relation, and a line ending with arrow indicates an
“uses”‑relationship, according to UML notation
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Page 4 of 9
Genhedenetal. J Cheminform (2020) 12:70
the tree search from files containing SMILES strings
[34].
e main MCTS implementation has been exten-
sively profiled and optimized—the bottlenecks are calls
to the neural network and to RDChiral [31] for resolv-
ing reaction templates, routines that rely on optimized
C or C++ code. We have not attempted to parallelize
the code, as the serial execution time is sufficient for
our purposes (see below). For the prediction of multiple
compounds at the same time, the code can of course be
embarrassingly parallelized. e benchmarking num-
bers below were made using a single CPU (Intel Xeon
4.00GHz) and a single GPU (Nvidia GeForce RTX 2080
Ti) on a Linux machine with 64GB memory.
More than 85% of the code is covered by automatic unit
and integration tests, which we execute on each commit.
Furthermore, the code is pep8 compliant, autoformatted
and code complexity is monitored automatically on each
commit. All of this contributes to the robustness and
maintainability of the code base and provides the basis
for continuous integration and deployment. Extensive
API documentation is autogenerated from docstrings
and is complemented by hand-written tutorials.
Results anddiscussion
As described in the Implementation section, there are
two main interfaces to the tool. Here, we exemplify the
usages of the tool with the GUI and then proceed with
a comparison using the CLI. In the example below we
have used the policy trained on USPTO data [24]. Fur-
thermore, we created a stock from compounds avail-
able in the ZINC database [33]; we only downloaded
tranches including fragment compounds (molecular
weight up to 250 D and log P up to 3.5) that had reactivity
labeled as “standard” or “reactive”, resulting in 17,422,831
compounds.
Graphical user interface
To use the GUI (and the CLI), a configuration file needs
to be created in YAML-format. is configuration file
must contain the path to files for the policy and instruc-
tions how to setup the stock. e policy files are (1) the
saved neural network model and, (2) a list of reaction
templates. Multiple stocks and policy networks can be
specified in the configuration and selected in the GUI
before running the algorithm. e user is also free to fine
tune the search algorithm using a set of properties. For
the GUI, they serve as default values whereas for the CLI
they are used in the search algorithm. If not provided in
the configuration file, default recommended settings are
automatically applied.
e GUI is based on the Jupyter notebook infrastruc-
ture, which builds and displays the GUI requiring at
minimum two lines of python code. Although, a Jupyter
notebook requires the user to enter Python code, the
number of commands one must enter is minimal so that
it is suitable even for non-technical researchers. A Jupy-
ter notebook is also ideal as a working environment for
researchers that want to experiment with the algorithm
and the result of the tree search. Because a Jupyter note-
book provides the full Python environment, one can eas-
ily customize the setup of the algorithm and fully inspect
the predicted routes. Furthermore, there are projects
such as voilá [35] built around Jupyter notebooks that
make it easy to create interactive webpages directly from
the notebooks. is could be setup for users that primar-
ily want to use AiZynthFinder to find suggestions for syn-
thesis plans.
In Fig.2, we have input the SMILES string for the anti-
viral drug Amenamevir. Furthermore, the user can then
select the stock and neural policy they want to use, as
well as some options for the tree search.
When the tree search is completed, the user can view
the predicted reaction routes. e GUI allows brows-
ing through the top-ranked routes, but using Python
scripting, all routes can be extracted and displayed. Fig-
ure3 shows an example for the Amenamevir drug. First,
the results show whether the route is solved or not, i.e.
if all precursors are in stock, and the score of the route.
e score reflects the fraction of solved precursors and
the number of reactions required to synthesize the tar-
get compound. e score for a solved compound is close
to 1.0, whereas the score for an unsolved compound is
typically less than 0.8. However, it should be noted that
the score was designed to support the tree search and
is rather indiscriminate with regard to the quality of the
route (i.e. if it’s a good route or not) and should be inter-
preted with care. Second, the results clearly display which
precursors to procure in order to synthesize the target
compound. Lastly, it shows the predicted route with
precursors in stock highlighted with a green rectangle,
and the precursors that are not in stock are highlighted
in orange. In the example shown in Fig.3, we see that
suggested route is very similar to the reported synthetic
route for Amenamevir [36], with the difference that the
anilinoacetate is available to purchase and does not need
to be synthesized.
Comparison withtheASKCOS tool
As mentioned above, several other retrosynthesis tools
exist, but unfortunately very few of them are open source
or well described in the literature. e software that is
closest for a comparison is the Tree builder module in the
ASKCOS suite of programs [14, 37]. First the algorithm
underlying the Tree builder module is similar to the algo-
rithm of AiZynthFinder, although different expansion
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Page 5 of 9
Genhedenetal. J Cheminform (2020) 12:70
policies are used, and the search tree constructed differ-
ently. e software is written in Python and the code is
available on Github. However, it is foremost intended for
end-users and the interface is web-based. LillyMol [15],
which is another open-source code, uses an exhaustive
search of template space to produce one-step sugges-
tions, i.e. not complete routes, and is thus less relevant
to compare with. To make a rough baseline comparison
between ASKCOS and AiZynthFinder we selected 100
random compounds from the ChEMBL database and
submitted them to the Tree builder module of the pub-
lic ASKCOS web server [38]. Even though this might not
represent the latest version of the codebase, it is intui-
tively the interface that most people would use. We set
a max depth of 6, an expansion time of 120s and used a
fast filter; otherwise default values were applied. We used
the AiZynthFinder CLI together with the ZINC stock
and the USPTO policy to predict routes for the same 100
compounds. Some statistics on the source code and the
route finding are collected in Table1 and the full data is
available as Additional file1. It is important to note that
these 100 compounds are not necessarily a representa-
tive part of the chemical space that might be relevant in
a drug design project. us, the test set should be viewed
as an illustration of the capacity of the software rather
than a go-to benchmarking set.
AiZynthFinder and ASKCOS find routes for 55 and 62
compounds respectively. ere were 47 compounds for
which both tools found a route, 15 compounds where
ASKCOS found a solution and AiZynthFinder did not,
and 8 compounds where AiZynthFinder found a solu-
tion and ASKCOS did not. ere were 30 compounds
that neither tool found a solution for. We have found
thatroute finding capability depends on the stock that
is used as stop criteria in both tools [24]. e exam-
ple stock created from a subset of the ZINC database is
for instance much less extensive than some of the com-
mercial stocks we typically use. If we include the readily
available Enamine building blocks in the stock, we could
find routes for an additional 10 compounds. e ASK-
COS tool from the public webserver employs a commer-
cial database consisting of 107,000 compounds with list
prices less than $100/g from Sigma Aldrich and eMol-
ecules [6]. e other factor that determines if a solution
is found is the template library—here we used USPTO
policy for AiZynthFinder, whereas ASKCOS is based on
the more extensive Reaxys database [39]. Using a policy
based on Reaxys data we find routes for 56 compounds,
although there is not a complete overlap with the USPTO
results. We have previously investigated the effect of poli-
cies trained on a variety of datasets on the route finding
capability of AiZynthFinder [24] however we cannot
release these to the public due to licensing agreements.
Furthermore, the capability to find a route for both tools
is closely related to the complexity of the synthesis. is
can be seen in Fig.4, showing the distribution of the syn-
thetic accessibility (SA) score [40] for four sets of data.
We see that for both AiZynthFinder and ASKCOS, the
SA score is generally lower for compounds that the tools
were able to find a solution for. Similar observations have
Fig. 2 The input section of the AiZynthFinder GUI. A user has entered the SMILES string for the drug Amenamevir and selected the ZINC stock
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Page 6 of 9
Genhedenetal. J Cheminform (2020) 12:70
been discussed previously in the literature [41]. It seems
that ASKCOS is somewhat better at finding solutions
with a mid-range SA score, but this might be due to the
lack of some scaffolds in the ZINC stock. Moreover, it
seems that AiZynthFinder predicts slightly shorter reac-
tion routes, with fewer purchasable precursors, although
it is unclear if the difference is significant given the rather
small test set.
Looking at the timings of the software, we see that
AiZynthFinder is faster than ASKCOS, both in terms of
total search time and the time it takes to find the first
solution. However this difference could be partially
attributed to the environment in which the test was
executed, a local Linux computer in the case of AiZynth-
Finder and a webserver in the case of ASKCOS. Lastly,
we want to point out that AiZynthFinder has a much
smaller code base than ASKCOS, with less than half the
number of Python statements in the core modules (the
part of the code necessary to execute the tree search). e
large difference in total statements of the package can be
attributed to the fact that ASKCOS has a lot more fea-
tures than AiZynthFinder. However, the difference in the
number of core statement could be because we re-engi-
neered the AiZynthFinder package such that it is a better
designed package than the previously released code. We
quantify this by calculating the average complexity [42],
which quantifies the number of independent branching
points, and Halstead effort [43], which is the product of
a volumetric measure and the difficulty to understand the
code. e number of lines, the code complexity and code
effort is among the metrics typically used to determine
if a codebase is maintainable [44], and they indicate that
the AiZynthFinder code is less complex and require less
effort to extend than ASKCOS.
is is far from a comprehensive comparison and is
intended to highlight the similarities and differences
Fig. 3 The output section of the AiZynthFinder GUI displaying the first suggested route to synthesize Amenamevir
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Page 7 of 9
Genhedenetal. J Cheminform (2020) 12:70
between the two tools. As mentioned above, it is dif-
ficult to compare the software on equal footing. Differ-
ent researchers have different priorities when it comes to
retrosynthesis, and it is not entirely clear how to make a
good comparison. We have not discussed the quality of
the predicted routes, which is in our opinion is an ill-
defined metric. For instance, we submitted Amenamevir
to the ASKCOS webserver and did not recover the
expected literature route, but that does not mean that the
route suggested by ASKCOS is incorrect. e only fair
way to find out is to synthesize the compounds accord-
ing to the proposed the route, but even then the success-
ful application of the suggested route is conditioned on
finding the optimal conditions for synthesis. As such, a
comprehensive comparison of tools is out of scope for
this software note.
Future developments
It is our aim that the AiZynthFinder software provides
a framework for research and development of novel ret-
rosynthesis algorithms. erefore, we have designed the
software to be easy to maintain and extend with new
features. Currently, it contains a solid foundation, i.e.,
the Monte Carlo tree search algorithm that has shown
promising results in finding routes for a range of com-
pounds. And we provide interfaces that suits this core
activity. However, it does not yet provide a fully inte-
grated solution. For instance, we are working on improv-
ing the accuracy of the predicted routes by implementing
a scoring framework. It is also of interest to augment the
predictions with an information retrieval system for the
used templates, so that chemists can e.g. look up simi-
lar reactions. Finally, we are working on improving the
Table 1 Statistics of AiZynthFinder and ASKCOS
predictions on100 compounds fromChEMBL
a The number of Python statements in the modules that are used by the
AiZynthFinder CLI and tree builder module, respectively
b The total number of python statements in the aizynthnder and makeit
(ASKCOS) python packages, respectively
c The average cyclomatic complexity over all functions used by the
AiZynthFinder CLI or the tree builder module
d The average Halstead eort over all functions used by the AiZynthFinder CLI
or the tree builder module
e The average time to complete the search over all compounds
f The average time to nd the rst solution over all compounds that were solved
AiZynthFinder ASKCOS
Number of core statementsa1095 2336
Number of total statementsb1495 9987
Average code complexityc2.2 3.4
Average code effortd22.0 116.8
Reaction database USPTO [30] Reaxys [39]
Stock ZINC [33] Sigma and
eMol‑
ecules [6]
Average search timee (s) 38.7 151.0
Average solution timef (s) 7.1 14.3
Number of solved routes 55 62
Average number of steps 2.4 3.3
Average number of precursors 2.7 3.2
Fig. 4 Distribution of the synthetic accessibility score of the 100 ChEMBL compounds, grouped by whether a synthetic route was found with
AiZynthFinder or ASKCOS
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Page 8 of 9
Genhedenetal. J Cheminform (2020) 12:70
recommendation policy, by for instance utilizing the
“ring breaker” policy [45]. All such extensions should
be possible to implement easily in the current codebase
because it has low complexity and Halstead effort. If the
features do not depend on internal AstraZeneca infra-
structure or data, and are relevant to the larger commu-
nity, they will be made available when we publish new
research findings. We expect minor releases with new
features to happen several times a year, whereas patch
releases fixing bugs and trivial code updates will be
released continuously.
Conclusions
We have presented the AiZynthFinder tool for retrosyn-
thesis planning. In our experience, it can suggest syn-
thetic routes for most compounds in a very short time.
We hope that it will be perceived as user-friendly and
with a low learning curve, because we provide extensive
documentation. Furthermore, the software is robust and
flexible and lends itself to easy extension with novel fea-
tures. Although it does not provide a complete and inte-
grated solution for synthesis planning, we believe that we
have provided a framework and platform where novel
algorithms can be tested and integrated in the future.
We hope that by releasing the software to the public
that researchers interested in retrosynthesis can use it to
explore synthetic route prediction and provide sugges-
tion how it can be improved. By providing open source
code and algorithmic transparency, we aim to promote
collaboration around a sustainable reference software.
We encourage users to contribute ideas or code so that
the tool can be incrementally improved and thereby pro-
vide more accurate and useful predictions of reaction
routes.
Supplementary information
Supplementary information accompanies this paper at https ://doi.
org/10.1186/s1332 1‑020‑00472 ‑1.
Additional le1. Complete search results for comparison between
AiZynthFinder and ASKCOS.
Acknowledgements
We thank Dr. Michael E. Fortunato, Dr. Connor W. Coley and Prof. Klavs F.
Jensen for helpful comments and clarifications regarding the ASKCOS
software.
Authors’ contributions
SG managed the refactoring project, refactored and made improvements
to the code, developed the testing framework, performed the tool compari‑
sons and wrote the initial manuscript. AT worked with the reaction datasets,
extracted the templates and trained and developed the policy networks. VC
investigated the performance and feasibility of the synthesis predictions. J‑LR
was AT academic supervisor and provided helpful feedback and guidance. OE
supervised and managed the team. EB designed and coded the first version
of the Monte Carlo tree‑search software and supervised and managed the
project in the early phases. All authors were involved in feedback, planning
of the work and editing and improving the manuscript. All authors read and
approved the final manuscript.
Funding
Amol Thakkar was supported financially by the European Union’s Horizon
2020 research and innovation program under the Marie Skłodowska‑Curie
Grant Agreement No. 676434, “Big Data in Chemistr y” (“BIGCHEM,” http://bigch
em.eu).
Availability and requirements
Project name: AiZynthFinder
Project home page: http://www.githu b.com/Molec ularA I/aizyn thfin der
Operating system(s): Platform independent
Programming language: Python 3
Other requirements: several open source python packages
License: MIT.
Any restrictions to use by non‑academics: none.
Data availability
The ZINC stock as well as the trained USPTO policy is available to download
from Figshare: https ://doi.org/10.6084/m9.figsh are.12334 577.v1.
Competing interests
Authors declare no competing interests.
Author details
1 Hit Discovery, Discovery Sciences, R&D, AstraZeneca Gothenburg, Mölndal,
Sweden. 2 Department of Chemistry and Biochemistry, University of Bern,
Freiestrasse 3, 3012 Bern, Switzerland.
Received: 3 July 2020 Accepted: 24 October 2020
References
1. Corey EJ, Todd Wipke W (1969) Computer‑assisted design of complex
organic syntheses. Science 166:178–192. https ://doi.org/10.1126/scien
ce.166.3902.178
2. Pensak DA, Corey EJ (1977) LHASA—Logic and Heuristics Applied to
Synthetic Analysis. In: Computer‑Assisted Organic Synthesis, American
Chemical Society. 61:1–32
3. Ihlenfeldt W‑D, Gasteiger J (1996) Computer‑assisted planning of organic
syntheses: the second generation of programs. Angew Chemie Int Ed
Engl 34:2613–2633. https ://doi.org/10.1002/anie.19952 6131
4. Engkvist O, Norrby O, Selmi N et al (2018) Computational prediction
of chemical reactions: current status and outlook. Drug Discov Today
23:1203–1218. https ://doi.org/10.1016/j.drudi s.2018.02.014
5. Coley CW, Green WH, Jensen KF (2018) Machine learning in computer‑
aided synthesis planning. Acc Chem Res 51:1281–1289. https ://doi.
org/10.1021/acs.accou nts.8b000 87
6. Coley CW, Thomas DA, Lummiss JAM et al (2019) A robotic platform for
flow synthesis of organic compounds informed by AI planning. Science
365:eaax1566. https ://doi.org/10.1126/scien ce.aax15 66
7. Segler MHS, Waller MP (2017) Neural‑symbolic machine learning for
retrosynthesis and reaction prediction. Chem A Eur J 23:5966–5971. https
://doi.org/10.1002/chem.20160 5499
8. Segler MHS, Preuss M, Waller P (2018) Planning chemical syntheses with
deep neural networks and symbolic AI. Nature 555:604–610. https ://doi.
org/10.1038/natur e2597 8
9. Schwaller P, Laino T, Gaudin T et al (2019) Molecular transformer: a model
for uncertainty‑calibrated chemical reaction prediction. ACS Cent Sci
5:1572–1583. https ://doi.org/10.1021/acsce ntsci .9b005 76
10. Zheng S, Rao J, Zhang Z et al (2020) Predicting retrosynthetic reactions
using self‑corrected transformer neural networks. J Chem Inf Model
60:47–55. https ://doi.org/10.1021/acs.jcim.9b009 49
11. Tetko I V., K arpov P, Van Deursen R, Godin G (2020) Augmented trans‑
former achieves 97% and 85% for top5 prediction of direct and classical
retro‑synthesis. https ://arxiv .org/abs/2003.02804 v1
12. Shi C, Xu M, Guo H, et al (2020) A graph to graphs framework for retrosyn‑
thesis prediction. https ://arxiv .org/abs/2003.12725
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Page 9 of 9
Genhedenetal. J Cheminform (2020) 12:70
fast, convenient online submission
thorough peer review by experienced researchers in your field
rapid publication on acceptance
support for research data, including large and complex data types
gold Open Access which fosters wider collaboration and increased citations
maximum visibility for your research: over 100M website views per year
At BMC, research is always in progress.
Learn more biomedcentral.com/submissions
Ready to submit your research
? Choose BMC and benefit from:
13. Somnath VR, Bunne C, Coley CW, et al (2020) Learning Graph Models for
Template‑Free Retrosynthesis. https ://arxiv .org/abs/2006.07038
14. Coley CW, Barzilay R, Jaakkola TS et al (2017) Prediction of organic reac‑
tion outcomes using machine learning. ACS Cent Sci 3:434–443. https ://
doi.org/10.1021/acsce ntsci .7b000 64
15. Watson IA, Wang J, Nicolaou CA (2019) A retrosynthetic analysis algo‑
rithm implementation. J Cheminform 11:1. https ://doi.org/10.1186/s1332
1‑018‑0323‑6
16. https ://Chemi cal.AI
17. https ://rxn.res.ibm.com/
18. https ://www.cas.org/produ cts/scifi nder/retro synth esis‑plann ing
19. https ://www.infoc hem.de/synth esis/ic‑synth
20. https ://molec ule.one/
21. https ://www.elsev ier.com/solut ions/reaxy s/how‑reaxy s‑works /synth esis‑
plann er
22. https ://www.sigma aldri ch.com/chemi stry/chemi cal‑synth esis/synth esis‑
softw are.html
23. https ://spaya .ai
24. Thakkar A, Kogej T, Reymond J‑L et al (2019) Datasets and their influence
on the development of computer assisted synthesis planning tools in the
pharmaceutical domain. Chem Sci. 11:154–168. https ://doi.org/10.1039/
C9SC0 4944D
25. https ://opens ource .org/licen ses/MIT
26. Abadi M, Agarwal A, Barham P, et al (2015) TensorFlow: large‑scale
machine learning on heterogeneous distributed systems
27. RDKit: Open‑source cheminformatics, http://www.rdkit .org
28. Haberg AA, Schult DA, Swart PJ (2008) Exploring network structure,
dynamics, and function using networkX. In: Proceedings of the 7th
Python in Science Conference (SciPy2008), ed. G. Varoquaux, T. Vaught
and J. Millman, Pasadena, CA USA. pp 11–15
29. Browne CB, Powley E, Whitehouse D et al (2012) A survey of Monte Carlo
tree search methods. IEEE Trans Comput Intell AI Games 4:1–43
30. Lowe D Chemical reactions from US patents, 1976–Sep 2016, https ://
figsh are.com/artic les/Chemi cal_react ions_from_US_paten ts_1976‑
Sep20 16_/51048 73. Accessed 31 Apr 2018
31. Coley CW, Green WH, Jensen KF (2019) RDChiral: an RDKit wrapper for
handling stereochemistry in retrosynthetic template extraction and
application. J Chem Inf Model 59:2529–2537. https ://doi.org/10.1021/acs.
jcim.9b002 86
32. Heller SR, McNaught A, Pletnev I et al (2015) InChI, the IUPAC interna‑
tional chemical identifier. J Cheminform 7:23. https ://doi.org/10.1186/
s1332 1‑015‑0068‑4
33. Sterling T, Irwin JJ (2015) ZINC 15 ‑ Ligand discovery for everyone. J Chem
Inf Model 55:2324–2337. https ://doi.org/10.1021/acs.jcim.5b005 59
34. Weininger D (1988) SMILES, a chemical language and information system:
1: introduction to methodology and encoding rules. J Chem Inf Comput
Sci 28:31–36. https ://doi.org/10.1021/ci000 57a00 5
35. https ://voila .readt hedoc s.io/en/stabl e/index .html
36. Flick AC, Leverett CA, Ding HX et al (2019) Synthetic approaches to the
new drugs approved during 2017. J Med Chem 62:7340–7382
37. https ://githu b.com/conno rcole y/ASKCO S
38. http://askco s.mit.edu/. Accessed 27 Apr 2020 to 29 Apr 2020
39. Reaxys©, Copyright © 2019 Elsevier Limited except certain content
provided by third parties, Reaxys is a trademark of Elsevier
40. Ertl P, Schuffenhauer A (2009) Estimation of synthetic accessibility score
of drug‑like molecules based on molecular complexity and fragment
contributions. J Cheminform 1:8. https ://doi.org/10.1186/1758‑2946‑1‑8
41. Gao W, Coley CW (2020) The synthesizability of molecules proposed
by generative models. J Chem Inf Model. https ://doi.org/10.1021/acs.
jcim.0c001 74
42. Mccabe TJ (1976) A complexity measure. IEEE Trans Softw Eng SE‑2:308–
320. https ://doi.org/10.1109/TSE.1976.23383 7
43. Halstead Maurice H (1977) Elements of Software Science. Elsevier North‑
Holland, Inc., Amsterdam. ISBN 0‑444‑00205‑7
44. Seref B, Tanriover O (2016) Software code maintainability: a literature
review. Int J Softw Eng Appl. https ://doi.org/10.5121/ijsea .2016.7305
45. Thakkar A, Selmi N, Reymond J‑L et al (2020) ‘Ring Breaker’: neural
network driven synthesis prediction of the ring system chemical space. J
Med Chem. https ://doi.org/10.1021/acs.jmedc hem.9b019 19
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in pub‑
lished maps and institutional affiliations.
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
1.
2.
3.
4.
5.
6.
Terms and Conditions
Springer Nature journal content, brought to you courtesy of Springer Nature Customer Service Center GmbH (“Springer Nature”).
Springer Nature supports a reasonable amount of sharing of research papers by authors, subscribers and authorised users (“Users”), for small-
scale personal, non-commercial use provided that all copyright, trade and service marks and other proprietary notices are maintained. By
accessing, sharing, receiving or otherwise using the Springer Nature journal content you agree to these terms of use (“Terms”). For these
purposes, Springer Nature considers academic use (by researchers and students) to be non-commercial.
These Terms are supplementary and will apply in addition to any applicable website terms and conditions, a relevant site licence or a personal
subscription. These Terms will prevail over any conflict or ambiguity with regards to the relevant terms, a site licence or a personal subscription
(to the extent of the conflict or ambiguity only). For Creative Commons-licensed articles, the terms of the Creative Commons license used will
apply.
We collect and use personal data to provide access to the Springer Nature journal content. We may also use these personal data internally within
ResearchGate and Springer Nature and as agreed share it, in an anonymised way, for purposes of tracking, analysis and reporting. We will not
otherwise disclose your personal data outside the ResearchGate or the Springer Nature group of companies unless we have your permission as
detailed in the Privacy Policy.
While Users may use the Springer Nature journal content for small scale, personal non-commercial use, it is important to note that Users may
not:
use such content for the purpose of providing other users with access on a regular or large scale basis or as a means to circumvent access
control;
use such content where to do so would be considered a criminal or statutory offence in any jurisdiction, or gives rise to civil liability, or is
otherwise unlawful;
falsely or misleadingly imply or suggest endorsement, approval , sponsorship, or association unless explicitly agreed to by Springer Nature in
writing;
use bots or other automated methods to access the content or redirect messages
override any security feature or exclusionary protocol; or
share the content in order to create substitute for Springer Nature products or services or a systematic database of Springer Nature journal
content.
In line with the restriction against commercial use, Springer Nature does not permit the creation of a product or service that creates revenue,
royalties, rent or income from our content or its inclusion as part of a paid for service or for other commercial gain. Springer Nature journal
content cannot be used for inter-library loans and librarians may not upload Springer Nature journal content on a large scale into their, or any
other, institutional repository.
These terms of use are reviewed regularly and may be amended at any time. Springer Nature is not obligated to publish any information or
content on this website and may remove it or features or functionality at our sole discretion, at any time with or without notice. Springer Nature
may revoke this licence to you at any time and remove access to any copies of the Springer Nature journal content which have been saved.
To the fullest extent permitted by law, Springer Nature makes no warranties, representations or guarantees to Users, either express or implied
with respect to the Springer nature journal content and all parties disclaim and waive any implied warranties or warranties imposed by law,
including merchantability or fitness for any particular purpose.
Please note that these rights do not automatically extend to content, data or other material published by Springer Nature that may be licensed
from third parties.
If you would like to use or distribute our Springer Nature journal content to a wider audience or on a regular basis or in any other manner not
expressly permitted by these Terms, please contact Springer Nature at
onlineservice@springernature.com
... Retrosynthetic planning is a fundamental problem in organic chemistry (Coley et al., 2018a;Genheden et al., 2020). The goal of retrosynthetic planning is to find a series of starting molecules that go through a sequence of reactions, which can also be represented as reaction tree, to synthesize the target molecule. ...
Preprint
Full-text available
Retrosynthetic planning plays a critical role in drug discovery and organic chemistry. Starting from a target molecule as the root node, it aims to find a complete reaction tree subject to the constraint that all leaf nodes belong to a set of starting materials. The multi-step reactions are crucial because they determine the flow chart in the production of the Organic Chemical Industry. However, existing datasets lack curation of tree-structured multi-step reactions, and fail to provide such reaction trees, limiting models' understanding of organic molecule transformations. In this work, we first develop a benchmark curated for the retrosynthetic planning task, which consists of 124,869 reaction trees retrieved from the public USPTO-full dataset. On top of that, we propose Metro: Memory-Enhanced Transformer for RetrOsynthetic planning. Specifically, the dependency among molecules in the reaction tree is captured as context information for multi-step retrosynthesis predictions through transformers with a memory module. Extensive experiments show that Metro dramatically outperforms existing single-step retrosynthesis models by at least 10.7% in top-1 accuracy. The experiments demonstrate the superiority of exploiting context information in the retrosynthetic planning task. Moreover, the proposed model can be directly used for synthetic accessibility analysis, as it is trained on reaction trees with the shortest depths. Our work is the first step towards a brand new formulation for retrosynthetic planning in the aspects of data construction, model design, and evaluation. Code is available at https://github.com/SongtaoLiu0823/metro.
... Retrosynthesis is the task of determining the optimal sequence of steps required to synthesise a given molecule of interest starting from readily available building blocks. It was Corey in the 1960s [1] who pioneered the digitization of the process, followed by a range of approaches from heuristics based expert systems [2,3,4], to data-driven deep learning [5,6,7,8,9,10]. When performed by domain experts, single-step retrosynthetic analysis, i.e. the breakdown of a target product into its constituent set of precursors, can be seen as a two-step process. First, the expert identifies a suitable site of disconnection, considering the competitiveness of forming that specific chemical bond ( Figure 1) across all others present. ...
Preprint
Full-text available
Data-driven approaches to retrosynthesis have thus far been limited in user interaction, in the diversity of their predictions, and the recommendation of unintuitive disconnection strategies. Herein, we extend the notions of prompt- based inference in natural language processing to the task of chemical language modeling. We show that by using a prompt describing the disconnection site in a molecule, we can steer the model to propose a wider set of precursors, overcoming training data biases in retrosynthetic recommendations and achiev- ing a 39 % performance improvement over the baseline. For the first time, the use of a disconnection prompt empowers chemists by giving them back greater control over the disconnection predictions, resulting in more diverse and creative recommendations. In addition, in place of a human-in-the-loop strategy, we propose a schema for automatic identification of disconnection sites, followed by prediction of reactant sets, achieving a 100 % improvement in class diversity as compared to the baseline. The approach is effective in mitigating prediction biases deriving from training data. In turn, this provides a larger variety of usable building blocks, which improves the end-user digital experience. We demonstrate its application to different chemistry domains, from traditional to enzymatic reactions, in which substrate specificity is key.
... 14 AiZynthFinder is a retrosynthetic planning tool that can generate synthetic routes for organic molecules. 17 Hence, an RAScore of 1 indicates a synthetic path to the desired molecule is likely to exist, while a score of 0 indicates that finding a synthetic path is likely to be challenging and potentially impossible. For the purpose of our constrained optimization experiments, we decided to use the RAscore based on a NN model given its reported performance 14 and intuitive interpretation. ...
Article
Full-text available
Optimization strategies driven by machine learning, such as Bayesian optimization, are being explored across experimental sciences as an efficient alternative to traditional design of experiment. When combined with automated laboratory...
Article
Reaction schemes for organic molecules play a crucial role in modern in silico drug design processes. In contrast to the classical drawn reaction diagrams, computational chemists prefer SMARTS based line notations due to a substantially increased expressiveness and precision. They are used to search databases, calculate synthesizability, generate new molecules, or simulate novel reactions. Working with computer-readable representations of reaction schemes can be challenging due to the complexity of the features to be represented. Line representations of reaction schemes can often be cryptic, even to experienced users. To simplify the work with Reaction SMARTS for synthetic, computational, and medicinal chemists, we introduce a visualization technique for reaction schemes and provide a respective tool, called ReactionViewer. ReactionViewer is able to convert reaction schemes encoded as Reaction SMILES, Reaction SMARTS, or SMIRKS into a visual representation. The visualization technique is based on the concept of structure diagrams and follows IUPAC's "Compendium of Chemical Terminology" definition of chemical reaction equations for the reaction symbols. We demonstrate the applicability of the method using two data sets of organic synthesis reaction schemes taken from recent publications. We discuss various properties of the visualization and highlight its readability and interpretability.
Chapter
Site‐specific drug delivery [SSDD] is a smart localized and targeted delivery system that is used to improve drug efficiency, decrease drug‐related toxicity, and prolong the duration of action by having protected interaction between a drug and the diseased tissue. SSDD system in association with the computational approaches is employed in discovery, design, and development of drugs to improve treatment outcomes. Artificial intelligence [AI] networks and tools are playing a prominent role in developing pharmaceutical products by employing fundamental paradigms. Among many computational techniques, deep learning [DL] technology utilizes artificial neural networks [ANN], belongs to machine learning [ML] approach that holds the key to measuring and forecasting a drug's affinity for specific targets. It can reduce both cost and time by speeding up the drug development process rationally with careful decisions. DL is considered as the primary strategy to predict bioactivity as it shows improved performance compared with other technologies in the field. DL can assist in evaluating the success of a target‐based drug design and development before the actual laboratory synthesis or production of the drug molecule. This chapter highlights the potential applications of DL in assigning a specific drug target site by predicting the structure of the target protein and drug affinity for a successful treatment. It also spotlights the impactful applications of many types of DL in SSDD and its advantages over conventional SSDD systems. Furthermore, some formulations that are intended to lead to the target or site‐specific delivery and DL role in docking and pharmacokinetics profiling are also addressed. Ongoing challenges, skepticism about the likelihood of success, and the paths to overcome by future technological advancements are also dealt with briefly. Due emphasis is given to the use of DL in reducing the economic burden of pharmaceutical industries to overcome costly failures and in developing target specific new drug candidate[s] for a successful therapeutic regimen beneficial to human life.
Article
In recent years, there has been a dramatic rise in interest in retrosynthesis prediction with artificial intelligence (AI) techniques. Unlike conventional retrosynthesis prediction performed by chemists and by rule-based expert systems, AI-driven retrosynthesis prediction automatically learns chemistry knowledge from off-the-shelf experimental datasets to predict reactions and retrosynthesis routes. This provides an opportunity to address many conventional challenges, including heavy reliance on extensive expertise, the sub-optimality of routes, and prohibitive computational cost. This review describes the current landscape of AI-driven retrosynthesis prediction. We first discuss formal definitions of the retrosynthesis problem and review the outstanding research challenges therein. We then review the related AI techniques and recent progress that enable retrosynthesis prediction. Moreover, we propose a novel landscape that provides a comprehensive categorization of different retrosynthesis prediction components and survey how AI reshapes each component. We conclude by discussing promising areas for future research.
Article
In modern drug design, one of the main issues is the optimization of an initial lead structure toward a drug candidate by modifying specific properties in the desired direction. The synthetic feasibility of the target structure is often neglected during this process, resulting in structures with low or suboptimal synthetic accessibility. In this work, we present a novel approach for synthesis-aware lead optimization called Synthesia. In contrast to the traditional approaches, Synthesia integrates the preservation of the synthesizability of the target structure into the lead structure modification process. Synthesia is able to create structural diversity for a lead structure that matches user-defined molecular properties without losing the applicability of a particular synthetic pathway. The methodology is validated by demonstrating that Synthesia is capable of providing structural analogues of DrugBank compounds that meet generic modification goals and maintain their synthetic pathways. In addition, Synthesia is used to cluster compounds from two different patent structure series (CDK7, Daurismo) according to their compatibility with the same synthetic pathways, maximizing the synthetic efficiency and providing an initial estimation of the effort of synthesizing the entire series. Altogether, we demonstrate Synthesia's ability to modify compound properties while maintaining in silico synthesizability.
Article
We introduce a framework for benchmarking multi-step retrosynthesis methods, i.e. route predictions, called PaRoutes. The framework consists of two sets of 10,000 synthetic routes extracted from the patent literature, a...
Article
In 2020, a “hybrid” expert‐AI computer program called Chematica (a.k.a. Synthia) was shown to autonomously plan multistep syntheses of complex natural products, which remain outside the reach of purely data‐driven AI programs. The ability to plan at this level of chemical sophistication has been attributed mainly to the superior quality of Chematica's reactions rules. However, rules alone are not sufficient for advanced synthetic planning which also requires appropriately crafted algorithms with which to intelligently navigate the enormous networks of synthetic possibilities, score the synthetic positions encountered, and rank the pathways identified. Chematica's algorithms are distinct from prêt‐à‐porter algorithmic solutions and are product of multiple rounds of improvements, against target structures of increasing complexity. Since descriptions of these improvements have been scattered among several of our prior publications, the aim of the current Review is to narrate the development process in a more comprehensive manner. This article is categorized under: Data Science > Computer Algorithms and Programming Data Science > Artificial Intelligence/Machine Learning Quantum Computing > Algorithms A network view of one‐step synthetic options leading to a target molecule. In realistic retrosynthetic searches for complex targets, networks comprised of tens of thousands of such “spiders” are examined.
Article
Full-text available
Synthesis planning is the process of recursively decomposing target molecules into available precursors. Computer-aided retrosynthesis can potentially assist chemists in designing synthetic routes, but at present it is cumbersome and can't provide results of satisfactory qualities. In this study, we have developed a template-free self-corrected retrosynthesis predictor (SCROP) to predict retrosynthesis by using Transformer neural networks. In the method, the retrosynthesis planning was converted to a machine translation problem from the products to molecular linear notations of reactants. By coupling with a neural network-based syntax corrector, our method achieved an accuracy of 59.0% on a standard benchmark dataset, which outperformed >21% over other deep learning methods and >6% over template-based methods. More importantly, our method was 1.7 times more accurate than other state-of-the-art methods for compounds not appearing in the training set.
Article
Full-text available
Computer Assisted Synthesis Planning (CASP) has gained considerable interest as of late. Herein we investigate a template-based retrosynthetic planning tool, trained on a variety of datasets consisting of up to 17.5 million reactions. We demonstrate that models trained on datasets such as internal Electronic Laboratory Notebooks (ELN), and the publicly available United States Patent Office (USPTO) extracts, are sufficient for the prediction of full synthetic routes to compounds of interest in medicinal chemistry. As such we have assessed the models on 1,731 compounds from 41 virtual libraries for which experimental results were known. Furthermore, we show that accuracy is a misleading metric for assessment of the policy network, and propose that the number of successfully applied templates, in conjunction with the overall ability to generate full synthetic routes be examined instead. To this end we found that the specificity of the templates comes at the cost of generalizability, and overall model performance. This is supplemented by a comparison of the underlying datasets and their corresponding models.
Article
Full-text available
Organic synthesis is one of the key stumbling blocks in medicinal chemistry. A necessary yet unsolved step in planning synthesis is solving the forward problem: Given reactants and reagents, predict the products. Similar to other work, we treat reaction prediction as a machine translation problem between simplified molecular-input line-entry system (SMILES) strings (a text-based representation) of reactants, reagents, and the products. We show that a multihead attention Molecular Transformer model outperforms all algorithms in the literature, achieving a top-1 accuracy above 90% on a common benchmark data set. Molecular Transformer makes predictions by inferring the correlations between the presence and absence of chemical motifs in the reactant, reagent, and product present in the data set. Our model requires no handcrafted rules and accurately predicts subtle chemical transformations. Crucially, our model can accurately estimate its own uncertainty, with an uncertainty score that is 89% accurate in terms of classifying whether a prediction is correct. Furthermore, we show that the model is able to handle inputs without a reactant–reagent split and including stereochemistry, which makes our method universally applicable.
Article
Full-text available
The need for synthetic route design arises frequently in discovery-oriented chemistry organizations. While traditionally finding solutions to this problem has been the domain of human experts, several computational approaches, aided by the algorithmic advances and the availability of large reaction collections, have recently been reported. Herein we present our own implementation of a retrosynthetic analysis method and demonstrate its capabilities in an attempt to identify synthetic routes for a collection of approved drugs. Our results indicate that the method, leveraging on reaction transformation rules learned from a large patent reaction dataset, can identify multiple theoretically feasible synthetic routes and, thus, support research chemist everyday efforts.
Article
Ring systems in pharmaceuticals, agrochemicals and dyes are ubiquitous chemical motifs. Whilst the synthesis of common ring systems is well described, and novel ring systems can be readily computationally enumerated, the synthetic accessibility of unprecedented ring systems remains a challenge. ‘Ring Breaker’ uses a data-driven approach to enable the prediction of ring-forming reactions, for which we have demonstrated its utility on frequently found and unprecedented ring systems, in agreement with literature syntheses. We demonstrate the performance of the neural network on a range of ring fragments from the ZINC and DrugBank databases and highlight its potential for incorporation into computer aided synthesis planning tools. These approaches to ring formation and retrosynthetic disconnection offer opportunities for chemists to explore and select more efficient syntheses/synthetic routes.
Article
The discovery of functional molecules is an expensive and time-consuming process, exemplified by the rising costs of small molecule therapeutic discovery. One class of techniques of growing interest for early-stage drug discovery is de novo molecular generation and optimization, catalyzed by the development of new deep learning approaches. These techniques can suggest novel molecular structures intended to maximize a multi-objective function, e.g., suitability as a therapeutic against a particular target, without relying on brute-force exploration of a chemical space. However, the utility of these approaches is stymied by ignorance of synthesizability. To highlight the severity of this issue, we use a data-driven computer-aided synthesis planning program to quantify how often molecules proposed by state-of-the-art generative models cannot be readily synthesized. Our analysis demonstrates that there are several tasks for which these models generate unrealistic molecular structures despite performing well on popular quantitative benchmarks. Synthetic complexity heuristics can successfully bias generation toward synthetically-tractable chemical space, although doing so necessarily detracts from the primary objective. This analysis suggests that to improve the utility of these models in real discovery workflows, new algorithm development is warranted.
Article
Pairing prediction and robotic synthesis Progress in automated synthesis of organic compounds has been proceeding along parallel tracks. One goal is algorithmic prediction of viable routes to a desired compound; the other is implementation of a known reaction sequence on a platform that needs little to no human intervention. Coley et al. now report preliminary integration of these two protocols. They paired a retrosynthesis prediction algorithm with a robotically reconfigurable flow apparatus. Human intervention was still required to supplement the predictor with practical considerations such as solvent choice and precise stoichiometry, although predictions should improve as accessible data accumulate for training. Science , this issue p. eaax1566
Article
There is a renewed interest in computer-aided synthesis planning, where the vast majority of approaches require the application of retrosynthetic reaction templates. Here we introduce RDChiral, an open-source Python wrapper for RDKit designed to provide consistent handling of stereochemical information in applying retrosynthetic transformations encoded as SMARTS strings. RDChiral is designed to enforce the introduction, destruction, retention, and inversion of chiral tetrahedral centers as well as the cis/trans configuration of double bonds. We also introduce an open-source implementation of a retrosynthetic template extraction algorithm to generate SMARTS patterns from atom-mapped reaction SMILES strings. In this application note, we describe the implementation of these two pieces of code and illustrate their use through many examples.
Article
New drugs introduced to the market every year represent privileged structures for particular biological targets. These new chemical entities (NCEs) provide insight into molecular recognition while serving as leads for designing future new drugs. This annual review describes the most likely process-scale synthetic approaches to thirty-one new chemical entities approved for the first time globally in 2017.
Article
Computer-aided synthesis planning (CASP) is focused on the goal of accelerating the process by which chemists decide how to synthesize small molecule compounds. The ideal CASP program would take a molecular structure as input and output a sorted list of detailed reaction schemes that each connect that target to purchasable starting materials via a series of chemically feasible reaction steps. Early work in this field relied on expert-crafted reaction rules and heuristics to describe possible retrosynthetic disconnections and selectivity rules but suffered from incompleteness, infeasible suggestions, and human bias. With the relatively recent availability of large reaction corpora (such as the United States Patent and Trademark Office (USPTO), Reaxys, and SciFinder databases), consisting of millions of tabulated reaction examples, it is now possible to construct and validate purely data-driven approaches to synthesis planning. As a result, synthesis planning has been opened to machine learning techniques, and the field is advancing rapidly.