ArticlePDF Available

Abstract and Figures

Virtual compound libraries are increasingly being used in computer-assisted drug discovery applications and have led to numerous successful cases. This paper aims to examine the fundamental concepts of library design and describe how to enumerate virtual libraries using open source tools. To exemplify the enumeration of chemical libraries, we emphasize the use of pre-validated or reported reactions and accessible chemical reagents. This tutorial shows a step-by-step procedure for anyone interested in designing and building chemical libraries with or without chemo-informatics experience. The aim is to explore various methodologies proposed by synthetic organic chemists and explore affordable chemical space using open-access chemoinformatics tools. As part of the tutorial, we discuss three examples of design: a Diversity-Oriented-Synthesis library based on lactams, a bis-heterocyclic combinatorial library, and a set of target-oriented molecules: isoindolinone based compounds as potential acetylcholinesterase inhibitors. This manuscript also seeks to contribute to the critical task of teaching and learning chemoinformatics.
Content may be subject to copyright.
Saldívar‑Gonzálezetal. J Cheminform (2020) 12:64‑020‑00466‑z
Chemoinformatics‑based enumeration
ofchemical libraries: atutorial
Fernanda I. Saldívar‑González1* , C. Sebastian Huerta‑García2 and José L. Medina‑Franco1
Virtual compound libraries are increasingly being used in computer‑assisted drug discovery applications and have led
to numerous successful cases. This paper aims to examine the fundamental concepts of library design and describe
how to enumerate virtual libraries using open source tools. To exemplify the enumeration of chemical libraries, we
emphasize the use of pre‑validated or reported reactions and accessible chemical reagents. This tutorial shows a
step‑by‑step procedure for anyone interested in designing and building chemical libraries with or without chemo‑
informatics experience. The aim is to explore various methodologies proposed by synthetic organic chemists and
explore affordable chemical space using open‑access chemoinformatics tools. As part of the tutorial, we discuss three
examples of design: a Diversity‑Oriented‑Synthesis library based on lactams, a bis‑heterocyclic combinatorial library,
and a set of target‑oriented molecules: isoindolinone based compounds as potential acetylcholinesterase inhibitors.
This manuscript also seeks to contribute to the critical task of teaching and learning chemoinformatics.
Keywords: Chemical enumeration, Chemoinformatics, Combinatorial libraries, DOS synthesis, Drug design,
Education, KNIME, Python
© The Author(s) 2020. This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing,
adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and
the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material
in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material
is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the
permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creat iveco
mmons .org/licen ses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creat iveco mmons .org/publi cdoma in/
zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
Hit identification is the starting point and one of the
most crucial stages of small-molecule drug discovery [1].
One approach to increase the likelihood of finding new
hit compounds is presented by the computational gen-
eration of virtual chemical libraries to be used in various
virtual screening methods. us, many researchers are
developing new de novo chemical libraries and libraries
“make-on-demand” by different in silico approaches [2].
For example, GDB17 generated by Reymond etal. is a
chemical library that explores the chemical space broadly
by enumerating more than 160 billion organic small mol-
ecules with up to 17 atoms [3]. Another example is the
95 million compounds in the virtual library CHIPMUNK
(CHemically feasible In silico Public Molecular UNiverse
Knowledge base) that were enumerated by performing a
selected set of reactions widely used in traditional combi-
natorial chemistry [4]. Other examples of virtual librar-
ies based on prevalidated or reported reactions, as well
as accessible chemical reagents developed by pharmaceu-
tical companies are BI-Claim developed by Boehringer
Ingelheim [5], Eli Lilly’s Proximal Collection [6], Pfizer
global virtual library (PGVL) [7], and Merck’s Accessible
inventory (MASSIV) [8]. is approach was also used by
chemical vendors to generate “make-on-demand” virtual
libraries such as the “Readily Accessible” (REAL) Data-
base and REAL Space being the largest synthetic accessi-
bility-based virtual compound collections to date [9].
In general, virtual libraries address the need to improve
the quality of compounds to identify efficiently lead
compounds [10]. In this context, the size, the structural
complexity, and the diversity of the virtual libraries play
a key role in increasing the chance of a successful drug
discovery and development outcome [11]. Another criti-
cal aspect of virtual libraries’ generation is that the com-
pounds obtained must have some novelty, and most
Open Access
Journal of Cheminformatics
1 DIFACQUIM Research Group, School of Chemistry, Department
of Pharmacy, Universidad Nacional Autónoma de México, Avenida
Universidad 3000, 04510 Mexico, Mexico
Full list of author information is available at the end of the article
Page 2 of 25
Saldívar‑Gonzálezetal. J Cheminform (2020) 12:64
importantly, they must be synthetically feasible. is
strategy is particularly attractive to build libraries for dif-
ficult and emerging molecular targets [12].
e construction of a virtual chemical compound can
be done in a variety of ways. For example, using a known
reaction schema and available reagents, based on func-
tional groups, by de novo-based design, by morphing/
transformation, or by decorating a molecular graph [13].
Different tools have been developed to enumerate
virtual libraries and are summarized in Table1. Some
of these tools replace a predetermined central unit of
a molecule, such as Molecular Operating Environment
(MOE) [14] and Schrödinger [15]. Other approaches
are based on combinatorial enumeration from speci-
fications of central scaffolds with connection points
and lists of R groups such as SMILES or standard data
files (SDF) like Library synthesizer [16] or Nova [17].
Few tools allow the user to enter a list of pre-validated
reactions to generate virtual libraries like Reactor [18],
DataWarrior [19], and KNIME [20]. ese tools have
the advantage of being freely accessible. For Reactor, an
academic license can be requested. Our research group
recently developed D-Peptide Builder, a free webserver
to enumerate combinatorial peptide libraries. e user
can build linear or cyclic peptide libraries with N-meth-
ylated or non-methylated amino acids [21, 22].
e pre-validated reactions strategy will result useful
for synthetic organic chemists, aimed to explore all possi-
ble compounds obtained through the reactions or design
approaches developed within their research groups or
reported in the literature. However, several experimental
research groups do not have access to commercial soft-
ware and/or do not have a background in informatics to
rapidly use the open-source tools to enumerate chemical
is manuscript aims to present and discuss a step-by-
step tutorial to enumerate chemical libraries using open-
access chemoinformatics tools. As part of the tutorial,
Table 1 Examples ofchemoinformatic tools available toenumerate virtual chemical libraries
Tool Main features References
Free tools
RDKit Library enumeration is based on generic reactions and that for every one of its generic
reactants a list of real reactant structures is provided [23]
DataWarrior Enumerated product structures are generated from a given generic reaction and that
for every one of its generic reactants a list of real reactant structures is provided [19]
KNIME Library enumeration is based on generic reactions, where a list of reagent structures is
provided for each of its generic reagents [24]
Library synthesizer Enumerated chemical libraries from specifications of central scaffolds with connection
points and lists of R groups [16]
D‑Peptide Builder A chemoinformatic tool to enumerate combinatorial libraries of up to pentapeptides,
linear or cyclic, using the natural pool of 20 amino acids. The user can use non and/
or Nmethylated amino acids. The server also enables the rapid visualization of the
chemical space of the newly enumerated peptides in comparison with other librar‑
ies relevant to drug discovery and preloaded in the server
SmiLib v2.0 Tool for rapid combinatorial library enumeration in the flexible and portable SMILES
notation. Combinatorial building blocks are attached to scaffolds by means of linkers,
this allows for the creation of customized libraries using linkers of different sizes and
chemical nature
GLARE (Global Library Assessment of REagents) Allows to optimize reagent lists for the design of combinatorial libraries [26]
Comercial tools
Reactor (ChemAxon) Library enumeration is based on generic reactions combined with reaction rules;
therefore, it is capable of generating chemically feasible products without preselec‑
tion of reagents
Molecular Operating Environment (MOE) Scaffold Replacement: New chemical compounds are generated by replacing a por‑
tion of a known compound (the scaffold), while preserving the remaining chemical
QuaSAR_CombiGen: A single combinatorial product is constructed by attaching
R‑groups to a scaffold at marked attachment points, called ports. The entire combi‑
natorial library is enumerated by exhaustively cycling through all combinations of
R‑groups at every attachment point on every scaffold
Schrödinger Core hopping: Create libraries by substituting one or several attachments on a core
structure with fragments from reagent compounds [15]
Nova (Optibrium) Enumerated chemical libraries from specifications of central scaffolds with connection
points and lists of R groups [17]
Page 3 of 25
Saldívar‑Gonzálezetal. J Cheminform (2020) 12:64
three chemical libraries’ design approaches were devel-
oped. One using the DOS Build/Couple/Pair approach,
the second exemplifies the design of a bis-heterocyclic
combinatorial library. e third is the design of isoin-
dolinone-based compounds as putative acetylcholinest-
erase (AChE) inhibitors. e design and construction
of these libraries are explained step by step. is manu-
script also aims to contribute to the critical task of learn-
ing chemoinformatics [27].
Chemical data formats
Single chemical structures
As in almost every task in chemoinformatics, molecular
representation is a key aspect to consider during the enu-
meration of chemical compounds [28]. Probably the most
well-known description of compounds is the two-dimen-
sional (2D) graphical representation. ere are currently
many programs to help draw chemical structures and
facilitate the storage and interconversion between stand-
ard file formats. Some of these software programs have
free academic versions such as MarvinSketch [29] and
ACD/ChemSketch [30], and others are commercial such
as ChemDraw [31], Schrödinger [15], and MOE [14], to
name a few [32]. ree-dimensional (3D) structures are
also widely used, especially now that numerous computer
programs have been developed to calculate and visualize
them. ese representations provide a powerful and intu-
itive tool for understanding many aspects of chemistry.
However, they have limitations, especially when it comes
to everyday tasks in chemoinformatics that require stor-
age and handling a vast number of compounds [33]. In
these applications, molecular information is typically
represented by the linear notation [34]. Hereunder,
we describe some of the most commonly used linear
notations to enumerate chemical structures: SMILES,
SMARTS, InChi, and InChikeys. Intuitive examples illus-
trating the general concepts of such linear notations are
shown in Fig.1.
Short and readable descriptions of molecular graphs are
linear notations. A clear example is the broadly used Sim-
plified Molecular Input Line System (SMILES), which
captures a molecules’ structure in the form of an unam-
biguous text string using alphanumeric characters. ey
allow the efficient storage and fast processing of large
numbers of molecules. e SMILES notation uses the
following basic rules for encoding molecules [36, 37]:
1. Atoms are represented by their atomic symbols.
Hydrogen atoms saturating free valences are not rep-
resented explicitly.
2. Neighboring atoms stand next to each other, and
bonds are characterized as being single (-), double
( =), triple (#), or aromatic (:). Single and aromatic
bonds are usually omitted.
3. Enclosures in parentheses specify branches in the
molecular structure.
4. For the linear representation of cyclic structures, a
bond is broken in each ring and the connecting ring
atoms are followed by the same digit in the textual
5. Atoms in aromatic rings are indicated by lower case
letters. In some cases, there may be problems with
aromaticity perception.
Although SMILES strings are unambiguous in describ-
ing chemical structures, they are not unique because
multiple valid SMILES representations exist for the same
molecular graph. Canonical SMILES strings are often
used to ensure the uniqueness of molecules in a database.
In principle, canonical SMILES strings can be used to
identify duplicated compounds, but in practice, canoni-
calization differs between programs. For more consist-
ent, documented, and standardized duplicated removal,
the IUPAC International Chemical Identifier (InChi,
InChiKey) [38] is recommended. Another aspect that
must be taken into account when using SMILES is the
handling of tautomers. Tautomerization can lead to alter-
native SMILES strings for the same ligand, and inconsist-
encies SMILES interpretation can lead to inconsistencies
in tautomer representation. Several programs can enu-
merate canonical tautomers (e.g., Accelerys, OpenEye,
and Schrödinger), and this is recommended for the con-
sistent processing of molecules.
SMILES Arbitrary Target Specification (SMARTS) is a
language developed to specify substructural patterns
used to match molecules and reactions. Substructure
specification is achieved using rules that are extensions
of SMILES. In particular, the atom and bond labels are
extended to also include logical operators and other spe-
cial symbols, which allow SMARTS atoms and bonds to
be more inclusive [39]. is notation is especially use-
ful for finding molecules with a particular substructure
in a database. SMARTS can also be used to filter out
molecules with substructures that are associated with
toxicological problems [40] or that appear as frequent
hitters (promiscuous compounds) in many biochemi-
cal high-throughput screens (Pan Assay Interference
Compounds, PAINS) [41]. Other applications are the
separation of active from inactive compounds and the
evaluation of ligand selectivity. e characterization of
chemical reaction centers has been described by Rarey
Page 4 of 25
Saldívar‑Gonzálezetal. J Cheminform (2020) 12:64
etal. [42], through the development of a new algorithm
called SMARTSminer, which allows the automatic deri-
vation of discriminative SMARTS patterns from sets of
pre-classified molecules.
e SMARTS language provides several primitive sym-
bols describing atomic and bond properties beyond those
used in SMILES (atomic symbol, charge, and isotopic
Fig. 1 SMILES, SMARTS, InChI and InChIKey concepts. Examples for the illustration of basic SMILES, SMARTS, InChI, and InChIKey syntax
rules are provided. SMARTS representations were made in SMARTviewer [35]. InChI and InChIKey identifiers are displayed for caffeine and
Page 5 of 25
Saldívar‑Gonzálezetal. J Cheminform (2020) 12:64
specifications). Table2 lists the atomic and bond primi-
tives used in SMARTS [39].
Atom and bond primitive specifications may be com-
bined to form expressions by using logical operators.
SMARTS examples can be found on Daylight’s web site
Because chemical pattern representations are relatively
new, the number of interfaces where the user can graphi-
cally create patterns is limited. Examples of editors to
handle SMARTS notation are MarvinSketch [29], JSME
[44], SMARTeditor [45], and the PubChem’s Sketcher
web editor [46, 47]. A comparison between these editors
was described by Schomburg etal. [45].
InChI andInChI Keys
InChI is the International Chemical Identifier developed
under IUPAC’s auspices, the International Union of Pure
and Applied Chemistry, with principal contributions
from NIST (the U.S. National Institute of Standards and
Technology) and the InChI Trust [38]. e InChI objec-
tive is to establish a unique label for each compound and
allow an easier linking of diverse data compilations. is
notation resolves many of the chemical ambiguities not
addressed by SMILES, particularly concerning stereocent-
ers, tautomers, and other valence model problems. How-
ever, InChIs are difficult to read and interpret by humans
in most cases. InChIs comprise different layers and sub
layers of information separated by slashes (/). Each InChI
string starts with the InChI version number, followed
by the main layer. is main layer contains sublayers
for empirical formula, atom connections, and hydrogen
atoms positions. e identity of each atom and its cova-
lently bonded partners provide all of the information nec-
essary for the main layer. e main layer may be followed
by additional layers, for example, for the charge, isotopic
composition, tautomerism, and stereochemistry [35].
e InChIKey is a fixed-length (27-character) con-
densed digital representation of an InChI, developed to
make it easy to perform web searches for chemical struc-
tures. e first block of 14 characters for an InChIKey
encodes core molecular constitution, as described by a
formula, connectivity, hydrogen positions, and charge
sublayers of the InChI main layer. e other structural
features complementing the core data—namely exact
positions of mobile hydrogens, stereochemical, iso-
topic, and metal ligands, whichever are applicable—are
encoded by the second block of InChIKey. e possi-
ble protonation or deprotonation of the core molecu-
lar entity (described by the protonation sublayer of the
InChI main layer) is encoded in the very last InChIKey
flag character. Further details of InChIKey are described
here https ://www.inchi -trust .org.
Chemical reactions
Representing chemical reactions is much more compli-
cated than representing single structures [48]. To rep-
resent chemical reactions is of particular importance to
identify the reactants, products, and if it wants to repre-
sent reactions more generically, it is required to deter-
mine the reaction center, that is, the collection of atoms
and bonds that are changed during the reaction [49], so
that the substructural transformation can be described
by specifying the reactive substructures in the reagent
and the product. To this end, Daylight [50] has developed
SMILES so that they can be used to describe reactions,
SMARTS for reaction queries, and SMIRKS to describe
transformations [51]. For its part, IUPAC has also been
developing a non-proprietary, international identifier
for reactions "RInChI" [52]. e RInChI project’s objec-
tive is to create a unique data string record and structure
detailed information on reaction processes, using InChI
software. ese approaches are powerful and flexible,
Table 2 SMARTS atomic andbond primitives
SMARTS atomic primitives SMARTS bond primitives
*: any atom
a: aromatic
A : aliphatic
D<n>: degree, <n> explicit connections
H<n>: total‑H‑count, <n> attached hydrogens
h<n>: implicit‑H‑count, <n> implicit hydrogens
R<n>: ring membership, in <n> SSSR rings
r<n> ring size, in smallest SSSR ring of size <n>
v<n>: valence, total bond order <n>
X<n>: connectivity, <n> total connections
x<n>: ring connectivity, <n> total ring connections
+<n>: positive charge, +<n> formal charge
‑<n>: negative charge, +<n> formal charge
#n : atomic number
@: chirality
‑: single bond (aliphatic)
/: directional bond "up"
\: directional bond "down"
/?: directional bond "up or unspecified"
\?: directional bond "down or unspecified"
= : double bond
#: triple bond
: : aromatic bond
~: any bond (wildcard)
@ : any ring bond
Page 6 of 25
Saldívar‑Gonzálezetal. J Cheminform (2020) 12:64
allowing for the inclusion of various information, includ-
ing atom mapping.
To understand the scope of these approaches and the
importance of atom mapping, suppose we look for reac-
tions that let us obtain an alcohol from a carbonyl group,
such as an ester. If we look for reactions in which there is
a carbonyl group in the starting material and alcohol in
the product, this search may produce undesirable results,
where there is another carbonyl group or alcohol in the
starting material. Still, the reaction does not change (see
Table 3, Reaction 1). Atom-to-atom mapping ensures
that both the carbonyl and alcohol groups are at the reac-
tion site. However, it is essential to note that atom map-
ping depends on the reaction mechanism, as shown in
reactions 2 and 3 of Table3.
To accurately capture a generic reaction, there are two
requirements. e first is the actual set of changes in the
molecule that occurs during the reaction (captured with
changes in atoms and bonds). e second is the indirect
effects of activating and deactivating groups near the
reaction site [39].
Within the Daylight’s system, the indirect effects on a
generic reaction are most appropriately expressed with
the SMARTS query language. However, SMARTS have
been designed for efficient querying of reaction data-
bases, and they do not have the other requirements to
accurately capture a generic reaction. SMIRKS accom-
plishes this by concisely expressing the atom and the
list of bond changes of a reaction, as well as the indi-
rect effects of activating and deactivating groups near
the reaction site. SMIRKS is a hybrid of SMILES and
SMARTS and can be used to represent reaction mecha-
nisms, resonance, and general modifications of molecu-
lar graphs [53, 54]. It is a restricted version of reaction
SMARTS with a set of rules that act as constraints. A
comparison between SMILES, SMARTS, and SMIRKS to
represent chemical reactions is described in Table4.
Chemical reaction database systems
Reaction databases store information that can help cre-
ate a data-rich environment in the early stage of phar-
maceutical process–product development. With this
information, various improvements to the initial selec-
tion process can be established, which can be seen mainly
reflected in a decrease in cost and time required. For
example, it can compare different reactions to produce
the same product, analyze different ways to carry out a
specific transformation of a functional group, and spec-
ify reaction’s conditions. It can also evaluate the reaction
path in terms of performance, cost, and sustainability
Searching for reactions and retrieving relevant infor-
mation from a chemical reaction is a complex task and
involves searching for chemical structures of reagents
or products (complete or partial), transformation infor-
mation (reaction centers), description of reactions (the
type of reaction, general comments), and numerical
data about the experimental reaction (yield, selectivity,
reaction conditions, etc.). For this reason, efforts have
been made to classify databases concerning their search
Table 3 Examples ofreaction queries
Page 7 of 25
Saldívar‑Gonzálezetal. J Cheminform (2020) 12:64
Table 4 Comparison betweenSMILES, SMARTS andSMIRKS torepresent chemical reactions
Representation Reactant > Agent > Product
In some cases the presence of agents can be omitted
Reactant > > Product
A reaction query may be composed of optional reactant,
agent, and product parts, which are separated by the
" > " character
Reactant > Agent > Product
Reactan > >
> Agent >
> > Product
Reactant > > Product
CC(= O)O.OCC > [H +].[Cl‑].OCC > CC(= O)OCC > > [#6][CX3](= O)[#6]
This query returns reactions in which the product contains
[C:1]([O,Cl:5]) = [O:2].[N:3][H:4] > > [N:3][C:1] = [O:2].[*:5][H:4]
[C]([O,Cl]) = [O].[N][H] > > [N][C] = [O].[*][H]
The use of the SMARTS [O,Cl] allows oxygen or chlorine
Characteristics The map is always the last part of the atom expression
delimited by a colon and it is optional
If hydrogen is mapped, it is also "special" and must be
shown (hydrogens are normally omitted from SMILES)
Atom map is optional
Any valid Reaction SMILES is a valid SMARTS query
Any valid Molecule SMARTS can be a component of a
Recursive SMARTS supports only molecule expressions
All valid SMIRKS are valid reaction queries
Atoms can be added or deleted during a transformation
Atomic SMARTS expressions can be used for atoms directly
involved in the reaction (the reaction center)
Stoichiometry is defined to be 1–1 for all atoms in the reac‑
tant and product for a transformation
Explicit hydrogens that are used on one side of a transfor‑
mation must appear explicitly on the other side of the
transformation must be mapped
Bond expressions must be valid SMILES (no bond queries
Atomic expressions may be any valid atomic SMARTS
expression for nodes where the bonding (connectivity and
bond order) does not change
Use To represent specific reactions between specific reactants
yielding specific products SMARTS are used for searching reactions SMIRKS are used to represent generic chemical transforma‑
Applications Store a library of reactions of interest (these might be
a record of reactions that have been carried out at a
company, a set of reaction plans in an academic research
group, or even a set of hypothetical reactions that might
never succeed in the laboratory)
Retrieve specific searches
Avoid uninteresting results
Reaction classification and categorization
Using SMIRKS to represent chemical transformations, reac‑
tion specifications can be stored in the database
Structures can be transformed and combined (reacted) to
produce new structures
Page 8 of 25
Saldívar‑Gonzálezetal. J Cheminform (2020) 12:64
reaction information. e criteria that have been estab-
lished are the following [56].
i) Each reaction is an individual record in the data-
base (detailed and graphical).e reaction must be
retrieved from the database as a detailed record (rea-
gents, products, stoichiometry, etc.). It can also be
extracted as a graphical representation where the
reaction scheme is shown. In many databases, the
reaction is represented in a graphical form.
ii) Structural information for target product as well as
iii) Reaction centers are reliably assigned and searcha-
ble.e reaction center of a reaction is the collection
of atoms and bonds changed during the reaction [49].
iv) Reaction components must be searchable.Informa-
tion for the components involved in the reaction
such as reagent, catalysts, solvents, etc.
v) Multistep reactions. In the case of multistep reac-
tions, all reactions (individual and whole pathway)
must be searchable.
vi) Reaction conditions. Conditions such as pH, tem-
perature, pressure, etc. should be searchable by exact
and suitable values.
vii) Reaction classification.e type of reaction (i.e.,
esterification) should be searchable.
viii) Post-processing of the database contents.Export
of the retrieved reaction data in other tools (i.e., MS
e main reaction databases that help organize, store,
and retrieve data have been described by Papadakis
et al. [55]. e CASREACT reaction database [57, 58]
stands out as containing the most significant number of
reported reactions, approximately 123 million single-step
and multi-step reactions, dating from 1840 to the pre-
sent. is database can be used to provide information
on different ways to produce the same product (single-
step or multi-step reactions), used for applications of a
particular catalyst, and various ways to carry out specific
functional group transformations [59]. Another reaction
database is REAXYS [60], based on Elsevier’s industry-
leading chemistry databases that include data for more
than 49 million reactions, dating from 1771 to the pre-
sent. It includes many compounds (organic, inorganic,
and organometallic) and experimental reaction details
(yield, solvents, etc.). It is searchable for reactions, sub-
stances, formulas, and data such as physicochemical
properties data, spectra. Additionally, the REAXYS data-
base can be used for synthesis route planning [61].
WebReactions from Open Molecules [62] is a good
example of an open access reaction database. It intro-
duces a new concept for retrieving reactions from a large
database in which reactions are indexed by the bond
changes that occur and the effect of the surrounding
groups on such bonds in aspects like rate, hindrance, or
resistance to change. Unlike conventional reaction data-
bases working on reaction substructure search, WebRe-
actions rather perform a customizable reaction similarity
search focusing on the reaction center.
e database entries are taxonomically indexed with
these successively nested subheadings: a rigorous digital
generalization of the reaction class and type, the nature of
substitution surrounding the reaction center, the nature
of entering and/or leaving groups, features in the reactant
which remains unchanged in the reaction. For example,
the synthesis of fentanyl, a potent opioid analgesic [63],
and its synthetic derivatives involve a reductive amina-
tion that can be searched for in WebReactions [64]. As
shown in Fig.2a, once the reaction of interest is drawn,
reaction centers are defined (red), and a minimum yield
and characteristics of surrounding atoms can be estab-
lished. In this case, there are seven matching reactions,
three examples are in Fig.2b–d, which show how similar
reactions could be carried out under different reducing
agents and conditions. Each result provides the reactant,
product, and catalyst, and the original paper’s reference.
A synthetic laboratory may select candidate reactions
based on the highest possible yield, or what resources
(such as reagents) are readily available.
Freely available andopen‑source tools
forthecomputational‑aided design ofchemical
e virtual enumeration of chemical reactions is a pow-
erful tool in systematic compound library design. e
exploration of virtual chemistry is bounded only by the
human imagination and the capabilities of computers. By
using reactions deposited in chemical reaction databases,
a large number of virtually obtained compounds can be
accessed. erefore, careful planning of these reactions is
of utmost importance to influence the products obtained
in these experiments. Until now, computer-based meth-
ods have considered generating compounds to address
issues such as the diversity of chemical libraries [8, 65],
the design of drug-like or focused libraries [66], and on
making and identifying compounds for high-throughput
screening strategies [67].
For the efficient design of chemical libraries, it is impor-
tant to keep in mind the type of compounds to obtain to
later evaluate the strategic bonds and select a strategy to
use. e choice of strategy to use will largely depend on
the ease with which this strategy has to be adopted by
medicinal chemists and the additional problems to be
covered (structural features, physicochemical proper-
ties, and diversity). e synthesis strategy that has been
Page 9 of 25
Saldívar‑Gonzálezetal. J Cheminform (2020) 12:64
mostly addressed to generate virtual libraries is combi-
natorial chemistry, however, other approaches such as
diversity-, biology-, lead-, or fragment oriented synthesis
can be easily implemented [68]. In this part, it is essen-
tial to focus on well-characterized reactions, to avoid the
bottleneck in current computational approaches to drug
design: the assessment of synthetic accessibility [69].
Another pragmatic way to improve compound qual-
ity while enhancing and accelerating drug discovery
projects is to access and propose a high quality, novel,
diverse building block collection [70]. Guidelines have
been developed that provide more specific guidance to
medicinal chemists and help prioritize the synthesis of
compounds. Among these guidelines is the proposed
’rule of 3 (MW 300; logP -3 to 3; HBA 3; HBD 3;
tPSA 60, Rotatable bonds 3) to guide fragment selec-
tion for fragment-based lead generation [71] and the
’rule of 2 (MW < 200, clogP < 2, HBD 2, HBA 4) to design
novel reagents for drug discovery projects [70]. ese
guidelines can help not only prioritize reagents but also
target libraries to compounds with optimal physicochem-
ical properties for drug design. Databases such as ZINC
DB [72], Asinex [73], Life Chemicals [74], and Maybridge
[75] can be used to access and download catalogs of com-
mercially available starting materials.
In order to exemplify the points above, this section
focuses on creating libraries of chemical compounds
from public data sources, generated using different syn-
thetic strategies and various open-access tools like RDKit,
KNIME, and DataWarrior. e designed libraries are syn-
thetically accessible as the design approach was based on
feasible reactions and existing reagents. However, this
does not mean that the obtained compounds are easy or
cheap to carry out. If an approach based on known reac-
tion schemes was not applied, it would be necessary to
evaluate the synthetic feasibility of the possible synthetic
Fig. 2 Searching the reductive amination involved in the synthesis of fentanyl in WebReactions. a Reaction input and fine‑tuning. bd Example
Page 10 of 25
Saldívar‑Gonzálezetal. J Cheminform (2020) 12:64
routes or the products’ accessibility, which we discuss
further in this manuscript.
Design ofalibrary ofbis‑heterocycles obtained
withclick chemistry using Python andtheRDKit
As medicinal chemists try to mimic the core elements of
a wide range of natural products such as nucleic acids,
amino acids, carbohydrates, vitamins, and alkaloids, het-
erocycles have become a standard structural unit in drug
discovery. ese structures allow modulating important
drug properties such as potency and selectivity through
bioisosteric replacements, lipophilicity, polarity, and
aqueous solubility [76].
Click chemistry provides a means for the rapid explo-
ration of the chemical universe enabling rapid struc-
ture–activity relationships (SAR) profiling through the
generation of analog libraries. Click chemistry is wide-
ranging, owing to strongly driven, highly selective reac-
tions of broad scope, allowing a much greater diversity
of block structures to be used [77]. Huisgen’s copper-
(I) catalyzed 1,3-dipolar cycloaddition of alkynes and
azides yielding triazoles is the premier example of a
click reaction [78], due to the accessibility of azides and
alkynes, highly diverse, unambiguous libraries become
available quickly.
is example is based on the synthetic approach
reported by Shafi etal. [79] to obtain bis-heterocycles,
linking 5-membered heterocycles building blocks con-
taining one or two heteroatoms (at least one nitrogen,
sulfur, oxygen) to a set of azide containing building
blocks through the formation of a 1,4-disubstituted
1,2,3-triazole using click chemistry (Fig.3). To this pur-
pose, the heterocycle must contain a nucleophilic moi-
ety such as a thiol, hydroxyl, or amino group that reacts
with a 3-halopropyne derivative through nucleophilic
aliphatic substitution (SN). Once the alkyne is appropri-
ately attached to the heterocycle, it reacts with the set of
azides to form a 1,2,3-triazole linking both fragments.
Python and the chemoinformatics toolkit RDKit [23]
are used to implement algorithms and functions in this
example. e toolkit RDKit provides the capabilities to
handle and manipulate molecular structures in Python.
A comprehensive introduction and installation instruc-
tions can be found in the online documentation from the
RDKit homepage (https ://rdkit .org/docs/index .html).
Fig. 3 A strategy used to build bis‑heterocycles
Page 11 of 25
Saldívar‑Gonzálezetal. J Cheminform (2020) 12:64
>>> import pandas as pd
import rdkit as rk
from rdkit import Chem
from rdkit.Chem import AllChem
>>>from rdkit.Chem.rdMolDescriptors import CalcNumHeteroatoms
#Read building blocks using a Supplier
>>>supp = Chem.SDMolSupplier(
for mol in supp:
>>> if
mol is not None: mol.GetNumAtoms()
for x in supp]
(mols) #Number of building blocks
>>>het5 = [x
for x in mols if x.HasSubstructMatch(patt1)]
#SMARTS Terminal alkyne 3-bromo or chloro substituted
>>>patt2= Chem.MolFromSmarts(
>>>alkynes = [x
for x in mols if x.HasSubstructMatch(patt2)]
>>>patt3= Chem.MolFromSmarts(
>>>azide = [x for x in mols if x.HasSubstructMatch(patt3)]
# Match a substructure with a SMARTS query
#SMARTS 5-membered heterocycles
Procedure in Python:
1. Build or identify a library of commercially available
building blocks. e building blocks used for this
example were taken from the Sigma Aldrich (Build-
ing Blocks) catalog obtained from the ZINC DB [80],
consisting of 124,368 building blocks.
2. Identify the characteristics of building blocks for
the strategy to be followed. Minor components and
duplicate compounds were removed, building blocks
were selected to comply with the Congreve’s ‘rule
of three’ [71]. e curated database can be found in
Additional file1: "Sigma_bb.sdf." As shown below, the
building blocks were read in Python using a supplier.
en, compounds were filtered for the presence of
appropriate functional groups: a 5-membered heter-
ocyclic ring with one (N, O or S) or two heteroatoms
(N, O, S; at least one N), and a nucleophilic substitu-
ent (–OH, –SH, –NH2), a terminal alkyne 3-bromo
or chloro substituted and an azide.
3. Setting up coupling reactions. To generate the library
of bis-heterocycles, the reactions and their correspond-
Page 12 of 25
Saldívar‑Gonzálezetal. J Cheminform (2020) 12:64
ing SMIRKS were defined according to a synthetic
approach reported by Shafi etal. [79] (Table5). ese
reactions were used in the code to enumerate com-
pounds that were eventually exported inCSV format.
4. Results. In total, 7884 bis-heterocycles were
obtained. Examples of compounds obtained follow-
ing this strategy and using the Sigma Aldrich build-
ing block database are shown in Table9.
Table 5 SMIRKS ofthecoupling reactions
# In[]:
#Nucleophilic Substitutuion
>>>prods1 = AllChem.EnumerateLibrary
>>>smis =
list(set([Chem.MolToSmiles(x[0],isomericSmiles=True) for x in prod]))
#Click reaction
>>> prods2 = AllChem.EnumerateLibraryFrom
Reaction(rxn2,[[ Chem.MolFromSmiles(x) for x in smis ],azide])
>>> smis2 = list(set([Chem.MolToSmiles(x[0],isomericSmiles=True) for x in prods2]))
#Export results as .CSV File
>>> df = pd.DataFrame(smis2, columns=["colummn"])
>>> df.to_csv('bis_heterocycles.csv', index=False)
Page 13 of 25
Saldívar‑Gonzálezetal. J Cheminform (2020) 12:64
diabetes [83], and infectious diseases [84]. Many lac-
tam-containing compounds are reported to act as
HIV-1 integrase inhibitors [85], opioid receptor ago-
nists [86, 87], as well as antitumoral [88, 89], anti-
inflammatory [90, 91], and antidepressant agents [92].
For the first example, a library of lactams was auto-
mated by applying the DOS strategy Build/Couple/
Pair [93] for medicinal chemistry applications [94].
The Build/Couple/Pair approach consists of building
different starting materials with suitable functional
groups that can be joined together through intermo-
lecular coupling reactions in all possible stereochemi-
cal combinations. In the pairing step, intramolecular
coupling reactions that join the remaining functional
groups are instrumental for developing skeletal diver-
sity and structurally different molecular scaffolds. The
KNIME (Konstanz Information Miner) workspace [20]
was selected as a platform for generating the work-
flow, where each task is represented by a node with
input and output ports. This server can be downloaded
directly from the KNIME homepage (https ://www.
knime .com/). For the management and analysis of
databases, the KNIME Example Server provides access
to many explanatory workflows. The example server
is accessible via the KNIME Explorer panel within the
Fig. 4 Workflow for the design of lactams. a Read structures of building blocks; b Building blocks filter: the structures were curated, filtered
according to the ‘rule of three’, and selected for the presence of appropriate functional groups; c Coupling phase: application of the amide bond
formation reaction between carboxylic acids and primary or secondary amines; d Pairing phase: use of the reactions as described in Table 8. Finally,
the compounds were separated into macrocycles and not macrocycles
Table 6 Functional groups that were quantied to lter
building blocks
Functional groups SMARTS
Alkene [H]\[#6]([H]) = [#6]/[#6]
Alkyne [H]C#C[#6]
Carboxylic Acid C(= O)[O;H,‑]
Sulfonyl chloride [$(S‑!@[#6])](= O)(= O)(Cl)
Amine primary [N;H2;D1;$(N‑!@[#6]);!$(N–C = [O,N,S])]
Amine secondary [N;H1;D2;$(N(‑[#6])‑[#6]);!$(N‑
[!#6;!#1]);!$(N–C = [O,N,S])]
Alcohol aromatic [O;H1;$(O‑!@c)]
Alcohol aliphatic [O;H1;$(O‑!@[C;!$(C = !@[O,N,S])])]
Aldehyde [CH;D2;!$(C‑[!#6;!#1])] = O
Halogen [$([F,Cl,Br,I]‑!@[#6]);!$([F,Cl,Br,I]‑
(= [D1;O,S,N]))]
Azide [N;H0;$(N‑[#6]);D2] = [N;D2] = [N;D1]
Design ofaDOS library using KNIME andRDKit
andMarvin nodes
Lactams are a class of compounds important for drug
design due to their great variety of potential thera-
peutic applications, spanning from cancer [81, 82],
Page 14 of 25
Saldívar‑Gonzálezetal. J Cheminform (2020) 12:64
Table 7 SMIRKS oftheamide bond formation betweencarboxylic acids andprimary orsecondary amines
Page 15 of 25
Saldívar‑Gonzálezetal. J Cheminform (2020) 12:64
Table 8 Intramolecular cyclization considered forthepairing phase
Page 16 of 25
Saldívar‑Gonzálezetal. J Cheminform (2020) 12:64
KNIME workbench and represents a great help when
starting a new workflow.
Figure 4 shows the workflow designed to generate a
library of lactams following the B/C/P approach. e
development of this workflow is described in detail
1. Build or identify a library of commercially avail-
able building blocks. We selected the commercially
Enamine building blocks library as a first input for
this tutorial, containing 437,625 unique compounds
(version March 2019) [95]. To allow for the readabil-
ity of all datasets, nodes for retrieving molecules in
different formats were considered, including the SDF
file (structure data file) (A1) or CSV file (comma-sep-
arated value) (A2). e Marvin Sketch node (A3) was
also included to draw other possible building blocks.
2. Identify the characteristics of building blocks for the
strategy to be followed. Compounds were normal-
ized, minor components and duplicate compounds
were removed (B1), building blocks were selected
in to comply with the Congreve’s ‘rule of three’ [71]
(B2), and then filtered for the presence of appropriate
functional groups (B3). e strategy used required
building blocks with more than two functional
groups: one for the coupling reaction and another for
the pairing reaction. e functional groups used in
this part and their corresponding SMARTS codes are
listed in Table6.
3. Setting up coupling reactions. To generate a library
of lactams, only the amide bond formation between
carboxylic acids (C2) and primary (C1) or secondary
amines (C3) was considered as the coupling reaction
(C4 and C5), the SMIRKS of this reaction is showed
in Table7. e SMILES of both secondary and ter-
tiary amides-containing coupling products were gen-
erated (C6–C7).
4. Establish pairing reactions. en different intramo-
lecular cyclization reactions were applied for the
pairing phase (D1–D2). Compounds containing the
two functional groups involved in the pairing reac-
tion within the same building block were removed.
is step was done to ensure that the lactam-con-
taining ring was closed. Table8 shows the different
intramolecular cyclization considered for the pairing
phase and their corresponding SMIRKS.
5. Separated into macrocycles and not macrocycles.
e lactams obtained from the DOS B/C/P workflow
were divided into macrocycles (more than 7-mem-
bered rings) and non-macrocycles (3- to 7-mem-
bered rings). Examples of non-macrocyclic lactams
that were produced under this approach are shown
in Table9. Information about the number of com-
pounds generated and the database’s diversity was
published by Saldivar-González etal. [94].
Library ofisoindolinone based compounds
aspotential AChE inhibitors
Alzheimer’s disease (AD) is an incurable, progressive
neurodegenerative disorder with a long presymptomatic
period. It is clinically characterized by cognitive and
behavioral impairment, social and occupational dysfunc-
tion and, ultimately, death [96]. e enhancement of cho-
linergic neurotransmission by preserving acetylcholine
(ACh) levels would be an effective way to overcome AD’s
occurrence, symptoms, and progression. Accordingly,
the inhibition of acetylcholinesterase (AChE), which
is responsible for the metabolic breakdown of ACh has
been regarded as one of the most promising approaches
[97]. Although various efficient cholinesterase inhibitor
drugs such as donepezil, rivastigmine, and galanthamine
have been developed, there is still significant demand
for drug discovery leading to efficient anti-Alzheimer’s
agents [98].
Isoindolinones are an important heterocyclic scaffold
ubiquitous in natural products such as aristoyagonine,
nuevamine, lennoxamine, and chilenine [99]. Recently,
Rayatzadeh etal. [98] reported the synthesis and acetyl-
cholinesterase inhibitory activity of novel isoindolinone
derivatives, in which two of the tested compounds
showed an IC50 of 41 and 83μM, respectively. Even more,
the compounds were obtained through a convenient pro-
cedure in the absence of any catalysts or additives in an
Ugi reaction with good tolerance to diverse functional
groups and satisfactory yields between 70 and 90%. is
background information attracted our attention, so we
decided to use the approach reported to be an example of
how a library can be built with an established scaffold and
a targeted biological activity.
Data Warrior was selected as a platform for the gen-
eration of this example. is software is a universal data
analysis and visualization program, useful to explore
large data sets of chemical structures with alphanumeri-
cal properties [19]. Some of its functionalities include
combinatorial library enumeration, the prediction of
molecular properties, and various methods to visualize
chemical space and activity cliffs with the intent to sup-
port chemists taking smarter decisions about structural
changes toward better property profiles.
Procedure in Data Warrior:
1. Build or identify a library of commercially available
building blocks. For this example, building blocks’
primary input was the Synquest Building Blocks Eco-
nomical catalog retrieved from the ZINC DB [100],
Page 17 of 25
Saldívar‑Gonzálezetal. J Cheminform (2020) 12:64
Table 9 Representative examples ofcompounds fromthethree libraries design inthis work
Page 18 of 25
Saldívar‑Gonzálezetal. J Cheminform (2020) 12:64
consisting of 59,597 building blocks. However, deriv-
atives of 2-carboxybenzaldehyde were not found in
this database, so a SMARTS containing the moiety
was used to search for building blocks directly in all
ZINC DB catalogs [101]. e screenshots and steps
of how this search was performed can be found in
Additional file1.
2. Identify the characteristics of building blocks for
the strategy to be followed. Minor components and
duplicate compounds were removed using Bank-
Cleaner server (https ://mobyl e.rpbs.univ-paris -dider 4#forms
::Bank-Clean er), then building blocks were selected
to comply with the Congreve’s ‘rule of three’[71] with
the filter parameters created at the FAF-Drugs4s Fil-
ter Editor (https ://mobyl e.rpbs.univ-paris -dider
cgi-bin/porta r-Edito r#forms ::Filte
r-Edito r), and running the filter at FAF-Drugs4s Fil-
tering Tool (https ://mobyl e.rpbs.univ-paris -dider 4#forms
::FAF-Drugs 4). e filter parameters can be found in
Additional file1. e functional groups needed were
filtered using the Data Warrior substructure search.
e detailed procedure and the substructures defined
to filter can be found in Additional file1 (“Substruc-
ture filtering in Data Warrior” section). In this case,
the three-component Ugi reaction required an iso-
cyanide and a primary amine, which were obtained
from the Synquest Building Blocks, and 2-carboxy-
benzaldehyde, obtained from the ZINC catalog.
Additionally, to include only groups that would add
flexibility to the final compound, for isocyanides and
primary amines, the building blocks containing aro-
matic rings were eliminated.
3. Establish the three-component reaction. Using the
Create Combinatorial Library on the Chemistry
module of Data Warrior, the reaction was built in its
simpler form under “Generic Reaction,” only drawing
the atoms involved in the transformation and ade-
quately mapping each atom from the reagents into
its position in the product (Fig.5a). An.RXN file with
the reaction already drawn in another program can
also be imported. e list of building blocks previ-
ously created for each of the reactants in.SDF format
was imported (Fig. 5b), and the library was gener-
4. Results. e SMILES of the isoindolinones were
obtained, generating 738 different compounds.
Examples of isoindolinones that were generated
under this approach are shown in Table9.
Post‑processing virtual libraries
Diversity analysis
Before performing a virtual screening or the synthesis of
a virtual compound, it is convenient to characterize the
compounds generated using different criteria. For exam-
ple, profiling the compound library with whole molecule
descriptors of pharmaceutical relevance can help to vali-
date the strategy used, represent medicinally relevant
chemical spaces [102], and filter compounds with drug-
like properties [103, 104]. Physicochemical properties
frequently used to describe chemical libraries include
molecular weight (MW), number of rotatable bonds
(RBs), hydrogen-bond acceptors (HBAs), hydrogen-bond
donors (HBDs), topological polar surface area (TPSA),
and the octanol/water partition coefficient (SlogP).
A complementary approach to characterize compound
databases is through molecular scaffolds or chemotypes
i.e., a molecule’s core structure [105]. Scaffold analysis is
broadly used to compare compound databases, to iden-
tify novel scaffolds in a compound library, to measure
diversity based on molecular scaffolds [106], to evaluate
the performance of virtual screening approaches [107],
and to analyze the SAR of sets of molecules with meas-
ured activity [108110]. Like physicochemical properties,
molecular scaffolds are easy to interpret and facilitate
communication with a scientist working in different dis-
ciplines. Another approach, perhaps more difficult to
interpret but widely used to characterize databases and
has been successfully applied to a series of computer-
assisted chemoinformatics and drug design applications,
is the molecular fingerprints [111]. Fingerprints are espe-
cially useful for similarity calculations, such as database
searching or clustering, generally measuring similarity as
the Tanimoto coefficient [112].
In addition to helping in the characterization of data-
bases, these chemoinformatic approaches are useful for
determining the chemical and structural diversity of the
compounds generated. e quantitative information gen-
erated helps guide the selection of compound libraries
or individual compounds to identify novel lead candi-
dates for biological targets. In particular, diversity analy-
sis helps compare different databases and evaluate the
structural novelty of a compound collection [113]. Free
tools such as RDKit [23], Platform for Unified Molecu-
lar Analysis (PUMA) [114], or the workflows developed
in KNIME by Naveja etal. [115] can help in the task of
assessing chemical diversity. Interpreting the results of
these analyzes individually, in many cases, is complicated
and can lead to biased interpretations since, as previously
mentioned, the perception and evaluation of the diver-
sity of a collection of compounds, in general, is relative
to the molecular representation. Todecrease the diver-
sity’s dependence with molecular representation, it has
Page 19 of 25
Saldívar‑Gonzálezetal. J Cheminform (2020) 12:64
been proposed to use a consensus approach through the
assessment of global diversity using Consensus Diversity
Plots (CDPs). A CDP is a 2D graph that represents in
the same plot up to four measures of diversity. e most
common are fingerprint-based, scaffold, whole molecular
properties associated with drug-like characteristics, and
database’s size [116].
For the three compound libraries designed in this man-
uscript (lactams, bis-heterocycles, and isoindolinones),
their chemical space based on physicochemical proper-
ties and shapes was analyzed and compared with a ref-
erence library of approved drugs. eir global diversity
of each database was also analyzed using the CDPlot.
Figure6a illustrates an application of PCA to generate
a visual representation of the property-based chemi-
cal space of 24,698 lactams,7884 bis-heterocycles, 649
isoindolinones, and a collection of 2125 drugs approved
for clinical use obtained from DrugBank [117]. PCA is a
mathematical method for dimensionality reduction that
allows us to visualize similarities and differences within
collections of compounds based on structural and phys-
icochemical parameters [118], making it a valuable tool
to guide the design of chemical libraries. e figure
shows that the three libraries designed in this manuscript
occupy the same property space as the main part of the
approved drugs library, indicating that the compounds
Fig. 5 a Reaction input tab in Enumeration of Combinatorial Library; b Reactants input tab in Enumeration of Combinatorial Library; c View of the
library generated
Page 20 of 25
Saldívar‑Gonzálezetal. J Cheminform (2020) 12:64
are prone to have favorable drug-like properties. Out of
the three design libraries, the DOS collection is the most
diverse, covering almost the same space as approved
drugs. In contrast, the bis-heterocycles and isoin-
dolinones are less diverse and focus on a region of the
space. Because of the design strategy, the property space
of bis-heterocycles’ library space is more restricted to the
heterocycles and azides. Since the isoindolinones library
was designed based on a common scaffold, the variations
of the molecular properties depend on the side-chain
substitutions. us, it is not surprising that they are
focused on a more restricted region in chemical space.
Fig. 6 Post‑processing plots. a PCA plot generated using six structural and physicochemical descriptors (MW, HBA, HBD, SlogP, TPSA and RBs). b
PMI plot. Compounds are placed in a triangle where the vertices represent rod, disc, and spherical compounds. c Consensus Diversity Plot (CDP):
(1) Approved drugs, (2) DOS, (3) Bis‑heterocycles, (4) Isoindolinones. Scaffold diversity is measured in the vertical axis using area under the curve
(AUC) and the diversity using molecular fingerprints is measured in the horizontal axis using MACCS/Tanimoto. Diversity based on physicochemical
properties is represented by the Euclidean distance of the six physicochemical properties using a continuous color scale. The relative size of the
data set is represented by the size of the data point. d ADME/Tox profile of the three databases calculated with the free server FAF‑Drugs. *Based on
Lipinski’s Rule of Five
Page 21 of 25
Saldívar‑Gonzálezetal. J Cheminform (2020) 12:64
e molecular shape is also a useful property to define
chemical spaces [119]. In the PMI plot in Fig.6b, we can
see that the main space occupied by approved drugs
is between rod and disc shapes, and once again, we can
observe the three libraries designed to share that space.
Bis-heterocycles and isoindolinones libraries are focused
in a specific shape. On one side, bis-heterocycles are
predominantly in the PMI plot’s disc zone because the
azide and heterocyclic fragments were linked, forming a
1,4-disubstituted 1,2,3-triazole in the middle, obtaining
large molecules. Furthermore, two aromatic rings highly
restricted the flexibility of the fragments linked, forcing
the molecule to be in an extended position (Table9). On
the other side, isoindolinones are mainly in the disc zone
of the PMI plot because the scaffold ring is planar so that
the main shape variations will be caused only by the sub-
stituents in positions 1 and 2 of the ring (Table9). Some
substituents at position 2 of isoindolinones could cause
the molecules to grow in a rod shape, explaining why a
few molecules of this library tend to expand into the rod
zone. Similarly, the planarity of bis-heterocycles explains
that fewer compounds in this library grow into the ring
space. DOS library is centered in the shape space, similar
to approved drugs, because of its larger structural diver-
sity. In contrast to the other two libraries designed in this
work, compounds in DOS explore the sphere zone with
potentially drug-like properties.
Figure6c shows the CDP of the libraries designed in
this work. e size of the data points represents the rela-
tive size of each data set, and the color of each data point
represents the diversity of the physicochemical prop-
erties of the data set as measured by the Euclidean dis-
tance of six properties of pharmaceutical relevance (MW,
HBAs, HBDs, TPSA, SlogP, RBs). To measure the struc-
tural diversity considering the entire structures (includ-
ing not only the central scaffold but also the side chains)
(x-axis), the MACCS fingerprints were used, and then the
Tanimoto coefficient was applied [120]. Values outside
the similarity matrix’s diagonal were used to compute the
median for all the pairwise comparisons. On the other
hand, as a measure of scaffold diversity, the Area Under
the cyclic system recovery Curve (AUC, y-axis) [121] was
used. Scaffolds were generated under the Bemis-Murcko
definition [122]. e AUC value is a useful parameter
to evaluate the diversity of the scaffold’s content in each
database. AUC value ranges from 0.5 (maximum diver-
sity, when each compound in the library has a different
cyclic system) to 1.0 (minimum diversity, when a single
cyclic system encompasses all the compounds). Accord-
ing to Fig.6c, the DOS library is the most diverse of all
three designed libraries when considering all three diver-
sity criteria: high scaffold and physicochemical diversity,
and intermediate fingerprint diversity. Approved drugs
are also very diverse when considering scaffold and fin-
gerprints; however, the variety in physicochemical prop-
erties is lower. e relative lower scaffold diversity of
bis-heterocycles and isoindolinones (with an area under
the scaffold recovery curve, AUC, close to one—Fig.6c)
agrees with the design strategy of both libraries that
is focused on the scaffolds. In bis-heterocycles, with-
out considering the heterocycle, the structural variation
associated with the azides is more considerable, causing
larger fingerprint-based diversity than isoindolinones. In
isoindolinones, even if the number of different amines
and isocyanides is limited, the three-component reac-
tion (described in section “Library of isoindolinone based
compounds as possible AChE inhibitors”, vide supra)
offers a larger amount of combinations, increasing the
physicochemical diversity.
However, it is vital to keep in mind that even in the
design and synthesis of focused libraries, there must be
some degree of diversity, and "redundant" compounds
(molecules that are structurally similar and have the
same activity) should be avoided. A diverse subset
of compounds should be more likely to contain com-
pounds with different activities and should also con-
tain fewer "redundant" compounds. For this reason,
the metrics used above can also be useful for navigat-
ing through the relevant chemical space to identify
subsets of compounds for synthesis, purchase, or test-
ing. Approaches to select subsets efficiently are mainly
cluster analysis, dissimilarity-based methods, cell-based
methods and optimization techniques [123]. If you want
to repeat this study, you can use the file titled "Diversity
Analysis.csv" and use the PUMA server (https ://www.
difac /) or the workflows reported by
Naveja etal. [115].
ADME/Tox prole
Other than the diversity analysis described in the pre-
vious section, in order to reduce the number of com-
pounds to be used in virtual screening, filters like
functional groups, physio-chemical properties, PAINS,
and toxicophores can be applied using free servers like
FAF-Drugs (https ://mobyl e.rpbs.univ-paris -dider,
Chembioserver 2.0 (https ://chemb ioser
eu/index .php) and the workflows designed in KNIME
e compounds of three libraries obtained in this
work were analyzed in FAF-Drugs to filter undesir-
able compounds and assist hit selection before chemi-
cal synthesis. In this server, depending on the filtering
ranges, Accepted (compounds with no structural alerts
and satisfying the physicochemical filter), Interme-
diate (compounds which embed low-risk structural
Page 22 of 25
Saldívar‑Gonzálezetal. J Cheminform (2020) 12:64
alerts with several occurrences below the threshold) or
Rejected (compounds that include a high-risk structural
alert) files are written associated with all their CSV
results files [127]. According to the FAF-Drugs results,
it can be seen in Fig.6d that the compounds identified
as bis-heterocycles have more drug-like physicochemi-
cal properties; however, it is the isoindolinone database
that contains the fewest structural alerts. In contrast,
the database of lactams obtained by the B/C/P DOS
strategy is the one that contains the largest amount of
PAINS and rejected molecules. e main problematic
moieties in this database are shown in Additional file1:
Figure S1, where many fluorenylmethyloxycarbonyl
compounds are associated with promiscuity [128], and
compounds with an excess of halogens in their struc-
ture are observed.
Synthetic accessibility
e number of designed compounds in silico may still
be vast, and some of them may not be easy to synthesize
in the laboratory. erefore, an estimate of the synthetic
accessibility, or, make filters related to reagents’s cost, in
principle, could help filter further the database or prior-
itize the structures generated.
If an approach based on known reaction schemes was
not applied, it would be necessary to evaluate the syn-
thetic feasibility of the possible synthetic routes. e
optimal method for evaluating a given compounds’ syn-
thetic feasibility is probably to search the chemical litera-
ture for cases where this or similar molecules/scaffolds
have been synthesized and to check the results with expe-
rienced organic chemists [13]. Some of the tools available
for planning synthetic routes are SciFinder [129], Reaxys
[60], Synthia [130], [131], and IBM RXN [132],
of which the last two mentioned are open access; being
an area of research growing in parallel with the technolo-
gies available, we should always keep an eye on develop-
ing tools such as AutoSynRoute [133] and new evaluation
methods [134]. Unfortunately, this is not an accessible
approach in an automated algorithm to filter the input to
a large-scale virtual library, so computer-based methods
to evaluate synthetic accessibility have been developed.
Synthetic accessibility is related to the ease of synthesis
of compounds according to their synthetic complexity,
which combines starting materials information and struc-
tural complexity [135], and is usually measured through
a score (SAscore) on a determined scale. Different tools
are available to measure the synthetic accessibility of mol-
ecules. Some examples are SYLVIA [136], CAESA [137],
WODCA [138], an RDKit Python source [139], an scoring
function in C + + based on the MOSES software library
[140], as well as other methods reported [141].
In recent years, the generation of virtual libraries has had
unprecedented progress thanks to the development of
different computational methods and synthetic knowl-
edge. Virtual libraries represent an important source
of novel structures in drug discovery applications. is
work showed how, through different computational
open-access methods, it is possible to automate design
approaches and enumerate and explore all the com-
pounds obtained using pre-validated reactions and com-
mercially or in-house available building blocks. ese
methods are becoming increasingly sophisticated and
allow restrictions on compound synthesis and filters to
prevent the creation of unwanted chemical compounds.
e importance of the post-processing step should always
be remembered, bearing in mind that the aims of gener-
ating virtual libraries should be focused on generating
molecules that are more attractive to medicinal chemists,
both improving the quality of compounds manufactured
and making sure they are synthetically accessible. We
have shown how different previously reported tools and
software available can be used on the generated libraries
to predict critical pharmacological properties, molecular
shape or to compare them to already existing libraries.
e tutorial examples used in this manuscript show
that it is possible to generate libraries with predicted
drug-like properties using validated reactions and com-
mercially available building blocks. Some of the gener-
ated compounds explore novel areas of the molecular
shape space, compared to approved drugs. We are con-
fident that the approaches used in this manuscript will
flourish (hopefully, with the aid of this tutorial), as long
as the knowledge derived from organic synthesis contin-
ues to be captured and exploited. We also anticipate that
more academic groups will use these strategies to design
new chemical structures.
Supplementary information
Supplementary information accompanies this paper at https ://doi.
org/10.1186/s1332 1‑020‑00466 ‑z.
Additional le1. This document describes the substructure search in
the ZINC database; the filter parameters for Congreve’s Rule of 3 used in
the FAF‑Drugs server; the instructions for filtering substructures in Data
Warrior and Figure S1.
AChE: Acetylcholinesterase; AD: Alzheimer disease; CSV: Comma separated
value file; CDP: Consensus Diversity Plot; DOS: Diversity‑Oriented‑Synthesis;
HBAs: Hydrogen‑bond acceptors; HBDs: Hydrogen‑bond donors; InChi: IUPAC
International Chemical Identifier; InChIKey: A fixed‑length (27‑character)
condensed digital representation of an InChI; KNIME: Konstanz Information
Miner; MW: Molecular weight; RBs: Number of rotatable bonds; SA: Syn‑
thetic accessibility; SAR: Structure–activity relationship; SDF: Standard data
file; SLogP: Octanol/water partition coefficient; SMARTS: SMILES Arbitrary
Page 23 of 25
Saldívar‑Gonzálezetal. J Cheminform (2020) 12:64
Target Specification; SMILES: Simplified Molecular Input Line System; SMIRKS:
Language to define generic reactions. It is a hybrid of SMILES and SMARTS
languages; PAINS: Pan Assay Interference Compounds; PMI: Principal Moment
of Inertia; PUMA: Platform for Unified Molecular Analysis; TOS: Target‑Oriented
Synthesis; TPSA: Topological Polar Surface Area.
F.I.S.G thanks Dr. Andrea Trabocchi and Dr. Elena Lenci for their contributions
and comments in the design of the DOS workflow.
Authors’ contributions
FISG developed the DOS workflow, analyzed the data, and contributed to writ‑
ing the manuscript. CSHG contributed to the design of bis‑heterocycles and
isoindolinones libraries and he participated in writing the manuscript. JLMF
contributed to the study design and took part in writing the manuscript. All
authors read and approved the final manuscript.
Not applicable.
Availability of data and materials
Data and materials for the examples are available as additional materials.
For Example 1 “Bis‑heterocycles” the curated database of building blocks can
be found as “Sigma_bb.sdf”, the python code as “” and the library
generated can be found as “bis‑heterocycles.csv”.
For Example 2 “DOS” the building blocks were retrieved from the ZINC DB
catalogs as previously described, the KNIME workflow used is “Workflow_DOS.
knwf” and the library generated can be found as “LactamsDOS.csv”.
For Example 3 “Isoindolinones” the building blocks were retrieved from the
ZINC DB catalogs as previously described, the input file used in Data Warrior in
SDF format are included as: “synquestecbb.sdf” and “2‑carboxybenzaldehydes.
sdf”. The reaction file is “Ugi‑3comp.rxn. And finally, the library generated can
be found as “Isoindolinones.sdf”.
The compounds from the three libraries generated in this work and the drugs
approved used for the diversity analysis can be found as "Diversity Analysis.
Competing interests
The authors have declared no competing interest.
Author details
1 DIFACQUIM Research Group, School of Chemistry, Department of Pharmacy,
Universidad Nacional Autónoma de México, Avenida Universidad 3000,
04510 Mexico, Mexico. 2 School of Chemistry, Department of Pharmacy,
Universidad Nacional Autónoma de México, Avenida Universidad 3000,
04510 Mexico, Mexico.
Received: 22 July 2020 Accepted: 5 October 2020
1. Yan XC, Sanders JM, Gao Y‑D, Tudor M, Haidle AM, Klein DJ et al (2020)
Augmenting hit identification by virtual screening techniques in small
molecule drug discovery. J Chem Inf Model. https ://
acs.jcim.0c001 13
2. Walters WP, Patrick WW (2019) Virtual chemical libraries. J Med Chem.
https :// hem.8b010 48
3. Ruddigkeit L, van Deursen R, Blum LC, Reymond J‑L (2012) Enumera‑
tion of 166 billion organic small molecules in the chemical universe
database GDB‑17. J Chem Inf Model 52:2864–2875
4. Humbeck L, Weigang S, Schäfer T, Mutzel P, Koch O (2018) CHIPMUNK:
A virtual synthesizable small‑molecule library for medicinal chemistry,
exploitable for protein‑protein interaction modulators. ChemMedChem
5. Lessel U, Wellenzohn B, Lilienthal M, Claussen H (2009) Searching frag‑
ment spaces with feature trees. J Chem Inf Model 49:270–279
6. Nicolaou CA, Watson IA, Hu H, Wang J (2016) The Proximal Lilly Col‑
lection: mapping, exploring and exploiting feasible chemical space. J
Chem Inf Model 56:1253–1266
7. Hu Q, Peng Z, Sutton SC, Na J, Kostrowicki J, Yang B et al (2012) Pfizer
Global Virtual Library (PGVL): a chemistry design tool powered by
experimentally validated parallel synthesis information. ACS Comb Sci
8. Lyu J, Wang S, Balius TE, Singh I, Levit A, Moroz YS et al (2019) Ultra‑large
library docking for discovering new chemotypes. Nature 566:224–229
9. REAL Database ‑ Enamine. https ://enami ry‑synth esis/real‑
compo unds/real‑datab ase. Accessed 4 Sept 2020.
10. Karthikeyan M, Vyas R (2014) Chemoinformatics approach for the
design and screening of focused virtual libraries. In: Karthikeyan M,
Vyas R (eds) Practical Chemoinformatics. Springer India, New Delhi, pp
11. Saldívar‑González FI, Medina‑Franco JL (2020) Chemoinformatics
approaches to assess chemical diversity and complexity of small
molecules. In: Trabocchi A, Lenci E (eds) Small Molecule Drug Discovery.
Elsevier, Florence, pp 83–102
12. Medina‑Franco JL, Martinez‑Mayorga K, Meurice N (2014) Balancing
novelty with confined chemical space in modern drug discovery.
Expert Opin Drug Discov 9:151–165
13. Pitt WR, Kroeplien B (2013) Exploring virtual scaffold spaces. In: Brown N
(ed) Methods and Principles in Medicinal Chemistry. Wiley, London, pp
14. Chemical Computing Group (CCG) | Computer‑Aided Molecular
Design. https ://www.chemc Accessed 4 Sept 2020.
15. Schrödinger. https ://www.schro dinge Accessed 4 Sept 2020.
16. Library synthesizer – Tripod Development. https ://tripo d.nih.
gov/?p=370. Accessed 4 Sept 2020.
17. Optibrium. https ://www.optib rop/stard rop‑nova.php.
Accessed 4 Sept 2020.
18. Reactor | ChemAxon. https ://chema cts/react or.
Accessed 4 Sept 2020.
19. Sander T, Freyss J, von Korff M, Rufener C (2015) DataWarrior: an open‑
source program for chemistry aware data visualization and analysis. J
Chem Inf Model 55:460–473
20. KNIME. https ://www.knime .com/. Accessed 4 Sept 2020.
21. D‑Peptide Builder. https :// Accessed 4 Sept 2020.
22. Díaz‑Eufracio BI, Palomino‑Hernández O, Arredondo‑Sánchez A,
Medina‑Franco JL (2020) D‑Peptide Builder: a web service to enumer‑
ate, analyze, and visualize the chemical space of combinatorial peptide
libraries. Mol Inform. https :// 0035
23. Landrum G. RDKit. 2020. https ://www.rdkit .org/. Accessed 4 Sept 2020.
24. Chemical Library Enumeration | KNIME. https ://www.knime .com/knime
‑appli catio ns/chemi cal‑libra ry‑enume ratio n. Accessed 4 Sept 2020.
25. Schüller A, Hähnke V, Schneider G. SmiLib v2.0: A Java‑Based tool
for rapid combinatorial library enumeration. QSAR Comb Sci. 2007;
doi:https :// 0101.
26. GLARE. https ://glare .sourc eforg Accessed 4 Sept 2020.
27. Guha R, Willighagen E (2020) Learning cheminformatics. J Cheminfor‑
matics. https :// 1‑019‑0406‑z
28. Engel T (2003) Representation of chemical compounds. In: Gasteiger J,
Engel T (eds) Chemoinformatics. Wiley‑VCH, Weinheim, pp 15–168
29. Marvin | ChemAxon. https ://chema cts/marvi n.
Accessed 4 Sept 2020.
30. Structure drawing software for academic and personal use. https ://
www.acdla rces/freew are/chems ketch /. Accessed 4 Sept
31. ChemDraw. https ://www.perki nelme ory/chemd raw.
Accessed 4 Sept 2020.
32. Karthikeyan M, Vyas R (2014) Open‑source tools, techniques, and data
in chemoinformatics. In: Karthikeyan M, Vyas R (eds) Practical Chemoin‑
formatics. Springer India, New Delhi, pp 1–92
33. Engel T (2018) Principles of molecular representations. Chemoinformat‑
ics. https :// 27816 880.ch2
34. Misra M, Faulon J‑L (2010) Algorithms to store and retrieve two‑dimen‑
sional (2D) chemical structures. In: Faulon J‑L, Bender A (eds) Handbook
of Chemoinformatics Algorithms. Chapman and Hall/CRC, London, pp
35. Schomburg K, Ehrlich H‑C, Stierand K, Rarey M (2011) Chemical pattern
visualization in 2D – the SMARTSviewer. J Cheminformatics. https ://doi.
Page 24 of 25
Saldívar‑Gonzálezetal. J Cheminform (2020) 12:64
36. Weininger D (1988) SMILES, a chemical language and information
system. 1. Introduction to methodology and encoding rules. J Chem Inf
Comput Sci. 28:31–36
37. Weininger D, Weininger A, Weininger JL (1989) SMILES 2 Algorithm
for generation of unique SMILES notation. J Chem Inf Comput Sci.
38. Heller SR, McNaught A, Pletnev I, Stein S, Tchekhovskoi D (2015) InChI,
the IUPAC International Chemical Identifier. J Cheminformatics 30(7):23
39. Inc D. Daylight Theory: SMARTS‑A Language for describing molecular
patterns. 2018. https ://www.dayli ml/doc/theor y/theor s.html. Accessed 4 Sept 2020.
40. Sushko I, Salmina E, Potemkin VA, Poda G, Tetko IV (2012) ToxAlerts: a
Web server of structural alerts for toxic chemicals and compounds with
potential adverse reactions. J Chem Inf Model 52(8):2310–2316
41. Baell JB, Holloway GA (2010) New substructure filters for removal of pan
assay interference compounds (PAINS) from screening libraries and for
their exclusion in bioassays. J Med Chem 53:2719–2740
42. Bietz S, Schomburg KT, Hilbig M, Rarey M (2015) Discriminative chemi‑
cal patterns: automatic and interactive design. J Chem Inf Model
43. Daylight>SMARTS Examples. https ://www.dayli ml_tutor
ials/langu ages/smart s/smart s_examp les.html. Accessed 4 Sept 2020.
44. Bienfait B, Ertl P (2013) JSME: a free molecule editor in JavaScript. J
Cheminformatics 5:24
45. Ihlenfeldt WD, Bolton EE, Bryant SH (2009) The PubChem chemical
structure sketcher. J Cheminformatics 1:20
46. PubChem Sketcher. https ://pubch /index
.html. Accessed 4 Sept 2020.
47. de Sousa JMA (2017) Processing of SMILES, InChI, and Hashed Finger‑
prints. In: Varnek A (ed) Tutorials in chemoinformatics. Wiley, Chichester,
pp 75–81
48. Chen L, Nourse JG, Christie BD, Leland BA, Grier DL (2002) Over 20 years
of reaction access systems from MDL: a novel reaction substructure
search algorithm. J Chem Inf Comp Sci. https ://
49. Warr WA (2014) A short review of chemical reaction database systems,
computer‑aided synthesis design, reaction prediction and synthetic
feasibility. Mol Inform. https :// 0052
50. Daylight. https ://www.dayli Accessed 4 Sept 2020.
51. O’Donnell T. Reactions and transformations. In: Design and use of rela‑
tional databases in chemistry. Boca Raton: CRC Press; 2008. p. 99–107.
52. Grethe G, Blanke G, Kraut H, Goodman JM (2018) International Chemi‑
cal Identifier for Reactions (RInChI). J Cheminformatics 10:22
53. Inc D. Daylight Theory: SMIRKS‑A reaction transform language. 2018.
https :// ls/Dayli ghtTh eoryM anual /theor
y.smirk s.html. Accessed 4 Sept 2020.
54. Daylight>SMIRKS tutorial. https ://www.dayli ml_tutor
ials/langu ages/smirk s/index .html. Accessed 8 May 2020.
55. Papadakis E, Anantpinijwatna A, Woodley J, Gani R (2017) A reaction
database for small molecule pharmaceutical processes integrated with
process information. Processes. https :// 0058
56. Zass E (2008) Databases of chemical reactions. In: Gasteiger J (ed)
Handbook of Chemoinformatics. Wiley‑VCH, Weinheim, pp 667–699
57. Blake JE, Dana RC (1990) CASREACT: more than a million reactions. J
Chem Inf Comp Sci 30:394–399
58. Reactions ‑ CASREACT ‑ Answers to your chemical reaction questions.
https :// nt/react ions. Accessed 4 Sept 2020.
59. Blower PE, Myatt GJ, Petras MW (1997) Exploring functional group
transformations on CASREACT. J Chem Inf Comp Sci 37:54–58
60. Reaxys. https ://www.reaxy Accessed 4 Sept 2020.
61. Computer GJ, Review S (2009) Reaxys. J Chem Inf Model 49:2897–2898
62. Open Molecules. https ://www.openm olecu actio ns/intro
.html. Accessed 4 Sept 2020.
63. Stanley TH (2005) Fentanyl. J Pain Symptom Manage 29(Suppl):S67–S71
64. Suh YG, Cho KH, Shin DY (1998) Total synthesis of fentanyl. Arch Pharm
Res 21:70–72
65. Huc I, Lehn J‑M (1997) Virtual combinatorial libraries: Dynamic genera‑
tion of molecular and supramolecular diversity by self‑assembly. P Natl
Acad Sci. https ://
66. Schneider G, Fechner U (2005) Computer‑based de novo design of
drug‑like molecules. Nat Rev Drug Discov 4(8):649–663
67. Green DVS. Virtual screening of virtual libraries. In: King FD, Oxford AW,
editors. Progress in Medicinal Chemistry. Elsevier. 2003. p. 61–97.
68. Weber L (2005) Current status of virtual combinatorial library design.
QSAR Comb Sci 24:809–823
69. Aronov AM (2002) Design of virtual combinatorial libraries. In: English
LB (ed) Combinatorial Library. Humana Press, Totowa, pp 267–276
70. Goldberg FW, Kettle JG, Kogej T, Perry MWD, Tomkinson NP (2015)
Designing novel building blocks is an overlooked strategy to improve
compound quality. Drug Discov Today 20:11–17
71. Congreve M, Carr R, Murray C, Jhoti H (2003) A “rule of three” for
fragment‑based lead discovery? Drug Discov Today. https ://doi.
org/10.1016/s1359 ‑6446(03)02831 ‑9
72. Sterling T, Irwin JJ (2015) ZINC 15–Ligand Discovery for Everyone. J
Chem Inf Model 55:2324–2337
73. – Asinex Focused Libraries, Screening compounds, Pre‑
plated Sets. https ://www.asine Accessed 4 Sept 2020.
74. Advanced Chemical Building Blocks | Novel scaffolds | Life Chemicals.
https ://lifec hemic ing‑block s. Accessed 4 Sept 2020.
75. Maybridge. https ://www.maybr Accessed 4 Sept 2020.
76. Gomtsyan A (2012) Heterocycles in drugs and drug discovery. Chem
Heterocycl Compd. https :// 3‑012‑0960‑z
77. Kolb HC, Sharpless KB (2003) The growing impact of click chemistry on
drug discovery. Drug Discov Today 8:1128–1137
78. Rostovtsev VV, Green LG, Fokin VV (2002) A stepwise Huisgen cycloaddi‑
tion process: copper(I)‑catalyzed regioselective “ligation” of azides and
terminal alkynes. Angew Chem Int Ed 41:2596–2599
79. Shafi S, Alam MM, Mulakayala N, Mulakayala C, Vanaja G, Kalle AM et al
(2012) Synthesis of novel 2‑mercapto benzothiazole and 1,2,3‑triazole
based bis‑heterocycles: their anti‑inflammatory and anti‑nociceptive
activities. Eur J Med Chem 49:324–333
80. ZINC Sigma Aldrich (Building Blocks). https ://zinc.docki ogs/
sialb b/. Accessed: 9 Jun 2020.
81. Kuhn D, Coates C, Daniel K, Chen D, Bhuiyan M, Kazi A et al (2004) Beta‑
lactams and their potential use as novel anticancer chemotherapeutics
drugs. Front Biosci 9:2605–2617
82. Malebari AM, Fayne D, Nathwani SM, O’Connell F, Noorani S, Twamley
B et al (2020) β‑Lactams with antiproliferative and antiapoptotic activ‑
ity in breast and chemoresistant colon cancer cells. Eur J Med Chem
83. Goel RK, Mahajan MP, Kulkarni SK (2004) Evaluation of anti‑hyperglyce‑
mic activity of some novel monocyclic beta lactams. J Pharm Pharm Sci
84. Shahid M, Sobia F, Singh A, Malik A, Khan HM, Jonas D et al (2009) Beta‑
lactams and beta‑lactamase‑inhibitors in current‑ or potential‑clinical
practice: a comprehensive update. Crit Rev Microbiol 35:81–108
85. Velthuisen EJ, Johns BA, Temelkoff DP, Brown KW, Danehower SC (2016)
The design of 8‑hydroxyquinoline tetracyclic lactams as HIV‑1 integrase
strand transfer inhibitors. Eur J Med Chem 117:99–112
86. De Marco R, Bedini A, Spampinato S, Comellini L, Zhao J, Artali R et al
(2018) Constraining endomorphin‑1 by β, α‑hybrid dipeptide/heterocy‑
cle scaffolds: identification of a novel κ‑opioid receptor selective partial
agonist. J Med Chem 61:5751–5757
87. Rawls SM, Robinson W, Patel S, Baron A (2008) Beta‑lactam antibiotic
prevents tolerance to the hypothermic effect of a kappa opioid recep‑
tor agonist. Neuropharmacology 55:865–870
88. Baiula M, Galletti P, Martelli G, Soldati R, Belvisi L, Civera M et al (2016)
New β‑lactam derivatives modulate cell adhesion and signaling
mediated by RGD‑binding and leukocyte integrins. J Med Chem
89. Xing B, Rao J, Liu R (2008) Novel beta‑lactam antibiotics derivatives:
their new applications as gene reporters, antitumor prodrugs and
enzyme inhibitors. Mini Rev Med Chem 8:455–471
90. Saturnino C, Fusco B, Saturnino P, De Martino G, Rocco F, Lancelot JC
(2000) Evaluation of analgesic and anti‑inflammatory activity of novel
beta‑lactam monocyclic compounds. Biol Pharm Bull 23:654–656
91. Wei J, Pan X, Pei Z, Wang W, Qiu W, Shi Z et al (2012) The beta‑lactam
antibiotic, ceftriaxone, provides neuroprotective potential via anti‑exci‑
totoxicity and anti‑inflammation response in a rat model of traumatic
brain injury. J Trauma Acute Care Surg 73:654–660
Page 25 of 25
Saldívar‑Gonzálezetal. J Cheminform (2020) 12:64
92. Volchegorskii IA, Trenina EA (2006) Antidepressant activity of beta‑
lactam antibiotics and their effects on the severity of serotonin edema.
Bull Exp Biol Med 142:73–75
93. Uchida T, Rodriquez M, Schreiber SL (2009) Skeletally Diverse Small
Molecules Using a Build/Couple/Pair Strategy. Org Lett. https ://doi.
org/10.1021/ol900 173t
94. Saldívar‑González FI, Lenci E, Calugi L, Medina‑Franco JL, Trabocchi A
(2020) Computational‑aided design of a library of lactams through a
Diversity‑Oriented Synthesis strategy. Bioorg Med Chem. https ://doi.
org/10.1016/j.bmc.2020.11553 9
95. Denis. Building Blocks ‑ Enamine n.d. https ://enami ing‑
block s. Accessed 20 April 2019.
96. Panza F, Lozupone M, Logroscino G, Imbimbo BP (2019) A critical
appraisal of amyloid‑β‑targeting therapies for Alzheimer disease. Nat
Rev Neurol 15:73–88
97. Lane RM, Potkin SG, Enz A (2006) Targeting acetylcholinesterase and
butyrylcholinesterase in dementia. Int J Neuropsychopharmacol
98. Rayatzadeh A, Saeedi M, Mahdavi M, Rezaei Z, Sabourian R, Mosslemin
MH et al (2015) Synthesis and evaluation of novel oxoisoindoline
derivatives as acetylcholinesterase inhibitors. Monatshefte für Chemie ‑
Chemical Monthly 146:637–643
99. Bentley KW (2006) beta‑Phenylethylamines and the isoquinoline alka‑
loids. Nat Prod Rep 23(3):444–463
100. ZINC Synquest Building Blocks Economical. https ://zinc.docki
catal ogs/synqu estbb e/. Accessed 4 Sept 2020.
101. ZINC. https ://zinc.docki Accessed 4 Sept 2020.
102. Lipinski C, Hopkins A (2004) Navigating chemical space for biology and
medicine. Nature 432:855–861
103. Lipinski CA (2004) Lead‑ and drug‑like compounds: the rule‑of‑five
revolution. Drug Discov Today Technol 1:337–341
104. Veber DF, Johnson SR, Cheng H‑Y, Smith BR, Ward KW, Kopple KD (2002)
Molecular properties that influence the oral bioavailability of drug
candidates. J Med Chem 45:2615–2623
105. Schuffenhauer A, Varin T (2011) Rule‑based classification of chemical
structures by scaffold. Mol Inform 30:646–664
106. Medina‑Franco J, Martínez‑Mayorga K, Bender A, Scior T (2009) Scaffold
diversity analysis of compound data sets using an entropy‑based meas‑
ure. QSAR Comb Sci. 28:1551–1560
107. Langdon SR, Westwood IM, van Montfort RLM, Brown N, Blagg J (2013)
Scaffold‑focused virtual screening: prospective application to the
discovery of TTK inhibitors. J Chem Inf Model 53:110012
108. Wetzel S, Klein K, Renner S, Rauh D, Oprea TI, Mutzel P et al (2009) Inter‑
active exploration of chemical space with Scaffold Hunter. Nat Chem
Biol 5:581–583
109. Agrafiotis DK, Wiener JJM (2010) Scaffold explorer: an interactive tool
for organizing and mining structureactivity data spanning multiple
chemotypes. J Med Chem. https :// 4495
110. Mok NY, Brown N (2017) Applications of systematic molecular scaffold
enumeration to enrich structure–activity relationship information. J
Chem Inf Model 57:27–35
111. Medina‑Franco JL, Maggiora GM (2013) Molecular similarity analysis. In:
Bajorath J (ed) Chemoinformatics for drug discovery. Wiley, Hoboken,
pp 343–399
112. Nikolova N, Jaworska J (2003) Approaches to measure chemical similar‑
ity– a Review. QSAR Comb Sci 22:1006–1026
113. Medina‑Franco JL (2013) Chemoinformatic characterization of the
chemical space and molecular diversity of compound libraries. In: Trab‑
occhi A (ed) Diversity‑Oriented Synthesis. Wiley, Hoboken, pp 325–352
114. González‑Medina M, Medina‑Franco JL (2017) Platform for unified
molecular analysis: PUMA. J Chem Inf Model 57:1735–1740
115. Naveja JJ, Saldívar‑González FI, Sánchez‑Cruz N, Medina‑Franco JL
(2019) Cheminformatics approaches to study drug polypharmacol‑
ogy. In: Roy K (ed) Multi‑target drug design using chem‑bioinformatic
approaches. Springer, New York, pp 3–25
116. González‑Medina M, Prieto‑Martínez FD, Owen JR, Medina‑Franco JL
(2016) Consensus diversity plots: a global diversity analysis of chemical
libraries. J Cheminformatics 8:63
117. Wishart DS, Feunang YD, Guo AC, Lo EJ, Marcu A, Grant JR et al (2018)
DrugBank 5.0: a major update to the DrugBank database for 2018.
Nucleic Acids Res. 46:D1074–D1082
118. Akella LB, DeCaprio D (2010) Cheminformatics approaches to analyze
diversity in compound screening libraries. Curr Opin Chem Biol
119. Meyers J, Carter M, Mok NY, Brown N (2016) On the origins of three‑
dimensionality in drug‑like molecules. Future Med Chem 8:1753–1767
120. Willett P, Barnard JM, Downs GM (1998) Chemical similarity searching. J
Chem Inf Comput Sci 38:983–996
121. Lipkus AH, Yuan Q, Lucas KA, Funk SA, Bartelt WF III, Schenck RJ et al
(2008) Structural diversity of organic chemistry. A scaffold analysis of
the CAS Registry. J Org Chem. 73:4443–4451
122. Bemis GW, Murcko MA (1996) The properties of known drugs. 1.
Molecular frameworks. J Med Chem 39:2887–2893
123. Leach AR, Gillet VJ, editors. Selecting diverse sets of compounds. An
introduction to chemoinformatics, Dordrecht: Springer Netherlands;
2007, p. 119–39.
124. Tutorials for Computer Aided Drug Design using KNIME workflows |
KNIME. https ://www.knime .com/blog/tutor ials‑for‑compu ter‑aided
‑drug‑desig n‑using ‑knime ‑workfl ows. Accessed 4 Sept 2020.
125. Gally J‑M, Bourg S, Do Q‑T, Aci‑Sèche S, Bonnet P (2017) VSPrep: a
general KNIME workflow for the preparation of molecules for virtual
screening. Mol Inform 36:1700023
126. Sala Benito JV, Paini A, Richarz A‑N, Meinl T, Berthold MR, Cronin MTD
et al (2017) Automated workflows for modelling chemical fate, kinetics
and toxicity. Toxicol In Vitro 45(Pt 2):249–257
127. Lagorce D, Bouslama L, Becot J, Miteva MA, Villoutreix BO (2017) FAF‑
Drugs4: free ADME‑tox filtering computations for chemical biology and
early stages drug discovery. Bioinformatics 33:3658–3660
128. Bruns RF, Watson IA (2012) Rules for identifying potentially reactive or
promiscuous compounds. J Med Chem 55:9763–9772
129. Retrosynthetic analysis and synthesis planning in SciFinder. https :// cts/scifi nder/retro synth esis‑plann ing. Accessed 4
Sept 2020.
130. SynthiaTM organic retrosynthesis software. Sigma‑Aldrich. https ://www.
sigma aldri stry/chemi cal‑synth esis/synth esis‑softw are.
html. Accessed 4 Sept 2020.
131. Spaya. https ://beta.spaya .ai/app. Accessed 4 Sept 2020.
132. IBM RXN for Chemistry. https :// Accessed 4 Sept 2020.
133. Lin K, Xu Y, Pei J, Lai L (2020) Automatic retrosynthetic route planning
using template‑free models. Chem Sci 11:3355–3364
134. Schwaller P, Petraglia R, Zullo V, Nair VH, Haeuselmann RA, Pisoni R
et al (2020) Predicting retrosynthetic pathways using transformer‑
based models and a hyper‑graph exploration strategy. Chem Sci
135. Bonnet P (2012) Is chemical synthetic accessibility computationally pre‑
dictable for drug and lead‑like molecules? A comparative assessment
between medicinal and computational chemists. Eur J Med Chem
136. SYLVIA ‑ Estimation of the synthetic accessibility of organic compounds.
https ://‑ cts/sylvi a. Accessed 4 Sept 2020.
137. CAESA | Keymodule. https ://www.keymo cts/caesa /
index .html. Accessed: 13 Jun 2020.
138. Sitzmann M. WODCA synthesis design. https ://www2.chemi e.uni‑erlan are/wodca /index .html. Accessed: 13 Jun 2020.
139. Ertl P, Schuffenhauer A (2009) Estimation of synthetic accessibility score
of drug‑like molecules based on molecular complexity and fragment
contributions. J Cheminformatics 1:8
140. Boda K, Seidel T, Gasteiger J (2007) Structure and reaction based evalu‑
ation of synthetic accessibility. J Comput Aided Mol Des 21:311–325
141. Fukunishi Y, Kurosawa T, Mikami Y, Nakamura H (2014) Prediction of
synthetic accessibility based on commercially available compound
databases. J Chem Inf Model 54:3259–3267
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in pub‑
lished maps and institutional affiliations.
... Different parameters have been used in the development of chemical libraries based on diversity [7][8][9][10]. Some of the concepts and representations used to generate these libraries are based on physicochemical properties (drug-like or leadlike) [11][12][13], 2D descriptors and molecular fingerprints (14), chemical space [15][16][17], molecular shape [18,19], and pharmacophore models [20]. ...
... SMILES is not only used for the representation of molecular models but also for performing similarity searches, in which a comparison of the physicochemical properties of molecules is carried out. SMILES is, therefore, useful in drug design studies based on physicochemical properties [11][12][13], molecular fingerprint [14], chemical space [15][16][17], and molecular scaffold [48]. Weininger, A., and Weininger, D. developed the original SMILES specification in the late 1980s and early 1990s [43,44,51] [46,52]. ...
... Some platforms used are: AlphaScreen technology [139,140], and PrePeP [141]. In addition, in-house methodologies have been developed for certain PAINS with the KNIME [142] software, the OpenEye [143] chemoinformatics tools, and the R and RStudio applications, which use the Java and R programming languages [16,144]. C omputational methods, including AI, are powerful tools for the detection of interfering compounds in vitro biological assays, as well as for virtual molecular modeling assays [145][146][147]. ...
Full-text available
Chemical libraries and compound data sets are among the main inputs to start the drug discovery process at universities, research institutes, and the pharmaceutical industry. The approach used in the design of compound libraries, the chemical information they possess, and the representation of structures, play a fundamental role in the development of studies: chemoinformatics, food informatics, in silico pharmacokinetics, computational toxicology, bioinformatics, and molecular modeling to generate computational hits that will continue the optimization process of drug candidates. The prospects for growth in drug discovery and development processes in chemical, biotechnological, and pharmaceutical companies began a few years ago by integrating computational tools with artificial intelligence methodologies. It is anticipated that it will increase the number of drugs approved by regulatory agencies shortly.
... The descriptor plays a key role for data mining, analysis of chemical distribution, and active site prediction. [9][10][11] Moreover, massive storage and data handling is a drawback of tools and thus counter to this situation, some algorithm-based tools are under pipeline. Some chemical data formats have been described in details in Table 8.2. ...
... 11 Effectiveness assessment of Aldesleukin using in silico design.GRAVY: Grand average of hydropathicity.Calculated using ProtParam web server. b Estimated through ANTIGEN pro online tool.Predicted using AlgPred web server. ...
Modern drug discovery program employs CADD for managing or creating theoretical models, which would be utilized by large databases for discovery and virtual screening of newer therapeutic agents. It is involved in the development and management of several algorithms to explore different prospects in the discovery of novel drug candidates like selection of ligands, prediction of protein structure and function, residues of the active site, and study of protein–ligand interactions. The concept of fragment-based, receptor-based, and nucleic acid-based design of biologics and protein drug design has been explained. Besides, this chapter also presents the role of chemoinformatics and bioinformatics, omics in drug discovery process, various databases used, and advances in drug designing approaches.
... However, these databases have some common set of molecules and hence overlapping chemical space [17]. With the availability of various open-source computational tools, enumerating ultra-large virtual libraries has increasingly become effortless, and so is their application in drug design [7,9,18]. However, one of the biggest challenges after a hit is obtained through VS is the molecules' availability and synthetic tractability for experimental validation. ...
Full-text available
Virtual screening (VS) is an important approach in drug discovery and relies on the availability of a virtual library of synthetically tractable molecules. Ugi reaction (UR) represents an important multi-component reaction (MCR) that reliably produces a peptidomimetic scaffold. Recent literature shows that a tactically assembled Ugi adduct can be subjected to further chemical modifications to yield a variety of rings and scaffolds, thus, renewing the interest in this old reaction. Given the reliability and efficiency of UR, we collated an UR derived library (URDL) of small molecules (total = 5773) for VS. The synthesis of the majority of URDL molecules may be carried out in 1–2 pots in a time and cost-effective manner. The detailed analysis of the average property and chemical space of URDL was also carried out using the open-source Datawarrior program. The comparison with FDA-approved oral drugs and inhibitors of protein–protein interactions (iPPIs) suggests URDL molecules are ‘clean’, drug-like, and conform to a structurally distinct space from the other two categories. The average physicochemical properties of compounds in the URDL library lie closer to iPPI molecules than oral drugs thus suggesting that the URDL resource can be applied to discover novel iPPI molecules. The URDL molecules consist of diverse ring systems, many of which have not been exploited yet for drug design. Thus, URDL represents a small virtual library of drug-like molecules with unexplored chemical space designed for VS. The structures of all molecules of URDL, oral drugs, and iPPI compounds are being made freely accessible as supplementary information for broader application.
... Later, SMILES arbitrary target specification (SMARTS) notation was developed to specify substructural patterns which allow the matching of molecules that contain the specified substructural pattern [77]. For 2D graphical representation, there are programs that allow drawing of the chemical structures and facilitate the storage and interconversion between standard 1D and 3D file formats [78]. 3D databases are very useful for structurebased screening. ...
Full-text available
Natural products (NPs) are a rich source of structurally novel molecules, and the chemical space they encompass is far from being fully explored. Over history, NPs have represented a significant source of bioactive molecules and have served as a source of inspiration for developing many drugs on the market. On the other hand, computer-aided drug design (CADD) has contributed to drug discovery research, mitigating costs and time. In this sense, compound databases represent a fundamental element of CADD. This work reviews the progress toward developing compound databases of natural origin, and it surveys computational methods, emphasizing chemoinformatic approaches to profile natural product databases. Furthermore, it reviews the present state of the art in developing Latin American NP databases and their practical applications to the drug discovery area.
... Later, SMILES Arbitrary Target Specification (SMARTS) notation was developed to specify substructural patterns which allow to match molecules that contain the specified substructural pattern [76]. For the 2D graphical representation, there are programs that allow to draw the chemical structures and facilitate the storage and interconversion between the standard 1D and 3D file formats [77]. 3D databases are very useful for structure-based screening. ...
Full-text available
Natural products (NPs) are a rich source of structurally novel molecules, and the chemical space they encompass is far from being fully explored. Over history, NPs have represented a significant source of bioactive molecules and have served as a source of inspiration for developing many drugs on the market. On the other hand, computer-aided drug design (CADD) has contributed to drug discovery research, mitigating costs and time. In this sense, compound databases represent a fundamental element for the CADD. This work reviews the progress toward developing compound databases of natural origin, particularly databases developed in Latin America, and their practical applications in the drug discovery area. We also survey the computational methods, emphasizing chemoinformatic approaches to profile natural product databases.
... Some metrics have been proposed to quantify the complexity of a molecule structure [2]. Similarly, there are different approaches to measure synthetic accessibility [40]. Saldívar-González Q8 et al. recently reviewed the three main methods to evaluate chemical complexity and synthetic accessibility, namely graph-theoretical methods, (sub)structure-based approaches, and physicochemical and topological descriptors [27]. ...
Natural products have a significant role in drug discovery. Their unique chemical structures have led to compounds in clinical use to treat different diseases. Also, natural products are significant sources of inspiration or starting points to develop new therapeutic agents. There are also unique natural products such as peptides and macrocycles that offer sources or starting points to address complex diseases. Computational approaches that used chemoinformatics and molecular modeling methods contribute to assisting and accelerating natural product-based drug discovery. Several research groups have recently used computational methodologies to organize data, interpret results, generate and test hypotheses, filter large chemical databases before the experimental screening, and design experiments. Herein, we discuss chemoinformatics and molecular modeling applications to uncover bioactive natural products. We also discuss in silico methods to optimize the biological activity and anticipate potential toxicity issues of natural products. As case studies, we discuss the role of natural products for COVID-19 drug discovery and their impact on the identification of compounds with activity against DNA methyltransferase, an epigenetic target with relevance in cancer and other diseases.
... The chemical space is huge, and novel chemical databases are assembled and under continued development. This has been exemplified by the large number of on-demand and virtual libraries commented recently (Walters 2019;Saldívar-González et al. 2020). Reviews of recent chemical libraries for drug discovery had been published elsewhere (Gong et al. 2017;Wang et al. 2019). ...
Informatics plays a fundamental role in many chemistry applications giving rise to the consolidation of well-established disciplines such as bioinformatics and chemoinformatics. It has also led to the maturation of subdisciplines such as food informatics, epi-informatics, and more recently, to the so-called natural products informatics. The extensive practice of informatics across different disciplines and subdisciplines has been boosted by the large and increasing availability of open and well-documented resources. A number of them have been implemented as web-applications that further encourage the use by the scientific community. In this chapter, we review the recent progress on the development of public chemoinformatic resources for different tasks, with special focus/emphasis on drug discovery applications. Due to the current COVID-19 pandemic, we emphasize resources that have been developed and released over the past few months to support drug discovery efforts worldwide.
Data is a critical element in any discovery process. In the last decades, we observed exponential growth in the volume of available data and the technology to manipulate it. However, data is only practical when one can structure it for a well-defined task. For instance, we need a corpus of text broken into sentences to train a natural language machine-learning model. In this work, we will use the token \textit{dataset} to designate a structured set of data built to perform a well-defined task. Moreover, the dataset will be used in most cases as a blueprint of an entity that at any moment can be stored as a table. Specifically, in science, each area has unique forms to organize, gather and handle its datasets. We believe that datasets must be a first-class entity in any knowledge-intensive process, and all workflows should have exceptional attention to datasets' lifecycle, from their gathering to uses and evolution. We advocate that science and engineering discovery processes are extreme instances of the need for such organization on datasets, claiming for new approaches and tooling. Furthermore, these requirements are more evident when the discovery workflow uses artificial intelligence methods to empower the subject-matter expert. In this work, we discuss an approach to bringing datasets as a critical entity in the discovery process in science. We illustrate some concepts using material discovery as a use case. We chose this domain because it leverages many significant problems that can be generalized to other science fields.
The landscape paradigm is revisited in the light of evolution in simple systems. A brief overview of different classes of fitness landscapes is followed by a more detailed discussion of the RNA model, which is currently the only evolutionary model that allows for a comprehensive molecular analysis of a fitness landscape. Neutral networks of genotypes are indispensable for the success of evolution. Important insights into the evolutionary mechanism are gained by considering the topology of sequence and shape spaces. The dynamic concept of molecular quasispecies is viewed in the light of the landscape paradigm. The distribution of fitness values in state space is mirrored by the population structures of mutant distributions. Two classes of thresholds for replication error or mutations are important: (i) the—conventional—genotypic error threshold, which separates ordered replication from random drift on neutral networks, and (ii) a phenotypic error threshold above which the molecular phenotype is lost. Empirical landscapes are reviewed and finally, the implications of the landscape concept for virus evolution are discussed.KeywordsAccessibilityError thresholdFitness landscapesGenotype-phenotype mapsMolecular evolutionNeutral networksPhenomenological approachQuasispeciesRNA modelSelective neutralitySequence spaceShape spaceShape space topologyVirus evolution
In the emerging field of drug discovery, rapid virtual screening methods become extremely valuable, especially when dealing with ultra-large databases of organic small bioactive molecules. In this work, we present a fast, computationally resource-efficient, and simple workflow for screening targeted compound libraries generated from ultra-large virtual chemical space. This workflow aims to find compounds with similar molecular 3D shapes with reference ones, and at the same time to expand chemical diversity and to identify new and potentially active scaffolds. This pipeline ensures the enrichment of the generated libraries with novel chemotypes. Also, it was shown that delicate tailoring of the physicochemical parameters of the search set ensures that all library compounds will possess desired property distributions. A visual inspection has shown that found structures bind to the receptor in the same way as the reference ones. Using our screening workflow, we have created a number of conventional protein-targeted libraries: the GPCRs Targeted Library (531 K compounds) and the Protein Kinases Targeted Library (113 K compounds). The described pipeline and scripts are freely accessible at:
Full-text available
We present an extension of our Molecular Transformer model combined with a hyper-graph exploration strategy for automatic retrosynthesis route planning without human intervention. The single-step retrosynthetic model sets a new state of the art for predicting reactants as well as reagents, solvents and catalysts for each retrosynthetic step. We introduce four metrics (coverage, class diversity, round-trip accuracy and Jensen-Shannon divergence) to evaluate the single-step retrosynthetic models, using the forward prediction and a reaction classification model always based on the transformer architecture. The hypergraph is constructed on the fly, and the nodes are filtered and further expanded based on a Bayesian-like probability. We critically assessed the end-to-end framework with several retrosynthesis examples from literature and academic exams. Overall, the frameworks have an excellent performance with few weaknesses related to the training data. The use of the introduced metrics opens up the possibility to optimize entire retrosynthetic frameworks by focusing on the performance of the single-step model only.
Full-text available
Retrosynthetic route planning can be considered as a rule-based reasoning procedure. The possibilities for each transformation are generated based on collected reaction rules, and then potential reaction routes are recommended by various optimization algorithms. Although there has been much progress in computer-assisted retrosynthetic route planning and reaction prediction, fully data-driven automatic retrosynthetic route planning remains challenging. Here we present a template-free approach that is independent of reaction templates, rules, or atom mapping, to implement automatic retrosynthetic route planning. We treated each reaction prediction task as a data-driven sequence-to-sequence problem using the multi-head attention-based Transformer architecture, which has demonstrated power in machine translation tasks. Using reactions from the United States patent literature, our end-to-end models naturally incorporate the global chemical environments of molecules and achieve remarkable performance on top-1 predictive accuracy (63.0%, with reaction class provided) and top-1 molecular validity (99.6%) in one-step retrosynthetic tasks. Inspired by the success rate of the one-step reaction prediction, we further carried out iterative, multi-step retrosynthetic route planning for four case products, which was successful. We then constructed an automatic data-driven end-to-end retrosynthetic route planning system (AutoSynRoute) using Monte Carlo Tree Search with a heuristic scoring function. AutoSynRoute successfully reproduced published synthesis routes for the four case products. The end-to-end model for reaction task prediction can be easily extended to larger or customer-requested reaction databases. Our study presents an important step in realizing automatic retrosynthetic route planning.
Full-text available
A series of novel 1,4-diaryl-2-azetidinone analogues of combretastatin A-4 (CA-4) have been designed, synthesised and evaluated in vitro for antiproliferative activity, antiapoptotic activity and inhibition of tubulin polymerisation. Glucuronidation of CA-4 by uridine 5-diphosphoglucuronosyl transferase enzymes (UGTs) has been identified as a mechanism of resistance in cancer cells. Potential sites of ring B glucuronate conjugation are removed by replacing the B ring meta-hydroxy substituent of selected series of β-lactams with alternative substituents e.g. F, Cl, Br, I, CH3. The 3-phenyl-β-lactam 11 and 3-hydroxy-β-lactam 46 demonstrate improved activity over CA-4 in CA-4 resistant HT-29 colon cancer cells (IC50 = 9 nM and 3 nM respectively compared with IC50 = 4.16 μM for CA-4), while retaining potency in MCF-7 breast cancer cells (IC50 = 17 nM and 22 nM respectively compared with IC50 = for 4 nM for CA-4). Compound 46 binds at the colchicine site of tubulin, and strongly inhibits tubulin assembly at micromolar concentrations comparable to CA-4. In addition, compound 46 induced mitotic arrest at low concentration in both cell lines MCF-7 and HT-29 together with downregulation of expression of antiapoptotic proteins Mcl-1, Bcl-2 and survivin in MCF-7 cells. These novel antiproliferative and antiapoptotic β-lactams are potentially useful scaffolds in the development of tubulin-targeting agents for the treatment of breast cancers and chemoresistant colon cancers.
Full-text available
Despite intense interest in expanding chemical space, libraries containing hundreds-of-millions to billions of diverse molecules have remained inaccessible. Here we investigate structure-based docking of 170 million make-on-demand compounds from 130 well-characterized reactions. The resulting library is diverse, representing over 10.7 million scaffolds that are otherwise unavailable. For each compound in the library, docking against AmpC β-lactamase (AmpC) and the D4 dopamine receptor were simulated. From the top-ranking molecules, 44 and 549 compounds were synthesized and tested for interactions with AmpC and the D4 dopamine receptor, respectively. We found a phenolate inhibitor of AmpC, which revealed a group of inhibitors without known precedent. This molecule was optimized to 77 nM, which places it among the most potent non-covalent AmpC inhibitors known. Crystal structures of this and other AmpC inhibitors confirmed the docking predictions. Against the D4 dopamine receptor, hit rates fell almost monotonically with docking score, and a hit-rate versus score curve predicted that the library contained 453,000 ligands for the D4 dopamine receptor. Of 81 new chemotypes discovered, 30 showed submicromolar activity, including a 180-pM subtype-selective agonist of the D4 dopamine receptor.
Peptide‐based drug discovery is re‐gaining attention in drug discovery. Similarly, combinatorial chemistry continues to be a useful technique for the rapid exploration of chemical space. A current challenge, however, is the enumeration of combinatorial peptide libraries using freely accessible tools. To facilitate the swift enumeration of combinatorial peptide libraries, we introduce herein D‐Peptide Builder. In the current version, the user can build up to pentapeptides, linear or cyclic, using the natural pool of 20 amino acids. The user can use non‐ and/or N‐methylated amino acids. The server also enables the rapid visualization of the chemical space of the newly enumerated peptides in comparison with other libraries relevant to drug discovery and preloaded in the server. D‐Peptide Builder is freely accessible at It is also accessible through the open D‐Tools platform (DIFACQUIM Tools for Chemoinformatics‐tools/).
Small molecule libraries for virtual screening are becoming a well-established tool for the identification of new hit compounds. As for experimental assays, the library quality, defined in terms of structural complexity and diversity, is crucial to increase the chance of a successful outcome in the screening campaign. In this context, Diversity-Oriented Synthesis has proven to be very effective, as the compounds generated are structurally complex and differ not only for the appendages, but also for the molecular scaffold. In this work, we automated the design of a library of lactams by applying a Diversity-Oriented Synthesis strategy called Build/Couple/Pair. We evaluated the novelty and diversity of these compounds by comparing them with lactam moieties contained in approved drugs, natural products, and bioactive compounds from ChEMBL. Finally, depending on their scaffold we classified them into β-, γ-, δ-, ε-, and isolated, fused and spiro- lactam groups and we assessed their drug-like and lead-like properties, thus providing the valence of this novel in silico designed library for medicinal chemistry applications.
Large scale in vitro and in silico screening are two orthogonal approaches for hit identification in drug discovery. In recent years, due to the emergence of new targets and a rapid increase in the size of the readily synthesizable chemical space, there is a growing emphasis on the integration of the two techniques to improve the hit finding efficiency. Here, we highlight three examples of drug discovery projects at Merck & Co., Inc., Kenilworth, NJ, USA in which different virtual screening (VS) techniques, each specifically tailored to leverage knowledge available for the target, were utilized to augment the selection of high-quality chemical matter for in vitro assays and to enhance the diversity and tractability of hits. Central to success is a fully integrated workflow combining in silico and experimental expertise at every stage of the hit identification process. We advocate that workflows encompassing VS as part of an integrated hit finding plan should be widely adopted to accelerate hit identification and foster cross-functional collaborations in modern drug discovery.
Small molecules have always been of interest in medicinal chemistry because of their ability to exert significant effects on the functions of macromolecules that comprise living systems. Chemical diversity and complexity are examples of properties to guide the design and selection of new small molecules. These properties may have significant implications in many areas of research including but not limited to drug and probe discovery. Systematic exploration of the chemical diversity and complexity of small molecules can be carried out using different strategies. A cornerstone in the different approaches is structure representation where the most common are molecular scaffolds, fingerprints, and whole molecule properties. This chapter discusses different cheminformatics strategies that have been consistently used in literature to characterize the diversity and complexity of small molecules. Recent exemplary studies of the diversity and complexity of small molecules and their interpretation are also discussed.
Brain accumulation of the amyloid-β (Aβ) peptide is believed to be the initial event in the Alzheimer disease (AD) process. Aβ accumulation begins 15–20 years before clinical symptoms occur, mainly owing to defective brain clearance of the peptide. Over the past 20 years, we have seen intensive efforts to decrease the levels of Aβ monomers, oligomers, aggregates and plaques using compounds that decrease production, antagonize aggregation or increase brain clearance of Aβ. Unfortunately, these approaches have failed to show clinical benefit in large clinical trials involving patients with mild to moderate AD. Clinical trials in patients at earlier stages of the disease are ongoing, but the initial results have not been clinically impressive. Efforts are now being directed against Aβ oligomers, the most neurotoxic molecular species, and monoclonal antibodies directed against these oligomers are producing encouraging results. However, Aβ oligomers are in equilibrium with both monomeric and aggregated species; thus, previous drugs that efficiently removed monomeric Aβ or Aβ plaques should have produced clinical benefits. In patients with sporadic AD, Aβ accumulation could be a reactive compensatory response to neuronal damage of unknown cause, and alternative strategies, including interference with modifiable risk factors, might be needed to defeat this devastating disease.