About
272
Publications
48,618
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
7,459
Citations
Introduction
Igor I. Baskin conducts research on the use of machine learning, (in particular, neural networks and kernel methods) and computational chemistry and physics in chemoinformatics and materials informatics in the field of electrochemistry.
Current institution
Additional affiliations
January 2021 - present
July 1994 - January 2001
July 1994 - January 2001
Publications
Publications (272)
We provide a comprehensive approach to a methodology to evaluate the performance of lithium-ion batteries and related intercalation systems at the single-particle level, by constructing diagrammatic representations. The idea that underlies these methodologies is using two dimensionless/scaling parameters, which allow the evaluation of a series of e...
This paper reviews the application of machine learning to the inhibition of corrosion by organic molecules. The methodologies considered include quantitative structure‐property relationships (QSPR) and related data‐driven approaches. The characteristic features of their key components are considered as applied to corrosion inhibition, including dat...
The increasing demand for energy storage technologies has prompted the exploration of side‐by‐side technologies, that can complement the current Lithium‐ion battery industry with cheaper and more abundant materials that can be incorporated in a myriad of new electrochemical cell designs. To meet these goals, a novel approach for electrolyte design...
Silicon (Si) is the second abundant material in nature and yet, despite its high abundance and ease of production, the possibility of using Si as an active multivalent rechargeable anode was never explored or reported. As a proof of concept, we will discuss in this talk a new rechargeable Si-ion cell, its design and architecture, enabling Si to be...
Conjugated QSPR models for reactions integrate fundamental chemical laws expressed by mathematical equations with machine learning algorithms. Herein we present a methodology for building conjugated QSPR models integrated with the Arrhenius equation. Conjugated QSPR models were used to predict kinetic characteristics of cycloaddition reactions rela...
Electrochemical processes underlie the functioning of electrochemical devices for energy storage and conversion. In this paper, electrochemoinformatics is defined as a scientific discipline, a part of computational electrochemistry, dealing with the application of information technologies, specifically data science, machine learning (ML), and artif...
In order to better foramize it, the notorious inverse-QSAR problem (finding structures of given QSAR-predicted properties) is considered in this paper as a two-step process including (i) finding "seed" descriptor vectors corresponding to user-constrained QSAR model output values and (ii) identifying the chemical structures best matching the "seed"...
In order to better formalize the notorious Inverse-QSAR problem (finding structures of given QSAR-predicted properties) is considered in this paper as a two-step process1,2,3 including (i) finding “seed” descriptor vectors corresponding to user-constrained QSAR model output values and (ii) identifying the chemical structures best matching the “seed...
Despite its high abundance and ease of production, the possibility of using silicon as an active multivalent rechargeable anode has never been explored, until now. As a proof of concept, a novel rechargeable silicon cell, its design and architecture are reported, enabling Si to be reversibly discharged at 1.1 V and charged at 1.5 V. It is proven th...
The great importance of the ability to quantitatively predict the properties of ionic liquids (ILs) using quantitative structure-property relationships (QSPR) models necessitates the understanding of which modern machine learning (ML) methods in combination with which types of molecular representations are preferable to use for this purpose. To add...
The synthesis of the desired chemical compound is the main task of synthetic organic chemistry. The predictions of reaction conditions and some important quantitative characteristics of chemical reactions as yield and reaction rate can substantially help in the development of optimal synthetic routes and assessment of synthesis cost. Theoretical as...
The most widely used QSAR approaches are mainly based on 2D molecular representation which ignores stereoconfiguration and conformational flexibility of compounds. 3D QSAR uses a single conformer of each compound which is difficult to choose reasonably. 4D QSAR uses multiple conformers to overcome the issues of 2D and 3D methods. However, many of e...
In this article, we consider cross-validation of the quantitative structure-property relationship models for reactions and show that the conventional k-fold cross-validation (CV) procedure gives an ‘optimistically’ biased assessment of prediction performance. To address this issue, we suggest two strategies of model cross-validation, ‘transformatio...
The “creativity” of Artificial Intelligence (AI) in terms of generating de novo molecular structures opened a novel paradigm in compound design, weaknesses (stability & feasibility issues of such structures) notwithstanding. Here we show that “creative” AI may be as successfully taught to enumerate novel chemical reactions that are stoichiometrical...
Rapid development of robotic platforms in drugs and materials design stimulates development of efficient chemoinformatics tools for chemical reactions mining. This article surveys main aspects and recent advances in this field. The following topics are discussed: reaction data availability, visualization and analysis of chemical reaction space, ret...
Rapid development of robotic platforms in drugs and materials design stimulates development of efficient chemoinformatics tools for chemical reactions mining. This article surveys main aspects and recent advances in this field. The following topics are discussed: reaction data availability, visualization and analysis of chemical reaction space, ret...
Nowadays, the problem of the model’s applicability domain (AD) definition is an active research topic in chemoinformatics. Although many various AD definitions for the models predicting properties of molecules (Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) models) were described in the literature, no one for chemical reactions (...
Here we report a new predictive model for autoignition temperature (AIT), an important physical parameter widely used to assess potential safety hazards of combustible materials. Available structure-AIT data extracted from different sources were critically analysed. Support vector regression (SVR) models on different data subsets were built in orde...
Correction for ‘QSAR without borders’ by Eugene N. Muratov et al. , Chem. Soc. Rev. , 2020, DOI: 10.1039/d0cs00098a.
Prediction of chemical bioactivity and physical properties has been one of the most important applications of statistical and more recently, machine learning and artificial intelligence methods in chemical sciences. This field of research, broadly known as quantitative structure–activity relationships (QSAR) modeling, has developed many important a...
Generative Topographic Mapping (GTM) can be efficiently used to visualize, analyze and model large chemical data. The GTM manifold needs to span the chemical space deemed relevant for a given problem. Therefore, the Frame set (FS) of compounds used for the manifold construction must well cover a given chemical space. Intuitively, the FS size must r...
Introduction: Deep discriminative and generative neural-network models are becoming an integral part of the modern approach to ligand-based novel drug discovery. The variety of different architectures of neural networks, the methods of their training, and the procedures of generating new molecules require expert knowledge to choose the most suitabl...
Generative Topographic Mapping (GTM) is a dimensionality reduction method, which is widely used for both data visualization and structure-activity modeling. Large dimensionality of the initial data space may require significant computational resources and slow down the GTM construction. Therefore, it may be meaningful to reduce the number of descri...
Here, we report an application of Artificial Intelligence techniques to generate novel chemical reactions of the given type. A sequence-to-sequence autoencoder was trained on the USPTO reaction database. Each reaction was converted into a single Condensed Graph of Reaction (CGR), followed by their translation into on-purpose developed SMILES/GGR te...
Here, we describe a concept of conjugated models for several properties (activities) linked by a strict mathematical relationship. This relationship can be directly integrated analytically into the Ridge Regression (RR) algorithm or accounted for in a special case of "twin" neural networks (NN). Developed approaches were applied to the modelling of...
The analysis of information on the spatial structure of molecules and the physical fields of their interactions with biological targets is extremely important for solving various problems in drug discovery. This mini-review article surveys the main features of the continuous molecular fields approach and its use for analyzing structure–activity rel...
Here we show that Generative Topographic Mapping (GTM) can be used to explore the latent space of the SMILES-based autoencoders and generate focused molecular libraries of interest. We have built a sequence-to-sequence neural network with Bidirectional Long Short-Term Memory layers and trained it on the SMILES strings from ChEMBL23. Very high recon...
Generative Topographic Mapping (GTM) approach was successfully used to visualize, analyze and model the equilibrium constants (KT) of tautomeric transformations as a function of both structure and experimental conditions. The modeling set contained 695 entries corresponding to 350 unique transformations of 10 tautomeric types, for which KT values w...
Various methods of machine learning, supervised and unsupervised, linear and nonlinear, classification and regression, in combination with various types of molecular descriptors, both “handcrafted” and “data-driven,” are considered in the context of their use in computational toxicology. The use of multiple linear regression, variants of naïve Baye...
Lecture "Application of Deep Learning Neural Networks in Chemoinformatics: Advantages and Prospects"
We report the first direct QSPR modeling of equilibrium constants of tautomeric transformations (logK T ) in different solvents and at different temperatures, which do not require intermediate assessment of acidity (basicity) constants for all tautomeric forms. The key step of the modeling consisted in the merging of two tautomers in one sole molec...
The review is devoted to the achievements in analysis of information on chemical reactions using machine learning methods. Four large areas that actively use these methods are outlined: computer-assisted planning of synthesis, analysis and visualization of chemical reaction data, prediction of the quantitative characteristics of reactions and compu...
Generative topographic mapping (GTM) approach is used to visualize the chemical space of organic molecules (L) with respect to binding a wide range of 41 different metal cations (M) and also to build predictive models for stability constants (logK) of 1:1 (M:L) complexes using “density maps,” “activity landscapes,” and “selectivity landscapes” tech...
This lecture is devoted to advances in the analysis of information on chemical reactions using machine learning methods. It outlines four large domains where these methods are actively used: computer-assisted synthetic planning, analysis and visualization of reaction data, prediction of quantitative characteristics of reactions, and, finally, compu...
By the structural representation of a chemical reaction in the form of a condensed graph a model allowing the prediction of rate constants (logk) of Diels–Alder reactions performed in different solvents and at different temperatures is constructed for the first time. The model demonstrates good agreement between the predicted and experimental logk...
This chapter illustrates the use of common regression methods and introduce performance measures for regression. The regression problem consists in estimating ligand affinity to adenosine receptor (A2A), as a function of the ligand structure. Ligand structures and their known pKi values were collected from the IupharDB, ChEMBL, and PubChem BioAssay...
This chapter contains a tutorial illustrating bagging and boosting in the context of regression models. The first base regression method used in this tutorial is the classical algorithm of Multiple Linear Regression (MLR) implemented in Weka in the class classifiers/functions with the name LinearRegression. The bagging procedure consists of: genera...
This chapter demonstrates the danger of the variable selection bias and the need for external cross-validation for correct assessment of the prediction performance of QSAR models based on automatically selected descriptors. The n-fold cross-validation technique is widely used to estimate the performance of QSAR models. In this procedure, the entire...
Nowadays, there exist hundreds of different machine learning methods. This chapter includes a tutorial considering the following machine learning methods for performing regression: zero regression (ZeroR), multiple linear regression (MLR), partial least squares (PLS), support vector regression (SVR), k nearest neighbors (kNN), back-propagation neur...
Stacking is historically one of the first ensemble learning methods. It combines several base models (lower-level models) built using absolutely different classes of machine learning methods by means of a “meta-learner” (high-level model) that takes as its inputs the output values of the base models. This chapter demonstrates the ability of stackin...
This chapter demonstrates the interpretable rules method. In this method selected rules are sensitive to any modification of the training data, even to the order of the data in the input file. Some machine learning methods allow user to obtain easily interpretable models involving a relatively small number of attributes. Generally, such models cons...
This chapter illustrates two important approaches of Ensemble Learning (EL): bagging and boosting. The methods are demonstrated on the example of building classification models based on interpretable rules. Some general behavior of these approaches are highlighted, in particular the situations where one needs to prefer one approach to another. In W...
This chapter illustrates the use of three classification methods and introduces measures of success for classification. The first one, the Naïve Bayes algorithm, focuses on a statistical description of the data. The second one, the Support Vector Machine, provides a geometric view of the classification problem. The third one, the k-Nearest Neighbor...
This chapter introduces the concept of Random Subspace and demonstrates the ability of the Random Forest method to produce strong predictive models. The Random Forest method is based on bagging models built using the Random Tree method, in which classification trees are grown on a random subset of descriptors. The Random Tree method can be viewed a...
In Energy-Based Neural Networks (EBNNs), relationships between variables are captured by means of a scalar function conventionally called "energy". In this article, we introduce a procedure of "harmony search", which looks for compounds providing the lowest energies for the EBNNs trained on active compounds. It can be considered as a special kind o...
Predictive methods for physical–chemical properties are commonly used during the early stage of drug discovery, notably when identifying promising lead structures for development. This article begins with a historical overview of these methods, and background information about the role of physical–chemical properties in medicinal chemistry. Then, a...
Herein, Generative Topographic Mapping (GTM) was challenged to produce planar projections of the high-dimensional conformational space of complex molecules (the 1LE1 peptide). GTM is a probability-based mapping strategy, and its capacity to support property prediction models serves to objectively assess map quality (in terms of regression statistic...
For the first time, energy-based neural networks (EBNNs) were applied to build structure-activity models. The Hopfield Networks (HNs) and the Restricted Boltzmann Machines (RBMs) were used to build one-class classification models for conducting similarity-based virtual screening. The AUC score for ROC curves and 1%-enrichment rates were compared fo...
In this chapter, we review some concepts and techniques used to visualize chemical compounds represented as objects in a multidimensional descriptor space. Several modern dimensionality reduction techniques are compared with respect to their ability to visualize the data in 2D space, using as example a dataset of acetylcholinesterase inhibitors and...
This chapter describes Generative Topographic Mapping (GTM) - a dimensionality reduction method which can be used both to data visualization, clustering and modeling. GTM is a probabilistic extension of Kohonen maps. Its probabilistic nature can be exploited in order to build regression or classification models, to define their applicability domain...
In this chapter, we review some concepts and techniques used to visualize chemical compounds represented as objects in a multidimensional descriptor space. Several modern dimensionality reduction techniques are compared with respect to their ability to visualize the data in 2D space, using as example a dataset of acetylcholinesterase inhibitors and...
Introduction:
Neural networks are becoming a very popular method for solving machine learning and artificial intelligence problems. The variety of neural network types and their application to drug discovery requires expert knowledge to choose the most appropriate approach.
Areas covered:
In this review, the authors discuss traditional and newly...
This presentation concerns the use of different dimensionality reduction and data visualization techniques in chemoinformatics. It starts with the justification of the need to visualize data as an important step to transfer the knowledge aquired by computers by analyzing raw data to humans. We live in a three-dimensional world and have to move on a...
This chapter describes Generative Topographic Mapping (GTM) -A dimensionality reduction method which can be used both to data visualization, clustering and modeling. GTM is a probabilistic extension of Kohonen maps. Its probabilistic nature can be exploited in order to build regression or classification models, to define their applicability domain,...
Prediction of the activity profile of a given molecule or discovering structures possessing a specific activity profile are two important goals in chemoinformatics, which could be achieved by bridging activity and molecular descriptor spaces. In this paper, we introduce the "Stargate" version of the Generative Topographic Mapping approach or S-GTM...
The lecture reviews the state of the art in 3D QSAR methods combining both alignment-based and alignment-free approaches. Three types of molecular fields are considered: molecular interection fields, atomic property fields, and the fields based on the electron density function. The main problems and challengies for 3D QSAR analysis are outlined. Th...
The lecture gives an overview of a new scientific field - materials informatics []. The lecture starts with definition of material and a short description of materials science. Several taxonomies of materials are given. Then a short history of this domain is given. The main focus of the lecture is the use of structure-property modeling (including Q...
In this paper we demonstrate that Generative Topographic Mapping (GTM), a machine learning method traditionally used for data visualisation, can be efficiently applied to QSAR modelling using probability distribution functions (PDF) computed in the latent 2-dimensional space. Several different scenarios of the activity assessment were considered: (...
A novel type of molecular fields, Continuous Indicator Fields (CIFs), is suggested to provide 3D structural description of molecules. The values of CIFs are calculated as the degree to which a point with given 3D coordinates belongs to an atom of a certain type. They can be used similarly to standard physicochemical fields for building 3D structure...
The first part of the lecture deals with different approaches of representation, formal mathematical description and classification of reactions in organic chemistry. Different model-driven (Ugi, Arens, Vladutz, Fujita, Hendrickson, etc) and data driven (Gelernter, InfoChem, etc) methods of reaction classification are considered. Dujundji-Ugi matri...
This paper is devoted to the analysis and visualization in 2-dimensional space of large datasets of millions of compounds using the incremental version of Generative Topographic Mapping (iGTM). The iGTM algorithm implemented in the in-house ISIDA-GTM program has been applied to a database of more than 2 million compounds combining datasets of 36 ch...
The Method of Continuous Molecular Fields is a universal approach to predict
various properties of chemical compounds, in which molecules are represented by
means of continuous fields (such as electrostatic, steric, electron density
functions, etc). The essence of the proposed approach consists in performing
statistical analysis of functional molec...
This paper reports a predictive model for the rate constant of the bimolecular nucleophilic substitution
involving the azide moiety. It predicts reaction rate constants in different solvents, including organic
mixtures, and with different organic and inorganic azides as reactants. The optimal descriptors describing
solvent effects and a cation type...
Machine learning methods play very important role in chemoinformatics [1], especially for property prediction [2]. In this lecture, they are characterized in terms of the “modes of statistical inference” and “modeling levels” nomenclature and by considering different facets of the modeling with respect to input/ouput matching, data types, models du...
An approach for the prediction of rate constants of chemical reactions, based on the representation of a chemical reaction as a condensed graph, has been tested on more than 1000 bimolecular nucleophilic substitution reactions with neutral nucleophiles in 38 solvents. Molecular fragment descriptors, temperature, and solvent parameters characterizin...
The evaluation of important pharmacokinetic properties such as hydrophobicity using High Throughput Screening (HTS) methods is a major issue in drug discovery. In this article, we present the measurement of the Chromatographic Hydrophobicity Index (CHI) on a subset of the French chemical library, the "Chimiothèque Nationale" (CN). The data was used...
Quantitative Structure-Activity Relationship modeling is one of the major computational tools employed in medicinal chemistry. However, throughout its entire history it has drawn both praise and criticism concerning its reliability, limitations, successes, and failures. In this paper, we discuss: (i) the development and evolution of QSAR; (ii) the...
This article reviews the application of fragment descriptors at different stages of virtual screening: filtering, similarity search, and direct activity assessment using QSAR/QSPR models. Several case studies are considered. It is demonstrated that the power of fragment descriptors stems from their universality, very high computational efficiency,...
The continuous molecular fields (CMF) approach is based on the application of continuous functions for the description of molecular fields instead of finite sets of molecular descriptors (such as interaction energies computed at grid nodes) commonly used for this purpose. These functions can be encapsulated into kernels and combined with kernel-bas...
We herewith present a novel approach to predict protein-ligand binding modes from the single two-dimensional structure of the ligand. Known protein-ligand X-ray structures were converted into binary bit strings encoding protein-ligand interactions. An artificial neural network was then set-up to first learn and then predict protein-ligand interacti...
Semi-supervised methods dealing with a combination of labeled and unlabeled data become more and more popular in machine-learning area, but not still used in chemoinformatics. Here, we demonstrate that Transductive Support Vector Machines (TSVM) – a semi-supervised large-margin classification method – can be particularly useful to build the models...
This article reviews the application of fragment descriptors at different stages of virtual screening: filtering, similarity search, and direct activity assessment using QSAR/QSPR models. Several case studies are considered. It is demonstrated that the power of fragment descriptors stems from their universality, very high computational efficiency,...