Igor Tetko
Research interests
-
InterestsChemoinformatics, Expert Systems, Grid Computing
Publications
-
3.88Impact points
PLS-Optimal: A Stepwise D-Optimal Design Based on Latent Variables.
Journal of chemical information and modeling. 03/2012;
Several applications, such as risk assessment within REACH or drug discovery, require reliable methods for the design of experiments and efficient testing strategies. Keeping the number of experiments as low as possible is important from both a financial and an ethical point of view, as exhaustive t... [more] Several applications, such as risk assessment within REACH or drug discovery, require reliable methods for the design of experiments and efficient testing strategies. Keeping the number of experiments as low as possible is important from both a financial and an ethical point of view, as exhaustive testing of compounds requires significant financial resources and animal lives. With a large initial set of compounds, experimental design techniques can be used to select a representative subset for testing. Once measured, these compounds can be used to develop quantitative structure-activity relationship models to predict properties of the remaining compounds. This reduces the required resources and time. D-Optimal design is frequently used to select an optimal set of compounds by analyzing data variance. We developed a new sequential approach to apply a D-Optimal design to latent variables derived from a partial least squares (PLS) model instead of principal components. The stepwise procedure selects a new set of molecules to be measured after each previous measurement cycle. We show that application of the D-Optimal selection generates models with a significantly improved performance on four different data sets with end points relevant for REACH. Compared to those derived from principal components, PLS models derived from the selection on latent variables had a lower root-mean-square error and a higher Q2 and R2. This improvement is statistically significant, especially for the small number of compounds selected.
-
3.84Impact points
The perspectives of computational chemistry modeling.
Journal of computer-aided molecular design. 12/2011; 26(1):135-6.
The on-line tools for computational chemistry modeling will be increasingly used in the future. This will bring the advantages both for the authors and the readers.... [more] The on-line tools for computational chemistry modeling will be increasingly used in the future. This will bring the advantages both for the authors and the readers.
-
3.88Impact points
A comparison of different QSAR approaches to modeling CYP450 1A2 inhibition.
Journal of chemical information and modeling. 06/2011; 51(6):1271-80.
Prediction of CYP450 inhibition activity of small molecules poses an important task due to high risk of drug-drug interactions. CYP1A2 is an important member of CYP450 superfamily and accounts for 15% of total CYP450 presence in human liver. This article compares 80 in-silico QSAR models that were c... [more] Prediction of CYP450 inhibition activity of small molecules poses an important task due to high risk of drug-drug interactions. CYP1A2 is an important member of CYP450 superfamily and accounts for 15% of total CYP450 presence in human liver. This article compares 80 in-silico QSAR models that were created by following the same procedure with different combinations of descriptors and machine learning methods. The training and test sets consist of 3745 and 3741 inhibitors and noninhibitors from PubChem BioAssay database. A heterogeneous external test set of 160 inhibitors was collected from literature. The studied descriptor sets involve E-state, Dragon and ISIDA SMF descriptors. Machine learning methods involve Associative Neural Networks (ASNN), K Nearest Neighbors (kNN), Random Tree (RT), C4.5 Tree (J48), and Support Vector Machines (SVM). The influence of descriptor selection on model accuracy was studied. The benefits of "bagging" modeling approach were shown. Applicability domain approach was successfully applied in this study and ways of increasing model accuracy through use of applicability domain measures were demonstrated as well as fragment-based model interpretation was performed. The most accurate models in this study achieved values of 83% and 68% correctly classified instances on the internal and external test sets, respectively. The applicability domain approach allowed increasing the prediction accuracy to 90% for 78% of the internal and 17% of the external test sets, respectively. The most accurate models are available online at http://ochem.eu/models/Q5747 .
-
3.84Impact points
Online chemical modeling environment (OCHEM): web platform for data storage, model development and publishing of chemical information.
Journal of computer-aided molecular design. 06/2011; 25(6):533-54.
The Online Chemical Modeling Environment is a web-based platform that aims to automate and simplify the typical steps required for QSAR modeling. The platform consists of two major subsystems: the database of experimental measurements and the modeling framework. A user-contributed database contains ... [more] The Online Chemical Modeling Environment is a web-based platform that aims to automate and simplify the typical steps required for QSAR modeling. The platform consists of two major subsystems: the database of experimental measurements and the modeling framework. A user-contributed database contains a set of tools for easy input, search and modification of thousands of records. The OCHEM database is based on the wiki principle and focuses primarily on the quality and verifiability of the data. The database is tightly integrated with the modeling framework, which supports all the steps required to create a predictive model: data search, calculation and selection of a vast variety of molecular descriptors, application of machine learning methods, validation, analysis of the model and assessment of the applicability domain. As compared to other similar systems, OCHEM is not intended to re-implement the existing tools or models but rather to invite the original authors to contribute their results, make them publicly available, share them with other users and to become members of the growing research community. Our intention is to make OCHEM a widely used platform to perform the QSPR/QSAR studies online and share it with other users on the Web. The ultimate goal of OCHEM is collecting all possible chemoinformatics tools within one simple, reliable and user-friendly resource. The OCHEM is free for web users and it is available online at http://www.ochem.eu.
-
3.88Impact points
Applicability domains for classification problems: Benchmarking of distance to models for Ames mutagenicity set.
Journal of chemical information and modeling. 10/2010; 50(12):2094-111.
The estimation of accuracy and applicability of QSAR and QSPR models for biological and physicochemical properties represents a critical problem. The developed parameter of "distance to model" (DM) is defined as a metric of similarity between the training and test set compounds that have b... [more] The estimation of accuracy and applicability of QSAR and QSPR models for biological and physicochemical properties represents a critical problem. The developed parameter of "distance to model" (DM) is defined as a metric of similarity between the training and test set compounds that have been subjected to QSAR/QSPR modeling. In our previous work, we demonstrated the utility and optimal performance of DM metrics that have been based on the standard deviation within an ensemble of QSAR models. The current study applies such analysis to 30 QSAR models for the Ames mutagenicity data set that were previously reported within the 2009 QSAR challenge. We demonstrate that the DMs based on an ensemble (consensus) model provide systematically better performance than other DMs. The presented approach identifies 30-60% of compounds having an accuracy of prediction similar to the interlaboratory accuracy of the Ames test, which is estimated to be 90%. Thus, the in silico predictions can be used to halve the cost of experimental measurements by providing a similar prediction accuracy. The developed model has been made publicly available at http://ochem.eu/models/1 .
-
1.93Impact points
Large-Scale Evaluation of log P Predictors: Local Corrections May Compensate Insufficient Accuracy and Need of Experimentally Testing Every Other Compound.
Chemistry & biodiversity. 11/2009; 6(11):1837-44.
A large variety of log P calculation methods failed to produce sufficient accuracy in log P prediction for two in-house datasets of more than 96000 compounds contrary to their significantly better performances on public datasets. The minimum Root Mean Squared Error (RMSE) of 1.02 and 0.65 were calcu... [more] A large variety of log P calculation methods failed to produce sufficient accuracy in log P prediction for two in-house datasets of more than 96000 compounds contrary to their significantly better performances on public datasets. The minimum Root Mean Squared Error (RMSE) of 1.02 and 0.65 were calculated for the Pfizer and Nycomed datasets, respectively, in the 'out-of-box' implementation. Importantly, the use of local corrections (LC) implemented in the ALOGPS program based on experimental in-house log P data significantly reduced the RMSE to 0.59 and 0.48 for the Pfizer and Nycomed datasets, respectively, instantly without retraining the model. Moreover, more than 60% of molecules predicted with the highest confidence in each set had a mean absolute error (MAE) less than 0.33 log units that is only ca. 10% higher than the estimated variation in experimental log P measurements for the Pfizer dataset. Therefore, following this retrospective analysis, we suggest that the use of the predicted log P values with high confidence may eliminate the need of experimentally testing every other compound. This strategy could reduce the cost of measurements for pharmaceutical companies by a factor of 2, increase the confidence in prediction at the analog design stage of drug discovery programs, and could be extended to other ADMET properties.
-
1.78Impact points
Cross-frequency coupling in mesiotemporal EEG recordings of epileptic patients.
Journal of physiology, Paris. 11/2009;
Semi-invasive foramen ovale (Fov) electrodes were used to record electrical activity in the vicinty of the inferior mesial temporal region of epileptic patients, in addition to standard scalp EEG. Third order cumulant analysis was used to measure the phase-coupled frequencies corresponding to non-li... [more] Semi-invasive foramen ovale (Fov) electrodes were used to record electrical activity in the vicinty of the inferior mesial temporal region of epileptic patients, in addition to standard scalp EEG. Third order cumulant analysis was used to measure the phase-coupled frequencies corresponding to non-linear coupling of spectral frequency components,somewhat analogous to frequencies of resonance.On the basis of the distribution of these frequencies,an index of resonance (IR) is defined as the ratio between the number of peaks in the gamma-band (40-55 Hz) vs.the number of peaks in the beta-band (15-30 Hz).The epileptogenic focus was located in the hemisphere with lower resonant frequencies because these frequencies were characteristic of a spread of the seizure over a broader area. In the case of Fov electrodes (IR) could differentiate a group of patients affected by a tumor compared to patients with mesial temporal sclerosis. The novel index (IR) appears as an interesting parameter to evaluate the level of interareal functional connectivity in Fov recordings in epileptic patients, but its usage is likely to be extended in electrophysiological studies.
-
3.88Impact points
Inductive Transfer of Knowledge: Application of Multi-Task Learning and Feature Net Approaches to Model Tissue-Air Partition Coefficients.
Journal of chemical information and modeling. 02/2009;
Two inductive knowledge transfer approaches - multitask learning (MTL) and Feature Net (FN) - have been used to build predictive neural networks (ASNN) and PLS models for 11 types of tissue-air partition coefficients (TAPC). Unlike conventional single-task learning (STL) modeling focused only on a s... [more] Two inductive knowledge transfer approaches - multitask learning (MTL) and Feature Net (FN) - have been used to build predictive neural networks (ASNN) and PLS models for 11 types of tissue-air partition coefficients (TAPC). Unlike conventional single-task learning (STL) modeling focused only on a single target property without any relations to other properties, in the framework of inductive transfer approach, the individual models are viewed as nodes in the network of interrelated models built in parallel (MTL) or sequentially (FN). It has been demonstrated that MTL and FN techniques are extremely useful in structure-property modeling on small and structurally diverse data sets, when conventional STL modeling is unable to produce any predictive model. The predictive STL individual models were obtained for 4 out of 11 TAPC, whereas application of inductive knowledge transfer techniques resulted in models for 9 TAPC. Differences in prediction performances of the models as a function of the machine-learning method, and of the number of properties simultaneously involved in the learning, has been discussed.
-
Calculation of molecular lipophilicity: state of the art and comparison of methods on more than 96000 compounds
Chemistry Central Journal. 01/2009;
-
Data integration and knowledge transfer: application to the tissue: air partition coefficients
Chemistry Central Journal. 01/2009;
-
Associative neural network.
Methods in molecular biology (Clifton, N.J.). 01/2009; 458:180-97.
An associative neural network (ASNN) is an ensemble-based method inspired by the function and structure of neural network correlations in brain. The method operates by simulating the short- and long-term memory of neural networks. The long-term memory is represented by ensemble of neural network wei... [more] An associative neural network (ASNN) is an ensemble-based method inspired by the function and structure of neural network correlations in brain. The method operates by simulating the short- and long-term memory of neural networks. The long-term memory is represented by ensemble of neural network weights, while the short-term memory is stored as a pool of internal neural network representations of the input pattern. The organization allows the ASNN to incorporate new data cases in short-term memory and provides high generalization ability without the need to retrain the neural network weights. The method can be used to estimate a bias and the applicability domain of models. The applications of the ASNN in QSAR and drug design are exemplified.
-
1.37Impact points
FunCat functional inference with belief propagation and feature integration.
Computational biology and chemistry. 11/2008; 32(5):375-7.
Pairwise comparison of sequence data is intensively used for automated functional protein annotation, while graphical models emerge as promising candidates for an integration of various heterogeneous features. We designed a model, termed hRMN that integrates different genomic features and implemente... [more] Pairwise comparison of sequence data is intensively used for automated functional protein annotation, while graphical models emerge as promising candidates for an integration of various heterogeneous features. We designed a model, termed hRMN that integrates different genomic features and implemented a variant of belief propagation for functional annotation transfer. hRMN allows the assignment of multiple functional categories while avoiding common problems in annotation transfer from heterogeneous datasets, such as an independency of the investigated datasets. We benchmarked this system with large-scale annotation transfer (based on the MIPS FunCat ontology) to proteins of the prokaryotes Bacillus subtilis, Helicobacter pylori, Listeria monocytogenes, and Listeria innocua. hRMN consistently outperformed two competitors in annotation of four bacterial genomes. The developed code is available for download at http://mips.gsf.de/proj/bfab/hRMN.html.
-
2.09Impact points
MitoP2: An Integrative Tool for the Analysis of the Mitochondrial Proteome.
Molecular biotechnology. 10/2008;
Mitochondria are crucial for normal cell metabolism and maintenance. Mitochondrial dysfunction has been implicated in a spectrum of human diseases, ranging from rare monogenic to common multifactorial disorders. Important for the understanding of organelle function is the assignment of its constitue... [more] Mitochondria are crucial for normal cell metabolism and maintenance. Mitochondrial dysfunction has been implicated in a spectrum of human diseases, ranging from rare monogenic to common multifactorial disorders. Important for the understanding of organelle function is the assignment of its constituents, and although over 1,500 proteins are predicted to be involved in mammalian mitochondrial function, so far only about 900 are assigned to mitochondria with reasonable certainty. Continuing efforts are being taken to obtain a complete inventory of the mitochondrial proteome by single protein studies and high-throughput approaches. To be of best value for the scientific community this data needs to be structured, explored, and customized. For this purpose, the MitoP2 database ( http://www.mitop2.de ) was established and is maintained in order to incorporate such data. The central database contains manually evaluated yeast, mouse, and human reference proteins, which show convincing evidence of a mitochondrial location. In addition, entries from genome-wide approaches that suggest protein localization are integrated and serve to compile a combined score for each candidate, which provides a best estimate of mitochondrial localization. Furthermore, it integrates information on the orthology between species, including Saccharomyces cerevisiae, mouse, human, Arabidopsis thaliana, and Neurospora crassa, thus mutually enhancing evidence across species. In contrast to other known databases, MitoP2 takes into account the reliability by which the protein is estimated as being mitochondrially located, as described herein. Multiple search functions, as well as information on disease causing genes and available mouse models, makes MitoP2 a valuable tool for the genetic investigation of human mitochondrial pathology.
-
3.88Impact points
Critical Assessment of QSAR Models of Environmental Toxicity against Tetrahymena pyriformis: Focusing on Applicability Domain and Overfitting by Variable Selection.
Journal of chemical information and modeling. 09/2008;
The estimation of the accuracy of predictions is a critical problem in QSAR modeling. The "distance to model" can be defined as a metric that defines the similarity between the training set molecules and the test set compound for the given property in the context of a specific model. It co... [more] The estimation of the accuracy of predictions is a critical problem in QSAR modeling. The "distance to model" can be defined as a metric that defines the similarity between the training set molecules and the test set compound for the given property in the context of a specific model. It could be expressed in many different ways, e.g., using Tanimoto coefficient, leverage, correlation in space of models, etc. In this paper we have used mixtures of Gaussian distributions as well as statistical tests to evaluate six types of distances to models with respect to their ability to discriminate compounds with small and large prediction errors. The analysis was performed for twelve QSAR models of aqueous toxicity against T. pyriformis obtained with different machine-learning methods and various types of descriptors. The distances to model based on standard deviation of predicted toxicity calculated from the ensemble of models afforded the best results. This distance also successfully discriminated molecules with low and large prediction errors for a mechanism-based model developed using log P and the Maximum Acceptor Superdelocalizability descriptors. Thus, the distance to model metric could also be used to augment mechanistic QSAR models by estimating their prediction errors. Moreover, the accuracy of prediction is mainly determined by the training set data distribution in the chemistry and activity spaces but not by QSAR approaches used to develop the models. We have shown that incorrect validation of a model may result in the wrong estimation of its performance and suggested how this problem could be circumvented. The toxicity of 3182 and 48774 molecules from the EPA High Production Volume (HPV) Challenge Program and EINECS (European chemical Substances Information System), respectively, was predicted, and the accuracy of prediction was estimated. The developed models are available online at http://www.qspr.org site.
-
2.91Impact points
Calculation of molecular lipophilicity: State-of-the-art and comparison of log P methods on more than 96,000 compounds.
Journal of pharmaceutical sciences. 09/2008;
We first review the state-of-the-art in development of log P prediction approaches falling in two major categories: substructure-based and property-based methods. Then, we compare the predictive power of representative methods for one public (N = 266) and two in house datasets from Nycomed (N = 882)... [more] We first review the state-of-the-art in development of log P prediction approaches falling in two major categories: substructure-based and property-based methods. Then, we compare the predictive power of representative methods for one public (N = 266) and two in house datasets from Nycomed (N = 882) and Pfizer (N = 95809). A total of 30 and 18 methods were tested for public and industrial datasets, respectively. Accuracy of models declined with the number of nonhydrogen atoms. The Arithmetic Average Model (AAM), which predicts the same value (the arithmetic mean) for all compounds, was used as a baseline model for comparison. Methods with Root Mean Squared Error (RMSE) greater than RMSE produced by the AAM were considered as unacceptable. The majority of analyzed methods produced reasonable results for the public dataset but only seven methods were successful on the both in house datasets. We proposed a simple equation based on the number of carbon atoms, NC, and the number of hetero atoms, NHET: log P = 1.46(+/-0.02) + 0.11(+/-0.001) NC-0.11(+/-0.001) NHET. This equation outperformed a large number of programs benchmarked in this study. Factors influencing the accuracy of log P predictions were elucidated and discussed. (c) 2008 Wiley-Liss, Inc. and the American Pharmacists Association J Pharm Sci.
-
3.25Impact points
Calculation of lipophilicity for Pt(II) complexes: experimental comparison of several methods.
Journal of inorganic biochemistry. 08/2008; 102(7):1424-37.
Platinum containing compounds are promising antitumor agents, but must enter cells before reaching their main biological target, namely DNA. Their distribution within the body, and hence their activity is to a large extent determined by their lipophilicity, thus there is a strong interest to develop... [more] Platinum containing compounds are promising antitumor agents, but must enter cells before reaching their main biological target, namely DNA. Their distribution within the body, and hence their activity is to a large extent determined by their lipophilicity, thus there is a strong interest to develop computational methods to predict this important property. This study analyses accuracy of five methods, namely ALOGPS, KOWWIN, CLOGP and two quantum chemical approaches, to predict octanol/water partition coefficients (logP) for sets of 43 and 12 Pt(II) complexes, collected from the literature and measured by the authors, respectively. All methods gave generally poor results with mean absolute error (MAE) of between 0.8 and 3 log units for prediction of new compounds. Extension of the ALOGPS program with data from the literature set resulted in the best prediction ability, MAE=0.46, for the measured molecules. The program was also able to correctly predict errors in calculated logP values. It is freely available for interactive use at http://www.vcclab.org.
-
3.88Impact points
Combinatorial QSAR modeling of chemical toxicants tested against Tetrahymena pyriformis.
Journal of chemical information and modeling. 05/2008; 48(4):766-84.
Selecting most rigorous quantitative structure-activity relationship (QSAR) approaches is of great importance in the development of robust and predictive models of chemical toxicity. To address this issue in a systematic way, we have formed an international virtual collaboratory consisting of six in... [more] Selecting most rigorous quantitative structure-activity relationship (QSAR) approaches is of great importance in the development of robust and predictive models of chemical toxicity. To address this issue in a systematic way, we have formed an international virtual collaboratory consisting of six independent groups with shared interests in computational chemical toxicology. We have compiled an aqueous toxicity data set containing 983 unique compounds tested in the same laboratory over a decade against Tetrahymena pyriformis. A modeling set including 644 compounds was selected randomly from the original set and distributed to all groups that used their own QSAR tools for model development. The remaining 339 compounds in the original set (external set I) as well as 110 additional compounds (external set II) published recently by the same laboratory (after this computational study was already in progress) were used as two independent validation sets to assess the external predictive power of individual models. In total, our virtual collaboratory has developed 15 different types of QSAR models of aquatic toxicity for the training set. The internal prediction accuracy for the modeling set ranged from 0.76 to 0.93 as measured by the leave-one-out cross-validation correlation coefficient ( Q abs2). The prediction accuracy for the external validation sets I and II ranged from 0.71 to 0.85 (linear regression coefficient R absI2) and from 0.38 to 0.83 (linear regression coefficient R absII2), respectively. The use of an applicability domain threshold implemented in most models generally improved the external prediction accuracy but at the same time led to a decrease in chemical space coverage. Finally, several consensus models were developed by averaging the predicted aquatic toxicity for every compound using all 15 models, with or without taking into account their respective applicability domains. We find that consensus models afford higher prediction accuracy for the external validation data sets with the highest space coverage as compared to individual constituent models. Our studies prove the power of a collaborative and consensual approach to QSAR model development. The best validated models of aquatic toxicity developed by our collaboratory (both individual and consensus) can be used as reliable computational predictors of aquatic toxicity and are available from any of the participating laboratories.
-
4.93Impact points
Beyond the 'best' match: machine learning annotation of protein sequences by integration of different sources of information.
Bioinformatics (Oxford, England). 04/2008; 24(5):621-8.
MOTIVATION: Accurate automatic assignment of protein functions remains a challenge for genome annotation. We have developed and compared the automatic annotation of four bacterial genomes employing a 5-fold cross-validation procedure and several machine learning methods. RESULTS: The analyzed genome... [more] MOTIVATION: Accurate automatic assignment of protein functions remains a challenge for genome annotation. We have developed and compared the automatic annotation of four bacterial genomes employing a 5-fold cross-validation procedure and several machine learning methods. RESULTS: The analyzed genomes were manually annotated with FunCat categories in MIPS providing a gold standard. Features describing a pair of sequences rather than each sequence alone were used. The descriptors were derived from sequence alignment scores, InterPro domains, synteny information, sequence length and calculated protein properties. Following training we scored all pairs from the validation sets, selected a pair with the highest predicted score and annotated the target protein with functional categories of the prototype protein. The data integration using machine-learning methods provided significantly higher annotation accuracy compared to the use of individual descriptors alone. The neural network approach showed the best performance. The descriptors derived from the InterPro domains and sequence similarity provided the highest contribution to the method performance. The predicted annotation scores allow differentiation of reliable versus non-reliable annotations. The developed approach was applied to annotate the protein sequences from 180 complete bacterial genomes. AVAILABILITY: The FUNcat Annotation Tool (FUNAT) is available on-line as Web Services at http://mips.gsf.de/proj/funat.
-
Calculation of lipophilicity for Pt(II) complexes: experimental comparison of several methods
Chemistry Central Journal. 01/2008;
Following (6)
-
Humayun Sharif
Max-Planck-Gesellschaft -
Alberto Manganaro
Università degli Studi di Milano-Bicocca -
Ahmed Mohamed Abdelaziz
Helmholtz Zentrum München -
Andrea Mauri
Università degli Studi di Milano-Bicocca -
Antony Williams
The Royal Society of Chemistry