KDD Cup 2001 Data Analysis: Prediction of Molecular Bioactivity for Drug Design-Binding to Thrombin

To read the full-text of this research, you can request a copy directly from the authors.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... The input variables of such problems may include clinical variables from medical examinations, laboratory test results, or the measurements of high throughput assays like DNA microarrays. Other examples are found in the prediction of biochemical properties such as the binding of a molecule to a drug target ( [46,7], see below). The input variables of such problems may include physico-chemical descriptors of the drug candidate molecule such as the presence or absence of chemical groups and their relative position. ...
... The latter is due to test molecules being compounds engineered based on previous (training set) results. Bar plot histogram of all entries in the competition (e.g., the bin labelled '68' gives the number of competition entries with performance in the range from 64 to 68), as well as results from [46] using inductive (dashed line) and transductive (solid line) feature selection systems. ...
... There were more than 100 entries in the KDD cup for the Thrombin dataset alone, with the winner achieving a performance score of 68%. After the competition took place, using a type of correlation score designed to cope with the small number of positive examples to perform feature selection combined with an SVM, a 75% success rate was obtained [46]. See Figure 6 for an overview of the results of all entries to the competition, as well as the results of [46]. ...
We briefly describe the main ideas of statistical learning theory, support vector machines, and kernel feature spaces. In addition, we present an overview of applications of kernel methods in bioinformatics.
... This classification does not involve first making an inductive generalization from data about previous cases and background assumptions, maybe using inference to the best explanation, and then deducing a conclusion about the new case from that inductive generalization. Vladimir Vapnik and others have developed methods of transduction that in certain cases give better results than inductive methods that infer general principles (Joachims 1999, Vapnik 2000, Weston et al. 2003, Goutte et al. 2004). And, if transduction is sometimes more reasonable in nonmoral reasoning, perhaps it can also be more reasonable in moral reasoning. ...
... Transduction performs considerably better than other methods in certain difficult real-life situations involving high-dimensional feature spaces where there is relatively little data (Joachims 1999, Weston et al. 2003, Goutte et al. 2004). ...
... What is important and not merely terminological is that, under certain conditions, transduction gives considerably better results in practice than those obtained from methods that use labeled data to infer a rule which is then used to classify new cases (Joachims 1999, Vapnik 2000, Weston et al. 2003, Goutte et al. 2004). ...
Lugosi 1996) is the basic theory behind contemporary machine learning and pattern recognition. We suggest that the theory provides an excellent framework for the philosophy of induction (see also Harman and Kulkarni 2007). Inductive reasons are often compared with deductive reasons. Deductive reasons for a conclusion guarantee the conclusion in the sense that the truth of the reasons guarantees the truth of the conclusion. Not so for inductive reasons, which typically do not provide the same sort of guarantee. One part of the philosophy of induction is concerned with saying what guarantees there are for various inductive methods. There are various paradigmatic approaches to specifying the problem of induction. For example, Reichenbach (1949) argued, roughly, that induction works in the long run if anything works in the long run. His proposal has been followed up in interesting ways in the learning in the limit literature (E.g. Putnam, 1963, Osherson et al., 1982, Kelly, 1996, Schulte, 2002). The paradigm here is to envision a potentially infinite data stream of labeled items, a question Q about that stream, and a method M that proposes an answer to Q given each finite initial sequence of 1
... Vapnik (1979) describes a method of inference, which he has more recently (1998, 2000, p. 293) called " transduction, " a method that infers directly from data to the classification of new cases as they come up. Under certain conditions, transduction gives considerably better results than those obtained from methods that use data to infer a rule that is then used to classify new cases (Joachims 1999, Vapnik 2000, Weston et al. 2003, Goutte et al. 2004). More generally, the problem of induction as we have described it—the problem of finding reliable inductive methods—can be fruitfully investigated, and is being fruitfully investigated in statistical learning theory (Vapnik, 1998; Kulkarni et al., 1998, Hastie et al., 2001). ...
Full-text available
Over the last 20 years in biotechnology, the production of recombinant proteins has been a crucial bioprocess in both biopharmaceutical and research arena in terms of human health, scientific impact and economic volume. Although logical strategies of genetic engineering have been established, protein overexpression is still an art. In particular, heterologous expression is often hindered by low level of production and frequent fail due to opaque reasons. The problem is accentuated because there is no generic solution available to enhance heterologous overexpression. For a given protein, the extent of its solubility can indicate the quality of its function. Over 30% of synthesized proteins are not soluble. In certain experimental circumstances, including temperature, expression host, etc., protein solubility is a feature eventually defined by its sequence. Until now, numerous methods based on machine learning are proposed to predict the solubility of protein merely from its amino acid sequence. In spite of the 20 years of research on the matter, no comprehensive review is available on the published methods. This paper presents an extensive review of the existing models to predict protein solubility in Escherichia coli recombinant protein overexpression system. The models are investigated and compared regarding the datasets used, features, feature selection methods, machine learning techniques and accuracy of prediction. A discussion on the models is provided at the end. This study aims to investigate extensively the machine learning based methods to predict recombinant protein solubility, so as to offer a general as well as a detailed understanding for researches in the field. Some of the models present acceptable prediction performances and convenient user interfaces. These models can be considered as valuable tools to predict recombinant protein overexpression results before performing real laboratory experiments, thus saving labour, time and cost.
Full-text available
Wet laboratory mutagenesis to determine enzyme activity changes is expensive and time consuming. This paper expands on standard one-shot learning by proposing an incremental transductive method (T2bRF) for the prediction of enzyme mutant activity during mutagenesis using Delaunay tessellation and 4-body statistical potentials for representation. Incremental learning is in tune with both eScience and actual experimentation, as it accounts for cumulative annotation effects of enzyme mutant activity over time. The experimental results reported, using cross-validation, show that overall the incremental transductive method proposed, using random forest as base classifier, yields better results compared to one-shot learning methods. T2bRF is shown to yield 90% on T4 and LAC (and 86% on HIV-1). This is significantly better than state-of-the-art competing methods, whose performance yield is at 80% or less using the same datasets.
The objective of this feasibility study is to introduce machine learning algorithms in the combination of general regression and cox proportional hazards regression to predicate the outcome of disease management. By using the delay in the receipt of adjuvant chemotherapy and SEER-Medicare databases as proof-of-principle, we conclude that general regression and Cox proportional hazards regression following the feature selection could identify factors that predict the delay and the impact of delay on survival outcome.
ResearchGate has not been able to resolve any references for this publication.