Article

Mining DNA sequences to predict sites which mutations cause genetic diseases.

Knowledge-Based Systems (impact factor: 2.42). 01/2002; 15(4):225-233. pp.225-233
Source: DBLP

ABSTRACT Currently single nucleotide polymorphism (SNP) analysis becomes the crossroad of bioinformatics and medicine. We have developed a data mining system, http://wwwmgs.bionet.nsc.ru/mgs/systems/rsnp/, called rSNP_Guide, to discover regulatory sites in DNA sequences, which mutations could be the cause of genetic diseases. During the first step, we estimate the abilities of the proteins considered to bind to genomic DNA, which alterations by mutations are associated with a genetic disease under study. During the second step, we formalize the disease-associated experimental data on the SNP-referred alterations in DNA binding to unknown protein. During the third step, we cluster fuzzily all known proteins examined so that to determine one of them, which specific site is altered by mutations in consistence with that of the unknown protein experimentally associated with genetic disease. During the fourth step, we predict the known protein, which binding site is (i) resent on DNA and (ii) altered by mutations associated with genetic disease. Finally, during the last step, we estimate the robustness of this prediction. The rSNP_Guide has been tested on the SNPs with the known relationships between regulatory site alterations and genetic disease penetration. Besides, the novel SNPs-referred regulatory sites associated with the genetic disease penetrations were discovered and,
then, successfully confirmed experimentally.

0 0
 · 
0 Bookmarks
  • Source
    Article: Point mutations within 663-666 bp of intron 6 of the human TDO2 gene, associated with a number of psychiatric disorders, damage the YY-1 transcription factor binding site.
    [show abstract] [hide abstract]
    ABSTRACT: Single base mutations G-->A at position 663 and G-->T at position 666 of intron 6 of the human tryptophan oxygenase gene (TDO2) are associated with a variety of psychiatric disorders [Comings, D.E. et al. (1996) Pharmacogenetics 6, 307-318]. Binding of rat liver nuclear extract proteins to synthetic double-strand oligonucleotides corresponding to three allelic states of the region between 651 bp and 680 bp of human TDO2 intron 6 has been studied by gel shift assay. It has been demonstrated that to each allelic state of the region there corresponds a specific set of proteins that interacts with it. With the aid of computer analysis and using specific anti-YY-1 antibodies it has been shown that both mutations damage the YY-1 transcription factor binding site.
    FEBS Letters 11/1999; 462(1-2):85-8. · 3.54 Impact Factor
  • Source
    Article: Automated extraction of information in molecular biology.
    [show abstract] [hide abstract]
    ABSTRACT: We review data mining techniques in molecular biology, specifically those that extract information from the scientific literature itself. As more of the biological literature is published electronically, there is an opportunity, and even a need, to automatically summarize the literature in a customized way, for example by associating keywords to a topic. These keywords can be extracted from relevant publications. The process of keyword extraction can be automated and optimized to keep literature pointers automatically up-to-date or to filter relevant information from the literature. To illustrate these points, OMIM (Online Mendelian Inheritance in Man), a database of human inherited diseases, was linked to the literature and keywords were derived that covered distinct aspects such as genetic information on the one hand and disease-specific protein and phenotypic information on the other. They were used to extract information that is helpful for keeping entries about disease up-to-date.
    FEBS Letters 07/2000; 476(1-2):12-7. · 3.54 Impact Factor
  • Article: Mining of biological data II: assessing data structure and class homogeneity by cluster analysis.
    [show abstract] [hide abstract]
    ABSTRACT: An important step in data analysis is class assignment which is usually done on the basis of a macroscopic phenotypic or bioprocess characteristic, such as high vs low growth, healthy vs diseased state, or high vs. low productivity. Unfortunately, such an assignment may lump together samples, which when derived from a more detailed phenotypic or bioprocess description are dissimilar, giving rise to models of lower quality and predictive power. In this paper we present a clustering algorithm for data preprocessing which involves the identification of fundamentally similar lots on the basis of the extent of similarity among the system variables. The algorithm combines aspects of cluster analysis and principal component analysis by applying agglomerative clustering methods to the first principal component of the system data matrix. As part of a rational strategy for developing empirical models, this technique selects lots (samples) which are most appropriate for inclusion in a training set by analyzing multivariate data homogeneity. Samples with similar data structures are identified and grouped together into distinct clusters. This knowledge is used in the formation of potential training sets. Additionally, this technique can identify atypical lots, i.e., samples that are not simply outliers but exhibit the general properties of one class but have been given the assignment of the other. The method is presented along with examples from its application to fermentation data sets.
    Metabolic Engineering 08/2000; 2(3):228-38. · 5.61 Impact Factor

Full-text

View
Available from
26 Feb 2013

Keywords

data mining system
 
disease-associated experimental data
 
fourth step
 
genetic disease penetration
 
genetic disease penetrations
 
genetic diseases
 
genomic DNA
 
last step
 
second step
 
single nucleotide polymorphism