[Show abstract][Hide abstract] ABSTRACT: Background
The analysis of complex diseases is an important problem in human genetics. Because multifactoriality is expected to play a pivotal role, many studies are currently focused on collecting information on the genetic and environmental factors that potentially influence these diseases. However, there is still a lack of efficient and thoroughly tested statistical models that can be used to identify implicated features and their interactions. Simulations using large biologically realistic data sets with known gene-gene and gene-environment interactions that influence the risk of a complex disease are a convenient and useful way to assess the performance of statistical methods.
The Gene-Environment iNteraction Simulator 2 (GENS2) simulates interactions among two genetic and one environmental factor and also allows for epistatic interactions. GENS2 is based on data with realistic patterns of linkage disequilibrium, and imposes no limitations either on the number of individuals to be simulated or on number of non-predisposing genetic/environmental factors to be considered. The GENS2 tool is able to simulate gene-environment and gene-gene interactions. To make the Simulator more intuitive, the input parameters are expressed as standard epidemiological quantities. GENS2 is written in Python language and takes advantage of operators and modules provided by the simuPOP simulation environment. It can be used through a graphical or a command-line interface and is freely available from http://sourceforge.net/projects/gensim. The software is released under the GNU General Public License version 3.0.
Data produced by GENS2 can be used as a benchmark for evaluating statistical tools designed for the identification of gene-gene and gene-environment interactions.
Full-text · Article · Jun 2012 · BMC Bioinformatics
[Show abstract][Hide abstract] ABSTRACT: DNA microarray analysis represents a relevant technology in genetic research to explore and recognize possible genomic features of many diseases. Since it is a high-throughput technology, it requires advanced tools for a dimensional reduction in massive data sets. Clustering is among the most appropriate tools for mining these data, although it suffers from the following problems: instability of the results, large number of genes compared with the number of samples, high noise level, complexity of initialization, and grouping genes and samples simultaneously. Almost all these problems can be positively addressed by using novel techniques, such as biclustering. In this paper, a new biclustering algorithm is proposed, hereafter denoted as combinatorial biclustering algorithm (CBA), that addresses the problems listed above. The algorithm analyzes the data finding biclusters of the desired size and allowable error. CBA performances are compared with the ones of other bicluster algorithms by discussing the output of different methods once running them on a synthetic data set. CBA seems to perform better, and for this reason, it has been applied to study a real data set as well. In particular, CBA has analyzed the transcriptional profile of 38 gastric cancer tissues with microsatellite instability (MSI) and without MSS. The results show clearly a much coherent behavior in gene expression of normal tissues versus tumoral ones. The high level of gene misregulation in tumoral tissues affects any further bicluster analysis, and it is only partially smoothed in the MSI/MSS study even admitting much higher level on initial admissible error.
No preview · Article · May 2012 · Neural Computing and Applications
[Show abstract][Hide abstract] ABSTRACT: Many decades of scientific investigation have proved the role of selective pressure in Homo Sapiens at least at the level of individual genes or loci. Nevertheless, there are examples of polygenic traits that are bound to be under selection, but studies devoted to apply population genetics methods to unveil such occurrence are still lacking. Stature provides a relevant example of well-studied polygenic trait for which is now available a genome-wide association study which has identified the genes involved in this trait, and which is known to be under selection. We studied the behavior of F(ST) in a simulated toy model to detect population differentiation on a generic polygenic phenotype under selection. The simulations showed that the set of alleles involved in the trait has a higher mean F(ST) value than those undergoing genetic drift only. In view of this we looked for an increase in the mean F(ST) value of the 180 variants associated to human height. For this set of alleles we found F(ST) to be significantly higher than the genomic background (p = 0.0356). On the basis of a statistical analysis we excluded that the increase was just due to the presence of outliers. We also proved as marginal the role played by local adaptation phenomena, even on different phenotypes in linkage disequilibrium with genetic variants involved in height. The increase of F(ST) for the set of alleles involved in a polygenic trait seems to provide an example of symmetry breaking phenomenon concerning the population differentiation. The splitting in the allele frequencies would be driven by the initial conditions in the population dynamics which are stochastically modified by events like drift, bottlenecks, etc, and other stochastic events like the born of new mutations.
[Show abstract][Hide abstract] ABSTRACT: Many natural phenomena are directly or indirectly related to latitude. Living at different latitudes, indeed, has its consequences with being exposed to different climates, diets, light/dark cycles, etc. In humans, one of the best known examples of genetic traits following a latitudinal gradient is skin pigmentation. Nevertheless, also several diseases show latitudinal clinals such as hypertension, cancer, dismetabolic conditions, schizophrenia, Parkinson's disease and many more.
We investigated, for the first time on a wide genomic scale, the latitude-driven adaptation phenomena. In particular, we selected a set of genes showing signs of latitude-dependent population differentiation. The biological characterization of these genes showed enrichment for neural-related processes. In light of this, we investigated whether genes associated to neuropsychiatric diseases were enriched by Latitude-Related Genes (LRGs). We found a strong enrichment of LRGs in the set of genes associated to schizophrenia. In an attempt to try to explain this possible link between latitude and schizophrenia, we investigated their associations with vitamin D. We found in a set of vitamin D related genes a significant enrichment of both LRGs and of genes involved in schizophrenia.
Our results suggest a latitude-driven adaptation for both schizophrenia and vitamin D related genes. In addition we confirm, at a molecular level, the link between schizophrenia and vitamin D. Finally, we discuss a model in which schizophrenia is, at least partly, a maladaptive by-product of latitude dependent adaptive changes in vitamin D metabolism.
Full-text · Article · Nov 2010 · BMC Evolutionary Biology
[Show abstract][Hide abstract] ABSTRACT: Complex diseases are multifactorial traits caused by both genetic and environmental factors. They represent the major part of human diseases and include those with largest prevalence and mortality (cancer, heart disease, obesity, etc.). Despite a large amount of information that has been collected about both genetic and environmental risk factors, there are few examples of studies on their interactions in epidemiological literature. One reason can be the incomplete knowledge of the power of statistical methods designed to search for risk factors and their interactions in these data sets. An improvement in this direction would lead to a better understanding and description of gene-environment interactions. To this aim, a possible strategy is to challenge the different statistical methods against data sets where the underlying phenomenon is completely known and fully controllable, for example simulated ones.
We present a mathematical approach that models gene-environment interactions. By this method it is possible to generate simulated populations having gene-environment interactions of any form, involving any number of genetic and environmental factors and also allowing non-linear interactions as epistasis. In particular, we implemented a simple version of this model in a Gene-Environment iNteraction Simulator (GENS), a tool designed to simulate case-control data sets where a one gene-one environment interaction influences the disease risk. The main aim has been to allow the input of population characteristics by using standard epidemiological measures and to implement constraints to make the simulator behaviour biologically meaningful.
By the multi-logistic model implemented in GENS it is possible to simulate case-control samples of complex disease where gene-environment interactions influence the disease risk. The user has full control of the main characteristics of the simulated population and a Monte Carlo process allows random variability. A knowledge-based approach reduces the complexity of the mathematical model by using reasonable biological constraints and makes the simulation more understandable in biological terms. Simulated data sets can be used for the assessment of novel statistical methods or for the evaluation of the statistical power when designing a study.
Full-text · Article · Jan 2010 · BMC Bioinformatics
[Show abstract][Hide abstract] ABSTRACT: Genetic differences both between individuals and populations are studied for their evolutionary relevance and for their potential medical applications. Most of the genetic differentiation among populations are caused by random drift that should affect all loci across the genome in a similar manner. When a locus shows extraordinary high or low levels of population differentiation, this may be interpreted as evidence for natural selection. The most used measure of population differentiation was devised by Wright and is known as fixation index, or F(ST). We performed a genome-wide estimation of F(ST) on about 4 millions of SNPs from HapMap project data. We demonstrated a heterogeneous distribution of F(ST) values between autosomes and heterochromosomes. When we compared the F(ST) values obtained in this study with another evolutionary measure obtained by comparative interspecific approach, we found that genes under positive selection appeared to show low levels of population differentiation. We applied a gene set approach, widely used for microarray data analysis, to detect functional pathways under selection. We found that one pathway related to antigen processing and presentation showed low levels of F(ST), while several pathways related to cell signalling, growth and morphogenesis showed high F(ST) values. Finally, we detected a signature of selection within genes associated with human complex diseases. These results can help to identify which process occurred during human evolution and adaptation to different environments. They also support the hypothesis that common diseases could have a genetic background shaped by human evolution.
[Show abstract][Hide abstract] ABSTRACT: The huge growth in gene expression data calls for the implementation of automatic tools for data processing and interpretation.
We present a new and comprehensive machine learning data mining framework consisting in a non-linear PCA neural network for feature extraction, and probabilistic principal surfaces combined with an agglomerative approach based on Negentropy aimed at clustering gene microarray data. The method, which provides a user-friendly visualization interface, can work on noisy data with missing points and represents an automatic procedure to get, with no a priori assumptions, the number of clusters present in the data. Cell-cycle dataset and a detailed analysis confirm the biological nature of the most significant clusters.
The software described here is a subpackage part of the ASTRONEURAL package and is available upon request from the corresponding author.
Supplementary data are available at Bioinformatics online.
[Show abstract][Hide abstract] ABSTRACT: Microarrays are among the most powerful tools in biological research, but in order to attain its full potentialities, it is imperative to develop techniques capable to effectively exploit the huge quantity of data which they produce. In this paper two machine learning methodologies for microarray data analysis are proposed: (1) Probabilistic Principal Surfaces (PPS), which is a nonlinear latent variable model which offers very appealing visualization and classification abilities and can be effectively employed for clustering purposes. More specifically, the PPS method builds a probability density function of a given data set of patterns, lying in a D-dimensional space (with D ≫ 3), expressed in terms of a fixed number of latent variables, lying in a Q-dimensional space (Q is usually 2 or 3), which can be used (after a proper manipulation) to visualize, classify and cluster the data; (2) Competitive Evolution on Data (CED) is instead an evolutionary system in which the possible solutions (cluster centroids) compete to conquer the largest possible number of resources (data) and thus partition the input data set in clusters.We discuss the application of both methods to the analysis of microarray data obtained for the yeast genome.
No preview · Article · Nov 2005 · Journal of Computational and Theoretical Nanoscience
[Show abstract][Hide abstract] ABSTRACT: Aim of this work is to apply a novel comprehensive machine learning tool for data mining to preprocessing and interpretation of gene express ion data. Furthermore, some visualization facilities are provided. The data mining fr ame- work consists of two main parts: preprocessing and clustering-agglomerating phases. To the first phase belong a noise filtering procedure and a non -linear PCA Neural Network for feature extraction. The second phase is used to accomplish an unsupervised clustering based on a hierarchy of two approaches: a Probabilistic Principal Surfaces to obtain the rough regions of interesting points and a F isher- Negentropy information based approach to agglomerate the regions previously found in order to discover substructures present in the data. Experiments on gene microarray data are made. Several experiments are shown varying the threshold, needed by the agglomerative clustering, to understand the structure of th e ana- lyzed data set.
[Show abstract][Hide abstract] ABSTRACT: Due to the recent technological advances, data mining in massive data sets has evolved as a crucial research field for many if not all areas of research: from astronomy to high energy physics, to genetics etc. In this paper we discuss an implementation of the Probabilistic Principal Surfaces (PPS) which was developed within the framework of the AstroNeural collaboration. PPS are a nonlinear latent variable model which may be regarded as a complete mathematical framework to accomplish some fundamental data mining activities such as: visualization, clustering and classification of high dimensional data. The effectiveness of the proposed model is exemplified referring to a complex astronomical data set.
[Show abstract][Hide abstract] ABSTRACT: Bioinformatics systems benefit from the use of data mining strategies to locate interesting and pertinent relationships within massive information. For example, data mining methods can ascertain and summarize the set of genes responding to a certain level of stress in an organism. Even a cursory glance through the literature in journals, reveals the persistent role of data mining in experimental biology. Integrating data mining within the context of experimental investigations is central to bioinformatics software. In this paper we describe the framework of probabilistic principal surfaces, a latent variable model which offers a large variety of appealing visualization capabilities and which can be successfully integrated in the context of microarray analysis. A preprocessing phase consisting of a nonlinear PCA neural network which seems to be very useful to deal with noisy and time dependent nature of microarray data has been added to this framework.
[Show abstract][Hide abstract] ABSTRACT: Probabilistic Principal Surfaces (PPS) offer very powerful visualization and classification capabilities and overcome most of the shortcomings of other neural tools such as SOM, GTM, etc. More specifically PPS build a probability density function of a given data set of patterns lying in a D-dimensional space (with D >> 3) which can be expressed in terms of a limited number of latent variables laying in a Q-dimensional space (Q is usually 2-3) which can be used to visualize the data in the latent space. PPS may also be arranged in ensembles to tackle very complex classification tasks. Competitive Evolution on Data (CED) is instead an evolutionary system in which the possible solutions (cluster centroids) compete to conquer the largest possible number of resources (data) and thus partition the input data set in clusters. We discuss the application of Spherical–PPS to two data sets coming, respectively, from astronomy (Great Observatory Origins Deep Survey) and from genetics (microarray data from yeast genoma) and of CED to the genetics data only.
[Show abstract][Hide abstract] ABSTRACT: Gene-expression microarrays make it possible to simultaneously measure the rate at which a cell or tissue is expressing each
of its thousands of genes. One can use these comprehensive snapshots of biological activity to infer regulatory pathways in
cells, identify novel targets for drug design, and improve diagnosis, prognosis, and treatment planning for those suffering
from disease. However, the amount of data this new technology produces is more than one can manually analyze. Hence, the need
for automated analysis of microarray data offers an opportunity for machine learning to have a significant impact on biology
and medicine. Probabilistic Principal Surfaces defines a unified theoretical framework for nonlinear latent variable models
embracing the Generative Topographic Mapping as a special case. This article describes the use of PPS for the analysis of
yeast gene expression levels from microarray chips showing its effectiveness for high-D data visualization and clustering.
[Show abstract][Hide abstract] ABSTRACT: The recent technological advances are producing huge data sets in almost all fields of scientific research, from astronomy to genetics. Although each research field often requires ad-hoc, fine tuned, procedures to properly exploit all the available information inherently present in the data, there is an urgent need for a new generation of general computational theories and tools capable to boost most human activities of data analysis. Here, we propose probabilistic principal surfaces (PPS) as an effective high-D data visualization and clustering tool for data mining applications, emphasizing its flexibility and generality of use in data-rich field. In order to better illustrate the potentialities of the method, we also provide a real world case-study by discussing the use of PPS for the analysis of yeast gene expression levels from microarray chips.
[Show abstract][Hide abstract] ABSTRACT: Microarray technologies represent a powerful tool in biological re-search, but in order to attain their full potentialities, it is crucial to develop techniques to effectively exploit the huge quantity of data produced. We propose an innovative tool specifically tailored to perform preprocessing, visualization and clustering on this type of data. The improvements with respect to more traditional tech-niques are:(a) Preprocessing: a noise estimation method is devel-oped to provide a formal, fast and accurate method to filter out the noisiest genes. The remaining genes are then processed us-ing a nonlinear PCA model which extract, from each microarray experiment, a smaller number of features; (b) Visualization: the preprocessed genes represent the input of a Probabilistic Princi-pal Surfaces (PPS), a latent variable model which has been shown to be very effective for data mining purposes; (c) Clustering: the trained PPS is used as bases for a clustering algorithm, based on a Negentropy information, able to compute, in an automatic way, the number of natural clusters inherently present in the data. All these phases are managed by means of a graphical user interface which provides further tools for data mining activities and gives the possibility to dynamically interact with the data. The tool was tested on yeast gene microarray data in order to find the groupings of co-regulated genes.