Article

Parallel evolutionary computation for multiobjective gene interaction analysis

Authors:
  • IST, Universidade de Lisboa, Portugal
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

Multiple studies provide evidence on the impact of certain gene interactions in the occurrence of diseases. Due to the complexity of genotype–phenotype relationships, it is required the development of highly efficient algorithmic strategies that successfully identify high-order interactions attending to different evaluation criteria. This work investigates parallel evolutionary computation approaches for multiobjective gene interaction analysis. A multiobjective genetic algorithm, with novel optimized design features, is developed and parallelized under problem-independent and problem-dependent schemes. Experimental results show the relevant performance of the method for complex interaction orders, significantly accelerating execution time (up to 296×) with regard to other state-of-the-art multiobjective tools.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... Since the computational complexity increases exponentially with the interaction order k, due to the higher number of SNP combinations, achieving an efficient execution of this analysis is a challenging task. This is even more demanding when considering multi-objective optimization, which is increasingly being used in state-of-the-art works to improve the accuracy of epistasis detection [4,5,8]. ...
... To tackle this issue, the algorithms for epistasis detection can be deployed and optimized in modern processors, with powerful out-of-order cores [5]. On the other hand, the high computational complexity of high-order epistasis and the characteristics of modern processors may lead to a decreased energy-efficiency. ...
... To perform an experimental evaluation for diverse amounts of SNPs and individual samples, three data sets from [5] are considered in this work, namely: small data set, with 23 SNPs and 10000 samples; medium data set, with 1000 SNPs and 4000 samples; large data set, with 31339 SNPs and 146 samples. Five individual tests were performed for each application implementation, for the epistasis orders k = 2 and k = 3. ...
Chapter
Epistasis detection represents a fundamental problem in bio-medicine to understand the reasons for occurrence of complex phenotypic traits (diseases) across a population of individuals. Exhaustively examining all possible interactions of multiple Single-Nucleotide Polymorphisms provides the most reliable way to identify accurate solutions, but it is both computationally and memory intensive task. To tackle this challenge, this work proposes a modular and self-adaptive framework for high-performance and energy-efficient epistasis analysis on modern tightly-coupled heterogeneous platforms composed of multi-core CPUs and integrated GPUs. To fully exploit the capabilities of these systems, the proposed framework incorporates both task- and data-parallel approaches specifically tailored to enhance single and multi-objective epistasis detection on each device architecture, along with allowing efficient collaborative execution across all devices. The experimental results show the ability of the proposed framework to handle the heterogeneity of an Intel CPU+iGPU system, achieving performance and energy-efficiency gains of up to 5×\times and 6×\times in different parallel execution scenarios.
... In addition to speedup gain that can be achieved by running simultaneous model simulations, parallel computing provides the opportunity to improve quality of optimized parameter sets when comparing the solution quality of parallel calibration algorithms with their sequential counterparts. This solution quality improvement can be explained by enhanced exploration/exploitation of the parameter space, which is enabled by running extra model simulations compared to the sequential version of the calibration algorithms (Huo et al., 2018;Harada and Alba, 2020;Gonçalves et al., 2020). Semiromi et al. (2018) showed that parallel automatic calibration of the Hydrologic Engineering Center-Hydrologic Modeling Systems (HEC-HMS) is superior to calibrating metamodels in terms of both processing time and solution quality. ...
Article
In recent decades, parallel computing has been increasingly applied to address the computational challenges of calibrating watershed hydrologic models. The purpose of this paper is to review these parallelization studies to summarize their contributions, identify knowledge gaps, and propose future research directions. These studies parallelized models based on either random-sampling-based algorithms or optimization algorithms and demonstrated considerable parallel speedup gain and parallel efficiency. However, the speedup gain/efficiency decreases as the number of parallel processing units increases, particularly after a threshold. In future, various combinations of hydrologic models, optimization algorithms, parallelization strategies, parallelization architectures, and communication modes need to be implemented to systematically evaluate a suite of parallelization scenarios for improving speedup gain, efficiency, and solution quality. A standardized suite of performance evaluation metrics needs to be developed to evaluate these parallelization approaches. Interactive multi-objective optimization algorithms and/or integrated sensitivity analysis and calibration algorithms are potential future research fields, as well.
... As the number of genetic interactions grows exponentially with the number of tested genes, this task becomes highly computationally demanding in modern computing systems, in particular when considering higher interaction orders. Evolutionary algorithms [6] or machine learning methods [7], [8] can be used to speed up epistasis detection by narrowing down the search space. These approaches may lead, however, to reduced accuracy in identifying epistasis. ...
Preprint
Developments in Genome-Wide Association Studies have led to the increasing notion that future healthcare techniques will be personalized to the patient, by relying on genetic tests to determine the risk of developing a disease. To this end, the detection of gene interactions that cause complex diseases constitutes an important application. Similarly to many applications in this field, extensive data sets containing genetic information for a series of patients are used (such as Single-Nucleotide Polymorphisms), leading to high computational complexity and memory utilization, thus constituting a major challenge when targeting high-performance execution in modern computing systems. To close this gap, this work proposes several novel approaches for the detection of three-way gene interactions in modern CPUs and GPUs, making use of different optimizations to fully exploit the target architectures. Crucial insights from the Cache-Aware Roofline Model are used to ensure the suitability of the applications to the computing devices. An extensive study of the architectural features of 13 CPU and GPU devices from all main vendors is also presented, allowing to understand the features relevant to obtain high-performance in this bioinformatics domain. To the best of our knowledge, this study is the first to perform such evaluation for epistasis detection. The proposed approaches are able to surpass the performance of state-of-the-art works in the tested platforms, achieving an average speedup of 3.9×\times (7.3×\times on CPUs and 2.8×\times on GPUs) and maximum speedup of 10.6×\times on Intel UHD P630 GPU.
Conference Paper
Single-nucleotide polymorphism (SNP) analysis has become a pivotal strategy for drug discovery within bioinformatics, especially for incurable diseases like cancer. With the increasing number of researchers starting to embrace metaheuristic methods, a review was done by gathering papers from the Google Scholar database from 2018 to 2023, which resulted in 20 papers after title, abstract, and content filtering. The findings show that the Genetic Algorithm and the Harmony Search Algorithm have become popular approaches in SNP analysis, particularly in studies on breast cancer, age-related macular degeneration, and colorectal cancer. However, the review shows that while researchers have proven most methods effective in finding disease-related SNPs, a more measurable study in SNP analysis is needed, due to the lack of elaboration on measurement metrics in the found studies.
Article
Full-text available
Flowering time is a well-known complex trait in crops and is influenced by many interacting genes. In this study, Mathew et al. identify two-way and.... Gene-by-gene interactions, also known as epistasis, regulate many complex traits in different species. With the availability of low-cost genotyping it is now possible to study epistasis on a genome-wide scale. However, identifying genome-wide epistasis is a high-dimensional multiple regression problem and needs the application of dimensionality reduction techniques. Flowering Time (FT) in crops is a complex trait that is known to be influenced by many interacting genes and pathways in various crops. In this study, we successfully apply Sure Independence Screening (SIS) for dimensionality reduction to identify two-way and three-way epistasis for the FT trait in a Multiparent Advanced Generation Inter-Cross (MAGIC) barley population using the Bayesian multilocus model. The MAGIC barley population was generated from intercrossing among eight parental lines and thus, offered greater genetic diversity to detect higher-order epistatic interactions. Our results suggest that SIS is an efficient dimensionality reduction approach to detect high-order interactions in a Bayesian multilocus model. We also observe that many of our findings (genomic regions with main or higher-order epistatic effects) overlap with known candidate genes that have been already reported in barley and closely related species for the FT trait.
Article
Full-text available
Epistasis, commonly defined as the interaction between multiple genes, is an important genetic component underlying phenotypic variation. Many statistical methods have been developed to model and identify epistatic interactions between genetic variants. However, because of the large combinatorial search space of interactions, most epistasis mapping methods face enormous computational challenges and often suffer from low statistical power due to multiple test correction. Here, we present a novel, alternative strategy for mapping epistasis: instead of directly identifying individual pairwise or higher-order interactions, we focus on mapping variants that have non-zero marginal epistatic effects—the combined pairwise interaction effects between a given variant and all other variants. By testing marginal epistatic effects, we can identify candidate variants that are involved in epistasis without the need to identify the exact partners with which the variants interact, thus potentially alleviating much of the statistical and computational burden associated with standard epistatic mapping procedures. Our method is based on a variance component model, and relies on a recently developed variance component estimation method for efficient parameter inference and p-value computation. We refer to our method as the “MArginal ePIstasis Test”, or MAPIT. With simulations, we show how MAPIT can be used to estimate and test marginal epistatic effects, produce calibrated test statistics under the null, and facilitate the detection of pairwise epistatic interactions. We further illustrate the benefits of MAPIT in a QTL mapping study by analyzing the gene expression data of over 400 individuals from the GEUVADIS consortium.
Article
Full-text available
The discovery of higher-order epistatic interactions is an important task in the field of genome wide association studies which allows for the identification of complex interaction patterns between multiple genetic markers. Some existing bruteforce approaches explore the whole space of k-interactions in an exhaustive manner resulting in almost intractable execution times. Computational cost can be reduced drastically by restricting the search space with suitable preprocessing filters which prune unpromising candidates. Other approaches mitigate the execution time by employing massively parallel accelerators in order to benefit from the vast computational resources of these architectures. In this paper, we combine a novel preprocessing filter, namely SingleMI, with massively parallel computation on modern GPUs to further accelerate epistasis discovery. Our implementation improves both the runtime and accuracy when compared to a previous GPU counterpart that employs mutual information clustering for prefiltering. SingleMI is open source software and publicly available at: https://github.com/sleeepyjack/singlemi/.
Article
Full-text available
Complex diseases may result from many genetic variants interacting with each other. For this reason, genome-wide interaction studies (GWIS) are currently performed to detect pairwise SNP interactions. While the computations required here can be completed within reasonable time, it has been inconvenient yet to detect third-order SNP interactions for large-scale datasets due to the cubic complexity of the problem.
Article
Full-text available
Background Detecting and visualizing nonlinear interaction effects of single nucleotide polymorphisms (SNPs) or epistatic interactions are important topics in bioinformatics since they play an important role in unraveling the mystery of “missing heritability”. However, related studies are almost limited to pairwise epistatic interactions due to their methodological and computational challenges. ResultsWe develop CINOEDV (Co-Information based N-Order Epistasis Detector and Visualizer) for the detection and visualization of epistatic interactions of their orders from 1 to n (n ≥ 2). CINOEDV is composed of two stages, namely, detecting stage and visualizing stage. In detecting stage, co-information based measures are employed to quantify association effects of n-order SNP combinations to the phenotype, and two types of search strategies are introduced to identify n-order epistatic interactions: an exhaustive search and a particle swarm optimization based search. In visualizing stage, all detected n-order epistatic interactions are used to construct a hypergraph, where a real vertex represents the main effect of a SNP and a virtual vertex denotes the interaction effect of an n-order epistatic interaction. By deeply analyzing the constructed hypergraph, some hidden clues for better understanding the underlying genetic architecture of complex diseases could be revealed. Conclusions Experiments of CINOEDV and its comparison with existing state-of-the-art methods are performed on both simulation data sets and a real data set of age-related macular degeneration. Results demonstrate that CINOEDV is promising in detecting and visualizing n-order epistatic interactions. CINOEDV is implemented in R and is freely available from R CRAN: http://cran.r-project.org and https://sourceforge.net/projects/cinoedv/files/.
Article
Full-text available
Motivation: Two-locus model is a typical significant disease model to be identified in genome-wide association study (GWAS). Due to intensive computational burden and diversity of disease models, existing methods have drawbacks on low detection power, high computation cost, and preference for some types of disease models. Method: In this study, two scoring functions (Bayesian network based K2-score and Gini-score) are used for characterizing two SNP locus as a candidate model, the two criteria are adopted simultaneously for improving identification power and tackling the preference problem to disease models. Harmony search algorithm (HSA) is improved for quickly finding the most likely candidate models among all two-locus models, in which a local search algorithm with two-dimensional tabu table is presented to avoid repeatedly evaluating some disease models that have strong marginal effect. Finally G-test statistic is used to further test the candidate models. Results: We investigate our method named FHSA-SED on 82 simulated datasets and a real AMD dataset, and compare it with two typical methods (MACOED and CSE) which have been developed recently based on swarm intelligent search algorithm. The results of simulation experiments indicate that our method outperforms the two compared algorithms in terms of detection power, computation time, evaluation times, sensitivity (TPR), specificity (SPC), positive predictive value (PPV) and accuracy (ACC). Our method has identified two SNPs (rs3775652 and rs10511467) that may be also associated with disease in AMD dataset.
Article
Full-text available
During the past decade, findings of genome-wide association studies (GWAS) improved our knowledge and understanding of disease genetics. To date, thousands of SNPs have been associated with diseases and other complex traits. Statistical analysis typically looks for association between a phenotype and a SNP taken individually via single-locus tests. However, geneticists admit this is an oversimplified approach to tackle the complexity of underlying biological mechanisms. Interaction between SNPs, namely epistasis, must be considered. Unfortunately, epistasis detection gives rise to analytic challenges since analyzing every SNP combination is at present impractical at a genome-wide scale. In this review, we will present the main strategies recently proposed to detect epistatic interactions, along with their operating principle. Some of these methods are exhaustive, such as multifactor dimensionality reduction, likelihood ratio-based tests or receiver operating characteristic curve analysis; some are non-exhaustive, such as machine learning techniques (random forests, Bayesian networks) or combinatorial optimization approaches (ant colony optimization, computational evolution system).
Article
Full-text available
High-throughput genotyping technologies (such as SNP-arrays) allow the rapid collection of up to a few million genetic markers of an individual. Detecting epistasis (based on 2-SNP interactions) in Genome-Wide Association Studies is an important but time consuming operation since statistical computations have to be performed for each pair of measured markers. Computational methods to detect epistasis therefore suffer from prohibitively long runtimes; e.g. processing a moderately-sized dataset consisting of about 500,000 SNPs and 5,000 samples requires several days using state-of-the-art tools on a standard 3GHz CPU. In this paper we demonstrate how this task can be accelerated using a combination of fine-grained and coarsegrained parallelism on two different computing systems. The first architecture is based on reconfigurable hardware (FPGAs) while the second architecture uses multiple GPUs connected to the same host. We show that both systems can achieve speedups of around four orders-of-magnitude compared to the sequential implementation. This significantly reduces the runtimes for detecting epistasis to only a few minutes for moderately-sized datasets and to a few hours for large-scale datasets.
Article
Full-text available
Background Discovering causal genetic variants from large genetic association studies poses many difficult challenges. Assessing which genetic markers are involved in determining trait status is a computationally demanding task, especially in the presence of gene-gene interactions.ResultsA non-parametric Bayesian approach in the form of a Bayesian neural network is proposed for use in analyzing genetic association studies. Demonstrations on synthetic and real data reveal they are able to efficiently and accurately determine which variants are involved in determining case-control status. By using graphics processing units (GPUs) the time needed to build these models is decreased by several orders of magnitude. In comparison with commonly used approaches for detecting interactions, Bayesian neural networks perform very well across a broad spectrum of possible genetic relationships.Conclusions The proposed framework is shown to be a powerful method for detecting causal SNPs while being computationally efficient enough to handle large datasets.
Article
Full-text available
Meta-analysis of data from genome-wide association studies (GWAS) of Alzheimer's disease (AD) has confirmed the high risk of APOE and identified twenty other risk genes/loci with moderate effect size. However, many more risk genes/loci remain to be discovered to account for the missing heritability. The contributions from individual singe-nucleotide polymorphisms (SNPs) have been thoroughly examined in traditional GWAS data analysis, while SNP-SNP interactions can be explored by a variety of alternative approaches. Here we applied generalized multifactor dimensionality reduction to the re-analysis of four publicly available GWAS datasets for AD. When considering 4-order intragenic SNP interactions, we observed high consistency of discovered potential risk genes among the four independent GWAS datasets. Ten potential risk genes were observed across all four datasets, including PDE1A, RYR3, TEK, SLC25A21, LOC729852, KIRREL3, PTPN5, FSHR, PARK2, and NR3C2. These potential risk genes discovered by generalized multifactor dimensionality reduction are highly relevant to AD pathogenesis based on multiple layers of evidence. The genetic contributions of these genes warrant further confirmation in other independent GWAS datasets for AD.
Article
Full-text available
The concept that a specific alteration in an individual's DNA can result in disease is central to our notion of molecular medicine. The molecular basis of more than 3,500 Mendelian disorders has now been identified. In contrast, the identification of genes for common disease has been much more challenging. We discuss historical and contemporary approaches to disease gene identification, focusing on novel opportunities such as the use of population extremes and the identification of rare variants. While our ability to sequence DNA has advanced dramatically, assigning function to a given sequence change remains a major challenge, highlighting the need for both bioinformatics and functional approaches to appropriately interpret these data. We review progress in mapping and identifying human disease genes and discuss future challenges and opportunities for the field.
Article
Full-text available
Background The detection of epistasis among genetic markers is of great interest in genome-wide association studies (GWAS). In recent years, much research has been devoted to find disease-associated epistasis in GWAS. However, due to the high computational cost involved, most methods focus on specific epistasis models, making the potential loss of power when the underlying epistasis models are not examined in these analyses. Results In this work, we propose a computational efficient approach based on complete enumeration of two-locus epistasis models. This approach uses a two-stage (screening and testing) search strategy and guarantees the enumeration of all epistasis patterns. The implementation is done on graphic processing units (GPU), which can finish the analysis on a GWAS data (with around 5,000 subjects and around 350,000 markers) within two hours. Source code is available at http://bioinformatics.ust.hk/BOOST.html∖#GBOOST. Conclusions This work demonstrates that the complete compositional epistasis detection is computationally feasible in GWAS.
Article
Full-text available
In genome-wide association studies (GWAS), the association between each single nucleotide polymorphism (SNP) and a phenotype is assessed statistically. To further explore genetic associations in GWAS, we considered two specific forms of biologically plausible SNP-SNP interactions, 'SNP intersection' and 'SNP union,' and analyzed the Crohn's Disease (CD) GWAS data of the Wellcome Trust Case Control Consortium for these interactions using a limited form of logic regression. We found strong evidence of CD-association for 195 genes, identifying novel susceptibility genes (e.g., ISX, SLCO6A1, TMEM183A) as well as confirming many previously identified susceptibility genes in CD GWAS (e.g., IL23R, NOD2, CYLD, NKX2-3, IL12RB2, ATG16L1). Notably, 37 of the 59 chromosomal locations indicated for CD-association by a meta-analysis of CD GWAS, involving over 22,000 cases and 29,000 controls, were represented in the 195 genes, as well as some chromosomal locations previously indicated only in linkage studies, but not in GWAS. We repeated the analysis with two smaller GWASs from the Database of Genotype and Phenotype (dbGaP): in spite of differences of populations and study power across the three datasets, we observed some consistencies across the three datasets. Notable examples included TMEM183A and SLCO6A1 which exhibited strong evidence consistently in our WTCCC and both of the dbGaP SNP-SNP interaction analyses. Examining these specific forms of SNP interactions could identify additional genetic associations from GWAS. R codes, data examples, and a ReadMe file are available for download from our website: http://www.ualberta.ca/~yyasui/homepage.html.
Article
Full-text available
Background Geneticists who look beyond single locus disease associations require additional strategies for the detection of complex multi-locus effects. Epistasis, a multi-locus masking effect, presents a particular challenge, and has been the target of bioinformatic development. Thorough evaluation of new algorithms calls for simulation studies in which known disease models are sought. To date, the best methods for generating simulated multi-locus epistatic models rely on genetic algorithms. However, such methods are computationally expensive, difficult to adapt to multiple objectives, and unlikely to yield models with a precise form of epistasis which we refer to as pure and strict. Purely and strictly epistatic models constitute the worst-case in terms of detecting disease associations, since such associations may only be observed if all n-loci are included in the disease model. This makes them an attractive gold standard for simulation studies considering complex multi-locus effects. Results We introduce GAMETES, a user-friendly software package and algorithm which generates complex biallelic single nucleotide polymorphism (SNP) disease models for simulation studies. GAMETES rapidly and precisely generates random, pure, strict n-locus models with specified genetic constraints. These constraints include heritability, minor allele frequencies of the SNPs, and population prevalence. GAMETES also includes a simple dataset simulation strategy which may be utilized to rapidly generate an archive of simulated datasets for given genetic models. We highlight the utility and limitations of GAMETES with an example simulation study using MDR, an algorithm designed to detect epistasis. Conclusions GAMETES is a fast, flexible, and precise tool for generating complex n-locus models with random architectures. While GAMETES has a limited ability to generate models with higher heritabilities, it is proficient at generating the lower heritability models typically used in simulation studies evaluating new algorithms. In addition, the GAMETES modeling strategy may be flexibly combined with any dataset simulation strategy. Beyond dataset simulation, GAMETES could be employed to pursue theoretical characterization of genetic models and epistasis.
Article
Full-text available
Epistasis is recognized fundamentally important for understanding the mechanism of disease-causing genetic variation. Though many novel methods for detecting epistasis have been proposed, few studies focus on their comparison. Undertaking a comprehensive comparison study is an urgent task and a pathway of the methods to real applications. This paper aims at a comparison study of epistasis detection methods through applying related software packages on datasets. For this purpose, we categorize methods according to their search strategies, and select five representative methods (TEAM, BOOST, SNPRuler, AntEpiSeeker and epiMODE) originating from different underlying techniques for comparison. The methods are tested on simulated datasets with different size, various epistasis models, and with/without noise. The types of noise include missing data, genotyping error and phenocopy. Performance is evaluated by detection power (three forms are introduced), robustness, sensitivity and computational complexity. None of selected methods is perfect in all scenarios and each has its own merits and limitations. In terms of detection power, AntEpiSeeker performs best on detecting epistasis displaying marginal effects (eME) and BOOST performs best on identifying epistasis displaying no marginal effects (eNME). In terms of robustness, AntEpiSeeker is robust to all types of noise on eME models, BOOST is robust to genotyping error and phenocopy on eNME models, and SNPRuler is robust to phenocopy on eME models and missing data on eNME models. In terms of sensitivity, AntEpiSeeker is the winner on eME models and both SNPRuler and BOOST perform well on eNME models. In terms of computational complexity, BOOST is the fastest among the methods. In terms of overall performance, AntEpiSeeker and BOOST are recommended as the efficient and effective methods. This comparison study may provide guidelines for applying the methods and further clues for epistasis detection.
Article
Full-text available
In genome-wide association studies (GWAS), up to millions of single nucleotide polymorphisms (SNPs) are genotyped for thousands of individuals. However, conventional single locus-based approaches are usually unable to detect gene-gene interactions underlying complex diseases. Due to the huge search space for complicated high order interactions, many existing multi-locus approaches are slow and may suffer from low detection power for GWAS. In this article, we develop a simple, fast and effective algorithm to detect genome-wide multi-locus epistatic interactions based on the clustering of relatively frequent items. Extensive experiments on simulated data show that our algorithm is fast and more powerful in general than some recently proposed methods. On a real genome-wide case-control dataset for age-related macular degeneration (AMD), the algorithm has identified genotype combinations that are significantly enriched in the cases. http://www.cs.ucr.edu/~minzhux/EDCF.zip minzhux@cs.ucr.edu; jingli@cwru.edu Supplementary data are available at Bioinformatics online.
Article
Full-text available
Detection of epistatic interaction between loci has been postulated to provide a more in-depth understanding of the complex biological and biochemical pathways underlying human diseases. Studying the interaction between two loci is the natural progression following traditional and well-established single locus analysis. However, the added costs and time duration required for the computation involved have thus far deterred researchers from pursuing a genome-wide analysis of epistasis. In this paper, we propose a method allowing such analysis to be conducted very rapidly. The method, dubbed EPIBLASTER, is applicable to case-control studies and consists of a two-step process in which the difference in Pearson's correlation coefficients is computed between controls and cases across all possible SNP pairs as an indication of significant interaction warranting further analysis. For the subset of interactions deemed potentially significant, a second-stage analysis is performed using the likelihood ratio test from the logistic regression to obtain the P-value for the estimated coefficients of the individual effects and the interaction term. The algorithm is implemented using the parallel computational capability of commercially available graphical processing units to greatly reduce the computation time involved. In the current setup and example data sets (211 cases, 222 controls, 299468 SNPs; and 601 cases, 825 controls, 291095 SNPs), this coefficient evaluation stage can be completed in roughly 1 day. Our method allows for exhaustive and rapid detection of significant SNP pair interactions without imposing significant marginal effects of the single loci involved in the pair.
Article
Full-text available
Gene-gene interactions have long been recognized to be fundamentally important for understanding genetic causes of complex disease traits. At present, identifying gene-gene interactions from genome-wide case-control studies is computationally and methodologically challenging. In this paper, we introduce a simple but powerful method, named "BOolean Operation-based Screening and Testing" (BOOST). For the discovery of unknown gene-gene interactions that underlie complex diseases, BOOST allows examination of all pairwise interactions in genome-wide case-control studies in a remarkably fast manner. We have carried out interaction analyses on seven data sets from the Wellcome Trust Case Control Consortium (WTCCC). Each analysis took less than 60 hr to completely evaluate all pairs of roughly 360,000 SNPs on a standard 3.0 GHz desktop with 4G memory running the Windows XP system. The interaction patterns identified from the type 1 diabetes data set display significant difference from those identified from the rheumatoid arthritis data set, although both data sets share a very similar hit region in the WTCCC report. BOOST has also identified some disease-associated interactions between genes in the major histocompatibility complex region in the type 1 diabetes data set. We believe that our method can serve as a computationally and statistically useful tool in the coming era of large-scale interaction mapping in genome-wide case-control studies.
Article
Full-text available
To explore the association between epistasis among related genes of the renin-angiotensin system (RAS) and type 2 diabetes. Gene polymorphisms were genotyped in 394 type 2 diabetic patients and 418 healthy control subjects in this case-control study. We used the multifactor dimensionality reduction method to identify gene-gene interactions. No single locus was associated with type 2 diabetes, except for the insert/deletion (I/D) polymorphism of the ACE gene in female subjects. In multi-locus analyses, in male subjects the model of rs2106809 (ACE2), rs220721 (Mas), rs699 (AGT), and I/D (ACE) was significant (P = 0.043). This combination was associated with a 4.00 times (95% CI 2.51-6.38; P < 0.0001) greater prevalence of type 2 diabetes. In female subjects, the model of rs2106809 (ACE2), I/D (ACE), and rs1403543 (AGTR2) was significant (P = 0.012). This three-locus combination was associated with a 2.76 times (1.91-3.97; P < 0.0001) greater prevalence of type 2 diabetes. Interactions among RAS-related genes were associated with type 2 diabetes in a Chinese population.
Article
Full-text available
Epistatic interactions of multiple single nucleotide polymorphisms (SNPs) are now believed to affect individual susceptibility to common diseases. The detection of such interactions, however, is a challenging task in large scale association studies. Ant colony optimization (ACO) algorithms have been shown to be useful in detecting epistatic interactions. AntEpiSeeker, a new two-stage ant colony optimization algorithm, has been developed for detecting epistasis in a case-control design. Based on some practical epistatic models, AntEpiSeeker has performed very well. AntEpiSeeker is a powerful and efficient tool for large-scale association studies and can be downloaded from http://nce.ads.uga.edu/~romdhane/AntEpiSeeker/index.html.
Article
Full-text available
Epistatic interactions among multiple genetic variants in the human genome may be important in determining individual susceptibility to common diseases. Although some existing computational methods for identifying genetic interactions have been effective for small-scale studies, we here propose a method, denoted 'bayesian epistasis association mapping' (BEAM), for genome-wide case-control studies. BEAM treats the disease-associated markers and their interactions via a bayesian partitioning model and computes, via Markov chain Monte Carlo, the posterior probability that each marker set is associated with the disease. Testing this on an age-related macular degeneration genome-wide association data set, we demonstrate that the method is significantly more powerful than existing approaches and that genome-wide case-control epistasis mapping with many thousands of markers is both computationally and statistically feasible.
Conference Paper
Many studies have shown that there is a direct relationship between Single Nucleotide Polymorphisms (SNPs) and the appearance of complex diseases, such as Alzheimer’s or Parkinson’s. However, recent advances in the Study of the Complete Genome Association indicate that the relationship between SNPs and these diseases goes beyond a simple one-to-one relationship, that is, the appearance of multiple SNPs (epistasis) influences the appearance of these diseases. In this sense, this work proposes the application of the NSGA-II multi-objective algorithm for the detection of epistasis of multiple loci in a database with 31,341 SNPs. Moreover, a parallel study has been performed to reduce the execution time of this problem. Our implementation not only achieves a reasonable good parallel performance and scalability, but also its biological significance overcomes other approaches published in the literature.
Article
Application of the experimental design of genome-wide association studies (GWASs) is now 10 years old (young), and here we review the remarkable range of discoveries it has facilitated in population and complex-trait genetics, the biology of diseases, and translation toward new therapeutics. We predict the likely discoveries in the next 10 years, when GWASs will be based on millions of samples with array data imputed to a large fully sequenced reference panel and on hundreds of thousands of samples with whole-genome sequencing data. Copyright © 2017 American Society of Human Genetics. Published by Elsevier Inc. All rights reserved.
Article
Motivation: The existing epistasis analysis approaches have been criticized mainly for their: 1) ignoring heterogeneity during epistasis analysis; 2) high computational costs; and 3) volatility of performances and results. Therefore, they will not perform well in general, leading to lack of reproducibility and low power in complex disease association studies. Method: In this work, a fast scheme is proposed to accelerate exhaustive searching based on multi-objective optimization named ESMO for concurrently analyzing heterogeneity and epistasis phenomena. In ESMO, mutual entropy and Bayesian network approaches are combined for evaluating epistatic SNP combinations. In order to be compatible with heterogeneity of complex diseases, we designed an adaptive framework based on non-dominant sort and top k selection algorithm with improved time complexity O(k*M*N). Moreover, ESMO is accelerated by strategies such as trading space for time, calculation sharing and parallel computing. Finally, ESMO is nonparametric and model-free. Results: We compared ESMO with other recent or classic methods using different evaluating measures. The experimental results show that our method not only can quickly handle epistasis, but also can effectively detect heterogeneity of complex population structures. Availability:https://github.com/XiongLi2016/ESMO/tree/master/ESMO-common-master Contact:lx_hncs@163.com
Article
epiSNP is a program for identifying pairwise single nucleotide polymorphism (SNP) interactions (epistasis) in quantitative-trait genome-wide association studies (GWAS). A parallel MPI version (EPISNPmpi) was created in 2008 to address this computationally expensive analysis on large data sets with many quantitative traits and SNP markers. However, the falling cost of genotyping has led to an explosion of large-scale GWAS data sets that challenge EPISNPmpi's ability to compute results in a reasonable amount of time. Therefore, we optimized epiSNP for modern multi-core and highly parallel many-core processors to efficiently handle these large data sets. This paper describes the serial optimizations, dynamic load balancing using MPI-3 RMA operations, and shared-memory parallelization with OpenMP to further enhance load balancing and allow execution on the Intel Xeon Phi coprocessor (MIC). For a large GWAS data set, our optimizations provided a 38.43 3 speedup over EPISNPmpi on 126 nodes using 2 MICs on TACC's Stampede Supercomputer. We also describe a Coarray Fortran (CAF) version that demonstrates the suitability of PGAS languages for problems with this computational pattern. We show that the Coarray version performs competitively with the MPI version on the NERSC Edison Cray XC30 supercomputer. Finally, the performance benefits of hyper-threading for this application on Edison (average 1.35 3 speedup) are demonstrated.
Article
In this chapter, a clear difference is made between the parallel design aspect and the parallel implementation aspect of evolutionary algorithms (EAs). From the algorithmic design point of view, the main parallel models for EAs are presented. A unifying view of parallel models for EAs is outlined. This chapter is organized as follows. In Sect. 55.2, the main parallel models for designing EAs are presented. Section 55.3 deals with the implementation issues of parallel EAs. In this section, the main concepts of parallel architectures and parallel programming paradigms, which interfere with the design and implementation of parallel EAs, are outlined. The main performance indicators that can be used to evaluate a parallel EAs in terms of efficiency are detailed. Finally, Sect. 55.4 deals with the design and implementation of different parallel models for EAs based on the software framework ParadisE0.
Article
Evolutionary algorithms (EA s) have amply shown their promise in solving various search and optimization problems for the past three decades. One of the hallmarks and niches of EAs is their ability to handle multi-objective optimization problems in their totality, which their classical counterparts lack. Suggested in the beginning of the 1990s, evolutionary multi-objective optimization (EMO ) algorithms are now routinely used in solving problems with multiple conflicting objectives in various branches of engineering, science, and commerce. In this chapter, we provide an overview of EMO methodologies by first presenting principles of EMO through an illustration of one specific algorithm and its application to an interesting real-world bi-objective optimization problem. Thereafter, we provide a list of recent research and application developments of EMO to provide a picture of some salient advancements in EMO research. The development and application of EMO to multi-objective optimization problems and their continued extensions to solve other related problems has elevated EMO research to a level which may now undoubtedly be termed as an active field of research with a wide range of theoretical and practical research and application opportunities.
Article
The use of evolutionary algorithms (EA s) for solving multiobjective optimization problems has been very active in the last few years. The main reasons for this popularity are their ease of use with respect to classical mathematical programming techniques, their scalability, and their suitability for finding trade-off solutions in a single run. However, these algorithms may be computationally expensive because (1) many real-world optimization problems typically involve tasks demanding high computational resources and (2) they are aimed at finding a whole front of optimal solutions instead of searching for a single optimum. Parallelizing EAs emerges as a possible way of reducing the CPU time down to affordable values, but it also allows researchers to use an advanced search engine – the parallel model – that provides the algorithms with an improved population diversity and enable them to cooperate with other (eventually nonevolutionary) techniques. The goal of this chapter is to provide the reader with an up-to-date review of the recent literature on parallel EAs for multiobjective optimization.
Article
Epistasis, the interaction between genes, has become a major topic in molecular and quantitative genetics. It is believed that these interactions play a significant role in genetic variations causing complex diseases. Several algorithms have been employed to detect pairwise interactions in genome-wide association studies (GWAS) but revealing higher order interactions remains a computationally challenging task. State of the art tools are not able to perform exhaustive search for all three-locus interactions in reasonable time even for relatively small input datasets. In this paper we present how a hardware-assisted design can solve this problem and provide fast, efficient and exhaustive third-order epistasis analysis with up-to-date FPGA technology.
Article
Interest in discovering combinations of genetic markers from case–control studies, such as Genome Wide Association Studies (GWAS), that are strongly associated to diseases has increased in recent years. Detecting epistasis, i.e. interactions among k markers (k ≥ 2), is an important but time consuming operation since statistical computations have to be performed for each k-tuple of measured markers. Efficient exhaustive methods have been proposed for k = 2, but exhaustive third-order analyses are thought to be impractical due to the cubic number of triples to be computed. Thus, most previous approaches apply heuristics to accelerate the analysis by discarding certain triples in advance. Unfortunately, these tools can fail to detect interesting interactions. We present GPU3SNP, a fast GPU-accelerated tool to exhaustively search for interactions among all marker-triples of a given case–control dataset. Our tool is able to analyze an input dataset with tens of thousands of markers in reasonable time thanks to two efficient CUDA kernels and efficient workload distribution techniques. For instance, a dataset consisting of 50,000 markers measured from 1000 individuals can be analyzed in less than 22 h on a single compute node with 4 NVIDIA GTX Titan boards. Source code is available at: http://sourceforge.net/projects/gpu3snp/.
Article
Genome-wide association studies (GWAS) have dominated the field of human genetics for the past 10 years. This study design allows for an unbiased, dense exploration of the genome and provides researchers with a vast array of SNPs to look for association with their trait or disease of interest. GWAS has been referred to as finding needles in a haystack and while many of these "needles," or SNPs associating with disease, have been identified, there is still a great deal of heritability yet to be explained. The missing or phantom heritability is due, at least in part, to epistasis or gene-gene interactions, which have not been extensively explored in GWAS. Part of the challenge for epistasis analysis in GWAS is the sheer magnitude of the search and the computational complexity associated with it. An exhaustive search for epistasis models is not computationally feasible; thus, alternate approaches must be considered. In this chapter, these approaches will be reviewed briefly, and the incorporation of biological knowledge to guide this process will be further expanded upon. Real biological data examples where this approach has yielded successful identification of epistasis will also be provided. Epistasis has been known to be important since the early 1900s; however, its prevalence in mainstream research has been somewhat overshadowed by molecular technology advances. Due to the increasing evidence of epistasis in complex traits, it continues to emerge as a likely explanation for missing heritability.
Article
Motivation: The existing methods for genetic-interaction detection in genome-wide association studies are designed from different paradigms, and their performances vary considerably for different disease models. One important reason for this variability is that their construction is based on a single-correlation model between SNPs and disease. Due to potential model preference and disease complexity, a single-objective method will therefore not work well in general, resulting in low power and a high false-positive rate. Method: In this work, we present a multi-objective heuristic optimization methodology named MACOED for detecting genetic interactions. In MACOED, we combine both logistical regression and Bayesian network methods, which are from opposing schools of statistics. The combination of these two evaluation objectives proved to be complementary, resulting in higher power with a lower false-positive rate than observed for optimizing either objective independently. To solve the space and time complexity for high-dimension problems, a memory-based multi-objective ant colony optimization algorithm is designed in MACOED that is able to retain non-dominated solutions found in past iterations. Results: We compared MACOED with other recent algorithms using both simulated and real datasets. The experimental results demonstrate that our method outperforms others in both detection power and computational feasibility for large datasets. Availability and implementation: Codes and datasets are available at: www.csbio.sjtu.edu.cn/bioinf/MACOED/.
Article
The role of epistasis in the genetic architecture of quantitative traits is controversial, despite the biological plausibility that nonlinear molecular interactions underpin the genotype-phenotype map. This controversy arises because most genetic variation for quantitative traits is additive. However, additive variance is consistent with pervasive epistasis. In this Review, I discuss experimental designs to detect the contribution of epistasis to quantitative trait phenotypes in model organisms. These studies indicate that epistasis is common, and that additivity can be an emergent property of underlying genetic interaction networks. Epistasis causes hidden quantitative genetic variation in natural populations and could be responsible for the small additive effects, missing heritability and the lack of replication that are typically observed for human complex traits.
Article
Genetic association is a challenging task for the identification and characterization of genes that increase the susceptibility to common complex multifactorial diseases. To fully execute genetic studies of complex diseases, modern geneticists face the challenge of detecting interactions between loci. A genetic algorithm (GA) is developed to detect the association of genotype frequencies of cancer cases and noncancer cases based on statistical analysis. An improved genetic algorithm (IGA) is proposed to improve the reliability of the GA method for high-dimensional SNP-SNP interactions. The strategy offers the top five results to the random population process, in which they guide the GA toward a significant search course. The IGA increases the likelihood of quickly detecting the maximum ratio difference between cancer cases and noncancer cases. The study systematically evaluates the joint effect of 23 SNP combinations of six steroid hormone metabolisms, and signaling-related genes involved in breast carcinogenesis pathways were systematically evaluated, with IGA successfully detecting significant ratio differences between breast cancer cases and noncancer cases. The possible breast cancer risks were subsequently analyzed by odds-ratio (OR) and risk-ratio analysis. The estimated OR of the best SNP barcode is significantly higher than 1 (between 1.15 and 7.01) for specific combinations of two to 13 SNPs. Analysis results support that the IGA provides higher ratio difference values than the GA between breast cancer cases and noncancer cases over 3-SNP to 13-SNP interactions. A more specific SNP-SNP interaction profile for the risk of breast cancer is also provided.
Article
Models in which two susceptibility loci jointly influence the risk of developing disease can be explored using logistic regression analysis. Comparison of likelihoods of models incorporating different sets of disease model parameters allows inferences to be drawn regarding the nature of the joint effect of the loci. We have simulated case-control samples generated assuming different two-locus models and then analysed them using logistic regression. We show that this method is practicable and that, for the models we have used, it can be expected to allow useful inferences to be drawn from sample sizes consisting of hundreds of subjects. Interactions between loci can be explored, but interactive effects do not exactly correspond with classical definitions of epistasis. We have particularly examined the issue of the extent to which it is helpful to utilise information from a previously identified locus when investigating a second, unknown locus. We show that for some models conditional analysis can have substantially greater power while for others unconditional analysis can be more powerful. Hence we conclude that in general both conditional and unconditional analyses should be performed when searching for additional loci.
Article
Multi-objective evolutionary algorithms (MOEAs) that use non-dominated sorting and sharing have been criticized mainly for: (1) their O(MN3) computational complexity (where M is the number of objectives and N is the population size); (2) their non-elitism approach; and (3) the need to specify a sharing parameter. In this paper, we suggest a non-dominated sorting-based MOEA, called NSGA-II (Non-dominated Sorting Genetic Algorithm II), which alleviates all of the above three difficulties. Specifically, a fast non-dominated sorting approach with O(MN2) computational complexity is presented. Also, a selection operator is presented that creates a mating pool by combining the parent and offspring populations and selecting the best N solutions (with respect to fitness and spread). Simulation results on difficult test problems show that NSGA-II is able, for most problems, to find a much better spread of solutions and better convergence near the true Pareto-optimal front compared to the Pareto-archived evolution strategy and the strength-Pareto evolutionary algorithm - two other elitist MOEAs that pay special attention to creating a diverse Pareto-optimal front. Moreover, we modify the definition of dominance in order to solve constrained multi-objective problems efficiently. Simulation results of the constrained NSGA-II on a number of test problems, including a five-objective, seven-constraint nonlinear problem, are compared with another constrained multi-objective optimizer, and the much better performance of NSGA-II is observed
Article
Hybrid methods of using evolutionary algorithms with a local search method are often used in the context of single-objective real-world optimization. In this paper, we discuss a couple of hybrid methods for multiobjective real-world optimization. In the posteriori approach, the obtained non-dominated solutions of a multiobjective evolutionary algorithm (MOEA) run are modified using a local search method. In the online approach, a local search method is applied to each solution obtained by genetic operations in a MOEA run. Both these approaches are compared on three engineering shape optimization problems for a fixed number of overall function evaluations. Simulation results suggest important insights about the extent of local search and the extent of an MOEA needed to achieve an overall efficient hybrid approach.