Yi Pan

The Second Xiangya Hospital of Central South University, Ch’ang-sha-shih, Hunan, China

Are you Yi Pan?

Claim your profile

Publications (361)293.48 Total impact

  • [Show abstract] [Hide abstract]
    ABSTRACT: Single nucleotide polymorphisms, a dominant type of genetic variants, have been used successfully to identify defective genes causing human single gene diseases. However, most common human diseases are complex diseases and caused by gene-gene and gene-environment interactions. Many SNP-SNP interaction analysis methods have been introduced but they are not powerful enough to discover interactions more than three SNPs. The paper proposes a novel method that analyzes all SNPs simultaneously. Different from existing methods, the method regards an individual’s genotype data on a list of SNPs as a point with a unit of energy in a multi-dimensional space, and tries to find a new coordinate system where the energy distribution difference between cases and controls reaches the maximum. The method will find different multiple SNPs combinatorial patterns between cases and controls based on the new coordinate system. The experiment on simulated data shows that the method is efficient. The tests on the real data of age-related macular degeneration (AMD) disease show that it can find out more significant multi-SNP combinatorial patterns than existing methods.
    IEEE/ACM Transactions on Computational Biology and Bioinformatics 05/2015; 12(3):695-704. DOI:10.1109/TCBB.2014.2363459 · 1.54 Impact Factor
  • [Show abstract] [Hide abstract]
    ABSTRACT: Prediction of essential proteins which are crucial to an organism's survival is important for disease analysis and drug design, as well as the understanding of cellular life. The majority of prediction methods infer the possibility of proteins to be essential by using the network topology. However, these methods are limited to the completeness of available protein-protein interaction (PPI) data and depend on the network accuracy. To overcome these limitations, some computational methods have been proposed. However, seldom of them solve this problem by taking consideration of protein domains. In this work, we first analyze the correlation between the essentiality of proteins and their domain features based on data of 13 species. We find that the proteins containing more protein domain types which rarely occur in other proteins tend to be essential. Accordingly, we propose a new prediction method, named UDoNC, by combining the domain features of proteins with their topological properties in PPI network. In UDoNC, the essentiality of proteins is decided by the number and the frequency of their protein domain types, as well as the essentiality of their adjacent edges measured by edge clustering coefficient. The experimental results on S. cerevisiae data show that UDoNC outperforms other existing methods in terms of area under the curve (AUC). Additionally, UDoNC can also perform well in predicting essential proteins on data of E. coli.
    IEEE/ACM Transactions on Computational Biology and Bioinformatics 04/2015; 12(2):276-288. DOI:10.1109/TCBB.2014.2338317 · 1.54 Impact Factor
  • [Show abstract] [Hide abstract]
    ABSTRACT: Essential proteins are indispensable for cellular life. It is of great significance to identify essential proteins that can help us understand the minimal requirements for cellular life and is also very important for drug design. However, identification of essential proteins based on experimental approaches are typically time-consuming and expensive. With the development of high-throughput technology in the post-genomic era, more and more protein-protein interaction data can be obtained, which make it possible to study essential proteins from the network level. There have been a series of computational approaches proposed for predicting essential proteins based on network topologies. Most of these topology based essential protein discovery methods were to use network centralities. In this paper, we investigate the essential proteins’ topological characters from a completely new perspective. To our knowledge it is the first time that topology potential is used to identify essential proteins from a protein-protein interaction (PPI) network. The basic idea is that each protein in the network can be viewed as a material particle which creates a potential field around itself and the interaction of all proteins forms a topological field over the network. By defining and computing the value of each protein’s topology potential, we can obtain a more precise ranking which reflects the importance of proteins from the PPI network. The experimental results show that topology potential-based methods TP and TP-NC outperform traditional topology measures: degree centrality (DC), betweenness centrality (BC), closeness centrality (CC), subgraph centrality (SC), eigenvector centrality (EC), information centrality (IC), and network centrality (NC) for predicting essential proteins. In addition, these centrality measures are improved on their performance for identifying essential proteins in biological network when controlled by topology potential.
    IEEE/ACM Transactions on Computational Biology and Bioinformatics 04/2015; 12(2):372-383. DOI:10.1109/TCBB.2014.2361350 · 1.54 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Based on the next generation genome sequencing technologies, a variety of biological applications are developed, while alignment is the first step once the sequencing reads are obtained. In recent years, many software tools have been developed to efficiently and accurately align short reads to the reference genome. However, there are still many reads that can't be mapped to the reference genome, due to the exceeding of allowable mismatches. Moreover, besides the unmapped reads, the reads with low mapping qualities are also excluded from the downstream analysis, such as variance calling. If we can take advantages of the confident segments of these reads, not only can the alignment rates be improved, but also more information will be provided for the downstream analysis. This paper proposes a method, called RAUR (Re-align the Unmapped Reads), to re-align the reads that can not be mapped by alignment tools. Firstly, it takes advantages of the base quality scores (reported by the sequencer) to figure out the most confident and informative segments of the unmapped reads by controlling the number of possible mismatches in the alignment. Then, combined with an alignment tool, RAUR re-align these segments of the reads. We run RAUR on both simulated data and real data with different read lengths. The results show that many reads which fail to be aligned by the most popular alignment tools (BWA and Bowtie2) can be correctly re-aligned by RAUR, with a similar Precision. Even compared with the BWA-MEM and the local mode of Bowtie2, which perform local alignment for long reads to improve the alignment rate, RAUR also shows advantages on the Alignment rate and Precision in some cases. Therefore, the trimming strategy used in RAUR is useful to improve the Alignment rate of alignment tools for the next-generation genome sequencing. All source code are available at http://netlab.csu.edu.cn/bioinformatics/RAUR.html.
    BMC Bioinformatics 03/2015; 16(Suppl 5):S8. DOI:10.1186/1471-2105-16-S5-S8 · 2.67 Impact Factor
  • [Show abstract] [Hide abstract]
    ABSTRACT: The recently developed next generation sequencing platforms not only decrease the cost for metagenomics data analysis, but also greatly enlarge the size of metagenomic sequence datasets. A common bottleneck of available assemblers is that the trade-off between the noise of the resulting contigs and the gain in sequence length for better annotation has not been attended enough for large-scale sequencing projects, especially for the datasets with low coverage and a large number of nonoverlapping contigs. To address this limitation and promote both accuracy and efficiency, we develop a novel metagenomic sequence assembly framework, DIME, by taking the DIvide, conquer, and MErge strategies. In addition, we give two MapReduce implementations of DIME, DIME-cap3 and DIME-genovo, on Apache Hadoop platform. For a systematic comparison of the performance of the assembly tasks, we tested DIME and five other popular short read assembly programs, Cap3, Genovo, MetaVelvet, SOAPdenovo, and SPAdes on four synthetic and three real metagenomic sequence datasets with various reads from fifty thousand to a couple million in size. The experimental results demonstrate that our method not only partitions the sequence reads with an extremely high accuracy, but also reconstructs more bases, generates higher quality assembled consensus, and yields higher assembly scores, including corrected N50 and BLAST-score-per-base, than other tools with a nearly theoretical speed-up. Results indicate that DIME offers great improvement in assembly across a range of sequence abundances and thus is robust to decreasing coverage.
    Journal of computational biology: a journal of computational molecular cell biology 02/2015; 22(2):159-77. DOI:10.1089/cmb.2014.0251 · 1.67 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Essential proteins are vitally important for cellular survival and development, and identifying essential proteins is very meaningful research work in the post-genome era. Rapid increase of available protein-protein interaction (PPI) data has made it possible to detect protein essentiality at the network level. A series of centrality measures have been proposed to discover essential proteins based on the PPI networks. However, the PPI data obtained from large scale, high-throughput experiments generally contain false positives. It is insufficient to use original PPI data to identify essential proteins. How to improve the accuracy, has become the focus of identifying essential proteins. In this paper, we proposed a framework for identifying essential proteins from active PPI networks constructed with dynamic gene expression. Firstly, we process the dynamic gene expression profiles by using time-dependent model and time-independent model. Secondly, we construct an active PPI network based on co-expressed genes. Lastly, we apply six classical centrality measures in the active PPI network. For the purpose of comparison, other prediction methods are also performed to identify essential proteins based on the active PPI network. The experimental results on yeast network show that identifying essential proteins based on the active PPI network can improve the performance of centrality measures considerably in terms of the number of identified essential proteins and identification accuracy. At the same time, the results also indicate that most of essential proteins are active.
  • [Show abstract] [Hide abstract]
    ABSTRACT: As the volume of data grows at an unprecedented rate, large-scale data mining and knowledge discovery present a tremendous challenge. Rough set theory, which has been used successfully in solving problems in pattern recognition, machine learning, and data mining, centers around the idea that a set of distinct objects may be approximated via a lower and upper bound. In order to obtain the benefits that rough sets can provide for data mining and related tasks, efficient computation of these approximations is vital. The recently introduced cloud computing model, MapReduce, has gained a lot of attention from the scientific community for its applicability to large-scale data analysis. In previous research, we proposed a MapReduce-based method for computing approximations in parallel, which can efficiently process complete data but fails in the case of missing (incomplete) data. To address this shortcoming, three different parallel matrix-based methods are introduced to process large-scale, incomplete data. All of them are built on MapReduce and implemented on Twister that is a lightweight MapReduce runtime system. The proposed parallel methods are then experimentally shown to be efficient for processing large-scale data.
    IEEE Transactions on Knowledge and Data Engineering 01/2015; 27(2):326-339. DOI:10.1109/TKDE.2014.2330821 · 1.82 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Apolipoprotein M (ApoM) is predominantly located in the high-density lipoprotein in human plasma. It has been demonstrated that ApoM expression could be regulated by several crucial nuclear receptors that are involved in the bile acid metabolism. In the present study, by combining gene-silencing experiments, overexpression studies, and chromatin immunoprecipitation assays, we showed that ApoM positively regulated liver receptor homolog-1 (LRH-1) gene expression via direct binding to an LRH-1 promoter region (nucleotides -406/ -197). In addition, we investigated the effects of farnesoid X receptor agonist GW4064 on hepatic ApoM expression in vitro. In HepG2 cell cultures, both mRNA and protein levels of ApoM and LRH-1 were decreased in a time-dependent manner in the presence of 1 μM GW4064, and the inhibition effect was gradually attenuated after 24 hours. In conclusion, our findings present supportive evidence that ApoM is a regulator of human LRH-1 transcription, and further reveal the importance of ApoM as a critical regulator of bile acids metabolism.
    Drug Design, Development and Therapy 01/2015; 9:2375-82. DOI:10.2147/DDDT.S78496 · 3.03 Impact Factor
  • [Show abstract] [Hide abstract]
    ABSTRACT: In genome assembly, the primary issue is how to determine upstream and downstream sequence regions of sequence seeds for constructing long contigs or scaffolds. When extending one sequence seed, repetitive regions in the genome always cause multiple feasible extension candidates which increase the difficulty of genome assembly. The universally accepted solution is choosing one based on read overlaps and paired-end (mate-pair) reads. However, this solution faces difficulties with regard to some complex repetitive regions. In addition, sequencing errors may produce false repetitive regions and uneven sequencing depth leads some sequence regions to have too few or too many reads. All the aforementioned problems prohibit existing assemblers from getting satisfactory assembly results. In this article, we develop an algorithm, called EPGA, which extracts paths from De Bruijn graph for genome assembly. EPGA uses a new score function to evaluate extension candidates based on the distributions of reads and insert size. The distribution of reads can solve problems caused by sequencing errors and short repetitive regions. Through assessing the variation of the distribution of insert size, EPGA can solve problems introduced by some complex repetitive regions. For solving uneven sequencing depth, EPGA uses relative mapping to evaluate extension candidates. On real datasets, we compare the performance of EPGA and other popular assemblers. The experimental results demonstrate that EPGA can effectively obtain longer and more accurate contigs and scaffolds. EPGA is publicly available for download at https://github.com/bioinfomaticsCSU/EPGA. jxwang@csu.edu.cn. © The Author (2014). Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oup.com.
    Bioinformatics 11/2014; 31(6). DOI:10.1093/bioinformatics/btu762 · 4.62 Impact Factor
  • [Show abstract] [Hide abstract]
    ABSTRACT: Nowadays, centrality analysis has become a principal method for identifying essential proteins in biological networks. Here we present CytoNCA, a Cytoscape plugin integrating calculation, evaluation and visualization analysis for multiple centrality measures. (i) CytoNCA supports eight different centrality measures and each can be applied to both weighted and unweighted biological networks. (ii) It allows users to upload biological information of both nodes and edges in the network, to integrate biological data with topological data to detect specific nodes. (iii) CytoNCA offers multiple potent visualization analysis modules, which generate various forms of output such as graph, table, and chart, and analyze associations among all measures. (iv) It can be utilized to quantitatively assess the calculation results, and evaluate the accuracy by statistical measures. (v) Besides current eight centrality measures, the biological characters from other sources could also be analyzed and assessed by CytoNCA. This makes CytoNCA an excellent tool for calculating centrality, evaluating and visualizing biological networks. http://apps.cytoscape.org/apps/cytonca. Copyright © 2014 Elsevier Ireland Ltd. All rights reserved.
    Bio Systems 11/2014; 127C:67-72. DOI:10.1016/j.biosystems.2014.11.005 · 1.47 Impact Factor
  • [Show abstract] [Hide abstract]
    ABSTRACT: Identification of disease-causing genes among a large number of candidates is a fundamental challenge in human disease studies. However, it is still time-consuming and laborious to determine the real disease-causing genes by biological experiments. With the advances of the high-throughput techniques, a large number of protein-protein interactions have been produced. Therefore, to address this issue, several methods based on protein interaction network have been proposed. In this paper, we propose a shortest path-based algorithm, named SPranker, to prioritize disease-causing genes in protein interaction networks. Considering the fact that diseases with similar phenotypes are generally caused by functionally related genes, we further propose an improved algorithm SPGOranker by integrating the semantic similarity of GO annotations. SPGOranker not only considers the topological similarity between protein pairs in a protein interaction network but also takes their functional similarity into account. The proposed algorithms SPranker and SPGOranker were applied to 1598 known orphan disease-causing genes from 172 orphan diseases and compared with three state-of-the-art approaches, ICN, VS and RWR. The experimental results show that SPranker and SPGOranker outperform ICN, VS, and RWR for the prioritization of orphan disease-causing genes. Importantly, for the case study of severe combined immunodeficiency, SPranker and SPGOranker predict several novel causal genes.
    Science China. Life sciences 10/2014; DOI:10.1007/s11427-014-4747-6 · 1.51 Impact Factor
  • [Show abstract] [Hide abstract]
    ABSTRACT: Many computational methods have been proposed to identify essential proteins by using the topological features of interactome networks. However, the precision of essential protein discovery still needs to be improved. Researches show that majority of hubs (essential proteins) in the yeast interactome network are essential due to their involvement in essential complex biological modules and hubs can be classified into two categories: date hubs and party hubs. In this study, combining with gene expression profiles, we propose a new method to predict essential proteins based on overlapping essential modules, named POEM. In POEM, the original protein interactome network is partitioned into many overlapping essential modules. The frequencies and weighted degrees of proteins in these modules are employed to decide which categories does a protein belong to? The comparative results show that POEM outperforms the classical centrality measures: Degree Centrality (DC), Information Centrality (IC), Eigenvector Centrality (EC), Subgraph Centrality (SC), Betweenness Centrality (BC), Closeness Centrality (CC), Edge Clustering Coefficient Centrality (NC) and two newly proposed essential proteins prediction methods: PeC and CoEWC. Experimental results indicate that the precision of predicting essential proteins can be improved by considering the modularity of proteins and integrating gene expression profiles with network topological features.
    IEEE Transactions on NanoBioscience 08/2014; 13(4). DOI:10.1109/TNB.2014.2337912 · 1.77 Impact Factor
  • [Show abstract] [Hide abstract]
    ABSTRACT: Accurate annotation of protein functions is still a big challenge for understanding life in the post-genomic era. Recently, some methods have been developed to solve the problem by incorporating functional similarity of GO terms into protein-protein interaction (PPI) network, which are based on the observation that a protein tends to share some common functions with proteins that interact with it in PPI network, and two similar GO terms in functional interrelationship network usually co-annotate some common proteins. However, these methods annotate functions of proteins by considering at the same level neighbors of proteins and GO terms respectively, and few attempts have been made to investigate their difference. Given the topological and structural difference between PPI network and functional interrelationship network, we firstly investigate at which level neighbors of proteins tend to have functional associations and at which level neighbors of GO terms usually co-annotate some common proteins. Then, an unbalanced Bi-random walk (UBiRW) algorithm which iteratively walks different number of steps in the two networks is adopted to find protein-GO term associations according to some known associations. Experiments are carried out on S. cerevisiae data. The results show that our method achieves better prediction performance not only than methods that only use PPI network data, but also than methods that consider at the same level neighbors of proteins and of GO terms.
    Current Protein and Peptide Science 07/2014; DOI:10.2174/1389203715666140724085224 · 2.33 Impact Factor
  • [Show abstract] [Hide abstract]
    ABSTRACT: Data Gathering is a fundamental task in Wireless Sensor Networks (WSNs). Data gathering trees capable of performing aggregation operations are also referred to as Data Aggregation Trees (DATs). Currently, most of the existing works focus on constructing DATs according to different user requirements under the Deterministic Network Model (DNM). However, due to the existence of many probabilistic lossy links in WSNs, it is more practical to obtain a DAT under the realistic Probabilistic Network Model (PNM). Moreover, the load-balance factor is neglected when constructing DATs in current literatures. Therefore, in this paper, we focus on constructing a Load-Balanced Data Aggregation Tree (LBDAT) under the PNM. More specifically, three problems are investigated, namely, the Load-Balanced Maximal Independent Set (LBMIS) problem, the Connected Maximal Independent Set (CMIS) problem, and the LBDAT construction problem. LBMIS and CMIS are well-known NP-hard problems and LBDAT is an NP-complete problem. Consequently, approximation algorithms and comprehensive theoretical analysis of the approximation factors are presented in the paper. Finally, our simulation results show that the proposed algorithms outperform the existing state-of-the-art approaches significantly.
    IEEE Transactions on Parallel and Distributed Systems 07/2014; 25(7):1681-1690. DOI:10.1109/TPDS.2013.160 · 2.17 Impact Factor
  • [Show abstract] [Hide abstract]
    ABSTRACT: Identification of essential proteins is very important for understanding the minimal requirements for cellular life and also necessary for a series of practical applications, such as drug design. With the advances in high throughput technologies, a large number of protein-protein interactions are available, which makes it possible to detect proteins’ essentialities from the network level. Considering that most species already have a number of known essential proteins, we proposed a new priori knowledge-based scheme to discover new essential proteins from protein interaction networks. Based on the new scheme, two essential protein discovery algorithms, CPPK and CEPPK, were developed. CPPK predicts new essential proteins based on network topology and CEPPK detects new essential proteins by integrating network topology and gene expressions. The performances of CPPK and CEPPK were validated based on the protein interaction network of Saccharomyces cerevisiae. The experimental results showed that the priori knowledge of known essential proteins was effective for improving the predicted precision. The predicted precisions of CPPK and CEPPK clearly exceeded that of the other ten previously proposed essential protein discovery methods: Degree Centrality (DC), Betweenness Centrality (BC), Closeness Centrality (CC), Subgraph Centrality(SC), Eigenvector Centrality(EC), Information Centrality(IC), Bottle Neck (BN), Density of Maximum Neighborhood Component (DMNC), Local Average Connectivity-based method (LAC), and Network Centrality (NC). Especially, CPPK achieved 40% improvement in precision over BC, CC, SC, EC, and BN, and CEPPK performed even better. CEPPK was also compared to four other methods (EPC, ORFL, PeC, and CoEWC) which were not node centralities and CEPPK was showed to achieve the best results.
    Methods 06/2014; 67(3). DOI:10.1016/j.ymeth.2014.02.016 · 3.22 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Identification of protein complexes from protein-protein interaction networks has become a key problem for understanding cellular life in postgenomic era. Many computational methods have been proposed for identifying protein complexes. Up to now, the existing computational methods are mostly applied on static PPI networks. However, proteins and their interactions are dynamic in reality. Identifying dynamic protein complexes is more meaningful and challenging. In this paper, a novel algorithm, named DPC, is proposed to identify dynamic protein complexes by integrating PPI data and gene expression profiles. According to Core-Attachment assumption, these proteins which are always active in the molecular cycle are regarded as core proteins. The protein-complex cores are identified from these always active proteins by detecting dense subgraphs. Final protein complexes are extended from the protein-complex cores by adding attachments based on a topological character of "closeness" and dynamic meaning. The protein complexes produced by our algorithm DPC contain two parts: static core expressed in all the molecular cycle and dynamic attachments short-lived. The proposed algorithm DPC was applied on the data of Saccharomyces cerevisiae and the experimental results show that DPC outperforms CMC, MCL, SPICi, HC-PIN, COACH, and Core-Attachment based on the validation of matching with known complexes and hF-measures.
    BioMed Research International 05/2014; 2014:375262. DOI:10.1155/2014/375262 · 2.71 Impact Factor
  • [Show abstract] [Hide abstract]
    ABSTRACT: Advanced biological technologies are producing large-scale protein–protein interaction (PPI) data at an ever increasing pace, which enable us to identify protein complexes from PPI networks. Pair-wise protein interactions can be modeled as a graph, where vertices represent proteins and edges represent PPIs. However most of current algorithms detect protein complexes based on deterministic graphs, whose edges are either present or absent. Neighboring information is neglected in these methods. Based on the uncertain graph model, we propose the concept of expected density to assess the density degree of a subgraph, the concept of relative degree to describe the relationship between a protein and a subgraph in a PPI network. We develop an algorithm called DCU (detecting complex based on uncertain graph model) to detect complexes from PPI networks. In our method, the expected density combined with the relative degree is used to determine whether a subgraph represents a complex with high cohesion and low coupling. We apply our method and the existing competing algorithms to two yeast PPI networks. Experimental results indicate that our method performs significantly better than the state-of-the-art methods and the proposed model can provide more insights for future study in PPI networks.
    IEEE/ACM Transactions on Computational Biology and Bioinformatics 05/2014; 11(3):486-497. DOI:10.1109/TCBB.2013.2297915 · 1.54 Impact Factor
  • Wireless Communications and Mobile Computing 05/2014; 14(7):673-688. DOI:10.1002/wcm.2218 · 1.29 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Most biological processes are carried out by protein complexes. A substantial number of false positives of the protein-protein interaction (PPI) data can compromise the utility of the datasets for complexes reconstruction. In order to reduce the impact of such discrepancies, a number of data integration and affinity scoring schemes have been devised. The methods encode the reliabilities (confidence) of physical interactions between pairs of proteins. The challenge now is to identify novel and meaningful protein complexes from the weighted PPI network. To address this problem, a novel protein complex mining algorithm ClusterBFS (Cluster with Breadth-First Search) is proposed. Based on the weighted density, ClusterBFS detects protein complexes of the weighted network by the breadth first search algorithm, which originates from a given seed protein used as starting-point. The experimental results show that ClusterBFS performs significantly better than the other computational approaches in terms of the identification of protein complexes.
    04/2014; 2014:354539. DOI:10.1155/2014/354539
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Backgroud: Taking the advantage of high-throughput single nucleotide polymorphism (SNP) genotyping technology, large genome-wide association studies (GWASs) have been considered to hold promise for unravelling complex relationships between genotype and phenotype. At present, traditional single-locus-based methods are insufficient to detect interactions consisting of multiple-locus, which are broadly existing in complex traits. In addition, statistic tests for high order epistatic interactions with more than 2 SNPs propose computational and analytical challenges because the computation increases exponentially as the cardinality of SNPs combinations gets larger. In this paper, we provide a simple, fast and powerful method using dynamic clustering and cloud computing to detect genome-wide multi-locus epistatic interactions. We have constructed systematic experiments to compare powers performance against some recently proposed algorithms, including TEAM, SNPRuler, EDCF and BOOST. Furthermore, we have applied our method on two real GWAS datasets, Age-related macular degeneration (AMD) and Rheumatoid arthritis (RA) datasets, where we find some novel potential disease-related genetic factors which are not shown up in detections of 2-loci epistatic interactions. Experimental results on simulated data demonstrate that our method is more powerful than some recently proposed methods on both two- and three-locus disease models. Our method has discovered many novel high-order associations that are significantly enriched in cases from two real GWAS datasets. Moreover, the running time of the cloud implementation for our method on AMD dataset and RA dataset are roughly 2 hours and 50 hours on a cluster with forty small virtual machines for detecting two-locus interactions, respectively. Therefore, we believe that our method is suitable and effective for the full-scale analysis of multiple-locus epistatic interactions in GWAS.
    BMC Bioinformatics 04/2014; 15(1):102. DOI:10.1186/1471-2105-15-102 · 2.67 Impact Factor

Publication Stats

3k Citations
293.48 Total Impact Points

Institutions

  • 2015
    • The Second Xiangya Hospital of Central South University
      Ch’ang-sha-shih, Hunan, China
  • 1970–2015
    • Georgia State University
      • Department of Computer Science
      Atlanta, Georgia, United States
  • 2013
    • University of Connecticut
      • Department of Computer Science and Engineering
      Storrs, CT, United States
  • 2010–2013
    • Central South University
      • • School of Biological Science and Technology
      • • School of Information Science and Engineering
      Changsha, Hunan, China
  • 2009
    • Southeast University (China)
      • School of Computer Science and Engineering
      Nanjing, Jiangxi Sheng, China
    • University of Central Arkansas
      • Department of Computer Science
      Arkansas, United States
  • 2008
    • National Taiwan University of Science and Technology
      • Department of Computer Science and Information Engineering
      Taipei, Taipei, Taiwan
  • 2003–2008
    • Southwest Jiaotong University
      • School of Information Science and Technology
      Hua-yang, Sichuan, China
  • 2006–2007
    • Jiangsu University of Science and Technology
      Chenkiang, Jiangsu Sheng, China
    • University of Georgia
      Атина, Georgia, United States
    • Nanjing University
      Nan-ching, Jiangsu Sheng, China
  • 2005
    • Nanyang Technological University
      Tumasik, Singapore
  • 2004–2005
    • The University of Memphis
      • Department of Computer Science
      Memphis, TN, United States
    • Georgia Institute of Technology
      • College of Computing
      Atlanta, Georgia, United States
  • 2002
    • University of Tsukuba
      • Centre for Computational Sciences
      Tsukuba, Ibaraki, Japan
    • The University of Aizu
      • School of Computer Science and Engineering
      Fukushima-shi, Fukushima-ken, Japan
  • 1970–2001
    • University of Dayton
      • Department of Computer Science
      Dayton, Ohio, United States
  • 1999
    • Griffith University
      Southport, Queensland, Australia
  • 1997–1999
    • Louisiana State University
      • Department of Computer Science
      Baton Rouge, Louisiana, United States