Lin Wang

Peking University, Peping, Beijing, China

Are you Lin Wang?

Claim your profile

Publications (9)33.25 Total impact

  • [Show abstract] [Hide abstract]
    ABSTRACT: Epistatic Miniarray Profiles (EMAP) enables the research of genetic interaction as an important method to construct large-scale genetic interaction networks. However, a high proportion of missing values frequently poses problems in EMAP data analysis since they such missing values hinder useful information in the datasetsdownstream analysis. While there have been some imputation approaches have been available to EMAP data, we adopted an improved SVD modeling procedure to impute the missing values in EMAP data, which has resulted results in a higher accuracy rate comparing compared with existent existing methods. The improved SVD imputation method adopted adopts an effective soft-threshold to the SVD approach which has been showed shown to be the best model to impute genetic interaction data when compared with a number of advanced imputation methods. Imputation methods can also improve the clustering results of EMAP datasets. Thus, after applying our imputation method on the EMAP dataset, more meaningful modules, known pathways and protein complexes could be detected. While the phenomenon of missing data unavoidably complicates EMAP data, our results showed that we could complete the original dataset by the Soft-SVD approach to accurately recover genetic interactions.
    Methods 04/2014; · 3.64 Impact Factor
  • [Show abstract] [Hide abstract]
    ABSTRACT: Backgroud Epistatic Miniarray Profiles (EMAP) enables the research of genetic interaction as an important method to construct large-scale genetic interaction networks. However, a high proportion of missing values frequently poses problems in EMAP data analysis since they such missing values hinder useful information in the datasetsdownstream analysis. While there have been some imputation approaches have been available to EMAP data, we adopted an improved SVD modeling procedure to impute the missing values in EMAP data, which has resulted results in a higher accuracy rate comparing compared with existent existing methods. Results The improved SVD imputation method adopted adopts an effective soft-threshold to the SVD approach which has been showed shown to be the best model to impute genetic interaction data when compared with a number of advanced imputation methods. Imputation methods can also improve the clustering results of EMAP datasets. Thus, after applying our imputation method on the EMAP dataset, more meaningful modules, known pathways and protein complexes could be detected. Conclusion While the phenomenon of missing data unavoidably complicates EMAP data, our results showed that we could complete the original dataset by the Soft-SVD approach to accurately recover genetic interactions.
    Methods 01/2014; · 3.64 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Expression quantitative trait loci (eQTL) studies have generated large amounts of data in different organisms. The analyses of these data have led to many novel findings and biological insights on expression regulations. However, the role of epistasis in the joint regulation of multiple genes has not been explored. This is largely due to the computational complexity involved when multiple traits are simultaneously considered against multiple markers if an exhaustive search strategy is adopted. In this article, we propose a computationally feasible approach to identify pairs of chromosomal regions that interact to regulate co-expression patterns of pairs of genes. Our approach is built on a bivariate model whose covariance matrix depends on the joint genotypes at the candidate loci. We also propose a filtering process to reduce the computational burden. When we applied our method to a yeast eQTL dataset profiled under both the glucose and ethanol conditions, we identified a total of 225 and 224 modules, with each module consisting of two genes and two eQTLs where the two eQTLs epistatically regulate the co-expression patterns of the two genes. We found that many of these modules have biological interpretations. Under the glucose condition, ribosome biogenesis was co-regulated with the signaling and carbohydrate catabolic processes, whereas silencing and aging related genes were co-regulated under the ethanol condition with the eQTLs containing genes involved in oxidative stress response process.
    PLoS Genetics 03/2013; 9(3):e1003414. · 8.52 Impact Factor
  • [Show abstract] [Hide abstract]
    ABSTRACT: MOTIVATION: Expression quantitative trait loci (eQTL) studies investigate how gene expression levels are affected by DNA variants. A major challenge in inferring eQTL is that a number of factors, such as unobserved covariates, experimental artifacts, and unknown environmental perturbations, may confound the observed expression levels. This may both mask real associations and lead to spurious association findings. RESULTS: In this paper, we introduce a LOw-Rank representation to account for confounding factors and make use of Sparse regression for eQTL mapping (LORS). We integrate the low-rank representation and sparse regression into a unified framework, in which SNPs and gene probes can be jointly analyzed. Given the two model parameters, our formulation is a convex optimization problem. We have developed an efficient algorithm to solve this problem and its convergence is guaranteed. We demonstrate its ability to account for non-genetic effects using simulation, and then apply it to two independent real data sets. Our results indicate that LORS is an effective tool to account for non-genetic effects. First, our detected associations show higher consistency between studies than recently proposed methods. Second, we have identified some new hot spots which can not be identified without accounting for non-genetic effects. AVAILABILITY: The software is available at: http://bioinformatics.med.yale.edu/group CONTACT: Hongyu Zhao (hongyu.zhao@yale.edu).
    Bioinformatics 02/2013; · 5.47 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Phosphorylation and transcriptional regulation events are critical for cells to transmit and respond to signals. In spite of its importance, systems-level strategies that couple these two networks have yet to be presented. Here we introduce a novel approach that integrates the physical and functional aspects of phosphorylation network together with the transcription network in S.cerevisiae, and demonstrate that different network motifs are involved in these networks, which should be considered in interpreting and integrating large scale datasets. Based on this understanding, we introduce a HeRS score (hetero-regulatory similarity score) to systematically characterize the functional relevance of kinase/phosphatase involvement with transcription factor, and present an algorithm that predicts hetero-regulatory modules. When extended to signaling network, this approach confirmed the structure and cross talk of MAPK pathways, inferred a novel functional transcription factor Sok2 in high osmolarity glycerol pathway, and explained the mechanism of reduced mating efficiency upon Fus3 deletion. This strategy is applicable to other organisms as large-scale datasets become available, providing a means to identify the functional relationships between kinases/phosphatases and transcription factors.
    PLoS ONE 01/2012; 7(3):e33160. · 3.53 Impact Factor
  • [Show abstract] [Hide abstract]
    ABSTRACT: The goal of network clustering algorithms detect dense clusters in a network, and provide a first step towards the understanding of large scale biological networks. With numerous recent advances in biotechnologies, large-scale genetic interactions are widely available, but there is a limited understanding of which clustering algorithms may be most effective. In order to address this problem, we conducted a systematic study to compare and evaluate six clustering algorithms in analyzing genetic interaction networks, and investigated influencing factors in choosing algorithms. The algorithms considered in this comparison include hierarchical clustering, topological overlap matrix, bi-clustering, Markov clustering, Bayesian discriminant analysis based community detection, and variational Bayes approach to modularity. Both experimentally identified and synthetically constructed networks were used in this comparison. The accuracy of the algorithms is measured by the Jaccard index in comparing predicted gene modules with benchmark gene sets. The results suggest that the choice differs according to the network topology and evaluation criteria. Hierarchical clustering showed to be best at predicting protein complexes; Bayesian discriminant analysis based community detection proved best under epistatic miniarray profile (EMAP) datasets; the variational Bayes approach to modularity was noticeably better than the other algorithms in the genome-scale networks.
    Frontiers in bioscience (Elite edition) 01/2012; 4:2150-61.
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Cellular functions depend on genetic, physical and other types of interactions. As such, derived interaction networks can be utilized to discover novel genes involved in specific biological processes. Epistatic Miniarray Profile, or E-MAP, which is an experimental platform that measures genetic interactions on a genome-wide scale, has successfully recovered known pathways and revealed novel protein complexes in Saccharomyces cerevisiae (budding yeast). By combining E-MAP data with co-expression data, we first predicted a potential cell cycle related gene set. Using Gene Ontology (GO) function annotation as a benchmark, we demonstrated that the prediction by combining microarray and E-MAP data is generally >50% more accurate in identifying co-functional gene pairs than the prediction using either data source alone. We also used transcription factor (TF)-DNA binding data (Chip-chip) and protein phosphorylation data to construct a local cell cycle regulation network based on potential cell cycle related gene set we predicted. Finally, based on the E-MAP screening with 48 cell cycle genes crossing 1536 library strains, we predicted four unknown genes (YPL158C, YPR174C, YJR054W, and YPR045C) as potential cell cycle genes, and analyzed them in detail. By integrating E-MAP and DNA microarray data, potential cell cycle-related genes were detected in budding yeast. This integrative method significantly improves the reliability of identifying co-functional gene pairs. In addition, the reconstructed network sheds light on both the function of known and predicted genes in the cell cycle process. Finally, our strategy can be applied to other biological processes and species, given the availability of relevant data.
    BMC Systems Biology 01/2011; 5 Suppl 1:S9. · 2.98 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: MOTIVATION: Epistatic Miniarray Profiles (EMAP) has enabled the mapping of large-scale genetic interaction networks; however, the quantitative information gained from EMAP cannot be fully exploited since the data are usually interpreted as a discrete network based on an arbitrary hard threshold. To address such limitations, we adopted a mixture modeling procedure to construct a probabilistic genetic interaction network and then implemented a Bayesian approach to identify densely interacting modules in the probabilistic network. RESULTS: Mixture modeling has been demonstrated as an effective soft-threshold technique of EMAP measures. The Bayesian approach was applied to an EMAP dataset studying the early secretory pathway in Saccharomyces cerevisiae. Twenty-seven modules were identified, and 14 of those were enriched by gold standard functional gene sets. We also conducted a detailed comparison with state-of-the-art algorithms, hierarchical cluster and Markov clustering. The experimental results show that the Bayesian approach outperforms others in efficiently recovering biologically significant modules.
    Bioinformatics 01/2011; 27(6):853-9. · 5.47 Impact Factor
  • [Show abstract] [Hide abstract]
    ABSTRACT: Assigning functions to proteins that have not been annotated is an important problem in the post-genomic era. Meanwhile, the availability of data on protein-protein interactions provides a new way to predict protein functions. Previously, several computational methods have been developed to solve this problem. Among them, Deng et al. developed a method based on the Markov random field (MRF). Lee et al. extended it to the kernel logistic regression model (KLR) based on the diffusion kernel. These two methods were tested on yeast benchmark data, and the results demonstrated that both MRF and KLR had high precision in function prediction. On that basis, inspired by the idea of a Markov cluster algorithm, we defined a new measure of network betweenness, and developed a betweenness-based logistic regression model (BLR). Applying it to predict protein functions on the yeast benchmark data, we found that BLR outperformed both the KLR and the MRF models. It is evidently that BLR is a more proper and efficient approach of function prediction.
    01/2010;