Effects of cDNA microarray time-series data size on gene regulatory network inference accuracy.
ABSTRACT A number of models and algorithms have been proposed in the past for gene regulatory network (GRN) inference; however, none of them address the effects of the size of the time-series microarray expression data in terms of number of time-points. In this paper, we study this problem by analyzing the behavior of two algorithms based on information theory models. These algorithms were implemented on different sizes of data generated by synthetic network generation tools. Experiments show that the performances of these algorithms reach a saturation point after a specific data size, thus giving the biologist an idea about what size of data will give the best inference accuracy. Also, the fact that the accuracy saturates after a specific number of time points (the saturation point being different for different algorithms) suggests that generating time-series data for a lot of time-points will not necessary improve the inference accuracy beyond a certain point. To understand this saturation, we found out that the information theoretic quantity, mutual information, tends to zero as the number of time points increase although the entropy in the network rises to unity. This illustrates the fact that mutual information (MI) might not be the best metric to use for GRN inference algorithms. To modify the MI metric we introduce a new method of computing time lags between any pair of genes and present the time lagged mutual information (TLMI) metric for reverse engineering of GRNs.
- SourceAvailable from: Kurt Gust[Show abstract] [Hide abstract]
ABSTRACT: Inferring the genetic network architecture in cells is of great importance to biologists as it can lead to the understanding of cell signaling and metabolic dynamics underlying cellular processes, onset of diseases, and potential discoveries in drug development. The focus today has shifted to genome scale inference approaches using information theoretic metrics such as mutual information over the gene expression data. In this paper, we propose two classes of inference algorithms using scoring schemes on complex interactions which are primarily based on information theoretic metrics. The central idea is to go beyond pair-wise interactions and utilize more complex structures between any node (gene or transcription factor) and its possible multiple regulators (only transcription factors). While this increases the network inference complexity over pair-wise interaction based approaches, it achieves much higher accuracy. We restricted the complex interactions considered in this paper to 3-node structures (any node and its two regulators) to keep our schemes scalable to genome-scale inference and yet achieve higher accuracy than other state of the art approaches. Detailed performance analyses based on benchmark precision and recall metrics over the known Escherichia coli transcriptional regulatory network, indicated that the accuracy of the proposed algorithms (sCoIn, aCoIn and its variants) is consistently higher in comparison to popular algorithms such as context likelihood of relatedness (CLR), relevance networks (RN) and GEneNetwork Inference with Ensemble of trees (GENIE3).Proceedings of the ACM Conference on Bioinformatics, Computational Biology and Biomedicine; 10/2012