Effects of cDNA microarray time-series data size on gene regulatory network inference accuracy.
ABSTRACT A number of models and algorithms have been proposed in the past for gene regulatory network (GRN) inference; however, none of them address the effects of the size of the time-series microarray expression data in terms of number of time-points. In this paper, we study this problem by analyzing the behavior of two algorithms based on information theory models. These algorithms were implemented on different sizes of data generated by synthetic network generation tools. Experiments show that the performances of these algorithms reach a saturation point after a specific data size, thus giving the biologist an idea about what size of data will give the best inference accuracy. Also, the fact that the accuracy saturates after a specific number of time points (the saturation point being different for different algorithms) suggests that generating time-series data for a lot of time-points will not necessary improve the inference accuracy beyond a certain point. To understand this saturation, we found out that the information theoretic quantity, mutual information, tends to zero as the number of time points increase although the entropy in the network rises to unity. This illustrates the fact that mutual information (MI) might not be the best metric to use for GRN inference algorithms. To modify the MI metric we introduce a new method of computing time lags between any pair of genes and present the time lagged mutual information (TLMI) metric for reverse engineering of GRNs.
- SourceAvailable from: psu.edu[show abstract] [hide abstract]
ABSTRACT: c biological data sets. The ability to adequately solve the inverse problem may enable in-depth analysis of complex dynamic systems in biology and other fields. Binary models of genetic networks Virtually all molecular and cellular signaling processes involve several inputs and outputs, forming a complex feedback network. The information for the construction and maintenance of this signaling system is stored in the genome. The DNA sequence codes for the structure and molecular dynamics of RNA and proteins, in turn determining biochemical recognition or signaling processes. The regulatory molecules that control the expression of genes are themselves the products of other genes. Effectively, genes turn each other on and off within a proximal genetic network of transcriptional regulators (Somogyi and Sniegoski, 1996). Furthermore, complex webs involving various intra- and extracellular signaling systems on the one hand depend on the expression of the genes that encode them, and on the05/2002;
- [show abstract] [hide abstract]
ABSTRACT: A central question in reverse engineering of genetic networks consists in determining the dependencies and regulating relationships among genes. This paper addresses the problem of inferring genetic regulatory networks from time-series gene-expression profiles. By adopting a probabilistic modeling framework compatible with the family of models represented by dynamic Bayesian networks and probabilistic Boolean networks, this paper proposes a network inference algorithm to recover not only the direct gene connectivity but also the regulating orientations. Based on the minimum description length principle, a novel network inference algorithm is proposed that greatly shrinks the search space for graphical solutions and achieves a good trade-off between modeling complexity and data fitting. Simulation results show that the algorithm achieves good performance in the case of synthetic networks. Compared with existing state-of-the-art results in the literature, the proposed algorithm exceptionally excels in efficiency, accuracy, robustness and scalability. Given a time-series dataset for Drosophila melanogaster, the paper proposes a genetic regulatory network involved in Drosophila's muscle development. Available from the authors upon request.Bioinformatics 10/2006; 22(17):2129-35. · 5.47 Impact Factor
- [show abstract] [hide abstract]
ABSTRACT: This paper reviews the principle of Minimum Description Length (MDL) for problems of model selection. By viewing statistical modeling as a means of generating descriptions of observed data, the MDL framework discriminates between competing models based on the complexity of each description. This approach began with Kolmogorov's theory of algorithmic complexity, matured in the literature on information theory, and has recently received renewed interest within the statistics community. In the pages that follow, we review both the practical as well as the theoretical aspects of MDL as a tool for model selection, emphasizing the rich connections between information theory and statistics. At the boundary between these two disciplines, we find many interesting interpretations of popular frequentist and Bayesian procedures. As we will see, MDL provides an objective umbrella under which rather disparate approaches to statistical modeling can co-exist and be compared. We illustrate th...01/2001;