[Show abstract][Hide abstract] ABSTRACT: High-throughput sequencing techniques are increasingly affordable and produce massive amounts of data. Together with other high-throughput technologies, such as microarrays, there are an enormous amount of resources in databases. The collection of these valuable data has been routine for more than a decade. Despite different technologies, many experiments share the same goal. For instance, the aims of RNA-seq studies often coincide with those of differential gene expression experiments based on microarrays. As such, it would be logical to utilize all available data. However, there is a lack of biostatistical tools for the integration of results obtained from different technologies. Although diverse technological platforms produce different raw data, one commonality for experiments with the same goal is that all the outcomes can be transformed into a platform-independent data format - rankings - for the same set of items. Here we present the R package TopKLists, which allows for statistical inference on the lengths of informative (top-k) partial lists, for stochastic aggregation of full or partial lists, and for graphical exploration of the input and consolidated output. A graphical user interface has also been implemented for providing access to the underlying algorithms. To illustrate the applicability and usefulness of the package, we integrated microRNA data of non-small cell lung cancer across different measurement techniques and draw conclusions. The package can be obtained from CRAN under a LGPL-3 license.
Statistical Applications in Genetics and Molecular Biology 05/2015; DOI:10.1515/sagmb-2014-0093 · 1.13 Impact Factor
[Show abstract][Hide abstract] ABSTRACT: Genomic imprinting is an important epigenetic factor in complex traits study, which has generally been examined by testing for parent-of-origin effects of alleles. For a diallelic marker locus, the parental-asymmetry test (PAT) based on case-parents trios and its extensions to incomplete nuclear families (1-PAT and C-PAT) are simple and powerful for detecting parent-of-origin effects. However, these methods are suitable only for nuclear families and thus are not amenable to general pedigree data. Use of data from extended pedigrees, if available, may lead to more powerful methods than randomly selecting one two-generation nuclear family from each pedigree. In this study, we extend PAT to accommodate general pedigree data by proposing the pedigree PAT (PPAT) statistic, which uses all informative family trios from pedigrees. To fully utilize pedigrees with some missing genotypes, we further develop the Monte Carlo (MC) PPAT (MCPPAT) statistic based on MC sampling and estimation. Extensive simulations were carried out to evaluate the performance of the proposed methods. Under the assumption that the pedigrees and their associated affection patterns are randomly drawn from a population of pedigrees with at least one affected offspring, we demonstrated that MCPPAT is a valid test for parent-of-origin effects in the presence of association. Further, MCPPAT is much more powerful compared to PAT for trios or even PPAT for all informative family trios from the same pedigrees if there is missing data. Application of the proposed methods to a rheumatoid arthritis dataset further demonstrates the advantage of MCPPAT.
[Show abstract][Hide abstract] ABSTRACT: One of the major challenges facing researchers studying complex biological systems is integration of data from -omics platforms. Omic-scale data include DNA variations, transcriptom profiles, and RAomics. Selection of an appropriate approach for a data-integration task is problem dependent, primarily dictated by the information contained in the data. In situations where modeling of multiple raw datasets jointly might be extremely challenging due to their vast differences, rankings from each dataset would provide a commonality based on which results could be integrated. Aggregation of microRNA targets predicted from different computational algorithms is such a problem. Integration of results from multiple mRNA studies based on different platforms is another example that will be discussed. Formulating the problem of integrating ranked lists as minimizing an objective criterion, we explore the usage of a cross entropy Monte Carlo method for solving such a combinatorial problem. Instead of placing a discrete uniform distribution on all the potential solutions, an iterative importance sampling technique is utilized "to slowly tighten the net" to place most distributional mass on the optimal solution and its neighbors. Extensive simulation studies were performed to assess the performance of the method. With satisfactory simulation results, the method was applied to the microRNA and mRNA problems to illustrate its utility.
[Show abstract][Hide abstract] ABSTRACT: Because of the need for fine mapping of disease loci and the availability of dense single-nucleotide-polymorphism markers, many forms of association tests have been developed. Most of them are applicable only to triads, whereas some are amenable to nuclear families (sibships). Although there are a number of methods that can deal with extended families (e.g., the pedigree disequilibrium test [PDT]), most of them cannot accommodate incomplete data. Furthermore, despite a large body of literature on association mapping, only a very limited number of publications are applicable to X-chromosomal markers. In this report, we first extend the PDT to markers on the X chromosome for testing linkage disequilibrium in the presence of linkage. This method is applicable to any pedigree structure and is termed "X-chromosomal pedigree disequilibrium test" (XPDT). We then further extend the XPDT to accommodate pedigrees with missing genotypes in some of the individuals, especially founders. Monte Carlo (MC) samples of the missing genotypes are generated and used to calculate the XMCPDT (X-chromosomal MC PDT) statistic, which is defined as the conditional expectation of the XPDT statistic given the incomplete (observed) data. This MC version of the XPDT remains a valid test for association under linkage with the assumption that the pedigrees and their associated affection patterns are drawn randomly from a population of pedigrees with at least one affected offspring. This set of methods was compared with existing approaches through simulation, and substantial power gains were observed in all settings considered, with type I error rates closely tracking their nominal values.
The American Journal of Human Genetics 10/2006; 79(3):567-73. DOI:10.1086/507609 · 10.93 Impact Factor
[Show abstract][Hide abstract] ABSTRACT: We compare and contrast the performance of SIMPLE, a Monte Carlo based software, with that of several other methods for linkage and haplotype analyses, focusing on the simulated data from the New York City population. First, a whole-genome scan study based on the microsatellite markers was performed using GENEHUNTER. Because GENEHUNTER had to drop individuals for many of the pedigrees, we performed a follow-up study focusing on several regions of interest using SIMPLE, which can handle all pedigrees in their entirety. Second, 3 haplotyping programs, including that in SIMPLE, were used to reconstruct haplotypic configurations in pedigrees. SIMPLE emerges clearly as a preferred tool, as it can handle large pedigrees and produces haplotypic configurations without double recombinant haplotypes. For this study, we had knowledge of the simulating models at the time we performed the analysis.