[show abstract][hide abstract] ABSTRACT: Background
High throughput parallel sequencing, RNA-Seq, has recently emerged as an appealing alternative to microarray in identifying differentially expressed genes (DEG) between biological groups. However, there still exists considerable discrepancy on gene expression measurements and DEG results between the two platforms. The objective of this study was to compare parallel paired-end RNA-Seq and microarray data generated on 5-azadeoxy-cytidine (5-Aza) treated HT-29 colon cancer cells with an additional simulation study.
We first performed general correlation analysis comparing gene expression profiles on both platforms. An Errors-In-Variables (EIV) regression model was subsequently applied to assess proportional and fixed biases between the two technologies. Then several existing algorithms, designed for DEG identification in RNA-Seq and microarray data, were applied to compare the cross-platform overlaps with respect to DEG lists, which were further validated using qRT-PCR assays on selected genes. Functional analyses were subsequently conducted using Ingenuity Pathway Analysis (IPA).
Pearson and Spearman correlation coefficients between the RNA-Seq and microarray data each exceeded 0.80, with 66%~68% overlap of genes on both platforms. The EIV regression model indicated the existence of both fixed and proportional biases between the two platforms. The DESeq and baySeq algorithms (RNA-Seq) and the SAM and eBayes algorithms (microarray) achieved the highest cross-platform overlap rate in DEG results from both experimental and simulated datasets. DESeq method exhibited a better control on the false discovery rate than baySeq on the simulated dataset although it performed slightly inferior to baySeq in the sensitivity test. RNA-Seq and qRT-PCR, but not microarray data, confirmed the expected reversal of SPARC gene suppression after treating HT-29 cells with 5-Aza. Thirty-three IPA canonical pathways were identified by both microarray and RNA-Seq data, 152 pathways by RNA-Seq data only, and none by microarray data only.
These results suggest that RNA-Seq has advantages over microarray in identification of DEGs with the most consistent results generated from DESeq and SAM methods. The EIV regression model reveals both fixed and proportional biases between RNA-Seq and microarray. This may explain in part the lower cross-platform overlap in DEG lists compared to those in detectable genes.
[show abstract][hide abstract] ABSTRACT: Computer-aided detection and diagnosis (CAD) of colonic polyps always faces the challenge of classifying imbalanced data. In this paper, three new operating point selection strategies based on receiver operating characteristic curve are proposed to address the problem.
Classification on imbalanced data performs inferiorly because of a major reason that the best differentiation threshold shifts due to the degree of data imbalance. To address this decision threshold shifting issue, three operating point selection strategies, i.e., shortest distance, harmonic mean and anti-harmonic mean, are proposed and their performances are investigated.
Experiments were conducted on a class-imbalanced database, which contains 64 polyps in 786 polyp candidates. Support vector machine (SVM) and random forests (RFs) were employed as basic classifiers. Two imbalanced data correcting techniques, i.e., cost-sensitive learning and training data down sampling, were applied to SVM and RFs, and their performances were compared with the proposed strategies. Comparing to the original thresholding method, i.e., 0.488 sensitivity and 0.986 specificity for RFs and 0.526 sensitivity and 0.977 specificity for SVM, our strategies achieved more balanced results, which are around 0.89 sensitivity and 0.92 specificity for RFs and 0.88 sensitivity and 0.90 specificity for SVM. Meanwhile, their performance remained at the same level regardless of whether other correcting methods are used.
Based on the above experiments, the gain of our proposed strategies is noticeable: the sensitivity improved from 0.5 to around 0.88 for RFs and 0.89 for SVM while remaining a relatively high level of specificity, i.e., 0.92 for RFs and 0.90 for SVM. The performance of our proposed strategies was adaptive and robust with different levels of imbalanced data. This indicates a feasible solution to the shifting problem for favorable sensitivity and specificity in CAD of polyps from imbalanced data.
International Journal of Computer Assisted Radiology and Surgery 06/2013; · 1.36 Impact Factor
[show abstract][hide abstract] ABSTRACT: BACKGROUND: Culture-independent phylogenetic analysis of 16S ribosomal RNA (rRNA) gene sequences has emerged as an incisive method of profiling bacteria present in a specimen. Currently, multiple techniques are available to enumerate the abundance of bacterial taxa in specimens, including the Sanger sequencing, the 'next generation' pyrosequencing, microarrays, quantitative PCR, and the rapidly emerging, third generation sequencing, and fourth generation sequencing methods. An efficient statistical tool is in urgent need for the followings tasks: (1) to compare the agreement between these measurement platforms, (2) to select the most reliable platform(s), and (3) to combine different platforms of complementary strengths, for a unified analysis. RESULTS: We present the latent variable structural equation modeling (SEM) as a novel statistical application for the comparative analysis of measurement platforms. The latent variable SEM model treats the true (unknown) relative frequency of a given bacterial taxon in a specimen as the latent (unobserved) variable and estimates the reliabilities of, and similarities between, different measurement platforms, and subsequently weighs those measurements optimally for a unified analysis of the microbiome composition. The latent variable SEM contains the repeated measures ANOVA (both the univariate and the multivariate models) as special cases and, as a more general and realistic modeling approach, yields superior goodness-of-fit and more reliable analysis results, as demonstrated by a microbiome study of the human inflammatory bowel diseases. CONCLUSIONS: Given the rapid evolution of modern biotechnologies, the measurement platform comparison, selection and combination tasks are here to stay and to grow -- and the latent variable SEM method is readily applicable to any other biological settings, aside from the microbiome study presented here.
[show abstract][hide abstract] ABSTRACT: Various types of features, e.g., geometric features, texture features,
projection features etc., have been introduced for polyp detection and
differentiation tasks via computer aided detection and diagnosis (CAD)
for computed tomography colonography (CTC). Although these features
together cover more information of the data, some of them are
statistically highly-related to others, which made the feature set
redundant and burdened the computation task of CAD. In this paper, we
proposed a new dimension reduction method which combines hierarchical
clustering and principal component analysis (PCA) for false positives
(FPs) reduction task. First, we group all the features based on their
similarity using hierarchical clustering, and then PCA is employed
within each group. Different numbers of principal components are
selected from each group to form the final feature set. Support vector
machine is used to perform the classification. The results show that
when three principal components were chosen from each group we can
achieve an area under the curve of receiver operating characteristics of
0.905, which is as high as the original dataset. Meanwhile, the
computation time is reduced by 70% and the feature set size is reduce by
77%. It can be concluded that the proposed method captures the most
important information of the feature set and the classification accuracy
is not affected after the dimension reduction. The result is promising
and further investigation, such as automatically threshold setting, are
worthwhile and are under progress.
[show abstract][hide abstract] ABSTRACT: It is broadly accepted that genetically engineered animal models do not always recapitulate human pathobiology. Therefore identifying best-fit mouse models of human cancers that truly reflect the corresponding human disease is of vital importance in elucidating molecular mechanisms of tumorigenesis and developing preventive and therapeutic approaches. A new hepatocellular carcinoma (HCC) mouse model lacking a novel putative tumor suppressor IQGAP2 has been generated by our laboratory. The aim of this study was to obtain the molecular signature of Iqgap2(-/-) HCC tumors and establish the relevance of this model to human disease. Here we report a comprehensive transcriptome analysis of Iqgap2(-/-) livers and a cross-species comparison of human and Iqgap2(-/-) HCC tumors using Significance Analysis of Microarray (SAM) and unsupervised hierarchical clustering analysis. We identified the Wnt/β-catenin signaling pathway as the top canonical pathway dysregulated in Iqgap2(-/-) livers. We also demonstrated that Iqgap2(-/-) hepatic tumors shared genetic signatures with HCC tumors from patients with advanced disease as evidenced by a 78% mouse-to-human microarray data set concordance rate with 117 out of 151 identified ortholog genes having similar expression profiles across the two species. Collectively, these results indicate that the Iqgap2 knockout mouse model closely recapitulates human HCC at the molecular level and supports its further application for the study of this disease.
PLoS ONE 01/2013; 8(8):e71826. · 3.73 Impact Factor
[show abstract][hide abstract] ABSTRACT: The aim of this study was to integrate human clinical, genotype, mRNA microarray and 16 S rRNA sequence data collected on 84 subjects with ileal Crohn's disease, ulcerative colitis or control patients without inflammatory bowel diseases in order to interrogate how host-microbial interactions are perturbed in inflammatory bowel diseases (IBD). Ex-vivo ileal mucosal biopsies were collected from the disease unaffected proximal margin of the ileum resected from patients who were undergoing initial intestinal surgery. Both RNA and DNA were extracted from the mucosal biopsy samples. Patients were genotyped for the three major NOD2 variants (Leufs1007, R702W, and G908R) and the ATG16L1T300A variant. Whole human genome mRNA expression profiles were generated using Agilent microarrays. Microbial composition profiles were determined by 454 pyrosequencing of the V3-V5 hypervariable region of the bacterial 16 S rRNA gene. The results of permutation based multivariate analysis of variance and covariance (MANCOVA) support the hypothesis that host mucosal Paneth cell and xenobiotic metabolism genes play an important role in host microbial interactions.
PLoS ONE 01/2012; 7(6):e30044. · 3.73 Impact Factor
[show abstract][hide abstract] ABSTRACT: Previous genome-wide expression studies have highlighted distinct gene expression patterns in inflammatory bowel disease (IBD) compared to control samples, but the interpretation of these studies has been limited by sample heterogeneity with respect to disease phenotype, disease activity, and anatomic sites. To further improve molecular classification of inflammatory bowel disease phenotypes we focused on a single anatomic site, the disease unaffected proximal ileal margin of resected ileum, and three phenotypes that were unlikely to overlap: ileal Crohn's disease (ileal CD), ulcerative colitis (UC), and control patients without IBD. Whole human genome (Agilent) expression profiling was conducted on two independent sets of disease-unaffected ileal samples collected from the proximal margin of resected ileum. Set 1 (47 ileal CD, 27 UC, and 25 Control non-IBD patients) was used as the training set and Set 2 was subsequently collected as an independent test set (10 ileal CD, 10 UC, and 10 control non-IBD patients). We compared the 17 gene signatures selected by four different feature-selection methods to distinguish ileal CD phenotype with non-CD phenotype. The four methods yielded different but overlapping solutions that were highly discriminating. All four of these methods selected FOLH1 as a common feature. This gene is an established biomarker for prostate cancer, but has not previously been associated with Crohn's disease. Immunohistochemical staining confirmed increased expression of FOLH1 in the ileal epithelium. These results provide evidence for convergent molecular abnormalities in the macroscopically disease unaffected proximal margin of resected ileum from ileal CD subjects.
PLoS ONE 01/2012; 7(5):e37139. · 3.73 Impact Factor
[show abstract][hide abstract] ABSTRACT: We tested the hypothesis that Crohn's disease (CD)-related genetic polymorphisms involved in host innate immunity are associated with shifts in human ileum-associated microbial composition in a cross-sectional analysis of human ileal samples. Sanger sequencing of the bacterial 16S ribosomal RNA (rRNA) gene and 454 sequencing of 16S rRNA gene hypervariable regions (V1-V3 and V3-V5), were conducted on macroscopically disease-unaffected ileal biopsies collected from 52 ileal CD, 58 ulcerative colitis and 60 control patients without inflammatory bowel diseases (IBD) undergoing initial surgical resection. These subjects also were genotyped for the three major NOD2 risk alleles (Leu1007fs, R708W, G908R) and the ATG16L1 risk allele (T300A). The samples were linked to clinical metadata, including body mass index, smoking status and Clostridia difficile infection. The sequences were classified into seven phyla/subphyla categories using the Naïve Bayesian Classifier of the Ribosome Database Project. Centered log ratio transformation of six predominant categories was included as the dependent variable in the permutation based MANCOVA for the overall composition with stepwise variable selection. Polymerase chain reaction (PCR) assays were conducted to measure the relative frequencies of the Clostridium coccoides - Eubacterium rectales group and the Faecalibacterium prausnitzii spp. Empiric logit transformations of the relative frequencies of these two microbial groups were included in permutation-based ANCOVA. Regardless of sequencing method, IBD phenotype, Clostridia difficile and NOD2 genotype were selected as associated (FDR ≤ 0.05) with shifts in overall microbial composition. IBD phenotype and NOD2 genotype were also selected as associated with shifts in the relative frequency of the C. coccoides--E. rectales group. IBD phenotype, smoking and IBD medications were selected as associated with shifts in the relative frequency of F. prausnitzii spp. These results indicate that the effects of genetic and environmental factors on IBD are mediated at least in part by the enteric microbiota.
PLoS ONE 01/2012; 7(6):e26284. · 3.73 Impact Factor
[show abstract][hide abstract] ABSTRACT: : Our objective is to assess the effect of genetic and environmental factors on Crohn's disease location.
: We identified 628 patients with Crohn's disease within the Washington University database (April 2005 through February 2010) that had complete information on 31 Crohn's disease-associated genotypes and clinical information on disease location (L1 to L4), smoking, sex, race, and age at diagnosis. For statistical reasons, the 3 major NOD2 alleles (rs2066844, rs2066845, and rs2066847) were grouped together. Logistic regression incorporating all of the genotypes and clinical covariates, including smoking, was performed with stepwise variable selection and by best subset selection.
: Stepwise variable selection selected 3 major covariates, composite NOD2 genotype, smoking, and TNFSF15 genotype, which are also the 3 covariates selected by the best subset method. Whereas the NOD2 genotype and smoking are positively associated with ileal (L1 + L3) disease, the TNFSF15 genotype is positively associated with isolated colonic (L2) disease.
: The ability to detect disease site associations in this single-center study may be limited by the population size, low allelic frequency, and/or low odds ratio of certain Crohn's disease risk alleles.
: These results indicate that NOD2 genotype, smoking status, and TNFSF15 genotype should be included as covariates in assessing the effect of genetic and environmental factors on Crohn's disease site location.
Diseases of the Colon & Rectum 08/2011; 54(8):1020-5. · 3.34 Impact Factor
[show abstract][hide abstract] ABSTRACT: Culture-independent microbiological technologies that interrogate complex microbial populations without prior axenic culture, coupled with high-throughput DNA sequencing, have revolutionized the scale, speed and economics of microbial ecological studies. Their application to the medical realm has led to a highly productive merger of clinical, experimental and environmental microbiology. The functional roles played by members of the human microbiota are being actively explored through experimental manipulation of animal model systems and studies of human populations. In concert, these studies have appreciably expanded our understanding of the composition and dynamics of human-associated microbial communities (microbiota). Of note, several human diseases have been linked to alterations in the composition of resident microbial communities, so-called dysbiosis. However, how changes in microbial communities contribute to disease etiology remains poorly defined. Correlation of microbial composition represents integration of only two datasets (phenotype and microbial composition). This article explores strategies for merging the human microbiome data with multiple additional datasets (e.g. host single nucleotide polymorphisms and host gene expression) and for integrating patient-based data with results from experimental animal models to gain deeper understanding of how host-microbe interactions impact disease.
Trends in Microbiology 07/2011; 19(9):427-34. · 8.43 Impact Factor
[show abstract][hide abstract] ABSTRACT: Whole human genome (Agilent) expression profiling was conducted on disease-unaffected ileal RNA collected from the proximal margin of resected ileum from 47 ileal Crohn's disease (CD), 27 ulcerative colitis (UC) and 25 control patients without inflammatory bowel diseases (IBD). Cluster analysis combined with significance analysis of microarrays (SAM) and principal component analysis (PCA) and was used to reduce the data dimension to identify gene- probe clusters associated with early pathogenic changes in ileal CD and UC. Ingenuity Pathway Analysis (IPA) was used to identify the biological pathways associated with each cluster. We reduced the dimensions of the 26,765 gene probe set to 43 gene-probe clusters. Most of these clusters could be labeled as related to different biological pathways, such as Paneth cell antimicrobial peptides, the formation of organized lymphoid structures, or nuclear receptor signaling and xenobiotic metabolism. Molecular phylogenetic 16S rRNA sequence analysis was completed on 83 DNA samples from the same samples used to generate the gene expression profiles. We conducted an exploratory study to determine if the first principle component (PC1) of these clusters could be linked to specific phyla/subphyla taxa. patients undergoing either right hemicolectomy or total colectomy. Of these 99 subjects, we have completed molecular phylogenetic analysis of the same biopsy samples based on 16S rRNA sequence analysis in 83 subjects. To identify biological pathways associated with early pathogeneic changes in the disease unaffected ileum, we aim to construct a system model including clinical information, genetic data and microbiota composition. In order to integrate these large data sets, we developed a dimension reduction scheme combining several computational tools, including cluster analysis, significance analysis of microarray (SAM) (4) and principal component analysis (PCA), to summarize information from our whole expression profiling experiments. Cluster analysis of microarray data based on similarity of gene expression values has been used for dimension reduction purpose (5-7), but has been criticized for lacking of statistical significance (4). IPA as well as direct inspection of the gene lists within each cluster was used to identify biological pathways. To illustrate the use of this approach towards demonstration - reduction, we present an exploratory analysis integrating the results of our cluster analysis with genotype, phenotype and human microbiome data.
IEEE 1st International Conference on Computational Advances in Bio and Medical Sciences, ICCABS 2011, Orlando, FL, USA, February 3-5, 2011; 01/2011
[show abstract][hide abstract] ABSTRACT: Abnormal host-microbe interactions are implicated in the pathogenesis of inflammatory bowel diseases. Previous 16S rRNA sequence analysis of intestinal tissues demonstrated that a subset of Crohn's disease (CD) and ulcerative colitis (UC) samples exhibited altered intestinal-associated microbial compositions characterized by depletion of Bacteroidetes and Firmicutes (particularly Clostridium taxa). We hypothesize that NOD2 and ATG16L1 risk alleles may be associated with these alterations.
To test this hypothesis, we genotyped 178 specimens collected from 35 CD, 35 UC, and 54 control patients for the three major NOD2 risk alleles (Leu 1007fs, R702W, and G908R) and the ATG16L1T300A risk allele, that had undergone previous 16S rRNA sequence analysis. Our statistical models incorporated the following independent variables: 1) disease phenotype (CD, UC, non-IBD control); 2) NOD2 composite genotype (NOD2(R) = at least one risk allele, NOD2(NR) = no risk alleles); 3) ATG16L1T300A genotype (ATG16L1(R/R), ATG16L1(R/NR), ATG16L1(NR/NR)); 4) patient age at time of surgery and all first-order interactions. The dependent variable(s) were the relative frequencies of bacterial taxa classified by applying the RDP 2.1 classifier to previously reported 16S rRNA sequence data.
Disease phenotype, NOD2 composite genotype and ATG16L1 genotype were significantly associated with shifts in microbial compositions by nonparametric multivariate analysis of covariance (MANCOVA). Shifts in the relative frequencies of Faecalibacterium and Escherichia taxa were significantly associated with disease phenotype by nonparametric ANCOVA.
These results support the concept that disease phenotype and genotype are associated with compositional changes in intestinal-associated microbiota.
[show abstract][hide abstract] ABSTRACT: Computer-aided detection (CAD) is an emerging technique which provides an optimal method for automated detection of colonic polyps in computed tomography colonography (CTC). Differentiating true-positives (TPs) from false-positives (FPs) is one of the main tasks of CAD. One major challenge for the differentiation task is how to classify the very unbalanced datasets. Many classifiers have been introduced to perform the differentiation task and some are proved to be useful. However, there has so far been no comparative study to evaluate the effectiveness of these classifiers. In this paper, we present a comparative study, which quantitatively assesses the most commonly used classifiers, e.g., support vector machine (SVM), random forest (RF), linear discriminant analysis (LDA), artificial neural network (ANN), logistic regression (LR), quadratic discriminant analysis (QDA), and classification and regression tree (CART). The performances of these classifiers were evaluated based on 786 initially detected patches, including 64 true polyps. Our results show that SVM, RF and LDA perform the best for the detection task and also most robustly in dealing with datasets with different unbalanced level. It can be concluded that these three classifiers are strong, good classifiers. While ANN delivers less favorable result, it provides good complementary information and can be labeled as weak, good classifier. By this comparative study, we conjecture that the combination of these classifiers can be a stronger classifier and worth for further investigation, because they are complementary to each other. From this comparative study, we further conclude that integrating a strong classifier for texture analysis would be a logical choice for CAD in CTC.
[show abstract][hide abstract] ABSTRACT: This UH2/UH3 demonstration project entitled “Effects of Crohn’s disease risk alleles on enteric microbiota” is focused on characterizing intestinal associated microbiota in patients with ileal Crohn’s disease (ileal CD), ulcerative colitis (UC) and control patients without inflammatory bowel diseases (non-IBD). We hypothesize that genetic factors that affect Paneth cell function, contribute to compositional changes in intestinal microbiota. These changes in microbiota may lead to reduction of protective commensal organisms and increased numbers of aggressive organisms that incite intestinal inflammation. This hypothesis is being tested by high throughput 16S rRNA sequence analysis of de-identified ileal and colonic tissues that have been archived at Washington University St. Louis, University of North Carolina, Mount Sinai Hospital and the Cleveland Clinic. Multivariate analysis of the metagenomic data will be conducted with genotyping metadata (highly reproducible CD risk alleles, including NOD2 and ATG16L1) and phenotyping metadata (e.g. age, gender, race, body mass index, medications and smoking). Shotgun sequencing will be performed on selected fecal specimens linked to ileal tissues to identify additional, or auxiliary, or synergistic pathogenic factors or other functional changes in the microbiome. Because members of this research team have observed that a chronic viral infection is required for the Paneth cell defect in Atg16l1 hypomorphic mice, a major focus of these studies will be towards identifying potential viral triggers for the defective Paneth cell phenotype in individuals harboring the ATG16L1 risk allele. Novel genetic probes for protective and aggressive organisms will be developed by mining bacterial genome and shotgun sequencing data. Genomic sequences will be produced for candidate protective and aggressive strains (e.g. adherent-invasive strains of E. coli) isolated from human intestinal tissues where there is limited existing genome information. Quantitative qPCR assays using the novel as well as established genetic probes will be conducted to test the hypothesis that an imbalance between protective and aggressive organisms is associated with genetic factors that affect Paneth cell function. Our combined expertise in multiple disciplines across multiple institutions, our demonstrated ability to collect a large number of well-phenotyped samples with longitudinal clinical information that will be linked to host response and morphologic studies, and our consortium’s capacity for high-throughput sequencing will be used to investigate how alterations in human microbiome relate to CD risk alleles and CD pathogenesis.