[Show abstract][Hide abstract] ABSTRACT: Background:
Machine learning methods and in particular random forests (RFs) are a promising alternative to standard single SNP analyses in genome-wide association studies (GWAS). RFs provide variable importance measures (VIMs) to rank SNPs according to their predictive power. However, in contrast to the established genome-wide significance threshold, no clear criteria exist to determine how many SNPs should be selected for downstream analyses.
We propose a new variable selection approach, recurrent relative variable importance measure (r2VIM). Importance values are calculated relative to an observed minimal importance score for several runs of RF and only SNPs with large relative VIMs in all of the runs are selected as important. Evaluations on simulated GWAS data show that the new method controls the number of false-positives under the null hypothesis. Under a simple alternative hypothesis with several independent main effects it is only slightly less powerful than logistic regression. In an experimental GWAS data set, the same strong signal is identified while the approach selects none of the SNPs in an underpowered GWAS.
The novel variable selection method r2VIM is a promising extension to standard RF for objectively selecting relevant SNPs in GWAS while controlling the number of false-positive results.
[Show abstract][Hide abstract] ABSTRACT: The diagnosis of inflammatory bowel disease (IBD) still remains a clinical challenge and the most accurate diagnostic procedure is a combination of clinical tests including invasive endoscopy. In this study we evaluated whether systematic miRNA expression profiling, in conjunction with machine learning techniques, is suitable as a non-invasive test for the major IBD phenotypes (Crohn's disease (CD) and ulcerative colitis (UC)). Based on microarray technology, expression levels of 863 miRNAs were determined for whole blood samples from 40 CD and 36 UC patients and compared to data from 38 healthy controls (HC). To further discriminate between disease-specific and general inflammation we included miRNA expression data from other inflammatory diseases (inflammation controls (IC): 24 chronic obstructive pulmonary disease (COPD), 23 multiple sclerosis, 38 pancreatitis and 45 sarcoidosis cases) as well as 70 healthy controls from previous studies. Classification problems considering 2, 3 or 4 groups were solved using different types of penalized support vector machines (SVMs). The resulting models were assessed regarding sparsity and performance and a subset was selected for further investigation. Measured by the area under the ROC curve (AUC) the corresponding median holdout-validated accuracy was estimated as ranging from 0.75 to 1.00 (including IC) and 0.89 to 0.98 (excluding IC), respectively. In combination, the corresponding models provide tools for the distinction of CD and UC as well as CD, UC and HC with expected classification error rates of 3.1 and 3.3%, respectively. These results were obtained by incorporating not more than 16 distinct miRNAs. Validated target genes of these miRNAs have been previously described as being related to IBD. For others we observed significant enrichment for IBD susceptibility loci identified in earlier GWAS. These results suggest that the proposed miRNA signature is of relevance for the etiology of IBD. Its diagnostic value, however, should be further evaluated in large, independent, clinically well characterized cohorts.
[Show abstract][Hide abstract] ABSTRACT: We carried out a trans-ancestry genome-wide association and replication study of blood pressure phenotypes among up to 320,251 individuals of East Asian, European and South Asian ancestry. We find genetic variants at 12 new loci to be associated with blood pressure (P = 3.9 x 10(-11) to 5.0 x 10(-21)). The sentinel blood pressure SNPs are enriched for association with DNA methylation at multiple nearby CpG sites, suggesting that, at some of the loci identified, DNA methylation may lie on the regulatory pathway linking sequence variation to blood pressure. The sentinel SNPs at the 12 new loci point to genes involved in vascular smooth muscle (IGFBP3, KCNK3, PDE3A and PRDM6) and renal (ARHGAP24, OSR1, SLC22A7 and TBX2) function. The new and known genetic variants predict increased left ventricular mass, circulating levels of NT-proBNP, and cardiovascular and all-cause mortality (P = 0.04 to 8.6 x 10(-6)). Our results provide new evidence for the role of DNA methylation in blood pressure regulation.
[Show abstract][Hide abstract] ABSTRACT: Several pathogenic viruses such as hepatitis B and human immunodeficiency viruses may integrate into the host genome. These virus/host integrations are detectable using paired-end next generation sequencing. However, the low number of expected true virus integrations may be difficult to distinguish from the noise of many false positive candidates. Here, we propose a novel filtering approach that increases specificity without compromising sensitivity for virus/host chimera detection. Our detection pipeline termed Vy-PER (Virus integration detection bY Paired End Reads) outperforms existing similar tools in speed and accuracy. We analysed whole genome data from childhood acute lymphoblastic leukemia (ALL), which is characterised by genomic rearrangements and usually associated with radiation exposure. This analysis was motivated by the recently reported virus integrations at genomic rearrangement sites and association with chromosomal instability in liver cancer. However, as expected, our analysis of 20 tumour and matched germline genomes from ALL patients finds no significant evidence for integrations by known viruses. Nevertheless, our method eliminates 12,800 false positives per genome (80× coverage) and only our method detects singleton human-phiX174-chimeras caused by optical errors of the Illumina HiSeq platform. This high accuracy is useful for detecting low virus integration levels as well as non-integrated viruses.
[Show abstract][Hide abstract] ABSTRACT: Cancer proteomics provide a powerful approach to identify biomarkers for personalized medicine. Particularly, biomarkers for early detection, prognosis and therapeutic intervention of bone cancers, especially osteosarcomas, are missing. Initially, we compared two-dimensional gel electrophoresis (2-DE)-based protein expression pattern between cell lines of fetal osteoblasts, osteosarcoma and pulmonary metastasis derived from osteosarcoma. Two independent statistical analyses by means of PDQuest® and SameSpot® software revealed a common set of 34 differentially expressed protein spots (p < 0.05). 17 Proteins were identified by mass spectrometry and subjected to Ingenuity Pathway Analysis resulting in one high-ranked network associated with Gene Expression, Cell Death and Cell-To-Cell Signaling and Interaction. Ran/TC4-binding protein (RANBP1) and Cathepsin D (CTSD) were further validated by Western Blot in cell lines while the latter one showed higher expression differences also in cytospins and in clinical samples using tissue microarrays comprising osteosarcomas, metastases, other bone malignancies, and control tissues. The results show that protein expression patterns distinguish fetal osteoblasts from osteosarcomas, pulmonary metastases, and other bone diseases with relevant sensitivities between 55.56% and 100% at ≥87.50% specificity. Particularly, CTSD was validated in clinical material and could thus serve as a new biomarker for bone malignancies and potentially guide individualized treatment regimes.
[Show abstract][Hide abstract] ABSTRACT: We carried out a trans-ancestry genome-wide association and replication study of blood pressure phenotypes among up to 320,251 individuals of East Asian, European and South Asian ancestry. We find genetic variants at 12 new loci to be associated with blood pressure (P = 3.9 × 10 −11 to 5.0 × 10 −21). The sentinel blood pressure SNPs are enriched for association with DNA methylation at multiple nearby CpG sites, suggesting that, at some of the loci identified, DNA methylation may lie on the regulatory pathway linking sequence variation to blood pressure. The sentinel SNPs at the 12 new loci point to genes involved in vascular smooth muscle (IGFBP3, KCNK3, PDE3A and PRDM6) and renal (ARHGAP24, OSR1, SLC22A7 and TBX2) function. The new and known genetic variants predict increased left ventricular mass, circulating levels of NT-proBNP, and cardiovascular and all-cause mortality (P = 0.04 to 8.6 × 10 −6). Our results provide new evidence for the role of DNA methylation in blood pressure regulation.
[Show abstract][Hide abstract] ABSTRACT: Standard analysis methods for genome wide association studies (GWAS) are not robust to complex disease models, such as interactions between variables with small main effects. These types of effects likely contribute to the heritability of complex human traits. Machine learning methods that are capable of identifying interactions, such as Random Forests (RF), are an alternative analysis approach. One caveat to RF is that there is no standardized method of selecting variables so that false positives are reduced while retaining adequate power. To this end, we have developed a novel variable selection method called relative recurrency variable importance metric (r2VIM). This method incorporates recurrency and variance estimation to assist in optimal threshold selection. For this study, we specifically address how this method performs in data with almost completely epistatic effects (i.e. no marginal effects). Our results show that with appropriate parameter settings, r2VIM can identify interaction effects when the marginal effects are virtually nonexistent. It also outperforms logistic regression, which has essentially no power under this type of model when the number of potential features (genetic variants) is large. (All Supplementary Data can be found here: http://research.nhgri.nih.gov/manuscripts/Bailey-Wilson/r2VIM_epi/).
Full-text · Article · Jan 2015 · Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing
[Show abstract][Hide abstract] ABSTRACT: Background and purpose:
The aim of this study was to determine the impact of functional single nucleotide polymorphism (SNP) pathways involved in the ROS pathway, DNA repair, or TGFB1 signaling on acute or late normal toxicity as well as individual radiosensitivity.
Materials and methods:
Patients receiving breast-conserving surgery and radiotherapy were examined either for erythema (n = 83), fibrosis (n = 123), or individual radiosensitivity (n = 123). The 17 SNPs analyzed are involved in the ROS pathway (GSTP1, SOD2, NQO1, NOS3, XDH), DNA repair (XRCC1, XRCC3, XRCC6, ERCC2, LIG4, ATM) or TGFB signaling (SKIL, EP300, APC, AXIN1, TGFB1). Associations with biological and clinical endpoints were studied for single SNPs but especially for combinations of SNPs assuming that a SNP is either beneficial or deleterious and needs to be weighted.
With one exception, no significant association was seen between a single SNP and the three endpoints studied. No significant associations were also observed when applying a multi-SNP model assuming that each SNP was deleterious. In contrast, significant associations were obtained when SNPs were suggested to be either beneficial or deleterious. These associations increased, when each SNP was weighted individually. Detailed analysis revealed that both erythema and individual radiosensitivity especially depend on SNPs affecting DNA repair and TGFB1 signaling, while SNPs in ROS pathway were of minor importance.
Functional pathways of SNPs may be used to form a risk score allowing to predict acute and late radiation-induced toxicity but also to unravel the underlying biological mechanisms.
No preview · Article · Aug 2014 · Strahlentherapie und Onkologie
[Show abstract][Hide abstract] ABSTRACT: Two-point linkage analyses of whole genome sequence data are a promising approach to identify rare variants that segregate with complex diseases in large pedigrees because, in theory, the causal variants have been genotyped. We used whole genome sequence data and simulated traits provided by Genetic Analysis Workshop 18 to evaluate the proportion of false-positive findings in a binary trait using classic two-point parametric linkage analysis. False-positive genome-wide significant log of odds (LOD) scores were identified in more than 80% of 50 replicates for a binary phenotype generated by dichotomizing a quantitative trait that was simulated with a polygenic component (that was not based on any of the provided whole genome sequence genotypes). In contrast, when the trait was truly nongenetic (created by randomly assigning affected-unaffected status), the number of false-positive results was well controlled. These results suggest that when using two-point linkage analyses on whole genome sequence data, one should carefully examine regions yielding significant two-point LOD scores with multipoint analysis and that a more stringent significance threshold may be needed.
[Show abstract][Hide abstract] ABSTRACT: A dozen genes/regions have been confirmed as genetic risk factors for oral clefts in human association and linkage studies, and animal models argue even more genes may be involved. Genomic sequencing studies should identify specific causal variants and may reveal additional genes as influencing risk to oral clefts, which have a complex and heterogeneous etiology. We conducted a whole exome sequencing (WES) study to search for potentially causal variants using affected relatives drawn from multiplex cleft families. Two or three affected 2°, 3° and higher degree relatives from 55 multiplex families were sequenced. We examined rare single nucleotide variants (SNVs) shared by affected relatives in 348 recognized candidate genes. Exact probabilities that affected relatives would share these rare variants were calculated given pedigree structures and corrected for the number of variants tested. Five novel and potentially damaging SNVs shared by affected distant relatives were found, and confirmed by Sanger sequencing. One damaging SNV in CDH1, shared by three affected second cousins from a single family, attained statistical significance (p=0.02 after correcting for multiple tests). Family based designs such as used in this WES study offer important advantages for identifying genes likely to be causing complex and heterogeneous disorders.
[Show abstract][Hide abstract] ABSTRACT: Logistic regression has been the de facto, and often the only, model used in the description and analysis of relationships between a binary outcome and observed features. It is widely used to obtain the conditional probabilities of the outcome given predictors, as well as predictor effect size estimates using conditional odds ratios.
We show how statistical learning machines for binary outcomes, provably consistent for the nonparametric regression problem, can be used to provide both consistent conditional probability estimation and conditional effect size estimates. Effect size estimates from learning machines leverage our understanding of counterfactual arguments central to the interpretation of such estimates. We show that, if the data generating model is logistic, we can recover accurate probability predictions and effect size estimates with nearly the same efficiency as a correct logistic model, both for main effects and interactions. We also propose a method using learning machines to scan for possible interaction effects quickly and efficiently. Simulations using random forest probability machines are presented.
The models we propose make no assumptions about the data structure, and capture the patterns in the data by just specifying the predictors involved and not any particular model structure. So they do not run the same risks of model mis-specification and the resultant estimation biases as a logistic model. This methodology, which we call a "risk machine", will share properties from the statistical machine that it is derived from.
[Show abstract][Hide abstract] ABSTRACT: Chromosomal aneuploidy has been identified as a prognostic factor in the majority of sporadic carcinomas. However, it is not known how chromosomal aneuploidy affects chromosome-specific protein expression in particular, and the cellular proteome equilibrium in general.
The aim was to detect chromosomal aneuploidy-associated expression changes in cell clones carrying trisomies found in colorectal cancer.
We used microcell-mediated chromosomal transfer to generate three artificial trisomic cell clones of the karyotypically stable, diploid, yet mismatch-deficient, colorectal cancer cell line DLD1 - each of them harboring one extra copy of either chromosome 3, 7 or 13. Protein expression differences were assessed by two-dimensional gel electrophoresis and mass spectrometry, compared to whole-genome gene expression data, and evaluated by PANTHER classification system and Ingenuity Pathway Analysis (IPA).
In total, 79 differentially expressed proteins were identified between the trisomic clones and the parental cell line. Up-regulation of PCNA and HMGB1 as well as down-regulation of IDH3A and PSMB3 were revealed as trisomy-associated alterations involved in regulating genome stability.
These results show that trisomies affect the expression of genes and proteins that are not necessarily located on the trisomic chromosome, but reflect a pathway-related alteration of the cellular equilibrium.
No preview · Article · Jan 2014 · Analytical cellular pathology (Amsterdam)
[Show abstract][Hide abstract] ABSTRACT: A large-scale RNAi screen was performed for 8 different melanoma cell lines using a pooled whole genome lentiviral shRNA library. shRNAs affecting proliferation of transduced melanoma cells were negatively selected during 10 days of culture. Overall, 617 shRNAs were identified by microarray hybridization. Pathway analyses identified mitogen-activated protein kinase (MAPK) pathway members such as ERK1/2, JNK1/2 and MAP3K7 and protein kinase Cβ (PKCβ) as candidate genes. Knockdown of PKCβ most consistently reduced cellular proliferation, colony formation and migratory capacity of melanoma cells and was selected for further validation. PKCβ showed enhanced expression in human primary melanomas and distant metastases as compared with benign melanocytic nevi. Moreover, treatment of melanoma cells with PKCβ-specific inhibitor enzastaurin reduced melanoma cell growth but had only small effects on benign fibroblasts. Finally, PKCβ-shRNA significantly reduced lung colonisation capacity of stably transduced melanoma cells in mice. Taken together, the present study identified new candidate genes for melanoma cell growth and proliferation. PKCβ seems to play an important role in these processes and might serve as a new target for treatment of metastatic melanoma. This article is protected by copyright. All rights reserved.
No preview · Article · Jan 2014 · Pigment Cell & Melanoma Research