Masao Nagasaki

Tohoku University, Miyagi, Japan

Are you Masao Nagasaki?

Claim your profile

Publications (133)366.35 Total impact

  • [Show abstract] [Hide abstract]
    ABSTRACT: The Tohoku Medical Megabank Organization constructed the reference panel (referred to as the 1KJPN panel), which contains >20 million single nucleotide polymorphisms (SNPs), from whole-genome sequence data from 1070 Japanese individuals. The 1KJPN panel contains the largest number of haplotypes of Japanese ancestry to date. Here, from the 1KJPN panel, we designed a novel custom-made SNP array, named the Japonica array, which is suitable for whole-genome imputation of Japanese individuals. The array contains 659 253 SNPs, including tag SNPs for imputation, SNPs of Y chromosome and mitochondria, and SNPs related to previously reported genome-wide association studies and pharmacogenomics. The Japonica array provides better imputation performance for Japanese individuals than the existing commercially available SNP arrays with both the 1KJPN panel and the International 1000 genomes project panel. For common SNPs (minor allele frequency (MAF)>5%), the genomic coverage of the Japonica array (r(2)>0.8) was 96.9%, that is, almost all common SNPs were covered by this array. Nonetheless, the coverage of low-frequency SNPs (0.5%<MAF⩽5%) of the Japonica array reached 67.2%, which is higher than those of the existing arrays. In addition, we confirmed the high quality genotyping performance of the Japonica array using the 288 samples in 1KJPN; the average call rate 99.7% and the average concordance rate 99.7% to the genotypes obtained from high-throughput sequencer. As demonstrated in this study, the creation of custom-made SNP arrays based on a population-specific reference panel is a practical way to facilitate further association studies through genome-wide genotype imputations.Journal of Human Genetics advance online publication, 25 June 2015; doi:10.1038/jhg.2015.68.
    Journal of Human Genetics 06/2015; DOI:10.1038/jhg.2015.68 · 2.53 Impact Factor
  • [Show abstract] [Hide abstract]
    ABSTRACT: BRCA1-associated protein 1 (BAP1) is a deubiquitinating enzyme that is involved in the regulation of cell growth. Recently, many somatic and germline mutations of BAP1 have been reported in a broad spectrum of tumors. In this study, we identified a novel somatic non-synonymous BAP1 mutation, a phenylalanine-to-isoleucine substitution at codon 170 (F170I), in one of 49 patients with esophageal squamous cell carcinoma (ESC). Multiplex ligation-dependent probe amplification (MLPA) of BAP1 gene in this ESC tumor disclosed monoallelic deletion (LOH), suggesting BAP1 alterations on both alleles in this tumor. The deubiquitinase activity and the auto-deubiquitinase activity of F170I-mutant BAP1 were markedly suppressed compared with wild-type BAP1. In addition, wild-type BAP1 mostly localizes to the nucleus, whereas the F170I mutant preferentially localized in the cytoplasm. Microarray analysis revealed that expression of the F170I mutant drastically altered gene expression profiles compared with expressed wild-type BAP1. Gene-ontology analyses indicated that the F170I mutation altered the expression of genes involved in oncogenic pathways. We found that one candidate, TCEAL7, previously reported as a putative tumor suppressor gene, was significantly induced by wild-type BAP1 as compared to F170I mutant BAP1. Furthermore, we found that the level of BAP1 expression in the nucleus was reduced in 44% of ESCs examined by immunohistochemistry (IHC). Because the nuclear localization of BAP1 is important for its tumor suppressor function, BAP1 may be functionally inactivated in substantial portion of ESCs. Taken together, BAP1 is likely to function as a tumor suppressor in at least a part of ESC. This article is protected by copyright. All rights reserved. This article is protected by copyright. All rights reserved.
    Cancer Science 06/2015; DOI:10.1111/cas.12722 · 3.53 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Human leucocyte antigen (HLA) genes play an important role in determining the outcome of organ transplantation and are linked to many human diseases. Because of the diversity and polymorphisms of HLA loci, HLA typing at high resolution is challenging even with whole-genome sequencing data. We have developed a computational tool, HLA-VBSeq, to estimate the most probable HLA alleles at full (8-digit) resolution from whole-genome sequence data. HLA-VBSeq simultaneously optimizes read alignments to HLA allele sequences and abundance of reads on HLA alleles by variational Bayesian inference. We show the effectiveness of the proposed method over other methods through the analysis of predicting HLA types for HLA class I (HLA-A, -B and -C) and class II (HLA-DQA1,-DQB1 and -DRB1) loci from the simulation data of various depth of coverage, and real sequencing data of human trio samples. HLA-VBSeq is an efficient and accurate HLA typing method using high-throughput sequencing data without the need of primer design for HLA loci. Moreover, it does not assume any prior knowledge about HLA allele frequencies, and hence HLA-VBSeq is broadly applicable to human samples obtained from a genetically diverse population.
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: With the recent development of microarray and high-throughput sequencing (HTS) technologies, a number of studies have revealed catalogs of copy number variants (CNVs) and their association with phenotypes and complex traits. In parallel, a number of approaches to predict CNV regions and genotypes are proposed for both microarray and HTS data. However, only a few approaches focus on haplotyping of CNV loci. We propose a novel approach to infer copy unit alleles and their numbers in each sample simultaneously from population-scale HTS data by variational Bayesian inference on a generative probabilistic model inspired by latent Dirichlet allocation, which is a well studied model for document classification problems. In simulation studies, we evaluated concordance between inferred and true copy unit alleles for lower-, middle-, and higher-copy number dataset, in which precision and recall were ≥ 0.9 for data with mean coverage ≥ 10× per copy unit. We also applied the approach to HTS data of 1123 samples at highly variable salivary amylase gene locus and a pseudogene locus, and confirmed consistency of the estimated alleles within samples belonging to a trio of CEPH/Utah pedigree 1463 with 11 offspring. Our proposed approach enables detailed analysis of copy number variations, such as association study between copy unit alleles and phenotypes or biological features including human diseases.
    BMC Bioinformatics 01/2015; 16(Suppl 1):S4. DOI:10.1186/1471-2105-16-S1-S4 · 2.67 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: High-throughput RNA sequencing (RNA-Seq) enables quantification and identification of transcripts at single-base resolution. Recently, longer sequence reads become available thanks to the development of new types of sequencing technologies as well as improvements in chemical reagents for the Next Generation Sequencers. Although several computational methods have been proposed for quantifying gene expression levels from RNA-Seq data, they are not sufficiently optimized for longer reads (e.g. > 250 bp). We propose TIGAR2, a statistical method for quantifying transcript isoforms from fixed and variable length RNA-Seq data. Our method models substitution, deletion, and insertion errors of sequencers based on gapped-alignments of reads to the reference cDNA sequences so that sensitive read-aligners such as Bowtie2 and BWA-MEM are effectively incorporated in our pipeline. Also, a heuristic algorithm is implemented in variational Bayesian inference for faster computation. We apply TIGAR2 to both simulation data and real data of human samples and evaluate performance of transcript quantification with TIGAR2 in comparison to existing methods. TIGAR2 is a sensitive and accurate tool for quantifying transcript isoform abundances from RNA-Seq data. Our method performs better than existing methods for the fixed-length reads (100 bp, 250 bp, 500 bp, and 1000 bp of both single-end and paired-end) and variable-length reads, especially for reads longer than 250 bp.
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Comprehensive understanding of gene regulatory networks (GRNs) is a major challenge in the field of systems biology. Currently, there are two main approaches in GRN analysis using time-course observation data, namely an ordinary differential equation (ODE)-based approach and a statistical model-based approach. The ODE-based approach can generate complex dynamics of GRNs according to biologically validated nonlinear models. However, it cannot be applied to ten or more genes to simultaneously estimate system dynamics and regulatory relationships due to the computational difficulties. The statistical model-based approach uses highly abstract models to simply describe biological systems and to infer relationships among several hundreds of genes from the data. However, the high abstraction generates false regulations that are not permitted biologically. Thus, when dealing with several tens of genes of which the relationships are partially known, a method that can infer regulatory relationships based on a model with low abstraction and that can emulate the dynamics of ODE-based models while incorporating prior knowledge is urgently required. To accomplish this, we propose a method for inference of GRNs using a state space representation of a vector auto-regressive (VAR) model with L1 regularization. This method can estimate the dynamic behavior of genes based on linear time-series modeling constructed from an ODE-based model and can infer the regulatory structure among several tens of genes maximizing prediction ability for the observational data. Furthermore, the method is capable of incorporating various types of existing biological knowledge, e.g., drug kinetics and literature-recorded pathways. The effectiveness of the proposed method is shown through a comparison of simulation studies with several previous methods. For an application example, we evaluated mRNA expression profiles over time upon corticosteroid stimulation in rats, thus incorporating corticosteroid kinetics/dynamics, literature-recorded pathways and transcription factor (TF) information.
    PLoS ONE 08/2014; 9(8):e105942. DOI:10.1371/journal.pone.0105942 · 3.53 Impact Factor
  • [Show abstract] [Hide abstract]
    ABSTRACT: Library quantitation is a critical step to obtain high data output in Illumina HiSeq sequencers. Here, we introduce a library quantitation method that utilizes Illumina MiSeq sequencer, designated as quantitative MiSeq (qMiSeq). In this procedure, 96 dual-index libraries including control samples are denatured, pooled in equal volume, and sequenced by MiSeq. We found that relative concentration of each library can be determined based on the observed index ratio and can be used to determine HiSeq run condition for each library. Thus, qMiSeq provides an efficient way to quantitate a large number of libraries at a time.
    Analytical Biochemistry 08/2014; 466. DOI:10.1016/j.ab.2014.08.015 · 2.31 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: BackgroundValidation of single nucleotide variations in whole-genome sequencing is critical for studying disease-related variations in large populations. A combination of different types of next-generation sequencers for analyzing individual genomes may be an efficient means of validating multiple single nucleotide variations calls simultaneously.ResultsHere, we analyzed 12 independent Japanese genomes using two next-generation sequencing platforms: the Illumina HiSeq 2500 platform for whole-genome sequencing (average depth 32.4×), and the Ion Proton semiconductor sequencer for whole exome sequencing (average depth 109×). Single nucleotide polymorphism (SNP) calls based on the Illumina Human Omni 2.5-8 SNP chip data were used as the reference. We compared the variant calls for the 12 samples, and found that the concordance between the two next-generation sequencing platforms varied between 83% and 97%.ConclusionsOur results show the versatility and usefulness of the combination of exome sequencing with whole-genome sequencing in studies of human population genetics and demonstrate that combining data from multiple sequencing platforms is an efficient approach to validate and supplement SNP calls.Electronic supplementary materialThe online version of this article (doi:10.1186/1471-2164-15-673) contains supplementary material, which is available to authorized users.
    BMC Genomics 08/2014; 15(1):673. DOI:10.1186/1471-2164-15-673 · 4.04 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Next-generation sequencers (NGSs) have become one of the main tools for current biology. To obtain useful insights from the NGS data, it is essential to control low-quality portions of the data affected by technical errors such as air bubbles in sequencing fluidics.
    BMC Genomics 08/2014; 15(1):664. DOI:10.1186/1471-2164-15-664 · 4.04 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Recently, several biological simulation models of, e.g., gene regulatory networks and metabolic pathways, have been constructed based on existing knowledge of biomolecular reactions, e.g., DNA-protein and protein-protein interactions. However, since these do not always contain all necessary molecules and reactions, their simulation results can be inconsistent with observational data. Therefore, improvements in such simulation models are urgently required. A previously reported method created multiple candidate simulation models by partially modifying existing models. However, this approach was computationally costly and could not handle a large number of candidates that are required to find models whose simulation results are highly consistent with the data. In order to overcome the problem, we focused on the fact that the qualitative dynamics of simulation models are highly similar if they share a certain amount of regulatory structures. This indicates that better fitting candidates tend to share the basic regulatory structure of the best fitting candidate, which can best predict the data among candidates. Thus, instead of evaluating all candidates, we propose an efficient explorative method that can selectively and sequentially evaluate candidates based on the similarity of their regulatory structures. Furthermore, in estimating the parameter values of a candidate, e.g., synthesis and degradation rates of mRNA, for the data, those of the previously evaluated candidates can be utilized. The method is applied here to the pharmacogenomic pathways for corticosterids in rats, using time-series microarray expression data. In the performance test, we succeeded in obtaining more than 80% of consistent solutions within 15% of the computational time as compared to the comprehensive evaluation. Then, we applied this approach to 142 literature-recorded simulation models of corticosteroid-induced genes, and consequently selected 134 newly constructed better models. The method described here was found to be capable of efficiently exploring candidate simulation models and obtaining better models within a short span of time. Furthermore, the results suggest that there may be room for improvement in literature recorded pathways and that they can be systematically updated using biological observational data.
    Bio Systems 06/2014; 121. DOI:10.1016/j.biosystems.2014.06.001 · 1.47 Impact Factor
  • [Show abstract] [Hide abstract]
    ABSTRACT: CSML and SBML are XML-based model definition standards which are developed with the aim of creating exchange formats for modeling, visualizing and simulating biological pathways. In this article we report a release of a format convertor for quantitative pathway models, namely CSML2SBML. It translates models encoded by CSML into SBML without loss of structural and kinetic information. The simulation and parameter estimation of the resulting SBML model can be carried out with compliant tool CellDesigner for further analysis. The convertor is based on the standards CSML version 3.0 and SBML Level 2 Version 4. In our experiments, 11 out of 15 pathway models in CSML model repository and 228 models in Macrophage Pathway Knowledgebase (MACPAK) are successfully converted to SBML models. The consistency of the resulting model is validated by libSBML Consistency Check of CellDesigner. Furthermore, the converted SBML model assigned with the kinetic parameters translated from CSML model can reproduce the same dynamics with CellDesigner as CSML one running on Cell Illustrator. CSML2SBML, along with its instructions and examples for use are available at http://csml2sbml.csml.org.
    Bio Systems 05/2014; 121. DOI:10.1016/j.biosystems.2014.05.004 · 1.47 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Heterozygous GATA-2 germline mutations are associated with overlapping clinical manifestations termed GATA-2 deficiency, characterized by immunodeficiency and predisposition to myelodysplastic syndrome (MDS) and acute myeloid leukemia (AML). However, there is considerable clinical heterogeneity among patients, and the molecular basis for the evolution of immunodeficiency into MDS/AML remains unknown. Thus, we conducted whole-genome sequencing on a patient with a germline GATA-2 heterozygous mutation (c. 988 C > T; p. R330X), who had a history suggestive of immunodeficiency and evolved into MDS/AML. Analysis was conducted with DNA samples from leukocytes for immunodeficiency, bone marrow mononuclear cells for MDS and bone marrow-derived mesenchymal stem cells. Whereas we did not identify a candidate genomic deletion that may contribute to the evolution into MDS, a total of 280 MDS-specific nonsynonymous single nucleotide variants were identified. By narrowing down with the single nucleotide polymorphism database, the functional missense database, and NCBI information, we finally identified three candidate mutations for EZH2, HECW2 and GATA-1, which may contribute to the evolution of the disease.
    Annals of Hematology 04/2014; 93(9). DOI:10.1007/s00277-014-2090-4 · 2.40 Impact Factor
  • [Show abstract] [Hide abstract]
    ABSTRACT: CSML and SBML are XML-based model definition standards which are developed with the aim of creating exchange formats for modeling, visualizing and simulating biological pathways. In this article we report a release of a format convertor for quantitative pathway models, namely CSML2SBML. It translates models encoded by CSML into SBML without loss of structural and kinetic information. The simulation and parameter estimation of the resulting SBML model can be carried out with compliant tool CellDesigner for further analysis. The convertor is based on the standards CSML version 3.0 and SBML Level 2 Version 4. In our experiments, 11 out of 15 pathway models in CSML model repository and 228 models in Macrophage Pathway Knowledgebase (MACPAK) are successfully converted to SBML models. The consistency of the resulting model is validated by libSBML Consistency Check of CellDesigner. Furthermore, the converted SBML model assigned with the kinetic parameters translated from CSML model can reproduce the same dynamics with CellDesigner as CSML one running on Cell Illustrator. CSML2SBML, along with its instructions and examples for use are available at http://csml2sbml.csml.org
  • [Show abstract] [Hide abstract]
    ABSTRACT: Recently, several biological simulation models of, e.g., gene regulatory networks and metabolic pathways, have been constructed based on existing knowledge of biomolecular reactions, e.g., DNA–protein and protein–protein interactions. However, since these do not always contain all necessary molecules and reactions, their simulation results can be inconsistent with observational data. Therefore, improvements in such simulation models are urgently required. A previously reported method created multiple candidate simulation models by partially modifying existing models. However, this approach was computationally costly and could not handle a large number of candidates that are required to find models whose simulation results are highly consistent with the data. In order to overcome the problem, we focused on the fact that the qualitative dynamics of simulation models are highly similar if they share a certain amount of regulatory structures. This indicates that better fitting candidates tend to share the basic regulatory structure of the best fitting candidate, which can best predict the data among candidates. Thus, instead of evaluating all candidates, we propose an efficient explorative method that can selectively and sequentially evaluate candidates based on the similarity of their regulatory structures. Furthermore, in estimating the parameter values of a candidate, e.g., synthesis and degradation rates of mRNA, for the data, those of the previously evaluated candidates can be utilized. The method is applied here to the pharmacogenomic pathways for corticosteroids in rats, using time-series microarray expression data. In the performance test, we succeeded in obtaining more than 80% of consistent solutions within 15% of the computational time as compared to the comprehensive evaluation. Then, we applied this approach to 142 literature-recorded simulation models of corticosteroid-induced genes, and consequently selected 134 newly constructed better models. The method described here was found to be capable of efficiently exploring candidate simulation models and obtaining better models within a short span of time. Furthermore, the results suggest that there may be room for improvement in literature recorded pathways and that they can be systematically updated using biological observational data.
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Structural variations (SVs), such as insertions, deletions, inversions, and duplications, are a common feature in human genomes, and a number of studies have reported that such SVs are associated with human diseases. Although the progress of next generation sequencing (NGS) technologies has led to the discovery of a large number of SVs, accurate and genome-wide detection of SVs remains challenging. Thus far, various calling algorithms based on NGS data have been proposed. However, their strategies are diverse and there is no tool able to detect a full range of SVs accurately. We focused on evaluating the performance of existing deletion calling algorithms for various spanning ranges from low- to high-coverage simulation data. The simulation data was generated from a whole genome sequence with artificial SVs constructed based on the distribution of variants obtained from the 1000 Genomes Project. From the simulation analysis, deletion calls of various deletion sizes were obtained with each caller, and it was found that the performance was quite different according to the type of algorithms and targeting deletion size. Based on these results, we propose an integrated structural variant calling pipeline (iSVP) that combines existing methods with a newly devised filtering and merging processes. It achieved highly accurate deletion calling with >90% precision and >90% recall on the 30× read data for a broad range of size. We applied iSVP to the whole-genome sequence data of a CEU HapMap sample, and detected a large number of deletions, including notable peaks around 300 bp and 6,000 bp, which corresponded to Alus and long interspersed nuclear elements, respectively. In addition, many of the predicted deletions were highly consistent with experimentally validated ones by other studies. We present iSVP, a new deletion calling pipeline to obtain a genome-wide landscape of deletions in a highly accurate manner. From simulation and real data analysis, we show that iSVP is broadly applicable to human whole-genome sequencing data, which will elucidate relationships between SVs across genomes and associated diseases or biological functions.
    BMC Systems Biology 12/2013; 7 Suppl 6(Suppl 6):S8. DOI:10.1186/1752-0509-7-S6-S8 · 2.85 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Using quantitative PCR-based miRNA arrays, we comprehensively analyzed the expression profiles of miRNAs in human and mouse embryonic stem (ES), induced pluripotent stem (iPS), and somatic cells. Immature pluripotent cells were purified using SSEA-1 or SSEA-4 and were used for miRNA profiling. Hierarchical clustering and consensus clustering by nonnegative matrix factorization showed two major clusters, human ES/iPS cells and other cell groups, as previously reported. Principal components analysis (PCA) to identify miRNAs that segregate in these two groups identified miR-187, 299-3p, 499-5p, 628-5p, and 888 as new miRNAs that specifically characterize human ES/iPS cells. Detailed direct comparisons of miRNA expression levels in human ES and iPS cells showed that several miRNAs included in the chromosome 19 miRNA cluster were more strongly expressed in iPS cells than in ES cells. Similar analysis was conducted with mouse ES/iPS cells and somatic cells, and several miRNAs that had not been reported to be expressed in mouse ES/iPS cells were suggested to be ES/iPS cell-specific miRNAs by PCA. Comparison of the average expression levels of miRNAs in ES/iPS cells in humans and mice showed quite similar expression patterns of human/mouse miRNAs. However, several mouse- or human-specific miRNAs are ranked as high expressers. Time course tracing of miRNA levels during embryoid body formation revealed drastic and different patterns of changes in their levels. In summary, our miRNA expression profiling encompassing human and mouse ES and iPS cells gave various perspectives in understanding the miRNA core regulatory networks regulating pluripotent cells characteristics.
    PLoS ONE 09/2013; 8(9):e73532. DOI:10.1371/journal.pone.0073532 · 3.53 Impact Factor
  • [Show abstract] [Hide abstract]
    ABSTRACT: Variant calling from genome-wide sequencing data is essential for the analysis of disease-causing mutations and elucidation of disease mechanisms. However, variant calling in low coverage regions is difficult due to sequence read errors and mapping errors. Hence, variant calling approaches that are robust to low coverage data are demanded. We propose a new variant calling approach that considers pedigree information and haplotyping based on sequence reads spanning two or more heterozygous positions termed phase informative reads. In our approach, genotyping and haplotyping by the assignment of each read to a haplotype based on phase informative reads are simultaneously performed. Therefore, positions with low evidence for heterozygosity are rescued by phase informative reads, and such rescued positions contribute to haplotyping in a synergistic way. In addition, pedigree information supports more accurate haplotyping as well as genotyping, especially in low coverage regions. Although heterozygous positions are useful for haplotyping, homozygous positions are not informative and weaken the information from heterozygous positions as majority of positions are homozygous. Thus, we introduce latent variables that determine zygosity at each position in order to filter out homozygous positions for haplotyping. In performance evaluation with a parent-offspring trio sequencing data, our approach outperforms existing approaches in accuracy on the agreement with SNP array genotyping results. Also, performance analysis considering distance between variants showed that the use of phase informative reads is effective for accurate variant calling, and further performance improvement is expected with longer sequencing data. nagasaki@megabank.tohoku.ac.jp.
    Bioinformatics 09/2013; 29(22). DOI:10.1093/bioinformatics/btt503 · 4.62 Impact Factor
  • [Show abstract] [Hide abstract]
    ABSTRACT: Cohesin is a multimeric protein complex that is involved in the cohesion of sister chromatids, post-replicative DNA repair and transcriptional regulation. Here we report recurrent mutations and deletions involving multiple components of the cohesin complex, including STAG2, RAD21, SMC1A and SMC3, in different myeloid neoplasms. These mutations and deletions were mostly mutually exclusive and occurred in 12.1% (19/157) of acute myeloid leukemia, 8.0% (18/224) of myelodysplastic syndromes, 10.2% (9/88) of chronic myelomonocytic leukemia, 6.3% (4/64) of chronic myelogenous leukemia and 1.3% (1/77) of classical myeloproliferative neoplasms. Cohesin-mutated leukemic cells showed reduced amounts of chromatin-bound cohesin components, suggesting a substantial loss of cohesin binding sites on chromatin. The growth of leukemic cell lines harboring a mutation in RAD21 (Kasumi-1 cells) or having severely reduced expression of RAD21 and STAG2 (MOLM-13 cells) was suppressed by forced expression of wild-type RAD21 and wild-type RAD21 and STAG2, respectively. These findings suggest a role for compromised cohesin functions in myeloid leukemogenesis.
    Nature Genetics 08/2013; 45(10). DOI:10.1038/ng.2731 · 29.65 Impact Factor
  • [Show abstract] [Hide abstract]
    ABSTRACT: Many human genes express multiple transcript isoforms through alternative splicing, which greatly increases diversity of protein function. Although RNA sequencing (RNA-Seq) technologies have been widely used in measuring amounts of transcribed mRNA, accurate estimation of transcript isoform abundances from RNA-Seq data is challenging, because reads often map to more than one transcript isoforms or paralogs whose sequences are very similar to each other. We propose a statistical method to estimate transcript isoform abundances from RNA-Seq data. Our method can handle gapped alignments of reads against reference sequences so that it allows insertion or deletion errors within reads. The proposed meth-od optimizes the number of transcript isoforms by variational Bayes-ian inference through an iterative procedure, and its convergence is guaranteed under a stopping criterion. On simulated data sets, our method outperformed the comparable quantification methods in inferring transcript isoform abundances, and at the same time its rate of convergence was faster than that of the expectation maximi-zation (EM) algorithm. We also applied our method to RNA-Seq data of human cell line samples, and showed that our prediction result was more consistent among technical replicates than those of other methods. An implementation of our method is available at http://github.com/nariai/tigar CONTACT: nariai@megabank.tohoku.ac.jp SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
    Bioinformatics 07/2013; DOI:10.1093/bioinformatics/btt381 · 4.62 Impact Factor
  • [Show abstract] [Hide abstract]
    ABSTRACT: The derivation of induced pluripotent stem (iPS) cells from individuals of genetic disorders offers new opportunities for basic research into these diseases and the development of therapeutic compounds. Severe congenital neutropenia (SCN) is a serious disorder characterized by severe neutropenia at birth. SCN is associated with heterozygous mutations in the neutrophil elastase [elastase, neutrophil-expressed (ELANE)] gene, but the mechanisms that disrupt neutrophil development have not yet been clarified because of the current lack of an appropriate disease model. Here, we generated iPS cells from an individual with SCN (SCN-iPS cells). Granulopoiesis from SCN-iPS cells revealed neutrophil maturation arrest and little sensitivity to granulocyte-colony stimulating factor, reflecting a disease status of SCN. Molecular analysis of the granulopoiesis from the SCN-iPS cells vs. control iPS cells showed reduced expression of genes related to the wingless-type mmtv integration site family, member 3a (Wnt3a)/β-catenin pathway [e.g., lymphoid enhancer-binding factor 1], whereas Wnt3a administration induced elevation lymphoid enhancer-binding factor 1-expression and the maturation of SCN-iPS cell-derived neutrophils. These results indicate that SCN-iPS cells provide a useful disease model for SCN, and the activation of the Wnt3a/β-catenin pathway may offer a novel therapy for SCN with ELANE mutation.
    Proceedings of the National Academy of Sciences 02/2013; DOI:10.1073/pnas.1217039110 · 9.81 Impact Factor

Publication Stats

2k Citations
366.35 Total Impact Points

Institutions

  • 2012–2014
    • Tohoku University
      • Tohoku Medical Megabank Organization
      Miyagi, Japan
  • 1998–2013
    • The University of Tokyo
      • • Center for Human Genome
      • • Department of Information Science
      Tōkyō, Japan
  • 2008–2009
    • The Institute of Statistical Mathematics
      Edo, Tōkyō, Japan
  • 2000–2004
    • Yamaguchi University
      • • Faculty of Science
      • • Graduate School of Science and Engineering
      Yamaguchi-shi, Yamaguchi-ken, Japan