Mark Kunitomi’s research while affiliated with IBM Research - Thomas J. Watson Research Center and other places

What is this page?


This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.

Publications (16)


A k-mer database for organism identification. US Patent 11830580
  • Patent

January 2023

·

2 Reads

·

Kaufman, JH

·

M Kunitomi

·

MA Davis

Clustering of Spike glycoprotein T cell Epitopes ORIGINAL+CANDIDATE and their occurrence. The bottom chart is a dendrogram obtained by performing sequence based clustering on T cell S epitopes. The labels along the x-axis in the dendrogram are the epitope sequences and have been assigned colors based on their originating organism: blue if CANDIDATE epitope, red if originally found in SARS, and green if known to be found in SARS-CoV-2. Along the y-axis of the dendrogram is the edit distance score. The edit distance of two sequences lets us know how similar the sequences are to one another. We put a threshold of 1.0 on this edit distance to discover clusters within the epitopes, i.e., epitopes with normalized distance < 1.0 are part of same cluster. In the top part of each figure, the bars align with the epitope labels from the dendrogram. Each bar represents the number of times all members of the cluster to which the epitope belongs are seen across SARS-CoV-2 genomes in our dataset. It is also important to note that the figure is actually a log–log plot of the counts. Furthermore, each bar is stacked based to show genomes sequenced in humans, animals, or environment. We would also like to highlight that low presence in genomes sequenced from environment is not a consequence of epitopes not being found in those genomes, but rather a product of extremely low numbers of high quality genomes from the environment in our dataset. Data used to generate this figure are presented in Supplemental Data SD3.
Immunodominance plot and mutagenesis plots. (a) Stacked area plot depicting normalized T cell epitope presence across the length of the Spike glycoprotein transcript (total length: 1299 amino acids). The graph is colored by epitope origin, with original epitope rates in blue and newly predicted epitope rates in red. (b) Mutation density plot for the Spike glycoprotein; logs normalized mismatch frequency rates across the protein as compared to the consensus sequence.
Clustermap obtained after clustering the T cell epitopes of Spike glycoprotein based on the position at which they occur within the protein. The x-axis is the entire length of the protein, which is 1299 in the case of S. Along the y-axis, every row represents one epitope. The color scheme is defined by using a color map that assigns colors to each row depending on occurrences of the epitope across all genomes. The y-axis labels on the right-hand side are colored cyan to indicate an epitope from the top ten list. Data used to generate this figure are present in Supplemental Data SD4.
Representation of the localization of the B cell and T cell epitopes on the CTD domain of the Nucleoprotein. (A) Scheme of SARS-CoV-2 N domains illustrating the N-term intrinsically disorder region (IDR) followed by the N-terminal domain (NTD), the IDR linker, the C-terminal domain (CTD), and the C-term IDR. (B,C) The N CTD dimer is represented in New Cartoon format (one monomer is gray and the other is transparent), and the sequence of the B cell (B) and T cell (C) epitopes is colored according to the legend represented in the figure. The epitope sequence is represented in the legend. The epitopes located in the linker domain are indicated by (**) and those in the C-term IDR by (*). For great clarity, we represented the epitopes in only one monomer.
Representation of the localization of the B cell and T cell epitopes on the SARS-CoV-2 Spike glycoprotein in the prefusion and postfusion conformations. (A) Scheme of SARS-CoV-2 S1 and S2 units of the S protein and of their domains. (B,C) The S protein trimer is represented in New Cartoon format (one monomer is gray the other two are transparent) and is shown in the prefusion conformation in the left side of the panels and in the postfusion conformation on the right side of the panels. The sequence of the B cell (A) and T cell (B) epitopes is shown in the figure legend and is colored accordingly in the S protein structure.

+1

Predicting Epitope Candidates for SARS-CoV-2
  • Article
  • Full-text available

August 2022

·

98 Reads

·

6 Citations

·

·

·

[...]

·

Epitopes are short amino acid sequences that define the antigen signature to which an antibody or T cell receptor binds. In light of the current pandemic, epitope analysis and prediction are paramount to improving serological testing and developing vaccines. In this paper, known epitope sequences from SARS-CoV, SARS-CoV-2, and other Coronaviridae were leveraged to identify additional antigen regions in 62K SARS-CoV-2 genomes. Additionally, we present epitope distribution across SARS-CoV-2 genomes, locate the most commonly found epitopes, and discuss where epitopes are located on proteins and how epitopes can be grouped into classes. The mutation density of different protein regions is presented using a big data approach. It was observed that there are 112 B cell and 279 T cell conserved epitopes between SARS-CoV-2 and SARS-CoV, with more diverse sequences found in Nucleoprotein and Spike glycoprotein.

Download

Figure 3. Immunodominance plot and mutagenesis plots.
Predicting Epitope Candidates for SARS-CoV-2

February 2022

·

29 Reads

Epitopes are short amino acid sequences that define the antigen signature to which an antibody binds. In light of the current pandemic, epitope analysis and prediction is paramount to improving serological testing and developing vaccines. In this paper, we leverage known epitope sequences from SARS-CoV, SARS-CoV-2 and other Coronaviridae and use those known epitopes to identify additional antigen regions in 62k SARS-CoV-2 genomes. Additionally, we present epitope distribution across SARS-CoV-2 genomes, locate the most commonly found epitopes, discuss where epitopes are located on proteins, and how epitopes can be grouped into classes. We also discuss the mutation density of different regions on proteins using a big data approach. We find that there are many conserved epitopes between SARS-CoV-2 and SARS-CoV, with more diverse sequences found in Nucleoprotein and Spike Glycoprotein.


Semi-Supervised Pipeline for Autonomous Annotation of SARS-CoV-2 Genomes

December 2021

·

78 Reads

·

6 Citations

SARS-CoV-2 genomic sequencing efforts have scaled dramatically to address the current global pandemic and aid public health. However, autonomous genome annotation of SARS-CoV-2 genes, proteins, and domains is not readily accomplished by existing methods and results in missing or incorrect sequences. To overcome this limitation, we developed a novel semi-supervised pipeline for automated gene, protein, and functional domain annotation of SARS-CoV-2 genomes that differentiates itself by not relying on the use of a single reference genome and by overcoming atypical genomic traits that challenge traditional bioinformatic methods. We analyzed an initial corpus of 66,000 SARS-CoV-2 genome sequences collected from labs across the world using our method and identified the comprehensive set of known proteins with 98.5% set membership accuracy and 99.1% accuracy in length prediction, compared to proteome references, including Replicase polyprotein 1ab (with its transcriptional slippage site). Compared to other published tools, such as Prokka (base) and VAPiD, we yielded a 6.4- and 1.8-fold increase in protein annotations. Our method generated 13,000,000 gene, protein, and domain sequences—some conserved across time and geography and others representing emerging variants. We observed 3362 non-redundant sequences per protein on average within this corpus and described key D614G and N501Y variants spatiotemporally in the initial genome corpus. For spike glycoprotein domains, we achieved greater than 97.9% sequence identity to references and characterized receptor binding domain variants. We further demonstrated the robustness and extensibility of our method on an additional 4000 variant diverse genomes containing all named variants of concern and interest as of August 2021. In this cohort, we successfully identified all keystone spike glycoprotein mutations in our predicted protein sequences with greater than 99% accuracy as well as demonstrating high accuracy of the protein and domain annotations. This work comprehensively presents the molecular targets to refine biomedical interventions for SARS-CoV-2 with a scalable, high-accuracy method to analyze newly sequenced infections as they arise.


Fig. 1 Bioinformatic pipeline schematic for processing microbiome samples in the presence of matrix content. Description of the bioinformatic steps (light gray) applied to high protein powder metatranscriptome samples (dark gray). Black arrows indicate data flow and blue boxes describe outputs from the pipeline.
Fig. 8 Salmonella status correlations with genus relative abundances. Only those genera with the absolute value of the correlation coefficient >0.5 are shown. Positive and negative correlations are indicated in gray and blue, respectively.
Accuracy of microbial identification using two in silico constructed simulated food mixtures.
Monitoring the microbiome for food safety and quality using deep shotgun sequencing

December 2021

·

263 Reads

·

33 Citations

npj Science of Food

In this work, we hypothesized that shifts in the food microbiome can be used as an indicator of unexpected contaminants or environmental changes. To test this hypothesis, we sequenced the total RNA of 31 high protein powder (HPP) samples of poultry meal pet food ingredients. We developed a microbiome analysis pipeline employing a key eukaryotic matrix filtering step that improved microbe detection specificity to >99.96% during in silico validation. The pipeline identified 119 microbial genera per HPP sample on average with 65 genera present in all samples. The most abundant of these were Bacteroides, Clostridium, Lactococcus, Aeromonas , and Citrobacter . We also observed shifts in the microbial community corresponding to ingredient composition differences. When comparing culture-based results for Salmonella with total RNA sequencing, we found that Salmonella growth did not correlate with multiple sequence analyses. We conclude that microbiome sequencing is useful to characterize complex food microbial communities, while additional work is required for predicting specific species’ viability from total RNA sequencing.


Analysis and forecasting of global real time RT-PCR primers and probes for SARS-CoV-2

April 2021

·

115 Reads

·

22 Citations

Rapid tests for active SARS-CoV-2 infections rely on reverse transcription polymerase chain reaction (RT-PCR). RT-PCR uses reverse transcription of RNA into complementary DNA (cDNA) and amplification of specific DNA (primer and probe) targets using polymerase chain reaction (PCR). The technology makes rapid and specific identification of the virus possible based on sequence homology of nucleic acid sequence and is much faster than tissue culture or animal cell models. However the technique can lose sensitivity over time as the virus evolves and the target sequences diverge from the selective primer sequences. Different primer sequences have been adopted in different geographic regions. As we rely on these existing RT-PCR primers to track and manage the spread of the Coronavirus, it is imperative to understand how SARS-CoV-2 mutations, over time and geographically, diverge from existing primers used today. In this study, we analyze the performance of the SARS-CoV-2 primers in use today by measuring the number of mismatches between primer sequence and genome targets over time and spatially. We find that there is a growing number of mismatches, an increase by 2% per month, as well as a high specificity of virus based on geographic location.


Figure 1. Total number of mismatches each PCR test creates when tested against the full corpus of SARS-CoV-2 genomes. Each PCR test is identified by the country of use and the targeted gene name.
Figure 3. Distribution of mismatches for each primer. A shows the total number of mismatches aggregated for each day within the time range. B shows the number of mismatches for each day averaged by the number of genomes that occur on a day within the time range.
Analysis and Forecasting of Global RT-PCR Primers for SARS-CoV-2

January 2021

·

50 Reads

Rapid tests for active SARS-CoV-2 infections rely on reverse transcription polymerase chain reaction (RT-PCR). RT-PCR uses reverse transcription of RNA into complementary DNA (cDNA) and amplification of specific DNA (primer and probe) targets using polymerase chain reaction (PCR). The technology makes rapid and specific identification of the virus possible based on sequence homology of nucleic acid sequence and is much faster than tissue culture or animal cell models. However the technique can lose sensitivity over time as the virus evolves and the target sequences diverge from the selective primer sequences. Different primer sequences have been adopted in different geographic regions. As we rely on these existing RT-PCR primers to track and manage the spread of the Coronavirus, it is imperative to understand how SARS-CoV-2 mutations, over time and geographically, diverge from existing primers used today. In this study, we analyze the performance of the SARS-CoV-2 primers in use today by measuring the number of mismatches between primer sequence and genome targets over time and spatially. We find that there is a growing number of mismatches, an increase by 2% per month, as well as a high specificity of virus based on geographic location.


Analysis and Forecasting of Global of RT-PCR Primers for SARS-CoV-2

December 2020

·

33 Reads

·

1 Citation

Rapid tests for active SARS-CoV-2 infections rely on reverse transcription polymerase chain reaction (RT-PCR). RT-PCR uses reverse transcription of RNA into complementary DNA (cDNA) and amplification of specific DNA (primer and probe) targets using polymerase chain reaction (PCR). The technology makes rapid and specific identification of the virus possible based on sequence homology of nucleic acid sequence and is much faster than than tissue culture or animal cell models. However the technique can lose sensitivity over time as the virus evolves and the target sequences diverge from the selective primer sequences. As we rely on existing RT-PCR primers to track and manage the spread of the Coronavirus as public life re-opens, it is imperative to understand how SARS-CoV-2 mutations, over time and geographically, diverge from existing primers used today. In this study, we analyze the performance of the SARS-CoV-2 primers in use today by measuring the number of mismatches between primer sequence and genome targets over time and spatially. We find that there is a growing number of mismatches, an increase by 2% per month, as well as a high specificity of virus based on geographic location.


Monitoring the microbiome for food safety and quality using deep shotgun sequencing

May 2020

·

210 Reads

·

3 Citations

In this work, we hypothesized that shifts in the food microbiome can be used as an indicator of unexpected contaminants or environmental changes. To test this hypothesis, we sequenced total RNA of 31 high protein powder (HPP) samples of poultry meal pet food ingredients. We developed a microbiome analysis pipeline employing a key eukaryotic matrix filtering step that improved microbe detection specificity to >99.96% during in silico validation. The pipeline identified 119 microbial genera per HPP sample on average with 65 genera present in all samples. The most abundant of these were Bacteroides , Clostridium , Lactococcus , Aeromonas , and Citrobacter . We also observed shifts in the microbial community corresponding to ingredient composition differences. When comparing culture-based results for Salmonella with total RNA sequencing, we found that Salmonella growth did not correlate with multiple sequence analyses. We conclude that microbiome sequencing is useful to characterize complex food microbial communities, while additional work is required for predicting specific species' viability from total RNA sequencing.


FASER results on experimental samples
FASER results on experimental food mixture
High protein powder sequences mapping to observed source genomes
Food authentication from shotgun sequencing reads with an application on high protein powders

November 2019

·

150 Reads

·

40 Citations

npj Science of Food

Here we propose that using shotgun sequencing to examine food leads to accurate authentication of ingredients and detection of contaminants. To demonstrate this, we developed a bioinformatic pipeline, FASER (Food Authentication from SEquencing Reads), designed to resolve the relative composition of mixtures of eukaryotic species using RNA or DNA sequencing. Our comprehensive database includes >6000 plants and animals that may be present in food. FASER accurately identified eukaryotic species with 0.4% median absolute difference between observed and expected proportions on sequence data from various sources including sausage meat, plants, and fish. FASER was applied to 31 high protein powder raw factory ingredient total RNA samples. The samples mostly contained the expected source ingredient, chicken, while three samples unexpectedly contained pork and beef. Our results demonstrate that DNA/RNA sequencing of food ingredients, combined with a robust analysis, can be used to find contaminants and authenticate food ingredients in a single assay.


Citations (11)


... Pathways related to the activation of B and T cells Intracellular antigens generate short peptides in the presence of proteases, which are presented to T-cell receptors (TCRs) on T cell by the MHC. Antigenic peptides on MHC-I molecules were recognized by CD8 + T cells, whereas peptides on MHC-II molecules were recognized by CD4 + T cells (Agarwal et al., 2022). In addition, CD28/CTLA4 on T-cell binds to the costimulatory molecules CD86 and CD80 to activate T cells . ...

Reference:

Lactiplantibacillus plantarum YY‐112 ameliorates mouse immunosuppression by enhancing B/T‐cell activation and maintaining Th1/Th2 homeostasis
Predicting Epitope Candidates for SARS-CoV-2

... Coronaviridae epitope data were retrieved from the Immune Epitope Database and Analysis Resource (IEDB) on 20 July 2020 and again in April 2021 [44]. The IBM Research Functional Genomics Platform (FGP) [45] with semi-supervised SARS-CoV-2 genome annotation method [46] was used to identify and retrieve the protein sequences, domain sequences, and genome accessions from January 2020 to April 2021. This includes ancestral lineage as well as sampled genomes spanning eight variants of concern and of interest, as described by Beck et al. [46]. ...

Semi-Supervised Pipeline for Autonomous Annotation of SARS-CoV-2 Genomes

... RNA viruses have extremely high mutation rates and new variants emerge, either due to genome recombination/reassortment, selection, or the accumulation of point mutations due to the highly error-prone RNA-dependent RNA polymerase (RdRp) 14 . As the genotypic distribution of the virus shifts as a result of RNA evolution, PCR primers can lose sensitivity, which is reported in human viruses such as SARS-COV-2 15 . A multi-year meta-transcriptomic survey of over 2000 viromes from China during 2016-2019 identified 23 novel viruses from both honey bees and mites 16 , demonstrating one of the many benefits of conducting meta-virome studies. ...

Analysis and forecasting of global real time RT-PCR primers and probes for SARS-CoV-2

... Bacterial cells were enzymatically lysed according to the protocol used by the 100K pathogen project [24], and then RNA was isolated using Trizol LS (Ambion, Austin, TX, USA) according to manufacturer instructions. RNA sequencing libraries were prepared as described previously [25][26][27], with RNA purity and integrity confirmed using TapeStation The same method was applied to all three swabs ( Figure 2). In brief, the oral mucosa lateral to the palatoglossal folds was swabbed using a cytobrush (FLOQSwabs, Coplan, Italy, EU). ...

Monitoring the microbiome for food safety and quality using deep shotgun sequencing

npj Science of Food

... Metagenomics is a powerful tool for characterizing microbial communities, and the translation of "omics" technologies like this to food microbiology will have a significant impact in the food industry and for public health (31,32). The applications of this technology extend far beyond just public health, they can also provide valuable insights about food quality, and there is evidence that the microbiome is likely an important and effective hazard indicator within the food supply chain (33). ...

Monitoring the microbiome for food safety and quality using deep shotgun sequencing

... Beyond this, milk is used as an ingredient to make a variety of products and other foods, with raw milk quality having considerable impacts on finished product quality, safety, and production efficiency. Other studies have aimed to characterize the microbiome of food ingredients in production settings, for example, in high protein powders (5,6), produce (7,8), and fermented foods (9)(10)(11)(12). These studies are useful in demonstrating the potential that metagenomics and metatranscriptomics have in advancing food safety and quality for targeted assessments as well as for improving sensitivity for regular surveillance. ...

Food authentication from shotgun sequencing reads with an application on high protein powders

npj Science of Food

... characteristics between the NAND-flash memory and dynamic random-access memory (DRAM). [1,2] Two types of 3D cross-point memories have been properly documented, including the storagemapped memory using phase change random-access memory (PCRAM) [3][4][5] or resistive random-access memory (ReRAM) [6][7][8][9][10] and memory mapped memory using p-spin torque transfer random-access memory (p-STT-MRAM). [11,12] In addition, ReRAM or conductive bridge random-access memory (CBRAM)-based neurons and synapses [13][14][15][16][17][18][19][20][21] have been extensively studied for artificial neural networks in contrast with the complementary metal oxide semiconductor field effect transistor-based neurons and synapses that have a limited ability to achieve a higher neural density. ...

Evaluation of intel 3D-xpoint NVDIMM technology for memory-intensive genomic workloads
  • Citing Conference Paper
  • September 2019

... Currently, food safety regulatory agencies including the Food and Drug Administration (FDA), Centers for Disease Control and Prevention (CDC), United States Department of Agriculture (USDA), and European Food Safety Authority (EFSA) are converging on the use of WGS for pathogen detection and outbreak investigation. Large scale WGS of food-associated bacteria was first initiated via the 100 K Pathogen Genome Project 9 with the goal of expanding the diversity of bacterial reference genomes-a crucial need for foodborne illness outbreak investigation, traceability, and microbiome studies 10,11 . However, since WGS relies on culturing a microbial isolate prior to sequencing, there are inherent biases and limitations in its ability to describe the microorganisms and their interactions in a food sample. ...

Insular Microbiogeography: Three Pathogens as Exemplars

Current Issues in Molecular Biology

... Sequences were assembled using Shovil (v1.0.4) (83), checked for quality, size (4.5-6.5Mbp genome), completeness (>95% estimate), and contamination (<10% estimate) using CheckM (84), and assessed for approximate genera and species and further identity test for possible contamination using Kraken (85)(86)(87)(88)(89)(90). Sixteen sequences that did not meet quality criteria were removed from downstream analysis. ...

Insular Microbiogeography: Three Pathogens as Exemplars

... Another recent example, in a bacterial setting, was the cholerae outbreak in Haiti wherein the phylogenetic analysis resolved the origin of the pathogen 27 . However, for this analysis to succeed, a substantial genome sequence database, of isolates collected across time and geographic location, was needed to enable placement in a phylogenetic context 28,29 . As outbreaks are bound to happen in the future, investment in cataloguing the genomic space of pathogens is even more important than previously appreciated so that populations of appropriate size can be examined as systematically examined in bacteria 30,31 . ...

Exploiting Functional Context in Biology: Reconsidering Classification of Bacterial Life