Ruriko Yoshida

Ruriko Yoshida
Naval Postgraduate School | NPS · Department of Operations Research

PhD in Mathematics

About

137
Publications
17,930
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
1,642
Citations
Citations since 2017
47 Research Items
780 Citations
2017201820192020202120222023050100150
2017201820192020202120222023050100150
2017201820192020202120222023050100150
2017201820192020202120222023050100150
Introduction
Data Science using tropical geometry and its applications to phylogenetics and phylogenomics.
Additional affiliations
September 2016 - present
Naval Postgraduate School
Position
  • Professor (Associate)
July 2006 - August 2016
University of Kentucky
Position
  • Professor (Associate)
August 2004 - June 2006
Duke University
Position
  • Research Assistant
Education
September 2000 - June 2004
University of California, Davis
Field of study
  • Mathematics
August 1997 - May 2000
University of California, Berkeley
Field of study
  • Mathematics

Publications

Publications (137)
Article
Full-text available
Unweighted Pair Group Method with Arithmetic Mean (UPGMA) is one of the most popular distance-based methods to reconstruct an equidistant phylogenetic tree from a distance matrix computed from an alignment of sequences. Since we use equidistant trees as gene trees for phylogenomic analyses under the multi-species coalescent model and since an input...
Article
Background The detailed complexity of triceps brachii insertional footprint continues to challenge surgeons as evidenced by continued reports of triceps-associated complications following elbow procedures. The purpose of this study is to describe the three-dimensional footprint of the triceps brachii at its olecranon insertion at the elbow. Methods...
Article
Support Vector Machines (SVMs) are one of the most popular supervised learning models to classify using a hyperplane in an Euclidean space. Similar to SVMs, tropical SVMs classify data points using a tropical hyperplane under the tropical metric with the max-plus algebra. In this paper, first we show generalization error bounds of tropical SVMs ove...
Preprint
Full-text available
In this paper we propose Hit and Run (HAR) sampling from a tropically convex set. The key ingredient of HAR sampling from a tropically convex set is sampling uniformly from a tropical line segment over the tropical projective torus, which runs linearly in its computational time complexity. We show that this HAR sampling method samples uniformly fro...
Article
Full-text available
We study the behavior of phylogenetic tree shapes in the tropical geometric interpretation of tree space. Tree shapes are formally referred to as tree topologies; a tree topology can also be thought of as a tree combinatorial type, which is given by the tree’s branching configuration and leaf labeling. We use the tropical line segment as a framewor...
Preprint
Full-text available
In 2004, Speyer and Sturmfels showed that a space of phylogenetic trees with $m$ leaves is a tropical Grassmanian, which is a tropicalization of the set of all solutions for a system of certain linear equations under the max-plus arithmetic. In this research we apply the "tropical metric," a well-defined metric over the space of phylogenetic trees...
Preprint
Full-text available
We consider an $I \times J\times K$ table with cell counts $X_{ijk} \geq 0$ for $i = 1, \ldots , I$, $j = 1, \ldots , J$ and $k = 1, \ldots , K$ under the no-three-way interaction model. In this paper, we propose a Markov Chain Monte Carlo (MCMC) scheme connecting the set of all contingency tables by all basic moves of $2 \times 2 \times 2$ minors...
Article
Full-text available
During 2020 and 2021, severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) transmission has been increasing among the world’s population at an alarming rate. Reducing the spread of SARS-CoV-2 and other diseases that are spread in similar manners is paramount for public health officials as they seek to effectively manage resources and potent...
Article
Full-text available
Uncrewed autonomous vehicles (UAVs) have made significant contributions to reconnaissance and surveillance missions in past US military campaigns. As the prevalence of UAVs increases, there has also been improvements in counter-UAV technology that makes it difficult for them to successfully obtain valuable intelligence within an area of interest. H...
Article
Tropical geometry with the max-plus algebra has been applied to statistical learning models over tree spaces because geometry with the tropical metric over tree spaces has some nice properties such as convexity in terms of the tropical metric. One of the challenges in applications of tropical geometry to tree spaces is the difficulty interpreting o...
Chapter
Phylogenomics is a new field which applies to tools in phylogenetics to genome data. Due to a new technology and increasing amount of data, we face new challenges to analyze them over a space of phylogenetic trees. Because a space of phylogenetic trees with a fixed set of labels on leaves is not Euclidean, we cannot simply apply tools in data scien...
Preprint
Full-text available
In this research, we investigate a tropical principal component analysis (PCA) as a best-fit Stiefel tropical linear space to a given sample over the tropical projective torus for its dimensionality reduction and visualization. Especially, we characterize the best-fit Stiefel tropical linear space to a sample generated from a mixture of Gaussian di...
Preprint
Full-text available
Uncrewed autonomous vehicles (UAVs) have made significant contributions to reconnaissance and surveillance missions in past US military campaigns. As the prevalence of UAVs increases, there has also been improvements in counter-UAV technology that makes it difficult for them to successfully obtain valuable intelligence within an area of interest. H...
Preprint
Full-text available
During 2020 and 2021, severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) transmission has been increasing amongst the world's population at an alarming rate. Reducing the spread of SARS-CoV-2 and other diseases that are spread in similar manners is paramount for public health officials as they seek to effectively manage resources and pote...
Chapter
Demand for effective methods of analyzing networks has emerged with the growth of accessible data, particularly for incomplete networks. Even as means for data collection advance, incomplete information remains a reality for numerous reasons. Data can be obscured by excessive noise. Surveys for information typically contain some non-respondents. In...
Book
Full-text available
This book developed from the need to teach a linear algebra course to students focused on data science and bioinformatics programs. These students tend not to realize the importance of linear algebra in applied sciences, since traditional linear algebra courses tend to cover mathematical contexts but not the computational aspect of linear algebra o...
Article
In the fall of 2009 and in the spring of 2012, supported by the National Institute of General Medical Sciences (NIGMS) in the National Institute of Health (NIH), we designed a course on “Phylogenetic Analysis and Molecular Evolution” (PAME), the first cross listed course between three different colleges (College of Arts and Sciences, College of Eng...
Preprint
Full-text available
Tropical geometry with the max-plus algebra has been applied to statistical learning models over tree spaces because geometry with the tropical metric over tree spaces have some nice properties. One of the challenges in applications of tropical geometry to tree spaces is the difficulty to interpret outcomes of statistical models with the tropical m...
Article
Full-text available
A tropical ball is a ball defined by the tropical metric over the tropical projective torus. In this paper we show several properties of tropical balls over the tropical projective torus and also over the space of phylogenetic trees with a given set of leaf labels. Then we discuss its application to the K nearest neighbors (KNN) algorithm, a superv...
Preprint
Full-text available
Support Vector Machines (SVMs) are one of the most popular supervised learning models to classify using a hyperplane in an Euclidean space. Similar to SVMs, tropical SVMs classify data points using a tropical hyperplane under the tropical metric with the max-plus algebra. In this paper, first we show generalization error bounds of tropical SVMs ove...
Preprint
Full-text available
We study the behavior of phylogenetic tree shapes in the tropical geometric interpretation of tree space. Tree shapes are formally referred to as tree topologies; a tree topology can also be thought of as a tree combinatorial type, which is given by the tree's branching configuration and leaf labeling. We use the tropical line segment as a framewor...
Article
Motivation: Due to new technology for efficiently generating genome data, machine learning methods are urgently needed to analyze large sets of gene trees over the space of phylogenetic trees. However, the space of phylogenetic trees is not Euclidean, so ordinary machine learning methods cannot be directly applied. In 2019, Yoshida et al. introduc...
Preprint
Full-text available
Phylogenomics is a new field which applies to tools in phylogenetics to genome data. Due to a new technology and increasing amount of data, we face new challenges to analyze them over a space of phylogenetic trees. Because a space of phylogenetic trees with a fixed set of labels on leaves is not Euclidean, we cannot simply apply tools in data scien...
Preprint
Full-text available
Most data in genome-wide phylogenetic analysis (phylogenomics) is essentially multidimensional, posing a major challenge to human comprehension and computational analysis. Also, we cannot directly apply statistical learning models in data science to a set of phylogenetic trees since the space of phylogenetic trees is not Euclidean. In fact, the spa...
Preprint
Full-text available
In 2019, Yoshida et al. introduced a notion of tropical principal component analysis (PCA). The output is a tropical polytope with a fixed number of vertices that best fits the data. We here apply tropical PCA to dimension reduction and visualization of data sampled from the space of phylogenetic trees. Our main results are twofold: the existence o...
Article
Full-text available
Theory and empirical evidence clearly indicate that phylogenies (trees) of different genes (loci) should not display precisely matched topologies. The main reason for such phylogenetic incongruence is reticulated evolutionary history of most species due to meiotic sexual recombination in eukaryotes, orhorizontal transfers of genetic materials in pr...
Article
Full-text available
Stephen Fienberg (1942-2016) was a statistician whose career has been an inspiration for the engagement of statistics with social and scientific issues, and it is in this spirit that he helped steer algebraic statistics toward more of a mainstream. Many of his favorite topics in the area are covered in this special issue. We are grateful to all aut...
Chapter
Logistic regression is one of the most popular models to classify in data science, and in general, it is easy to use. However, in order to conduct a goodness-of-fit test, we cannot apply asymptotic methods if we have sparse datasets. In the case, we have to conduct an exact conditional inference via a sampler, such as Markov Chain Monte Carlo (MCMC...
Article
Given a set of organisms, the available corresponding genetic information is often incomplete and most gene trees fail to contain all individuals. This incompleteness causes difficulties in data collection, information extraction, and gene tree inference. Outlying gene trees may represent horizontal gene transfers, gene duplications, hybridizations...
Article
Full-text available
Evolutionary hypotheses provide important underpinnings of biological and medical sciences, and comprehensive, genome-wide understanding of evolutionary relationships among organisms are needed to test and refine such hypotheses. Theory and empirical evidence clearly indicate that phylogenies (trees) of different genes (loci) should not display pre...
Chapter
Principal component analysis is one of the most popular unsupervised learning methods for reducing the dimension of a given data set in a high-dimensional Euclidean space. However, computing principal components on a space of phylogenetic trees with fixed labels of leaves is a challenging task since a space of phylogenetic tree is not Euclidean. In...
Preprint
Full-text available
We introduce a novel framework for the statistical analysis of phylogenetic trees: Palm tree space is constructed on principles of tropical algebraic geometry, and represents phylogenetic trees as a point in a space endowed with the tropical metric. We show that palm tree space possesses a variety of properties that allow for the definition of prob...
Article
Full-text available
Principal component analysis is a widely-used method for the dimensionality reduction of a given data set in a high-dimensional Euclidean space. Here we define and analyze two analogues of principal component analysis in the setting of tropical geometry. In one approach, we study the Stiefel tropical linear space of fixed dimension closest to the d...
Article
Exact conditional goodness-of-fit tests for discrete exponential family models can be conducted via Monte Carlo estimation of p values by sampling from the conditional distribution of multiway contingency tables. The two most popular methods for such sampling are Markov chain Monte Carlo (MCMC) and sequential importance sampling (SIS). In this work...
Article
Phylogenetic trees are mathematical objects which summarize the most recent common ancestor relationships between a given set of organisms. There is often a need to quantify the degree of similarity or discordance between two proposed trees. For instance, a person may be interested in knowing whether the phylogenetic trees reconstructed from two di...
Article
Full-text available
Most biological data are multidimensional, posing a major challenge to human comprehension and computational analysis. Principal component analysis is the most popular approach to rendering two- or three-dimensional representations of the major trends in such multidimensional data. The problem of multidimensionality is acute in the rapidly growing...
Article
Full-text available
The question whether there exists an integral solution to the system of linear equations with non-negative constraints, $A\x = \b, \, \x \ge 0$, where $A \in \Z^{m\times n}$ and ${\mathbf b} \in \Z^m$, finds its applications in many areas, such as operation research, number theory and statistics. In order to solve this problem, we have to understan...
Article
For a graph G with p vertices the closed convex cone consists of all real positive semidefinite matrices whose sparsity pattern is given by G, that is, those matrices with zeros in the off-diagonal entries corresponding to nonedges of G. The extremal rays of this cone and their associated ranks have applications to matrix completion problems, maxim...
Article
Full-text available
We investigate the computation of Fermat-Weber points under the tropical metric, motivated by its application to the space of equidistant phylogenetic trees realized as the tropical linear space of all ultrametrics. While the Fr\'echet mean with the ${\rm CAT}(0)$-metric of Billera-Holmes-Vogtman has been studied by many authors, the Fermat-Weber p...
Article
At the present time it is often stated that the maximum likelihood (ML) or the Bayesian method of phylogenetic construction is more accurate than the neighbor joining (NJ) method. Our computer simulations, however, have shown that the converse is true if we use p distance in the NJ procedure and the criterion of obtaining the true tree (Pc expresse...
Chapter
In this chapter, we outline the basics of phylogenetic tree reconstruction methods and related computational aspects. We use the software package R to demonstrate computational hands-on examples. One of the great opportunities offered by modern genomics is that phylogenetics applied on a genomic scale (phylogenomics) should be especially powerful f...
Article
Full-text available
We study the geometry of metrics and convexity structures on the space of phylogenetic trees, here realized as the tropical linear space of all ultrametrics. The CAT(0)-metric of Billera-Holmes-Vogtman arises from the theory of orthant spaces. While its geodesics can be computed by the Owen-Provan algorithm, geodesic triangles are complicated and c...
Article
Full-text available
A distance-based method to reconstruct a phylogenetic tree with $n$ leaves takes a distance matrix, $n \times n$ symmetric matrix with $0$s in the diagonal, as its input and reconstructs a tree with $n$ leaves using tools in combinatorics. A safety radius is a radius from a tree metric (a distance matrix realizing a true tree) within which the inpu...
Article
Full-text available
For a graph $G$ with $p$ vertices the cone of concentration matrices consists of all real positive semidefinite $p\times p$ matrices with zeros in the off-diagonal entries corresponding to nonedges of $G$. The extremal rays of this cone and their associated ranks have applications to matrix completion problems, maximum likelihood estimation in Gaus...
Article
Full-text available
As costs of genome sequencing have dropped precipitously, development of efficient bioinformatic methods to analyze genome structure and evolution have become ever more urgent. For example, most published phylogenomic studies involve either massive concatenation of sequences, or informal comparisons of phylogenies inferred on a small subset of orth...
Article
Full-text available
In order to conduct a statistical analysis on a given set of phylogenetic gene trees, we often use a distance measure between two trees. In a statistical distance-based method to analyze discordance between gene trees, it is a key to decide "biological meaningful" and "statistically well-distributed" distance between trees. Thus, in this paper, we...
Article
Full-text available
While the majority of gene histories found in a clade of organisms are expected to be generated by a common process (e.g. the coalescent process), it is well-known that numerous other coexisting processes (e.g. horizontal gene transfers, gene duplication and subsequent neofunctionalization) will cause some genes to exhibit a history quite distinct...
Article
Full-text available
We provide an explicit combinatorial formula for the volume of the polytope of n×n doubly-stochastic matrices, also known as the Birkhoff polytope. We do this through the description of a generating function for all the lattice points of the closely related polytope of n×n real non-negative matrices with all row and column sums equal to an integer...
Article
While the theoretical foundation of the optimal camera placement problem has been studied for decades, its practical implementation has recently attracted significant research interest due to the increasing popularity of visual sensor networks. The most flexible formulation of finding the optimal camera placement is based on a binary integer progra...
Data
Physical mapping of EAS genes in Epichloë festucae E2368. Genomic DNA was digested with SfiI or NotI as indicated under each panel, separated by clamped homogeneous electric field (CHEF) electrophoresis, and blotted onto nylon filters. The filters were cut into strips, which were probed with labeled segments of the genes indicated above each lane....
Data
Origins of isolates for which genomes were sequenced or survey-sequenced in this study. (DOCX)
Data
Genome and sequence accession numbers. All data are in GenBank, except the Claviceps purpurea 20.1 assembly (76493), which is in the EMBL database. (DOCX)
Data
Summary of epichloae transposable elements identified within repeat regions. Abbreviations: Eam = Epichloë amarillans, Ebe = E. brachyelytri, Efe = E. festucae, Egl = E. glyceriae, Ety = E. typhina, Nga = Neotyphodium gansuense, Ngi = N. gansuense var. inebrians, retro-Tn = retrotransposon, Tn = transposon. (DOCX)
Data
Full-text available
Patch for OrthoMCL. (PDF)
Data
Synteny relationships between genes flanking non-telomeric alkaloid loci in Clavicipitaceae and orthologs in Fusarium graminearum. (A) Comparison of the regions flanking EAS in C. purpurea 20.1 with orthologous genes in E. festucae Fl1 and F. graminearum PH-1. (B) Comparison of the regions flanking IDT in C. purpurea 20.1 with orthologous genes in...
Data
Shimodaira-Hasegawa test results. Tree1 is the maximum likelihood estimate (MLE) tree obtained from the data. Δln L represents the difference between the MLE and likelihood value of Tree 2 under the model with the given data. The p-values are for the null hypothesis that Tree1 and Tree 2 are equally good explanations of the data for Tree1. (DOCX)
Data
Phylogenies of housekeeping genes from sequenced isolates and other Clavicipitaceae. (A) Phylogenetic tree based on nucleotide alignment for a portion of the RNA polymerase II second-largest subunit gene, rpbB. (B) Phylogenetic tree based on nucleotide alignment for a portion of the translation elongation factor 1-α gene, tefA. Trees are rooted wit...
Data
Comparison of indole-diterpene synthesis (IDT/LTM) gene clusters in genomes of plant-associated Clavicipitaceae. Genes for synthesis of the skeleton compound, paspaline, are shown in blue, and genes for subsequent chemical decorations are shown in red. The function of idtS/ltmS (purple) is unknown. Identifiable genes flanking the clusters are indic...
Data
RIP-indices indicating repeat-induced point mutations in and near alkaloid loci. (A) EAS loci from E. festucae Fl1 and E2368. Gene names are abbreviated A through H for easA through easH, W for dmaW, and clo for cloA. (B) IDT/LTM loci from E. festucae Fl1 and E2368. Gene names are abbreviated B through Q for ltmB through ltmQ. (C) LOL locus and adj...
Data
Secondary metabolism gene clusters in assembled C. purpurea and E. festucae genomes. (DOCX)
Article
Full-text available
The fungal family Clavicipitaceae includes plant symbionts and parasites that produce several psychoactive and bioprotective alkaloids. The family includes grass symbionts in the epichloae clade ( and species), which are extraordinarily diverse both in their host interactions and in their alkaloid profiles. Epichloae produce alkaloids of four disti...
Article
Full-text available
On March 11, 2011, Japan was struck by the Great East Japan earthquake followed by a 23-foot tsunami, which crippled the Fukushima Daiichi nuclear plant. Because of a lack of plans for livestock evacuation in the case of a nuclear power plant accident, local farmers in the Fukushima exclusion zone had significant losses. Development of a rigorous a...
Article
Full-text available
Background The increased use of multi-locus data sets for phylogenetic reconstruction has increased the need to determine whether a set of gene trees significantly deviate from the phylogenetic patterns of other genes. Such unusual gene trees may have been influenced by other evolutionary processes such as selection, gene duplication, or horizontal...
Data
Full-text available
MrBayesparameters. All Bayesian analyses were run using MrBayes. Two independent runs were performed for each data set, each using four Markov chains and the default temperature parameter setting of 0.2. 100,000 generations were run with a sample drawn every 100 generations and 25% of the samples treated as burn-in. The minimum, first quartile, med...