# Ruriko YoshidaNaval Postgraduate School | NPS · Department of Operations Research

Ruriko Yoshida

PhD in Mathematics

137

Publications

17,930

Reads

1,642

Citations

Citations since 2017

Data Science using tropical geometry and its applications to phylogenetics and phylogenomics.

Additional affiliations

September 2016 - present

July 2006 - August 2016

August 2004 - June 2006

September 2000 - June 2004

August 1997 - May 2000

Unweighted Pair Group Method with Arithmetic Mean (UPGMA) is one of the most popular distance-based methods to reconstruct an equidistant phylogenetic tree from a distance matrix computed from an alignment of sequences. Since we use equidistant trees as gene trees for phylogenomic analyses under the multi-species coalescent model and since an input...

Background
The detailed complexity of triceps brachii insertional footprint continues to challenge surgeons as evidenced by continued reports of triceps-associated complications following elbow procedures. The purpose of this study is to describe the three-dimensional footprint of the triceps brachii at its olecranon insertion at the elbow.
Methods...

Support Vector Machines (SVMs) are one of the most popular supervised learning models to classify using a hyperplane in an Euclidean space. Similar to SVMs, tropical SVMs classify data points using a tropical hyperplane under the tropical metric with the max-plus algebra. In this paper, first we show generalization error bounds of tropical SVMs ove...

In this paper we propose Hit and Run (HAR) sampling from a tropically convex set. The key ingredient of HAR sampling from a tropically convex set is sampling uniformly from a tropical line segment over the tropical projective torus, which runs linearly in its computational time complexity. We show that this HAR sampling method samples uniformly fro...

We study the behavior of phylogenetic tree shapes in the tropical geometric interpretation of tree space. Tree shapes are formally referred to as tree topologies; a tree topology can also be thought of as a tree combinatorial type, which is given by the tree’s branching configuration and leaf labeling. We use the tropical line segment as a framewor...

In 2004, Speyer and Sturmfels showed that a space of phylogenetic trees with $m$ leaves is a tropical Grassmanian, which is a tropicalization of the set of all solutions for a system of certain linear equations under the max-plus arithmetic. In this research we apply the "tropical metric," a well-defined metric over the space of phylogenetic trees...

We consider an $I \times J\times K$ table with cell counts $X_{ijk} \geq 0$ for $i = 1, \ldots , I$, $j = 1, \ldots , J$ and $k = 1, \ldots , K$ under the no-three-way interaction model. In this paper, we propose a Markov Chain Monte Carlo (MCMC) scheme connecting the set of all contingency tables by all basic moves of $2 \times 2 \times 2$ minors...

During 2020 and 2021, severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) transmission has been increasing among the world’s population at an alarming rate. Reducing the spread of SARS-CoV-2 and other diseases that are spread in similar manners is paramount for public health officials as they seek to effectively manage resources and potent...

Tropical geometry with the max-plus algebra has been applied to statistical learning models over tree spaces because geometry with the tropical metric over tree spaces has some nice properties such as convexity in terms of the tropical metric. One of the challenges in applications of tropical geometry to tree spaces is the difficulty interpreting o...

Phylogenomics is a new field which applies to tools in phylogenetics to genome data. Due to a new technology and increasing amount of data, we face new challenges to analyze them over a space of phylogenetic trees. Because a space of phylogenetic trees with a fixed set of labels on leaves is not Euclidean, we cannot simply apply tools in data scien...

In this research, we investigate a tropical principal component analysis (PCA) as a best-fit Stiefel tropical linear space to a given sample over the tropical projective torus for its dimensionality reduction and visualization. Especially, we characterize the best-fit Stiefel tropical linear space to a sample generated from a mixture of Gaussian di...

Uncrewed autonomous vehicles (UAVs) have made significant contributions to reconnaissance and surveillance missions in past US military campaigns. As the prevalence of UAVs increases, there has also been improvements in counter-UAV technology that makes it difficult for them to successfully obtain valuable intelligence within an area of interest. H...

Demand for effective methods of analyzing networks has emerged with the growth of accessible data, particularly for incomplete networks. Even as means for data collection advance, incomplete information remains a reality for numerous reasons. Data can be obscured by excessive noise. Surveys for information typically contain some non-respondents. In...

This book developed from the need to teach a linear algebra course to students focused on data science and bioinformatics programs. These students tend not to realize the importance of linear algebra in applied sciences, since traditional linear algebra courses tend to cover mathematical contexts but not the computational aspect of linear algebra o...

In the fall of 2009 and in the spring of 2012, supported by the National Institute of General Medical Sciences (NIGMS) in the National Institute of Health (NIH), we designed a course on “Phylogenetic Analysis and Molecular Evolution” (PAME), the first cross listed course between three different colleges (College of Arts and Sciences, College of Eng...

Tropical geometry with the max-plus algebra has been applied to statistical learning models over tree spaces because geometry with the tropical metric over tree spaces have some nice properties. One of the challenges in applications of tropical geometry to tree spaces is the difficulty to interpret outcomes of statistical models with the tropical m...

A tropical ball is a ball defined by the tropical metric over the tropical projective torus. In this paper we show several properties of tropical balls over the tropical projective torus and also over the space of phylogenetic trees with a given set of leaf labels. Then we discuss its application to the K nearest neighbors (KNN) algorithm, a superv...

Motivation:
Due to new technology for efficiently generating genome data, machine learning methods are urgently needed to analyze large sets of gene trees over the space of phylogenetic trees. However, the space of phylogenetic trees is not Euclidean, so ordinary machine learning methods cannot be directly applied. In 2019, Yoshida et al. introduc...

Most data in genome-wide phylogenetic analysis (phylogenomics) is essentially multidimensional, posing a major challenge to human comprehension and computational analysis. Also, we cannot directly apply statistical learning models in data science to a set of phylogenetic trees since the space of phylogenetic trees is not Euclidean. In fact, the spa...

In 2019, Yoshida et al. introduced a notion of tropical principal component analysis (PCA). The output is a tropical polytope with a fixed number of vertices that best fits the data. We here apply tropical PCA to dimension reduction and visualization of data sampled from the space of phylogenetic trees. Our main results are twofold: the existence o...

Theory and empirical evidence clearly indicate that phylogenies (trees) of
different genes (loci) should not display precisely matched topologies. The
main reason for such phylogenetic incongruence is reticulated evolutionary
history of most species due to meiotic sexual recombination in eukaryotes,
orhorizontal transfers of genetic materials in pr...

Stephen Fienberg (1942-2016) was a statistician whose career has been an inspiration for the engagement of statistics with social and scientific issues, and it is in this spirit that he helped steer algebraic statistics toward more of a mainstream. Many of his favorite topics in the area are covered in this special issue. We are grateful to all aut...

Logistic regression is one of the most popular models to classify in data science, and in general, it is easy to use. However, in order to conduct a goodness-of-fit test, we cannot apply asymptotic methods if we have sparse datasets. In the case, we have to conduct an exact conditional inference via a sampler, such as Markov Chain Monte Carlo (MCMC...

Given a set of organisms, the available corresponding genetic information is often incomplete and most gene trees fail to contain all individuals. This incompleteness causes difficulties in data collection, information extraction, and gene tree inference. Outlying gene trees may represent horizontal gene transfers, gene duplications, hybridizations...

Evolutionary hypotheses provide important underpinnings of biological and medical sciences, and comprehensive, genome-wide understanding of evolutionary relationships among organisms are needed to test and refine such hypotheses. Theory and empirical evidence clearly indicate that phylogenies (trees) of different genes (loci) should not display pre...

Principal component analysis is one of the most popular unsupervised learning methods for reducing the dimension of a given data set in a high-dimensional Euclidean space. However, computing principal components on a space of phylogenetic trees with fixed labels of leaves is a challenging task since a space of phylogenetic tree is not Euclidean. In...

We introduce a novel framework for the statistical analysis of phylogenetic trees: Palm tree space is constructed on principles of tropical algebraic geometry, and represents phylogenetic trees as a point in a space endowed with the tropical metric. We show that palm tree space possesses a variety of properties that allow for the definition of prob...

Principal component analysis is a widely-used method for the dimensionality reduction of a given data set in a high-dimensional Euclidean space. Here we define and analyze two analogues of principal component analysis in the setting of tropical geometry. In one approach, we study the Stiefel tropical linear space of fixed dimension closest to the d...

Exact conditional goodness-of-fit tests for discrete exponential family models can be conducted via Monte Carlo estimation of p values by sampling from the conditional distribution of multiway contingency tables. The two most popular methods for such sampling are Markov chain Monte Carlo (MCMC) and sequential importance sampling (SIS). In this work...

Phylogenetic trees are mathematical objects which summarize the most recent common ancestor relationships between a given set of organisms. There is often a need to quantify the degree of similarity or discordance between two proposed trees. For instance, a person may be interested in knowing whether the phylogenetic trees reconstructed from two di...

Most biological data are multidimensional, posing a major challenge to human comprehension and computational analysis. Principal component analysis is the most popular approach to rendering two- or three-dimensional representations of the major trends in such multidimensional data. The problem of multidimensionality is acute in the rapidly growing...

The question whether there exists an integral solution to the system of linear equations with non-negative constraints, $A\x = \b, \, \x \ge 0$, where $A \in \Z^{m\times n}$ and ${\mathbf b} \in \Z^m$, finds its applications in many areas, such as operation research, number theory and statistics. In order to solve this problem, we have to understan...

For a graph G with p vertices the closed convex cone consists of all real positive semidefinite matrices whose sparsity pattern is given by G, that is, those matrices with zeros in the off-diagonal entries corresponding to nonedges of G. The extremal rays of this cone and their associated ranks have applications to matrix completion problems, maxim...

We investigate the computation of Fermat-Weber points under the tropical metric, motivated by its application to the space of equidistant phylogenetic trees realized as the tropical linear space of all ultrametrics. While the Fr\'echet mean with the ${\rm CAT}(0)$-metric of Billera-Holmes-Vogtman has been studied by many authors, the Fermat-Weber p...

At the present time it is often stated that the maximum likelihood (ML) or the Bayesian method of phylogenetic construction
is more accurate than the neighbor joining (NJ) method. Our computer simulations, however, have shown that the converse is
true if we use p distance in the NJ procedure and the criterion of obtaining the true tree (Pc expresse...

In this chapter, we outline the basics of phylogenetic tree reconstruction methods and related computational aspects. We use the software package R to demonstrate computational hands-on examples. One of the great opportunities offered by modern genomics is that phylogenetics applied on a genomic scale (phylogenomics) should be especially powerful f...

We study the geometry of metrics and convexity structures on the space of
phylogenetic trees, here realized as the tropical linear space of all
ultrametrics. The CAT(0)-metric of Billera-Holmes-Vogtman arises from the
theory of orthant spaces. While its geodesics can be computed by the
Owen-Provan algorithm, geodesic triangles are complicated and c...

A distance-based method to reconstruct a phylogenetic tree with $n$ leaves
takes a distance matrix, $n \times n$ symmetric matrix with $0$s in the
diagonal, as its input and reconstructs a tree with $n$ leaves using tools in
combinatorics. A safety radius is a radius from a tree metric (a distance
matrix realizing a true tree) within which the inpu...

For a graph $G$ with $p$ vertices the cone of concentration matrices consists
of all real positive semidefinite $p\times p$ matrices with zeros in the
off-diagonal entries corresponding to nonedges of $G$. The extremal rays of
this cone and their associated ranks have applications to matrix completion
problems, maximum likelihood estimation in Gaus...

As costs of genome sequencing have dropped precipitously, development of
efficient bioinformatic methods to analyze genome structure and evolution have
become ever more urgent. For example, most published phylogenomic studies
involve either massive concatenation of sequences, or informal comparisons of
phylogenies inferred on a small subset of orth...

In order to conduct a statistical analysis on a given set of phylogenetic
gene trees, we often use a distance measure between two trees. In a statistical
distance-based method to analyze discordance between gene trees, it is a key to
decide "biological meaningful" and "statistically well-distributed" distance
between trees. Thus, in this paper, we...

While the majority of gene histories found in a clade of organisms are expected to be generated by a common process (e.g. the coalescent process), it is well-known that numerous other coexisting processes (e.g. horizontal gene transfers, gene duplication and subsequent neofunctionalization) will cause some genes to exhibit a history quite distinct...

We provide an explicit combinatorial formula for the volume of the polytope of n×n doubly-stochastic matrices, also known as the Birkhoff polytope. We do this through the description of a generating function for all the lattice points of the closely related polytope of n×n real non-negative matrices with all row and column sums equal to an integer...

While the theoretical foundation of the optimal camera placement problem has been studied for decades, its practical implementation has recently attracted significant research interest due to the increasing popularity of visual sensor networks. The most flexible formulation of finding the optimal camera placement is based on a binary integer progra...

Physical mapping of EAS genes in Epichloë festucae E2368. Genomic DNA was digested with SfiI or NotI as indicated under each panel, separated by clamped homogeneous electric field (CHEF) electrophoresis, and blotted onto nylon filters. The filters were cut into strips, which were probed with labeled segments of the genes indicated above each lane....

Origins of isolates for which genomes were sequenced or survey-sequenced in this study.
Genome and sequence accession numbers. All data are in GenBank, except the Claviceps purpurea 20.1 assembly (76493), which is in the EMBL database.
Summary of epichloae transposable elements identified within repeat regions. Abbreviations: Eam = Epichloë amarillans, Ebe = E. brachyelytri, Efe = E. festucae, Egl = E. glyceriae, Ety = E. typhina, Nga = Neotyphodium gansuense, Ngi = N. gansuense var. inebrians, retro-Tn = retrotransposon, Tn = transposon.
RIP-index perl script.
Patch for OrthoMCL.
Synteny relationships between genes flanking non-telomeric alkaloid loci in Clavicipitaceae and orthologs in Fusarium graminearum. (A) Comparison of the regions flanking EAS in C. purpurea 20.1 with orthologous genes in E. festucae Fl1 and F. graminearum PH-1. (B) Comparison of the regions flanking IDT in C. purpurea 20.1 with orthologous genes in...

Shimodaira-Hasegawa test results. Tree1 is the maximum likelihood estimate (MLE) tree obtained from the data. Δln L represents the difference between the MLE and likelihood value of Tree 2 under the model with the given data. The p-values are for the null hypothesis that Tree1 and Tree 2 are equally good explanations of the data for Tree1.
(DOCX)

Phylogenies of housekeeping genes from sequenced isolates and other Clavicipitaceae. (A) Phylogenetic tree based on nucleotide alignment for a portion of the RNA polymerase II second-largest subunit gene, rpbB. (B) Phylogenetic tree based on nucleotide alignment for a portion of the translation elongation factor 1-α gene, tefA. Trees are rooted wit...

Comparison of indole-diterpene synthesis (IDT/LTM) gene clusters in genomes of plant-associated Clavicipitaceae. Genes for synthesis of the skeleton compound, paspaline, are shown in blue, and genes for subsequent chemical decorations are shown in red. The function of idtS/ltmS (purple) is unknown. Identifiable genes flanking the clusters are indic...

RIP-indices indicating repeat-induced point mutations in and near alkaloid loci. (A) EAS loci from E. festucae Fl1 and E2368. Gene names are abbreviated A through H for easA through easH, W for dmaW, and clo for cloA. (B) IDT/LTM loci from E. festucae Fl1 and E2368. Gene names are abbreviated B through Q for ltmB through ltmQ. (C) LOL locus and adj...

Secondary metabolism gene clusters in assembled C. purpurea and E. festucae genomes.
(DOCX)

The fungal family Clavicipitaceae includes plant symbionts and parasites that produce several psychoactive and bioprotective alkaloids. The family includes grass symbionts in the epichloae clade ( and species), which are extraordinarily diverse both in their host interactions and in their alkaloid profiles. Epichloae produce alkaloids of four disti...

On March 11, 2011, Japan was struck by the Great East Japan earthquake followed by a 23-foot tsunami, which crippled the Fukushima Daiichi nuclear plant. Because of a lack of plans for livestock evacuation in the case of a nuclear power plant accident, local farmers in the Fukushima exclusion zone had significant losses. Development of a rigorous a...

Background
The increased use of multi-locus data sets for phylogenetic reconstruction has increased the need to determine whether a set of gene trees significantly deviate from the phylogenetic patterns of other genes. Such unusual gene trees may have been influenced by other evolutionary processes such as selection, gene duplication, or horizontal...

MrBayesparameters. All Bayesian analyses were run using MrBayes. Two independent runs were performed for each data set, each using four Markov chains and the default temperature parameter setting of 0.2. 100,000 generations were run with a sample drawn every 100 generations and 25% of the samples treated as burn-in. The minimum, first quartile, med...