
Yasuo Tabei- Ph.D (Computer Science)
- Unit leader at RIKEN
Yasuo Tabei
- Ph.D (Computer Science)
- Unit leader at RIKEN
Looking for a posdoc researcher in my team. Please contact me if you have an interest.
About
88
Publications
10,097
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
1,254
Citations
Introduction
Current institution
Additional affiliations
April 2017 - present
October 2013 - March 2017
Education
October 2006 - October 2009
Publications
Publications (88)
Improvements in tracking technology through optical and computer vision systems have enabled a greater understanding of the movement-based behaviour of multiple agents, including in team sports. In this study, a Multi-Agent Statistically Discriminative Sub-Trajectory Mining (MA-Stat-DSM) method is proposed that takes a set of binary-labelled agent...
The compression of highly repetitive strings (i.e., strings with many repetitions) has been a central research topic in string processing, and quite a few compression methods for these strings have been proposed thus far. Among them, an efficient compression format gathering increasing attention is the run-length Burrows--Wheeler transform (RLBWT),...
Lossless data compression has been widely studied in computer science. One of the most widely used lossless data compressions is Lempel-Ziv (LZ) 77 parsing, which achieves a high compression ratio. Bidirectional (a.k.a. macro) parsing is a lossless data compression technique that computes a sequence of phrases copied from another substring (target...
Similarity-preserving hashing is a core technique for fast similarity searches, and it randomly maps data points in a metric space to strings of discrete symbols (i.e., sketches) in the Hamming space. While traditional hashing techniques produce binary sketches, recent ones produce integer sketches for preserving various similarity measures. Howeve...
Non-negative least square regression (NLS) is a constrained least squares problem where the coefficients are restricted to be non-negative. It is useful for modeling non-negative responses such as time measurements, count data, histograms and so on. Existing NLS solvers are designed for cases where the predictor variables and response variables hav...
Let string S[1..n] be parsed into z phrases by the Lempel-Ziv algorithm. The corresponding compression algorithm encodes S in O(z) space, but it does not support random access to S. We introduce a data structure, the block tree, that represents S in O(zlog(n/z)) space and extracts any symbol of S in time O(log(n/z)), among other space-time tradeo...
A keyword dictionary is an associative array whose keys are strings. Recent applications handling massive keyword dictionaries in main memory have a need for a space-efficient implementation. When limited to static applications, there are a number of highly compressed keyword dictionaries based on the advancements of practical succinct data structu...
Similarity-preserving hashing is a core technique for fast similarity searches, and it randomly maps data points in a metric space to strings of discrete symbols (i.e., sketches) in the Hamming space. While traditional hashing techniques produce binary sketches, recent ones produce integer sketches for preserving various similarity measures. Howeve...
Here, we present a novel algorithm for frequent itemset mining in streaming data (FIM-SD). For the past decade, various FIM-SD methods in one-pass approximation settings that allow to approximate the support of each itemset have been proposed. They can be categorized into two approximation types: parameter-constrained (PC) mining and resource-const...
Although a significant number of compressed indexes for highly repetitive strings have been proposed thus far, developing compressed indexes that support faster queries remains a challenge. Run-length Burrows-Wheeler transform (RLBWT) is a lossless data compression by a reversible permutation of an input string and run-length encoding, and it has b...
String kernels are attractive data analysis tools for analyzing string data.
Among them, alignment kernels are known for their high prediction accuracies in string classifications when tested in combination with SVM in various applications. However, alignment kernels have a crucial drawback in that they scale poorly due to their quadratic computati...
Massive datasets of spatial trajectories representing the mobility of a diversity of moving objects are ubiquitous in research and industry. Similarity search of a large collection of trajectories is indispensable for turning these datasets into knowledge. Current methods for similarity search of trajectories are inefficient in terms of search time...
We propose a novel
statistical approach
to evaluate the
statistical significance
(reliability) of the results from discriminative sub-trajectory mining, which we call
Statistically Discriminative Sub-trajectory Mining (Stat-DSM)
. Given two groups of trajectories, the goal of Stat-DSM is to extract moving patterns in the form of sub-trajector...
Lcp-values, lcp-intervals, and maximal repeats are powerful tools in various string processing tasks and have a wide variety of applications. Although many researchers have focused on developing enumeration algorithms for them, those algorithms are inefficient in that the space usage is proportional to the length of the input string. Recently, the...
Prediction of compound‐protein interactions with fingerprints has recently become challenging in recent pharmaceutical science for an efficient drug discovery. We review two scalable methods for predicting drug‐protein interactions on fingerprints. Especially, we introduce two techniques of learning statistical models using lossless and lossy data...
We propose a novel statistical approach to evaluate the statistical significance (reliability) of the findings in the discriminative sub-trajectory mining problem, called Statistically Discriminative Sub-trajectory Mining (Stat-DSM). Given two groups of trajectories, the goal is to extract moving patterns in the form of sub-trajectories that occur...
Recently, randomly mapping vectorial data to strings of discrete symbols (i.e., sketches) for fast and space-efficient similarity searches has become popular. Such random mapping is called similarity-preserving hashing and approximates a similarity metric by using the Hamming distance. Although many efficient similarity searches have been proposed,...
Motivation:
Genome-wide identification of the transcriptomic responses of human cell lines to drug treatments is a challenging issue in medical and pharmaceutical research. However, drug-induced gene expression profiles are largely unknown and unobserved for all combinations of drugs and human cell lines, which is a serious obstacle in practical a...
A keyword dictionary is an associative array whose keys are strings. Recent applications handling massive keyword dictionaries in main memory have a need for a space-efficient implementation. When limited to static applications, there are a number of highly-compressed keyword dictionaries based on the advancements of practical succinct data structu...
We study the problem of discriminative sub-trajectory mining. Given two groups of trajectories, the goal of this problem is to extract moving patterns in the form of sub-trajectories which are more similar to sub-trajectories of one group and less similar to those of the other. We propose a new method called Statistically Discriminative Sub-traject...
Background
Characterization of drug-protein interaction networks with biological features has recently become challenging in recent pharmaceutical science toward a better understanding of polypharmacology.
Results
We present a novel method for systematic analyses of the underlying features characteristic of drug-protein interaction networks, which...
Converting a compressed format of a string into another compressed format without an explicit decompression is one of the central research topics in string processing. We discuss the problem of converting the run-length Burrows-Wheeler Transform (RLBWT) of a string to Lempel-Ziv 77 (LZ77) phrases of the reversed string. The first results with Polic...
Here, we present a novel algorithm for frequent itemset mining for streaming data (FIM-SD). For the past decade, various FIM-SD methods in one-pass approximation settings have been developed to approximate the frequency of each itemset. These approaches can be categorized into two approximation types: parameter-constrained (PC) mining and resource-...
Lossless data compression has been widely studied in computer science. One of the most widely used lossless data compressions is Lempel-Zip(LZ) 77 parsing, which achieves a high compression ratio. Bidirectional (a.k.a. macro) parsing is a lossless data compression and computes a sequence of phrases copied from another substring (target phrase) on e...
The availability of biomedical big data provides an opportunity to develop data-driven approaches in agriculture and human healthcare research. In this study, we investigate statistical machine learning approaches to metabolic pathway reconstruction and the prediction of drug–target interactions, using heterogeneous biomedical big data. We present...
String kernels are attractive data analysis tools for analyzing string data. Among them, alignment kernels are known for their high prediction accuracies in string classifications when tested in combination with SVMs in various applications. However, alignment kernels have a crucial drawback in that they scale poorly due to their quadratic computat...
Genome-wide identification of all target proteins of drug candidate compounds is a challenging issue in drug discovery. Moreover, emerging phenotypic effects, including therapeutic and adverse effects, are heavily dependent on the inhibition or activation of target proteins. Here we propose a novel computational method for predicting inhibitory and...
We present a novel compressed dynamic self-index for highly repetitive text collections. Signature encoding is a compressed dynamic self-index for highly repetitive texts and has a large disadvantage that the pattern search for short patterns is slow. We improve this disadvantage for faster pattern search by leveraging an idea behind truncated suff...
We present a novel compressed dynamic self-index for highly repetitive text collections. Signature encoding is a compressed dynamic self-index for highly repetitive texts and has a large disadvantage that the pattern search for short patterns is slow. We improve this disadvantage for faster pattern search by leveraging an idea behind truncated suff...
Similarity search over chemical compound databases is a fundamental task in the discovery and design of novel drug-like molecules. Such databases often encode molecules as non-negative integer vectors, called molecular descriptors, which represent rich information on various molecular properties. While there exist efficient indexing structures for...
Dynamic mode decomposition (DMD) is a data-driven method for calculating a modal representation of a nonlinear dynamical system, and it has been utilized in various fields of science and engineering. In this paper, we propose Bayesian DMD, which provides a principled way to transfer the advantages of the Bayesian formulation into DMD. To this end,...
Similarity search over chemical compound databases is a fundamental task in the discovery and design of novel drug-like molecules. Such databases often encode molecules as non-negative integer vectors, called molecular descriptors, which represent rich information on various molecular properties. While there exist efficient indexing structures for...
With massive high-dimensional data now commonplace in research and industry, there is a strong and growing demand for more scalable computational techniques for data analysis and knowledge discovery. Key to turning these data into knowledge is the ability to learn statistical models with high interpretability. Current methods for learning statistic...
Motivation: Metabolic pathways are an important class of molecular networks consisting of compounds, enzymes and their interactions. The understanding of global metabolic pathways is extremely important for various applications in ecology and pharmacology. However, large parts of metabolic pathways remain unknown, and most organism-specific pathway...
With massive high-dimensional data now commonplace in research and industry, there is a strong and growing demand for more scalable computational techniques for data analysis and knowledge discovery. Key to turning these data into knowledge is the ability to learn statistical models with high interpretability. Current methods for learning statistic...
Although several self-indexes for highly repetitive text collections exist, developing an index and search algorithm with editing operations remains a challenge. Edit distance with moves (EDM) is a string-to-string distance measure that includes substring moves in addition to ordinal editing operations to turn one string into another. Although the...
Although several self-indexes for highly repetitive text collections exist, developing an index and search algorithm with editing operations remains a challenge. Edit distance with moves (EDM) is a string-to-string distance measure that includes substring moves in addition to ordinal editing operations to turn one string into another. Although the...
Given a string S of length N on a fixed alphabet of σ symbols, a grammar compressor produces a context-free grammar G of size n that generates S and only S. In this paper we describe data structures to support the following operations on a grammar-compressed string: access(S,i,j) (return substring S[i,j]), rank
c
(S,i) (return the number of occurre...
Given a grammar compressed string S, a pattern P, and \(d\ge 0\), we consider the problem of finding all occurrences of \(P'\) in S such that \(d(P,P')\le d\) with respect to Hamming distance. We propose an algorithm for this problem in \(O(\lg \lg n \lg ^* N(m+d\ occ_d \lg \frac{m}{d}\lg N))\) time, where \(N=|S|\), \(m=|P|\), n is the number of v...
Although several grammar-based self-indexes have been proposed thus far,
their applicability is limited to offline settings where whole input texts are
prepared, thus requiring to rebuild index structures for given additional
inputs, which is often the case in the big data era. In this paper, we present
the first online self-indexed grammar compres...
Recent advances in mass spectrometry and related metabolomics technologies have enabled the rapid and comprehensive analysis of numerous metabolites. However, biosynthetic and biodegradation pathways are only known for a small portion of metabolites, with most metabolic pathways remaining uncharacterized.
In this study, we developed a novel method...
We describe a data structure that stores a string $S$ in space similar to
that of its Lempel-Ziv encoding and efficiently supports access, rank and
select queries. These queries are fundamental for implementing succinct and
compressed data structures, such as compressed trees and graphs. We show that
our data structure can be built in a scalable ma...
Given a string $S$ of length $N$ on a fixed alphabet of $\sigma$ symbols, a
grammar compressor produces a context-free grammar $G$ of size $n$ that
generates $S$ and only $S$. In this paper we describe data structures to
support the following operations on a grammar-compressed string:
$\mbox{rank}_c(S,i)$ (return the number of occurrences of symbol...
Edit distance with moves (EDM) is a string-to-string distance measure that includes substring moves in addition to ordinal editing operations to turn one string to the other. Although optimizing EDM is intractable, it has many applications especially in error detections. Edit sensitive parsing (ESP) is an efficient parsing algorithm that guarantees...
Motivation: Metabolic pathway analysis is crucial not only in metabolic engineering but also in rational drug design. However, the biosynthetic/biodegradation pathways are known only for a small portion of metabolites, and a vast amount of pathways remain uncharacterized. Therefore, an important challenge in metabolomics is the de novo reconstructi...
While several self-indexes for highly repetitive texts exist, developing a
practical self-index applicable to real world repetitive texts remains a
challenge. ESP-index is a grammar-based self-index on the notion of
edit-sensitive parsing (ESP), an efficient parsing algorithm that guarantees
upper bounds of parsing discrepancies between different a...
We present novel variants of fully online LCA (FOLCA), a fully online grammar com- pression that builds a straight line program (SLP) and directly encodes it into a succinct representation in an online manner. FOLCA enables a direct encoding of an SLP into a succinct representation that is asymptotically equivalent to an information theoretic lower...
The identification of compound-protein interactions plays key roles in the drug development toward discovery of new drug leads and new therapeutic protein targets. There is therefore a strong incentive to develop new efficient methods for predicting compound-protein interactions on a genome-wide scale. In this paper we develop a novel chemogenomic...
In order to develop hypothesis on unknown metabolic pathways, biochemists frequently rely on literature that uses a free-text format to describe functional groups or substructures. In computational chemistry or cheminformatics, molecules are typically represented by chemical descriptors, i.e., vectors that summarize information on its various prope...
Most phenotypic effects of drugs are involved in the interactions between drugs and their target proteins, however, our knowledge about the molecular mechanism of the drug-target interactions is very limited. One of challenging issues in recent pharmaceutical science is to identify the underlying molecular features which govern drug-target interact...
We present a fully-onlinealgorithmbforbconstructingbbstraight line programs (SLPs). A naive array representation of an SLP with n variables on an alphabet of size σ requires 2n lg(n + σ) bits. As al- ready shown in [Tabei et al., CPM’13], in offline setting, this size can be reduced to n lg(n + σ) + 2n + o(n), which is asymptotically equal to the i...
Analyzing functional interactions between small compounds and proteins is indispensable in genomic drug discovery. Since rich information on various compound-protein inter- actions is available in recent molecular databases, strong demands for making best use of such databases require to invent powerful methods to help us find new functional compou...
The metabolic pathway is an important biochemical reaction network involving enzymatic reactions among chemical compounds. However, it is assumed that a large number of metabolic pathways remain unknown, and many reactions are still missing even in known pathways. Therefore, the most important challenge in metabolomics is the automated de novo reco...
We solve an open problem related to an optimal encoding of a straight line program (SLP), a canonical form of grammar compression deriving a single string deterministically. We show that an information-theoretic lower bound for representing an SLP with n symbols requires at least 2n + logn! + o(n) bits. We then present a succinct representation of...
We solve an open problem related to an optimal encoding of a straight line program (SLP), a canonical form of grammar compression deriving a single string deterministically. We show that an information-theoretic lower bound for representing an SLP with n symbols requires at least 2n + log n! + o(n) bits. We then present a succinct representa-tion o...
Dictionary is a crucial data structure to implement grammar-based compression algorithms. Such a dictionary should access any codes in O(1) time for an efficient compression. A standard dictionary consisting of fixed-length codes consumes a large amount of memory of 2n logn bits for n variables. We present novel dictionaries consisting of variable-...
Drug effects are mainly caused by the interactions between drug molecules and their target proteins including primary targets and off-targets. Identification of the molecular mechanisms behind overall drug-target interactions is crucial in the drug design process.
We develop a classifier-based approach to identify chemogenomic features (the underly...
Similarity searches in the databases of chemical fingerprints are a fundamental task in discovering novel drug-like molecules. Multibit trees have a data structure that enables fast similarity searches of chemical fingerprints (Kristensen et al., WABI'09). A standard pointer-based representation of multibit trees consumes a large amount of memory t...
Computational investigation of protein functions is one of the most urgent and demanding tasks in the field of structural bioinformatics. Exhaustive pairwise comparison of known and putative ligand-binding sites, across protein families and folds, is essential in elucidating the biological functions and evolutionary relationships of proteins. Given...
Numerous potential ligand-binding sites are available today, along with hundreds of thousands of known binding sites observed in the PDB. Exhaustive similarity search for such vastly numerous binding site pairs is useful to predict protein functions and to enable rapid screening of target proteins for drug design. Existing databases of ligand-bindi...
Background / Purpose:
Numerous potential ligand-binding sites, in addition to hundreds of thousands of known binding sites in the Protein Data Bank (PDB), are available today. Exhaustive similarity searches for such vastly numerous binding sites are computationally demanding, but they must be useful for prediction of protein functions and rapid s...
Similarity networks of ligands are often reported useful in predicting chemical activities and target proteins. However, the naive method of computing all pairwise similarities of chemical fingerprints takes quadratic time, which is prohibitive for large scale databases with millions of ligands. We propose a fast all pairs similarity search method,...
Similarity search in databases of labeled graphs is a fundamental task in managing graph data such as XML, chemical compounds and social networks. Typically, a graph is decomposed to a set of substructures (e.g., paths, trees and subgraphs) and a similarity measure is defined via the number of common substructures. Using the representation, graphs...
A linear graph is a graph whose vertices are totally ordered. Biological and
linguistic sequences with interactions among symbols are naturally represented
as linear graphs. Examples include protein contact maps, RNA secondary
structures and predicate-argument structures. Our algorithm, linear graph miner
(LGM), leverages the vertex order for effic...
A linear graph is a graph whose vertices are totally ordered. Biological and linguistic sequences with interactions among symbols are naturally represented as linear graphs. Examples include protein con-tact maps, RNA secondary structures and predicate-argument struc-tures. Our algorithm, linear graph miner (LGM), leverages the vertex order for eff...
Non-coding RNAs (ncRNAs) show a unique evolutionary process in which the substitutions of distant bases are correlated in order to conserve the secondary structure of the ncRNA molecule. Therefore, the multiple alignment method for the detection of ncRNAs should take into account both the primary sequence and the secondary structure. Recently, ther...
We present web servers for analysis of non-coding RNA sequences on the basis of their secondary structures. Software tools
for structural multiple sequence alignments, structural pairwise sequence alignments and structural motif findings are available
from the integrated web server and the individual stand-alone web servers. The servers are located...
Aligning multiple RNA sequences is essential for analyzing non-coding RNAs. Although many alignment methods for non-coding RNAs, including Sankoff's algorithm for strict structural alignments, have been proposed, they are either inaccurate or computationally too expensive. Faster methods with reasonable accuracies are required for genome-scale anal...
Structural RNA genes exhibit unique evolutionary patterns that are designed to conserve their secondary structures; these patterns should be taken into account while constructing accurate multiple alignments of RNA genes. The Sankoff algorithm is a natural alignment algorithm that includes the effect of base-pair covariation in the alignment model....
The functions of non-coding RNAs are strongly related to their secondary structures, but it is known that a secondary structure prediction of a single sequence is not reliable. Therefore, we have to collect similar RNA sequences with a common secondary structure for the analyses of a new non-coding RNA without knowing the exact secondary structure...
Recent study on comparative genomics revealed that the conserved regions between genomes contain large amount of non-coding sequences. One of the important challenge is to discover as many candidates of non-coding RNAs as possible with high accuracy by sequence infor- mation analyses. Comparison of the two sequences is the most important foundation...