
Kenji Yamanishi- Professor
- Professor at The University of Tokyo
Kenji Yamanishi
- Professor
- Professor at The University of Tokyo
About
187
Publications
14,545
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
4,567
Citations
Introduction
Current institution
Publications
Publications (187)
Online structured prediction is a task of sequentially predicting outputs with complex structures based on inputs and past observations, encompassing online classification. Recent studies showed that in the full information setup, we can achieve finite bounds on the surrogate regret, i.e., the extra target loss relative to the best possible surroga...
This study addresses the issue of graph generation with generative models. In particular, we are concerned with graph community augmentation problem, which refers to the problem of generating unseen or unfamiliar graphs with a new community out of the probability distribution estimated with a given graph dataset. The graph community augmentation me...
The normalized maximum likelihood (NML) code length is widely used as a model selection criterion based on the minimum description length principle, where the model with the shortest NML code length is selected. A common method to calculate the NML code length is to use the sum (for a discrete model) or integral (for a continuous model) of a functi...
Graph data augmentation (GDA), which manipulates graph structure and/or attributes, has been demonstrated as an effective method for improving the generalization of graph neural networks on semi-supervised node classification. As a data augmentation technique, label preservation is critical, that is, node labels should not change after data manipul...
In this chapter, we introduce some methodologies for parameter estimation on the basis of the MDL principle. Parameter estimation is the most fundamental task of statistical inference, which should be addressed preceding model selection, as focused on in the subsequent chapters. Therefore, this chapter can be thought of as a preliminary chapter for...
In this chapter, we address the issue of sequential prediction. The goal is to make the cumulative prediction loss as small as possible. This issue is reduced to the problem of minimizing the cumulative code-length when the code-length is calculated sequentially. We consider three types of prediction algorithms; maximum likelihood prediction algori...
This chapter shows an application of the MDL principle to statistical model selection. First a number of existing model selection criteria such as AIC, BIC, MML, and cross-validation are introduced. The MDL criterion is introduced as an information-theoretic model selection criterion. It is justified in terms of consistency, estimation optimality,...
The MDL principle has been established on the basis of the equivalence relation between a probability distribution and coding through the Kraft inequality. Therefore, the issue of learning with the MDL principle is formulated in a setting where the hypothesis is a class of probability distributions and the loss function is the logarithmic loss. In...
This chapter summarizes mathematical preliminaries necessary for reading this book.
In this chapter, we show an application of the MDL principle to change detection. It is one of most practically important issues in data science, where the MDL principle plays a key role in designing effective change detection algorithms. We classify the change detection issues into parameter change detection and latent structure change detection....
This chapter introduces the notion of descriptive dimension (Ddim), which is a real-valued representation complexity of a probabilistic model. Ddim is formulated on the basis of the MDL principle and the theory of box counting dimension. We refer to the task of calculating Ddim from data as continuous model selection. When a model changes over time...
In this chapter, we introduce the notions of information, probability distributions, and coding. We show that the (inferior) probability distribution and coding are equivalent through the Kraft inequality. The most primitive quantification of information is Shannon’s information, which is the optimal code-length when a probability distribution is k...
Latent variable models are important knowledge representations for machine learning. This chapter introduces two information-theoretic criteria for model selection in latent variable models; the latent stochastic complexity (LSC) and decomposed normalized maximum likelihood code-length (DNML). It is shown how DNML can be applied to model selection...
We address the issue of detecting changes of models that lie behind a data stream. The model refers to an integer-valued structural information such as the number of free parameters in a parametric model. Specifically we are concerned with the problem of how we can detect signs of model changes earlier than they are actualized. To this end, we empl...
Glaucoma, which makes progressive and irreversible sight damage to human eyes, is the second leading cause of blindness worldwide. The damage is principally estimated by visual field (VF) sensitivity through costly visual field tests. To achieve a less costly estimation, a promising method is to first measure retinal layers thickness (RT) by optica...
Graph embedding methods are effective techniques for representing nodes and their relations in a continuous space. Specifically, the hyperbolic space is more effective than the Euclidean space for embedding graphs with tree-like structures. Thus, it is critical how to select the best dimensionality for the hyperbolic space in which a graph is embed...
Machine learning for point clouds has been attracting much attention, with many applications in various fields, such as shape recognition and material science. To enhance the accuracy of such machine learning methods, it is known to be effective to incorporate global topological features, which are typically extracted by persistent homology. In the...
Recent studies have experimentally shown that we can achieve in non-Euclidean metric space effective and efficient graph embedding, which aims to obtain the vertices' representations reflecting the graph's structure in the metric space. Specifically, graph embedding in hyperbolic space has experimentally succeeded in embedding graphs with hierarchi...
We address the issue of detecting changes of models that lie behind a data stream. The model refers to an integer-valued structural information such as the number of free parameters in a parametric model. Specifically we are concerned with the problem of how we can detect signs of model changes earlier than they are actualized. To this end, we empl...
Graph embedding methods are effective techniques for representing nodes and their relations in a continuous space. Specifically, the hyperbolic space is more effective than the Euclidean space for embedding graphs with tree-like structures. Thus it is critical how to select the best dimensionality for the hyperbolic space in which a graph is embedd...
We consider measuring the number of clusters (cluster size) in the finite mixture models for interpreting their structures. Many existing information criteria have been applied for this issue by regarding it as the same as the number of mixture components (mixture size); however, this may not be valid in the presence of overlaps or weight biases. I...
This study addresses the issue of summarizing a static graph, known as graph summarization, effectively and efficiently. The resulting compact graph is referred to as a summary graph. Based on the minimum description length principle (MDL), we propose a novel graph summarization algorithm called the graph summarization with latent variable probabil...
The detection of network changes over time is based on identifying deviations of the network structure. The challenge mainly lies in designing a good summary or descriptor of the network structure for facilitating the measure of deviations. In particular, a network may have a huge number of nodes and edges. Moreover, there can exist complicated dep...
Purpose:
To investigate whether a correction based on a Humphrey field analyzer (HFA) 24-2/30-2 visual field (VF) can improve the prediction performance of a deep learning model to predict the HFA 10-2 VF test from macular optical coherence tomography (OCT) measurements.
Methods:
This is a multicenter, cross-sectional study. The training dataset...
Finite mixture models are widely used for modeling and clustering data. When they are used for clustering, they are often interpreted by regarding each component as one cluster. However, this assumption may be invalid when the components overlap. It leads to the issue of analyzing such overlaps to correctly understand the models. The primary purpos...
Graph embedding, which represents real-world entities in a mathematical space, has enabled numerous applications such as analyzing natural languages, social networks, biochemical networks, and knowledge bases. It has been experimentally shown that graph embedding in hyperbolic space can represent hierarchical tree-like data more effectively than em...
We are concerned with the issue of detecting changes and their signs from a data stream. For example, when given time series of COVID-19 cases in a region, we may raise early warning signals of an epidemic by detecting signs of changes in the data. We propose a novel methodology to address this issue. The key idea is to employ a new information-the...
Purpose
We constructed a multitask learning model (latent space linear regression and deep learning, LSLR-DL) in which the two tasks of cross-sectional predictions (using optical coherence tomography: OCT) of VF (central 10°) and longitudinal progression predictions of VF (30°) were performed jointly via sharing the DL component such that informati...
In this paper, we propose a novel information criteria-based approach to select the dimensionality of the word2vec Skip-gram (SG). From the perspective of the probability theory, SG is considered as an implicit probability distribution estimation under the assumption that there exists a true contextual distribution among words. Therefore, we apply...
Normalized maximum likelihood (NML) distribution of probabilistic model gives the optimal code length function in the sense of minimax regret. Despite its optimal property, the calculation of NML distribution is not easy, and existing efficient methods have been focusing on its asymptotic behavior, or on specific models. This paper gives an efficie...
Hyperbolic ordinal embedding (HOE) represents entities as points in hyperbolic space so that they agree as well as possible with given constraints in the form of entity i is more similar to entity j than to entity k. It has been experimentally shown that HOE can obtain representations of hierarchical data such as a knowledge base and a citation net...
Multi-label learning deals with data examples which are associated with multiple class labels simultaneously. Despite the success of existing approaches to multi-label learning, there is still a problem neglected by researchers, i.e., not only are some of the values of observed labels missing, but also some of the labels are completely unobserved f...
This paper addresses the issue of detecting hierarchical changes in latent variable models (HCDL) from data streams. There are three different levels of changes for latent variable models: 1) the first level is the change in data distribution for fixed latent variables, 2) the second one is that in the distribution over latent variables, and 3) the...
In this paper, we propose a novel information criteria-based approach to select the dimensionality of the word2vec Skip-gram (SG). From the perspective of the probability theory, SG is considered as an implicit probability distribution estimation under the assumption that there exists a true contextual distribution among words. Therefore, we apply...
Purpose
To investigate whether optical coherence tomography (OCT) measurements can improve visual field (VF) trend analyses in glaucoma patients, using the ‘deeply-regularized latent-space linear regression’ (DLLR) model.
Design
Retrospective cohort study
Subjects
Training and testing datasets included 7,984 VFs from 998 eyes of 592 patients and...
Spatial attention has been introduced to convolutional neural networks (CNNs) for improving both their performance and interpretability in visual tasks including image classification. The essence of the spatial attention is to learn a weight map which represents the relative importance of activations within the same layer or channel. All existing a...
We are concerned with the issue of detecting changes and their signs from a data stream. For example, when given time series of COVID-19 cases in a region, we may raise early warning signals of an epidemic by detecting signs of changes in the data. We propose a novel methodology to address this issue. The key idea is to employ a new information-the...
In online learning from non-stationary data streams, it is both necessary to learn robustly to outliers and to adapt to changes of underlying data generating mechanism quickly. In this paper, we refer to the former nature of online learning algorithms as robustness and the latter as adaptivity. There is an obvious tradeoff between them. It is a fun...
In model-based clustering using finite mixture models, it is a significant challenge to determine the number of clusters (cluster size). It used to be equal to the number of mixture components (mixture size); however, this may not be valid in the presence of overlaps or weight biases. In this study, we propose to continuously measure the cluster si...
Existing multi-label learning (MLL) approaches mainly assume all the labels are observed and construct classification models with a fixed set of target labels (known labels). However, in some real applications, multiple latent labels may exist outside this set and hide in the data, especially for large-scale data sets. Discovering and exploring the...
Background/Aim
To train and validate the prediction performance of the deep learning (DL) model to predict visual field (VF) in central 10° from spectral domain optical coherence tomography (SD-OCT).
Methods
This multicentre, cross-sectional study included paired Humphrey field analyser (HFA) 10-2 VF and SD-OCT measurements from 591 eyes of 347 pa...
Purpose
To predict the visual field (VF) of glaucoma patients within the central 10 degrees from optical coherence tomography (OCT) measurements using deep learning and tensor regression.
Design
cross-sectional study
Method
Humphrey 10-2 VFs and OCT measurements were carried out in 505 eyes of 304 glaucoma patients and 86 eyes of 43 normal subjec...
Inter-event times of various human behaviour are apparently non-Poissonian and obey long-tailed distributions as opposed to exponential distributions, which correspond to Poisson processes. It has been suggested that human individuals may switch between different states, in each of which they are regarded to generate events obeying a Poisson proces...
This paper introduces the combinatorial Boolean model (CBM), which is defined as the class of linear combinations of conjunctions of Boolean attributes. This paper addresses the issue of learning CBM from labeled data. CBM is of high knowledge interpretability but na\"{i}ve learning of it requires exponentially large computation time with respect t...
This paper addresses the issues of how we can quantify structural information for nonparametric distributions and how we can detect its changes. Structural information refers to an index for a global understanding of a data distribution. When we consider the problem of clustering using a parametric model such as a Gaussian mixture model, the number...
This paper addresses the issue of how we can detect changes of changes, which we call metachanges, in data streams. A metachange refers to a change in patterns of when and how changes occur, referred to as “metachanges along time” and “metachanges along state”, respectively. Metachanges along time mean that the intervals between change points signi...
This paper introduces a new notion of dimensionality of probabilistic models from an information-theoretic view point. We call it the "descriptive dimension"(Ddim). We show that Ddim coincides with the number of independent parameters for the parametric class, and can further be extended to real-valued dimensionality when a number of models are mix...
Existing methods on representation-based subspace clustering mainly treat all features of data as a whole to learn a single self-representation and get one clustering solution. Real data however are often complex and consist of multiple attributes or sub-features, such as a face image has expressions or genders. Each attribute is distinct and compl...
Prediction of glaucomatous visual field loss has significant clinical benefits because it can help with early detection of glaucoma as well as decision-making for treatments. Glaucomatous visual loss is conventionally captured through visual field sensitivity (VF ) measurement, which is costly and time-consuming. Thus, existing approaches mainly pr...
When considering a data set it is often unknown how complex it is, and hence it is difficult to assess how rich a model for the data should be. Often these choices are swept under the carpet, ignored, left to the domain expert, but in practice this is highly unsatisfactory; domain experts do not know how to set k, what prior to choose, or how many...
Semi-supervised representation-based subspace clustering is to partition data into their underlying subspaces by finding effective data representations with partial supervisions. Essentially, an effective and accurate representation should be able to uncover and preserve the true data structure. Meanwhile, a reliable and easy-to-obtain supervision...
We propose a new model selection criterion based on the minimum description length principle in a name of the decomposed normalized maximum likelihood (DNML) criterion. Our criterion can be applied to a large class of hierarchical latent variable models, such as naïve Bayes models, stochastic block models, latent Dirichlet allocations and Gaussian...
Non-negative tensor factorization (NTF) is a widely used multi-way analysis approach that factorizes a high-order non-negative data tensor into several non-negative factor matrices. In NTF, the non-negative rank has to be predetermined to specify the model and it greatly influences the factorized matrices. However, its value is conventionally deter...
This paper corrects errors in the calculation of the normalized maximum likelihood (NML) code length for a Gaussian mixture model (GMM). It shows that the NML code length calculated in “Efficient computation of normalized maximum likelihood codes for Gaussian mixture models with its applications to clustering” is an upper bound on the NML code leng...
Inter-event times of various human behavior are apparently non-Poissonian and obey long-tailed distributions as opposed to exponential distributions, which correspond to Poisson processes. It has been suggested that human individuals may switch between different states in each of which they are regarded to generate events obeying a Poisson process....
Semi-supervised representation-based subspace clustering is to partition data into their underlying subspaces by finding effective data representations with partial supervisions. Essentially , an effective and accurate representation should be able to uncover and preserve the true data structure. Meanwhile , a reliable and easy-to-obtain supervisio...
We develop a new theoretical framework, the \emph{envelope complexity}, to analyze the minimax regret with logarithmic loss functions and derive a Bayesian predictor that achieves the adaptive minimax regret over high-dimensional $\ell_1$-balls up to the major term. The prior is newly derived for achieving the minimax regret and called the \emph{sp...
At present, a large amount of traffic-related data is obtained manually and through sensors and social media, e.g., traffic statistics, accident statistics, road information, and users' comments. In this paper, we propose a novel framework for mining traffic risk from such heterogeneous data. Traffic risk refers to the possibility of occurrence of...
We tackle the problem of penalty selection for regularization on the basis of the minimum description length (MDL) principle. In particular, we consider that the design space of the penalty function is high-dimensional. In this situation, the luckiness-normalized-maximum-likelihood (LNML)-minimization approach is favorable, because LNML quantifies...
Conventionally, glaucoma is diagnosed on the basis of visual field sensitivity (VF). However, the VF test is time-consuming, costly, and noisy. Using retinal thickness (RT) for glaucoma diagnosis is currently desirable. Thus, we propose a new methodology for estimating VF from RT in glaucomatous eyes. The key ideas are to use our new methods of pat...
We are concerned with the issue of detecting model changes in probability distributions. We specifically consider the strategies based on the minimum description length (MDL) principle. We theoretically analyze their basic performance from the two aspects; data compression and hypothesis testing. From the view of data compression, we derive a new b...
Nonnegative matrix factorization (NMF), a well-known technique to find parts-based representations of nonnegative data, has been widely studied. In reality, ordinal relations often exist among data, such as data i is more related to j than to q. Such relative order is naturally available, and more importantly, it truly reflects the latent data stru...
Abstract
Purpose: Global indices of standard automated perimerty are insensitive to localized losses, while point-wise indices are sensitive but highly variable. Region-wise indices sit in between. This study introduces a machine-learning-based index for glaucoma progression detection that outperforms global, region-wise, and point-wise indices.
De...
A hyperbolic space has been shown to be more capable of modeling complex networks than a Euclidean space. This paper proposes an explicit update rule along geodesics in a hyperbolic space. The convergence of our algorithm is theoretically guaranteed, and the convergence rate is better than the conventional Euclidean gradient descent algorithm. More...
Conventionally, glaucoma is diagnosed on the basis of the visual field sensitivity (VF). However, the VF test is time-consuming, costly, and noisy. Using the retinal thickness (RT) for glaucoma diagnosis is currently desirable. Thus,we propose a new methodology for estimating the VF from the RT in glaucomatous eyes. The key ideas are to use our new...
We tackle the problem of penalty selection of regularization on the basis of the minimum description length (MDL) principle. In particular, we consider that the design space of the penalty function is high-dimensional. In this situation, the luckiness-normalized-maximum-likelihood(LNML)-minimization approach is favorable, because LNML quantifies th...
The normalized maximum likelihood code length has been widely used in model selection, and its favorable properties, such as its consistency and the upper bound of its statistical risk, have been demonstrated. This paper proposes a novel methodology for calculating the normalized maximum likelihood code length on the basis of Fourier analysis. Our...
This paper introduces the combinatorial Boolean model (CBM), which is defined as the class of linear combinations of conjunctions of Boolean attributes. This paper addresses the issue of learning CBM from labeled data. CBM is of high knowledge interpretability but na\"{i}ve learning of it requires exponentially large computation time with respect t...
This paper shows that the normalized maximum likelihood (NML) code-length calculated in [1] is an upper bound on the NML code-length strictly calculated for the Gaussian Mixture Model. We call an upper bound on the NML code-length as uNML (upper bound on NML). When we use this uNML code-length, we have to change the scale of data sequence to satisf...
We propose a new model selection criterion based on the minimum description length principle in a name of the decomposed normalized maximum likelihood criterion. Our criterion can be applied to a large class of hierarchical latent variable models, such as the Naive Bayes models, stochastic block models and latent Dirichlet allocations, for which ma...
Dense measurements of visual-field, which is necessary to detect glaucoma, is known as very costly and labor intensive. Recently, measurement of retinal-thickness can be less costly than measurement of visual-field. Thus, it is sincerely desired that the retinal-thickness could be transformed into visual-sensitivity data somehow. In this paper, we...
We are concerned with detecting continuous changes in stochastic processes. In conventional studies on non-stationary stochastic processes, it is often assumed that changes occur abruptly. By contrast, we assume that they take place continuously. The proposed scheme consists of an efficient algorithm and rigorous theoretical analysis under the assu...
The minimum description length principle (MDL principle) is a data-compression-based methodology for optimal estimation and prediction from data. It gives a unifying strategy for designing machine learning algorithms and plays an important role in knowledge discovery from big data. Conventionally, the MDL principle has been extensively studied unde...
This study proposes a novel statistical methodology to analyze expenditure on multiple medical sectors using consumer data. Conventionally, medical expenditure has been analyzed by two-part models, which separately consider purchase decision and amount of expenditure. We extend the traditional two-part models by adding the step of basket analysis f...