About
164
Publications
14,310
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
1,604
Citations
Introduction
Additional affiliations
April 2017 - present
April 1992 - March 1994
March 2017 - present
Publications
Publications (164)
In this paper, we propose new methods to efficiently solve convex optimization problems encountered in sparse estimation. These methods include a new quasi-Newton method that avoids computing the Hessian matrix and improves efficiency, and we prove its fast convergence. We also prove the local convergence of the Newton method under the assumption o...
In causal discovery, non-Gaussianity has been used to characterize the complete configuration of a linear non-Gaussian acyclic model (LiNGAM), encompassing both the causal ordering of variables and their respective connection strengths. However, LiNGAM can only deal with the finite-dimensional case. To expand this concept, we extend the notion of v...
This study demonstrates that double descent can be mitigated by adding a dropout layer adjacent to the fully connected linear layer. The unexpected double-descent phenomenon garnered substantial attention in recent years, resulting in fluctuating prediction error rates as either sample size or model size increases. Our paper posits that the optimal...
We develop a method for estimating a simple matrix for a multidimensional item response theory model. Our proposed method allows each test item to correspond to a single latent trait, making the results easier to interpret. It also enables clustering of test items based on their corresponding latent traits. The basic idea of our approach is to use...
In this chapter, we will first review the basics of Bayesian statistics as a warm-up. In the latter half, assuming that knowledge alone, we will describe the full picture of Watanabe’s Bayes theory. In this chapter, we would like to avoid rigorous discussions and talk in an essay-like manner to grasp the overall picture. From now on, we will write...
Based on the introductory content of algebraic geometry learned in the previous chapter, this chapter delves into the core of Watanabe’s Bayesian theory. As learned in Chap. 5, the generalization to non-regular cases assumes that even if there are multiple \(\theta _*\), the range of \(B(\epsilon _n,\theta _*)\) is considered. In Watanabe’s Bayesia...
In this chapter, we examine the value of the real log canonical threshold \(\lambda \). First, assuming a known value of \(\lambda \), we evaluate the WBIC values for each \(\beta > 0\). Then, we determine the value of \(\lambda \) from the WBIC values obtained for different \(\beta > 0\). The value of \(\lambda \) is generally below d/2, and in th...
In this chapter, we discuss information criteria such as AIC and BIC. In Watanabe’s Bayesian theory, new information criteria, such as WAIC and WBIC, are proposed. Existing information criteria assume that the true distribution is regular with respect to the statistical model, and they cannot be applied to general situations.
First, we define basic terms in Bayesian statistics, such as prior distribution, posterior distribution, marginal likelihood, and predictive distribution. Next, we define the true distribution q and the statistical model \(\{p(\cdot |\theta )\}_{\theta \in \Theta }\), and find the set of \(\theta \in \Theta \) that minimizes the Kullback–Leibler (K...
In Bayesian statistics, it is generally difficult to mathematically derive the posterior distribution, except in special cases. Instead, it is common to generate random numbers following the posterior distribution and perform integration calculations based on their frequency. In this chapter, we will discuss Markov Chain Monte Carlo (MCMC) methods,...
In this paper, we propose new methods to efficiently solve convex optimization problems encountered in sparse estimation, which include a new quasi-Newton method that avoids computing the Hessian matrix and improves efficiency, and we prove its fast convergence. We also prove the local convergence of the Newton method under the assumption of strong...
In Bayesian statistics, it is generally difficult to mathematically derive the posterior distribution, except in special cases. Instead, it is common to generate random numbers following the posterior distribution and perform integration calculations based on their frequency. In this chapter, we will discuss Markov Chain Monte Carlo (MCMC) methods,...
First, we define basic terms in Bayesian statistics, such as prior distribution, posterior distribution, marginal likelihood, and predictive distribution. Next, we define the true distribution q and the statistical model \(\{p(\cdot |\theta )\}_{\theta \in \Theta }\), and find the set of \(\theta \in \Theta \) that minimizes the Kullback-Leibler (K...
In this chapter, we describe the mathematical knowledge necessary for understanding this book. First, we discuss matrices, open sets, closed sets, compact sets, the Mean Value Theorem, and Taylor expansions. All of these are topics covered in the first year of college. Next, we discuss absolute convergence and analytic functions. Then, we discuss t...
In thimore, we clarify that AIC, TIC, and WAIC show almost the same performance when assuming regularity. In addition, we introduce WBIC, which corresponds to a generalization of BIC, and clarify that it shows similar performance when assuming regularity. Note that there is the following relationship among the propositions presented in this chapter...
In this chapter, we will first review the basics of Bayesian statistics as a warm-up. In the latter half, assuming that knowledge alone, we will describe the full picture of Watanabe’s Bayes Theory. In this chapter, we would like to avoid rigorous discussions and talk in an essay-like manner to grasp the overall picture. From now on, we will write...
In this chapter, we examine the value of the real logarithmic threshold \(\lambda \). First, assuming a known value of \(\lambda \), we evaluate the WBIC values for each \(\beta > 0\). Then, we determine the value of \(\lambda \) from the WBIC values obtained for different \(\beta > 0\). The value of \(\lambda \) is generally below d/2, and in the...
In this paper, we find and analyze that we can easily drop the double descent by only adding one dropout layer before the fully-connected linear layer. The surprising double-descent phenomenon has drawn public attention in recent years, making the prediction error rise and drop as we increase either sample or model size. The current paper shows tha...
In sparse estimation, in which the sum of the loss function and the regularization term is minimized, methods such as the proximal gradient method and the proximal Newton method are applied. The former is slow to converge to a solution, while the latter converges quickly but is inefficient for problems such as group lasso problems. In this paper, w...
In the original version of the book, the following corrections have been incorporated: The phrases “principle component” and “principle component analysis” have been changed to “principal component” and “principal component analysis”, respectively, at all occurrences in Chap. 10. The book and the chapter have been updated with the changes.
In this chapter, we introduce the concept of random variables X:E→R in an RKHS and discuss testing problems in RKHSs. In particular, we define a statistic and its null hypothesis for the two-sample problem and the corresponding independence test.
A stochastic process may be defined either as a sequence of random variables {Xt}t∈T, where T is a set of times, or as a function Xt(ω):T→R of ω∈Ω.
Thus far, we have learned that a feature map Ψ:E∋x↦k(x,·) is obtained by the positive definite kernel k:E×E→R. In this chapter, we generate a linear space H0 based on its image k(x,·)(x∈E) and construct a Hilbert space H by completing this linear space, where H is called the reproducing kernel Hilbert space (RKHS), which satisfies the reproducing p...
In sparse estimation, such as fused lasso and convex clustering, we apply either the proximal gradient method or the alternating direction method of multipliers (ADMM) to solve the problem. It takes time to include matrix division in the former case, while an efficient method such as FISTA (fast iterative shrinkage-thresholding algorithm) has been...
In data analysis and various information processing tasks, we use kernels to evaluate the similarities between pairs of objects. In this book, we deal with mathematically defined kernels called positive definite kernels. Let the elements x, y of a set E correspond to the elements (functions) \(\Psi (x), \Psi (y)\) of a linear space H called the rep...
When considering machine learning and data science issues, in many cases, the calculus and linear algebra courses taken during the first year of university provide sufficient background information. However, we require knowledge of metric spaces and their completeness, as well as linear algebras with nonfinite dimensions, for kernels.
In Chapter 1, we learned that the kernel \(k(x,y)\in {\mathbb R}\) represents the similarity between two elements x, y in a set E. Chapter 3 described the relationships between a kernel k, its feature map \(E\ni x\mapsto k(x,\cdot )\in H\), and its reproducing kernel Hilbert space H.
In data analysis and various information processing tasks, we use kernels to evaluate the similarities between pairs of objects. In this book, we deal with mathematically defined kernels called positive definite kernels. Let the elements x, y of a set E correspond to the elements (functions) \(\Psi (x), \Psi (y)\) of a linear space H called the rep...
In Chapter 1, we learned that the kernel \(k(x,y)\in {\mathbb R}\) represents the similarity between two elements x, y in a set E.
When considering machine learning and data science issues, in many cases, the calculus and linear algebra courses taken during the first year of university provide sufficient background information. However, we require knowledge of metric spaces and their completeness, as well as linear algebras with nonfinite dimensions, for kernels. If your major...
In this chapter, we introduce the concept of random variables \(X: E\rightarrow {\mathbb R}\) in an RKHS and discuss testing problems in RKHSs. In particular, we define a statistic and its null hypothesis for the two-sample problem and the corresponding independence test.
A stochastic process may be defined either as a sequence of random variables \(\{X_t\}_{t\in T}\), where T is a set of times, or as a function \(X_t(\omega ): T\rightarrow {\mathbb R}\) of \(\omega \in \Omega \).
Thus far, we have learned that a feature map \(\Psi : E\ni x\mapsto k(x,\cdot )\) is obtained by the positive definite kernel \(k: E\times E\rightarrow {\mathbb R}\). In this chapter, we generate a linear space \(H_0\) based on its image \(k(x,\cdot ) (x\in E\)) and construct a Hilbert space H by completing this linear space, where H is called the...
We consider learning as an undirected graphical model from sparse data. While several efficient algorithms have been proposed for graphical lasso (GL), the alternating direction method of multipliers (ADMM) is the main approach taken concerning joint graphical lasso (JGL). We propose proximal gradient procedures with and without a backtracking opti...
We consider biclustering that clusters both samples and features and propose efficient convex biclustering procedures. The convex biclustering algorithm (COBRA) procedure solves twice the standard convex clustering problem that contains a non-differentiable function optimization. We instead convert the original optimization problem to a differentia...
In this chapter, we examine the problem of estimating the structure of the graphical model from observations. In the graphical model, each vertex is regarded as a variable, and edges express the dependency between them (conditional independence). In particular, assume a so-called sparse situation where the number of vertices is larger than the numb...
Thus far, by Lasso, we mean making the estimated coefficients of some variables zero if the absolute values are significantly small for regression, classification, and graphical models. In this chapter, we consider dealing with matrices. Suppose that the given data take the form of a matrix, such as in image processing. We wish to approximate an im...
In this chapter, we consider the so-called generalized linear regression, which includes logistic regression (binary and multiple cases), Poisson regression, and Cox regression.
In general statistics, we often assume that the number of samples N is greater than the number of variables p. If this is not the case, it may not be possible to solve for the best-fitting regression coefficients using the least squares method, or it is too computationally costly to compare a total of \(2^p\) models using some information criterion...
Group Lasso is Lasso such that the variables are categorized into K groups \(k=1,\ldots ,K\). The \(p_k\) variables \(\theta _k=[\theta _{1,k},\ldots ,\theta _{p_k,k}]^T\in {\mathbb R}^{p_k}\) in the same group share the same times at which the nonzero coefficients become zeros when we increase the \(\lambda \) value.
In this chapter, we consider sparse estimation for the problems of multivariate analysis, such as principal component analysis and clustering .
This paper considers an extension of the linear non-Gaussian acyclic model (LiNGAM) that determines the causal order among variables from a dataset when the variables are expressed by a set of linear equations, including noise. In particular, we assume that the variables are binary. The existing LiNGAM assumes that no confounding is present, which...
This paper considers an extension of the linear non-Gaussian acyclic model (LiNGAM) that determines the causal order among variables from a dataset when the variables are expressed by a set of linear equations, including noise. In particular, we assume that the variables are binary. The existing LiNGAM assumes that no confounding is present, which...
Fused Lasso is the problem of finding the \(\theta _1,\ldots ,\theta _N\) that minimize.
In this chapter, we consider sparse estimation for the problems of multivariate analysis, such as principal component analysis and clustering. There are two equivalence definitions for principal component analysis: finding orthogonal vectors that maximize the variance and finding a vector that minimizes the reconstruction error when the dimension i...
In this chapter, we examine the problem of estimating the structure of the graphical model from observations. In the graphical model, each vertex is regarded as a variable, and edges express the dependency between them (conditional independence ). In particular, assume a so-called sparse situation where the number of vertices is larger than the num...
Group Lasso is Lasso such that the variables are categorized into K groups \(k=1,\ldots ,K\). The \(p_k\) variables \(\theta _k=[\theta _{1,k},\ldots ,\theta _{p_k,k}]^T\in {\mathbb R}^{p_k}\) in the same group share the same times at which the nonzero coefficients become zeros when we increase the \(\lambda \) value. This chapter considers groups...
In statistics, we assume that the number of samples N is larger than the number of variables p. Otherwise, linear regression will not produce any least squares solution, or it will find the optimal variable set by comparing the information criterion values of the 2p subsets of the cardinality p. Therefore, it is difficult to estimate the parameters...
For regression, until now we have focused on only linear regression, but in this chapter, we will consider the nonlinear case where the relationship between the covariates and response is not linear. In the case of linear regression in Chap. 2, if there are p variables, we calculate p + 1 coefficients of the basis that consists of p + 1 functions 1...
In this chapter, we construct decision trees by estimating the relationship between the covariates and the response from observed data. Starting from the root, each vertex traces to either the left or right at each branch, depending on whether a condition w.r.t. the covariates is met, and finally reaches a terminal node to obtain the response. Comp...
Linear algebra is the basis of logic constructions in any science. In this chapter, we learn about inverse matrices, determinants, linear independence, vector spaces and their dimensions, eigenvalues and eigenvectors, orthonormal bases and orthogonal matrices, and diagonalizing symmetric matrices. In this book, to understand the essence concisely,...
Support vector machine is a method for classification and regression that draws an optimal boundary in the space of covariates (p dimension) when the samples (x1, y1), …, (xN, yN) are given. This is a method to maximize the minimum value over i = 1, …, N of the distance between xi and the boundary. This notion is generalized even if the samples are...
Until now, from the observed data, we have considered the following cases: Build a statistical model and estimate the parameters contained in it.Estimate the statistical model. Build a statistical model and estimate the parameters contained in it. Estimate the statistical model. In this chapter, we consider the latter for linear regression. The act...
In this chapter, we consider the so-called generalized linear regression, which includes logistic regression (binary and multiple cases), Poisson regression, and Cox regression. We can formulate these problems in terms of maximizing the likelihood and solve them by applying the Newton method: differentiate the log-likelihood by the parameters to be...
Thus far, by Lasso, we mean making the estimated coefficients of some variables zero if the absolute values are significantly small for regression, classification, and graphical models. In this chapter, we consider dealing with matrices. Suppose that the given data take the form of a matrix, such as in image processing. We wish to approximate an im...
In general statistics, we often assume that the number of samples N is greater than the number of variables p. If this is not the case, it may not be possible to solve for the best-fitting regression coefficients using the least squares method, or it is too computationally costly to compare a total of 2p models using some information criterion.
Linear regression Fitting covariate and response data to a line is referred to as linear regression. In this chapter, we introduce the least squares method for a single covariate (single regression) first and extend it to multiple covariates (multiple regression) later. Then, based on the statistical notion of estimating parameters from data, we fi...
Supervised learningUnsupervised learning Thus far, we have considered supervised learning from N observation data (x1, y1), …, (xN, yN), where y1, …, yN take either real values (regression) or a finite number of values (classification). In this chapter, we consider unsupervised learning, in which such a teacher does not exist, and the relations bet...
In this chapter, we consider constructing a classification rule from covariates to a response that takes values from a finite set such as ± 1, figures 0, 1, ⋯ , 9. For example, we wish to classify a postal code from handwritten characters and to make a rule between them. First, we consider logistic regression to minimize the error rate in the test...
Generally, there is not only one statistical model that explains a phenomenon. In that case, the more complicated the model, the easier it is for the statistical model to fit the data. However, we do not know whether the estimation result shows a satisfactory (prediction) performance for new data different from those used for the estimation. For ex...
We consider learning an undirected graphical model from sparse data. While several efficient algorithms have been proposed for graphical lasso (GL), the alternating direction method of multipliers (ADMM) is the main approach taken concerning for joint graphical lasso (JGL). We propose proximal gradient procedures with and without a backtracking opt...
In machine learning and data science, we often consider efficiency for solving problems. In sparse estimation, such as fused lasso and convex clustering, we apply either the proximal gradient method or the alternating direction method of multipliers (ADMM) to solve the problem. It takes time to include matrix division in the former case, while an e...
We consider the problem of Bayesian network structure learning (BNSL) from data. In particular, we focus on the score‐based approach rather than the constraint‐based approach and address what score we should use for the purpose. The Bayesian Dirichlet equivalent uniform (BDeu) has been mainly used within the community of BNs (not outside of it). We...
The most crucial ability for machine learning and data science is mathematical logic for grasping their essence rather than knowledge and experience. This textbook approaches the essence of machine learning and data science by considering math problems and building Python programs.
As the preliminary part, Chapter 1 provides a concise introduction...
The most crucial ability for machine learning and data science is mathematical logic for grasping their essence rather than knowledge and experience. This textbook approaches the essence of sparse estimation by considering math problems and building Python programs.
Each chapter introduces the notion of sparsity and provides procedures followed by...
For regression, until now we have focused on only linear regression, but in this chapter, we will consider the nonlinear case where the relationship between the covariates and response is not linear. In the case of linear regression in Chap. 2, if there are p variables, we calculate \(p+1\) coefficients of the basis that consists of \(p+1\) functio...
Until now, from the observed data, we have considered the following cases:Build a statistical model and estimate the parameters contained in itEstimate the statistical model Build a statistical model and estimate the parameters contained in it Estimate the statistical model In this chapter, we consider the latter for linear regression. The act of f...
Thus far, we have considered Supervised learning from N observation data \((x_1, y_1), \ldots , (x_N, y_N)\), where \(y_1, \ldots , y_N\) take either real values (regression) or a finite number of values (classification). In this chapter, we consider unsupervised learning, in which such a teacher does not exist, and the relations between the N samp...
Linear algebra is the basis of logic constructions in any science. In this chapter, we learn about inverse matrices, determinants, linear independence, vector spaces and their dimensions, eigenvalues and eigenvectors, orthonormal bases and orthogonal matrices, and diagonalizing symmetric matrices. In this book, to understand the essence concisely,...
Generally, there is not only one statistical model that explains a phenomenon. In that case, the more complicated the model, the easier it is for the statistical model to fit the data. However, we do not know whether the estimation result shows a satisfactory (prediction) performance for new data different from those used for the estimation. For ex...
In this chapter, we consider constructing a classification rule from covariates to a response that takes values from a finite set such as \(\pm 1\), figures \(0,1,\ldots ,9\). For example, we wish to classify a postal code from handwritten characters and to make a rule between them. First, we consider logistic regression to minimize the error rate...
Fitting covariate and response data to a line is referred to as linear regression. In this chapter, we introduce the least squares method for a single covariate (single regression) first and extend it to multiple covariates (multiple regression) later. Then, based on the statistical notion of estimating parameters from data, we find the distributio...
In this chapter, we construct decision trees by estimating the relationship between the covariates and the response from the observed data. Starting from the root, each vertex traces to either the left or right at each branch, depending on whether a condition w.r.t. the covariates is met, and finally reaches a terminal node to obtain the response....
In statistics, we assume that the number of samples N is larger than the number of variables p. Otherwise, linear regression will not produce any least squares solution, or it will find the optimal variable set by comparing the information criterion values of the \(2^p\) subsets of the cardinality p. Therefore, it is difficult to estimate the param...
Support vector machine is a method for classification and regression that draws an optimal boundary in the space of covariates (p dimension) when the samples \((x_1, y_1), \ldots , (x_N, y_N)\) are given. This is a method to maximize the minimum value over \(i = 1, \ldots , N\) of the distance between \(x_i\) and the boundary. This notion is genera...
The most crucial ability for machine learning and data science is mathematical logic for grasping their essence rather than knowledge and experience. This textbook approaches the essence of machine learning and data science by considering math problems and building R programs.
As the preliminary part, Chapter 1 provides a concise introduction to li...
This paper considers structure learning from data with n samples of p variables, assuming that the structure is a forest, using the Chow-Liu algorithm. Specifically, for incomplete data, we construct two model selection algorithms that complete in O(p
<sup xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">2</...
We consider efficient Bayesian network structure learning (BNSL) based on scores using branch and bound. Thus far, as a BNSL score, the Bayesian Dirichlet equivalent uniform (BDeu) has been used most often , but it is recently proved that the BDeu does not choose the simplest model even when the likelihood is maximized whereas Jeffreys' prior and M...
In this paper, we derive the exact formula of Klein's fundamental 2-form of second kind for the so-called $C_{ab}$ curves. The problem was initially solved by Klein in the 19th century for the simple hyper-elliptic curves, but little progress had been seen for its extension for more than 100 years. Recently, it has been addressed by several authors...
In Bayesian network structure learning (BNSL), we need the prior probability over structures and parameters. If the former is the uniform distribution, the latter determines the correctness of BNSL. In this paper, we compare BDeu (Bayesian Dirichlet equivalent uniform) and Jeffreys' prior w.r.t. their consistency. When we seek a parent set $U$ of a...
This paper addresses the problem of efficiently finding an optimal Bayesian network structure for maximizing the posterior probability. In particular, we focus on the B& B strategy to save the computational effort associated with finding the largest score. To make the search more efficient, we need a tighter upper bound so that the current score ca...
This paper proposes an estimator of mutual information for both discrete and continuous variables and applies it to the Chow–Liu algorithm to find a forest that expresses probabilistic relations among them. The state-of-the-art assumes that the continuous variables are Gaussian and that the graphical model under discrete and continuous variables is...
This paper proposes a novel estimator of mutual information for discrete and continuous variables. The main feature of this estimator is that it is zero for a large sample size n if and only if the two variables are independent. The estimator can be used to construct several histograms, compute estimations of mutual information, and choose the maxi...
The Package Miura contains functions that compute divisor class group
arithmetic for nonsingular curves. The package reduces computation in a divisor
class group to that in the ideal class group via the isomorphism. The
underlying quotient ring should be over the ideal given by a nonsingular curve
in the form of Miura. Although computing the multip...
We consider the problem of learning a Bayesian network structure given n examples and the prior probability based on maximizing the posterior probability. We propose an algorithm that runs in O(n log n) time and that addresses continuous variables and discrete variables without assuming any class of distribution. We prove that the decision is stron...
We consider the problem of learning a Bayesian network structure given n examples and the prior probability based on maximizing the posterior probability. We propose an algorithm that runs in O(n log n) time and that addresses continuous variables and discrete variables without assuming any class of distribution. We prove that the decision is stron...
This paper proposes a new mutual information estimator for discrete and continuous variables, and constructs a forest based on the Chow-Liu algorithm. The state-of-art method assumes Gaussian and ANOVA for continuous and discrete/continuous cases, respectively. Given data, the proposed algorithm constructs several pairs of quantizers for X and Y su...
This paper addresses the problem of efficiently finding an optimal Bayesian network structure w.r.t. maximizing the posterior probability and minimizing the description length. In particular, we focus on the branch and bound strategy to save computational effort. To obtain an efficient search, a larger lower bound of the score is required (when we...
This volume constitutes the refereed proceedings of the Second International Workshop on Advanced Methodologies for Bayesian Networks, AMBN 2015, held in Yokohama, Japan, in November 2015.
The 18 revised full papers and 6 invited abstracts presented were carefully reviewed and selected from numerous submissions. In the International Workshop on Ad...