Jason M. Klusowski

Jason M. Klusowski
Princeton University | PU · Department of Operations Research and Financial Engineering

PhD in Statistics and Data Science

About

24
Publications
3,089
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
152
Citations

Publications

Publications (24)
Preprint
Sorted l1 regularization has been incorporated into many methods for solving high-dimensional statistical estimation problems, including the SLOPE estimator in linear regression. In this paper, we study how this relatively new regularization technique improves variable selection by characterizing the optimal SLOPE trade-off between the false discov...
Preprint
Full-text available
This paper shows that decision trees constructed with Classification and Regression Trees (CART) methodology are universally consistent for additive models, even when the dimensionality scales exponentially with the sample size, under certain $\ell_1$ sparsity constraints. The consistency is universal in the sense that there are no a priori assumpt...
Preprint
Full-text available
Decision trees and their ensembles are endowed with a rich set of diagnostic tools for ranking and screening variables in a predictive model. Despite the widespread use of tree based variable importance measures, pinning down their theoretical properties has been challenging and therefore largely unexplored. To address this gap between theory and p...
Article
SLOPE is a relatively new convex optimization procedure for high-dimensional linear regression via the sorted $\ell _{1}$ penalty: the larger the rank of the fitted coefficient, the larger the penalty. This non-separable penalty renders many existing techniques invalid or inconclusive in analyzing the SLOPE solution. In this paper, we develop an...
Preprint
Full-text available
Within the machine learning community, the widely-used uniform convergence framework seeks to answer the question of how complex models such as modern neural networks can generalize well to new data. This approach bounds the test error of the \emph{worst-case} model one could have fit to the data, which presents fundamental limitations. In this pap...
Preprint
Full-text available
Decision trees with binary splits are popularly constructed using Classification and Regression Trees (CART) methodology. For regression models, this approach recursively divides the data into two near-homogenous daughter nodes according to a split point that maximizes the reduction in sum of squares error (the impurity) along a particular variable...
Preprint
Full-text available
Classical results on the statistical complexity of linear models have commonly identified the norm of the weights $\|w\|$ as a fundamental capacity measure. Generalizations of this measure to the setting of deep networks have been varied, though a frequently identified quantity is the product of weight norms of each layer. In this work, we show tha...
Preprint
Full-text available
SLOPE is a relatively new convex optimization procedure for high-dimensional linear regression via the sorted l1 penalty: the larger the rank of the fitted coefficient, the larger the penalty. This non-separable penalty renders many existing techniques invalid or inconclusive in analyzing the SLOPE solution. In this paper, we develop an asymptotica...
Preprint
Full-text available
Decision trees with binary splits are popularly constructed using Classification and Regression Trees (CART) methodology. For binary classification and regression models, this approach recursively divides the data into two near-homogenous daughter nodes according to a split point that maximizes the reduction in sum of squares error (the impurity) a...
Article
We clarify that Hölder's inequality can be stated more generally than is often realized. This is an immediate consequence of an analogous information-theoretic identity which we call Hölder's identity. We also explain Andrew R. Barron's original use of the identity.
Preprint
Full-text available
For any ReLU network there is a representation in which the sum of the absolute values of the weights into each node is exactly $1$, and the input layer variables are multiplied by a value $V$ coinciding with the total variation of the path weights. Implications are given for Gaussian complexity, Rademacher complexity, statistical risk, and metric...
Preprint
Full-text available
It has been experimentally observed in recent years that multi-layer artificial neural networks have a surprising ability to generalize, even when trained with far more parameters than observations. Is there a theoretical basis for this? The best available bounds on their metric entropy and associated complexity measures are essentially linear in t...
Preprint
Full-text available
Random forests have become an important tool for improving accuracy in regression and classification problems since their inception by Leo Breiman in 2001. In this paper, we revisit a historically important random forest model originally proposed by Breiman in 2004 and later studied by Gérard Biau in 2012, where a feature is selected at random and...
Preprint
Full-text available
A popular class of problem in statistics deals with estimating the support of a density from $n$ observations drawn at random from a $d$-dimensional distribution. The one-dimensional case reduces to estimating the end points of a univariate density. In practice, an experimenter may only have access to a noisy version of the original data. Therefore...
Article
Full-text available
Applied researchers often construct a network from a random sample of nodes in order to infer properties of the parent network. Two of the most widely used sampling schemes are subgraph sampling, where we sample each vertex independently with probability $p$ and observe the subgraph induced by the sampled vertices, and neighborhood sampling, where...
Article
Full-text available
Learning properties of large graphs from samples has been an important problem in statistical network analysis since the early work of Goodman \cite{Goodman1949} and Frank \cite{Frank1978}. We revisit a problem formulated by Frank \cite{Frank1978} of estimating the number of connected components in a large graph based on the subgraph sampling model...
Article
Full-text available
The MDL two-part coding $ \textit{index of resolvability} $ provides a finite-sample upper bound on the statistical risk of penalized likelihood estimators over countable models. However, the bound does not apply to unpenalized maximum likelihood estimation or procedures with exceedingly small penalties. In this paper, we point out a more general i...
Article
Full-text available
We give convergence guarantees for estimating the coefficients of a symmetric mixture of two linear regressions by expectation maximization (EM). In particular, if the initializer has a large cosine angle with the population coefficient vector and the signal to noise ratio (SNR) is large, a sample-splitting version of the EM algorithm converges to...
Article
Full-text available
Estimation of functions of $ d $ variables is considered using ridge combinations of the form $ \textstyle\sum_{k=1}^m c_{1,k} \phi(\textstyle\sum_{j=1}^d c_{0,j,k}x_j-b_k) $ where the activation function $ \phi $ is a function with bounded value and derivative. These include single-hidden layer neural networks, polynomials, and sinusoidal models....
Article
Full-text available
Recently, a general method for analyzing the statistical accuracy of the EM algorithm has been developed and applied to some simple latent variable models [Balakrishnan et al. 2016]. In that method, the basin of attraction for valid initialization is required to be a ball around the truth. Using Stein's Lemma, we extend these results in the case of...
Article
Full-text available
We establish $ L^{\infty} $ and $ L^2 $ error bounds for functions of many variables that are approximated by linear combinations of ReLU (rectified linear unit) and squared ReLU ridge functions with $ \ell^1 $ and $ \ell^0 $ controls on their inner and outer parameters. With the squared ReLU ridge function, we show that the $ L^2 $ approximation e...
Article
Full-text available
Let $ f^{\star} $ be a function on $ \mathbb{R}^d $ satisfying a spectral norm condition. For various noise settings, we show that $ \mathbb{E}\|\hat{f} - f^{\star} \|^2 \leq v_{f^{\star}}\left(\frac{\log d}{n}\right)^{1/4} $, where $ n $ is the sample size and $ \hat{f} $ is either a penalized least squares estimator or a greedily obtained version...

Network

Cited By