Yingying Fan

Yingying Fan
Minzu University of China · Department of Information Engineering

About

37
Publications
4,018
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
897
Citations

Publications

Publications (37)
Article
The weighted nearest neighbors (WNN) estimator has been popularly used as a flexible and easy-to-implement nonparametric tool for mean regression estimation. The bagging technique is an elegant way to form WNN estimators with weights automatically generated to the nearest neighbors (Steele, 2009; Biau et al., 2010); we name the resulting estimator...
Article
Full-text available
Dysbiosis of human gut microbiota has been reported in association with ulcerative colitis (UC) in both children and adults using either 16S rRNA gene or shotgun sequencing data. However, these studies used either 16S rRNA or metagenomic shotgun sequencing but not both. We sequenced feces samples from 19 pediatric UC and 23 healthy children ages be...
Article
Network data are prevalent in many contemporary big data applications in which a common interest is to unveil important latent links between different pairs of nodes. Yet a simple fundamental question of how to precisely quantify the statistical uncertainty associated with the identification of latent links still remains largely unexplored. In this...
Preprint
Full-text available
The framework of model-X knockoffs provides a flexible tool for exact finite-sample false discovery rate (FDR) control in variable selection. It also completely bypasses the use of conventional p-values, making it especially appealing in high-dimensional nonlinear models. Existing works have focused on the setting of independent and identically dis...
Preprint
This paper investigates the estimation and inference of the average treatment effect (ATE) using deep neural networks (DNNs) in the potential outcomes framework. Under some regularity conditions, the observed response can be formulated as the response of a mean regression problem with both the confounding variables and the treatment indicator as th...
Article
Significance Although practically attractive with high prediction and classification power, complicated learning methods often lack interpretability and reproducibility, limiting their scientific usage. A useful remedy is to select truly important variables contributing to the response of interest. We develop a method for deep learning inference us...
Article
Full-text available
Identifying interaction effects is fundamentally important in many scientific discoveries and contemporary applications, but it is challenging since the number of pairwise interactions increases quadratically with the number of covariates and that of higher-order interactions grows even faster. Although there is a growing literature on interaction...
Article
Full-text available
Identifying interaction effects is fundamentally important in many scientific discoveries and contemporary applications, but it is challenging since the number of pairwise interactions increases quadratically with the number of covariates and that of higher-order interactions grows even faster. Although there is a growing literature on interaction...
Article
Deep learning has benefited almost every aspect of modern big data applications. Yet its statistical properties still remain largely unexplored. It is commonly believed nowadays that deep neural networks (DNNs) benefit from representational learning. To gain some statistical insights into this, we design a simple simulation setting where we generat...
Article
Motivation The rapid development of sequencing technologies has enabled us to generate a large number of metagenomic reads from genetic materials in microbial communities, making it possible to gain deep insights into understanding the differences between the genetic materials of different groups of microorganisms, such as bacteria, viruses, plasmi...
Preprint
As a flexible nonparametric learning tool, random forest has been widely applied to various real applications with appealing empirical performance, even in the presence of high-dimensional feature space. Unveiling the underlying mechanisms has led to some important recent theoretical results on consistency under the classical setting of fixed dimen...
Preprint
Network data is prevalent in many contemporary big data applications in which a common interest is to unveil important latent links between different pairs of nodes. Yet a simple fundamental question of how to precisely quantify the statistical uncertainty associated with the identification of latent links still remains largely unexplored. In this...
Article
Interpretability and stability are two important features that are desired in many contemporary big data applications arising in statistics, economics, and finance. While the former is enjoyed to some extent by many existing forecasting approaches, the latter in the sense of controlling the fraction of wrongly discovered features which can enhance...
Preprint
Characterizing the exact asymptotic distributions of high-dimensional eigenvectors for large structured random matrices poses important challenges yet can provide useful insights into a range of applications. To this end, in this paper we introduce a general framework of asymptotic theory of eigenvectors (ATE) for large structured symmetric random...
Article
Heterogeneity is often natural in many contemporary applications involving massive data. While posing new challenges to effective learning, it can play a crucial role in powering meaningful scientific discoveries through the integration of information among subpopulations of interest. In this paper, we exploit multiple networks with Gaussian graphs...
Preprint
Interpretability and stability are two important features that are desired in many contemporary big data applications arising in economics and finance. While the former is enjoyed to some extent by many existing forecasting approaches, the latter in the sense of controlling the fraction of wrongly discovered features which can enhance greatly the i...
Preprint
Deep learning has become increasingly popular in both supervised and unsupervised machine learning thanks to its outstanding empirical performance. However, because of their intrinsic complexity, most deep learning methods are largely treated as black box tools with little interpretability. Even though recent attempts have been made to facilitate t...
Preprint
Heterogeneous treatment effects are the center of gravity in many modern causal inference applications. In this paper, we investigate the estimation and inference of heterogeneous treatment effects with precision in a general nonparametric setting. To this end, we enhance the classical $k$-nearest neighbor method with a simple algorithm, extend it...
Article
Full-text available
Power and reproducibility are key to enabling refined scientific discoveries in contemporary big data applications with general high-dimensional nonlinear models. In this paper, we provide theoretical foundations on the power and robustness for the model-free knockoffs procedure introduced recently in Cand\`{e}s, Fan, Janson and Lv (2016) in high-d...
Article
Full-text available
Evaluating the joint significance of covariates is of fundamental importance in a wide range of applications. To this end, p-values are frequently employed and produced by algorithms that are powered by classical large-sample asymptotic theory. It is well known that the conventional p-values in Gaussian linear model are valid even when the dimensio...
Article
Many modern big data applications feature large scale in both numbers of responses and predictors. Better statistical efficiency and scientific insights can be enabled by understanding the large-scale response-predictor association network structures via layers of sparse latent factors ranked by importance. Yet sparsity and orthogonality have been...
Article
Full-text available
Feature interactions can contribute to a large proportion of variation in many prediction models. In the era of big data, the coexistence of high dimensionality in both responses and covariates poses unprecedented challenges in identifying important interactions. In this paper, we suggest a two-stage interaction identification method, called the in...
Article
A common problem in modern statistical applications is to select, from a large set of candidates, a subset of variables which are important for determining an outcome of interest. For instance, the outcome may be disease status and the variables may be hundreds of thousands of single nucleotide polymorphisms on the genome. For data coming from low-...
Article
Heterogeneity is often natural in many contemporary applications involving massive data. While posing new challenges to effective learning, it can play a crucial role in powering meaningful scientific discoveries through the understanding of important differences among subpopulations of interest. In this paper, we exploit multiple networks with Gau...
Article
Full-text available
Understanding how features interact with each other is of paramount importance in many scientific discoveries and contemporary applications. Yet interaction identification becomes challenging even for a moderate number of covariates. In this paper, we suggest an efficient and flexible procedure, called the interaction pursuit (IP), for interaction...
Article
Large-scale precision matrix estimation is of fundamental importance yet challenging in many contemporary applications for recovering Gaussian graphical models. In this paper, we suggest a new approach of innovated scalable efficient estimation (ISEE) for estimating large precision matrix. Motivated by the innovated transformation, we convert the o...
Article
Full-text available
Feature interactions can contribute to a large proportion of variation in many prediction models. In the era of big data, the coexistence of high dimensionality in both responses and covariates poses unprecedented challenges in identifying important interactions. In this paper, we suggest a two-stage interaction identification method, called the in...
Article
We suggest a new method, called "Functional Additive Regression", or FAR, for efficiently performing high dimensional functional regression. FAR extends the usual linear regression model involving a functional predictor, X(t), and a scalar response, Y , in two key respects. First, FAR uses a penalized least squares optimization approach to efficien...
Article
Full-text available
This paper is concerned with the problems of interaction screening and nonlinear classification in a high-dimensional setting. We propose a two-step procedure, IIS-SQDA, where in the first step an innovated interaction screening (IIS) approach based on transforming the original p-dimensional feature vector is proposed, and in the second step a spar...
Article
While functional regression models have received increasing attention recently, most existing approaches assume both a linear relationship and a scalar response variable. We suggest a new method, "Functional Response Additive Model Estimation" (FRAME), which extends the usual linear regression model to situations involving both functional predictor...
Article
Two important goals of high-dimensional modelling are prediction and variable selection. In this article, we consider regularization with combined L1 and concave penalties, and study the sampling properties of the global optimum of the suggested method in ultrahigh-dimensional settings. The L1 penalty provides the minimum regularization needed for...
Article
High dimensional sparse modelling via regularization provides a powerful tool for analysing large-scale data sets and obtaining meaningful interpretable models. The use of non-convex penalty functions shows advantage in selecting important features in high dimensions, but the global optimality of such methods still demands more understanding. We co...
Article
High-dimensional data analysis has motivated a spectrum of regularization methods for variable selection and sparse modeling, with two popular methods being convex and concave ones. A long debate has taken place on whether one class dominates the other, an important question both in theory and to practitioners. In this article, we characterize the...
Article
Full-text available
Consider a two-class classification problem when the number of features is much larger than the sample size. The features are masked by Gaussian noise with zero means where the precision matrix (i.e., inverse of the covariance matrix) is unknown but is presumably sparse. The useful features, also unknown, are sparse and each contributes weakly (i.e...

Network

Cited By