Yishu Xue

Yishu Xue
Google Inc. | Google

Doctor of Philosophy

About

35
Publications
3,618
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
273
Citations
Introduction
big data analytics; spatially varying coefficient models; model selection

Publications

Publications (35)
Article
The most popular regression model for the analysis of time-to-event data is the Cox proportional hazards model. While the model specifies a parametric relationship between the hazard function and the predictor variables, there is no specification regarding the form of the baseline hazard function. A critical assumption of the Cox model, however, is...
Article
The Cox model, which remains as the first choice in analyzing time‐to‐event data even for large datasets, relies on the proportional hazards (PH) assumption. When survival data arrive sequentially in chunks, a fast and minimally storage intensive approach to test the PH assumption is desirable. We propose an online updating approach that updates th...
Article
The Cox proportional hazard model is one of the most popular tools in analyzing time‐to‐event data in public health studies. When outcomes observed in clinical data from different regions yield a varying pattern correlated with location, it is often of great interest to investigate spatially varying effects of covariates. In this paper, we propose...
Preprint
Full-text available
Most existing spatial clustering literatures discussed the cluster algorithm for spatial responses. In this paper, we consider a Bayesian clustered regression for spatially dependent data in order to detect clusters in the covariate effects. Our proposed method is based on the Dirichlet process which provides a probabilistic framework for simultane...
Preprint
Full-text available
An income distribution describes how an entity's total wealth is distributed amongst its population. In economics, the Lorenz curve is a well-known functional representation of income distribution. Clustering of Lorenz curves based on both their similarities and spatial adjacencies is motivated by examining the household incomes in each state from...
Article
Full-text available
Growth curve analysis (GCA) has a wide range of applications in various fields where growth trajectories need to be modeled. Heteroscedasticity is often present in the error term, which can not be handled with sufficient flexibility by standard linear fixed or mixed-effects models. One situation that has been addressed is where the error variance i...
Preprint
Full-text available
We propose a multidimensional tensor clustering approach for studying how professional basketball players' shooting patterns vary over court locations and game time. Unlike most existing methods that only study continuous-valued tensors or have to assume the same cluster structure along different tensor directions, we propose a Bayesian nonparametr...
Article
Although basketball is a dynamic process sport, played between two sides of five players each, learning some static information is essential for professional players, coaches, and team managers. In order to have a deep understanding of field goal attempts among different players, we propose a zero‐inflated Poisson model with clustered regression co...
Article
In regional economics research, a problem of interest is to detect similarities between regions, and estimate their shared coefficients in economics models. In this article, we propose a mixture of finite mixtures clustered regression model with auxiliary covariates that account for similarities in demographic or economic characteristics over a spa...
Chapter
The cumulative hazard function plays an important role not only in survival analysis in biostatistical applications, but also in many other fields including finance and reliability analysis. When the data size exceeds the computer memory, many traditional nonparametric approaches for estimating and testing the cumulative hazard function are not app...
Article
In this paper, we propose a Susceptible–Infected–Removal (SIR) model with time fused coefficients. In particular, our proposed model discovers the underlying time homogeneity pattern for the SIR model's transmission rate and removal rate via Bayesian shrinkage priors. MCMC sampling for the proposed method is facilitated by the nimble package in R....
Preprint
Full-text available
Although basketball is a dynamic process sport, with 5 plus 5 players competing on both offense and defense simultaneously, learning some static information is predominant for professional players, coaches and team mangers. In order to have a deep understanding of field goal attempts among different players, we propose a zero inflated Poisson model...
Article
In this paper, we develop a group learning approach to analyze the underlying heterogeneity structure of shot selection among professional basketball players in the NBA. We propose a mixture of finite mixtures (MFM) model to capture the heterogeneity of shot selection among different players based on Log Gaussian Cox process (LGCP). Our proposed me...
Article
The accelerated failure time (AFT) model is a commonly used tool in analyzing survival data. In public health studies, data is often collected from medical service providers in different locations. Survival rates from different locations often present geographically varying patterns. In this paper, we focus on the accelerated failure time model wit...
Article
The geographically weighted regression (GWR) is a well-known statistical approach to explore spatial non-stationarity of the regression relationship in spatial data analysis. In this paper, we discuss a Bayesian recourse of GWR. Bayesian variable selection based on spike-and-slab prior, bandwidth selection based on range prior, and model assessment...
Preprint
Full-text available
In this paper, we propose a Susceptible-Infected-Removal (SIR) model with time fused coefficients. In particular, our proposed model discovers the underlying time homogeneity pattern for the SIR model's transmission rate and removal rate via Bayesian shrinkage priors. The properties of the proposed models are examined and a Markov chain Monte Carlo...
Article
In economic development there are often regions that share similar socioeconomic characteristics, and econometrics models on such regions tend to produce similar covariate effect estimates. This paper proposes a Bayesian clustered regression for spatially dependent data in order to detect clusters in covariate effects. The proposed method is based...
Preprint
Full-text available
The geographically weighted regression (GWR) is a well-known statistical approach to explore spatial non-stationarity of the regression relationship in spatial data analysis. In this paper, we discuss a Bayesian recourse of GWR. Bayesian variable selection based on spike-and-slab prior, bandwidth selection based on range prior, and model assessment...
Preprint
Full-text available
In this paper, we develop a group learning approach to analyze the underlying heterogeneity structure of shot selection among professional basketball players in the NBA. We propose a mixture of finite mixtures (MFM) model to capture the heterogeneity of shot selection among different players based on Log Gaussian Cox process (LGCP). Our proposed me...
Preprint
Full-text available
In regional economics research, a problem of interest is to detect similarities between regions, and estimate their shared coefficients in economics models. In this article, we propose a mixture of finite mixtures (MFM) clustered regression model with auxiliary covariates that account for similarities in demographic or economic characteristics over...
Preprint
Full-text available
The accelerated failure time (AFT) model is a commonly used tool in analyzing survival data. In public health studies, data is often collected from medical service providers in different locations. Survival rates from different locations often present geographically varying patterns. In this paper, we focus on the accelerated failure time model wit...
Article
The Cox model is the most popular tool for analyzing time-to-event data. The nonparametric baseline hazard function can be as important as the regression coefficients in practice, especially when prediction is needed. In the context of stochastic process control, we propose a simultaneous monitoring method that combines a multivariate control chart...
Preprint
Full-text available
Bayesian spatial modeling of heavy-tailed distributions has become increasingly popular in various areas of science in recent decades. We propose a Weibull regression model with spatial random effects for analyzing extreme economic loss. Model estimation is facilitated by a computationally efficient Bayesian sampling algorithm utilizing the multiva...
Preprint
Full-text available
The Cox proportional hazard model is one of the most popular tools in analyzing time-to-event data in public health studies. When outcomes observed in clinical data from different regions yield a varying pattern correlated with location, it is often of great interest to investigate spatially varying effects of covariates. In this paper, we propose...
Article
The generalized information criterion (GIC) is an important tool for model selection in statistical inference. In the big data setting, traditional GIC cannot be calculated when the data size exceeds the computer memory. We propose an online updating approach to calculate the GIC, and perform model selection for huge datasets. Specifically, we defi...
Article
Objective To improve on existing methods to infer race/ethnicity in health care data through an analysis of birth records from Connecticut. Data Source A total of 162 467 Connecticut birth records from 2009 to 2013. Study Design We developed a logistic model to predict race/ethnicity using data from US Census and patient‐level information. Model...
Article
While studies of the proportional hazards model for big survival data mainly focus on speeding up computation and selecting features from a huge number of covariates, verifying the crucial assumption of proportional hazards (PH) has not been tackled for big data when the data size exceeds a computer’s memory. This dissertation summarizes methodolog...
Article
Income distribution is a major concern in economic theory. In regional economics, it is often of interest to compare income distributions in different regions. Traditional methods often compare the income inequality of different regions by assuming parametric forms of the income distributions, or using summary statistics like the Gini coefficient....
Preprint
Full-text available
The Cox model, which remains as the first choice in analyzing time-to-event data even for large datasets, relies on the proportional hazards assumption. When the data size exceeds the computer memory, the standard statistics for testing the proportional hazards assumption can no longer b e easily calculated. We propose an online up dating approach...

Network

Cited By