[Show abstract][Hide abstract] ABSTRACT: We describe a new approach to analyze chirp syllables of free-tailed bats from two regions of Texas in which they are predominant: Austin and College Station. Our goal is to characterize any systematic regional differences in the mating chirps and assess whether individual bats have signature chirps. The data are analyzed by modeling spectrograms of the chirps as responses in a Bayesian functional mixed model. Given the variable chirp lengths, we compute the spectrograms on a relative time scale interpretable as the relative chirp position, using a variable window overlap based on chirp length. We use 2D wavelet transforms to capture correlation within the spectrogram in our modeling and obtain adaptive regularization of the estimates and inference for the regions-specific spectrograms. Our model includes random effect spectrograms at the bat level to account for correlation among chirps from the same bat, and to assess relative variability in chirp spectrograms within and between bats. The modeling of spectrograms using functional mixed models is a general approach for the analysis of replicated nonstationary time series, such as our acoustical signals, to relate aspects of the signals to various predictors, while accounting for between-signal structure. This can be done on raw spectrograms when all signals are of the same length, and can be done using spectrograms defined on a relative time scale for signals of variable length in settings where the idea of defining correspondence across signals based on relative position is sensible.
Journal of the American Statistical Association 06/2013; 108(502):514-526. DOI:10.1080/01621459.2013.793118 · 1.98 Impact Factor
[Show abstract][Hide abstract] ABSTRACT: When employing model selection methods with oracle properties such as the smoothly clipped absolute deviation (SCAD) and the Adaptive Lasso, it is typical to estimate the smoothing parameter by m-fold cross-validation, for example, m = 10. In problems where the true regression function is sparse and the signals large, such cross-validation typically works well. However, in regression modeling of genomic studies involving Single Nucleotide Polymorphisms (SNP), the true regression functions, while thought to be sparse, do not have large signals. We demonstrate empirically that in such problems, the number of selected variables using SCAD and the Adaptive Lasso, with 10-fold cross-validation, is a random variable that has considerable and surprising variation. Similar remarks apply to non-oracle methods such as the Lasso. Our study strongly questions the suitability of performing only a single run of m-fold cross-validation with any oracle method, and not just the SCAD and Adaptive Lasso.
The American Statistician 11/2011; 65(4):223-228. DOI:10.1198/tas.2011.11052 · 0.92 Impact Factor
[Show abstract][Hide abstract] ABSTRACT: The authors consider the analysis of hierarchical longitudinal functional data based upon a functional principal components approach. In contrast to standard frequentist approaches to selecting the number of principal components, the authors do model averaging using a Bayesian formulation. A relatively straightforward reversible jump Markov Chain Monte Carlo formulation has poor mixing properties and in simulated data often becomes trapped at the wrong number of principal components. In order to overcome this, the authors show how to apply Stochastic Approximation Monte Carlo (SAMC) to this problem, a method that has the potential to explore the entire space and does not become trapped in local extrema. The combination of reversible jump methods and SAMC in hierarchical longitudinal functional data is simplified by a polar coordinate representation of the principal components. The approach is easy to implement and does well in simulated data in determining the distribution of the number of principal components, and in terms of its frequentist estimation properties. Empirical applications are also presented.
Canadian Journal of Statistics 06/2010; 38(2):256-270. DOI:10.1002/cjs.10062 · 0.65 Impact Factor
[Show abstract][Hide abstract] ABSTRACT: Hierarchical functional data are widely seen in complex studies where sub-units are nested within units, which in turn are nested within treatment groups. We propose a general framework of functional mixed effects model for such data: within unit and within sub-unit variations are modeled through two separate sets of principal components; the sub-unit level functions are allowed to be correlated. Penalized splines are used to model both the mean functions and the principal components functions, where roughness penalties are used to regularize the spline fit. An EM algorithm is developed to fit the model, while the specific covariance structure of the model is utilized for computational efficiency to avoid storage and inversion of large matrices. Our dimension reduction with principal components provides an effective solution to the difficult tasks of modeling the covariance kernel of a random function and modeling the correlation between functions. The proposed methodology is illustrated using simulations and an empirical data set from a colon carcinogenesis study. Supplemental materials are available online.
Journal of the American Statistical Association 03/2010; 105(489):390-400. DOI:10.1198/jasa.2010.tm08737 · 1.98 Impact Factor
[Show abstract][Hide abstract] ABSTRACT: We consider the problem of score testing for certain low dimensional parameters of interest in a model that could include finite but high dimensional secondary covariates and associated nuisance parameters. We investigate the possibility of the potential gain in power by reducing the dimensionality of the secondary variables via oracle estimators such as the Adaptive Lasso. As an application, we use a recently developed framework for score tests of association of a disease outcome with an exposure of interest in the presence of a possible interaction of the exposure with other co-factors of the model. We derive the local power of such tests and show that if the primary and secondary predictors are independent, then having an oracle estimator does not improve the local power of the score test. Conversely, if they are dependent, there is the potential for power gain. Simulations are used to validate the theoretical results and explore the extent of correlation needed between the primary and secondary covariates to observe an improvement of the power of the test by using the oracle estimator. Our conclusions are likely to hold more generally beyond the model of interactions considered here.
The International Journal of Biostatistics 01/2010; 6(1):Article 12. DOI:10.2202/1557-4679.1231 · 0.74 Impact Factor
[Show abstract][Hide abstract] ABSTRACT: Recently (Martinez et al. 2010),we compared calciumion signaling (Ca2+) between two exposures, where the data present as movies, or, more prosaically, time series of images. They described novel
uses of singular value decompositions (SVD) and weighted versions of them (WSVD) to extract the signals from such movies,
in a way that is semi-automatic and tuned closely to the actual data and their many complexities. These complexities include
the following. First, the images themselves are of no interest: all interest focuses on the behavior of individual cells across
time, and thus the cells need to be segmented in an automated manner. Second, the cells themselves have 100+ pixels, so that
they form 100+ curves measured over time, so that data compression is required to extract the features of these curves. Third,
some of the pixels in some of the cells are subject to image saturation due to bit depth limits, and this saturation needs
to be accounted for if one is to normalize the images in a reasonably unbiased manner. Finally, theCa2+ signals have oscillations orwaves that vary with time and these signals need to be extracted. Thus, they showed how to use
multiple weighted and standard singular value decompositions to detect, extract and clarify the Ca2+ signals. In this paper,we showhow this signal extraction lends itself to a cluster analysis of the cell behavior, which shows
distinctly different patterns of behavior.
[Show abstract][Hide abstract] ABSTRACT: Time series associated with single-molecule experiments and/or simulations contain a wealth of multiscale information about complex biomolecular systems. We demonstrate how a collection of Penalized-splines (P-splines) can be useful in quantitatively summarizing such data. In this work, functions estimated using P-splines are associated with stochastic differential equations (SDEs). It is shown how quantities estimated in a single SDE summarize fast-scale phenomena, whereas variation between curves associated with different SDEs partially reflects noise induced by motion evolving on a slower time scale. P-splines assist in "semiparametrically" estimating nonlinear SDEs in situations where a time-dependent external force is applied to a single-molecule system. The P-splines introduced simultaneously use function and derivative scatterplot information to refine curve estimates. We refer to the approach as the PuDI (P-splines using Derivative Information) method. It is shown how generalized least squares ideas fit seamlessly into the PuDI method. Applications demonstrating how utilizing uncertainty information/approximations along with generalized least squares techniques improve PuDI fits are presented. Although the primary application here is in estimating nonlinear SDEs, the PuDI method is applicable to situations where both unbiased function and derivative estimates are available.
SIAM Journal on Multiscale Modeling and Simulation 01/2010; 8(4):1562-1580. DOI:10.1137/090768102 · 1.63 Impact Factor
[Show abstract][Hide abstract] ABSTRACT: We compare calcium ion signaling (Ca(2+)) between two exposures; the data are present as movies, or, more prosaically, time series of images. This paper describes novel uses of singular value decompositions (SVD) and weighted versions of them (WSVD) to extract the signals from such movies, in a way that is semi-automatic and tuned closely to the actual data and their many complexities. These complexities include the following. First, the images themselves are of no interest: all interest focuses on the behavior of individual cells across time, and thus, the cells need to be segmented in an automated manner. Second, the cells themselves have 100+ pixels, so that they form 100+ curves measured over time, so that data compression is required to extract the features of these curves. Third, some of the pixels in some of the cells are subject to image saturation due to bit depth limits, and this saturation needs to be accounted for if one is to normalize the images in a reasonably un-biased manner. Finally, the Ca(2+) signals have oscillations or waves that vary with time and these signals need to be extracted. Thus, our aim is to show how to use multiple weighted and standard singular value decompositions to detect, extract and clarify the Ca(2+) signals. Our signal extraction methods then lead to simple although finely focused statistical methods to compare Ca(2+) signals across experimental conditions.
The Annals of Applied Statistics 12/2009; 3(4):1467-1492. DOI:10.1214/09-AOAS253 · 1.46 Impact Factor
[Show abstract][Hide abstract] ABSTRACT: When employing model selection methods such as the Lasso and the Adaptive Lasso, it is typical to estimate the smoothing parameter by m-fold crossvalidation, e.g., m = 10. In problems where the true regression function is sparse and the signals large, such crossvali-dation typically works well, with the Adaptive Lasso being an oracle method. However, in regression modeling of genomic studies involving Single Nucleotide Polymorphisms (SNP), the true regression functions, while thought to be sparse, do not have large signals. We demonstrate empirically that in such problems, the number of selected variables using the Adaptive Lasso with 10-fold crossvalidation is a random variable that has considerable and surprising variation. Similar remarks apply to the Lasso. Our study strongly questions the suitability of performing only a single run of m-fold crossvalidation with any oracle method, and not just the Adaptive Lasso.