Article

Highly Irregular Functional Generalized Linear Regression with Electronic Health Records

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

This work presents a new approach, called Multiple Imputation of Sparsely‐sampled Functions at Irregular Times (MISFIT), for fitting generalized functional linear regression models with sparsely and irregularly sampled data. Current methods do not allow for consistent estimation unless one assumes that the number of observed points per curve grows sufficiently quickly with the sample size. In contrast, MISFIT is based on a multiple imputation framework, which, as we demonstrate empirically, has the potential to produce consistent estimates without such an assumption. Just as importantly, it propagates the uncertainty of not having completely observed curves, allowing for a more accurate assessment of the uncertainty of parameter estimates, something that most methods currently cannot accomplish. This work is motivated by a longitudinal study on macrocephaly, or atypically large head size, in which electronic medical records allow for the collection of a great deal of data. However, the sampling is highly variable from child to child. Using MISFIT we are able to clearly demonstrate that the development of pathologic conditions related to macrocephaly is associated with both the overall head circumference of the children as well as the velocity of their head growth.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... This often produces a bias when considering subsequent modeling such as nonlinear functional regression. Predicting a scalar response Y from sparsely observed predictor processes X(t) is also a major challenge even for the simplest case of a functional linear model (Petrovich et al. 2018), due to the inconsistency of the predictors that is a consequence of the sparsity of the longitudinal observations. Functional regression models have proved useful to model the relationship between a scalar or functional response and functional predictors X(t), where t ranges over a compact interval T (Ramsay and Silverman 2005;Shi and Choi 2011;Wang et al. 2016). ...
... To obtain a consistent estimate of the slope function β in the FLM with sparse observations, one can use the fact that the linear model structure allows to express the slope in terms of the cross covariance and covariance functions of the predictor process X and the response, which are quantities that can be consistently estimated under mild assumptions (Yao et al. 2005b). Alternative multiple imputation methods based on conditioning on both the predictor observations and the response Y have also been explored (Petrovich et al. 2018), and these also rely on cross-covariance estimation. ...
Preprint
We develop a representation of Gaussian distributed sparsely sampled longitudinal data whereby the data for each subject are mapped to a multivariate Gaussian distribution; this map is entirely data-driven. The proposed method utilizes functional principal component analysis and is nonparametric, assuming no prior knowledge of the covariance or mean structure of the longitudinal data. This approach naturally connects with a deeper investigation of the behavior of the functional principal component scores obtained for longitudinal data, as the number of observations per subject increases from sparse to dense. We show how this is reflected in the shrinkage of the distribution of the conditional scores given noisy longitudinal observations towards a point mass located at the true but unobservable FPCs. Mapping each subject's sparse observations to the corresponding conditional score distribution leads to useful visualizations and representations of sparse longitudinal data. Asymptotic rates of convergence as sample size increases are obtained for the 2-Wasserstein metric between the true and estimated conditional score distributions, both for a K-truncated functional principal component representation as well as for the case when K=K(n) diverges with sample size nn\to\infty. We apply these ideas to construct predictive distributions aimed at predicting outcomes given sparse longitudinal data.
... Single imputation procedures (like mean imputation or PACE) are useful in general but can't account for the uncertainty induced from the imputation procedure; once the imputation is done, analyses then typically proceeds as if the imputed values were the truth. This leads to overly optimistic measures of uncertainty and the potential for substantial bias (Petrovich et al., 2018). To deal with this and other problems associated with single imputation methods, we consider multiple imputation methods. ...
... One issue is that the imputation procedure of PACE does not consider the response Yi nor does it have any consideration for subsequent models that will be fit. This results in a bias while estimating model parameters (Petrovich et al., 2018). In addition, PACE is just a single imputation method and hence the uncertainty in the imputation is not properly propagated when forming confidence intervals, prediction intervals, or p-values. ...
Article
Full-text available
This work considers the problem of fitting functional models with sparsely and irregularly sampled functional data. It overcomes the limitations of the state‐of‐the‐art methods, which face major challenges in the fitting of more complex non‐linear models. Currently, many of these models cannot be consistently estimated unless the number of observed points per curve grows sufficiently quickly with the sample size, whereas, we show numerically that a modified approach with more modern multiple imputation methods can produce better estimates in general. We also propose a new imputation approach that combines the ideas of MissForest with Local Linear Forest and compare their performance with PACE and several other multivariate multiple imputation methods. This work is motivated by a longitudinal study on smoking cessation, in which the Electronic Health Records (EHR) from Penn State PaTH to Health allow for the collection of a great deal of data, with highly variable sampling. To illustrate our approach, we explore the relation between relapse and diastolic blood pressure. We also consider a variety of simulation schemes with varying levels of sparsity to validate our methods.
... Single imputation procedures (like mean imputation or PACE) are useful in general but can't account for the uncertainty induced from the imputation procedure; once the imputation is done, analyses then typically proceeds as if the imputed values were the truth. This leads to overly optimistic measures of uncertainty and the potential for substantial bias (Petrovich et al., 2018). To deal with this and other problems associated with single imputation methods, we consider multiple imputation methods. ...
... One issue is that the imputation procedure of PACE does not consider the response Yi nor does it have any consideration for subsequent models that will be fit. This results in a bias while estimating model parameters (Petrovich et al., 2018). In addition, PACE is just a single imputation method and hence the uncertainty in the imputation is not properly propagated when forming confidence intervals, prediction intervals, or p-values. ...
Preprint
Full-text available
This work considers the problem of fitting functional models with sparsely and irregularly sampled functional data. It overcomes the limitations of the state-of-the-art methods, which face major challenges in the fitting of more complex non-linear models. Currently, many of these models cannot be consistently estimated unless the number of observed points per curve grows sufficiently quickly with the sample size, whereas, we show numerically that a modified approach with more modern multiple imputation methods can produce better estimates in general. We also propose a new imputation approach that combines the ideas of {\it MissForest} with {\it Local Linear Forest} and compare their performance with {\it PACE} and several other multivariate multiple imputation methods. This work is motivated by a longitudinal study on smoking cessation, in which the Electronic Health Records (EHR) from Penn State PaTH to Health allow for the collection of a great deal of data, with highly variable sampling. To illustrate our approach, we explore the relation between relapse and diastolic blood pressure. We also consider a variety of simulation schemes with varying levels of sparsity to validate our methods.
... Recently, the case where the sample paths are non-differentiable has received much attention. There is now extensive evidence that in some applications, such as energy and climate, chemistry and physics, sports science and medical applications, many functional data sets can reasonably be assumed to be generated by continuous but irregular sample paths of X. See, for example, Poß et al. (2020), Petrovich et al. (2022), , , Wang et al. (2024). Typically, the paths can reasonably be assumed to be Hölder continuous, but the Hölder exponent is generally unknown. ...
Preprint
Full-text available
The computation of integrals is a fundamental task in the analysis of functional data, which are typically considered as random elements in a space of squared integrable functions. Borrowing ideas from recent advances in the Monte Carlo integration literature, we propose effective unbiased estimation and inference procedures for integrals of uni- and multivariate random functions. Several applications to key problems in functional data analysis (FDA) involving random design points are studied and illustrated. In the absence of noise, the proposed estimates converge faster than the sample mean and the usual algorithms for numerical integration. Moreover, the proposed estimator facilitates effective inference by generally providing better coverage with shorter confidence and prediction intervals, in both noisy and noiseless setups.
... Then, the eigenfunctions associated to this smooth estimated covariance are used as a basis to represent smoothly the curves (Yao et al., 2005;Xiao et al., 2018); and they could indeed be used as a dictionary in the smoothing problem Problem 3.7 as well. It is also worth noting that this idea has been studied extensively in the context of the functional additive linear model in Petrovich et al. (2018). ...
Thesis
With the increasing ubiquity of data-collecting devices, a great variety of phenomena is monitored with finer and finer accuracy, which constantly expands the scope of Machine Learning applications. Dealing with such volume of data efficiently is however challenging. Fortunately, as measurements get denser, they may become gradually redundant. We can then greatly reduce the burden by finding a representation which exploits properties of the generating process and/or is tailored for the application at hand.This thesis revolves around an aspect of this idea: functional data. Data indeed consist of discrete measurements, but sometimes thinking of those as functional, we can exploit prior knowledge on smoothness to obtain a better yet lower dimensional representation. The focus is on nonlinear models for functional output regression (FOR), relying on an extension of reproducing kernel Hilbert spaces for vector-valued functions (vv-RKHS), which is the cornerstone of many nonlinear existing FOR methods. We propose to challenge those in two aspects: their computational complexity with respect to the number of measurements per function and their focusing solely on the square loss.To that end, we introduce the new framework of kernel projection learning (KPL) combining vv-RKHSs and representation of signals in dictionaries. The loss remains functional, however the model predicts only a finite number of representation coefficients. This approach retains the many advantages of vv-RKHSs yet greatly alleviates the computational burden incurred by the functional outputs. We derive two estimators in closed-form using the square loss, one for fully observed functions and one for discretized ones. We show that both are consistent in terms of excess risk. We demonstrate as well the possibility to use other differentiable and convex losses, to combine this framework with large scale kernel methods and to automatically select the dictionary using a structured penalty.In another contribution, we propose to solve the regression problem in vv-RKHSs of function-valued functions for the family of convoluted losses which we introduce. Those losses can either promote sparsity or robustness with a parameter controlling the degree of locality of those properties. Thanks to their structure, they are particularly amenable to dual approaches which we investigate. We then overcome the challenges posed by the functional nature of the dual variables by proposing two possible representations and we propose corresponding algorithms.
Article
In this work we propose a functional concurrent regression model to estimate labor supply elasticities over the years 1988 through 2014 using Current Population Survey data. Assuming, as is common , that individuals' wages are endogenous, we introduce instrumental variables in a two-stage least squares approach to estimate the desired labor supply elasticities. Furthermore, we tailor our estimation method to sparse functional data. Though recent work has incorporated instrumental variables into other functional regression models, to our knowledge this has not yet been done in the functional concurrent regression model, and most existing literature is not suited for sparse functional data. We show through simulations that this two-stage least squares approach greatly eliminates the bias introduced by a naive model (i.e. one that does not acknowledge endogeneity) and produces accurate coefficient estimates for moderate sample sizes.
Article
Full-text available
The main ideas behind the classic multivariate logistic regression model make sense when translated to the functional setting, where the explanatory variable X is a function and the response Y is binary. However, some important technical issues appear (or are aggravated with respect to those of the multivariate case) due to the functional nature of the explanatory variable. First, the mere definition of the model can be questioned: While most approaches so far proposed rely on the L2-based model, we explore an alternative (in some sense, more general) approach, based on the theory of reproducing kernel Hilbert spaces (RKHS). The validity conditions of such RKHS-based model, and their relation with the L2-based one, are investigated and made explicit in two formal results. Some relevant particular cases are considered as well. Second, we show that, under very general conditions, the maximum likelihood of the logistic model parameters fails to exist in the functional case, although some restricted versions can be considered. Third, we check (in the framework of binary classification) the practical performance of some RKHS-based procedures, well-suited to our model: They are compared to several competing methods via Monte Carlo experiments and the analysis of real data sets.
Article
Full-text available
In recent years, multiple imputation has emerged as a convenient and flexible paradigm for analysing data with missing values. Essential features of multiple imputation are reviewed, with answers to frequently asked questions about using the method in practice.
Article
Full-text available
Importance Incidental findings on screening and diagnostic tests are common and may prompt cascades of testing and treatment that are of uncertain value. No study to date has examined physician perceptions and experiences of these cascades nationally. Objective To estimate the national frequency and consequences of cascades of care after incidental findings using a national survey of US physicians. Design, Setting, and Participants Population-based survey study using data from a 44-item cross-sectional, online survey among 991 practicing US internists in a research panel representative of American College of Physicians national membership. The survey was emailed to panel members on January 22, 2019, and analysis was performed from March 11 to May 27, 2019. Main Outcomes and Measures Physician report of prior experiences with cascades, features of their most recently experienced cascade, and perception of potential interventions to limit the negative consequences of cascades. Results This study achieved a 44.7% response rate (376 completed surveys) and weighted responses to be nationally representative. The mean (SE) age of respondents was 43.4 (0.7) years, and 60.4% of respondents were male. Almost all respondents (99.4%; percentages were weighted) reported experiencing cascades, including cascades with clinically important and intervenable outcomes (90.9%) and cascades with no such outcome (94.4%). Physicians reported cascades caused their patients psychological harm (68.4%), physical harm (15.6%), and financial burden (57.5%) and personally caused the physicians wasted time and effort (69.1%), frustration (52.5%), and anxiety (45.4%). When asked about their most recent cascade, 33.7% of 371 respondents reported the test revealing the incidental finding may not have been clinically appropriate. During this most recent cascade, physicians reported that guidelines for follow-up testing were not followed (8.1%) or did not exist to their knowledge (53.2%). To lessen the negative consequences of cascades, 62.8% of 376 respondents chose accessible guidelines and 44.6% chose decision aids as potential solutions. Conclusions and Relevance The survey findings indicate that almost all respondents had experienced cascades after incidental findings that did not lead to clinically meaningful outcomes yet caused harm to patients and themselves. Policy makers and health care leaders should address cascades after incidental findings as part of efforts to improve health care value and reduce physician burnout.
Article
Full-text available
Gut and oral microbiota perturbations have been observed in obese adults and adolescents; less is known about their influence on weight gain in young children. Here we analyzed the gut and oral microbiota of 226 two-year-olds with 16S rRNA gene sequencing. Weight and length were measured at seven time points and used to identify children with rapid infant weight gain (a strong risk factor for childhood obesity), and to derive growth curves with innovative Functional Data Analysis (FDA) techniques. We showed that growth curves were associated negatively with diversity, and positively with the Firmicutes-to-Bacteroidetes ratio, of the oral microbiota. We also demonstrated an association between the gut microbiota and child growth, even after controlling for the effect of diet on the microbiota. Lastly, we identified several bacterial genera that were associated with child growth patterns. These results suggest that by the age of two, the oral microbiota of children with rapid infant weight gain may have already begun to establish patterns often seen in obese adults. They also suggest that the gut microbiota at age two, while strongly influenced by diet, does not harbor obesity signatures many researchers identified in later life stages.
Article
Full-text available
Newly encoded memories are labile and consolidate over time. The importance of sleep in memory consolidation has been well known for almost a decade. However, recent research has shown that awake quiescence, too, can support consolidation: people remember more new memories if they quietly rest after encoding than if they engage in a task. It is not yet known how exactly this rest-related consolidation benefits new memories, and whether it affects the fine detail of new memories. Using a sensitive picture recognition task, we show that awake quiescence aids the fine detail of new memories. Young adults were significantly better at discriminating recently encoded target pictures from similar lure pictures when the initial encoding of target pictures had been followed immediately by 10 minutes of awake quiescence than an unrelated perceptual task. This novel finding indicates that, in addition to influencing how much we remember, our behavioural state during wakeful consolidation determines, at least in part, the level of fine detail of our new memories. Thus, our results suggest that rest-related consolidation protects the fine detail of new memories, allowing us to retain detailed memories.
Article
Full-text available
We consider inference for the mean and covariance functions of covariate adjusted functional data using Local Linear Kernel (LLK) estimators. By means of a double asymptotic, we differentiate between sparse and dense covariate adjusted functional data –depending on the relative order of m (the discretization points per function) and n (the number of functions). Our simulation results demonstrate that the existing asymptotic normality results can lead to severely misleading inferences in finite samples. We explain this phenomenon based on our theoretical results and propose finite-sample corrections which provide practically useful approximations for inference in sparse and dense data scenarios. The relevance of our theoretical results is showcased using a real-data application.
Article
Full-text available
In many studies, it is of interest to predict the future trajectory of subjects based on their historical data, referred to as dynamic prediction. Mixed effects models have traditionally been used for dynamic prediction. However, the commonly used random intercept and slope model is often not sufficiently flexible for modeling subject-specific trajectories. In addition, there may be useful exposures/predictors of interest that are measured concurrently with the outcome, complicating dynamic prediction. To address these problems, we propose a dynamic functional concurrent regression model to handle the case where both the functional response and the functional predictors are irregularly measured. Currently, such a model cannot be fit by existing software. We apply the model to dynamically predict children's length conditional on prior length, weight, and baseline covariates. Inference on model parameters and subject-specific trajectories is conducted using the mixed effects representation of the proposed model. An extensive simulation study shows that the dynamic functional regression model provides more accurate estimation and inference than existing methods. Methods are supported by fast, flexible, open source software that uses heavily tested smoothing techniques.
Article
Full-text available
The Hájek-Feldman dichotomy establishes that two Gaussian measures are either mutually absolutely continuous with respect to each other (and hence there is a Radon-Nikodym density for each measure with respect to the other one) or mutually singular. Unlike the case of finite dimensional Gaussian measures, there are non-trivial examples of both situations when dealing with Gaussian stochastic processes. This paper provides: (a) Explicit expressions for the optimal (Bayes) rule and the minimal classification error probability in several relevant problems of supervised binary classification of mutually absolutely continuous Gaussian processes. The approach relies on some classical results in the theory of Reproducing Kernel Hilbert Spaces (RKHS). (b) An interpretation, in terms of mutual singularity, for the “near perfect classification” phenomenon described by Delaigle and Hall (2012a Delaigle, A. and Hall, P. (2012a). Achieving near perfect classification for functional data. J. R. Statist. Soc. B 74, 267–286.[CrossRef] [Google Scholar]). We show that the asymptotically optimal rule proposed by these authors can be identified with the sequence of optimal rules for an approximating sequence of classification problems in the absolutely continuous case. (c) As an application, we discuss a natural variable selection method, which essentially consists of taking the original functional data X(t), t ∈ [0, 1] to a d-dimensional marginal (X(t1), …, X(td)) which is chosen in order to minimize the classification error of the corresponding Fisher's linear rule. We give precise conditions under which this discrimination method achieves the minimal classification error of the original functional problem.
Article
Full-text available
Objective: Large electronic health record (EHR) datasets are increasingly used to facilitate research on growth, but measurement and recording errors can lead to biased results. We developed and tested an automated method for identifying implausible values in pediatric EHR growth data. Materials and methods: Using deidentified data from 46 primary care sites, we developed an algorithm to identify weight and height values that should be excluded from analysis, including implausible values and values that were recorded repeatedly without remeasurement. The foundation of the algorithm is a comparison of each measurement, expressed as a standard deviation score, with a weighted moving average of a child's other measurements. We evaluated the performance of the algorithm by (1) comparing its results with the judgment of physician reviewers for a stratified random selection of 400 measurements and (2) evaluating its accuracy in a dataset with simulated errors. Results: Of 2 000 595 growth measurements from 280 610 patients 1 to 21 years old, 3.8% of weight and 4.5% of height values were identified as implausible or excluded for other reasons. The proportion excluded varied widely by primary care site. The automated method had a sensitivity of 97% (95% confidence interval [CI], 94-99%) and a specificity of 90% (95% CI, 85-94%) for identifying implausible values compared to physician judgment, and identified 95% (weight) and 98% (height) of simulated errors. Discussion and conclusion: This automated, flexible, and validated method for preparing large datasets will facilitate the use of pediatric EHR growth datasets for research.
Article
Full-text available
Objective To identify phenotypes of type 1 diabetes control and associations with maternal/neonatal characteristics based on blood pressure (BP), glucose, and insulin curves during gestation, using a novel functional data analysis approach that accounts for sparse longitudinal patterns of medical monitoring during pregnancy. Methods We performed a retrospective longitudinal cohort study of women with type 1 diabetes whose BP, glucose, and insulin requirements were monitored throughout gestation as part of a program-project grant. Scores from sparse functional principal component analysis (fPCA) were used to classify gestational profiles according to the degree of control for each monitored measure. Phenotypes created using fPCA were compared with respect to maternal and neonatal characteristics and outcome. Results Most of the gestational profile variation in the monitored measures was explained by the first principal component (82-94%). Profiles clustered into three subgroups of high, moderate, or low heterogeneity, relative to the overall mean response. Phenotypes were associated with baseline characteristics, longitudinal changes in glycohemoglobin A1 and weight, and to pregnancy-related outcomes. Conclusion Three distinct longitudinal patterns of glucose, insulin, and BP control were found. By identifying these phenotypes, interventions can be targeted for subgroups at highest risk for compromised outcome, to optimize diabetes management during pregnancy. Thieme Medical Publishers 333 Seventh Avenue, New York, NY 10001, USA.
Article
Full-text available
The Sleep Heart Health Study (SHHS) is a comprehensive landmark study of sleep and its impacts on health outcomes. A primary metric of the SHHS is the in-home polysomnogram, which includes two electroencephalographic (EEG) channels for each subject, at two visits. The volume and importance of this data presents enormous challenges for analysis. To address these challenges, we introduce multilevel functional principal component analysis (MFPCA), a novel statistical methodology designed to extract core intra- and inter-subject geometric components of multilevel functional data. Though motivated by the SHHS, the proposed methodology is generally applicable, with potential relevance to many modern scientific studies of hierarchical or longitudinal functional outcomes. Notably, using MFPCA, we identify and quantify associations between EEG activity during sleep and adverse cardiovascular outcomes.
Article
Full-text available
Smoothing of noisy sample covariances is an important component in functional data analysis. We propose a novel covariance smoothing method based on penalized splines and associated software. The proposed method is a bivariate spline smoother that is designed for covariance smoothing and can be used for sparse functional or longitudinal data. We propose a fast algorithm for covariance smoothing using leave-one-subject-out cross-validation. Our simulations show that the proposed method compares favorably against several commonly used methods. The method is applied to a study of child growth led by one of coauthors and to a public dataset of longitudinal CD4 counts. Electronic supplementary material The online version of this article (doi:10.1007/s11222-017-9744-8) contains supplementary material, which is available to authorized users.
Article
Full-text available
Most complex human diseases are likely the consequence of the joint actions of genetic and environmental factors. Identification of gene-environment (G × E) interactions not only contributes to a better understanding of the disease mechanisms, but also improves disease risk prediction and targeted intervention. In contrast to the large number of genetic susceptibility loci discovered by genome-wide association studies, there have been very few successes in identifying G × E interactions, which may be partly due to limited statistical power and inaccurately measured exposures. Although existing statistical methods only consider interactions between genes and static environmental exposures, many environmental/lifestyle factors, such as air pollution and diet, change over time, and cannot be accurately captured at one measurement time point or by simply categorizing into static exposure categories. There is a dearth of statistical methods for detecting gene by time-varying environmental exposure interactions. Here, we propose a powerful functional logistic regression (FLR) approach to model the time-varying effect of longitudinal environmental exposure and its interaction with genetic factors on disease risk. Capitalizing on the powerful functional data analysis framework, our proposed FLR model is capable of accommodating longitudinal exposures measured at irregular time points and contaminated by measurement errors, commonly encountered in observational studies. We use extensive simulations to show that the proposed method can control the Type I error and is more powerful than alternative ad hoc methods. We demonstrate the utility of this new method using data from a case-control study of pancreatic cancer to identify the windows of vulnerability of lifetime body mass index on the risk of pancreatic cancer as well as genes that may modify this association.
Article
Full-text available
Time-average approximation and principal component analysis of the stochastic process underlying the functional data are the main tools for adapting NIPALS algorithm to estimate missing data in the functional context. The influence of the amount of missing data in the estimation of linear regression models is studied using the PLS method. A simulation study illustrates our methodology.
Article
Full-text available
We consider nonparametric estimation of the mean and covariance functions for functional/longitudinal data. Strong uniform convergence rates are developed for estimators that are local-linear smoothers. Our results are obtained in a unified framework in which the number of observations within each curve/cluster can be of any rate relative to the sample size. We show that the convergence rates for the procedures depend on both the number of sample curves and the number of observations on each curve. For sparse functional data, these rates are equivalent to the optimal rates in nonparametric regression. For dense functional data, root-n rates of convergence can be achieved with proper choices of bandwidths. We further derive almost sure rates of convergence for principal component analysis using the estimated covariance function. The results are illustrated with simulation studies.
Article
Full-text available
Advances in data collection and storage have tremendously increased the presence of functional data, whose graphical representations are curves, images or shapes. As a new area of Statistics, functional data analysis extends existing methodologies and theories from the fields of functional analysis, generalized linear models, multivariate data analysis, nonparametric statistics and many others. This paper provides a review into functional data analysis with main emphasis on functional principal component analysis, functional principal component regression, and bootstrap in functional principal component regression. Recent trends as well as open problems in the area are discussed.
Article
Full-text available
The test characteristics of head circumference (HC) measurement percentile criteria for the identification of previously undetected pathology associated with head enlargement in primary care are unknown. Electronic patient records were reviewed to identify children age 3 days to 3 years with new diagnoses of intracranial expansive conditions (IEC) and metabolic and genetic conditions associated with macrocephaly (MGCM). We tested the following HC percentile threshold criteria: ever above the 95th, 97th, or 99.6th percentile and ever crossing 2, 4, or 6 increasing major percentile lines. The Centers for Disease Control and World Health Organization growth curves were used, as well as the primary care network (PCN) curves previously derived from this cohort. Among 74,428 subjects, 85 (0.11%) had a new diagnosis of IEC (n = 56) or MGCM (n = 29), and between these 2 groups, 24 received intervention. The 99.6th percentile of the PCN curve was the only threshold with a PPV over 1% (PPV 1.8%); the sensitivity of this threshold was only 15%. Test characteristics for the 95th percentiles were: sensitivity (CDC: 46%; WHO: 55%; PCN: 40%), positive predictive value (PPV: CDC: 0.3%; WHO: 0.3%; PCN: 0.4%), and likelihood ratios positive (LR+: CDC: 2.8; WHO: 2.2; PCN: 3.9). Test characteristics for the 97th percentiles were: sensitivity (CDC: 40%; WHO: 48%; PCN: 34%), PPV (CDC: 0.4%; WHO: 0.3%; PCN: 0.6%), and LR+ (CDC: 3.6; WHO: 2.7; PCN: 5.6). Test characteristics for crossing 2 increasing major percentile lines were: sensitivity (CDC: 60%; WHO: 40%; PCN: 31%), PPV (CDC: 0.2%; WHO: 0.1%; PCN: 0.2%), and LR+ (CDC: 1.3; WHO: 1.1; PCN: 1.5). Commonly used HC percentile thresholds had low sensitivity and low positive predictive value for diagnosing new pathology associated with head enlargement in children in a primary care network.
Article
We are interested in functional linear regression when some observations of the real response are missing, while the functional covariate is completely observed. A complete case regression imputation method of missing data is presented, using functional principal component regression to estimate the functional coefficient of the model. We study the asymptotic behavior of the error when the missing data are replaced by the regression imputed value, in a ’missing at random’ framework. The completed database can be used to estimate the functional coefficient of the model and to predict new values of the response. The practical behavior of the method is also studied on simulated datasets. A real dataset illustration is performed in the environmental context of air quality.
Article
The 2017 Recommendations for Preventive Pediatric Health Care (Periodicity Schedule) have been approved by the American Academy of Pediatrics (AAP) and represents a consensus of AAP and the Bright Futures Periodicity Schedule Workgroup. Each child and family is unique; therefore, these recommendations are designed for the care of children who are receiving competent parenting, have no manifestations of any important health problems, and are growing and developing in a satisfactory fashion. Developmental, psychosocial, and chronic disease issues for children and adolescents may require frequent counseling and treatment visits separate from preventive care visits. Additional visits also may become necessary if circumstances suggest variations from normal.
Article
A functional binary choice model is explored in a case-control or choice-based sample design context. That is, a model is considered in which the response is binary, the explanatory variable is functional, and the sample is stratified with respect to the values of the response variable. A dimensional reduction of the space of the explanatory random function based on a Karhunen–Loève expansion is used to define a conditional maximum likelihood estimate of the model. Based on this formulation, several asymptotic properties are given. A simulation study and an application to kneading data are used to compare the proposed method with the ordinary maximum likelihood method, which ignores the nature of the sampling. The proposed model yields encouraging results. The potential of the functional choice-based sampling model for integrating special non-random features of the sample, which would have been difficult to see otherwise, is also outlined.
Article
We propose methods for variable selection in the context of modeling the association between a functional response and concurrently observed functional predictors. This data structure, and the need for such methods, is exemplified by our motivating example: a study in which blood pressure values are observed throughout the day, together with measurements of physical activity, location, posture, affect or mood, and other quantities that may influence blood pressure. We estimate the coefficients of the concurrent functional linear model using variational Bayes and jointly model residual correlation using functional principal components analysis. Latent binary indicators partition coefficient functions into included and excluded sets, incorporating variable selection into the estimation framework. The proposed methods are evaluated in simulations and real-data analyses, and are implemented in a publicly available R package with supporting interactive graphics for visualization. Copyright © 2017 John Wiley & Sons, Ltd.
Conference Paper
Complex diseases such as major depression affect people over time in complicated patterns. Longitudinal data analysis is thus crucial for understanding and prognosis of such diseases and has received considerable attention in the biomedical research community. Traditional classification and regression methods have been commonly applied in a simple (controlled) clinical setting with a small number of time points. However, these methods cannot be easily extended to the more general setting for longitudinal analysis, as they are not inherently built for time-dependent data. Functional regression, in contrast, is capable of identifying the relationship between features and outcomes along with time information by assuming features and/or outcomes as random functions over time rather than independent random variables. In this paper, we propose a novel sparse generalized functional linear model for the prediction of treatment remission status of the depression participants with longitudinal features. Compared to traditional functional regression models, our model enables high-dimensional learning, smoothness of functional coefficients, longitudinal feature selection and interpretable estimation of functional coefficients. Extensive experiments have been conducted on the Sequenced Treatment Alternatives to Relieve Depression (STAR*D) data set and the results show that the proposed sparse functional regression method achieves significantly higher prediction power than existing approaches.
Article
Nonparametric estimation of mean and covariance functions is important in functional data analysis. We investigate the performance of local linear smoothers for both mean and covariance functions with a general weighing scheme, which includes two commonly used schemes, equal weight per observation (OBS), and equal weight per subject (SUBJ), as two special cases. We provide a comprehensive analysis of their asymptotic properties on a unified platform for all types of sampling plan, be it dense, sparse or neither. Three types of asymptotic properties are investigated in this paper: asymptotic normality, L2 convergence and uniform convergence. The asymptotic theories are unified on two aspects: (1) the weighing scheme is very general; (2) the magnitude of the number Ni of measurements for the ith subject relative to the sample size n can vary freely. Based on the relative order of Ni to n, functional data are partitioned into three types: non-dense, dense and ultradense functional data for the OBS and SUBJ schemes. These two weighing schemes are compared both theoretically and numerically. We also propose a new class of weighing schemes in terms of a mixture of the OBS and SUBJ weights, of which theoretical and numerical performances are examined and compared.
Article
With the advance of modern technology, more and more data are being recorded continuously during a time interval or intermittently at several discrete time points. These are both examples of functional data, which has become a commonly encountered type of data. Functional data analysis (FDA) encompasses the statistical methodology for such data. Broadly interpreted, FDA deals with the analysis and theory of data that are in the form of functions. This paper provides an overview of FDA, starting with simple statistical notions such as mean and covariance functions, then covering some core techniques, the most popular of which is functional principal component analysis (FPCA). FPCA is an important dimension reduction tool, and in sparse data situations it can be used to impute functional data that are sparsely observed. Other dimension reduction approaches are also discussed. In addition, we review another core technique, functional linear regression, as well as clustering and classification of functional data. Beyond linear and single- or multiple- index methods, we touch upon a few nonlinear approaches that are promising for certain applications. They include additive and other nonlinear functional regression models and models that feature time warping, manifold learning, and empirical differential equations. The paper concludes with a brief discussion of future directions.
Article
Bayes classifiers for functional data pose a challenge. This is because probability density functions do not exist for functional data. As a consequence, the classical Bayes classifier using density quotients needs to be modified. We propose to use density ratios of projections on a sequence of eigenfunctions that are common to the groups to be classified. The density ratios can then be factored into density ratios of individual functional principal components whence the classification problem is reduced to a sequence of nonparametric one-dimensional density estimates. This is an extension to functional data of some of the very earliest nonparametric Bayes classifiers that were based on simple density ratios in the one-dimensional case. By means of the factorization of the density quotients the curse of dimensionality that would otherwise severely affect Bayes classifiers for functional data can be avoided. We demonstrate that in the case of Gaussian functional data, the proposed functional Bayes classifier reduces to a functional version of the classical quadratic discriminant. A study of the asymptotic behavior of the proposed classifiers in the large sample limit shows that under certain conditions the misclassification rate converges to zero, a phenomenon that has been referred to as "perfect classification". The proposed classifiers also perform favorably in finite sample applications, as we demonstrate in comparisons with other functional classifiers in simulations and various data applications, including wine spectral data, functional magnetic resonance imaging (fMRI) data for attention deficit hyperactivity disorder (ADHD) patients, and yeast gene expression data.
Article
We develop a hierarchical Gaussian process model for forecasting and inference of functional time series data. Unlike existing methods, our approach is especially suited for sparsely or irregularly sampled curves and for curves sampled with non-negligible measurement error. The latent process is dynamically modeled as a functional autoregression (FAR) with Gaussian process innovations. We propose a fully nonparametric dynamic functional factor model for the dynamic innovation process, with broader applicability and improved computational efficiency over standard Gaussian process models. We prove finite-sample forecasting and interpolation optimality properties of the proposed model, which remain valid with the Gaussian assumption relaxed. An efficient Gibbs sampling algorithm is developed for estimation, inference, and forecasting, with extensions for FAR(p) models with model averaging over the lag p. Extensive simulations demonstrate substantial improvements in forecasting performance and recovery of the autoregressive surface over competing methods, especially under sparse designs. We apply the proposed methods to forecast nominal and real yield curves using daily U.S. data. Real yields are observed more sparsely than nominal yields, yet the proposed methods are highly competitive in both settings. Supplementary materials, including R code and the yield curve data, are available online.
Article
Hydrocephalus is a common disorder of cerebral spinal fluid (CSF) physiology resulting in abnormal expansion of the cerebral ventricles. Infants commonly present with progressive macrocephaly whereas children older than 2 years generally present with signs and symptoms of intracranial hypertension. The classic understanding of hydrocephalus as the result of obstruction to bulk flow of CSF is evolving to models that incorporate dysfunctional cerebral pulsations, brain compliance, and newly characterised water-transport mechanisms. Hydrocephalus has many causes. Congenital hydrocephalus, most commonly involving aqueduct stenosis, has been linked to genes that regulate brain growth and development. Hydrocephalus can also be acquired, mostly from pathological processes that affect ventricular outflow, subarachnoid space function, or cerebral venous compliance. Treatment options include shunt and endoscopic approaches, which should be individualised to the child. The long-term outcome for children that have received treatment for hydrocephalus varies. Advances in brain imaging, technology, and understanding of the pathophysiology should ultimately lead to improved treatment of the disorder. Copyright © 2015 Elsevier Ltd. All rights reserved.
Article
There is a lack of evidence on the value of head circumference (HC) as a screening measure. We aimed to describe the incidence of head centile shifting and the relationship between extremes of head size and later neurodevelopmental problems in the Avon Longitudinal Study of Parents and Children. HC was measured routinely at 2, 9, and 18 or 24 months and by researchers at ages 4, 8, 12, and 18 months. IQ according to the Wechsler Intelligence Scale for Children was measured in research clinics at age 8 for all. Neurocognitive disorders (NCDs) were identified from chart review. There were 10 851 children with ≥2 head measurements. At each age, 2% to 3% of children had scores that were < -2 or >2 SDs below or above the mean, but for most children this was only found at 1 age. More than 15% of children showed centile shifts, but less than one-third of these were sustained at subsequent measurements. Only 0.5% showed a sustained shift beyond the normal range. Children with consistently small heads were up to 7 times more likely to have an NCD, but 85% of children with small heads had no NCDs, and 93% of children with NCDs had head SD scores within the normal range. Centile shifts within the normal range occur commonly and seem mainly to reflect measurement error. This finding makes robust assessment of the head trajectory difficult and may result in many children being investigated unnecessarily. Extreme head size is neither specific nor sensitive for detecting NCDs, suggesting that routine measurement of HC is unhelpful. Copyright © 2015 by the American Academy of Pediatrics.
Article
Context Abusive head trauma (AHT) is a dangerous form of child abuse that can be difficult to diagnose in young children.Objectives To determine how frequently AHT was previously missed by physicians in a group of abused children with head injuries and to determine factors associated with the unrecognized diagnosis.Design Retrospective chart review of cases of head trauma presenting between January 1, 1990, and December 31, 1995.Setting Academic children's hospital.Patients One hundred seventy-three children younger than 3 years with head injuries caused by abuse.Main Outcome Measures Characteristics of head-injured children in whom diagnosis of AHT was unrecognized and the consequences of the missed diagnoses.Results Fifty-four (31.2%) of 173 abused children with head injuries had been seen by physicians after AHT and the diagnosis was not recognized. The mean time to correct diagnosis among these children was 7 days (range, 0-189 days). Abusive head trauma was more likely to be unrecognized in very young white children from intact families and in children without respiratory compromise or seizures. In 7 of the children with unrecognized AHT, misinterpretation of radiological studies contributed to the delay in diagnosis. Fifteen children (27.8%) were reinjured after the missed diagnosis. Twenty-two (40.7%) experienced medical complications related to the missed diagnosis. Four of 5 deaths in the group with unrecognized AHT might have been prevented by earlier recognition of abuse.Conclusion Although diagnosing head trauma can be difficult in the absence of a history, it is important to consider inflicted head trauma in infants and young children presenting with nonspecific clinical signs.
Article
Multiple imputation was designed to handle the problem of missing data in public-use data bases where the data-base constructor and the ultimate user are distinct entities. The objective is valid frequency inference for ultimate users who in general have access only to complete-data software and possess limited knowledge of specific reasons and models for nonresponse. For this situation and objective, I believe that multiple imputation by the data-base constructor is the method of choice. This article first provides a description of the assumed context and objectives, and second, reviews the multiple imputation framework and its standard results. These preliminary discussions are especially important because some recent commentaries on multiple imputation have reflected either misunderstandings of the practical objectives of multiple imputation or misunderstandings of fundamental theoretical results. Then, criticisms of multiple imputation are considered, and, finally, comparisons are made to alternative strategies.
Article
Nonparametric methods are developed for estimating the dose effect when a response consists of correlated observations over time measured in a dose-response experiment. The methods can also be applied to data collected from a completely randomized design experiment. Methods are developed for the detection and description of the effects of dose, time, and their interaction. The methods allow for individual variation in the timing and number of observations. A generalization allowing baseline covariates to be incorporated is addressed. These results may be used in an exploratory fashion in the process of building a random-effects model for longitudinal data.
Article
Missing values and outliers are frequently encountered in traffic monitoring data. We approach these problems by sampling the daily traffic flow rate trajectories from random functions and taking advantage of the data features using functional data analysis. We propose to impute missing values by using the conditional expectation approach to functional principal component analysis (FPCA). Our simulation study shows that the FPCA approach performs better than two commonly discussed methods in the literature, the probabilistic principal component analysis (PCA) and the Bayesian PCA, which have been shown to perform better than many conventional approaches. Based on the FPCA approach, the functional principal component scores can be applied to the functional bagplot and functional highest density region boxplot, which makes outlier detection possible for incomplete functional data. Our numerical results indicate that these two outlier detection approaches coupled with the proposed missing value imputation method can perform reasonably well. Although motivated by traffic flow data application, the proposed functional data methods for missing value imputation and outlier detection can be used in many applications with longitudinally recorded functional data.
Article
Functional principal components (FPC) analysis is widely used to decompose and express functional observations. Curve estimates implicitly condition on basis functions and other quantities derived from FPC decompositions; however these objects are unknown in practice. In this article, we propose a method for obtaining correct curve estimates by accounting for uncertainty in FPC decompositions. Additionally, pointwise and simultaneous confidence intervals that account for both model-and decomposition-based variability are constructed. Standard mixed model representations of functional expansions are used to construct curve estimates and variances conditional on a specific decomposition. Iterated expectation and variance formulas combine model-based conditional estimates across the distribution of decompositions. A bootstrap procedure is implemented to understand the uncertainty in principal component decomposition quantities. Our method compares favorably to competing approaches in simulation studies that include both densely and sparsely observed functions. We apply our method to sparse observations of CD4 cell counts and to dense white-matter tract profiles. Code for the analyses and simulations is publicly available, and our method is implemented in the R package refund on CRAN.
Article
Complex diseases such as major depression affect people over time in complicated patterns. Longitudinal data analysis is thus crucial for understanding and prognosis of such diseases and has received considerable attention in the biomedical research community. Traditional classification and regression methods have been commonly applied in a simple (controlled) clinical setting with a small number of time points. However, these methods cannot be easily extended to the more general setting for longitudinal analysis, as they are not inherently built for time-dependent data. Functional regression, in contrast, is capable of identifying the relationship between features and outcomes along with time information by assuming features and/or outcomes as random functions over time rather than independent random variables. In this paper, we propose a novel sparse generalized functional linear model for the prediction of treatment remission status of the depression participants with longitudinal features. Compared to traditional functional regression models, our model enables high-dimensional learning, smoothness of functional coefficients, longitudinal feature selection and interpretable estimation of functional coefficients. Extensive experiments have been conducted on the Sequenced Treatment Alternatives to Relieve Depression (STAR*D) data set and the results show that the proposed sparse functional regression method achieves significantly higher prediction power than existing approaches.
Article
A random vector x arises from one of two multivariate normal distributions differing in mean but not covariance. A training set x1, x2, ··· xn of previous cases, along with their correct assignments, is known. These can be used to estimate Fisher's discriminant by maximum likelihood and then to assign x on the basis of the estimated discriminant, a method known as the normal discrimination procedure. Logistic regression does the same thing but with the estimation of Fisher's disriminant done conditionally on the observed values of x1x2, ···, xn. This article computes the asymptotic relative efficiency of the two procedures. Typically, logistic regression is shown to be between one half and two thirds as effective as normal discrimination for statistically interesting values of the parameters.
Article
The probability of disease development in a defined time period is described by a logistic regression model. A model for the regression variable, given disease status, is induced and is applied to case-control data. It is shown that the odds ratio estimators and their asymptotic variance matrices may be obtained by applying the original logistic regression model to the case-control study as if the data had been obtained in a prospective study. This result gives a flexible and convenient method of analysis for a range of case-control studies in which stratum sizes are reasonably large. The work extends Anderson's (1972) results on logistic discrimination and generalizes the findings of Breslow & Powers (1978) on the equivalence of odds ratio estimators when both prospective and retrospective logistic models are applied to case-control data.
Article
We introduce a class of models for an additive decomposition of groups of curves stratified by crossed and nested factors, generalizing smoothing splines to such samples by associating them with a corresponding mixed-effects model. The models are also useful for imputation of missing data and exploratory analysis of variance. We prove that the best linear unbiased predictors (BLUPs) from the extended mixed-effects model correspond to solutions of a generalized penalized regression where smoothing parameters are directly related to variance components, and we show that these solutions are natural cubic splines. The model parameters are estimated using a highly efficient implementation of the EM algorithm for restricted maximum likelihood (REML) estimation based on a preliminary eigenvector decomposition. Variability of computed estimates can be assessed with asymptotic techniques or with a novel hierarchical bootstrap resampling scheme for nested mixed-effects models. Our methods are applied to menstrual cycle data from studies of reproductive function that measure daily urinary progesterone; the sample of progesterone curves is stratified by cycles nested within subjects nested within conceptive and nonconceptive groups.
Article
In this paper we analyse CD4 counts from infants born to mothers who are infected with the human immunodeficiency virus. A random effects model with linear or low order polynomials in time is unsatis-factory for these longitudinal data We develop an alternative approach based on a flexible family of models for which both the fixed and the random effects are linear combinations of B-splines. The fixed and random parts are smooth functions of time and the covariance structure is parsimonious. The procedure allows estimates of each individual's smooth trajectory over time to be exhibited. Model selection, estimation and computation are discussed. Centile curves are presented that take into account the longitudinal nature of the data We emphasize a graphical approach to the presentation of results.
Article
In a missing-data setting, we want to estimate the mean of a scalar outcome, based on a sample in which an explanatory variable is observed for every subject while responses are missing by happenstance for some of them. We consider two kinds of estimates of the mean response when the explanatory variable is functional. One is based on the average of the predicted values and the second one is a functional adaptation of the Horvitz–Thompson estimator. We show that the infinite dimensionality of the problem does not affect the rates of convergence by stating that the estimates are root-n consistent, under missing at random (MAR) assumption. These asymptotic features are completed by simulated experiments illustrating the easiness of implementation and the good behaviour on finite sample sizes of the method. This is the first paper emphasizing that the insensitiveness of averaged estimates, well known in multivariate non-parametric statistics, remains true for an infinite-dimensional covariable. In this sense, this work opens the way for various other results of this kind in functional data analysis.
Article
  We show that, in functional data classification problems, perfect asymptotic classification is often possible, making use of the intrinsic very high dimensional nature of functional data. This performance is often achieved by linear methods, which are optimal in important cases. These results point to a marked contrast between classification for functional data and its counterpart in conventional multivariate analysis, where the dimension is kept fixed as the sample size diverges. In the latter setting, linear methods can sometimes be quite inefficient, and there are no prospects for asymptotically perfect classification, except in pathological cases where, for example, a variance vanishes. By way of contrast, in finite samples of functional data, good performance can be achieved by truncated versions of linear methods. Truncation can be implemented by partial least squares or projection onto a finite number of principal components, using, in both cases, cross-validation to determine the truncation point. We establish consistency of the cross-validation procedure.
Article
  This work proposes an extension of the functional principal components analysis (FPCA) or Karhunen–Loève expansion, which can take into account non-parametrically the effects of an additional covariate. Such models can also be interpreted as non-parametric mixed effect models for functional data. We propose estimators based on kernel smoothers and a data-driven selection procedure of the smoothing parameters based on a two-step cross-validation criterion. The conditional FPCA is illustrated with the analysis of a data set consisting of egg laying curves for female fruit flies. Convergence rates are given for estimators of the conditional mean function and the conditional covariance operator when the entire curves are collected. Almost sure convergence is also proven when one observes discretized noisy sample paths only. A simulation study allows us to check the good behaviour of the estimators.
Article
This work aims at performing functional principal components analysis (FPCA) with Horvitz–Thompson estimators when the observations are curves collected with survey sampling techniques. One important motivation for this study is that FPCA is a dimension reduction tool which is the first step to develop model-assisted approaches that can take auxiliary information into account. FPCA relies on the estimation of the eigenelements of the covariance operator which can be seen as nonlinear functionals. Adapting to our functional context the linearization technique based on the influence function developed by Deville [1999. Variance estimation for complex statistics and estimators: linearization and residual techniques. Survey Methodology 25, 193–203], we prove that these estimators are asymptotically design unbiased and consistent. Under mild assumptions, asymptotic variances are derived for the FPCA’ estimators and consistent estimators of them are proposed. Our approach is illustrated with a simulation study and we check the good properties of the proposed estimators of the eigenelements as well as their variance estimators obtained with the linearization approach.
Article
From the results of convergence by sampling in linear principal component analysis (of a random function in a separable Hilbert space), the limiting distribution is given for the principal values and the principal factors. These results can be explicitly written in the normal case. Some applications to statistical inference are investigated.
Article
The elements of a multivariate dataset are often curves rather than single points. Functional principal components can be used to describe the modes of variation of such curves. If one has complete measurements for each individual curve or, as is more common, one has measurements on a fine grid taken at the same time points for all curves, then many standard techniques may be applied. However, curves are often measured at an irregular and sparse set of time points which can differ widely across individuals. We present a technique for handling this more difficult case using a reduced rank mixed effects framework.