Article

Statistical Analysis with Missing Data

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... In particular, several strategies have been investigated in the literature in order to reduce the sensitivity of the generalization power of data-driven decision-making systems and highcomplexity neural networks on the quality of training data. These strategies aim to optimize the performances of such models through approaches such as data cleaning and augmentation [1][2][3], the usage of specific cost-functions [4,5] or algorithms [6], regularization, dropout and noise injection [7][8][9], or more sophisticated methods and training procedures [10][11][12]. Nevertheless, while investigations in this direction are undoubtedly crucial, they are often rather empirical in nature, and in practical applications an extensive fine-tuning of parameters is often required. ...
... Indeed, choosing p = 1 − 2R, 4 Here this is necessary also in the supervised case because of the product ΓΓ s resulting from the square (Γ s − Γ) 2 . Then, we can take x ′ as initial configuration σ (0) = x ′ and apply (2.1) to get σ (1) . Next, multiplying both sides of the evolution relation by x i , we obtain ...
... On the other hand, increasing α, some non-trivial effects may emerge, depending on the interplay between dilution and the noise due to pattern interference, as we show hereafter. In the upper right plot (α = 0.4), In the second row, we report the heat maps for the probability that the attractiveness of the ground-truth is positive, namely P(∆ ≥ 0) = 1 2 (1 + m (1) ) -see also App. E, again for α = 0.1 (left) and α = 0.4. ...
Preprint
We consider Hopfield networks, where neurons interact pair-wise by Hebbian couplings built over i. a set of definite patterns (ground truths), ii. a sample of labeled examples (supervised setting), iii. a sample of unlabeled examples (unsupervised setting). We focus on the case where ground-truths are Rademacher vectors and examples are noisy versions of these ground-truths, possibly displaying some blank entries (e.g., mimicking missing or dropped data), and we determine the spectral distribution of the coupling matrices in the three scenarios, by exploiting and extending the Marchenko-Pastur theorem. By levering this knowledge, we are able to analytically inspect the stability and attractiveness of the ground truths, as well as the generalization capabilities of the networks. In particular, as corroborated by long-running Monte Carlo simulations, the presence of black entries can have benefits in some specific conditions, suggesting strategies based on data sparsification; the robustness of these results in structured datasets is confirmed numerically. Finally, we demonstrate that the Hebbian matrix, built on sparse examples, can be recovered as the fixed point of a gradient descent algorithm with dropout, over a suitable loss function.
... A small portion of claims with missing socioeconomic and remoteness data (< 5%) were imputed using multiple imputations (with 10 iterations) by the chained equations method [36]. A careful examination of the missing patterns showed that data were missing at random [37]. ...
... As the assumption of proportional hazards was violated in our analysis, an accelerated failure time (AFT) model was employed to identify the factors associated with time to first visit to a healthcare provider [37,38]. The AFT models do not require proportional hazard assumptions [39]. ...
... The AFT models do not require proportional hazard assumptions [39]. We employed two most commonly used AFT models, lognormal and Weibull, based on the distribution of the data (lower Akaike Information Criterion (AIC) and Bayesian Information Criteria (BIC) values indicate a better-fit model) [37,38]. ...
Article
Full-text available
Purpose Evidence shows that patient outcomes following musculoskeletal injury have been associated with the timing of care. Despite the increasing number of injured workers presenting with low back pain (LBP) in primary care, little is known about the factors that are associated with the timing of initial healthcare provider visits. This study investigated factors that are associated with the timing of initial workers’ compensation (WC)-funded care provider visits for LBP claims. Methods We used a retrospective cohort design. A standardised multi-jurisdiction database of LBP claims with injury dates from July 2011 to June 2015 was analysed. Determinants of the time to initial general practitioner (GPs) and or musculoskeletal (MSK) therapists were investigated using an accelerated failure time model, with a time ratio (TR) > 1 indicating a longer time to initial healthcare provider visit. Results 9088 LBP claims were included. The median time to first healthcare provider visit was 3 days (interquartile range (IQR) 1–9). Compared to General practitioners (GPs) (median 3 days, IQR 1–8), the timing of initial consultation was longer if the first healthcare providers were MSK therapists (median 5 days, IQR 2–14) (p < 0.001). Female workers had a shorter time to first healthcare provider visit [TR = 0.87; 95% CI (0.78, 0.97)] compared to males. It took twice as long to see MSK therapists first as it did to see GPs for injured workers [TR = 2.12; 95% CI (1.88, 2.40)]. Professional workers and those from remote areas also experienced delayed initial healthcare provider visits. Conclusions The time to initial healthcare provider visit for compensable LBP varied significantly by certain occupational and contextual factors. Further research is needed to investigate the impact of the timing of initial visits to healthcare providers on claim outcomes.
... With this ensues problems in application domains that rely on the access to complete and quality data which can affect every academic/professional fields and sectors. Techniques aimed at rectifying the problem have been an area of research in several disciplines [11]- [13]. The manner in which data points go missing in a dataset determines the approach to be used in estimating these values. ...
... The manner in which data points go missing in a dataset determines the approach to be used in estimating these values. As per [13], there exist three missing data mechanisms. This article focuses on investigating the Missing Completely at Random (MCAR) and Missing at Random (MAR) mechanisms. ...
... MAR on the other hand arises when missingness in a specific feature is reliant upon the other features within the dataset, but not the feature of interest itself [4]. According to [13], there are two main missing data patterns. These are the arbitrary and monotone missing data patterns. ...
Preprint
In this paper, we examine the problem of missing data in high-dimensional datasets by taking into consideration the Missing Completely at Random and Missing at Random mechanisms, as well as theArbitrary missing pattern. Additionally, this paper employs a methodology based on Deep Learning and Swarm Intelligence algorithms in order to provide reliable estimates for missing data. The deep learning technique is used to extract features from the input data via an unsupervised learning approach by modeling the data distribution based on the input. This deep learning technique is then used as part of the objective function for the swarm intelligence technique in order to estimate the missing data after a supervised fine-tuning phase by minimizing an error function based on the interrelationship and correlation between features in the dataset. The investigated methodology in this paper therefore has longer running times, however, the promising potential outcomes justify the trade-off. Also, basic knowledge of statistics is presumed.
... Nonparametric estimation techniques [35][36][37][38], which impose fewer structural assumptions and rely more significantly on the data, are particularly well-suited when the system structure is less known. In contrast, parametric approaches are especially useful when a specific well-defined and appropriate structure for the system is given, which is often the case in many real-world scenarios [39][40][41]. ...
... for any θ ∈ Θ. Rather than directly maximizing the likelihood p(y|θ, u) as discussed in Section 4, the EM algorithm employs an iterative scheme that maximizes the auxiliary function Q(θ|θ k ) [41]. More precisely, given θ 0 ∈ Θ, an initial guess for the parameters θ, EM iteratively generates a sequence of parameter estimations given bŷ ...
Preprint
Full-text available
In this paper, we address the identification problem for the systems characterized by linear time-invariant dynamics with bilinear observation models. More precisely, we consider a suitable parametric description of the system and formulate the identification problem as the estimation of the parameters defining the mathematical model of the system using the observed input-output data. To this end, we propose two probabilistic frameworks. The first framework employs the Maximum Likelihood (ML) approach, which accurately finds the optimal parameter estimates by maximizing a likelihood function. Subsequently, we develop a tractable first-order method to solve the optimization problem corresponding to the proposed ML approach. Additionally, to further improve tractability and computational efficiency of the estimation of the parameters, we introduce an alternative framework based on the Expectation--Maximization (EM) approach, which estimates the parameters using an appropriately designed cost function. We show that the EM cost function is invex, which ensures the existence and uniqueness of the optimal solution. Furthermore, we derive the closed-form solution for the optimal parameters and also prove the recursive feasibility of the EM procedure. Through extensive numerical experiments, the practical implementation of the proposed approaches is demonstrated, and their estimation efficacy is verified and compared, highlighting the effectiveness of the methods to accurately estimate the system parameters and their potential for real-world applications in scenarios involving bilinear observation structures.
... Analysis of missingness using Little's Missing Completely at Random (MCAR) test (χ 2 = 35.21, df = 12, p = 0.001) indicated that the missing data did not follow a MCAR pattern [51]. The pre-treatment data (ADNM-8: 1.8%, n =14) were missing due to branching in the online data collection platform, where, for example, the ADNM-8 questions only appeared if the applicant checked at least one of the stressful event items. ...
... Furthermore, in a binary logistic regression analysis, age and ethnicity signi cantly predicted missing data, with younger age and nonwhite ethnic backgrounds associated with higher noncompletion (p < .01). Therefore, missing data were assumed to be missing at random [MAR; 51,52]. ...
Preprint
Full-text available
Background : Trials of disorder-specific Internet-delivered cognitive-behavioral therapy (ICBT) for Adjustment Disorder (AD) show moderate effect sizes but may have limited scalability in routine care settings, where clients present with a range of concerns. Transdiagnostic ICBT, which addresses common emotional and behavioral concerns irrespective of diagnosis, could address the need for effective and scalable treatments for symptoms of AD. Objective: This study aimed to evaluate the effectiveness of a transdiagnostic ICBT course for patients with high AD symptoms, and to investigate predictors of treatment outcomes, and treatment satisfaction. Methods: 793 participants received a therapist-guided, transdiagnostic ICBT course. The study measured changes in AD symptoms from pre-treatment to post-treatment to 3-month follow-up using the Adjustment Disorder – New Module 8 (ADNM-8). Results: The prevalence of high AD symptoms (defined as a score >23 on the ADNM-8) was 54.8% at pre-treatment. The study found large reductions in AD symptoms from pre-treatment to post-treatment ( d = 1.29, 95% CI [1.13, 1. 45]) and from pre-treatment to the 3-month follow-up ( d = 1.67, 95% CI [1.49, 1.85]). These effect sizes were comparable to those found in previous ICBT trials of AD-specific treatments. Approximately 70% of participants scored below clinical cut-off for high AD symptoms at post-treatment, and 79 % met this criterion at follow-up. Engagement in treatment and post-treatment satisfaction were similar between individuals with high AD symptoms and those without, with the majority (76.9%) completing four or more lessons and 81.7% reporting overall satisfaction. Conclusions: The findings suggest that transdiagnostic ICBT is an effective and acceptable treatment for AD symptoms, with results comparable to those of AD-specific interventions. The high prevalence of AD symptoms and stressful life events among participants in a routine care setting underscores the importance of early identification of individuals with high AD symptoms.
... The methods with which the missing data are dealt are dependent on the category into which the data fall. Three broad categories for pattern missingness are defined: monotone missingness, file matching, and general missingness [5], [6]. If a set of variables for a given instance are k y y ;...; 1 , monotone missingness occurs if when a missing value y j occurs, the variables can be ordered such that k j y y ;...; 1 + are also missing. ...
... Missing data are often classified into one of three mechanisms, as defined by Little and Rubin [5]. The mechanisms are listed as follows in order from least to most dependent on other information. ...
Preprint
This paper presents an impact assessment for the imputation of missing data. The data set used is HIV Seroprevalence data from an antenatal clinic study survey performed in 2001. Data imputation is performed through five methods: Random Forests, Autoassociative Neural Networks with Genetic Algorithms, Autoassociative Neuro-Fuzzy configurations, and two Random Forest and Neural Network based hybrids. Results indicate that Random Forests are superior in imputing missing data in terms both of accuracy and of computation time, with accuracy increases of up to 32% on average for certain variables when compared with autoassociative networks. While the hybrid systems have significant promise, they are hindered by their Neural Network components. The imputed data is used to test for impact in three ways: through statistical analysis, HIV status classification and through probability prediction with Logistic Regression. Results indicate that these methods are fairly immune to imputed data, and that the impact is not highly significant, with linear correlations of 96% between HIV probability prediction and a set of two imputed variables using the logistic regression analysis.
... We employed intent-totreat analyses that utilized all available data, assuming ignorable missingness (Missing At Random). [32][33][34][35] SBIRT survey scores over time were analyzed using general linear mixed models with random intercept for practice, adjusting for average number of patient visits per week and practice ownership (private, hospital owned, safety net, rural health center). For count data (# screened, # screen positive, etc.) we used generalized linear mixed models (negative binomial) with random intercepts for practice to analyze the data, again adjusting for average number of patient visits per week and practice ownership. ...
... (1, 466) = 14.12, p = .001). As missing data were related to measured variables and characteristics of participants, we assumed that they were Missing at Random (MAR) and use full information maximum likelihood (Little & Rubin, 2002). ...
Article
Full-text available
This study examined the stability/change trajectories of future expectations during high school, analyzing whether adolescents’ sex, attachment to parents and peers, multiple risk, and pandemic-related stress explained these trajectories. The sample included 467 Portuguese adolescents, assessed three times across 18 months. Results revealed that adolescents’ future expectations increased significantly over time. We observed significant inter-participant variance at initial levels and growth rates. Emotional bonds with parents was associated with higher initial levels of optimistic future expectations, whereas alienation to peers was associated with lower initial levels of optimism. Adolescents’ exposure to multiple psychosocial risks was associated with lower growth of optimism. In turn, alienation to peers was associated with a higher growth rate of optimism. Pandemic-related stress was negatively associated with optimism at T2 and T3, and these associations were different over time. Our findings emphasize the associations between individual, relational, broader social contexts, and development of adolescents’ future expectations.
... The decision on whether 187 to impute or not depends on the rate of missing values and the underlying mechanism 188 that causes the missing entries. Generally, missing data can be a result of one of three 189 processes: missing completely at random (MCAR), missing not at random (MNAR), or 190 missing at random (MAR) [20]. MCAR is a result of random errors, such as equipment 191 failures, whereas MNAR occurs when the mechanism causing the missing entries has no 192 February 13, 2025 5/16 ...
Preprint
Full-text available
Background Missing data in medical datasets poses significant challenges for developing effective AI/ML pipelines. Inaccurate imputation can lead to biased results, reduced model performance, and compromised clinical insights. Understanding how different imputation methods affect AI/ML model performance is crucial for ensuring accurate clinical findings. Objective This study systematically investigates the effects of different imputation methods on AI/ML model performance and the clinical implications of these methods. Methods We investigate the impact of four different missing data strategies on the performance of common classification algorithms for analyzing medical data. The performance was evaluated based on sensitivity and specificity metrics for the tasks of predicting COVID-19 diagnosis and patient deterioration. We also perform feature analysis to understand the clinical implications the choice of imputation method has. Results Our findings show that the choice of imputation method significantly affects the performance of AI/ML techniques and the clinical conclusions drawn from the data. The optimal handling of missing values depends on (i) the composition of the features with missing values, (ii) the rate of missing values, and (iii) the pattern of the missing features. Using COVID-19 diagnosis and patient deterioration as representative examples of clinical tasks, our results indicate that MICE imputation yields the best overall performance, resulting in a 26% improvement in accuracy compared to baseline methods. Specifically, for predicting COVID-19 diagnosis, we achieved a sensitivity of 81% and specificity of 98%, while for patient deterioration, the sensitivity was 65% and specificity was 99%. Conclusion This study demonstrates the critical impact of missing data imputation on AI/ML model performance and the clinical insights derived from these models. Our findings underscore the importance of selecting appropriate imputation techniques tailored to the specific characteristics of medical data to ensure accurate and reliable AI/ML predictions.
... Little's missing completely at random test was non-significant, χ 2 (428) = 435.324, p =.394, supporting the imputation of missing values [85]. Next, we explored whether variables were significantly skewed, which could compromise results. ...
Article
Full-text available
Background Child maltreatment exerts lasting effects on emotion regulation, which in turn accounts for adult’s risk for psychopathology such as depression. In this vulnerable population, deficits in emotion regulation of negative affect are well established and include reliance on emotional suppression and rumination strategies. In contrast, alterations in the regulation of positive affect associated with child maltreatment history are less understood. We examined the role of positive rumination and dampening of positive affect, two emotion regulation strategies that may be impaired by the experience of child maltreatment and are associated with depression risk. We hypothesized that alterations in positive rumination and dampening would explain the association between women’s childmaltreatment history and heightened risk for current depressive symptoms. To determine if positive affect regulation accounts for unique variance between child maltreatment history and depression risk we controlled for brooding rumination. Methods Undergraduate women (n = 122) completed surveys on child maltreatment, depressive symptoms, and their tendency to dampen or engage in positive rumination in response to positive affect, reflecting cross-sectional data. The PROCESS macro, model 4 was run in SPSS to examine the extent to which emotion regulation strategies accounted for the association between child maltreatment history and current depressive symptoms. Results Child maltreatment history was associated with a higher tendency to dampen positive affect but was not linked with positive rumination. Dampening partially explained the link between child maltreatment and women’s current depressive symptoms. Dampening and brooding rumination each accounted for unique variance in the association between child maltreatment and depressive symptoms. Conclusions Results suggest that emotion suppression strategies among child maltreatment survivors may also extend to positive affect, with impairments in specific regulation strategies. Currently dysphoric women with a history of child maltreatment tend to dampen their positive moods and reactions to events as well as ruminate on their dysphoric moods, both tendencies accounted for unique variance in current depression risk. Longitudinal research is warranted to clarify the role of alterations in positive emotion regulations strategies in understanding how child maltreatment fosters risk for psychopathology such as depression.
... Differences in CFI (ΔCFI), RMSEA (ΔRMSEA), and SRMR (ΔSRMR) were also used for model comparisons (ΔCFI ≤ 0.01, ΔRMSEA ≤ 0.015, and ΔSRMR ≤ 0.030 indicate that there are no substantial changes in model fit of the current constrained model versus the previous less constrained model (Chen, 2007;Cheung & Rensvold, 2002). The missing data were handled using the full information maximum likelihood method (Little & Rubin, 2002) as it yields more accurate results than other approaches by minimizing bias in regression and standard error estimates for all types of missing data (Graham, 2009;Schlomer et al., 2010). ...
Article
Full-text available
Despite evidence indicating that social mindfulness may be a precondition for both prosocial and aggressive behavior, there remains a limited understanding of how the bidirectional dynamics between them unfold over time. Framed in the developmental cascade model, this study examined the longitudinal reciprocal relations between social mindfulness and these two distinct social behaviors among early adolescents by disentangling within-person and between-person effects. A total of 1087 Chinese early adolescents (48.7% girls; Mage = 11.35 ± 0.49 years at Time 1) participated in a three-wave longitudinal study with about four-month assessment intervals. The random-intercept cross-lagged panel model indicated that, at the within-person level, social mindfulness and prosocial behavior positively predicted each other over time. Furthermore, fluctuations in social mindfulness were found to negatively predict changes in aggressive behavior at subsequent time points, but the reverse was not true, suggesting a unidirectional influence. A similar pattern was found between social mindfulness and reactive aggressive behavior, but no significant bidirectional effects emerged between social mindfulness and proactive aggressive behavior. These findings highlight the role of social mindfulness in shaping early adolescents’ social behavior over time, thus providing insights for more targeted and effective interventions to foster prosocial behavior and prevent aggressive behavior.
... Missing data have also been intensively studied in a range of modern fields, for example, in matrix completion (Jin et al., 2022;Mao et al., 2019), sparse principal component analysis (Zhu et al., 2022) and many more. The book by Little and Rubin (2019) provides a comprehensive treatment of the missing data problems. ...
Preprint
Full-text available
We study the problem of missing not at random (MNAR) datasets with binary outcomes. We propose an exponential tilt based approach that bypasses any knowledge on 'nonresponse instruments' or 'shadow variables' that are usually required for statistical estimation. We establish a sufficient condition for identifiability of tilt parameters and propose an algorithm to estimate them. Based on these tilt parameter estimates, we propose importance weighted and doubly robust estimators for any mean functions of interest, and validate their performances in a synthetic dataset. In an experiment with the Waterbirds dataset, we utilize our tilt framework to perform unsupervised transfer learning, when the responses are missing from a target domain of interest, and achieve a prediction performance that is comparable to a gold standard.
... Missing data, accounting for 13.2% of the data because of loss of follow-up at T1, were imputed using the expectation maximization (EM) method (27), which assumed that data were missing completely at random (Little's MCAR test: p = .337) and provided unbiased parameter estimates under this assumption. ...
Preprint
Full-text available
Background: The present study developed an intervention using a personalized Healthy Eating Report Card to provide parents with personalized insights into the extent to which their child adhered to international healthy eating guidelines and engaged in favorable family home food environments. This study aimed to assess the effectiveness of this intervention in improving preschool-aged children’s eating practices. Methods: A three‐armed, single‐blinded randomized controlled trial was conducted with 331 parent-child dyads recruited from eight local kindergartens in Hong Kong. Parents were asked to complete the International Healthy Eating Report Card Scale at baseline and one-month post-intervention. The participants were randomly assigned to one of three groups: (i) the intervention group (who received a personalized Healthy Eating Report Card), (ii) the usual care group (who received a standard government-issued leaflet on healthy eating), or (iii) the mere-measurement control group (who received no healthy eating materials). We examined if the improvement in the overall report card score of the intervention group was statistically higher than that of the other two groups using ANCOVA. Results: The results of ANCOVA demonstrated that the overall report card score was significantly different among the three groups after adjusting for the baseline value [ F (2,327) = 3.98, p = .020, η p ² = .02]. Bonferroni post-hoc tests revealed that children in the intervention group improved significantly more than those in the mere-measurement control group ( p < .05) with an improvement of 4.6%. The overall report card score of the usual care group was not significantly different from that of the intervention group or the mere-measurement control group ( p > .05). Conclusions: This study provides promising evidence for the effectiveness of the personalized Healthy Eating Report Card in promoting healthy eating practices among preschool-aged children. It also demonstrated its potential as a cost-efficient and scalable tool for health interventions. Trial registration: This trial was registered retrospectively on November 19, 2024, at chictr.org.cn (ChiCTR number: ChiCTR2400092558).
... The data preprocessing phase involves several critical steps to ensure data quality and compatibility. We begin by cleaning the datasets, removing incomplete or inconsistent records, and handling missing values through appropriate imputation techniques [19]. We then standardize variable formats across different years of NSDUH data to ensure comparability. ...
Article
Full-text available
Background: The rapid adoption of telehealth services for youth mental health care necessitates a comprehensive evaluation of its effectiveness. This study aimed to analyze the impact of telehealth on youth mental health outcomes using artificial intelligence techniques applied to large-scale public health data. Methods: We conducted an AI-driven analysis of data from the National Survey on Drug Use and Health (NSDUH) and other SAMHSA datasets. Machine learning techniques, including random forest models, K-means clustering, and time series analysis, were employed to evaluate telehealth adoption patterns, predictors of effectiveness, and comparative outcomes with traditional in-person care. Natural language processing was used to analyze sentiment in user feedback. Results: Telehealth adoption among youth increased significantly, with usage rising from 2.3 sessions per year in 2019 to 8.7 in 2022. Telehealth showed comparable effectiveness to in-person care for depressive disorders and superior effectiveness for anxiety disorders. Session frequency, age, and prior diagnosis were identified as key predictors of telehealth effectiveness. Four distinct user clusters were identified, with socioeconomic status and home environment strongly associated with positive outcomes. States with favorable reimbursement policies saw a 15% greater increase in youth telehealth utilization and 7% greater improvement in mental health outcomes. Conclusions: Telehealth demonstrates significant potential in improving access to and effectiveness of mental health services for youth. However, addressing technological barriers and socioeconomic disparities is crucial to maximize its benefits.
... MICE generates imputations based on a set of imputation models for each variable with missing values and imputes different variable types, such as continuous, binary, and categorical data. The MICE model was specified to complete 5 imputations with 5 iterations [20][21][22]. Effect estimates were computed on each imputed dataset and pooled using Rubin's rules [20,23,24]. ...
Article
Full-text available
Purpose Disparities in gynecologic cancer clinical trial enrollment exist between Black and White patients; however, few examine racial differences in clinical trial enrollment predictors. We examined whether first-line clinical trial enrollment determinants differed between Black and White gynecologic cancer patients. Methods We used the National Cancer Database to identify Black and White gynecologic cancer (cervix, ovarian, uterine) patients diagnosed in 2014–2020. Multivariable logistic regression was used to estimate adjusted odds ratios (ORs) and 95% confidence intervals (CIs) for associations between clinical trial enrollment (yes vs no) and sociodemographic, facility, tumor, and treatment characteristics stratified by race. We included a multiplicative interaction term between each assessed predictor and race to test whether associations differed by race. Results We included 703,022 gynecologic cancer patients (mean [SD] age at diagnosis, 60.9 [13.1] years). Clinical trial enrollment was lower among Black (49/86,058, 0.06%) vs. White patients (710/616,964, 0.11%). Only cancer site differed by race: among Black patients, a cervical vs. uterine cancer diagnosis (OR = 4.63, 95% CI = 1.67–12.88) was associated with higher clinical trial enrollment odds, while among White patients, both cervical (OR = 2.21, 95% CI = 1.48–3.29) and ovarian (OR = 3.40, 95% CI = 2.58–4.47) cancer diagnoses (vs. uterine cancer) were associated with higher enrollment odds. Most predictors were associated with clinical trial enrollment odds among White but not Black patients. Conclusion Few differences in first-line clinical trial enrollment predictors exist between Black and White gynecologic cancer patients. Although small numbers of Black patients and low clinical trial prevalence are limitations, this descriptive analysis is important in understanding racially disparate clinical trial enrollment.
... For the 10MWT and TUG, the worst value of all participants was assigned to participants who were unable to perform the gait tests. This type of missing data is called "Missing Not at Random", and the "worst value" is considered a conservative method of imputation data [43]. The analysis of MEP was not performed on an intention-to-treat basis due to external problems with the TMS device and the elevated missing data for this outcome. ...
Article
Full-text available
Background Although transcutaneous spinal cord stimulation (tSCS) has been suggested as a safe and feasible intervention for gait rehabilitation, no studies have determined its effectiveness compared to sham stimulation. Objective To determine the effectiveness of tSCS combined with robotic-assisted gait training (RAGT) on lower limb muscle strength and walking function in incomplete spinal cord injury (iSCI) participants. Methods A randomized, double-blind, sham-controlled clinical trial was conducted. Twenty-seven subacute iSCI participants were randomly allocated to tSCS or sham-tSCS group. All subjects conducted a standard Lokomat walking training program of 40 sessions (5 familiarization sessions, followed by 20 sessions combined with active or sham tSCS, and finally the last 15 sessions with standard Lokomat). Primary outcomes were the lower extremity motor score (LEMS) and dynamometry. Secondary outcomes included the 10-Meter Walk Test (10MWT), the Timed Up and Go test (TUG), the 6-Minute Walk test (6MWT), the Spinal Cord Independence Measure III (SCIM III) and the Walking Index for Spinal Cord Injury II (WISCI-II). Motor evoked potential (MEP) induced by transcranial magnetic stimulation (TMS) were also assessed for lower limb muscles. Assessments were performed before and after tSCS intervention and after 3-weeks follow-up. Results Although no significant differences between groups were detected after the intervention, the tSCS group showed greater effects than the sham-tSCS group for LEMS (3.4 points; p = 0.033), 10MWT (37.5 s; p = 0.030), TUG (47.7 s; p = 0.009), and WISCI-II (3.4 points; p = 0.023) at the 1-month follow-up compared to baseline. Furthermore, the percentage of subjects who were able to walk 10 m at the follow-up was greater in the tSCS group (85.7%) compared to the sham group (43.1%; p = 0.029). Finally, a significant difference (p = 0.049) was observed in the comparison of the effects in the amplitude of the rectus femoris MEPs of tSCS group (− 0.97 mV) and the sham group (− 3.39 mV) at follow-up. Conclusions The outcomes of this study suggest that the combination of standard Lokomat training with tSCS for 20 sessions was effective for LEMS and gait recovery in subacute iSCI participants after 1 month of follow-up. Trial registration ClinicalTrials.gov (NCT05210166).
... Another limitation of mr.mash-rss is that it requires the summary statistics to be computed on the same samples for each phenotype. In other words, there should not be missing data in Y in 1. Dealing with arbitrary patterns of missing data in multivariate models is not a trivial problem [46] and is an area where more research is needed. If individual-level data are available, missing values may be imputed before the prediction analysis. ...
Article
Full-text available
Polygenic prediction of complex trait phenotypes has become important in human genetics, especially in the context of precision medicine. Recently, mr.mash, a flexible and computationally efficient method that models multiple phenotypes jointly and leverages sharing of effects across such phenotypes to improve prediction accuracy, was introduced. However, a drawback of mr.mash is that it requires individual-level data, which are often not publicly available. In this work, we introduce mr.mash-rss, an extension of the mr.mash model that requires only summary statistics from Genome-Wide Association Studies (GWAS) and linkage disequilibrium (LD) estimates from a reference panel. By using summary data, we achieve the twin goal of increasing the applicability of the mr.mash model to data sets that are not publicly available and making it scalable to biobank-size data. Through simulations, we show that mr.mash-rss is competitive with, and often outperforms, current state-of-the-art methods for single- and multi-phenotype polygenic prediction in a variety of scenarios that differ in the pattern of effect sharing across phenotypes, the number of phenotypes, the number of causal variants, and the genomic heritability. We also present a real data analysis of 16 blood cell phenotypes in the UK Biobank, showing that mr.mash-rss achieves higher prediction accuracy than competing methods for the majority of traits, especially when the data set has smaller sample size.
... This way, valuable information for analysis and deviations, especially in datasets where missing values occur relatively frequently and are not entirely random, can be lost. As Acock [5] and Little [6] note, deleting records with missing (incomplete) data often leads to unsatisfactory results, especially when the missing data are not normally distributed. Deleting a column is considered acceptable when the missing data in it exceed 85%. ...
Conference Paper
The article examines the possibilities for big data analysis, focusing on dealing with missing data. By exploring the techniques and methodologies used for imputation, interpretation, and modeling of missing data in the context of big data analytics, we emphasize the relationship between data completeness and the quality of solutions. The ability to effectively handle missing data not only enhances the reliability and validity of analytical insights but also enables informed strategic decision-making. An experiment was conducted on filling missing data in a dataset based on three different mathematical models. The best result is reported with the autoregressive recursive interpolation model. For this purpose, data from a wastewater treatment plant, which have not been the subject of research so far, were used.
... The technical validation of our dataset primarily involved ensuring data completeness by identifying and addressing missing values, a foundational approach widely adopted in data quality assessments 29,30 . This check, a standard practice in database validation, enables the identification of gaps that could affect data reliability. ...
Article
Full-text available
Governments procure large amounts of goods and services to help them implement policies and deliver public services; in Italy, this is an essential sector, corresponding to about 12% of the gross domestic product. Data are increasingly recorded in public repositories, although they are often divided into multiple sources and not immediately available for consultation. This paper provides a description and analysis of an effort to collect and arrange a legal public administration database. The main source of interest involves the National Anti-Corruption Authority in Italy, which describes more than 3 million tenders. To improve usability, the database is integrated with two other relevant data sources concerning information on public entities and territorial units for statistical purposes. The period identified by domain experts covers 2016–2023. The analysis also identifies key challenges that arise from the current Open Data catalogue, particularly in terms of data completeness. A practical application is described with an example of use. The final dataset, called Italian Tender Hub (ITH), is available in a repository with a description of its use.
... The data matrix included missing values of 4.6% (33 out of 708 measured values). The missing completely at random (MCAR) test (χ 2 = 18.30, df = 10, p = 0.051) indicated no differences between data with or without missing values (Little & Rubin, 2002). Therefore, the scores were expected to be missing at random, requiring no further action. ...
Article
Full-text available
This study explored the profiles of 175 teachers’ self-efficacy (TSE) in elementary, vocational, and higher physical education (PE) and examined teachers’ perceptions of inter-student bullying as outcomes of these profiles. The links between teachers’ perceptions of inter-student bullying and teaching level, teaching experience, tertiary education, gender, and age covariates were also analysed. The Latent cluster analysis (LCA), based on cross-sectional data collected via an anonymous online survey, revealed three profiles (low, intermediate, and high). Physical education teachers with low teaching-efficacy profiles reported more frequent inter-student bullying in PE than teachers with intermediate and high profiles. Regarding the low teaching-efficacy profile, tertiary education emerged as a significant covariate for bullying indicating that PE teachers with Master in Sport Science degree scored higher frequency in inter-student bullying than teachers with other degrees. In the intermediate profile, younger and more experienced PE teachers reported more frequent inter-student bullying than older and less experienced counterparts of the same cluster. Organisations responsible for teachers’ education and voluntary professional development must consider the diversity of TSE, regarding teachers’ age, teaching experience, and educational level, by tailoring pedagogical practices to promote bully-free PE for students.
... The guardians who participated in this study were their parents. Non-responses to the questionnaire were processed using multiple imputation methods [32]. In the first step, 5 data sets were created by imputing missing values based on fully conditional specification (FCS), a widely used multiple imputation method. ...
Article
Full-text available
Based on emotional security, stress, and spillover and crossover theories, this study aimed to examine the indirect pathways between destructive and constructive interparental conflict, parenting stress, unsupportive parenting, and child insecurity six months later. Using data from two time points beginning when Korean children (N = 159) were approximately 3–5 years old, two dual-mediation models of the relevant variables were constructed. The results indicate that destructive conflict is associated with higher levels of parenting stress, whereas constructive conflict is associated with lower levels of stress. Furthermore, mothers’ and fathers’ parenting stress influenced their own unsupportive parenting behaviors, which, in turn, influenced their children’s insecurity, suggesting a spillover effect. However, the crossover effect and mediation analyses provided partial support for various pathways of the hypotheses. By examining both destructive and constructive conflict, including both maternal and paternal variables, and examining not only spillover but also crossover effects, this study highlights that while constructive conflict may reduce parental stress and unsupportive parenting behaviors, the negative effects of destructive conflict may affect children more strongly. Particularly, by examining the spillover and crossover effects in the unique cultural context of Korean families, this study provides important insights into interparental conflict’s impact on child development.
... Given the 4-point response scale and asymmetric item distributions, they were treated as ordinal categorical variables in the analysis (20). The present sample minimal missing responses (missing N = 4) in the CDI-SF, which was handled using full information maximum likelihood under the missing-at-random assumption (21). Psychometric properties of the CDI-SF were examined in three steps. ...
Article
Full-text available
Objectives Cardiac patients experience various somatic and psychosocial symptoms and stress is an important prognostic factor of cardiac rehabilitation. This study evaluated the psychometric properties of the 12-item Cardiac Distress Inventory – Short Form (CDI-SF) in the Chinese context. Methods A total of 227 patients with cardiac diseases were recruited in a specialist outpatient clinic in Hong Kong between Aug 2022 and July 2023. The participants completed the CDI-SF and validated measures on psychosocial functioning and quality of life. Exploratory factor analysis and partial correlation analysis were conducted to examine the factorial validity, reliability, and convergent validity of the CDI-SF with reference to validating measures. Results The 1-factor model showed adequate model fit with excellent composite reliability (ω = .92) and substantial factor loadings (λ = .64 –.94, p <.01). The CDI-SF factor was negatively associated with age (r = –.21, p <.01) and showed positive and strong partial correlations (r = .59 –.69, p <.01) with impact of event, depression, and burnout, and negative partial correlations (r = -.43 to -.54, p <.01) with resilience and quality of life. Conclusion Our study provides the first results on the psychometric properties of the CDI-SF among cardiac patients in Hong Kong. The psychometric results support the CDI-SF as a precise, valid, and reliable measure of cardiac distress in the Chinese context.
... For the analyses, we used SAS (Proc Mixed; Littell et al., 2006). Following good practice (e.g., Grimm et al., 2016), we used full information maximum likelihood procedures to accommodate incomplete data under missing-at-random assumptions (Little & Rubin, 2019). Given the large sample size for the analyses and to guard against false positives (for discussion, see Lakens et al., 2018), we used the p < .0001 ...
Article
Full-text available
According to the Flynn effect, performance on cognitive ability tests has improved over the past decades. However, we know very little about whether such historical improvements generalize to middle-aged adults (aged 45–65) and differ across nations. We used harmonized data on episodic memory from nationally representative longitudinal panel surveys across a total of 16 countries (United States, Mexico, China, England, and countries in Continental, Mediterranean, and Nordic Europe). We compared historical change in age-related trajectories of episodic memory among middle-aged adults. Our sample included 117,231 participants who provided 330,390 observations. Longitudinal multilevel regression models revealed that today’s middle-aged adults in the United States perform worse on episodic memory tests than their peers in the past. By contrast, today’s middle-aged adults in most other countries perform better on these tests than their peers in the past. However, later-born cohorts of U.S. and Chinese middle-aged adults experienced less steep within-person decrements—or even increments—in episodic memory than earlier born cohorts. Historical change trends persisted when controlling for sociodemographic factors, as well as for indicators of physical and mental health. Differences in episodic memory by gender and education became smaller over historical time across all nations. Our findings suggest that countries differ considerably in episodic memory performance, by more than half a standard deviation, and in the direction and size of how midlife episodic memory trajectories have changed over historical time. Further factors related to historical changes in midlife episodic memory need to be identified by future research.
... In this study, the attrition rates were 2.0% at T 1 , 4.4% at T 2 , 6.6% at T 3 , 22.8% at T 4 , and 23.1% at T 5 . The missing completely at random (MCAR) test was conducted to clarify the missing data mechanism (Little & Rubin, 2019). Results indicated that the data did not meet the criteria for MCAR (χ 2 (174) = 366.78, ...
Article
Full-text available
Anxiety symptoms is prevalent among college students and is associated with a range of detrimental consequences. Self-compassion and emotion regulation difficulties are important factors affecting anxiety symptoms, but their functional mechanism and longitudinal correlation are still unclear. This three-year longitudinal study (baseline: n = 5785, 48.2% of female, Mage = 18.63 years, SD = 0.88; T1 to T5: n = range from 4312 to 5497) aimed to validate the emotion regulation model of self-compassion by examining the associations between self-compassion, emotion regulation difficulties, and anxiety symptoms. Random intercept cross-lagged panel models (RI-CLPMs) was used to distinguish within-person variations overtime from stable between-person differences. The results obtained from the RI-CLPMs indicated that there is a bidirectional effect between self-compassion and anxiety symptoms at the within-person level. Emotional regulation difficulties played a longitudinal mediating role in the prediction from self-compassion to anxiety symptoms at the within-person level, validating the emotion regulation model of self-compassion. The current study indicates that cultivating self-compassion in college students is crucial as it can improve their emotion regulation skills and alleviate anxiety symptoms.
... Robust methods to account for data loss were used in these analyses, via SEM which accounts for this type of drop off in the data over time by modeling random effects that capture an individual's change over time, and allows for missing data in dependent variables using full information maximum likelihood estimation (FIML), where dependent variables are correlated with measures from earlier data collection time points (Little & Rubin, 2019). Multiple fit indices (Akaike Information Criteria (AIC), Root Mean Square Error of Approximation (RMSEA), Comparative Fit Index (CFI), Tucker-Lewis Index (TLI), and Standardized Root Mean Square Residual (SRMR) evaluated how well the sample data fit the SEM model. ...
Article
Full-text available
Objective To examine the impact of coping styles in older adults with asthma on the prospective relationship between depressive symptoms and asthma outcomes, and how their perceptions of social support influenced their coping styles. Methods Adults 60 and over with asthma were recruited and interviewed about their experiences of asthma, depression, and other psychosocial factors over three time points (Baseline, 6-month, and 12-month visits). Structural equation models examined the mediating roles of coping styles in the relationship between depressive symptoms (assessed by BDI-II) and asthma outcomes (i.e., asthma control, asthma quality of life, asthma-related distress, asthma-related hospitalizations, and oral corticosteroid use) and the mediating role of perceived social support in the relationship between depressive symptoms and coping style. Results 455 participants were included in this study. Overall, 33.9% of the study population self-identified as Black and 32.8% as Hispanic. Depressive symptoms at baseline predicted less spiritual coping at 6 months (β = − 0.15, p = 0.03), more negative coping at 6 months (β = 0.44, p < .0001), and worse asthma outcomes at 12 months (β = 0.31, p < .0001). None of the coping styles significantly mediated the relationship between depressive symptoms and asthma outcomes. Perceived social support mediated the relationship between depressive symptoms and positive coping, such that more depressive symptoms predicted less perceived social support, which in turn resulted in less positive coping engagement (β = − 0.06, p = 0.03). Conclusions This study demonstrates that in older adults with asthma depressive symptoms impact perceived social support, coping strategy selection (including spiritual coping), and subsequent asthma outcomes.
... Assumption (1) is a missing completely at random (MCAR) assumption (Little and Rubin , 2014) leading to the complete case analysis where data on participants are used to estimate the health indicators of non-participants. If assumption (1) holds, the non-participation is neither selective with respect to variables of interest nor background variables. ...
Preprint
Aims: A common objective of epidemiological surveys is to provide population-level estimates of health indicators. Survey results tend to be biased under selective non-participation. One approach to bias reduction is to collect information about non-participants by contacting them again and asking them to fill in a questionnaire. This information is called re-contact data, and it allows to adjust the estimates for non-participation. Methods: We analyse data from the FINRISK 2012 survey, where re-contact data were collected. We assume that the respondents of the re-contact survey are similar to the remaining non-participants with respect to the health given their available background information. Validity of this assumption is evaluated based on the hospitalization data obtained through record linkage of survey data to the administrative registers. Using this assumption and multiple imputation, we estimate the prevalences of daily smoking and heavy alcohol consumption and compare them to estimates obtained with a commonly used assumption that the participants represent the entire target group. Results: This approach produces higher prevalence estimates than what is estimated from participants only. Among men, smoking prevalence estimate was 28.5% (23.2% for participants), heavy alcohol consumption prevalence was 9.4% (6.8% for participants). Among women, smoking prevalence was 19.0% (16.5% for participants) and heavy alcohol consumption 4.8% (3.0% for participants). Conclusion: Utilization of re-contact data is a useful method to adjust for non-participation bias on population estimates in epidemiological surveys.
... It is well known that using the information from complete cases or available cases may lead to invalid statistical inference (Little and Rubin, 2014). A common approach is to use an appropriate imputation model, which accounts for the scale level of the measurements. ...
Preprint
Missing values are a common phenomenon in all areas of applied research. While various imputation methods are available for metrically scaled variables, methods for categorical data are scarce. An imputation method that has been shown to work well for high dimensional metrically scaled variables is the imputation by nearest neighbor methods. In this paper, we extend the weighted nearest neighbors approach to impute missing values in categorical variables. The proposed method, called wNNSelcat\mathtt{wNNSel_{cat}}, explicitly uses the information on association among attributes. The performance of different imputation methods is compared in terms of the proportion of falsely imputed values. Simulation results show that the weighting of attributes yields smaller imputation errors than existing approaches. A variety of real data sets is used to support the results obtained by simulations.
... Consequently, the assessment of a surrogate in the Cox model framework can be cast as a problem of estimation with a missing covariate. Although methods for estimating the Cox model with missing covariates have been extensively studied [e.g., Lin and Ying (1993), Robins, Rotnitzky and Zhao (1994), Zhou and Pepe (1995), Paik and Tsai (1997), Chen and Little (1999), Herring and Ibrahim (2001), Chen (2002) and Little and Rubin (2002)], their application to the proposed surrogate assessment are not direct, as the missing data are entirely in the placebo group. Techniques are called for to predict the "missing" immune responses in the placebo recipients, or a random sample of them. ...
Preprint
Assessing immune responses to study vaccines as surrogates of protection plays a central role in vaccine clinical trials. Motivated by three ongoing or pending HIV vaccine efficacy trials, we consider such surrogate endpoint assessment in a randomized placebo-controlled trial with case-cohort sampling of immune responses and a time to event endpoint. Based on the principal surrogate definition under the principal stratification framework proposed by Frangakis and Rubin [Biometrics 58 (2002) 21--29] and adapted by Gilbert and Hudgens (2006), we introduce estimands that measure the value of an immune response as a surrogate of protection in the context of the Cox proportional hazards model. The estimands are not identified because the immune response to vaccine is not measured in placebo recipients. We formulate the problem as a Cox model with missing covariates, and employ novel trial designs for predicting the missing immune responses and thereby identifying the estimands. The first design utilizes information from baseline predictors of the immune response, and bridges their relationship in the vaccine recipients to the placebo recipients. The second design provides a validation set for the unmeasured immune responses of uninfected placebo recipients by immunizing them with the study vaccine after trial closeout. A maximum estimated likelihood approach is proposed for estimation of the parameters. Simulated data examples are given to evaluate the proposed designs and study their properties.
... The pattern of missing data means the distribution of missing values in the whole dataset. Little and Rubin (2014) classified missing patterns into three categories: missing completely at random (MCAR), missing at random (MAR), and not missing at random (NMAR). ...
Preprint
The real-time crash likelihood prediction has been an important research topic. Various classifiers, such as support vector machine (SVM) and tree-based boosting algorithms, have been proposed in traffic safety studies. However, few research focuses on the missing data imputation in real-time crash likelihood prediction, although missing values are commonly observed due to breakdown of sensors or external interference. Besides, classifying imbalanced data is also a difficult problem in real-time crash likelihood prediction, since it is hard to distinguish crash-prone cases from non-crash cases which compose the majority of the observed samples. In this paper, principal component analysis (PCA) based approaches, including LS-PCA, PPCA, and VBPCA, are employed for imputing missing values, while two kinds of solutions are developed to solve the problem in imbalanced data. The results show that PPCA and VBPCA not only outperform LS-PCA and other imputation methods (including mean imputation and k-means clustering imputation), in terms of the root mean square error (RMSE), but also help the classifiers achieve better predictive performance. The two solutions, i.e., cost-sensitive learning and synthetic minority oversampling technique (SMOTE), help improve the sensitivity by adjusting the classifiers to pay more attention to the minority class.
... If data are missing completely at random, standard estimation and inference procedures are still consistent when the missing data observations are ignored, see Heitjan and Basu (1996), Little (1988) among others. If data are missing at random (MAR) in the sense that the propensity of missingness depends only on the observed covariates, consistent estimation can still be obtained through covariate balancing, see Rubin (1976a,b), Little and Rubin (1989), , , Bang and Robins (2005), Qin and Zhang (2007), Chen et al. (2008), Tan (2010), Rotnitzky et al. (2012), Little and Rubin (2014) among others. In many applications, data are missing not at random (MNAR). ...
Preprint
This paper proposes a simple and efficient estimation procedure for the model with non-ignorable missing data studied by Morikawa and Kim (2016). Their semiparametrically efficient estimator requires explicit nonparametric estimation and so suffers from the curse of dimensionality and requires a bandwidth selection. We propose an estimation method based on the Generalized Method of Moments (hereafter GMM). Our method is consistent and asymptotically normal regardless of the number of moments chosen. Furthermore, if the number of moments increases appropriately our estimator can achieve the semiparametric efficiency bound derived in Morikawa and Kim (2016), but under weaker regularity conditions. Moreover, our proposed estimator and its consistent covariance matrix are easily computed with the widely available GMM package. We propose two data-based methods for selection of the number of moments. A small scale simulation study reveals that the proposed estimation indeed out-performs the existing alternatives in finite samples.
... This imputation choice was governed by necessity: standard imputation methods, such as listwise or pair-wise deletion would have resulted in a dramatic reduction in samples on which a predictive model could be built, rendering the modeling process impracticable. And due to the idiosyncrasies of this particular type of dataset, in which there is little overlap in answered values across features, methods that attempt to estimate joint distributions such as Bayesian multiple imputation methods [29] were not able to converge (using standard settings for the mi package [30] of 30 iterations and 4 chains). After performing mean imputation, we normalized all values to z-scores. ...
Preprint
Crowdsourcing has been successfully applied in many domains including astronomy, cryptography and biology. In order to test its potential for useful application in a Smart Grid context, this paper investigates the extent to which a crowd can contribute predictive hypotheses to a model of residential electric energy consumption. In this experiment, the crowd generated hypotheses about factors that make one home different from another in terms of monthly energy usage. To implement this concept, we deployed a web-based system within which 627 residential electricity customers posed 632 questions that they thought predictive of energy usage. While this occurred, the same group provided 110,573 answers to these questions as they accumulated. Thus users both suggested the hypotheses that drive a predictive model and provided the data upon which the model is built. We used the resulting question and answer data to build a predictive model of monthly electric energy consumption, using random forest regression. Because of the sparse nature of the answer data, careful statistical work was needed to ensure that these models are valid. The results indicate that the crowd can generate useful hypotheses, despite the sparse nature of the dataset.
... Although the MCAR assumption is sufficient for valid frequentist inference, certain aspects of likelihood-based frequentist inference (that is, using maximum likelihood estimates and the observed information matrix to measure their precision) are asymptotically valid when the missingness mechanism always produces MAR data sets (Molenberghs and Kenward, 1998;Little and Rubin, 2002). Seaman et al. (2013) referred to this type of missingness mechanism as "everywhere missing at random," whereas Mealli and Rubin (2015) referred to it as "missing always at random"; we choose to follow the latter suggestion as the word "everywhere" in probability and statistics has a different mathematical meaning, which is not reflected in its use in this context. ...
Preprint
Models for analyzing multivariate data sets with missing values require strong, often unassessable, assumptions. The most common of these is that the mechanism that created the missing data is ignorable - a twofold assumption dependent on the mode of inference. The first part, which is the focus here, under the Bayesian and direct-likelihood paradigms, requires that the missing data are missing at random; in contrast, the frequentist-likelihood paradigm demands that the missing data mechanism always produces missing at random data, a condition known as missing always at random. Under certain regularity conditions, assuming missing always at random leads to an assumption that can be tested using the observed data alone namely, the missing data indicators only depend on fully observed variables. Here, we propose three different diagnostic tests that not only indicate when this assumption is incorrect but also suggest which variables are the most likely culprits. Although missing always at random is not a necessary condition to ensure validity under the Bayesian and direct-likelihood paradigms, it is sufficient, and evidence for its violation should encourage the careful statistician to conduct targeted sensitivity analyses.
... The total variability in sensitivity (in the logit scale) from a compound symmetry working variance-covariance structure ranged from 0. 24 In other words, there was a stronger correlation between any two logit specificities in a given study than between any two logit sensitivities. ...
Preprint
Network meta-analysis (NMA) allow combining efficacy information from multiple comparisons from trials assessing different therapeutic interventions for a given disease and to estimate unobserved comparisons from a network of observed comparisons. Applying NMA on diagnostic accuracy studies is a statistical challenge given the inherent correlation of sensitivity and specificity. A conceptually simple and novel hierarchical arm-based (AB) model which expresses the logit transformed sensitivity and specificity as sum of fixed effects for test, correlated study-effects and a random error associated with various tests evaluated in given study is proposed. We apply the model to previously published meta-analyses assessing the accuracy of diverse cytological and molecular tests used to triage women with minor cervical lesions to detect cervical precancer and the results compared with those from the contrast-based (CB) model which expresses the linear predictor as a contrast to a comparator test. The proposed AB model is more appealing than the CB model in that it yields the marginal means which are easily interpreted and makes use of all available data and easily accommodates more general variance-covariance matrix structures.
... 1. Due to imprecise, incomplete or unreliable data measurements, such as streams of sensor measurements, RFID measurements, trajectory measurements, or DNA sequencing reads [5,7,63,65,87]. 2. Due to (deliberate) flexible sequence modeling, such as binding profiles of molecular sequences [11,94,109]. 3. When strings contain confidential information (patterns) which has been deleted deliberately for privacy protection [4,18,77]. ...
Article
Full-text available
Missing values arise routinely in real-world sequential (string) datasets due to: (1) imprecise data measurements; (2) flexible sequence modeling, such as binding profiles of molecular sequences; or (3) the existence of confidential information in a dataset which has been deleted deliberately for privacy protection. In order to analyze such datasets, it is often important to replace each missing value, with one or more valid letters, in an efficient and effective way. Here we formalize this task as a combinatorial optimization problem: the set of constraints includes the context of the missing value (i.e., its vicinity) as well as a finite set of user-defined forbidden patterns, modeling, for instance, implausible or confidential patterns; and the objective function seeks to minimize the number of new letters we introduce. Algorithmically, our problem translates to finding shortest paths in special graphs that contain forbidden edges representing the forbidden patterns. Our work makes the following contributions: (1) we design a linear-time algorithm to solve this problem for strings over constant-sized alphabets; (2) we show how our algorithm can be effortlessly applied to fully sanitize a private string in the presence of a set of fixed-length forbidden patterns [Bernardini et al. 2021a]; (3) we propose a methodology for sanitizing and clustering a collection of private strings that utilizes our algorithm and an effective and efficiently computable distance measure; and (4) we present extensive experimental results showing that our methodology can efficiently sanitize a collection of private strings while preserving clustering quality, outperforming the state of the art and baselines. To arrive at our theoretical results, we employ techniques from formal languages and combinatorial pattern matching.
... The remaining participants data has a low rate of missing data (i.e., less than 2%). Missing data were omitted, following recommendations by Little and Rubin (2019). Table 1 displays the inter-item correlations for the 16 items. ...
Article
Full-text available
Sensation seeking is the pursuit of varied, novel, and complex experiences and has been the subject of many studies since the 1950s. Research has demonstrated sensation seeking traits are related to risk-taking, including use and abuse of substances as well as physical risk-taking such as participation in extreme sports. Researchers have developed multiple measures to assess individual differences in sensation seeking. The most used measures for adults have weaknesses, such as conflating traits with behaviors, using out-of-date language, and low internal consistency. In the present research we carried out a series of studies in which we developed and validated a brief measure of sensation. The final version of the measure contained seven adjectives that participants rated how accurately each described them. The results showed that scores on the new measure were positively related to scores on two popular measures of sensation-seeking (i.e., the SSS-V and the AISS) as well as to two measures of risk-taking (i.e., DOSPERT and YRBS).
... In practice, missing data is mostly either listwise deleted, which can (obviously) lead to a substantial loss of valuable information, or imputed via multiple imputation. The latter method replaces the missing values with draws from probability distributions, commonly using either the joint posterior distribution of all variables with missing observations (Little and Rubin, 2019) or the conditional distribution of each variable condtioned on other variables in the data (van Buuren, 2007). Many extensions have been proposed, e.g. to include interaction effects (Goldstein, Carpenter and Browne, 2014), general nonlinear effects (Bartlett et al., 2015) or to account for sampling weights (Zhou, Elliott and Raghunathan, 2016). ...
Preprint
Full-text available
We propose an adaption of the multiple imputation random lasso procedure tailored to longitudinal data with unobserved fixed effects which provides robust variable selection in the presence of complex missingness, high dimensionality and multicollinearity. We apply it to identify social and financial success factors of microfinance institutions (MFIs) in a data-driven way from a comprehensive, balanced, and global panel with 136 characteristics for 213 MFIs over a six-year period. We discover the importance of staff structure for MFI success and find that profitability is the most important determinant of financial success. Our results indicate that financial sustainability and breadth of outreach can be increased simultaneously while the relationship with depth of outreach is more mixed.
... Since machine learning algorithms lose efficiency with missing values, and the multiple imputation methods are yet to be integrated, we analysed only an imputed dataset using the Hot Deck method [9,10]. Nevertheless, we recognise the superiority of the multiple imputation approach for dealing effectively with missing values and the limitations of the Hot-Deck method in complying with the Rubin's approach [11]. The rationale behind our combinatory approach (statistical and machine learning algorithms methods) was to find out if using machine learning approaches would lead to additional insights into which prognostic variables affect the employability of programme participants, when compared to well-established statistical models. ...
Article
Full-text available
Background Whether machine learning approaches are superior to classical statistical models for survival analyses, especially in the case of lack of proportionality, is unknown. Objectives To compare model performance and predictive accuracy of classic regressions and machine learning approaches using data from the Inspiring Families programme. Methods The Inspiring Families programme aims to support members of families with complex issues to return to work. We explored predictors of time to return to work with proportional hazards (Semi-Parametric Cox in Stata) and (Flexible Parametric Parmar-Royston in Stata) against the Survival penalised regression with Elastic Net penalty (scikit-survival), (conditional) Survival Forest algorithm (pySurvival), and (kernel) Survival Support Vector Machine (pySurvival). Results At baseline we obtained data on 61 binary variables from all 3161 participants. No model appeared superior, with a low predictive power (concordance index between 0.51 and 0.61). The median time for finding the first job was about 254 days. The top five contributing variables were ‘family issues and additional barriers’, ‘restriction of hours’, ‘available CV’, ‘self-employment considered’ and ‘education’. The Harrell’s Concordance index was range from 0.60 (Cox model) to 0.71 (Random Survival Forest) suggesting a better fit for the machine learning approaches. However, the comparison for predicting median time on a selected scenario based showed only minor differences. Conclusion Implementing a series of survival models with and without proportional hazards background provides a useful insight as well as better interpretation of the coefficients affected by non-linearities. However, that better fit does not translate to substantially higher predictive power and accuracy from using machine learning approaches. Further tuning of the machine learning algorithms may provide improved results.
... We also recommend incorporating measurement patterns into statistical and machine learning models to mitigate the impact of varied measurement patterns. This can be achieved by weighting the samples and observations based on their measurement frequencies, modeling the data generation process as a multilevel model where the measurement pattern is treated as a latent variable, or employing a Bayesian model to estimate the impact of different measurement frequencies on model outcomes, thus providing a probabilistic framework to handle uncertainty [19,11,10,22]. Finally, promoting transparency in data collection processes and conducting regular audits of EHR systems to detect bias can help create a more equitable healthcare data landscape. ...
Preprint
Full-text available
Background: Disparities in data collection within electronic health records (EHRs), especially in Intensive Care Units (ICUs), can reveal underlying biases that may affect patient outcomes. Identifying and mitigating these biases is critical for ensuring equitable healthcare. This study aims to develop an analytical framework for measurement patterns, including missingness rates and measurement frequencies, evaluate the association between them and demographic factors, and assess their impact on in-hospital mortality prediction. Methods: We conducted a retrospective cohort study using the Medical Information Mart for Intensive Care III (MIMIC-III) database, which includes data on over 40,000 ICU patients from Beth Israel Deaconess Medical Center (2001–2012). Adult patients with ICU stays longer than 24 hours were included. Measurement patterns, such as missingnessrates and measurement frequencies, were derived from EHR data and analyzed. Targeted Machine Learning (TML) methods were used to assess potential biases in measurement patterns across demographic factors (age, gender, race/ethnicity) while controlling for confounders such as other demographics and disease severity. The predictive power of measurement patterns on in-hospital mortality was evaluated. Results: Among 23,426 patients, significant demographic disparities were observed in the first 24 hours of ICU stays. Elderly patients (≥ 65 years) had more frequent temperature measurements compared to younger patients, while males had slightly fewer missing temperature measurements than females. Racial disparities were notable: White patients had more frequent blood pressure and oxygen saturation (SpO2) measurements compared to Black and Hispanic patients. Measurement patterns were associated with ICU mortality, with models based solely on these patterns achieving an area under the receiver operating characteristic curve (AUC) of 0.76 (95% CI: 0.74–0.77). Conclusions: This study underscores the significance of measurement patterns in ICU EHR data, which are associated with patient demographics and ICU mortality. Analyzing patterns of missing data and measurement frequencies provides valuable insights into patient monitoring practices and potential biases in healthcare delivery. Understanding these disparities is critical for improving the fairness of healthcare delivery and developing more accurate predictive models in critical care settings.
Article
Full-text available
This study considers the sphericity test, a specific test of variance-covariance matrix under monotone missing data for a one-sample problem. We provide the likelihood ratio (LR) and derive an asymptotic expansion of the likelihood ratio test (LRT) statistic and modified LRT statistic for the null distribution. We also derive the upper percentiles of the LRT statistic and modified LRT statistic when the null hypothesis holds, and provide approximate upper percentiles. Furthermore, we prove that the LR under monotone missing data is affine invariant under the null hypothesis. For complete data, we provide an asymptotic expansion of the LRT statistic and modified LRT statistic for the null distribution. Furthermore, we numerically evaluate the actual type I error rates for the approximate upper percentiles using Monte Carlo simulation and provide examples of the LRT statistic and modified LRT statistic and approximate upper percentiles under monotone missing data.
Preprint
Full-text available
We study tensor completion (TC) through the lens of low-rank tensor decomposition (TD). Many TD algorithms use fast alternating minimization methods, which solve highly structured linear regression problems at each step (e.g., for CP, Tucker, and tensor-train decompositions). However, such algebraic structure is lost in TC regression problems, making direct extensions unclear. To address this, we propose a lifting approach that approximately solves TC regression problems using structured TD regression algorithms as blackbox subroutines, enabling sublinear-time methods. We theoretically analyze the convergence rate of our approximate Richardson iteration based algorithm, and we demonstrate on real-world tensors that its running time can be 100x faster than direct methods for CP completion.
Article
Full-text available
Given the importance of continuous-time stochastic volatility models to describe the dynamics of interest rates, we propose a goodness-of-fit test for the parametric form of the drift and diffusion functions, based on a marked empirical process of the residuals built with an adapted Kalman Filter estimation. The test statistics are constructed using a continuous functional (Kolmogorov–Smirnov and Crámer–von Mises) over the empirical processes. Both, the different estimation procedures (including other alternatives as for example methods based on Markov Chain Monte Carlo or Particle filters) and the new proposed tests are compared in different simulation studies. The tests are calibrated with a specific bootstrap method using the estimation of a discrete version of the diffusion model with stochastic volatility. Finally, an application of the procedures to real data is provided.
Article
The multi-parameter intricate process control of bioreactor systems poses an urgent challenge to cell culture. It is feasible to simulate and analyze the implications of each parameter on the culture process, comprehend the correlation between key variables involved in the cell culture process, and establish the framework for multi-parameter collaborative regulation through the utilization of bioreactor modeling. This paper reviews the approaches for model implementation of multi-parameter process control, along with the analysis and optimization techniques of mathematical models related to physiological processes, culture environments, and bioreactor structures involved in bioengineering, tissue engineering, or hepatocyte culture processes. It then covers remaining obstacles and potential for using the digital twin. In view of this, the review anticipates process optimization and control of the bioreactor hepatocyte in vitro culture system and the bioartificial liver clinical support system, intending to enhance the understanding of the hepatocyte culture process and meet the requirements associated with bioartificial liver therapy for the amount and quality of hepatocytes.
Article
Full-text available
Rationale The incidence of renal function alterations among patients with COVID19 is unknown. Objective To determine the incidence of acute kidney injury (AKI) or augmented renal clearance (ARC) in patients hospitalised with COVID19 and identify risk factors for patients who may exhibit each renal alteration. Methods Retrospective, observational cohort analysis of hospitalise, adult patients within the National COVID Cohort Collaborative (N3C) database with laboratory confirmed COVID19 and available data to calculate creatinine clearance using the Cockcroft–Gault equation from 1 January 2020 through 9 April 2022. Measurements Incidence of AKI or ARC and patient demographics. Main results 15 608 patients were included for renal function characterisation where 20.9% experienced AKI and 34.8% exhibited ARC. ARC lasted longer than AKI; however, AKI was associated with increased hospital length of stay and mortality. 11 274 patients were included in logistic regression analysis. Height and White race were the only variables associated with decreased risk of AKI while male sex and diabetes were associated with increased risk. Male sex, Black race and hypertension were associated with decreased risk of ARC. Age was associated with decreased risk of both AKI and ARC while weight and Hispanic ethnicity were associated with increased risk in both renal alterations. Conclusions A significant proportion of patients exhibit renal alterations during their hospitalisation for COVID19. These results provide initial evidence of identifying patients at risk of AKI or ARC, but more research is needed, especially with respect to use of biomarkers for renal alteration risk stratification.
Article
Full-text available
Conventional survey tools such as weighting do not address non-ignorable nonresponse that occurs when nonresponse depends on the variable being measured. This paper describes non-ignorable nonresponse weighting and imputation models using randomized response instruments, which are variables that affect response but not the outcome of interest. This paper uses a doubly robust estimator that is valid if one, but not necessarily both, of the weighting and imputation models is correct. When applied to a national 2019 survey, these tools produce estimates that suggest there was nontrivial non-ignorable nonresponse related to turnout, and, for subgroups, Trump approval and policy questions. For example, the conventional MAR-based weighted estimates of Trump support in the Midwest were 10 percentage points lower than the MNAR-based estimates.
Preprint
Full-text available
Introduction COVID-19 was less severe in Sub-Saharan Africa (SSA) compared with Europe and North America. It is unclear whether these differences could be explained immunologically. Here we determined the levels of ex vivo SARS-CoV-2 peptide-specific IFN-γ producing cells, and levels of plasma cytokines and chemokines over the first month of COVID-19 diagnosis among Kenyan COVID-19 patients from urban and rural areas. Methods Between June 2020 and August 2022, we recruited and longitudinally monitored 188 COVID-19 patients from two regions in Kenya, Nairobi (urban, n = 152) and Kilifi (rural, n = 36), with varying levels of disease severity ­– severe, mild/moderate, and asymptomatic. IFN-γ secreting cells were enumerated at 0-, 7-, 14- and 28-days post diagnosis by an ex vivo enzyme-linked immunospot (ELISpot) assay following in vitro stimulation of peripheral blood mononuclear cells (PBMCs) with overlapping peptides from several SARS-CoV-2 proteins. A multiplexed binding assay was used to measure the levels of 22 plasma cytokines and chemokines. Results Higher frequencies of IFN-γ-secreting cells against the SARS-CoV-2 spike peptides were observed on the day of diagnosis among the asymptomatic compared to the patients with severe COVID-19. Higher concentrations of 17 of the 22 cytokines and chemokines measured were positively associated with severe disease, and in particular interleukin (IL)-8, IL-18 and IL-1ra (p<0.0001), while a lower concentration of SDF-1α was associated with severe disease (p<0.0001). Concentrations of 8 and 16 cytokines and chemokines including IL-18 were higher among Nairobi asymptomatic and mild patients compared to their respective Kilifi counterparts. Conversely, the concentrations for SDF-1α were higher in rural Kilifi compared to Nairobi (p=0.012). Conclusion In Kenya, as seen elsewhere, pro-inflammatory cytokines and chemokines were associated with severe COVID-19, while an early IFN-γ cellular response to overlapping SARS-CoV-2 spike peptides was associated with reduced risk of disease. Living in urban Nairobi (compared with rural Kilifi) was associated with increased levels of pro-inflammatory cytokines/chemokines.
Article
This study addresses the issue of random non-response (RNR) in the study data by introducing improved generalized ratio estimators for the heterogeneous population mean. The novelty of the proposed methodology is to optimize the combined ratio estimator using Searls (1964) idea instead of conventional mean for the heterogeneous population. To study the characterization properties of the proposed estimators, mathematical expressions for the bias and mean square error (MSqE) are derived under stratified simple random sampling technique. The optimal performance of the estimators is obtained along with the required constraints, and a comprehensive theoretical comparison between the proposed and adopted estimators is made. The precision of the suggested estimators in comparison to the adopted estimators based on conventional and well-established estimators has been demonstrated by both the empirical analysis on the body fat dataset and the Monte Carlo simulation technique. The results of this study highlight the fact that the suggested estimators outperform all existing traditional and well-established estimators adopted under this scenario.
Preprint
We derive the closed-form restricted maximum likelihood (REML) estimator and Kenward-Roger's variance estimator for fixed effects in the mixed effects model for repeated measures (MMRM) when the missing data pattern is monotone. As an important application of the analytic result, we present the formula for calculating the power of treatment comparison using the Wald t test with the Kenward-Roger adjusted variance estimate in MMRM. It allows adjustment for baseline covariates without the need to specify the covariate distribution in randomized trials. A simple two-step procedure is proposed to determine the sample size needed to achieve the targeted power. The proposed method performs well for both normal and moderately nonnormal data even in small samples (n = 20) in simulations. An anti-depressant trial is analyzed for illustrative purposes.
Preprint
Full-text available
The study aims to examine the performance of different missing data imputation methods on accurately estimating missing data in high dimensional datasets and their impact on classification using extreme learning machines (ELM). Random datasets were generated with n = 150 observations, p = 500 independent variables, and different missing data rates. Various imputation methods were used, including mean, median, random, k-nearest neighbors (KNN), missing value imputation with random forests (I-RF), multivariate imputations by chained equations with classification and regression trees (MICE-CART), as well as direct and indirect use of regularized regression (DURR and IURR) methods specifically developed for high dimensional data. The performance of the methods was evaluated based on their proximity to the reference classification scores obtained using ELM. I-RF, MICE-CART, DURR, and IURR, followed by KNN methods, exhibited better performance at low missing rates, while DURR and IURR methods stood out at high missing rates.
ResearchGate has not been able to resolve any references for this publication.