BMC Medical Research Methodology

Published by Springer Nature
Online ISSN: 1471-2288
Learn more about this page
Recent publications
Article
We examine the concept of Bayesian Additional Evidence (BAE) recently proposed by Sondhi et al. We derive simple closed-form expressions for BAE and compare its properties with other methods for assessing findings in the light of new evidence. We find that while BAE is easy to apply, it lacks both a compelling rationale and clarity of use needed for reliable decision-making.
 
(continued)
Abbreviations RCT : Randomized controlled trial; CRT : Cluster randomized trial; SWCRT : Stepped wedge cluster randomized trial; TTHA: Time-to-hospital admission; TTE: Time-to-event; TTFE: Time-to-first-event; TTRE: Time-to-recurrent-event; TTTE: Time-to-terminal-event; CoxPH: Cox proportional hazard; AG: Andersengill; PWP-TT: Prentice-williams-peterson total-time; PWP-GT: Prentice-williamspeterson gap-time; HR: Hazard ratio; MSE: Mean square error; CP: Coverage probability; CI: Confidence interval; SE: Standard error.
Article
Background There are currently no methodological studies on the performance of the statistical models for estimating intervention effects based on the time-to-recurrent-event (TTRE) in stepped wedge cluster randomised trial (SWCRT) using an open cohort design. This study aims to address this by evaluating the performance of these statistical models using an open cohort design with the Monte Carlo simulation in various settings and their application using an actual example. Methods Using Monte Carlo simulations, we evaluated the performance of the existing extended Cox proportional hazard models, i.e., the Andersen-Gill (AG), Prentice-Williams-Peterson Total-Time (PWP-TT), and Prentice-Williams-Peterson Gap-time (PWP-GT) models, using the settings of several event generation models and true intervention effects, with and without stratification by clusters. Unidirectional switching in SWCRT was represented using time-dependent covariates. Results Using Monte Carlo simulations with the various described settings, in situations where inter-individual variability do not exist, the PWP-GT model with stratification by clusters showed the best performance in most settings and reasonable performance in the others. The only situation in which the performance of the PWP-TT model with stratification by clusters was not inferior to that of the PWP-GT model with stratification by clusters was when there was a certain amount of follow-up period, and the timing of the trial entry was random within the trial period, including the follow-up period. In situations where inter-individual variability existed, the PWP-GT model consistently underperformed compared to the PWP-TT model. The AG model performed well only in a specific setting. By analysing actual examples, it was found that almost all the statistical models suggested that the risk of events during the intervention condition may be somewhat higher than in the control, although the difference was not statistically significant. Conclusions When estimating the TTRE-based intervention effects of SWCRT in various settings using an open cohort design, the PWP-GT model with stratification by clusters performed most reasonably in situations where inter-individual variability was not present. However, if inter-individual variability was present, the PWP-TT model with stratification by clusters performed best.
 
Flowchart showing the search results and reasons for exclusion
Article
Introduction Over the last years, the number of systematic reviews published is steadily increasing due to the global interest in this type of evidence synthesis. However, little is known about the characteristics of this research published in Portuguese medical journals. This study aims to evaluate the publication trends and overall quality of these systematic reviews. Material and methods This was a methodological study. We aimed the most visible Portuguese medical journals indexed in MEDLINE. Systematic reviews were identified through an electronic search (through PUBMED). We included systematic reviews published up to August 2020. Systematic reviews selection and data extraction were done independently by three authors. The overall quality critical appraisal using the A MeaSurement Tool to Assess systematic Reviews (AMSTAR-2) was independently assessed by three authors. Disagreements were solved by consensus. Results Sixty-six systematic reviews published in 5 Portuguese medical journals were included. Most ( n = 53; 80.3%) were systematic reviews without meta-analysis. Up to 2010 there was a steady increase in the number of systematic reviews published, followed by a period of great variability of publication, ranging from 1 to 10 in a given year. According to the systematic reviews’ typology, most have been predominantly conducted to assess the effectiveness/efficacy of health interventions ( n = 27; 40.9%). General and Internal Medicine ( n = 20; 30.3%) was the most addressed field. Most systematic reviews ( n = 46; 69.7%) were rated as being of “critically low-quality”. Conclusions There were consistent flaws in the methodological quality report of the systematic reviews included, particularly in establishing a prior protocol and not assessing the potential impact of the risk of bias on the results. Through the years, the number of systematic reviews published increased, yet their quality is suboptimal. There is a need to improve the reporting of systematic reviews in Portuguese medical journals, which can be achieved by better adherence to quality checklists/tools.
 
Article
  • Antonio Remiro-AzócarAntonio Remiro-Azócar
Background Anchored covariate-adjusted indirect comparisons inform reimbursement decisions where there are no head-to-head trials between the treatments of interest, there is a common comparator arm shared by the studies, and there are patient-level data limitations. Matching-adjusted indirect comparison (MAIC), based on propensity score weighting, is the most widely used covariate-adjusted indirect comparison method in health technology assessment. MAIC has poor precision and is inefficient when the effective sample size after weighting is small. Methods A modular extension to MAIC, termed two-stage matching-adjusted indirect comparison (2SMAIC), is proposed. This uses two parametric models. One estimates the treatment assignment mechanism in the study with individual patient data (IPD), the other estimates the trial assignment mechanism. The first model produces inverse probability weights that are combined with the odds weights produced by the second model. The resulting weights seek to balance covariates between treatment arms and across studies. A simulation study provides proof-of-principle in an indirect comparison performed across two randomized trials. Nevertheless, 2SMAIC can be applied in situations where the IPD trial is observational, by including potential confounders in the treatment assignment model. The simulation study also explores the use of weight truncation in combination with MAIC for the first time. Results Despite enforcing randomization and knowing the true treatment assignment mechanism in the IPD trial, 2SMAIC yields improved precision and efficiency with respect to MAIC in all scenarios, while maintaining similarly low levels of bias. The two-stage approach is effective when sample sizes in the IPD trial are low, as it controls for chance imbalances in prognostic baseline covariates between study arms. It is not as effective when overlap between the trials’ target populations is poor and the extremity of the weights is high. In these scenarios, truncation leads to substantial precision and efficiency gains but induces considerable bias. The combination of a two-stage approach with truncation produces the highest precision and efficiency improvements. Conclusions Two-stage approaches to MAIC can increase precision and efficiency with respect to the standard approach by adjusting for empirical imbalances in prognostic covariates in the IPD trial. Further modules could be incorporated for additional variance reduction or to account for missingness and non-compliance in the IPD trial.
 
Article
  • D.’Alessandro GiandomenicoD.’Alessandro Giandomenico
  • Ruffini NuriaRuffini Nuria
  • Aquino AlessandroAquino Alessandro
  • [...]
  • Cerritelli FrancescoCerritelli Francesco
Background To measure the specific effectiveness of a given treatment in a randomised controlled trial, the intervention and control groups have to be similar in all factors not distinctive to the experimental treatment. The similarity of these non-specific factors can be defined as an equality assumption. The purpose of this review was to evaluate the equality assumptions in manual therapy trials. Methods Relevant studies were identified through the following databases: EMBASE, MEDLINE, SCOPUS, WEB OF SCIENCE, Scholar Google, clinicaltrial.gov, the Cochrane Library, chiloras/MANTIS, PubMed Europe, Allied and Complementary Medicine (AMED), Physiotherapy Evidence Database (PEDro) and Sciencedirect. Studies investigating the effect of any manual intervention compared to at least one type of manual control were included. Data extraction and qualitative assessment were carried out independently by four reviewers, and the summary of results was reported following the PRISMA statement. Result Out of 108,903 retrieved studies, 311, enrolling a total of 17,308 patients, were included and divided into eight manual therapy trials categories. Equality assumption elements were grouped in three macro areas: patient-related, context-related and practitioner-related items. Results showed good quality in the reporting of context-related equality assumption items, potentially because largely included in pre-existent guidelines. There was a general lack of attention to the patient- and practitioner-related equality assumption items. Conclusion Our results showed that the similarity between experimental and sham interventions is limited, affecting, therefore, the strength of the evidence. Based on the results, methodological aspects for planning future trials were discussed and recommendations to control for equality assumption were provided.
 
Article
  • Jessica A. RamsayJessica A. Ramsay
  • Steven MascaroSteven Mascaro
  • Anita J. CampbellAnita J. Campbell
  • [...]
  • Yue WuYue Wu
Background Diagnosing urinary tract infections (UTIs) in children in the emergency department (ED) is challenging due to the variable clinical presentations and difficulties in obtaining a urine sample free from contamination. Clinicians need to weigh a range of observations to make timely diagnostic and management decisions, a difficult task to achieve without support due to the complex interactions among relevant factors. Directed acyclic graphs (DAG) and causal Bayesian networks (BN) offer a way to explicitly outline the underlying disease, contamination and diagnostic processes, and to further make quantitative inference on the event of interest thus serving as a tool for decision support. Methods We prospectively collected data on children present to ED with suspected UTIs. Through knowledge elicitation workshops and one-on-one meetings, a DAG was co-developed with clinical domain experts (the Expert DAG) to describe the causal relationships among variables relevant to paediatric UTIs. The Expert DAG was combined with prospective data and further domain knowledge to inform the development of an application-oriented BN (the Applied BN), designed to support the diagnosis of UTI. We assessed the performance of the Applied BN using quantitative and qualitative methods. Results We summarised patient background, clinical and laboratory characteristics of 431 episodes of suspected UTIs enrolled from May 2019 to November 2020. The Expert DAG was presented with a narrative description, elucidating how infection, specimen contamination and management pathways causally interact to form the complex picture of paediatric UTIs. Parameterised using prospective data and expert-elicited parameters, the Applied BN achieved an excellent and stable performance in predicting Escherichia coli culture results, with a mean area under the receiver operating characteristic curve of 0.86 and a mean log loss of 0.48 based on 10-fold cross-validation. The BN predictions were reviewed via a validation workshop, and we illustrate how they can be presented for decision support using three hypothetical clinical scenarios. Conclusion Causal BNs created from both expert knowledge and data can integrate case-specific information to provide individual decision support during the diagnosis of paediatric UTIs in ED. The model aids the interpretation of culture results and the diagnosis of UTIs, promising the prospect of improved patient care and judicious use of antibiotics.
 
Article
Background The Randomised Evaluation of COVID-19 Therapy (RECOVERY) trial is aimed at addressing the urgent need to find effective treatments for patients hospitalised with suspected or confirmed COVID-19. The trial has had many successes, including discovering that dexamethasone is effective at reducing COVID-19 mortality, the first treatment to reach this milestone in a randomised controlled trial. Despite this, it continues to use standard or ‘fixed’ randomisation to allocate patients to treatments. We assessed the impact of implementing response adaptive randomisation within RECOVERY using an array of performance measures, to learn if it could be beneficial going forward. This design feature has recently been implemented within the REMAP-CAP platform trial. Methods Trial data was simulated to closely match the data for patients allocated to standard care, dexamethasone, hydroxychloroquine, or lopinavir-ritonavir in the RECOVERY trial from March-June 2020, representing four out of five arms tested throughout this period. Trials were simulated in both a two-arm trial setting using standard care and dexamethasone, and a four-arm trial setting utilising all above treatments. Two forms of fixed randomisation and two forms of response-adaptive randomisation were tested. In the two-arm setting, response-adaptive randomisation was implemented across both trial arms, whereas in the four-arm setting it was implemented in the three non-standard care arms only. In the two-arm trial, randomisation strategies were performed at the whole trial level as well as within three pre-specified patient subgroups defined by patients’ respiratory support level. Results All response-adaptive randomisation strategies led to more patients being given dexamethasone and a lower mortality rate in the trial. Subgroup specific response-adaptive randomisation reduced mortality rates even further. In the two-arm trial, response-adaptive randomisation reduced statistical power compared to FR, with subgroup level adaptive randomisation exhibiting the largest power reduction. In the four-arm trial, response-adaptive randomisation increased statistical power in the dexamethasone arm but reduced statistical power in the lopinavir arm. Response-adaptive randomisation did not induce any meaningful bias in treatment effect estimates nor did it cause any inflation in the type 1 error rate. Conclusions Using response-adaptive randomisation within RECOVERY could have increased the number of patients receiving the optimal COVID-19 treatment during the trial, while reducing the number of patients needed to attain the same study power as the original study. This would likely have reduced patient deaths during the trial and lead to dexamethasone being declared effective sooner. Deciding how to balance the needs of patients within a trial and future patients who have yet to fall ill is an important ethical question for the trials community to address. Response-adaptive randomisation deserves to be considered as a design feature in future trials of COVID-19 and other diseases.
 
Overview of model structure
FVIII concentrations over time
95CI: 95% confidence interval; rAHF-PFM: antihemophilic factor (recombinant) plasma/albumin-free method; BAY: BAY 81–8973
Bleeding events
rAHF-PFM: antihemophilic factor (recombinant) plasma/albumin-free method; BAY: BAY 81–8973
Tornado diagram indicating the effect of each independent variable on the resultant ICER while keeping all other variables constant
rAHF-PFM: antihemophilic factor (recombinant) plasma/albumin-free method
Article
Background Long-term prophylactic therapy is considered the standard of care for hemophilia A patients. This study models the long-term clinical and cost outcomes of two factor VIII (FVIII) products using a pharmacokinetic (PK) simulation model in a Chinese population. Methods Head-to-head PK profile data of BAY 81–8973 (KOVALTRY®) and antihemophilic factor (recombinant) plasma/albumin-free method (rAHF-PFM, ADVATE®) were applied to a two-state (alive and dead) Markov model to simulate blood FVIII concentrations at a steady state in prophylactically-treated patients with hemophilia A. Worsening of the Pettersson score was simulated and decline was associated with the probability of having orthopaedic surgery. The only difference between the compounds was FVIII concentration at a given time; each subject was treated with 25 IU/kg every 3 days. The model used a lifetime horizon, with cycle lengths of 1 year. Results Cumulative bleeding events, joint bleeding events, and major bleeding events were reduced by 19.3% for BAY 81–8973 compared to rAHF-PFM. Hospitalizations and hospitalization days were also reduced by 19.3% for BAY 81–8973 compared to rAHF-PFM. BAY 81–8973 resulted in both cost savings and a gain in quality adjusted life years (QALYs) compared to rAHF-PFM. Conclusion Based on modeled head-to-head comparisons, differences in PK-properties between BAY 81–8973 and rAHF-PFM result in a reduced number of bleeding events, leading to reduced costs and increased quality of life for BAY 81–8973. These results should be used to inform clinical practice in China when caring for patients with severe hemophilia A.
 
The selection frequency of the variables using different P-out-values compared to the complete dataset
Percentages agreement between the P-values of the selected variables by the different pooling methods and the complete dataset.
Model selection frequencies of the first ten unique prognostic models from the four pooling methods, quantifying how likely these models were selected compared to the models from the complete dataset
Article
Background For the development of prognostic models, after multiple imputation, variable selection is advised to be applied from the pooled model. The aim of this study is to evaluate by using a simulation study and practical data example the performance of four different pooling methods for variable selection in multiple imputed datasets. These methods are the D1, D2, D3 and recently extended Median-P-Rule (MPR) for categorical, dichotomous, and continuous variables in logistic regression models. Methods Four datasets (n = 200 and n = 500), with 9 variables and correlations of respectively 0.2 and 0.6 between these variables, were simulated. These datasets included 2 categorical and 2 continuous variables with 20% missing at random data. Multiple Imputation (m = 5) was applied, and the four methods were compared with selection from the full model (without missing data). The same analyzes were repeated in five multiply imputed real-world datasets (NHANES) (m = 5, p = 0.05, N = 250/300/400/500/1000). Results In the simulated datasets, the differences between the pooling methods were most evident in the smaller datasets. The MPR performed equal to all other pooling methods for the selection frequency, as well as for the P-values of the continuous and dichotomous variables, however the MPR performed consistently better for pooling and selecting categorical variables in multiply imputed datasets and also regarding the stability of the selected prognostic models. Analyzes in the NHANES-dataset showed that all methods mostly selected the same models. Compared to each other however, the D2-method seemed to be the least sensitive and the MPR the most sensitive, most simple, and easy method to apply. Conclusions Considering that MPR is the most simple and easy pooling method to use for epidemiologists and applied researchers, we carefully recommend using the MPR-method to pool categorical variables with more than two levels after Multiple Imputation in combination with Backward Selection-procedures (BWS). Because MPR never performed worse than the other methods in continuous and dichotomous variables we also advice to use MPR in these types of variables.
 
Intersecting individual and contextual factors that can shape individual identity [11, 17]
Knowledge-to-Action Framework [18]
Overview of study
Article
Background Models, theories, and frameworks (MTFs) provide the foundation for a cumulative science of implementation, reflecting a shared, evolving understanding of various facets of implementation. One under-represented aspect in implementation MTFs is how intersecting social factors and systems of power and oppression can shape implementation. There is value in enhancing how MTFs in implementation research and practice account for these intersecting factors. Given the large number of MTFs, we sought to identify exemplar MTFs that represent key implementation phases within which to embed an intersectional perspective. Methods We used a five-step process to prioritize MTFs for enhancement with an intersectional lens. We mapped 160 MTFs to three previously prioritized phases of the Knowledge-to-Action (KTA) framework. Next, 17 implementation researchers/practitioners, MTF experts, and intersectionality experts agreed on criteria for prioritizing MTFs within each KTA phase. The experts used a modified Delphi process to agree on an exemplar MTF for each of the three prioritized KTA framework phases. Finally, we reached consensus on the final MTFs and contacted the original MTF developers to confirm MTF versions and explore additional insights. Results We agreed on three criteria when prioritizing MTFs: acceptability (mean = 3.20, SD = 0.75), applicability (mean = 3.82, SD = 0.72), and usability (median = 4.00, mean = 3.89, SD = 0.31) of the MTF. The top-rated MTFs were the Iowa Model of Evidence-Based Practice to Promote Quality Care for the ‘Identify the problem’ phase (mean = 4.57, SD = 2.31), the Consolidated Framework for Implementation Research for the ‘Assess barriers/facilitators to knowledge use’ phase (mean = 5.79, SD = 1.12), and the Behaviour Change Wheel for the ‘Select, tailor, implement interventions’ phase (mean = 6.36, SD = 1.08). Conclusions Our interdisciplinary team engaged in a rigorous process to reach consensus on MTFs reflecting specific phases of the implementation process and prioritized each to serve as an exemplar in which to embed intersectional approaches. The resulting MTFs correspond with specific phases of the KTA framework, which itself may be useful for those seeking particular MTFs for particular KTA phases. This approach also provides a template for how other implementation MTFs could be similarly considered in the future. Trial registration Open Science Framework Registration: osf.io/qgh64.
 
Histogram of hospital length of stay for patients with asthma diagnosis, n = 2,167
Article
Background Hospital length of stay (LOS) is a key indicator of hospital care management efficiency, cost of care, and hospital planning. Hospital LOS is often used as a measure of a post-medical procedure outcome, as a guide to the benefit of a treatment of interest, or as an important risk factor for adverse events. Therefore, understanding hospital LOS variability is always an important healthcare focus. Hospital LOS data can be treated as count data, with discrete and non-negative values, typically right skewed, and often exhibiting excessive zeros. In this study, we compared the performance of the Poisson, negative binomial (NB), zero-inflated Poisson (ZIP), and zero-inflated negative binomial (ZINB) regression models using simulated and empirical data. Methods Data were generated under different simulation scenarios with varying sample sizes, proportions of zeros, and levels of overdispersion. Analysis of hospital LOS was conducted using empirical data from the Medical Information Mart for Intensive Care database. Results Results showed that Poisson and ZIP models performed poorly in overdispersed data. ZIP outperformed the rest of the regression models when the overdispersion is due to zero-inflation only. NB and ZINB regression models faced substantial convergence issues when incorrectly used to model equidispersed data. NB model provided the best fit in overdispersed data and outperformed the ZINB model in many simulation scenarios with combinations of zero-inflation and overdispersion, regardless of the sample size. In the empirical data analysis, we demonstrated that fitting incorrect models to overdispersed data leaded to incorrect regression coefficients estimates and overstated significance of some of the predictors. Conclusions Based on this study, we recommend to the researchers that they consider the ZIP models for count data with zero-inflation only and NB models for overdispersed data or data with combinations of zero-inflation and overdispersion. If the researcher believes there are two different data generating mechanisms producing zeros, then the ZINB regression model may provide greater flexibility when modeling the zero-inflation and overdispersion.
 
Participant flow chart during field testing
Connecting to the USSD app
Accessing specific sexual reproductive health information in the USSD app
Article
Background Adolescent pregnancies and sexually-transmitted infections continue to impact 15 – 19-year-olds across the globe. The lack of sexual reproductive health information (SRH) in resource-limited settings due to cultural and societal attitudes towards adolescent SRH could be contributing to the negative outcomes. Innovative approaches, including mobile phone technologies, are needed to address the need for reliable adolescent SRH information. Objective The study aimed to co-design a Unstructured Supplementary Service Data (USSD) based mobile app prototype to provide confidential adolescent SRH information on-demand and evaluate the mobile app’s usability and user experience. Methods A human-centered design methodology was applied. This practice framework allowed the perspectives and feedback of adolescent users to be included in the iterative design process. To participate, an adolescent must have been 15 to 19 years old, resided in Kibra and would be able to access a mobile phone. Adolescents were enrolled for the alpha and field testing of the app prototype at different time-points. The Mobile Application Rating Scale (MARS) a multidimensional mobile phone evaluation tool was used to access the functionality, engagement, aesthetics and quality of information in the app. Responses from the MARS were reported as mean scores for each category and a mean of the aggregate scores making the app’s quality score. The MARS data was also evaluated as categorical data, A Chi square test of independence was carried out to show significance of any observed differences using cumulative and inverse cumulative distribution functions. Results During the usability test, 62/109 (54.9%) of the adolescents that were followed-up had used the app at least once, 30/62 (48.4%) of these were male participants and 32/62 (51.6%) female. On engagement, the app had a mean score of 4.3/5 (SD 0.44), 4.6/5 (SD 0.38) on functionality, 4.3/5 (SD 0.57) on aesthetics and 4.4/5 (SD 0.60) on the quality of information. The overall app quality mean score was 4.4/5 (SD 0.31). The app was described as ‘very interesting’ to use by 44/62 (70.9%) of the participants, 20/44 males and 24/44 females. The content was deemed to be either ‘perfectly’ or ‘well targeted’ on sexual reproductive health by 60/62 (96.7%) adolescents, and the app was rated ‘best app’ by 45/62 (72.6%) adolescents, 27/45 females and 18/45 males, with a p-value = 0.011. Conclusions Adolescents need on-demand, accurate and trusted SRH information. A mobile phone app is a feasible and acceptable way to deliver adolescent SRH information in resource-limited settings. The USSD mobile phone technology shows promise in the delivery of much needed adolescent SRH information on-demand..
 
Article
Background Although books and articles guiding the methods of sample size calculation for prevalence studies are available, we aim to guide, assist and report sample size calculation using the present calculators. Results We present and discuss four parameters (namely level of confidence, precision, variability of the data, and anticipated loss) required for sample size calculation for prevalence studies. Choosing correct parameters with proper understanding, and reporting issues are mainly discussed. We demonstrate the use of a purposely-designed calculators that assist users to make proper informed-decision and prepare appropriate report. Conclusion Two calculators can be used with free software (Spreadsheet and RStudio) that benefit researchers with limited resources. It will, hopefully, minimize the errors in parameter selection, calculation, and reporting. The calculators are available at: ( https://sites.google.com/view/sr-ln/ssc ).
 
Bias of treatment estimates and the coverage of their standard errors by the 95% confidence interval in the unmeasured confounding scenario. In the “ideal” case the effect of confounders on the outcomes are the same, and in the “random coefficients” case these effects are randomised. Within each cell, the left funnel plot shows the estimates calibrated with negative controls only and the right funnel plot shows estimates calibrated with negative and positive controls
Bias of treatment estimates and the coverage of their standard errors by the 95% confidence interval in the model misspecification – missing quadratic term scenario. In the “ideal” case the effect of confounders on the outcomes are the same, and in the “random coefficients” case these effects are randomised. Within each cell, the left funnel plot shows the estimates calibrated with negative controls only and the right funnel plot shows estimates calibrated with negative and positive controls
Bias of treatment estimates and the coverage of their standard errors by the 95% confidence interval in the model misspecification – missing interaction term scenario. In the “ideal” case the effect of confounders on the outcomes are the same, and in the “random coefficients” case these effects are randomised. Within each cell, the left funnel plot shows the estimates calibrated with negative controls only and the right funnel plot shows estimates calibrated with negative and positive controls
Bias of treatment estimates and the coverage of their standard errors by the 95% confidence interval in the non-positivity scenario. In the “ideal” case the effect of confounders on the outcomes are the same, and in the “random coefficients” case these effects are randomised. Within each cell, the left funnel plot shows the estimates calibrated with negative controls only and the right funnel plot shows estimates calibrated with negative and positive controls
Bias of treatment estimates and the coverage of their standard errors by the 95% confidence interval in the measurement error scenario. In the “ideal” case the effect of confounders on the outcomes are the same, and in the “random coefficients” case these effects are randomised. Within each cell, the left funnel plot shows the estimates calibrated with negative controls only and the right funnel plot shows estimates calibrated with negative and positive controls
Article
Background Estimations of causal effects from observational data are subject to various sources of bias. One method for adjusting for the residual biases in the estimation of treatment effects is through the use of negative control outcomes, which are outcomes not believed to be affected by the treatment of interest. The empirical calibration procedure is a technique that uses negative control outcomes to calibrate p-values. An extension of this technique calibrates the coverage of the 95% confidence interval of a treatment effect estimate by using negative control outcomes as well as positive control outcomes, which are outcomes for which the treatment of interest has known effects. Although empirical calibration has been used in several large observational studies, there is no systematic examination of its effect under different bias scenarios. Methods The effect of empirical calibration of confidence intervals was analyzed using simulated datasets with known treatment effects. The simulations consisted of binary treatment and binary outcome, with biases resulting from unmeasured confounder, model misspecification, measurement error, and lack of positivity. The performance of the empirical calibration was evaluated by determining the change in the coverage of the confidence interval and the bias in the treatment effect estimate. Results Empirical calibration increased coverage of the 95% confidence interval of the treatment effect estimate under most bias scenarios but was inconsistent in adjusting the bias in the treatment effect estimate. Empirical calibration of confidence intervals was most effective when adjusting for the unmeasured confounding bias. Suitable negative controls had a large impact on the adjustment made by empirical calibration, but small improvements in the coverage of the outcome of interest were also observable when using unsuitable negative controls. Conclusions This work adds evidence to the efficacy of empirical calibration of the confidence intervals in observational studies. Calibration of confidence intervals is most effective where there are biases due to unmeasured confounding. Further research is needed on the selection of suitable negative controls.
 
Simulation study: selective coverage from both simulation setups of selective 90% CIs for the submodel inference target. For each scenario, the actual selective coverage rate was estimated by simulation, and over all scenarios, the values were summarised by boxplots. See Supplementary Figure S4 for stratified results. The nominal confidence level of 0.9 used in the construction of the CIs is depicted as dashed line. Colors indicate the type of variable selection. Monte Carlo error is indicated by grey areas describing binomial 95% CIs expected at the nominal confidence level with 900 iterations
Toy simulation study: selective power and selective type 1 error of selective 90% confidence intervals. For each scenario, power or type I error was estimated by simulation, and over all scenarios with specified target simulation R², the values were summarised by boxplots. The target values are depicted as dashed lines (1 for power, 0.1 for type 1 error). Colors indicate the type of variable selection. Results from the realistic setup are comparable to the ones shown here (Supplementary Figure S5)
Toy simulation study: median and interquartile range (IQR) of the widths of selective 90% CIs. CIs were standardized. For each scenario the median and IQR of CI widths were computed, and over all variables and scenarios with specified target simulation R², the values were summarised by boxplots. Dashed lines mark a width of zero. Colors indicate the type of variable selection. Supplementary Figure S7 shows the same results stratified by true predictors and noise variables. Results for the realistic setup are comparable (Supplementary Figure S8)
Toy simulation study: predictive accuracy in terms of difference of validation R² and target simulation R². The target simulation R² was 0.2 in left panel, 0.8 in right panel. For each scenario, predictive accuracy was estimated by simulation, and over all scenarios, the values were summarised by boxplots. Dashed lines mark an optimal difference of zero. Colors indicate the type of variable selection. Results for the realistic setup are comparable (Supplementary Figure S12)
Real data example: point estimates and 90% selective CIs for regression coefficients. Results are shown at the original scales of the variables. Each method is depicted in a separate panel. The variables are ordered by increasing standardized coefficients. The individual selection frequencies estimated by 100 subsamples are given as percentages above each panel
Article
Background Variable selection for regression models plays a key role in the analysis of biomedical data. However, inference after selection is not covered by classical statistical frequentist theory, which assumes a fixed set of covariates in the model. This leads to over-optimistic selection and replicability issues. Methods We compared proposals for selective inference targeting the submodel parameters of the Lasso and its extension, the adaptive Lasso: sample splitting, selective inference conditional on the Lasso selection (SI), and universally valid post-selection inference (PoSI). We studied the properties of the proposed selective confidence intervals available via R software packages using a neutral simulation study inspired by real data commonly seen in biomedical studies. Furthermore, we present an exemplary application of these methods to a publicly available dataset to discuss their practical usability. Results Frequentist properties of selective confidence intervals by the SI method were generally acceptable, but the claimed selective coverage levels were not attained in all scenarios, in particular with the adaptive Lasso. The actual coverage of the extremely conservative PoSI method exceeded the nominal levels, and this method also required the greatest computational effort. Sample splitting achieved acceptable actual selective coverage levels, but the method is inefficient and leads to less accurate point estimates. The choice of inference method had a large impact on the resulting interval estimates, thereby necessitating that the user is acutely aware of the goal of inference in order to interpret and communicate the results. Conclusions Despite violating nominal coverage levels in some scenarios, selective inference conditional on the Lasso selection is our recommended approach for most cases. If simplicity is strongly favoured over efficiency, then sample splitting is an alternative. If only few predictors undergo variable selection (i.e. up to 5) or the avoidance of false positive claims of significance is a concern, then the conservative approach of PoSI may be useful. For the adaptive Lasso, SI should be avoided and only PoSI and sample splitting are recommended. In summary, we find selective inference useful to assess the uncertainties in the importance of individual selected predictors for future applications.
 
Example of a person-period data set (on right) created from continuous-time survival data (on left). In the timeline plot, circles indicate censoring and diamonds indicate events. The horizon of interest is w=5 and there are J=5 specified intervals defined as 1: A1=(t0,t1], 2: A2=(t1,t2], 3: A3=(t2,t3], 4: A4=(t3,t4], 5: A5=(t4,t5], whose endpoints are given by t0=0, t1=1, t2=2, t3=3, t4=4, t5=5. ID 1 experiences an event in interval 3 and thus in the person-period data set they have rows corresponding to the first three intervals and for the 3rd interval, their event status is a 1. ID 2 is censored in interval 4 and in the person-period data set they have row corresponding to the first 4 intervals and have an event status of 0 for all of them. ID 3 experiences the event at a time beyond the prediction horizon of interest, thus we administratively censor them at the prediction horizon and they have a row in the person-period for all intervals and have an event status of 0 for all of them
Demonstration of different specifications of censoring to create a person-period data set (on right) from continuous-time survival data of two censored individuals (on left). The horizon of interest is w=5 and there are J=5 intervals defined as Aj=(tj−1,tj] for j∈{1,2,3,4,5}. Three different specifications are given for whether an individual contributes to a particular interval (Event1: intervals during which the individual is observed, Event2: intervals for which they survive at least half of the interval, Event3: intervals for which they survive the entire interval). ID 4 is censored in the first half of interval 3, so in the person-period data set they do not contribute to interval 3 in the Event2 and Event3 specification. ID 5 is censored in second half of interval 4, so in the person-period data set for interval 4 they contribute under Event2 specification but not under Event3 specification since they do not survive to the end of the interval
Visualization of the model building and testing process
Predicted survival probabilities for a randomly selected individual in the pbc data set with continuous-time models, Cox PH and RSF, and discrete-time models using logistic regression and a neural network. NNet: neural network; PH: proportional hazards; RSF: random survival forest
Test set time-dependent R-squared measure of Brier score (left) and AUC (right) for prediction models applied to the colon data set. Higher values indicate better predictive performance. The R-squared IBS over the 5 time points for the different models are in decreasing order: GBM (0.132), CForest (0.119), RSF (0.105), GBM (0.105), Elastic net (0.101), Cox (0.098), Logistic regression (0.096), Neural network (0.069), SVM (0.010). AUC: area under the ROC curve; CForest: Conditional inference random forest; GBM: gradient boosting machines; NNet: neural network; PH: proportional hazards; RSF: random survival forest
Article
Background Prediction models for time-to-event outcomes are commonly used in biomedical research to obtain subject-specific probabilities that aid in making important clinical care decisions. There are several regression and machine learning methods for building these models that have been designed or modified to account for the censoring that occurs in time-to-event data. Discrete-time survival models, which have often been overlooked in the literature, provide an alternative approach for predictive modeling in the presence of censoring with limited loss in predictive accuracy. These models can take advantage of the range of nonparametric machine learning classification algorithms and their available software to predict survival outcomes. Methods Discrete-time survival models are applied to a person-period data set to predict the hazard of experiencing the failure event in pre-specified time intervals. This framework allows for any binary classification method to be applied to predict these conditional survival probabilities. Using time-dependent performance metrics that account for censoring, we compare the predictions from parametric and machine learning classification approaches applied within the discrete time-to-event framework to those from continuous-time survival prediction models. We outline the process for training and validating discrete-time prediction models, and demonstrate its application using the open-source R statistical programming environment. Results Using publicly available data sets, we show that some discrete-time prediction models achieve better prediction performance than the continuous-time Cox proportional hazards model. Random survival forests, a machine learning algorithm adapted to survival data, also had improved performance compared to the Cox model, but was sometimes outperformed by the discrete-time approaches. In comparing the binary classification methods in the discrete time-to-event framework, the relative performance of the different methods varied depending on the data set. Conclusions We present a guide for developing survival prediction models using discrete-time methods and assessing their predictive performance with the aim of encouraging their use in medical research settings. These methods can be applied to data sets that have continuous time-to-event outcomes and multiple clinical predictors. They can also be extended to accommodate new binary classification algorithms as they become available. We provide R code for fitting discrete-time survival prediction models in a github repository.
 
A schematic representation of a classical randomized test-treatment study [4, 5]
Results for the power for the 1620 scenarios stratified by the difference in prevalence, the difference in sensitivity, and the difference in specificity. The power of the fixed design was compared to the adaptive design containing a re-estimation of the prevalence assuming μI+ = 0.05 (a-c) and μI+ = 0.1 (d-f). The black dotted line mark the theoretical power of 80%. The black bold line in the box marks the median value
Article
Background Randomized test-treatment studies aim to evaluate the clinical utility of diagnostic tests by providing evidence on their impact on patient health. However, the sample size calculation is affected by several factors involved in the test-treatment pathway, including the prevalence of the disease. Sample size planning is exposed to strong uncertainties in terms of the necessary assumptions, which have to be compensated for accordingly by adjusting prospectively determined study parameters during the course of the study. Method An adaptive design with a blinded sample size recalculation in a randomized test-treatment study based on the prevalence is proposed and evaluated by a simulation study. The results of the adaptive design are compared to those of the fixed design. Results The adaptive design achieves the desired theoretical power, under the assumption that all other nuisance parameters have been specified correctly, while wrong assumptions regarding the prevalence may lead to an over- or underpowered study in the fixed design. The empirical type I error rate is sufficiently controlled in the adaptive design as well as in the fixed design. Conclusion The consideration of a blinded recalculation of the sample size already during the planning of the study may be advisable in order to increase the possibility of success as well as an enhanced process of the study. However, the application of the method is subject to a number of limitations associated with the study design in terms of feasibility, sample sizes needed to be achieved, and fulfillment of necessary prerequisites.
 
Examples of f() corresponding to the proportion of the effect of intervention at time of T0 between t∈[T0−l1,T0+l2]
Application results in Tokyo: A Original time series, B First difference of the data, C Autocorrelation function (ACF) plot for the first difference of the data, and D Partial autocorrelation function (PACF) plot for the first difference of the data
Application results in Osaka and Ehime: A Upper: Original time series, Lower: First difference of the data in Osaka B Upper: Original time series, Lower: First difference of the data in Ehime
Article
Background Interrupted time series (ITS) analysis has become a popular design to evaluate the effects of health interventions. However, the most common formulation for ITS, the linear segmented regression, is not always adequate, especially when the timing of the intervention is unclear. In this study, we propose a new model to overcome this limitation. Methods We propose a new ITS model, ARIMAITS-DL, that combines (1) the Autoregressive Integrated Moving Average (ARIMA) model and (2) distributed lag functional terms. The ARIMA technique allows us to model autocorrelation, which is frequently observed in time series data, and the decaying cumulative effect of the intervention. By contrast, the distributed lag functional terms represent the idea that the intervention effect does not start at a fixed time point but is distributed over a certain interval (thus, the intervention timing seems unclear). We discuss how to select the distribution of the effect, the model construction process, diagnosing the model fitting, and interpreting the results. Further, our model is implemented as an example of a statement of emergency (SoE) during the coronavirus disease 2019 pandemic in Japan. Results We illustrate the ARIMAITS-DL model with some practical distributed lag terms to examine the effect of the SoE on human mobility in Japan. We confirm that the SoE was successful in reducing the movement of people (15.0–16.0% reduction in Tokyo), at least between February 20 and May 19, 2020. We also provide the R code for other researchers to easily replicate our method. Conclusions Our model, ARIMAITS-DL, is a useful tool as it can account for the unclear intervention timing and distributed lag effect with autocorrelation and allows for flexible modeling of different types of impacts such as uniformly or normally distributed impact over time.
 
Brain, Bone, Heart (BBH) Study Timeline. Legend: BBH: Brain, Bone, Heart Study; MWCCS: Multicenter AIDS Cohort Study (MACS)/Women’s Interagency HIV Study (WIHS) Combined Cohort Study (MWCCS); CRF: case report form; sIRB: single Institutional Review Board
Brain, Bone, Heart (BBH) Study Visit Components and Visit Schedule [1]. BBH is a Nested, Multipart sub-study of WIHS, which later transitioned to MWCCS. BBH Study Visits are linked to the nearest parent study visit. Legend: BBH: Brain, Bone, Heart Study; MWCCS: Multicenter AIDS Cohort Study (MACS)/Women’s Interagency HIV Study (WIHS) Combined Cohort Study (MWCCS); DEXA: dual energy X-ray absorptiometry; QCT: quantitative computed tomography; MRI: magnetic resonance imaging; CCTA: coronary computed tomography angiography; CIMT: carotid intima-media thickness
Article
Background Collecting new data from cross-sectional/survey and cohort observational study designs can be expensive and time-consuming. Nested (hierarchically cocooned within an existing parent study) and/or Multipart (≥ 2 integrally interlinked projects) study designs can expand the scope of a prospective observational research program beyond what might otherwise be possible with available funding and personnel. The Brain, Bone, Heart (BBH) study provides an exemplary case to describe the real-world advantages, challenges, considerations, and insights from these complex designs. Main BBH is a Nested, Multipart study conducted by the Specialized Center for Research Excellence (SCORE) on Sex Differences at Emory University. BBH is designed to examine whether estrogen insufficiency-induced inflammation compounds HIV-induced inflammation, leading to end-organ damage and aging-related co-morbidities affecting the neuro-hypothalamic–pituitary–adrenal axis (brain), musculoskeletal (bone), and cardiovascular (heart) organ systems. Using BBH as a real-world case study, we describe the advantages and challenges of Nested and Multipart prospective cohort study design in practice. While excessive dependence on its parent study can pose challenges in a Nested study, there are significant advantages to the study design as well. These include the ability to leverage a parent study’s resources and personnel; more comprehensive data collection and data sharing options; a broadened community of researchers for collaboration; dedicated longitudinal research participants; and, access to historical data. Multipart, interlinked studies that share a common cohort of participants and pool of resources have the advantage of dedicated key personnel and the challenge of increased organizational complexity. Important considerations for each study design include the stability and administration of the parent study (Nested) and the cohesiveness of linkage elements and staff organizational capacity (Multipart). Conclusion Using the experience of BBH as an example, Nested and/or Multipart study designs have both distinct advantages and potential vulnerabilities that warrant consideration and require strong biostatistics and data management leadership to optimize programmatic success and impact.
 
Flowchart for the process of data extraction
Article
When designing a noninferiority (NI) study one of the most important steps is to set the noninferiority (NI) limit. The NI limit is an acceptable loss of efficacy for a new investigative treatment compared to an active control treatment – often standard care. The limit should be a value so small that the loss efficacy is clinically zero. An approach to the setting of a noninferiority limit such that an effect over placebo can be shown through an indirect comparison to placebo-controlled trials where the active control treatment was compared to placebo. In this context, the setting of the NI limit depends on three assumptions: assay sensitivity, bias minimisation, and the constancy assumption. The last assumption of constancy assumes the effect of the active control over placebo is constant. This paper aims to assess the constancy assumption in placebo-controlled trials. Methods: 236 Cochrane reviews of placebo-controlled trials published in 2015–2016 were collected and used to assess the relation between the placebo, active treatment, and the standardised treatment different (SMD) with the time (year of publication). Results: The analysis showed that both the size of the study and the treatment effect were associated with year of publication. The three main variables that affect the estimate of any future trial are the estimate from the meta-analysis of previous trials prior to the trial, the year difference in the meta-analysis, and the year of the trial conduction. The regression analysis showed that an increase of one unit in the point estimate of the historical meta-analysis would lead to an increase in the predicted estimate of future trial on the SMD scale by 0.88. This result suggests the final trial results are 12% smaller than that from the meta-analysis of trials until that point. Conclusion: The result of this study indicates that assuming constancy of the treatment difference between the active control and placebo can be questioned. It is therefore important to consider the effect of time in estimating the treatment response if indirect comparisons are being used as the basis of a NI limit.
 
An exploratory analysis to assess the association between IOP-derived subject-specific characteristics (i.e., intercept, slope, and variance of residuals from the OLS model of IOPs against measurement times) and the risk of developing POAG using the OHTS data. A raw data of longitudinal IOP over follow-up time; B risk of developing POAG by the quartiles of IOP intercept; C risk of developing POAG by the quartiles of IOP slope; D risk of POAG by the quartiles of within-subject IOP variability
Average estimated effect of biomarker variability on survival outcome (γ3^\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\widehat{{\gamma }_{3}}$$\end{document}) from the joint model and two-stage methods based on the second simulation, where the dotted red-line represents the true value (γ3 = 0.5). Aγ3^\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\widehat{{\gamma }_{3}}$$\end{document} as function of varying SD of random intercept; Bγ3^\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\widehat{{\gamma }_{3}}$$\end{document} as function of varying SD of random slope; Cγ3^\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\widehat{{\gamma }_{3}}$$\end{document} as function of varying SD of within-subject variability; Dγ3^\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\widehat{{\gamma }_{3}}$$\end{document} as function of varying mean of within-subject variability
Article
Background In recent years there is increasing interest in modeling the effect of early longitudinal biomarker data on future time-to-event or other outcomes. Sometimes investigators are also interested in knowing whether the variability of biomarkers is independently predictive of clinical outcomes. This question in most applications is addressed via a two-stage approach where summary statistics such as variance are calculated in the first stage and then used in models as covariates to predict clinical outcome in the second stage. The objective of this study is to compare the relative performance of various methods in estimating the effect of biomarker variability. Methods A joint model and 4 different two-stage approaches (naïve, landmark analysis, time-dependent Cox model, and regression calibration) were illustrated using data from a large multi-center randomized phase III trial, the Ocular Hypertension Treatment Study (OHTS), regarding the association between the variability of intraocular pressure (IOP) and the development of primary open-angle glaucoma (POAG). The model performance was also evaluated in terms of bias using simulated data from the joint model of longitudinal IOP and time to POAG. The parameters for simulation were chosen after OHTS data, and the association between longitudinal and survival data was introduced via underlying, unobserved, and error-free parameters including subject-specific variance. Results In the OHTS data, joint modeling and two-stage methods reached consistent conclusion that IOP variability showed no significant association with the risk of POAG. In the simulated data with no association between IOP variability and time-to-POAG, all the two-stage methods (except the naïve approach) provided a reliable estimation. When a moderate effect of IOP variability on POAG was imposed, all the two-stage methods underestimated the true association as compared with the joint modeling while the model-based two-stage method (regression calibration) resulted in the least bias. Conclusion Regression calibration and joint modelling are the preferred methods in assessing the effect of biomarker variability. Two-stage methods with sample-based measures should be used with caution unless there exists a relatively long series of longitudinal measurements and/or strong effect size (NCT00000125).
 
Sample sizes represented by the height of rectangles and prevalence of significant prostate cancer represented by the width of rectangles for the 11 PBCG cohorts used in the study. The cohorts have been numbered according to their rank of clinically significant prostate cancer prevalence. The 3rd cohort in black outline was withheld to serve as an external validation cohort with the remaining 10 cohorts used for training prediction models
Amount of missing risk factor data by cohort on the x-axis; all patients were required to have prostate-specific antigen (PSA) and age, hence 0% missing for these covariates. The 3rd cohort separated by the black vertical line is used as an external validation set, and leave-one-cohort-out cross-validation was applied to the other cohorts. Cohorts were sorted by missing data pattern
CIL and AUC performing leave-one-cohort-out cross-validation on 10 PBCG cohorts. Median values are indicated with numbers and as vertical lines in the boxes
Calibration plots with shaded pointwise 95% confidence intervals for the 6 modeling methods applied to 10 PBCG training cohorts and validated on the external cohort. The diagonal black line is where predicted risks equal observed risks, lines below the diagonal indicate over-prediction, and lines above under-prediction, on the validation set
Marginal and pairwise comparisons of predictions from the 6 methods for the 5543 biopsies of the external validation set, pooled and stratified by clinically significant prostate cancer status (31.7% with clinically significant prostate cancer). Corr indicates Pearson correlation. Turquoise indicates individuals with clinically significant prostate cancer and purple not
Article
Background We compared six commonly used logistic regression methods for accommodating missing risk factor data from multiple heterogeneous cohorts, in which some cohorts do not collect some risk factors at all, and developed an online risk prediction tool that accommodates missing risk factors from the end-user. Methods Ten North American and European cohorts from the Prostate Biopsy Collaborative Group (PBCG) were used for fitting a risk prediction tool for clinically significant prostate cancer, defined as Gleason grade group ≥ 2 on standard TRUS prostate biopsy. One large European PBCG cohort was withheld for external validation, where calibration-in-the-large (CIL), calibration curves, and area-underneath-the-receiver-operating characteristic curve (AUC) were evaluated. Ten-fold leave-one-cohort-internal validation further validated the optimal missing data approach. Results Among 12,703 biopsies from 10 training cohorts, 3,597 (28%) had clinically significant prostate cancer, compared to 1,757 of 5,540 (32%) in the external validation cohort. In external validation, the available cases method that pooled individual patient data containing all risk factors input by an end-user had best CIL, under-predicting risks as percentages by 2.9% on average, and obtained an AUC of 75.7%. Imputation had the worst CIL (-13.3%). The available cases method was further validated as optimal in internal cross-validation and thus used for development of an online risk tool. For end-users of the risk tool, two risk factors were mandatory: serum prostate-specific antigen (PSA) and age, and ten were optional: digital rectal exam, prostate volume, prior negative biopsy, 5-alpha-reductase-inhibitor use, prior PSA screen, African ancestry, Hispanic ethnicity, first-degree prostate-, breast-, and second-degree prostate-cancer family history. Conclusion Developers of clinical risk prediction tools should optimize use of available data and sources even in the presence of high amounts of missing data and offer options for users with missing risk factors.
 
Overview of basic components of a factorial survey
Factorial survey development in the TechChild research programme (2021)
Article
Background The decision to initiate invasive long-term ventilation for a child with complex medical needs can be extremely challenging. TechChild is a research programme that aims to explore the liminal space between initial consideration of such technology dependence and the final decision. This paper presents a best practice example of the development of a unique use of the factorial survey method to identify the main influencing factors in this critical juncture in a child’s care. Methods We developed a within-subjects design factorial survey. In phase 1 (design) we defined the survey goal (dependent variable, mode and sample). We defined and constructed the factors and factor levels (independent variables) using previous qualitative research and existing scientific literature. We further refined these factors based on expert feedback from expert clinicians and a statistician. In phase two (pretesting), we subjected the survey tool to several iterations (cognitive interviewing, face validity testing, statistical review, usability testing). In phase three (piloting) testing focused on feasibility testing with members of the target population (n = 18). Ethical approval was obtained from the then host institution’s Health Sciences Ethics Committee. Results Initial refinement of factors was guided by literature and interviews with clinicians and grouped into four broad categories: Clinical, Child and Family, Organisational, and Professional characteristics. Extensive iterative consultations with clinical and statistical experts, including analysis of cognitive interviews, identified best practice in terms of appropriate: inclusion and order of clinical content; cognitive load and number of factors; as well as language used to suit an international audience. The pilot study confirmed feasibility of the survey. The final survey comprised a 43-item online tool including two age-based sets of clinical vignettes, eight of which were randomly presented to each participant from a total vignette population of 480. Conclusions This paper clearly explains the processes involved in the development of a factorial survey for the online environment that is internationally appropriate, relevant, and useful to research an increasingly important subject in modern healthcare. This paper provides a framework for researchers to apply a factorial survey approach in wider health research, making this underutilised approach more accessible to a wider audience.
 
Article
Introduction Studying the relationship between unemployment and health raises many methodological challenges. In the current study, the aim was to evaluate the sensitivity of estimates based on different ways of measuring unemployment and the choice of statistical model. Methods The Northern Swedish cohort was used, and two follow-up surveys thereof from 1995 and 2007, as well as register data about unemployment. Self-reported current unemployment, self-reported accumulated unemployment and register-based accumulated unemployment were used to measure unemployment and its effect on self-reported health was evaluated. Analyses were conducted with G-computation, logistic regression and three estimators for the inverse probability weighting propensity scores, and 11 potentially confounding variables were part of the analyses. Results were presented with absolute differences in the proportion with poor self-reported health between unemployed and employed individuals, except when logistic regression was used alone. Results Of the initial 1083 pupils in the cohort, our analyses vary between 488–693 individuals defined as employed and 61–214 individuals defined as unemployed. In the analyses, the deviation was large between the unemployment measures, with a difference of at least 2.5% in effect size when unemployed was compared with employed for the self-reported and register-based unemployment modes. The choice of statistical method only had a small influence on effect estimates and the deviation was in most cases lower than 1%. When models were compared based on the choice of potential confounders in the analytical model, the deviations were rarely above 0.6% when comparing models with 4 and 11 potential confounders. Our variable for health selection was the only one that strongly affected estimates when it was not part of the statistical model. Conclusions How unemployment is measured is highly important when the relationship between unemployment and health is estimated. However, misspecifications of the statistical model or choice of analytical method might not matter much for estimates except for the inclusion of a variable measuring health status before becoming unemployed. Our results can guide researchers when analysing similar research questions. Model diagnostics is commonly lacking in publications, but they remain very important for validation of analyses.
 
Flow chart of data selection and its linkage application
Article
Background Linkage of public healthcare data provides powerful resources for studying from a comprehensive view of quality of care than information for a single administrative database. It is believed that positive patient experiences reflect good quality of health care and may reduce patient readmission. This study aimed to determine the relationship between patient experience and hospital readmission at a system level by linking anonymous experience survey data with de-identified longitudinal hospital administrative admissions data. Methods Data were obtained by linking two datasets with anonymised individual-level records from seven largest-scale acute public hospitals over seven geographical clusters in Hong Kong. Selected records in the two datasets involving patient experience survey (PES) (2013 survey dataset) and healthcare utilization (admissions dataset) were used. Following data cleaning and standardization, a deterministic data linkage algorithm was used to identify pairs of records uniquely matched for a list of identifiers (10 selected variables) between two datasets. If patient’s record from the survey dataset matched with the hospitalization records in the admissions dataset, they were included in the subsequent analyses. Bivariate analyses and multivariable logistic regression models were performed to evaluate the associations between hospital readmission in the next calendar month and patient experience. Results The overall matching rate was 62.1% (1746/2811) for PES participants aged 45 or above from the survey dataset. The average score for overall inpatient experience was 8.10 (SD = 1.53). There was no significant difference between matched patients and unmatched patients in terms of their score for the perception of overall quality of care received during hospitalization (X² = 6.931, p-value = 0.14) and score for overall inpatient experience (X² = 7.853, p-value = 0.25). In the multivariable model, readmission through the outpatient department (planned admission) in the next calendar month was significantly associated with a higher score given to the overall quality of care received (adjusted OR = 1.54, 95%CI = 1.09–2.17), while such association was absent for readmission through Accident and Emergency department (adjusted OR = 0.75, 95%CI = 0.50–1.12). Conclusions This study demonstrated the feasibility of routine record linkage, with the limited intrusion of patients’ confidentiality, for evaluating health care quality. It also highlights the significant association between readmission through planned readmission and a higher score for overall quality of care received. A possible explanation might be the perceived better co-ordination between outpatient departments and inpatient service and the well-informed discharge plan given to this group of patients.
 
Article
Background Multiple imputation is frequently used to address missing data when conducting statistical analyses. There is a paucity of research into the performance of multiple imputation when the prevalence of missing data is very high. Our objective was to assess the performance of multiple imputation when estimating a logistic regression model when the prevalence of missing data for predictor variables is very high. Methods Monte Carlo simulations were used to examine the performance of multiple imputation when estimating a multivariable logistic regression model. We varied the size of the analysis samples ( N = 500, 1,000, 5,000, 10,000, and 25,000) and the prevalence of missing data (5–95% in increments of 5%). Results In general, multiple imputation performed well across the range of scenarios. The exceptions were in scenarios when the sample size was 500 or 1,000 and the prevalence of missing data was at least 90%. In these scenarios, the estimated standard errors of the log-odds ratios were very large and did not accurately estimate the standard deviation of the sampling distribution of the log-odds ratio. Furthermore, in these settings, estimated confidence intervals tended to be conservative. In all other settings (i.e., sample sizes > 1,000 or when the prevalence of missing data was less than 90%), then multiple imputation allowed for accurate estimation of a logistic regression model. Conclusions Multiple imputation can be used in many scenarios with a very high prevalence of missing data.
 
Calibration plots for prediction of stroke outcome at 90-day on test sets: A the Catboost model, B the XGBoost model
SHapley Additive exPlanations (SHAP) plots, ranking plot of shap values on test sets. The blue to red color represents the feature value (red high, blue low). The x-axis measures the impacts on the model output (right positive, left negative). (A) the Catboost model, (B) the XGBoost model
Article
Objective We aimed to investigate factors related to the 90-day poor prognosis (mRS≥3) in patients with transient ischemic attack (TIA) or minor stroke, construct 90-day poor prognosis prediction models for patients with TIA or minor stroke, and compare the predictive performance of machine learning models and Logistic model. Method We selected TIA and minor stroke patients from a prospective registry study (CNSR-III). Demographic characteristics,smoking history, drinking history(≥20g/day), physiological data, medical history,secondary prevention treatment, in-hospital evaluation and education,laboratory data, neurological severity, mRS score and TOAST classification of patients were assessed. Univariate and multivariate logistic regression analyses were performed in the training set to identify predictors associated with poor outcome (mRS≥3). The predictors were used to establish machine learning models and the traditional Logistic model, which were randomly divided into the training set and test set according to the ratio of 70:30. The training set was used to construct the prediction model, and the test set was used to evaluate the effect of the model. The evaluation indicators of the model included the area under the curve (AUC) of the discrimination index and the Brier score (or calibration plot) of the calibration index. Result A total of 10967 patients with TIA and minor stroke were enrolled in this study, with an average age of 61.77 ± 11.18 years, and women accounted for 30.68%. Factors associated with the poor prognosis in TIA and minor stroke patients included sex, age, stroke history, heart rate, D-dimer, creatinine, TOAST classification, admission mRS, discharge mRS, and discharge NIHSS score. All models, both those constructed by Logistic regression and those by machine learning, performed well in predicting the 90-day poor prognosis (AUC >0.800). The best performing AUC in the test set was the Catboost model (AUC=0.839), followed by the XGBoost, GBDT, random forest and Adaboost model (AUCs equal to 0.838, 0, 835, 0.832, 0.823, respectively). The performance of Catboost and XGBoost in predicting poor prognosis at 90-day was better than the Logistic model, and the difference was statistically significant( P <0.05). All models, both those constructed by Logistic regression and those by machine learning had good calibration. Conclusion Machine learning algorithms were not inferior to the Logistic regression model in predicting the poor prognosis of patients with TIA and minor stroke at 90-day. Among them, the Catboost model had the best predictive performance. All models provided good discrimination.
 
Simulated data (left panel) and identified trajectories (right panel) for each scenario
aTo make the boxplots for scenario 1–4 more legible, the boxes representing each subgroup at each time point were shifted slightly so that they would not overlap
Individual trajectories of difficulty initiating and maintaining sleep for identified trajectories (left panel) and a sample of participants (right panel), NDIT Survey cycles 1–20
Article
Background Group-based trajectory modelling (GBTM) is increasingly used to identify subgroups of individuals with similar patterns. In this paper, we use simulated and real-life data to illustrate that GBTM is susceptible to generating spurious findings in some circumstances. Methods Six plausible scenarios, two of which mimicked published analyses, were simulated. Models with 1 to 10 trajectory subgroups were estimated and the model that minimized the Bayes criterion was selected. For each scenario, we assessed whether the method identified the correct number of trajectories, the correct shapes of the trajectories, and the mean number of participants of each trajectory subgroup. The performance of the average posterior probabilities, relative entropy and mismatch criteria to assess classification adequacy were compared. Results Among the six scenarios, the correct number of trajectories was identified in two, the correct shapes in four and the mean number of participants of each trajectory subgroup in only one. Relative entropy and mismatch outperformed the average posterior probability in detecting spurious trajectories. Conclusion Researchers should be aware that GBTM can generate spurious findings, especially when the average posterior probability is used as the sole criterion to evaluate model fit. Several model adequacy criteria should be used to assess classification adequacy.
 
Article
Background Meta-analysis is a central method for quality evidence generation. In particular, meta-analysis is gaining speedy momentum in the growing world of quantitative information. There are several software applications to process and output expected results. Open-source software applications generating such results are receiving more attention. This paper uses Python’s capabilities to provide applicable instruction to perform a meta-analysis. Methods We used the PythonMeta package with several modifications to perform the meta-analysis on an open-access dataset from Cochrane. The analyses were complemented by employing Python’s zEpid package capable of creating forest plots. Also, we developed Python scripts for contour-enhanced funnel plots to assess funnel plots asymmetry. Finally, we ran the analyses in R and STATA to check the cross-validity of the results. Results A stepwise instruction on installing the software and packages and performing meta-analysis was provided. We shared the Python codes for meta-analysts to follow and generate the standard outputs. Our results were similar to those yielded by R and STATA. Conclusion We successfully produced standard meta-analytic outputs using Python. This programming language has several flexibilities to improve the meta-analysis results even further.
 
ROC plot of studies reporting sensitivity and specificity of FC testing for IBD at 50 μg/g (binomial method). The applicable region for primary care is defined by the test positive rate (dashed line) and by test positive rate plus prevalence (trapezium) from THIN data defines the area of sensitivity and specificity that is compatible with UK primary care practices. Included studies using the binomial distribution method (Caviglia 2014 [9], Conroy 2018 [10] and DeSloovere 2017 [11]) were compatible with their true parameters lying in the applicable region unlike the rejected studies (Alrubaiy 2012 [12], Boyd 2016 [13], Carroccio 2003 [14], El Badry 2010 [15], Hogberg 2017 [16], Labaere 2014 [17], Li 2006 [18], Mowat 2016 [19], Oyaert 2017 [20], Oyaert 2014 [21] and Tan 2016 [22])
ROC plot of studies reporting sensitivity and specificity of FC testing for IBD at 50 μg/g (Mahalanobis distance method). The applicable region for primary care is defined by the test positive rate (dashed line) and by test positive rate plus prevalence (trapezium) from THIN data defines the area of sensitivity and specificity that is compatible with UK primary care practices. Included studies using the Mahalanobis distance method (Carroccio 2003 [14], Oyaert 2014 [21], Oyaert 2017 [17], Boyd 2016 [13], Alrubaiy 2012 [12], DeSloovere 2017 [11], Conroy 2018 [10], El Badry 2010 [15], and Caviglia 2014 [9]) had closer ‘statistical distance’ to the applicable region than rejected studies (Labaere 2014 [17], Li 2006 [18], Hogberg 2017 [16], Mowat 2016 [19] and Tan 2016 [22])
Sensitivity and false positive rate pairs in ROC space from conventional meta-analysis, tailored meta-analysis and THIN data. Tailored meta-analysis was undertaken using the binomial and Mahalanobis distance methods. The applicable region (trapezium) was informed by routine data from primary care. TMA tailored meta-analysis, THIN the health improvement network
Article
Background Meta-analyses of test accuracy studies may provide estimates that are highly improbable in clinical practice. Tailored meta-analysis produces plausible estimates for the accuracy of a test within a specific setting by tailoring the selection of included studies compatible with a specific setting using information from the target setting. The aim of this study was to validate the tailored meta-analysis approach by comparing outcomes from tailored meta-analysis with outcomes from a setting specific test accuracy study. Methods A retrospective cohort study of primary care electronic health records provided setting-specific data on the test positive rate and disease prevalence. This was used to tailor the study selection from a review of faecal calprotectin testing for inflammatory bowel disease for meta-analysis using the binomial method and the Mahalanobis distance method. Tailored estimates were compared to estimates from a study of test accuracy in primary care using the same routine dataset. Results Tailoring resulted in the inclusion of 3/14 (binomial method) and 9/14 (Mahalanobis distance method) studies in meta-analysis. Sensitivity and specificity from tailored meta-analysis using the binomial method were 0.87 (95% CI 0.77 to 0.94) and 0.65 (95% CI 0.60 to 0.69) and 0.98 (95% CI 0.83 to 0.999) and 0.68 (95% CI 0.65 to 0.71), respectively using the Mahalanobis distance method. The corresponding estimates for the conventional meta-analysis were 0.94 (95% CI 0.90 to 0.97) and 0.67 (95% CI 0.57 to 0.76) and for the FC test accuracy study of primary care data 0.93 (95%CI 0.89 to 0.96) and 0.61 (95% CI 0.6 to 0.63) to detect IBD at a threshold of 50 μg/g. Although the binomial method produced a plausible estimate, the tailored estimates of sensitivity and specificity were not closer to the primary study estimates than the estimates from conventional meta-analysis including all 14 studies. Conclusions Tailored meta-analysis does not always produce estimates of sensitivity and specificity that lie closer to the estimates derived from a primary study in the setting in question. Potentially, tailored meta-analysis may be improved using a constrained model approach and this requires further investigation.
 
Visualization of the CLECI process
Examples of linguistic elements for CLECI
Interpretation of reliability measure scores
Basic analytical model of CLECI assessing potential predictors of patterns of language use
Article
Background The quality of communication between healthcare professionals (HCPs) and patients affects health outcomes. Different coding systems have been developed to unravel the interaction. Most schemes consist of predefined categories that quantify the content of communication (the what ). Though the form (the how ) of the interaction is equally important, protocols that systematically code variations in form are lacking. Patterns of form and how they may differ between groups therefore remain unnoticed. To fill this gap, we present CLECI, Coding Linguistic Elements in Clinical Interactions, a protocol for the development of a quantitative codebook analyzing communication form in medical interactions. Methods Analyzing with a CLECI codebook is a four-step process, i.e. preparation, codebook development, (double-)coding, and analysis and report. Core activities within these phases are research question formulation, data collection, selection of utterances, iterative deductive and inductive category refinement, reliability testing, coding, analysis, and reporting. Results and conclusion We present step-by-step instructions for a CLECI analysis and illustrate this process in a case study. We highlight theoretical and practical issues as well as the iterative codebook development which combines theory-based and data-driven coding. Theory-based codes assess how relevant linguistic elements occur in natural interactions, whereas codes derived from the data accommodate linguistic elements to real-life interactions and contribute to theory-building. This combined approach increases research validity, enhances theory, and adjusts to fit naturally occurring data. CLECI will facilitate the study of communication form in clinical interactions and other institutional settings.
 
Article
Background The individual data collected throughout patient follow-up constitute crucial information for assessing the risk of a clinical event, and eventually for adapting a therapeutic strategy. Joint models and landmark models have been proposed to compute individual dynamic predictions from repeated measures to one or two markers. However, they hardly extend to the case where the patient history includes much more repeated markers. Our objective was thus to propose a solution for the dynamic prediction of a health event that may exploit repeated measures of a possibly large number of markers. Methods We combined a landmark approach extended to endogenous markers history with machine learning methods adapted to survival data. Each marker trajectory is modeled using the information collected up to the landmark time, and summary variables that best capture the individual trajectories are derived. These summaries and additional covariates are then included in different prediction methods adapted to survival data, namely regularized regressions and random survival forests, to predict the event from the landmark time. We also show how predictive tools can be combined into a superlearner. The performances are evaluated by cross-validation using estimators of Brier Score and the area under the Receiver Operating Characteristic curve adapted to censored data. Results We demonstrate in a simulation study the benefits of machine learning survival methods over standard survival models, especially in the case of numerous and/or nonlinear relationships between the predictors and the event. We then applied the methodology in two prediction contexts: a clinical context with the prediction of death in primary biliary cholangitis, and a public health context with age-specific prediction of death in the general elderly population. Conclusions Our methodology, implemented in R, enables the prediction of an event using the entire longitudinal patient history, even when the number of repeated markers is large. Although introduced with mixed models for the repeated markers and methods for a single right censored time-to-event, the technique can be used with any other appropriate modeling technique for the markers and can be easily extended to competing risks setting.
 
Written instructions given to participating families
Bland-Altman plot for child school scale measurement compared to the difference between child school scale and home scale weight
Bland-Altman plot for parent school scale measurement compared to the difference between parent school scale and home scale weight
Bland-Altman plot for child school height measurement compared to the difference between child school height and home height measurements
Article
Background The purpose of this study is to describe and assess a remote height and weight protocol that was developed for an ongoing trial conducted during the SARS COV-2 pandemic. Methods Thirty-eight rural families (children 8.3 ± 0.7 years; 68% female; and caregivers 38.2 ± 6.1 years) were provided detailed instructions on how to measure height and weight. Families obtained measures via remote data collection (caregiver weight, child height and weight) and also by trained staff. Differences between data collection methods were examined. Results Per absolute mean difference analyses, slightly larger differences were found for child weight (0.21 ± 0.21 kg), child height (1.53 ± 1.29 cm), and caregiver weight (0.48 ± 0.42 kg) between school and home measurements. Both analyses indicate differences had only minor impact on child BMI percentile (− 0.12, 0.68) and parent BMI (0.05, 0.13). Intraclass coefficients ranged from 0.98 to 1.00 indicating that almost all of the variance was due to between person differences and not measurement differences within a person. Conclusion Results suggest that remote height and weight collection is feasible for caregivers and children and that there are minimal differences in the various measurement methods studied here when assessing group differences. These differences did not have clinically meaningful impacts on BMI. This is promising for the use of remote height and weight measurement in clinical trials, especially for hard-to reach-populations. Trial registration Clinical. Registered in clinicaltrials.gov (NCT03304249) on 06/10/2017.
 
Network diagram representing number of RCTs assessing biologics as treatments for RA, IPD were available for the two trials assessing tocilizumab
Posterior median and 95% credible interval estimates for log odds ratios comparing each biologic versus placebo in terms of the ACR20 outcome, for contrast-based methods without covariates
Posterior median and 95% credible interval estimates for the pooled log odds on each intervention in terms of the ACR20 outcome, for arm-based methods without covariates
Posterior median and 95% credible interval estimates for log odds ratios comparing each biologic versus placebo in terms of the ACR20 outcome, for contrast ‘-based methods applied to the artificial dataset
Posterior median and 95% credible interval estimates for the pooled log odds on each intervention in terms of the ACR20 outcome, for arm-based methods applied to the artificial dataset
Article
Background Increasingly in network meta-analysis (NMA), there is a need to incorporate non-randomised evidence to estimate relative treatment effects, and in particular in cases with limited randomised evidence, sometimes resulting in disconnected networks of treatments. When combining different sources of data, complex NMA methods are required to address issues associated with participant selection bias, incorporating single-arm trials (SATs), and synthesising a mixture of individual participant data (IPD) and aggregate data (AD). We develop NMA methods which synthesise data from SATs and randomised controlled trials (RCTs), using a mixture of IPD and AD, for a dichotomous outcome. Methods We propose methods under both contrast-based (CB) and arm-based (AB) parametrisations, and extend the methods to allow for both within- and across-trial adjustments for covariate effects. To illustrate the methods, we use an applied example investigating the effectiveness of biologic disease-modifying anti-rheumatic drugs for rheumatoid arthritis (RA). We applied the methods to a dataset obtained from a literature review consisting of 14 RCTs and an artificial dataset consisting of IPD from two SATs and AD from 12 RCTs, where the artificial dataset was created by removing the control arms from the only two trials assessing tocilizumab in the original dataset. Results Without adjustment for covariates, the CB method with independent baseline response parameters (CBunadjInd) underestimated the effectiveness of tocilizumab when applied to the artificial dataset compared to the original dataset, albeit with significant overlap in posterior distributions for treatment effect parameters. The CB method with exchangeable baseline response parameters produced effectiveness estimates in agreement with CBunadjInd, when the predicted baseline response estimates were similar to the observed baseline response. After adjustment for RA duration, there was a reduction in across-trial heterogeneity in baseline response but little change in treatment effect estimates. Conclusions Our findings suggest incorporating SATs in NMA may be useful in some situations where a treatment is disconnected from a network of comparator treatments, due to a lack of comparative evidence, to estimate relative treatment effects. The reliability of effect estimates based on data from SATs may depend on adjustment for covariate effects, although further research is required to understand this in more detail.
 
Regression Tree for Causal Instrumental Variable-Based Early Surgery Effect on the Probability of Benefit#
Regression Tree for Causal Instrumental Variable-Based Early Surgery Effect on the Probability of Detriment#
Two-Stage Least Squares (2SLS) Estimates by Benefit and Detriment by Ex-Post Reference Classes for Third-level CART Splits
Article
Background Comparative effectiveness research (CER) using observational databases has been suggested to obtain personalized evidence of treatment effectiveness. Inferential difficulties remain using traditional CER approaches especially related to designating patients to reference classes a priori. A novel Instrumental Variable Causal Forest Algorithm (IV-CFA) has the potential to provide personalized evidence using observational data without designating reference classes a priori, but the consistency of the evidence when varying key algorithm parameters remains unclear. We investigated the consistency of IV-CFA estimates through application to a database of Medicare beneficiaries with proximal humerus fractures (PHFs) that previously revealed heterogeneity in the effects of early surgery using instrumental variable estimators. Methods IV-CFA was used to estimate patient-specific early surgery effects on both beneficial and detrimental outcomes using different combinations of algorithm parameters and estimate variation was assessed for a population of 72,751 fee-for-service Medicare beneficiaries with PHFs in 2011. Classification and regression trees (CART) were applied to these estimates to create ex-post reference classes and the consistency of these classes were assessed. Two-stage least squares (2SLS) estimators were applied to representative ex-post reference classes to scrutinize the estimates relative to known 2SLS properties. Results IV-CFA uncovered substantial early surgery effect heterogeneity across PHF patients, but estimates for individual patients varied with algorithm parameters. CART applied to these estimates revealed ex-post reference classes consistent across algorithm parameters. 2SLS estimates showed that ex-post reference classes containing older, frailer patients with more comorbidities, and lower utilizers of healthcare were less likely to benefit and more likely to have detriments from higher rates of early surgery. Conclusions IV-CFA provides an illuminating method to uncover ex-post reference classes of patients based on treatment effects using observational data with a strong instrumental variable. Interpretation of treatment effect estimates within each ex-post reference class using traditional CER methods remains conditional on the extent of measured information in the data.
 
Graphical summary of outcomes from the multi-stage meta-consensus exercise
Article
Background Methods for developing national recommendations vary widely. The successful adoption of new guidance into routine practice is dependent on buy-in from the clinicians delivering day-to-day patient care and must be considerate of existing resource constraints, as well as being aspirational in its scope. This initiative aimed to produce guidelines for the management of head and neck squamous cell carcinoma of unknown primary (HNSCCUP) using a novel methodology to maximise the likelihood of national adoption. Methods A voluntary steering committee oversaw 3 phases of development: 1) clarification of topic areas, data collection and assimilation, including systematic reviews and a National Audit of Practice; 2) a National Consensus Day, presenting data from the above to generate candidate consensus statements for indicative voting by attendees; and 3) a National Delphi Exercise seeking agreement on the candidate consensus statements, including representatives from all 58 UK Head and Neck Multidisciplinary Teams (MDT). Methodology was published online in advance of the Consensus Day and Delphi exercise. Results Four topic areas were identified to frame guideline development. The National Consensus Day was attended by 227 participants (54 in-person and 173 virtual). Results from 7 new systematic reviews were presented, alongside 7 expert stakeholder presentations and interim data from the National Audit and from relevant ongoing Clinical Trials. This resulted in the generation of 35 statements for indicative voting by attendees which, following steering committee ratification, led to 30 statements entering the National Delphi exercise. After 3 rounds (with a further statement added after round 1), 27 statements had reached ‘strong agreement’ ( n = 25, 2, 0 for each round, respectively), a single statement achieved ‘agreement’ only (round 3), and ‘no agreement’ could be reached for 3 statements (response rate 98% for each round). Subsequently, 28 statements were adopted into the National MDT Guidelines for HNSCCUP. Conclusions The described methodology demonstrated an effective multi-phase strategy for the development of national practice recommendations. It may serve as a cost-effective model for future guideline development for controversial or rare conditions where there is a paucity of available evidence or where there is significant variability in management practices across a healthcare service.
 
Article
Background Due to contradictory results in current research, whether age at menopause is increasing or decreasing in Western countries remains an open question, yet worth studying as later ages at menopause are likely to be related to an increased risk of breast cancer. Using data from breast cancer screening programs to study the temporal trend of age at menopause is difficult since especially younger women in the same generational cohort have often not yet reached menopause. Deleting these younger women in a breast cancer risk analyses may bias the results. The aim of this study is therefore to recover missing menopause ages as a covariate by comparing methods for handling missing data. Additionally, the study makes a contribution to understanding the evolution of age at menopause for several generations born in Portugal between 1920 and 1970. Methods Data from a breast cancer screening program in Portugal including 278,282 women aged 45–69 and collected between 1990 and 2010 are used to compare two approaches of imputing age at menopause: (i) a multiple imputation methodology based on a truncated distribution but ignoring the mechanism of missingness; (ii) a copula-based multiple imputation method that simultaneously handles the age at menopause and the missing mechanism. The linear predictors considered in both cases have a semiparametric additive structure accommodating linear and non-linear effects defined via splines or Markov random fields smoothers in the case of spatial variables. Results Both imputation methods unveiled an increasing trend of age at menopause when viewed as a function of the birth year for the youngest generation. This trend is hidden if we model only women with an observed age at menopause. Conclusion When studying age at menopause, missing ages must be recovered with an adequate procedure for incomplete data. Imputing these missing ages avoids excluding the younger generation cohort of the screening program in breast cancer risk analyses and hence reduces the bias stemming from this exclusion. In addition, imputing the not yet observed ages of menopause for mostly younger women is also crucial when studying the time trend of age at menopause otherwise the analysis will be biased.
 
Study flow chart selection
Variability of six domains of AGREE II applied to the most often overlapping CPGs (assessed by at least five appraisals). The vertical axis represents AGREE II domain scores (0-100), the horizontal axis represents six AGREE II Domains. Legend. Domain 1: Scope and Purpose, Domain 2: Stakeholder involvement, Domain 3: Rigour of Development, Domain 4: Clarity of presentation, Domain 5: Applicability, Domain 6: Editorial Independence. ACP: American College of Physicians; APS: American Pain Society; APTA: American Physical Therapy Association; CCGPP: Council on Chiropractic Guidelines and Practice Parameters; KCE: Belgian Health Care Knowledge Centre; NICE: National Institute for Health and Care Excellence. * NICE 2016 was assessed by nine appraisals but the domain scores were available for eight; ACP 2017 was assessed by eight appraisals but available for seven
Article
Background Systematic reviews can apply the Appraisal of Guidelines for Research & Evaluation (AGREE) II tool to critically appraise clinical practice guidelines (CPGs) for treating low back pain (LBP); however, when appraisals differ in CPG quality rating, stakeholders, clinicians, and policy-makers will find it difficult to discern a unique judgement of CPG quality. We wanted to determine the proportion of overlapping CPGs for LBP in appraisals that applied AGREE II. We also compared inter-rater reliability and variability across appraisals. Methods For this meta-epidemiological study we searched six databases for appraisals of CPGs for LBP. The general characteristics of the appraisals were collected; the unit of analysis was the CPG evaluated in each appraisal. The inter-rater reliability and the variability of AGREE II domain scores for overall assessment were measured using the intraclass correlation coefficient and descriptive statistics. Results Overall, 43 CPGs out of 106 (40.6%) overlapped in seventeen appraisals. Half of the appraisals (53%) reported a protocol registration. Reporting of AGREE II assessment was heterogeneous and generally of poor quality: overall assessment 1 (overall CPG quality) was rated in 11 appraisals (64.7%) and overall assessment 2 (recommendation for use) in four (23.5%). Inter-rater reliability was substantial/perfect in 78.3% of overlapping CPGs. The domains with most variability were Domain 6 (mean interquartile range [IQR] 38.6), Domain 5 (mean IQR 28.9), and Domain 2 (mean IQR 27.7). Conclusions More than one third of CPGs for LBP have been re-appraised in the last six years with CPGs quality confirmed in most assessments. Our findings suggest that before conducting a new appraisal, researchers should check systematic review registers for existing appraisals. Clinicians need to rely on updated CPGs of high quality and confirmed by perfect agreement in multiple appraisals. Trial Registration Protocol Registration OSF: https://osf.io/rz7nh/
 
Article
Objective Our study aimed to identify predictors as well as develop machine learning (ML) models to predict the risk of 30-day mortality in patients with sepsis-associated encephalopathy (SAE). Materials and methods ML models were developed and validated based on a public database named Medical Information Mart for Intensive Care (MIMIC)-IV. Models were compared by the area under the curve (AUC), accuracy, sensitivity, specificity, positive and negative predictive values, and Hosmer–Lemeshow good of fit test. Results Of 6994 patients in MIMIC-IV included in the final cohort, a total of 1232 (17.62%) patients died following SAE. Recursive feature elimination (RFE) selected 15 variables, including acute physiology score III (APSIII), Glasgow coma score (GCS), sepsis related organ failure assessment (SOFA), Charlson comorbidity index (CCI), red blood cell volume distribution width (RDW), blood urea nitrogen (BUN), age, respiratory rate, PaO2, temperature, lactate, creatinine (CRE), malignant cancer, metastatic solid tumor, and platelet (PLT). The validation cohort demonstrated all ML approaches had higher discriminative ability compared with the bagged trees (BT) model, although the difference was not statistically significant. Furthermore, in terms of the calibration performance, the artificial neural network (NNET), logistic regression (LR), and adapting boosting (Ada) models had a good calibration—namely, a high accuracy of prediction, with P-values of 0.831, 0.119, and 0.129, respectively. Conclusions The ML models, as demonstrated by our study, can be used to evaluate the prognosis of SAE patients in the intensive care unit (ICU). Online calculator could facilitate the sharing of predictive models.
 
Model Performance. (c. F1 Score* and d. Balanced Accuracy*: some points in these graphs are missing due to NA values resulted from zero values for the True Positives in the highly imbalanced datasets)
Average training time
Model performance with and without Pre-trained Word Embeddings. (c. F1 Score* and d. Balanced Accuracy*: some points in these graphs are missing due to NaN values resulted from zero values for the True Positives in the highly imbalanced datasets)
Average training time with and without pre-trained word embeddings
Article
Background Discharge medical notes written by physicians contain important information about the health condition of patients. Many deep learning algorithms have been successfully applied to extract important information from unstructured medical notes data that can entail subsequent actionable results in the medical domain. This study aims to explore the model performance of various deep learning algorithms in text classification tasks on medical notes with respect to different disease class imbalance scenarios. Methods In this study, we employed seven artificial intelligence models, a CNN (Convolutional Neural Network), a Transformer encoder, a pretrained BERT (Bidirectional Encoder Representations from Transformers), and four typical sequence neural networks models, namely, RNN (Recurrent Neural Network), GRU (Gated Recurrent Unit), LSTM (Long Short-Term Memory), and Bi-LSTM (Bi-directional Long Short-Term Memory) to classify the presence or absence of 16 disease conditions from patients’ discharge summary notes. We analyzed this question as a composition of 16 binary separate classification problems. The model performance of the seven models on each of the 16 datasets with various levels of imbalance between classes were compared in terms of AUC-ROC (Area Under the Curve of the Receiver Operating Characteristic), AUC-PR (Area Under the Curve of Precision and Recall), F1 Score, and Balanced Accuracy as well as the training time. The model performances were also compared in combination with different word embedding approaches (GloVe, BioWordVec, and no pre-trained word embeddings). Results The analyses of these 16 binary classification problems showed that the Transformer encoder model performs the best in nearly all scenarios. In addition, when the disease prevalence is close to or greater than 50%, the Convolutional Neural Network model achieved a comparable performance to the Transformer encoder, and its training time was 17.6% shorter than the second fastest model, 91.3% shorter than the Transformer encoder, and 94.7% shorter than the pre-trained BERT-Base model. The BioWordVec embeddings slightly improved the performance of the Bi-LSTM model in most disease prevalence scenarios, while the CNN model performed better without pre-trained word embeddings. In addition, the training time was significantly reduced with the GloVe embeddings for all models. Conclusions For classification tasks on medical notes, Transformer encoders are the best choice if the computation resource is not an issue. Otherwise, when the classes are relatively balanced, CNNs are a leading candidate because of their competitive performance and computational efficiency.
 
Article
Background: Seasonality classification is a well-known and important part of time series analysis. Understanding the seasonality of a biological event can contribute to an improved understanding of its causes and help guide appropriate responses. Observational data, however, are not comprised of biological events, but timestamped diagnosis codes the combination of which (along with additional requirements) are used as proxies for biological events. As there exist different methods for determining the seasonality of a time series, it is necessary to know if these methods exhibit concordance. In this study we seek to determine the concordance of these methods by applying them to time series derived from diagnosis codes in observational data residing in databases that vary in size, type, and provenance. Methods: We compared 8 methods for determining the seasonality of a time series at three levels of significance (0.01, 0.05, and 0.1), against 10 observational health databases. We evaluated 61,467 time series at each level of significance, totaling 184,401 evaluations. Results: Across all databases and levels of significance, concordance ranged from 20.2 to 40.2%. Across all databases and levels of significance, the proportion of time series classified seasonal ranged from 4.9 to 88.3%. For each database and level of significance, we computed the difference between the maximum and minimum proportion of time series classified seasonal by all methods. The median within-database difference was 54.8, 34.7, and 39.8%, for p < 0.01, 0.05, and 0.1, respectively. Conclusion: Methods of binary seasonality classification when applied to time series derived from diagnosis codes in observational health data produce inconsistent results. The methods exhibit considerable discord within all databases, implying that the discord is a result of the difference between the methods themselves and not due to the choice of database. The results indicate that researchers relying on automated methods to assess the seasonality of time series derived from diagnosis codes in observational data should be aware that the methods are not interchangeable and thus the choice of method can affect the generalizability of their work. Seasonality determination is highly dependent on the method chosen.
 
Article
Background The need to engage adults, age 65 and older, in clinical trials of conditions typical in older populations, (e.g. hypertension, diabetes mellitus, Alzheimer’s disease and related dementia) is exponentially increasing. Older adults have been markedly underrepresented in clinical trials, often exacerbated by exclusionary study criteria as well as functional dependencies that preclude participation. Such dependencies may further exacerbate communication challenges. Consequently, the evidence of what works in subject recruitment is less generalizable to older populations, even more so for those from racial and ethnic minority and low-income communities. Methods To support capacity of research staff, we developed a virtual, three station simulation (Group Objective Structured Clinical Experience—GOSCE) to teach research staff communication skills. This 2-h course included a discussion of challenges in recruiting older adults; skills practice with Standardized Participants (SPs) and faculty observer who provided immediate feedback; and debrief to highlight best practices. Each learner had opportunities for active learning and observational learning. Learners completed a retrospective pre-post survey about the experience. SP completed an 11-item communication checklist evaluating the learner on a series of established behaviorally anchored communication skills (29). Results In the research staff survey, 92% reported the overall activity taught them something new; 98% reported it provided valuable feedback; 100% said they would like to participate again. In the SP evaluation there was significant variation: the percent well-done of items by case ranged from 25–85%. Conclusions Results from this pilot suggest that GOSCEs are a (1) acceptable; (2) low cost; and (3) differentiating mechanism for training and assessing research staff in communication skills and structural competency necessary for participant research recruitment.
 
Multi-state model of colorectal cancer. (A) Natural history process, (B) Observed transition pathways
Schematic representation of 3 possible observation process leading to right-censoring (A) and interval- censoring (B and C). From top to bottom, all individuals are AF at time zero prior to start of surveillance (A) and remain AF until the end of their follow-up vm, (B) detected to be AA, or (C) detected to be CRC
Flow chart of inclusion and exclusion criteria from the adenoma cohort [34, 35]. Never colonoscopy: single entry with non-colonoscopic polypectomy at baseline, or because there was no colonoscopic examination in all visits including baseline. No findings at baseline colonoscopy, and no findings later: a single (i.e., baseline) entry as no finding, or all entries as no findings. CRC at baseline colonoscopy: a single (i.e., baseline) entry as CRC
Comparison between survival curves from NPMLE estimate and Weibull model for the first transition time to AA
Estimated cumulative incidence curves. (A) Cumulative distribution function (CDF) for patients treated with AA (red solid line) and NAA (blue dashed lines) since baseline. (B) CDF of CRC (black solid line) since AA onset, with 1000 bootstrapped CDF curves (grey lines)
Article
Background To optimize colorectal cancer (CRC) screening and surveillance, information regarding the time-dependent risk of advanced adenomas (AA) to develop into CRC is crucial. However, since AA are removed after diagnosis, the time from AA to CRC cannot be observed in an ethically acceptable manner. We propose a statistical method to indirectly infer this time in a progressive three-state disease model using surveillance data. Methods Sixteen models were specified, with and without covariates. Parameters of the parametric time-to-event distributions from the adenoma-free state (AF) to AA and from AA to CRC were estimated simultaneously, by maximizing the likelihood function. Model performance was assessed via simulation. The methodology was applied to a random sample of 878 individuals from a Norwegian adenoma cohort. Results Estimates of the parameters of the time distributions are consistent and the 95% confidence intervals (CIs) have good coverage. For the Norwegian sample (AF: 78 % , AA: 20 % , CRC: 2 % ), a Weibull model for both transition times was selected as the final model based on information criteria. The mean time among those who have made the transition to CRC since AA onset within 50 years was estimated to be 4.80 years (95% CI: 0; 7.61). The 5-year and 10-year cumulative incidence of CRC from AA was 13.8 % (95 % CI: 7.8 % ;23.8 % ) and 15.4 % (95 % CI: 8.2 % ;34.0 % ), respectively. Conclusions The time-dependent risk from AA to CRC is crucial to explain differences in the outcomes of microsimulation models used for the optimization of CRC prevention. Our method allows for improving models by the inclusion of data-driven time distributions.
 
Example realist study cycle. Adapted from Pawson et al. [17], Marchal et al. [31], Mukumbang et al. [32], and Sarkies et al. [33]
Realist concept of the ontological depth of reality stratification. Source: Adapted from Jagosh [39]
Initial Program Theory 2 for the clinical champion implementation strategy
Initial Program Theory 1 for the audit and feedback implementation strategy
Article
Implementation science in healthcare aims to understand how to get evidence into practice. Once this is achieved in one setting, it becomes increasingly difficult to replicate elsewhere. The problem is often attributed to differences in context that influence how and whether implementation strategies work. We argue that realist research paradigms provide a useful framework to express the effect of contextual factors within implementation strategy causal processes. Realist studies are theory-driven evaluations that focus on understanding how and why interventions work under different circumstances. They consider the interaction between contextual circumstances, theoretical mechanisms of change and the outcomes they produce, to arrive at explanations of conditional causality (i.e., what tends to work, for whom, under what circumstances). This Commentary provides example applications using preliminary findings from a large realist implementation study of system-wide value-based healthcare initiatives in New South Wales, Australia. If applied judiciously, realist implementation studies may represent a sound approach to help optimise delivery of the right care in the right setting and at the right time.
 
Article
Background Accurate measurement of trajectories in longitudinal studies, considered the gold standard method for tracking functional growth during adolescence, decline in aging, and change after head injury, is subject to confounding by testing experience. Methods We measured change in cognitive and motor abilities over four test sessions (baseline and three annual assessments) in 154 male and 165 female participants (baseline age 12–21 years) from the National Consortium on Alcohol and NeuroDevelopment in Adolescence (NCANDA) study. At each of the four test sessions, these participants were given a test battery using computerized administration and traditional pencil and paper tests that yielded accuracy and speed measures for multiple component cognitive (Abstraction, Attention, Emotion, Episodic memory, Working memory, and General Ability) and motor (Ataxia and Speed) functions. The analysis aim was to dissociate neurodevelopment from testing experience by using an adaptation of the twice-minus-once tested method, which calculated the difference between longitudinal change (comprising developmental plus practice effects) and practice-free initial cross-sectional performance for each consecutive pairs of test sessions. Accordingly, the first set of analyses quantified the effects of learning (i.e., prior test experience) on accuracy and after speed domain scores. Then developmental effects were determined for each domain for accuracy and speed having removed the measured learning effects. Results The greatest gains in performance occurred between the first and second sessions, especially in younger participants, regardless of sex, but practice gains continued to accrue thereafter for several functions. For all 8 accuracy composite scores, the developmental effect after accounting for learning was significant across age and was adequately described by linear fits. The learning-adjusted developmental effects for speed were adequately described by linear fits for Abstraction, Emotion, Episodic Memory, General Ability, and Motor scores, although a nonlinear fit was better for Attention, Working Memory, and Average Speed scores. Conclusion Thus, what appeared as accelerated cognitive and motor development was, in most cases, attributable to learning. Recognition of the substantial influence of prior testing experience is critical for accurate characterization of normal development and for developing norms for clinical neuropsychological investigations of conditions affecting the brain.
 
Article
Background A lack of available data and statistical code being published alongside journal articles provides a significant barrier to open scientific discourse, and reproducibility of research. Information governance restrictions inhibit the active dissemination of individual level data to accompany published manuscripts. Realistic, high-fidelity time-to-event synthetic data can aid in the acceleration of methodological developments in survival analysis and beyond by enabling researchers to access and test published methods using data similar to that which they were developed on. Methods We present methods to accurately emulate the covariate patterns and survival times found in real-world datasets using synthetic data techniques, without compromising patient privacy. We model the joint covariate distribution of the original data using covariate specific sequential conditional regression models, then fit a complex flexible parametric survival model from which to generate survival times conditional on individual covariate patterns. We recreate the administrative censoring mechanism using the last observed follow-up date information from the initial dataset. Metrics for evaluating the accuracy of the synthetic data, and the non-identifiability of individuals from the original dataset, are presented. Results We successfully create a synthetic version of an example colon cancer dataset consisting of 9064 patients which aims to show good similarity to both covariate distributions and survival times from the original data, without containing any exact information from the original data, therefore allowing them to be published openly alongside research. Conclusions We evaluate the effectiveness of the methods for constructing synthetic data, as well as providing evidence that there is minimal risk that a given patient from the original data could be identified from their individual unique patient information. Synthetic datasets using this methodology could be made available alongside published research without breaching data privacy protocols, and allow for data and code to be made available alongside methodological or applied manuscripts to greatly improve the transparency and accessibility of medical research.
 
Study and feedback timeline over one year
A sample TLS for sensemaking and engagement, two months after implementation
Article
Background Normalization process theory (NPT) has been widely used to better understand how new interventions are implemented and embedded. The NoMAD (Normalization Measurement Development questionnaire) is a 23-item NPT instrument based on NPT. As the NoMAD is a relatively new instrument, the objectives of this paper are: to describe the experience of implementing the NoMAD, to describe it being used as a feedback mechanism to gain insight into the normalization process of a complex health intervention, and to further explore the psychometric properties of the instrument. Methods Health TAPESTRY was implemented in six Family Health Teams (total of seven sites) across Ontario. Healthcare team members at each site were invited to complete the NoMAD, and three general questions about normalization, six times over a 12-month period. Each site was then provided a visual traffic light summary (TLS) reflecting the implementation of the Health TAPESTRY. The internal consistency of each sub-scale and validity of the NoMAD were assessed. Learnings from the implementation of the NoMAD and subsequent feedback mechanism (TLS) are reported descriptively. Results In total, 56 diverse health care team members from six implementation sites completed the NoMAD. Each used it at least once during the 12-month study period. The implementation of the NoMAD and TLS was time consuming to do with multiple collection (and feedback) points. Most (60%) internal consistency values of the four subscales (pooled across site) across each collection point were satisfactory. All correlations were positive, and most (86%) were statistically significant among NoMAD subscales. All but one correlation between the NoMAD subscales and the general questions were positive, and most (72%) were significant. Generally, scores on the subscales were higher at 12-month than baseline, albeit did not follow a linear pattern of change across implementation. Generally, scores were higher for experienced sites compared to first-time implementors. Conclusion Our experience would suggest fewer collection points; three timepoints spaced out by several months are adequate, if repeated administration of the NoMAD is used for feedback loops. We provide additional evidence of the psychometric properties of the NoMAD. Trial Registration Registered at ClinicalTrials.gov: NCT03397836 .
 
Progress towards 95–95-95 targets of participants (2014–2015) . First 95: percentage of PLHIV who are aware of their HIV status; Second 95: percentage of those who knew their status and on ART; Third 95: percentage of those on ART and with low viral load; Composite low viral load: percentage of PLHIV with low viral load
Multiple correspondence analysis plot showing association between variables and patterns (A) 2014 and (B) 2015 survey. Factors contributing to high viral load (≥ 400 copies m/L) in A = 2014 survey were: HIVstatusknew, ARV, ARVdose, perceivedRiskH, NhivLifeTest, HIVtest, HIVPreg, TBDgsd, TBTstd, Sex12mCA, condom12MCA, TBExp, sexpartner12mCA, currentnopartner, and lsp2CA. while B = 2015 survey were: HIVstatusknew, ARV, TBExp, ARVDose, PerceivedRiskH, NhivLifetest and HIVtest . As denoted by distance of variable away from centre point. See Table S3 in Additional file 1 for all variables and categories full descriptions
Multiple correspondence analysis plot showing contribution of variables categories to high viral load (≥ 400 copies m/L). (A) 2014 and (B)2015 survey. Variable categories in red contributed most followed by those in orange color, these are: A = 2014 survey: ARV_N, Hstat_NR, NHT_Nv, HVT_N, CNP_Non, SEX12_NR, CDM12_NR, SP12M_NR, LSP_NR, B = 2015 survey: ARV_N, HIVstat_negative, HIVsta_NR, TBEx_NR, HVT_N, NHT_Nv . See Table S3 in Additional file 1 for these categories’ full description
Multiple correspondence analysis factor map of individual and variables categories with 95% confidence ellipses in 2014 and 2015 survey. Factors map further show a stronger level of correlation and interaction between variables, categories and individual HIV positive men and women. In 2014 survey, factors such as perceived risk of contracting HIV, ever tested for HIV, on medication to prevent HIV, no money, had STI symptoms and ever tested for TB, while in the 2015 survey variables such as are education level, gender, ever diagnosed of TB, on medication to prevent HIV, the number of a lifetime sex partner, meal cut, on medication to prevent TB, had STI symptoms, the number of sex partner in last 12 months shows no convergence of confidence ellipses, implying these factors contribute to high viral load in this population. Also, a more clustering is observed in 2014 as compared to 2015
Random Forest plot showing high importance predictors of high viral load (2014 and 2015 survey). High importance predictors in 2014 are: ARV dosage, CD4 cells per μL, perceived risk of contracting HIV, ARV, knowledge of HIV status, alcohol, ever diagnosed with TB, ever tested with TB, on TB medication, total number of sex partners last 12 months, gender, total number of lifetime sex partners, place of resident, length of stay in community and education status. While in 2015 high importance predictors are: ARV dosage, CD4 cells per μL, exposed to TB last 12 months, ever diagnosed with TB, on TB medication, knowledge of HIV status, ARV, meal cut, no money, gender, perceived risk of contracting HIV, length of stay in community, total number of lifetime sex partners and education status
Article
Background Sustainable Human Immunodeficiency Virus (HIV) virological suppression is crucial to achieving the Joint United Nations Programme of HIV/AIDS (UNAIDS) 95–95-95 treatment targets to reduce the risk of onward HIV transmission. Exploratory data analysis is an integral part of statistical analysis which aids variable selection from complex survey data for further confirmatory analysis. Methods In this study, we divulge participants’ epidemiological and biological factors with high HIV RNA viral load (HHVL) from an HIV Incidence Provincial Surveillance System (HIPSS) sequential cross-sectional survey between 2014 and 2015 KwaZulu-Natal, South Africa. Using multiple correspondence analysis (MCA) and random forest analysis (RFA), we analyzed the linkage between socio-demographic, behavioral, psycho-social, and biological factors associated with HHVL, defined as ≥400 copies per m/L. Results Out of 3956 in 2014 and 3868 in 2015, 50.1% and 41% of participants, respectively, had HHVL. MCA and RFA revealed that knowledge of HIV status, ART use, ARV dosage, current CD4 cell count, perceived risk of contracting HIV, number of lifetime HIV tests, number of lifetime sex partners, and ever diagnosed with TB were consistent potential factors identified to be associated with high HIV viral load in the 2014 and 2015 surveys. Based on MCA findings, diverse categories of variables identified with HHVL were, did not know HIV status, not on ART, on multiple dosages of ARV, with less likely perceived risk of contracting HIV and having two or more lifetime sexual partners. Conclusion The high proportion of individuals with HHVL suggests that the UNAIDS 95–95-95 goal of HIV viral suppression is less likely to be achieved. Based on performance and visualization evaluation, MCA was selected as the best and essential exploration tool for identifying and understanding categorical variables’ significant associations and interactions to enhance individual epidemiological understanding of high HIV viral load. When faced with complex survey data and challenges of variables selection in research, exploratory data analysis with robust graphical visualization and reliability that can reveal divers’ structures should be considered.
 
Article
Background The concept of standard of care (SoC) treatment is commonly utilized in clinical trials. However, in a setting of an emergent disease, such as COVID-19, where there is no established effective treatment, it is unclear what the investigators considered as the SoC in early clinical trials. The aim of this study was to analyze and classify SoC reported in randomized controlled trial (RCT) registrations and RCTs published in scholarly journals and on preprint servers about treatment interventions for COVID-19. Methods We conducted a cross-sectional study. We included RCTs registered in a trial registry, and/or published in a scholarly journal, and/or published on preprint servers medRxiv and bioRxiv (any phase; any recruitment status; any language) that aim to compare treatment interventions related to COVID-19 and SoC, available from January 1, 2020, to October 8, 2020. Studies using „standard“ treatment were eligible for inclusion if they reported they used standard, usual, conventional, or routine treatment. When we found such multiple reports of an RCT, we treated those multiple sources as one unit of analysis. Results Among 737 unique trials included in the analysis, 152 (21%) reported that SoC was proposed by the institutional or national authority. There were 129 (18%) trials that reported component(s) of SoC; the remaining trials simply reported that they used SoC, with no further detail. Among those 129 trials, the number of components of SoC ranged from 1 to 10. The most commonly used groups of interventions in the SoC were antiparasitics (62% of the trials), antivirals (57%), antibiotics (31%), oxygen (17%), antithrombotics/anticoagulants (14%), vitamins (13%), immunomodulatory agents (13%), corticosteroids (12%), analgesics/antipyretics (12%). Various combinations of those interventions were used in the SoC, with up to 7 different types of interventions combined. Posology, timing, and method of administration were frequently not reported for SoC components. Conclusion Most RCTs (82%) about treatment for COVID-19 that were registered or published in the first 9 months of the pandemic did not describe the “standard of care” they used. Many of those interventions have, by now, been shown as ineffective or even detrimental.
 
Top-cited authors
Benjamin Djulbegovic
  • City of Hope National Medical Center
Iztok Hozo
  • Indiana University Northwest
Stela Pudar-Hozo
  • Indiana University Northwest
Sabi Redwood
  • University of Bristol
Nicola Gale
  • University of Birmingham