As part of a collaboration with ResearchGate, Springer Nature has made articles from this journal available on ResearchGate.
ResearchGate and Springer Nature have partnered to provide new ways of accessing research content. As part of this partnership, recent articles from this journal are now available on ResearchGate. Find out more here.

BMC Medical Research Methodology (BMC MED RES METHODOL)

Publisher: BioMed Central

Journal description

BMC Medical Research Methodology publishes original research articles in the design, performance, analysis, reporting, and interpretation of all types of health care research.

Additional details

Cited half-life6.20
Immediacy index0.45
Eigenfactor0.01
Article influence1.32
Websitehttp://www.biomedcentral.com/1471-2288/
Website descriptionBMC Medical Research Methodology website
Other titlesBioMed Cental research methodology, Medical research methodology
OCLC46614457
Material typeDocument, Periodical, Internet resource
Document typeInternet Resource, Computer File, Journal / Magazine / Newspaper

Publications in this journal

Background A prognostic model should not enter clinical practice unless it has been demonstrated that it performs a useful role. External validation denotes evaluation of model performance in a sample independent of that used to develop the model. Unlike for logistic regression models, external validation of Cox models is sparsely treated in the literature. Successful validation of a model means achieving satisfactory discrimination and calibration (prediction accuracy) in the validation sample. Validating Cox models is not straightforward because event probabilities are estimated relative to an unspecified baseline function. Methods We describe statistical approaches to external validation of a published Cox model according to the level of published information, specifically (1) the prognostic index only, (2) the prognostic index together with Kaplan-Meier curves for risk groups, and (3) the first two plus the baseline survival curve (the estimated survival function at the mean prognostic index across the sample). The most challenging task, requiring level 3 information, is assessing calibration, for which we suggest a method of approximating the baseline survival function. Results We apply the methods to two comparable datasets in primary breast cancer, treating one as derivation and the other as validation sample. Results are presented for discrimination and calibration. We demonstrate plots of survival probabilities that can assist model evaluation. Conclusions Our validation methods are applicable to a wide range of prognostic studies and provide researchers with a toolkit for external validation of a published Cox model.
In order to reduce systematic errors (such as language bias) and increase the precision of the summary treatment effect estimate, a comprehensive identification of randomised controlled trials (RCT), irrespective of publication language, is crucial in systematic reviews and meta-analyses. We identified trials in the German general health care literature. Eight German language general health care journals were searched for randomised controlled trials and analysed with respect to the number of published RCTs each year and the size of trials. A total of 1618 trials were identified with a median total number of 43 patients per trial. Between 1970 and 2004 a small but constant rise in sample size from a median number of 30 to 60 patients per trial can be observed. The number of published trials was very low between 1948 and 1970, but increased between 1970 and 1986 to a maximum of 11.2 RCTs per journal and year. In the following time period a striking decline of the number of RCTs was observed. Between 1999 and 2001 only 0.8 RCTs per journal and year were published, in the next three years, the number of published trials increased to 1.7 RCTs per journal and year. German language general health care journals no longer have a role in the dissemination of trial results. The slight rise in the number of published RCTs in the last three years can be explained by a change of publication language from German to English of three of the analysed journals.
Background Seasonal variation in the occurrence of cardiovascular diseases has been recognized for decades. In particular, incidence rates of hospitalization with atrial fibrillation (AF) and stroke have shown to exhibit a seasonal variation. Stroke in AF patients is common and often severe. Obtaining a description of a possible seasonal variation in the occurrence of stroke in AF patients is crucial in clarifying risk factors for developing stroke and initiating prophylaxis treatment. Methods Using a dynamic generalized linear model we were able to model gradually changing seasonal variation in hospitalization rates of stroke in AF patients from 1977 to 2011. The study population consisted of all Danes registered with a diagnosis of AF comprising 270,017 subjects. During follow-up, 39,632 subjects were hospitalized with stroke. Incidence rates of stroke in AF patients were analyzed assuming the seasonal variation being a sum of two sinusoids and a local linear trend. Results The results showed that the peak-to-trough ratio decreased from 1.25 to 1.16 during the study period, and that the times of year for peak and trough changed slightly. Conclusion The present study indicates that using dynamic generalized linear models provides a flexible modeling approach for studying changes in seasonal variation of stroke in AF patients and yields plausible results.
It was still unclear whether the methodological reporting quality of randomized controlled trials (RCTs) in major hepato-gastroenterology journals improved after the Consolidated Standards of Reporting Trials (CONSORT) Statement was revised in 2001. RCTs in five major hepato-gastroenterology journals published in 1998 or 2008 were retrieved from MEDLINE using a high sensitivity search method and their reporting quality of methodological details were evaluated based on the CONSORT Statement and Cochrane Handbook for Systematic Reviews of interventions. Changes of the methodological reporting quality between 2008 and 1998 were calculated by risk ratios with 95% confidence intervals. A total of 107 RCTs published in 2008 and 99 RCTs published in 1998 were found. Compared to those in 1998, the proportion of RCTs that reported sequence generation (RR, 5.70; 95%CI 3.11-10.42), allocation concealment (RR, 4.08; 95%CI 2.25-7.39), sample size calculation (RR, 3.83; 95%CI 2.10-6.98), incomplete outecome data addressed (RR, 1.81; 95%CI, 1.03-3.17), intention-to-treat analyses (RR, 3.04; 95%CI 1.72-5.39) increased in 2008. Blinding and intent-to-treat analysis were reported better in multi-center trials than in single-center trials. The reporting of allocation concealment and blinding were better in industry-sponsored trials than in public-funded trials. Compared with historical studies, the methodological reporting quality improved with time. Although the reporting of several important methodological aspects improved in 2008 compared with those published in 1998, which may indicate the researchers had increased awareness of and compliance with the revised CONSORT statement, some items were still reported badly. There is much room for future improvement.
The Australian and New Zealand Intensive Care Society (ANZICS) Adult Patient Database (APD) collects voluntary data on patient admissions to Australian and New Zealand intensive care units (ICUs). This paper presents an in-depth statistical analysis of risk-adjusted mortality of ICU admissions from 2000 to 2010 for the purpose of identifying ICUs with unusual performance. A cohort of 523, 462 patients from 144 ICUs was analysed. For each ICU, the natural logarithm of the standardised mortality ratio (log-SMR) was estimated from a risk-adjusted, three-level hierarchical model. This is the first time a three-level model has been fitted to such a large ICU database anywhere. The analysis was conducted in three stages which included the estimation of a null distribution to describe usual ICU performance. Log-SMRs with appropriate estimates of standard errors are presented in a funnel plot using 5% false discovery rate thresholds. False coverage-statement rate confidence intervals are also presented. The observed numbers of deaths for ICUs identified as unusual are compared to the predicted true worst numbers of deaths under the model for usual ICU performance. Seven ICUs were identified as performing unusually over the period 2000 to 2010, in particular, demonstrating high risk-adjusted mortality compared to the majority of ICUs. Four of the seven were ICUs in private hospitals. Our three-stage approach to the analysis detected outlying ICUs which were not identified in a conventional (single) risk-adjusted model for mortality using SMRs to compare ICUs. We also observed a significant linear decline in mortality over the decade. Distinct yearly and weekly respiratory seasonal effects were observed across regions of Australia and New Zealand for the first time. The statistical approach proposed in this paper is intended to be used for the review of observed ICU and hospital mortality. Two important messages from our study are firstly, that comprehensive riskadjustment is essential in modelling patient mortality for comparing performance, and secondly, that the appropriate statistical analysis is complicated.
Background For the analysis of length-of-stay (LOS) data, which is characteristically right-skewed, a number of statistical estimators have been proposed as alternatives to the traditional ordinary least squares (OLS) regression with log dependent variable. Methods Using a cohort of patients identified in the Australian and New Zealand Intensive Care Society Adult Patient Database, 2008–2009, 12 different methods were used for estimation of intensive care (ICU) length of stay. These encompassed risk-adjusted regression analysis of firstly: log LOS using OLS, linear mixed model [LMM], treatment effects, skew-normal and skew-t models; and secondly: unmodified (raw) LOS via OLS, generalised linear models [GLMs] with log-link and 4 different distributions [Poisson, gamma, negative binomial and inverse-Gaussian], extended estimating equations [EEE] and a finite mixture model including a gamma distribution. A fixed covariate list and ICU-site clustering with robust variance were utilised for model fitting with split-sample determination (80%) and validation (20%) data sets, and model simulation was undertaken to establish over-fitting (Copas test). Indices of model specification using Bayesian information criterion [BIC: lower values preferred] and residual analysis as well as predictive performance (R2, concordance correlation coefficient (CCC), mean absolute error [MAE]) were established for each estimator. Results The data-set consisted of 111663 patients from 131 ICUs; with mean(SD) age 60.6(18.8) years, 43.0% were female, 40.7% were mechanically ventilated and ICU mortality was 7.8%. ICU length-of-stay was 3.4(5.1) (median 1.8, range (0.17-60)) days and demonstrated marked kurtosis and right skew (29.4 and 4.4 respectively). BIC showed considerable spread, from a maximum of 509801 (OLS-raw scale) to a minimum of 210286 (LMM). R2 ranged from 0.22 (LMM) to 0.17 and the CCC from 0.334 (LMM) to 0.149, with MAE 2.2-2.4. Superior residual behaviour was established for the log-scale estimators. There was a general tendency for over-prediction (negative residuals) and for over-fitting, the exception being the GLM negative binomial estimator. The mean-variance function was best approximated by a quadratic function, consistent with log-scale estimation; the link function was estimated (EEE) as 0.152(0.019, 0.285), consistent with a fractional-root function. Conclusions For ICU length of stay, log-scale estimation, in particular the LMM, appeared to be the most consistently performing estimator(s). Neither the GLM variants nor the skew-regression estimators dominated.
Little is known about the impact of data collection method on self-reported cancer screening behaviours, particularly among hard-to-reach populations. The purpose of this study is to examine the effects of data collection mode on response to indicators of cancer screenings by unmarried middle-aged and older women. Three survey methods were evaluated for collecting data about mammography and Papanicolaou (hereafter, Pap) testing among heterosexual and sexual minority (e.g., lesbian and bisexual) women. Women ages 40-75 were recruited from June 2003 - June 2005 in Rhode Island. They were randomly assigned to receive: Self-Administered Mailed Questionnaire [SAMQ; N = 202], Computer-Assisted Telephone Interview [CATI; N = 200], or Computer-Assisted Self-Interview [CASI; N = 197]. Logistic regression models were computed to assess survey mode differences for 13 self-reported items related to cancer screenings, adjusting for age, education, income, race, marital status, partner gender, and recruitment source. Compared to women assigned to CATI, women assigned to SAMQ were less likely to report two or more years between most recent mammograms (CATI = 23.2% vs. SAMQ = 17.7%; AOR = 0.5, 95% CI = 0.3 - 0.8) and women assigned to CASI were slightly less likely to report being overdue for mammography (CATI = 16.5% vs. CASI = 11.8%; AOR = 0.5, 95% CI = 0.3 - 1.0) and Pap testing (CATI = 14.9% vs. CASI = 10.0%; AOR = 0.5, 95% CI = 0.2 - 1.0). There were no other consistent mode effects. Among participants in this sample, mode of data collection had little effect on the reporting of mammography and Pap testing behaviours. Other measures such as efficiency and cost-effectiveness of the mode should also be considered when determining the most appropriate form of data collection for use in monitoring indicators of cancer detection and control.
To assess the intra- and inter-rater agreement of chart abstractors from multiple sites involved in the evaluation of an Asthma Care Program (ACP). For intra-rater agreement, 110 charts randomly selected from 1,433 patients enrolled in the ACP across eight Ontario communities were re-abstracted by 10 abstractors. For inter-rater agreement, data abstractors reviewed a set of eight fictitious charts. Data abstraction involved information pertaining to six categories: physical assessment, asthma control, spirometry, asthma education, referral visits, and medication side effects. Percentage agreement and the kappa statistic (kappa) were used to measure agreement. Sensitivity and specificity estimates were calculated comparing results from all raters against the gold standard. Intra-rater re-abstraction yielded an overall kappa of 0.81. Kappa values for the chart abstraction categories were: physical assessment (kappa 0.84), asthma control (kappa 0.83), spirometry (kappa 0.84), asthma education (kappa 0.72), referral visits (kappa 0.59) and medication side effects (kappa 0.51). Inter-rater abstraction of the fictitious charts produced an overall kappa of 0.75, sensitivity of 0.91 and specificity of 0.89. Abstractors demonstrated agreement for physical assessment (kappa 0.88, sensitivity and specificity 0.95), asthma control (kappa 0.68, sensitivity 0.89, specificity 0.85), referral visits (kappa 0.77, sensitivity 0.88, specificity 0.95), and asthma education (kappa 0.49, sensitivity 0.87, specificity 0.77). Though collected by multiple abstractors, the results show high sensitivity and specificity and substantial to excellent inter- and intra-rater agreement, assuring confidence in the use of chart abstraction for evaluating the ACP.
The evaluation of academic research performance is nowadays a priority issue. Bibliometric indicators such as the number of publications, total citation counts and h-index are an indispensable tool in this task but their inherent association with the size of the research output may result in rewarding high production when evaluating institutions of disparate sizes. The aim of this study is to propose an indicator that may facilitate the comparison of institutions of disparate sizes. The Modified Impact Index (MII) was defined as the ratio of the observed h-index (h) of an institution over the h-index anticipated for that institution on average, given the number of publications (N) it produces i.e. MII = h/10alphaNbeta (alpha and beta denote the intercept and the slope, respectively, of the line describing the dependence of the h-index on the number of publications in log10 scale). MII values higher than 1 indicate that an institution performs better than the average, in terms of its h-index. Data on scientific papers published during 2002-2006 and within 36 medical fields for 219 Academic Medical Institutions from 16 European countries were used to estimate alpha and beta and to calculate the MII of their total and field-specific production. From our biomedical research data, the slope beta governing the dependence of h-index on the number of publications in biomedical research was found to be similar to that estimated in other disciplines ( approximately 0.4). The MII was positively associated with the average number of citations/publication (r = 0.653, p < 0.001), the h-index (r = 0.213, p = 0.002), the number of publications with > or = 100 citations (r = 0.211, p = 0.004) but not with the number of publications (r = -0.020, p = 0.765). It was the most highly associated indicator with the share of country-specific government budget appropriations or outlays for research and development as % of GDP in 2004 (r = 0.229) followed by the average number of citations/publication (r = 0.153) whereas the corresponding correlation coefficient for the h-index was close to 0 (r = 0.029). MII was calculated for first 10 top-ranked European universities in life sciences and biomedicine, as provided by Times Higher Education ranking system, and their total and field-specific performance was compared. The MII should complement the use of h-index when comparing the research output of institutions of disparate sizes. It has a conceptual interpretation and, with the data provided here, can be computed for the total research output as well as for field-specific publication sets of institutions in biomedicine.
To assess the intra- and inter-rater agreement of chart abstractors from multiple sites involved in the evaluation of an Asthma Care Program (ACP). For intra-rater agreement, 110 charts randomly selected from 1,433 patients enrolled in the ACP across eight Ontario communities were re-abstracted by 10 abstractors. For inter-rater agreement, data abstractors reviewed a set of eight fictitious charts. Data abstraction involved information pertaining to six categories: physical assessment, asthma control, spirometry, asthma education, referral visits, and medication side effects. Percentage agreement and the kappa statistic (kappa) were used to measure agreement. Sensitivity and specificity estimates were calculated comparing results from all raters against the gold standard. Intra-rater re-abstraction yielded an overall kappa of 0.81. Kappa values for the chart abstraction categories were: physical assessment (kappa 0.84), asthma control (kappa 0.83), spirometry (kappa 0.84), asthma education (kappa 0.72), referral visits (kappa 0.59) and medication side effects (kappa 0.51). Inter-rater abstraction of the fictitious charts produced an overall kappa of 0.75, sensitivity of 0.91 and specificity of 0.89. Abstractors demonstrated agreement for physical assessment (kappa 0.88, sensitivity and specificity 0.95), asthma control (kappa 0.68, sensitivity 0.89, specificity 0.85), referral visits (kappa 0.77, sensitivity 0.88, specificity 0.95), and asthma education (kappa 0.49, sensitivity 0.87, specificity 0.77). Though collected by multiple abstractors, the results show high sensitivity and specificity and substantial to excellent inter- and intra-rater agreement, assuring confidence in the use of chart abstraction for evaluating the ACP.
The evaluation of academic research performance is nowadays a priority issue. Bibliometric indicators such as the number of publications, total citation counts and h-index are an indispensable tool in this task but their inherent association with the size of the research output may result in rewarding high production when evaluating institutions of disparate sizes. The aim of this study is to propose an indicator that may facilitate the comparison of institutions of disparate sizes. The Modified Impact Index (MII) was defined as the ratio of the observed h-index (h) of an institution over the h-index anticipated for that institution on average, given the number of publications (N) it produces i.e. MII = h/10alphaNbeta (alpha and beta denote the intercept and the slope, respectively, of the line describing the dependence of the h-index on the number of publications in log10 scale). MII values higher than 1 indicate that an institution performs better than the average, in terms of its h-index. Data on scientific papers published during 2002-2006 and within 36 medical fields for 219 Academic Medical Institutions from 16 European countries were used to estimate alpha and beta and to calculate the MII of their total and field-specific production. From our biomedical research data, the slope beta governing the dependence of h-index on the number of publications in biomedical research was found to be similar to that estimated in other disciplines ( approximately 0.4). The MII was positively associated with the average number of citations/publication (r = 0.653, p < 0.001), the h-index (r = 0.213, p = 0.002), the number of publications with > or = 100 citations (r = 0.211, p = 0.004) but not with the number of publications (r = -0.020, p = 0.765). It was the most highly associated indicator with the share of country-specific government budget appropriations or outlays for research and development as % of GDP in 2004 (r = 0.229) followed by the average number of citations/publication (r = 0.153) whereas the corresponding correlation coefficient for the h-index was close to 0 (r = 0.029). MII was calculated for first 10 top-ranked European universities in life sciences and biomedicine, as provided by Times Higher Education ranking system, and their total and field-specific performance was compared. The MII should complement the use of h-index when comparing the research output of institutions of disparate sizes. It has a conceptual interpretation and, with the data provided here, can be computed for the total research output as well as for field-specific publication sets of institutions in biomedicine.
Background Opt-in consent is usually required for research, but is known to introduce selection bias. This is a particular problem for large scale epidemiological studies using only pre-collected health data. Most previous studies have shown that members of the public value opt-in consent and can perceive research without consent as an invasion of privacy. Past research has suggested that people are generally unaware of research processes and existing safeguards, and that education may increase the acceptability of research without prior informed consent, but this recommendation has not been formally evaluated. Our objectives were to determine the range of public opinion about the use of existing medical data for research and to explore views about consent to a secondary review of medical records for research. We also investigated the effect of the provision of detailed information about the potential effect of selection bias on public acceptability of the use of data for research. Methods We carried out a systematic review of existing literature on public attitudes to secondary use of existing health records identified by searching PubMed (1966-present), Embase (1974-present) and reference lists of identified studies to provide a general overview, followed by a qualitative focus group study with 19 older men recruited from rural and suburban primary care practices in the UK to explore key issues in detail. Results The systematic review identified twenty-seven relevant papers and the findings suggested that males and older people were more likely to consent to a review of their medical data. Many studies noted participants’ lack of knowledge about research processes and existing safeguards and this was reflected in the focus groups. Focus group participants became more accepting of the use of pre-collected medical data without consent after being given information about selection bias and research processes. All participants were keen to contribute to NHS-related research but some were concerned about data-sharing for commercial gain and the potential misuse of information. Conclusions Increasing public education about research and specific targeted information provision could promote trust in research processes and safeguards, which in turn could increase the acceptability of research without specific consent where the need for consent would lead to biased findings and impede research necessary to improve public health.
The determinants of participation in long-term follow-up studies of disasters have rarely been delineated. Even less is known from studies of events that occurred in eastern Europe. We examined the factors associated with participation in a longitudinal two-stage study conducted in Kyiv following the 1986 Chornobyl nuclear power plant accident. Six hundred child-mother dyads (300 evacuees and 300 classmate controls) were initially assessed in 1997 when the children were 11 years old, and followed up in 2005-6 when they were 19 years old. A population control group (304 mothers and 327 children) was added in 2005-6. Each assessment point involved home interviews with the children and mothers (stage 1), followed by medical examinations of the children at a clinic (stage 2). Background characteristics, health status, and Chornobyl risk perceptions were examined. The participation rates in the follow-up home interviews were 87.8% for the children (88.6% for evacuees; 87.0% for classmates) and 83.7% for their mothers (86.4% for evacuees and 81.0% for classmates). Children's and mothers' participation was predicted by one another's study participation and attendance at the medical examination at time 1. Mother's participation was also predicted by initial concerns about her child's health, greater psychological distress, and Chornobyl risk perceptions. In 1997, 91.2% of the children had a medical examination (91.7% of evacuees and 90.7% of classmates); in 2005-6, 85.2% were examined (83.0% of evacuees, 87.7% of classmates, 85.0% of population controls). At both times, poor health perceptions were associated with receiving a medical examination. In 2005-6, clinic attendance was also associated with the young adults' risk perceptions, depression or generalized anxiety disorder, lower standard of living, and female gender. Despite our low attrition rates, we identified several determinants of selective participation consistent with previous research. Although evacuee status was not associated with participation, Chornobyl risk perceptions were strong predictors of mothers' follow-up participation and attendance at the medical examinations. Understanding selective participation offers valuable insight for future longitudinal disaster studies that integrate psychiatric and medical epidemiologic research.
The study design with the smallest bias for causal inference is a perfect randomized clinical trial. Since this design is often not feasible in epidemiologic studies, an important challenge is to model bias properly and take random and systematic variation properly into account. A value for a target parameter might be said to be "incompatible" with the data (under the model used) if the parameter's confidence interval excludes it. However, this "incompatibility" may be due to bias and/or extra-variation. We propose the following way of re-interpreting conventional results. Given a specified focal value for a target parameter (typically the null value, but possibly a non-null value like that representing a twofold risk), the difference between the focal value and the nearest boundary of the confidence interval for the parameter is calculated. This represents the maximum correction of the interval boundary, for bias and extra-variation, that would still leave the focal value outside the interval, so that the focal value remained "incompatible" with the data. We describe a short example application concerning a meta analysis of air versus pure oxygen resuscitation treatment in newborn infants. Some general guidelines are provided for how to assess the probability that the appropriate correction for a particular study would be greater than this maximum (e.g. using knowledge of the general effects of bias and extra-variation from published bias-adjusted results). Although this approach does not yet provide a method, because the latter probability can not be objectively assessed, this paper aims to stimulate the re-interpretation of conventional confidence intervals, and more and better studies of the effects of different biases.
Techniques for interim analysis, the statistical analysis of results while they are still accumulating, are highly-developed in the setting of clinical trials. But in the setting of laboratory experiments such analyses are usually conducted secretly and with no provisions for the necessary adjustments of the Type I error-rate. Laboratory researchers, from ignorance or by design, often analyse their results before the final number of experimental units (humans, animals, tissues or cells) has been reached. If this is done in an uncontrolled fashion, the pejorative term 'peeking' has been applied. A statistical penalty must be exacted. This is because if enough interim analyses are conducted, and if the outcome of the trial is on the borderline between 'significant' and 'not significant', ultimately one of the analyses will result in the magical P = 0.05. I suggest that Armitage's technique of matched-pairs sequential analysis should be considered. The conditions for using this technique are ideal: almost unlimited opportunity for matched pairing, and a short time between commencement of a study and its completion. Both the Type I and Type II error-rates are controlled. And the maximum number of pairs necessary to achieve an outcome, whether P = 0.05 or P > 0.05, can be estimated in advance. Laboratory investigators, if they are to be honest, must adjust the critical value of P if they analyse their data repeatedly. I suggest they should consider employing matched-pairs sequential analysis in designing their experiments.
The purpose of this study was to validate the accuracy of an alternative cervical cancer test - visual inspection with acetic acid (VIA) - by addressing possible imperfections in the gold standard through latent class analysis (LCA). The data were originally collected at peri-urban health clinics in Zimbabwe. Conventional accuracy (sensitivity/specificity) estimates for VIA and two other screening tests using colposcopy/biopsy as the reference standard were compared to LCA estimates based on results from all four tests. For conventional analysis, negative colposcopy was accepted as a negative outcome when biopsy was not available as the reference standard. With LCA, local dependencies between tests were handled through adding direct effect parameters or additional latent classes to the model. Two models yielded good fit to the data, a 2-class model with two adjustments and a 3-class model with one adjustment. The definition of latent disease associated with the latter was more stringent, backed by three of the four tests. Under that model, sensitivity for VIA (abnormal+) was 0.74 compared to 0.78 with conventional analyses. Specificity was 0.639 versus 0.568, respectively. By contrast, the LCA-derived sensitivity for colposcopy/biopsy was 0.63. VIA sensitivity and specificity with the 3-class LCA model were within the range of published data and relatively consistent with conventional analyses, thus validating the original assessment of test accuracy. LCA probably yielded more likely estimates of the true accuracy than did conventional analysis with in-country colposcopy/biopsy as the reference standard. Colposcopy with biopsy can be problematic as a study reference standard and LCA offers the possibility of obtaining estimates adjusted for referent imperfections.
Contributing reviewers The editors of BMC Medical Research Methodology would like to thank all our reviewers who have contributed to the journal in Volume 12 (2012).
Route environments can positively influence people's active commuting and thereby contribute to public health. The Active Commuting Route Environment Scale (ACRES) was developed to study active commuters' perceptions of their route environments. However, bicycle commuters represent a small portion of the population in many cities and thus are difficult to study using population-based material. Therefore, the aim of this study is to expand the state of knowledge concerning the criterion-related validity of the ACRES and the representativity using an advertisement-recruited sample. Furthermore, by comparing commuting route environment profiles of inner urban and suburban areas, we provide a novel basis for understanding the relationship between environment and bikeability. Bicycle commuters from Greater Stockholm, Sweden, advertisement- (n = 1379) and street-recruited (n = 93), responded to the ACRES. Traffic planning and environmental experts from the Municipality of Stockholm (n = 24) responded to a modified version of the ACRES. The criterion-related validity assessments were based on whether or not differences between the inner urban and the suburban route environments, as indicated by the experts and by four existing objective measurements, were reflected by differences in perceptions of these environments. Comparisons of ratings between advertisement- and street-recruited participants were used for the assessments of representativity. Finally, ratings of inner urban and suburban route environments were used to evaluate commuting route environment profiles. Differences in ratings of the inner urban and suburban route environments by the advertisement-recruited participants were in accord with the existing objective measurements and corresponded reasonably well with those of the experts. Overall, there was a reasonably good correspondence between the advertisement- and street-recruited participants' ratings. Distinct differences in commuting route environment profiles were noted between the inner urban and suburban areas. Suburban route environments were rated as safer and more stimulating for bicycle-commuting than the inner urban ones. In general, the findings applied to both men and women. The overall results show: considerable criterion-related validity of the ACRES; ratings of advertisement-recruited participants mirroring those of street-recruited participants; and a higher degree of bikeability in the suburban commuting route environments than in the inner urban ones.
Familism and parental respect are culturally derived constructs rooted in Hispanic and Asian cultures, respectively. Measures of these constructs have been utilized in research and found to predict delays in substance use initiation and reduced levels of use. However, given that these measures are explicitly designed to tap constructs that are considered important by different racial/ethnic groups, there is a risk that the measurement properties may not be equivalent across groups. This study evaluated the measurement equivalence of measures of familism and parental respect in a large and diverse sample of middle school students in Southern California (n = 5646) using a multiple group confirmatory factor analysis approach. Results showed little evidence of measurement variance across four racial/ethnic groups (African American, Hispanic, Asian, and non-Hispanic White), supporting the continued use of these measures in diverse populations. Some differences between latent variable means were identified - specifically that the Hispanic group and the white group differed on familism. No evidence of invariance was found. However, the item distributions were highly positively skewed, indicating a tendency for youth to endorse the most positive response, which may reduce the reliability of the measures and suggests that refinement is possible.
Background Two most important considerations in evaluation of survival prediction models are 1) predictability - ability to predict survival risks accurately and 2) reproducibility - ability to generalize to predict samples generated from different studies. We present approaches for assessment of reproducibility of survival risk score predictions across medical centers. Methods Reproducibility was evaluated in terms of consistency and transferability. Consistency is the agreement of risk scores predicted between two centers. Transferability from one center to another center is the agreement of the risk scores of the second center predicted by each of the two centers. The transferability can be: 1) model transferability - whether a predictive model developed from one center can be applied to predict the samples generated from other centers and 2) signature transferability - whether signature markers of a predictive model developed from one center can be applied to predict the samples from other centers. We considered eight prediction models, including two clinical models, two gene expression models, and their combinations. Predictive performance of the eight models was evaluated by several common measures. Correlation coefficients between predicted risk scores of different centers were computed to assess reproducibility - consistency and transferability. Results Two public datasets, the lung cancer data generated from four medical centers and colon cancer data generated from two medical centers, were analyzed. The risk score estimates for lung cancer patients predicted by three of four centers agree reasonably well. In general, a good prediction model showed better cross-center consistency and transferability. The risk scores for the colon cancer patients from one (Moffitt) medical center that were predicted by the clinical models developed from the another (Vanderbilt) medical center were shown to have excellent model transferability and signature transferability. Conclusions This study illustrates an analytical approach to assessing reproducibility of predictive models and signatures. Based on the analyses of the two cancer datasets, we conclude that the models with clinical variables appear to perform reasonable well with high degree of consistency and transferability. There should have more investigations on the reproducibility of prediction models including gene expression data across studies.
The purpose of this study was to validate the accuracy of an alternative cervical cancer test - visual inspection with acetic acid (VIA) - by addressing possible imperfections in the gold standard through latent class analysis (LCA). The data were originally collected at peri-urban health clinics in Zimbabwe. Conventional accuracy (sensitivity/specificity) estimates for VIA and two other screening tests using colposcopy/biopsy as the reference standard were compared to LCA estimates based on results from all four tests. For conventional analysis, negative colposcopy was accepted as a negative outcome when biopsy was not available as the reference standard. With LCA, local dependencies between tests were handled through adding direct effect parameters or additional latent classes to the model. Two models yielded good fit to the data, a 2-class model with two adjustments and a 3-class model with one adjustment. The definition of latent disease associated with the latter was more stringent, backed by three of the four tests. Under that model, sensitivity for VIA (abnormal+) was 0.74 compared to 0.78 with conventional analyses. Specificity was 0.639 versus 0.568, respectively. By contrast, the LCA-derived sensitivity for colposcopy/biopsy was 0.63. VIA sensitivity and specificity with the 3-class LCA model were within the range of published data and relatively consistent with conventional analyses, thus validating the original assessment of test accuracy. LCA probably yielded more likely estimates of the true accuracy than did conventional analysis with in-country colposcopy/biopsy as the reference standard. Colposcopy with biopsy can be problematic as a study reference standard and LCA offers the possibility of obtaining estimates adjusted for referent imperfections.
Contributing reviewers The editors of BMC Medical Research Methodology would like to thank all our reviewers who have contributed to the journal in Volume 12 (2012).
Route environments can positively influence people's active commuting and thereby contribute to public health. The Active Commuting Route Environment Scale (ACRES) was developed to study active commuters' perceptions of their route environments. However, bicycle commuters represent a small portion of the population in many cities and thus are difficult to study using population-based material. Therefore, the aim of this study is to expand the state of knowledge concerning the criterion-related validity of the ACRES and the representativity using an advertisement-recruited sample. Furthermore, by comparing commuting route environment profiles of inner urban and suburban areas, we provide a novel basis for understanding the relationship between environment and bikeability. Bicycle commuters from Greater Stockholm, Sweden, advertisement- (n = 1379) and street-recruited (n = 93), responded to the ACRES. Traffic planning and environmental experts from the Municipality of Stockholm (n = 24) responded to a modified version of the ACRES. The criterion-related validity assessments were based on whether or not differences between the inner urban and the suburban route environments, as indicated by the experts and by four existing objective measurements, were reflected by differences in perceptions of these environments. Comparisons of ratings between advertisement- and street-recruited participants were used for the assessments of representativity. Finally, ratings of inner urban and suburban route environments were used to evaluate commuting route environment profiles. Differences in ratings of the inner urban and suburban route environments by the advertisement-recruited participants were in accord with the existing objective measurements and corresponded reasonably well with those of the experts. Overall, there was a reasonably good correspondence between the advertisement- and street-recruited participants' ratings. Distinct differences in commuting route environment profiles were noted between the inner urban and suburban areas. Suburban route environments were rated as safer and more stimulating for bicycle-commuting than the inner urban ones. In general, the findings applied to both men and women. The overall results show: considerable criterion-related validity of the ACRES; ratings of advertisement-recruited participants mirroring those of street-recruited participants; and a higher degree of bikeability in the suburban commuting route environments than in the inner urban ones.
Familism and parental respect are culturally derived constructs rooted in Hispanic and Asian cultures, respectively. Measures of these constructs have been utilized in research and found to predict delays in substance use initiation and reduced levels of use. However, given that these measures are explicitly designed to tap constructs that are considered important by different racial/ethnic groups, there is a risk that the measurement properties may not be equivalent across groups. This study evaluated the measurement equivalence of measures of familism and parental respect in a large and diverse sample of middle school students in Southern California (n = 5646) using a multiple group confirmatory factor analysis approach. Results showed little evidence of measurement variance across four racial/ethnic groups (African American, Hispanic, Asian, and non-Hispanic White), supporting the continued use of these measures in diverse populations. Some differences between latent variable means were identified - specifically that the Hispanic group and the white group differed on familism. No evidence of invariance was found. However, the item distributions were highly positively skewed, indicating a tendency for youth to endorse the most positive response, which may reduce the reliability of the measures and suggests that refinement is possible.
Background Two most important considerations in evaluation of survival prediction models are 1) predictability - ability to predict survival risks accurately and 2) reproducibility - ability to generalize to predict samples generated from different studies. We present approaches for assessment of reproducibility of survival risk score predictions across medical centers. Methods Reproducibility was evaluated in terms of consistency and transferability. Consistency is the agreement of risk scores predicted between two centers. Transferability from one center to another center is the agreement of the risk scores of the second center predicted by each of the two centers. The transferability can be: 1) model transferability - whether a predictive model developed from one center can be applied to predict the samples generated from other centers and 2) signature transferability - whether signature markers of a predictive model developed from one center can be applied to predict the samples from other centers. We considered eight prediction models, including two clinical models, two gene expression models, and their combinations. Predictive performance of the eight models was evaluated by several common measures. Correlation coefficients between predicted risk scores of different centers were computed to assess reproducibility - consistency and transferability. Results Two public datasets, the lung cancer data generated from four medical centers and colon cancer data generated from two medical centers, were analyzed. The risk score estimates for lung cancer patients predicted by three of four centers agree reasonably well. In general, a good prediction model showed better cross-center consistency and transferability. The risk scores for the colon cancer patients from one (Moffitt) medical center that were predicted by the clinical models developed from the another (Vanderbilt) medical center were shown to have excellent model transferability and signature transferability. Conclusions This study illustrates an analytical approach to assessing reproducibility of predictive models and signatures. Based on the analyses of the two cancer datasets, we conclude that the models with clinical variables appear to perform reasonable well with high degree of consistency and transferability. There should have more investigations on the reproducibility of prediction models including gene expression data across studies.
The objective of most biomedical research is to determine an unbiased estimate of effect for an exposure on an outcome, i.e. to make causal inferences about the exposure. Recent developments in epidemiology have shown that traditional methods of identifying confounding and adjusting for confounding may be inadequate. The traditional methods of adjusting for "potential confounders" may introduce conditional associations and bias rather than minimize it. Although previous published articles have discussed the role of the causal directed acyclic graph approach (DAGs) with respect to confounding, many clinical problems require complicated DAGs and therefore investigators may continue to use traditional practices because they do not have the tools necessary to properly use the DAG approach. The purpose of this manuscript is to demonstrate a simple 6-step approach to the use of DAGs, and also to explain why the method works from a conceptual point of view. Using the simple 6-step DAG approach to confounding and selection bias discussed is likely to reduce the degree of bias for the effect estimate in the chosen statistical model.
Adaptive designs are becoming increasingly important in clinical research. One approach subdivides the study into several (two or more) stages and combines the p-values of the different stages using Fisher's combination test. Alternatively to Fisher's test, the recently proposed truncated product method (TPM) can be applied to combine the p-values. The TPM uses the product of only those p-values that do not exceed some fixed cut-off value. Here, these two competing analyses are compared. When an early termination due to insufficient effects is not appropriate, such as in dose-response analyses, the probability to stop the trial early with the rejection of the null hypothesis is increased when the TPM is applied. Therefore, the expected total sample size is decreased. This decrease in the sample size is not connected with a loss in power. The TPM turns out to be less advantageous, when an early termination of the study due to insufficient effects is possible. This is due to a decrease of the probability to stop the trial early. It is recommended to apply the TPM rather than Fisher's combination test whenever an early termination due to insufficient effects is not suitable within the adaptive design.
Background: The Computer Adaptive Test version of the Community Reintegration of Injured Service Members measure (CRIS-CAT) consists of three scales measuring Extent of, Perceived Limitations in, and Satisfaction with community integration. The CRIS-CAT was developed using item response theory methods. The purposes of this study were to assess the reliability, concurrent, known group and predictive validity and respondent burden of the CRIS-CAT.The CRIS-CAT was developed using item response theory methods. The purposes of this study were to assess the reliability, concurrent, known group and predictive validity and respondent burden of the CRIS-CAT. Methods: This was a three-part study that included a 1) a cross-sectional field study of 517 homeless, employed, and Operation Enduring Freedom/Operation Iraqi Freedom (OEF/OIF) Veterans; who completed all items in the CRIS item set, 2) a cohort study with one year follow-up study of 135 OEF/OIF Veterans, and 3) a 50-person study of CRIS-CAT administration. Conditional reliability of simulated CAT scores was calculated from the field study data, and concurrent validity and known group validity were examined using Pearson product correlations and ANOVAs. Data from the cohort were used to examine the ability of the CRIS-CAT to predict key one year outcomes. Data from the CRIS-CAT administration study were used to calculate ICC (2,1) minimum detectable change (MDC), and average number of items used during CAT administration. Results: Reliability scores for all scales were above 0.75, but decreased at both ends of the score continuum. CRIS-CAT scores were correlated with concurrent validity indicators and differed significantly between the three Veteran groups (P < .001). The odds of having any Emergency Room visits were reduced for Veterans with better CRIS-CAT scores (Extent, Perceived Satisfaction respectively: OR = 0.94, 0.93, 0.95; P < .05). CRIS-CAT scores were predictive of SF-12 physical and mental health related quality of life scores at the 1 year follow-up. Scales had ICCs >0.9. MDCs were 5.9, 6.2, and 3.6, respectively for Extent, Perceived and Satisfaction subscales. Number of items (mn, SD) administered at Visit 1 were 14.6 (3.8) 10.9 (2.7) and 10.4 (1.7) respectively for Extent, Perceived and Satisfaction subscales. Conclusion: The CRIS-CAT demonstrated sound measurement properties including reliability, construct, known group and predictive validity, and it was administered with minimal respondent burden. These findings support the use of this measure in assessing community reintegration.
We aimed at assessing the degree of measurement error in essential fatty acid intakes from a food frequency questionnaire and the impact of correcting for such an error on precision and bias of odds ratios in logistic models. To assess these impacts, and for illustrative purposes, alternative approaches and methods were used with the binary outcome of cognitive decline in verbal fluency. Using the Atherosclerosis Risk in Communities (ARIC) study, we conducted a sensitivity analysis. The error-prone exposure - visit 1 fatty acid intake (1987-89) - was available for 7,814 subjects 50 years or older at baseline with complete data on cognitive decline between visits 2 (1990-92) and 4 (1996-98). Our binary outcome of interest was clinically significant decline in verbal fluency. Point estimates and 95% confidence intervals were compared between naive and measurement-error adjusted odds ratios of decline with every SD increase in fatty acid intake as % of energy. Two approaches were explored for adjustment: (A) External validation against biomarkers (plasma fatty acids in cholesteryl esters and phospholipids) and (B) Internal repeat measurements at visits 2 and 3. The main difference between the two is that Approach B makes a stronger assumption regarding lack of error correlations in the structural model. Additionally, we compared results from regression calibration (RCAL) to those from simulation extrapolation (SIMEX). Finally, using structural equations modeling, we estimated attenuation factors associated with each dietary exposure to assess degree of measurement error in a bivariate scenario for regression calibration of logistic regression model. Attenuation factors for Approach A were smaller than B, suggesting a larger amount of measurement error in the dietary exposure. Replicate measures (Approach B) unlike concentration biomarkers (Approach A) may lead to imprecise odds ratios due to larger standard errors. Using SIMEX rather than RCAL models tends to preserve precision of odds ratios. We found in many cases that bias in naïve odds ratios was towards the null. RCAL tended to correct for a larger amount of effect bias than SIMEX, particularly for Approach A.
Background Adjusting for laboratory test results may result in better confounding control when added to administrative claims data in the study of treatment effects. However, missing values can arise through several mechanisms. Methods We studied the relationship between availability of outpatient lab test results, lab values, and patient and system characteristics in a large healthcare database using LDL, HDL, and HbA1c in a cohort of initiators of statins or Vytorin (ezetimibe & simvastatin) as examples. Results Among 703,484 patients 68% had at least one lab test performed in the 6 months before treatment. Performing an LDL test was negatively associated with several patient characteristics, including recent hospitalization (OR = 0.32, 95% CI: 0.29-0.34), MI (OR = 0.77, 95% CI: 0.69-0.85), or carotid revascularization (OR = 0.37, 95% CI: 0.25-0.53). Patient demographics, diagnoses, and procedures predicted well who would have a lab test performed (AUC = 0.89 to 0.93). Among those with test results available claims data explained only 14% of variation. Conclusions In a claims database linked with outpatient lab test results, we found that lab tests are performed selectively corresponding to current treatment guidelines. Poor ability to predict lab values and the high proportion of missingness reduces the added value of lab tests for effectiveness research in this setting.
We aimed at assessing the degree of measurement error in essential fatty acid intakes from a food frequency questionnaire and the impact of correcting for such an error on precision and bias of odds ratios in logistic models. To assess these impacts, and for illustrative purposes, alternative approaches and methods were used with the binary outcome of cognitive decline in verbal fluency. Using the Atherosclerosis Risk in Communities (ARIC) study, we conducted a sensitivity analysis. The error-prone exposure - visit 1 fatty acid intake (1987-89) - was available for 7,814 subjects 50 years or older at baseline with complete data on cognitive decline between visits 2 (1990-92) and 4 (1996-98). Our binary outcome of interest was clinically significant decline in verbal fluency. Point estimates and 95% confidence intervals were compared between naive and measurement-error adjusted odds ratios of decline with every SD increase in fatty acid intake as % of energy. Two approaches were explored for adjustment: (A) External validation against biomarkers (plasma fatty acids in cholesteryl esters and phospholipids) and (B) Internal repeat measurements at visits 2 and 3. The main difference between the two is that Approach B makes a stronger assumption regarding lack of error correlations in the structural model. Additionally, we compared results from regression calibration (RCAL) to those from simulation extrapolation (SIMEX). Finally, using structural equations modeling, we estimated attenuation factors associated with each dietary exposure to assess degree of measurement error in a bivariate scenario for regression calibration of logistic regression model. Attenuation factors for Approach A were smaller than B, suggesting a larger amount of measurement error in the dietary exposure. Replicate measures (Approach B) unlike concentration biomarkers (Approach A) may lead to imprecise odds ratios due to larger standard errors. Using SIMEX rather than RCAL models tends to preserve precision of odds ratios. We found in many cases that bias in naïve odds ratios was towards the null. RCAL tended to correct for a larger amount of effect bias than SIMEX, particularly for Approach A.
Background Adjusting for laboratory test results may result in better confounding control when added to administrative claims data in the study of treatment effects. However, missing values can arise through several mechanisms. Methods We studied the relationship between availability of outpatient lab test results, lab values, and patient and system characteristics in a large healthcare database using LDL, HDL, and HbA1c in a cohort of initiators of statins or Vytorin (ezetimibe & simvastatin) as examples. Results Among 703,484 patients 68% had at least one lab test performed in the 6 months before treatment. Performing an LDL test was negatively associated with several patient characteristics, including recent hospitalization (OR = 0.32, 95% CI: 0.29-0.34), MI (OR = 0.77, 95% CI: 0.69-0.85), or carotid revascularization (OR = 0.37, 95% CI: 0.25-0.53). Patient demographics, diagnoses, and procedures predicted well who would have a lab test performed (AUC = 0.89 to 0.93). Among those with test results available claims data explained only 14% of variation. Conclusions In a claims database linked with outpatient lab test results, we found that lab tests are performed selectively corresponding to current treatment guidelines. Poor ability to predict lab values and the high proportion of missingness reduces the added value of lab tests for effectiveness research in this setting.
Missing data may bias the results of clinical trials and other studies. This study describes the response rate, questionnaire responses and financial costs associated with offering participants from a multilingual population the option to complete questionnaires over the telephone. Design: Before and after study of two methods of questionnaire completion. Participants and Setting: Seven hundred and sixty five pregnant women from 25 general practices in two UK inner city Primary Care Trusts (PCTs) taking part in a cluster randomised controlled trial of offering antenatal sickle cell and thalassaemia screening in primary care. Two hundred and four participants did not speak English. Sixty one women were offered postal questionnaire completion only and 714 women were offered a choice of telephone or postal questionnaire completion. Outcome measures: (i) Proportion of completed questionnaires, (ii) attitude and knowledge responses obtained from a questionnaire assessing informed choice. The response rate from women offered postal completion was 26% compared with 67% for women offered a choice of telephone or postal completion (41% difference 95% CI Diff 30 to 52). For non-English speakers offered a choice of completion methods the response rate was 56% compared with 71% for English speakers (95% CI Diff 7 to 23). No difference was found for knowledge by completion method, but telephone completion was associated with more positive attitude classifications than postal completion (87 vs 96%, 95% CI diff 0.006 to 15). Compared with postal administration the additional costs associated with telephone administration were pound3.90 per questionnaire for English speakers and pound71.60 per questionnaire for non English speakers. Studies requiring data to be collected by questionnaire may obtain higher response rates from both English and non-English speakers when a choice of telephone or postal administration (and where necessary, an interpreter)is offered compared to offering postal administration only. This approach will, however, incur additional research costs and uncertainty remains about the equivalence of responses obtained from the two methods.
Background We developed and validated an automated database case definition for diabetes in children and youth to facilitate pharmacoepidemiologic investigations of medications and the risk of diabetes. Methods The present study was part of an in-progress retrospective cohort study of antipsychotics and diabetes in Tennessee Medicaid enrollees aged 6–24 years. Diabetes was identified from diabetes-related medical care encounters: hospitalizations, outpatient visits, and filled prescriptions. The definition required either a primary inpatient diagnosis or at least two other encounters of different types, most commonly an outpatient diagnosis with a prescription. Type 1 diabetes was defined by insulin prescriptions with at most one oral hypoglycemic prescription; other cases were considered type 2 diabetes. The definition was validated for cohort members in the 15 county region geographically proximate to the investigators. Medical records were reviewed and adjudicated for cases that met the automated database definition as well as for a sample of persons with other diabetes-related medical care encounters. Results The study included 64 cases that met the automated database definition. Records were adjudicated for 46 (71.9%), of which 41 (89.1%) met clinical criteria for newly diagnosed diabetes. The positive predictive value for type 1 diabetes was 80.0%. For type 2 and unspecified diabetes combined, the positive predictive value was 83.9%. The estimated sensitivity of the definition, based on adjudication for a sample of 30 cases not meeting the automated database definition, was 64.8%. Conclusion These results suggest that the automated database case definition for diabetes may be useful for pharmacoepidemiologic studies of medications and diabetes.
The psychometric properties of the nursing home administrator job satisfaction questionnaire (NHA-JSQ) are presented, and the steps used to develop this instrument. The NHA-JSQ subscales were developed from pilot survey activities with 93 administrators, content analysis, and a research panel. The resulting survey was sent to 1,000 nursing home administrators. Factor analyses were used to determine the psychometric properties of the instrument. Of the 1,000 surveys mailed, 721 usable surveys were returned (72 percent response rate). The factor analyses show that the items were representative of six underlying factors (i.e., coworkers, work demands, work content, work load, work skills, and rewards). The NHA-JSQ represents a short, psychometrically sound job satisfaction instrument for use in nursing homes.
Many studies compared the degree of concordance between adolescents' and parents' reports on family socioeconomic status (SES). However, none of these studies analyzed whether the degree of concordance varies by different levels of household financial stress. This research examines whether the degree of concordance between adolescents' and parent reports for the three traditional SES measures (parental education, parental occupation and household income) varied with parent-reported household financial stress and relative standard of living. 2,593 adolescents with a mean age of 13 years, and one of their corresponding parents from the Taiwan Longitudinal Youth Project conducted in 2000 were analyzed. Consistency of adolescents' and parents' reports on parental educational attainment, parental occupation and household income were examined by parent-reported household financial stress and relative standard of living. Parent-reported SES variables are closely associated with family financial stress. For all levels of household financial stress, the degree of concordance between adolescent's and parent's reports are highest for parental education (κ ranging from 0.87 to 0.71) followed by parental occupation (κ ranging from 0.50 to 0.34) and household income (κ ranging from 0.43 to 0.31). Concordance for father's education and parental occupation decreases with higher parent-reported financial stress. This phenomenon was less significant for parent-reported relative standard of living. Though the agreement between adolescents' and parents' reports on the three SES measures is generally judged to be good in most cases, using adolescents reports for family SES may still be biased if analysis is not stratified by family financial stress.
This study proposes a new approach for investigating bias in self-reported data on height and weight among adolescents by studying the relevance of participants' self-reported response capability. The objectives were 1) to estimate the prevalence of students with high and low self-reported response capability for weight and height in a self-administrated questionnaire survey among 11--15 year old Danish adolescents, 2) to estimate the proportion of missing values on self-reported height and weight in relation to capability for reporting height and weight, and 3) to investigate the extent to which adolescents' response capability is of importance for the accuracy and precision of self-reported height and weight. Also, the study investigated the impact of students' response capability on estimating prevalence rates of overweight. Data was collected by a school-based cross-sectional questionnaire survey among students aged 11--15 years in 13 schools in Aarhus, Denmark, response rate =89%, n = 2100. Response capability was based on students' reports of perceived ability to report weight/height and weighing/height measuring history. Direct measures of height and weight were collected by school health nurses. One third of the students had low response capability for weight and height, respectively, and every second student had low response capability for BMI. The proportion of missing values on self-reported weight and height was significantly higher among students who were not weighed and height measured recently and among students who reported low recall ability. Among both boys and girls the precision of self-reported height and weight tended to be lower than among students with low response capability. Low response capability was related to BMI (z-score) and overweight prevalence among girls. These findings were due to a larger systematic underestimation of weight among girls who were not weighed recently (-1.02 kg, p < 0.0001) and among girls with low recall ability for weight (-0.99 kg, p = 0.0024). This study indicates that response capability may be relevant for the accuracy of girls' self-reported measurements of weight and height. Consequently, by integrating items on response capability in survey instruments, participants with low capability can be identified. Similar analyses based on other and less selected populations are recommended.
Background Statistical process control (SPC), an industrial sphere initiative, has recently been applied in health care and public health surveillance. SPC methods assume independent observations and process autocorrelation has been associated with increase in false alarm frequency. Methods Monthly mean raw mortality (at hospital discharge) time series, 1995–2009, at the individual Intensive Care unit (ICU) level, were generated from the Australia and New Zealand Intensive Care Society adult patient database. Evidence for series (i) autocorrelation and seasonality was demonstrated using (partial)-autocorrelation ((P)ACF) function displays and classical series decomposition and (ii) “in-control” status was sought using risk-adjusted (RA) exponentially weighted moving average (EWMA) control limits (3 sigma). Risk adjustment was achieved using a random coefficient (intercept as ICU site and slope as APACHE III score) logistic regression model, generating an expected mortality series. Application of time-series to an exemplar complete ICU series (1995-(end)2009) was via Box-Jenkins methodology: autoregressive moving average (ARMA) and (G)ARCH ((Generalised) Autoregressive Conditional Heteroscedasticity) models, the latter addressing volatility of the series variance. Results The overall data set, 1995-2009, consisted of 491324 records from 137 ICU sites; average raw mortality was 14.07%; average(SD) raw and expected mortalities ranged from 0.012(0.113) and 0.013(0.045) to 0.296(0.457) and 0.278(0.247) respectively. For the raw mortality series: 71 sites had continuous data for assessment up to or beyond lag40 and 35% had autocorrelation through to lag40; and of 36 sites with continuous data for ≥ 72 months, all demonstrated marked seasonality. Similar numbers and percentages were seen with the expected series. Out-of-control signalling was evident for the raw mortality series with respect to RA-EWMA control limits; a seasonal ARMA model, with GARCH effects, displayed white-noise residuals which were in-control with respect to EWMA control limits and one-step prediction error limits (3SE). The expected series was modelled with a multiplicative seasonal autoregressive model. Conclusions The data generating process of monthly raw mortality series at the ICU level displayed autocorrelation, seasonality and volatility. False-positive signalling of the raw mortality series was evident with respect to RA-EWMA control limits. A time series approach using residual control charts resolved these issues.
Background Verbal autopsy has been widely used to estimate causes of death in settings with inadequate vital registries, but little is known about its validity. This analysis was part of Addis Ababa Mortality Surveillance Program to examine the validity of verbal autopsy for determining causes of death compared with hospital medical records among adults in the urban setting of Ethiopia. Methods This validation study consisted of comparison of verbal autopsy final diagnosis with hospital diagnosis taken as a “gold standard”. In public and private hospitals of Addis Ababa, 20,152 adult deaths (15 years and above) were recorded between 2007 and 2010. With the same period, a verbal autopsy was conducted for 4,776 adult deaths of which, 1,356 were deceased in any of Addis Ababa hospitals. Then, verbal autopsy and hospital data sets were merged using the variables; full name of the deceased, sex, address, age, place and date of death. We calculated sensitivity, specificity and positive predictive values with 95% confidence interval. Results After merging, a total of 335 adult deaths were captured. For communicable diseases, the values of sensitivity, specificity and positive predictive values of verbal autopsy diagnosis were 79%, 78% and 68% respectively. For non-communicable diseases, sensitivity of the verbal autopsy diagnoses was 69%, specificity 78% and positive predictive value 79%. Regarding injury, sensitivity of the verbal autopsy diagnoses was 70%, specificity 98% and positive predictive value 83%. Higher sensitivity was achieved for HIV/AIDS and tuberculosis, but lower specificity with relatively more false positives. Conclusion These findings may indicate the potential of verbal autopsy to provide cost-effective information to guide policy on communicable and non communicable diseases double burden among adults in Ethiopia. Thus, a well structured verbal autopsy method, followed by qualified physician reviews could be capable of providing reasonable cause specific mortality estimates in Ethiopia. However, the limited generalizability of this study due to the fact that matched verbal autopsy deaths were all in-hospital deaths in an urban center, thus results may not be generalizable to rural home deaths. Such application and refinement of existing verbal autopsy methods holds out the possibility of obtaining replicable, sustainable and internationally comparable mortality statistics of known quality. Similar validation studies need to be undertaken considering the limitation of medical records as “gold standard” since records may not be confirmed using laboratory investigations or medical technologies. The validation studies need to address child and maternal causes of death and possibly all underlying causes of death.
Despite its benefits, it is uncommon to apply the nested case-control design in diagnostic research. We aim to show advantages of this design for diagnostic accuracy studies. We used data from a full cross-sectional diagnostic study comprising a cohort of 1295 consecutive patients who were selected on their suspicion of having deep vein thrombosis (DVT). We draw nested case-control samples from the full study population with case:control ratios of 1:1, 1:2, 1:3 and 1:4 (per ratio 100 samples were taken). We calculated diagnostic accuracy estimates for two tests that are used to detect DVT in clinical practice. Estimates of diagnostic accuracy in the nested case-control samples were very similar to those in the full study population. For example, for each case:control ratio, the positive predictive value of the D-dimer test was 0.30 in the full study population and 0.30 in the nested case-control samples (median of the 100 samples). As expected, variability of the estimates decreased with increasing sample size. Our findings support the view that the nested case-control study is a valid and efficient design for diagnostic studies and should also be (re)appraised in current guidelines on diagnostic accuracy research.
Analyzing time-to-onset of adverse drug reactions from treatment exposure contributes to meeting pharmacovigilance objectives, i.e. identification and prevention. Post-marketing data are available from reporting systems. Times-to-onset from such databases are right-truncated because some patients who were exposed to the drug and who will eventually develop the adverse drug reaction may do it after the time of analysis and thus are not included in the data. Acknowledgment of the developments adapted to right-truncated data is not widespread and these methods have never been used in pharmacovigilance. We assess the use of appropriate methods as well as the consequences of not taking right truncation into account (naïve approach) on parametric maximum likelihood estimation of time-to-onset distribution. Both approaches, naïve or taking right truncation into account, were compared with a simulation study. We used twelve scenarios for the exponential distribution and twenty-four for the Weibull and log-logistic distributions. These scenarios are defined by a set of parameters: the parameters of the time-to-onset distribution, the probability of this distribution falling within an observable values interval and the sample size. An application to reported lymphoma after anti TNF-¿ treatment from the French pharmacovigilance is presented. The simulation study shows that the bias and the mean squared error might in some instances be unacceptably large when right truncation is not considered while the truncation-based estimator shows always better and often satisfactory performances and the gap may be large. For the real dataset, the estimated expected time-to-onset leads to a minimum difference of 58 weeks between both approaches, which is not negligible. This difference is obtained for the Weibull model, under which the estimated probability of this distribution falling within an observable values interval is not far from 1. It is necessary to take right truncation into account for estimating time-to-onset of adverse drug reactions from spontaneous reporting databases.
Although the methods for conducting systematic reviews of efficacy are well established, there is much less guidance on how systematic reviews of adverse effects should be performed. In order to determine where methodological research is most needed to improve systematic reviews of adverse effects of health care interventions, we conducted a descriptive analysis of systematic reviews published between 1994 and 2005. We searched the Database of Abstracts of Reviews of Effects (DARE) and The Cochrane Database of Systematic Reviews (CDSR) to identify systematic reviews in which the primary outcome was an adverse effect or effects. We then extracted data on many of the elements of the systematic review process including: types of interventions studied, adverse effects of interest, resources searched, search strategies, data sources included in reviews, quality assessment of primary data, nature of the data analysis, and source of funding. 256 reviews were included in our analysis, of which the majority evaluated drug interventions and pre-specified the adverse effect or effects of interest. A median of 3 resources were searched for each review and very few reviews (13/256) provided sufficient information to reproduce their search strategies. Although more than three quarters (185/243) of the reviews sought to include data from sources other than randomised controlled trials, fewer than half (106/256) assessed the quality of the studies that were included. Data were pooled quantitatively in most of the reviews (165/256) but heterogeneity was not always considered. Less than half (123/256) of the reviews reported on the source of funding. There is an obvious need to improve the methodology and reporting of systematic reviews of adverse effects. The methodology around identification and quality assessment of primary data is the main concern.
This study proposes a new approach for investigating bias in self-reported data on height and weight among adolescents by studying the relevance of participants' self-reported response capability. The objectives were 1) to estimate the prevalence of students with high and low self-reported response capability for weight and height in a self-administrated questionnaire survey among 11--15 year old Danish adolescents, 2) to estimate the proportion of missing values on self-reported height and weight in relation to capability for reporting height and weight, and 3) to investigate the extent to which adolescents' response capability is of importance for the accuracy and precision of self-reported height and weight. Also, the study investigated the impact of students' response capability on estimating prevalence rates of overweight. Data was collected by a school-based cross-sectional questionnaire survey among students aged 11--15 years in 13 schools in Aarhus, Denmark, response rate =89%, n = 2100. Response capability was based on students' reports of perceived ability to report weight/height and weighing/height measuring history. Direct measures of height and weight were collected by school health nurses. One third of the students had low response capability for weight and height, respectively, and every second student had low response capability for BMI. The proportion of missing values on self-reported weight and height was significantly higher among students who were not weighed and height measured recently and among students who reported low recall ability. Among both boys and girls the precision of self-reported height and weight tended to be lower than among students with low response capability. Low response capability was related to BMI (z-score) and overweight prevalence among girls. These findings were due to a larger systematic underestimation of weight among girls who were not weighed recently (-1.02 kg, p < 0.0001) and among girls with low recall ability for weight (-0.99 kg, p = 0.0024). This study indicates that response capability may be relevant for the accuracy of girls' self-reported measurements of weight and height. Consequently, by integrating items on response capability in survey instruments, participants with low capability can be identified. Similar analyses based on other and less selected populations are recommended.
Background Statistical process control (SPC), an industrial sphere initiative, has recently been applied in health care and public health surveillance. SPC methods assume independent observations and process autocorrelation has been associated with increase in false alarm frequency. Methods Monthly mean raw mortality (at hospital discharge) time series, 1995–2009, at the individual Intensive Care unit (ICU) level, were generated from the Australia and New Zealand Intensive Care Society adult patient database. Evidence for series (i) autocorrelation and seasonality was demonstrated using (partial)-autocorrelation ((P)ACF) function displays and classical series decomposition and (ii) “in-control” status was sought using risk-adjusted (RA) exponentially weighted moving average (EWMA) control limits (3 sigma). Risk adjustment was achieved using a random coefficient (intercept as ICU site and slope as APACHE III score) logistic regression model, generating an expected mortality series. Application of time-series to an exemplar complete ICU series (1995-(end)2009) was via Box-Jenkins methodology: autoregressive moving average (ARMA) and (G)ARCH ((Generalised) Autoregressive Conditional Heteroscedasticity) models, the latter addressing volatility of the series variance. Results The overall data set, 1995-2009, consisted of 491324 records from 137 ICU sites; average raw mortality was 14.07%; average(SD) raw and expected mortalities ranged from 0.012(0.113) and 0.013(0.045) to 0.296(0.457) and 0.278(0.247) respectively. For the raw mortality series: 71 sites had continuous data for assessment up to or beyond lag40 and 35% had autocorrelation through to lag40; and of 36 sites with continuous data for ≥ 72 months, all demonstrated marked seasonality. Similar numbers and percentages were seen with the expected series. Out-of-control signalling was evident for the raw mortality series with respect to RA-EWMA control limits; a seasonal ARMA model, with GARCH effects, displayed white-noise residuals which were in-control with respect to EWMA control limits and one-step prediction error limits (3SE). The expected series was modelled with a multiplicative seasonal autoregressive model. Conclusions The data generating process of monthly raw mortality series at the ICU level displayed autocorrelation, seasonality and volatility. False-positive signalling of the raw mortality series was evident with respect to RA-EWMA control limits. A time series approach using residual control charts resolved these issues.
Background Verbal autopsy has been widely used to estimate causes of death in settings with inadequate vital registries, but little is known about its validity. This analysis was part of Addis Ababa Mortality Surveillance Program to examine the validity of verbal autopsy for determining causes of death compared with hospital medical records among adults in the urban setting of Ethiopia. Methods This validation study consisted of comparison of verbal autopsy final diagnosis with hospital diagnosis taken as a “gold standard”. In public and private hospitals of Addis Ababa, 20,152 adult deaths (15 years and above) were recorded between 2007 and 2010. With the same period, a verbal autopsy was conducted for 4,776 adult deaths of which, 1,356 were deceased in any of Addis Ababa hospitals. Then, verbal autopsy and hospital data sets were merged using the variables; full name of the deceased, sex, address, age, place and date of death. We calculated sensitivity, specificity and positive predictive values with 95% confidence interval. Results After merging, a total of 335 adult deaths were captured. For communicable diseases, the values of sensitivity, specificity and positive predictive values of verbal autopsy diagnosis were 79%, 78% and 68% respectively. For non-communicable diseases, sensitivity of the verbal autopsy diagnoses was 69%, specificity 78% and positive predictive value 79%. Regarding injury, sensitivity of the verbal autopsy diagnoses was 70%, specificity 98% and positive predictive value 83%. Higher sensitivity was achieved for HIV/AIDS and tuberculosis, but lower specificity with relatively more false positives. Conclusion These findings may indicate the potential of verbal autopsy to provide cost-effective information to guide policy on communicable and non communicable diseases double burden among adults in Ethiopia. Thus, a well structured verbal autopsy method, followed by qualified physician reviews could be capable of providing reasonable cause specific mortality estimates in Ethiopia. However, the limited generalizability of this study due to the fact that matched verbal autopsy deaths were all in-hospital deaths in an urban center, thus results may not be generalizable to rural home deaths. Such application and refinement of existing verbal autopsy methods holds out the possibility of obtaining replicable, sustainable and internationally comparable mortality statistics of known quality. Similar validation studies need to be undertaken considering the limitation of medical records as “gold standard” since records may not be confirmed using laboratory investigations or medical technologies. The validation studies need to address child and maternal causes of death and possibly all underlying causes of death.
Despite its benefits, it is uncommon to apply the nested case-control design in diagnostic research. We aim to show advantages of this design for diagnostic accuracy studies. We used data from a full cross-sectional diagnostic study comprising a cohort of 1295 consecutive patients who were selected on their suspicion of having deep vein thrombosis (DVT). We draw nested case-control samples from the full study population with case:control ratios of 1:1, 1:2, 1:3 and 1:4 (per ratio 100 samples were taken). We calculated diagnostic accuracy estimates for two tests that are used to detect DVT in clinical practice. Estimates of diagnostic accuracy in the nested case-control samples were very similar to those in the full study population. For example, for each case:control ratio, the positive predictive value of the D-dimer test was 0.30 in the full study population and 0.30 in the nested case-control samples (median of the 100 samples). As expected, variability of the estimates decreased with increasing sample size. Our findings support the view that the nested case-control study is a valid and efficient design for diagnostic studies and should also be (re)appraised in current guidelines on diagnostic accuracy research.
Analyzing time-to-onset of adverse drug reactions from treatment exposure contributes to meeting pharmacovigilance objectives, i.e. identification and prevention. Post-marketing data are available from reporting systems. Times-to-onset from such databases are right-truncated because some patients who were exposed to the drug and who will eventually develop the adverse drug reaction may do it after the time of analysis and thus are not included in the data. Acknowledgment of the developments adapted to right-truncated data is not widespread and these methods have never been used in pharmacovigilance. We assess the use of appropriate methods as well as the consequences of not taking right truncation into account (naïve approach) on parametric maximum likelihood estimation of time-to-onset distribution. Both approaches, naïve or taking right truncation into account, were compared with a simulation study. We used twelve scenarios for the exponential distribution and twenty-four for the Weibull and log-logistic distributions. These scenarios are defined by a set of parameters: the parameters of the time-to-onset distribution, the probability of this distribution falling within an observable values interval and the sample size. An application to reported lymphoma after anti TNF-¿ treatment from the French pharmacovigilance is presented. The simulation study shows that the bias and the mean squared error might in some instances be unacceptably large when right truncation is not considered while the truncation-based estimator shows always better and often satisfactory performances and the gap may be large. For the real dataset, the estimated expected time-to-onset leads to a minimum difference of 58 weeks between both approaches, which is not negligible. This difference is obtained for the Weibull model, under which the estimated probability of this distribution falling within an observable values interval is not far from 1. It is necessary to take right truncation into account for estimating time-to-onset of adverse drug reactions from spontaneous reporting databases.
Although the methods for conducting systematic reviews of efficacy are well established, there is much less guidance on how systematic reviews of adverse effects should be performed. In order to determine where methodological research is most needed to improve systematic reviews of adverse effects of health care interventions, we conducted a descriptive analysis of systematic reviews published between 1994 and 2005. We searched the Database of Abstracts of Reviews of Effects (DARE) and The Cochrane Database of Systematic Reviews (CDSR) to identify systematic reviews in which the primary outcome was an adverse effect or effects. We then extracted data on many of the elements of the systematic review process including: types of interventions studied, adverse effects of interest, resources searched, search strategies, data sources included in reviews, quality assessment of primary data, nature of the data analysis, and source of funding. 256 reviews were included in our analysis, of which the majority evaluated drug interventions and pre-specified the adverse effect or effects of interest. A median of 3 resources were searched for each review and very few reviews (13/256) provided sufficient information to reproduce their search strategies. Although more than three quarters (185/243) of the reviews sought to include data from sources other than randomised controlled trials, fewer than half (106/256) assessed the quality of the studies that were included. Data were pooled quantitatively in most of the reviews (165/256) but heterogeneity was not always considered. Less than half (123/256) of the reviews reported on the source of funding. There is an obvious need to improve the methodology and reporting of systematic reviews of adverse effects. The methodology around identification and quality assessment of primary data is the main concern.
Disease specific mortality is often used as outcome rather than total mortality in clinical trials. This approach assumes that the classification of cause of death is unbiased. We explored whether use of fungal infection-related mortality as outcome rather than total mortality leads to bias in trials of antifungal agents in cancer patients. As an estimate of bias we used relative risk of death in those patients the authors considered had not died from fungal infection. Our sample consisted of 69 trials included in four systematic reviews of prophylactic or empirical antifungal treatment in patients with cancer and neutropenia we have published previously. Thirty trials met the inclusion criteria. The trials comprised 6130 patients and 869 deaths, 220 (25%) of which were ascribed to fungal infection. The relative risk of death was 0.85 (95% CI 0.75-0.96) for total mortality, 0.57 (95% CI 0.44-0.74) for fungal mortality, and 0.95 (95% CI 0.82-1.09) for mortality among those who did not die from fungal infection. We could not support the hypothesis that use of disease specific mortality introduces bias in antifungal trials on cancer patients as our estimate of the relative risk for mortality in those who survived the fungal infection was not increased. We conclude that it seems to be reliable to use fungal mortality as the primary outcome in trials of antifungal agents. Data on total mortality should be reported as well, however, to guard against the possible introduction of harmful treatments.
Log-linear association models have been extensively used to investigate the pattern of agreement between ordinal ratings. In 2007, log-linear non-uniform association models were introduced to estimate, from a cross-classification of two independent raters using an ordinal scale, varying degrees of distinguishability between distant and adjacent categories of the scale. In this paper, a simple method based on simulations was proposed to estimate the power of non-uniform association models to detect heterogeneities across distinguishabilities between adjacent categories of an ordinal scale, illustrating some possible scale defects. Different scenarios of distinguishability patterns were investigated, as well as different scenarios of marginal heterogeneity within rater. For sample size of N = 50, the probabilities of detecting heterogeneities within the tables are lower than .80, whatever the number of categories. In additition, even for large samples, marginal heterogeneities within raters led to a decrease in power estimates. This paper provided some issues about how many objects had to be classified by two independent observers (or by the same observer at two different times) to be able to detect a given scale structure defect. Our results also highlighted the importance of marginal homogeneity within raters, to ensure optimal power when using non-uniform association models.

Data provided are for informational purposes only. Although carefully collected, accuracy cannot be guaranteed.