ArticlePDF Available

Test–retest reliability of evoked BOLD signals from a cognitive–emotive fMRI test battery

Authors:

Abstract and Figures

Even more than in cognitive research applications, moving fMRI to the clinic and the drug development process requires the generation of stable and reliable signal changes. The performance characteristics of the fMRI paradigm constrain experimental power and may require different study designs (e.g., crossover vs. parallel groups), yet fMRI reliability characteristics can be strongly dependent on the nature of the fMRI task. The present study investigated both within-subject and group-level reliability of a combined three-task fMRI battery targeting three systems of wide applicability in clinical and cognitive neuroscience: an emotional (face matching), a motivational (monetary reward anticipation) and a cognitive (n-back working memory) task. A group of 25 young, healthy volunteers were scanned twice on a 3T MRI scanner with a mean test-retest interval of 14.6 days. FMRI reliability was quantified using the intraclass correlation coefficient (ICC) applied at three different levels ranging from a global to a localized and fine spatial scale: (1) reliability of group-level activation maps over the whole brain and within targeted regions of interest (ROIs); (2) within-subject reliability of ROI-mean amplitudes and (3) within-subject reliability of individual voxels in the target ROIs. Results showed robust evoked activation of all three tasks in their respective target regions (emotional task=amygdala; motivational task=ventral striatum; cognitive task=right dorsolateral prefrontal cortex and parietal cortices) with high effect sizes (ES) of ROI-mean summary values (ES=1.11-1.44 for the faces task, 0.96-1.43 for the reward task, 0.83-2.58 for the n-back task). Reliability of group level activation was excellent for all three tasks with ICCs of 0.89-0.98 at the whole brain level and 0.66-0.97 within target ROIs. Within-subject reliability of ROI-mean amplitudes across sessions was fair to good for the reward task (ICCs=0.56-0.62) and, dependent on the particular ROI, also fair-to-good for the n-back task (ICCs=0.44-0.57) but lower for the faces task (ICC=-0.02-0.16). In conclusion, all three tasks are well suited to between-subject designs, including imaging genetics. When specific recommendations are followed, the n-back and reward task are also suited for within-subject designs, including pharmaco-fMRI. The present study provides task-specific fMRI reliability performance measures that will inform the optimal use, powering and design of fMRI studies using comparable tasks.
Content may be subject to copyright.
A preview of the PDF is not available
... In order to confidently relate neural activity to psychological constructs of interest and apply research findings to clinical settings, such as establishing neural biomarkers of certain disorders, the observed neural activity from fMRI must be reliably stable. The overall degree of fMRI test-retest reliability can vary between task types (Holiga et al., 2018), specific conditions of a task and the contrasts of interest (Raemaekers et al., 2007;Fröhner et al., 2019;Heckendorf et al., 2019;McDermott et al., 2020), as well as the brain region of interest [ROI; (Plichta et al., 2012;Li et al., 2020;Morales et al., 2020;Korucuoglu et al., 2021)]. Low reliability in imaging research limits inferences that relate individual difference measures to fMRI activation (Zeynep Enkavi et al., 2019). ...
... Across various tasks, measures of reliability, such as ICC, are greater at the group level than the individual level. This finding has been demonstrated during memory encoding tasks (Brandt et al., 2013;Holiga et al., 2018;Bossier et al., 2020), an intertemporal choice task (Fröhner et al., 2019), emotional face tasks (Plichta et al., 2012;Holiga et al., 2018;McDermott et al., 2020), an antisaccade paradigm (Raemaekers et al., 2007), reward-related tasks (Plichta et al., 2012;Holiga et al., 2018), Nback working memory tasks (Plichta et al., 2012;Holiga et al., 2018), a theory of mind task, and a response inhibition task (Holiga et al., 2018). It is currently unclear whether low reliability at the individual level is related to methodologies (e.g., measure of neural activation, imaging analysis) being used to calculate reliability or if fMRI is an inherently unreliable measure of neural activity. ...
... Across various tasks, measures of reliability, such as ICC, are greater at the group level than the individual level. This finding has been demonstrated during memory encoding tasks (Brandt et al., 2013;Holiga et al., 2018;Bossier et al., 2020), an intertemporal choice task (Fröhner et al., 2019), emotional face tasks (Plichta et al., 2012;Holiga et al., 2018;McDermott et al., 2020), an antisaccade paradigm (Raemaekers et al., 2007), reward-related tasks (Plichta et al., 2012;Holiga et al., 2018), Nback working memory tasks (Plichta et al., 2012;Holiga et al., 2018), a theory of mind task, and a response inhibition task (Holiga et al., 2018). It is currently unclear whether low reliability at the individual level is related to methodologies (e.g., measure of neural activation, imaging analysis) being used to calculate reliability or if fMRI is an inherently unreliable measure of neural activity. ...
Article
Full-text available
Test-retest reliability of fMRI is often assessed using the intraclass correlation coefficient (ICC), a numerical representation of reliability. Reports of low reliability at the individual level may be attributed to analytical approaches and inherent bias/error in the measures used to calculate ICC. It is unclear whether low reliability at the individual level is related to methodological decisions or if fMRI is inherently unreliable. The purpose of this study was to investigate methodological considerations when calculating ICC to improve understanding of fMRI reliability. fMRI data were collected from adolescent females (N = 23) at pre- and post-cognitive behavioral therapy. Participants completed an emotion processing task during fMRI. We calculated ICC values using contrasts and β coefficients separately from voxelwise and network (ICA) analyses of the task-based fMRI data. For both voxelwise analysis and ICA, ICC values were higher when calculated using β coefficients. This work provides support for the use of β coefficients over contrasts when assessing reliability of fMRI, and the use of contrasts may underlie low reliability estimates reported in the existing literature. Continued research in this area is warranted to establish fMRI as a reliable measure to draw conclusions and utilize fMRI in clinical settings.
... The sample size may be an important issue in neuromarketing research (Lim et al., 2019). In fact, the sample size was much smaller in fMRI neuroimaging studies than in other fields of study, mainly due to the financial burden of running studies, but showed acceptable levels of test-retest reliability (Bennett and Miller, 2010;Plichta et al., 2012). In a thorough evaluation of sample sizes of fMRI publications, the median sample size was 14.5, and high-impact neuroimaging journals in 2018 had a median sample size of 24 (Szucs and Ioannidis, 2020). ...
Article
Full-text available
Price and customer ratings are perhaps the two most important pieces of information consumers rely on when shopping online. This study aimed to elucidate the neural mechanism by which the introduction of these two types of information influences the purchase intention of potential consumers for hedonic products. Participants performed a lip-care product shopping task during functional magnetic resonance imaging, in which they re-disclosed purchase intentions referring to the information of price or rating provided about the products that they had previously disclosed their purchase intentions without any information. Data from 38 young female participants were analyzed to identify the underlying neural regions associated with the intention change and product information. The bilateral frontopolar cortex, bilateral dorsal anterior cingulate cortex (dACC), and left insula activated higher for the unchanged than changed intention condition. The right dACC and bilateral insula also activated more toward the price than the rating condition, whereas the medial prefrontal cortex and bilateral temporoparietal junction responded in the opposite direction. These results seem to reflect the shift to exploratory decision-making strategies and increased salience in maintaining purchase intentions despite referring to provided information and to highlight the involvement of social cognition-related regions in reference to customer ratings rather than price.
... Reports regarding this discrepancy between group-level and individual-level longitudinal reliability were recently highlighted for a number of (classic) experimental paradigms (Fröhner, Teckentrup, Smolka, & Kroemer, 2019;Hedge, Powell, & Sumner, 2018;Herting, Gautam, Chen, Mezher, & Vetter, 2018;Plichta et al., 2012). Our results add fear conditioning and extinction as assessed by SCRs and BOLD fMRI to this list and have important implications for . ...
Preprint
Here we follow the call to target measurement reliability as a key prerequisite for individual-level predictions in translational neuroscience by investigating i) longitudinal reliability at the individual and ii) group level, iii) cross-sectional reliability and iv) response predictability across experimental phases. 120 individuals performed a fear conditioning paradigm twice six month apart. Analyses of skin conductance responses, fear ratings and BOLD-fMRI with different data transformations and included numbers of trials were conducted. While longitudinal reliability was generally poor to moderate at the individual level, it was good for acquisition but not extinction at the group-level. Cross-sectional reliability was satisfactory. Higher responding in preceding phases predicted higher responding in subsequent experimental phases at a weak to moderate level depending on data specifications. In sum, the results suggest the feasibility of individual-level predictions for (very) short time intervals (e.g., cross-phases) while predictions for longer time intervals may be problematic.
... In prior studies, the reliability of fMRI face-emotion paradigms varied by task condition. For example, prior work typically finds moderate reliability for face vs. baseline contrasts, but poor reliability for contrasts between specific face-emotion types, for example, angry vs. neutral Plichta et al., 2012;Sauder, Hajcak, Angstadt, & Phan, 2013;van den Bulk et al., 2013;White et al., 2016). The current study utilizes two tasks that differ in their cognitive demands. ...
Article
Full-text available
Assessing and improving test-retest reliability is critical to efforts to address concerns about replicability of task-based functional magnetic resonance imaging. The current study uses two statistical approaches to examine how scanner and task-related factors influence reliability of neural response to face-emotion viewing. Forty healthy adult participants completed two face-emotion paradigms at up to three scanning sessions across two scanners of the same build over approximately 2 months. We examined reliability across the main task contrasts using Bayesian linear mixed-effects models performed voxel-wise across the brain. We also used a novel Bayesian hierarchical model across a predefined whole-brain parcellation scheme and subcortical anatomical regions. Scanner differences accounted for minimal variance in temporal signal-to-noise ratio and task contrast maps. Regions activated during task at the group level showed higher reliability relative to regions not activated significantly at the group level. Greater reliability was found for contrasts involving conditions with clearly distinct visual stimuli and associated cognitive demands (e.g., face vs. nonface discrimination) compared to conditions with more similar demands (e.g., angry vs. happy face discrimination). Voxel-wise reliability estimates tended to be higher than those based on predefined anatomical regions. This work informs attempts to improve reliability in the context of task activation patterns and specific task contrasts. Our study provides a new method to estimate reliability across a large number of regions of interest and can inform researchers' selection of task conditions and analytic contrasts.
... However, correlations in VS activation between complete twin pairs and trending associations for monozygotic twin pairs suggest there is at least some (small) reliable signal of VS activation in those data. Consistent with studies on shortterm reliability of the neural response to reward in adults [45], the signal of VS activation seems to be more reliable within twin pairs during the total win compared to neutral contrast (i.e., when not combining the effects of both win and loss in a single contrast). Therefore, our study is contributing to the growing literature that suggests some contrasts, particularly those with a single active condition [42], may be more reliable than others. ...
Article
Adolescence is a period of increased risk-taking behavior, thought to be driven, in part, by heightened reward sensitivity. One challenge of studying reward processing in the field of developmental neuroscience is finding a task that activates reward circuitry, and is short, not too complex, and engaging for youth of a wide variety of ages and socioeconomic backgrounds. In the present study, we tested a brief child-friendly reward task for activating reward circuitry in two independent samples of youth ages 7-19 years old enriched for poverty (study 1: n = 464; study 2: n = 27). The reward task robustly activated the ventral striatum, with activation decreasing from early to mid-adolescence and increasing from mid- to late adolescence in response to reward. This response did not vary by gender, pubertal development, or income-to-needs ratio, making the task applicable for a wide variety of populations. Additionally, ventral striatum activation to the task did not differ between youth who did and did not expect to receive a prize at the end of the task, indicating that an outcome of points alone may be enough to engage reward circuitry. Thus, this reward task is effective for studying reward processing in youth from different socioeconomic backgrounds.
Article
Background: Major depressive disorder (MDD) is a prevalent neuropsychiatric illness for which it is important to resolve underlying brain mechanisms. Current treatments are often unsuccessful, precipitating a need to identify predictive markers. Aim: We evaluated (1) alterations in brain responses to an emotional faces functional magnetic resonance imaging (fMRI) paradigm in individuals with MDD, compared to controls, (2) whether pretreatment brain responses predicted antidepressant treatment response, and (3) pre-post change in brain responses following treatment. Methods: Eighty-nine medication-free, depressed individuals and 115 healthy controls completed the fMRI paradigm. Depressed individuals completed a nonrandomized, open-label, 8-week treatment with escitalopram, including the option to switch to duloxetine after 4 weeks. We examined patient-control group differences in regional fMRI responses at baseline, whether baseline fMRI responses predicted treatment response at 8 weeks, including early life stress moderating effects, and change in fMRI responses in 36 depressed individuals rescanned following 8 weeks of treatment. Results: Task reaction time was 5% slower in patients. Multiple brain regions showed significant task-related responses, but we observed no statistically significant patient-control group differences (Cohen's d < 0.35). Patient pretreatment brain responses did not predict antidepressant treatment response (area under the curve of the receiver operator characteristic (AUC-ROC) < 0.6) and brain responses were not statistically significantly changed after treatment (Cohen's d < 0.33). Conclusion: This represents the largest prediction study to date examining emotional faces fMRI features as predictors of antidepressant treatment response. Brain response to this fMRI emotional faces paradigm did not distinguish depressed individuals from healthy controls, nor was it predictive of antidepressant treatment response.Clinical Trial Registration: Site: https://clinicaltrials.gov, Trial Number: NCT02869035, Trial Title: Treatment Outcome in Major Depressive Disorder.
Article
Trait stability of measures is an essential requirement for individual differences research. Functional MRI has been increasingly used in studies that rely on the assumption of trait stability, such as attempts to relate task related brain activation to individual differences in behavior and psychopathology. However, recent research using adult samples has questioned the trait stability of task-fMRI measures, as assessed by test-retest correlations. To date, little is known about trait stability of task fMRI in children. Here, we examined within-session reliability and long-term stability of individual differences in task-fMRI measures using fMRI measures of brain activation provided by the Adolescent Brain Cognitive Development (ABCD) Study Release v4.0 as an individual's average regional activity, using its tasks focused on reward processing, response inhibition, and working memory. We also evaluated the effects of factors potentially affecting reliability and stability. Reliability and stability (quantified as the ratio of non-scanner related stable variance to all variances) was poor in virtually all brain regions, with an average value of .088 and .072 for short term (within-session) reliability and long-term (between-session) stability, respectively, in regions of interest (ROIs) historically-recruited by the tasks. Only one reliability or stability value in ROIs exceeded the ‘poor’ cut-off of .4, and in fact rarely exceeded .2 (only 4.9%). Motion had a pronounced effect on estimated reliability/stability, with the lowest motion quartile of participants having a mean reliability/stability 2.5 times higher (albeit still ‘poor’) than the highest motion quartile. Poor reliability and stability of task-fMRI, particularly in children, diminishes potential utility of fMRI data due to a drastic reduction of effect sizes and, consequently, statistical power for the detection of brain-behavior associations. This essential issue urgently needs to be addressed through optimization of task design, scanning parameters, data acquisition protocols, preprocessing pipelines, and data denoising methods.
Article
The dominant approach in investigating the individual reliability for event-related potentials (ERPs) is to extract peak-related features at electrodes showing the strongest group effects. Such a peak-based approach implicitly assumes ERP components showing a stronger group effect are also more reliable, but this assumption has not been substantially validated and few studies have investigated the reliability of ERPs beyond peaks. In this study, we performed a rigorous evaluation of the test-retest reliability of ERPs collected in a multisensory and cognitive experiment from 82 healthy adolescents, each having two sessions. By comparing group effects and individual reliability, we found that a stronger group-level response in ERPs did not guarantee higher reliability. A perspective of neural oscillation should be adopted for the analysis of reliability. Further, by simulating ERPs with an oscillation-based computational model, we found that the consistency between group-level ERP responses and individual reliability was modulated by inter-subject latency jitter and inter-trial variability. The current findings suggest that the conventional peak-based approach may underestimate the individual reliability in ERPs and a neural oscillation perspective on ERP reliability should be considered. Hence, a comprehensive evaluation of the reliability of ERP measurements should be considered in individual-level neurophysiological trait evaluation and psychiatric disorder diagnosis.
Article
Full-text available
Neurobiological factors contributing to violence in humans remain poorly understood. One approach to this question is examining allelic variation in the X-linked monoamine oxidase A (MAOA) gene, previously associated with impulsive aggression in animals and humans. Here, we have studied the impact of a common functional polymorphism in MAOA on brain structure and function assessed with MRI in a large sample of healthy human volunteers. We show that the low expression variant, associated with increased risk of violent behavior, predicted pronounced limbic volume reductions and hyperresponsive amygdala during emotional arousal, with diminished reactivity of regulatory prefrontal regions, compared with the high expression allele. In men, the low expression allele is also associated with changes in orbitofrontal volume, amygdala and hippocampus hyperreactivity during aversive recall, and impaired cingulate activation during cognitive inhibition. Our data identify differences in limbic circuitry for emotion regulation and cognitive control that may be involved in the association of MAOA with impulsive aggression, suggest neural systems-level effects of X-inactivation in human brain, and point toward potential targets for a biological approach toward violence. Neurobiological factors contributing to violence in humans remain poorly understood. One approach to this question is examining allelic variation in the X-linked monoamine oxidase A (MAOA) gene, previously associated with impulsive aggression in animals and humans. Here, we have studied the impact of a common functional polymorphism in MAOA on brain structure and function assessed with MRI in a large sample of healthy human volunteers. We show that the low expression variant, associated with increased risk of violent behavior, predicted pronounced limbic volume reductions and hyperresponsive amygdala during emotional arousal, with diminished reactivity of regulatory prefrontal regions, compared with the high expression allele. In men, the low expression allele is also associated with changes in orbitofrontal volume, amygdala and hippocampus hyperreactivity during aversive recall, and impaired cingulate activation during cognitive inhibition. Our data identify differences in limbic circuitry for emotion regulation and cognitive control that may be involved in the association of MAOA with impulsive aggression, suggest neural systems-level effects of X-inactivation in human brain, and point toward potential targets for a biological approach toward violence.
Article
Full-text available
Historically, reproducibility has been the sine qua non of experimental findings that are considered to be scientifically useful. Typically, findings from functional magnetic resonance imaging (fMRI) studies are assessed with statistical parametric maps (SPMs) using a p value threshold. However, a smaller p value does not imply that the observed result will be reproducible. In this study, we suggest interpreting SPMs in conjunction with reproducibility evidence. Reproducibility is defined as the extent to which the active status of a voxel remains the same across replicates conducted under the same conditions. We propose a methodology for assessing reproducibility in functional MR images without conducting separate experiments. Our procedures include the empirical Bayes method for estimating effects due to experimental stimuli, the threshold optimization procedure for assigning voxels to the active status, and the construction of reproducibility maps. In an empirical example, we implemented the proposed methodology to construct reproducibility maps based on data from the study by Ishai et al. (2000). The original experiments involved 12 human subjects and investigated brain regions most responsive to visual presentation of 3 categories of objects: faces, houses, and chairs. The brain regions identified included occipital, temporal, and fusiform gyri. Using our reproducibility analysis, we found that subjects in one of the experiments exercised at least 2 mechanisms in responding to visual objects when performing alternately matching and passive tasks. One gave activation maps closer to those reported in Ishai et al., and the other had related regions in the precuneus and posterior cingulate. The patterns of activated regions are reproducible for at least 4 out of 6 subjects involved in the experiment. Empirical application of the proposed methodology suggests that human brains exhibit different strategies to accomplish experimental tasks when responding to stimuli. It is important to correlate activations to subjects' behavior such as reaction time and response accuracy. Also, the latency between the stimulus presentation and the peak of the hemodynamic response function varies considerably among individual subjects according to types of stimuli and experimental tasks. These variations per se also deserve scientific inquiries. We conclude by discussing research directions relevant to reproducibility evidence in fMRI.
Article
A Statistical Model for ReliabilitySome Consequences of UnreliabilityThe Simple Replication Reliability StudyThe Control of Unreliability by ReplicationThe Interexaminer Reliability Study
Article
Findings from animal as well as human neuroimaging studies suggest that reward delivery is associated with the activation of subcortical limbic and prefrontal brain regions, including the thalamus, the striatum, the anterior cingulate and the prefrontal cortex. The aim of the present study was to explore if these reward-sensitive regions are also activated during the anticipation of reinforcers that vary with regard to their motivational value. A differential conditioning paradigm was performed, with the presentation of a rewarded reaction time task serving as the unconditioned stimulus (US). Depending on their reaction time, subjects were given (or not given) a monetary reward, or were presented with a verbal feedback consisting of being fast or slow. In a third control condition no task needed to be executed. Each of the three conditions was introduced by a different visual cue (CS). Brain activation of 27 subjects was recorded using event-related functional magnetic resonance imaging. The results showed significant activation of the substantia nigra, thalamic, striatal, and orbitofrontal brain regions as well as of the insula and the anterior cingulate during the presentation of a CS signalling a rewarded task. The anticipation of a monetary reward produced stronger activation in these regions than the anticipation of positive verbal feedback. The results are interpreted as reflecting the motivation-dependent reactivity of the brain reward system with highly motivating stimuli (monetary reward) leading to a stronger activation than those less motivating ones (verbal reward).