Fig 3 - uploaded by Hiroshi Yamada
Content may be subject to copyright.
Value function and its RPE estimated by using the reinforcement learning model. (A) Plots of the V(p,m) t=10,000 against the expected value defined mathematically, i.e., probability time magnitude. (B) r − V(p,m) t=10,000 after reward plotted against the positive component of RPE, i.e., obtained reward magnitude minus the expected values. (C) r − V(p,m) t=10,000 after no-reward (hence, r is zero) plotted against the negative component of RPE, i.e., zero minus expected value. Plots were made for all stimuli, as a function of different learning rates. r is the correlation coefficient.
Source publication
Research in the multidisciplinary field of neuroeconomics has mainly been driven by two influential theories regarding human economic choice: prospect theory, which describes decision-making under risk, and reinforcement learning theory, which describes learning for decision-making. We hypothesized that these two distinct theories guide decision-ma...
Contexts in source publication
Context 1
... models). We simulated this reinforcement learning model using different learning rates for each of the 100 lotteries over 10,000 trials in a monkey experimental setting. After 10,000 trials, which was sufficient to fully learn the lottery value, the algorithm produced lottery valuations, V(m, p) t=10,000 , which converged to the expected value ( Fig. 3A). There were slight deviations in the predictions from the expected value (diagonal line) due to the RPE, which yielded trial-by-trial dynamics in V(m, p) t even after substantial learning, and the extent of these fluctuations was causally related to the learning rate ( Fig. 3A, comparison between panels). It was also clear that these ...
Context 2
... V(m, p) t=10,000 , which converged to the expected value ( Fig. 3A). There were slight deviations in the predictions from the expected value (diagonal line) due to the RPE, which yielded trial-by-trial dynamics in V(m, p) t even after substantial learning, and the extent of these fluctuations was causally related to the learning rate ( Fig. 3A, comparison between panels). It was also clear that these deviations were observed in both the positive and negative components of the RPE (Fig. 3, B and C), as they were estimated from r t − V(m, p) t . This simple simulation exercise demonstrated how the typical reinforcement learning model captured the trial-by-trial dynamics of ...
Context 3
... (diagonal line) due to the RPE, which yielded trial-by-trial dynamics in V(m, p) t even after substantial learning, and the extent of these fluctuations was causally related to the learning rate ( Fig. 3A, comparison between panels). It was also clear that these deviations were observed in both the positive and negative components of the RPE (Fig. 3, B and C), as they were estimated from r t − V(m, p) t . This simple simulation exercise demonstrated how the typical reinforcement learning model captured the trial-by-trial dynamics of gambling behavior in our experiments using V(m, p) t . Reinforcement learning is limited in that it does not reveal whether utility curvature and/or ...
Context 4
... we checked how often the choosers with separate RPE preferences are correctly identified according to BIC. We found that of our 40 simulated choosers, none were ever mistaken for EU choosers. However, we would mistakenly classify eight as P2 according to BIC. For these simulated participants, the BIC differences between the models were minimal (fig. S3). Only one of them would be classified as P2 chooser according to Akaike's Information Criterion (AIC) that applies lower penalty for extra parameters. Overall, we concluded that it is very unlikely that we would classify EU or P2 choosers as separate RPE choosers, but it is possible to sometimes mistake separate RPE choosers for P2 ...
Context 5
... that V(m, p) 0 = 0, we simulated the TD algorithm for 10,000 experimental trials for each lottery using different values of the learning rate A. In Fig. 3, we plot V(m, p) 10,000 , which is the lottery valuation at which the algorithm arrived, against its expected value. Ultimately, the algorithm values the lotteries close to their expected values, particularly at low learning rates. The RPE estimated using the TD algorithm is close to the obtained reward minus the expected value. V(m, ...
Context 6
... models). We simulated this reinforcement learning model using different learning rates for each of the 100 lotteries over 10,000 trials in a monkey experimental setting. After 10,000 trials, which was sufficient to fully learn the lottery value, the algorithm produced lottery valuations, V(m, p) t=10,000 , which converged to the expected value ( Fig. 3A). There were slight deviations in the predictions from the expected value (diagonal line) due to the RPE, which yielded trial-by-trial dynamics in V(m, p) t even after substantial learning, and the extent of these fluctuations was causally related to the learning rate ( Fig. 3A, comparison between panels). It was also clear that these ...
Context 7
... V(m, p) t=10,000 , which converged to the expected value ( Fig. 3A). There were slight deviations in the predictions from the expected value (diagonal line) due to the RPE, which yielded trial-by-trial dynamics in V(m, p) t even after substantial learning, and the extent of these fluctuations was causally related to the learning rate ( Fig. 3A, comparison between panels). It was also clear that these deviations were observed in both the positive and negative components of the RPE (Fig. 3, B and C), as they were estimated from r t − V(m, p) t . This simple simulation exercise demonstrated how the typical reinforcement learning model captured the trial-by-trial dynamics of ...
Context 8
... (diagonal line) due to the RPE, which yielded trial-by-trial dynamics in V(m, p) t even after substantial learning, and the extent of these fluctuations was causally related to the learning rate ( Fig. 3A, comparison between panels). It was also clear that these deviations were observed in both the positive and negative components of the RPE (Fig. 3, B and C), as they were estimated from r t − V(m, p) t . This simple simulation exercise demonstrated how the typical reinforcement learning model captured the trial-by-trial dynamics of gambling behavior in our experiments using V(m, p) t . Reinforcement learning is limited in that it does not reveal whether utility curvature and/or ...
Context 9
... we checked how often the choosers with separate RPE preferences are correctly identified according to BIC. We found that of our 40 simulated choosers, none were ever mistaken for EU choosers. However, we would mistakenly classify eight as P2 according to BIC. For these simulated participants, the BIC differences between the models were minimal (fig. S3). Only one of them would be classified as P2 chooser according to Akaike's Information Criterion (AIC) that applies lower penalty for extra parameters. Overall, we concluded that it is very unlikely that we would classify EU or P2 choosers as separate RPE choosers, but it is possible to sometimes mistake separate RPE choosers for P2 ...
Context 10
... that V(m, p) 0 = 0, we simulated the TD algorithm for 10,000 experimental trials for each lottery using different values of the learning rate A. In Fig. 3, we plot V(m, p) 10,000 , which is the lottery valuation at which the algorithm arrived, against its expected value. Ultimately, the algorithm values the lotteries close to their expected values, particularly at low learning rates. The RPE estimated using the TD algorithm is close to the obtained reward minus the expected value. V(m, ...
Citations
... Our definition of CPT-value employs a kind of "static" semantics, in the sense that we fix a complete strategy for the whole MDP and then view it as a prospect. A "dynamic" semantics is also of interest, where the current state and the history that led up to it affect the decision; such a dynamic prospect theory has recently been proposed [53]. This comes with several complications and design choices to be made, as discussed in [1, App. ...
Cumulative prospect theory (CPT) is the first theory for decision-making under uncertainty that combines full theoretical soundness and empirically realistic features [P.P. Wakker - Prospect theory: For risk and ambiguity, Page 2]. While CPT was originally considered in one-shot settings for risk-aware decision-making, we consider CPT in sequential decision-making. The most fundamental and well-studied models for sequential decision-making are Markov chains (MCs), and their generalization Markov decision processes (MDPs). The complexity theoretic study of MCs and MDPs with CPT is a fundamental problem that has not been addressed in the literature. Our contributions are as follows: First, we present an alternative viewpoint for the CPT-value of MCs and MDPs. This allows us to establish a connection with multi-objective reachability analysis and conclude the strategy complexity result that memoryless randomized strategies are necessary and sufficient for optimality. Second, based on this connection, we provide an algorithm for computing the CPT-value in MDPs with infinite-horizon objectives. We show that the problem is in EXPTIME and fixed-parameter tractable. Moreover, we provide a polynomial-time algorithm for the special case of MCs.
... However, as captured in expected utility theory, decisionmakers are usually not indifferent; they have risk preferences. Tversky and Kahneman (1981) introduced these kinds of problems to illustrate critical tests of divergent predictions of expected utility theory versus prospect theory, still both widely used theories today (e.g., Barberis, 2013;Tymula et al., 2023). Prospect theory predicted gain-loss differences in risk preference, which was thought to rule out expected utility theory in its classic form. ...
Framing effects (risk preferences reverse for gains vs. losses) and the Allais paradox (risk preferences reverse when an option is certain vs. not) are major violations of rational choice theory. In contrast to typical samples, certified public accountants who are competent in working with probabilities and expected values should be an ideal test case for rational choice, especially high scorers on the cognitive reflection test (CRT). Although dual-process theories emphasize numeracy and cognitive reflection, fuzzy-trace theory emphasizes gist-based intuition to explain these effects among cognitively advanced decision-makers. Thus, we recruited a high-numeracy sample of certified public accountants (N = 259) and students (N = 648). We administered classic dread-disease framing, business framing, and Allais paradox problems and the CRT. Each participant received a gain and loss framing problem from different domains (one disease and one business), with presentation order counterbalanced across participants. Order of Allais problems was counterbalanced within participants. Within-participants (cross-domain) framing, between-participants (within-domain) framing, and the Allais paradox were observed for both samples. Accountants did not show domain-specific attenuation (differentially smaller framing) for business problems. Despite large expected-value differences between Allais problem options, accountants’ choices resembled students’ choices. Contrary to dual-process theories, CRT scores were positively related to framing for students (more framing with higher CRT) and inconsistently related for accountants, but high scorers had robust framing effects; high scorers also showed the Allais paradox. Results are consistent with fuzzy-trace theory’s expectation that experts show framing effects because they rely primarily on gist-based intuition, not because they lack numeracy or cognitive reflection.
... Schneider & Day, 2018;Tymula et al., 2023)。 鉴 于反 S 型概率权重的灵活性(Abdellaoui et al., 2010), ...
... This phenomenon results in a concave probability weighting function and constitutes a key component of prospect theory, a fundamental theory in economics [4,10]. Studies have been done to explore the neural mechanisms underlying such probability distortion when outcome probabilities are explicitly shown to the subjects [11,14,15]. ...
Making decisions when outcomes are uncertain requires accurate judgment of the probability of outcomes, yet such judgments are often inaccurate, owing to reliance on heuristics that introduce systematic errors like overweighting of low probabilities. Here, using a decision-making task in which the participants were unaware of outcome probabilities, we discovered that both humans and mice exhibit a rarity-induced decision bias (RIDB), i.e., a preference towards rare rewards, which persists across task performance. Optogenetics experiments demonstrated that activity in the posterior parietal cortex (PPC) is required for the RIDB. Using in vivo electrophysiology, we found that rare rewards bidirectionally modulate choice-encoding PPC neurons to bias subsequent decisions towards rare rewards. Learning enhances stimulus-encoding of PPC neurons, which plays a causal role in stimulus-guided decisions. We then developed a dual-agent behavioural model that successfully recapitulates the decision-making and learning behaviours, and corroborates the specific functions of PPC neurons in mediating decision-making and learning. Thus, beyond expanding understanding of rare probability overweighting to a context where the outcome probability is unknown, and characterizing the neural basis for RIDB in the PPC, our study reveals an evolutionarily conserved heuristic that persistently impacts decision-making and learning under uncertainty.
... This aforementioned limitation introduces several potential problems in neuroeconomic studies (Camerer et al., 2005;Glimcher et al., 2008;Yamada et al., 2021;Imaizumi et al., 2022;Tymula et al., 2023) that employ experimental testing of reward valuation systems for economic choices. When measuring neural activity in the reward circuitry, the subjective values of any reward depend on the physical state of the subject (Nakano et al., 1984;Critchley and Rolls, 1996;de Araujo et al., 2006;Pritchard et al., 2008), even for money (Symmonds et al., 2010). ...
Hunger and thirst drive animals’ consumption behavior and regulate their decision-making concerning rewards. We previously assessed the thirst states of monkeys by measuring blood osmolality under controlled water access and examined how these thirst states influenced their risk-taking behavior in decisions involving fluid rewards. However, hunger assessment in monkeys remains poorly performed. Moreover, the lack of precise measures for hunger states leads to another issue regarding how hunger and thirst states interact with each other in each individual. Thus, when controlling food access to motivate performance, it remains unclear how these two physiological needs are satisfied in captive monkeys. Here, we measured blood ghrelin and osmolality levels to respectively assess hunger and thirst in four captive macaques. Using an enzyme-linked immunosorbent assay, we identified that the levels of blood ghrelin, a widely measured hunger-related peptide hormone in humans, were high after 20 h of no food access (with ad libitum water). This reflects a typical controlled food access condition. One hour after consuming a regular dry meal, the blood ghrelin levels in three out of four monkeys decreased to within their baseline range. Additionally, blood osmolality measured from the same blood sample, the standard hematological index of hydration status, increased after consuming the regular dry meal with no water access. Thus, ghrelin and osmolality may reflect the physiological states of individual monkeys regarding hunger and thirst, suggesting that these indices can be used as tools for monitoring hunger and thirst levels that mediate an animal's decision to consume rewards.
... On the behavioral level logistic mixed-effects regression models were employed to determine the predictive capacity of emotional and reward PEs to punishment decisions (Fig. 1c). On the neural level, we aimed to investigate: 1) the common and distinct neural systems that support the prediction and experience of reward and emotion during social decisions, 2) whether distinct multivariate neural patterns are sensitive to capture variations of emotional and reward PEs and the associated punishment decision using machine-learning based neural decoding approaches (given the higher precision of this approach to establish process specific neural signatures 30,31 ), 3) whether the reward or emotional neurofunctional PE representation predict the neurofunctional decision to reject an offer and thus to characterize the different roles the PEs play in punishment decisions (Fig. 1c). The modified Ultimatum Game task and PEs computation. ...
Traditional decision-making models conceptualize humans as optimal learners aiming to maximize outcomes by leveraging reward prediction errors (PE). While violated emotional expectations(emotional PEs) have recently been formalized, the underlying neurofunctional basis and whether it differs from reward PEs remain unclear. Using a modified fMRI Ultimatum Game on n=43 participants we modelled reward and emotional PEs in response to unfair offers and subsequent punishment decisions. Computational modelling revealed distinct contributions of reward and emotional PEs to
punishment decisions, with reward PE exerting a stronger impact. This process was neurofunctionally dissociable such that (1) reward engaged the dorsomedial prefrontal cortex while emotional experience recruited the anterior insula, (2) multivariate decoding accurately separated reward and
emotional PEs. Predictive neural expressions of reward but not emotional PEs in fronto-insular systems predicted neurofunctional and behavioral punishment decisions. Overall, these findings suggest distinct neurocomputational processes underlie reward and emotional PEs which uniquely impact
social decisions.
Traditional decision-making models conceptualize humans as adaptive learners utilizing the differences between expected and actual rewards (prediction errors, PEs) to maximize outcomes, but rarely consider the influence of violations of emotional expectations (emotional PEs) and how it differs from reward PEs. Here, we conducted a fMRI experiment (n = 43) using a modified Ultimatum Game to examine how reward and emotional PEs affect punishment decisions in terms of rejecting unfair offers. Our results revealed that reward relative to emotional PEs exerted a stronger prediction to punishment decisions. On the neural level, the left dorsomedial prefrontal cortex (dmPFC) was strongly activated during reward receipt whereas the emotions engaged the bilateral anterior insula. Reward and emotional PEs were also encoded differently in brain-wide multivariate patterns, with a more sensitive neural signature observed within fronto-insular circuits for reward PE. We further identified a fronto-insular network encompassing the left anterior cingulate cortex, bilateral insula, left dmPFC and inferior frontal gyrus that encoded punishment decisions. In addition, a stronger fronto-insular pattern expression under reward PE predicted more punishment decisions. These findings underscore that reward and emotional violations interact to shape decisions in complex social interactions, while the underlying neurofunctional PEs computations are distinguishable.
Conventional models of decision-making are predicated upon the notion of rational deliberation. However, empirical evidence has increasingly highlighted the pervasive role of bounded rationality in shaping decisional outcomes. The manifestation of bounded rationality is evident through a spectrum of cognitive biases and heuristics, including but not limited to anchoring, availability, the decoy effect, herd behavior, and the nuanced dynamics of reward and punishment, as well as the implications of weighting and framing effects. This prospective study is dedicated to a comprehensive exploration of such multiple factors together with their impacts to the architecture and functionality of decision-making processes, and their further research potentials as well.