Article

Eliciting Human Judgment for Prediction Algorithms

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

Even when human point forecasts are less accurate than data-based algorithm predictions, they can still help boost performance by being used as algorithm inputs. Assuming one uses human judgment indirectly in this manner, we propose changing the elicitation question from the traditional direct forecast (DF) to what we call the private information adjustment (PIA): how much the human thinks the algorithm should adjust its forecast to account for information the human has that is unused by the algorithm. Using stylized models with and without random error, we theoretically prove that human random error makes eliciting the PIA lead to more accurate predictions than eliciting the DF. However, this DF-PIA gap does not exist for perfectly consistent forecasters. The DF-PIA gap is increasing in the random error that people make while incorporating public information (data that the algorithm uses) but is decreasing in the random error that people make while incorporating private information (data that only the human can use). In controlled experiments with students and Amazon Mechanical Turk workers, we find support for these hypotheses. This paper was accepted by Charles Corbett, operations management.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... Humans increasingly rely on advice from artificial intelligence (AI) for prediction tasks in areas including logistics, finance, or healthcare (Ibrahim et al., 2021;Lehmann et al., 2022;Logg et al., 2019). AI systems are based on algorithms with the ability to interpret external data and learn from it (Turel & Kalhan, 2023). ...
... This behavior also was frequently observed when humans received advice in form of a human-generated prediction (Bonaccio & Dalal, 2006). In forecasting settings, it often is the case that algorithms perform superior to humans on their own (Ibrahim et al., 2021). If both the human and the algorithmic prediction are included with equal weight in the final prediction, this can severely harm the predictive performance. ...
... We focus on a scenario in which humans perform a surgery prediction task with advice in form of an algorithmic ensemble prediction. We chose the task as an example of an application context in which algorithms on average outperform humans but in practice mostly function as their advisors (Ibrahim et al., 2021). This allows us to study our research question in a setting that is not only relevant to practical settings but also provides a realistic judgment context. ...
Conference Paper
Full-text available
Humans frequently receive algorithmic advice for prediction tasks. However, they often combine their own judgment with advice in a biased way which can be harmful for predictive performance. So far, such kind of biased behavior has been found for humans receiving one algorithmic advice. It is likely that humans will increasingly be advised by more than one algorithm, as it is already the case with ensemble methods. We study the effect of disclosing multiple algorithmic advisors on users’ weight on advice. We conduct a between-subjects experiment (n = 192) in which we manipulate whether the underlying individual algorithmic predictions are shown in addition to an ensemble prediction. In line with findings on naïve diversification, we observe that humans place a higher weight on an ensemble when individual predictions are presented. Our findings contribute to literature on biases in AI-advised settings and have implications for the presentation of ensemble predictions.
... [889][890]. The human mind thrives in the face of special events where flexibility, subjective evaluation, and additional information are necessary to interpret the event (Goodwin, 2002;Ibrahim et al., 2021). Statistical methods perform better relative to judgment when variability is low; they excel in stable environments where trend detection, the optimal weighting of evidence, and systematic integration allow for accurate forecasts (Lawrence et al., 2006;Sanders & Ritzman, 1992). ...
... The most common method of integration used in practice is Judgmental Adjustment (Fildes & Petropoulos, 2015;Siemsen & Aloysius, 2020). Human forecasters receive output from a model and then adjust it according to their intuition or other private information (Ibrahim et al., 2021). Another method of integration is Quantitative Correction (Fildes, 1991;Goodwin et al., 2011;Theil, 1971). ...
... Judgmental Adjustments are revisions to a forecast that rely on human judgment. Due to the risks of judgment biases, researchers have examined best practices when engaging in Judgmental Adjustment (e.g., Fildes et al., 2009;Ibrahim et al., 2021;Petropoulos et al., 2016). A common argument in the literature is that only experts should adjust forecasts (e.g., Arvan et al., 2019), as higher levels of expertise lead to improved forecast accuracy (Alvarado-Valencia et al., 2017). ...
Preprint
Our research examines how to integrate human judgment and statistical algorithms for demand planning in an increasingly data-driven and automated environment. We use a laboratory experiment combined with a field study to compare existing integration methods with a novel approach: Human-Guided Learning. This new method allows the algorithm to use human judgment to train a model using an iterative linear weighting of human judgment and model predictions. Human-Guided Learning is more accurate vis-à-vis the established integration methods of Judgmental Adjustment , Quantitative Correction of Human Judgment, Forecast Combination, and Judgment as a Model Input. Human-Guided Learning performs similarly to Integrative Judgment Learning, but under certain circumstances, Human-Guided Learning can be more accurate. Our studies demonstrate that the benefit of human judgment for demand planning processes depends on the integration method. K E Y W O R D S behavioral experiment, demand planning, digitization, field study, forecasting, human judgment, machine learning Highlights • Human judgment is an essential element of demand forecasting but requires integration into the forecasting process. • Giving people too much influence may introduce more noise than signal. Our research examines different ways of integrating human judgment into a forecasting process and shows that an effortless way of doing so-by allowing people to indicate that a special event is affecting a forecast and an algorithm to estimate the impact of that event-performs remarkably well in comparison to other methods.
... Some recent studies in Operations Management analyze the human-AI interaction in a theoretical framework (Agrawal et al. 2018, 2019, Boyaci et al. 2020, Dai and Singh 2021, de Véricourt and Gurkan 2022, Ibrahim et al. 2021. They focus on modeling the impact of AI-based predictions on the human decision-making process. ...
... They find the algorithm outperforms human buyers in terms of reducing outof-stocks rates and inventory rates. The most related empirical works to our study are Fogliato et al. (2022), Ibrahim et al. (2021). Ibrahim et al. (2021) show how to exploit the human domain knowledge to improve the AI predictions for surgery duration. ...
... The most related empirical works to our study are Fogliato et al. (2022), Ibrahim et al. (2021). Ibrahim et al. (2021) show how to exploit the human domain knowledge to improve the AI predictions for surgery duration. Particularly, they suggest inputting the human adjustment (so-called private information adjustment in the paper), instead of the human direct forecast, into the prediction algorithm. ...
Preprint
Commercial AI solutions provide analysts and managers with data-driven business intelligence for a wide range of decisions, such as demand forecasting and pricing. However, human analysts may have their own insights and experiences about the decision-making that is at odds with the algorithmic recommendation. In view of such a conflict, we provide a general analytical framework to study the augmentation of algorithmic decisions with human knowledge: the analyst uses the knowledge to set a guardrail by which the algorithmic decision is clipped if the algorithmic output is out of bound, and seems unreasonable. We study the conditions under which the augmentation is beneficial relative to the raw algorithmic decision. We show that when the algorithmic decision is asymptotically optimal with large data, the non-data-driven human guardrail usually provides no benefit. However, we point out three common pitfalls of the algorithmic decision: (1) lack of domain knowledge, such as the market competition, (2) model misspecification, and (3) data contamination. In these cases, even with sufficient data, the augmentation from human knowledge can still improve the performance of the algorithmic decision.
... The data are typically recorded in a quantitative form before experts' predictions are combined and then the data are translated into a statistical model (Ashton, 1986;Kraan & Bedford, 2005;Lipscomb, Parmigiani, & Hasselblad, 1998). In this way, aggregated human judgment can be used as an input for building knowledge bases (Morris, 1977) and prediction algorithms (Ibrahim, Kim, & Tong, 2021), which, when deployed in organizations as tools (e.g. different forms of AI), can support human judgment makers and improve the accuracy of their judgments (Cooley & Hicks, 1983;Morwitz & Schmittlein, 1998). ...
... Evaluations in prediction studies tend to be more focused on improving the accuracy of judgment (Lenk & Floyd, 1988;Priem, 1994). For example, Ibrahim et al. (2021Ibrahim et al. ( : 2314 sought to boost the performance of AI algorithms using human judgment. The authors evaluated the efficacy of using private information adjustment (PIA) (in their words, "how much the human thinks the algorithm should adjust its forecast to account for the information that only the human has"), for improving the forecasting accuracy of machine-learning algorithms. ...
Article
Judgment is an important concept in business and management research and related to several subfields, ranging from staff appraisal and entrepreneurship to strategic decision-making and business ethics. The popularity of the concept has given rise to a diversity of understandings, which, in some instances, lack theoretical precision or conceptual clarity. Our review offers a comprehensive overview and consolidates existing research on judgment in business and management research by identifying three perspectives: variance, prediction, and wisdom. We show how these perspectives converge by highlighting shared characteristics of judgment, such as it being evaluative, personal, and key to coping with uncertainty. In addition, our theoretical synthesis demonstrates how the three perspectives diverge along four central characteristics—theoretical inspiration, purpose, onto-epistemological orientation, and mode of reasoning—that shape how judgment is conceptualized and operationalized in business and management research. By developing a theoretical platform that configures judgment research into three distinct perspectives, our review opens up pathways for assessing the conceptual coherence and methodological implications of each perspective. Building on the latter, we explore how the three perspectives can complement each other and conclude by proposing future directions for the advancement of judgment research.
... Examples of judgment and decision making in retail supply chains include hiring and talent management (van Hoek et al., 2020); forecasting and demand planning (Arvan et al., 2019;Perera et al., 2019); supply planning (Gray & Dougherty, 2018); purchasing (Wehrle et al., 2020); and product removal (Kesavan & Kushwaha, 2020). Research has shown that in tasks requiring judgment and decision making, the optimal outcome often involves a partnership between technology (e.g., analytics and artificial intelligence [AI]) and humans (Blattberg & Hoch, 1990;Brau et al., 2023;Ibrahim et al., 2021). ...
... The need for the human factor working with technology is supported by a large body of research which shows that humans and data-driven systems each have their unique strengths and weaknesses (Blattberg & Hoch, 1990;Brau et al., 2023;Moritz et al., 2014;Sanders, 2017). Specifically, humans often have up-to-date knowledge of changes and events occurring in their environment that can affect outcomes which are outside of the data enabling the analytics (Fildes et al., 2009;Fildes & Kingsman, 2011;Ibrahim et al., 2021). Algorithms based on data have the advantage of being objective, consistent, powerful in processing large datasets and can consider relationships between many variables (Ng, 2016;Schoenherr & Speier-Pero, 2015). ...
Article
Our research reveals the continued and evolving role of the human factor in decision making in digitalized retail supply chains. We compare managerial roles in a pre‐ and post‐COVID era through conducting in‐depth interviews of 25 executives spanning the retail supply chain ecosystem. We use grounded theory to develop four main contributions. First , we find that the involvement of managerial judgment is found to be progressively greater moving up the retail supply chain, away from the customer and the demand signal. Second , integration of analytics and judgment is now the primary method of decision making, and we identify elements needed for success. Third , we develop an essential framework for a successful integration process. Fourth , we isolate the necessary components of a successful process for analytics/artificial intelligence (AI) implementation. Our paper offers important insights into how analytics and AI are—and should be used—in judgment and decision making and opportunities for researchers to understand the changing role of the human factor in digitalized retail supply chains.
... Recent research in several fields -information systems, operations, judgement and decisionmaking, computer science, and economics, among others -stresses that building performant decision support algorithms requires accounting for how they interact with humans over and above their accuracy [Kleinberg et al., 2018;Ludwig and Mullainathan, 2021;Malik, 2020;Kim et al., 2022;Donahue et al., 2022]. For example, prior work finds that decision support algorithms benefit from incorporating human judgement and feedback [Gao et al., 2021;Ibrahim et al., 2021;De-Arteaga et al., 2022a], from capturing how humans deviate from the algorithms' predictions [Bastani et al., We find that human-friendliness significantly improves human-algorithm collaboration performance. Experimental subjects who were provided human-friendly decision support scored 97% more on our knowledge graph expansion task than subjects who were provided not-human-friendly decision support, which we attribute to an increase in their decision-making speed and accuracy (by 21% and 15%, respectively). ...
... Finally, we have not considered several other important dimensions of human-centric machine learning: incorporating human judgement and feedback [Ibrahim et al., 2021;Gao et al., 2021], quantifying and depicting algorithmic uncertainty [McGrath et al., 2020], exhibiting fairness [Fu et al., 2020;De-Arteaga et al., 2022b], and being interpretable and transparent [Smith-Renner et al., 2020]. We believe incorporating each of these dimensions is worthy of future research. ...
Preprint
Full-text available
Curated knowledge graphs encode domain expertise and improve the performance of recommendation, segmentation, ad targeting, and other machine learning systems in several domains. As new concepts emerge in a domain, knowledge graphs must be expanded to preserve machine learning performance. Manually expanding knowledge graphs, however, is infeasible at scale. In this work, we propose a method for knowledge graph expansion with humans-in-the-loop. Concretely, given a knowledge graph, our method predicts the "parents" of new concepts to be added to this graph for further verification by human experts. We show that our method is both accurate and provably "human-friendly". Specifically, we prove that our method predicts parents that are "near" concepts' true parents in the knowledge graph, even when the predictions are incorrect. We then show, with a controlled experiment, that satisfying this property increases both the speed and the accuracy of the human-algorithm collaboration. We further evaluate our method on a knowledge graph from Pinterest and show that it outperforms competing methods on both accuracy and human-friendliness. Upon deployment in production at Pinterest, our method reduced the time needed for knowledge graph expansion by ~400% (compared to manual expansion), and contributed to a subsequent increase in ad revenue of 20%.
... In addition, some recent modeling studies about human-AI interaction, and their primary focus lies in examining the potential impact of the coexistence of humans and an AI on decision-making performance and exploring how the predictive performance can be enhanced or hindered compared to decisions made solely by humans or AI (e.g., Ibrahim et al. 2021, de Véricourt and Gurkan 2023, Boyacı et al. 2024. Different from this stream of literature, our paper specifically focuses on the emerging foundation models that have powered many new AI applications since the premiere of ChatGPT. ...
Preprint
Full-text available
AI is undergoing a paradigm shift with the rise of foundation models, pre-trained on broad data and adaptable to myriad downstream tasks. This paper presents an economic theory of foundation model value chain that consists of upstream developers, downstream deployers, and end consumers. Central to understanding the market dynamics within the foundation model value chain is the topic of model openness. We explore how model openness affects the "fine-tuning game" as downstream deployers compete in adopting and investing in fine-tuning foundation models. We first find that as openness increases, the leading deployer might strategically limit its fine-tuning efforts to maintain a monopoly. This strategy restricts market expansion and negatively impacts the following deployer’s profit, the upstream developer’s profit, consumer surplus, and overall social welfare. This dynamic gives rise to what we term the "openness trap": a range of medium openness levels where the value chain’s overall welfare is lower than it would be at zero openness. Furthermore, along the spectrum of model openness, we explore the welfare implications of prevalent market strategies employed by upstream developers, such as vertical integration and offering free trials. Our findings reveal that vertical integration proves beneficial for all stakeholders when the model openness is relatively low. However, for a certain medium range of model openness, vertical integration can unexpectedly backfire and should not be implemented. Regarding free trials, one might intuitively expect this strategy to benefit downstream deployers and consumers by effectively reducing deployers’ foundation model sourcing costs. However, we find that such cost benefit can surprisingly lead to a "triple-lose" outcome for the two deployers and consumers, due to the intricate relationships within the AI value chain. Overall, our theory offers valuable guidance for industry practitioners and policymakers in navigating the rapidly evolving landscape of foundation model development and deployment.
... A standard set of data, visuals, and interactive platform to explore past incidence data and in particular past peaks could be created to support forecasts. Recent work also suggests that a different elicitation technique-a method to extract from an individual a forecast-may be warranted when collecting human judgment predictions in service to a computational model [62]. Importantly, the cognitive energy of experts who generate forecasts is limited. ...
Preprint
Full-text available
Infectious disease forecasts can reduce mortality and morbidity by supporting evidence-based public health decision making. Most epidemic models train on surveillance and structured data (e.g. weather, mobility, media), missing contextual information about the epidemic. Human judgment forecasts are novel data, asking humans to generate forecasts based on surveillance data and contextual information. Our primary hypothesis is that an epidemic model trained on surveillance plus human judgment forecasts (a chimeric model) can produce more accurate long-term forecasts of incident hospitalizations compared to a control model trained only on surveillance. Humans have a finite amount of cognitive energy to forecast, limiting them to forecast a small number of states. Our secondary hypothesis is that a model can map human judgment forecasts from a small number of states to all states with similar performance. For the 2023/24 season, we collected weekly incident influenza hospitalizations for all US states, and 696 human judgment forecasts of peak epidemic week and the maximum number of hospitalizations (peak intensity) for ten of the most populous states. We found a chimeric model outperformed a control model on long-term forecasts. Compared to human judgment, a chimeric model produced forecasts of peak epidemic week and peak intensity with similar or improved performance. Forecasts of peak epidemic week and peak intensity for the ten states where humans input forecasts vs a model that extended these forecasts to all states showed similar performance to one another. Our results suggest human judgment forecasts are a viable data source that can improve infectious disease forecasts and support public health decisions.
... Empirical research, in turn, has investigated human perceptions of and behavioral responses to machine decisions. Such research has included studies on a human aversion against, and trust in algorithms (Castelo et al. 2019;Dietvorst et al. 2018;Ibrahim et al. 2021;Kawaguchi 2021;Logg et al. 2019); the diffusion of responsibility between humans and machines (Gogoll/Uhl 2018;Kirchkamp/Strobel 2019;Parasuraman et al. 2000), and the role of AI in People Analytics (Newman et al. 2020;Tursunbayeva et al. 2021). ...
... At a broad level, we contribute to the growing literature in operations management that uses experimental methods (Schultz et al. 1998, Bolton and Katok 2008, de Véricourt et al. 2013, Buell et al. 2017, Beer et al. 2018, Davis and Leider 2018, Kraft et al. 2018, Liu and KC 2023, some of which examines various aspects of queueing (Kremer and Debo 2016, Shunko et al. 2018, Ü lkü et al. 2020, Aksin et al. 2022, Luo et al. 2023); see Allon and Kremer (2019) for a review. Similar to the approach used in Kim et al. (2020), Ibrahim et al. (2021), and Kim and Tong (2024), we replicate our findings across different study populations to enhance the generalizability of our findings. We also contribute to the growing body of mostly analytical research examining the conditions under which dedicated queues may outperform pooled queues. ...
Article
Problem definition: Contrary to traditional queueing theory, recent field studies in B2C services indicate that pooled queues may be less efficient than dedicated queues. Methodology/results: We use two online experiments in the healthcare delivery context to replicate this finding and assess the interplay of servers’ customer ownership and queue length awareness as potential underlying mechanisms. We operationalize customer ownership as the extent to which servers feel ownership toward their customers and queue length awareness as the extent to which servers are able to accurately quantify their number of customers. We find that, following a change in queue configuration, dedicated queues outperform pooled queues with respect to processing speed without sacrificing quality. The reduction in speed is partially mediated by the servers’ queue length awareness and partially suppressed by their ownership of customers in queue. The former is because servers turn out to be less likely to underestimate their load, which makes them work faster. The latter is because ownership of customers in queue may distract servers from the customer in service. When the queue configuration changes from a dedicated to a pooled one, the shorter processing times and higher levels of queue length awareness persist across the change, unlike the higher ownership of customers in the queue. Managerial implications: In discretionary service settings, switching to a dedicated queue is often beneficial in terms of operational performance, partly because the increased queue length awareness motivates servers to work faster; however, the increased degree of customer ownership of those in queue may distract them and result in a slowdown. Funding: This work was supported by the Wharton Behavioral Lab, the Claude Marion Endowed Faculty Scholar Award, the Wharton-INSEAD Alliance, and the Wharton Dean’s Research Fund. Supplemental Material: The online appendices are available at https://doi.org/10.1287/msom.2023.0202 .
... Other attempts include combining separate human and algorithm outputs (e.g., Blattberg andHoch 1990, Goodwin 2000), introducing systems to elicit human judgment for prediction algorithms (Ibrahim et al. 2021), and learning the human experts' intuition for risk prediction (Orfanoudaki et al. 2022). Another approach to keeping humans in the loop is AI augmentation, where the main idea is to have AI systems work alongside humans and collaborate with them. ...
Preprint
Full-text available
The United States Food and Drug Administration's (FDA's) Premarket Notification 510(K) pathway allows manufacturers to gain approval for a medical device by demonstrating its substantial equivalence to another legally marketed device. However, the inherent ambiguity of this regulatory procedure has led to high recall rates for many devices cleared through this pathway. This trend has raised significant concerns regarding the efficacy of the FDA's current approach, prompting a reassessment of the 510(K) regulatory framework. In this paper, we develop a combined human-algorithm approach to assist the FDA in improving its 510(k) medical device clearance process by reducing the risk of potential recalls and the workload imposed on the FDA. We first develop machine learning methods to estimate the risk of recall of 510(k) medical devices based on the information available at the time of submission. We then propose a data-driven clearance policy that recommends acceptance, rejection, or deferral to FDA's committees for in-depth evaluation. We conduct an empirical study using a unique large-scale dataset of over 31,000 medical devices and 12,000 national and international manufacturers from over 65 countries that we assembled based on data sources from the FDA and Centers for Medicare and Medicaid Service (CMS). A conservative evaluation of our proposed policy based on this data shows a 38.9% improvement in the recall rate and a 43.0% reduction in the FDA's workload. Our analyses also indicate that implementing our policy could result in significant annual cost-savings ranging between \2.4 billion and \2.7 billion, which highlights the value of using a holistic and data-driven approach to improve the FDA's current 510(K) medical device evaluation pathway.
... Other attempts include combining separate human and algorithm outputs (e.g., Blattberg andHoch 1990, Goodwin 2000), introducing systems to elicit human judgment for prediction algorithms (Ibrahim et al. 2021), and learning the human experts' intuition for risk prediction (Orfanoudaki et al. 2022). Another approach to keeping humans in the loop is AI augmentation, where the main idea is to have AI systems work alongside humans and collaborate with them. ...
Preprint
Full-text available
The United States Food and Drug Administration's (FDA's) Premarket Notification 510(K) pathway allows manufacturers to gain approval for a medical device by demonstrating its substantial equivalence to another legally marketed device. However, the inherent ambiguity of this regulatory procedure has led to high recall rates for many devices cleared through this pathway. This trend has raised significant concerns regarding the efficacy of the FDA's current approach, prompting a reassessment of the 510(K) regulatory framework. In this paper, we develop a combined human-algorithm approach to assist the FDA in improving its 510(k) medical device clearance process by reducing the risk of potential recalls and the workload imposed on the FDA. We first develop machine learning methods to estimate the risk of recall of 510(k) medical devices based on the information available at the time of submission. We then propose a data-driven clearance policy that recommends acceptance, rejection, or deferral to FDA's committees for in-depth evaluation. We conduct an empirical study using a unique large-scale dataset of over 31,000 medical devices and 12,000 national and international manufacturers from over 65 countries that we assembled based on data sources from the FDA and Centers for Medicare and Medicaid Service (CMS). A conservative evaluation of our proposed policy based on this data shows a 38.9% improvement in the recall rate and a 43.0% reduction in the FDA's workload. Our analyses also indicate that implementing our policy could result in significant annual cost-savings ranging between 2.4billionand2.4 billion and 2.7 billion, which highlights the value of using a holistic and data-driven approach to improve the FDA's current 510(K) medical device evaluation pathway.
... Recent studies show the superiority of human-machine collaborations over both full machine automation and human-only operations (Fügener et al. 2022) and shed light on the merits of the human in the loop (Fügener et al. 2021). On the one hand, machines can augment the capabilities of humans, such as managers (Davenport et al. 2020), and on the other hand, humans can complement machines by contributing their general intelligence (Te'eni et al. 2023) and diverse ideas (Wang et al. 2023c, Zhang et al. 2024) and incorporating private information (i.e., data that only humans can use, such as in-house data) (Choudhury et al. 2020, Ibrahim et al. 2021, Sun et al. 2022). Cao et al. (2021) show that, when Lu and Zhang: Information, Humans, and Machines Information Systems Research, Articles in Advance, pp. ...
Article
Our study, conducted through a field experiment with a major Asian microloan company, examines the interaction between information complexity and machine explanations in human–machine collaboration. We find that human evaluators’ loan approval decision-making outcomes are significantly enhanced when they are equipped with both large information volumes and machine-generated explanations, underscoring the limitations of relying solely on human intuition or machine analysis. This blend fosters deep human engagement and rethinking, effectively reducing gender biases and increasing prediction accuracy by identifying overlooked data correlations. Our findings stress the crucial role of combining human discernment with artificial intelligence to improve decision-making efficiency and fairness. We offer specific training and system design strategies to bolster human–machine collaboration, advocating for a balanced integration of technological and human insights to navigate intricate decision-making scenarios efficiently. Specifically, the study suggests that, whereas machines manage borderline cases, humans can significantly contribute by reevaluating and correcting machine errors in random cases (i.e., those without explicitly congruent feature patterns) through stimulated active rethinking triggered by strategic information prompts. This approach not only amplifies the strengths of both humans and machines, but also ensures more accurate and fair decision-making processes.
... Many studies have shown that algorithms can perform better in making final decisions than humans do. For example, Ibrahim et al. (2021) show how to mitigate the biases in human judgment for prediction algorithms by incorporating the domain knowledge of humans into algorithms. Green and Plunkett (2022) use an eBay data set to train an RL agent to optimally bargain on eBay. ...
Article
Full-text available
Inventory management is one of the most important components of Alibaba’s business. Traditionally, human buyers make replenishment decisions: although artificial intelligence (AI) algorithms make recommendations, human buyers can choose to ignore these recommendations and make their own decisions. The company has been exploring a new replenishment system in which algorithmic recommendations are final. The algorithms combine state-of-the-art deep reinforcement learning techniques with the framework of fictitious play. By learning the supplier’s behavior, we are able to address the important issues of lead time and fill rate on order quantity, which have been ignored in the extant literature of stochastic inventory control. We present evidence that our algorithms outperform human buyers in terms of reducing out-of-stock rates and inventory levels. More interestingly, we have seen additional benefits amid the pandemic. Over the last two years, cities in China partially and intermittently locked down to mitigate COVID-19 outbreaks. We have observed panic buying from human buyers during lockdowns, leading to the bullwhip effect. By contrast, panic buying and the bullwhip effect can be mitigated using our algorithms due to their ability to recognize changes in the supplier’s behavior during lockdowns. History: This paper has been accepted for the INFORMS Journal on Applied Analytics Special Issue—2022 Daniel H. Wagner Prize for Excellence in the Practice of Advanced Analytics and Operations Research.
... 12 However, "Data Type II" could also exist-that is, information available to H but not to AI for training. In the hiring context, for instance, Data Type II might take the form of what the candidate said in interviews, the observation of body language, and facial expressions that cannot easily be coded but can nonetheless be used (perhaps unconsciously) by H in decision-making (Ibrahim, Kim, & Tong, 2021). It might also be difficult to provide Data Type II to an AI because of privacy concerns or regulation, even when it is codifiable. ...
Article
Full-text available
An “ensemble” approach to decision making involves aggregating the results from different decision makers solving the same problem (i.e., a division of labor without specialization). We draw on the literatures on machine learning-based Artificial Intelligence (AI) as well as on human decision making to propose conditions under which human-AI ensembles can be useful. We argue that human and AIbased algorithmic decision making can be usefully ensembled even when neither has a clear advantage over the other in terms of predictive accuracy, and even if neither alone can attain satisfactory accuracy in absolute terms. Many managerial decisions have these attributes, and collaboration between humans and AI is usually ruled out in such contexts because the conditions for specialization are not met. However, we propose that human-AI collaboration through ensembling is still a possibility under the conditions we identify.
... The practical implications of interpretability on the adoption of ML models have been highlighted by numerous recent studies. Practitioners are more likely to use algorithms if they understand and are able to modify them (Dietvorst et al. 2018); this can be particularly beneficial when algorithmic decisions lack domain knowledge and suffer by model misspecification , or when human decision makers have access to private information that is unused by the algorithm (Ibrahim et al. 2021, Balakrishnan et al. 2022. Vice versa, interpretable ML can assist and affect human decisions (Gillis et al. 2021), or even help improve workers' performance by inferring tips and strategies from the model (Bastani et al. 2021). ...
Preprint
Full-text available
Owing to their inherently interpretable structure, decision trees are commonly used in applications where interpretability is essential. Recent work has focused on improving various aspects of decision trees, including their predictive power and robustness; however, their instability, albeit well-documented, has been addressed to a lesser extent. In this paper, we take a step towards the stabilization of decision tree models through the lens of real-world health care applications due to the relevance of stability and interpretability in this space. We introduce a new distance metric for decision trees and use it to determine a tree's level of stability. We propose a novel methodology to train stable decision trees and investigate the existence of trade-offs that are inherent to decision tree models - including between stability, predictive power, and interpretability. We demonstrate the value of the proposed methodology through an extensive quantitative and qualitative analysis of six case studies from real-world health care applications, and we show that, on average, with a small 4.6% decrease in predictive power, we gain a significant 38% improvement in the model's stability.
... Bansal et al. (2020) study the problem of human-AI interaction and demonstrate that the best classifier is not always the one that leads to the best human decisions when the classifier is used as decision support. Recent work has studied how to develop algorithms in contexts in which the human is the sole and final decision-maker (Bansal et al. 2019, Ibrahim et al. 2021, Wolczynski et al. 2022. ...
Preprint
Full-text available
Human-AI complementarity is important when neither the algorithm nor the human yields dominant performance across all instances in a given context. Recent work that explored human-AI collaboration has considered decisions that correspond to classification tasks. However, in many important contexts where humans can benefit from AI complementarity, humans undertake course of action. In this paper, we propose a framework for a novel human-AI collaboration for selecting advantageous course of action, which we refer to as Learning Complementary Policy for Human-AI teams (\textsc{lcp-hai}). Our solution aims to exploit the human-AI complementarity to maximize decision rewards by learning both an algorithmic policy that aims to complement humans by a routing model that defers decisions to either a human or the AI to leverage the resulting complementarity. We then extend our approach to leverage opportunities and mitigate risks that arise in important contexts in practice: 1) when a team is composed of multiple humans with differential and potentially complementary abilities, 2) when the observational data includes consistent deterministic actions, and 3) when the covariate distribution of future decisions differ from that in the historical data. We demonstrate the effectiveness of our proposed methods using data on real human responses and semi-synthetic, and find that our methods offer reliable and advantageous performance across setting, and that it is superior to when either the algorithm or the AI make decisions on their own. We also find that the extensions we propose effectively improve the robustness of the human-AI collaboration performance in the presence of different challenging settings.
... In a non-health context, Miklós-Thal and Tucker (2019) found that ML algorithms can impact the degree to which firms collude with each other in their pricing strategy. Ibrahim et al. (2021) introduced a system to elicit human judgment for prediction algorithms, assuming that experts have at their disposal subject information that is not available in the model input. We propose the integration of expert advice into the ML system in the form of an exogenous predictive model that is trained on historical data of human judgment. ...
Preprint
Full-text available
There is growing evidence that machine learning (ML) algorithms can be used to develop accurate clinical risk scores for a wide range of medical conditions. However, the degree to which such algorithms can affect clinical decision-making is not well understood. Our work attempts to address this problem, investigating the effect of algorithmic predictions on human expert judgment. Leveraging a survey of medical providers and data from a leading U.S. hospital, we develop an ML algorithm and compare its performance with that of medical experts in the task of predicting 30-day readmissions after transplantation. We find that our algorithm is not only more accurate in predicting clinical risk but can also positively influence human judgment. However, its potential impact is mediated by the users’ degree of algorithm aversion and trust. We show that, while our ML algorithm establishes non-linear associations between patient characteristics and the outcome of interest, human experts mostly attribute risk in a linear fashion. To capture potential synergies between human experts and the algorithm, we propose a human-algorithm “centaur” model and a framework to evaluate its performance in a real-world system. We show that our centaur model can outperform human experts and the best ML algorithm by systematically enhancing algorithmic performance with human-based intuition. Our results suggest that a centaur-enhanced risk assessment process that combines the power of human intuition with ML-based predictions could yield significant cost savings in practice by reducing readmissions through more accurate identifications of patients at risk of return at the time of discharge.
... Work on aggregating direct human judgment predictions has focused on adjusting for correlated predictions between individuals, assessing the number of individual predictions to combine, and determining how to appropriately weight individuals based on past predictive performance [18,21]. Direct predictions take advantage of a human's ability to build a prediction from available structured data and information typically unavailable to a computational model, such as subjective information, intuition, and expertise [33]. ...
Article
Full-text available
Background: Past research has shown that various signals associated with human behavior (e.g., social media engagement) can benefit computational forecasts of COVID-19. One behavior that has been shown to reduce the spread of infectious agents is compliance with non-pharmaceutical interventions (NPIs). However, the extent to which the public adheres to NPIs is difficult to measure and consequently difficult to incorporate into computational forecasts of infectious disease. Soliciting judgments from many individuals (i.e., crowdsourcing) can lead to surprisingly accurate estimates of both current and future targets of interest. Therefore, asking a crowd to estimate community-level compliance with NPIs may prove to be an accurate and predictive signal of an infectious agent, such as COVID-19. Objective: We aimed to show that crowdsourced perceptions of compliance with NPIs can be a fast, reliable signal that can predict the spread of an infectious agent. We showed this by measuring the correlation between crowdsourced perceptions of NPI and one- through four-week-ahead US incident cases of COVID-19 and evaluating whether incorporating crowdsourced perceptions improves predictive performance of a computational forecast of incident cases. Methods: For 36 weeks from September, 2020 to April, 2021, we asked two crowds twenty one questions about their perceptions of their communities adherence to NPI and public health guidelines and collected 10,120 responses. Self-reported state residency was compared to estimates from the U.S. census to determine representativeness of the crowds. Crowdsourced NPI signals were mapped to 21 mean perception of adherence signals-or MEPA-and analyzed descriptively to investigate features such as how MEPA signals changed over time and whether MEPA timeseries clustered into groups based on patterns of responding. We investigated whether MEPA signals were associated with one- through four-week-ahead incident cases of COVID-19 by (i) estimating correlations between MEPA and incident cases, and (ii) including MEPA into computational forecasts. Results: The crowds were mostly geographically representative of the U.S. population with slight overrepresentation in the Northeast. MEPA signals tended to converge toward moderate levels of compliance throughout the survey period, and an unsupervised analysis revealed signals clustered into four groups roughly based on the type of question being asked. Several MEPA signals linearly correlated with one through four week ahead incident cases of COVID-19 at the US national level. Including questions related to social distancing, testing, and limiting large gatherings increased out of sample predictive performance for 1-3 week ahead probabilistic forecasts of incident cases of COVID-19 when compared to a model that was trained on only past incident cases. Conclusions: Crowdsourced perceptions of non-pharmaceutical adherence may be an important signal to improve forecasts of the trajectory of an infectious agent and increase public health situational awareness. Clinicaltrial:
... Many studies have shown algorithms can perform better in the role of final decision-making than humans. For example, Ibrahim et al. (2021) show how to mitigate human judgment for prediction algorithms by incorporating humans' domain knowledge into algorithms. Green and Plunkett (2021) use an eBay dataset to train an RL agent to optimally bargain on eBay. ...
Article
Full-text available
Inventory management is one of the most important components of Alibaba’s business. Traditionally, human buyers make replenishment decisions: although AI algorithms make recommendations, human buyers can choose to ignore these recommendations and make their own decisions. The company has been exploring a new replenishment system in which algorithmic recommendations are final. The algorithms combine state- of-the-art deep reinforcement learning techniques with the framework of fictitious play. By learning the supplier’s behavior, we are able to address the important issues of lead time and fill rate on order quantity, which have been ignored in the extant literature of stochastic inventory control. We present evidence that our algorithms outperform human buyers in terms of reducing out-of-stock rates and inventory levels. More interestingly, we have seen additional benefits amid the pandemic. Over the past two years, cities in China partially and intermittently locked down to mitigate COVID-19 outbreaks. We have observed panic buying from human buyers during lockdowns, leading to the bullwhip effect. By contrast, panic buying and the bullwhip effect can be mitigated using our algorithms, due to their ability to recognize changes in the supplier’s behavior during lockdowns.
... Lawrence et al., 2006;Palley and Soll, 2019), humans assist an AI (e.g. Hampshire et al., 2020;Ibrahim et al., 2021), or an algorithm optimizes advice given to human decision-makers (Bastani et al., 2021). ...
Preprint
When we use algorithms to produce recommendations, we typically think of these recommendations as providing helpful information, such as when risk assessments are presented to judges or doctors. But when a decision-maker obtains a recommendation, they may not only react to the information. The decision-maker may view the recommendation as a default action, making it costly for them to deviate, for example when a judge is reluctant to overrule a high-risk assessment of a defendant or a doctor fears the consequences of deviating from recommended procedures. In this article, we consider the effect and design of recommendations when they affect choices not just by shifting beliefs, but also by altering preferences. We motivate our model from institutional factors, such as a desire to avoid audits, as well as from well-established models in behavioral science that predict loss aversion relative to a reference point, which here is set by the algorithm. We show that recommendation-dependent preferences create inefficiencies where the decision-maker is overly responsive to the recommendation, which changes the optimal design of the algorithm towards providing less conservative recommendations. As a potential remedy, we discuss an algorithm that strategically withholds recommendations, and show how it can improve the quality of final decisions.
... Accurate forecasts are a crucial ingredient of many operational decisions. In capacity planning, for example, a hospital may need surgery duration forecasts to schedule operating rooms (Ibrahim et al. 2021). In inventory management, a newsvendor's order quantity depends on the products' forecasted demand (Silver et al. 2016). ...
... Accurate forecasts are a crucial ingredient of operational decisions. In capacity planning, for example, a hospital may need surgery duration forecasts to schedule operating rooms (Ibrahim et al. 2021). In inventory management, a newsvendor's order quantity depends on the products' forecasted demand (Silver et al. 2016). ...
Preprint
Full-text available
What systems should we use to elicit and aggregate judgmental forecasts? Who should be asked to make such forecasts? We address these questions by assessing two widely-used crowd prediction systems: prediction markets and prediction polls. Our main test compares a prediction market against team-based prediction polls, using data from a large, multi-year forecasting competition. Each of these two systems uses inputs from either a large, sub-elite or a small, elite crowd. We find that small, elite crowds outperform larger ones, whereas the two systems are statistically tied. In addition to this main research question, we examine two complementary questions. First, we compare two market structures, continuous double auction (CDA) markets and logarithmic market scoring rule (LMSR) markets, and find that the LMSR market produces more accurate forecasts than the CDA market, especially on low-activity questions. Second, given the importance of elite forecasters, we compare the talent-spotting properties of the two systems, and find that markets and polls are equally effective at identifying elite forecasters. Overall, the performance benefits of "superforecasting" hold across systems. Managers should move towards identifying and deploying small, select crowds to maximize forecasting performance.
Article
Problem definition: A fundamental issue faced by operations management researcher relates to striking the right balance between rigor and relevance in their work. Another important aspect of operations management research relates to influencing and positively impacting businesses and society at large. We constantly struggle to achieve these objectives. Methodology/results: This MSOM Fellow forum article discusses key opportunities for increasing the relevance and impact of Operations Management research. In particular, it highlights two major areas: technology enabled operations and society and operations where unique opportunities exist for the field to make lasting contributions to business and society. Managerial implications: It concludes with a menu of approaches to enhance practical impact of our research.
Article
Fueled by the widespread adoption of algorithms and artificial intelligence, the use of chatbots has become increasingly popular in various business contexts. In this paper, we study how to effectively and appropriately use voice chatbots, particularly by leveraging the two design features identity disclosure and anthropomorphism, and evaluate their impact on the firm operational performance. In collaboration with a large truck-sharing platform, we conducted a field experiment that randomly assigned 11,000 truck drivers to receive outbound calls from the voice chatbot dispatcher of our focal platform. Our empirical results suggest that disclosing the identity of the chatbot at the beginning of the conversation negatively affects operational performance, leading to around 11% reduction in the response probability. However, humanizing the voice chatbot by adding our proposed anthropomorphism features (i.e., interjections and filler words) significantly improves response probability, conversation length, and the probability of order acceptance intention by over 5.6%, 24.9%, and 10.1%, respectively. Moreover, even when the chatbot’s identity is disclosed along with humanizing features, the operational outcomes still improve. This finding suggests that enhancing anthropomorphism may potentially counteract the negative effects of chatbot identity disclosure. Finally, we propose one plausible explanation for the performance improvement—the enhanced trust between humans and algorithms—and provide empirical evidence that drivers are more likely to disclose information to chatbot dispatchers with anthropomorphism features. Our proposed anthropomorphism improvement solutions are currently being implemented and utilized by our collaborator platform. This paper has been This paper was accepted by Felipe Caro for the Special Issue on the Human-Algorithm Connection. Funding: This study is supported by the National Natural Science Foundation of China [Grants 72172169 and 91646125], Program for Innovation Research at the Central University of Finance and Economic, and Shanghai Pujiang Program. Supplemental Material: The online appendix and data files are available at https://doi.org/10.1287/mnsc.2022.03833 .
Article
Problem definition: Two disciplines increasingly applied in operations management (OM) are machine learning (ML) and behavioral science (BSci). Rather than treating these as mutually exclusive fields, we discuss how they can work as complements to solve important OM problems. Methodology/results: We illustrate how ML and BSci enhance one another in non-OM domains before detailing how each step of their respective research processes can benefit the other in OM settings. We then conclude by proposing a framework to help identify how ML and BSci can jointly contribute to OM problems. Managerial implications: Overall, we aim to explore how the integration of ML and BSci can enable researchers to solve a wide range of problems within OM, allowing future research to generate valuable insights for managers, companies, and society.
Article
Using high‐quality nationwide social security data combined with machine learning tools, we develop predictive models of income support receipt intensities for any payment enrolee in the Australian social security system between 2014 and 2018. We show that machine learning algorithms can significantly improve predictive accuracy compared to simpler heuristic models or early warning systems currently in use. Specifically, the former predicts the proportion of time individuals are on income support in the subsequent 4 years with greater accuracy, by a magnitude of at least 22% (14 percentage points increase in the R‐squared), compared to the latter. This gain can be achieved at no extra cost to practitioners since the algorithms use administrative data currently available to caseworkers. Consequently, our machine learning algorithms can improve the detection of long‐term income support recipients, which can potentially enable governments and institutions to offer timely support to these at‐risk individuals.
Article
Artificial intelligence systems are increasingly demonstrating their capacity to make better predictions than human experts. Yet recent studies suggest that professionals sometimes doubt the quality of these systems and overrule machine-based prescriptions. This paper explores the extent to which a decision maker (DM) supervising a machine to make high-stakes decisions can properly assess whether the machine produces better recommendations. To that end, we study a setup in which a machine performs repeated decision tasks (e.g., whether to perform a biopsy) under the DM’s supervision. Because stakes are high, the DM primarily focuses on making the best choice for the task at hand. Nonetheless, as the DM observes the correctness of the machine’s prescriptions across tasks, the DM updates the DM’s belief about the machine. However, the DM is subject to a so-called verification bias such that the DM verifies the machine’s correctness and updates the DM’s belief accordingly only if the DM ultimately decides to act on the task. In this setup, we characterize the evolution of the DM’s belief and overruling decisions over time. We identify situations under which the DM hesitates forever whether the machine is better; that is, the DM never fully ignores but regularly overrules it. Moreover, the DM sometimes wrongly believes with positive probability that the machine is better. We fully characterize the conditions under which these learning failures occur and explore how mistrusting the machine affects them. These findings provide a novel explanation for human–machine complementarity and suggest guidelines on the decision to fully adopt or reject a machine. This paper was accepted by Elena Katok, special issue on the human–algorithm connection. Supplemental Material: The online appendix is available at https://doi.org/10.1287/mnsc.2023.4791 .
Article
The rapid adoption of artificial intelligence (AI) technologies by many organizations has recently raised concerns that AI may eventually replace humans in certain tasks. In fact, when used in collaboration, machines can significantly enhance the complementary strengths of humans. Indeed, because of their immense computing power, machines can perform specific tasks with incredible accuracy. In contrast, human decision makers (DMs) are flexible and adaptive but constrained by their limited cognitive capacity. This paper investigates how machine-based predictions may affect the decision process and outcomes of a human DM. We study the impact of these predictions on decision accuracy, the propensity and nature of decision errors, and the DM’s cognitive efforts. To account for both flexibility and limited cognitive capacity, we model the human decision-making process in a rational inattention framework. In this setup, the machine provides the DM with accurate but sometimes incomplete information at no cognitive cost. We fully characterize the impact of machine input on the human decision process in this framework. We show that machine input always improves the overall accuracy of human decisions but may nonetheless increase the propensity of certain types of errors (such as false positives). The machine can also induce the human to exert more cognitive efforts, although its input is highly accurate. Interestingly, this happens when the DM is most cognitively constrained, for instance, because of time pressure or multitasking. Synthesizing these results, we pinpoint the decision environments in which human-machine collaboration is likely to be most beneficial. This paper was accepted by Jeannette Song, operations management. Supplemental Material: The data files and online appendices are available at https://doi.org/10.1287/mnsc.2023.4744 .
Article
Problem definition: We study the adherence to the recommendations of a decision support system (DSS) for clearance markdowns at Zara, the Spanish fast fashion retailer. Our focus is on behavioral drivers of the decision to deviate from the recommendation, and the magnitude of the deviation when it occurs. Academic/practical relevance: A major obstacle in the implementation of prescriptive analytics is users’ lack of trust in the tool, which leads to status quo bias. Understanding the behavioral aspects of managers’ usage of these tools, as well as the specific biases that affect managers in revenue management contexts, is paramount for a successful rollout. Methodology: We use data collected by Zara during seven clearance sales campaigns to analyze the drivers of managers’ adherence to the DSS. Results: Adherence to the DSS’s recommendations was higher, and deviations were smaller, when the products were predicted to run out before the end of the campaign, consistent with the fact that inventory and sales were more salient to managers than revenue. When there was a higher number of prices to set, managers of Zara’s own stores were more likely to deviate from the DSS’s recommendations, whereas franchise managers did the opposite and showed a weak tendency to adhere more often instead. Two interventions aimed at shifting salience from inventory and sales to revenue helped increase adherence and overall revenue. Managerial implications: Our findings provide insights on how to increase voluntary adherence that can be used in any context in which a company wants an analytical tool to be adopted organically by its users. We also shed light on two common biases that can affect managers in a revenue management context, namely salience of inventory and sales, and cognitive workload. Supplemental Material: The e-companion is available at https://doi.org/10.1287/msom.2022.1166 .
Preprint
Full-text available
A key challenge in the emerging field of precision nutrition entails providing diet recommendations that reflect both the (often unknown) dietary preferences of different patient groups and known dietary constraints specified by human experts. Motivated by this challenge, we develop a preference-aware constrained-inference approach in which the objective function of an optimization problem is not pre-specified and can differ across various segments. Among existing methods, clustering models from machine learning are not naturally suited for recovering the constrained optimization problems, whereas constrained inference models such as inverse optimization do not explicitly address non-homogeneity in given datasets. By harnessing the strengths of both clustering and inverse optimization techniques, we develop a novel approach that recovers the utility functions of a constrained optimization process across clusters while providing optimal diet recommendations as cluster representatives. Using a dataset of patients' daily food intakes, we show how our approach generalizes stand-alone clustering and inverse optimization approaches in terms of adherence to dietary guidelines and partitioning observations, respectively. The approach makes diet recommendations by incorporating both patient preferences and expert recommendations for healthier diets, leading to structural improvements in both patient partitioning and nutritional recommendations for each cluster. An appealing feature of our method is its ability to consider infeasible but informative observations for a given set of dietary constraints. The resulting recommendations correspond to a broader range of dietary options, even when they limit unhealthy choices.
Article
Full-text available
Firms require demand forecasts at different levels of aggregation to support a variety of resource allocation decisions. For example, a retailer needs store-level forecasts to manage inventory at the store, but also requires a regionally aggregated forecast for managing inventory at a distribution center. In generating an aggregate forecast, a firm can choose to make the forecast directly based on the aggregated data or indirectly by summing lower-level forecasts (i.e., bottom up). Our study investigates the relative performance of such hierarchical forecasting processes through a behavioral lens. We identify two judgment biases that affect the relative performance of direct and indirect forecasting approaches: a propensity for random judgment errors and a failure to benefit from the informational value that is embedded in the correlation structure between lower-level demands. Based on these biases, we characterize demand environments where one hierarchical process results in more accurate forecasts than the other. This paper was accepted by Martin Lariviere, operations management.
Article
Full-text available
Decision and risk analysts have considerable discretion in designing procedures for eliciting subjective probabilities. One of the most popular approaches is to specify a particular set of exclusive and exhaustive events for which the assessor provides such judgments. We show that assessed probabilities are systematically biased toward a uniform distribution over all events into which the relevant state space happens to be partitioned so that probabilities are "partition-dependent." We surmise that a typical assessor begins with an "ignorance prior" distribution that assigns equal probabilities to all specified events, then adjusts those probabilities insufficiently to reflect his or her beliefs concerning how the likelihoods of the events differ. In five studies, we demonstrate partition dependence for both discrete events and continuous variables (Studies 1 and 2), show that the bias decreases with increased domain knowledge (Studies 3 and 4), and that top experts in decision analysis are susceptible to this bias (Study 5). We relate our work to previous research on the "pruning bias" in fault-tree assessment (e.g., Fischhoff, Slovic, & Lichtenstein, 1978) and show that previous explanations of pruning bias (enhanced availability of events that are explicitly specified, ambiguity in interpreting event categories, demand effects) cannot fully account for partition dependence. We conclude by discussing implications for decision analysis practice.
Article
Full-text available
The format of pricing contracts varies substantially across business contexts, a major variable being whether a contract imposes a fixed fee payment. This paper examines how the use of the fixed fee in pricing contracts affects market outcomes of a manufacturer-retailer channel. Standard economic theories predict that channel efficiency increases with the introduction of the fixed fee and is invariant to its framing. We conduct a laboratory experiment to test these predictions. Surprisingly, the introduction of the fixed fee fails to increase channel efficiency. Moreover, the framing of the fixed fee does make a difference: an opaque frame as quantity discounts achieves higher channel efficiency than a salient frame as a two-part tariff, although these two contractual formats are theoretically equivalent.
Article
Most data-driven decision support tools do not include input from people. We study whether and how to incorporate physician input into such tools, in an empirical setting of predicting the surgery duration. Using data from a hospital, we evaluate and compare the performances of three families of models: models with physician forecasts, purely data-based models, and models that combine physician forecasts and data. We find that combined models perform the best, which suggests that physician forecasts have valuable information above and beyond what is captured by data. We also find that applying simple corrections to physician forecasts performs comparably well.
Article
Product forecasts are a critical input into sourcing, procurement, production, inventory, logistics, finance and marketing decisions. Numerous quantitative models have been developed and applied to generate and improve product forecasts. The use of human judgement, either solely or in conjunction with quantitative models, has been well researched in the academic literature and is a popular forecasting approach in industry practice. In the context of judgemental forecasting, methods that integrate an expert's judgement into quantitative forecasting models are commonly referred to as “integrating forecasting” methods. This paper presents a systematic review of the literature of judgemental demand forecasting with a focus placed on integrating methods. We explore the role of expert opinion and contextual information and discuss the application of behaviourally informed support systems. We also provide important directions for further research in these areas.
Article
Importance Increasing value requires improving quality or decreasing costs. In surgery, estimates for the cost of 1 minute of operating room (OR) time vary widely. No benchmark exists for the cost of OR time, nor has there been a comprehensive assessment of what contributes to OR cost. Objectives To calculate the cost of 1 minute of OR time, assess cost by setting and facility characteristics, and ascertain the proportion of costs that are direct and indirect. Design, Setting, and Participants This cross-sectional and longitudinal analysis examined annual financial disclosure documents from all comparable short-term general and specialty care hospitals in California from fiscal year (FY) 2005 to FY2014 (N = 3044; FY2014, n = 302). The analysis focused on 2 revenue centers: (1) surgery and recovery and (2) ambulatory surgery. Main Outcomes and Measures Mean cost of 1 minute of OR time, stratified by setting (inpatient vs ambulatory), teaching status, and hospital ownership. The proportion of cost attributable to indirect and direct expenses was identified; direct expenses were further divided into salary, benefits, supplies, and other direct expenses. Results In FY2014, a total of 175 of 302 facilities (57.9%) were not for profit, 78 (25.8%) were for profit, and 49 (16.2%) were government owned. Thirty facilities (9.9%) were teaching hospitals. The mean (SD) cost for 1 minute of OR time across California hospitals was 37.45(37.45 (16.04) in the inpatient setting and 36.14(36.14 (19.53) in the ambulatory setting (P = .65). There were no differences in mean expenditures when stratifying by ownership or teaching status except that teaching hospitals had lower mean (SD) expenditures than nonteaching hospitals in the inpatient setting (29.88[29.88 [9.06] vs 38.29[38.29 [16.43]; P = .006). Direct expenses accounted for 54.6% of total expenses (20.40of20.40 of 37.37) in the inpatient setting and 59.1% of total expenses (20.90of20.90 of 35.39) in the ambulatory setting. Wages and benefits accounted for approximately two-thirds of direct expenses (inpatient, 14.00of14.00 of 20.40; ambulatory, 14.35of14.35 of 20.90), with nonbillable supplies accounting for less than 10% of total expenses (inpatient, 2.55of2.55 of 37.37; ambulatory, 3.33of3.33 of 35.39). From FY2005 to FY2014, expenses in the OR have increased faster than the consumer price index and medical consumer price index. Teaching hospitals had slower growth in costs than nonteaching hospitals. Over time, the proportion of expenses dedicated to indirect costs has increased, while the proportion attributable to salary and supplies has decreased. Conclusions and Relevance The mean cost of OR time is 36to36 to 37 per minute, using financial data from California’s short-term general and specialty hospitals in FY2014. These statewide data provide a generalizable benchmark for the value of OR time. Furthermore, understanding the composition of costs will allow those interested in value improvement to identify high-yield targets.
Article
Most operations models assume individuals make decisions based on a perfect understanding of random variables or stochastic processes. In reality, however, individuals are subject to cognitive limitations and make systematic errors. We leverage established psychology on sample naivete to model individuals’ forecasting errors and biases in a way that is portable to operations models. The model has one behavioral parameter and embeds perfect rationality as a special case. We use the model to mathematically characterize point and error forecast behavior, reflecting an individual’s beliefs about the mean and variance of a random variable. We then derive 10 behavioral phenomena that are inconsistent with perfect rationality assumptions but supported by existing empirical evidence. Finally, we apply the model to two operations settings, inventory management and queuing, to illustrate the model’s portability and discuss its numerous predictions. For inventory management, we characterize order decisions assuming behavioral demand forecasting. The model predicts that even under automated cost optimization, one should expect a pull-to-center effect. It also predicts that this effect can be mitigated by separating point forecasting from error forecasting. For base stock models, it predicts that safety stocks are too small (large) for short (long) lead times. We also express the steady-state behavior of a queue with balking, assuming rational joining decisions but behavioral wait-time forecasts. The model predicts that joining customers tend to be disappointed in their experienced waits. Also, for long (short) lines, it predicts customers have more (less) disperse wait-time beliefs and tend to overestimate (underestimate) the true wait-time variance. This paper was accepted by Serguei Netessine, operations management.
Article
Remarkably little is known about the cognitive processes which are employed in the solution of clinical problems. This paucity of information is probably accounted for in large part by the lack of suitable analytic tools for the study of the physician's thought processes. Here we report on the use of the computer as a laboratory for the study of clinical cognition.
Article
In many markets, it is common for headquarters to create a price list but grant local salespeople discretion to negotiate prices for individual transactions. How much (if any) pricing discretion headquarters should grant is a topic of debate within many firms. We investigate this issue using a unique data set from an indirect lender with local pricing discretion. We estimate that the local sales force adjusted prices in a way that improved profits by approximately 11% on average. A counterfactual analysis shows that using a centralized, data-driven pricing optimization system could improve profits even further, up to 20% over those actually realized. This suggests that centralized pricing—if appropriately optimized—can be more effective than field price discretion. We discuss the implications of these findings for auto lending and other industries with similar pricing processes. This paper was accepted by Serguei Netessine, operations management.
Article
When bidders incur a cost to learn their valuations, bidder entry can impact auction performance. Two common selling mechanisms in this environment are an English auction and a sequential bidding process. Theoretically, sellers should prefer the auction, because it generates higher expected revenues, whereas bidders should prefer the sequential mechanism, because it generates higher expected bidder profits. We compare the two mechanisms in a controlled laboratory environment, varying the entry cost, and find that, contrary to the theoretical predictions, average seller revenues tend to be higher under the sequential mechanism, whereas average bidder profits are approximately the same. We identify three systematic behavioral deviations from the theoretical model: (1) in the auction, bidders do not enter 100% of the time; (2) in the sequential mechanism, bidders do not set preemptive bids according to the predicted threshold strategy; and (3) subsequent bidders tend to overenter in response to preemptive bids by first bidders. We develop a model of noisy bidder-entry costs that is consistent with these behaviors, and we show that our model organizes the experimental data well. Data, as supplemental material, are available at http://dx.doi.org/10.1287/mnsc.2013.1800. This paper was accepted by Teck Ho, behavioral economics.
Article
This paper attempts a greater precision and clarity of understanding concerning the nature and economic significance of knowledge and its variegated forms by presenting 'the skeptical economist's guide to 'tacit knowledge''. It critically reconsiders the ways in which the concepts of tacitness and codification have come to be employed by economists and develops a more coherent re-conceptualization of these aspects of knowledge production and distribution activities. It seeks also to show that a proposed alternative framework for the study of knowledge codification activities offers a more useful guide for further research directed to informing public policies for science, technological innovation and long-run economic growth.
Article
We investigate how uncertainty in retail sales can be explained by the return on a financial market index. This information can be employed in forecasting, hedging, and risk management. Our forecasting model expresses the total sales of a retailer as a function of sales forecasts generated by equity analysts, the term of the forecast, and the return on an aggregate financial market index over the term of the forecast. Using a panel of annual firm-level sales forecasts for 97 retailers over 10 years, each year containing multiple forecasts of varying terms, we show that a large and significant part of the sales forecast errors is explained by market returns. Surprisingly, this information is not accounted for in the analysts’ forecasts. Therefore, we develop a method of augmenting sales forecasts with market returns thereby improving their accuracy. We conduct an extensive study of the model forecast updating performance and show that the accuracy improvement can exceed 15% in out-of-sample tests under various performance metrics, compared to both equity analysts and a standard time-series method. We also demonstrate the usefulness of the financial market data for operational hedging decisions.
Article
We focus on ways of combining simple database models with managerial intuition. We present a model and method for isolating managerial intuition. For five different business forecasting situations, our results indicate that a combination of model and manager always outperforms either of these decision inputs in isolation, an average R2 increase of 0.09 (16%) above the best single decision input in cross-validated model analyses. We assess the validity of an equal weighting heuristic, 50% model + 50% manager, and then discuss why our results might differ from previous research on expert judgment.
Article
Technical expertise, human judgment, and the time spent by an analyst are often believed to be key factors in determining the accuracy of forecasts obtained with the use of a time series forecasting method. A control experiment was designed to empirically test these beliefs. It involved the participation of experts and persons with limited training. Forecasts were generated for 25 time series with the use of the Box-Jenkins, Holt-Winters and Carbone-Longini filtering methods. Results of the nonparametric tests used to compare the forecasts confirmed that technical expertise, judgmental adjustment, and individualized analyses were of little value in improving forecast accuracy as compared to black box approaches. In addition, simpler methods were found to provide significantly more accurate forecasts than the Box-Jenkins method when applied by persons with limited training.
Article
Demand forecasting is a crucial aspect of the planning process in supply-chain companies. The most common approach to forecasting demand in these companies involves the use of a computerized forecasting system to produce initial forecasts and the subsequent judgmental adjustment of these forecasts by the company’s demand planners, ostensibly to take into account exceptional circumstances expected over the planning horizon. Making these adjustments can involve considerable management effort and time, but do they improve accuracy, and are some types of adjustment more effective than others? To investigate this, we collected data on more than 60,000 forecasts and outcomes from four supply-chain companies. In three of the companies, on average, judgmental adjustments increased accuracy. However, a detailed analysis revealed that, while the relatively larger adjustments tended to lead to greater average improvements in accuracy, the smaller adjustments often damaged accuracy. In addition, positive adjustments, which involved adjusting the forecast upwards, were much less likely to improve accuracy than negative adjustments. They were also made in the wrong direction more frequently, suggesting a general bias towards optimism. Models were then developed to eradicate such biases. Based on both this statistical analysis and organisational observation, the paper goes on to analyse strategies designed to enhance the effectiveness of judgmental adjustments directly.
Article
The past 25 years has seen phenomenal growth of interest in judgemental approaches to forecasting and a significant change of attitude on the part of researchers to the role of judgement. While previously judgement was thought to be the enemy of accuracy, today judgement is recognised as an indispensable component of forecasting and much research attention has been directed at understanding and improving its use. Human judgement can be demonstrated to provide a significant benefit to forecasting accuracy but it can also be subject to many biases. Much of the research has been directed at understanding and managing these strengths and weaknesses. An indication of the explosion of research interest in this area can be gauged by the fact that over 200 studies are referenced in this review.
Article
As might be expected, surveys show accuracy is the most important criterion in selecting a forecasting strategy. This may lead to the expectation that computer based software would be used to aid the forecasting effort. However, despite the wide availability of forecast software, management judgement appears to be the preferred method. This leads to the question: does management judgement provide the most accurate forecast estimates? This paper reports a field study of judgemental sales forecasting over thirteen manufacturing organisations to investigate whether these forecasts are accurate, unbiased and efficient, and whether better ex ante forecasts could have been developed using a computer based model. The study shows that the company forecasts were not uniformly more accurate than a simple, un-seasonally adjusted, naı̈ve forecast. Most of the source of error is due to both inefficiency (a serial correlation in the errors) and bias in the forecasts. These two factors seemed to mask any contribution of contextual information to accuracy. The results are also discussed in terms of forecasting objectives the organisations may have other than accuracy.
Article
We empirically document factors that influence how local operating managers use discretion to balance the trade-off between service capacity costs and customer sensitivity to service time. Our findings, using data from one of the largest financial services providers in the United States, indicate that customer sensitivity to service time varies widely and predictably with observable market characteristics. In turn, we find evidence that local operating managers account for market-specific customer sensitivities to service times by deviating frequently and in predictable ways from the recommendations offered by a centralized capacity-planning model. Finally, we document that these discretionary capacity supply decisions exhibit a strong learning effect whereby experienced operating managers place more weight than their less-experienced counterparts on the market-specific trade-off between service capacity costs and customer sensitivity to service times. Overall, our results demonstrate both the importance of local knowledge as an input in service operations and the potential for incorporating richer data on customer behavior and preferences into service cost and productivity standard metrics.
Article
The "wisdom of crowds" in making judgments about the future or other unknown events is well established. The average quantitative estimate of a group of individuals is consistently more accurate than the typical estimate, and is sometimes even the best estimate. Although individuals' estimates may be riddled with errors, averaging them boosts accuracy because both systematic and random errors tend to cancel out across individuals. We propose exploiting the power of averaging to improve estimates generated by a single person by using an approach we call dialectical bootstrapping. Specifically, it should be possible to reduce a person's error by averaging his or her first estimate with a second one that harks back to somewhat different knowledge. We derive conditions under which dialectical bootstrapping fosters accuracy and provide an empirical demonstration that its benefits go beyond reliability gains. A single mind can thus simulate the wisdom of many.
Article
A crowd often possesses better information than do the individuals it comprises. For example, if people are asked to guess the weight of a prize-winning ox (Galton, 1907), the error of the average response is substantially smaller than the average error of individual estimates. This fact, which Galton interpreted as support for democratic governance, is responsible for the success of polling the audience in the television program ‘‘Who Wants to be a Millionaire’ ’ (Surowiecki, 2004) and for the superiority of combined over individual financial forecasts (Clemen, 1989). Researchers agree that this wisdom-of-crowds effect depends on a statistical fact: The crowd’s average will be more accurate as long as some of the error of one individual is statistically independent of the error of other individuals—as seems almost guaranteed to be the case. Whether a similar improvement can be obtained by averaging
Article
A problem that operating room (OR) managers face in running an OR suite on the day of surgery is to identify "holes" in the OR schedule in which to assign "add-on" cases. This process necessitates knowing the typical and maximum amounts of time that the case is likely to require. The OR manager may know previous case durations for the particular surgeon performing a particular scheduled procedure. The "upper prediction bound" specifies with a certain probability that the duration of the surgeon's next case will be less than or equal to the bound. Prediction bounds were calculated by using methods that (1) do not assume that case durations follow a specific statistical distribution or (2) assume that case durations follow a log-normal distribution. These bounds were tested using durations of 48,847 cases based on 15,574 combinations of scheduled surgeon and procedure. Despite having 3 yr of data, 80 or 90% prediction bounds would not be able to be calculated using the distribution-free method for 35 or 49% of future cases versus 22 or 22% for the log-normal method, respectively. Prediction bounds based on the log-normal distribution overestimated the desired value less often than did the distribution-free method. The chance that the duration of the next case would be less than or equal to its 90% bound based on the log-normal distribution was within 2% of the expected rate. Prediction bounds classified by scheduled surgeon and procedure can be accurately calculated using a method that assumes that case durations follow a log-normal distribution.
Article
This paper attempts to answer the question: When is a random variable Y “more variable” than another random variable X?
Article
A laboratory experiment and two field studies were used to compare the accuracy of three methods that allow judgmental forecasts to be integrated with statistical methods. In all three studies the judgmental forecaster had exclusive access to contextual (or non time-series) information. The three methods compared were: (i) statistical correction of judgmental biases using Theil’s optimal linear correction; (ii) combination of judgmental forecasts and statistical time-series forecasts using a simple average and (iii) correction of judgmental biases followed by combination. There was little evidence in any of the studies that it was worth going to the effort of combining judgmental forecasts with a statistical time-series forecast – simply correcting judgmental biases was usually sufficient to obtain any improvements in accuracy. The improvements obtained through correction in the laboratory experiment were achieved despite its effectiveness being weakened by variations in biases between periods.
Article
This paper compares the relative predictive ability of several statistical models with analysts' forecasts. It is one of the first attempts to forecast quarterly earnings using an autoregressive conditional heteroskedasticity (ARCH) model. ARCH and autoregressive integrated moving average models are found to be superior statistical forecasting alternatives. The most accurate forecasts overall are provided by analysts. Analysts have both a contemporaneous and timing advantage over statistical models. When the sample is screened on those firms that have the largest structural change in the earnings process, the forecast accuracy of the best statistical models is similar to analysts' predictions. Copyright 1995 by MIT Press.