Galit Shmueli

Galit Shmueli
National Tsing Hua University | NTHU · Institute of Service Science

PhD

About

195
Publications
80,831
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
8,782
Citations

Publications

Publications (195)
Preprint
Construct-based models have become a mainstay of management and information systems research. However, these models are likely overfit to the data samples they are estimated on, which makes them risky to use in explanatory, prescriptive, or predictive ways outside a given sample. Empirical researchers currently lack tools to analyze why and how the...
Article
The era of behavioural big data has created new avenues for data science research, with many new contributions stemming from academic researchers. Yet data controlled by platforms have become increasingly difficult for academics to access. Platforms now routinely use algorithmic behaviour modification techniques to manipulate users’ behaviour, leav...
Article
Forecasting hierarchical or grouped time series using a reconciliation approach involves two steps: computing base forecasts and reconciling the forecasts. Base forecasts can be computed by popular time series forecasting methods such as Exponential Smoothing (ETS) and Autoregressive Integrated Moving Average (ARIMA) models. The reconciliation step...
Preprint
Full-text available
Algorithms, from simple automation to machine learning, have been introduced into judicial contexts to ostensibly increase the consistency and efficiency of legal decision making. In this paper, we describe four types of inconsistencies introduced by risk prediction algorithms. These inconsistencies threaten to violate the principle of treating sim...
Preprint
Full-text available
Personalization should take the human person seriously. This requires a deeper understanding of how recommender systems can shape both our self-understanding and identity. We unpack key European humanistic and philosophical ideas underlying the General Data Protection Regulation (GDPR) and propose a new paradigm of humanistic personalization. Human...
Preprint
The fields of statistics and machine learning design algorithms, models, and approaches to improve prediction. Larger and richer behavioral data increase predictive power, as evident from recent advances in behavioral prediction technology. Large internet platforms that collect behavioral big data predict user behavior for internal purposes and for...
Preprint
The field of computational statistics refers to statistical methods or tools that are computationally intensive. Due to the recent advances in computing power some of these methods have become prominent and central to modern data analysis. In this article we focus on several of the main methods including density estimation, kernel smoothing, smooth...
Preprint
We propose a tree-based semi-varying coefficient model for the Conway-Maxwell- Poisson (CMP or COM-Poisson) distribution which is a two-parameter generalization of the Poisson distribution and is flexible enough to capture both under-dispersion and over-dispersion in count data. The advantage of tree-based methods is their scalability to high-dimen...
Article
We propose a tree-based semi-varying coefficient model for the Conway-Maxwell-Poisson (CMP or COM-Poisson) distribution which is a two-parameter generalization of the Poisson distribution and is flexible enough to capture both under-dispersion and over-dispersion in count data. The advantage of tree-based methods is their scalability to high-dimens...
Preprint
Though used extensively, the concept and process of machine learning (ML) personalization have generally received little attention from academics, practitioners, and the general public. We describe the ML approach as relying on the metaphor of the person as a feature vector and contrast this with humanistic views of the person. In light of the rece...
Conference Paper
Methodological research in Partial Least Squares Path Modeling (PLS-PM), a construct-based modeling technique, has seen a flurry of efforts to introduce predictive analytic methods. However, there is still confusion about how prediction can be applied to refine theory and integrate with this traditionally inferential technique. We feel that predict...
Article
We propose two methods for time-series clustering that capture temporal information(trend, seasonality, autocorrelation) and domain-relevant cross-sectional attributes. The methods are based on model-based partitioning (MOB) trees and can be used as automated yet transparent tools for clustering large collections of time series. We address the chal...
Article
Purpose Partial least squares (PLS) has been introduced as a “causal-predictive” approach to structural equation modeling (SEM), designed to overcome the apparent dichotomy between explanation and prediction. However, while researchers using PLS-SEM routinely stress the predictive nature of their analyses, model evaluation assessment relies exclusi...
Preprint
Classification tasks are common across many fields and applications where the decision maker's action is limited by resource constraints. In direct marketing only a subset of customers is contacted; scarce human resources limit the number of interviews to the most promising job candidates; limited donated organs are prioritized to those with best f...
Article
Rapid growth in the availability of behavioral big data (BBD) has outpaced the speed of updates to ethical research codes and regulation of data privacy and human subjects' data collection, storage, and use. The introduction of the European Union's (EU's) General Data Protection Regulation (GDPR) in May 2018 will have far-reaching effects on data s...
Article
Partial least squares path modeling (PLS-PM) has become popular in various disciplines to model structural relationships among latent variables measured by manifest variables. To fully benefit from the predictive capabilities of PLS-PM, researchers must understand the efficacy of predictive metrics used. In this research, we compare the performance...
Article
Full-text available
Exploring theoretically plausible alternative models for explaining the phenomenon under study is a crucial step in advancing scientific knowledge. This paper advocates model selection in Information Systems (IS) studies that use Partial Least Squares path modeling (PLS) and suggests the use of model selection criteria derived from Information Theo...
Article
Analytics is important for education planning. Deploying forecasting analytics requires management information systems (MISs) that collect the needed data and deliver the forecasts to stakeholders. A critical question is whether the data collected by a system is adequate for producing the analytics for decision making. We describe the case of a new...
Article
The Conway–Maxwell–Poisson (CMP) or COM–Poisson regression is a popular model for count data due to its ability to capture both under dispersion and over dispersion. However, CMP regression is limited when dealing with complex nonlinear relationships. With today's wide availability of count data, especially due to the growing collection of data on...
Article
Studying causal effects is central to research in operations management in manufacturing and services, from evaluating prevention procedures, to effects of policies and new operational technologies and practices. The growing availability of micro-level data creates challenges for researchers and decision makers in terms of choosing the right level...
Conference Paper
Generating predictions from PLS models is a recent and novel addition to the research and practice of structural equation modeling. Shmueli et al. (2016) gave us an explicit understanding of what prediction should entail in the context of PLS. That study also demonstrated how to generate predictions using the measurement items and structure of the...
Article
Behavioral big data (BBD) refers to very large and rich multidimensional data sets on human and social behaviors, actions, and interactions, which have become available to companies, governments, and researchers. A growing number of researchers in social science and management fields acquire and analyze BBD for the purpose of extracting knowledge a...
Article
Linear regression is among the most popular statistical models in social sciences research, and researchers in various disciplines use linear probability models (LPMs)—linear regression models applied to a binary outcome. Surprisingly, LPMs are rare in the IS literature, where researchers typically use logit and probit models for binary outcomes. R...
Article
The field of computational statistics refers to statistical methods or tools that are computationally intensive. Due to the recent advances in computing power, some of these methods have become prominent and central to modern data analysis. In this paper, we focus on several of the main methods including density estimation, kernel smoothing, smooth...
Article
Full-text available
The term quality of statistical data, developed and used in official statistics and international organizations such as the International Monetary Fund (IMF) and the Organisation for Economic Co-operation and Development (OECD), refers to the usefulness of summary statistics generated by producers of official statistics. Similarly, in the context o...
Article
Full-text available
Count data are a popular outcome in many empirical studies, especially as big data has become available on human and social behavior. The Conway-Maxwell Poisson (CMP) distribution is popularly used for modeling count data due to its ability to handle both overdispersed and underdispersed data. Yet, current methods for estimating CMP regression mode...
Book
Provides an important framework for data analysts in assessing the quality of data and its potential to provide meaningful insights through analysis. Analytics and statistical analysis have become pervasive topics, mainly due to the growing availability of data and analytic tools. Technology, however, fails to deliver insights with added value if t...
Chapter
Full-text available
Research is about the advancement of knowledge. A main tool of research and empirical studies is the publication of results. Reports on difficulties in deriving repeated results under different circumstances or even with the original data have been growing, posing fundamental questions such as how to ensure integrity of research work. In some cases...
Chapter
Full-text available
The term quality of statistical data, developed and used in official statistics and international organizations such as the IMF and the OECD, refers to the usefulness of summary statistics and indicators generated by producers of official statistics. Similarly, in the context of survey quality, official agencies such as Eurostat, NCSES, and Statist...
Chapter
This chapter introduces the information quality (InfoQ) framework taking a structural approach, heretofore missing in the literature and education curricula. We introduce the InfoQ concept, its components and dimensions, and a formal definition. We then illustrate the InfoQ components by comparing several types of studies in the field of online auc...
Chapter
In this chapter we present seven case studies of data-driven research in the context of healthcare. The first case study is related to two influential reports prepared by the Institute of Medicine (IOM) that have significantly increased the understanding of the impact of current US healthcare processes on patient safety. Through the InfoQ lens, we...
Chapter
Customer surveys are designed to collect data on customer experience and customer opinions. They are a prime example of a situation in which operationalization and communication of results are key elements of a successful analysis. If a customer satisfaction survey does not lead to specific actions or is not adequately communicated to organizationa...
Chapter
This chapter describes statistical approaches designed to increase InfoQ at the postdata collection stage. Data can be primary, secondary, or semisecondary. At this stage, the data is affected by both a priori (ex ante) causes and a posteriori (ex post) causes. This creates a difference between data planned to be collected and data actually collect...
Chapter
Reviewers play a critical role in the publication process, an important landmark in scientific research. Yet, in many journals, acceptance of scientific papers for publication relies on the reviewer's experience and good sense, with no clear guidelines. The lack of guidance increases uncertainty and variability in the usefulness of reviews. This ch...
Chapter
The InfoQ components and dimensions presented in the previous chapters were applied to a wide range of domains such as education, healthcare, surveys, and official statistics. In this chapter, the focus is on education programs in areas such as data science, business analytics, or statistical methods. In this context, the focus is on practice-orien...
Chapter
Risk management is a prime example of a situation in which all InfoQ dimensions play a critical role. Risk assessment requires data at the right resolution, with proper integration, temporal relevance, and an analysis ensuring chronology of data and goals. Risks need to be addressed by decisions and actions; therefore, proper operationalization is...
Chapter
Full-text available
This chapter presents a breakdown of the InfoQ concept into eight dimensions for assessing the information quality (InfoQ) in a study. We start by describing approaches for assessing the concept of data quality, popular in marketing and medical research and government organizations. We then use a similar framework to create the eight dimensions of...
Chapter
This chapter introduces the application of the InfoQ framework to education-related studies. It includes four case studies. The first is the Missouri Assessment Program report card. Two other case studies are related to the application of value-added models (VAMs) in education. One study looks at the impact of value-added teachers on students’ long...
Chapter
This chapter examines established data collection and study design strategies aimed at increasing InfoQ at the predata collection stage. We also examine constraints such as resource limitations, ethical considerations, and human conformance that lower InfoQ. The two most applicable domains are surveys and experimental design. In experimental design...
Chapter
This chapter examines quality in terms of these information quality (InfoQ) components: quality of the analysis goal, data quality, analysis quality and quality of utility. Although the quality of each of the individual components affects InfoQ, it is the combination of the four that determines the level of InfoQ. The chapter aims to help the reade...
Article
The Bernoulli and Poisson processes are two popular discrete count processes; however, both rely on strict assumptions. We instead propose a generalized homogeneous count process (which we name the Conway-Maxwell-Poisson or COM-Poisson process) that not only includes the Bernoulli and Poisson processes as special cases, but also serves as a flexibl...
Article
The term “Big Data” evokes emotions ranging from excitement to exasperation in the statistics community. Looking beyond these emotions reveals several important changes that affect us as statisticians and as humans. I focus on Behavioral Big Data (BBD), or very large and rich multidimensional datasets on human behaviors, actions and interactions, w...
Conference Paper
Despite the growing interest in predictive analytics using PLS models, there are no practical studies that demonstrate the application of predictive PLS modeling. This study reexamines an established empirical model and reanalyzes it through the lens of predictive analytics. In implementing predictive PLS procedures in recent literature, we uncover...
Article
Attempts to introduce predictive performance metrics into partial least squares (PLS) path modeling have been slow and fall short of demonstrating impact on either practice or scientific development in PLS. This study contributes to PLS development by offering a comprehensive framework that identifies different dimensions of prediction and their ef...
Article
Reviewers play a critical role in the publication process, the hallmark of scientific advancement. Yet, in many journals, determining the contribution of a paper is left to the reviewer's experience and good sense without providing structured guidelines. This lack of guidance to authors and reviewers increases uncertainty and variability in the use...
Article
Attempts to introduce predictive performance metrics into Partial Least Squares (PLS) path modeling have been slow and fall short of demonstrating impact on both practice and scientific development in PLS. This study contributes to PLS development by offering a comprehensive framework that identifies different dimensions of prediction and their eff...
Article
The growing popularity of online dating websites is altering one of the most fundamental human activities: finding a date or a marriage partner. Online dating platforms offer new capabilities, such as extensive search, big-data based mate recommendations and varying levels of anonymity, whose parallels do not exist in the physical world. Yet, littl...
Article
Multivariate control charts are used for monitoring multiple series simultaneously, for the pur- pose of detecting shifts in the mean vector in any direction. In the context of disease out- break detection, interest is in detecting only an increase in the process means. Two practical approaches for deriving directional Hotelling charts are Follmann...
Article
The term quality of statistical data, developed and used in official statistics and international organizations such as the IMF and the OECD, refers to the usefulness of summary statistics generated by producers of official statistics. Similarly, in the context of survey quality, official agencies such as Eurostat, NCSES and Statistics Canada creat...
Article
Prediction and variable selection are major uses of data mining algorithms but they are rarely the focus in social science research, where the main objective is causal explanation. Ideal causal modeling is based on randomized experiments, but because experiments are often impossible, unethical or expensive to perform, social science research often...
Article
The Bernoulli and Poisson are two popular discrete count processes; however, both rely on strict assumptions that motivate their use. We instead propose a generalized count process (the Conway-Maxwell-Poisson process) that not only includes the Bernoulli and Poisson processes as special cases, but also serves as a flexible mechanism to describe cou...
Article
Many employers expect to face a significant shortfall of workers with data science skills in the coming decade. This panel focuses on the opportunities and challenges this poses for the Information Systems (IS) community. Specifically, the panel focuses on three key questions at the nexus of data science, skills, and IS: a) characterizing the chang...
Article
Sizes of datasets used in IS research are growing quickly due to data available from digital technologies such as mobile, RFID, sensors, online markets, and more. It is not uncommon to see studies using tens and hundreds of thousands or even millions of records. Linear regression is among the most popular statistical model in social sciences resear...
Article
This work is aimed at finding potential Simpson's paradoxes in Big Data. Simpson's paradox (SP) arises when choosing the level of data aggregation for causal inference. It describes the phenomenon where the direction of a cause on an effect is reversed when examining the aggregate vs. disaggregates of a sample or population. The practical decision...
Article
Full-text available
The Internet has provided IS researchers with the opportunity to conduct studies with extremely large samples, frequently well over 10,000 observations. There are many advantages to large samples, but researchers using statistical inference must be aware of the p-value problem associated with them. In very large samples, p-values go quickly to zero...
Article
Full-text available
Bimodal truncated count distributions are frequently observed in aggregate survey data and in user ratings when respondents are mixed in their opinion. They also arise in censored count data, where the highest category might create an additional mode. Modeling bimodal behavior in discrete data is useful for various purposes, from comparing shapes o...
Conference Paper
Full-text available
Numbers are not data and data analysis does not necessarily produce information and knowledge. Statistics, data mining, and artificial intelligence are disciplines focused on extracting knowledge from data. They provide tools for testing hypotheses, predicting new observations, quantifying population effects, and summarizing data efficiently. In th...
Article
Full-text available
Biosurveillance, focused on the early detection of disease outbreaks, relies on classical statistical control charts for detecting disease outbreaks. However, such methods are not always suitable in this context. Assumptions of normality, independence and stationarity are typically violated in syndromic data. Furthermore, outbreak signatures are ty...
Article
Sizes of datasets used in academic research are growing quickly, with many studies using tens and hundreds of thousands or even millions of records. Linear regression is among the most popular statistical model in social sciences research. Linear probability models, which are linear regression models applied to a binary outcome, are commonly used f...
Article
We introduce a tree-based approach for assessing the performance impact of diverse self-selected interventions in management research. Our approach, which takes advantage of "Big Data", or observational data with large sample sizes and a large number of variables, offers important advantages over traditional propensity score matching. In particular...
Article
The growing popularity of online dating sites is altering one of the most fundamental human activities of finding a date or a marriage partner. Online dating platforms offer new capabilities, such as intensive search, big-data based mate recommendations and varying levels of anonymity, whose parallels do not exist in the physical world. In this stu...
Article
The current kidney allocation system in the United States fails to match donors and recipients well. In an effort to improve the allocation system, the United Network of Organ Sharing (UNOS) defined factors that should determine a new allocation policy, and particularly patients' potential remaining lifetime without a transplant (pre-transplant sur...
Article
We define the concept of Information Quality (InfoQ) as the potential of a dataset to achieve a specific (scientific or practical) goal using a given empirical analysis method. InfoQ is different from data quality and analysis quality, but is dependent on these components and on the relationship between them. We survey statistical methods for incre...
Article
The Poisson distribution is a popular distribution for modeling count data, yet it is constrained by its equidispersion assumption, making it less than ideal for modeling real data that often exhibit over-dispersion or under-dispersion. The COM-Poisson distribution is a two-parameter generalization of the Poisson distribution that allows for a wide...
Article
Full-text available
Modern biosurveillance is the monitoring of a wide range of prediagnostic and diagnostic data for the purpose of enhancing the ability of the public health infrastructure to detect, investigate, and respond to disease outbreaks. Statistical control charts have been a central tool in classic disease surveillance and also have migrated into modern bi...
Article
Full-text available
In this work we propose a modern statistical approach to the analysis and modeling of dynamics in online auctions. Online auction data usually arrive in the form of a set of bids recorded over the duration of an auction. We propose the use of a modern statistical approach called functional data analysis that preserves the entire temporal dimension...
Article
Full-text available
Electronic commerce, and in particular online auctions, have received an extreme surge of popularity in recent years. While auction theory has been studied for a long time from a game-theory perspective, the electronic implementation of the auction mecha- nism poses new and challenging research questions. Although the body of empirical research on...
Chapter
Full-text available
This chapter proposes an enhancement to currently used algorithms for monitoring daily counts of pre-diagnostic data. Rather than use a single algorithm or apply multiple algorithms simultaneously, our approach is based on ensembles of algorithms. The ensembles lead to better performance in terms of higher true alert rates for a given false alert r...
Article
This paper presents a novel intelligent bidding system, called SOABER (Simultaneous Online Auction BiddER), which monitors simultaneous online auctions of high-value fine art items. It supports decision-making by maximizing bidders' surpluses and their chances of winning an auction. One key element of the system is a dynamic forecasting model, whic...
Article
This research essay highlights the need to integrate predictive analytics into information systems (IS) research, and shows several concrete ways in which this can be accomplished. Predictive analytics include empirical methods (statistical and other) that generate data predictions as well as methods for assessing predictive power. Predictive analy...
Article
This research essay highlights the need to integrate predictive analytics into information systems research and shows several concrete ways in which this goal can be accomplished. Predictive analytics include empirical methods (statistical and other) that generate data predictions as well as methods for assessing predictive power. Predictive analyt...
Article
The Internet presents great opportunities for research about information technology, allowing IS researchers to collect very large and rich datasets. It is common to see research papers with tens or even hundreds of thousands of data points, especially when reading about electronic commerce. Large samples are better than smaller samples in that the...
Article
Full-text available
Statistical modeling is a powerful tool for developing and testing theories by way of causal explanation, prediction, and description. In many disciplines there is near-exclusive use of statistical modeling for causal explanation and the assumption that models with high explanatory power are inherently of high predictive power. Conflation between e...
Article
This paper presents a novel intelligent bidding system, called SOABER (Simultaneous Online Auction Bidder), which monitors simultaneous online auctions of high-value fine art items. It supports decision making by maximizing bidders’ surplus and their chances of winning an auction. One key element of the system is a dynamic forecasting model, which...
Chapter
Abstract The arrival process of bidders and bids in online auctions is important for studying and modeling supply and demand,in the online marketplace. Whereas bid arrivals are observable in online auction data, bidder behavior is typically not. A popular assumption in the online auction literature is that a homogeneous,Poisson bidder arrival proce...
Article
Full-text available
Biosurveillance involves monitoring measures of diagnostic and pre-diagnostic ac-tivity for early detection of disease outbreaks. Modern biosurveillance data include daily counts of diagnostic evidence such as lab results, and pre-diagnostic health seeking behavior such as medication sales. A serious challenge to research in the field of biosurveil...
Article
Full-text available
Poisson regression is a popular tool for modeling count data and is applied in a vast array of applications from the social to the physical sciences and beyond. Real data, however, are often over- or under-dispersed and, thus, not conducive to Poisson regression. We propose a regression model based on the Conway--Maxwell-Poisson (COM-Poisson) distr...
Article
  The path that the price takes during an on-line auction plays an important role in understanding and forecasting on-line auctions. Price dynamics, such as the price velocity or its acceleration, capture the speed at which auction information changes. The ability to estimate price dynamics accurately is especially important in realtime price forec...