## No full-text available

To read the full-text of this research,

you can request a copy directly from the author.

To read the full-text of this research,

you can request a copy directly from the author.

... In the latter approach, there is less selection in relationships being tested and the probability is lower that the findings are actually true [15]. Three types of research questions can be answered by analyzing data, increasing complexity: descriptive, predictive, and causal questions [16]. ...

... Researchers can use various techniques to answer descriptive, predictive, and causal questions in their data sets [16]. A selection of most commonly used techniques is provided in Table 1, and these have been described elsewhere in more detail [23][24][25][26][27][28][29][30][31][32][33][34][35][36][37][38][39][40]. ...

... Researchers are required to define this (to various extent per method) before performing the analysis. Given the three types of possible research questions [16], data can be either baseline characteristics ("X") or outcome ("Y"). Algorithms vary in the way they couple "X" with "Y" (Table 1), which determine what types of research questions they might feasibly answer, given appropriate data. ...

Large and complex data sets are increasingly available for research in critical care. To analyze these data, researchers use techniques commonly referred to as statistical learning or machine learning (ML). The latter is known for large successes in the field of diagnostics, for example, by identification of radiological anomalies. In other research areas, such as clustering and prediction studies, there is more discussion regarding the benefit and efficiency of ML techniques compared with statistical learning. In this viewpoint, we aim to explain commonly used statistical learning and ML techniques and provide guidance for responsible use in the case of clustering and prediction questions in critical care. Clustering studies have been increasingly popular in critical care research, aiming to inform how patients can be characterized, classified, or treated differently. An important challenge for clustering studies is to ensure and assess generalizability. This limits the application of findings in these studies toward individual patients. In the case of predictive questions, there is much discussion as to what algorithm should be used to most accurately predict outcome. Aspects that determine usefulness of ML, compared with statistical techniques, include the volume of the data, the dimensionality of the preferred model, and the extent of missing data. There are areas in which modern ML methods may be preferred. However, efforts should be made to implement statistical frameworks (e.g., for dealing with missing data or measurement error, both omnipresent in clinical data) in ML methods. To conclude, there are important opportunities but also pitfalls to consider when performing clustering or predictive studies with ML techniques. We advocate careful valuation of new data-driven findings. More interaction is needed between the engineer mindset of experts in ML methods, the insight in bias of epidemiologists, and the probabilistic thinking of statisticians to extract as much information and knowledge from data as possible, while avoiding harm.

... Similar to how the majority of psychological research has operated from a hypothetico-deductive perspective, thus obviating the necessity of justification, the same could be said for explanatory aims (e.g., Yarkoni and Westfall, 2017). While explanation can be contrasted with description and prediction (Shmueli, 2010;Hamaker et al., 2020;Mõttus et al., 2020), the distinction between description and explanation is often less than clear, and is subject to a researcher's point of view (Wilkinson, 2014;Yarkoni, 2020). While explanation is concerned with understanding underlying mechanisms. ...

Psychological science is experiencing a rise in the application of complex statistical models and, simultaneously, a renewed focus on applying research in a confirmatory manner. This presents a fundamental conflict for psychological researchers as more complex forms of modeling necessarily eschew as stringent of theoretical constraints. In this paper, I argue that this is less of a conflict, and more a result of a continued adherence to applying the overly simplistic labels of exploratory and confirmatory. These terms mask a distinction between exploratory/confirmatory research practices and modeling. Further, while many researchers recognize that this dichotomous distinction is better represented as a continuum, this only creates additional problems. Finally, I argue that while a focus on preregistration helps clarify the distinction, psychological research would be better off replacing the terms exploratory and confirmatory with additional levels of detail regarding the goals of the study, modeling details, and scientific method.

... A multiple linear regression analysis was chosen in order to obtain a readily interpretable model. While nonlinear, or non-parametric regression analysis might provide a better fit with lower model residuals than multiple linear regression, the coefficients from such models become less readily interpretable (Shmueli, 2010). The predictor variable coefficients from multiple linear regression are readily interpretable as the mean change in the response variable for one standard deviation change in the associated predictor when based on standardized data, with their absolute values being directly comparable (Frost, 2020) for a given variety or across the different varieties. ...

This thesis covers two different, but related topics regarding how different grapevine varieties respond to drought stress. Firstly, measurements of carbon isotope ratios in berry juice at maturity provide an integrated assessment of drought stress during berry ripening, and when collected over multiple seasons, provides an indication of its response to this stress. Characterize drought stress during this period is useful because it affects the quality of grapes and the resulting wine. Measurements of carbon isotope ratios in berry juice were carried out on 48 varieties planted in an experimental vineyard in Bordeaux over seven years and found differences across varieties. A hierarchical cluster analysis then created a classification of varieties based on their relative drought tolerance. In addition, using measurements of leaf water potential measurements collected over four seasons, a hydroscape approach was used to develop a list of metrics indicative of the sensitivity of stomatal regulation to water stress. Key hydroscape metrics were also well correlated with carbon isotope ratios. This study also examined how the carbon isotope ratios in wine spirit (eau de vie) produced by double distillation was potentially affected when compared to the source wine and parent grape must. A strong relationship was found between the carbon isotope ratios of grape must, wine and eau de vie, suggesting the latter could be used to estimate the vine water status that existed during the corresponding berry ripening period. This could be useful for exploring how sensory attributes of eau de vie are linked to vine water status.Secondly, in wine growing regions around the world, climate change can affect vine transpiration and vineyard water use. In order to characterize this response, a simplified method was presented to determine vine conductance using measurements of vine sap flow, temperature and humidity within the vine canopy, and estimates of net radiation absorbed by the vine canopy. Based on this method and measurements taken on several vines of 5 varieties in a non-irrigated vineyard in Bordeaux France, bulk stomatal conductance was estimated on 15-minute intervals from July to mid-September 2020 producing values similar to those presented for vineyards in the literature. Sensitivity analysis using non-parametric regression found transpiration flux and vapor pressure deficit to be the most important input variables in the calculation of conductance, with absorbed net radiation and bulk boundary layer conductance being much less important. Multiple linear regression analysis found variability of vapour pressure deficit over the day and predawn water potential over the season explained much of the variability in bulk stomatal conductance overall. For the regression analysis, it was important to address non-linearity and collinearity in the explanatory variables and developing a model that was readily interpretable. Transpiration simulations based on the regression equations found similar differences between varieties in terms of daily and seasonal transpiration. These simulations also compared well with those from an accepted vineyard water balance model, although there were differences between the two approaches in the rate at which conductance decreased in response to drought stress.

... The intent of the above is to obtain a readily interpretable model with the best possible fit. While nonlinear, or non-parametric regression analysis might provide a better fit with lower residuals, the coefficients from such models become less readily interpretable (Shmueli, 2010). ...

In wine growing regions around the world, climate change has the potential to affect vine transpiration and overall vineyard water use due to related changes in daily atmospheric conditions and soil water deficits. Grapevines control their transpiration in response to such changes by regulating conductance of water through the soil-plant-atmosphere continuum. The response of bulk stomatal conductance, the vine canopy equivalent of stomatal conductance, to such changes were studied on Cabernet-Sauvignon, Merlot, Tempranillo, Ugni blanc, and Semillon vines in a non-irrigated vineyard in Bordeaux France. Whole-vine sap flow, temperature and humidity in the vine canopy, and net radiation absorbed by the vine canopy were measured on 15-minute intervals from early July through mid-September 2020, together with periodic measurements of leaf area, canopy porosity, and predawn leaf water potential. From these data, bulk stomatal conductance was calculated on 15-minute intervals, and multiple linear regression analysis was performed to identify key variables and their relative effect on conductance. For the regression analysis, attention was focused on addressing non-linearity and collinearity in the explanatory variables and developing a model that was readily interpretable.Variability of vapour pressure deficit in the vine canopy over the day and predawn water potential over the season explained much of the variability in bulk stomatal conductance overall, with relative differences between varieties appearing to be driven in large part by differences in conductance response to predawn water potential between the varieties. Transpiration simulations based on the regression equations found similar differences between varieties in terms of daily and seasonal transpiration. These simulations also compared well with those from an accepted vineyard water balance model, although there appeared to be differences between the two approaches in the rate at which conductance, and hence transpiration is reduced as a function of decreasing soil water content (i.e., increasing water deficit stress). By better characterizing the response of bulk stomatal conductance, the dynamics of vine transpiration can be better parameterized in vineyard water use modeling of current and future climate scenarios.

... Finally, I hoped that the editors of the books clearly illustrated the difference between causal/statistical inferencing and predictive modeling (discussed skimpily in chapter 11), preferably at the outset of the methodological section. There is a clear distinction between explanatory modeling and predictive modeling, and the processes and strategies used for developing each type of model vary (Shmueli, 2010). Therefore, it could have been more beneficial for readers coming from social sciences, and who are typcially accustomed to causal explanation, to know in what ways predicttive modleing is differnet from explanatory modeling. ...

The paper reviews a book entitled "Computational Psychometrics: New Methodologies for a New Generation of Digital Learning and Assessment "

... Our goal is to provide a detailed investigation of the tornado casualty landscape in Shreveport that can be used as a launching pad for more extensive regional analyses of tornado casualties. Furthermore, we seek to broaden the scope of data-driven statistical models for tornado casualties toward more theory-driven statistical models (Neyman 1939;Box et al. 2005;Shmueli 2010) in order to make them more responsible (Hand 2019) by showing the importance of including local, placebased data to address causality rather than predictive power-or variance. ...

Tornadoes are among the most violent hazards in the world capable of producing mass casualties. Much of what is known about the relationship between tornadoes and casualties—injuries and fatalities—is driven by quantitative methods that often omit individual community factors. In response, here we present a place-based analysis of tornado activity and casualties in Shreveport, Louisiana. Results show that tornado casualties are more likely in smooth and lower topography and in formally redlined neighborhoods. Results also indicate that areas around the local Barksdale Air Force Base have experienced fewer casualties than other parts of the city since the installation of a Doppler Radar in 1995 and that Shreveport has a greatly reduced casualty rate since the Super Outbreak of 2011. We argue that continued place-based approaches are necessary for an understanding of the multi-dimensional, structural, and historical legacies that produce disproportionate impacts to environmental hazards and that when combined with quantitative methods, place-based approaches have the potential to create regional-or-local intervention strategies that can reduce the loss of life.

... versus machine learning Social psychology. A primary goal in empirical psychology is to describe the causal underpinnings of human behavior [139,179,210]. Researchers identify hypotheses representing predictions about variables that constitute observed data. ...

Recent concerns that machine learning (ML) may be facing a reproducibility and replication crisis suggest that some published claims in ML research cannot be taken at face value. These concerns inspire analogies to the replication crisis affecting the social and medical sciences, as well as calls for greater integration of statistical approaches to causal inference and predictive modeling. A deeper understanding of what reproducibility concerns in research in supervised ML have in common with the replication crisis in experimental science can put the new concerns in perspective, and help researchers avoid "the worst of both worlds" that can emerge when ML researchers begin borrowing methodologies from explanatory modeling without understanding their limitations, and vice versa. We contribute a comparative analysis of concerns about inductive learning that arise in different stages of the modeling pipeline in causal attribution as exemplified in psychology versus predictive modeling as exemplified by ML. We identify themes that re-occur in reform discussions like overreliance on asymptotic theory and non-credible beliefs about real-world data generating processes. We argue that in both fields, claims from learning are implied to generalize outside the specific environment studied (e.g., the input dataset or subject sample, modeling implementation, etc.) but are often impossible to refute due to forms of underspecification. In particular, many errors being acknowledged in ML expose cracks in long-held beliefs that optimizing predictive accuracy using huge datasets absolves one from having to make assumptions about the underlying data generating process. We conclude by discussing rhetorical risks like error misdiagnosis that arise in times of methodological uncertainty.

... Aos críticos e céticos quanto à viabilidade da adoção disruptiva da abordagem datadriven em estudos populacionais e de saúde, cabe salientar que a discussão epistemológica não pode ser baseada em visões dicotômicas da ciência do tipo "qualitativo ou quantitativo" encaradas, frequentemente, como abordagens antagônicas. A proposta aqui defendida é de construir caminhos para uma ciência aberta 5 10 , criativa, inovadora, que possa adotar, de forma simultânea, métodos mistos (qualitativos e quantitativos) e que possa ser guiada por procedimentos híbridos (hypothesis e data-driven) [11][12][13][14] . ...

Resumo Introdução O termo “big data” no ambiente acadêmico tem deixado de ser uma novidade, tornando-se mais comum em publicações científicas e em editais de fomento à pesquisa, levando a uma revisão profunda da ciência que se faz e se ensina. Objetivo Refletir sobre as possíveis mudanças que as ciências de dados podem provocar nas áreas de estudos populacionais e de saúde. Método Para fomentar esta reflexão, artigos científicos selecionados da área de big data em saúde e demografia foram contrastados com livros e outras produções científicas. Resultados Argumenta-se que o volume dos dados não é a característica mais promissora de big data para estudos populacionais e de saúde, mas a complexidade dos dados e a possibilidade de integração com estudos convencionais por meio de equipes interdisciplinares são promissoras. Conclusão No âmbito do setor de saúde e de estudos populacionais, as possibilidades da integração dos novos métodos de ciência de dados aos métodos tradicionais de pesquisa são amplas, incluindo um novo ferramental para a análise, monitoramento, predição de eventos (casos) e situações de saúde-doença na população e para o estudo dos determinantes socioambientais e demográficos.

... To obtain reliable estimates for the coefficients and evaluate the predictive accuracy simultaneously, we implemented a 250-fold cross-validation procedure (Shmueli, 2010). On each fold, the available data were randomly split into a training set (80% of data) and a test set (20% of data). ...

Modern educational technology has the potential to support students to use their study time more effectively. Learning analytics can indicate relevant individual differences between learners, which adaptive learning systems can use to tailor the learning experience to individual learners. For fact learning, cognitive models of human memory are well suited to tracing learners’ acquisition and forgetting of knowledge over time. Such models have shown great promise in controlled laboratory studies. To work in realistic educational settings, however, they need to be easy to deploy and their adaptive components should be based on individual differences relevant to the educational context and outcomes. Here, we focus on predicting university students’ exam performance using a model-based adaptive fact-learning system. The data presented here indicate that the system provides tangible benefits to students in naturalistic settings. The model’s estimate of a learner’s rate of forgetting predicts overall grades and performance on individual exam questions. This encouraging case study highlights the value of model-based adaptive fact-learning systems in classrooms

... OLS produced lower standard errors than GLM. Since extrapolation models are not necessarily the best-fitted models, we also tested the performance of predictions [26], and OLS predictions were, on average, more accurate than GLM's. ...

Background:
One critical element to optimize funding decisions involves the cost and efficiency implications of implementing alternative program components and configurations. Program planners, policy makers and funders alike are in need of relevant, strategic data and analyses to help them plan and implement effective and efficient programs. Contrary to widely accepted conceptions in both policy and academic arenas, average costs per service (so-called "unit costs") vary considerably across implementation settings and facilities. The objective of this work is twofold: 1) to estimate the variation of VMMC unit costs across service delivery platforms (SDP) in Sub-Saharan countries, and 2) to develop and validate a strategy to extrapolate unit costs to settings for which no data exists.
Methods:
We identified high-quality VMMC cost studies through a literature review. Authors were contacted to request the facility-level datasets (primary data) underlying their results. We standardized the disparate datasets into an aggregated database which included 228 facilities in eight countries. We estimated multivariate models to assess the correlation between VMMC unit costs and scale, while simultaneously accounting for the influence of the SDP (which we defined as all possible combinations of type of facility, ownership, urbanicity, and country), on the unit cost variation. We defined SDP as any combination of such four characteristics. Finally, we extrapolated VMMC unit costs for all SDPs in 13 countries, including those not contained in our dataset.
Results:
The average unit cost was 73 USD (IQR: 28.3, 100.7). South Africa showed the highest within-country cost variation, as well as the highest mean unit cost (135 USD). Uganda and Namibia had minimal within-country cost variation, and Uganda had the lowest mean VMMC unit cost (22 USD). Our results showed evidence consistent with economies of scale. Private ownership and Hospitals were significant determinants of higher unit costs. By identifying key cost drivers, including country- and facility-level characteristics, as well as the effects of scale we developed econometric models to estimate unit cost curves for VMMC services in a variety of clinical and geographical settings.
Conclusion:
While our study did not produce new empirical data, our results did increase by a tenfold the availability of unit costs estimates for 128 SDPs in 14 priority countries for VMMC. It is to our knowledge, the most comprehensive analysis of VMMC unit costs to date. Furthermore, we provide a proof of concept of the ability to generate predictive cost estimates for settings where empirical data does not exist.

... Gaining insights from the data has been the focus of several related fields for a long time, such as Statistics for instance. In this field, the tasks of predicting and understanding (or describing) the effects of attributes over a target variable are traditionally separate (Shmueli, 2010). These two distinct tasks are thus associated to different objectives, the fulfillment of which is achieved using different models: for instance linear models for understanding feature effects, and Gaussian processes for prediction. ...

This thesis focuses on the field of XAI (eXplainable AI), and more particularly local post-hoc interpretability paradigm, that is to say the generation of explanations for a single prediction of a trained classifier. In particular, we study a fully agnostic context, meaning that the explanation is generated without using any knowledge about the classifier (treated as a black-box) nor the data used to train it. In this thesis, we identify several issues that can arise in this context and that may be harmful for interpretability. We propose to study each of these issues and propose novel criteria and approaches to detect and characterize them. The three issues we focus on are: the risk of generating explanations that are out of distribution; the risk of generating explanations that cannot be associated to any ground-truth instance; and the risk of generating explanations that are not local enough. These risks are studied through two specific categories of interpretability approaches: counterfactual explanations, and local surrogate models.

... Favorable conditions that may trigger blooms include high water temperature (Beaulieu et al., 2013;Elliott, 2010;Haakonsson et al., 2017;Paerl and Huisman, 2008), low salinity (Engstr€ om-€ Ost et al., 2011), high nutrient concentrations (Downing et al., 2001;Gobler et al., 2016;Smith, 1986;Smith and Schindler, 2009), high light penetration (Davis and Koop, 2006) and water column stability (Carey et al., 2012;Wagner and Adrian, 2009). However, the variables that best explain occurrence are not always those that best predict it (Shmueli, 2010). In eutrophic ecosystems, nutrients (phosphorus, P, and nitrogen, N) are major determinants of CyanoHABs (Lancelot and Muylaert, 2011;Lehman et al., 2010), but other variables, such as water temperature, salinity and hydrological features, can be more effective as predictors of bloom dynamics (Olli et al., 2015;Robson and Hamilton, 2004;Taş et al., 2006). ...

Eutrophication and climate change scenarios engender the need to develop good predictive models for harmful cyanobacterial blooms (CyanoHABs). Nevertheless, modeling cyanobacterial biomass is a challenging task due to strongly skewed distributions that include many absences as well as extreme values (dense blooms). Most modeling approaches alter the natural distribution of the data by splitting them into zeros (absences) and positive values, assuming that different processes underlie these two components. Our objectives were (1) to develop a probabilistic model relating cyanobacterial biovolume to environmental variables in the Río de la Plata Estuary (35°S, 56°W, n = 205 observations) considering all biovolume values (zeros and positive biomass) as part of the same process; and (2) to use the model to predict cyanobacterial biovolume under different risk level scenarios using water temperature and conductivity as explanatory variables. We developed a compound Poisson-Gamma (CPG) regression model, an approach that has not previously been used for modeling phytoplankton biovolume, within a Bayesian hierarchical framework. Posterior predictive checks showed that the fitted model had a good overall fit to the observed cyanobacterial biovolume and to more specific features of the data, such as the proportion of samples crossing three threshold risk levels (0.2, 1 and 2 mm³ L⁻¹) at different water temperatures and conductivities. The CPG model highlights the strong control of cyanobacterial biovolume by nonlinear and interactive effects of water temperature and conductivity. The highest probability of crossing the three biovolume levels occurred at 22.2 °C and at the lowest observed conductivity (∼0.1 mS cm⁻¹). Cross-validation of the fitted model using out-of-sample observations (n = 72) showed the model's potential to be used in situ, as it enabled prediction of cyanobacterial biomass based on two readily measured variables (temperature and conductivity), making it an interesting tool for early alert systems and management strategies. Furthermore, this novel application demonstrates the potential of the Bayesian CPG approach for predicting cyanobacterial dynamics in response to environmental change.

... In fact, sometimes the best predictors are interventions that counteract a causal process. Removing causal expectations means fewer restrictions on the variables a model can include, so long as the goal is properly understood to be an accurate and generalizable prediction, and not a deeper understanding of its biological significance (9). A useful prediction model should, therefore, satisfy three core criteria: ...

Prediction models aim to use available data to predict a health state or outcome that has not yet been observed. Prediction is primarily relevant to clinical practice, but is also used in research, and administration. While prediction modeling involves estimating the relationship between patient factors and outcomes, it is distinct from casual inference. Prediction modeling thus requires unique considerations for development, validation, and updating. This document represents an effort from editors at 31 respiratory, sleep, and critical care medicine journals to consolidate contemporary best practices and recommendations related to prediction study design, conduct, and reporting. Herein, we address issues commonly encountered in submissions to our various journals. Key topics include considerations for selecting predictor variables, operationalizing variables, dealing with missing data, the importance of appropriate validation, model performance measures and their interpretation, and good reporting practices. Supplemental discussion covers emerging topics such as model fairness, competing risks, pitfalls of "modifiable risk factors", measurement error, and risk for bias. This guidance is not meant to be overly prescriptive; we acknowledge that every study is different, and no set of rules will fit all cases. Additional best practices can be found in the Transparent Reporting of a multivariable prediction model for Individual Prognosis Or Diagnosis (TRIPOD) guidelines, to which we refer readers for further details.

... Right now, the central target of developing a PRS is to have a correct prediction to recognize individuals at risk [98,99]. The use of SNPs and logistic regression when making a PRS could be improved, since logistic regression is made to understand the process but is not optimized for prediction [100]. There are two approaches to build a PRS model, namely regression-based methods (e.g., logistic regression) and tree-based methods (e.g., random forest) [101,102]. ...

Recent studies have led to considerable advances in the identification of genetic variants associated with type 1 and type 2 diabetes. An approach for converting genetic data into a predictive measure of disease susceptibility is to add the risk effects of loci into a polygenic risk score. In order to summarize the recent findings, we conducted a systematic review of studies comparing the accuracy of polygenic risk scores developed during the last two decades. We selected 15 risk scores from three databases (Scopus, Web of Science and PubMed) enrolled in this systematic review. We identified three polygenic risk scores that discriminate between type 1 diabetes patients and healthy people, one that discriminate between type 1 and type 2 diabetes, two that discriminate between type 1 and monogenic diabetes and nine polygenic risk scores that discriminate between type 2 diabetes patients and healthy people. Prediction accuracy of polygenic risk scores was assessed by comparing the area under the curve. The actual benefits, potential obstacles and possible solutions for the implementation of polygenic risk scores in clinical practice were also discussed. Develop strategies to establish the clinical validity of polygenic risk scores by creating a framework for the interpretation of findings and their translation into actual evidence, are the way to demonstrate their utility in medical practice.

... Biologically, this concept can be viewed as analogous to quantify and encode the genes mutational burden observed in the data (14), which corresponds to the amount of (possibly damaging) mutations carried by the genes covered by the WES. This is nevertheless conceptually different from existing mutation burden testing approaches such as (15)(16)(17), since the goal of our study is mainly predictive and not just explanatory (18). ...

Whole exome sequencing (WES) data are allowing researchers to pinpoint the causes of many Mendelian disorders. In time, sequencing data will be crucial to solve the genome interpretation puzzle, which aims at uncovering the genotype-to-phenotype relationship, but for the moment many conceptual and technical problems need to be addressed. In particular, very few attempts at the in-silico diagnosis of oligo-to-polygenic disorders have been made so far, due to the complexity of the challenge, the relative scarcity of the data and issues such as batch effects and data heterogeneity, which are confounder factors for machine learning (ML) methods. Here, we propose a method for the exome-based in-silico diagnosis of Crohn’s disease (CD) patients which addresses many of the current methodological issues. First, we devise a rational ML-friendly feature representation for WES data based on the gene mutational burden concept, which is suitable for small sample sizes datasets. Second, we propose a Neural Network (NN) with parameter tying and heavy regularization, in order to limit its complexity and thus the risk of over-fitting. We trained and tested our NN on 3 CD case-controls datasets, comparing the performance with the participants of previous CAGI challenges. We show that, notwithstanding the limited NN complexity, it outperforms the previous approaches. Moreover, we interpret the NN predictions by analyzing the learned patterns at the variant and gene level and investigating the decision process leading to each prediction.

... Corresponding tools such as SAM [2], limma [3], protein fold recognition [17][18][19], protease substrate prediction [20,21] and protein backbone torsion angle prediction [22]. Thus, predictive variables [23][24][25] are selected according to classification results of a certain classifier. Random forest [26,27] is a case in point. ...

Background:
Various methods for differential expression analysis have been widely used to identify features which best distinguish between different categories of samples. Multiple hypothesis testing may leave out explanatory features, each of which may be composed of individually insignificant variables. Multivariate hypothesis testing holds a non-mainstream position, considering the large computation overhead of large-scale matrix operation. Random forest provides a classification strategy for calculation of variable importance. However, it may be unsuitable for different distributions of samples.
Results:
Based on the thought of using an ensemble classifier, we develop a feature selection tool for differential expression analysis on expression profiles (i.e., ECFS-DEA for short). Considering the differences in sample distribution, a graphical user interface is designed to allow the selection of different base classifiers. Inspired by random forest, a common measure which is applicable to any base classifier is proposed for calculation of variable importance. After an interactive selection of a feature on sorted individual variables, a projection heatmap is presented using k-means clustering. ROC curve is also provided, both of which can intuitively demonstrate the effectiveness of the selected feature.
Conclusions:
Feature selection through ensemble classifiers helps to select important variables and thus is applicable for different sample distributions. Experiments on simulation and realistic data demonstrate the effectiveness of ECFS-DEA for differential expression analysis on expression profiles. The software is available at http://bio-nefu.com/resource/ecfs-dea.

... While the predictive performance of our proposed modeling framework (described in Section 4) is not directly impacted by multicollinearity, the presence of high-dimensional, correlated predictors could lead to a 'masking effect' of certain variables or model overfitting [29][30][31][32]. In other words, the model will still perform well in terms of predictive accuracy in the face of multicollinearity, but there may be indirect effects, such as the masking or suppression of uncorrelated predictor variables in favor of the correlated ones. ...

Projected climate change will significantly influence the shape of the end-use energy demand profiles for space conditioning—leading to a likely increase in cooling needs and a subsequent decrease in heating needs. This shift will put pressure on existing infrastructure and utility companies to meet a demand that was not accounted for in the initial design of the systems. Furthermore, the traditional linear models typically used to predict energy demand focus on isolating either the electricity or natural gas demand, even though the two demands are highly interconnected. This practice often leads to less accurate predictions for both demand profiles. Here, we propose a multivariate, multi-sector (i.e., residential, commercial, industrial) framework to model the climate sensitivity of the coupled electricity and natural gas demand simultaneously, leveraging advanced statistical learning algorithms. Our results indicate that the season-to-date heating and cooling degree-days, as well as the dew point temperature are the key predictors for both the electricity and natural gas demand. We also found that the energy sector is most sensitive to climate during the autumn and spring (intermediate) seasons, followed by the summer and winter seasons. Moreover, the proposed model outperforms a similar univariate model in terms of predictive accuracy, indicating the importance of accounting for the interdependence within the energy sectors. By providing accurate predictions of the electricity and natural gas demand, the proposed framework can help infrastructure planners and operators make informed decisions towards ensuring balanced energy delivery and minimizing supply inadequacy risks under future climate variability and change.

The chapter presents the development of machine learning in German official statistics. Starting with a methodological introduction and a brief reference to international activities as well as an overview for Germany, machine learning projects in the field of classification and editing & imputation are presented and discussed in more detail.

The current research aims to launch effective accounting fraud detection models using imbalanced ensemble learning algorithms for China A-Share listed firms. Based on a sample of 33,544 Chinese firm-year instances from 1998 to 2017, this research respectively established one logistic regression and four ensemble learning classifiers (AdaBoost, XGBoost, CUSBoost, and RUSBoost) by 12 financial ratios and 28 raw financial data. Additionally, we divided the sample into the train and test observations to evaluate the classifiers' out-of-sample performance. In detail, we applied two metrics, namely, Area under the ROC (receiver operating characteristic) curve (AUC) and Area under the Precision-Recall curve (AUPR), to evaluate classifiers' discriminability. In the supplement test, this study put forward an algebraic fused model on the basis of the four ensemble learning classifiers and introduced the sliding window technique. The empirical results showed that the ensemble learning classifiers can detect accounting fraud for the imbalanced China A-listed firms far more effectively than the logistic regression model. Moreover, imbalanced ensemble learning classifiers (CUSBoost and RUSBoost) effectively performed better than the common ensemble learning models (AdaBoost and XGBoost) in average. The algebraic fused model in the supplement test also obtained the highest average AUC and AUPR among all the employed algorithms. Our results offer firm support for the potential role of Machine Learning (ML)-based Artificial Intelligence (AI) approaches in reliably predicting accounting fraud with high accuracy. Similarly, for the Chinese settings, our ML-based AI offers utmost advantage in forecasting accounting fraud. Finally, this paper fills the research gap on the applications of imbalanced ensemble learning in accounting fraud detection for Chinese listed firms.

Big data and algorithmic risk prediction tools promise to improve criminal justice systems by reducing human biases and inconsistencies in decision‐making. Yet different, equally justifiable choices when developing, testing and deploying these socio‐technical tools can lead to disparate predicted risk scores for the same individual. Synthesising diverse perspectives from machine learning, statistics, sociology, criminology, law, philosophy and economics, we conceptualise this phenomenon as predictive inconsistency. We describe sources of predictive inconsistency at different stages of algorithmic risk assessment tool development and deployment and consider how future technological developments may amplify predictive inconsistency. We argue, however, that in a diverse and pluralistic society we should not expect to completely eliminate predictive inconsistency. Instead, to bolster the legal, political and scientific legitimacy of algorithmic risk prediction tools, we propose identifying and documenting relevant and reasonable ‘forking paths’ to enable quantifiable, reproducible multiverse and specification curve analyses of predictive inconsistency at the individual level.

Machine learning's (ML's) unique power to approximate functions and identify non-obvious regularities in data has attracted considerable attention from researchers in natural and social sciences. The emergence of predictive modeling applications in OM studies notwithstanding, it remains unclear how OM scholars can effectively leverage supervised ML for theory building and theory testing, the primary goals of scientific research. We attempt to fill this gap by conducting a literature review of recent developments in supervised ML in OM to identify vacancies in the extant literature, shedding light on how ML applications can move beyond problem-solving into theory building, and formulating a procedure to help OM scholars leverage ML for exploratory theory development. Our procedure employs the random forest with well-developed properties and inference toolkits that are crucial for empirical research. We then expand the boundary of ML usage and connect supervised ML to the explanatory modeling and hypothesis testing employed by OM empiricists for decades, and discuss the use of supervised ML for causal inference from observational data. We posit that contemporary ML can facilitate pattern exploration and enhance the validity of theory testing. We conclude by discussing directions for future empirical OM studies that aim to leverage ML.

The advent of technological developments is allowing to gather large amounts of data in several research fields. Learning analytics (LA)/educational data mining has access to big observational unstructured data captured from educational settings and relies mostly on unsupervised machine learning (ML) algorithms to make sense of such type of data. Generalized additive models for location, scale, and shape (GAMLSS) are a supervised statistical learning framework that allows modeling all the parameters of the distribution of the response variable with respect to the explanatory variables. This article overviews the power and flexibility of GAMLSS in relation to some ML techniques. Also, GAMLSS' capability to be tailored toward causality via causal regularization is briefly commented. This overview is illustrated via a data set from the field of LA.
This article is categorized under: Application Areas > Education and Learning
Algorithmic Development > Statistics
Technologies > Machine Learning

The influence of controlling shareholder characteristics on corporate risk has been a popular topic for discussion in academic and theoretical circles. However, current research lacks systematic and quantitative conclusions based on predictive ability, as it only focuses on the causal relationship between a single characteristic of the controlling shareholder and corporate risk. This paper utilizes the back propagation neural network based on gray wolf algorithm (GWO-BP) method in the machine learning algorithm for the first time and takes the listed companies that publicly issue bonds in the Chinese bond market as a research sample. It summarizes the qualities of controlling shareholders from the perspective of controlling shareholders’ risk-taking and benefits expropriation and examines multi-dimensional controlling shareholder characteristics for predicting the debt default risk of companies. This research established that: (1) Overall, the characteristics of controlling shareholders can improve the ability to predict the debt default of a company; (2) The features of the investment portfolio of the controlling shareholder have a higher degree of predicting the debt default risk of a company,while the properties of equity structure and related transactions have a lower degree of predicting the risk of corporate debt default.This research not only uses machine learning methods to study controlling shareholders in China from a more comprehensive perspective but also provides a useful incentive for bondholders to protect their interests.

The vital role of motivation becomes even more evident when considering the digital transformation of learning and teaching environments, especially with the effect of the pandemic. Basic psychological needs and emotions, which have not been comprehensively examined together despite their important roles in motivating, draw attention. Accordingly, this study aims to reveal the psychological, emotional, and individual variables that influence the pre-service teachers’ intention to use technology, and to evaluate and validate the predictive power of a proposed model. The technology acceptance model formed the basis of the proposed model, and the model was extended with the self-determination theory (competence, autonomy, relatedness) and a framework of emotions (enjoyment, playfulness, anxiety, frustration). Data were collected online from 591 pre-service teachers studying in 10 different departments of a state university. In data analysis PLS-SEM, PLSpredict and multi-group analysis were performed. The results revealed that the model explains 79.8% of the intention and that the predictive power of the model is high. The relationship between competence and perceived ease of use represents the strongest relationship in the model, and the most influential construct on intention is enjoyment. These findings suggest that both intrinsic and extrinsic motivation play a major role in technology acceptance, especially during the pandemic. In addition, innovativeness, which is related to technology use and motivation, had various moderator effects on the relationships. Findings indicate that the model, which offers a motivational approach based on basic psychological needs and emotions, provides rare information and has high relevance for the field.

Semantic processing (SP) is one of the critical abilities of humans for representing and manipulating conceptual and meaningful information. Neuroimaging studies of SP typically collapse data from many subjects, but its neural organization and behavioral performance vary between individuals. It is not yet understood whether and how the individual variabilities in neural network organizations contribute to the individual differences in SP behaviors. We aim to identify the neural signatures underlying SP variabilities by analyzing functional connectivity (FC) patterns based on a large-sample Human Connectome Project (HCP) dataset and rigorous predictive modeling. We used a two-stage predictive modeling approach to build an internally cross-validated model and to test the model's generalizability with unseen data from different HCP samples and other out-of-sample datasets. FC patterns within a putative semantic brain network were significantly predictive of individual SP scores summarized from five SP-related behavioral tests. This cross-validated model can be used to predict unseen HCP data. The model generalizability was enhanced in the language task compared with other tasks used during scanning and was better for females than males. The model constructed from the HCP dataset can be partially generalized to two independent cohorts that participated in different semantic tasks. FCs connecting to the Perisylvian language network show the most reliable contributions to predictive modeling and the out-of-sample generalization. These findings contribute to our understanding of the neural sources of individual differences in SP, which potentially lay the foundation for personalized education for healthy individuals and intervention for SP and language deficits patients.

Reproducibility is not only essential for the integrity of scientific research but is also a prerequisite for model validation and refinement for the future application of predictive algorithms. However, reproducible research is becoming increasingly challenging, particularly in high-dimensional genomic data analyses with complex statistical or algorithmic techniques. Given that there are no mandatory requirements in most biomedical and statistical journals to provide the original data, analytical source code, or other relevant materials for publication, accessibility to these supplements naturally suggests a greater credibility of the published work. In this study, we performed a reproducibility assessment of the notable paper by Gerstung et al. (Nat Genet 49:332–340, 2017) by rerunning the analysis using their original code and data, which are publicly accessible. Despite an open science setting, it was challenging to reproduce the entire research project; reasons included: incomplete data and documentation, suboptimal code readability, coding errors, limited portability of intensive computing performed on a specific platform, and an R computing environment that could no longer be re-established. We learn that the availability of code and data does not guarantee transparency and reproducibility of a study; paradoxically, the source code is still liable to error and obsolescence, essentially due to methodological and computational complexity, a lack of reproducibility checking at submission, and updates for software and operating environment. The complex code may also hide problematic methodological aspects of the proposed research. Building on the experience gained, we discuss the best programming and software engineering practices that could have been employed to improve reproducibility, and propose practical criteria for the conduct and reporting of reproducibility studies for future researchers.

Background:
Accurate clinical prediction supports the effective treatment of alcohol use disorder (AUD) and other psychiatric disorders. Traditional statistical techniques have identified patient characteristics associated with treatment outcomes. However, less work has focused on systematically leveraging these associations to create optimal predictive models. The current study demonstrates how machine learning can be used to predict clinical outcomes in people completing outpatient AUD treatment.
Method:
We used data from the COMBINE multisite clinical trial (n = 1383) to develop and test predictive models. We identified three priority prediction targets, including (1) heavy drinking during the first month of treatment, (2) heavy drinking during the last month of treatment, and (3) heavy drinking between weekly/bi-weekly sessions. Models were generated using the random forest algorithm. We used "leave sites out" partitioning to externally validate the models in trial sites that were not included in the model training. Stratified model development was used to test for sex differences in the relative importance of predictive features.
Results:
Models predicting heavy alcohol use during the first and last months of treatment showed internal cross-validation area under the curve (AUC) scores ranging from 0.67 to 0.74. AUC was comparable in the external validation using data from held-out sites (AUC range = 0.69 to 0.72). The model predicting between-session heavy drinking showed strong classification accuracy in internal cross-validation (AUC = 0.89) and external test samples (AUC range = 0.80 to 0.87). Stratified analyses showed substantial sex differences in optimal feature sets.
Conclusion:
Machine learning techniques can predict alcohol treatment outcomes using routinely collected clinical data. This technique has the potential to greatly improve clinical prediction accuracy without requiring expensive or invasive assessment methods. More research is needed to understand how best to deploy these models.

How well adolescents get along with others such as peers and teachers is an important aspect of adolescent development. Current research on adolescent relationship with peers and teachers is limited by classical methods that lack explicit test of predictive performance and cannot efficiently discover complex associations with potential non-linearity and higher-order interactions among a large set of predictors. Here, a transparently reported machine learning approach is utilized to overcome these limitations in concurrently predicting how well adolescents perceive themselves to get along with peers and teachers. The predictors were 99 items from four instruments examining internalizing and externalizing psychopathology, sensation-seeking, peer pressure, and parent-child conflict. The sample consisted of 3232 adolescents (M = 14.0 years, SD = 1.0 year, 49% female). Nonlinear machine learning classifiers predicted with high performance adolescent relationship with peers and teachers unlike classical methods. Using model explainability analyses at the item level, results identified influential predictors related to somatic complaints and attention problems that interacted in nonlinear ways with internalizing behaviors. In many cases, these intrapersonal predictors outcompeted in predictive power many interpersonal predictors. Overall, the results suggest the need to cast a much wider net of variables for understanding and predicting adolescent relationships, and highlight the power of a data-driven machine learning approach with implications on a predictive science of adolescence research.

Human brain is more complex than any other known structure in the universe. Millions and trillions of information like the memories are stored, habits learned, and personalities are shaped in it. To know how all this work and to know how mind basically respond on certain events, Psychology came in action. Most of the work in this field is based upon the history and impressions while major treatment is done using guess work, even the best psychiatrists choose the correct treatment for a given patient only about 37% which we ultimately want to improve. So, this paper presents the major connection between computer science and psychology and how machine learning can change the overall approach and methods in the treatment of mental health and in the understanding of psychology.

In this paper, we analyze the impact of various factors on meeting service level agreements (SLAs) for information technology (IT) incident resolution. Using a large IT services incident dataset, we develop and compare multiple models to predict the value of a target Boolean variable indicating whether an incident met its SLA. Logistic regression and neural network models are found to have the best performance in terms of misclassification rates and average squared error. From the best-performing models, we identify a set of key variables that influence the achievement of SLAs. Based on model insights, we provide a thorough discussion of IT process management implications. We suggest several strategies that can be adopted by incident management teams to improve the quality and effectiveness of incident management processes, and recommend avenues for future research.

Throughout semiarid western North America, flood irrigation and associated small reservoirs have created or augmented many wetlands that otherwise would not exist or persist through summer. Diversion of mountain snowmelt from rivers has thereby created widely scattered hotspots of biodiversity. Increased urban water demands, higher profits from sprinkler irrigation, and climate-driven declines in mountain snowpack threaten these wetlands. Knowledge of unique functions of different wetland types and their spatial interactions would aid conservation of wetland complexes. We characterized use by ducks of wetlands with varying salinities, vegetation, nearby land use, and spatial relations in the Laramie Basin, Wyoming, USA. All duck species and social groups had higher densities in smaller wetlands. Pairs and broods of diving ducks and some dabbling ducks had highest densities in oligosaline wetlands (0.5–5 ‰ salinity) which have emergent plants for nesting cover. However, these ducks were commonly observed in mesosaline wetlands (5–18 ‰) which lack emergent cover but have higher availability of near-surface foods, suggesting differential use of wetland types for nesting and feeding. Accordingly, densities of some dabbling and diving ducks were higher when mesosaline wetlands were within 1 km. Hayfields or livestock grazing nearby seldom affected duck densities in wetlands, suggesting that with sparse upland cover in shortgrass steppe, many upland nesters sought cover in dry portions of the emergent fringe. For ducks in such intermountain basins, mesosaline wetlands with less stable water levels but high prey availability should be maintained in complexes near oligosaline wetlands with variably flooded emergent cover.

We provide explanations on the general principles of machine learning, as well as analytical steps required for successful machine learning-based predictive modeling, which is the focus of this series. In particular, we define the terms machine learning, artificial intelligence, as well as supervised and unsupervised learning, continuing by introducing optimization, thus, the minimization of an objective error function as the central dogma of machine learning. In addition, we discuss why it is important to separate predictive and explanatory modeling, and most importantly state that a prediction model should not be used to make inferences. Lastly, we broadly describe a classical workflow for training a machine learning model, starting with data pre-processing and feature engineering and selection, continuing on with a training structure consisting of a resampling method, hyperparameter tuning, and model selection, and ending with evaluation of model discrimination and calibration as well as robust internal or external validation of the fully developed model. Methodological rigor and clarity as well as understanding of the underlying reasoning of the internal workings of a machine learning approach are required, otherwise predictive applications despite being strong analytical tools are not well accepted into the clinical routine.

El presente artículo tiene como objetivo ilustrar una de las tantas aplicaciones de la Industria 4.0 mediante el uso de procedimientos analíticos multivariados y modelos de aprendizaje automático multirrespuesta, como un camino para analizar, modelar y estandarizar las relaciones entre las distintas variables de entrada y de salida que gobiernan la formulación de las mermeladas. Este trabajo de investigación es llevado a cabo en una compañía dedicada a la producción y comercialización de productos agropecuarios, describe la metodología de estudio utilizada que permitió hallar los rangosde valores para los niveles de azúcar (°Bx) y acidez (pH) que satisfacen matemática y estadísticamente los parámetros de liberación de producto terminado definidos por la misma compañía.

Neuroscience and artificial intelligence (AI) share a long history of collaboration. Advances in neuroscience, alongside huge leaps in computer processing power over the last few decades, have given rise to a new generation of in silico neural networks inspired by the architecture of the brain. These AI systems are now capable of many of the advanced perceptual and cognitive abilities of biological systems, including object recognition and decision making. Moreover, AI is now increasingly being employed as a tool for neuroscience research and is transforming our understanding of brain functions. In particular, deep learning has been used to model how convolutional layers and recurrent connections in the brain’s cerebral cortex control important functions, including visual processing, memory, and motor control. Excitingly, the use of neuroscience-inspired AI also holds great promise for understanding how changes in brain networks result in psychopathologies, and could even be utilized in treatment regimes. Here we discuss recent advancements in four areas in which the relationship between neuroscience and AI has led to major advancements in the field; (1) AI models of working memory, (2) AI visual processing, (3) AI analysis of big neuroscience datasets, and (4) computational psychiatry.

Background
Long-term exposure to ambient air pollution was linked to depression incidence, although the results were limited and inconsistent.
Objectives
To investigate the effects of long-term air pollution exposure on depression risk prospectively in China.
Methods
The present study used data from Yinzhou Cohort on adults without depression at baseline, and followed up until April 2020. Two-year moving average concentrations of particulate matter with a diameter ≤ 2.5 μm (PM2.5), ≤10 μm (PM10) and nitrogen dioxide (NO2) were measured using land-use regression (LUR) models for each participant. Depression cases were ascertained using the Health Information System (HIS) of the local health administration by linking the unique identifiers. We conducted Cox regression models with time-varying exposures to estimate the hazard ratios (HRs) and 95% confidence intervals (95% CIs) of depression with each pollutant, after adjusting for a sequence of individual covariates as demographic characteristics, lifestyles, and comorbidity. Besides, physical activity, baseline potential depressive symptoms, cancer status, COVID-19 pandemic, different outcome definitions and air pollution exposure windows were considered in sensitivity analyses.
Results
Among the 30,712 adults with a mean age of 62.22 ± 11.25, 1024 incident depression cases were identified over totaling 98,619 person-years of observation. Interquartile range increments of the air pollutants were associated with increased risks of depression, and the corresponding HRs were 1.59 (95%CI: 1.46, 1.72) for PM2.5, 1.49 (95%CI: 1.35, 1.64) for PM10 and 1.58 (95%CI: 1.42, 1.77) for NO2. Subgroup analyses suggested that participants without taking any protective measures towards air pollution were more susceptible. The results remained robust in all sensitivity analyses.
Conclusions
Long-term exposure to ambient air pollution was identified as a risk factor for depression onset. Strategies to reduce air pollution are necessary to decrease the disease burden of depression.

Econometric land use models study determinants of land use shares of different classes: “agriculture”, “forest”, “urban” and “other” for example. Land use shares have a compositional nature as well as an important spatial dimension. We compare two compositional regression models with a spatial autoregressive nature in the framework of land use. We study the impact of the choice of coordinate space and prove that a choice of coordinate representation does not have any impact on the parameters in the simplex as long as we do not impose further restrictions. We discuss parameters interpretation taking into account the non-linear structure as well as the spatial dimension. In order to assess the explanatory variables impact, we compute and interpret the semi-elasticities of the shares with respect to the explanatory variables and the spatial impact summary measures.

For a given research question, there are usually a large variety of possible analysis strategies acceptable according to the scientific standards of the field, and there are concerns that this multiplicity of analysis strategies plays an important role in the non-replicability of research findings. Here, we define a general framework on common sources of uncertainty arising in computational analyses that lead to this multiplicity, and apply this framework within an overview of approaches proposed across disciplines to address the issue. Armed with this framework, and a set of recommendations derived therefrom, researchers will be able to recognize strategies applicable to their field and use them to generate findings more likely to be replicated in future studies, ultimately improving the credibility of the scientific process.

A measure of relative importance of variables is often desired by researchers when the explanatory aspects of econometric methods are of interest. To this end, the author briefly reviews the limitations of conventional econometrics in constructing a reliable measure of variable importance. The author highlights the relative stature of explanatory and predictive analysis in economics and the emergence of fruitful collaborations between econometrics and computer science. Learning lessons from both, the author proposes a hybrid approach based on conventional econometrics and advanced machine learning (ML) algorithms, which are otherwise, used in predictive analytics. The purpose of this article is two-fold: to propose a hybrid approach to assess relative importance and demonstrate its applicability in addressing policy priority issues with an example of food inflation in India, followed by a broader aim to introduce the possibility of conflation of ML and conventional econometrics to an audience of researchers in economics and social sciences, in general.

The opioid crisis in the United States (US) has been defined by waves of drug- and locality-specific Opioid use-Related Epidemics (OREs) of overdose and bloodborne infections, among a range of health harms. The ability to identify localities at risk of such OREs, and better yet, to predict which ones will experience them, holds the potential to mitigate further morbidity and mortality. This narrative review was conducted to identify and describe quantitative approaches aimed at the “risk assessment”, “detection” or “prediction” of OREs in the US. We implemented a PubMed search composed of the: 1) objective (e.g. prediction), 2) epidemiologic outcome (e.g. outbreak), 3) underlying cause (i.e. opioid use), 4) health outcome (e.g. overdose, HIV), 5) location (i.e. U.S.). In total, 46 studies were included, and the following information extracted: discipline, objective, health outcome, drug/substance type, geographic region/unit of analysis, and data sources. Studies identified relied on clinical, epidemiological, behavioral and drug markets surveillance and applied a range of methods including statistical regression, geospatial analyses, dynamic modeling, phylogenetic analyses and machine learning. Studies for the prediction of overdose mortality at national/state/county and zip code level are rapidly emerging. Geospatial methods are increasingly used to identify hotspots of opioid use and overdose. In the context of infectious disease OREs, routine genetic sequencing of patient samples to identify growing transmission clusters via phylogenetic methods could increase early detection capacity. A coordinated implementation of multiple, complementary approaches would increase our ability to successfully anticipate outbreak risk and respond preemptively. We present a multi-disciplinary framework for the prediction of OREs in the US and reflect on challenges research teams will face in implementing such strategies along with good practices.

This introduction sets the stage for what this series of volumes on information systems (IS) research seeks to accomplish, that is, to move the standards of IS research beyond its comfort zone of deriving legitimacy from its more established reference disciplines toward crafting fresh and original indigenous theory. The first step toward reaching this goal involves reaching an agreement on the need for theory, and the preeminent role of theory as the most distinctive product of human intellectual activity. Following Aristotle’s approach to addressing the “Why?” question by answering the “What?” question, this chapter reviews major discussions surrounding the definition of theory from multiple disciplines and proposes a novel, more inclusive view of theory that encompasses the views of these disciplines while, at the same time, highlights the unique goals that each theory category addresses. These unique communicative goals: theory as proposition, model, paradigm, worldview, grand theory, methodology, explanation, significant description, prescription, and metatheory offers researchers a wider space within which exciting and original theorizing can take place.

While phishing has evolved over the years, it still exploits one of the weakest links in any information system — humans. The present study aims at describing who the potential phishing victims are. We constructed two types of phishing messages that represented two basic categories of phishing e-mails: regular and spear-phishing. In cooperation with the IT management of a municipality in the southwestern region of the United States, we sent these messages to the municipality’s employees and collected demographic data about individuals employed by the organization. We then applied eight supervised learning methods to classify the municipality’s employees into two groups: phished and not-phished. Our results indicate that spear-phishing yields a significantly higher response rate than regular phishing and that some machine learning methods yield high classification accuracy in predicting phishing victims. We finally provide discussion of the results as well as the future implications.

An accurate predictive model for estimating the timing of seasonal phenological stages of grape (Vitis L.) would be a valuable tool for crop management. Currently the most used index for predicting the phenological timing of fruit crops is growing degree days (GDD), but the predictive accuracy of the GDD index varies from season-to-season and is considered unsatisfactory for grapevines grown in the midwestern United States. We used the methods of multiple regression to analyze and model the effects of multiple factors on the number of days remaining until each of four phenological stages (budbreak, bloom, veraison, and harvest maturity) for five cold-climate wine grape cultivars (Frontenac, La Crescent, Marquette, Petit Ami, and St. Croix) grown in central Iowa. The factors (predictor variables) evaluated in models included cultivar, numerical day of the year (DOY), DOY of soil thaw or the previous phenological stage, photoperiod, GDD with a base temperature of 10 C (GDD 10), soil degree days with a base temperature of 5 C (SDD 5), and solar accumulation. Models were evaluated for predictive accuracy and goodness of fit by calculating the coefficient of determination (R2), the corrected Akaike information criterion (AICc), and the Bayesian information criterion (BIC); testing for normal distribution of residuals; and comparing the actual number of days remaining until a phenological stage with the number of days predicted by models. The top-performing models from the training set were also tested for predictive accuracy on a validation dataset (a set of data not used to build the model), which consisted of environmental and phenological data recorded for one popular Midwest cultivar (Marquette) in 2019. At all four phenological stages, inclusion of multiple factors (cultivar and four to six additional factors) resulted in predictive models that were more accurate and consistent than models using cultivar and GDD 10 alone. Multifactor models generated from data of all five cultivars had high R2 values of 0.996, 0.985, 0.985, and 0.869 for budbreak, bloom, veraison, and harvest, respectively, whereas R2 values for models using only cultivar and GDD 10 were substantially lower (0.787, 0.904, 0.960, and 0.828, respectively). The average errors (differences from actual) for the top multifactor models were 0.70, 0.84, 1.77, and 3.80 days for budbreak, bloom, veraison, and harvest, respectively, and average errors for models that included only cultivar and GDD 10 were much larger (5.27, 2.24, 2.79, and 4.29 days, respectively). In the validation tests, average errors for budbreak, bloom, veraison, and harvest were 1.92, 1.31, 0.94, and 1.67 days, respectively, for the top multifactor models and 10.05, 2.54, 4.23, and 4.96 days, respectively, for models that included cultivar and GDD 10 only. Our results demonstrate the improved accuracy and utility of multifactor models for predicting the timing of phenological stages of cold-climate grape cultivars in the midwestern United States. Used together in succession, the models for budbreak, bloom, veraison, and harvest form a four-stage, multifactor calculator for improved prediction of phenological timing. Multifactor models of this type could be tailored for specific cultivars and growing regions to provide the most accurate predictions possible.

Aim
In contrast to studies of defects found during code review, we aim to clarify whether code review measures can explain the prevalence of post-release defects.
Method
We replicate McIntosh et al.’s (Empirical Softw. Engg. 21(5): 2146–2189, 2016) study that uses additive regression to model the relationship between defects and code reviews. To increase external validity, we apply the same methodology on a new software project. We discuss our findings with the first author of the original study, McIntosh. We then investigate how to reduce the impact of correlated predictors in the variable selection process and how to increase understanding of the inter-relationships among the predictors by employing Bayesian Network (BN) models.
Context
As in the original study, we use the same measures authors obtained for Qt project in the original study. We mine data from version control and issue tracker of Google Chrome and operationalize measures that are close analogs to the large collection of code, process, and code review measures used in the replicated the study.
Results
Both the data from the original study and the Chrome data showed high instability of the influence of code review measures on defects with the results being highly sensitive to variable selection procedure. Models without code review predictors had as good or better fit than those with review predictors. Replication, however, confirms with the bulk of prior work showing that prior defects, module size, and authorship have the strongest relationship to post-release defects. The application of BN models helped explain the observed instability by demonstrating that the review-related predictors do not affect post-release defects directly and showed indirect effects. For example, changes that have no review discussion tend to be associated with files that have had many prior defects which in turn increase the number of post-release defects. We hope that similar analyses of other software engineering techniques may also yield a more nuanced view of their impact. Our replication package including our data and scripts is publicly available (Krutauz et al. 2020).

Shilling attacks against collaborative filtering (CF) models are characterized by several fake user profiles mounted on the system by an adversarial party with the goal to harvest recommendation outcomes toward a malicious desire.
The vulnerability of CF engines is directly tied with their heavy reliance on underlying interaction data ---like user-item rating matrix (URM)--- to train their models and their inherent inability to distinguish genuine profiles from non-genuine ones.
Despite this background, the majority of works conducted so far for analyzing shilling attacks mainly focused on properties such as confronted recommendation models, recommendation outputs, and even users under attack. The under-researched yet significant element has been the impact of data and data characteristics on the effectiveness of shilling attacks on CF models.
Toward this goal, this work presents a systematic and in-depth study by using an explanatory modeling approach built on a regression model to test the hypothesis of whether URM properties can impact the outcome of CF recommenders under a shilling attack.
We ran extensive experiments involving 97200 simulations on three different domains (movie, business, and music), and showed that URM properties considerably affect the robustness of CF models in shilling attack scenarios.
Obtained results can be of great help for the system designer in understanding the cause of variations in a recommender system performance due to a shilling attack.

The differences in life-history traits and processes between organisms living in the same or different populations contribute to their ecological and evolutionary dynamics. We developed mixed-effect model formulations of the popular size-at-age von Bertalanffy and Gompertz growth functions to estimate individual and group variation in body growth, using as a model system four freshwater fish populations, where tagged individuals were sampled for more than 10 years. We used the software Template Model Builder to estimate the parameters of the mixed-effect growth models. Tests on data that were not used to estimate model parameters showed good predictions of individual growth trajectories using the mixed-effects models and starting from one single observation of body size early in life; the best models had R2 > 0.80 over more than 500 predictions. Estimates of asymptotic size from the Gompertz and von Bertalanffy models were not significantly correlated, but their predictions of size-at-age of individuals were strongly correlated (r > 0.99), which suggests that choosing between the best models of the two growth functions would have negligible effects on the predictions of size-at-age of individuals. Model results pointed to size ranks that are largely maintained throughout the lifetime of individuals in all populations.

A recently published model that predicted the risk of skin tears in older adults was compared with seven additional published models. Four models were excluded because of limitations in research design. Four models were compared for their relative predictive performance and accuracy using sensitivity, specificity, and the area under the curve (AUC), which involved using receiver‐operating characteristic analysis. The predictive ability of the skin tear models differed with the AUC ranging between 0.673 and 0.854. Based on the predictive ability, the selection of models could lead to different clinical decisions and health outcomes. The model, which had been adjusted for potential confounders consisted of five variables (male gender, history of skin tears, history of falls, clinical skin manifestations of elastosis, and purpura), was found to be the most parsimonious for predicting skin tears in older adults (AUC 0.854; 81.7% sensitivity; 81.4% specificity). Effective models serve as important clinical tools for identifying older individuals at risk of skin tears and can better direct more timely and targeted prevention strategies that improve health outcomes and reduce health care expenditure.

Objective:
To test the hypotheses that emerging viruses are associated with neurological hospitalizations and that statistical models can be used to predict neurological sequelae from viral infections.
Methods:
An ecological study was carried out to observe time trends in the number of hospitalizations with inflammatory polyneuropathy and Guillain-Barré syndrome (GBS) in the state of Rio de Janeiro from 1997 to 2017. Increases in GBS from month to month were assessed using a Farrington test. In addition, a cross-sectional study was conducted analyzing 50 adults hospitalized for inflammatory polyneuropathies from 2015 to 2017. The extent to which Zika virus symptoms explained GBS hospitalizations was evaluated using a calibration test.
Results:
There were significant increases (Farrington test, P<0.001) in the incidence of GBS following the introduction of influenza A/H1N1 in 2009, dengue virus type 4 in 2013, and Zika virus in 2015. Of 50 patients hospitalized, 14 (28.0%) were diagnosed with arboviruses, 9 (18.0%) with other viruses, and the remainder with other causes of such neuropathies. Statistical models based on cases of emerging viruses accurately predicted neurological sequelae, such as GBS.
Conclusion:
The introduction of novel viruses increases the incidence of inflammatory neuropathies.

Health data are increasingly being generated at a massive scale, at various levels of phenotyping and from different types of resources. Concurrent with recent technological advances in both data-generation infrastructure and data-analysis methodologies, there have been many claims that these events will revolutionize healthcare, but such claims are still a matter of debate. Addressing the potential and challenges of big data in healthcare requires an understanding of the characteristics of the data. Here we characterize various properties of medical data, which we refer to as ‘axes’ of data, describe the considerations and tradeoffs taken when such data are generated, and the types of analyses that may achieve the tasks at hand. We then broadly describe the potential and challenges of using big data in healthcare resources, aiming to contribute to the ongoing discussion of the potential of big data resources to advance the understanding of health and disease. Health data are being generated and collected at an unprecedented scale, but whether big data will truly revolutionize healthcare is still a matter of much debate.

Multilevel modeling is an increasingly popular technique for analyzing hierarchical data. This article addresses the problem of predicting a future observable y*jin thejth group of a hierarchical data set. Three prediction rules are considered and several analytical results on the relative performance of these prediction rules are demonstrated. In addition, the prediction rules are assessed by means of a Monte Carlo study that extensively covers both the sample size and parameter space. Specifically, the sample size space concerns the various combinations of Level 1 (individual) and Level 2 (group) sample sizes, while the parameter space concerns different intraclass correlation values. The three prediction rules employ OLS, prior, and multilevel estimators for the Level 1 coefficientsβjThe multilevel prediction rule performs the best across all design conditions, and the prior prediction rule degrades as the number of groups, J, increases. Finally, this article investigates the robustness of the multilevel prediction rule to misspecifications of the Level 2 model.

What has science actually achieved? A theory of achievement should (1) define what has been achieved, (2) describe the means or methods used in science, and (3) explain how such methods lead to such achievements. Predictive accuracy is one truth-related achievement of science, and there is an explanation of why common scientific practices (of trading off simplicity and fit) tend to increase predictive accuracy. Akaike's explanation for the success of AIC is limited to interpolative predictive accuracy. But therein lies the strength of the general framework, for it also provides a clear formulation of many open problems of research.

Akaike's framework for thinking about model selection in terms of the goal of predictive accuracy and his criterion for model selection have important philosophical implica- tions. Scientists often test models whose truth values they already know, and they often decline to reject models that they know full well are false. Instrumentalism helps explain this pervasive feature of scientific practice, and Akaike's framework helps provide in- strumentalism with the epistemology it needs. Akaike's criterion for model selection also throws light on the role of parsimony considerations in hypothesis evaluation. I explain the basic ideas behind Akaike's framework and criterion; several biological examples, including the use of maximum likelihood methods in phylogenetic inference, are considered.

This chapter reviews some recent advancements in financial applications of genetic algorithms and genetic programming. We start with the more familiar applications, such as forecasting, trading, and portfolio management. We then trace the recent extensions to cash flow management, option pricing, volatility forecasting, and arbitrage. The direction then turns to agent-based computational finance, a bottom-up approach to the study of financial markets. The review also sheds light on a few technical aspects of GAs and GP, which may play a vital role in financial applications.

nline reverse auctions generate real-time bidding data that could be used via appropriate statistical esti- mation to assist the corporate buyer's procurement decision. To this end, we develop a method, called BidAnalyzer, which estimates dynamic bidding models and selects the most appropriate of them. Specifically, we enable model estimation by addressing the problem of partial observability; i.e., only one of N suppliers' bids is realized, and the other (N − 1) bids remain unobserved. To address partial observability, BidAnalyzer esti- mates the latent price distributions of bidders by applying the Kalman filtering theory. In addition, BidAnalyzer conducts model selection by applying multiple information criteria. Using empirical data from an automotive parts auction, we illustrate the application of BidAnalyzer by estimating several dynamic bidding models to obtain empirical insights, retaining a model for forecasting, and assessing its predictive performance in out-of- sample. The resulting one-step-ahead price forecast is accurate up to 2.95% median absolute percentage error. Finally, we suggest how BidAnalyzer can serve as a device for price discovery in online reverse auctions.

The classification problem is considered in which an output variable y assumes discrete values with respective probabilities that depend upon the simultaneous values of a set of input variables xDf x 1;:::; x ng:At issue is how error in the estimates of these probabilities affects classification error when the estimates are used in a classification rule. These effects are seen to be somewhat counter intuitive in both their strength and nature. In particular the bias and variance components of the estimation error combine to influence classification in a very different way than with squared error on the probabilities themselves. Certain types of (very high) bias can be canceled by low variance to produce accurate classification. This can dramatically mitigate the effect of the bias associated with some simple estimators like "naive" Bayes, and the bias induced by the curse-of-dimensionality on nearest-neighbor procedures. This helps explain why such simple methods are often competitive with and sometimes superior to more sophisticated ones for classification, and why "bagging/aggregating" classifiers can often improve accuracy. These results also suggest simple modifications to these procedures that can (sometimes dramatically) further improve their classification performance.

One result of the increasing sophistication and complexity of MIS theory and research is the number of studies hypothesizing and testing for

A separate and distinct interaction with both the actual e-vendor and with its IT Web site interface is at the heart of online shopping. Previous research has established, accordingly, that online purchase intentions are the product of both consumer assessments of the IT itself-specifically its perceived usefulness and ease-of-use (TAM)-and trust in the e-vendor. But these perspectives have been examined independently by IS researchers. Integrating these two perspectives and examining the factors that build online trust in an environment that lacks the typical human interaction that often leads to trust in other circumstances advances our understanding of these constructs and their linkages to behavior. Our research on experienced repeat online shoppers shows that consumer trust is as important to online commerce as the widely accepted TAM use-antecedents, perceived usefulness and perceived ease of use. Together these variable sets explain a considerable proportion of variance in intended behavior. The study also provides evidence that online trust is built through (1) a belief that the vendor has nothing to gain by cheating, (2) a belief that there are safety mechanisms built into the Web site, and (3) by having a typical interface, (4) one that is, moreover, easy to use.

Abstract This paper extends Ajzen’s (1991) theory of planned behavior (TPB) toexplain,and predict the process of e-commerce ,adoption ,by consumers. The process is captured through two online consumer behaviors: (1) getting information and (2) purchasing,a product from a Web vendor. First, we simultaneously model the association between these two contingent online behaviors and their respective intentions by appealing ,to consumer ,behavior ,theories and the theory of implementation intentions, respectively. Second, following TPB, we derive for each behavior its intention, attitude, subjective norm, and perceived behavioral control (PBC). Third, we Ritu Agarwal was ,the accepting senior editor ,for this paper. Elena

Research over two decades has advanced the knowledge of how to assess predictive validity. We believe this has value to information systems (IS) researchers. To demonstrate, we used a widely cited study of IS spending. In that study, price-adjusted diffusion models were proposed to explain and to forecast aggregate U.S. information systems spending. That study concluded that such models would produce more accurate forecasts than would simple linear trend extrapolation. However, one can argue that the validation procedure provided an advantage to the diffusion models. We reexamined the results using an alternative validation procedure based on three principles extracted from forecasting research: (1) use ex ante (out-of-sample) performance rather than the fit to the historical data, (2) use well-accepted models as a basis for comparison, and (3) use an adequate sample of forecasts. Validation using this alternative procedure did confirm the importance of the price-adjustment, but simple trend extrapolations were found to be more accurate than the price-adjusted diffusion models.

Traditional analyses of the curve fitting problem maintain that the data do not indicate what form the fitted curve should take. Rather, this issue is said to be settled by prior probabilities, by simplicity, or by a background theory. In this paper, we describe a result due to Akaike (1973), which shows how the data can underwrite an inference concerning the curve’s form based on an estimate of how predictively accurate it will be. We argue that this approach throws light on the theoretical virtues of parsimoniousness, unification, and non ad hocness, on the dispute about Bayesianism, and on empiricism and scientific realism.

The relationship between depression and cerebrovascular disease (CBVD) continues to be debated although little research has compared the predictive power of depression for coronary heart disease (CHD) with that for CBVD within the same population. This study aimed to compare the importance of depression for CHD and CBVD within the same population of adults free of apparent cardiovascular disease.
A random sample of 23,282 adults (9507 men, 13,775 women) aged 20-54 years were followed up for 7 years. Fatal and first non-fatal CHD and CBVD events were documented by linkage to the National-hospital-discharge and mortality registers.
Sex-age-education-adjusted hazard ratio (HR) for CHD was 1.66 [95% confidence interval (CI) 1.24-2.24] for participants with mild to severe depressive symptoms, i.e. those scoring > or =10 on the 21-item Beck Depression Inventory, and 2.04 (1.27-3.27) for those who filled antidepressant prescriptions compared with those without depression markers in 1998, i.e. at study baseline. For CBVD, the corresponding HRs were 1.01 (0.67-1.53) and 1.77 (0.95-3.29). After adjustment for behavioural and biological risk factors these associations were reduced but remained evident for CHD, the adjusted HRs being 1.47 (1.08-1.99) and 1.72 (1.06-2.77). For CBVD, the corresponding multivariable adjusted HRs were 0.87 (0.57-1.32) and 1.52 (0.81-2.84).
Self-reported depression using a standardized questionnaire and clinical markers of mild to severe depression were associated with an increased risk for CHD. There was no clear evidence that depression is a risk factor for CBVD, but this needs further confirmation.

The advent of formal definitions of the simplicity of a theory has important implications for model selection. But what is the best way to define simplicity? Forster and Sober ([ 1994 ]) advocate the use of Akaike's Information Criterion (AIC), a non-Bayesian formalisation of the notion of simplicity. This forms an important part of their wider attack on Bayesianism in the philosophy of science. We defend a Bayesian alternative: the simplicity of a theory is to be characterised in terms of Wallace's Minimum Message Length (MML). We show that AIC is inadequate for many statistical problems where MML performs well. Whereas MML is always defined, AIC can be undefined. Whereas MML is not known ever to be statistically inconsistent, AIC can be. Even when defined and consistent, AIC performs worse than MML on small sample sizes. MML is statistically invariant under 1-to-1 re-parametrisation, thus avoiding a common criticism of Bayesian approaches. We also show that MML provides answers to many of Forster's objections to Bayesianism. Hence an important part of the attack on Bayesianism fails.
• Introduction
• The Curve Fitting Problem
• 2.1 Curves and families of curves
• 2.2 Noise
• 2.3 The method of Maximum Likelihood
• 2.4 ML and over-fitting
• Akaike's Information Criterion (AIC)
• The Predictive Accuracy Framework
• The Minimum Message Length (MML) Principle
• 5.1 The Strict MML estimator
• 5.2 An example: The binomial distribution
• 5.3 Properties of the SMML estimator
• 5.3.1 Bayesianism
• 5.3.2 Language invariance
• 5.3.3Generality
• 5.3.4 Consistency and efficiency
• 5.4 Similarity to false oracles
• 5.5 Approximations to SMML
• Criticisms of AIC
• 6.1 Problems with ML
• 6.1.1 Small sample bias in a Gaussian distribution
• 6.1.2 The von Mises circular and von Mises—Fisher spherical distributions
• 6.1.3 The Neyman–Scott problem
• 6.1.4 Neyman–Scott, predictive accuracy and minimum expected KL distance
• 6.2 Other problems with AIC
• 6.2.1 Univariate polynomial regression
• 6.2.2 Autoregressive econometric time series
• 6.2.3 Multivariate second-order polynomial model selection
• 6.2.4 Gap or no gap: a clustering-like problem for AIC
• 6.3 Conclusions from the comparison of MML and AIC
• Meeting Forster's objections to Bayesianism
• 7.1 The sub-family problem
• 7.2 The problem of approximation, or, which framework for statistics?
• Conclusion
• Details of the derivation of the Strict MML estimator
• MML, AIC and the Gap vs. No Gap Problem
• B.1 Expected size of the largest gap
• B.2 Performance of AIC on the gap vs. no gap problem
• B.3 Performance of MML in the gap vs. no gap problem

When a scientist uses an observation to formulate a theory, it is no surprise that the resulting theory accurately captures that observation. However, when the theory makes a novel prediction—when it predicts an observation that was not used in its formulation—this seems to provide more substantial confirmation of the theory. This paper presents a new approach to the vexed problem of understanding the epistemic difference between prediction and accommodation . In fact, there are several problems that need to be disentangled; in all of them, the key is the concept of overfitting . We float the hypothesis that accommodation is a defective methodology only when the methods used to accommodate the data fail to guard against the risk of overfitting. We connect our analysis with the proposals that other philosophers have made. We also discuss its bearing on the conflict between instrumentalism and scientific realism.
Introduction
Predictivisms—a taxonomy
Observations
Formulating the problem
What might Annie be doing wrong?
Solutions
Observations explained
Mayo on severe tests
The miracle argument and scientific realism
Concluding comments

Practitioners of many skills face the need to make some realistic statement about the likely outcome of a future 'experiment of interest' on the basis of observed variability of outcomes in previously conducted related experiments. In this book the authors provide the predictor with the data and formulae which will assist in accurate forecasting, and suggest that an effective answer is to be found in the concept of predictive distribution within the framework of statistical prediction analysis. An applied mathematical approach is adopted throughout and the book is aimed at readers with some statistical knowledge, final year undergraduates, numerate scientists, technologists and medical workers interested in predictive techniques.

A generalized form of the cross‐validation criterion is applied to the choice and assessment of prediction using the data‐analytic concept of a prescription. The examples used to illustrate the application are drawn from the problem areas of univariate estimation, linear regression and analysis of variance.

This article presents an in-class Monte Carlo demonstration, designed to demonstrate to students the implications of multicollinearity in a multiple regression study. In the demonstration, students already familiar with multiple regression concepts are presented with a scenario in which the "true" relationship between the response and predictor variables is known. Two alternative data generation mechanisms are applied to this scenario, one in which the predictor variables are mutually independent, and another where two predictor variables are correlated. A number of independent realizations of data samples are generated under each scenario, and the regression coefficients for an appropriately specified model are estimated with respect to each sample. Scatter-plots of the estimated regression coefficients under the two scenarios provide a clear visual demonstration of the effects of multicollinearity. The two scenarios are also used to examine the effects of model specification error. Copyright © 2005 by Timothy S. Vaughan and Kelly E. Berry, all rights reserved.

In this paper a method of estimating the parameters of a set of regression equations is reported which involves application of Aitken's generalized least-squares [1] to the whole system of equations. Under conditions generally encountered in practice, it is found that the regression coefficient estimators so obtained are at least asymptotically more efficient than those obtained by an equation-by-equation application of least squares. This gain in efficiency can be quite large if “independent” variables in different equations are not highly correlated and if disturbance terms in different equations are highly correlated. Further, tests of the hypothesis that all regression equation coefficient vectors are equal, based on “micro” and “macro” data, are described. If this hypothesis is accepted, there will be no aggregation bias. Finally, the estimation procedure and the “micro-test” for aggregation bias are applied in the analysis of annual investment data, 1935–1954, for two firms.

This special issue contains six papers that address a variety of practical research process questions. The papers explore how theory and method inevitably interact in particular organization and management studies. Here we offer an overview of how theory and method have been treated to date by organization researchers and suggest that respecting both the primacy of theory and the primacy of evidence is no easy task but a necessary balancing practice that characterizes high-quality research.

A result can be regarded as routinely predictable when it has recurred consistently under a known range of different conditions. This depends on the previous analysis of many sets of data, drawn from different populations. There is no such basis of extensive experience when a prediction is derived from the analysis of only a single set of data. Yet that is what is mainly discussed in our statistical texts. The paper discusses the design and analysis of studies aimed at achieving routinely predictable results. It uses two running case history examples.

This paper gives an account of an experiment in the use of the so-called DELPHI method, which was devised in order to obtain the most reliable opinion consensus of a group of experts by subjecting them to a series of questionnaires in depth interspersed with controlled opinion feedback.

Random forests are a combination of tree predictors such that each tree depends on the values of a random vector sampled independently and with the same distribution for all trees in the forest. The generalization error for forests converges a.s. to a limit as the number of trees in the forest becomes large. The generalization error of a forest of tree classifiers depends on the strength of the individual trees in the forest and the correlation between them. Using a random selection of features to split each node yields error rates that compare favorably to Adaboost (Y. Freund & R. Schapire, Machine Learning: Proceedings of the Thirteenth International conference, ***, 148–156), but are more robust with respect to noise. Internal estimates monitor error, strength, and correlation and these are used to show the response to increasing the number of features used in the splitting. Internal estimates are also used to measure variable importance. These ideas are also applicable to regression.

Assuming a set of observations is available from the general linear model, and assuming prior information about parameters, we propose a method of assessing the influence of specified subsets of the data when the goal is to predict future observations.

Introduction Interactions between Qualitative Predictors Interactions between Qualitative and Quantitative/Continuous Predictors Interactions between Quantitative/Continuous Predictors Multicategory Models Additional Considerations

SUMMARY The primary aim of this paper is to show how graphical models can be used as a mathematical language for integrating statistical and subject-matter information. In par- ticular, the paper develops a principled, nonparametric framework for causal inference, in which diagrams are queried to determine if the assumptions available are sufficient for identifying causal effects from nonexperimental data. If so the diagrams can be queried to produce mathematical expressions for causal effects in terms of observed distributions; otherwise, the diagrams can be queried to suggest additional observations or auxiliary experiments from which the desired inferences can be obtained.

The Akaike information criterion (AIC) derived as an estimator of the Kullback-Leibler information discrepancy provides a useful tool for evaluating statistical models, and numerous successful applications of the AIC have been reported in various fields of natural sciences, social sciences and engineering. One of the main objectives of this book is to provide comprehensive explanations of the concepts and derivations of the AIC and related criteria, including Schwarzs Bayesian information criterion (BIC), together with a wide range of practical examples of model selection and evaluation criteria. A secondary objective is to provide a theoretical basis for the analysis and extension of information criteria via a statistical functional approach. A generalized information criterion (GIC) and a bootstrap information criterion are presented, which provide unified tools for modeling and model evaluation for a diverse range of models, including various types of nonlinear models and model estimation procedures such as robust estimation, the maximum penalized likelihood method and a Bayesian approach.

The prequential approach is founded on the premises that the purpose of statistical inference is to make sequential probability forecasts for future observations, rather than to express information about parameters. Many traditional parametric concepts, such as consistency and efficiency, prove to have natural counterparts in this formulation, which sheds new light on these and suggests fruitful extensions.

1. Effective biodiversity management can only be implemented if data are available on assemblage–environment relationships. The level of detail needs to be relevant to the scale of planning and decision making. A number of remote-sensing methods are available, but there are few studies that link information collected at both landscape and local scales. This is particularly true for arthropods even though these organisms are ecologically very important.
2. We assessed the predictive power of habitat variables measured by airborne laser scanning (light detection and ranging; LiDAR) to model the activity, richness and composition of assemblages of forest-dwelling beetles. We compared the results with data acquired using conventional field methods. We sampled beetles with pitfall traps and flight-interception traps at 171 sampling stations along an elevation gradient in a montane forest.
3. We found a high predictive power of LiDAR-derived variables, which captured most of the predictive power of variables measured in ground surveys. In particular, mean body size and species composition of assemblages showed considerable predictability using LiDAR-derived variables. The differences in the predictability of species richness and diversity of assemblages between trap types can be explained by sample size. We expect predictabilities with R2 of up to 0·6 for samples with 250 individuals on average.
4. The statistical response of beetle data and the ecological interpretability of results showed that airborne laser scanning can be used for cost-effective mapping (LiDAR : field survey : beetles 15 : 100 : 260 € ha−1) of biodiversity even in remote mountain areas and in structurally complex habitats, such as forests.
5. Synthesis and applications. The strong relationship between characteristics of beetle assemblages to variables derived by laser scanning provides an opportunity to link data from local ground surveys of hyperdiverse taxa to data collected remotely at the landscape scale. This will enable conservation managers to evaluate habitats, define hotspots or map activity, richness and composition of assemblages at scales relevant for planning and management. In addition to the large area that can be sampled remotely, the grain of the data allows a single tree to be identified, which opens up the possibility of planning management actions at local scales.

Simplified models have many appealing properties and sometimes give better parameter estimates and model predictions, in sense of mean-squared-error, than extended models, especially when the data are not informative. In this paper, we summarize extensive quantitative and qualitative results in the literature concerned with using simplified or misspecified models. Based on confidence intervals and hypothesis tests, we develop a practical strategy to help modellers decide whether a simplified model should be used, and point out the difficulty in making such a decision. We also evaluate several methods for statistical inference for simplified or misspecified models.Les modèles simplifiés ont des propriétés intéressantes et présentent parfois de meilleures estimations de paramètres et prédictions de modèles, pour ce qui est de l'erreur quadratique moyenne, que les modèles plus élaborés, en particulier lorsque les données ne sont pas de type informatif. Nous présentons dans cet article un résumé d'un grand nombre de résultats quantitatifs et qualitatifs de la littérature scientifique portant sur des modèles simplifiés ou mal spécifiés. En nous appuyant sur des intervalles de confiance et des essais d'hypothèses, nous établissons une stratégie pratique afin d'aider les concepteurs de modèles à déterminer s'ils doivent employer un modèle simplifié et attirer leur attention sur la difficulté de prendre une telle décision. Nous évaluons également plusieurs méthodes d'inférence statistique pour des modèles simplifiés ou mal spécifiés.

This is a new epistemological approach to the inexact sciences. The purpose of all science is to explain and predict in an objective manner. While in the exact sciences explanation and prediction have the same logical structure, this is not so in the inexact sciences. This permits various methodological innovations in the inexact sciences, e.g., expert judgment and simulation.

Bagging predictors is a method for generating multiple versions of a predictor and using these to get an aggregated predictor. The aggregation averages over the versions when predicting a numerical outcome and does a plurality vote when predicting a class. The multiple versions are formed by making bootstrap replicates of the learning set and using these as new learning sets. Tests on real and simulated data sets using classification and regression trees and subset selection in linear regression show that bagging can give substantial gains in accuracy. The vital element is the instability of the prediction method. If perturbing the learning set can cause significant changes in the predictor constructed, then bagging can improve accuracy.

To explain the phenomena in the world of our experience, to answer the question “why?” rather than only the question “what?”, is one of the foremost objectives of all rational inquiry; and especially, scientific research in its various branches strives to go beyond a mere description of its subject matter by providing an explanation of the phenomena it investigates. While there is rather general agreement about this chief objective of science, there exists considerable difference of opinion as to the function and the essential characteristics of scientific explanation. In the present essay, an attempt will be made to shed some light on these issues by means of an elementary survey of the basic pattern of scientific explanation and a subsequent more rigorous analysis of the concept of law and of the logical structure of explanatory arguments.

The propensity score is the conditional probability of assignment to a particular treatment given a vector of observed covariates. Both large and small sample theory show that adjustment for the scalar propensity score is sufficient to remove bias due to all observed covariates. Applications include: (i) matched sampling on the univariate propensity score, which is a generalization of discriminant matching, (ii) multivariate adjustment by subclassification on the propensity score where the same subclasses are used to estimate treatment effects for all outcome variables and in all subpopulations, and (iii) visual representation of multivariate covariance adjustment by a two-dimensional plot.

This paper identifies five common risk factors in the returns on stocks and bonds. There are three stock-market factors: an overall market factor and factors related to firm size and book-to-market equity. There are two bond-market factors, related to maturity and default risks. Stock returns have shared variation due to the stock-market factors, and they are linked to bond returns through shared variation in the bond-market factors. Except for low-grade corporates, the bond-market factors capture the common variation in bond returns. Most important, the five factors seem to explain average returns on stocks and bonds.

This research uses functional data modeling to study the price formation process in online auctions. It conceptualizes the price evolution and its first and second derivatives (velocity and acceleration respectively) as the primary objects of interest. Together these three functional objects permit us to talk about the dynamics of an auction, and how the influence of different factors vary throughout the auction. For instance, we find that the incremental impact of an additional bidder's arrival on the rate of price increase is smaller towards the end of the auction. Our analysis suggests that “stakes” do matter and that the rate of price increase is faster for more expensive items, especially at the start and the end of an auction. We observe that higher seller ratings (which correlate with experience) positively influence the price dynamics, but the effect is weaker in auctions with longer durations. Interestingly, we find that the price level is negatively related to auction duration when the seller has low rating whereas in auctions with high-rated sellers longer auctions achieve higher price levels throughout the auction, and especially at the start and end. Our methodological contributions include the introduction of functional data analysis as a useful toolkit for exploring the structural characteristics of electronic markets.

Much work has studied the effect of different treatments of missing values on model induction, but little work has analyzed treatments for the common case of missing values at prediction time. This paper first compares several different methods — predictive value imputation, the distribution-based imputation used by C4.5, and using reduced models — for applying classification trees to instances with missing values (and also shows evidence that the results generalize to bagged trees and to logistic regression). The results show that for the two most popular treatments, each is preferable under different conditions. Strikingly the reduced-models approach, seldom mentioned or used, consistently outperforms the other two methods, sometimes by a large margin. The lack of attention to reduced modeling may be due in part to its (perceived) expense in terms of computation or storage. Therefore, we then introduce and evaluate alternative, hybrid approaches that allow users to balance between more accurate but computationally expensive reduced modeling and the other, less accurate but less computationally expensive treatments. The results show that the hybrid methods can scale gracefully to the amount of investment in computation/storage, and that they outperform imputation even for small investments.

There are many different methods used by classification tree algorithms when missing data occur in the predictors, but few studies have been done comparing their appropriateness and performance. This paper provides both analytic and Monte Carlo evidence regarding the effectiveness of six popular missing data methods for classification trees appli ed to binary response data. We show that in the context of classification trees, the relationship bet ween the missingness and the dependent variable, as well as the existence or non-existence of missi ng values in the testing data, are the most helpful criteria to distinguish different missing data met hods. In particular, separate class is clearly the best method to use when the testing set has missing values and the missingness is related to the response variable. A real data set related to modeling bankruptcy of a firm is then analyzed. The paper concludes with discussion of adaptation of these results to logistic regression, and other potential generalizations.

Collopy, Adya and Armstrong (1994) (CAA) advocate the use of atheoretical "black box" extrapolation techniques to forecast information systems spending. In this paper, we contrast this approach with the positive modeling approach of Gurbaxani and Mendelson (1990), where the primary focus is on explanation based on economics and innovation diffusion theory. We argue that the objectives and premises of extrapolation techniques are so fundamentally different from those of positive modeling that the evaluation of positive models using the criteria of "black box" forecasting approaches is inadequate. We further show that even if one were to accept CAA's premises, their results are still inferior. Our results refute CAA's claim that linear trend extrapolations are appropriate for forecasting future IS spending and demonstrate the risks of ignoring the guidance of theory. 1.

This paper develops a model of the growth of information systems expenditures in the United States. The model incorporates two major factors that influence the rate and pattern of spending growth - the diffusion of technological innovation and the effect of price on the demand for computing. Traditional studies have focused on the role of innovation while ignoring the effects of price on the growth process. We show that while information systems expenses initially grew following an S-curve, more recent growth has converged to an exponential pattern. These patterns are consistent with our integrative price-adjusted S-curve growth model.

During the past decade there has been an explosion in computation and information technology. With it have come vast amounts of data in a variety of fields such as medicine, biology, finance, and marketing. The challenge of understanding these data has led to the development of new tools in the field of statistics, and spawned new areas such as data mining, machine learning, and bioinformatics. Many of these tools have common underpinnings but are often expressed with different terminology. This book describes the important ideas in these areas in a common conceptual framework. While the approach is statistical, the emphasis is on concepts rather than mathematics. Many examples are given, with a liberal use of color graphics. It is a valuable resource for statisticians and anyone interested in data mining in science or industry. The book's coverage is broad, from supervised learning (prediction) to unsupervised learning. The many topics include neural networks, support vector machines, classification trees and boosting---the first comprehensive treatment of this topic in any book.
This major new edition features many topics not covered in the original, including graphical models, random forests, ensemble methods, least angle regression and path algorithms for the lasso, non-negative matrix factorization, and spectral clustering. There is also a chapter on methods for ``wide'' data (p bigger than n), including multiple testing and false discovery rates.
Trevor Hastie, Robert Tibshirani, and Jerome Friedman are professors of statistics at Stanford University. They are prominent researchers in this area: Hastie and Tibshirani developed generalized additive models and wrote a popular book of that title. Hastie co-developed much of the statistical modeling software and environment in R/S-PLUS and invented principal curves and surfaces. Tibshirani proposed the lasso and is co-author of the very successful An Introduction to the Bootstrap. Friedman is the co-inventor of many data-mining tools including CART, MARS, projection pursuit and gradient boosting.

There are two cultures in the use of statistical modeling to reach conclusions from data. One assumes that the data are generated by a given stochastic data model. The other uses algorithmic models and treats the data mechanism as unknown. The statistical community has been committed to the almost exclusive use of data models. This commitment has led to irrelevant theory, questionable conclusions, and has kept statisticians from working on a large range of interesting current problems. Algorithmic modeling, both in theory and practice, has developed rapidly in fields outside statistics. It can be used both on large complex data sets and as a more accurate and informative alternative to data modeling on smaller data sets. If our goal as a field is to use data to solve problems, then we need to move away from exclusive dependence on data models and adopt a more diverse set of tools.