Figure 1 - available via license: Creative Commons Attribution 4.0 International

Content may be subject to copyright.

# Comparison of linear model and Poisson model. Note: The red line shows how expected y depends on x and how observations are distributed at x ¼ 1, 2, 3, 4, and 5 (conditional distribution). Axes show kernel density plots of x and y (unconditional distribution).

Source publication

Transforming variables before analysis or applying a transformation as a part of a generalized linear model are common practices in organizational research. Several methodological articles addressing the topic, either directly or indirectly, have been published in the recent past. In this article, we point out a few misconceptions about transformat...

## Contexts in source publication

**Context 1**

... were the most common way to apply transformations, and therefore it is important to understand what GLMs do. 2 To this end, we take linear regression, shown in the first plot of Figure 1, as a starting point. In linear regression, the mean (or more precisely, the expected value) of the dependent variable depends linearly on the independent variables, and the data are assumed to be normally distributed around the population regression line. ...

**Context 2**

... the distribution is a conditional distribution specified for a specific combination of independent variables, not an unconditional distribution of the dependent variable overall. In the linear regression model shown in the first plot of Figure 1, the observations are always normally distributed around the regression line. That is, if we only look at observations that have the same x value (e.g., x ¼ 1), that subset will be normally distributed (conditional distribution). ...

**Context 3**

... specification of a GLM involves two key decisions: choosing the link function for the relationship between the independent variable and the expected value of the dependent variable and the (conditional) distribution for the dependent variable. Importantly, the distribution concerns the conditional distribution for a specific set of predictor variables, not the unconditional distribution that one would analyze in the data preparation and screening stage of research (see Figure 1). Of these, the first decision is much more important because it determines which functional form is estimated; the second decision only influences whether the chosen form is estimated correctly. ...

**Context 4**

... example, fitting a parabola to a data set where y ¼ log(x) would erroneously indicate the presence of a U-shape (first up, then down, or the other way) effect. An opposite incorrect conclusion of an increasing trend could be made if y first increases gradually but then decreases steeply (Simonsohn, 2018, Figures 2, 3). We show this effect in the first panel of Figure 8, which shows a binned scatterplot and second-order polynomial curve produced by the binsreg command and an exponential curve fitted with Poisson QML regression for reference. ...

**Context 5**

... bands can be included to accomplish the first task (Mize, 2019). To support the chosen functional form, we recommend that researchers not only visualize the predicted curves but also indicate where their data are in the same plot (Fife, 2020), either by plotting the cases (as done in, e.g., Figure 3) or, if not possible due to their large number, by using rug and contour plots, as shown in the first plot of Figure 10. A contour plot shows which parts of the plot the observations are located in (have higher density) using a set of nested simple closed curves. ...

**Context 6**

... demonstrate, we add prestige as a control variable in Model 4 in Table 7, causing the effect of education to disappear. 18 The second plot of Figure 10 shows that the adjusted prediction curve does not explain the pattern in the data well. To better visualize this kind of effect, we can divide the data set into five quantiles by prestige and plot the prediction curves separately for each subsample. ...

**Context 7**

... linear Model 5 has a significant interaction term, whereas in the nonlinear Model 6, the interaction term is nonsignificant. Nevertheless, Figure 11 shows a clear moderation effect if absolute income is considered: The effect of education is much stronger for men-dominated occupations in both plots. It also shows why plotting the data is helpful in choosing a model: The linear model clearly extrapolates the data and underpredicts the incomes of higher education occupations, particularly those with more women, and hence does not fit as well as the nonlinear model. ...

**Context 8**

... reason is that in the exponential model, the effects are multiplicative instead of additive, as noted before. The lack of a significant interaction term thus does not mean a moderation effect would not exist in absolute terms (Russell & Dean, 2000); it is simply presented in an alternative form, as Figure 11 shows. Similarly, if the appropriate model for the data is an exponential model, then fitting a linear model can produce significant interactions that did not exist in the exponential model. ...

**Context 9**

... we interpret Figure 11 as supporting a moderation effect? Ideally, moderation hypotheses should be stated in a way that leaves no room for ambiguity (see also Gardner et al., 2017). ...

## Similar publications

This paper concerns with optimal designs for a wide class of nonlinear models with information driven by the linear predictor. The aim of this study is to generate an R-optimal design which minimizes the product of the main diagonal entries of the inverse of the Fisher information matrix at certain values of the parameters. An equivalence theorem f...

## Citations

... We contribute the insight that performance is better characterized by the lognormal distribution, which differs from the exponential tail and power law with exponential cutoff distributions typically found in other employment and work contexts (Bradley and Aguinis, 2023;Joo et al., 2017). Thus, this study identifies the need to adapt prevailing statistical methodologies to account for lognormal distributions by drawing upon Paretian paradigms (McKelvey and Andriani, 2005) and scholarly guidelines for incorporating non-normality in statistical analyses (Becker et al., 2019;Rönkkö et al., 2022). Specifically, pervasive non-normality in entrepreneurship variables calls for (a) identification of the most suitable distributional shapes using a pitting or falsification approach, (b) examination of the proportion or weight that outliers carry in terms of the overall pool, (c) adoption of appropriate analytical methods (e.g., rank correlations, nonparametric regression) for hypothesis testing, and (d) theoretically informed expansion of the taxonomy of distribution shapes to account for novel generative mechanisms of extreme performance. ...

This study extends emerging theories of star performers to digital platforms, an increasingly prevalent entrepreneurial context. It hypothesizes that the unique characteristics of many digital platforms (e.g., low marginal costs, feedback loops, and network effects) produce heavy-tailed performance distributions, indicating the existence of star entrepreneurs. Using longitudinal data from an online learning platform, proportional differentiation is identified as the most likely generative mechanism and lognormal distribution as the most likely shape for distributions of entrepreneurial performance in digital contexts. This study contributes theory and empirical evidence for non-normal entrepreneurial performance with implications for scholars and practitioners of digital entrepreneurship. Executive summary The performance of 'star' entrepreneurs on digital platforms can be 100-or 1000-fold that of their average competitors. When performance is plotted as a distribution, star performers reside in the tails of these distributions. The assumption of a normal distribution of performance in the bulk of entrepreneurship research implies that most performance observations are clustered around the average. Instead, most entrepreneurs on digital platforms exhibit sub-par performance, while a minority captures a major fraction of the generated value. This paper argues that the unique characteristics of digital contexts-nearly zero marginal costs, feedback loops, and network effects-drive such extreme performance. Using data from Udemy, a digital platform where independent producers (entrepreneurs) offer educational videos (digital products) to a large pool of potential customers, we provide evidence that entre-preneurial performance is lognormally rather than normally distributed. We further identify proportional differentiation as the underlying generative mechanism. Thus, star performance on digital platforms is not driven only by the rich-get-richer effect. Instead, both the initial value of performance and the rate at which it is accumulated play important roles in explaining extreme performance outcomes. This discovery has important implications for entrepreneurship theory and practice. Our findings, for example, signal that some late entrants who successfully pursue high customer accumulation rates in domains with high knowledge intensity can become star entrepreneurs.

... In more extreme cases of non-normality, a twostep transformation has proven effective (Templeton et al., 2021). However, transformations should only be carried out when necessary and should align with the selected analysis technique (Rönkkö et al., 2021). ...

Find archived papers, submission instructions, terms of use, and much more at the JISE website: https://jise.org ABSTRACT To ensure validity in survey research, it is imperative that we properly educate doctoral students on best practices in data quality procedures. A 14-year analysis of 679 studies in the AIS "Basket of 8" journals noted undercommunication in the most pertinent procedures, consistent across journals and time. Given recent calls for improvements in data transparency, scholars must be educated on the importance and methods for ensuring data quality. Thus, to guide the education of doctoral students, we present a "5-C Framework'' of data quality procedures derived from a wide-ranging literature review. Additionally, we describe a set of guidelines regarding enacting and communicating data quality procedures in survey research.

... In the scope of modeling choices, future research could also examine to what extent data transformations may improve performance in social signal processing. Methods from generalized linear models (e.g., Rönkkö et al., 2022), such as log transformation, could potentially be applied to machine learning model performance when labels are not distributed evenly (as in the case of our labels for high and low cohesion, see Table A1 in Appendix A). Moreover, future research can investigate the potential interplay between micro-level behavioral mechanisms underlying task and social cohesion, as presently investigated, and larger-scale temporal dynamics of (emergent) task and social cohesion across longer interaction periods, such as entire meetings or series of meetings by utilizing multilevel modeling approaches. ...

Social signal processing develops automated approaches to detect, analyze, and synthesize social signals in human–human as well as human–machine interactions by means of machine learning and sensor data processing. Most works analyze individual or dyadic behavior, while the analysis of group or team interactions remains limited. We present a case study of an interdisciplinary work process for social signal processing that can develop automatized measures of complex team interaction dynamics, using team task and social cohesion as an example. In a field sample of 25 real project team meetings, we obtained sensor data from cameras, microphones, and a smart ID badge measuring acceleration. We demonstrate how fine-grained behavioral expressions of task and social cohesion in team meetings can be extracted and processed from sensor data by capturing dyadic coordination patterns that are then aggregated to the team level. The extracted patterns act as proxies for behavioral synchrony and mimicry of speech and body behavior which map onto verbal expressions of task and social cohesion in the observed team meetings. We reflect on opportunities for future interdisciplinary or collaboration that can move beyond a simple producer–consumer model.

... For skewed data sets with many zeros, the common practice of adding a small positive constant to the observations (the "shift" parameter) before log transformation has little to recommend it as such a parameter has a highly significant effect on the estimator of the geometric mean [16]. Recent reviews have cautioned against the "routine" use of log-transformation in regression; rather GLM, or, as in the current paper, GLMM have been endorsed [14,52,53]. As noted by Deb et al., "Properly interpreting results from a log-transformed model requires substantially more effort" [54]. ...

Background
Intensive care unit (ICU) length of stay (LOS) and the risk adjusted equivalent (RALOS) have been used as quality metrics. The latter measures entail either ratio or difference formulations or ICU random effects (RE), which have not been previously compared.
Methods
From calendar year 2016 data of an adult ICU registry-database (Australia & New Zealand Intensive Care Society (ANZICS) CORE), LOS predictive models were established using linear (LMM) and generalised linear (GLMM) mixed models. Model fixed effects quality-metric formulations were estimated as RALOSR for LMM (geometric mean derived from log(ICU LOS)) and GLMM (day) and observed minus expected ICU LOS (OMELOS from GLMM). Metric confidence intervals (95%CI) were estimated by bootstrapping; random effects (RE) were predicted for LMM and GLMM. Forest-plot displays of ranked quality-metric point-estimates (95%CI) were generated for ICU hospital classifications (metropolitan, private, rural/regional, and tertiary). Robust rank confidence sets (point estimate and 95%CI), both marginal (pertaining to a singular ICU) and simultaneous (pertaining to all ICU differences), were established.
Results
The ICU cohort was of 94,361 patients from 125 ICUs (metropolitan 16.9%, private 32.8%, rural/regional 6.4%, tertiary 43.8%). Age (mean, SD) was 61.7 (17.5) years; 58.3% were male; APACHE III severity-of-illness score 54.6 (25.7); ICU annual patient volume 1192 (702) and ICU LOS 3.2 (4.9). There was no concordance of ICU ranked model predictions, GLMM versus LMM, nor for the quality metrics used, RALOSR, OMELOS and site-specific RE for each of the ICU hospital classifications. Furthermore, there was no concordance between ICU ranking confidence sets, marginal and simultaneous for models or quality metrics.
Conclusions
Inference regarding adjusted ICU LOS was dependent upon the statistical estimator and the quality index used to quantify any LOS differences across ICUs. That is, there was no “one best model”; thus, ICU “performance” is determined by model choice and any rankings thereupon should be circumspect.

... As suggested by Rönkkö et al. (2022), such transformations should only be done if researchers have strong theoretical arguments for a logtransformation. With regard to total assets, diminishing returns of additional assets are often assumed. ...

Does a greater representation of women in top management teams (TMTs) contribute to higher firm perfor-mance? Although several studies have investigated this question, they have failed to sufficiently account for endogeneity. We address the endogeneity problem by using an instrumental variable (IV) design to estimate the causal effect of women's representation in TMTs on firm performance. We use a shift-share or Bartik-type instrument, which is well-established in economics but has received little attention in management and leadership research. We analyze the effect of TMT gender diversity on four types of firm performance: profitability , market-based performance, liquidity, and growth. Our sample is based on firms in the S&P 1,500, which we observe over 24 years (1997-2020). Our findings indicate that TMT gender diversity positively affects the profitability, liquidity, and growth of firms but does not impact market-based performance. We also analyze whether the effect of TMT gender diversity was stronger during two economic crises, namely the 2008/2009 financial crisis and the COVID-19 pandemic, but our instrumental variable analysis provides no evidence for such an interaction effect. Our results are robust to multiple alternative specifications. This study contributes to research on strategic leadership, specifically regarding the effect of women leaders, as well as the crisis leadership literature.

... While Silva and Tenreyro (2006) recommend the use of a Poisson distribution, this method leads to biased results if there is overdispersion in the dependent variable, which was the case for the data in this study. In situations like these, a negative binomial distribution produces better results (Rönkkö et al., 2022). Since the negative binomial distribution estimates the logit of the dependent variable, we also log-transformed the exposure variables (total student population of the host country, and population aged 15-24 of the origin country). ...

In this paper, we analyze the relationship between development and outgoing international student mobility (ISM) for the years 2003–2018 using data from UNESCO. Starting from migration transition theory, we expect that development and outgoing migration follows an inverted U-shape due to changes in capabilities and aspirations of populations. As predicted, we find that outgoing ISM also follows this pattern. Probing deeper into this finding, we investigated whether students from countries of different levels of development favor different destination countries, focusing on destination countries’ academic ranking, GDP per capita, and linguistic and colonial ties. We find that these destination country characteristics indeed have different effects for students from origin countries with different stages of development, and that these effects cannot simply be reduced to a dichotomy between developed/developing countries. Together, the findings highlight the nonlinearity of ISM processes. In turn this opens up new avenues of research regarding the diversity of international student populations.

... Beyond the descriptive statistics provided to illustrate de facto reporting standards for each transparency feature, we regressed article citations on the transparency index, along with the aforementioned author and article characteristics we controlled for. Given that article citations are a discrete dependent variable and follows a left skewed distribution, it required a different modeling approach for which generalized linear models (GLMs) are more suited ( Rönkkö et al., 2022 ). A poisson regression was possible, but it comes with a strict assumption of equidispersion where the variance and mean of the dependent variable must be equal ( Blevins et al., 2015 ). ...

... A poisson regression was possible, but it comes with a strict assumption of equidispersion where the variance and mean of the dependent variable must be equal ( Blevins et al., 2015 ). Since this assumption was not satisfied, we had to address the presence of overdispersion in our data; which we did by running a negative binomial regression ( Hilbe, 2011 ;Rönkkö et al., 2022 ). Negative binomial regression can be seen as a generalization of poisson regression, possessing an extra parameter to model overdispersion and therefore being more appropriate for distributions with overdispersion ( Blevins et al., 2015 ;Rönkkö et al., 2022 ). ...

... Since this assumption was not satisfied, we had to address the presence of overdispersion in our data; which we did by running a negative binomial regression ( Hilbe, 2011 ;Rönkkö et al., 2022 ). Negative binomial regression can be seen as a generalization of poisson regression, possessing an extra parameter to model overdispersion and therefore being more appropriate for distributions with overdispersion ( Blevins et al., 2015 ;Rönkkö et al., 2022 ). Table 4 shows the percentage frequency of each of the coded features in our sample. ...

Intuitively, there would appear to be a direct positive link between the transparency with which research procedures get reported and their appreciation (and citation) within the academic community. It is therefore not surprising that several guidelines exist, which demand the reporting of specific features for ensuring transparency of quantitative field studies. Unfortunately, it is currently far from clear which of these features do get reported, and how this affects the articles' citations. To rectify this, we review 200 quantitative field studies published in five major journals from the field of management research over a period of 20 years (1997-2016). Our results reveal that there are significant gaps in the transparent reporting of even the most basic features. On the other hand, our results show that copious reporting of transparency is productive only up to a certain degree, after which more transparent articles get cited less, pointing to a 'transparency sweet spot' that can be achieved by reporting mindfully.

... This allows a visual inspection of the possibility of a necessity relationship by identifying areas without observations ('empty spaces'). Second, the scatter plot assists with empirically selecting the functional form of the ceiling line (e.g., Rönkkö et al. 2022). Third, the scatter plot helps to identify possible outliers (Aguinis et al. 2013). ...

Necessary condition analysis (NCA) is an increasingly used or suggested method in many business and management disciplines including, for example, entrepreneurship, human resource management, international business, marketing, operations, public and nonprofit management, strategic management, and tourism. In the light of this development, our work delivers a review of the topics analyzed with NCA or in which NCA is proposed as a method. The review highlights the tremendous possibilities of using NCA, which hopefully encourages other researchers to try the method. To support researchers in future NCA studies, this article also provides detailed guidelines about how to best use NCA. These cover eight topics: theoretical justification, meaningful data, scatter plot, ceiling line, effect size, statistical test, bottleneck analysis, and further descriptions of NCA.

... Common mistake: Researchers often log-transform non-negative and skewed variables to make them "more normally distributed." It is a statistical myth that variables should be log-transformed to make their distributions less skewed: what matters is correctly modeling the functional form (Rönkkö, Aalto, Tenhunen, & Aguirre-Urreta, 2022;Villadsen & Wulff, 2021b). When authors log-transform their dependent variable, they change its relationship to the covariates to comply with criteria that are largely irrelevant (non-normality of the error term is only a problem in small samples; see Wooldridge, 2002, Chapter 5). ...

... Common mistake: Log transforming the outcome is problematic for three reasons. First, ordinary least squares (OLS) is generally inconsistent if used to estimate parameters in a linear regression with a log-transformed outcome (Santos Silva & Tenreyro, 2006;Rönkkö et al., 2022). Second, if the outcome contains zeros, adding an arbitrary constant (e.g., 1) before log-transformation becomes necessary, but biases the estimated parameters (Johnson & Rausser, 1971), goodness-of-fit measures (Ekwaru & Veugelers, 2018), and p-values (Feng et al., 2014). ...

... After variable selection, the final logistic regression models for both hourly and daily response variables included the explanatory variables distance, noise, power output, the interactions of distance-noise and distance-power output (Table 3). Visual inspection of the relationship with distance led us to include distance transformed to the second power [37], which contributed to an improved model fit. In summary, high levels of ambient noise and low transmitting output power significantly reduced the probability of a transmission being detected, whereby these negative effects were exacerbated at greater distance (Fig. 3). ...

Background
In acoustic telemetry studies, detection range is usually evaluated as the relationship between the probability of detecting an individual transmission and the distance between the transmitter and receiver. When investigating animal presence, however, few detections will suffice to establish an animal’s presence within a certain time frame. In this study, we assess detection range and its impacting factors with a novel approach aimed towards studies making use of binary presence/absence metrics. The probability of determining presence of an acoustic transmitter within a certain time frame is calculated as the probability of detecting a set minimum number of transmissions within that time frame. We illustrate this method for hourly and daily time bins with an extensive empirical dataset of sentinel transmissions and detections in a receiver array in a Belgian offshore wind farm.
Results
The accuracy and specificity of over 84% for both temporal resolutions showed the developed approach performs adequately. Using this approach, we found important differences in the predictive performance of distinct hypothetical range testing scenarios. Finally, our results demonstrated that the probability of determining presence over distance to a receiver did not solely depend on environmental and technical conditions, but would also relate to the temporal resolution of the analysis, the programmed transmitting interval and the movement behaviour of the tagged animal. The probability of determining presence differed distinctly from a single transmission’s detectability, with an increase of up to 266 m for the estimated distance at 50% detection probability ( D 50 ).
Conclusion
When few detections of multiple transmissions suffice to ascertain presence within a time bin, predicted range differs distinctly from the probability of detecting a single transmission within that time bin. We recommend the use of more rigorous range testing methodologies for acoustic telemetry applications where the assessment of detection range is an integral part of the study design, the data analysis and the interpretation of results.