Article

“An Analysis of Transformations.”

Authors:
  • Nuffield College, Oxford
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

In the analysis of data it is often assumed that observations y1, y2, …, yn are independently normally distributed with constant variance and with expectations specified by a model linear in a set of parameters θ. In this paper we make the less restrictive assumption that such a normal, homoscedastic, linear model is appropriate after some suitable transformation has been applied to the y's. Inferences about the transformation and about the parameters of the linear model are made by computing the likelihood function and the relevant posterior distribution. The contributions of normality, homoscedasticity and additivity to the transformation are separated. The relation of the present methods to earlier procedures for finding transformations is discussed. The methods are illustrated with examples.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... The Box-Cox transformation encompasses a family of power transforms that aim to identify and apply the optimal normalisation transformation for non-normal data. Developed by Box and Cox (1964), it includes log, square root, inverse and other flexible power transforms as special cases. ...
... The Box-Cox procedure provides a data-driven approach to selecting an appropriate power transformation. It offers a wider range of normalising options compared to a single defined transform (Box and Cox 1964). ...
Article
Full-text available
Fitness traits described as a ratio often display non‐normal distributions; consequently, transformations are frequently applied to improve normality prior to the estimation of genetic parameters. However, the impact of different transformations on genetic parameter estimates depends on the dataset at hand. The objective of this study was to evaluate the effects of eight common transformations ( z ‐score, log, square root, probit, arcsine, logit, Box‐Cox and Yeo‐Johnson) on genetic parameter estimates for non‐normal fitness traits in turkeys. Three fertility traits in turkeys were analysed. Egg production rate, egg fertility rate and hatch of fertile eggs rate phenotypes were collected on 6667 turkeys. All three phenotypes exhibited a significant level of non‐normality. An informative pedigree file for the phenotyped birds was generated and consisted of 8612 animals. A mixed linear model that included the hatch year and regression on body weight at 18 weeks of age as fixed effects was used to analyse the transformed and untransformed phenotypes. To make the untransformed and transformed data comparable, they were all standardised to the same mean and variance. Results showed that the transformations significantly impacted genetic parameter estimates. In fact, the percentage variations in the estimates of the heritabilities of the three traits compared to the non‐transformed data ranged from −80% to 45%. Across the different comparison criteria, the Box‐Cox transformation seems to have the advantage compared to the other methods. Furthermore, it resulted in the highest heritability estimates. Although the genetic correlations showed fewer differences across transformations, the Spearman rank correlations ranged between 0.87 and 1, indicating some re‐ranking. These findings suggest that the choice of data transformation impacts inferences on the genetic properties of non‐normal traits, and careful consideration of the transformation method is needed prior to genetic analysis of skewed fitness data in turkeys and potentially other agricultural species. The results provide guidelines for the appropriate choice of transformations given observed levels of deviation from normality.
... The Box-Cox transformation is used when time series data is not stationary in variance. Table 1 presents commonly used values and their associated transformations (Box & Cox, 1964). The differencing, difference is often used to make series stationary in mean. ...
... The models developed in Stages I and II were suitable for the intended purpose, ensuring simplicity and interpretability without compromising forecast accuracy. Adopted from (Box & Cox, 1964) ARIMA modeling methods were used in this study based on a common method available for modeling and forecasting time series data. If the time series data pass all the preliminary tests in Stage I of Box-Jenkins modelling, then Stage II, which is parameter estimation, can be employed. ...
Article
Full-text available
Forecasting the prices of RON97 fuel is crucial for economic planning and policy-making due to its impact on transportation costs and inflation. This study aims to analyse the RON97 fuel prices and develop a forecasting model using Box-Jenkins modelling. The study employs a comprehensive historical weekly dataset of RON97 fuel prices in Malaysia from 30th March 2017 to 18th April 2024 obtained from Malaysia's Official Open Data Portal. This research focuses on Stage I and Stage II of the Box-Jenkins approach. Stage I involves model identification, including a preliminary assessment of the data's stationarity through differencing and unit root tests. In Stage II, model estimation is applied to identify the most significant Box-Jenkins model to forecast the fuel price. This stage also includes identifying potential Box-Jenkins models based on autocorrelation functions (ACF) and partial autocorrelation functions (PACF). In this study, the analysis is done with the aid of R software, where the potential of this software in forecasting weekly RON97 fuel prices time series data is explored. The result from the analysis revealed that ARIMA (0,1,2) is the best model for forecasting RON97 fuel prices in Malaysia. Box-Jenkins modelling effectively captures the underlying patterns and trends in RON97 fuel prices, highlighting its applicability in economic and financial time series forecasting.
... Only the temperature met the assumption of normality (A = 0.638, p-value = 0.086) and homogeneity of variance (B = 0.292, p-value = 0.864), so the Tukey test statistic was used for this variable. To stabilize the variance and ensure that the data fit a normal distribution with the rest of the variables, the following Box and Cox transformation [22] was used: ...
Article
Full-text available
Anthropic activities such as agriculture, livestock, and wastewater discharges affect water quality in the Tlapaneco River in the mountain region of the state of Guerrero, México, which is a tributary of the Balsas. The river flows from the mountain region and discharges into the Pacific Ocean; the water resource in the localities mentioned is used for agriculture, recreation, and domestic activities. The aim of this study was to evaluate water quality in the stretch of influence of two localities, Patlicha and Copanatoyac. The instrument used was the Biological Monitoring Working Party biotic index (BMWP) and physicochemical parameters. Nine sampling sites were selected according to the perception of the local community with respect to disturbance; the study area was divided into three parts: high, medium, and low. Twenty-seven collections of macroinvertebrates and water were analyzed, in dry and rainy seasons, through the presence–absence of these organisms and physicochemical analysis, to evaluate water quality. The results showed that the conditions of the riverbed associated with daily activities and domestic discharges are important factors in the composition of the families. Water quality was very poor to regular, according to the macroinvertebrate assemblages collected. The BMWP index was of acceptable quality when the orders (Family) Ephemeroptera (Leptohyphidae; Leptophlebiidae; Baetidae; Ephemerellidae), Diptera (Chironomidae; Simulidae), Trichoptera (Hydropsychidae), Hemiptera (Veliidae; Corixidae), Coleoptera (Hydrophylidae), and Odonata (Lestidae) were present; in sites with poor quality, the families Chironomidae, Leptophlebiidae, Veliidae, Corixidae, Hydropsychidae, Leptohyphidae, Hydrophilidae, Baetidae, and Simuliidae were found, while in very poor quality water, only family Corixidae was present.
... Data were transformed when necessary using SAS 9.4 (SAS, 2013). The transformation was determined using the Box-Cox method (Box & Cox, 1981, 1964) using SAS 9.4 (SAS, 2013. Significance level was determined at α = 0.05 for all statistical analysis. ...
Article
Full-text available
Sunflower (Helianthus annuus L.), a dominant crop in Hetao Irrigation District, Western Inner Mongolia, is cultivated in arid and saline‐alkaline fields due to their salt and alkali tolerance, ensuring that farmers’ income from these fields is not lower than those from fertile lands. However, little is known about the integrated analysis of nitrogen (N) dynamics, including soil total N (TN), nitrate (NO3⁻), leachate TN and NO3⁻, nitrous oxide (N2O) fluxes, and N cycling microbial gene abundance in sunflower fields. The specific objective was to explore N dynamics for 2021 through 2023 in sunflower seeded to saline‐alkali croplands under arid condition based on treatments of irrigation rate (I rate: I1, 5110; I2, 4050; I3, 2985 m³ ha⁻¹) for washing salinity by irrigation and nitrogen fertilization rate (N rate: N1, 750; N2, 600; N3, 450; N0, 0 kg ha⁻¹). Our findings indicated that I rate did not affect soil N dynamics; N rate significantly increased soil TN, NO3⁻ and N2O fluxes, especially showing an extremely significant increase for leachate TN and NO3⁻ leachate. The interaction of I and N rates impacted soil TN, NO3⁻, their leachate, and N cycling microbial gene abundances, especially denitrification genes. Soil leachate TN and NO3⁻ increased exponentially over time. Soil N2O fluxes increased annually with the growth of sunflowers. In the saline‐alkali sunflower fields, low N rate (450 kg ha ⁻¹) can be an optimal strategy, and the precise calibration of I and N rates can guarantee adequate N dynamics and yields, highlighting the intricate balance required for sustainable agricultural practices.
... Meteorological data of precipitation (mm), maximum temperature (ºC), minimum temperature (ºC), average temperature (ºC), relative humidity (%), and wind speed (m s -1 ) corresponding to the periods of winter cultivation (April 2 nd to October 4 th , 2019) and summer (October 15 th , 2019, to April 26 th , 2020) (INMET, 2022). Box and Cox's (1964) methodology was employed to determine the most suitable transformation for normal approximation. The transformation constants (λ) for each characteristic were as follows: VGMY (λ = 0.229), PRDM (λ = -0.552), ...
Article
Full-text available
Environmental conditions significantly impact the performance of sweet potato genotypes, necessitating the study of genotype x environment (GE) interactions to select genotypes adaptable to varying cultivation conditions. This study aimed to assess GE interactions in sweet potatoes for animal feed and identify high-performance genotypes suitable for different seasons. We conducted two tests during the Brazilian winter of 2019 and summer of 2020. Employing a partially balanced triple lattice experimental design with 100 treatments (92 sweet potato genotypes and eight controls) and three replications, we measured vine green matter yield (VGMY), percentage vine dry matter (PVDM), vine dry matter yield (VDMY), percentage of root dry matter (PRDM), and roots dry matter yield (RDMY). We ranked genotypes, highlighting the best performers for individual and combined seasons. Significant differences in VGMY, PRDM, and RDMY were observed for GE interaction. VGMY, VDMY, and PRDM favored the summer season, while PVDM and RDMY performed better in the winter season. Genotypes 2018-31-713, 2018-72-1438, 2018-31-666, 2018-12-252, 2018-19-461, 2018-19-389, 2018-38-946, 2018-31-689, and 2018-37-864 proved most suitable for VGMY and VDMY across growing seasons. Genotypes 2018-28-514, 2018-15-268, and 2018-19-443 demonstrated potential in percentage vine dry matter. Genotypes 2018-31-666, 2018-72-1438, and 2018-15-277 are recommended for PRDM in both seasons. Genotypes 2018-19-464, 2018-28-556, 2018-55-1154, 2018-28-543, 2018-53-1038, 2018-72-1432, and 2018-19-443 exhibited greater potential for RDMY, making them ideal for animal feed in both growing seasons.
... RIP and RDLG were model by linear random effects (Delgado et al 2020). Given the skewness of the residuals, data were transformed following the method of Box and Cox (1964), using λ = 0.5 (RIPt) and λ = 0.25 (RDLGt) as estimated values. The "lmer" function of the lme4 library (Bates and Maechler 2010) of the R program (R Core Team 2013) allowed adjusting the following statistical method: ...
Article
Full-text available
In sunflower breeding, plant phenotyping of white rot (WR) resistance requires a significant amount of resources. Thus the present study aimed to evaluate the possibility to make a more efficient phenotyping of WR resistance in sunflower hybrids per unit of allocated resources. The Degree of Genetic Determination (DGD) estimated from 37 commercial hybrids evaluated by their relative incubation period (RIP) and relative daily lesion growth (RDLG) for 3 years (y) in field experiments designed with 3 replications (r) and 12 plants/plot (pl/p), i.e. 108 plants/hybrid (pl/h), was compared with DGDs estimated using < 108 pl/h. When using fewer resources, DGD values were estimated with less precision in all year-replication-plant/plot combinations. The bias between the DGD averages estimated and the benchmarked DGD values of 0.78 (RIP) and 0.63 (RDLG) and, consequently, the inaccuracies of such estimations increased gradually. The 3y-2r-6pl/p combination was the level of allocated resources showing a still acceptable relative genotypic variability detected for RIP, given that a 100% probability of the DGD estimated was higher than the proposed threshold (DGD = 0.60) value, although the probability for RDLG was quite lower. That combination also resulted in a not-to-be overlooked gain in relative genotypic variability per unit of allocated resources in relation to that with 108 pl/h. So, the cost associated with resources, such as land, seeds, time, and personnel, allocated to assess WR resistance could be reduced without significantly altering the accuracy and precision of the DGD values estimated respect to the benchmarked ones.
... For a univariate time series, there could be a linear stochastic trend, which may be removed by differencing, or there could be a possibly nonlinear deterministic trend, which may be removed by a regression device. Moreover, heterogeneity may be removed using a Box-Cox-type transformation [41]. In a local Gaussian approach, further analysis is facilitated by transforming the series into a Gaussian series; see [4]. ...
Article
Full-text available
Machine learning forecasting methods are compared to more traditional parametric statistical models. This comparison is carried out regarding a number of different situations and settings. A survey of the most used parametric models is given. Machine learning methods, such as convolutional networks, TCNs, LSTM, transformers, random forest, and gradient boosting, are briefly presented. The practical performance of the various methods is analyzed by discussing the results of the Makridakis forecasting competitions (M1–M6). I also look at probability forecasting via GARCH-type modeling for integer time series and continuous models. Furthermore, I briefly comment on entropy as a volatility measure. Cointegration and panels are mentioned. The paper ends with a section on weather forecasting and the potential of machine learning methods in such a context, including the very recent GraphCast and GenCast forecasts.
... Attributes with significant skewness may distort the overall similarity score, as the SD-score relies on the standard deviation as a measure of variability. Recommendations for improvement: Preprocessing the skewed data can approximate normality to reduce asymmetry, using transformations such as log, square root, or Box-Cox [10,34], before applying the SD-score, which will improve the SD-score's effectiveness. Also, providing additional statistical context, such as visualizations of the data distribution [18], can aid in interpreting the SD-scores findings. ...
Article
Full-text available
The ability to measure similarity or distance between data points is critical for various analytical tasks, including classification, clustering, and anomaly detection. However, traditional distance metrics such as Euclidean, Manhattan, and Hamming often struggle with mixed data types, varying attribute scales, and noise, limiting their robustness in diverse datasets. This paper introduces the Standard Deviation Score (SD-score), a novel similarity metric designed to address these challenges. By transforming traditional distance values into standard deviation units relative to a target point, the SD-score enables robust and interpretable similarity assessments. Extensive experimental evaluations demonstrate that the SD-score consistently outperforms conventional metrics in accuracy, precision, recall, and F-score within the k-Nearest Neighbors classification framework. Also, a comprehensive evaluation of the SD-score’s performance across Gaussian, skewed, and multimodal distributions showed promising results in the cluster coherence experiment, in which the Silhouette score was measured through the K-means clustering algorithm, emphasizing its adaptability to real-world data complexities. Additionally, the experiments detail improved handling of mixed numerical, ordinal, and categorical data types through a unified framework. The proposed metric incorporates inherent normalization mechanisms, reducing sensitivity to outliers and ensuring consistency across varying data scales and distributions, making it a versatile tool for real-world applications. This advancement in similarity measurement paves the way for more accurate and efficient data analysis across multiple domains.
... Linear mixed-effect models using the selected proximity measures were fitted on the behavioural measures probed in each analysis using the lme4 package in R using maximum likelihood estimation (Bates et al., 2015). Potential transformation of the raw data was suggested using Box-Cox transformation analysis (Box & Cox, 1964), and the normality of regression residuals was examined using Q-Q plot. We do not report these analyses, but they are available in R code on the OSF site for this project. ...
Article
Full-text available
Recent research has shown that the compositional meaning of a compound is routinely constructed by combining meanings of constituents. However, this body of research has focused primarily on Germanic languages. It remains unclear whether this same computational process is also observed in Chinese, a writing system characterised by less systematicity of the meanings and functions of constituents across compounds. We quantified the ease of integrating the meanings of Chinese constituent characters into a compositional compound meaning using a computational model based on distributional semantics. We then showed that this metric predicted sensibility judgements on novel compounds (Study 1), lexical decision latencies for rejecting novel compounds (Study 2), and lexical decision latencies for recognising existing compounds (Study 3). These results suggest that a compositional process is involved in Chinese compound processing, even in tasks that do not explicitly require meaning combination. Our results also suggest that a generic statistical learning framework is able to capture the meaningful functions of Chinese compound constituents. We conclude by discussing the advantages of routine meaning construction during compound processing in Chinese reading.
... In this application, the effect of antidotes developed against the poisons was tested on mice. The data set poisons obtained from [27], and available in boot package in R, the response times of the mice against the treatment of antidotes A, B, C, and D were made. Parameter estimates and summary statistics (in week) are given in Table 10. ...
Article
Full-text available
Testing the equality of means of several skewed populations, particularly in the presence of nuisance parameters, is a central challenge in statistics. While various tests have been proposed for such as log-normal, inverse-normal, and exponential distributions leveraging methods like generalized p-value, parametric bootstrap, and the fiducial approach, there remains a notable gap in the literature, the absence of a computational approach method-based test for the two-parameter exponential distribution. Such a method is essential for achieving robust results in small sample sizes while considering power and Type I error probability. In response to this gap, our paper introduces and implements a novel computational approach tests embedded in the doex package in R. Our focus is on assessing the equality of means for several skewed populations following a two-parameter exponential distribution. We conduct a comprehensive comparison of our proposed tests against existing alternatives, evaluating their penalized power and Type I error probability. Notably, our computational approach tests exhibit superior performance, particularly in cases involving small samples and balanced designs. Furthermore, to illustrate the practical relevance of our proposed tests, we present a real-world application using authentic data. This empirical demonstration serves to underscore the efficacy and applicability of our novel computational approach tests in real-world scenarios.
... Transformation models (in the sense of Box and Cox 1964;Hothorn et al. 2018) offer a compromise between the nonparametric and parametric world by transforming outcomes via a monotone nondecreasing transformation function h : Y → R such that the transformed outcome distribution is described by a simple cumulative distribution function G : R → [0, 1] with parameter-free log-concave absolute continuous density. This results in F 0 (y) = G(h(y)) for the distribution under control. ...
Preprint
Full-text available
Treatment effects for assessing the efficacy of a novel therapy are typically defined as measures comparing the marginal outcome distributions observed in two or more study arms. Although one can estimate such effects from the observed outcome distributions obtained from proper randomisation, covariate adjustment is recommended to increase precision in randomised clinical trials. For important treatment effects, such as odds or hazard ratios, conditioning on covariates in binary logistic or proportional hazards models changes the interpretation of the treatment effect under noncollapsibility and conditioning on different sets of covariates renders the resulting effect estimates incomparable. We propose a novel nonparanormal model formulation for adjusted marginal inference. This model for the joint distribution of outcome and covariates directly features a marginally defined treatment effect parameter, such as a marginal odds or hazard ratio. Marginal distributions are modelled by transformation models allowing broad applicability to diverse outcome types. Joint maximum likelihood estimation of all model parameters is performed. From the parameters not only the marginal treatment effect of interest can be identified but also an overall coefficient of determination and covariate-specific measures of prognostic strength can be derived. A reference implementation of this novel method is available in R add-on package tram. For the special case of Cohen's standardised mean difference d, we theoretically show that adjusting for an informative prognostic variable improves the precision of this marginal, noncollapsible effect. Empirical results confirm this not only for Cohen's d but also for log-odds ratios and log-hazard ratios in simulations and three applications.
... In our case, the probability scale for binary responses, which is bounded between zero and one, is nonlinearly transformed to an unbounded latent scale by a link function in the model. Nonlinear transformations such as the logarithm, the exponential function, or the probit transformation can affect the interpretation of interactions (Box & Cox, 1964). The application of nonlinear transformations can introduce curvature or changes in the relationship between variables, thus influencing the nature of interaction effects (Fox, 2015). ...
Article
Full-text available
People judge repeated statements as more true than new ones. This repetition-based truth effect is a robust phenomenon when statements are ambiguous. However, previous studies provided conflicting evidence on whether repetition similarly affects truth judgments for plausible and implausible statements. Given the lack of a formal theory explaining the interaction between repetition and plausibility on the truth effect, it is important to develop a model specifying the assumptions regarding this phenomenon. In this study, we propose a Bayesian model that formalizes the simulation-based model by Fazio, Rand, and Pennycook (2019; Psychonomic Bulletin & Review ). The model specifies how repetition and plausibility jointly influence the truth effect in light of nonlinear transformations of binary truth judgments. We test our model in a reanalysis of experimental data from two previous studies by computing Bayes factors for four competing model variants. Our findings indicate that, while the truth effect is usually larger for ambiguous than for highly implausible or plausible statements on the probability scale, it can simultaneously be constant for all statements on the probit scale. Hence, the interaction between repetition and plausibility may be explained by a constant additive effect of repetition on a latent probit scale.
... To address these challenges, the Box-Cox transformation was subsequently applied to stabilize variance in the image data and approximate a normal distribution. By adjusting nonlinear data using the parameter λ, the Box-Cox transformation reduces skewness and reshapes the data closer to normality (Box and Cox, 1964). This step complements gamma correction, which primarily enhances contrast but does not correct pixel value distribution biases. ...
... For this, a two-way mixed-design analysis of variance (ANOVA) was conducted with the within-subject factor being frequency bands and the between-subject factor being two experiments. Since relative weights in some frequency bands (1 kHz in Experiment 1, 0.25-and 8-kHz bands in Experiment 2) violated the normality assumption, as assessed by the Shapiro-Wilk test (Shapiro & Wilk, 1965), the Box-Cox transformation (Box & Cox, 1964) with λ = 0.4 was applied. After the transformation, the normality assumption was satisfied with Shapiro-Wilk test. ...
Article
Full-text available
Purpose Although users can customize the frequency–gain response of hearing aids, the variability in their individual adjustments remains a concern. This study investigated the within-subject variability in the gain adjustments made within a single self-adjustment procedure. Method Two experiments were conducted with 20 older adults with mild-to-severe hearing loss. Participants used a two-dimensional touchscreen to adjust hearing aid amplification across six frequency bands (0.25–8 kHz) while listening to continuous speech in background noise. In these two experiments, two user interface designs, differing in control-to-gain map, were tested. For each participant, the statistical properties of 30 repeated gain adjustments within a single self-adjustment procedure were analyzed. Results When participants made multiple gain adjustments, their preferred gain settings showed the highest variability in the 4- and 8-kHz frequency bands and the lowest variability in the 1- and 2-kHz bands, suggesting that midfrequency bands are weighted more heavily in their preferences compared to high frequencies. Additionally, significant correlations were observed for the preferred gains between the 0.25- and 0.5-kHz bands, between the 0.5- and 1-kHz bands, and between the 4- and 8-kHz bands. Lastly, the standard error of the preferred gain reduced with an increasing number of trials, with a rate close to being slightly shallower than would be expected for invariant mean preference for most participants, suggesting convergent estimation of the underlying preference across trials. Conclusion Self-adjustments of frequency–gain profiles are informative about the underlying preference; however, the contributions from various frequency bands are neither equal nor independent. Supplemental Material https://doi.org/10.23641/asha.28405397
... The data were also transformed to reduce the differences between the extreme values. In particular, a Box-Cox transformation, commonly used in environmental and geosciences [44][45][46][47], was applied based on the results and previous experience. ...
Article
Full-text available
This study explores the relationship between bryophyte (mosses) diversity and environmental factors in the Veles region, North Macedonia, focusing on the spatial distribution of chemical elements in the moss and surface soil samples collected from the same locations. Eighteen moss samples were analyzed alongside surface soils. Advanced spectrometric techniques were used to identify potentially toxic elements (PTEs) and their links to anthropogenic and natural sources. While metal measurements are widely reported in the literature, the novelty of this study lies in its integrative approach, combining moss biodiversity analysis with a direct comparison of element concentrations in both moss and soil. The results show significant patterns of deposition of PTEs and highlight the long-term impact of industrial activities on biodiversity and air pollution. These findings provide valuable insights into conservation strategies and environmental management in the midst of ongoing ecological change. Five groups of elements were separated using factor analysis: G1 (Al, Cr, Cu, Fe, Li, Mg, Mn, Ni and V); G2 (Ba and Na); G3 (K, P and Mo), G4 (Pb and Zn), and G5 (Ag, As and Cd), of which two groups (G1 and G2) were found to be typical geochemical associations, while G4 and G5 are anthropogenic associations due to the emission of dust from contaminated soils and the slag heap of the Pb-Zn smelting plant. Group 3 represents a mixed geochemical and anthropogenic association. It was found that Pb, Zn, Cd, and As could indeed be detected in the moss in the study area, underlining its ability to detect pollutants in the air. A comparative analysis of moss and soil samples revealed significant differences in element concentrations, with most elements being more concentrated in soil. These results underline the role of moss as a bioindicator of atmospheric deposition, detecting pollution trends rather than direct soil contamination.
... If the p-value is less than 0.05, the null hypothesis (the sample came from a normal distribution) is rejected, while a p-value higher than 0.05 indicating that the sample came from a normal distribution. Box-Cox Transformation (Box & Cox, 1964), with Lambda equal to 0 was used to transform the data to make it fit a normal distribution. ...
Article
Full-text available
Purpose: Lockdown and movement restrictions that imposed by governments have significantly changed customers behavior, making the planning and decision-making processes more challenging. Providing accurate estimation of the demand, enable managers to take more successful decisions and allow optimizing inventory and resources; this is the main purpose of this study.Design/methodology/approach: An ensemble model that based on combining Bayesian-optimized Long Short-Term Memory (BO-LSTM) and Gated Recurrent Unit (BO-GRU). Experiments were carried out on actual dataset obtained from company specialized in food industries during the volatile situation of Covid-19.Findings: The proposed model significantly outperformed all hand-tuned ones and reduced the mean Root Mean Squared Error (RMSE) and Mean Absolute Error (MAE) by 2.80% and 4.74% compared to BO-LSTM and 3.14% and 3.60% compared to BO-GRU respectively. Furthermore, using BO algorithm for hyperparameters tuning improved the forecasting accuracy.Originality/value: The suggested model was statistically compared to its members in addition to other machine learning models using the t-test. Findings demonstrated the superiority of the proposed method over all benchmark models.
... The skewness and kurtosis of the distribution of elements in the set are calculated by the Equations (5) and (6). Based on the skewness and kurtosis of the positively skewed distribution, it was determined that the Box − cox transformation [21] of Equation (7) was used to approximate the normal distribution. ...
Article
Full-text available
Due to the large fluctuation of blast furnace gas (BFG) generation and its complex production characteristics, it is difficult to accurately obtain its gas change rules. Therefore, this paper proposes a prediction method of BFG generation based on Bayesian network. First, the BFG generation data are divided according to the production rhythm of the hot blast stove, and the training event set is constructed for the two dimensions of interval generation and interval time. Then, the Bayesian network of generation and the Bayesian network of time corresponding to the two dimensions are built. Finally, the state of each prediction interval is inferred, and the results of the reasoning are mapped and combined to obtain the prediction results of the BFG generation interval combination. In the experiment part, the actual data of a large domestic iron and steel plant are used to carry out multi-group comparison experiments, and the results show that the proposed method can effectively improve the prediction accuracy.
... To meet LMM assumptions, distribution and power coefficient of all continuous dependent variables were inspected using the MASS package 86 and the Box-Cox procedure. 77 As a result, all continuous dependent variables were log-transformed. Model selection always started with the maximal random-effects structure. ...
... While Welch's ANOVA was appropriately selected as a robust alternative to the traditional ANOVA, several additional statistical procedures could further strengthen the analysis. Box-Cox transformations could be applied to stabilize variances across groups (33), or weighted least squares regression could be implemented to account for unequal variances (34). For non-parametric alternatives, the Brown-Forsythe test could be considered, as it maintains robustness under heteroscedasticity while potentially offering greater power than Welch's ANOVA in certain conditions (35). ...
Article
Full-text available
This study addresses the evaluation of the generation of domestic solid waste in Peruvian households using statistical techniques and the SEMMA and PCA data mining methodology. The objective is to explore how waste management, population and the Per Capita Generation index PCG index per capita influence the production of this waste in Peruvian departments. The sample was obtained from the database of annual reports submitted by district and provincial municipalities to MINAM through the Information System for Solid Waste Management (SIGERSOL), including data from the 24 departments of Peru, with a total of 14,852 records organized in 196 registration forms. Statistical techniques and the adaptation of the SEMMA methodology were applied together with the Principal Component Analysis (PCA) to examine the impacts of the accumulation of household solid waste in Peru. This study showed that the first component accounts for 80.2% of the inertia. Combining the first two components accounts for 99.8% of the total variation, suggesting that most of the meaningful information can be maintained using only two dimensions. Welch’s ANOVA showed significant differences in domestic solid waste generation among Peruvian departments [ F (6, 94.310) = 790.444; p = 0.0, p < 0.05]. In addition, a square Eta of 99.09% revealed a very large effect size, indicating that the amount of population explains 99.09% of the variation in the generation of this waste between the departments. The PCG index had a moderate effect, suggesting the need for further studies to explore the underlying causes of regional differences and assess the effectiveness of the waste management measures implemented. A positive relationship was found between the production of Domestic Solid Waste (DSW) and the number of inhabitants. Lima stood out with the highest average of DSW 13220.47 tons and the PCG index of 50%. Using Ward’s method, three groups were obtained and PCA was applied to each group. In the Group, Lambayeque 5616.48 tons, Loreto 2946.44 tons and San Martín 1596.07 tons registered the highest DSW averages, while Amazonas 441.1 tons obtained the lowest. Ucayali 60%, Loreto 58% and San Martín 57% showed the highest PCG indexes. In Group b, Ayacucho 701.81 tons had the highest average DSW and Apurimac 497 tons the lowest. Tacna and Apurimac with 44% and Moquegua 43% registered the highest PCG indexes, while Huancavelica 42% and Pasco 41% had the lowest. In Group C Piura 4476.53 tons and La Libertad 3478.46 tons showed the highest DSW averages, while Huánuco 859.41 tons and Cajamarca 812.74 tons registered the lowest. Ica and Piura led with an average PCG of 48%, while Puno and Junín with 43% had the lowest values.
... One of the most important data transformation families is the Box-Cox transformation, which is a power transformation method first introduced in 1964. The Box-Cox transformation uses a parameter λ to transform a variable y with positive values using the following formula [23]: ...
Article
Full-text available
This paper presents a method of choosing a single solution in the Pareto Optimal Front of the multi-objective problem of the spectral and energy efficiency trade-off in Massive MIMO (Multiple Input, Multiple Output) systems. It proposes the transformation of the group of non-dominated alternatives using the Box–Cox transformation with values of λ < 1 so that the graph with a complex shape is transformed into a concave graph. The Box–Cox transformation solves the selection bias shown by the decision-making algorithms in the non-concave part of the Pareto Front. After the transformation, four different MCDM (Multi-Criteria Decision-Making) algorithms were implemented and compared: SAW (Simple Additive Weighting), TOPSIS (Technique for Order of Preference by Similarity to Ideal Solution), PROMITHEE (Preference Ranking Organization Method for Enrichment Evaluations) and VIKOR (Vlse Kriterijumska Optimizacija Kompromisno Resenje). The simulations showed that the best value of the λ parameter is 0, and the MCDM algorithms which explore the Pareto Front completely for different values of weights of the objectives are VIKOR as well as SAW and TOPSIS when they include the Max–Min normalization technique.
... The original reaction time distributions from each participant significantly deviated from normality, as assessed by the Shapiro-Wilk test, ps < .001. The Box-Cox procedure (Box & Cox, 1964) was applied to compute normal distributions. We analyzed normalized RTs of correct responses (from now on RTs) with a linear mixed model (LMM) and counts of response accuracies with a generalized linear mixed model (GLMM). ...
Article
Full-text available
Motor interactions with single, as well as pairs of objects can be automatically affected by visual asymmetries provided by protruding parts, whether the handle or not. Faster and more accurate performance is typically produced when task-defined responses correspond to the location of such protruding parts, relative to when they do not correspond (i.e., object-based spatial correspondence effects). In two experiments we investigated the mechanisms that underlie the spatial coding of tool-object pairs when semantic and action alignment relationships were orthogonally combined. Centrally presented pictures of "active" tools (depicted as potentially performing their proper action) were paired, on one side, to a "passive" object (target of the tool action). We observed S-R correspondence effects that depended on the location of the protruding side of tool-object pairs, and not on the non-protruding side of the tool handle. Thus, results further supported the location coding account of the effect, against the affordance activation one. The effect was only produced when tool-object pairs belonged to the same semantic category or were correctly aligned for action, but with no further interplay. This was not consistent with the idea that action links were coded between tool-object pairs, and that the resulting action direction interacted with response spatial codes. Alternatively, we claimed that semantic relation and action alignment acted, independent from each other, as perceptual grouping criteria; allowing for the basic spatial coding of visual asymmetries to take place. This brought to speculation, at neurocognitive level, about independent processing along the ventral and ventro-dorsal streams.
... We added weights to control for unequal sampling across populations and locations: the number of samples for a given population per location divided by the maximum value, so more weight would be given to migration distances estimated from locations that were more representative for a given population. Because migration distances (km) were not normally distributed, we applied a Box-Cox transformation (Box and Cox 1964;Venables and Ripley 2002). We selected the best fit model with the anova R function and computed the estimated marginal means for each population combination with the emmeans R package (Lenth et al. 2023). ...
Article
Full-text available
The contributions of distinct populations to annual harvests provide key insights to conservation, especially in migratory species that return to specific reproductive areas. In this context, genetic stock identification (GSI) requires reference samples from source populations to assign harvested individuals, yet sampling might be challenging as reproductive areas could be remote and/or unknown. To investigate intraspecific variation in walleye (Sander vitreus) populations harvested in two large lakes in northern Quebec, we used genotyping-by-sequencing data to develop a panel of 303 filtered single-nucleotide polymorphisms. We then genotyped 1465 fish and assessed individual migration distances from GPS coordinates of capture locations. Samples were assigned to a source population using two methods, one requiring allele frequencies of known populations (RUBIAS) and the other without prior knowledge (STRUCTURE). Individual assignments to a known population reached 93% consistency between both methods in the main lake where we identified all five major source populations. However, the analyses also revealed up to three small unsampled populations. Furthermore, populations were characterised by large differences in average migration distance. In contrast, assignment consistency reached 99% in the neighbouring lake and walleye were assigned with high confidence to two populations having a similar distribution throughout the lake. The complex population structure and migration patterns in the main lake suggest a more heterogeneous habitat and thus, greater potential for local adaptation. This study highlights how combining analytical approaches can inform the robustness of GSI results in a given system and detect intraspecific diversity and complexity relevant for conservation.
Article
Water quality in streams is primarily affected by various land use practices. This study analyzes water quality data collected from the outlets of 113 watersheds across three South Atlantic states in the USA. The objective is to evaluate the relationship between different land use metrics and long-term stream water quality, specifically investigating whether incorporating the spatial proximity of various land uses to the stream and outlet can enhance predictions of stream water quality. To achieve this, four distinct metrics were utilized to assess their influence on stream water quality. The first metric, known as the Lumped method, assigns equal weight to all land uses. The second, the Inverse Distance Weights stream (IDWs), gives greater weight to land uses located closer to the stream. The third metric, the Inverse Distance Weights Outlet (IDWO), weights land uses according to their proximity to the watershed outlet. The final metric focuses on hydrologically sensitive areas (HSAs), which are areas within watersheds that generate the majority of runoff. The results indicated that the Lumped metric emphasizes the significance of forested lands, whereas the HSAs, IDWs, and IDWO metrics highlight the importance of the spatial distribution of agricultural and industrial lands within the watershed. These findings support the hypothesis that considering hotspot areas and their relative positions within the watershed can improve predictions of water quality. Overall, the incorporation of HSAs, IDWs, and IDWO metrics shows that not only is the extent of land use change within a watershed critical, but also the proximity of these land uses to a stream or outlet plays a significant role.
Article
Full-text available
In most tinnitus patients, tinnitus can be masked by external sounds. However, evidence for the efficacy of sound-based treatments is scarce. To elucidate the effect of sounds on tinnitus under real-world conditions, we collected data through the TrackYourTinnitus mobile platform over a ten-year period using Ecological Momentary Assessment and Mobile Crowdsensing. Using this dataset, we analyzed 67,442 samples from 572 users. Depending on the effect of environmental sounds on tinnitus, we identified three groups (T-, T+, T0) using Growth Mixture Modeling (GMM). Moreover, we compared these groups with respect to demographic, clinical, and user characteristics. We found that external sound reduces tinnitus (T-) in about 20% of users, increases tinnitus (T+) in about 5%, and leaves tinnitus unaffected (T0) in about 75%. The three groups differed significantly with respect to age and hearing problems, suggesting that the effect of sound on tinnitus is a relevant criterion for clinical subtyping.
Article
Full-text available
Learning to read affects speech perception. For example, the ability of listeners to recognize consistently spelled words faster than inconsistently spelled words is a robust finding called the Orthographic Consistency Effect (OCE). Previous studies located the OCE at the rime level and focused on languages with opaque orthographies. This study investigates whether the OCE also emerges at the phonemic level and is a general phenomenon of languages with alphabetic scripts, including those with transparent writing systems. Thirty French (opaque language) and 30 Spanish (transparent language) listeners participated in an auditory lexical decision task featuring words and pseudowords including either only consistently spelled phonemes or also inconsistently spelled phonemes. Our results revealed an OCE in both French and Spanish which surfaced as longer reaction times in response to inconsistently spelled words and pseudowords. However, when analyzing the data split by language, the OCE was only detectable in French but not in Spanish. Our findings have two theoretical implications. First, they show that auditory lexical processing is impacted by orthographic information that is retrieved at the phonemic level, not just the rime level. Second, they suggest that the OCE may be modulated by a language’s opacity. In conclusion, our study highlights the depth of literacy effects on auditory language processing and calls for further investigations involving highly transparent languages.
Article
Full-text available
Background According to the hypersensitivity hypothesis, highly emotionally intelligent individuals perceive emotion information at a lower threshold, pay more attention to emotion information, and may be characterized by more intense emotional experiences. The goal of the present study was to investigate whether and how emotional intelligence (EI) is related to hypersensitivity operationalized as heightened emotional and facial reactions when observing others narrating positive and negative life experiences. Methods Participants (144 women) watched positive and negative videos in three different conditions: with no specific instructions (spontaneous condition), with the instructions to put themselves in the character’s shoes (empathic condition) and with the instructions to distinguish themselves from the character (distancing condition). The activity of the corrugator supercilii and zygomaticus major muscles was recorded and after each video, the participants reported the arousal corresponding to their emotion during the video. The EI facets emotion recognition (ER), emotion understanding (EU), and emotion management (EM) were measured. Results Participants’ self-reported arousal and facial motor responses increased in the empathic condition compared to the spontaneous condition and then decreased in the distancing condition. Although there was no effect of EI on reported arousal, EI, specifically EU and EM, seemed to influence facial reactions during the task. In the spontaneous and empathic conditions, EU was associated with a greater difference in zygomaticus activation between positive and negative videos, suggesting that individuals high on this EI facet may react more to positive emotion of others. In the spontaneous and distancing conditions, EM predicted less corrugator activation when watching negative videos, suggesting that individuals high on this EI facet may spontaneously regulate their negative emotions. Conclusions This study suggests that hypersensitivity effects might better be captured by implicit measures such as facial reactions rather than explicit ones such as reporting of emotion. They also suggest that some EI facets and viewing conditions (spontaneous, empathic, and distancing view) influence emotional facial reactivity.
Article
Yerba mate (Ilex paraguariensis A.St.-Hil.) is a crop of significant economic importance, especially for farmers in southern Brazil. Despite advances in yerba mate cultivation, there is a lack of studies on soil physical quality, mainly about the effects of soil compaction. Yerba mate is cultivated associated with native forest, however, the areas used for the expansion of cultivation present degraded soils, such as compacted grasslands, without any tillage to improve crop establishment. This study aimed to evaluate the effects of soil compaction on the growth of yerba mate seedlings and clonal plants. The experiment was conducted in a greenhouse, in pots, between November 2022 and July 2023. Seminal seedlings and clonal (genotype BRS BLD Yari) plants were planted in the upper layer of the pots, without soil compaction. Meanwhile, treatments were applied to the lower layers of the pots, based on the maximum bulk density obtained by the Proctor Test. The treatments were high, medium and no compaction. Plant and soil attributes were evaluated, during the growth period. The behavior of the seminal and the Yari clonal plants was distinct. The shoot Yari clonal plants were more affected by the increase in soil compaction. However, the roots of both the Yari clonal and the seminal plants presented a strong reduction in density and percentage under compacted conditions. Soil compaction provoked greater effects on the Yari clonal than the seminal plants during the study period, which must be considered during the implementation of a new Yerba mate plantation.
Article
Full-text available
Grasses of the Brachiaria genus are widely used as cover crops in no-tillage areas of the Brazilian Cerrado. This study aimed to evaluate the ability of six Brachiaria cultivars to produce shoot and root biomass, and the potential of the root system to grow through a 0.01 m thick wax layer with 1.5 MPa penetration resistance. The plants were grown in PVC columns with a diameter of 0.1 m and a height of 0.7 m. The column was divided into an upper part measuring 0.25 m (top) and a lower part measuring 0.45 m (bottom). The wax layer was positioned between the two parts of the column as a physical barrier to be perforated by the roots. The columns were filled with peaty substrate. The Brachiaria cultivars used were: Brachiaria brizantha cv. BRS Piatã, Brachiaria decumbens cv. Basilisk, Brachiaria brizantha cv. BRS Paiaguás, Brachiaria ruziziensis cv. Ruziziensis, Brachiaria brizantha cv. Xaraés and Brachiaria brizantha cv. Marandu. The Ruziziensis cultivar accumulated a high root dry mass, but the Xaraés cultivar presented the highest wax layer perforation capacity (80 %). Decumbens is the species with the lowest wax layer perforation capacity (10 %). Brachiaria species and cultivars demonstrated differences in their responses to high root penetration ability, which can be used for recommended different species of Brachiaria in different proposes used changes in shoot, leaves, and root dry matter and the distribution of roots in the soil column profile. Xaraés cultivar has potential to be used as a management strategy in soil recovery for degraded lands with mechanical impedance.
Article
Full-text available
Hydromorphic soils, prevalent in regions with shallow water tables, play crucial roles in water regulation, aquifer recharge, and supporting biodiversity, including endemic plant species. However, research into how their attributes respond to water flow and level fluctuations remains limited. This study evaluated the chemical and microbiological attributes of hydromorphic soils developed from basalt in riparian forest areas. It compared three hydromorphic soil areas with varying water content, derived from igneous rocks, against a well-drained control area in southern Brazil. Soil samples were collected from the 0–10 cm layer at eight points per area across all four seasons to assess the effects of moisture content and temperature on the soil’s chemical and microbiological factors. Results indicated that high-water-content soils exhibited increased basal respiration, microbial biomass carbon, and microbial quotient, particularly during the summer. Soils with medium and low water content demonstrated a greater microbial biomass quotient, indicative of higher organic carbon immobilization, and showed more variability in microbiological attributes due to water fluctuations. Well-drained soils had the lowest microbiological values and minimal seasonal variation. Chemical parameters remained stable across all areas; however, high-water soils were more acidic and enriched with aluminum, whereas medium and low-water soils contained higher levels of phosphorus, magnesium, and calcium. The reduced acidity in well-drained soils was associated with lower organic matter content. The study concludes that hydromorphic soils differ from well-drained soils in terms of chemical and microbiological attributes. While chemical properties remained stable, microbiological parameters varied seasonally, influenced by precipitation and temperature changes. Hence, these seasonal changes in microbial activity, as responses to rainfall and temperature, are crucial environmental factors as they significantly influence biogeochemical cycles, water quality, and ecosystem dynamics.
Chapter
As the abundance of collected data on products, processes and service-related operations continues to grow with technology that facilitates the ease of data collection, it becomes important to use the data adequately for decision making. The ultimate value of the data is realized once it can be used to derive information on product and process parameters and make appropriate inferences. Inferential statistics, where information contained in a sample is used to make inferences on unknown but appropriate population parameters, has existed for quite some time (Mendenhall, Reinmuth, & Beaver, 1993; Kutner, Nachtsheim, & Neter, 2004). Applications of inferential statistics to a wide variety of fields exist (Dupont, 2002; Mitra, 2006; Riffenburgh, 2006). In data mining, a judicious choice has to be made to extract observations from large databases and derive meaningful conclusions. Often, decision making using statistical analyses requires the assumption of normality. This chapter focuses on methods to transform variables, which may not necessarily be normal, to conform to normality.
Article
Full-text available
Suspensory locomotion differs significantly from upright quadrupedal locomotion in mammals. Nevertheless, we know little concerning joint kinematics of suspensory movement. Here, we report three‐dimensional kinematic data during locomotion in brown‐throated three‐toed sloths ( Bradypus variegatus ). Individuals were recorded with four calibrated high‐speed cameras while performing below‐branch locomotion on a simulated branch. The elbow (range 73°–177°; mean 114°) and knee (range 107°–175°; mean 140°) were extended throughout support phase, with elbow extension increasing with speed. Both the fore‐ and hindlimb displayed abducted proximal limb elements (i.e., arm and thigh) and adducted distal elements (i.e., forearm and leg) during all support phase points. Comparisons of elbow and knee angles between brown‐throated three‐toed sloths and Linnaeus's two‐toed sloths ( Choloepus didactylus ) showed that brown‐throated three‐toed sloths had significantly more extended joint positions during all support phase points. Additionally, across all kinematic measurements, brown‐throated three‐toed sloths showed significant differences between homologous fore‐ and hindlimb segments, with the knee being more extended than the elbow and the arm being more abducted than the thigh. These results are consistent with previously established morphological and behavioral differences between extant sloth genera, with three‐toed sloths showing significantly longer forelimbs than hindlimbs and typically favoring locomotion on angled supports. Our findings show that, despite overall similarities in the use of below‐branch quadrupedal locomotion, the two sloth lineages achieve this locomotor mode with differing kinematic strategies (e.g., degree of joint flexion). These differences may be attributed to the distinct evolutionary pathways through which obligate suspensory locomotion arose in each lineage.
Article
Full-text available
This study investigates the changes in population size, distribution, and habitat preferences of the Eurasian magpie Pica pica in Zielona Góra over 23 years, emphasising the effects of urbanisation and habitat transformation. A comprehensive survey conducted in 2022 identified 953 magpie pairs, with an average density of 8.8 pairs/km2 across the current administrative boundaries of Zielona Góra (without forests), and 27.7 pairs/km2 in strictly urbanised zones. The highest densities were observed in the old town (36.5 pairs/km2) and residential blocks (34.5 pairs/km2), while peripheral areas, like allotment gardens and industrial zones, showed significantly lower densities. The nests were predominantly located in coniferous trees, especially spruces, marking a shift from the previously favoured poplars. The mean nest height was 11.8 m, varying by habitat type, with the highest nests found in the old town and parks. Environmental factors, such as proximity to trash bins, water sources, and tall trees, were significant predictors of nest density and placement. These findings underscore the magpie’s adaptability to urban environments, influenced by the availability of anthropogenic resources, habitat structure, and surrounding urban features.
Article
Full-text available
This study explored the orthographic processing of flankers in two lexical decision task experiments. In the first experiment, we manipulated the location of the overlapped letters between targets and flankers with 5 experimental conditions: i) identity, ii) first three letters overlapped, iii) three internal letters overlapped, iv) three last letters overlapped, and v) unrelated. The results showed significant facilitatory effects for conditions i), ii) and iv), suggesting that only the letters adjacent to the targets are able to generate facilitation. In Experiment 2, we created 5 conditions to specifically control for the role of inner and outer letters. The experimental conditions were the following: i) identity, ii) first three letters of the targets are presented in an inner position, iii) last three letters of the targets are presented in an inner position, iv) three inner letters of the targets are presented in first position, and v) unrelated. The results showed faster responses for the ID condition with respect to all other experimental conditions, without other significant contrasts. These results suggest that both the location of the letter overlap and the role of outer/inner letters must be considered in explaining the results.
Article
Full-text available
This paper focuses on the relationship between hotel age and performance, challenging the prevailing theory that hotel performance uniformly declines with age. Using the most extensive dataset available, encompassing over 3,000 U.S. hotels over six recent years, we employ a mixed-methods approach to assess whether a U-shaped performance trend exists. Our analysis reveals a nuanced pattern wherein hotel performance declines during the initial years of operation but improves beyond a critical turning point at approximately 40 years old, particularly for hotels located in large metropolitan areas, and also based on hotel brand-affiliation status and class. The findings suggest that certain older hotels defy conventional depreciation models, experiencing a resurgence in performance. These insights have implications for hospitality researchers, as well as for practitioners, suggesting re-evaluating older hotels’ strategies and informing investment and taxation decisions.
Chapter
In this chapter, we review the concepts required to successfully analyze the main hypothesis for a surgical randomized controlled trial. Three hypothesis frameworks are considered for the main hypothesis for the trial: superiority, noninferiority, and equivalence. Data harmonization, checking, and processing are described as well as data visualization techniques for both continuous and categorical variables, all of which should occur prior to formal analysis. The most common statistical assumptions are reviewed as well as options for analyses if these assumptions are not met. The difference between statistical significance and clinical significance is also discussed. This chapter concludes with a section on statistical pitfalls and best practices for surgical trials. Throughout this chapter, we have worked to make these statistical concepts accessible while also providing additional resources for those who require, or are interested in, a deeper conceptual understanding.
Article
This paper gives a short account of the more important properties of the multivariate t-distribution, which arises in association with a set of normal sample deviates.
Article
1—In a previous paper*, dealing with the importance of properties of sufficiency in the statistical theory of small samples, attention was mainly confined to the theory of estimation. In the present paper the structure of small sample tests, whether these are related to problems of estimation and fiducial distributions, or are of the nature of tests of goodness of fit, is considered further.
Article
A number of methods for examining the residuals remaining after a conventional analysis of variance or least-squares fitting have been explored during the past few years. These give information on various questions of interest, and in particular, aid in assessing the validity or appropriateness of the conventional analysis. The purpose of this paper is to make a variety of these techniques more easily available, so that they can be tried out more widely.Techniques of analysis, some graphical, some wholly numerical, and others mixed, are discussed in terms of the residuals that result from fitting row and column means to entries in a two-way array (or in several two-way arrays). Extensions to more complex situations, and some of the uses of the results of examination, are indicated.
Article
A multivariate generelization of Stndent's t -distribution is considered. The bivariata case is treated in detail; exact and asymptotic expressions for the probability integral and an asymptotic expression for certain percentage points are obtained. The main results for the bivariate case are given as equations (10), (11), (23) and (30) below; these equations are used to construct tables for certain special cases.
Article
An analysis of the frequency distribution of local lesions, produced by viruses on half‐leaves of a number of plants, shows that their standard error increases with increasing mean. Hence analysis of variance and statistical tests of significance should not be applied to lesion numbers unless they are suitably transformed. The transformation into logarithms over‐corrects so that the standard error decreases with increasing mean. A satisfactory transformation is y = log 10 ( x + c ), where x is the number of lesions and c is a constant. A method is given of assessing c for different experiments. Great accuracy is not needed; in an experiment discussed in detail a satisfactory transformation is obtained with any value for c between 15 and 80. On individual plants the numbers of lesions formed on half‐leaves are distributed more or less normally, whereas their distribution about the common mean for many plants is skew and ‘leptokurtic’. The distribution of the transformed numbers is almost normal, both for individual plants and about the common mean for a number of plants.
Theory of Probability
  • H. Jeffreys