Figure 1. The condition number as a function of f .
Figure 2. Conditional numbers vs. different values of p for y = β 0 + β 1 X 1 + β 2 D + ε.  
Table 2 . Regression analysis output for y = 23 + 1.5X 1 + 3X 2 + 0.5D + ε
Figure 3. Boxplot of 100 condition numbers for p = 0.95.  
Figure 4. Results for the different values of p for y = β 0 + β 1 X 1 + β 2 D A + β 3 D B + ε.  


Role of categorical variables in multicollinearity in linear regression model
January 2014


M. Wissmann



H. Toutenburg

The present article discusses the role of categorical variable in the problem of multicollinearity in linear regression model. It exposes the diagnostic tool condition number to linear regression models with categorical explanatory variables and analyzes analytically as well as numerically how the dummy variables and choice of reference category can affect the degree of multicollinearity.


Models for Categorical Response Variables

January 2010


Generalized linear models (GLMs) are a generalization of the classical linear models of regression analysis and analysis of variance, which model the relationship between the expectation of a response variable and unknown predictor variables according to E(yi)=xi1β1++xipβp=xiβ.\begin{array}{ll} {\rm E}(y_i) &= x_{i1}\beta_1 + \ldots + x_{ip}\beta_p\\ &= x^{\prime}_i \beta.\\ \end{array} (8.1) The parameters are estimated according to the principle of least squares and are optimal according to the minimum dispersion theory or, in the case of a normal distribution, are optimal according to the ML theory (cf. Chapter 3).

Stein-Rule Estimation under an Extended Balanced Loss Function

October 2009


This paper extends the balanced loss function to a more general set up. The ordinary least squares and Stein-rule estimators are exposed to this general loss function with quadratic loss structure in a linear regression model. Their risks are derived when the disturbances in the linear regression model are not necessarily normally distributed. The dominance of ordinary least squares and Stein-rule estimators over each other and the effect of departure from normality assumption of disturbances on the risk property is studied.

Multifactor Experiments

September 2009


In practice, for most designed experiments it can be assumed that the response Y is not only dependent on a single variable but on a whole group of prognostic factors. If these variables are continuous, their influence on the response is taken into account by so–called factor levels. These are ranges (e.g., low, medium, high) that classify the continuous variables as ordinal variables. In Sections 1.7 and 1.8, we have already cited examples for designed experiments where the dependence of a response on two factors was to be examined.

The Linear Regression Model

September 2009


The main focus of this chapter will be the linear regression model and its basic principle of estimation.We introduce the fundamental method of least squares by looking at the least squares geometry and discussing some of its algebraic properties. In empirical work, it is quite often appropriate to specify the relationship between two sets of data by a simple linear function. For example, we model the influence of advertising time on the number of positive reactions from the public. From the scatterplot in Figure 3.1 one could suspect a linear function between advertising time (x{axis) and the number of positive reactions (y{axis). The study was done on 66 people in order to investigate the impact and cognition of advertising on TV.

Comparison of Two Samples

September 2009


Problems of comparing two samples arise frequently in medicine, sociology, agriculture, engineering, and marketing. The data may have been generated by observation or may be the outcome of a controlled experiment. In the latter case, randomization plays a crucial role in gaining information about possible differences in the samples which may be due to a specific factor. Full nonrestricted randomization means, for example, that in a controlled clinical trial there is a constant chance of every patient getting a specific treatment. The idea of a blind, double blind, or even triple blind set{up of the experiment is that neither patient, nor clinician, nor statistician, know what treatment has been given. This should exclude possible biases in the response variable, which would be induced by such knowledge. It becomes clear that careful planning is indispensible to achieve valid results. Another problem in the framework of a clinical trial may consist of the fact of a systematic effect on a subgroup of patients, e.g., males and females. If such a situation is to be expected, one should stratify the sample into homogeneous subgroups. Such a strategy proves to be useful in planned experiments as well as in observational studies.

Incomplete Block Designs

September 2009


In many situations the number of treatments to be compared is large. Then we need large number of blocks to accommodate all the treatments and in turn more experimental material. This may increase the cost of experimentation in terms of money, labor, time etc. The completely ran- domized design and randomized block design may not be suitable in such situations because they will require large number of experimental units to accommodate all the treatments. In such cases when sufficient number of homogeneous experimental units are not available to accommodate all the treatments in a block, then incomplete block designs are used in which each block receives only some and not all the treatments to be compared. Sometimes it is possible that the blocks that are available can only handle a limited number of treatments due to several reasons. For example, suppose the effect of twenty medicines for a rare disease from different companies is to be tested over patients. These medicines can be treated as treatments. It may be difficult to get sufficient number of patients having the disease to conduct a complete block experiment. In such a case, a possible solution is to have less than twenty patients in each block. Then not all the twenty medicines can be administered in every block. Instead few medicines are administered to the patients in one block and the remaining medicines to the patients in other blocks. The incomplete block designs can be used in this setup. In another example, the medical companies and biological experimentalists need animals to conduct their experiments to study the development of any new drug. Usually there is an ethics commission which studies the whole project and decides how many animals can be sacrificed in the experiment. Generally the limits prescribed by the ethics commission are not sufficient to conduct a complete block experiment.

Statistical Analysis of Incomplete Data

September 2009


A basic problem in the statistical analysis of data sets is the loss of single observations, of variables, or of single values. Rubin (1976) can be regarded as the pioneer of the modern theory of Nonresponse in Sample Surveys. Little and Rubin (1987) and Rubin (1987) have discussed fundamental concepts for handling missing data based on decision theory and models for the mechanism of nonresponse.

Single–Factor Experiments with Fixed and Random Effects

September 2009


The analysis of variance, which was originally developed by R.A. Fisher for field experiments, is one of the most widely used and one of the most general statistical procedures for testing and analyzing data. These procedures require a large amount of computation, especially in the case of complicated classifications. For this reason, these procedures are available as software.

Repeated Measures Model

September 2009


In contrast to the previous chapters, we now assume that instead of having only one observation per object/subject (e.g., patient) we now have repeated observations. These repeated measurements are collected at previously exact defined times. The principle idea is that these observations give information about the development of a response Y . This response might, for instance, be the blood pressure (measured every hour) for a fixed therapy (treatment A), the blood sugar level (measured every day of the week), or the monthly training performance of sprinters for training method A, etc., i.e., variables which change with time (or a different scale of measurement). The aim of a design like this is not so much the description of the average behavior of a group (with a fixed treatment), rather the comparison of two or more treatments and their effect across the scale of measurement (e.g., time), i.e., the treatment or therapy comparison. First of all, before we deal with this interesting question, let us introduce the model for one treatment, i.e., for one sample from one population.

