Heinrich Heine University Düsseldorf
Question
Asked 5 August 2013
What is the best way to scale parameters before running a Principal Component Analysis (PCA)?
I am working on lake water chemistry parameters and am using the resulting factors in a multiple regression. Is there a difference between standardizing (to a mean of 0 and a SD of 1) and normalizing (log-transforming) the parameters to put them on the same scale?
Most recent answer
The most easy and straightforward way is to use ade4 package, dudi.pca() function already have scale and centering implemented. Just go through the vignette, will find it.
Popular answers (1)
Natural Resources Institute Finland (Luke)
Background:
PCA constructs orthogonal - mutually uncorrelated - linear combinations that (successively) explains as much common variation as possible. Actually, PCA can be done based on the covariance matrix as well as the correlation matrix, not only the latter one. Scaling the data matrix such that all variables have zero mean and unit variance (also known as "normalizing", "studentisizing", "z-scoring"), makes the two approaches identical. This is because the covariance between two normalized variables *is* the correlation coefficient.
Should you log-transform?
If you have variables that always get positive numbers, such as lenght, weight, etc., and that showes much more variation with higher values (heteroscedasticity), a log-normal distribution (i.e., normal after log-transformation) might be a clearly better description of the data than a normal distribution. In such cases I would log-transform before doing PCA. Log-transforming that kind of variables makes the distributions more normally distributed, stabilizes the variances, but also makes your model multiplative on the raw scale instead of additive. That is of course the case for all types of linear models, such as t-tests or multiple regression. That's worth a thought when you interpret the results.
Should you scale the data to mean = 0, var = 1?
This depends on your study question and your data. As a rule of thumb, if all your variables are measured on the same scale and have the same unit, it might be a good idea *not* to scale the variables (i.e., PCA based on the covariance matrix). If you want to maximize variation, it is fair to let variables with more variation contribute more. On the other hand, if you have different types of variables with different units, it is probably wise to scale the data first (i.e., PCA based on the correlation matrix).
75 Recommendations
All Answers (45)
nature reserve of Saint-brieuc bay
PCA is typically centered and reduce the variables. the centering: the average is to remove each variable; The reduction: divide the values of variables by the standard deviation.
Usually for physico-chemical variables there is not it necessary of the transformation beforehand.
log-transformation is generally used on actual decrease to the impact of very high values.
learn a little more:
1 Recommendation
University Hospital Würzburg
I agree with John Finarelli (+1) and I add that variables may include noise that may be important to consider. Following his example if both skull length and length of teeth were measured with a normally distributed additive error with a standard deviation of 1 millimeter than rescaling all variables to the same standard deviation will consider teeth and skull equally “important” despite the higher relative error in the size of teeth.
University Hospital Würzburg
@M.Doğa Ertürk: John Finarelli suggested “a rescale to 0/1” – isn't this the same as Z-scores?
University of Florida
Many thanks for the helpful responses. For Garbor Borgulya - you mentioned data measured with a normally distributed error - does this mean log-transforming and giving the data a more normal distribution before scaling is a good choice if my data is variable? I understand logging data for PCAs can help with outliers as well.
Augusta University
Yes, you should log transform the data, so that the data follows the normal distribution. Scaling data to unit mean or sd will make comparison between variables easier. @gabor, yes scaling =zscores
Museum für Naturkunde Magdeburg
In R you can define whether the calculation of principal components ('princomp') uses the covariance or correlation matrix.
My experience is that if you chose 'cor = TRUE' (the correlation matrix) rather than the default option (covariance matrix) you will get a result which does not reflect as much differences in the variances of the considered variables.
University of Ljubljana
One way is to scale against the biggest value - i.e. obtaining scale 0-1.
If possible, I would propose to seek scaling to a "landmark" value, i.e. the one that you expect to be most informative, instead to scaling against the largest value (which may even not be known forehand ...
But the best scaling for me is when you scale according to some other variable(s) in a way that the new variable is dimensionless. The principle is known from physics as the PI theorem. E.g. for prediction of activity of a chemical compound (Structure-Activity-Relationship) count (or mass) of hydrogen atoms in the compound against other atoms (mass of the compound) shows to be extremely important!!!
STRS Consultant Services
PCA uses a correlation matrix, so scaling/transforming variables will make no difference to the output...
6 Recommendations
STRS Consultant Services
However, if you are using something like MDS or nMDS then standardising /transforming your data will make a difference and is often necessary to avoid variables having undue bias on the output (ordination plot).
1 Recommendation
PCA: The goal of principal components analysis is to reduce an original set of variables into a smaller set of uncorrelated components that represent most of the information found in the original variables.The PCA is most useful when a large number of variables.
3 Recommendations
University of Miami
The best approach is to log-transform and then standardize the column-wise data to a mean of zero and a SD of 1 to remove the scale effect of the variables,
3 Recommendations
University of Miami
In this way it is equivalent to doing a PCA on the correlation matrix because the variance-covariance and correlation matrix are equivalent in this format.
1 Recommendation
REM Analytics
For PCA to work well, the data need to follow the same (or approximately) distribution. Do you have good reason to believe your data is normal, or following a known distribution?If such is the case,each distribution can be rescaled by it's parameters in an ad-hoc manner. Otherwise i advise applying the Inverse Empirical Distribution Function to each of your variables.
The procedure is as follows: Rank each data set (ie: sort them in ascending order and record their position on the ladder). Replace the value of each by the rank divided by the total number of observation.
The result is that the distribution of each variable will converge in probability towards a uniform distribution. This is independent of the original distribution they followed, and is true no matter what.
3 Recommendations
University of Miami
Paulo's idea is reasonable, or use the Box-Cox variable transformation procedure to get all the variables to approximate normality post-transformation.
2 Recommendations
There is NO best way to "scale parameters before running a Principal Component Analysis (PCA)". Data pretreatment is problem dependent. Statisticians insist to transform and scale variables to get normal distribution. However this is not mandatory; chemists often refrain from standardization as the noise is over-weighted and the signal (peak) is down-weighted. A short summary of scaling and transformations can be found in R. van der Berg et al. BMC Genomics 2006, 7:142,
DOI10.1186/147-2164-7-142 (open access)
5 Recommendations
Augusta University
@Paulo, I would appreciate if you can provide working code in R for the following example:
df<-structure(list(tnfr1 = c(16.808949430985, 8.97282518075153, 13.9296710360061,
15.2550597949848, 19.2169754657571, 5.7672449714808, 9.42840415098761,
17.5790575318415, 20.762967529801, 9.86246431992298, 4.79511213965356,
7.75947097954384, 14.3768170170218, 4.07683696421655, 12.810879414004,
14.6976323292678, 21.1304038206973, -7.35913458523375, 1.70961262426819,
2.14550869824633, 16.808949430985, 8.97282518075153, 13.9296710360061,
15.2550597949848, 19.2169754657571, 5.7672449714808, 9.42840415098761,
17.5790575318415, 20.762967529801, 9.86246431992298, 4.79511213965356,
7.75947097954384, 14.3768170170218, 4.07683696421655, 12.810879414004,
14.6976323292678, 21.1304038206973, -7.35913458523375, 1.70961262426819,
2.14550869824633), tnfr2 = c(-3.19198230202612, -6.36674975435633,
-0.469231836823039, 5.31396178440631, 6.35390439587416, 7.64604296315032,
-0.413060587484212, 11.7759570662747, -1.09874532764231, 10.4619763458391,
-1.00981294815041, -2.15558309552645, 14.4928066978267, 2.5806328915248,
7.71646233463449, 5.67060559508385, -0.903961568835792, 7.17852127012057,
1.26594674918207, 9.91392403988436, -3.19198230202612, -6.36674975435633,
-0.469231836823039, 5.31396178440631, 6.35390439587416, 7.64604296315032,
-0.413060587484212, 11.7759570662747, -1.09874532764231, 10.4619763458391,
-1.00981294815041, -2.15558309552645, 14.4928066978267, 2.5806328915248,
7.71646233463449, 5.67060559508385, -0.903961568835792, 7.17852127012057,
1.26594674918207, 9.91392403988436), dm = structure(c(1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 2L), .Label = c("N", "Y"), class = "factor")), .Names = c("tnfr1",
"tnfr2", "dm"), row.names = c(NA, -40L), class = "data.frame")
1 Recommendation
Natural Resources Institute Finland (Luke)
Background:
PCA constructs orthogonal - mutually uncorrelated - linear combinations that (successively) explains as much common variation as possible. Actually, PCA can be done based on the covariance matrix as well as the correlation matrix, not only the latter one. Scaling the data matrix such that all variables have zero mean and unit variance (also known as "normalizing", "studentisizing", "z-scoring"), makes the two approaches identical. This is because the covariance between two normalized variables *is* the correlation coefficient.
Should you log-transform?
If you have variables that always get positive numbers, such as lenght, weight, etc., and that showes much more variation with higher values (heteroscedasticity), a log-normal distribution (i.e., normal after log-transformation) might be a clearly better description of the data than a normal distribution. In such cases I would log-transform before doing PCA. Log-transforming that kind of variables makes the distributions more normally distributed, stabilizes the variances, but also makes your model multiplative on the raw scale instead of additive. That is of course the case for all types of linear models, such as t-tests or multiple regression. That's worth a thought when you interpret the results.
Should you scale the data to mean = 0, var = 1?
This depends on your study question and your data. As a rule of thumb, if all your variables are measured on the same scale and have the same unit, it might be a good idea *not* to scale the variables (i.e., PCA based on the covariance matrix). If you want to maximize variation, it is fair to let variables with more variation contribute more. On the other hand, if you have different types of variables with different units, it is probably wise to scale the data first (i.e., PCA based on the correlation matrix).
75 Recommendations
University of Buenos Aires
Hi Gretchen
On a PCA, using the raw data, eigenvalues may be extracted either from the covariance matrix or from the correlation matrix. The second option is ¨mandatory¨ for water chemistry analysis if you have different types of variables with different units, i.e., the variables are measured on different scales as pH, conductivity, nitrates. When PCA is based on the covariance matrix, the difference of e.g., 100 µS/cm on conductivity between samples (locations) results on far greater importance on the resulting ordination than the difference between pH 4 and 9 in the same samples, or differences of 0.5 and 8 mg/L in dissolved oxygen (or between anoxia and oversaturated).
If all your variables are measured on the same scale and have the same unit, e.g., mg/L, you can use either the correlation matrix or the covariance matrix, or both. The ordination obtained in a PCA based on the correlation matrix results from the differences in the chemical species composition, since the data are standardized to variance 1. The ordination obtained in a PCA based on the covariance matrix results from the differences in the chemical species absolute concentrations. E.g., if the differences in ammonia concentration between 2 samples are five times greater than in nitrate, nitrite and phosphate concentrations, the ordination would reflect the ammonia difference in the first factor, even if the species composition in the samples is reversed, e.g., the limiting nutrient could be PRS in one sample and DIN in the other sample. To avoid dominant species taking over the analysis, you can log-transform the raw data to down-weight the larger values before extracting eigenvalues from the covariance matrix. This makes no sense when based on the correlation matrix.
Thus, as information resulting from the two analyses is different, you can perform both and analyze the ordinations resulting considering species composition and absolute concentrations.
No assumptions of data normality are needed for a PCA if you don’t want to validate the axes using anova techniques.
I don’t know how are you using the resulting factors in a multiple regression, have you considered to use direct ordination methods, as RDA (lineal) or CCA (unimodal) instead of this indirect ordination method?
Best regards
4 Recommendations
Hormozgan University of Medical Sciences
I think, it didn't need to scale the parameters, but if you want to analyze the subscales of a questionnaire, it's better to have same scale for each domain to represent better result.
1 Recommendation
It depends on the data you have. PCA tries to use the first, second,..., components to explain the variation in your data. So if you are using PCA, then obviously, you're interested in explaining variations. Check the variances from the covariance matrix or just use correlation matrices to standardize. It depends on your data
Griffith University
Centering, scaling, and transformations: improving the biological information content of metabolomics data
Robert A van den Berg, Huub CJ Hoefsloot, Johan A Westerhuis, Age K Smilde, Mariët J van der Werf
Harper Adams University
I'm working in the same area and have been running PCA with axis rotation on my centred and SD=1 data (as a couple of my variables are in the hundreds while others range from negative decimals to positive decimals). The PCA has identified a certain set of parameters out of 45 that explain 75% of the variance. Super. I do a multiple linear regression using those variables and get an adjusted Rsquared = 0.5656. However if I put all my variables in for the regression, then omit any variables that didn't have a significant p value in the summary of the regression, I get a completely different set of variables and the adjusted Rsquared is up to 0.65...so better if you're trying to improve your Rsquared value. So what do I do next? Do I take the better adjusted Rsquared and those variables that result in it or do I follow the outcomes of PCA?
Thanks in advance...I've been looking for any papers on this and I'm drawing a blank...
Koya University
Hello,
Normalization means casting data set to a specific range like [0,1] or [-1,+1], but why we do that, the answer is to eliminate the influence on one factor (feature) over another, for example you have the amount of olive between 5000 ton to 90000 ton, so the range is [5000, 90,000] ton, in other side you have the temperature ranges from -15 to 49 C, the range is [-15, 49]. These two features are not in the same range, you have to cast both of them in the same range say [-1,+1], this will eliminate the influence of production on the temperature and give equal chances to both of them.
In another hand, gradient descent algorithm GDA which is the backpropagation algorithm used in neural networks converges faster with normalized data.
If all features lay in the same range then no normalization is required. One of the drawbacks of normalization is when the data contains outliers (anomalies), because this will aggregate most of the data in a very small range and only outliers will lay on the boundaries.
Z-score, is a standardization method also used for scaling the data, its useful for data contains outliers. It makes the data to has zero mean and standard deviation =1.
Read the link
10 Recommendations
The National Museum of Natural Sciences-CSIC
Hi, I have values of differences in frequency of behaviors between two opponents, so sometimes I have negative values. How I can scale (or treat) my data before running a PCA?
1 Recommendation
STRS Consultant Services
Doesn't really matter that much with PCA as it works off a correlation matrix, which standardises anyway
4 Recommendations
The National Museum of Natural Sciences-CSIC
Thank you for your answer. If I well understand. I dont need to scale and/or transform my data? Because for my last PCA with positive frequency values I apply this (behaviors<-scale(log(behaviors+1). but this time I need tu use the negative values.
University of Vienna
As Matthew stated, PCA works with correlations, so there is no urgent need for transformations (except if you have outliers, e.g. a few very large or very small values compared to the other values in the same variable). Negative values are no problem. The main problem you might run into are non-linear relations between variables. It is a good idea to check the communalities (after PCA) values for each variable, it tells you, how good a variable is represented in the PCA (1=perfect, 0=not at all)
Loughborough University
If you mean centre your data the mean will become 0, mean-centering can reduce the covariance between the linear and the interaction terms (see the following paper http://pubsonline.informs.org/doi/abs/10.1287/mksc.1060.0263). Mean centering is recommended if all your data is of the same units and is typically used to minimise collinearity. However, as to whether this is achieved is debatable. The other alternative is autoscaling, which is used when you have several variables with different units. Autoscaling first involves mean centering the data, then dividing by the standard deviation. With autoscaling you get a mean of 0 and standard deviation of 1. It is useful for obtaining information pertaining to the covariance of the variables (it is used a lot in genomics to classifying genes). So it depends on the data you have what method you will apply, but look at some literature on both to decide on the best option for your purpose.
2 Recommendations
Christ University
Before running input data in PCA, first you should standardize the input variables you are going to use. Because generally input data may be different unit of measurement. So to get reliable composite value first make the variable standardize. If all the value is of same scale then its not necessary to transform data to its standardize form. And you can take log transform of all data before using PCA.
1 Recommendation
Federal University of Ceará
Sometimes one variable has a scale very different from the other (a concentration of 100-1000mg versus 0,01-0,1mg). When you want to test the influence of these variables on a biological process (lets say, bacterial growth rate) the variable with the highest numerical value will have more weight in the model simply because its numerical value is greater. You can rescale all your variables to range between 0 and 1, changing the variable's values, but keeping the same proportion between each value. Now, all variables will have the same potential weight to PCA. This procedure is called "ranging" and to achieve it you divide each variable by the largest value that variable has in you dataset.
6 Recommendations
University of Texas at Austin
I think it should be centered for sure (e.g. Z score) across variables. BUT, if you have more than one experimental group you should do the Z score not on the overall population, but on the control group. This is especially true if you have groups of different sizes (so one would weigh more than another).
1 Recommendation
University of Liverpool
Michela (Micky) Marinelli I read your comment and I disagree with your statement. I think if you scale based on one of the groups only you will introduce bias in your outputs. The whole point of mean centering is to allow for all variables to have the same weight independently of their magnitude so we can fully appraise the changes with respect to that mean trend of the variable itself. I may have missed something so could you explain a bit more the rationale for your answer? Do you have any links to further literature we could consult?
3 Recommendations
Texas A&M University
Standardization is a advisable method for data transformation especially when the variables in original dataset is measured on significantly different scale
1 Recommendation
Ethiopian Institute of Agricultural Research
This to add one question to Andrean Linden's answer. I appreciate your answer but what if the variables are measured in the same unit for example, in percent but there is variation in percentage level among the variables
2 Recommendations
University of Anbar
Dear Gretchen Lescord
Using matrix to get eigenvectors and eigenvalues.
4 Recommendations
Ethiopian Institute of Agricultural Research
Dear Andreas Lindean
Would you please elaborate to me what do you mean by " If you want to maximize variation, it is fair to let variables with more variation contribute more'?
2 Recommendations
I recently had a problem like that with physics data.
Many scientists (not all) very often show data regarding a quantity. scaled to the peak value. They divide every value by the peak value observed in their data.
This type of normalization destroys much of the data embedded in the data.
The normalize quantity is often distributed as a Gaussian probability distribution function (pdf). That is why a Gaussian distribution centered on 1 is called a normal distribution.
An equivalent procedure maximizes the correlation matrix of the data. The correlation matrix contains the data normalized to the peak values of each quantity. The principle components are often distributed as a normal pdf (i.e., a normal distribution).
The investigator often maximizes the correlation about the mean to find his 'principle components'.
A correlation matrix is often used when the different quantities have different dimensions (e.g., units). If you have some data in Newtons and some in grams, then you can get rid of the units merely dividing by the peak value of whatever it is. However, there is often some useful information that remains after the division by the peak.
However, I did something different.
Instead of normalizing to a peak value, I determined dimensionless parameters using 'universal' constants. I would cancel out the original units by taking ratios of one quantity relative to another. I divided by constants and quantities that are part of 'universal' laws of nature.
There is more than one way to form a dimensionless quantity. I chose dimensionless quantities based on the physics of the situation.
So instead of working straight with width, depth, and lifetime, I calculated the Grashof number, and the Fourier number. I observed that these so called 'constant were actually dynamic variables. They changed during the experiment.
So I maximized covariance around the mean using dimensionless quantities rather than normalized quantities. Since all the parameters lacked units, I was able to use a covariance matrix instead of a correlation matrix.
So I formed covariance matrices with these dimensionless units. Of course, the correlation matrix just had the original data normalized to the peak.
I maximized correlation around the origin, correlation around the mean, covariance around the origin, and covariance around the mean. Each of these corresponds to a different way of 'weighting' the data.
The results that made the
There was something interesting. Whereas the correlation matrix was consistent with a norma' (i.e., Gaussian) pdf, the covariance matrix was consistent with a lognormal pdf.
When I plotted the eigenvalues of the correlation matrix, the Scree diagram did not make sense. When I plotted the fractional variance, the Scree diagram made a lot of sense.
Now, I know biologists don't work with the same equations that physicists use. However, they do take ratios. An animals length divided by the length of the largest animal is a normalized parameter. An animal's length divided by its width is a dimensionless parameter.
The length of an animal divided by its width is the more meaningful quantity. It correlates well with the animal's age. The ratio of the an animals length with the longest length of the sample is not very meaningful.
So I suggest instead of normalizing to peak, you form dimensionless parameters. Look for any equations in biology that you think relevant. Use the constants in these equations to cancel out units.
2 Recommendations
University of Saskatchewan
Hi. I am working on a dataset containing more than 100 features. Most of these features are anthropometric measurements such as waist circumference, hip circumference, height with millimetre units, and weight in kg.
I have selected 20 of these features based on the problem I am working on and now I want to use PCA to find the most important features among these 20 features and come to the maximum of 8 features as PCA is one method for feature selection in clustering analysis (it is an unsupervised task).
I have applied the Shapiro-Wilk test on all these features and I found that they are not distributed normally as the p-value for most of them is more than 0.05 and there is some skewness in the right tail.
My question is what pre-processing or feature engineering steps I should apply to my dataset before applying PCA on these 20 features and is scaling the dataset with StandardScaler() the only thing I should apply on these skewed features?
or Log-transforming is a better solution before using PCA.
I did read the comment you have put for this question and you had mentioned that when units are the same Log-transformation can help. But for me, units are not the same and I have distance metrics and also weight that comes in Kg.
Thank you so much
Related Publications
BACKGROUND:Drug Regulatory agencies all over the world generally discourage exclusion of outliers in a BE (BE) study; on the other hand Good Statistical Practices requires it. If the decision rules for identifying the outliers are clearly mentioned before the start of the study and laid down in protocol by the responsible biostatistician in collabo...
The Oubeira is a freshwater shallow lake, with 21.73 km2 surface, located in the Algerian extreme Northeast. It is endorheic and Ramsar site since 1983, part the national park of El Kala. It receives wastewater discharges from few surrounding communities and suffers from illegal takings for agricultural activities. These actions have so far little...
A study has attempted to recognizing the impact of morphometric and land cover parameters on stream water chemistry and bed sediment of various watershed located in Bhagirathi basin, Garhwal Himalaya. To achieve the objective 23 watersheds were selected along the Bhagirathi River, where the bed sediments were collected and standard grain size analy...