Statistics

4
What does this warning message tell me (in R package quint) and what is 'gammafn'?

Hi all,

I am working with the R package 'quint' to test for qualitative treatment-subgroup interactions (personalized medicine). Everytime I analyze my data there are some warnings which I cannot handle. All warnings are of the same sort. Actually, the main problem is that I do not understand what the warning means exactly:

Warning messages:
1: In computeD(tmat[kk, 1], tmat[kk, 2], tmat[kk, 3], tmat[kk, 4], :
value out of range in 'gammafn'

Is anybody well versed in this R package or is it probably an universal warning that you were confronted with in another context? I do not know what 'gammafn' could be and why the value is out of range.

I appreciate any comments and ideas!

Best,

Brian

The warnings, especially when they appear often, should not be disregarded.  It may happen, that the results produced by the software, are meaningless.  Brian, are you sure, that your computed results make sense?

First question: What is the role of the gamma function in the applied statistical model? Judging from the information  that a treatment-subgroup interaction is tested . I guess that a F-test is calculated, and the probability distribution function of F (also of chi-square) somehow use gamma functions. Getting some extremal arguments for these functions indicates that something is wrong with the data. May be you have some empty classes in the assumed linear model, may be too few degrees of freedom, may be the (the non-zero) data  have variance equal zero? May be too many parameters to estimate and too few data?

In any case, to consider the results as trustworthy, the reason of the warnings should be clarified.

Sorry, this does not sounds optimistic.

Anna

17
Which measure of inter-rater agreement is appropriate with diverse, multiple raters?
I want to calculate and quote a measure of agreement between several raters who rate a number of subjects into one of three categories. The individual raters are not identified and are, in general, different for each subject. The number of ratings per subject varies between subjects from 2 to 6.
In the literature I have found Cohen's Kappa, Fleiss Kappa and a measure 'AC1' proposed by Gwet. So far, I think that Fleiss measure is the most appropriate, although he derives it assuming than the number of ratings per subject is the same for all subjects.
The Gwet measure AC1 is supposed to deal with the apparent 'paradox' of low agreement values despite a large percentage agreement. Unfortunately, I do not understand the derivation of this measure.
I would be grateful for any comments and suggestions, particularly on the appropriateness of Fleiss Kappa and the soundness of Gwet's AC1.
• Ciaran O'Boyle asked a question:
New
What is your experience of the POSAS score?

I've been trying to use the POSAS score for evaluation of burn scar reconstruction.  However, I find the Patient's part difficult to use.  The patients all seem to become a bit embarrassed in trying to put scores on their symptoms.  I'm left with the impression that they are all under-scoring their problems.  This might be ok statistically, since if everything is under-scored, then trends will still be valid, but has anyone else experienced this?  More to the point, has anyone overcome this problem???

5
Is there a way in statistics, to consider multiple variables as one ?

For example for illustrating the development of ICT, there are six different indicators.

How can I consider these six different indicators as one single variable in order to use it for further statistical test?

Is there a way ?

If anyone has knowledge of statistics, it could be great to help me.

Thank you

Elahe, there are several approaches, but three are the most common for this situation. Which one you choose depends on your goal in the analysis, the nature of your data, and what you believe to be the relations among the variables.

Since you mention ICD-10, for indicators of a diagnosis, the simplest thing is to sum (or average) them and use that as your variable.

A somewhat more complex version of the same thing is principal component analysis, as has been suggested. Technically speaking, a PCA will give you the "first eigenvector" of the correlations among your variables. Practically speaking, it gives you a weighted sum that represents the greatest possible amount of information from your data that one score can. However, if your indicators are binary, like presence/absence, PCA doesn't always perform well. The other caution is that it optimizes for your dataset and won't necessarily generalize well.

The third is factor analysis, variations of which can accommodate binary indicators, but I think that's getting more complex than you want to.

Pat

1
To find the significant difference for Non gaussian data set, which test is used?

Statistician, Data Analyst

First, plot the data with quantile-quantile plot against normality (so in R, qqnorm(x);qqline(x)). This is easier to see deviations than a histogram

The Shapiro-Wilk test is often used, see

https://stat.ethz.ch/R-manual/R-devel/library/stats/html/shapiro.test.html

Other packages will have similar abilities (and in R there are more packages/functions if your needs are more specific).

Sometimes people estimate skewness and kurtosis and see if these differ from normality. However, these are only two characteristics of the distribution, and also estimating their standard errors are problematic (see attached).

• Source
Article: Problematic standard errors and confidence intervals for skewness and kurtosis
[Hide abstract]
ABSTRACT: Many statistics packages print skewness and kurtosis statistics with estimates of their standard errors. The function most often used for the standard errors (e.g., in SPSS) assumes that the data are drawn from a normal distribution, an unlikely situation. Some textbooks suggest that if the statistic is more than about 2 standard errors from the hypothesized value (i.e., an approximate value for the critical value from the t distribution for moderate or large sample sizes when α = 5%), the hypothesized value can be rejected. This is an inappropriate practice unless the standard error estimate is accurate and the sampling distribution is approximately normal. We show distributions where the traditional standard errors provided by the function underestimate the actual values, often being 5 times too small, and distributions where the function overestimates the true values. Bootstrap standard errors and confidence intervals are more accurate than the traditional approach, although still imperfect. The reasons for this are discussed. We recommend that if you are using skewness and kurtosis statistics based on the 3rd and 4th moments, bootstrapping should be used to calculate standard errors and confidence intervals, rather than using the traditional standard. Software in the freeware R for this article provides these estimates.
Full-text · Article · Feb 2011 · Behavior Research Methods
• Thanyani Rambane asked a question:
New
When doing seasonal adjustments how do you remove residuals from the aggregate of components that have no residuals?
1
What is the best approach of treating Dont know responses when re-scaling continous variable?

To illustrate this let me use an example. I have question "How much do you trust your President?" Answer choices are 1- Not at all 2-Just a little 3-Somewhat 4-Alot 5-Dont know. When re-scaling some analyst might consider to re-scale this by allotting 1-Not at all 2-Just a little 3-Dont know 4-Somewhat 5-A lot. Another Analysis might allot 1-Not at all/Don't know 2-Just a little 3-Somewhat 4-Alot . Every Analyst have a distinct way of doing this but there must be a clear line where we can all walk? Personally i always group don't know choices as system missing however i lose a lot of cases on the way. The bottom line is what is the best way of treating Don't know responses with a statistical implication and without violating any statistical principle?

Hello Ronald

With regards

Thawhidul Kabir

22
Is it better to use the "mean" or "median" in describing central value/tendency of a given population/sample?
I want to know which among the two "mean" or "median" appropriately reflects central value of a given sample/population.

Thanks so much for the respective opinions shared. Your points are well noted and very much appreciated

5
¿Is there any difference in the effect size calculation (Cohen's d) for within groups and between groups?

Do you know if I can use the same formula (Cohen's d) for the calculation of within groups effect size (pre-post treatment of one group) and between groups effect size (post-treatment of two groups)?

The formula that I know is proposed by Durlak (2009, How to select, calculate and interpret effect sizes).

Thank you! =)

http://journal.frontiersin.org/article/10.3389/fpsyg.2013.00863/abstract

11
Has anyone used PERMANOVA as an univariat test?
I have seen papers using PERMANOVA as a multivariate test (for example for comparing richness among sites or for comparing a species density among seasons). Is it posible? To me this does not sound correct, since Anderson (2001) describes PERMANOVA as ´A new method for non‐parametric multivariate analysis of variance´.
I believe that PERMANOVA is essentially a multivariate test, am I correct?

Dear Eduardo,

PERMDISP tests the heterogeneity of dispersions among groups. This is conceptually similar, but not quite the same thing as heterogeneity of variances. Dispersions can vary both in extent and 'shape' of the multivariate cloud. The short answer is: not really. As per my earlier answer, you should not be using PERMANOVA for univariate comparisons in the first place, and if concerned about normal distributions, you should be looking at models with appropriate distributions rather than trying to shoe-horn non-normal or heterogeneous data into Gaussian models. Try log-linear or negative binomial models for count data, for example.

I should mention that CAP does not provide information on data 'spread'. As a constrained ordination method, it looks for the 'best' means of separating a priori specified groups, and as such does not allow any any assessment of either overall or between-group variability. Anderson & Willis (2003) specifically recommend using unconstrained methods (such as PCO or MDS) for this.

Hope this helps,

Trevor

39
How to interpret factor scores from Exploratory Factor Analysis?
I've conducted different factor extraction methods using a considerably small dataset (low-level features extracted from image content). The problem is with the interpretation of factor scores obtained, which ranges from negative to positive integer number of unknown minimum/maximum. I read some handbooks but usually highlighted on how to conduct factor analysis and very rarely discuss about how to interpret the output.

Hello everybody
I have conducted a factor analysis to extract latent factors leading to gender inequality at workplace. I extracted 5 factors from 20 variables. Now i want to rank all the variables showing their contribution in gender inequality. I have conducted a factor score ranking of variables, is this a correct method to fulfill my objective of ranking?
Please suggest me ways to do the ranking.

19
How can we run any regression if the error term is explicitly assumed to be non-normally distributed?

Please look in the link below, bottom of p. 8, where W is specified in several ways, not necessarily distributed normally.

There is an extensive literature dealing with your question that is summarized in

Wilcox (2012). Introduction to Robust Estimation and Hypothesis Testing. Elsevier.

99+
Which correlation coefficient is better to use: Spearman or Pearson?
I´m performing a correlational study of two temporal series of data in order to identify positive or negative correlations between them. Which correlation coefficient is better to use: Spearman or Pearson?

Dear Walter. May be you should search for a method that takes into account the fact that you have times series. Pearson (based on the values) and Spearman (based on the ranks) do not take into account the time series specificities.

5
Do I need to weight my questionnaire data?

We have commissioned a survey company to undertake a telephone survey for us (n=500). The sample has been recruited from people who have taken part in a large national survey (Health Survey England) which is conducted every year with a different randomly selected population sample. We are speaking specifically to people with more than one long-term health condition.

The questionnaire data has been collected and the survey company are planning to provide us with the dataset next week. They have asked us if we would like them to weight the data (for an additional fee). I am not an expert on weighting large datasets, and am unsure whether this is necessary. Was wondering if anyone had any thoughts on the matter, guidance on what I should be considering in this decision, or indeed could point me in the direction of any resources which might help me make this decision.

We would like the data to be broadly generalisable to this population. The sample has been recruited from 3 different years worth of HSE participants so we could get the numbers. I am assuming that they were more difficult to recruit from the earlier years as people will have changed phone numbers, become more ill, etc. I will need to check this assumption with the survey company.

Would really appreciate any advice. We are a charity and don't really have resources for additional expenditure, but I don't want to jeopardise the quality of the data.

Thank you

Thank you all. I think I am going to go with weights, as I know there will be bias in the sample and I would like to generalise. Going to see if a uni colleague will help me to avoid the costs! Fingers crossed....

6
Is there any relationship or equation to find the total width of a gaussian curve when full width at half maximium is known?

Is there any relationship or equation to find the total width of a gaussian curve when full width at half maximium is known?

If premise of simetry around media holds for positive values, then the width should be equal to two medias. So I am not able to understand neither to apply gaussian curves. It would help if contributors explain their answers with a numeric sample.

5
What is the best non-parametric method for determining modality and potential mixture in a non-normal single-variable distribution?

I am an archaeologist studying bedrock mortars (the holes in boulders and rock outcrops that Native Americans would use to grind foodstuffs) in a large study area.  I’m exploring whether the depth of an individual bedrock mortar is an indicator of a specific kind of food that was meant to be ground in it.  More specifically, whether a bounded range of depths is correlated with a specific type of food (e.g. mortars between 0.25 – 5.5 cm deep were mostly used to grind acorn).  Unfortunately, I did not have the ability to test for the presence of food residues in the mortars, so I only have a single variable- depth- to work with.  Instead of performing regressions or similar multivariate tests to correlate specific foods with specific depth ranges, I am looking for patterns in the depth variable that suggest preferred ranges.  I assume these ranges will be evident as modes in the distribution curve, and potentially as mixed distributions.

I measured the depth (continuous interval) of 699 bedrock mortars from my study area, and can assume that is essentially the entire population.  When I plot the values, the resultant distribution curve is highly non-normal (right skewed, leptokurtic?, a long tail and lots of outliers on the right, p-value of <0.01 for Kolmogorov-Smirnov and Shapiro-Wilk tests).  There is an ever present mode on the left side of the curve, which is definitely meaningful, but there a couple of potential modes on the right side of the curve that are less obvious.  I have “bump hunted” by binning the data differently in the histograms, but am hoping for a more statistically powerful way to identify modes.  The values do not normalize when doing a log base 10 or square root transformation.  Is there a test or method I can use to identify statistically significant (probable?) modes in the distribution of this single variable?  Is there an effective non-parametric test for mixed distributions in this circumstance?

Any advice is greatly appreciated.  I have attached a spreadsheet of the values below.

Thanks!

KDE in R can be as easy as:

plot(density(variable.name))

... though there are many more options and packages that give further options.

1
Can I create FMEA from the line items of pareto analysis?

I have created a pareto analysis on top contributors for error. Please let me know with the 80/20 list of line items, can I develop a FMEA to understand the root cause?

I think from Pareto analysis you can get the major contributor of failure mode. The FMEA is a rigorous process and this historical failure modes from Pareto analysis can help to enhance FAME. For complete FMEA, besides this major failure modes, it is necessary to generate operating condition, failure process and propagation, and many other cross functional activities.

6
Can anyone direct me to this paper by Noether (1963) A note on the Kolmogorov-Smirnov statistic in the discrete case?

The complete reference is Noether GE: A note on the Kolmogorov-Smirnov statistic in the discrete case. Metrika 7:115-116, 1963'.

Basically I'm looking cor citations that prove that Kolmogorov-Smirnov test for discrete variables is (way) too sensitive, and P-values are extremely small. Many thanks for your help.

2
We have done univariate analysis in SPSS, no problem in getting the odds ratio. but E+ error is showing in multivariate. how to do without error?

We have done univariate analysis in SPSS, no problem in getting the odds ratio. but E+ error is showing in multivariate. pls give suggestion. how to do the multivariate without error.

Hello

you must conduct it via Logistic Regression (Binary logistic Reg.) Analysis

Of course, you must analyse it with univariate analysis one by one and select variables that have P less than 0.25 for multivariate analysis.

3
How do I do a two-stage multiple imputation?

I am trying to do multiple imputation in two steps. I have successfully completed the first step (imputing the independent variable).

Now, using the imputed data set I created in the first step, I would like to impute the covariates I am using. I am using stata and get this error:

"mi impute: imputations exist
40 imputations found; use replace to replace existing and/or add() to add new imputations. Before adding new
imputations, verify that the same imputation model is specified."

I appreciate any tips! Thank you!
Sarah

Hi Sarah,

I would use the methods included in the attached paper for missing data imputation.

They are so easy to implement them in R to impute in multiple variables.

I can guide you about that if you prefer to use R.

Best,

Watheq

• Source
Conference Paper: Comparative Statistical Algorithms for Imputation of Missing Measurements in Petrophysical Data
[Hide abstract]
ABSTRACT: The missing data problem in petrophysical properties especially core measurements of permeability is a crucial step in reservoir characterization. It affects the multivariate statistical inference of these measurements leading to non-efficient prediction and less accurate geospatial modeling. Consequently, many imputation algorithms have been presented in this paper to comparatively predict the missing values of horizontal and vertical core permeability for a well in sandstone reservoir in a southern Iraqi oil field. The algorithms are Mean Substitution (MS), Iterative robust model-based imputation (IRMI), Multiple Imputation of Incomplete Multivariate Data (MIIMD), and Random Imputation of Missing Data (RIMD). These algorithms have been applied based on the deductive statistical inference to impute the incomplete data. All the algorithms above have been illustrated and the predictions have been depicted for the data before and after the imputation process for all the algorithms with respect to the histograms and the vertical data distribution given the well depth. The results have shown that the Random Imputation of Missing Data is the best algorithm because the histogram has preserved its shape before and after the imputation. Therefore, it is the best one for accurate imputation of incomplete petrophysical data.
Full-text · Conference Paper · Dec 2014
3
Is there any recommended test other than Sobel test in measuring mediating effect?

Sobel test in mediating effect

For a conceptually different approach, see http://imai.princeton.edu/projects/mechanisms.html

35
When I use AIC (akaike information criterion) to find the model of the best fit, do I need to consider p-values?
Using the AIC method, I extracted the parameters which are the best fit to explain the variability in my dependent variable. My question is, when I want to publish my results, do I need to state my p-value that the AIC method gave me or the p-values that the regression model calculated?
I think Animesh Ghose has answered the question. The model with the lowest AIC value is considered as the best model. In addition, Akaike Weight and Evidence ratio should be calculated.
3
Can I use fractional rank method to transform my data to normal distribution?

I have used log and square root transformation method in to order to normalise my data set but it didn't work? Can I use fractional rank method to get normalise data set?

4
Why am I getting a low ICC even though agreement is objectively high?

I am assisting a student with a measure of agreement for test-retest data using an ICC (2, 1) for absolute agreement.  The data is ordinal with a 5 point scale, and according to Streiner et al 2015's book, the ICC is superior in these situations than (for example) kappa or weighted kappa.

We've encountered a situation where despite nearly all scores being identical between test 1 and test 2, we have an ICC of .000.  As far as I can tell this is because there is almost no variation in the scores.  For both test 1 and test 2, nearly all of the answers given are "2".  I've queried with the student whether this suggests that there was something wrong with the question in the first instance, since if you ask a question and everyone answers identically, was it really worth asking in the first place?

However what I'd like to know is WHY the ICC is 0 despite nearly all responses agreeing.  My understanding is that ICC is a version of correlation adjusted for the fact that the different variables are measuring the same thing.  Correlation looks at associations between two variables.  If when plotted on a scatter plot scores on both variables are, with very few exceptions, all "2", then it is not possible to model a relationship between the two variables.  They're not associated with one another, they're functionally identical.

That's the sort of answers I was expecting, thanks.  After thinking about it for a while longer I also realized that the ICC measures reliability, with agreement being secondary to this.  These two things are to an extent independent.  In the case of the data we have here we have high agreement but poor reliability.  If you're not doing an ICC for "absolute" agreement it is also possible to have the reverse be true - a situation where you have high reliability but low total agreement.

7
Any information about statistics on GERD (Gross expenditure on research and development) and number of researchers in breaking by fields of science?

Dear colleagues, hello! My question is primarily  for statisticians (statisticians of Science technology and innovation).

Do you have any information about  statistics  on GERD (Gross expenditure on research and development) and number of researchers in breaking by fields of science?

In UNESCO Institute of Statistics (UIS), OECD stat database, Eurostat, I’ve found such a statistics for some countries (Croatia, Hungary, Portugal, Russia, Argentina etc.) but data for e.g. Germany, Japan, US, UK, Netherlands, France etc. exist only for one year or do not exist at all.

In OECD Stats e.g. There are more or less full data (i.e. coverage of developed countries) on “Other national R-D expenditure by field of science and by source of funds”. (but it is only some (quite small) part of total GERD).

I need data  for “Gross domestic expenditure on R-D by sector of performance and field of science” (for total intramural sectors) and  “R-D personnel by sector of employment and field of science” (for all sector). Here data are very fragmentary for the most developed countries. The same situation is in UNESCO UIS, and Eurostat since these database exchange data with each other.

Therefore do you have any info on “Gross domestic expenditure on R-D by sector of performance and field of science” and  “R-D personnel by sector of employment and field of science” For developed countries?

Is this data available on national statistical service databases?

Best regards,

Maxim Kotsemir

Have  you cheked national statistical offices. Sometime is it only the way to find data, however such data could not be fully comparable.

9
I Categorized respondents in two groups on basis of mean of a percentage scale. Is this correct approach?

I have grouped web searchers as efficient or inefficient on basis of their experience about finding information online. I have responses of 281 volunteers for question.

You find what you are searching for ______ of times.

options are: 0%   10%   20%   30%   40%   50%   60%   70%   80%   90%   100%

The mean approximately 60% I formed two groups as

Group A % > 60

Group B % <= 60

Is this statistically valid approach?

Thank you sir for adding this new dimension.

New
How this high dimensional test statistic has been obtained?

Hi! I was reading this article of Feng and Sun. But I was wondering How the statistic has been obtained! any Idea?
thanks.

• Source
Article: A note on high-dimensional two-sample test
[Hide abstract]
ABSTRACT: We propose a new scalar and shift transform invariant test statistic for the high-dimensional two-sample location problem. Theoretical results and simulation studies show the good performance of our test under certain circumstances.
Full-text · Article · Oct 2015 · Statistics [?] Probability Letters
5
Can I include predictor variables that are correlated to the exposure variable in Negative binomial regression?

Hello, I have a question regarding negative binomial (NB) regression. I am not sure if I can include predictor variables that are correlated with the exposure variable (say time)? I'm concerned that the predictor variable (VIF 3.8) is correlated to the exposure variable (VIF 9.8). I have carried out NB regression despite the collinearity and the results are significant. The overall likelihood ratio test is 6.377, df = 1 and sig = 0.012. Can I include the predictor variable in this case?

I agree with Jochen and also with Andrey: if your model is an explanatory model, you do need to address multicollinearity, but if it is a predictive model, then it does not matter.

5
How can I calculate Cmax, Tmax, AUC, T1/2 in PK/PD studies?

TIME IN MIN - 0, 30,60,180,300
CON OF CURCUMIN in µg - 0.00, 7.66, 84.27, 40.87, 7.84 respectively

Agree with both Konstantin and Jun.

Considering that you are quite new on PK concept, I suggested simple Non compartmental analysis. Compartment analysis need some understanding and software. Try to use Basica or Kinetica software trial version.

Other software Phoenix WonNONLIN and GastroPlus  are paid and expensive ..!

15
How do I test normality for two independent groups?

Hi,

I am doing comparison between two independent groups; patient group of 24 subjects and control group of 34 subjects. My questions are:

1) Is testing normality relative? I mean is it sufficient to check out visually the histograms and q-q plots and decide if my variable data shows normality or not? what are skewness and kurtosis? I wonder what is the conclusive tests that I can depend on to test normality.

2) In case one group is showing normality while the other is not, does that means that my data is not normally distributed and I have to use non-parametric testing?

3) I read about other assumptions regarding using parametric tests even the variable data of both groups shows normal distribution. For example, I read about variance homogeneity? How can I check this out? I am using SPSS for my data analysis.

Thank you.

Thank you for all your replies.

Mr. Dennis,

You suggested to me to fit the two means and look at the distribution of the residuals. Did you mean to combine both groups and plot them as one group? How to check out the distribution of residuals?

Thank you again.