Science topics: MathematicsStatistics
Science topic

# Statistics - Science topic

Statistical theory and its application.
Questions related to Statistics
Question
Hi, I have previously heard in a conference someone said "The isolates within group A are similar to group B with R > 0.2"
Is there a way for me to calculate the R value of two different groups of isolates based on their nucleotide/amino acid sequences or based on their sequence homology, so that in the end I could reach a conclusion just like the example I've provided?? Thank you so much!
Regards,
Question
I want to draw a graph between predicted probabilities vs observed probabilities. For predicted probabilities I use this “R” code (see below). Is this code ok or not ?.
Could any tell me, how can I get the observed probabilities and draw a graph between predicted and observed probability.
analysis10<-glm(Response~ Strain + Temp + Time + Conc.Log10
+ Strain:Conc.Log1+ Temp:Time
predicted_probs = data.frame(probs = predict(analysis10, type="response"))
I have attached that data file
Plotting observed vs predicted is not sensible here.
You don't have observed probabilities; you have observed events. You might use "Temp", "Time", and "Conc.Log10" as factors (with 4 levels) and define 128 different "groups" (all combinations of all levels of all factors) and use the proportion of observed events within each of these 128 groups. But you have only 171 observations in total. No chance to get any reasonable proportions (you would need some tens or hundreds of observation per groups for this to work reasonably well).
Question
Hi,
I was trying to design a trial comparing one diagnostic tool (which will give 9 different output) to a common gold standard (which also gives the same output catagories). However, I'm so confused about how to estimate sample size for each output (or catagorical group). My first thought was to use specificity and sensitivity to calculate, but it's unlike nomal binary variables.
Should I use the same method for binary variables and use it aginst each catagory (i.e. make the studying catagory as positive and the rest as negative), so we can get a postive group sample size and a negative group (which contains the rest of eight outputs)?
REALLY appreciated if anyone can give me a hint or approach to this.
I think a good metric to judge the performance of a diagnostic tool is the false-classification rate. You can detemine the sample size to provide a desired precision of the estimate of this false-classification rate.
Question
I want to estimate the half-life value for the virus as a function of strain and concentration, and as a continuous function of temperature.
Could anybody tell me, how to calculate the half-life value in R programming?
I have attached a CSV file of the data
Ok, that's corrected.
What ist the "Response"? Is this an indicator if the virus is detected? If so, then it's hard to estimate the half-time, as both, the concentration and the sensitivity of the assay are unknown. I also wonder how Response could be 0 at Time 0.
It would be possible to fit a Cox-model (as David suggested), you could also fit a binomial model and find the roots of the predition function (on the logit scale) which is the log odds of "Response = 1" vs. "Response = 0". But it's not clear if the lod odds = 0 corresponds to a 50% survival.
Question
I have to investigate
1) how the response depends on the strain, temperature, time, and concentration.
I applied logistic regression (glm) and got the reduced model. When I tried to make the logistic regression line and confidence interval, it looks like that in the picture. (pic attached below)
Could anybody tell me, how to resolve this issue (I want only one logistic regression and two confidence interval lines, not many)?
I have attached the data
For confidence interval I use this
prediction<-as.data.frame(predict(analysis13,se.fit=T))
prediction.data<- data.frame(pred=prediction\$fit,
upper=prediction\$fit+(1.96*prediction\$se.fit),
lower=prediction\$fit-(1.96*prediction\$se.fit))
plot(household\$Conc.Log10,prediction.data\$pred,type="l",
xlab="width",ylab='linear predictor',las=1,lwd=2,ylim=c(-10,6))
lines(household\$Conc.Log10,prediction.data\$upper,lwd=2,lty=2,col='dark red')
lines(household\$Conc.Log10,prediction.data\$lower,lwd=2,lty=2,col='dark red')
Dear Zuffain
It appears you are looking to draw counter-factual plots (one predictor changing while other predictors held at a constant value - usually average). Read from page 89 onwards (GELMAN, A., & HILL, J. (2007). Data analysis using regression and multilevel/hierarchical models. Cambridge, Cambridge University Press.) for more details.
However I am using monte-carlo approximation instead of the glm function in R - but you can figure it out.
Best wishes
Question
Hello, dear community,
In the context of the development of a quantum algorithm, I need at some point to evaluate the square of an expectation value <O>, with O a real diagonal matrix (and thus hermitian).
I need to compute this efficiently using a quantum computer. I would like to know if there is a procedure, like the Hadamard test for evaluating <Ψ|O|Ψ>, to evaluate by using the minimum number of measurements the quantity (<Ψ|O|Ψ>)².
Multiplication of two numbers isn’t that hard.
Question
Hello Seniors I hope you are doing well
Recently I've read some very good research articles. In those articles datasets were taken from V-Dem, Polity and Freedom House. Though they have shared the link of supplementary datasets and the process of how they analyzed these datasets in SPSS or R in brief but I couldn't understand and replicate these findings. It may be because I am not very good at quantitative data analysis.
So I want to know how could I better understand this Datasets analysis easily like V-Dem etc. Is there any good course online, lectures or conference video etc. Or good book?
Any help would be appreciated.
Thanks in anticipation.
Please find some online course for learning R on Edx and Coursera platforms.
Thanks ~PB
Question
For my research I want tot test the influence of distance according to Hall's (1966) interpersonal distances (intimate, personal, social and public) on facial expression (negative or positive). Since both my independent and dependent variable are categorical ordinal data I thought of using Spearman's correlation. Is that the right statistical method?
Hello Tim,
If the distances variable truly represents an ordered/ordinal set, then Spearman correlation will work for you. If they are nominal categories, then, from a chi-square test of independence, the phi coefficient (a measure of correlation) = square root of (chi-squared / N).
Question
Our group conducted an undergraduate research about fiber-polypropylene composite panels, we have a 4 different independent variables that includes fiber ratio (0% ; 5% ; 10%), length(1cm/2cm), fiber age(old-young), and fiber treatment (treated/untreated) to measure the strength(dependent variable) of the composite panel. Is factorial ANOVA will do or ANCOVA? If not, what statistical treatment would be appropriate for our research?
Hello Neil,
It sounds as if factorial anova would work (3 x 2 x 2 x 2; or the analogous regression model). You didn't mention any possible covariate, so given the way your query is worded, it doesn't seem as if ancova is applicable here.
Of course, the suitability depends in part on how data were collected, and whether the strength measurement is interval or ratio strength; your ability to examine higher order interactions will depend on how many replications/cases per cell your design incorporated.
Question
We're conducting a research design as follow:
• An observational longitudinal study
• Time period: 5 years
• Myocardial infarction (MI) patients without prior heart failure are recruited (we'll name this number of people after 5 years of conducting our study A)
• Exclusion criteria: Death during MI hospitalization or no data for following up for 3-6 months after discharge.
• Outcome/endpoint: heart failure post MI (confirmed by an ejection fraction (EF) < 40%)
• These patients will then be followed up for a period of 3 to maximum 6 months. If their EF during this 3-6 months after discharge is <40% -> they are considered to have heart failure post MI. (we'll name this number of people after 5 years of conducting our study B)
• Otherwise they are not considered to have the aforementioned outcome/endpoint.
My question is as follow:
1. What is the A/B best called? Is it cumulative incidence? We're well-aware of similar studies to ours but the one main different is they did not limit the follow up time (i.e: a patient can be considered to have heart failure post MI even 4 years after they were recruited). I wonder if this factor limits the ability to calculate cumulative incidence in our study?
2. Is there a more appropriate measure to describe what we're looking to measure? How can we calculate incidence in this study?
3. We also wanted to find associated factors (risk factor?) with heart failure post-MI. We collected some data about the MI's characteristics, the patients' comorbidities during the MI hospitalization (when they were first recruited). Can we use Cox proportional hazards model to calculate the HR of these factors?
Hi,
There will be some censoring over these time points and these patients may or may not be included as per your criteria.
Question
Hello, I have two cohorts and I'd like to check if the distribution of cases is similar between them. For clarity I'll post the table.
I'd like to know if the HEMS and GEMS groups differ significantly in the distribution of counts of various coded call (represented by their category number, 7, 14, 15, etc.).
I'm not sure how to run and check that with those variables - sorry for the basic question, I'd just like to prove similarity between the two groups and I'm a bit of a statistical novice.
Thank you.
Thank you both for responding :)
Question
Hi everyone,
I have performed a Spearman's Rho with several variables: 1 continuous dependent variable and 5 continuous independent variables. I did this as normality was violated so I couldn't do a Pearson's Correlation. From the Spearman's Rho, I have ordered the independent variables from the strongest correlation to the weakest. I am planning to run a regression where I enter the independent variables in order (from the strongest correlation to the weakest) but I cannot figure out which regression analysis I should run. Someone suggested a Stepwise regression but I am not sure if this is the correct analysis. Do you think I should just run a multiple regression (where I cannot choose the order of variables to be entered) or some other regression?
When you employ a five-point Likert Scale, Pearson Correlation is used for these the relationship among variables such as the relationship between marketing mix (7Ps). But if you consider marketing mix (7Ps) and the factors or predictors of customer satisfaction (dependent variable), you have to employ multiple regression analysis (MRA) for prediction analysis. The R-Square could explain the phenomenon explanation regarding the significance level of 0.05, 0.01 or 0.001.
Kindly visit the links for Multiple Regression Analysis and Pearson Correlation Analysis.
Question
Hi
I have a huge dataset for which I'd like to assess the independence of two categorical variables (x,y) given a third categorical variable (z).
My assumption: I have to do the independence tests per each unique "z" and even if one of these experiments shows the rejection of null hypothesis (independence), it would be rejected for the whole data.
Results: I have done Chi-Sq, Chi with Yates correction, Monte Carlo and Fisher.
- Chi-Sq is not a good method for my data due to sparse contingency table
- Yates and Monte carlo show the rejection of null hypothesis
- For Fisher, all the p values are equal to 1
1) I would like to know if there is something I'm missing or not.
2) I have already discarded the "z"s that have DOF = 0. If I keep them how could I interpret the independence?
3) Why do Fisher result in pval=1 all the time?
4) Any suggestion?
#### Apply Fisher exact test
fish = fisher.test(cont_table,workspace = 6e8,simulate.p.value=T)
#### Apply Chi^2 method
chi_cor = chisq.test(cont_table,correct=T); ### Yates correction of the Chi^2
chi = chisq.test(cont_table,correct=F);
chi_monte = chisq.test(cont_table,simulate.p.value=T, B=3000);
Hello Masha,
Why not use the Mantel-Haenszel test across all the z-level 2x2 tables for which there is some data? This allows you to estimate the aggregate odds ratio (and its standard error), thus you can easily determine whether a confidence interval includes 1 (no difference in odds, and hence, no relationship between the two variables in each table) or not.
That seems simpler than having to run a bunch of tests, and by so doing, increase the aggregate risk of a type I error (false positive).
Question
Dear all,
For my graduation research, I am trying to create a composite score about household resilience out of data collected through a household survey. However, this data consists of ordinal variables (5-point Likert scale), binary variables (yes-no questions), and ratio variables (in proportions between 0-1).
My plan was to recode the data on the 5-point Likert scale into scores from 0-1 and do the same for the yes-no questions, with yes = 1 and no = 0 (since answering yes would mean a household is more resilient and thus have a higher score). However, this seems very off.
At this moment I am aware that it wasn't the best idea to create a survey with both types of questions, but I am unable to recollect the data.
Therefore my **question** is as follows: do you have any tips on how to create a composite score composed of both interval and binary variables?
Thank you in advance and have a lovely day.
Best,
Nina
Hello Nina,
In general, I would not recommend collapsing a measure having more than two values down to just two values unless there was some very compelling reason to do so. You're jettisoning potentially useful information by so doing.
The question here is, how do you intend to use the composite score you seek to construct? That should help to discern appropriate approaches from inappropriate methods. The fact that some elements of that composite are Likert-type response scale values whereas others are dichotomous variables isn't a deal-breaker.
When you say that you have "interval" variables, are you using individual Likert-type items as variables, or have you created and vetted one or more unidimensional scale scores by summing/averaging/otherwise combining scores over a set of related items? Generally, most would agree that individual Likert-type items involving self-report of human perception/sentiment/opinion/judgment (et al.) are ordinal strength scores, not interval strength scores. There are ways to get the results across related sets of such items into interval strength scores, however.
Question
I have a study in which there is an unbalanced design. Each person is put into 1 of three conditions. Then, there are two exemplars that each person receives. However, people are tested twice for exemplar 1 and only once for exemplar 2.
I was thinking of using a linear mixed effect model to account for this unbalanced design with condition*exemplar (to see if the effect of condition replicates across exemplar) and condition*time (to see if the effect of condition varied across time for exemplar 1). However, I am not sure of what the LMEM is doing when it gives me estimates for condition*time or time, given that there is only 1 time point for exemplar 2. Is it ignoring exemplar 2? Can I say that condition*time and time are looking at only exemplar 1?
Any help in clarifying this would be much appreciated! The other type of analysis I was thinking of doing would be analyzing the exemplars separately so that analysis #1 would look at the effect of condition*time for the exemplar with 2 time points, but analysis #2 would look at the effect of condition for the exemplar with 1 time point.
I have a question related to this. How can I fix the Unbalanced sample size? I have two sets of data. One is the academic performance of the 33 students and the other is a survey response from 7 teachers. I wanna look for the significant relationship between the two.
Question
How i can extracting p-values spearman correlation in robCompositions package (CorCoDa function ) (R software)?
Thanks
Azzeddine.R
Hi @Daniel Wright,
Thank you for your kind response. May I clarify a few things here!
Firstly, when @Reghais said "Robust", he meant "robust" in relation to compositional data analysis (CoDa) point of view, not robust in actual sense of it. Secondly, you're right: using Spearman, whether in corCoDa or cor.test, serves the same purpose, and in this case, being less sensitive to outliers as @Reghais indicated.
So, both of you are right in your lines of argument but @Reghais is working with compositional data; thus, cannot use cor.test because the solution to his data's manipulation is embedded in corCoDa function of robCompositions.
I hope this would help.
Question
Hello,
I am working with my data with one IV (Likert Scale-Sports Engagement Scale) and one DV (6 Subscales-Likert Scale [Psychological Wellbeing Scale]). What statistical treatment is best to check their relationship and causal-relationship?
I hope you can answer it as soon as possible! Thank you so much!
Thank you so much!
Hello Joseph,
if you have 6 scales than probably, you have 6 DVs and not one. Perhaps even more than 6 as most scales I know are undercomplex and comprised of more factors.
And of course, Daniel Wright is right, the question is which confounders (common causes of the IV and resp. DV) of each of each of the 6 "effects" are plausible. However, these questions should have been adressed prior to the data collection.
Good luck
Holger
Question
Hi everyone,
I tried to run a Point Biserial Correlation with one continuous variable and several dummy coded nominal variables however, my continuous (dependent) variable violated the normality assumption.
Are there any alternatives for assessing the correlation between one continuous dependent variable and several dummy coded nominal variables?
Thank you!
If I understand correctly, you have several binary variable and one numeric variable, and you want to see if (and by how much, measured perhaps by Cohen's d or a correlation measure) the values on this numeric variables differ between the two values for each of the binary variables, BUT you do not feel a t-test is appropriate. Is this correct? If only differences in means are what you want, then bootstrapping could be used. Otherwise, there are lots of transformations (including rank-based tests) that can be done. But, I am not sure if I understand your problem.
Question
Hi Everyone,
I am running a multiple regression with several dummy-coded variables (initially multi-level ordinal variables). With the assumptions testing, do I carry it out as it's normally done? Do I need to do something special with the dummy-coded variables?
Thank you!
Before any regression analysis, it is better to examine the existence of a linear relationship between each of the independent variables and the dependent variable using the "scatter plot" (Scatter Plot) and also calculating the correlation coefficient. Dispersion diagrams are a good tool for this purpose.
Question
Hello!
I am looking to do a Pearson Correlation to determine the order in which my variables should be entered into a Hierarchical Multiple Regression. Some of my variables are continuous and some are categorical. So, I am thinking to perform Pearson Correlations such as this:
- Correlation between level of study (undergraduate, postgraduate), area of study (nutrition, counselling, psychology, medicine, etc), year of study (1st, 2nd) and a test score. A separate correlation test will be performed to find out the associations between each test score (I have a number of different tests) and the independent variables.
I will then use those correlations in the Hierarchical Multiple Regression to determine which of the independent variables can predict the test scores. I will dummy-code each of the categorical variables after the Pearson Correlation but before the Hierarchical Multiple Regression.
My question is, are these steps suitable?
Thank you!
It does not make sense to compute a Pearson correlation coefficient between a polytomous nominal variable (such as area of study) and a continuous variable (Pearson is OK for binary and continuous variables, but not for polytomous nominal variables).
I would suggest dummy coding and running the regression with all variables right away. The regression should tell you which variables are statistically significant predictors of your outcome variable, taking into account all variables at once.
Question
Hello,
I am combining (averaging) survey items measured with a Likert scale, in order to get some new composite variables. I know there is some controversy regarding this, but I am following methods of similar research. If I take the average of a set of ordinal items, can I use the new variables in an ordinal regression as dependent/independent variables? I have heard multiple linear regression would be an option, but I violate some assumptions.
Thank you!
It depends on what you were measuring with your Likert scale. For instance, if 2 bad means and 5 good means, then the average will be regular (or something like that) and it doesn't have any sense. Use an average if it really represents your data, otherwise the "similar research" could be not well focused.
Question
I am looking at gender equality in sports media. I have collected two screen time measures from TV coverage of a sport event - one time for male athletes and one time for female athletes.
i am looking for a statistical test to give evidence that one gender is favoured. I assume I have to compare each genders time against the EXPECTED time given a 50/50 split (so male time + female time / 2), as this would be the time if no gender was favoured.
my first though was chi square? But I’m not sure that works because there’s really only one category. I am pregnant and so my brain is not working at the moment lol. I think the answer is really simple but I just can’t think of anything.
For this sport event there was some fixed total time, I suppose. During this time, either male or female athletes could be shown. To report some aspect, one might think of a minimum required time, like half a minute or so. So one could cut the total time into a series of such short slots showing either a male or a female athlete. The distribution of these might be considered binomial, and one could test hypotheses about the proportion of slots covered by one of the sexes.
A crucial and unaswered aspect so far is what hypothesis should be tested. In the above explained binomial example the hypothesis should be the proportion expected when no sex is preferred. This is not simple 0.5, since the proportion of of male and female athletes participating in the event must also be part of the equation. Further, there might be different disciplines, some of which can be reported in more short time-slices, whereas other show fewer athletes for a longer time. If these disciplines are (more or less) sex-specific, then this introduces a bias that should also be considered. There are surely many more difficulties in defining what a "fair" screen-time split would be. So I don't see how such an analysis could provide "evidence" for favouring one of the sexes (exept under some specific and rather unrealistic assumptions).
Question
Dear Researchers,
I interpret mediation calculated with Process 4.0 in my study in Zhao et. al., 2010 approach. They suggest that it's an indirect-only mediation when the a*b effect is the only one that is significant and sign of this effect doesn’t matter. The problem is, that in my results only indirect effect is statistically significant and total effect is lower than the direct effect which suggests a suppression or confounding model (MacKinnon, et al., 2000). My question is - can I use the term indirect-only mediation in this situation or should I interpret this result as an example of suppression/confounding?
Zhao, X., Lynch Jr, J. G., & Chen, Q. (2010). Reconsidering Baron and Kenny: Myths and truths about mediation analysis. Journal of consumer research, 37(2), 197-206.
MacKinnon, D.P., Krull, J.L. & Lockwood, C.M. (2000) Equivalence of the Mediation, Confounding and Suppression Effect. Prev Sci 1, 173–181.
Hello Jakub,
a) explain what your variables are
b) how exactly all effects are (mainly all direct effects)
Dirk Enzmann I don't know the MacKinnnon paper. Could you explain how he views medatiation and supression ocurring at the same time? As far as I remember, suppression means that x is unconditionally unrelated to y with is a contrast to x being an indirect cause of y (although I could imagine a scenerio in which x and y are additionally confounded in such a way that the resulting covariance is in an opposite direction to that of the indirect effect.
All the best,
Holger
Question
I need to run artanova and tukey-hsd for the interactions among the treatments, but my dataset has few NAs due to experimental errors.
When I run :
anova(model<- art(X ~ Y, data = d.f))
I get the warning :
Error in (function (object) :
Aligned Rank Transform cannot be performed when fixed effects have missing data (NAs).
Manually lifting is not an option because each row is a sample and it would keep NAs, simply in wrong samples.
The issue is that you are using art() from ARTool to fit the model and that can't handle missing values. You could use listwise deletion by passing na.omit(d.f) to the art() function - though this would potentially bias results (though no more than using na.rm=TRUE in anova() or lm().
A better solution is to use multiple imputation (e.g., with the mice package in R), though I'm not sure if that works directly with art() models or to use a different approach to handle your data (which presumably aren't suitable for linear models). You could use a transformation, a generalized linear model, robust regression etc. depending on the nature of the data.
Question
I would like to search for journals to publish in the area of statistics / quantitative methods that have special issues. Other than going through each journal one by one, is there any website (publisher-wise or as a collection from different journals) that list down the special issues. I know about MDPI and Taylor & Francis. How about Wiley, Springer, etc.?
Béatrice Marianne Ewalds-Kvist thank you. I probably has worded it wrongly. I am looking to publish in the special issues in the area of statistics.
Question
Dear researched,
I read a paper from one well known publisher. Paper is about gestational diabetees. At the age 45+ they found 35 cases of diabetes and 90 women without it. They wrote: ''Te incidence of GDM at age≥45years was as high as 38.89%.'' I would suggest 28%? What do you think?
In addition, I have some question. If something increase from 50 to 200, that is 4 times increase?
What about folds? In my opinion, folds are not same as times? The next article stated in abstract: ''neutralizing antibodies were increased by 10.3–28.9 times at 4 weeks after the booster'' and than in results ''eutralizing antibody GMTs then increased during the 4 weeks after the booster dose until day 237 by 28.9, 10.3, and 11.9-fold". Was this written correctly?
38% comes if we divide 35/90 . it should be 28% (90/125).
"fold change" i think synonymous with "times", as in "3-fold larger" = "3 times larger"..
Question
We are measuring "Purchase intention" and have therefore made a framework where Four variables, Involvement, Argument Quality, Source credibility and Information Usefulness together make up Purchase intentions.
We have conducted a survey with a number of questions for each variable mentioned above with an answering scale of 1-6. The aim of the study is to compare the mean of each variable inorder to see if the two groups (Genz & Millenials) the survey was sent out to differentiate in some way. We are also wondering how to "combine" these four variables inorder to measure the mean for Purchase intention for each group.
If your goal is to compare mean scores between the two groups, then you can just use a series of t-Tests. If you want to determine which variables predict the group a person belongs to, based on comparing the full set of variables, then you can use either discriminant analysis or logistic regression.
Question
Did the authors include any validation items, and/or did they consult experts to review the sample? I cannot seem to find anything that answers these questions.
Factor Analysing
Question
Hii there,
I am working with SPSS and I noticed that I have a lot of missing values.
I cant delete the variables so I have to replace the missing values. I read a few different options online, but I am still not sure which one I have to choose. I can't replace the missing values with the mean, because they are ordinal variables. Is it an option to replace the missing values with the mode, or is it even better to replace it with the median?
I am a nursing student and it's the first time that I am working with this program. I am basically a rookie.
A better option than mean or median imputation is multiple imputation of missing values. See, for example,
Enders, C. K. (2010). Applied missing data analysis. Guilford Press.
Enders nicely discusses the flaws of single-imputation procedures (as well as the advantages of multiple imputation and related procedures for handling missing data).
Question
I want a statistics book that has explanations about Durbin-Watson and R square. If you know any, please let me know.
Question
Dear all,
I collected the data from 14 participants by using an item-Likert scale (1-5) related to workload measurement. The same participants rated the scale for six different-sized keyboard designs by considering workload. To understand if there is a significant difference among the six keyboard designs, I applied the non-parametric Friedman test. I found a statistical difference, so I applied Wilcoxon signed-rank test with Bonferroni correction for pairwise comparisons.
My question: While I found no significant difference between LL-SS, there is a significant difference between ML-SS, though LL and ML have the same mean. Is this true? adjusted alpha value = 0,05 / 15 = 0,00333, the Wilcoxon result is 0,00306 for ML-SS. So I considered it a significant difference. Did I make it correct?
I attached the workload results and spss outcome.
Hi, Emmanuel Gabreyohannes . In general the signed-rank test isn't a test of medians. It's pretty easy to come up with an example of observations that have the same medians, but will result in a low p-value for this test. ... As a side note, technically the test can't be used with strictly ordinal data as the first step is subtracting paired values, which wouldn't make sense for strictly ordinal data.
Question
Recently I am trying to reproduce results from this paper Janich, P., Toufighi, K., Solanas, G., Luis, N. M., Minkwitz, S., Serrano, L., Lehner, B., & Benitah, S. A. (2013). Human Epidermal Stem Cell Function Is Regulated by Circadian Oscillations. Cell Stem Cell, 13(6), 745–753.
Here is the difficulty I met:
The author performed microassay to detect gene expression in mutiple overlapping time window.
Take time window 1 for example, let's say there are 100 genes, of which their expression are detected at 5h, 10h, 15h, 20h.
Then the author applied a quadratic regression model "expression = a(time.point)^2 + b(time.point) + c" to determine whether these genes change periodically within each time window (time.point can be 5, 10, 15, 20 in this example). If the coefficent "a" <0 and pvalue for the coefficient "a" < 0.05, this gene would be identified as "peak gene"; Otherwise, if the coefficent "a" >0 and pvalue for the coefficient "a" < 0.05, it would be labelled as "trough gene". But the problem is that, the author calculate the pvalue with two methods, the first one is based on t distribution i.e. pvalue = Pr(>|t|) and the R code would be:
summary(lm(formula = expression ~poly(time.point ,2, raw=T))))\$coefficients [3,'Pr(>|t|)'])
the other way is based on normal distrubution i.e.
pvalue = pnorm(q=t.score, lower.tail=T)*2
(That means if |t.statistics| > 1.96, the pvalue is guaranteed to be < 0.05.)
The author chose the latter one as the final pvalue. But is it right to do so in this situation?
From what I learned, t.distribution should be better when the population standard deviation is unknown and the sample size is < 30 (for each regression model, there are only 4 observations) . Since different pvalue calculated by these two methods could greatly affect the final result and conclusion, could someone give me a detailed explanation? Any help would be appreciated!
To follow on from comment by Salvatore S. Mangiafico .you might think about what difference this step makes in the eventual results.
In this small part of the procedure, you have a test statistic consisting of a ratio, where the divisor is an estimated standard deviation, and the two versions you mention either do or do not take into account the fact that you have an estimated standard deviation.
(i) there is a possible variant where you replace the estimated standard deviation (estimated locally in the sequence) by something obtained more broadly, and so less subject to error, and for which the normal approximation is better.
(ii) the cases where the test statistic might be most misleading arise when the estimated standard deviation is either unusually small or unusually large. So you might want to look at whether the actual pattern of data in such cases really do justify being counted as turning points. Perhaps the short section of the series is too smooth or too rough to support a conclusion.
(iii) you might consider an alternative way of assessing this, or any other test statistic, (getting a "p-value") which could be via a permutation test of some sort.
But, for the two versions you mention, if the later arguments of the overall procedure really only choose to look at smallest apparent p-values without using a formal threshold (such as 0.05), then there may be no contradiction between the lists of points selected as turning points (except perhaps in the numbers in the lists).
Question
I have attached the data of my results but I'm not sure how to proceed, could you help me?
any software at the websites available
Question
I have a model that needs calibration, but I am afraid that if I calibrate using too many model parameters, I will overfit to my data, or the calibration will not be well-done.
Can anyone suggest a method to determine the maximum number of parameters I should use?
I think It is Better to guess the function instead of using simple polynomial fit. Ofcourse in all cases we need to keep in mind the outliers . The advantage of using a function is the parameters will have some physical significance according to your model. Also from the chisq/ ndf of different fit you can compare the goodness of the fit.
Question
Good afternoon dear colleagues.
I know for sure, that sometimes researchers expand their group size. Let's say, we are planning an experiment with 10 mice: 5 in the control, and 5 in the experimental group. At the end of the experiment, we see that we do not have enough data (e.g. some 1 or 2 of mice were excluded). By saying that, I mean that we see the difference, but it is statistically not significant. I know, that the required group size may be calculated with the use of the preliminary data. But I have not seen that anybody used this strategy when performing the experiment (I guess because the number of animals in an experiment must be minimized as much as possible).
As I said, I know that in the case described people may conduct additional experiments with another group of mice and then combine data from 2 experiments. Despite the protocol and conditions are keeping the same as much as possible, I doubt that this strategy is right. Moreover, I have read or listened somewhere to the explanation of how bad this approach is and why, but I do not remember where. Now I cannot find the answer no matter how hard I try.
Could anybody please explain that moment?
If I understand your question, it is that you would like to know why looking at a preliminary result and using it to guide adding observations might turn out badly.
First, as JW has responded, you run a very strong risk of generating a type I error. There are approaches (e.g. sequential analysis) to running an experiment that manages this. But generally it is difficult to work out the probabilities if you are starting with a random result "that you like" and hoping that it becomes "more convincing".
There are other problems. One to consider is the pattern of missingness. How can you be sure that attrition wasn't related to your experiment? Well, there are tests, but with small numbers of animals you cannot be sure one way or another. Again, there are statistical approaches to tackle the problem, but you would have to larger numbers before you can get a real number. If the attrition was due to your experiment, then the measures you are taking are not reflecting the overall risk of this situation, and this might make it impossible to correctly reflect the situation.
A final point is that even with "the same" protocols, it is hard to be absolutely convinced of equal treatment -even with mice. Is there a seasonal effect? Are you more practiced with the manipulation at this point? Are you now "less blind" or more committed to an outcome? Are you more or less patient?
Hope this helps.
Question
I am currently analysing a dataset from a survey that moved participants to a different section depending on their response to a previous question, (e.g. people who said they were not farmers, skipped the farming-related questions). There is also missing data that does simply relates to respondents choosing to skip the question. The structurally missing data has led to a large amount of missing data and I assume has led the data to be MNAR.
I am wondering how to appropriately manage these two sets of missing data so I can progress to an ANOVA and then regression.
For context out of a sample of 22076, for each variable there are around 4500 structurally missing values (20%), and around 80 (.4%) missing values due to respondents choosing to not respond.
Any help would be hugely appreciated!
If people are only skipping questions that are not relevant to them, it sounds like a matter of defining your population for each question. If you want to know something about farmers, you only ask farmers. A nonfarmer is not a missing case that needs imputation. Such a person is not part of the farming population. If you want to know about parents, a nonparent is not part of the parent population. Does that address your question? Perhaps population definition by question is the issue?? Then there would be no action required other than defining your populations which would be of varying sizes, and your sample sizes would very as well.
Question
I carried out Kruskal Wallis H test in SPSS to do a pair wise comparison of three groups. I got some positive and negative values in Test statistics and Std. Test Statistics columns. I can conclude the results based on p-value but I don't know what the values indicates in Test Statistics and Std. Test Statics column and why some values are positive and why some are negative. Need some explanation please. Thanks in advance.
Salvatore S. Mangiafico , Yes the second group has higher values but I don't know what should I conclude from higher values. Also, there is no explanation on negative test statistics. I tried to figure it out but unfortunately, I couldn't
Question
Data Mining and Machine Learning looks similar to me. Can you elaborate the difference between these two? As per my understanding-
Data Mining is about finding some useful information and using that information in decision making. That means using the known properties of data we are finding the unknown property of data. e.g. Studying sales of computers in different regions and supply them accordingly.
On the other hand ML is about prediction of results. It uses known properties of data to find other known property of data with new data instances. e.g. Prediction of house prize after 5 years from the existing data of house sales.
Question
I have heard in videos that variation in R2 and path coefficients (before and after common method bias correction) should be <10% for unmeasured marker variable
and <30% for measured latent marker variable correction method.
Can anyone share the articles or references? Where does this cut-off value come from?
Please see the CMB related discussion in this latest article:
Syed Mahmudur Rahman, Jamie Carlson, Siegfried P. Gudergan, Martin Wetzels, Dhruv Grewal. (2022). Perceived Omnichannel Customer Experience (OCX): Concept, measurement, and impact. Journal of Retailing, https://doi.org/10.1016/j.jretai.2022.03.003
Question
Hello everyone,
I would like to ask whether somebody knows which R package should be used in order to perform the Marascuilo procedure. I have multiple groups with proportion values that I want to compare, so it seems that is the most suitable test (i.e. 5 out of 10 individuals from population 1 produce compound X = 0.5, while only 4 out of 20 individuals from population 2 produce compound X = 0.4 and so on). However, I cannot find whether there is a package available for this test.
Does anyone know what package and function should be used?
Thank you very much in advance!
Cheers,
Dear Marián,
Thank you very much for this suggestion! I will check this code as well.
Cheers,
K/
Question
For example, when is it better to use decision trees instead of SVM or KNN, based on underlying theory/distribution of the data ?
I would appreciate any empirical/theoretical advice or references.
Thank you.
Question
Say I have a 3 (young vs. middle-age vs. and old)* 2(male vs. female)* 2(smoker vs. no smoker) ANOVA design with three way interactions. However, the number of non-smoker is two times (N=2000) more than the number of smokers (N=1000). In this case, the number of young male non-smoker is 600 whereas the number of young male smoker is 300. Would the uneuqual sample size in each cell problematic for ANOVA and mutiple comparison test?
Yes, you can use ANOVA but beware of colinearity between the explanatory variables, for instance, the majority of smokers might also be "old". The variable "Age" might be better treated as a continuous rather than split into 3 categories with arbitrary cut points.
Question
We are currently conducting a research project that focuses on organ rejection. For this purpose, we have taken blood samples of various patients, who have received an organ transplant pre- and postOP, although here we only consider postOP. Some of these patients have received an organ biopsy to diagnose a suspected organ rejection reaction. Blood samples were also taken during these times.
We want to compare the non-rejection (samples taken postOP when no biopsy was taken or samples corresponding to a negative biopsy result) to the rejection samples (samples corresponding to a positive biopsy result).
The problem we now face is the following: Not all patients have received a biopsy.
This means that some but not all of the patients in the non-rejection group have dependent (paired) samples in the rejection group.
How do we statistically account for the fact that some of the samples are paired? Any help is greatly appreciated!
You may start by trying at first to calculate correlation coefficients (R) for both cases you are interested in and then to calculate possible regressions of your data, in order to analyze them. Furthermore, you may try to continue with ANOVA. Please also consult the paper attached which you may find helpfull for your analysis purposes
Question
A very interesting topic, "quantification of randomness" in mathematics it is sometimes reffered to as "complex theory" (although it is more about pseudorandom than randomness) that is based on saying that a complicated series is more random and then there are tests for randomness in Statistics and perhaps the most intriguing test related to information theory -"entropy"(as also being of relevence to and result of second law of thermodynamics), while there are also random numbers generators (pseudorandom numbers generators) and true random numbers generators using quantum computing.
So, what I've been trying to, is making a complete list of all available algorithms or books or even random number generators that will allow me to tell me how much random a series is, allowing me to "quantify randomness".
There are 125 unique infinite series which are pseudorandom that I have discovered and generated based on a rule, now how do I test for randomness and quantify it? Uf the series is random or there is probably a pattern, or something that will allow me to predict the next number in the series given I don't know what the next number is.
Now, do anyone know of any github links based on any of the above? ^ (like anything related to quantifying randomness in general that you think will be helpful).
A book/books on quantifying randomness will be very very helpful too. Actually anything at all...
You should check out seminal and fundamental work by Gregory Chaitin starting in 1965 when he was a student in CUNY (City Univ. of New York) and continuing through the 1970's.
Question
Hello, friends!
I'm working on an imaging genetics projects with an aim to exploring whether certain allelic variations of a gene modulate BOLD responses and behaviors measured in a social-cognitive task.
When I solely looked into the behavioral data, I found no evidence that the genetic variable (e.g., polygenic risk score) significantly predicted individual participants' task performance (e.g., no zero-order correlation).
However, I found that the genetic variable is linearly associated with the activations in one brain region, and the activation values extracted from this area tha shows significant genetic modulation in turn correlated with the same task performance analyzed above. Let's suppose that: A=polygenic risk score, B = Brain activation, and C= behavioral task performance. All genetic, brain activation, and behavioral data are obtained from the same group of individuals.
What I'm seeing here is as follows:
1. Significant association between A->B
2. Significant association between B->C
3. Non-significant association between A->C.
My (potentially faulty) intuition was that maybe there is a path between these variables, where A is linked with C only via the action of B. Indeed, a mediation analysis based on bootstrapping revealed a significant indirect path linking: A->B->C. No direct effect was significant with or without the mediator. (I understand that this is problematic in Baron-Kenny approach, but I also learned that the A->C relationship is not required as it's equivalent to the total effects, which essentially is the combination between all possible indirect and direct effects.)
In this situation, is it permissible to conclude that the brain activation (B) is mediating the genetic (A) and behavioral (C) variable? I could see someone argue that A->B->C is a more accurate model as you may miss the significant indirect path if you only test the direct path. However, such a postulation just seems counterintuitive. It just doesn't seem to make sense that the genetic modulation on behaviors that was initially absent "suddenly" becomes significant when the brain data are combined.
Or is this just a misguided feeling due to the fact that I happened to perform the behavioral analysis first (mostly due to the format of the paper where you typically introduce the behavioral results prior to the neuroimaging data), and now I feel like I'm making things up with the neuroimaging data that weren't initially consodered...
Any inputs will be greatly appreciated!
Thanks!
Although the direct effect is one of the conditions for mediating effect based on Baron and Kenny’s (1986) traditional approach, it is the consensus now that a significant direct relationship between the independent (i.e., three motivational states) and dependent (job performance) variables should not be required for the indirect effect (Mackinnon et al., 2002, 2004, 2007; Shrout and Bolger, 2002).
Baron RM and Kenny DA (1986) The moderator–mediator variable distinction in social psychological research: Conceptual, strategic, and statistical considerations. Journal of Personality and Social Psychology51:1173–1182.
MacKinnon DP, Fairchild AJ and Fritz MS (2007) Mediation analysis. Annual Review of Psychology 58:593–614.
MacKinnon DP, Lockwood CM, Hoffman JM, West SG and Sheets V (2002) A comparison of methods to test mediation and other intervening variable effects. Psychological Methods 7:83–104.
MacKinnon DP, Lockwood CM and Williams J (2004) Confidence limits for the indirect effect: Distribution of the product and resampling methods. Multivariate Behavioral Research39:99–128.
Shrout PE and Bolger N (2002) Mediation in experimental and nonexperimental studies: New procedures and recommendations. Psychological Methods 7:422–445.
Question
I need to calculate the necessary sample sizes to reach alpha = 0.05 and Power = 0.8 for my experiment.
The problem ist, that I expect there to be no difference between my two groups. So how do I calculate Cohens d in that case? How many replicates is enough to be sure that my (assumed) non significant result is indeed because there is no difference and not because my sample size was too small?
Johannes Plagge Important element in the formula for determining the sample size is the difference between 2 means. When the difference is small, the sample size will be large. if the difference is zero, the sample size approaches infinite.
Best !!
AN
Question
I am currently performing undergraduate research in forensics and I am comparing two types of width measurements (the widths of land and groove impressions on fired bullets), one taken by an automated system and the other performed by my associate manually using a comparison microscope. We are trying to see if the automated method is a more suitable replacement for the manual method. We were recommended to perform a simple linear regression (ordinary least squares) however when it comes to actually interpreting the results we had some slight trouble.
According to pg 218 of Howard Seltmann's experimental design and analysis, "sometimes it is reasonable to choose a different null hypothesis for β1. For example, if x is some gold standard for a particular measurement, i.e., a best-quality measurement often involving great expense, and y is some cheaper substitute, then the obvious null hypothesis is β1 = 1 with alternative β1 ≠ 1. For example, if x is percent body fat measured using the cumbersome whole body immersion method, and Y is percent body fat measured using a formula based on a couple of skin fold thickness measurements, then we expect either a slope of 1, indicating equivalence of measurements (on average) or we expect a different slope". In comparison to normal linear regression where β1 = 0 is usually tested, I was just wondering how you actually test the hypothesis proposed by Seltmann: do we test it the same way you would test the hypotheses of a normal linear regression (finding T test values, p values, etc)? Or is there a different approach?
I am also open to suggestions as to what other tests could be performed
A quick thank you in advance for those who take the time to help!
Reference:
Your question is specific : can the automated measurement replace the human?
You need to look at the literature on comparison of two measurement methods.
A graphical approach is very useful because you can see if the agreement is dependent on the value being measured. Bland and Altman's original paper is one of the most highly-cited methodology papers ever. https://www.semanticscholar.org/paper/Measuring-agreement-in-method-comparison-studies-Bland-Altman/6b118f54830361182c172306027e3af0516a3c08
Question
I need to find the number of full time (officer and civilian) law enforcement officers for as many years as possible. I know that uniform crime reports has some data, but I can't seem to find too many years. Anyone know of an already constructed data set?
If I could find part time data too that might be nice. I also need law enforcement spending (total - federal, state, and local).
Law Enforcement Management and Administrative Statistics (LEMAS) published by the Bureau of Justice Statistics (BJS) of the US Department of Justice (DOJ) is an excellent source. Also, some media outlets keep local files. For example the NJ based Asbury Park Press maintains a DataUniverse which contains information about policing in NJ, among other items.
Question
My understanding of conventional practice in this regard is that when there are more than two independent proportions being compared (e.g., comparing the proportion of people who contracted COVID-19 at a given period between the <18 year-old group, 18-64 year-old group and >64 year-old group), one of the groups being compared will serve as a reference group (which will automatically have an OR or RR = 1) through which the corresponding OR or RR of the remaining groups will be derived from. As far as I know, it seems that the generated OR or RR from the latter groups, through logistic regression or by-hand manual computation, will have a p-value whose threshold for significance testing is not adjusted with respect to the number of pairwise comparisons performed.
I understand that in the case of more than two independent means, we implement one-way ANOVA/Kruskal-Wallis technique first as omnibus/global hypothesis test which is followed by the appropriate post-hoc tests with the p-value thresholds adjusted if the former test finds something "statistically significant." I imagine that if the same stringency is applied to more than two independent proportions, we should be doing something like a Chi-square test of association (with the assumptions of the test being met) first as omnibus/global hypothesis test, followed by an appropriate post-hoc procedure (possibly Fisher exact tests with p-value threshold adjustment depending on the number of pairwise comparisons performed) if the former test elicits a "statistically significant" difference between the independent proportions.
I would like to ask some clarification (i.e., what concepts/matters I am getting wrong) on this. Thank you in advance.
Indeed do run the chi-square test. In attachment an R script with post hoc tests.
Question
I have 350 patients, which I divide into 4 age groups. I am wondering if (in addition to calculating percentages) any of the age groups are statistically significantly over-represented. Unfortunately there is no "non-patient" group, so I can't create cross tabs for chi^2 test.
Try a linear regression with dummies as groups
Question
Is it possible to use Finite Population Correction (FPC) to decide the minimum required sample size when we use Respondent Driven Sampling (RDS) approach to recruit hidden populations? Kindly share any reading material on this? An introduction to RDS is attached for your information. Thanks in advance for kind support.
Suchira Suranga -
If your weights are good, so you expect reasonable inference, do you have a way to estimate variance? I am not familiar with this, but if you can estimate variance, then you generally want to only apply that to the data not in the sample. You estimate variance from what is in the sample, and apply it only to what was not in the sample. I have a Sage Encyclopedia entry for the finite population correction (fpc) factor, which I explained in terms of both design-based, and model-based methods. I suspect that the same idea applies here.
I obtained permission from Sage to post this.
Cheers - Jim
Question
In my research design, there are 10 groups of 7 people, each group rates 3 out of the 10 chosen essays. This means each rater scores 3 essays & each essay gets 21 scores. (for concrete details, please see file attached). However, this means that were will be missing data (empty cells). After having looked at similar questions, it appears that SPSS (I am using ver 25) would treat empty cells as though they were filled, resulting in skewed results.
Has SPSS solved this problem? Is there any other program that works?
Attached the statistics.
Question
N/A
"What test would offer insight as to group x condition?"
Only a parametric model can do this. As soon as use ranks (i.e. some kind of "non-parametric analysis") an interaction is not meaningfully interpretable.
There are more important things to consider when analysing an interaction: it makes a difference if you assume that effects are additive or multiplicative. A meaningful interpration also requires that that the observed interaction is not due to ceiling or floor effects.
It's easy to do "some test" and to get "some result", but it is tricky to get a meaningful interpretation. I suggest to collaborate with a statistician.
Question
Is it possible to determine a  regression equation using SmartPLS 2?
A basic regression equation looks like this:
y= a + beta X, where a is a constant/intercept.
A classical regression requires several underlying assumption, mainly normality, to make predictions using unstandardized data. However, smartPLS doesn't follow this parametric assumption, and only deals with standardized data (putting different variables on the same scale, to be consistent). Since smartpls aims to maximize the variance explained in the criterion variable (rather than producing equation to predict an absolute number), the constant or a regression equation is not directly available in smartpls.
The regression equation can be produced using SPSS, R or even excel, and the coefficients and significance levels will not be too much different from smartPLS output.
Question
The histogram is the distribution of the response (i.e., subjective social class)whereas the second image is the pp-plot of the residuals after the GLM using predictors including demographics such as age, gender, education, income... Is GLM still suitable in this context? If not, what would be the best alternatives?
To extend the question asked by Ronán Michael Conroy , can the "social classes" reasonably be considered as being on an increasing scale in the order as numbered? Perhaps some are different-but-parallel?
If the classes are ordered, then you might consider something not based directly on distributions, but rather on developing a score procedure to quantify the goodness of any given decision rule that predicts a social class (or a distribution over the classes) given the explanatory variables.
One such score procedure is the "ranked probability score", which is based on counting how many categories there are between the "observed" and "predicted" categories. In the most general case, you might be able to specify a cost function to compare any pair of observed" and "predicted" categories.
If you have not done so already, I suggest that you start with a simple graphical approach whereby, for each explanatory variable, you produce two versions of your figure 1 histograms classified according to whether the item is below or above the median for that explanatory variable. The idea being to see how far the two histograms are different, and which variable give the most difference.
Question
I have 2 equation, and in each equation i have a coefficient of interest:
lets say:
eq1: Y=bX+2cX2+4
eq2: Y=aX2+2dX+12
Giving that the value of a and b are changing over time.
And I am aiming to record the values of a in list A and b in another list B
And from their behaviour i want to draw conclusion about the strength of these coefficients.
But i am a bit confused about how to draw such conclusion and what is the most representative way to monitor a and b behaviour change over time.
Or its better to monitor the increase or decrease of coefficient by summing the difference of recorded values over time.
I have more coefficients to be monitored, and they may have value or not. and my aim is to build meaningful classification that can categorise coefficients as useful or not.
Question
Dear community,
I am currently running some generalized linear mixed model analysis with R.
I have a lot of possible predictors (all either continuous or ordinal) and subject as a random factor. It is difficult for me to decide on which predictor makes more sense because some would make sense based on our current knowledge, however some were included because they were promising even though there are no current theory or proof directly supporting their potential role in explaining my dependent variable.
Looking for ways to either select a priori predictors or assess the "goodness of fit" of different possible models, I came across many fascinating posts about r-squared, its bias towards "encouraging" an increased number of predictors (as it necessarily goes down with increasing predictors if I understood well), and possible alternatives : marginal r-squared, conditional r-squared, adjusted r-squared and my new favourite predicted r-squared.
I find the idea behind predicted r-squared very convincing. However I cannot find anything about its use with mixed model, except this appendix :
Furthermore, in R there doesn't seem to be a direct way to calculate predicted r-squared. I found a home-made function created by Tom Hopper here:
However I am not sure again that it is usable with mixed models. Also because in the PRESS function, lm.influence is used to diagnose the quality of fit of the model, but I am not sure it works with a glmer model.
So, sorry for the very long post, hope it makes sense to somebody, and would be very curious to get your feedback on that.
Note: I am not a statistician (if that was not already obvious from my post), so my understanding of this methods might be a bit artificial.
EDIT: Little note. This is born from my will to avoid over-fitting seeing the number of predictors that I used in my diverse attempts to model my data. So if there are better tools to evaluate over-fitting in mixed models I am also interested.
Two thoughts, for linear outcomes, you can compute R-squares at each level of analysis, however, this is problematic because unless you are careful about how you center your predictors, the meaning of the variance components at higher levels will change and lead to nonsense results. A slightly better approach is a single R-square based on the sum of all variance components, however, the centering issue still remains. (grand -mean centering all predictors is really the only way to go here). Pseudo R-squares based on log-likelihoods (assuming full maximum likelihood) are plausible and will work with generalized outcomes.
However, if over-fitting is the worry, consider techniques taught in data science tests related to data splitting, modeling, and prediction tests. Split your data, explore on a subset, and verify fit on another.
Question
• What is the best metric for model selection?
• Does accuracy derived by cross-validation is a good metric? *
• Does the selected model in the model selection process based on the metrics, surely leads to better results?
Hasan -
I like to suggest comparing performances of alternative models - perhaps picked based on subject matter knowledge or some other method, but not forward or backward selection which can leave out good possibilities - by using a "graphical residual analysis," which you can research. Such a comparison, on a single scatterplot, would be with regard to one specific sample, so you do then need to avoid overfitting to that particular sample, and that is where "cross-validation" is important.
Some comparison statistics may automatically assume homoscedasticity. That is not a good blanket assumption. See https://www.researchgate.net/publication/354854317_WHEN_WOULD_HETEROSCEDASTICITY_IN_REGRESSION_OCCUR.
Best wishes - Jim
Question
I am interested in statistically analyzing a non-normal distributed histogram of the performance of 141 test cases in the 0-200 range, with most of them concentrated in the 50-100 range and the rest being outliers. I would like to answer the following questions:
1. How can the more "correct" average be found since it is an non-normal distribution with outliers?
2. How can the means with and without outliers be statistically compared to examine the effect of the outliers?
Since I am not a skilled statistician, I would like to know how to start analyzing such a case: which statistical procedures to use ?, which statistics tests to use ?, which metrics to examine (mean, std, p-value, significant value, etc)? And how to analyze it step by step. I would also like to receive references that can help me.
Question
The conventional models used for dose-response meta-analyses only consider the variation for the response variable. I am trying to find a model for dose-response meta-analysis (both the dose and the response variables are continuous) which also takes the variation (SD) in the dose (exposure) variable and possible confounding variables (for adjustment in the model) into account. I will be thankful if you also guide me about the statistical package that I should use for such an analysis.
Dear
Michael Stein
,
Not yet. In fact, the DRMETA package in STATA lets you work with continuous response variables. However, I need a model which lats variation for the dose variable. None of them worked for me.
Question
Hi,
I am currently working on a project where we examine the effect of an intervention on fatigue. The project has been carried out according to an one-group pre-test post-test design. I am uncertain about the best way to conduct a mediation analysis.
In our project we started with a 12week control period. So we have a baseline measurement at T0, and the pre-test measurement at T12. So we have two measurements for 1 condition. The following 12 weeks are the intervention period, and we have post-test measurement after these 12 weeks at T24.
Our outcome (fatigue) and potential mediators have been measured at all these three time points. We conducted a mediation analysis according to Montoya & Hayes (2017). However, this analysis only took into account the measurements at T12 and T24. So the baseline measurement is not considered at all. Now we are wondering, is this the correct way of conducting this analysis, or do we also need to include the baseline measurement? We had the following idea's
• Conduct the analysis with T12 (pre-test) compared toT24 (post-test) as we did.
• Conduct the analysis with the average of T12 and T0 (pre-test) compared to T24 (post-test).
• Conduct two separate analyses, so compare T0 and T12, and compare T12 with T24.
• Conduct an analysis to compare the change score (T12-T0) with the change score (T24-T12)
What would you guys do?
Next time plan your analysis before you collect your data.. Best wishes David Booth
Question
I am planning to do a questionnaire study. My variables are
students performance(Y)
university performance (X)
lecturers contribution (Moderator)
1. For collecting lecturer contribution, my sample population are lecturers.
2. For collecting university performance, my sample population is university administration staff
3. For collecting students performance, my sample population are both lecturers and university administration staff
Please tell me should I design two separate questionnaires in one study. What can be the best way of designing this or these questionnaires? Please refer me to any research articles.
Thank you very much for your cooperation
I think you need to settle down on a design and approach. I am very sure that whatever approach you decide to take, there will always be a statistical remedy. Let me know what you are settling for, then I can advise on the statistical analysis to make. Answer the following questions to enable me to understand the direction of the work.
1. Are you using GPA for students performance?
2. Are you using teachers to rate themselves or will students provide teachers' performance rating
3. Are there specific objectives your study seek to achieve?
All these questions should be answered, and if 'yes' to question 3, then we may need to see some of them.
Question
I have done a genetic association analysis based on SPSS software. The allelic association has come out as a significant association. But when I wanted to compute the odds ratio in both allelic and different genotypic models based on cross-tabulation, the SPSS showed the odds ratio of control/case instead of case/control. Can I report this ratio in scientific journals? For better clarity, one SPSS file is attached below. Any suggestion will be appreciated. The analysis process is mentioned briefly below.
> Denote control 1 and case 2 in SPSS. Also denote AG, GG genotype 1, and AA genotype 2 in SPSS (Dominant model).
> Import the excel sheet in SPSS.
> Go descriptive stat, click crosstab.
> Put population (case and control column) into row and genotype into the column.
> Select chi-square, percentage, and risk.
> Continue
Further to what Kelvyn Jones said, if one wishes to use CROSSTABS to compute an OR for Cases relative to Controls, then the group variable must be coded such that Case code < Control code.
* Reproduce the result in the original post.
* Denote control 1 and case 2 in SPSS.
* Also denote AG, GG genotype 1, and
* AA genotype 2 in SPSS (Dominant model).
NEW FILE.
DATASET CLOSE ALL.
DATA LIST LIST / Group gtype (2F1) n (F5.0).
BEGIN DATA
1 1 58
1 2 298
2 1 35
2 2 85
END DATA.
VARIABLE LABELS gtype "Genotype".
VALUE LABELS
Group 1 "Control" 2 "Case" /
gtype 1 "AA,GG" 2 "AA".
WEIGHT by n.
CROSSTABS Group by gtype / STATISTICS= RISK.
****************************************************.
* If you want to use CROSSTABS to get the OR for
* Cases relative to Controls, you must recode
* the Group variable so that Case code < Control code.
****************************************************.
RECODE Group (1=2)(2=1) INTO grp.
FORMATS grp (F1).
VALUE LABELS grp 1 "Case" 2 "Control".
CROSSTABS
TABLES Group by gtype /
TABLES grp by gtype / STATISTICS= RISK.
* If you use GENLIN to estimate a logit model,
* you can specify reference categories for both
* the explanatory variable and the outcome variable.
Question
Let us suppose that we have an intervention, for example technology integration in the science classroom. Can we study what mediators could affect the results of the intervention, for example learning motivation? Can we study what moderators could affect the results of the intervention? And why.
For example, can we study how gender mediates the influence of the intervention on the learning otivation? or it would be better if we consider the interaction of gender and the intervention?
You can probably mix up between moderator and mediator. You may google these terms. Interaction terms are for moderators and path analysis is for mediation.
To answer your question on whether mediation or moderator can be tested for intervention studies, yes you can. However, it should be theoretically sound and the type of analysis is mentioned above.
Question
As far as I know, I can plot either -log10(p-values) on the y-axis of a volcano plot, or the -log10(adjusted p-values) after adjusting them for example with Benjamini-Hochberg.
When plotting adjusted p-values, I can just set the cut-off to -log10(0.05) (see picture 1).
However when plotting the original p-values, I need to set a different cut-off. You can see in the raw data table that Species 9 already has an adjusted p-value >0.05 while Species 8 is the first one with adjusted p-value <0.05. Therefore, my cut-off when plotting the original p-values should be between the original p-values of Species 8 and 9 so between 0.00806 and 0.01165.
In the second picture, I set the cut-off to 0.01165. Is there any way to determine more accurately, where in between this cut-off should be set?
Question