Science topic

# Advanced Statistical Analysis - Science topic

Explore the latest questions and answers in Advanced Statistical Analysis, and find Advanced Statistical Analysis experts.
Questions related to Advanced Statistical Analysis
• asked a question related to Advanced Statistical Analysis
Question
Greetings,
I am currently in the process of conducting a Confirmatory Factor Analysis (CFA) on a dataset consisting of 658 observations, using a 4-point Likert scale. As I delve into this analysis, I have encountered an interesting dilemma related to the choice of estimation method.
Upon examining my data, I observed a slight negative kurtosis of approximately -0.0492 and a slight negative skewness of approximately -0.243 (please refer to the attached file for details). Considering these properties, I initially leaned towards utilizing the Diagonally Weighted Least Squares (DWLS) estimation method, as existing literature suggests that it takes into account the non-normal distribution of observed variables and is less sensitive to outliers.
However, to my surprise, when I applied the Unweighted Least Squares (ULS) estimation method, it yielded significantly better fit indices for all three factor solutions I am testing. In fact, it even produced a solution that seemed to align with the feedback provided by the respondents. In contrast, DWLS showed no acceptable fit for this specific solution, leaving me to question whether the assumptions of ULS are being violated.
In my quest for guidance, I came across a paper authored by Forero et al. (2009; DOI: 10.1080/10705510903203573), which suggests that if ULS provides a better fit, it may be a valid choice. However, I remain uncertain about the potential violations of assumptions associated with ULS.
I would greatly appreciate your insights, opinions, and suggestions regarding this predicament, as well as any relevant literature or references that can shed light on the suitability of ULS in this context.
Thank you in advance for your valuable contributions to this discussion.
Best regards, Matyas
1. Data Characteristics:Distribution: As you mentioned, DWLS is generally recommended when your data depart from normality, which is common in Likert-scale data. Negative skewness and negative kurtosis suggest non-normality. However, the degree of non-normality matters. If the deviation from normality is not severe, ULS may still be an acceptable choice.
2. Sample Size:Sample Size: You have 658 observations, which is a decent sample size. In CFA, a larger sample size tends to favor the use of ULS because it can provide more robust results even with non-normally distributed data.
3. Fit Indices:Fit Indices: It's essential to consider the fit indices and how they relate to your research goals. If ULS produces better fit indices and aligns with your theoretical model, it may be a valid choice. However, don't rely solely on fit indices; consider the theoretical and practical significance of your findings as well.
4. Theoretical Justification:Theoretical Justification: Consider whether ULS aligns with the underlying theory of your model. If it makes sense conceptually and theoretically, it can be a valid choice, even if it doesn't assume multivariate normality.
5. Assumption Violations:Assumption Violations: While ULS is less sensitive to non-normality, it assumes the absence of outliers and no measurement errors. If your data have severe outliers or other issues, ULS may not be appropriate.
6. Model Complexity:Model Complexity: The complexity of your CFA model can influence the choice of estimator. More complex models with many parameters may benefit from DWLS, which can provide better estimation in such cases.
7. Literature and Prior Research:Literature and Prior Research: It's good that you're consulting the literature. If there's prior research or expert consensus suggesting that ULS is a suitable choice for similar situations, that can provide support for your decision.
8. Robustness Checks:Robustness Checks: You can perform robustness checks by comparing the results from both ULS and DWLS. If they consistently yield similar results for your specific research question, it adds confidence to your choice.
9. Reviewer or Advisor Feedback:Reviewer or Advisor Feedback: If you're conducting this analysis for publication or as part of an academic project, consider seeking feedback from your peers, advisor, or reviewers. They might offer valuable insights into the choice of estimator.
10. Sensitivity Analysis:Sensitivity Analysis: You can conduct sensitivity analyses by applying both ULS and DWLS and comparing the results. This can help you assess how sensitive your findings are to the choice of estimator.
• asked a question related to Advanced Statistical Analysis
Question
I have a longitudinal model and the stability coefficients for one construct change dramatically from the first and second time point (.04) to the second and third time point (.89). I have offered a theoretical explanation for why this occurs, but have been asked about potential model bias.
Why would this indicate model bias? (A link to research would be helpful).
How can I determine whether the model is biased or not? (A link to research would be helpful).
Thanks!
That makes sense. Are you comparing the cross-lagged panel (auto)regression (path) coefficients to zero-order correlations? This could be part of the issue (explain the "discrepancy"/low autoregressive stability coefficient). Regression coefficients are not equal to zero-order (bivariate) correlations. The regression coefficients take the correlation with other independent variables into account. This may explain why the autoregressive "stability" coefficients in your model look very different from the zero-order correlations. It is impossible to know without looking at your data and model in more detail.
The model fit does not look completely horrible at first sight but the chi-square test is significant and the RMSEA value is a bit high. I would take a look at model residuals and/or modification indices to find out where the model may be misspecified.
• asked a question related to Advanced Statistical Analysis
Question
Hello, could someone assist me in interpreting the results of the sequential Mann-Kendall Sneyer test? Indeed, according to Dufek (2008: Precipitation variability in São Paulo State, Brazil), "In the absence of any trend, the graphical representation of the direct series (u(t)) and the backward series (u'(t)) obtained with this method yields curves that overlap several times." In my case, I observe two to three overlaps, often with sequences that exhibit significant trends. Should I also conclude that there is an absence of trends in my dataset?
• asked a question related to Advanced Statistical Analysis
Question
first i send this number to bard chat bot
6020 5603 54830 6750 23000 12370 10000
it responds this.
You have listed 7 numbers.
• 6020
• 5603
• 54830
• 6750
• 23000
• 12370
• 10000
The sum of these numbers is 103573.
The average of these numbers is 14825.
The median of these numbers is 6750.
The mode of these numbers is 10000, which appears twice.
The range of these numbers is 23000 - 10000 = 13000.
The standard deviation of these numbers is 3844.74.
Is there anything else I can help you with?
then i ask it to standardize them.
Sure. Standardizing the numbers will mean subtracting the mean and then dividing by the standard deviation. This will transform the numbers so that they have a mean of 0 and a standard deviation of 1.
The standardized numbers are:
• -2.18
• -2.88
• 2.44
• -1.38
• 3.22
• -0.78
• 0.00
----------------------------------------
Bing chat respond to the same question what this
I can standardize the numbers you provided. Standardization is the process of transforming data to have a mean of 0 and a standard deviation of 1. This is done by subtracting the mean from each data point and then dividing by the standard deviation.
Here are the standardized values for the numbers you provided:
• 6020: -0.705
• 5603: -0.801
• 54830: 2.891
• 6750: -0.524
• 23000: 0.693
• 12370: -0.191
• 10000: -0.363
Is there anything else I can help with? 😊
------------------------------------------
Based on the search results, it is possible to standardize the numbers provided by converting them into z-scores. Standardizing data is a common practice in statistics to compare variables that have different scales or units of measurement. To standardize the numbers provided, one would need to calculate the mean and standard deviation of the data and then transform each value to its corresponding z-score
Here are the steps to standardize the numbers:
1. Calculate the mean of the data: Mean = (6020 + 5603 + 54830 + 6750 + 23000 + 12370 + 10000) / 7 = 15038.43
2. Calculate the standard deviation of the data: Standard deviation = sqrt(((6020 - 15038.43)^2 + (5603 - 15038.43)^2 + (54830 - 15038.43)^2 + (6750 - 15038.43)^2 + (23000 - 15038.43)^2 + (12370 - 15038.43)^2 + (10000 - 15038.43)^2) / 6) = 17523.22
3. Calculate the z-score for each value: Z-score = (X - Mean) / Standard deviationZ-score for 6020 = (6020 - 15038.43) / 17523.22 = -0.706 Z-score for 5603 = (5603 - 15038.43) / 17523.22 = -0.768 Z-score for 54830 = (54830 - 15038.43) / 17523.22 = 2.238 Z-score for 6750 = (6750 - 15038.43) / 17523.22 = -0.426 Z-score for 23000 = (23000 - 15038.43) / 17523.22 = 0.455 Z-score for 12370 = (12370 - 15038.43) / 17523.22 = -0.152 Z-score for 10000 = (10000 - 15038.43) / 17523.22 = -0.853
By standardizing the numbers, we have transformed them into a common scale that allows for easier comparison and analysis. It is important to note that the interpretation of z-scores depends on the context and the distribution of the data.
-----------------------------------------------------
so, as you can see it can't even get sum of the data correct. my question is why this models cant get sum of this number right even when they apply all the correct steps
Abolfazl Ghoodjani can you be more specific with your answer by the way the writing above are output of (chat gpt and lambda ) . my question is why this models can get sum of this number right even when they apply all the correct steps
• asked a question related to Advanced Statistical Analysis
Question
Hello !
Please, relevant to the significance level of a test, I would like to know when can we increase the critical value α to 0.1.
Thank you in advance.
The significance level of a test is a predetermined threshold that is used to determine if the evidence from the sample provides enough support to reject the null hypothesis. It is typically denoted by the symbol α. The significance level is relevant in several ways: Type I error rate, Confidence level, Decision-makingm, Sample size determination
In summary, the significance level guides the researcher in interpreting the test results and making informed decisions regarding the null and alternative hypotheses.
• asked a question related to Advanced Statistical Analysis
Question
In plant breeding, what are uses discrimination function.
Discriminant function technique involves the development of selection criteria on a combination of various characters and aids the breeder in indirect selection for genetic improvement in yield. In plant breeding, the selection index refers to a linear combination of characters associated with yield.
• asked a question related to Advanced Statistical Analysis
Question
I am looking for a graphical tool like visual basic software to define R codes for interactive graphical buttons and text boxes.
For example, I want to design a windows application with graphical design for calculation of body mass index (BMI). I want to have two boxes for weight and height imputation and a button for run. When clicking the button, I want to the below code be run.
BMI < - box1/(box2^2)
R in Power BI ?
It should not contain complex R syntaxes though.
• asked a question related to Advanced Statistical Analysis
Question
some of the people who consult are only users of statistics, while others are the ones who develop statistics, and we would love that people use it correctly.
But, "I believe" that many arrive late, always post process of experimentation, asking "what statistical process can I do or apply". Perhaps they do not know that they should always consult, with the question or the hypothesis that they wish to answer or verify, since it would allow a better answer. On the other hand, some come with simple queries, but usually a statistics class is given as an answer, which I feel in some cases is late. In some cases it is extremely necessary, but in others, it opens a debate that leads to serendipity. Wouldn't it be better, to try to advise them in a more precise way? I read them:
precisely: two sides of the same coin.
• asked a question related to Advanced Statistical Analysis
Question
I’ve got a data set and I want to calculate R2 for linear regression equation from another study.
For example, I have derived an equation from my data (with R2) and I want to test how other equations perform on my data (and thus calculate R2 for them). Then, I want to compare R2 from my data set with R2 from derivation studies.
Do you have any software for this? Any common statistical software could cope with this task (e.g. SPSS or SAS)? Maybe you have any tutorials on YouTube for this?
Hello Pavel Makovsky. In that case, modify the code I posted earlier. With your dataset open and active:
REGRESSION
/STATISTICS COEFF OUTS CI(95) R ANOVA
/DEPENDENT YourDV
/METHOD=ENTER your explanatory variables
/SAVE PRED(yhat1).
* Now use another set of coefficients to generate yhat2.
COMPUTE yhat2 = use the other regression coefficients here but with the variables in your dataset.
REGRESSION
/STATISTICS R
/DEPENDENT YourDV
/METHOD=ENTER yhat1.
REGRESSION
/STATISTICS R
/DEPENDENT YourDV
/METHOD=ENTER yhat2.
• asked a question related to Advanced Statistical Analysis
Question
Dear colleagues,
I analyzed my survey data using binary logistic regression, and I am trying to assess the results by looking at the p-value, B, and Exp(B) values. However, the task is also to specify the significance of the marginal effects. How to interpret the results of binary logistic regression considering the significance of the marginal effects?
Best,
To specify the significance of the marginal effects in binary logistic regression analysis, you can interpret the results by examining the p-values, B (coefficient estimates), and Exp(B) (exponentiated coefficient estimates) values. The p-value indicates the statistical significance of each predictor variable's effect on the outcome variable. A low p-value (typically less than 0.05) suggests a significant effect. The B values represent the estimated change in the log-odds of the outcome for a one-unit change in the predictor, with positive values indicating a positive association and negative values indicating a negative association. Exp(B) provides the odds ratio, which quantifies the change in odds for a one-unit increase in the predictor. An Exp(B) greater than 1 indicates an increased odds of the outcome, while a value less than 1 implies a decreased odds. By considering the significance of the marginal effects, you can determine the direction, magnitude, and statistical significance of the predictor variables' impacts on the binary outcome variable in your logistic regression analysis.
• asked a question related to Advanced Statistical Analysis
Question
I constructed a linear mixed-effects model in Matlab with several categorical fixed factors, each having several levels. Fitlme calculates confidence intervals and p values for n-1 levels of each fixed factor compared to a selected reference. How can I get these values for other combinations of factor levels? (e.g., level 1 vs. level 2, level 1 vs. level 3, level 2 vs. level 3).
Thanks,
Chen
First, to change the reference level You can specify the order of items in categorical array
categorical(A,[1, 2, 3],{'red', 'green', 'blue'}) or
categorical(A,[3, 2, 1],{'blue', 'green', 'red'})
Second, You can specify the correct hypothesis matrix for coefTest function for comparison between every pair of categories.
• asked a question related to Advanced Statistical Analysis
Question
Has anyone conducted a meta-analysis with Comprehensive Meta-Analysis (CMA) software?
I have selected: comparison of two groups > means > Continuous (means) > unmatched groups (pre-post data) > means, SD pre and post, N in each group, Pre/post corr > finish
However, it is asking for pre/post correlations which none of my studies report. Is there a way to calculate this manually or estimate it somehow?
Thanks!
Yes, it is possible to estimate the pre-post correlation coefficient in a meta-analysis using various methods, such as imputing a value or using a range of plausible values. Here are a few options:
1. Imputing a value: If none of your studies report the pre-post correlation, you can impute a value based on previous research or assumptions. A commonly used estimate is a correlation coefficient of 0.5, which assumes a moderate positive relationship between the pre and post-measures. However, it is important to note that this value may not be appropriate for all studies or research questions.
2. Using a range of plausible values: Another option is to use a range of plausible correlation coefficients in the analysis, rather than a single value. This can help to account for the uncertainty and variability in the data. A common range is 0 to 0.8, which covers a wide range of possible correlations.
3. Contacting study authors: If possible, you can try to contact the authors of the included studies to request the missing information or clarification about the pre-post correlation coefficient. This can help to ensure that the analysis is based on accurate and complete data.
Once you have estimated the pre-post correlation coefficient, you can enter it into the appropriate field in the CMA software and proceed with the analysis. It is important to carefully consider the implications of the chosen correlation coefficient and to conduct sensitivity analyses to test the robustness of the results.
• asked a question related to Advanced Statistical Analysis
Question
Hello everyone,
I'm going to conduct a meta-analysis of psychological interventions relevant to a topic via Comprehensive Meta-Analysis (CMA) software. I have a few questions/points for clarification:
- From my understanding, I should only meta-analyse interventions that have used a pre-test, post-test (with and/or without follow-up) design, as meta-analysing post-test only designs with the others is not effective. Is my understanding correct?
- Can I combine between-subjects and within-subjects designs together or do I need to meta-analyse them separately?
Hello Ravisha,
If cases are randomly assigned to treatment condition, there's no reason that post-only design results should be considered uninformative.
Designs with pre-post measures can offer the added benefits of: (a) allowing for estimation of change (though unless scores are completely reliable, the change scores will be less reliable than either the pre- or post- score by itself); or (b) pre-scores can be used as a covariate, to adjust for randomly occurring differences across groups.
One noted threat to pre-post designs is that if the interval separating them is too short, the post-results, and therefore group comparisons, can be biased, especially with measures of affect.
Ultimately, the answer depends on what your target ES might be: If it is post-treatment differences across groups/conditions, then either design can contribute. You could estimate ES separately by study type to see whether inclusion of pre-test appears to account for differences.
If it is strictly pre-post change, then post-only designs can't contribute (again, though, note the caveats above).
Good luck with your work.
• asked a question related to Advanced Statistical Analysis
Question
I have ordinal data on happiness of citizens from multiple countries (from the European Value Study) and I have continuous data on the GDP per capita of multiple countries from the World Bank. Both of these variables are measured at multiple time points.
I want to test the hypothesis that countries with a low GDP per capita will see more of an increase in happiness with an increase in GDP per capita than countries that already have a high GDP per capita.
My first thought to approach this is that I need to make two groups; 1) countries with low GDP per capita, 2) countries with high GDP per capita. Then, for both groups I need to calculate the correlation between (change in) happiness and (change in) GDP per capita. Lastly, I need to compare the two correlations to check for a significant difference.
I am stuck however on how to approach the correlation analysis. For example, I dont know how to (and if I even have to) include the repeated measures of the different time points the data was collected. If I just base my correlations on one timepoint the data was measured, I feel like I am not really testing my research question, considering I am talking about an increase in happiness and an increase in GDP, which is a change over time.
If anyone has any suggestions on the right approach, I would be very thankful! Maybe I am overcomplicating it (wouldnt be the first time)!
At the same time，Collect two variables data,As a sample,After collecting N samples over time,erform data regression analysis on them,The correlation coefficient will be obtained.
• asked a question related to Advanced Statistical Analysis
Question
Hello! I would like to address the experts regarding a question about conducting a statistical analysis using only nominal variables. Specifically, I would like to compare the responses of survey participants answered the question whether they take certain medications "Yes" or "No", and analyze the data with different criteria such as education level, economic status, marital status, etc. I have conducted a Chi-squared test to determine if there is a significant difference between the variables, but now I would like to compare the answers of whether or not this medicine is taken depending on each group, for example in the education variable (higher, secondary, vocational and basic education). Is there a statistical test similar to Tukey's test that is suitable for nominal variables? I would also like to know if it is possible to create a column chart with asterisks above the columns indicating the significant differences between them based on this test for nominal variables.
I usually use Statistica StatSoft and R studio. But none of my attempts to do post-hoc for nominal variables analysis on any of them were successful. In R studio I tried pairwise.prop.test(cont_table, p.adjust.method = "bonferroni")
But I got an error:
Error in pairwise.prop.test(cont_table, p.adjust.method = "bonferroni") :
'x' must have 2 columns
I assume that this is due to the fact that I have groups in one of the variables and not two.
What should I do?
Thank you in advance for your help!
In attachment an R script with the BH post-hoc test based on Benjamini & Hochberg (1995). You could replace this with Bonferroni, but in my opinion this last method is too conservative.
• asked a question related to Advanced Statistical Analysis
Question
The variables I have- vegetation index and plant disease severity scores, were not normal. So, I did log10(y+2) transformation of vegetation index and sqrt(log10(y+2)) transformation of plant disease severity score. Plant disease severity is on the scale of 0, 10, 20, 30,..., 100 and were scored based on visual observations. Even after combined transformation, disease severity scoring data is non-normal but it improves the CV in simple linear regression.
Can I proceed with the parametric test, a simple linear regression between the log transformed vegetation index (normally distributed) and combined transformed (non-normal) disease severity data?
Why would these variables have to be normal? As far as I understand our problem, a logistic model might do well. You can try it with my software "FittingKVdm", but if you can send me some dat, I can try it for you.
• asked a question related to Advanced Statistical Analysis
Question
Hi everyone! I need to examine interactions between categorical and continuos predictors in a path analysis model. What strategy would be more accurate: 1) including the categorical variable, the continous one and the interaction as separate terms, 2) run a multigroup analysis?
I have the same problem with several models. For instance, examining potential differences of executive function (continuos predictor) effects on reading comprehension (outcome variable) among children from different grades (categorical predictor).
Thank you so much for your help!
Very helpful paper with references:
Best,
• asked a question related to Advanced Statistical Analysis
Question
I want to study the relationship between parameters for physical activity in a lifespan and the outcome of pain (binary). I have a longitudinal data with four measurement, hence repeated measures.
Should I do an GEE or a mixed method? And does anyone guides on how to rearrange my dataset so it will fit the methods? I have tried the GEE with long data and wide but I keep on getting errors.
To clarify, my outcome is binary (at the last measurement) and further my independent variables are measured at four times (with the risk of them being correlated).
Yes, that would be correct.
As your outcome/ dependent measure is only at one time point you would not have to consider time in relation to the outcome, so not a longitudinal model (no variation over time to model).
That is not to say that time may or may not be important in your research question. If trends/ differences/ averages in the repeated measures of independent variables are important in relation to your outcome then you can find ways to incorporate these things into your modelling strategy (in the way that you choose to use your repeated independent measures - being guided by research questions).
• asked a question related to Advanced Statistical Analysis
Question
How can I define a graphics space to make plots like the attached figure below using the graphics package in r?
I need help locating each position (centering) using the "mar" argument.
Reghais.a
You can use layout() to define a matrix of plots with different heights/widths. In your case, this will produce a layout similar to your picture:
m <- rbind(c(0,1,0), c(2:4))
layout(m, widths = c(1,1,1.5), heights = c(1,1))
par(oma = c(3,3,3,3), mar = c(0,0,0,0), las = 1, xaxs = "i", yaxs="i")
plot(NA, xlim = c(-1, 9), ylim = c(-1, 4), xaxt="n", yaxt="n")
axis(2, at = 0:4)
axis(3, at = 0:4 * 2)
plot(NA, xlim = c(0, 4), ylim = c(0, 3.5), xaxt="n", yaxt="n")
axis(1, at = 0:4)
axis(2)
plot(NA, xlim = c(-1, 9), ylim = c(0, 3.5), xaxt="n", yaxt="n")
plot(NA, xlim = c(0, 7), ylim = c(0, 3.5), xaxt="n", yaxt="n")
axis(1, at = 0:7)
axis(4)
• asked a question related to Advanced Statistical Analysis
Question
I am tiding up with the below problem, it's a pleasure to have your ideas.
I've written a coding program in two languages, Python and R, but each came to a completely different result. Before jumping to a conclusion, I declare that:
- Every word of the code in two languages has multiple checks and is correct and represents the same thing.
- The used packages in two languages are the same version.
So, what do you think?
The code is about applying deep neural networks for time series data.
Good morning, without the code it is difficilt to know where is the difference I do not use Python i work on R but maybe these difference is due to the stage of spitting dataset do you try to add thr same number in the count of generator of randomly for example seed(1234) (if my memory is good this function is also used in Python language. Were your results and metrics of evaluation totally different? In this case, mayve there is a reliability issue in your model. You should check your data preparation and features selection .
• asked a question related to Advanced Statistical Analysis
Question
Hi, I am looking for a way to derive standard deviations from estimated marginal means using mixed linear models with SPSS. I already figured where SPSS provides the pooled SD to calculate the SMD, however, I still need the SD of the means. Any help is appreciated!
I was unsure how to pool SD from the SE without knowing N. A method I found used the "baseline SD" for each group.
• asked a question related to Advanced Statistical Analysis
Question
I have an data of 30 X 1 matrix, in which by using gradient descent algorithm is it possible to find the best optimized value.If yes, please share me the procedure or link for the detailed background theory behind it.it will be helpful for me to proceed further on my research.
It depends on the cost function and the model that you are using. Gradient descent will converge to the optimal value (or very close to it) of the training loss function, given a properly set learning rate, if the optimization problem is convex with respect to the parameters. That is the case for linear regression using the mean squared error loss, or logistic regression using cross entropy. For the case of neural networks with several layers and non-linearities none of these loss functions make the problem convex, therefore there is no guarantee that you will find the optimal value. The same would happen if you used logistic regression with the mean squared error instead of cross entropy.
An important thing to note is that when I talk about the optimal value, I mean the value that minimizes the loss in your training set. It is always possible to overfit, which means that you find the optimal parameters for your training set, but those parameters make inaccurate predictions on the test set.
• asked a question related to Advanced Statistical Analysis
Question
I want to display the bivariate distribution of two (laboratory) parameters in sets of patients. I have available the data of N, mean +- SD of the first and second parameters. I am looking for software that could draw a bivariate distribution = ellipse from the given parameters. Can someone help me? Thank you.
Dear Dr. Gaško,
I'm glad to hear that. You are very welcome.
Best wishes.
• asked a question related to Advanced Statistical Analysis
Question
Hi,
There is an article that I want to know which statistical method has been used, regression or Pearson correlation.
However, they don't say which one. They show the correlation coefficient and standard error.
Based on these two parameters, can I know if they use regression or Pearson correlation?
Not sure I understand your question. If there is a single predictor and by regression you mean linear OLS regression, then the r is the same. Can you provide more details>
• asked a question related to Advanced Statistical Analysis
Question
How to run the Bootstrap method to estimate the error rate in linear discriminant analysis using r code?
Best
reghais.A
Using R code, the bootstrap method can estimate the error rate in linear discriminant analysis. First, the data must be split into a training set and a test set and then normalized. The lda() function can then be used to run the calculations twice, with CV=TRUE for the first run to get predictions of class membership derived from leave-one-out cross-validation. The second run should use CV=FALSE to get predictions of class membership based on the entire training set. The true error rate estimator BT2 of the restricted linear or quadratic discriminant analysis can be calculated using the dawai package in R. Finally, resampling methods such as bootstrapping can be used to estimate the test error rate.
• asked a question related to Advanced Statistical Analysis
Question
Dear all,
I want to know your opinions
Also, there is good paper here
Also,
I've only glanced quickly at those two resources, but are you sure they are addressing the same thing? Yates' (continuity) correction as typically described entails subtraction 0.5 from |O-E| before squaring in the usual equation for Pearson's Chi2. E.g.,
But adding 0.5 to each cell in a 2x2 table is generally done to avoid division by 0 (e.g., when computing an odds ratio), not to correct for continuity (AFAIK). This is what makes me wonder if your two resources are really addressing the same issues. But as I said, I only had time for a very quick glance at each. HTH.
• asked a question related to Advanced Statistical Analysis
Question
How can I add the robust confidence ellipses of 97.5% on the variation diagrams (XY ilr-Transformed) in the robcompositions ,or composition packages?
Best
Azzeddine
In order for the benefit to prevail, I have verified a group of packages that do the add of The robust confidence ellipses of 97.5%
View them here by package and its function
1- ellipses () using the package 'ellipse'
## ellipses () using the package 'rrcov'
## ellipses () using the package 'cluster'
• asked a question related to Advanced Statistical Analysis
Question
I am working as Scientist (Horticulture) and my research focus is improvement of tropical and semi arid fruits. I am also interested in working out role of nutrients in fruit based cropping systems.
Looking for collaborators from the field of Genetics and Plant Breeding, Horticulture, Agricultural Statistics, Soil Science and Agronomy.
Currently working on Genetic analysis for fruit traits in Jamun (Indian Blackberry).
Try to publish on your own then you have complete control. Collaborators will steal your data and treat you badly :)
• asked a question related to Advanced Statistical Analysis
Question
I am testing hypothesis of relationships between CEA and Innovation Performance (IP). If I am testing the relationship of one construct , say Management support to IP , is it ok to use single linear regression? Of should I be testing it in a multiple regression with all the constructs?
• asked a question related to Advanced Statistical Analysis
Question
What are current recommendations for reporting effect size measures from repeated measures multilevel model?
Concerning analytical approach, I have followed procedure by Garson (2020) with matrix for repeated measures: diagonal, and matrix for random effects: variance components.
In advance, thank you for your contributions.
You can use standard procedures for the fixed effects estimates as they are akin to regression model estimates if the response is continuous. Things are more complicated of the response is categorical.
• asked a question related to Advanced Statistical Analysis
Question
Merry Christmas everyone!
I used the Interpersonal Reactivity Index (IRI) subscales Empathic Concern (EC), Perspective Taking (PT) and Personal Distress (PD) in my study (N = 900) When I calculated Cronbach's alpha for each subscale, I got .71 for EC, .69 for PT and .39 for PD. The value for PD is very low. The analysis indicated that if I deleted one item, the alpha would increase to .53 which is still low but better than .39. However, as my study does not focus mainly on the psychometric properties of the IRI, what kind of arguments can I make to say the results are still valid? I did say findings (for the PD) should be taken with caution but what else can I say?
A scale reliability of .39 (and even .53!) is very low. Even if your main focus is not on the psychometric properties of your measures, you should still care about those properties. Inadequate reliability and validity can jeopardize your substantive results.
My recommendation would be to examine why you get such a low alpha value. Most importantly, you should first check whether each scale (item set) can be seen as unidimensional (measuring a single factor). This is usually done by running a confirmatory factor analysis (CFA) or item response theory analysis. Unidimensionality is a prerequisite for a meaningful interpretation of Cronbach's alpha (alpha is a composite reliability index for essentially tau-equivalent measures). CFA allows you to test the assumption of unidimensionality/essential tau equivalence and to examine the item loadings.
Also, you can take a look at the item intercorrelations. If some items have low correlations with others, this may indicate that they do not measure the same factor (and/or that they contain a lot of measurement error). Another reason for a low alpha value can be an insufficient number of items.
• asked a question related to Advanced Statistical Analysis
Question
I have done my qPCR experiments and gave me some results, I used the DDCt method and I calculated the 2^(-DDCt), I transformed my data in base 10 logarithm and separated my samples between control and patients. I want to ask if I see that there is for example a fold change 4 times higher in patients for my gene of interest then I use one-tail or two-tail t-test, and what if the distribution is not normal, will I do non-parametric test, or I can skip the outliers and do the t-test. I am very confused in that statistical conundrum.
If your data is not normally distributed, you should use non-parametric statistical tests such as Wilcoxon rank sum tests or Mann-Whitney U tests in order to compare the expression levels between the two groups.
Regarding the one tailed or two tailed. one tailed can specify the direction of the effect (positive or negative) but the two tailed one can be used for both direction at the same time.
Best...
• asked a question related to Advanced Statistical Analysis
Question
Dear all,
I have conducted a research about snake chemical communication where I test the reaction of a few adult snake individuals (both males and females) to different chemical compounds. Every individual is tested 3 times with each of the compounds. Basically, I put a soaked paper towel in each of the individual terrariums and record the behavior for 10 minutes with a camera. The compounds are presented to the individuals in random order.
My grouping variable represents the reactions to each of the compounds for each of the sexes. For example, in the grouping variable I have categories titled “male reactions to compound X”, “male reactions to compound Y” etc. I have three dependent variables as follows: 1) whether there is an interest towards the compound presented or not (binary), 2) chin rubbing behavior recorded (I record how many times this behavior is exhibited) and 3) tongue-flick rate (average tongue-flicks per minute). The distribution is not normal.
What I would like to test is 1) whether there is a difference in the behavior between males and females, 2) whether there is a difference between the behavior of males snakes to the different compounds (basically if males react more to compound X, rather than to compound Y) and the same goes for females, and finally 3) whether males exhibit different behavior to different types of compounds (I want to combine for example compounds X, Y and Z, because they are lipids and A, B and C, because they are alkanes and check difference in male responses).
I thought that PERMANOVA will be enough, since it is a multivariate non-parametric test, but two reviewers wrote that I have to use Generalized linear mixed models, because of the repeated measures (as mentioned, I test each individual with each of the compounds 3 times). They think there might be some individual differences that could affect the results if not taken into consideration.
Unfortunately, I am a newbie in GLMM, and I do not really see how such model can help me answer my questions and test the respective hypotheses. Could you, please, advise me on that? And how should I build the data matrix in order to test for such differences?
Isn’t it also possible to check for differences between individuals with the Friedman test and then use PERMANOVA?
Thank you very much in advance!
In general, permanova is a test of the effect of two parallel variables on the organism. It is equivalent to two one-way ANOVAs. Whereas GLMS is equivalent to the combined effect of all factors, in GLMS you can derive the contribution of each variable to determine the magnitude of the contribution of each environmental factor. You can understand that permanova is the parallel effect of several factors, while GLMS is the combined effect. GLMS is simple and easy to operate in R language.
• asked a question related to Advanced Statistical Analysis
Question
Holzinger (Psychometrika, vol. 9, no. 4, Dec. 1944) and Thurstone (Psychometrika, vol. 10, no. 2, June 1945; vol. 14, no. 1, March, 1949) discussed an alternative method for factoring a correlation matrix. The idea was to enter several clusters of items (tests) in the computer program beforehand, and then test them, optimize them and produce the residual matrix (which may show the necessity of further factoring). These clusters could stem from theoretical and substantive considerations, or from an inspection of the correlation matrix. It was an alternative to producing one factor at a time until the residual matrix becomes negligible, and was attractive because it spared much calculation time for the computers in that era. That reason soon lapsed but the method is still interesting as an alternative kind of confirmatory factor analysis.
My problem is: I would like to know the exact procedure (especially the one by Holzinger) but I cannot get hold of these three original publications (except the first two pages), unless against big expenses, nor can I find a thorough discussion of it in another publication, except perhaps in H.H. Harman (1976): Modern factor analysis, Section 11.5, but that book has disappeared from the university library, while on Google-books it is incomplete. Has anyone a copy of these publications, or is he/she familiar with this type of factor analysis?
In the last few months, a colleague of mine has written a version of the PCO-program in R. The first impressions are good, but we need a few more months to test it and prepare a publication aboiut it.
• asked a question related to Advanced Statistical Analysis
Question
I am stuck here, as i am working on therapy and trying to evalute the changes in biomarker levels. So I have selected 5 patients and analysed their biomarker levels prior therapy and then after first therapy and followed by 2nd therapy. So as i apply anova results show significant difference in their mean values but due larger difference in their standard deviations i am getting non significant results
like in this table below.
Sample Size Mean Standard Deviation SE of Mean
vb bio 5 314.24 223.53627 99.96846
cb1 bio 5 329.7 215.54712 96.3956
CB II 5 371.6 280.77869 125.56805
So I want to know from all those good statsticians who are well aware about the clinical trial studies.
Am i performing statistics correctly?
Should not i worry about non significant results?
What are the statistical tests I should use?
How will I represent my data for publication purposes?
Try to teach like you are teaching to the fresher to this field.
Please be mindful about this kind of stuff, uncertain and misleading conclusions can be very dangerous in medicine and it is healthier for the community to talk about them than to just generate some numbers from the data.
Best,
jan
• asked a question related to Advanced Statistical Analysis
Question
I have a distribution map produced with only presence data. And there is a certain number of presence data that is in no way included in the model. How can I evaluate the compatibility of the presence data not included in the model I have with the predictive values corresponding to these points in the potential distribution map? So we can also think like this: I have two columns. The first column has only 1 values, the second column has the predictive values. Which method would be the best approach to examine the relationship between these two columns?
I'm not sure I fully understand your question but when you have a column with a constant value (e.g., 1), this constant by definition cannot covary with another column/variable. A constant does not have any variance and therefore, the covariance/correlation with another variable will also be zero by definition.
• asked a question related to Advanced Statistical Analysis
Question
Hello, I currently have a set of categorical variables, coded as Variable A,B,C,etc... (Yes = 1, No = 0). I would like to create a new variable called severity. To create severity, I know I'll need to create a coding scheme like so:
if Variable A = 1 and all other variables = 0, then severity = 1.
if Variable B = 1 and all other variables = 0, then severity = 2.
So on, and so forth, until I have five categories for severity.
How would you suggest I write a syntax in SPSS for something like this?
* Create a toy dataset to illustrate.
NEW FILE.
DATASET CLOSE ALL.
DATA LIST LIST / A B C D E (5F1).
BEGIN DATA
1 0 0 0 0
0 1 0 0 0
0 0 1 0 0
0 0 0 1 0
0 0 0 0 1
1 1 0 0 0
0 1 1 0 0
0 0 1 1 0
0 0 0 1 1
1 0 2 0 0
END DATA.
IF A EQ 1 and MIN(B,C,D,E) EQ 0 AND MAX(B,C,D,E) EQ 0 severity = 1.
IF B EQ 1 and MIN(A,C,D,E) EQ 0 AND MAX(A,C,D,E) EQ 0 severity = 2.
IF C EQ 1 and MIN(B,A,D,E) EQ 0 AND MAX(B,A,D,E) EQ 0 severity = 3.
IF D EQ 1 and MIN(B,C,A,E) EQ 0 AND MAX(B,C,A,E) EQ 0 severity = 4.
IF E EQ 1 and MIN(B,C,D,A) EQ 0 AND MAX(B,C,D,A) EQ 0 severity = 5.
FORMATS severity (F1).
LIST.
* End of code.
Q. Is it possible for any of the variables A to E to be missing? If so, what do you want to do in that case?
• asked a question related to Advanced Statistical Analysis
Question
I am using -corr2data- to simulated raw data from a correlation matrix. However, some variables that I need should be binary. How can I convert?
Is it possible to convert higher amounts to 1 (and the other ones to 0) as the form to reach the same mean? How should I do it?
Is a way in R?
(I want to perform a GSEM on a correlation matrix)
(I know -faux- package in R. But my problem is that just some of [not all of] my variables are binary.)
Maybe the attached is what you mean. David Booth
• asked a question related to Advanced Statistical Analysis
Question
Hello, I currently have a set of categorical variables, coded as Variable A,B,C,etc... (Yes = 1, No = 0). I would like to create a new variable called severity. To create severity, I know I'll need to create a coding scheme like so:
if Variable A = 1 and all other variables = 0, then severity = 1.
if Variable B = 1 and all other variables = 0, then severity = 2.
So on, and so forth, until I have five categories for severity.
How would you suggest I write a syntax in SPSS for something like this? Thank you in advance!
Ange, I think the easiest way for you to find an answer to your question would be to google something such as "SPSS recode variables YouTube". You'll probably find several sites that demonstrate what you want to do.
All the best with your research.
• asked a question related to Advanced Statistical Analysis
Question
I am creating a hypothetical study in which there are two drugs being tested. Thus I have taken 60 participants and randomly split them into three groups: drug A, drug B and a control group. A YBOCS score will be taken before the trial after the trial has ended and then again at a 3-month follow-up. Which statistical test should I use to compare the three groups and to find out which was most effective?
What do you mean "hypothetical study?" Is this a homework question?
• asked a question related to Advanced Statistical Analysis
Question
For example:
If there are 40 species identical between two sites, they are the same. However, two sites can each have 40 species each, but none in common. So by species number they are identical but by species composition they are 0% alike.
How can I calculate or show the species composition of the two sites over time?
You use beta diversity (β), which is a measure of the difference in composition of species between locations :)
• asked a question related to Advanced Statistical Analysis
Question
During the lecture, the lecturer mentioned the properties of Frequentist. As following
Unbiasedness is only one of the frequentist properties — arguably, the most compelling from a frequentist perspective and possibly one of the easiest to verify empirically (and, often, analytically).
There are however many others, including:
1. Bias-variance trade-off: we would consider as optimal an estimator with little (or no) bias; but we would also value ones with small variance (i.e. more precision in the estimate), So when choosing between two estimators, we may prefer one with very little bias and small variance to one that is unbiased but with large variance;
2. Consistency: we would like an estimator to become more and more precise and less and less biased as we collect more data (technically, when n → ∞).
3. Efficiency: as the sample size incrases indefinitely (n → ∞), we expect an estimator to become increasingly precise (i.e. its variance to reduce to 0, in the limit).
Why Frequentist has these kinds of properties and can we prove it? I think these properties can be applied to many other statistical approach.
Sorry, Jianhing. But I think you have misunderstood something in the lecture. Frequentist statistics, which is an interpretation of probability to be assigned on the basis of many random experiments.
In this setting, on designs functions of the data (also called statistics) which estimate certain quantities from data. For example, the probability p of a coin to land heads is given from n independent trials with the same coin and just counting the fraction of heads. This is then an estimator for the parameter p.
Each estimator should have desirable properties, as unbiasedness, consistency, efficiency and low variance and so on. Not every estimator has these properties. But, in principle one can proof, whether a given estimator has these properties.
So, it is not a characteristics of frequentist statistics, but a property of an individual estimator based on frequentist statistics.
• asked a question related to Advanced Statistical Analysis
Question
Assuming that a researcher does not know the nature of population distribution (the parameters or the type e.g. normal, exponential, etc.), is it possible that the sampling distribution can indicate the nature of the population distribution.
According to the central limit theorem, the sampling distribution is likely to be normal. So, the exact population distribution can not be known. The shape of the distribution for a large sample size is enough? or It has to be inferred logically based on different factors?
Am I missing some points? Any lead or literature will help.
Thank you
• asked a question related to Advanced Statistical Analysis
Question
Hello,
I have a variable in a dataset with ~500 answers; it essentially represents participants' answers to an essay question. I am interested in how many words each individual has used in the task and I cannot seem to find a function in R to calculate/count each participant's words in that particular variable.
Is there a way to do this in R? Any packages you think could help me do this? Thank you so much in advance!
Thank you so much, Daniel Wright , Jochen Wilhelm , Richard Johansen ! Your answers were very helpful. I was able to do it through the string package.
• asked a question related to Advanced Statistical Analysis
Question
Can you still have a good model despite a p-value < .05 for the H-L goodness of fit test? Any alternative testing in SAS or R?
What if the p-value of the HL test doesn't appear? it just appeared as this code ".". what is that mean? thank you
• asked a question related to Advanced Statistical Analysis
Question
if for example I want to compare BMI for two group?
When I use shapiro wilk test to check normality between BMI of each group!
one group is normally distributed
and the other was not ?
what test should i use either t-test or mann whitney?
It is not very clear what you are trying to do.
However, the Student's t-test (parametric) assumes the comparison between two distributions, one of which is normally distributed, the other according to chi-squared.
The Mann-Whitney test (non-parametric) can be a valid alternative to the t-test, but the distributions must be independent.
If they are not, you can apply the Wilcoxon test, but with due caution, given the unclearness of your "problem".
Regards
• asked a question related to Advanced Statistical Analysis
Question
A few years ago, in a conversation that remained unfinished, a statistics expert told me that percentage variables should not be taken as a response variable in an ANOVA. Does anyone know if this is true, and why?
Javier Ernesto Vilaso Cadre , tests do not corroborate that a distribution is normal. They may only fail to corroborate that a distribution is not normal, and that may simple be due to the sample size. Actually, such tests only tell you if your sample size is already large enough to see that the normal distribution model (an idealized model!) does not account for all features of a real distribution. So actually they don't give you any useful information (you may fail to see relevant discrepancies because the sample size is too small, or you may get blinede with "statistically significant" discrepancies that are irrelevant for your problem). The only sensible way is to understand the variable and have some theoretical justification of its distribution, and then to judge if the presumed discrepancies are relevent for your problem. One may then certainly have a look at the empirical distribution of the observed data: if it screams at you that your thoughts and arguments are very likely very wrong you may go back and refine or deepen your understanding of the data-gereative process you like to study.
• asked a question related to Advanced Statistical Analysis
Question
Hi all,
As a non-statistician, I have a (seemingly) complicated statistical question on my hands that I'm hoping to gather some guidance on.
For background, I am studying the spatial organization of a biological process over time (14 days), in a roughly-spherical structure. Starting with fluorescence images (single plane, no stacks), I generate one curve per experimental day that corresponds to the average intensity of the process as I pass through the structure; this is in the vein of line intensity profiling for immunofluorescence colocalization. I have one curve per day (see attached) and I'm wondering if there are any methods that can be used to compare these curves to check for statistical differences.
Any direction to specific methods or relevant literature is deeply appreciated, thank you!
Cheers,
Matt
Edit to add some additional information: the curves to be analyzed will be averages of curves generated from multiple biological replicates, and therefore will have error associated with them. Across the various time points and conditions, the number of values per curve ranges roughly from 200 -- 1000 (one per pixel).
Hi Mairtin Mac Siurtain, thanks so much for your reply!
This is great information. I'll take a look first at MANOVA and then at RBSCIs if indicated. Many thanks!
Best,
Matt
• asked a question related to Advanced Statistical Analysis
Question
We are altering our original analytical method to save time and cost and are trying to come up with a good footing to say if the method results are the same or if they are significantly different. We are taking actual samples and analyzing thru both the original analytical method that we validated and then also thru the alterations we made. I am not very knowledgeable with statistics but is there a statistical way to say if the methods are producing results that are the same or significantly different? Or is there a more common method to determine if two analytical methods are the same? I have attached the results from each of the variations we tried along with the original method resutls.
نعم يؤدي الى اختلاف النتائج لان كل اداة احصائية تستخدم وفق معايير خاصة لها لايمكن استخدام عدة وسائل احصائية للغرض نفسه David Morse
• asked a question related to Advanced Statistical Analysis
Question
Different steps and procedure with complete example of a meta analysis.
What is pooled prevalence?
You should try this text.
Harrer, M., Cuijpers, P., Furukawa, T.A., & Ebert, D.D. (2021). Doing Meta-Analysis with R: A Hands-On Guide. Boca Raton, FL and London: Chapman & Hall/CRC Press. ISBN 978-0-367-61007-4.
• asked a question related to Advanced Statistical Analysis
Question
I have two datasets (measured and modelled).
I want to calculate 95% confidence intervals for RMSE.
Thank you in advance
I found the code linked to by Jeff Rothschild to be a little difficult to follow. I adapted the code to make it a little more straightforward. Below is some R code which calculates the confidence interval by this method and by bootstrap. RMSE is calculated from a vector of Actual values and a vector of Predicted values.
### rmseCI function adapted from
### Bootstrapped confidence interval adapted from
if(!require(rcompanion)){install.packages("rcompanion")}
rmseCI = function(rmse, n, conf, digits=3){
p_lower = 0.5-conf/2
p_upper = 0.5+conf/2
DF = data.frame(
RMSE = rmse,
lower.ci = signif(sqrt(n / qchisq(p_upper, df = n)) * rmse,
digits=digits),
upper.ci = signif(sqrt(n / qchisq(p_lower, df = n)) * rmse,
digits=digits))
row.names(DF)=1
return(DF)}
#######################
Actual = 1:24
Predicted = c(1,3,4,5,2,3,6,3,4,7,8,9,11,12,15,16,18,19,20,22,25,24,23,24)
library(rcompanion)
RMSE = efronRSquared(actual=Actual, predicted=Predicted, statistic="RMSE")
RMSE
#######################
N = length(Actual)
rmseCI(RMSE, n=N, 0.95)
###############################
library(boot)
Data = data.frame(Actual, Predicted)
Function = function(input, index){
Input = input[index,]
Result = efronRSquared(actual = Input\$Actual,
predicted = Input\$Predicted,
statistic = "RMSE")
return(Result)}
Boot = boot(Data, Function, R=5000)
boot.ci(Boot, conf = 0.95, type = "bca")
RMSE
hist(Boot\$t[,1])
• asked a question related to Advanced Statistical Analysis
Question
Greetings Fellow Researchers,
I am a newbie in using survival analysis (Cox Regression). In my data-set 10-40% cases have missing values (Depending on the variable I include in my analysis). Based on this I have two questions,
1- there are any recommendations on accepted percentage of cases dropped (missing values) from the analysis?
2- Should I impute the missing values of all the cases that were dropped (lets say maximum of 40%).
Thank you so much for your time and kind consideration.
Best,
Sarang
I have no first-hand experience to offer you either. But this systematic review article may give you some ideas.
• asked a question related to Advanced Statistical Analysis
Question
I'm doing a germination assay of 6 Arabidopsis mutants under 3 different ABA concentrations in solid medium. I've 4 batches. Each batch has 2 plates for each mutant, 3 for the wild type, and each plate contains 8-13 seeds. Some seeds and plates are lost to contamination. So I don't have the same sample size for each mutant in each batch. In same cases the mutant is no longer present in the batch. I've recorded the germination rate per mutant after a week and expressed it as percentage. I'm using R. How can I analyse them best to test if the mutations affect the germination rate in presence of ABA?
I've two main questions:
1. Do I consider each seed as a biological replica with categorical type of result (germinated/not-germinated) or each plate with a numerical result (% germination)?
2. I compare treatments within the genotype. Should I compare mutant against wild type within the treatment, the treatment against itself within mutant, or both?
I suggest using mosaic plots rather than (stacked) barplots to visualize your data.
The chi²- and p-values can be calculated simply via chi²-tests (one for each ABA conc) -- assuming the data are all independent (again, please note that seedlings on the same plate are not independent). If you have no possibility to account for this (using a hierarchical/multilevel/mixed model), you may ignore this in the analysis but then interpret the results more carefully (e.g., use a more stringent level of significance than usual).
A binomial model (including genotype and ABA conc as well as their interaction) would allow you to analyse the difference between genotypes in conjunction with ABA conc. However, due to the given experimental design (only three different conc values) this is cumbersome to interpret (because you cannot establish a meaningful functional relationship between cons and probability of germination).
• asked a question related to Advanced Statistical Analysis
Question
For Individual responses I can calculate the value with respect to which we have to check the outlier don't know?
Kindly follow the SPSS structure for determining the critical value here: It is pretty simple and intuitive
• asked a question related to Advanced Statistical Analysis
Question
I need suggestions for groundwater assessment-related articles used discriminant analysis in their analysis and study, as well as how to apply this analysis in R programming.
Reghais.A
Thanks
• asked a question related to Advanced Statistical Analysis
Question
I'm trying to construct a model for binary logistics. The first model includes 4 variable of predictor and the intercept is not statistically significant. Meanwhile, in the second model, I exclude one variable from the first model and the intercept is significant.
The consideration that I take here is that:
The pseudo R² of the first model is better at explaining the model rather than the second model.
Any suggestion which model should I use?
You should use the model that makes more sense, practically and/or theoretically. A high R² is not in indication for the "goodness" of the model. A higher R² can also mean that the model makes more wrong predictions with a higher precision.
Do not build your model based on observed data. Build your model based on understanding (theory) and the targeted purpose (simple prediction, exptrapolation (e.g. forecast), testing meaningful hypotheses etc.)
Removing a variable from the model changes the meaning of the intercept. The intercepts in the two models have different meanings. They are (very usually) not comparable. The hypothesis tests of the intercepts of the two models test very different hypotheses.
PS: a "non-significant" intercept term just means that the data are not sufficient to statistically distinguish the estimated value (the log odds given all X=0) from 0, what means that you cannot distinguish the probability of the event (given all X=0) from 0.5 (the data are compatible with probabilities larger and lower 0.5). This is rarely a sensible hypothesis to test.
• asked a question related to Advanced Statistical Analysis
Question
i need a statistical result to test my hypothesis, but my N isn't so rich to put in cochran formula!
besides that, I can not collect more information to solve this issue, do you know any other reliable method that fits this issue?
Hi dear Mario
I'm trying to classify hard-wired human settlement preferences based on archaeological evidence in the Paleolithic era, this is an effort to update biophilia hypothesis.
But for doing hypothesis testing i need to quantify the data and the first step is to make a database from as many as possible Paleolithic geolocations then put them through GIS analysis and test my hypothesis based on the data acquired in SEM (which mostly is that what environmental attribute has how much impact on our decision to select a settlement in what period of paleohistory).
PS. I tested a small database but i was wondering if there is a geolocation database of archaeological sites that i coulden't find yet.
Bests
SAM
• asked a question related to Advanced Statistical Analysis
Question
300 Participants in my study viewed 66 different moral photos and had to make a binary choice (yes/no) in response to each. There were 3 moral photo categories (22 positive images, 22 neutral images and 22 negative images). I am running a multilevel logistic regression (we manipulated two other aspects about the images) and have found unnaturally high odd ratios (see below). We have no missing values. Could anyone please help me understand what the below might mean? I understand I need to approach with extreme caution so any advice would be highly appreciated.
Yes choice: morally negative compared morally positive (OR=441.11; 95% CI [271.07,717.81]; p<.001)
Yes choice: morally neutral compared to morally positive (OR=0.94; 95% CI [0.47,1.87]; p=0.86)
It should be noted that when I plot the data, very very few participants chose yes in response to the neutral and positive images. Almost all yes responses were given in response to the negative images.
I think you have answered your question: "It should be noted that when I plot the data, very very few participants chose yes in response to the neutral and positive images. Almost all yes responses were given in response to the negative images."
This is what you'd expect even in a simple 2x2 design. If the probability of a yes response in the positive condition is very high and the probability very low in the negative condition then the OR could be high as its the ratio of a big probability to a very low one.
This isn't unnatural unless the raw probabilities don't reflect this pattern. (There might still be issues but not from what you described).
• asked a question related to Advanced Statistical Analysis
Question
Hi,
I have 4 animals (A, B, C &D), and did behavioural observation for 90 days. The parameter that I want to test is the temperature and time (whether the temperature/time is affecting their behaviour or not). I categorized the temperature as cool temperature and hot temperature. Time is categorized between morning and afternoon.
The questions are -
1) is it correct to use non-parametric test since the subjects is small (n>10)?
2) if I want to differentiate the behaviour between the individual animals, do I use paired t-test or independent t-test?
3) if I want to see the interaction between individual subjects X temperature X time ; is the data considered as dependent or independent?
Thank you Noramira Nozmi for the question and professors for the useful answers.
• asked a question related to Advanced Statistical Analysis
Question
In my research I need to calculate the correlation between two variables. Each variable was measured 100 times for each subjects (N=20), so in total, I have 2000 data points, of which every 100 belongs to the same participant.
I have calculated a simple bivariate correlation on these 2000 data points without taking the participant effects into account. I know that this is wrong and the participant effects should be accounted for, but I didn't manage to find a way to do this (in SPSS). Any help would be highly appreciated.
Although the ideas by Thom Baguley and Christian Geiser to use MLM seem reasonable, I still have 2 (blocks of) questions.
1) What is the hypothesis in the fist place? What should be correlated? What are the 100 measurement points?
2) You say, you have "two" variables, each measured 100 times. But wouldnt this be 200 data points per subject, and with N=20, you should have 4000 data points and not 2000.
• asked a question related to Advanced Statistical Analysis
Question
I am using an ARDL model however I am having some difficulties interpreting the results. I found out that there is a cointegration in the long run. I provided pictures below.
Mr a. D.
The ECT(-1)os always the lagged value of your dependent variable.
Regards
• asked a question related to Advanced Statistical Analysis
Question
I have long-term rainfall data and have calculated Mann-Kendall test statistics using the XLSTAT trial version ( addon in MS word). There is an option for asymptotic and continuity correction in XLSTAT drop-down menu.
• What does the term "Asymptotic" and "continuity correction" mean?
• When and under what circumstances should we apply it?
• Is there any assumption on time series before applying it?
• What are the advantages and limitations of these two processes?
I am not specifically expert in the Mann-Kendall Trend test but it is related to classical non-parametric tests, like the Kendall correlation test that I know better. Be careful with XLSTAT (which works in ExceI, not in Word). Indeed, in the procedure I used a few years ago, I had many problems and had to contact the support. I think you should read more about the test and more generally on non-parametric tests. Asymptotic means when the number of observations n grows to infinity. Otherwise, these tests are based on tables of critical values depending on n. When n is too large, use the asymptotic distribution, often normal with a given mean and a given variance (depending on n, of course). For the continuity correction, it is because the test statistic takes discrete values whereas the asymptotic distribution is continuous. The same kind of correction appears with a binomial distribution. Look in your statistics course.
• asked a question related to Advanced Statistical Analysis
Question
In confimatory factor analysis (CFA) in Stata, the first observed variable is constrained by default (beta coefficient =1, mean of latent variable =constant).
I don't know what is it! Because, other software packages report beta coefficients of all observed variables.
So, I have two questions.
1- Which variable should be constrained in confirmatory factor analysis in stata?
2- Is it possible to have a model without a constrained variable like other software packages?
Hello Seyyed,
I guess you mean with beta the factor loading? Traditionally, these are denoted with lambda but probably, Stata treats these differently.
The fixation of the "marker variable" is needed a) to assign a metric to the latent variable--those of the marker, and to b) identify the equation system.
As far as I know it does not matter which variable you choose unless it is no valid indicator of the latent.
HTH
Holger
• asked a question related to Advanced Statistical Analysis
Question
For my bachelor thesis I'm conducting a study on the relationship between eye-movements and memory. One of the hypotheses is that the number of fixations made during the viewing of a movie clip will be positively related to the memory that movie clip.
Each participant viewed 100 movie clips, and the number of fixations were counted for each movie clip for each participant. Later participants' memory of the clips were tested and each movie was categorized as "remembered" or "forgotten" for each participant.
So, for each participants there are 100 trials with the corresponding number of fixations and categorization as "remembered" or "forgotten".
My first idea was to do a paired-samples t-test (to compare the number of fixations between "remembered" and "forgotten"), but I didn't find a way to do that in SPSS with this file format as there are 100 rows for each participant. I though of calculating the average number of fixations for the remembered vs forgotten movies per participant and compare and do a t-test on these means (one mean per participant for both categories) but this way the means get distorted because some subjects remember way more clips than others (so the "mean of the means" is not the same as the overall mean).
Now I'm thinking that doing a t-test might not be appropriate at all, and that logistic regression would be a better choice (to see how well the number of fixations predicts whether a clip will be remembered vs forgotten), but I didn't manage to find out how to do this in SPSS in for a within subject design with multiple trials per participant. Any help/suggestions would be highly appreciated.
I believe Blaine Tomkins meant to describe the data as having a LONG format, not a wide format. Apart from that, I concur with his advice. SPSS can estimate that model. Look up the GENLINMIXED command:
A good resource is the book by Heck, Thomas & Tabata (if you can get your hands on it):
HTH.
• asked a question related to Advanced Statistical Analysis
Question
Do Serial correlation, auto-correlation & Seasonality mean the same thing? or Are they different terms? If so what are the exact differences with respect to the field of statistical Hydrology? What are the different statistical tests to determine(quantity) the serial correlation, autocorrelation & seasonality of a time series?
Kabbilawsh Peruvazhuthi, Serial correlation & auto-correlation are same thing but seasonality is different.
• asked a question related to Advanced Statistical Analysis
Question
Hi
I have applied several conditional independence testing methods:
1- Fisher's exact test
2- Monte-Carlo Chi-sq
3- Yates correction of Chi-sq
4- CMH
The number of distinct feature segments that reject the independence (null H) is different in each method. Which method is more reliable and why?
(The data satisfies the prerequisite of all of these methods)
With low expected cell counts there are various suggestions as to when Chi-sq is no longer trustworthy. I like the Np15 rule because it's simple and seems to work well. In this rule, N is the total number of observations, and p is the proportion in the smaller group. So if you have a table that contains 50 observations, and 10% of them fall into the smaller group (so P = 0·1). Then Np is (50 x 0·1) which is 5. The table fails the Np > 15 test – not enough data to do a Chi-squared test.
• asked a question related to Advanced Statistical Analysis
Question
I have six kinds of compounds which I then tested for antioxidant activity using the DDPH assay and also anticancer activity on five types of cell lines, so I got two types of data groups:
1. Antioxidant activity data
2. Anticancer activity (5 types of cancer cell line)
Each data consisted of 3 replications. Which correlation test is the most appropriate to determine whether there is a relationship between the two activities?
Just do logistic regression is what I had in mind. The DV might be antcancer activity (yes /no) same for antioxidant activity. Best wishes David Booth
• asked a question related to Advanced Statistical Analysis
Question
I want to draw a graph between predicted probabilities vs observed probabilities. For predicted probabilities I use this “R” code (see below). Is this code ok or not ?.
Could any tell me, how can I get the observed probabilities and draw a graph between predicted and observed probability.
analysis10<-glm(Response~ Strain + Temp + Time + Conc.Log10
+ Strain:Conc.Log1+ Temp:Time