Science topic

# Advanced Statistics - Science topic

Question
I'm doing a germination assay of 6 Arabidopsis mutants under 3 different ABA concentrations in solid medium. I've 4 batches. Each batch has 2 plates for each mutant, 3 for the wild type, and each plate contains 8-13 seeds. Some seeds and plates are lost to contamination. So I don't have the same sample size for each mutant in each batch. In same cases the mutant is no longer present in the batch. I've recorded the germination rate per mutant after a week and expressed it as percentage. I'm using R. How can I analyse them best to test if the mutations affect the germination rate in presence of ABA?
I've two main questions:
1. Do I consider each seed as a biological replica with categorical type of result (germinated/not-germinated) or each plate with a numerical result (% germination)?
2. I compare treatments within the genotype. Should I compare mutant against wild type within the treatment, the treatment against itself within mutant, or both?
I suggest using mosaic plots rather than (stacked) barplots to visualize your data.
The chi²- and p-values can be calculated simply via chi²-tests (one for each ABA conc) -- assuming the data are all independent (again, please note that seedlings on the same plate are not independent). If you have no possibility to account for this (using a hierarchical/multilevel/mixed model), you may ignore this in the analysis but then interpret the results more carefully (e.g., use a more stringent level of significance than usual).
A binomial model (including genotype and ABA conc as well as their interaction) would allow you to analyse the difference between genotypes in conjunction with ABA conc. However, due to the given experimental design (only three different conc values) this is cumbersome to interpret (because you cannot establish a meaningful functional relationship between cons and probability of germination).
Question
I'm trying to construct a model for binary logistics. The first model includes 4 variable of predictor and the intercept is not statistically significant. Meanwhile, in the second model, I exclude one variable from the first model and the intercept is significant.
The consideration that I take here is that:
The pseudo R² of the first model is better at explaining the model rather than the second model.
Any suggestion which model should I use?
You should use the model that makes more sense, practically and/or theoretically. A high R² is not in indication for the "goodness" of the model. A higher R² can also mean that the model makes more wrong predictions with a higher precision.
Do not build your model based on observed data. Build your model based on understanding (theory) and the targeted purpose (simple prediction, exptrapolation (e.g. forecast), testing meaningful hypotheses etc.)
Removing a variable from the model changes the meaning of the intercept. The intercepts in the two models have different meanings. They are (very usually) not comparable. The hypothesis tests of the intercepts of the two models test very different hypotheses.
PS: a "non-significant" intercept term just means that the data are not sufficient to statistically distinguish the estimated value (the log odds given all X=0) from 0, what means that you cannot distinguish the probability of the event (given all X=0) from 0.5 (the data are compatible with probabilities larger and lower 0.5). This is rarely a sensible hypothesis to test.
Question
Could you please elaborate on the specific differences between scale development and index development (based on formative measurement) in the context of management research? Is it essential to use only the pre-defined or pre-tested scales to develop an index, such as brand equity index, brand relationship quality index? Suggest some relevant references.
Kishalay Adhikari, you might find some useful information in Chapter 12 of the following book:
Hair, J. F., Babin, B. J., Anderson, R. E., & Black, W. C. (2019). Multivariate data analysis (8th ed.). Cengage.
I think that some of this chapter could have been written a bit more effectively, but overall it is helpful in drawing distinctions between scales and indexes.
All the best with your research.
Question
I am currently doing a research proposal for my thesis and I wanted to know is it possible to use two different econometric methods to carry on with the findings?
Yes, that is correct as you state.
Question
Dear all,
I have a question about a mediation hypothesis interpretation.
We have a model in which the direct effect of X on Y is significant, and its standardized estimate is greater than the indirect effect estimate (X -> M -> Y), which is significant too.
As far as I can understand, it should be a partial mediation, but should the indirect effect estimate be larger than the direct effect estimate to assess a partial mediation effect?
Or is the significance of the indirect effect sufficient to assess the mediation?
Marco
Marco Marini as far as I know, you must have two conditions both verified for a partial mediation hypothesis to be confirmed:
1 - the indirect effect must be significant (X -> M -> Y) *
2 - the direct effect must be significant (X -> Y)
If both conditions are satisfied, then you have a partial mediation. If condition 1 is satisfied, but not condition 2, then you have a full mediation (i.e., your mediator entirely explains the effect of X over Y).
As Christian Geiser suggested: "Partial mediation simply means that only some of the X --> Y effect is mediated through M".
To my knowledge, the ratio between direct and indirect effect has no role in distinguishing between partial vs. full mediation.
* Please note: "the indirect effect must be significant" doesn't mean that path a and b must both be significant. All you need is path a × b significant (better if bootstrapped).
Question
300 Participants in my study viewed 66 different moral photos and had to make a binary choice (yes/no) in response to each. There were 3 moral photo categories (22 positive images, 22 neutral images and 22 negative images). I am running a multilevel logistic regression (we manipulated two other aspects about the images) and have found unnaturally high odd ratios (see below). We have no missing values. Could anyone please help me understand what the below might mean? I understand I need to approach with extreme caution so any advice would be highly appreciated.
Yes choice: morally negative compared morally positive (OR=441.11; 95% CI [271.07,717.81]; p<.001)
Yes choice: morally neutral compared to morally positive (OR=0.94; 95% CI [0.47,1.87]; p=0.86)
It should be noted that when I plot the data, very very few participants chose yes in response to the neutral and positive images. Almost all yes responses were given in response to the negative images.
I think you have answered your question: "It should be noted that when I plot the data, very very few participants chose yes in response to the neutral and positive images. Almost all yes responses were given in response to the negative images."
This is what you'd expect even in a simple 2x2 design. If the probability of a yes response in the positive condition is very high and the probability very low in the negative condition then the OR could be high as its the ratio of a big probability to a very low one.
This isn't unnatural unless the raw probabilities don't reflect this pattern. (There might still be issues but not from what you described).
Question
Hi Folks,
I am working on a meta-analysis and I am trying to convert data into effect sizes (Cohen's d) to provide a robust synthesis of the evidence. All the studies used a one-group pre-post design and the outcome variables were assessed before and after the participation in an intervention.
Although the majority of the studies included in this meta-analysis reported either the effect sizes (Cohen's d) or the mean changes, a few of them reported the median changes. I am wondering if there is a way to calculate the effect sizes of these median changes.
For example, the values reported in one paper are:
Pre Median (IQR) = 280.5 (254.5 - 312.5)
Post Median (IQR) = 291.0 (263.5 - 321.0)
Is there any way I can convert these values into Cohen's d?
Thank you very much for your help.
I do not think I will include these estimated means and SDs in the meta-analysis, but I can definitely report them in the narrative synthesis, as they will add additional evidence (with all the precautions due to the assumptions) to the findings.
Thanks again David, I very much appreciated your help.
Question
Hi All, I was wondering what statistical test do I use for this example. Comparing participants' ratings of a person's (1) competence and (2) employability, based on the person's (1) level of education and (2) gender.
So there are two IVs:
(1) The person's level of Education [3 levels].
(2) The person's Gender [2 genders].
So there is a total of 6 conditions presented to the participants [ 3 levels of education x 2 genders]. However, each participant is only presented with 4 conditions; meaning, there is a mixture of between-participants and within-participants used in the study.
There are two DVs:
(1) Participants' rating of the person's Competence.
(2) Participants' rating of the person's Employability.
I was thinking the statistical test would be MANOVA, but want to confirm.
Also, if the participants used in the study are a mixture of between-participants, and within-participants, how can MANOVA work in this case?
Any advice or insight on the above would be really appreciated. Thank you.
Hello Paul,
The first question to address is, how do you characterize the strength/scale of your DVs? Nominal? Ordinal? Interval? The second question is, do you really aim to interpret the results multivariately (that is, for the vector of values on rated competence and rated employability), or is it more likely that your attention will be focused in these individually? If individually, then run univariate analyses, one for each DV; otherwise, go multivariate.
Interval / Multivariate:
Multivariate regression or Manova (either would involve a repeated measures factor having four levels: condition)
Interval / Univariate:
Regression or mixed (two-between, one-within) anova
Ordinal / Univariate:
Ordinal regression or an adaptation of aligned ranks anova
Nominal / univariate (Depends on number of levels of the nominal variable)
Possibly logistic regression (if two levels of DV, such as "Satisfactory/Unsatisfactory")
Question
I am using an ARDL model however I am having some difficulties interpreting the results. I found out that there is a cointegration in the long run. I provided pictures below.
Mr a. D.
The ECT(-1)os always the lagged value of your dependent variable.
Regards
Question
I have long-term rainfall data and have calculated Mann-Kendall test statistics using the XLSTAT trial version ( addon in MS word). There is an option for asymptotic and continuity correction in XLSTAT drop-down menu.
• What does the term "Asymptotic" and "continuity correction" mean?
• When and under what circumstances should we apply it?
• Is there any assumption on time series before applying it?
• What are the advantages and limitations of these two processes?
I am not specifically expert in the Mann-Kendall Trend test but it is related to classical non-parametric tests, like the Kendall correlation test that I know better. Be careful with XLSTAT (which works in ExceI, not in Word). Indeed, in the procedure I used a few years ago, I had many problems and had to contact the support. I think you should read more about the test and more generally on non-parametric tests. Asymptotic means when the number of observations n grows to infinity. Otherwise, these tests are based on tables of critical values depending on n. When n is too large, use the asymptotic distribution, often normal with a given mean and a given variance (depending on n, of course). For the continuity correction, it is because the test statistic takes discrete values whereas the asymptotic distribution is continuous. The same kind of correction appears with a binomial distribution. Look in your statistics course.
Question
In confimatory factor analysis (CFA) in Stata, the first observed variable is constrained by default (beta coefficient =1, mean of latent variable =constant).
I don't know what is it! Because, other software packages report beta coefficients of all observed variables.
So, I have two questions.
1- Which variable should be constrained in confirmatory factor analysis in stata?
2- Is it possible to have a model without a constrained variable like other software packages?
Hello Seyyed,
I guess you mean with beta the factor loading? Traditionally, these are denoted with lambda but probably, Stata treats these differently.
The fixation of the "marker variable" is needed a) to assign a metric to the latent variable--those of the marker, and to b) identify the equation system.
As far as I know it does not matter which variable you choose unless it is no valid indicator of the latent.
HTH
Holger
Question
I am working on two SNPs on the same gene, and I tested some biochemical parameters for 150 patients with hypothyroidism. I want to see if a certain haplotype has an impact on these biochemical parameters. How can I statistically calculate the haplotypes and their association with these parameters?
Question
Do Serial correlation, auto-correlation & Seasonality mean the same thing? or Are they different terms? If so what are the exact differences with respect to the field of statistical Hydrology? What are the different statistical tests to determine(quantity) the serial correlation, autocorrelation & seasonality of a time series?
Kabbilawsh Peruvazhuthi, Serial correlation & auto-correlation are same thing but seasonality is different.
Question
I want to draw a graph between predicted probabilities vs observed probabilities. For predicted probabilities I use this “R” code (see below). Is this code ok or not ?.
Could any tell me, how can I get the observed probabilities and draw a graph between predicted and observed probability.
analysis10<-glm(Response~ Strain + Temp + Time + Conc.Log10
+ Strain:Conc.Log1+ Temp:Time
predicted_probs = data.frame(probs = predict(analysis10, type="response"))
I have attached that data file
Plotting observed vs predicted is not sensible here.
You don't have observed probabilities; you have observed events. You might use "Temp", "Time", and "Conc.Log10" as factors (with 4 levels) and define 128 different "groups" (all combinations of all levels of all factors) and use the proportion of observed events within each of these 128 groups. But you have only 171 observations in total. No chance to get any reasonable proportions (you would need some tens or hundreds of observation per groups for this to work reasonably well).
Question
Four homogeneity tests, namely the Standard Normal Homogeneity Test(SNHT), Buishand Range(BR) and Pettitt test and Von-Neumann Ratio test (VNR) are applied for finding the break-point. Out of which SNHT, BR and Pettitt give the timestamp at which the break occurs whereas VNR measures the amount of inhomogeneity. Multiple papers have made the claim that "SNHT finds the break point at the beginning and end of the series whereas BR & Pettitt test finds the break point at the middle of the series."
Is there any mathematical proof behind that claim ? Is there any peer-reviewed work (Journal article) which has proved the claim or is there any paper which has crosschecked the claim ?
Let me say that I have a 100 years data, then start of the time series means whether it is the first 10 years or first 15 years or first 20 years? How to come to a conclusion ?
Well, I do not know much about these tests you are doing. What you say in the last sentence is more related to my experience. I have prepared a paper on day temperature series and used split-line models with plateau phase followed by linear or nonlinear phase (same can be done if various trends are followed by a plateau phase as happens in Mitscherlich's diminishing returns (Exponential curve). There is also a test which helps to decide if linear trend in two sub-periods are equal or not.
Question
Hi
I have a huge dataset for which I'd like to assess the independence of two categorical variables (x,y) given a third categorical variable (z).
My assumption: I have to do the independence tests per each unique "z" and even if one of these experiments shows the rejection of null hypothesis (independence), it would be rejected for the whole data.
Results: I have done Chi-Sq, Chi with Yates correction, Monte Carlo and Fisher.
- Chi-Sq is not a good method for my data due to sparse contingency table
- Yates and Monte carlo show the rejection of null hypothesis
- For Fisher, all the p values are equal to 1
1) I would like to know if there is something I'm missing or not.
2) I have already discarded the "z"s that have DOF = 0. If I keep them how could I interpret the independence?
3) Why do Fisher result in pval=1 all the time?
4) Any suggestion?
#### Apply Fisher exact test
fish = fisher.test(cont_table,workspace = 6e8,simulate.p.value=T)
#### Apply Chi^2 method
chi_cor = chisq.test(cont_table,correct=T); ### Yates correction of the Chi^2
chi = chisq.test(cont_table,correct=F);
chi_monte = chisq.test(cont_table,simulate.p.value=T, B=3000);
Hello Masha,
Why not use the Mantel-Haenszel test across all the z-level 2x2 tables for which there is some data? This allows you to estimate the aggregate odds ratio (and its standard error), thus you can easily determine whether a confidence interval includes 1 (no difference in odds, and hence, no relationship between the two variables in each table) or not.
That seems simpler than having to run a bunch of tests, and by so doing, increase the aggregate risk of a type I error (false positive).
Question
1. In non-parametric statistics, the Theil–Sen estimator is a method for robustly fitting a line to sample points in the plane (simple linear regression) by choosing the median of the slopes of all lines through pairs of points. Many journals have applied Sen slope to find the magnitude and direction of the trend
2. It has also been called Sen's slope estimator, slope selection, the single median method, the Kendall robust line-fit method, and the Kendall–Theil robust line.
3. The major advantage of Thiel-Sen slope is that the estimator can be computed efficiently, and is insensitive to outliers. It can be significantly more accurate than non-robust simple linear regression (least squares) for skewed and heteroskedastic data, and competes well against least squares even for normally distributed data in terms of statistical power.
My question is are there any disadvantages/shortcomings of Sen's Slope? Are there any assumptions on the time series before applying it.? Is there any improved version of this method? Since the method was discovered in 1968, does there exist any literature where the power of the Sen slope is compared with other non-parametric? What inference can be made by applying Sen slope to a hydrologic time series explicitly? What about the performance of the Sen slope when applied on an autocorrelated time series like rainfall and temperature?
Two points. The approach is pretty similar to what Boscovich proposed in the 1700s, so if dating this type of procedure you can go further back (Farebrother, R. W. (1999). Fitting linear relationships: A history of the calculus of observations
1750--1900. New York, NY: Springer.) Second point. A disadvantage is it will be slow for even medium sized n. Here is a quick coding of T-S compared with the lm function in R (which is slower for n=10 because of all the checks it does before estimating the model) and what I assume is similar to the main computation bit in it. At n=10 they are similiar-ish, but n=100 T-S is much slower (note different units).
> theilsen <- function(x,y){
+ n <- length(x) # assuming no missing
+ slopes <- {}
+ for (i in 1:(n-1))
+ for (j in (i+1):n)
+ slopes <- c(slopes, (y[i]-y[j])/(x[i]-x[j]))
+ beta1 <- median(slopes[is.finite(slopes)])
+ beta0 <- median(y - beta1*x)
+ return(list(beta0=beta0,beta1=beta1))
+ }
> lmb <- function(x,y) solve(t(x) %*% x) %*% t(x) %*% y
> library(microbenchmark)
> x <- rnorm(10); y <- rnorm(10)
> microbenchmark(theilsen(x,y),lm(x~y),lmb(x,y))
Unit: microseconds
expr min lq mean median uq max
theilsen(x, y) 249.101 271.3515 764.2949 303.4510 373.6505 42047.800
lm(x ~ y) 1222.101 1293.1510 1496.4859 1419.3010 1594.7010 5597.801
lmb(x, y) 100.001 103.3010 271.7730 120.2015 186.0010 7302.101
neval cld
100 ab
100 b
100 a
> x <- rnorm(100); y <- rnorm(100)
> microbenchmark(theilsen(x,y),lm(x~y),lmb(x,y))
Unit: microseconds
expr min lq mean median uq
theilsen(x, y) 60628.902 75446.151 91017.986 76951.5510 80715.9015
lm(x ~ y) 1187.001 1377.951 1619.025 1659.1510 1807.2015
lmb(x, y) 100.600 111.702 176.185 192.4505 215.2015
max neval cld
543952.501 100 b
2262.701 100 a
303.602 100 a
>
Question
My question concerns the problem of calculating odds ration in logistic regression analysis when the input variables are from different scales (i.e.: 0.01-0.1, 0-1, 0-1000). Although the coefficients of the logistic regression looks fine, the odds ratio values are, in some cases, enormous (see example below).
In the example there were no outlier values in each input variables.
What is general rule, should we normalize all input variables before analysis to obtain reliable OR values?
Sincerely
Mateusz Soliński
You need to interpret OR using Exponential of estimates.
Question
Dear RG members, how can I find R packages and lists specific to health and medical research? Furthermore, could you give me online sources or guidelines on easy to study statistical analysis using R for medical research and its data visualization?
You may search via google. You write the subject name and the expression "R package" at the end of the line when you search.
Question
I need to run artanova and tukey-hsd for the interactions among the treatments, but my dataset has few NAs due to experimental errors.
When I run :
anova(model<- art(X ~ Y, data = d.f))
I get the warning :
Error in (function (object) :
Aligned Rank Transform cannot be performed when fixed effects have missing data (NAs).
Manually lifting is not an option because each row is a sample and it would keep NAs, simply in wrong samples.
The issue is that you are using art() from ARTool to fit the model and that can't handle missing values. You could use listwise deletion by passing na.omit(d.f) to the art() function - though this would potentially bias results (though no more than using na.rm=TRUE in anova() or lm().
A better solution is to use multiple imputation (e.g., with the mice package in R), though I'm not sure if that works directly with art() models or to use a different approach to handle your data (which presumably aren't suitable for linear models). You could use a transformation, a generalized linear model, robust regression etc. depending on the nature of the data.
Question
Hello,
I am hoping that someone who is well versed in statistics can help me with my analysis and design. I am investigating the torque produced via stimulation from different quadriceps muscles. I have two groups (INJ & CON), three muscles (VM, RF, VL), three timepoints (Pre, Post, 48H) in which torque is measured at two different frequencies (20 & 80 Hz). In addition to the torque, we also want to look at the relative change from baseline for immediately Post and 48H in order to remove some of the baseline variability between muscles or subjects. A ratio of 1.0 indicates same torque values post and Pre. This is a complex design so I have a few questions.
If I wanted to use repeated measures ANOVA, I have to first for normality. When I run the normality test on the raw data in SPSS, I have one condition that fails and others that are close (p < 0.1). When I run the ratios I also have a condition that fails normality. Does this mean now that I have to do a non-parametric test for each? If so, which one? I am having a difficult time finding a non-parametric test that can account for all my independent variables. Friedman's is repeated measures but it is not going to be able to account for group/frequency/muscle differences like an ANOVA would.
Is repeated measures ANOVA robust enough to account for this? If so, should I set this up as a four-way repeated measures ANOVA? It seems like I am really increasing my risk of type I error. It could be separated it by frequency (20 and 80 Hz) because it's established a higher frequency produces higher torque but as you can tell I have a lot of uncertainties in the design. I apologize if I am leaving out vital information in order to get answers. Please let me know and I can elaborate further.
Thank you,
Chris
You can use a plain ANOVA for repeated measures (time) according to Bhogaraju Anand suggestion, as for normality test, forget it...it is not essential and parametric tests are sufficiently robust as for deviations from normality (see attached file)
Question
I have 18 rainfall time series. On calculating the variance, it was found there was an appreciable change in the value of variance from one rainfall station to other. Parametric statistical tests are sensitive to Variance, does it mean we need to apply robust statistical tests instead of the parametric test?
Kabbilawsh Peruvazhuthi, Generally Parametric tests are to assume equal variances across groups. You could potentially resolve issues either with some data transformation or switching to a non-parametric equivalent test.
Best !!
AN
Question
I carried out Kruskal Wallis H test in SPSS to do a pair wise comparison of three groups. I got some positive and negative values in Test statistics and Std. Test Statistics columns. I can conclude the results based on p-value but I don't know what the values indicates in Test Statistics and Std. Test Statics column and why some values are positive and why some are negative. Need some explanation please. Thanks in advance.
Well, if the order of the groups were reversed, you would get a test statistic of the same magnitude but with a positive sign. You can play with a t- test to see this effect --- changing the order of the groups will give you the same t value l, but with a different sign.
Question
Hi -
I am looking for a way to quantify annual temporal variation in the intensity of space-use (per pixel across a reserve) into a single value. I was originally looking into using the coefficient of variation - however the CV does not appropriately quantify the intensity of utilization. For example, areas with high constant utilization and low constant utilization will both have a value of 0.
I will have yearly intensity of utilization values for each pixel with a park, where 5 is the highest possible utilization and 0 is no utilization. So for example:
2017 2018 2019
pixel 1 5 5 4
pixel 2 3 1 5
pixel 3 1 0 1
I'm looking for one single value per pixel, that can quantify the temporal variation whilst still accounting for the total intensity of utilization. Is there something similar to the CV that will also be able to account for the magnitude of utilization per pixel?
Emma
Please refer to these articles since they might guide you in the correct track:
Question
I am currently performing undergraduate research in forensics and I am comparing two types of width measurements (the widths of land and groove impressions on fired bullets), one taken by an automated system and the other performed by my associate manually using a comparison microscope. We are trying to see if the automated method is a more suitable replacement for the manual method. We were recommended to perform a simple linear regression (ordinary least squares) however when it comes to actually interpreting the results we had some slight trouble.
According to pg 218 of Howard Seltmann's experimental design and analysis, "sometimes it is reasonable to choose a different null hypothesis for β1. For example, if x is some gold standard for a particular measurement, i.e., a best-quality measurement often involving great expense, and y is some cheaper substitute, then the obvious null hypothesis is β1 = 1 with alternative β1 ≠ 1. For example, if x is percent body fat measured using the cumbersome whole body immersion method, and Y is percent body fat measured using a formula based on a couple of skin fold thickness measurements, then we expect either a slope of 1, indicating equivalence of measurements (on average) or we expect a different slope". In comparison to normal linear regression where β1 = 0 is usually tested, I was just wondering how you actually test the hypothesis proposed by Seltmann: do we test it the same way you would test the hypotheses of a normal linear regression (finding T test values, p values, etc)? Or is there a different approach?
I am also open to suggestions as to what other tests could be performed
A quick thank you in advance for those who take the time to help!
Reference:
Your question is specific : can the automated measurement replace the human?
You need to look at the literature on comparison of two measurement methods.
A graphical approach is very useful because you can see if the agreement is dependent on the value being measured. Bland and Altman's original paper is one of the most highly-cited methodology papers ever. https://www.semanticscholar.org/paper/Measuring-agreement-in-method-comparison-studies-Bland-Altman/6b118f54830361182c172306027e3af0516a3c08
Question
Other than R, which other software/app can I easily obtain volcano plots related to gene information.
try dash_bio or bioinfokit module in python for creating volcano plot.
But I believe R programming is easier.
Question
N/A
"What test would offer insight as to group x condition?"
Only a parametric model can do this. As soon as use ranks (i.e. some kind of "non-parametric analysis") an interaction is not meaningfully interpretable.
There are more important things to consider when analysing an interaction: it makes a difference if you assume that effects are additive or multiplicative. A meaningful interpration also requires that that the observed interaction is not due to ceiling or floor effects.
It's easy to do "some test" and to get "some result", but it is tricky to get a meaningful interpretation. I suggest to collaborate with a statistician.
Question
1. Which is the correct order in Data-processing of rainfall time series- Homogeneity test followed outlier detection & treatment (OR) Outlier detection & treatment followed by Homogeneity test?
2. I have monthly rainfall data for 113 years. I am planning to run four homogeneity test- Buishand range test (BRT), Standard normal homogeneity test(SNHT), Von-Neumann Ratio (VNR) and Pettitt.
3. Which is the appropriate method for identifying outliers in a non-normal distribution ?
4. Should the descriptive statistics(DS) and Exploratory data analysis (EDA) should be conducted before (or) after treating the outlier? (or) a comparison should be made in the EDA & DS before and after treating the outlier
You should first check the discordancy in R that will tell you if any of the regions in your dataset are discordant (means gross error or outlier) then further do a homogeneity test.
Question
Conventionally four homogeneity tests, namely the Standard Normal Homogeneity Test(SNHT), Buishand Range(BR) and Pettitt test and Von-Neumann Ratio test (VNR) are applied for finding the break-point. Out of which SNHT, BR and Pettitt give the timestamp at which the break occurs whereas VNR measures the amount of inhomogeneity. SNHT finds the break point at the beginning and end of the series whereas BR & Pettitt test finds the break point at the middle of the series. How to come to a common conclusion (Break point of the time series) if all three test gives different timestamps?
Question
I want to check the Homogeneity of a rainfall time series. I want to apply the following techniques. Is there any R package available in CRAN for running the following test?
• The von Neumann Test
• Cumulative Deviations Test
• Bayesian Test
• Dunnett Test
• Bartlett Test
• Hartley Test
• Tukey Test for Multiple Comparisons
In XLSTAT software, four homogeneity tests are present, is there any other software where all the homogeneity tests are present?
Dear @Kabbilawsh Peruvazhuthi,
In R, you can use several packages:
-climtrends package: VonNeumann test, Cumulative Deviations Test, and other homogeneity tests such as SNHT, Buishand, and Pettitt
-DescTools package: Dunnett Test
-multcompView package : TukeyHSD()
-Function bartlett.test(x, …)
Question
I want to do the trend analysis of the dataset of temepratutre and precipitation which has several continuous up and down (See the fugure). For Examaple, several (More than 10) breaks can be seen in my data sets. Is it advisable to to do piecewise linear regression analysis in such cases. To overcome the limitation of such parametric analysis, I have done the nonparametric trend analysis like Mann-Kendell.
No, piecewise regression is nonsense in this context.
You have the linear trends. It's prefectly fine to use a simple linear regression, and the data you show does not indicate that a more complicated model would give you more insight.
If you want a curve that follows the measurements more closely than a straight line, you might sonsider a spline or a scatterplot-smoother (use Google...). You can choos the smoothness freely (from a straight line up to en extremely wiggly curve that touches every point) or use cross-validation to find the smoothness with the smallest cross-validation error.
If you are interested in multi-annual patterns or more precise forecasts (at least for the few following years), you might consider a time-series decomposition (use Google...). But your data is rather small for this task (few measurements or too noisy given the number of measurements).
Question
I want to develop a Hybrid SARIMA-GARCH for forecasting monthly rainfall data. The 100% of data is split into 80% for training and 20% for testing the data. I initially fit a SARIMA model for rainfall and found the residual of the SARIMA model is heteroscedastic in nature. To capture the information left in the SARIMA residual, GARCH is applied to model the residual part. The model order (p=1,q=1) of GARCH is applied. But when the data is forecasted I am getting constant value. I tried applying different model orders for GARCH, still, I am getting a constant value. I have attached my code, kindly help me resolve it? Where have I made mistake in coding? or is some other CRAN package has to be used?
library(“tseries”)
library(“forecast”)
library(“fgarch”)
setwd("C:/Users/Desktop") # Setting of the work directory
datats<-ts(data,frequency=12,start=c(1982,4)) # Converting data set into time series
plot.ts(datats) # Plot of the data set
diffdatats<-diff(datats,differences=1) # Differencing the series
datatsacf<-acf(datats,lag.max=12) # Obtaining the ACF plot
datapacf<-pacf(datats,lag.max=12) # Obtaining the PACF plot
auto.arima(diffdatats) # Finding the order of ARIMA model
datatsarima<-arima(diffdatats,order=c(1,0,1),include.mean=TRUE) # Fitting of ARIMA model
forearimadatats<-forecast.Arima(datatsarima,h=12) # Forecasting using ARIMA model
plot.forecast(forearimadatats) # Plot of the forecast
residualarima<-resid(datatsarima) # Obtaining residuals
archTest(residualarima,lag=12) # Test for heteroscedascity
# Fitting of ARIMA-GARCH model
garchdatats<-garchFit(formula = ~ arma(2)+garch(1, 1), data = datats, cond.dist = c("norm"), include.mean = TRUE, include.delta = NULL, include.skew = NULL, include.shape = NULL, leverage = NULL, trace = TRUE,algorithm = c("nlminb"))
# Forecasting using ARIMA-GARCH model
forecastgarch<-predict(garchdatats, n.ahead = 12, trace = FALSE, mse = c("uncond"), plot=FALSE, nx=NULL, crit_val=NULL, conf=NULL)
plot.ts(forecastgarch) # Plot of the forecast
At the begin it happens as usual, and this way we learning, I would like to advise you to check your theory & codes line by line. It will work for sure.
Question
While running a sysGMM using xtabond2 in STATA, I came across the following error. The error is strange to me because this was not the first time to run symGMM using xtabond2. The screenshot is attached.
I tried using mata: mata set matafavor space, perm. to set space over speed, I keep getting the same error repeateadly.
Thanks as I await your response
System GMM using xtabond2 requires collapse option to ensure that the number of instruments are not greater than the number of countries. See Roodman (2009). Too many instruments proliferation in system GMM.
What does 12 mean? Something is wrong in model estimation
Question
When running Cronbach's Alpha test for internal consistency...
I have some missing values in the data set coded as 999.
Are they included in calculations or dismissed by Stata software as default?
Using other words do I have to mark some option in Stata before running Cronbach's alpha calculations so the software would dismiss missing values?
Could anybody clarify? Many thanks in advance.
use the SPSS program to analyze your data
Question
Hello, In one of the projects, I conducted a questionnaire for the skills of students before the project (PRE survey), and after the completion of the project, I conducted a post-project survey.
I calculated the results and processes of the questionnaire (the percentage of the level of each skill increase)
But I have no experience in making an interpretation of these results.
If you can help me or provide me with publications in this area.
Thank you.
If the study was based on the same subjects (or students) then the pre-survey mean could be compared with the post-survey mean by a paired-t test.
Question
Grubbs's test and Dixon's test are widely applied in the field of Hydrology to detect outliers, but the drawback of these statistical tests is that it needs the dataset to be approximately normally distributed? I have rainfall data for 113 years and the dataset is non-normally distributed. What are the statistical tests for finding outliers in non-normally distributed datasets & what values should we replace in the place of Outliers?
Hello Kabbilawsh,
If you believed your sample data accurately represented the target population, you could: (a) run a simulation study of random samples from such a population; and (b) identify exact thresholds for cases (either individual data points or sample means or medians, depending on which better fit your research situation) at whatever desired level of Type I risk you were willing to apply.
If you don't believe your sample data accurately represent the target population, you could invoke whatever distribution you believe to be plausible for the population, then proceed as above.
On the other hand, you could always construct a Chebychev confidence interval for the mean at whatever confidence level you desired, though this would then identify thresholds beyond which no more than 100 - CI% of sample means would be expected to fall, no matter what the shape of the distribution. This, of course, would apply only to samples of 2 or more cases, not to individual scores.
Question
Dear colleagues, I've tried to construct some recurrent neural network with using learning sample size 25 and I would like to get 178 columns in output as a result (there are 25 columns and 178 linesin the learning sample),but I can use predict ony for a single item :
pred<-predict(fit,inputs[-train[106,]]),so I need to change the numbers in train to get a column with forecast.
mi<-sslog
shift <- 25
S <- c()
for (i in 1:(length(mi)-shift+1))
{
s <- mi[i:(i+shift-1)]
S <- rbind(S,s)
}
train<-S
y<-as.data.frame(S, row.names=FALSE)
x1<-Lag(y,k=1)
x2<-Lag(y,k=2)
x3<-Lag(y,k=3)
x4<-Lag(y,k=4)
x5<-Lag(y,k=5)
x6<-Lag(y,k=6)
x7<-Lag(y,k=7)
x8<-Lag(y,k=8)
x9<-Lag(y,k=9)
x10<-Lag(y,k=10)
x11<-Lag(y,k=11)
x12<-Lag(y,k=12)
slog<-cbind(y,x1,x2,x3,x4,x5,x6,x7,x8,x9,x10,x11,x12)
slog<-slog[-(1:12),]
inputs<-slog[,2:13]
outputs<-slog[,1]
fit<-elman(inputs[train[106,]],
outputs[train[106,]],
size=c(3,2),
learnFuncParams=c(0.2),
maxit=40)
#plotIterativeError(fit)
y<-as.vector(outputs[-train[106,]])
#plot(y,type="l")
pred<-predict(fit,inputs[-train[106,]])
a<-pred
print (a)
df <- data.frame(a)
Could you tell me please,how is possible to construct data frame and to get the 178 in output as a result,not only single column?
Thanks a lot for your help
Try looking on the attached screenshot for a.trick. but you don't have much to predict from..so ??????. David Booth
Question
I have Monthly rainfall data from 1901-2013 for 29 stations covering the entire state of Kerala. I took the first 80% of the data for training the model and the rest 20% for validating it. I developed SARIMA monthly model for forecasting rainfall. The reviewer has asked What is the scientific basis for forecasting rainfall over a point location (station) over a longer time scale (a Month)? What was the reviewer trying to convey by this question?
Hi Kabbilawsh,
You developed a statistical model based on the long-term observational data, which may have incorporated certain interannual variabilities intrinsic in climate system affecting the specific place. However, you have not understood what cause the interannual variabilities, or why the monthly precipitation is predictable. The reviewer would hope you to give an explanation (in discussion) about the drivers or mechanism underlying the variabilities.
Good lack.
Guoyu Ren
Question
Hello everyone,
Could you recommend papers, books or websites about mathematical foundations of artificial intelligence?
Thank you for your attention and valuable support.
Regards,
Cecilia-Irene Loeza-Mejía
Mathematics helps AI scientists to solve challenging deep abstract problems using traditional methods and techniques known for hundreds of years. Math is needed for AI because computers see the world differently from humans. Where humans see an image, a computer will see a 2D- or 3D-matrix. With the help of mathematics, we can input these dimensions into a computer, and linear algebra is about processing new data sets.
Here you can find good sources for this:
Question
Hi all and happy new year; please let me know about my question. recently I have conducted a 2 (Groups) X 4 (Condition) between-within repeated measure ANOVA on my data. the results indicated the main effect of the Group and also the main effect of Condition was significant and the interaction was not. However, when I took a look at the post hoc LSD for interaction, the condition was significant in one group but was not significant in another group. Please let me know; why when the interaction isn't significant, the post hoc LSD for each group, indicates different comparisons? which one I should report? the main effect of conditions for two groups? or the comparisons of each group separately?
Dear Dr. Saemi
Hi,
Repeated measure ANOVA is a rigorous test for the significance of the interaction effect, and Conversely is the best for the 2 * 4 design. But, on the other hand, the LSD test can be considered the most convenient post-hoc test. In some cases, this happens to LSD when the One-way ANOVA is not significant.
Finally. When the interaction effect is not significant, you naturally do not need any post-hoc test. But, to be sure, try the Tukey and Bonferroni post-hoc test and you will probably see different results.
Also, for main effects you can only run the post-hoc test for condition main effect because you have 4 conditions. But, for group main effect with 2 groups you can not do this operation.
Best wishes,
Majid
Question
Context:
A systematic review with data synthesis is being conducted. The data synthesis aims to assess if there are correlations between changes in two pair of variables (e.g., disability and pain).
From each included study, only mean and SD (no more data available) have been extracted in different follow-up periods (1-month, 3-months, 6-months, 12-months).
Doubt/question:
I would like to run a correlation analysis of two pairs of Hedges'g effect size values to assess a correlation between variables' changes. What I would like to know is if it is possible to run spearman correlation weighted by sample size and number of follow-ups between Hedges'g values. And if yes, how it could be reported or justified.
I have already calculated repeated Hedges'g values for the same group in different follow-up times (1-month, 3-month, 6-month, 12-month) in different variables (i.e., variable1 and variable2).
An example of the Hedges'g matrix is:
----------------------- Var1 Var2
g1 (3-months) -0.5 -0.9 ---> n=12
g1 (6-months) -0.4 -1.6 ---> n=12
g2 (3-months) +0.1 +0.0 --> n=12
g3 (1-months) -0.7 -0.3 ---> n=40
g3 (3-months) -0.6 -0.3 ---> n=40
g3 (6-months) -0.8 -0.4 ---> n=40
g3 (12-month) -1.0 -0.5 ---> n=40
g4 (1-months) -0.7 -0.2 ---> n=40
g4 (3-months) -0.5 -0.3 ---> n=40
g4 (6-months) -0.4 -0.3 ---> n=40
g4 (12-month) -0.6 -0.2 ---> n=40
- If I run spearman correlation only with Hedges'g values (without taking into account sample size), it would be wrong because sample size is not taking into account.
- If I run spearman correlation creating n cases (12 with the first pair of values, 12 with the second pair of values, 12 with the third pair of values, 40 with the fourth pair of values, 40 with the fifht pair of values, etc), it would be wrong because for the same sample/group I am creating more cases than participants.
- Then, my question is if I weight the Hedges'g values according sample size and number of measurements (6 cases for the first pair, 6 cases for the second pair, 12 for the third pair, 10 for the fourth pair, 10 for the fifth pair, etc.) and then calculate spearman correlation, would it be correct?
Question
I build an predictive machine learning model that generate the probability to default over the next 2 years for all the companies in a specific country. I used for training the algorithms financial data for all these companies and also the NACE codes (domains of activity) and I'm wondering if I will develop a better model if I somehow segment the population of B2B in segments and run distinct models on these segments.
You can work on different aspects starting as cited A demographic approach and it could be also a geographic one depending on what information the dataset includes also you can try to identify behavioral patterns within your data. Also, you could go further by focusing on Customer capabilities & the Nature of the existing relationships.
Question
What is the method to compare the performances on two different cognitive tests (that measure different cognitive functions) of the same or different group(s)?
As two cognitive tests are inherently different from each other and many a times, have different parameters.
It will be helpful if anyone can direct me to some useful references.
Thank you
Joan Jiménez-Balado I should have clarified above I was speaking specifically with respect to making statistical comparisons between different cognitive scales within the same sample. You are correct, however, the asker mentioned "same or different group(s)".
If there are independent groups, as described in your example, one could easily make statistical comparisons on any cognitive scale. However, I can think of no way to compare (statistically) scores on two separate cognitive scales within a single sample - unless perhaps you used some variation of a rank-order test and assessed whether individual ranks on one cognitive test are similar to the individual ranks on the other cognitive scale.
Question
I'm using Enrich Marital satisfaction scale (15 items) in my thesis. along with other variable scales.
Please guide me regarding correlation and regression analysis. Out of Raw scores of EMS and IDS and PCT scores of both subscales, which scores will be taken in both analyses? and also as we are using percentile scores do I have to calculate the percentile score for other variables? or i will consider the raw scores?
Dear@ Nimra Naeem, regarding regression, please identify your department and independent variables. Also, type of dependent variable is important to choose regression models. This link maybe useful.
"Correlation vs. Regression Made Easy: Which to Use + Why" https://www.g2.com/articles/correlation-vs-regression
Question
Hi everyone! I have a statistical problem that is puzzling me. I have a very nested paradigm and I don't know exactly what analysis to employ to test my hypothesis. Here's the situation.
I have three experiments differing in one slight change (Exp 1, Exp 2, and Exp 3). Each subject could only participate in one experiment. Each experiment involves 3 lists of within-subjects trials (List A, B, and C), namely, the participants assigned to Exp 1 were presented with all the three lists. Subsequently, each list presented three subsets of within-subjects trials (let's call these subsets LEVEL, being I, II, and III).
The dependent variable is the response time (RT) and, strangely enough, is normally distributed (Kolmogorov–Smirnov test's p = .26).
My hypothesis is that no matter the experiment and the list, the effect of this last within-subjects variable (i.e., LEVEL) is significant. In the terms of the attached image, the effect of the LEVEL (I-II-III) is significant net of the effect of the Experiment and Lists.
Crucial info:
- the trials are made of the exact same stimuli with just a subtle variation among the LEVELS I, II, and III; therefore, they are comparable in terms of length, quality, and every other aspect.
- the lists are made to avoid that the same subject could be presented with the same trial in two different forms.
The main problem is that it is not clear to me how to conceptualize the LIST variable, in that it is on the one hand a between-subjects variable (different subjects are presented with different lists), but on the other hand, it is a within-subject variable, in that subjects from different experiments are presented with the same list.
For the moment, here's the solutions I've tried:
1 - Generalized Linear Mixed Model (GLMM). EXP, LIST, and LEVEL as fixed effect; and participants as a random effect. In this case, the problem is that the estimated covariance matrix of the random effects (G matrix) is not positive definite. I hypothesize that this happens because the GLMM model expects every subject to go through all the experiments and lists to be effective. Unfortunately, this is not the case, due to the nested design.
2 – Generalized Linear Model (GLM). Same family of model, but without the random effect of the participants’ variability. In this case, the analysis runs smoothly, but I have some doubts on the interpretation of the p values of the fixed effects, which appear to be massively skewed: EXP p = 1, LIST p = 1, LEVEL p < .0001. I’m a newbie in these models, so I don’t know whether this could be a normal circumstance. Is that the case?
3 – Three-way mixed ANOVA with EXP and LIST as between-subjects factors, and LEVEL as the within-subjects variable with three levels (I, II, and III). Also in this case, the analysis runs smoothly. Nevertheless, together with a good effect of the LEVEL variable (F= 15.07, p < .001, η2 = .04), I also found an effect of the LIST (F= 3.87, p = .022, η2 = .02) and no interaction LEVEL x LIST (p = .17).
The result seems satisfying to me, but is this analysis solid enough to claim that the effect of the LEVEL is by no means affected by the effect of the LIST?
Ideally, I would have preferred a covariation perspective (such as ANCOVA or MANCOVA), in which the test allows an assessment of the main effect of the between-subjects variables net of the effects of the covariates. Nevertheless, in my case the classic (M)ANCOVA variables pattern is reversed: “my covariates” are categorical and between-subjects (i.e., EXP and LIST), so I cannot use them as covariates; and my factor is in fact a within-subject one.
To sum up, my final questions are:
- Is the three-way mixed ANOVA good enough to claim what I need to claim?
- Is there a way to use categorical between-subjects variables as “covariates”? Perhaps moderation analysis with a not-significant role of the moderator(s)?
- do you propose any other better ways to analyze this paradigm?
I hope I have been clear enough, but I remain at your total disposal for any clarification.
Best,
Alessandro
P.S.: I've run a nested repeated measures ANOVA, wherein LIST is nested within EXP and LEVEL remain as the within-subjects variable. The results are similar, but the between-subjects nested effect LIST within EXP is significant (p = .007 η2 = .06). Yet, the question on whether I can claim what I need to claim remains.
yes of course three way ANOVA
Question
Hi,
I have 2 categorical and 1 continuous predictors (3 predictors in total), and 1 continuous dependent variable. The 2 categorical variables have 3 and 2 levels, respectively, and I have only dummy coded the variable with 3 levels, but directly assigned 0 and 1 to the variable with only 2 levels (my understanding is that if the categorical variable has only 2 levels, dummy coding is not necessary?).
In this case, how do I do and interpret the assumption tests of multicollinearity, linearity and homoscedasticity for multiple linear regression in SPSS?
Thank you!
Yufan Ye -
Have you looked at a "graphical residual analysis?" You can search on that term if you aren't familiar. It will help you study model fit, including heteroscedasticity. Also, a "cross-validation" may help you to avoid overfitting to the sample at hand to the point that you do not predict so well for the rest of that population or subpopulation which you wish to be modeling.
If this model is a good fit, I expect you will likely see heteroscedasticity. See https://www.researchgate.net/publication/354854317_WHEN_WOULD_HETEROSCEDASTICITY_IN_REGRESSION_OCCUR.
Cheers - Jim
Question
What is the method to compare the performances on two different cognitive tests (that measure different cognitive functions) of the same or different group(s)?
As two cognitive tests are inherently different from each other and many a times, have different parameters.
It will be helpful if anyone can direct me to some useful references.
Thank you.
If they are norm-referenced tests that yield standard scores it is easy. Just put the results on the same scale (deviation IQ, z score, or whatever) and compare directly. Remember, all standard scores are interchangeable - conversion tables are easily found. Differences are significant/interpretable based on the standard errors of measurement of the two tests. Usually those figures are reported in the Manuals; if not, they can be calculated easily from the reliability coefficients and standard deviations. The only caveat here is that the standardization samples may not be comparable, so exercise due caution in what you assert. That is, one test developer may inadvertently have recruited an especially able (or weak) set of people on whom the norms are based.
Question
I need to present some data in a form of a Radar graph and I don't want to use excel and I need another software.
What is the best way of making this kind of chart that can be accepted academically?
Thank you for your help and software
Question
I have heard some academics argue that t-test can only be used for hypothesis testing. That it is too weak a tool to be used to analyse a specific objective when carrying out an academic research. For example, is t-test an appropriate analytical tool to determine the effect of credit on farm output?
Depending on you objective statement, if your objective is to compare variables that influence a particular problem, you can use the t- test to compare and them give justification.
Question
Dear colleagues
If anyone may suggest me a code (matlab or r) for arima and garch model for the real data (not simulated).
Thank you so much
Valeriia Bondarenko , in MatLab it is extremely easy. For, example, for real data of Danish stock returns:
load Data_Danish; nr = DataTable.RN; % (1) Do a figure to see the data: figure; plot(dates,nr); hold on; plot([dates(1) dates(end)],[0 0],'r:'); hold off; title('Danish Nominal Stock Returns'); ylabel('Nominal return (%)'); xlabel('Year');
% (2) Create a GARCH(1,1) canonical model:
Mdl = garch('GARCHLags',1,'ARCHLags',1,'Offset',NaN);
% (3) Fit the model to the data:
EstMdl = estimate(Mdl,nr);
Done!
(MatLab m-file to fit the model attached)
Question
Hello Researchers,
I was working to check the validity of a mathematical equation. In doing so I have obtained a large data set values >50,000 experimentally. As the mathematical equation gives only a single value I was wondering is there any way to compare that data set to mathematical equation. Based on that comparison I am willing to assign a constant that is when operated (+,-,x,/) to mathematical equation results in values justifying the experimental data set. The constant could be more than one as the range of the data set is quite large compared to empirically obtained values through a mathematical equation.
Thanks for your reply I would look at cross validation methods and see what is appropriate for your situation. Best wishes, David Booth
Question
I cannot find sources which gives a thorough explanation of PCA and how to assign the principal components 1 & 2 including their computations. Say for example, my study will explore polyphenol profiles of a certain plant from different geographical area and I will test their antioxidant activity too and analyze them using biochemometric approach. Which variables should be included in principal components? I will also integrate data from this PCA to construct my OPLS DA.
Hi Duri,
I organized some information about PCA and PCR on:
The original is in Portuguese but can be translated by Google Translator:
Best Regards,
Markos
Question
Suppossing:
Landing Page A with 1300 leads is achieving a 3% conversion rate Landing Page B with 1500 leads is achieving a 10% conversion rate
First, I want to see if I have achieved enough samples (Leads or conversions) to have a statistically valid test.
Second, I want to confirm that conversion rate for Landing Page B is statistically significant, so better than the obtained for Landing Page A.
How do I determine that the sample of leads and conversions obtained are statistically representative? What would be the minimum sample to get the same success in the conversion ratio?
Conversion could mean different things for different campaigns. With that different ways could be used to measure. What type of campaigns are you using as in your research scenarios?
Question
I am currently doing a PCA on microbial data. After running a Parallel Analysis to determine the number of factors to retain from the PCA, the answer is 12. Since my idea is to save the factor scores and use them as independent variables for a GLM together with other variables, I was wondering:
• Should I definitely save the factor scores of all 12 factors (which would become too many variables) or I can save only a few of them (e.g., the first 3 which together explain a 50% of the variance) for the GLM?
• If I can save a lower number, should I re-run the PCA retaining only that lower number (e.g. 3) or just use the factor scores already obtained when retaining the 12 ones?
Thank you all for your time and help!
Hello Abdulmuhsin S. Shihab. The Preacher & MacCallum (2003) article I referred to in my earlier post explains (among many other things) why eigenvalues > 1 is a very poor way to determine the number of factors (or components) to retain:
HTH.
Question
Hi all,
My research project is based on Meta analysis. I have empirical mean, sample size and standard error calculated for two groups Group1 has 20 studies and Group 2 has 6 studies. I have already calculated pooled weighted mean, SE and CI for each groups. I would like to know how to calculate statistical significance between two groups with the empirical pooled weighted values?
Is it possible to calculate statistical significance based on confidence intervals between two groups? If yes, what type of statistical tests i need to perform to calculate P value for this.
Thanks much
Is there a reason you can't run a meta analysis with 26 studies and including the grouping variable as a moderator?
If the two intervals are independent and the SEs of the effect sizes reasonably similar you could get an approximate CI by pooling the SEs.
Question
Hi everybody, I have a question about calculating the sample size with the software. There is various software such as GPower and NCSS. PASS in this area. which one is better? Can anyone guide me? For example; how can I work with NCSS. PASS software? Thanks
It is not free.
Question
Can you kindly suggest the best statistical test to compare the yield of a protein from a bacterial culture carried out in different pH? Is one-way ANOVA a suitable method?
There are many tests, have you tried a simple means comparison test?
Question
Hello,
I am performing statistical analysis of my research data by comparing the mean values by using Tukey HSD test. I got homogeneous group in both small and capital alphabets. This is because of large number of treatments in my study. Is this type of homogeneous group is acceptable for publication in any journal?
You can use SPSS for this analysis but it is mostly done in Statistix 8.1 program
Question
Hi everyone.
I have a question about finding a cost function for a problem. I will ask the question in a simplified form first, then I will ask the main question. I'll be grateful if you could possibly help me with finding the answer to any or both of the questions.
1- What methods are there for finding the optimal weights for a cost function?
2- Suppose you want to find the optimal weights for a problem that you can't measure the output (e.g., death). In other words, you know the contributing factors to death but you don't know the weights and you don't know the output because you can't really test or simulate death. How can we find the optimal (or sub-optimal) weights of that cost function?
I know it's a strange question, but it has so many applications if you think of it.
Best wishes
Question
Deal All,
I have two series of time series data that I would to correlate. One data set is the deposits, by month, for a list of different account. The other is the balances, by month, for the same list of accounts. In essence, I have two matrices that I want to understand correlation for without having to strip out each account separately. Furthermore, I want to cross-section that data into different segments.
This is being done with the goal of being able to forecast account balances in the futures, by looking at their usage behavior (assuming there is a lag relationship).
How do I build an intermediate matrix of the correlations? Is there a way to do it in Python or R-Studio? Is there a way to do it in excel?
Thanks
Ryan
Question
Dear Researchers, Modellers, and Mathematicians,
As we know that in mathematics, computer science, and physics, a deterministic system is a system in which no randomness is involved in the development of future states of the system. A deterministic model will thus always produce the same output from a given starting condition or initial state. In this regard, I am looking forward to having examples from daily life events which are deterministic. Thank you!
Sincerely,
Aman Srivastava
Thank you, Dr. Faical Barzi, Dr. Zakaria Yahia, Dr. José Robles for sharing your opinions. It can be concluded from your responses that "deterministic processes can never happen in nature due to a certain degree of uncertainty involved; however, humankind has considered such (relevant) processes to be deterministic for their convenience in model development, such as the case of automated system or examples from classical mechanics". I request readers of this question to kindly view their responses for more understanding. Thank you, kindly stay connected.
Question
I aim to allocate subjects to four different experimental groups by means of Permuted Block Randomization, in order to get equal group sizes.
This, according to Suresh (2011, J Hum Reprod Sci) can result in groups that are not comparable with respect to important covariates. In other words: there may be significant differences between treatments with respect to subject covaraites, e.g. age, gender, education.
I want to achieve comparable groups with respect to these covaraites. This is normally achieved with stratified randomization techniques, which itself seems to be a type of block randomization with blocks being not treatment groups, but the covariate-categories, e.g. low income and high income.
Is a combination of both approaches possible and practically feasible? If there are, e.g. 5 experimental groups and 3 covariates, each with 3 different categories, randomization that aims to achieve groups balanced wrt covariates and equal in size might be complicated.
Is it possible to perform Permuted Block Randomization to treatments for each "covariate-group", e.g. for low income, and high income groups separately, in order to achieve this goal?
Hi! You might want to check this free online randomization app. You can apply simple randomization, block randomization with random block sizes and also stratified randomization.
Question
Hi there,
in SPSS I can perform a PCA with my dataset, which does not show a positive definite correlation matrix, since I have more variables (45) than cases (n = 31).
The results seem quite interesting, however, since my correlation matrix and therefore all criteria for appropriateness (Anti-Image, MSA etc.) are not available, am I allowed to perform such an analysis?
Or are the results of the PCA automatically nonsense? I can identify a common theme in each of the loaded factors and its items.
Thanks and best Greetings from Aachen, Germany
Alexander Kwiatkowski
Hello Alexander,
Trying to parse the information carried by 45 variables (e.g., 990 unique relationships) based on data from 31 cases is a bit like the Biblical story of the loaves and fishes: without divine intervention, you're simply not going to get good results.
In general, absence of a positive definite matrix implies that: (a) there is at least one variable that is linearly dependent on one or more of the other variables in the set (e.g., redundancy); and/or (b) the correlations are incoherent (which can occur if you use pairwise missing data technique, or some value(s) were miscoded on data entry). Either way, you'd need to re-inspect the data and data handling, and possibly jettison one or more of the variables.
If you find your results interesting, perhaps that should be the motivation to collect additional data so that you can be more confident that the resulting structure--whatever you decide it to be--is something more than a finding idiosyncratic to your data set.
Question
helllo
i have a simple control-experimental research design with pre-post exam and 12 person in each group.
What is the appropriate way to extract the effect size? (what is the right formula? cohen d or eta squire or omega squire or ....)
Andrew Paul McKenzie Pegman, here is what you wrote concerning Cohen's d:
If the two groups have the same n, then the effect size is simply calculated by subtracting the means and dividing the result by the pooled standard deviation.
I added emphasis on the definite article the. To be fair, that wording does suggest that Cohen's d is the one and only acceptable measure of effect size. That may not be what you intended, but the wording does suggest it, IMO.
Second, I saw nothing critical or offensive in Thom's reply. You are being overly sensitive, IMO.
And finally, as we are expressing opinions about the value of standardizing effect sizes, I generally agree with the views that Thom expressed in his article on effect sizes. YMMV.
Baguley, T. (2009). Standardized or simple effect size: What should be reported?. British journal of psychology, 100(3), 603-617. http://irep.ntu.ac.uk/id/eprint/23799/1/193490_1870%20baguely%20postprint.pdf
Question
I performed a logistic regression to ascertain the effects of academics’ role at the institution, years of teaching, qualification, and type of HEIs on the likelihood that participants are ready to teach IR topics to accounting students. Below are my results.
Is it normal to have a significance of more than .05 in a Hosmer and Lemeshow test but having non-significant independent variables? How do you interpret such a scenario? My sample size is 50, looking at 4 independent variables (research suggests size of 10 for each independent variable).
Hello Tishta,
The two results represent tests of different hypotheses, and therefore need not agree with one another.
The H-L test evaluates whether the performance of the final model is consistent across levels of the independent variable (useful when one or more of the IVs is continuous in form). In other words, does the model work coherently across the range of the IV (this is correctly implied by Abolfazi's response, above).
The test of individual IVs is simply a matter of whether including that IV yields a non-zero regression coefficient (which may correctly be construed as, does the IV contribute to the explanatory power of the model).
So, there's no reason to expect that the two tests will or must agree for any given data set.
Question
I have 210 respondents who completed a 15-item true/false/I don't know questionnaire, what is the best way to analyse that data and determine each respondents final score out of 15?
SPSS Coding: 1 = true, 2 = false, 3 = I don't know.
Each item/question is a unique variable.
The original measure indicates correct responses are allocated a score of 1, incorrect and “I don’t know” a score of 0, for a maximum of 15/15.
I have the correct answers as both true/false and scored in the ways indicated above, however I'm not sure how to check each respondent against the correct answers.
Would I try to match the existing answers to the correct answers and determine a result from that? Is there a fast way of doing it that doesn't involve individually inputting each answer? I feel like I'm missing something ridiculously obvious.