Questions related to Statistics
Hi, I have previously heard in a conference someone said "The isolates within group A are similar to group B with R > 0.2"
Is there a way for me to calculate the R value of two different groups of isolates based on their nucleotide/amino acid sequences or based on their sequence homology, so that in the end I could reach a conclusion just like the example I've provided?? Thank you so much!
I want to draw a graph between predicted probabilities vs observed probabilities. For predicted probabilities I use this “R” code (see below). Is this code ok or not ?.
Could any tell me, how can I get the observed probabilities and draw a graph between predicted and observed probability.
analysis10<-glm(Response~ Strain + Temp + Time + Conc.Log10
+ Strain:Conc.Log1+ Temp:Time
predicted_probs = data.frame(probs = predict(analysis10, type="response"))
I have attached that data file
I was trying to design a trial comparing one diagnostic tool (which will give 9 different output) to a common gold standard (which also gives the same output catagories). However, I'm so confused about how to estimate sample size for each output (or catagorical group). My first thought was to use specificity and sensitivity to calculate, but it's unlike nomal binary variables.
Should I use the same method for binary variables and use it aginst each catagory (i.e. make the studying catagory as positive and the rest as negative), so we can get a postive group sample size and a negative group (which contains the rest of eight outputs)?
REALLY appreciated if anyone can give me a hint or approach to this.
I want to estimate the half-life value for the virus as a function of strain and concentration, and as a continuous function of temperature.
Could anybody tell me, how to calculate the half-life value in R programming?
I have attached a CSV file of the data
I have to investigate
1) how the response depends on the strain, temperature, time, and concentration.
I applied logistic regression (glm) and got the reduced model. When I tried to make the logistic regression line and confidence interval, it looks like that in the picture. (pic attached below)
Could anybody tell me, how to resolve this issue (I want only one logistic regression and two confidence interval lines, not many)?
I have attached the data
For confidence interval I use this
Hello, dear community,
In the context of the development of a quantum algorithm, I need at some point to evaluate the square of an expectation value <O>, with O a real diagonal matrix (and thus hermitian).
I need to compute this efficiently using a quantum computer. I would like to know if there is a procedure, like the Hadamard test for evaluating <Ψ|O|Ψ>, to evaluate by using the minimum number of measurements the quantity (<Ψ|O|Ψ>)².
Thank you in advance for your answer
Hello Seniors I hope you are doing well
Recently I've read some very good research articles. In those articles datasets were taken from V-Dem, Polity and Freedom House. Though they have shared the link of supplementary datasets and the process of how they analyzed these datasets in SPSS or R in brief but I couldn't understand and replicate these findings. It may be because I am not very good at quantitative data analysis.
So I want to know how could I better understand this Datasets analysis easily like V-Dem etc. Is there any good course online, lectures or conference video etc. Or good book?
Any help would be appreciated.
Thanks in anticipation.
For my research I want tot test the influence of distance according to Hall's (1966) interpersonal distances (intimate, personal, social and public) on facial expression (negative or positive). Since both my independent and dependent variable are categorical ordinal data I thought of using Spearman's correlation. Is that the right statistical method?
Our group conducted an undergraduate research about fiber-polypropylene composite panels, we have a 4 different independent variables that includes fiber ratio (0% ; 5% ; 10%), length(1cm/2cm), fiber age(old-young), and fiber treatment (treated/untreated) to measure the strength(dependent variable) of the composite panel. Is factorial ANOVA will do or ANCOVA? If not, what statistical treatment would be appropriate for our research?
We're conducting a research design as follow:
- An observational longitudinal study
- Time period: 5 years
- Myocardial infarction (MI) patients without prior heart failure are recruited (we'll name this number of people after 5 years of conducting our study A)
- Exclusion criteria: Death during MI hospitalization or no data for following up for 3-6 months after discharge.
- Outcome/endpoint: heart failure post MI (confirmed by an ejection fraction (EF) < 40%)
- These patients will then be followed up for a period of 3 to maximum 6 months. If their EF during this 3-6 months after discharge is <40% -> they are considered to have heart failure post MI. (we'll name this number of people after 5 years of conducting our study B)
- Otherwise they are not considered to have the aforementioned outcome/endpoint.
My question is as follow:
- What is the A/B best called? Is it cumulative incidence? We're well-aware of similar studies to ours but the one main different is they did not limit the follow up time (i.e: a patient can be considered to have heart failure post MI even 4 years after they were recruited). I wonder if this factor limits the ability to calculate cumulative incidence in our study?
- Is there a more appropriate measure to describe what we're looking to measure? How can we calculate incidence in this study?
- We also wanted to find associated factors (risk factor?) with heart failure post-MI. We collected some data about the MI's characteristics, the patients' comorbidities during the MI hospitalization (when they were first recruited). Can we use Cox proportional hazards model to calculate the HR of these factors?
Hello, I have two cohorts and I'd like to check if the distribution of cases is similar between them. For clarity I'll post the table.
I'd like to know if the HEMS and GEMS groups differ significantly in the distribution of counts of various coded call (represented by their category number, 7, 14, 15, etc.).
I'm not sure how to run and check that with those variables - sorry for the basic question, I'd just like to prove similarity between the two groups and I'm a bit of a statistical novice.
I have performed a Spearman's Rho with several variables: 1 continuous dependent variable and 5 continuous independent variables. I did this as normality was violated so I couldn't do a Pearson's Correlation. From the Spearman's Rho, I have ordered the independent variables from the strongest correlation to the weakest. I am planning to run a regression where I enter the independent variables in order (from the strongest correlation to the weakest) but I cannot figure out which regression analysis I should run. Someone suggested a Stepwise regression but I am not sure if this is the correct analysis. Do you think I should just run a multiple regression (where I cannot choose the order of variables to be entered) or some other regression?
Thank you in advance!
I have a huge dataset for which I'd like to assess the independence of two categorical variables (x,y) given a third categorical variable (z).
My assumption: I have to do the independence tests per each unique "z" and even if one of these experiments shows the rejection of null hypothesis (independence), it would be rejected for the whole data.
Results: I have done Chi-Sq, Chi with Yates correction, Monte Carlo and Fisher.
- Chi-Sq is not a good method for my data due to sparse contingency table
- Yates and Monte carlo show the rejection of null hypothesis
- For Fisher, all the p values are equal to 1
1) I would like to know if there is something I'm missing or not.
2) I have already discarded the "z"s that have DOF = 0. If I keep them how could I interpret the independence?
3) Why do Fisher result in pval=1 all the time?
4) Any suggestion?
#### Apply Fisher exact test
fish = fisher.test(cont_table,workspace = 6e8,simulate.p.value=T)
#### Apply Chi^2 method
chi_cor = chisq.test(cont_table,correct=T); ### Yates correction of the Chi^2
chi = chisq.test(cont_table,correct=F);
chi_monte = chisq.test(cont_table,simulate.p.value=T, B=3000);
For my graduation research, I am trying to create a composite score about household resilience out of data collected through a household survey. However, this data consists of ordinal variables (5-point Likert scale), binary variables (yes-no questions), and ratio variables (in proportions between 0-1).
My plan was to recode the data on the 5-point Likert scale into scores from 0-1 and do the same for the yes-no questions, with yes = 1 and no = 0 (since answering yes would mean a household is more resilient and thus have a higher score). However, this seems very off.
At this moment I am aware that it wasn't the best idea to create a survey with both types of questions, but I am unable to recollect the data.
Therefore my **question** is as follows: do you have any tips on how to create a composite score composed of both interval and binary variables?
Thank you in advance and have a lovely day.
I have a study in which there is an unbalanced design. Each person is put into 1 of three conditions. Then, there are two exemplars that each person receives. However, people are tested twice for exemplar 1 and only once for exemplar 2.
I was thinking of using a linear mixed effect model to account for this unbalanced design with condition*exemplar (to see if the effect of condition replicates across exemplar) and condition*time (to see if the effect of condition varied across time for exemplar 1). However, I am not sure of what the LMEM is doing when it gives me estimates for condition*time or time, given that there is only 1 time point for exemplar 2. Is it ignoring exemplar 2? Can I say that condition*time and time are looking at only exemplar 1?
Any help in clarifying this would be much appreciated! The other type of analysis I was thinking of doing would be analyzing the exemplars separately so that analysis #1 would look at the effect of condition*time for the exemplar with 2 time points, but analysis #2 would look at the effect of condition for the exemplar with 1 time point.
How i can extracting p-values spearman correlation in robCompositions package (CorCoDa function ) (R software)?
I am working with my data with one IV (Likert Scale-Sports Engagement Scale) and one DV (6 Subscales-Likert Scale [Psychological Wellbeing Scale]). What statistical treatment is best to check their relationship and causal-relationship?
I hope you can answer it as soon as possible! Thank you so much!
Thank you so much!
I tried to run a Point Biserial Correlation with one continuous variable and several dummy coded nominal variables however, my continuous (dependent) variable violated the normality assumption.
Are there any alternatives for assessing the correlation between one continuous dependent variable and several dummy coded nominal variables?
I am running a multiple regression with several dummy-coded variables (initially multi-level ordinal variables). With the assumptions testing, do I carry it out as it's normally done? Do I need to do something special with the dummy-coded variables?
I am looking to do a Pearson Correlation to determine the order in which my variables should be entered into a Hierarchical Multiple Regression. Some of my variables are continuous and some are categorical. So, I am thinking to perform Pearson Correlations such as this:
- Correlation between level of study (undergraduate, postgraduate), area of study (nutrition, counselling, psychology, medicine, etc), year of study (1st, 2nd) and a test score. A separate correlation test will be performed to find out the associations between each test score (I have a number of different tests) and the independent variables.
I will then use those correlations in the Hierarchical Multiple Regression to determine which of the independent variables can predict the test scores. I will dummy-code each of the categorical variables after the Pearson Correlation but before the Hierarchical Multiple Regression.
My question is, are these steps suitable?
I am combining (averaging) survey items measured with a Likert scale, in order to get some new composite variables. I know there is some controversy regarding this, but I am following methods of similar research. If I take the average of a set of ordinal items, can I use the new variables in an ordinal regression as dependent/independent variables? I have heard multiple linear regression would be an option, but I violate some assumptions.
I am looking at gender equality in sports media. I have collected two screen time measures from TV coverage of a sport event - one time for male athletes and one time for female athletes.
i am looking for a statistical test to give evidence that one gender is favoured. I assume I have to compare each genders time against the EXPECTED time given a 50/50 split (so male time + female time / 2), as this would be the time if no gender was favoured.
my first though was chi square? But I’m not sure that works because there’s really only one category. I am pregnant and so my brain is not working at the moment lol. I think the answer is really simple but I just can’t think of anything.
I interpret mediation calculated with Process 4.0 in my study in Zhao et. al., 2010 approach. They suggest that it's an indirect-only mediation when the a*b effect is the only one that is significant and sign of this effect doesn’t matter. The problem is, that in my results only indirect effect is statistically significant and total effect is lower than the direct effect which suggests a suppression or confounding model (MacKinnon, et al., 2000). My question is - can I use the term indirect-only mediation in this situation or should I interpret this result as an example of suppression/confounding?
Thanks in advance.
Zhao, X., Lynch Jr, J. G., & Chen, Q. (2010). Reconsidering Baron and Kenny: Myths and truths about mediation analysis. Journal of consumer research, 37(2), 197-206.
MacKinnon, D.P., Krull, J.L. & Lockwood, C.M. (2000) Equivalence of the Mediation, Confounding and Suppression Effect. Prev Sci 1, 173–181.
I need to run artanova and tukey-hsd for the interactions among the treatments, but my dataset has few NAs due to experimental errors.
When I run :
anova(model<- art(X ~ Y, data = d.f))
I get the warning :
Error in (function (object) :
Aligned Rank Transform cannot be performed when fixed effects have missing data (NAs).
Manually lifting is not an option because each row is a sample and it would keep NAs, simply in wrong samples.
I would like to search for journals to publish in the area of statistics / quantitative methods that have special issues. Other than going through each journal one by one, is there any website (publisher-wise or as a collection from different journals) that list down the special issues. I know about MDPI and Taylor & Francis. How about Wiley, Springer, etc.?
Thank you in advance.
I read a paper from one well known publisher. Paper is about gestational diabetees. At the age 45+ they found 35 cases of diabetes and 90 women without it. They wrote: ''Te incidence of GDM at age≥45years was as high as 38.89%.'' I would suggest 28%? What do you think?
In addition, I have some question. If something increase from 50 to 200, that is 4 times increase?
What about folds? In my opinion, folds are not same as times? The next article stated in abstract: ''neutralizing antibodies were increased by 10.3–28.9 times at 4 weeks after the booster'' and than in results ''eutralizing antibody GMTs then increased during the 4 weeks after the booster dose until day 237 by 28.9, 10.3, and 11.9-fold". Was this written correctly?
We are measuring "Purchase intention" and have therefore made a framework where Four variables, Involvement, Argument Quality, Source credibility and Information Usefulness together make up Purchase intentions.
We have conducted a survey with a number of questions for each variable mentioned above with an answering scale of 1-6. The aim of the study is to compare the mean of each variable inorder to see if the two groups (Genz & Millenials) the survey was sent out to differentiate in some way. We are also wondering how to "combine" these four variables inorder to measure the mean for Purchase intention for each group.
Thanks in advance
Did the authors include any validation items, and/or did they consult experts to review the sample? I cannot seem to find anything that answers these questions.
I am working with SPSS and I noticed that I have a lot of missing values.
I cant delete the variables so I have to replace the missing values. I read a few different options online, but I am still not sure which one I have to choose. I can't replace the missing values with the mean, because they are ordinal variables. Is it an option to replace the missing values with the mode, or is it even better to replace it with the median?
I am a nursing student and it's the first time that I am working with this program. I am basically a rookie.
Thanks in advance!!
I collected the data from 14 participants by using an item-Likert scale (1-5) related to workload measurement. The same participants rated the scale for six different-sized keyboard designs by considering workload. To understand if there is a significant difference among the six keyboard designs, I applied the non-parametric Friedman test. I found a statistical difference, so I applied Wilcoxon signed-rank test with Bonferroni correction for pairwise comparisons.
My question: While I found no significant difference between LL-SS, there is a significant difference between ML-SS, though LL and ML have the same mean. Is this true? adjusted alpha value = 0,05 / 15 = 0,00333, the Wilcoxon result is 0,00306 for ML-SS. So I considered it a significant difference. Did I make it correct?
Looking forward to your support!
I attached the workload results and spss outcome.
Recently I am trying to reproduce results from this paper Janich, P., Toufighi, K., Solanas, G., Luis, N. M., Minkwitz, S., Serrano, L., Lehner, B., & Benitah, S. A. (2013). Human Epidermal Stem Cell Function Is Regulated by Circadian Oscillations. Cell Stem Cell, 13(6), 745–753.
Here is the difficulty I met:
The author performed microassay to detect gene expression in mutiple overlapping time window.
Take time window 1 for example, let's say there are 100 genes, of which their expression are detected at 5h, 10h, 15h, 20h.
Then the author applied a quadratic regression model "expression = a(time.point)^2 + b(time.point) + c" to determine whether these genes change periodically within each time window (time.point can be 5, 10, 15, 20 in this example). If the coefficent "a" <0 and pvalue for the coefficient "a" < 0.05, this gene would be identified as "peak gene"; Otherwise, if the coefficent "a" >0 and pvalue for the coefficient "a" < 0.05, it would be labelled as "trough gene". But the problem is that, the author calculate the pvalue with two methods, the first one is based on t distribution i.e. pvalue = Pr(>|t|) and the R code would be:
summary(lm(formula = expression ~poly(time.point ,2, raw=T))))$coefficients [3,'Pr(>|t|)'])
the other way is based on normal distrubution i.e.
pvalue = pnorm(q=t.score, lower.tail=T)*2
(That means if |t.statistics| > 1.96, the pvalue is guaranteed to be < 0.05.)
The author chose the latter one as the final pvalue. But is it right to do so in this situation?
From what I learned, t.distribution should be better when the population standard deviation is unknown and the sample size is < 30 (for each regression model, there are only 4 observations) . Since different pvalue calculated by these two methods could greatly affect the final result and conclusion, could someone give me a detailed explanation? Any help would be appreciated!
I have a model that needs calibration, but I am afraid that if I calibrate using too many model parameters, I will overfit to my data, or the calibration will not be well-done.
Can anyone suggest a method to determine the maximum number of parameters I should use?
Good afternoon dear colleagues.
I know for sure, that sometimes researchers expand their group size. Let's say, we are planning an experiment with 10 mice: 5 in the control, and 5 in the experimental group. At the end of the experiment, we see that we do not have enough data (e.g. some 1 or 2 of mice were excluded). By saying that, I mean that we see the difference, but it is statistically not significant. I know, that the required group size may be calculated with the use of the preliminary data. But I have not seen that anybody used this strategy when performing the experiment (I guess because the number of animals in an experiment must be minimized as much as possible).
As I said, I know that in the case described people may conduct additional experiments with another group of mice and then combine data from 2 experiments. Despite the protocol and conditions are keeping the same as much as possible, I doubt that this strategy is right. Moreover, I have read or listened somewhere to the explanation of how bad this approach is and why, but I do not remember where. Now I cannot find the answer no matter how hard I try.
Could anybody please explain that moment?
I am happy to get any comments, links, or articles about the issue!
I am currently analysing a dataset from a survey that moved participants to a different section depending on their response to a previous question, (e.g. people who said they were not farmers, skipped the farming-related questions). There is also missing data that does simply relates to respondents choosing to skip the question. The structurally missing data has led to a large amount of missing data and I assume has led the data to be MNAR.
I am wondering how to appropriately manage these two sets of missing data so I can progress to an ANOVA and then regression.
For context out of a sample of 22076, for each variable there are around 4500 structurally missing values (20%), and around 80 (.4%) missing values due to respondents choosing to not respond.
Any help would be hugely appreciated!
I carried out Kruskal Wallis H test in SPSS to do a pair wise comparison of three groups. I got some positive and negative values in Test statistics and Std. Test Statistics columns. I can conclude the results based on p-value but I don't know what the values indicates in Test Statistics and Std. Test Statics column and why some values are positive and why some are negative. Need some explanation please. Thanks in advance.
Data Mining and Machine Learning looks similar to me. Can you elaborate the difference between these two? As per my understanding-
Data Mining is about finding some useful information and using that information in decision making. That means using the known properties of data we are finding the unknown property of data. e.g. Studying sales of computers in different regions and supply them accordingly.
On the other hand ML is about prediction of results. It uses known properties of data to find other known property of data with new data instances. e.g. Prediction of house prize after 5 years from the existing data of house sales.
I have heard in videos that variation in R2 and path coefficients (before and after common method bias correction) should be <10% for unmeasured marker variable
and <30% for measured latent marker variable correction method.
Can anyone share the articles or references? Where does this cut-off value come from?
I would like to ask whether somebody knows which R package should be used in order to perform the Marascuilo procedure. I have multiple groups with proportion values that I want to compare, so it seems that is the most suitable test (i.e. 5 out of 10 individuals from population 1 produce compound X = 0.5, while only 4 out of 20 individuals from population 2 produce compound X = 0.4 and so on). However, I cannot find whether there is a package available for this test.
Does anyone know what package and function should be used?
Thank you very much in advance!
For example, when is it better to use decision trees instead of SVM or KNN, based on underlying theory/distribution of the data ?
I would appreciate any empirical/theoretical advice or references.
Say I have a 3 (young vs. middle-age vs. and old)* 2(male vs. female)* 2(smoker vs. no smoker) ANOVA design with three way interactions. However, the number of non-smoker is two times (N=2000) more than the number of smokers (N=1000). In this case, the number of young male non-smoker is 600 whereas the number of young male smoker is 300. Would the uneuqual sample size in each cell problematic for ANOVA and mutiple comparison test?
We are currently conducting a research project that focuses on organ rejection. For this purpose, we have taken blood samples of various patients, who have received an organ transplant pre- and postOP, although here we only consider postOP. Some of these patients have received an organ biopsy to diagnose a suspected organ rejection reaction. Blood samples were also taken during these times.
We want to compare the non-rejection (samples taken postOP when no biopsy was taken or samples corresponding to a negative biopsy result) to the rejection samples (samples corresponding to a positive biopsy result).
The problem we now face is the following: Not all patients have received a biopsy.
This means that some but not all of the patients in the non-rejection group have dependent (paired) samples in the rejection group.
How do we statistically account for the fact that some of the samples are paired? Any help is greatly appreciated!
A very interesting topic, "quantification of randomness" in mathematics it is sometimes reffered to as "complex theory" (although it is more about pseudorandom than randomness) that is based on saying that a complicated series is more random and then there are tests for randomness in Statistics and perhaps the most intriguing test related to information theory -"entropy"(as also being of relevence to and result of second law of thermodynamics), while there are also random numbers generators (pseudorandom numbers generators) and true random numbers generators using quantum computing.
So, what I've been trying to, is making a complete list of all available algorithms or books or even random number generators that will allow me to tell me how much random a series is, allowing me to "quantify randomness".
There are 125 unique infinite series which are pseudorandom that I have discovered and generated based on a rule, now how do I test for randomness and quantify it? Uf the series is random or there is probably a pattern, or something that will allow me to predict the next number in the series given I don't know what the next number is.
Now, do anyone know of any github links based on any of the above? ^ (like anything related to quantifying randomness in general that you think will be helpful).
A book/books on quantifying randomness will be very very helpful too. Actually anything at all...
I'm working on an imaging genetics projects with an aim to exploring whether certain allelic variations of a gene modulate BOLD responses and behaviors measured in a social-cognitive task.
When I solely looked into the behavioral data, I found no evidence that the genetic variable (e.g., polygenic risk score) significantly predicted individual participants' task performance (e.g., no zero-order correlation).
However, I found that the genetic variable is linearly associated with the activations in one brain region, and the activation values extracted from this area tha shows significant genetic modulation in turn correlated with the same task performance analyzed above. Let's suppose that: A=polygenic risk score, B = Brain activation, and C= behavioral task performance. All genetic, brain activation, and behavioral data are obtained from the same group of individuals.
What I'm seeing here is as follows:
1. Significant association between A->B
2. Significant association between B->C
3. Non-significant association between A->C.
My (potentially faulty) intuition was that maybe there is a path between these variables, where A is linked with C only via the action of B. Indeed, a mediation analysis based on bootstrapping revealed a significant indirect path linking: A->B->C. No direct effect was significant with or without the mediator. (I understand that this is problematic in Baron-Kenny approach, but I also learned that the A->C relationship is not required as it's equivalent to the total effects, which essentially is the combination between all possible indirect and direct effects.)
In this situation, is it permissible to conclude that the brain activation (B) is mediating the genetic (A) and behavioral (C) variable? I could see someone argue that A->B->C is a more accurate model as you may miss the significant indirect path if you only test the direct path. However, such a postulation just seems counterintuitive. It just doesn't seem to make sense that the genetic modulation on behaviors that was initially absent "suddenly" becomes significant when the brain data are combined.
Or is this just a misguided feeling due to the fact that I happened to perform the behavioral analysis first (mostly due to the format of the paper where you typically introduce the behavioral results prior to the neuroimaging data), and now I feel like I'm making things up with the neuroimaging data that weren't initially consodered...
Any inputs will be greatly appreciated!
I need to calculate the necessary sample sizes to reach alpha = 0.05 and Power = 0.8 for my experiment.
The problem ist, that I expect there to be no difference between my two groups. So how do I calculate Cohens d in that case? How many replicates is enough to be sure that my (assumed) non significant result is indeed because there is no difference and not because my sample size was too small?
I am currently performing undergraduate research in forensics and I am comparing two types of width measurements (the widths of land and groove impressions on fired bullets), one taken by an automated system and the other performed by my associate manually using a comparison microscope. We are trying to see if the automated method is a more suitable replacement for the manual method. We were recommended to perform a simple linear regression (ordinary least squares) however when it comes to actually interpreting the results we had some slight trouble.
According to pg 218 of Howard Seltmann's experimental design and analysis, "sometimes it is reasonable to choose a different null hypothesis for β1. For example, if x is some gold standard for a particular measurement, i.e., a best-quality measurement often involving great expense, and y is some cheaper substitute, then the obvious null hypothesis is β1 = 1 with alternative β1 ≠ 1. For example, if x is percent body fat measured using the cumbersome whole body immersion method, and Y is percent body fat measured using a formula based on a couple of skin fold thickness measurements, then we expect either a slope of 1, indicating equivalence of measurements (on average) or we expect a different slope". In comparison to normal linear regression where β1 = 0 is usually tested, I was just wondering how you actually test the hypothesis proposed by Seltmann: do we test it the same way you would test the hypotheses of a normal linear regression (finding T test values, p values, etc)? Or is there a different approach?
I am also open to suggestions as to what other tests could be performed
A quick thank you in advance for those who take the time to help!
I need to find the number of full time (officer and civilian) law enforcement officers for as many years as possible. I know that uniform crime reports has some data, but I can't seem to find too many years. Anyone know of an already constructed data set?
If I could find part time data too that might be nice. I also need law enforcement spending (total - federal, state, and local).
My understanding of conventional practice in this regard is that when there are more than two independent proportions being compared (e.g., comparing the proportion of people who contracted COVID-19 at a given period between the <18 year-old group, 18-64 year-old group and >64 year-old group), one of the groups being compared will serve as a reference group (which will automatically have an OR or RR = 1) through which the corresponding OR or RR of the remaining groups will be derived from. As far as I know, it seems that the generated OR or RR from the latter groups, through logistic regression or by-hand manual computation, will have a p-value whose threshold for significance testing is not adjusted with respect to the number of pairwise comparisons performed.
I understand that in the case of more than two independent means, we implement one-way ANOVA/Kruskal-Wallis technique first as omnibus/global hypothesis test which is followed by the appropriate post-hoc tests with the p-value thresholds adjusted if the former test finds something "statistically significant." I imagine that if the same stringency is applied to more than two independent proportions, we should be doing something like a Chi-square test of association (with the assumptions of the test being met) first as omnibus/global hypothesis test, followed by an appropriate post-hoc procedure (possibly Fisher exact tests with p-value threshold adjustment depending on the number of pairwise comparisons performed) if the former test elicits a "statistically significant" difference between the independent proportions.
I would like to ask some clarification (i.e., what concepts/matters I am getting wrong) on this. Thank you in advance.
I have 350 patients, which I divide into 4 age groups. I am wondering if (in addition to calculating percentages) any of the age groups are statistically significantly over-represented. Unfortunately there is no "non-patient" group, so I can't create cross tabs for chi^2 test.
Thanks for your help.
Is it possible to use Finite Population Correction (FPC) to decide the minimum required sample size when we use Respondent Driven Sampling (RDS) approach to recruit hidden populations? Kindly share any reading material on this? An introduction to RDS is attached for your information. Thanks in advance for kind support.
In my research design, there are 10 groups of 7 people, each group rates 3 out of the 10 chosen essays. This means each rater scores 3 essays & each essay gets 21 scores. (for concrete details, please see file attached). However, this means that were will be missing data (empty cells). After having looked at similar questions, it appears that SPSS (I am using ver 25) would treat empty cells as though they were filled, resulting in skewed results.
Has SPSS solved this problem? Is there any other program that works?
The histogram is the distribution of the response (i.e., subjective social class)whereas the second image is the pp-plot of the residuals after the GLM using predictors including demographics such as age, gender, education, income... Is GLM still suitable in this context? If not, what would be the best alternatives?
I have 2 equation, and in each equation i have a coefficient of interest:
Giving that the value of a and b are changing over time.
And I am aiming to record the values of a in list A and b in another list B
And from their behaviour i want to draw conclusion about the strength of these coefficients.
But i am a bit confused about how to draw such conclusion and what is the most representative way to monitor a and b behaviour change over time.
Or its better to monitor the increase or decrease of coefficient by summing the difference of recorded values over time.
I have more coefficients to be monitored, and they may have value or not. and my aim is to build meaningful classification that can categorise coefficients as useful or not.
I am currently running some generalized linear mixed model analysis with R.
I have a lot of possible predictors (all either continuous or ordinal) and subject as a random factor. It is difficult for me to decide on which predictor makes more sense because some would make sense based on our current knowledge, however some were included because they were promising even though there are no current theory or proof directly supporting their potential role in explaining my dependent variable.
Looking for ways to either select a priori predictors or assess the "goodness of fit" of different possible models, I came across many fascinating posts about r-squared, its bias towards "encouraging" an increased number of predictors (as it necessarily goes down with increasing predictors if I understood well), and possible alternatives : marginal r-squared, conditional r-squared, adjusted r-squared and my new favourite predicted r-squared.
I find the idea behind predicted r-squared very convincing. However I cannot find anything about its use with mixed model, except this appendix :
Furthermore, in R there doesn't seem to be a direct way to calculate predicted r-squared. I found a home-made function created by Tom Hopper here:
However I am not sure again that it is usable with mixed models. Also because in the PRESS function, lm.influence is used to diagnose the quality of fit of the model, but I am not sure it works with a glmer model.
So, sorry for the very long post, hope it makes sense to somebody, and would be very curious to get your feedback on that.
Note: I am not a statistician (if that was not already obvious from my post), so my understanding of this methods might be a bit artificial.
Thanks in advance!
EDIT: Little note. This is born from my will to avoid over-fitting seeing the number of predictors that I used in my diverse attempts to model my data. So if there are better tools to evaluate over-fitting in mixed models I am also interested.
- What is the best metric for model selection?
- Does accuracy derived by cross-validation is a good metric? *
- Does the selected model in the model selection process based on the metrics, surely leads to better results?
I am interested in statistically analyzing a non-normal distributed histogram of the performance of 141 test cases in the 0-200 range, with most of them concentrated in the 50-100 range and the rest being outliers. I would like to answer the following questions:
- How can the more "correct" average be found since it is an non-normal distribution with outliers?
- How can the means with and without outliers be statistically compared to examine the effect of the outliers?
Since I am not a skilled statistician, I would like to know how to start analyzing such a case: which statistical procedures to use ?, which statistics tests to use ?, which metrics to examine (mean, std, p-value, significant value, etc)? And how to analyze it step by step. I would also like to receive references that can help me.
Thanks in advance.
The conventional models used for dose-response meta-analyses only consider the variation for the response variable. I am trying to find a model for dose-response meta-analysis (both the dose and the response variables are continuous) which also takes the variation (SD) in the dose (exposure) variable and possible confounding variables (for adjustment in the model) into account. I will be thankful if you also guide me about the statistical package that I should use for such an analysis.
I am currently working on a project where we examine the effect of an intervention on fatigue. The project has been carried out according to an one-group pre-test post-test design. I am uncertain about the best way to conduct a mediation analysis.
In our project we started with a 12week control period. So we have a baseline measurement at T0, and the pre-test measurement at T12. So we have two measurements for 1 condition. The following 12 weeks are the intervention period, and we have post-test measurement after these 12 weeks at T24.
Our outcome (fatigue) and potential mediators have been measured at all these three time points. We conducted a mediation analysis according to Montoya & Hayes (2017). However, this analysis only took into account the measurements at T12 and T24. So the baseline measurement is not considered at all. Now we are wondering, is this the correct way of conducting this analysis, or do we also need to include the baseline measurement? We had the following idea's
- Conduct the analysis with T12 (pre-test) compared toT24 (post-test) as we did.
- Conduct the analysis with the average of T12 and T0 (pre-test) compared to T24 (post-test).
- Conduct two separate analyses, so compare T0 and T12, and compare T12 with T24.
- Conduct an analysis to compare the change score (T12-T0) with the change score (T24-T12)
What would you guys do?
I am planning to do a questionnaire study. My variables are
university performance (X)
lecturers contribution (Moderator)
1. For collecting lecturer contribution, my sample population are lecturers.
2. For collecting university performance, my sample population is university administration staff
3. For collecting students performance, my sample population are both lecturers and university administration staff
Please tell me should I design two separate questionnaires in one study. What can be the best way of designing this or these questionnaires? Please refer me to any research articles.
Thank you very much for your cooperation
I have done a genetic association analysis based on SPSS software. The allelic association has come out as a significant association. But when I wanted to compute the odds ratio in both allelic and different genotypic models based on cross-tabulation, the SPSS showed the odds ratio of control/case instead of case/control. Can I report this ratio in scientific journals? For better clarity, one SPSS file is attached below. Any suggestion will be appreciated. The analysis process is mentioned briefly below.
> Denote control 1 and case 2 in SPSS. Also denote AG, GG genotype 1, and AA genotype 2 in SPSS (Dominant model).
> Import the excel sheet in SPSS.
> Go descriptive stat, click crosstab.
> Put population (case and control column) into row and genotype into the column.
> Select chi-square, percentage, and risk.
Let us suppose that we have an intervention, for example technology integration in the science classroom. Can we study what mediators could affect the results of the intervention, for example learning motivation? Can we study what moderators could affect the results of the intervention? And why.
For example, can we study how gender mediates the influence of the intervention on the learning otivation? or it would be better if we consider the interaction of gender and the intervention?
As far as I know, I can plot either -log10(p-values) on the y-axis of a volcano plot, or the -log10(adjusted p-values) after adjusting them for example with Benjamini-Hochberg.
When plotting adjusted p-values, I can just set the cut-off to -log10(0.05) (see picture 1).
However when plotting the original p-values, I need to set a different cut-off. You can see in the raw data table that Species 9 already has an adjusted p-value >0.05 while Species 8 is the first one with adjusted p-value <0.05. Therefore, my cut-off when plotting the original p-values should be between the original p-values of Species 8 and 9 so between 0.00806 and 0.01165.
In the second picture, I set the cut-off to 0.01165. Is there any way to determine more accurately, where in between this cut-off should be set?
Is it possible to run a correlation test on a continuous DV and a categorical IV with 3 levels?
I'm investigating is gender is associated with academic procrastination, however, my gender variable is coded as: 0 = Male; 1 = Female; 2 = Non-Binary.
Initially I ran a Pearson product-moment correlation coefficient test however, I have now realised that this may not be the right procedure.
Any help would be greatly appreciated!
I have two treatment groups with 4 biological replicates each. I measured 100 lipid species in each of them and want to visualize the differences using a volcano plot.
Which of these two ways is the correct to process my data:
- Calculate for each lipid species the averages, fold changes, adjusted p-values between the two treatment groups and then at the end log2 transform the fold changes and -log10 transform the p-values for plotting
- Log2 transform all the measured lipid concentrations first, calculate the averages, fold changes, adjusted p-va