Science topic

# Biostatistical Methods - Science topic

Explore the latest questions and answers in Biostatistical Methods, and find Biostatistical Methods experts.
Questions related to Biostatistical Methods
Question
Hi,
I have performed an epidemiological survey on insomnia prevalence using ISI and am looking forward to testing internal consistency using Cronbach's alpha. I missed finding any reference example for estimating the same for each survey question. It would be helpful to receive assistance from your expertise.
I would appreciate your help in enhancing my knowledge.
Daniel Ruivo Marques, pardon me for the late response. Thank you for your guidance.
Question
In many biostatistics books, the negative sign is ignored in the calculated t value.
in left tail t test we include a minus sign in the critical value.
eg.
result of paired t test left tailed
calculated t value = -2.57
critical value = - 1.833 ( df =9; level of significance 5%) (minus sign included since
it is a left tailed test)
now, we can accept or reject the null hypothesis.
if we do not ignore the negative sign i.e. -2.57<1.833 null hypothesis accepted
if we ignore the negative sign i.e. 2.57>1.833 null hypothesis rejected.
Le signe négatif en mathématique en général et en statistique en particulier a toute son importance notamment au niveau des commentaires des résultats (exemple: la corrélation positive s'oppose à la corrélation linéaire négative). Les signes sont à respecter.
Question
I have 16S data sequenced from the Illumina MiSeq platform. This data comes from an experiment testing the effects of different aquaculture additives on the growth and survival of larval sablefish. It consisted of 18 tanks with 6 replicates of 3 water treatments: clay, algae, and algae with a switch to clay after one week. I'm interested in the effects of these additives have on the skin microbiome of the larval sablefish. The 16S data are from water samples from the tanks and from swab samples off the surfaces of 8-12 sablefish (to control for interindividual variation). There were also 3 different genotypic crosses used, so that there were 2 replicates of each genotype for each of the 3 treatments.
I have sets of water and swab data from all 18 tanks for 3 time points (each a few days apart).
I'm interested in the following:
1) How reflective are the skin microbial communities of the surrounding seawater? (i.e. are they similar or very different from one another?)
For this question, I was thinking about using the weighted UniFrac measure and generating PCoA plots that include both the water and swab samples to see if they cluster together. I think that will be the most informative as it considers relative abundance and phylogeny, and that's something I'm interested in. Beyond that, I'm unsure if that's the most appropriate measure to use, if I should use additional measures like Bray-Curtis or unweighted UniFrac, and what statistical tests to use beyond that.
2) A. How is skin microbial composition/structure different between water treatments?
B. How does it change over time, with respect to each treatment?
C. How does the similarity between skin and water communities change over time?
For some of these questions, I was thinking of using a generalized linear model in R, but beyond that I'm really unsure of where to start.
3) How much of an effect does genotype play in the formation of the skin microbiome?
I was thinking maybe using a generalized linear mixed effects model (using genotype as a random effect, and seeing how that might be different than using it as a fixed effect, but seeing as genotype is the only random effect in this study, then I don't know if that's appropriate). I could also use a generalized linear model to see if there's an interaction between genotype and treatment, and how much of an effect genotype has on its own.
Beyond what I've stated above, I'm unsure of which indices would be best to use (Shannon, Simpson, Chao1, etc), which statistical tests to use (since they come with their own assumptions and have their own limitations), which models to run, etc. Statistics in an ecological context is something I'm still learning, and I'm not very familiar with multivariate approaches. I am, however, familiar with R and QIIME.
Any and all assistance is greatly appreciated. Anything to at least point me in the right direction. Thank you in advance!
It sounds like you have a lot of interesting questions to investigate! Here are some suggestions on how to approach your analysis:
1. To investigate the similarity between skin microbial communities and the surrounding seawater, using the weighted UniFrac measure and generating PCoA plots that include both the water and swab samples is a good idea. In addition to UniFrac, you could also consider using other measures such as Bray-Curtis dissimilarity, Jaccard distance or Morisita-Horn distance. You can then use PERMANOVA or ANOSIM to test for significant differences between the skin and water communities. You may also want to consider using mixed-effects models to account for the non-independence of the data due to the repeated measures design.
2. A. To investigate the differences in skin microbial composition between water treatments, you could compare the community composition of the swab samples from each treatment using PERMANOVA or ANOSIM. B. To investigate how the skin microbial community changes over time with respect to each treatment, you could perform a longitudinal analysis using linear mixed effects models or generalized estimating equations. You can use multivariate techniques like redundancy analysis (RDA) or canonical correspondence analysis (CCA) to explore the relationship between microbial community composition and time. C. To investigate how the similarity between skin and water communities change over time, you could perform a similar longitudinal analysis using mixed effects models, and test for differences between the similarity of communities across the different treatments.
3. To investigate the effect of genotype on skin microbiome, a generalized linear mixed effects model would be appropriate. You can use the model to investigate the fixed effects of treatment, genotype, and their interaction on skin microbiome composition, while also including genotype as a random effect. You could also perform pairwise comparisons between genotypes to identify significant differences in the skin microbial composition.
For all these analyses, it would be useful to calculate alpha and beta diversity metrics such as Shannon diversity index and Bray-Curtis dissimilarity, and to visualize the data using ordination plots. Additionally, you may want to use differential abundance analysis to identify specific microbial taxa that are differentially enriched or depleted between treatments, time points or genotypes. There are several R packages available for performing these analyses, including vegan, nlme, and DESeq2.
Question
Contact by E-mail: Kawsar_Ahmed@csu.edu.cn
Or, for further meeting discussion, you can set up an online meeting ID and send me the link. Thanks in advance.
I am interested
Question
Hello everyone. I am currently working on comparing antioxidant potency between my samples.
I have 4 samples = water kefir, water kefir infused with butterfly pea, water kefir infused with turmeric and water kefir infused with matcha green tea.
As of for now, I am comparing them on the basis of how they fare in the two assays I have done; DPPH & FRAP assays. However, I am doubtful of this comparison method as they're not exactly comparable since they have different unit and measurement.
I have recently encountered the concept of relative antioxidant capacity index (RACI) from few recently published papers. However, I am still not clear on how could I integrate this concept for my result.
Could anyone provide me any insight on how does one use this concept to compare between antioxidant potency of different samples?
I feel, that the people who posted answers completely ignored my first response. Each assay is based on its own chemistry, designed for certain goals, and used in different conditions. You need to understand well what you are measuring. Otherwise, you put into a basket apples, oranges, pears and unknown fruits and want to compare your and someone's else baskets. How you can say which basket is better? My point is that the RACI does not have any chemical sense.
Question
Hi,
I have performed an insomnia prevalence study among academics using ISI. I have come across the floor and ceiling effect in a cross-sectional survey. I want to estimate the same percentage of each ISI question and the total score. It would be helpful to see an example to calculate the same.
I would appreciate your help in enhancing my knowledge.
Hi,
Look at the distribution of scores whether they are skewed. Correct for the skew by some transformation and again look for the distribution. You may be able to do further analysis. Conversely, frequencies in the categories can be compared.
Here are a few references on the Psychometric properties and its use in some studies of the scale:
Morin CM, Belleville G, Bélanger L, Ivers H. The Insomnia Severity Index: psychometric indicators to detect insomnia cases and evaluate treatment response. Sleep. 2011 May 1;34(5):601-8. doi: 10.1093/sleep/34.5.601
Schulte, T., Hofmeister, D., Mehnert-Theuerkauf, A. et al. Assessment of sleep problems with the Insomnia Severity Index (ISI) and the sleep item of the Patient Health Questionnaire (PHQ-9) in cancer patients. Support Care Cancer 29, 7377–7384 (2021). https://doi.org/10.1007/s00520-021-06282-x
Yusufov M, Zhou ES, Recklitis CJ. Psychometric properties of the Insomnia Severity Index in cancer survivors. Psychooncology. 2019 Mar;28(3):540-546. doi: 10.1002/pon.4973
Ohayon MM. Epidemiology of insomnia: what we know and what we still need to learn. Sleep Med Rev. 2002 Apr;6(2):97-111. doi: 10.1053/smrv.2002.0186
Okajima I, Miyamoto T, Ubara A, Omichi C, Matsuda A, Sumi Y, Matsuo M, Ito K, Kadotani H. Evaluation of Severity Levels of the Athens Insomnia Scale Based on the Criterion of Insomnia Severity Index. Int J Environ Res Public Health. 2020 Nov 26;17(23):8789. doi: 10.3390/ijerph17238789
Gagnon C, Bélanger L, Ivers H, Morin CM. Validation of the Insomnia Severity Index in primary care. J Am Board Fam Med. 2013 Nov-Dec;26(6):701-10. doi: 10.3122/jabfm.2013.06.130064
Kraepelien M, Blom K, Forsell E, Hentati Isacsson N, Bjurner P, Morin CM, Jernelöv S, Kaldo V. A very brief self-report scale for measuring insomnia severity using two items from the Insomnia Severity Index - development and validation in a clinical population. Sleep Med. 2021 May;81:365-374. doi: 10.1016/j.sleep.2021.03.003
Question
I am stuck here, as i am working on therapy and trying to evalute the changes in biomarker levels. So I have selected 5 patients and analysed their biomarker levels prior therapy and then after first therapy and followed by 2nd therapy. So as i apply anova results show significant difference in their mean values but due larger difference in their standard deviations i am getting non significant results
like in this table below.
Sample Size Mean Standard Deviation SE of Mean
vb bio 5 314.24 223.53627 99.96846
cb1 bio 5 329.7 215.54712 96.3956
CB II 5 371.6 280.77869 125.56805
So I want to know from all those good statsticians who are well aware about the clinical trial studies.
Am i performing statistics correctly?
Should not i worry about non significant results?
What are the statistical tests I should use?
How will I represent my data for publication purposes?
Try to teach like you are teaching to the fresher to this field.
Please be mindful about this kind of stuff, uncertain and misleading conclusions can be very dangerous in medicine and it is healthier for the community to talk about them than to just generate some numbers from the data.
Best,
jan
Question
Hi, I would like to know what is the type of study for the following research:
- The researcher conducted a descriptive study of medication errors in a hospital over 3 years. The number and characteristics of medication errors were the comparison sample.
- The researcher implemented a medication error mitigation program.
- Then the researcher studied the number and characteristics of medication errors in the 3 years before the implementation of the mitigation program. And these results were compared with the 3-year sample prior to implementation.
What is the type of study for this research?
The observational study whether cross sectional , prospective .Also retrospective study will help , it depends what type of error .Using trigger tool for specific errors can help too
Question
Hello everyone!
We are developing a phase I randomized clinical trial, in 18 healthy volunteers, aimed to test the safety and pharmacokinetics of i.v drug. However, we want to test two different doses of the drug (doses A and B), and each dose is to be administered with a specific infusion rate: dose A will be administered at X ml/min, and dose B at Y ml/min.
We need to randomize the 18 patients with a 2:1 ratio (active drug vs placebo), in blocks of size 6. However, to maintain the blind, we also would need two different infusion rates for the placebo (X and Y).
What do you think is the best way to randomize the volunteers in this study?
One way could be to randomize the patients in a 2 x 2 factorial design: one axis to assign the drug vs placebo, and the other axis to assign the drug dose with the infusion rate. To maintain a 2:1 ratio for the first axis, a and 1:1 ratio for the second axis, in blocks of size 6. A second way could be to randomize "three treatments" (dose A with X infusion rate, dose B with Y infusion rate, and placebo), 1:1:1 ratio, in blocks of size 6, and then, to randomize patients assigned to placebo in blocks of size two (or without blocks) to infusion rate X or Y.
What do you think is the best manner to randomize in methodological terms? In the case of the first way, Do we need to test the interaction between dose and infusion rate? Do you have another idea to randomize the patients in this study?
Thank you so much for your suggestions and help.
Hi,
As there are only 18 units and a single dose study without repetition of treatments, Whynot go for simple randomization procedure with random number generator into the three groups (A,B,C) one of the groups is similar to other that gives a 2:1 ratio. Groups A,B and C could themselves be randomized.
Here is a good book on this subject.
Lachin, John M.; Rosenberger, William F (Wiley series in probability and statistics) Randomization in clinical trials: theory and practice [2 ed.] John Wiley & Sons,2016.
Question
I have two sets of samples of human sera that is pre- and post immune sera. I should find the average cut off for immunogenic titers. I know how to do that in excel by t-test. But my sample number is 34. Can someone help me in letting me know how to perform a z test to calculate the cut off values for pre-immune sera.
Just type in your data and do it with Excel just as you have in the past. There's no major difference with what you did in the past. Best wishes David Booth
Question
For Individual responses I can calculate the value with respect to which we have to check the outlier don't know?
Kindly follow the SPSS structure for determining the critical value here: It is pretty simple and intuitive
Question
Hi, I was hoping someone could recommend papers that discuss the impact of using averaged data in random forest analyses or in making regression models with large data sets for ecology.
For example, if I had 4,000 samples each from 40 sites and did a random forest analysis (looking at predictors of SOC, for example) using environmental metadata, how would that compare with doing a random forest of the averaged sample values from the 40 sites (so 40 rows of averaged data vs. 4,000 raw data points)?
I ask this because a lot of the 4,000 samples have missing sample-specific environmental data in the first place, but there are other samples within the same site that do have that data available.
I'm just a little confused on 1.) the appropriateness of interpolating average values based on missingness (best practices/warnings), 2.) the drawbacks of using smaller, averaged sample sizes to deal with missingness vs. using incomplete data sets vs. using significantly smaller sample sizes from only "complete" data, and 3.) the geospatial rules for linking environmental data with samples? (if 50% of plots in a site have soil texture data, and 50% of plots don't, yet they're all within the same site/area, what would be the best route for analysis?) (it could depend on variable, but I have ~50 soil chemical/physical variables?)
Thank you for any advice or paper or tutorial recommendations.
Thank you!
Question
We're conducting a research design as follow:
• An observational longitudinal study
• Time period: 5 years
• Myocardial infarction (MI) patients without prior heart failure are recruited (we'll name this number of people after 5 years of conducting our study A)
• Exclusion criteria: Death during MI hospitalization or no data for following up for 3-6 months after discharge.
• Outcome/endpoint: heart failure post MI (confirmed by an ejection fraction (EF) < 40%)
• These patients will then be followed up for a period of 3 to maximum 6 months. If their EF during this 3-6 months after discharge is <40% -> they are considered to have heart failure post MI. (we'll name this number of people after 5 years of conducting our study B)
• Otherwise they are not considered to have the aforementioned outcome/endpoint.
My question is as follow:
1. What is the A/B best called? Is it cumulative incidence? We're well-aware of similar studies to ours but the one main different is they did not limit the follow up time (i.e: a patient can be considered to have heart failure post MI even 4 years after they were recruited). I wonder if this factor limits the ability to calculate cumulative incidence in our study?
2. Is there a more appropriate measure to describe what we're looking to measure? How can we calculate incidence in this study?
3. We also wanted to find associated factors (risk factor?) with heart failure post-MI. We collected some data about the MI's characteristics, the patients' comorbidities during the MI hospitalization (when they were first recruited). Can we use Cox proportional hazards model to calculate the HR of these factors?
Hi,
The study starts with a Cohort A and on Follow up if Ef<40 then it will be in Group B. This Shift suggests that the survival decreases (Failure to be in Group A) i.e Survival Analysis is applicable. Since factors affecting the survival would be examined, then Cox Proportional Hazards Model is applicable. Survival curves are cumulative curves.
Question
We are currently doing an undergrad thesis and we are planning to assess the presence or absence of species in each elevation (our variable for community) during a certain month. We were able to find ideas like the coefficient of community but this only allows us to assess two communities.
You might try looking at amphiban/reptile studies, as many have been done at various altitudes for populations in vernal pools. I don't have such a paper on hand, but you can search for that, and they all contain some type of corrleation factor. Joell
Question
EDIT: Please see below for the edited version of this question first (02.04.22)
Hi,
I am searching for a reliable normalization method. I have two chip-seq datas to be compared with t-test but the rpkm values are biased. So I need to fix this before the t-test. For instance, when a value is high, it doesn't mean it is high in reality. There can be another factor to see this value is high. In reality, I should see a value closer to mean. Likewise, if a value is low and the factor is strong, we can say that's the reason why we see the low value. We should have seen value much closer to the mean. In brief, what I want is to eliminate the effect of this factor.
In line with this purpose, I have another data showing how strong this factor is for each value in the chip-seq datas (with again RPKM values). Should I simply divide my rpkm values by the corresponding RPKM to get unbiased data? Or is it better to divide rpkm values by the ratio of RPKM/ Mean(RPKMs) ?
Do you have any other suggestions? How should I eliminate the factor?
Actually, the log transformation in the figure I attached was done according to the formula: log((#1+1)/(#2+1)). Just later, I thought that I added "1" to my values to be able to carry out log transformation (not to eliminate zero values). So I considered that maybe, it would be more correct to add "1" to adjusted values just before the transformation.
Thanks again :) Jochen Wilhelm
Question
I have two different ChIP-seq data for different proteins, I have aligned them to some fragments in the DNA. Some of these fragments get zero read count for one of them or for both. To be able to say these fragments has protein X much more than the protein Y, I use student's t-test.
I wonder if It would be better to remove the zero values from both of the data showing rpkm values for each fragment. Moreover, they pose problem when I want to use log during data visualization part.
What would you suggest?
Thank u so much for both your answer and suggestion David Eugene Booth
Question
Hi,
We received a statistical reviewer comments on our manuscript and one of the comments goes as follows: '... Note that common tests of normality are not powered to detect departures from normality when n is small (eg n<6) and in these cases normality should be support by external information (eg from larger samples sizes in the literature) or non-parametric tests should be used.'
This is basically the same as saying that 'parametric tests cannot be used when n<6', at least without the use of some matching external data which would permit accurate assumption of data distribution (of course in real life such datasets do not exist). And this just doesn't seem right. t-test and ANOVA can be used with small sample sizes as long as they satisfy test assumptions, which according to the reviewer cannot be accurately assumed and thus cannot be used...
I see two possible ways of addressing this:
1. Argue that parametric tests are applicable and that normality can be assumed using residual plots, testing homogeneity or variance, etc. This sounds as the more difficult, risky and really laborious option.
2. Redo all the comparisons with non-parametric test based on this one comment. Which just doesn't seem right and empirically would not yield a different result. It would be applicable to 15-20 comparisons presented in the paper..
Maybe someone else would have other suggestions on the correct way to address this?
For every dataset in the paper, I assume data distribution by identifying outliers (outliers - >Q3 + 1.5xIQR or < Q1 - 1.5xIQR; extreme outliers - > Q3 + 3xIQR or < Q1 - 3xIQR), testing normality assumption by Shapiro-Wilk’s test and visually inspecting data distribution using frequency histograms, distribution density and Q-Q (quantile-quantile) plots. Homogeneity of variance was tested using Levene’s test.
Datasets are usually n=6 and are exploratory gene expression (qPCR) pairwise comparisons or functional in vivo and in vitro (blood pressure, nerve activity, response magnitude compared to baseline data) repeated measures data between 2-4 experimental groups.
This probably does not help you, but I thought that I would have a look at the original Student (Gossett) paper of 1918 as the test was specifically designed for (very) small samples:
"if our sample be small, we have two sources of uncertainty: (1) owing to the “error of random sampling” the mean of our series
of experiments deviates more or less widely from the mean of the population,
and (2) the sample is not sufficiently large to determine what is the law of
distribution of individuals. It is usual, however, to assume a normal distribution,
because, in a very large number of cases, this gives an approximation so close
that a small sample will give no real information as to the manner in which
the population deviates from normality: since some law of distribution must
he assumed it is better to work with a curve whose area and ordinates are
tabled, and whose properties are well known. This assumption is accordingly
made in the present paper, so that its conclusions are not strictly applicable to
populations known not to be normally distributed; yet it appears probable that
the deviation from normality must be very extreme to load to serious error. " My emphasis
" Section X. Conclusions
1. A curve has been found representing the frequency distribution of stan-
dard deviations of samples drawn from a normal population.
2. A curve has been found representing the frequency distribution of the
means of the such samples, when these values are measured from the mean of
the population in terms of the standard deviation of the sample.
3. It has been shown that the curve represents the facts fairly well even
when the distribution of the population is not strictly normal." Again my emphasis.
The are several examples with a sample size below 10 in the paper.
When I used to teach this stuff (1st year geography students), I would demonstrate the Fisher Randomization and permutation test for very small numbers as the students could do this by hand and thereby see the underlying logic of the test. I would show that you could permute the data of the two variables under the null hypothesis of no difference and see how extreme a result you could get 'by chance' and then compare the observed value to this; no normality assumptions were needed in coming to some sort of judgement.
Question
I would like to study correlation between four transcripts (fold changes of mRNA expression) at different time intervals (5 time points). How can I perform this analysis?
Try Correlation matrix on R Programming. Try corrplot( ) package in R
Question
I am planning a cross-over design study (RCT) on effect of a certain supplement/medicine on post-exercise muscle pain. There hasn't been any similar study to recent date on the effect of this medicine (or similar medicines) on post-exercise muscle pain. However, some studies have been conducted for effect of this medicine on certain conditions such as hypertension.
As long as I have been searching formulas for estimating sample size, they need information (such as standard deviation, mean, effect size, etc.) from some similar kind of studies which was conducted before.
Is there anyway to estimate a sample size for my RCT with the aforementioned conditions?
The calculation of the sample size depends on the variance in the results. Certain software such as G power can help you calculate the sample size based on the mean difference, and the variance between the groups.
Question
What is difference between parametric and non-parametric tests?
Yes, I think Tami Barry's answer is good, thank you very much
Question
I am currently involved in a study that needed to do the regression analysis to see the effect of each treatment on a few dependent variables. Can you help me, what is the minimum level of factors or treatments that should be there for a regression analysis?
The plastic dosage seems to be a quantitative variable, and I hope you are free to choose the amount. If the functional form of the relationship between plastic dosage (X) and compressive strength (Y) are not known, I would use a dynamic range as large as possible, from tiny amounts to huge amounts of X, and I would choose as many different values of X as practically feasable, being more or less logarithmically spaced over the range of X. It may be even worth using only relatively few different values of X (5-10 or so) in a first round just to get an idea what range of X might be interesting, and then focus on this range in a second round.
Question
Dear All,
I am struggling with a constant problem with a csv extension while preparing data for MuSSE model analysis: have tried to do a bunch of stuff to fix a problem but no success - always the same thing ("All names must be length 1"). I would very grateful for your help! :)
library(diversitree) dat="MuSSE_hosts.csv" dat<- read.table("MuSSE_hosts.csv", header=TRUE, dec=".", sep=",", row.names=1) mat <- dat[,2:ncol(dat)] lik.0 <- make.musse.multitrait(tree, mat, depth=0) Error in check.states.musse.multitrait(tree, states, strict = strict, : All names must be length 1
Thank you a lot in advance!
Thank you a lot for your help!
Question
I have an issue in analysing qRT-PCR datasets. For my gene of interest, treatment A's mRNA fold change values are 10, 40, and 200, over the control group. For treatment B, the corresponding values are 0.5, 10, and 5. Therefore, between the two treatments, I know that A's is always higher than B's and that too hugely (20, 4, and 40 fold difference). However, if I perform routine statistical tests like a t-test, there is no significant difference because of the huge standard deviations.
Can you suggest a way to represent this data and also make proper sense statistically? Thanks in advance.
You need normalization
Question
Hello everyone, I would like to ask if the way that the sample size of this research was calculated is valid or correct. Is a study to evaluate the effect of gargling with Povidone iodine among COVID-19 patients. The text says “For this pilot study, we looked at Eggers et al. (2015), using Betadine gargle on MERS-CoV which showed a significant reduction of viral titer by a factor of 4.3 log10 TCID50/mL, and we calculated a sample size of 5 per arm or 20 samples in total”. From this data of the reduction of the viral titer in a previous study on MERS-CoV ¿It is valid to calculate the sample size this way for a new study on COVID-19?
there are many different ways to estimate the sample size and you can select the suitable one for your research.
Question
Dear all,
I am working on gene expression and Kaplan Meier curve dividing the patients in " high" and "low" using SPSS, then I want to do the cox proportional hazard analysis combining i.e the mutational status of one gene. I am naive in using SPSS Software How I can set up the analysis in order to find the Hazard ratio of specific combinations i.e X gene(high) and Y (gene) "mut" or "Wt".
Good Morning,
I've never used SPSS and the last time I used SAS for survival analysis was 20 years ago when I was still in University. It is relatively simple to set up a Kaplan Meier analysis in R using the Survival package. Granted, you have to learn a bit of code, but there are lots of resources available online to walk you through it. On the positive side, once you get your code working, you can re-use it, and, while it is more work at the beginning, you are forced to think about and understand your data and each step in the analysis, which leads to a better understanding of the outcomes. A good introduction tutorial on survival analysis using R can be found here: https://www.emilyzabor.com/tutorials/survival_analysis_in_r_tutorial.html
Good Luck!
Question
For example, we are reviewing an article and the sensitivity of a testing modality 87% while including 50 patients. How we will be able to calculate its upper and lower limit at 95% confidence interval while making a forest-plot?
You may try this by using Chart Builder.
NEW FILE.
DATASET CLOSE ALL.
GET FILE "C:\SPSSdata\bankloan.sav".
DATASET NAME raw.
* OMS.
DATASET DECLARE logistic.
OMS
/SELECT TABLES
/IF COMMANDS=['Logistic Regression'] SUBTYPES=['Variables in the Equation']
/DESTINATION FORMAT=SAV NUMBERED=TableNumber_
OUTFILE='logistic' VIEWER=YES
/TAG = 'logistic'.
LOGISTIC REGRESSION VARIABLES default
/METHOD=ENTER age employ address income debtinc
/PRINT=CI(95)
/CRITERIA=PIN(0.05) POUT(0.10) ITERATE(20) CUT(0.5).
OMSEND TAG = ["logistic"].
DATASET ACTIVATE logistic.
COMPUTE Vfilter = Var2 NE "Constant".
FILTER by Vfilter.
VARIABLE LABELS Var2 "Variable".
* Chart Builder.
GGRAPH
/GRAPHDATASET NAME="graphdataset" VARIABLES=Var2 MAXIMUM(Upper)[name="MAXIMUM_Upper"]
MINIMUM(Lower)[name="MINIMUM_Lower"] MEAN(ExpB)[name="MEAN_ExpB"] MISSING=LISTWISE REPORTMISSING=NO
/GRAPHSPEC SOURCE=INLINE.
BEGIN GPL
SOURCE: s=userSource(id("graphdataset"))
DATA: Var2=col(source(s), name("Var2"), unit.category())
DATA: MAXIMUM_Upper=col(source(s), name("MAXIMUM_Upper"))
DATA: MINIMUM_Lower=col(source(s), name("MINIMUM_Lower"))
DATA: MEAN_ExpB=col(source(s), name("MEAN_ExpB"))
COORD: rect(dim(1,2), transpose())
GUIDE: axis(dim(1), label("Variable"))
GUIDE: axis(dim(2), label("Odds Ratio & 95% CI"))
SCALE: linear(dim(2), include(0))
shape(shape.ibeam))
ELEMENT: point(position(Var2*MEAN_ExpB), shape(shape.circle))
END GPL.
Good luck
Question
I need to calculate the prevalence ratios to show the trend of prevalence of a drug group used by pregnant women over the years (2001-2018). I would like to use the year 2001 with age distribution of pregnant women in 2001 as reference for following years.
Please show me how to calculate the standardized Prevalence ratio (95%CI) with SPSS (I am not familiar with other software).
The outcome (dependent) variable Drug group use (y/n).
Other variables: age at delivery of pregnant women (continuous, but can be re-coded into age groups), years of birth (2001-2018), date of the prescription of the drug group.
I tried to use GEE (generalized estimating equation), but I do not know which model to use: poisson, negative binomial or binary logistic?
GEE: because there are women deliver several times during 2001-2018
Hao Tran With Poisson regression use predicted lambdas , you need to convert them to predicted probabilities to get at least 1, which would be: 1 - exp( - lambda).
Good luck.
Question
The virus has a reproductive number of 2-4 persons per new infected person. There are many variables to take into consideration concerning calculating the time until the pandemic is over. Therefore, statistical methods could be researched for this question.
The COVID-19 pandemic is caused by SARS-CoV-2, a positive sense single stranded rRNA virus. This infection that has spread worldwide causes severe acute respiratory syndrome. The virus spreads with close contact and respiratory droplets. It infects human cells via binding the ACE-2 receptor.
Any ideas would be appreciated,
Many mathematical models and projections were done regarding the covid-19 pandemic. They predicted that it will last so many months or weeks etc. some advocated complete lockdowns and then projections were done accordingly.
But to the best of my knowledge, none of the projections were proven right. Now only time will tell how long its course will run. It will depend upon the post vaccination scenario and emergence of new strains, probably with antigenic drift only. We will come to know more clearly about the pathophysiology of this virus in due course of time as more evidence is coming day by day...So I think we still don't know how long the pandemic will last.
Question
I have a dataset of 5 variables of quantitative continuous type: 4 independent and 1 dependent (see attached). I tried using linear multiple regression for this (using the standard lm function in R), but no statistical significance was obtained. Then I decided to try to build a nonlinear model using the nls function, but I have relatively little experience in this. Could you help me, please: how to choose the right "equation" for a nonlinear model? Or maybe I'm doing everything wrong at all? So far I have used the standard linear model in the "non-linear" model.
I would be very grateful for your help.
If you do not have the opportunity to open the code and see the result, I copy it here:
------
library(XLConnect)
INDEP <- NULL
DEP <- NULL
DEP <- as.numeric(db[,1])
for(i in 1:4){
INDEP[[i]] <- as.numeric(db[,i+1])
}
MODEL <- NULL
SUM <- NULL
MODEL<-nls(DEP ~ k0 + INDEP[]*k1 + INDEP[]*k2 + INDEP[]*k3 + INDEP[]*k4, start=list(k0=0,k1=0,k2=0,k3=0,k4=0))
SUM <- summary(MODEL)
-----
The result is:
-----
Formula: DEP ~ k0 + INDEP[] * k1 + INDEP[] * k2 + INDEP[] * k3 +
INDEP[] * k4
Parameters:
Estimate Std. Error t value Pr(>|t|)
k0 6.04275 1.30085 4.645 6.41e-06 ***
k1 0.03117 0.01922 1.622 0.107
k2 -0.02274 0.01663 -1.367 0.173
k3 -0.01224 0.01717 -0.713 0.477
k4 -0.01435 0.01541 -0.931 0.353
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1.418 on 186 degrees of freedom
Number of iterations to convergence: 1
Achieved convergence tolerance: 2.898e-08
-----
It sounds like you already tested your hypothesis with the linear model and used up your p value. So, you can't do more p-value tests.
As far as choosing a model, that isn't a question for the stats folks, but is related to the theory you have (stats folks can say how to implement or approximate the model).
Question
Suppose the relative risks (RR) of kidney cancer among Normal BMI vs Obese and Normal BMI vs Overweight are 1.21 (95% CI 1.21-1.45) and 1.66 (95% CI : 1.66-2.06.
How we can estimate the RR associated with unit increase in BMI?
You can do it with the glst command in STATA for a single study. For more information: https://www.stata.com/meeting/nordic-and-baltic13/abstracts/materials/se13_orsini.pdf
Question
Is it possible with SPSS? Are questionnaires varied in different countries and regions?
Step by Step Explanation.
Question
Can someone explain cohen's d test, in a simple way, please?
It is kindly requested to elaborate it for medical students in simple words.
Cohen's d isn't a test. It is a measure of effect size.
It allows you to express the difference between two groups in terms of the naturally-occurring variation in the thing you are measuring. The variation is measured by using information from both groups and pooling it as the pooled standard deviation.
The trouble with Cohen's d is that people tend to convert it to tee-shirt sizes – small, medium, large. This seems very vague when you go to the bother of doing all those calculations, somehow. And studies looking at typical values of d in different research areas suggest that it's not appropriate to have the same definitions of small, medium and large for all disciplines. I have a reference somewhere that I can post if I locate it!
Question
I have data set for S vs t and x vs t. The yield coefficient needs to be calculated. What is procedure to calculate it? Do I take the data for logarithmic growth phase only?
How can I found out X ( microorganisms concentration- biomass concentration) and S ( substrate concentration) from blackwater characteristic data? I mean, COD or VSS does represent these?
Question
Anyone can advise me which statistical method is suitable to see the associations between TB infection and PM 2.5?
and in case if you time between the PM2.5 exposure until the time they got TB, then it's to use COX regression and represent it through KM curve as well.
Question
I have SPSS software and I am unable to find out how to perform Hierarchical summary receiver operating characteristic curve, Deek's Funnel plot and Forest plot. Can please some one guide me how to perform using SPSS, or suggest an alternate software or free online website based solution. Thanks
I would recommend using Revman. but you still could use SPSS for metanalysis.
Question
There are two groups in the study such as group 1 and group 2. One of the groups received treatment, but the other did not. When the mortality of the groups is compared, it seems that there is no statistical difference. However, the expected mortality rate (calculated based on PRISM3 score) in first group ( treatment group) was significantly higher than the other. I think the treatment is successful by lowering the high mortality expectation. However, I could not find how to show this statistically or how can I equalize this imbalance (mortality expectation) between groups at the beginning.
Thanks
.
the adjustment method depends a lot of the data but you can have a look at the following thread
if you have a good overlap between the score distribution in both groups (despite different means), you could go for stratification, although it may rule out too small samples
.
Question
And If GLM the which family and link will have sense to interpret the data well?
These are functional and taxonomic diversity indices of macrobenthic fauna attached in the file. and want to discuss on spatial differences within habitats.
Graphic representations represent the normal probability of model residuals. It is useful for determining the pattern of residuals. A higher concentration of points around the diagonal also means supporting the claim of approximate linear dependence or accepting the assumption of normally arranged residuals.
Question
Background info:
I have calculated the doubling times of wild-type cell lines and gene knockdown cell lines. Growth curves were measured three times (day 0-day 6), each time there were 2 technical replicates. The technical replicates were plotted over time and via log-linear regression a doubling was derived.
I now want test whether knockdown of this gene affects doubling time. As the variation between the different growth curves (doubling times) is quite large (likely due to random things like people opening the incubator more frequently that week and differences in confluency at plating, things that are the same for both wild-type and knockdown cell line), I think I need to use a paired-t test.
However, from what I've seen, a paired t-test does not take into account standard error of those doubling times. So I'm wondering, is this correct? I do not have a background in statistics, but this feels somewhat wrong.
To clarify: for both the wild-type cell line and for the knockdown cell line I have three doubling times. I want to compare these to see if the knockdown has an effect on doubling times. As I derived the doubling times from log linear regression I think it's best to compare the slopes rather than convert those slopes to doubling times and compare those.
I'm not sure where you found that t tests do not take into account standard errors of means. T-tests are based on the t statistic, which is the quotient of the mean difference and the standard error of the difference.
Do you mean that the t-test only take into account variability between growth curves of individual cells, but not the variability of each curve? T-tests assume that individual data point (in your case growth curve slopes) comes from a single measurement. One underlying assumption of t-tests is that measurement error is distributed normally, so the average measurement error will approach (or be) zero as the number of data point increases.
Question
Hi,
I would like to compute I² in for my meta-analyses, but I can't find any software to do it. I know that ESCI (the excel file from Geoff Cumming, thanks to him!) can do this math, but the version I've used ( https://thenewstatistics.com/itns/esci/ ) only compute Diamond ratio. I know the old version of ESCI computed I² and Q, but I can't find it (and it is not available on the website).
If someone have an idea, that will help!
Thanks :)
Louis
Attached you'll find it. However, I recommend:
1. Watching the Dr. Cumming's video explaining how to analyze heterogeneity in terms of fixed/random effects and diamond ratio (this is extended in his book): https://www.youtube.com/watch?v=_bB2k-1Hv9E
2. Running the "metaforest" or "mc.heterogeneity" function within R. Here you have the recent version of the package 'meta' (https://cran.r-project.org/web/packages/meta/meta.pdf) and a very nice hands-on guide (https://bookdown.org/MathiasHarrer/Doing_Meta_Analysis_in_R/).
Hope this help.
Question
Situation:
Question: Gene X is suspected to have an general effect on growth (has an effect on growth of all cell types).
Knockdown cell lines (=strongly reduced expression of the gene) of different cell types were generated. Growth curves were established for the KD and control cell lines by seeding 10 wells on day 1 (stemming from the same cell mixture) and counting 2 wells each day for the following 5 days (day1: seeding; day2 - 6: counting cells). This experiment was performed 5 times.
For each experiment, for each day, the mean of the cell counts was calculated, Ln transformed and plotted against time. A linear fit was used to determine the linear part of the growth curve. To find the linear part I tried to obtain the highest R2 value with a minimum of 3 data points (sometimes 3 points gave the highest R2, sometimes 5 points gave the highest R2). I then used the mean cell count from the first and last point of the linear part to calculate the doubling time (which is often used in literature to represent growth rate of cells).
Formula used:
First image
I’m however not sure if I should calculate the doubling time using this formula (using the cell counts), or whether I should calculate the doubling time from the equation of the linear fit.
To calculate the error of the doubling time the following formula was used:
Second image
At this point I have 5 normalized doubling times and their error (normalized to the control, so the doubling time of the knockdown divided by the doubling time of the corresponding control).
I now want to assess whether there is a significant difference in doubling times (= significant difference in growth rate) between knockdown and control. Someone in my lab suggested a T-test and I agree, yet I have a few problems with this=
· I don't know how to test normality of my data since I (currently) only have 3 data points (I will have 5 in the end)
· I don't know how to test equality of variance with only 3 (in the end 5) data points.
· I don't know if this is the best method to accurately determine whether there is a significant difference in growth or not. (For example, maybe there’s a way to immediately compare the growth curves, I suspect that such a test would be more accurate, but also more complicated to such a degree that I myself might not be able to apply it)
I work in a lab where no one really has any expertise in this (or statistics in general), so I have to figure it out on my own. I have a lot of doubts on whether or not what I’m doing is correct or not, not helped by the fact that my understanding of statistics is very basic.
If you see any other mistakes, please do tell.
You can use a Mann-Whitney U test for non parametric data
Question
May some one please let me know what syntax should be used for calculating the Hardy Weinberg equilibrium in case control studies?
Dear Rubina,
I am working on analyzing polymorphisms on CYP3A5 gene. The percentaje of the AA (CYP3A51*/1*) genotype in the population (50 patients) is 72%, of Ab (CYP3A51*/3*) genotype is 28% and of bb genotype (CYP3A53*/3*)is 0.
I am trying to calculate Hardy Weinberg equilibrium on STATA but I don't know how do I have to introduce the data....¿Do I have to generate a variable named, for example, "Genotype" and put the results of the genotyping analysis as categorical results, I mean: "AA", "Ab" and "bb"? (Patient 1: AA, Patient 2: Ab, and so on.....)¿Which is the second variable to which I have to do the comparison in order to obtain a p value? I am very confused.
Please If anyone could help me, I would really appreciate it! Thanks!
Question
I have only the total number of cases for each diseases (measured by millions) and I'm trying to find a significance relationship for their coexistence throughout the 5 regions...
Yes do use the chi-square test. In attachment an R script for this, with Post Hoc tests.
Question
Hello everybody, I am new with the meta-analysis in Genome Wide Data so I have this doubt. I have read METAL documentation, which is by far the most used meta-analysis software in both EWAS and GWAS microarray data, but I cannot figure out how would be the input for EWAS analysis. As METAL was originally designed for GWAS, one of the inputs is to provide both the reference and no reference allele. Therefore as EWAS arrays do not rely in allele frequencies but in a quantitative measure, I would like to know how would be the input in METAL regarding this case. Thank you so much in advance for answering this issue (which may be easy, but I certainly do not know)
Finally I got the answer and is... just do not provide the allele frequencies , the analysis will run fine, reassuring this, here is a Github manual of how to perform an EWAS meta-analysis (https://github.com/ammegandchips/meta_EWAS/blob/master/metal.md) . As you can see the parameters regarding the frequencies and genomic control are off.
Question
For example, I have 4 cell treatments and technical repeat wells of each with confluence data every hour for 48 hours.
Would Two-way ANOVA be a good way to observe the differences between the cell treatments over time? I want to see that the different drug concentrations effect cell proliferation overtime.
I agree with Jochen. You can do a linear mixed model with treatment and as fixed factor and experiment as random factor. This is easily done in R with the lmer (package lme4):
model <- lmer(confluence ~ Treatment + (1|experiment). data=data).
If treatment is significant, you can do follow up pairwise comparisons to ascertain which treatment is having the strongest effect on your cell confluence.
Good luck!
Question
Do you know any R pakage can deal with unbalanced data set.
I have agronomic traits from 7 years with different locations number, different replications number, and different genotypes.
Even though this is an older question, I would like to add that the aov function is used for balanced designs (the results can be hard to interpret without balance ).
Question
I am analyzing a dataset. There I have 4 variables that are used to diagnose a disease. Among them, 3 were "Lab test report findings" e.g. Test A, Test B, Test C and 1 "clinical findings" i.e. "Test D (which is obtained by the clinical examination of the patient and is not established for the confirmatory diagnosis of the disease).
To confirm the diagnosis of the disease e.g. "Dengue", Each of the 3 lab tests i.e. A, B, C can independently be used for the confirmatory diagnose of Dengue. In my research, patients had done at least one of the 3 tests to confirm the disease. Some might have done all the 3 tests.
Also, among the patients, a great proportion had shown the positive result of the "Test D".
I want to establish that, the "Test D" could be one of the confirmatory tests along with the other 3 tests i.e. A, B, C. On top of that, "Test D" could be more accurate and reliable to confirm the Dengue compared to other lab tests i.e. A, B, C.
So, what statistical test should I be used to prove and compare the effectiveness of this clinical examination findings? Also, suggest me some graphs, that can visualize with this case)
N.B. All 4 tests had a dichotomous answer. The findings of these tests can either be positive or negative.
I think you can show an association of the clinical variable with your disease state using linear/logistic regression and calculate Odd Ratio (OR). In addition, you may carry on same for other test, and compare their ORs. It does not need to test sensitivity, precision, and accuracy as like diagnostic tool (i.e., ROC-AUC). It would be supportive for follow-up experiments.
Question
According to my sample size estimation, the sample size comes to be 14. Would results from a trial with such a little number of subjects be valid? On the other hand I cannot increase the sample size beyond my calculated sample size except for adjustment for loss to follow up. Please suggest me.
Hi Sir....As discussed above, you can always revisit the effect size (Cohen's d) taken for the sample size calculation. Taking a very large effect size may yield a very low sample size. using G power software will help you to calculate precise effect size and sample size for the your study.
Thanks
Question
I am new to stats to this level in ecology. I am trying to compare DNA and RNA libraries with thousands of OTUs. I summarized taxa to get the most abundant species, but I can obtain only relative abundances. I was thinking to use SIMPER as I read in several comments to test which species differ the most per station between DNA and RNA based libraries. However I read that SIMPER is a more or less robust test. I was wondering if the manyglm was also an alternative for my question or if you suggest another way. Thank you for your help!
Question
I'm involved in a meta-analysis where some trials outcomes are shown in mean and standard deviation and some are shown as median and inter-quantile range. As softwares' functions require the group n, mean and SD, I looked around and found the following paper http://www.biomedcentral.com/1471-2288/5/13. However, this simulations study states that it is possible to estimate mean and SD given the median and range (min and max values), not from median and IQR. We checked again on each paper for min and max value but it was very disappointing as none informed these values. Therefore, I would like very much if any one have a tip to help me workaround this issue.
SOLUTION: Use Wan's method (2014) ~820 cited. Superior to Hozo's method. Hereby the PDF and Excel.
Question
I have treated THP1 and AGS cells for 12, 24 and 48 hours with a bacterial toxin concentrations 0, 5, 10, 20, 40, and 80 ug/ml. Now I want to prove my results with statistical methods but I'm confused which one to use. Is it will be ANOVA post hoc test or simply t test. If it will be one tailed or two tailed, paired or unpaired.
As I understand, the cells were randomly distributed over the 6 different concentrations, but each cell was treated for 12, 24 and 48 hours. If this is correct, then you have a Between (6 levels of Concentration) x Within (3 levels of Time (hours)) design. First run a factorial repeated measures ANOVA. If the there are sign. effects (of Concentration, Time and/or the interaction) then you can run Post Hoc tests. Before running these tests, check the distributions of the measures on the treated cells with histograms. If they obviously do not look normal, then run a non-parametric test such as the f1.LD.f1 from the parLD R package (https://www.jstatsoft.org/article/view/v050i12 )
I suppose you normalize the images of the cells like in this stdudy: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM1224512
Question
Does anybody know an estimation method for calculating the prevalence of a given risk factor among general population, given that the odds ratio/relative risk, the prevalence of the risk factor among diseased and the prevalence of the disease are available?
In the two way table D+ D-
E+ a b
E- c d
disease odds ratio = (a/b )/(c/d) = ad/bc
exposure odds ratio = (a/c)/(b/d) = ad/bc
Question
I want to investigate whether there is a significant upregulation of certain genes between cells from WT and KO mice without stimulation (negative control) and after stimulation with substances A, B and C.
The sample size consists of n=3 KO and n=3 WT mice. Primary cell lines are cultivated from each mouse. Cells from each mouse are used in all four conditions stated above (repeated measures).
Because of the small sample size, I presumed a non-parametric test was in order (correct me if I am wrong).
If I would only need to compare the differences in gene expression between control, A, B and C in one type of mouse, I would use something like the Friedman test (the non-parametric alternative to the Repeated-measures One-way ANOVA). However, I am interested in the difference in expression between KO and WT mice. Should I use something like a non-parametric equivalent of a repeated-measures two-way ANOVA? And if so, which test should I use?
Nice link, Bruce. Bland and Altman have a great series about statistics in bmj. It's certainly worth reading their other articles, too.
Today I would add that we do have statistical modelling techniques available to everyone that allow using proper distributional assumptions. Transforming the response should be avoided. I had some references showing that but I don't recall where I put them :( Here is one paper I found in a haste:
Unfortunately, the authors alsow fall back on the ancient suggestion to use "non-parametric" approaches.
There are some papers I found saying that log-transformations of count data is inappropriate (e.g. )
The example given in the bmj artice should, as I see it, be analyzed using a gamma-GLM with log-link. The response should be the latencies (Tab1) or the concentrations (Tab2) and the baseline values should be used as covariate.
In the Tab.1 data set it is quite obvious that the 11th record is "outlying". The interesting question is: why that? Is that value trustworthy? If yes, we should take it seriously, and in this case we must admit that the variability in the data is so large that we should not confidenctly conclude that the treatment reduces the AB concentration (p=0.3). If the value is in fact problematic and one can show that there was a mistake (e.g. the after-value was 1.22 instead of 12.2), the data very clearly supports a decrease in the concentration (p<0.000001 when removing #11 or when changing it to 1.22).
A sign test (giving p = 0.004) is obscuring that there might be a problem with the data. In this case it is ok since changing the (possibly erroneous) value to 0 would still give p = 0.02 (using values > 12.2 would not change the result anyway). However, there are many zeros and ties, so "exact" p-values can not be calculated (I don't have any clue how big the error may be).
Question
As an example, gene expression levels for 50 genes are measured by qPCR for 5 different conditions. Likewise, gene expression levels for the same 50 genes were measured by a different technique for the same 5 conditions. Which method(s) can be applied to compare these two techniques?
Classify the data in a two-way table i.e. a table that is composed of the two different techniques as two rows and the five different conditions as five columns.
Then apply Analysis of Variance technique. You will be able to compare the difference between two techniques as well as the difference among the five conditions.
Ignoring the columns and considering the observations under the two rows, you can apply t test for comparing the two techniques. However, in this approach, you will not be able to compare the five conditions.
Question
I have read multiple articles that have used machine learning algorithms (convolutional neural network, random forest, support vector regression, and gaussian process regression) on cross-sectional MRI data. I am wondering whether it is possible to apply these same methods to longitudinal or clustered data with repeated measures? If so, is there an algorithm that might be better to use?
I would be interested in seeing how adding longitudinal data could improve the performance of these types of machine learning models. So far, I am only aware of using mixed effect-models or generalized estimating equation on longitudinal data, but I am reading books and papers to learn more. Any advice or resources would be greatly appreciated.
Hello Robert, there are extensions of recursive partitioning and trees for longitudinal and clustered data. They essentially include a mixed model element into the algorithm. I have used the RE-EM algorithm in the past (see DOI: 10.1007/s10994-011-5258-3 and DOI: 10.1016/j.csda.2015.02.004). There are also binary partitioning for continuous longitudinal data (DOI: 10.1002/sim.1266) and mixed-effect random forest (DOI: 10.1080/00949655.2012.741599). Implementations can be found in R packages: REEMtree, longRPart2, MixRF.
Question
My experiment is isolating primary cells from biopsy of different patients.
And then to culture the primary cells in two conditions, one is the negative control, and the experimental set-up is to culture with more hormone. I had cultured them in parallel (both are freshly cultured immediately after I isolated them from biopsy).
Then I got the assay result in two different culture conditions, I considered my results are paired result of each patient, so I performed paired T-test.
But when I presented it to my colleagues, she said paired sample t-test may not be appropriate here because there is some strict regulations about using paired sample t-test with samples from cell culture. I had briefly search for it for a while but cannot find it.
Can anyone tell me whether there is such regulation?
I had also performed independent sample t-test with my result, and got a much larger p-value. I think it is the individual difference between patients generated larger variance in the analysis.
Could anyone tell me the best test for my data?
If you harvest during exponential growth clearly that may play a part in your interpretation of experimental results. However that was not the issue. All cells were of the same cell type to within a small number of spontaneous mutations perhaps. The only difference between the plates is that one plate contains hormone while the other does not. The inoculate for each paired plate comes from the same individual. I assume growth curves were obtained for cells with and without hormone so you know if hormone has an effect on cell growth and what if any that effect is so if such an effect exists you may consider it with the rest of the experimental results. Given all this unless some strange effect happens that requires explanation. This is statistically a paired t test. Any effect on the growth is known so any correction necessary can be made, such as was suggested by Dr. Ebert. Again when all necessary corrections, transformations and what have you are finished. The comparison is the paired t-test. This is exactly how we did chemical mutagenisis studies in Herman E. Brockman's lab many years ago. See the link for publications from some in the group with Materials and Methods sections. I apologize for my poor description but this was 50 years ago for me. Mea Culpa.
Again, I see no reason why a poorly remembered something or other affects either the statistics or the biology. As always the assumptions of the test should be checked. Best, David Booth
Question
Very frequently in papers, devoted to parallel clinical trials, I face a situation, when a calculated SD for an effect in each group is approximately equal to an SD of effect difference between the two groups.
An example may be found, e.g. in the following paper (Table 2):
Pelubiprofen achieved an efficacy in VAS scale of 26.2 with SD = 19.5. Celecoxib achieved efficacy of 21.2 with SD = 20.8. However, a difference is 5.0 with SD = 20.1! I was expecting SD ~sqrt(2) more, since the samples are independent and has approximately equal size.
Amr Muhammed , yes, to be more precise the weights will be not exactly 1/2 but will depend on the groups sizes. But in the mentioned table (and many other similar papers) this SD is referred to a difference: difference mean = 5.0, difference SD = 20.1.... Probably it is some kind of conventional notation for a united sample SD but it seems to be too confusing.
Question
Hi! I would like to compare cell proliferation rates.
The working hypothesis is that the proliferative effect of extracellular vesicles on cells cultured on the skin implant is increased compared to samples with pure cell culture.
There will be 4 samples:
1) cells (control)
2) cells + skin implant
3) cells + skin implant + extracellular vesicles
4) cells + extracellular vesicles
Cells will be from 3 donors, and experiments will be carried out 3 times with each donor culture.
Can someone help? Could you advice what method is the best?
In R the model could be formulates like
model = glmer(rates ~ implants*vesicles + (1|donor), family=Gamma)
where rates are the rate constants detemined, implants and vesicles are binary factors (yes/no) and donor is a factor identifying the subject (used a random intercept factor in the model).
You are interested in the interaction of implants and vesicles. The difference-in-difference statistic can be obtained with
summary(model)
and you can get a p-value with
anova(model, test="Chisq") or drop1(model, test="chisq")
Question
Suppose I am doing a Case control study. Lets say Group 1 is a clinical population (N=30), Group 2 is a healthy control population (N=30). I have measured various variables (Continuous data) in both the groups, and using t-test I have found the difference between the two groups. Now, suppose I want to find relationship between the two variables, can both the groups be clubbed together (N=60), or do I do separate correlation analysis for each group?
For Example: If "satisfaction with life" and "quality of life" are research variables in two groups, specifically Patients with anxiety vs Healthy control. I can get continuous data for both these variables using a questionnaire, and I can do a t-test and establish if there is a difference in satisfaction with life and quality of life between these two groups. Now, if I want to know the association between satisfaction with life and quality of life, can I club both patient group and healthy control group together? If yes, is it applicable always, or, are there some conditions? Please explain as my research question is different, and I have just given an example here.
No need to clubbed the data. find the correlation between the variables not the groups. For example in group one variables height and weight and similarly in group two. so we will apply the correlation analysis between the variables height and weight. we will check how much correlated height and weight to each other.
if the data is normal then use Pearson correlation otherwise Spearsman. After getting the correlation value ( r or rho) then simply check in which group the variables are much correlated ( in control or case group).
Question
I would like to calculate under-5 mortality from data from a survey and it is difficult to find a coherent resource that gives a step by step guide on calculation under 5 mortality rate using cox regression model in SPSS. Anyone with any resource recommendations or perhaps ready to work through with me on this project? Cheers!
Dear Emmanuel Nene Odjidja,
If you familiar with R, then you can try with R survival package for cox regression model. Please follow the link: https://rviews.rstudio.com/2017/09/25/survival-analysis-with-r/. I hope the above link would be helpful for you.
Question
I constructed a contingency table with three categorical variables (species, elevation, and year) and performed a chi-square analysis to test for independence between them. My initial goal was to determine the probability that observed differences in species elevational ranges between two surveys (years) were due to chance.
The test revealed a significant relationship between the variables. This, however, presents me with a new question: how do I go about determining precisely where this relationship exists (e.g. species & elevation vs. species & year vs.elevation & year, etc.)?
Any advice or suggestions on how I can figure this out? Thanks!
Hey, Evan
Absolutely, low < med < high ---> ordered scale (Likert-type)
Question
I am working with birth defects data with case - control ratio 1:4. However, I would like to choose controls that are a better match with my cases and reduce the ratio to 1:1 or 1:2. I am planning to use propensity score approach to choose my controls from this database. Is this the best method to use?
Thanks for the information. I have used the psmatch packaged in STATA to perform propensity score matching but I don´t known how can I get 5 controls for 1 case. Can anyone help me with syntax for STATA??
Question
I am using the complex sampling analysis method within SPSS. I would like to use the cox regression for my variable under complex sample, as my variable has a prevalence rate of greater than 10%, thus logistic regression should not be used. When using cox regression under the complex sampling analysis - is robust variance already controlled for?
In fact, since 2015, things changed about heteroskedastic dealing, and now it has become almost mandatory (so I think that it was why you was asked by peer reviewer). Happily, the last versions of SPSS integrate it in cox regression through sandwich estimators and, more important, HC in general linear models.
Hope it helps,
Kind regards,
Question
Some times we want to conduct a reliability study on some diagnostic modality for a specific disease but the gold standard for the diagnosis of that disease is either invasive procedure or surgery. which is not justified to be performed on normal individuals (control group). In such a case is it justified to take control group as negative of the gold standard?
For example:
We want to diagnose Infantile Hypertrophoid pyloric stenosis (IHPS) with the help of ultrasound but the gold standard for its diagnosis is surgery. If we perform Ultrasound of 50-infants with projectile vomiting and the sonographic findings of 40 of them are likely for IHPS and 10 for normal. But after surgery (Gold-Standard) 38- were confirmed as IHPS but 2 were false positive. Now we want to perform ultrasound of 50-normal (control). Is it justified to put all the 50 normal infant as True negative and false positive as 0 of the gold Standard, To perform chi-Square statistics?
Hello,
If I well undertood your question, I think here you can use the Bayes Theorme, based on clinic litterature if you have about this specific test. we can consider the ultrasound results as a priori information about the clinic test, and the Gold-standard one as the real data (x).
I can reformulate your problem as :
If your test results come back positive when using ultrasound method, what are your chances that children actually have the disease (with G-S method) ?
CHELLAI. F
Question
The error message is as follows:
Warnings Box's Test of Equality of Covariance Matrices is not computed because there are fewer than two nonsingular cell covariance matrices.
However, the results are computed anyways.
I want to know what does it actually means.
Does it affect the overall results?
How to fix this error?
It should be noted that the sample size in my groups is different.
Baset
Hello Abdulbaset,
In order for SPSS (which is what I presume you are using for analysis) to compute and report a Box M test (multivariate homogeneity of covariance matrices test, a test that is not needed if you are strictly sticking to univariate repeated measures designs), you need:
1. Two or more independent groups (e.g., at least one between-subjects factor).
2. Scores on two or more measures per case.
3. SOME variance for each measure within each group.
4. The scores must not be linearly dependent (e.g., score2 can't be a simple function of score1).
Conditions #1, #2, #3 don't impact whether a MANOVA can be executed or not; hence, you can get results (regardless of whether they are meaningful!). If condition #4 holds, SPSS will omit at least one variable from the MANOVA.
Question
Bio statisticians are calculating Sample Size with the help of formula but unfortunately there is no time duration mentioned in it. How could it b justified to calculate the same sample size for a student of Master degree having research of 9-months and a student of PhD having research of 18-months, with the same formula?
Thanks for fruitful information
Question
I would like to know why is linearity important in a sandwich assay curve for protein sample detection.
Hello, linearity important to have more reliable answer.
linearity of sandwich assay curve for protein sample detection is called linear regression.
Example for calculation by non-linear curve is linked below and attached.
Good luck!
Question
Suppose I have a Questionnaire for a Stress assessment that contains 30 questions, each question has 5 answers (0- no stress, 1-mild stress, 2- moderate , 3-High stress, 4- Severe stress). The Total score of the 30 question varies from 0 - 120.
How we can categories the Total score (the range of total score is 0-120) into mild , moderate and severe? Which cut off s should l take for mild, moderate and severe?
Hello there,
Step #1:
Compute the overall mean score for the items of that variable of interest
Step #2:
Make cutoff points using a calculator (Maximum- Minimum / n). How? In case of a five-point Likert scale, you may, for example, have assigned the value (1) for Strongly Disagree, (2) Slightly Disagree, (3) Neutral, (4) Slightly Agree, (5) Strongly Agree. So in order to make cutoff points, you've to do this simple math, as per the formula stated above: (5-1/3). Upon calculation you will get this value (1.33), right! This is the interval value.
* "Highest" refers to the highest score of the given Likert scale (5, in our example)
* "Lowest" refers to the lowest (1)
* "n" refers to the number of CATEGORIES you intend to create
Step #3:
Do the math for the three categories (Low, Mid & High). How? Just add up the score of your interval value to the three category, as in the following:
Low (1 - 2.339),
Mid (2.34 - 3.669)
High (3.67 - 5)
Step #4:
If you are using SPSS, enter these category-related values by navigating through "Transform", and "Recode into Different Variable". Here you will need to recode the Mean score we've obtained in Step# 1. After you give it a new Name and Label, click on "Old and New Values". A new window will pop up. Select "Range", and then insert the values of the categories given above. You should give a value for each range (i.e. 1 for Range no 1), in the New Value empty field. Then click "Add". Follow the same routine for the other range values of your categories, 2 & 3. Click "Continue", and then "Ok".
Step #5:
Go to the Data Editor page and scroll down to find your new created variable. You'll need to click on Value to make labels. Please add 1 in the Value field, and in the Label field the word "Low", and thus 2 for (Mid), 3 for (High). Click "Add" each time you make a new try, and then "Ok".
That's all.
Cheers,
Ahmed
Question
Dear All,
I have 500 miRNAs with expressed read counts in 20 conditions. In the same 20 conditions, I have measurments of lymphocyte counts.
I would like to see how the miRNAs counts are correlated to the lymphocyte counts ?
Now these two variables are not equal in numbers, how to go about that ?
Question
I am trying to compute the "E-Value", introduced by VanderWeele & Peng Ding (2017, Ann Intern Med).
For that, I need to have the Risk Ration (also known as Relative Risk).
As I understood, Risk Ratio is only for dichotomous variables.
So, is it possible to compute a Risk Ratio out of a correlation between two continuous variables ?