Science topic

Applied Biostatistics - Science topic

Explore the latest questions and answers in Applied Biostatistics, and find Applied Biostatistics experts.
Questions related to Applied Biostatistics
  • asked a question related to Applied Biostatistics
Question
16 answers
Just bounced on me.
Before statistically analysing significant difference, shouldn't we see if data fits normal distribution first? Is 3 replicates enough to testify the hypothesis of normal distribution? (In undergraduate statictics course we used Shapiro-Wilk test which needs at least 8 samples) Or just according to the central limit theorem that we don't need to testify whether the data fits normal distribution or not?
Thanks!
Relevant answer
Answer
It is a misconception that you would "testify the hypothesis of normal distribution". Tests use to "check for normal distribution" are actually tests to check for deviations from the normal distribution (the null hypothesis is that the data are realizations from iid normal random variables). Failing to reject this hypothesis is not a "proof" that the hypothesis is correct (it's only[!!] indicating that the sample size is not large enough to clearly see that the real-world data will not perfectly align with some idealized theoretical model!), and being able to reject it is no "proof" that the deviation is of any relevance (it only demonstrates that the sample size is large enough to clears see some deviation, but that might not at all be relevant).
The assumptions are based on understanding of the data-generating process. Often, this understanding is rather limited, and one needs to make some educated guess. Of course it is then appropriate (and recommended) to cross check if the data is roughly behaving "as expected" under these asumptions - provided the sample size is large enough to get some reliable idea about the distribution. This is not achieved by tests like Shapiro-Wilk, Anderson-Darling, Kolmogorov-Smirnov or others. It's better to check residual diagnostic plots, one of which is a normal-quantile-quantile plot (normal-QQ plot). Provided the sample size is large enough, this plot can show systematic deviations and give an impression of the kind and severity of the deviations. Together with other diagnostic plots one may make a judgement if the (set of) assumptions is reasonable / acceptable or if there ought to be a better set of more resonable assumptions.
With n < 30 (or so), you should not consider the CLT as argument that the assumption (normal distribution) is irrelevant. Even for n > 30 relying on the CLT can be probelmatic when the distribution is strongly skewed (as often in biology! E.g. for gene and protein expression, concentrations of biomolecules, titers of pathogens, etc.).
A particularily stupid example of applying a test to check for normal distribution is if the data are clearly binomial (yes/no, present/absent) data is coded as 0/1, or proportions (bound between 0 and 1, or as percentages bound between 0% and 100%). Similarily stupid is the example of testing data from counts and from strictly positive variables (concentrations, intensities). In all these cases the assumption of a normal distribution can possibly be okay, but we have very obviously more appropriate distributional assumptions (binomial, beta, Poisson, neg-binomial, gamma aor log-normal) and we have obviously more appropriate tools to deal with these kinds of distributions.
  • asked a question related to Applied Biostatistics
Question
4 answers
Assuming this is my hypothetical data set (attached figure), in which the thickness of a structure was evaluated in the defined positions (1-3) in 2 groups (control and treated). I emphasize that the structure normally increases and decreases in thickness from position 1 to 3. I would also like to point out that each position has data from 2 individuals (samples).
I would like to check if there is a statistical difference in the distribution of points (thickness) depending on the position. Suggestions were to use the 2-sample Kolmogorov-Smirnov test.
However, my data are not absolutely continuous, considering that the position of the measurement in this case matters (and the test ignores this factor, just ordering all values from smallest to largest and computing the statistics).
In this case, is the 2-sample Komogorov-Smirnov test misleading? Is there any other type of statistical analysis that could be performed in this case?
Thanks in advance!
Relevant answer
Answer
You can see with www.Stats4Edu.com
  • asked a question related to Applied Biostatistics
Question
4 answers
I'm excited to speak at this FREE conference for anyone interested in statistics in clinical research. 👇🏼👇🏼 The Effective Statistician conference features a lineup of scholars and practitioners who will speak about professional & technical issues affecting statisticians in the workplace. I'll be giving a gentle introduction to structural equation modeling! I hope to see you there. Sign up here:
Relevant answer
Answer
Thanks for this valuable share!!
  • asked a question related to Applied Biostatistics
Question
5 answers
Hi,
I have performed an epidemiological survey on insomnia prevalence using ISI and am looking forward to testing internal consistency using Cronbach's alpha. I missed finding any reference example for estimating the same for each survey question. It would be helpful to receive assistance from your expertise.
I would appreciate your help in enhancing my knowledge.
Relevant answer
Answer
Daniel Ruivo Marques, pardon me for the late response. Thank you for your guidance.
  • asked a question related to Applied Biostatistics
Question
15 answers
Hello, I'm a master student working with fungi. And one of my studies includes the evaluation of the mycelium growth efficiency and biomass production of mushroom strains on different culture media and incubation temperatures. On my experiment, I'm working with 4 media (PDA, MYPA, YGA and Soy Agar) and 4 temperatures (20, 25, 30 and 35ºC). The 2 way Anova tests shows the two factors (medium and temperature) have a significant interaction between each one. Now, I would like to know if there would be a statistical test that could quantify this interaction effect. I'd be glad if anyone could point me in a direction.
Thanks in advance,
Denis
Relevant answer
Answer
Bonjour! C'est ce que j'ai proposé au début (la méthode des moindres carrés - corrélations linéaires, elle convient le mieux.
  • asked a question related to Applied Biostatistics
Question
9 answers
In many biostatistics books, the negative sign is ignored in the calculated t value.
in left tail t test we include a minus sign in the critical value.
eg.
result of paired t test left tailed
calculated t value = -2.57
critical value = - 1.833 ( df =9; level of significance 5%) (minus sign included since
it is a left tailed test)
now, we can accept or reject the null hypothesis.
if we do not ignore the negative sign i.e. -2.57<1.833 null hypothesis accepted
if we ignore the negative sign i.e. 2.57>1.833 null hypothesis rejected.
Relevant answer
Answer
Le signe négatif en mathématique en général et en statistique en particulier a toute son importance notamment au niveau des commentaires des résultats (exemple: la corrélation positive s'oppose à la corrélation linéaire négative). Les signes sont à respecter.
  • asked a question related to Applied Biostatistics
Question
3 answers
Hi,
I have performed an insomnia prevalence study among academics using ISI. I have come across the floor and ceiling effect in a cross-sectional survey. I want to estimate the same percentage of each ISI question and the total score. It would be helpful to see an example to calculate the same.
I would appreciate your help in enhancing my knowledge.
Relevant answer
Answer
Hi,
Look at the distribution of scores whether they are skewed. Correct for the skew by some transformation and again look for the distribution. You may be able to do further analysis. Conversely, frequencies in the categories can be compared.
Here are a few references on the Psychometric properties and its use in some studies of the scale:
Morin CM, Belleville G, Bélanger L, Ivers H. The Insomnia Severity Index: psychometric indicators to detect insomnia cases and evaluate treatment response. Sleep. 2011 May 1;34(5):601-8. doi: 10.1093/sleep/34.5.601
Schulte, T., Hofmeister, D., Mehnert-Theuerkauf, A. et al. Assessment of sleep problems with the Insomnia Severity Index (ISI) and the sleep item of the Patient Health Questionnaire (PHQ-9) in cancer patients. Support Care Cancer 29, 7377–7384 (2021). https://doi.org/10.1007/s00520-021-06282-x
Yusufov M, Zhou ES, Recklitis CJ. Psychometric properties of the Insomnia Severity Index in cancer survivors. Psychooncology. 2019 Mar;28(3):540-546. doi: 10.1002/pon.4973
Ohayon MM. Epidemiology of insomnia: what we know and what we still need to learn. Sleep Med Rev. 2002 Apr;6(2):97-111. doi: 10.1053/smrv.2002.0186
Okajima I, Miyamoto T, Ubara A, Omichi C, Matsuda A, Sumi Y, Matsuo M, Ito K, Kadotani H. Evaluation of Severity Levels of the Athens Insomnia Scale Based on the Criterion of Insomnia Severity Index. Int J Environ Res Public Health. 2020 Nov 26;17(23):8789. doi: 10.3390/ijerph17238789
Gagnon C, Bélanger L, Ivers H, Morin CM. Validation of the Insomnia Severity Index in primary care. J Am Board Fam Med. 2013 Nov-Dec;26(6):701-10. doi: 10.3122/jabfm.2013.06.130064
Kraepelien M, Blom K, Forsell E, Hentati Isacsson N, Bjurner P, Morin CM, Jernelöv S, Kaldo V. A very brief self-report scale for measuring insomnia severity using two items from the Insomnia Severity Index - development and validation in a clinical population. Sleep Med. 2021 May;81:365-374. doi: 10.1016/j.sleep.2021.03.003
  • asked a question related to Applied Biostatistics
Question
12 answers
I have diet composition of a species in an area (10 different components) for two different years. So I have two columns (year 1 and 2) and 10 rows (the food items), and the cells are filled with proportions. I want to test if there is a statistical difference in diet between the two years. What test do I use?
Relevant answer
Answer
chi- square is a good approach
  • asked a question related to Applied Biostatistics
Question
7 answers
Hi,
We received a statistical reviewer comments on our manuscript and one of the comments goes as follows: '... Note that common tests of normality are not powered to detect departures from normality when n is small (eg n<6) and in these cases normality should be support by external information (eg from larger samples sizes in the literature) or non-parametric tests should be used.'
This is basically the same as saying that 'parametric tests cannot be used when n<6', at least without the use of some matching external data which would permit accurate assumption of data distribution (of course in real life such datasets do not exist). And this just doesn't seem right. t-test and ANOVA can be used with small sample sizes as long as they satisfy test assumptions, which according to the reviewer cannot be accurately assumed and thus cannot be used...
I see two possible ways of addressing this:
  1. Argue that parametric tests are applicable and that normality can be assumed using residual plots, testing homogeneity or variance, etc. This sounds as the more difficult, risky and really laborious option.
  2. Redo all the comparisons with non-parametric test based on this one comment. Which just doesn't seem right and empirically would not yield a different result. It would be applicable to 15-20 comparisons presented in the paper..
Maybe someone else would have other suggestions on the correct way to address this?
For every dataset in the paper, I assume data distribution by identifying outliers (outliers - >Q3 + 1.5xIQR or < Q1 - 1.5xIQR; extreme outliers - > Q3 + 3xIQR or < Q1 - 3xIQR), testing normality assumption by Shapiro-Wilk’s test and visually inspecting data distribution using frequency histograms, distribution density and Q-Q (quantile-quantile) plots. Homogeneity of variance was tested using Levene’s test.
Datasets are usually n=6 and are exploratory gene expression (qPCR) pairwise comparisons or functional in vivo and in vitro (blood pressure, nerve activity, response magnitude compared to baseline data) repeated measures data between 2-4 experimental groups.
Relevant answer
Answer
This probably does not help you, but I thought that I would have a look at the original Student (Gossett) paper of 1918 as the test was specifically designed for (very) small samples:
"if our sample be small, we have two sources of uncertainty: (1) owing to the “error of random sampling” the mean of our series
of experiments deviates more or less widely from the mean of the population,
and (2) the sample is not sufficiently large to determine what is the law of
distribution of individuals. It is usual, however, to assume a normal distribution,
because, in a very large number of cases, this gives an approximation so close
that a small sample will give no real information as to the manner in which
the population deviates from normality: since some law of distribution must
he assumed it is better to work with a curve whose area and ordinates are
tabled, and whose properties are well known. This assumption is accordingly
made in the present paper, so that its conclusions are not strictly applicable to
populations known not to be normally distributed; yet it appears probable that
the deviation from normality must be very extreme to load to serious error. " My emphasis
" Section X. Conclusions
1. A curve has been found representing the frequency distribution of stan-
dard deviations of samples drawn from a normal population.
2. A curve has been found representing the frequency distribution of the
means of the such samples, when these values are measured from the mean of
the population in terms of the standard deviation of the sample.
3. It has been shown that the curve represents the facts fairly well even
when the distribution of the population is not strictly normal." Again my emphasis.
The are several examples with a sample size below 10 in the paper.
When I used to teach this stuff (1st year geography students), I would demonstrate the Fisher Randomization and permutation test for very small numbers as the students could do this by hand and thereby see the underlying logic of the test. I would show that you could permute the data of the two variables under the null hypothesis of no difference and see how extreme a result you could get 'by chance' and then compare the observed value to this; no normality assumptions were needed in coming to some sort of judgement.
  • asked a question related to Applied Biostatistics
Question
3 answers
For example, we are reviewing an article and the sensitivity of a testing modality 87% while including 50 patients. How we will be able to calculate its upper and lower limit at 95% confidence interval while making a forest-plot?
Relevant answer
Answer
You may try this by using Chart Builder.
NEW FILE.
DATASET CLOSE ALL.
GET FILE "C:\SPSSdata\bankloan.sav".
DATASET NAME raw.
* OMS.
DATASET DECLARE logistic.
OMS
/SELECT TABLES
/IF COMMANDS=['Logistic Regression'] SUBTYPES=['Variables in the Equation']
/DESTINATION FORMAT=SAV NUMBERED=TableNumber_
OUTFILE='logistic' VIEWER=YES
/TAG = 'logistic'.
LOGISTIC REGRESSION VARIABLES default
/METHOD=ENTER age employ address income debtinc
/PRINT=CI(95)
/CRITERIA=PIN(0.05) POUT(0.10) ITERATE(20) CUT(0.5).
OMSEND TAG = ["logistic"].
DATASET ACTIVATE logistic.
COMPUTE Vfilter = Var2 NE "Constant".
FILTER by Vfilter.
VARIABLE LABELS Var2 "Variable".
* Chart Builder.
GGRAPH
/GRAPHDATASET NAME="graphdataset" VARIABLES=Var2 MAXIMUM(Upper)[name="MAXIMUM_Upper"]
MINIMUM(Lower)[name="MINIMUM_Lower"] MEAN(ExpB)[name="MEAN_ExpB"] MISSING=LISTWISE REPORTMISSING=NO
/GRAPHSPEC SOURCE=INLINE.
BEGIN GPL
SOURCE: s=userSource(id("graphdataset"))
DATA: Var2=col(source(s), name("Var2"), unit.category())
DATA: MAXIMUM_Upper=col(source(s), name("MAXIMUM_Upper"))
DATA: MINIMUM_Lower=col(source(s), name("MINIMUM_Lower"))
DATA: MEAN_ExpB=col(source(s), name("MEAN_ExpB"))
COORD: rect(dim(1,2), transpose())
GUIDE: axis(dim(1), label("Variable"))
GUIDE: axis(dim(2), label("Odds Ratio & 95% CI"))
SCALE: linear(dim(2), include(0))
ELEMENT: interval(position(region.spread.range(Var2*(MINIMUM_Lower+MAXIMUM_Upper))),
shape(shape.ibeam))
ELEMENT: point(position(Var2*MEAN_ExpB), shape(shape.circle))
END GPL.
Good luck
  • asked a question related to Applied Biostatistics
Question
7 answers
I have a dataset of 5 variables of quantitative continuous type: 4 independent and 1 dependent (see attached). I tried using linear multiple regression for this (using the standard lm function in R), but no statistical significance was obtained. Then I decided to try to build a nonlinear model using the nls function, but I have relatively little experience in this. Could you help me, please: how to choose the right "equation" for a nonlinear model? Or maybe I'm doing everything wrong at all? So far I have used the standard linear model in the "non-linear" model.
I would be very grateful for your help.
If you do not have the opportunity to open the code and see the result, I copy it here:
------
library(XLConnect)
wk <- loadWorkbook("base.xlsx")
db <- readWorksheet(wk, sheet=1)
INDEP <- NULL
DEP <- NULL
DEP <- as.numeric(db[,1])
for(i in 1:4){
INDEP[[i]] <- as.numeric(db[,i+1])
}
MODEL <- NULL
SUM <- NULL
MODEL<-nls(DEP ~ k0 + INDEP[[1]]*k1 + INDEP[[2]]*k2 + INDEP[[3]]*k3 + INDEP[[4]]*k4, start=list(k0=0,k1=0,k2=0,k3=0,k4=0))
SUM <- summary(MODEL)
-----
The result is:
-----
Formula: DEP ~ k0 + INDEP[[1]] * k1 + INDEP[[2]] * k2 + INDEP[[3]] * k3 +
INDEP[[4]] * k4
Parameters:
Estimate Std. Error t value Pr(>|t|)
k0 6.04275 1.30085 4.645 6.41e-06 ***
k1 0.03117 0.01922 1.622 0.107
k2 -0.02274 0.01663 -1.367 0.173
k3 -0.01224 0.01717 -0.713 0.477
k4 -0.01435 0.01541 -0.931 0.353
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1.418 on 186 degrees of freedom
Number of iterations to convergence: 1
Achieved convergence tolerance: 2.898e-08
-----
Relevant answer
Answer
It sounds like you already tested your hypothesis with the linear model and used up your p value. So, you can't do more p-value tests.
As far as choosing a model, that isn't a question for the stats folks, but is related to the theory you have (stats folks can say how to implement or approximate the model).
  • asked a question related to Applied Biostatistics
Question
9 answers
Can someone explain cohen's d test, in a simple way, please?
It is kindly requested to elaborate it for medical students in simple words.
Relevant answer
Answer
Cohen's d isn't a test. It is a measure of effect size.
It allows you to express the difference between two groups in terms of the naturally-occurring variation in the thing you are measuring. The variation is measured by using information from both groups and pooling it as the pooled standard deviation.
The trouble with Cohen's d is that people tend to convert it to tee-shirt sizes – small, medium, large. This seems very vague when you go to the bother of doing all those calculations, somehow. And studies looking at typical values of d in different research areas suggest that it's not appropriate to have the same definitions of small, medium and large for all disciplines. I have a reference somewhere that I can post if I locate it!
  • asked a question related to Applied Biostatistics
Question
6 answers
I have data set for S vs t and x vs t. The yield coefficient needs to be calculated. What is procedure to calculate it? Do I take the data for logarithmic growth phase only?
Relevant answer
Answer
How can I found out X ( microorganisms concentration- biomass concentration) and S ( substrate concentration) from blackwater characteristic data? I mean, COD or VSS does represent these?
  • asked a question related to Applied Biostatistics
Question
34 answers
Journal of Multidisciplinary Applied Natural Science (abbreviated as J. Multidiscip. Appl. Nat. Sci.) is a double-blind peer-reviewed journal for multidisciplinary research activity on natural sciences and their application on daily life. This journal aims to make significant contributions to applied research and knowledge across the globe through the publication of original, high-quality research articles in the following fields: 1) biology and environmental science 2) chemistry and material sciences 3) physical sciences and 4) mathematical sciences.
We invite the researcher related on our scope to join as section editor based on their interest or as regional handling editor in their region. The role of editor is help us to maintain and improve the Journal’s standards and quality by:
  1. Support the Journal through the submission of your own manuscripts where appropriate;
  2. Encourage colleagues and peers to submit high quality manuscripts to the Journal;
  3. Support in promoting the Journal;
  4. Attend virtual Editorial Board meetings when possible;
  5. Be an ambassador for the journal: build, nurture, and grow a community around it;
  6. Increase awareness of the articles published in the journal in all relevant communities and amongst colleagues;
  7. Regularly agreeing to review papers when invited by Associate Editors, and handle these promptly to ensure fast turnaround times
  8. Suggest referees for papers that you are unable to review yourself
Relevant answer
Answer
Frank T. Edelmann yes sure, thanks for your good discussion. we create this journal based on the other multidisciplinary journal, as example Journal of King Saud University - Science (scopus indexed) publishes peer-reviewed research articles in the fields of physics, astronomy, mathematics, statistics, chemistry, biochemistry, earth sciences, life and environmental sciences.
Other example, PERIÓDICO TCHÊ QUÍMICA (scopus indexed) also publishes peer-reviewed research article in same fields with our journal.
  • asked a question related to Applied Biostatistics
Question
5 answers
I am new to stats to this level in ecology. I am trying to compare DNA and RNA libraries with thousands of OTUs. I summarized taxa to get the most abundant species, but I can obtain only relative abundances. I was thinking to use SIMPER as I read in several comments to test which species differ the most per station between DNA and RNA based libraries. However I read that SIMPER is a more or less robust test. I was wondering if the manyglm was also an alternative for my question or if you suggest another way. Thank you for your help!
Relevant answer
  • asked a question related to Applied Biostatistics
Question
6 answers
I have treated THP1 and AGS cells for 12, 24 and 48 hours with a bacterial toxin concentrations 0, 5, 10, 20, 40, and 80 ug/ml. Now I want to prove my results with statistical methods but I'm confused which one to use. Is it will be ANOVA post hoc test or simply t test. If it will be one tailed or two tailed, paired or unpaired.
Relevant answer
Answer
As I understand, the cells were randomly distributed over the 6 different concentrations, but each cell was treated for 12, 24 and 48 hours. If this is correct, then you have a Between (6 levels of Concentration) x Within (3 levels of Time (hours)) design. First run a factorial repeated measures ANOVA. If the there are sign. effects (of Concentration, Time and/or the interaction) then you can run Post Hoc tests. Before running these tests, check the distributions of the measures on the treated cells with histograms. If they obviously do not look normal, then run a non-parametric test such as the f1.LD.f1 from the parLD R package (https://www.jstatsoft.org/article/view/v050i12 )
I suppose you normalize the images of the cells like in this stdudy: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM1224512
  • asked a question related to Applied Biostatistics
Question
4 answers
Does anybody know an estimation method for calculating the prevalence of a given risk factor among general population, given that the odds ratio/relative risk, the prevalence of the risk factor among diseased and the prevalence of the disease are available?
Relevant answer
Answer
In the two way table D+ D-
E+ a b
E- c d
disease odds ratio = (a/b )/(c/d) = ad/bc
exposure odds ratio = (a/c)/(b/d) = ad/bc
  • asked a question related to Applied Biostatistics
Question
4 answers
I want to discriminate type I and Type II diabetes with certain factors.
I wish to done discriminant analysis with type of diabetes as dependent variable and I have both categorical and continues independent factors. My doubt is , can I include "categorical Independent variables in discriminant analysis"
Relevant answer
Answer
You can, but discriminant analysis assumes that the independent variables are normally distributed. Further, it assumes that the variances are equal across groups of the outcome (i.e., multivariate normality assumption). Fisher and Van Belle (Biostatistics: A Methodology for the Health Sciences) have a nice review. Logistic regression is a far-more flexible approach, since it doesn't make normality assumptions about the dependent or independent variables.
-p
  • asked a question related to Applied Biostatistics
Question
6 answers
This is an anti-tumor efficacy study. there are 2 compounds and each compound has 3 dose levels. For example:
Group 1: vehicle control group (n=10 mice)
Group 2: drug A treatment group, dose 1 (n=10 mice)
Group 3: drug A treatment group, dose 2 (n=10 mice)
Group 4: drug A treatment group, dose 3 (n=10 mice)
Group 5: drug B treatment group, dose 1 (n=10 mice)
Group 6: drug B treatment group, dose 2 (n=10 mice)
Group 7: drug B treatment group, dose 3 (n=10 mice)
At the end of the study, mice will be euthanized and tumors are weighed. To compare if the tumor weight of treatment groups is significantly different from that of the vehicle group. The question is: when we use one-way anova do the statistics, will all the 7 groups be seen as a whole or drug A and B will be compared with vehicle separately?
Many thanks in advance!
Relevant answer
Answer
Hi Alok
My suggestion is that, you could use one-way ANOVA or two way ANOVA followed by a Dunnettes test as post hoc test if you want to compare every mean to the control or you could use tukeys test if you want to compare every mean with every other mean using graphpad prism.
  • asked a question related to Applied Biostatistics
Question
16 answers
Hello everyone,
Currently I am trying to do K - mean clustering on microarray dataset which consists of 127 columns and 1000 rows. When I plot the graph, it gives an error like "figure margins too large". Then, I write this in R console:
par("mar") #It will give current dimensions
par(mar=c(1,1,1,1) #Tried to update the dimensions
But; it did not work. So, can anyone suggest me another way of fixing this problem? (Attached the some part of code in below)
Thanks,
Hasan
--------------------------------------------------------------------------------------------------------------
x = as.data.frame(x)
km_out = kmeans(x, 2, nstart = 20)
km_out$cluster
plot(x, col=(km.out$cluster+1), main="K - Means Clustering Results with K=2",xlab"", ylab"", pch=20, cex=2)
>Error in plot.new() : figure margins too large
Relevant answer
Answer
  • asked a question related to Applied Biostatistics
Question
5 answers
Some times we want to conduct a reliability study on some diagnostic modality for a specific disease but the gold standard for the diagnosis of that disease is either invasive procedure or surgery. which is not justified to be performed on normal individuals (control group). In such a case is it justified to take control group as negative of the gold standard?
For example:
We want to diagnose Infantile Hypertrophoid pyloric stenosis (IHPS) with the help of ultrasound but the gold standard for its diagnosis is surgery. If we perform Ultrasound of 50-infants with projectile vomiting and the sonographic findings of 40 of them are likely for IHPS and 10 for normal. But after surgery (Gold-Standard) 38- were confirmed as IHPS but 2 were false positive. Now we want to perform ultrasound of 50-normal (control). Is it justified to put all the 50 normal infant as True negative and false positive as 0 of the gold Standard, To perform chi-Square statistics?
Relevant answer
Answer
Hello,
If I well undertood your question, I think here you can use the Bayes Theorme, based on clinic litterature if you have about this specific test. we can consider the ultrasound results as a priori information about the clinic test, and the Gold-standard one as the real data (x).
I can reformulate your problem as :
If your test results come back positive when using ultrasound method, what are your chances that children actually have the disease (with G-S method) ?
CHELLAI. F
  • asked a question related to Applied Biostatistics
Question
13 answers
"Comparison of scoring system 1 versus scoring system 2 predicting in-hospital mortality".
The study is non-interventional, will follow patients from admission to discharge. Please suggest the best suitable design.
Relevant answer
Answer
If you are to follow up your cases for the two scoring methodologies over a certain period of time to see mortality then i guess you must have in mind certain number of patients being followed up. This will be your cohort and i interpret it will be a cohort study. It will be then prospective in nature.
Regards
  • asked a question related to Applied Biostatistics
Question
6 answers
Hi,
This seems a bit unusual to me, since I could not find any related paper. This is my situation:
We have CT values (3 replicates each) for following conditions for gene A:
- Wild type (WT) cells, treated.
- WT cells, untreated,
- Mutated (MUT) cells, treated,
- MUT cells, untreated.
We are interested to study the effect of a certain mutation (MUT) on the expression pattern of gene A in treated vs. untreated conditions. To do so, I simply calculated ddCT_WT as: dCT_WT_treated - dCT_WT_untreated, and similarly for MUT: ddCT_MUT: dCT_MUT_treated - dCT_MUT_untreated. Finally log2 fold change expression (FC) was calculated as: log2(ddCT_MUT/ddCT_WT). I am not sure if this approach makes sense or not; so, I appreciate if you can help me to better interpret/represent/analyse my results.
Any reference to similar conditions is highly appreciated!
Thanks!
Relevant answer
Answer
Hi there,
In principle your calculation will tell you if the fold change in the mutant is higher or lower than those in the WT. You may loose important information about significant of such a change though. If you have a bit of programming knowledge this is what I would do. You have 4 experiments and 3 replicates for each. Calculate your log2(ddCT_MUT/ddCT_WT) as you did and then for 1000 times randomly shuffle the values of the expression of A among all the 12 groups. Each time calculate the log2(ddCT_MUT/ddCT_WT) and store the result. You should get a distribution of values that you can use to compare with your initial one. If your initial value is higher (or lower) than 95% of the random values then there is probably a signal there. ........ just an idea!
  • asked a question related to Applied Biostatistics
Question
6 answers
Hello everyone,
I would like to calculate the cluster similarities between two clusterings of the same dataset and want to see that similarity statistic specified of the clusterings from the comemberships of' the observations. However; I could not implement the code in R. Is there anyone who can help?
Best regards,
Hasan
Relevant answer
Answer
cluster.stat function in R coding
  • asked a question related to Applied Biostatistics
Question
3 answers
Dear colleagues,
I just received a comment by the reviewer as a "What is the ordinate unit on the each graph?"
Actually, I do not know the meaning of the ordinate unit. What is the ordinate unit? How can I calculate or demonstrate ordinate unit in plots in the R?
Note: I had used season packages in the R software to create plots in the submitted manuscript.
Relevant answer
Answer
Yes, I think it is that simple. It would have been a great help if you had posted the graph.
  • asked a question related to Applied Biostatistics
Question
5 answers
I want to calculate the Standarized incidence ratio of second primary malignancies.
I have a database with patients that were diagnosed with lung cancer as first cancer and these were followed up to find if they developed a second primary malignancy. The patients were diagnosed with lung cancer between 1990 and 2013 and the follow up started in 1990 and finished in 2014.
For calculating the amount of expected cases one needs the amount of person years for every age category and also the age specific incidence rate . I only have the age specific incidence rates between 1999 and 2014.
How could I calculate the SIR? Should I exclude all the patients that were diagnosed with lung cancer before 1999 and:
- Calculate the person years from date of first diagnosis ( 1999) until outcome or end of follow up , count the observed cases from 1999 onwards?
Does anyone with experience in this topic have a suggestion?
Thanks in advance!
Relevant answer
Answer
I presume you are doing indirect standardization by age only, comparing the incidence of any second primary malignancy in those with a diagnosis of lung cancer to the incidence of any primary malignancy in the general population.
You appear to have an identified cohort of lung cancer patients.You know which of the patients developed a second primary after they developed lung cancer.The person years at risk would be calculated from the time of first cancer. The SIR for any second primary would be actual number od second primaries/ expected no of these as a first primay. The expected number would be calculated from the average age specific incidences for the general population over the period 1990-2014.
assuming the general population is quite large (eg national) , the age specific rates for all cancer combined are unlikely to have varied much between 1990 and 1999. So you can use the 1999-2014 general population age specific rates for the whole follow up period 1990-2014.
  • asked a question related to Applied Biostatistics
Question
7 answers
I have rapid light curve data (ETR for each PAR value) for 24 different specimen of macroalgae. The dataset has three factors: species (species 1 and species 2), pH treatment (treatment 1 and treatment 2) and Day (day 1 of the experiment and day 8 of the experiment). 
I have fitted a model defined by Webb 1974 to 8 subsets of the data:
species 1,pH treatment 1, day 1
species 1, pH treatment 1, day 8
species 1, pH treatment 2, day 1...etc.
I have plotted the curves of the data that is predicted by the model. The model also gives the values and standard error of two parameters: alpha (the slope of the curve) and Ek (the light saturation coefficient). I have added an image of the scatterplot + 4 curves predicted by the model for species 1 (so each curve has a different combination of  the factors pH treatment and Day). 
I was wondering what the best way would be to statistically test if the 8 curves differ from each other? (or in other words: how to test if the slopes and Ek of the models are significantly different?). When googling for answers, I found many ways to check which models with your data better, but not how to test if the different treatments also cause differences in rapid light curves.  
Any help would be greatly appreciated.
Cheers,
Luna
Relevant answer
Answer
Although the question is few years old, I’ll add an answer just in case someone finds this relevant.
For comparing between treatments, you need to fit single curve for each individual light curve measured, not a single curve for a treatment, as your graph above implies. Then you can extract the parameters for all the curves and compare these between the treatments, as suggested by Denis.
There is a nice R package called “phytotools” (Silsbe & Malkin 2015) which I have used and found very convenient. It has four different light curve models: Eilers & Peeters 1988, Jassby & Platt 1976, Platt, Gallegos & Harrison 1980 and Webb et al. 1974. The manual also provides the equations for the models.
As the functions for fitting are provided in the package, fitting the light curves is straightforward. The package has basic programming examples also. One needs to do a bit of coding to loop through the data containing the curves and fitting a model. Each of the measured light curves needs to have a unique id. When looping through the ids, you need to extract the parameters (e.g. alpha, beta, so on) from the model fitted, along with the info of your treatments for that particular curve into a data frame.
After this it should be easy to perform the analyses e.g. with ANOVA or something else, depending whether your data shows heterogeneity etc...
  • asked a question related to Applied Biostatistics
Question
13 answers
I was wondering if anyone had any resources on how to do a pooled prevalence in R? Is it possible to have a forest plot as a result? Any help would be greatly appreciated.
Thanks
Dearbhla 
Relevant answer
Answer
You can use the meta package in R.
First step, perform the metaprop command in an object (e.g. metaprop_results). The metaprop function needs three arguments: the number of events per study (event), the study population (n), the dataframe in which they are stored (df).
metaprop_results <- metaprop(event = your_events, n = your_n, data = df)
Second step plot your results as forest plot
forest(metaprop_results)
  • asked a question related to Applied Biostatistics
Question
7 answers
Hello people,
I want to know how to use GLM to compare mean number of granivore birds for "high water level" years and "low water level" years as shown in the picture provided below. This is an arbitrary data set I made up but the data I have is similar and they are not normally distributed. What step should I follow? Where should I start? Should I use GLM or something else? Should I first determine whether the data fits neg.binomial or Poisson distribution? If so, how can I do it with R?
I tried using Mann-Whitney U-test but I think I should use something stronger. I would be glad if somebody can explain to me what to do in plain language. Thanks in advance.
Relevant answer
Answer
The objective (i.e. what is wanted to know) is not clear. It is required to describe the objective clearly and specifically.
The data, shown in the table, is a kind of time series data. It may be possible to extract information from the data. However, the size of data is small. It will be convenient if more data are collected.
  • asked a question related to Applied Biostatistics
Question
9 answers
Hi everyone. I have applied multiple logistic regression to create a model based on my independent parameters (x, y & w). My generated model function is Z=ax+by+cw-d where Z is an exponential term including the probability of the occurrence of my dependent parameter (Z=exp(P)/(exp(p)+1)), and all of the parameters are binary.
Now in order to interpret the output, I have calculated the probability of the occurrence of my dependent variable, for all values of all possible permutations of the variables as follow:
1: x=0, y=0, w=0 ------> P=0.74%
2: x=0, y=1, w=0 ------> P=2.3%
3: x=1, y=0, w=0 ------> P=1.35%
4: x=1, y=1, w=0 ------> P=4.14%
5: x=0, y=0, w=1-------> P=1.65%
.
.
8: x=1, y=1, w=1------> P=8.83%
Since the sign of all coefficients (a, b & c) is positive, apparently the highest probability occurs when x, y and w be 1. But in this case the probability got its highest value as only 8.8%. Is this result rational?
And how can I interpret the magnitude of each independent parameter? Can I say that since all the variables are binary and have a positive coefficient, a variable with bigger coefficient have bigger impact on the probability derived from Z?
Thank you all in advance for your kind replies.
Relevant answer
Answer
Hello Mohsen,
Generally, yes … all variables being binary. However, I would argue (as you have in your opening question) that you need to transform your (logit) coefficients into probabilities. This helps in discussing and comparing the parameters in a way that the reader can understand. It’s next to impossible to communicate different types of non-linear parameters (like logit, cubic, probit, etc) in a write-up and make sense of them.
So, I think you are on the right track with transforming the regression parameters into probabilities. Keep in mind, that you don’t really have to break them down into all possible groupings. Each transformed parameter, is the increase/decrease in probability for the DV for that independent parameter =1 (assuming your binary variables are 0 or 1), HOLDING all other variables in the equation equal.
However, because these predictors have covariance (as one would expect), the probabilities are not simply additive (e.g., the probability for x & y =1 is not the same as adding up the probabilities of when only x=1 and separately when y=1; in your example at the top).
The odds ratio is a better way of showing the magnitude of for each independent parameter.
I have added a spreadsheet for doing the logit conversions, in case it is helpful.
Wishing you well,
  • asked a question related to Applied Biostatistics
Question
3 answers
I see MS Excel has several trend-line options; linear, logarithmic, polynomial, exponential, and power functions.
What is the basis / logic of selecting these functions for biological data.
For e.g.  I'm interested in understanding change of abundance either transcripts / proteins; my data fitting with polynominal trend-line.
how can I compare different samples in this option?
Relevant answer
Answer
Hi Akila,
You can use a model which best fit to your data. For example if your data have exponential trend and we want to use linear trend then we can get misleading results. Look at your data graphs you can get a lot of information to decide which trend line is best for your data.
You can find some basic information from the link below:
All the best
  • asked a question related to Applied Biostatistics
Question
4 answers
Hi guys i have recently conducted a meta-analysis looking to compare 3 different drugs against each other, i am struggling to know which statistic on the meta-analysis do i use to compare the 3 drugs against each other
Am i correct in saying that you would just compare the 3 WMD in each subgroup alongside their confidence interval? i have attached a picture of my meta analysis down below
Relevant answer
Answer
NMA, sure.
  • asked a question related to Applied Biostatistics
Question
3 answers
My study aims to explain why there are more cases of a given disease in certain areas of a state. For that I'm trying to use the no of occurrences as a dependent variable and land use metrics + economic data as the independent ones. I've tried using linear regression, but it doesn't explain very well. If there is literature about it and there's already a certain method established as standard, please let me know.
Relevant answer
Answer
In the first instance, you should be trying Poisson regression, with the population count in each area as the exposure variable. Counts are not normally distributed but tend to follow a Poisson or negative binomial distribution.
  • asked a question related to Applied Biostatistics
Question
10 answers
Colleagues, I need help with Venn diagrams and transcriptomics. I have three list of IDs (example: c58516_g4_i4), only IDs, not the sequences. I need to make a Venn diagram, to know which IDs are shared among the three lists, and which only between two of them and which are only present in its original list. I could do it manually, but it's a huge amount of IDs. Can you suggest me some sowtware for windows or script for linux ?. Thanks!
Relevant answer
Answer
You can try the following tools called "venny". Cheer~
  • asked a question related to Applied Biostatistics
Question
6 answers
I am conducting a research project in which I am using SEM model. My exogenous variable( world system position) is ordinal with 4 categories. I am not sure how creating so many dummy variables will work in SEM model. Thus I would like to treat it as a continuous variable. But I am not sure if I will be violating any statistical assumption by doing this. Can somebody help me with suggestion on this?
Relevant answer
Answer
As the others point out, it is possible to incorporate an ordinal variable. It also seems you are interested in whether you can treat it as continuous and that is another matter. There are several considerations.
1) Continuous scales assume equivalent differences between intervals on the sale. This might not be valid if all you have is the rank order from the ordinal scale. For example, can you reasonably assume that the distance between "agree" and "strongly agree" is the same for all respondents to your questionnaire?
2) I would also consider the purpose of the study. How high are the stakes in your results?
3) How has the scale been treated in your discipline? It is very common in social science research to treat ordinal scales as continuous. What about the scale you are using? How have others treated it?
In the strictest sense you have an ordinal scale. On the other hand, what are typically truly ordinal scales are often treated as continuous in social science research. Though it's not often explicitly discussed, I think one should be explicit about the treatment of the scale as continuous if it is novel in your field. Knowing nothing about your research, to be precise I would treat the variable as ordinal and, if you are curious about that particular scale, conduct research on how reasonable it is to treat it as continuous if such research doesn't already exist for your scale and population. Stevens (1946) is a starting point on this issue.
Best wishes
  • asked a question related to Applied Biostatistics
Question
9 answers
Can someone explain What is “Experimental Unit, Replicate, Total sample size” , “treatment size in Bio-statistics? with a practical biological example.
I see some places, they use "n=..." for replications. According to what I've been taught this is totally wrong.
Does sample size equal to replicate? Then how and why should?
As I've seen in different papers, I'll try to summarize what I've observed in root length measuring test; n.b. each seen has control and treatment
Seen 1: in one plate /box plant grow (Control vs Treatment) and each genotype has 20 seedlings and they report n = 20 seedlings. e.g. like in Picture 1
  • in this kind of experiment they consider each seedling is a biological replicate
Seen 2: in one plate /box plant grow (Control vs Treatment) and each genotype has 20 seedlings and they report n = 20 seedlings, 5 independent experiments. e.g. like in Picture 1
  • in this kind of experiment they consider each seedling is a biological replicate and following five independent experiments
Seen 3: in one plate /box plant grow (Control vs Treatment) and each genotype has 20 seedlings and they report n = 3. e.g. like in Picture 2
  • in this kind of experiment they consider each plate is a replicate and in one plate 10/15/20 seedlings are grown.
Relevant answer
Answer
Experimental unit: This is the (field plot/animal/gear/whatever) to which "treatments" are applied. Treatments can be directly applied (like a dose of insecticide to an insect) or they can be observational (sex, weather, disease). If you randomize, you typically randomize the experimental units. Note that there are some additional terms: subsample, techincal replicate, pseudoreplicate. These three are terms used when multiple samples are taken from a single experimental unit.
Replicate: A replicate is one experimental unit in one treatment. The number of replicates is the number of experimental units in a treatment.
Total sample size: My guess is that this is a count of the number of experimental units in all treatments. This is not very informative, and leads to trouble if the design is unballanced. Afterall, I could say that my total sample size was 100 in four treatments. Sounds good, unless I reveal that one treatment had 60 replicates, one treatment had 30 replicates and the other two had five each.
Treatment size: I am not sure, but I would guess that this is the number of experimental units in each treatment.
Recognize that a huge variety of people use statistics. They may not all use all of the same terminology in exactly the same way.
The problem with examples is that they can be aranged in a multitude of different ways. Methods sections often don't provide sufficient detail to replicate an experiment. In your example there is a treatment (herbicide) and a control (water). I want to evaluate herbicide resistance in 6 genotypes, 20 plants per genotype.
1) I place all of the plants in one box. I mix up a tank of herbicide, and spray half the box. Technically, this is one replicate, no matter how I process the plants.
2) I place 12 plants in a box. Half get sprayed. This is two plants from every genotype. I still made up a single tank, so this is one replicate.
3) I place 12 plants in a box, just like in 2. However, I mix up a new solution each time. If I am really good, I might use a different sprayer/nozzles. I will end with 10 replicates.
People often argue that I really don't care about the sprayer effect, nor the effect of the person using it. So I will go with plan 2 (above), and treat this as 10 replicates. This is great so long as you assume that the sprayer that you are using is typical, and no errors were made in mixing up the herbicide. With those untestable assumptions holding true, this is fine.
  • asked a question related to Applied Biostatistics
Question
7 answers
We are working on health survey project which has more than 100000 participant and we are confused to use mean or median?
Please help us.
Data are not following normal distribution.
Relevant answer
Answer
In this case, it may be useful to calculate, not only the median, but a set of percentile points; for example: 0.1%, 0.5%, 1%, 5%, 10%, 20%, 50%, 80%, 90%, 95%, 99%, 99.5%, 99.9% points.
I still do not know if you just want to set some benchmarks or if you have additional research goals.
With this amount of data, I do not think it would be useful to apply inferential procedures to either means or medians.
  • asked a question related to Applied Biostatistics
Question
2 answers
Hello, everybody. I would like to know if it is methodologically correct to use a pool of patient samples to analyze microRNA expression by qPCR in my population. I have 3 groups with ~50 individuals each and therefore, the cost to perform an exploratory study in each patient is extremely high. I was thinking of prepare a pool of cDNA/group and perform the qPCR for each group instead of each individual and observe the trend expression.
Thanks in advance!
Relevant answer
Answer
Yes, this is correct, as you want to see systematic differences in miRNA expression between groups.
You save money and you pay for this by a slight loss of statistical power and the inability to identify individual profiles (so it is impossible to identify outliers, cluster individuals, do a subgroup analysis etc).
I suggest not to pool more than 5 individuals, so that you have 10 pools/group.
Pooling all the samples (of each group, getting 1 pool/group) is usually not a good idea because then you would loose any information about the variability so that you can not judge the statistical significance of your results (you will have *no* statistical power). However, if you need to screen a large number of miRNAs it might be ok to run a first round on such "total pools" to pick the miRNAs that showed the largest change in a second round with pooling fewer individuals (5, say) to check the significance. If this is a preliminary experiment to select candidates for further experiments, significance may not be important and you can directly proceed with the experiments testing the biological relevance of your selected candidates.
  • asked a question related to Applied Biostatistics
Question
6 answers
Suppose, you have measured 4 clinical parameters (A, B, C, D) at the time of admission of 60 patients with the same disease. You observe the outcome as "severe disease" and "non-severe disease". Now you want to calculate the severity predictive values of:
1. Individual parameters: i] A, ii] B, iii] C, iv] D individually
2. Combinations of v] A+B, vi] A+C, vii] A+D, viii] B+C, ix] B+D, x] C+D
3. Combinations of xi] A+B+C, xii] A+B+D, xiii] A+C+D, xiv] B+C+D
4. and xv] A+B+C+D
How can one compare the results of these 15 combinations and tell which combination can give the highest possible specific, sensitive, positive predictive and negative predictive data ?? Please enlighten.
Thank you 
Relevant answer
Answer
Thanx Jochen. Have a nice weekend.
  • asked a question related to Applied Biostatistics
Question
6 answers
 hsa-miR-4454+hsa-miR-7975 prob shows a high read counts in most of the available Nanostring data including ours, while sequencing analyse does not support the abundance of these microRNAs. Anybody has the same issue in microRNA NanoString data?
Relevant answer
Answer
Sorry but none of these answers meet the question.
Why hsa-miR-4454+hsa-miR-7975 in NanoString gives rise to high counts. However, if you perform RTqPCR or miRNASeq this microRNA is not detectable.
Sorry, Ali but I do not have an answer yet.
  • asked a question related to Applied Biostatistics
Question
3 answers
Hi,
I am just starting working with DESeq. I have a question regarding the basic biological interpretation of DESeq based DE gene expression. There are two situations I have listed below and I would like to know which one is more biologically relevant
I have two treatment groups: treatment 1 and treatment2 and I am comparing them with a control group all with three replicates. I devised my study as
1. I created a dataframe containing counts of all 9 count files and from this dataframe, I am creating comparisons as: T1 vs Control, T2 vs Control and T2 vs T1.
2. I create a dataframe everytime I create a comparison like when I am comparing T1 vs Control, then I am creating a dataframe with 6 count files. Again when I am comparing T2 vs Control I am creating another dataframe with 6 count files.
I want to know which of these two design strategies will give me a more accurate result as to what effects T1 and T2 are causing when compared with control and how are T1 and T2 different as well as similar?
Relevant answer
Answer
Hi Mrigaya,
I would certainly not split the analysis in multiple data frames. DESeq will estimate some negative binomial parameters based on your data, and the more data you provide, the more reliable the estimates. Also, you want to work with the same estimates for all comparisons.
There are ways of putting the 3 comparisons into one model but the specific details depend on which contrasts you are interested in. I wouldn't know the exact syntax, but I'm sure that the community can help you if you specify precisely what you want to test. 
By the way: there are many good software packages for the detection of differential expression. For more complex designs, I personally really like EBseq, which has a very straightforward syntax to test multiple hypothesis. Just saying.
Good luck!
Rik
  • asked a question related to Applied Biostatistics
Question
1 answer
what genes are under study?
Relevant answer
Dear Martin,
if you have the information about the real size of the total target population (in your case Venezuelan mestizo population .....) you can calculate the  research sample size here:
But if you have don't have info about the size of this specific population than your research sample would be purposive and than the number of the respondents is not strictly defined (as much respondents as you can reach is ok)
In both the cases if the differences of the genes is important for the research than you should have (nearly) equal number of the respondents from each category (min 30 respondents from each sub-sample)
  • asked a question related to Applied Biostatistics
Question
4 answers
I am analysing abundance data using Primer 7 and I am a bit confused about how to pre-treat the data before carrying out SIMPER. I don't know if I have to standardise the samples by total or if I need to standardised the variables (species). 
Many thanks!
Relevant answer
Answer
Hello Paz Aranega Bou.
When the unit of sampling can not be tightly controlled, standardization (the samples by total) may be neccesary (Clarke and Gorley 2006), this would turn abundance data to values of relative abundance (percentage).
Clarke K., Gorley R. 2006. Primer v6: User Manual/Tutorial. PRIMER-E, Plymouth, UK, 193 pp.
  • asked a question related to Applied Biostatistics
Question
12 answers
The power or sensitivity of a binary hypothesis test is the probability that the test correctly rejects the null hypothesis (H0) when the alternative hypothesis (H1) is true. should this be addressed before every clinical studies?
Relevant answer
Answer
A priori power analysis is intended to avoid studies that can't address their primary question or studies that waste precious resources by being larger than they need to be. They also force you to define your primary question and think about clinically meaningful effect sizes. In that sense, I regard them as mandatory. As a result, I tend to reject grants or papers sent to me for review that don't include a sample size calculation. That said,  most statisticians eschew a posteriori power calculations, since once the study is done, you either saw the effect or didn't. For the latter, it's a matter of taste and philosophy.
  • asked a question related to Applied Biostatistics
Question
9 answers
Recently I'm working with  functional candidate gene study. so  I have three SNP and  need to calculate allele substitution effect and additive and dominance effect for single SNP marker association  with quantitative traits (i.e body weight). can anyone give me best statistical way to do this.
Relevant answer
Answer
The easiest way would be through a linear regression (you can use R, SAS, even MS Excel). All what you have to do is to code your alleles as 0-1-2 {AA, Aa, aa}, then regress the phenotype over that SNP. Simple like that: The beta coefficient is you effect of an allele substitution.
For dominance, the procedure is the same, but you code you alleles as 0-1-0 {AA, Aa, aa}. You may want to consider fit both additive and dominance effect at the same time. In which case, the regression coefficient of dominance indicate what kind of dominance this locus display (under dominance, over dominance, complete dominance, etc.).
  • asked a question related to Applied Biostatistics
Question
3 answers
This
Relevant answer
Answer
Please be more specific about your question. Beyond that, there is no probably a "best" method, but just more appropriate.
  • asked a question related to Applied Biostatistics
Question
5 answers
Hi. What stats. can I use if my protocol is: 3 different animals to obtain cells for culture experiments, then do the same 3 treatments/interventions on 3 dishes at 1 and 2hr? The animals, cell culture experiments, buffers etc. are all done/made on seperate days from fresh. 
Relevant answer
Answer
Yes.
If you have (for instance) several measurements from per dish, several dishes per cell extraction, several extractions per animal, then you may use a multilevel (hierarchical) model that can give you the additional information at what level of (technical) replication the most variance is introduced (that may hbe halpful to plan further experiments). For your final scientific question, however, it will (should) give you the same amount of information that is obtained from the averages per animal.
  • asked a question related to Applied Biostatistics
Question
5 answers
I have 4-time points and for each time point I have 3 animals, from each animal are collected cells, which can be either negative or positive. I need compare different time points if there is any difference between positivity/negativity cells. I am thinking about chi-square test. Thanks for a help
Relevant answer
Answer
Ok, then it is not "repeated measures". The best option still is a logistic regression, just that you do not need to consider "animal" (neither as random nor as fixed factor).
  • asked a question related to Applied Biostatistics
Question
3 answers
Dear all, how do i normalize my data if not normalized by applying log 10 and sqrt transformation?. 
Relevant answer
Answer
depends on data which transformation to apply. available in standard texts
  • asked a question related to Applied Biostatistics
Question
2 answers
I'm developing a cost-effectiveness analysis where I need to calculate time-dependent reintervention rates derived from published sources. I have multiple studies with different follow-up times and different presentations of the data (Kaplan-Meier risk estimates, cumulative probabilities...) 
How would I go about calculating a yearly probability of recurrence that could be applied consistently through out my model, and reducing the probability by a constant factor every year?
Relevant answer
Answer
Your best bet is to covert everything to cumulative density functions. Easy if you know the distribution (Poisson, weibull, or whatever and its parameters). The Kaplan-Meier plot is essentially a non-parametric CDF. See. e.g., this link
  • asked a question related to Applied Biostatistics
Question
8 answers
I'm analyzing data from experiments, where in a net pen we exposed fish to different noise frequencies. However I'm uncertain as to what statistical tests I should use to test for possible differences before/during/after the exposure.
We have data in one second intervals of the area of the school of fish and the velocity of the centroid of the school (plus X/Y coordinates), I'm interested in testing for differences within and between different exposures. Data for each exposure starts 60 seconds before the onset of sound, which lasts for about two minutes, and ends with a 60 second tail after the noise has ended.
Should one use time series analysis, like ARIMA, ARMA or ARIMAX to test the data, or something different altogether? At the moment I'm using SPSS to test the data.
Relevant answer
Answer
Stephen,
   What an interesting idea, and especially cool if there is sufficient data to support using the method. The article can be found at:
One great thing about RG is that you sometimes find articles from other disciplines that you would never ever find otherwise.
  • asked a question related to Applied Biostatistics
Question
2 answers
Population A (Formicidae) 
Tajima's D Value = -2.66825 (p <0.001)
h = 0.3845 ± 0.0724
π = 0.002474 ± 0.001572
Population B (Formicidae)
Tajima's D Value = -1.40150 (p <0.01)
h = 0.6268 ± 0.0452
π = 0.001186 ± 0.000921
(Fu Fs test was not significant for any of the populations)
Relevant answer
Answer
I concur with Paul Chiou on this.
  • asked a question related to Applied Biostatistics
Question
3 answers
I try to analysis the best correlation between the biotic and environmental dataset. My question is: if I have one variable for the biota dataset and more variable for the environmental dataset, is possible I apply the BEST analysis?
Relevant answer
Answer
Yes you can use BEST with a single response variable.  For a single response variable you use Euclidean distance to make the 'target' matrix, then search for the subset of (transformed) explanatory variables that gives the best match (with whatever the appropriate resemblance measure for those variables is). Other methods are available (as alluded to in previous answers) but each has its own set of assumptions.
  • asked a question related to Applied Biostatistics
Question
6 answers
I have collected leaf temperature data (ranged from 29 0C to 32 0C) against different organic matter levels. Can I perform ANOVA test to check the difference between leaf temperature for different organic matter levels? or what is the suitable statistical test to perform this?
Relevant answer
Answer
Yes, I think ANOVA is the correct method with treatment as the fixed effect.  Random effect would be reps nested within treatment.  Variation among samples is the residual term.  
Alternatively, you could average the samples first and then the model would only include the fixed treatment effect.
In either case remember to verify the assumptions of normality and homogeneity of residuals before making any conclusions about the treatment effect.
Just wondering, how large was each EU?  And the distance between EUs was 25 cm? That doesn't seen very far apart.  Why 10 samples per unit?  Were you expecting a lot of variation within each unit?  You many want to consider using more reps and fewer samples in the future.  With only 3 reps the inference space is rather limited.  Were the samples taken all at the same time or over time?  If over time, then you have a repeated measures ANOVA which is a bit more complex to analyze.
  • asked a question related to Applied Biostatistics
Question
1 answer
ADMA has been shown to inhibit nitric oxide synthase (NOS). I am looking at the the production of nitrite in HMEC-1 cells treated with 1μM, 5 μM, 10 μM, 50μM and 100 μM. As well in control cells treated with DPBS.   
Relevant answer
Answer
It sounds like you have only one response and 6 treatments. In this case a one way ANOVA could be used providing the data accomplish the assumptions of this test, but more details about the experimental design is necessary to give you a better and more accurate answer.
  • asked a question related to Applied Biostatistics
Question
1 answer
Hi 
I was wondering if anyone could help with my data analysis for a taqman low density array assay I have run as I am quite new to this and am unsure if I have done it correctly and where to go next.
Basically, I am assaying the expression of 96 genes in tissue and models. I have run 5 tissue samples (separate patients) and 2 models (5 repeats of 1 and 3 repeats of the other) on a gene card. 
I have used normfinder to identify the best combination of housekeeping genes to use (using the median of each model). I have then found ∆Ct of each of my data points using the geometric mean of the housekeepers. I've then removed outliers using Grubbs test and then found the mean of my two models (i.e. the mean of the 2 means) and the mean of the tissue and found the fold change by dividing 1 to the other. I've subsequently taken log 2 of the fold changes. I now want to find if the changes are significant - I was going to use a T-test but my sample sizes are too small to test normality and I've read that man-whitney U tests are hard to conduct with low sample sizes.
Have I analysed the data correctly thus far and can anyone recommend a test for the significance?
Thanks for all your help! 
  • asked a question related to Applied Biostatistics
Question
3 answers
For example: I'm evaluating 50 genotypes and I have two treatments - say 'A' as control and 'B' as treatment. How can we analyze the genetic diversity through Mahalanobis D2 statistics of these genotypes together ? or we have to evaluate them separately.. Can anyone shed some light on this matter.
Relevant answer
Answer
I discriminate 6 different type ordinary datasets and six microarray datasets using 10 discriminant functions. Download my papers after 2012.
  • asked a question related to Applied Biostatistics
Question
3 answers
Dear friends,
There is Mean and SD of two parameters such as blood glucose and serum Insulin , which have been extracted from about 14 articles in order to a meta analysis.
Now,I have considered to obtain a ratio between these parameters.
  • How should I obtain or estimate the Standar deviation of these two parameters which have not been determined in articles?
I would be pleased If some one answer me.
Relevant answer
Answer
You can estimate the standard deviation of the ratio using propagation of errors. See the link below and.see the "Multiplication or Division" row in Table 1.
  • asked a question related to Applied Biostatistics
Question
3 answers
detail steps of the manual and computerize analysis will be appreciated
Relevant answer
Answer
Hi Umar,
If there is homogeneity, you can just add them up. since there is a confounding variable, traditional method will definitely lose some information if you just add all the k contingency tables together to make a 2 by 2 table.
i recommend Mantel-Haenszel method.
 μk = E(n11k) = n1+kn+1k/n++k
νk^2=Var(n11k) = n1+k*n2+k*n+1k*n+2k/[n++k^2(n++k −1)]
the statistic is [sum_k_(n11k)-sum_k_(μk)]^2/sum_k(νk^2)
hope this will help you.
  • asked a question related to Applied Biostatistics
Question
1 answer
We are using alpha lattice design in our genetic designs, we have some question as below:
1-       we need Expected mean of square for source of variations in order to calculate the genetical variance components in alpha lattice and Rectangular lattice designs?
2-      How we can use covariance analysis in alpha lattice to adjust the treatment?
3-      Is it possible to get the adjusted data in order to adjusted means?
4-      Shall we get the above mentioned items while we have alpha lattice in combined analysis?
Can we have their procedures in SAS or Crop stat ?
I would be very grateful in advance
  • asked a question related to Applied Biostatistics
Question
6 answers
Which statistical test will help me compare and state significance of the three zones? Is there any indices by which I can calculate the association/proximity of two species?
Relevant answer
Answer
What are you interested to compare? The species number and the species composition among  the zones? If it's the case a good start would be read:
Marti J. Anderson 2001. A new method for non‐parametric multivariate analysis of variance and
Chao et al 2014. Rarefaction and extrapolation with Hill numbers: a framework for sampling and estimation in species diversity studies
  • asked a question related to Applied Biostatistics
Question
3 answers
I want to know which statistics test should I apply for below experiments. 
1. Particle size of nanoparticle checked everyday for a week. 3s reading on each day. I want to correlate this with stability of nanoparticle over time. Is correlation co-efficient the best choice ? 
2. cumulative drug release from nanoparticle over 48 hrs at different time point under two treatment condition, pH 5.2 and pH 7.4 
Thnaks.
Relevant answer
Answer
Hi Hardik
Form your description, a linear fixed effects model would be suitable for your first question, and a linear mixed effects model for your second. 
  • asked a question related to Applied Biostatistics
Question
6 answers
For example, I had two group of samples, I calculated the distance decay-rates of the two groups of samples respectively, then I got two values DDRa and DDRb. How can I test the difference between the two values?
Thank you for your reply!
Relevant answer
Answer
Another way to compare two slopes from two different groups is to use the bootstrap. Bootstrap each group, generating the two slopes. Do this 1000 or 10,000 times saving the results. You will have a distribution of the slopes and how often they overlap. Probabilities can be calculated from the distribution. Wish you well.
  • asked a question related to Applied Biostatistics
Question
1 answer
Im planning on doing a Family based association study in which I'll be doing exome analysis on twenty trios w/ affected probands. I've noticed the affected phenotype runs in families and I want to seek out associated variants. Since I am no statistician (the software does most of that work) I'm not sure how to write the statistical considerations section of my IRB. Does anyone have a template or an example of something close that I can work from? Its the only thing thats holding up my protocol in my universities IRB dept.
If it helps, the proposed experimental design mirrors the one used in this study: http://www.ncbi.nlm.nih.gov/pubmed/?term=25737299
Relevant answer
Answer
  • asked a question related to Applied Biostatistics
Question
4 answers
I have a study group (N subjects with disease A) and a control group (2N subjects WITHOUT disease A). I want to compare the two groups in terms of outcomes (categorical and continuous variables). Which tests should be applied?
Relevant answer
Answer
for comparison of two groups independent sample t-test and willcoxon signed test or mann whitney nonparametric test for continuous and categorical data  is more appropriate..
  • asked a question related to Applied Biostatistics
Question
3 answers
Hi!
I have several sets (e.g. 100) of essential genes + non-essential ones, all were extracted from a certain database of essential genes for a particular disease under different conditions. How can I compare these sets to check if the results obtained under different conditions (100) are significant?
For instance, set #1 contains 2000 genes containing 100 essential genes (found in the hypothetical database) + 1900 non-essential genes, and so forth.
Thanks in advance
Relevant answer
Answer
The "enrichment" of essential gene in a set can be tested with Fisher's exact test (or the chi-squared test as approximative test). This is alos called "overrepresentation analysis".
  • asked a question related to Applied Biostatistics
Question
4 answers
Hi,
I have a data set of 20 algae sampled every 2 months for a year for % lipid. I am looking for the correct model to analyse the change in % lipid over time but seem to be running into road blocks everywhere I turn:
-The data is heteroscedastic
-There is data missing for some individuals at different time points
-The samples taken in the month of spawning have huge increases in % lipid followed by a large decrease the next month (so non-linear?)
I know that repeated measures ANOVA is not an option due to the missing data points. But is a glm possible considering the heteroscedasticity and non-linearity? Or would a non-linear model be more appropriate?
I have trawled the internet for answers but there are a lot of opinions, contradictions, and baffling results.
Any guidance would be greatly appreciated.
Relevant answer
Answer
If you wanna go Bayesian you can use an hierarchical Bayesian model considering mixed effects
These kind of models behave pretty fine with missing data. For instance, if you haven't other variables besides time, you can consider a transformation of the response as Jochen Wilhelm said, e.g. logit. In this case you can consider that y represents the transfomed data and $y \sim N( m(t), \sigma )$, with $m(t) = \beta_0 + \beta_1 \times t + b_0 + b_1\times t$ being \bf{b}=(b_0,b_1)\sim N(0,\Sigma) a vector of random effectas and the relation is linear. But if you suspect that the relation is not linear you can add t^2 and t^3 and the respective random effects. From my point of view this is the simple strategy. For a complex one you can think about splines.
  • asked a question related to Applied Biostatistics
Question
4 answers
Almost always in clinical research the log-rank test (Mantel-Haenszel test) is employed to compare the equality of two survival curves. Is there a good way to decide whether other tests may be more sensitive in detecting differences between the groups? For example, if the Kaplan-Meier curves cross as opposed to being roughly parallel over time? Does the total number of events influence the choice of the test statistic? What if we are also adjusting for a covariate, does this situation affect which test should be used to compare the two survival curves?
Relevant answer
Answer
Thanks a lot Paul, very helpful!
-David
  • asked a question related to Applied Biostatistics
Question
6 answers
I proposed a new method for estimating weights of 8 elements for the description of a pattern. Each weight ranges from 0 to 1 and the sum of the 8 elements' weights for the description of each pattern is equal to 1.
I want to compare my estimated weights with those given by a standard method. Using the Bland-Altman plots attached below, agreement limits seems to not be acceptable for my study. In fact, a difference of 0.1 between paired results (obtained using the new method and the standard one) is really important. So I need to define a difference limit between paired results to judge if the compared methods are convergent or not.
Could it be arbitrary defined or there is a method to do it?
Relevant answer
Answer
Stands limits are based on mean of these two ratings (generally two measurements, assessments) and standard deviation (SD) of this mean defined as
(mean - 2 SD, mean + 2 SD).
More precisely 1.96 SD and not 2 SD.  
But you could additionally define something as "logical agreement limits" which are not defined statistically but which are coming from definition of the task. 
  • asked a question related to Applied Biostatistics
Question
4 answers
Hi all,
I would need to calculate the percentage of ice-free area in a certain radius around each of about 200 sampling points in Antarctica. I guess this would be a similar approach as to calculate vegetation cover etc.?
A long time ago I learned some GIS basics using Idrisi, but I haven't used it in 7-8 years and it was very basic. Now, I quite urgently need this data (2-3 weeks, but the sooner the better :) ).
Therefore, any advise is welcome to get me started.
I have the possibility to use ArcGIS and QGIS.
I came across Quantarctica for QGIS, which might provide the map. Any other sources for georeferenced maps of Antarctica?
I guess I can then plot the samples' coordinates, but I then have to manually delineate the ice-free regions by drawing polygons? Or are there layers available giving the ice-free surface or conversely, ice coverage, such that I would only need to detract this from the circle surface?
Can such an analysis be automated?
Sorry for these probably very basic questions. If you can provide me with a good crash-course tutorial (on GIS in general or a similar problem), that would be very welcome too :)
Thanks in advance!
Bjorn
Relevant answer
Answer
Hi Bjorn,
There are many ways of answering this question. And of course you can do this yourself. The internet has made our lives much easier nowadays. If you have ArcGIS or any other GIS software it can be done by just typing some keywords on the web. Moreover, there are thousands of online youtube videos that can assist you further. I have the following suggestion for you to give you a head start:
1. After you start ArcMap there is a possibility to add a Basemap Layer. See if you can find Antarctica base Maps.
2. Once you have the base map, you can import the coordinates of the sampling points from an excel sheet and convert it into a point shape file.
3. Draw a buffer around each point. This is an automatic process.
The rest I think you can figure out yourself. Check the links that I have attached. 
However, the questions that I have for you are these: Are you interested in any specific dates or have you collected any satellite imagery of your study area in Antarctica? The area which you are studying, are they covered by ice all the year round or do they change with seasons?
  • asked a question related to Applied Biostatistics
Question
4 answers
I have 4 sites with a total of 22 species (i.e. site1 has 7 species, site 2 has 15, etc.). I also have multiple weeks’ species abundance data for each site. I want to analyze the species diversity on temporal and spatial scale based on high throughput sequencing.
Several methods have been proposed to compare sites for the species richness, many of which only use presence/absence data. I want to use abundance data, and do the following:
 -use entire dataset, compare the similarity statistically, and obtain an optimum species richness/diversity value (say x number of species needed to reach 95% coverage of the whole dataset)
-use subsampled dataset (time-wise and site-wise), and analyze at which stage the previously obtained optimum number of species is reached (i.e., at 2 months of sampling instead of 12 or in one site instead of 4, etc.)
 Any recommendations on the use of abundance data for answering these questions?  
Relevant answer
Answer