Science topic

# Biostatistical Methods - Science topic

Explore the latest questions and answers in Biostatistical Methods, and find Biostatistical Methods experts.

Questions related to Biostatistical Methods

Hi,

I have performed an epidemiological survey on insomnia prevalence using ISI and am looking forward to testing internal consistency using Cronbach's alpha. I missed finding any reference example for estimating the same for each survey question. It would be helpful to receive assistance from your expertise.

I would appreciate your help in enhancing my knowledge.

In many biostatistics books, the negative sign is ignored in the calculated t value.

in left tail t test we include a minus sign in the critical value.

eg.

result of paired t test left tailed

calculated t value = -2.57

critical value = - 1.833 ( df =9; level of significance 5%) (minus sign included since

it is a left tailed test)

now, we can accept or reject the null hypothesis.

if we do not ignore the negative sign i.e. -2.57<1.833 null hypothesis accepted

if we ignore the negative sign i.e. 2.57>1.833 null hypothesis rejected.

I have 16S data sequenced from the Illumina MiSeq platform. This data comes from an experiment testing the effects of different aquaculture additives on the growth and survival of larval sablefish. It consisted of 18 tanks with 6 replicates of 3 water treatments: clay, algae, and algae with a switch to clay after one week. I'm interested in the effects of these additives have on the skin microbiome of the larval sablefish. The 16S data are from water samples from the tanks and from swab samples off the surfaces of 8-12 sablefish (to control for interindividual variation). There were also 3 different genotypic crosses used, so that there were 2 replicates of each genotype for each of the 3 treatments.

I have sets of water and swab data from all 18 tanks for 3 time points (each a few days apart).

I'm interested in the following:

1) How reflective are the skin microbial communities of the surrounding seawater? (i.e. are they similar or very different from one another?)

For this question, I was thinking about using the weighted UniFrac measure and generating PCoA plots that include both the water and swab samples to see if they cluster together. I think that will be the most informative as it considers relative abundance and phylogeny, and that's something I'm interested in. Beyond that, I'm unsure if that's the most appropriate measure to use, if I should use additional measures like Bray-Curtis or unweighted UniFrac, and what statistical tests to use beyond that.

2) A. How is skin microbial composition/structure different between water treatments?

B. How does it change over time, with respect to each treatment?

C. How does the similarity between skin and water communities change over time?

For some of these questions, I was thinking of using a generalized linear model in R, but beyond that I'm really unsure of where to start.

3) How much of an effect does genotype play in the formation of the skin microbiome?

I was thinking maybe using a generalized linear mixed effects model (using genotype as a random effect, and seeing how that might be different than using it as a fixed effect, but seeing as genotype is the only random effect in this study, then I don't know if that's appropriate). I could also use a generalized linear model to see if there's an interaction between genotype and treatment, and how much of an effect genotype has on its own.

Beyond what I've stated above, I'm unsure of which indices would be best to use (Shannon, Simpson, Chao1, etc), which statistical tests to use (since they come with their own assumptions and have their own limitations), which models to run, etc. Statistics in an ecological context is something I'm still learning, and I'm not very familiar with multivariate approaches. I am, however, familiar with R and QIIME.

Any and all assistance is greatly appreciated. Anything to at least point me in the right direction. Thank you in advance!

Contact by E-mail: Kawsar_Ahmed@csu.edu.cn

Or, for further meeting discussion, you can set up an online meeting ID and send me the link. Thanks in advance.

Hello everyone. I am currently working on comparing antioxidant potency between my samples.

I have 4 samples = water kefir, water kefir infused with butterfly pea, water kefir infused with turmeric and water kefir infused with matcha green tea.

As of for now, I am comparing them on the basis of how they fare in the two assays I have done; DPPH & FRAP assays. However, I am doubtful of this comparison method as they're not exactly comparable since they have different unit and measurement.

I have recently encountered the concept of relative antioxidant capacity index (RACI) from few recently published papers. However, I am still not clear on how could I integrate this concept for my result.

Could anyone provide me any insight on how does one use this concept to compare between antioxidant potency of different samples?

Hi,

I have performed an insomnia prevalence study among academics using ISI. I have come across the floor and ceiling effect in a cross-sectional survey. I want to estimate the same percentage of each ISI question and the total score. It would be helpful to see an example to calculate the same.

I would appreciate your help in enhancing my knowledge.

Please share this question with expert in statistics if you don't know answere.

I am stuck here, as i am working on therapy and trying to evalute the changes in biomarker levels. So I have selected 5 patients and analysed their biomarker levels prior therapy and then after first therapy and followed by 2nd therapy. So as i apply anova results show significant difference in their mean values but due larger difference in their standard deviations i am getting non significant results

like in this table below.

Sample Size Mean Standard Deviation SE of Mean

vb bio 5 314.24 223.53627 99.96846

cb1 bio 5 329.7 215.54712 96.3956

CB II 5 371.6 280.77869 125.56805

So I want to know from all those good statsticians who are well aware about the clinical trial studies.

Please suggest

Am i performing statistics correctly?

Should not i worry about non significant results?

What are the statistical tests I should use?

How will I represent my data for publication purposes?

Please be eloberative in answers?

Try to teach like you are teaching to the fresher to this field.

Hi, I would like to know what is the type of study for the following research:

- The researcher conducted a descriptive study of medication errors in a hospital over 3 years. The number and characteristics of medication errors were the comparison sample.

- The researcher implemented a medication error mitigation program.

- Then the researcher studied the number and characteristics of medication errors in the 3 years before the implementation of the mitigation program. And these results were compared with the 3-year sample prior to implementation.

What is the type of study for this research?

Hello everyone!

We are developing a

**phase I randomized clinical trial, in 18 healthy volunteers**, aimed to test the safety and pharmacokinetics of i.v drug. However, we want to test**two different doses**of the drug (doses A and B), and each dose is to be administered with a specific**infusion rate: dose A will be administered at X ml/min, and dose B at Y ml/min.**We need to randomize the 18 patients with a

**2:1 ratio (active drug vs placebo), in blocks**of size 6. However, to maintain the blind, we also would need two different infusion rates for the placebo (X and Y).**What do you think is the best way to randomize the volunteers in this study?**

**One way**could be to randomize the patients in a 2 x 2 factorial design: one axis to assign the drug vs placebo, and the other axis to assign the drug dose with the infusion rate. To maintain a 2:1 ratio for the first axis, a and 1:1 ratio for the second axis, in blocks of size 6.

**A second way**could be to randomize "three treatments" (dose A with X infusion rate, dose B with Y infusion rate, and placebo), 1:1:1 ratio, in blocks of size 6, and then, to randomize patients assigned to placebo in blocks of size two (or without blocks) to infusion rate X or Y.

**What do you think is the best manner to randomize in methodological terms?**In the case of the first way, Do we need to test the interaction between dose and infusion rate? Do you have another idea to randomize the patients in this study?

Thank you so much for your suggestions and help.

I have two sets of samples of human sera that is pre- and post immune sera. I should find the average cut off for immunogenic titers. I know how to do that in excel by t-test. But my sample number is 34. Can someone help me in letting me know how to perform a z test to calculate the cut off values for pre-immune sera.

For Individual responses I can calculate the value with respect to which we have to check the outlier don't know?

Please help in this regards

Hi, I was hoping someone could recommend papers that discuss the impact of using averaged data in random forest analyses or in making regression models with large data sets for ecology.

For example, if I had 4,000 samples each from 40 sites and did a random forest analysis (looking at predictors of SOC, for example) using environmental metadata, how would that compare with doing a random forest of the averaged sample values from the 40 sites (so 40 rows of averaged data vs. 4,000 raw data points)?

I ask this because a lot of the 4,000 samples have missing sample-specific environmental data in the first place, but there are other samples within the same site that do have that data available.

I'm just a little confused on 1.) the appropriateness of interpolating average values based on missingness (best practices/warnings), 2.) the drawbacks of using smaller, averaged sample sizes to deal with missingness vs. using incomplete data sets vs. using significantly smaller sample sizes from only "complete" data, and 3.) the geospatial rules for linking environmental data with samples? (if 50% of plots in a site have soil texture data, and 50% of plots don't, yet they're all within the same site/area, what would be the best route for analysis?) (it could depend on variable, but I have ~50 soil chemical/physical variables?)

Thank you for any advice or paper or tutorial recommendations.

We're conducting a research design as follow:

- An observational longitudinal study
- Time period: 5 years
- Myocardial infarction (MI) patients
*without*prior heart failure are recruited (we'll name this number of people after 5 years of conducting our study)**A** - Exclusion criteria: Death during MI hospitalization or no data for following up for 3-6 months after discharge.
- Outcome/endpoint: heart failure post MI (confirmed by an ejection fraction (EF) < 40%)
- These patients will then be followed up for
**a period of 3 to maximum 6 months**. If their EF during this 3-6 months after discharge is <40% -> they are considered to have heart failure post MI. (we'll name this number of people after 5 years of conducting our study)*B* - Otherwise they are not considered to have the aforementioned outcome/endpoint.

My question is as follow:

- What is the
best called? Is it*A/B**cumulative incidence*? We're well-aware of similar studies to ours but the one main different is they(i.e: a patient can be considered to have heart failure post MI even 4 years after they were recruited). I wonder if this factor limits the ability to calculate cumulative incidence in our study?*did not limit the follow up time* - Is there a more appropriate measure to describe what we're looking to measure? How can we calculate
*incidence*in this study? - We also wanted to find associated factors (risk factor?) with heart failure post-MI. We collected some data about the MI's characteristics, the patients' comorbidities during the MI hospitalization (when they were first recruited). Can we use Cox proportional hazards model to calculate the HR of these factors?

We are currently doing an undergrad thesis and we are planning to assess the presence or absence of species in each elevation (our variable for community) during a certain month. We were able to find ideas like the coefficient of community but this only allows us to assess two communities.

EDIT: Please see below for the edited version of this question first (02.04.22)

Hi,

I am searching for a reliable normalization method. I have two chip-seq datas to be compared with t-test but the rpkm values are biased. So I need to fix this before the t-test. For instance, when a value is high, it doesn't mean it is high in reality. There can be another factor to see this value is high. In reality, I should see a value closer to mean. Likewise, if a value is low and the factor is strong, we can say that's the reason why we see the low value. We should have seen value much closer to the mean. In brief, what I want is to eliminate the effect of this factor.

In line with this purpose, I have another data showing how strong this factor is for each value in the chip-seq datas (with again RPKM values). Should I simply divide my rpkm values by the corresponding RPKM to get unbiased data? Or is it better to divide rpkm values by the ratio of RPKM/ Mean(RPKMs) ?

Do you have any other suggestions? How should I eliminate the factor?

I have two different ChIP-seq data for different proteins, I have aligned them to some fragments in the DNA. Some of these fragments get zero read count for one of them or for both. To be able to say these fragments has protein X much more than the protein Y, I use student's t-test.

I wonder if It would be better to remove the zero values from both of the data showing rpkm values for each fragment. Moreover, they pose problem when I want to use log during data visualization part.

What would you suggest?

Hi,

We received a statistical reviewer comments on our manuscript and one of the comments goes as follows: '.

*.. Note that common tests of normality are not powered to detect departures from normality when n is small (eg n<6) and in these cases normality should be support by external information (eg from larger samples sizes in the literature) or non-parametric tests should be used.*'This is basically the same as saying that '

**parametric tests cannot be used when n<6'**, at least without the use of some matching external data which would permit*accurate*assumption of data distribution (of course in real life such datasets do not exist). And this just doesn't seem right. t-test and ANOVA can be used with small sample sizes as long as they satisfy test assumptions, which according to the reviewer cannot be accurately assumed and thus cannot be used...I see two possible ways of addressing this:

- Argue that parametric tests are applicable and that normality can be assumed using residual plots, testing homogeneity or variance, etc. This sounds as the more difficult, risky and really laborious option.
- Redo all the comparisons with non-parametric test based on this one comment. Which just doesn't seem right and empirically would not yield a different result. It would be applicable to 15-20 comparisons presented in the paper..

**Maybe someone else would have other suggestions on the correct way to address this?**

For every dataset in the paper, I assume data distribution by identifying outliers (outliers - >Q3 + 1.5xIQR or < Q1 - 1.5xIQR; extreme outliers - > Q3 + 3xIQR or < Q1 - 3xIQR), testing normality assumption by Shapiro-Wilk’s test and visually inspecting data distribution using frequency histograms, distribution density and Q-Q (quantile-quantile) plots. Homogeneity of variance was tested using Levene’s test.

Datasets are usually n=6 and are exploratory gene expression (qPCR) pairwise comparisons or functional

*in vivo*and*in vitro*(blood pressure, nerve activity, response magnitude compared to baseline data) repeated measures data between 2-4 experimental groups.I would like to study correlation between four transcripts (fold changes of mRNA expression) at different time intervals (5 time points). How can I perform this analysis?

I am planning a cross-over design study (RCT) on effect of a certain supplement/medicine on post-exercise muscle pain. There hasn't been any similar study to recent date on the effect of this medicine (or similar medicines) on post-exercise muscle pain. However, some studies have been conducted for effect of this medicine on certain conditions such as hypertension.

As long as I have been searching formulas for estimating sample size, they need information (such as standard deviation, mean, effect size, etc.) from some similar kind of studies which was conducted before.

Is there anyway to estimate a sample size for my RCT with the aforementioned conditions?

I am currently involved in a study that needed to do the regression analysis to see the effect of each treatment on a few dependent variables. Can you help me, what is the minimum level of factors or treatments that should be there for a regression analysis?

Dear All,

I am struggling with a constant problem with a csv extension while preparing data for MuSSE model analysis: have tried to do a bunch of stuff to fix a problem but no success - always the same thing ("All names must be length 1"). I would very grateful for your help! :)

library(diversitree)
dat="MuSSE_hosts.csv"
dat<- read.table("MuSSE_hosts.csv", header=TRUE, dec=".", sep=",", row.names=1)
mat <- dat[,2:ncol(dat)]
lik.0 <- make.musse.multitrait(tree, mat, depth=0)
Error in check.states.musse.multitrait(tree, states, strict = strict, :
All names must be length 1

Thank you a lot in advance!

I have an issue in analysing qRT-PCR datasets. For my gene of interest, treatment A's mRNA fold change values are 10, 40, and 200, over the control group. For treatment B, the corresponding values are 0.5, 10, and 5. Therefore, between the two treatments, I know that A's is always higher than B's and that too hugely (20, 4, and 40 fold difference). However, if I perform routine statistical tests like a t-test, there is no significant difference because of the huge standard deviations.

Can you suggest a way to represent this data and also make proper sense statistically? Thanks in advance.

Hello everyone, I would like to ask if the way that the sample size of this research was calculated is valid or correct. Is a study to evaluate the effect of gargling with Povidone iodine among COVID-19 patients. The text says “For this pilot study, we looked at Eggers et al. (2015), using Betadine gargle on MERS-CoV which showed a significant reduction of viral titer by a factor of 4.3 log10 TCID50/mL, and we calculated a sample size of 5 per arm or 20 samples in total”. From this data of the reduction of the viral titer in a previous study on MERS-CoV ¿It is valid to calculate the sample size this way for a new study on COVID-19?

Dear all,

I am working on gene expression and Kaplan Meier curve dividing the patients in " high" and "low" using SPSS, then I want to do the cox proportional hazard analysis combining i.e the mutational status of one gene. I am naive in using SPSS Software How I can set up the analysis in order to find the Hazard ratio of specific combinations i.e X gene(high) and Y (gene) "mut" or "Wt".

For example, we are reviewing an article and the sensitivity of a testing modality 87% while including 50 patients. How we will be able to calculate its upper and lower limit at 95% confidence interval while making a forest-plot?

I need to calculate the prevalence ratios to show the trend of prevalence of a drug group used by pregnant women over the years (2001-2018). I would like to use the year 2001 with age distribution of pregnant women in 2001 as reference for following years.

Please show me how to calculate the standardized Prevalence ratio (95%CI) with SPSS (I am not familiar with other software).

The outcome (dependent) variable Drug group use (y/n).

Other variables: age at delivery of pregnant women (continuous, but can be re-coded into age groups), years of birth (2001-2018), date of the prescription of the drug group.

I tried to use GEE (generalized estimating equation), but I do not know which model to use: poisson, negative binomial or binary logistic?

GEE: because there are women deliver several times during 2001-2018

Please help me.

The virus has a reproductive number of 2-4 persons per new infected person. There are many variables to take into consideration concerning calculating the time until the pandemic is over. Therefore, statistical methods could be researched for this question.

The COVID-19 pandemic is caused by SARS-CoV-2, a positive sense single stranded rRNA virus. This infection that has spread worldwide causes severe acute respiratory syndrome. The virus spreads with close contact and respiratory droplets. It infects human cells via binding the ACE-2 receptor.

Any ideas would be appreciated,

I have a dataset of 5 variables of quantitative continuous type: 4 independent and 1 dependent (see attached). I tried using linear multiple regression for this (using the standard

*function in R), but no statistical significance was obtained. Then I decided to try to build a nonlinear model using the***lm***function, but I have relatively little experience in this. Could you help me, please: how to choose the right "equation" for a nonlinear model? Or maybe I'm doing everything wrong at all? So far I have used the standard linear model in the "non-linear" model.***nls**I would be very grateful for your help.

If you do not have the opportunity to open the code and see the result, I copy it here:

------

*library(XLConnect)*

*wk <- loadWorkbook("base.xlsx")*

*db <- readWorksheet(wk, sheet=1)*

*INDEP <- NULL*

*DEP <- NULL*

*DEP <- as.numeric(db[,1])*

*for(i in 1:4){*

*INDEP[[i]] <- as.numeric(db[,i+1])*

*}*

*MODEL <- NULL*

*SUM <- NULL*

**MODEL<-nls(DEP ~ k0 + INDEP[[1]]*k1 + INDEP[[2]]*k2 + INDEP[[3]]*k3 + INDEP[[4]]*k4, start=list(k0=0,k1=0,k2=0,k3=0,k4=0))***SUM <- summary(MODEL)*

-----

The result is:

-----

*Formula: DEP ~ k0 + INDEP[[1]] * k1 + INDEP[[2]] * k2 + INDEP[[3]] * k3 +*

*INDEP[[4]] * k4*

*Parameters:*

*Estimate Std. Error t value Pr(>|t|)*

*k0 6.04275 1.30085 4.645 6.41e-06 ****

*k1 0.03117 0.01922 1.622 0.107*

*k2 -0.02274 0.01663 -1.367 0.173*

*k3 -0.01224 0.01717 -0.713 0.477*

*k4 -0.01435 0.01541 -0.931 0.353*

*---*

*Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1*

*Residual standard error: 1.418 on 186 degrees of freedom*

*Number of iterations to convergence: 1*

*Achieved convergence tolerance: 2.898e-08*

-----

Suppose the relative risks (RR) of kidney cancer among Normal BMI vs Obese and Normal BMI vs Overweight are 1.21 (95% CI 1.21-1.45) and 1.66 (95% CI : 1.66-2.06.

How we can estimate the RR associated with unit increase in BMI?

Is it possible with SPSS? Are questionnaires varied in different countries and regions?

Can someone explain cohen's d test, in a simple way, please?

It is kindly requested to elaborate it for medical students in simple words.

I have data set for S vs t and x vs t. The yield coefficient needs to be calculated. What is procedure to calculate it? Do I take the data for logarithmic growth phase only?

Anyone can advise me which statistical method is suitable to see the associations between TB infection and PM 2.5?

I have SPSS software and I am unable to find out how to perform Hierarchical summary receiver operating characteristic curve, Deek's Funnel plot and Forest plot. Can please some one guide me how to perform using SPSS, or suggest an alternate software or free online website based solution. Thanks

There are two groups in the study such as group 1 and group 2. One of the groups received treatment, but the other did not. When the mortality of the groups is compared, it seems that there is no statistical difference. However, the expected mortality rate (calculated based on PRISM3 score) in first group ( treatment group) was significantly higher than the other. I think the treatment is successful by lowering the high mortality expectation. However, I could not find how to show this statistically or how can I equalize this imbalance (mortality expectation) between groups at the beginning.

Thanks

And If GLM the which family and link will have sense to interpret the data well?

These are functional and taxonomic diversity indices of macrobenthic fauna attached in the file. and want to discuss on spatial differences within habitats.

Background info:

I have calculated the doubling times of wild-type cell lines and gene knockdown cell lines. Growth curves were measured three times (day 0-day 6), each time there were 2 technical replicates. The technical replicates were plotted over time and via log-linear regression a doubling was derived.

I now want test whether knockdown of this gene affects doubling time. As the variation between the different growth curves (doubling times) is quite large (likely due to random things like people opening the incubator more frequently that week and differences in confluency at plating, things that are the same for both wild-type and knockdown cell line), I think I need to use a paired-t test.

However, from what I've seen, a paired t-test does not take into account standard error of those doubling times. So I'm wondering, is this correct? I do not have a background in statistics, but this feels somewhat wrong.

*To clarify: for both the wild-type cell line and for the knockdown cell line I have three doubling times. I want to compare these to see if the knockdown has an effect on doubling times. As I derived the doubling times from log linear regression I think it's best to compare the slopes rather than convert those slopes to doubling times and compare those.*

Hi,

I would like to compute I² in for my meta-analyses, but I can't find any software to do it. I know that ESCI (the excel file from Geoff Cumming, thanks to him!) can do this math, but the version I've used ( https://thenewstatistics.com/itns/esci/ ) only compute Diamond ratio. I know the old version of ESCI computed I² and Q, but I can't find it (and it is not available on the website).

If someone have an idea, that will help!

Thanks :)

Louis

Situation:

Question: Gene X is suspected to have an general effect on growth (has an effect on growth of all cell types).

Knockdown cell lines (=strongly reduced expression of the gene) of different cell types were generated. Growth curves were established for the KD and control cell lines by seeding 10 wells on day 1 (stemming from the same cell mixture) and counting 2 wells each day for the following 5 days (day1: seeding; day2 - 6: counting cells). This experiment was performed 5 times.

For each experiment, for each day, the mean of the cell counts was calculated, Ln transformed and plotted against time. A linear fit was used to determine the linear part of the growth curve. To find the linear part I tried to obtain the highest R2 value with a minimum of 3 data points (sometimes 3 points gave the highest R2, sometimes 5 points gave the highest R2). I then used the mean cell count from the first and last point of the linear part to calculate the doubling time (which is often used in literature to represent growth rate of cells).

Formula used:

*First image*

I’m however not sure if I should calculate the doubling time using this formula (using the cell counts), or whether I should calculate the doubling time from the equation of the linear fit.

To calculate the error of the doubling time the following formula was used:

*Second image*

At this point I have 5 normalized doubling times and their error (normalized to the control, so the doubling time of the knockdown divided by the doubling time of the corresponding control).

I now want to assess whether there is a significant difference in doubling times (= significant difference in growth rate) between knockdown and control. Someone in my lab suggested a T-test and I agree, yet I have a few problems with this=

**·**I don't know how to test normality of my data since I (currently) only have 3 data points (I will have 5 in the end)

**·**I don't know how to test equality of variance with only 3 (in the end 5) data points.

**·**I don't know if this is the best method to accurately determine whether there is a significant difference in growth or not. (For example, maybe there’s a way to immediately compare the growth curves, I suspect that such a test would be more accurate, but also more complicated to such a degree that I myself might not be able to apply it)

I work in a lab where no one really has any expertise in this (or statistics in general), so I have to figure it out on my own. I have a lot of doubts on whether or not what I’m doing is correct or not, not helped by the fact that my understanding of statistics is very basic.

If you see any other mistakes, please do tell.

May some one please let me know what syntax should be used for calculating the Hardy Weinberg equilibrium in case control studies?

I have only the total number of cases for each diseases (measured by millions) and I'm trying to find a significance relationship for their coexistence throughout the 5 regions...

Hello everybody, I am new with the meta-analysis in Genome Wide Data so I have this doubt. I have read METAL documentation, which is by far the most used meta-analysis software in both EWAS and GWAS microarray data, but I cannot figure out how would be the input for EWAS analysis. As METAL was originally designed for GWAS, one of the inputs is to provide both the reference and no reference allele. Therefore as EWAS arrays do not rely in allele frequencies but in a quantitative measure, I would like to know how would be the input in METAL regarding this case. Thank you so much in advance for answering this issue (which may be easy, but I certainly do not know)

For example, I have 4 cell treatments and technical repeat wells of each with confluence data every hour for 48 hours.

Would Two-way ANOVA be a good way to observe the differences between the cell treatments over time? I want to see that the different drug concentrations effect cell proliferation overtime.

Do you know any R pakage can deal with unbalanced data set.

I have agronomic traits from 7 years with different locations number, different replications number, and different genotypes.

I am analyzing a dataset. There I have

**4 variables**that are used to diagnose a disease. Among them,**3 were "Lab test report findings"**e.g.**Test A, Test B, Test C**and**(which is obtained by the clinical examination of the patient and is not established for the confirmatory diagnosis of the disease).***1 "clinical findings" i.e. "Test D*To confirm the diagnosis of the disease e.g. "Dengue", Each of the

**3 lab tests i.e. A, B, C**can independently be used for the confirmatory diagnose of Dengue. In my research, patients had done at least one of the 3 tests to confirm the disease. Some might have done all the 3 tests.Also, among the patients, a great proportion had shown the positive result of the

**.***"Test D"*I want to establish that, the

**could be one of the confirmatory tests along with the other***"Test D"***3 tests i.e. A, B, C.**On top of that,**could be more accurate and reliable to confirm the Dengue compared to other***"Test D"***lab tests i.e. A, B, C.**So, what statistical test should I be used to prove and compare the effectiveness of this clinical examination findings? Also, suggest me some graphs, that can visualize with this case)

N.B. All 4 tests had a dichotomous answer. The findings of these tests can either be positive or negative.

According to my sample size estimation, the sample size comes to be 14. Would results from a trial with such a little number of subjects be valid? On the other hand I cannot increase the sample size beyond my calculated sample size except for adjustment for loss to follow up. Please suggest me.

I am new to stats to this level in ecology. I am trying to compare DNA and RNA libraries with thousands of OTUs. I summarized taxa to get the most abundant species, but I can obtain only relative abundances. I was thinking to use SIMPER as I read in several comments to test which species differ the most per station between DNA and RNA based libraries. However I read that SIMPER is a more or less robust test. I was wondering if the manyglm was also an alternative for my question or if you suggest another way. Thank you for your help!

I'm involved in a meta-analysis where some trials outcomes are shown in mean and standard deviation and some are shown as median and inter-quantile range. As softwares' functions require the group n, mean and SD, I looked around and found the following paper http://www.biomedcentral.com/1471-2288/5/13. However, this simulations study states that it is possible to estimate mean and SD given the median and range (min and max values), not from median and IQR. We checked again on each paper for min and max value but it was very disappointing as none informed these values. Therefore, I would like very much if any one have a tip to help me workaround this issue.

I have treated THP1 and AGS cells for 12, 24 and 48 hours with a bacterial toxin concentrations 0, 5, 10, 20, 40, and 80 ug/ml. Now I want to prove my results with statistical methods but I'm confused which one to use. Is it will be ANOVA post hoc test or simply t test. If it will be one tailed or two tailed, paired or unpaired.

Does anybody know an estimation method for calculating the prevalence of a given risk factor among general population, given that the odds ratio/relative risk, the prevalence of the risk factor among diseased and the prevalence of the disease are available?

I want to investigate whether there is a significant upregulation of certain genes between cells from WT and KO mice without stimulation (negative control) and after stimulation with substances A, B and C.

The sample size consists of n=3 KO and n=3 WT mice. Primary cell lines are cultivated from each mouse. Cells from each mouse are used in all four conditions stated above (repeated measures).

Because of the small sample size, I presumed a non-parametric test was in order (correct me if I am wrong).

If I would only need to compare the differences in gene expression between control, A, B and C in one type of mouse, I would use something like the Friedman test (the non-parametric alternative to the Repeated-measures One-way ANOVA). However, I am interested in the difference in expression between KO and WT mice. Should I use something like a non-parametric equivalent of a repeated-measures two-way ANOVA? And if so, which test should I use?

As an example, gene expression levels for 50 genes are measured by qPCR for 5 different conditions. Likewise, gene expression levels for the same 50 genes were measured by a different technique for the same 5 conditions. Which method(s) can be applied to compare these two techniques?

I have read multiple articles that have used machine learning algorithms (convolutional neural network, random forest, support vector regression, and gaussian process regression) on cross-sectional MRI data. I am wondering whether it is possible to apply these same methods to longitudinal or clustered data with repeated measures? If so, is there an algorithm that might be better to use?

I would be interested in seeing how adding longitudinal data could improve the performance of these types of machine learning models. So far, I am only aware of using mixed effect-models or generalized estimating equation on longitudinal data, but I am reading books and papers to learn more. Any advice or resources would be greatly appreciated.

My experiment is isolating primary cells from biopsy of different patients.

And then to culture the primary cells in two conditions, one is the negative control, and the experimental set-up is to culture with more hormone. I had cultured them in parallel (both are freshly cultured immediately after I isolated them from biopsy).

Then I got the assay result in two different culture conditions, I considered my results are paired result of each patient, so I performed paired T-test.

But when I presented it to my colleagues, she said paired sample t-test may not be appropriate here because there is some strict regulations about using paired sample t-test with samples from

**cell culture**. I had briefly search for it for a while but cannot find it.Can anyone tell me whether there is such regulation?

I had also performed independent sample t-test with my result, and got a much larger p-value. I think it is the individual difference between patients generated larger variance in the analysis.

Could anyone tell me the best test for my data?

Very frequently in papers, devoted to parallel clinical trials, I face a situation, when a calculated SD for an effect in each group is approximately equal to an SD of effect difference between the two groups.

An example may be found, e.g. in the following paper (Table 2):

Pelubiprofen achieved an efficacy in VAS scale of 26.2 with SD = 19.5. Celecoxib achieved efficacy of 21.2 with SD = 20.8. However, a difference is 5.0 with SD = 20.1! I was expecting SD ~sqrt(2) more, since the samples are independent and has approximately equal size.

Hi! I would like to compare cell proliferation rates.

The working hypothesis is that the proliferative effect of extracellular vesicles on cells cultured on the skin implant is increased compared to samples with pure cell culture.

There will be 4 samples:

1) cells (control)

2) cells + skin implant

3) cells + skin implant + extracellular vesicles

4) cells + extracellular vesicles

Cells will be from 3 donors, and experiments will be carried out 3 times with each donor culture.

Can someone help? Could you advice what method is the best?

Thank you in advance, your help is much appreciated.

Suppose I am doing a Case control study. Lets say Group 1 is a clinical population (N=30), Group 2 is a healthy control population (N=30). I have measured various variables (Continuous data) in both the groups, and using t-test I have found the difference between the two groups. Now, suppose I want to find relationship between the two variables, can both the groups be clubbed together (N=60), or do I do separate correlation analysis for each group?

For Example: If "satisfaction with life" and "quality of life" are research variables in two groups, specifically Patients with anxiety vs Healthy control. I can get continuous data for both these variables using a questionnaire, and I can do a t-test and establish if there is a difference in satisfaction with life and quality of life between these two groups. Now, if I want to know the association between satisfaction with life and quality of life, can I club both patient group and healthy control group together? If yes, is it applicable always, or, are there some conditions? Please explain as my research question is different, and I have just given an example here.

I would like to calculate under-5 mortality from data from a survey and it is difficult to find a coherent resource that gives a step by step guide on calculation under 5 mortality rate using cox regression model in SPSS. Anyone with any resource recommendations or perhaps ready to work through with me on this project? Cheers!

I constructed a contingency table with three categorical variables (species, elevation, and year) and performed a chi-square analysis to test for independence between them. My initial goal was to determine the probability that observed differences in species elevational ranges between two surveys (years) were due to chance.

The test revealed a significant relationship between the variables. This, however, presents me with a new question: how do I go about determining precisely where this relationship exists (e.g. species & elevation vs. species & year vs.elevation & year, etc.)?

Any advice or suggestions on how I can figure this out? Thanks!

I am working with birth defects data with case - control ratio 1:4. However, I would like to choose controls that are a better match with my cases and reduce the ratio to 1:1 or 1:2. I am planning to use propensity score approach to choose my controls from this database. Is this the best method to use?

I am using the complex sampling analysis method within SPSS. I would like to use the cox regression for my variable under complex sample, as my variable has a prevalence rate of greater than 10%, thus logistic regression should not be used. When using cox regression under the complex sampling analysis - is robust variance already controlled for?

Some times we want to conduct a reliability study on some diagnostic modality for a specific disease but the gold standard for the diagnosis of that disease is either invasive procedure or surgery. which is not justified to be performed on normal individuals (control group). In such a case is it justified to take control group as negative of the gold standard?

For example:

We want to diagnose Infantile Hypertrophoid pyloric stenosis (IHPS) with the help of ultrasound but the gold standard for its diagnosis is surgery. If we perform Ultrasound of 50-infants with projectile vomiting and the sonographic findings of 40 of them are likely for IHPS and 10 for normal. But after surgery (Gold-Standard) 38- were confirmed as IHPS but 2 were false positive. Now we want to perform ultrasound of 50-normal (control). Is it justified to put all the 50 normal infant as True negative and false positive as 0 of the gold Standard, To perform chi-Square statistics?

The error message is as follows:

**Warnings Box's Test of Equality of Covariance Matrices is not computed because there are fewer than two nonsingular cell covariance matrices.**

However, the results are computed anyways.

**I want to know what does it actually means.**

**Does it affect the overall results?**

**How to fix this error?**

It should be noted that the

**sample size**in my groups is different.I would appreciate your guidance.

Baset

*Please it is too important to share your view.*Bio statisticians are calculating Sample Size with the help of formula but unfortunately there is no time duration mentioned in it. How could it b justified to calculate the same sample size for a student of Master degree having research of 9-months and a student of PhD having research of 18-months, with the same formula?

I would like to know why is linearity important in a sandwich assay curve for protein sample detection.

Suppose I have a Questionnaire for a Stress assessment that contains 30 questions, each question has 5 answers (0- no stress, 1-mild stress, 2- moderate , 3-High stress, 4- Severe stress). The Total score of the 30 question varies from 0 - 120.

How we can categories the Total score (the range of total score is 0-120) into mild , moderate and severe? Which cut off s should l take for mild, moderate and severe?

Dear All,

I have 500 miRNAs with expressed read counts in 20 conditions. In the same 20 conditions, I have measurments of lymphocyte counts.

I would like to see how the miRNAs counts are correlated to the lymphocyte counts ?

Now these two variables are not equal in numbers, how to go about that ?

I am trying to compute the "E-Value", introduced by VanderWeele & Peng Ding (2017,

*Ann Intern Med*).For that, I need to have the Risk Ration (also known as Relative Risk).

As I understood, Risk Ratio is only for dichotomous variables.

So, is it possible to compute a Risk Ratio out of a correlation between two continuous variables ?

I have rapid light curve data (ETR for each PAR value) for 24 different specimen of macroalgae. The dataset has three factors: species (species 1 and species 2), pH treatment (treatment 1 and treatment 2) and Day (day 1 of the experiment and day 8 of the experiment).

I have fitted a model defined by Webb 1974 to 8 subsets of the data:

species 1,pH treatment 1, day 1

species 1, pH treatment 1, day 8

species 1, pH treatment 2, day 1...etc.

I have plotted the curves of the data that is predicted by the model. The model also gives the values and standard error of two parameters: alpha (the slope of the curve) and Ek (the light saturation coefficient). I have added an image of the scatterplot + 4 curves predicted by the model for species 1 (so each curve has a different combination of the factors pH treatment and Day).

I was wondering what the best way would be to statistically test if the 8 curves differ from each other? (or in other words: how to test if the slopes and Ek of the models are significantly different?). When googling for answers, I found many ways to check which models with your data better, but not how to test if the different treatments also cause differences in rapid light curves.

Any help would be greatly appreciated.

Cheers,

Luna