Science topic

Advanced Statistics - Science topic

Explore the latest questions and answers in Advanced Statistics, and find Advanced Statistics experts.
Questions related to Advanced Statistics
  • asked a question related to Advanced Statistics
Question
5 answers
Hello ResearchGate community,
I am looking for a statistician with experience in metagenomic data analysis to assist with a research project. The data involves genotypic diversity within microbial profiles, and we require statistical expertise to ensure accurate and robust analysis. Specifically, I am seeking someone who is skilled in handling large datasets and can provide insights through advanced statistical methods.
If you have expertise in this area or know someone who does, please feel free to reach out. I’d be happy to discuss further details regarding the project and potential collaboration.
Thank you in advance for your support and recommendations.
Relevant answer
Answer
Please contact me
  • asked a question related to Advanced Statistics
Question
2 answers
Hello!
I am performing a study to introduce a new test for a specific eye disease diagnosis. The new test has continuous values, the disease can be present in one or both eyes, and the disease severity by eye could also be different. Furthermore, the presence of the disease in one eye increases the probability of having the disease in the other eye.
Because we aim to estimate the diagnostic performance of the new test, we performed the new test and gold standard for the disease in both eyes in a sample of patients. However, the fact of repeated measurements by each patient could introduce intra-class correlation to the data, limiting analyzing the results as i.i.d. Therefore, diagnostic performance derived directly from a logistic regression model or ROC curve could not be correct.
What do you think is the best approach to calculate the AUC, sensitivity, specificity, and predictive values in this case?
I think that a mixed-effects model with the patient as a random intercept could be useful. However, I do not know if there is any method to estimate the diagnostic performance with this type of models.
Thank you in advance.
Relevant answer
Answer
Hi Abraham,
I think this has been previously adressed in various epi oriented papers.
A good reference and tutorial to do it on several software is:
Genders TS, Spronk S, Stijnen T, Steyerberg EW, Lesaffre E, Hunink MG. Methods for calculating sensitivity and specificity of clustered data: a tutorial. Radiology. 2012 Dec;265(3):910-6. doi: 10.1148/radiol.12120509. Epub 2012 Oct 23. PMID: 23093680.
I hope this helps!
  • asked a question related to Advanced Statistics
Question
3 answers
I am at the end of conducting a large systematic review and meta-analysis. I have experience of meta-analysis and have attempted to meta-analyse the studies myself, but I am not happy with my method. The problem is that almost all the studies are crossover studies and I am not sure how to analyse them correctly. I have consulted the Cochrane Handbook, and it seems to suggest a paired analysis is best, but I do not have the expertise to do this - https://training.cochrane.org/handbook/current/chapter-23#section-23-2-6
I am seeking a statistician familiar with meta-analysis to consult with, and if possible, undertake the meta-analysis. There are only two authors on this paper (me and a colleague), so you would either be second or last author. We aim to publish in a Q1 or Q2 journal, and from my own analysis I can see we have very interesting results.
Please let me know if you are interested.
Relevant answer
Answer
Depending on the structure of the data (how much pre-processing has been already done), I would be ready to conduct the meta-analysis as well. Please feel free to reach out by PM.
  • asked a question related to Advanced Statistics
Question
6 answers
I want to use SPSS Amos to calculate SEM because I use SPSS for my statistical analysis. I have already found some workarounds, but they are not useful for me. For example, using a correlation matrix where the weights are already applied seems way too confusing to me and is really error prone since I have a large dataset. I already thought about using Lavaan with SPSS, because I read somewhere that you can apply weights in the syntax in Lavaan. But I don't know if this is true and if it will work with SPSS. Furthermore, to be honest, I'm not too keen on learning another syntax again.
So I hope I'm not the first person who has problems adding weights in Amos (or SEM in general) - if you have any ideas or workarounds I'll be forever grateful! :)
Relevant answer
Answer
You can see www.Stats4Edu.com
  • asked a question related to Advanced Statistics
Question
4 answers
We are looking for a highly qualified researcher with expertise in advanced statistical analysis to contribute to a scientific article to be submitted to a prestigious journal by the end of the year (2024). The article will focus on the adoption of digital innovations in agriculture.
Key responsibilities:
- Carry out in-depth statistical analysis using a provided database (the dataset is ready and available in SPSS format).
- Apply advanced statistical techniques, including structural equation modelling and/or random forest models.
- Work closely to interpret the results and contribute to the manuscript.
The aim is to fully analyse the data and prepare it for publication.
If you are passionate about agricultural innovation and have the necessary statistical expertise, we would like to hear from you.
Relevant answer
Answer
Carlos Parra-López this sounds interesting. I'm interested, but if you like, we can have a preliminary discussion earlier.
  • asked a question related to Advanced Statistics
Question
1 answer
Hi everyone.
When running a GLMM, I need to turn the data from wide format to the long format (stacked).
When checking for assumptions like normality, do I check them for the stacked variable (e.g., outcomemeasure_time) or for each variable separately (e.g., outcomemeasure_baseline, outcomemeasure_posttest, outcomemeasure_followup)?
Also, when identifying covariates via correlations (Pearson's or Spearman's), do I use the seperate variables or the stacked one?
Normality: say the outcomemeasure_baseline normality is violated but normality for the others weren't (ouecomemeasure_posttest and outcomemeasure_followup). Normality for the stacked variable is also not violated. In this case when running the GLMM, do I adjust for normality violations because normality for one of the seperate measures was violated?
Covariates: say age was identified as a covariate for outcomemeasure_baseline but not the others (separately: ouecomemeasure_posttest and outcomemeasure_followup OR the stacked variable). In this case, do I include age as a covariate since it was identified as one for one of the seperate variables?
Thank you so much in advance!
Relevant answer
Answer
The assumption on normality only matters for a model with normally (Gaussian) distributed errors (LMM). Meaning the residuals of the model are from your side to approximate normality and this assumption is reasonable. Assuming that you use the word GLMM, you have selected a model with a different distribution and link function? If these words sound like gibberish, it might provide some help to search the terminology I just used or find a few introductory articles or books. Best
  • asked a question related to Advanced Statistics
Question
6 answers
Hi everyone,
Does anyone have a detailed SPSS (v. 29) guide on how to conduct Generalised Linear Mixed Models?
Thanks in advance!
Relevant answer
Answer
Ravisha Jayawickrama dont thank
Onipe Adabenege Yahaya
, but chatGPT, you could have gotten the same answer yourself.
  • asked a question related to Advanced Statistics
Question
1 answer
I have a thermocouple which output me some voltage level after signal conditioning. I need to convert it to desired units in centigrade. Below is the formula I am using for conversion. I need to proof that this formula will ensure uniform conversion of all voltage levels of thermocouple to centigrade units, such that 0 Volt corresponds to -200 centigrade and 10 Volt corresponds to 1500 centigrade.
Maximum voltage and minimum voltage are from DAQ after signal conditioning.
Maximum Reading Range and minimum reading range are values in centigrade.
We need to prove that range of voltage lets say 0V to 10V will be uniformly converted to -200 centigrade to 1500 centigrade reading range
Below is the formula for which we need a proof.
Precision Factor = (Maximum Voltage - Minimum Voltage) / (Maximum Reading Range - Minimum Reading range)
Desired output value in Centigrade = ((Input Voltage level - Minimum Voltage)/ Precision Factor) + Minimum Reading Range
Relevant answer
Answer
As the temperature range is too extreme to use an external reference source to test your measurement system, the next best thing is to simulate thermocouple response.
There is data available on relationship between temperature and voltage for the different thermocouple types. A standard multi-calibrator tool can generate voltages to simulate specific temperatures. Simply disconnect the thermocouple, simulate the thermocouples response to temperature using the multi-calibrator and compare your read out to the expected value.
As thermocouple responses tend to be curved rather than linear, there may be bigger errors at certain parts of the measurement range depending on how accurately the relationship can be expressed as an equation.
  • asked a question related to Advanced Statistics
Question
9 answers
Hi everyone,
I'm working on a project where I need to compare the similarity between line curves on two separate charts, and I could use some guidance. Here’s the situation:
  1. First Chart Details: Contains two curves, both of which are moving averages. These curves are drawn on a browser canvas by a user. I have access to the x and y data points of these curves.
  2. Second Chart Details: Contains two curves, with accessible x and y data points. In this chart, the x-axis represents time, and the y-axis represents values.
Challenge:
  • The two charts do not share the same coordinate system values.
Goal:
  • I would like to compare the similarity in patterns between individual lines across the two charts (i.e., one line from the first chart vs. one line from the second chart).
  • Additionally, I want to compare the overall shape formed by both lines on the first chart to the shape formed by both lines on the second chart.
Could anyone provide advice on methodologies or algorithms that could help in assessing the similarity of these line curves?
Thank you for any help.
Lovro Bajc
I have attached
Relevant answer
Answer
Adnan Majeed Thank you for extensive answer. I am searching for Quantitative Comparison solution. I need to find similarity between shape formed by two curves on one 2D space compared to shape formed by other two curves on second 2D space.
Calculating area between two curves is not a suitable solution as the shape is not taken into consideration.
  • asked a question related to Advanced Statistics
Question
3 answers
This question is well answered now. Thanks!
Relevant answer
Answer
There is ongoing debate in the scientific community about the validity of p-values in scientific research. Many scientists and statisticians are calling for abandoning statistical significance tests and p-values. The only way to avoid p-hacking is to not use p-values. You should make scientific inference based on some descriptive statistics and your domain knowledge (not p-values).
  • asked a question related to Advanced Statistics
Question
2 answers
Hello everyone,
I am currently undertaking a research project that aims to assess the effectiveness of an intervention program. However, I am encountering difficulties in locating suitable resources for my study.
Specifically, I am in search of papers and tutorials on multivariate multigroup latent change modelling. My research involves evaluating the impact of the intervention program in the absence of a control group, while also investigating the influence of pre-test scores on subsequent changes. Additionally, I am keen to explore how the scores differ across various demographic groups, such as age, gender, and knowledge level (all measured as categorical variables).
Although I have come across several resources on univariate/bivariate latent change modelling with more than three time points, I have been unable to find papers that specifically address my requirements—namely, studies focusing on two time points, multiple latent variables (n >= 3), and multiple indicators for each latent variable (n >= 2).
I would greatly appreciate your assistance and guidance in recommending any relevant papers, tutorials, or alternative resources that pertain to my research objectives.
Best,
V. P.
Relevant answer
Answer
IYH Dear Vivian Parker
Ch. 19 Muthén, B. Latent variable analysis: Growth mixture modeling and related techniques for longitudinal data. In D. Kaplan (ed.), Handbook of quantitative methodology for the social sciences. Newbury Park, CA: Sage.
Although this ref do not exclusively concentrate on two-time-point cases, it does contain discussions revolving around multiple latent variables and multiple indicators for those latent constructs. https://users.ugent.be/~wbeyers/workshop/lit/Muthen%202004%20LGMM.pdf
It contains rich content concerning latent growth curve models and elaborates on multivariate implementations.
While conceptually broader, it present crucial components necessary for building and applying two-time-point, multivariate latent change models.
  • asked a question related to Advanced Statistics
Question
2 answers
The fact that a feature is of a complementary distribution does not seem to be a sufficient reason to discard the feature as irrelevant; especially as they seem phenomenologically relevant.
Relevant answer
Answer
That indeed does seem to be the case. Thank you for your answer!
  • asked a question related to Advanced Statistics
Question
5 answers
I would like to test whether the general relationship between the number of years of education and the wage is linear, exponential, etc. Or in other words, does going from 1 year to 2 years of education have the same impact on wages as going from 10 to 11. I want a general assessment for the world and not for a specific country.
I got standardized data from surveys on several countries and multiple times (since 2000). My idea is to build a multilevel mixed-effects model, with a fixed effect for the number of years of education and random effects for the country, the year of the survey and other covariates (age, sex, etc.). I’m not so used to this type of model: do you think it makes sense? Is this the most appropriate specification of the model for my needs?
Relevant answer
Answer
Check these.
  1. Multilevel Mixed-Effects Model: This model includes a fixed effect for the number of years of education and random effects for the country, the year of the survey, and other covariates (age, sex, etc.)1.
  2. Hierarchical Linear Modeling (HLM): HLM is an extension of regression analysis that allows for the modeling of hierarchical data structures, such as individuals nested within countries and years. HLM is useful because it allows for the estimation of both fixed and random effects and can accommodate missing data1.
  3. Ordinary Least Squares (OLS): A simple OLS or conditional correlation.
  • asked a question related to Advanced Statistics
Question
2 answers
Relevant answer
Answer
What could your "political inclinations" possibly have to do with the scientific issues discussed on this website?
  • asked a question related to Advanced Statistics
Question
7 answers
Hello everyone! As you understand, high-precision positioning using global navigation satellite systems or simply high-precision determination of a random variable. At what point does your estimates precision fall into the "highly precision" category? Is this always a convention associated with the method of determining a random variable or is there a general formulation for classifying estimates as highly precision?
Relevant answer
Answer
The precision of a class must be defined by specifications, which define the RMS and the instruments to be used. If there are no specifications, then people involved in such a class get together and decide about the specifications for the class.
The number of digits also characterizes the precision of an instrument. If a theodolite measures an angle with a direct reading of one second, its precision is one second. If you want to test it you measure several times the three angles of a triangle, and you see how much the closing error is.
In any case, you define precision by specifications; you test precision by statistical analysis of measurements of a well-designed experiment.
  • asked a question related to Advanced Statistics
Question
3 answers
What are the possible ways of rectifying a lack of fit test showing up as significant. Context: Optimization of lignocellulosic biomass acid hydrolysis (dilute acid) mediated by nanoparticles
  • asked a question related to Advanced Statistics
Question
6 answers
Hello,
I have the following problem. I have made three measurements of the same event under the same measurement conditions.
Each measurement has a unique probability distribution. I have already calculated the mean and standard deviation for each measurement.
My goal is to combine my three measurements to get a general result of my experiment.
I know how to calculate the combined mean: (x_comb = (x1_mean+x2_mean+x3_mean)/3)
I don't know how to calculate the combined standard deviation.
Please let me know if you can help me. If you have any other questions, don't hesitate to ask me.
Thank you very much! :)
Relevant answer
Answer
What is the pooled standard deviation?
The pooled standard deviation is a method for estimating a single standard deviation to represent all independent samples or groups in your study when they are assumed to come from populations with a common standard deviation. The pooled standard deviation is the average spread of all data points about their group mean (not the overall mean). It is a weighted average of each group's standard deviation.
Attached is the formula.
  • asked a question related to Advanced Statistics
Question
7 answers
Is ex ante power analysis the same as a priori power analysis or is it something different in the domain of SEM and multiple regression analysis? If it is different, then what are the recommended methods or procedures? Any citations for it?
Thank you for precious time and help!
Relevant answer
Answer
Zubaida Abdul Sattar Thanks a lot for sharing detailed information.
  • asked a question related to Advanced Statistics
Question
6 answers
I want to ask about the usage of parametrical and non-parametrical tests if we have an enormous sample size.
Let me describe a case for discussion:
- I have two groups of samples of a continuous variable (let's say: Pulse Pressure, so the difference between systolic and diastolic pressure at a given time), let's say from a) healthy individuals (50 subjects) and b) patients with hypertension (also 50 subjects).
- there are approx. 1000 samples of the measured variable from each subject; thus, we have 50*1000 = 50000 samples for group a) and the same for group b).
My null hypothesis is: that there is no difference in distributions of the measured variable between analysed groups.
I calculated two different approaches, providing me with a p-value:
Option A:
- I took all samples from group a) and b) (so, 50000 samples vs 50000 samples),
- I checked the normality in both groups using the Shapiro-Wilk test; both distributions were not normal
- I used the Mann-Whitney test and found significant differences between distributions (p<0.001), although the median value in group a) was 43.0 (Q1-Q3: 33.0-53.0) and in group b) 41.0 (Q1-Q3: 34.0-53.0).
Option B:
- I averaged the variable's values over all participants (so, 50 samples in group a) and 50 samples in group b))
- I checked the normality in both groups using the Shapiro-Wilk test; both distributions were normal,
- I used t Student test and obtained p-value: 0.914 and median values 43.1 (Q1-Q3: 33.3-54.1) in group a) and 41.8 (Q1-Q3: 35.3-53.1) in group b).
My intuition is that I should use option B and average the signal before the testing. Otherwise, I reject the null hypothesis, having a very small difference in median values (and large Q1-Q3), which is quite impractical (I mean, visually, the box plots look very similar, and they overlap each other).
What is your opinion about these two options? Are both correct but should be used depending on the hypothesis?
Relevant answer
Answer
You have 1000 replicate measurements from each subjects. These 1000 values are correlated and they should not be analyzed as if they were independent. So your model is wrong and you should identify a more sensible model. Eventually, the test of the difference between your groups should not have more than 98 degrees of freedom (it should have less, since a sensible model will surely include some other parameters than just the tow means). Having 1000 replicate measurements seems an overkill to me if there was no other aspect that should be considered in an analysis (like a change over time, with age, something like that). If there is nothing else that should be considered, the simplest analysis is to average the 1000 values per patient and do a t-test on 2x50 (averaged) values.
If you had a sample of independent thausands of samples per group, estimation would be mor interesting than testing. You should then better interpret the 95% confidence interval of the estimate (biological relevance) rather than the (in this respect silly) fact whether it is just in the positive or in the negative range.
  • asked a question related to Advanced Statistics
Question
2 answers
Hello,
In the realm of economic and social analysis, one question looms large: Can the intricate intricacies of advanced statistical and mathematical models effectively capture and withstand the complexities of the real world?
Thanks
Relevant answer
Answer
An age-old and widely debated question: there is no univocal and precise answer.
  • asked a question related to Advanced Statistics
Question
10 answers
It is possible to run a regression of both Seconday and primary data in the same model? I mean, when the dependent variable is primary data to be sourced via questionnaire and the Independent variable is secondary data to be gathered from published financial statements?
For Example: if the topic is  Capital Budgeting moderator and shareholders wealth (SHW). Capital budgeting moderators is proxy by inflation , management attitude to risk, Economic condition and Political instability. while SHW is proxy by Market value, Profitability and Retained earnings.
Relevant answer
Answer
There should be a causal effect of the independent variables on the dependent variable in regression analysis. Primary data gathered through questionnaire for the dependent variable would be influenced by the current happenings while the independent variables based on secondary data was influenced by past or historical happenings. Therefore, there would not true linkages between independent variables and the dependent variable. Therefore, running a regression with both Secondary and primary data in the same model would not give you best outcome.
  • asked a question related to Advanced Statistics
Question
12 answers
I have nonparametric continuous data. I want to apply correlation analysis. However, I am undecided about which correlation analysis I should apply to non-parametric continuous data that does not have a normal distribution. Pearson, spearman, Kendall of another method. I know there are many methods for non parametric data. But which one should I choose for correlation?
Relevant answer
Answer
A mean is a model. It is an assumption that barring other effects every member of a population will exhibit the mean value. Variation about the mean can be normally distributed (i.e., the difference between the model and the real measurements) exhibits an exponentially decreasing frequency that is symmetric on either side of the mean. It can be log-normal, exponential, triangular, uniform, or 'other'. For normal and log-normally distributed error about the mean we can employ parametric tests (e.g., t-test, or F-test) to compare with an assignable probability the difference between one or more populations. These are parametric tests. There are no similar simple tests for the other distributions. These parametric tests also let you set a confidence interval on the model parameters (i.e., the mean +/- t(alpha, df)*std dev). When the model is more complex you can distribute the confidence intervals to all the parameters of the model (e.g., to the slope and intercept of a line). Most non-parametric tests (e.g., Chi-squared, Mann-Whitney, Kolmagorov-Smirnov) require that you bin the error (i.e., create a histogram of the error) to compare two or more populations. The act of binning the data converts the error from a continuous to a discrete distribution. Discrete distributions include Poisson, bi- and multi-nomial distributions, Chi-squared, etc. Ultimately, when comparing two or more models. Creating an ROC curve, ultimately based on true vs. false positives can also be used.
  • asked a question related to Advanced Statistics
Question
10 answers
I have nonparametric continuous data. I want to apply correlation analysis. However, I am undecided about which correlation analysis I should apply to non-parametric continuous data that does not have a normal distribution. Pearson, spearman, Kendall of another method. I know there are many methods for non parametric data. But which one should I choose for correlation?
Relevant answer
Answer
Fraud alert : Shital Choudhar 's answer is simply copied and pasted from ChatGPT.
This means that Shital has no idea whether the information and advice in the answer is right or wrong, helpful or misleading. And they don't care.
  • asked a question related to Advanced Statistics
Question
10 answers
These are few questions for your reference,
How much did you learn about managing your money from your parents?
· None
· Hardly at all
· Little
· Some
A lot
How often were you influenced by or did you discuss about finances with your parents?
· Never
· Once a year
· Every few months
· Twice a month
Weekly
What is your current investment amount in stocks/shares? (Portfolio value)
· 1 - 90,000
· 90,000–170,000
· 170,000–260,000
· 260,000–340,000
· More than 340,000
The above questions are allocated weights from 1 to 5.
Relevant answer
Answer
You can set ordinal type variable for analysis in SPSS
  • asked a question related to Advanced Statistics
Question
6 answers
In plant breeding, what are uses discrimination function.
Relevant answer
Answer
Discriminant function technique involves the development of selection criteria on a combination of various characters and aids the breeder in indirect selection for genetic improvement in yield. In plant breeding, the selection index refers to a linear combination of characters associated with yield.
  • asked a question related to Advanced Statistics
Question
4 answers
Assuming that $X \in \mathbb{R}^{p \times n}$ is the data matrix, where p is the dimension, n is the sample size. We obtain the data permutation matrices by randomly permuting entries in each column of the data matrix. What are the statistical applications of the data permutation matrices obtained in this way?
Relevant answer
Answer
Thank you for your reply, David. Best Regards. David Morse
  • asked a question related to Advanced Statistics
Question
2 answers
Is nonparametric regression used in psychology research? And if yes, what types. I know about quantile regression, but i cant find much literature where researchers use Kernel regression or local regression. Is this because they arent useful for such research?
Relevant answer
Answer
Nonparametric regression techniques, including kernel and local regression, can be used in psychology research. While they may not be as commonly employed as other regression methods in psychology, they can be valuable in certain contexts where the assumptions of parametric regression models may not hold or when researchers are interested in exploring non-linear relationships.
Kernel regression, also known as kernel smoothing or kernel density estimation, is a nonparametric method that estimates the conditional expectation of a dependent variable given an independent variable. It can be particularly useful when the relationship between variables is not expected to be linear and no specific functional form is assumed. Kernel regression uses a kernel function to weight the data points around the target point, producing a smoothed estimate.
Local regression, often referred to as locally weighted scatterplot smoothing (LOWESS) or LOESS, is another nonparametric regression technique that allows for the flexible modeling of relationships. It fits a separate regression model to each data point by giving more weight to nearby points and less weight to distant points. Local regression can capture non-linear patterns and is suitable for situations where the relationship between variables may change across the range of the independent variable.
While there may be less literature specifically focused on the application of kernel regression and local regression in psychology research compared to parametric methods, it does not necessarily mean they are not useful. The choice of regression technique often depends on the specific research question, the nature of the data, and the assumptions that can be reasonably made. Parametric regression models, such as linear regression or generalized linear models, are more commonly used in psychology due to their simplicity and interpretability. However, nonparametric regression techniques can be valuable when exploring complex relationships or when assumptions of parametric models are not met.
It's worth noting that the application of nonparametric regression techniques in psychology research may be influenced by factors such as the availability of specialized software, the familiarity of researchers with these methods, and the specific research traditions within the field. Nonetheless, when appropriate, nonparametric regression approaches can offer valuable insights into relationships that may not be captured by parametric models.
  • asked a question related to Advanced Statistics
Question
4 answers
I'm working on my PhD thesis and I'm stuck around expected analysis.
I'll briefly explain the context then write the question.
I'm studying moral judgment in the cross-context between Moral Foundations Theory and Dual Process theory.
Simplified: MFT states that moral judgmnts are almost always intuitive, while DPT states that better reasoners (higher on cognitive capability measures) will make moral judgmnets through analytic processes.
I have another idea - people will make moral judgments intuitively only for their primary moral values (e.g., for conservatives those are binding foundations - respectin authority, ingroup loyalty and purity), while for the values they aren't concerned much about they'll have to use analytical processes to figure out what judgment to make.
To test this idea, I'm giving participants:
- a few moral vignettes to judge (one concerning progressive values and one concerning conservative values) on 1-7 scale (7 meaning completely morally wrong)
- moral foundations questionnaire (measuring 5 aspects of moral values)
- CTSQ (Comprehensive Thinking Styles Questionnaire), CRT and belief bias tasks (8 syllogisms)
My hypothesis is therefore that cognitive measures of intuition (such as intuition preference from CTSQ) will predict moral judgment only in the situations where it concerns primary moral values.
My study design is correlational. All participants are answering all of the questions and vignettes. So I'm not quite sure how to analyse the findings to test the hypothesis.
I was advised to do a regressional analysis where moral values (5 from MFQ) or moral judgments from two different vignettes will be predictors, and intuition measure would be dependent variable.
My concern is that the anlaysis is a wrong choice because I'll have both progressives and conservatives in the sample, which means both groups of values should predict intuition if my assumption is correct.
I think I need to either split people into groups based on their MFQ scores than do this analysis, or introduce some kind of multi-step analysis or control or something, but I don't know what would be the right approach.
If anyone has any ideas please help me out.
How would you test the given hypothesis with available variables?
Relevant answer
Answer
There are several statistical analysis techniques available, and the choice of method depends on various factors such as the type of data, research question, and the hypothesis being tested. Here is a step-by-step guide on how to approach hypothesis testing:
  1. Formulate your research question and null hypothesis: Start by clearly defining your research question and the hypothesis you want to test. The null hypothesis (H0) represents the default position, stating that there is no significant relationship or difference between variables.
  2. Select an appropriate statistical test: The choice of statistical test depends on the nature of your data and the research question. Here are a few common examples:Student's t-test: Used to compare means between two groups. Analysis of Variance (ANOVA): Used to compare means among more than two groups. Chi-square test: Used to analyze categorical data and test for independence or association between variables. Correlation analysis: Used to examine the relationship between two continuous variables. Regression analysis: Used to model the relationship between a dependent variable and one or more independent variables.
  3. Set your significance level and determine the test statistic: Specify your desired level of significance, often denoted as α (e.g., 0.05). This value represents the probability of rejecting the null hypothesis when it is true. Based on your selected test, identify the appropriate test statistic to calculate.
  4. Collect and analyze your data: Gather the necessary data for your analysis. Perform the chosen statistical test using statistical software or programming languages like R or Python. The specific steps for analysis depend on the chosen test and software you are using.
  5. Calculate the p-value: The p-value represents the probability of obtaining the observed results (or more extreme) if the null hypothesis is true. Compare the p-value to your significance level (α). If the p-value is less than α, you reject the null hypothesis and conclude that there is evidence for the alternative hypothesis (Ha). Otherwise, you fail to reject the null hypothesis.
  6. Interpret the results: Based on the outcome of your analysis, interpret the results in the context of your research question. Consider the effect size, confidence intervals, and any other relevant statistical measures.
  • asked a question related to Advanced Statistics
Question
6 answers
I constructed a linear mixed-effects model in Matlab with several categorical fixed factors, each having several levels. Fitlme calculates confidence intervals and p values for n-1 levels of each fixed factor compared to a selected reference. How can I get these values for other combinations of factor levels? (e.g., level 1 vs. level 2, level 1 vs. level 3, level 2 vs. level 3).
Thanks,
Chen
Relevant answer
Answer
First, to change the reference level You can specify the order of items in categorical array
categorical(A,[1, 2, 3],{'red', 'green', 'blue'}) or
categorical(A,[3, 2, 1],{'blue', 'green', 'red'})
Second, You can specify the correct hypothesis matrix for coefTest function for comparison between every pair of categories.
  • asked a question related to Advanced Statistics
Question
1 answer
I am attempting to use the Seurat FindAllMarkers function to validate markers for rice taken from the plantsSCRNA-db. I want to use the ROC test in order to get a good idea of how effective any of the markers are. While doing a bit of research, different stats forums say: "If we must label certain scores as good or bad, we can reference the following rule of thumb from Hosmer and Lemeshow in Applied Logistic Regression (p. 177):
0.5 = No discrimination 0.5-0.7 = Poor discrimination 0.7-0.8 = Acceptable discrimination 0.8-0.9= Excellent discrimination0.9 = Outstanding discrimination "
For more background, the output of the function returns a dataframe with a row for each gene, showing myAUC: area under the Receiver Operating Characteristic, and Power: the absolute value of myAUC - 0.5 multiplied by 2. Some other statistics are included as well such as average log2FC and the percent of cells expressing the gene in one cluster vs all other clusters.
With this being said, I would assume a myAUC score of 0.7 or above would imply the marker is effective. However given the formula used to calculate power, a myAUC score of 0.7 would correlate to a power of 0.4. So with this being said, would it be fair to assume that myAUC should be ignored for the purposes of validating markers? Or should both values be taken into account somehow?
Relevant answer
Answer
In the Seurat R package for analyzing single-cell RNA-seq data, "power" and "myAUC" are both functions used for selecting the most informative features or genes in the dataset. However, they employ different approaches and criteria to achieve this.
  1. Power: The "power" function in Seurat is used for identifying highly variable genes (HVGs) based on their expression dispersion relative to their mean expression level. This approach aims to capture genes that display biological variability across cells and are likely to be driving the observed heterogeneity in the dataset. By default, the "power" function calculates the power of a statistical test to detect differences in expression between two groups of cells, such as treatment vs. control or different cell types. It estimates the relationship between the mean expression and variance of each gene using a trend line and defines highly variable genes as those with expression levels deviating significantly from the trend line. The function outputs a list of highly variable genes ranked by their deviation.
  2. myAUC: The "myAUC" function in Seurat stands for "Area Under the Curve" and is used to rank genes based on their differential expression between two predefined groups or conditions. It employs the area under the receiver operating characteristic (ROC) curve as a measure of differential expression, where the ROC curve represents the true positive rate against the false positive rate at various gene expression thresholds. The myAUC algorithm evaluates the discriminatory power of each gene in distinguishing between the two groups and ranks them accordingly. Genes with higher AUC values have greater discriminatory power and are considered more differentially expressed between the groups of interest.
In summary, the "power" function identifies highly variable genes based on their expression dispersion relative to mean expression, while the "myAUC" function ranks genes based on their ability to discriminate between two predefined groups or conditions using the area under the ROC curve. Both functions aim to identify genes that are potentially important for distinguishing between different cell types, states, or experimental conditions, but they use different statistical and computational approaches to achieve this goal.
  • asked a question related to Advanced Statistics
Question
4 answers
Hi everyone,
I need to convert standard error (SE) into standard deviation (SD). The formula for that is
SE times the square root of the sample size
By 'sample size', does it mean the total sample size or sample sizes of individual groups? For example, the intervention group has 40 participants while the control group has 39 (so the total sample size is 79) So, when calculating SD for the intervention group, do I use 40 as the sample size or 79?
Thank you!
Relevant answer
Answer
7.7.3.2 Obtaining standard deviations from standard errors and (cochrane.org)
also, there is useful calculator in the attached Excel file from Cochrane.
  • asked a question related to Advanced Statistics
Question
12 answers
My device is quite old and can't run SPSS. Is there any acceptable alternative available?
Relevant answer
Answer
PSPP has the same look and feel as SPSS but is less good for printed output. But it's free, and there are workarounds to offset its limitations.
  • asked a question related to Advanced Statistics
Question
4 answers
I have ordinal data on happiness of citizens from multiple countries (from the European Value Study) and I have continuous data on the GDP per capita of multiple countries from the World Bank. Both of these variables are measured at multiple time points.
I want to test the hypothesis that countries with a low GDP per capita will see more of an increase in happiness with an increase in GDP per capita than countries that already have a high GDP per capita.
My first thought to approach this is that I need to make two groups; 1) countries with low GDP per capita, 2) countries with high GDP per capita. Then, for both groups I need to calculate the correlation between (change in) happiness and (change in) GDP per capita. Lastly, I need to compare the two correlations to check for a significant difference.
I am stuck however on how to approach the correlation analysis. For example, I dont know how to (and if I even have to) include the repeated measures of the different time points the data was collected. If I just base my correlations on one timepoint the data was measured, I feel like I am not really testing my research question, considering I am talking about an increase in happiness and an increase in GDP, which is a change over time.
If anyone has any suggestions on the right approach, I would be very thankful! Maybe I am overcomplicating it (wouldnt be the first time)!
Relevant answer
Answer
At the same time,Collect two variables data,As a sample,After collecting N samples over time,erform data regression analysis on them,The correlation coefficient will be obtained.
  • asked a question related to Advanced Statistics
Question
2 answers
Hello to everyone,
I've had several discussions with my colleagues about setting up field experiments to be replicated in different environments.
We agree that each experiment must have exactly the same experimental design to ensure data comparability.
I've been told that these experiments must also have the exact same randomization, I don't agree because I believe that it is the experimental design itself that ensures data comparability. Below I attach a drawing to better explain the issue:
In the attached file, I have the same experimental design between locations, with the same randomization within subplots. Shouldn't we randomize the treatments (i, ii, iii and iv) within each subplot? Does it make sense to have an exact copy of experimental fields?
Thanks in advance!
Relevant answer
Answer
It is randomization, if they are the same or different randomization, I see no problem. Regards.
  • asked a question related to Advanced Statistics
Question
9 answers
I am carrying out statistical testing on a paired sample - before and after medication review in 50 patients. I think I am correct in using the Wilcoxon signed rank test as the data is not normally distributed. However I want to check as the review only could reduce the number of medicines so there are no positive rankings only negative, therefore my test statistic comes out at 0. This obviously rejects the null hypothesis and there is a significant reduction in the number of medicines but I just wanted to check is this normal? Should a different test be used or is this OK?
Thanks
Relevant answer
Answer
Oh that makes sense, thank you! I only have limited statistics teaching behind me so this has been a struggle.
  • asked a question related to Advanced Statistics
Question
6 answers
Hello everyone,
I am currently doing research on the impact of online reviews on consumer behavior. Unfortunately, statistics are not my strong point, and I have to test three hypotheses.
The hypotheses are as follows: H1: There is a connection between the level of reading online reviews and the formation of impulsive buying behavior in women.
H2: There is a relationship between the age of the respondents and susceptibility to the influence of online reviews when making a purchase decision.
H3: There is a relationship between respondents' income level and attitudes that online reviews strengthen the desire to buy.
Questions related to age, level of income and level of reading online reviews were set as ranks (e.g. 18-25 years; 26-35 years...; 1000-2000 Eur; 2001-3000 Eur; every day; once a week; once a month etc.), and the questions measuring attitudes and impulsive behavior were formed in the form of a Likert scale.
What statistical method should be used to test these hypotheses?
Relevant answer
Answer
Go with the test of association (chi-square test )
  • asked a question related to Advanced Statistics
Question
5 answers
Relevant answer
Answer
I've only glanced quickly at those two resources, but are you sure they are addressing the same thing? Yates' (continuity) correction as typically described entails subtraction 0.5 from |O-E| before squaring in the usual equation for Pearson's Chi2. E.g.,
But adding 0.5 to each cell in a 2x2 table is generally done to avoid division by 0 (e.g., when computing an odds ratio), not to correct for continuity (AFAIK). This is what makes me wonder if your two resources are really addressing the same issues. But as I said, I only had time for a very quick glance at each. HTH.
  • asked a question related to Advanced Statistics
Question
2 answers
Hello everyone,
I would like to investigate three factors using a central composite design.
Each factor has 5 levels (+1 and -1, 2 axial points and 1 center point).
I chose my high and low levels (+ and -1) based on a screening DoE I did previously using the same factors.
I chose an alpha of 1.681 for the axial points because I would like my model to be rotatable. However, for one of the three factors, one of the axial points is outside the feasable range (negative CaCl concentration....). I thought of increasing my low level for this factor to avoid this. Lets say, increasing the value from 0.05 to 0.1 to avoid reaching the negative range with the axial point, but I was wondering, if this would effect the reliablity of my model?
Another option would be to change the design to one that has no axial points outside the design points. However, this is actually my area of interest.
Can anyone help?
Relevant answer
Answer
Dear Thuy,
In our last publication, we had the same problem with CCD and the alpha being generating negative values, so I inserted the levels in terms of alphas and it worked perfectly.
You can check our publication
Kind regards
  • asked a question related to Advanced Statistics
Question
5 answers
Hello, I have a question regarding using a binary-coded dependent variable on the Mann-Whitney U test.
I have a test with 15 questions from 3 different categories in my study. The answers are forced answers and have one correct answer. I coded the answers as binary values with 1 being correct and 0 being incorrect.
Therefore, for 3 different categories, the participants have a mean score between 0 and 1 representing their success (I took the mean because I have many participants who did not answer 2 or 3 questions).
Does it make sense to put a mean of binary coded value as a dependent variable on a nonparametric test or it sounds weird and I should apply something else like chi-square or logistic regression?
Relevant answer
Answer
It depends, but first, why did some people not answer some? Were these not presented to them by some random mechanism, or were they the last few (and perhaps easier or harder than others), or did people see them but decide not to answer because they thought they would miss them or it would take to long to answer them, or etc. Second, I am not sure what the three categories means. Is it that they got say 5 items on math, 5 on reading, and 5 on science? or that there were three groups? The choice of statistic will depend on this. Third, what research questions do you have? And finally, Mann-Whitney is usually used for two categories of people (i.e., two groups).
  • asked a question related to Advanced Statistics
Question
6 answers
I am currently studying Research and Methodology and got this query. Please can anyone answer this?
Relevant answer
Answer
Shubh Yadav Type I and Type II mistakes are both dangerous and, depending on the circumstances, can have catastrophic repercussions.
A Type I mistake, often known as a false positive, happens when a statistical test rejects the null hypothesis yet finds it to be true. This can result in inaccurate inferences and improper actions being done as a result of such findings. A Type I error might arise, for example, if a medical test is meant to identify a certain disease and the test wrongly suggests that a patient has the condition when they do not. This might lead to the patient receiving needless and perhaps hazardous therapy.
A Type II mistake, often known as a false negative, arises when a statistical test fails to reject the null hypothesis despite the fact that the null hypothesis is wrong. This can also lead to erroneous assumptions and improper behaviors. For example, if the same medical test indicated above wrongly suggests that a patient does not have the disease while, in reality, they do, the patient may not receive the appropriate therapy.
Making a Type I error may be more dangerous in some instances, while making a Type II error may be more significant in others. In the end, it is determined by the context and the potential implications of each sort of error.
  • asked a question related to Advanced Statistics
Question
3 answers
Hello!
In general, as a rule of thumb, what is the acceptable value for standardised factor loadings produced by a confirmatory factor analysis?
And, what could be done/interpretation if the obtained loadings are lower than the acceptable value?
How does everyone approach this?
Relevant answer
Answer
@ Ravisha Jayawickrama. Most sources accept value for standardised factor loadings above 0.4
  • asked a question related to Advanced Statistics
Question
4 answers
Merry Christmas everyone!
I used the Interpersonal Reactivity Index (IRI) subscales Empathic Concern (EC), Perspective Taking (PT) and Personal Distress (PD) in my study (N = 900) When I calculated Cronbach's alpha for each subscale, I got .71 for EC, .69 for PT and .39 for PD. The value for PD is very low. The analysis indicated that if I deleted one item, the alpha would increase to .53 which is still low but better than .39. However, as my study does not focus mainly on the psychometric properties of the IRI, what kind of arguments can I make to say the results are still valid? I did say findings (for the PD) should be taken with caution but what else can I say?
Relevant answer
Answer
A scale reliability of .39 (and even .53!) is very low. Even if your main focus is not on the psychometric properties of your measures, you should still care about those properties. Inadequate reliability and validity can jeopardize your substantive results.
My recommendation would be to examine why you get such a low alpha value. Most importantly, you should first check whether each scale (item set) can be seen as unidimensional (measuring a single factor). This is usually done by running a confirmatory factor analysis (CFA) or item response theory analysis. Unidimensionality is a prerequisite for a meaningful interpretation of Cronbach's alpha (alpha is a composite reliability index for essentially tau-equivalent measures). CFA allows you to test the assumption of unidimensionality/essential tau equivalence and to examine the item loadings.
Also, you can take a look at the item intercorrelations. If some items have low correlations with others, this may indicate that they do not measure the same factor (and/or that they contain a lot of measurement error). Another reason for a low alpha value can be an insufficient number of items.
  • asked a question related to Advanced Statistics
Question
5 answers
Respected researchers,
I am using The Friedman Test in order to determine the statistical version of significant differences of means within the sessions in a group. Can I do this?
Relevant answer
Answer
Yes. You can use the Friedman Test in order to determine the statistical version of significant differences of means within the sessions in a group, if the data is not normally distributed and is of ordinal type.
  • asked a question related to Advanced Statistics
Question
2 answers
Can someone please share how to select the best algorithm to use for the child nodes when you have the parent nodes probabilities as there are so many algorithms to chose from using GeNIe software.
The algorithms are
Relevance-based decomposition, polytree, EPIS sampling, AIS sampling, Logic sampling, Backward sampling, Likelihood sampling, Self-importance
Relevant answer
Selecting the best algorithm for calculating the child nodes from the parent nodes using GeNIe software can depend on several factors, including the specific characteristics of your data, the computational resources available to you, and the desired properties of the resulting model. Some general considerations to keep in mind when choosing an algorithm include:
  1. Data characteristics: Different algorithms may be more or less suitable for different types of data. For example, some algorithms may be better suited for data with high levels of noise or missing values, while others may perform better with more structured data.
  2. Computational resources: Some algorithms may be more computationally intensive than others, which may be a consideration if you have limited computational resources or are working with large datasets.
  3. Model properties: Different algorithms may result in models with different properties, such as different levels of accuracy or interpretability. You may want to consider the desired properties of the resulting model when selecting an algorithm.
To help you decide which algorithm is best for your particular needs, you may want to consider consulting documentation or literature on the algorithms available in GeNIe software, as well as experimenting with different algorithms on your data to see which one performs best.
  • asked a question related to Advanced Statistics
Question
2 answers
I am currently working on my master thesis and I ran into this statistical problem. Hopefully one of you can help me, because so far I can only see that a mediation analysis with a MANCOVA isn't possible.
Relevant answer
Answer
Yes, you can run such a model (with all variables included simultaneously) within the framework of path analysis using programs for structural equation modeling such as Mplus, lavaan (in R), and AMOS. In the free workshop below, I show how to run path analysis in Mplus: https://www.goquantfish.com/courses/path-analysis-with-mplus
  • asked a question related to Advanced Statistics
Question
6 answers
I am measuring two continuous variables over time in four groups. Firstly, I want to determine if the two variables correlate in each group. I then want to determine if there is significant differences in these correlations between groups.
For context, one variable is weight, and one is a behaviour score. The groups are receiving various treatment and I want to test if weight change influences the behaviour score differently in each group.
I have found the r package rmcorr (Bakdash & Marusich, 2017) to calculate correlation coefficients for each group, but am struggling to determine how to correctly compare correlations between more than two groups. The package diffcorr allows comparing between two groups only.
I came across this article describing a different method in SPSS:
However, I don't have access to SPSS so am wondering if anyone has any suggestions on how to do this analysis in r (or even Graphpad Prism).
Or I could the diffcorr package to calculate differences for each combination of groups, but then would I need to apply a multiple comparison correction?
Alternatively, Mohr & Marcon 2005 describe a different method using spearman correlation that seems like it might be more relevant, however I wonder why their method doesn’t seem to have been used by other researches? It also looks difficult to implement so I’m unsure if it’s the right choice.
Any advice would be much appreciated!
Relevant answer
Answer
You wrote: "For context, one variable is weight, and one is a behaviour score. The groups are receiving various treatment and I want to test if weight change influences the behaviour score differently in each group."
I'm not sure this is best tested with a correlation coefficient. This sounds like an interaction hypothesis (or moderation if you prefer). What you need I think is the interaction of weight change by group. This is usually tested by the regression coefficient for the interaction. You can standardize this to scale it similarly to a correlation coefficient (though that's actually best done outside the model for interactions).
You can compare correlations but that isn't necessarily sensible because you risk confounding the effects of interest with changes in SD of the variables across groups (and there seems no rationale for needing that).
A further complication is including weight change without baseline weight as a covariate might be a poor choice. Even if groups are randomized including baseline weight may increase precision of the estimates of the. other effects.
  • asked a question related to Advanced Statistics
Question
3 answers
Hello, I currently have a set of categorical variables, coded as Variable A,B,C,etc... (Yes = 1, No = 0). I would like to create a new variable called severity. To create severity, I know I'll need to create a coding scheme like so:
if Variable A = 1 and all other variables = 0, then severity = 1.
if Variable B = 1 and all other variables = 0, then severity = 2.
So on, and so forth, until I have five categories for severity.
How would you suggest I write a syntax in SPSS for something like this?
Relevant answer
Answer
* Create a toy dataset to illustrate.
NEW FILE.
DATASET CLOSE ALL.
DATA LIST LIST / A B C D E (5F1).
BEGIN DATA
1 0 0 0 0
0 1 0 0 0
0 0 1 0 0
0 0 0 1 0
0 0 0 0 1
1 1 0 0 0
0 1 1 0 0
0 0 1 1 0
0 0 0 1 1
1 0 2 0 0
END DATA.
IF A EQ 1 and MIN(B,C,D,E) EQ 0 AND MAX(B,C,D,E) EQ 0 severity = 1.
IF B EQ 1 and MIN(A,C,D,E) EQ 0 AND MAX(A,C,D,E) EQ 0 severity = 2.
IF C EQ 1 and MIN(B,A,D,E) EQ 0 AND MAX(B,A,D,E) EQ 0 severity = 3.
IF D EQ 1 and MIN(B,C,A,E) EQ 0 AND MAX(B,C,A,E) EQ 0 severity = 4.
IF E EQ 1 and MIN(B,C,D,A) EQ 0 AND MAX(B,C,D,A) EQ 0 severity = 5.
FORMATS severity (F1).
LIST.
* End of code.
Q. Is it possible for any of the variables A to E to be missing? If so, what do you want to do in that case?
  • asked a question related to Advanced Statistics
Question
3 answers
Hello, I currently have a set of categorical variables, coded as Variable A,B,C,etc... (Yes = 1, No = 0). I would like to create a new variable called severity. To create severity, I know I'll need to create a coding scheme like so:
if Variable A = 1 and all other variables = 0, then severity = 1.
if Variable B = 1 and all other variables = 0, then severity = 2.
So on, and so forth, until I have five categories for severity.
How would you suggest I write a syntax in SPSS for something like this? Thank you in advance!
Relevant answer
Answer
Ange, I think the easiest way for you to find an answer to your question would be to google something such as "SPSS recode variables YouTube". You'll probably find several sites that demonstrate what you want to do.
All the best with your research.
  • asked a question related to Advanced Statistics
Question
3 answers
if someone can please share any report/paper/thesis it will be highly appreciated.
Relevant answer
Answer
The "technique" leaves the time it finds.
Have you drawn a representative sample from a population?
Did you submit a structured questionnaire to the sample units?
Good: start analyzing the data that describe the phenomenon you are studying and the rest will come by itself.
  • asked a question related to Advanced Statistics
Question
3 answers
Dear community,
I am planning on conducting a GWAS analysis with two groups of patients differing in binary characteristics. As this cohort naturally is very rare, our sample size is limited to a total of approximately 1500 participants (low number for GWAS). Therefore, we were thinking on studying associations between pre-selected genes that might be phenotypically relevant to our outcome. As there exist no pre-data/arrays that studied similiar outcomes in a different patient cohort, we need to identify regions of interest bioinformatically.
1) Do you know any tools that might help me harvest genetic information for known pathways involved in relevant cell-functions and allow me to downscale my number of SNPs whilst still preserving the exploratory character of the study design? e.g. overall thrombocyte function, endothelial cell function, immune function etc.
2) Alternatively: are there bioinformatic ways (AI etc.) that circumvent the problem of multiple testing in GWAS studies and would allow me to robustly explore my dataset for associations even at lower sample sizes (n < 1500 participants)?
Thank you very much in advance!
Kind regards,
Michael Eigenschink
Relevant answer
Answer
for the second part of your problem, you can try vcf2gwas pipeline, that is very easy to run as a docker image
  • asked a question related to Advanced Statistics
Question
4 answers
Dear community,
I've been reading a lot about dealing with omics data that lies outside the limit of quantification - there are a bunch of different recommendations on how to approach this. One paper drew the <LLOQ data at random from a normal distribution (interval: 0 to LLOQ) & used a log norm distribution for data >ULOQ - is that a sound idea? does anyone have comments on this /further suggestions?
I am looking forward to your responses.
Relevant answer
Answer
If there aren't too many censored values on the low end (< LLOQ), what precise values these take shouldn't matter much in the analysis. (Or, we shouldn't do an analysis where the precise values of these observations matter much). If there are many < LLOQ values, then you might consider a different analysis.
Just thinking about it, I'm not sure what to do with values censored at the high end (> HLOQ). I would be wary of replacing them some kind of random distribution with no upper limit, since it's possible to get a really high value by chance. I have done analyses where I replaced > HLOQ with just HLOQ; because there weren't that many values, it didn't really affect the overall conclusions.
Section 4.7, VALUES BELOW DETECTION LIMITS, of the following (USEPA, 2000) has some simple advice about values less than the detection limit. They say, if there < 15 % of values below the detection limit, "Replace nondetects with DL/2, DL, or a very small number." They have more involved methods when Non-detects are 15%–50% of observations.
For observations < LLOQ, maybe random values from a uniform distribution would make more sense than from a normal distribution.
  • asked a question related to Advanced Statistics
Question
3 answers
Assume we have a program with different instructions. Due to some limitations in the field, it is not possible to test all the instructions. Instead, assume we have tested 4 instructions and calculated their rank for a particular problem.
the rank of Instruction 1 = 0.52
the rank of Instruction 2 = 0.23
the rank of Instruction 3 = 0.41
the rank of Instruction 4 = 0.19
Then we calculated the similarity between the tested instructions using cosine similarity (after converting the instructions from text form to vectors- machine learning instruction embedding).
Question ... is it possible to create a mathematical formula considering the values of rank and the similarity between instructions, so that .... given an un-tested instruction ... is it possible to calculate, estimate, or predict the rank of the new un-tested instruction based on its similarity with a tested instruction?
For example, we measure the similarity between instruction 5 and instruction 1. Is it possible to calculate the rank of instruction 5 based on its similarity with instruction 1? is it possible to create a model or mathematical formula? if yes, then how?
Relevant answer
Answer
As far as I understand your problem, you first need a mathematical relation between the instructions and rank. For instance, Rank x should correspond to some instruction value as y and vice versa; it means you require a mathematical function.
So there are various methods/tools to find a suitable (as accurate as you want) to find mathematical function based on given discrete values like curve fitting methods or the use of ML.
Further, Once you obtain the mathematical function, run your code a few times, and you will get a set for various combinations of (instruction, rank). These set values will work as the feedback for your derived function. Make changes based on the feedback, and you will get a much more accurate function.
I hope you are looking for the same.
  • asked a question related to Advanced Statistics
Question
4 answers
During the lecture, the lecturer mentioned the properties of Frequentist. As following
Unbiasedness is only one of the frequentist properties — arguably, the most compelling from a frequentist perspective and possibly one of the easiest to verify empirically (and, often, analytically).
There are however many others, including:
1. Bias-variance trade-off: we would consider as optimal an estimator with little (or no) bias; but we would also value ones with small variance (i.e. more precision in the estimate), So when choosing between two estimators, we may prefer one with very little bias and small variance to one that is unbiased but with large variance;
2. Consistency: we would like an estimator to become more and more precise and less and less biased as we collect more data (technically, when n → ∞).
3. Efficiency: as the sample size incrases indefinitely (n → ∞), we expect an estimator to become increasingly precise (i.e. its variance to reduce to 0, in the limit).
Why Frequentist has these kinds of properties and can we prove it? I think these properties can be applied to many other statistical approach.
Relevant answer
Answer
Sorry, Jianhing. But I think you have misunderstood something in the lecture. Frequentist statistics, which is an interpretation of probability to be assigned on the basis of many random experiments.
In this setting, on designs functions of the data (also called statistics) which estimate certain quantities from data. For example, the probability p of a coin to land heads is given from n independent trials with the same coin and just counting the fraction of heads. This is then an estimator for the parameter p.
Each estimator should have desirable properties, as unbiasedness, consistency, efficiency and low variance and so on. Not every estimator has these properties. But, in principle one can proof, whether a given estimator has these properties.
So, it is not a characteristics of frequentist statistics, but a property of an individual estimator based on frequentist statistics.
  • asked a question related to Advanced Statistics
Question
4 answers
Assuming that a researcher does not know the nature of population distribution (the parameters or the type e.g. normal, exponential, etc.), is it possible that the sampling distribution can indicate the nature of the population distribution.
According to the central limit theorem, the sampling distribution is likely to be normal. So, the exact population distribution can not be known. The shape of the distribution for a large sample size is enough? or It has to be inferred logically based on different factors?
Am I missing some points? Any lead or literature will help.
Thank you
  • asked a question related to Advanced Statistics
Question
9 answers
Presently I am handling a highly positively skewed geochemical dataset. After several attempts, I have prepared a 3 parameter lognormal distribution (using natural log and additive constant c). The descriptive statistic parameters obtained are log-transformed mean (α) and standard deviation (β). The subsequent back-transformed mean and standard deviation (BTmean and BTsd) are based on the formula
BTmean = e ^ (α + (β^2/2)) - c
BTsd = Sqrt [(BTmean)^2) (e ^(β^2)-1)] - c
However, someone suggests to use Lagrange Multiplier. I am not sure about the
1) Equation using the Lagrange Multiplier
2) How to derive the value of Lagrange multiplier in my case.
Kindly advise.
Regards
Relevant answer
Answer
David will likely think this is complete rubbish, but here is what one librarian had to say about Z-Library.
<quote>
So there are outright “We pirate stuff’ sites like Mobilism and ZLibrary. These are places that are basically set up to pirate things and have no veneer of legality to them.
</quote>
  • asked a question related to Advanced Statistics
Question
7 answers
What is the technique which I use to convert the annual ESG score data to (Monthly, weekly, or daily data) with good accuracy?
AND how can I apply via Python?!
Relevant answer
Answer
I agree with Yongkang Stanley Huang and Michel Charifzadeh. Because ESG score is not a periodic data ( such as contineous data) rather a score usually provied with symbols such as A, B and C...
  • asked a question related to Advanced Statistics
Question
6 answers
Hi everyone,
I am struggling a bit with data analysis.
If I have 2 separate groups, A and B.
And each group has 3 repeats A1,A2,A3 and B1,B2 B3 for 10 time points.
How would I determine statistical significance between the 2 groups?
If I then added a third group, C, with 3 repeats C1,C2,C3 for each timepoint.
What statistical analysis would I use then?
Thanks in advance
Relevant answer
Answer
اذا اردنا t-teast للفروق بين المجموعتين
او اذا اردنا استخدام معامل الاتبارط بيرسون @
  • asked a question related to Advanced Statistics
Question
7 answers
To be more precise, my dependent variable was the mental well-being of students. The first analysis was chi-square (mental well-being x demographic variable), hence I treated the dv as categorical. Then, in order to find the influence of mental well-being on my independent variable, I treated the dv as a continuous variable so that I can analyse it using multiple regression.
Is it appropriate and acceptable? and is there any previous study that did the same thing?
Need some advice from all of you here. Thank you so much
Relevant answer
Answer
اذا لم نقوم باستعمال الادوات الاحصائية الدقيقية لهذا المجال بالنسبة للمتغيرات البحث سوف يؤدي الى الى اخطاء احصائية في دقة النتائج ولكن ممكن استخدام في بحثين منفصلين كمتغير تابع @
  • asked a question related to Advanced Statistics
Question
3 answers
Modelling biology is often a challenge, even more, when dealing with behavioural data. Models quickly become extremely complex full of variables and random effects. When trying to deal with a complex data set there are often several variables (or questions) you´re interested in, that might explain the variation of the response variable. But is it better to fit one very complex model or several ones? Let me put an example:
We would like to know more about the relationship between nursing behaviour and rank in a wild primate. For that, we record nursing duration and the rank of the mother. However, we think that the age of the mother and the infant are also interesting sources of variation. We will also record variables that we think might affect but that we are not necessarily interested in like the weather.
My first intuition is to put everything in the model:
  • nursing duration ~ rank + mothers' age + infants' age + mothers' age*infants' age + (1| weather)
I want to believe that by including all variables you reduce type I errors. But I have not been able to find an explanation of why that is the case.
Would it be statistically correct to perform two models instead, one for each question?
  • nursing duration ~ rank + (1| weather)
  • nursing duration ~ mothers' age + infants' age + mothers' age*infants' age + (1| weather)
I have been told that a common practice is to fit the most complex model first and then remove variables until you arrive at the lowest AIC. But I am not sure there is a better way to assess how many variables you should include in a model.
Please let me know if you know of any books or further reading addressing these kind of questions. Ideally focusing on biologists' or behavioural ecologists' statistics.
Relevant answer
Answer
Hi Maria, In general if you have a strong theoretical framework that justifies the decision to include these DVs, then include the DVs. This is your hypothesis. Although you could remove variables based on the p-value, AIC, BIC then you specific hypothesis: nursing duration ~ rank + mothers' age*infants' age + (1| weather) is not addressed. By sequentially removing values based on p>.05 you inflate your error rate and by removing them based on AIC and BIC you are not addressing your hypothesis. If p [of any DV] is not < .05. Then you simply have not obtained enough data to address this as evidence against H0. Then repeat the study with a larger n (never add more data to the study after you see p<.05).
I am in favor of simpler models (in the sense of less DVs, as they are easier to understand and address), although I have created horrible models in the past. I luckily seemed to have learned from this (I hope). Note that your model is not all that complex. Also, I removed mothers' age + infants' age as R considers this automatically using *. And if you run the model consider given the point and interval estimates for future repetition/replication (e.g., using the parameters package https://easystats.github.io/blog/posts/parameters_new_models/).
Good luck.
Useful books are:
https://hastie.su.domains/ISLR2/ISLRv2_website.pdf (only Chapters 1-3 for the basics)
(extremely well written, not free)
(note that any book has it weaknesses and strengths, and you often only use a particular chapter)
---------------------
Some additional useless information. From a Popperian view "simplicity" of the model has it preference (Logic of Scientific Discovery, Chapter 7). He describes simplicity as that the argumentation can be easily rejected, it has to be "testable" (easy to "reject" or "invalidate"). In your case this would be the "complex" model, as any of the assertions made on the DVs in the model: rank + mothers' age*infants' age (which is not a bad thing, it is a good thing!). Note also that Popper did not really tried to define of find the essence of word "simplicity".
  • asked a question related to Advanced Statistics
Question
9 answers
Well,
I am a very curious person. During Covid-19 in 2020, I through coded data and taking only the last name, noticed in my country that people with certain surnames were more likely to die than others (and this pattern has remained unchanged over time). Through mathematical ratio and proportion, inconsistencies were found by performing a "conversion" so that all surnames had the same weighting. The rest, simple exercise of probability and statistics revealed this controversial fact.
Of course, what I did was a shallow study, just a data mining exercise, but it has been something that caught my attention, even more so when talking to an Indian researcher who found similar patterns within his country about another disease.
In the context of pandemics (for the end of these and others that may come)
I think it would be interesting to have a line of research involving different professionals such as data scientists; statisticians/mathematicians; sociology and demographics; human sciences; biological sciences to compose a more refined study on this premise.
Some questions still remain:
What if we could have such answers? How should Research Ethics be handled? Could we warn people about care? How would people with certain last names considered at risk react? And the other way around? From a sociological point of view, could such a recommendation divide society into "superior" or "inferior" genes?
What do you think about it?
=================================
Note: Due to important personal matters I have taken a break and returned with my activities today, February 13, 2023. I am too happy to come across many interesting feedbacks.
Relevant answer
Answer
It is just coincidental
  • asked a question related to Advanced Statistics
Question
6 answers
I'm doing a germination assay of 6 Arabidopsis mutants under 3 different ABA concentrations in solid medium. I've 4 batches. Each batch has 2 plates for each mutant, 3 for the wild type, and each plate contains 8-13 seeds. Some seeds and plates are lost to contamination. So I don't have the same sample size for each mutant in each batch. In same cases the mutant is no longer present in the batch. I've recorded the germination rate per mutant after a week and expressed it as percentage. I'm using R. How can I analyse them best to test if the mutations affect the germination rate in presence of ABA?
I've two main questions:
1. Do I consider each seed as a biological replica with categorical type of result (germinated/not-germinated) or each plate with a numerical result (% germination)?
2. I compare treatments within the genotype. Should I compare mutant against wild type within the treatment, the treatment against itself within mutant, or both?
Relevant answer
Answer
I suggest using mosaic plots rather than (stacked) barplots to visualize your data.
The chi²- and p-values can be calculated simply via chi²-tests (one for each ABA conc) -- assuming the data are all independent (again, please note that seedlings on the same plate are not independent). If you have no possibility to account for this (using a hierarchical/multilevel/mixed model), you may ignore this in the analysis but then interpret the results more carefully (e.g., use a more stringent level of significance than usual).
A binomial model (including genotype and ABA conc as well as their interaction) would allow you to analyse the difference between genotypes in conjunction with ABA conc. However, due to the given experimental design (only three different conc values) this is cumbersome to interpret (because you cannot establish a meaningful functional relationship between cons and probability of germination).
  • asked a question related to Advanced Statistics
Question
6 answers
I'm trying to construct a model for binary logistics. The first model includes 4 variable of predictor and the intercept is not statistically significant. Meanwhile, in the second model, I exclude one variable from the first model and the intercept is significant.
The consideration that I take here is that:
The pseudo R² of the first model is better at explaining the model rather than the second model.
Any suggestion which model should I use?
Relevant answer
Answer
You should use the model that makes more sense, practically and/or theoretically. A high R² is not in indication for the "goodness" of the model. A higher R² can also mean that the model makes more wrong predictions with a higher precision.
Do not build your model based on observed data. Build your model based on understanding (theory) and the targeted purpose (simple prediction, exptrapolation (e.g. forecast), testing meaningful hypotheses etc.)
Removing a variable from the model changes the meaning of the intercept. The intercepts in the two models have different meanings. They are (very usually) not comparable. The hypothesis tests of the intercepts of the two models test very different hypotheses.
PS: a "non-significant" intercept term just means that the data are not sufficient to statistically distinguish the estimated value (the log odds given all X=0) from 0, what means that you cannot distinguish the probability of the event (given all X=0) from 0.5 (the data are compatible with probabilities larger and lower 0.5). This is rarely a sensible hypothesis to test.
  • asked a question related to Advanced Statistics
Question
4 answers
Could you please elaborate on the specific differences between scale development and index development (based on formative measurement) in the context of management research? Is it essential to use only the pre-defined or pre-tested scales to develop an index, such as brand equity index, brand relationship quality index? Suggest some relevant references.
Relevant answer
Answer
Kishalay Adhikari, you might find some useful information in Chapter 12 of the following book:
Hair, J. F., Babin, B. J., Anderson, R. E., & Black, W. C. (2019). Multivariate data analysis (8th ed.). Cengage.
I think that some of this chapter could have been written a bit more effectively, but overall it is helpful in drawing distinctions between scales and indexes.
All the best with your research.
  • asked a question related to Advanced Statistics
Question
5 answers
Dear all,
I have a question about a mediation hypothesis interpretation.
We have a model in which the direct effect of X on Y is significant, and its standardized estimate is greater than the indirect effect estimate (X -> M -> Y), which is significant too.
As far as I can understand, it should be a partial mediation, but should the indirect effect estimate be larger than the direct effect estimate to assess a partial mediation effect?
Or is the significance of the indirect effect sufficient to assess the mediation?
THanks in advance,
Marco
Relevant answer
Answer
Marco Marini as far as I know, you must have two conditions both verified for a partial mediation hypothesis to be confirmed:
1 - the indirect effect must be significant (X -> M -> Y) *
2 - the direct effect must be significant (X -> Y)
If both conditions are satisfied, then you have a partial mediation. If condition 1 is satisfied, but not condition 2, then you have a full mediation (i.e., your mediator entirely explains the effect of X over Y).
As Christian Geiser suggested: "Partial mediation simply means that only some of the X --> Y effect is mediated through M".
To my knowledge, the ratio between direct and indirect effect has no role in distinguishing between partial vs. full mediation.
* Please note: "the indirect effect must be significant" doesn't mean that path a and b must both be significant. All you need is path a × b significant (better if bootstrapped).
  • asked a question related to Advanced Statistics
Question
3 answers
300 Participants in my study viewed 66 different moral photos and had to make a binary choice (yes/no) in response to each. There were 3 moral photo categories (22 positive images, 22 neutral images and 22 negative images). I am running a multilevel logistic regression (we manipulated two other aspects about the images) and have found unnaturally high odd ratios (see below). We have no missing values. Could anyone please help me understand what the below might mean? I understand I need to approach with extreme caution so any advice would be highly appreciated.
Yes choice: morally negative compared morally positive (OR=441.11; 95% CI [271.07,717.81]; p<.001)
Yes choice: morally neutral compared to morally positive (OR=0.94; 95% CI [0.47,1.87]; p=0.86)
It should be noted that when I plot the data, very very few participants chose yes in response to the neutral and positive images. Almost all yes responses were given in response to the negative images.
Relevant answer
Answer
I think you have answered your question: "It should be noted that when I plot the data, very very few participants chose yes in response to the neutral and positive images. Almost all yes responses were given in response to the negative images."
This is what you'd expect even in a simple 2x2 design. If the probability of a yes response in the positive condition is very high and the probability very low in the negative condition then the OR could be high as its the ratio of a big probability to a very low one.
This isn't unnatural unless the raw probabilities don't reflect this pattern. (There might still be issues but not from what you described).
  • asked a question related to Advanced Statistics
Question
2 answers
Hi Folks,
I am working on a meta-analysis and I am trying to convert data into effect sizes (Cohen's d) to provide a robust synthesis of the evidence. All the studies used a one-group pre-post design and the outcome variables were assessed before and after the participation in an intervention.
Although the majority of the studies included in this meta-analysis reported either the effect sizes (Cohen's d) or the mean changes, a few of them reported the median changes. I am wondering if there is a way to calculate the effect sizes of these median changes.
For example, the values reported in one paper are:
Pre Median (IQR) = 280.5 (254.5 - 312.5)
Post Median (IQR) = 291.0 (263.5 - 321.0)
Is there any way I can convert these values into Cohen's d?
Thank you very much for your help.
Relevant answer
Answer
Thank you very much for your answer. It is really helpful and makes sense.
I do not think I will include these estimated means and SDs in the meta-analysis, but I can definitely report them in the narrative synthesis, as they will add additional evidence (with all the precautions due to the assumptions) to the findings.
Thanks again David, I very much appreciated your help.
  • asked a question related to Advanced Statistics
Question
4 answers
Hi All, I was wondering what statistical test do I use for this example. Comparing participants' ratings of a person's (1) competence and (2) employability, based on the person's (1) level of education and (2) gender.
So there are two IVs:
(1) The person's level of Education [3 levels].
(2) The person's Gender [2 genders].
So there is a total of 6 conditions presented to the participants [ 3 levels of education x 2 genders]. However, each participant is only presented with 4 conditions; meaning, there is a mixture of between-participants and within-participants used in the study.
There are two DVs:
(1) Participants' rating of the person's Competence.
(2) Participants' rating of the person's Employability.
I was thinking the statistical test would be MANOVA, but want to confirm.
Also, if the participants used in the study are a mixture of between-participants, and within-participants, how can MANOVA work in this case?
Any advice or insight on the above would be really appreciated. Thank you.
Relevant answer
Answer
Hello Paul,
The first question to address is, how do you characterize the strength/scale of your DVs? Nominal? Ordinal? Interval? The second question is, do you really aim to interpret the results multivariately (that is, for the vector of values on rated competence and rated employability), or is it more likely that your attention will be focused in these individually? If individually, then run univariate analyses, one for each DV; otherwise, go multivariate.
Interval / Multivariate:
Multivariate regression or Manova (either would involve a repeated measures factor having four levels: condition)
Interval / Univariate:
Regression or mixed (two-between, one-within) anova
Ordinal / Univariate:
Ordinal regression or an adaptation of aligned ranks anova
Nominal / univariate (Depends on number of levels of the nominal variable)
Possibly logistic regression (if two levels of DV, such as "Satisfactory/Unsatisfactory")
Good luck with your work.
  • asked a question related to Advanced Statistics
Question
9 answers
In statistics, Cramér's V is a measure of association between two nominal variables, giving a value between 0 and 1 (inclusive). It was first proposed by Harald Cramér (1946).
It is actually considered in many papers I came accross that a threshold value of 0.15 (sometimes even 0.1) can be considered as meaningful, hence giving hints of a low association between the variables being tested. Do you have any reference, mathematical foundation or explanation on why this threshold is relevant ?
Regards,
Roland.
Relevant answer
Answer
I made a mistake in my answer. I gave you the information for the COntingency coefficient. For the Cramer V, there is a table of effect size which depend on the degree of freedom of the Cramer V. See: https://www.real-statistics.com/chi-square-and-f-distributions/effect-size-chi-square/
  • asked a question related to Advanced Statistics
Question
5 answers
I am using an ARDL model however I am having some difficulties interpreting the results. I found out that there is a cointegration in the long run. I provided pictures below.
Relevant answer
Answer
Mr a. D.
The ECT(-1)os always the lagged value of your dependent variable.
Regards