Science topic
Advanced Statistics - Science topic
Explore the latest questions and answers in Advanced Statistics, and find Advanced Statistics experts.
Questions related to Advanced Statistics
Hello ResearchGate community,
I am looking for a statistician with experience in metagenomic data analysis to assist with a research project. The data involves genotypic diversity within microbial profiles, and we require statistical expertise to ensure accurate and robust analysis. Specifically, I am seeking someone who is skilled in handling large datasets and can provide insights through advanced statistical methods.
If you have expertise in this area or know someone who does, please feel free to reach out. I’d be happy to discuss further details regarding the project and potential collaboration.
Thank you in advance for your support and recommendations.
Hello!
I am performing a study to introduce a new test for a specific eye disease diagnosis. The new test has continuous values, the disease can be present in one or both eyes, and the disease severity by eye could also be different. Furthermore, the presence of the disease in one eye increases the probability of having the disease in the other eye.
Because we aim to estimate the diagnostic performance of the new test, we performed the new test and gold standard for the disease in both eyes in a sample of patients. However, the fact of repeated measurements by each patient could introduce intra-class correlation to the data, limiting analyzing the results as i.i.d. Therefore, diagnostic performance derived directly from a logistic regression model or ROC curve could not be correct.
What do you think is the best approach to calculate the AUC, sensitivity, specificity, and predictive values in this case?
I think that a mixed-effects model with the patient as a random intercept could be useful. However, I do not know if there is any method to estimate the diagnostic performance with this type of models.
Thank you in advance.
I am at the end of conducting a large systematic review and meta-analysis. I have experience of meta-analysis and have attempted to meta-analyse the studies myself, but I am not happy with my method. The problem is that almost all the studies are crossover studies and I am not sure how to analyse them correctly. I have consulted the Cochrane Handbook, and it seems to suggest a paired analysis is best, but I do not have the expertise to do this - https://training.cochrane.org/handbook/current/chapter-23#section-23-2-6
I am seeking a statistician familiar with meta-analysis to consult with, and if possible, undertake the meta-analysis. There are only two authors on this paper (me and a colleague), so you would either be second or last author. We aim to publish in a Q1 or Q2 journal, and from my own analysis I can see we have very interesting results.
Please let me know if you are interested.
I want to use SPSS Amos to calculate SEM because I use SPSS for my statistical analysis. I have already found some workarounds, but they are not useful for me. For example, using a correlation matrix where the weights are already applied seems way too confusing to me and is really error prone since I have a large dataset. I already thought about using Lavaan with SPSS, because I read somewhere that you can apply weights in the syntax in Lavaan. But I don't know if this is true and if it will work with SPSS. Furthermore, to be honest, I'm not too keen on learning another syntax again.
So I hope I'm not the first person who has problems adding weights in Amos (or SEM in general) - if you have any ideas or workarounds I'll be forever grateful! :)
We are looking for a highly qualified researcher with expertise in advanced statistical analysis to contribute to a scientific article to be submitted to a prestigious journal by the end of the year (2024). The article will focus on the adoption of digital innovations in agriculture.
Key responsibilities:
- Carry out in-depth statistical analysis using a provided database (the dataset is ready and available in SPSS format).
- Apply advanced statistical techniques, including structural equation modelling and/or random forest models.
- Work closely to interpret the results and contribute to the manuscript.
The aim is to fully analyse the data and prepare it for publication.
If you are passionate about agricultural innovation and have the necessary statistical expertise, we would like to hear from you.
Hi everyone.
When running a GLMM, I need to turn the data from wide format to the long format (stacked).
When checking for assumptions like normality, do I check them for the stacked variable (e.g., outcomemeasure_time) or for each variable separately (e.g., outcomemeasure_baseline, outcomemeasure_posttest, outcomemeasure_followup)?
Also, when identifying covariates via correlations (Pearson's or Spearman's), do I use the seperate variables or the stacked one?
Normality: say the outcomemeasure_baseline normality is violated but normality for the others weren't (ouecomemeasure_posttest and outcomemeasure_followup). Normality for the stacked variable is also not violated. In this case when running the GLMM, do I adjust for normality violations because normality for one of the seperate measures was violated?
Covariates: say age was identified as a covariate for outcomemeasure_baseline but not the others (separately: ouecomemeasure_posttest and outcomemeasure_followup OR the stacked variable). In this case, do I include age as a covariate since it was identified as one for one of the seperate variables?
Thank you so much in advance!
Hi everyone,
Does anyone have a detailed SPSS (v. 29) guide on how to conduct Generalised Linear Mixed Models?
Thanks in advance!
I have a thermocouple which output me some voltage level after signal conditioning. I need to convert it to desired units in centigrade. Below is the formula I am using for conversion. I need to proof that this formula will ensure uniform conversion of all voltage levels of thermocouple to centigrade units, such that 0 Volt corresponds to -200 centigrade and 10 Volt corresponds to 1500 centigrade.
Maximum voltage and minimum voltage are from DAQ after signal conditioning.
Maximum Reading Range and minimum reading range are values in centigrade.
We need to prove that range of voltage lets say 0V to 10V will be uniformly converted to -200 centigrade to 1500 centigrade reading range
Below is the formula for which we need a proof.
Precision Factor = (Maximum Voltage - Minimum Voltage) / (Maximum Reading Range - Minimum Reading range)
Desired output value in Centigrade = ((Input Voltage level - Minimum Voltage)/ Precision Factor) + Minimum Reading Range
Hi everyone,
I'm working on a project where I need to compare the similarity between line curves on two separate charts, and I could use some guidance. Here’s the situation:
- First Chart Details: Contains two curves, both of which are moving averages. These curves are drawn on a browser canvas by a user. I have access to the x and y data points of these curves.
- Second Chart Details: Contains two curves, with accessible x and y data points. In this chart, the x-axis represents time, and the y-axis represents values.
Challenge:
- The two charts do not share the same coordinate system values.
Goal:
- I would like to compare the similarity in patterns between individual lines across the two charts (i.e., one line from the first chart vs. one line from the second chart).
- Additionally, I want to compare the overall shape formed by both lines on the first chart to the shape formed by both lines on the second chart.
Could anyone provide advice on methodologies or algorithms that could help in assessing the similarity of these line curves?
Thank you for any help.
Lovro Bajc
I have attached
Hello everyone,
I am currently undertaking a research project that aims to assess the effectiveness of an intervention program. However, I am encountering difficulties in locating suitable resources for my study.
Specifically, I am in search of papers and tutorials on multivariate multigroup latent change modelling. My research involves evaluating the impact of the intervention program in the absence of a control group, while also investigating the influence of pre-test scores on subsequent changes. Additionally, I am keen to explore how the scores differ across various demographic groups, such as age, gender, and knowledge level (all measured as categorical variables).
Although I have come across several resources on univariate/bivariate latent change modelling with more than three time points, I have been unable to find papers that specifically address my requirements—namely, studies focusing on two time points, multiple latent variables (n >= 3), and multiple indicators for each latent variable (n >= 2).
I would greatly appreciate your assistance and guidance in recommending any relevant papers, tutorials, or alternative resources that pertain to my research objectives.
Best,
V. P.
The fact that a feature is of a complementary distribution does not seem to be a sufficient reason to discard the feature as irrelevant; especially as they seem phenomenologically relevant.
I would like to test whether the general relationship between the number of years of education and the wage is linear, exponential, etc. Or in other words, does going from 1 year to 2 years of education have the same impact on wages as going from 10 to 11. I want a general assessment for the world and not for a specific country.
I got standardized data from surveys on several countries and multiple times (since 2000). My idea is to build a multilevel mixed-effects model, with a fixed effect for the number of years of education and random effects for the country, the year of the survey and other covariates (age, sex, etc.). I’m not so used to this type of model: do you think it makes sense? Is this the most appropriate specification of the model for my needs?
Hello everyone! As you understand, high-precision positioning using global navigation satellite systems or simply high-precision determination of a random variable. At what point does your estimates precision fall into the "highly precision" category? Is this always a convention associated with the method of determining a random variable or is there a general formulation for classifying estimates as highly precision?
What are the possible ways of rectifying a lack of fit test showing up as significant. Context: Optimization of lignocellulosic biomass acid hydrolysis (dilute acid) mediated by nanoparticles
Hello,
I have the following problem. I have made three measurements of the same event under the same measurement conditions.
Each measurement has a unique probability distribution. I have already calculated the mean and standard deviation for each measurement.
My goal is to combine my three measurements to get a general result of my experiment.
I know how to calculate the combined mean: (x_comb = (x1_mean+x2_mean+x3_mean)/3)
I don't know how to calculate the combined standard deviation.
Please let me know if you can help me. If you have any other questions, don't hesitate to ask me.
Thank you very much! :)
Is ex ante power analysis the same as a priori power analysis or is it something different in the domain of SEM and multiple regression analysis? If it is different, then what are the recommended methods or procedures? Any citations for it?
Thank you for precious time and help!
I want to ask about the usage of parametrical and non-parametrical tests if we have an enormous sample size.
Let me describe a case for discussion:
- I have two groups of samples of a continuous variable (let's say: Pulse Pressure, so the difference between systolic and diastolic pressure at a given time), let's say from a) healthy individuals (50 subjects) and b) patients with hypertension (also 50 subjects).
- there are approx. 1000 samples of the measured variable from each subject; thus, we have 50*1000 = 50000 samples for group a) and the same for group b).
My null hypothesis is: that there is no difference in distributions of the measured variable between analysed groups.
I calculated two different approaches, providing me with a p-value:
Option A:
- I took all samples from group a) and b) (so, 50000 samples vs 50000 samples),
- I checked the normality in both groups using the Shapiro-Wilk test; both distributions were not normal
- I used the Mann-Whitney test and found significant differences between distributions (p<0.001), although the median value in group a) was 43.0 (Q1-Q3: 33.0-53.0) and in group b) 41.0 (Q1-Q3: 34.0-53.0).
Option B:
- I averaged the variable's values over all participants (so, 50 samples in group a) and 50 samples in group b))
- I checked the normality in both groups using the Shapiro-Wilk test; both distributions were normal,
- I used t Student test and obtained p-value: 0.914 and median values 43.1 (Q1-Q3: 33.3-54.1) in group a) and 41.8 (Q1-Q3: 35.3-53.1) in group b).
My intuition is that I should use option B and average the signal before the testing. Otherwise, I reject the null hypothesis, having a very small difference in median values (and large Q1-Q3), which is quite impractical (I mean, visually, the box plots look very similar, and they overlap each other).
What is your opinion about these two options? Are both correct but should be used depending on the hypothesis?
Hello,
In the realm of economic and social analysis, one question looms large: Can the intricate intricacies of advanced statistical and mathematical models effectively capture and withstand the complexities of the real world?
Thanks
It is possible to run a regression of both Seconday and primary data in the same model? I mean, when the dependent variable is primary data to be sourced via questionnaire and the Independent variable is secondary data to be gathered from published financial statements?
For Example: if the topic is Capital Budgeting moderator and shareholders wealth (SHW). Capital budgeting moderators is proxy by inflation , management attitude to risk, Economic condition and Political instability. while SHW is proxy by Market value, Profitability and Retained earnings.
I have nonparametric continuous data. I want to apply correlation analysis. However, I am undecided about which correlation analysis I should apply to non-parametric continuous data that does not have a normal distribution. Pearson, spearman, Kendall of another method. I know there are many methods for non parametric data. But which one should I choose for correlation?
I have nonparametric continuous data. I want to apply correlation analysis. However, I am undecided about which correlation analysis I should apply to non-parametric continuous data that does not have a normal distribution. Pearson, spearman, Kendall of another method. I know there are many methods for non parametric data. But which one should I choose for correlation?
These are few questions for your reference,
How much did you learn about managing your money from your parents?
· None
· Hardly at all
· Little
· Some
A lot
How often were you influenced by or did you discuss about finances with your parents?
· Never
· Once a year
· Every few months
· Twice a month
Weekly
What is your current investment amount in stocks/shares? (Portfolio value)
· 1 - 90,000
· 90,000–170,000
· 170,000–260,000
· 260,000–340,000
· More than 340,000
The above questions are allocated weights from 1 to 5.
In plant breeding, what are uses discrimination function.
Assuming that $X \in \mathbb{R}^{p \times n}$ is the data matrix, where p is the dimension, n is the sample size. We obtain the data permutation matrices by randomly permuting entries in each column of the data matrix. What are the statistical applications of the data permutation matrices obtained in this way?
Is nonparametric regression used in psychology research? And if yes, what types. I know about quantile regression, but i cant find much literature where researchers use Kernel regression or local regression. Is this because they arent useful for such research?
I'm working on my PhD thesis and I'm stuck around expected analysis.
I'll briefly explain the context then write the question.
I'm studying moral judgment in the cross-context between Moral Foundations Theory and Dual Process theory.
Simplified: MFT states that moral judgmnts are almost always intuitive, while DPT states that better reasoners (higher on cognitive capability measures) will make moral judgmnets through analytic processes.
I have another idea - people will make moral judgments intuitively only for their primary moral values (e.g., for conservatives those are binding foundations - respectin authority, ingroup loyalty and purity), while for the values they aren't concerned much about they'll have to use analytical processes to figure out what judgment to make.
To test this idea, I'm giving participants:
- a few moral vignettes to judge (one concerning progressive values and one concerning conservative values) on 1-7 scale (7 meaning completely morally wrong)
- moral foundations questionnaire (measuring 5 aspects of moral values)
- CTSQ (Comprehensive Thinking Styles Questionnaire), CRT and belief bias tasks (8 syllogisms)
My hypothesis is therefore that cognitive measures of intuition (such as intuition preference from CTSQ) will predict moral judgment only in the situations where it concerns primary moral values.
My study design is correlational. All participants are answering all of the questions and vignettes. So I'm not quite sure how to analyse the findings to test the hypothesis.
I was advised to do a regressional analysis where moral values (5 from MFQ) or moral judgments from two different vignettes will be predictors, and intuition measure would be dependent variable.
My concern is that the anlaysis is a wrong choice because I'll have both progressives and conservatives in the sample, which means both groups of values should predict intuition if my assumption is correct.
I think I need to either split people into groups based on their MFQ scores than do this analysis, or introduce some kind of multi-step analysis or control or something, but I don't know what would be the right approach.
If anyone has any ideas please help me out.
How would you test the given hypothesis with available variables?
I constructed a linear mixed-effects model in Matlab with several categorical fixed factors, each having several levels. Fitlme calculates confidence intervals and p values for n-1 levels of each fixed factor compared to a selected reference. How can I get these values for other combinations of factor levels? (e.g., level 1 vs. level 2, level 1 vs. level 3, level 2 vs. level 3).
Thanks,
Chen
I am attempting to use the Seurat FindAllMarkers function to validate markers for rice taken from the plantsSCRNA-db. I want to use the ROC test in order to get a good idea of how effective any of the markers are. While doing a bit of research, different stats forums say: "If we must label certain scores as good or bad, we can reference the following rule of thumb from Hosmer and Lemeshow in Applied Logistic Regression (p. 177):
0.5 = No discrimination 0.5-0.7 = Poor discrimination 0.7-0.8 = Acceptable discrimination 0.8-0.9= Excellent discrimination0.9 = Outstanding discrimination "
For more background, the output of the function returns a dataframe with a row for each gene, showing myAUC: area under the Receiver Operating Characteristic, and Power: the absolute value of myAUC - 0.5 multiplied by 2. Some other statistics are included as well such as average log2FC and the percent of cells expressing the gene in one cluster vs all other clusters.
With this being said, I would assume a myAUC score of 0.7 or above would imply the marker is effective. However given the formula used to calculate power, a myAUC score of 0.7 would correlate to a power of 0.4. So with this being said, would it be fair to assume that myAUC should be ignored for the purposes of validating markers? Or should both values be taken into account somehow?
Hi everyone,
I need to convert standard error (SE) into standard deviation (SD). The formula for that is
SE times the square root of the sample size
By 'sample size', does it mean the total sample size or sample sizes of individual groups? For example, the intervention group has 40 participants while the control group has 39 (so the total sample size is 79) So, when calculating SD for the intervention group, do I use 40 as the sample size or 79?
Thank you!
My device is quite old and can't run SPSS. Is there any acceptable alternative available?
I have ordinal data on happiness of citizens from multiple countries (from the European Value Study) and I have continuous data on the GDP per capita of multiple countries from the World Bank. Both of these variables are measured at multiple time points.
I want to test the hypothesis that countries with a low GDP per capita will see more of an increase in happiness with an increase in GDP per capita than countries that already have a high GDP per capita.
My first thought to approach this is that I need to make two groups; 1) countries with low GDP per capita, 2) countries with high GDP per capita. Then, for both groups I need to calculate the correlation between (change in) happiness and (change in) GDP per capita. Lastly, I need to compare the two correlations to check for a significant difference.
I am stuck however on how to approach the correlation analysis. For example, I dont know how to (and if I even have to) include the repeated measures of the different time points the data was collected. If I just base my correlations on one timepoint the data was measured, I feel like I am not really testing my research question, considering I am talking about an increase in happiness and an increase in GDP, which is a change over time.
If anyone has any suggestions on the right approach, I would be very thankful! Maybe I am overcomplicating it (wouldnt be the first time)!
Hello to everyone,
I've had several discussions with my colleagues about setting up field experiments to be replicated in different environments.
We agree that each experiment must have exactly the same experimental design to ensure data comparability.
I've been told that these experiments must also have the exact same randomization, I don't agree because I believe that it is the experimental design itself that ensures data comparability. Below I attach a drawing to better explain the issue:
In the attached file, I have the same experimental design between locations, with the same randomization within subplots. Shouldn't we randomize the treatments (i, ii, iii and iv) within each subplot? Does it make sense to have an exact copy of experimental fields?
Thanks in advance!
I am carrying out statistical testing on a paired sample - before and after medication review in 50 patients. I think I am correct in using the Wilcoxon signed rank test as the data is not normally distributed. However I want to check as the review only could reduce the number of medicines so there are no positive rankings only negative, therefore my test statistic comes out at 0. This obviously rejects the null hypothesis and there is a significant reduction in the number of medicines but I just wanted to check is this normal? Should a different test be used or is this OK?
Thanks
Hello everyone,
I am currently doing research on the impact of online reviews on consumer behavior. Unfortunately, statistics are not my strong point, and I have to test three hypotheses.
The hypotheses are as follows: H1: There is a connection between the level of reading online reviews and the formation of impulsive buying behavior in women.
H2: There is a relationship between the age of the respondents and susceptibility to the influence of online reviews when making a purchase decision.
H3: There is a relationship between respondents' income level and attitudes that online reviews strengthen the desire to buy.
Questions related to age, level of income and level of reading online reviews were set as ranks (e.g. 18-25 years; 26-35 years...; 1000-2000 Eur; 2001-3000 Eur; every day; once a week; once a month etc.), and the questions measuring attitudes and impulsive behavior were formed in the form of a Likert scale.
What statistical method should be used to test these hypotheses?
Dear all,
I want to know your opinions
Also, there is good paper here
Also,
Hello everyone,
I would like to investigate three factors using a central composite design.
Each factor has 5 levels (+1 and -1, 2 axial points and 1 center point).
I chose my high and low levels (+ and -1) based on a screening DoE I did previously using the same factors.
I chose an alpha of 1.681 for the axial points because I would like my model to be rotatable. However, for one of the three factors, one of the axial points is outside the feasable range (negative CaCl concentration....). I thought of increasing my low level for this factor to avoid this. Lets say, increasing the value from 0.05 to 0.1 to avoid reaching the negative range with the axial point, but I was wondering, if this would effect the reliablity of my model?
Another option would be to change the design to one that has no axial points outside the design points. However, this is actually my area of interest.
Can anyone help?
Hello, I have a question regarding using a binary-coded dependent variable on the Mann-Whitney U test.
I have a test with 15 questions from 3 different categories in my study. The answers are forced answers and have one correct answer. I coded the answers as binary values with 1 being correct and 0 being incorrect.
Therefore, for 3 different categories, the participants have a mean score between 0 and 1 representing their success (I took the mean because I have many participants who did not answer 2 or 3 questions).
Does it make sense to put a mean of binary coded value as a dependent variable on a nonparametric test or it sounds weird and I should apply something else like chi-square or logistic regression?
I am currently studying Research and Methodology and got this query. Please can anyone answer this?
Hello!
In general, as a rule of thumb, what is the acceptable value for standardised factor loadings produced by a confirmatory factor analysis?
And, what could be done/interpretation if the obtained loadings are lower than the acceptable value?
How does everyone approach this?
Merry Christmas everyone!
I used the Interpersonal Reactivity Index (IRI) subscales Empathic Concern (EC), Perspective Taking (PT) and Personal Distress (PD) in my study (N = 900) When I calculated Cronbach's alpha for each subscale, I got .71 for EC, .69 for PT and .39 for PD. The value for PD is very low. The analysis indicated that if I deleted one item, the alpha would increase to .53 which is still low but better than .39. However, as my study does not focus mainly on the psychometric properties of the IRI, what kind of arguments can I make to say the results are still valid? I did say findings (for the PD) should be taken with caution but what else can I say?
Respected researchers,
I am using The Friedman Test in order to determine the statistical version of significant differences of means within the sessions in a group. Can I do this?
Can someone please share how to select the best algorithm to use for the child nodes when you have the parent nodes probabilities as there are so many algorithms to chose from using GeNIe software.
The algorithms are
Relevance-based decomposition, polytree, EPIS sampling, AIS sampling, Logic sampling, Backward sampling, Likelihood sampling, Self-importance
I am currently working on my master thesis and I ran into this statistical problem. Hopefully one of you can help me, because so far I can only see that a mediation analysis with a MANCOVA isn't possible.
I am measuring two continuous variables over time in four groups. Firstly, I want to determine if the two variables correlate in each group. I then want to determine if there is significant differences in these correlations between groups.
For context, one variable is weight, and one is a behaviour score. The groups are receiving various treatment and I want to test if weight change influences the behaviour score differently in each group.
I have found the r package rmcorr (Bakdash & Marusich, 2017) to calculate correlation coefficients for each group, but am struggling to determine how to correctly compare correlations between more than two groups. The package diffcorr allows comparing between two groups only.
I came across this article describing a different method in SPSS:
However, I don't have access to SPSS so am wondering if anyone has any suggestions on how to do this analysis in r (or even Graphpad Prism).
Or I could the diffcorr package to calculate differences for each combination of groups, but then would I need to apply a multiple comparison correction?
Alternatively, Mohr & Marcon 2005 describe a different method using spearman correlation that seems like it might be more relevant, however I wonder why their method doesn’t seem to have been used by other researches? It also looks difficult to implement so I’m unsure if it’s the right choice.
Any advice would be much appreciated!
Hello, I currently have a set of categorical variables, coded as Variable A,B,C,etc... (Yes = 1, No = 0). I would like to create a new variable called severity. To create severity, I know I'll need to create a coding scheme like so:
if Variable A = 1 and all other variables = 0, then severity = 1.
if Variable B = 1 and all other variables = 0, then severity = 2.
So on, and so forth, until I have five categories for severity.
How would you suggest I write a syntax in SPSS for something like this?
Hello, I currently have a set of categorical variables, coded as Variable A,B,C,etc... (Yes = 1, No = 0). I would like to create a new variable called severity. To create severity, I know I'll need to create a coding scheme like so:
if Variable A = 1 and all other variables = 0, then severity = 1.
if Variable B = 1 and all other variables = 0, then severity = 2.
So on, and so forth, until I have five categories for severity.
How would you suggest I write a syntax in SPSS for something like this? Thank you in advance!
if someone can please share any report/paper/thesis it will be highly appreciated.
Dear community,
I am planning on conducting a GWAS analysis with two groups of patients differing in binary characteristics. As this cohort naturally is very rare, our sample size is limited to a total of approximately 1500 participants (low number for GWAS). Therefore, we were thinking on studying associations between pre-selected genes that might be phenotypically relevant to our outcome. As there exist no pre-data/arrays that studied similiar outcomes in a different patient cohort, we need to identify regions of interest bioinformatically.
1) Do you know any tools that might help me harvest genetic information for known pathways involved in relevant cell-functions and allow me to downscale my number of SNPs whilst still preserving the exploratory character of the study design? e.g. overall thrombocyte function, endothelial cell function, immune function etc.
2) Alternatively: are there bioinformatic ways (AI etc.) that circumvent the problem of multiple testing in GWAS studies and would allow me to robustly explore my dataset for associations even at lower sample sizes (n < 1500 participants)?
Thank you very much in advance!
Kind regards,
Michael Eigenschink
Dear community,
I've been reading a lot about dealing with omics data that lies outside the limit of quantification - there are a bunch of different recommendations on how to approach this. One paper drew the <LLOQ data at random from a normal distribution (interval: 0 to LLOQ) & used a log norm distribution for data >ULOQ - is that a sound idea? does anyone have comments on this /further suggestions?
I am looking forward to your responses.
Assume we have a program with different instructions. Due to some limitations in the field, it is not possible to test all the instructions. Instead, assume we have tested 4 instructions and calculated their rank for a particular problem.
the rank of Instruction 1 = 0.52
the rank of Instruction 2 = 0.23
the rank of Instruction 3 = 0.41
the rank of Instruction 4 = 0.19
Then we calculated the similarity between the tested instructions using cosine similarity (after converting the instructions from text form to vectors- machine learning instruction embedding).
Question ... is it possible to create a mathematical formula considering the values of rank and the similarity between instructions, so that .... given an un-tested instruction ... is it possible to calculate, estimate, or predict the rank of the new un-tested instruction based on its similarity with a tested instruction?
For example, we measure the similarity between instruction 5 and instruction 1. Is it possible to calculate the rank of instruction 5 based on its similarity with instruction 1? is it possible to create a model or mathematical formula? if yes, then how?
During the lecture, the lecturer mentioned the properties of Frequentist. As following
Unbiasedness is only one of the frequentist properties — arguably, the most compelling from a frequentist perspective and possibly one of the easiest to verify empirically (and, often, analytically).
There are however many others, including:
1. Bias-variance trade-off: we would consider as optimal an estimator with little (or no) bias; but we would also value ones with small variance (i.e. more precision in the estimate), So when choosing between two estimators, we may prefer one with very little bias and small variance to one that is unbiased but with large variance;
2. Consistency: we would like an estimator to become more and more precise and less and less biased as we collect more data (technically, when n → ∞).
3. Efficiency: as the sample size incrases indefinitely (n → ∞), we expect an estimator to become increasingly precise (i.e. its variance to reduce to 0, in the limit).
Why Frequentist has these kinds of properties and can we prove it? I think these properties can be applied to many other statistical approach.
Assuming that a researcher does not know the nature of population distribution (the parameters or the type e.g. normal, exponential, etc.), is it possible that the sampling distribution can indicate the nature of the population distribution.
According to the central limit theorem, the sampling distribution is likely to be normal. So, the exact population distribution can not be known. The shape of the distribution for a large sample size is enough? or It has to be inferred logically based on different factors?
Am I missing some points? Any lead or literature will help.
Thank you
Presently I am handling a highly positively skewed geochemical dataset. After several attempts, I have prepared a 3 parameter lognormal distribution (using natural log and additive constant c). The descriptive statistic parameters obtained are log-transformed mean (α) and standard deviation (β). The subsequent back-transformed mean and standard deviation (BTmean and BTsd) are based on the formula
BTmean = e ^ (α + (β^2/2)) - c
BTsd = Sqrt [(BTmean)^2) (e ^(β^2)-1)] - c
However, someone suggests to use Lagrange Multiplier. I am not sure about the
1) Equation using the Lagrange Multiplier
2) How to derive the value of Lagrange multiplier in my case.
Kindly advise.
Regards
What is the technique which I use to convert the annual ESG score data to (Monthly, weekly, or daily data) with good accuracy?
AND how can I apply via Python?!
Hi everyone,
I am struggling a bit with data analysis.
If I have 2 separate groups, A and B.
And each group has 3 repeats A1,A2,A3 and B1,B2 B3 for 10 time points.
How would I determine statistical significance between the 2 groups?
If I then added a third group, C, with 3 repeats C1,C2,C3 for each timepoint.
What statistical analysis would I use then?
Thanks in advance
To be more precise, my dependent variable was the mental well-being of students. The first analysis was chi-square (mental well-being x demographic variable), hence I treated the dv as categorical. Then, in order to find the influence of mental well-being on my independent variable, I treated the dv as a continuous variable so that I can analyse it using multiple regression.
Is it appropriate and acceptable? and is there any previous study that did the same thing?
Need some advice from all of you here. Thank you so much
Modelling biology is often a challenge, even more, when dealing with behavioural data. Models quickly become extremely complex full of variables and random effects. When trying to deal with a complex data set there are often several variables (or questions) you´re interested in, that might explain the variation of the response variable. But is it better to fit one very complex model or several ones? Let me put an example:
We would like to know more about the relationship between nursing behaviour and rank in a wild primate. For that, we record nursing duration and the rank of the mother. However, we think that the age of the mother and the infant are also interesting sources of variation. We will also record variables that we think might affect but that we are not necessarily interested in like the weather.
My first intuition is to put everything in the model:
- nursing duration ~ rank + mothers' age + infants' age + mothers' age*infants' age + (1| weather)
I want to believe that by including all variables you reduce type I errors. But I have not been able to find an explanation of why that is the case.
Would it be statistically correct to perform two models instead, one for each question?
- nursing duration ~ rank + (1| weather)
- nursing duration ~ mothers' age + infants' age + mothers' age*infants' age + (1| weather)
I have been told that a common practice is to fit the most complex model first and then remove variables until you arrive at the lowest AIC. But I am not sure there is a better way to assess how many variables you should include in a model.
Please let me know if you know of any books or further reading addressing these kind of questions. Ideally focusing on biologists' or behavioural ecologists' statistics.
Well,
I am a very curious person. During Covid-19 in 2020, I through coded data and taking only the last name, noticed in my country that people with certain surnames were more likely to die than others (and this pattern has remained unchanged over time). Through mathematical ratio and proportion, inconsistencies were found by performing a "conversion" so that all surnames had the same weighting. The rest, simple exercise of probability and statistics revealed this controversial fact.
Of course, what I did was a shallow study, just a data mining exercise, but it has been something that caught my attention, even more so when talking to an Indian researcher who found similar patterns within his country about another disease.
In the context of pandemics (for the end of these and others that may come)
I think it would be interesting to have a line of research involving different professionals such as data scientists; statisticians/mathematicians; sociology and demographics; human sciences; biological sciences to compose a more refined study on this premise.
Some questions still remain:
What if we could have such answers? How should Research Ethics be handled? Could we warn people about care? How would people with certain last names considered at risk react? And the other way around? From a sociological point of view, could such a recommendation divide society into "superior" or "inferior" genes?
What do you think about it?
=================================
Note: Due to important personal matters I have taken a break and returned with my activities today, February 13, 2023. I am too happy to come across many interesting feedbacks.
I'm doing a germination assay of 6 Arabidopsis mutants under 3 different ABA concentrations in solid medium. I've 4 batches. Each batch has 2 plates for each mutant, 3 for the wild type, and each plate contains 8-13 seeds. Some seeds and plates are lost to contamination. So I don't have the same sample size for each mutant in each batch. In same cases the mutant is no longer present in the batch. I've recorded the germination rate per mutant after a week and expressed it as percentage. I'm using R. How can I analyse them best to test if the mutations affect the germination rate in presence of ABA?
I've two main questions:
1. Do I consider each seed as a biological replica with categorical type of result (germinated/not-germinated) or each plate with a numerical result (% germination)?
2. I compare treatments within the genotype. Should I compare mutant against wild type within the treatment, the treatment against itself within mutant, or both?
I'm trying to construct a model for binary logistics. The first model includes 4 variable of predictor and the intercept is not statistically significant. Meanwhile, in the second model, I exclude one variable from the first model and the intercept is significant.
The consideration that I take here is that:
The pseudo R² of the first model is better at explaining the model rather than the second model.
Any suggestion which model should I use?
Could you please elaborate on the specific differences between scale development and index development (based on formative measurement) in the context of management research? Is it essential to use only the pre-defined or pre-tested scales to develop an index, such as brand equity index, brand relationship quality index? Suggest some relevant references.
Dear all,
I have a question about a mediation hypothesis interpretation.
We have a model in which the direct effect of X on Y is significant, and its standardized estimate is greater than the indirect effect estimate (X -> M -> Y), which is significant too.
As far as I can understand, it should be a partial mediation, but should the indirect effect estimate be larger than the direct effect estimate to assess a partial mediation effect?
Or is the significance of the indirect effect sufficient to assess the mediation?
THanks in advance,
Marco
300 Participants in my study viewed 66 different moral photos and had to make a binary choice (yes/no) in response to each. There were 3 moral photo categories (22 positive images, 22 neutral images and 22 negative images). I am running a multilevel logistic regression (we manipulated two other aspects about the images) and have found unnaturally high odd ratios (see below). We have no missing values. Could anyone please help me understand what the below might mean? I understand I need to approach with extreme caution so any advice would be highly appreciated.
Yes choice: morally negative compared morally positive (OR=441.11; 95% CI [271.07,717.81]; p<.001)
Yes choice: morally neutral compared to morally positive (OR=0.94; 95% CI [0.47,1.87]; p=0.86)
It should be noted that when I plot the data, very very few participants chose yes in response to the neutral and positive images. Almost all yes responses were given in response to the negative images.
Hi Folks,
I am working on a meta-analysis and I am trying to convert data into effect sizes (Cohen's d) to provide a robust synthesis of the evidence. All the studies used a one-group pre-post design and the outcome variables were assessed before and after the participation in an intervention.
Although the majority of the studies included in this meta-analysis reported either the effect sizes (Cohen's d) or the mean changes, a few of them reported the median changes. I am wondering if there is a way to calculate the effect sizes of these median changes.
For example, the values reported in one paper are:
Pre Median (IQR) = 280.5 (254.5 - 312.5)
Post Median (IQR) = 291.0 (263.5 - 321.0)
Is there any way I can convert these values into Cohen's d?
Thank you very much for your help.
Hi All, I was wondering what statistical test do I use for this example. Comparing participants' ratings of a person's (1) competence and (2) employability, based on the person's (1) level of education and (2) gender.
So there are two IVs:
(1) The person's level of Education [3 levels].
(2) The person's Gender [2 genders].
So there is a total of 6 conditions presented to the participants [ 3 levels of education x 2 genders]. However, each participant is only presented with 4 conditions; meaning, there is a mixture of between-participants and within-participants used in the study.
There are two DVs:
(1) Participants' rating of the person's Competence.
(2) Participants' rating of the person's Employability.
I was thinking the statistical test would be MANOVA, but want to confirm.
Also, if the participants used in the study are a mixture of between-participants, and within-participants, how can MANOVA work in this case?
Any advice or insight on the above would be really appreciated. Thank you.
In statistics, Cramér's V is a measure of association between two nominal variables, giving a value between 0 and 1 (inclusive). It was first proposed by Harald Cramér (1946).
It is actually considered in many papers I came accross that a threshold value of 0.15 (sometimes even 0.1) can be considered as meaningful, hence giving hints of a low association between the variables being tested. Do you have any reference, mathematical foundation or explanation on why this threshold is relevant ?
Regards,
Roland.
I am using an ARDL model however I am having some difficulties interpreting the results. I found out that there is a cointegration in the long run. I provided pictures below.