Science topics: MathematicsStatistics
Science topic

Statistics - Science topic

Statistical theory and its application.
Questions related to Statistics
  • asked a question related to Statistics
Question
3 answers
Hi all!
I've run both LFQ and TMT 18-plex proteomics on the same protein extracts.
My experiment consists of two study conditions, and 8 biological replicates.
After digesting my protein extractions, I ran half of the peptide preparation using DDA with four technical replicates, and the other half I TMT tagged (18-plex, two reference channels, one mixture) fractionated, and ran using an SPS MS3 method on the Fusion Lumos.
I've done the searches in PD2.4, and summarised the results with `MSstats` and `MSstatsTMT`.
I'm currently working on how to deal with two different datasets of the same experiment, the original plan was to use the LFQ dataset for the improved coverage, and the TMT dataset for improved quantification.
One thing I've noticed is that while the TMT dataset has significantly better adjusted p-values, the fold changes are less pronounced than the LFQ dataset, meaning that quite a few proteins fail the biological significance thresholds. See the attached volcano plots (vertical dotted lines represent 0.58 log2 FC, horizontal 0.05 adjusted p-value). The scales are not consistent between the plots sorry!
I'm aware that MS2 TMT methods have an issue with reporter ion compression blunting fold change values, and was hoping that it would be less of an issue with my MS3 method. Is there a correction for this, or does this reflect a lack of dramatic fold-change in my biology?
Any other tips for integrating LFQ and TMT data would also be appreciated!
Sam
Relevant answer
Answer
Thank you for your answer Nikhil Dev Narendradev ,
I had a suspicion that this would be the case. I have looked at the correlation of TMT and LFQ fold changes, and it's not great unfortunately (plot attached). However, proteins with increased abundance in one technique do tend to show similar changes in the other technique.
I have decided to focus my analysis on the 25 proteins showing significance in both methods - this is a manageable list! I've not though to focus on proteins by those following a linear relationship however, that's an interesting idea.
  • asked a question related to Statistics
Question
3 answers
I am currently conducting a study using a photothrombotic stroke model in C57BL6 mice and measuring motor function outcome following strokes to determine if a pharmacological treatment can help improve their motor recovery. To measure motor recovery, I am using the tapered beam task and the grid walk task. Both of these tasks measure the number of errors that the mice make during trials. One thing that I've noticed is that a handful of the mice in the placebo group (no pharmacological treatment, just saline) are unable to complete the tasks on the first day of behavior due to the severity of the injury and the lack of treatment.
As such, I'm wondering if there is a standard way to handle missing data that is a result of severe injuries and is important for accurately reflecting differences between my groups. The methods that I can think of would either be filling with the mean for the group, filling with the highest number of errors of the group (e.g. the worst recorded score was 93 errors in the placebo group, presumably the mice unable to complete the task have more severe strokes and should receive the max number of errors observed), or multiple imputation using the MICE package in R. My understanding is that multiple imputation is the standard for filling in data that is not missing at random, but I want to ensure that is true in this scenario as well. Any citations (especially those specific to animal models) to support these methods would be greatly appreciated as well.
Relevant answer
Answer
A few questions: 1. are the mice "pre-trained to an established baseline" prior to injury? 2. If so, are there any "qualitative scores" of ambulation? 3. are the mice able to do any part of the task, or simply fail immediately? 4. finally, do they recover in the subsequent days, post injury i.e. repeated task on day 2, 3,4...?
If the mice are trained well before the insult, to a zero or close to zero error standard, any animal that can't perform would be maximal error (which could be set as the max time for the task, ie 60sec to traverse the beam is the length of the task). Piloting, as suggested, may also be helpful if you pilot the number of errors on day 1 for untreated controls finding the upper threshold to set a true failure limit. The scoring should have a wide enough range to accommodate for total failure and complete success. Bracketing the score may be in order as well, (0-2 foot slips is perfect vs 3-10 foot slips are mildly impaired...etc). In this manner the repeated measures ANOVA Group (mean score) X Day stats would show deficit and improvement significance. You can assign each bracket a score number.
It appears by your question you only have a few missing day 1 data points in the non-treated injured mice. So long as the injury is standard across all groups and repeatable it should be fine to set "failure limits." Furthermore, we don't want to eliminate mice that fail to perform but spontaneously recover eventually. Titrating the injury too much can also have unintended consequences, ending up with no significant treatment effects.
  • asked a question related to Statistics
Question
4 answers
𝑃 ( 𝑋 ≥ 𝑘 − 1 ) for X∼Binomial(n−1,p).
P(X≥k) for X∼Binomial(n,p)
Relevant answer
Answer
Hi,
We know that if X1~Bin(n-1,p) and X2~Bin(1,p) then Y=X1+X2~Bin(n,p).
On the other hand if the events A and B be defined as A={X1<=k} and B={X1+X2<=k} then A is subset of B or at most A=B. Therefore we can say
P(A)<=P(B) which yields P(X1<=k)<=P(X1+X2<=k). This yields P(X1>k)>=P(X1+X2>k) i.e
P(X1>=k+1)>=P(X1+X2>=k+1), (1).
On the other hand we have
P(X1>=k)>=P(X1>=k+1), (2).
Comparing (1) and (2) yields
P(X1>=k)>=P(X1+X2>=k+1)
Regard,
Hamid
  • asked a question related to Statistics
Question
5 answers
I have seen many comments implying if a newly developed scale has a solid model background, EFA can (or better, should) be skipped. In a cognitive scale that I have recently developed, I had a clear design on my items, based on the previous theory. However, after administrating it to my study population, I ran a WLSMV CFA with two first-order factors and saw that some items (out of a total of 50) have weak (<0.30) or very weak (<0.10) loadings and possible cross-loadings.
My fit indices improved to an excellent range after deleting some of the lowest-loading items. Even after that, I have items with factor loadings of ~0.20. I have good reliability when they stay. And they don't look bad, theoretically. After pruning them to have a minimum loading of 0.3, not only my already good fit indices don't improve much, but my reliability gets lower. And I lose a good chunk of items. You don't want to assess cognitive skills with 15 items since almost all batteries have 30-40 items minimum. Should I keep them?
Also, some of the items with ceiling effect (98% correct responses) stay in the CFA model with good loadings. Should I keep them?
There are clear guidelines on item-deleting strategies for EFA. What about CFA?
Relevant answer
Answer
A reviewer will rip you apart if you skip EFA for a newly developed scale, one of the key points of an EFA is to uncover any discrepancies between the hypothesized factor structure and theory, but it will also reveal item redundancy, poor loadings and potential cross-loadings (both of which you observed) which already could tell you which items to remove
  • asked a question related to Statistics
Question
4 answers
What’s the most common programming paradigm of no-code platforms? Why?
Relevant answer
Answer
The most common programming paradigm of no-code platforms is declarative programming.
Why Declarative Programming?
  • Focus on "What" not "How": No-code platforms allow users to describe what they want to achieve (e.g., "create a form," "generate a report") rather than how to do it. This makes it accessible to users who may not have traditional programming skills.
  • Visual Interfaces: These platforms often provide drag-and-drop interfaces, workflows, and rule-based systems, which align well with declarative principles where users specify the desired outcomes rather than writing detailed procedural code.
  • Ease of Use: Declarative paradigms reduce the complexity involved in traditional coding. Users can build applications, workflows, and automations by configuring pre-built components or templates, which is ideal for business users or non-developers.
  • Abstraction: Declarative programming abstracts the underlying logic and implementation details. This allows the platform to handle complex tasks behind the scenes, which is crucial for empowering users without deep technical knowledge.
These factors make declarative programming the ideal paradigm for no-code platforms, supporting their goal of democratizing software development and making it more accessible to a broader audience
  • asked a question related to Statistics
Question
3 answers
I am at the end of conducting a large systematic review and meta-analysis. I have experience of meta-analysis and have attempted to meta-analyse the studies myself, but I am not happy with my method. The problem is that almost all the studies are crossover studies and I am not sure how to analyse them correctly. I have consulted the Cochrane Handbook, and it seems to suggest a paired analysis is best, but I do not have the expertise to do this - https://training.cochrane.org/handbook/current/chapter-23#section-23-2-6
I am seeking a statistician familiar with meta-analysis to consult with, and if possible, undertake the meta-analysis. There are only two authors on this paper (me and a colleague), so you would either be second or last author. We aim to publish in a Q1 or Q2 journal, and from my own analysis I can see we have very interesting results.
Please let me know if you are interested.
Relevant answer
Answer
Depending on the structure of the data (how much pre-processing has been already done), I would be ready to conduct the meta-analysis as well. Please feel free to reach out by PM.
  • asked a question related to Statistics
Question
6 answers
Hello network,
Is there anyone who could help me with my empirical research - especially with the statistical elaboration - on the topic of entrepreneurial intention and business succession in German SMEs, or who has experience with the preparation of Structural Equation Modeling?
Please feel free to send me a private message.
Thank you and best regards
Julia
Relevant answer
Answer
Do you have results and want interpretation or you have not performed the analysis yet
  • asked a question related to Statistics
Question
3 answers
Why are people of color supporting unchecked white power?
Relevant answer
Answer
Thanks for the link Alexander Ohnemus
  • asked a question related to Statistics
Question
4 answers
I plan to perform a t-test between risk index and leakage values to know if there is a significant difference between leakage values of low, medium, and high risk pipes. However, the data I have on leakage and risk index was measured using the same raw data (e..g. pipe age, pipe material, pipe diameter, pressure)
Relevant answer
Answer
If you have one independent variable which is pipes. Pipes are classified into 3 categories (low, medium, high risk pipes) and this classification was designed depending on many factors (e..g. pipe age, pipe material, pipe diameter, pressure). And you have one dependent variable which is leakage values, you can use one way ANVOA test if its assumptions are realized.
  • asked a question related to Statistics
Question
8 answers
Hi all, I wanted to see whether extrinsic or intrinsic motivation affects the frequency of cosplaying. The questionnaire for motivation has 4 subscales, in which I mark them as E1 (social), E2 (social avoidance), I1 (intellectual), I2 (competency mastery). The frequency was measured as whether participants have cosplayed (meaning dressing up as a character at any event) once, twice, or more than thrice. Logically, intrinsic and extrinsic motivation influence each other, but I thought that it would be interesting to see whether there are any significant differences between each sub for this particular sample. In other words, it's like finding out, for example, which one contributes more to the behaviour of cosplaying; is it because of increasing skill mastery or a sense of belonging?
IV: types of motivation with 2 levels (intrinsic, extrinsic). possibly it's counted as 4 levels if I regard each subconstruct (social, social avoidance, intellectual, competency master) as 1 level.
DV: frequency of cosplay
My current hypotheses are kept simple for now:
H1: There is a relationship between extrinsic motivation and the frequency of cosplaying.
H2: There is a relationship between intrinsic motivation and the frequency of cosplaying.
I'm getting confused after all the research for a suitable test. For now, I'm clear that I have an ordinal IV and DV. However, I'm unsure where should I begin looking if I were to find the differences in effect or predicting which type (E1, E2, I1 or I2) of motivation subconstruct would contribute to the frequency of behaviour. I'm looking into ordinal regression, linear regression, ANOVA ... but none of them seem to be suitable. Originally, I thought of doing Pearson correlation since that seems to be an obvious choice, but I'd like to explore more about the 4 subconstructs. Unless I'm not looking at my whole picture correctly, I would really appreciate input and assistance. I'm more than happy to give more details about the research to help in this inquiry. Thank you!
Relevant answer
Answer
if there are no obviously applicable model-structures with associated theory for significance-testing, then you could consider going back to the basic principles of statistics. Specifically, you consider the two steps separately: (a) finding a test-statistic that measures what you want measure; (b) finding a null-distribution for that test-statistic.
The question of finding a null-distribution for a test-statistic can always be solved by applying a randomization argument, and often a permutation approach would be suitable.
The question of applying a randomization approach when you are using some complicated model-fitting procedure is solved by applying the whole model-fitting procedure to each version of the randomized data-set. The model-fitting procedure gives some maximised or minimised goodness-of-fit value and you use this as the test-statistic.
In your application, the goodness-of-fit value might be based on the ranked probability score, but you might want to construct a modified version of this tailored to your context.
You may wish to find an experienced statistician for face-to-face guidance.
  • asked a question related to Statistics
Question
3 answers
Relevant answer
Answer
No, tail risks do not prompt cryptozoology.
  • asked a question related to Statistics
Question
4 answers
Factor loading is more than 0.92 also ave and cr are more than 0.95? Is this acceptable? If it's not i tried to delete items but the more i delete the more factor loading is higher! how can i treat it?
Relevant answer
Answer
Hi. I agree with Heba Ramadan. It seems your observed variables (indicators) are too similar and will not provide enough variability to measure the latent variables.
It is weird to say, but such a level of AVE and CR (above 0.95) is bad news. Usually when we delete items the situation is upside down: we have several observable items with low factor loading and this pull down AVE and CR.
So, we eliminate the indicators with the lowest factor loadings to improve AVE and CR.
I think that in your case, if you do so, you put AVE and CR even higher, what is bad.
Sorry to say, but I agree with Heba Ramadan and I also think you must consider revising or rephrasing the indicators to increase diversity in content.
  • asked a question related to Statistics
Question
4 answers
I have 6 ecosystems, 3 of which are substrate A and the other 3 are substrate B. each ecosystem has about 10 species. I have calculated a simpsons value for each ecosystem and a simpsons value for each substrate. I would like to statistically compare the two index values of substrate A and B, is this possible in any way? Since I would like to statistically compare the biodiversity between the two substrates, what is the best way to go about this?
Relevant answer
You use a Mann-Whitney test since the data do not have a normal distribution :)
  • asked a question related to Statistics
Question
4 answers
I have six ecosystems in two substrate categories (Triplicates essentially). I have determined shannon wiener index values for each ecosystem and also for the two categories separately. I have done this for two separate sets of data that were sampled in two separate years. Is it possible to statistically compare the development of the biodiversity between each of the categories i.e., the development of biodiveristy in ecosystem 1 between the two years, using the shannon wiener values somehow? Are there any other tests that could work? I am aware of the hutcheson t test however, some of my data is not normally distributed.
I would really appreciate some help!
Relevant answer
Answer
To statistically compare Shannon-Wiener index values between two years:
  1. Calculate Shannon-Wiener Index: Compute the Shannon-Wiener index separately for each year using appropriate ecological data.
  2. Normality Check: Ensure that the index values follow a normal distribution, typically assessed using statistical tests like the Shapiro-Wilk test or visual inspection (e.g., histograms).
  3. Choose a Test: Use a paired t-test if the data for both years are paired (i.e., measurements from the same sites or samples) and normally distributed. Alternatively, use a Wilcoxon signed-rank test if the data are not normally distributed or if the assumptions for the t-test are not met.
  4. Perform the Test: Conduct the chosen statistical test to compare the mean or median Shannon-Wiener index values between the two years.
  5. Interpret the Results: Evaluate the test statistic and p-value to determine if there is a statistically significant difference in the Shannon-Wiener index values between the two years. Adjust for multiple comparisons if necessary.
By following these steps, you can effectively compare Shannon-Wiener index values between two different years in a statistically rigorous manner.
  • asked a question related to Statistics
Question
1 answer
Relevant answer
Answer
The concept of fine tuning suggests that each being gets one life in the material realm followed by an eternal afterlife. This view posits that life on Earth is intricately balanced and designed to support a singular physical existence, after which individuals transition to an eternal spiritual existence or afterlife.
  • asked a question related to Statistics
Question
7 answers
I have six ecosystems in two substrate categories (Triplicates essentially). I have determined shannon wiener index values for each ecosystem and also for the two categories separately. I have done this for two separate sets of data that were sampled in two separate years. Is it possible to statistically compare the development of the biodiversity between each of the categories i.e., the development of biodiveristy in ecosystem 1 between the two years, using the shannon wiener values somehow? Are there any other tests that could work? I am aware of the hutcheson t test however, some of my data is not normally distributed.
I would really appreciate some help!
Relevant answer
Answer
To statistically compare Shannon index values (a measure of diversity) between two years, you can use several methods depending on the data's nature and distribution. Here's a step-by-step guide:
1. Prepare Your Data
Ensure you have the Shannon index values for the two years. Your data might look something like this:
Year Shannon_Index
2022 2.3
2022 2.5
2022 2.1
2023 2.7
2023 2.8
2023 2.6
2. Check Normality
Determine if the Shannon index values follow a normal distribution. This can be done using tests such as the Shapiro-Wilk test.
3. Choose a Statistical Test
Based on the normality of the data, choose an appropriate test:
If data is normally distributed:
Use an independent t-test if the variances between the groups are equal (you can check for this using Levene's test).
Use Welch's t-test if the variances are not equal.
If data is not normally distributed:
Use a non-parametric test such as the Mann-Whitney U test (also known as the Wilcoxon rank-sum test).
4. Perform the Test
  • asked a question related to Statistics
Question
5 answers
Is it very literally subbing in shannon wiener index values instead of species abundances?
Relevant answer
Answer
By the laws of statistics, no crime, it is allowed.
  • asked a question related to Statistics
Question
5 answers
So my student have a question that i cannot answer as well. She analyzing the effect of ICT toward labor productivity using 8 years data panel using 4 independent variables with EVIEWS 13. Frankly i quite surprised that the R-squared value on her results is 0.94 with only 2 significance variables. Theoretically, in simple regression model with higher value of R-squared most likely indicates bad and have statistics problems. Recently, i asked her to calculate the data using STATA and the results shows that only have 0,51 R-Square with exact similar coefficient.
I've search some articles about it and says that eviews might be wrong, and some says that STATA is wrong. Can someone explain what should i do and which software have to use?
note:
1. Some articles says to using areg command in stata to find similar value as eviews, but i quite doubt because areg is using for categorical regression in stata and its not quite fit in panel regression model.
2. Some says that eviews is wrong calculation.
Relevant answer
Answer
Because the software are different.
  • asked a question related to Statistics
Question
2 answers
Hello,
I discovered that a compound that I use, which get integrated into RNA, might have an unspecific impact on mRNA stability of a gene of interest.
To confirm that, I pretreated my cells for 2h with this compound to let it be incorporated into RNA, then did a timecourse with Actinomycin D to block transcription and observe my target of interest mRNA stability through time.
After qPCR, I have a list of RQ values, all calculated using an untreated timepoint 0 control as a reference. My data are the following : DMSO, Compound, Actinomycin D, Compound + Actinomycin D at 1, 2, 3 and 4 hours + the timepoint 0 control. Experiment was performed 4 times.
Once plotted, the results give me 4 lines, each of them representing the impact of one treatment through time. To prove that my compound impacts mRNA stability, I need to prove that the line with the Compound + Actinomycin D is statistically significantly lower compared to Actionmycin D alone.
Could you help me to select the best statistical test to use for this question ?
So far, here are the other strategies I tried:
1 - I analyzed the qPCR data using each DMSO condition as a RQ reference for their respective timepoint, giving me a histogram comparing all treatments together at each timepoint and performed a two-way ANOVA on them. If I'm correct, this strategy assesses at which timepoint, the treatments are different from each other. However, I would like now to analyse the data globally, and not separately, timepoint by timepoint.
2 - I followed the data analysis section of this paper :
Which, if I understand it correctly, calculates how well each of my curves will follow a decay model to calculate a decay rate. However, this is not exactly the answer I want and moreover, Prism gave me the following answer :
One phase decay - Least squares fit
Prism has identified at least one unstable parameter.This suggests that your data may be incomplete or don’t fully describe the selected model.Because of this, confidence intervals for all parameters should be interpreted with caution.
For at least one parameter, Prism was able to find a best-fit value, but was unable to calculate a complete confidence interval. This best-fit value should be interpreted with caution.
As I think this other strategy doesn't fit my needs and my data don't seem to be adequate for it, I do not plan on doing more on this second strategy.
Thank you very much for your help.
Relevant answer
Answer
Thank you very much for your answer.
I took the time to discuss with different people and to read Prism Tutorial as much as I could. And while I agree that a one-way Anova can be useful if I take each timepoint separately to confirm the impact of treatment, I don't think this strategy is sufficient if I want to take into consideration all data in one go. This is why I did a two-way ANOVA with the Tukey's test as you suggest for the RQ using the DMSO conditions as controls while checking at all timepoints altogether.
As I looked online to other papers, I found one comparing similar types of data which used a two-way ANOVA with a Bonferroni's multiple comparison test and comparing each treatment condition means. This seems to be correct to me so far as Prism indeed compared all treatment to each other.
Do you or other people think this strategy could be the good one ?
Thank you again.
  • asked a question related to Statistics
Question
2 answers
Relevant answer
Answer
In my opinion this approach is true. Because short size texts are more attractive for people, due to the fact that they can consider that they will read it in a few minutes and understand easily. Actually, to produce short and understandable texts in academy is not that simple ( because we want to write everything clearly ) , readers and researchers need this type of readings.
  • asked a question related to Statistics
Question
4 answers
Assuming this is my hypothetical data set (attached figure), in which the thickness of a structure was evaluated in the defined positions (1-3) in 2 groups (control and treated). I emphasize that the structure normally increases and decreases in thickness from position 1 to 3. I would also like to point out that each position has data from 2 individuals (samples).
I would like to check if there is a statistical difference in the distribution of points (thickness) depending on the position. Suggestions were to use the 2-sample Kolmogorov-Smirnov test.
However, my data are not absolutely continuous, considering that the position of the measurement in this case matters (and the test ignores this factor, just ordering all values from smallest to largest and computing the statistics).
In this case, is the 2-sample Komogorov-Smirnov test misleading? Is there any other type of statistical analysis that could be performed in this case?
Thanks in advance!
Relevant answer
Answer
You can see with www.Stats4Edu.com
  • asked a question related to Statistics
Question
1 answer
"DNA is SO unpredictable that they are either fractals or something less predictable, thus a gene is never known to manifest into a trait, debunking hereditarianism and vindicating CRT" (Ohnemus 2024).
Relevant answer
Answer
These self-similar patterns are the result of a simple equation or mathematical statement. You create fractals by repeating this equation through a feedback loop in a process called iteration, where the results of one iteration form the input value for the next.
  • asked a question related to Statistics
  • asked a question related to Statistics
Question
3 answers
Dear colleagues,
I would like to ask whether it is possible to compare the quality of different models based on the same data but containing different number of variables using IRT analysis, namely Log likelihood, AIC and BIC statistics?
Specifically, I have a model with 36 items and I am gradually eliminating some problem variables and I want to compare the overall quality of the model between each other based on the above statistics?
Is this procedure possible?
Thank you for your answer.
Relevant answer
Answer
In trying to use 3 different criteria for model evaluation, the important thing is to know the underlying meanings of the different criteria. Comparison of log-likelihoods is concerned only with information you have in your existing data and the accuracy of estimates and predictions available from that data, while the "information criteria" make some adjustments related to possibly obtaining extra data that would allow improved estimates and predictions to be obtained.
In practice, the different criteria can indicate different model-structures as being "best".
  • asked a question related to Statistics
Question
4 answers
Dear colleagues,
I would like to ask for your advice on testing the criterion-related validity of the measuring instrument. It is common practice to test this type of validity by correlation with other relevant variables. However, I received a comment from a reviewer that if I calculate only Pearson correlation, the measurement error is not taken into account and the correlation is underestimated.
He said I should use reliability-corrected correlations or report the correlation by fitting an SEM model where the three factors correlated with the external variables (my measurement instrument is a simple structure with three correlated factors).
Could I ask your advice on how to calculate this? Personally, I do not know how I should proceed. Alternatively, what is your opinion?
Thank you very much.
Relevant answer
Answer
Marius Ole Johansen Thank you very much. It looks very good and it seems to work. Could I ask you for another script if I need a correlation for all factors together? Thank you very much.
  • asked a question related to Statistics
Question
4 answers
Relevant answer
Answer
If White privilege is based on societal advantages for white people. It's unlikely to disappear entirely because societal ideas about race can persist even if everyone looked the same. The goal is to create a society where race doesn't affect opportunity. We can work towards this by promoting diversity and equal treatment.
  • asked a question related to Statistics
Question
5 answers
Relevant answer
Answer
The vastness or sustainability of a belief system doesn't necessarily indicate its plausibility. Plausibility depends on evidence, logical coherence, and consistency with observable phenomena, rather than the popularity or longevity of a belief. While a belief system's widespread acceptance or enduring nature may influence perceptions, plausibility is determined by its ability to withstand critical scrutiny and align with empirical reality.
  • asked a question related to Statistics
Question
7 answers
I have a mixed effect model, with two random effect variables. I wanted to rank the relative importance of the variables. The relimpo package doesn't work for mixed effect model. I am interested in the fixed effect variables anyway so will it be okay if I only take the fixed variables and use relimp? Or use weighted Akaike for synthetic models with alternatively missing the variables?
which one is more acceptable?
Relevant answer
Answer
install.packages("glmm.hp")
library(glmm.hp)
library(MuMIn)
library(lme4)
mod1 <- lmer(Sepal.Length ~ Petal.Length + Petal.Width+(1|Species),data = iris)
r.squaredGLMM(mod1)
glmm.hp(mod1)
a <- glmm.hp(mod1)
plot(a)
  • asked a question related to Statistics
Question
4 answers
Hello all,
I am running into a problem I have not encountered before with my mediation analyses. I am running a simple mediation X > M > Y in R.
Generally, I concur that the total effect does not have to be significant for there to be a mediation effect, and in the case I am describing, this would be a logical occurence, since the effects of path a and b are both significant and respectively are -.142 and .140, thus resulting in a 'null-effect' for the total effect.
However, my 'c path X > Y is not 'non-significant' as I would expect, rather, the regression does not fit (see below) :
(Residual standard error: 0.281 on 196 degrees of freedom Multiple R-squared: 0.005521, Adjusted R-squared: 0.0004468 F-statistic: 1.088 on 1 and 196 DF, p-value: 0.2982).
Usually I would say you cannot interpret models that do not fit, and since this path is part of my model, I hesitate to interpret the mediation at all. However, the other paths do fit and are significant. Could the non-fitting also be a result of the paths cancelling one another?
Note: I am running bootstrapped results for the indirect effects, but the code does utilize the 'total effect' path, which does not fit on its own, therefore I am concerned.
Note 2: I am working with a clinical sample, therefore the samplesize is not as great as I'd like group 1: 119; group2: 79 (N = 198).
Please let me know if additional information is needed and thank you in advance!
Relevant answer
Answer
Somehow it is not clear to my, what you mean with "does not fit"? Could you please provide the output of the whole analysis? I think this would be helpful.
  • asked a question related to Statistics
Question
4 answers
Dear colleagues
Could you tell me please,how is it possible to consruct boxplot from dataframe in rstuio
df9 <- data.frame(Kmeans= c(1,0.45,0.52,0.54,0.34,0.39,0.57,0.72,0.48,0.29,0.78,0.48,0.59),hdbscan= c(0.64,1,0.32,0.28,0.33,0.56,0.71,0.56,0.33,0.19,0.53,0.45,0.39),sectralpam=c(0.64,0.31,1,0.48,0.24,0.32,0.52,0.66,0.32,0.44,0.28,0.25,0.47),fanny=c(0.64,0.31,0.38,1,0.44,0.33,0.48,0.73,0.55,0.51,0.32,0.39,0.57),FKM=c(0.64,0.31,0.38,0.75,1,0.26,0.55,0.44,0.71,0.38,0.39,0.52,0.53), FKMnoise=c(0.64,0.31,0.38,0.75,0.28,1,0.42,0.45,0.62,0.31,0.25,0.66,0.67), Mclust=c(0.64,0.31,0.38,0.75,0.28,0.46,1,0.36,0.31,0.42,0.47,0.66,0.53), PAM=c(0.64,0.31,0.37,0.75,0.28,0.46,0.58,1,0.73,0.43,0.39,0.26,0.41) ,
AGNES=c(0.64,0.31,0.37,0.75,0.28,0.46,0.58,0.55,1,0.31,0.48,0.79,0.31), Diana=c(0.64,0.31,0.37,0.75,0.28,0.46,0.58,0.55,0.42,1,0.67,0.51,0.43),
zones2=c(0.64,0.31,0.37,0.75,0.28,0.46,0.58,0.55,0.42,0.45,1,0.69,0.35),
zones3=c(0.64,0.31,0.37,0.75,0.28,0.46,0.58,0.55,0.42,0.45,0.59,1,0.41),
gsa=c(0.64,0.31,0.37,0.75,0.28,0.46,0.58,0.55,0.42,0.45,0.59,0.36,1), method=c("kmeans", "hdbscan", "spectralpam", "fanny", "FKM","FKMnoise", "Mclust", "PAM", "AGNES", "DIANA","zones2","zones3","gsa"))
head(df9)
df9 <- df9 %>% mutate(across(everything(), ~as.numeric(as.character(.))))
Thank you ery much
Relevant answer
Answer
Dear Valeriia Bondarenko
First you need to install "ggplot2" and "reshape2" along with those two libraries.
# Then you have to melted the methods
df9_melted<-melt(df9,id.vars="method")
# For the boxplot
ggplot(df9_melted,aes(x=method,y=value))+geom_boxplot()+labs(x="Method",y="Value",title="Boxplot of methods")
  • asked a question related to Statistics
Question
10 answers
In the domain of clinical research, where the stakes are as high as the complexities of the data, a new statistical aid emerges: bayer: https://github.com/cccnrc/bayer
This R package is not just an advancement in analytics - it’s a revolution in how researchers can approach data, infer significance, and derive conclusions
What Makes `Bayer` Stand Out?
At its heart, bayer is about making Bayesian analysis robust yet accessible. Born from the powerful synergy with the wonderful brms::brm() function, it simplifies the complex, making the potent Bayesian methods a tool for every researcher’s arsenal.
Streamlined Workflow
bayer offers a seamless experience, from model specification to result interpretation, ensuring that researchers can focus on the science, not the syntax.
Rich Visual Insights
Understanding the impact of variables is no longer a trudge through tables. bayer brings you rich visualizations, like the one above, providing a clear and intuitive understanding of posterior distributions and trace plots.
Big Insights
Clinical trials, especially in rare diseases, often grapple with small sample sizes. `Bayer` rises to the challenge, effectively leveraging prior knowledge to bring out the significance that other methods miss.
Prior Knowledge as a Pillar
Every study builds on the shoulders of giants. `Bayer` respects this, allowing the integration of existing expertise and findings to refine models and enhance the precision of predictions.
From Zero to Bayesian Hero
The bayer package ensures that installation and application are as straightforward as possible. With just a few lines of R code, you’re on your way from data to decision:
# Installation devtools::install_github(“cccnrc/bayer”)# Example Usage: Bayesian Logistic Regression library(bayer) model_logistic <- bayer_logistic( data = mtcars, outcome = ‘am’, covariates = c( ‘mpg’, ‘cyl’, ‘vs’, ‘carb’ ) )
You then have plenty of functions to further analyze you model, take a look at bayer
Analytics with An Edge
bayer isn’t just a tool; it’s your research partner. It opens the door to advanced analyses like IPTW, ensuring that the effects you measure are the effects that matter. With bayer, your insights are no longer just a hypothesis — they’re a narrative grounded in data and powered by Bayesian precision.
Join the Brigade
bayer is open-source and community-driven. Whether you’re contributing code, documentation, or discussions, your insights are invaluable. Together, we can push the boundaries of what’s possible in clinical research.
Try bayer Now
Embark on your journey to clearer, more accurate Bayesian analysis. Install `bayer`, explore its capabilities, and join a growing community dedicated to the advancement of clinical research.
bayer is more than a package — it’s a promise that every researcher can harness the full potential of their data.
Explore bayer today and transform your data into decisions that drive the future of clinical research: bayer - https://github.com/cccnrc/bayer
Relevant answer
Answer
Many thanks for your efforts!!! I will try it out as soon as possible and will provide feedback on github!
All the best,
Rainer
  • asked a question related to Statistics
Question
5 answers
What may be a good, strong and convincing example demonstrating the power of copulas by uncovering some not obvious statistical dependencies?
I am especially interested in the example contrasting copula vs a simple calculation of a correlation coefficient for the original distributions.
Something like this - the (properly normalized) correlation coefficient of components of a bivariate distribution does not suggest a strong statistical dependence between them, but the copula distribution of these two components shows a clear dependence between them (possibly manifested in the value of a correlation coefficient calculated for the copula distribution?). Or the opposite - the correlation coefficient of the original bivariate distribution suggests strong dependence, but its copula shows that the statistical dependence is "weak", or just absent.
Mostly interested in an example described in terms of formulae (so that the samples could be generated, e.g. in MATLAB), but if somebody can point to the specific pre-generated bivariate distribution dataset (or its plots), that will work too.
Thank you!
Relevant answer
Answer
I used the two sets of bivariate normal distributions generated by you, [x y1] and [x y2], representing strong and weak dependencies, calculating empirical copula distributions, [U V1] and [U V2] and also Pearson correlation coefficients for all bivariate distribution samples, then plotting them. The MATLAB / Octave script and the plots are attached.
I still do not understand what advantage the empirical copula distributions provide in this example. The weak and strong dependence are evident from the original plots of samples of [x y1] and [x y2] and the values of Pearson correlation coefficients for them. Pearson correlation coefficients of copulas are rather close to the same coefficients of the original distribution. Yes, the manifestation of weak dependence for the copula [U V2] looks different (samples of the copula distributions look like those from a uniform bivariate distribution)compared to the manifestation of the same in the original samples [x y2]. But why is this difference important?
Am I mis-interpreting your example, or am I missing something in the interpretation of the results?
Thank you,
  • asked a question related to Statistics
Question
5 answers
I want to estimate the half-life value for the virus as a function of strain and concentration, and as a continuous function of temperature.
Could anybody tell me, how to calculate the half-life value in R programming?
I have attached a CSV file of the data
Relevant answer
Answer
Estimating the half-life of a virus involves understanding its stability and decay rate under specific environmental or biological conditions. This is a crucial parameter in virology, impacting everything from the design of disinfection protocols to the assessment of viral persistence in the environment or within a host. Here's a structured approach to estimating the half-life values for a virus:
  1. Defining Conditions:Environment: Specify the environmental conditions such as temperature, humidity, UV exposure, and presence of disinfectants, as these factors significantly affect viral stability. Biological: In biological systems, consider the impact of host factors such as immune response, tissue type, and presence of antiviral agents.
  2. Experimental Setup:Sampling: Begin by preparing a known concentration of the virus under controlled conditions. Time Points: Collect samples at predetermined time points that are appropriate based on preliminary data or literature values suggesting the expected rate of decay.
  3. Quantitative Assays:Plaque Assay: One of the most accurate methods for quantifying infectious virus particles. It measures the number of plaque-forming units (PFU) which reflect viable virus particles. PCR-Based Assays: These can measure viral RNA or DNA but do not distinguish between infectious and non-infectious particles. Adjustments or complementary assays might be required to correlate these results with infectivity. TCID50 (Tissue Culture Infective Dose): This assay determines the dilution of virus required to infect 50% of cultured cells, providing another measure of infectious virus titer.
  4. Data Analysis:Plot Decay Curves: Use logarithmic plots of the viral titer (e.g., PFU/mL or TCID50/mL) against time. The decay of viral concentration should ideally follow first-order kinetics in the absence of complicating factors. Calculate Half-Life: The half-life (t1/2) can be calculated using the equation derived from the slope (k) of the linear portion of the decay curve on a logarithmic scale:�1/2=ln⁡(2)�t1/2​=kln(2)​Statistical Analysis: Ensure statistical methods are used to analyze the data, providing estimates of variance and confidence intervals for the half-life.
  5. Validation and Replication:Replicate Studies: Conduct multiple independent experiments to validate the half-life estimation. Variability in viral preparations and experimental conditions can affect the reproducibility of results. Peer Review: Consider external validation or peer review of the methodology and findings to ensure robustness and accuracy.
  6. Interpretation and Application:Contextual Interpretation: Understand that the estimated half-life is context-specific. Results obtained under laboratory conditions may differ significantly from those in natural or clinical settings. Application in Risk Assessment: Use the half-life data to inform risk assessments, disinfection strategies, or predictive modeling of viral spread and persistence.
By meticulously following these steps and ensuring the precision of each phase of the process, one can accurately estimate the half-life of a virus under specific conditions. This information is essential for developing effective control strategies and understanding the dynamics of viral infections.
Perhaps this protocol list can give us more information to help solve the problem.
  • asked a question related to Statistics
Question
3 answers
I am studying leadership style's impact on job satisfaction. in the data collection instrument, there are 13 questions on leadership style divided into a couple of leadership styles. on the other hand, there are only four questions for job satisfaction. how do i run correlational tests on these variables? What values do i select to analyze in Excel?
Relevant answer
Answer
First, you need to do the correlation between your target variable and each of your potential independent variables. After checking what independent variables are the more correlated to your target variable (as mentioned earlier coefficient correlation closest to - 1 or + 1). Once, you decide according to these correlation coefficients which variables you can select for your model, you need to ensure that there will be no multicollinearity in your model. To ensure that, for each independent variable you do correlation tests again. If two independent variables are too correlated, you should introduce only one in your model (e.g. the variable which had the higher correlation rate with your dependent variable).
  • asked a question related to Statistics
Question
67 answers
I explain here the connection between the pre-scientific Law of Universal Causality and all sorts of statistical explanations in physical sciences. The way it takes may look strange, but it will be interesting enough to consider.
To repeat in short what is already said a few times: by all possible assumptions, to exist (which is To Be with respect to Reality-in-total) is non-vacuous. Hence, any existent must have Extension, have finite-content parts. These parts, by the only other possible assumption, must yield impacts on other parts both external and internal. This is Change.
These impacts are always finite in the content and measured extents. The measured extents of Extension and Change are space and time. Without measurements we cannot speak of space and time as existing or as pertaining to existents. What pertain to all existents as most essential are Extension and Change. Existence in Extension and Change means that finitely extended objects give origin to finite impacts. This is Causality. Every existent is Extension-Change-wise existent, and hence everything is causal.
As pertinents to existents, Extension and Change are the most applicable qualities / universals of the group of all entities, i.e., Reality-in-total, because they belong to all that exist. Since Extension and Change are not primarily in our minds, let us call them as ontological universals. As is clear now, Extension and Change are the widest possible and most general ontological universals. All universals are pure qualities. All qualities other than ontological universals are mixtures of pure qualities.
There are physical-ontological universals / qualities that are not as universal as Extension and Change. ‘Colouredness’ / ‘being coloured’, ‘redness’, ‘unity’ / ‘being a unit’, ‘being malleable’, ‘being rigid’, etc. are also pure qualities. These are pertinents not merely of one existent process. They belong to many. These many are a group of existent processes of one kind, based on the one classification quality. Such groups of Extension-Change-wise existent entities are termed natural kinds.
Ontological universals can be reflected in minds too, but in very meagre ways, not always, and not always to the same extent of correspondence with ontological universals, because they are primarily in existent processes. A direct reflection is impossible. The many individuals who get them reflected meagrely formulate them differently.
The supposed common core of ontological universals in minds is a pure notion, but they are mere notions idealized by minds. These ideals are also not inherited of the pertinent ontological universals of all relevant existent things, but at least by way of absorption from some existents, in whatever manner of correspondence with ontological universals. I call them connotative universals, because they are the pure aspects of the conceptual activity of noting objectual processes together.
In brains connotative universals can show themselves only as a mixture of the relevant connotative universals and the relevant brain elements. Please note that this is not a brain-scientific statement. It is the best imaginable philosophical common-sense on the brain-scientific aspect of the formation of connotative universals, and hence it is acceptable to all brain scientists. In brains there are processes that define such activities. But it needs only to be accepted that these processes too are basically of Extension-Change-wise existence, and hence are causal in all senses.
Connotatives are just representations of all kinds of ontological universals. Connotatives are concatenated in various ways in connection with brain elements – in every case highly conceptually and symbolically. These concatenations of connotatives among themselves are imaginations, emotions, reflections, theories, etc., as considered exclusively in the mind.
Note here also that the lack of exact correspondence between ontological and connotative universals is what makes ALL our statements essentially statistical and non-exact at the formation of premises and at the jump from premises into conclusions. The statistical aspect here is part of the process of formation, by brains, of connotatives from ontological universals. This is the case in every part of imaginations, emotions, reflections, theories, etc., even when statistical measurements are not actually being made part of the inquiry as a matter of mentally guided linguistic and mathematical procedures.
Further, connotative universals are formulated in words expressed as terms, connected with connectives of processes, and concatenated in statements. These are the results of the symbolic functioning of various languages including mathematics. These are called denotative universals and their concatenations. All symbolic activities function at this level.
Now coming to statistics as an applied expression of mathematics. It is nothing but denotative universals concatenated in a quantitatively qualitative manner. Even here there is a lot of lack of exactness, which are known as uncertainty, randomness, etc. Pay attention to the fact that language, mathematics, and its statistical part work at the level of denotative universals and their concatenations. These are naturally derived from the conglomerations of ontological universals via concatenations of connotatives and then translated with further uncertainties unto denotative concatenations.
Causation works at the level of the conglomerations of ontological universals, which are in existent things themselves. That is, statistical connections appear not at the ontological level, but at the denotative level. When I say that this laptop is in front of me, there is a directness of acceptance of images from the ontological universals and their conglomerations into the connotative realm of connotations and from there into the denotative realm of connotations. But in roundabout conclusions regarding causal processes at the physical-ontological level into the statistical level, the amount or extent of directness of judgement is very much lacking.
Relevant answer
  • asked a question related to Statistics
Question
4 answers
What is the specific importance of a bachelor’s degree in the hiring process?
Relevant answer
Answer
A bachelor's degree signals foundational knowledge and transferable skills to employers, making it a plus in many fields. It can also be a screening tool for employers. However, its importance varies. Some professions require a specific degree, while experience or alternative credentials like certifications might be valued more in others. Overall, a degree can be an asset but isn't always the only thing that matters for getting hired.
  • asked a question related to Statistics
Question
5 answers
Why parsimoniously does fertility negatively correlate with socioeconomic status? How?
Relevant answer
Overall, the negative correlation between fertility and socioeconomic status can be attributed to a combination of economic, educational, cultural, and structural factors that shape individuals' reproductive choices and opportunities. Understanding these mechanisms is essential for policymakers, healthcare providers, and social scientists seeking to address disparities in reproductive health outcomes and promote equitable access to family planning resources.
  • asked a question related to Statistics
Question
4 answers
Hi,
I am hoping to get some help on what type of statistical test to run to validate my data. I have run 2 ELISAs with the same samples for each test. I did perform a Mann-Whitney U-test to compare the groups, and my results were good.
However, my PI wants me to also run a statistical test to determine that there wasn't any significant difference in the measurement of each sample between the runs. He wants to know that my results are concordant/reproducible.
I am trying to compare each sample individually, and since I don't have 3 data points, I can't run an ANOVA. What types of statistical tests will give me that information? Also, is there a test that will run all the samples simultaneously but only compare across the same sample.
For example, if my data looked like this.
A: 5, 5.7
B: 6, 8
C: 10, 20
I need a test to determine if there is any significant difference between the values for samples A, B, and C separately and not compare the group variance between A-C.
Relevant answer
Answer
If you want to see how comparable the results from the two ELISAs are, simply plot the results of the first ELIZA against those of the second ELISA.
Another option is to make a mean-difference plot (aka "Bland-Altman plot"): plot the differences between the ELISA results against the mean of the ELISA results.
Doing a statistical test and interpreting a non-significant result as "there is no difference" or "the groups/runs/ELISAs are comparable" is logically flawed and complete nonsense. Don't do this, ever!
  • asked a question related to Statistics
Question
2 answers
Hello everyone,
I am currently undertaking a research project that aims to assess the effectiveness of an intervention program. However, I am encountering difficulties in locating suitable resources for my study.
Specifically, I am in search of papers and tutorials on multivariate multigroup latent change modelling. My research involves evaluating the impact of the intervention program in the absence of a control group, while also investigating the influence of pre-test scores on subsequent changes. Additionally, I am keen to explore how the scores differ across various demographic groups, such as age, gender, and knowledge level (all measured as categorical variables).
Although I have come across several resources on univariate/bivariate latent change modelling with more than three time points, I have been unable to find papers that specifically address my requirements—namely, studies focusing on two time points, multiple latent variables (n >= 3), and multiple indicators for each latent variable (n >= 2).
I would greatly appreciate your assistance and guidance in recommending any relevant papers, tutorials, or alternative resources that pertain to my research objectives.
Best,
V. P.
Relevant answer
Answer
IYH Dear Vivian Parker
Ch. 19 Muthén, B. Latent variable analysis: Growth mixture modeling and related techniques for longitudinal data. In D. Kaplan (ed.), Handbook of quantitative methodology for the social sciences. Newbury Park, CA: Sage.
Although this ref do not exclusively concentrate on two-time-point cases, it does contain discussions revolving around multiple latent variables and multiple indicators for those latent constructs. https://users.ugent.be/~wbeyers/workshop/lit/Muthen%202004%20LGMM.pdf
It contains rich content concerning latent growth curve models and elaborates on multivariate implementations.
While conceptually broader, it present crucial components necessary for building and applying two-time-point, multivariate latent change models.
  • asked a question related to Statistics
Question
8 answers
I want to examine the relationship between school grades and self-esteem and was planning to do a linear regression analysis.
Here's where my Problem is. I have three more variables: socioeconomic status, age and sex. I wanted to treat those as moderation variables, but I'm not sure if that's the best solution. Maybe a multiple regression analysis would be enough? Or should I control those variables?
Also if I'd go for a moderation analysis, how'd I go about analysing with SPSS? I can find a lot of videos about moderation analysis, but I can't seem to find cases with more than one moderator.
I've researched a lot already but can't seem to find an answer. Also my statistic skills aren't the best so maybe that's why.
I'd be really thankful for your input!
Relevant answer
Answer
Hi Daniel Wright. Sure, I'm fine with just calling it an interaction. I'm just saying that if one wanted to use some other term, I prefer effect modification over moderation because it is neutral with respect to the nature of the interaction.
  • asked a question related to Statistics
Question
13 answers
I recently had a strange question from one of the non-statistician asking on confidence interval. His way of understanding is that all the sample values that was used to calculate the confidence interval should be within that interval. I have tried to answer him the best, but couldn't convince him in any way. Is there any best way to explain why it need not be, and the purpose is not the way he understands. How would you handle this question?
Thanks in advance.
Relevant answer
Answer
If we use the example of the CI for the population mean: You may argue that the estimate should become "better" (more precise) when more data (information) is available. So the expected width of the CI should decrease with sample size. By choosing an arbitrary large sample size you can get arbitrarily small expected CIs. But the sample size has no effect on the variance of the data itself.
  • asked a question related to Statistics
Question
1 answer
Relevant answer
Answer
Metaphysics is the branch of philosophy that deals with the fundamental nature of reality.There are a few metaphysical ideas that could potentially end stratification or the division of people into different social classes one idea is the concept of social justice.
This is the idea that all people are equal and deserve to be treated fairly.Another idea is the concept of social mobility, this is the idea that people should have the opportunity to move up or down the social ladder based on their own efforts.
Another challenge is that some people may be unwilling to share power or resources with others.
  • asked a question related to Statistics
Question
3 answers
Respectfully, across reincarnation belief and scientific materialism, why is considering the individual self, as an illusion, a commonality? 1)
Relevant answer
Answer
I can only address this question with mathematical structures. The individual self is multi-dimensional manifold embedded with an much, much larger manifold of infinite dimensions. One may think of it as a vector space of tremendous size. As vast as it is, a human existence is but a small subspace of the infinite dimensional manifold. When released from physical existence, the aspects of individual self convolve with the larger space. In some sense, you may refer to that as the commonality.
  • asked a question related to Statistics
Question
7 answers
Dear all,
I am sharing the model below that illustrates the connection between attitudes, intentions, and behavior, moderated by prior knowledge and personal impact perceptions. I am seeking your input on the preferred testing approach, as I've come across information suggesting one may be more favorable than the other in specific scenarios.
Version 1 - Step-by-Step Testing
Step 1: Test the relationship between attitudes and intentions, moderated by prior knowledge and personal impact perceptions.
Step 2: Test the relationship between intentions and behavior, moderated by prior knowledge and personal impact perceptions.
Step 3: Examine the regression between intentions and behavior.
Version 2 - Structural Equation Modeling (SEM)
Conduct SEM with all variables considered together.
I appreciate your insights on which version might be more suitable and under what circumstances. Your help is invaluable!
Regards,
Ilia
Relevant answer
Answer
Ilia, some thoughts on your model. According to your path diagram you have 4 moderator effects. For such a large model, you need a large sample size to detect all moderator effects simultaneously. Do you have a justification for all of these nonlinear relationships?
Some relationships in the path diagram are missing. First, prior knowledge, personal impact, and attitude should be correlated - these are the predictor variables. Second, prior knowledge and personal impact should have direct effects on the dependent variables behavioral intentions and behavior (this is necessary).
As this model is quite complex, I would suggest to start with analyzing the linear model. If this model fits the data well, then I would include the interaction effects one by one. Keep in mind that you need to use a robust estimation method for parameter estimation because of the interaction effects. If these effects exist in the population, then behavioral intentions and behavior should be non-normally distributed.
Kind regards, Karin
  • asked a question related to Statistics
Question
2 answers
ResearchGate does a pretty good job of tracking publication analytics such as reads and citations over time. The recommendations feature can also be an interesting indicator for a publication's resonance with the scholarly community.
This progress allow for ideas to be developed about how to make the analytics features even better in the future. Here are some ideas I have been thinking about:
  • something equivalent to Altmetric that tracks social media mentions across multiple platforms and mentions in news articles, conference proceedings, etc.
  • more longitudinal data for individual publications by month and year
  • the ability to compare the performance of one's own publications, with perhaps a way to rank them in analytic reports by reads, citations, etc.
  • More specific analytics to allow for comparisons within and between departments on an individual and collective basis, which can be sorted by discipline, field, etc.
Are there any additional analytics features that you would like to see on ResearchGate?
Relevant answer
Answer
That's a fair point. I am interested metrics like reads because my research focuses on student government and I have noticed that a lot of my readers appear to be students. These students may not yet be publishing academic works or they may be in a different discipline, but even though they are not formally citing the publications in journals, they may be using them to help with student government activities in practice. Keeping an eye on reads and the countries where the readers are may give indications about new student-led initiatives that are emerging. I'm curious about what the potential impacts could be and how this may correspond with changes in metrics like reads over time.
  • asked a question related to Statistics
Question
8 answers
Meta-analyses and systematic reviews seem the shortcut to academic success as they usually have a better chance of getting published in accredited journals, be read more, and bring home a lot of citations. Interestingly enough, apart from being time-consuming, they are very easy; they are actually nothing but carefully followed protocols of online data collection and statistical analysis, if any.
The point is that most of this can be easily done (at least in theory) by a simple computer algorithm. A combination of if/thenstatements would simply allow the software to decide on the statistical parameters to be used, not to mention more advanced approaches that can be available to expert systems.
The only part needing a much more advanced algorithm like a very good artificial intelligence is the part that is supposed to search the articles, read them, accurately understand them, include/exclude them accordingly, and extract data from them. It seems that today’s level of AI is becoming more and more sufficient for this purpose. AI can now easily read papers and understand them quite accurately. So AI programs that can either do the whole meta-analysis themselves, or do the heavy lifting and let the human check and polish/correct the final results are on the rise. All needed would be the topic of the meta-analysis. The rest is done automatically or semi-automatically.
We can even have search engines that actively monitor academic literature, and simply generate the end results (i.e., forest plots, effect sizes, risk of bias assessments, result interpretations, etc.), as if it is some very easily done “search result”. Humans then can get back to doing more difficult research instead of putting time on searching and doing statistical analyses and writing the final meta-analysis paper. At least, such search engines can give a pretty good initial draft for humans to check and polish them.
When we ask a medical question from a search engine, it will not only give us a summary of relevant results (the way the currently available LLM chatbots do) but also will it calculate and produce an initial meta-analysis for us based on the available scientific literature. It will also warn the reader that the results are generated by AI and should not be deeply trusted, but can be used as a rough guess. This is of course needed until the accuracy of generative AI surpasses that of humans.
It just needs some enthusiasts with enough free time and resources on their hands to train some available open-source, open-parameter LLMs to do this specific task. Maybe even big players are currently working on this concept behind the scene to optimize their propriety LLMs for meta-analysis generation.
Any thoughts would be most welcome.
Vahid Rakhshan
Relevant answer
Answer
There was a recent well-publicised event where an actual legal court case included legal documents prepared by AI that included supposed legal citations to cases that did not ever exist.
So, you have two problems:
(1) Constructing code that does actually work;
(2) Persuading others that you have code that actually works.
  • asked a question related to Statistics
Question
3 answers
hi, i'm currently writing my psychology dissertation where i am investigating "how child-oriented perfectionism relates to behavioural intentions and attitudes towards children in a chaotic versus calm virtual reality environment".
therefore i have 3 predictor variables/independent variables: calm environment, chaotic environment and child-oriented perfectionism
my outcome/dependent variables are: behavioural intentions and attitudes towards children.
my hypotheses are:
  1. participants will have more negative behavioural intentions and attitudes towards children in the chaotic environment than in the calm environment.
  2. these differences (highlighted above) will be magnified in participants high in child-oriented perfectionism compared to participants low in child oriented perfectionism.
i used a questionnaire measuring child-oriented perfectionism which will calculate a score. then participants watched the calm environment video and then answered the behavioural intentions and attitudes towards children questionnaires in relation to the children shown in the calm environment video. participants then watched the chaotic environment video and then answered the behavioural intentions and attitudes towards children questionnaire in relation to the children in the chaotic environment video.
i am unsure whether to use a multiple linear regression or repeated measures anova with a continuous moderator (child-oriented perfectionism) to answer my research question and hypotheses. please please can someone help!
Relevant answer
Answer
1. participants will have more negative behavioural intentions and attitudes towards children in the chaotic environment than in the calm environment.
--- because there were only two conditions (levels of your factor), you can use a paired t-test (or wilcoxon if nonparametric) to compare the behavioral intentions/attitudes between the calm and chaotic environment where the same participants were subjected to both environments.
2. these differences (highlighted above) will be magnified in participants high in child-oriented perfectionism compared to participants low in child oriented perfectionism.
--- indeed this is a simple linear regression (not multiple one), you can start with creating a new dependent variable (y) as the difference in behavioral intentions/attitudes between the calm and chaotic environment, then you run a regression on the independent variable of a perfectionism score (x).
  • asked a question related to Statistics
Question
2 answers
Relevant answer
Answer
What could your "political inclinations" possibly have to do with the scientific issues discussed on this website?
  • asked a question related to Statistics
Question
1 answer
RuntimeWarning: Precision loss occurred in moment calculation due to catastrophic cancellation. This occurs when the data are nearly identical. Results may be unreliable.
res = hypotest_fun_out(*samples, **kwds)
Above warning occured in python. Firstly, the dataset was normalised and then while performing the t-test this warning appeared, though the output was displayed. Kindly suggest some methods to avoid this warning.
Relevant answer
Answer
Why do you normalize before testing? If you are doing a pairwise t-test and the differences are small this only makes differences smaller. https://www.stat.umn.edu/geyer/3701/notes/arithmetic.html
  • asked a question related to Statistics
Question
3 answers
I am somewhat Hegelian because I do not believe in martyrdom, and or dying on a hill, and usually the popular, and or traditional, opinion has a deeper less obvious reason.
Relevant answer
Answer
I value politics, I believe in politics, and I exercise my political right as a citizen.
  • asked a question related to Statistics
Question
7 answers
Can anyone here me with one biostatistics question. It is about finding the sample size from power analysis. I have the variables. Just need an assistance with the calculations.
  • asked a question related to Statistics
Question
5 answers
As a Computer Science student inexperienced in statistics, I'm looking for some advice on selecting the appropriate statistical test for my dataset.
My data, derived from brain scans, is structured into columns: subject, channels, freqbands, measures, value, and group. It involves recording multiple channels (electrodes) per patient, dividing the signal into various frequency bands (freqbands), and calculating measures like Shannon entropy for each. So each signal gets broken down to one data point. This results in 1425 data points per subject (19 channels x 5 freqbands x 15 measures), totalling around 170 subjects.
I aim to determine if there's a significant difference in values (linked to specific channel, freqband, and measure combinations) between two groups. Additionally, I'm interested in identifying any significant differences at the channel, measure or freqband level.
What would be a suitable statistical test for this scenario?
Thanks in advance for any help!
  • asked a question related to Statistics
Question
2 answers
Has anyone gone through the Wohler's report_2023 yet? Its pros and cons? What are the ways to obtain its e-copy? Its subscription is very costly for a normal researcher (around 750 USD per user). Any alternatives to get similar kind of data as that of the report?
Relevant answer
Answer
Thanks David A. Jones sir!!
  • asked a question related to Statistics
Question
4 answers
I'm excited to speak at this FREE conference for anyone interested in statistics in clinical research. 👇🏼👇🏼 The Effective Statistician conference features a lineup of scholars and practitioners who will speak about professional & technical issues affecting statisticians in the workplace. I'll be giving a gentle introduction to structural equation modeling! I hope to see you there. Sign up here:
Relevant answer
Answer
Thanks for this valuable share!!
  • asked a question related to Statistics
Question
4 answers
How to test for common method bias in CB-SEM?
Relevant answer
Answer
Check Harman's one-factor test, which can be used to access, measure instrument and collect data for dependent and independent variables.
You can as well use covariance-based structural equation modelling (CB-SEM)
  • asked a question related to Statistics
Question
5 answers
Is it correct to choose the principal components method in order to show the relationship of species with biotopes?
Relevant answer
Answer
Olena Yarys If you are looking for patterns and relationships among those variables (species and biotopes), additional approaches like Canonical Correspondence Analysis (CCA) or regression models may be appropriate. Then you could validate and perform the sensibility analysis of your results.
  • asked a question related to Statistics
Question
3 answers
My answer: Yes, in order to interpret history, disincentives are the most rigorous guide. How?: Due to the many assumptions of inductive logic, deductive logic is more rigorous. Throughout history, incentives are less rigorous because no entity(besides God) is completely rational and or self-interested, thus what incentivizes an act is less rigorous then what disincentivizes the same action. And, as a heuristic, all entities(besides God) have a finite existence before their energy(eternal consciousness) goes to the afterlife( paraphrased from these sources : 1)
, thus interpretation through disincentives is more rigorous than interpreting through incentives.
Relevant answer
Answer
People's behavior in history is based on different motives, ideologies and personal views. Although motivational factors may influence decision making, individuals and groups often act within the context of their own authority and time.
  • asked a question related to Statistics
Question
2 answers
Who agrees life is more about preventing tragedies than performing miracles? I welcome elaborations.
Relevant answer
Answer
Maybe a bit cheezy, but "preventing tragedies IS performing miracles" in my opinion. Then again, negative news are always more reported and recognized than positive news, so if if someone performs an extraordinarily good feat, they will be only awarded, if at all, for a very short time.
  • asked a question related to Statistics
Question
4 answers
If you're using a number such as a statistic from a reference study you want to cite, should you write the number with the confidence interval? And how to effectively prevent plagiarism when dealing with numbers?
Thank you!
Relevant answer
Answer
Iltimass Gouazar when you’re citing statistics from a reference study, it’s generally a good practice to report the confidence intervals as well. Including them adds context to the statistic and gives a sense of the uncertainty or variability associated with the estimate. However, to prevent plagiarism, you do the needful which is to cite each of your sources and paraphrase using your own words.
  • asked a question related to Statistics
Question
4 answers
Are people more likely to mix up words if they are fluent in more languages? How? Why?
Relevant answer
Answer
Certainly! A person who is eloquent in more than one language is more likely to code-switch and mix up words from different languages within her L1. Language users be it consciously or unconsciously, seek facilitating things for themselves. Reasons for this interference vary:
1/ Similarities in pronunciation, grammar, vocabulary among languages systems like French, English, Spanish do play an important role in a multilingual society. The fact of knowing more than one language because of historical reasons, mixing up words become crucial when people communicate with others from different languages. A person who is fluent in French may easily mix up words when using English. The same thing happens to learners who mix up words from French when writing in or speaking English.
2/Language dominance: A bilingual speaker who uses the second language the whole day at work and with colleagues may not prevent herself from mixing up words when using her mother tongue at home.
3/ Prestige is another reason why people mix up words. For example, in Algeria people who uses French (a second language) words or sentences with Arabic is considered intellectual.
3/Actually languages interferences and code-witching occur even in the same language. For instance, a person who lives or works in an area which is far from home may be noticed since she uses different vocabulary and body language. The same thing happens to the same language user when words are mixed up using her own language at home.
  • asked a question related to Statistics
Question
4 answers
Hi!
This might be a bit of a stupid question, but I am currently writing my master thesis. One of the things I am doing is a factor analysis on a scale developed in Canada. This scale has only been validated on the Canadian workforce (the developers have one time done a exploratory factor analysis and two times done a confirmatory factor analysis). I am doing an exploratory and a confirmatory factor analysis in the Norwegian workforce to see what factor structure I would find here, and if it is the same as in Canada. As this this only one of three things I am doing in my masters I have hypothesis for all the other findings, so my supervisor would like me to have hypothesis for the factor structure as well. Whenever I try to come up with some arguments, I always feel like I am just arguing for the same attitudes in both countries, rather than the factor structure.
My question is: How do you make a hypothesis for this where you argue for the same/a different factor structure without arguing for the same/different attitudes?
Thank you in advance! :)
Relevant answer
Answer
Factor analysis can be used to identify the factors that contribute to the structure of the workforce.
It can as well be used to identify the key skills and competencies that are required for different roles in an organisation, gaps in the current workforce and also develop strategies to address these issues.
  • asked a question related to Statistics
Question
7 answers
I came across a commentary titled, 'Tests of Statistical Significance – their (ab)use in the social sciences' and it made me reflect on the validity of using my sample for statistical testing. I have a sample of 24 banks and they were not randomly selected. They were the top 50 banks ranked by the Banker and I narrowed down the sample to 24 because only those banks were usable for my study. I wanted to test the association between these banks using a McNemar's test but any result I obtain- I obtained insignificant results - would be meaningless, right? Because they are not a random selection. I did not want to make a generalisation, but I wanted to know if I could still comment on the insignificance of their association?
Relevant answer
Answer
A new book worth of our serious attention: The Myth of Statistical Inference (2021 by Michael C. Acree)
  • asked a question related to Statistics
Question
3 answers
Hello. We understand that a volcano plot is a graphical representation of differential values (proteins or genes), and it requires two parameters: fold change and p-value. However, for IP-MS (immunoprecipitation-mass spectrometry) data, there are many proteins identified in the IP (immunoprecipitation group) with their intensity, but these proteins are not detected in the IgG (control group)(the data is blank). This means that we cannot calculate the p-value and fold change for these "present(IP) --- absent(IgG)" proteins, and therefore, we cannot plot them on a volcano plot. However, in many articles, we see that these proteins are successfully plotted on a volcano plot. How did they accomplish this? Are there any data fitting methods available to assist in drawing? need imputation? but is it reflect the real interaction degree?
Relevant answer
Answer
Albert Lee : the issue with doing this is it makes the fold changes entirely arbitrary. Imagine I have a protein I detect in my test samples at "arbitrary value 10" but do not detect in my control samples at all.
If I call the ctrl value 0.5, then 0.5 vs 10.5 = 20 fold increase.
If I call the ctrl value 0.1, then 0.1 vs 10.1 = 100 fold increase.
If I call the ctrl value 0.0001, then 0.0001 vs 10.0001 = 100,000 fold increase.
In reality, the increase is effectively "infinite fold", but what this is really highlighting is that fold changes are not an appropriate metric here.
A lot (most) of statistical analysis is predicated on the measurement of change in values, not "present/absent" scenarios.
For disease biomarkers, for example, something that is present/absent is of use as a diagnostic biomarker, but not as a monitoring biomarker: you can say "if you see this marker at all, you have the disease", but you cannot really use it to track therapeutic efficacy, because all values of this marker other than "N/A" are indicative of disease.
For monitoring biomarkers you really want "healthy" and "diseased" values such that you can track the shift from one to the other.
David Genisys: I agree with Jochen Wilhelm , and would not plot my data in this manner.
A lot will depend on the kind of reviewers you get, and the type of paper you're trying to produce, but it would be more appropriate to note that these markers are entirely absent in one group, and then to comment on the robustness of their detection in the other. You wouldn't run stats necessarily, because as noted, stats are horrible for yes/no markers, but you could use the combination of presence/absence and actual level of the former to make inferences as to biological effect. If a marker goes from "not detected" to "detected but barely", then it might be indicative of dysregulated, aberrant expression behaviour, or perhaps stochastic low-level damage. Interesting, but perhaps not of biological import or diagnostic utility. If instead if goes from "not detected" to "readily detected, at high levels", then it's probably very useful as a diagnostic biomarker, and also indicative of some active biological process, be it widespread damage/release, or active expression of novel targets.
In either case you can make biological inferences without resorting to making up numbers so you can stick them on a volcano plot (and to be honest, if you get the kind of reviewers that demand volcano plots, you can always use the trick Albert suggests).
Volcano plots are primarily a way to take BIG DATA and present it in a manner that allows you to highlight the most interesting targets that have changed between groups: if you have whole swathes of genes that are instead present/absent, then those could be presented as a table, perhaps sorted by GO terms or something (if it looks like there are shared ontological categories you could use to infer underlying biology).
  • asked a question related to Statistics
Question
3 answers
We measured three aspects (i.e. variables) of self-regulation. We have 2 groups and our sample size is ~30 in each group. We anticipate that three variables will each contribute unique variance to a self-regulation composite. How do we compare if there are group differences in the structure/weighting of the composite? What analysis should be conducted?
Relevant answer
Answer
Are you thinking of self-regulation as a latent variable with the 3 "aspects" as manifest indicators? If so, you could use a two-group SEM, although your sample size is a bit small.
You've not said what software you use, but this part of the Stata documentation might help you get the general idea anyway.
  • asked a question related to Statistics
Question
3 answers
I have a paper that proposed a hypothesis test that is heavily based on existing tests (so it is pretty much a procedure built on existing statistical tests). It was rejected by a few journals claiming that it was not innovative, although I demonstrated that it outperforms some commonly used tests.
Are there any journals that take this sort of papers?
Relevant answer
Answer
There are two different strategies for submitting this type of work: 1) find a statistical journal that accepts more applied work or 2) find a scientific journal that finds your work of interest. What scientific, engineering, or medical problem are you trying to solve with your new method? What does your work add or provide to the community that is not addressed in the current literature? Once you know the answer to these two questions, you can better determine which journal to submit.
  • asked a question related to Statistics
Question
6 answers
I want to ask about the usage of parametrical and non-parametrical tests if we have an enormous sample size.
Let me describe a case for discussion:
- I have two groups of samples of a continuous variable (let's say: Pulse Pressure, so the difference between systolic and diastolic pressure at a given time), let's say from a) healthy individuals (50 subjects) and b) patients with hypertension (also 50 subjects).
- there are approx. 1000 samples of the measured variable from each subject; thus, we have 50*1000 = 50000 samples for group a) and the same for group b).
My null hypothesis is: that there is no difference in distributions of the measured variable between analysed groups.
I calculated two different approaches, providing me with a p-value:
Option A:
- I took all samples from group a) and b) (so, 50000 samples vs 50000 samples),
- I checked the normality in both groups using the Shapiro-Wilk test; both distributions were not normal
- I used the Mann-Whitney test and found significant differences between distributions (p<0.001), although the median value in group a) was 43.0 (Q1-Q3: 33.0-53.0) and in group b) 41.0 (Q1-Q3: 34.0-53.0).
Option B:
- I averaged the variable's values over all participants (so, 50 samples in group a) and 50 samples in group b))
- I checked the normality in both groups using the Shapiro-Wilk test; both distributions were normal,
- I used t Student test and obtained p-value: 0.914 and median values 43.1 (Q1-Q3: 33.3-54.1) in group a) and 41.8 (Q1-Q3: 35.3-53.1) in group b).
My intuition is that I should use option B and average the signal before the testing. Otherwise, I reject the null hypothesis, having a very small difference in median values (and large Q1-Q3), which is quite impractical (I mean, visually, the box plots look very similar, and they overlap each other).
What is your opinion about these two options? Are both correct but should be used depending on the hypothesis?
Relevant answer
Answer
You have 1000 replicate measurements from each subjects. These 1000 values are correlated and they should not be analyzed as if they were independent. So your model is wrong and you should identify a more sensible model. Eventually, the test of the difference between your groups should not have more than 98 degrees of freedom (it should have less, since a sensible model will surely include some other parameters than just the tow means). Having 1000 replicate measurements seems an overkill to me if there was no other aspect that should be considered in an analysis (like a change over time, with age, something like that). If there is nothing else that should be considered, the simplest analysis is to average the 1000 values per patient and do a t-test on 2x50 (averaged) values.
If you had a sample of independent thausands of samples per group, estimation would be mor interesting than testing. You should then better interpret the 95% confidence interval of the estimate (biological relevance) rather than the (in this respect silly) fact whether it is just in the positive or in the negative range.
  • asked a question related to Statistics
Question
5 answers
Neurons were treated with four different types of drugs, and then a full transcriptome was produced. I am interested in looking at the effects of these drugs on two specific pathways, each with around 20 genes. Would it be appropriate for me to just set up a simple comparative test (like a t-test) and run it for each gene? Or should I still use a differential gene expression package like DESeq2, even though only a few genes are going to be analysed? The aim of my experiment is a very targeted analysis, with the hopes that I may be able to uncover interesting relationships by cutting out the noise (i.e., the rest of the genes that are not of interest).
Relevant answer
Answer
Heather Macpherson oh yay that is much better. I think edgeR or limma would be highly appropriate to process your data. The edgeR and limma user guide is an excellent resource and has many tutorials on its proper use. as Jochen Wilhelm explained very well you will not want to subset. In edgeR and limma you can filter by experiment which would require a design matrix. i would also generate a contrast matrix for the group comparisons. After your groupwise comparisons you can subset as you like. highlight those genes in a volcano plot or smear plot. If you have not done this before, I would highly suggest starting from homer step 1 and then directly to the edgeR user guide vignette. Good luck!! http://homer.ucsd.edu/homer/basicTutorial/index.html
  • asked a question related to Statistics
Question
5 answers
During writing a review, usually published articles are collected from the popular data source like PubMed, google scholar, Scopus etc.
My questions are
1. how we can confirm that all the articles that are published in a certain period (e.g.,2000 to 2020) are collected and considered in the sorting process(excluding and including criteria)?
2. When the articles are not in open access, then how can we minimize the challenges to understand the data for the metanalysis?
Relevant answer