Science topic

Analytical Statistics - Science topic

Explore the latest questions and answers in Analytical Statistics, and find Analytical Statistics experts.
Questions related to Analytical Statistics
  • asked a question related to Analytical Statistics
Question
1 answer
Hi I have 4 samples belonging to 2 groups, so 2 replicates for each group. I am using edgeR to do differential analysis, and just find that many variables which hold not small logFC values are not significantly changed (FDR > 0.05). But I checked these variables, they actually are suitable to be defined as differential.
I think the problem is that the variance between samples are large due to I only have 2 replicates, so it is hard to pass the significance test in edgeR.
How I lower the sample variance properly? Which math and statistic methods should I use?
Thanks for your attention!
Relevant answer
Answer
not enough data for statistical analysis...still try Bootstrap to test Group differences
  • asked a question related to Analytical Statistics
Question
4 answers
I have a dummy varible as the possible mediatior of a relationship in my model. By reading the Baron and Kenny's (1986) steps, I see that, in the second one you have to test the relationship between the indepentend variable and the mediator, using the last one as a dependent variable. However, normally you won't use an OLS when you have a dummy as a dependent variable. Should I use a Probit in this case?
Relevant answer
Answer
No, you don’t have to use a Probit model specifically when analyzing a possible mediation effect that includes a dummy variable. While Probit models are commonly used for binary dependent variables (where the outcome is a 0 or 1), you can analyze mediation effects with other models depending on the nature of your variables. For example, if your outcome variable is continuous, you might use linear regression to analyze mediation. However, if the mediator or outcome is binary, a Probit or Logit model could be appropriate. The key is to choose a model that aligns with the characteristics of your dependent and mediator variables, and the mediation analysis can be conducted using techniques like the Baron and Kenny method or bootstrapping regardless of the model type.
  • asked a question related to Analytical Statistics
Question
4 answers
𝑃 ( 𝑋 ≥ 𝑘 − 1 ) for X∼Binomial(n−1,p).
P(X≥k) for X∼Binomial(n,p)
Relevant answer
Answer
Hi,
We know that if X1~Bin(n-1,p) and X2~Bin(1,p) then Y=X1+X2~Bin(n,p).
On the other hand if the events A and B be defined as A={X1<=k} and B={X1+X2<=k} then A is subset of B or at most A=B. Therefore we can say
P(A)<=P(B) which yields P(X1<=k)<=P(X1+X2<=k). This yields P(X1>k)>=P(X1+X2>k) i.e
P(X1>=k+1)>=P(X1+X2>=k+1), (1).
On the other hand we have
P(X1>=k)>=P(X1>=k+1), (2).
Comparing (1) and (2) yields
P(X1>=k)>=P(X1+X2>=k+1)
Regard,
Hamid
  • asked a question related to Analytical Statistics
Question
4 answers
What are the processes to extract ASI microdata with STATA and SPSS?
I have microdata in STATA and SPSS formats. I want to know about the process. Is there any tutorial on youtube for ASI microdata?
Relevant answer
Answer
Good morning Sir Florian Schütze
Thank you very much for your reply/comment.
I have visited there. I found video for PLFS and NSS but did not get for ASI.
From MoSPI microdata catalog, I have downloaded data but unable to get specific variables' quantity. Variables Like, no. of firms and operated firms, I have get. But I am unable to get fixed capital, input, output and other variables. I merged two blocks and applied formula but perhaps there is some mistake. So I am not getting values.
  • asked a question related to Analytical Statistics
Question
6 answers
Hi everyone,
Does anyone have a detailed SPSS (v. 29) guide on how to conduct Generalised Linear Mixed Models?
Thanks in advance!
Relevant answer
Answer
Ravisha Jayawickrama dont thank
Onipe Adabenege Yahaya
, but chatGPT, you could have gotten the same answer yourself.
  • asked a question related to Analytical Statistics
Question
1 answer
"Dear Researchers,
In the context of the Brief COPE Inventory (B-COPE), which comprises 14 scales (each with two items) with responses ranging from 1 (“I haven't done this at all”) to 4 (“I have done this a lot”), I noticed that for statistical analysis, the questionnaire is often divided into adaptive and maladaptive coping subscales. The adaptive coping subscale is derived from a cumulative score of 8 scales, while the maladaptive coping subscale is based on the remaining 6 scales.
My question pertains to the methodological implications of this division: How do you ensure a fair and balanced comparison between adaptive and maladaptive coping strategies when they are based on an unequal number of scales? Specifically, I am interested in understanding the statistical rationale behind this approach and how it might influence the interpretation of a participant's coping strategies as more adaptive or maladaptive. Additionally, are there any considerations or adjustments made during the analysis to account for the discrepancy in the number of scales between these two subscales?
Thank you
Relevant answer
Answer
You can either standardize the variables before your analysis (mean of zero and std. dev. of 1), or if they are independent variables in a regression, you can use the standardized coefficients.
  • asked a question related to Analytical Statistics
Question
2 answers
Short Course: Statistics, Calibration Strategies and Data Processing for Analytical Measurements
Pittcon 2024, San Diego, CA, USA (Feb 24-28, 2024)
Time: Saturday, February 24, 2024, 8:30 AM to 5:00 PM (Full day course)
Short Course: SC-2561
Presenter: Dr. Nimal De Silva, Faculty Scientist, Geochemistry Laboratories, University of Ottawa, Ontario, Canada K1N 6N5
Abstract:
Over the past few decades, instrumental analysis has come a long way in terms of sensitivity, efficiency, automation, and the use of sophisticated software for instrument control and data acquisition and processing. However, the full potential of such sophistication can only be realized with the user’s understanding of the fundamentals of method optimization, statistical concepts, calibration strategies and data processing, to tailor them to the specific analytical needs without blindly accepting what the instrument can provide. The objective of this course is to provide the necessary knowledge to strategically exploit the full potential of such capabilities and commonly available spreadsheet software. Topics to be covered include Analytical Statistics, Propagation of Errors, Signal Noise, Uncertainty and Dynamic Range, Linear and Non-linear Calibration, Weighted versus Un-Weighted Regression, Optimum Selection of Calibration Range and Standard Intervals, Gravimetric versus Volumetric Standards and their Preparation, Matrix effects, Signal Drift, Standard Addition, Internal Standards, Drift Correction, Matrix Matching, Selection from multiple responses, Use and Misuse of Dynamic Range, Evaluation and Visualization of Calibrations and Data from Large Data Sets of Multiple Analytes using EXCEL, etc. Although the demonstration data sets will be primarily selected from ICPES/MS and Chromatographic measurements, the concepts discussed will be applicable to any analytical technique, and scientific measurements in general.
Learning Objectives:
After this course, you will be familiar with:
- Statistical concepts, and errors relevant to analytical measurements and calibration.
- Pros and cons of different calibration strategies.
- Optimum selection of calibration type, standards, intervals, and accurate preparation of standards.
- Interferences, and various remedies.
- Efficient use of spreadsheets for post-processing of data, refining, evaluation, and validation.
Access to a personal laptop for the participants during the course would be helpful, although internet access during the course is not necessary. However, some sample- and work-out spreadsheets, and course material need to be distributed (emailed) to the participants day before the course.
Target Audience: Analytical Technicians, Chemists, Scientists, Laboratory Managers, Students
Register for Pittcon: https://pittcon.org/register
Relevant answer
Answer
Dear Thiphol:
Many thanks for your interest. Currently, I don't have a recorded video. However, I may offer this course in the future on-line in a webinar format if there is sufficient interest/inquiries.
Thanks again.
Nimal
  • asked a question related to Analytical Statistics
Question
3 answers
We measured three aspects (i.e. variables) of self-regulation. We have 2 groups and our sample size is ~30 in each group. We anticipate that three variables will each contribute unique variance to a self-regulation composite. How do we compare if there are group differences in the structure/weighting of the composite? What analysis should be conducted?
Relevant answer
Answer
Are you thinking of self-regulation as a latent variable with the 3 "aspects" as manifest indicators? If so, you could use a two-group SEM, although your sample size is a bit small.
You've not said what software you use, but this part of the Stata documentation might help you get the general idea anyway.
  • asked a question related to Analytical Statistics
Question
2 answers
I have 3 papers suitable for inclusion in my systematic review looking at high versus low platelet to red call ratio in TBI, and want advice as to whether I can combine their estimates of effect in a meta-analysis.
One RCT which provides an unadjusted odds ratio and adjusted odds ratio of 28-day mortality for two groups (one intervention (high ratio) and one control (low ratio), adjusted for differences in baseline characteristics).
One retrospective cohort study which provides absolute unadjusted 28-day mortality data for two groups (one exposed to high ratio, and another exposed to a low ratio). They have also performed a sophisticated propensity analysis to adjust for the few differences between the groups and multivariate cox regression to adjust for factors associated with mortality, and presented hazard ratios.
Finally, a post-hoc analysis of a RCT, which compares outcomes for participants grouped according to presence/absence of haemorrhagic shock (HS) and TBI. This generates 4 groups - neither HS nor TBI, HS only, TBI only and TBI + HS. I am interested in the latter two as they included patients with TBI. One group was exposed to a high ratio, whereas the other a lower ratio. The authors provided unadjusted mortality data for all groups, and they adjust for differences in admission characteristics, to generate odds ratio of 28-day mortality. However, they present these adjusted odds ratios of death at 28days for the HS only, TBI only and TBI + HS groups compared to the neither TBI nor HS group, not to each other.
I could analyse unadjusted mortality in a meta-analysis, but want to know if I can combine all or some of the adjusted outcome measures, I have described instead? Any help greatly appreciated.
Relevant answer
Answer
Thanks James
  • asked a question related to Analytical Statistics
Question
4 answers
Hi everyone,
I need to convert standard error (SE) into standard deviation (SD). The formula for that is
SE times the square root of the sample size
By 'sample size', does it mean the total sample size or sample sizes of individual groups? For example, the intervention group has 40 participants while the control group has 39 (so the total sample size is 79) So, when calculating SD for the intervention group, do I use 40 as the sample size or 79?
Thank you!
Relevant answer
Answer
7.7.3.2 Obtaining standard deviations from standard errors and (cochrane.org)
also, there is useful calculator in the attached Excel file from Cochrane.
  • asked a question related to Analytical Statistics
Question
10 answers
Hi,
There is an article that I want to know which statistical method has been used, regression or Pearson correlation.
However, they don't say which one. They show the correlation coefficient and standard error.
Based on these two parameters, can I know if they use regression or Pearson correlation?
Relevant answer
Answer
Not sure I understand your question. If there is a single predictor and by regression you mean linear OLS regression, then the r is the same. Can you provide more details>
  • asked a question related to Analytical Statistics
Question
6 answers
Hello everyone,
I am currently doing research on the impact of online reviews on consumer behavior. Unfortunately, statistics are not my strong point, and I have to test three hypotheses.
The hypotheses are as follows: H1: There is a connection between the level of reading online reviews and the formation of impulsive buying behavior in women.
H2: There is a relationship between the age of the respondents and susceptibility to the influence of online reviews when making a purchase decision.
H3: There is a relationship between respondents' income level and attitudes that online reviews strengthen the desire to buy.
Questions related to age, level of income and level of reading online reviews were set as ranks (e.g. 18-25 years; 26-35 years...; 1000-2000 Eur; 2001-3000 Eur; every day; once a week; once a month etc.), and the questions measuring attitudes and impulsive behavior were formed in the form of a Likert scale.
What statistical method should be used to test these hypotheses?
Relevant answer
Answer
Go with the test of association (chi-square test )
  • asked a question related to Analytical Statistics
Question
4 answers
Merry Christmas everyone!
I used the Interpersonal Reactivity Index (IRI) subscales Empathic Concern (EC), Perspective Taking (PT) and Personal Distress (PD) in my study (N = 900) When I calculated Cronbach's alpha for each subscale, I got .71 for EC, .69 for PT and .39 for PD. The value for PD is very low. The analysis indicated that if I deleted one item, the alpha would increase to .53 which is still low but better than .39. However, as my study does not focus mainly on the psychometric properties of the IRI, what kind of arguments can I make to say the results are still valid? I did say findings (for the PD) should be taken with caution but what else can I say?
Relevant answer
Answer
A scale reliability of .39 (and even .53!) is very low. Even if your main focus is not on the psychometric properties of your measures, you should still care about those properties. Inadequate reliability and validity can jeopardize your substantive results.
My recommendation would be to examine why you get such a low alpha value. Most importantly, you should first check whether each scale (item set) can be seen as unidimensional (measuring a single factor). This is usually done by running a confirmatory factor analysis (CFA) or item response theory analysis. Unidimensionality is a prerequisite for a meaningful interpretation of Cronbach's alpha (alpha is a composite reliability index for essentially tau-equivalent measures). CFA allows you to test the assumption of unidimensionality/essential tau equivalence and to examine the item loadings.
Also, you can take a look at the item intercorrelations. If some items have low correlations with others, this may indicate that they do not measure the same factor (and/or that they contain a lot of measurement error). Another reason for a low alpha value can be an insufficient number of items.
  • asked a question related to Analytical Statistics
Question
6 answers
I am measuring two continuous variables over time in four groups. Firstly, I want to determine if the two variables correlate in each group. I then want to determine if there is significant differences in these correlations between groups.
For context, one variable is weight, and one is a behaviour score. The groups are receiving various treatment and I want to test if weight change influences the behaviour score differently in each group.
I have found the r package rmcorr (Bakdash & Marusich, 2017) to calculate correlation coefficients for each group, but am struggling to determine how to correctly compare correlations between more than two groups. The package diffcorr allows comparing between two groups only.
I came across this article describing a different method in SPSS:
However, I don't have access to SPSS so am wondering if anyone has any suggestions on how to do this analysis in r (or even Graphpad Prism).
Or I could the diffcorr package to calculate differences for each combination of groups, but then would I need to apply a multiple comparison correction?
Alternatively, Mohr & Marcon 2005 describe a different method using spearman correlation that seems like it might be more relevant, however I wonder why their method doesn’t seem to have been used by other researches? It also looks difficult to implement so I’m unsure if it’s the right choice.
Any advice would be much appreciated!
Relevant answer
Answer
You wrote: "For context, one variable is weight, and one is a behaviour score. The groups are receiving various treatment and I want to test if weight change influences the behaviour score differently in each group."
I'm not sure this is best tested with a correlation coefficient. This sounds like an interaction hypothesis (or moderation if you prefer). What you need I think is the interaction of weight change by group. This is usually tested by the regression coefficient for the interaction. You can standardize this to scale it similarly to a correlation coefficient (though that's actually best done outside the model for interactions).
You can compare correlations but that isn't necessarily sensible because you risk confounding the effects of interest with changes in SD of the variables across groups (and there seems no rationale for needing that).
A further complication is including weight change without baseline weight as a covariate might be a poor choice. Even if groups are randomized including baseline weight may increase precision of the estimates of the. other effects.
  • asked a question related to Analytical Statistics
Question
3 answers
Hello, I currently have a set of categorical variables, coded as Variable A,B,C,etc... (Yes = 1, No = 0). I would like to create a new variable called severity. To create severity, I know I'll need to create a coding scheme like so:
if Variable A = 1 and all other variables = 0, then severity = 1.
if Variable B = 1 and all other variables = 0, then severity = 2.
So on, and so forth, until I have five categories for severity.
How would you suggest I write a syntax in SPSS for something like this?
Relevant answer
Answer
* Create a toy dataset to illustrate.
NEW FILE.
DATASET CLOSE ALL.
DATA LIST LIST / A B C D E (5F1).
BEGIN DATA
1 0 0 0 0
0 1 0 0 0
0 0 1 0 0
0 0 0 1 0
0 0 0 0 1
1 1 0 0 0
0 1 1 0 0
0 0 1 1 0
0 0 0 1 1
1 0 2 0 0
END DATA.
IF A EQ 1 and MIN(B,C,D,E) EQ 0 AND MAX(B,C,D,E) EQ 0 severity = 1.
IF B EQ 1 and MIN(A,C,D,E) EQ 0 AND MAX(A,C,D,E) EQ 0 severity = 2.
IF C EQ 1 and MIN(B,A,D,E) EQ 0 AND MAX(B,A,D,E) EQ 0 severity = 3.
IF D EQ 1 and MIN(B,C,A,E) EQ 0 AND MAX(B,C,A,E) EQ 0 severity = 4.
IF E EQ 1 and MIN(B,C,D,A) EQ 0 AND MAX(B,C,D,A) EQ 0 severity = 5.
FORMATS severity (F1).
LIST.
* End of code.
Q. Is it possible for any of the variables A to E to be missing? If so, what do you want to do in that case?
  • asked a question related to Analytical Statistics
Question
9 answers
Presently I am handling a highly positively skewed geochemical dataset. After several attempts, I have prepared a 3 parameter lognormal distribution (using natural log and additive constant c). The descriptive statistic parameters obtained are log-transformed mean (α) and standard deviation (β). The subsequent back-transformed mean and standard deviation (BTmean and BTsd) are based on the formula
BTmean = e ^ (α + (β^2/2)) - c
BTsd = Sqrt [(BTmean)^2) (e ^(β^2)-1)] - c
However, someone suggests to use Lagrange Multiplier. I am not sure about the
1) Equation using the Lagrange Multiplier
2) How to derive the value of Lagrange multiplier in my case.
Kindly advise.
Regards
Relevant answer
Answer
David will likely think this is complete rubbish, but here is what one librarian had to say about Z-Library.
<quote>
So there are outright “We pirate stuff’ sites like Mobilism and ZLibrary. These are places that are basically set up to pirate things and have no veneer of legality to them.
</quote>
  • asked a question related to Analytical Statistics
Question
6 answers
Hi everyone,
I am struggling a bit with data analysis.
If I have 2 separate groups, A and B.
And each group has 3 repeats A1,A2,A3 and B1,B2 B3 for 10 time points.
How would I determine statistical significance between the 2 groups?
If I then added a third group, C, with 3 repeats C1,C2,C3 for each timepoint.
What statistical analysis would I use then?
Thanks in advance
Relevant answer
Answer
اذا اردنا t-teast للفروق بين المجموعتين
او اذا اردنا استخدام معامل الاتبارط بيرسون @
  • asked a question related to Analytical Statistics
Question
8 answers
I am hoping to see if there is a statistically significant difference between the number of trauma patients receiving a certain procedure between two time frames, but am unsure on what test I should be using.
Time frame 1: 474 trauma patients admitted, 7 received the procedure
Time frame 2: 365 trauma patients admitted, 9 received the procedure
I would be grateful for any advise and can provide more information as needed.
Many thanks!
Relevant answer
Answer
I believe you've computed the risk ratio, Sal. The odds ratio would be (9/356) / (7/467) = 1.6865971. ;-)
  • asked a question related to Analytical Statistics
Question
2 answers
In Hayes model 58, the moderation variable is input to both the path of 1.independent variable->mediation variable and 2.mediation variable->dependent variable path. At this time, when the moderation effect in the path of 1 is rejected and only the moderation effect in the path of 2 is adopted, how to interpret the moderated-mediation effect?
Even in the above case, the effect difference of -1SD, Mean, and +1SD is presented. At this time, if the bootstrapping LLCI and ULCI values of all effects do not include 0, can this result be interpreted as having a moderated-mediation effect?
We look forward to hearing from our seniors.
  • asked a question related to Analytical Statistics
Question
23 answers
I need suggestions for groundwater assessment-related articles used discriminant analysis in their analysis and study, as well as how to apply this analysis in R programming.
Reghais.A
Thanks
Relevant answer
  • asked a question related to Analytical Statistics
Question
4 answers
I'm doing a study based on compare two orbital sensors data, and on the study i'm basing on there is this normalization formula for the rasters: ((Bi<= 0) * 0) + ((Bi >= 10000) *1) + ((Bi >= 0) & (Bi < 10000)) * Float((Bi)/10000), Where "Bi" means "band". Is there someone who understad e could explain this formula? Thanks very much.
Relevant answer
Answer
It just makes values less than 0 to 0, and greater than 10000 to 1, while anything in between will be from 0 to 1. as the value is divided by 10000.
The way of writing formula is similar to conditional but without using if statement.
For example: (x>=0)*0+(x<0)*1 will evaluate to 0 for positive number and 1 for negatives. the expression (x>0) will either be 1 or 0, based on the value of x.
  • asked a question related to Analytical Statistics
Question
3 answers
I have six kinds of compounds which I then tested for antioxidant activity using the DDPH assay and also anticancer activity on five types of cell lines, so I got two types of data groups:
1. Antioxidant activity data
2. Anticancer activity (5 types of cancer cell line)
Each data consisted of 3 replications. Which correlation test is the most appropriate to determine whether there is a relationship between the two activities?
Relevant answer
Answer
Just do logistic regression is what I had in mind. The DV might be antcancer activity (yes /no) same for antioxidant activity. Best wishes David Booth
  • asked a question related to Analytical Statistics
Question
2 answers
I want to know the list for the residuals (error) tests for time series models? also the list of Stationarity test for time series data?
Relevant answer
Answer
I agree with Medhat about the book. It will give you the basics of what you want. For stationary tests you can use the python lib pmdarima. It has useful and automatic tests on times stationarity, but the book will help you understand the tests results.
  • asked a question related to Analytical Statistics
Question
6 answers
I am looking at gender equality in sports media. I have collected two screen time measures from TV coverage of a sport event - one time for male athletes and one time for female athletes.
i am looking for a statistical test to give evidence that one gender is favoured. I assume I have to compare each genders time against the EXPECTED time given a 50/50 split (so male time + female time / 2), as this would be the time if no gender was favoured.
my first though was chi square? But I’m not sure that works because there’s really only one category. I am pregnant and so my brain is not working at the moment lol. I think the answer is really simple but I just can’t think of anything.
Relevant answer
Answer
independent sample t-test best
  • asked a question related to Analytical Statistics
Question
4 answers
If I want the annual average of the country production of oil for 2019 and I have 25 stations,
1- should I take the sum ( of 12 months) for each station individually so I get the annual sum for each station and then divide by 25 to calculate country annual
2- or I take the sum of January for the 25 stations and then February .... etc. and then divide by 12 which is number of months to get the annual average of the country
Relevant answer
Answer
These are 2 different averages. The numerator is the same for both 1 and 2 -this is the sum of production of 25 stations for 12 months, i.e. the total annual production of all 25 stations.
But division this numerator by 25 gives you the annual average production per station.
Division by 12 gives you the average production of all 25 stations per month.
There is no single correct average. The average depends on how you define it and what you want to characterize-production per station or production per month..
  • asked a question related to Analytical Statistics
Question
4 answers
Hi, I have data from various patients, each patient has 7 values corresponding to different time points and I would like to do an average of all the patients on GraphPad to create an XY plot that respresent the average of all patients for each X value but I don't know how to do it. I have all the datasets and the graph from each patient alone in separate GraphPad documents. Is there a way to do an average of all of them on GraphPad that isn't arranging the values manually?
Thank you in advance
Relevant answer
Answer
Hi Elena,
You can choose grouped graph, so that you enter the patient’s names/code in the row section, and the data (X time point) in the coulmn part,
This will make a master sheet for you, and will make it easier for you to only copy and paste the parts you want to do an XY graph.
I am not aware if there is any other way to do XY all at once
Best,
  • asked a question related to Analytical Statistics
Question
7 answers
Hello everyone,
I need help understanding whether my two groups are paired or not.
I am collecting data from one group of cells. We have developed two different workflows (let's call them A, and B) for data analysis. We want to test whether these two workflows give the 'same' results for the same set of cells.
At the end, I obtain:
  1. Group 1 (contains the variable obtained with workflow A)
  2. Group 2 (contains the variable obtained with workflow B).
I have been considering the two groups as independent because the two workflows do not interfere with each other. However, the fact that both workflows operate on the same cells is throwing me off and I am wondering if these groups are actually paired.
Could you advise me on this and on what test is best to use?
The hypothesis for the test would be:
  • the distributions of the variable is the same with both workflow A and B; and/or
  • the median of the distribution from workflow A equals the one from workflow B
Thank you.
GN
Relevant answer
Answer
If you have two samples, from two measurement methods (workflows), but from the same subject (single group of cells), then you can use a paired samples test.
The benefit of this is that it has more statistical power than an unpaired test.
  • asked a question related to Analytical Statistics
Question
8 answers
Hi, I have this model with many parameters (variables) and I wonder if there is a statistical method to determine how big the influence of each variable is. Anyone has an idea? Thanks.
Relevant answer
Answer
What is the model? The beta coefficients tell you the size of “influence” . Are you asking if you have too many variables how do you select features to include in the model ? Search feature engineering… EDA should point you in the direction of what to include
  • asked a question related to Analytical Statistics
Question
6 answers
Is it possible to determine a  regression equation using SmartPLS 2?
Relevant answer
Answer
A basic regression equation looks like this:
y= a + beta X, where a is a constant/intercept.
A classical regression requires several underlying assumption, mainly normality, to make predictions using unstandardized data. However, smartPLS doesn't follow this parametric assumption, and only deals with standardized data (putting different variables on the same scale, to be consistent). Since smartpls aims to maximize the variance explained in the criterion variable (rather than producing equation to predict an absolute number), the constant or a regression equation is not directly available in smartpls.
The regression equation can be produced using SPSS, R or even excel, and the coefficients and significance levels will not be too much different from smartPLS output.
  • asked a question related to Analytical Statistics
Question
6 answers
Howdy.
I am currently working with a set of samples, divided in 3 consecutive phases (8 samples for phase 1, 9 samples for phase 2 and 6 samples for phase 3). My data are homoscedastic and not normally distributed. What test does SPSS (version 21) employ to analyse the pairwise comparisons after the Kruskal-Wallis’ test? Are the values I get (from the post hoc testing) the result of a Dunn’s test? If that is so, how should I report them on my abstract? Something like “Subsequent pairwise comparisons with the Dunn’s test showed a significant increase between phase 1 and phase 2 (p < 0.05)” or should I take into account even the value in the first column (the one labelled as “test statistic”, which I highlighted in red in the attached image)?
Is it correct to use this kind of post hoc testing for my data or should I employ some other kind of test (Behrens-Fisher test or Steel Test – Nonparametric Multiple Comparisons), since I got a different number of samples for each phase?
Thank you
Relevant answer
Answer
Hi,
I know this response comes 7 years to late, but in the K-S pairwise comparisons the Dunn's post-hoc is the test applied.
Best,
  • asked a question related to Analytical Statistics
Question
5 answers
1. Is randomizing subjects to one of the four groups a must?
2. What statistical analysis can be used? Could these tests be affected if samples are not randomized? Which statistics is preferable?
Relevant answer
Answer
how do we justify Solomon four-group as Quasi-experimental?
  • asked a question related to Analytical Statistics
Question
4 answers
As per the attachment: There are three sets of students. Each set is evaluated by individual judges. Hence the marks are very varied in the three sets. How to give rational marks to all? I wish to make the highest and lowest marks of all three sets equal, and the other marks follow. Is this possible?
Relevant answer
Answer
There is no definitive way to do what you want to do because the effect of the raters is conflated with the effect of the students. That is, Rater B gave low ratings, but you don't know if that's because Rater B always rates students low or because the students that Rater B happened to get were poor students. Ideally, you would have each student rated by at least two raters, but preferably have each student rated by all of the raters. ... One thing you could do assume that the highest rated student by each rater is really good and the lowest by each rater is really not good. You can re-code the ratings for each rater so that the highest is 10 and the lowest is 1. This may not be fair to all students, tho. Or you could modify this, for example, making sure no student gets a lower score than they did originally.
You can run the following code at this website, without installing R: https://rdrr.io/snippets/ . The code beginning with # are comments, and don't need to be run.
library(rcompanion)
A = c(5, 4, 9, 7, 8, 10)
blom(A, method="scale", min=1, max=10)
### 2.5 1.0 8.5 5.5 7.0 10.0
B = c(3, 5, 6, 5, 4, 5)
blom(B, method="scale", min=1, max=10)
### 1 7 10 7 4 7
C = c(2, 9, 8, 7, 6, 8)
blom(C, method="scale", min=1, max=10)
### 1.000000 10.000000 8.714286 7.428571 6.142857 8.714286
  • asked a question related to Analytical Statistics
Question
5 answers
Hello, There is a dataset with several KPIs which are varying between (0,1). What is the best analytical approach to split the data and define a line in two dimensional (or define a plane in multi-dimensional space) based on data behavior and practical assumptions/considerations (there is some recommended ranges for each KPIs etc.)?
For instance in the attached screenshot, I want to flag the individuals/observations in Ae area for more investigation. I want to be able to apply the proposed approach in multi-dimensional space with several KPIs as well. Any thoughts would be appreciated.
Relevant answer
Answer
If you want to 'flag' individuals by some data-driven approach, possibly making a cluster tree would be helpful, and you can visually compare if the cluster tree flags the observations you were anticipating. This can help avoid simpson's paradox and such
  • asked a question related to Analytical Statistics
Question
7 answers
please I am working on a pre and post-survey on which I found the results shown in the attachments (Test of normality and paired sample T-test) can you help me write the interpretations of these results to add them to my scientific article.
Thank you very much in advance.
Relevant answer
Answer
The confidence interval provides a range of likely values for the population mean difference. For example, a 95% confidence level indicates that if you take 100 random samples from the population, you could expect approximately 95 of the samples to produce intervals that contain the population mean difference. The confidence interval helps you assess the practical significance of your results. Use your specialized knowledge to determine whether the confidence interval includes values that have practical significance for your situation. If the interval is too wide to be useful, consider increasing your sample size. For more information, go to Ways to get a more precise confidence interval.
  • asked a question related to Analytical Statistics
Question
3 answers
Hello, In one of the projects, I conducted a questionnaire for the skills of students before the project (PRE survey), and after the completion of the project, I conducted a post-project survey.
I calculated the results and processes of the questionnaire (the percentage of the level of each skill increase)
But I have no experience in making an interpretation of these results.
If you can help me or provide me with publications in this area.
Thank you.
Relevant answer
Answer
If the study was based on the same subjects (or students) then the pre-survey mean could be compared with the post-survey mean by a paired-t test.
  • asked a question related to Analytical Statistics
Question
6 answers
I want to do a descriptive analysis using the World Values Survey dataset which has an N=1200. However, even thought I have searched a lot, I haven't found the methodology or a tool to calculate the sample size I need to get meaningful comparisons when I cross variables. For example, I want to know how many observations do I need in every category if I want to compare the social position attributed to the elderly over sex AND over ethnic group. That is (exemplying even more), the difference between the black vs indigenous women in my variable of interest. What if I have 150 observations in black women? Is that enough? How to set the threshold?
Expressing my gratitude in advance,
Santiago.
Relevant answer
Answer
When you divide a sample into subgroups, your maximum power is where the groups are of equal size. So the first step is to calculate power two samples of 600.
Where the groups are of unequal size, power goes down compared with the ideal case of a 50:50 split. With a 60:40 split, the effective sample is reduced by only 4%, but as you get to 80:20, the reduction is 36% and at 90:10 it's 64%. So with a 90:10 split your power is what you would have with a sample that was 64% smaller split 50:50 between the groups.
Effective sample size calculations are very simple. The ideal sample is 0·5 x 0·5, which is 0·25. A 30:70 sample is 0·3 x 0·7 which is 0·21, which is 12% lower than 0·25. etc.
  • asked a question related to Analytical Statistics
Question
5 answers
When should household panel data be weighted to increase the representativeness of the data? Does this depend on the number of reporting households?
Relevant answer
Answer
The weighted mean used in meta-analysis. It gives higher weight to accurate study and low weight to study less accurate.
  • asked a question related to Analytical Statistics
Question
9 answers
I am researching gender bias in sport media have done a survey which involved 8 sets of 4 images of athletes (4 male and 4 female sets), each being followed up with 3 questions. Participants had to select which image they thought best fit the 3 questions. (So i ended up with 8 answers to each of the questions)
I'm struggling with figuring out how to analyse my data? I need to keep my data in terms of 'amount of times this image was chosen', so i need it to be in whole number (image 1,2,3,4) but everything I try gives me the mean answer from 1-4 across ALL the images for the question.
Questions I am trying to answer are:
Was a certain image chosen more often in the female athlete sets than the male athlete sets (and vice versa)?
Did male/female participants differ from eachother in their responses? (was one gender more likely to select one type of image compared to the other gender)
Happy to answer follow up questions. I feel like the answer is simple but I havent done stat analysis in ages and I just cant think of anything.
Relevant answer
Answer
Daniel Wright 8 athletes, different settings/outfits. n around 60 and question is basically just do attitudes towards male and female athletes differ within the particular sport’s community - the survey is only one part of the overall study, but the other parts aren’t relevant to the analysis of this bit.
  • asked a question related to Analytical Statistics
Question
3 answers
  • Hello. I am struggling with the problem. I can measure two ratios of three independent normal random values with their means and variances (not zero): Z1=V1/V0, Z2=V2/V0, V0~N(m0,s0), V1~N(m1,s1), V2~N(m2,s2). These are measurements of the speeds of the vehicle. Now I should estimate the means and the variances of these rations. We can see it is Cauchy distribution with no mean and variance. But it has analogs in the form of location and scale. Are there mathematical relations between mean and location, variance and scale? Can we approximate Cauchy by Normal? I have heard if we limit the estimated value we can obtain the mean and variance.
Relevant answer
Answer
Well the Cauchy distribution has all moments equal infinity. You might be interested in the attached Google search which is about robust Statisical methods but I would probably try the first ref in the second search first. Good luck if you do. David Booth
  • asked a question related to Analytical Statistics
Question
3 answers
I have purchase data from supermarkets that include a variable for weighting. This weighting is supposed to represent how representative a household is.
I want to separate my data into two very unequal groups and aggregate the values into months. Can or should I use the weighting here?
Relevant answer
Answer
I find it a little difficult to answer this as you have not spelled out your aims.
I would keep the detailed data (presumably repeated measures over occasions on households?) and then include a set of dummies to identify the month and a dummy to identify the two groups (using between and within random effects for households and occasions). An interaction between the dummies would then allow an assessment of in which months the dummied group has a different mean. In this form I would then do a weighted and unweighted analysis as a sensitivity test to see what is the scale of the difference. Of course the weighted and unweighted analyses represent a different target of inference. By using random effects you could also include such variables as type of household if this was of interest. Using a mixed model, you could include an underlying 'smooth' (such as a spline) to capture the underlying time trend in the purchasing behaviour.
  • asked a question related to Analytical Statistics
Question
12 answers
Hello,
I am performing statistical analysis of my research data by comparing the mean values by using Tukey HSD test. I got homogeneous group in both small and capital alphabets. This is because of large number of treatments in my study. Is this type of homogeneous group is acceptable for publication in any journal?
Relevant answer
Answer
You can use SPSS for this analysis but it is mostly done in Statistix 8.1 program
  • asked a question related to Analytical Statistics
Question
7 answers
Just getting a gauge from various sides of the community regarding which statistical analysis method is underrated.
Thank you.
Relevant answer
Answer
Cluster analysis.
  • asked a question related to Analytical Statistics
Question
19 answers
Hi everyone.
I have a question about finding a cost function for a problem. I will ask the question in a simplified form first, then I will ask the main question. I'll be grateful if you could possibly help me with finding the answer to any or both of the questions.
1- What methods are there for finding the optimal weights for a cost function?
2- Suppose you want to find the optimal weights for a problem that you can't measure the output (e.g., death). In other words, you know the contributing factors to death but you don't know the weights and you don't know the output because you can't really test or simulate death. How can we find the optimal (or sub-optimal) weights of that cost function?
I know it's a strange question, but it has so many applications if you think of it.
Best wishes
  • asked a question related to Analytical Statistics
Question
7 answers
Hello,
I replicated a study in which participants are asked to rate the importance of some user experience dimensions (like efficiency, usefulness, perspicuity, etc.) for a product (0-7 rating).
I added some new dimensions like ease of use. How can I statistically determine whether the added dimension is measuring a new dimension? It is obvious that the new dimension have strong correlation with other pragmatic dimensions. My question is how to show the new dimension is worth being included and is different from the current ones.
P.S. There is no list of items per dimension. Definition of dimensions are provided and participants just rate their importance.
Thank you
Relevant answer
Answer
You can calculate the rank of the matrix of data before and after adding the new variable. If it makes difference, it can be considered as a new dimension.
  • asked a question related to Analytical Statistics
Question
8 answers
Hello,
I am a bit confused about appellation and calculation of validation parameters.
Am I right that Accuracy is qualitative (not quantitative) parameter and is a combination of Trueness and Precision?
Am I right with calculation of Trueness?
ref/x_avg,
where ref is reference value, x_avg is average of values (but sometimes I read median?)
And Precision depends on systematic error and is calculated
s/x_avg,
where s is standard deviation based on a sample (or the entire population?), x_avg is average of values
Thanks for every answer
Relevant answer
Answer
Dear Denisa Macková,
This is answers to your questions.
1. You are right that accuracy is a qualitative concept (as well as its constituents, trueness and precision) and is described in qualitative terms such as “good” or “bad”. Nevertheless, accuracy can be quantified. The two separate measures of accuracy, the estimated standard deviation and the bias, should be evaluated. These two figures cannot be rigorously combined in any way to give an overall index of accuracy in view of the different nature of the two error components, random and systematic.
2. Trueness is usually expressed quantitatively in terms of bias (B) that is the difference between the mean of the results (x(mean)) and a suitable reference value (x(ref)). Other related measures to express trueness are the relative bias B(%) = (x(mean) – x(ref))·100/x(ref), the recovery R(%) = x(mean)·100/x(ref), and the spike recovery R(spike)(%) = (x(mean)' – x(mean))·100/x(ref) where x(mean)' and x(mean) are the mean values of spiked and unspiked samples, respectively.
3. A common measure of precision is the standard deviation s (and the relative standard deviation RSD = s/x(mean) also known as CV(%)). Another measure of precision is the precision limit that is the 95 % confidence interval for the difference between two replicate results, calculated as 2.8 s. Note that in estimating precision, the measurement conditions should be clearly specified, with repeatability conditions and reproducibility conditions representing the two extreme cases of variability in results.
Best regards,
Rouvim Kadis
  • asked a question related to Analytical Statistics
Question
3 answers
Dear colleagues,
I have been helping analyse a sustainability project that compares % of biomass in composts.
As the design is 5x2 (4 replicates each, that's what I got given) using 15 predictors I'm using PERMANOVA and will later analyse the power to see if the analysis is valid.
However, the variables (chemical compounds and physical characteristics) have different units and quite different range values and I need to standardize them(I'm using z-score).
Have been looking for a while, but can't find an answer to the questions:
Should I apply the standardization by variables, meaning find each variable mean and standard deviation, or should I use the central point (the whole dataset, mean and standard deviation of all applied to each measurement)?
They give me different results and I would like to be able to support the choice I will make.
Would love to hear some insights and references into that.
All the best,
Erica
Relevant answer
Answer
Hello Erica,
Standardize within each variable, so that each has a mean of 0 and SD of 1. The consequence of not standardizing is that the variable(s) with largest SD(s) will exert more influence on measured distances between cases.
Good luck with your work.
  • asked a question related to Analytical Statistics
Question
10 answers
Imagine there is a surface, with points randomly spread all over it. We know the surface area S, and the number of points N, therefore we also know the point density "p".
If I blindly draw a square/rectangle (area A) over such surface, what is the probability it'll encompass at least one of those points?
P.s.: I need to solve this "puzzle" as part of a random-walk problem, where a "searcher" looks for targets in a 2D space. I'll use it to calculate the probability the searcher has of finding a target at each one of his steps.
Thank you!
Relevant answer
Answer
@Jochen Wilhelm, the solutions are not equivalent because
For Poisson: P(at least one point) = 1 - P(K=0) = 1 - e^(-N/S*A)
For Binomial: P(at least one point) = 1 - ( (S - A)/S )^N
The general formula for the Binomial case is the following:
P(the rectangle encompasses k points)=(N choose k) ( A/S )^k ( (S - A)/S )^(N - k)
  • asked a question related to Analytical Statistics
Question
15 answers
I have a data set of particulate concentration (A) and corresponding emission from car (B), factory (C) and soil (D). I have 100 observations of A and corresponding B , C and D. Lets say, there are no other factor is contributing in particulate concentration (A) other than B, C and D. Correlation analysis shows A have linear relationship with B , exponential relationship with C and Logarithmic relationship with D. I want to know which factor is contributing more in concentration of A (Predominant factor). I also want to know if any model can be build like following equations
A = m*A+n*exp(B)+p*Log (C), where m, n and p are constant, from the data-set I have
Relevant answer
Answer
Maybe you can consider the recursive least squares algorithm (RLS). RLS is the recursive application of the well-known least squares (LS) regression algorithm, so that each new data point is taken in account to modify (correct) a previous estimate of the parameters from some linear (or linearized) correlation thought to model the observed system. The method allows for the dynamical application of LS to time series acquired in real-time. As with LS, there may be several correlation equations with the corresponding set of dependent (observed) variables. For the recursive least squares algorithm with forgetting factor (RLS-FF), acquired data is weighted according to its age, with increased weight given to the most recent data.
Years ago, while investigating adaptive control and energetic optimization of aerobic fermenters, I have applied the RLS-FF algorithm to estimate the parameters from the KLa correlation, used to predict the O2 gas-liquid mass-transfer, hence giving increased weight to most recent data. Estimates were improved by imposing sinusoidal disturbance to air flow and agitation speed (manipulated variables). The power dissipated by agitation was accessed by a torque meter (pilot plant). The proposed (adaptive) control algorithm compared favourably with PID. Simulations assessed the effect of numerically generated white Gaussian noise (2-sigma truncated) and of first order delay. This investigation was reported at (MSc Thesis):
  • asked a question related to Analytical Statistics
Question
4 answers
Hi,
I am going to intervene mice with drug and want to see is there any effects of intervention compared to sham controls in mouse model. Do you have any idea about the priori power calculation tools/methods used in animal interventional studies?
Many thanks,
Nirmal
Relevant answer
Answer
George Vasilakos , thank fornoticing. I fixed it. The new link should work.
  • asked a question related to Analytical Statistics
Question
7 answers
Hello Everyone,
I need to find the similarity or compare two different sensor data values whether these two patterns are the same or not. The values are measured each day and the length of the data points for each day isn't the same
eg: day1 - length of the data points are 700,
day2 - length of the data points are 1000
data are collected at the same time interval, but senor didn't capture completely that's the reason the length of the data points varies.
Similarity refers here, how close the day1 pattern to the day2 pattern.
As these data points don't fall into any distributions, I have applied multiple non-parametric tests like Kruskal, Mann-Whitney, etc. But these tests aren't consistent. Can anyone recommend to me how to proceed here or what is the best approach for this problem?
I have attached the sample plot for two different dates.
Relevant answer
Answer
Since you only offer one practical definition, that the patterns are the same, you should be able to reject that in their populations they are exactly the same without a test.
  • asked a question related to Analytical Statistics
Question
6 answers
I am conducting design analysis of simplex-centroid mixes, with 7 points and 2 repetitions of the central point totaling 9 tests.
When I perform regression analysis, the only model that fits is the special cubic, however Total Error and Pure Error values are the same, so the lack of fit value is equal to 0.
I would like to know if the fit of the special cubic model is adequate for the responses obtained?
How can I know if I have generated an over-fit equation?
Relevant answer
Thank you all!
  • asked a question related to Analytical Statistics
Question
4 answers
I have a very simple model where I measure the effect of tone of voice on purchase intention, moderated by brand alignment. I have a 2x2 between subjects model, with tone of voice levels being informal vs formal, and brand alignment levels being warm vs competent.
When I run a two-way anova, I see that the main effect for tone of voice is not significant. However when I run it in a one way ANOVA, this effect is significant. Can someone explain why this is, and if i were to report the one way anova would this be incorrect?
for bachelors thesis
  • asked a question related to Analytical Statistics
Question
8 answers
I consider three readers circularly involved in Bland-Altman 2-by-2; threfore I obtained 3 Bland-Altman analysis with corresponding Limit of Agreement (and Confidence Interval). In order to get more accurate estimations, I would like to average the three Limit of Agreement but I am not sure this is correct. That's my first question. My second concern is about confidence interval when averaging standard deviation that themselves have CI. How to compute Confidence interval for averaged Standard Deviation (or Limit of Agreement)? Thanks for the help.
Relevant answer
Answer
Interesting.
  • asked a question related to Analytical Statistics
Question
11 answers
So i'm doing a meta analysis, and i have some question about a study that i have.
This is the data for one group(lower intensity group):
pre intervention mean(SD): 274(70.5)
post intervention mean(SD): 286.7(73.9)
mean difference pre and post (95%CI): 12.7 (-27.1, 40.0)
This is the data for other group(higher intensity group)
pre intervention mean(SD): 267.2(61.3)
post intervention mean(SD): 291.3(63.7)
Mean difference pre and post(95%CI): 24.2(-12,63.7)
As you can see the 95%CI between pre and post is asymmetrical around the mean. What causes this?
i'm comparing the mean differences (between pre and post) of higher intensity group compared to lower intensity group using revman, how do i use this data for meta analysis?
Here's the link to the study: (table 2, 6mwd)
Relevant answer
Answer
It doesn't mean anything. Not all methods of calculating confidence intervals yield symmetrical results - for example if means were log transformed or converted to some kinds of effect sizes during calculation or if they were bootstrapped, there will often be an asymmetrical CI. I can't tell you the exact reason in this case given that you haven't said how the CI was derived, but it doesn't have any implicit meaning. You can still use this data for the analysis without any complications.
For RevMan, you can just input the mean and SD for each study arm and the MD will be calculated automatically.
  • asked a question related to Analytical Statistics
Question
19 answers
we have some experimental data of mechanical strength of rock material. We compare this data with the estimated strength (calculated using several existing criteria) and also determined the error percentage for each criteria. 
So i want to know that
what is the maximum percentage of error, that is acceptable for rock mechanics purposes, specially when we compare the experimental data with the estimated ones.
Relevant answer
Answer
the error for my study (prediction of unconfined compressive strength - 1120 data points) is about 30%
do you have any reference regarding your answer that may help me
  • asked a question related to Analytical Statistics
Question
11 answers
I'm working in a lab that is currently doing some research on 2 species of duckweed and I did a simple experiment to compare whether or not a certain way of cleaning jars has an effect on the growth of duckweed. The data is an exponential distribution since duckweed has exponential growth.
I've attached a picture of my data along with the equations of the trendlines.
I'm having trouble trying to figure out the best way to determine whether or not the differences between the sets of data for each species is significant or not. How should I go about comparing them? (I know that the method labeled "washer" appears to have less growth but I want to make sure that the difference is statistically significant)
I've been searching around the internet but I haven't really found anything that makes complete sense to me.
Relevant answer
Answer
Babak Jamshidi , "distribution" is a wrong word used by Kitt Kroeger . He obviousely meant an exponential relationship between time and frond count and the question was how to test the difference in growth rates between two exponental curves.
The distribution of the response, frond count, should actually be Poisson or negative binomial. However, the variance does not seem to increase with increasing means, what is a bit strange. This is difficult to assess from the plot because (I suppose) only averages are shown.
I also take Kitt's word for the exponential relationship. The data seem to follow more a logistic model (with seemingly different asmyptotes for the two groups; this might me an interesting aspect to explain...). But this again is hard to tell from the very few available data and without subject matter knowledge.
  • asked a question related to Analytical Statistics
Question
4 answers
Hello everyone, I would like to ask if the way that the sample size of this research was calculated is valid or correct. Is a study to evaluate the effect of gargling with Povidone iodine among COVID-19 patients. The text says “For this pilot study, we looked at Eggers et al. (2015), using Betadine gargle on MERS-CoV which showed a significant reduction of viral titer by a factor of 4.3 log10 TCID50/mL, and we calculated a sample size of 5 per arm or 20 samples in total”. From this data of the reduction of the viral titer in a previous study on MERS-CoV ¿It is valid to calculate the sample size this way for a new study on COVID-19?
Relevant answer
Answer
there are many different ways to estimate the sample size and you can select the suitable one for your research.
  • asked a question related to Analytical Statistics
Question
8 answers
Hello, I need some help on what statistical test I should use for my data analysis. I have a Data set A, which is an array of 5000 numbers, all of which are zero, and a Data set B, which is an array of random continuous numbers that do not necessarily follow a normal distribution. Both data sets can be plotted on a histogram for visual aid. Data set A is my "ideal" and Data set B is my "measured" - I would like to compare the similarity of Data set B to Data set A (ideally it would be a single output figure such as a % similarity). I would then go on to test another Data set C (the same style of array as data set B - it does not have normal distribution and is continuous numbers) and compare its similarity % with Data set A. I would then be able to make a "ranking" on whether Data set B or Data set C was most similar to Data set A. Some of the considerations: - The similarity value has to account for the shape (ie. the histogram of Data set B will rank with a higher similarity % the closer it is to a vertical straight line as shown in Data set A) - The similarity value has to account for x axis distance on the histogram (ie. the further from zero the poorer the % similarity to data set A) - The weighting of each has to be equal (ie. neither the shape or distance on the x axis is more important) - Because the weighting is equal, if data set B was a straight line at -5, it should have the same % similarity to data set A if it had been a straight line at +5. - the order of the values in array B does not matter
I'm essentially trying to rank data sets against the "ideal" data set A (but taking into account non normal distribution, histogram shape similarity, distance etc). I have no idea what test to apply that can give me a % similarity to the ideal under these conditions.
Thank you so so so much.
Relevant answer
Answer
Use both KS and KL but also remember percentages and elasticities are possibilities (depending on the type of data)
  • asked a question related to Analytical Statistics
Question
5 answers
I have tested the effect of EO on 4 different bacteria and I have repeated it 5 times. I am not sure whether I can use T test as it usually covers two factors and not 4. If not what Can I use instead? 
Relevant answer
Answer
Ladan Majdi , you can use Analysis of variance methods do this. A complete review of these methods can be found in its Wikipedia article and this wonderful review article by Martin G. Larson
  • asked a question related to Analytical Statistics
Question
3 answers
Hello,
So I am Supposed to do a study in order to answer a statistical question regarding the association between two categorical data or whether the 2 variable were dependent. I did a survey with 2 questions MCQ ( 1 with 3 choices and the second is 4). when we hear about association and categorical data we usually go for Chi-square and this what were asked to get. however my sample sample is too small (N=50) and as a result i had contingency table as in the attachment:
As a result of the values I got, I cannot use the Chi square because about 50% of my cells are with less than 5 for the expected value. While I was trying to find an answer, I saw someone stating that if I combined 2 columns I can have higher expected frequencies and as a result I can use the chi square. I did combine A and B and as a result I had higher expected values in all cells. Will this affect the results of my hypothesis and what to do in this case? which test shall I use. Thank you in advance.
NB: I want to do it using manual calculation not SPSS . Kindly look for the second reply for clarification.
Again thank you
Relevant answer
Answer
You can run the Fisher exact test.
In attachment a script for such a test using R. This yields a p-value = .0005.
The script includes a pairwise comparison Post Hoc test, with Benjamin Hochberg adjustments. It is inspired by the R Companion:
  • asked a question related to Analytical Statistics
Question
16 answers
Hi,
I’m studying the effect of joint (cracks) sets spacing and persistence on blasted rock size so I have two independent categorical variables (labelled SP for spacing and PER for persistence) that have 5 levels of measurement ranges each. The dependent variable is the blasted rock size (Xc ) i.e I want to know how the spacing and persistence of the existing joints on a rock face would affect the size of blasted rocks. Measurement levels for both spacing and persistence are listed below
spacing levels:
SP1: less than 60mm SP2: 60-200mm SP3: 200-600mm SP4:0.6-2m SP5: more than 2m
persistence levels:
PER1: more than 20m PER2: 10-20m PER3: 3-10m PER4: 1-3m PER5: less than 1m
Spacing and Persistence were recoded as ranges since they were estimated and not measured individually as it'd take too much time to measure each one (1 set of joint may have at least 10 joints, some can reach 50 or more and the measurement are not exactly the same between joints belonging to the same set. Measurement was done manually on site)
Initially, I ran the regression with these two variables as categorical variables but the problem is the levels are not mutually exclusive. 1 rock slope could consist of 2 or more crack sets hence the situation where more than 1 levels of spacing and persistence can be observed. As an example, rock face A consist of 3 crack sets:
Set 1 (quantity: 25) SP3 PER5 Set 2 (quantity: 30) SP4 PER6 set 3 (quantity: 56) SP2 PER3
As can be seen, 1 rock face contains 3 different levels of SP and PER.
Technically, these are ordinal variables and as explained above if I choose to treat them as categorical I face the problem of non-mutually exclusive levels. Recently, I found out that that ordinal variable can be treated as continuous which seems to solve my problems with non-mutually exclusive levels of the variables if I enter the variables as categorical. My main concern is to look at the variable as a whole not by its levels so it might be what I need.
My question is, is it correct if I assign the numerical value to the levels like this in order to treat the variables as continuous? 1 to 5, from lowest to highest.
Spacing: 1: less than 60mm 2: 60-200mm 3: 200-600mm 4:0.6-2m 5: more than 2m
persistence: 1: less than 1m 2: 1-3m 3: 3-10m 4: 10-20m 4: more than 20m
and then run regression as I would with the usual continuous variable? Plus, for prediction, once I get the equation, do I insert the value 1-5 as the X in the equation? I am still confused with the prediction step since even if I treat it as continuous I'd still have the problem with the presence of different levels of SP and PER. Or is there another way around this problem?
2nd question is: as provided in data example for rock face A, is it correct to repeat the data input according to the quantity? as in I’d entered set 1 data 25 times, set 2 data 30 times and set 3 data 56 times.
I am very new to statistic and learning it on my own so I might be wrong with something in this field. Any answers, suggestion and advise are very much appreciated. Thank you in advance!
Relevant answer
Answer
Employing ordered-probit models to deal with ordinal data, advocate for a Bayesian approach, another way is the Frequentist approach. You may want to read Liddell & Kruschke (2018) for clarification?
Good luck.
  • asked a question related to Analytical Statistics
Question
29 answers
I have a dataset with one nominal independent variable with 10 different levels and one dichotomous dependent variable.
What would be the appropriate statistical test to compare the different levels of the IV?
Example:
Independent variable: "Favourite color". Different non-ranked levels: Yellow, green, orange, blue, red, purple, black, white, brown and pink.
Dependent variable: Dichotomous: Smoker: Yes (1) vs. No (0)
I am not interested in choosing a reference level (in this example a specific colour) since there is no solid way to decide which of these colours should be the reference.
The only idea I can come up with for statistical testing is chi square (Fishers) comparing each level of the IV to the combination of all the other levels. In other words creating dummy variables (but without a reference level) - e.g. "Yellow vs. not yellow" and then perform chi square. Next "green vs. not green" etc. till the end (with all the levels).
Is this an accepted way to compare the different levels of a nominal variable?
My results will then be something like shown here:
"People with favourite colour yellow smoke significantly more than others".
"People with favourite colours green, orange, blue, red, purple, black, white and brown does not smoke significantly more (or less) than others".
"People with favourite colour pink smoke significantly less than others".
This analysis is easy to perform but is it statistically sound?
Are there any better alternatives?
Or should I simply stick to descriptives without any statistical comparison?
Thank you
Relevant answer
Answer
In my opinion, you can run a binary logistic regression. For such a regression the IV can be either categorical/nominal or continuous.
  • asked a question related to Analytical Statistics
Question
7 answers
Dear all,
I am processing nondiagnostic pottery shards. I analysed fragmentation and abrasion in the field.
I first divided my nondiagnostics into clay groups. After that for each clay group I sorted out the individual shards according to size categories using a grid (1x1cm, 2x2cm, 3x3cm etc) and I also analysed shards from each clay group according to 3 levels of abrasion.
Now, I want to calculate the level of fragmentation and abrasion of each clay group. Any suggestion on what would be the best way to do this?
Best wishes,
Uros
Relevant answer
Answer
Dear Uros,
i agree with Christoph Keller the sherd weight is a proxy for the sherd volume/size and already an indicator for pottery fragmentation. You can simply investigate sherd weights within each clay group and abrasion category (e.g. with boxplots or summary statistics). Best regards. Robin
  • asked a question related to Analytical Statistics
Question
5 answers
I have this set of data:
Treatment 1: 30.5; 34.5; 24.4
Treatment 2: 24.8; 20.8; 16.8
Treatment 3: 19.1; 21.4; 21.0
Treatment 4: 22.3; 26.1; 27.1
Treatment 5: 26.5; 31.2; 22.9
Analysis with SAS gave the following result:
ANOVA: p = 0.047 (significant)
Tukey test (alpha = 0.05):
Treatment 1: a
Treatment 2: a
Treatment 3: a
Treatment 4: a
Treatment 5: a
So which one is correct? How can I interpret this result?
Relevant answer
Answer
The overall ANOVA asks about the relationships among all the levels; tests like Tukey's LSD compare individual levels. Those are two different questions. Which one are you interested in? Perhaps both?
  • asked a question related to Analytical Statistics
Question
3 answers
what is the appropriate statistical analysis to show correlation between number of hours spent for online classes and its perceived effect on students’ mental health? The number or hours was one multiple choice question with 4 choices and the perceived effect on students’ mental health is a likert scale of 5 intervals indicating likelihood (never, rarely, sometimes, often, always) with 10 different questions. I see people recommending Pearson or spearman and i am thoroughly confused since i have ten questions. how do i condense the ten questions into one variable?
furthermore, i was initially planning to see the effect of online learning to students perceived mental health, but did that mean using mean and standard deviation. is that correct analysis for the hypothesis “online learning does not have an effect to students’ mental health”?
Relevant answer
Answer
Hello Ja Su,
A couple of general observations about Likert-type scales (though do have a look at the resources given below):
1. The original conceptualization of a scale was that the researcher would combine responses across a set of items (usually by summing the 1-5, or whatever arbitrary values are given to a single response) that are unidimensional--that is, they all pertain to the same target object, concept, attribute, characteristic, person, or idea. If sufficient items were included, then: (a) the resultant score would be more reliable than any single response: and (b) the resultant score would tend to behave like a continuous variable (technically, it's still ordinal, but a lot of practitioners treat it as if interval strength).
2. Item response theory models allow the aggregation of such items (again, dimensionality is a consideration) in a way that yields interval strength scores for the resultant scale.
3. Factor analysis would offer yet another way to combine scores over a set of items.
4. The tendency of researchers to use single items which use a Likert-type response scale as a variable is not best practice.
5. To answer your apparent question, good old Spearman correlation would be satisfactory.
Now, for the resources:
David Morgan has posted a lot of information on how to handle a set of Likert-type response items here on ResearchGate. You could also look at Rensis Likert's original monograph (see this link: https://legacy.voteview.com/pdf/Likert_1932.pdf). A more recent, and somewhat simplistic view of Likert-type scale item sets may be found here: hhttps://www.scribbr.com/methodology/likert-scale/
Good luck with your work.
  • asked a question related to Analytical Statistics
Question
4 answers
I am checking the relative expression of 10 genes via qPCR between control and patient cell lines (iPSC vs. NSC vs. Neurons). After analysis I have obtained:
a. 2-ΔCt values for each cell line which I have plotted as 3 grouped bar graph for each line (multiple t-test)
b. ΔΔCt
c. 2-ΔΔCt (fold change?)
Please help me represent b & c in the most useful way also, feedbacks on a. (particularly statistical analysis) would be greatly appreciated
Thanks in advance.
error bars represent SEM.
Relevant answer
Answer
Hi Bilal,
You only plot graphs for 2-ΔΔCt. For statistical analysis you need to use oneway anova test where you will compare all of your test sample datasets with control dataset . I hope you can solve your problem now easily.
  • asked a question related to Analytical Statistics
Question
4 answers
I have the energy specter acquired from experimental data. After normalization, it can be used as a probability density function(PDF). I can construct a Cumulative distribution function(CDF) on a given interval using its definition as the integral of PDF. This integral simplified as a sum because of the PDF given in discrete form. I want to generate random numbers from this CDF.
I used Inverse transform sampling replacing CDF integral with sum. From then I am following the standard routine of the Inverse transform sampling solving it for sum range instead of an integral range.
My sampling visually fits experimental data but I wonder if this procedure is mathematically correct and how it could be proofed?
Relevant answer
Answer
The ideas are ok but you need to do things that show your summs converge to the integral. Referring to a text on harmonic analysis or numerical analysis would probably be beneficial.
  • asked a question related to Analytical Statistics
Question
5 answers
Suppose that X1, X2 are random variables with given probability distributions fx1(x), fx2(x).
Let fy(x) = fy( fx1(x) , fx2(x) ) be a known probability distribution of "unknown" random variable Y. Is it possible to determine how the variable Y is related to X1 and X2?
  • asked a question related to Analytical Statistics
Question
4 answers
After literature review, current draft as below:
Variables:
Independent = B (represented by I, M, O, S, C),
Moderating = W
Dependent = F
Control = CV
Hypotheses:
H1a. I has significant positive relationship with F.
H1b. M has significant positive relationship with F.
H1c. O has significant positive relationship with F.
H1d. S has significant positive relationship with F.
H1e. C has significant positive relationship with F.
H2. W has significant positive relationship with F.
H3a. W strengthen the positive relationship between I and F.
H3b. W strengthen the positive relationship between M and F.
H3c. W strengthen the positive relationship between O and F.
H3d. W strengthen the positive relationship between S and F.
H3e. W strengthen the negative relationship between C and F.
H4. B has a significant positive relationship with F.
H5. W strengthen the positive relationship between B and F.
Proposed Equations:
1) Pit = a + β1Iit + β2Mit + β3Oit + β4Sit + β5Cit + β6CVit + εit [H1a-e]
2) Pit = a + β1Iit + β2Mit + β3Oit + β4Sit + β5Cit + β6Wit + β7Iit x Wit + β8Mit x Wit + β9Oit x Wit + β10Sit x Wit + β11Cit x Wit +β12CVit + εit [H2 & H3a-e]
Question:
1) Are the proposed equations appropriated to test respective hypotheses?
2) Should an equation (e.g. weighted score for B based on I, M, O, S, C) be formulated to test H4 & H5? or It is not necessary but to conclude H4 & H5 as a whole based on the individual estimating result of H1a-e, H2 and h3a-e.
Thanks for advice / sharing in advance.
Relevant answer
Answer
The purpose of a moderating variable is to check if that factor (variable) influence or drive the relationship between the dependent and independent variable. So the hypothesis should reflect that
E.g. W significantly change or moderate the relationship between I and F or
The relationship between I and F is stronger (weaker) when there is W (this is good if W is a dummy variable of 0 and 1)
Hope this helps. ALL THE BEST
  • asked a question related to Analytical Statistics
Question
3 answers
This is so far the procedure I was trying upon and then I couldn't fix it
As per my understanding here some definitions:
- lexical frequencies, that is, the frequencies with which correspondences occur in a dictionary or, as here, in a word list;
- lexical frequency is the frequency with which the correspondence occurs when you count all and only the correspondences in a dictionary.
- text frequencies, that is, the frequencies with which correspondences occur in a large corpus.
- text frequency is the frequency with which a correspondence occurs when you count all the correspondences in a large set of pieces of continuous prose ...;
You will see that lexical frequency produces much lower counts than text frequency because in lexical frequency each correspondence is counted only once per word in which it occurs, whereas text frequency counts each correspondence multiple times, depending on how often the words in which it appears to occur.
When referring to the frequency of occurrence, two different frequencies are used: type and token. Type frequency counts a word once.
So I understand that probably lexical frequencies deal with types counting the words once and text frequencies deal with tokens counting the words multiple times in a corpus, therefore for the last, we need to take into account the word frequency in which those phonemes and graphemes occur.
So far I managed phoneme frequencies as it follows
Phoneme frequencies:
Lexical frequency is: (single count of a phoneme per word/total number of counted phonemes in the word list)*100= Lexical Frequency % of a specific phoneme in the word list.
Text frequency is similar but then I fail when trying to add the frequencies of the words in the word list: (all counts of a phoneme per word/total number of counted phonemes in the word list)*100 vs (sum of the word frequencies of the targeted words that contain the phoneme/total sum of all the frequencies of all the words in the list)= Text Frequency % of a specific phoneme in the word list.
PLEASE HELP ME TO FIND A FORMULA ON HOW TO CALCULATE THE LEXICAL FREQUENCY AND THE TEXT FREQUENCY of phonemes and graphemes.
Relevant answer
Answer
Hola,
Para el cálculo de la frecuencia léxica de unidades simples o complejas, se suele utilizar WordSmith o AntCon.
Saludos
  • asked a question related to Analytical Statistics
Question
3 answers
This is so far the procedure I was trying upon and then I couldn't fix it
As per my understanding:
- lexical frequencies, that is, the frequencies with which correspondences occur in a dictionary or, as here, in a word list;
- lexical frequency is the frequency with which the correspondence occurs when you count all and only the correspondences in a dictionary.
- text frequencies, that is, the frequencies with which correspondences occur in a large corpus.
- text frequency is the frequency with which a correspondence occurs when you count all the correspondences in a large set of pieces of continuous prose ...;
You will see that lexical frequency produces much lower counts than text frequency because in lexical frequency each correspondence is counted only once per word in which it occurs, whereas text frequency counts each correspondence multiple times, depending on how often the words in which it appears to occur.
When referring to the frequency of occurrence, two different frequencies are used: type and token. Type frequency counts a word once.
So I understand that probably lexical frequencies deal with types counting the words once and text frequencies deal with tokens counting the words multiple times in a corpus, therefore for the last, we need to take into account the word frequency in which those phonemes and graphemes occur.
So far I managed phoneme frequencies as it follows
Phoneme frequencies:
Lexical frequency is: (single count of a phoneme per word/total number of counted phonemes in the word list)*100= Lexical Frequency % of a specific phoneme in the word list.
Text frequency is similar but then I fail when trying to add the frequencies of the words in the word list: (all counts of a phoneme per word/total number of counted phonemes in the word list)*100 vs (sum of the word frequencies of the targeted words that contain the phoneme/total sum of all the frequencies of all the words in the list)= Text Frequency % of a specific phoneme in the word list.
PLEASE HELP ME TO FIND A FORMULA ON HOW TO CALCULATE THE LEXICAL FREQUENCY AND THE TEXT FREQUENCY of phonemes and graphemes.
Relevant answer
Answer
It will help if you use a suitable and powerful qualitative research software as Atlas.ti (https://atlasti.com/) or equivalent. This software allows you to introduce and research large amounts of text, written or oral, images, videos, etc. Then, you can select diverse research techniques, including frequencies, correlations, modulations, structures, and several other tools.
  • asked a question related to Analytical Statistics
Question
7 answers
I wish to estimate the expression of two types of markers by immunohistochemistry in the biopsy specimens of a particular type of cancer, and correlate their expressions with clinical parameters and outcomes. For this type of cancer, we see approximately 400-500 patients every year at our centre, which is about 10% of all of our cancer patients.The crude rate of this cancer in India is also around 10%. The estimated prevalence of the two markers vary between 50-75% as per published studies. What should be my ideal sample size ?
Relevant answer
Answer
At least a few hundred to establish a reliable research. I think you have enough samples here.
  • asked a question related to Analytical Statistics
Question
6 answers
Hello,
I have measured morphometric parameters of plants grown in vitro (height, root mass, etc.). I have one variable, thus two test groups - control and treatment group. I've made three independent biological replicates of the experiment, 30 plants per each biological replicate. In total there are 90 plants for control group and 90 plants for treatment group. I have done single factor ANOVA for each replicate and achieved high F numbers and very low p-values. My question is, is there a way, similar to ANOVA for repeated measurements, to analyse this data as three independent units, or should I just merge the 3 replicates into one data set?
Relevant answer
Answer
I am not sure what you mean by biological replicate. For me, each individual plat is a biological replicate. So you had 15 bio. reps (= different plants) in each group (15 x 2 groups = 30 er experiment), and you replicated the entire experiment three times (with new, different plants, so these are again all bio. reps). leasecorrect me if I am wrong.
If this is correct, you can simply analyze the data from all 90 plants in one model. There might be some systematic differences between the experiments, which can be accounted for by adding the experiment ID as a random factor.
  • asked a question related to Analytical Statistics
Question
11 answers
Hello,
I have little doubts about the statistical model that I am using to analyze my data. I have two groups of residue studies data Group 1 n=7 and the other group 2 n=47. they are independent and the studies are expensive and rare so I couldn't increase the sample size by any means.
I have tested the normality of both groups using the SPSS and found to be not normally distributed then I transformed all the data to fit the normal distribution using the square-root calculation - they fit the normal distribution p= 0.2 (more than 0.05).
which test should I use especially that I am using SPSS package
thanks
Relevant answer
Answer
You said you "need to compare the mean[s] of the two groups," but then you are applying a non-linear transformation so that you are no longer comparing means. e.g.,
G1: 1 1 1 1 1 64, mean 11.5
G2: 4 4 4 4 4 4 4 4, mean 4
taking the square root
G1*: 1 1 1 1 1 8, mean 2.5
G2*: 2 2 ... 2, mean 2
So with the raw data the first group is higher, but after taking the square root is higher. If you NEED to compare means (and sometimes this is important) do not transform the data prior to either Welch's or Student's test.
  • asked a question related to Analytical Statistics
Question
3 answers
I have two datasets. The first one has 20 patients. While changing the LBNP pressure for each patient , (in a period of time), the physiological signals (ECG and blood pressure and Spo2) are recorded.
The second data set has 30 patients and again for a period of time , the same signals are recorded.
In total the first dataset has 400 samples where each sample corresponds to an ECG, blood pressure, Spo2 reading and for each sample there is the output (LBNP). (3 features and one output)
The second dataset is the same except we have 600 samples. For each sample we change the LBNP and read ECG, blood pressure, and Spo2.
My question is how to compare two datasets so we know whether or not they are different from each other? Each dataset comes from a different clinical team. What are the statistical tests that can be used to do the comparison?
Relevant answer
Answer
Dear: Tests related to repeated measurements are used
  • asked a question related to Analytical Statistics
Question
6 answers
There are two groups in the study such as group 1 and group 2. One of the groups received treatment, but the other did not. When the mortality of the groups is compared, it seems that there is no statistical difference. However, the expected mortality rate (calculated based on PRISM3 score) in first group ( treatment group) was significantly higher than the other. I think the treatment is successful by lowering the high mortality expectation. However, I could not find how to show this statistically or how can I equalize this imbalance (mortality expectation) between groups at the beginning.
Thanks
Relevant answer
Answer
.
your expected mortality rate is a confounding variable in your analysis
the adjustment method depends a lot of the data but you can have a look at the following thread
if you have a good overlap between the score distribution in both groups (despite different means), you could go for stratification, although it may rule out too small samples
.
  • asked a question related to Analytical Statistics
Question
3 answers
Hello everyone,
I have some difficulties with the procedure in my empirical evaluation. My structural model consists of seven independent variables and one dependent variable and so far I have evaluated 150 German and 150 American data sets individually. However, since I want to check in which country the significant relationships between IV and DV are stronger, I am unsure how to proceed correctly.
Is it possible to compare just the significant path coefficients to find out if the relationship between IV and DP in country A is stronger than in country B?
Or does this have to be evaluated by means of a multi-group analysis?
Thanks for your help,
Baxmauer
Relevant answer
Answer
Hello Michael,
as David has already mentioned, the most effective way would to go to a multigroup SEM. There is a ton of literature on "measurement invariance" that describes such an analysis as a set of tests.
1) You begin by testing the structure. If you have just a regression structure with no effects fixed to zero, there is no real test. But if you have some restrictions (e.g., a full mediation) this would be a good thing, and you can test wether this structure holds in both groups. This step is called "configural invariance". Perhaps you have latent measurement models. These are tested as well (i.e., whether the m.models are equivalent across countries
2) Then you have the option to specify "equality constraints". That means that you tell the algorithm to estimate a certain coefficient but under the premise that this coefficient must be identical in both countries. The empirical question then is, whether this is correct or a serious violation (as the countries differ in the values of the coefficient). Hence this step consists of comparing the fit of the model with the configurally invariant model. If the fit gets a bit worse but not to a significant degree than the H0 of "no difference" holds. If the fit gets significantly worse, than you reject the H0 and assume a difference. The powerful and efficient aspect is that you can apply equality constraints for all parameters of interest at once leading to a strong omnibus test of "all parameters are equal across both countries".
Again, if you have a mmeasurement model, you would usually start by testing for "metric invariance", namely whether the factor loadings are identical followed by a test of "structural invariance" testing your substantial cross-country hypothesis.
I have an introductory book on SEM (in German language) with the lavaan package in R that has a chapter on multigroup models and how you do it (if you forgive me the shameless self-advertisement :)
HTH
--Holger
  • asked a question related to Analytical Statistics
Question
6 answers
The original series is nonstationary as it has a clear increasing trend and its ACF plot gradually dampens. To make the series stationary, what optimum order of differencing (d) is needed?
Furthermore, if the ACF and PACF plots of the differenced series do not cut off after a definite value of lags but have peaks at certain intermittent lags. How to choose the optimum values of 'p' and 'q' in such a case?
Relevant answer
Answer
You can use auto.arima function from 'forecast' package for R.
Alternatively, if you have many observations, you can try out-of-sample comparison of alternative models with different values of d.
To compare alternative models, you can use the instructions described here:
  • asked a question related to Analytical Statistics
Question
4 answers
I have seen that some researchers just compare the difference in R2 in two models: one in which the variables of interest are included and one in which they are excluded. However, in my case, I have that this difference is small (0.05). Is there any method by which I can be sure (or at least have some support for the argument that) this change is not just due to luck or noise?
Relevant answer
Answer
Partial F-test will be useful hear. After the 1st variable is in, you add other variables ONE at a time. after 2nd bariable is added you have your y-variable as function of 2 variables giving model with 2 d.f. & certain Sum of Squares (SS). from 2-variable SS subtract 1-variable SS. that change in SS will have 1d.f. cost. So extra variable SS divided by '1' is the "Change in regression mean squares (regMS). Further, divide 2-variable residual SS (RSS) by 2-variable residual d.f to get Residual mean Squares (resMS). Now divide Change in regMS by resMS to get partial F-value & look up Tables for probability of partial F-value. If significant keep the 2nd variable in & do the same for any further independent variable you may want to add to your model. Adjusted Rsq is = 100*[1 - {(regSS/regDF)/(totalSS/totalDF)}]
  • asked a question related to Analytical Statistics
Question
13 answers
To illustrate my point I present you an hypothetical case with the following equation:
wage=C+0.5education+0.3rural area (
Where the variable "education" measures the number of years of education a person has and rural area is a dummy variable that takes the value of 1 if the person lives in the rural area and 0 if she lives in the urban area.
In this situation (and assuming no other relevant factors affecting wage), my questions are:
1) Is the 0.5 coefficient of education reflecting the difference between (1) the mean of the marginal return of an extra year of education on the wage of an urban worker and (2) the mean of the marginal return of an extra year of education of an rural worker?
a) If my reasoning is wrong, what would be the intuition of the mechanism of "holding constant"?
2) Mathematically, how is that just adding the rural variable works on "holding constant" the effect of living in a rural area on the relationship between education and wage?
Relevant answer
Answer
To assume that other variable do not change in order to allow for an evaluation of partial variation in a dependent variable due to variation in the only independent while other variables do not change
  • asked a question related to Analytical Statistics
Question
9 answers
Hi everyone,
I am planning on constructing a Fama french 3 factor model for a period from 1.1.1998-31.12.2015 for a portfolio of about 120 stocks. I have collected the monthly returns for each stock over 36 months since their IPO. The process of doing a Fama french 3 factor model for a single stock is very straight forward as seen in this video: https://www.youtube.com/watch?v=b2bO23z7cwg
However, how should I proceed with a portfolio with returns that all have different starting dates (as each firms have a different IPO date)?
My tough was as follows:
  1. Calculate the average 1 month return, 2 month return,, 3 month return, ….36 month return from all the stocks in the portfolio.
  2. Calculate the 1 month average, 2 month average, 3 month average, ….36 month average of the Rf, HML, SMB, Mkt-Rf
  3. Subtract 1 month average Rf from average 1 month return, repeat until the 36th month.
  4. Proceed with running the regression.
Many papers, such as the one by Levis (The Performance of Private
Equity-Backed IPOs), have used the Fama French 3 factor model but do not explain the mechanics behind the process.Any help is more than appreciated.
Any help is greatly appreciated
-Sebastian
Relevant answer
Answer
Hello, I have writing research does ESG factor impacts stock's market return. I am conducting this research with three factors of the FAMA french model and the fourth-factor being ESG factor.
(Stock Return-Rf) = b0 + b1 (RM- Rf) + B2 HML + b3 SMB + B4 ESG + e
I want to create this analysis at the market level.
I can create this model at the individual stock level, however, I am unable to use this model at the market level because, fame french three factors are constant for all the stocks, so I can't select it as my independent variable.
Dependent Variable:
  • Stock returns - 60 companies stock's yearly returns.
Independent Variable:
  • Market factor (CAPM): FTSE 100/S&P100 - However, market returns would be common for each stock. So, this variable is not changing with each stock as the market return is common for all stocks. 
  • Firm Size (SMB): I can calculate the SMB factor using six portfolios formed using Size and Book to market value. However, this factor will also be common for each stock. Hence this variable is not changing as the change in the dependent variable.
  • Book to Market Value (HMB): I have calculated the HMB factor using six portfolios formed using Size and Book to market value. However, this factor will also be common for each stock. 
  • ESG Factor: I have an ESG score of all stocks for five years. Now, this factor is changing each year with a change in stock returns
Is it possible to use FAMA french factor at market level?
  • asked a question related to Analytical Statistics
Question
5 answers
Hello everyone,
I am trying to statistically analyze whether data from 3 thermometers differs significantly. At the moment, because of COVID-19, several control points have come up at the company for which I work. We have been using infrared thermometers to check up on people and to be aware if they have a fever or not. However, we don't own a control thermometer with which we could easily calibrate our equipment, we thought that using a statistical test would be helpful, but at this point, we are lost.
Normally, we would compare our data to our control thermometer and that would be it. Our other thermometers are allowed to have a difference of +-1°C at max when we compare them to their controls; we can't do that now.
What I have been doing is collecting 5 to 10 measurements from each thermometer and compare them through an ANOVA test, and then assessing the results (when needed) by running Fisher's Least Significant Difference test. I don't know if it is right to do so because sometimes the data I collect does not seem to vary a lot (the mean difference is NEVER greater than +-1°C), and even so the test concludes that they differ significantly.
What would be right here? We don't want to work with the wrong kind of equipment or put away operating thermometers without a solid reason, we just want to do what's best to our people.
Could you guys please help me?
Relevant answer
Answer
The only way to honestly solve your problem is to a priori set a reference cutofftion is one of informed reliability and instrument validation testing. If you think about it, the question is not that complex and applying ANOVA would just confuse matters.
  • You have three thermometers and you don't know if any of them are perform entirely accurately.
  • You do not know if thermometers provide precise measurements upon retesting - within re-test consistency.
  • You don't know if the thermometers perform in a similar fashion in comparison to each other - between method reliability.
The only way to honestly solve your problem is to set an unstandardized tolerance cutoff that can you would then use to determine if it makes sense to continue using these instruments. Cutoff could be ±1 degree C, just as an arbitrary example.
My approach:
  • Test each instrument many times on the same standard.
  • My favorite statistic is the RANGE. Do any thermometers fall outside of perform outside of your cutoff. If so, then somebody may be incorrectly told they have COVID, so best to not use it. It really depends on your context.
  • There are many wonderful descriptive and reliability analyses than can be done in your situation. Get the median absolute deviation (MAD) of each thermometer and decide if dispersion is acceptable.
  • Check histograms of each and look for skew. If it exists, is it acceptable, given the direction?
  • Plot the three measurements in a modified Bland-Altman plot with average between the three in the X-axis and SORTED temperatures in Y axis. Preferably you know the temperature reference, so draw a horizontal line in the plot at that point. Otherwise use the mean.
  • Visually check the residuals vs the reference. Maybe get a root mean square error of Y values vs reference.
  • You may want to repeat the process with another subject or temperature range.
  • In the end, make an informed value judgment about the precision and probable validity of the instruments. If the values are too far apart, then you cannot use the thermometers.
This is really the only way to make these kinds of decisions in my opinion. I am open for others discussion reliability analysis but be wary if others suggest repeated measures ANOVA, two one-sided t-tests, or even Cronbach's alpha. Your goal is to get the most representative assessment of the situation, not p-hack your way to safety!
  • asked a question related to Analytical Statistics
Question
10 answers
Hello,
I have little doubts about the statistical model that I am using to analyze my data. I have two groups of residue studies data Group 1 n=7 and the other group 2 n=47. they are independent and the studies are expensive and rare so I couldn't increase the sample size by any means.
I have tested the normality of both groups using the SPSS and found to be not normally distributed then I transformed all the data to fit the normal distribution using the square-root calculation - they fit the normal distribution p= 0.2 (more than 0.05).
the data is then analyzed by SPSS - independent t-test to compare between the two means and the p-value of unequal variance was selected to act as the significant value instead of the usual T-test (by other words Welch's test was used ) kindly have a look at the attached data
and used to plot the CI of 95% with each mean of the groups tested in a bar chart
Do you agree using a sample of 7 is not affecting or biasing the test result? and if n=5 of one group and the other group n = 16 is is still a valid method to use? or should I use another test for small samples?
I wish if someone confirm my thoughts?
Thanks
Relevant answer
Answer
Bruce Weaver . thank you so much for the information.It seems professional and I will need to have a detailed look.
I means by residue studies : pesticide residue on and in plant matrices compared to residue on plant leaves ; they are extremely independent to each others and know to be not normally distributed due to many different physiological and metrological conditions associated with each study .
Thank you again
  • asked a question related to Analytical Statistics
Question
17 answers
I've created a playlist on YouTube that helps researchers in analysing their survey data. Please suggest any statistical test that you think is useful and not present in the list to work on it and prepare it. You can subscribe to the channel to watch future videos.
Relevant answer
Answer
Novel Macros using MINITAB for both teaching & research purposes, the MIINITAB does not cost as much as SPSS or SAS. However, it is not able to solve very large problems, only small and mid size problems
  • asked a question related to Analytical Statistics
Question
5 answers
This is how i interpreted it:
Results of a binary logistic regression analysis to assess the effect of demographic factors such as farmers education, farmland size, location of farmland in the catchment, and land use type on the likelihood that a farmer would adopt an SWC measure (coded as Yes) or not adopt an SWC measure (coded as No) on his/her farm are presented in the Table below. The full model containing all predictors was statistically significant, x2(df = 4, n = 73, Yes = 56, No = 17) = 9.723, p = 0.045, indicating that the model was able to distinguish between farmers who are likely to adopt or not adopt SWC measures on their farm. Overall success rate of the model was 66%. Based on the odd ratios, farmers with formal education were 13 times more likely than those with no formal education to implement SWC measures on their farm.
Relevant answer
Answer
Hmmm, not really. For example, " Based on the odd ratios, farmers with formal education were 13 times more likely than those with no formal education to implement SWC measures on their farm. " is not right. This value is for education after conditioning on the other variables in the model. It is important to stress this. That 13 can go up or down depending on what else is included in the model. If you want to make a statement about the association between these two unconditionally, just include education in the model. If you want to talk about the causal impacts of education there are other issues.
  • asked a question related to Analytical Statistics
Question
4 answers
In my study I want to compare a regression coffecient between two groups. The coffecients were negative for the two group. Based on my understanding, a negative sign of a coffecient indicates inverse relationship between X and Y but not the magnitude of the relationship. If the coffecient for group A= - 0.5 and for group B= - 0.6, then does this mean that coffecient for A is larger than B.
When I calculate the coffecients difference bweteen the two groups, the system (Stata) consider that the coffecient for group A >B.
Does this estimation correct? What is the appropriate way to test the difference between the coffecients?
Relevant answer
Answer
Hi, you need to show your model because your descriptions isn't clear. Best, D. Booth
  • asked a question related to Analytical Statistics
Question
4 answers
I have a data set as follows:
0.65, 0.86, 1, 1, 1, 1, 1, 1, 1, 1. When I am drawing a box and whisker plot in excel 2016, the whiskers on the lower side are not appearing. Instead two data points representing 0.86 and 0.65 are shown below the Q1 value of the box. I am unable to figure out the reason for this?
As per my understanding, there should have been a whisker at the min value i.e. 0.65 connecting it to the lower end (Q1) of the box.
Kindly help.
Regards
Sanchit S Agarwal
Relevant answer
Answer
Hello Sanchit,
Mostly for those who didn't read through the citation provided by Anton N. Gvozdetsky , here's the direct explanation.
Whiskers extend up to 1.5X the "mid-band spread" from the upper hinge and the lower hinge points in a box and whisker plot. Any data points further above or further below those thresholds are usually depicted as dots ("outlying cases"). The mid-band spread is the distance between the upper hinge (analogous to the 75th percentile, but is computed slightly differently) and the lower hinge (analogous to the 25th percentile).
For your data set, lower hinge = 1, median = 1, upper hinge = 1. Therefore, mid-band spread = 0, and the lowest two data points (0.65, 0.86) are correctly shown as outlying cases (and, no whisker appears at either end).
Good luck with your work.
  • asked a question related to Analytical Statistics
Question
6 answers
I am facing a problem when I try to calculate the hr from two different survival curves, here is the problem: in the first plot the experimental group's curve is more close to the placebo group then the second plot, even if the first plot's hr is smaller than the second plot. I wonder what the possible reasons are. Can you guys help me to solve this problem? Thanks.
Relevant answer
Answer
Hi, Song Lui,
How You calculate HR ?
It seems that two HR have no statistical significance, and both are not differ from 1.
of course, You always can compare two HR in staticatical sence ( R software, library survcomp , function hr.comp2: Function to statistically compare two hazard ratios) .
And only than You should ask - why....?
  • asked a question related to Analytical Statistics
Question
6 answers
Dear Researchers,
I'm badly need assistance from a statistician to interpret my data with references. Analysed results are attached.
IVs - Idealized influence, inspirational motivation, intellectual stimulation, individualized consideration
DV - Employee green behaviour
1. What is the relationship between the dimensions of environmental transformational leadership (ETL) exhibit by managers and green behaviour of field officers of ABC Plantations?
2. What is most influencing dimension of ETL on green behaviour of field officers of ABC Plantations?
3. What is the relationship between ETL and green behaviour of field officers of ABC Plantations ?
4. What are the recommended strategies to be adapted to increase the level employee green behaviour in ABC Plantations?
Given the time limitations I considered only the total population of field officers in ABC Plantations. (Which is 85 and 81 responded) and data collected through self administered questionnaires.
Thank you
Relevant answer
Answer
Hello again Rookantha,
Presuming that the four IVs represent the target dimensions of ETL, then:
1. If this RQ is more concerned with the zero-order correlations, then your table of correlations for IVs with DV (the 7th one in your post) give the answers to this. The individual correlations with Green Behavior (DV) vary from .37 (for the IV of IM) to .46 (for II). Note that the other two IVs have nearly the same (zero-order) relationship with the DV as does II.
2. As there is correlation among the IVs, the preferred way to address this question would be via partial correlations or the standardized regression coefficients (from the regression model, given in the last table in your post): From these, II (with value of .324) is given more emphasis in the model than the next closest IV (IC, with value .270), and far more than the other two IVs (IM, IS). As only II and IC have estimated regression coefficients that are significantly different from zero (at the .05 level), the IM and IS scores could be said to not be contributing to the explanatory power of the model, given that II and IC scores were part of that model.
3. For your sample, the four IVs collectively explain (R-squared = ) 30.7% of the observed variance in Green Behavior scores; that is an amount that is statistically distinguishable from zero: F(4, 76) = 8.402, p < .001.
4. I think that this RQ calls for inference beyond the data!
Good luck with your work.
  • asked a question related to Analytical Statistics
Question
4 answers
Hi all,
I'm interested in comparing the ratios of yes/ no results between 2 strains:
yes no
strain a 10 20
strain b 50 50
For each experiment, I have 3 biological replicates (same strains, same protocol, different days).
I was wondering whether I can use the
Cochran–Mantel–Haenszel test for repeated tests of independence to answer whether there is a difference between the yes/no proportion of strain a and b?
If so-
what is the best way to graph it?
Thanks!
Relevant answer
Answer
Hello again Chen,
I'd say it depends on what research question you're trying to answer. The MH test will address the comparison of strains (across replications). If you're more worried about whether replications yielded consistent results, you could:
1. Run three FIsher exact tests, one for each day. The collective likelihood could be computed via Fisher's method for combining independent outcomes:
= -2(SUM(ln(prob_i)), which is approximately distributed as a chi-square with 2*k df (where k = number of probabilities being combined; ln = natural log; and SUM is the summation operator). Of course, if the trends differ, your next chore is to try to figure out plausible explanation/s!
2. Switch gears and use log-linear analysis, with a 2 x 2 x 3 design. In this approach, the main effect of replication could be evaluated (along with the other main effects and interactions). Here's a good starting resource: http://userwww.sfsu.edu/efc/classes/biol710/loglinear/Log%20Linear%20Models.htm
Good luck with your work.
  • asked a question related to Analytical Statistics
Question
5 answers
I am from social science background and looking for some material that sums up data imputation methods in one reading material. The articles that I found describe the common methods used but I would like to know how the method was done. Thanks!
Relevant answer
Answer
I highly recommend this article which discusses the dangers of imputation, mean imputation in particular. http://people.oregonstate.edu/~acock/missing/working%20with%20missing%20values.pdf
  • asked a question related to Analytical Statistics
Question
4 answers
I tried using structural equation modeling to analysis a cross sectional design study data. The dependent variable is categorical (Dichotomous), I have 8 latent variables in my model (independent variables measured by scales), and 2 observed independent variables. The model fit results are: CFI/ TLI: 0.745/ 0.727, RMSEA: 0.046, number of free parameters: 162.
I also tried to modify my model based on the model modification results, but there was still no improvement in the model fitting result.
Do you have any suggestions to deal with this weak fit model? The software I used for analysis is Mplus version 7.4.
Thank you for giving any comments!
Relevant answer
Answer
You can put a graph to help you, the error covariance is to correlate the errors of measurement of the factors or indicators of the model that seem to have the same source of error for semantic reasons of writing or same correlation, etc.
Saris, W.E, Satorra, A., & van der Veld, W.M. (2009). Testing Structural Equation modeling or detection of misspecifications? Structural Equation Modeling, 16, 561–582.doi: 10.1080/10705510903203433.
  • asked a question related to Analytical Statistics
Question
7 answers
Reading Wooldridge's book on introductory econometrics I observe that the F test allows us to see if, in a group, at least one of the coefficients is statistically significant. However, in my model I have that, individually, one of the variables of the group I want to test is already statistically significant (measured by the t-test). So, if that is the case I expect that, no matter with which variable I test for, if I include the one that is already individually significant, the F test will also be significant. Is there any useful usage I can make with the F test in this case?
Relevant answer
Answer
Hello Santiago and Calleagues,
A t -test for one variable is identical to an F-test, in a simple regression model (with one explanatory variable); as a matter of fact in this case the square of the t-statistic value is exactly equal to the F statistic value (t^2 =F).
Once you include additional independent variables the F-test is the one that you rely to report the results. If you have significant F- test then you report the regression results and you look which explanatory variables are significant using the t-test. However if the F-test is insignificant you stop there and you do not report the regression.You need to find another model that at the minimum passes the F-test.This is a specification issue.
Kind regards,
George K Zestos
  • asked a question related to Analytical Statistics
Question
4 answers
I am writing a non-parametric/parametric statistical analysis paper on three Independent data sets. (Human Development Index, Gini Index, US Aid) for 10 countries, annually over the last 10 years. I want to find out whether the Gini index can be described as a predictor for the country's Human Development, and whether US Aid impacts this.
I want to know which tests I should conduct to find an inference for my data.
Relevant answer
Answer
Taking into account that you have only 10 observations for each variable, I suggest to analyze the variables in pairs, i.e. Human Development Index, Gini index and Human Development Index, US Aid . You can try to use the exact test for correlation coefficient (see for example https://documentation.statsoft.com/STATISTICAHelp.aspx?path=Power/PowerAnalysis/Examples/Example9ExactTestsandConfidenceIntervalsfortheCorrelationCoefficient).
Of course using the exact tests needs some assumptions concerning the distribution of the observable random variables, but some of them does not need the assumption of normality.
  • asked a question related to Analytical Statistics
Question
4 answers
I have two models. The second one is the first plus more control variables. I see that the coefficient of a variable that I have in the two models has decreased in this second model and I want to know if this difference is statistically significant. For this to be true, is it necessary that the confidence intervals of the variable in the two models don't include the value of it's coefficients? Or the condition is that not even the limits of the confidence intervals cross with each other?
Relevant answer
Answer
You say you have the "same variable" in both models.
Although the values are identical, the coefficients in the models for this variable do have different meanings! It is usually not very sensible to compare the coefficients of these "same variable" across different models. The meanings may be as different as "velocity" and "mass". How would you compare such coefficients? Is 5 m/s more or less than 6 kg ? Are 3 m/s equal to 3 kg?
  • asked a question related to Analytical Statistics
Question
7 answers
I have reported life satisfaction as my dependent variable and many independent variables of different kinds. Of them, one is the area in which the individual lives (urban/rural) and other is the access to public provided water service. When the area variable is included in the model, the second variable is non significant. However, when it is excluded, the public service gains enough significance for a 95% level of confidence. The two variables are moderately and negatively correlated (r= -0.45).
What possible explanations do you see for this phenomenom?
Relevant answer
Answer
Mr. Valdivieso. I think the Multicollinearity is almost definitely the problem. Try the simple VIF test. Then try to change the way you measure the variables Area & Access. Good luck.
  • asked a question related to Analytical Statistics
Question
1 answer
all i can find is for public health datas which are different so need a close suggestion
Relevant answer
Answer
  • asked a question related to Analytical Statistics
Question
3 answers
Dear Scientists,
Greetings
Please, could anyone give me an alternative to analyse data generated from an augmented Block design layout?
The Following known softwares are not working! Could anyone know the reasons? I urgently need your help!
Here are the softwares/links
Indian Agricultural Research Institute, New Delhi
•Statistical Package for Augmented Designs (SPAD)
•SAS macro called augment.sas
CIMMYT – SAS macro called UNREPLICATE
•Developed in 2000 – uses some older SAS syntax
Thanks in advance for your help
Regards
Relevant answer
Answer
None of your links worked, so maybe explain what you are trying to achieve. have you thought of using R which is freely available and a supported Open Access Program.
There are augmented block designs in the R package agricolae .
These are designs for two types of treatments: the control treatments (common) and the increased treatments. The common treatments are applied in complete randomized blocks, and the increased treatments, at random. Each treatment should be applied in any block once only. It is understood that the common treatments are of a greater interest; the standard error of the difference is much smaller than when between two increased ones in different blocks.
  • asked a question related to Analytical Statistics
Question
5 answers
I'm looking for a free and user-friendly tool. I'm familiar with python (and a bit R).
Thank you
Relevant answer
Answer
Aryan Shahabian The assumptions change as the methods change as the problem changes. In my experience, sensitivity analysis was about how a solution to a problem changes as the input conditions or input data or ... change. It might help you to consider what a sensitivity analysis on the methods in one or the other of the attached papers might mean. IMO this is especially interesting because adaptive lasso in both of these situations has an oracle property as is mentioned. Best wishes, David Booth
  • asked a question related to Analytical Statistics
Question
9 answers
There are some statistical test to compare the significance test between the two Pearson correlation coefficients (by fisher's r to z transformation.) I want to know if there is any statistical test to compare the significance of the difference between two Concordance correlation correlations. So that it can be compared to each other so one can claim that one is "stronger" than another.
Relevant answer
Answer
you can use t-test as well:=> from
GENSTAT
Genstat 64-bit Release 19.1 ( PC/Windows 7) 17 July 2019 09:34:51 Copyright 2018, VSN International Ltd. Registered to: Mewa S. Dhanoa Dan
>>>
PRCORRELATION procedure
Calculates probabilities for product moment correlations (R.W. Payne).
Method
PRCORRELATION uses the fact that, for a correlation r based on n observations, the variable t = r × √((n - 2) / (1 - r2)) has a t distribution on n-2 degrees of freedom.