Science topic

Advanced Statistical Modeling - Science topic

Explore the latest questions and answers in Advanced Statistical Modeling, and find Advanced Statistical Modeling experts.
Questions related to Advanced Statistical Modeling
  • asked a question related to Advanced Statistical Modeling
Question
3 answers
Dear Colleagues,
I was wondering if you could suggest how to analyze insect community (multivariate) data collected across multiple sites and time points. Specifically, I aim to assess differences in community composition between Treatment and Control conditions.
What are my options regarding:
  • Multivariate Time Series modeling?
  • Multivariate Mixed-Effect Models?
  • Latent Models (e.g., Generalized Linear Latent Variable Models, gllvm)?
  • Machine Learning approaches?
I’m aware of Multivariate Time Series analysis applied in fields like Finance and likely many other approaches that could be relevant. However, I am struggling to determine the most appropriate method for my case.
If you have experience in this area or can recommend a good blog, tutorial, or resource, I would greatly appreciate your suggestions.
Thank you for your time!
Relevant answer
Answer
To analyze insect community data across multiple sites and time points, consider these approaches:
1. Multivariate Time Series Modeling: Use models like VAR or state-space to capture temporal dynamics, though they require large datasets and may assume linearity.
2. Multivariate Mixed-Effect Models: Linear or generalized mixed models (LMM/GLMM) handle nested data (e.g., sites within treatments), but can be computationally intensive for multivariate responses.
3. Latent Models (GLLVM): Generalized Linear Latent Variable Models are ideal for high-dimensional count data, handling correlations among species.
4. Machine Learning: Use random forests, gradient boosting, or neural networks to model non-linear relationships and temporal patterns.
5. Ecological Analyses: Use PERMANOVA, ANOSIM, and ordination methods (e.g., NMDS) to test and visualize community differences between treatments.
Start with ordination for exploration, test with PERMANOVA, and apply GLLVMs or mixed models for deeper insights. Resources: R packages like vegan, gllvm, and tutorials on R-bloggers.
  • asked a question related to Advanced Statistical Modeling
Question
4 answers
Hi everyone,
I ran a Generalised Linear Mixed Model to see if an intervention condition (video 1, video 2, control) had any impact on an outcome measure across time (baseline, immediate post-test and follow-up). I am having trouble interpreting the Fixed Coefficients table. Can anyone help?
Also, why are the last four lines empty?
Thanks in advance!
Relevant answer
Answer
Alexander Pabst I would add that the first thing to do is a likelihood ratio test to see if having the fixed effects in the model was better fitting than a model without them. I see that the two of the interaction terms may be significant but that's contingent on the overall system of variables being 'significant'. Personally I don't use Wald tests, their approximation sometimes isn't very good. I would use stepwise LRT to determine whether a term (or system of terms) should be included in the model (although for some situations in a mixed model one needs to use something like the BIC).
  • asked a question related to Advanced Statistical Modeling
Question
1 answer
I would like to perform a literature review at this time on augmented learning and learning augmented algorithms to enhance performance-guided surgery
Relevant answer
Answer
1. **Define the Scope and Objectives**:
- Clearly define the objectives of your literature review. For example, you may want to focus on understanding the current state of research in using augmented learning and learning-augmented algorithms to enhance surgical performance and guidance.
- Determine the key aspects you want to cover, such as the specific applications of these techniques in the context of performance-guided surgery, the methodologies employed, the reported outcomes and benefits, as well as any challenges or limitations.
2. **Search and Gather Relevant Literature**:
- Identify relevant databases and search engines, such as PubMed, IEEE Xplore, ACM Digital Library, and Google Scholar, to search for peer-reviewed journal articles, conference proceedings, and other relevant publications.
- Use a combination of keywords, such as "augmented learning", "learning-augmented algorithms", "performance-guided surgery", "surgical guidance", "surgical decision support", etc., to conduct your searches.
- Ensure you include both recent and seminal publications in your search to capture the latest advancements as well as the foundations of the field.
3. **Review and Critically Analyze the Literature**:
- Carefully read and analyze the selected publications, focusing on the key aspects identified in the scope and objectives.
- Identify the main themes, methodologies, findings, and contributions reported in the literature.
- Assess the quality, validity, and reliability of the studies, and identify any gaps, inconsistencies, or areas that require further investigation.
4. **Synthesize the Findings**:
- Organize the literature review in a logical and coherent manner, potentially using a thematic or chronological approach.
- Synthesize the key insights, trends, and conclusions drawn from the literature, highlighting the potential applications, benefits, and limitations of using augmented learning and learning-augmented algorithms in performance-guided surgery.
5. **Identify Future Research Directions**:
- Based on your analysis of the literature, identify areas that require further research, such as specific surgical procedures or applications that could benefit from these techniques, methodological improvements, or the integration of these approaches with other emerging technologies.
- Provide recommendations for future research that could contribute to the advancement of this field and address the identified gaps.
6. **Structure and Write the Literature Review**:
- Organize your literature review into a well-structured document, including an introduction, background, review of the literature, synthesis of findings, and a conclusion.
- Use appropriate headings, subheadings, and transitions to ensure the flow and readability of your review.
- Properly cite the references using a consistent citation style, such as APA or IEEE.
Good luck; partial credit AI
  • asked a question related to Advanced Statistical Modeling
Question
1 answer
I'm currently working on a project involving group-based trajectory modelling and am seeking advice on handling multi-level factors within this context. Specifically, I'm interested in understanding the following:
  1. Multi-Level Factors in Trajectory Modelling: How can multi-level factors (e.g., individual-level and group-level variables) be effectively addressed in group-based trajectory modelling? Are there specific methods or best practices recommended for incorporating these factors?
  2. Flexmix Package: I’ve come across the Flexmix package in R, which supports flexible mixture modelling. How can this package be utilised to handle multi-level factors in trajectory modelling? Are there specific advantages or limitations of using Flexmix compared to other methods?
  3. Comparison with Other Approaches: In what scenarios would you recommend using Flexmix over other trajectory modelling approaches like LCMM, TRAJ, or GBTM? How do these methods compare in terms of handling multi-level data and providing accurate trajectory classifications?
  4. Adjusting for Covariates: When identifying initial trajectories (e.g., highly adherent, moderately adherent, low adherent), is it necessary to adjust for covariates such as age, sex, and socioeconomic status (SES)? Or is focusing on adherence levels at each time point sufficient for accurate trajectory identification? What are the best practices for incorporating these covariates into the modelling process?
Any insights, experiences, or references to relevant literature would be greatly appreciated!
Relevant answer
Answer
Addressing multi-level factors in group-based trajectory modeling (GBTM) is crucial for accurately capturing the hierarchical structure of the data. This often involves accounting for individual-level and group-level effects, which can significantly influence the trajectory analysis. Here, we’ll explore different methods for addressing these multi-level factors and discuss the role of the flexmix package in R.
Methods for Addressing Multi-Level Factors in GBTM
  1. Hierarchical Linear Models (HLMs) or Multilevel Models:These models explicitly account for the nested structure of the data, allowing for random intercepts and slopes at different levels. Useful when you have nested data (e.g., students within schools, patients within hospitals).
  2. Latent Class Growth Analysis (LCGA):This method identifies distinct trajectory groups without accounting for nested data structures. Useful when the primary interest is in identifying distinct groups of trajectories but less effective for multi-level data.
  3. Growth Mixture Modeling (GMM):Extends LCGA by allowing for within-class variation in growth trajectories. Can incorporate random effects to account for multi-level structures but can be complex and computationally intensive.
  4. Multilevel Growth Mixture Modeling:Combines features of multilevel modeling and GMM to address hierarchical data. Allows for the inclusion of both individual and group-level random effects.
  5. Two-Stage Approaches:First stage involves fitting individual-level growth trajectories. Second stage models the extracted parameters (e.g., intercepts, slopes) as outcomes in a higher-level model.
Role of the flexmix Package in R
The flexmix package in R is a powerful tool for finite mixture modeling, including GBTM. It allows for the specification of various types of mixture models, including those with multi-level data structures.
Key Features of flexmix:
  • Flexibility: Can handle different types of mixture models (e.g., normal, Poisson, binomial).
  • Customization: Users can define their own models and likelihood functions.
  • Integration: Works well with other R packages, enabling complex modeling frameworks.
  • Addressing multi-level factors in GBTM requires careful consideration of the hierarchical structure of the data. Combining tools like flexmix for trajectory identification with multilevel modeling packages can provide a robust framework for analyzing complex data structures. By leveraging the strengths of different methods, you can achieve more accurate and insightful results in your trajectory analysis.
  • asked a question related to Advanced Statistical Modeling
Question
1 answer
Hi everyone.
When running a GLMM, I need to turn the data from wide format to the long format (stacked).
When checking for assumptions like normality, do I check them for the stacked variable (e.g., outcomemeasure_time) or for each variable separately (e.g., outcomemeasure_baseline, outcomemeasure_posttest, outcomemeasure_followup)?
Also, when identifying covariates via correlations (Pearson's or Spearman's), do I use the seperate variables or the stacked one?
Normality: say the outcomemeasure_baseline normality is violated but normality for the others weren't (ouecomemeasure_posttest and outcomemeasure_followup). Normality for the stacked variable is also not violated. In this case when running the GLMM, do I adjust for normality violations because normality for one of the seperate measures was violated?
Covariates: say age was identified as a covariate for outcomemeasure_baseline but not the others (separately: ouecomemeasure_posttest and outcomemeasure_followup OR the stacked variable). In this case, do I include age as a covariate since it was identified as one for one of the seperate variables?
Thank you so much in advance!
Relevant answer
Answer
The assumption on normality only matters for a model with normally (Gaussian) distributed errors (LMM). Meaning the residuals of the model are from your side to approximate normality and this assumption is reasonable. Assuming that you use the word GLMM, you have selected a model with a different distribution and link function? If these words sound like gibberish, it might provide some help to search the terminology I just used or find a few introductory articles or books. Best
  • asked a question related to Advanced Statistical Modeling
Question
1 answer
I have data from population based observation (not questionnaires but yearly observation from secondary database) and I already have a common model for each populations (6 groups, each has the same latent variables, observed variables, and the structural models are also the same shape). As the study is some kind of longitudinal basis (not independent to each other), am I still be able too use MGA (Multi Group Analysis)? My result does not pass the MICOM procedure, is doing a MICOM procedure an obligatory prior MGA in my specific case?
Relevant answer
Answer
They should pass MICOM. Otherwise, you may only qualitatively compare the results of group differences (but not based on a test).
  • asked a question related to Advanced Statistical Modeling
Question
7 answers
I have a mixed effect model, with two random effect variables. I wanted to rank the relative importance of the variables. The relimpo package doesn't work for mixed effect model. I am interested in the fixed effect variables anyway so will it be okay if I only take the fixed variables and use relimp? Or use weighted Akaike for synthetic models with alternatively missing the variables?
which one is more acceptable?
Relevant answer
Answer
install.packages("glmm.hp")
library(glmm.hp)
library(MuMIn)
library(lme4)
mod1 <- lmer(Sepal.Length ~ Petal.Length + Petal.Width+(1|Species),data = iris)
r.squaredGLMM(mod1)
glmm.hp(mod1)
a <- glmm.hp(mod1)
plot(a)
  • asked a question related to Advanced Statistical Modeling
Question
2 answers
Suppose that we have three variables (X, Y, Z). According to past literature Y mediates the relationship between X & Z while X mediates the relationship between Y & Z. Can I analyze these interrelationships in a single SEM using a duplicate variable for either X (i.e., Xiv & X Ddv) or Y (Yiv or Ydv)?
Relevant answer
Answer
It is possible to use the same variable twice, once as a mediator and once as an independent variable. This methodology enables a more comprehensive examination of the connections inside the model.
For Reference:
  • asked a question related to Advanced Statistical Modeling
Question
3 answers
What are the possible ways of rectifying a lack of fit test showing up as significant. Context: Optimization of lignocellulosic biomass acid hydrolysis (dilute acid) mediated by nanoparticles
  • asked a question related to Advanced Statistical Modeling
Question
3 answers
We measured three aspects (i.e. variables) of self-regulation. We have 2 groups and our sample size is ~30 in each group. We anticipate that three variables will each contribute unique variance to a self-regulation composite. How do we compare if there are group differences in the structure/weighting of the composite? What analysis should be conducted?
Relevant answer
Answer
Are you thinking of self-regulation as a latent variable with the 3 "aspects" as manifest indicators? If so, you could use a two-group SEM, although your sample size is a bit small.
You've not said what software you use, but this part of the Stata documentation might help you get the general idea anyway.
  • asked a question related to Advanced Statistical Modeling
Question
1 answer
I have a set of measured data (with spectrum analyser) of power emitted by an antenna "mW_NLOS" in function of frequencies. How can I fit this data to a Rician distribution using matlab.
Note that I used my_dist=fitdist(mW_NLOS,'Rician') but it seems that it isn't correct to me.
Relevant answer
Answer
There are two main methods to fit a distribution to a set of data in MATLAB:
1. Using the fitdist function
The fitdist function is the most versatile and commonly used method for fitting distributions to data in MATLAB. It takes two required arguments:
  • x: The data vector
  • distname: The name of the distribution to fit
For example, to fit a normal distribution to the data vector x, you would use the following command:
Code snippet
pd = fitdist(x, 'normal');
content_copyUse code with caution. Learn more
The fitdist function returns a probability distribution object pd that contains the estimated parameters of the fitted distribution. You can use the pd object to evaluate the probability density function (PDF), cumulative distribution function (CDF), quantile function, and other properties of the distribution.
2. Using the Distribution Fitter app
The Distribution Fitter app is a graphical user interface (GUI) that provides a convenient way to fit distributions to data in MATLAB. To use the Distribution Fitter app, follow these steps:
  1. Open the Distribution Fitter app by clicking on the Apps tab in the MATLAB toolbar and selecting Math > Statistics and Optimization > Distribution Fitter.
  2. Select the data vector you want to fit a distribution to.
  3. Choose the distribution you want to fit from the list of available distributions.
  4. Click the Fit button.
The Distribution Fitter app will display a variety of plots and statistics that can be used to assess the goodness of fit of the distribution.
Additional options
Both the fitdist function and the Distribution Fitter app provide a number of additional options that you can use to customize the fitting process. For example, you can specify the fitting method (e.g., maximum likelihood, least squares), set confidence intervals, and plot the fitted distribution along with the data.
Which method to use?
The best method for fitting a distribution to data in MATLAB depends on your specific needs. If you need more control over the fitting process, then the fitdist function is a good choice. However, if you are looking for a more user-friendly interface, then the Distribution Fitter app is a better option.
  • asked a question related to Advanced Statistical Modeling
Question
5 answers
Greetings,
I am currently in the process of conducting a Confirmatory Factor Analysis (CFA) on a dataset consisting of 658 observations, using a 4-point Likert scale. As I delve into this analysis, I have encountered an interesting dilemma related to the choice of estimation method.
Upon examining my data, I observed a slight negative kurtosis of approximately -0.0492 and a slight negative skewness of approximately -0.243 (please refer to the attached file for details). Considering these properties, I initially leaned towards utilizing the Diagonally Weighted Least Squares (DWLS) estimation method, as existing literature suggests that it takes into account the non-normal distribution of observed variables and is less sensitive to outliers.
However, to my surprise, when I applied the Unweighted Least Squares (ULS) estimation method, it yielded significantly better fit indices for all three factor solutions I am testing. In fact, it even produced a solution that seemed to align with the feedback provided by the respondents. In contrast, DWLS showed no acceptable fit for this specific solution, leaving me to question whether the assumptions of ULS are being violated.
In my quest for guidance, I came across a paper authored by Forero et al. (2009; DOI: 10.1080/10705510903203573), which suggests that if ULS provides a better fit, it may be a valid choice. However, I remain uncertain about the potential violations of assumptions associated with ULS.
I would greatly appreciate your insights, opinions, and suggestions regarding this predicament, as well as any relevant literature or references that can shed light on the suitability of ULS in this context.
Thank you in advance for your valuable contributions to this discussion.
Best regards, Matyas
Relevant answer
Answer
Thank you for your question. I have searched the web for information about the Diagonally Weighted Least Squares (DWLS) and Unweighted Least Squares (ULS) estimators, and I have found some relevant sources that may help you with your decision.
One of the factors that you should consider when choosing between DWLS and ULS is the sample size. According to Forero et al. (2009)1, DWLS tends to perform better than ULS when the sample size is small (less than 200), but ULS tends to perform better than DWLS when the sample size is large (more than 1000). Since your sample size is 658, it falls in the intermediate range, where both methods may provide similar results.
Another factor that you should consider is the degree of non-normality of your data. According to Finney and DiStefano (2006), DWLS is more robust to non-normality than ULS, especially when the data are highly skewed or kurtotic. However, ULS may be more efficient than DWLS when the data are moderately non-normal or close to normal. Since your data have slight negative skewness and kurtosis, it may not be a serious violation of the ULS assumptions.
A third factor that you should consider is the model fit and parameter estimates. According to Forero et al. (2009)1, both methods provide accurate and similar results overall, but ULS tends to provide more accurate and less variable parameter estimates, as well as more precise standard errors and better coverage rates. However, DWLS has higher convergence rates than ULS, which means that it is less likely to encounter numerical problems or estimation errors.
Based on these factors, it seems that both DWLS and ULS are reasonable choices for your data and model, but ULS may have some advantages over DWLS in terms of efficiency and accuracy. However, you should also check the sensitivity of your results to different estimation methods, and compare them with other criteria such as theoretical plausibility, parsimony, and interpretability.
I hope this answer helps you with your analysis. If you need more information, you can refer to the sources that I have cited below.
1: Factor analysis with ordinal indicators: A Monte Carlo study comparing DWLS and ULS estimation by Carlos G. Forero, Alberto Maydeu-Olivares & David Gallardo-Pujol in British Journal of Mathematical and Statistical Psychology (2009)
: Non-normal and categorical data in structural equation modeling by Sara J. Finney & Christine DiStefano in Structural equation modeling: A second course (2006)
Good luck
  • asked a question related to Advanced Statistical Modeling
Question
4 answers
I have a longitudinal model and the stability coefficients for one construct change dramatically from the first and second time point (.04) to the second and third time point (.89). I have offered a theoretical explanation for why this occurs, but have been asked about potential model bias.
Why would this indicate model bias? (A link to research would be helpful).
How can I determine whether the model is biased or not? (A link to research would be helpful).
Thanks!
Relevant answer
Answer
That makes sense. Are you comparing the cross-lagged panel (auto)regression (path) coefficients to zero-order correlations? This could be part of the issue (explain the "discrepancy"/low autoregressive stability coefficient). Regression coefficients are not equal to zero-order (bivariate) correlations. The regression coefficients take the correlation with other independent variables into account. This may explain why the autoregressive "stability" coefficients in your model look very different from the zero-order correlations. It is impossible to know without looking at your data and model in more detail.
The model fit does not look completely horrible at first sight but the chi-square test is significant and the RMSEA value is a bit high. I would take a look at model residuals and/or modification indices to find out where the model may be misspecified.
  • asked a question related to Advanced Statistical Modeling
Question
5 answers
In recent years, quite a few reports have been published of the results based on the statistical information processing. For example, a study establishes that the use of a certain remedy (some food, drink, nutritional supplement, drug, treatment method, etc.) reduces (or increases) the value of some output parameter by 20 ... 30 ... 40%. The output parameter can be the frequency of onset of the analyzed disease, the frequency of its successful cure, etc. Based on this finding the conclusion is made that the studied factor significantly influences the output parameter. How trustable can such a conclusion be?
For further details see, please,
Relevant answer
Answer
Grazie Sergey
  • asked a question related to Advanced Statistical Modeling
Question
1 answer
Question background. There is an equipartition theorem, and it is without doubt correct. But it has its conditions of applicability, which are not always satisfied. There are well-known examples of a chain of connected oscillators, the spectral density of a black body, the new example of an ideal gas in a round vessel I have studied. How may or may not the energy be partitioned in such cases, when the equipartition theorem is not applicable? Can anyone provide more systems with known uneven laws of energy partitioning?
Relevant answer
Answer
This subject is covered in some textbooks on microscopic thermodynamics (e.g., Pierce 1968) and most textbooks on gas dynamics (e.g., Owczarek 1964). Basically, it is an unstable and temporary state, often called "frozen".
  • asked a question related to Advanced Statistical Modeling
Question
3 answers
i am using a fixed effect panel data with 100 observations (20 groups), 1 dependent and three independent variables. i would like to get a regression output from it. my question is it necessary to run any normality test and linearity test for panel data? and what difference would it make if i don't go for these tests? 
Relevant answer
Answer
Rede Ganeshkumar Dilipkumar so to answer the hypothesis test and relationship between regressor and regressand we need surely normality test right?... because oftenly in causal relationship method we have to prove if the hypothesis alternative is the answer or instead hypothesis null is the answer for answer the influence of regressor to regressand. So after we found the best fit model of regression whether pooled effect, fixed effect or random effect than we have to continue by finding the influence between regressor and regressand so you said that hypothesis test surely need normality test and other assumption test? .. Please put the recommended theory or reffrence to strengtening your argumentation. thank you for the enlightment..
  • asked a question related to Advanced Statistical Modeling
Question
4 answers
The variables I have- vegetation index and plant disease severity scores, were not normal. So, I did log10(y+2) transformation of vegetation index and sqrt(log10(y+2)) transformation of plant disease severity score. Plant disease severity is on the scale of 0, 10, 20, 30,..., 100 and were scored based on visual observations. Even after combined transformation, disease severity scoring data is non-normal but it improves the CV in simple linear regression.
Can I proceed with the parametric test, a simple linear regression between the log transformed vegetation index (normally distributed) and combined transformed (non-normal) disease severity data?
Relevant answer
Answer
Why would these variables have to be normal? As far as I understand our problem, a logistic model might do well. You can try it with my software "FittingKVdm", but if you can send me some dat, I can try it for you.
  • asked a question related to Advanced Statistical Modeling
Question
5 answers
Hi,
I have data from a study that included 3 centers. I will conduct a multiple regression (10 IVs, 1 non-normally distributed DV) but I am unsure how to handle the variable "center" in these regressions. Should I:
1) Include "centre" as one predictor along with the other 10 IVs.
2) Utilize multilevel regression
Thanks in advance for any input
Kind regards
  • asked a question related to Advanced Statistical Modeling
Question
5 answers
Hello everyone, I need a bit help with statistics analitic methods.
My partner (MD) is conducting research as part of her residency exam on how years of occupation impact workers' hearing. Her known variables are years of employment, years of employment at current job position,age and percents of hearing loss (calculated with Fowler-Sabine formula, so in %).
She had a statistic working on her study and he did multivariate linear regression (explaining he used it because one variable is in %).
However one of her professors said she should use log regression analysis instead. WHY? Is multivariate linear not OK and is, why not?
Can anyone help explain which one should be used/ is better and why? We tried google but as we are not statistics or experienced researchers this is quite hard for us to understand. However, she need this done correctly as this study is a part of her residency exam.
Any help is much appreciated.
Many thanks!
Anze&Ana
Relevant answer
Answer
I believe there is a misunderstanding in this discussion. Could "log regression" not mean logarithmic rather than logistic? That makes more sense as % is still numeric.
  • asked a question related to Advanced Statistical Modeling
Question
1 answer
How can I add the robust confidence ellipses of 97.5% on the variation diagrams (XY ilr-Transformed) in the robcompositions ,or composition packages?
Best
Azzeddine
Relevant answer
Answer
In order for the benefit to prevail, I have verified a group of packages that do the add of The robust confidence ellipses of 97.5%
View them here by package and its function
1- ellipses () using the package 'ellipse'
## ellipses () using the package 'rrcov'
## ellipses () using the package 'cluster'
  • asked a question related to Advanced Statistical Modeling
Question
3 answers
Hello!
In general, as a rule of thumb, what is the acceptable value for standardised factor loadings produced by a confirmatory factor analysis?
And, what could be done/interpretation if the obtained loadings are lower than the acceptable value?
How does everyone approach this?
Relevant answer
Answer
@ Ravisha Jayawickrama. Most sources accept value for standardised factor loadings above 0.4
  • asked a question related to Advanced Statistical Modeling
Question
4 answers
Merry Christmas everyone!
I used the Interpersonal Reactivity Index (IRI) subscales Empathic Concern (EC), Perspective Taking (PT) and Personal Distress (PD) in my study (N = 900) When I calculated Cronbach's alpha for each subscale, I got .71 for EC, .69 for PT and .39 for PD. The value for PD is very low. The analysis indicated that if I deleted one item, the alpha would increase to .53 which is still low but better than .39. However, as my study does not focus mainly on the psychometric properties of the IRI, what kind of arguments can I make to say the results are still valid? I did say findings (for the PD) should be taken with caution but what else can I say?
Relevant answer
Answer
A scale reliability of .39 (and even .53!) is very low. Even if your main focus is not on the psychometric properties of your measures, you should still care about those properties. Inadequate reliability and validity can jeopardize your substantive results.
My recommendation would be to examine why you get such a low alpha value. Most importantly, you should first check whether each scale (item set) can be seen as unidimensional (measuring a single factor). This is usually done by running a confirmatory factor analysis (CFA) or item response theory analysis. Unidimensionality is a prerequisite for a meaningful interpretation of Cronbach's alpha (alpha is a composite reliability index for essentially tau-equivalent measures). CFA allows you to test the assumption of unidimensionality/essential tau equivalence and to examine the item loadings.
Also, you can take a look at the item intercorrelations. If some items have low correlations with others, this may indicate that they do not measure the same factor (and/or that they contain a lot of measurement error). Another reason for a low alpha value can be an insufficient number of items.
  • asked a question related to Advanced Statistical Modeling
Question
2 answers
If we have multiple experts to get the prior probabilities for the parent nodes how will the experts fill the node probabilities such as low, medium and high and how will we get the consensus of all the expert about the probability distribution of the parent node.
If someone can please share any paper/Questionnaire/expert based Bayesian network where all these queries are explained it will be highly appreciated.
Relevant answer
Answer
Ette Etuk Thank you so much for the feedback. Actually if you have a lot of stockholders and you want to create a consensus among them then how will we incorporate the probabilities in the parent nodes in the Bayesian network.
  • asked a question related to Advanced Statistical Modeling
Question
7 answers
Hi,
I have used central compoiste design with four variables and 3 levels which gives me 31 experiements. After performing the expeirments, I found that the model is not significant. However, when I used different data (which I prevousluy obtained), I got the good model.
How do I justifiy using user-defined data? and why CCD failed to provide a significant model?
I would be really thankful for your response.
Relevant answer
Answer
There are few question necessery to ask.
Are you sure you have used Central Composite Design? CCD requires 5 levels for each factor: -axial, -1, 0, 1, axial. Perhaps you used Box-Behnken Design which requires -1, 0, 1?
Nextly, what do you mean you used different data, which gave you a good model? You already had responses for this exact design? If not, maybe previous data does not represent your current experiment?
Eventually, are you sure your factors significantly affect the response? They might not, therefore it can not find significant model.
  • asked a question related to Advanced Statistical Modeling
Question
6 answers
I'm trying to construct a model for binary logistics. The first model includes 4 variable of predictor and the intercept is not statistically significant. Meanwhile, in the second model, I exclude one variable from the first model and the intercept is significant.
The consideration that I take here is that:
The pseudo R² of the first model is better at explaining the model rather than the second model.
Any suggestion which model should I use?
Relevant answer
Answer
You should use the model that makes more sense, practically and/or theoretically. A high R² is not in indication for the "goodness" of the model. A higher R² can also mean that the model makes more wrong predictions with a higher precision.
Do not build your model based on observed data. Build your model based on understanding (theory) and the targeted purpose (simple prediction, exptrapolation (e.g. forecast), testing meaningful hypotheses etc.)
Removing a variable from the model changes the meaning of the intercept. The intercepts in the two models have different meanings. They are (very usually) not comparable. The hypothesis tests of the intercepts of the two models test very different hypotheses.
PS: a "non-significant" intercept term just means that the data are not sufficient to statistically distinguish the estimated value (the log odds given all X=0) from 0, what means that you cannot distinguish the probability of the event (given all X=0) from 0.5 (the data are compatible with probabilities larger and lower 0.5). This is rarely a sensible hypothesis to test.
  • asked a question related to Advanced Statistical Modeling
Question
5 answers
I am using an ARDL model however I am having some difficulties interpreting the results. I found out that there is a cointegration in the long run. I provided pictures below.
Relevant answer
Answer
Mr a. D.
The ECT(-1)os always the lagged value of your dependent variable.
Regards
  • asked a question related to Advanced Statistical Modeling
Question
5 answers
What are the most important updates that distinguish the last update of SMART PLS (4) from the previous one (3)?
Relevant answer
Answer
Syed Arslan Haider : SmartPLS gives a very generous 50% discount to academics. But your idea of using the Big Mac index for country-specific pricing is quite interesting. However, this requires a solution to minimize the potential for fraud and abuse (an additional investment). In the end, they need to be able to fund the software development.
  • asked a question related to Advanced Statistical Modeling
Question
3 answers
Dear fellow researchers,
Usually we use lavaan for continuous variable, so can we still use lavaan for categorical variable (e.g. high and low ethnic diversity composition)?
Thank you very much!
Best,
Edita
Relevant answer
Answer
Hello Edita,
A categorical variable having only two levels (e.g., coded 0/1) can be used in any linear model as an IV or antecedent variable.
If such a variable is the DV, however, it likely makes more sense to switch from linear to logistic models.
Good luck with your work.
  • asked a question related to Advanced Statistical Modeling
Question
2 answers
I recently included GEE models in the statistics and calculated it with the wald chi square test.
Does anyone know how to correctly report the findings considering APA-Guidelines?
e.g we would report the findings of a rANOVa the following:
"No main effect of group factors F(1,92)=.52, p > .05"
How do you report these findings? Please find an output of the model attached. Thank you!
Relevant answer
Answer
Hello Melanie,
Along with your table, something as simple as this:
"Of the three tests, only that for scoresED_1 was statistically significant via the Wald test, W(1) = 5.203, p = .023."
Then, of course, go on to explain the meaning of this effect in the context of your research question(s) as well as implications of the non-significant results for time and score*time interaction.
Good luck with your work.
  • asked a question related to Advanced Statistical Modeling
Question
1 answer
Dear all, I want to replicate an Eview plot (attached as Plot 1) in STATA after performing a time series regression. I made an effort to produce this STATA plot (attached as Plot 2). However, I want Plot 2 to be exactly the same thing as Plot 1.
Please, kindly help me out. Below are the STATA codes I run to produce Plot 2. What exactly did I need to include?
The codes:
twoway (tsline Residual, yaxis(1) ylabel(-0.3(0.1)0.3)) (tsline Actual, yaxis(2)) (tsline Fitted, yaxis(2)),legend(on)
  • asked a question related to Advanced Statistical Modeling
Question
3 answers
One dependent variable (continuous) ~ two continuous and two categorical (nominal) independent variables
I'm seeking for the best method for predicting a data collection with more than 100 sites. The distribution of all continuous variables is not normally distributed.
Relevant answer
Answer
Beyond the scarcity of information, are you sure of the relationship between variables?
  • asked a question related to Advanced Statistical Modeling
Question
6 answers
I have previously conducted laboratory experiments on a photovoltaic panel under the influence of artificial soiling in order to be able to obtain the short circuit current and the open-circuit voltage data, which I analyzed later using statistical methods to draw a performance coefficient specific to this panel that expresses the percentage of the decrease in the power produced from the panel with the increase of accumulating dust. Are there any similar studies that relied on statistical analysis to measure this dust effect?
I hope I can find researchers interested in this line of research and that we can do joint work together!
Article link:
Relevant answer
Answer
Dear Dr Younis
Find attached:
1-(1) (PDF) Spatial Management for Solar and Wind Energy in Kuwait (researchgate.net)
2-(1) (PDF) Cost and effect of native vegetation change on aeolian sand, dust, microclimate and sustainable energy in Kuwait (researchgate.net)
regards
Ali Al-Dousari
  • asked a question related to Advanced Statistical Modeling
Question
4 answers
Which one of these multilevel models are better? should the random equation variables be added also as covariates?
Model A: with random equation variables as covariates
Model B: without random equation variables as covariates
* Model A resulted in same results with a routine ologit. So, if model A is better than model B, what the philosophy of using multilevel mixed models (because of same result with ologit)?!
Relevant answer
Answer
You may want to deepen you understanding of multilevel random coefficient models by using the resources given here:
and
it includes instructions for Stata
Modules
  1. Using quantitative data in research (watch video introduction)
  2. Introduction to quantitative data analysis (watch video introduction)
  3. Multiple regression
  4. Multilevel structures and classifications (watch video introduction)
  5. Introduction to multilevel modelling
  6. Regression models for binary responses
  7. Multilevel models for binary responses
  8. Multilevel modelling in practice: Research questions, data preparation and analysis
  9. Single-level and multilevel models for ordinal responses
  10. Single-level and multilevel models for nominal responses
  11. Three-level multilevel models
  12. Cross-classified multilevel models
  13. Multiple membership multilevel models
  14. Missing Data
  15. Multilevel Modelling of Repeated Measures Data
  • asked a question related to Advanced Statistical Modeling
Question
2 answers
I working with phyr:pglmm package in R, which uses Pagel's lambda to correct for phylogenetic non-independence. I wish to report this value, to give an idea of the strength of phylogenetic signal. However, contrary to other functions such as PGLS in caper and the like, the results do not show what was the lambda used to generate the model.
Is there any function to extract this value from the model summary?..
Thanks
Relevant answer
Answer
If you are using the {phyr} you might need to calculate the Lambda manually from the results taking the outputs from the random effects. Also, in {phyr} you can use the function cor_phylo to estimate the phylogenetic signal and the correlation between multiple traits, more details here https://daijiang.github.io/phyr/reference/cor_phylo.html.
  • asked a question related to Advanced Statistical Modeling
Question
3 answers
I have been working with a GAM model with numerous features(>10). Although I have tuned it to satisfaction in my business application, I was wondering what is the correct way to fine tune a GAM model. i.e. if there is any specific way to tune the regularizers and the number of splines; and if there is a way to say which model is accurate.
The question is actually coming from the point that on different level of tuning and regularization, we can reduce the variability of the effect of a specific variable i.e. reduce the number of ups and downs in the transformed variable and so on. So I don't understand at this point that what model represents the objective truth and which one doesn't; since other variables end up influencing the single transformed variables too.
Relevant answer
Answer
Cross validation algorithm in scikit-learn of Python is working very well in tuning hyper-parameters.
  • asked a question related to Advanced Statistical Modeling
Question
4 answers
Hi
I'm working on a research for developing a nonlinear model (e.g. exponential, polynomial and...) between a dependent variable (Y) and 30 independent variables ( X1, X2, ... , X30).
As you know I need to choose the best variables that have most impacts on estimating (Y).
But the question is that can I use Pearson Correlation coefficient matrix to choose the best variables?
I know that Pearson Correlation coefficient calculates the linear correlation between two variables but I want to use the variables for a nonlinear modeling ,and I don't know the other way to choose my best variables.
I used PCA (Principle Component Analysis) for reduce my variables but acceptable results were not obtained.
I used HeuristicLab software to develop Genetic Programming - based regression model and R to develop Support Vector Regression model as well.
Thanks
Relevant answer
Answer
Hello Amirhossein Haghighat. The type of univariable pre-screening of candidate predictors you are describing is a recipe for producing an overfitted model. See Frank Harrell's Author Checklist (link below), and look especially under the following headings:
  • Use of stepwise variable selection
  • Lack of insignificant variables in the final model
There are much better alternatives you could take a look at--e.g., LASSO (2nd link below). If you indicate what software you use, someone may be able to give more detailed advice or resources. HTH.
  • asked a question related to Advanced Statistical Modeling
Question
2 answers
To reduce the dimensionality of large datasets and to carry out correlation among the parameters do we use only inlet or outlet parameters individually or use both of them to see the correlation?
  • asked a question related to Advanced Statistical Modeling
Question
14 answers
I have a set of experimental data (EXP) which I have fitted with two analytical models (AN1 & AN2).
In order to estimate the precision and accuracy of both analytical models I can study statistics of the ratios EXP/AN1 and EXP/AN2 or AN1/EXP and AN2/EXP.
Well, the point is that statistics of such ratios are not coincident.
I see that many researchers adopt the first approach when I istinctively would go for the second because I can compare two different analytical models by normalizing them with respect to the same experimental variable.
Is there anybody who can help me out with this?
thanks.
Relevant answer
Answer
Hello.
I am facing the same problem.
I think till date, you have got an answer you can share with me.
Second thing; I have a set of experimental data of dampers and 10 parameters to play with (stiffness, radius of nozzles, orifice and so forth.....).
I would an advice from you, on how to attack the problem so I can come up with an analytical model of it to optimise the design?
Thank you.
  • asked a question related to Advanced Statistical Modeling
Question
4 answers
I have run an ARDL model for a Time Series Cross Sectional data but the output is not reporting the R.squared. What could be the reason/s.
Thank you.
Maliha Abubakari
Relevant answer
Answer
I thought PMG estimator (a form of ARDL ) is more approprite
  • asked a question related to Advanced Statistical Modeling
Question
6 answers
I want to do a descriptive analysis using the World Values Survey dataset which has an N=1200. However, even thought I have searched a lot, I haven't found the methodology or a tool to calculate the sample size I need to get meaningful comparisons when I cross variables. For example, I want to know how many observations do I need in every category if I want to compare the social position attributed to the elderly over sex AND over ethnic group. That is (exemplying even more), the difference between the black vs indigenous women in my variable of interest. What if I have 150 observations in black women? Is that enough? How to set the threshold?
Expressing my gratitude in advance,
Santiago.
Relevant answer
Answer
When you divide a sample into subgroups, your maximum power is where the groups are of equal size. So the first step is to calculate power two samples of 600.
Where the groups are of unequal size, power goes down compared with the ideal case of a 50:50 split. With a 60:40 split, the effective sample is reduced by only 4%, but as you get to 80:20, the reduction is 36% and at 90:10 it's 64%. So with a 90:10 split your power is what you would have with a sample that was 64% smaller split 50:50 between the groups.
Effective sample size calculations are very simple. The ideal sample is 0·5 x 0·5, which is 0·25. A 30:70 sample is 0·3 x 0·7 which is 0·21, which is 12% lower than 0·25. etc.
  • asked a question related to Advanced Statistical Modeling
Question
9 answers
What are the best methods to handle Imbalance data? and Do these methods make more biasness?
Relevant answer
Answer
1.Under-sampling
2.Over-sampling
3.Ensemble learning
4.Adjusting class weights
5.The right evaluation metrics
  • asked a question related to Advanced Statistical Modeling
Question
1 answer
Dear colleagues,
I am approaching the hotspot analysis for the first time.
My main goal is to understand the advantages and disadvantages of different methods used for the hotspot analysis (e.g., Moran I, Getis, etc.).
There is something I do not understand. Imagine I have occurrence data points (i.e., points in a map with the same value as each value indicates an occurrence event). Is it necessary to aggregate the occurrence data? What I mean is, if all input values are "1", can Getis G* still work or I should aggregate data in the same grid prior to the analysis?
Thank you very much in advance,
Chiara
  • asked a question related to Advanced Statistical Modeling
Question
4 answers
Hello,
I have a question about which longitudinal mediation model would be the best for my study. My study contains two groups (A; intervention versus WL), two measures: a mediator (M) and an outcome measure (B) three measurement points (T0, T1, T2). I want to know if a change in M precludes a chance in B on the different time points.
Now I was wondering whether a cross lagged panel model or a latent change score model would be the best to use for this mediation. Would someone have any advice about this for me?
If the best solution is a latent change score model, does anyone maybe have any reccomendations for a tutorial on how to do this? (preferably in R).
(The study is about a parenting program which chances parenting skills (M) to reduce childerens externalizing behavior (B))
Thank you very much in advance!
Suzanne
Relevant answer
Answer
Hello Suzanne,
we would agree with Amalia Raquel Pérez Nebra and use the random-intercept cross-lagged panel model that is suitable when you have 3+ waves and alows to model within-person and between person effects.
Mulder, J. D., & Hamaker, E. L. (2021). Three extensions of the random intercept cross-lagged panel model. Structural Equation Modeling: A Multidisciplinary Journal, 1-11.
Mund, M., & Nestler, S. (2019). Beyond the cross-lagged panel model: Next-generation statistical tools for analyzing interdependencies across the life course. Advances in Life Course Research, 41, 100249.
All the best,
Holger
  • asked a question related to Advanced Statistical Modeling
Question
5 answers
Hi everyone! I have a statistical problem that is puzzling me. I have a very nested paradigm and I don't know exactly what analysis to employ to test my hypothesis. Here's the situation.
I have three experiments differing in one slight change (Exp 1, Exp 2, and Exp 3). Each subject could only participate in one experiment. Each experiment involves 3 lists of within-subjects trials (List A, B, and C), namely, the participants assigned to Exp 1 were presented with all the three lists. Subsequently, each list presented three subsets of within-subjects trials (let's call these subsets LEVEL, being I, II, and III).
The dependent variable is the response time (RT) and, strangely enough, is normally distributed (Kolmogorov–Smirnov test's p = .26).
My hypothesis is that no matter the experiment and the list, the effect of this last within-subjects variable (i.e., LEVEL) is significant. In the terms of the attached image, the effect of the LEVEL (I-II-III) is significant net of the effect of the Experiment and Lists.
Crucial info:
- the trials are made of the exact same stimuli with just a subtle variation among the LEVELS I, II, and III; therefore, they are comparable in terms of length, quality, and every other aspect.
- the lists are made to avoid that the same subject could be presented with the same trial in two different forms.
The main problem is that it is not clear to me how to conceptualize the LIST variable, in that it is on the one hand a between-subjects variable (different subjects are presented with different lists), but on the other hand, it is a within-subject variable, in that subjects from different experiments are presented with the same list.
For the moment, here's the solutions I've tried:
1 - Generalized Linear Mixed Model (GLMM). EXP, LIST, and LEVEL as fixed effect; and participants as a random effect. In this case, the problem is that the estimated covariance matrix of the random effects (G matrix) is not positive definite. I hypothesize that this happens because the GLMM model expects every subject to go through all the experiments and lists to be effective. Unfortunately, this is not the case, due to the nested design.
2 – Generalized Linear Model (GLM). Same family of model, but without the random effect of the participants’ variability. In this case, the analysis runs smoothly, but I have some doubts on the interpretation of the p values of the fixed effects, which appear to be massively skewed: EXP p = 1, LIST p = 1, LEVEL p < .0001. I’m a newbie in these models, so I don’t know whether this could be a normal circumstance. Is that the case?
3 – Three-way mixed ANOVA with EXP and LIST as between-subjects factors, and LEVEL as the within-subjects variable with three levels (I, II, and III). Also in this case, the analysis runs smoothly. Nevertheless, together with a good effect of the LEVEL variable (F= 15.07, p < .001, η2 = .04), I also found an effect of the LIST (F= 3.87, p = .022, η2 = .02) and no interaction LEVEL x LIST (p = .17).
The result seems satisfying to me, but is this analysis solid enough to claim that the effect of the LEVEL is by no means affected by the effect of the LIST?
Ideally, I would have preferred a covariation perspective (such as ANCOVA or MANCOVA), in which the test allows an assessment of the main effect of the between-subjects variables net of the effects of the covariates. Nevertheless, in my case the classic (M)ANCOVA variables pattern is reversed: “my covariates” are categorical and between-subjects (i.e., EXP and LIST), so I cannot use them as covariates; and my factor is in fact a within-subject one.
To sum up, my final questions are:
- Is the three-way mixed ANOVA good enough to claim what I need to claim?
- Is there a way to use categorical between-subjects variables as “covariates”? Perhaps moderation analysis with a not-significant role of the moderator(s)?
- do you propose any other better ways to analyze this paradigm?
I hope I have been clear enough, but I remain at your total disposal for any clarification.
Best,
Alessandro
P.S.: I've run a nested repeated measures ANOVA, wherein LIST is nested within EXP and LEVEL remain as the within-subjects variable. The results are similar, but the between-subjects nested effect LIST within EXP is significant (p = .007 η2 = .06). Yet, the question on whether I can claim what I need to claim remains.
Relevant answer
Answer
yes of course three way ANOVA
  • asked a question related to Advanced Statistical Modeling
Question
4 answers
Dear colleagues,
Actually, I have two files with two different resolutions and I am looking for a code (Python, Matlab, R) to estimate the correlation coefficient, Bias and statistical indices between a specific point and its nearest point in the other file. I will be thankful for any help.
Thanks in advance
Regards,
Relevant answer
Answer
The internet should have a lot of such codes.
  • asked a question related to Advanced Statistical Modeling
Question
3 answers
  • Hello. I am struggling with the problem. I can measure two ratios of three independent normal random values with their means and variances (not zero): Z1=V1/V0, Z2=V2/V0, V0~N(m0,s0), V1~N(m1,s1), V2~N(m2,s2). These are measurements of the speeds of the vehicle. Now I should estimate the means and the variances of these rations. We can see it is Cauchy distribution with no mean and variance. But it has analogs in the form of location and scale. Are there mathematical relations between mean and location, variance and scale? Can we approximate Cauchy by Normal? I have heard if we limit the estimated value we can obtain the mean and variance.
Relevant answer
Answer
Well the Cauchy distribution has all moments equal infinity. You might be interested in the attached Google search which is about robust Statisical methods but I would probably try the first ref in the second search first. Good luck if you do. David Booth
  • asked a question related to Advanced Statistical Modeling
Question
11 answers
The aim of my research is to analyse the correlation between two delta values (change between two timepoints) via regression analysis.
Let the variables be X, Y, and Z, and t0 represent pre-intervention and t1 represent post-intervention. X is a psychometric value (Visual Analogue Scale ranging from 0 to 100), Y and Z are biological values.
For example, I want to calculate the correlation between delta (Xt1 - Xt0), delta (Yt1 - Yt0), and delta (Zt1 - Zt0).
I am aware that delta value is statistically inefficient, therefore Pearson's correlation or Spearman's correlation is out. I would appreciate any advice or any model examples. Thanks!
Relevant answer
Answer
One main issue with correlations of observed variable difference (change) scores is that these correlations may be strongly attenuated due to measurement error (unreliability). In terms of classical test theory, observed variable difference (change) scores often have low reliabilities because measurement error from both pretest and posttest affect the error variance component of the difference (change) score. To avoid the problem of low change score reliability, latent difference (change) scores can be used that are based on differences between true scores rather than observed scores. (True scores are by definition free of measurement error.) The correlations between "true difference scores" can be estimated using methods of structural equation modeling/longitudinal confirmatory factor analysis.
To mathematically identify latent difference score variables, you either need multiple (at least 2) observed variables ("indicators", measures) for each construct X, Y, and Z at each of the two time points or find a way to otherwise identify the error variance component of each observed variable X, Y, Z through appropriate constraints.
If you have appropriate estimates of the reliabilities of X, Y, and Z, you could derive (compute) the error variance components and specify them accordingly in a latent difference (change) score model with fixed error variances. Other options may be available for your design. You could check out the extensive literature on latent change score modeling to explore this option further and see if it works for your design, e.g.:
Another option that may be applicable in your case is a relatively simple computational correction for attenuation, see, e.g.:
Again, this also requires that the reliability of the change scores be known.
  • asked a question related to Advanced Statistical Modeling
Question
7 answers
For example, how to analyze the effect of speed on a binary performance (success or failure), knowing that the expected probabilities do not necessarily form a straight line but could be an inverted u-shaped curve.
To understand better I created a dataset on R and I put the script at your disposal. I have also attached a graph that shows the frequency of success as a function of speed.
Thank you.
Relevant answer
Answer
Hello Pierre,
How best to analyze likely depends on your specific aims. If what you wish to do is model the relationship, then clearly you'd need to include not just the value of the IV (speed, in your example), but a function of the squared IV as well (whether centered or not is up to you) as a second IV, as the relationship appears quadratic in form. Logistic regression could work (with the DV being the dichotomous outcome of success).
If you have something else in mind, perhaps you could elaborate your query.
Good luck with your work.
  • asked a question related to Advanced Statistical Modeling
Question
6 answers
Hi
I'm using three different performance criteria for evaluating my model:
1.Nash–Sutcliffe (NSE)
2.Percent bias (PBIAS)
3.Root mean square error (RMSE)
You can suppose that I used a regression model to estimate a time series data such as river mean daily discharge or something like that.
But for a single model and a single dataset, we saw difference performances for each criteria.
Is this possible? I expected that all of these three criteria have same results.
You can see the variation's diagram of these criteria in appendix pic.
Thanks
Relevant answer
Answer
Due to different factors
  • asked a question related to Advanced Statistical Modeling
Question
12 answers
Hello,
I am performing statistical analysis of my research data by comparing the mean values by using Tukey HSD test. I got homogeneous group in both small and capital alphabets. This is because of large number of treatments in my study. Is this type of homogeneous group is acceptable for publication in any journal?
Relevant answer
Answer
You can use SPSS for this analysis but it is mostly done in Statistix 8.1 program
  • asked a question related to Advanced Statistical Modeling
Question
5 answers
Hi there,
in SPSS I can perform a PCA with my dataset, which does not show a positive definite correlation matrix, since I have more variables (45) than cases (n = 31).
The results seem quite interesting, however, since my correlation matrix and therefore all criteria for appropriateness (Anti-Image, MSA etc.) are not available, am I allowed to perform such an analysis?
Or are the results of the PCA automatically nonsense? I can identify a common theme in each of the loaded factors and its items.
Thanks and best Greetings from Aachen, Germany
Alexander Kwiatkowski
Relevant answer
Answer
Hello Alexander,
Trying to parse the information carried by 45 variables (e.g., 990 unique relationships) based on data from 31 cases is a bit like the Biblical story of the loaves and fishes: without divine intervention, you're simply not going to get good results.
In general, absence of a positive definite matrix implies that: (a) there is at least one variable that is linearly dependent on one or more of the other variables in the set (e.g., redundancy); and/or (b) the correlations are incoherent (which can occur if you use pairwise missing data technique, or some value(s) were miscoded on data entry). Either way, you'd need to re-inspect the data and data handling, and possibly jettison one or more of the variables.
If you find your results interesting, perhaps that should be the motivation to collect additional data so that you can be more confident that the resulting structure--whatever you decide it to be--is something more than a finding idiosyncratic to your data set.
Good luck with your work.
  • asked a question related to Advanced Statistical Modeling
Question
7 answers
Hello everyone! For my dissertation I am using Network Analysis to model my data. I have 11 variables and all but 3 of them are likert scales. I am struggling to test for linearity for my data (linearity is an assumption for network analysis). Obviously when I am trying to test linearity using standardised regressions (ZRED against ZRESID) the scatterplot is not homoscedastic because of the likert scales. Is anyone familiar with Network analysis assumption testing regarding likert-type data??? any help appreciated :) My data is not normally distributed however I am using npn transformations (in JASP) to solve this issue for the networks. Just don't know how to test for linearity as relations among variables need to be assumed to be linear.
I am using SPSS for data cleaning etc. and JASP to run the network.
Relevant answer
Answer
From my basic understanding of likert scale analysis, it might be difficult to establish linearity of the constructs of likert scales unless they are transformed to continuous state.
Since you are using SPSS for data cleaning, I believe you can transform the likert scale responses by computation of each item measured on likert scales to generate a continuous variable for each of the item which can then be used to explore the test of linearity.
For instance if the 3 variables are measured on a 4-point likert scale coded as "0, 1,2 and 3" in an ascending or descending order of the score attached to the construct it measured, it means that you will have a maximum reference scale of 3×3 (9) and following computation, the continuous value generated out of a total score of 9 can be used for testing linear relationship, correlation/regression as long as the outcome variable is also in a continuous state, it can also be dichotomously categorized based on the mean score from the reference scale. However, since i can't tell if the items of the variables are measuring the samething or not, I wouldn't be able to conclude whether what can help is transformation by computation or transformation by recoding.
  • asked a question related to Advanced Statistical Modeling
Question
6 answers
Hi, I am a beginner in the field of cancer genomics. I am reading gene expression profiling papers in which researchers classify the cancer samples into two groups based on expression of group of genes. for example "High group" "Low group" and do survival analysis, then they associate these groups with other molecular and clinical parameters for example serum B2M levels, serum creatinine levels for 17p del, trisomy of 3. Some researchers classify the cancer samples into 10 groups. Now if I am proposing a cancer classification schemes and presenting a survival model based on 2 groups or 10 groups, How should I assess the predictive power of my proposed classification model and simultaneously how do i compare predictive power of mine with other survival models? Thanks you in advance.
Relevant answer
Answer
The survAUC R package provides a number of ways to compare models link: https://stats.stackexchange.com/questions/181634/how-to-compare-predictive-power-of-survival-models
  • asked a question related to Advanced Statistical Modeling
Question
3 answers
Background of the problem for people non familiar with transfusion:
Every day, hospital transfusion services must guarantee they have enough blood to meet patient transfusion demand. This is very like managing a food stock in a refrigerator but even easier (just one product, one expiration date).
Assumptions:
- RBC (red blood cell) units are kept on a refrigerator with enough capacity
- Maximum expiration date for RBC is 42 days.
- We make a routine order to the blood bank provider (our grocery store) every day from Monday to Friday. We can make additional orders if we run out of stock.
- We practice first in first out.
The model should satisfy the following targets:
- Minimize the expiration rate (the number of RBC units expiring every certain time)
- The less time on the refrigerator the better (eat fresh)
- Avoid extraordinary orders as much as possible (some extraordinary order might be considered to cope with transfusion peaks of demand)
- Note: I have purposely simplified the problem (no considerations of ABO group distribution, crossmatched reserved units, or reduced expiration date of the irradiated blood…).
Non controlled external factors:
- Transfusion rate
- Average remaining time to expiration of the received blood.
What we are looking for is the optimal number of RBC units we must order on a routine basis (M-F) which is the only factor we can adjust to optimize the equation. Resulting figure must be recalculated when transfusion rate or average expiration date of the received units change.
Intuitively, I see how the number of extraordinary orders and RBC freshness are at opposite sides of a balance, because if a higher number of extraordinary orders is tolerated, then it should result in a smaller but fresher RBC inventory. On the other hand, in order to avoid extraordinary orders, you need a greater inventory thus increasing the average age of your products and the risk of expiration.
Any idea or suggestion? Any available script on Excel/SPSS/Stata/R/SQL? Any related bibliography?
I can provide raw data for simulations on demand if someone want to make a collaboration.
Relevant answer
Answer
good work
Interested
best of luck
  • asked a question related to Advanced Statistical Modeling
Question
7 answers
In the development of forecasting, prediction or estimation models, we have recourse to information criterions so that the model is parsimonious. So, why and when should one or the other of these information criterions be used ?
Relevant answer
Answer
You need to be mindful of what any one IC is doing for you. They can look at 3 different contexts:
(a) you select a model structure now, fit the model to the data you have now and keep using those now-fitted parameters from now on.
(b) you select a model structure now and keep that structure, but will refit the model to an expanded dataset (reducing parameter-estimation variation but not bias).
(c) you select a model structure now and keep that structure, but will continually refit the model as expanded datasets are available (eliminating parameter-estimation variation but not bias).
  • asked a question related to Advanced Statistical Modeling
Question
15 answers
I have a data set of particulate concentration (A) and corresponding emission from car (B), factory (C) and soil (D). I have 100 observations of A and corresponding B , C and D. Lets say, there are no other factor is contributing in particulate concentration (A) other than B, C and D. Correlation analysis shows A have linear relationship with B , exponential relationship with C and Logarithmic relationship with D. I want to know which factor is contributing more in concentration of A (Predominant factor). I also want to know if any model can be build like following equations
A = m*A+n*exp(B)+p*Log (C), where m, n and p are constant, from the data-set I have
Relevant answer
Answer
Maybe you can consider the recursive least squares algorithm (RLS). RLS is the recursive application of the well-known least squares (LS) regression algorithm, so that each new data point is taken in account to modify (correct) a previous estimate of the parameters from some linear (or linearized) correlation thought to model the observed system. The method allows for the dynamical application of LS to time series acquired in real-time. As with LS, there may be several correlation equations with the corresponding set of dependent (observed) variables. For the recursive least squares algorithm with forgetting factor (RLS-FF), acquired data is weighted according to its age, with increased weight given to the most recent data.
Years ago, while investigating adaptive control and energetic optimization of aerobic fermenters, I have applied the RLS-FF algorithm to estimate the parameters from the KLa correlation, used to predict the O2 gas-liquid mass-transfer, hence giving increased weight to most recent data. Estimates were improved by imposing sinusoidal disturbance to air flow and agitation speed (manipulated variables). The power dissipated by agitation was accessed by a torque meter (pilot plant). The proposed (adaptive) control algorithm compared favourably with PID. Simulations assessed the effect of numerically generated white Gaussian noise (2-sigma truncated) and of first order delay. This investigation was reported at (MSc Thesis):
  • asked a question related to Advanced Statistical Modeling
Question
8 answers
How does one make their marketing mix be more agile to new channels, ever changing environment. what are the models used for this analysis and their interpretation.
our paper on MMM- complex models and interpretations to discuss the advertising effects, models- simple and complex to collate it all together.
Please read, review and suggest how we can add on to enhance our research going forward
Relevant answer
Answer
Well, intereyting approach, although I personally have a problem with "econometrisation" of marketing.
I like new perspectives on basic marketing concepts, for example SAVE concept instead of classic 4P
  • asked a question related to Advanced Statistical Modeling
Question
8 answers
I want to run 5 different models to estimate stream flow. In order to optimize the characteristics of these models I use Taguchi method. So I have to run different models according to the Taguchi orthogonal array. Therefore, I have different models with different inputs and different data lengths. For example the first test is: using rainfall and temperature in ANFIS model with 2 year data length, while the second test is: using rainfall, temperature and discharge for previous day in SVR model with 10 year data length. So, the inputs, Data length and model type is changing in these tests. What is the best performance evaluation criterion for this study? NRMSE can be a good criterion because it normalizes the RMSE and in this way, it removes the effect of data range.
Now, I want to know if there is any better solution for this problem.
Relevant answer
Answer
Follow
  • asked a question related to Advanced Statistical Modeling
Question
5 answers
Hi
I have a dataset and I need to check the goodness of fit for Pearson type3 for them.
How can I do it?
Any software? Any MATLAB code?
Thanks
Relevant answer
Answer
Amirhossein Haghighat the definition of the chi squared statistic for a sample is the sum of (x-E(x))^2/s^2, where s^2 is the variance of the data. When the variable is Poisson-distributed, then you are in the particular case where s^2=E(x).
Good luck
  • asked a question related to Advanced Statistical Modeling
Question
5 answers
My study is about the development of Intumescent Coatinv through addition of additive A and B. In my research I used samples with different ratios from the two having the controlled one with 0:0 . I perfomed the Horizontal fire test and recorsed the temperature of the coated steel overtime. I need to show if there is significant difference from each sample and compare the correlation coefficients from each samples( if it is statistically significant). I tried using one way anova and post hoc test but i think time affects the temperatures. Should i try two way anova?
Relevant answer
Answer
if normality is ok, definetly Dunnett's test (see Google)
Good luck
  • asked a question related to Advanced Statistical Modeling
Question
4 answers
Dear Colleagues ... Greetings
I would like to compare some robust regression methods based on the bootstrap technique. the comparison will be by using Monte-Carlo simulation (the regression coefficients are known) so, I wonder how can I bootstrap the determination coefficient and MSE.
Thanks in advance
Huda
Relevant answer
Answer
Here you can follow:
1- draw the bootstrap sample (size m)
2- Estimate the coefficient by you need from this sample by using the (robust method)
3- find Mse for coefficient
4- Repeat the above bootstarp process B times, here you can estimate Mse for coefficients by taking the average, here you will get the estimation for bootstarp Mse
5- This loop can be repeat N times by Monte-Carlo Simulation
  • asked a question related to Advanced Statistical Modeling
Question
4 answers
I'm a community ecologist (for soil microbes), and I find hurdle models are really neat/efficient for modeling the abundance of taxa with many zeros and high degrees of patchiness (separate mechanisms governing likelihood of existing in an environment versus the abundance of the organism once it appears in the environment). However, I'm also very interested in the interaction between organisms, and I've been toying with models that include other taxa as covariates that help explain the abundance of a taxon of interest. But the abundance of these other taxa also behave in a way that might be best understood with a hurdle model. I'm wondering if there's a way of constructing a hurdle model with two gates - one that is defined by the taxon of interest (as in a classic hurdle model); and one that is defined by a covariate such that there is a model that predicts the behavior of taxon 1 given that taxon 2 is absent, and a model that predicts the behavior of taxon 1 given that taxon 2 is present. Thus there would be three models total:
Model 1: Taxon 1 = 0
Model 2: Taxon 1 > 0 ~ Environment, Given Taxon 2 = 0
Model 3: Taxon 1 > 0 ~ Environment, Given Taxon 2 > 0
Is there a statistical framework / method for doing this? If so, what is it called? / where can I find more information about it? Can it be implemented in R? Or is there another similar approach that I should be aware of?
To preempt a comment I expect to receive: I don't think co-occurrence models get at what I'm interested in. These predict the likelihood of taxon 1 existing in a site given the distribution of taxon 2. These models ask the question do taxon 1 and 2 co-occur more than expected given the environment? But I wish to ask a different question: given that taxon 1 does exist, does the presence of taxon 2 change the abundance of taxon 1, or change the relationship of taxon 1 to the environmental parameters?
Relevant answer
Answer
Thank you Remal Al-Gounmeein for sharing! I think it's interesting because I have somewhat the opposite problem that this paper addresses; many people in my field use simple correlation to relate the abundance of taxa to one another, but typically those covariances can be explained by an environmental gradient. So including covariates actually vastly decreases the number of "significant" relationships. But still it's a point well-taken because explaining that e.g. taxon1 and taxon2 don't likely interact directly even though they are positively or negatively correlated would in fact require presenting the results of both models. Thanks!
  • asked a question related to Advanced Statistical Modeling
Question
6 answers
Hi,
I am currently working on a project titled 'How does internet use affect interpersonal trust, a comparison across 4 European countries'. However, I have been struggling with deciding on which model best suits my project, as both my dependent variable, interpersonal trust and main independent variable are ordinal. ie: Interpersonal trust ranges from 0 "You can't be too careful" through to 10 "Most people can be trusted" and the variable internet use frequency ranges from 1 "Never" 2 "Only Occasionally" 3 "A few times a week" 4 "Most days" 5 "Everyday". I was wondering if there is a model that I could use that would be suited to this, or whether I would need to recode my variable for internet use? Also seeing as I will be composing a cross country analysis, what model would also be suitable for this? I am working with stata so any advice on how to do this in stata would also be greatly appreciated.
Thank you.
Relevant answer
Answer
Hi Maya,
the DV has enough scale points with at least semi-interval level that can be used as such. The IV looks a bit shaky so if you want to be hyper correct, make 4 dummies out of it. Much more important are potential control variables that block unobserved confounding effects.
Best,
Holger
  • asked a question related to Advanced Statistical Modeling
Question
3 answers
I am trying to make an SEM model in AMOS for my dissertation but having some trouble with getting a good fit. I am looking for relationships between self-efficacy reported on a Likert scale and scores on an assessment (the 2 instruments in my study). The Items on the left are the items from the self-efficacy questionnaire, and the skills listed on the right are scores on different categories in the test. I have done Pearsons r product moment correlations and the covariances I made were the significant results from the persons r. However, when I set the model up in AMOS and run it, the values I get do not meet the good fit indexes :(
Here is what I get: Chi-square = 354.624, Df = 18, Probability level = .000, CFI = .064, TLI rho2 = -.456, RMSEA = .289
When I tried to use the modification indices, it suggested adding covariances between the items on the self-efficacy scale and between the categories of scores on the assessment component. 
I tried it out with the suggested covariances and I get: Chi-square = 5.223, Df= 6, Probability level = .516, CFI = 1.00, TLI rho2 = 1.010, RMSEA = .000
Would that be considered a good fit? Can I still use that model even though my research question is to examine the relationships between the two instruments and not between the items within each instrument?
Relevant answer
  • asked a question related to Advanced Statistical Modeling
Question
5 answers
Are the two related or similar?
What difference do they tell in selecting of best features?
Relevant answer
Hi there, well, the predictive power score is the model's ability to achieve some accuracy in general performance, and the feature importance score allows you increase or decrease those accuracy. First of all, you need to select the most important features that allow you achieve a good accuracy in the performance of your model and with that your predictive power score will be great or acceptable 😉
  • asked a question related to Advanced Statistical Modeling
Question
8 answers
I have three outputs, each with a different unit. I am searching for an error estimator with the help of which I can select the most suitable Neural Network configuration. I will have to compute the training error as well as the testing error.
With one output, any error estimator (like MAPE or RMSE) would have done the job, but I am confused what will be the best suited for this case as I have three different outputs and I need one single number as the error to compare.
Will SSE be a good fit for it?
Relevant answer
Answer
  • asked a question related to Advanced Statistical Modeling
Question
14 answers
Configurations:
Rstudio v1.3.1093
Packages: car, lme4, MuMIn, DescTools
Model: myGLM<- glm(interaction~Species+Individual+interacion item type, family=poisson)
Response variable (461 interactions): count data
Predictor variables ('species' (n = 10), 'individuals' (n = 39), 'interaction item type' (n = 3)): categorial data
Description:
Regardingless how I order the predictor variables the ANOVA (Anova(myGLM, type=c("II"), test.statistic=c("LR")) outputs zero Df and no other results for the variable ‘species’.
‘Individual’ is highly nested in ‘species’ since every individual belongs to a certain species. If tested alone (each the predictor variable ‘individual’ and ‘species’) both variables show a statistically significant effect. Therefore, the variable ‘individual’ is a more complex categorial ‘version’ of ‘species’. But it is an important focus for my study. But compared with the AICc and McFadden’s pseudo-R² the model with the individual shows a significantly better fit, than with the predictor variable ‘species’. My cautious interpretation here is currently, that the ‘real’ effect is individual then species-specific while both predictors are collinear. Is that a decent interpretation? Are there other possible reasons why ‘species’ gets no ANOVA results? Maybe there is a solution regarding the missing ANOVA results for ‘species’? I wasn’t able to find much about that topic. Maybe someone can help me. Thanks! :)
Relevant answer
Answer
Dear Gerrit,
You wrote that "The multilevel approach is given due to utilising GLM with additive predictor variables (I do not use interaction terms).".
Maybe this term of "multilevel" might have different meanings, I however thought about adding individuals as random effects in a generalized mixed model.