Science topic
Advanced Statistical Modeling - Science topic
Explore the latest questions and answers in Advanced Statistical Modeling, and find Advanced Statistical Modeling experts.
Questions related to Advanced Statistical Modeling
Dear Colleagues,
I was wondering if you could suggest how to analyze insect community (multivariate) data collected across multiple sites and time points. Specifically, I aim to assess differences in community composition between Treatment and Control conditions.
What are my options regarding:
- Multivariate Time Series modeling?
- Multivariate Mixed-Effect Models?
- Latent Models (e.g., Generalized Linear Latent Variable Models, gllvm)?
- Machine Learning approaches?
I’m aware of Multivariate Time Series analysis applied in fields like Finance and likely many other approaches that could be relevant. However, I am struggling to determine the most appropriate method for my case.
If you have experience in this area or can recommend a good blog, tutorial, or resource, I would greatly appreciate your suggestions.
Thank you for your time!
Hi everyone,
I ran a Generalised Linear Mixed Model to see if an intervention condition (video 1, video 2, control) had any impact on an outcome measure across time (baseline, immediate post-test and follow-up). I am having trouble interpreting the Fixed Coefficients table. Can anyone help?
Also, why are the last four lines empty?
Thanks in advance!
I would like to perform a literature review at this time on augmented learning and learning augmented algorithms to enhance performance-guided surgery
I'm currently working on a project involving group-based trajectory modelling and am seeking advice on handling multi-level factors within this context. Specifically, I'm interested in understanding the following:
- Multi-Level Factors in Trajectory Modelling: How can multi-level factors (e.g., individual-level and group-level variables) be effectively addressed in group-based trajectory modelling? Are there specific methods or best practices recommended for incorporating these factors?
- Flexmix Package: I’ve come across the Flexmix package in R, which supports flexible mixture modelling. How can this package be utilised to handle multi-level factors in trajectory modelling? Are there specific advantages or limitations of using Flexmix compared to other methods?
- Comparison with Other Approaches: In what scenarios would you recommend using Flexmix over other trajectory modelling approaches like LCMM, TRAJ, or GBTM? How do these methods compare in terms of handling multi-level data and providing accurate trajectory classifications?
- Adjusting for Covariates: When identifying initial trajectories (e.g., highly adherent, moderately adherent, low adherent), is it necessary to adjust for covariates such as age, sex, and socioeconomic status (SES)? Or is focusing on adherence levels at each time point sufficient for accurate trajectory identification? What are the best practices for incorporating these covariates into the modelling process?
Any insights, experiences, or references to relevant literature would be greatly appreciated!
Hi everyone.
When running a GLMM, I need to turn the data from wide format to the long format (stacked).
When checking for assumptions like normality, do I check them for the stacked variable (e.g., outcomemeasure_time) or for each variable separately (e.g., outcomemeasure_baseline, outcomemeasure_posttest, outcomemeasure_followup)?
Also, when identifying covariates via correlations (Pearson's or Spearman's), do I use the seperate variables or the stacked one?
Normality: say the outcomemeasure_baseline normality is violated but normality for the others weren't (ouecomemeasure_posttest and outcomemeasure_followup). Normality for the stacked variable is also not violated. In this case when running the GLMM, do I adjust for normality violations because normality for one of the seperate measures was violated?
Covariates: say age was identified as a covariate for outcomemeasure_baseline but not the others (separately: ouecomemeasure_posttest and outcomemeasure_followup OR the stacked variable). In this case, do I include age as a covariate since it was identified as one for one of the seperate variables?
Thank you so much in advance!
I have data from population based observation (not questionnaires but yearly observation from secondary database) and I already have a common model for each populations (6 groups, each has the same latent variables, observed variables, and the structural models are also the same shape). As the study is some kind of longitudinal basis (not independent to each other), am I still be able too use MGA (Multi Group Analysis)? My result does not pass the MICOM procedure, is doing a MICOM procedure an obligatory prior MGA in my specific case?
I have a mixed effect model, with two random effect variables. I wanted to rank the relative importance of the variables. The relimpo package doesn't work for mixed effect model. I am interested in the fixed effect variables anyway so will it be okay if I only take the fixed variables and use relimp? Or use weighted Akaike for synthetic models with alternatively missing the variables?
which one is more acceptable?
Suppose that we have three variables (X, Y, Z). According to past literature Y mediates the relationship between X & Z while X mediates the relationship between Y & Z. Can I analyze these interrelationships in a single SEM using a duplicate variable for either X (i.e., Xiv & X Ddv) or Y (Yiv or Ydv)?
What are the possible ways of rectifying a lack of fit test showing up as significant. Context: Optimization of lignocellulosic biomass acid hydrolysis (dilute acid) mediated by nanoparticles
We measured three aspects (i.e. variables) of self-regulation. We have 2 groups and our sample size is ~30 in each group. We anticipate that three variables will each contribute unique variance to a self-regulation composite. How do we compare if there are group differences in the structure/weighting of the composite? What analysis should be conducted?
I have a set of measured data (with spectrum analyser) of power emitted by an antenna "mW_NLOS" in function of frequencies. How can I fit this data to a Rician distribution using matlab.
Note that I used my_dist=fitdist(mW_NLOS,'Rician') but it seems that it isn't correct to me.
Greetings,
I am currently in the process of conducting a Confirmatory Factor Analysis (CFA) on a dataset consisting of 658 observations, using a 4-point Likert scale. As I delve into this analysis, I have encountered an interesting dilemma related to the choice of estimation method.
Upon examining my data, I observed a slight negative kurtosis of approximately -0.0492 and a slight negative skewness of approximately -0.243 (please refer to the attached file for details). Considering these properties, I initially leaned towards utilizing the Diagonally Weighted Least Squares (DWLS) estimation method, as existing literature suggests that it takes into account the non-normal distribution of observed variables and is less sensitive to outliers.
However, to my surprise, when I applied the Unweighted Least Squares (ULS) estimation method, it yielded significantly better fit indices for all three factor solutions I am testing. In fact, it even produced a solution that seemed to align with the feedback provided by the respondents. In contrast, DWLS showed no acceptable fit for this specific solution, leaving me to question whether the assumptions of ULS are being violated.
In my quest for guidance, I came across a paper authored by Forero et al. (2009; DOI: 10.1080/10705510903203573), which suggests that if ULS provides a better fit, it may be a valid choice. However, I remain uncertain about the potential violations of assumptions associated with ULS.
I would greatly appreciate your insights, opinions, and suggestions regarding this predicament, as well as any relevant literature or references that can shed light on the suitability of ULS in this context.
Thank you in advance for your valuable contributions to this discussion.
Best regards, Matyas
I have a longitudinal model and the stability coefficients for one construct change dramatically from the first and second time point (.04) to the second and third time point (.89). I have offered a theoretical explanation for why this occurs, but have been asked about potential model bias.
Why would this indicate model bias? (A link to research would be helpful).
How can I determine whether the model is biased or not? (A link to research would be helpful).
Thanks!
In recent years, quite a few reports have been published of the results based on the statistical information processing. For example, a study establishes that the use of a certain remedy (some food, drink, nutritional supplement, drug, treatment method, etc.) reduces (or increases) the value of some output parameter by 20 ... 30 ... 40%. The output parameter can be the frequency of onset of the analyzed disease, the frequency of its successful cure, etc. Based on this finding the conclusion is made that the studied factor significantly influences the output parameter. How trustable can such a conclusion be?
For further details see, please,
Question background. There is an equipartition theorem, and it is without doubt correct. But it has its conditions of applicability, which are not always satisfied. There are well-known examples of a chain of connected oscillators, the spectral density of a black body, the new example of an ideal gas in a round vessel I have studied. How may or may not the energy be partitioned in such cases, when the equipartition theorem is not applicable? Can anyone provide more systems with known uneven laws of energy partitioning?
i am using a fixed effect panel data with 100 observations (20 groups), 1 dependent and three independent variables. i would like to get a regression output from it. my question is it necessary to run any normality test and linearity test for panel data? and what difference would it make if i don't go for these tests?
The variables I have- vegetation index and plant disease severity scores, were not normal. So, I did log10(y+2) transformation of vegetation index and sqrt(log10(y+2)) transformation of plant disease severity score. Plant disease severity is on the scale of 0, 10, 20, 30,..., 100 and were scored based on visual observations. Even after combined transformation, disease severity scoring data is non-normal but it improves the CV in simple linear regression.
Can I proceed with the parametric test, a simple linear regression between the log transformed vegetation index (normally distributed) and combined transformed (non-normal) disease severity data?
Hi,
I have data from a study that included 3 centers. I will conduct a multiple regression (10 IVs, 1 non-normally distributed DV) but I am unsure how to handle the variable "center" in these regressions. Should I:
1) Include "centre" as one predictor along with the other 10 IVs.
2) Utilize multilevel regression
Thanks in advance for any input
Kind regards
Hello everyone, I need a bit help with statistics analitic methods.
My partner (MD) is conducting research as part of her residency exam on how years of occupation impact workers' hearing. Her known variables are years of employment, years of employment at current job position,age and percents of hearing loss (calculated with Fowler-Sabine formula, so in %).
She had a statistic working on her study and he did multivariate linear regression (explaining he used it because one variable is in %).
However one of her professors said she should use log regression analysis instead. WHY? Is multivariate linear not OK and is, why not?
Can anyone help explain which one should be used/ is better and why? We tried google but as we are not statistics or experienced researchers this is quite hard for us to understand. However, she need this done correctly as this study is a part of her residency exam.
Any help is much appreciated.
Many thanks!
Anze&Ana
How can I add the robust confidence ellipses of 97.5% on the variation diagrams (XY ilr-Transformed) in the robcompositions ,or composition packages?
Best
Azzeddine
Hello!
In general, as a rule of thumb, what is the acceptable value for standardised factor loadings produced by a confirmatory factor analysis?
And, what could be done/interpretation if the obtained loadings are lower than the acceptable value?
How does everyone approach this?
Merry Christmas everyone!
I used the Interpersonal Reactivity Index (IRI) subscales Empathic Concern (EC), Perspective Taking (PT) and Personal Distress (PD) in my study (N = 900) When I calculated Cronbach's alpha for each subscale, I got .71 for EC, .69 for PT and .39 for PD. The value for PD is very low. The analysis indicated that if I deleted one item, the alpha would increase to .53 which is still low but better than .39. However, as my study does not focus mainly on the psychometric properties of the IRI, what kind of arguments can I make to say the results are still valid? I did say findings (for the PD) should be taken with caution but what else can I say?
If we have multiple experts to get the prior probabilities for the parent nodes how will the experts fill the node probabilities such as low, medium and high and how will we get the consensus of all the expert about the probability distribution of the parent node.
If someone can please share any paper/Questionnaire/expert based Bayesian network where all these queries are explained it will be highly appreciated.
Hi,
I have used central compoiste design with four variables and 3 levels which gives me 31 experiements. After performing the expeirments, I found that the model is not significant. However, when I used different data (which I prevousluy obtained), I got the good model.
How do I justifiy using user-defined data? and why CCD failed to provide a significant model?
I would be really thankful for your response.
I'm trying to construct a model for binary logistics. The first model includes 4 variable of predictor and the intercept is not statistically significant. Meanwhile, in the second model, I exclude one variable from the first model and the intercept is significant.
The consideration that I take here is that:
The pseudo R² of the first model is better at explaining the model rather than the second model.
Any suggestion which model should I use?
I am using an ARDL model however I am having some difficulties interpreting the results. I found out that there is a cointegration in the long run. I provided pictures below.
What are the most important updates that distinguish the last update of SMART PLS (4) from the previous one (3)?
Dear fellow researchers,
Usually we use lavaan for continuous variable, so can we still use lavaan for categorical variable (e.g. high and low ethnic diversity composition)?
Thank you very much!
Best,
Edita
I recently included GEE models in the statistics and calculated it with the wald chi square test.
Does anyone know how to correctly report the findings considering APA-Guidelines?
e.g we would report the findings of a rANOVa the following:
"No main effect of group factors F(1,92)=.52, p > .05"
How do you report these findings? Please find an output of the model attached. Thank you!
Dear all, I want to replicate an Eview plot (attached as Plot 1) in STATA after performing a time series regression. I made an effort to produce this STATA plot (attached as Plot 2). However, I want Plot 2 to be exactly the same thing as Plot 1.
Please, kindly help me out. Below are the STATA codes I run to produce Plot 2. What exactly did I need to include?
The codes:
twoway (tsline Residual, yaxis(1) ylabel(-0.3(0.1)0.3)) (tsline Actual, yaxis(2)) (tsline Fitted, yaxis(2)),legend(on)
One dependent variable (continuous) ~ two continuous and two categorical (nominal) independent variables
I'm seeking for the best method for predicting a data collection with more than 100 sites. The distribution of all continuous variables is not normally distributed.
I have previously conducted laboratory experiments on a photovoltaic panel under the influence of artificial soiling in order to be able to obtain the short circuit current and the open-circuit voltage data, which I analyzed later using statistical methods to draw a performance coefficient specific to this panel that expresses the percentage of the decrease in the power produced from the panel with the increase of accumulating dust. Are there any similar studies that relied on statistical analysis to measure this dust effect?
I hope I can find researchers interested in this line of research and that we can do joint work together!
Article link:
Which one of these multilevel models are better? should the random equation variables be added also as covariates?
Model A: with random equation variables as covariates
Model B: without random equation variables as covariates
* Model A resulted in same results with a routine ologit. So, if model A is better than model B, what the philosophy of using multilevel mixed models (because of same result with ologit)?!
I working with phyr:pglmm package in R, which uses Pagel's lambda to correct for phylogenetic non-independence. I wish to report this value, to give an idea of the strength of phylogenetic signal. However, contrary to other functions such as PGLS in caper and the like, the results do not show what was the lambda used to generate the model.
Is there any function to extract this value from the model summary?..
Thanks
I have been working with a GAM model with numerous features(>10). Although I have tuned it to satisfaction in my business application, I was wondering what is the correct way to fine tune a GAM model. i.e. if there is any specific way to tune the regularizers and the number of splines; and if there is a way to say which model is accurate.
The question is actually coming from the point that on different level of tuning and regularization, we can reduce the variability of the effect of a specific variable i.e. reduce the number of ups and downs in the transformed variable and so on. So I don't understand at this point that what model represents the objective truth and which one doesn't; since other variables end up influencing the single transformed variables too.
Hi
I'm working on a research for developing a nonlinear model (e.g. exponential, polynomial and...) between a dependent variable (Y) and 30 independent variables ( X1, X2, ... , X30).
As you know I need to choose the best variables that have most impacts on estimating (Y).
But the question is that can I use Pearson Correlation coefficient matrix to choose the best variables?
I know that Pearson Correlation coefficient calculates the linear correlation between two variables but I want to use the variables for a nonlinear modeling ,and I don't know the other way to choose my best variables.
I used PCA (Principle Component Analysis) for reduce my variables but acceptable results were not obtained.
I used HeuristicLab software to develop Genetic Programming - based regression model and R to develop Support Vector Regression model as well.
Thanks
To reduce the dimensionality of large datasets and to carry out correlation among the parameters do we use only inlet or outlet parameters individually or use both of them to see the correlation?
I have a set of experimental data (EXP) which I have fitted with two analytical models (AN1 & AN2).
In order to estimate the precision and accuracy of both analytical models I can study statistics of the ratios EXP/AN1 and EXP/AN2 or AN1/EXP and AN2/EXP.
Well, the point is that statistics of such ratios are not coincident.
I see that many researchers adopt the first approach when I istinctively would go for the second because I can compare two different analytical models by normalizing them with respect to the same experimental variable.
Is there anybody who can help me out with this?
thanks.
I have run an ARDL model for a Time Series Cross Sectional data but the output is not reporting the R.squared. What could be the reason/s.
Thank you.
Maliha Abubakari
I want to do a descriptive analysis using the World Values Survey dataset which has an N=1200. However, even thought I have searched a lot, I haven't found the methodology or a tool to calculate the sample size I need to get meaningful comparisons when I cross variables. For example, I want to know how many observations do I need in every category if I want to compare the social position attributed to the elderly over sex AND over ethnic group. That is (exemplying even more), the difference between the black vs indigenous women in my variable of interest. What if I have 150 observations in black women? Is that enough? How to set the threshold?
Expressing my gratitude in advance,
Santiago.
What are the best methods to handle Imbalance data? and Do these methods make more biasness?
Dear colleagues,
I am approaching the hotspot analysis for the first time.
My main goal is to understand the advantages and disadvantages of different methods used for the hotspot analysis (e.g., Moran I, Getis, etc.).
Maybe you can help me to better understand how Getis G* works (https://pro.arcgis.com/en/pro-app/latest/tool-reference/spatial-statistics/h-how-hot-spot-analysis-getis-ord-gi-spatial-stati.htm).
There is something I do not understand. Imagine I have occurrence data points (i.e., points in a map with the same value as each value indicates an occurrence event). Is it necessary to aggregate the occurrence data? What I mean is, if all input values are "1", can Getis G* still work or I should aggregate data in the same grid prior to the analysis?
Thank you very much in advance,
Chiara
Hello,
I have a question about which longitudinal mediation model would be the best for my study. My study contains two groups (A; intervention versus WL), two measures: a mediator (M) and an outcome measure (B) three measurement points (T0, T1, T2). I want to know if a change in M precludes a chance in B on the different time points.
Now I was wondering whether a cross lagged panel model or a latent change score model would be the best to use for this mediation. Would someone have any advice about this for me?
If the best solution is a latent change score model, does anyone maybe have any reccomendations for a tutorial on how to do this? (preferably in R).
(The study is about a parenting program which chances parenting skills (M) to reduce childerens externalizing behavior (B))
Thank you very much in advance!
Suzanne
Hi everyone! I have a statistical problem that is puzzling me. I have a very nested paradigm and I don't know exactly what analysis to employ to test my hypothesis. Here's the situation.
I have three experiments differing in one slight change (Exp 1, Exp 2, and Exp 3). Each subject could only participate in one experiment. Each experiment involves 3 lists of within-subjects trials (List A, B, and C), namely, the participants assigned to Exp 1 were presented with all the three lists. Subsequently, each list presented three subsets of within-subjects trials (let's call these subsets LEVEL, being I, II, and III).
The dependent variable is the response time (RT) and, strangely enough, is normally distributed (Kolmogorov–Smirnov test's p = .26).
My hypothesis is that no matter the experiment and the list, the effect of this last within-subjects variable (i.e., LEVEL) is significant. In the terms of the attached image, the effect of the LEVEL (I-II-III) is significant net of the effect of the Experiment and Lists.
Crucial info:
- the trials are made of the exact same stimuli with just a subtle variation among the LEVELS I, II, and III; therefore, they are comparable in terms of length, quality, and every other aspect.
- the lists are made to avoid that the same subject could be presented with the same trial in two different forms.
The main problem is that it is not clear to me how to conceptualize the LIST variable, in that it is on the one hand a between-subjects variable (different subjects are presented with different lists), but on the other hand, it is a within-subject variable, in that subjects from different experiments are presented with the same list.
For the moment, here's the solutions I've tried:
1 - Generalized Linear Mixed Model (GLMM). EXP, LIST, and LEVEL as fixed effect; and participants as a random effect. In this case, the problem is that the estimated covariance matrix of the random effects (G matrix) is not positive definite. I hypothesize that this happens because the GLMM model expects every subject to go through all the experiments and lists to be effective. Unfortunately, this is not the case, due to the nested design.
2 – Generalized Linear Model (GLM). Same family of model, but without the random effect of the participants’ variability. In this case, the analysis runs smoothly, but I have some doubts on the interpretation of the p values of the fixed effects, which appear to be massively skewed: EXP p = 1, LIST p = 1, LEVEL p < .0001. I’m a newbie in these models, so I don’t know whether this could be a normal circumstance. Is that the case?
3 – Three-way mixed ANOVA with EXP and LIST as between-subjects factors, and LEVEL as the within-subjects variable with three levels (I, II, and III). Also in this case, the analysis runs smoothly. Nevertheless, together with a good effect of the LEVEL variable (F= 15.07, p < .001, η2 = .04), I also found an effect of the LIST (F= 3.87, p = .022, η2 = .02) and no interaction LEVEL x LIST (p = .17).
The result seems satisfying to me, but is this analysis solid enough to claim that the effect of the LEVEL is by no means affected by the effect of the LIST?
Ideally, I would have preferred a covariation perspective (such as ANCOVA or MANCOVA), in which the test allows an assessment of the main effect of the between-subjects variables net of the effects of the covariates. Nevertheless, in my case the classic (M)ANCOVA variables pattern is reversed: “my covariates” are categorical and between-subjects (i.e., EXP and LIST), so I cannot use them as covariates; and my factor is in fact a within-subject one.
To sum up, my final questions are:
- Is the three-way mixed ANOVA good enough to claim what I need to claim?
- Is there a way to use categorical between-subjects variables as “covariates”? Perhaps moderation analysis with a not-significant role of the moderator(s)?
- do you propose any other better ways to analyze this paradigm?
I hope I have been clear enough, but I remain at your total disposal for any clarification.
Best,
Alessandro
P.S.: I've run a nested repeated measures ANOVA, wherein LIST is nested within EXP and LEVEL remain as the within-subjects variable. The results are similar, but the between-subjects nested effect LIST within EXP is significant (p = .007 η2 = .06). Yet, the question on whether I can claim what I need to claim remains.
Dear colleagues,
Actually, I have two files with two different resolutions and I am looking for a code (Python, Matlab, R) to estimate the correlation coefficient, Bias and statistical indices between a specific point and its nearest point in the other file. I will be thankful for any help.
Thanks in advance
Regards,
- Hello. I am struggling with the problem. I can measure two ratios of three independent normal random values with their means and variances (not zero): Z1=V1/V0, Z2=V2/V0, V0~N(m0,s0), V1~N(m1,s1), V2~N(m2,s2). These are measurements of the speeds of the vehicle. Now I should estimate the means and the variances of these rations. We can see it is Cauchy distribution with no mean and variance. But it has analogs in the form of location and scale. Are there mathematical relations between mean and location, variance and scale? Can we approximate Cauchy by Normal? I have heard if we limit the estimated value we can obtain the mean and variance.
The aim of my research is to analyse the correlation between two delta values (change between two timepoints) via regression analysis.
Let the variables be X, Y, and Z, and t0 represent pre-intervention and t1 represent post-intervention. X is a psychometric value (Visual Analogue Scale ranging from 0 to 100), Y and Z are biological values.
For example, I want to calculate the correlation between delta (Xt1 - Xt0), delta (Yt1 - Yt0), and delta (Zt1 - Zt0).
I am aware that delta value is statistically inefficient, therefore Pearson's correlation or Spearman's correlation is out. I would appreciate any advice or any model examples. Thanks!
For example, how to analyze the effect of speed on a binary performance (success or failure), knowing that the expected probabilities do not necessarily form a straight line but could be an inverted u-shaped curve.
To understand better I created a dataset on R and I put the script at your disposal. I have also attached a graph that shows the frequency of success as a function of speed.
Thank you.
Hi
I'm using three different performance criteria for evaluating my model:
1.Nash–Sutcliffe (NSE)
2.Percent bias (PBIAS)
3.Root mean square error (RMSE)
You can suppose that I used a regression model to estimate a time series data such as river mean daily discharge or something like that.
But for a single model and a single dataset, we saw difference performances for each criteria.
Is this possible? I expected that all of these three criteria have same results.
You can see the variation's diagram of these criteria in appendix pic.
Thanks
Hello,
I am performing statistical analysis of my research data by comparing the mean values by using Tukey HSD test. I got homogeneous group in both small and capital alphabets. This is because of large number of treatments in my study. Is this type of homogeneous group is acceptable for publication in any journal?
Hi there,
in SPSS I can perform a PCA with my dataset, which does not show a positive definite correlation matrix, since I have more variables (45) than cases (n = 31).
The results seem quite interesting, however, since my correlation matrix and therefore all criteria for appropriateness (Anti-Image, MSA etc.) are not available, am I allowed to perform such an analysis?
Or are the results of the PCA automatically nonsense? I can identify a common theme in each of the loaded factors and its items.
Thanks and best Greetings from Aachen, Germany
Alexander Kwiatkowski
Hello everyone! For my dissertation I am using Network Analysis to model my data. I have 11 variables and all but 3 of them are likert scales. I am struggling to test for linearity for my data (linearity is an assumption for network analysis). Obviously when I am trying to test linearity using standardised regressions (ZRED against ZRESID) the scatterplot is not homoscedastic because of the likert scales. Is anyone familiar with Network analysis assumption testing regarding likert-type data??? any help appreciated :) My data is not normally distributed however I am using npn transformations (in JASP) to solve this issue for the networks. Just don't know how to test for linearity as relations among variables need to be assumed to be linear.
I am using SPSS for data cleaning etc. and JASP to run the network.
Hi, I am a beginner in the field of cancer genomics. I am reading gene expression profiling papers in which researchers classify the cancer samples into two groups based on expression of group of genes. for example "High group" "Low group" and do survival analysis, then they associate these groups with other molecular and clinical parameters for example serum B2M levels, serum creatinine levels for 17p del, trisomy of 3. Some researchers classify the cancer samples into 10 groups. Now if I am proposing a cancer classification schemes and presenting a survival model based on 2 groups or 10 groups, How should I assess the predictive power of my proposed classification model and simultaneously how do i compare predictive power of mine with other survival models? Thanks you in advance.
Background of the problem for people non familiar with transfusion:
Every day, hospital transfusion services must guarantee they have enough blood to meet patient transfusion demand. This is very like managing a food stock in a refrigerator but even easier (just one product, one expiration date).
Assumptions:
- RBC (red blood cell) units are kept on a refrigerator with enough capacity
- Maximum expiration date for RBC is 42 days.
- We make a routine order to the blood bank provider (our grocery store) every day from Monday to Friday. We can make additional orders if we run out of stock.
- We practice first in first out.
The model should satisfy the following targets:
- Minimize the expiration rate (the number of RBC units expiring every certain time)
- The less time on the refrigerator the better (eat fresh)
- Avoid extraordinary orders as much as possible (some extraordinary order might be considered to cope with transfusion peaks of demand)
- Note: I have purposely simplified the problem (no considerations of ABO group distribution, crossmatched reserved units, or reduced expiration date of the irradiated blood…).
Non controlled external factors:
- Transfusion rate
- Average remaining time to expiration of the received blood.
What we are looking for is the optimal number of RBC units we must order on a routine basis (M-F) which is the only factor we can adjust to optimize the equation. Resulting figure must be recalculated when transfusion rate or average expiration date of the received units change.
Intuitively, I see how the number of extraordinary orders and RBC freshness are at opposite sides of a balance, because if a higher number of extraordinary orders is tolerated, then it should result in a smaller but fresher RBC inventory. On the other hand, in order to avoid extraordinary orders, you need a greater inventory thus increasing the average age of your products and the risk of expiration.
Any idea or suggestion? Any available script on Excel/SPSS/Stata/R/SQL? Any related bibliography?
I can provide raw data for simulations on demand if someone want to make a collaboration.
In the development of forecasting, prediction or estimation models, we have recourse to information criterions so that the model is parsimonious. So, why and when should one or the other of these information criterions be used ?
I have a data set of particulate concentration (A) and corresponding emission from car (B), factory (C) and soil (D). I have 100 observations of A and corresponding B , C and D. Lets say, there are no other factor is contributing in particulate concentration (A) other than B, C and D. Correlation analysis shows A have linear relationship with B , exponential relationship with C and Logarithmic relationship with D. I want to know which factor is contributing more in concentration of A (Predominant factor). I also want to know if any model can be build like following equations
A = m*A+n*exp(B)+p*Log (C), where m, n and p are constant, from the data-set I have
How does one make their marketing mix be more agile to new channels, ever changing environment. what are the models used for this analysis and their interpretation.
our paper on MMM- complex models and interpretations to discuss the advertising effects, models- simple and complex to collate it all together.
Please read, review and suggest how we can add on to enhance our research going forward
I want to run 5 different models to estimate stream flow. In order to optimize the characteristics of these models I use Taguchi method. So I have to run different models according to the Taguchi orthogonal array. Therefore, I have different models with different inputs and different data lengths. For example the first test is: using rainfall and temperature in ANFIS model with 2 year data length, while the second test is: using rainfall, temperature and discharge for previous day in SVR model with 10 year data length. So, the inputs, Data length and model type is changing in these tests. What is the best performance evaluation criterion for this study? NRMSE can be a good criterion because it normalizes the RMSE and in this way, it removes the effect of data range.
Now, I want to know if there is any better solution for this problem.
Hi
I have a dataset and I need to check the goodness of fit for Pearson type3 for them.
How can I do it?
Any software? Any MATLAB code?
Thanks
My study is about the development of Intumescent Coatinv through addition of additive A and B. In my research I used samples with different ratios from the two having the controlled one with 0:0 . I perfomed the Horizontal fire test and recorsed the temperature of the coated steel overtime. I need to show if there is significant difference from each sample and compare the correlation coefficients from each samples( if it is statistically significant). I tried using one way anova and post hoc test but i think time affects the temperatures. Should i try two way anova?
Dear Colleagues ... Greetings
I would like to compare some robust regression methods based on the bootstrap technique. the comparison will be by using Monte-Carlo simulation (the regression coefficients are known) so, I wonder how can I bootstrap the determination coefficient and MSE.
Thanks in advance
Huda
I'm a community ecologist (for soil microbes), and I find hurdle models are really neat/efficient for modeling the abundance of taxa with many zeros and high degrees of patchiness (separate mechanisms governing likelihood of existing in an environment versus the abundance of the organism once it appears in the environment). However, I'm also very interested in the interaction between organisms, and I've been toying with models that include other taxa as covariates that help explain the abundance of a taxon of interest. But the abundance of these other taxa also behave in a way that might be best understood with a hurdle model. I'm wondering if there's a way of constructing a hurdle model with two gates - one that is defined by the taxon of interest (as in a classic hurdle model); and one that is defined by a covariate such that there is a model that predicts the behavior of taxon 1 given that taxon 2 is absent, and a model that predicts the behavior of taxon 1 given that taxon 2 is present. Thus there would be three models total:
Model 1: Taxon 1 = 0
Model 2: Taxon 1 > 0 ~ Environment, Given Taxon 2 = 0
Model 3: Taxon 1 > 0 ~ Environment, Given Taxon 2 > 0
Is there a statistical framework / method for doing this? If so, what is it called? / where can I find more information about it? Can it be implemented in R? Or is there another similar approach that I should be aware of?
To preempt a comment I expect to receive: I don't think co-occurrence models get at what I'm interested in. These predict the likelihood of taxon 1 existing in a site given the distribution of taxon 2. These models ask the question do taxon 1 and 2 co-occur more than expected given the environment? But I wish to ask a different question: given that taxon 1 does exist, does the presence of taxon 2 change the abundance of taxon 1, or change the relationship of taxon 1 to the environmental parameters?
Hi,
I am currently working on a project titled 'How does internet use affect interpersonal trust, a comparison across 4 European countries'. However, I have been struggling with deciding on which model best suits my project, as both my dependent variable, interpersonal trust and main independent variable are ordinal. ie: Interpersonal trust ranges from 0 "You can't be too careful" through to 10 "Most people can be trusted" and the variable internet use frequency ranges from 1 "Never" 2 "Only Occasionally" 3 "A few times a week" 4 "Most days" 5 "Everyday". I was wondering if there is a model that I could use that would be suited to this, or whether I would need to recode my variable for internet use? Also seeing as I will be composing a cross country analysis, what model would also be suitable for this? I am working with stata so any advice on how to do this in stata would also be greatly appreciated.
Thank you.
I am trying to make an SEM model in AMOS for my dissertation but having some trouble with getting a good fit. I am looking for relationships between self-efficacy reported on a Likert scale and scores on an assessment (the 2 instruments in my study). The Items on the left are the items from the self-efficacy questionnaire, and the skills listed on the right are scores on different categories in the test. I have done Pearsons r product moment correlations and the covariances I made were the significant results from the persons r. However, when I set the model up in AMOS and run it, the values I get do not meet the good fit indexes :(
Here is what I get: Chi-square = 354.624, Df = 18, Probability level = .000, CFI = .064, TLI rho2 = -.456, RMSEA = .289
When I tried to use the modification indices, it suggested adding covariances between the items on the self-efficacy scale and between the categories of scores on the assessment component.
I tried it out with the suggested covariances and I get: Chi-square = 5.223, Df= 6, Probability level = .516, CFI = 1.00, TLI rho2 = 1.010, RMSEA = .000
Would that be considered a good fit? Can I still use that model even though my research question is to examine the relationships between the two instruments and not between the items within each instrument?
Are the two related or similar?
What difference do they tell in selecting of best features?
I have three outputs, each with a different unit. I am searching for an error estimator with the help of which I can select the most suitable Neural Network configuration. I will have to compute the training error as well as the testing error.
With one output, any error estimator (like MAPE or RMSE) would have done the job, but I am confused what will be the best suited for this case as I have three different outputs and I need one single number as the error to compare.
Will SSE be a good fit for it?
Configurations:
Rstudio v1.3.1093
Packages: car, lme4, MuMIn, DescTools
Model: myGLM<- glm(interaction~Species+Individual+interacion item type, family=poisson)
Response variable (461 interactions): count data
Predictor variables ('species' (n = 10), 'individuals' (n = 39), 'interaction item type' (n = 3)): categorial data
Description:
Regardingless how I order the predictor variables the ANOVA (Anova(myGLM, type=c("II"), test.statistic=c("LR")) outputs zero Df and no other results for the variable ‘species’.
‘Individual’ is highly nested in ‘species’ since every individual belongs to a certain species. If tested alone (each the predictor variable ‘individual’ and ‘species’) both variables show a statistically significant effect. Therefore, the variable ‘individual’ is a more complex categorial ‘version’ of ‘species’. But it is an important focus for my study. But compared with the AICc and McFadden’s pseudo-R² the model with the individual shows a significantly better fit, than with the predictor variable ‘species’. My cautious interpretation here is currently, that the ‘real’ effect is individual then species-specific while both predictors are collinear. Is that a decent interpretation? Are there other possible reasons why ‘species’ gets no ANOVA results? Maybe there is a solution regarding the missing ANOVA results for ‘species’? I wasn’t able to find much about that topic. Maybe someone can help me. Thanks! :)