- Daniel Lorenz Winter asked a question:How to define a positive interaction in two-hybrid screens?
I am having some trouble finding a good definition of how to define a positive interaction (in relation to the negative and positive controls) when using a two-hybrid system to detects protein-protein interactions.
There's of course thousands of papers out there, but so far no luck finding one that explicitly described how they decided on a cut out using similar assays as mine: I'm working with a system based on the bacterial two-hybrid (BACTH) system, and using an enzymatic assay to measure the activity of the reporter gene, ß-galactosidase.
I often hear about using a signal twice as strong as the negative control as the threshold. But what about the standard error of the negative control? Should I use instead twice the average + standard error as the threshold to be more stringent? Any other ideas?Following
- A'ang Subiyakto added an answer:What statistics does one use for a Likert scale?My data was collected by likert scale. Are they ordinal or interval data? and Can I use Factor Analysis to analyze these data?
Bro Mario O. Bourgoin, do you have justification for the interval assumption? it may be articles or others. Thank youFollowing
- How to interpret odds ratios that are smaller than 1?
OR = 1 mean that both groups have the same odds.
When OR = > 1 (Eg. If OR = 3, i.e the study group have 3 times more odds of having an event when compared to control group.
However, when odds OR = < 1, how to explain it statistically.
Usually when the OR is between 2 and 1 we don't consider exposure as a potent risk factor. So if the OR is between 0 and 1, should we consider that exposure is not a potent protective factor.
- Fausto Galetto added an answer:Does this probability distribution have a name?
The result of a Bernoulli trial is a variable with 2 possible outcomes. The respective probability distribution is calles the "Bernoulli distribution".
Considering an experiment which result is a variable with more than 2 possible outcomes (say, the result of rolling a die , or of checking the base at a mutagenic site in a gene ) - what is the name of the probability distribution for such sampling spaces? Is there a special name? Or is it simply tremes as "discrete probability distribution"?
PS: I am not asking for the names of derived distributions like the binonial, geometric, multinominal, hypergeometric and so forth. I am asking for the name of the distribution that is equivalent to the Bernoulli distribution for k>2 possible disjunctive outcomes.
@@@ Serban C. Musca · 23.07 · 22.59 · European University of Brittany
You wrote: (warning: do not feed the troll!)
A question (from the Troll!): did you ever read any of my papers and CASES?
A suggestion (from the Troll!): read some of them, e.g.
- Case n° ELEVEN; another WRONG Taguchi application, REFEREES_INCOMPETENT!!!! FIRST part Quality MUST be loved, DISquality MUST be hated
- Case n° SIXTEEN; SECOND PART, other WRONG ideas of D.C. MONTGOMERY!!!!! Quality MUST be loved, DISquality MUST be hated
- Are confidence intervals continuous, or are they discrete?
In the parametric statistical paradigms, 95% CIs are conceptualized as continuous random variables distributed in the manner dictated by the assumed underlying parent population. In contrast to this parametric conceptualization of the nature of classical data, the fundamental premise of both the optimal (“maximum accuracy”) data analysis (ODA) paradigm and novometric theory is that classical phenomena are fundamentally discrete in nature, and no reference is made concerning a hypothetical underlying parent distribution.
In parametric methods, 95% CIs are computed for model effects, but not for error. If a statistical model is applied to a data set consisting of random data, does one expect the model to classify all observations incorrectly? Is a 95% CI also needed for chance?
In novometric theory, CI’s for model and for chance both are discrete. Here is an article discussing these issues:
Greetings Emilio, thank you for your comment.
“It is accepted that random datasets and ‘classical phenomena are fundamentally discrete in nature’."
I am grateful that you begin your comment by mentioning this fundamental insight which ultimately led to the discovery of novometric theory (NT).
“But when we use mathematics like calculus, we use to define any interval as continuous due the analytical advantages that this language and logics gives in many cases.”
In light of tremendous advances made in theoretical and applied physics, chemistry, and engineering, it is very tempting to make such an assumption and attempt to follow in the footsteps of the developers of theoretical and applied quantum mechanics (QM). And, no matter how widely and consistently mediocre the results of parametric methods prove in the modeling of classical phenomena, optimistic researchers are reluctant to develop an alternative approach. However, in my view atomic and molecular phenomena are different than classical phenomena. While the first axiom of QM is that the data represent a complete Hilbert space (and so calculus applies), the first axiom of NT is that the data represent a statistical sample and NOT a complete Hilbert space (and so calculus doesn’t apply). 
“I am not a believer of the supposed benefits of 95% CI, confident intervals, neither of measurements of ‘likelihoods’…”
Neither am I a believer of the benefits of these methods, so long as these entities derive from parametric methods. I believe that for classical methods modeled using maximum-accuracy methods the CI constitutes a discrete distribution: no assumptions are made about the character of the distribution, it is what it is. 
“…my view is that you may model discrete events and datasets using continuous models to represent good approximations to them and to their inner premises.”
In this respect our ideas diverge, depending on your definition of “good”. I believe that QM methods model atomic phenomena well but don’t do well when modeling classical phenomena, and that NT methods model classical phenomena well but don’t do well when modeling atomic phenomena. For example, at least 600 publications using parametric statistical methods have addressed diffusion phenomena associated with drug eluting stents. Several recent articles model drug release profiles in terms of an interaction between coating drug diffusivity conceptualized as a molecular phenomenon, and the physical and functional properties of the surrounding arterial wall: this approach and the resulting models render parametric methods--obsolete--in this application. [e.g., 3]
“Interpreting (probabilities associated with random variables) from datasets is the task of statistics and its applied models to arrive to proxy conclusions and support ‘sound’ decisions under uncertainty due to our limited datasets and methods.”
If I understand your comment correctly (I hope so), we agree: obviously this is a significant challenge, it is our objective as researchers, and it is what ultimately leads to ecologically meaningful substantive discoveries and advances.
- T. Ansah - Narh asked a question:Which numpy syntax can be used to select specific elements in a numpy array?
How can I select some specific elements from a numpy array?
Say I have imported numpy as np
y = np.random.uniform(0,6, 20)
I then want to select all elements (i.e. y < = 1)from y. Thanks in advanceFollowing
- Satish S Poojary added an answer:What statistics should be applied to compare telomere length with its methylation status and with shelterin protein' expression?
I want to correlate telomere length with shelterin protein' expression and methylation status of subtelomeric regions.
It is continuousFollowing
- Emilio José Chaves added an answer:Anyone familiar with the Central Limit Theorem assumptions justification?
In the application of the Central Limit Theorem to sampling statistics, the key assumptions are that the samples are independent and identically distributed. How do you justify these assumptions (i.e. why are they likely to be true?)
I agree that “””Symmetry “”” of the “density” (of the Random Variables) is not very important…. as n -> infinity." but would add that even if N is not big (less than 20)." That is important because it means that most statistical texts and teachers may be outdated, if not wrong. You have repeated this idea several times for the seek of good science.
Your sentence "It is USEFUL to reach a “reasonable” approximation also when n is as small as 4 or 5!" must be taken cautiously because it requires that each value of dataset be somewhat close to the average of each one of the N possible intervals, in order to be "representative" and to give a close U value. (I do not care about sigma variances).
"The Central Limit Theorem is a PROBABILITY Theorem (proved AND comprised in the field of Probability Theory, NOT in the field of “Statistics”)." My view is that statistics must relate probabilities (frequences, and other related terms like chances) to datasets, building proxy models of measured phenomenon.
Thanks a lot, emilioFollowing
- When should we apply Frequentist statistics and when should we choose Bayesian statistics?
There are plenty of debates in the literature which statistical practice is better. But both approaches have many advantages but also some shortcomings. Could you suggest any references that would describe which approach to choose and when? Thank you for your valuable help!
The frequency interpretation is based on a tautological (circular) argument. Therefore "frequentist statistics" actually has no sensible philosophical foundation.
- The old battle revisited: why, say ferquentists, can a hypothesis not have a probability?
Frequentists, as far as I understood, define probability as a limiting relative frequency. And they say that frequencies can only be defined for data (not for hypotheses), so probabilities can only be given for data and never for a hypothesis. The latter is either true or false, what is generally simply unknown.
I get lost in this line of arguemtation when it comes to the problem of the reference set. The relative frequency must be defined according to a reference set (related to the "population"). The assumption is that any element of the reference set will be observed with the same relative frequency when n->infinity. Von Mises introduced the place selection criterion to further define a required kind of "randomness" to ensure that data will behave this way. I do not understand what place selection is different to the statement that we can not know the order of the elements in the reference set. and therefor have to expect similar frequencies from any subsequence. If there is no difference, then the whole frequentistic approach essentially has an epistemic foundation and is, in principle, not different to approaches where probability is more directly linked to a "state of knowledge".
Now, consider the archetypical example of a coin toss. Based on the binomial distribution with parameter p, one can calculate P(k times head in n tosses). A typical argument of a frequentist is that k is a random variable and thus can have a probability assigned to it (that may be estimated from actual data). The parameter p, in contrast, is a hypothesis, somehow related to the reference set (or the population), what has a fixed (but unknown) value. It will not vary with different tosses and thus is no random variable and thus has no probability.
The definition of the reference set remains unclear. What is the infinite sequence of coin tosses? Under which conditions are the coins tossed? The typical answer is: under identical conditions. But is the conditions were identical, the results were identical, too. So the key is that the conditions of the repetitions are just assumed not to be identical. But what are the actual and possible differences that are allowed (and required)? - and what is the frequency distribution of the varying conditions in the infinite set of replications? And why is the parameter assumed to be constant in different replications (I really do not see a legitimation fo this assumption)?
At the end, so it looks to me, the fundament of the whole argumentation is that we do not know what frequency distribution there is. Thus, the whole procedure around the frequencies in "replicated" experiments is eventually about our uncertainty of the precise conditions, or to put it differently to avoid a discussion about determinism: our inability to precisely predict the results/data.
After this very long (and possibly not very helpful) excourse back to my question:
The frequentistic interpetation requires the imagination of an infinite series of replications (with unknown extend of variation), and uses this to assign a probability. I do not see a difference between the replication of an experiment in this world ad infinitum under different conditions - adn the "replication" of an experiment in infinitifely many similar (but somewhat different) worlds. This is often used in cases when the event under consideration is practically impossible to replicate under sufficiently similar conditions (let alone what "sufficiently" means here). For instance take the beginning of the second world war or the exctinction of the Polio virus, or the explosion of the Vesuv, etc.
But then: if I have some data from what I estimate a parameter (e.g. this p of the binomial distribution) - could then not be an infinite number of (similar) worlds be imagined where my copies all obtain some (more or less different) data and get (more or less different) estimates? And if so, where is the problem in assigning a probability (distribution) to the parameter (i.e. the hypothesis) then?
Looking forward for your comments and critiques.
@Fabrice: I also found this (it's from Hajek, too):
- Is there a meta-analysis out there that supports the Kappa statistic?
If yes, what software can I use? At least, what paper support this type of meta-analysis?
Any help would be much appreciated!
My comment is off-topic, but may be of interest, so I'll err on the side of rigor.
In short, Kappa isn't the best method of assessing reliability/agreement.
If the articles using kappa which you are collecting include the raw data in tabular format, this can easily be assessed, and a summary of your findings would be most illuminating in a companion review.
Sample articles discussing the problems with kappa, and a superior methodology for assessing agreement/reliability, include:
- Can anyone explain a statistical instrument to test independent factors with a dependent variable?
Please note that the data is non-normally distributed when the normality test was conducted/run and all the variables were measured with the same scaling or weighting.
WRT to "Dr.", I am simply forecasting the future using optimal methods. :-) Seriously, when someone with your interest in methods speaks like an open-minded scientist, that is an easy call to make. If possible, please check-out the ODA book, it is the best broad introduction to the paradigm. I can only imagine with relish the advances you may generate...
Paul ("being-a-doctor-doesn't-matter-nearly-as-much-as-thinking-like-a-doctor") YarnoldFollowing
- Fausto Galetto added an answer:Does anyone know how to solve this statistical problem?
Assume we have N balls. Each ball is with a specific value on it. The distribution of these values are known. Give a specific value x. Now we draw n balls from the box each time and note the average value of these balls as y. The question is what is the minimum n to make y>x within a specific confidence interval. Thanks in advance. Any guess and discussions are welcome.
Dear friends, I'm going to upload a document """"Probability Theory,The problem of Faustine Cheng National University of Singapore SOLVED-17-10-14"""".
I am NOT SURE ..... Can you please analyze it and criticize it?Following
- Sheeja Krishnan asked a question:ABC for parameter estimation from aggregate data?
How do I estimate parameters using ABC when the the quantity that is predicted by the model is given by the product of two observations (with no access to raw data)? For example, the model predicts a mean distribution of various phenotypes in blood. The experimental data is in two parts. First I have the mean and SD number of total cells from blood at various time points for n patients. The second data consists of percentage distribution of cells (mean and sd known) in the total cells at the same time points. The individual raw data is not available. So to obtain the mean distribution, I could multiple total*percentage distribution/100. But how do I incorporate the experimental error in ABC? Any thoughts?Following
- Pedro Range added an answer:How is the proportion of variation explained by each variable calculated in DISTLM marginal tests?
I have several cases where the cumulative value exceeds one. I assume this is caused by strong collinearity among variables? How does it affect the analysis?
I was refering to sequential tests, where each variable is considered separatly. It is now clear to me that there is no logic in adding up the "Prop." values in these tests (that is probably why the cumulative values are not included in the output), thank you for pointing this out Sokratis. When the RDA axis are considered, the cumulative percentage explained is always less than 100, as Mathieu mentioned.
Thank you both for the very helpfull replies. Best regardsFollowing
- Senthilvel Vasudevan added an answer:Hello I need a database / library statistics on cosmetics industry in Australia
You know online statistical information or research paper with statistic for Australian Business?
Hi, Good Morning,
For your study ie., statistics on cosmetics industry in Australia
- Franz Strauss added an answer:Does anyone know how to calculate the sample size in an "in vitro" study using ATCC bacterias?
I´m doing a research testing the viability of p gingivalis (ATCC) against ozone and I need to know if it is necessary to do a sample size calculation.
Thank you very much Federico
- Matthias Templ added an answer:Is anyone familiar with Logarithmic calibration and geometric means?
I have a logarithmic calibration line:
MachineSignal = -3.21 log(concentration) + 21.9
It seems obvious to me, that if I have a triplicate measurement of concentrations (measured indirectly through MachineSignal), they should be "averaged" by the geometric mean, not the arithmetic mean. Arithmetic mean would be O.K. only for raw MachineSignal values.
Can I have some authoritatively - looking citation for that ? I need it for the research paper I am writing.
if you deal with concentrations, you might also think of compositional data analysis and log-ratio transformations instead of the log-transformation, i.e. in this case the average of log-ratios.Following
- Is anyone familiar with full factorial design?
I did a 2-level full factorial design with 3 factors in order to find the significance of each factor and their interactions on a response. After ANOVA on non-transformed response, I found no factors with significant p-value while 2 factors were obsereved with significant p-value when the same response was log- transformed. What can I say on significance of the factors? Are those significant or not?
In stark contrast to parametric methods, both Type I error and effect strength (classification accuracy) for optimal (maximum-accuracy) methods are invariant over any monotonic transformation of the data.
For example, see: Yarnold, P.R., Soltysik, R.C., & Martin, G.J. (1994). Heart rate variability and susceptibility for sudden cardiac death: An example of multivariable optimal discriminant analysis. Statistics in Medicine, 13, 1015-1021.Following
- Thom S Baguley added an answer:Is cohen d equal to z statistics? How can I calculate cohen d using Z scores?
I wanted to know if cohen's d equal to Z scores.
It is worth adding that the calculation of Cohen's d for a data set can be calculated in several different ways depending on what sort of data it is, what sort of question you are interested in and what assumptions you want to make.
This becomes important in meta-analysis because applying the same formula to two different data sets might produce estimates of two different quantities.
Also, more generally, Cohen's d - as usually calculated - is a measure of the discriminability or detectability of an effect not a measure of how big it is. That is because it is scaled in terms of sample variability (the SD) and thus anything that influences the SD influences d (even if it doesn't change the size of the effect).Following
- Valarmathi Srinivasan added an answer:What is the difference between Pilot Study and Pre test? Which one should be used for sample size calculation?
Is there any similarity in Pretest and Pilot Study
Calculate sample size for the main study. Take 5% of the calculated sample size for the pilot study.
it it is like pre test and post test then you have to calculate sample size based on paired observations t test.Following
- Gigi Voinea added an answer:How do I obtain a synchrony score?
I want to measure the synchrony degree between some individual cyclic events for a group. What is the best indicator for the synchrony? A synchrony score? There is in Statistics such a indicator/method? Wich is the best way to calculate such an indicator?
I am sorry about the incomplete answer, I have wirtten it from my phone an d I do not know what happened.
Can each tree be considered as a differenent observation? Are the measureaments taken at the same time interval ? If so, I would recommend Functional Data Analysis. Its techniques are implemented in R in the fda package (parametric version) and fda.usc (non-parametric version). I have done some works with fda, and I can help you with the computation.
Keep in touch!
- Priya Chaudhary added an answer:Is the any standard / robust method to identify outliers?I have performed linear regression analysis. I wish to know if there is any standard procedure to identify outliers with precision? I am using matlab for statistical analysis. If anyone has come across any specific function in matlab for the same, kindly let me know?
Thanks a lot everyone.Following
- Rami Ben Haj-Kacem added an answer:What is the best statistical technique to analyse impact of X on Y in qualitative research?Say for example we are trying to assess the impact of X on Y. Y in turn consists of Parameters K,L,M,N,O,P. K,L,M,N,O,P being qualitative in nature. More appropriately if we are trying to assess the impact of a particular government scheme on poverty alleviation of its beneficiaries of the scheme. Which best statistical techniques can be used to rightly assess the impact?
If the explanatory variables are qualitative, you have to convert each possible state to a binary variable and use Ordinary Least Squares (OLS). Normally, as your variables are socioeconomic indicators, OLS is suggested to the risk of violated hypothesis such as heteroscedasticity... Thus, think about using non-parametric estimation techniques.
If your endogenous variable is qualitative, you have to use ordered or multinomial Logit, depending on the type of your endogenous variable.Following
- Michael Von Kutzschenbach added an answer:What are the most interesting and challenging phenomena to model?I'm putting together a list of ongoing modeling works (from different research areas) which try to understand and predict the behavior of an interesting phenomenon. Just a few examples to kick off the discussion:
- In environmental systems, global climate change modeling seems to be a perpetual challenge. Making climate change models accurate enough for quantitative prediction has been bedeviling such models.
- In obesity and nutrition, the current literature provides over 100 statistical equations to estimate basal metabolic rate (BMR) as a function of different attributes (e.g. age, weight, height, etc.), yet understanding how BMR is precisely modeled based on those attributes is an interesting area of research.
There seem to be tons of examples in biology, psychology, economics, engineering, etc, but what are the 'publicly interesting' challenges that you would like to add here?
We will research how system dynamics modeling can be applied to enable the balanced scorecard to fulfill its potential as a strategic management tool. We will develop a simplified model of an organization in the context of the balanced scorecard. The original balanced scorecard consists of four fundamental sectors; in our model, we include a fifth sector, called the environment, to represent the ecological aspects of sustainability. (s. http://kutzschenbach.eu/zur-person-2/activites/)Following
- How much a fit model with the following statistics is reliable?
How much a model with:
- Model p-value < 0.05
- Lack of fit p-value > 0.1
- R2=0.99, Adj. R2= 0.99, Pred. R2=0.99
- normal probability plot attached (residuals do not follow a straight line).
Impossible to anser. You will have to specify what model you mean (structual and random part). And further, you should have some criteion of "reliability". How would you measure it? What are acceptable values for the "reliability"?
The values you gave are quite meaningless. The p-value refers to some null hypothesis of a specified model. If the model is miss-specified, neither the null nor the p-value provide any reasonable insight. The same applies to the lack of fit. The R² values were not useful even when the model is correctly specified, since they give only the proportion of explained variance, what is not related to reliability.
Showing "R²" and "Normal probability plot" implies that the random component of your model is based on the normal distribution. The plots indicate that the underlying assumptions are far from being reasonable in your case.Following
- Honglan Shi added an answer:How to get the regression equation using the PLS model?
When we use PLS model, the original data x and y are projected into t and u. Then the dimention can be reduced, for example, the original dimension is 12. In PLS model we use 4 as the optimal dimension, it means we select 4 latent variables (LVs). Then we can get the regression equation as u=a*(LV1)+b*(LV2)+c*(LV3)+d*(LV4), but how about the equation of x and y? Can we get the equation some kind of y=e*(x1)+f*(x2)+g*(x3)+h*(x4)? If possible, how to get this equation? Or it is impossible?
Thank you, guys. Your suggestions are very useful for me.Following
- Valarmathi Srinivasan added an answer:How to implement this idea in SAS?
I use multinomial logistic regression in SAS to study the 3 level dependent variable, say the variable name is disease (3 levels are normal, undiagnosed disease, and diagnosed disease), and independent variables, for example, protein intake. The SAS code is as following:
PROC SURVEYLOGISTIC DATA = dataset;
CLASS disease (ref ='normal' ) /param =ref;
MODEL disease = protein /link = glogit;
And I got part of the resultant table from SAS output (in the attachment).
If I want to investigate whether these associations with variable protein may differ for diagnosed or undiagnosed disease, then I was suggested to do a likelihood ratio test comparing two nested models for independent variable protein. The first model is the multinominal logistic model you have already run. The second model is a constrained logistic model, in which you force the association of the independent variable “protein” with both diagnosed and undiagnosed disease to be the same. Then you compare the likelihoods obtained from these two models. If the p-value is significant, it implies that the unconstrained model fits the data much better than the constrained model, and the independent variable has a different association with diagnosed and undiagnosed disease.
Here is my question: How to use SAS to implement this idea? Does anyone already done similar work and provide me a template SAS code? This problem has been upsetting me for a couple of days :).
Also, I read some research paper and they have the p value for each of the Odds ratios. How did they get that? I did not see the p values for odds ratio in SAS and SPSS output.
Thank you very much!
Instead concentrating on p-values one could check the confidence interval (CI) of OR.
If the CI for an OR does not include 1 (that is no difference) it is statistically significantFollowing
Statistical theory and its application.