Science topic

Statistical Computing - Science topic

Explore the latest questions and answers in Statistical Computing, and find Statistical Computing experts.
Questions related to Statistical Computing
  • asked a question related to Statistical Computing
Question
1 answer
i need a Scientific Paper Research topic with hypothesis and statistical computation
Relevant answer
Answer
Have any interest in a follow up to the attached paper. Best wishes David Booth
  • asked a question related to Statistical Computing
Question
3 answers
Hey,
I want to calculate the standard deviation for each substituent for two molecules using Excel and then calculate the average of all the values (no S.D).
For S.D I used STDEV.P and for average I used AVERAGE. Is it the right way? Or should I use STDEV.S? or should I calculate range (large-small) instead of average?
Relevant answer
Answer
"R squared is higher" -- compared to what?
"STDEV isn't lower" -- compared to what?
The STDEV of the (absolute) differences between x- and y-values correlates with the correlation between x and y: the higher the correlation between x and y, the smaller the STDEV of their absolute differences should be. But in your case I don't see with what you are comparing your results - you have only a single sample.
  • asked a question related to Statistical Computing
Question
5 answers
I am conducting a study and had two people to vet the results for me. Person A used SPSS 25, while Person B and I used SPSS 28. We all got the same frequency table results (same central tendencies, range, standard deviation, case count), same ordinal regression results too, but when we ran bivariate correlation the SPSS 25 results were different. It gave significant results for 4 of the 6 variables, while the SPSS 28 gave insignificant results for 5 of 6. We double checked the data tables and variable types, and everything is identical. We had the data run in STATA and the correlation results were very close or matching to the SPSS 25 results. We went over the steps to check for variation and there's none. Does anyone have any idea what could cause this? Any recommendations on which results you would go with if we can't find a resolution to getting data to match?
Relevant answer
Answer
Hello Shelly-Ann,
Do the computed correlations match?
Are all variables in your data set set the same way (e.g., "Scale" or "Ordinal" or "Nominal") when they are run on the two different versions of SPSS?
If so, the discrepancy may be caused by an adjustment for the set of hypothesis tests (making individual significance tests less powerful). Or, a new probability distribution function for version 28 (for t-distribution and/or F-distribution).
The way to check the first possibility is to enter just a pair of variables in the correlation dialog box and execute the analysis.
The way to check the second possibility is to evaluate some known results in both versions for the t-distribution and F-distribution.
Example:
Prob_t = compute 1 - CDF.T(2.086, 20) (should yield .0250, to 4 dec places)
Prob_F = compute 1 - CDF.F(4.35, 1, 20) (should yield .0500, to 4 dec places)
I'd also suggest you request both summary statistics and cross-product deviations and covariances (available under "Options" in the correlation/bivariate dialog box) in each version of spss and compare these. Any discrepancy would signal a potential problem.
Failing these steps, save the generated syntax for each spss version of the correlation subprogram, even if you're using dialog boxes to elicit the analyses, and make sure that these are identical across versions.
Good luck with your work.
  • asked a question related to Statistical Computing
Question
3 answers
I have six kinds of compounds which I then tested for antioxidant activity using the DDPH assay and also anticancer activity on five types of cell lines, so I got two types of data groups:
1. Antioxidant activity data
2. Anticancer activity (5 types of cancer cell line)
Each data consisted of 3 replications. Which correlation test is the most appropriate to determine whether there is a relationship between the two activities?
Relevant answer
Answer
Just do logistic regression is what I had in mind. The DV might be antcancer activity (yes /no) same for antioxidant activity. Best wishes David Booth
  • asked a question related to Statistical Computing
Question
8 answers
Dear RG experts,
The statistical study includes a number of tests, some of which are well-known, while others are controversial. For me, applying such dubious standards to real-world problems is a major issue. I learn all of these in order to solve real-life problems.
Please help me with the application of the run test.
Relevant answer
Answer
Respected James R Knaub Sir,
Thank you so much. Further, I will ask about some problems with it.
  • asked a question related to Statistical Computing
Question
4 answers
What type of design was used and how was statistical computation and graphing achieved? Was a software package used?
Relevant answer
Answer
Yes to all your questions. The tutorial is currently under review.
  • asked a question related to Statistical Computing
Question
3 answers
i found that most equations, such as d' and A' were used for balanced design of AA, AB pairs.
I am sincerely writing to ask for help concerning method or equation for calculating sensitivity for unequal weight. Thanks a million.
Relevant answer
Answer
Min,
Perhaps the attached paper will help. It was published in "Psychological Review," a highly prestigious and influential journal. The senior author is a well-known expert in signal detection.
Best,
Don Polzella
  • asked a question related to Statistical Computing
Question
3 answers
Dear All: I am wondering if someone have an R codes (R functions) to run the test procedures described in the paper titled “ESTIMATION AND COMPARISON OF LOGNORMAL PARAMETERS IN THE PRESENCE OF CENSORED DATA” by STAVROS POULOUKAS 2004, Journal of Statistical Computation & Simulation, Vol. 74, No. 3, March 2004, pp. 157–169. I can send a copy of the paper if necessary. with many thanks abou
  • asked a question related to Statistical Computing
Question
6 answers
Hello!
I have successfully developed and implemented ANFIS in R with the help of FRBS package. Just one thing that is remaining is to visualize the ANFIS network.
Currently due to some constraints because of COVID, I don't have any access to Matlab while working from home. So I was wondering if there is any way to implement it in R.
Relevant answer
Answer
Ritesh Pabari
No I couldn't. I used Matlab instead for ANFIS. It was quite robust. You could view the architecture, rules etc quite easily. And you could also customize membership function with ease.
  • asked a question related to Statistical Computing
Question
5 answers
Which is the best book for understanding Social Sciences Statistical analysis tools?
Hello Friends!
I have been in search of best book for understanding and applying social sciences statistical analysis tools. I am new in this field please seniors recommend some best books on the topic.
Thanks
Relevant answer
Answer
Read the fundamentals of the research.
Champion, D. J. (1970). Basic statistics for social research. Scranton: Chandler Publishing Company.
  • asked a question related to Statistical Computing
Question
8 answers
Prediction bounds are relatively important when an interval considering the error of the regression model and not only sampling error comparing to the confidence interval.
Relevant answer
Answer
Thank you dear Firdos
Thank Mohammed
  • asked a question related to Statistical Computing
Question
17 answers
What are the most valuable lessons you've learnt after using R?
Relevant answer
Answer
In my view, R is efficient for quick data analysis and visualization. It is also simple to expose results through an interactive web app with the package shiny.
However, I agree with Clément Poiret that Python is almost inevitable when you work in industry thanks to its versatility and the ability to respect processes of continuous integration. I've first learned coding with R at school, few years after I had to use Java and C++. This lead me to review and improve the way I coded with R. After that I've learned Python for OOP and statistical programming, this lead me to improve again my R skills. Now with few years of experience, I often switch between R and Python depending on the task. As I said firstly, I always use R and RStudio for data exploration (I found it powerful to deal with big datasets and parallel processing) and quick reports with Rmarkdown (HTML and TeX). I necessarily use Python when I need robustness for heavy project and OOP (e.g. to improve factorization and inheritance).
Finally, I would say that there ALWAYS be a rigorous way to code in R, but it is not mandatory and the language do authorize bad practice to obtain same results. It is not just about code performance, but also about the syntax of the functions or even simply about the code implementation. For statisticians working in teams who learned only R, I would recommend to use packages with documentation (e.g. roxygen2) and unit tests (e.g. testthat).
  • asked a question related to Statistical Computing
Question
4 answers
A linear mixed model was fitted using lmer function of lme4 package in R. Does anyone have idea of how to extract studentized conditional residual for individual data point?
Relevant answer
Answer
The function stats::rstudent() extracts studentized residuals.
  • asked a question related to Statistical Computing
Question
4 answers
The DDoS attach could detect Statistical based, Soft computing based, Knowledge-based and Data mining and machine learning-based methods. These methods proved that are efficient to detect attacks but lacking behind with automatic capabilities. Also, these DDoS attack detection methods are localized standalone systems that predict the DDoS attacks based on data traffic rather than detect it on the spot.
Relevant answer
Answer
You can use the IDS in a noncentralized process also.
Regards
  • asked a question related to Statistical Computing
Question
7 answers
I have a list of phi, psi angles derived from a number of PDB files based on some criteria. I have plotted a scatter plot using matplotlib with them. But i want to show the favoured, allowed and generously allowed regions in different shades of color at the background. For better understanding, am providing the scatter plot (D_torsionAngle.png) i have already made and an example of the plot (1hmp.png) i wanted to do.
Relevant answer
Answer
  • asked a question related to Statistical Computing
Question
5 answers
We used SPSS to conduct a mixed model linear analysis of our data. How do we report our findings in APA format? If you can direct us to a source that explains how to format our results, we would greatly appreciate it. Thank you. 
Relevant answer
Answer
The lack of standard error depends on your software, and even then it only applies to the variance terms. The reason for this is the variance cannot go negative and the sampling distribution can often be expected not to be asymptotically normal but skewed. So just explain this in you results table.
  • asked a question related to Statistical Computing
Question
10 answers
It is known that the FPE gives the time evolution of the probability density function of the stochastic differential equation.
I could not see any reference that relates the PDF obtain by the FPE with trajectories of the SDE.
for instance, consider the solution of corresponding FPE of an SDE converges to pdf=\delta{x0} asymptotically in time.
does it mean that all the trajectories of the SDE will converge to x0 asymptotically in time?
Relevant answer
Answer
The Fokker-Plank equation can be treated as a so-called forward Kolmogorov equation for a certain diffusion process.
To derive a stochastic equation for this diffusion process it is very useful if you know a generator of this process. Finally, to find out a form of the generator you have to consider a PDE, dual to the Fokker-Plank equation which is called the backward Kolmogorov equation. The elliptic operator in the backward Kolmogorov equation coincides with the generator of the required disffusion process. Let me give you an example.
Assume that you consider the Cauchy problem for the Fokker-Plank type equation
u_t=Lu, u(0,x)=u_0(x),
where Lu(t,x)=[A^2(x)u(t,x)]_{xx}-[a(x)u(t,x)]_x.
The dual equation is h_t+L^*h=0, where L^*h= A^2(x)h_{xx}+a(x)h_x.
As a result the required diffusion process x(t) satisfies the SDE
dx(t)=a(x(t))dt+A(x(t))dw(t), x(0)= \xi,
where w(t) is a Wiener process and \xi is a random variable independent on w(t) with the distribution density u_0(x).
You may see the book Bogachev V.I., Krylov N.V., Röckner M., Shaposhnikov S.V. "Fokker-Planck-Kolmogorov equations"
  • asked a question related to Statistical Computing
Question
2 answers
I am doing research on colon cancer Want to compare two different treatments of survival in a retrospective data. I want to do propensity score matching to adjust for the bias. I am using inverse propensity score weight method. I calculated inverse propensity score weight from propensity score. I was able to get odds ratio by using this weighted variable in SPSS. But when i am trying to calculate KM curve I am getting the following error code "No statistics are computed because nonpositive or fractional case weights were found". Any suggestions of how to proceed? 
Relevant answer
Answer
Would you like to provide R code for IPTW?
  • asked a question related to Statistical Computing
Question
25 answers
Production Engineering is an Engineering area that deals with the problems of productive operations, with emphasis on the production of goods and services. Operational Research (OR) is defined as the area responsible for solving real problems, using decision-making situations, using mathematical models. The OR is an applied science focused on solving real problems that seeks to apply knowledge from other disciplines such as mathematics, statistics, computation to improve rationality in decision making processes.
Operational Research (OR), responsible for solving real problems, through mathematical and statistical models. How have we used OR in our Searches?
Relevant answer
Answer
Dear Leopoldino
I guess that you are referring to OR (Operational Research) and you use the Brasilian OP (Pesquisa Operacional). If it is so, I believe that the most important aspect of your remark is that OR, and most especially Linear Programming (LP) solves real problems, which is possible because its algebraic structure of inequalities. It allows the construction of a scenario model by far much more representative than in any MCDM methods.
LP was developed in 1939 by Leonid Kantarovich, and for this development he was awarded in 1952 the Nobel Prize in Economy. It is then the granddad of all present-day methods for MCDM.
The actual algorithm, the Simplex, is due to the genius of George Dantzing which developed it in 1948. This same algorithm is still used to day, after 70 years, and is, according to some sources, used by about 70,000 large companies world-wide.
LP processes large amounts of information, it is irrelevant the number of alternatives and criteria, which can be in the thousands each, and it is so important, that Excel incorporated it as an add-in since 1991.
However, LP has two severe drawbacks; one of them is that it works with only one objective and with only quantitative criteria, which is not very realistic in nowadays projects. Nowadays, these two problems have been superseded by new methods and software based on LP, which do not yield an optimal solution, as the original Simplex, but one which is probably very close to it.
What is important is that LP allows for modelling very complex scenarios, incorporating features than none of the more than two dozen MCDM in the market can handle, and for this reason, it is able to model real problems, by establishing restrictions, dependencies and even correlation.
In addition, it does not produce Rank Reversal.
I have been using LP for decades and in the late 70s was fortunate enough to act as a counterpart of the Massachusetts Institute of Technology (MIT), to solve by LP in a two years project, a very complex problem related with a river basin.
Since then, I solved more than one hundred problems in very different areas, and many of them have been published in my books, and at present, I am trying to promote its use in RG, where, as you properly say, rationality is paramount.
LP mathematics is a little complex, however, a user does not need to know it, in the same way as he does not need to know the mathematics of AHP, PROMETHHE or TOPSIS. As matter of fact, LP is easier to use than other methods since no weights are needed.
A couple of weeks ago I proposed in RG to develop a course on LP, however, nobody was interested.
  • asked a question related to Statistical Computing
Question
3 answers
How to describe the process of analyzing statistical data carried out with the help of Business Intelligence in Big Data database systems?
How to describe the main models of statistical data analysis carried out with the help of computerized Business Intelligence tools used for the analysis of large data sets processed in the cloud and analyzed multi-criteria in Big Data database systems?
Please reply
Dear Colleagues and Friends from RG
Some of the currently developing aspects and determinants of the applications of data processing technologies in Big Data database systems are described in the following publications:
The issues of the use of information contained in Big Data database systems for the purposes of conducting Business Intelligence analyzes are described in the publications:
I invite you to discussion and cooperation.
Best wishes
Relevant answer
Answer
I found the answer in this paper.
  • asked a question related to Statistical Computing
Question
20 answers
Big Data database systems can significantly facilitate the analytical processes of advanced processing and testing of large data sets for the needs of statistical surveys.
The current technological revolution, known as Industry 4.0, is determined by the development of the following technologies of advanced information processing: Big Data database technologies, cloud computing, machine learning, Internet of Things, artificial intelligence, Business Intelligence and other advanced data mining technologies. All these advanced data processing and analysis technologies can significantly change and facilitate the analysis of large statistical datasets in the future.
Do you agree with my opinion on this matter?
In view of the above, I am asking you the following question:
Will analytics based on data processing in Big Data database systems facilitate the analysis of statistical data?
Please reply
I invite you to discussion and scientific cooperation
Best wishes
Relevant answer
Answer
I am with Alexander & James on this. First, Nearly, all so-called big-data is "it-is-what-it-is" data (i.e. The analyst has no control over how the data is collected. Under these circumstances the selection process for the sample is non-random & the selection probabilities are typically unknowable. One is often not even certain if the sample is from 1 unique population.) Under such circumstances biases can arise & their effects are not measurable. For something like sales; big-data can be highly applicable & population specific because it represents a sample from one's particular population: your customers. Yet, if it represents health records, one has to wonder whether or not the source for these record misses some parts of the population or over-represents sub-populations. I think solid "statistical thinking" can help big-data analysis more than the other way around.
  • asked a question related to Statistical Computing
Question
10 answers
Hello everyone :)
I am currently conducting a comprehensive meta-analysis on customer loyalty, with a huge amount of articles, that are using SEM to evaluate the strengths of the relationships between the different variables I am interested in (satisfaction, loyalty, trust…etc).
I saw, that for most of the meta-analysis, the effect size metric is r. But since all my articles of interest are using SEM, I just could report the Beta coefficients, t-values and p-values. Is it okay to use these kinds of metrics to conduct a meta-analysis?
I saw an article of Peterson (2005), explaining how to transform a beta into a r coefficient for the articles where the r is not available. This is a first start, but this is not giving me a comprehensive method for conducting a meta-analysis only with SEM articles (what metrics should I code? what are the statistics to compute?...etc).
My question is then: is it possible to conduct a meta-analysis with articles using SEM? If yes, do you have references explaining how to code the metrics and compute the statistics for the meta-analysis?
Thanks in advance for your help ! :)
Kathleen Desveaud
Relevant answer
Answer
1. You can use a structural equation approach to meta-analysis, as suggested by Gordon. I recommend another book by Mike Cheung: https://www.amazon.de/Meta-Analysis-Structural-Equation-Modeling-Approach/dp/1119993431. For this, you would use the individual correlation matrices of the primary studies to generate a meta-analytical variance-covariance matrix and run SEM on that. This method is preferable to first meta-analysing the bilateral relation and using the matrix of those relations as input to SEM.
2. You can use the output of SEM as you would treat any regression method for inclusion in meta-analysis. Hence, you can use the regression coefficients (or its derivative: partial correlation) and run the meta-analysis on that. Check the first question here: http://www.erim.eur.nl/research-facilities/meta-essentials/frequently-asked-questions/ (disclaimer: I wrote the answer). This is based on the work of Ariel Aloe and others: DOIs: 10.1080/00221309.2013.853021 and 10.3102/1076998610396901
Good luck!
Robert
PS: I think Martin Bjørn Stausholm misunderstood your question as referring to standard errors. His answer is absolutely correct (I think) but doesn't relate to structural equation modelling (SEM)
  • asked a question related to Statistical Computing
Question
10 answers
I have 40 observations and 9 items. When I try to run factor analysis on SPSS I get an error message. "There are fewer than two cases, at least one of the variables has zero variance, there is only one variable in the analysis, or correlation coefficients could not be computed for all pairs of variables. No further statistics will be computed."
Is the problem inadequate sample size?
If I increase sample size,will it be a solution?
Thanks
Relevant answer
Answer
ıt means that all respondents answer your one of the items same answer thus there is no variability in that item for example all respondents answered The item 1 as 3 check item descriptive statistics you will see at least one item has zero standart deviation
  • asked a question related to Statistical Computing
Question
2 answers
Can histon that is computed from histogram refer both to upper and lower approx. in rough entropy?
Relevant answer
Answer
  • asked a question related to Statistical Computing
Question
18 answers
What does Sen's slope value indicates while performing mann-kendall trend test using xlstat? can anyone describe this value in the xlstat tutorial example?
Relevant answer
Answer
Putting it simply, When your trend analysis give you a significant trend (positive or negative) Sen's slope is than to capture the magnitude of that trend. Say your MK test revealed that temperature increased yearly between 1950-2000, here Sen's slope will tell you on average how much temperature has changed each year.
Hope it helps you
  • asked a question related to Statistical Computing
Question
4 answers
I've gathered some binary data and my observations look like this:
trt1 trt2 trt3 trt4
p1 1 1 0 1
p2 1 1 1 1
p3 1 1 1 1
p4 1 1 1 1
p5 1 1 1 1
p6 1 1 1 1
p7 1 1 0 1
p8 1 1 0 0
p9 1 1 1 1
when I tried to calculate the Phi coefficient between these 4 columns using SPSS, I ran into a problem since the software would't calculate this coefficient for the first column, saying: "No statistics are computed because trt1 is a constant."
Can anyone help by suggesting another way to calculate some sort of correlation coefficient or by solving this?
Relevant answer
Answer
Try tetrachoric correlation coefficients. psych package implemented it.
  • asked a question related to Statistical Computing
Question
1 answer
I run test-retest reliability using unweighted kappa for my questionnaire and 2 of my items have the result " no statistics are computed because constant". How to interpret this clause and can i use the items in my questionnaire? Any article to support?
Relevant answer
Answer
If you have no variance in these items you will run in a number of problems (like missing correlations where they are theoretically expected). When you aim to use your questionnaire in a group similar to the one you got your test-retest-data from you should consider excluding these items. I would keep them only if they cover something very important that is not assessed by any other item and if there is at least the possibility that a subject has a different answer than all others.
  • asked a question related to Statistical Computing
Question
1 answer
I run test-retest reliability using unweighted kappa for my questionnaires and 2 of my items have the result "no statistics are computed because constant". How do i interpret this clause and can i use the items on my questionnaire? Any article to support?
Relevant answer
Answer
In my experience, this happens when you don't have variation in responses during either the test or retest.
For example, if you were calculating kappa for a yes/no question, and everyone answered yes during the test but a few reported no during the retest, you would not be able to calculate kappa.
You would be able to report % agreement and note that you can't report kappa because test or retest data are a constant.
  • asked a question related to Statistical Computing
Question
2 answers
The MKT Z statistic was computed for annual mean of pollutant concentration to see whether the concentration was increasing or decreasing in a time period of 20 years. The linear regression of annual mean concentration was computed for the same data (ppb/year). The linear regression slope is negative and MKT Z values are large and negative. How do I write about the comparison between the two?
Thanks
Relevant answer
Answer
Dear Basith,
In one of my papers (see at the end of this response) I clarified that
"Trend direction can be significant for a very small magnitude of increase or decrease which may be unimportant in practice. On the contrary, Trend direction may be insignificant for change whose magnitude must not be ignored in environmental management decision-making".
To reliably interpret the results of the trend analysis, you need answers to the following questions:
  1. is the computed and population trend slopes significantly different?
  2. is the non-zero magnitude of the trend significant?
  3. is the trend direction significant?
  4. how strongly are the sub-trends varying in the series? here you can have strong sub-trends yet the overall (or long-term) trend from the full time series is insignificant
  5. what are the possible bottlenecks or sources of uncertainty on the trend results?
For answers to the above questions as well as useful information on how to interpret your trend results, you may find the following papers may be so relevant for your case. 
best wishes
Deleted research item The research item mentioned here has been deleted
  • asked a question related to Statistical Computing
Question
1 answer
From our statistical analyses with R and Genstat, the outputs for means are different for the same values and treatments analyzed. For instance means with Genstat A,B,C,D are 32.65,4.69,18.23 and 48.96 respectively while for the same variables, R analysis gives A,B,C,D as 34.17, 4.13, 18.23 and 42.68.what could be the reason for the disparity in the means?
Relevant answer
Answer
Can you give a simple, reproducible example?
  • asked a question related to Statistical Computing
Question
4 answers
Im trying to match gene ID from one data base with the GO ID in other data base, the length in the second data base is longer than the first one, why do I have this error? Error in 1:longitud[j] : NA/NaN argument
Here is the script
d1<-as.matrix(datos)
longitud1=numeric(0)
for(i in 1:length(datos$Datos1)){
longitud1[i]=length(which(datos1$Cod1==d1[i]))}
longitud=longitud1[-which(longitud==0)]
i=1
mat1=matrix(0,sum(longitud),2)
for(j in 1:length(longitud)){
for(k in 1:longitud[j]){
mat1[i,]=as.matrix(datos1[which(datos1$Cod1==d1[j]),][k,])
i=i+1
}}
Relevant answer
Answer
If you do not post a reproducible example  (i.e, part of the data) it is hard to help. In general I discourage to use nested loops in R.
  • asked a question related to Statistical Computing
Question
12 answers
I have generated a number of data sets that follow a specific distribution. They represent different layers of a certain system but they should be correlated with predetermined correlation coefficient without affecting their statistical distribution. 
Hence, I am thinking of arranging every two subsequent data sets to impose their corresponding correlation parameter. Any ideas of how to do that, especially in R, will be appreciated.
Relevant answer
Answer
Dear Timothy
it is only about illustrating the correlation so the extreme values should be of minor importance (anyway, in case of extreme values, correlation is not really appropriate);
for other purpose, rerandomisation might be better than Bootstrap even though I think that both methods are relatively poor with rare events as they usually are not represented in the original sample so that they will not show up neither in the Bootstrap distribution nor in the rerandomisation.
Hiba, you should use R to do either the Bootstrap or rerandomisation.
  • asked a question related to Statistical Computing
Question
3 answers
Are there any statistical test (parametric or non-parametric) which can be applied to test the goodness of fit of a potential probability distribution (other than normal) estimated from auto-correlated data?
Relevant answer
Answer
Dear Vasana,
As far as I know this is a problem with no general solution working for any situation. Given multivariate observations, I realize that you are asking for a GOF test for the marginal distribution, do you? There are several results for normality in the time-series framework. In general, to construct a GOF test , it is necessary to specify the marginal distribution (Gamma, Weibull, Poisson, Binomial...) and the correlational model (AR(1), ARMA(p,q), ...) as well. For instance, in the following paper you can find and extension of the classical Fisher dispersion test for the Poisson distribution with an AR(1) structure (INAR(1) model):
Schweer S, Weiß CH (2014) Compound Poisson INAR(1) processes: stochastic properties and testing for overdispersion. Comput Stat Data Anal 77:267–284
Best regards.
  • asked a question related to Statistical Computing
Question
14 answers
I need to compare average runtime execution of two different C++ algorithms. The question is: how many times do I need to repeat the experiment in order to calculate the average runtime for each algorithm? 10? 100 times? Can you provide me with a paper regarding this issue? Thanks in advance!
Relevant answer
Answer
Standard statistics.  Depends on what degree of variability you can tolerate.  The standard error is proportional to 1/variance.  So for 10% standard error (i.e. Within 10% of the true value) will require 100 experiments.  For 1%, then 10000 measurements.  People never like the conclusions, but the math is simple.
  • asked a question related to Statistical Computing
Question
10 answers
I taught an introductory stat class and one of the subtopics involved confidence intervals( CI).  I decided to focus my examples on the meaning and applications of CI.  For example, students  were  involved  obtaining the confidence interval for the population mean from a random sample of data that they themselves collected. I discovered from students' work that  interpreting the meaning of confidence interval was truly challenging to many. How have others approached this topic in introductory statistics?
.
Relevant answer
Answer
Stephen Politzer-Ahles has it right.
There is a subtle difference between a confidence interval and its associated random interval. For example, a 95% confidence interval for the mean is one instance of an associated random interval.  The random interval has a 95% of containing the mean, not the confidence interval.
Both the mean and a confidence interval are fixed.  They do not vary. Either the confidence interval contains the mean or it does not.  There is no probability that can be logically assigned to it.  On the other hand, there can be a probability assigned to a random interval containing the mean.
As explained below, I believe that the issue comes down to understanding the difference between the "mean" \mu and the "sample mean" \bar{X}.  The sample mean is a random variable, while the mean is a fixed number.
For a known variance \sigma^2, and a normally distributed random variable X, the random interval associated with a confidence interval for the mean is [\bar{X} - z\sigma/sqrt{n},\bar{X} + z\sigma/sqrt{n}], where z is a fixed number that depends on the confidence level (e.g. 95%) and n is the sample size. This interval is random because it contains the random variable \bar{X}.  Once you assign an instance of \bar{X}, the interval is no longer random.  It either contains \mu or it does not.
For teaching confidence intervals in an introductory statistics class, one might consider the following outline.
(1) Explain what a random variable is, (2) Explain what a random sample is, including what IID means, (3) Define \bar{X} in terms of a random sample (4) Explain the difference between \mu and \bar{X} (\bar{X} estimates \mu, \bar{X} is random, \mu is not). (5) Explain what a confidence interval is (for the mean).
There are of course several steps in between to explain, but these are essential from my perspective.  It may be that doing something less will lead to the sorts of confusion already mentioned.  On the other hand, it may be too much for an introductory class, depending on the ability of the students involved.  Ultimately, it's a judgement call.  
By something less, I mean the approach taken in the following reference:
Orkin, M. and Drogin R., "Vital Statistics", McGraw-Hill, 1975.
The approach I've suggested above is more in line with the following:
Meyer, Paul L., "Introductory Probability and Statistical Applications", Addison-Wesley, 1970.
Both are introductory, but assume different levels of mathematical experience and/or ability; the latter being at a higher level.
  • asked a question related to Statistical Computing
Question
8 answers
Dear all, I would like run spatial autocorrelation analysis with my data in R (or other software such as Minitab, Past or Python). My data comprise 100 1m2 plots with control paired plots 1m far away treatment. In all plots I measured plant cover and I want to measure species co-ocorrence in each plot. All plots are georeferenced with lat and long in degree, minutes and seconds. I want know if had autocorrelation in my sampling. Can someone help me?
Best wishes,
Jhonny
Relevant answer
Answer
Hi Jhonnhy.
You can fit linear models with correlation structures for the error using package nlme (https://cran.r-project.org/web/packages/nlme/nlme.pdf). There is an argument `correlation` in the `lme` function to model spatial correlation. Also the function `Variogram` is used to compute the semi-variogram. Argument `form = ~ x + y` represents a two-dimensional position vector with coordinates x and y, which I think is your case.
  • asked a question related to Statistical Computing
Question
5 answers
Hi all,
I am working with the R package 'quint' to test for qualitative treatment-subgroup interactions (personalized medicine). Everytime I analyze my data there are some warnings which I cannot handle. All warnings are of the same sort. Actually, the main problem is that I do not understand what the warning means exactly:
Warning messages:
1: In computeD(tmat[kk, 1], tmat[kk, 2], tmat[kk, 3], tmat[kk, 4], :
value out of range in 'gammafn'
Is anybody well versed in this R package or is it probably an universal warning that you were confronted with in another context? I do not know what 'gammafn' could be and why the value is out of range.
I appreciate any comments and ideas!
Best,
Brian
Relevant answer
Answer
The error that you are facing is related to the values of your variable not fittin g the requirements of the gamma function. may be you need to change the method and function you are using for the analysis. or may be make transformation that fits the data to the gamma function.  
  • asked a question related to Statistical Computing
Question
5 answers
Independent component analysis (ICA) is a statistical and computational technique for revealing hidden factors that underlie sets of random variables, measurements, or signals.
Relevant answer
ICA is usually applied in ECG signal processing for denoising, artifacts removal and source separation (e.g. fetal ECG, in which it can be used to separate the mother's signal from the fetus).
You can find more info on the book Advances in Electrocardiograms - Methods and Analysis, Chapter 19: Independent Component Analysis in ECG Signal Processing
  • asked a question related to Statistical Computing
Question
11 answers
I need to perform an a priori power analysis for a
1) MANOVA with 1 IV with 3 levels and 4 DV's
2) MANOVA with 2 IV (1 with 2 levels, 1 with 3 levels) and 4 DV'sAsk
(only global effects, not interactions)
Can anyone help me calculate the sample size for a .5 effect and .05 p,?
Thank you!
Relevant answer
Answer
to do power analysis to estimate your sample size, you have to write your hypothesis, and based on that you decide what statistical test you will use. It should be one of the inferential statistics. so you need to determine the following: alpha {standard to be .05}, power [standard to be .80], effect size {small, moderate, or large, each test has its own value, you can find these values in the net}. Then download free programs to calculate the sample size such as G. power.
  • asked a question related to Statistical Computing
Question
3 answers
I am looking for a script to calculate average wind direction. I am wondering if someone has it already. 
Relevant answer
Answer
   
  • openair package
  • asked a question related to Statistical Computing
Question
5 answers
My experimental design has two factors (the dosage of a drug x the gender of the animal subject). The subjects were fed with the drug for a long period of time, and we divided the experimental period into three intervals. I randomly choose some of my subjects for collecting data at the end of each interval.
However, it turns out that the gender of these subjects are not equal for each dosage treatment (because it was hard to tell the difference in gender based on appearance). Some treatments even have no replication (I picked 8 animals per dosage treatment but only one were female). I wonder how to do ANOVA on this.
I've searched a relevant website as below--
I'm not sure whether it suits my situation, and I'm figuring out how to make some modifications as I calculate my data.
Any advice and suggestions will be greatly appreciated.
Relevant answer
Answer
Dear Bruce,
Thank you for your useful suggestions! I'll try on that and see if it's suitable.
  • asked a question related to Statistical Computing
Question
11 answers
Hi. I study on my master degree and I'm currently doing on my dissertation
I have some question. I think I will use a ordinal regression. However, I'm quite struggling on how to report this type of regression. I used ordinal data as a dependent variable. and the scale data as an independent variables. In the SPSS, I selected all independent variables as covariates and I quite not sure what exactly I should report. I saw many examples on the websites but most of them using ordinal variables as independent variables.
So anyone can help me or explain me on how to report this type of regression. or any textbook or journal that explain this
Thank you.
Relevant answer
Answer
Logistic regression analysis is commonly used when the outcome is categorical. By using the natural log of the odds of the outcome as the dependent variable, we usually examine the odds of an outcome occurring or not, and the relationships are linearized similar to the multiple linear regression (Hosmer, 1989).When the outcome is not dichotomous; we should distinguish the outcome type first, which is often ignored by some investigators. There are two kinds of logistic modeling: multinomial logistic regression handles the case of a multi-way categorical dependent variable (Friedman, 2010) and ordered logistic regression handles ordinal dependent variables (Simon, 2014). The predictor could both continuous or categorical for the two types of logistic regression.  Using nominal, k level outcome, has k -1 dependent variables, hence k-1 incerception and k-1 parameters each continuous predictor, and (k-1)*(m-1) for each m-level categorical predictor. Treat categorical predictor as ordinal by assign different reasonal scores, we use one parameter to replace the m-1 parameters to increase testing power.
 Multinomial logistic model
The data set of the following example contains the results of a hypothetical user testing three brands of computer games. Users rated the games on a five-point scale from very good (vg) to very bad (vb). The analysis is performed to estimate the differences in the ratings of the three games. The variable score contains the rating of scales, and the variable game contains the games tested. The variable count contains the number of testers rating each game in each rating group. The baseline logistic is used to define the regression functions: Using the one level as reference then LOG(p1is/pijm) as outcome function. See following SAS program
There are 4 outcome functions the log odds with score of very bad (reference group):
ln(p(vg)/p(vb)), ln(p(g)/p(vb)), ln(p(m)/p(vb)), ln(p(b)/p(vb)). The SAS coding is:
Data Compgame;
input count game$ score$ @@;
datalines;
70 game1 vg 71 game1 g 151 game1 m 60 game1 b 46 game1 vb
20 game2 vg 36 game2 g 130 game2 m 80 game2 b 70 game2 vb
50 game3 vg 55 game3 g 140 game3 m 72 game3 b 50 game3 vb
;
Proc logistic  data=Compgame  rorder=data; /*rorder function kee the outcome functions as the order as the order in data: vg,g,mb,vb*/
freq count;
class game /param=glm;
model score = game /link=glogit;
run;
Table 10 summarizes the model fitting and estimating results.
Table 10: Multinomial logistic regression*
Parameter
score
Estimate
Standard error
P value
Model Fitting
Intercept
vg
0.000
0.200
1.000
AIC
3356
3323
Intercept
g
0.095
0.195
0.626
SC
3376
3383
Intercept
m
1.030
0.165
<.0001
-2 Log L
3348
3299
Intercept
b
0.365
0.184
0.048
OR*
95% CI
game1
vg
0.420
0.276
0.128
1.522
0.886
2.612
game1
g
0.339
0.272
0.213
1.403
0.823
2.391
game1
m
0.159
0.236
0.500
1.172
0.739
1.86
game1
b
-0.099
0.269
0.713
0.906
0.535
1.534
game2
vg
-1.253
0.323
0.000
0.286
0.152
0.538
game2
g
-0.760
0.283
0.007
0.468
0.268
0.815
game2
m
-0.411
0.222
0.064
0.663
0.43
1.024
game2
b
-0.182
0.245
0.457
0.833
0.515
1.347
·         The Reference group is game 3 and score=vb
Reading results:
1.      There are 4 intercepts for the four baseline logit outcomes, with the first being for the first logit: ln(p(vg)/p(vb)), which is -1.25E-13=0. It is the log odds for Game 3 (see the raw data: for Game 3, vg=vb=50, hence ln(p1/p5)=0.  Similar to the intercept1, all intercepts are the log odds with p5 for GAME 3.
2.      There are 8 regression coefficients β, which are the differences of the four logits between the Game 1 or Game 2, with Game 3 respectively. The first β of Game 1 is vg =0.42, which is logit difference between Game 1 and Game 3 for the response function ln(p(vg)/p(vb)). It is positive, meaning that the evaluation of Game 1 is better than that of Game 3, with more evaluations of “very good” received. And exp(0.4199)=1.522 is the odds ratio between GAME 1 and GAME 3 for comparing the very good score with very bad score. The percentage of the stating (?) GAME 1 to be very good vs very bad, is roughly 52% higher than GAME3. However the difference is not significant and the p-value is only about 0.13.
3.      The interpretation is similar for others, for GAME 2, the percentage ratio receiving scores vg versus vb is significantly lower than for GAME 3, with the odds ratio 0.286.  The very good rating for Game 2 is much lower comparing with Game 3.
If we only use score=vg and score=vb, it is a binary outcome, use the logistic we get the same results for the multinomial logistic regression for the response function log(p(vg)/(1-p(vg))=log(p(vg)/p(vb)). The parameter estimate for Game 1 is 0.4199 and the odds ratio vg to vb is 1.52. Hence, for the data above, the multinomial model is just combining the dichotomous logistic regressions together. However, if there are other cofactors in the model the model will be more complex. If the model has different intercepts and different slopes, the combined multinomial logistic model and the binary logistic model will be the same.
Ordered logistic model:
Since the score measurement is clearly ordered, assuming p1, p2, p3, p4, p5 be the probabilities to be testes as vg, g, m, b and vb, the cumulative logistic functions defined as follows and to be defined as proportional:
Response function                              Assigned score
Very good: ln(p1/( p2+ p3+ p4+ p5)),    0
        good: ln(p1+ p2)/(p3+ p4+ p5)),     1
   medium: ln(p1+ p2 + p3)/(p4+ p5)),    2
          bad: ln(p1+ p2 + p3+ p4)/ p5),      3
The regression equation is
     2.5.2
The SAS coding is
proc logistic  data=Compgame  rorder=data; /* rorder function assigns the order of the four outcome functions as the order of the data, the scores are 0, 1, 2 and 3*/
freq count;
class game /param=glm;
model score = game/link=clogit ; /*clogit performing the ordinal logistic regression, assuming the increasing rate or the four lgits is constant*/
run;
The parameter estimations are list in Table 11:
Table 11: Analysis of Maximum Likelihood Estimates
Parameter
 
DF
Estimate
Standard
Error
Wald
Chi-Square
Pr > ChiSq
Intercept
vg
1
-1.9087
0.1189
257.5924
<.0001
Intercept
g
1
-0.9356
0.1022
83.7943
<.0001
Intercept
m
1
0.7305
0.1004
52.9434
<.0001
Intercept
b
1
1.8493
0.1162
253.4371
<.0001
game
game1
1
0.3098
0.1307
5.6179
0.0178
game
game2
1
-0.5748
0.1365
17.7412
<.0001
game
game3
0
0
.
.
 
 
It shows that assuming the odds to be proportional, the clogit uses the cumulative odds. the game effects on the assigned score, the four cumulative logits is assumed to have the same ‘distance’, ranged from 0 to 3 .  The slope estimation for factor of game is relative change of the logarithm of cumulative odds the Game3. For Game1, the slope is 0.3098, it implies the counts are non-even comaring with Game3, the logarithm of cumulative odds ratio is 0.3098, which is the change from one category to another category in sequential with the reference levle of vb. Hence there is more good reading for Game1 comaring with Game3. For Game2, the slope is -0.5748, hence there are more bad reading for Game2 comparing with Game3.
  • asked a question related to Statistical Computing
Question
5 answers
I'm getting an error : "Error in if (colnames(tm.class)[j] == "fixed") tm.final[i, j] = 0 :
missing value where TRUE/FALSE needed" all I tried to do was a simple snk.test(lm(values ~ factor1*factor2)), and the estimates function keeps returning this error. i'm not sure what tm. class is, but I have no idea why the column names seem to be NA for whatever the estimates function is testing
Relevant answer
Answer
I had the same problem - I had to make sure that the factor was fixed and that seemed to solve it! 
  • asked a question related to Statistical Computing
Question
3 answers
Dear SAS users,
I want to perform Barnard's exact test for a 2x2 table, which is described as an option in the EXACT statement of PROC FREQ. While I can use all other options for the EXACT statement, the option BARNARD seems to be unavailable.
I use SAS 9.3 for Windows. Is anybody aware of that problem?
I know about the controversies on the use of Barnard's test (or alternatives) as compared to Fisher's exact test and try to avoid a discussion on that issue here. I have to use Barnard's test for one project.
Thank you in advance!
Relevant answer
Answer
try SAS 9.1 as recommended in one article (Analysis of 2 3 2 tables of frequencies:
matching test to experimental design International Journal of Epidemiology 2008;37:1430–1435
doi:10.1093/ije/dyn162  Published by Oxford University Press on behalf of the International Epidemiological Association
The Author 2008; all rights reserved. Advance Access publication 18 August 2008) i read early. by JOHN LUDBROOK
  • asked a question related to Statistical Computing
Question
7 answers
Hi,
I want ask suggestion suitable software to run multinomial and mixed logit(statistical model) besides SPSS?
Thanks
Relevant answer
Answer
R
  • asked a question related to Statistical Computing
Question
2 answers
I used GDP data to find structural break in the series - if there is any significant change in the pattern or series. I used 'breakpoints' command and found some breakpoints. 
My question is how it works? Based on regression on itself?
Relevant answer
Answer
Function "breakpoints" in package strucchange is based on piecewise linear models. It uses dynamic programming to find breakpoints that minimize residual sum of squares (RSS) of a linear model with m + 1 segments. The Bayesian Information criterion (BIC) is used to find an optimal model as compromise between RSS and number of parameters.
The linear model can be quite general and may include covariates. A parameter "h" can be set to determine minimal segment size. This is especially important for relatively small samples.
More can be found:
- in the very well readable book: Kleiber, C. & Zeileis, A. Applied Econometrics with R Springer, 2008,
- in several papers of Prof. Achim Zeileis and
- in additional material found at the homepage of the author: http://eeecon.uibk.ac.at/~zeileis/
It's a really smart and powerful technique -- I like it :)
Thomas
  • asked a question related to Statistical Computing
Question
2 answers
I'm trying to run the additive macro (for additive hazards models) written by Alicia Howell and John Klein but it takes longer to run. I'm using SAS 9.3.  I always break it after an hour without getting any output. I followed all the steps as per Alicia Howell's paper. It does not show any errors except that I don't get any output. I'm not sure whether to leave it running overnight.
Relevant answer
Answer
Thanks Ellen Hertzmark, I have 642  observations with only 3 variables. I will try the steps you have mentioned and see what happens. Funny enough, if I leave out a semicolon it  quickly gives me an error. I was hoping the output will be that quick .
  • asked a question related to Statistical Computing
Question
3 answers
I am working on prediction of hatchability in chickens using egg quality traits. I need statistical software applications for efficient data analysis. Please list out some good applications for genetic and breeding studies in chicken.
Relevant answer
Answer
We need more information on your data and experimental design to give a meaningful answer. Either post information about the design (and how you measure hatchability) or email me, and I will respond. Good luck :)
  • asked a question related to Statistical Computing
Question
7 answers
When I simulate the mc it shows a warning "Warning from spectre during Monte Carlo analysis `mc1'.
mc1: Attempt to run Monte Carlo analysis with process and mismatch variations, but no process variations were specified in statistics block.". How can one specifiy the statistics for monte carlo simulation?
Relevant answer
Answer
I have required the "standard deviation" (strictly the rms) of the parameters I am evaluating to be less than 20% fir a single "run."
  • asked a question related to Statistical Computing
Question
5 answers
Hi everybody,
I have a qustion about predict (raster package) with gam?
This is my script:
pred.data <- brick (sst, par)
gamCrus <- gam(logCrus~s(sst)+s(par), family=gaussian(), data=zoo.data)
predCrus <- predict(pred.data, gamCrus)
I have an error message:
> predCrus <- predict(pred.data, gamCrus)
Error in model.frame.default(ff, data = newdata, na.action = na.act) :
object is not a matrix
In addition: Warning messages:
1: In predict.gam(model, blockvals, ...) :
not all required variables have been supplied in newdata!
Relevant answer
Answer
You basically need 2 arguments, one being the model you calibrate with some data, and other data you want to predict their associated value
predict(name of your previously run model, the new data you want to predict)
this site may also help you:
  • asked a question related to Statistical Computing
Question
2 answers
spatial analysis, spatial econometrics, spatial statistics, computer or statistics
Relevant answer
Answer
You can use the spatstat package (http://www.spatstat.org/spatstat/) which offers the possibility to fit Ripley's K function. Spatstat also seems to work quite well also for large data sets. You might have a look at the function Kest().
  • asked a question related to Statistical Computing
Question
6 answers
I'm usually fitting curve on Matlab using "fminsearch" function, which is a really useful and powerful function.
As I'm currently more using R than Matlab... I wonder if that kind of function or script exist on R ?
That would be perfect if you can provide an example conduct on Matlab and on R.
Regards
Relevant answer
Answer
see the pracma package and the note at
but also package neldermead may have one implementation.
  • asked a question related to Statistical Computing
Question
3 answers
I am trying to resolve a problem with count data. At the beginning, I fitted a poisson regression model. However, I got under dispersion in my model.
I tried to use a restricted generalized poisson regression model to go on. However, I got the problem with the SAS code. Can anyone propose a suitable SAS procedure in this case?
Relevant answer
Answer
Another option might be to use a generalized Poisson distribution.  I know this can be done in STATA, but I don't know if/where one would do it in SAS.
Or, you could transform your data monotonically - say, via square root or log function.  This would be workable, since your minimum is >0 on the original scale.
  • asked a question related to Statistical Computing
Question
7 answers
I have my own function. I need to generate its 10000 values which should be random like the rayleigh fading values generated using h=randn(N,1)+i*rand(N,1) like this line generate N random values similarly I need now similar values of my distribution. my function has input of three arguments whose values are different in one scenario . what I understand that I can take fix three values for one time and evaluate my function then get it one value. output values of my function is of complex values. whether I need to compute  variance and average  of my function values then loop over in between these values to get 10000 values .write in an array form .please correct me whether this is right way or what should I do.how do I write code in Matlab.?
Relevant answer
Answer
Usually, you would take a random number generator (RNG) that can provide uniformly distributed values between 0 and 1. Here's how the mapping is generally done: First, from your probability distribution function (PDF) you generate the cumulative distribution function (CDF) through integration over x. The CDF will always be invertible. Now, you can pick any random number from a uniform distribution and look up the x-value of your function through the inverse CDF. This x-value will be a random number from your PDF. Sometimes the CDF cannot be determined by a formula, so that you'll have to find a numerical approach to this. With the right keywords you should be able to find papers on this.
Here's a link on what Wikipedia has to say on this:
  • asked a question related to Statistical Computing
Question
6 answers
Applications of different types / theories and best entropy method used in image processing.
Relevant answer
Answer
1872 Orange pog.svg – Ludwig Boltzmann presents his H-theorem, and with it the formula Σpi log pi for the entropy of a single gas particle.
1878 Orange pog.svg – J. Willard Gibbs defines the Gibbs entropy: the probabilities in the entropy formula are now taken as probabilities of the state of the whole system.
1924 Red pog.svg – Harry Nyquist discusses quantifying "intelligence" and the speed at which it can be transmitted by a communication system.
1927 Orange pog.svg – John von Neumann defines the von Neumann entropy, extending the Gibbs entropy to quantum mechanics.
1928 Red pog.svg – Ralph Hartley introduces Hartley information as the logarithm of the number of possible messages, with information being communicated when the receiver can distinguish one sequence of symbols from any other (regardless of any associated meaning).
1929 Orange pog.svg – Leó Szilárd analyses Maxwell's Demon, showing how a Szilard engine can sometimes transform information into the extraction of useful work.
1940 Red pog.svg – Alan Turing introduces the deciban as a measure of information inferred about the German Enigma machine cypher settings by the Banburismus process.
1944 Red pog.svg – Claude Shannon's theory of information is substantially complete.
1947 Purple pog.svg – Richard W. Hamming invents Hamming codes for error detection and correction. For patent reasons, the result is not published until 1950.
1948 Red pog.svg – Claude E. Shannon publishes A Mathematical Theory of Communication
1949 Red pog.svg – Claude E. Shannon publishes Communication in the Presence of Noise – Nyquist–Shannon sampling theorem and Shannon–Hartley law
1949 Red pog.svg – Claude E. Shannon's Communication Theory of Secrecy Systems is declassified
1949 Green pog.svg – Robert M. Fano publishes Transmission of Information. M.I.T. Press, Cambridge, Mass. – Shannon–Fano coding
1949 Green pog.svg – Leon G. Kraft discovers Kraft's inequality, which shows the limits of prefix codes
1949 Purple pog.svg – Marcel J. E. Golay introduces Golay codes for forward error correction
1951 Red pog.svg – Solomon Kullback and Richard Leibler introduce the Kullback–Leibler divergence
1951 Green pog.svg – David A. Huffman invents Huffman encoding, a method of finding optimal prefix codes for lossless data compression
1953 Green pog.svg – August Albert Sardinas and George W. Patterson devise the Sardinas–Patterson algorithm, a procedure to decide whether a given variable-length code is uniquely decodable
1954 Purple pog.svg – Irving S. Reed and David E. Muller propose Reed–Muller codes
1955 Purple pog.svg – Peter Elias introduces convolutional codes
1957 Purple pog.svg – Eugene Prange first discusses cyclic codes
1959 Purple pog.svg – Alexis Hocquenghem, and independently the next year Raj Chandra Bose and Dwijendra Kumar Ray-Chaudhuri, discover BCH codes
1960 Purple pog.svg – Irving S. Reed and Gustave Solomon propose Reed–Solomon codes
1962 Purple pog.svg – Robert G. Gallager proposes Low-density parity-check codes; they are unused for 30 years due to technical limitations.
1965 Purple pog.svg – Dave Forney discusses concatenated codes.
1967 Purple pog.svg – Andrew Viterbi reveals the Viterbi algorithm, making decoding of convolutional codes practicable.
1968 Purple pog.svg – Elwyn Berlekamp invents the Berlekamp–Massey algorithm; its application to decoding BCH and Reed-Solomon codes is pointed out by James L. Massey the following year.
1968 Red pog.svg – Chris Wallace and David M. Boulton publish the first of many papers on Minimum Message Length (MML) statistical and inductive inference
1970 Purple pog.svg – Valerii Denisovich Goppa introduces Goppa codes
1972 Purple pog.svg – J. Justesen proposes Justesen codes, an improvement of Reed-Solomon codes
1973 Red pog.svg – David Slepian and Jack Wolf discover and prove the Slepian–Wolf coding limits for distributed source coding.[1]
1974 Red pog.svg – George H. Walther and Harold F. O'Neil, Jr., conduct first empirical study of satisfaction factors in the user-computer interface[2]
1976 Purple pog.svg – Gottfried Ungerboeck gives the first paper on trellis modulation; a more detailed exposition in 1982 leads to a raising of analogue modem POTS speeds from 9.6 kbit/s to 33.6 kbit/s.
1976 Green pog.svg – R. Pasco and Jorma J. Rissanen develop effective arithmetic coding techniques.
1977 Green pog.svg – Abraham Lempel and Jacob Ziv develop Lempel–Ziv compression (LZ77)
1989 Green pog.svg – Phil Katz publishes the .zip format including DEFLATE (LZ77 + Huffman coding); later to become the most widely used archive container and most widely used lossless compression algorithm
1993 Purple pog.svg – Claude Berrou, Alain Glavieux and Punya Thitimajshima introduce Turbo codes
1994 Green pog.svg – Michael Burrows and David Wheeler publish the Burrows–Wheeler transform, later to find use in bzip2
1995 Orange pog.svg – Benjamin Schumacher coins the term qubit and proves the quantum noiseless coding theorem
2001 Green pog.svg – Dr. Sam Kwong and Yu Fan Ho proposed Statistical Lempel Ziv
2008 Purple pog.svg – Erdal Arıkan introduced Polar Codes, the first practical construction of codes that achieves capacity for a wide array of channels.
  • asked a question related to Statistical Computing
Question
1 answer
See above.
Relevant answer
Answer
Hi Peter,
Exactly the same as in ENVI!
Have apeek at Envi Freelook and inspire yourself.
Cheers,
Frank
  • asked a question related to Statistical Computing
Question
1 answer
A metric can be used to evaluate performance
Relevant answer
Answer
You can calculate R2: 1-rss/tss.
The residual sum of squares (rss) can be calculated using ctree's "predict" function, so if "model_ctree" is your ctree model and "y" is your dependent variable:
rss <- sum((y - predict(model_ctree))^2)
tss <- sum((y-mean(y))^2)
r2 <- 1-rss/tss
Note that this R2 is a prediction on the learning data. If you want to predict on new data, you will have to specify the test set conaining the new data using the "newdata" parameter in the "predict" function. If your test set is "testset", use:
predict(model_ctree, newdata = testset)
If you do not have fresh data for the test set, you can hold out an amout of cases from your entire sample by randomly splitting it up into a learning sample (that you use to fit the ctree model) and a test sample (that you use for prediction).
  • asked a question related to Statistical Computing
Question
2 answers
Coding sheet
Relevant answer
Answer
Characters?? or variables? question requires elaboration..
  • asked a question related to Statistical Computing
Question
1 answer
I need to evaluate the security of a sequence.
Relevant answer
Answer
Evaluating the security of a random sequence for cryptography (I assume that what you mean by the security), please check out the NIST SP 800-22 (http://csrc.nist.gov/publications/nistpubs/800-22-rev1a/SP800-22rev1a.pdf)
  • asked a question related to Statistical Computing
Question
2 answers
Out of equilibrium thermodynamics study often changes in a structure with a long time scale. Could first order transitions give informations to out of equilibrium evolving structures? I mean: the time scale in a first order transition tends to zero (for example liquid solid transition), but the involved systems are often more simple. I would suggest scientists to study first order transitions as if they were out of equilibrium (during the duration of the transition).
Relevant answer
Answer
That's the reason why Becker-Döring cluster theory and its modifications (coagulation-fragmentation equations) is still quite popular amongst mathematicians and physicists. One investigates e.g. the first-order phase transition "gas -> liquid" via the formation of larger and larger clusters, formed out of a monomer bath (i.e. the gaseous phase). Here a cluster consists of several monomers sticking together. Once a critical clustersize is exceeded one identifies these "large" clusters with the occurance of droplets (i.e. the liquid phase).
The stochastic processes of cluster formation are usually described by master equations. For Markovian stochastic processes these master equations are represented by infinite systems of ordinary differential equations of 1st order in the time parameter t. And as you observed correctly, under certain conditions one recovers a time scale tending to zero, i.e. metastable states. Furthermore, as you also mentioned, the physical systems are far from equilibrium. Hence Onsager's reciprocity relations approach is not applicable. Oliver Penrose, a former post-doc of Lars Onsager, has dedicated quite some time researching the issues you mentioned in the 70s, 80s and 90s, also Kurt Binder, Joel Lebowitz and Enzo Olivieri to name but a few pioneers.
A drawback of this mathematical picture, however, is the need of a micro-physical theory for the transition rates. If one really had a serious theory of the transition rates for water and the "liquid -> solid" phase transition ("freezing"), say, then one could investigate a possible theoretical explanation for the Mpemba effect.
  • asked a question related to Statistical Computing
Question
6 answers
Suppose I am looking at millions of data sets each with millions of data points and I need to capture details about each of those distributions with as much accuracy as possible. Histograms are a concise way to capture information about the distributions so that one can construct a CDF or calculate approximate quantiles at a later time from the stored histograms, and they can efficiently be calculated over many computers in parallel for large data sets.
What statistical methods best capture the information loss for a given set of histogram breakpoints for a given empirical distribution?
For example, suppose I have the data set 1,1,1,1,1,9,9,9,9,9. Histogram 1 uses breakpoints 0,5,10 and Histogram 2 uses breakpoints 0,2,4,6,8,10
So histogram 1 looks like :
[0,5] : 5
[5,10]: 5
Histogram 2 looks like:
[0,2]: 5
[2,4]: 0
[4,6]: 0
[6,8]: 0
[8,10]: 5
Clearly Histogram 1 has more information loss than histogram 2 since the bimodal nature of the underlying distribution is lost with the unfortunate breakpoints chosen in histogram 1 compared to the breakpoints in histogram 2 which show the bimodal nature of the underlying distribution.
Since I don't know if the underlying distribution is normal, I am currently using a worst case metric which essentially generates the worst possible distributions that could be represented by the same histogram and takes the Kolmogorov-Smirnoff statistic (or just the maximum distance apart of the two CDFs approximated from the histograms, as represented by the yellow boxes in the right most column of the attached plots).
Do any statistical software packages calculate KS or information loss metrics directly from histograms? Are there other methods besides KS which capture this information loss? I couldn't find anything for R on CRAN.
Relevant answer
Answer
Here are a few thoughts; see if they help.
1. This is the classic bias-variance trade-off problem: Too few bins result into too much bias (i.e., deviation from the true but unknown underlying density) but low variability (of the histogram) across different data realizations (from the same data-generating process), whereas too many bins lead to little bias but too much variability. The basis for dealing with such bias-variance trade-off problems is usually a squared loss between the histogram estimator and the true but unknown density, and the associated risk function (i.e., expected loss). All optimal binsize prescriptions (such as Freedman-Diaconis) attempt to balance the bias of a histogram with its variance. This standard formalism can be found in, e.g., Larry Wasserman's All of Statistics and All of Nonparametric Statistics (both Springer).
If you use an optimal binsize prescription, and the data meets the assumptions that the prescription is based on, then information loss is autoatically minimal according to that prescription.
2. Also remember that a histogram is just one estimator for the probability density of your data. That too a discrete one. If the underlying density is known to be continuous, then other density estimators (e.g., kernel density estimators) may be more appropriate.
3. The CDF need not be estimated from the histogram. There is, e.g., a nonparametric estimator called the empirical distribution function (EDF) that estimates the CDF directly from the data. The EDF does not depend on any adjustable parameter such as the bin size, and its formal properties are well-understood.
  • asked a question related to Statistical Computing
Question
2 answers
I am looking for an explicit formula that weighs the predictions from each database into a combined one.
Relevant answer
Answer
Assuming you're not interested in the more high-end approaches like random forests or model averaging, you might want to try a mixed effects model where you include database as a random effect.