# Data Science vs Statistics

Data Science vs Statistics

According to a recent poll ( http://www.kdnuggets.com/2013/05/poll-results-with-big-data-statistics-will-become-more-important.htm ) the big majority (68%) of KDnuggets audience thought that in the Era of Big Data, Statistics will become more important, as the foundation of Data Science.

By paraphrasing an old definition of mine for Computational Statistics (Computational Statistics & Data Analysis 23: 191–193, 1996) it can be said that "Data Science is Statistics in Computer and Internet era."..

Statisticians nowadays, have not only to deal with abundant data but also with new ones as well as with more complex data structure. In order to face these data the new technologies are playing an important role.All this has strongly affected the paradigm and reasoning of the classical statistics mainly based on the sample theory and related data analysis methods.

So far the main goal of Data Science is to provide a suitable statistical framework for studying the problem of gaining knowledge, making predictions, making decisions or constructing models for specific domains.Where modeling should be intended in a soft way (soft modeling) by relaxing strong distributional hypotheses toward structural ones.

According to a recent poll ( http://www.kdnuggets.com/2013/05/poll-results-with-big-data-statistics-will-become-more-important.htm ) the big majority (68%) of KDnuggets audience thought that in the Era of Big Data, Statistics will become more important, as the foundation of Data Science.

By paraphrasing an old definition of mine for Computational Statistics (Computational Statistics & Data Analysis 23: 191–193, 1996) it can be said that "Data Science is Statistics in Computer and Internet era."..

Statisticians nowadays, have not only to deal with abundant data but also with new ones as well as with more complex data structure. In order to face these data the new technologies are playing an important role.All this has strongly affected the paradigm and reasoning of the classical statistics mainly based on the sample theory and related data analysis methods.

So far the main goal of Data Science is to provide a suitable statistical framework for studying the problem of gaining knowledge, making predictions, making decisions or constructing models for specific domains.Where modeling should be intended in a soft way (soft modeling) by relaxing strong distributional hypotheses toward structural ones.

## All Answers (23)

Conrad Barnes· University of Texas Southwestern Medical Centerhttp://radar.oreilly.com/2010/06/what-is-data-science.html#data-scientists

Alexandr Savinov· Bosch GmbHBig Data is not just volume. It is more about the variety of data sources, complexity of the structures, semantics and reasoning about data. So I am not sure that classical statistics can help too much (at least directly) in dealing with these new problems.

Nasser Saleh· Universiteit TwenteNasser Saleh· Universiteit TwenteHemanta K. Baruah· Bodoland UniversityMitchell Maltenfort· Rothman InstituteHemanta K. Baruah· Bodoland UniversityJanuary Weiner· Max Planck Institute for Infection BiologyWhat was your question again?

@Hemanta

Do you think that "numerical" and "statistical" are somewhat mutually exclusive or opposite? If yes, why?

@Mitchell Maltenfort:

I think that "Big Data" is mostly a buzz word. However, at least in biology I see an increase of highly dimensional data, which leads to a qualitative shift in statistical methods required towards dimension reduction techniques and variable selection. Amongst others precisely because high dimensionality can lead to high type I errors if one was to test the variables independently. Imagine what face William Gosset would make if I told him that I intend to run a few hundred thousand t-tests before breakfast!

Hemanta K. Baruah· Bodoland UniversityMutually exclusive perhaps, yes, but not opposite, no.

For example, when you fit a mathematical model with linear parameters, such as Y = a + bX + e, where 'e' is a probabilistic error term following the Gauss-Markov conditions, you can apply the method of least squares to estimate the parameters. But suppose you have a mathematical model with non-linear parameters. In that case, the method of least squares can not be applied. However, then the problem of estimation of the parameters can be done using the Taylorian expansion in an iterative manner. The first example is a statistical one, while the second one is a numerical one. In statistical matters, we deal with probabilistic analysis. In numerical analysis, the question of probability is not there. Incidentally, for the example mentioned above, the Taylorian expansion based iterative method returns the same estimates for a and b in just one iteration.

January Weiner· Max Planck Institute for Infection BiologyI think you can perfectly well combine your Taylorian expansion with bootstrapping to get confidence intervals and p-values for your parameters (first few google hits show me that actually many people are doing that). I think that you only see that the areas are mutually exclusive because you *define* "numeric" as approaches for which "the question of probability is not there".

Also note that estimates of the fitted coefficients derived from a non-statistical framework (ie without any measure of their robustness, or their relation to the population -- as opposed to sample) would be in many scientific contexts just as useless as a poll result without giving the poll size. ("The number of my readers is now twice as high!" -- "Congratulations, I didn't know that you married")

P.S. Then of course, we have non-linear regressions and even non-linear least squares.

Hemanta K. Baruah· Bodoland UniversityLet me cite another simple example. While fitting a probability law, say the normal probability density function, to a given set of data, we do not use the method of least squares, because we can not use this method in this case. We then use a procedure using the standard tables of the concerned law of probability to fit the density curve. However, if you use the Taylorian expansion based iteration scheme in this case, you would get the same results. Now according to you, this is a meaningless exercise just like poll results! This is based on differential calculus, and not on probability; that was what I meant.

Anyway, I mentioned about non-linear parameters, not non-linear regression. Try using your standard technique to fit something like Y = exp(aX) + exp(bX). This is not transferable to a linear form. The Taylorian expansion based iteration technique will return results in this case too.

With regards.

John W. Kern· Kern Statistical Services, Inc., University of Wyoming, Montana State UniversityData are used to develop and test theories, estimate quantities and make decisions. As has always been the case, available data sources have generally outstripped computational capabilities, so scientists (physicists, computer scientists, mathematicians etc.) have struggled to balance incorporation of large and diverse data sources within the then-current computational limits.

Consider the FFT, developed before the advent of computers, yet later becoming the workhorse of signal processing and also statistical analysis of stationary processes.

The core problems have not changed fundamentally---Big or Small data scientists are integrating diverse data of a variety of sources to study processes. The uncertainty in these processes remains critical to making accurate and reliable predictions and decisions, but the options for quantifying that uncertainty have grown with increased computational resources and at times larger sample sizes.

Interestingly with respect to genomics research, sample sizes have actually become smaller while the number of variables has grown astronomically....This is really small data.

Boris Shmagin· South Dakota State UniversityJanuary Weiner· Max Planck Institute for Infection Biology"Try using your standard technique to fit something like Y = exp(aX) + exp(bX). This is not transferable to a linear form. "

Sure, I would not use linear regression for that. That doesn't mean that I don't have a statistic wrench in my toolbox that can perfectly handle this case: generalized nonlinear models. And I get all the normal goodies: errors, confidence intervals etc. Thanks for demonstrating my point.

"However, if you use the Taylorian expansion based iteration scheme in this case, you would get the same results. Now according to you, this is a meaningless exercise just like poll results! "

Yep. Sorry about that. The fact that I can calculate a mean of two values or find a perfect fit to any data set is not informative in itself if I have no means of understanding how this fit depends on the sampling. Say, I remove one point from your data set: how does your fit change? The separation between population and sample is the key to understanding my standpoint.

all the best,

j.

Hemanta K. Baruah· Bodoland UniversityJanuary,

Actually, my understandings are as follows. In the last two decades or so, particularly after the advent of very fast computers, data analysis has become more dependent on computers than on mathematical principles. In the process, mathematics had already started to be on the back foot!

Who am I to object when people go for analyzing data using standard software without much bothering to think whether classical mathematical principles support such uses or not? I am just another teacher teaching mathematical matters, and while doing so, I like to stick to principles put forward by the mathematicians of the last two centuries.

Anyway, if there is no uncertainty associated, one can use numerical analytical approach to study data. If there is uncertainty involved, then one has to see whether the uncertainty is probabilistic. If so, the theory of probability can take over. If the uncertainty is fuzzy, then there is the mathematics of fuzziness to deal with the problem. The data may in fact be chaotic also, in which case one should go for using principles of the theory of chaos.

My point is that if there is a probabilistic error term involved in a mathematical model, then it is the calculus of probability which we shall have to use. If there is no uncertainty involved in the model, then the matters can be dealt with using Newtonian calculus.

Mine are classical sort of understandings. I may be wrong!

Regards.

John W. Kern· Kern Statistical Services, Inc., University of Wyoming, Montana State UniversityHemanta K. Baruah· Bodoland UniversityJohn W. Kern· Kern Statistical Services, Inc., University of Wyoming, Montana State UniversityHemanta K. Baruah· Bodoland UniversityWhen someone uses a software where it should not be used, the software itself can not be blamed for that.

Take the case of the multiple correlation coefficient R^2. Before using any software that returns the value of R^2, the user should first see whether the situation is fit for computing R^2 for statistical conclusions. Now a days, hardly anyone bothers to look into such matters.

Let me cite a concrete example. Suppose (X, Y) are given as (1, 3), (2, 7), (3, 13), (4, 21) and (5, 31). Any standard software would return the value of R^2 as 0.9722. This means, 97.22% (= the coefficient of determination) of the relationship between X and Y can be explained mathematically, while the rest 2.78% (= the coefficient of non-determination) is due to probabilistic error.

Incidentally, the values of X and Y stated above follow the exact mathematical relation Y = 1 + X + X^2. There is no error term here! In that case, how can we say that 2.78% of the relationship is due to errors where there was no error at all?

In fact, the user should first see whether the relationship is at least to some extent linear. In this case, the relationship is quadratic anyway. Therefore misuse of R^2 would lead to such a nonsense conclusion.

In fact, for a quadratic relationship, we already have the concept of the correlation ratio. But people just compute R^2, and conclude accordingly. Even when the model is exponential, you would find articles using this concept of R^2.

This is just one example. There are too many such examples in which Statistical principles have been violated.

In Physics, for example, no one would dare to misuse a principle. But in statistical matters, this kind of violation is a regular affair.

John W. Kern· Kern Statistical Services, Inc., University of Wyoming, Montana State UniversityThen we commiserate over coffee about how statistics is misused and that all would have been well, had "they" just consulted a statistician at the beginning of the project.

In the mean time, computer scientists, physicists, engineers and mathematicians are developing new and creative methods to deal with the quickly evolving nature of complex numerical models with ever expanding volume and sources of data--as many in statistics are waiting to be "consulted" at the beginning of the project.

I guess that was quite a rant! I must have some pent up anger!

Hemanta K. Baruah· Bodoland UniversityI would like to add one further comment. The misuses like the one cited in my earlier comment were initially started by people from outside the statistics fraternity. However, instead of telling them to stop such misuses, some applied statistics oriented researchers have actually started to follow them! For example, the earlier cited misuse of the coefficient of determination was started in the life sciences perhaps. A time came when bio-statisticians started to use the coefficient of determination wrongly. Indeed, in a non-linear bio-statistical model, there would surely be a probabilistic error component. That is why, they believe that the co-efficient of non-determination may be associated with the error term. In my example, I have shown that for a non-linear model, even though there may be no error term, yet there would be some variations allotted to error which is actually absent in the model!

People tend to forget that the idea of the correlation coefficient appears very naturally from the postulates necessary to define a Euclidean space, and a Euclidean space has to be a linear space first. Hence, as soon as we step out of the postulates defining a linear space, the question of the correlation coefficient does not arise, and therefore computing the coefficient of determination does not arise in such cases.

As you have said, some of the statistics fraternity rue over such misuse. But who cares! Most of the statistics fraternity have meanwhile forgotten that Statistics is a mathematics based subject after all, and that we need to show proper regard to mathematical matters.

Boris Shmagin· South Dakota State UniversityLet me bring to attention that there is a controversy; Data science that is developing as a science now and can’t be compare to statistic that is not a science.

(http://www.youtube.com/watch?feature=player_embedded&v=TbNymweHW4E)

There is an article also on this point:

"The Philosophy of Statistics"

Author(s): Dennis V. Lindley

Source: Journal of the Royal Statistical Society. Series D (The Statistician), Vol. 49, No. 3(2000), pp. 293-337

Published by: Blackwell Publishing for the Royal Statistical Society

Stable URL: http://www.jstor.org/stable/2681060

Hemanta K. Baruah· Bodoland UniversityJohn and I have been discussing on misuse of statistical formalisms. In our communications, we have not made any comment either for or against Data Science.

Yes, Statistics is not 'science' in the sense that it actually includes "tools of science". Such tools are however based on mathematics. From this standpoint, Statistics, like Computer Science, is a mathematical science indeed.

In Data Science too, to deal with a particular situation, first there must exist some logic, based on which there has to follow some mathematics, based on which algorithms will be prepared, after which computers would take over.

I hope, Data Science would not grow as a computer dependent subject devoid of mathematical logic.

Can you help by adding an answer?