Project

Philosophy of Data Science review for big-data analytics

Goal: A short essay about Philosophy of Data Science. It's time to shake up the ancient scientific model from the 17th century, which I think is responsible for the current deadlock of big data analytics. The key points in which my GT new model departs from the old one are (a) Induction through rules, (b) Substantiation of rules, (c) Different method for sampling. My guide in this (ad)venture is Prof. Joseph Agassi - a long time warrior of science logics against technocracy.

Methods: Data Mining, Philosophy of Science

Date: 30 July 2016

Updates
0 new
50
Recommendations
0 new
10
Followers
0 new
30
Reads
3 new
1167

Project log

Edith Ohri
added a research item
Bank survey data records of 35 managers and over 200 questions were given for interpretation. The survey meant to find what are the positions of managers in the Bank regarding automation. The solution algorithms define Similarity Groups and their key factors. The groups are homogeneous thus reduce the number of variables. In this case, the groups were much distinct, which enabled performing logical operations such as: group comparison and identifying emerging behavior in one of the groups…
Edith Ohri
added an update
Rule #15 of data science – More one consumes the data, larger it grows and higher is its added-value!.
The very use of data creates additional data. Actually, a sign of information theft is that the amount of information in the stolen subject suddenly increases. This special quality of data generates an ever growing supply of data resources and added value that pushes forward what is witnessed as the data revolution.
 
Edith Ohri
added a research item
This is a short presentation of GT application for Fab higher yields. Fabs are almost the last remainder of mass-production lines, and as such face challenges of rigidity, expensive infrastructure, and complexity in planning and control. On the other hand, Fabs by nature are closest to the 4th revolution. It seems that a middle stage of AI driven Quality Control can bridge the gap between the ancient (100 yrs old) and contemporary concepts of management, by using big data power with GT. The strengths of the GT solution for this task are mainly three: GT's ability to find root causes, using unsupervised data, and creating simple controls that can be embedded in existing systems. It may sound too much of expectations yet is attainable, as shown in the presentation's real case example.
Edith Ohri
added an update
Rule #11 of data science: prediction by laws of behavior* is more reliable than past data , since laws (unlike events records) stay valid beyond the data collection, thus providing a more stable ground for extending deduced relations into the future.
Definition: a law of behavior is a multivariate function that explains the variations in a pattern.
Example: prediction based on past data, is like driving by looking on a back-mirror. Although the car flow in front and in the rear is almost the same, relying on it for predicting the road is bound to result in car jams and chain accidents. To improve the prediction, rule # 11 suggests interpreting the data with laws-of-behavior. For instance, consider the car load, the low speed as an indicator for a nearing junction, assessing (inside a city) the time between traffic lights, the effects of weather and road conditions, read the cross road signs, and taking into account that some drivers are using Navigation Guides (in which case it is safe to follow their decisions as reflected in the back-mirror).
 
Edith Ohri
added an update
Rule #10 of data science: indirect variables are relevant and need to be included in big data analysis just as direct variables.
Definition: Indirect variables are dependent variables, environment descriptors, meta data, processing or analysis information.
Note: Big data variables are not orthogonal, monotone or, in general, having any predetermined behavior. Therefore they are analyzed by algorithms, not mathematical formulas.
 
Edith Ohri
added an update
Rule #9 of data science – result variables that are also causes produce a feedback cyclic effect.
The new data science defines result-cause variables in three steps: (a) locate patterns with a cyclic effect, (b) select variables of which the derivations flip through the cycle, (c) find a rational that explains the mechanism behind this behavior.
Notes:
  1. Many systems include explicit or implicit result-cause variables.
  2. The cyclic (oscillation) effect may take a shape of vibrations, internal resonance, waves, fluctuations, economic bubble cycle etc.
 
Edith Ohri
added an update
Meet the 4 pillars of the new QC:
Pinpointed
Proactive
Pervasive and
Predictive.
All the 4 qualities originate in the same Group Technology hierarchical clustering feature, as follows. The distinction among groups enables pinpointing items of interest. The hierarchy allows tracing specific item qualities to their common root-causes, and to intervene proactively in the process. The structure of rules (laws) helps in their ubiquitous application and pervasive scale up. And the group definitions help to reach greater resolution and a more granular and precise prediction.
 
Edith Ohri
added an update
Rule #8 of data science – There is a third type of errors (Gamma, γ). It is caused by wrong hypotheses* unlike the first two types FALSE-POSITIVE and TRUE-NEGATIVE that are attributed to the stochastic nature of data.
The Gamma type is indifferent to testing; hypotheses that lead to these errors are either ALWAYS WRONG or ALWAYS RIGHT disregarding the tests.
*Wrong hypotheses include logical mistakes, bad terminology, and non-verifiable claims.
Notes:
1. The Gamma error is prevalent in cases that involve many variables or unsupervised data.
2. Long tails and many outliers indicate on the existence of a third type error.
 
Edith Ohri
added an update
Rule #7 of data science - CAUSES are variables that change consistently with the target* and have a logical or fundamental explanation. If they do not have an explanation they are considered as ASSOCIATIONS (this rule extends rule #4 creating effective hypotheses)
The importance of causes is in their increased accuracy, detailing, stability and their pinpointing factors by which it is possible to intervene in processes pro-actively.
Example:
Suppose that analysis of Football games results indicates that winning in games is affected consistently by the following four variables: "Number of goals that the team had in the previous season", "player rates", "the team coach rate", and "the number of foreign players that have been recruited in the last season".
All the four variables make sense – it is easy to see the contribution to winning of demonstrated team winning capabilities, talented players, a reputable coach, etc. The trouble is that the opposite rational is makes also sense – winning does increase the team income, as well as its reputation, motivation, and attraction for more good recruits. Also, there are several theories floating around about what makes a success – some consider the players' younger age to be a key factor, some trust only the coach, there are analysts that believe that the best predictor is the team last season success... The management would surely want to know which the actual true root causes are.
Rule #7 makes it possible to track down root causes through the hierarchical structure of pattern. Here, the root causes** turn out to be a combination of two variables: "the number of goals in the last season" and "the team coach rating". The two variables serve for a prediction function which is simple to test, handle and run on a data stream.
The new-data-science full model includes defining exceptions and early alerts, within newly created Laws-of-behavior.
Notes:
* A target variable can be stated by aspects of the target. In the this example, the aspects can be the prediction of "Income", "number of wins", "Number of goals"...
** The defined root causes capture also a number of indirect and hard-to-quantify factors, such as: team motivation, fan club enthusiasm, and the local authorities support.
 
Edith Ohri
added an update
Rule #6 of data science – definitions are hierarchical.
Definitions of entities, attributes, events and patterns – all of them can be perceived as hierarchical; where lower-level definitions mostly inherit the traits from the pertinent upper-level definition.
Definitions always have at least 2 levels, where the upper-level is the name of attribute and the lower-level includes the attribute's values. For instance: 1st level - "Personal physicals"; 2nd level – {"Height", "Weight",..}, 3rd level – { {Height values), {Weight values},…}. In an exception record, instead of "Weight" there may be, for example, "Deviation from Normal Weight", or a different measure units.
Notes:
(1) A hierarchy is just a logical arrangement of a collection of items, which enables the observer to focus on details or extend the view, back and forth, efficiently i.e. with very little modifications.
(2) The hierarchy depends on the specific subject and context of inquiry.
(3) Inheritance of traits from upper to lower-level definitions is always incomplete in unsupervised data, due to variations, "noise" and errors, exceptions, mixed events, evolving processes and mutations.
 
Edith Ohri
added an update
In the early years of the 17th century questions were different. People looked for ways to understand nature in order to better handle it.. Scientist devoted many years of empiric as well as scholastic study so they can come up with new theories. The IT revolution has changed all that. The way we experience the world now is different, the questions are different and researchers need tools to cope with the flood of information. We need a massive model for finding hidden cause-effect explanations, a hypotheses search machine, the kind that goes beyond existing expert knowledge and the limited human ability to conceptualize multi-variant mixed-phenomena events. 
Can human create a machine that will be smarter than human, and by what means? The answer is positive, if using using self-clustering (grouping). Why grouping? - since it reduces the number of space dimensions, eliminates noise, and most importantly - creates new terminology that enables logical operations. Why self-grouping? In order to break the chicken-and-egg cycle. It is possible to sum up the endeavor of automating hypotheses as reverse-engineering of data or "formulas that turn given set of chaotic data to rationally explained events". For a number of reasons (e.g. as mentioned here earlier, the number of big data variables exceeds the number of instances), the task of reverse-engineering big data, is out of reach of mathematics (including Statistics). The solution hence is left for Search Algorithms. This is in essence what "GT data mining" Group Technology stands for.
 
Edith Ohri
added an update
Rule #5 of data science – samples for empiric research; any authentic set of records can be used for learning and testing.
A dataset should cover as many kinds of phenomena and variables values (including extreme points and exceptions).
The final structure of datasets is "flat table" where the raws are event records and the columns are descriptive variables (attributes).
To the original record one should add the following descriptive variables of: the environment, data sources, and the Delta – the difference between a record to its neighboring records (in an attempt to preserve the information that is embedded in the structure sequence of events).
Note-1:
A hypothesis making model should notify the user when data inconsistencies amount to a such level that risks having contradicting conclusions based on the same dataset.
Note-2:
The new data science frees the analyst from the old mandatory pre-requisites of "statistically representative sampling" and "data unified form and quality" (it is possible to prove, according to Rules 1-4). It means less preparations, more data objectivity, and better support of the requirement for fast learning from data as is.
13-05-2017
Home of GT data mining
 
Edith Ohri
added an update
Rule #4 of data science – hypotheses checking; hypotheses need to have a cause-effect rationale that is traceable down to individual events.
Making sound hypotheses is the main purpose of data mining analytics.
A hypothesis has a four parts structure:
- Conditions (or pattern scopes)
- Variables (shared by most of the data sources)
- Results
- A function that connects between variables and results.
Definition:
"Hypothesis" is a term coined in statistics, meaning a theory before being tested. Data-mining analytics can be viewed in this sense as a preliminary stage of preparing hypotheses for statistics tests.
28-Apr-2017
 
Edith Ohri
added an update
Rule #3 of Data Science: generating insights.
Generating insights is based on group LOGICAL OPERATORS such as: comparison, breaking-down, generalization, analogy.
Example: suppose that a large retail shops network has two similar selling patterns (or "groups"), of which one is much more profitable than the other. Elimination of the similar characteristics of the groups leaves (supposedly) just two characteristics that are unique to each group – Region and Means-0f-Payment. The company Operations manager approves of the findings and explains that the two groups have different product selection and there have been special sales promotion campaigns for holders of a particular credit card that is typical to the higher group. Induction of this explanation may lead to the insight, that it is possible to increase the profit in all of the shops by launching a general sales promotion that offers products that currently not are available in some of the shops.
 
Edith Ohri
added an update
Rule #2 of Data Science: every data collection is fully divisible, according to the target notion, to relatively homogenous hierarchical groups.
Unlike Classification, in big data different groups (patterns) may have different sorts of characteristics. In marketing for example, one group may be characterized by specific clients and buying habits, while the other group characterized by demographics and geographic areas.
Definition of Target Notion:
The term target notion describes various aspects of the target. This broad definition is needed in the learning stage, before the exact term is being established (after the learning stage, the target is expected to be more focused).
For example: suppose that the target is to find high performance factors. The various target aspects could be then Productivity, Company Value, Orders trend, Equity, and the rate of new products turn out.
 
Edith Ohri
added an update
Rule #1 of Data Science: In big data, the number of the variables (or "attributes") always exceeds the number of records.
The meaning of this rule is that in big data there are more variables than equations – a fact that disables the use of mathematics for a solution.
It looks like an absurd – how can big data have a shortage in records?
The explanation is rooted in the collection of big data from unsupervised sources where each record has individual environment conditions which may interact with the results and therefore need to be documented and added to the data. This addition of individual attributes is what inflates the number of variables.
To remedy this impossible constraint, the GT solution offers to eliminate most of the environment variables by dividing the data into groups that shares the same conditions – as stated in "GT New Data Science Principles" principle #10 at https://www.researchgate.net/project/Philosophy-of-Data-Science-review-for-big-data-analytics/update/58ee2d161042bf333c67c5ef
 
Edith Ohri
added an update
Data Science Principles (Following GT data mining)
1. The goal of big data analysis is to generate knowledge or Hypotheses out of data.
2. Analysis first and 'must' step is breaking the data mass into HIERARCHICAL GROUPS (patterns of behavior).
3. SAMPLING of big data does not have to be representative of the whole; however it should include various phenomena, not just the one under study in order to help conclude about its behavior.
4. THE VARIABLES should include background information about the environment, circumstances and conditions that may affect directly or indirectly the phenomenon under study.
5. ASSUMPTIONS ARE NOT ALLOWED, unless they can be verified.
6. DATA CLEANING is not allowed for objectivity and authenticity reasons.
7. DEPENDENCIES among variables are assumed at start (variables without dependencies are considered as redundant).
8. The results should be EXPLANATORY and produce INSIGHTS. * INSIGHT definition: a generalized law that expands the scope of that law, by inferring to different sets of conditions or cause-result relations, by logical and analogical considerations
 
Edith Ohri
added an update
Unsupervised data fails big data analytics
--------------------------------------------------------------------
In the analysis of big data there exist one "silent" assumption that is a corner stone of statistical testing, but doesn't receive much attention – the assumption about data supervision (to ensure data qualities, such as: consistency, authenticity, full coverage etc). This requirement that used to be traditionally an inherent part of analytics models, is now almost impossible to fulfill.
To understand the extreme extent of the change in the world of data and its analytics challenge, it is enough to examine just the single aspect of the way data are produced. In the old time, data input was collected for specific pre-assigned research aims, while today large quantities of data pouring in from automated systems and remote digital processes that are designed for different aims, all of which researchers have little if any control.
It is quite understandable in view of this change, why the assumption of big data supervision becomes increasingly unattainable. Yet, the even more critical consequence of this collapse of fundamental assumption is the decreased validity of the scientific model that lays on its ground, as seen in the increasing α, β and γ type errors (I will write about it separately).
Note, that if unsupervised big data was not a major cause of failure, if it was just a technical issue, it could have been easily resolved, niby selecting out of the abundance data, a statistical sample that matches the supervision directives. Yet as a fact, sampling big data has not become a solution, which shows that a deeper change in the scientific model is needed.
Next chapter: My proposition for a new data science model.
The new model is made to function in a world of mixed types of events, multiple data sources, many variables, many unknown interrelations, hidden phenomena, dynamics, noises (i.e. it is free from the need to supervise data) and above all – its main target is to create hypotheses, not just testing them. This solution is a non-statistical one.
- Home of GT data mining
 
Edith Ohri
added an update
On 'dark' data
Data being dark, depends totally on our method of observation. Actually, the existence of extensive 'darkness' indicates that something is very wrong in a method. I see a number of faults that can cause analytics to crash, just from the sleeve :) to mention a few: meddling with the data (e.g. cleaning the outliers), using bad hypotheses, never checking presumptions(!!!)... Each of the above is enough to destroy a scientific method, much more so when they act altogether. As long as this is going on, even the seemingly 'right' analytics results could be potentially wrong, for example, when dark data concentrate in one part of the scale, or when the supposed cause incidentally coincides with a variable that escapes the study.
Like always my remedy is by grouping the native data into hierarchical clusters. Among the clusters all the uncounted for patterns should have SCOPES & CHARACTERISTICS that bring them to light.
 
Edith Ohri
added an update
The law of Large Numbers fails in big data
The law of Large Numbers is often regarded as a sort of "law of nature" by which variables' averages always gravitate to fixed clear values. The question is, does the law of large number hold true in the case of big-data?
The key to the answer on this question is in the understanding of the law's underlying assumptions. especially about the point of sample-representation and data stability.
One of the qualities that signify big data is Volatility. Volatility thrives in large multi-variant and closely-packed interrelated events (this can be proven mathematically), such as usually exist in big data. It is the dynamics that follows big-data events which interferes in the convergence of averages and prevents it from happening.
Moreover in my view, even if the law of large numbers was true for big data, it would not have been of much use, due to its focus on common "average" behavior that mostly is already known, rather than on irregularities andexceptions that are yet unknown and teherefore require research. For eample, research of early-detection indicators, adverse-effects, fraud-detection, quality-assurance, customer-retention, accidents, and long-tail marketing – to mention a few. Ill discuss later the long-tail phenomenon, but for now just one remark, long-tails actually consist of hidden phenomena, thus their discovery has to be, by definition, within other rules than those that are already considered in law of large numbers.
The above weak points of the law of large numbers, are just a small part of analytics "peculiarities" that can be expected in big data. This paragraph is part of a series of assays on a proposed new concept of science in wake of the IT industrial revolution, collaborated with @Joseph Agassi.
 
Edith Ohri
added an update
Is Machine Learning chasing its own tail of pre-assumption?
Machine Learning (ML) as a method of learning is indeed a machine, i.e. it operates consistently, repeatedly and predictably, by a designed method, which is made for specific conditions; but its "learning" part is more like "training" or "verification" rather than the acquisition of new knowledge that is suggested in this name. Practically speaking, ML is made to improve prescribed response formulas, not to invent such formulas, (and I know the statement might be seemed controversial) not even to correct them.
Here is then my take on the issue:
Law #1 A dog (or a cat) chasing its tail for long enough time will eventually catch it.
Law #2 The catching will heart!
Law #3 Getting painful results will not stop the chase; it will stop only due to boredom or the exhaustion of all energy resources. 
 
Edith Ohri
added an update
Your thoughts, questions, comments..
 
Edith Ohri
added an update
Identifying patterns of behavior, key factors, indicators, prediction formulas, and early detection parameters.
 
Edith Ohri
added an update
I'm working on
(a) Creating a new concept of scientific model that aims to remedy some faults in the traditional science model and serve better in today's IT reality.
(b) Placing my GT-data-mining solution in respect to the current and the new models.
 
Edith Ohri
added a project goal
A short essay about Philosophy of Data Science. It's time to shake up the ancient scientific model from the 17th century, which I think is responsible for the current deadlock of big data analytics. The key points in which my GT new model departs from the old one are (a) Induction through rules, (b) Substantiation of rules, (c) Different method for sampling. My guide in this (ad)venture is Prof. Joseph Agassi - a long time warrior of science logics against technocracy.