Project

# Information Theory - SEE PROJECT LOG

Goal: A.I. and Physics. All of the material on this site is subject to my copyright policy: https://derivativedribble.wordpress.com/copyright-policy/

2 new
555
Recommendations
0 new
35
Followers
0 new
13
9 new
8584

## Project log

It just dawned on me that my paper, Information, Knowledge, and Uncertainty, seems to allow us to measure the amount of information a predictive function provides about a variable:

I'll be presenting my AutoML software, Black Tree, during a Meetup event, which you can dial into here:
I'll be walking through basically everything in some detail, so it should be an interesting presentation, with an opportunity to ask questions at the end.
Attached is a set of slides I'll be using to present.
Regards,
Charles

Imagine you have N lightbulbs, and each lightbulb can be either on or off. Further, posit that you don’t know what it is that causes any of the lightbulbs to be on or off. For example, you don’t know whether there’s a single switch, or N switches, and moreover, we leave open the possibility that there’s a logic that determines the state of a given bulb, assuming the state of all other bulbs. We can express this formally, as a rule of implication F from subset C of S to S, such that given any C, the state of S is determined.
This is awfully formal, but it’s worth it, as you can now describe what you know to be true: there’s basically no way there are N switches for any N lightbulbs in the same room.
The rule of implication F is by definition a graph in that if the state of lightbulb A implies the state of lightbulb B, then you can draw an edge from A to B. There is only one graph that has no rule of implication, i.e., the empty graph. As a consequence, the least likely outcome, given no information ex ante, is that there is no correlation between the states of the lightbulbs. This would require N independent switches.
As a general matter, given N variables, the least likely outcome is that either all variables are perfectly independent, or all variables are perfectly dependent, since there is only one empty graph, and only one complete graph. It is therefore more reasonable to assume some correlation between a set of variables, than not, given no information ex ante. This is similar to Ramsey’s theorem on friends and strangers, except it’s probabilistic, and not limited to a single size graph, it is instead true in general.
More generally, this implies that the larger a set of signals is, the less likely it is the signals are independent of each other. To test this empirically, I generated a dataset of 1,000 random vectors, with random 1,000 classifiers, 1, 2, 3, and 4. Predicting the classifier of a vector should produce an accuracy of 25%, since the vectors are randomly generated, and so are their classifiers. I ran an ML prediction algorithm on this dataset, which again, should produce an accuracy of 25%, since these are random vectors, unless something unusual is going on. Unfortunately this software is not free, but you can do something like it, since it pulls clusters near a given vector and then uses the modal class of the cluster as the predicted class of the given vector. The bigger the cluster, the higher the confidence metric. However, the bigger the cluster, the less likely the vectors in the cluster are to be independent of each other, according to the analysis above. And as it turns out, the accuracy actually increases as a function of cluster size, to well beyond chance, peaking just below 70%. This makes no sense, in the absence of a theory like the one above.
Here’s the code I used to generate the dataset:

I've put together a very simple model of genetic inheritance that mimics sexual reproduction. The basic idea is that parents meet randomly, and if they satisfy each other's mating criteria, they have children that then participate in the same population. Individuals die at a given age, and the age of mortality is a function of genetic fitness, and overall it's already pretty good, though I have planned improvements.
More on my personal blog here:

I’ve assembled a dataset using complete mtDNA genomes from the NIH, for 10 individuals that are Kazakh, Nepalese, Iberian Roma, Japanese, and Italian, for a total of 50 complete mtDNA genomes. Using Nearest Neighbor alone on the raw sequence data, the accuracy is about 80%, and basic filtering by simply counting the number of matching bases brings the accuracy up to 100%. This is empirical evidence for the claim that heritage can be predicted using mtDNA alone. One interesting result, that could simply be bad data, the Japanese population (classifier 4 in the dataset), contains three anomalous genomes, that have an extremely low number of matching bases with their Nearest Neighbors. However, what’s truly bizarre, is that whether or not you include these individuals in the dataset (the attached code contains a segment that removes them), generating clusters using matching bases suggests an affinity between Japanese and Italian mtDNA. This could be known, but this struck me as very strange. Note that because matching bases is plainly indicative of common heritage, this simply cannot be dismissed.
Code, data, and charts here:

The Shannon Entropy is not a good measure of structural randomness, for the simple reason that all uniform distributions maximize the Shannon Entropy. As a consequence, e.g., a sequence of alternating heads and tails (H,T,HT, ...) maximizes the Shannon Entropy, despite having an obvious structure. The Kolmogorov Complexity solves this, since the shortest program that generates an alternating sequence of heads and tails is obviously going to be much shorter than the length of the sequence, and therefore such a sequence is not Kolmogorov-Random, which requires the complexity to be the length of the sequence plus a constant.
The full note is here:

We present a series of Machine Learning algorithms that can quickly classify and analyze large datasets of genetic sequences. All of the algorithms presented have a worst-case polynomial runtime, even when run in serial. As an application, we show that human mtDNA can be used to quickly and reliably predict ethnicity across 36 global ethnicities, using a total of 403 complete mtDNA genomes, each from the National Institute of Health Database. None of these techniques make use of genes or haplogroups, and instead simply compare the raw genomes in full. This work demonstrates unambiguously that at least some genetic analysis can be accomplished without regard for anything other than the raw genomes themselves, thereby eliminating the need to partition genomes into genes and other segments for such classifications. This in turn allows for completely unsupervised classification and analysis by machines. We also present new statistical theories of heredity and imputation, together with experimental evidence for these theories.
My primary work is in A.I., and I'm now applying it to genetics specifically, and this has yielded what I think is an interesting and simple algorithm that when run in parallel, allows for entire populations to be analyzed at the basepair level. Specifically, it allows for a machine to quickly find the longest sequence of basepairs common to a given population. My understanding is that this is otherwise a difficult problem.
This should help e.g., identify sequences common to people with diseases, or beneficial traits, and in general, should expedite the process of identifying sequences common to populations.
Here's a brief explainer, together with the code you can run yourself that actually implements this on real genetic data, from my research blog:
I'm certainly not new to A.I. or computer science generally, but I am new to genetics, so thoughts are more than welcomed.

Quantize a space of amplitudes to achieve a code. That is, each one of finitely many peak amplitudes corresponds to some symbol or number. So if e.g., a signal has a peak amplitude of 4, then the classifier / symbol it encodes is 4. Now posit a transmitter and a receiver for the signals. Send a signal from the transmitter to the receiver, and record both the known peak amplitude label transmitted (i.e., the classifier), and the received amplitudes at the receiver. Because of noise, the transmitted amplitudes could be different from the received amplitudes, including the peak amplitude, which we’re treating as a classifier. For every given received signal, use A.I. to predict the true transmitted peak amplitude / classifier. To do this, take in a fixed window of observations around each peak, and provide that to the ML algorithm. The idea is that by taking a window around the peak amplitude, you are taking in more information about the signal, rather than just the peak itself, and so even with noise, as long as the true underlying amplitude is known in the training dataset, all transmitted signals subject to noise should be similarly incorrect, allowing an ML algorithm to predict the true underlying signal. Attached is an original clean signal (left), with a peak amplitude / classifier of 5, and the resultant signal with some noise (right). Note that the amplitudes are significantly different, but nonetheless my classification algorithms can predict the true underlying peak amplitude with great accuracy, because the resultant noisy curves are all similarly incorrect.

It just dawned on me that we might be able to cure diseases associated with individual genes that code for specific proteins, by simply suppressing the resultant mRNA. This could be accomplished by flooding cells with molecules that are highly reactive with the mRNA produced by the “bad gene”, and also flooding cells with the mRNA produced by the correct “good gene”. This would cause the bad gene to fail to produce the related protein (presumably the source of the related disease), and instead cause the cell to produce the correct protein of a healthy person since they’re given the mRNA produced by the good gene.

I noted in the past that a UTM plus a clock, or any other source of ostensibly random information, is ultimately equivalent to a UTM for the simple reason that any input can eventually be generated by simply iterating through all possible inputs to the machine in numerical order. As a consequence, any random input generated to a UTM will eventually be generated by a second UTM that simply generates all possible inputs, in some order.
See this link on my personal blog:
However, random inputs can quickly approach a size where the amount of time required to generate them iteratively exceeds practical limitations. This is a problem in computer science generally, where the amount of time to solve problems exhaustively can even at times exceed the amount of time since the Big Bang. As a consequence, as a practical matter, a UTM plus a set of inputs, whether they’re random, or specialized to a particular problem, could in fact be superior to a UTM, since it would actually solve a given problem in some sensible amount of time, whereas a UTM without some specialized input would not. This suggests a practical hierarchy that subdivides finite time by what could be objective scales, e.g., the age of empires (about 1,000 years), the age of life (a few billion years), and the age of the Universe itself (i.e., time since the Big Bang). This is real, because it helps you think about what kinds of processes could be at work solving problems, and this plainly has implications in genetics, because again you’re dealing with molecules so large that even random sources don’t really make sense, suggesting yet another means of computation.

It just dawned on me that you can simply store positively and negatively charged particles separately (e.g., electrons and protons), just like cells do, to generate a difference in electrostatic charge, and therefore motion / electricity. I’m fairly confident space near Earth, is filled with charged particles, since my understanding is that the atmosphere is our primary defense against charged particles, and the source of the Aurora Borealis. So, by logical implication, you can collect and separate positively and negatively charged particles in space, bring them back down to Earth, and you have a battery. Moreover, because it’s subatomic particles, and not compounds, you have no chemical degradation, since you’re dealing with basically perfectly stable particles. As a consequence, you should be able to reverse the process indefinitely, assuming the battery is utilized by causing the electrons / protons to cross a wire and commingle. Said otherwise, there’s no reason why we can’t separate them again, producing yet another battery, repeating this indefinitely. I’m not an engineer, and so I don’t know the costs, but this is plainly clean energy, and given what a mess we’ve made of this place, I’m pretty sure any increased costs would be justified. Just an off the cuff idea, a negatively charged fluid could be poured into the commingled chambers, and drained, which should cause the protons to follow the fluid out, separating the electrons from the protons again.

In a footnote to one of my papers on physics, I introduced but didn’t fully unpack a simple theory that defines a space in which time itself exists, in that all things that actually happen are in effect stored in some space. See Footnote 7:
A Unified Model of the Gravitational, Electrostatic, and Magnetic Forces), I introduced but didn’t fully unpack a simple theory that defines a space in which time itself exists, in that all things that actually happen are in effect stored in some space. The basic idea is that as the Universe changes, it’s literally moving in that space. That said, you could dispense with time altogether as in independent variable in my model, since time is the result of physical change, and so if there were no change at all to any system, you would have no way of measuring time, and therefore you could argue, that time is simply a secondary property imposed upon reality, that is measured through physical change.
However, we know that reality does in fact change, and we also have memories, which are quite literally representations of prior states of reality. This at least suggests the possibility that reality also has memory, that stores the prior, and possibly the future states of the Universe. Ultimately, this may be unnecessary, and therefore false, but it turns out you can actually test the model I’m going to present experimentally, and some known experiments are consistent with the model, in particular the existence of dark energy, and the spontaneous, temporary appearance of virtual particles at extremely small scales.
The full note is on my blog:

Below I present an optimization algorithm that appears to be universal, in that it can solve high-dimensional interpolation problems, problems involving physical objects, and even sort a list, in each case without any specialization. The runtime is fixed ex ante, though the algorithm is itself non-deterministic.
So I just came to an amazing insight on the nature of life, and other systems generally:
There's a big difference between superficial unity (e.g., something looks like a single object), and functional unity (e.g., an atom behaves as a single object). We know of examples of arguably random systems like cellular automata that exhibit superficial unity (they literally contain triangle-shaped outputs or Sierpinksi Traingles). But that's not the same thing as an atom, that generally interacts as a single unit, or a composite particle, that again, despite being a complex object, behaves as a single unitary whole in general when interacting with other systems. And the reason I think this is connected to life, is because at least some people claim the origin of life stems from random interactions -
This is deeply unsatisfying, and in my opinion, an incomplete explanation of what's happening.
Think about the probability of randomly producing something as large and complex as a DNA molecule, that has deep unitary functions, that copies itself, consumes its environment, and ends up generating macroscopic systems, that are also functionally unitary -
This is a ridiculous idea. For intuition, generate random characters on a page, and try to run them in C++, or whatever language you like -
What's the probability you'll even produce a program that runs, let alone does something not only useful, but astonishing, and with an output that is orders of magnitude larger than the input program? There's no amount of time that will make this idea make sense, and you'll probably end up needing periods of time that exceed the age of the Universe itself. A simple for loop contains about 20 characters, and there are about 50 characters in scope in programming languages -
This is not realistic thinking, as a program of any real utility, will quickly vanish into the truly impossible, with even a simple given for loop having a probability that is around $latex O(\frac{1}{10^{30}})$. For context, there have been about $latex O(10^{17})$ seconds since the Big Bang. Let's assume someone had a computer running at the time of the Big Bang, that could generate 1 billion possible programs per second. Every program generated is either a success or a failure, and let's assume the probability of success is again $latex p = O(\frac{1}{10^{30}})$. The binomial distribution in this case reduces to,
$latex \bar{p} = np(1-p)^{n-1},$
where $latex n$ is the number of trials and $latex p$ is the probability of generating code that runs. Because we've assumed that our machine, that's been around since inception, can test one billion possibilities per second, we would have a total number of trials given by $latex n = 10^{26}$. This yields a comically low probability that is $latex \bar{p} = O(\frac{1}{10^4})$, even given the absurd assumption, that calculations have been running since the Big Bang.
Expressed in these terms, the idea that life is the product of chance, sounds absurd, and it is, but this doesn't require direct creationism, though I think philosophers and scientists are still stuck with the fact that the Universe is plainly structured, which is strange. Instead, I think that we can turn to the atom, and composite particles, for an intuition as to how a molecule as complex as DNA could come into being. Specifically, I think that high energy, high complexity interactions cause responses from Nature, that impose simplicity, and literally new physics:
The physics inside an atom, is not the same as the physics outside an atom;
The physics inside a composite particle, is not the same as the physics outside a composite particle.
This does not preclude a unified theory, but instead perhaps e.g., different subsets or instances of the same general rules apply under different conditions, and that the rules change as a function of energy and complexity. So if e.g., you have a high-energy event, that is high in complexity, at the scale of the atom, then perhaps, this could give rise to a system like DNA. This is however, distinct from random processes that produce superficial or apparent unity (i.e., it looks like a shape), and instead is a process of Nature that imposes functional unity on systems that are sufficiently high in energy and complexity.
I am in some sense calling into question at least some of the work of people like Stephen Wolfram, that from what I remember, argue that the behavior of automata can be used to describe the behavior of living systems. I think instead you can correctly claim that automata produce patterns that are superficially similar to e.g., the coloring and shapes you find in Nature, but that's not the same as producing a dynamic and unitary system, that superficially has a resemblance to the patterns you find in that area of computer science generally. The idea being that you have a discrete change from a complex churning physical system, into a single ordered system that behaves as a whole, and has its own physics that are again, distinct from the physics prior to this discrete change. It's the difference producing a static artifact, that again has elements that are similar to known objects, and producing a dynamic artifact with those same superficial qualities. What's interesting, is that we know heavy elements are produced in stars, which are plainly very high energy, very complex systems. Now think about how much larger and complex DNA is compared to even a large atom. If this theory is correct, then we would be looking for systems that aren't necessarily as large as stars, but perhaps have even greater energy densities and complexities -
That's quite the conundrum, because I don't think such things occur routinely if at all on Earth. I think the admittedly possibly spurious logic suggests that higher energy stars, specifically black holes, might be capable of producing larger artifacts like large molecules. The common misconception that nothing escapes a black hole is simply incorrect, and not because of Hawking's work, but because black holes are believed to have a lifespan, just like stars. As a result, any objects they create, could be released -
The intuition would be, powerful stars produce heavy elements, so even more powerful stars, like black holes, could produce molecules. And because the physics of black holes is plainly high energy and complex, it's a decent candidate.
However, even if all of this is true, and we are able to someday replicate the conditions that give rise to life, we are still stuck with the alarming realization that there are apparently inviolate rules of the Universe, beyond physics, the causes of which are arguably inaccessible to us. Specifically, the theorems of combinatorics are absolute, and more primary than physics, since they are not subject to change or refinement -
They are absolute, inviolate rules of the Universe, and as a result, they don't appear to have causes in time, like the Big Bang. They instead follow logically, almost outside time, from assumption. How did this come to be? And does that question even mean anything, for rules that seem to have no ultimate temporal cause? For these reasons, I don't think there's a question for science there, because it's beyond cause. This is where I think the role of philosophy and religion truly comes into play, because there is, as far as I can tell, no access to causes beyond the mere facts of mathematics. That is, we know a given theorem is true, because there is a proof from a set of assumptions that are again totally inviolate -
Counting will never be incorrect, and neither will the laws of logic. And so we are instead left with an inquiry into why counting is correct, and why logic is correct, and I don't think that's subject to scientific inquiry. It simply is the case, beyond empiricism, though you could argue we trust the process because it's always been observed to be the case. But this is dishonest, because everyone knows, in a manner that at least I cannot articulate in words, that you don't need repeated observation to know that counting is inviolate. Moreover, I don't think this is limited to combinatorics, but instead, would include any theorems that follow from apparently inviolate assumptions about the Universe. For example, the results I presented in,
fit into this category, because they follow from a tautology that simply cannot be avoided. Further, the results I present in Section 1.4 of,
also fit this description, because all measurements of time, made by human beings, will be discrete, and so the results again follow from apparently inviolate assumptions about the Universe. And so we are stuck with the alarming realization that there are apparently inviolate rules of the Universe, the causes of which are arguably inaccessible to us. That is, the laws of combinatorics seem to be perennially true, that follow from idea itself, independent of any substance, and without succession from cause in time.
So in this view, even if the conditions of the origins of life are someday replicable, the real mystery is the context that causes life to spring into existence -
The causes of the laws of the Universe, being inaccessible, perhaps in contrast to the conditions that produce life.
To take this literally, the laws of combinatorics, actually exist, outside time and space, and yet impose conditions upon time and space, that are absolute and inviolate. This space, assuming it exists, would have a relational structure that is also absolute and inviolate, as the logical relationships among the theorems of combinatorics are also absolute and inviolate. It is in this view, only the peculiarity of our condition, that requires the progression through time, which allows for computation, and in turn, the discovery of these laws and relationships, whereas in reality, they simply exist, literally beyond our Universe, and are alluded to, by observation of our Universe. To return to the topic of life, without life, there would be no ideas, and instead, only the operations of their consequences (i.e., the undisturbed laws of physics). Because life exists, there is, in this view, a connection, between the space of ideas, and our Universe. To lend credence to this view, consider the fact that there are no constraints on the Universe, other than those imposed by some source. For example, the written laws of mankind, the limitations of the human body, the observed laws of physics, and ultimately, the theorems of combinatorics, which will never change. The argument above suggests that, therefore, the theorems of combinatorics have a source that is exogenous to both time and space, yet to deny a source to the most primordial and absolute restrictions of our Universe, seems awkward at best. Moreover, humanity's progression through science has repeatedly struggled with that which is not directly observable by the human body, such as radio waves, magnetic fields, and gravity, yet we know something must be there. In this case, logic implies that the laws of combinatorics exist in a space, that by definition, must be beyond both time and space, though its consequences are obvious, and with us constantly, and so that space, must have dominion over time and space, and is, from our condition, absolute and inviolate.

The professional website for my AutoML Software is up and running, and you can already download a Free Version from the site:
I will continue to post academic updates to ResearchGate, and my personal blog, though the website will contain the actual releases of new software going forward.

I’m pleased to announce that after years of basically non-stop research and coding, I’ve completed the Free Version of my AutoML software, Black Tree (Beta 1.0). The results are astonishing, producing consistently high accuracy, with simply unprecedented runtimes, all in an easy to use GUI written in Swift for MacOS (version 10.10 or later):
Dataset / Accuracy / Total Runtime (Seconds)
UCI Credit (2,500 rows): 98.46% 217.23
UCI Ionosphere (351 rows): 100.0% 0.7551
UCI Iris (150 rows): 100.0% 0.2379
UCI Parksinsons (195 rows): 100.0% 0.3798
UCI Sonar (208): 100.0% 0.5131
UCI Wine (178): 100.0% 0.3100
UCI Abalone (2,500 rows): 100.0% 10.597
MNIST Fashion (2,500 images): 97.76% 25.562

I'm testing a slight tweak to eliminate non-determinism from my Normalization Algorithm, and I figured I'd share it in the interim.
The idea is to simply use zero when a dimension has an average number of digits equal to zero. For the logarithm function, this generates an error, because you have log(0), but you can just set that to zero, rather than adding noise (my initial approach).

This is the code that implements the technique mentioned in Section 1.2 of my paper, "Vectorized Deep Learning", which has always been public, but I thought it best to isolate the code, given how accurate this technique is.
The variable "dataset_file" is the path of the dataset, and simply set the second variable "user_selection_vector" to the vector [0,1], which simply turns on normalization.
This algorithm generates the following accuracies, with a trivial runtime:
UCI Credit: 98.72%
UCI Ionosphere: 100%
UCI Iris: 100%
UCI Parkinsons: 100%
UCI Sonar: 100%
UCI Wine: 100%

Attached is an algorithm that uses cluster prediction, rather than nearest neighbor, to size the digits of a dataset.

Attached is the code needed to run my supervised modal prediction algorithm, with the UCI Credit Dataset the suggested dataset, which is also attached (2,500 random rows, out of 30,000).

Below are several papers that contain fundamental results in Artificial Intelligence, Computer Theory, and Set Theory. When appropriate, code is attached.
Here's a free version of my AutoML software for MacOS. Note that you'll probably have to right-click, and select "Open", because of Mac's default security settings, but otherwise, it's ready to run!
Enjoy!
Charles

Attached are all of my algorithms, as of today, which are of course subject to my copyright policy:

Below we present and apply a fundamental equation of epistemology, that relates information, knowledge, and uncertainty, and apply that same equation to random variables using information theory. Also attached is software that applies the results to a deep learning classification problem.
I’ve written a few posts lately that make use of two measures of confidence, and it turns out they don’t work consistently (i.e., across datasets) on an individual basis, but they do work when combined, which you can see below.
For results see here:
For a high-level summary, see here:

I’ve stuck to the same handful of UCI and MNIST datasets for consistency during the development of my algorithms, but I’m now planning to port my software to the Mac OS, for offer on their app store, and just like the Matlab version, the demo will be free, but I will simultaneously issue a simple commercial version that allows for unrestricted rows and dimensions. As a result of this, I’ve started to expand the datasets to which I’ve applied my algorithms, just to be sure they work as a practical matter, despite a formal proof that they must work.
And it turns out, they do, and it particular, I’ve just applied my basic supervised image classification algorithm ...
Full note with code here:

Below please find all of my algorithms, as of today, September 3, 2021.
Please note that as always, these algorithms are subject to my copyright policy:

Below we introduce two theorems that show a set of real numbers is sorted if and only if (1) the distance between adjacent pairs in the resultant ordering is minimized, and (2) if and only if the amount of information required to encode the resultant ordering, when expressed as a particular class of recurrence relations, is minimized. As such, these two theorems demonstrate an equivalence between sorting and minimizing information. These theorems imply what appears to be the fastest known sorting algorithm, with a worst-case O(log(N)) runtime, when run on a parallel machine.
Attached are all of my algorithms as of today, June 21, 2021.
UPDATE (June 23, 2021):
Also attached are updated version of the command lines that go along with my pitch deck,

My letter to The Congressional A.I. Caucus regarding likely anticompetitive practices in the market for A.I. software.
I noted on Twitter that I came up with a really efficient way of calculating spatial entropy, and though I plan on writing something more formal on the topic, as it’s related to information theory generally, I wanted to at least write a brief note on the method, which is as follows:

Attached are all of my algorithms as of 3-15-21.
Note, again, these algorithms may NOT be used for commercial purposes, without my express prior written consent.
Also attached are the command line files for my article -
together with a zip file containing the images for a dataset referenced in that article.

Below is a set of lemmas, corollaries, and proofs, related to the logarithm of Aleph_0, at the intersection of information theory, computer theory, and set theory.
Many consumer devices can be used to perform parallel computations, and in a series of approximately five-hundred research notes, I introduced a new and comprehensive model of artificial intelligence rooted in information theory and parallel computing, that allows for classification and prediction in worst-case polynomial time, using what is effectively AutoML, since the user only needs to provide the datasets. This results in run-times that are simply incomparable to any other approach to A.I. of which I'm aware, with classifications at times taking seconds over datasets comprised of tens of millions of vectors, even when run on consumer devices. Below is a series of formal lemmas, corollaries, and proofs, that form the theoretical basis of my model of A.I.
The probability of throwing a billion heads in a row using a fair coin is no different than any other outcome involving a billion coin tosses. If we measure the surprisal of observing a billion heads in a row using Shannon information, it’s the same as every other outcome, since their probabilities are all equal, forming a uniform distribution. At the same time, observing a billion heads in a row is beyond surprising, as a practical matter, and would probably be a story you tell for quite a while (putting the time constraints aside).
If however, we look at the distribution of outcomes within each throw, then we get a very different measure of surprisal. Specifically, there is only one outcome that consists of only heads, and so the probability of that distribution is extremely low, in context, since there are going to be an enormous number of possible distributions over a billion coin tosses. We can abstract further, and look at the structure of the distribution, which in this case consists entirely of a single outcome. There are exactly two distributions with this structure –

One of the first useful conclusions I came to in applying information theory to physics was that you could use Shannon entropy to measure the complexity of the motion of an object. For example, light always travels in a perfectly straight line in a vacuum, and so it has a zero entropy distribution of motion. In contrast, a baseball has some wobble when observed at a sufficiently close scale, and therefore the entropy of its distribution of motion is greater than zero. Said otherwise, there’s some mix of velocities to most objects, even if what you see appears to be rectilinear at our scale of observation. To cut through any noise in observation, you could cluster the observed velocities, since this will cause similar velocities to be clustered together, and you can then measure the entropy of the distribution of the resultant clusters, which is something most of my clustering algorithms do anyway.

A letter I wrote to the U.S. Congressional Artificial Intelligence Caucus regarding the dangers of A.I. in small devices.
In a series of lemmas and corollaries, I proved that under certain reasonable assumptions, you can classify and cluster datasets with literally perfect accuracy. Of course, real world datasets don't perfectly conform to the assumptions, but my work nonetheless shows, that worst-case polynomial runtime algorithms can produce astonishingly high accuracies. This results in run-times that are simply incomparable to any other approach to A.I. of which I'm aware, with classifications at times taking seconds over datasets comprised of tens of millions of vectors, even when run on consumer devices. Below is a summary of the results of this model as applied to benchmark datasets, including UCI and MNIST datasets, as well as several novel datasets rooted in thermodynamics. All of the code necessary to follow along is available on my ResearchGate Homepage, and www.blacktreeautoml.com.
Attached is a zip file containing all of my algorithms as of today, together with five command line scripts that implement the examples in my latest research paper, Vectorized Deep Learning.

In a previous article, I introduced a rigorous method for thinking about partial information, though for whatever reason, I left out a rather obvious corollary, which is that you can quantify both uncertainty and knowledge in terms of information, though I did go on a rant on Twitter, where I explained the concept in some detail.
Though I’ve done a ton of work on the issue, which I’ll expand upon in other articles, the fundamental equation that follows is,
I = K + U,
where I is the total information of the system, K is the subjective knowledge with respect to the system, held by some observer, and U is the uncertainty of that same observer with respect to the system, all having units of bits.

I’ve written an algorithm that vectorizes polynomial interpolation, producing extremely fast approximations of functions. The algorithm works by generating all possible polynomials of a given degree, and then evaluates all of them, but because these operations are vectorized, the runtime is excellent, even on a home computer:

We present a new model of artificial intelligence and a new model of physics, each rooted in information theory, computer theory, and combinatorics. Chapter II covers the standard topics in machine learning and deep learning, including image processing, data classification, and function prediction. More advanced topics include vectorized image processing algorithms that in turn allow for video classification, including real-time video classification, and real-time function prediction. Chapter III introduces a new model of physics, including time-dilation, gravity, charge, and magnetism, as well as selected topics from quantum mechanics, together presenting a new and unified model of physics. All of the algorithms referenced in the text are available on my ResearchGate Homepage, as implemented in Octave / Matlab.
Attached is a complete zip file with all of my algorithms.
As noted above, I retain all rights to these algorithms, copyright and otherwise.

We present a model of physics rooted in discrete mathematics that implies the correct equations for time-dilation, gravity, charge, magnetism, and is consistent with the fundamentals of quantum mechanics. We show that the model presented herein is consistent with all experiments, of which we are aware, that test the Theory of Relativity, and propose experiments that would allow for the model presented herein to be distinguished from the Theory of Relativity. However, the differences between equations implied by the Theory of Relativity and the model presented herein are so small, that they are beyond the error of any experiments of which we are aware. Moreover, unlike the Theory of Relativity, the model presented herein makes use of objective time, but nonetheless implies time-dilation will occur, as a local phenomenon, with the rate at which a system progresses through its states determined by the ratio between its kinetic energy to its total energy. Finally, unlike the Theory of Relativity, the model presented herein follows deductively from assumptions about the mechanics of elementary particles, and is not rooted in any assumptions about the nature of time itself.
Continuing my work on the applications of A.I. to physics, in this article, I’m going to present very simple software that can take a large number of observations of an object, that has a macroscopic rectilinear motion, and calculate its velocity.
The reason this software is interesting, is because the motions of the objects are random, but nonetheless generate a macroscopically rectilinear motion. This is how real world objects actually behave, in that the parts of a projectile don’t move in the same direction at the same time, but instead have variability that simply isn’t perceptible to human beings, but might nonetheless get picked up by a sensor, making the problem of measuring velocity non-trivial, when given a lot of data.
Though you don’t need to read it to appreciate the software I’ll present in this article, all of this is based upon my model of motion, which I presented in my main paper on physics:
I’m simply memorializing the model using code, which will eventually allow for a complete model of physics, including the correct equations of time-dilation, that can be easily analyzed using my entire library of A.I. software.
My ultimate goal is to allow for the autonomous discovery of useful models of physical systems, leaving physicists free to do the higher work of discovering theoretical connections between observables, and then allowing machines to uncover approximations of their relationships, which should in turn assist physicists in their discovery of the correct, exact, mathematical relationships. This software will also be useful for the cases where the correct relationship is impossible to discover, due to intractability, or because it simply doesn’t exist in the first instance, which is a real possibility, if reality is non-computable. There’s also, obviously, the commercial case, where you don’t need the correct equation, and a model that works is all you really need.
The example in this case consists of a statistical sphere, that has 1,000 independently moving points, that change position over a series of 16 frames. The velocities of the points are independent, but selected in a manner that ensures that the relative positions of the points remain approximately constant on average over time. For an explanation of the mathematics underlying the model, see Section 3.3 of the paper linked to above, but you can plainly see combinatorial relationships unfolding in the paths of the individual points, with the outcomes more dispersed in the center of each path, which I’ll explore in a follow up article, clustering the paths of the particles, and generating state graphs.
The method itself is incredibly simple:
The algorithm approximates the origin of the shape in each frame, by taking the average over each of the x, y, and z coordinates, separately, for each frame, assuming the overall shape is a sphere, which it obviously isn’t, since there’s some dispersion. However, it doesn’t matter, because the dispersion is roughly symmetrical, producing a meaningful point from which we can measure displacement.
The rate of displacement in this approximated origin is then calculated for each frame, and compared to the true, average displacement over the entire set of points for each frame. If there are N frames, then this process will calculate the true average velocity for each frame, expressed as an N x 3 matrix, and compare that to the approximated average velocity for each frame, also expressed as an N x 3 matrix (note that the x, y, and z velocities are independent).
Approximating the average velocity takes about 2 seconds running on an iMac, and the  average error is O(10^(-14)).

I’ve long noted the connections between standard deviation, entropy, and complexity, and in fact, my first image clustering algorithm was based upon the connections between these measures.
See Section 2 of my paper:
However, the focus of my work shifted to entropy-based clustering, simply because it is so effective. I’ve since discovered an even simpler, third model of clustering, that is based in counting the number of points in a cluster, as you iterate through cluster sizes, and so it is incredibly efficient, since it requires only basic operations to work.
However, it requires a number that allows us to express distance in the dataset, in order to operate. Specifically, the algorithm operates by beginning at a given point in the dataset, and moving out in quantized distances, counting all the points that are within that distance of the original point. Any sufficiently small value will eventually work, but it will obviously affect the number of iterations necessary. That is, if your quantized distance is too small for the context of the dataset, then your number of iterations will be extremely large, causing the algorithm to be slow. If your quantized distance is too big, then you’re going to jump past all the data, and the algorithm simply won’t work.
As noted in the article below this one, on identifying macroscopic objects, I initially used my value “delta”, which you can read about in my original paper linked to above, that is reasonably fast to calculate, but nonetheless can start to take time as your dataset has hundreds of thousands, or millions of vectors. Since the goal of this round of articles is to build up a model of thermodynamics, I need to be able to quickly process millions of vectors, preferably tens of millions, to get a meaningful snapshot of the microstates of a thermodynamic system.
What I realized this morning, is that you can take the difference between adjacent entries in the dataset, after it’s been sorted, and this will give you a meaningful measure of how far apart items in the dataset really are. What’s more important, is that this is a basically instantaneous calculation, which in turn allows my new clustering algorithm to run with basically no preprocessing.
The results are simply astonishing:
Using a dataset of 2,619,033 Euclidean 3-vectors, that together comprise 5 statistical spheres, the clustering algorithm took only 16.5 seconds to cluster the dataset into exactly 5 clusters, with absolutely no errors at all, running on an iMac.
The actual complexity of the algorithm is as follows:
Sort the dataset by row values, and let X_min be the minimum element, and X_max be the maximum element.
Then take the norm of the difference between adjacent entries, Norm(i) = ||X(i) - X(i+1)||.
Let avg be the average over that set of norms.
The complexity is O(||X_min - X_max||/avg), i.e., it's independent of the number of vectors.
This assumes that all vectorized operations are truly parallel, which is probably not the case for extremely large datasets run on a home computer.
However, while I don't know the particulars of the implementation, it is clear, based upon actual performance, that languages such as MATLAB successfully implement vectorized operations in a parallel manner, even on a home computer.
My entire A.I. library is attached as well, together with the code specific to this example.

Following up on yesterday’s article, in which I introduced an efficient method for clustering points in massive datasets, below is an algorithm that can actually identify macroscopic objects in massive datasets, with perfect precision. That is, it can cluster points into objects such that the clusters correspond perfectly to the actual objects in the scene, with no classification errors, and exactly the correct number of objects.

In a previous article, I introduced an algorithm that can cluster a few hundred thousand N-dimensional vectors in about a minute or two, depending upon the dataset, by first compressing the data down to a single dimension.
The impetus for that algorithm was thermodynamics, specifically, clustering data expanding about a point, e.g., a gas expanding in a volume. That algorithm doesn’t work for all datasets, but it is useful in thermodynamics, and probably object tracking as well, since it lets you easily identify the perimeter of a set of points.
Below is a full-blown clustering algorithm that can nonetheless handle enormous datasets efficiently. Specifically, attached is a simple classification example consisting of two classes of 10,000, 10-dimensional vectors each, for a total of 20,000 vectors.
The classification task takes about 14 seconds, running on an iMac, with 100% accuracy.
In addition to clustering the data, a compressed representation of the dataset is generated by the classification algorithm, that in turn allows for the utilization of the nearest-neighbor method, which is an incredibly efficient method for prediction, that is in many real world cases, mathematically impossible to beat, in terms of accuracy.
Said otherwise, even though nearest-neighbor is extremely efficient, with a dataset of this size, it could easily start to get slow, since you are still comparing an input vector to the entire dataset. As a result, this method of clustering allows you to utilize nearest-neighbor on enormous datasets, simply because the classification process generates a compressed representation of the entire dataset.
In the specific case attached below, the dataset consists of 20,000 vectors, and the compressed dataset fed to the nearest-neighbor algorithm consists of just 4 vectors.
Predictions occurred at a rate of about 8,000 predictions per second, with absolutely no errors at all, over all 20,000 vectors.

As noted in previous articles, I’ve turned my attention to applying A.I. to thermodynamics, which requires the analysis of enormous datasets of Euclidean vectors. Though arguably not necessary, I generalized a clustering algorithm I’ve been working on to N-dimensions, and it is preposterously efficient:
The attached example takes a dataset of 5 million, 10-dimensional vectors, and clusters them in about 1.17 seconds, with absolutely no errors.
The example is a simple two class dataset, but nonetheless, the efficiency is simply remarkable, and completely eclipses anything I’ve ever heard of, even my own work.
This is years into my work, and rather than reexplain things, if you’re interested in how this process works, I’d suggest reading my original paper on machine learning and information theory:
This is another instance of the same ideas in information theory, but in this case, making use of radical compression.

In previous articles, I introduced my algorithms for analyzing mass-scale datasets, in particular, clustering, and other algorithms. However, those algorithms deliberately avoided operations on the underlying microstate data, since these datasets are comprised of tens of millions of Euclidean vectors.
I’ve now turned my attention to analyzing changes in thermodynamics states, which would benefit from direct measurements of the underlying point data in the microstates themselves. At the same time, any sensible observation of a thermodynamic is going to consist of at least tens of thousands, and possibly millions of observations. My algorithms can already handle this, but they avoid the underlying point data, and instead use compression as a workaround.
In contrast, the algorithm below actually clusters an enormous number of real value points, with radical efficiency, with the intent being to capture changes in position in individual points in thermodynamic systems.
This is a simple example attached that clusters 225,000 real number values in about 120 seconds, running on an iMac –
This is simply ridiculous.
I’m still refining this algorithm, but thought I would share it now, in the interest of owning claim to it.
The accuracy is just under 100%, though this is a simple example.

Attached is software that can generate a graph that represents the transitions of a thermodynamic system, from one state to the next. States that are sequential in time are by definition adjacent, and it’s a directed edge, from the initial state to the next state.
For a thermodynamic system, the underlying state graph is likely to be noise, and mostly disconnected, because microstates could of course occur only once, meaning they have only one next state neighbor.
So in addition, there is code that first clusters the data, into similar states, and then produces a graph based upon items in one cluster transitioning to items in other clusters.
As an example, the attached code uses the expanding gas dataset, over 50 sequences of expansion. So you’d expect the clustering to cause all of the initial states to be clustered together, the later states clustered together, etc, and this is exactly what happens, just as it did in the previous article:
As a result, the graph produced should be a path connecting the initial cluster to the final cluster, and this is exactly what happens.
I’ll write some code that allows for visualization of the graphs, but for now, you can examine the matrix to get a sense of its structure (see attached):
The integer entries indicate how many items the cluster represented by the row is adjacent to. So entry (1,2) shows that cluster 1 is connected to all 50 states in cluster 2, which is exactly what you’d expect, suggesting that the expanding gas always forms a sequence from one state of expansion to the next.
I’ll follow up later this week, possibly today, with software that then uses these graphs to measure how ordered a thermodynamic system is using graph theoretic measures, such as number of maximal paths, how ordered maximal paths are, etc.
NOTE: I corrected a minor bug in a subroutine related to the dictionary insert function, which is updated and attached.

Attached is a zip-file containing all of my algorithms, as of today, 7-22.

This is just the first post in what will be torrent of new algorithms I've developed for dealing efficiently with datasets comprised of large numbers of observations.
This set of algorithms implements the dictionary of states referenced in my original article on the topic (see the post below), allowing for radically efficient comparison between very complex observations, in this case reducing a set of observations to a single dictionary of states.
The net effect of this particular set of algorithms (attached below) is to achieve compression, taking a set of complex observations, that could each consist of thousands of Euclidean datapoints (or significantly more), and quickly eliminate duplicate states, reducing the dataset to a set of unique observations.
The example in the code below takes a sequence of observations intended to model a simple example in thermodynamics, of an expanding gas, which is plotted below.
One sequence from the expanding gas dataset is attached as an image to this update.
Each state of the gas consists of 10,000 points in Euclidean space, and there are 10 observations per sequence, with 10 sequences, for a total of 1,000,000 vectors.
Compression is in this case achieved in about 8 seconds, on an iMac.
This could be useful on its own, but I will follow up with another set of algorithms sometime this week that allow this initial step to facilitate fast prediction over these types of datasets, that consist of massive numbers of observations, even on consumer devices, which would otherwise be intractable, and then follow up with a comprehensive approach to natural language processing using these same algorithms.
The additional code necessary for this example is attached.
The balance of the code is available here:
and here:

In a previous article, I introduced software that can quickly classify datasets consisting of millions of observations. Attached is code that builds upon this software, by embedding mass-scale data in Euclidean space. Specifically, the attached code can take a set of observations in 3-space, and map them to the real line, allowing a large number of observations to be analyzed as a simple set of real numbers. This allows for extremely efficiently manipulation after the fact, since you can compress an extremely large number of observations to a single real number.
This is just the first part of a larger package of software that I’m working on that is geared towards solving problems in thermodynamics, that will ultimately allow for radically efficient analysis of complex problems in physics on consumer devices.

Shannon’s equation is often used as a measure of uncertainty, and that’s not unreasonable, and he provides a mathematical argument as to why it works as a measure of uncertainty in his seminal paper on information theory.
But, I’ve introduced a different way of thinking about uncertainty rooted in partial information, that is quite elegant, since as you’re given partial information about a system, your uncertainty approaches zero as a matter of arithmetic.

I was listening to In Utero by Nirvana last night, and I was reminded by how great they were as a band. That album in particular has a really unique, and unusual sound, and is harmonically quite strange. I was also reminded of the fact that Cobain is certainly not the best guitarist in the world. Nonetheless, his songs are legitimately interesting. You could say that this is the result of working well with what you have, and in some sense, this must be true, because this is exactly what he’s doing. It turns out that we can think more rigorously about this distinction between creative brain power, and technical competence, using computer theory.
For example, if I give you a set of notes that are all in the same key, it’s trivial to construct a melody that fits them, and machines can certainly do this. If, however, I give you a set of notes that are not in the same key, then constructing a listenable melody using those notes is non-trivial, and I’m not sure that a machine can do this. This is exactly what Nirvana’s music does, which is to take a set of disparate key signatures, and connect them with listenable melodies that your average listener probably wouldn’t second guess at all. That is, their music generally sounds normal, but harmonically, it’s actually quite interesting. The same is certainly true of Chris Cornell‘s music as well, in that he also makes use of what is effectively multi-tonal music, in a manner that isn’t obvious in the absence of conscious reflection.

I’ve already introduced a new model of A.I. that allows for autonomous real-time deep learning on cheap consumer devices, and below I’ll introduce a new algorithm that can solve classifications over datasets that consist of tens of millions of observations, quickly and accurately on cheap consumer devices. The deep learning algorithms that I’ve already introduced are incomparably more efficient than typical deep learning algorithms, and the algorithm below takes my work to the extreme, allowing ordinary consumer devices to solve classification problems that even an industrial quality machine would likely struggle to solve in any reasonable amount of time when using traditional deep learning algorithms.
Running on a $200 dollar Lenovo laptop, the algorithm correctly classified a dataset of 15 million observations comprised of points in Euclidean 3-space in 10.12 minutes, with an accuracy of 100%. When applied to a dataset of 1.5 million observations, the algorithm classified the dataset in 52 seconds, again with an accuracy of 100%. As a general matter, the runtimes suggest that this algorithm would allow for efficient processing of datasets containing hundreds of millions of observations on a cheap consumer device, but Octave runs out of memory at around 15 million observations, so I cannot say for sure. added an update The software that powers my deep learning engine Prometheus is already public, and so are the related academic papers, but putting it all in one place, in plain English, adds value, particularly to commercially interested, non-experts. And so I’ve put together a short pitch deck, attached below. Feel free to send this around to anyone that might be interested in investing in the product, or leasing a commercial version of the software. added an update Carl Jung had a big influence on me when I was in college, and though these days I stick to more practical psychological considerations for application to A.I., when I was 20 years old, I was willing to entertain some radical thinking. And this is not to suggest that Jung’s ideas on psychology aren’t practical, in fact quite the opposite – personally, I believe that species memory, archetypal thinking, and a “shadow” animal nature are all very real aspects of the human condition. But Jung was unafraid to explore mystical thinking, in a manner and on a scale that is arguably unfashionable for a storied, and relatively contemporary figure in the history of Western thought. In particular, he had this idea of “synchronicity”, which I think I’ve reduced to mathematics, but I’ve called it coincidence, to dress it down a bit: added an update I introduced an abstraction algorithm a while back that can take a dataset, and generate a representative abstraction of the dataset using a nearest neighbor technique. We can take this process a step further, to generate a set of representations for a given dataset. Specifically, if any of the elements are not within delta of the abstraction, then we gather those elements up into a new dataset. We then generate a new abstraction for that leftover set. We repeat this process until every element of the original dataset is within delta of an abstraction. The number of abstractions needed to do this can serve as a measure of the complexity of the dataset, since the more abstractions you need, the more unique elements there are. It’s similar to the idea of a basis in linear algebra - A minimum set of representative elements that allow me to express the fundamental elements of a dataset. We could also calculate the value of delta by iterating through different values, repeating this process, and selecting the value that causes the entropy of the categorization structure to change the most, treating each element associated with a given abstraction as a category. Here’s some updated code, implementing the algorithm described above: Simply place this code in a loop through values of delta to use this approach to generate an independent value of delta, though this will be more time consuming than the approach I've used, which is to use my categorization algorithm to generate delta, and then apply the abstraction algorithm. added an update Attached is an updated abstraction algorithm that generates an abstraction of a dataset. This particular version is keyed to the MNIST dataset, but the approach is a general one. I've updated to iterate through all possible abstractions, identifying the one that is within delta of the most items from the dataset, which should therefore correspond to the abstraction that is most representative of the dataset. added an update My theory of art is a practical theory rooted in basic psychology and information theory. The primary tools I use to think about a piece are mathematics, to analyze the structure of the piece, and basic distinctions between exogenous signals, internal biological responses, and the psychological associations that follow. It’s probably closer to how a copywriter, psychologist, or propagandist, would think about art, than, say, a literary theorist would. added an update Attached is an updated version of an algorithm I introduced in a previous post that can imitate a dataset, thereby expanding the domain of a dataset. This version is incomparably more efficient than the previous version, since it uses my new categorization and prediction algorithms. In the example I’ve attached, which imitates the shape of a 3D vase, the algorithm generates approximately 140 new points per second, and copies the entire shape in about 11 seconds, running on a cheap laptop. Attached are images showing the original shape, the point information, and the replicated shape generated by the algorithm. added an update Attached is an algorithm that can abstract shapes from a dataset of Euclidean objects. I've applied it to the MNIST dataset. I'm still working on this, but the initial results are quite good, and the speed is excellent. added an update I did some work while I was in Stockholm on distributions over countable sets, but stopped in the interest of finishing up what seemed to be, at the time, more important work in physics and A.I. But yesterday, I realized that you can construct a simple model of probability on countable sets using the work that I had abandoned. added an update Following up on my previous post, attached is a fully vectorized version of my original prediction function, which is extremely efficient, and basically instantaneous. Also attached is a command line script that demonstrates how to use it. added an update Attached is a completely vectorized, highly efficient version of my original clustering algorithm. Its runtime is drastically shorter, and comparable to my real-time algorithms, though it is based upon my original technique of iterating through different levels of discernment until we find the level that generates the greatest change in the entropy of the categorization. Also attached is a command line script that demonstrates how to apply it to a dataset. added an update Here's a polynomial-time algorithm that can identify, and then track objects in three-space over time given 3-D point information from a sensor, or other input: Because the algorithm tracks at the point level, rather than at an object level (i.e., it builds its own objects from the point data), it can be used to track objects that have irregular, and even diffuse shapes, like smoke, or steam. Running on a$200 dollar laptop, it tracked 15 Newtonian projectiles in about 8 seconds, with an accuracy of about 98% to 100%.
All of the other functions you need can be found in my full library here:

I've updated Prometheus to include autonomous noise filtering.
This means that you can give it data that has dimensions that you're not certain contribute to the classification, and might instead be noise. This allows Prometheus to take datasets that might currently produce very low accuracy classifications, and autonomously eliminate dimensions until it produces accurate classifications.
It can handle significant amounts of noise:
I've given it datasets where 50% of the dimensions were noise, and it was able to uncover the actual dataset within a few minutes.
In short, you can give it garbage, and it will turn into gold, on its own.
I've proven that it's basically mathematically impossible to beat nearest neighbor using real-world Euclidean data:
And since I've come up with a vectorized implementation of nearest neighbor, this version of Prometheus uses only nearest neighbor-based methods.
As a result, the speed is insane.
If you don't use noise filtering, classifications occur basically instantaneously on a cheap laptop. If you do have some noise, it still takes only a few minutes for a dataset of a few hundred vectors to be processed, even on a cheap laptop.
Attached is my full library as a zip file, including the new Prometheus.
Also attached is a command line script that demonstrates how the use the program.
Enjoy, and if you're interested in a commercial version, please let me know.

I’ve been using nearest neighbor quite a bit lately, and I’ve noticed that its accuracy is remarkable, and I’ve been trying to understand why.
It turns out that you can prove that for certain datasets, you actually can’t do better than the nearest neighbor algorithm:
Assume that for a given dataset, classifications are constant within a radius of R of any data point.
Also assume that each point has a neighbor that is within R.
That is, if x is a point in the dataset, then there is another point y in the dataset such that ||x-y|| < R.
In plain terms, classifications don’t change unless you travel further than R from a given data point, and every point in the dataset has a neighbor that is within R of that point.
Now let’s assume we’re given a new data point from the testing set that is within the boundaries of the training set.
By definition, the new point is within R of some point from the training set, which implies it has the same class as that point from the training set.
This proves the result.
This means that given a sufficiently dense, locally consistent dataset, it is mathematically impossible to make predictions that are more accurate than nearest neighbor, since it will be 100% accurate in this case.
Unless you’re doing discrete mathematics, which can be chaotic (e.g., nearest neighbor obviously won’t work for determining whether a number is prime) your dataset will probably be locally consistent for a small enough value of R.
And since we have the technology to make an enormous number of observations, we can probably produce datasets that satisfy the criteria above.
The inescapable conclusion is that if you have a sufficiently large number of observations, you can probably achieve a very high accuracy simply using nearest neighbor.

Attached is a very stripped down image partition algorithm.

Attached is another version of my delimiter algorithm, that iterates through different grid sizes, selecting the grid size that generates the greatest change in the entropy of the resultant image partition.
It's essentially a combination of my original image partition algorithm and my image delimiter algorithm. I'm still tweaking it, but it certainly works as is.

The electromagnetic spectrum appears to be arbitrarily divisible. This means that for any subset of the spectrum, we can subdivide that subset into an arbitrarily large number of intervals. So, for example, we can, as a theoretical matter matter, subdivide the visible spectrum into as many intervals as we'd like. This means that if we have a light source that can emit a particular range of frequencies, and a dataset that we'd like to encode using frequencies of light from that range, we can assign each element of the dataset a unique frequency.
So let's assume we have a dataset of N items, and a light source that is capable of emitting at least N unique frequencies of light. It follows that we can encode the dataset using the frequencies of light emitted by the light source, using a simple mapping, where vector x_i is represented by some unique frequency f_i. So in short, if we want to represent the observation of vector x_i, we could cause the light source to generate frequency f_i. Because laser and interferometer technology is very sensitive, this shouldn't be a problem for most practical purposes, and we can probably represent fairly large datasets this way.
If we shine two beams of light, each with their own frequency, through a refractive medium, like a prism, then the exiting angle of those beams will be different. As a result, given an input vector, and a dataset of vectors, all of which are then encoded as beams of light, it should be very easy to find which vector from the dataset is most similar to the input vector, using some version of refraction, or perhaps interferometry. This would allow for nearest-neighbor deep learning to be implemented using a computer driven by light, rather than electricity.
I've already shown that vectorized nearest-neigbor is an extremely fast, and accurate method of implementing machine learning. Perhaps by using a beam-splitter, we can simultaneously test an input vector against an entire dataset, which would allow for something analogous to vectorization using beams of light.

Attached is yet another algorithm, based upon my real-time clustering algorithm, that is capable of generating multiple independent predictions. Unlike the rest of my real-time algorithms, there is significant training time, though the actual prediction function is very fast, and comparable to my other real-time prediction algorithms. The command line code is in "9-23-Notes".
Also attached are my real-time clustering algorithms, slightly tweaked.

Attached is a script that calls my real-time prediction functions to generate a series of independent predictions.
This is probably a solid base for applying other machine learning techniques to extract more information from these predictions.
Also attached is a rough cut at a function that applies some analysis to the resulting vectors of predictions (“layered filter prediction”).

In a previous paper, I introduced a new model of artificial intelligence rooted in information theory that can solve deep learning problems quickly and accurately in polynomial time. In this paper, I'll present another set of algorithms that are so efficient, they allow for real-time deep learning on consumer devices. The obvious corollary of this paper is that existing consumer technology can, if properly exploited, drive artificial intelligence that is vastly more powerful than traditional machine learning and deep learning techniques.
Attached are all of my algorithm as of today, 9/17/2019.
Though also included in the zip file, for convenience, I've separately attached the code for my most recent research paper:

I can't tell if this is a problem that only I'm experiencing, but it seems that the attachments to the posts are not visible. This site has done some strange things in the past, so I'm assuming it's temporary.
Charles

Attached is an algorithm that does real-time deep learning.
The “observations” input is the dataset. The observations dataset is meant to simulate a buffer, and as data is read from it, that data is used to make predictions, and build a training dataset, from which predictions are generated. However, once the accuracy of the predictions meets or exceeds the value of “threshold”, the algorithm stops learning and only makes predictions. If the accuracy drops below threshold, then learning begins again. “N” is the dimension of the dataset, above which data is ignored.
Also attached is a command line script demonstrating how to use the algorithm.

Attached is an algorithm that does real-time video classification, at an average rate of approximately 3 frames per second, running on an iMac.
Each video consists of 10 frames of HD images, roughly 700 KB per frame. The individual unprocessed frames are assumed to be available in memory, simulating reading from a buffer.
This algorithm requires no prior training, and learns on the fly as new frames become available.
The particular task solved by this algorithm is classifying the gestures in the video:
I raise either my left-hand or my right-hand in each video.
The accuracy is in this case 97%, in that the algorithm correctly classified 43 of the 44 videos.
Though the algorithm is generalized, this particular instance could obviously be used to do gesture classification in real-time, allowing for human-machine interactions to occur, without the machine having any prior knowledge about the gestures that will be used.
That is, this algorithm can autonomously distinguish between gestures in real-time, at least when the motions are sufficiently distinct, as they are in this case. By pre-loading a delimiter gesture, we can train the algorithm on an effectively unlimited number of gestures.
This is the same classification task I presented here:
The only difference is that in this case, I used the real-time prediction methods I've been working on, and not Prometheus.
The image files from the video frames are available here:
Though there is a testing and training loop in the code, this is just a consequence of the dataset, which I previously used in a supervised model. That is, predictions are based upon only the data that has already been observed, and not the entire dataset.
Note that you'll have to adjust the file paths in the script for your machine.
The time-stamps printed to the command line represent the amount of time elapsed per video classification, not the amount of time elapsed per frame. Simply divide the time per video by 10 to obtain the average time per frame.
This approach doesn't monitor accuracy on a running basis, but it's trivial to do so, which would allow us to stop adding new data to the dataset once the desired accuracy is reached. We could then turn learning back on if the accuracy slips. We could also easily create a buffer for predictions that are expected to be wrong - there's a measure of difference generated each time the prediction function is called, so when this increases too much, we can pause prediction and learn a bit more.
In short, this is an extremely accurate, real-time learning algorithm that can probably do pretty much anything, on the fly. Even trivial deviations on this theme will allow for powerful applications to real-world problems.

Attached is a short script that extracts the background from an image. I'm still tweaking it.

Attached is a script that allows for real-time function prediction.
Specifically, it can take in a training set of millions of observations, and an input vector, and immediately return a prediction for any missing data in the input vector.
Running on an iMac, using a training set of 1.5 million vectors, the prediction algorithm had an average run time of .027 seconds per prediction.
Running on a Lenovo laptop, also using a training set of 1.5 million vectors, the prediction algorithm had an average run time of 0.12268 seconds per prediction.
Note that this happens with no training beforehand, which means that the training set can be updated continuously, allowing for real-time prediction.
So if our function is of the form z = f(x,y), then our training set would consist of points over the domain for which the function was evaluated, and our input vector would be a given (x,y) pair within the domain of the function, but outside the training set.
We could, therefore, use the main prediction algorithm (or the real-time classification prediction algorithm I presented in the posts below) to make predictions on the fly as the training set gets built up, and then stop reading the training data once the desired accuracy is reached. We could then turn reading the training data on and off as predictions vary in accuracy on the fly.
I've attached a command line script that demonstrates how to use the algorithm, applying it to a sin curve in three-space (see "9-6-19NOTES").

Attached is a stripped down version of Prometheus that runs in less than one second on a consumer device.
It's essentially the same program from the user's perspective: there's a GUI, it normalizes data, and does machine learning. And the accuracy is comparable to the previous version.
This would be perfect for a non-data scientist looking to solve machine learning and deep learning classifications quickly, without access to a high-power machine.
The main difference is the engine, which is a stripped down version intended for real-time prediction, and the fact that it does not do function prediction, which is probably not important unless you're actually a scientist.
For how to use Prometheus Express, you can follow the examples given here:
Also attached is my full AI library as a zip file.

Following up on the post below, attached is an algorithm that can generate a cluster for a single input vector in a fraction of a second. See "no_model_optimize_cluster".
This will allow you to extract items that are similar to a given input vector without any training time, basically instantaneously.
Further, I presented a related hypothesis that there is a single objective value that warrants distinction for any given dataset in the following research note:
To test this hypothesis again, I've also attached a script that repeatedly calls the clustering function over an entire dataset, and measured the norm of the difference between the items in each cluster.
The resulting difference appears to be very close to the value of delta generated by my categorization algorithm, providing further evidence for this hypothesis.

Attached is a script that can solve machine learning classifications with no model whatsoever.
Its runtime is effectively instantaneous, even on a consumer laptop, and generally between .002 seconds and .008 seconds for a dataset of a few hundred vectors.
Its accuracy is as follows for the following UCI datasets:
• Ionosphere Dataset: 83.774%
• Iris Dataset: 94.348%
• Wine Dataset: 91.852%
• Parkinsons Dataset: 83.667%.
The attached "no_model_ML" contains all the command line code you need to run and apply the algorithm to the datasets, which are courtesy of the UCI Machine Learning Repository:
The algorithm consists of just 27 lines of code.
My boundary detection algorithm runs in between 1 and 2 seconds per image, which means that by combining this algorithm with my boundary detection algorithm, we can classify HD video at a rate of about 1 frame every two seconds, on a consumer device. That is, the run time of the actual classification algorithm is negligible, meaning we that can build a dataset on the fly and classify each frame as soon as it's ready.
This would allow for real-time gesture detection on cheap devices.
We can also use this approach to generate clusters, by iteratively selecting best fit predictions using this algorithm, and then assembling those matches into a cluster associated with a given input vector.
I've considered two approaches, one of which is attached. The attached approach ("no_model_generate_cluster") seeks to maximize the number of unique correct items in each category.
Another approach would be to select the final clustering by measuring the rate of change in entropy over each cluster generated, selecting the cluster that is the result of the greatest change in entropy.