Content uploaded by Charles Davi

Author content

All content in this area was uploaded by Charles Davi on Oct 17, 2021

Content may be subject to copyright.

Content uploaded by Charles Davi

Author content

All content in this area was uploaded by Charles Davi on Oct 17, 2021

Content may be subject to copyright.

Content uploaded by Charles Davi

Author content

All content in this area was uploaded by Charles Davi on Oct 16, 2021

Content may be subject to copyright.

Content uploaded by Charles Davi

Author content

All content in this area was uploaded by Charles Davi on Oct 14, 2021

Content may be subject to copyright.

Content uploaded by Charles Davi

Author content

All content in this area was uploaded by Charles Davi on Oct 10, 2021

Content may be subject to copyright.

Content uploaded by Charles Davi

Author content

All content in this area was uploaded by Charles Davi on Oct 10, 2021

Content may be subject to copyright.

Information, Knowledge, and Uncertainty

Charles Davi

October 16, 2021

Abstract

Below we present and apply a fundamental equation of epistemology,

that relates information, knowledge, and uncertainty, and apply that same

equation to random variables using information theory. Also attached is

software that applies the results to deep learning classiﬁcation problems,

with a discussion of that application included below.1

1Any code not attached can be found in my full A.I. library, available on ResearchGate.

The dataset is courtesy of Harvard University, and is available here

1

1 Introduction

Italian mathematicians, including Luca Pacioli, developed a system of account-

ing based upon the following fundamental equation of accounting:

A=E+L, (1)

where A is assets, E is the owner’s equity in those assets, and L is the owner’s

liabilities with respect to those assets. The intuition for Equation (1) is that

all money is either yours, or someone else’s. Therefore, when purchasing an

asset, you are using either your money (represented as equity in the asset, E),

or someone else’s money (represented as a liability to another, L). You cannot

prove the equation beyond the observation of the tautology that all money is

either yours, or someone else’s. Equation (1) is an instance of the more general

tautology that all things are either in a given set, or not in that set.2

Similarly, we introduce an equation of epistemology rooted in another in-

stance of this tautology, that all that can be known about a given system, is

either known to you, or unknown to you. Expressed symbolically, if Iis the

measure of all that can be known about a given system, Kis the measure of

your knowledge of the system, and U,uncertainty, is the measure of what is

unknown to you about the system, then it must the case that,

I=K+U. (2)

Information theory allows us to take this simple accounting, and apply it rig-

orously, by calculating speciﬁc values for I, K, and U, based upon the properties

of a given system.

2 Basic Application

Imagine you’re given a series of boxes labelled 1 through N, exactly one of

which contains a pebble, though which box contains the pebble is unknown to

you. In reality, you can do many things with a set of boxes, including stacking

them upon each other, shu✏ing them about, removing the lids, etc., but for our

purposes, we are interested only in the location of the pebble. As a consequence,

the only actions that change the state of the system in question are those that

move the pebble from one box to another. Because there are N boxes, and only

2Note that there are no paradoxes if you assume the universal set does not contain any

sets, and contains only individual elements.

2

1 pebble, the system has exactly Nstates, each deﬁned by a unique location of

the pebble.

Because the system has Nstates, this treatment of the set of boxes and

single pebble can be used to store exactly log(N) bits. For the same reason, at

most log(N) bits are required to specify any given state of the system, with each

state of the system uniquely characterized by a unique binary string of length

log(N). Returning to Equation (2) above, because our inquiry is limited to the

location of the pebble, it follows that the most you can ever know about the

system is in this case the location of the pebble, which requires at most log(N)

bits to specify. As such, it must be the case that,

I= log(N).

If you know absolutely nothing about the possible location of the pebble,

then it must be the case that your knowledge, K, is zero.

This in turn implies that, ex ante,

I=U= log(N).

Now instead assume that you’re told ex ante, from a perfectly reliable source,

that the pebble is not in the ﬁrst box, but you otherwise have no information

about the location of the pebble. Upon receipt of this information, you are now

considering a system that is equivalent to one comprised of N1 boxes and a

single pebble, since you know with certainty, that the pebble is not in the ﬁrst

box. A system comprised of N1 boxes and a single pebble can be in exactly

N1 states, which requires at most log(N1) bits to represent. Because

the original system is, upon receipt of this information, equivalent to a system

comprised of N1 boxes and a single pebble, it must be the case that your

uncertainty is exactly what it would be, when considering a system of N1

boxes and a single pebble for the same purpose. This implies that, upon receipt

of the information,

log(N)=K+ log(N1),

and so,

K= log(N)log(N1).

As a result, your knowledge is non-zero, but still incomplete. This example

also implies the following deﬁnition:

3

Knowledge is information that reduces uncertainty.3

3 The Uncertainty of a Random Variable

Imagine you’re told the pebble is equally likely to be in any one of the Nboxes,

which implies a uniform distribution. In the ﬁrst example above, you were told

nothing about the distribution ex ante, other than that the pebble is in one

of the Nboxes, and in that case, your knowledge was therefore zero. In this

case, we have the additional information that the distribution over the possible

outcomes is uniform. However, if you know nothing about the distribution, then

the only distribution consistent with knowing nothing about the distribution,

is a uniform distribution, since it treats all outcomes as equally likely, for any

other distribution will imply a pair of probabilities, piand pj, for which pi<

pj. Because you have no knowledge about the system, you have no basis to

assume that one probability di↵ers from any other. As a result, in the absence

of information about the system, assuming a uniform distribution is the only

option, assuming there is a distribution in the ﬁrst instance.4

This leads to the following proposition:

Proposition 3.1. When your knowledge about a system is zero, your ex ante

expectation as to the distribution of states of the system is given by the uniform

distribution, and moreover, your uncertainty is given by,

I=U= log(N),(3)

where Nis the number of states of the system.

Now imagine that you have two sequences of observations, generated by two

sources, A=[aaabb] and B=[aaabc].In both cases, the most likely modal

event is a, with a probability of 3

5. However, in the latter case of B, there’s

3Note that if you receive the same message twice, the second message will not convey any

knowledge, since your uncertainty will be unchanged, given receipt of the ﬁrst message.

4Imagine instead that you’re told there is no distribution, in that the distribution is unsta-

ble, and changes over time. For example, imagine that the position of the pebble is deliberately

selected so that exactly this occurs over time. Even in this case, you have no reason to assume

that one box is more likely to contain the pebble than any other. You simply have additional

information, that over time, recording the frequency with which each box contains the pebble

will not produce any stable distribution. As a result, ex ante, your expectation is that each

box is equally likely to contain the pebble, producing a uniform distribution, despite knowing

that the actual observed distribution cannot be a uniform distribution, since you are told

beforehand that there is no stable distribution at all.

In the case where you don’t know the possible states of a system, your knowledge is again

zero, but your uncertainty is inﬁnite, because the number of states is, from your perspective,

unknown, resulting in an inﬁnite set of possible states.

4

an additional possibility of state c, that is not present in A. In both cases,

the rational prediction is a, as the most likely state, and moreover, because

the probability of ais the same in both cases, the expected number of errors

is the same, for any given number of observations. However, the latter case of

Bhas an additional possibility not present in the ﬁrst case, namely, state c.

Intuitively, this greater multiplicity creates greater uncertainty, for the simple

reason that more states are apparently possible in the case of B. As a result, B

presents greater uncertainty than A, despite the most likely outcome being the

same, with the same probability, in both cases.

It turns out that we can measure the uncertainty of probability distribution

precisely using an equation presented by Claude Shannon in 1948 [1],5that

provides a lower bound on the average minimum code length that can be used

to encode a source, as a function of its distribution of states. Speciﬁcally,

H=

k

X

i=1

pilog( 1

pi

),(4)

where the distribution of the states of the source is given by {p1,...,p

k}.

Said in words, if a source has a particular distribution of states, then the min-

imum average number of bits required to encode a single state of the source is

given by Equation (4).

A lesser known result in that same paper, proves that equations of the form,

H=C

k

X

i=1

pilog( 1

pi

),

where Cis some constant, are the only equations that satisfy three pri-

mordial assumptions about the uncertainty of a distribution.6As a result, we

can interpret Equation (4) as providing not only a lower bound on the average

amount of information required to encode a source, but also a measure of the

uncertainty of the distribution of the states of the source. Note that Equation

(4) is maximized for the uniform distribution, and as such, for a system with

Nstates, uncertainty, as measured by Equation (4), is in that case given by

log(N), which is consistent with Equation (3) above.

5See, “A Mathematical Theory of Communication”.

6See Section 6 of [1]: 1. The function is continuous over all possible probabilities; 2. If

all the probabilities {p1,...,p

k}are equal, then the function should increase as a function

of k; 3. The total uncertainty of sequential outcomes is given by the weighted sum of the

individual uncertainties, where the weight is determined by the frequency of each outcome.

As a consequence, for a series of independent outcomes, the total uncertainty is given by the

sum of the individual uncertainties.

5

As a general matter, consistent with Shannon’s proof, we have the following

proposition:

Proposition 3.2. The average uncertainty of a random variable over a distri-

bution {p1,...,p

k}is given by,

U=H=

k

X

i=1

pilog( 1

pi

).(5)

Note that a string generated by Nindependent observations of a system

with kpossible states, itself has kNpossible states. It follows in that case,

I=Nlog(k).

Therefore, we have the following proposition:

Proposition 3.3. Your total knowledge of a system with a distribution {p1,...,p

k},

given Nobservations, is given by,

K=Nlog(k)NH. (6)

4 Interpretation

We derived the value of Kfrom tautologies, and a theorem, and so if you

accept all assumptions made along the way, then you must accept Equation (6).

That is, as a matter of logic, if you accept Shannon’s assumptions regarding

uncertainty, then you must accept Equation (4) as a measure of uncertainty.

Moreover, because Iis ﬁxed, for any system, and since its value is equal to Uin

the case of K= 0, the value of Iis always given by the logarithm of the number

states of the system. If you have some number of independent observations

of the system, then the total uncertainty is in that case given by the sum of

those individual uncertainties, which is an assumption in Shannon’s original

paper (see footnote 6 above). And so in all cases, you must accept the value

of Ipresented in Equation (6). Kmust measure your knowledge, for all that

exists is either knowledge or uncertainty, and never both. And so if you accept

Shannon’s assumptions as they relate to uncertainty, you must accept Equation

(6) as a measure of your knowledge given a distribution.

Note that this discussion relates only to the observed distribution, and has

nothing to do with some purported true distribution that drives the behavior of

the system in question. That is, Shannon’s equation measures your uncertainty

6

given the observed distribution under consideration, and does not purport to

relate to any underlying, ostensibly true, but unobserved distribution. This is

simply the case if you’re observing a possibly truly random system, since the

distribution would in that case be unbounded, and so the entropy could for

example, start out low, and then escalate, at any point in the future, reducing

your knowledge to arbitrarily close to zero. This is simply the case, absent

additional assumptions about the system.

5 Application to Prediction

We can apply Equation (6) to a problem that arises often in machine learn-

ing, which is predicting the classiﬁcation of an observation: take a given

observation, and predict, based upon the data within the observation, to which

particular class, among some set of possible classes, the observation belongs. In

this case, we’ll begin with image classiﬁcation, speciﬁcally, where a machine is

responsible for determining what hand-written digit is displayed in an image,

and the class is given by the number displayed, resulting in 10 classes, 0 through

9. In order to generate observations that lend themselves to the analysis above,

given an input image, we will retrieve a set of similar images, known as a clus-

ter.7We can implement clustering by retrieving all images from a dataset

that are suﬃciently similar to the input image. This cluster will generate a

distribution of classes associated with a given input image.

For example, given an image of a 1, let’s assume the related cluster of images

consists of the following distribution of classes: [1,1,1,7,9]. That is, the cluster

associated with a given image of a 1, consists of, including itself, 3 images of 1’s,

an image of a 7, and an image of a 9. This distribution is not unrealistic, since

the method in question makes use of Euclidean distance between image ﬁles,

and the numbers, 1, 7, and 9, when drawn by hand, can at times all resemble

each other.

In this case, the machine does not know the class of the input image, and

is instead given only the distribution of classes in the cluster, which includes

the class of the input image. We then predict the class of the input image by

selecting the modal class as our predicted class. For example, given the same

cluster vector [1,1,1,7,9], the modal class is 1, the modal probability is 3

5,the

entropy of the distribution is H=1.3710, the value of Iis given by 5 log(10),

and so K=I5H=9.7549. In contrast, assume our observed cluster vector is

[1,1,1,1,1,7]. Recalculating all of these values, the modal class is 1, the modal

probability is 5

6, the entropy of the distribution is H=0.65002, the value of I

is given by 6log(10), and so K=I6H= 16.031. Not surprisingly, the value

7See the code attached, written in Octave / Matlab, which instead classiﬁes skin cancer

lesions, using the exact same process. We describe the digit recognition hypothetical, to which

the same code can be applied, e.g., using the MNIST Numerical Dataset, for simplicity.

7

of Kis higher in the second case, because the vector is more consistent, and

longer. It turns out that if you set a minimum threshold for both knowledge

and the modal probability,8and increase that threshold, accuracy of prediction

increases as a function of that threshold.9

Reporting a cluster of classes associated with an input is no di↵erent than

making repeated observations of the same system, and recording the results.10

Intuitively, the more consistent the observations, the less uncertainty there is

with respect to the observations. More speciﬁcally, if there are a high number of

possible outcomes and you make a high number of observations, then Iwill be

high. If the resultant entropy is low, then you have a high degree of knowledge

in your observations, since many outcomes could have occurred, but failed to.

In contrast, if the resultant entropy is high, or you make a low number of

observations, then you have a low degree of knowledge in your observations.

And so this method allows you to compare, objectively, your knowledge in two

potentially totally di↵erent sets of observations.11 That this method works in

practice is empirical evidence for the application of Equation (6) to probabilities,

speciﬁcally, that probabilities become more reliable as a function of knowledge,

and so in particular, the modal probability becomes more reliable in practice as

a function of knowledge.12

You can develop a mechanical intuition for why this works, using combina-

torics, which is that the number of ways you can be wrong depends upon the

frequency distribution and the modal frequency, with the number of ways you

can be wrong increasing as a function of increasingly equal frequencies (i.e.,

tending towards the uniform distribution) and a decreasing modal frequency.

However, note that while modal frequency of course determines the expected

number of prediction errors using the modal outcome, the combinatorial out-

come space itself can change, even if the expected number of errors remains

constant. As such, this view is consistent with Equation (4), which varies as a

function of multiplicity of outcome generally. Speciﬁcally, because the algorithm

does not know which entry in the cluster vector contains the input class, we can

simply ﬁx the ﬁrst entry as the presumed index of the input class, and consider

8Knowledge can be mapped to [0,1] by simply dividing by the maximum knowledge ob-

served for any cluster. Again, see the code attached.

9This is demonstrated in the code attached, which can be applied to any single object

image classiﬁcation task.

10Yo u c an p e rhap s d e b a te thi s p o i nt, but c l u s t erin g i s , l i tera l l y, i n t his ca s e , t h e rep eate d

observation of a region of Euclidean space.

11Note that if your observations do not consist of discrete labels, and instead, continuous

data, then you can ﬁrst cluster the observations, which will produce discrete categories of

observations, to which this method can be applied. Alternatively, you can apply an analogous

method using the equations we presented in a previous paper, “Sorting, Information, and

Recursion”.

12This implies that raising the threshold for the modal probability alone should not increase

accuracy, and this is indeed the case. That is, you must also increase the threshold for

knowledge in order to increase accuracy, which you can see demonstrated in the code attached.

8

all permutations of the cluster vector.13 The more unique permutations there

are, the greater the number of opportunities for error there are, and the num-

ber of unique permutations increases as the frequencies become more uniform.14

Further, the lower the modal frequency, the greater the number of permutations

there are that cause the wrong class to appear in the ﬁrst entry, but as noted,

this measure of uncertainty varies as a function of multiplicity generally. So

perhaps a better mechanical intuition comes from multiplicity itself, in that by

increasing the minimum threshold for knowledge, you are eliminating observa-

tions that have a large number of permutations, and therefore, it requires less

data to generate a dataset that actually contains a greater portion of the pos-

sible permutations of a given observation, which will cause predictions to more

closely comport with the probabilities implied by those observations.

Finally, note that we are not imputing any true underlying distribution or

value to a set of observations, even if such a distribution or value exists. We

are instead measuring knowledge in a set of observations, and treating those

observations as all that is known. In the context of probabilities, it turns out

that, empirically, increasing both knowledge and the modal probability, causes

accuracy to increase. We can interpret this result as evidence for the claim

that observed probabilities become more reliable as a function of increasing

knowledge, regardless of whether or not there is additional data left unobserved.

As a general matter, the hypothesis to be tested is -

Observation literally carries information, and so the more information a

set of observations carries, as measured by Equation (6), the more the logical

implications of those observations comport to reality itself.

13Note there is an equivalence between permuting the labels of the elements of a vector,

and holding the elements ﬁxed, and permuting the elements, and holding the labels ﬁxed.

14This is trivial to prove: consider A=k!m!, for m<k, and consider B=m+1

kA=(k

1)!(m+1)! A, and apply it to the formula for unique permutations with repetition. However,

this admittedly trivial fact suggests the possibility of a combinatorial theory of uncertainty

that is consistent with Shannon’s, where uncertainty is again the result of multiplicity.

9

%===============================================================================

%===============================================================================

%===============================================================================

%VECTORIZED IMAGE CLASSIFICATION - SKIN CANCER STAT CLUSTER PREDICTION

%===============================================================================

%===============================================================================

%===============================================================================

%COPYRIGHT CHARLES DAVI, 2021

%===============================================================================

%LOADS THE DATASET

%===============================================================================

clear

clc

pkg load image

%image directory

directory = '/Users/charlesdavi/Desktop/Datasets/Skin_Cancer/dataverse_files/1/';

%file that contains the full list of image file names

file = '/Users/charlesdavi/Desktop/Datasets/Skin_Cancer/dataverse_files/Image_ID.txt';

A = textread (file, "%s");

%file that contains the full list of corresponding classifiers

file = '/Users/charlesdavi/Desktop/Datasets/Skin_Cancer/dataverse_files/Class_ID.txt';

B = textread (file, "%f");

%file that contains the full list of patient IDs

file = '/Users/charlesdavi/Desktop/Datasets/Skin_Cancer/dataverse_files/

Patient_ID.txt';

C = textread (file, "%s");

%there are multiple similar images of the same patients, so we take uniques-----

unique_patient_IDs = unique(C);

max_num_images = size(unique_patient_IDs)

num_images = max_num_images; %number of images we select from the dataset

%generates indexes for the dataset, preventing similar duplicate images

for i = 1 : num_images

temp_ID = unique_patient_IDs{i};

counter = 0;

current_ID ="HAM_ZZZZZZZ"; %this patient ID does not exist

%iterates until we find the first image associated with the patient

while(sum(current_ID == temp_ID) < 11) %each patient ID is 11 characters long

counter = counter + 1;

current_ID = C{counter};

endwhile

dataset_rows(i) = counter;

endfor

%-------------------------------------------------------------------------------

%loads the images and classifiers into memory

for i = 1 : num_images

image_file = A{dataset_rows(i)};

image_file = [directory image_file '.jpg'];

I = imread(image_file);

I = rgb2gray(I);

IMG_array{i} = I;

IMG_category(i) = B(dataset_rows(i)); %loads the classifier for each image

endfor

%===============================================================================

%EXTRACTS SHAPE INFORMATION

%===============================================================================

tic;

[final_avg_matrix final_indexes] = partition_image_vectorized_gs(IMG_array{1}); %this

is to size the partitions for the entire dataset

toc

N = size(final_avg_matrix,1);

tic;

%iterates through entire dataset

for i = 1 : num_images

I = IMG_array{i};

[avg_matrix] = calc_avg_color_vect(final_indexes, I, N); %this extracts shape

information

input_vector = reshape(avg_matrix, [1 N^2]);

input_vector(N^2+1) = IMG_category(i); %this is the hidden classifier

dataset(i,:) = input_vector;

endfor

toc

%===============================================================================

%GENERATES CLUSTERS

%===============================================================================

tic;

N = N^2;

s = std(dataset(:,1:N)); %calculates the standard deviation of the dataset in each

dimension

s = mean(s); %takes the average standard deviation

s = s*N;

num_rows = size(dataset,1);

cluster_matrix = zeros(num_rows,num_rows);

delta = s/24; %this is the value of delta based upon experimentation

for i = 1 : num_rows

input_vector = dataset(i,:);

[cluster_vector diff_vector] = find_delta_cluster(input_vector, dataset, delta, N);

cluster_matrix(i,:) = cluster_vector;

endfor

toc

%===============================================================================

%GENERATES TRAINING / TESTING DATASET

%===============================================================================

num_iterations = 500;

accuracy_p = [];

accuracy_c = [];

accuracy_cp = [];

num_rows = size(dataset,1);

for i = 1 : num_iterations

%permutes the dataset

%Generates a training and testing dataset---------------------------------------

num_training_rows = floor(.85*num_rows); %selects a portion of the dataset

training_rows = randperm(num_rows,num_training_rows);

testing_dataset = dataset;

testing_dataset(training_rows,:) = [];

training_dataset = dataset(training_rows,:);

num_testing_rows = size(testing_dataset,1);

%===============================================================================

%CLUSTER PREDICTION STEP

%===============================================================================

[predicted_class_vector confidence_vector probability_vector CP_accuracy error_vector]

= cluster_prediction(dataset, training_dataset, testing_dataset, training_rows,

cluster_matrix, N);

%===============================================================================

%BENCHMARK PREDICTION STEP - NEAREST NEIGHBOR

%===============================================================================

num_errors = 0;

for j = 1 : num_testing_rows

input_vector = testing_dataset(j,:);

[predicted_vector predicted_class prediction_row diff_vector] =

find_NN(input_vector, training_dataset,N);

actual_class = input_vector(N+1);

if(predicted_class != actual_class)

num_errors = num_errors + 1;

endif

endfor

NNaccuracy(i) = 1 - num_errors/num_testing_rows;

%===============================================================================

%PROBABILITY; CONFIDENCE

%===============================================================================

%probability--------------------------------------------------------------------

counter = 1;

increment = .01;

num_levels = size(0 : increment : 1,2);

for j = 0 : increment : 1

x = find(probability_vector >= j);

num_errors = sum(error_vector(x));

num_predictions = size(x,2);

if(num_predictions > 0)

accuracy_p(i,counter) = 1 - num_errors/num_predictions;

endif

counter = counter + 1;

endfor

%confidence---------------------------------------------------------------------

counter = 1;

increment = .01;

num_levels = size(0 : increment : 1,2);

confidence_vector = confidence_vector/max(confidence_vector);

for j = 0 : increment : 1

x = find(confidence_vector >= j);

num_errors = sum(error_vector(x));

num_predictions = size(x,2);

if(num_predictions > 0)

accuracy_c(i,counter) = 1 - num_errors/num_predictions;

endif

counter = counter + 1;

endfor

%confidence and probability-----------------------------------------------------

counter = 1;

increment = .001;

num_levels = size(0 : increment : 1,2);

confidence_vector = confidence_vector/max(confidence_vector);

for j = 0 : increment : 1

x = find(probability_vector >= j);

y = find(confidence_vector >= j);

temp1 = zeros(num_testing_rows,1);

temp2 = zeros(num_testing_rows,1);

temp1(x) = 1;

temp2(y) = 1;

z = temp1.*temp2;

z = find(z == 1);

z = z';

num_errors = sum(error_vector(z));

num_predictions = size(z,2);

if(num_predictions > 0)

accuracy_cp(i,counter) = 1 - num_errors/num_predictions;

endif

counter = counter + 1;

endfor

endfor %end of outer loop

plot_data_p = mean(accuracy_p);

plot_data_c = mean(accuracy_c);

plot_data_cp = mean(accuracy_cp);

figure, plot(plot_data_p)

figure, plot(plot_data_c)

figure, plot(plot_data_cp) %this is the plot generated by both thresholds