Content uploaded by Ricky X. F. Chen

Author content

All content in this area was uploaded by Ricky X. F. Chen on Jan 26, 2016

Content may be subject to copyright.

A Brief Introduction on Shannon’s

Information Theory

Ricky Xiaofeng Chen∗

Abstract

This is an introduction of Shannon’s information theory. Basically, it is more like

a long note so that it is by no means a complete survey or completely mathematically

rigorous. It covers two main topics: entropy and channel capacity. Hopefully, it will

be interesting and helpful to those in the process of learning information theory.

Keywords: information, entropy, channel capacity, encoding, decoding

1 Preface

Claud Shannon’s paper “A mathematical theory of communication” [1] published in July

and October of 1948 is the Magna Carta of the information age. Shannon’s discovery of

the fundamental laws of data compression and transmission marks the birth of Information

Theory.

In this note, we will ﬁrst introduce two main fundamental results in information theory:

entropy and channel capacity, following Shannon’s logic (hopefully). For more aspects,

we refer the readers to the papers [1, 2, 3] and the references therein. At the end, we have

some open discussion comments.

2 Information and Entropy

What is information? or what does it mean when somebody says he has gotten some

information regarding something? Well, it means before someone else “communicate”

some stuﬀ about this “something”, he is not sure of what this “something” is about.

Note, anything can be described by several sentences in a language, for instance, English.

A sentence or sentences in English can be viewed as a sequences of letters (‘a’,‘b’,‘c’,.. . )

and symbols (‘,’,‘.’,‘ ’,. . . ). So, we can just think of sentences conveying diﬀerent meaning

as diﬀerent sequences. Thus, “he is not sure of what this “something” is about” can

∗Biocomplexity Institute and Dept. of Mathematics, Virginia Tech, 1015 Life Sciences Circle, Blacks-

burg, VA 24061, USA. Email: chen.ricky1982@gmail.com, chenshu731@sina.com

1

be understood as “he is not sure which sequence this “something” corresponds”. Of

course, we can assume he is aware of all possible sequences, only which one of them

remains uncertain. He can get some information when someone else pick one sequence

and “communicate” it to him. In this sense, we can say this sequence, even each letter or

symbol there contains certain amount of information. Another aspect of these sequences

is that not all sequences, words, or letters appear equally. They appear following some

probability distribution. For example, the sequence “how are you” is more likely to

appear than “ahaojiaping mei”; the letter ‘e’ is more likely to appear than the letter ‘z’

(the reader may have noticed that the letter ‘z’ has not appeared in the text so far). The

rough ideas above are the underlying motivation of the following more formal discussion

on what information is, how to measure information, and so on.

2.1 How many sequences are there

To formalize the ideas we have just discussed, we assume there is an alphabet Awith

nletters, i.e., A={x1, x2, . . . xn}. For example, A={a, b, . . . , z, ‘,’,‘.’,‘ ’, . . .}, or just

as simple as A={0,1}. We will next be interested in sequences with entries from the

alphabet. We assume each letter xiappears in all sequences interested with probability

0≤pi≤1 and n

i=1 pi= 1. To make it simple, we further assume that for any such

sequence s=s1s2s3··· sT, where si=xjfor some j, the exact letter taking by siand sj

are independent (but subject to the probability distribution pi) for all i̸=j. Now come

to the fundamental question: with these assumptions, how many sequences are there?

It should be noted that a short sequence consisting of these letters will not properly

reﬂect the statistic properties we assumed. Thus, the length Tof these sequences we are

interested should be quite large, and we will consider the situation as Tgoes to inﬁnity,

denoted by T→ ∞. Now, from the viewpoint of statistics, each sequence of length Tcan

be viewed as a series of Tindependent experiments and the possible outcomes of each

experiment are these events (letters) in A, where the event xihappens with probability

pi. By the law of large numbers, for Tlarge enough, in each series of Tindependent

experiments, the event xiwill (almost surely) appear T×pi(T pifor short) times. Assume

we label these experiments by 1,2,...T. Now, the only thing we do not know is in which

experiments the event xihappens. Therefore, the number of sequences we are interested

is the number of diﬀerent ways of placing T p1number of x1,T p2number of x2, and so

on, into Tpositions. Equivalently, it is the number of diﬀerent ways of placing Tdiﬀerent

balls into ndiﬀerent boxes such that there are T p1balls in the ﬁrst box, T p2balls in the

second box, and so on and so forth.

Now it should be easy to enumerate the number of these sequences. Let’s ﬁrst consider

a toy example:

Example 2.1. Assume there are T= 5 balls and 2 boxes. How many diﬀerent ways to

place 2 balls in the ﬁrst box and 3 balls in the second? The answer is that there are in total

5

25−2

3= 10 diﬀerent ways, where n

m=n!

m!(n−m)! and n! = n×(n−1) ×(n−2) ×···1.

The same as the example above, for our general setting here, the total number of

2

sequences we are interested is

K=T

T p1×T−T p1

T p2×T−T p1−T p2

T p3× · ·· T−T p1− ··· T pn−1

T pn.(1)

2.2 Average amount of restore resource

Next, if we want to index each sequence among these Ksequences using binary digits,

i.e., a sequence using only 0 and 1, what is the minimum length of the binary sequence?

Still, let us look at an example ﬁrst.

Example 2.2. If K= 4, all 4 sequence can be respectively indexed by 00, 01, 10 and 11.

So, the binary sequence should have a length log24 = 2 to index each sequence.

Therefore, the binary sequence should have length log2Kin order to index each se-

quence among all these Ksequences. In terms of Computer Science, we need log2K

bits to index (and restore) a sequence. Next, we will derive a more explicit expression of

log2K.

If mis large enough, m! can be quite accurately approximated by the Stirling formula:

m!≈√2πm m

em

.(2)

Thus, for ﬁxed a, b ≥0 and T→ ∞, we have the approximation:

T a

T b=(T a)!

(T b)!(T a −T b)! ≈√2πT a T a

eT a

√2πT b T b

eT b 2πT (a−b)T(a−b)

eT(a−b)(3)

=√aaT a

√2πT b(a−b)bT b (a−b)T(a−b).(4)

Notice that for any ﬁxed pi>0, T pi→ ∞ as T→ ∞, that means we can apply

approximation eq. (4) to every term in eq. (1). Using it, we obtain

log2K=−nlog2√2πT −log2√p1−log2√p2− · ·· log2√pn

−T p1log2p1−T p2log2p2− · ·· T pnlog2pn(5)

Now, if we consider the average number of bits a letter needs in indexing a sequence of

length T, a minor miracle happens: as T→ ∞,

log2K

T=−

n

i=1

pilog2pi.(6)

The expression on the right hand side of eq. (6) is the celebrated quantity associated

with a probability distribution, called Shannon entropy.

3

Let’s review a little bit what we have done: we have Ksequences in total, and all

sequences appear equally likely. Assume they encode diﬀerent messages. Regardless

speciﬁc messages they encode, we regard them as having the same amount of information.

Let’s just employ the number of bits needed to encode a sequence to count the amount of

information a sequence encode (or can provide). Then, log2K

Tcan be viewed as the average

amount of information a letter has. This suggests that we can actually deﬁne the amount

of information of each letter. Here, we say “average” because we think the amount of

information diﬀerent letters have should be diﬀerent as they may not “contribute equally”

in a sequence, depending on the respective probabilities of the letters. Indeed, if we look

into formula (6), it only depends on the probability distribution of these letters in A. If

we reformulate the formula as n

i=1

pi×log2

1

pi

.

It is clearly the expectation (i.e., average in the sense of probability) of the quantity log21

pi

associated with the letter xi, for 1 ≤i≤n. This matches the term “average” so that we

can deﬁne the amount of information a letter xiwith probability pihas to be log21

pibits.

In this deﬁnition of information, we observe that if a letter has a higher probability it

has less information, and vice versa. In other words, more uncertainty, more information.

Just like lottery, winning the ﬁrst prize is less likely but more shocking when it happens,

while you may think winning a prize of 10 bucks is not a big deal since it is very likely.

Hence, this deﬁnition agrees with our intuition as well.

In the subsequent of the paper, we will omit the base in the logarithm function.

Theoretically, the base could be any number and is 2 by default. Now we summarize

information and Shannon entropy in the following deﬁnition:

Deﬁnition 2.3. Let Xbe a random variable, taking value xiwith probability pi, for

1≤i≤n. Then, the quantity I(pi) = log 1

piis the amount of information encoded in xi

(or pi), while the average amount of information n

i=1 pi×log 1

piis called the Shannon

entropy of the random variable X(or the distribution P), and denoted by H(X).

Question: among all possible probability distributions, which distribution gives the

largest Shannon entropy? The answer is given in the following proposition:

Proposition 2.4. For ﬁnite n, when pi=1

nfor 1≤i≤n, the Shannon entropy attains

the maximum n

i=1

1

n×log n= log n.

2.3 Further deﬁnitions and properties

The deﬁnition of information and entropy can be extended to continuous random variables.

Let Xbe a random variable taking real (i.e., real numbers) values and let f(x) be its

4

probability density function. Then, the probability P(X=x) = f(x)∆x. Mimic the

discrete ﬁnite case, the entropy of Xcan be deﬁned by

H(X) =

x−P(x) log P(x) = lim

∆x→0

x−[f(x)∆x] log[f(x)∆x] (7)

= lim

∆x→0

x−[f(x)∆x](log f(x) + log ∆x]) (8)

=−f(x) log f(x)dx−log dx, (9)

where we used the deﬁnition of (Riemann) integral and the fact f(x)dx= 1. The last

formula here is called the absolute entropy for the random variable X. Note, regardless

the probability distribution, there is always a positive inﬁnity term −log dx. So, we can

drop this term and deﬁne the (relative) entropy of Xto be

−f(x) log f(x)dx.

Proposition 2.5. Among all real random variables with expectation µand variance σ2,

the Gauss distribution X∼ N(µ, σ2)attains the maximum entropy

H(X) = log √2πeσ2.

Note joint distribution and conditional distribution are still probability distributions.

Then, we can deﬁne entropy there correspondingly.

Deﬁnition 2.6. Let X, Y be two random variables with joint distribution P(X=x, Y =

y) (P(x, y) for short). Then the joint entropy H(X, Y ) is deﬁned by

H(X, Y ) = −

x,y

P(x, y) log P(x, y).(10)

Deﬁnition 2.7. Let X, Y be two random variables with joint distribution P(x, y) and

conditional distribution P(y|x). Then the conditional entropy H(X|Y) is deﬁned by

H(X|Y) = −

x,y

P(x, y) log P(x|y).(11)

Remark 2.8.Fixing X=x,P(Y|x) is also a probability distribution. It’s entropy equals

H(Y|x) = −

y

P(y|x) log P(y|x)

which can be viewed as a function over X(or a random variable depending on X). It can

be checked that H(Y|X) is actually the expectation of H(Y|x), i.e.,

H(Y|X) =

x

P(x)H(Y|x),

using the fact that P(x, y) = P(x)P(y|x).

5

3 Channel Capacity

In a communication system, we have three basic ingredients: the source, the destination

and the media between them. We call the media the (communication) channel. A channel

could be in any form. It could be physical wires, cables, open environment in wireless

communication, antennas and certain combination of them.

3.1 Channel without error

Given a channel and a set Aof letters (or symbols) which can be transmitted via the

channel. Now suppose an information source generates letters in Afollowing a probabil-

ity distribution P(so we have a random variable Xtaking values in A), and send the

generated letters to the destination through the channel.

Assume the channel will carry the exact letters generated by the source to the desti-

nation. Then, what is the amount of information received by the destination? Certainly,

the destination will receive exactly the same amount of information generated or provided

by the source, which is T H(X) in a time period of length of Tsymbols (with Tlarge

enough). Namely, in a time period of symbol-length T, the source will generate a sequence

of length T, the destination will receive the same sequence, no matter what the sequence

generated at the source is. Hence, the amount of information received at the destination

is on average H(X) per symbol.

The channel capacity of a channel is the maximum amount information on average can

be obtained at the destination in a ﬁxed time duration, e.g., per second, or per symbol

(time). Put it diﬀerently, the channel capacity can be characterized by the maximum

number of sequences on Awe can select and transmit on the channel such that the

destination can in principle determine, without error, the corresponding sequences feed

into the channel based on the received sequences.

If the channel is errorless, what is the capacity of the channel? Well, as discussed

above, the maximum amount of information can be received at the destination equals

to the maximum amount of information can be generated at the source. Therefore, the

channel capacity Cfor this case is

C= max

XH(X),per symbol,(12)

where Xranges over all possible distributions on A.

For example, if Acontains nletters for nbeing ﬁnite, then we know from Proposi-

tion 2.4 that the uniform distribution achieves the channel capacity C= log n.

3.2 Channel with error

What is the channel capacity of a channel with error? A channel with error means that

the source generated a letter xi∈Aand transmitted it to the destination via the channel,

with some unpredictable error, the received letter at the destination may be xj. Assume

statistically, xjis received with probability p(xj|xi) when xiis transmitted. These

6

probabilities are called transit probabilities of the channel. We assume, once the channel

is given, the transit probabilities are determined and will not change.

We start with some examples.

Example 3.1. Assume A={0,1}. If the transit probabilities of the channel are

p(1 |0) = 0.5, p(0 |0) = 0.5,

p(1 |1) = 0.5, p(0 |1) = 0.5,

what is the channel capacity? The answer should be 0, i.e., the destination cannot obtain

any information at all.

Because no matter what is being sent to the destination, the received sequence at

the destination could be any 0 −1 sequence, with equal probability. From the received

sequence, we can neither determine which sequence is the one generated at the source,

nor can we determine which sequences are not the one generated at the source.

In other words, the received sequence has no binding relation with the transmitted

sequence on the channel at all, we can actually ﬂip a fair coin to generate a sequence

ourself instead of looking into the one actually received at the destination.

Example 3.2. Assume A={0,1}. If the transit probabilities of the channel are

p(1 |0) = 0.1, p(0 |0) = 0.9,

p(1 |1) = 0.9, p(0 |1) = 0.1,

what is the channel capacity? The answer should not be 0, i.e., the destination can

determine something in regard to the transmitted sequence.

Further suppose the source generates 0 and 1 with equal probability, i.e.,1

2. Observe

the outcome at the destination for a long enough time, that is a sequence long enough, for

computation purpose, say a 10000-letter long sequence is long enough (to guarantee the

law of large numbers to be eﬀective). With these assumptions, there are 5000 1’s and 0’s,

respectively, in the generated sequence at the source. After the channel, 5000 ×0.1 = 500

1’s will be changed to 0’s and vice versa. Thus, the received sequence has also 5000 1’s

and 5000 0’s. Assume the sequence received at the destination has 5000 1’s for the ﬁrst

half entries and 5000 0’s for the second half entries.

With these probabilities and received sequence known, what can we say about the

generated sequence at the source? Well, it is not possible immediately to know what

is the generated sequence based on these intelligences, because there are more than one

sequence which can lead to the received sequence after going through the channel. But,

the sequence generated at the source can certainly not be the sequence that contains 5000

0’s for the ﬁrst half and 5000 1’s for the second half, or any sequence with most of 0’s

concentrating in the ﬁrst half subsequence. Since if that one is the generated one, the

received sequence should contain about 4500 0’s in the ﬁrst half of the received sequence,

which is not the case observed in the received sequence.

This is unlike the example right above, for which we can neither determine which is

generated nor those not generated. Thus, the information obtained by the destination

should not be 0.

7

Let us come back to determine the capacity of the channel. Recall the capacity is

the maximum number of sequences on Awe can select and transmit on the channel

such that the destination can in principle determine without error the corresponding

sequences feed into the channel based on the received sequences. Since there is error in

the transmission on the channel, we can not select two sequences which potentially lead

to the same sequence after going through the channel at the same time, otherwise we can

never determine which one of the two is the transmitted one on the channel based on the

same (received) sequence at the destination.

Basically, the possible outputs at the destination are also sequences on A, where

element xi, for 1 ≤i≤n, appears in these sequences with probability

pY(xi) =

xj∈A

p(xj)p(xi|xj).

Note this probability distribution will depend only on the distribution Xsince the transit

probabilities are ﬁxed. Denote the random variable associating to this probability dis-

tribution at the destination Y(note Ywill change as Xchange). Shannon [1] proved

that for a given distribution X, we can choose at most 2T[H(X)−H(X|Y)] sequences (satisfy-

ing the given distribution) to be the sequences to transmit on the channel such that the

destination can determine, in principle, without error, the transmitted sequence based

on the received sequences. That is, the destination can obtain H(X)−H(X|Y) bits

information per symbol. Therefore, the channel capacity for this case is

C= max

X[H(X)−H(X|Y),per symbol,(13)

where Xranges over all probability distributions on A.

4 Open Discussion

There are many articles and news claiming that the Shannon capacity limit deﬁned above

was broken. In fact, these are just kind of advertisements on new technologies with

more advanced settings than Shannon’s original theory, e.g., multiple-antenna transmit-

ting/receiving technologies (MIMO). Essentially, these tech. are still based on Shannon

capacity. They have not broken Shannon capacity limit at all.

Can Shannon capacity limit be broken?

In my personal opinion, all “limit” problems are optimization problems under certain

conditions. So, it is really about settings. If you believe Shannon’s settings are the

most suitable one in modelling a communication system, then I suspect you cannot break

Shannon capacity limit. Anyway, we end this note with some open discussions:

1.There is no problem to model information sources as random processes. However,

given a channel and the set Aof letters transmittable on the channel. To discuss

the capacity of the channel, why are we only allowed to select sequences obeying

8

the same probability distribution? Given two probability distributions on A,P1and

P2. If there exists xifor some 1 ≤i≤nsuch that

xj∈A

P1(xj)p(xi|xj)̸=

xj∈A

P2(xj)p(xi|xj),

we call X1and X2compatible. Note, in this case, if we transmit a sequence satisfying

distribution X1and another sequence satisfying distribution X2, the destination

should know, based on inspecting the number of xiin the received sequence, that

the transmitted sequence is from the X1-class or the X2-class. We call a set Fof

probability distributions on A, such that any two probability distributions in Fare

compatible, an admissible set. Should the capacity of the channel be deﬁned as

C= max

F

X∈F

H(X)−H(X|Y),(14)

where Franges over all admissible sets on A?

2.Set aside theoretical discussion. There are lots of real systems or simulations. Why is

there no report of breaking Shannon capacity limit? Possibly, the reason is that these

channel encoding methods used just generate sequences (of modulation symbols,

e.g., QPSK,16QAM, etc.) satisfying almost the same probability, i.e., agree with

Shannon’s settings on obtaining channel capacity. Is that the case?

References

[1] C.E. Shannon, A mathematical theory of communication, Bell Syst. Tech. J., vol. 27,

pp. 379-423, 623-656, July-Oct. 1948.

[2] C.E. Shannon, Communication in the presence of noise, Proc. IRE, 37 (1949), 10-21.

[3] S. Verd´u, Fifty years of Shannon theory, IEEE Transactions On Information Theory,

44 (1998), 2057-2078.

9