ResearchPDF Available

A Brief Introduction on Shannon's Information Theory



This is an introduction to Shannon's information theory. It covers two main topics: entropy and channel capacity, which are developed in a combinatorial flavor. Some open discussion on if the Shannon capacity limit can be broken is presented as well. Note: updated versions entitled "A brief introduction to Shannon's information theory" are available on arXiv (2018v2, 2021v3).
A Brief Introduction on Shannon’s
Information Theory
Ricky Xiaofeng Chen
This is an introduction of Shannon’s information theory. Basically, it is more like
a long note so that it is by no means a complete survey or completely mathematically
rigorous. It covers two main topics: entropy and channel capacity. Hopefully, it will
be interesting and helpful to those in the process of learning information theory.
Keywords: information, entropy, channel capacity, encoding, decoding
1 Preface
Claud Shannon’s paper “A mathematical theory of communication” [1] published in July
and October of 1948 is the Magna Carta of the information age. Shannon’s discovery of
the fundamental laws of data compression and transmission marks the birth of Information
In this note, we will first introduce two main fundamental results in information theory:
entropy and channel capacity, following Shannon’s logic (hopefully). For more aspects,
we refer the readers to the papers [1, 2, 3] and the references therein. At the end, we have
some open discussion comments.
2 Information and Entropy
What is information? or what does it mean when somebody says he has gotten some
information regarding something? Well, it means before someone else “communicate”
some stuff about this “something”, he is not sure of what this “something” is about.
Note, anything can be described by several sentences in a language, for instance, English.
A sentence or sentences in English can be viewed as a sequences of letters (‘a’,‘b’,‘c’,.. . )
and symbols (‘,’,‘.’,‘ ’,. . . ). So, we can just think of sentences conveying different meaning
as different sequences. Thus, “he is not sure of what this “something” is about” can
Biocomplexity Institute and Dept. of Mathematics, Virginia Tech, 1015 Life Sciences Circle, Blacks-
burg, VA 24061, USA. Email:,
be understood as “he is not sure which sequence this “something” corresponds”. Of
course, we can assume he is aware of all possible sequences, only which one of them
remains uncertain. He can get some information when someone else pick one sequence
and “communicate” it to him. In this sense, we can say this sequence, even each letter or
symbol there contains certain amount of information. Another aspect of these sequences
is that not all sequences, words, or letters appear equally. They appear following some
probability distribution. For example, the sequence “how are you” is more likely to
appear than “ahaojiaping mei”; the letter ‘e’ is more likely to appear than the letter ‘z’
(the reader may have noticed that the letter ‘z’ has not appeared in the text so far). The
rough ideas above are the underlying motivation of the following more formal discussion
on what information is, how to measure information, and so on.
2.1 How many sequences are there
To formalize the ideas we have just discussed, we assume there is an alphabet Awith
nletters, i.e., A={x1, x2, . . . xn}. For example, A={a, b, . . . , z, ‘,’,‘.’,‘ ’, . . .}, or just
as simple as A={0,1}. We will next be interested in sequences with entries from the
alphabet. We assume each letter xiappears in all sequences interested with probability
0pi1 and n
i=1 pi= 1. To make it simple, we further assume that for any such
sequence s=s1s2s3··· sT, where si=xjfor some j, the exact letter taking by siand sj
are independent (but subject to the probability distribution pi) for all i̸=j. Now come
to the fundamental question: with these assumptions, how many sequences are there?
It should be noted that a short sequence consisting of these letters will not properly
reflect the statistic properties we assumed. Thus, the length Tof these sequences we are
interested should be quite large, and we will consider the situation as Tgoes to infinity,
denoted by T→ ∞. Now, from the viewpoint of statistics, each sequence of length Tcan
be viewed as a series of Tindependent experiments and the possible outcomes of each
experiment are these events (letters) in A, where the event xihappens with probability
pi. By the law of large numbers, for Tlarge enough, in each series of Tindependent
experiments, the event xiwill (almost surely) appear T×pi(T pifor short) times. Assume
we label these experiments by 1,2,...T. Now, the only thing we do not know is in which
experiments the event xihappens. Therefore, the number of sequences we are interested
is the number of different ways of placing T p1number of x1,T p2number of x2, and so
on, into Tpositions. Equivalently, it is the number of different ways of placing Tdifferent
balls into ndifferent boxes such that there are T p1balls in the first box, T p2balls in the
second box, and so on and so forth.
Now it should be easy to enumerate the number of these sequences. Let’s first consider
a toy example:
Example 2.1. Assume there are T= 5 balls and 2 boxes. How many different ways to
place 2 balls in the first box and 3 balls in the second? The answer is that there are in total
3= 10 different ways, where n
m!(nm)! and n! = n×(n1) ×(n2) ×···1.
The same as the example above, for our general setting here, the total number of
sequences we are interested is
T p1×TT p1
T p2×TT p1T p2
T p3× · ·· TT p1− ··· T pn1
T pn.(1)
2.2 Average amount of restore resource
Next, if we want to index each sequence among these Ksequences using binary digits,
i.e., a sequence using only 0 and 1, what is the minimum length of the binary sequence?
Still, let us look at an example first.
Example 2.2. If K= 4, all 4 sequence can be respectively indexed by 00, 01, 10 and 11.
So, the binary sequence should have a length log24 = 2 to index each sequence.
Therefore, the binary sequence should have length log2Kin order to index each se-
quence among all these Ksequences. In terms of Computer Science, we need log2K
bits to index (and restore) a sequence. Next, we will derive a more explicit expression of
If mis large enough, m! can be quite accurately approximated by the Stirling formula:
m!2πm m
Thus, for fixed a, b 0 and T→ ∞, we have the approximation:
T a
T b=(T a)!
(T b)!(T a T b)! 2πT a T a
eT a
2πT b T b
eT b 2πT (ab)T(ab)
=aaT a
2πT b(ab)bT b (ab)T(ab).(4)
Notice that for any fixed pi>0, T pi→ ∞ as T→ ∞, that means we can apply
approximation eq. (4) to every term in eq. (1). Using it, we obtain
log2K=nlog22πT log2p1log2p2− · ·· log2pn
T p1log2p1T p2log2p2− · ·· T pnlog2pn(5)
Now, if we consider the average number of bits a letter needs in indexing a sequence of
length T, a minor miracle happens: as T→ ∞,
The expression on the right hand side of eq. (6) is the celebrated quantity associated
with a probability distribution, called Shannon entropy.
Let’s review a little bit what we have done: we have Ksequences in total, and all
sequences appear equally likely. Assume they encode different messages. Regardless
specific messages they encode, we regard them as having the same amount of information.
Let’s just employ the number of bits needed to encode a sequence to count the amount of
information a sequence encode (or can provide). Then, log2K
Tcan be viewed as the average
amount of information a letter has. This suggests that we can actually define the amount
of information of each letter. Here, we say “average” because we think the amount of
information different letters have should be different as they may not “contribute equally”
in a sequence, depending on the respective probabilities of the letters. Indeed, if we look
into formula (6), it only depends on the probability distribution of these letters in A. If
we reformulate the formula as n
It is clearly the expectation (i.e., average in the sense of probability) of the quantity log21
associated with the letter xi, for 1 in. This matches the term “average” so that we
can define the amount of information a letter xiwith probability pihas to be log21
In this definition of information, we observe that if a letter has a higher probability it
has less information, and vice versa. In other words, more uncertainty, more information.
Just like lottery, winning the first prize is less likely but more shocking when it happens,
while you may think winning a prize of 10 bucks is not a big deal since it is very likely.
Hence, this definition agrees with our intuition as well.
In the subsequent of the paper, we will omit the base in the logarithm function.
Theoretically, the base could be any number and is 2 by default. Now we summarize
information and Shannon entropy in the following definition:
Definition 2.3. Let Xbe a random variable, taking value xiwith probability pi, for
1in. Then, the quantity I(pi) = log 1
piis the amount of information encoded in xi
(or pi), while the average amount of information n
i=1 pi×log 1
piis called the Shannon
entropy of the random variable X(or the distribution P), and denoted by H(X).
Question: among all possible probability distributions, which distribution gives the
largest Shannon entropy? The answer is given in the following proposition:
Proposition 2.4. For finite n, when pi=1
nfor 1in, the Shannon entropy attains
the maximum n
n×log n= log n.
2.3 Further definitions and properties
The definition of information and entropy can be extended to continuous random variables.
Let Xbe a random variable taking real (i.e., real numbers) values and let f(x) be its
probability density function. Then, the probability P(X=x) = f(x)∆x. Mimic the
discrete finite case, the entropy of Xcan be defined by
H(X) =
xP(x) log P(x) = lim
x[f(x)∆x] log[f(x)∆x] (7)
= lim
x[f(x)∆x](log f(x) + log ∆x]) (8)
=f(x) log f(x)dxlog dx, (9)
where we used the definition of (Riemann) integral and the fact f(x)dx= 1. The last
formula here is called the absolute entropy for the random variable X. Note, regardless
the probability distribution, there is always a positive infinity term log dx. So, we can
drop this term and define the (relative) entropy of Xto be
f(x) log f(x)dx.
Proposition 2.5. Among all real random variables with expectation µand variance σ2,
the Gauss distribution X∼ N(µ, σ2)attains the maximum entropy
H(X) = log 2πeσ2.
Note joint distribution and conditional distribution are still probability distributions.
Then, we can define entropy there correspondingly.
Definition 2.6. Let X, Y be two random variables with joint distribution P(X=x, Y =
y) (P(x, y) for short). Then the joint entropy H(X, Y ) is defined by
H(X, Y ) =
P(x, y) log P(x, y).(10)
Definition 2.7. Let X, Y be two random variables with joint distribution P(x, y) and
conditional distribution P(y|x). Then the conditional entropy H(X|Y) is defined by
H(X|Y) =
P(x, y) log P(x|y).(11)
Remark 2.8.Fixing X=x,P(Y|x) is also a probability distribution. It’s entropy equals
H(Y|x) =
P(y|x) log P(y|x)
which can be viewed as a function over X(or a random variable depending on X). It can
be checked that H(Y|X) is actually the expectation of H(Y|x), i.e.,
H(Y|X) =
using the fact that P(x, y) = P(x)P(y|x).
3 Channel Capacity
In a communication system, we have three basic ingredients: the source, the destination
and the media between them. We call the media the (communication) channel. A channel
could be in any form. It could be physical wires, cables, open environment in wireless
communication, antennas and certain combination of them.
3.1 Channel without error
Given a channel and a set Aof letters (or symbols) which can be transmitted via the
channel. Now suppose an information source generates letters in Afollowing a probabil-
ity distribution P(so we have a random variable Xtaking values in A), and send the
generated letters to the destination through the channel.
Assume the channel will carry the exact letters generated by the source to the desti-
nation. Then, what is the amount of information received by the destination? Certainly,
the destination will receive exactly the same amount of information generated or provided
by the source, which is T H(X) in a time period of length of Tsymbols (with Tlarge
enough). Namely, in a time period of symbol-length T, the source will generate a sequence
of length T, the destination will receive the same sequence, no matter what the sequence
generated at the source is. Hence, the amount of information received at the destination
is on average H(X) per symbol.
The channel capacity of a channel is the maximum amount information on average can
be obtained at the destination in a fixed time duration, e.g., per second, or per symbol
(time). Put it differently, the channel capacity can be characterized by the maximum
number of sequences on Awe can select and transmit on the channel such that the
destination can in principle determine, without error, the corresponding sequences feed
into the channel based on the received sequences.
If the channel is errorless, what is the capacity of the channel? Well, as discussed
above, the maximum amount of information can be received at the destination equals
to the maximum amount of information can be generated at the source. Therefore, the
channel capacity Cfor this case is
C= max
XH(X),per symbol,(12)
where Xranges over all possible distributions on A.
For example, if Acontains nletters for nbeing finite, then we know from Proposi-
tion 2.4 that the uniform distribution achieves the channel capacity C= log n.
3.2 Channel with error
What is the channel capacity of a channel with error? A channel with error means that
the source generated a letter xiAand transmitted it to the destination via the channel,
with some unpredictable error, the received letter at the destination may be xj. Assume
statistically, xjis received with probability p(xj|xi) when xiis transmitted. These
probabilities are called transit probabilities of the channel. We assume, once the channel
is given, the transit probabilities are determined and will not change.
We start with some examples.
Example 3.1. Assume A={0,1}. If the transit probabilities of the channel are
p(1 |0) = 0.5, p(0 |0) = 0.5,
p(1 |1) = 0.5, p(0 |1) = 0.5,
what is the channel capacity? The answer should be 0, i.e., the destination cannot obtain
any information at all.
Because no matter what is being sent to the destination, the received sequence at
the destination could be any 0 1 sequence, with equal probability. From the received
sequence, we can neither determine which sequence is the one generated at the source,
nor can we determine which sequences are not the one generated at the source.
In other words, the received sequence has no binding relation with the transmitted
sequence on the channel at all, we can actually flip a fair coin to generate a sequence
ourself instead of looking into the one actually received at the destination.
Example 3.2. Assume A={0,1}. If the transit probabilities of the channel are
p(1 |0) = 0.1, p(0 |0) = 0.9,
p(1 |1) = 0.9, p(0 |1) = 0.1,
what is the channel capacity? The answer should not be 0, i.e., the destination can
determine something in regard to the transmitted sequence.
Further suppose the source generates 0 and 1 with equal probability, i.e.,1
2. Observe
the outcome at the destination for a long enough time, that is a sequence long enough, for
computation purpose, say a 10000-letter long sequence is long enough (to guarantee the
law of large numbers to be effective). With these assumptions, there are 5000 1’s and 0’s,
respectively, in the generated sequence at the source. After the channel, 5000 ×0.1 = 500
1’s will be changed to 0’s and vice versa. Thus, the received sequence has also 5000 1’s
and 5000 0’s. Assume the sequence received at the destination has 5000 1’s for the first
half entries and 5000 0’s for the second half entries.
With these probabilities and received sequence known, what can we say about the
generated sequence at the source? Well, it is not possible immediately to know what
is the generated sequence based on these intelligences, because there are more than one
sequence which can lead to the received sequence after going through the channel. But,
the sequence generated at the source can certainly not be the sequence that contains 5000
0’s for the first half and 5000 1’s for the second half, or any sequence with most of 0’s
concentrating in the first half subsequence. Since if that one is the generated one, the
received sequence should contain about 4500 0’s in the first half of the received sequence,
which is not the case observed in the received sequence.
This is unlike the example right above, for which we can neither determine which is
generated nor those not generated. Thus, the information obtained by the destination
should not be 0.
Let us come back to determine the capacity of the channel. Recall the capacity is
the maximum number of sequences on Awe can select and transmit on the channel
such that the destination can in principle determine without error the corresponding
sequences feed into the channel based on the received sequences. Since there is error in
the transmission on the channel, we can not select two sequences which potentially lead
to the same sequence after going through the channel at the same time, otherwise we can
never determine which one of the two is the transmitted one on the channel based on the
same (received) sequence at the destination.
Basically, the possible outputs at the destination are also sequences on A, where
element xi, for 1 in, appears in these sequences with probability
pY(xi) =
Note this probability distribution will depend only on the distribution Xsince the transit
probabilities are fixed. Denote the random variable associating to this probability dis-
tribution at the destination Y(note Ywill change as Xchange). Shannon [1] proved
that for a given distribution X, we can choose at most 2T[H(X)H(X|Y)] sequences (satisfy-
ing the given distribution) to be the sequences to transmit on the channel such that the
destination can determine, in principle, without error, the transmitted sequence based
on the received sequences. That is, the destination can obtain H(X)H(X|Y) bits
information per symbol. Therefore, the channel capacity for this case is
C= max
X[H(X)H(X|Y),per symbol,(13)
where Xranges over all probability distributions on A.
4 Open Discussion
There are many articles and news claiming that the Shannon capacity limit defined above
was broken. In fact, these are just kind of advertisements on new technologies with
more advanced settings than Shannon’s original theory, e.g., multiple-antenna transmit-
ting/receiving technologies (MIMO). Essentially, these tech. are still based on Shannon
capacity. They have not broken Shannon capacity limit at all.
Can Shannon capacity limit be broken?
In my personal opinion, all “limit” problems are optimization problems under certain
conditions. So, it is really about settings. If you believe Shannon’s settings are the
most suitable one in modelling a communication system, then I suspect you cannot break
Shannon capacity limit. Anyway, we end this note with some open discussions:
1.There is no problem to model information sources as random processes. However,
given a channel and the set Aof letters transmittable on the channel. To discuss
the capacity of the channel, why are we only allowed to select sequences obeying
the same probability distribution? Given two probability distributions on A,P1and
P2. If there exists xifor some 1 insuch that
we call X1and X2compatible. Note, in this case, if we transmit a sequence satisfying
distribution X1and another sequence satisfying distribution X2, the destination
should know, based on inspecting the number of xiin the received sequence, that
the transmitted sequence is from the X1-class or the X2-class. We call a set Fof
probability distributions on A, such that any two probability distributions in Fare
compatible, an admissible set. Should the capacity of the channel be defined as
C= max
where Franges over all admissible sets on A?
2.Set aside theoretical discussion. There are lots of real systems or simulations. Why is
there no report of breaking Shannon capacity limit? Possibly, the reason is that these
channel encoding methods used just generate sequences (of modulation symbols,
e.g., QPSK,16QAM, etc.) satisfying almost the same probability, i.e., agree with
Shannon’s settings on obtaining channel capacity. Is that the case?
[1] C.E. Shannon, A mathematical theory of communication, Bell Syst. Tech. J., vol. 27,
pp. 379-423, 623-656, July-Oct. 1948.
[2] C.E. Shannon, Communication in the presence of noise, Proc. IRE, 37 (1949), 10-21.
[3] S. Verd´u, Fifty years of Shannon theory, IEEE Transactions On Information Theory,
44 (1998), 2057-2078.

Supplementary resource (1)

... A self-information refers to the degree of certainty carried by the occurrence of an event [14]. If a nucleotide at i th position of sequence S is denoted by s i , then, the selfinformation of a codon (s i−2 s i−1 s i ) can be represented as, ...
... The entropy of information theory ( ) is a popular metric for information measurement introduced by Shannon [279] . It computes the quantity of randomness existing in a message. ...
Text hiding is an intelligent programming technique, which embeds a secret message (SM) or watermark (ω) into a cover text file or message (CM/CT) in an imperceptible way to protect confidential information. Recently, text hiding in forms of watermarking and steganography has found broad applications in, for instance, covert communication, copyright protection, content authentication, and so on. It has also been widely considered as an attractive technology to improve the use of conventional cryptography algorithms in the area of multimedia security by concealing information into a cover being protected. In general, information hiding or data hiding can be categorized into two classifications: watermarking and steganography. While watermarking attempts to concern the robustness of the embedded watermark/signature at the expense of embedding capacity, steganography tries to embed as much secret information as feasible into a cover media. In contrast to text hiding, text steganalysis is the process and science of identifying whether a given carrier text file/message has a hidden message (HM) in it, and, if possible, extracting/detecting the embedded hidden information. In practice, steganalysis evaluates the efficiency of information hiding algorithms, meaning a robust watermarking/steganography algorithm should be invisible (or irremovable) not only to Human Vision Systems (HVS) but also to intelligent data processing attacks. Since the digital text is one of the most widely used digital media on the Internet, the significant part of Web sites, social media, articles, eBooks, and so on is only plain text. Thus, copyrights protection of plaintexts is still a remaining issue that must be improved to provide proof of ownership and obtain the integrity rate. During the last decade, digital watermarking and steganography techniques have been used as alternatives to prevent tampering, distortion, and media forgery attacks and also to protect both copyright and authentication. As yet, text hiding and steganalysis have drawn relatively less attention compared to data hiding in other media such as image, video, and audio. This dissertation aims to focus on this relatively neglected research area and has three main objectives as follows. 1) We discuss various types of text hiding algorithms, and their limitations in digital text documents and messages as well as the definition of the common evaluation criteria. We theoretically analyze the efficiency of the existing text hiding methods concerning the evaluation criteria. Then, we conduct a set of experiments on the real examples to evaluate the efficiency of existing techniques and their limitations and investigate the performance of structural-based text hiding techniques. Our findings confirm that the structural-based text hiding approaches provide better efficiency compared to other state-of-the-art methods. Thus, we outline some guidelines and directions to enhance the efficiency of structural-based techniques in digital texts for future works. 2) We propose a novel text steganography technique called AITSteg, which affords end-to-end secure conversation via SMS or social media between smartphone users. To meet this requirement, we investigate the trade-off between invisibility, embedding capacity, and distortion robustness criteria by considering proper embeddable locations for hiding the SM into the CM using Unicode Zero Width characters (ZWC). We then experiment the proposed technique concerning evaluation criteria by implementing it on some real CM examples. The experiments confirm that the AITSteg can prevent different attacks, including man-in-the-middle attack, message disclosure, and manipulation by readers. Also, we compare the experimental results with the existing approaches for showing the superiority of the proposed technique. To the best of our knowledge, this is the first technique that provides end-to-end hidden transmission of SM in the cover of text message using symmetric keys via social media. 3) We present an intelligent watermarking technique called ANiTW which utilizes an instance-based learning algorithm to hide an invisible watermark (ω) into Latin cover text-based information (CT) such that the ω can be extracted, even if a malicious user manipulates a portion of the watermarked information. We experiment with the ANiTW by implementing it on 16 social media applications (SMAs) and some real CT examples concerning evaluation criteria. Experiments demonstrate that the ANiTW can identify the integrity rate and ownership of watermarked information on social media, where there is a doubt about its originality. To the best of our knowledge, this is the first intelligent text watermarking technique that provides an invisible signature for forensic identification of spurious information on social media by evaluating the manipulation rate of watermarked information, while the other existing approaches only consider the robust/fragile marking of signature into the CT.
... The entropy of information theory (H) is a popular metric for information measurement introduced by Shannon [128]. It computes the quantity of randomness existing in a message. ...
Full-text available
Abstract: Modern text hiding is an intelligent programming technique which embeds a secret message/watermark into a cover text message/file in a hidden way to protect confidential information. Recently, text hiding in the form of watermarking and steganography has found broad applications in, for instance, covert communication, copyright protection, content authentication, etc. In contrast to text hiding, text steganalysis is the process and science of identifying whether a given carrier text file/message has hidden information in it, and, if possible, extracting/detecting the embedded hidden information. This paper presents an overview of state of the art of the text hiding area, and provides a comparative analysis of recent techniques, especially those focused on marking structural characteristics of digital text message/file to hide secret bits. Also, we discuss different types of attacks and their effects to highlight the pros and cons of the recently introduced approaches. Finally, we recommend some directions and guidelines for future works.
... A. Redefining web design for AR 1) Flattening the web for user interaction: The ease of accessing information is highly relevant to the user's steps to reach the information as stated in Shannon's information theory [21] and Fitts' Law [22]. According to these theories, the user's displacement of the pointer to reach a target will directly impact the task completion time. ...
Conference Paper
Full-text available
Mobile Augmented Reality (MAR) drastically changes our approach to computing and user interaction. Web browsing, in particular, is impractical on AR devices as current web design principles do not account for three-dimensional display and navigation of virtual content. In this paper, we propose Mobile to AR (M2A), the first framework for designing web pages for AR devices. M2A exploits the visual context to display more content while enabling users to locate relevant data intuitively with minimal modifications to the website. To evaluate the principles behind the framework, we implement a demonstration application in AR and conduct two user-focused experiments. Our experimental study reveals that participants with M2A are 5 times faster to find information on a web page compared to a smartphone, and 2 times faster than a traditional AR web browser. Furthermore, users consider navigation on M2A websites to be significantly more intuitive and easy to use compared to their desktop and mobile counterparts.
It is important to understand the reason behind the machine learning model predictions. In this paper we introduce a new model using random forest technique called Interpretable Random Forests. In this, we make some changes in feature selection technique for some business problems. The probabilistic method chosen to select each feature at every node along with the normal feature selection criteria will give us an Interpretable random forest model which will be compared to the general random forest model that gives an idea of how feature selection impacts the output from a model. Our experimental results shows IRF model has more or same accuracy than RF model and also interpretable.
Full-text available
A method is developed for representing any communication system geometrically. Messages and the corresponding signals are points in two "function spaces," and the modulation process is a mapping of one space into the other. Using this representation, a number of results in communication theory are deduced concerning expansion and compression of bandwidth and the threshold effect. Formulas are found for the maxmum rate of transmission of binary digits over a system when the signal is perturbed by various types of noise. Some of the properties of "ideal" systems which transmit at this maxmum rate are discussed. The equivalent number of binary digits per second for certain information sources is calculated.
A brief chronicle is given of the historical development of the central problems in the theory of fundamental limits of data compression and reliable communication