Chapter 1 Information Theory for Discrete Variables
(1) Message Sets
What is a message? Shannon suggested that a message has to do with choice and sets. In his words:The significant aspect is that the actual message is one selected from a set of possible messages.
(2) Measuring Choice
Shannon was interested in defining a quantity that will measure, in some sense, how muchinformation is “produced” by choosing messages. He suggested that the logarithm of the number of elements in the message set can be regarded as a measure of the information produced when one message is chosen from the set, all choices being equally likely.
For example, consider a fair coin that takes on values in and suppose we flip the coin times. The string of coin flips takes on one of values, all equally likely, and the information produced is
For non-equally likely choices, Shannon developed a measure that has the same form as the entropy in statistical mechanics. For example, consider a biased coin that takes on the value Heads with probability and Tails with probability . The string of coin flips takes on values (since the expected number of Heads is ). Shannon proposed that the information produced when flipping the coin times is
(3) Entropy
Let be the support set of the function , i.e., the set of such that . We define the entropy or uncertainty of the discrete random variable as
Example 1.1
The entropy depends only on the probability distribution , and not on the letters in alphabet of . We thus have
for any invertible or bijective function , since simply relabels the letters in the alphabet. Examples include .
Example 1.2
Consider the Bernoulli distribution that has an alphabet and . The entropy of is
and is called the binary entropy function.
Theorem 1.1
with equality on the left iff. there is one letter in with , and with equality on the right iff. for any, i.e., in uniformly distributed.
Proof. of right equality
(4) Example Distributions
(5) Conditional Entropy
Consider the joint distribution where and are discrete random variables.
Focusing on one line of the table of joint distribution, we get the conditional entropy of given the event with is
Similarly,
with equality on the left iff. for some , and with equality on the right iff. for any.
We average the conditional entropy on all and get the conditional entropy of given
Also,
with equality on the left iff. there is one such that for every . And with equality on the right iff. for any and for every .
Example 1.3
We say that essentially determines if ).
Recall Example 1.1, is invertible, . determines and determines
For a non-invertible function , since the probability distribution of and might be different. still determines but no longer determines
Examples include and .
(6) Joint Entropy
The joint entropy of and is defined by considering the concatenation (not multiplication!) of and as a new discrete random variable, i.e., we have
and
with equality on the left iff. essentially determines , or essentially determines , or both. And with equality on the right iff. for all .
Example 1.4 Chain Rule for Entropy
Example 1.5
Proof.
Example 1.6
Using conclusions about conditional entropy in Example 1.3, we study the joint distribution and .
For a invertible function ,
For a non-invertible function ,
(7) Information Divergence / Kullback-Leibler Distance
The informational divergence between and is defined as
We define if for some .
Observe that the definition is not symmetric in and , i.e., we have in general.
To avoid the case we introduce the notation to mean that for all . In other words, implies that , or for finite sets.
Given a third discrete random variable , we define the conditional information divergence between and as
Theorem 1.2
with equality iff. for all .
Proof.
Example 1.7
If is uniform on X, for ,
Example 1.8
If is uniform on X, for ,
Example 1.9
If for some ,
Example 1.10 Chain Rule for Information Divergence