Chapter 1 Information Theory for Discrete Variables(1) Message Sets(2) Measuring Choice(3) EntropyExample 1.1Example 1.2 (4) Example Distributions(5) Conditional Entropy Example 1.3(6) Joint Entropy Example 1.4 Chain Rule for EntropyExample 1.5Example 1.6(7) Information Divergence / Kullback-Leibler DistanceExample 1.7Example 1.8Example 1.9Example 1.10 Chain Rule for Information DivergenceExample 1.11
Chapter 1 Information Theory for Discrete Variables
(1) Message Sets
What is a message? Shannon suggested that a message has to do with choice and sets. In his words: The significant aspect is that the actual message is one selected from a set of possible messages.
(2) Measuring Choice
Shannon was interested in defining a quantity that will measure, in some sense, how much information is “produced” by choosing messages. He suggested that the logarithm of the number of elements in the message set can be regarded as a measure of the information produced when one message is chosen from the set, all choices being equally likely.
For example, consider a fair coin that takes on values in
For non-equally likely choices, Shannon developed a measure that has the same form as the entropy in statistical mechanics. For example, consider a biased coin that takes on the value Heads with probability
(3) Entropy
Let
be the support set of the function , i.e., the set of such that . We define the entropy or uncertainty of the discrete random variable as
Example 1.1
The entropy
for any invertible or bijective function
Example 1.2
Consider the Bernoulli distribution that has an alphabet
and
Theorem 1.1
with equality on the left iff. there is one letter
in with , and with equality on the right iff. for any , i.e., in uniformly distributed.
Proof. of right equality
(4) Example Distributions
(5) Conditional Entropy
Consider the joint distribution
Focusing on one line of the table of joint distribution, we get the conditional entropy of
Similarly,
with equality on the left iff.
We average the conditional entropy on all
Also,
with equality on the left iff. there is one
Example 1.3
We say that
Recall Example 1.1,
For a non-invertible function
Examples include
(6) Joint Entropy
The joint entropy of
and
with equality on the left iff.
Example 1.4 Chain Rule for Entropy
Example 1.5
Proof.
Example 1.6
Using conclusions about conditional entropy in Example 1.3, we study the joint distribution
For a invertible function
For a non-invertible function
(7) Information Divergence / Kullback-Leibler Distance
The informational divergence between
and is defined as We define
if for some . Observe that the definition is not symmetric in
and , i.e., we have in general.
To avoid the case
Given a third discrete random variable
Theorem 1.2
with equality iff.
for all .
Proof.
Example 1.7
If
Example 1.8
If
Example 1.9
If
Example 1.10 Chain Rule for Information Divergence
Example 1.11 instead of
(8) Mutual Information
The mutual information
between two random variables and is defined as Note that
means that and are statistically independent. Thus, measures the dependence of and .
Theorem 1.3
with equality iff.
and are statistically independent.
Proof.
The inequality means that conditioning doesn't increase entropy, or colloquially that conditioning reduces entropy. Note, however, that
Example 1.12
We know that
When
Example 1.13 Chain Rule for Mutual Information
Example 1.14
Using the chain rule for mutual information and conclusions of Example 1.11, we have
(9) Cross Entropy
The cross entropy of two distributions
and with the same domain is defined as Theorem 1.4
with equality iff.
for any .
Theorem 1.4 gives a useful tool to upper bound entropy: every choice of
(10) Inequalities
Theorem 1.5 Log-sum Inequality
Let
, , and define and , we have with equality iff.
for all .
Proof.
Example 1.15
with equality iff.
Theorem 1.6 Data Processing Inequalities
Any processing decreases the mutual information:
If
forms a Markov chain, we have and . If
and are outputs of a channel with inputs and , we have .
Proof.
Using conclusions of Example 1.13,
Using the chain rule of information divergence, we have
which gives
Theorem 1.9 Fano's Inequality
Suppose both
and take on values in the same alphabet, let . We have with equlity iff. for all
and , we have
Proof.
Let
where
where
(11) Convexity
Theorem 1.10 Concavity of Entropy
For any
Theorem 1.11 Convexity of Information Divergence
For any
Theorem 1.12 Mutual Information
We write
as . The function is concave in , i.e., the input distribution if , i.e., the channel is fixed, for any
. The function
is convex in , i.e., the channel if , i.e., the input distribution is fixed, for any
.
Theorem 1.13 Linearity and Convexity of Cross Entropy
Example 1.16 Jensen's Inequality
Let
Proof.
Since
Example 1.17 Definitions of Convexity
These definitions are equivalent
Proof.
Using Jensen's Inequality,
Chapter 2 Information Theory for Continuous Variabls
(1) Differential Entropy
Consider a real-valued random variable
with density . The differential entropy of is defined as
Example 2.1
Consider the uniform distribution with
Rules for Differential Entropy
Conditional Entropy
(2) Mixed Distributions
(3) Example Distributions
(4) Informational Divergence
The information divergence between the densities
and is
(5) Cross Entropy
The cross-entropy of two densities
and is
with equality iff.
(6) Maximum Entropy
Alphabet or Volume Constraint
Uniform distribution maximizes the entropy. If
with equality iff.
First Moment Constriant
Under the first-moment constraint
We use cross entropy to upper bound the entropy
Second Moment Constraint
Under the second-moment constraint
Gaussian random variables maximize differential entropy under the second moment contraint.
Example 2.2
The second moment contraint for a scalar is
Chapter 3 Coding of Memoryless Sources
We encode a random variable source of
A
code is said to be uniquely decodable if every finite sequence has at most one interpretation as a sequence of its codewords. A prefix-free code is automatically also a uniquely decodable code.
(1) Kraft Inequality
Theorem 3.1 Kraft Inequality
A
prefix-free code with codeword length , exists iff. Equality holds iff. the code has no unused leaves in its
tree.
Kraft's inequality provides a necessary condition for the existence of a prefix-free code. A
uniquely decodable code with a given list of codeword lengths exists if and only if these lengths satisfy Kraft’s inequality.
Proof.
Path Length Lemma
Proof.
Leaf-Entropy Lemma
(2) Entropy Bounds on Source Coding
Theorem 3.2 Coding theorem for a Single Random Variable.
The average code word length
of an optimum prefix-free code for a random variable satisfies Equality holds iff.
for all .
Proof.
For the lower bound, we use Leaf Length Lemma and Leaf-Entropy Lemma,
and get
For the upper bound, we use Kraft's Inequality,
Let
We choose the shortest codes
Therefore,
and we get
Theorem 3.3 Coding Theorem for a DMS
The average code word length
of an optimum prefix-free code for a DMS that is parsed into blocks of length satisfies
Proof.
We already have the coding theorem for a single random variable
Since the source is DMS, we have
(3) Huffman Codes
The
prefix-free codes that minimize for a random variable are called Huffman codes. The branches of the tree are codewords with minimal
.
Optimal
prefix-free codes assign shorter code words to more probable letters.All used leaves in the tree of an optimal
prefix-free code has a sibling.The number of unused leaves in the tree of an optimal
prefix-free code is
Proof.
A
The number of unused leaves is
Since
(4) Tunstall Codes
The proper
dictionary codes with fixed block length that maximize for a random variable are called Tunstall codes. The branches of the tree are source sequence with maximal
.
A proper dictionary satisfies the following two conditions: The dictionary (or message set) satisfies the prefix-free condition, i.e., the dictionary words are leaves of a rooted tree. No leaves are unused.
Choose the largest
that satisfies .Similar to Huffman code, we now have
Proof.
Using Leaf Entropy Lemma, we get
Combine both and finish the proof.
Theorem 3.4 Converse to the Coding Theorem for a DMS.
A
prefix-free encoding of a proper dictionary for a DMS satisfies
This is a more general lower bound that applies to first using a variable-length-to-block encoder with a proper dictionary followed by a prefix-free block-to-variable-length encoder.
First, we apply Tunstall code to encode a source block of length
Then, we apply Huffman code to encode a single
(5) Huffman Codes and Tunstall Codes
Huffman codes and Tunstall codes are both types of prefix-free codes, which are coding schemes in which no code is a prefix of any other code, allowing messages to be uniquely decoded. However, there are some key differences between these two types of codes:
Huffman codes (block to variable length) are constructed using a method called Huffman coding, which is a lossless data compression algorithm that assigns shorter codes to more probable symbols and longer codes to less probable symbols, in order to minimize the expected length of the encoded message.
Tunstall codes (variable length to block), on the other hand, are constructed using a method called the Tunstall algorithm, which is a lossless data compression algorithm that generates a prefix-free code by decomposing code into overlapping blocks and assigning a unique message to each block.
Chapter 4 Coding of Stationary Sources
(1) Discrete Stationary Sources
A DSS output
The entropy rate is defined as
It has property
Theorem 4.1 Entropy Rate for a DSS
Proof.
Similarly,
Thus,
And
Proof.
(2) Entropy Bounds on Source Coding a DSS
Theorem 4.2 Coding Theorem for a DSS
The average code word length
of an optimum prefix-free code for a DSS that is parsed into blocks of length satisfies
Proof.
We already have the coding theorem for a single random variable
Now we replace the single random variable with a block of length
Using the notation
Since
(3) Elias Code for Positive Integers
x | Elias Code | Elias Code Lenght |
---|---|---|
1 | 1 | 1 |
2 | 010.0 | 4 |
3 | 010.1 | 4 |
4 | 011.00 | 5 |
5 | 011.01 | 5 |
6 | 011.10 | 5 |
7 | 011.11 | 5 |
8 | 00100.000 | 8 |
9 | 00100.001 | 8 |
Proof.
For
x | Binary Representation | Code Lenght |
---|---|---|
1 | 1 | 1 |
2 | 10 | 2 |
3 | 11 | 2 |
4 | 100 | 3 |
5 | 101 | 3 |
6 | 110 | 3 |
7 | 111 | 3 |
8 | 1000 | 4 |
9 | 1001 | 4 |
This base-2 code is not prefix-free, so does a ternary, etc. representation. Therefore, we pre-pad each code
For
x | Zero-pre-padded Binary Representation | Code Lenght |
---|---|---|
1 | 1 | 1 |
2 | 0.10 | 3 |
3 | 0.11 | 3 |
4 | 00.100 | 5 |
5 | 00.101 | 5 |
6 | 00.110 | 5 |
7 | 00.111 | 5 |
8 | 000.1000 | 7 |
9 | 000.1001 | 7 |
Elias pre-pads differently to shorten this code by using
For
x | Elias-pre-padded Binary Representation | Code Lenght |
---|---|---|
1 | 1 | 1 |
2 | 010.10 | 5 |
3 | 010.11 | 5 |
4 | 011.100 | 6 |
5 | 011.101 | 6 |
6 | 011.110 | 6 |
7 | 011.111 | 6 |
8 | 00100.1000 | 9 |
9 | 00100.1001 | 9 |
Finally, we get the Elias code by omitting the first "1" after "." with length
(4) Elias-Willems Universal Source Coding
We now describe an specific algorithm for source coding when the source statistics are unknown. An important result that we state without proof is that the expected recurrence time of a stationary and ergodic source satisfies
for all times
Example 4.1
Consider a DMS
We apply the binary Elias code to
The second inequality follows from the concavity of logarithm and Jensen's inequality.
Taking expectations we have
Consider a block of length
Theorem 4.3 Universal Source Coding Theorem for a DSS
The average code word length
of a binary code for a DSS that is parsed into blocks of length and for which the Elias code is used to map the time to the most recent occurrence of letters satisfies
Moreover,
Chapter 5 Channel Coding
Data Processing
where
follows from and data processing doesn't increase mutual information. follows from that the channel is memoryless, i.e., only depends on .
(1) Rate, Reliability and Cost
By reducing rate, one can increase reliability. We thus expect a rate-reliability tradeoff. If on introduces a cost
(2) Memoryless Channels
An encoder maps a source message
(3) Cost Functions
Suppose that transmitting
We consider
The largest rate
Example 5.1 Power Constraint
(4) Block and Bit Error Probability
We define the channel coding by requiring the block error probability
to be small.
But we also want to minimize the average bit error probability
Theorem 5.1 Block/Bit Error Probability
With equality on the left if all bits in a block are wrong. And equlity on the right if exaclty one bit is wrong in each block.
Proof. 1
Intuitively,
Therefore,
Proof. 2
Theorem 5.2 Lower Bounds on
and Rate-reliability Tradeoff
where
We see that
is at most if reliability is good( is small).
Proof. of
Using Fano's inequality and
We suppose source messages are equal-probable and get
If
Proof. of
where
Using the chain rule
we get the following bound
If
(5) Random Coding
Capacity-Cost Function
(6) Concavity and Converse
Theorem
is concave in .
is non-decreasing in .
Proof.
where
Rate-reliability Tradeoff
(7) Discrete Alphabet Examples
<1> Binary Symmetric Channel
The best choice for
<2> Binary Erasure Channel
Therefore,
<3> Strongly Symmetric Channels
Notice the entires in one row add up to
Uniformly dispersive: (entries in rows are the same, their order may change)
for every input letter
, the list of probabilities is the same. Therefore, is the same for all . Uniformly focusing: (entries in columns are the same, their order may change )
for every output letter
, the list of probabilities is the same. Uniform input in a u.f channel implies uniform output. Strongly symmetric:
A channel is said to be strongly symmetric if it both uniformly dispersive and uniformly focusing.
BEC is u.d. but not u.f. BSC is u.d. and u.f.
Strongly Symmetric Channel Capacity
Capacity is achieved by a uniform
<4> Symmetric Channels
A BEC can be decomposed to two strongly symmetric channels:
Capacity of Symmetric DMC
Example 5.2 BPSK over AWGN
We use a
(8) Continuous Alphabet Examples
Consider the additive white Gaussian noise channel with
We now maximize
with equality iff.
Finally, we get
Example 5.3 AWGN channel with BPSK
We transmit symbols