Information Theory 定理总结及推导

information_theory

Chapter 1 Information Theory for Discrete Variables(1) Message Sets(2) Measuring Choice(3) EntropyExample 1.1Example 1.2 (4) Example Distributions(5) Conditional Entropy Example 1.3(6) Joint Entropy Example 1.4 Chain Rule for EntropyExample 1.5Example 1.6(7) Information Divergence / Kullback-Leibler DistanceExample 1.7Example 1.8Example 1.9Example 1.10 Chain Rule for Information DivergenceExample 1.11 $P_1P_2$ instead of $P_{X_1X_2}$ (8) Mutual InformationExample 1.12Example 1.13 Chain Rule for Mutual InformationExample 1.14(9) Cross Entropy(10) InequalitiesExample 1.15(11) ConvexityExample 1.16 Jensen’s InequalityExample 1.17 Definitions of ConvexityChapter 2 Information Theory for Continuous Variabls(1) Differential EntropyExample 2.1Rules for Differential Entropy(2) Mixed Distributions(3) Example Distributions(4) Informational Divergence(5) Cross Entropy(6) Maximum EntropyAlphabet or Volume ConstraintFirst Moment ConstriantSecond Moment ConstraintExample 2.2Chapter 3 Coding of Memoryless Sources(1) Kraft Inequality (2) Entropy Bounds on Source Coding(3) Huffman Codes(4) Tunstall Codes(5) Huffman Codes and Tunstall CodesChapter 4 Coding of Stationary Sources(1) Discrete Stationary Sources(2) Entropy Bounds on Source Coding a DSS(3) Elias Code for Positive Integers(4) Elias-Willems Universal Source CodingExample 4.1Chapter 5 Channel Coding(1) Rate, Reliability and Cost(2) Memoryless Channels(3) Cost Functions Example 5.1 Power Constraint(4) Block and Bit Error Probability(5) Random Coding(6) Concavity and Converse(7) Discrete Alphabet Examples<1> Binary Symmetric Channel<2> Binary Erasure Channel<3> Strongly Symmetric Channels<4> Symmetric ChannelsExample 5.2 BPSK over AWGN(8) Continuous Alphabet ExamplesExample 5.3 AWGN channel with BPSK

Chapter 1 Information Theory for Discrete Variables

(1) Message Sets

What is a message? Shannon suggested that a message has to do with choice and sets. In his words: The significant aspect is that the actual message is one selected from a set of possible messages.

(2) Measuring Choice

Shannon was interested in defining a quantity that will measure, in some sense, how much information is “produced” by choosing messages. He suggested that the logarithm of the number of elements in the message set can be regarded as a measure of the information produced when one message is chosen from the set, all choices being equally likely.

$\{Heads, Tails\}$ $n$ $2^n$ $\log_2{2^n}=n \ bits.$

statistical mechanics $p$ $1-p$ $n\choose np$ $np$ $n$ times is

\begin{matrix} \log_{2} (\binom{n}{n p}) \approx n (p \log_{2} \frac{1}{p} + (1 - p) \log_{2} \frac{1}{1 - p}) b i t s \end{matrix}

(3) Entropy

$supp(f)$ $f$ $a$ $f(a) > 0$ entropy or uncertainty $X$ as
$\begin{matrix} H (X) = \sum_{a \in s u p p (P_{X})} - P_{X} (a) \log_{2} P_{X} (a) = E [- \log_{2} P_{X} (X)] \end{matrix}$

Example 1.1

$H(X)$ $P_X(\cdot)$ $a$ $X$ . We thus have

\begin{matrix} H (X) = H (g (X)) \end{matrix}

$g(\cdot)$ $g(\cdot)$ $Y=2^X$ .

Example 1.2

$\{0,1\}$ $P_X(1)=p$ $X$ is

\begin{matrix} H_{2} (p) = - p \log_{2} p - (1 - p) \log_{2} (1 - p) \end{matrix}

$H_2()$ is called the binary entropy function.

Theorem 1.1
$\begin{matrix} 0 \leq H (X) \leq \log_{2} | X | \end{matrix}$
$a$ $X$ $P_X(a)=1$ $P_X(a)=\frac{1}{|X|}$ any $a\in X$ $X$ in uniformly distributed.

Proof. of right equality

\begin{array}{rcl} \log_{2} (x) & = & \ln (x) \log_{x} (2) \leq (x - 1) \log_{2} (e) \\ H (X) & = & E [\log_{x} \frac{1}{| X | P_{X} (X)}] + \log_{2} | X | \\ \leq & E [\frac{1}{| X | P_{X} (X)} - 1] \log_{2} (e) + \log_{2} | X | \\ = & \log_{2} (e) \sum_{a \in s u p p (P_{X})} P_{X} (a) (\frac{1}{| X | P_{X} (X)} - 1) + \log_{2} | X | \\ = & \log_{2} (e) (\frac{| s u p p (P_{X})}{| X |} - 1) + \log_{2} | X | \\ \leq & \log_{2} | x | \end{array}

(4) Example Distributions

(5) Conditional Entropy

$P_{XY}()$ $X$ $Y$ are discrete random variables.

Focusing on one line $X$ $Y=b$ $Pr[Y=b]>0$ is

\begin{array}{rcl} H (X | Y = b) & = & \sum_{a \in s u p p (P_{X | Y} (\cdot | b))} - P_{X | Y} (a | b) \log_{2} P_{X | Y} (a | b) \\ = & E [- \log_{2} P_{X | Y} (X | Y = b)] . \end{array}

Similarly,

\begin{matrix} 0 \leq H (X | Y = b) \leq \log_{2} | X | \end{matrix}

$P_{X|Y}(a|b)=1$ $a$ $P_{X|Y}(a|b)=\frac{1}{|X|}$ any $a\in X$ .

$b\in Y$ and get the $X$ $Y$

\begin{array}{rcl} H (X | Y) & = & \sum_{b \in s u p p (P_{Y})} P_{Y} (b) H (X | Y = b) \\ = & \sum_{(a, b) \in s u p p (P_{X Y})} - P_{X Y} (a, b) \log_{2} P_{X | Y} (a | b) \\ = & E [- \log_{2} P_{X | Y} (X | Y)] . \end{array}

Also,

\begin{matrix} 0 \leq H (X | Y) \leq \log_{2} | X | \end{matrix}

$a$ $P_{X|Y}(a|b)=1$ $b$ $P_{X|Y}(a|b)=\frac{1}{|X|}$ any $a\in X$ $b$ .

Example 1.3

$Y$ $X$ $H(X|Y)=0$ ).

Example 1.1 $g(X)$ $H(X)= H(g(X))$ $X$ $g(X)$ $g(X)$ $X$

\begin{matrix} H (g (X) | X) = H (X | g (X)) = 0 . \end{matrix}

non-invertible $f(X)$ $H(X)\neq H(f(X))$ $X$ $f(X)$ $X$ $f(X)$ $f(X)$ $X$

\begin{matrix} \begin{matrix} H (f (X) | X) = 0 \\ H (X | f (X)) \neq 0. \end{matrix} \end{matrix}

$Y=|X|$ $Y=cosX$ .

(6) Joint Entropy

$X$ $Y$ $XY$ $X$ $Y$ as a new discrete random variable, i.e., we have

\begin{matrix} H (X Y) = \sum_{(a, b) \in s u p p (P_{X Y})} - P_{X Y} (a, b) \log_{2} P_{X Y} (a, b) = E [- \log_{2} P_{X Y} (X, Y)] \end{matrix}

and

\begin{matrix} max (H (X), H (Y)) \leq H (X Y) \leq \log_{2} (| X | \cdot | Y |) \end{matrix}

$X$ $Y$ $Y$ $X$ $P_{XY}(a,b)=\frac{1}{|X||Y|}$ $(a,b)$ .

Example 1.4 Chain Rule for Entropy

\begin{array}{rcl} H (X Y) & = & H (X) + H (X | Y) = H (Y) + H (Y | X) \\ H (X^{n}) & = & \sum_{i = 1}^{n} H (X_{i} | X^{i - 1}), X^{0} i s a c o n s t a n t . \end{array}

Example 1.5

\begin{matrix} H (X Y | Z) \geq H (X | Z) . \end{matrix}

Proof.

\begin{matrix} H (X Y | Z) = H (X | Z) + H (Y | X Z) \geq H (X | Z) . \end{matrix}

Example 1.6

Example 1.3 $H(Xf(X))$ $H(Xg(X))$ .

invertible $g(X)$ ,

\begin{matrix} \begin{matrix} H (g (X) | X) = H (X | g (X)) = 0 \\ H (X g (X)) = H (X) = H (g (X)) . \end{matrix} \end{matrix}

non-invertible $f(X)$ ,

\begin{matrix} \begin{matrix} H (f (X) | X) = 0 \\ H (X | f (X)) \neq 0 \\ H (X f (X)) = H (X) \geq H (f (X)) . \end{matrix} \end{matrix}

(7) Information Divergence / Kullback-Leibler Distance

$P_X (\cdot)$ $P_Y (\cdot)$ is defined as
$\begin{matrix} \begin{matrix} D (P_{X} ∥ P_{Y}) = \sum_{a \in s u p p (P_{X})} P_{X} (a) \log_{2} \frac{P_{X} (a)}{P_{Y} (a)} \\ = E [\log_{x} \frac{P_{X} (X)}{P_{Y} (X)}] \end{matrix} \end{matrix}$
$D(P_X\parallel P_Y ) = \infin$ $P_Y (a) = 0$ $a\in supp(P_X)$ .
$P_X$ $P_Y$ $D(P_X\parallel P_Y ) \neq D(P_Y\parallel P_X )$ in general.

$D(P_X\parallel P_Y ) = \infin$ $P_Y\gg P_X$ $P_Y(a)=0 \Rightarrow P_X(a)=0$ $a\in X$ $P_Y\gg P_X$ $supp(P_X)\subseteq supp(P_Y)$ $D(P_X\parallel P_Y ) < \infin$ for finite sets.

$Z$ conditional information divergence $P_{X|Z}(\cdot)$ $P_{Y|Z}(\cdot)$ as

\begin{array}{rcl} D (P_{X | Z} ∥ P_{Y | Z} | P_{Z}) & = & \sum_{c \in s u p p (P_{Z})} P_{Z} (c) D (P_{X | Z} (\cdot | c) ∥ P_{Y | Z} (\cdot | c)) \\ = & \sum_{c \in s u p p (P_{Z})} P_{Z} (c) D (P_{X | Z} (\cdot | c) ∥ P_{Y | Z} (\cdot | c)) \\ = & \sum_{c \in s u p p (P_{Z})} P_{Z} (c) \sum_{a \in s u p p (P_{X | Z} (\cdot | c))} P_{X | Z} (a | c) \log_{2} \frac{P_{X | Z} (a | c)}{P_{Y | Z} (a | c)} \\ = & \sum_{(a, c) \in s u p p (P_{X Z})} P_{X Z} (a, c) \log_{2} \frac{P_{X | Z} (a | c)}{P_{Y | Z} (a | c)} \\ = & E [l o g_{2} \frac{P_{X | Z} (X | Z)}{P_{Y | Z} (X | Z)}] \end{array}

Theorem 1.2
$\begin{matrix} D (P_{X} ∥ P_{Y}) \geq 0 \end{matrix}$
$P_X (a) = P_Y (a)$ $a\in supp(P_X)$ .

Proof.

\begin{aligned} \log_{2} (x) & = \ln (x) \log_{x} (2) \leq (x - 1) \log_{2} (e) \\ - D (P_{X} ∥ P_{Y}) & = E [\log_{2} \frac{P_{Y} (X)}{P_{X} (X)}] \\ \leq E [\frac{P_{Y} (X)}{P_{X} (X)} - 1] \log_{2} (e) \\ = \sum_{a \in s u p p (P_{X})} P_{X} (a) [\frac{P_{Y} (a)}{P_{X} (a)} - 1] \log_{2} (e) \\ \leq 0 \end{aligned}

Example 1.7

$P_Y(\cdot)$ $a\in supp(P_X)$ $P_Y(a)=\frac{1}{|X|}$

\begin{array}{rcl} D (P_{X} ∥ P_{Y}) & = & \sum_{a \in s u p p (P_{X})} P_{X} (a) \log_{2} \frac{P_{X} (a)}{P_{Y} (a)} \\ = & \sum_{a \in s u p p (P_{X})} P_{X} (a) \log_{2} P_{X} (a) - \sum_{a \in s u p p (P_{X})} P_{X} (a) \log_{2} P_{Y} (a) \\ = & - H (X) + \log_{2} | X | \end{array}

Example 1.8

$P_X(\cdot)$ $a\in supp(P_X)$ $P_X(a)=\frac{1}{|X|}$

\begin{array}{rcl} D (P_{X} ∥ P_{Y}) & = & \sum_{a \in s u p p (P_{X})} P_{X} (a) \log_{2} \frac{P_{X} (a)}{P_{Y} (a)} \\ = & \sum_{a \in s u p p (P_{X})} P_{X} (a) \log_{2} P_{X} (a) - \sum_{a \in s u p p (P_{X})} P_{X} (a) \log_{2} P_{Y} (a) \\ = & - l o g_{2} | X | - \frac{1}{| X |} \sum_{a \in s u p p (P_{X})} \log_{2} P_{Y} (a) \end{array}

Example 1.9

$P_X(a)=1$ $a\in X$ ,

\begin{matrix} D (P_{X} ∥ P_{Y}) = - \log_{2} P_{Y} (a) \end{matrix}

Example 1.10 Chain Rule for Information Divergence

\begin{array}{rcl} D (P_{X}^{n} ∥ P_{Y}^{n}) & = & \sum_{x^{n} \in s u p p (P_{X^{n}})} P_{X^{n}} (x^{n}) \log \frac{P_{X^{n}} (x^{n})}{P_{Y^{n}} (x^{n})} \\ = & \sum_{x^{n} \in s u p p (P_{X^{n}})} P_{X^{n}} (x^{n}) \log \frac{Π_{i = 1}^{n} P_{X_{i} | X^{i - 1}} (x_{i} | x^{i - 1})}{Π_{i = 1}^{n} P_{Y_{i} | Y^{i - 1}} (x_{i} | x^{i - 1})} \\ = & \sum_{i = 1}^{n} \sum_{x^{n} \in s u p p (P_{X^{n}})} P_{X^{n}} (x^{n}) \log \frac{P_{X_{i} | X^{i - 1}} (x_{i} | x^{i - 1})}{P_{Y_{i} | Y^{i - 1}} (x_{i} | x^{i - 1})} \\ = & \sum_{i = 1}^{n} \sum_{x^{i} \in s u p p (P_{X^{i}})} \sum_{x_{i + 1} x_{i + 2} . . . x_{n} \in s u p p (P_{X_{i + 1} X_{i + 2} . . . X_{n} | X^{i}})} P_{X^{n}} (x^{n}) \log \frac{P_{X_{i} | X^{i - 1}} (x_{i} | x^{i - 1})}{P_{Y_{i} | Y^{i - 1}} (x_{i} | x^{i - 1})} \\ = & \sum_{i = 1}^{n} \sum_{x^{i} \in s u p p (P_{X^{i}})} P_{X^{i}} (x^{i}) \log \frac{P_{X_{i} | X^{i - 1}} (x_{i} | x^{i - 1})}{P_{Y_{i} | Y^{i - 1}} (x_{i} | x^{i - 1})} \cdot \sum_{x_{i + 1} x_{i + 2} . . . x_{n} \in s u p p (P_{X_{i + 1} X_{i + 2} . . . X_{n} | X^{i}})} P_{X_{i + 1} X_{i + 2} . . . X_{n} | X^{i}} (x_{i + 1} x_{i + 2} . . . x_{n} | x^{i}) \\ = & \sum_{i = 1}^{n} \sum_{x^{i} \in s u p p (P_{X^{i}})} P_{X^{i}} (x^{i}) \frac{P_{X_{i} | X^{i - 1}} (x_{i} | x^{i - 1})}{P_{Y_{i} | Y^{i - 1}} (x_{i} | x^{i - 1})} \\ = & \sum_{i = 1}^{n} E [\log_{2} \frac{P_{X_{i} | X^{i - 1}}}{P_{Y_{i} | Y^{i - 1}}}] \\ = & \sum_{i = 1}^{n} D (P_{X_{i} | X^{i - 1}} ∥ P_{Y_{i} | Y^{i - 1}} | P_{X^{i - 1}}) \end{array}

$P_1P_2$ $P_{X_1X_2}$

\begin{array}{rcl} D (P_{1} P_{2} ∥ Q_{1} Q_{2}) & = & \sum_{(a, b) \in s u p p (P_{1} P_{2})} P_{1} (a) P_{2} (b) \log \frac{P_{1} (a) P_{2} (b)}{Q_{1} (a) Q_{2} (b)} \\ = & \sum_{(a, b) \in s u p p (P_{1} P_{2})} P_{1} (a) P_{2} (b) (\log \frac{P_{1} (a)}{Q_{1} (a)} + \log \frac{P_{2} (b)}{Q_{2} (b)}) \\ = & \sum_{a \in s u p p (P_{1})} P_{1} (a) \log \frac{P_{1} (a)}{Q_{1} (a)} + \sum_{b \in s u p p (P_{2})} P_{2} (b) \log \frac{P_{2} (b)}{Q_{2} (b)} \\ = & D (P_{1} ∥ Q_{1}) + D (P_{2} ∥ Q_{2}) . \end{array}

(8) Mutual Information

$I(X;Y)$ $X$ $Y$ is defined as
$\begin{matrix} \begin{matrix} I (X; Y) = D (P_{X Y} ∥ P_{X} P_{Y}) \\ = \sum_{(a, b) \in s u p p (P_{X Y})} P_{X Y} (a, b) l o g_{2} \frac{P_{X Y} (a, b)}{P_{X} (a) P_{Y} (b)} \end{matrix} \end{matrix}$
$P_{XY} = P_XP_Y$ $X$ $Y$ $I (X;Y)$ $X$ $Y$ .

\begin{array}{rcl} I (X; Y) & = & H (X) - H (X | Y) \\ = & H (Y) - H (Y | X) \\ = & H (X) + H (Y) - H (X Y) \end{array}

Theorem 1.3
$\begin{array}{rcl} I (X; Y) & \geq & 0 \\ H (X | Y) & \leq & H (X) \\ H (Y | X) & \leq & H (Y) \\ H (X Y) & \leq & H (X) + H (Y) \end{array}$
$X$ $Y$ are statistically independent.

Proof.

$X$ $Y$ $P_{XY}(a,b)=P_X(a)P_Y(b)$ , we have

\begin{array}{rcl} H (X | Y) & = & \sum_{(a, b) \in s u p p (P_{X Y})} - P_{X Y} (a, b) \log_{2} P_{X | Y} (a | b) \\ = & \sum_{(a, b) \in s u p p (P_{X Y})} - P_{X Y} (a, b) \log_{2} P_{X} (a) \\ = & \sum_{a \in s u p p (P_{X})} - P_{X} (a) \log_{2} P_{X} (a) \\ = & H (X) . \\ H (Y | X) & = & H (Y) . \\ H (X Y) & = & \sum_{(a, b) \in s u p p (P_{X Y})} - P_{X Y} (a, b) \log_{2} P_{X Y} (a, b) \\ = & \sum_{(a, b) \in s u p p (P_{X Y})} - P_{X} (a) P_{Y} (b) \log_{2} P_{X} (a) P_{Y} (b) \\ = & \sum_{(a, b) \in s u p p (P_{X Y})} - P_{X} (a) P_{Y} (b) \log_{2} P_{X} (a) \\ + & \sum_{(a, b) \in s u p p (P_{X Y})} - P_{X} (a) P_{Y} (b) \log_{2} P_{Y} (b) \\ = & \sum_{a \in s u p p (P_{X})} - P_{X} (a) \log_{2} P_{X} (a) \\ + & \sum_{b \in s u p p (P_{Y})} - P_{Y} (b) \log_{2} P_{Y} (b) \\ = & H (X) + H (Y) . \end{array}

$H(X|Y=b)$ $H(X)$ .

Example 1.12

\begin{array}{rcl} I (X; Y | Z) & = & H (X | Z) - H (X | Y Z) \\ = & H (Y | Z) - H (Y | X Z) \end{array}

$0\le I(X;Y|Z)\le \min(H(X|Z),H(Y|Z))$ .

$I(X;Y|Z)=0$ $X-Z-Y$ $Z$ $X$ $Y$ $H(X|Z)=H(X|YZ)$ .

\begin{array}{rcl} I (X; Y | Z) & = & D (P_{X Y | Z} ∥ P_{X | Z} P_{Y | Z} | P_{Z}) \\ = & D (P_{X | Z} P_{Y | X Z} ∥ P_{X | Z} P_{Y | Z} | P_{Z}) \\ \hat{=} & D (P_{X | Z} P_{Y | Z} ∥ P_{X | Z} P_{Y | Z} | P_{Z}) \\ = & 0 . \end{array}

Example 1.13 Chain Rule for Mutual Information

$X^n=X_1X_2..X_n$ $n$ $Y$ is some discrete random variables

\begin{array}{rcl} I (X^{n}; Y) & = & H (X^{n}) - H (X^{n} | Y) \\ = & \sum_{i = 1}^{n} H (X_{i} | X^{i - 1}) - \sum_{i = 1}^{n} H (X_{i} | X^{i - 1} Y) \\ = & \sum_{i = 1}^{n} (H (X_{i} | X^{i - 1}) - H (X_{i} | X^{i - 1} Y)) \\ = & \sum_{i = 1}^{n} I (X_{i}; Y | X^{i - 1}) \end{array}

Example 1.14

Using the chain rule for mutual information and conclusions of Example 1.11, we have

\begin{matrix} \begin{matrix} I (X; Y) \leq I (X; Y Z) \\ I (X; Y) \leq I (X Z; Y) . \end{matrix} \end{matrix}

(9) Cross Entropy

$P_X (\cdot)$ $P_Y (\cdot)$ with the same domain is defined as
$\begin{matrix} \begin{matrix} X (P_{X} ∥ P_{Y}) = \sum_{a \in s u p p (P_{X})} - P_{X} (a) \log_{2} P_{Y} (a) \\ = E [- \log_{2} P_{Y} (X)] . \end{matrix} \end{matrix}$
Theorem 1.4
$\begin{matrix} X (P_{X} ∥ P_{Y}) = H (X) + D (P_{X} ∥ P_{Y}) . \end{matrix}$
$P_X(a)=P_Y(a)$ $a\in X$ .

upper bound entropy $P_Y (\cdot)$ $H(X)$ $P_Y (\cdot)$ to have an exponential form, such as a Gaussian distribution for continuous random variables.

(10) Inequalities

Theorem 1.5 Log-sum Inequality
$S_a=\sum_{i=1}^na_i$ $S_b=\sum_{i=1}^nb_i$ $P_X(i)=a_i/S_a$ $P_Y(i)=b_i/S_b$ , we have
$\begin{matrix} \sum_{i = 1}^{n} a_{i} \log \frac{a_{i}}{b_{i}} = S_{a} D (P_{X} ∥ P_{Y}) + S_{a} \log \frac{S_{a}}{S_{b}} \geq S_{a} \log \frac{S_{a}}{S_{b}} \end{matrix}$
$a_i/b_i=S_a/S_b$ $i$ .

Proof.

\begin{array}{rcl} \sum_{i = 1}^{n} a_{i} \log \frac{a_{i}}{b_{i}} & = & \sum_{i = 1}^{n} a_{i} \log \frac{P_{X} (i)}{P_{Y} (i)} \cdot \frac{S_{a}}{S_{b}} \\ = & S_{a} \sum_{i = 1}^{n} P_{X} (i) \log \frac{P_{X} (i)}{P_{Y} (i)} + S_{a} \log \frac{S_{a}}{S_{b}} \\ = & S_{a} D (P_{X} ∥ P_{Y}) + S_{a} \log \frac{S_{a}}{S_{b}} . \end{array}

Example 1.15

\begin{matrix} a_{1} \log \frac{a_{1}}{b_{1}} + a_{2} \log \frac{a_{2}}{b_{2}} \geq (a_{1} + a_{2}) \log \frac{a_{1} + a_{2}}{b_{1} + b_{2}} \end{matrix}

$a_1/b_1=a_2/b_2=(a_1+a_2)(b_1+b_2)$ .

Theorem 1.6 Data Processing Inequalities
Any processing decreases the mutual information:
$X-Y-Z$ $I(X;Z)\le I(X;Y)$ $I(X;Z)\le I(Y;Z)$ .
$Y_1$ $Y_2$ $P_{Y|X}(\cdot)$ $X_1$ $X_2$ $D(P_{Y_1}|P_{Y_2}) \le D(P_{X_1}|P_{X_2})$ .

Proof.

Using conclusions of Example 1.13,

\begin{matrix} \begin{matrix} I (X; Z) \leq I (X; Y Z) = I (X; Y) + I (X; Z | Y) = I (X; Y), \\ I (X; Z) \leq I (X Y; Z) = I (Y; Z) + I (X; Z | Y) = I (Y; Z) . \end{matrix} \end{matrix}

Using the chain rule of information divergence, we have

\begin{array}{rcl} D (P_{X_{1} Y_{1}} ∥ P_{X_{2} Y_{2}}) & = & D (P_{X_{1}} ∥ P_{X_{2}}) + D (P_{Y_{1} | X_{1}} ∥ P_{Y_{2} | X_{2}} | P_{X_{1}}) \\ = & D (P_{X_{1}} ∥ P_{X_{2}}) \\ = & D (P_{Y_{1}} ∥ P_{Y_{2}}) + D (P_{X_{1} | Y_{1}} ∥ P_{X_{2} | Y_{2}} | P_{Y_{1}}) \end{array}

which gives

\begin{matrix} D (P_{Y_{1}} ∥ P_{Y_{2}}) \leq D (P_{X_{1}} ∥ P_{X_{2}}) . \end{matrix}

Theorem 1.9 Fano's Inequality
$X$ $\hat{X}$ $P_e=Pr[\hat{X}\neq X]$ . We have
$\begin{matrix} H (X | \hat{X}) \leq H_{2} (P_{e}) + P_{e} \log_{2} (| X | - 1) \end{matrix}$
$a$ $b$ , we have
$\begin{matrix} \begin{matrix} P_{X | \hat{X}} (a | b) = {\begin{aligned} 1 - P_{e} if b = a \\ \frac{P_{e}}{| X | - 1} if b \neq a \end{aligned} \end{matrix} \end{matrix}$

Proof.

$E=\mathbb{1} (\hat{X}\neq X)$ $H(EX|\hat{X})$ in two ways

\begin{aligned} H (E X | \hat{X}) & = H (X | \hat{X}) + H (E | X \hat{X}) \\ \overset{(a)}{=} H (X | \hat{X}) \end{aligned}

$(a)$ $X\hat{X}$ $E$ .

\begin{aligned} H (E X | \hat{X}) & = H (E | \hat{X}) + H (X | \hat{X} E) \\ = H (E | \hat{X}) + Pr [E = 0] H (X | \hat{X}, E = 0) + Pr [E = 1] H (X | \hat{X}, E = 1) \\ \overset{(b)}{=} H (E | \hat{X}) + Pr [E = 1] H (X | \hat{X}, E = 1) \\ = H (E | \hat{X}) + P_{e} H (X | \hat{X}, E = 1) \\ \overset{(c)}{\leq} H (E | \hat{X}) + P_{e} \log_{2} (| X | - 1) \\ \leq H (E) + P_{e} \log_{2} (| X | - 1) \\ \leq H_{2} (P_{e}) + P_{e} \log_{2} (| X | - 1) \end{aligned}

$(b)$ $E=0$ $X=\hat{X}$ $\hat{X}$ $X$ $(c)$ $E=1$ $X\neq\hat{X}$ $X$ $|X|-1$ values.

(11) Convexity

Theorem 1.10 Concavity of Entropy
$0<\lambda <1$
$\begin{matrix} λ H (P_{X}) + (1 - λ) H (Q_{X}) \leq H (λ P_{X} + (1 - λ) Q_{X}) . \end{matrix}$

Theorem 1.11 Convexity of Information Divergence
$0<\lambda <1$
$\begin{matrix} \begin{matrix} λ D (P_{X} ∥ P_{Y}) + (1 - λ) D (Q_{X} ∥ Q_{Y}) \\ \geq D (λ P_{X} + (1 - λ) Q_{X} ∥ λ P_{Y} + (1 - λ) Q_{Y}) . \end{matrix} \end{matrix}$

Theorem 1.12 Mutual Information
$I(X;Y)$ $I(P_X;P_{Y|X})$ $I(P_X;P_{Y|X})$ $P_X$ $P_{Y|X}$ , i.e., the channel is fixed,
$\begin{matrix} λ I (P_{X}; P_{Y | X}) + (1 - λ) I (Q_{X}; P_{Y | X}) \leq I (λ P_{X} + (1 - λ) Q_{X}, P_{Y | X}) \end{matrix}$
$0<\lambda <1$ .
$I(P_X;P_{Y|X})$ $P_{Y|X}$ $P_X$ , i.e., the input distribution is fixed,
$\begin{matrix} λ I (P_{X}; P_{Y | X}) + (1 - λ) I (P_{X}; Q_{Y | X}) \geq I (P_{X}, λ P_{Y | X} + (1 - λ) Q_{Y | X}) . \end{matrix}$
$0<\lambda <1$ .

Theorem 1.13 Linearity and Convexity of Cross Entropy
$\begin{matrix} \begin{matrix} λ X (P_{X} ∥ P_{Y}) + (1 - λ) X (Q_{X} ∥ P_{Y}) = X (λ P_{X} + (1 - λ) Q_{X} ∥ P_{Y}) \\ λ X (P_{X} ∥ P_{Y}) + (1 - λ) X (P_{X} ∥ Q_{Y}) \geq X (P_{X} ∥ λ P_{Y} + (1 - λ) Q_{Y}) \end{matrix} \end{matrix}$

Example 1.16 Jensen's Inequality

$f$ $I$ $X$ $I$ $E[f(X)]\ge f(E[X])$ .

Proof.

$f(\cdot)$ $m$ $f(x)\ge f(x_0)+m(x-x_0)$ $x_0$ $E[X]$ $f(x)\ge f(E[X])+m(x-E[X])$ $x$ $x$ :

\begin{matrix} E [f (X)] \geq f (E [X]) + m (E [X] - E [X]) = f (E [X]) . \end{matrix}

Example 1.17 Definitions of Convexity

These definitions are equivalent

\begin{array}{rc} f (x) \geq f (x_{0}) + m (x - x_{0}) \\ λ f (x_{1}) + (1 - λ) f (x_{2}) \geq f (λ x_{1} + (1 - λ) x_{2}) \\ f^{″} (x) \geq 0 \end{array}

Proof.

Using Jensen's Inequality,

\begin{matrix} λ f (x_{1}) + (1 - λ) f (x_{2}) = E [f (X)] \geq f (E [X]) = f (λ x_{1} + (1 - λ) x_{2}) . \end{matrix}

Chapter 2 Information Theory for Continuous Variabls

(1) Differential Entropy

$X$ $p_X(\cdot)$ $X$ is defined as
$\begin{matrix} h (X) = \int_{s u p p (p_{x})} - p_{X} (a) \log p_{X} (a) d a = E [- \log p_{X} (X)] . \end{matrix}$

Example 2.1

$p_X(a)=1/A$ $a\in [0,A)$ ,

\begin{matrix} h (X) = l o g (A) \end{matrix}

$h(X)$ negative $A<1$ .

Rules for Differential Entropy

\begin{matrix} \begin{matrix} h (X + c) = h (X) \\ h (c X) = h (X) + l o g | c | \\ h (Y + c X | X) = h (Y | X) \end{matrix} \end{matrix}

Conditional Entropy
$\begin{matrix} h (Y | X) = \int_{s u p p (p_{X})} p_{X} (a) h (Y | X = a) d a \end{matrix}$

(2) Mixed Distributions

(3) Example Distributions

(4) Informational Divergence

$p_X$ $p_Y$ is
$\begin{matrix} D (p_{X} ∥ p_{Y}) = \int_{s u p p (p_{X})} p_{X} (a) \log \frac{p_{X} (a)}{p_{Y} (a)} d a \end{matrix}$

$I(X;Y)=D(p_{XY}\parallel p_Xp_Y) \ge 0$
$D(p_X\parallel p_Y)\ge 0$

(5) Cross Entropy

$p_X$ $p_Y$ is
$\begin{matrix} X (p_{X} ∥ p_{Y}) = - \int_{s u p p (p_{X})} p_{X} (a) \log p_{Y} (a) d a \end{matrix}$

\begin{matrix} \begin{matrix} X (p_{X} ∥ p_{Y}) = h (X) + D (p_{X} ∥ p_{Y}) \\ D (p_{X} ∥ p_{Y}) \geq 0 \\ X (p_{X} ∥ p_{Y}) \geq h (X) \end{matrix} \end{matrix}

$p_X(a)=p_Y(a)$ $a\in supp(p_X)$ .

(6) Maximum Entropy

Alphabet or Volume Constraint

$X$ $S=\int_{supp(p_X)}1dx$ then

\begin{matrix} h (X) \leq \log | S | \end{matrix}

$p_X(a)=1/|S|$ $a\in S$ .

First Moment Constriant

$E[X]\le m$ $X$ is non-negative, independent exponential random variables maximize (differential) entropy under the first moment and non-negativity constraints

\begin{matrix} p_{E_{i}} (a) = \frac{1}{m_{i}} e^{- a / m_{i}}, a \geq 0. \end{matrix}

We use cross entropy to upper bound the entropy

\begin{array}{rcl} h (X) \leq X (p_{X} ∥ p_{E}) & = & - \int_{s u p p (p_{X})} p_{X} (a) \log p_{E} (a) d a \\ = & - \int_{s u p p (p_{X})} p_{X} (a) \sum_{i} (- \frac{a_{i}}{m_{i}} \log e - \log m_{i}) d a \\ = & \sum_{i} (\frac{\int_{s u p p (p_{X})} p_{X} (a) a_{i} d a}{m_{i}} \log e + \log m_{i}) \\ \leq & \sum_{i} (\frac{m_{i}}{m_{i}} \log e + \log m_{i}) \\ = & \sum_{i} l o g (e m_{i}) \end{array}

Second Moment Constraint

$|Q_X|\le Q$ $D$ $G$ $Q_X$ $m$ .

\begin{aligned} h (X) & \leq X (p_{X} ∥ p_{G}) \\ = - \int_{s u p p (p_{X})} p_{X} (a) \log p_{G} (a) d a \\ = - \int_{s u p p (p_{X})} p_{X} (a) [- \frac{1}{2} \log ((2 π)^{n} | Q_{X} |) - \frac{1}{2} (a - m)^{T} Q_{X}^{- 1} (a - m) \log e] d a \\ = \frac{1}{2} \log ((2 π e)^{n} | Q_{X} |) \end{aligned}

Gaussian random variables maximize differential entropy under the second moment contraint.

Example 2.2

$Var[X]\le D$ and the bound implies

\begin{matrix} h (X) \leq \frac{1}{2} \log (2 π e D) \end{matrix}

Chapter 3 Coding of Memoryless Sources

$K-ary$ $D-ary$ digits.

$D-ary$ every $D-ary$ sequence has at most one interpretation as a sequence of its codewords. A prefix-free code is automatically also a uniquely decodable code.

(1) Kraft Inequality

Theorem 3.1 Kraft Inequality
$D-ary$ $l_k, k=1, 2, ..., K$ , exists iff.
$\begin{matrix} \sum_{k = 1}^{K} D^{- l_{k}} \leq 1. \end{matrix}$
$D-ary$ tree.

$D-ary$ uniquely decodable code with a given list of codeword lengths exists if and only if these lengths satisfy Kraft’s inequality.

Proof.

Path Length Lemma

\begin{matrix} \sum_{k = 1}^{K} p_{k} l_{k} = \sum_{j = 1}^{J} P_{j} \end{matrix}

Proof.

\begin{array}{rcl} \sum_{k = 1}^{K} p_{k} l_{k} & = & \sum_{k = 1}^{K} p_{k} \sum_{j = 1}^{J} 1 (n o d e j i s o n t h e p a t h t o s o u r c e l e t t e r k) \\ = & \sum_{j = 1}^{J} \sum_{k = 1}^{K} p_{k} 1 (n o d e j i s o n t h e p a t h t o l e a f k) \\ = & \sum_{j = 1}^{J} P_{j} \end{array}

Leaf-Entropy Lemma

\begin{matrix} H_{l e a f} = \sum_{j = 1}^{J} P_{j} H (Q_{j}) \end{matrix}

(2) Entropy Bounds on Source Coding

Theorem 3.2 Coding theorem for a Single Random Variable.
$E[L]$ $D-ary$ $K-ary$ $X$ satisfies
$\begin{matrix} \frac{H (X)}{l o g_{2} D} \leq E [L] < \frac{H (X)}{l o g_{2} D} + 1 \end{matrix}$
$p_k=D^{-l_k}$ $k$ .

Proof.

For the lower bound, we use Leaf Length Lemma and Leaf-Entropy Lemma,

\begin{array}{rcl} H (X) & = & H_{l e a f} \\ = & \sum_{j = 1}^{J} P_{j} H (Q_{j}) \\ \leq & \sum_{j = 1}^{J} P_{j} l o g_{2} D \\ = & E [L] l o g_{2} D b i t s \end{array}

and get

\begin{matrix} \frac{H (X)}{l o g_{2} D} \leq E [L] . \end{matrix}

For the upper bound, we use Kraft's Inequality,

\begin{matrix} \sum_{k = 1}^{K} D^{- l_{k}} \leq 1 = \sum_{k = 1}^{K} p_{k} . \end{matrix}

Let

\begin{matrix} \begin{matrix} D^{- l_{k}} \leq p_{k}, \\ l_{k} \geq - l o g_{D} p_{k} . \end{matrix} \end{matrix}

We choose the shortest codes

\begin{matrix} l_{k} = ⌈ - l o g_{D} p_{k} ⌉ . \end{matrix}

Therefore,

\begin{matrix} l_{k} < - l o g_{D} p_{k} + 1, \end{matrix}

and we get

\begin{array}{rcl} E [L] & = & \sum_{k = 1}^{K} p_{k} l_{k} \\ < & - \sum_{k = 1}^{K} p_{k} l o g_{D} p_{k} + \sum_{k = 1}^{K} p_{k} \\ = & \frac{- \sum_{k = 1}^{K} p_{k} l o g_{2} p_{k}}{l o g_{2} D} + 1 \\ = & \frac{H (X)}{l o g_{2} D} + 1 . \end{array}

Theorem 3.3 Coding Theorem for a DMS
$E[L]$ $D-ary$ $K-ary$ $P_X(.)$ that is parsed into $B$ satisfies
$\begin{matrix} \frac{H (X)}{l o g_{2} D} \leq \frac{E [L]}{B} < \frac{H (X)}{l o g_{2} D} + \frac{1}{B} \end{matrix}$

Proof.

$X$ $B$ $X^B$ and get

\begin{matrix} \frac{H (X^{B})}{l o g_{2} D} \leq \frac{E [L]}{B} < \frac{H (X^{B})}{l o g_{2} D} + 1 \end{matrix}

$H(X^B)=BH(X)$ and finish the proof.

(3) Huffman Codes

$D-ary$ $E[L]$ $K-ary$ random variable are called Huffman codes.
codewords $E[L]$ .

$D-ary$ prefix-free codes assign shorter code words to more probable letters.
$D-ary$ prefix-free code has a sibling.
$D-ary$ prefix-free code is

\begin{matrix} (K - D) (D - 2) m o d (D - 1) \end{matrix}

Proof.

$D-ary$ $D+(J-1)(D-1)$ $K$ leaves are codewords.

$r=D+(J-1)(D-1)-K$ , i.e.,

\begin{matrix} (D - K) = - (J - 1) (D - 1) + r \end{matrix}

$r<D-1$ , we have

\begin{matrix} r = (D - K) m o d (D - 1) . \end{matrix}

(4) Tunstall Codes

$D-ary$ $L$ $E[B]$ $K-ary$ random variable are called Tunstall codes.
source sequence $E[B]$ .

A proper dictionary satisfies the following two conditions: The dictionary (or message set) satisfies the prefix-free condition, i.e., the dictionary words are leaves of a rooted tree. No leaves are unused.
$J$ $K+(K-1)(J-1)\le D^L$ .
Similar to Huffman code, we now have

\begin{matrix} \frac{L}{E [B]} \geq \frac{H (X)}{l o g_{2} D} \end{matrix}

Proof.

Using Leaf Entropy Lemma, we get

\begin{array}{rcl} H (U^{L}) & = & \sum_{j = 1}^{J} P_{j} H (Q_{j}) = \sum_{j = 1}^{J} P_{j} H (X) = E [B] H (X) \\ H (U^{L}) & \leq & l o g_{2} (D^{L}) = L l o g_{2} D \end{array}

Combine both and finish the proof.

Theorem 3.4 Converse to the Coding Theorem for a DMS.
$D-ary$ prefix-free encoding of a proper dictionary for a DMS satisfies
$\begin{matrix} \frac{E [L]}{E [B]} \geq \frac{H (X)}{l o g_{2} D} \end{matrix}$

This is a more general lower bound that applies to first using a variable-length-to-block encoder with a proper dictionary followed by a prefix-free block-to-variable-length encoder.

$B$ $L'$ ,

\begin{matrix} H (U^{L^{'}}) = E [B] H (X) . \end{matrix}

$D^{L'}-ary$ $D-ary$ letters,

\begin{matrix} E [L] \geq \frac{H (U^{L^{'}})}{l o g_{2} D} . \end{matrix}

(5) Huffman Codes and Tunstall Codes

Huffman codes and Tunstall codes are both types of prefix-free codes, which are coding schemes in which no code is a prefix of any other code, allowing messages to be uniquely decoded. However, there are some key differences between these two types of codes:

Huffman codes (block to variable length) are constructed using a method called Huffman coding, which is a lossless data compression algorithm that assigns shorter codes to more probable symbols and longer codes to less probable symbols, in order to minimize the expected length of the encoded message.

Tunstall codes (variable length to block), on the other hand, are constructed using a method called the Tunstall algorithm, which is a lossless data compression algorithm that generates a prefix-free code by decomposing code into overlapping blocks and assigning a unique message to each block.

Chapter 4 Coding of Stationary Sources

(1) Discrete Stationary Sources

$X_1, X_2, ..., X_B$ , we define the entropy per source letter as

\begin{matrix} H_{B} (X) = \frac{1}{B} H (X^{B}) . \end{matrix}

The entropy rate is defined as

\begin{matrix} H_{\infty} (X) = lim_{B \to \infty} \frac{1}{B} H (X^{B}) . \end{matrix}

It has property

\begin{matrix} H_{\infty} (X) = lim_{B \to \infty} H (X_{B} | X^{B - 1}) . \end{matrix}

Theorem 4.1 Entropy Rate for a DSS
$\begin{matrix} \begin{matrix} H (X_{B} | X^{B - 1}) i s n o n - i n c r e a s i n g a s B i n c r e a s e s . \end{matrix} \end{matrix}$ $\begin{matrix} H (X_{B} | X^{B - 1}) \leq H_{B} (X) f o r a l l B \geq 1 . \end{matrix}$

$\begin{matrix} H_{B} (X) i s n o n - i n c r e a s i n g a s B i n c r e a s e s . \end{matrix}$

Proof.

\begin{array}{rcl} H (X_{B} | X^{B - 1}) & = & H (X_{B} | X_{1} X_{2} X_{3} . . . X_{B - 1}) \\ \leq & H (X_{B} | X_{2} X_{3} . . . X_{B - 1}) \\ = & H (X_{B - 1} | X_{1} X_{2} . . . X_{B - 2}) \\ = & H (X_{B - 1} | X^{B - 2}) \end{array}

Similarly,

\begin{array}{rcl} H (X_{B - 1} | X^{B - 2}) & = & H (X_{B - 1} | X_{1} X_{2} . . . X_{B - 2}) \\ \leq & H (X_{B - 1} | X_{2} X_{3} . . . X_{B - 2}) \\ = & H (X_{B - 2} | X_{1} X_{2} . . . X_{B - 3}) \\ = & H (X_{B - 2} | X^{B - 3}) \end{array}

$H(X_B|X^{B-1})$ $B$ increases.

\begin{matrix} H (X_{B} | X^{B - 1}) \leq H (X_{B - 1} | X^{B - 2}) \leq H (X_{B - 2} | X^{B - 3}) \leq . . . \end{matrix}

And

\begin{array}{rcl} H (X^{B}) & = & \sum_{i = 1}^{B} H (X_{i} | X^{i - 1}) \\ = & H (X_{B} | X^{B - 1}) + H (X_{B - 1} | X^{B - 2}) + H (X_{B - 2} | X^{B - 3}) + . . . \\ \geq & B H (X_{B} | X^{B - 1}) \end{array}

Proof.

\begin{array}{rcl} H_{B + 1} (X) & = & \frac{1}{B + 1} H (X^{B + 1}) \\ = & \frac{H (X_{B + 1} | X^{B}) + H (X^{B})}{B + 1} \\ = & \frac{H (X_{B} | X^{B - 1}) + H (X^{B})}{B + 1} \\ \leq & \frac{H_{B} (X) + H (X^{B})}{B + 1} \\ = & H_{B} (X) . \end{array}

(2) Entropy Bounds on Source Coding a DSS

Theorem 4.2 Coding Theorem for a DSS
$E[L]$ $D-ary$ $K-ary$ DSS $B$ satisfies
$\begin{matrix} \frac{H_{\infty} (X)}{l o g_{2} D} \leq \frac{E [L]}{B} < \frac{H_{B} (X)}{l o g_{2} D} + \frac{1}{B} \end{matrix}$

Proof.

$X$

\begin{matrix} \frac{H (X)}{l o g_{2} D} \leq E [L] < \frac{H (X)}{l o g_{2} D} + 1 . \end{matrix}

$B$

\begin{matrix} \frac{H (X^{B})}{l o g_{2} D} \leq E [L] < \frac{H (X^{B})}{l o g_{2} D} + 1 . \end{matrix}

$H_B(X)=\frac{1}{n}H(X^B)$ , i.e., entropy per source symbol, we get

\begin{matrix} \frac{H_{B} (X)}{l o g_{2} D} \leq \frac{E [L]}{B} < \frac{H_{B} (X)}{l o g_{2} D} + \frac{1}{B} . \end{matrix}

$H_B(X)$ $B$ $H_B(X)\ge H_\infin(X)$ and

\begin{matrix} \frac{H_{\infty} (X)}{l o g_{2} D} \leq \frac{E [L]}{B} < \frac{H_{B} (X)}{l o g_{2} D} + \frac{1}{B} . \end{matrix}

(3) Elias Code for Positive Integers

x	$U^L(x)$	$L(x)$
1	1	1
2	010.0	4
3	010.1	4
4	011.00	5
5	011.01	5
6	011.10	5
7	011.11	5
8	00100.000	8
9	00100.001	8

Proof.

$D-ary$ representation of integers, we have

\begin{matrix} L_{1} (x) = ⌊ l o g_{D} (x) ⌋ + 1 \end{matrix}

x	$U^{L_1}(x)$	$L_1(x)$
1	1	1
2	10	2
3	11	2
4	100	3
5	101	3
6	110	3
7	111	3
8	1000	4
9	1001	4

$U^{L_1}(x)$ $D-ary$ symbols it should consider after the first digit.

$D-ary$ representation of integers, we have

\begin{matrix} L_{2} (x) = 2 L_{1} (x) - 1 = 2 ⌊ l o g_{D} (x) ⌋ + 1 \end{matrix}

x	$U^{L_2}(x)$	$L_2(x)$
1	1	1
2	0.10	3
3	0.11	3
4	00.100	5
5	00.101	5
6	00.110	5
7	00.111	5
8	000.1000	7
9	000.1001	7

$U^{L_2}$ $1$ ).

$D-ary$ representation of integers, we have

\begin{matrix} L_{3} (x) = L_{1} (x) + L_{2} (L_{1} (x)) = ⌊ l o g_{D} (x) ⌋ + 1 + 2 ⌊ l o g_{D} (⌊ l o g_{D} (x) ⌋ + 1) ⌋ + 1 \end{matrix}

x	$U^{L_3}(x)$	$L_3(x)$
1	1	1
2	010.10	5
3	010.11	5
4	011.100	6
5	011.101	6
6	011.110	6
7	011.111	6
8	00100.1000	9
9	00100.1001	9

Finally, we get the Elias code by omitting the first "1" after "." with length

\begin{matrix} L (x) = ⌊ l o g_{D} (x) ⌋ + 2 ⌊ l o g_{D} (⌊ l o g_{D} (x) ⌋ + 1) ⌋ + 1 \end{matrix}

(4) Elias-Willems Universal Source Coding

We now describe an specific algorithm for source coding when the source statistics are unknown. An important result that we state without proof is that the expected recurrence time of a stationary and ergodic source satisfies

\begin{matrix} E [△_{i} | X_{i} = a] = \frac{1}{P_{X_{1}} (a)} \end{matrix}

$i$ $a$ .

Example 4.1

$P_X$ $j$ $a$ $(1-P_X(a))^{j-1}P_X(a)$ . The expected waiting time is therefore

\begin{matrix} E [△_{i} | X_{i} = a] = \sum_{j = 1}^{\infty} (1 - P_{X} (a))^{j - 1} P_{X} (a) = \frac{1}{P_{X} (a)} \end{matrix}

$\triangle_i$ and compute

\begin{array}{rcl} E [L_{i} | X_{i} = a] & = & E [L (△_{i}) | X_{i} = a] \\ \leq & E [l o g_{2} (△_{i}) + 2 l o g_{2} (l o g_{2} (△_{i}) + 1) + 1 | X_{i} = a] \\ \leq & l o g_{2} E [△_{i} | X_{i} = a] + 2 l o g_{2} (l o g_{2} E [△_{i} | X_{i} = a] + 1) + 1 \\ = & - l o g_{2} (P_{X_{1}} (a)) + 2 l o g_{2} (- l o g_{2} (P_{X_{1}} (a)) + 1) + 1 \end{array}

The second inequality follows from the concavity of logarithm and Jensen's inequality.

Taking expectations we have

\begin{matrix} E [L_{i}] \leq H (X_{1}) + 2 l o g_{2} (H (X_{1}) + 1) + 1 . \end{matrix}

$B$ instead of a single random variable, we have

\begin{matrix} E [L_{i}] \leq H (X^{B}) + 2 l o g_{2} (H (X^{B}) + 1) + 1 . \end{matrix}

Theorem 4.3 Universal Source Coding Theorem for a DSS
$E[L]$ $K-ary$ $B$ and for which the Elias code is used to map the time to the most recent occurrence of letters satisfies
$\begin{matrix} \frac{E [L]}{B} < H_{B} (X) + \frac{2}{B} l o g_{2} (B H_{B} (X) + 1) + \frac{1}{B} \end{matrix}$

$E[L]/B$ $H_\infin(X)$ $B$ sufficiently large.

Chapter 5 Channel Coding

Data Processing
$\begin{aligned} I (W; \hat{W}) & \overset{(a)}{\leq} I (W; Y^{n}) \\ = H (Y^{n}) - H (Y^{n} | W) \\ = \sum_{i = 1}^{n} (H (Y_{i} | Y^{i - 1}) - H (Y_{i} | Y^{i - 1} X^{n})) \\ \overset{(b)}{\leq} \sum_{i = 1}^{n} (H (Y_{i}) - H (Y_{i} | X_{i})) \\ \leq \sum_{i = 1}^{n} I (X_{i}; Y_{i}) \end{aligned}$
$(a)$ $\hat{W}=f(Y^n)$ $(b)$ $Y_i$ $X_i$ .
$\begin{aligned} I (W; \hat{W}) & \leq \sum_{i = 1}^{n} I (X_{i}; Y_{i}) \\ \leq n max_{i} I (X_{i}; Y_{i}) \\ \leq n max_{P_{X}} I (X; Y) \\ = n C \end{aligned}$

(1) Rate, Reliability and Cost

rate-reliability $S$ associated with transmission, then there is a capacity-cost tradeoff.

(2) Memoryless Channels

encoder $W$ $k$ $X=W\cdot G$ $n$ $X^n$ $V^{nR}$ $R$ $X$ .

(3) Cost Functions

$x^n$ $y^n$ $s^n(x^n,y^n)$ units. Naturally, we want to bound the average cost, otherwise we can send information with infinite energy and ignore all noises

\begin{matrix} E [s^{n} (x^{n}, y^{n})] \leq S \end{matrix}

$s^n(\cdot)$ $s(\cdot)$ :

\begin{matrix} s^{n} (x^{n}, y^{n}) = \frac{1}{n} \sum_{i = 1}^{n} s (x_{i}, y_{i}) \end{matrix}

$C$ $S$ capacity cost function $C(S)$ .

Example 5.1 Power Constraint

$s(x,y)=x^2,s^n(x^n,y^n)=(1/n)\sum X_i^2$ , this is the most common constraint.

(4) Block and Bit Error Probability

We define the channel coding by requiring the block error probability

\begin{matrix} P_{e} = Pr [\hat{W} \neq W] \end{matrix}

to be small.

But we also want to minimize the average bit error probability

\begin{matrix} P_{b} = \frac{1}{b} \sum_{i = 1}^{b} Pr [\hat{V_{i}} \neq V_{i}] \end{matrix}

Theorem 5.1 Block/Bit Error Probability
$\begin{matrix} P_{b} \overset{(a)}{\leq} P_{e} \overset{(b)}{\leq} b P_{b} \end{matrix}$
With equality on the left if all bits in a block are wrong. And equlity on the right if exaclty one bit is wrong in each block.

Proof. 1

Intuitively,

\begin{aligned} [1 bit error] \Rightarrow [1 block error] \\ [1 block error] \Rightarrow [at-leat-1 bit error] \end{aligned}

Therefore,

\begin{aligned} P_{b} = Pr [1 bit error] & \leq Pr [1 block error] = P_{e} \\ P_{e} = Pr [1 block error] & \leq Pr [at-leat-1 bit error] \leq Pr [all-bit error] = b P_{b} \end{aligned}

Proof. 2

$P_e$ $P_b$
Rate-reliability Tradeoff
$\begin{aligned} n R & \leq \frac{I (W; \hat{W}) + H_{2} (P_{e})}{1 - P_{e}} \leq \frac{n C + H_{2} (P_{e})}{1 - P_{e}} \\ n R & \leq \frac{I (W; \hat{W})}{1 - H_{2} (P_{b})} \leq \frac{n C}{1 - H_{2} (P_{b})} \end{aligned}$
$C=\max_{P_X}I(X;Y).$
$nR$ $I$ $P_e$ is small).

$P_e$

$|W|=2^{nR}$ , we have

\begin{aligned} H (W | \hat{W}) & \leq H_{2} (P_{e}) + P_{e} \log_{2} (| W | - 1) \\ \leq H_{2} (P_{e}) + P_{e} n R \end{aligned}

We suppose source messages are equal-probable and get

\begin{aligned} H (W | \hat{W}) & = H (W) - I (W; \hat{W}) \\ = n R - I (W; \hat{W}) \\ \leq H_{2} (P_{e}) + n R P_{e} \\ n R & \leq \frac{I (W; \hat{W}) + H_{2} (P_{e})}{1 - P_{e}} \end{aligned}

$P_e\rightarrow 0$ $nR\rightarrow I(W;\hat{W})$ . That means, for reliable transmission, the number of bits we transmit over any channel is at most the mutual information.

$P_b$

\begin{aligned} H_{2} (P_{b}) & = H_{2} (\frac{1}{b} \sum_{i = 1}^{b} Pr [\hat{V_{i}} \neq V_{i}]) \\ \overset{(a)}{\geq} \frac{1}{b} \sum_{i = 1}^{b} H_{2} (Pr [\hat{V_{i}} \neq V_{i}]) \\ \overset{(b)}{\geq} \frac{1}{b} \sum_{i = 1}^{b} H (V_{i} | \hat{V_{i}}) \end{aligned}

$(a)$ $H_2(\cdot)$ $|X|=2$

\begin{matrix} H (X | \hat{X}) \leq H_{2} (P_{e}) + P_{e} \log_{2} (| X | - 1) = H_{2} (P_{e}) \end{matrix}

Using the chain rule

\begin{aligned} H (V_{i} | \hat{V_{i}}) & \geq H (V_{i} | \hat{V_{i}} \hat{V_{i - 1}} . . . \hat{V_{1}}) \\ \geq H (V_{i} | {\hat{V}}^{b}) \\ \geq H (V_{i} | V^{i - 1} {\hat{V}}^{b}) \end{aligned}

we get the following bound

\begin{aligned} H_{2} (P_{b}) & \geq \frac{1}{b} \sum_{i = 1}^{b} H (V_{i} | V^{i - 1} {\hat{V}}^{b}) \\ = \frac{1}{b} H (V^{b} | {\hat{V}}^{b}) \\ = \frac{1}{b} (H (V^{b}) - I (V^{b}; {\hat{V}}^{b})) \\ = \frac{1}{b} (H (W) - I (W; \hat{W})) \\ = \frac{1}{b} (b - I (W; \hat{W})) \\ = 1 - \frac{I (W; \hat{W})}{n R} \\ n R & \leq \frac{I (W; \hat{W})}{1 - H_{2} (P_{b})} \end{aligned}

$P_b\rightarrow 0$ $nR\rightarrow I(W;\hat{W})$ . That means, for reliable transmission, the number of bits we transmit over any channel is at most the mutual information.

(5) Random Coding

Capacity-Cost Function
$\begin{matrix} C (S) = max_{P_{X} : E [s (X, Y)] \leq S} I (X; Y) \end{matrix}$

(6) Concavity and Converse

Theorem
$C(S)$ $S$ .
$C(S)$ $S$ .

Proof.

$I(W;\hat{W})\le \sum I(X_i;Y_i).$ $X_i$ $I(X_i;Y_i)\le C$ $X_i$ $E[s(X_i,Y_i)]$ $C$ $C(E[s(X_i,Y_i)])$ and get

\begin{aligned} I (W; \hat{W}) & \leq \sum_{i = 1}^{n} I (X_{i}; Y_{i}) \\ \leq \sum_{i = 1}^{n} C (E [s (X_{i}, Y_{i})]) \\ = n E_{n} [C (E [s (X_{i}, Y_{i})])] \\ \overset{(a)}{\leq} n C (E_{n} [E [s (X_{i}, Y_{i})]]) \\ = n C (\frac{1}{n} \sum_{i = 1}^{n} E [s (X_{i}, Y_{i})]) \\ \leq n C (S) \end{aligned}

$(a)$ $C(S)$ $E[C(S)]\le C(E[S])$ .

Rate-reliability Tradeoff
$\begin{aligned} n R & \leq \frac{I (W; \hat{W}) + H_{2} (P_{e})}{1 - P_{e}} \leq \frac{n C (S) + H_{2} (P_{e})}{1 - P_{e}} \\ n R & \leq \frac{I (W; \hat{W})}{1 - H_{2} (P_{b})} \leq \frac{n C (S)}{1 - H_{2} (P_{b})} \end{aligned}$

(7) Discrete Alphabet Examples

<1> Binary Symmetric Channel

$P_X$ $P_X(0)=P_X(1)=1/2$ $C=1-H_2(p)$ $p$ is the crossover probability.

<2> Binary Erasure Channel

\begin{aligned} P_{Y} (△) & = P_{X} (0) \cdot p + P_{X} (1) \cdot p = p \\ P_{Y} (1) & = P_{X} (1) \cdot (1 - p) \\ P_{Y} (0) & = P_{X} (0) \cdot (1 - p) \\ H (X | Y) & = P_{Y} (0) H (X | Y = 0) + P_{Y} (1) H (X | Y = 1) + P_{Y} (△) H (X | Y = △) \\ = 0 + 0 + p H (X) \end{aligned}

$C=\max_{p_X}H(X)(1-p)$ $P_X$ $C=1-p$ .

<3> Strongly Symmetric Channels

\begin{matrix} \begin{matrix} P (Y | X) = (\begin{matrix} * & * & * & * & * & * \\ * & * & * & * & * & * \\ * & * & * & * & * & * \\ * & * & * & * & * & * \end{matrix}) \end{matrix} \end{matrix}

$1$ $1.$

Uniformly dispersive: (entries in rows are the same, their order may change)
$x$ $\{P(y|x),y\in Y\}$ $H(Y|X=x)$ $x\in X$ .
Uniformly focusing: (entries in columns are the same, their order may change )
$y$ $\{P(y|x),x\in X\}$ is the same. Uniform input in a u.f channel implies uniform output.
Strongly symmetric:
A channel is said to be strongly symmetric if it both uniformly dispersive and uniformly focusing.

BEC is u.d. but not u.f. BSC is u.d. and u.f.

Strongly Symmetric Channel Capacity
$\begin{matrix} C = H (Y) - H (Y | X) = \log_{2} | Y | - \sum_{y \in Y} - P (y | x) \log_{2} P (y | x) \end{matrix}$
$P_X$

<4> Symmetric Channels

A BEC can be decomposed to two strongly symmetric channels:

\begin{aligned} P (Y | X) = (\begin{array}{c} \begin{array}{ccc} 1 - p & 0 & p \\ 0 & 1 - p & p \end{array} \end{array}) \\ \Rightarrow & subchannel 1: (1 - p) \cdot (\begin{array}{c} 1 & 0 \\ 0 & 1 \end{array}) is strongly symmetric \\ subchannel 2: p \cdot (\begin{array}{c} 1 \\ 1 \end{array}) is strongly symmetric \\ \Rightarrow & C = q_{1} C_{1} + q_{2} C_{2} = 1 - p \end{aligned}

Capacity of Symmetric DMC
$\begin{matrix} C = \sum_{i = 1}^{L} q_{i} C_{i} \end{matrix}$

Example 5.2 BPSK over AWGN

$2$ -bit quantizer:

\begin{aligned} P (Y | X) = (\begin{array}{c} \begin{array}{cccc} p_{0} & p_{1} & p_{2} & p_{3} \\ p_{3} & p_{2} & p_{1} & p_{0} \end{array} \end{array}) \\ \Rightarrow & subchannel 1: (p_{0} + p_{3}) \cdot (\begin{array}{c} p_{0} / (p_{0} + p_{3}) & p_{3} / (p_{0} + p_{3}) \\ p_{3} / (p_{0} + p_{3}) & p_{0} / (p_{0} + p_{3}) \end{array}) is BSC(strongly symmetric) \\ subchannel 2: (p_{1} + p_{2}) \cdot (\begin{array}{c} p_{1} / (p_{1} + p_{2}) & p_{2} / (p_{1} + p_{2}) \\ p_{2} / (p_{1} + p_{2}) & p_{1} / (p_{1} + p_{2}) \end{array}) is BSC(strongly symmetric) \\ \Rightarrow & C = (p_{0} + p_{3}) (1 - H_{2} (\frac{p_{0}}{p_{0} + p_{3}})) \\ + (p_{1} + p_{2}) (1 - H_{2} (\frac{p_{1}}{p_{1} + p_{2}})) \end{aligned}

(8) Continuous Alphabet Examples

$Y_i=X_i+Z_i$ $Z_i\sim\mathcal{N}(0,\sigma^2)$ $s(x,y)=x^2$ $E[(1/n)\sum X_i^2]\le S=P.$

\begin{aligned} C (S) = C (P) & = max_{X : E [X^{2}] \leq P} I (X; Y) \\ I (X; Y) & = h (Y) - h (Y | X) \\ = h (Y) - h (Z | X) \\ = h (Y) - h (Z) \\ = h (Y) - \frac{1}{2} \log (2 π e σ^{2}) \end{aligned}

$h(Y)$

\begin{aligned} Z & \sim N (0, σ^{2}) \\ Y & = X + Z \\ Y & \sim N (E [X], V a r [X^{2}] + σ^{2}) \\ V a r [X^{2}] + σ^{2} & \leq E [X^{2}] + σ^{2} \\ V a r [Y] & \leq P + σ^{2} \\ \Rightarrow h (Y) & \leq \frac{1}{2} \log (2 π e (P + σ^{2})) \end{aligned}

$X\sim\mathcal{N}(0,P).$

Finally, we get

$\begin{matrix} C (S) = C (P) = \frac{1}{2} \log (1 + \frac{P}{σ^{2}}) \end{matrix}$

Example 5.3 AWGN channel with BPSK

$X\in\{\pm\sqrt{P}\}$ and calculate

\begin{matrix} \begin{matrix} C (P) = max_{X} I (X; Y) = max_{X} h (Y) - h (Y | X) \\ p (Y) = \sum_{x} P_{X} (x) P_{Y | X} (y | X = x) \\ h (Y | X) = h (Z) = \frac{1}{2} \log (2 π e σ^{2}) \end{matrix} \end{matrix}

Chapter 1 Information Theory for Discrete Variables

(1) Message Sets

(2) Measuring Choice

(3) Entropy

Example 1.1

Example 1.2

Theorem 1.1

(4) Example Distributions

(5) Conditional Entropy

Example 1.3

(6) Joint Entropy

Example 1.4 Chain Rule for Entropy

Example 1.5

Example 1.6

(7) Information Divergence / Kullback-Leibler Distance

Theorem 1.2

Example 1.7

Example 1.8

Example 1.9

Example 1.10 Chain Rule for Information Divergence

Example 1.11 P1P2P_1P_2 instead of PX1X2P_{X_1X_2}

(8) Mutual Information

Theorem 1.3

Example 1.12

Example 1.13 Chain Rule for Mutual Information

Example 1.14

(9) Cross Entropy

Theorem 1.4

(10) Inequalities

Theorem 1.5 Log-sum Inequality

Example 1.15

Theorem 1.6 Data Processing Inequalities

Theorem 1.9 Fano's Inequality

(11) Convexity

Theorem 1.10 Concavity of Entropy

Theorem 1.11 Convexity of Information Divergence

Theorem 1.12 Mutual Information

Theorem 1.13 Linearity and Convexity of Cross Entropy

Example 1.16 Jensen's Inequality

Example 1.17 Definitions of Convexity

Chapter 2 Information Theory for Continuous Variabls

(1) Differential Entropy

Example 2.1

Rules for Differential Entropy

(2) Mixed Distributions

(3) Example Distributions

(4) Informational Divergence

(5) Cross Entropy

(6) Maximum Entropy

Alphabet or Volume Constraint

First Moment Constriant

Second Moment Constraint

Example 2.2

Chapter 3 Coding of Memoryless Sources

(1) Kraft Inequality

Theorem 3.1 Kraft Inequality

(2) Entropy Bounds on Source Coding

Theorem 3.2 Coding theorem for a Single Random Variable.

Theorem 3.3 Coding Theorem for a DMS

(3) Huffman Codes

(4) Tunstall Codes

Theorem 3.4 Converse to the Coding Theorem for a DMS.

(5) Huffman Codes and Tunstall Codes

Chapter 4 Coding of Stationary Sources

(1) Discrete Stationary Sources

Theorem 4.1 Entropy Rate for a DSS

(2) Entropy Bounds on Source Coding a DSS

Theorem 4.2 Coding Theorem for a DSS

(3) Elias Code for Positive Integers

(4) Elias-Willems Universal Source Coding

Example 4.1

Theorem 4.3 Universal Source Coding Theorem for a DSS

Chapter 5 Channel Coding

Data Processing

(1) Rate, Reliability and Cost

(2) Memoryless Channels

(3) Cost Functions

Example 5.1 Power Constraint

(4) Block and Bit Error Probability

Theorem 5.1 Block/Bit Error Probability

$P_1P_2$ $P_{X_1X_2}$

$P_e$ $P_b$

留下评论取消回复