Information Theory 定理总结及推导

information_theory

Chapter 1 Information Theory for Discrete Variables(1) Message Sets(2) Measuring Choice(3) EntropyExample 1.1Example 1.2 (4) Example Distributions(5) Conditional Entropy Example 1.3(6) Joint Entropy Example 1.4 Chain Rule for EntropyExample 1.5Example 1.6(7) Information Divergence / Kullback-Leibler DistanceExample 1.7Example 1.8Example 1.9Example 1.10 Chain Rule for Information DivergenceExample 1.11 P1P2 instead of PX1X2(8) Mutual InformationExample 1.12Example 1.13 Chain Rule for Mutual InformationExample 1.14(9) Cross Entropy(10) InequalitiesExample 1.15(11) ConvexityExample 1.16 Jensen’s InequalityExample 1.17 Definitions of ConvexityChapter 2 Information Theory for Continuous Variabls(1) Differential EntropyExample 2.1Rules for Differential Entropy(2) Mixed Distributions(3) Example Distributions(4) Informational Divergence(5) Cross Entropy(6) Maximum EntropyAlphabet or Volume ConstraintFirst Moment ConstriantSecond Moment ConstraintExample 2.2Chapter 3 Coding of Memoryless Sources(1) Kraft Inequality (2) Entropy Bounds on Source Coding(3) Huffman Codes(4) Tunstall Codes(5) Huffman Codes and Tunstall CodesChapter 4 Coding of Stationary Sources(1) Discrete Stationary Sources(2) Entropy Bounds on Source Coding a DSS(3) Elias Code for Positive Integers(4) Elias-Willems Universal Source CodingExample 4.1Chapter 5 Channel Coding(1) Rate, Reliability and Cost(2) Memoryless Channels(3) Cost Functions Example 5.1 Power Constraint(4) Block and Bit Error Probability(5) Random Coding(6) Concavity and Converse(7) Discrete Alphabet Examples<1> Binary Symmetric Channel<2> Binary Erasure Channel<3> Strongly Symmetric Channels<4> Symmetric ChannelsExample 5.2 BPSK over AWGN(8) Continuous Alphabet ExamplesExample 5.3 AWGN channel with BPSK

Chapter 1 Information Theory for Discrete Variables

(1) Message Sets

What is a message? Shannon suggested that a message has to do with choice and sets. In his words: The significant aspect is that the actual message is one selected from a set of possible messages.

(2) Measuring Choice

Shannon was interested in defining a quantity that will measure, in some sense, how much information is “produced” by choosing messages. He suggested that the logarithm of the number of elements in the message set can be regarded as a measure of the information produced when one message is chosen from the set, all choices being equally likely.

For example, consider a fair coin that takes on values in {Heads,Tails} and suppose we flip the coin n times. The string of coin flips takes on one of 2n values, all equally likely, and the information produced is log22n=n bits.


For non-equally likely choices, Shannon developed a measure that has the same form as the entropy in statistical mechanics. For example, consider a biased coin that takes on the value Heads with probability p and Tails with probability 1p. The string of coin flips takes on (nnp) values (since the expected number of Heads is np). Shannon proposed that the information produced when flipping the coin n times is

log2(nnp)n(plog21p+(1p)log211p) bits

(3) Entropy

Let supp(f) be the support set of the function f, i.e., the set of a such that f(a)>0. We define the entropy or uncertainty of the discrete random variable X as

H(X)=asupp(PX)PX(a)log2PX(a)=E[log2PX(X)]

Example 1.1

The entropy H(X) depends only on the probability distribution PX(), and not on the letters a in alphabet of X. We thus have

H(X)=H(g(X))

for any invertible or bijective function g(), since g() simply relabels the letters in the alphabet. Examples include Y=2X.

Example 1.2

Consider the Bernoulli distribution that has an alphabet {0,1} and PX(1)=p. The entropy of X is

H2(p)=plog2p(1p)log2(1p)

and H2() is called the binary entropy function.

Theorem 1.1

0H(X)log2|X|

with equality on the left iff. there is one letter a in X with PX(a)=1, and with equality on the right iff. PX(a)=1|X| for any aX, i.e., X in uniformly distributed.

Proof. of right equality

log2(x)=ln(x)logx(2)(x1)log2(e)H(X)=E[logx1|X|PX(X)]+log2|X|E[1|X|PX(X)1]log2(e)+log2|X|=log2(e)asupp(PX)PX(a)(1|X|PX(X)1)+log2|X|=log2(e)(|supp(PX)|X|1)+log2|X|log2|x|

(4) Example Distributions

(5) Conditional Entropy

Consider the joint distribution PXY() where X and Y are discrete random variables.

Focusing on one line of the table of joint distribution, we get the conditional entropy of X given the event Y=b with Pr[Y=b]>0 is

H(X|Y=b)=asupp(PX|Y(|b))PX|Y(a|b)log2PX|Y(a|b)=E[log2PX|Y(X|Y=b)].

Similarly,

0H(X|Y=b)log2|X|

with equality on the left iff. PX|Y(a|b)=1 for some a, and with equality on the right iff. PX|Y(a|b)=1|X| for any aX.


We average the conditional entropy on all bY and get the conditional entropy of X given Y

H(X|Y)=bsupp(PY)PY(b)H(X|Y=b)=(a,b)supp(PXY)PXY(a,b)log2PX|Y(a|b)=E[log2PX|Y(X|Y)].

Also,

0H(X|Y)log2|X|

with equality on the left iff. there is one a such that PX|Y(a|b)=1 for every b. And with equality on the right iff. PX|Y(a|b)=1|X| for any aX and for every b.

Example 1.3

We say that Y essentially determines X if H(X|Y)=0).

Recall Example 1.1, g(X) is invertible, H(X)=H(g(X)). X determines g(X) and g(X) determines X

H(g(X)|X)=H(X|g(X))=0.

For a non-invertible function f(X), H(X)H(f(X)) since the probability distribution of X and f(X) might be different. X still determines f(X) but f(X) no longer determines X

H(f(X)|X)=0H(X|f(X))0.

Examples include Y=|X| and Y=cosX.

(6) Joint Entropy

The joint entropy of X and Y is defined by considering the concatenation (not multiplication!) XY of X and Y as a new discrete random variable, i.e., we have

H(XY)=(a,b)supp(PXY)PXY(a,b)log2PXY(a,b)=E[log2PXY(X,Y)]

and

max(H(X),H(Y))H(XY)log2(|X||Y|)

with equality on the left iff. X essentially determines Y, or Y essentially determines X, or both. And with equality on the right iff. PXY(a,b)=1|X||Y| for all (a,b).

Example 1.4 Chain Rule for Entropy

H(XY)=H(X)+H(X|Y)=H(Y)+H(Y|X)H(Xn)=i=1nH(Xi|Xi1), X0 is a constant.

Example 1.5

H(XY|Z)H(X|Z).

Proof.

H(XY|Z)=H(X|Z)+H(Y|XZ)H(X|Z).

Example 1.6

Using conclusions about conditional entropy in Example 1.3, we study the joint distribution H(Xf(X)) and H(Xg(X)).

For a invertible function g(X),

H(g(X)|X)=H(X|g(X))=0H(Xg(X))=H(X)=H(g(X)).

For a non-invertible function f(X),

H(f(X)|X)=0H(X|f(X))0H(Xf(X))=H(X)H(f(X)).

(7) Information Divergence / Kullback-Leibler Distance

The informational divergence between PX() and PY() is defined as

D(PXPY)=asupp(PX)PX(a)log2PX(a)PY(a)=E[logxPX(X)PY(X)]

We define D(PXPY)= if PY(a)=0 for some asupp(PX).

Observe that the definition is not symmetric in PX and PY , i.e., we have D(PXPY)D(PYPX) in general.

To avoid the case D(PXPY)= we introduce the notation PYPX to mean that PY(a)=0PX(a)=0 for all aX. In other words, PYPX implies that supp(PX)supp(PY), or D(PXPY)< for finite sets.

Given a third discrete random variable Z, we define the conditional information divergence between PX|Z() and PY|Z() as

D(PX|ZPY|Z|PZ)=csupp(PZ)PZ(c)D(PX|Z(|c)PY|Z(|c))=csupp(PZ)PZ(c)D(PX|Z(|c)PY|Z(|c))=csupp(PZ)PZ(c)asupp(PX|Z(|c))PX|Z(a|c)log2PX|Z(a|c)PY|Z(a|c)=(a,c)supp(PXZ)PXZ(a,c)log2PX|Z(a|c)PY|Z(a|c)=E[log2PX|Z(X|Z)PY|Z(X|Z)]

Theorem 1.2

D(PXPY)0

with equality iff. PX(a)=PY(a) for all asupp(PX).

Proof.

log2(x)=ln(x)logx(2)(x1)log2(e)D(PXPY)=E[log2PY(X)PX(X)]E[PY(X)PX(X)1]log2(e)=asupp(PX)PX(a)[PY(a)PX(a)1]log2(e)0

Example 1.7

If PY() is uniform on X, for asupp(PX), PY(a)=1|X|

D(PXPY)=asupp(PX)PX(a)log2PX(a)PY(a)=asupp(PX)PX(a)log2PX(a)asupp(PX)PX(a)log2PY(a)=H(X)+log2|X|

Example 1.8

If PX() is uniform on X, for asupp(PX), PX(a)=1|X|

D(PXPY)=asupp(PX)PX(a)log2PX(a)PY(a)=asupp(PX)PX(a)log2PX(a)asupp(PX)PX(a)log2PY(a)=log2|X|1|X|asupp(PX)log2PY(a)

Example 1.9

If PX(a)=1 for some aX,

D(PXPY)=log2PY(a)

Example 1.10 Chain Rule for Information Divergence