Information Theory 定理总结及推导

information_theory

Chapter 1 Information Theory for Discrete Variables(1) Message Sets(2) Measuring Choice(3) EntropyExample 1.1Example 1.2 (4) Example Distributions(5) Conditional Entropy Example 1.3(6) Joint Entropy Example 1.4 Chain Rule for EntropyExample 1.5Example 1.6(7) Information Divergence / Kullback-Leibler DistanceExample 1.7Example 1.8Example 1.9Example 1.10 Chain Rule for Information DivergenceExample 1.11 P1P2 instead of PX1X2(8) Mutual InformationExample 1.12Example 1.13 Chain Rule for Mutual InformationExample 1.14(9) Cross Entropy(10) InequalitiesExample 1.15(11) ConvexityExample 1.16 Jensen’s InequalityExample 1.17 Definitions of ConvexityChapter 2 Information Theory for Continuous Variabls(1) Differential EntropyExample 2.1Rules for Differential Entropy(2) Mixed Distributions(3) Example Distributions(4) Informational Divergence(5) Cross Entropy(6) Maximum EntropyAlphabet or Volume ConstraintFirst Moment ConstriantSecond Moment ConstraintExample 2.2Chapter 3 Coding of Memoryless Sources(1) Kraft Inequality (2) Entropy Bounds on Source Coding(3) Huffman Codes(4) Tunstall Codes(5) Huffman Codes and Tunstall CodesChapter 4 Coding of Stationary Sources(1) Discrete Stationary Sources(2) Entropy Bounds on Source Coding a DSS(3) Elias Code for Positive Integers(4) Elias-Willems Universal Source CodingExample 4.1Chapter 5 Channel Coding(1) Rate, Reliability and Cost(2) Memoryless Channels(3) Cost Functions Example 5.1 Power Constraint(4) Block and Bit Error Probability(5) Random Coding(6) Concavity and Converse(7) Discrete Alphabet Examples<1> Binary Symmetric Channel<2> Binary Erasure Channel<3> Strongly Symmetric Channels<4> Symmetric ChannelsExample 5.2 BPSK over AWGN(8) Continuous Alphabet ExamplesExample 5.3 AWGN channel with BPSK

Chapter 1 Information Theory for Discrete Variables

(1) Message Sets

What is a message? Shannon suggested that a message has to do with choice and sets. In his words: The significant aspect is that the actual message is one selected from a set of possible messages.

(2) Measuring Choice

Shannon was interested in defining a quantity that will measure, in some sense, how much information is “produced” by choosing messages. He suggested that the logarithm of the number of elements in the message set can be regarded as a measure of the information produced when one message is chosen from the set, all choices being equally likely.

For example, consider a fair coin that takes on values in {Heads,Tails} and suppose we flip the coin n times. The string of coin flips takes on one of 2n values, all equally likely, and the information produced is log22n=n bits.


For non-equally likely choices, Shannon developed a measure that has the same form as the entropy in statistical mechanics. For example, consider a biased coin that takes on the value Heads with probability p and Tails with probability 1p. The string of coin flips takes on (nnp) values (since the expected number of Heads is np). Shannon proposed that the information produced when flipping the coin n times is

log2(nnp)n(plog21p+(1p)log211p) bits

(3) Entropy

Let supp(f) be the support set of the function f, i.e., the set of a such that f(a)>0. We define the entropy or uncertainty of the discrete random variable X as

H(X)=asupp(PX)PX(a)log2PX(a)=E[log2PX(X)]

Example 1.1

The entropy H(X) depends only on the probability distribution PX(), and not on the letters a in alphabet of X. We thus have

H(X)=H(g(X))

for any invertible or bijective function g(), since g() simply relabels the letters in the alphabet. Examples include Y=2X.

Example 1.2

Consider the Bernoulli distribution that has an alphabet {0,1} and PX(1)=p. The entropy of X is

H2(p)=plog2p(1p)log2(1p)

and H2() is called the binary entropy function.

Theorem 1.1

0H(X)log2|X|

with equality on the left iff. there is one letter a in X with PX(a)=1, and with equality on the right iff. PX(a)=1|X| for any aX, i.e., X in uniformly distributed.

Proof. of right equality

log2(x)=ln(x)logx(2)(x1)log2(e)H(X)=E[logx1|X|PX(X)]+log2|X|E[1|X|PX(X)1]log2(e)+log2|X|=log2(e)asupp(PX)PX(a)(1|X|PX(X)1)+log2|X|=log2(e)(|supp(PX)|X|1)+log2|X|log2|x|

(4) Example Distributions

(5) Conditional Entropy

Consider the joint distribution PXY() where X and Y are discrete random variables.

Focusing on one line of the table of joint distribution, we get the conditional entropy of X given the event Y=b with Pr[Y=b]>0 is

H(X|Y=b)=asupp(PX|Y(|b))PX|Y(a|b)log2PX|Y(a|b)=E[log2PX|Y(X|Y=b)].

Similarly,

0H(X|Y=b)log2|X|

with equality on the left iff. PX|Y(a|b)=1 for some a, and with equality on the right iff. PX|Y(a|b)=1|X| for any aX.


We average the conditional entropy on all bY and get the conditional entropy of X given Y

H(X|Y)=bsupp(PY)PY(b)H(X|Y=b)=(a,b)supp(PXY)PXY(a,b)log2PX|Y(a|b)=E[log2PX|Y(X|Y)].

Also,

0H(X|Y)log2|X|

with equality on the left iff. there is one a such that PX|Y(a|b)=1 for every b. And with equality on the right iff. PX|Y(a|b)=1|X| for any aX and for every b.

Example 1.3

We say that Y essentially determines X if H(X|Y)=0).

Recall Example 1.1, g(X) is invertible, H(X)=H(g(X)). X determines g(X) and g(X) determines X

H(g(X)|X)=H(X|g(X))=0.

For a non-invertible function f(X), H(X)H(f(X)) since the probability distribution of X and f(X) might be different. X still determines f(X) but f(X) no longer determines X

H(f(X)|X)=0H(X|f(X))0.

Examples include Y=|X| and Y=cosX.

(6) Joint Entropy

The joint entropy of X and Y is defined by considering the concatenation (not multiplication!) XY of X and Y as a new discrete random variable, i.e., we have

H(XY)=(a,b)supp(PXY)PXY(a,b)log2PXY(a,b)=E[log2PXY(X,Y)]

and

max(H(X),H(Y))H(XY)log2(|X||Y|)

with equality on the left iff. X essentially determines Y, or Y essentially determines X, or both. And with equality on the right iff. PXY(a,b)=1|X||Y| for all (a,b).

Example 1.4 Chain Rule for Entropy

H(XY)=H(X)+H(X|Y)=H(Y)+H(Y|X)H(Xn)=i=1nH(Xi|Xi1), X0 is a constant.

Example 1.5

H(XY|Z)H(X|Z).

Proof.

H(XY|Z)=H(X|Z)+H(Y|XZ)H(X|Z).

Example 1.6

Using conclusions about conditional entropy in Example 1.3, we study the joint distribution H(Xf(X)) and H(Xg(X)).

For a invertible function g(X),

H(g(X)|X)=H(X|g(X))=0H(Xg(X))=H(X)=H(g(X)).

For a non-invertible function f(X),

H(f(X)|X)=0H(X|f(X))0H(Xf(X))=H(X)H(f(X)).

(7) Information Divergence / Kullback-Leibler Distance

The informational divergence between PX() and PY() is defined as

D(PXPY)=asupp(PX)PX(a)log2PX(a)PY(a)=E[logxPX(X)PY(X)]

We define D(PXPY)= if PY(a)=0 for some asupp(PX).

Observe that the definition is not symmetric in PX and PY , i.e., we have D(PXPY)D(PYPX) in general.

To avoid the case D(PXPY)= we introduce the notation PYPX to mean that PY(a)=0PX(a)=0 for all aX. In other words, PYPX implies that supp(PX)supp(PY), or D(PXPY)< for finite sets.

Given a third discrete random variable Z, we define the conditional information divergence between PX|Z() and PY|Z() as

D(PX|ZPY|Z|PZ)=csupp(PZ)PZ(c)D(PX|Z(|c)PY|Z(|c))=csupp(PZ)PZ(c)D(PX|Z(|c)PY|Z(|c))=csupp(PZ)PZ(c)asupp(PX|Z(|c))PX|Z(a|c)log2PX|Z(a|c)PY|Z(a|c)=(a,c)supp(PXZ)PXZ(a,c)log2PX|Z(a|c)PY|Z(a|c)=E[log2PX|Z(X|Z)PY|Z(X|Z)]

Theorem 1.2

D(PXPY)0

with equality iff. PX(a)=PY(a) for all asupp(PX).

Proof.

log2(x)=ln(x)logx(2)(x1)log2(e)D(PXPY)=E[log2PY(X)PX(X)]E[PY(X)PX(X)1]log2(e)=asupp(PX)PX(a)[PY(a)PX(a)1]log2(e)0

Example 1.7

If PY() is uniform on X, for asupp(PX), PY(a)=1|X|

D(PXPY)=asupp(PX)PX(a)log2PX(a)PY(a)=asupp(PX)PX(a)log2PX(a)asupp(PX)PX(a)log2PY(a)=H(X)+log2|X|

Example 1.8

If PX() is uniform on X, for asupp(PX), PX(a)=1|X|

D(PXPY)=asupp(PX)PX(a)log2PX(a)PY(a)=asupp(PX)PX(a)log2PX(a)asupp(PX)PX(a)log2PY(a)=log2|X|1|X|asupp(PX)log2PY(a)

Example 1.9

If PX(a)=1 for some aX,

D(PXPY)=log2PY(a)

Example 1.10 Chain Rule for Information Divergence

D(PXnPYn)=xnsupp(PXn)PXn(xn)logPXn(xn)PYn(xn)=xnsupp(PXn)PXn(xn)logΠi=1nPXi|Xi1(xi|xi1)Πi=1nPYi|Yi1(xi|xi1)=i=1nxnsupp(PXn)PXn(xn)logPXi|Xi1(xi|xi1)PYi|Yi1(xi|xi1)=i=1nxisupp(PXi)xi+1xi+2...xnsupp(PXi+1Xi+2...Xn|Xi)PXn(xn)logPXi|Xi1(xi|xi1)PYi|Yi1(xi|xi1)=i=1nxisupp(PXi)PXi(xi)logPXi|Xi1(xi|xi1)PYi|Yi1(xi|xi1)xi+1xi+2...xnsupp(PXi+1Xi+2...Xn|Xi)PXi+1Xi+2...Xn|Xi(xi+1xi+2...xn|xi)=i=1nxisupp(PXi)PXi(xi)PXi|Xi1(xi|xi1)PYi|Yi1(xi|xi1)=i=1nE[log2PXi|Xi1PYi|Yi1]=i=1nD(PXi|Xi1PYi|Yi1|PXi1)

Example 1.11 P1P2 instead of PX1X2

D(P1P2Q1Q2)=(a,b)supp(P1P2)P1(a)P2(b)logP1(a)P2(b)Q1(a)Q2(b)=(a,b)supp(P1P2)P1(a)P2(b)(logP1(a)Q1(a)+logP2(b)Q2(b))=asupp(P1)P1(a)logP1(a)Q1(a)+bsupp(P2)P2(b)logP2(b)Q2(b)=D(P1Q1)+D(P2Q2).

 

(8) Mutual Information

The mutual information I(X;Y) between two random variables X and Y is defined as

I(X;Y)=D(PXYPXPY)=(a,b)supp(PXY)PXY(a,b)log2PXY(a,b)PX(a)PY(b)

Note that PXY=PXPY means that X and Y are statistically independent. Thus, I(X;Y) measures the dependence of X and Y .

I(X;Y)=H(X)H(X|Y)=H(Y)H(Y|X)=H(X)+H(Y)H(XY)

Theorem 1.3

I(X;Y)0H(X|Y)H(X)H(Y|X)H(Y)H(XY)H(X)+H(Y)

with equality iff. X and Y are statistically independent.

Proof.

X and Y are statistically independent, PXY(a,b)=PX(a)PY(b), we have

H(X|Y)=(a,b)supp(PXY)PXY(a,b)log2PX|Y(a|b)=(a,b)supp(PXY)PXY(a,b)log2PX(a)=asupp(PX)PX(a)log2PX(a)=H(X).H(Y|X)=H(Y).H(XY)=(a,b)supp(PXY)PXY(a,b)log2PXY(a,b)=(a,b)supp(PXY)PX(a)PY(b)log2PX(a)PY(b)=(a,b)supp(PXY)PX(a)PY(b)log2PX(a)+(a,b)supp(PXY)PX(a)PY(b)log2PY(b)=asupp(PX)PX(a)log2PX(a)+bsupp(PY)PY(b)log2PY(b)=H(X)+H(Y).

The inequality means that conditioning doesn't increase entropy, or colloquially that conditioning reduces entropy. Note, however, that H(X|Y=b) can be larger than H(X).

Example 1.12

I(X;Y|Z)=H(X|Z)H(X|YZ)=H(Y|Z)H(Y|XZ)

We know that 0I(X;Y|Z)min(H(X|Z),H(Y|Z)).

When I(X;Y|Z)=0, we say that XZY forms a Markov chain, where given Z, X and Y are statistically independent, H(X|Z)=H(X|YZ).

I(X;Y|Z)=D(PXY|ZPX|ZPY|Z|PZ)=D(PX|ZPY|XZPX|ZPY|Z|PZ)=^D(PX|ZPY|ZPX|ZPY|Z|PZ)=0.

Example 1.13 Chain Rule for Mutual Information

Xn=X1X2..Xn is a string of n discrete random variables, Y is some discrete random variables

I(Xn;Y)=H(Xn)H(Xn|Y)=i=1nH(Xi|Xi1)i=1nH(Xi|Xi1Y)=i=1n(H(Xi|Xi1)H(Xi|Xi1Y))=i=1nI(Xi;Y|Xi1)

Example 1.14

Using the chain rule for mutual information and conclusions of Example 1.11, we have

I(X;Y)I(X;YZ)I(X;Y)I(XZ;Y).

(9) Cross Entropy

The cross entropy of two distributions PX() and PY() with the same domain is defined as

X(PXPY)=asupp(PX)PX(a)log2PY(a)=E[log2PY(X)].

Theorem 1.4

X(PXPY)=H(X)+D(PXPY).

with equality iff. PX(a)=PY(a) for any aX.

Theorem 1.4 gives a useful tool to upper bound entropy: every choice of PY() gives an upper bound on H(X). To simplify the computation, one usually chooses PY() to have an exponential form, such as a Gaussian distribution for continuous random variables.

(10) Inequalities

Theorem 1.5 Log-sum Inequality

Let Sa=i=1nai, Sb=i=1nbi, and define PX(i)=ai/Sa and PY(i)=bi/Sb, we have

i=1nailogaibi=SaD(PXPY)+SalogSaSbSalogSaSb

with equality iff. ai/bi=Sa/Sb for all i.

Proof.

i=1nailogaibi=i=1nailogPX(i)PY(i)SaSb=Sai=1nPX(i)logPX(i)PY(i)+SalogSaSb=SaD(PXPY)+SalogSaSb.

Example 1.15

a1loga1b1+a2loga2b2(a1+a2)loga1+a2b1+b2

with equality iff. a1/b1=a2/b2=(a1+a2)(b1+b2).

Theorem 1.6 Data Processing Inequalities

Any processing decreases the mutual information:

If XYZ forms a Markov chain, we have I(X;Z)I(X;Y) and I(X;Z)I(Y;Z).

If Y1 and Y2 are outputs of a channel PY|X() with inputs X1 and X2, we have D(PY1|PY2)D(PX1|PX2).

Proof.

Using conclusions of Example 1.13,

I(X;Z)I(X;YZ)=I(X;Y)+I(X;Z|Y)=I(X;Y),I(X;Z)I(XY;Z)=I(Y;Z)+I(X;Z|Y)=I(Y;Z).

Using the chain rule of information divergence, we have

D(PX1Y1PX2Y2)=D(PX1PX2)+D(PY1|X1PY2|X2|PX1)=D(PX1PX2)=D(PY1PY2)+D(PX1|Y1PX2|Y2|PY1)

which gives

D(PY1PY2)D(PX1PX2).

Theorem 1.9 Fano's Inequality

Suppose both X and X^ take on values in the same alphabet, let Pe=Pr[X^X]. We have

H(X|X^)H2(Pe)+Pelog2(|X|1)

with equlity iff. for all a and b, we have

PX|X^(a|b)={1Pe      if b=aPe|X|1   if ba

Proof.

Let E=1(X^X). Using the chain rule to expand H(EX|X^) in two ways

H(EX|X^)=H(X|X^)+H(E|XX^)=(a)H(X|X^)

where (a) follows because XX^ essentially determines E.

H(EX|X^)=H(E|X^)+H(X|X^E)=H(E|X^)+Pr[E=0]H(X|X^,E=0)+Pr[E=1]H(X|X^,E=1)=(b)H(E|X^)+Pr[E=1]H(X|X^,E=1)=H(E|X^)+PeH(X|X^,E=1)(c)H(E|X^)+Pelog2(|X|1)H(E)+Pelog2(|X|1)H2(Pe)+Pelog2(|X|1)

where (b) follows because E=0 means that X=X^, X^ determines X. And (c) follows because E=1 means that XX^, X can take |X|1 values.

(11) Convexity

Theorem 1.10 Concavity of Entropy

For any 0<λ<1

λH(PX)+(1λ)H(QX)H(λPX+(1λ)QX).

Theorem 1.11 Convexity of Information Divergence

For any 0<λ<1

λD(PXPY)+(1λ)D(QXQY)D(λPX+(1λ)QXλPY+(1λ)QY).

Theorem 1.12 Mutual Information

We write I(X;Y) as I(PX;PY|X). The function I(PX;PY|X) is concave in PX, i.e., the input distribution if PY|X, i.e., the channel is fixed,

λI(PX;PY|X)+(1λ)I(QX;PY|X)I(λPX+(1λ)QX,PY|X)

for any 0<λ<1.

The function I(PX;PY|X) is convex in PY|X, i.e., the channel if PX, i.e., the input distribution is fixed,

λI(PX;PY|X)+(1λ)I(PX;QY|X)I(PX,λPY|X+(1λ)QY|X).

for any 0<λ<1.

Theorem 1.13 Linearity and Convexity of Cross Entropy

λX(PXPY)+(1λ)X(QXPY)=X(λPX+(1λ)QXPY)λX(PXPY)+(1λ)X(PXQY)X(PXλPY+(1λ)QY)

 

Example 1.16 Jensen's Inequality

Let f be a convex function on I and X be a random variable taking values in I, we have E[f(X)]f(E[X]).

Proof.

Since f() is convex, there exists a real number m such that f(x)f(x0)+m(xx0). We choose x0 to be E[X] and get f(x)f(E[X])+m(xE[X]). This holds for every x, we take the expectation over x:

E[f(X)]f(E[X])+m(E[X]E[X])=f(E[X]).

Example 1.17 Definitions of Convexity

These definitions are equivalent

f(x)f(x0)+m(xx0)λf(x1)+(1λ)f(x2)f(λx1+(1λ)x2)f(x)0

Proof.

Using Jensen's Inequality,

λf(x1)+(1λ)f(x2)=E[f(X)]f(E[X])=f(λx1+(1λ)x2).

Chapter 2 Information Theory for Continuous Variabls

(1) Differential Entropy

Consider a real-valued random variable X with density pX(). The differential entropy of X is defined as

h(X)=supp(px)pX(a)logpX(a)da=E[logpX(X)].

Example 2.1

Consider the uniform distribution with pX(a)=1/A for a[0,A),

h(X)=log(A)

h(X) is negative for A<1.

Rules for Differential Entropy

h(X+c)=h(X)h(cX)=h(X)+log|c|h(Y+cX|X)=h(Y|X)

Conditional Entropy

h(Y|X)=supp(pX)pX(a)h(Y|X=a)da

(2) Mixed Distributions

(3) Example Distributions

(4) Informational Divergence

The information divergence between the densities pX and pY is

D(pXpY)=supp(pX)pX(a)logpX(a)pY(a)da
  • I(X;Y)=D(pXYpXpY)0

  • D(pXpY)0

(5) Cross Entropy

The cross-entropy of two densities pX and pY is

X(pXpY)=supp(pX)pX(a)logpY(a)da
X(pXpY)=h(X)+D(pXpY)D(pXpY)0X(pXpY)h(X)

with equality iff. pX(a)=pY(a) for almost all asupp(pX).

(6) Maximum Entropy

Alphabet or Volume Constraint

Uniform distribution maximizes the entropy. If X is limited to S=supp(pX)1dx then

h(X)log|S|

with equality iff. pX(a)=1/|S| for almost all aS.

First Moment Constriant

Under the first-moment constraint E[X]m, where X is non-negative, independent exponential random variables maximize (differential) entropy under the first moment and non-negativity constraints

pEi(a)=1miea/mi, a0.

We use cross entropy to upper bound the entropy

h(X)X(pXpE)=supp(pX)pX(a)logpE(a)da=supp(pX)pX(a)i(aimilogelogmi)da=i(supp(pX)pX(a)aidamiloge+logmi)i(mimiloge+logmi)=ilog(emi)

Second Moment Constraint

Under the second-moment constraint |QX|Q, where D is some constant. Let G be Gaussian with covariance matrix QX and entries with same mean m.

h(X)X(pXpG)=supp(pX)pX(a)logpG(a)da=supp(pX)pX(a)[12log((2π)n|QX|)12(am)TQX1(am)loge]da=12log((2πe)n|QX|)

Gaussian random variables maximize differential entropy under the second moment contraint.

Example 2.2

The second moment contraint for a scalar is Var[X]D and the bound implies

h(X)12log(2πeD)

Chapter 3 Coding of Memoryless Sources

We encode a random variable source of Kary alphabet into a sequence of Dary digits.

  • A Dary code is said to be uniquely decodable if every finite Dary sequence has at most one interpretation as a sequence of its codewords. A prefix-free code is automatically also a uniquely decodable code.

(1) Kraft Inequality

Theorem 3.1 Kraft Inequality

A Dary prefix-free code with codeword length lk,k=1,2,...,K, exists iff.

k=1KDlk1.

Equality holds iff. the code has no unused leaves in its Dary tree.

  • Kraft's inequality provides a necessary condition for the existence of a prefix-free code. A Dary uniquely decodable code with a given list of codeword lengths exists if and only if these lengths satisfy Kraft’s inequality.

Proof.

  • Path Length Lemma

k=1Kpklk=j=1JPj

Proof.

k=1Kpklk=k=1Kpkj=1J1(node j is on the path to source letter k)=j=1Jk=1Kpk1(node j is on the path to leaf k)=j=1JPj
  • Leaf-Entropy Lemma

Hleaf=j=1JPjH(Qj)

(2) Entropy Bounds on Source Coding

Theorem 3.2 Coding theorem for a Single Random Variable.

The average code word length E[L] of an optimum Dary prefix-free code for a Kary random variable X satisfies

H(X)log2DE[L]<H(X)log2D+1

Equality holds iff. pk=Dlk for all k.

Proof.

For the lower bound, we use Leaf Length Lemma and Leaf-Entropy Lemma,

H(X)=Hleaf=j=1JPjH(Qj)j=1JPjlog2D=E[L]log2D bits

and get

H(X)log2DE[L].

For the upper bound, we use Kraft's Inequality,

k=1KDlk1=k=1Kpk.

Let

Dlkpk,lklogDpk.

We choose the shortest codes

lk=logDpk.

Therefore,

lk<logDpk+1,

and we get

E[L]=k=1Kpklk<k=1KpklogDpk+k=1Kpk=k=1Kpklog2pklog2D+1=H(X)log2D+1.

Theorem 3.3 Coding Theorem for a DMS

The average code word length E[L] of an optimum Dary prefix-free code for a Kary DMS PX(.) that is parsed into blocks of length B satisfies

H(X)log2DE[L]B<H(X)log2D+1B

Proof.

We already have the coding theorem for a single random variable X, here we treat a block of length B as a new single random variable XB and get

H(XB)log2DE[L]B<H(XB)log2D+1

Since the source is DMS, we have H(XB)=BH(X) and finish the proof.

(3) Huffman Codes

The Dary prefix-free codes that minimize E[L] for a Kary random variable are called Huffman codes.

The branches of the tree are codewords with minimal E[L].

  • Optimal Dary prefix-free codes assign shorter code words to more probable letters.

  • All used leaves in the tree of an optimal Dary prefix-free code has a sibling.

  • The number of unused leaves in the tree of an optimal Dary prefix-free code is

(KD)(D2) mod (D1)

Proof.

A Dary tree must have D+(J1)(D1) leaves, in which K leaves are codewords.

The number of unused leaves is r=D+(J1)(D1)K, i.e.,

(DK)=(J1)(D1)+r

Since r<D1, we have

r=(DK) mod (D1).

(4) Tunstall Codes

The proper Dary dictionary codes with fixed block length L that maximize E[B] for a Kary random variable are called Tunstall codes.

The branches of the tree are source sequence with maximal E[B].

  • A proper dictionary satisfies the following two conditions: The dictionary (or message set) satisfies the prefix-free condition, i.e., the dictionary words are leaves of a rooted tree. No leaves are unused.

  • Choose the largest J that satisfies K+(K1)(J1)DL.

  • Similar to Huffman code, we now have

LE[B]H(X)log2D

Proof.

Using Leaf Entropy Lemma, we get

H(UL)=j=1JPjH(Qj)=j=1JPjH(X)=E[B]H(X)H(UL)log2(DL)=Llog2D

Combine both and finish the proof.

Theorem 3.4 Converse to the Coding Theorem for a DMS.

A Dary prefix-free encoding of a proper dictionary for a DMS satisfies

E[L]E[B]H(X)log2D

This is a more general lower bound that applies to first using a variable-length-to-block encoder with a proper dictionary followed by a prefix-free block-to-variable-length encoder.

First, we apply Tunstall code to encode a source block of length B into a sequence of length L,

H(UL)=E[B]H(X).

Then, we apply Huffman code to encode a single DLary random variable into a sequence of Dary letters,

E[L]H(UL)log2D.

(5) Huffman Codes and Tunstall Codes

Huffman codes and Tunstall codes are both types of prefix-free codes, which are coding schemes in which no code is a prefix of any other code, allowing messages to be uniquely decoded. However, there are some key differences between these two types of codes:

Huffman codes (block to variable length) are constructed using a method called Huffman coding, which is a lossless data compression algorithm that assigns shorter codes to more probable symbols and longer codes to less probable symbols, in order to minimize the expected length of the encoded message.

Tunstall codes (variable length to block), on the other hand, are constructed using a method called the Tunstall algorithm, which is a lossless data compression algorithm that generates a prefix-free code by decomposing code into overlapping blocks and assigning a unique message to each block.

Chapter 4 Coding of Stationary Sources

(1) Discrete Stationary Sources

A DSS output X1,X2,...,XB, we define the entropy per source letter as

HB(X)=1BH(XB).

The entropy rate is defined as

H(X)=limB1BH(XB).

It has property

H(X)=limBH(XB|XB1).

Theorem 4.1 Entropy Rate for a DSS

H(XB|XB1) is nonincreasing as B increases.
H(XB|XB1)HB(X) for all B1.
HB(X) is nonincreasing as B increases.

Proof.

H(XB|XB1)=H(XB|X1X2X3...XB1)H(XB|X2X3...XB1)=H(XB1|X1X2...XB2)=H(XB1|XB2)

Similarly,

H(XB1|XB2)=H(XB1|X1X2...XB2)H(XB1|X2X3...XB2)=H(XB2|X1X2...XB3)=H(XB2|XB3)

Thus, H(XB|XB1) is non-increasing as B increases.

H(XB|XB1)H(XB1|XB2)H(XB2|XB3)...

And

H(XB)=i=1BH(Xi|Xi1)=H(XB|XB1)+H(XB1|XB2)+H(XB2|XB3)+...BH(XB|XB1)

Proof.

HB+1(X)=1B+1H(XB+1)=H(XB+1|XB)+H(XB)B+1=H(XB|XB1)+H(XB)B+1HB(X)+H(XB)B+1=HB(X).

(2) Entropy Bounds on Source Coding a DSS

Theorem 4.2 Coding Theorem for a DSS

The average code word length E[L] of an optimum Dary prefix-free code for a Kary DSS that is parsed into blocks of length B satisfies

H(X)log2DE[L]B<HB(X)log2D+1B

Proof.

We already have the coding theorem for a single random variable X

H(X)log2DE[L]<H(X)log2D+1.

Now we replace the single random variable with a block of length B

H(XB)log2DE[L]<H(XB)log2D+1.

Using the notation HB(X)=1nH(XB), i.e., entropy per source symbol, we get

HB(X)log2DE[L]B<HB(X)log2D+1B.

Since HB(X) is non-increasing as B increases, we have HB(X)H(X) and

H(X)log2DE[L]B<HB(X)log2D+1B.

(3) Elias Code for Positive Integers

xElias Code UL(x)Elias Code Lenght L(x)
111
2010.04
3010.14
4011.005
5011.015
6011.105
7011.115
800100.0008
900100.0018

Proof.


For Dary representation of integers, we have

L1(x)=logD(x)+1
xBinary Representation UL1(x)Code Lenght L1(x)
111
2102
3112
41003
51013
61103
71113
810004
910014

This base-2 code is not prefix-free, so does a ternary, etc. representation. Therefore, we pre-pad each code UL1(x) with a certain number of zeros to make it prefix-free. The decoder counts the number of zeros to determine how many subsequent Dary symbols it should consider after the first digit.

For Dary representation of integers, we have

L2(x)=2L1(x)1=2logD(x)+1
xZero-pre-padded Binary Representation UL2(x)Code Lenght L2(x)
111
20.103
30.113
400.1005
500.1015
600.1105
700.1115
8000.10007
9000.10017

Elias pre-pads differently to shorten this code by using UL2 to count the number of digits (including the beginning 1).

For Dary representation of integers, we have

L3(x)=L1(x)+L2(L1(x))=logD(x)+1+2logD(logD(x)+1)+1
xElias-pre-padded Binary Representation UL3(x)Code Lenght L3(x)
111
2010.105
3010.115
4011.1006
5011.1016
6011.1106
7011.1116
800100.10009
900100.10019

Finally, we get the Elias code by omitting the first "1" after "." with length

L(x)=logD(x)+2logD(logD(x)+1)+1

(4) Elias-Willems Universal Source Coding

We now describe an specific algorithm for source coding when the source statistics are unknown. An important result that we state without proof is that the expected recurrence time of a stationary and ergodic source satisfies

E[i|Xi=a]=1PX1(a)

for all times i and all letters a.

Example 4.1

Consider a DMS PX for which the probability of waiting for j symbols before seeing the letter a is (1PX(a))j1PX(a). The expected waiting time is therefore

E[i|Xi=a]=j=1(1PX(a))j1PX(a)=1PX(a)

We apply the binary Elias code to i and compute

E[Li|Xi=a]=E[L(i)|Xi=a]E[log2(i)+2log2(log2(i)+1)+1|Xi=a]log2E[i|Xi=a]+2log2(log2E[i|Xi=a]+1)+1=log2(PX1(a))+2log2(log2(PX1(a))+1)+1

The second inequality follows from the concavity of logarithm and Jensen's inequality.

Taking expectations we have

E[Li]H(X1)+2log2(H(X1)+1)+1.

Consider a block of length B instead of a single random variable, we have

E[Li]H(XB)+2log2(H(XB)+1)+1.

Theorem 4.3 Universal Source Coding Theorem for a DSS

The average code word length E[L] of a binary code for a Kary DSS that is parsed into blocks of length B and for which the Elias code is used to map the time to the most recent occurrence of letters satisfies

E[L]B<HB(X)+2Blog2(BHB(X)+1)+1B

Moreover, E[L]/B can be made as close to H(X) as desired by choosing B sufficiently large.

Chapter 5 Channel Coding

Data Processing

I(W;W^)(a)I(W;Yn)=H(Yn)H(Yn|W)=i=1n(H(Yi|Yi1)H(Yi|Yi1Xn))(b)i=1n(H(Yi)H(Yi|Xi))i=1nI(Xi;Yi)

where (a) follows from W^=f(Yn) and data processing doesn't increase mutual information. (b) follows from that the channel is memoryless, i.e., Yi only depends on Xi.

I(W;W^)i=1nI(Xi;Yi)nmaxiI(Xi;Yi)nmaxPXI(X;Y)=nC

(1) Rate, Reliability and Cost

By reducing rate, one can increase reliability. We thus expect a rate-reliability tradeoff. If on introduces a cost S associated with transmission, then there is a capacity-cost tradeoff.

(2) Memoryless Channels

An encoder maps a source message W of length k into a string X=WG of length n. We view Xn as a string VnR of independent bits. The rate R measures how many bits are needed to represent a symbol X.

(3) Cost Functions

Suppose that transmitting xn and receiving yn incurs with a cost of sn(xn,yn) units. Naturally, we want to bound the average cost, otherwise we can send information with infinite energy and ignore all noises

E[sn(xn,yn)]S

We consider sn() as a per-letter cost function s():

sn(xn,yn)=1ni=1ns(xi,yi)

The largest rate C as a function of the cost S is called the capacity cost function denoted by C(S).

Example 5.1 Power Constraint

s(x,y)=x2,sn(xn,yn)=(1/n)Xi2, this is the most common constraint.

(4) Block and Bit Error Probability

We define the channel coding by requiring the block error probability

Pe=Pr[W^W]

to be small.

But we also want to minimize the average bit error probability

Pb=1bi=1bPr[Vi^Vi]

Theorem 5.1 Block/Bit Error Probability

Pb(a)Pe(b)bPb

With equality on the left if all bits in a block are wrong. And equlity on the right if exaclty one bit is wrong in each block.

Proof. 1

Intuitively,

[1 bit error][1 block error][1 block error][at-leat-1 bit error]

Therefore,

Pb=Pr[1 bit error]Pr[1 block error]=PePe=Pr[1 block error]Pr[at-leat-1 bit error]Pr[all-bit error]=bPb

Proof. 2

Theorem 5.2 Lower Bounds on Pe and Pb

Rate-reliability Tradeoff

nRI(W;W^)+H2(Pe)1PenC+H2(Pe)1PenRI(W;W^)1H2(Pb)nC1H2(Pb)

where C=maxPXI(X;Y).

We see that nR is at most I if reliability is good(Pe is small).

Proof. of Pe

Using Fano's inequality and |W|=2nR, we have

H(W|W^)H2(Pe)+Pelog2(|W|1)H2(Pe)+PenR

We suppose source messages are equal-probable and get

H(W|W^)=H(W)I(W;W^)=nRI(W;W^)H2(Pe)+nRPenRI(W;W^)+H2(Pe)1Pe

If Pe0, nRI(W;W^). That means, for reliable transmission, the number of bits we transmit over any channel is at most the mutual information.

Proof. of Pb

H2(Pb)=H2(1bi=1bPr[Vi^Vi])(a)1bi=1bH2(Pr[Vi^Vi])(b)1bi=1bH(Vi|Vi^)

where (a) follows by the concavity of H2(), and (b) follows by Fano's inequality in case of |X|=2

H(X|X^)H2(Pe)+Pelog2(|X|1)=H2(Pe)

Using the chain rule

H(Vi|Vi^)H(Vi|Vi^Vi1^...V1^)H(Vi|V^b)H(Vi|Vi1V^b)

we get the following bound

H2(Pb)1bi=1bH(Vi|Vi1V^b)=1bH(Vb|V^b)=1b(H(Vb)I(Vb;V^b))=1b(H(W)I(W;W^))=1b(bI(W;W^))=1I(W;W^)nRnRI(W;W^)1H2(Pb)

If Pb0, nRI(W;W^). That means, for reliable transmission, the number of bits we transmit over any channel is at most the mutual information.

(5) Random Coding

Capacity-Cost Function

C(S)=maxPX:E[s(X,Y)]SI(X;Y)

 

(6) Concavity and Converse

Theorem

C(S) is concave in S.

C(S) is non-decreasing in S.

Proof.

I(W;W^)I(Xi;Yi). Since the mutual information about Xi is I(Xi;Yi)C, and Xi induces some cost E[s(Xi,Yi)], we can rewrite C as C(E[s(Xi,Yi)]) and get

I(W;W^)i=1nI(Xi;Yi)i=1nC(E[s(Xi,Yi)])=nEn[C(E[s(Xi,Yi)])](a)nC(En[E[s(Xi,Yi)]])=nC(1ni=1nE[s(Xi,Yi)])nC(S)

where (a) follows from the concavity of C(S) and Jensen's Inequality E[C(S)]C(E[S]).

Rate-reliability Tradeoff

nRI(W;W^)+H2(Pe)1PenC(S)+H2(Pe)1PenRI(W;W^)1H2(Pb)nC(S)1H2(Pb)

(7) Discrete Alphabet Examples

<1> Binary Symmetric Channel

The best choice for PX is PX(0)=PX(1)=1/2 so that C=1H2(p), where p is the crossover probability.

<2> Binary Erasure Channel

PY()=PX(0)p+PX(1)p=pPY(1)=PX(1)(1p)PY(0)=PX(0)(1p)H(X|Y)=PY(0)H(X|Y=0)+PY(1)H(X|Y=1)+PY()H(X|Y=)=0+0+pH(X)

Therefore, C=maxpXH(X)(1p), choosing PX to be coin-flipping, we get C=1p.

<3> Strongly Symmetric Channels

P(Y|X)=()

Notice the entires in one row add up to 1, while the entries in one column don't necessarily add up to 1.

Uniformly dispersive: (entries in rows are the same, their order may change)

for every input letter x, the list of probabilities {P(y|x),yY} is the same. Therefore, H(Y|X=x) is the same for all xX.

Uniformly focusing: (entries in columns are the same, their order may change )

for every output letter y, the list of probabilities {P(y|x),xX} is the same. Uniform input in a u.f channel implies uniform output.

Strongly symmetric:

A channel is said to be strongly symmetric if it both uniformly dispersive and uniformly focusing.

BEC is u.d. but not u.f. BSC is u.d. and u.f.

Strongly Symmetric Channel Capacity

C=H(Y)H(Y|X)=log2|Y|yYP(y|x)log2P(y|x)

Capacity is achieved by a uniform PX

<4> Symmetric Channels

A BEC can be decomposed to two strongly symmetric channels:

P(Y|X)=(1p0p01pp) subchannel 1: (1p)(1001) is strongly symmetric subchannel 2: p(11) is strongly symmetricC=q1C1+q2C2=1p

Capacity of Symmetric DMC

C=i=1LqiCi

Example 5.2 BPSK over AWGN

We use a 2-bit quantizer:

P(Y|X)=(p0p1p2p3p3p2p1p0) subchannel 1: (p0+p3)(p0/(p0+p3)p3/(p0+p3)p3/(p0+p3)p0/(p0+p3)) is BSC(strongly symmetric) subchannel 2: (p1+p2)(p1/(p1+p2)p2/(p1+p2)p2/(p1+p2)p1/(p1+p2)) is BSC(strongly symmetric) C=(p0+p3)(1H2(p0p0+p3))       +(p1+p2)(1H2(p1p1+p2))

(8) Continuous Alphabet Examples

Consider the additive white Gaussian noise channel with Yi=Xi+Zi where ZiN(0,σ2), we have power constraint s(x,y)=x2 and E[(1/n)Xi2]S=P.

C(S)=C(P)=maxX:E[X2]PI(X;Y)I(X;Y)=h(Y)h(Y|X)=h(Y)h(Z|X)=h(Y)h(Z)=h(Y)12log(2πeσ2)

We now maximize h(Y)

ZN(0,σ2)Y=X+ZYN(E[X],Var[X2]+σ2)Var[X2]+σ2E[X2]+σ2Var[Y]P+σ2h(Y)12log(2πe(P+σ2))

with equality iff. XN(0,P).

Finally, we get

C(S)=C(P)=12log(1+Pσ2)

 

Example 5.3 AWGN channel with BPSK

We transmit symbols X{±P} and calculate

C(P)=maxXI(X;Y)=maxXh(Y)h(Y|X)p(Y)=xPX(x)PY|X(y|X=x)h(Y|X)=h(Z)=12log(2πeσ2)

 

留下评论

您的电子邮箱地址不会被公开。 必填项已用*标注

此站点使用Akismet来减少垃圾评论。了解我们如何处理您的评论数据