Entropy (information theory)


In information theory, the entropy HXH_{X} or H(X)H(X) of a discrete Random variable XX defined in the sample space X\mathcal{X} and following some Probability distribution described by the Probability mass function p(x)p(x), is

HX=E[logp(x)]=xXp(x)logp(x)H_{X}=\mathrm{E}[-\log p(x)]=-\sum_{x \in \mathcal{X}}p(x)\log p(x)

where E\mathrm{E} is the expected value and logp(x)-\log p(x) is called self-information. It is hence the expectation of the self-information. It is often called Shannon entropy due to it being originally introduced by Claude Shannon. If the Probability is encoded in a density matrix ρ^\hat{\rho}, we can alternatively write

H=Tr(ρ^log2ρ^)H=-\text{Tr}(\hat{\rho}\log_{2} \hat{\rho})

using the trace.

The base of the logarithm is arbitrary and may be any real number. Base two is the most common choice as it encodes the idea of "true or false" outcomes and is measured in bits. In physics, base ee (i.e. ln\ln) is a common choice as it spontaneously arises in nature. Entropy in base ee is measured in nats.

Interpretation

Shannon entropy can be described as a measure of the uncertainty of XX. HH can be seen as a sort of "missing information" or as a measure of "surprisingness" : the greater the entropy, the less information I have about XX a priori and the more unexpected the result is going to be. For instance, if XX only has one possible value, entropy is zero as there is no uncertainty; I already know everything that there is to know about XX.

Joint entropy

If XX and YY are two random variables in sample spaces X\mathcal{X} and Y\mathcal{Y}, joint entropy is defined by just changing the normal distribution function to a Joint distribution function:

HX,Y=xXyYpX,Y(x,y)logpX,Y(x,y)H_{X,Y}=-\sum_{x \in \mathcal{X}} \sum_{y\in \mathcal{Y}}p_{X,Y}(x,y)\log p_{X,Y}(x,y)

If XX and YY are independent variables, it simplifies to

HX,Y=xXyYpX,Y(x,y)[logpX(x)+logpY(y)]=HX+HYH_{X,Y}=-\sum_{x \in \mathcal{X}}\sum _{y \in \mathcal{Y}}p_{X,Y}(x,y)[\log p_{X}(x)+\log p_{Y}(y)]=H_{X}+H_{Y}

Properties

  • It is non-negative: HX0H_{X}\geq 0. It is zero only if XX is certain (i.e. it only has one possible outcome).
  • If XX has more than one possible outcome, then HXH_{X} is maximized when all those outcomes are equiprobable.
  • Joint entropy is generally less than the sum of the single entropies: HX,YHX+HYH_{X,Y}\leq H_{X}+H_{Y}. It is equal only if XX and YY are independent.
  • Entropy is additive across non-interacting systems (not across variables; see previous property): H=H1+H2H=H_{1}+H_{2}. The entropy of a system is equal to the sum of entropies of disjoint subsystems (keyword being disjoint: if they are interacting, this is not true: HH1+H2H\geq H_{1}+H_{2}).

Boltzmann's entropy

In the specific case that all outcomes are equally likely, p(x)p(x) is a constant. If there are WW possible outcomes, we have p=1/Wp=1/W, in which case the entropy reduces to

H=logWH=\log W

This chiefly occurs in statistical mechanics under the equal a priori probability hypothesis, where WW is interpreted as the number of microstates a physical system can be found in. Turning it into physical entropy through the Boltzmann constant, we get Boltzmann's formula for entropy:

S=kBlogW\boxed{S=k_{B}\log W}

Second law of thermodynamics

This particular case has a very useful property that's true for a system with many particles, say N1023N\sim 10^{23} and defined energy EtotalE_{\text{total}} (a microcanonical ensemble). Split the system in two non-interacting halves (that still exchange energy, somehow) with respective energies E1E_{1} and E2E_{2} (Etotal=E1+E2E_{\text{total}}=E_{1}+E_{2}) and call W=Γ(Ei)W=\Gamma(E_{i}) the function that counts all the states at energy EiE_{i}[^1]. Γ1(Ei)\Gamma_{1}(E_{i}) and Γ2(Ei)\Gamma_{2}(E_{i}) do the same, but for the two subsystems. We therefore have[^2]

Γ(Etotal)=iΓ1(Ei)Γ2(EtotalEi)=iexp(S1(Ei)kB+S2(EtotalEi)kB)\Gamma(E_\text{total})=\sum_{i}\Gamma_{1}(E_{i})\Gamma_{2}(E_\text{total}-E_{i})=\sum_{i}\exp\left( \frac{S_{1}(E_{i})}{k_{B}}+ \frac{S_{2}(E_\text{total}-E_{i})}{k_{B}} \right)

Now, since the state count goes like ΓeN\Gamma\sim e^{N} and the entropy goes like SNS\sim N, the sum above iterates over terms that go like eN\sim e^{N} (remember that NN is itself exponentially large, 1023\sim 10^{23}). Now, say that there is even a single term for which the exponent is twice as large as any other term (so the entropy is twice as large). That means it goes like e2N=eNeN\sim e^{2N}=e^{N}e^{N}. But that means that this terms is eN\sim e^{N} times larger than any other terms, i.e. so massively large every other term becomes vanishingly small in comparison. As such, this entire sum can be collapsed to just the term with maximum entropy (of energy EmaxE_\text{max}), which is found in exponentially more microstates than any other[^3]:

Γ(Etotal)exp(S1(Emax)kB+S2(EtotalEmax)kB)\Gamma(E_\text{total})\simeq \exp\left( \frac{S_{1}(E_{\text{max}})}{k_{B}} + \frac{S_{2}(E_\text{total}-E_\text{max})}{k_{B}} \right)

and so

S(Etotal)S1(Emax)+S2(EtotalEmax)S1(E1)+S2(E2)S(E_\text{total})\simeq S_{1}(E_\text{max})+S_{2}(E_\text{total}-E_\text{max})\geq S_{1}(E_{1})+S_{2}(E_{2})

In our case, EmaxE_\text{max} is found when

S1(Emax)ES2(EtotalEmax)E=0(1)\frac{ \partial S_{1}(E_\text{max}) }{ \partial E }- \frac{ \partial S_{2}(E_\text{total}-E_\text{max}) }{ \partial E } = 0\tag{1}

In essence, when the two systems make contact, they will near certainly end up with fixed energies EmaxE_\text{max} and EtotalEmaxE_\text{total}-E_\text{max}. It’s worth stressing that there is no a priori reason why this is the case. But the large number of particles involved means that system 1 is overwhelmingly likely to be found with energy EmaxE_\text{max} which maximizes the number of states Γ\Gamma of the combined system. Conversely, once in this bigger set of states, it is highly unlikely that the system will ever be found back in a state with energy E1E_{1} or, indeed, any other energy other than EmaxE_{\text{max}}. This is precisely the second law of thermodynamics and as we can see, it is actually a probabilistic law, not a deterministic one. When it is said that an isolated system's entropy never increases, it would in fact be more correct to state that an isolated system's entropy almost surely never increases, and it is wrong to state that the chance does not exist. However, this chance is so trivially small[^4] that it might as well not exist, and thus for any purpose, real, theoretical, academic or educational, it is neglected.

Temperature

This definition of entropy is also satisfactory for the definition of absolute temperature. To see this, let's analyze if TT obeys the laws that we'd expect of it. If we take two objects in thermal equilibrium and at the same temperature, and we make the touch, nothing should happen. This is true: we've shown that two systems in contact maximize their entropy, which means that they're energies need to end up as E1=EmaxE_{1}=E_\text{max} and E2=EtotalEmaxE_{2}=E_\text{total}-E_\text{max}. For nothing to happen, the systems need to have these energies from the get-go. But if they do, then (1)(1) is

S1(E1)E=S2(E2)ET1=T2=T\frac{ \partial S_{1}(E_{1}) }{ \partial E } =\frac{ \partial S_{2}(E_{2}) }{ \partial E } \quad\Rightarrow \quad T_{1}=T_{2}=T

which is exactly what we stated.

Say they are now at different temperatures. We know that they will exchange energy when they come into contact so, by conservation of energy, δE1=δE2\delta E_{1}=-\delta E_{2}. The total differential of SS is

δS=S1(E1)E1δE1+S2(E2)E2δE2=(S1(E1)ES2(E2)E)δE1=(1T11T2)δE1\delta S=\frac{ \partial S_{1}(E_{1}) }{ \partial E_{1} }\delta E_{1}+ \frac{ \partial S_{2}(E_{2}) }{ \partial E_{2} } \delta E_{2}=\left( \frac{ \partial S_{1}(E_{1}) }{ \partial E } - \frac{ \partial S_{2}(E_{2}) }{ \partial E } \right)\delta E_{1}=\left( \frac{1}{T_{1}}- \frac{1}{T_{2}} \right)\delta E_{1}

δS>0\delta S>0 because of the second law, so if T1>T2T_{1}>T_{2} we need δE1<0\delta E_{1}<0, so the hotter systems needs to lose energy. This behavior is also consistent, so this definition of temperature is valid. However, it doesn't prove why it's exactly 1/T1/T and not, say, 1/T21/T^{2}, as the behavior would be the same. To prove this, we can plug this definition into a known system (probably an ideal gas) and show it is exactly 1/T1/T that provides the correct result.

Examples

> For $N$ fair coins, all tossed simultaneously, the probability of each combination of results is $p(x)=1/2^{N}=2^{-N}$ since they are all equiprobable and there are $2^{N}$ possible combinations. Thus, since all terms in the sum are identical, we have > $$H_{X}=- 2^{N}(2^{-N}\log_{2}2^{-N})=N\log_{2}2=N \text{ bits} > A fair $N$-sided die has a $1/N$ chance of landing on each face, so again > $$H_{X}=-N\left( \frac{1}{N}\log_{2} \frac{1}{N} \right)=\log_{2} N \text{ bits} > This function is important enough that it was given a special name: the [[Bernoulli entropy]]. It is applicable to any binary process of probability $p$. > [!example] Horse racing > Say there is a race with eight horses. The predictions say that each horse has the following chance to win the race > $$\left( \frac{1}{2}, \frac{1}{4}, \frac{1}{8}, \frac{1}{16}, \frac{1}{64}, \frac{1}{64}, \frac{1}{64}, \frac{1}{64} \right)

What is the entropy of this sequence? Well, this is a sequence of probabilities, so we can apply the definition of entropy directly:

> Same as two fair coin tosses. (For comparison, if the horses were all equally likely to win, it would have been $H=\log_{2}8=3\text{ bits}$). Now, say the winner of the race needs to be broadcasted. The simplest message to send would be the number of the horse that won in binary form so that we can take care of all possibilities. Since there are eight horses, and eight in binary is $111$, this makes us send over three bits of information per message. But note that the *actual* sequence above only has two bits of entropy. So, in theory, there's a method that, *on average*, only needs to send over two bits of information to fully convey the winning horse. We do this by exploiting the fact that we already know who is most likely to win. > > To reduce average information being sent, we can make the most likely message shorter, at the cost of making the least likely ones longer. Thus, instead of sending over a message that is always three bits long as before, we can use the following convention to identify the eight horses: > $$(0,10,110,1110,111100,111101,111110,111111)

Here the message length is variable: the shortest message is a measly 1 bit, whereas the longest ones are 6 bits. But if we weigh these lengths by the likelihood of each being sent (i.e. that horse winning the race), we find that the average length is actually lower:

> So while messages following this convention can on occasion be longer than before, they usually aren't, so on average, over many messages being sent out, this reduces the amount of bits being transferred by a lot. Note that the effectiveness of this convention is entirely dependent on the fact that we know those probabilities beforehand: had all horses been equally likely to win, this convention would have been on average worse. [^1]: This is assuming discrete energy levels like the ones that occur in quantum mechanics. If $E$ is continuous, just change $E_{i}$ to $E$ and the sums to integrals. The result is the same. [^2]: It can be argued that we have no guarantee that $E_\text{total}-E_{i}$ is a valid energy level for the other half if we are working with quantum levels. That said, energy levels for systems with massive $N$ are so closely packed they may as well be considered a continuum, so any energy level is pretty much guaranteed to be valid. To be more rigorous though, we would need to introduce an interaction term that guarantees the exchange of energy in a valid way. [^3]: Since integrals don't have terms, the logic comes down to saying that the integral can be evaluated using [[Laplace's method]]. [^4]: I don't believe "trivially small" quite gets the point across. The chance of violating the second law of thermodynamics is so small that the average time it would take to find one violating event is so high, the *entire lifetime of the universe becomes negligible*.