5.3. Information Theory#
In layman terms Data Analysis is the process of extracting useful information from data. But what is information? How do we measure it?
In this notebook, we will explore the concept of information, uncertainity in Random Variables and how it can be quantified. We will also be looking at a way to quantify the magnitude of an update from one set of beliefs to another using the concept of KL Divergence.
5.3.1. Entropy#
Entropy is the average amount of “information”, “surprise”, or “uncertainty” inherent in the variable’s possible outcomes.
For a random variable
where
Here we share a simple intution of entropy. Suppose we have two events x and y, then the info gained on observing both would be sum of info gained on observing x and y individually
Now, for a random variable
The base of the logarithm is arbitrary. If we use base 2, then the unit of information is the bit. If we use base e, then the unit of information is the nat. If we use base 10, then the unit of information is the digit.
Entropy as Length of Transmitted Code
If we have 8 states (a,b,c,d,e,f,g, h) with probabilities (
This means that on average, we need 2 bits to encode the state of the random variable. We can represent the codes as (a: 0, b: 10, c: 110, d: 1110, e: 111100, f: 111101, g: 111110, h: 111111). The pdf average length of the code is 2 bits.
Differential Entropy
The continuous version of entropy is called differential entropy. It is defined as:
Note, this can also be in negative as pdf can be greater than 1.
5.3.2. Joint Entropy#
Joint entropy is the extension of entropy to multiple random variables. For two random variables
where
5.3.3. Cross Entropy#
Cross-entropy is a measure of the difference between two probability distributions for a given random variable or set of events. Note: on same set.
Till now, we have been looking at the entropy of a single random variable. But what if we have two random variables
The intution between why it gives a measure of the difference between the two probability distributions will become clearer after we look at KL Divergence.
It shows up in the Cross entropy loss function in logistic learning, where we are trying to minimize the difference between the actual output (a one hot vector, which acts like a probibility distribution), and the predicted distribution of probabilites for different classes.
5.3.4. Mutual Information#
Mutual information is a measure of the amount of information that one random variable contains about another random variable.
For two random variables
It intuitively measures how much information is shared between the two random variables. If
Here we can recognise that
5.3.5. KL Divergence#
KL Divergence is a measure of how one probability distribution is different from a second, reference probability distribution.
Information in itself is a vague and abstract concept. But we can quantify the magnitude of an update from one set of beliefs to another very well using the concept of KL Divergence. Here we will show the KL divergence formulae, understand some of its properties and relate them to entropy and cross entropy.
For two probability distributions
In the continuous case, it is defined as:
This quantifies the information update on changing from prior belief
5.3.5.1. KLD properties#
1. Continuity of KL Divergence
KL Divergence is continuous in the limit as
and .
The information gain should be continous. The KL divergence by its formulae is continous, but we have to check the cases where
In the limit
For
2. Non Negativity of KL Divergence
KL Divergence is non-negative, and is equal to zero if and only if
.
The information gain should be positive regardless of the probability distributions, as we always gain information on changing our beliefs.
We will make use of the Jensens Inequality, which states that for a convex function

Fig. 5.1 Jensen Inequality#
Jensens Inequality is the equation for the statement that in a convex function, all secant lines lie above the graph of the function. This is also evident by the figure above.
Proof:
The equality holds when
The non-negativity of the KL divergence is a very important property, as it allows us to get bounds on expressions.
3. Chain Rule for KL Divergence
KL Diveregence is additive, i.e.
For probabilities we have the chain rule (product rule) as:
Is there a similar rule for KL divergence? Can we split the information gain from two variables in a chain rule as we did for probabilities?
Here we have put
This means that just like the probabilities, we can also use a chain rule for the KL divergence. This is a very important property, as it allows us to split the information gain from multiple variables into individual information gains.
4. KL Divergence is invariant to reparametrisation
KL Divergence is invariant to reparametrisation of the variables.
What happens when we reparametrise our functions from
If we change the variable from
Because of this reparameterization invariance we can rest assured that when we measure the KL divergence between two distributions we are measuring something about the distributions and not the way we choose to represent the space in which they are defined. We are therefore free to transform our data into a convenient basis of our choosing, such as a Fourier bases for images, without affecting the result.
KL Divergence is not Symmetric
KL Divergence is not symmetric, i.e.
By looking at its formulae itself, we can say KL Divergence is not symmetric, but should’nt it be? After all the information gained on changing from

Fig. 5.2 Assymetry of KLD#
In this example, we took two different Beta distributions, Beta1 = Beta(
As another Example:-
Suppose we have a biased coin: Initially we were told that it shows heads with a probability 0.443, and then told that no, actually it has a probability 0.975 of showing heads. The information gain is:
If we flip the game then the information gained will be:-
Thus we see that starting with a distribution that is nearly even and moving to one that is nearly certain takes about 1 bit of information, or one well designed yes/no question. To instead move us from near certainty in an outcome to something that is akin to the flip of a coin requires more persuasion.
import math
q = 0.443; p = 0.975
print("Info gain1: ", p*math.log2(p/q) + (1-p)*math.log2((1-p)/(1-q)))
p = 0.443; q = 0.975
print("Info gain2: ", p*math.log2(p/q) + (1-p)*math.log2((1-p)/(1-q)))
Info gain1: 0.9977011988907731
Info gain2: 1.9898899560575691
Calculating the beta distribution klds
import torch
import matplotlib.pyplot as plt
from ipywidgets import interact
def calc_info_gain(y1, y2):
L = 200
return 1/L*torch.sum(y2*torch.log(y2/y1))
def plot_beta(alpha1, beta1, alpha2, beta2):
Beta1 = torch.distributions.Beta(concentration1=alpha1, concentration0=beta1)
Beta2 = torch.distributions.Beta(concentration1=alpha2, concentration0=beta2)
xs = torch.linspace(0.01, 0.99, 100)
ys1 = Beta1.log_prob(xs).exp()
plt.plot(xs, ys1, color='C0', label = "prior Beta")#label='Beta' + str(alpha1) + ',' + str(beta1))
ys2 = Beta2.log_prob(xs).exp()
plt.plot(xs, ys2, color='C1', label = "post Beta")#label = 'Beta' + str(alpha2) + ',' + str(beta2))
plt.legend()
# Filled area
plt.fill_between(xs, ys1, color='C0', alpha=0.2)
plt.fill_between(xs, ys2, color='C1', alpha=0.2)
# write title with info gain only till 5 decimal places
plt.title('Information Gain: ' + str(round(calc_info_gain(ys1, ys2).item(), 5)))
interact(plot_beta,alpha1=(1.0, 19, 1.0), beta1=(1.0, 19, 1.0), alpha2=(1.0, 19, 1.0), beta2=(1.0, 19, 1.0))
<function __main__.plot_beta(alpha1, beta1, alpha2, beta2)>

Fig. 5.3 Slider Example#
5.3.5.2. Entropy and KLD#
Entropy of a distribution p is a contant minus (-) the KL Divergence of that distribution from the uniform distribution.
The entropy of a distribution tries to capture the information, uncertainity of the pdf and the KL Divergence gives us the information gained on updating our belief of a pdf. So, it is natural to ask if there is a relation between the two.
For Discrete Case:
We can use
Hense,
This also feels intuitive as the uniform distribution (U) is the most uncertain distribution, so it should have the maximum entropy and it does as it has the least KLD with U (ie 0).
It also makes sense in another way, as for any other distribution p, the more its KLD with U means more information gained on updating our belief from U to p, which means its prob is more peaked (certain), hense its lesser entropy, as shown by the formulae.
For Continuous Case:
Entropy is similar in the continuous case, where it is called continuous entropy, but with one change. Discrete entropy is always positive, but continuous entropy can be negative as pdf can be greater than 1. (Even U has Entropy 0, as log(1) = 0)
You can see from the slider example that the entropy of a distribution is a constant minus the KL Divergence of that distribution from the uniform distribution. (Set the prior as a beta dist of
5.3.5.3. Cross Entropy and KLD#
The cross entropy was our previous way to get some measure of closeness of two probability distributions and it is related to the KL Divergence.
Where
One way to think:-
In KLDivergence, we are trying to get the information gained on updating our belief from
5.3.5.4. MI and KLD#
MI measures the information gain if we update from a model that treats the two variables as independent p(x)p(y) to one that models their true joint density p(x, y).
KLDivergence and MI are obviously linked as KLD tells us how much info we gain on going from Y tp X, while MI gives us how much info of Y was being kept in X.
Now, by formula, MI is measuring the information gained on updating our belief from
So the more
Another interpretation:-
So, MI is quantifies how uncertain we are about X - how uncertain we are about X given that we know Y, or it is the difference between the entropy of the random variable and the conditional entropy of the random variable given the other random variable. This means that MI is the information that the two random variables share with each other.
Thus we can interpret the MI between X and Y as the reduction in uncertainty about X after observing Y , or, by symmetry, the reduction in uncertainty about Y after observing X. Incidentally, this result gives an alternative proof that conditioning, on average, reduces entropy. In particular, we have
We can also show that
These relations are captured perfectly by this figure from Kevin Murphy

Fig. 5.4 Entropies#
Information Of Data
How to charachterise the info of 111000 vs 100101 vs 101010?