Entropy, Cross-Entropy, KL-Divergence

4 min readApr 23, 2020

Entropy:

Entropy is the measure of the reduction in uncertainty.

Entropy came from Claude Shannon's information theory, where the goal is to send information from the sender to the recipient in an optimized way. We use bits to send information, a bit is either 0 or 1.

When we are using one bit to send a piece of information we are reducing the recipient's uncertainty by a factor of 2.

Suppose we have two types of weather Rainy & Sunny, And the forecast has predicted next day would be rainy, here we reduced the uncertainty by a factor of 2 (initially we have 2 options and after we predicted it would be rainy we have only 1 option), we got one bit of useful information irrespective of how many bits they use to encode it. (If the use a string “Rainy” to communicate the results, they used 8(1byte)*5 = 40 bits but conveyed only 1 bit of information)

Now let's suppose there 8 different types of weather, And we predicted one of the 8 types as next day’s weather, we are reducing the uncertainty by a factor of 8. Here we got 3 bits of useful information.

You can compute the number of bits used to convey the information by just taking the logarithm of the number of events

1st Case Number of events =2 → log2 →1 bits

2nd Case Number of events = 8 → log8 →3 bits

In the above two cases, all the events are equally likely/ they have equal probability.

Let's take a case where the probability of being sunny is 75% and being rainy is 25%

If the forecast is rainy tomorrow, we reduced the uncertainty by a factor of 4(assume overall there are 4 option, 3 sunny and 1 rainy after we predict next day is rainy we are left with only 1 option thus we reduced the uncertainty by a factor of 4)

log(4) = log(1/0.25) = 2 bits of useful information (Intuitively every bit sent will reduce the uncertainty by a factor of 2, so to reduce the uncertainty by a factor of 4 we need 2 bits)

If we predict the next day as sunny, we reduced the uncertainty by a factor of 4/3 (Initially we had 4 options which got reduced to 1 of the 3 sunny options so option reduced from 4 to 3)

log(4/3) = log(1/0.75) = 0.41 bits of useful information

The second case amount of useful information we get is very less because we already know it is 75% probable.

So the total amount of useful information we get from the forecast is 2*0.25 + 0.75 * 0.41 = 0.81

Let's calculate it for the case where rainy and sunny have equal probability

-0.5*log(0.5)-0.5*log(0.5) = 1

Now this formula should make sense, Entropy gives as the average amount of useful information you get from a set of events.

When the events are very uncertain (50 and 50) and forecast made a prediction you learn more information, but when events are less uncertain(75%, 25%) you learn less information.

Cross-Entropy

Cross entropy is the average number of bits used to encode the information.

Take an example where we have 8 kinds of weather if we use 3 bits to represent each

The entropy is 0.35*log(0.35) + 0.35*log(0.35) + 0.1*log(0.1) + …= 2.23 bits , Though we are using 3 bits to represent each outcome average bits of useful information we are getting is only 2.23.

Above exmaple

Entropy = 2.23

Cross Entropy = Average number of bits used = 0.35*3+0.35*3+0.1*3+…= 3

If we change the encoding scheme as following way

Cross-Entropy = 0.35 *2 + 0.35*2 + 0.1*3 + ….= 2.42 bits which is close to entropy but not equal.

When we reverse the distribution as shown below

Cross entropy is 0.01*2 + 0.01*2+0.04*3+ ….=4.58 bits which is twice as much as entropy.

This is because by encoding, as shown above, we are implicitly assuming a distribution.

When the true distribution does not tally with the predicted distribution(implicit distribution based on the chosen encoding scheme) then cross-entropy will be greater than entropy by a small bit, The difference is what we call KL divergence

Cross Entropy= Entropy + KLdivergence

Entropy, Cross-Entropy, KL-Divergence

Written by Santhosh