Entropy Maximization

Finding the Most Likely Distribution Under Constraints

Entropy maximization is one of the most elegant ideas in statistical mechanics and information theory. The core question is: given what we know, what probability distribution should we use? The answer: the one that maximizes entropy — the most “honest” or least biased distribution consistent with the constraints.


The Problem

Given a discrete random variable $X$ taking values ${x_1, x_2, \ldots, x_n}$ with probabilities ${p_1, p_2, \ldots, p_n}$, we want to maximize the Shannon entropy:

$$H(p) = -\sum_{i=1}^{n} p_i \log p_i$$

subject to constraints:

$$\sum_{i=1}^{n} p_i = 1 \quad \text{(normalization)}$$

$$\sum_{i=1}^{n} p_i \cdot x_i = \mu \quad \text{(mean constraint)}$$

$$\sum_{i=1}^{n} p_i \cdot x_i^2 = \mu^2 + \sigma^2 \quad \text{(variance constraint)}$$


Concrete Example

Suppose we have a die with faces ${1, 2, 3, 4, 5, 6}$. We observe:

  • Mean: $\mu = 3.5$ (fair die average)
  • Then add a twist: $\mu = 4.0$ (biased toward higher values)

We ask: what probability distribution maximizes entropy given this mean?

By the method of Lagrange multipliers, the solution takes the form of a Gibbs/Boltzmann distribution:

$$p_i^* = \frac{e^{\lambda x_i}}{Z(\lambda)}, \quad Z(\lambda) = \sum_{j} e^{\lambda x_j}$$

where $\lambda$ is chosen so that $\sum_i p_i x_i = \mu$.


Python Code

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.gridspec as gridspec
from scipy.optimize import minimize, brentq
from scipy.special import entr
from mpl_toolkits.mplot3d import Axes3D

# ─────────────────────────────────────────
# 1. Core functions
# ─────────────────────────────────────────

def boltzmann_dist(lam, x):
"""Gibbs distribution for given lambda and support x."""
e = np.exp(lam * x - np.max(lam * x)) # numerical stability
return e / e.sum()

def mean_from_lambda(lam, x):
p = boltzmann_dist(lam, x)
return np.dot(p, x)

def solve_lambda(mu_target, x):
"""Find lambda such that E[X] = mu_target using Brent's method."""
lo, hi = -20.0, 20.0
f = lambda lam: mean_from_lambda(lam, x) - mu_target
return brentq(f, lo, hi, xtol=1e-12)

def shannon_entropy(p):
return float(np.sum(entr(p))) # entr(p) = -p*log(p), handles p=0

# ─────────────────────────────────────────
# 2. Example: biased die
# ─────────────────────────────────────────

x = np.arange(1, 7, dtype=float) # die faces {1,...,6}
mu_fair = 3.5
mu_biased = 4.0

lam_fair = solve_lambda(mu_fair, x)
lam_biased = solve_lambda(mu_biased, x)

p_fair = boltzmann_dist(lam_fair, x)
p_biased = boltzmann_dist(lam_biased, x)
p_uniform = np.ones(6) / 6 # true uniform for reference

print("=== Fair-mean constraint (μ=3.5) ===")
for xi, pi in zip(x, p_fair):
print(f" P(X={int(xi)}) = {pi:.6f}")
print(f" λ = {lam_fair:.6f}, H = {shannon_entropy(p_fair):.6f} nats")

print("\n=== Biased-mean constraint (μ=4.0) ===")
for xi, pi in zip(x, p_biased):
print(f" P(X={int(xi)}) = {pi:.6f}")
print(f" λ = {lam_biased:.6f}, H = {shannon_entropy(p_biased):.6f} nats")

print(f"\n Uniform entropy H = {shannon_entropy(p_uniform):.6f} nats")

# ─────────────────────────────────────────
# 3. Sweep over mu → H(mu) curve
# ─────────────────────────────────────────

mu_vals = np.linspace(1.05, 5.95, 400)
lam_vals = np.array([solve_lambda(m, x) for m in mu_vals])
p_matrix = np.array([boltzmann_dist(l, x) for l in lam_vals]) # (400, 6)
H_vals = np.array([shannon_entropy(p) for p in p_matrix])

# ─────────────────────────────────────────
# 4. 3-D surface: mu × face → probability
# ─────────────────────────────────────────

MU, FACE = np.meshgrid(mu_vals, x, indexing='ij') # (400,6) each

# ─────────────────────────────────────────
# 5. Plotting
# ─────────────────────────────────────────

fig = plt.figure(figsize=(18, 14))
fig.suptitle("Entropy Maximization Under Mean Constraint\n(MaxEnt Gibbs Distribution on a Die)",
fontsize=15, fontweight='bold', y=0.98)

gs = gridspec.GridSpec(2, 3, figure=fig, hspace=0.45, wspace=0.35)

# --- (a) Bar chart: fair vs biased ---
ax1 = fig.add_subplot(gs[0, 0])
bar_w = 0.27
bars1 = ax1.bar(x - bar_w, p_uniform, bar_w, label='Uniform', color='steelblue', alpha=0.85)
bars2 = ax1.bar(x, p_fair, bar_w, label='MaxEnt μ=3.5', color='darkorange', alpha=0.85)
bars3 = ax1.bar(x + bar_w, p_biased, bar_w, label='MaxEnt μ=4.0', color='seagreen', alpha=0.85)
ax1.set_xlabel("Die Face", fontsize=11)
ax1.set_ylabel("Probability", fontsize=11)
ax1.set_title("(a) MaxEnt Distributions vs Uniform", fontsize=11)
ax1.set_xticks(x)
ax1.legend(fontsize=8)
ax1.set_ylim(0, 0.28)
ax1.grid(axis='y', alpha=0.4)

# --- (b) H vs μ ---
ax2 = fig.add_subplot(gs[0, 1])
ax2.plot(mu_vals, H_vals, color='purple', lw=2.0)
ax2.axvline(mu_fair, color='darkorange', ls='--', lw=1.5, label=f'μ=3.5, H={shannon_entropy(p_fair):.3f}')
ax2.axvline(mu_biased, color='seagreen', ls='--', lw=1.5, label=f'μ=4.0, H={shannon_entropy(p_biased):.3f}')
ax2.axhline(shannon_entropy(p_uniform), color='steelblue', ls=':', lw=1.5, label=f'Uniform H={shannon_entropy(p_uniform):.3f}')
ax2.set_xlabel("Mean μ", fontsize=11)
ax2.set_ylabel("Entropy H (nats)", fontsize=11)
ax2.set_title("(b) MaxEnt Entropy vs Mean Constraint", fontsize=11)
ax2.legend(fontsize=8)
ax2.grid(alpha=0.4)

# --- (c) λ vs μ ---
ax3 = fig.add_subplot(gs[0, 2])
ax3.plot(mu_vals, lam_vals, color='crimson', lw=2.0)
ax3.axhline(0, color='k', lw=0.8, ls='--')
ax3.axvline(mu_fair, color='darkorange', ls='--', lw=1.5, label=f'μ=3.5 → λ={lam_fair:.3f}')
ax3.axvline(mu_biased, color='seagreen', ls='--', lw=1.5, label=f'μ=4.0 → λ={lam_biased:.3f}')
ax3.set_xlabel("Mean μ", fontsize=11)
ax3.set_ylabel("Lagrange Multiplier λ", fontsize=11)
ax3.set_title("(c) Lagrange Multiplier λ vs Mean", fontsize=11)
ax3.legend(fontsize=8)
ax3.grid(alpha=0.4)

# --- (d) 3-D surface ---
ax4 = fig.add_subplot(gs[1, :2], projection='3d')
surf = ax4.plot_surface(MU, FACE, p_matrix, cmap='viridis', alpha=0.88,
linewidth=0, antialiased=True)
ax4.set_xlabel("Mean μ", fontsize=10, labelpad=8)
ax4.set_ylabel("Die Face", fontsize=10, labelpad=8)
ax4.set_zlabel("Probability", fontsize=10, labelpad=8)
ax4.set_title("(d) 3-D: MaxEnt Probability Surface\n(μ × face → p*)", fontsize=11)
ax4.set_yticks(x)
fig.colorbar(surf, ax=ax4, shrink=0.5, aspect=10, pad=0.1)

# --- (e) Heatmap ---
ax5 = fig.add_subplot(gs[1, 2])
im = ax5.imshow(p_matrix.T, aspect='auto', origin='lower',
extent=[mu_vals[0], mu_vals[-1], 0.5, 6.5],
cmap='plasma')
ax5.set_xlabel("Mean μ", fontsize=11)
ax5.set_ylabel("Die Face", fontsize=11)
ax5.set_title("(e) Heatmap: p*(face | μ)", fontsize=11)
ax5.set_yticks(x)
fig.colorbar(im, ax=ax5, label='Probability')

plt.savefig("maxent_die.png", dpi=150, bbox_inches='tight')
plt.show()
print("Figure saved.")

Code Walkthrough

boltzmann_dist(lam, x) — computes $p_i = e^{\lambda x_i} / Z$ with the log-sum-exp trick (- np.max(...)) to avoid numerical overflow for large $|\lambda|$.

solve_lambda(mu_target, x) — the mean $\mathbb{E}[X]$ is a smooth, strictly monotone function of $\lambda$. Brent’s method (brentq) finds the unique root of $\mathbb{E}_\lambda[X] - \mu = 0$ in $O(\log(1/\varepsilon))$ iterations — much faster than gradient descent.

shannon_entropy(p) — uses scipy.special.entr which safely computes $-p \log p$ and returns 0 when $p = 0$.

Sweep over $\mu$ — we solve for $\lambda(\mu)$ at 400 values between the extremes, building a full picture of how the MaxEnt distribution changes with the constraint.


Graph Explanations

(a) Bar Chart shows the three distributions side-by-side. With $\mu = 3.5$, the MaxEnt solution is identical to the uniform — because no additional information is encoded beyond normalization. With $\mu = 4.0$, higher faces get larger probability, but the distribution is as spread out as possible given that constraint.

(b) H vs $\mu$ — entropy is maximized at $\mu = 3.5$ (the symmetric center of ${1,\ldots,6}$) and falls off toward the extremes where the distribution must concentrate mass. This curve is concave and symmetric around 3.5.

(c) $\lambda$ vs $\mu$ — $\lambda = 0$ corresponds to the uniform distribution ($\mu = 3.5$). Positive $\lambda$ tilts mass toward high faces; negative $\lambda$ toward low faces. The relationship is nonlinear and steeper near the extremes.

(d) 3-D Surface — the most visually rich panel. Each “slice” at a fixed $\mu$ is one MaxEnt distribution. You can see the probability mass shifting smoothly from low faces (left) to high faces (right) as $\mu$ increases from 1 to 6.

(e) Heatmap — same information as (d) but in 2-D. Bright regions = high probability. The diagonal bright band shows mass continuously tracking the constraint mean.


Key Takeaway

The MaxEnt principle says: use the distribution that is maximally noncommittal about everything you don’t know, while respecting everything you do know. When the only constraint is the mean, the answer is always a Gibbs/Boltzmann exponential family distribution — a result that bridges information theory, statistical mechanics, and Bayesian inference.


=== Fair-mean constraint (μ=3.5) ===
  P(X=1) = 0.166667
  P(X=2) = 0.166667
  P(X=3) = 0.166667
  P(X=4) = 0.166667
  P(X=5) = 0.166667
  P(X=6) = 0.166667
  λ = 0.000000,  H = 1.791759 nats

=== Biased-mean constraint (μ=4.0) ===
  P(X=1) = 0.103065
  P(X=2) = 0.122731
  P(X=3) = 0.146148
  P(X=4) = 0.174034
  P(X=5) = 0.207240
  P(X=6) = 0.246782
  λ = 0.174629,  H = 1.748506 nats

  Uniform entropy H = 1.791759 nats

Figure saved.