The Central Limit Theorem

A Visual Exploration with Python

I’ll solve a probability theory problem using $Python$, explain the mathematical concepts, and visualize the results with graphs.

Let’s look at a classic probability problem: the Central Limit Theorem demonstration.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats

# Set random seed for reproducibility
np.random.seed(42)

# Parameters
n_samples = 10000 # Number of samples to generate
sample_sizes = [1, 2, 5, 10, 30, 50] # Different sample sizes to demonstrate CLT
distributions = {
'uniform': stats.uniform(0, 1),
'exponential': stats.expon(scale=1),
'poisson': stats.poisson(5)
}

# Function to generate samples and calculate means
def generate_sample_means(distribution, sample_size, n_samples):
# Generate n_samples sets of sample_size samples and calculate their means
sample_means = np.array([
np.mean(distribution.rvs(size=sample_size))
for _ in range(n_samples)
])
return sample_means

# Setup the plots
fig, axes = plt.subplots(len(distributions), len(sample_sizes), figsize=(20, 12))
fig.suptitle('Central Limit Theorem Demonstration', fontsize=16)

# Create plots for each distribution and sample size
for i, (dist_name, distribution) in enumerate(distributions.items()):
# Calculate the theoretical mean and std of the distribution
if dist_name == 'uniform':
true_mean = 0.5
true_std = 1/np.sqrt(12)
elif dist_name == 'exponential':
true_mean = 1
true_std = 1
else: # poisson
true_mean = 5
true_std = np.sqrt(5)

for j, sample_size in enumerate(sample_sizes):
# Generate sample means
means = generate_sample_means(distribution, sample_size, n_samples)

# Calculate expected standard deviation of the sampling distribution
expected_std = true_std / np.sqrt(sample_size)

# Plot histogram
ax = axes[i, j]
ax.hist(means, bins=50, density=True, alpha=0.6, color='skyblue')

# Plot the normal distribution curve for comparison
x = np.linspace(min(means), max(means), 1000)
normal_pdf = stats.norm.pdf(x, true_mean, expected_std)
ax.plot(x, normal_pdf, 'r-', linewidth=2)

# Add theoretical mean line
ax.axvline(true_mean, color='green', linestyle='--', linewidth=1.5)

# Set title and labels
if i == 0:
ax.set_title(f'Sample Size = {sample_size}')
if j == 0:
ax.set_ylabel(f'{dist_name.capitalize()}')

# Add some stats
sample_mean = np.mean(means)
sample_std = np.std(means)
ax.text(0.05, 0.95, f'Mean: {sample_mean:.4f}\nSTD: {sample_std:.4f}',
transform=ax.transAxes, fontsize=8,
verticalalignment='top', bbox=dict(boxstyle='round', facecolor='white', alpha=0.7))

plt.tight_layout()
plt.subplots_adjust(top=0.9)

# Create additional visualization for convergence of variance
plt.figure(figsize=(10, 6))
colors = ['blue', 'red', 'green']
markers = ['o', 's', '^']

for i, (dist_name, distribution) in enumerate(distributions.items()):
if dist_name == 'uniform':
true_std = 1/np.sqrt(12)
elif dist_name == 'exponential':
true_std = 1
else: # poisson
true_std = np.sqrt(5)

std_ratios = []
for sample_size in range(1, 51):
means = generate_sample_means(distribution, sample_size, n_samples)
sample_std = np.std(means)
expected_std = true_std / np.sqrt(sample_size)
std_ratio = sample_std / expected_std
std_ratios.append(std_ratio)

plt.plot(range(1, 51), std_ratios, color=colors[i], marker=markers[i],
linestyle='-', label=f'{dist_name.capitalize()}', markersize=4)

plt.axhline(1, color='black', linestyle='--')
plt.xlabel('Sample Size')
plt.ylabel('Ratio of Sample STD to Expected STD')
plt.title('Convergence of Sampling Distribution Standard Deviation')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

# Let's also visualize the original distributions
plt.figure(figsize=(15, 5))
x_values = {
'uniform': np.linspace(0, 1, 1000),
'exponential': np.linspace(0, 5, 1000),
'poisson': np.arange(0, 20)
}

for i, (dist_name, distribution) in enumerate(distributions.items()):
plt.subplot(1, 3, i+1)

if dist_name == 'poisson':
# For discrete distribution
x = x_values[dist_name]
plt.bar(x, distribution.pmf(x), alpha=0.6, color=colors[i])
else:
# For continuous distributions
x = x_values[dist_name]
plt.plot(x, distribution.pdf(x), color=colors[i], linewidth=2)
plt.fill_between(x, distribution.pdf(x), alpha=0.3, color=colors[i])

plt.title(f'{dist_name.capitalize()} Distribution')
plt.xlabel('Value')
plt.ylabel('Density' if dist_name != 'poisson' else 'Probability')
plt.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

Central Limit Theorem Example

The Central Limit Theorem (CLT) is one of the most important results in probability theory.

It states that when independent random variables are added, their properly normalized sum tends toward a normal distribution even if the original variables are not normally distributed.

Mathematical Formulation

Let $X_1, X_2, \ldots, X_n$ be independent and identically distributed random variables with mean $\mu$ and finite variance $\sigma^2$.

The CLT states that as $n$ approaches infinity, the distribution of:

$$Z_n = \frac{\bar{X}_n - \mu}{\sigma/\sqrt{n}}$$

converges to the standard normal distribution $N(0,1)$,
where is the sample mean.

In other words, the sampling distribution of the mean approaches a normal distribution with mean $\mu$ and standard deviation $\frac{\sigma}{\sqrt{n}}$ as the sample size increases.

Code Explanation

My code demonstrates the Central Limit Theorem by:

  1. Generating samples from three different distributions (uniform, exponential, and Poisson)
  2. Computing the mean of each sample for various sample sizes
  3. Plotting the distribution of these sample means alongside a normal distribution curve
  4. Tracking how quickly the standard deviation of the sample means converges to the theoretical value

Key components:

  • generate_sample_means() creates multiple samples of a specified size and calculates their means
  • The first plot shows histograms of sample means for different distributions and sample sizes
  • The second plot shows how the standard deviation ratio converges to the expected value
  • The third plot visualizes the original distributions to highlight their different shapes

Theoretical Insights

For each distribution, I’ve calculated the true mean and standard deviation:

  1. Uniform distribution on [0,1]:

    • Mean: $\mu = \frac{0+1}{2} = 0.5$
    • Variance: $\sigma^2 = \frac{(1-0)^2}{12} = \frac{1}{12}$
  2. Exponential distribution with scale=1:

    • Mean: $\mu = \lambda = 1$
    • Variance: $\sigma^2 = \lambda^2 = 1$
  3. Poisson distribution with λ=5:

    • Mean: $\mu = \lambda = 5$
    • Variance: $\sigma^2 = \lambda = 5$

The standard deviation of the sampling distribution should be $\frac{\sigma}{\sqrt{n}}$ where n is the sample size.

Results Interpretation

The graphs demonstrate several important aspects of the CLT:



  1. As sample size increases, the distribution of sample means becomes more normal (bell-shaped), regardless of the original distribution shape
  2. The larger the sample size, the smaller the standard deviation of the sampling distribution
  3. The convergence happens relatively quickly - even with sample sizes of $30$, the approximation is quite good
  4. The standard deviation of the sample means scales proportionally to $\frac{1}{\sqrt{n}}$

The bottom plot showing the ratio of observed to theoretical standard deviation confirms that our experimental results match the theory as it approaches $1.0$ for all distributions as sample size increases.

This demonstrates why the CLT is so powerful in statistics - it allows us to make inferences about population parameters regardless of the underlying distribution, as long as we have a sufficiently large sample size.