Estimating the Effect of Education on Wages

Practical Example in Econometrics: Estimating the Effect of Education on Wages

We will use a simplified dataset to estimate how education affects wages using Ordinary Least Squares (OLS) regression.

This example demonstrates a common econometric problem: understanding causal relationships using regression analysis.


Problem

The objective is to estimate the relationship between education (measured in years) and wages (hourly wage).

We hypothesize that higher education leads to higher wages.

Assumptions

  • The relationship is linear: $( \text{Wages} = \beta_0 + \beta_1 \cdot \text{Education} + \epsilon )$, where $( \epsilon )$ is the error term.
  • No omitted variable bias for simplicity.

Python Implementation

Below is the $Python$ implementation:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
import numpy as np
import pandas as pd
import statsmodels.api as sm
import matplotlib.pyplot as plt

# Generate synthetic data
np.random.seed(42)
n = 100
education = np.random.normal(12, 2, n) # Average education years: 12
error = np.random.normal(0, 5, n)
wages = 5 + 2.5 * education + error # True relationship: beta_0=5, beta_1=2.5

# Create a DataFrame
data = pd.DataFrame({'Education': education, 'Wages': wages})

# OLS Regression
X = sm.add_constant(data['Education']) # Add intercept
model = sm.OLS(data['Wages'], X).fit()
print(model.summary())

# Plot the data and regression line
plt.scatter(data['Education'], data['Wages'], color='blue', label='Observed Data')
plt.plot(data['Education'], model.predict(X), color='red', label='Fitted Line')
plt.xlabel('Education (Years)')
plt.ylabel('Wages (Hourly)')
plt.title('Relationship Between Education and Wages')
plt.legend()
plt.show()

Explanation of Code

  1. Data Generation:

    • education is randomly generated to simulate years of schooling.
    • error introduces random noise to mimic real-world data variability.
    • wages is computed using the true relationship with some error.
  2. OLS Regression:

    • statsmodels.OLS is used to estimate the parameters $( \beta_0 )$ (intercept) and $( \beta_1 )$ (slope).
  3. Visualization:

    • A scatter plot shows observed data (blue dots).
    • The regression line (red) represents the predicted relationship between education and wages.

Key Outputs

  • Regression Summary:

    • $Coefficients$ ( $\beta_0, \beta_1 $): These indicate the estimated impact of education on wages.
    • $R$-$squared$: Indicates the goodness of fit (closer to $1$ is better).
  • Graph:

    • The red line demonstrates the estimated relationship.
      The slope corresponds to the $ \beta_1 $ value, showing how wages change with education.