Analyzing the Relationship Between Education Level and Income Using Python

Let’s consider a sociological example where we analyze the relationship between education level and income.

We’ll use $Python$ to simulate data, analyze it, and visualize the results.

Example: Education Level and Income

Hypothesis: Higher education levels are associated with higher income.

Step 1: Simulate Data

We’ll create a synthetic dataset where:

  • Education Level is measured in years (e.g., $12$ years for high school, $16$ for a bachelor’s degree, etc.).
  • Income is measured in thousands of dollars per year.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Set a seed for reproducibility
np.random.seed(42)

# Simulate data
n = 100 # Number of samples
education_level = np.random.randint(12, 21, n) # Years of education (12 to 20 years)
income = 20 * education_level + np.random.normal(0, 10, n) # Income in thousands of dollars

# Create a DataFrame
data = pd.DataFrame({'Education_Level': education_level, 'Income': income})

# Display the first few rows
print(data.head())

Step 2: Analyze the Data

We’ll use a simple linear regression to analyze the relationship between education level and income.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
from sklearn.linear_model import LinearRegression

# Prepare the data
X = data[['Education_Level']]
y = data['Income']

# Fit the model
model = LinearRegression()
model.fit(X, y)

# Get the coefficients
slope = model.coef_[0]
intercept = model.intercept_

print(f"Slope (Coefficient for Education Level): {slope}")
print(f"Intercept: {intercept}")

Step 3: Visualize the Results

We’ll plot the data points and the regression line to visualize the relationship.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
# Plot the data points
plt.figure(figsize=(10, 6))
sns.scatterplot(x='Education_Level', y='Income', data=data, color='blue', label='Data Points')

# Plot the regression line
plt.plot(X, model.predict(X), color='red', label='Regression Line')

# Add labels and title
plt.xlabel('Education Level (Years)')
plt.ylabel('Income (Thousands of Dollars)')
plt.title('Relationship Between Education Level and Income')
plt.legend()

# Show the plot
plt.show()

Step 4: Interpret the Results

   Education_Level      Income
0               18  363.268452
1               15  299.188810
2               19  384.677948
3               16  327.361224
4               18  352.202981
Slope (Coefficient for Education Level): 19.71155398870412
Intercept: 5.404657619066882

  • Slope (Coefficient for Education Level): This value indicates how much income increases for each additional year of education.

For example, if the slope is $20$, it means that for each additional year of education, income increases by $$20,000$ on average.

  • Intercept: This is the expected income when the education level is $0$ years.
    In this context, it may not have a practical interpretation since education level cannot be $0$.

Step 5: Conclusion

The scatter plot with the regression line shows a positive relationship between education level and income.

As education level increases, income tends to increase as well.

This supports our hypothesis that higher education levels are associated with higher income.

Full Code

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.linear_model import LinearRegression

# Set a seed for reproducibility
np.random.seed(42)

# Simulate data
n = 100 # Number of samples
education_level = np.random.randint(12, 21, n) # Years of education (12 to 20 years)
income = 20 * education_level + np.random.normal(0, 10, n) # Income in thousands of dollars

# Create a DataFrame
data = pd.DataFrame({'Education_Level': education_level, 'Income': income})

# Display the first few rows
print(data.head())

# Prepare the data
X = data[['Education_Level']]
y = data['Income']

# Fit the model
model = LinearRegression()
model.fit(X, y)

# Get the coefficients
slope = model.coef_[0]
intercept = model.intercept_

print(f"Slope (Coefficient for Education Level): {slope}")
print(f"Intercept: {intercept}")

# Plot the data points
plt.figure(figsize=(10, 6))
sns.scatterplot(x='Education_Level', y='Income', data=data, color='blue', label='Data Points')

# Plot the regression line
plt.plot(X, model.predict(X), color='red', label='Regression Line')

# Add labels and title
plt.xlabel('Education Level (Years)')
plt.ylabel('Income (Thousands of Dollars)')
plt.title('Relationship Between Education Level and Income')
plt.legend()

# Show the plot
plt.show()

This code simulates data, fits a linear regression model, and visualizes the relationship between education level and income.

The results suggest that higher education levels are associated with higher income, which is a common finding in sociological research.