A Basic Guide to Linear Regression Using Statsmodels in Python

A Basic Guide to Linear Regression Using Statsmodels in Python

Here’s a basic useful example of how to use $statsmodels$ for $linear$ $regression$ in Python.

This example demonstrates how to fit a $linear$ $regression$ $model$, check the summary, and make predictions.

Linear Regression Example with statsmodels

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
import numpy as np
import pandas as pd
import statsmodels.api as sm
import statsmodels.formula.api as smf

# Generate synthetic data
np.random.seed(42)
n = 100
X = np.random.rand(n)
y = 2 * X + np.random.randn(n) * 0.1 # y = 2*X + noise

# Create a DataFrame
df = pd.DataFrame({
'X': X,
'y': y
})

# Add a constant (intercept) to the independent variable
X = sm.add_constant(df['X'])

# Fit the linear regression model
model = sm.OLS(df['y'], X).fit()

# Print the summary of the model
print(model.summary())

# Predict using the model
df['y_pred'] = model.predict(X)

# Print the first few predictions
print(df.head())

# If you prefer using formulas (like in R):
formula = 'y ~ X'
model_formula = smf.ols(formula=formula, data=df).fit()

# Print the summary of the model fitted using formulas
print(model_formula.summary())

Explanation:

  1. Generating Data:

    • We create synthetic data where y is linearly dependent on X with some added noise.
  2. DataFrame:

    • We store the data in a Pandas DataFrame.
  3. Adding a Constant:

    • In $linear$ $regression$, we often include an intercept.
      sm.add_constant() adds a column of ones to X to account for this.
  4. Fitting the Model:

    • We use sm.OLS() to define the Ordinary Least Squares (OLS) $regression$ $model$ and .fit() to estimate the coefficients.
  5. Summary:

    • model.summary() provides a detailed summary of the model, including R-squared, coefficients, p-values, etc.
  6. Prediction:

    • After fitting the model, we use it to make predictions with .predict().
  7. Formula API:

    • You can also use smf.ols() with a formula string, similar to R’s syntax.

This example covers basic $regression$, but statsmodels also offers more advanced models like time series analysis (ARIMA), $logistic$ $regression$, and more.

Explanation of the output

Output:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
                            OLS Regression Results                            
==============================================================================
Dep. Variable: y R-squared: 0.976
Model: OLS Adj. R-squared: 0.976
Method: Least Squares F-statistic: 4065.
Date: Sun, 25 Aug 2024 Prob (F-statistic): 1.35e-81
Time: 23:46:40 Log-Likelihood: 99.112
No. Observations: 100 AIC: -194.2
Df Residuals: 98 BIC: -189.0
Df Model: 1
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
const 0.0215 0.017 1.263 0.210 -0.012 0.055
X 1.9540 0.031 63.754 0.000 1.893 2.015
==============================================================================
Omnibus: 0.900 Durbin-Watson: 2.285
Prob(Omnibus): 0.638 Jarque-Bera (JB): 0.808
Skew: 0.217 Prob(JB): 0.668
Kurtosis: 2.929 Cond. No. 4.18
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
X y y_pred
0 0.374540 0.757785 0.753370
1 0.950714 1.871528 1.879227
2 0.731994 1.473164 1.451842
3 0.598658 0.998560 1.191302
4 0.156019 0.290070 0.326374
OLS Regression Results
==============================================================================
Dep. Variable: y R-squared: 0.976
Model: OLS Adj. R-squared: 0.976
Method: Least Squares F-statistic: 4065.
Date: Sun, 25 Aug 2024 Prob (F-statistic): 1.35e-81
Time: 23:46:40 Log-Likelihood: 99.112
No. Observations: 100 AIC: -194.2
Df Residuals: 98 BIC: -189.0
Df Model: 1
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
Intercept 0.0215 0.017 1.263 0.210 -0.012 0.055
X 1.9540 0.031 63.754 0.000 1.893 2.015
==============================================================================
Omnibus: 0.900 Durbin-Watson: 2.285
Prob(Omnibus): 0.638 Jarque-Bera (JB): 0.808
Skew: 0.217 Prob(JB): 0.668
Kurtosis: 2.929 Cond. No. 4.18
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

This output is the summary of an Ordinary Least Squares (OLS) $linear$ $regression$ $model$.

Let’s break down the key parts of the output:

1. Model Information

  • Dep. Variable: y

    • The dependent variable being predicted (in this case, y).
  • Model: OLS

    • The type of model used (Ordinary Least Squares $regression$).
  • Method: Least Squares

    • The method used to estimate the coefficients of the model.
  • No. Observations: 100

    • The number of observations (data points) used in the model.
  • Df Residuals: 98

    • The degrees of freedom of the residuals (number of observations minus the number of estimated parameters, including the intercept).
  • Df Model: 1

    • The degrees of freedom of the model (number of estimated parameters excluding the intercept).

2. Statistical Measures

  • R-squared: 0.976

    • This is the coefficient of determination.
      It indicates that $97.6$% of the variance in the dependent variable y is explained by the independent variable X.
      A value close to $1$ indicates a good fit.
  • Adj. R-squared: 0.976

    • The adjusted R-squared accounts for the number of predictors in the model.
      It’s also high, which confirms the model fits well.
  • F-statistic: 4065.0

    • This is the test statistic for the overall significance of the model.
      A high F-statistic suggests that the model is statistically significant.
  • Prob (F-statistic): 1.35e-81

    • The p-value associated with the F-statistic. A very small value (much less than $0.05$) indicates strong evidence against the null hypothesis, suggesting that the model is statistically significant.
  • Log-Likelihood: 99.112

    • A measure of model fit. Higher values indicate a better fit.
  • AIC (Akaike Information Criterion): -194.2

    • A lower AIC suggests a better model.
      It balances model fit with the number of parameters to avoid overfitting.
  • BIC (Bayesian Information Criterion): -189.0

    • Similar to AIC but with a stronger penalty for models with more parameters.
      Lower is better.

3. Coefficients Table

  • coef:

    • The estimated coefficients for the model.
    • const: The intercept is $0.0215$.
    • X: The slope is $1.9540$, meaning that for every one unit increase in X, y increases by about $1.954$ units.
  • std err:

    • The standard error of the coefficient estimate.
      Smaller values indicate more precise estimates.
  • t:

    • The t-statistic for the hypothesis test that the coefficient is zero.
      For X, it is $63.754$, indicating that X is a significant predictor.
  • P>|t|:

    • The p-value for the t-test.
      A p-value less than $0.05$ indicates that the coefficient is significantly different from zero.
    • For X, the p-value is $0.000$, indicating it is highly significant.
  • [0.025, 0.975]:

    • The 95% confidence interval for the coefficients.
      For X, the true slope is likely between $1.893$ and $2.015$.

4. Model Diagnostics

  • Omnibus: 0.900, Prob(Omnibus): 0.638

    • These tests check for normality of the residuals.
      A p-value greater than $0.05$ suggests that the residuals are normally distributed (which is good).
  • Jarque-Bera (JB): 0.808, Prob(JB): 0.668

    • Another test for normality.
      Similar to the Omnibus test, a p-value above $0.05$ indicates that the residuals follow a normal distribution.
  • Skew: 0.217

    • The skewness of the residuals.
      A value close to zero suggests symmetry.
  • Kurtosis: 2.929

    • Kurtosis measures the “tailedness” of the distribution. A value close to $3$ indicates normal kurtosis (similar to a normal distribution).
  • Durbin-Watson: 2.285

    • This statistic tests for autocorrelation in the residuals.
      A value around $2$ suggests that there is no autocorrelation (which is good).
  • Cond. No.: 4.18

    • The condition number tests for multicollinearity.
      Values above $30$ may indicate problematic multicollinearity, but $4.18$ is quite low, indicating no issues here.

5. Predictions

  • y_pred:
    • These are the predicted values of y based on the fitted model.
      The table at the bottom shows the first few predictions alongside the actual y values and the X values.

Conclusion:

This $linear$ $regression$ $model$ fits the data well, with a high R-squared and statistically significant coefficients.

The residuals appear to be normally distributed, and there is no evidence of autocorrelation or multicollinearity.