A Basic Guide to Linear Regression Using Statsmodels in Python

August 27, 2024

A Basic Guide to Linear Regression Using Statsmodels in Python

Here’s a basic useful example of how to use $statsmodels$ for $linear$ $regression$ in Python.

This example demonstrates how to fit a $linear$ $regression$ $model$, check the summary, and make predictions.

Linear Regression Example with `statsmodels`

import numpy as np
import pandas as pd
import statsmodels.api as sm
import statsmodels.formula.api as smf

# Generate synthetic data
np.random.seed(42)
n = 100
X = np.random.rand(n)
y = 2 * X + np.random.randn(n) * 0.1  # y = 2*X + noise

# Create a DataFrame
df = pd.DataFrame({
    'X': X,
    'y': y
})

# Add a constant (intercept) to the independent variable
X = sm.add_constant(df['X'])

# Fit the linear regression model
model = sm.OLS(df['y'], X).fit()

# Print the summary of the model
print(model.summary())

# Predict using the model
df['y_pred'] = model.predict(X)

# Print the first few predictions
print(df.head())

# If you prefer using formulas (like in R):
formula = 'y ~ X'
model_formula = smf.ols(formula=formula, data=df).fit()

# Print the summary of the model fitted using formulas
print(model_formula.summary())

Explanation:

Generating Data:
- We create synthetic data where y is linearly dependent on X with some added noise.
DataFrame:
- We store the data in a Pandas DataFrame.
Adding a Constant:
- In $linear$ $regression$, we often include an intercept.
  sm.add_constant() adds a column of ones to X to account for this.
Fitting the Model:
- We use sm.OLS() to define the Ordinary Least Squares (OLS) $regression$ $model$ and .fit() to estimate the coefficients.
Summary:
- model.summary() provides a detailed summary of the model, including R-squared, coefficients, p-values, etc.
Prediction:
- After fitting the model, we use it to make predictions with .predict().
Formula API:
- You can also use smf.ols() with a formula string, similar to R’s syntax.

This example covers basic $regression$, but statsmodels also offers more advanced models like time series analysis (ARIMA), $logistic$ $regression$, and more.

Explanation of the output

Output:

                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      y   R-squared:                       0.976
Model:                            OLS   Adj. R-squared:                  0.976
Method:                 Least Squares   F-statistic:                     4065.
Date:                Sun, 25 Aug 2024   Prob (F-statistic):           1.35e-81
Time:                        23:46:40   Log-Likelihood:                 99.112
No. Observations:                 100   AIC:                            -194.2
Df Residuals:                      98   BIC:                            -189.0
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          0.0215      0.017      1.263      0.210      -0.012       0.055
X              1.9540      0.031     63.754      0.000       1.893       2.015
==============================================================================
Omnibus:                        0.900   Durbin-Watson:                   2.285
Prob(Omnibus):                  0.638   Jarque-Bera (JB):                0.808
Skew:                           0.217   Prob(JB):                        0.668
Kurtosis:                       2.929   Cond. No.                         4.18
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
          X         y    y_pred
0  0.374540  0.757785  0.753370
1  0.950714  1.871528  1.879227
2  0.731994  1.473164  1.451842
3  0.598658  0.998560  1.191302
4  0.156019  0.290070  0.326374
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      y   R-squared:                       0.976
Model:                            OLS   Adj. R-squared:                  0.976
Method:                 Least Squares   F-statistic:                     4065.
Date:                Sun, 25 Aug 2024   Prob (F-statistic):           1.35e-81
Time:                        23:46:40   Log-Likelihood:                 99.112
No. Observations:                 100   AIC:                            -194.2
Df Residuals:                      98   BIC:                            -189.0
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept      0.0215      0.017      1.263      0.210      -0.012       0.055
X              1.9540      0.031     63.754      0.000       1.893       2.015
==============================================================================
Omnibus:                        0.900   Durbin-Watson:                   2.285
Prob(Omnibus):                  0.638   Jarque-Bera (JB):                0.808
Skew:                           0.217   Prob(JB):                        0.668
Kurtosis:                       2.929   Cond. No.                         4.18
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

This output is the summary of an Ordinary Least Squares (OLS) $linear$ $regression$ $model$.

Let’s break down the key parts of the output:

1. Model Information

Dep. Variable: y
- The dependent variable being predicted (in this case, y).
Model: OLS
- The type of model used (Ordinary Least Squares $regression$).
Method: Least Squares
- The method used to estimate the coefficients of the model.
No. Observations: 100
- The number of observations (data points) used in the model.
Df Residuals: 98
- The degrees of freedom of the residuals (number of observations minus the number of estimated parameters, including the intercept).
Df Model: 1
- The degrees of freedom of the model (number of estimated parameters excluding the intercept).

2. Statistical Measures

R-squared: 0.976
- This is the coefficient of determination.
  It indicates that $97.6$% of the variance in the dependent variable y is explained by the independent variable X.
  A value close to $1$ indicates a good fit.
Adj. R-squared: 0.976
- The adjusted R-squared accounts for the number of predictors in the model.
  It’s also high, which confirms the model fits well.
F-statistic: 4065.0
- This is the test statistic for the overall significance of the model.
  A high F-statistic suggests that the model is statistically significant.
Prob (F-statistic): 1.35e-81
- The p-value associated with the F-statistic. A very small value (much less than $0.05$) indicates strong evidence against the null hypothesis, suggesting that the model is statistically significant.
Log-Likelihood: 99.112
- A measure of model fit. Higher values indicate a better fit.
AIC (Akaike Information Criterion): -194.2
- A lower AIC suggests a better model.
  It balances model fit with the number of parameters to avoid overfitting.
BIC (Bayesian Information Criterion): -189.0
- Similar to AIC but with a stronger penalty for models with more parameters.
  Lower is better.

3. Coefficients Table

coef:
- The estimated coefficients for the model.
- const: The intercept is $0.0215$.
- X: The slope is $1.9540$, meaning that for every one unit increase in X, y increases by about $1.954$ units.
std err:
- The standard error of the coefficient estimate.
  Smaller values indicate more precise estimates.
t:
- The t-statistic for the hypothesis test that the coefficient is zero.
  For X, it is $63.754$, indicating that X is a significant predictor.
P>|t|:
- The p-value for the t-test.
  A p-value less than $0.05$ indicates that the coefficient is significantly different from zero.
- For X, the p-value is $0.000$, indicating it is highly significant.
[0.025, 0.975]:
- The 95% confidence interval for the coefficients.
  For X, the true slope is likely between $1.893$ and $2.015$.

4. Model Diagnostics

Omnibus: 0.900, Prob(Omnibus): 0.638
- These tests check for normality of the residuals.
  A p-value greater than $0.05$ suggests that the residuals are normally distributed (which is good).
Jarque-Bera (JB): 0.808, Prob(JB): 0.668
- Another test for normality.
  Similar to the Omnibus test, a p-value above $0.05$ indicates that the residuals follow a normal distribution.
Skew: 0.217
- The skewness of the residuals.
  A value close to zero suggests symmetry.
Kurtosis: 2.929
- Kurtosis measures the “tailedness” of the distribution. A value close to $3$ indicates normal kurtosis (similar to a normal distribution).
Durbin-Watson: 2.285
- This statistic tests for autocorrelation in the residuals.
  A value around $2$ suggests that there is no autocorrelation (which is good).
Cond. No.: 4.18
- The condition number tests for multicollinearity.
  Values above $30$ may indicate problematic multicollinearity, but $4.18$ is quite low, indicating no issues here.

5. Predictions

y_pred:
- These are the predicted values of y based on the fitted model.
  The table at the bottom shows the first few predictions alongside the actual y values and the X values.

Conclusion:

This $linear$ $regression$ $model$ fits the data well, with a high R-squared and statistically significant coefficients.

The residuals appear to be normally distributed, and there is no evidence of autocorrelation or multicollinearity.