A Basic Guide to Linear Regression Using Statsmodels in Python
Here’s a basic useful example of how to use $statsmodels$ for $linear$ $regression$ in Python.
This example demonstrates how to fit a $linear$ $regression$ $model$, check the summary, and make predictions.
Linear Regression Example with statsmodels
1 | import numpy as np |
Explanation:
Generating Data:
- We create synthetic data where
y
is linearly dependent onX
with some added noise.
- We create synthetic data where
DataFrame:
- We store the data in a Pandas DataFrame.
Adding a Constant:
- In $linear$ $regression$, we often include an intercept.
sm.add_constant()
adds a column of ones toX
to account for this.
- In $linear$ $regression$, we often include an intercept.
Fitting the Model:
- We use
sm.OLS()
to define the Ordinary Least Squares (OLS) $regression$ $model$ and.fit()
to estimate the coefficients.
- We use
Summary:
model.summary()
provides a detailed summary of the model, including R-squared, coefficients, p-values, etc.
Prediction:
- After fitting the model, we use it to make predictions with
.predict()
.
- After fitting the model, we use it to make predictions with
Formula API:
- You can also use
smf.ols()
with a formula string, similar to R’s syntax.
- You can also use
This example covers basic $regression$, but statsmodels
also offers more advanced models like time series analysis (ARIMA), $logistic$ $regression$, and more.
Explanation of the output
Output:
1 | OLS Regression Results |
This output is the summary of an Ordinary Least Squares (OLS) $linear$ $regression$ $model$.
Let’s break down the key parts of the output:
1. Model Information
Dep. Variable:
y
- The dependent variable being predicted (in this case,
y
).
- The dependent variable being predicted (in this case,
Model:
OLS
- The type of model used (Ordinary Least Squares $regression$).
Method:
Least Squares
- The method used to estimate the coefficients of the model.
No. Observations:
100
- The number of observations (data points) used in the model.
Df Residuals:
98
- The degrees of freedom of the residuals (number of observations minus the number of estimated parameters, including the intercept).
Df Model:
1
- The degrees of freedom of the model (number of estimated parameters excluding the intercept).
2. Statistical Measures
R-squared:
0.976
- This is the coefficient of determination.
It indicates that $97.6$% of the variance in the dependent variabley
is explained by the independent variableX
.
A value close to $1$ indicates a good fit.
- This is the coefficient of determination.
Adj. R-squared:
0.976
- The adjusted R-squared accounts for the number of predictors in the model.
It’s also high, which confirms the model fits well.
- The adjusted R-squared accounts for the number of predictors in the model.
F-statistic:
4065.0
- This is the test statistic for the overall significance of the model.
A high F-statistic suggests that the model is statistically significant.
- This is the test statistic for the overall significance of the model.
Prob (F-statistic):
1.35e-81
- The p-value associated with the F-statistic. A very small value (much less than $0.05$) indicates strong evidence against the null hypothesis, suggesting that the model is statistically significant.
Log-Likelihood:
99.112
- A measure of model fit. Higher values indicate a better fit.
AIC (Akaike Information Criterion):
-194.2
- A lower AIC suggests a better model.
It balances model fit with the number of parameters to avoid overfitting.
- A lower AIC suggests a better model.
BIC (Bayesian Information Criterion):
-189.0
- Similar to AIC but with a stronger penalty for models with more parameters.
Lower is better.
- Similar to AIC but with a stronger penalty for models with more parameters.
3. Coefficients Table
coef:
- The estimated coefficients for the model.
const
: The intercept is $0.0215$.X
: The slope is $1.9540$, meaning that for every one unit increase inX
,y
increases by about $1.954$ units.
std err:
- The standard error of the coefficient estimate.
Smaller values indicate more precise estimates.
- The standard error of the coefficient estimate.
t:
- The t-statistic for the hypothesis test that the coefficient is zero.
ForX
, it is $63.754$, indicating thatX
is a significant predictor.
- The t-statistic for the hypothesis test that the coefficient is zero.
P>|t|:
- The p-value for the t-test.
A p-value less than $0.05$ indicates that the coefficient is significantly different from zero. - For
X
, the p-value is $0.000$, indicating it is highly significant.
- The p-value for the t-test.
[0.025, 0.975]:
- The 95% confidence interval for the coefficients.
ForX
, the true slope is likely between $1.893$ and $2.015$.
- The 95% confidence interval for the coefficients.
4. Model Diagnostics
Omnibus:
0.900
, Prob(Omnibus):0.638
- These tests check for normality of the residuals.
A p-value greater than $0.05$ suggests that the residuals are normally distributed (which is good).
- These tests check for normality of the residuals.
Jarque-Bera (JB):
0.808
, Prob(JB):0.668
- Another test for normality.
Similar to the Omnibus test, a p-value above $0.05$ indicates that the residuals follow a normal distribution.
- Another test for normality.
Skew:
0.217
- The skewness of the residuals.
A value close to zero suggests symmetry.
- The skewness of the residuals.
Kurtosis:
2.929
- Kurtosis measures the “tailedness” of the distribution. A value close to $3$ indicates normal kurtosis (similar to a normal distribution).
Durbin-Watson:
2.285
- This statistic tests for autocorrelation in the residuals.
A value around $2$ suggests that there is no autocorrelation (which is good).
- This statistic tests for autocorrelation in the residuals.
Cond. No.:
4.18
- The condition number tests for multicollinearity.
Values above $30$ may indicate problematic multicollinearity, but $4.18$ is quite low, indicating no issues here.
- The condition number tests for multicollinearity.
5. Predictions
- y_pred:
- These are the predicted values of
y
based on the fitted model.
The table at the bottom shows the first few predictions alongside the actualy
values and theX
values.
- These are the predicted values of
Conclusion:
This $linear$ $regression$ $model$ fits the data well, with a high R-squared and statistically significant coefficients.
The residuals appear to be normally distributed, and there is no evidence of autocorrelation or multicollinearity.