Least Squares and Regression with CVXPY

Least Squares and Regression with CVXPY

Least squares is a popular method for solving regression problems, where we want to fit a model to data by minimizing the sum of the squared differences (errors) between the observed values and the values predicted by the model.

In this example, we will demonstrate how to use $CVXPY$ to solve a $linear$ $regression$ problem using least squares.

Problem Description: Linear Regression

We are given a dataset consisting of several observations.
Each observation includes:

  • A set of features (independent variables).
  • A target value (dependent variable).

Our goal is to find a linear relationship between the features and the target, meaning we want to find the best-fitting line (or hyperplane) that predicts the target based on the features.

Mathematically, we can express this as:
$$
y = X\beta + \epsilon
$$
Where:

  • $( X )$ is a matrix of features (each row corresponds to an observation and each column to a feature).
  • $( y )$ is a vector of observed target values.
  • $( \beta )$ is the vector of unknown coefficients we want to estimate.
  • $( \epsilon )$ represents the errors (differences between the predicted and observed values).

We want to find $( \beta )$ that minimizes the sum of squared errors:
$$
\min_\beta | X\beta - y |_2^2
$$

Step-by-Step Solution with CVXPY

Here is how to solve this least squares problem using $CVXPY$:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
import numpy as np
import cvxpy as cp
import matplotlib.pyplot as plt

# Step 1: Generate synthetic data for a regression problem
np.random.seed(1)
n = 50 # Number of data points
m = 3 # Number of features

# Randomly generate feature matrix X and true coefficients beta_true
X = np.random.randn(n, m)
beta_true = np.array([3, -1, 2]) # True coefficients
y = X @ beta_true + 0.5 * np.random.randn(n) # Observed values with some noise

# Step 2: Define the variable (beta) for the regression coefficients
beta = cp.Variable(m)

# Step 3: Define the objective function (least squares error)
objective = cp.Minimize(cp.sum_squares(X @ beta - y))

# Step 4: Define and solve the problem
problem = cp.Problem(objective)
problem.solve()

# Step 5: Output the results
print("Status:", problem.status)
print("Estimated coefficients (beta):", beta.value)
print("True coefficients:", beta_true)

# Step 6: Visualize the data and the fitted values
y_pred = X @ beta.value
plt.scatter(range(n), y, color='blue', label='Observed')
plt.scatter(range(n), y_pred, color='red', label='Predicted')
plt.xlabel('Data Point')
plt.ylabel('Target Value')
plt.title('Least Squares Regression Fit')
plt.legend()
plt.show()

Detailed Explanation:

  1. Data Generation:
    In this example, we generate synthetic data for simplicity.
    The matrix $( X )$ contains $50$ data points with $3$ features, and we generate the target values $( y )$ using a known set of coefficients $( \beta_{\text{true}} = [3, -1, 2] )$ plus some noise to simulate real-world observations.

  2. Decision Variable:
    The unknowns are the regression coefficients $( \beta )$, represented as a $CVXPY$ variable beta of size $3$ (since we have $3$ features).

  3. Objective Function:
    The goal is to minimize the sum of squared errors between the observed target values $( y )$ and the predicted values $( X\beta )$.
    In $CVXPY$, this is represented by the function cp.sum_squares(X @ beta - y), which computes the sum of the squared residuals.

  4. Problem Definition and Solution:
    The cp.Problem function defines the least squares optimization problem, and problem.solve() finds the optimal solution for $( \beta )$.

  5. Results:
    After solving the problem, the estimated coefficients $( \beta )$ are printed.
    These should be close to the true coefficients used to generate the data.
    We also visualize the observed values versus the predicted values to show how well the model fits the data.

Output:

  • Estimated Coefficients: The estimated coefficients $( \hat{\beta} )$ are close to the true coefficients used to generate the data ($( \beta_{\text{true}} = [3, -1, 2] )$), which indicates a good fit.

Visualization of Results:

The following plot shows the observed values (in blue) and the predicted values (in red).
A good fit would show the predicted values closely following the observed ones.

  • Blue points: Actual (observed) target values.
  • Red points: Predicted target values based on the fitted model.

Interpretation:

  • Minimizing the Sum of Squared Errors:
    The least squares method minimizes the squared differences between the predicted and observed values.
    This produces the “best fit” line for the given data, in the sense that the total prediction error is minimized.

  • Optimal Solution:
    Since the status of the problem is optimal, $CVXPY$ successfully found the regression coefficients that minimize the objective function.

Conclusion:

In this example, we demonstrated how to solve a linear regression problem using least squares with $CVXPY$.
The least squares method is widely used in machine learning and statistics for fitting linear models to data.

By minimizing the sum of squared errors, we can estimate the coefficients that best explain the relationship between the features and the target values.