Instructions
Requirements and Specifications
Source Code
!pip install otter-grader
# Initialize Otter
import otter
grader = otter.Notebook("lab8.ipynb")
# Lab 8: Fitting Models to Data
In this lab, you will practice using a numerical optimization package `cvxpy` to compute solutions to optimization problems. The example we will use is a linear fit and a quadratic fit.
import pandas as pd
import numpy as np
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
## Objectives for Lab 8:
Models and fitting models to data is a common task in data science. In this lab, you will practice fitting models to data. The models you will fit are:
* Linear fit
* Normal distribution
## Boston Housing Dataset
from sklearn.datasets import load_boston
boston_dataset = load_boston()
print(boston_dataset['DESCR'])
housing = pd.DataFrame(boston_dataset['data'], columns=boston_dataset['feature_names'])
housing['MEDV'] = boston_dataset['target']
housing.head()
fig, ax = plt.subplots(figsize=(10, 7))
sns.scatterplot(x='LSTAT', y='MEDV', data=housing)
plt.show()
The model for the relationship between the response variable MEDV ($y$) and predictor variables LSTAT ($u$) and RM ($v$) is that
$$ y_i = \beta_0 + \beta_1 u_i + \epsilon_i, $$
where $\epsilon_i$ is random noise.
In order to fit the linear model to data, we minimize the sum of squared errors of all observations, $i=1,2,\dots,n$.
$$\begin{aligned}
&\min_{\beta} \sum_{i=1}^n (y_i - \beta_0 + \beta_1 u_i )^2 = \min_{\beta} \sum_{i=1}^n (y_i - x_i^T \beta)^2 = \min_{\beta} \|y - X \beta\|_2^2
\end{aligned}$$
where $\beta = (\beta_0,\beta_1)^T$, and $x_i^T = (1, u_i)$. Therefore, $y = (y_1, y_2, \dots, y_n)^T$ and $i$-th row of $X$ is $x_i^T$.
## Question 1: Constructing Data Variables
Define $y$ and $X$ from `housing` data.
y = housing['MEDV']
X1 = housing['LSTAT'].to_frame()
X1.insert(0, 'intercept', np.ones((len(y),1)))
#X.insert(0, 'intercept', X1)
grader.check("q1")
## Installing CVXPY
First, install `cvxpy` package by running the following bash command:
!pip install cvxpy
## Question 2: Fitting Linear Model to Data
Read this example of how cvxpy problem is setup and solved: https://www.cvxpy.org/examples/basic/least_squares.html
The usage of cvxpy parallels our conceptual understanding of components in an optimization problem:
* `beta` are the variables $\beta$
* `loss` is sum of squared errors
* `prob` minimizes the loss by choosing $\beta$
Make sure to extract the data array of data frames (or series) by using `values`: e.g., `X.values`
beta2
import cvxpy as cp
beta2 = cp.Variable(2)
loss2 = cp.sum_squares(y.values-X1.values @ beta2)
prob2 = cp.Problem(cp.Minimize(loss2))
prob2.solve()
yhat2 = X1.values@beta2.value
grader.check("q2")
## Question 3: Visualizing resulting Linear Fit
Visualize fitted model by plotting `LSTAT` by `MEDV`.
fig, ax = plt.subplots(figsize=(10, 7))
sns.scatterplot(x='LSTAT', y='MEDV', data=housing, ax = ax, label='Data')
sns.scatterplot(housing['LSTAT'], yhat2, label='Fit', ax = ax)
plt.legend()
plt.show()
## Question 4: Fitting Quadratic Model to Data
Add a column of squared `LSTAT` values to `X`. The new model is,
Then, fit a quadratic model to data.
X2 = X1.copy()
X2.insert(2, 'LSTAT^2', X2['LSTAT']**2)
beta4 = cp.Variable(3)
loss4 = cp.sum_squares(y.values-X2.values @ beta4)
prob4 = cp.Problem(cp.Minimize(loss4))
prob4.solve()
yhat4 = X2.values@beta4.value
grader.check("q4a")
Visualize quadratic fit:
fig, ax = plt.subplots(figsize=(10, 7))
sns.scatterplot(x='LSTAT', y='MEDV', data=housing, ax = ax, label='Data')
sns.scatterplot(housing['LSTAT'], yhat4, label='Fit', ax = ax)
plt.legend()
plt.show()
---
To double-check your work, the cell below will rerun all of the autograder tests.
grader.check_all()
## Submission
Make sure you have run all cells in your notebook in order before running the cell below, so that all images/graphs appear in the output. The cell below will generate a zip file for you to submit. **Please save before exporting!**
# Save your notebook first, then run this cell to export your submission.
grader.export()