## Instructions

**Objective**

## Requirements and Specifications

**Goals**

- Loading in data
- Polynomial fitting using NumPy
- Fitting curves using SciPy
- Plotting data and the correspoding best-fit results
- Calculating residuals when fitting a models to data

**Source Code
**

### <p style="text-align: right;"> ✅ Put your name here</p>

# __CMSE 201 – Spring 2021__

<img src="https://cmse.msu.edu/sites/_cmse/assets/Image/image002.jpg"

alt="CMSE Logo"

align="right"

height="100"

width="100" />

# Homework 4: Fitting data using NumPy and SciPy

## Goals

### By the end of the homework assignment you will have practiced:

1. Loading in data

2. Polynomial fitting using NumPy

3. Fitting curves using SciPy

4. Plotting data and the correspoding best-fit results

5. Calculating residuals when fitting a models to data

## Assignment instructions

Work through the following assignment, making sure to follow all of the directions and answer all of the questions.

**This assignment is due at 11:59pm on Friday, March 26.**

It should be uploaded to D2L in the approach "Homework" submission folder. Submission instructions can be found at the end of the notebook as well.

## Grading

- Academic Integrity (1 pt)

- Part 1 (20 pts)

- Part 2 (24 pts)

**Total:** 45 pts

---

# Academic integrity statement (1 point)

In the markdown cell below, put your personal academic integrity statement (composed during the Day04 In-Class Assignment). By including this statement, you are confirming that the work you submit in the assignment is wholly your own.

<font size=6 color="#009600">✎</font> *Put your personal academic integrity statement here.*

---

## Part 1: Fitting Piece-wise Linear Functions to Data

In this problem, we will look at fitting more unusual functions to data.

### ✅ Question 1.1: Setting up your Python modules (3 points)

In this homework, you will be mainly using Matplotlib, Pandas, NumPy, and SciPy's `curve_fit` function. Make sure to include all of the important `import` comments here.

# Load needed modules here

import numpy as np

import pandas as pd

import matplotlib.pyplot as plt

from scipy.optimize import curve_fit

import matplotlib.dates as mdates

### ✅ Question 1.2: Generate and Plot Data (4 points)

Consider a function $y=f(x)$ such that $f(x) = 2x$ for $0 \leq x \leq 0.5$ and $f(x)=2-2x$ for $0.5 \leq x \leq 1$. Plot this function over 200 equispaced points ($x$) in the range of $x = $ 0 to 1 (end points included). Then, generate a noisy version of this function by adding Gaussian noise with mean zero and standard deviation 0.2 to all of the $y$ values from this function. Essentially, make $y_{noisy} = f(x) \, + $ Gaussian($x$). Plot the noisy data (versus $x$ values) in a new plot. Label all axes on the plots.

# Generate 200 equispaced points

x = np.linspace(0, 1, 200)

# Generate y

y = np.zeros(200)

for i in range(len(x)):

xi = x[i]

if xi >= 0 and xi <= 0.5:

y[i] = 2*xi

else:

y[i] = 2 - 2*xi

# Generate noisy version

y_noisy = y + np.random.normal(0, 0.2, 200)

# Plot

fig, axes = plt.subplots(nrows = 1, ncols = 2, figsize = (13,5))

axes[0].plot(x, y)

axes[0].set_title('Normal Signal')

axes[0].set_xlabel('x')

axes[0].set_ylabel('y(x)')

axes[1].plot(x, y_noisy)

axes[1].set_title('Noisy Signal')

axes[1].set_xlabel('x')

axes[1].set_ylabel('y_noisy(x)')

plt.show()

### ✅ Question 1.3: Polynomial Fitting (5 points)

How would you use NumPy's `polyfit` function to fit a model like $f(x)$ (i.e., polynomial in two pieces) to the corrupted or noisy data? Explain your approach. Use only first order polynomials for this. Print all the fitted model parameters. Then use `poly1d` to help **plot the data together with the best fit model**. Include a legend to make it clear what information represents the data and what information represents the best fit. *Hint: More than one type of polynomial may offer a good fit to this type of data.*

<font size=+3>✎</font> *When observing the graph, it can be believed that it belongs to two types of curve: a triangular, or a trigonometric (half cycle of a sine or cosine). However, since a linear fit (of first order) is required, we will opt for the option of a triangular curve. To do this, we can divide the vector of values in two, and make two different fits.a*

# Put your pseudocode and code here

x1 = x[:len(x)//2]

y1= y_noisy[:len(y_noisy)//2]

y2 = y_noisy[len(y_noisy)//2:]

x2= x[len(x)//2:]

z1 = np.polyfit(x1, y1, 1)

z2 = np.polyfit(x2, y2, 1)

# Calculate fitted function

p1 = np.poly1d(z1)

p2 = np.poly1d(z2)

print("Estimated functions by NumPy's polyfit are:")

print("f1(x) = {0:.8f}x{1:.8f}".format(z1[0],z1[1]))

print("f2(x) = {0:.8f}x+{1:.8f}".format(z2[0],z2[1]))

# Evaluate and plot

plt.plot(x, y_noisy, label = 'Data')

plt.plot(x1, p1(x1), label = 'Fit of first half')

plt.plot(x2, p2(x2), label = 'Fit of second half')

plt.legend()

plt.xlabel('X')

plt.ylabel('f(x)')

plt.grid(True)

plt.show()

---

### ✅ Question 1.4: Using Curve Fit for the data (7 points)

Write a function called `fitting_function` that takes the following arguments: $x$, $m_1$, $b_1$, $m_2$, $b_2$, where $x$ is a NumPy array and the rest are all floating point parameter values corresponding to the slopes of the lines/pieces and their intercepts. The function should return an array with values $f(x) = m_1*x + b_1$ when $0 \leq x \leq 0.5$ and $f(x) = m_2*x + b_2$ where $0.5 \leq x \leq 1$.

Use SciPy's `curve_fit` function to find the best fit model parameters for the corrupted data using your `fitting_function`. **Plot the data and the best fit. Label all axes and include a legend. Also print the resulting parameters.**

# Put your pseudocode and code here

def fitting_function(x, m1, b1, m2, b2):

y = np.zeros(len(x))

for i in range(len(x)):

xi = x[i]

if xi <= 0.5:

y[i] = m1*xi + b1

else:

y[i] = m2*xi + b2

return y

popt, pcov = curve_fit(fitting_function, x, y_noisy)

print("Estimated functions by SciPy's curve_fit are:")

print("f1(x) = {0:.8f}x{1:.8f}".format(popt[0],popt[1]))

print("f2(x) = {0:.8f}x+{1:.8f}".format(popt[2],popt[3]))

# Plot noisy curve and fitted

plt.plot(x, y_noisy, label = 'Original')

plt.plot(x, fitting_function(x, *popt), label = 'Fitted')

plt.grid(True)

plt.legend()

plt.xlabel('x')

plt.ylabel('y(x)')

plt.show()

---

### ✅ Question 1.5: How good are the estimates? (1 point)

Comment on how close the parameter estimates in Questions 1.3 and 1.4 are to the values used to generate the data.

<font size=+3>✎</font> *At both questions, the parameters calculated are printed. It can be seen that both pair of parameters (two y-intercepts and two slopes) are very similar. The only difference is in parameter m1, and the difference is in the decimal number 8, so we can assume that these parameters are exactly the same, because a difference in the 8th-decimal is not significant*

---

## Part 2: Fitting Models to Real Data

In this problem, we will look at fitting models to real weather data. The data is available at [Fivethirtyeight.](https://github.com/fivethirtyeight/data/tree/master/us-weather-history) Download and check out the file `KMDW.csv` from D2L. You will use your model-fitting skills to see if you can find a good mathematical model for how the data behaves.

### ✅ Question 2.1: Reading and inspecting the data (3 points)

Read in the data in `KMDW.csv` using Pandas. Print out a few rows of the data as well as the summary statistics of the data (mean, standard deviation, etc.) to get a sense of the data.

df = pd.read_csv('KMDW.csv')

df.head()

### ✅ Question 2.2: Convert date format (2 points)

The `date` column has the year-month-day format. Convert this to the `ordinal` format so you have a single number associated with each date. Add this as a new column to the dataframe. We will use this as the independent variable for regression.

# First, copy the original 'date' column into a new one named 'date_ordinal'

df['date_ordinal'] = df['date']

# The column 'date_ordinal' contains dates in string format, so we need to convert them to datetime first

df['date_ordinal'] = pd.to_datetime(df['date_ordinal'])

# Now, convert to ordinal

df['date_ordinal'] = df['date_ordinal'].apply(lambda x: x.toordinal())

df.head()

### ✅ Question 2.3: Polynomial fitting (7 points)

Using the NumPy `polyfit` function, fit the first 300 entries of the `actual_mean_temp`, `actual_min_temp`, and `actual_max_temp` columns of the dataframe with polynomials (one for each column). This will be for dates 7/1/2014 through 4/26/2015. Choose a degree for the polynomial and justify the choice. Use `poly1d` to **plot the data together with the best fit models**. Create three different plots (one for each different temperature statistic) and label all axes and include legends to make clear what information represents the data and what information represents the best fit. The dates should all show up in the yyyy-mm-dd format on the x-axes. Rotate the x-ticks for better display. When successful, one of the plots should look something like this:

<img src="https://raw.githubusercontent.com/msu-cmse-courses/cmse201-S21-student/master/assets/img/mean_temp_vs_data.png">

N = 300 # Number of points to pick. For question 2.5, you just need to change this value to 200, 100, etc

# Pick the first 300 entries

x = df[:N]['date_ordinal']

y1 = df[:N]['actual_mean_temp']

y2 = df[:N]['actual_min_temp']

y3 = df[:N]['actual_max_temp']

# Plot data

fig, axes = plt.subplots(nrows = 3, ncols = 1, figsize=(10,15))

axes[0].scatter(x, y1, label = 'data')

axes[1].scatter(x, y2, label = 'data')

axes[2].scatter(x, y3, label = 'data')

# Now, fit data

z1 = np.polyfit(x, y1, 3)

z2 = np.polyfit(x, y2, 3)

z3 = np.polyfit(x, y3, 3)

p1 = np.poly1d(z1)

p2 = np.poly1d(z2)

p3 = np.poly1d(z3)

# Plot

x_values = pd.date_range(df.loc[0,'date'], periods = N, freq = 'MS')

axes[0].plot(x, p1(x), label = 'fitted', color = 'red')

axes[1].plot(x, p2(x), label = 'fitted', color = 'red')

axes[2].plot(x, p3(x), label = 'fitted', color = 'red')

axes[0].legend()

axes[0].set_ylabel('Mean Temperatures')

axes[0].set_xticklabels(x_values,rotation=45)

axes[0].xaxis.set_major_formatter(mdates.DateFormatter("%Y-%m"))

axes[0].grid()

axes[1].legend()

axes[1].set_ylabel('Min Temperatures')

axes[1].set_xticklabels(x_values,rotation=45)

axes[1].xaxis.set_major_formatter(mdates.DateFormatter("%Y-%m"))

axes[1].grid()

axes[2].legend()

axes[2].set_ylabel('Max Temperatures')

axes[2].set_xticklabels(x_values,rotation=45)

axes[2].xaxis.set_major_formatter(mdates.DateFormatter("%Y-%m"))

axes[2].grid()

<font size=+3>✎</font> *At first, I was thinking to fit the data to a polynominal of second order, like a parabola, but then I noticed that the curves has two inflection points, so I though that it would be better to fit to a 3rd-order polynomial.*

### ✅ Question 2.4: How does the model predict forward in time? (3 points)

Use the models learned above for just the initial 300 days for the `actual_mean_temp`, `actual_min_temp`, and `actual_max_temp` columns to predict the temperatures for the remaining dates not used during regression; Essentially, predict the temperatures for 4/27/2015 to the end of the dataset. **Now plot the entire 365 days of actual temperatures together with the models' predictions**. Create three different plots corresponding to the three columns and label all axes and include legends on all plots. The dates should show up in the yyyy-mm-dd format on the x-axes. Rotate the x-ticks for better display. How does the choice of polynomial degree affect the forward predictions?

# Now predict for the rest of dataset

x_rest = df[N:]['date_ordinal']

y1_rest = df[N:]['actual_mean_temp']

y2_rest = df[N:]['actual_min_temp']

y3_rest = df[N:]['actual_max_temp']

# Plot data

fig, axes = plt.subplots(nrows = 3, ncols = 1, figsize=(10,15))

axes[0].scatter(x_rest, y1_rest, label = 'data')

axes[1].scatter(x_rest, y2_rest, label = 'data')

axes[2].scatter(x_rest, y3_rest, label = 'data')

# Plot estimated values

x_values2 = pd.date_range(df.loc[N,'date'], periods = len(df)-N, freq = 'MS')

axes[0].plot(x_rest, p1(x_rest), label = 'fitted', color = 'red')

axes[1].plot(x_rest, p2(x_rest), label = 'fitted', color = 'red')

axes[2].plot(x_rest, p3(x_rest), label = 'fitted', color = 'red')

axes[0].legend()

axes[0].set_ylabel('Mean Temperatures')

axes[0].set_xticklabels(x_values2,rotation=45)

axes[0].xaxis.set_major_formatter(mdates.DateFormatter("%Y-%m"))

axes[0].grid()

axes[1].legend()

axes[1].set_ylabel('Min Temperatures')

axes[1].set_xticklabels(x_values2,rotation=45)

axes[1].xaxis.set_major_formatter(mdates.DateFormatter("%Y-%m"))

axes[1].grid()

axes[2].legend()

axes[2].set_ylabel('Max Temperatures')

axes[2].set_xticklabels(x_values2,rotation=45)

axes[2].xaxis.set_major_formatter(mdates.DateFormatter("%Y-%m"))

axes[2].grid()

<font size=+3>✎</font> *Now it seems like our fitted polynomial is bad, because the calculated predictions are very different from the actual ones. This is because we have made a fit to a polynomial of order 3, which is very basic.*

### ✅ Question 2.5: Effect of the number of data points used to fit model parameters. (1 point)

Try varying the number of data points used to fit the model parameters, e.g., using only the first 100 days or 200 days, etc. for fitting, and then using the estimated models to predict the temperatures for the remaining dates not used for fitting the model. What do you observe with regards to accuracy of future prediction with fewer or more data points to fit the model?

<font size=+3>✎</font> *Increasing or decreasing the number of points does not really increases the accuracy of the model. Increasing the number of points does increases the accuracy for the first N (N is the number of selected points), but not for the rest of the dataset.*

### ✅ Question 2.6: Fitting the Model with all the data. (4 points)

Now try to fit polynomial models (explain the choice of degree) to the entire 365 days worth of data. Do this for the `actual_mean_temp`, `actual_min_temp`, and `actual_max_temp` columns. **Then plot the entire 365 days of actual temperatures together with the new models' predictions**. Create three different plots and follow all plotting guidelines as before. How do these models seem to compare visually to the best fit ones in Question 2.4?

xAll = df['date_ordinal']

y1All = df['actual_mean_temp']

y2All = df['actual_min_temp']

y3All = df['actual_max_temp']

# Plot data

fig, axes = plt.subplots(nrows = 3, ncols = 1, figsize=(10,15))

axes[0].scatter(xAll, y1All, label = 'data')

axes[1].scatter(xAll, y2All, label = 'data')

axes[2].scatter(xAll, y3All, label = 'data')

# Now, fit data

z1 = np.polyfit(xAll, y1All, 3)

z2 = np.polyfit(xAll, y2All, 3)

z3 = np.polyfit(xAll, y3All, 3)

p1All = np.poly1d(z1)

p2All = np.poly1d(z2)

p3All = np.poly1d(z3)

# Plot

x_values = pd.date_range(df.loc[0,'date'], periods = len(df), freq = 'MS')

axes[0].plot(xAll, p1All(xAll), label = 'fitted', color = 'red')

axes[1].plot(xAll, p2All(xAll), label = 'fitted', color = 'red')

axes[2].plot(xAll, p3All(xAll), label = 'fitted', color = 'red')

axes[0].legend()

axes[0].set_ylabel('Mean Temperatures')

axes[0].set_xticklabels(x_values,rotation=45)

axes[0].xaxis.set_major_formatter(mdates.DateFormatter("%Y-%m"))

axes[0].grid()

axes[1].legend()

axes[1].set_ylabel('Min Temperatures')

axes[1].set_xticklabels(x_values2,rotation=45)

axes[1].xaxis.set_major_formatter(mdates.DateFormatter("%Y-%m"))

axes[1].grid()

axes[2].legend()

axes[2].set_ylabel('Max Temperatures')

axes[2].set_xticklabels(x_values2,rotation=45)

axes[2].xaxis.set_major_formatter(mdates.DateFormatter("%Y-%m"))

axes[2].grid()

<font size=+3>✎</font> *The model seems to be following the trend of the data, but it is because we are using all the data. If we select a small portion of the data, there are oscillations that are not achieved by the fitted line.*

### ✅ Question 2.7: Quantifying Fitting Errors. (3 points)

We can also determine the quality of the fit by computing the *residuals*, $r_i$, as the difference between each known data point $y_i$ and the *expected* $y$ value from the best fit model evaluated at $x_i$, i.e., $$r_i = y_i - f(x_i).$$

The residuals can be aggregated over all data points and summarized with the mean squared error $$MSE=\frac{1}{N}\sum_{i=1}^{N}r_i^2.$$

Compute using python, the mean squared error over the `actual_mean_temp` column using the model fitted to all 365 days of data and using the model estimated using only the first 300 days of data. Pick the same degree of the polynomial (say 2) for both cases. In which case are residuals smaller (or the fit is better)?

r1 = y1 - p1(x) # For first 300 points

r2 = y1All - p1All(xAll) # All data

N1 = len(r1)

N2 = len(r2)

# Now compute MSE

MSE1 = (1/N1)*sum(r1**2)

MSE2 = (1/N2)*sum(r2**2)

print("The Mean Squared Error for the first 300 datapoints is: {0:.4f}".format(MSE1))

print("The Mean Squared Error for all datapoints is: {0:.4f}".format(MSE2))

<font size=+3>✎</font> *It seems like for the first 300 points, the fit is better, with an MSE of 84.10, while for all datapoints we obtained a MSE of 117.72.*

### ✅ Question 2.8: Thinking more models. (1 point)

If the data were to be repeated over several years, what pattern would you expect to see in the data? Would a polynomial model make sense for predicting the temperatures over several years? What other functions would you use to model such data?

<font size=+3>✎</font> *Put your explanation here.*

---

## Assignment Wrap-up

Please fill out the following Google Form before you submit your assignment. **You must completely fill this out in order to receive credit for the assignment!**

**COMPLETE THIS SURVEY through [this link](https://forms.office.com/Pages/ResponsePage.aspx?id=MHEXIi9k2UGSEXQjetVofddd5T-Pwn1DlT6_yoCyuCFUQjRHUktCSEtQR0hLUVU0NkdWUlFSME9OMC4u) or through cell below.**

### ANSWER

Instructors, you can access the responses to the survey [here](https://forms.office.com/Pages/DesignPage.aspx?fragment=FormId%3DMHEXIi9k2UGSEXQjetVofddd5T-Pwn1DlT6_yoCyuCFUQjRHUktCSEtQR0hLUVU0NkdWUlFSME9OMC4u%26Token%3Daafe169eea6d4c18b22d4401ef23158b).

from IPython.display import HTML

HTML(

"""

<iframe

src="https://forms.office.com/Pages/ResponsePage.aspx?id=MHEXIi9k2UGSEXQjetVofRzD77LVqLNIpscq-NmZsrdUNENQVDBCRDhKVEowQ0dSOFVLWVNNS0Q3NS4u"

width="80%"

height="1200px"

frameborder="0"

marginheight="0"

marginwidth="0">

Loading...

</iframe>

"""

)

---

### Congratulations, you're done!

Submit this assignment by uploading it to the course Desire2Learn web page.

Go to the "Homework Assignments" section, find the appropriate submission folder link, and upload it there.

© Copyright 2020, [Department of Computational Mathematics, Science and Engineering](https://cmse.msu.edu) at Michigan State University.