# Create a Program to Implement Polynomial Degree in Python Assignment Solution.

## Instructions

Objective
Write a python assignment to implement polynomial degree.

## Requirements and Specifications Source Code

# STUDENT NAME, STUDENT NUMBER (TO BE FILLED BY THE STUDENT) # Advanced Data Analysis - Assignment 2 This notebook contains the **Assignment 2** of the Advanced Data Analysis course. The topic of the assignment consists in performing linear regression on National Health and Nutrition Examination data. ### DEADLINE: 10-October-2021 The assignment is **individual**. You should submit your resolution on Moodle by the deadline. While doing this assignment, you can use or adapt any code from the lectures if you want. Students have three grace days that they can use for all assignments and group project, which allows them to deliver the projects late. Use these grace days carefully. [//]: # (We will be using latex for fomulas) <p>        src="https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.0/MathJax.js?config=TeX-AMS_CHTML"> ### Notebook Instructions * You only need to deliver this notebook file (notice that, a notebook file extension is filename.ipynb) - Data files must not be submitted * You don't need to create additional cells. Try to use the ones that are already available * The notebook should be delivered with the outputs already available # Dataset The file children.csv contains a file with two columns. The first column is the age of each child in months, and the second the weight in Kg. The data is from the National Health and Nutrition Examination Survey of 2017-2018 and represents a sample of children up to 24 months old. The following code loads the children.csv file # This code cell does not need to be changed import os import pandas as pd #dataFileName = os.path.join( "../assignment2", "children.csv") dataFileName = "children.csv" dataDF = pd.read_csv(dataFileName) dataDF.head() # Assignment In this assignment, we aim to predict the weight of a children until 24 monthts old based on child age. from sklearn import linear_model from sklearn.preprocessing import PolynomialFeatures import matplotlib.pyplot as plt import os from sklearn.metrics import mean_squared_error, r2_score x = dataDF[['age']] y = dataDF[['weight']] ## Question 1 In this question, we aim to identify the best polynomial degree. ### **1.a)** Find the best polynomial degree from 1 to 12 (6 points out of 20). # Solve question here. Add a Markdown cell after this cell if you want to add some comment on you solution. r2_scores = [] rmse_vals = [] min_rmse = 1e10 min_rmse_dg = -1 for nd in range(1, 13): # from 1 to 12   poly = PolynomialFeatures(degree = nd)   X = poly.fit_transform(x)   # Fit values   model = linear_model.LinearRegression()   model.fit(X, y)   y_new = model.predict(X)   # Calculate RMS error   rmse = (mean_squared_error(y, y_new))**(1/2)   r2_val = r2_score(y,y_new)   print(f"The RMS error for a degree of {nd} is {rmse} and the R2 value is {r2_val}")   r2_scores.append(r2_val)   rmse_vals.append(rmse)   if rmse < min_rmse:     min_rmse = rmse     min_rmse_dg = nd print(f"The min RMSE obtained was of {min_rmse} and it was for a polynomial of degree: {min_rmse_dg}") So, from the RMS errors, it seems that the best fit for the data is a polynomial of degree: 10. This degree also ensures the highest R2 coefficient ### **1.b)** Plot the results obtained (for each degree the score obtained) (2 points out of 20). # Solve question here. fig, axes = plt.subplots(nrows = 1, ncols = 2, figsize=(8,8)) axes.plot(range(1, 13), rmse_vals) axes.set_xlabel('Polynomial Degree') axes.set_ylabel('RMS Error') axes.grid(True) axes.plot(range(1,13), r2_scores) axes.set_xlabel('Polynomial Degree') axes.set_ylabel('R2') axes.grid(True) ### **1.c)** Why k-fold cross validation approach is important to evaluate the performance of predictive models? (1 point out of 20) K-Fold procedure is used to evaluate the skill of a model. This procedure is often used when the dataset is small or has too many features, and it helps to estimate the parameters so the model has a good validation accuracy. ## Question 2 (10 points out of 20) Here, we aim to build a model to predict the weigth of children based on their agr. ### **2.a)** Using the best degree found, find the coefficients of the best curve (4 points out of 20). # Solve question here. Add a Markdown cell after this cell if you want to add some comment on you solution. # The coefficients are: poly = PolynomialFeatures(degree = min_rmse_dg) X = poly.fit_transform(x) # Fit values model = linear_model.LinearRegression() model.fit(X, y) print(model.coef_) ### **2.b)** Plot the train and test set and the model computed (3 points out of 20) # Solve question here. Add a Markdown cell after this cell if you want to add some comment on you solution. # First, split into train and test from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(x, y, test_size = 0.3) # we use 70% of the dataset for training # Now, create model again poly = PolynomialFeatures(degree = min_rmse_dg) X_train_transform = poly.fit_transform(X_train) X_test_transform = poly.fit_transform(X_test) # Fit values model = linear_model.LinearRegression() model.fit(X_train_transform, y_train) # Plot train and test data plt.figure() plt.scatter(X_train, y_train, label = 'Train Data', color = 'lightcoral') plt.scatter(X_test, y_test, label = 'Test Data', color = 'steelblue') y_predict = model.predict(X_train_transform) plt.scatter(X_train, y_predict, label = 'Model', color = 'black') # Plot the model plt.legend() plt.grid(True) plt.xlabel('Age') plt.ylabel('Weight') plt.show() ### **2.c)** What is the mean squared error (MSE) on the test set? (1 point out of 20) # Solve question here. Add a Markdown cell after this cell if you want to add some comment on you solution. # Predict the values in the test set y_predict = model.predict(X_test_transform) # Calculate RMS error rmse = (mean_squared_error(y_test, y_predict))**(1/2) print(f"The MSE in the test set is: {rmse}") # Question 3 ### **3.a)** In the plot made in 2.b) represent also the uncertainty of the model achieved with different shades at the levels the confidence intervals of 95% and 99% (3 points out of 20). Discuss the results achieved. import numpy as np # Plot train and test data plt.figure() y_predict = model.predict(X_train_transform) error = np.abs(y_predict - y_train.values) x = [x for x in X_train.values] y2 = [y_predict[i] + error[i] for i in range(len(y_train))] y1 = [y_predict[i] -error[i] for i in range(len(y_train))] plt.scatter(x, y1) plt.scatter(x, y2) plt.scatter(X_train, y_predict, label = 'Model', color = 'black') #plt.fill_between(x, y1, y2, # alpha=0.5, edgecolor='#CC4F1B', facecolor='#FF9848') # Plot the model plt.legend() plt.grid(True) plt.xlabel('Age') plt.ylabel('Weight') plt.show() From the results obtained in figure 2.b) we can see that the model is quite good since the trend is centered on the test and training data. If we observe each point of the obtained model, we can see that it is centered approximately on the mean of the dataset values. This means that the model is not overfitting and therefore does not try to recreate each oscillation of the data, because for each x value, there are several y values, but our model predicts a y value for each x. Also, we can see that the obtained RMSE in part 2.c) it is relatively low which indicates that the values are close to the median. # Solve question here. Add a Markdown cell after this cell if you want to add some comment on you solution. y_predict.shape x X_train