## Instructions

**Objective**

## Requirements and Specifications

**Source Code
**

# STUDENT NAME, STUDENT NUMBER (TO BE FILLED BY THE STUDENT)

# Advanced Data Analysis - Assignment 2

This notebook contains the **Assignment 2** of the Advanced Data Analysis course.

The topic of the assignment consists in performing linear regression on National Health and Nutrition Examination data.

### DEADLINE: 10-October-2021

The assignment is **individual**. You should submit your resolution on Moodle by the deadline. While doing this assignment, you can use or adapt any code from the lectures if you want.

Students have three grace days that they can use for all assignments and group project, which allows them to deliver the projects late. Use these grace days carefully.

[//]: # (We will be using latex for fomulas)

### Notebook Instructions

* You only need to deliver this notebook file (notice that, a notebook file extension is filename.ipynb) - Data files must not be submitted

* You don't need to create additional cells. Try to use the ones that are already available

* The notebook should be delivered with the outputs already available

# Dataset

The file children.csv contains a file with two columns. The first column is the age of each child in

months, and the second the weight in Kg. The data is from the National Health and Nutrition Examination

Survey of 2017-2018 and represents a sample of children up to 24 months old.

The following code loads the children.csv file

# This code cell does not need to be changed

import os

import pandas as pd

#dataFileName = os.path.join( "../assignment2", "children.csv")

dataFileName = "children.csv"

dataDF = pd.read_csv(dataFileName)

dataDF.head()

# Assignment

In this assignment, we aim to predict the weight of a children until 24 monthts old based on child age.

from sklearn import linear_model

from sklearn.preprocessing import PolynomialFeatures

import matplotlib.pyplot as plt

import os

from sklearn.metrics import mean_squared_error, r2_score

x = dataDF[['age']]

y = dataDF[['weight']]

## Question 1

In this question, we aim to identify the best polynomial degree.

### **1.a)** Find the best polynomial degree from 1 to 12 (6 points out of 20).

# Solve question here. Add a Markdown cell after this cell if you want to add some comment on you solution.

r2_scores = []

rmse_vals = []

min_rmse = 1e10

min_rmse_dg = -1

for nd in range(1, 13): # from 1 to 12

poly = PolynomialFeatures(degree = nd)

X = poly.fit_transform(x)

# Fit values

model = linear_model.LinearRegression()

model.fit(X, y)

y_new = model.predict(X)

# Calculate RMS error

rmse = (mean_squared_error(y, y_new))**(1/2)

r2_val = r2_score(y,y_new)

print(f"The RMS error for a degree of {nd} is {rmse} and the R2 value is {r2_val}")

r2_scores.append(r2_val)

rmse_vals.append(rmse)

if rmse < min_rmse:

min_rmse = rmse

min_rmse_dg = nd

print(f"The min RMSE obtained was of {min_rmse} and it was for a polynomial of degree: {min_rmse_dg}")

So, from the RMS errors, it seems that the best fit for the data is a polynomial of degree: 10. This degree also ensures the highest R2 coefficient

### **1.b)** Plot the results obtained (for each degree the score obtained) (2 points out of 20).

# Solve question here.

fig, axes = plt.subplots(nrows = 1, ncols = 2, figsize=(8,8))

axes[0].plot(range(1, 13), rmse_vals)

axes[0].set_xlabel('Polynomial Degree')

axes[0].set_ylabel('RMS Error')

axes[0].grid(True)

axes[1].plot(range(1,13), r2_scores)

axes[1].set_xlabel('Polynomial Degree')

axes[1].set_ylabel('R2')

axes[1].grid(True)

### **1.c)** Why k-fold cross validation approach is important to evaluate the performance of predictive models? (1 point out of 20)

K-Fold procedure is used to evaluate the skill of a model. This procedure is often used when the dataset is small or has too many features, and it helps to estimate the parameters so the model has a good validation accuracy.

## Question 2 (10 points out of 20)

Here, we aim to build a model to predict the weigth of children based on their agr.

### **2.a)** Using the best degree found, find the coefficients of the best curve (4 points out of 20).

# Solve question here. Add a Markdown cell after this cell if you want to add some comment on you solution.

# The coefficients are:

poly = PolynomialFeatures(degree = min_rmse_dg)

X = poly.fit_transform(x)

# Fit values

model = linear_model.LinearRegression()

model.fit(X, y)

print(model.coef_)

### **2.b)** Plot the train and test set and the model computed (3 points out of 20)

# Solve question here. Add a Markdown cell after this cell if you want to add some comment on you solution.

# First, split into train and test

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(x, y, test_size = 0.3) # we use 70% of the dataset for training

# Now, create model again

poly = PolynomialFeatures(degree = min_rmse_dg)

X_train_transform = poly.fit_transform(X_train)

X_test_transform = poly.fit_transform(X_test)

# Fit values

model = linear_model.LinearRegression()

model.fit(X_train_transform, y_train)

# Plot train and test data

plt.figure()

plt.scatter(X_train, y_train, label = 'Train Data', color = 'lightcoral')

plt.scatter(X_test, y_test, label = 'Test Data', color = 'steelblue')

y_predict = model.predict(X_train_transform)

plt.scatter(X_train, y_predict, label = 'Model', color = 'black')

# Plot the model

plt.legend()

plt.grid(True)

plt.xlabel('Age')

plt.ylabel('Weight')

plt.show()

### **2.c)** What is the mean squared error (MSE) on the test set? (1 point out of 20)

# Solve question here. Add a Markdown cell after this cell if you want to add some comment on you solution.

# Predict the values in the test set

y_predict = model.predict(X_test_transform)

# Calculate RMS error

rmse = (mean_squared_error(y_test, y_predict))**(1/2)

print(f"The MSE in the test set is: {rmse}")

# Question 3

### **3.a)** In the plot made in 2.b) represent also the uncertainty of the model achieved with different shades at the levels the confidence intervals of 95% and 99% (3 points out of 20). Discuss the results achieved.

import numpy as np

# Plot train and test data

plt.figure()

y_predict = model.predict(X_train_transform)

error = np.abs(y_predict - y_train.values)

x = [x[0] for x in X_train.values]

y2 = [y_predict[i][0] + error[i][0] for i in range(len(y_train))]

y1 = [y_predict[i][0] -error[i][0] for i in range(len(y_train))]

plt.scatter(x, y1)

plt.scatter(x, y2)

plt.scatter(X_train, y_predict, label = 'Model', color = 'black')

#plt.fill_between(x, y1, y2,

# alpha=0.5, edgecolor='#CC4F1B', facecolor='#FF9848')

# Plot the model

plt.legend()

plt.grid(True)

plt.xlabel('Age')

plt.ylabel('Weight')

plt.show()

From the results obtained in figure 2.b) we can see that the model is quite good since the trend is centered on the test and training data. If we observe each point of the obtained model, we can see that it is centered approximately on the mean of the dataset values. This means that the model is not overfitting and therefore does not try to recreate each oscillation of the data, because for each x value, there are several y values, but our model predicts a y value for each x. Also, we can see that the obtained RMSE in part 2.c) it is relatively low which indicates that the values are close to the median.

# Solve question here. Add a Markdown cell after this cell if you want to add some comment on you solution.

y_predict.shape

x

X_train