Instructions
Requirements and Specifications
Source Code
## Worksheet 8
### Constructing, Evaluating, and Visualizing Piplelines
### Due on 6/12/21 @ 11:55 pm EST (see Assignment Folder in Sakai)
## Authorized help and collaboration rules
You **may not** collaborate with friends or teammates. You **may** use your notes, class provided resources (e.g., web links, notebooks,videos, slides) to help you solve the problems below. For effective learning, you should try to complete the worksheet on your own before looking for help.
If you have any questions regarding what is, or is not, authorized you must ask. Saying after the fact you didn't understand, or were not sure, is not a valid excuse.
## Python modules
In the coding cell below, include all the Python modules needed to run your project. Some commonly used modules have already been included for you (System, Numpy, Matplotlib, and Pandas).
import sys
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn import preprocessing
from sklearn.model_selection import cross_validate
from sklearn.model_selection import StratifiedKFold
from sklearn.preprocessing import MinMaxScaler
plt.rcParams.update({"font.size": 16, "legend.loc": "upper right"})
## Python Version Check
Verify you're running Python version `3.7.0` or later.
print(sys.version)
****
### Kaggle: Heart Attack Analysis & Prediction Dataset
<a href="https://www.kaggle.com/rashikrahmanpritom/heart-attack-analysis-prediction-dataset?select=heart.csv">Heart Attack</a> dataset includes the attributes listed below. In total, there are 77 samples (i.e. patients) that make up this dataset.
- Age : Age of the patient
- Sex : Sex of the patient
- cp : Chest Pain type chest pain type
- Value 1: typical angina
- Value 2: atypical angina
- Value 3: non-anginal pain
- Value 4: asymptomatic
- trtbps : resting blood pressure (in mm Hg)
- chol : cholestoral in mg/dl fetched via BMI sensor
- fbs : (fasting blood sugar > 120 mg/dl)
- Value 1: true
- Value 0: false
- rest_ecg : resting electrocardiographic results
- Value 0: normal
- Value 1: having ST-T wave abnormality (T wave inversions and/or ST elevation or depression of > 0.05 mV)
- Value 2: showing probable or definite left ventricular hypertrophy by Estes' criteria
- thalach : maximum heart rate achieved
- pcp : peak cardiac power
- output : target values
- Value 0: healthy
- Value 1: heart condition
heart_df = pd.read_csv("heart.csv")
heart_df
****
## Question 1: Visualizing Data Relationships (5 Points)
<img src="w8_q1_plot.png" width="550" style="float: right"/>
In the coding cell below, create a combined scatter plot that visualizes the relationship between **healthy** and **heart condition** patients using the following data attributes:
- ``Resting Blood Pressure`` and ``Cholestoral``,
- ``Resting Blood Pressure`` and ``Age``,
- ``Resting Blood Pressure`` and ``Maximum Heart Rate Achieved``, and
- ``Resting Blood Pressure`` and ``Peak Cardiac Power``.
In your scatter plots, To recieve full credit your plotting solution must the data provided in the **heart_df** panda and the <a href="https://matplotlib.org/">Matplotlib</a> Python library and create a plotting solution that is **visually identical** to the `plot shown on the right`.
You may assume:
- The colors used to generate plots are red (heart condition subjects) and blue (healthy subjects)
- The figsize=(15,15)
- The minimum xtick value is 80, the maximum ytick value is 240, and increments by 20
# Select the rows for healthy patientis
healthy = heart_df[heart_df['output'] == 0]
# Select patients with heart condition
non_healthy = heart_df[heart_df['output'] == 1]
fig, axes = plt.subplots(nrows = 2, ncols = 2, figsize=(8,8))
# Resting Blood Pressure and Cholesterol
axes[0,0].scatter(healthy['trtbps'], healthy['chol'], label = 'healthy', color = 'blue')
axes[0,0].scatter(non_healthy['trtbps'], non_healthy['chol'], label = 'heart condition', color = 'red')
axes[0,0].legend()
axes[0,0].set_ylabel('Cholesterol')
axes[0,0].set_xlabel('Resting Blood Pressure')
# Resting Blood Pressure and Age
axes[0,1].scatter(healthy['trtbps'], healthy['age'], label = 'healthy', color = 'blue')
axes[0,1].scatter(non_healthy['trtbps'], non_healthy['age'],label = 'heart condition', color = 'red')
axes[0,1].legend()
axes[0,1].set_ylabel('Age')
axes[0,1].set_xlabel('Resting Blood Pressure')
# Resting Blood Pressure and Maximum Heart Rate Achieved
axes[1,0].scatter(healthy['trtbps'], healthy['thalach'], label = 'healthy', color = 'blue')
axes[1,0].scatter(non_healthy['trtbps'], non_healthy['thalach'], label = 'heart condition', color = 'red')
axes[1,0].legend()
axes[1,0].set_ylabel('Maximum Heart Rate Achieved')
axes[1,0].set_xlabel('Resting Blood Pressure')
# Resting Blood Pressure and Peak Cardiac Power
axes[1,1].scatter(healthy['trtbps'], healthy['pcp'], label = 'healthy', color = 'blue')
axes[1,1].scatter(non_healthy['trtbps'], non_healthy['pcp'], label = 'heart condition', color ='red')
axes[1,1].legend()
axes[1,1].set_ylabel('Peak Cardiac Power')
axes[1,1].set_xlabel('Resting Blood Pressure')
*****
## Question 2: Construct a Sklearn Pipeline (5 Points)
In the coding cell below, create a 2-stage Sklearn pipeline that has two models in this order:
1. <a href="https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html">MinMaxScaler</a> Sklearn model. Please name this model ``min_max_scaler``. Lastly, set the **feature_range** attribute to (-1,1).
2. <a href="https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html">LogisticRegression</a> Sklearn model. Please name this model ``logit_classifier``.
This is a very simple question, don't over think it :)
# scaler
min_max_scaler = MinMaxScaler(feature_range = (-1,1))
# classifier
logit_classifier = LogisticRegression()
# Create pipeline
pipe = Pipeline([('scaler', min_max_scaler), ('logistic', logit_classifier)])
*****
## Question 3: Pipeline Performance Evaluation (5 points)
<img src="w8_q3_plot.png" width="550" style="float: right"/>
In the coding cell below, evaluate the classification performance of a pipeline using at 10-fold cross-validation approach. The data input (i.e., X matrix) into the pipeline pipeline should only include ``Resting Blood Pressure`` and ``Maximum Heart Rate Achieved`` values, and the classfication labels (i.e., y vector) should only include the output values. To recieve full credit your plotting solution must the data provided in the **heart_df** panda and the <a href="https://matplotlib.org/">Matplotlib</a> Python library and create a plotting solution that is **visually identical** to the `plot shown on the right`.
Please ensure you,
- use Sklearn <a href="https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_validate.html#sklearn.model_selection.cross_validate">cross_validate</a>,
- Set the cross_validate **scoring** attribute to ``accuracy``
- Set the cross_validate **cv** attribute <a href="https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.StratifiedKFold.html#sklearn.model_selection.StratifiedKFold">StratifiedKFold</a>
- Set the StratifiedKFold **n_splits** attribute to 10
- The number of decimal places for Q1, Q2, and Q3 is two.
Hint:
- To compute the first quartile (Q1), second quartile (Q2, or median), and third quartile (Q3) values use the <a href="https://numpy.org/doc/stable/reference/generated/numpy.quantile.html">quantile</a> function in Numpy.
# Define X_data
X = heart_df[['trtbps', 'thalach']] # Resting Blood pressure and Maximum heart rate achieved
y = heart_df['output']
cv_results = cross_validate(pipe, X, y, scoring = 'accuracy', cv = StratifiedKFold(n_splits = 10))
results = np.quantile(cv_results['test_score'], [0.25, 0.5, 0.75])
results = np.round(results, 2)
Q1 = results[0]
Q2 = results[1]
Q3 = results[2]
plt.figure()
plt.boxplot(results, vert=False)
plt.xlabel('Classification Accuracy')
plt.title('Resting Blood Pressure vs. Maximum Heart Rate Achieved')
plt.text(0.5, 0.5, "Q1 = {:.2f}, Q2 = {:.2f}, Q3 = {:.2f}".format(Q1, Q2, Q3))
*****
## Question 4: Pipeline Performance Evaluation (5 points)
<img src="w8_q4_plot.png" width="550" style="float: right"/>
In the coding cell below, evaluate the classification performance of a pipeline using at 10-fold cross-validation approach. The data input (i.e., X matrix) into the pipeline pipeline should only include ``Resting Blood Pressure`` and ``Peak Cardiac Power`` values, and the classfication labels (i.e., y vector) should only include the output values. To recieve full credit your plotting solution must the data provided in the **heart_df** panda and the <a href="https://matplotlib.org/">Matplotlib</a> Python library and create a plotting solution that is **visually identical** to the `plot shown on the right`.
Please ensure you,
- use Sklearn <a href="https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_validate.html#sklearn.model_selection.cross_validate">cross_validate</a>,
- Set the cross_validate **scoring** attribute to ``accuracy``
- Set the cross_validate **cv** attribute <a href="https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.StratifiedKFold.html#sklearn.model_selection.StratifiedKFold">StratifiedKFold</a>
- Set the StratifiedKFold **n_splits** attribute to 10
- The number of decimal places for Q1, Q2, and Q3 is two.
Hint:
- To compute the first quartile (Q1), second quartile (Q2, or median), and third quartile (Q3) values use the <a href="https://numpy.org/doc/stable/reference/generated/numpy.quantile.html">quantile</a> function in Numpy.
# Define X_data
X = heart_df[['trtbps', 'pcp']] # Resting Blood pressure and Peak Cardiac Power
y = heart_df['output']
cv_results = cross_validate(pipe, X, y, scoring = 'accuracy', cv = StratifiedKFold(n_splits = 10))
results = np.quantile(cv_results['test_score'], [0.25, 0.5, 0.75])
results = np.round(results, 2)
Q1 = results[0]
Q2 = results[1]
Q3 = results[2]
plt.figure()
plt.boxplot(results, vert=False)
plt.xlabel('Classification Accuracy')
plt.title('Resting Blood Pressure vs. Maximum Heart Rate Achieved')
plt.text(0.5, 0.5, "Q1 = {:.2f}, Q2 = {:.2f}, Q3 = {:.2f}".format(Q1, Q2, Q3))