## Instructions

**Objective**

## Requirements and Specifications

- The “LRS_Pre_Assessment_trimmed_rank.csv” physical fitness testing dataset is provided for use in this exercise. The data was collected from an active-duty squadron and the samples were deidentified at the point of collection. The independent variables include numeric and categorical data related to demographics, mental health surveys, fitness participation surveys, injury history surveys, physical performance measures, and body composition assessments. The dependent variable is whether or not the member passed their fitness test, and is titled APFT_1_is_pass. For this label, pass = 1, and fail = 0.
- In a marked-up Jupyter notebook (*.ipynb), use the statsmodels Logit algorithm to predict the labels of the dataset:
- Break your code into logical chunks, using multiple “text” and “code” sections, similar to the examples given in class.
- Drop the “flight” column and one-hot-encode the “rank” & “gender” columns with df = pd.get_dummies(df, drop_first=True). drop_first is needed so the columns are linearly independent.
- Create a Data Understanding table using .describe() and include 3 Data Understanding visualizations such as a scatterplot, histogram, pairplot or correlation matrix.
- Based on your data understanding visualizations, perform data preparation transformations as required, such as normalization or log transform.
- In your Modeling section:
- Split your dataset into 70% train & 30% test.
- Include three variations on your modeling method. Variations could include adding, removing or transforming input variables.
- For the “best” variation, using the train dataset, create a ROC curve plot and calculate the accuracy, odds ratio, AUC, classification report and confusion matrix.
- For the “best” variation, using the test dataset, create a ROC curve plot and calculate the accuracy, odds ratio, AUC, classification report and confusion matrix.
- Include a text block related to Business/Mission Understanding:
- Review the week 2 “Binary Classification metric summary” file
- Mention what the majority class is (passing or failing the test), the % of datapoints in the majority class and whether or not the dataset is balanced.
- Discuss the penalty (if any) associated with a False Negative or False Positive
- Discuss the metrics (accuracy/f1/etc) that would be most appropriate for this problem based on the balance & penalties
- Include a summary text block:
- Your justification for the “best” variation in part 2.e.
- The contribution of the most important 2-3 input variables to your model, such as z or p test scores.
- A discussion of the performance metrics from part 2.e. Based on comparing the model performance on the train/test datasets, mention if any of the models overfit the data.
- iv. Write in a formal writing style based on the Appendix B guidance, with the exception that references and citations are not required.

**Upload your .ipynb python file to Canvas as your homework submission.**

**Source Code
**

import numpy as np

import tensorflow as tf

from sklearn.model_selection import train_test_split

import matplotlib.pyplot as plt

import pandas as pd

from sklearn.linear_model import LogisticRegression

import seaborn as sns

from sklearn.metrics import roc_curve

from sklearn.metrics import roc_auc_score

from sklearn.metrics import confusion_matrix

## Load Data into a DataFrame

df = pd.read_csv('LRS_Pre_Assessment_trimmed_rank.csv')

df.head()

## Drop the 'flight' column

df = df.drop(columns=['Flight'])

df.head()

## One-Hot encode the column 'Rank'

rank_encode = pd.get_dummies(df.Rank, prefix='Rank', drop_first = True)

df = pd.get_dummies(df, drop_first=True)

df.head()

## Describe Dataset

### Show statistical description of the dataset

df.describe()

### Show relation between Body Fat Percent and Muscle Percent

passed = df[df['APFT_1_is_pass'] == 1]

not_passed = df[df['APFT_1_is_pass'] == 0]

plt.figure()

ax = passed.plot.scatter(x='BodyFatPerc', y = 'MusclePerc', label = 'Passed')

not_passed.plot.scatter(x='BodyFatPerc', y = 'MusclePerc', label = 'Not Passed',ax = ax, color='red')

plt.grid(True)

plt.show()

An interesting thing about the figure shown above, is that, as more Body Fat Percent, less Muscle Percent in the body. Another thing no notice is that, the majority of the people that filed the test, is people with more Body Fat Percent (bottom-right records)

## Show a bar plot for the number of people that passed and failed the test, grouped by sex

pd.crosstab(df.APFT_1_is_pass, df.Gender_M).plot(kind='bar', rot=0)

plt.grid(True)

plt.legend(['Male', 'Female'])

plt.show()

We see that, the gender of the most people that passed the test is Female, but for the people that failed it, the most are also Womens

### Now plot a correlation map

fig, ax = plt.subplots(figsize=(10,10))

corr = df.corr()

sns.heatmap(corr)

plt.show()

## Normalize the data so all numeric values in the dataset are between 0 and 1

We will use Min-Max Normalization

df_norm = (df - df.min())/(df.max()-df.min())

df_norm.head()

## Split Dataset into 70% train, 30% test

train_df, test_df = train_test_split(df_norm, test_size=0.3)

y_train = train_df['APFT_1_is_pass'].values

X_train = train_df.drop(columns = ['APFT_1_is_pass']).values

y_test = test_df['APFT_1_is_pass'].values

X_test = test_df.drop(columns = ['APFT_1_is_pass']).values

print(f"There are {len(y_train)} rows in the train dataset, and {len(y_test)} rows in the test dataset.")

## Function to compute accuracy

This function will compute accuracy given the real output and the predicted output

def calc_accuracy(y_real, y_pred):

N = len(y_pred)

# Compute the number of predictions that are equal to the real values

return np.where(y_real == y_pred)[0].shape[0]/N

## Model 1: Using all variables

Create a LogisticRegressionModel considering all the variables in the dataset

model1 = LogisticRegression()

model1.fit(X_train, y_train)

# Now predict

y_pred1 = model1.predict(X_test)

# Print accuracy

accuracy1 = calc_accuracy(y_pred1, y_test)

print("The accuracy of Model 1 is: {:.2f}%".format(accuracy1*100.0))

## Model 2:

From the correlation map shown before, we see that there is no correlation (or correlation almost equal to zero) between the ***APFT_1_is_pass*** variable and the ***ORS_Total*** variable, so now we will remove the **ORS_Total** variable

***ORS_Total*** is column 1 in the X_train/X_test arrays, so we will remove that column

X_train2 = np.delete(X_train, 1, 1)

X_test2 = np.delete(X_test, 1, 1)

Create Model 2

model2 = LogisticRegression()

model2.fit(X_train2, y_train)

# Now predict

y_pred2 = model2.predict(X_test2)

# Print accuracy

accuracy2 = calc_accuracy(y_pred2, y_test)

print("The accuracy of Model 2 is: {:.2f}%".format(accuracy2*100.0))

We see that the accuracy did not change, so the removed column did not affect the output.

## Model 3: Removing all variables with a correlation less than |0.1|

corr

We see that the variables with a correlation (absolute value) less than 0.1 are: ORS_Total, PTSD_Score and Rank_SrEnlisted

X_train3 = np.delete(X_train, [1,2,9], 1)

X_test3 = np.delete(X_test, [1,2,9], 1)

model3 = LogisticRegression()

model3.fit(X_train3, y_train)

# Now predict

y_pred3 = model3.predict(X_test3)

# Print accuracy

accuracy3 = calc_accuracy(y_pred3, y_test)

print("The accuracy of Model 3 is: {:.2f}%".format(accuracy3*100.0))

### Removing variables with low correlation does not affects the model, so we keep the original model with all parameters

## ROC, AUC, accuracy and Confusion Matrix for Train Dataset

lr_probs = model1.predict_proba(X_train)

# keep probabilities for the positive outcome only

lr_probs = lr_probs[:, 1]

# calculate scores

lr_auc = roc_auc_score(y_train, lr_probs)

# summarize scores

print('Logistic: ROC AUC=%.3f' % (lr_auc))

lr_fpr, lr_tpr, _ = roc_curve(y_train, lr_probs)

# plot the roc curve for the model

plt.plot(lr_fpr, lr_tpr, marker='.', label='Logistic Regression')

# axis labels

plt.xlabel('False Positive Rate')

plt.ylabel('True Positive Rate')

# show the legend

plt.legend()

# show the plot

plt.show()

Confusion Matrix

y_pred_train = model1.predict(X_train)

conf_mat = confusion_matrix(y_train, y_pred_train)

print(conf_mat)

## ROC, AUC, accuracy and Confusion Matrix for Test Dataset

lr_probs = model1.predict_proba(X_test)

# keep probabilities for the positive outcome only

lr_probs = lr_probs[:, 1]

# calculate scores

lr_auc = roc_auc_score(y_test, lr_probs)

# summarize scores

print('Logistic: ROC AUC=%.3f' % (lr_auc))

lr_fpr, lr_tpr, _ = roc_curve(y_test, lr_probs)

# plot the roc curve for the model

plt.plot(lr_fpr, lr_tpr, marker='.', label='Logistic Regression')

# axis labels

plt.xlabel('False Positive Rate')

plt.ylabel('True Positive Rate')

# show the legend

plt.legend()

# show the plot

plt.show()

Confusion Matrix

y_pred_test = model1.predict(X_test)

conf_mat = confusion_matrix(y_test, y_pred_test)

print(conf_mat)