+1 (315) 557-6473 

Create a Program to Create Classification System in Python Assignment Solution.


Instructions

Objective
Write a program to create classification system in python language.

Requirements and Specifications

In this assignment you will solve a classification problem using Python.
  1. The “LRS_Pre_Assessment_trimmed_rank.csv” physical fitness testing dataset is provided for use in this exercise. The data was collected from an active-duty squadron and the samples were deidentified at the point of collection. The independent variables include numeric and categorical data related to demographics, mental health surveys, fitness participation surveys, injury history surveys, physical performance measures, and body composition assessments. The dependent variable is whether or not the member passed their fitness test, and is titled APFT_1_is_pass. For this label, pass = 1, and fail = 0.
  2. In a marked-up Jupyter notebook (*.ipynb), use the statsmodels Logit algorithm to predict the labels of the dataset:
    1. Break your code into logical chunks, using multiple “text” and “code” sections, similar to the examples given in class.
    2. Drop the “flight” column and one-hot-encode the “rank” & “gender” columns with df = pd.get_dummies(df, drop_first=True). drop_first is needed so the columns are linearly independent.
    3. Create a Data Understanding table using .describe() and include 3 Data Understanding visualizations such as a scatterplot, histogram, pairplot or correlation matrix.
    4. Based on your data understanding visualizations, perform data preparation transformations as required, such as normalization or log transform.
    5. In your Modeling section:
      1. Split your dataset into 70% train & 30% test.
      2. Include three variations on your modeling method. Variations could include adding, removing or transforming input variables.
      3. For the “best” variation, using the train dataset, create a ROC curve plot and calculate the accuracy, odds ratio, AUC, classification report and confusion matrix.
      4. For the “best” variation, using the test dataset, create a ROC curve plot and calculate the accuracy, odds ratio, AUC, classification report and confusion matrix.
    6. Include a text block related to Business/Mission Understanding:
      1. Review the week 2 “Binary Classification metric summary” file
      2. Mention what the majority class is (passing or failing the test), the % of datapoints in the majority class and whether or not the dataset is balanced.
      3. Discuss the penalty (if any) associated with a False Negative or False Positive
      4. Discuss the metrics (accuracy/f1/etc) that would be most appropriate for this problem based on the balance & penalties
    7. Include a summary text block:
      1. Your justification for the “best” variation in part 2.e.
      2. The contribution of the most important 2-3 input variables to your model, such as z or p test scores.
      3. A discussion of the performance metrics from part 2.e. Based on comparing the model performance on the train/test datasets, mention if any of the models overfit the data.
      4. iv. Write in a formal writing style based on the Appendix B guidance, with the exception that references and citations are not required.
    Upload your .ipynb python file to Canvas to write your python homework.

    Source Code

    import numpy as np

    import tensorflow as tf

    from sklearn.model_selection import train_test_split

    import matplotlib.pyplot as plt

    import pandas as pd

    from sklearn.linear_model import LogisticRegression

    import seaborn as sns

    from sklearn.metrics import roc_curve

    from sklearn.metrics import roc_auc_score

    from sklearn.metrics import confusion_matrix

    ## Load Data into a DataFrame

    df = pd.read_csv('LRS_Pre_Assessment_trimmed_rank.csv')

    df.head()

    ## Drop the 'flight' column

    df = df.drop(columns=['Flight'])

    df.head()

    ## One-Hot encode the column 'Rank'

    rank_encode = pd.get_dummies(df.Rank, prefix='Rank', drop_first = True)

    df = pd.get_dummies(df, drop_first=True)

    df.head()

    ## Describe Dataset

    ### Show statistical description of the dataset

    df.describe()

    ### Show relation between Body Fat Percent and Muscle Percent

    passed = df[df['APFT_1_is_pass'] == 1]

    not_passed = df[df['APFT_1_is_pass'] == 0]

    plt.figure()

    ax = passed.plot.scatter(x='BodyFatPerc', y = 'MusclePerc', label = 'Passed')

    not_passed.plot.scatter(x='BodyFatPerc', y = 'MusclePerc', label = 'Not Passed',ax = ax, color='red')

    plt.grid(True)

    plt.show()

    An interesting thing about the figure shown above, is that, as more Body Fat Percent, less Muscle Percent in the body. Another thing no notice is that, the majority of the people that filed the test, is people with more Body Fat Percent (bottom-right records)

    ## Show a bar plot for the number of people that passed and failed the test, grouped by sex

    pd.crosstab(df.APFT_1_is_pass, df.Gender_M).plot(kind='bar', rot=0)

    plt.grid(True)

    plt.legend(['Male', 'Female'])

    plt.show()

    We see that, the gender of the most people that passed the test is Female, but for the people that failed it, the most are also Womens

    ### Now plot a correlation map

    fig, ax = plt.subplots(figsize=(10,10))

    corr = df.corr()

    sns.heatmap(corr)

    plt.show()

    ## Normalize the data so all numeric values in the dataset are between 0 and 1

    We will use Min-Max Normalization

    df_norm = (df - df.min())/(df.max()-df.min())

    df_norm.head()

    ## Split Dataset into 70% train, 30% test

    train_df, test_df = train_test_split(df_norm, test_size=0.3)

    y_train = train_df['APFT_1_is_pass'].values

    X_train = train_df.drop(columns = ['APFT_1_is_pass']).values

    y_test = test_df['APFT_1_is_pass'].values

    X_test = test_df.drop(columns = ['APFT_1_is_pass']).values

    print(f"There are {len(y_train)} rows in the train dataset, and {len(y_test)} rows in the test dataset.")

    ## Function to compute accuracy

    This function will compute accuracy given the real output and the predicted output

    def calc_accuracy(y_real, y_pred):

    N = len(y_pred)

    # Compute the number of predictions that are equal to the real values

    return np.where(y_real == y_pred)[0].shape[0]/N

    ## Model 1: Using all variables

    Create a LogisticRegressionModel considering all the variables in the dataset

    model1 = LogisticRegression()

    model1.fit(X_train, y_train)

    # Now predict

    y_pred1 = model1.predict(X_test)

    # Print accuracy

    accuracy1 = calc_accuracy(y_pred1, y_test)

    print("The accuracy of Model 1 is: {:.2f}%".format(accuracy1*100.0))

    ## Model 2:

    From the correlation map shown before, we see that there is no correlation (or correlation almost equal to zero) between the ***APFT_1_is_pass*** variable and the ***ORS_Total*** variable, so now we will remove the **ORS_Total** variable

    ***ORS_Total*** is column 1 in the X_train/X_test arrays, so we will remove that column

    X_train2 = np.delete(X_train, 1, 1)

    X_test2 = np.delete(X_test, 1, 1)

    Create Model 2

    model2 = LogisticRegression()

    model2.fit(X_train2, y_train)

    # Now predict

    y_pred2 = model2.predict(X_test2)

    # Print accuracy

    accuracy2 = calc_accuracy(y_pred2, y_test)

    print("The accuracy of Model 2 is: {:.2f}%".format(accuracy2*100.0))

    We see that the accuracy did not change, so the removed column did not affect the output.

    ## Model 3: Removing all variables with a correlation less than |0.1|

    corr

    We see that the variables with a correlation (absolute value) less than 0.1 are: ORS_Total, PTSD_Score and Rank_SrEnlisted

    X_train3 = np.delete(X_train, [1,2,9], 1)

    X_test3 = np.delete(X_test, [1,2,9], 1)

    model3 = LogisticRegression()

    model3.fit(X_train3, y_train)

    # Now predict

    y_pred3 = model3.predict(X_test3)

    # Print accuracy

    accuracy3 = calc_accuracy(y_pred3, y_test)

    print("The accuracy of Model 3 is: {:.2f}%".format(accuracy3*100.0))

    ### Removing variables with low correlation does not affects the model, so we keep the original model with all parameters

    ## ROC, AUC, accuracy and Confusion Matrix for Train Dataset

    lr_probs = model1.predict_proba(X_train)

    # keep probabilities for the positive outcome only

    lr_probs = lr_probs[:, 1]

    # calculate scores

    lr_auc = roc_auc_score(y_train, lr_probs)

    # summarize scores

    print('Logistic: ROC AUC=%.3f' % (lr_auc))

    lr_fpr, lr_tpr, _ = roc_curve(y_train, lr_probs)

    # plot the roc curve for the model

    plt.plot(lr_fpr, lr_tpr, marker='.', label='Logistic Regression')

    # axis labels

    plt.xlabel('False Positive Rate')

    plt.ylabel('True Positive Rate')

    # show the legend

    plt.legend()

    # show the plot

    plt.show()

    Confusion Matrix

    y_pred_train = model1.predict(X_train)

    conf_mat = confusion_matrix(y_train, y_pred_train)

    print(conf_mat)

    ## ROC, AUC, accuracy and Confusion Matrix for Test Dataset

    lr_probs = model1.predict_proba(X_test)

    # keep probabilities for the positive outcome only

    lr_probs = lr_probs[:, 1]

    # calculate scores

    lr_auc = roc_auc_score(y_test, lr_probs)

    # summarize scores

    print('Logistic: ROC AUC=%.3f' % (lr_auc))

    lr_fpr, lr_tpr, _ = roc_curve(y_test, lr_probs)

    # plot the roc curve for the model

    plt.plot(lr_fpr, lr_tpr, marker='.', label='Logistic Regression')

    # axis labels

    plt.xlabel('False Positive Rate')

    plt.ylabel('True Positive Rate')

    # show the legend

    plt.legend()

    # show the plot

    plt.show()

    Confusion Matrix

    y_pred_test = model1.predict(X_test)

    conf_mat = confusion_matrix(y_test, y_pred_test)

    print(conf_mat)