Program To Predict Loan Eligibility in Python Language

Instructions

Objective

Write a python assignment program to predict loan eligibility.

Requirements and Specifications

Additional Project : Bancassurance

Description

Background and Context

Best insurance company and My Bank have set up a Bancassurance(Bancassurance is a relationship between a bank and an insurance company), now using the data of liability customers of My Bank, The Best insurance company wants to convert customers with both a life insurance policy and an account in My bank to loan customers(taking a loan against a life insurance policy)

A campaign that the company ran last year for liability customers showed a healthy conversion rate of over 12.56% success. You are provided with data of customers who have an account in My bank and life insurance policy in the Best insurance company

You as a data scientist at the Best insurance company have to build a model to identify the positively responding customers who have a higher probability of purchasing the loan. This will increase the success ratio and reduce the cost of the campaign.

Objective

To predict whether a liability customer will buy a loan or not.
Which variables are most significant for making predictions.
Which segment of customers should be targeted more.

Source Code

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn import preprocessing, tree
import seaborn as sns
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
### Read Data
df = pd.read_csv('My_Bank.csv')
df.head(10)
print(f"This dataset has {len(df)} rows")
### Show the number of NaN values in each column
df.isnull().sum()
### Remove non-useful columns
df = df.drop(columns = ['CUST_ID'])
### Convert ACC_OP_DATE to Numeric
df['ACC_OP_DATE'] = pd.to_datetime(df['ACC_OP_DATE']).dt.strftime("%m%d%Y").astype(int)
df.head(5)
### Categorize object columns
object_columns = df.select_dtypes(include=['object']).columns
for col in object_columns:
  values = df[col].unique()
  values_dict = {x[0]: x[1] for x in zip(values, range(len(values)))}
  df[col] = df[col].map(values_dict)
### Normalize data
df_norm = (df-df.min())/(df.max()-df.min())
df_norm.head()
### Extract target column
Y = df_norm['TARGET']
X = df_norm.drop(columns=['TARGET'])
X.head()
print(f"There are {len(X.columns)} variables and {len(X)} records")
### Display correlation map to see the relation between variables
f = plt.figure(figsize = (10,10))
plt.matshow(df_norm.corr(), fignum = f.number)
plt.colorbar()
plt.xticks(range(len(df_norm.columns)), df_norm.columns, rotation=90);
plt.yticks(range(len(df_norm.columns)), df_norm.columns);
plt.show()
### Split data into train and test
X_train, X_test, Y_train, Y_test = train_test_split(
... X, Y, test_size=0.3, random_state=42)
### Build LogisticRegression Model
model = LogisticRegression()
model.fit(X_train, Y_train)
### Score
model.score(X_test, Y_test)
### Create a plot of model's accuracy vs. K best features
scores = []
for k in range(1, len(X.columns)):
  X_new = SelectKBest(chi2, k = k).fit_transform(X, Y)
  X_train2, X_test2, Y_train2, Y_test2 = train_test_split(X_new, Y, test_size=0.3, random_state=42)
  model = LogisticRegression()
  model.fit(X_train2, Y_train2)
  score = model.score(X_test2, Y_test2)
  scores.append(score)
plt.plot(range(1, len(X.columns)), scores)
plt.grid(True)
plt.xlabel('Number of Features')
plt.ylabel("Model's Accuracy")
### Pick optimal number of features
kopt = range(1, len(X.columns))[np.argmax(scores)]
print(f"The optimal number of features is {kopt}, giving a model accuracy of {max(scores)*100.0}%")
Xopt_lr = SelectKBest(chi2, k = kopt).fit_transform(X, Y)
# Build a new model but only with best features
### Select best features
X_new = SelectKBest(chi2, k=kopt).fit_transform(X, Y)
### Split into Train and Test with new X values
X_train2, X_test2, Y_train2, Y_test2 = train_test_split(X, Y, test_size=0.3, random_state=42)
### Build Model
model2 = LogisticRegression()
model2.fit(X_train2, Y_train2)
model2.score(X_test2, Y_test2)
# Decision Tree
treeClf = tree.DecisionTreeClassifier()
treeClf.fit(X_train, Y_train)
treeClf.score(X_test, Y_test)
### Select K best features and run again the decision tree
scoresTree = []
for k in range(1, len(X.columns)):
  X_new = SelectKBest(chi2, k = k).fit_transform(X, Y)
  X_train3, X_test3, Y_train3, Y_test3 = train_test_split(X_new, Y, test_size=0.3, random_state=42)
  treeClf = tree.DecisionTreeClassifier()
  treeClf.fit(X_train3, Y_train3)
  score = treeClf.score(X_test3, Y_test3)
  scoresTree.append(score)
plt.plot(range(1, len(X.columns)), scores)
plt.grid(True)
plt.xlabel('Number of Features')
plt.ylabel("Model's Accuracy")
koptTree = range(1, len(X.columns))[np.argmax(scoresTree)]
print(f"The optimal number of features for Decision Tree is {koptTree}, giving a model accuracy of {max(scoresTree)*100.0}%")
Xopt_tree = SelectKBest(chi2, k = koptTree).fit_transform(X, Y)
### Plot Scores of both LogisticRegression and DecisionTree vs. Number of features
plt.plot(range(1, len(X.columns)), scores, label = 'LogisticRegression')
plt.plot(range(1, len(X.columns)), scoresTree, label = 'DecisionTree')
plt.legend()
plt.grid(True)
plt.xlabel('Number of Features')
plt.ylabel("Model's Accuracy")
plt.show()

So we see that the Decision Tree has a higher accuracy than the Logistic Regression.

For the Decision Tree, the optimal number of features is 19, while for Logistic Regression is 29.

Program To Predict Loan Eligibility in Python Language Assignment Solution.

Instructions

Requirements and Specifications