Program to Implement Data Science Python Stack in Python

Instructions

Objective
Write a python assignment program to implement data science python stack.
Requirements and Specifications

Goals
To acquire a basic understanding of the Python "data science stack" (NumPy, Pandas, Matplotlib, Scikit-Learn).
To have an early experience of manipulating, summarizing, and visualizing small datasets.
To demonstrate the ability to write Python code to answer questions and test hypotheses based on the contents of those datasets.
To learn how to implement several different machine learning classification models in Python.
To learn how to test a model and produce a set of plots and performance measures.
Source Code
# COP 4045 - Python Programming - Dr. Marques - Summer 2021
# Assignment 10: Introducing the Python Data Science stack
## STARTER
### Goals
- To acquire a basic understanding of the Python "data science stack" (NumPy, Pandas, Matplotlib, Scikit-Learn).
- To have an early experience of manipulating, summarizing, and visualizing small datasets.
- To demonstrate the ability to write Python code to answer questions and test hypotheses based on the contents of those datasets.
- To learn how to implement several different machine learning classification models in Python
- To learn how to test a model and produce a set of plots and performance measures
### Instructions
- This assignment is structured in two parts.
- For each part, there will be some Python code to be written and questions to be answered.
- At the end, you should export your notebook to PDF format; it will "automagically" become your report.
- Submit the report (PDF), notebook (.ipynb file), and the link to the "live" version of your solution on Google Colaboratory via Canvas.
- The number of points is indicated next to each part. They add up to 100.
- There are additional (10 points worth of) bonus items, which are, of course optional.
### Important
- It is OK to attempt the bonus points, but please **do not overdo it!**
- Remember: this is an early exercise in exploring datasets; learning the syntax and "tricks" of Python, Jupyter notebooks, Numpy, Pandas, and Matplotlib; and writing code to use data to test simple hypotheses, produce answers to simple questions, or make predictions.
---------
### Imports + Google Drive
# Imports
import numpy as np
import pandas as pd
from pandas import DataFrame, Series
import matplotlib.pyplot as plt
from scipy.stats import pearsonr
from __future__ import division
import seaborn as sns
sns.set(style='ticks', palette='Set2')
%matplotlib inline
# OPTIONAL
# Mount Google Drive
# from google.colab import drive
# drive.mount('/content/drive')
-------------------
## Part 1: EDA
The Python code below will load a dataset containing the salaries and demographic data of more than 1000 employees of a hypothetical company, available in the file *salaries.csv*, which is a simple comma-separated list of labels and values.
#salaries = pd.read_csv('/content/drive/My Drive/Colab Notebooks/data/salaries.csv')
salaries = pd.read_csv('salaries.csv')
"""
from google.colab import files
uploaded = files.upload()
import io
salaries = pd.read_csv(io.StringIO(uploaded['salaries.csv'].decode('utf-8')))
"""
print(salaries.shape)
print(salaries.count())
salaries.head()
salaries.describe()
--------------------
### Summary statistics and correlations
Let's explore the dataset by plotting some graphs and displaying summary statistics.
The code below should display:
- Min, max, average, and median salary (global)
- A histogram of salaries
- A scatterplot correlating salaries and years of education
- The (Pearson) correlation coefficient between the two variables.
This should help us get started.
salary = np.array(salaries['earn'])
print("--- Salary statistics ---")
print("Minimum salary (global): ${:6.2f}".format(np.min(salary)))
print("Maximum salary (global): ${:6.2f}".format(np.max(salary)))
print("Average salary (global): ${:6.2f}".format(np.mean(salary)))
print("Median salary (global): ${:6.2f}".format(np.median(salary)))
plt.hist(salary)
plt.title('Salary Distribution')
plt.xlabel('Salary')
plt.ylabel('Number of Employees');
years = np.array(salaries['ed'])
plt.title('Salary vs. Education Level')
plt.ylabel('Salary')
plt.xlabel('Years of education');
plt.scatter(years, salary, alpha=0.5)
plt.show()
# Compute Pearson coefficient
corr, _ = pearsonr(salary,years)
print('Correlation coefficient: ',corr)
The [Pearson correlation coefficient](https://en.wikipedia.org/wiki/Pearson_correlation_coefficient) (a value between -1 and 1) can be used to summarize the strength of the linear relationship between two data samples.
A simplified way to interpret the result is (see table 1 [here](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6107969/)):
- A value of 0 means no correlation
- Values below -0.5 or above 0.5 indicates a notable (negative/positive) correlation
### 1.1 Your turn! (10-14 points)
Write code to:
1. Display the total headcount and the number (and %) of male and female employees. (2 pts)
2. Compute and display the min, max, average, and median salary *per gender*. (8 pts)
3. (OPTIONAL) Plot meaningful graphs that could provide insight into the gender inequality (*if any is present*) associated with the salaries in the company. (<= 4 bonus points)
# Enter your code here
# 1) Count and percentages
perc_males = len(salaries[salaries['sex'] == "male"])/len(salaries) *100.0
perc_females = len(salaries[salaries['sex'] == "female"])/len(salaries) *100.0
# 2) Display mean, std, min... for earn, grouped by sex
print("There are {:.2f}% males and {:.2f}% females.".format(perc_males, perc_females))
salaries.groupby(by=['sex']).describe()['earn']
# Part 3) Graph
# Plot salaries for males
males_earn = salaries[salaries['sex'] == "male"]
females_earn = salaries[salaries['sex'] == "female"]
fig, ax = plt.subplots(figsize=(8,6))
salaries.groupby('sex').plot(y='earn', kind='kde', ax=ax)
plt.legend(['male', 'female'])
It can be seen that the salaries for males are higher than the salaries for females
--------------------
### Signs of inequality
As you can possibly tell by now, this dataset may help us test hypotheses and answer questions related to possible sources of inequality associated with the salary distribution: gender, age, race, etc..
Let's assume, for the sake of argument, that the number of years of education should correlate well with a person's salary (this is clearly a weak argument and the plot and Pearson correlation coefficient computation above suggests that this is *not* the case) and that other suspiciously high (positive or negative) correlations could be interpreted as a sign of inequality.
See Notebooks 1 and 2 from [my ICMLA 2019 tutorial with Christian Garbin](https://github.com/fau-masters-collected-works-cgarbin/ieee-icmla-2019-data-science-tutorial) for additional insights.
---------------------
### Hypotheses H1, H2, H3
At this point, we will formulate 3 different hypotheses that might suggest that the salary distribution is biased by factors such as age, gender, or race:
- H1: Older employees are paid less (i.e., ageism)
- H2: Female employees are paid less (i.e., gender bias)
- H3: Non-whites are paid less (i.e, race bias).
### 1.2 Your turn! (24-30 points)
Write Python code to test hypotheses H1, H2, and H3 (and some text to explain whether they were confirmed or not).
Feel free to (also) use plots, but make your code independent of a human being interpreting those plots.
**Weight**: 24 pts, i.e., 8 pts per hypothesis.
Up to 6 bonus points for insightful additional hypotheses, code, and/or comments.
# Required imports
from scipy import stats
H1:
# For this test, we will consider "older" the ones with an age higher or equal than 60 yrs
younger = salaries[salaries['age'] < 60]['earn']
older = salaries[salaries['age'] >= 60]['earn']
pval = stats.ttest_ind(younger, older).pvalue # younger > older
print(pval)
if pval < 0.05:
 print("We reject ", end="")
else:
 print("We accept ", end="")
print("the hypotheses H1 that older employees are paid less than younger employees.")
H2:
# Enter your code here
male = salaries[salaries['sex'] == "male"]['earn']
female = salaries[salaries['sex'] == "female"]['earn']
pval = stats.ttest_ind(male, female).pvalue # male > female
print(pval)
if pval < 0.05:
 print("We reject ", end="")
else:
 print("We accept ", end="")
print("the hypotheses H2 that female employees are paid less than male employees.")
H3:
# Enter your code here
whites = salaries[salaries['race'] == "white"]['earn']
nonwhites = salaries[salaries['race'] != "white"]['earn']
pval = stats.ttest_ind(whites, nonwhites).pvalue # white > nonwhite
print(pval)
if pval < 0.05:
 print("We reject ", end="")
else:
 print("We accept ", end="")
print("the hypotheses H3 that non-white employees are paid less than white employees.")
-------------------
## Part 2: Classification
### 2a. Iris flower classification
The Python code below will load a dataset containing information about three types of Iris flowers that had the size of its petals and sepals carefully measured.
The Fisher’s Iris dataset contains 150 observations with 4 features each:
- sepal length in cm;
- sepal width in cm;
- petal length in cm; and
- petal width in cm.
The class for each instance is stored in a separate column called “species”. In this case, the first 50 instances belong to class Setosa, the following 50 belong to class Versicolor and the last 50 belong to class Virginica.
See:
https://archive.ics.uci.edu/ml/datasets/Iris for additional information.
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
iris = sns.load_dataset("iris")
iris.head()
#### Histograms, pair plots and summary statistics
The code below:
1. Computes and displays relevant summary statistics for the whole dataset.
2. Displays the pair plots for all (4) attributes for all (3) categories / species / classes in the Iris dataset.
# Display pair plot
sns.pairplot(iris, hue='species', height=2.5);
# Display summary statistics for the whole dataset
iris.describe()
#### 2.1 Your turn! (25 points)
Write code to:
1. Build a decision tree classifier using scikit-learn's `DecisionTreeClassifier` (using the default options). Check documentation at https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html
2. Plot the resulting decision tree.
(Note: if `graphviz` gives you headaches, a text-based 'plot'-- using `export_text` -- should be OK.)
3. Perform k-fold cross-validation using k=3 and display the results.
from sklearn.tree import DecisionTreeClassifier # Import Decision Tree Classifier
from sklearn.model_selection import train_test_split
import graphviz
from sklearn import tree
from sklearn.model_selection import KFold, StratifiedKFold, cross_val_score
# Transform the species to number
species = np.unique(iris['species'])
species_dict = {species[i]: i for i in range(len(species))}
iris["species"].replace(species_dict, inplace=True)
# Part 1)
# Split x_train and y_train
y = pd.CategoricalIndex(iris['species'])
X = iris.drop(columns=['species'])
X_train, X_test, y_train, y_test = train_test_split(X,y)
# Create tree
clf = DecisionTreeClassifier()
# Train Decision Tree Classifer
clf = clf.fit(X_train,y_train)
# Part 2): Plot tree
tree.export_graphviz(clf,
 out_file="tree.dot",
 feature_names = list(X.columns),
 class_names="Species",
 filled = True)
tree.plot_tree(clf)
# Part 3
k = 3
kf =KFold(n_splits=k, shuffle=True, random_state=42)
cnt = 1
# split() method generate indices to split data into training and test set.
for train_index, test_index in kf.split(X, y):
 print(f'Fold:{cnt}, Train set: {len(train_index)}, Test set:{len(test_index)}')
 cnt += 1
score = cross_val_score(clf, X, y, cv= kf, scoring="neg_mean_squared_error")
print(f'Scores for each fold: {score}')
score.mean()
### 2b. Digit classification
The MNIST handwritten digit dataset consists of a training set of 60,000 examples, and a test set of 10,000 examples. Each image in the dataset has 28$\times$28 pixels.
The Python code below loads the images from the MNIST dataset, flattens them, normalizes them (i.e., maps the intensity values from [0..255] to [0..1]), and displays a few images from the training set.
from keras.datasets import mnist
# Model / data parameters
num_classes = 10
input_shape = (28, 28, 1)
# the data, split between train and validation sets
(X_train, y_train), (X_valid, y_valid) = mnist.load_data()
X_train.shape
y_train.shape
y_train[0:12]
plt.figure(figsize=(5,5))
for k in range(12):
 plt.subplot(3, 4, k+1)
 plt.imshow(X_train[k], cmap='Greys')
 plt.axis('off')
plt.tight_layout()
plt.show()
X_valid.shape
y_valid.shape
y_valid[0]
plt.imshow(X_valid[0], cmap='Greys')
plt.axis('off')
plt.show()
# Reshape (flatten) images
X_train_reshaped = X_train.reshape(60000, 784).astype('float32')
X_valid_reshaped = X_valid.reshape(10000, 784).astype('float32')
# Scale images to the [0, 1] range
X_train_scaled_reshaped = X_train_reshaped / 255
X_valid_scaled_reshaped = X_valid_reshaped / 255
# Renaming for conciseness
X_training = X_train_scaled_reshaped
X_validation = X_valid_scaled_reshaped
print("X_training shape (after reshaping + scaling):", X_training.shape)
print(X_training.shape[0], "train samples")
print("X_validation shape (after reshaping + scaling):", X_validation.shape)
print(X_validation.shape[0], "validation samples")
import tensorflow as tf
# convert class vectors to binary class matrices
y_training = tf.keras.utils.to_categorical(y_train, num_classes)
y_validation = tf.keras.utils.to_categorical(y_valid, num_classes)
print(y_valid[0])
print(y_validation[0])
#### A baseline classifier
The code below is an example of how to:
1. Build and fit a 10-class Naive Bayes classifier using scikit-learn's `MultinomialNB()` with default options and using the raw pixel values as features.
2. Make predictions on the test data, compute the overall accuracy, and plot the resulting confusing matrix.
from sklearn.naive_bayes import MultinomialNB
model = MultinomialNB()
model.fit(X_training, y_train)
pred_labels = model.predict(X_validation)
pred_labels.shape
print(pred_labels)
print(y_valid)
from sklearn.metrics import confusion_matrix
import seaborn as sns
mat = confusion_matrix(y_valid, pred_labels)
sns.heatmap(mat.T, square=True, annot=True, fmt='d', cbar=False, cmap="YlGnBu")
plt.xlabel('true label')
plt.ylabel('predicted label');
from sklearn.metrics import accuracy_score
accuracy_score(y_valid, pred_labels)
#### 2.2 Your turn! (20 points)
Write code to:
1. Build and fit a 10-class Random Forests classifier using scikit-learn's `RandomForestClassifier()` with default options (don't forget `random_state=0`) and using the raw pixel values as features.
2. Make predictions on the test data, compute the overall accuracy and plot the resulting confusing matrix.
Hint: your accuracy should be > 90%
# ENTER YOUR CODE HERE
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(random_state=0)
rf.fit(X_training, y_train)
pred_labels = rf.predict(X_validation)
mat = confusion_matrix(y_valid, pred_labels)
sns.heatmap(mat.T, square=True, annot=True, fmt='d', cbar=False, cmap="YlGnBu")
plt.xlabel('true label')
plt.ylabel('predicted label');
print("The accuracy is: {:.2f}%".format(accuracy_score(y_valid, pred_labels)*100.0))
## Conclusions (21 points)
Write your conclusions and make sure to address the issues below:
- What have you learned from this assignment?
- Which parts were the most fun, time-consuming, enlightening, tedious?
- What would you do if you had an additional week to work on this?
We see that the RandomForestClassifier behaved better than the Naive Bayes Classifier, getting an accuracy of 97%, which represents an increase of almost 14%
Create a Program to Implement Data Science Python Stack in Python Assignment Solution.

Instructions

Requirements and Specifications