# Create a Program to Implement Data Science Python Stack in Python Assignment Solution.

## Instructions

Objective
Write a program to implement data science python stack in python language.

## Requirements and Specifications

Goals
To acquire a basic understanding of the Python "data science stack" (NumPy, Pandas, Matplotlib, Scikit-Learn).
To have an early experience of manipulating, summarizing, and visualizing small datasets.
To demonstrate the ability to write Python code to answer questions and test hypotheses based on the contents of those datasets.
To learn how to implement several different machine learning classification models in Python.
To learn how to test a model and produce a set of plots and performance measures.
Source Code
# COP 4045 - Python Programming - Dr. Marques - Summer 2021 # Assignment 10: Introducing the Python Data Science stack ## STARTER ### Goals - To acquire a basic understanding of the Python "data science stack" (NumPy, Pandas, Matplotlib, Scikit-Learn). - To have an early experience of manipulating, summarizing, and visualizing small datasets. - To demonstrate the ability to write Python code to answer questions and test hypotheses based on the contents of those datasets. - To learn how to implement several different machine learning classification models in Python - To learn how to test a model and produce a set of plots and performance measures ### Instructions - This assignment is structured in two parts. - For each part, there will be some Python code to be written and questions to be answered. - At the end, you should export your notebook to PDF format; it will "automagically" become your report. - Submit the report (PDF), notebook (.ipynb file), and the link to the "live" version of your solution on Google Colaboratory via Canvas. - The number of points is indicated next to each part. They add up to 100. - There are additional (10 points worth of) bonus items, which are, of course optional. ### Important - It is OK to attempt the bonus points, but please **do not overdo it!** - Remember: this is an early exercise in exploring datasets; learning the syntax and "tricks" of Python, Jupyter notebooks, Numpy, Pandas, and Matplotlib; and writing code to use data to test simple hypotheses, produce answers to simple questions, or make predictions. --------- ### Imports + Google Drive # Imports import numpy as np import pandas as pd from pandas import DataFrame, Series import matplotlib.pyplot as plt from scipy.stats import pearsonr from __future__ import division import seaborn as sns sns.set(style='ticks', palette='Set2') %matplotlib inline # OPTIONAL # Mount Google Drive # from google.colab import drive # drive.mount('/content/drive') ------------------- ## Part 1: EDA The Python code below will load a dataset containing the salaries and demographic data of more than 1000 employees of a hypothetical company, available in the file *salaries.csv*, which is a simple comma-separated list of labels and values. #salaries = pd.read_csv('/content/drive/My Drive/Colab Notebooks/data/salaries.csv') salaries = pd.read_csv('salaries.csv') """ from google.colab import files uploaded = files.upload() import io salaries = pd.read_csv(io.StringIO(uploaded['salaries.csv'].decode('utf-8'))) """ print(salaries.shape) print(salaries.count()) salaries.head() salaries.describe() -------------------- ### Summary statistics and correlations Let's explore the dataset by plotting some graphs and displaying summary statistics. The code below should display: - Min, max, average, and median salary (global) - A histogram of salaries - A scatterplot correlating salaries and years of education - The (Pearson) correlation coefficient between the two variables. This should help us get started. salary = np.array(salaries['earn']) print("--- Salary statistics ---") print("Minimum salary (global): ${:6.2f}".format(np.min(salary))) print("Maximum salary (global):${:6.2f}".format(np.max(salary))) print("Average salary (global): ${:6.2f}".format(np.mean(salary))) print("Median salary (global):${:6.2f}".format(np.median(salary))) plt.hist(salary) plt.title('Salary Distribution') plt.xlabel('Salary') plt.ylabel('Number of Employees'); years = np.array(salaries['ed']) plt.title('Salary vs. Education Level') plt.ylabel('Salary') plt.xlabel('Years of education'); plt.scatter(years, salary, alpha=0.5) plt.show() # Compute Pearson coefficient corr, _ = pearsonr(salary,years) print('Correlation coefficient: ',corr) The [Pearson correlation coefficient](https://en.wikipedia.org/wiki/Pearson_correlation_coefficient) (a value between -1 and 1) can be used to summarize the strength of the linear relationship between two data samples. A simplified way to interpret the result is (see table 1 [here](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6107969/)): - A value of 0 means no correlation - Values below -0.5 or above 0.5 indicates a notable (negative/positive) correlation ### 1.1 Your turn! (10-14 points) Write code to: 1. Display the total headcount and the number (and %) of male and female employees. (2 pts) 2. Compute and display the min, max, average, and median salary *per gender*. (8 pts) 3. (OPTIONAL) Plot meaningful graphs that could provide insight into the gender inequality (*if any is present*) associated with the salaries in the company. (<= 4 bonus points) # Enter your code here # 1) Count and percentages perc_males = len(salaries[salaries['sex'] == "male"])/len(salaries) *100.0 perc_females = len(salaries[salaries['sex'] == "female"])/len(salaries) *100.0 # 2) Display mean, std, min... for earn, grouped by sex print("There are {:.2f}% males and {:.2f}% females.".format(perc_males, perc_females)) salaries.groupby(by=['sex']).describe()['earn'] # Part 3) Graph # Plot salaries for males males_earn = salaries[salaries['sex'] == "male"] females_earn = salaries[salaries['sex'] == "female"] fig, ax = plt.subplots(figsize=(8,6)) salaries.groupby('sex').plot(y='earn', kind='kde', ax=ax) plt.legend(['male', 'female']) It can be seen that the salaries for males are higher than the salaries for females -------------------- ### Signs of inequality As you can possibly tell by now, this dataset may help us test hypotheses and answer questions related to possible sources of inequality associated with the salary distribution: gender, age, race, etc.. Let's assume, for the sake of argument, that the number of years of education should correlate well with a person's salary (this is clearly a weak argument and the plot and Pearson correlation coefficient computation above suggests that this is *not* the case) and that other suspiciously high (positive or negative) correlations could be interpreted as a sign of inequality. See Notebooks 1 and 2 from [my ICMLA 2019 tutorial with Christian Garbin](https://github.com/fau-masters-collected-works-cgarbin/ieee-icmla-2019-data-science-tutorial) for additional insights. --------------------- ### Hypotheses H1, H2, H3 At this point, we will formulate 3 different hypotheses that might suggest that the salary distribution is biased by factors such as age, gender, or race: - H1: Older employees are paid less (i.e., ageism) - H2: Female employees are paid less (i.e., gender bias) - H3: Non-whites are paid less (i.e, race bias). ### 1.2 Your turn! (24-30 points) Write Python code to test hypotheses H1, H2, and H3 (and some text to explain whether they were confirmed or not). Feel free to (also) use plots, but make your code independent of a human being interpreting those plots. **Weight**: 24 pts, i.e., 8 pts per hypothesis. Up to 6 bonus points for insightful additional hypotheses, code, and/or comments. # Required imports from scipy import stats H1: # For this test, we will consider "older" the ones with an age higher or equal than 60 yrs younger = salaries[salaries['age'] < 60]['earn'] older = salaries[salaries['age'] >= 60]['earn'] pval = stats.ttest_ind(younger, older).pvalue # younger > older print(pval) if pval < 0.05: print("We reject ", end="") else: print("We accept ", end="") print("the hypotheses H1 that older employees are paid less than younger employees.") H2: # Enter your code here male = salaries[salaries['sex'] == "male"]['earn'] female = salaries[salaries['sex'] == "female"]['earn'] pval = stats.ttest_ind(male, female).pvalue # male > female print(pval) if pval < 0.05: print("We reject ", end="") else: print("We accept ", end="") print("the hypotheses H2 that female employees are paid less than male employees.") H3: # Enter your code here whites = salaries[salaries['race'] == "white"]['earn'] nonwhites = salaries[salaries['race'] != "white"]['earn'] pval = stats.ttest_ind(whites, nonwhites).pvalue # white > nonwhite print(pval) if pval < 0.05: print("We reject ", end="") else: print("We accept ", end="") print("the hypotheses H3 that non-white employees are paid less than white employees.") ------------------- ## Part 2: Classification ### 2a. Iris flower classification The Python code below will load a dataset containing information about three types of Iris flowers that had the size of its petals and sepals carefully measured. The Fisher’s Iris dataset contains 150 observations with 4 features each: - sepal length in cm; - sepal width in cm; - petal length in cm; and - petal width in cm. The class for each instance is stored in a separate column called “species”. In this case, the first 50 instances belong to class Setosa, the following 50 belong to class Versicolor and the last 50 belong to class Virginica. See: https://archive.ics.uci.edu/ml/datasets/Iris for additional information. import numpy as np import matplotlib.pyplot as plt import seaborn as sns iris = sns.load_dataset("iris") iris.head() #### Histograms, pair plots and summary statistics The code below: 1. Computes and displays relevant summary statistics for the whole dataset. 2. Displays the pair plots for all (4) attributes for all (3) categories / species / classes in the Iris dataset. # Display pair plot sns.pairplot(iris, hue='species', height=2.5); # Display summary statistics for the whole dataset iris.describe() #### 2.1 Your turn! (25 points) Write code to: 1. Build a decision tree classifier using scikit-learn's DecisionTreeClassifier (using the default options). Check documentation at https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html 2. Plot the resulting decision tree. (Note: if graphviz gives you headaches, a text-based 'plot'-- using export_text -- should be OK.) 3. Perform k-fold cross-validation using k=3 and display the results. from sklearn.tree import DecisionTreeClassifier # Import Decision Tree Classifier from sklearn.model_selection import train_test_split import graphviz from sklearn import tree from sklearn.model_selection import KFold, StratifiedKFold, cross_val_score # Transform the species to number species = np.unique(iris['species']) species_dict = {species[i]: i for i in range(len(species))} iris["species"].replace(species_dict, inplace=True) # Part 1) # Split x_train and y_train y = pd.CategoricalIndex(iris['species']) X = iris.drop(columns=['species']) X_train, X_test, y_train, y_test = train_test_split(X,y) # Create tree clf = DecisionTreeClassifier() # Train Decision Tree Classifer clf = clf.fit(X_train,y_train) # Part 2): Plot tree tree.export_graphviz(clf, out_file="tree.dot", feature_names = list(X.columns), class_names="Species", filled = True) tree.plot_tree(clf) # Part 3 k = 3 kf =KFold(n_splits=k, shuffle=True, random_state=42) cnt = 1 # split() method generate indices to split data into training and test set. for train_index, test_index in kf.split(X, y): print(f'Fold:{cnt}, Train set: {len(train_index)}, Test set:{len(test_index)}') cnt += 1 score = cross_val_score(clf, X, y, cv= kf, scoring="neg_mean_squared_error") print(f'Scores for each fold: {score}') score.mean() ### 2b. Digit classification The MNIST handwritten digit dataset consists of a training set of 60,000 examples, and a test set of 10,000 examples. Each image in the dataset has 28$\times$28 pixels. The Python code below loads the images from the MNIST dataset, flattens them, normalizes them (i.e., maps the intensity values from [0..255] to [0..1]), and displays a few images from the training set. from keras.datasets import mnist # Model / data parameters num_classes = 10 input_shape = (28, 28, 1) # the data, split between train and validation sets (X_train, y_train), (X_valid, y_valid) = mnist.load_data() X_train.shape y_train.shape y_train[0:12] plt.figure(figsize=(5,5)) for k in range(12): plt.subplot(3, 4, k+1) plt.imshow(X_train[k], cmap='Greys') plt.axis('off') plt.tight_layout() plt.show() X_valid.shape y_valid.shape y_valid[0] plt.imshow(X_valid[0], cmap='Greys') plt.axis('off') plt.show() # Reshape (flatten) images X_train_reshaped = X_train.reshape(60000, 784).astype('float32') X_valid_reshaped = X_valid.reshape(10000, 784).astype('float32') # Scale images to the [0, 1] range X_train_scaled_reshaped = X_train_reshaped / 255 X_valid_scaled_reshaped = X_valid_reshaped / 255 # Renaming for conciseness X_training = X_train_scaled_reshaped X_validation = X_valid_scaled_reshaped print("X_training shape (after reshaping + scaling):", X_training.shape) print(X_training.shape[0], "train samples") print("X_validation shape (after reshaping + scaling):", X_validation.shape) print(X_validation.shape[0], "validation samples") import tensorflow as tf # convert class vectors to binary class matrices y_training = tf.keras.utils.to_categorical(y_train, num_classes) y_validation = tf.keras.utils.to_categorical(y_valid, num_classes) print(y_valid[0]) print(y_validation[0]) #### A baseline classifier The code below is an example of how to: 1. Build and fit a 10-class Naive Bayes classifier using scikit-learn's MultinomialNB() with default options and using the raw pixel values as features. 2. Make predictions on the test data, compute the overall accuracy, and plot the resulting confusing matrix. from sklearn.naive_bayes import MultinomialNB model = MultinomialNB() model.fit(X_training, y_train) pred_labels = model.predict(X_validation) pred_labels.shape print(pred_labels) print(y_valid) from sklearn.metrics import confusion_matrix import seaborn as sns mat = confusion_matrix(y_valid, pred_labels) sns.heatmap(mat.T, square=True, annot=True, fmt='d', cbar=False, cmap="YlGnBu") plt.xlabel('true label') plt.ylabel('predicted label'); from sklearn.metrics import accuracy_score accuracy_score(y_valid, pred_labels) #### 2.2 Your turn! (20 points) Write code to: 1. Build and fit a 10-class Random Forests classifier using scikit-learn's RandomForestClassifier() with default options (don't forget random_state=0) and using the raw pixel values as features. 2. Make predictions on the test data, compute the overall accuracy and plot the resulting confusing matrix. Hint: your accuracy should be > 90% # ENTER YOUR CODE HERE from sklearn.ensemble import RandomForestClassifier rf = RandomForestClassifier(random_state=0) rf.fit(X_training, y_train) pred_labels = rf.predict(X_validation) mat = confusion_matrix(y_valid, pred_labels) sns.heatmap(mat.T, square=True, annot=True, fmt='d', cbar=False, cmap="YlGnBu") plt.xlabel('true label') plt.ylabel('predicted label'); print("The accuracy is: {:.2f}%".format(accuracy_score(y_valid, pred_labels)*100.0)) ## Conclusions (21 points) Write your conclusions and make sure to address the issues below: - What have you learned from this assignment? - Which parts were the most fun, time-consuming, enlightening, tedious? - What would you do if you had an additional week to work on this? We see that the RandomForestClassifier behaved better than the Naive Bayes Classifier, getting an accuracy of 97%, which represents an increase of almost 14%