## Instructions

**Objective**

Write a python program to create a health informatic system in python.

## Requirements and Specifications

Health informatics covers a broad aspect of the healthcare industry, offering insights on both the macro and micro level. When combined with computing standards and data visualization tools, healthcare analytics helps professionals in the field to operate better by providing real-time information that can support decisions and deliver actionable insights.

For your fourth and final assignment, you will have the opportunity to work with a model to build a predictive based solution or tool. You will pull all of your model components together in a Jupyter notebook. Details are listed below.

**Exercise #1 - Selecting a Topic**

I am focusing on Diabetes Mellitus for this assignment.

**Exercise #2 - Selecting a Dataset & Model**

Once you have selected your topic area, you need to figure out what type of model you want to use and what type of data you need to do so. You can choose any public dataset or you can create a synthetic dataset using

**. If you decide to create your own dataset, you will need explicitly outline the steps you took to create it, so that it can be duplicated.**__Synthea (Links to an external site.)__Example diabetes datasets that can be used:

- https://archive.ics.uci.edu/ml/datasets/diabetes
- https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_diabetes.html
- http://odds.cs.stonybrook.edu/pima-indians-diabetes-dataset/

**Model Ideas / Groups**

- Classification Model - Binary type problem statements
- Clustering Model - Groupings or correlations
- Forecasting Model - Numeric Outcome
- Outlier Model - Looking for anomalies or odd behavior
- Time Series - 6 months, 2 days, 1 year, etc.

Please limit your dataset to no more than 100,000 observations at maximum.

**Exercise #3 - Algorithm Selection**

You are free to choose any algorithm(s) to work with. Keep in mind that some algorithms work better with certain models and vice versa.

**You are free to use any pre-existing code and/or libraries**, but you need to clearly and repeatedly site any and all sources of code that is not of your own creation. Your only restriction is that you must use Jupyter notebooks for your coded analysis. You will need to explain why you selected your algorithm(s) and how they complement your model.**Popular Algorithms**

- Random Forest
- Generalized Linear Model (GLM)
- Gradient Boosted Model (GBM)
- k-Means
- Decision Trees

**Popular Libraries**

- scikit-learn (Python)
- Weka (Java)
- Random Forest (R)

**Exercise #4 - Jupyter Notebook**

Once you have completed Exercises 1-3, you will now beginning pulling together your model. You are limited to Python, R and Java, which each have kernels available for Jupyter Notebooks.

**Source Code**

```
import pandas as pd
import numpy as np
from sklearn.cluster import KMeans
import seaborn as sns
from sklearn import preprocessing
from sklearn.model_selection import train_test_split
from sklearn.feature_selection import SelectKBest, chi2
# Data
The dataset used in this project is the Pima Indians Diabetes Database found at: https://www.kaggle.com/uciml/pima-indians-diabetes-database
This dataset contains 768 rows, 8 variables and 1 target.
df = pd.read_csv('diabetes.csv')
df.head()
# Descriptive Analysis
df.describe()
For this dataset, we see that there are no categorical variables, apart from the trarget variable.
### Check number of missing values per column
df.isnull().sum()
By looking at the cell above, we see that there are no missing values.
# Graphs
### Pair Plot
sns.pairplot(df)
# Data Cleaning
There is no need to fill missing values or remove rows with missing values because there are no missing values. The only step to be carried in this part is to normalize the data to increase accuracy. So, all variables will be normalized between 0 and 1
### Extract X and Y values from dataset and convert them to numpy arrays
Y = df['Outcome'].values.reshape((len(df),1))
X = df.drop(columns = ['Outcome']).values
### Normalize
scaler = preprocessing.MinMaxScaler().fit(X)
X = scaler.transform(X)
# Model
### Split into Train and Test
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.33, random_state = 42)
print(f"There are {len(X_train)} training samples")
print(f"There are {len(X_test)} testing samples")
### Build Model
We will bould a model with 2 clusters because there are only two possible output values: 0 or 1
model = KMeans(n_clusters = 2)
model.fit(X_train)
### Predict
y_predicted = model.predict(X_test).reshape((len(X_test),1))
### Score
model.score(X_test)
### Compute Accuracy
accuracy = 1.0 - np.abs(y_predicted-Y_test).sum()/len(Y_test)
print(f"The accuracy of the model is {accuracy*100.0}%")
# Improve Model
For each sample we have 8 variables, but some of these variables may be unhelpful or may be overfitting the model. To do this, we will do a Feature Selection to select the optimal number of variables
### Run 8 models to check for which number of features whe obtain the highest accuracy
accuracies = []
for k in range(1,9):
X_new = SelectKBest(chi2, k = k).fit_transform(X, Y)
X_train2, X_test2, Y_train2, Y_test2 = train_test_split(X_new, Y, test_size = 0.33, random_state = 42)
model2 = KMeans(n_clusters = 2)
model2.fit(X_train2)
y_predicted2 = model2.predict(X_test2).reshape((len(X_test2),1))
accuracy2 = 1.0 - np.abs(y_predicted2-Y_test2).sum()/len(X_test2)
accuracies.append(accuracy2)
print(f"The highest accuracy was obtained for {range(1,9)[np.argmax(accuracies)]} features")
### Improved Model
X_new = SelectKBest(chi2, k = 7).fit_transform(X, Y)
X_train2, X_test2, Y_train2, Y_test2 = train_test_split(X_new, Y, test_size = 0.33, random_state = 42)
model2 = KMeans(n_clusters = 2)
model2.fit(X_train2)
y_predicted2 = model2.predict(X_test2).reshape((len(X_test2),1))
accuracy2 = 1.0 - np.abs(y_predicted2-Y_test2).sum()/len(X_test2)
print(f"The new accuracy is: {accuracy2*100.0}%")
```