Program to Implement NYC Datasets in Python Assignment Solution.

Instructions

Objective

Write a program to implement NYC datasets in python language.

Requirements and Specifications

Step 1 (20 pts.): Select two datasets from NYC Open Data https://opendata.cityofnewyork.us/ (Links to an external site.). Write a paragraph about how they might be related and why looking at the data might be helpful. Be sure to check the data dictionary which will explain each column/attribute. NYC School Datasets are an example of a category that are easy to merge as they share a common key. Alternative data sets could be used, but must have my approval.

Step 2 (50 pts.): Write a python assignment that cleans and merges the data. Get rid of the columns you aren't using. I expect to see the use of head() in your data where relevant, so that I can see what is going on in your dataframes. Use comments to explain how you dealt with missing data and other issues in the dataset.

Step 3 (30 pts.): Use seaborn (regression plot) or other plots/graphs to graphically illustrate the relationship between some variables in each dataset. Write another paragraph on what insights you've gained.

Upload all Juypter notebooks, along with your code saved as a .pdf. Save your notebooks with output! Otherwise, I might need to get your data from you to test your work.

.py files are not accepted as a substitute for Juypter notebooks. Any submission that is only a .py file will receive a grade of 0.

Source Code

# Datasets
# Datasets
###COVID-19 Daily Counts of Cases, Hospitalizations, and DeathsHealth
https://data.cityofnewyork.us/Health/COVID-19-Daily-Counts-of-Cases-Hospitalizations-an/rc75-m7u3
**Download link:** https://data.cityofnewyork.us/api/views/rc75-m7u3/rows.csv?accessType=DOWNLOAD
### Emergency Department Visits and Admissions for Influenza-like Illness and/or Pneumonia
https://data.cityofnewyork.us/Health/Emergency-Department-Visits-and-Admissions-for-Inf/2nwg-uqyg
**Download link:** https://data.cityofnewyork.us/api/views/2nwg-uqyg/rows.csv?accessType=DOWNLOAD
The datasets mentioned above have information on the number of cases of infection and death reported in NY per day. The other dataset contains the number of visits to hospitals or ER rooms for cases of pneumonia, influenza or other similar symptoms. It is planned to verify a direct relationship between these datasets from the date the first cases of COVID-19 were reported in NY.
import pandas as pd
import requests
import matplotlib.pyplot as plt
import io
import seaborn as sns
## Step 1: Download Datasets
data1 = pd.read_csv('https://data.cityofnewyork.us/api/views/rc75-m7u3/rows.csv?accessType=DOWNLOAD')
data1.head()
data2 = pd.read_csv('https://data.cityofnewyork.us/api/views/2nwg-uqyg/rows.csv?accessType=DOWNLOAD')
data2.head()
# Step 2: Clean Datasets
### For first dataset, select only the columns of interest
columns_of_interest = ['DATE_OF_INTEREST', 'CASE_COUNT', 'HOSPITALIZED_COUNT', 'DEATH_COUNT']
df1 = data1[columns_of_interest]
### Convert 'DATE_OF_INTEREST' to Datetime
df1['DATE_OF_INTEREST'] = pd.to_datetime(df1['DATE_OF_INTEREST'])
df1.head()
### Let's do the same for the second dataset
columns_of_interest= ['extract_date', 'date', 'total_ed_visits', 'ili_pne_visits']
df2 = data2[columns_of_interest]
df2['extract_date'] = pd.to_datetime(df2['extract_date'])
df2['date'] = pd.to_datetime(df2['date'])
df2.head()
# Let's find the date for the first COVID-19 case reported
start_date = data1.loc[0, 'DATE_OF_INTEREST']
print(start_date)
### Find all rows in second dataset from start_date to present
df2 = df2[df2['date'] >= start_date]
df2.head()
# Step 3: Plots
### Plot the number of COVID-19 case and number of ER visits per day
**NOTE: ** We will normalize (between 0 and 1) the number of cases for both datasets. This is because we only want to compare the shape of curves and not the values
df1_grouped = df1.groupby(by=['DATE_OF_INTEREST']).sum().sort_values(by=['DATE_OF_INTEREST'], ascending=True)
df1_grouped = (df1_grouped-df1_grouped.min())/(df1_grouped.max() - df1_grouped.min())
df1_grouped.head()
df2_grouped = df2.groupby(by=['date']).sum().sort_values(by=['date'], ascending = True)
df2_grouped = (df2_grouped-df2_grouped.min())/(df2_grouped.max() - df2_grouped.min())
df2_grouped.head()
### Plot
plt.figure()
ax = df1_grouped.plot(y = 'CASE_COUNT', label = 'COVID-19 Cases')
df2_grouped.plot(y='ili_pne_visits', label = 'Hospital Visits', ax = ax)
plt.legend()
plt.show()
We see that there is a clear correlation between the curves. At the beginning, the curves are very similar, and it is because when the pandemic began, everyone was very scared and at the first symptom (even if it was minimal) people attended the ER room. As time passed and social distancing measures began to be implemented, we see that the number of covid cases decreased as did visits to the ER. By October 2020, the number of cases increased again (second wave) and ER visits also increased, although in lesser quantity and it is because people were no longer so scared.
## Correlation maps for each dataset
corr1 = data1.corr()
corr2 = data2.corr()
### Dataset 1
sns.heatmap(corr1)
## Dataset 2
sns.heatmap(corr2)

Create a Program to Implement NYC Datasets in Python Assignment Solution.

Instructions

Requirements and Specifications