Write a program to implement NYC datasets in python language.
Requirements and Specifications
Step 1 (20 pts.): Select two datasets from NYC Open Data https://opendata.cityofnewyork.us/ (Links to an external site.). Write a paragraph about how they might be related and why looking at the data might be helpful. Be sure to check the data dictionary which will explain each column/attribute. NYC School Datasets are an example of a category that are easy to merge as they share a common key. Alternative data sets could be used, but must have my approval.
Step 2 (50 pts.): Write a notebook that cleans and merges the data. Get rid of the columns you aren't using. I expect to see the use of head() in your data where relevant, so that I can see what is going on in your dataframes. Use comments to explain how you dealt with missing data and other issues in the dataset.
Step 3 (30 pts.): Use seaborn (regression plot) or other plots/graphs to graphically illustrate the relationship between some variables in each dataset. Write another paragraph on what insights you've gained.
Upload all Juypter notebooks, along with your code saved as a .pdf. Save your notebooks with output! Otherwise, I might need to get your data from you to test your work.
.py files are not accepted as a substitute for Juypter notebooks. Any submission that is only a .py file will receive a grade of 0.
# Datasets# Datasets###COVID-19 Daily Counts of Cases, Hospitalizations, and DeathsHealthhttps://data.cityofnewyork.us/Health/COVID-19-Daily-Counts-of-Cases-Hospitalizations-an/rc75-m7u3**Download link:** https://data.cityofnewyork.us/api/views/rc75-m7u3/rows.csv?accessType=DOWNLOAD### Emergency Department Visits and Admissions for Influenza-like Illness and/or Pneumoniahttps://data.cityofnewyork.us/Health/Emergency-Department-Visits-and-Admissions-for-Inf/2nwg-uqyg**Download link:** https://data.cityofnewyork.us/api/views/2nwg-uqyg/rows.csv?accessType=DOWNLOADThe datasets mentioned above have information on the number of cases of infection and death reported in NY per day. The other dataset contains the number of visits to hospitals or ER rooms for cases of pneumonia, influenza or other similar symptoms. It is planned to verify a direct relationship between these datasets from the date the first cases of COVID-19 were reported in NY.import pandas as pdimport requestsimport matplotlib.pyplot as pltimport ioimport seaborn as sns## Step 1: Download Datasetsdata1 = pd.read_csv('https://data.cityofnewyork.us/api/views/rc75-m7u3/rows.csv?accessType=DOWNLOAD')data1.head()data2 = pd.read_csv('https://data.cityofnewyork.us/api/views/2nwg-uqyg/rows.csv?accessType=DOWNLOAD')data2.head()# Step 2: Clean Datasets### For first dataset, select only the columns of interestcolumns_of_interest = ['DATE_OF_INTEREST', 'CASE_COUNT', 'HOSPITALIZED_COUNT', 'DEATH_COUNT']df1 = data1[columns_of_interest]### Convert 'DATE_OF_INTEREST' to Datetimedf1['DATE_OF_INTEREST'] = pd.to_datetime(df1['DATE_OF_INTEREST'])df1.head()### Let's do the same for the second datasetcolumns_of_interest= ['extract_date', 'date', 'total_ed_visits', 'ili_pne_visits']df2 = data2[columns_of_interest]df2['extract_date'] = pd.to_datetime(df2['extract_date'])df2['date'] = pd.to_datetime(df2['date'])df2.head()# Let's find the date for the first COVID-19 case reportedstart_date = data1.loc[0, 'DATE_OF_INTEREST']print(start_date)### Find all rows in second dataset from start_date to presentdf2 = df2[df2['date'] >= start_date]df2.head()# Step 3: Plots### Plot the number of COVID-19 case and number of ER visits per day**NOTE: ** We will normalize (between 0 and 1) the number of cases for both datasets. This is because we only want to compare the shape of curves and not the valuesdf1_grouped = df1.groupby(by=['DATE_OF_INTEREST']).sum().sort_values(by=['DATE_OF_INTEREST'], ascending=True)df1_grouped = (df1_grouped-df1_grouped.min())/(df1_grouped.max() - df1_grouped.min())df1_grouped.head()df2_grouped = df2.groupby(by=['date']).sum().sort_values(by=['date'], ascending = True)df2_grouped = (df2_grouped-df2_grouped.min())/(df2_grouped.max() - df2_grouped.min())df2_grouped.head()### Plotplt.figure()ax = df1_grouped.plot(y = 'CASE_COUNT', label = 'COVID-19 Cases')df2_grouped.plot(y='ili_pne_visits', label = 'Hospital Visits', ax = ax)plt.legend()plt.show()We see that there is a clear correlation between the curves. At the beginning, the curves are very similar, and it is because when the pandemic began, everyone was very scared and at the first symptom (even if it was minimal) people attended the ER room. As time passed and social distancing measures began to be implemented, we see that the number of covid cases decreased as did visits to the ER. By October 2020, the number of cases increased again (second wave) and ER visits also increased, although in lesser quantity and it is because people were no longer so scared.## Correlation maps for each datasetcorr1 = data1.corr()corr2 = data2.corr()### Dataset 1sns.heatmap(corr1)## Dataset 2sns.heatmap(corr2)