Program To Detect Air Quality Assignment Solution.

Instructions

Objective

Write a python assignment program to detect air quality.

Requirements and Specifications

program to detect air quality in python 1

Source Code

# Application Assignment 1:

## *** Please Read:

### Air quality in U.S. cities

You are given some data on air quality in U.S. metropolitan areas over time together with several questions of interest, and your objective is to answer the questions.

However, there is no explicit instruction provided about *how* to answer the questions or where exactly to begin. Thus, you will need to discern for yourself how to manipulate and summarize the data in order to answer the questions of interest, and you will need to write your own codes from scratch to obtain results. It is recommended that you examine the data, consider the questions, and plan a rough approach before you begin doing any computations.

You have some latitude for creativity: **although there are accurate answers to each question** -- namely, those that are consistent with the data -- **there is no singularly correct answer**. Most people will perform similar operations and obtain similar answers, but there's no specific result that must be considered to answer the questions accurately.

The questions can be answered using computing skills taught in class and basic internet searches for domain background; for this assignment, you may wish to refer to previous assignments and labs for code examples and the [EPA website on PM pollution](https://www.epa.gov/pm-pollution) for background. However, you are also encouraged to refer to external resources (package documentation, vignettes, stackexchange, internet searches, etc.) as needed.

The broader goal of these assignment is to cultivate your problem-solving ability in an unstructured setting. Your work will be evaluated based on the following:

- choice of method(s) used to answer questions;

- clarity of presentation;

- code style and documentation.

Please write up your results separately from your codes; codes should be included at the end of the notebook.

---

## Part I: Dataset

Merge the city information with the air quality data and tidy the dataset (see notes below). Write a brief description of the data.

In your description, answer the following questions:

- What is a CBSA (the geographic unit of measurement)?

- How many CBSA's are included in the data?

- In how many states and territories do the CBSA's reside? (*Hint: `str.split()`*)

- In which years were data values recorded?

- How many observations are recorded?

- How many variables are measured?

- Which variables are non-missing most of the time (*i.e.*, in at least 50% of instances)?

- What is PM 2.5 and why is it important?

Please write your description in narrative fashion; _**please do not list answers to the questions above one by one**_. A few brief paragraphs should suffice; please limit your data description to three paragraphs or less.

### Air quality data

*The CBSA is the type of geographcial unit used for a measurement. In this case, the CBSA is related to areas (Counties, States), etc, which are used to categorize the measurements by Region*

*This dataset contains 1134 observations in 24 variables. These observations/measurements were taken in 86 different 351 CBSA and 86 different states*

*The records were taken in a total of 19 years (from 2000 to 2019)*

*One of the measurements are about PM 2.5 which is a type of pollutant and refers to the particles present in breathable air (dust, dirt, ash, soot, etc.) with a diameter less than or equal to 2.5 micrometers.*

*About the non-missing variables most of time, we computed the Weighted Annual Sum and the 98% Quantile and we obtained that the non-missing variables are all the variables in the dataset excep the ones for years 2016 and 2019*

## Part II: Descriptive analysis

Focus on the PM2.5 measurements that are non-missing most of the time. Answer each of the following questions in a brief paragraph or two. Your paragraph(s) should indicate both your answer and a description of how you obtained it; _**please do not include codes with your answers**_.

### Has PM 2.5 air pollution improved in the U.S. on the whole since 2000?

Yes. In the first graph displayed in this Notebook, we can see that the trend is decreasing. In the 2000s, the amount of PM2.5 was near 24 units while at 2019 it was around 14 units

### Over time, has PM 2.5 pollution become more variable, less variable, or about equally variable from city to city in the U.S.?

In the second graph displayed in this Notebook, we can see the changes in PM2.5 at each state in the United States. The graph does not shows a legend because there are records for 86 states and a legend will look too big. The importance of this graph if that, we can see that in every state the amount of PM2.5 has been decreasing with the past of time.

### Which state has seen the greatest improvement in PM 2.5 pollution over time? Which city has seen the greatest improvement?

The state with the highest improvement in PM2.5 is New-York (NY) with an improvement of 81%

The city with the highest improvement in PM2.5 is Jamestown-Dunkirk-Fredonia with an improvement of 99.02%

The improvement is defined based on the percent difference between the first record (first year) and the last record (last year). So for example, New-York has a concentration of 30.4 in the 2000s while on 2019 it had a concentration of 5.43. It means that it had a reduction of ~81%, so it has an improvement of around 81%

### Choose a location with some meaning to you (e.g. hometown, family lives there, took a vacation there, etc.). Was that location in compliance with EPA primary standards as of the most recent measurement?

The EPA standard for PM2.5 is 12 micrograms. The city of Urban Honolulu, HI in 2019 had a PM2.5 concentration of ~5 so IT DOES MEETS the EPA Standard

## Imputation

One strategy for filling in missing values ('imputation') is to use non-missing values to predict the missing ones; the success of this strategy depends in part on the strength of relationship between the variable(s) used as predictors of missing values.

Identify one other pollutant that might be a good candidate for imputation based on the PM 2.5 measurements and explain why you selected the variable you did. Can you envision any potential pitfalls to this technique?

**One of the pollutants similar to PM2.5 is PM1, which refers to particles in the breathable air with dimensions less than or equal to 1 micrometer. If there were missing values corresponding to PM2.5, values of the pollutant PM1 could be used to fill them. These values can be selected by state and by year, in this way the measurements would be similar (For example, if there was a missing value in NY for the year 2007, the PM1 value for the same state and the same year would be selected)**

**Among the disadvantages of this method is the fact that the pollutants are not the same, and there may be environmental factors that translate into a difference between the content of both pollutants in the same state (for example, the presence of factories, presence of industry that works with coal, etc). This would negatively affect the dataset since it would have unrealistic values.**

---
# Codes
# packages
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
# raw data
air_raw = pd.read_csv('air-quality.csv')
cbsa_info = pd.read_csv('cbsa-info.csv')
## Part I
# Display first rows of air_raw
air_raw.head()
# Display first rows of cbsa_info
cbsa_info.head()
### Required answers for questions above
# Number of CBSA
print(f"There are {len(air_raw['CBSA'].unique())} CBSAs in the dataset")
# In how many territories does CBSA resided? Take the Core Based Statistical Area column and split
states = []
for c in cbsa_info['Core Based Statistical Area']:
    state = c.split(',')[1].strip()
    if not state in states:
        states.append(state)
print(f"There are {len(states)} states where CBSA resides in")
# Get years
years = air_raw.iloc[:,4:].columns
print("The years containing data are:")
print(', '.join(years))
# Observations
print(f"There are {len(air_raw)} observations and {air_raw.shape[1]} variables.")
### Check for non-missing variables
# Select non-missing variables most of the time (< 50%)
data = air_raw.groupby(['Pollutant']).mean()
data = data.drop(columns = ['CBSA', 'Number of Trends Sites'])
# Now print variables where 98% quantile is higher than 50% (non-missing most of time)
data = data.quantile(0.98)
print(data[data>50])
# Part II
### Plot the amount of PM 2.5 measured over the years
data = air_raw[air_raw['Pollutant'] == 'PM2.5']
data = data.groupby(['Pollutant']).mean()
data = data.drop(columns = ['CBSA', 'Number of Trends Sites'])
y = data.to_numpy()[0]
x = range(2000,2020)
plt.figure(figsize=(12,10))
plt.plot(x, y)
plt.xlabel('Year')
plt.ylabel('Amount of PM2.5 in the US')
plt.grid(True)
plt.title('Amount of PM2.5 per year')
plt.xticks(range(2000,2020))
plt.show()
### Plot the amount of pollutant per state
# Create a copy of the dataset and add a new column containing the state. The states are obtained from the cbsa_info dataset
states_dict = dict()
cities_dict = dict()
for row in cbsa_info.iterrows():
    CBSA = row[1]['CBSA']
    state = row[1]['Core Based Statistical Area'].split(',')[1].strip()
    city = row[1]['Core Based Statistical Area'].split(',')[0].strip()
    states_dict[CBSA] = state
# Now, set the state to the air_raw dataset copy
data = air_raw.copy()
for CBSA, state in states_dict.items():
    data.loc[data['CBSA'] == CBSA, 'State'] = state
# Now add a new column with the city
for i in range(len(data)):
    CBSA = data.loc[i, 'CBSA']
    city = cbsa_info.loc[cbsa_info['CBSA'] == CBSA, 'Core Based Statistical Area'].to_numpy()[0].split(',')[0].strip()
    data.loc[i, 'City'] = city
# Now, group the data by state and compute the mean
data2 = data.groupby(['State']).mean()
data2 = data2.drop(columns = ['CBSA', 'Number of Trends Sites'])
# Plot
fig, ax = plt.subplots()
data2.transpose().plot(use_index=True, ax=ax)
ax.get_legend().remove()
plt.show()
# First, group by state
data2 = data.groupby(['State']).mean()
data2 = data2.drop(columns=['CBSA', 'Number of Trends Sites'])
# Now, take only the columns for the first and last year
data2 = data2.iloc[:, [0,-1]]
# Compute in a new column, the percentage that the last years represents regarding the first year
data2['Percentage change'] = abs((data2['2019'] - data2['2000']))/data2['2000'] *100.0
# Now sort to take the state with the smallest percentage
data2 = data2.sort_values(by=['Percentage change'], ascending=False)
fig, ax = plt.subplots(figsize=(18,10))
data2['Percentage change'].plot.bar(ax=ax)
plt.show()
# Print
print('The state with the highest change in PM2.5 is: {0} with a change of {1:.2f}%'.format(data2.iloc[0,:].name, data2.iloc[0,:][2]))
# Now, do the same but now compute the difference by city
# First, group by state
data2 = data.groupby(['City']).mean()
data2 = data2.drop(columns=['CBSA', 'Number of Trends Sites'])
# Now, take only the columns for the first and last year
data2 = data2.iloc[:, [0,-1]]
# Compute in a new column, the percentage that the last years represents regarding the first year
data2['Percentage change'] = abs((data2['2019'] - data2['2000']))/data2['2000'] *100.0
# Now sort to take the state with the smallest percentage
data2 = data2.sort_values(by=['Percentage change'], ascending=False)
fig, ax = plt.subplots(figsize=(18,10))
data2['Percentage change'].plot.bar(ax=ax)
plt.show()
print('The city with the highest change in PM2.5 is: {0} with a change of {1:.2f}%'.format(data2.iloc[0,:].name, data2.iloc[0,:][2]))
### Check concentration at Urban Honolulu, HI in 2019 to see if it meets the EPA standard
value = data.groupby(['City']).mean()['2019']['Urban Honolulu']
print("The concentration in Urban Honolulu, HI at 2019 is {:.2f}".format(value))
if value < 12:
    print("The city meets the EPA standard")
else:
    print("The city does not meet the EPA standard")
---
## Notes on merging (keep at bottom of notebook)
To combine datasets based on shared information, you can use the `pd.merge(A, B, how = ..., on = SHARED_COLS)` function, which will match the rows of `A` and `B` based on the shared columns `SHARED_COLS`. If `how = 'left'`, then only rows in `A` will be retained in the output (so `B` will be merged *to* `A`); conversely, if `how = 'right'`, then only rows in `B` will be retained in the output (so `A` will be merged *to* `B`).
A simple example of the use of `pd.merge` is illustrated below:
# toy data frames
A = pd.DataFrame(
    {'shared_col': ['a', 'b', 'c'],
    'x1': [1, 2, 3],
    'x2': [4, 5, 6]}
)
B = pd.DataFrame(
    {'shared_col': ['a', 'b'],
    'y1': [7, 8]}
)
A
B
Below, if `A` and `B` are merged retaining the rows in `A`, notice that a missing value is input because `B` has no row where the shared column (on which the merging is done) has value `c`. In other words, the third row of `A` has no match in `B`.
# left join
pd.merge(A, B, how = 'left', on = 'shared_col')
If the direction of merging is reversed, and the row structure of `B` is dominant, then the third row of `A` is dropped altogether because it has no match in `B`.
# right join
pd.merge(A, B, how = 'right', on = 'shared_col')

Python Program to Detect Air Quality Assignment Solution.

Instructions

Requirements and Specifications