## Instructions

**Objective**

Write a python homework program to implement weather prediction.

## Requirements and Specifications

Create a weather predictive system with the help of machine learning.

**Source Code**

```
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
for filename in filenames:
print(os.path.join(dirname, filename))
Loading and previewing the dataset.
weatherdata = pd.read_csv('../input/weather-dataset-rattle-package/weatherAUS.csv')
weatherdata.head(10)
List the columns of the dataframe and load the shape of the dataframe.
print(weatherdata.columns)
print("Shape of the dataframe: ", weatherdata.shape)
Loading the descriptive statistic summary of the dataframe.
weatherdata.describe()
Inspecting the data types of each column in the dataframe.
weatherdata.dtypes
Visualizing the correlation of each column in heatmap.
import seaborn as sns
import matplotlib.pyplot as plt
sns.set(rc = {'figure.figsize':(15,8)})
corrplot = sns.heatmap(weatherdata.corr(), cmap = 'YlGnBu', annot = True)
plt.show()
From the above heatmap, we can see that the most positive correlation occurs between variables MaxTemp and Temp3pm, and the most negative correlation occurs between variables Sunshine and Cloud3pm.
We want to make sure that there is no missing data in the columns containing numeric-type data.
We need to inspect how many NAs in the columns containing numeric-type data.
# Count how many null values in columns containing numeric data
numeric_cols = ['MinTemp', 'MaxTemp', 'Rainfall', 'Evaporation', 'Sunshine', 'WindGustSpeed', 'WindSpeed9am', \
'WindSpeed3pm', 'Humidity9am', 'Humidity3pm', 'Pressure9am', 'Pressure3pm', 'Cloud9am', 'Cloud3pm', \
'Temp9am', 'Temp3pm']
na_sum_numeric_cols = []
for n in numeric_cols:
na_sum_numeric_cols.append(weatherdata[n].isnull().sum())
for i in range(len(numeric_cols)):
print("Sum of NAs in", numeric_cols[i], ":", na_sum_numeric_cols[i])
We will check whether the numeric columns have outliers using Seaborn boxplot, to determine how we will remove the NA values.
fig, ax = plt.subplots(len(numeric_cols), 1, figsize = (20, 65))
fig.suptitle("Distribution of Outliers")
for n in numeric_cols:
sns.boxplot(x = weatherdata[n], data = weatherdata, palette = "crest", ax = ax[numeric_cols.index(n)], width = 0.4)
ax[numeric_cols.index(n)].set_title("")
plt.show()
From the above boxplots, we can find there are many numeric columns containing outliers. Hence, we want to remove the NAs by replacing the NA values with the median of each column.
for n in numeric_cols:
weatherdata[n].fillna(value = weatherdata[n].median(), inplace = True)
weatherdata.head(10)
Making sure that there is no NA value left in each column.
for n in numeric_cols:
print('Amount of null values in column', n, ":", weatherdata.shape[0] - len(weatherdata[n]))
We do the same to the columns containing categorical values, namely WindGustDir, WindDir9am, and WindDir3pm. However, we will use the 'ffill' method, since we can't replace NAs with either median, mode, or mean of categorical values.
categorical_cols = ["WindGustDir", "WindDir9am", "WindDir3pm"]
na_sum_categorical_cols = []
# Count the number of NAs in categorical columns
for col in categorical_cols:
na_sum_categorical_cols.append(weatherdata[col].isnull().sum())
for i in range(len(categorical_cols)):
print("Sum of NAs in", categorical_cols[i], ":", na_sum_categorical_cols[i])
# Replacing NA with 'ffill' method and check whether all NAs have been replaced
for col in categorical_cols:
weatherdata[col].fillna(method = 'ffill', inplace = True)
print('Amount of null values in column', col, "after preprocessing is:", \
weatherdata.shape[0] - len(weatherdata[col]))
We check how many NA values exist in RainToday.
print("The number of NAs in RainToday is", weatherdata['RainToday'].isna().sum(), \
"while the total number of rows in RainToday is", len(weatherdata['RainToday']))
print("The ratio between NAs and non-NAs is", weatherdata['RainToday'].isna().sum()/len(weatherdata['RainToday']))
Since NA values constitute only 2.24% of all rows, we may drop the NAs in RainToday.
weatherdata.dropna(subset = ["RainToday"], inplace = True)
print(weatherdata.shape)
We do the same to RainTomorrow.
print("The number of NAs in RainTomorrow is", weatherdata['RainTomorrow'].isna().sum(), \
"while the total number of rows in RainToday is", len(weatherdata['RainTomorrow']))
print("The ratio between NAs and non-NAs is", weatherdata['RainTomorrow'].isna().sum()/len(weatherdata['RainTomorrow']))
weatherdata.dropna(subset = ['RainTomorrow'], inplace = True)
print(weatherdata.shape)
Finding the most minimum and maximum temperature in degrees Celcius.
print("The most minimum temperature recorded is", weatherdata['MinTemp'].min(), "degrees Celcius")
print("The most maximum temperature recorded is", weatherdata['MaxTemp'].max(), "degrees Celcius")
Finding the largest amount of rainfall recorded in a day.
print("The largest amount of rainfall recorded in a day is", weatherdata['Rainfall'].max(), "mm")
Now let's get started with training and testing data.
We notice that we have columns containing categorical data. Hence, we need to apply label encoding to "convert" the categorical values into numerical values.
from sklearn import preprocessing
label_encoder = preprocessing.LabelEncoder()
for col in categorical_cols:
weatherdata[col] = label_encoder.fit_transform(weatherdata[col])
print(weatherdata[col].unique())
We do the same for Date, Location, RainToday and Rain Tomorrow.
for col in ['Date', 'Location', 'RainToday', 'RainTomorrow']:
weatherdata[col] = label_encoder.fit_transform(weatherdata[col])
print(weatherdata[col].unique())
weatherdata.head(10)
Let's start splitting the dataset into training and test sets.
X = weatherdata.drop(columns = ['RainTomorrow'], axis=1)
y = weatherdata['RainTomorrow']
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42)
Checking the shapes of training and test sets.
print ('Training set:', X_train.shape, y_train.shape)
print ('Test set:', X_test.shape, y_test.shape)
We will test some machine learning algorithms, namely Decision Tree and Naive Bayes.
We start with Decision Tree first.
We will also use hyperparameter tuning using RandomizedSearchCV to find and use the best parameters, that may result in a good-performance model.
from sklearn.tree import DecisionTreeClassifier
from scipy.stats import randint
from sklearn.model_selection import RandomizedSearchCV
from sklearn.metrics import classification_report, confusion_matrix
param_dist = {"max_depth": [3, None], "max_features": randint(1, 9), "min_samples_leaf": randint(1, 9), \
"criterion": ["gini", "entropy"]}
# Instantiate the classifier
weathertree = DecisionTreeClassifier()
# Instantiate the hyperparameter tuning algorithm
weathertree_cv = RandomizedSearchCV(weathertree, param_dist, cv = 5)
# Train the training data
weathertree_cv.fit(X_train, y_train)
# Print the tuned parameters and score
print("Tuned Decision Tree Parameters: {}".format(weathertree_cv.best_params_))
print("Best score is {}\n".format(weathertree_cv.best_score_))
# Inspecting the accuracy of the fine-tuned model
print("Confusion matrix: \n {}".format(confusion_matrix(y_true=y_test, y_pred=weathertree_cv.predict(X_test))))
print("\nClassification report: \n \n {}".format(classification_report(y_test, y_pred=weathertree_cv.predict(X_test))))
print("Accuracy using Decision Tree: ", metrics.accuracy_score(y_test, weathertree_cv.predict(X_test)))
We can see that, using Decision Tree, the fine-tuned model has 80.22% accuracy.
We will try classifying using Naive Bayes.
from sklearn.naive_bayes import GaussianNB
weathernb = GaussianNB()
weathernb.fit(X_train, y_train)
# Inspecting the metrics of the model
print("Confusion matrix: \n {}".format(confusion_matrix(y_true=y_test, y_pred=weathernb.predict(X_test))))
print("\nClassification report: \n \n {}".format(classification_report(y_test, y_pred=weathernb.predict(X_test))))
print("\nAccuracy using GaussianNB: ", metrics.accuracy_score(y_test, weathernb.predict(X_test)))
From the above, we can see that, using Gaussian Naive Bayes, the accuracy is slightly better than Decision Tree at 80.30%.
We will try classifying using Logistic Regression.
from sklearn.linear_model import LogisticRegression
c_space = np.logspace(-5, 8, 15)
param_grid = {'C': c_space}
weatherLR = LogisticRegression(solver = 'liblinear')
weatherLR_cv = RandomizedSearchCV(weatherLR, param_grid, cv = 5)
weatherLR_cv.fit(X_train, y_train)
# Print the tuned parameters and score
print("Tuned Logistic Regression Parameters: {}".format(weatherLR_cv.best_params_))
print("Best score is {}".format(weatherLR_cv.best_score_))
# Inspecting the metrics of the model
print("\nConfusion matrix: \n {}".format(confusion_matrix(y_true=y_test, y_pred=weatherLR_cv.predict(X_test))))
print("\nClassification report: \n \n {}".format(classification_report(y_test, y_pred=weatherLR_cv.predict(X_test))))
print("\nAccuracy using Logistic Regression: ", metrics.accuracy_score(y_test, weatherLR_cv.predict(X_test)))
From the above, we may conclude that Logistic Regression with C = 163789.37 performs better than the other two algorithms, with the accuracy of 84%.
```