# Create a Program to Implement Weather Prediction in Python Assignment Solution

July 02, 2024
Dr. David
🇦🇺 Australia
Python
Dr. David Adams, a distinguished Computer Science scholar, holds a PhD from the University of Melbourne, Australia. With over 5 years of experience in the field, he has completed over 300 Python assignments, showcasing his deep understanding and expertise in the subject matter.
Key Topics
• Instructions
• Requirements and Specifications
Tip of the day
News

## Instructions

Objective
Write a python homework program to implement weather prediction.

## Requirements and Specifications

Create a weather predictive system with the help of machine learning.
Source Code
`import numpy as np # linear algebraimport pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)import osfor dirname, _, filenames in os.walk('/kaggle/input'):for filename in filenames:print(os.path.join(dirname, filename))Loading and previewing the dataset.weatherdata = pd.read_csv('../input/weather-dataset-rattle-package/weatherAUS.csv')weatherdata.head(10)List the columns of the dataframe and load the shape of the dataframe.print(weatherdata.columns)print("Shape of the dataframe: ", weatherdata.shape)Loading the descriptive statistic summary of the dataframe.weatherdata.describe()Inspecting the data types of each column in the dataframe.weatherdata.dtypesVisualizing the correlation of each column in heatmap.import seaborn as snsimport matplotlib.pyplot as pltsns.set(rc = {'figure.figsize':(15,8)})corrplot = sns.heatmap(weatherdata.corr(), cmap = 'YlGnBu', annot = True)plt.show()From the above heatmap, we can see that the most positive correlation occurs between variables MaxTemp and Temp3pm, and the most negative correlation occurs between variables Sunshine and Cloud3pm.We want to make sure that there is no missing data in the columns containing numeric-type data.We need to inspect how many NAs in the columns containing numeric-type data.# Count how many null values in columns containing numeric datanumeric_cols = ['MinTemp', 'MaxTemp', 'Rainfall', 'Evaporation', 'Sunshine', 'WindGustSpeed', 'WindSpeed9am', \'WindSpeed3pm', 'Humidity9am', 'Humidity3pm', 'Pressure9am', 'Pressure3pm', 'Cloud9am', 'Cloud3pm', \'Temp9am', 'Temp3pm']na_sum_numeric_cols = []for n in numeric_cols:na_sum_numeric_cols.append(weatherdata[n].isnull().sum())for i in range(len(numeric_cols)):print("Sum of NAs in", numeric_cols[i], ":", na_sum_numeric_cols[i])We will check whether the numeric columns have outliers using Seaborn boxplot, to determine how we will remove the NA values.fig, ax = plt.subplots(len(numeric_cols), 1, figsize = (20, 65))fig.suptitle("Distribution of Outliers")for n in numeric_cols:sns.boxplot(x = weatherdata[n], data = weatherdata, palette = "crest", ax = ax[numeric_cols.index(n)], width = 0.4)ax[numeric_cols.index(n)].set_title("")plt.show()From the above boxplots, we can find there are many numeric columns containing outliers. Hence, we want to remove the NAs by replacing the NA values with the median of each column.for n in numeric_cols:weatherdata[n].fillna(value = weatherdata[n].median(), inplace = True)weatherdata.head(10)Making sure that there is no NA value left in each column.for n in numeric_cols:print('Amount of null values in column', n, ":", weatherdata.shape[0] - len(weatherdata[n]))We do the same to the columns containing categorical values, namely WindGustDir, WindDir9am, and WindDir3pm. However, we will use the 'ffill' method, since we can't replace NAs with either median, mode, or mean of categorical values.categorical_cols = ["WindGustDir", "WindDir9am", "WindDir3pm"]na_sum_categorical_cols = []# Count the number of NAs in categorical columnsfor col in categorical_cols:na_sum_categorical_cols.append(weatherdata[col].isnull().sum())for i in range(len(categorical_cols)):print("Sum of NAs in", categorical_cols[i], ":", na_sum_categorical_cols[i])# Replacing NA with 'ffill' method and check whether all NAs have been replacedfor col in categorical_cols:weatherdata[col].fillna(method = 'ffill', inplace = True)print('Amount of null values in column', col, "after preprocessing is:", \weatherdata.shape[0] - len(weatherdata[col]))We check how many NA values exist in RainToday.print("The number of NAs in RainToday is", weatherdata['RainToday'].isna().sum(), \"while the total number of rows in RainToday is", len(weatherdata['RainToday']))print("The ratio between NAs and non-NAs is", weatherdata['RainToday'].isna().sum()/len(weatherdata['RainToday']))Since NA values constitute only 2.24% of all rows, we may drop the NAs in RainToday.weatherdata.dropna(subset = ["RainToday"], inplace = True)print(weatherdata.shape)We do the same to RainTomorrow.print("The number of NAs in RainTomorrow is", weatherdata['RainTomorrow'].isna().sum(), \"while the total number of rows in RainToday is", len(weatherdata['RainTomorrow']))print("The ratio between NAs and non-NAs is", weatherdata['RainTomorrow'].isna().sum()/len(weatherdata['RainTomorrow']))weatherdata.dropna(subset = ['RainTomorrow'], inplace = True)print(weatherdata.shape)Finding the most minimum and maximum temperature in degrees Celcius.print("The most minimum temperature recorded is", weatherdata['MinTemp'].min(), "degrees Celcius")print("The most maximum temperature recorded is", weatherdata['MaxTemp'].max(), "degrees Celcius")Finding the largest amount of rainfall recorded in a day.print("The largest amount of rainfall recorded in a day is", weatherdata['Rainfall'].max(), "mm")Now let's get started with training and testing data.We notice that we have columns containing categorical data. Hence, we need to apply label encoding to "convert" the categorical values into numerical values.from sklearn import preprocessinglabel_encoder = preprocessing.LabelEncoder()for col in categorical_cols:weatherdata[col] = label_encoder.fit_transform(weatherdata[col])print(weatherdata[col].unique())We do the same for Date, Location, RainToday and Rain Tomorrow.for col in ['Date', 'Location', 'RainToday', 'RainTomorrow']:weatherdata[col] = label_encoder.fit_transform(weatherdata[col])print(weatherdata[col].unique())weatherdata.head(10)Let's start splitting the dataset into training and test sets.X = weatherdata.drop(columns = ['RainTomorrow'], axis=1)y = weatherdata['RainTomorrow']from sklearn.model_selection import train_test_splitX_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42)Checking the shapes of training and test sets.print ('Training set:', X_train.shape, y_train.shape)print ('Test set:', X_test.shape, y_test.shape)We will test some machine learning algorithms, namely Decision Tree and Naive Bayes.We start with Decision Tree first.We will also use hyperparameter tuning using RandomizedSearchCV to find and use the best parameters, that may result in a good-performance model.from sklearn.tree import DecisionTreeClassifierfrom scipy.stats import randintfrom sklearn.model_selection import RandomizedSearchCVfrom sklearn.metrics import classification_report, confusion_matrixparam_dist = {"max_depth": [3, None], "max_features": randint(1, 9), "min_samples_leaf": randint(1, 9), \"criterion": ["gini", "entropy"]}# Instantiate the classifierweathertree = DecisionTreeClassifier()# Instantiate the hyperparameter tuning algorithmweathertree_cv = RandomizedSearchCV(weathertree, param_dist, cv = 5)# Train the training dataweathertree_cv.fit(X_train, y_train)# Print the tuned parameters and scoreprint("Tuned Decision Tree Parameters: {}".format(weathertree_cv.best_params_))print("Best score is {}\n".format(weathertree_cv.best_score_))# Inspecting the accuracy of the fine-tuned modelprint("Confusion matrix: \n {}".format(confusion_matrix(y_true=y_test, y_pred=weathertree_cv.predict(X_test))))print("\nClassification report: \n \n {}".format(classification_report(y_test, y_pred=weathertree_cv.predict(X_test))))print("Accuracy using Decision Tree: ", metrics.accuracy_score(y_test, weathertree_cv.predict(X_test)))We can see that, using Decision Tree, the fine-tuned model has 80.22% accuracy.We will try classifying using Naive Bayes.from sklearn.naive_bayes import GaussianNBweathernb = GaussianNB()weathernb.fit(X_train, y_train)# Inspecting the metrics of the modelprint("Confusion matrix: \n {}".format(confusion_matrix(y_true=y_test, y_pred=weathernb.predict(X_test))))print("\nClassification report: \n \n {}".format(classification_report(y_test, y_pred=weathernb.predict(X_test))))print("\nAccuracy using GaussianNB: ", metrics.accuracy_score(y_test, weathernb.predict(X_test)))From the above, we can see that, using Gaussian Naive Bayes, the accuracy is slightly better than Decision Tree at 80.30%.We will try classifying using Logistic Regression.from sklearn.linear_model import LogisticRegressionc_space = np.logspace(-5, 8, 15)param_grid = {'C': c_space}weatherLR = LogisticRegression(solver = 'liblinear')weatherLR_cv = RandomizedSearchCV(weatherLR, param_grid, cv = 5)weatherLR_cv.fit(X_train, y_train)# Print the tuned parameters and scoreprint("Tuned Logistic Regression Parameters: {}".format(weatherLR_cv.best_params_))print("Best score is {}".format(weatherLR_cv.best_score_))# Inspecting the metrics of the modelprint("\nConfusion matrix: \n {}".format(confusion_matrix(y_true=y_test, y_pred=weatherLR_cv.predict(X_test))))print("\nClassification report: \n \n {}".format(classification_report(y_test, y_pred=weatherLR_cv.predict(X_test))))print("\nAccuracy using Logistic Regression: ", metrics.accuracy_score(y_test, weatherLR_cv.predict(X_test)))From the above, we may conclude that Logistic Regression with C = 163789.37 performs better than the other two algorithms, with the accuracy of 84%.`

## Related Samples

Discover our Python Assignment Samples for clear, detailed solutions to programming tasks. These examples cover essential topics such as loops, functions, data manipulation, and algorithmic problems. Perfect for students looking to enhance their Python skills with practical, educational resources designed to aid understanding and improve academic performance.