Create a Program to Implement Regression in Python Assignment Solution.

Instructions

Objective
Write a program to implement regression in python.

Requirements and Specifications

N-Dimensional Clusters
Problem Statement
You are required to model the price of cars with the available independent variables. It will be used by your management team to understand how exactly the prices vary with the independent variables. They can accordingly manipulate the design of the cars, the business strategy etc. to meet certain price levels. Further, the model will be a good way for management to understand the pricing dynamics of a new market.
In general, your company would like for you to answer the following:
• Which variables are significant in predicting the price of a car
• How well those variables describe the price of a car
Source Code
```### Homework 1 (Regression) COSC 3337 Dr. Rizk ### Part 1. Reading and Understanding the Data import numpy as np import pandas as pd import matplotlib.pyplot as plt import seaborn as sns import warnings warnings.filterwarnings('ignore') # rcParams['figure.figsize'] = 8, 5 # sns.set_style('darkgrid') df = pd.read_csv('car_data.csv') df.head() df.describe() df.info() ### Part 2. Data Cleaning and Preparation df['CompanyName'] = df['CarName'].apply(lambda x:x.split(' ')[0]) print('The unique companies in our dataset are: \n',df.CompanyName.unique()) df['CompanyName'].replace('toyouta', 'toyota',inplace=True) df['CompanyName'].replace('Nissan', 'nissan',inplace=True) df['CompanyName'].replace('maxda', 'mazda',inplace=True) df['CompanyName'].replace('vokswagen', 'volkswagen',inplace=True) df['CompanyName'].replace('vw', 'volkswagen',inplace=True) df['CompanyName'].replace('porcshce', 'porsche',inplace=True) print('The unique companies in our dataset are: \n',df.CompanyName.unique()) ### Part 3. Visualising Categorical Data #### Create the following plots - 1- A plot of the unique company names on the x-axis, and the value counts on the y-axis. - 2- A plot of the unique car bodys on the x-axis and value counts on the y-axis. fig, ax = plt.subplots(figsize = (15,5)) plt1 = sns.countplot(df['CompanyName'], order=pd.value_counts(df['CompanyName']).index,) plt1.set(xlabel = 'CompanyName', ylabel= 'Count') plt1.set_title("Company Counts") plt.show() plt.tight_layout() plt1 = sns.countplot(df['carbody'], order=pd.value_counts(df['carbody']).index,) plt1.set(xlabel = 'carbody', ylabel= 'Count') plt1.set_title("Car Body Counts") plt.show() plt.tight_layout() #### describe what we can conclude from them - Toyota best car company. - sedan is the best car prefered. #### Create the following plots - 1- A plot of the unique company names on the x-axis, and that companies average price on the y-axis. - 2- A plot of the unique car bodys on the x-axis and that car body's average price on the y-axis.. plt.figure(figsize=(20,8)) df_= pd.DataFrame(df.groupby(['CompanyName'])['price'].mean().sort_values(ascending = False)).reset_index() plt = sns.barplot(x='CompanyName' ,y = 'price',data = df_ ) plt.set_title("Company vs. Avgerage Price ") df_ = pd.DataFrame(df.groupby(['carbody'])['price'].mean().sort_values(ascending = False)).reset_index() plt = sns.barplot(x='carbody' ,y = 'price',data = df_) plt.set_title("Car Body vs. Avgerage Price ") #### describe what we can conclude from them - Jaguar ,Buick and porsche seem to have highest average price. - hardtop and convertible higher price. #### Create the following plots - 1- A plot of the unique symboling values on the x-axis, and the value counts on the y-axis. - 2- A box plot of the unique symboling values on the x-axis and price on the y-axis. import matplotlib.pyplot as plt import seaborn as sns import warnings warnings.filterwarnings('ignore') plt.subplots(figsize = (15,5)) plt.subplot(1,2,1) plt.title('Symboling Counts') sns.countplot(df.symboling) plt.subplot(1,2,2) plt.title('Symboling vs Price') sns.boxplot(x=df.symboling, y=df.price) plt.show() #### describe what we can conclude from them - symboling with 0 values is the most sold. - The cars with symboling -1 symboling are high priced #### Create the following plots - 1- A plot of enginetype on the x-axis, and the value counts on the y-axis. - 2- A box plot of enginetype on the x-axis and price on the y-axis. plt.figure(figsize=(20,8)) plt.subplot(1,2,1) plt.title('Engine Type Counts') sns.countplot(df.enginetype) plt.subplot(1,2,2) plt.title('Engine Type vs Price') sns.boxplot(x=df.enginetype, y=df.price) plt.show() #### describe what we can conclude from them - ohc Engine type is most favored type. - The ohcv has the bigest price range. #### Create the following plots - 1- A plot of cylindernumber on the x-axis, and the value counts on the y-axis.. - 2- A box plot of cylindernumber on the x-axis and price on the y-axis.. plt.figure(figsize=(20,8)) plt.subplot(1,2,1) plt.title('Cylinder Number Counts') sns.countplot(df.cylindernumber) plt.subplot(1,2,2) plt.title('Cylinder Number vs Price') sns.boxplot(x=df.cylindernumber, y=df.price) plt.show() #### describe what we can conclude from them - Four cylinders is most favored type. - The six cylinders has the bigest price range. #### Create the following plots - 1- A plot of fuelsystem on the x-axis, and the value counts on the y-axis. - 2- A box plot of fuelsystem on the x-axis and price on the y-axis. plt.figure(figsize=(20,8)) plt.subplot(1,2,1) plt.title('Fuel Systems Counts') sns.countplot(df.fuelsystem) plt.subplot(1,2,2) plt.title('Fuel Systems vs Price') sns.boxplot(x=df.fuelsystem, y=df.price) plt.show() #### describe what we can conclude from them - mpfi fuel system type is most favored type. - The idi feul system has the bigest price range. #### Create the following plots - 1- A plot of drivewheel on the x-axis, and the value counts on the y-axis. - 2- A box plot of drivewheel on the x-axis and price on the y-axis. plt.figure(figsize=(20,8)) plt.subplot(1,2,1) plt.title('Drive Wheel Counts') sns.countplot(df.drivewheel) plt.subplot(1,2,2) plt.title('Drive Wheel vs Price') sns.boxplot(x=df.drivewheel, y=df.price) plt.show() #### describe what we can conclude from them - fwd drive wheel type is most favored type. - The rwd has the bigest price range. #### Create the following plots - 1- A plot of enginelocation on the x-axis, and the value counts on the y-axis. - 2- A box plot of enginelocation on the x-axis and price on the y-axis. plt.figure(figsize=(20,8)) plt.subplot(1,2,1) plt.title('Engine Location Counts') sns.countplot(df.enginelocation) plt.subplot(1,2,2) plt.title('Engine Location vs Price') sns.boxplot(x=df.enginelocation, y=df.price) plt.show() #### describe what we can conclude from them - front Engine location is most favored type. - The front has the bigest price range. #### Create the following plots - 1- A plot of fueltype on the x-axis, and the value counts on the y-axis. - 2- A box plot of fueltype on the x-axis and price on the y-axis. plt.figure(figsize=(20,8)) plt.subplot(1,2,1) plt.title('Fuel Type Counts') sns.countplot(df.fueltype) plt.subplot(1,2,2) plt.title('Fuel Type vs Price') sns.boxplot(x=df.fueltype, y=df.price) plt.show() #### describe what we can conclude from them - gas feul type is most favored type. - The diesel has the bigest price range. #### Create the following plots - 1- A plot of doornumber on the x-axis, and the value counts on the y-axis. - 2- A box plot of doornumber on the x-axis and price on the y-axis. plt.figure(figsize=(20,8)) plt.subplot(1,2,1) plt.title('Number of Doors Counts') sns.countplot(df.doornumber) plt.subplot(1,2,2) plt.title('Number of Doors vs Price') sns.boxplot(x=df.doornumber, y=df.price) plt.show() #### describe what we can conclude from them - foor cae doors type is most favored type. #### Create the following plots - 1- A plot of aspiration on the x-axis, and the value counts on the y-axis. - 2- A box plot of aspiration on the x-axis and price on the y-axis. plt.figure(figsize=(20,8)) plt.subplot(1,2,1) plt.title('Aspiration Counts') sns.countplot(df.aspiration) plt.subplot(1,2,2) plt.title('Aspiration vs Price') sns.boxplot(x=df.aspiration, y=df.price) plt.show() #### describe what we can conclude from them - std aspiration type is most favored type. #### Create the following plots - 1- A plot showing the price distribution - 2- A box plot of price plt.figure(figsize=(20,8)) plt.subplot(1,2,1) plt.title('Car Price Distribution') sns.distplot(df.price) plt.subplot(1,2,2) plt.title('Car Price Spread') sns.boxplot(y=df.price,orient="v") plt.show() #### describe what we can conclude from them - the graph is right-skewed which means most of the values are in the low region #### Create the following plots - 1-A scatter plot of carlength vs price. - 2-A scatter plot of carwidth vs price. - 3-A scatter plot of carheight vs price. - 4-A scatter plot of carweight vs price. plt.figure(figsize=(20,8)) plt.subplot(2,2,1) plt.title('Car Length vs Price') sns.regplot(df['carlength'],df['price'],color="blue") plt.subplot(2,2,2) plt.title('Car Width vs Price') sns.regplot(df['carwidth'],df['price'],color="orange") plt.subplot(2,2,3) plt.title('Car Height vs Price') sns.regplot(df['carheight'],df['price'],color="green") plt.subplot(2,2,4) plt.title('Car Weight vs Price') sns.regplot(df['curbweight'],df['price'],color="red") plt.show() #### describe what we can conclude from them - Car Width, Car Length and Car Weight have a poitive correlation. carheight doesn't. #### Create the following plots - 1-A scatter plot of enginesize vs price. - 2-A scatter plot of boreratio vs price. - 3-A scatter plot of stroke vs price. - 4-A scatter plot of compressionratio vs price. - 5-A scatter plot of horsepower vs price. - 6-A scatter plot of peakrpm vs price. - 7-A scatter plot of wheelbase vs price. - 8-A scatter plot of citympg vs price. - 9-A scatter plot of highwaympg vs price. plt.figure(figsize=(20,8)) plt.subplot(3,3,1) plt.title('Engine Size vs Price') sns.regplot(df['enginesize'],df['price'],color="blue") plt.subplot(3,3,2) plt.title('Boreratio vs Price') sns.regplot(df['boreratio'],df['price'],color="orange") plt.subplot(3,3,3) plt.title('Stroke vs Price') sns.regplot(df['stroke'],df['price'],color="green") plt.subplot(3,3,4) plt.title('Compression Ratio vs Price') sns.regplot(df['compressionratio'],df['price'],color="red") plt.subplot(3,3,5) plt.title('Horse Power vs Price') sns.regplot(df['horsepower'],df['price'],color="blue") plt.subplot(3,3,6) plt.title('Peak RPM vs Price') sns.regplot(df['peakrpm'],df['price'],color="blue") plt.subplot(3,3,7) plt.title('Wheel Base vs Price') sns.regplot(df['wheelbase'],df['price'],color="orange") plt.subplot(3,3,8) plt.title('City MPG vs Price') sns.regplot(df['citympg'],df['price'],color="green") plt.subplot(3,3,9) plt.title('Highway MPG vs Price') sns.regplot(df['highwaympg'],df['price'],color="red") plt.show() #### describe what we can conclude from them - enginesize, boreratio, horsepower, wheelbase have a poitive correlation. - Create a heatmap or correlation matrix to inspect the correlations in our dataset. plt.figure(figsize = (30, 25)) sns.heatmap(df.corr(), annot = True, cmap="YlGnBu") plt.show() #### describe what we can conclude from them - curbweight, enginesize, horsepower,carwidth Highly correlated #### For example, citympg and highwaympg can be combined into a single feature. Create a new column called 'fuel_economy' that's a combination of the 2. df['fueleconomy'] = (0.55 * df['citympg']) + (0.45 * df['highwaympg']) df.head() ### Part 5. Data Pre-Processing Perform the following - Convert your categorical variables into dummy variables - Scale the data using a scaler of your choice - Split your data into a training and testing set, with test size of 0.30 df_ = df[['price', 'fueltype', 'aspiration','carbody', 'drivewheel','wheelbase', 'curbweight', 'enginetype', 'cylindernumber', 'enginesize', 'boreratio','horsepower', 'fueleconomy', 'carlength','carwidth']] def dummies(x,df): temp = pd.get_dummies(df[x], drop_first = True) df = pd.concat([df, temp], axis = 1) df.drop([x], axis = 1, inplace = True) return df df_ = dummies('fueltype',df_) df_ = dummies('aspiration',df_) df_ = dummies('carbody',df_) df_ = dummies('drivewheel',df_) df_ = dummies('enginetype',df_) df_ = dummies('cylindernumber',df_) df_.head() from sklearn.model_selection import train_test_split np.random.seed(0) df_train, df_test = train_test_split(df_, train_size = 0.7, test_size = 0.3, random_state = 30) from sklearn.preprocessing import MinMaxScaler scaler = MinMaxScaler() f1 = ['wheelbase', 'curbweight', 'enginesize', 'boreratio', 'horsepower','fueleconomy','carlength','carwidth','price'] df_train[f1] = scaler.fit_transform(df_train[f1]) y_train = df_train.pop('price') X_train = df_train f2 = ['wheelbase', 'curbweight', 'enginesize', 'boreratio', 'horsepower','fueleconomy','carlength','carwidth','price'] df_test[f2] = scaler.fit_transform(df_test[f2]) y_test = df_test.pop('price') X_test = df_test ### Part 6. Model Creation and Evaluation Perform the following using sklearn - 1- Create a linear regression model, and train (fit) it on the training data. - 2- Run the test data through your model to obtain predictions. Save these predictions into a variable called 'predictions'. - 3- Create a scatter plot of the true price labels vs the predicted price value of your model. - 4- Create a histogram of the residuals - 5- Print the R^2 of your model from sklearn.feature_selection import RFE from sklearn.linear_model import LinearRegression import statsmodels.api as sm from statsmodels.stats.outliers_influence import variance_inflation_factor LinearRegression = LinearRegression() LinearRegression.fit(X_train,y_train) R = RFE(LinearRegression, 10) R = R.fit(X_train, y_train) import statsmodels.api as sm from sklearn.metrics import r2_score LinearRegression = sm.OLS(y_train,X_train).fit() y_train_pred = LinearRegression.predict(X_train) y_pred = LinearRegression.predict(X_test) fig = plt.figure() sns.distplot((y_train - y_train_pred), bins = 20) fig.suptitle('Residuals Histogram ') plt.xlabel('Price') fig = plt.figure() plt.scatter(y_test,y_pred) fig.suptitle('True y vs. Prediction') plt.xlabel('True y', fontsize=18) print('R-Squared:',r2_score(y_test, y_pred)) #### Lastly, create a dataframe of your model's coefficents. For example, we obtained the coefficients below. X_train.columns coef = pd.DataFrame() coef['Features'] = X_train.columns coef['coef'] = [variance_inflation_factor(X_train.values, i) for i in range(X_train.shape[1])] coef['coef'] = round(coef['coef'], 2) coef = coef.sort_values(by = "coef", ascending = False) coef```