## Instructions

**Objective**

## Requirements and Specifications

**Source Code
**

# LEGO Regression Project

## Introduction

In this cumulative lab you'll perform a full linear regression analysis and report the findings of your final model, including both predictive model performance metrics and interpretation of fitted model parameters.

## Objectives

You will be able to:

* Perform a full linear regression analysis with iterative model development

* Evaluate your final model and interpret its predictive performance metrics

* Apply an inferential lens to interpret relationships between variables identified by the model

# Task: Develop a LEGO Pricing Algorithm

![pile of legos](images/legos.jpg)

Photo by Xavi Cabrera on Unsplash

## Business Understanding

You just got hired by LEGO! Your first project is going to be to develop a pricing algorithm to help set a target price for new LEGO sets that are released to market. The goal is to save the company some time and to help ensure consistency in pricing between new products and past products.

The main purpose of this algorithm is *predictive*, meaning that **your model should be able to take in attributes of a LEGO set that does not yet have a set price, and to predict a good price**. The effectiveness of your predictive model will be measured by how well it predicts prices in our test set, where we know what the actual prices were but the model does not.

The secondary purpose of this algorithm is *inferential*, meaning that **your model should be able to tell us something about the relationship between the attributes of a LEGO set and its price**. You will apply your knowledge of statistics to include appropriate caveats about these relationships.

## Data Understanding

You have access to a dataset containing over 700 LEGO sets released in the past, including attributes of those sets as well as their prices. You can assume that the numeric attributes in this dataset have already been preprocessed appropriately for modeling (i.e. that there are no missing or invalid values), while the text attributes are simply there for your visual inspection and should not be used for modeling. Also, note that some of these attributes cannot be used in your analysis because they will be unavailable for future LEGO products or are otherwise irrelevant.

You do not need to worry about inflation or differences in currency; just predict the same kinds of prices as are present in the past data, which have already been converted to USD.

## Loading the Data

Load both the train and test datasets from the `data` folder in this repository, then split them into feature and target DataFrames. Remember, both of these datasets contain prices, but we are using the test set as a stand-in for future LEGO products where the price has not yet been determined. The model will be trained on just the train set, then we will compare its predictions on the test set to the actual prices on the test set.

# Run this cell without changes

import pandas as pd

import matplotlib.pyplot as plt

%matplotlib inline

# Replace None with your code

train = pd.read_csv('data/lego_train.csv')

test = pd.read_csv('data/lego_test.csv')

X_train = train[['piece_count', 'min_age', 'max_age', 'difficulty_level', 'num_reviews', 'star_rating']]

y_train = train['list_price']

X_test = test[['piece_count', 'min_age', 'max_age', 'difficulty_level', 'num_reviews', 'star_rating']]

y_test = test['list_price']

## Exploring the Data

Inspect your data to start making sense of it

# Replace None with your code

train.head()

Produce high-level descriptive information about your training data

# Replace None with your code

train.describe()

Visualize the distribution of the target variable

# Replace None with your code

fig, ax = plt.subplots()

y_train.plot.hist(ax = ax)

ax.set_xlabel('List price')

ax.set_ylabel('Frequency')

plt.show()

## Requirements

### 1. Interpret a Correlation Heatmap to Build a Baseline Model

You'll start modeling by choosing the feature that is most correlated with our target, and build and evaluate a linear regression model with just that feature.

### 2. Build a Model with All Relevant Numeric Features

Now, add in the rest of the relevant numeric features of the training data, and compare that model's performance to the performance of the baseline model.

### 3. Select the Best Combination of Features

Using statistical properties of the fitted model, the `sklearn.feature_selection` submodule, and some custom code, find the combination of relevant numeric features that produces the best scores.

### 4. Build and Evaluate a Final Predictive Model

Using the best features selected in the previous step, create a final model, fit it on all rows of the training dataset, and evaluate it on all rows of the test dataset in terms of both r-squared and RMSE.

### 5. Interpret the Final Model

Determine what, if any, understanding of the underlying relationship between variables can be determined with this model. This means you will need to interpret the model coefficients as well as checking whether the assumptions of linear regression have been met.

### 6. Create Presentation Notebook

Edit this notebook or create a new one to showcase your work.

# 1. Interpret a Correlation Heatmap to Build a Baseline Model

## Correlation Heatmap

Produce a heatmap showing the correlations between all of the numeric values in our training data. The x and y axis labels should indicate the pair of values that are being compared, and then the color and the number should represent the correlation.

The most important column or row shows the correlations between the target (listing price) and other attributes.

# Run this cell without changes

import seaborn as sns

import numpy as np

# Replace None with your code

# Compute correlation matrix

corr = train.corr()

sns.heatmap(corr)

Based on the heatmap, which feature is most strongly correlated with the target (`listing_price`)? In other words, which feature has the strongest positive or negative correlation — the correlation with the greatest magnitude?

# Replace None with the name of the feature (a string)

most_correlated_feature = 'piece_count'

Create a scatter plot of that feature vs. listing price:

# Replace None with your code

plt.figure()

piece_count = train['piece_count'].to_numpy()

list_price = train['list_price'].to_numpy()

plt.scatter(piece_count, list_price)

plt.grid(True)

plt.xlabel('Piece count')

plt.ylabel('List price')

plt.show()

Assuming you correctly identified `piece_count` (the number of pieces in the LEGO set) as the most correlated feature, you should have a scatter plot that shows a fairly clear linear relationship between that feature and the target. It looks like we are ready to proceed with making our baseline model without any additional transformation.

## Building a Baseline Model

Now, we'll build a linear regression model using just that feature, which will serve as our baseline model:

# Run this cell without changes

from sklearn.linear_model import LinearRegression

baseline_model = LinearRegression()

Then we evaluate the model using `cross_validate` and `ShuffleSplit`, which essentially means that we perform 3 separate train-test splits within our `X_train` and `y_train`, then we find both the train and the test scores for each.

# Run this cell without changes

from sklearn.model_selection import cross_validate, ShuffleSplit

splitter = ShuffleSplit(n_splits=3, test_size=0.25, random_state=0)

baseline_scores = cross_validate(

estimator=baseline_model,

X=X_train[[most_correlated_feature]],

y=y_train,

return_train_score=True,

cv=splitter

)

print("Train score: ", baseline_scores["train_score"].mean())

print("Validation score:", baseline_scores["test_score"].mean())

Interpret these scores below. What are we measuring? What can we learn from this?

**Hint:** when you use `cross_validate`, it uses the `.score` method of the estimator by default. See [documentation here](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html#sklearn.linear_model.LinearRegression.score) for that method of `LinearRegression`.

# Replace None with appropriate text

"""

We see that the score is around ~78%. It means that

the model is capable of predict the listing price correctly 78% of the times

"""

# 2. Build a Model with All Numeric Features

Now that we have established a baseline, it's time to move on to more complex models.

## Numeric Feature Selection

One thing that you will almost always need to do in a modeling process is remove non-numeric data prior to modeling. While you could apply more-advanced techniques such as one-hot encoding or NLP in order to convert non-numeric columns into numbers, this time just create a dataframe `X_train_numeric` that is a copy of `X_train` that only contains numeric columns.

You can look at the `df.info()` printout above to do this manually, or there is a handy `.select_dtypes` method ([documentation here](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.select_dtypes.html)).

# Replace None with appropriate code

X_train_numeric = train[['prod_id', 'piece_count', 'min_age', 'max_age', 'difficulty_level', 'num_reviews', 'star_rating']]

X_train_numeric

The following code checks that your answer was correct:

# Run this cell without changes

# X_train_numeric should be a dataframe

assert type(X_train_numeric) == pd.DataFrame

# Check for correct shape

assert X_train_numeric.shape == (558, 7)

Now we can look at scatter plots of all numeric features compared to the target (skipping `piece_count` since we already looked at that earlier):

# Run this cell without changes

scatterplot_data = X_train_numeric.drop("piece_count", axis=1)

fig, axes = plt.subplots(ncols=3, nrows=2, figsize=(12, 8))

fig.set_tight_layout(True)

for index, col in enumerate(scatterplot_data.columns):

ax = axes[index//3][index%3]

ax.scatter(X_train_numeric[col], y_train, alpha=0.2)

ax.set_xlabel(col)

ax.set_ylabel("listing price")

## Feature Selection Using Domain Understanding

Ok, now all of the remaining features can technically go into a model with scikit-learn. But do they make sense?

Some reasons you might not want to include a given numeric column include:

1. The column represents a unique identifier, not an actual numeric feature

2. The column is something that will not be available when making future predictions

Recall that the business purpose here is creating an algorithm to set the price for a newly-released LEGO set. Which columns should we drop because of the issues above?

# Replace None with appropriate text

"""

The prod_id column contains an identifier of the product, but its value does not affects its price

so we should drop the 'prod_id' column

Another column that should not affect the price is the 'num_reviews' and 'star_rating' since these values are assigned

to each lego set after it is sold

"""

Now, create a variable `X_train_second_model`, which is a copy of `X_train_numeric` where those irrelevant columns have been removed:

# Replace None with appropriate code

X_train_second_model = train[['piece_count', 'min_age', 'max_age', 'difficulty_level']]

X_train_second_model

## Building and Evaluating the Second Model

In the cell below, we use the same process to evaluate a model using `X_train_second_model` rather than using just `piece_count`.

# Run this cell without changes

second_model = LinearRegression()

second_model_scores = cross_validate(

estimator=second_model,

X=X_train_second_model,

y=y_train,

return_train_score=True,

cv=splitter

)

print("Current Model")

print("Train score: ", second_model_scores["train_score"].mean())

print("Validation score:", second_model_scores["test_score"].mean())

print()

print("Baseline Model")

print("Train score: ", baseline_scores["train_score"].mean())

print("Validation score:", baseline_scores["test_score"].mean())

Interpret these results. Did our second model perform better than the baseline? Any ideas about why or why not?

**Hint:** because the purpose of this model is to set future prices that have not been determined yet, the most important metric for evaluating model performance is the validation score, not the train score.

# Replace None with appropriate text

"""

The train score increased by 15 but the validation score is worst than in the baseline model.

This could be happening because the values are not normalized

"""

# 3. Select the Best Combination of Features

As you likely noted above, adding all relevant numeric features did not actually improve the model performance. Instead, it led to overfitting.

## Investigating Multicollinearity

This potentially indicates that our model is performing poorly because these features violate the independence assumption (i.e. there is strong multicollinearity). In other words, maybe the minimum age, maximum age, and difficulty level are not really providing different information than the number of pieces in the LEGO set, and instead are just adding noise. Then the model is using that noise to get a slightly better score on the training data, but but a worse score on the validation data.

While `LinearRegression` from scikit-learn has a lot of nice functionality for working with a predictive framing (e.g. compatibility with the `cross_validate` function), it doesn't have anything built in to detect strong multicollinearity. Fortunately the same linear regression model is also available from StatsModels ([documentation here](https://www.statsmodels.org/stable/generated/statsmodels.regression.linear_model.OLS.html)), where it is called `OLS` (for "ordinary least squares"). Models in StatsModels, including `OLS`, are not really designed for predictive model validation, but they do give us a lot more statistical information.

In the cell below, use StatsModels to fit a linear regression model on the same features used in the second model, then display the summary.

# Replace None with your code

import statsmodels.api as sm

result = sm.OLS(y_train, X_train_second_model).fit()

print(result.summary())

A condition number of 10-30 indicates multicollinearity, and a condition number above 30 indicates strong multicollinearity. This print-out shows a condition number of `2.77e+03`, i.e. 2770, which is well above 30.

In a predictive context (we are currently trying to build a model to assign prices to future LEGO sets, not a model primarily intended for understanding the relationship between prices and attributes of past LEGO sets), we do not *always* need to be worried when we identify strong multicollinearity. Sometimes there are features that are highly collinear but they also are individually communicating useful information to the model. In this case, however, it seems like strong multicollinearity might be what is causing our second model to have worse performance than the first model.

## Selecting Features Based on p-values

Given that we suspect our model's issues are related to multicollinearity, let's try to narrow down those features. In this case, let's use the p-values assigned to the coefficients of the model.

Looking at the model summary above, ***which features are statistically significant, with p-values above 0.05***? (P-values are labeled **P>|t|** in a StatsModels summary.)

# Replace None with appropriate text

"""

The values that are statistically significant are 'min_age' and 'max_age'

"""

**Important note:** There are many limitations to using coefficient p-values to select features. See [this StackExchange answer](https://stats.stackexchange.com/a/291239) with examples in R for more details. However for now you can proceed using just the p-values technique until the more-advanced techniques have been covered.

In the cell below, create a list `significant_features` that contains the names of the columns whose features have statistically significant coefficient p-values. You should not include `"const"` in that list because `LinearRegression` from scikit-learn automatically adds a constant term and there is no column of `X_train` called `"const"`.

(You do not need to extract this information programmatically, just write them out like `"column_name_1", "column_name_2"` etc.)

# Replace None with appropriate code

significant_features = ['min_age', 'max_age']

Now let's build a model using those significant features only:

# Run this cell without changes

third_model = LinearRegression()

X_train_third_model = X_train[significant_features]

third_model_scores = cross_validate(

estimator=third_model,

X=X_train_third_model,

y=y_train,

return_train_score=True,

cv=splitter

)

print("Current Model")

print("Train score: ", third_model_scores["train_score"].mean())

print("Validation score:", third_model_scores["test_score"].mean())

print()

print("Second Model")

print("Train score: ", second_model_scores["train_score"].mean())

print("Validation score:", second_model_scores["test_score"].mean())

print()

print("Baseline Model")

print("Train score: ", baseline_scores["train_score"].mean())

print("Validation score:", baseline_scores["test_score"].mean())

Interpret the results below. What happened when we removed the features with high p-values?

# Replace None with appropriate text

"""

Now, the model is worst than the previous 2 as the validation score is only ~21.85%

"""

Interesting. So, this algorithm is saying that our baseline model, with `piece_count` as the only feature, is the best one it could find.

However, note that this is based on the "importances" of the features, which means the coefficients in the context of a linear regression. It is possible that we can still get a better model by including multiple features, if we try removing columns using a different strategy.

## A Brute Force Approach

Given that we have only four columns and only a few hundred rows, one other option we have is something more computationally expensive: write custom code that goes over multiple different permutations of the columns, to see if we can find something better than the p-values approach.

The code below assumes that we want to keep the `piece_count` column, then attempts a linear regression with all possible combinations of 1-2 additional features. Don't worry too much if you don't understand everything that is happening here — an approach like this should be a last resort and you may not ever need to use it!

# Run this cell without changes

from itertools import combinations

features = ["piece_count", "min_age", "max_age", "difficulty_level"]

# Make a dataframe to hold the results (not strictly necessary

# but it makes the output easier to read)

results_df = pd.DataFrame(columns=features)

# Selecting just piece_count

results_df = results_df.append({

"train_score": baseline_scores["train_score"].mean(),

"val_score": baseline_scores["test_score"].mean()

}, ignore_index=True)

# Selecting 1 additional feature

for feature in features[1:]:

scores = cross_validate(

estimator=second_model,

X=X_train[["piece_count", feature]],

y=y_train,

return_train_score=True,

cv=splitter

)

# Note: this technique of appending to a df is quite inefficient

# Here it works because it's only happening 6 times, but avoid

# doing this for a whole dataset

results_df = results_df.append({

feature: "Yes",

"train_score": scores["train_score"].mean(),

"val_score": scores["test_score"].mean()

}, ignore_index=True)

# Selecting 2 additional features

for (feature1, feature2) in list(combinations(features[1:], 2)):

scores = cross_validate(

estimator=second_model,

X=X_train[["piece_count", feature1, feature2]],

y=y_train,

return_train_score=True,

cv=splitter

)

results_df = results_df.append({

feature1: "Yes",

feature2: "Yes",

"train_score": scores["train_score"].mean(),

"val_score": scores["test_score"].mean()

}, ignore_index=True)

# Including all 3 additional features

results_df = results_df.append({

"min_age": "Yes", "max_age": "Yes", "difficulty_level": "Yes",

"train_score": second_model_scores["train_score"].mean(),

"val_score": second_model_scores["test_score"].mean()

}, ignore_index=True)

# Fill in remaining values where appropriate

results_df["piece_count"] = "Yes"

results_df.fillna("No", inplace=True)

results_df

Interpret the table above. It shows both training and validation scores for `piece_count` as well as all combinations of 0, 1, 2, or 3 other features.

Which features make the best model? Which make the worst? How does this align with the previous discussion of multicollinearity? And how much does feature selection seem to matter in general for this dataset + model algorithm, once we have identified the most correlated feature for the baseline?

# Replace None with appropriate text

"""

The features that makes the best model are 'piece_count', 'max_age' and 'difficulty_level'

while the features making the worst model are 'piece_count', 'min_age' and 'max_age'

This shows that the multicollinearity between the variables sometimes improves the model and sometimes

it just adds noise

"""

# 4. Build and Evaluate a Final Predictive Model

In the cell below, create a list `best_features` which contains the names of the best model features based on the findings of the previous step:

# Replace None with appropriate code

best_features = ['piece_count', 'max_age', 'difficulty_level']

Now, we prepare the data for modeling:

# Run this cell without changes

X_train_final = X_train[best_features]

X_test_final = X_test[best_features]

In the cell below, instantiate a `LinearRegression` model called `final_model`, then fit it on the training data and score it on the test data.

# Replace None with appropriate code

final_model = LinearRegression()

# Fit the model on X_train_final and y_train

final_model.fit(X_train_final, y_train)

# Score the model on X_test_final and y_test

# (use the built-in .score method)

final_model.score(X_test_final, y_test)

## User-Friendly Metrics

The score above is an r-squared score. Let's compute the MAE as well, since this would be more applicable to a business audience.

# Run this cell without changes

from sklearn.metrics import mean_absolute_error

mean_absolute_error(y_test, final_model.predict(X_test_final))

What does this value mean in the current business context?

# Replace None with appropriate text

"""

This is the error between the predicted values and the real values. The MAE function computes

the absolute error abs(Yp_i - Y_i) for each sample i and the sums all these values. As the values are closer (predicted to real)

the error decreases

"""

# 5. Interpret the Final Model

Display the coefficients and intercept for the final model:

# Run this cell without changes

print(pd.Series(final_model.coef_, index=X_train_final.columns, name="Coefficients"))

print()

print("Intercept:", final_model.intercept_)

Interpret these values below. What is the pricing algorithm you have developed?

# Replace None with appropriate text

"""

The algorithm returned a linear function of 3 variables that estimates the listing price.

The function returned by the algorithm is:

listing_price = intercept + 0.058633*piece_count - 0.043271*max_age + 2.044057*difficulty_level

"""

Before assuming that these coefficients give us inferential insight into past pricing decisions, we should investigate each of the assumptions of linear regression, in order to understand how much our model violates them.

## Investigating Linearity

First, let's check whether the linearity assumption holds.

# Run this cell without changes

preds = final_model.predict(X_test_final)

fig, ax = plt.subplots()

perfect_line = np.arange(y_test.min(), y_test.max())

ax.plot(perfect_line, linestyle="--", color="orange", label="Perfect Fit")

ax.scatter(y_test, preds, alpha=0.5)

ax.set_xlabel("Actual Price")

ax.set_ylabel("Predicted Price")

ax.legend();

Are we violating the linearity assumption?

# Replace None with appropriate text

"""

No. We can see that the relation between the X variables and the target variable 'y' is linear

"""

### Investigating Normality

Now let's check whether the normality assumption holds for our model.

# Run this code without changes

import scipy.stats as stats

residuals = (y_test - preds)

sm.graphics.qqplot(residuals, dist=stats.norm, line='45', fit=True);

Are we violating the normality assumption?

# Replace None with appropriate text

"""

We see that the normality assumption is violated. The majority of the observations does not lies on the straight line (red lne)

"""

### Investigating Multicollinearity (Independence Assumption)

Another way to measure multicollinearity is with [variance inflation factor](https://en.wikipedia.org/wiki/Variance_inflation_factor) ([StatsModels documentation here](https://www.statsmodels.org/stable/generated/statsmodels.stats.outliers_influence.variance_inflation_factor.html#statsmodels.stats.outliers_influence.variance_inflation_factor)). A "rule of thumb" for VIF is that 5 is too high (i.e. strong multicollinearity).

Run the code below to find the VIF for each feature.

# Run this cell without changes

from statsmodels.stats.outliers_influence import variance_inflation_factor

vif = [variance_inflation_factor(X_train_final.values, i) for i in range(X_train_final.shape[1])]

pd.Series(vif, index=X_train_final.columns, name="Variance Inflation Factor")

Do we have too high of multicollinearity?

# Replace None with appropriate text

"""

No, we do not. We see that the variance inflation factor for each variable is less than 2.

"""

### Investigating Homoscedasticity

Now let's check whether the model's errors are indeed homoscedastic or if they violate this principle and display heteroscedasticity.

# Run this cell without changes

fig, ax = plt.subplots()

ax.scatter(preds, residuals, alpha=0.5)

ax.plot(preds, [0 for i in range(len(X_test))])

ax.set_xlabel("Predicted Value")

ax.set_ylabel("Actual - Predicted Value");

Are we violating the homoscedasticity assumption?

# Replace None with appropriate text

"""

I do not think the model violates the homosceasticity assumption. If we see the graph, the variance of residuals is almost the same (Centered around the horizonal line)

"""

### Linear Regression Assumptions Conclusion

Given your answers above, how should we interpret our model's coefficients? Do we have a model that can be used for inferential as well as predictive purposes? What might your next steps be?

# Replace None with appropriate text

"""

We have determined a model that is capable of estimating the price of Lego pieces with an accuracy of approx. 78%.

Although the model is bad, it still needs to be improved to achieve an accuracy of at least 90%. In the model, we have been able to observe that the value of the intercept is ~9.

This means that a lego set without parts would cost approx. $9 which makes no sense.

"""

# 6. Create Presentation Notebook

Now that you've completed your project, let's put it into an easily presentable format so you can add it to your portfolio. To do this, we recommend completing the following steps outside of this notebook.

1. Create a new GitHub repository for your project.

2. Save a copy of this notebook and the `data` subfolder into your local repository.

3. Edit the text and images in the notebook to present your project and help someone else understand it.

4. Run your notebook from start to finish, then save it.

5. Create a README.md file in your repository with a brief summary of your project.

6. Push your updated repository to GitHub to share with your instructor and employers!

# Level Up: Project Enhancements

After completing the project, you could consider the following enhancements if you have time:

* Engineer new features to improve the predictive power of your model

* Identify and remove outliers, then redo the analysis

* Identify models with high or low value for LEGO buyers, using the differences between actual and predicted prices

* Conduct statistical tests using the numeric features in the dataset to make inferences about the population of LEGO sets

## Summary

Well done! As you can see, regression can be a challenging task that requires you to make decisions along the way, try alternative approaches, and make ongoing refinements.