# Python Program to Implement Data Management Assignment Solution.

## Instructions

Objective
Write a program to implement data management in python.

## Requirements and Specifications

Source Code

!pip install otter-grader # Initialize Otter import otter grader = otter.Notebook("assignment3.ipynb") -- # Assignment 3: Exploratory Data Analysis in Professional Basketball In this assignment we'll conduct an exploratory data analysis of professional basketball data. Basketball is a team sport in which the goal is to try to outscore the amount in a fixed amount of time. Points are scored (either 2 or 3 points) by putting the ball throw a hoop on one end of the court. An attempt at putting the ball throw the hoop is known as a "shot". If helpful, you can read more about [the rules of basketball](https://en.wikipedia.org/wiki/Rules_of_basketball). The National Basketball Association (NBA) is the professional basketball league in the United States and provides a nice website with many statistics gathered on teams and players in the league: [http://stat.nba.com](http://stat.nba.com). ## Question 1: Managing data files We will use data that is available from NBA. Although NBA doesn't officially make the data API (application programming interface) public, people have figured out ways to access their data programmatically ([1](https://datavizardry.com/2020/01/28/nba-shot-charts-part-1/), [2](https://github.com/swar/nba_api), [3](https://github.com/seemethere/nba_py/wiki/stats.nba.com-Endpoint-Documentation)). While these approaches will work when python is installed and running on your computer, NBA seems to block (pun intended) connections from Google Cloud where our course JupyterHub is running. Therefore, in this assignment, the raw data downloads are provided to you in a zip file: https://ucsb.box.com/shared/static/z6y3etgikbzbnf0ld4brvc95xtgjcrie.zip Download and unzip the file to a directory named data using command line commands (unzipping on Windows and Mac may not work because different OS have different constraints on filename lengths, etc.). Adding exclamation point in Jupyter notebook cell indicates that bash shell interpreter will execute your command. bash wget -nc https://ucsb.box.com/shared/static/z6y3etgikbzbnf0ld4brvc95xtgjcrie.zip -O nba-data.zip unzip -o nba-data.zip -d data  What these commands are doing: * wget downloads files ([what do each of the pieces do?](https://explainshell.com/explain?cmd=wget+-nc+https%3A%2F%2Fucsb.box.com%2Fshared%2Fstatic%2F940qiuxyp798gv4nx8iwvazu9qqjs37e.zip+-O+nba-data.zip)) * unzip will unzip nba-data.zip into directory named data (specified by -d data) and will overwrite any same filenames when extracting (specified by -o). Following screencast videos show the terminal vs. Jupyter notebook's ! exclamation way of running command line commands. %%HTML# Run your commands in this cell !wget -nc https://ucsb.box.com/shared/static/z6y3etgikbzbnf0ld4brvc95xtgjcrie.zip -O nba-data.zip !unzip -o nba-data.zip -d data After unzipping the files, you will find three types of files in data/ directory: * Team data: commonTeamYears?LeagueID=00&Season=2018-19 * Player data: commonallplayers?LeagueID=00&Season=2018-19&IsOnlyCurrentSeason=0 * Player's shot data: shotchartdetail?PlayerID=[PlayerID]&PlayerPosition=&Season=2018-19&ContextMeasure=FGA&DateFrom=&DateTo=&GameID=&GameSegment=&LastNGames=0&LeagueID=00&Location=&Month=0&OpponentTeamID=0&Outcome=&Period=0&Position=&RookieYear=&SeasonSegment=& Each player's shot data is identified by replacing [PlayerID] with their numeric ID. Here is how we will read in the data: * Each data file contains text in [JSON (Javascript Object Notation) format](https://www.w3schools.com/python/python_json.asp). * First, read the data content as text (using [Path.read_text()](https://docs.python.org/3/library/pathlib.html#pathlib.Path.read_text) from pathlib module) * Second, we convert it to a Python dictionary format (using [json.loads()](https://docs.python.org/3/library/json.html#json.loads) in json module) * Third, identify DataFrame content * Fourth, identify DataFrame header * Fifth, assemble DataFrame ### Question 1a: Team Data Read team data file into a pandas data frame named allteams starting from the given code below. from pathlib import Path import json import pandas as pd import numpy as np fname = 'data/commonTeamYears?LeagueID=00&Season=2018-19' # directory_name/file_name step_1 = Path(fname).read_text() # str step_2 = json.loads(step_1) # dict step_3 = step_2['resultSets'][0]['rowSet'] # list step_4 = step_2['resultSets'][0]['headers'] # list # print out each of step_1 through step_4 and understand what each line does print(step_1) # Read the text from the given directory print(step_2) # Converts to JSON object print(step_3) # Gets the first value in the attribute 'resultSets', and then gets the attribute 'rowSet' print(step_4) # Gets the first value in the attribute 'resultSets', and then gets the attribute 'headers' Use variables constructed above to assemble allteams DataFrame. Drop any teams that no longer exist as of 2019. These teams show None in ABBREVIATION column.allteams = pd.DataFrame(data = step_3, columns = step_4) allteams = allteams.dropna() allteams.head() grader.check("q1a") ### Question 1b: Player Data pathlib has flexible ways to specify file and directory paths. For example, the following are equivalent: * Path('data/commonallplayers?LeagueID=00&Season=2018-19&IsOnlyCurrentSeason=0') * Path('data') / 'commonallplayers?LeagueID=00&Season=2018-19&IsOnlyCurrentSeason=0') * Path('data').joinpath('commonallplayers?LeagueID=00&Season=2018-19&IsOnlyCurrentSeason=0') Read players data file with name data/commonallplayers?LeagueID=00&Season=2018-19&IsOnlyCurrentSeason=0. Assemble pandas DataFrame with name allplayers. Set row index to be PERSON_ID and sort_index.dirname = 'data' # directory_name filename = 'commonallplayers?LeagueID=00&Season=2018-19&IsOnlyCurrentSeason=0' # file_name step_1 = Path('data').joinpath(filename).read_text() # str step_2 = json.loads(step_1) # dict step_3 = step_2['resultSets'][0]['rowSet'] # list step_4 = step_2['resultSets'][0]['headers'] # list allplayers = pd.DataFrame(data = step_3, columns = step_4) allplayers = allplayers.set_index('PERSON_ID') allplayers = allplayers.sort_index() allplayers.head() grader.check("q1b") ### Question 1c: Shots Data pathlib can also find all filenames that match a given pattern using [Path.glob() method](https://docs.python.org/3/library/pathlib.html#pathlib.Path.glob). For example, teams data and players data start with the pattern common followed by a wildcard *: common*. We can use this to retrieve two file names with one call: two_files = Path('data').glob('common*') # generator: https://www.educative.io/edpresso/generator-vs-iterator-in-python list(two_files) # list All file names for shots data start with shotchartdetail. Use this as the pattern to * First, read all file names into allshots_files * Second, loop over each file in allshots_files and assemble a dataframe * Third, add as an element in a list named allshots_list (each file is an data frame item in the list). * Fourth, concatenate all dataframes into one dataframe named allshots. Set the row index to be PLAYER_ID and sort_index.allshots_files = list(Path('data').glob('shotchartdetail*')) allshots_files.sort() allshots_list = list() for f in allshots_files: # Read file step_1 = Path(f).read_text() step_2 = json.loads(step_1) # dict step_3 = step_2['resultSets'][0]['rowSet'] # list step_4 = step_2['resultSets'][0]['headers'] # list allshots = pd.DataFrame(data = step_3, columns = step_4) allshots_list.append(allshots) # Now concatenate allshots = pd.concat(allshots_list) allshots = allshots.set_index('PLAYER_ID') allshots = allshots.sort_index() allshots.head() grader.check("q1c") ### Question 1d: Extract Stephen Curry's Shot Data Use [allplayers.query()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.query.html) to find the player id (index) associated with the player named "[Stephen Curry](https://en.wikipedia.org/wiki/Stephen_Curry)". Set the value of PlayerID as curry_id of type str. Subset all of Stephen Curry's shots in a data frame named curry_data. Also, set the dtype of SHOT_MADE_FLAG to 'bool' in one command. Something like:  curry_data = allshots.query(???).astype(????) allplayers.head() # fill-in all ... query_str = 'DISPLAY_FIRST_LAST == "Stephen Curry"' curry_id = str(allplayers.query(query_str).index.values[0]) curry_data = allshots.query('PLAYER_ID == ' + curry_id).astype({'SHOT_MADE_FLAG':bool}) grader.check("q1d") ## Question 2: Visualization ### Question 2a: All Shots Scatter Plot Use seaborn to create scatter plot of the location of Stephen Curry's shot attempts from this year (LOC_X and LOC_Y). When you call a scatterplot, seaborn returns a figure in an object, we'll call it ax. We can set properties of the figure by calling methods on ax. Use this approach to set the x-axis limits to span (-300, 300), the y-axis limits to span (-100, 500). %matplotlib inline import matplotlib.pyplot as plt import seaborn as sns plt.figure(figsize=[12, 11]) ax2a = sns.scatterplot(x=curry_data['LOC_X'], y = curry_data['LOC_Y']) # Set x/y limits and labels ax2a.set(xlim=(-300,300),ylim=(-100,500)) plt.show() grader.check("q2a") Understanding any dataset is difficult without context. Lets add some important context by adding the relevant court lines into our diagram. If you are interested, you can read more about the lines and dimensions on the [NBA basketball court](https://en.wikipedia.org/wiki/Basketball_court). We will use code from [http://savvastjortjoglou.com/nba-shot-sharts.html](http://savvastjortjoglou.com/nba-shot-sharts.html) to add the court markings to our diagram. The draw_court function below will do this for us. The below cell will generate an example court. ## code is from http://savvastjortjoglou.com/nba-shot-sharts.html def draw_court(ax=None, color='black', lw=1, outer_lines=False): from matplotlib.patches import Circle, Rectangle, Arc from matplotlib.pyplot import gca # If an axes object isn't provided to plot onto, just get current one if ax is None: ax = gca() # Create the various parts of an NBA basketball court # Create the basketball hoop # Diameter of a hoop is 18" so it has a radius of 9", which is a value # 7.5 in our coordinate system hoop = Circle((0, 0), radius=7.5, linewidth=lw, color=color, fill=False) # Create backboard backboard = Rectangle((-30, -7.5), 60, 0, linewidth=lw, color=color) # The paint # Create the outer box 0f the paint, width=16ft, height=19ft outer_box = Rectangle((-80, -47.5), 160, 190, linewidth=lw, color=color, fill=False) # Create the inner box of the paint, widt=12ft, height=19ft inner_box = Rectangle((-60, -47.5), 120, 190, linewidth=lw, color=color, fill=False) # Create free throw top arc top_free_throw = Arc((0, 142.5), 120, 120, theta1=0, theta2=180, linewidth=lw, color=color, fill=False) # Create free throw bottom arc bottom_free_throw = Arc((0, 142.5), 120, 120, theta1=180, theta2=0, linewidth=lw, color=color, linestyle='dashed') # Restricted Zone, it is an arc with 4ft radius from center of the hoop restricted = Arc((0, 0), 80, 80, theta1=0, theta2=180, linewidth=lw, color=color) # Three point line # Create the side 3pt lines, they are 14ft long before they begin to arc corner_three_a = Rectangle((-219, -47.5), 0, 140, linewidth=lw, color=color) corner_three_b = Rectangle((219, -47.5), 0, 140, linewidth=lw, color=color) # 3pt arc - center of arc will be the hoop, arc is 23'9" away from hoop # I just played around with the theta values until they lined up with the # threes three_arc = Arc((0, 0), 475, 475, theta1=22.5, theta2=157.5, linewidth=lw, color=color) # Center Court center_outer_arc = Arc((0, 422.5), 120, 120, theta1=180, theta2=0, linewidth=lw, color=color) center_inner_arc = Arc((0, 422.5), 40, 40, theta1=180, theta2=0, linewidth=lw, color=color) # List of the court elements to be plotted onto the axes court_elements = [hoop, backboard, outer_box, inner_box, top_free_throw, bottom_free_throw, restricted, corner_three_a, corner_three_b, three_arc, center_outer_arc, center_inner_arc] if outer_lines: # Draw the half court line, baseline and side out bound lines outer_lines = Rectangle((-250, -47.5), 500, 470, linewidth=lw, color=color, fill=False) court_elements.append(outer_lines) # Add the court elements onto the axes or element in court_elements: ax.add_patch(element) return ax plt.figure(figsize=(12,11)) draw_court(outer_lines=True) plt.xlim(-300,300) plt.ylim(-100,500) plt.show()### Question 2b: All Shots Scatter Plot + Court Outline Again use seaborn to make a scatter plot of Stephen Curry's shots. Again, set the x-axis limits to span (-300, 300), the y-axis limits to span (-100, 500) color the points by whether the shot was made or missed. Set the missed shots to have an 'x' symbol and made shots to be a circular symbol. Call the draw_court function with outer_lines set to to be true. Save the Axes returned by the plot call in a variable called ax.plt.figure(figsize=(12, 11)) markers = {0 : "X", 1 : "o"} markers_lst = [markers[int(x)] for x in curry_data['SHOT_MADE_FLAG']] ax = sns.scatterplot(x = curry_data['LOC_X'], y = curry_data['LOC_Y'], style = markers_lst) ax.set(xlim = (-300,300),ylim=(-150,500)) draw_court(outer_lines = True) plt.show()### Question 2c: Analyzing the VisualizationIn a few sentences, discuss what makes this an effective or ineffective visualization for understanding the types of shots that Stephen Curry likes to take and is good at taking, relative to other players in the league. Are there ways it can be improved? The previous graph allows to visualize in a very precise way the positions of the shots made by Stephen Curry in addition to allowing to visualize the successful and unsuccessful shots. In this way it is possible to determine the areas where the majority of successful and unsuccessful shots are deprived and in this way to know which are the positions (physical place) where there is more probability that the player hits the shot### Question 2d: A Hexbin plot Visualize Stephen Curry's shots by using a [hexbin plot with marginal histograms](https://seaborn.pydata.org/examples/hexbin_marginals.html). Also refer to setting [figure aesthetics](https://seaborn.pydata.org/tutorial/aesthetics.html) for what commands below do. sns.set_style("white") joint_shot_chart = sns.jointplot(x=curry_data['LOC_X'], y = curry_data['LOC_Y']) joint_shot_chart.fig.set_size_inches(12,11) # A joint plot has 3 Axes, the first one called ax_joint # is the one we want to draw our court onto and adjust some other settings ax = joint_shot_chart.ax_joint draw_court(ax, outer_lines=True) # Adjust the axis limits and orientation of the plot in order # to plot half court, with the hoop by the top of the plot ax.set_xlim(-300, 300) ax.set_ylim(500, -100) # Get rid of axis labels and tick marks ax.set_xlabel('') ax.set_ylabel('') ax.tick_params(labelbottom=False, labelleft=False) # Add a title ax.set_title('Stephen Curry, 2018-19, FGA', y=1.2, fontsize=10) # Add Data Scource and Author ax.text(-250,445,'\n This plot is based on code by Savvas Tjortjoglou (savvastjortjoglou.com)', fontsize=12); ## Question 3: Binning and Smoothing Shots So far, in we have worked with dataframes which represent each shot as a single observation (row) within the dataset. However, this isn't a convenient data structure for the kinds of spatial analyses we will pursue below. In this part, we will divide the court into square regions and create a matrix which includes the number of shots taken by a player in that region. We divide the court up into square bins (i.e. a 2d histogram) and, for each player, count number of shots that fall into each bin. Fortunately, this function is relatively simple to write using numpy module. ### Question 3a: 2D Smoothing Fill in the bin_shots function below. Use np.histgram2d to count count the shots in each bin. The bins are defined bin_edges which is a pandas Series of the form (xedges, yedges). If density = True, call ndimage.filters.gaussian_filter on the result of np.histogram2d with smoothing parameter sigma. This will create a smoothed version of the raw data histograms.def bin_shots(df, bin_edges, density=False, sigma=1): """Given data frame of shots, compute a 2d matrix of binned counts is computed Args: df: data frame of shotchartdetail from nba.com. At the minimum, variables named LOCX and LOCY are required. bin_edges: bin edge definition: edges in x and edges in y Returns: binned: counts xedges: bin edges in X direction yedges: bin edges in Y direction """ import numpy as np from scipy import ndimage ## Call np.histogram2d binned, xedges, yedges = np.histogram2d(df['LOC_X'], df['LOC_Y'], bins = bin_edges) if density: # Recompute 'binned' using "gaussian_filter" binned = ndimage.filters.gaussian_filter(binned, sigma = sigma) # Normalize the histogram to be a "density", e.g. mass across all bins sums to 1. binned /= np.sum(binned) return(binned, xedges, yedges) grader.check("q3a") ### Question 3b: Visualize the binning on curry_data Call bin_shots on curry_data to create a binned but unsmoothed matrix of shot counts (call this curry_binned_unsmoothed), a binned and smoothed matrix of counts with sigma=1 (call this curry_binned_smoothed1) and one with sigma=5 (call this curry_binned_smoothed5). Use the bin edges defined below: ## bin edge definitions in inches xedges = np.linspace(start=-300, stop=300, num=151) yedges = np.linspace(start=-48, stop=372, num=106) bin_edges = (xedges, yedges) curry_binned_unsmoothed, xe, ye = bin_shots(curry_data, (xedges, yedges), False) curry_binned_smoothed1, xe, ye = bin_shots(curry_data, (xedges, yedges), True, 1) curry_binned_smoothed5, xe, ye = bin_shots(curry_data, (xedges, yedges), True, 5) ... The function below can be used to visualize the shots as a heatmap: def plot_shotchart(binned_counts, xedges, yedges, ax=None, use_log=False, cmap = 'Reds'): """Plots 2d heatmap from vectorized heatmap counts Args: hist_counts: vectorized output of numpy.histogram2d xedges, yedges: bin edges in arrays ax: figure axes [None] use_log: will convert count x to log(x+1) to increase visibility [False] cmap: Set the color map https://matplotlib.org/examples/color/colormaps_reference.html Returns: ax: axes with plot """ import numpy as np import matplotlib.pyplot as plt ## number of x and y bins. nx = xedges.size - 1 ny = yedges.size - 1 X, Y = np.meshgrid(xedges, yedges) if use_log: counts = np.log(binned_counts + 1) if ax is None: fig, ax = plt.subplots(1,1) ax.pcolormesh(X, Y, binned_counts.T, cmap=cmap) ax.set_aspect('equal') draw_court(ax) return(ax) Create 3 side by side plots of curry_binned_unsmoothed, curry_binned_smoothed1 and curry_binned_smoothed5fig, ax = plt.subplots(1, 3, figsize=(20,60)) plot_shotchart(curry_binned_unsmoothed, xe, ye, ax = ax[0]) plot_shotchart(curry_binned_smoothed1, xe, ye, ax = ax[1]) plot_shotchart(curry_binned_smoothed5, xe, ye, ax = ax[2]) ... fig.show()### Vectorize Shot Images - Here we proceed create a dictionary of smoothed patterns, each vectorized into a 1-d array (like Lab 6) - In this case, the object all_smooth is a dictionary that consists of arrays of length 15750. - Each entry in all_smooth represents the smoothed frequency of shots along the bins generated in the code above for a given player. ## number of bins is one less than number of edges (remember homework 1) nx = xedges.size - 1 ny = yedges.size - 1 ## 2d histogram containers for binned counts and smoothed binned counts all_counts = [] all_smooth = [] pids = [] ## 2d histogram containers for binned counts and smoothed binned counts ## data matrix: players (row) by vectorized 2-d court locations (column) for i, one in enumerate(allshots.groupby('PLAYER_ID')): ## what does this line do? pid, pdf = one num_shots = len(pdf.index) if(num_shots > 100): tmp1, xedges, yedges = bin_shots(pdf, bin_edges=(xedges, yedges), density=True, sigma=2) tmp2, xedges, yedges = bin_shots(pdf, bin_edges=(xedges, yedges), density=False) ## vectorize and store into list all_smooth += [tmp1.reshape(-1)] all_counts += [tmp2.reshape(-1)] pids += [pid] X = np.vstack(all_smooth).T p, n = X.shape print('Number of shot regions (p):', p) print('Number of players (n):', n) ## Question 4: Non-negative Matrix Factorization (NMF) The non-negative matrix factorization is a dimension reduction technique that is often applied to image data. It is similar to PCA except that is only applicable for strictly positive data. We can apply the NMF to vectorized versions of the shot surface. This is useful because we can convert the observed matrix of shot surfaces into: * Bases: Identifying modes of shooting style (number of modes is determined by n_components argument to NMF function below) * Coefficients: How each players shooting style could be expressed as a (positive) linear combination of these bases The NMF solves the following problem: given some matrix $X$ is $p\times n$ matrix, NMF computes the following factorization: $$\min_{W,H} \| X - WH \|_F\\ \text{ subject to } W\geq 0,\ H\geq 0,$$ where $W$ is ${p\times r}$ matrix and $H$ is ${r\times n}$ matrix. In this homework, we have the following: #### The data matrix $X$ $X$ is of dimension $n$={number of players} and $p$={number of total square bins on the court}. Each column corresponds to a player, with entries corresponding to a "flattened" or "vectorized" version of the 2d histograms plotted in part 4b. #### Bases matrix: $W$ Columns $W_i$ contain the shot "bases". First, we will try it with $r=3$ bins in 5a, and then with $r=10$ bins in 5d. #### Coefficient matrix: H Each column of $H$ gives a coefficient for each of the bases vectors in $W$, and there are $n$ columns for each player. The sklearn library is one of the main Python machine learning libraries. It has a built in NMF function for us. The function below runs this function and normalizes the basis surfaces to sum to 1. ## Non-negative Matrix Factorization def non_negative_marix_decomp(n_components, array_data): import sklearn.decomposition as skld model = skld.NMF(n_components=n_components, init='nndsvda', max_iter=500, random_state=0) W = model.fit_transform(array_data) # Normalize basis vectors to sum to 1 Wsum = W.sum(axis=0) W = W/Wsum ## fix H correspondingly H = model.components_ H = (H.T * Wsum).T nmf = (W, H) return(nmf) ### Question 4a: Computing NMF Factorization Compute the NMF on all player's shot charts, X, assuming with n_components = 3 (i.e. each shot chart can be represented as a positive linear combination of 3 "basis" shot charts). Fill in plot_vectorized_shot_chart. This takes a the a vector of binned shot counts, converts it back to a matrix of the appropriate size and then calls plot_shotchart on the matrix. The numpy function reshape will be useful here: [https://docs.scipy.org/doc/numpy/reference/generated/numpy.reshape.html](https://docs.scipy.org/doc/numpy/reference/generated/numpy.reshape.html)W3, H3 = non_negative_marix_decomp(3, X) grader.check("q4a")### Question 4b: Visualizing Shot Types Plot the first three basis images by calling plot_vectorized_shot_chart below on the columns of W3.def plot_vectorized_shotchart(vec_counts, xedges, yedges, ax=None, use_log=False, cmap = 'Reds'): """Plots 2d heatmap from vectorized heatmap counts Args: hist_counts: vectorized output of numpy.histogram2d xedges, yedges: bin edges in arrays ax: figure axes [None] use_log: will convert count x to log(x+1) to increase visibility [False] cmap: Set the color map https://matplotlib.org/examples/color/colormaps_reference.html Returns: ax: axes with plot """ nx = xedges.size-1 ny = yedges.size-1 # use reshape to convert a vectorized counts back into a 2d histogram two_d_counts = vec_counts.reshape((nx, ny)) return(plot_shotchart(two_d_counts, xedges, yedges, ax=ax, use_log=use_log, cmap=cmap)) fig, ax = plt.subplots(1, 3, figsize=(20,60)) ## Write a for loop for i in range(3): # Call plot_vectorized_shot_chart plot_vectorized_shotchart(W3[:,i], xedges, yedges, ax = ax[i]) ax[i].set_title('Shot Basis %i' % (i+1))### Question 4c: Reconstruction Error Below we re-construct the shooting pattern for a single player. By "reconstructing" we mean use the approximation $$\hat{X} = WH$$ obtained via NMF. Find $\hat X$ by multipling W and H. In python the @ symbol is used for matrix multiplication.X3_hat = W3@H3Plot $X$, $\hat X$ and the residual ($X - \hat X$) for the player named LaMarcus Aldridge. Remember, each column of $X$ is a vectorized matrix corresponding to the binned (or smoothed binned) shot information. # Find the player_id of LaMarcus Aldridge player_id = int(allplayers.query('DISPLAY_FIRST_LAST == "LaMarcus Aldridge"').index.values[0]) ## find index in X corresponding to that player #to_plot_idx = np.where(pids == player_id)[0][0] to_plot_idx = pids.index(player_id) fig, ax = plt.subplots(1, 3, figsize=(20,60)) ## Call plot_vectorized_shot_chart original_shotchart = plot_vectorized_shotchart(X[:,to_plot_idx], xedges, yedges, ax=ax[0]) reconstructed_shotchart = plot_vectorized_shotchart(X3_hat[:, to_plot_idx], xedges, yedges, ax=ax[1]) residual_chart = plot_vectorized_shotchart(X[:, to_plot_idx]-X3_hat[:,to_plot_idx], xedges, yedges, ax=ax[2]) # print(max(abs(X3_hat[:, to_plot_idx] - X[:, to_plot_idx]))) ax[0].set_title('Original Shooting Pattern') ax[1].set_title('Reconstructed Shooting pattern (r=3)') ax[2].set_title('Residual Shooting Pattern (r=3)') fig.show()### Question 4d: Choice of Colormap Why does it make sense to use a _sequential_ palette for the original and reconstructed shot charts and a _diverging_ palette for the residual? _Hint:_ Read the introduction to colormaps [here](https://matplotlib.org/users/colormaps.html). It makes sense to use a sequential palette for the original shot charts and reconstructed shot charts because the sequential allows you to observe changes in intensity and saturation just as it occurs in an intensity map such as shots. In the case of the diverging palette, it is used for the residual because it allows better observation of changes in intensity in two different colors that meet in the middle at an unsaturated color.What areas of the court does this player to shoot more and where less relative to the reconstructed area. If its helpful, you can refer to court locations by name using this legend [here](https://en.wikipedia.org/wiki/Basketball_court#/media/File:Basketball_terms.png]). By looking at the Residual Graph, it can be seen that the player shots more from the left-side between the The Arc and the Lane Line### Question 4e: More Detailed Modeling Re-run the analysis, this time for 10 basis vectors instead of 3. Again plot the bases using plot_vectorized_shotchart on the columns of W10. **Hint**: Study the following code  fig, ax = plt.subplots(2, 5, figsize=(20, 7)) ax = ax.flatten() # turn ax into a flat array ax[0].set_title('hello') ax[9].set_title('there') fig.show() W10, H10 = non_negative_marix_decomp(10, X) fig, ax = plt.subplots(2, 5, figsize=(20, 7)) ## Write a for loop for i in range(10): plot_vectorized_shotchart(W10[:,i], xedges, yedges, ax = ax[i//5,i%5]) ax[i//5,i%5].set_title('Shot Basis %i' % (i+1))If you did things correctly, you should be really impressed! We've identified potentially interesting patterns of shooting styles without actually specifying anything about the way basketball is played or where the relevant lines are on the court. The resulting images are based only on the actual behavior of the players. Even more impressive is that we're capturing similarity in regions that are far apart on the court. One reason we can do this is that a basketball court is symmetric along the length of the court (i.e. symmetric about x=0). However, people tend to be left or right hand dominant, which might affect their preferences. Look carefuly at the shot basis plots above: is there any evidence of _asymmetry_ in player shooting behavior? Refer to specific basis images in your answer. For this player, he does not seems to have a preference for left or right. We can see similarities from both sides and we can see that the highest number of shots comes from center lineRepeat part 5b, and again plot original, reconstructed and residual shot chats for LaMarcus Aldridge.X10_hat = W10@H10 fig, ax = plt.subplots(1, 3, figsize=(20,60)) # I took the first player appearing in first column # (you probably want to do more interesting players) original_shotchart = plot_vectorized_shotchart(X[:, to_plot_idx], xedges, yedges, ax = ax[0]) reconstructed_shotchart = plot_vectorized_shotchart(X10_hat[:, to_plot_idx], xedges, yedges, ax = ax[1]) residual_chart = plot_vectorized_shotchart(X[:, to_plot_idx]-X10_hat[:, to_plot_idx], xedges, yedges, ax = ax[2]) ax[0].set_title('Original Shooting Pattern') ax[1].set_title('Reconstructed Shooting pattern (r=10)') ax[2].set_title('Residual Shooting Pattern (r=10)');### Question 4f: Comparing Players With H10 matrix, it is possible to compare any pair of players. For all players pairwise, $i$ and $j$, compare using euclidean distance between their coefficients: $$\text{player-distance}(i,j) = \| H_i - H_j \|_2 = \left(\sum_{k=1}^{10} (H_{ki} - H_{kj})^2 \right)^{1/2}$$ Create a heatmap for comparing pair-wise player distance matrix. Find the two pairs of players with smallest distances. Also, find two pairs of players with largest distances.H10.shape player_distance = np.zeros((len(pids),len(pids))) for i in range(len(pids)): for j in range(len(pids)): if i != j: player_distance[i,j] += np.sum(np.power(H10[:,i]-H10[:,j], 2)) player_distance = np.sqrt(player_distance) # Plot heatmap plt.figure(figsize=(10,10)) sns.heatmap(player_distance) player_distance[player_distance <= 0] = np.inf min_dist = np.min(player_distance) min_dist_players = np.where(player_distance == min_dist)[0] player_a = pids[min_dist_players[0]] player_b = pids[min_dist_players[1]] print("The minimum distance is {0:.2f} and it is between players with id {1} and {2}".format(min_dist, player_a, player_b)) We can see that the minimum distance between two different players is 0.05 and it is between players with id 203468 and 1627750. These distances were obtained by comparing each different ID in matrix H10### Question 4g: Residuals The residual betwene Xhat and X gives a sense of how well a player is decribed by NMF computed matrices W and H. Calculate RMSE for each player, and plot the histogram. Comment on this distribution and players with smallest and largest RMSEs (use 10 components).# Calculate RMSE for each player RMSE = np.sqrt(np.sum(np.power(X-X10_hat, 2), axis = 0)) # Histogram plt.figure() plt.hist(RMSE, bins = 10) plt.show() We can see that most of the players have an error of approximately 0.006 (0.6%) while a few have a lower error. However, it is interesting to note that there are even fewer players with a high error, since for high errors they are not considered to play in the professional leagues.### Question 4h: Proposing improvements One of the main purposes of exploratory data analysis is to generate new ideas, directions, and hypothesis for future analyses and experiments. Take two players of your choice and compare their shooting patterns with various visualizations. State any insights and defend your conclusions with visual and/or numerical comparisons.# Take data for a Bradley Beal query_str = 'DISPLAY_FIRST_LAST == "Bradley Beal"' beal_id = str(allplayers.query(query_str).index.values[0]) beal_data = allshots.query('PLAYER_ID == ' + beal_id).astype({'SHOT_MADE_FLAG':bool}) # Show shooting pattern on same graph plt.figure(figsize=(12, 11)) markers = {0 : "X", 1 : "o"} markers_lst1 = [markers[int(x)] for x in curry_data['SHOT_MADE_FLAG']] markers_lst2 = [markers[int(x)] for x in beal_data['SHOT_MADE_FLAG']] ax = sns.scatterplot(x = curry_data['LOC_X'], y = curry_data['LOC_Y'], style = markers_lst1, color='blue') ax = sns.scatterplot(x = beal_data['LOC_X'], y = beal_data['LOC_Y'], style = markers_lst2, color='red', ax = ax) ax.set(xlim = (-300,300),ylim=(-150,500)) draw_court(outer_lines = True) plt.show() We can see the density of shots by Stephen Curry and Bradley Beal. It is curious to note that both players have similar density plots for their shots, but Stephen Curry has fired more long-range shots than Beal_Cell Intentionally Blank_ --- To double-check your work, the cell below will rerun all of the autograder tests. grader.check_all() ## Submission Make sure you have run all cells in your notebook in order before running the cell below, so that all images/graphs appear in the output. The cell below will generate a zip file for you to submit. **Please save before exporting!** # Save your notebook first, then run this cell to export your submission. grader.export()