Program to Implement Data Management Assignment Solution.

Instructions

Objective
Write a Python assignment where you need to create a program to implement data management in Python. This assignment will require you to develop a program that effectively handles various data manipulation tasks using Python's built-in data structures and functions. You'll be working with concepts like data input, storage, retrieval, and possibly even data analysis. This assignment will provide you with valuable hands-on experience in applying Python programming skills to practical data management scenarios.
Requirements and Specifications

program to implement data management in python 1
Source Code
!pip install otter-grader
# Initialize Otter
import otter
grader = otter.Notebook("assignment3.ipynb")
--
# Assignment 3: Exploratory Data Analysis in Professional Basketball
In this assignment we'll conduct an exploratory data analysis of professional basketball data. Basketball is a team sport in which the goal is to try to outscore the amount in a fixed amount of time. Points are scored (either 2 or 3 points) by putting the ball throw a hoop on one end of the court. An attempt at putting the ball throw the hoop is known as a "shot". If helpful, you can read more about [the rules of basketball](https://en.wikipedia.org/wiki/Rules_of_basketball).
The National Basketball Association (NBA) is the professional basketball league in the United States and provides a nice website with many statistics gathered on teams and players in the league: [http://stat.nba.com](http://stat.nba.com).
## Question 1: Managing data files
We will use data that is available from NBA. Although NBA doesn't officially make the data API (application programming interface) public, people have figured out ways to access their data programmatically ([1](https://datavizardry.com/2020/01/28/nba-shot-charts-part-1/), [2](https://github.com/swar/nba_api), [3](https://github.com/seemethere/nba_py/wiki/stats.nba.com-Endpoint-Documentation)). While these approaches will work when python is installed and running on your computer, NBA seems to block (pun intended) connections from Google Cloud where our course JupyterHub is running.
Therefore, in this assignment, the raw data downloads are provided to you in a zip file: https://ucsb.box.com/shared/static/z6y3etgikbzbnf0ld4brvc95xtgjcrie.zip
Download and unzip the file to a directory named `data` using command line commands (unzipping on Windows and Mac may not work because different OS have different constraints on filename lengths, etc.). Adding exclamation point in Jupyter notebook cell indicates that `bash` shell interpreter will execute your command.
```bash
wget -nc https://ucsb.box.com/shared/static/z6y3etgikbzbnf0ld4brvc95xtgjcrie.zip -O nba-data.zip
unzip -o nba-data.zip -d data
```
What these commands are doing:
* `wget` downloads files ([what do each of the pieces do?](https://explainshell.com/explain?cmd=wget+-nc+https%3A%2F%2Fucsb.box.com%2Fshared%2Fstatic%2F940qiuxyp798gv4nx8iwvazu9qqjs37e.zip+-O+nba-data.zip))
* `unzip` will unzip `nba-data.zip` into directory named `data` (specified by `-d data`) and will overwrite any same filenames when extracting (specified by `-o`).
Following screencast videos show the terminal vs. Jupyter notebook's `!` exclamation way of running command line commands.
%%HTML
# Run your commands in this cell
!wget -nc https://ucsb.box.com/shared/static/z6y3etgikbzbnf0ld4brvc95xtgjcrie.zip -O nba-data.zip
!unzip -o nba-data.zip -d data
After unzipping the files, you will find three types of files in `data/` directory:
* Team data: `commonTeamYears?LeagueID=00&Season=2018-19`
* Player data: `commonallplayers?LeagueID=00&Season=2018-19&IsOnlyCurrentSeason=0`
* Player's shot data: `shotchartdetail?PlayerID=[PlayerID]&PlayerPosition=&Season=2018-19&ContextMeasure=FGA&DateFrom=&DateTo=&GameID=&GameSegment=&LastNGames=0&LeagueID=00&Location=&Month=0&OpponentTeamID=0&Outcome=&Period=0&Position=&RookieYear=&SeasonSegment=&`
Each player's shot data is identified by replacing `[PlayerID]` with their numeric ID.
Here is how we will read in the data:
* Each data file contains text in [JSON (Javascript Object Notation) format](https://www.w3schools.com/python/python_json.asp).
* First, read the data content as text (using [Path.read_text()](https://docs.python.org/3/library/pathlib.html#pathlib.Path.read_text) from `pathlib` module)
* Second, we convert it to a Python dictionary format (using [json.loads()](https://docs.python.org/3/library/json.html#json.loads) in `json` module)
* Third, identify DataFrame content
* Fourth, identify DataFrame header
* Fifth, assemble DataFrame
### Question 1a: Team Data
Read team data file into a pandas data frame named `allteams` starting from the given code below.
from pathlib import Path
import json
import pandas as pd
import numpy as np
fname = 'data/commonTeamYears?LeagueID=00&Season=2018-19' # directory_name/file_name
step_1 = Path(fname).read_text() # str
step_2 = json.loads(step_1) # dict
step_3 = step_2['resultSets'][0]['rowSet'] # list
step_4 = step_2['resultSets'][0]['headers'] # list
# print out each of step_1 through step_4 and understand what each line does
print(step_1) # Read the text from the given directory
print(step_2) # Converts to JSON object
print(step_3) # Gets the first value in the attribute 'resultSets', and then gets the attribute 'rowSet'
print(step_4) # Gets the first value in the attribute 'resultSets', and then gets the attribute 'headers'
Use variables constructed above to assemble `allteams` DataFrame.
Drop any teams that no longer exist as of 2019. These teams show None in `ABBREVIATION` column.
allteams = pd.DataFrame(data = step_3, columns = step_4)
allteams = allteams.dropna()
allteams.head()
grader.check("q1a")
### Question 1b: Player Data
`pathlib` has flexible ways to specify file and directory paths. For example, the following are equivalent:
* `Path('data/commonallplayers?LeagueID=00&Season=2018-19&IsOnlyCurrentSeason=0')`
* `Path('data') / 'commonallplayers?LeagueID=00&Season=2018-19&IsOnlyCurrentSeason=0')`
* `Path('data').joinpath('commonallplayers?LeagueID=00&Season=2018-19&IsOnlyCurrentSeason=0')`
Read players data file with name `data/commonallplayers?LeagueID=00&Season=2018-19&IsOnlyCurrentSeason=0`.
Assemble pandas DataFrame with name `allplayers`. Set row index to be `PERSON_ID` and `sort_index`.
dirname = 'data' # directory_name
filename = 'commonallplayers?LeagueID=00&Season=2018-19&IsOnlyCurrentSeason=0' # file_name
step_1 = Path('data').joinpath(filename).read_text() # str
step_2 = json.loads(step_1) # dict
step_3 = step_2['resultSets'][0]['rowSet'] # list
step_4 = step_2['resultSets'][0]['headers'] # list
allplayers = pd.DataFrame(data = step_3, columns = step_4)
allplayers = allplayers.set_index('PERSON_ID')
allplayers = allplayers.sort_index()
allplayers.head()
grader.check("q1b")
### Question 1c: Shots Data
`pathlib` can also find all filenames that match a given pattern using [`Path.glob()` method](https://docs.python.org/3/library/pathlib.html#pathlib.Path.glob).
For example, teams data and players data start with the pattern `common` followed by a wildcard `*`: `common*`.
We can use this to retrieve two file names with one call:
two_files = Path('data').glob('common*') # generator: https://www.educative.io/edpresso/generator-vs-iterator-in-python
list(two_files) # list
All file names for shots data start with `shotchartdetail`.
Use this as the pattern to
* First, read all file names into `allshots_files`
* Second, loop over each file in `allshots_files` and assemble a dataframe
* Third, add as an element in a list named `allshots_list` (each file is an data frame item in the list).
* Fourth, concatenate all dataframes into one dataframe named `allshots`. Set the row index to be `PLAYER_ID` and `sort_index`.
allshots_files = list(Path('data').glob('shotchartdetail*'))
allshots_files.sort()
allshots_list = list()
for f in allshots_files:
 # Read file
 step_1 = Path(f).read_text()
 step_2 = json.loads(step_1) # dict
 step_3 = step_2['resultSets'][0]['rowSet'] # list
 step_4 = step_2['resultSets'][0]['headers'] # list
 allshots = pd.DataFrame(data = step_3, columns = step_4)
 allshots_list.append(allshots)
# Now concatenate
allshots = pd.concat(allshots_list)
allshots = allshots.set_index('PLAYER_ID')
allshots = allshots.sort_index()
allshots.head()
grader.check("q1c")
### Question 1d: Extract Stephen Curry's Shot Data
Use [`allplayers.query()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.query.html) to find the player id (index) associated with the player named "[Stephen Curry](https://en.wikipedia.org/wiki/Stephen_Curry)". Set the value of `PlayerID` as `curry_id` of type `str`.
Subset all of Stephen Curry's shots in a data frame named `curry_data`. Also, set the dtype of `SHOT_MADE_FLAG` to `'bool'` in one command. Something like:
```
curry_data = allshots.query(???).astype(????)
```
allplayers.head()
# fill-in all ...
query_str = 'DISPLAY_FIRST_LAST == "Stephen Curry"'
curry_id = str(allplayers.query(query_str).index.values[0])
curry_data = allshots.query('PLAYER_ID == ' + curry_id).astype({'SHOT_MADE_FLAG':bool})
grader.check("q1d")
## Question 2: Visualization
### Question 2a: All Shots Scatter Plot
Use `seaborn` to create scatter plot of the location of Stephen Curry's shot attempts from this year (`LOC_X` and `LOC_Y`). When you call a scatterplot, seaborn returns a figure in an object, we'll call it `ax`. We can set properties of the figure by calling methods on `ax`. Use this approach to set the x-axis limits to span (-300, 300), the y-axis limits to span (-100, 500).
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
plt.figure(figsize=[12, 11])
ax2a = sns.scatterplot(x=curry_data['LOC_X'], y = curry_data['LOC_Y'])
# Set x/y limits and labels
ax2a.set(xlim=(-300,300),ylim=(-100,500))
plt.show()
grader.check("q2a")
Understanding any dataset is difficult without context. Lets add some important context by adding the relevant court lines into our diagram. If you are interested, you can read more about the lines and dimensions on the [NBA basketball court](https://en.wikipedia.org/wiki/Basketball_court). We will use code from [http://savvastjortjoglou.com/nba-shot-sharts.html](http://savvastjortjoglou.com/nba-shot-sharts.html) to add the court markings to our diagram. The `draw_court` function below will do this for us. The below cell will generate an example court.
## code is from http://savvastjortjoglou.com/nba-shot-sharts.html
def draw_court(ax=None, color='black', lw=1, outer_lines=False):
 from matplotlib.patches import Circle, Rectangle, Arc
 from matplotlib.pyplot import gca
 # If an axes object isn't provided to plot onto, just get current one
 if ax is None:
 ax = gca()
 # Create the various parts of an NBA basketball court
 # Create the basketball hoop
 # Diameter of a hoop is 18" so it has a radius of 9", which is a value
 # 7.5 in our coordinate system
 hoop = Circle((0, 0), radius=7.5, linewidth=lw, color=color, fill=False)
 # Create backboard
 backboard = Rectangle((-30, -7.5), 60, 0, linewidth=lw, color=color)
 # The paint
 # Create the outer box 0f the paint, width=16ft, height=19ft
 outer_box = Rectangle((-80, -47.5), 160, 190, linewidth=lw, color=color,
 fill=False)
 # Create the inner box of the paint, widt=12ft, height=19ft
 inner_box = Rectangle((-60, -47.5), 120, 190, linewidth=lw, color=color,
 fill=False)
 # Create free throw top arc
 top_free_throw = Arc((0, 142.5), 120, 120, theta1=0, theta2=180,
 linewidth=lw, color=color, fill=False)
 # Create free throw bottom arc
 bottom_free_throw = Arc((0, 142.5), 120, 120, theta1=180, theta2=0,
 linewidth=lw, color=color, linestyle='dashed')
 # Restricted Zone, it is an arc with 4ft radius from center of the hoop
 restricted = Arc((0, 0), 80, 80, theta1=0, theta2=180, linewidth=lw,
 color=color)
 # Three point line
 # Create the side 3pt lines, they are 14ft long before they begin to arc
 corner_three_a = Rectangle((-219, -47.5), 0, 140, linewidth=lw,
 color=color)
 corner_three_b = Rectangle((219, -47.5), 0, 140, linewidth=lw, color=color)
 # 3pt arc - center of arc will be the hoop, arc is 23'9" away from hoop
 # I just played around with the theta values until they lined up with the
 # threes
 three_arc = Arc((0, 0), 475, 475, theta1=22.5, theta2=157.5, linewidth=lw,
 color=color)
 # Center Court
 center_outer_arc = Arc((0, 422.5), 120, 120, theta1=180, theta2=0,
 linewidth=lw, color=color)
 center_inner_arc = Arc((0, 422.5), 40, 40, theta1=180, theta2=0,
 linewidth=lw, color=color)
 # List of the court elements to be plotted onto the axes
 court_elements = [hoop, backboard, outer_box, inner_box, top_free_throw,
 bottom_free_throw, restricted, corner_three_a,
 corner_three_b, three_arc, center_outer_arc,
 center_inner_arc]
 if outer_lines:
 # Draw the half court line, baseline and side out bound lines
 outer_lines = Rectangle((-250, -47.5), 500, 470, linewidth=lw,
 color=color, fill=False)
 court_elements.append(outer_lines)
 # Add the court elements onto the axes
 or element in court_elements:
 ax.add_patch(element)
 return ax
plt.figure(figsize=(12,11))
draw_court(outer_lines=True)
plt.xlim(-300,300)
plt.ylim(-100,500)
plt.show()
### Question 2b: All Shots Scatter Plot + Court Outline
Again use seaborn to make a scatter plot of Stephen Curry's shots. Again, set the x-axis limits to span (-300, 300), the y-axis limits to span (-100, 500) color the points by whether the shot was made or missed. Set the missed shots to have an 'x' symbol and made shots to be a circular symbol. Call the `draw_court` function with `outer_lines` set to to be true. Save the `Axes` returned by the plot call in a variable called `ax`.
plt.figure(figsize=(12, 11))
markers = {0 : "X", 1 : "o"}
markers_lst = [markers[int(x)] for x in curry_data['SHOT_MADE_FLAG']]
ax = sns.scatterplot(x = curry_data['LOC_X'], y = curry_data['LOC_Y'], style = markers_lst)
ax.set(xlim = (-300,300),ylim=(-150,500))
draw_court(outer_lines = True)
plt.show()
### Question 2c: Analyzing the Visualization
In a few sentences, discuss what makes this an effective or ineffective visualization for understanding the types of shots that Stephen Curry likes to take and is good at taking, relative to other players in the league. Are there ways it can be improved?
The previous graph allows to visualize in a very precise way the positions of the shots made by Stephen Curry in addition to allowing to visualize the successful and unsuccessful shots. In this way it is possible to determine the areas where the majority of successful and unsuccessful shots are deprived and in this way to know which are the positions (physical place) where there is more probability that the player hits the shot
### Question 2d: A Hexbin plot
Visualize Stephen Curry's shots by using a [hexbin plot with marginal histograms](https://seaborn.pydata.org/examples/hexbin_marginals.html). Also refer to setting [figure aesthetics](https://seaborn.pydata.org/tutorial/aesthetics.html) for what commands below do.
sns.set_style("white")
joint_shot_chart = sns.jointplot(x=curry_data['LOC_X'], y = curry_data['LOC_Y'])
joint_shot_chart.fig.set_size_inches(12,11)
# A joint plot has 3 Axes, the first one called ax_joint
# is the one we want to draw our court onto and adjust some other settings
ax = joint_shot_chart.ax_joint
draw_court(ax, outer_lines=True)
# Adjust the axis limits and orientation of the plot in order
# to plot half court, with the hoop by the top of the plot
ax.set_xlim(-300, 300)
ax.set_ylim(500, -100)
# Get rid of axis labels and tick marks
ax.set_xlabel('')
ax.set_ylabel('')
ax.tick_params(labelbottom=False, labelleft=False)
# Add a title
ax.set_title('Stephen Curry, 2018-19, FGA',
 y=1.2, fontsize=10)
# Add Data Scource and Author
ax.text(-250,445,'\n This plot is based on code by Savvas Tjortjoglou (savvastjortjoglou.com)',
 fontsize=12);
## Question 3: Binning and Smoothing Shots
So far, in we have worked with dataframes which represent each shot as a single observation (row) within the dataset. However, this isn't a convenient data structure for the kinds of spatial analyses we will pursue below.
In this part, we will divide the court into square regions and create a matrix which includes the number of shots taken by a player in that region. We divide the court up into square bins (i.e. a 2d histogram) and, for each player, count number of shots that fall into each bin. Fortunately, this function is relatively simple to write using `numpy` module.
### Question 3a: 2D Smoothing
Fill in the `bin_shots` function below. Use `np.histgram2d` to count count the shots in each bin. The bins are defined `bin_edges` which is a pandas Series of the form `(xedges, yedges)`. If `density = True`, call `ndimage.filters.gaussian_filter` on the result of `np.histogram2d` with smoothing parameter `sigma`. This will create a smoothed version of the raw data histograms.
def bin_shots(df, bin_edges, density=False, sigma=1):
 """Given data frame of shots, compute a 2d matrix of binned counts is computed
 Args:
 df: data frame of shotchartdetail from nba.com.
 At the minimum, variables named LOCX and LOCY are required.
 bin_edges: bin edge definition: edges in x and edges in y
 Returns:
 binned: counts
 xedges: bin edges in X direction
 yedges: bin edges in Y direction
 """
 import numpy as np
 from scipy import ndimage
 ## Call np.histogram2d
 binned, xedges, yedges = np.histogram2d(df['LOC_X'], df['LOC_Y'], bins = bin_edges)
 if density:
 # Recompute 'binned' using "gaussian_filter"
 binned = ndimage.filters.gaussian_filter(binned, sigma = sigma)
 # Normalize the histogram to be a "density", e.g. mass across all bins sums to 1.
 binned /= np.sum(binned)
 return(binned, xedges, yedges)
grader.check("q3a")
### Question 3b: Visualize the binning on `curry_data`
Call `bin_shots` on `curry_data` to create a binned but unsmoothed matrix of shot counts (call this `curry_binned_unsmoothed`), a binned and smoothed matrix of counts with `sigma=1` (call this `curry_binned_smoothed1`) and one with `sigma=5` (call this `curry_binned_smoothed5`). Use the bin edges defined below:
## bin edge definitions in inches
xedges = np.linspace(start=-300, stop=300, num=151)
yedges = np.linspace(start=-48, stop=372, num=106)
bin_edges = (xedges, yedges)
curry_binned_unsmoothed, xe, ye = bin_shots(curry_data, (xedges, yedges), False)
curry_binned_smoothed1, xe, ye = bin_shots(curry_data, (xedges, yedges), True, 1)
curry_binned_smoothed5, xe, ye = bin_shots(curry_data, (xedges, yedges), True, 5)
...
The function below can be used to visualize the shots as a heatmap:
def plot_shotchart(binned_counts, xedges, yedges, ax=None, use_log=False, cmap = 'Reds'):
 """Plots 2d heatmap from vectorized heatmap counts
 Args:
 hist_counts: vectorized output of numpy.histogram2d
 xedges, yedges: bin edges in arrays
 ax: figure axes [None]
 use_log: will convert count x to log(x+1) to increase visibility [False]
 cmap: Set the color map https://matplotlib.org/examples/color/colormaps_reference.html
 Returns:
 ax: axes with plot
 """
 import numpy as np
 import matplotlib.pyplot as plt
 ## number of x and y bins.
 nx = xedges.size - 1
 ny = yedges.size - 1
 X, Y = np.meshgrid(xedges, yedges)
 if use_log:
 counts = np.log(binned_counts + 1)
 if ax is None:
 fig, ax = plt.subplots(1,1)
 ax.pcolormesh(X, Y, binned_counts.T, cmap=cmap)
 ax.set_aspect('equal')
 draw_court(ax)
 return(ax)
Create 3 side by side plots of `curry_binned_unsmoothed`, `curry_binned_smoothed1` and `curry_binned_smoothed5`
fig, ax = plt.subplots(1, 3, figsize=(20,60))
plot_shotchart(curry_binned_unsmoothed, xe, ye, ax = ax[0])
plot_shotchart(curry_binned_smoothed1, xe, ye, ax = ax[1])
plot_shotchart(curry_binned_smoothed5, xe, ye, ax = ax[2])
...
fig.show()
### Vectorize Shot Images
- Here we proceed create a dictionary of smoothed patterns, each vectorized into a 1-d array (like Lab 6)
- In this case, the object `all_smooth` is a dictionary that consists of arrays of length `15750`.
- Each entry in `all_smooth` represents the smoothed frequency of shots along the bins generated in the code above for a given player.
## number of bins is one less than number of edges (remember homework 1)
nx = xedges.size - 1
ny = yedges.size - 1
## 2d histogram containers for binned counts and smoothed binned counts
all_counts = []
all_smooth = []
pids = []
## 2d histogram containers for binned counts and smoothed binned counts
## data matrix: players (row) by vectorized 2-d court locations (column)
for i, one in enumerate(allshots.groupby('PLAYER_ID')):
 ## what does this line do?
 pid, pdf = one
 num_shots = len(pdf.index)
 if(num_shots > 100):
 tmp1, xedges, yedges = bin_shots(pdf, bin_edges=(xedges, yedges), density=True, sigma=2)
 tmp2, xedges, yedges = bin_shots(pdf, bin_edges=(xedges, yedges), density=False)
 ## vectorize and store into list
 all_smooth += [tmp1.reshape(-1)]
 all_counts += [tmp2.reshape(-1)]
 pids += [pid]
X = np.vstack(all_smooth).T
p, n = X.shape
print('Number of shot regions (p):', p)
print('Number of players (n):', n)
## Question 4: Non-negative Matrix Factorization (NMF)
The non-negative matrix factorization is a dimension reduction technique that is often applied to image data. It is similar to PCA except that is only applicable for strictly positive data. We can apply the NMF to vectorized versions of the shot surface. This is useful because we can convert the observed matrix of shot surfaces into:
* Bases: Identifying modes of shooting style (number of modes is determined by `n_components` argument to `NMF` function below)
* Coefficients: How each players shooting style could be expressed as a (positive) linear combination of these bases
The NMF solves the following problem: given some matrix $X$ is $p\times n$ matrix, NMF computes the following factorization:
$$ \min_{W,H} \| X - WH \|_F\\
\text{ subject to } W\geq 0,\ H\geq 0, $$
where $W$ is ${p\times r}$ matrix and $H$ is ${r\times n}$ matrix.
In this homework, we have the following:
#### The data matrix $X$
$X$ is of dimension $n$={number of players} and $p$={number of total square bins on the court}. Each column corresponds to a player, with entries corresponding to a "flattened" or "vectorized" version of the 2d histograms plotted in part 4b.
#### Bases matrix: $W$
Columns $W_i$ contain the shot "bases". First, we will try it with $r=3$ bins in 5a, and then with $r=10$ bins in 5d.
#### Coefficient matrix: H
Each column of $H$ gives a coefficient for each of the bases vectors in $W$, and there are $n$ columns for each player.
The `sklearn` library is one of the main Python machine learning libraries. It has a built in NMF function for us. The function below runs this function and normalizes the basis surfaces to sum to 1.
## Non-negative Matrix Factorization
def non_negative_marix_decomp(n_components, array_data):
 import sklearn.decomposition as skld
 model = skld.NMF(n_components=n_components, init='nndsvda', max_iter=500, random_state=0)
 W = model.fit_transform(array_data)
 # Normalize basis vectors to sum to 1
 Wsum = W.sum(axis=0)
 W = W/Wsum
 ## fix H correspondingly
 H = model.components_
 H = (H.T * Wsum).T
 nmf = (W, H)
 return(nmf)
### Question 4a: Computing NMF Factorization
Compute the NMF on all player's shot charts, X, assuming with `n_components` = 3 (i.e. each shot chart can be represented as a positive linear combination of 3 "basis" shot charts). Fill in `plot_vectorized_shot_chart`. This takes a the a vector of binned shot counts, converts it back to a matrix of the appropriate size and then calls `plot_shotchart` on the matrix. The numpy function `reshape` will be useful here: [https://docs.scipy.org/doc/numpy/reference/generated/numpy.reshape.html](https://docs.scipy.org/doc/numpy/reference/generated/numpy.reshape.html)
W3, H3 = non_negative_marix_decomp(3, X)
grader.check("q4a")
### Question 4b: Visualizing Shot Types
Plot the first three basis images by calling `plot_vectorized_shot_chart` below on the columns of `W3`.
def plot_vectorized_shotchart(vec_counts, xedges, yedges, ax=None, use_log=False, cmap = 'Reds'):
 """Plots 2d heatmap from vectorized heatmap counts
 Args:
 hist_counts: vectorized output of numpy.histogram2d
 xedges, yedges: bin edges in arrays
 ax: figure axes [None]
 use_log: will convert count x to log(x+1) to increase visibility [False]
 cmap: Set the color map https://matplotlib.org/examples/color/colormaps_reference.html
 Returns:
 ax: axes with plot
 """
 nx = xedges.size-1
 ny = yedges.size-1
 # use reshape to convert a vectorized counts back into a 2d histogram
 two_d_counts = vec_counts.reshape((nx, ny))
 return(plot_shotchart(two_d_counts, xedges, yedges, ax=ax, use_log=use_log, cmap=cmap))
fig, ax = plt.subplots(1, 3, figsize=(20,60))
## Write a for loop
for i in range(3):
 # Call plot_vectorized_shot_chart
 plot_vectorized_shotchart(W3[:,i], xedges, yedges, ax = ax[i])
 ax[i].set_title('Shot Basis %i' % (i+1))
### Question 4c: Reconstruction Error
Below we re-construct the shooting pattern for a single player. By "reconstructing" we mean use the approximation $$\hat{X} = WH$$ obtained via NMF. Find $\hat X$ by multipling W and H. In python the `@` symbol is used for matrix multiplication.
X3_hat = W3@H3
Plot $X$, $\hat X$ and the residual ($X - \hat X$) for the player named LaMarcus Aldridge. Remember, each column of $X$ is a vectorized matrix corresponding to the binned (or smoothed binned) shot information.
# Find the player_id of LaMarcus Aldridge
player_id = int(allplayers.query('DISPLAY_FIRST_LAST == "LaMarcus Aldridge"').index.values[0])
## find index in X corresponding to that player
#to_plot_idx = np.where(pids == player_id)[0][0]
to_plot_idx = pids.index(player_id)
fig, ax = plt.subplots(1, 3, figsize=(20,60))
## Call plot_vectorized_shot_chart
original_shotchart = plot_vectorized_shotchart(X[:,to_plot_idx], xedges, yedges, ax=ax[0])
reconstructed_shotchart = plot_vectorized_shotchart(X3_hat[:, to_plot_idx], xedges, yedges, ax=ax[1])
residual_chart = plot_vectorized_shotchart(X[:, to_plot_idx]-X3_hat[:,to_plot_idx], xedges, yedges, ax=ax[2])
# print(max(abs(X3_hat[:, to_plot_idx] - X[:, to_plot_idx])))
ax[0].set_title('Original Shooting Pattern')
ax[1].set_title('Reconstructed Shooting pattern (r=3)')
ax[2].set_title('Residual Shooting Pattern (r=3)')
fig.show()
### Question 4d: Choice of Colormap
Why does it make sense to use a _sequential_ palette for the original and reconstructed shot charts and a _diverging_ palette for the residual? _Hint:_ Read the introduction to colormaps [here](https://matplotlib.org/users/colormaps.html). 
It makes sense to use a sequential palette for the original shot charts and reconstructed shot charts because the sequential allows you to observe changes in intensity and saturation just as it occurs in an intensity map such as shots.
In the case of the diverging palette, it is used for the residual because it allows better observation of changes in intensity in two different colors that meet in the middle at an unsaturated color.
What areas of the court does this player to shoot more and where less relative to the reconstructed area. If its helpful, you can refer to court locations by name using this legend [here](https://en.wikipedia.org/wiki/Basketball_court#/media/File:Basketball_terms.png]).
By looking at the Residual Graph, it can be seen that the player shots more from the left-side between the The Arc and the Lane Line
### Question 4e: More Detailed Modeling
Re-run the analysis, this time for 10 basis vectors instead of 3. Again plot the bases using `plot_vectorized_shotchart` on the columns of `W10`.
**Hint**: Study the following code
```
fig, ax = plt.subplots(2, 5, figsize=(20, 7))
ax = ax.flatten() # turn ax into a flat array
ax[0].set_title('hello')
ax[9].set_title('there')
fig.show()
```
W10, H10 = non_negative_marix_decomp(10, X)
fig, ax = plt.subplots(2, 5, figsize=(20, 7))
## Write a for loop
for i in range(10):
 plot_vectorized_shotchart(W10[:,i], xedges, yedges, ax = ax[i//5,i%5])
 ax[i//5,i%5].set_title('Shot Basis %i' % (i+1))
If you did things correctly, you should be really impressed! We've identified potentially interesting patterns of shooting styles without actually specifying anything about the way basketball is played or where the relevant lines are on the court. The resulting images are based only on the actual behavior of the players. Even more impressive is that we're capturing similarity in regions that are far apart on the court. One reason we can do this is that a basketball court is symmetric along the length of the court (i.e. symmetric about x=0). However, people tend to be left or right hand dominant, which might affect their preferences. Look carefuly at the shot basis plots above: is there any evidence of _asymmetry_ in player shooting behavior? Refer to specific basis images in your answer. 
For this player, he does not seems to have a preference for left or right. We can see similarities from both sides and we can see that the highest number of shots comes from center line
Repeat part 5b, and again plot original, reconstructed and residual shot chats for LaMarcus Aldridge.
X10_hat = W10@H10
fig, ax = plt.subplots(1, 3, figsize=(20,60))
# I took the first player appearing in first column
# (you probably want to do more interesting players)
original_shotchart = plot_vectorized_shotchart(X[:, to_plot_idx], xedges, yedges, ax = ax[0])
reconstructed_shotchart = plot_vectorized_shotchart(X10_hat[:, to_plot_idx], xedges, yedges, ax = ax[1])
residual_chart = plot_vectorized_shotchart(X[:, to_plot_idx]-X10_hat[:, to_plot_idx], xedges, yedges, ax = ax[2])
ax[0].set_title('Original Shooting Pattern')
ax[1].set_title('Reconstructed Shooting pattern (r=10)')
ax[2].set_title('Residual Shooting Pattern (r=10)');
### Question 4f: Comparing Players
With `H10` matrix, it is possible to compare any pair of players. For all players pairwise, $i$ and $j$, compare using euclidean distance between their coefficients:
$$ \text{player-distance}(i,j) = \| H_i - H_j \|_2 = \left(\sum_{k=1}^{10} (H_{ki} - H_{kj})^2 \right)^{1/2} $$
Create a heatmap for comparing pair-wise player distance matrix. Find the two pairs of players with smallest distances. Also, find two pairs of players with largest distances.
H10.shape
player_distance = np.zeros((len(pids),len(pids)))
for i in range(len(pids)):
 for j in range(len(pids)):
 if i != j:
 player_distance[i,j] += np.sum(np.power(H10[:,i]-H10[:,j], 2))
player_distance = np.sqrt(player_distance)
# Plot heatmap
plt.figure(figsize=(10,10))
sns.heatmap(player_distance)
player_distance[player_distance <= 0] = np.inf
min_dist = np.min(player_distance)
min_dist_players = np.where(player_distance == min_dist)[0]
player_a = pids[min_dist_players[0]]
player_b = pids[min_dist_players[1]]
print("The minimum distance is {0:.2f} and it is between players with id {1} and {2}".format(min_dist, player_a, player_b))
We can see that the minimum distance between two different players is 0.05 and it is between players with id 203468 and 1627750.
These distances were obtained by comparing each different ID in matrix H10
### Question 4g: Residuals
The residual betwene `Xhat` and `X` gives a sense of how well a player is decribed by NMF computed matrices `W` and `H`. Calculate RMSE for each player, and plot the histogram. Comment on this distribution and players with smallest and largest RMSEs (use 10 components).
# Calculate RMSE for each player
RMSE = np.sqrt(np.sum(np.power(X-X10_hat, 2), axis = 0))
# Histogram
plt.figure()
plt.hist(RMSE, bins = 10)
plt.show()
We can see that most of the players have an error of approximately 0.006 (0.6%) while a few have a lower error. However, it is interesting to note that there are even fewer players with a high error, since for high errors they are not considered to play in the professional leagues.
### Question 4h: Proposing improvements
One of the main purposes of exploratory data analysis is to generate new ideas, directions, and hypothesis for future analyses and experiments. Take two players of your choice and compare their shooting patterns with various visualizations.
State any insights and defend your conclusions with visual and/or numerical comparisons.
# Take data for a Bradley Beal
query_str = 'DISPLAY_FIRST_LAST == "Bradley Beal"'
beal_id = str(allplayers.query(query_str).index.values[0])
beal_data = allshots.query('PLAYER_ID == ' + beal_id).astype({'SHOT_MADE_FLAG':bool})
# Show shooting pattern on same graph
plt.figure(figsize=(12, 11))
markers = {0 : "X", 1 : "o"}
markers_lst1 = [markers[int(x)] for x in curry_data['SHOT_MADE_FLAG']]
markers_lst2 = [markers[int(x)] for x in beal_data['SHOT_MADE_FLAG']]
ax = sns.scatterplot(x = curry_data['LOC_X'], y = curry_data['LOC_Y'], style = markers_lst1, color='blue')
ax = sns.scatterplot(x = beal_data['LOC_X'], y = beal_data['LOC_Y'], style = markers_lst2, color='red', ax = ax)
ax.set(xlim = (-300,300),ylim=(-150,500))
draw_court(outer_lines = True)
plt.show()
We can see the density of shots by Stephen Curry and Bradley Beal. It is curious to note that both players have similar density plots for their shots, but Stephen Curry has fired more long-range shots than Beal
_Cell Intentionally Blank_
---
To double-check your work, the cell below will rerun all of the autograder tests.
grader.check_all()
## Submission
Make sure you have run all cells in your notebook in order before running the cell below, so that all images/graphs appear in the output. The cell below will generate a zip file for you to submit. **Please save before exporting!**
# Save your notebook first, then run this cell to export your submission.
grader.export()
Python Program to Implement Data Management Assignment Solution.

Instructions

Requirements and Specifications