## Instructions

**Objective**

## Requirements and Specifications

**Solutions**

**Source Code
**

# CS 237: Homework 6 Programming Portion

# add your imports here

import random

# here are some examples of imports

import matplotlib.pyplot as plt # for plotting

import numpy as np # for simulating random choices

import bisect as bi # for binary search on sorted lists

# Exercise 1: Data structures for discrete RVs: PDFs and CDFs

In this exercise, we'll build simple data structures to represent PDFs and CDFs of discrete random variables (RVs).

As input, the user will provide a dictionary of values of a discrete RV and their associate percentage probabilities. At the end of this notebook are some test cases involving frequencies of English letters in documents and Super Bowl win probabilities of NFL teams. You're welcome and encouraged to try others, including larger datasets.

Concretely, the input is a Python dictionary with tuples of the form (RV_value: percentage_prob)

For example, if the user creates an RV named X with '{4: 50.0, 22: 30.0, 2: 20.0}', we will want to represent the following information (in sorted order), sorted by values:

PDF f(X): Pr[X = 2] = 0.2, Pr[X = 4] = 0.5, Pr[X = 22] = 0.3

CDF F(X): Pr[X < 2] = 0, Pr[2 <=X < 4] = 0.2, Pr[4 <= X < 22] = 0.7, Pr[22 <= X] = 1

We also need to check that the user has specified a valid PDF (probabilities sums to 1).

To manipulate PDFs and CDFs efficiently, we will store the values in sorted order. To be space-efficient, we only need to store those values where the PDF has mass, or equivalently, where the CDF has a discontinuity. So our random variable X can be described by:

values: 2 4 22

--------------------------------

PDF f() 0.2 0.5 0.3

CDF F() 0.2 0.7 1.0

# note that F() always starts at zero, so we don't need to store that explicitly

===

# Exercise 1a) Complete the implementation of the __init__ method of the RV class below

class RV:

# Create an RV from a user-specified dictionary. See examples of input below.

# Member variables are

# self.values (sorted list)

# self.PDF (list of probabilities)

# self.CDF (list of cumulative probabilities)

#

def __init__(self, dict):

# create sorted list of values

self.values = []

for x in dict:

self.values.append(x)

self.values.sort() # create a sorted list of outcomes

# create PDF

self.PDF = [None] * len(self.values)

for i in range(len(self.values)):

pct = dict[self.values[i]]

self.PDF[i] = round(pct * 10000)/1000000

# round to 8 decimal digits to avoid numerical precision problems

# see what bad things happen if you just use pct/100

# compute F_X from f_X

self.CDF = [None] * len(self.values)

total = 0.0

for i in range(len(self.values)):

total += self.PDF[i]

self.CDF[i] = total

# Exercise 1b) Complete the implementation of prob() below

# Find the leftmost index i in sorted list a such that a[i] == x

def index(a, x):

i = bi.bisect_left(a, x)

if i != len(a) and a[i] == x:

return i

return None

# Given an RV value return its likelihood: Pr[R=value]

def prob (R, value):

i = index(R.values, value)

if i is None:

return 0.0

return R.PDF[i]

# Exercise 1c) Complete the implementation of rangeprob() below

# Find the leftmost index in sorted list A strictly greater than x

def find_gt(a, x):

i = bi.bisect_right(a, x)

if i != len(a):

return i

return None

# Find the rightmost index in sorted list A less than or equal to x

def find_le(a, x):

i = bi.bisect_right(a, x)

if i:

return i-1

return None

# STUDENTS TO IMPLEMENT

# Given a range of outcomes specified by an interval: [left, right] and an RV, return Pr[left <= x < right]

def rangeprob (R, left, right):

i_from = find_le(R.values, left)

i_to = find_gt(R.values, right)

if i_from >= i_to:

return None

total = 0.0

for j in range(i_from, i_to):

total += R.PDF[j]

return total

# Random Sampling

To generate a random sample from a distribution, it is easiest to use the CDF.

Here's pseudocode of the method you will implement:

- Generate a random number y from [0, 1).

- Search the sorted list of the CDF to find the smallest outcome l such that y < F(l)

(note that there will always be one such l provided F is a valid CDF)

- return l

Intuitively, y serves to pick a random offset into the CDF, which maps onto an interval

within the CDF associated with one value of the random variable.The length of that interval is exactly to the probability of the associated value!

For example, with the CDF of our previous random variable

F_X = [2: 0.2], [4: 0.7], [22, 1.0]

We might select y = 0.78. The smallest value l satisfying y < F[l] is

l = 22, F[l] = 1.0

So we return '22'.

Alternatively, if we selected y = 0.152, we would return '2'.

# Exercise 1d) Complete the implementation of sample() below

# STUDENTS TO IMPLEMENT

# Generate a random sample from a random variable R

def sample(R):

r = random.uniform(0, 1)

i = find_gt(R.CDF, r)

return R.values[i]

# feel free to insert your own test cases here

# Exercise 2) Run our test cases on your code.

# English letter frequencies as a dictionary

letterFrequency = {'E' : 12.0,

'T' : 9.10,

'A' : 8.12,

'O' : 7.68,

'I' : 7.31,

'N' : 6.95,

'S' : 6.28,

'R' : 6.02,

'H' : 5.92,

'D' : 4.32,

'L' : 3.98,

'U' : 2.88,

'C' : 2.71,

'M' : 2.61,

'F' : 2.30,

'Y' : 2.11,

'W' : 2.09,

'G' : 2.03,

'P' : 1.82,

'B' : 1.49,

'V' : 1.11,

'K' : 0.69,

'X' : 0.17,

'Q' : 0.12,

'J' : 0.11,

'Z' : 0.08 }

# Test initialization

Letter = RV(letterFrequency)

print (Letter.values)

print (Letter.PDF)

print (Letter.CDF)

# Test prob() and rangeprob()

print ("Pr[Letter = 'Z'] = " + str(prob(Letter, 'Z')))

print ("Pr[Letter = 'CS237'] = " + str(prob(Letter, 'CS237'))) # This should return None (not a valid value of the RV)

print ("Pr['M' <= Letter < 'X']= " + str(rangeprob(Letter, 'M', 'X')))

print ("Pr['X' <= Letter < 'M']= " + str(rangeprob(Letter, 'X', 'M'))) # This should return None or raise an error

# Test sample()

## Let's see if letter frequencies are enough to produce any reasonable 5-letter English words

for i in range (51):

word = (""+ sample(Letter)+sample(Letter)+sample(Letter)+sample(Letter)+sample(Letter))

print (word)

truncatedLetterFrequency = {'E' : 12.0,

'T' : 9.10,

'A' : 8.12,

'O' : 7.68,

'I' : 7.31}

TruncLetter = RV(truncatedLetterFrequency)

## this is hard to deal with well in Python, but we just want you to set the CDF to None

print (TruncLetter.CDF)

# from fivethirtyeight.com, as of 10/12

SuperBowlWinPercentages = {"Bills": 17,

"Bucs": 13.9,

"Ravens": 10.7,

"Chargers": 8.8,

"Cardinals": 9,

"Rams": 8.8,

"Cowboys": 5,

"Packers": 4.9,

"Chiefs": 4,

"Browns": 3.8,

"Saints": 3.6,

"Titans": 2,

"Seahawks": 0.9,

"49ers": 0.9,

"Broncos": 0.8,

"Bengals": 0.7,

"Vikings": 0.7,

"Patriots": 0.7,

"Colts": 0.7,

"Raiders": 0.5,

"Panthers": 0.5,

"Bears": 0.5,

"Washington": 0.4,

"Eagles": 0.4,

"Steelers": 0.4,

"Falcons": 0.2,

"Dolphins": 0.1,

"Giants": 0.05,

"Jets": 0.03,

"Texans": 0.02

}

Winner = RV(SuperBowlWinPercentages)

print (Winner.values)

print (Winner.PDF)

print (Winner.CDF)

while True:

winner = sample(Winner)

print (winner)

if winner == "Patriots":

break