Python Program to Solve Questions Related to Datasets

Instructions

Objective

Write a program to solve questions related to datasets in python.

Requirements and Specifications

Problem:

Write a python assignment for scheduling a building paint job. Both interior and exterior.

The program will do the following:

Get the following data from the customer:

- Ask the customer to enter his/her name.

- Ask the customer to enter the month number that they want to schedule the paint job for (Let the customer know the valid entries are 1 – 12).

- Ask the customer to enter the size of the interior and exterior walls (per square feet) that they want painted.

o example: interior: 7600

exterior: 6400

- Ask the customer for their credit card number (Let the user know AMEX, VISA, & MC cards are accepted)

o (Have in mind that AMEX cards are 15 digits, VISA/MC cards are 16 digits)

o example: 312345678912345 (Amex example)

4987654321234567 (Visa example)

- Note: Make sure to read the values from the same line as the question.

(Question and answer in same line)

Perform the following calculations:

- Calculate the following:

o Total number of gallons(cans) of paint needed for the interior (a whole number)

o Total number of gallons(cans) of paint needed for the exterior (a whole number)

o Total number of gallons(cans) of paint needed altogether (a whole number)

§ Perform your calculations based on the following information:

Each 1 gallon can of paint can paint 400 SQFT
Make sure you always have enough paint
Hint: You can add a can (gallon) to the result to make sure you have enough.
[XC] Extra Credit Point: Instead of just adding a can, actually calculate how many cans (gallons) you need.

o total cost of interior paint

o total cost of exterior paint

o total cost of paint all together

§ Perform your calculations based on the following pricing:

Exterior paint is weatherproof and more expensive.
Price to paint the interior walls: $300.00 (per gallon).
Price to paint the exterior walls: $600.00 (per gallon).

Create a receipt for the customer that looks as:

- [tab]Date: followed by [Today’s date]

o Code for today’s Date is: DateTime.Now.ToString("MM/dd/yyyy HH:mm:ss")

- [tab]Name: followed by [Customer name]

- [tab]Card Number: followed by [Customer credit card number as described below]

o Display the card as six (6) stars followed by the last 4 digits of their card number.

o Example, if card number: 1234567890123456, display: ******3456

o Use Substring.

- [tab]Month Scheduled: followed by [month]

o The month will be displayed in the following format:

§ The 3-letter name of the month followed by a space and then the month number inside a pair parenthesis.

§ Example: if month = 11, then display: NOV (11)

§ Use enum (if not sure how, replace month name with “???”, ex: ??? (11))

Line 1: [tab]Date: [Today’s Date]

Line 2: [tab]Name: Name

Line 3: [tab]Card Number: ******[Last 4 digits]

Line 4: [tab]Month Scheduled: NOV (11)

<------------ 40 chars ----------->

Only 10 digits of the card number is displayed

Month Scheduled, is the [3-letter Month Name] ([Month Number])

Followed by receipt details.

- A line of space, followed by details.

o Details are as follows:

§ Line 1: [tab]Header row (Type, Area (sqft), Paint (gal), Cost ($) )

§ Line 2: [tab]dashes

§ Line 3: [tab]Interior detail

§ Line 4: [tab]Exterior detail

§ Line 5: [tab]dashes

§ Line 6: [tab]Totals

o Note: Format and display all $ values with $ sign and one decimal points

- The receipt will look something like the following example:

Date: Todays Date

Name: xxxx

Card Number: ******3450

Month Scheduled: NOV (11)

Type Area (SqFt) Paint (Gal) Cost ($)

--------- ----------- ----------- -----------

Interior 7600 19 $5,700.0

Exterior 5500 14 $8,400.0

--------- ----------- ----------- -----------

Total 13100 33 $14,100.0

Comments:

Make sure to provide comments in your code as needed to describe your work.

Also include the following program info at the top of your program inside the Class().

Source Code
## Full name: Krystean-Heart Jenport
## R#: 11765042
## Title of the notebook: CTDS: Exam 2
## Date: 10-21-21
> # Important Notes:  

> ### 0. Work directly in this notebook, upon completion render the notebook as a PDF file.
> ### 1. Please make sure that you have your information ***in the box above!***
> ### 2. You ***CAN*** use your ***notes, notebooks, Google, and the internet.***
> ### 3. You ***CAN NOT*** collaborate in any ways or forms with anyone. It is and will be considered as cheating.
> ### 4. Please ***READ the questions CAREFULLY.***
> ### 5. Then, ***READ the questions again.***
> ### 6. ***Provide comments*** for your codes. Your code should be ***readable for someone with very basic Python skills.***
> ### 7. If you are using the resources on the internet, make sure to ***observe all the steps of the CCMR process: Copy, Cite, Modify, and Run.***
> ### 8. Consulting with each other is **not** permitted. Sharing work with each other is **not** permitted. Working together is **not** permitted.
> ### 9. YOU NEED TO SUBMIT 2 FILES: YOUR NOTEBOOK (.ipynb file) ***AND*** ITS PDF (.pdf file) on ***Blackboard before the deadline.***
> ### 10. You have ***48 hours (2 days)*** to submit your answers. ***Multiple attempts*** are allowed but ***only the last submission will be graded.***
___
# Question 0 (5 pts): To begin ...
#### Run the cell below, and leave the results in your notebook.
#### RUN! this Cell ####
import sys
! hostname
! whoami
print(sys.executable) # OK if generates an exception message on Windows machines
# tested ok MacOS, arm linux, x86-64 linux, Windows 10VM, Windows Server
___
# Question 1 (15 pts): To cause or to correlate; that is the question!
![](https://aspectmr.com/wp-content/uploads/2019/11/chicken_makes_you_go_to_the_moon.jpg) 

### Answer the following questions in your own words:
#### A) What is the message behind the above image in relation to causation and correlation?
#### B) Why is it important to understand the difference between causation and correlation?
#### C) Write two examples of your own to illustrate the difference.
**A)** The relation is like blaming something that has no connection with the problem. Eating chicken does not make you go to the moon. The problem illustrates a correlation problem, not a causation problem.
**B)** It is important to understand their differences because the terms mean completely different things. Causation is referred to the result from an activity/cause (Cause and Effect), so it refers to: if x happened, then y follows.
Correlation refers to the relation between two variables.
**C)**
* **Causation**: I went to the beach, so I got a tan
* **Correlation**: If the demand increases, the price increases
___
# Question 2 (40 pts): On "Bottled Poetry" ...
![](https://w7w5t4b3.rocketcdn.me/wp-content/uploads/2016/04/game-of-thonres-tyrion-lannister-i-drink-and-i-know-things.jpg) 

### The "winequality.csv" dataset is provided with information related to red vinho verde wine samples, from the north of Portugal. The goal is to model wine quality based on physicochemical tests. Follow the steps and answer the question. *Due to privacy and logistic issues, only physicochemical (inputs) and sensory (the output) variables are available (e.g. there is no data about grape types, wine brand, wine selling price, etc.).*
#### The datasets consists of several Input variables (based on physicochemical tests).
|Columns|Info.|
|---:|---:|
|fixed acidity |most acids involved with wine or fixed or nonvolatile (do not evaporate readily)|
|volatile acidity |the amount of acetic acid in wine, which at too high of levels can lead to an unpleasant, vinegar taste|
|citric acid |found in small quantities, citric acid can add 'freshness' and flavor to wines|
|residual sugar |the amount of sugar remaining after fermentation stops, it's rare to find wines with less than 1 gram/liter|
|chlorides |the amount of salt in the wine|
|free sulfur dioxide |the free form of SO2 exists in equilibrium between molecular SO2 (as a dissolved gas) and bisulfite ion|
|total sulfur dioxide |amount of free and bound forms of S02; in low concentrations, SO2 is mostly undetectable in wine|
|density |the density of water is close to that of water depending on the percent alcohol and sugar content|
|pH |describes how acidic or basic a wine is on a scale from 0 (very acidic) to 14 (very basic); most wines are between 3-4|
|sulphates |a wine additive which can contribute to sulfur dioxide gas (S02) levels, wich acts as an antimicrobial|
|alcohol |the percent alcohol content of the wine|
|quality (score between 0 and 10) |output variable (based on sensory data, score between 0 and 10)|
*Acknowledgements:
P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553, 2009.*
### Based on the provided dataset, "winequality.csv", follow the steps and answer the questions.
#### A) Read the data as a data frame and print the first few rows. Check if it is the right dataset and explain your answer.
import pandas as pd
wine_data = pd.read_csv('winequality.csv')
wine_data.head()
#### B) Use the appropriate function and get a summary of information on the data frame. Explain what you can learn from this summary report (at least 3 items).
wine_data.info()
#### C) Are there any missing values in the data? Justify your answer.
# Count the number of NaN values in each column
wine_data.isna().sum()
#### D) Use the appropriate function and get the 5-number summary for the data frame. Explain what you can learn from this summary report for each column with regards to statistics such as mean, median, etc.
# Calculate the Quartiles Mean
wine_data.describe()
#### E) Rename the "quality (score _0to10)" column heading to "quality"
wine_data = wine_data.rename(columns ={"quality (score _0to10)": "quality"})
wine_data.head()
#### F) Make a subset of all the wines with a quality above 7. Name this subset "TopQ".
TopQ = wine_data[wine_data["quality"] > 7]
TopQ.head()
#### G) What percentage of wines in "TopQ" has an alcohol content less than 10%? What is this percentage out of the entire set of wine (the original data)?
# Get the number of rows for wines with alcohol content less than 10
wines_less10 = TopQ[TopQ['alcohol'] < 10]
# Calculate the percentage
perc = len(wines_less10) / len(wine_data)
print("The percentage of wines in TopQ with an alcohol concent less than 10% is: {:.2f}%".format(perc*100.0))
#### H) What percentage of all wines has a ph between 2.5 and 3.5 and sulphates content higher than 0.6?
# get a sub-dataframe of all wines with a ph between 2.5 and 3.5
wines_ph = wine_data[wine_data['pH'] >= 2.5]
wines_ph = wines_ph[wines_ph['pH'] <= 3.5]
# Calculate percentage
perc = len(wines_ph)/len(wine_data)
print("The percentage of all wines with a pH between 2.5 and 3.5 is: {:.2f}%".format(perc*100.0))
#### I) Print the above subset of the dataframe, sorted by wine quality.
wines_ph = wines_ph.sort_values(by=['quality'])
wines_ph.head(20)
#### J) Define a function that labels the wines based on their quality according to the table below: 

|Quality Score|Label.|
|---:|---:|
|q>= 7|Top|

|5|q<=5|Low|
def label_data(wine_data):
  wine_data['Label'] = ''
  wine_data.loc[wine_data['quality'] >= 7, 'Label'] = 'Top'
  wine_data.loc[(wine_data['quality'] > 5) & (wine_data['quality'] <7), 'Label'] = 'Average'
  wine_data.loc[wine_data['quality'] <= 5, 'Label'] = 'Low'
#### K) Apply the function on the data frame and store the result in a new column "Qlabel".
label_data(wine_data)
wine_data.head()
#### L) Report the share of each quality label in percentage.
wine_data.groupby(['Label']).size() / len(wine_data) *100.0
#### M) Plot a histogram of pH for all the Low quality wines. Explain what you can infer from this plot (at least 2 items).
# Get all wines with low quality
low_wines = wine_data[wine_data['Label'] == 'Low']
low_wines['pH'].hist()
It can be seen that for low-quality wines, most have a pH between 3.2 and 3.4
#### N) Make a similar histogram for pH for all the Top quality wines. Put the new histogram and the previous one next to each other. Make sure that both histograms have the same range for the x axis. Explain what you can infer by comparing them.
# Get all wines with Top quality
top_wines = wine_data[wine_data['Label'] == 'Top']
top_wines['pH'].hist()
In this case, high-quality wines have a pH between 3.2 and 3.4, but there are also some with a higher pH of up to 3.8
___
# Question 3 (15 pts): The Experiment!
![](https://onthewards.org/wp-content/uploads/2017/03/dropping-the-ball-part-1.jpg) 

### An experiment has been carried out in which a ball is droped from a known height and its speed is recorded at every second. The table below contains that information.
|Elapsed Time (s)|Speed (m/s)|
|---:|---:|
|0 |0|
|1.0 |3|
|2.0 |7.4|
|3.0 |16.2|
|4.0 |23.5|
|5.0 |32.2|
|6.0 | 42.2|
|7.0 | 65.1 |
|8.0 | 73.5 |
|9.0 | 99.3 |
|10.0| 123.4|
### Based on the provided data, follow the steps and answer the questions.
#### A) Create a 1D numpy array for "Elapsed Time" and another one for "Speed".
import numpy as np
time = np.array([0, 1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, 10.0])
speed = np.array([0, 3, 7.4, 16.2, 23.5, 32.2, 42.2, 65.1, 73.5, 99.3, 123.4])
#### B) Use the right numpy function(s) and calculate the average speed over the duration of the experiment.
avg_speed = speed.mean()
print("The average speed ins: {:.2f} (m/s)".format(avg_speed))
#### C) Plot the speed vs time (speed on y-axis, time on x-axis) using a scatter plot. Use blue markers.
import matplotlib.pyplot as plt
plt.figure()
plt.scatter(time, speed, color='blue')
plt.xlabel('Time (s)')
plt.ylabel('Speed (m/s)')
plt.grid(True)
plt.show()
#### D) Plot a red line on the scatterplot based on the linear model $f(x) = mx + b$
# Define the matrix of values
N = len(time) # number of points
A = np.ones((N,2))
A[:,0] = 1
A[:,1] = time
by = speed
coeff = np.matmul(np.matmul(np.linalg.inv(np.matmul(A.T, A)), A.T), by)
# Get slope and y-intercept
b = coeff[0]
m = coeff[1]
# Calculate fitted values
y_fit = m*time + b
# Plot
plt.figure()
plt.scatter(time, speed, color='blue', label = 'Data')
plt.plot(time, y_fit, color='red', label = 'Fit')
plt.xlabel('Time (s)')
plt.ylabel('Speed (m/s)')
plt.legend()
plt.grid(True)
plt.show()
#### E) By trial-and-error find values of $m$ and $b$ that provide a good visual fit (i.e. makes the red line explain the blue markers).
# We can use the values obtained in the previous cell, since these are the best ones
m = 11.973
b = -15.7
y_fit2 = m*time + b
# Plot
plt.figure()
plt.scatter(time, speed, color='blue', label = 'Data')
plt.plot(time, y_fit2, color='red', label = 'Fit')
plt.xlabel('Time (s)')
plt.ylabel('Speed (m/s)')
plt.legend()
plt.grid(True)
plt.show()
#### F) Using this data model estimate the speed at $t = 15~\texttt{sec.}$
speed_at15 = m*15 + b
print("The speed at t = 15s is {:.2f} (m/s)".format(speed_at15))
___
# Bonus Question for extra credit!
![](https://images.squarespace-cdn.com/content/v1/53e9e56be4b08138fefa5095/1526088540947-YAAMTJM58SEAWNKCJUEE/160912_The_N_That_Connects_1-1.4.1.gif) 

### This is a simulation of a real-world data analysis project. It will require your knowledge and skills but above all, it will require your engineering judgement. For each question, you need to provide a clear and concise answer, a well-documented code, solid justification for your answer and every decision and assumption you make, and an appropriate visualization if you see fit.
#### A dataset is provided that contains tv shows and movies available on Netflix. This dataset is available at https://www.kaggle.com/shivamb/netflix-shows.
> FEATURES:

SHOW-ID - Unique id of each show (not much of a use for us in this notebook)

TYPE - The category of a show, can be either a Movie or a TV Show

TITLE - Name of the show

DIRECTOR - Name of the director(s) of the show

CAST - Name of the show

COUNTRY - Name of countries the show is available to watch on Netflix

DATE ADDED - Date when the show was added on Netflix

RATING - Show rating on netflix

RELEASE YEAR - Release year of the show

DURATION - Time duration of the show

LISTED IN - Genre of the show

### Questions:
#### A) Describe the dataset with as much detail as you see appropriate.
#### B) Which country has the most TV Shows and which countries has the least Movies?
#### C) How has the Netflix contents changed (in number) for the titles added over the years from 2000 to 2020 (on a yearly basis). Is it different for Movies and TV Shows?
#### D) What is the most common rating (age restriction) on Netflix? How has it changed for the products released before 2000 and after 2000?
## Load Dataset
netflix = pd.read_csv('NetflixData.csv')
netflix.head()
## Part A)
netflix.info()
## Part B)
# Group the data by country, and then count the number of entries for each one
netflix_country = netflix.groupby(['country']).size().sort_values()
print("The country with the least shows is {0} with {1} shows".format(netflix_country.index[0], netflix_country[0]))
print("The country with the most shows is {0} with {1} shows".format(netflix_country.index[-1], netflix_country[-1]))
## part C)
# Group by Type and release_year
# First, select only the titles between 2000 and 2020
shows = netflix[(netflix['release_year'] >= 2000) & (netflix['release_year'] <= 2020)]
shows = shows.groupby(['release_year', 'type']).count()['show_id'].unstack().plot()
plt.grid(True)
plt.xlabel('Year of Release')
plt.ylabel('Number of releases')
## Part D)
# Group by release_year and Rating
# First, select only the titles between 2000 and 2020
shows = netflix[(netflix['release_year'] >= 2000) & (netflix['release_year'] <= 2020)]
shows = shows.groupby(['release_year', 'rating']).count()['show_id'].unstack().plot()
plt.grid(True)
plt.xlabel('Year of Release')
plt.ylabel('Number of releases')
![](https://freight.cargo.site/w/1200/i/c96b2ce17b2aee95c6e837552e3e38d058ac4ad2a6759448ed78128963790744/Happy_Coding_BC_6-03-05.png)

Python Program to Solve Questions Related to Datasets Assignment Solution.

Instructions

Requirements and Specifications