• Home
  • Data Analytics Projects
    • Data Analytics Using Python
      • Mini Project: AirBnB Recommender App
      • Data Visualization
      • Text Mining (Text Nomalisation)
      • Statistics Analysis
      • A Study of Food Consumption in the World
    • DataCamp Projects
    • Tableau Visualizations
    • Experimenting with Google Teachable Machines
    • Data Analytics Using R
      • Udemy Projects
      • Using Dplyr
      • Hypothesis Testing
      • Linear Regression (Project1)
      • Linear Regression (Project 2)
      • Linear Regressions Exploring Interactions
      • Regression Models
      • Multiple Regression (Project 2)
  • Philanthropy
    • 2018 Water for Life
    • 2019 Habitat for Humanity Global Build
  • My Thoughts
  • Contact
BarbaraYam.com
  • Data Analytics Projects
    • Data Analytics Using Python
      • Mini Project: AirBnB Recommender App
      • Data Visualization
      • Text Mining (Text Nomalisation)
      • Statistics Analysis
      • A Study of Food Consumption in the World
    • DataCamp Projects
    • Tableau Visualizations
    • Experimenting with Google Teachable Machines
    • Data Analytics Using R
      • Udemy Projects
      • Using Dplyr
      • Hypothesis Testing
      • Linear Regression (Project1)
      • Linear Regression (Project 2)
      • Linear Regressions Exploring Interactions
      • Regression Models
      • Multiple Regression (Project 2)
  • Philanthropy
    • 2018 Water for Life
    • 2019 Habitat for Humanity Global Build
  • My Thoughts
  • Contact

Data Visualization using Python

In this exercise, we explored Data Visualization using Matplotlib and the Seaborn Library

Barbara Yam_-ds102-problem-set-week3

DS102 Problem Set - Week 3

Hackwagon Academy

HWA-DS102-PS3-201810


If you have any questions to the Problem Set, feel free to send a message to your TAs or Instructors in the Telegram chat. You are encouraged to also discuss the problem sets within your own groups.

Before you start, fill in your details for this notebook and rename the notebook to `_-problem-set-week-3`, for example `jane-doe-problem-set-week-3`.
In [1]:
# Rename the notebook. Then, fill in your details here.
# Name  :Barbara Yam
# Class : 5 May 2019, 7 pm

Datasets Required:¶

  • sgp-literacy-rate.csv

  • insurance-claims.csv

  • ks-projects-15k.csv

  • wines-200.csv

  • loans-by-grade.csv

In [8]:
# import the relevant packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
# Set matplotlib inline
%matplotlib inline

Part 1 (5 Marks)¶

Mark breakdowns are stated in the question e.g. [1]. For questions wtih short answers, **both the code and typed answers must be provided to receive full credit.** **For plotting, all features must be present to receive full credit.**

1A.

i) What are the key features of a plot that needs to be explicitly stated? [0.5]

In [3]:
# What are the key features of a plot that needs to be explicitly stated?
# Code is not required.
# (Type your answer here)
#Title
#Legend
#Axes Labels for both the horizontal or  x -axis and the vertical or  y -axis
#Reasonable tick values

ii) You are given the following data in sgp-literacy-rate.csv, taken from World Bank Open Data. The definition of each column is as follows:

sgp-literacy-rate.csv

col name definition datatype example
literacy_rate The percentage of people aged 15 and above who can both read and write float 96.716660
year The year in which the literacy rate was obtained int 2010

Read from CSV to df

Read the data from the sgp-literacy-rate.csv CSV file into a DataFrame called sgp_lr_df.

In [35]:
# Read from CSV to sgp_lr_df. Write your code here.

sgp_lr_df = pd.read_csv('sgp-literacy-rate.csv',sep="|")

sgp_lr_df
sgp_lr_df.reset_index()
Out[35]:
index literacy_rate year
0 0 95.857330 2010
1 1 96.186081 2011
2 2 96.369789 2012
3 3 96.548080 2013
4 4 96.716660 2014
5 5 96.825493 2015
6 6 97.049591 2016

The following code to plot the line graph is given to you. It plots the literacy rate of Singapore (% of people ages 15 and above) from 2010 to 2016. Perform the following steps:

  1. Run the code to plot the line graph.
  2. Using your answer from 1A or otherwise, modify the code to add the missing features of the plot. [1]
In [36]:
# Instructions: modify the code to add the missing features of the plot

DIMS = (10,5)
fig, ax1 = plt.subplots()
ax1.set_title("Singapore's Literacy Rate, 2010 to 2016", fontsize=16)
sgp_lr_df.plot(kind='line',x ='year',y ='literacy_rate',figsize=DIMS, ax=ax1, marker='o')

plt.xlabel("Year")
plt.ylabel('Literacy Rate')
ax1.legend(["Singapore's Literacy Rate"], fontsize=12)
# Display the line graph
plt.show()

1B. You are given a baggage loss dataset from the TSA Claims Database. In this dataset, the case_number represents the unique ID associated with the case from the insurer and claim_amount is the amount of claim filed by the passenger, or the loss due to travelling borne by the passenger.

Read from CSV to df¶

Read the dataset from insurance-claims.csv to a DataFrame called ins_df.

In [13]:
# Read from CSV to ins_df. Write your code here.
#

ins_df= pd.read_csv('insurance-claims.csv')

ins_df.head()
Out[13]:
case_number claim_amount
0 2004103065920 50.00
1 0226059M 312.50
2 2006012099982 90.00
3 2005010372123 216.00
4 2004092263584 178.74

i) Plot a histogram for claim_amount using $8$ bins. Use the bins parameter to specify this. Also, as we do not need the KDE component, set the parameter kde=False. Ensure that all features of the plot have to be shown for full credit. Clue: Use sns.distplot(). [1]

Note: If you see this error:

Warning: The 'normed' kwarg is deprecated, and has been replaced by the 'density' kwarg.

You can ignore it. The plot will not be affected.

In [14]:
# Plot a histogram for claim_amount using 8 bins.
# Write your code here.

fig = plt.figure(figsize=(18, 8))
ax1 = fig.add_subplot(111)

sns.distplot(ins_df['claim_amount'],kde=False,ax=ax1,bins=8)

ax1.set_title("Histogram of Claim Amount, 8 bins", fontsize=16)
ax1.set_ylabel("Number of Claims")
Out[14]:
Text(0, 0.5, 'Number of Claims')

ii) The following DataFrame called ins_df_agg_c is given to you. Run the code. What is the interpretation of count in ins_df_agg_c with respect to the plotted histogram in i) ? [0.5]

In [15]:
# No additional code is required.
ins_df_agg = pd.cut(ins_df['claim_amount'], 8).reset_index(name='bin_range')
ins_df_agg_c = ins_df_agg.groupby('bin_range').size().reset_index(name='count')
ins_df_agg_c

# What is the interpretation of count in ins_df_agg_c with respect to the plotted histogram?
# (Type your answer here)
# The number of claims within the amount that is specified the bin_range. 
# Bin range is gotten by dividing insurance amounts into 8 equal intervals.
Out[15]:
bin_range count
0 (1.011, 125.625] 1689
1 (125.625, 249.25] 440
2 (249.25, 372.875] 181
3 (372.875, 496.5] 76
4 (496.5, 620.125] 48
5 (620.125, 743.75] 26
6 (743.75, 867.375] 27
7 (867.375, 991.0] 13

1C. You are given the details of some projects taken from the KickStarter Projects dataset. The definition of each column is as follows:

ks-projects-15k.csv

col name definition datatype example
ID internal kickstart ID int 59178964
name name of a kickstarter project str I Believe in King The Kid
main_category category of a campaign str Music
backers the number of backers a project has int 444
usd_pledged_real the amount that was pledged for the project, in USD float 27591.46
usd_goal_real the fundraising goal of the project, in USD float 55000

Read from CSV to df¶

Read the dataset from ks-projects-15k.csv to a DataFrame called kst_df.

In [16]:
# Read from CSV to kst_df. Write your code here.

kst_df = pd.read_csv('ks-projects-15k.csv')

kst_df.sample()
Out[16]:
ID name main_category backers usd_pledged_real usd_goal_real
1937 2060559238 GREAT FILM COMPOSERS & THE PIANO: CD+VIDEO by ... Music 98 21291.0 16000.0

i) Filter for all Kickstarter Projects where the total amount pledged in USD, or usd_pledged_real is greater than $\$10000$. Store this in a new DataFrame called kst_df_10k. [0.5]

In [17]:
# Filter for all Kickstarter Projects where the usd_pledged_real is greater than 10000.
# (Write your code here)

kst_df_10k = kst_df[kst_df['usd_pledged_real']>10000]

kst_df_10k.sample()
Out[17]:
ID name main_category backers usd_pledged_real usd_goal_real
1319 885975580 Seeds of Reprisal: Monsanto vs. Michael White ... Publishing 27 23106.0 25000.0

ii) Create a new DataFrame, kst_df_agg where the first column is the main_category and the second column is called n. In kst_df_agg, main_category is the main category of kst_df_10k and n is the number of projects belonging to that main_category. Hint: use df.groupby().size(). [0.5]

In [18]:
# Create an aggregated df with 1 column being the main_category and the other column
# being the no. of projects in that category.
# (Write your code here)

kst_df_agg = kst_df.groupby('main_category').size()
kst_df_agg.reset_index(name='n')                
                   
Out[18]:
main_category n
0 Art 77
1 Comics 70
2 Crafts 13
3 Dance 11
4 Design 289
5 Fashion 94
6 Film & Video 386
7 Food 155
8 Games 280
9 Journalism 12
10 Music 256
11 Photography 34
12 Publishing 122
13 Technology 241
14 Theater 47

iii) Using your answers to (i) and (ii) or otherwise, plot a bar graph where the $x$-axis is the main_category and the $y$-axis is the number of projects belonging to that main_category. [1]

In [19]:
# Plot the bar graph where the x-axis is the main_category and 
# the y-axis is the number of projects belonging to that main_category
# (Write your code here)

fig = plt.figure(figsize=(18, 7))
ax = fig.add_subplot(111)
kst_df_agg.plot(kind='bar', x='main_category', y='n',ax=ax,color='skyblue')
# Add the ylabel using ax.set_ylabel() and the title using ax.set_title()
ax.set_ylabel("Number of Projects")
ax.set_xlabel('Main Category')
ax.set_title("Number of Projects in each Main Category")
Out[19]:
Text(0.5, 1.0, 'Number of Projects in each Main Category')
**Checkpoint**: When you have completed Part 1, upload your progress to eLearn. eLearn only accepts the latest notebooks so if you have an existing notebook, your latest upload will **override** it.

Part 2 (7 Marks)¶

For questions wtih short answers, **both the code and typed answers must be provided to receive full credit.** **For plotting, all features must be present to receive full credit.**

2A. You are given the score and price of some bottles of wines taken from the Wine Reviews dataset. The definition of each column is as follows:

wines-200.csv

col name definition datatype example
country The country that the wine is from str US
points The number of points WineEnthusiast rated the wine on a scale of 1-100 int 86
price The price of the bottle of wine float 12.0

Read from CSV to df¶

Read the dataset from wines-200.csv to a DataFrame called wines_df.

In [20]:
# Read from CSV to wines_df. Write your code here.

wines_df = pd.read_csv('wines-200.csv')

wines_df.head()
Out[20]:
country points price
0 US 86 50.0
1 Italy 92 48.0
2 France 89 19.0
3 US 92 34.0
4 US 84 20.0

i). Using dropna(), remove all records where the price value is not given. What is the shape of wines_df after removing the values?[0.5]

In [21]:
# Remove the records where the price is not given
# (Write your code here)

wines_df = wines_df.dropna(subset=['price'])
# What is the shape of wines_df after removing the values?
# (Write your code here)

#wine_df.dropna(inplace=True)

wines_df.shape

#185 entries, 3 columns
Out[21]:
(185, 3)

ii). Using your answers to i) or otherwise, plot a scatterplot where the $x$-axis represents the number of points a wine has received, and the $y$-axis is the price of the wine. Only plot for points where both the point and price value exist. The title, $x$-axis label and $y$-axis label must be shown to obtain full credit. [1]

In [22]:
# Plot the scatterplot where the x-axis represents the number of points a wine has received, 
# and the y-axis is the price of the wine.
# Write your code here

fig = plt.figure(figsize=(8, 7))
ax = fig.add_subplot(111)

wines_df.plot(kind='scatter',ax=ax,x='points',y='price',title='Scatterplot of Price to Points of Wine')
Out[22]:
<matplotlib.axes._subplots.AxesSubplot at 0x1ac1e6ba848>

iii) Complete the code below to plot a boxplot showing the distribution of the points of the bottles of wines. [0.5]

In [40]:
# Complete the code below to plot a boxplot.
sns.set(style="whitegrid")
sns.boxplot(wines_df['points'], orient="h", width=0.2)
plt.title('Boxplot of point values of wine')
# Show the boxplot
plt.show()
In [24]:
#sns.reset_orig()

#reset the seaborn library

iv) Using the plot of iii) to help you or otherwise, find the number of outliers for the points distribtution of the wines dataset. [0.5]

In [25]:
# Find the number of outliers for the points distribtution of the wines dataset.
# (Write your code here)
# Hint: Use the following code to help you:
p25 = np.percentile(wines_df['points'], 25)
p75 = np.percentile(wines_df['points'], 75)
iqr = p75 - p25
# p25, p75 = np.percentile(wines_df['points'],[25,75])
#lo, hi = p25-1.5*iqr, p75+1.5*iqr

lower_bound = p25 - 1.5*iqr
upper_bound = p75 + 1.5*iqr

wine_df = wines_df[(wines_df['points'] < lower_bound) | (wines_df['points'] > upper_bound)]
wine_df.count()
# How many outliers are there for the points distribtution of the wines dataset?
# (Type your answer here)

#1 outlier
Out[25]:
country    1
points     1
price      1
dtype: int64


2B. You are given the number of loans per term for each grade taken from a sample from the Lending Club Dataset. The definition of each column is as below.

loans-by-grade.csv

col name definition datatype example
grade The grade scored str A
36 months The number of loans with a 36 month term (repayment must be complete in 3 years) int 18572
60 months The number of loans with a 60 month term (repayment must be complete in 5 years) int 8013

Read from CSV to df¶

Read the dataset from loans-by-grade.csv to a DataFrame called loans_df.

In [26]:
# Read from CSV to loans_df. Write your code here.

loans_df= pd.read_csv('loans-by-grade.csv')

loans_df.head()
Out[26]:
grade 36 months 60 months
0 A 18572 670
1 B 27570 5572
2 C 21087 10739
3 D 9973 8013
4 E 2859 6296

i) Add a column to loans_df called Total Loans. This represents the total number of 36 & 60 month loans belonging to the grade. [0.5].

In [27]:
#Add a column to loans_df called 'Total Loans'
# Write your code here

loans_df['Total Loans'] = loans_df['36 months'] + loans_df['60 months']

loans_df
Out[27]:
grade 36 months 60 months Total Loans
0 A 18572 670 19242
1 B 27570 5572 33142
2 C 21087 10739 31826
3 D 9973 8013 17986
4 E 2859 6296 9155
5 F 580 2353 2933
6 G 96 620 716

ii)

  • What proportion of the loans in Grade A are 36 month loans?
  • What proportion of the loans in Grade A are 60 month loans?

[0.5]

In [28]:
# What proportion of the loans in Grade A are 36 month loans?
# Code is not required. However, you could modify this line of code to help you
18572 / (18572 + 670) *100

# (Type your answer here)
# 96.51803346845442% of loans in Grade A are 36 month loans.
Out[28]:
96.51803346845442
In [29]:
# What proportion of the loans in Grade A are 60 month loans?
# Code is not required. However, you could modify this line of code to help you
670 / (18572 + 670)* 100

# (Type your answer here)
# 3.481966531545577% of loans in Grade A are 60 month loans.
Out[29]:
3.4819665315455772

iii) Using the answers to i) and ii) or otherwise, add 2 columns to loan_df:

  • 36 months percentage is the proportion of loans that has a 36 month term belonging to the grade

  • 60 months percentage is the proportion of loans that has a 60 month term belonging to the grade

[1]

In [30]:
# Write the code for adding the columns to loans_df below.
# Write your code here

loans_df['36 months percentage'] = loans_df['36 months']/ loans_df['Total Loans'] * 100
loans_df['60 months percentage'] = loans_df['60 months']/ loans_df['Total Loans'] * 100

loans_df
Out[30]:
grade 36 months 60 months Total Loans 36 months percentage 60 months percentage
0 A 18572 670 19242 96.518033 3.481967
1 B 27570 5572 33142 83.187496 16.812504
2 C 21087 10739 31826 66.257148 33.742852
3 D 9973 8013 17986 55.448682 44.551318
4 E 2859 6296 9155 31.228837 68.771163
5 F 580 2353 2933 19.774974 80.225026
6 G 96 620 716 13.407821 86.592179

iv) Plot a stacked bar-graph where each bar in the $x$-axis is a grade and the $y$-axis represents the percentage of loans belonging to the grade. Hint: As each bar represents a percentage, all bars have the same height of $100\%$. Also, use DataFrame.plot(kind='bar', stacked=True) and add more parameters accordingly. [1]

In [31]:
# Write the code for plotting the stacked bar-graph below.
# Write your code here

#loans_df = loans_df[['grade'],['36 months percentage'],['60 months percentage']]
#loans_df.index = loans_df['grade']

loans_df1 = loans_df.drop(columns = ['36 months','60 months','Total Loans'])
loans_df1.head()

fig, ax1 = plt.subplots()
loans_df1.plot(kind='bar', stacked=True, figsize=(18,10),  
                                     title="No. of Loans by Grade", ax=ax1)
ax1.set_ylabel("No. of Loans")
plt.show()

2C. You will be using the wines-200.csv again for this question. Read from the wines-200.csv dataset from 2A to wines_df2. Remember to perform the dropna() step.

In [32]:
# Read from CSV file to wines_df2
# Remember to dropna()
# Write your code here.

wines_df2 = pd.read_csv('wines-200.csv')

wines_df2 = wines_df2.dropna()
wines_df2.head()
Out[32]:
country points price
0 US 86 50.0
1 Italy 92 48.0
2 France 89 19.0
3 US 92 34.0
4 US 84 20.0

i) Find the top 2 countries with the most wines in the dataset. [0.5]

In [43]:
# What are the top 2 countries in terms of number of wines produced?
# Write your code here
#
country_count = wines_df2.groupby('country').size().reset_index(name='c')
country_count.sort_values('c',ascending=False,inplace=True)
print()
top_2_countries = country_count.iloc[:2]['country']
print(top_2_countries)
top_2_countries = top_2_countries.tolist()
print()
print(top_2_countries)

#country_count = wines_df.grouby('country').size().reset_index(name='c')
#country_count.sort_values('c',ascending=False, inplace= True)

#store the list of top 2 into a list

# (Type your answer here)
# US, Italy has the most wines in the dataset. 
15       US
9     Italy
Name: country, dtype: object

['US', 'Italy']

ii) Using your answer in (i) or otherwise, debug and complete the following code to plot a violin plot of the wine prices of the top 2 countries with the most wines in the dataset. There should be a seperate violin plot for each country in the top 2. Refer to the seaborn documentation, this Stackoverflow answer and this Stackoverflow answer to help you. [1]

In [45]:
# Debug and complete the code for plotting the violin below.

#wines_df = wines_df[wines_df['country'].isin(top_2_countries)]
#wines_df['one']=1
#sns.violinplot(x="one", y='price', hue="country',data=wines_df3, palette="muted")

wines_df3= wines_df[wines_df['country'].isin(top_2_countries)]
wines_df3['one'] =1 
fig= plt.figure(figsize=(8,8))
ax= fig.add_subplot(111)
sns.violinplot(x="one", y='price', hue='country', data=wines_df3, split=True, palette="muted",ax=ax)
ax.set_xticklabels(list(top_2_countries))


ax.set_ylabel("Price")
ax.set_xlabel("")
ax.set_xticklabels(['country'])
ax.set_title("Violin plot of wine prices by country", fontsize=14)

plt.show()
#plot a double joint violin plot to see the distribution 
C:\Users\yamba\Anaconda3\lib\site-packages\ipykernel_launcher.py:8: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  
**Checkpoint**: When you have completed Part 2, upload your progress to eLearn. eLearn only accepts the latest notebooks so if you have an existing notebook, your latest upload will **override** it.

Disclaimer: 

This is a personal website. The opinions expressed here represent my own and not those of my employer. 

In addition, my thoughts and opinions change from time to time I consider this a necessary consequence of having an open mind.

All rights reserved 2024 

Privacy Policy applies 

Terms and Conditions apply.