• Home
  • Data Analytics Projects
    • Data Analytics Using Python
      • Mini Project: AirBnB Recommender App
      • Data Visualization
      • Text Mining (Text Nomalisation)
      • Statistics Analysis
      • A Study of Food Consumption in the World
    • DataCamp Projects
    • Tableau Visualizations
    • Experimenting with Google Teachable Machines
    • Data Analytics Using R
      • Udemy Projects
      • Using Dplyr
      • Hypothesis Testing
      • Linear Regression (Project1)
      • Linear Regression (Project 2)
      • Linear Regressions Exploring Interactions
      • Regression Models
      • Multiple Regression (Project 2)
  • Philanthropy
    • 2018 Water for Life
    • 2019 Habitat for Humanity Global Build
  • My Thoughts
  • Contact
BarbaraYam.com
  • Data Analytics Projects
    • Data Analytics Using Python
      • Mini Project: AirBnB Recommender App
      • Data Visualization
      • Text Mining (Text Nomalisation)
      • Statistics Analysis
      • A Study of Food Consumption in the World
    • DataCamp Projects
    • Tableau Visualizations
    • Experimenting with Google Teachable Machines
    • Data Analytics Using R
      • Udemy Projects
      • Using Dplyr
      • Hypothesis Testing
      • Linear Regression (Project1)
      • Linear Regression (Project 2)
      • Linear Regressions Exploring Interactions
      • Regression Models
      • Multiple Regression (Project 2)
  • Philanthropy
    • 2018 Water for Life
    • 2019 Habitat for Humanity Global Build
  • My Thoughts
  • Contact

Statistics Analysis using Python

Here, we explore statistics concepts of mean, median and mode. Skewness of the data is also being explored. And we end with the kernal density estimation (KDE) plot. 

Barbara Yam_-ds102-problem-set-week5

DS102 Problem Set - Week 5

Hackwagon Academy

HWA-DS102-PS5-201810


If you have any questions to the Problem Set, feel free to send a message to your TAs or Instructors in the Telegram chat. You are encouraged to also discuss the problem sets within your own groups.

Before you start, fill in your details for this notebook and rename the notebook to `_-problem-set-week-5`, for example `jane-doe-problem-set-week-5`.
In [ ]:
# Rename the notebook. Then, fill in your details here.
# Name  : Barbara Yam
# Class : 5 May 2019

Datasets Required:¶

  • insurance-claims.csv

In this problem set, you will be using what you have learnt in DS102 to understand an applied problem in actuarial science, in particular the concept of moral hazard.

Problem Statement¶

In insurance, an insurance policy is a contract between a policy holder and the insurance company. For a fee or premium, the policy holder receives financial protection in the event of a loss. This claim due to the loss will be paid by the insurance company.

The claim that is paid by the insurance company or insurer to the policy holder or insured (usually an individual) is subject to a deductible, a co-insurance payment and a policy limit. The definition for each of the terms is as follows:

  • A Deductible is the minimum sum that is payable by the policy holder, before the insurance company needs to pay out to the policy holder.

  • A Co-insurance is the proportion of the bill that is payable by the insurance company. The remaining proportion is payable by the policy holder.

  • A Out-of-pocket Limit / Policy Limit is the maximum amount that the insurance company has to pay for a claim. The remaining amount needs to be paid by the policy holder.

Consider the following example:

Two travellers, A and B paid for a travel insurance policy before their flights. This policy has a deductible of $d=\$300$. The **co-insurance** ratio is $c=0.2$ or $20\%$ and the** policy limit** is $l=\$5,000$.

A. Traveller A was not able to fly back due to a typhoon at the holiday destination. A files for a claim of $\$2,000$ for A's flight back. The payout from the insurer is calculated as follows:

  1. After subjecting to a deductible, the amount to be paid by the insurer is $2000-d=2000-300=1700$
  2. After subjecting to co-insurance, the amount to be paid is $(1-c)\times 1700 = (1-0.2)\times 1700=1360$

and hence A will receive a claim amount of $\$1360$ from the insurer and his out-of-pocket payment is $\$640$.

B. Traveller B's valuables and laptop was stolen while travelling. B files for a claim of $\$12,000$ which was the value of all the items stolen. The payout from the insurer is calculated as follows:

  1. After subjecting to a deductible, the amount to be paid by the insurer is $12000-d=12000-300=11700$
  2. After subjecting to co-insurance, the amount to be paid is $(1-c)\times 11700 = (1-0.2)\times 11700=9360$
  3. The policy limit is $l=5000$. Since the claim amount, $\$9360$ is higher than $\$5000$, the amount to be paid is only the policy limit of $\$5000$

and hence B will receive a claim amount of $\$5000$ from the insurer and his out-of-pocket payment is $\$7000$.

**Problem Statement **
In this problem, the task is to study the effects of each of the following instruments on the total amount the insurance company needs to pay out.

Dataset¶

You are given a baggage loss dataset from the TSA Claims Database. In this dataset, the case_number represents the unique ID associated with the case from the insurer and claim_amount is the amount of claim filed by the passenger, or the loss due to travelling borne by the passenger.

Answer the following questions. Mark breakdowns are stated in the question e.g. [1]. **Where stated, both the code and typed answers must be provided to receive full credit.**

In [28]:
# import the relevant packages

import pandas as pd
import numpy as np

import seaborn as sns
import matplotlib.pyplot as plt

Read from CSV to df¶

Read the dataset from insurance-claims.csv to a DataFrame called df.

In [5]:
# Read the dataset from CSV to a df. Write your code here

df= pd.read_csv('insurance-claims.csv')

Part 1 (6 Marks)¶

1 i) What is the mean of claim_amount? [0.5]

In [6]:
# What is the mean of claim_amount?
# (Write your code here)
#
print(df['claim_amount'].mean())
# (Type your answer here)
# The mean is $131.41266399999995. 
131.41266399999995

1 ii) What is the variance of claim_amount? [0.5]

In [13]:
# What is the variance of claim_amount? 
# (Write your code here)
#
print(df['claim_amount'].var(axis=0))
# (Type your answer here)
#The variance is $25235.105180495306.
25235.105180495306

1 iii)

  • What is the skewness of claim_amount? [0.5]

  • How do you interpret the result of claim_amount's skewness value? [0.5]

In [14]:
# What is the skewness of claim_amount? 
# (Write your code here)
#
print(df['claim_amount'].skew())
# (Type your answer here)
#The skew is 2.4650649305683907

# How do you interpret the result of claim_amount's skewness value?
# (Type your answer here)
# The skewness is positive (>0), indicating that claim_amount is skewed right (right-tailed). 
2.4650649305683907

1 iv) Fill in the blanks: Using the interquartile-range method, a claim_amount is considered an outlier if it is below _1.5*IQR_____ or above _1.5*IQR_____. [1]

In [16]:
# Find the range of values an outlier of claim_amount can take (find the lower bound & upper bound)
# (Write your code here)
#
quartile_1 = np.percentile(df['claim_amount'], 25)
quartile_3 = np.percentile(df['claim_amount'], 75)
iqr = quartile_3 - quartile_1

lower_bound = quartile_1 - 1.5*iqr
upper_bound = quartile_3 + 1.5*iqr
                              
print(lower_bound)
print(upper_bound)                            
# (Type your answer here)
# A claim_amount is considered an outlier if it is below _-153.81_____ or above _352.53_____
-153.81
352.53

1 v) What proportion of the observations in claim_amount are outliers? [1]

In [23]:
# What proportion of the observations in claim_amount are outliers?
# (Write your code here)
#
outliers_df = df[(df['claim_amount']<-153.81)|(df['claim_amount']>352.53)]
print(len(outliers_df)/len(df)*100)
# (Type your answer here)
#8.04% of the observations are outliers. 
8.04

1 vi) Plot a Kernel Density Estimate (KDE) plot for claim_amount using sns.kdeplot(). Use the Gaussian kernel with bandwidth 25. In other words, add the parameters kernel='gau' and bw=25 when plotting. Also add the parameter shade=True to shade the area under the graph.[1]

Hint: Use the following lines to help you.

ax1.set_xlabel("claim_amount")
ax1.set_ylabel("density")
ax1.set_title("KDE plot of claim_amount")
In [67]:
# Plot a KDE plot for claim_amount
# (Write your code here)
fig = plt.figure(figsize=(16,8))
ax1 = fig.add_subplot(111)

sns.kdeplot(df['claim_amount'], kernel='gau', bw=25, shade= True, ax=ax1)
ax1.set_xlabel("claim_amount")
ax1.set_ylabel("density")
ax1.set_title("KDE plot of claim_amount")
ax1.legend(['claim_amount'])
Out[67]:
<matplotlib.legend.Legend at 0x1abff4c96d8>

1 vii)

  • What is the difference when calculating ($\text{median}$ - $\text{mean}$) of claim_amount? How will you interpret this value, in relation to the skewness of the distribution? Hint: Use the answers in 1 i) and 1 vii) to help you. [1]
In [40]:
# What is the difference when calculating (median - mean) of claim_amount? 
# (Write your code here)
#
print(df['claim_amount'].median()-df['claim_amount'].mean())
# (Type your answer here)
# The difference is $-60.28766399999995.

# How will you interpret the value of (median - mean), in relation to the skewness
# of the distribution?
# (Type your answer here)
# Since (median-mean) < 0, it indicates that median is on the left of the mean. 
# The graph is positively skewed and skewed to the right (right tailed). 
-60.28766399999995

**Checkpoint**: When you have completed Part 1, upload your progress to eLearn. eLearn only accepts the latest notebooks so if you have an existing notebook, your latest upload will **override** it.

Part 2 (6 Marks)¶

For this question, the following values are defined:

Instrument Variable Value
Deductible $d$ $50$
Co-insurance ratio $c$ $0.3$
Policy Limit $l$ $500$

2 i) Write a function compute_claim_after_deductible(claim_amount) that takes in a claim amount claim_amount as the only parameter. compute_claim_after_deductible() returns 0 if the claim_amount is less than 50, and returns (claim_amount - 50) if the claim amount is above 50.

Hint: You may use the following code to help you. If you do, take note that you need to correct the logic errors. [0.5]

In [64]:
# Write a function compute_claim_after_deductible(claim_amount)

# (Write your code here)
def compute_claim_after_deductible(claim_amount):
    claim = claim_amount
    d = 50
    if d <= claim:
        return(claim - d)
    else:
        return(0)

2 ii) Now, create a new column claim_amount_with_deductible. This is the result of applying compute_claim_after_deductible to the column claim_amount. [0.5]

In [66]:
# Create a new column claim_amount_with_deductible by using Series.apply()
# (Write your code here)
df['claim_amount_with_deductible'] = df['claim_amount'].apply(compute_claim_after_deductible)

2 iii) Plot a KDE plot for both claim_amount and claim_amount_with_deductible using sns.kdeplot(). They MUST be in the same KDE plot. Use the Gaussian kernel with bandwidth 25. In other words, add the parameters kernel='gau', bw=25 when plotting. Also, add the parameters shade=True to shade the area under the KDE plot. [1]

Hint: Use the following lines to help you.

fig2 = plt.figure(figsize=(16, 8))
ax1 = fig2.add_subplot(111)
In [74]:
# Plot a KDE plot for both claim_amount and claim_amount_with_deductible. They MUST be
# in the same plot.
# (Write your code here)
fig2 = plt.figure(figsize=(16, 8))
ax1 = fig2.add_subplot(111)

sns.kdeplot(df['claim_amount'], kernel='gau', bw=25, shade= True, ax=ax1,color='blue')
ax1.set_xlabel("claim_amount")
ax1.set_ylabel("density")
ax1.set_title("KDE plot of claim_amount")
ax1.legend(['claim_amount','claim_amount_with_deductible'])

sns.kdeplot(df['claim_amount_with_deductible'], kernel='gau', bw=25, shade= True, ax=ax1,color='red')
Out[74]:
<matplotlib.axes._subplots.AxesSubplot at 0x1ab80ac1ef0>

2 iv) How has the deductible impacted the shape of the KDE plot? Describe the change in terms of the shift of the mean and the proportion of the observations having the value of 0.

Hint: Consider using the results of df.describe() to explain your answer. [1]

In [73]:
# How has the deductible impacted the shape of the KDE plot? 
# Code is not required. However, you could use df.describe() to help you.
#
df.describe()
# (Type your answer here)
# The mean for claim_amount_with_deductible is lower than that of claim_amount. 
# The mean is shifted to the left. 
# The standard deviation has also been reduced; this makes the graph more skewed to the right.
# The claims less than $50 become $0 with deductible. 
# There is a higher percentage of observations having 0, the bottom 25% claims 0 now,
# compared to the values in the claim_amount column. 
Out[73]:
claim_amount claim_amount_with_deductible
count 2500.000000 2500.000000
mean 131.412664 89.682040
std 158.855611 153.500587
min 2.000000 0.000000
25% 36.067500 0.000000
50% 71.125000 21.125000
75% 162.652500 112.652500
max 991.000000 941.000000

2 v) Write a function compute_claim_payable_by_insurer(claim_amount) that takes in a claim amount claim_amount as the only parameter. compute_claim_payable_by_insurer() performs the following:

  1. if the claim_amount is less than 50, return 0
  2. if the claim amount is above 50, deduct 50 from the claim_amount, then calculate the copayment paid by the insurer which is $1-0.3=0.7$ of the remaining amount
  3. if the value calculated in step 2 exceeds the policy limit of 500, then return 500. If not, return the value calculated in step 2

Hint: You may use the following code to help you. If you do, take note that you need to correct the logic errors. [1]

In [85]:
# Write a function compute_claim_payable_by_insurer(claim_amount)

# (Write your code here)
#
def compute_claims_payable_by_insurer(claim_amount):
    d = 50
    c = 0.3
    l = 500

    claim = claim_amount
    payout = 0

    if claim < d:
        payout = payout
    else:
        payout = (claim-d) * (1-c)
        if payout > l:
            payout = l
    return (payout)

2 vi) Now, create a new column claim_amount_by_insurer. This is the result of applying compute_claim_payable_by_insurer() to the column claim_amount Be careful! Choose the correct column. [0.5]

In [86]:
# Create a new column claim_amount_with_deductible by using Series.apply()
# (Write your code here)

df['claim_amount_by_insurer']= df['claim_amount'].apply(compute_claims_payable_by_insurer)

df.head()
Out[86]:
case_number claim_amount claim_amount_with_deductible claim_amount_by_insurer
0 2004103065920 50.00 0.00 0.000
1 0226059M 312.50 262.50 183.750
2 2006012099982 90.00 40.00 28.000
3 2005010372123 216.00 166.00 116.200
4 2004092263584 178.74 128.74 90.118

2 vii) Run the following cell to find out key statistics of the df. Fill in the blanks:

The maximum of claim_amount_by_insurer is this equal to the ( ? ). [0.5]

In [87]:
# Uncomment and run the following lines of code to see the report.
print("=== SUMMARY STATISTICS ===")
print(df.describe())
print()
print("=== SKEWNESS OF SERIES ===")
print(df.skew())
print()
print("=== SUM OF SERIES ========")
print(df[['claim_amount', 'claim_amount_with_deductible', 'claim_amount_by_insurer']].sum())
print()
print("=== VARIANCE OF SERIES ===")
print(df[['claim_amount', 'claim_amount_with_deductible', 'claim_amount_by_insurer']].var())

# Uncomment and run the following lines to see the KDE plot for both series.
fig3 = plt.figure(figsize=(16, 6))
ax1 = fig3.add_subplot(111)
sns.kdeplot(df['claim_amount'], shade=True, kernel='gau', bw=25, ax=ax1, color='red')
sns.kdeplot(df['claim_amount_by_insurer'], shade=True, kernel='gau', bw=25, ax=ax1, color='purple')
ax1.set_xlabel("claim_amount")
ax1.set_ylabel("density")
ax1.set_title("KDE plot of claim_amount, loss and paid by insurer",fontsize=16)
plt.show()

# No additional code is required.
# Answer the following questions:
# Fill in the blanks with one of the following: deductible, co-payment, policy limit
# The maximum of claim_amount_by_insurer is this equal to the __policy limit____.
=== SUMMARY STATISTICS ===
       claim_amount  claim_amount_with_deductible  claim_amount_by_insurer
count   2500.000000                   2500.000000              2500.000000
mean     131.412664                     89.682040                61.883404
std      158.855611                    153.500587               103.325938
min        2.000000                      0.000000                 0.000000
25%       36.067500                      0.000000                 0.000000
50%       71.125000                     21.125000                14.787500
75%      162.652500                    112.652500                78.856750
max      991.000000                    941.000000               500.000000

=== SKEWNESS OF SERIES ===
claim_amount                    2.465065
claim_amount_with_deductible    2.635603
claim_amount_by_insurer         2.412770
dtype: float64

=== SUM OF SERIES ========
claim_amount                    328531.660
claim_amount_with_deductible    224205.100
claim_amount_by_insurer         154708.511
dtype: float64

=== VARIANCE OF SERIES ===
claim_amount                    25235.105180
claim_amount_with_deductible    23562.430327
claim_amount_by_insurer         10676.249418
dtype: float64

2 viii) The definition of moral hazard. It states, with the application to property:

When a property owner obtains insurance on a property, the contract is based on the idea that the property owner will avoid situations that may damage the property. The moral hazard exists that the property owner, because of the availability of the insurance, may be less inclined to protect the property, since the payment from an insurance company lessens the burden on the property owner in the case of a disaster.

Using the KDE plot and report in 2 vii), answer the following questions.

i) When a traveller buys travel insurance for his/her trip, how does moral hazard exist? Explain by comparing how his / her behaviour changes when there is and when there isn't travel insurance for the trip.

ii) By considering the burden of payment of both the insurer and insured, and using the concept of moral hazard, explain why an insurance company needs to implement a deductible, co-insurance and a policy limit for this travel insurance policy.

[1]

In [ ]:
# No additional code is required.
# Answer the following questions: When a traveller buys travel insurance for his/her 
# trip, how does moral hazard exist?
# (Type your answer here)
For example, since there is availability of insurance, the traveller is less likely to take care of 
his/her own belongings, or even himself/herself up to the policy limit. 
If personal items get stolen, and traveller can claim and buy a new one at no 
additional costs on his/her part, traveller is more inclined to declare the stolen items. 
In [ ]:
# No additional code is required.
# Why does an insurance company need to implement a deductible, 
# co-insurance and a policy limit for this travel insurance policy.
# (Type your answer here)
Since there is now a copayment where insurer and insured to share the costs of the claims, 
70% and 30% respectively after deductible, there is less incentives for the insured to lose their items 
or to be reckless, sometimes or to even to make a claim, especially if a loss is less than $50. 
This encourages travellers to take partial responsibility, and to reduce unethical behaviours.
The policy limit safeguards the insurance company from going out of business as there is a policy limit
to each claim. 
**Checkpoint**: When you have completed Part 2, upload your progress to eLearn. eLearn only accepts the latest notebooks so if you have an existing notebook, your latest upload will **override** it.

Credits

  • healthcare.gov for the definitions of instruments
  • Investopedia for the definitions
  • TSA Claims Database for the dataset

Disclaimer: 

This is a personal website. The opinions expressed here represent my own and not those of my employer. 

In addition, my thoughts and opinions change from time to time I consider this a necessary consequence of having an open mind.

All rights reserved 2024 

Privacy Policy applies 

Terms and Conditions apply.