• Home
  • Data Analytics Projects
    • Data Analytics Using Python
      • Mini Project: AirBnB Recommender App
      • Data Visualization
      • Text Mining (Text Nomalisation)
      • Statistics Analysis
      • A Study of Food Consumption in the World
    • DataCamp Projects
    • Tableau Visualizations
    • Experimenting with Google Teachable Machines
    • Data Analytics Using R
      • Udemy Projects
      • Using Dplyr
      • Hypothesis Testing
      • Linear Regression (Project1)
      • Linear Regression (Project 2)
      • Linear Regressions Exploring Interactions
      • Regression Models
      • Multiple Regression (Project 2)
  • Philanthropy
    • 2018 Water for Life
    • 2019 Habitat for Humanity Global Build
  • My Thoughts
  • Contact
BarbaraYam.com
  • Data Analytics Projects
    • Data Analytics Using Python
      • Mini Project: AirBnB Recommender App
      • Data Visualization
      • Text Mining (Text Nomalisation)
      • Statistics Analysis
      • A Study of Food Consumption in the World
    • DataCamp Projects
    • Tableau Visualizations
    • Experimenting with Google Teachable Machines
    • Data Analytics Using R
      • Udemy Projects
      • Using Dplyr
      • Hypothesis Testing
      • Linear Regression (Project1)
      • Linear Regression (Project 2)
      • Linear Regressions Exploring Interactions
      • Regression Models
      • Multiple Regression (Project 2)
  • Philanthropy
    • 2018 Water for Life
    • 2019 Habitat for Humanity Global Build
  • My Thoughts
  • Contact

Udemy projects

So I have fallen in love with this new instructor on Udemy, Kirill Eremenko from www.superdatascience.com, and this is my second course with him.

 

 I have been exposed to R for a while but I could never figure out what's so amazing about this programming language that mathematicians and some academics love to stick by it. When he explained the course, it was a huge aha moment for me each time. 

 

Interestingly, he was re-introduced to me by my lecturer in Machine Learning. He was saying in Feb 2020 that one of the best courses to learn Machine Learning was this guy's course and it gave him a lot of new realizations each day. I went back and realized that I have actually bought a ton of Kirill's courses, they are sitting on my Udemy yet I never found the motivation to get started. Thanks to the circuit breaker (extended circuit breaker), I am finding more focused blocks of time to do the self improvement that I have always placed at the back burner. Here are the projects from his Udemy course: R Programming from A to Z: R for Data Science with Real Exercises.

 

Enjoy!  

 

Investigating the Law of Law Numbers

Investigating the Law of Law Numbers

Barbara Yam

5/1/2020

# Investigating the law of Large Numbers

N <- 1000
counter <- 0
for (i in rnorm(N)){
  if(i> -1 & i <1){
    counter <- counter +1 
  
  }
}

answer <- counter/N *100
answer 

# Thoughts: N could be replaced with a bigger and bigger number to see 
# if the answer converges to expected mean of 68.2%
Financial Analysis Project

Financial Analysis Project

Barbara Yam

6 May 2020

#Data

revenue <- c(14574.49, 7606.46, 8611.41, 9175.41, 8058.65, 8105.44, 11496.28, 9766.09, 10305.32, 14379.96, 10713.97, 15433.50)
expenses <- c(12051.82, 5695.07, 12319.20, 12089.72, 8658.57, 840.20, 3285.73, 5821.12, 6976.93, 16618.61, 10054.37, 3803.96)

The Task is to calculate the following financial metrics: - profit for each month - profit after tax for each month (the tax rate is 30%) - profit margin for each month - equals to profit after tax divided by revenue - good months - where profit after tax was greater than the mean for the year - bad months - where the where profit after tax was less than the mean for the year - the best month - where the profit after tax was max for the year - the worst month - where the profit after tax was min for the year

Note: i. Results for dollar values need to be calculated with $0.01 precision, but need to be presented in units of $1000 with no decimal points.

  1. Results for the profit margin ratio need to be presented in units of % with no decimal points

Solution

#profit for each month

profit <- revenue - expenses
profit
##  [1]  2522.67  1911.39 -3707.79 -2914.31  -599.92  7265.24  8210.55  3944.97
##  [9]  3328.39 -2238.65   659.60 11629.54

#profit after taxt for each month (tax is 30%)

profit_after_tax <- round(0.7 * profit,2)
profit_after_tax
##  [1]  1765.87  1337.97 -2595.45 -2040.02  -419.94  5085.67  5747.38  2761.48
##  [9]  2329.87 -1567.06   461.72  8140.68

profit margin for each month

profit_margin <- round(profit_after_tax/revenue,2) *100 
profit_margin
##  [1]  12  18 -30 -22  -5  63  50  28  23 -11   4  53

good months - where profit after tax is greater than the mean for the year

mean_profit_after_tax <- mean(profit_after_tax)
good_months <- profit_after_tax > mean_profit_after_tax
good_months
##  [1]  TRUE FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE  TRUE FALSE FALSE  TRUE

bad months

bad_months <- !good_months
bad_months
##  [1] FALSE  TRUE  TRUE  TRUE  TRUE FALSE FALSE FALSE FALSE  TRUE  TRUE FALSE

best month were profit after tax is the max for the year

best_month <- profit_after_tax == max(profit_after_tax)
best_month
##  [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE

worst month were profit after tax is the min for the year

worst_month <- profit_after_tax == min(profit_after_tax)
worst_month
##  [1] FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE

units of thousands

revenue_1000 <- round(revenue/1000,0)
expenses_1000 <- round(expenses/1000,0)
profit_1000 <- round(profit/1000,0)
profit_after_tax_1000 <- round(profit_after_tax/1000,0)
profit_margin
##  [1]  12  18 -30 -22  -5  63  50  28  23 -11   4  53
good_months
##  [1]  TRUE FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE  TRUE FALSE FALSE  TRUE
bad_months
##  [1] FALSE  TRUE  TRUE  TRUE  TRUE FALSE FALSE FALSE FALSE  TRUE  TRUE FALSE
best_month
##  [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE
worst_month
##  [1] FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE

#matrices

report_matrix <- rbind(revenue_1000,expenses_1000,profit_1000,profit_after_tax_1000,
           profit_margin,good_months,bad_months,best_month,worst_month)

report_matrix
##                       [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11]
## revenue_1000            15    8    9    9    8    8   11   10   10    14    11
## expenses_1000           12    6   12   12    9    1    3    6    7    17    10
## profit_1000              3    2   -4   -3   -1    7    8    4    3    -2     1
## profit_after_tax_1000    2    1   -3   -2    0    5    6    3    2    -2     0
## profit_margin           12   18  -30  -22   -5   63   50   28   23   -11     4
## good_months              1    0    0    0    0    1    1    1    1     0     0
## bad_months               0    1    1    1    1    0    0    0    0     1     1
## best_month               0    0    0    0    0    0    0    0    0     0     0
## worst_month              0    0    1    0    0    0    0    0    0     0     0
##                       [,12]
## revenue_1000             15
## expenses_1000             4
## profit_1000              12
## profit_after_tax_1000     8
## profit_margin            53
## good_months               1
## bad_months                0
## best_month                1
## worst_month               0
BasketBall Trends

BasketBall Trends

Barbara Yam

9 May 2020

Dear Student,

Welcome to the dataset for the homework exercise.

Instructions for this dataset: You have only been supplied vectors. You will need to create the matrices yourself. Matrices: - FreeThrows - FreeThrowAttempts

Sincerely, Kirill Eremenko www.superdatascience.com

Copyright: These datasets were prepared using publicly available data. However, theses scripts are subject to Copyright Laws. If you wish to use these R scripts outside of the R Programming Course by Kirill Eremenko, you may do so by referencing www.superdatascience.com in your work.

Comments: Seasons are labeled based on the first year in the season E.g. the 2012-2013 season is preseneted as simply 2012

Notes and Corrections to the data: Kevin Durant: 2006 - College Data Used Kevin Durant: 2005 - Proxied With 2006 Data Derrick Rose: 2012 - Did Not Play Derrick Rose: 2007 - College Data Used Derrick Rose: 2006 - Proxied With 2007 Data Derrick Rose: 2005 - Proxied With 2007 Data

Seasons

Seasons <- c("2005","2006","2007","2008","2009","2010","2011","2012","2013","2014")

Players

Players <- c("KobeBryant","JoeJohnson","LeBronJames","CarmeloAnthony","DwightHoward","ChrisBosh","ChrisPaul","KevinDurant","DerrickRose","DwayneWade")

Free Throws

KobeBryant_FT <- c(696,667,623,483,439,483,381,525,18,196)
JoeJohnson_FT <- c(261,235,316,299,220,195,158,132,159,141)
LeBronJames_FT <- c(601,489,549,594,593,503,387,403,439,375)
CarmeloAnthony_FT <- c(573,459,464,371,508,507,295,425,459,189)
DwightHoward_FT <- c(356,390,529,504,483,546,281,355,349,143)
ChrisBosh_FT <- c(474,463,472,504,470,384,229,241,223,179)
ChrisPaul_FT <- c(394,292,332,455,161,337,260,286,295,289)
KevinDurant_FT <- c(209,209,391,452,756,594,431,679,703,146)
DerrickRose_FT <- c(146,146,146,197,259,476,194,0,27,152)
DwayneWade_FT <- c(629,432,354,590,534,494,235,308,189,284)

Matrix

FreeThrows <- rbind(KobeBryant_FT,JoeJohnson_FT,LeBronJames_FT,CarmeloAnthony_FT,DwightHoward_FT,ChrisBosh_FT,ChrisPaul_FT,KevinDurant_FT,DerrickRose_FT,DwayneWade_FT)
colnames(FreeThrows) <- Seasons
rownames(FreeThrows) <- Players
FreeThrows
##                2005 2006 2007 2008 2009 2010 2011 2012 2013 2014
## KobeBryant      696  667  623  483  439  483  381  525   18  196
## JoeJohnson      261  235  316  299  220  195  158  132  159  141
## LeBronJames     601  489  549  594  593  503  387  403  439  375
## CarmeloAnthony  573  459  464  371  508  507  295  425  459  189
## DwightHoward    356  390  529  504  483  546  281  355  349  143
## ChrisBosh       474  463  472  504  470  384  229  241  223  179
## ChrisPaul       394  292  332  455  161  337  260  286  295  289
## KevinDurant     209  209  391  452  756  594  431  679  703  146
## DerrickRose     146  146  146  197  259  476  194    0   27  152
## DwayneWade      629  432  354  590  534  494  235  308  189  284

Free Throw Attempts

KobeBryant_FTA <- c(819,768,742,564,541,583,451,626,21,241)
JoeJohnson_FTA <- c(330,314,379,362,269,243,186,161,195,176)
LeBronJames_FTA <- c(814,701,771,762,773,663,502,535,585,528)
CarmeloAnthony_FTA <-c(709,568,590,468,612,605,367,512,541,237)
DwightHoward_FTA <- c(598,666,897,849,816,916,572,721,638,271)
ChrisBosh_FTA <- c(581,590,559,617,590,471,279,302,272,232)
ChrisPaul_FTA <- c(465,357,390,524,190,384,302,323,345,321)
KevinDurant_FTA <- c(256,256,448,524,840,675,501,750,805,171)
DerrickRose_FTA <- c(205,205,205,250,338,555,239,0,32,187)
DwayneWade_FTA <- c(803,535,467,771,702,652,297,425,258,370)

Matrix

FreeThrowsAttempts <- cbind(KobeBryant_FTA,JoeJohnson_FTA,LeBronJames_FTA,
                    CarmeloAnthony_FTA,DwightHoward_FTA,ChrisBosh_FTA,
                    ChrisPaul_FTA,KevinDurant_FTA,DerrickRose_FTA,DwayneWade_FTA)
colnames(FreeThrowsAttempts) <- Seasons
rownames(FreeThrowsAttempts) <- Players
FreeThrowsAttempts
##                2005 2006 2007 2008 2009 2010 2011 2012 2013 2014
## KobeBryant      819  330  814  709  598  581  465  256  205  803
## JoeJohnson      768  314  701  568  666  590  357  256  205  535
## LeBronJames     742  379  771  590  897  559  390  448  205  467
## CarmeloAnthony  564  362  762  468  849  617  524  524  250  771
## DwightHoward    541  269  773  612  816  590  190  840  338  702
## ChrisBosh       583  243  663  605  916  471  384  675  555  652
## ChrisPaul       451  186  502  367  572  279  302  501  239  297
## KevinDurant     626  161  535  512  721  302  323  750    0  425
## DerrickRose      21  195  585  541  638  272  345  805   32  258
## DwayneWade      241  176  528  237  271  232  321  171  187  370

Game Matrix

KobeBryant_G <- c(80,77,82,82,73,82,58,78,6,35)
JoeJohnson_G <- c(82,57,82,79,76,72,60,72,79,80)
LeBronJames_G <- c(79,78,75,81,76,79,62,76,77,69)
CarmeloAnthony_G <- c(80,65,77,66,69,77,55,67,77,40)
DwightHoward_G <- c(82,82,82,79,82,78,54,76,71,41)
ChrisBosh_G <- c(70,69,67,77,70,77,57,74,79,44)
ChrisPaul_G <- c(78,64,80,78,45,80,60,70,62,82)
KevinDurant_G <- c(35,35,80,74,82,78,66,81,81,27)
DerrickRose_G <- c(40,40,40,81,78,81,39,0,10,51)
DwayneWade_G <- c(75,51,51,79,77,76,49,69,54,62)

Games <- rbind(KobeBryant_G, JoeJohnson_G, LeBronJames_G, CarmeloAnthony_G, DwightHoward_G, ChrisBosh_G, ChrisPaul_G, KevinDurant_G, DerrickRose_G, DwayneWade_G)
rm(KobeBryant_G, JoeJohnson_G, CarmeloAnthony_G, DwightHoward_G, ChrisBosh_G, LeBronJames_G, ChrisPaul_G, DerrickRose_G, DwayneWade_G, KevinDurant_G)
colnames(Games) <- Seasons
rownames(Games) <- Players

Plot free throw attempts per game

myplot2<-function(){
  matplot(t(FreeThrowsAttempts/Games),type="b",pch=15:18,col=c(1:4,6),main ="Free Throw Per Game")
  legend("topleft",cex=0.5,inset=0.05,legend=Players[1:10], col=c(1:4,6),pch=15:18, horiz=F)
}

myplot2()

Plot accuracy of free throws

FreeThrows/FreeThrowsAttempts
##                     2005      2006      2007      2008      2009      2010
## KobeBryant     0.8498168 2.0212121 0.7653563 0.6812412 0.7341137 0.8313253
## JoeJohnson     0.3398438 0.7484076 0.4507846 0.5264085 0.3303303 0.3305085
## LeBronJames    0.8099730 1.2902375 0.7120623 1.0067797 0.6610925 0.8998211
## CarmeloAnthony 1.0159574 1.2679558 0.6089239 0.7927350 0.5983510 0.8217180
## DwightHoward   0.6580407 1.4498141 0.6843467 0.8235294 0.5919118 0.9254237
## ChrisBosh      0.8130360 1.9053498 0.7119155 0.8330579 0.5131004 0.8152866
## ChrisPaul      0.8736142 1.5698925 0.6613546 1.2397820 0.2814685 1.2078853
## KevinDurant    0.3338658 1.2981366 0.7308411 0.8828125 1.0485437 1.9668874
## DerrickRose    6.9523810 0.7487179 0.2495726 0.3641405 0.4059561 1.7500000
## DwayneWade     2.6099585 2.4545455 0.6704545 2.4894515 1.9704797 2.1293103
##                     2011      2012       2013      2014
## KobeBryant     0.8193548 2.0507812 0.08780488 0.2440847
## JoeJohnson     0.4425770 0.5156250 0.77560976 0.2635514
## LeBronJames    0.9923077 0.8995536 2.14146341 0.8029979
## CarmeloAnthony 0.5629771 0.8110687 1.83600000 0.2451362
## DwightHoward   1.4789474 0.4226190 1.03254438 0.2037037
## ChrisBosh      0.5963542 0.3570370 0.40180180 0.2745399
## ChrisPaul      0.8609272 0.5708583 1.23430962 0.9730640
## KevinDurant    1.3343653 0.9053333        Inf 0.3435294
## DerrickRose    0.5623188 0.0000000 0.84375000 0.5891473
## DwayneWade     0.7320872 1.8011696 1.01069519 0.7675676
myplot3<- function(){
  matplot(t(round(FreeThrows/FreeThrowsAttempts,1)),type="b",pch=15:18,col=c(1:4,6), main="Accuracy of Free Throws")
  legend("topleft",cex=0.5,inset=0.02,legend=Players[1:10], col=c(1:4,6),pch=15:18, horiz=F)
}

myplot3()

Points Matrix

KobeBryant_PTS <-
c(2832,2430,2323,2201,1970,2078,1616,2133,83,782)
JoeJohnson_PTS <-
c(1653,1426,1779,1688,1619,1312,1129,1170,1245,1154)
LeBronJames_PTS <- c(2478,2132,2250,2304,2258,2111,1683,2036,2089,1743)
CarmeloAnthony_PTS <- c(2122,1881,1978,1504,1943,1970,1245,1920,2112,966)
DwightHoward_PTS <- c(1292,1443,1695,1624,1503,1784,1113,1296,1297,646)
ChrisBosh_PTS <- c(1572,1561,1496,1746,1678,1438,1025,1232,1281,928)
ChrisPaul_PTS <- c(1258,1104,1684,1781,841,1268,1189,1186,1185,1564)
KevinDurant_PTS <- c(903,903,1624,1871,2472,2161,1850,2280,2593,686)
DerrickRose_PTS <- c(597,597,597,1361,1619,2026,852,0,159,904)
DwayneWade_PTS <- c(2040,1397,1254,2386,2045,1941,1082,1463,1028,1331)

Points <- rbind(KobeBryant_PTS, JoeJohnson_PTS, LeBronJames_PTS, CarmeloAnthony_PTS, DwightHoward_PTS, ChrisBosh_PTS, ChrisPaul_PTS, KevinDurant_PTS, DerrickRose_PTS, DwayneWade_PTS)
rm(KobeBryant_PTS, JoeJohnson_PTS, LeBronJames_PTS, CarmeloAnthony_PTS, DwightHoward_PTS, ChrisBosh_PTS, ChrisPaul_PTS, KevinDurant_PTS, DerrickRose_PTS, DwayneWade_PTS)

colnames(Points) <- Seasons
rownames(Points) <- Players

Field Goals Matrix

KobeBryant_FG <- c(978,813,775,800,716,740,574,738,31,266)
JoeJohnson_FG <- c(632,536,647,620,635,514,423,445,462,446)
LeBronJames_FG <- c(875,772,794,789,768,758,621,765,767,624)
CarmeloAnthony_FG <- c(756,691,728,535,688,684,441,669,743,358)
DwightHoward_FG <- c(468,526,583,560,510,619,416,470,473,251)
ChrisBosh_FG <- c(549,543,507,615,600,524,393,485,492,343)
ChrisPaul_FG <- c(407,381,630,631,314,430,425,412,406,568)
KevinDurant_FG <- c(306,306,587,661,794,711,643,731,849,238)
DerrickRose_FG <- c(208,208,208,574,672,711,302,0,58,338)
DwayneWade_FG <- c(699,472,439,854,719,692,416,569,415,509)

FieldGoals <- rbind(KobeBryant_FG, JoeJohnson_FG, LeBronJames_FG, CarmeloAnthony_FG, DwightHoward_FG, ChrisBosh_FG, ChrisPaul_FG, KevinDurant_FG, DerrickRose_FG, DwayneWade_FG)
rm(KobeBryant_FG, JoeJohnson_FG, LeBronJames_FG, CarmeloAnthony_FG, DwightHoward_FG, ChrisBosh_FG, ChrisPaul_FG, KevinDurant_FG, DerrickRose_FG, DwayneWade_FG)
colnames(FieldGoals) <- Seasons
rownames(FieldGoals) <- Players

Plot player playing style (2 vs 3 points preference) excluding Free Throws

PointWithoutFreeThrows <- Points - FreeThrows
PointWithoutFreeThrows/FieldGoals
##                    2005     2006     2007     2008     2009     2010     2011
## KobeBryant     2.184049 2.168512 2.193548 2.147500 2.138268 2.155405 2.151568
## JoeJohnson     2.202532 2.222015 2.261206 2.240323 2.203150 2.173152 2.295508
## LeBronJames    2.145143 2.128238 2.142317 2.167300 2.167969 2.121372 2.086957
## CarmeloAnthony 2.048942 2.057887 2.079670 2.117757 2.085756 2.138889 2.154195
## DwightHoward   2.000000 2.001901 2.000000 2.000000 2.000000 2.000000 2.000000
## ChrisBosh      2.000000 2.022099 2.019724 2.019512 2.013333 2.011450 2.025445
## ChrisPaul      2.122850 2.131234 2.146032 2.101426 2.165605 2.165116 2.185882
## KevinDurant    2.267974 2.267974 2.100511 2.146747 2.161209 2.203938 2.206843
## DerrickRose    2.168269 2.168269 2.168269 2.027875 2.023810 2.180028 2.178808
## DwayneWade     2.018598 2.044492 2.050114 2.103044 2.101530 2.091040 2.036058
##                    2012     2013     2014
## KobeBryant     2.178862 2.096774 2.203008
## JoeJohnson     2.332584 2.350649 2.271300
## LeBronJames    2.134641 2.151239 2.192308
## CarmeloAnthony 2.234679 2.224764 2.170391
## DwightHoward   2.002128 2.004228 2.003984
## ChrisBosh      2.043299 2.150407 2.183673
## ChrisPaul      2.184466 2.192118 2.244718
## KevinDurant    2.190150 2.226148 2.268908
## DerrickRose         NaN 2.275862 2.224852
## DwayneWade     2.029877 2.021687 2.056974
myplot4 <- function(){
  matplot(t(round(PointWithoutFreeThrows/FieldGoals,1)),type="b",pch=15:18,col=c(1:4,6), main="Player Playing Style")
  legend("topleft",cex=0.5,inset=0.02,legend=Players[1:10], col=c(1:4,6),pch=15:18, horiz=F)
}

myplot4()

World Bank

World Bank

Barbara Yam

12 May 2020

Project details: You are required to produce a scatterplot depicting Life Expectancy (y-axis) and Fertility Rate (x-axis) statistics by Country.

The scatterplot needs to also be categorised by Countries’ Region. 2 years worth of data has been supplied: 1960 and 2013, and to produce a visualisation for each of these years.

Finally to provide insights into how the two periods compare.

stats2 <- read.csv("Section5-Homework-Data.csv")

#split data ito data1960 and  data2013
data1960 <- stats2[stats2$Year==1960,]
data1960$Region <- factor(data1960$Region)
head(data1960)
##           Country.Name Country.Code       Region Year Fertility.Rate
## 1                Aruba          ABW The Americas 1960          4.820
## 2          Afghanistan          AFG         Asia 1960          7.450
## 3               Angola          AGO       Africa 1960          7.379
## 4              Albania          ALB       Europe 1960          6.186
## 5 United Arab Emirates          ARE  Middle East 1960          6.928
## 6            Argentina          ARG The Americas 1960          3.109
str(data1960)
## 'data.frame':    187 obs. of  5 variables:
##  $ Country.Name  : chr  "Aruba" "Afghanistan" "Angola" "Albania" ...
##  $ Country.Code  : chr  "ABW" "AFG" "AGO" "ALB" ...
##  $ Region        : Factor w/ 6 levels "Africa","Asia",..: 6 2 1 3 4 6 2 6 5 3 ...
##  $ Year          : int  1960 1960 1960 1960 1960 1960 1960 1960 1960 1960 ...
##  $ Fertility.Rate: num  4.82 7.45 7.38 6.19 6.93 ...
data2013 <- stats2[stats2$Year==2013,]
data2013$Region <- factor(data2013$Region)
head(data2013)
##             Country.Name Country.Code       Region Year Fertility.Rate
## 188                Aruba          ABW The Americas 2013          1.669
## 189          Afghanistan          AFG         Asia 2013          5.050
## 190               Angola          AGO       Africa 2013          6.165
## 191              Albania          ALB       Europe 2013          1.771
## 192 United Arab Emirates          ARE  Middle East 2013          1.801
## 193            Argentina          ARG The Americas 2013          2.335
str(data2013)
## 'data.frame':    187 obs. of  5 variables:
##  $ Country.Name  : chr  "Aruba" "Afghanistan" "Angola" "Albania" ...
##  $ Country.Code  : chr  "ABW" "AFG" "AGO" "ALB" ...
##  $ Region        : Factor w/ 6 levels "Africa","Asia",..: 6 2 1 3 4 6 2 6 5 3 ...
##  $ Year          : int  2013 2013 2013 2013 2013 2013 2013 2013 2013 2013 ...
##  $ Fertility.Rate: num  1.67 5.05 6.16 1.77 1.8 ...
nrow(data2013)
## [1] 187
nrow(data1960)
## [1] 187
nrow(data2013)
## [1] 187
#showing equal split
#Execute below code to generate three new vectors
Country_Code <- c("ABW","AFG","AGO","ALB","ARE","ARG","ARM","ATG","AUS","AUT","AZE","BDI","BEL","BEN","BFA","BGD","BGR","BHR","BHS","BIH","BLR","BLZ","BOL","BRA","BRB","BRN","BTN","BWA","CAF","CAN","CHE","CHL","CHN","CIV","CMR","COG","COL","COM","CPV","CRI","CUB","CYP","CZE","DEU","DJI","DNK","DOM","DZA","ECU","EGY","ERI","ESP","EST","ETH","FIN","FJI","FRA","FSM","GAB","GBR","GEO","GHA","GIN","GMB","GNB","GNQ","GRC","GRD","GTM","GUM","GUY","HKG","HND","HRV","HTI","HUN","IDN","IND","IRL","IRN","IRQ","ISL","ITA","JAM","JOR","JPN","KAZ","KEN","KGZ","KHM","KIR","KOR","KWT","LAO","LBN","LBR","LBY","LCA","LKA","LSO","LTU","LUX","LVA","MAC","MAR","MDA","MDG","MDV","MEX","MKD","MLI","MLT","MMR","MNE","MNG","MOZ","MRT","MUS","MWI","MYS","NAM","NCL","NER","NGA","NIC","NLD","NOR","NPL","NZL","OMN","PAK","PAN","PER","PHL","PNG","POL","PRI","PRT","PRY","PYF","QAT","ROU","RUS","RWA","SAU","SDN","SEN","SGP","SLB","SLE","SLV","SOM","SSD","STP","SUR","SVK","SVN","SWE","SWZ","SYR","TCD","TGO","THA","TJK","TKM","TLS","TON","TTO","TUN","TUR","TZA","UGA","UKR","URY","USA","UZB","VCT","VEN","VIR","VNM","VUT","WSM","YEM","ZAF","COD","ZMB","ZWE")
Life_Expectancy_At_Birth_1960 <- c(65.5693658536586,32.328512195122,32.9848292682927,62.2543658536585,52.2432195121951,65.2155365853659,65.8634634146342,61.7827317073171,70.8170731707317,68.5856097560976,60.836243902439,41.2360487804878,69.7019512195122,37.2782682926829,34.4779024390244,45.8293170731707,69.2475609756098,52.0893658536585,62.7290487804878,60.2762195121951,67.7080975609756,59.9613658536585,42.1183170731707,54.2054634146342,60.7380487804878,62.5003658536585,32.3593658536585,50.5477317073171,36.4826341463415,71.1331707317073,71.3134146341463,57.4582926829268,43.4658048780488,36.8724146341463,41.523756097561,48.5816341463415,56.716756097561,41.4424390243903,48.8564146341463,60.5761951219512,63.9046585365854,69.5939268292683,70.3487804878049,69.3129512195122,44.0212682926829,72.1765853658537,51.8452682926829,46.1351219512195,53.215,48.0137073170732,37.3629024390244,69.1092682926829,67.9059756097561,38.4057073170732,68.819756097561,55.9584878048781,69.8682926829268,57.5865853658537,39.5701219512195,71.1268292682927,63.4318536585366,45.8314634146342,34.8863902439024,32.0422195121951,37.8404390243902,36.7330487804878,68.1639024390244,59.8159268292683,45.5316341463415,61.2263414634146,60.2787317073171,66.9997073170732,46.2883170731707,64.6086585365854,42.1000975609756,68.0031707317073,48.6403170731707,41.1719512195122,69.691756097561,44.945512195122,48.0306829268293,73.4286585365854,69.1239024390244,64.1918292682927,52.6852682926829,67.6660975609756,58.3675853658537,46.3624146341463,56.1280731707317,41.2320243902439,49.2159756097561,53.0013170731707,60.3479512195122,43.2044634146342,63.2801219512195,34.7831707317073,42.6411951219512,57.303756097561,59.7471463414634,46.5107073170732,69.8473170731707,68.4463902439024,69.7868292682927,64.6609268292683,48.4466341463415,61.8127804878049,39.9746829268293,37.2686341463415,57.0656341463415,60.6228048780488,28.2116097560976,67.6017804878049,42.7363902439024,63.7056097560976,48.3688048780488,35.0037073170732,43.4830975609756,58.7452195121951,37.7736341463415,59.4753414634146,46.8803902439024,58.6390243902439,35.5150487804878,37.1829512195122,46.9988292682927,73.3926829268293,73.549756097561,35.1708292682927,71.2365853658537,42.6670731707317,45.2904634146342,60.8817073170732,47.6915853658537,57.8119268292683,38.462243902439,67.6804878048781,68.7196097560976,62.8089268292683,63.7937073170732,56.3570487804878,61.2060731707317,65.6424390243903,66.0552926829268,42.2492926829268,45.6662682926829,48.1876341463415,38.206,65.6598292682927,49.3817073170732,30.3315365853659,49.9479268292683,36.9658780487805,31.6767073170732,50.4513658536585,59.6801219512195,69.9759268292683,68.9780487804878,73.0056097560976,44.2337804878049,52.768243902439,38.0161219512195,40.2728292682927,54.6993170731707,56.1535365853659,54.4586829268293,33.7271219512195,61.3645365853659,62.6575853658537,42.009756097561,45.3844146341463,43.6538780487805,43.9835609756098,68.2995365853659,67.8963902439025,69.7707317073171,58.8855365853659,57.7238780487805,59.2851219512195,63.7302195121951,59.0670243902439,46.4874878048781,49.969512195122,34.3638048780488,49.0362926829268,41.0180487804878,45.1098048780488,51.5424634146342)
Life_Expectancy_At_Birth_2013 <- c(75.3286585365854,60.0282682926829,51.8661707317073,77.537243902439,77.1956341463415,75.9860975609756,74.5613658536585,75.7786585365854,82.1975609756098,80.890243902439,70.6931463414634,56.2516097560976,80.3853658536585,59.3120243902439,58.2406341463415,71.245243902439,74.4658536585366,76.5459512195122,75.0735365853659,76.2769268292683,72.4707317073171,69.9820487804878,67.9134390243903,74.1224390243903,75.3339512195122,78.5466585365854,69.1029268292683,64.3608048780488,49.8798780487805,81.4011219512195,82.7487804878049,81.1979268292683,75.3530243902439,51.2084634146342,55.0418048780488,61.6663902439024,73.8097317073171,62.9321707317073,72.9723658536585,79.2252195121951,79.2563902439025,79.9497804878049,78.2780487804878,81.0439024390244,61.6864634146342,80.3024390243903,73.3199024390244,74.5689512195122,75.648512195122,70.9257804878049,63.1778780487805,82.4268292682927,76.4243902439025,63.4421951219512,80.8317073170732,69.9179268292683,81.9682926829268,68.9733902439024,63.8435853658537,80.9560975609756,74.079512195122,61.1420731707317,58.216487804878,59.9992682926829,54.8384146341464,57.2908292682927,80.6341463414634,73.1935609756098,71.4863902439024,78.872512195122,66.3100243902439,83.8317073170732,72.9428536585366,77.1268292682927,62.4011463414634,75.2682926829268,68.7046097560976,67.6604146341463,81.0439024390244,75.1259756097561,69.4716829268293,83.1170731707317,82.290243902439,73.4689268292683,73.9014146341463,83.3319512195122,70.45,60.9537804878049,70.2024390243902,67.7720487804878,65.7665853658537,81.459756097561,74.462756097561,65.687243902439,80.1288780487805,60.5203902439024,71.6576829268293,74.9127073170732,74.2402926829268,49.3314634146342,74.1634146341464,81.7975609756098,73.9804878048781,80.3391463414634,73.7090487804878,68.811512195122,64.6739024390244,76.6026097560976,76.5326585365854,75.1870487804878,57.5351951219512,80.7463414634146,65.6540975609756,74.7583658536585,69.0618048780488,54.641512195122,62.8027073170732,74.46,61.466,74.567512195122,64.3438780487805,77.1219512195122,60.8281463414634,52.4421463414634,74.514756097561,81.1048780487805,81.4512195121951,69.222,81.4073170731707,76.8410487804878,65.9636829268293,77.4192195121951,74.2838536585366,68.1315609756097,62.4491707317073,76.8487804878049,78.7111951219512,80.3731707317073,72.7991707317073,76.3340731707317,78.4184878048781,74.4634146341463,71.0731707317073,63.3948292682927,74.1776341463415,63.1670487804878,65.878756097561,82.3463414634146,67.7189268292683,50.3631219512195,72.4981463414634,55.0230243902439,55.2209024390244,66.259512195122,70.99,76.2609756097561,80.2780487804878,81.7048780487805,48.9379268292683,74.7157804878049,51.1914878048781,59.1323658536585,74.2469268292683,69.4001707317073,65.4565609756098,67.5223658536585,72.6403414634147,70.3052926829268,73.6463414634147,75.1759512195122,64.2918292682927,57.7676829268293,71.159512195122,76.8361951219512,78.8414634146341,68.2275853658537,72.8108780487805,74.0744146341464,79.6243902439024,75.756487804878,71.669243902439,73.2503902439024,63.583512195122,56.7365853658537,58.2719268292683,59.2373658536585,55.633)

#(c) Kirill Eremenko, www.superdatascience.com
life_expectancydf <- data.frame(Code=Country_Code, LifeExpectancy1960 = Life_Expectancy_At_Birth_1960,
                                LifeExpectancy2013= Life_Expectancy_At_Birth_2013)
str(life_expectancydf)
## 'data.frame':    187 obs. of  3 variables:
##  $ Code              : chr  "ABW" "AFG" "AGO" "ALB" ...
##  $ LifeExpectancy1960: num  65.6 32.3 33 62.3 52.2 ...
##  $ LifeExpectancy2013: num  75.3 60 51.9 77.5 77.2 ...
head(life_expectancydf)
##   Code LifeExpectancy1960 LifeExpectancy2013
## 1  ABW           65.56937           75.32866
## 2  AFG           32.32851           60.02827
## 3  AGO           32.98483           51.86617
## 4  ALB           62.25437           77.53724
## 5  ARE           52.24322           77.19563
## 6  ARG           65.21554           75.98610
head(data1960)
##           Country.Name Country.Code       Region Year Fertility.Rate
## 1                Aruba          ABW The Americas 1960          4.820
## 2          Afghanistan          AFG         Asia 1960          7.450
## 3               Angola          AGO       Africa 1960          7.379
## 4              Albania          ALB       Europe 1960          6.186
## 5 United Arab Emirates          ARE  Middle East 1960          6.928
## 6            Argentina          ARG The Americas 1960          3.109
summary(data1960)
##  Country.Name       Country.Code                Region        Year     
##  Length:187         Length:187         Africa      :53   Min.   :1960  
##  Class :character   Class :character   Asia        :33   1st Qu.:1960  
##  Mode  :character   Mode  :character   Europe      :40   Median :1960  
##                                        Middle East :12   Mean   :1960  
##                                        Oceania     :13   3rd Qu.:1960  
##                                        The Americas:36   Max.   :1960  
##  Fertility.Rate 
##  Min.   :1.940  
##  1st Qu.:4.311  
##  Median :6.210  
##  Mean   :5.537  
##  3rd Qu.:6.806  
##  Max.   :8.187
merge1960 <- merge(data1960,life_expectancydf, by.x="Country.Code",by.y="Code")
merge1960$LifeExpectancy2013 <-NULL
merge1960$Year <- NULL
#year not necessary because we already know this is for 1960.

merge2013 <- merge(data2013,life_expectancydf, by.x="Country.Code",by.y="Code" )
merge2013$LifeExpectancy1960 <- NULL
merge1960$Year <- NULL
#year not necessary because we already know this is for 2013. 

head(merge2013)
##   Country.Code         Country.Name       Region Year Fertility.Rate
## 1          ABW                Aruba The Americas 2013          1.669
## 2          AFG          Afghanistan         Asia 2013          5.050
## 3          AGO               Angola       Africa 2013          6.165
## 4          ALB              Albania       Europe 2013          1.771
## 5          ARE United Arab Emirates  Middle East 2013          1.801
## 6          ARG            Argentina The Americas 2013          2.335
##   LifeExpectancy2013
## 1           75.32866
## 2           60.02827
## 3           51.86617
## 4           77.53724
## 5           77.19563
## 6           75.98610
str(merge2013)
## 'data.frame':    187 obs. of  6 variables:
##  $ Country.Code      : chr  "ABW" "AFG" "AGO" "ALB" ...
##  $ Country.Name      : chr  "Aruba" "Afghanistan" "Angola" "Albania" ...
##  $ Region            : Factor w/ 6 levels "Africa","Asia",..: 6 2 1 3 4 6 2 6 5 3 ...
##  $ Year              : int  2013 2013 2013 2013 2013 2013 2013 2013 2013 2013 ...
##  $ Fertility.Rate    : num  1.67 5.05 6.16 1.77 1.8 ...
##  $ LifeExpectancy2013: num  75.3 60 51.9 77.5 77.2 ...
library(ggplot2)
qplot(data=merge1960, x=Fertility.Rate, y=LifeExpectancy1960,
      colour=Region,alpha=I(0.6),main="Life Expectancy vs    Fertility (1960)")

qplot(data=merge2013, x=Fertility.Rate, y=LifeExpectancy2013,
      colour=Region,alpha=I(0.6),main="Life Expectancy vs Fertility (2013)")

Analysis: The European and American countries tend to be at the top left of both graphs where fertility rate is lower and life expectancy is higher.

In 1960, for the European countries, the fertility is between 2 to 4 with life expectancy between 60 and 70 years of age.
In 2013, life expectancy has increased to 70 to 80 years of age, and fertility rate reduced to less than 2.

In 1960, the African nations have a fertility rate of between 5 to 8 and life expectancy lower than 55 years of age.
In 2013, the same African nations have a slightly lowered fertility rate of between 3 to 7 and life expectancy has risen to 50 to 70.
Overall, there seems to a trend where when life expectancy increases, fertility rate reduces.

Movies Visualisation

Movies Visualisation

Barbara Yam

15 May 2020

#task is to reproduce the boxplot that is created by the end of
# this exercise

movies <- read.csv("Section6-Homework-Data.csv")
# data prep
head(movies)
##   Day.of.Week                Director  Genre       Movie.Title Release.Date
## 1      Friday               Brad Bird action      Tomorrowland   22/05/2015
## 2      Friday             Scott Waugh action    Need for Speed   14/03/2014
## 3      Friday          Patrick Hughes action The Expendables 3   15/08/2014
## 4      Friday Phil Lord, Chris Miller comedy    21 Jump Street   16/03/2012
## 5      Friday         Roland Emmerich action  White House Down   28/06/2013
## 6      Friday              David Ayer action              Fury   17/10/2014
##                Studio Adjusted.Gross...mill. Budget...mill. Gross...mill.
## 1 Buena Vista Studios                  202.1            170         202.1
## 2 Buena Vista Studios                  204.2             66         203.3
## 3           Lionsgate                  207.1            100         206.2
## 4                Sony                  208.8             42         201.6
## 5                Sony                  209.7            150         205.4
## 6                Sony                  212.8             80         211.8
##   IMDb.Rating MovieLens.Rating Overseas...mill. Overseas. Profit...mill.
## 1         6.7             3.26            111.9      55.4           32.1
## 2         6.6             2.97            159.7      78.6          137.3
## 3         6.1             2.93            166.9      80.9          106.2
## 4         7.2             3.62             63.1      31.3          159.6
## 5         8.0             3.65            132.3      64.4           55.4
## 6         5.8             2.85              126      59.5          131.8
##   Profit. Runtime..min. US...mill. Gross...US
## 1    18.9           130       90.2       44.6
## 2   208.0           132       43.6       21.4
## 3   106.2           126       39.3       19.1
## 4   380.0           109      138.4       68.7
## 5    36.9           131       73.1       35.6
## 6   164.8           134       85.8       40.5
summary(movies)
##  Day.of.Week          Director            Genre           Movie.Title       
##  Length:608         Length:608         Length:608         Length:608        
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##  Release.Date          Studio          Adjusted.Gross...mill. Budget...mill.  
##  Length:608         Length:608         Length:608             Min.   :  0.60  
##  Class :character   Class :character   Class :character       1st Qu.: 45.00  
##  Mode  :character   Mode  :character   Mode  :character       Median : 80.00  
##                                                               Mean   : 92.47  
##                                                               3rd Qu.:130.00  
##                                                               Max.   :300.00  
##  Gross...mill.       IMDb.Rating    MovieLens.Rating Overseas...mill.  
##  Length:608         Min.   :3.600   Min.   :1.490    Length:608        
##  Class :character   1st Qu.:6.375   1st Qu.:3.038    Class :character  
##  Mode  :character   Median :6.900   Median :3.365    Mode  :character  
##                     Mean   :6.924   Mean   :3.340                      
##                     3rd Qu.:7.600   3rd Qu.:3.672                      
##                     Max.   :9.200   Max.   :4.500                      
##    Overseas.     Profit...mill.        Profit.        Runtime..min.  
##  Min.   : 17.2   Length:608         Min.   :    7.7   Min.   : 30.0  
##  1st Qu.: 49.9   Class :character   1st Qu.:  201.8   1st Qu.:100.0  
##  Median : 58.2   Mode  :character   Median :  338.6   Median :116.0  
##  Mean   : 57.7                      Mean   :  719.3   Mean   :117.8  
##  3rd Qu.: 66.3                      3rd Qu.:  650.1   3rd Qu.:130.2  
##  Max.   :100.0                      Max.   :41333.3   Max.   :238.0  
##    US...mill.      Gross...US  
##  Min.   :  0.0   Min.   : 0.0  
##  1st Qu.:107.0   1st Qu.:33.7  
##  Median :141.7   Median :41.8  
##  Mean   :167.1   Mean   :42.3  
##  3rd Qu.:202.1   3rd Qu.:50.1  
##  Max.   :760.5   Max.   :82.8
str(movies)
## 'data.frame':    608 obs. of  18 variables:
##  $ Day.of.Week           : chr  "Friday" "Friday" "Friday" "Friday" ...
##  $ Director              : chr  "Brad Bird" "Scott Waugh" "Patrick Hughes" "Phil Lord, Chris Miller" ...
##  $ Genre                 : chr  "action" "action" "action" "comedy" ...
##  $ Movie.Title           : chr  "Tomorrowland" "Need for Speed" "The Expendables 3" "21 Jump Street" ...
##  $ Release.Date          : chr  "22/05/2015" "14/03/2014" "15/08/2014" "16/03/2012" ...
##  $ Studio                : chr  "Buena Vista Studios" "Buena Vista Studios" "Lionsgate" "Sony" ...
##  $ Adjusted.Gross...mill.: chr  "202.1" "204.2" "207.1" "208.8" ...
##  $ Budget...mill.        : num  170 66 100 42 150 80 50 85 70 5 ...
##  $ Gross...mill.         : chr  "202.1" "203.3" "206.2" "201.6" ...
##  $ IMDb.Rating           : num  6.7 6.6 6.1 7.2 8 5.8 6 6.8 6.3 5.9 ...
##  $ MovieLens.Rating      : num  3.26 2.97 2.93 3.62 3.65 2.85 3.16 3.45 2.92 2.9 ...
##  $ Overseas...mill.      : chr  "111.9" "159.7" "166.9" "63.1" ...
##  $ Overseas.             : num  55.4 78.6 80.9 31.3 64.4 59.5 39.9 39.3 73.9 49.8 ...
##  $ Profit...mill.        : chr  "32.1" "137.3" "106.2" "159.6" ...
##  $ Profit.               : num  18.9 208 106.2 380 36.9 ...
##  $ Runtime..min.         : int  130 132 126 109 131 134 125 115 92 84 ...
##  $ US...mill.            : num  90.2 43.6 39.3 138.4 73.1 ...
##  $ Gross...US            : num  44.6 21.4 19.1 68.7 35.6 40.5 60.1 60.7 26.1 50.2 ...
# renaming some columns to easier names

colnames(movies) <- c("DayofWeek","Director","Genre","MovieTitle",
                      "ReleaseDate","Studio","AdjustedGrossinMillions",
                      "BudgetinMillions","GrossinMillions","IMDBRating",
                      "MovieLensRating","OverseasinMillions","OverseasPercent",
                      "ProfitinMillions","ProfitPercent","RuntimeinMin",
                      "USinMillions","GrossPercentUS")

# change studio and genre from character to factors
movies$Studio <- factor(movies$Studio)
movies$Genre <- factor(movies$Genre)
# start mini visualizations
library(ggplot2)

#off topic
ggplot(data=movies, aes(x=DayofWeek)) +geom_bar()

# interestingly most movies are released on Fridays and
# no movies are released on Mondays!
v <- ggplot(data=movies,aes(x=Genre, y=GrossPercentUS,
                            colour=Studio))

v +geom_boxplot(size=1.2)

# too much data! filter out only the needed genres and studios
str(movies)
## 'data.frame':    608 obs. of  18 variables:
##  $ DayofWeek              : chr  "Friday" "Friday" "Friday" "Friday" ...
##  $ Director               : chr  "Brad Bird" "Scott Waugh" "Patrick Hughes" "Phil Lord, Chris Miller" ...
##  $ Genre                  : Factor w/ 15 levels "action","adventure",..: 1 1 1 5 1 1 2 1 1 10 ...
##  $ MovieTitle             : chr  "Tomorrowland" "Need for Speed" "The Expendables 3" "21 Jump Street" ...
##  $ ReleaseDate            : chr  "22/05/2015" "14/03/2014" "15/08/2014" "16/03/2012" ...
##  $ Studio                 : Factor w/ 36 levels "Art House Studios",..: 2 2 11 25 25 25 2 31 31 20 ...
##  $ AdjustedGrossinMillions: chr  "202.1" "204.2" "207.1" "208.8" ...
##  $ BudgetinMillions       : num  170 66 100 42 150 80 50 85 70 5 ...
##  $ GrossinMillions        : chr  "202.1" "203.3" "206.2" "201.6" ...
##  $ IMDBRating             : num  6.7 6.6 6.1 7.2 8 5.8 6 6.8 6.3 5.9 ...
##  $ MovieLensRating        : num  3.26 2.97 2.93 3.62 3.65 2.85 3.16 3.45 2.92 2.9 ...
##  $ OverseasinMillions     : chr  "111.9" "159.7" "166.9" "63.1" ...
##  $ OverseasPercent        : num  55.4 78.6 80.9 31.3 64.4 59.5 39.9 39.3 73.9 49.8 ...
##  $ ProfitinMillions       : chr  "32.1" "137.3" "106.2" "159.6" ...
##  $ ProfitPercent          : num  18.9 208 106.2 380 36.9 ...
##  $ RuntimeinMin           : int  130 132 126 109 131 134 125 115 92 84 ...
##  $ USinMillions           : num  90.2 43.6 39.3 138.4 73.1 ...
##  $ GrossPercentUS         : num  44.6 21.4 19.1 68.7 35.6 40.5 60.1 60.7 26.1 50.2 ...
movies_filter <- (movies$Genre=="action") | (movies$Genre=="adventure")|
                          (movies$Genre=="animation") | (movies$Genre == "comedy") |
                          (movies$Genre == "drama")

movies_filter2 <- (movies$Studio=="Buena Vista Studios") | (movies$Studio=="Fox")|
                  (movies$Studio=="Paramount Pictures") | (movies$Studio=="Sony") |
                  (movies$Studio=="Universal") | (movies$Studio=="WB")
movies_filter2
##   [1]  TRUE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE
##  [13] FALSE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
##  [25]  TRUE  TRUE FALSE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE
##  [37]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE  TRUE
##  [49]  TRUE  TRUE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
##  [61]  TRUE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE
##  [73] FALSE  TRUE  TRUE FALSE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE FALSE
##  [85]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE FALSE  TRUE
##  [97]  TRUE  TRUE  TRUE FALSE  TRUE FALSE  TRUE FALSE  TRUE FALSE  TRUE FALSE
## [109]  TRUE  TRUE  TRUE FALSE  TRUE  TRUE FALSE  TRUE FALSE  TRUE  TRUE  TRUE
## [121]  TRUE  TRUE FALSE  TRUE FALSE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [133]  TRUE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE
## [145]  TRUE  TRUE  TRUE FALSE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE
## [157] FALSE  TRUE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [169]  TRUE  TRUE FALSE FALSE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE FALSE
## [181]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE
## [193]  TRUE FALSE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE FALSE FALSE
## [205]  TRUE  TRUE  TRUE  TRUE  TRUE FALSE FALSE  TRUE FALSE  TRUE  TRUE  TRUE
## [217]  TRUE FALSE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [229]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE  TRUE FALSE  TRUE  TRUE FALSE
## [241]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE FALSE
## [253]  TRUE  TRUE  TRUE FALSE FALSE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE
## [265]  TRUE  TRUE  TRUE  TRUE  TRUE FALSE  TRUE FALSE  TRUE FALSE  TRUE FALSE
## [277]  TRUE FALSE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE
## [289]  TRUE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE FALSE
## [301]  TRUE FALSE FALSE  TRUE FALSE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE
## [313]  TRUE  TRUE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [325]  TRUE FALSE FALSE  TRUE  TRUE  TRUE FALSE FALSE  TRUE  TRUE  TRUE  TRUE
## [337]  TRUE  TRUE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE FALSE  TRUE FALSE
## [349]  TRUE FALSE  TRUE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE FALSE  TRUE
## [361]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE FALSE
## [373]  TRUE  TRUE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE FALSE  TRUE FALSE  TRUE
## [385]  TRUE  TRUE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [397]  TRUE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE
## [409]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE FALSE  TRUE
## [421] FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE
## [433]  TRUE  TRUE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [445]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE
## [457]  TRUE  TRUE  TRUE FALSE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE
## [469] FALSE FALSE  TRUE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE
## [481] FALSE  TRUE  TRUE  TRUE FALSE FALSE FALSE  TRUE  TRUE  TRUE  TRUE FALSE
## [493]  TRUE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE
## [505]  TRUE FALSE  TRUE  TRUE FALSE  TRUE  TRUE FALSE FALSE FALSE  TRUE  TRUE
## [517]  TRUE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE FALSE  TRUE  TRUE
## [529]  TRUE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE
## [541] FALSE  TRUE  TRUE  TRUE FALSE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE
## [553]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [565]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE  TRUE
## [577]  TRUE FALSE FALSE FALSE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [589]  TRUE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE
## [601]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE
movies_filtered <- movies[movies_filter & movies_filter2,]
head(movies_filtered)
##   DayofWeek                Director     Genre       MovieTitle ReleaseDate
## 1    Friday               Brad Bird    action     Tomorrowland  22/05/2015
## 2    Friday             Scott Waugh    action   Need for Speed  14/03/2014
## 4    Friday Phil Lord, Chris Miller    comedy   21 Jump Street  16/03/2012
## 5    Friday         Roland Emmerich    action White House Down  28/06/2013
## 6    Friday              David Ayer    action             Fury  17/10/2014
## 7  Thursday            Rob Marshall adventure   Into the Woods  25/12/2014
##                Studio AdjustedGrossinMillions BudgetinMillions GrossinMillions
## 1 Buena Vista Studios                   202.1              170           202.1
## 2 Buena Vista Studios                   204.2               66           203.3
## 4                Sony                   208.8               42           201.6
## 5                Sony                   209.7              150           205.4
## 6                Sony                   212.8               80           211.8
## 7 Buena Vista Studios                   213.9               50           212.9
##   IMDBRating MovieLensRating OverseasinMillions OverseasPercent
## 1        6.7            3.26              111.9            55.4
## 2        6.6            2.97              159.7            78.6
## 4        7.2            3.62               63.1            31.3
## 5        8.0            3.65              132.3            64.4
## 6        5.8            2.85                126            59.5
## 7        6.0            3.16               84.9            39.9
##   ProfitinMillions ProfitPercent RuntimeinMin USinMillions GrossPercentUS
## 1             32.1          18.9          130         90.2           44.6
## 2            137.3         208.0          132         43.6           21.4
## 4            159.6         380.0          109        138.4           68.7
## 5             55.4          36.9          131         73.1           35.6
## 6            131.8         164.8          134         85.8           40.5
## 7            162.9         325.8          125        128.0           60.1
str(movies_filtered)
## 'data.frame':    423 obs. of  18 variables:
##  $ DayofWeek              : chr  "Friday" "Friday" "Friday" "Friday" ...
##  $ Director               : chr  "Brad Bird" "Scott Waugh" "Phil Lord, Chris Miller" "Roland Emmerich" ...
##  $ Genre                  : Factor w/ 15 levels "action","adventure",..: 1 1 5 1 1 2 1 1 3 8 ...
##  $ MovieTitle             : chr  "Tomorrowland" "Need for Speed" "21 Jump Street" "White House Down" ...
##  $ ReleaseDate            : chr  "22/05/2015" "14/03/2014" "16/03/2012" "28/06/2013" ...
##  $ Studio                 : Factor w/ 36 levels "Art House Studios",..: 2 2 25 25 25 2 31 31 34 25 ...
##  $ AdjustedGrossinMillions: chr  "202.1" "204.2" "208.8" "209.7" ...
##  $ BudgetinMillions       : num  170 66 42 150 80 50 85 70 80 60 ...
##  $ GrossinMillions        : chr  "202.1" "203.3" "201.6" "205.4" ...
##  $ IMDBRating             : num  6.7 6.6 7.2 8 5.8 6 6.8 6.3 4.5 5.6 ...
##  $ MovieLensRating        : num  3.26 2.97 3.62 3.65 2.85 3.16 3.45 2.92 2.17 2.84 ...
##  $ OverseasinMillions     : chr  "111.9" "159.7" "63.1" "132.3" ...
##  $ OverseasPercent        : num  55.4 78.6 31.3 64.4 59.5 39.9 39.3 73.9 50.3 60.6 ...
##  $ ProfitinMillions       : chr  "32.1" "137.3" "159.6" "55.4" ...
##  $ ProfitPercent          : num  18.9 208 380 36.9 164.8 ...
##  $ RuntimeinMin           : int  130 132 109 131 134 125 115 92 80 133 ...
##  $ USinMillions           : num  90.2 43.6 138.4 73.1 85.8 ...
##  $ GrossPercentUS         : num  44.6 21.4 68.7 35.6 40.5 60.1 60.7 26.1 49.7 39.4 ...
library(ggplot2)
w <- ggplot(data=movies_filtered,aes(x=Genre, y=GrossPercentUS))

w <- w + geom_jitter(aes(size=BudgetinMillions,colour=Studio))+
  ylab("Gross % US")+
  ggtitle("Domestic Gross % by Genre") +
  geom_boxplot(alpha=0.7,outlier.colour=NA) +
  theme(
    axis.title.x = element_text(colour="Blue",size=20),
    axis.title.y = element_text(colour="Blue",size=20),
    axis.text.x=element_text(size=10),
    axis.text.y=element_text(size=10),
    plot.title=element_text(size=25),
    legend.title=element_text(size=10),
    legend.text=element_text(size=10),
    text=element_text(family="Comic Sans MS")
  )

w$labels$size <- "Budget $M"
w
## Warning in grid.Call(C_stringMetric, as.graphicsAnnot(x$label)): font family not
## found in Windows font database

## Warning in grid.Call(C_stringMetric, as.graphicsAnnot(x$label)): font family not
## found in Windows font database
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database
## Warning in grid.Call(C_stringMetric, as.graphicsAnnot(x$label)): font family not
## found in Windows font database
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database
## Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database

Disclaimer: 

This is a personal website. The opinions expressed here represent my own and not those of my employer. 

In addition, my thoughts and opinions change from time to time I consider this a necessary consequence of having an open mind.

All rights reserved 2024 

Privacy Policy applies 

Terms and Conditions apply.