Introduction

This is my first Kaggle exercise! The goal is to predict which passengers survived the Titanic shipwreck, given a set of attributes about the passengers. This file takes < 60 seconds to execute, and as of May 2020, the result places within the top 12% of submissions.

Kaggle allows us to skip the processes of data gathering, architecture, governance, and extraction, so we can move straight to importing and cleaning the data.

Import Data

The training data has 891 observations, and the test data has 418 observations.

For each passenger, there are 10 predictors along with their passenger ID:

  • Pclass (ticket class, proxy for socio-economic status)
  • Name (passenger name: Last, First)
  • Sex (male/female)
  • Age (years)
  • SibSp (# siblings and spouses aboard the Titanic)
  • Parch (# parents and children aboard the Titanic)
  • Ticket (ticket number)
  • Fare (passenger fare)
  • Cabin (cabin number)
  • Embarked (port of embarkation: Cherbourg, Queenstown, or Southampton)
train <- read_csv("train.csv")
test <- read_csv("test.csv")
head(train)
## # A tibble: 6 x 12
##   PassengerId Survived Pclass Name  Sex     Age SibSp Parch Ticket  Fare Cabin
##         <dbl>    <dbl>  <dbl> <chr> <chr> <dbl> <dbl> <dbl> <chr>  <dbl> <chr>
## 1           1        0      3 Brau… male     22     1     0 A/5 2…  7.25 <NA> 
## 2           2        1      1 Cumi… fema…    38     1     0 PC 17… 71.3  C85  
## 3           3        1      3 Heik… fema…    26     0     0 STON/…  7.92 <NA> 
## 4           4        1      1 Futr… fema…    35     1     0 113803 53.1  C123 
## 5           5        0      3 Alle… male     35     0     0 373450  8.05 <NA> 
## 6           6        0      3 Mora… male     NA     0     0 330877  8.46 <NA> 
## # … with 1 more variable: Embarked <chr>

Data Wrangling

Exploration: Sex

Sex had a very disproportionate effect on survival on the Titanic. We can see that approximately 75% of females survived, compared to less than 20% of males.

train %>%
  group_by(Sex) %>%
  summarize(Survived = sum(Survived),
            Total = n(),
            Proportion = Survived/Total) %>%
  ggplot() +
    geom_col(mapping = aes(x = Sex, y = Proportion)) + 
    labs(title = "Titanic Survival Rate by Sex", y = "Survival Rate")

Exploration: Ticket Class

Ticket class seems to greatly affect survival rate as well. More than 60% of 1st class passengers survived, while less than 25% of 3rd class passengers did.

train %>%
  group_by(Pclass) %>%
  summarize(Survived = sum(Survived),
            Total = n(),
            Proportion = Survived/Total) %>%
  ggplot() +
    geom_col(mapping = aes(x = Pclass, y = Proportion)) + 
    labs(title = "Titanic Survival Rate by Ticket Class", x = "Ticket Class", y = "Survival Rate")

Feature Engineering: Title

I created a new variable called “Title” to represent the passenger’s title extracted from their name.

There were a number of infrequently occurring titles, so I decided to fit them all into five categories: “Master”, “Mr”, “Mrs”, “Miss”, and “Royal” (for titles that indicated royalty). We can see that passengers with different titles have significantly different survival rates.

### create Title from Name (train)
titles <- (str_extract(train$Name, pattern = "(?<=, ).+(?=\\. )"))
titles <- str_replace_all(titles, c("Jonkheer" = "Mr",
                                "Capt" = "Mr",
                                "Col" = "Mr",
                                "Don" = "Mr",
                                "Major" = "Mr", 
                                "Rev" = "Mr",
                                "Sir" = "Royal",
                                "Dr" = "Mr",
                                "Lady" = "Royal",
                                "Mme" = "Mrs",
                                "Mrs. Martin \\(Elizabeth L" = "Mrs",
                                "the Countess" = "Royal",
                                "Mlle" = "Mrs",
                                "Ms" = "Miss"))
titles[797] <- "Mrs" # handle special case of female doctor
train$Title <- titles
table(train$Title) # Master, Miss, Mr, Mrs, Royal
## 
## Master   Miss     Mr    Mrs  Royal 
##     40    183    536    129      3
### create Title from Name (test)
titles <- (str_extract(test$Name, pattern = "(?<=, ).+(?=\\. )"))
titles <- str_replace_all(titles, c("Col" = "Mr",
                                "Dona" = "Mrs",
                                "Dr" = "Mr",
                                "Ms" = "Miss",
                                "Rev" = "Mr"))
test$Title <- titles
table(test$Title) # Master, Miss, Mr, Mrs
## 
## Master   Miss     Mr    Mrs 
##     21     79    245     73
rm(titles)

### graph survival rate by title
train %>%
  group_by(Title) %>%
  summarize(Survived = sum(Survived), 
            Total = n(),
            Proportion = Survived/Total) %>%
  ggplot() +
    geom_col(mapping = aes(x = Title, y = Proportion)) +
    labs(title = "Titanic Survival Rate By Passenger Title", y = "Survival Rate")