Survival Guide

Visualizing the Titanic Manifest Dataset

The famed British passenger liner the R.M.S. Titanic sank on its maiden voyage from Southampton in the UK to New York City in the early morning hours of April 15, 1912, plunging into the icy waters of the North Atlantic Ocean. Of the estimated 2,224 passengers and crew on board, more than 1,500 were lost.

The dataset I work with here is a moderately well-known one, the Titanic Manifest Dataset. It contains data for 1309 of the approximately 1317 passengers on board the Titanic (the rest being crew). The data have been split into a training and testing csv for the purposes of supervised machine learning to predict passenger survival. However, I am only interested in visualizing passenger survival here, so I have recombined these to create a full csv file that includes all passengers.

Read in data and import packages

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

import seaborn as sns

titanic = pd.read_csv('./passengers.csv')

Data Dictionary

The data dictionary below shows what information is provided for each passenger listed in the data set.

Passenger class can be interpreted as a proxy for socioeconomic status, with class 1 tickets being the most expensive and class 3 the least. Age is in years, given as a fraction if less than 1, and with a .5 appended if the age is estimated.

Data Dictionary

Family relation variables as given (i.e. sibsp and parch) obviously exclude some possible familial or pseudo-familial relationships. Spouses refer strictly to husbands and wives (mistresses and fiances are ignored). Siblings/children/parent relationships include stepsiblings or stepchildren. Family relations not represented: cousins, nephews/nieces, aunts/uncles, in-laws. Some children travelled only with a nanny, therefore parch=0 for them. Others travelled with very close friends or neighbors in a village, however, the definitions do not support such relations.

Data Cleaning

In [2]:
titanic.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1309 entries, 0 to 1308
Data columns (total 12 columns):
PassengerId    1309 non-null int64
Survived       1309 non-null int64
Pclass         1309 non-null int64
Name           1309 non-null object
Sex            1309 non-null object
Age            1046 non-null float64
SibSp          1309 non-null int64
Parch          1309 non-null int64
Ticket         1309 non-null object
Fare           1308 non-null float64
Cabin          295 non-null object
Embarked       1307 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 122.8+ KB
In [3]:
titanic.describe().T
Out[3]:
count mean std min 25% 50% 75% max
PassengerId 1309.0 655.000000 378.020061 1.00 328.0000 655.0000 982.000 1309.0000
Survived 1309.0 0.380443 0.485681 0.00 0.0000 0.0000 1.000 1.0000
Pclass 1309.0 2.294882 0.837836 1.00 2.0000 3.0000 3.000 3.0000
Age 1046.0 29.832380 14.343425 0.17 21.0000 28.0000 39.000 80.0000
SibSp 1309.0 0.498854 1.041658 0.00 0.0000 0.0000 1.000 8.0000
Parch 1309.0 0.385027 0.865560 0.00 0.0000 0.0000 0.000 9.0000
Fare 1308.0 33.295479 51.758668 0.00 7.8958 14.4542 31.275 512.3292
In [4]:
titanic.head()
Out[4]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S
In [5]:
titanic.isnull().sum()
Out[5]:
PassengerId       0
Survived          0
Pclass            0
Name              0
Sex               0
Age             263
SibSp             0
Parch             0
Ticket            0
Fare              1
Cabin          1014
Embarked          2
dtype: int64

There are 1014 null values for 'Cabin', out of 1309 total passengers. Additionally, the 'Age' column has 263 missing values, and 'Embarked' has 2.

In [6]:
#Delete rows with Embarked information missing
titanic = titanic[titanic['Embarked'].notnull()]
In [7]:
#Fill null values in Cabin column to allow manipulation
titanic['Cabin'] = titanic['Cabin'].fillna('empty')

Feature Engineering

In [8]:
#Combines parents/children and siblings/spouses columns to create a feature for total family members on board 
titanic['FamilyCount'] = titanic['SibSp'] + titanic['Parch']

#One-hot encoding embark port and gender
titanic[['Embarked_C','Embarked_Q','Embarked_S']] = pd.get_dummies(titanic['Embarked'])
titanic[['Female','Male']] = pd.get_dummies(titanic['Sex'])

#Title feature: the lambda function splits the Name string at the comma
#and again at the period following, to isolate passenger title.
titanic['Title'] = titanic['Name'].map(lambda x: x.split(', ')[1].split('.')[0])
titanic['Title'].unique()
Out[8]:
array(['Mr', 'Mrs', 'Miss', 'Master', 'Don', 'Rev', 'Dr', 'Mme', 'Ms',
       'Major', 'Lady', 'Sir', 'Mlle', 'Col', 'Capt', 'the Countess',
       'Jonkheer', 'Dona'], dtype=object)
In [9]:
#One-hot encoding titles
#(feeds in alphabetically sorted list of unique titles)
titanic[sorted(list(titanic['Title'].unique()))] = pd.get_dummies(titanic['Title'])

Survival Correlation Heatmap

In [10]:
#The following heatmap shows the correlation of all variables with Survival
plt.figure(figsize=(10,10))
sns.heatmap(titanic.corr()[['Survived']],cmap="RdBu_r",center=0.0, annot=True);

Right off the bat, we can see that gender is a primary predictor of survival, with females more likely to survive and males less likely. Other associations that jump out are: the lower (i.e.; better) your class and the higher your fare, the higher your chance of survival. We also notice that those embarking at port C seem to do slightly better, and those embarking at port S slightly worse. However this may be confounded by say, a higher proportion of low-fare passengers embarking at port S.

Survival by Gender, Passenger Class, Port of Embarkation

In [11]:
#Overall survival rate
titanic['Survived'].mean()
Out[11]:
0.3794950267788829
In [12]:
#Survival rate by sex
titanic.groupby('Sex')['Survived'].mean()
Out[12]:
Sex
female    0.726293
male      0.188612
Name: Survived, dtype: float64
In [13]:
#Survival rate by passenger class (1 is most expensive)
titanic.groupby('Pclass')['Survived'].mean()
Out[13]:
Pclass
1    0.616822
2    0.429603
3    0.252468
Name: Survived, dtype: float64
In [14]:
#Survival rate by port of embarkation (potentially confounded with passenger class)
titanic.groupby('Embarked')['Survived'].mean()
Out[14]:
Embarked
C    0.555556
Q    0.357724
S    0.330416
Name: Survived, dtype: float64

The overall survival rate on the manifest is 38.2%. The standout gap is by gender: 72.6% of females survived, compared to 18.9% of males. This is likely due to the "women and children first" rules when filling the lifeboats. As one might expect, passengers with higher class tickets were also more likely to survive.

Survival by Missing Data

In [15]:
#Survival rate for passengers with no cabin listing
print('Without cabin listing: ', titanic[titanic['Cabin']=='empty']['Survived'].mean())

#Survival rate for passengers WITH cabin listing
print('With cabin listing: ', titanic[titanic['Cabin']!='empty']['Survived'].mean())

#Survival rate for passengers with no age information
print('Without age data: ', titanic[titanic['Age'].isnull()==True]['Survived'].mean())

#Survival rate for passengers with age information
print('With age data: ', titanic[titanic['Age'].notnull()==True]['Survived'].mean())
Without cabin listing:  0.30078895463510846
With cabin listing:  0.6518771331058021
Without age data:  0.2737642585551331
With age data:  0.4061302681992337

Survival is noticeably below average where information is missing. Accurate documentation may have been an issue with passengers who did not survive.

Survival by Title

In [16]:
#Survival breakdown by title with counts
titles = pd.DataFrame(titanic.groupby('Title')['Survived'].sum())
titles['Total Count'] = titanic.groupby('Title')['Survived'].count()
titles['Survival Rate'] = titanic.groupby('Title')['Survived'].mean()
titles.rename(columns = {'Survived':'Survival Count'}, inplace=True)
titles
Out[16]:
Survival Count Total Count Survival Rate
Title
Capt 0 1 0.000000
Col 2 4 0.500000
Don 0 1 0.000000
Dona 1 1 1.000000
Dr 4 8 0.500000
Jonkheer 0 1 0.000000
Lady 1 1 1.000000
Major 1 2 0.500000
Master 31 61 0.508197
Miss 175 259 0.675676
Mlle 2 2 1.000000
Mme 1 1 1.000000
Mr 121 757 0.159841
Mrs 154 196 0.785714
Ms 1 2 0.500000
Rev 0 8 0.000000
Sir 1 1 1.000000
the Countess 1 1 1.000000

VISUALIZATION: Survival by Family Members on Board

In [17]:
#Correlation between survival rate and total family on board, overall
np.corrcoef(titanic['FamilyCount'],titanic['Survived'])[0,1]
Out[17]:
0.02879335191814032

Correlation of .029 suggests next to no association between survival and number of family members. Let's look closer though. First, let's look at having family vs not having family, as a binary.

In [18]:
print('Survival rate with no family on board: {}'.format(titanic[titanic['FamilyCount']==0]['Survived'].mean()))
print('Survival rate with family on board: {}'.format(titanic[titanic['FamilyCount']>0]['Survived'].mean()))
Survival rate with no family on board: 0.29949238578680204
Survival rate with family on board: 0.5009633911368016

So, as a binary, people with family on board had a 50.1% survival rate, compared to a 29.9% survival rate among people with no family on board. Let's take one closer look at the distribution here with a visualization:

In [19]:
#BUILDING PLOT DATA
#----------------------------------
#Creates binary column for whether a passenger has family on board
titanic['FamilyOnBoard'] = titanic['FamilyCount'].map(lambda x: 1 if x>0 else 0)
rates_binary = list(titanic.groupby('FamilyOnBoard')['Survived'].mean())

#This shows us which values exist for FamilyCount and sorts them in a list (x-values):
nums_members = sorted(titanic['FamilyCount'].unique())

#Creates a function that takes a number and returns survival rate for people
#with that number of family members on board
def survival_rate_by_family_count(n):
    survival_rate = titanic[titanic['FamilyCount']==n]['Survived'].mean()
    return survival_rate

#Returns survival rates by count of family members (y-values)
rates_members = [survival_rate_by_family_count(i) for i in nums_members]

#Returns the raw counts for each category that will be in the plot 
#(e.g.; how many people are there with family? how many people are there with 6 family members?)
counts_binary = list(titanic['FamilyOnBoard'].value_counts())
counts_members = list(titanic['FamilyCount'].value_counts())


#SUBPLOTS WITH MATPLOTLIB, SEABORN
#------------------------------------------
#Create subplots layout with width proportions of 2 to 9 (achieves equal bar width)
fig, ax = plt.subplots(figsize=(12,6), nrows=1, ncols=2, gridspec_kw = {'width_ratios':[2,9]})
plt.suptitle('Survival Rate by Family Members on Board', fontsize=16)

#Plot Survival Rate by Family on Board, Binary
ax[0].set_xlabel('Family on Board?')
ax[0].set_ylabel('Survival Rate')
ax[0].set_ylim([0,1])
sns.barplot(x=[0,1],
            y=rates_binary,
            #Divergent color palette adds impact to visualization of survival rate
            palette="coolwarm_r",
            hue=rates_binary,
            #dodge=False prevents presentation issues with redundant data mapping
            dodge=False,
            ax=ax[0],
            hue_order=sorted(rates_binary + rates_members))
ax[0].legend_.remove()
#for loop with ax.text to add count labels to the categories
for i in range(len(rates_binary)):
    ax[0].text(x=i,y=.02+rates_binary[i],s=str(counts_binary[i]), ha='center', va='bottom')

#Plot Survival Rate by Family Member Count
ax[1].set_xlabel('Number of Family Members on Board')
ax[1].set_ylim([0,1])
sns.barplot(nums_members,
            rates_members,
            palette="coolwarm_r",
            hue=rates_members,
            dodge=False,
            ax=ax[1],
            #setting hue_order to the combination of means and ratelist...
            #...puts divergent color map onto a single scale across both plots.
            hue_order=sorted(rates_binary + rates_members))
ax[1].legend_.remove();
for i in range(len(rates_members)):
    ax[1].text(x=i,y=.02+rates_members[i],s=str(counts_members[i]), ha='center', va='bottom')

#Used an empty artist to create legend with text string clarifying labels
empty = plt.plot([], [], '')
plt.legend(empty,'',title="Numerical Labels = Counts of Original Passengers in Each Category");

Here we can see clearly that having family on board is associated with increased likelihood of survival overall, however there is more to the story. It appears that survival rate increases with number of family members at low numbers, up to a peak of 69.8% survival for people with 3 family members. However, survival rate falls off sharply for people with 4 or more family members. Some of this may be due to coincidence (the counts for 3+ family members are low), and it's also possible larger families had lower class tickets. However, it is also conceivable that having family members to assist in getting to safety would be a boon up to a point, but larger families may have proven a hindrance, trying to make sure everyone was accounted for, etc etc.

VISUALIZATION: Survival Rate by Sex in Different Age Groups

The following plot idea required a high degree of customization, but I felt it would offer a particularly elegant view of two major survival determinants.

In [21]:
#generate lists of the gender survival rates for each five-year age group between 0 and 65
ratelistmale = [titanic[(titanic['Female']==0)&(titanic['Age']>=i)&(titanic['Age']<(i+5))]['Survived'].mean() for i in range(0,65,5)]
ratelistfemale = [titanic[(titanic['Female']==1)&(titanic['Age']>=i)&(titanic['Age']<(i+5))]['Survived'].mean() for i in range(0,65,5)]

#create figure
fig,ax = plt.subplots(figsize=(10,5))
plt.title('Survival Rate by Gender in Different Age Groups')
plt.ylabel('Survival Rate')

#generate side by side barplots
barWidth = 1.5
plt.bar(np.arange(2,67,5), ratelistmale, color='navy', width=barWidth, label='Male')
plt.bar(np.arange(3.5,68.5,5), ratelistfemale, color='orange', width=barWidth, label='Female')

plt.xticks(np.arange(0,70,5))

plt.legend();

What is so interesting here is not just the clear representation of female vs male survival, but also the idea that female survival seems to increase with age, while male survival drops off past a certain age. In particular, this plot visualizes that children under the age of 5 or even under the age of 10 are treated mostly the same in terms of lifeboat allotments or priority, but from age 12 or so on an enormous gap opens between the genders, indicating that males of age were expected to yield lifeboat seats to women and children. An astonishing 7.8% of male passengers between ages 15 to 19 survived the disaster (5 out of 64, as seen below).

In [26]:
titanic[(titanic['Female']==0)&(titanic['Age']>=15)&(titanic['Age']<20)]['Survived'].value_counts()
Out[26]:
0    59
1     5
Name: Survived, dtype: int64

VISUALIZATION: Survival by Age and Fare

In [28]:
fig, ax = plt.subplots(figsize=(16,10))
plt.xlim(-5,200)

with sns.plotting_context('notebook',font_scale=1.2):

    ax = sns.scatterplot(
        x='Fare',
        y='Age',
        hue='Survived',
        palette=['red','blue'],
        data=titanic,
        ax=ax
    )

In this seaborn scatterplot, we get a really nice presentation of the available tiered fares, and the density of passengers with certain fares and at certain ages. Fares over 200 have not been included, for scaling reasons. Red indicates that the passenger perished, while Blue indicates survival. We can see very clearly that the great density of deaths occurred among lower tier pricing passengers between the ages of 17 and 45. We can see that there is a trend toward survival as fares increase, and also as age decreases.

VISUALIZATION: Survival by Class and Sex, Subdivided

In [29]:
#The Seaborn catplot allows us to break things down by sub-category
with sns.plotting_context('notebook',font_scale=1.2):

    ax = sns.catplot(
        x='Pclass',
        y='Survived',
        hue='Sex',
        kind='bar',
        data=titanic,
    )
C:\Users\eamon\Anaconda3\lib\site-packages\scipy\stats\stats.py:1713: FutureWarning: Using a non-tuple sequence for multidimensional indexing is deprecated; use `arr[tuple(seq)]` instead of `arr[seq]`. In the future this will be interpreted as an array index, `arr[np.array(seq)]`, which will result either in an error or a different result.
  return np.add.reduce(sorted[indexer] * weights, axis=axis) / sumval
In [30]:
#for the actual survival rate values in the plot above:
titanic.groupby('Pclass').apply(lambda x: x.groupby('Sex')['Survived'].mean())
Out[30]:
Sex female male
Pclass
1 0.964789 0.340782
2 0.886792 0.146199
3 0.490741 0.148073

The basic takeaway from this plot is that women with Class 1 or Class 2 status were all but guaranteed to survive. In a tragedy where only 38% of passengers survived overall, the rate of survival for Class 1 women was 96%, and for Class 2 women 89%. The drop off to 49% survival for Class 3 women makes it clear that this was not purely a situation of "Ladies First". We can also see that males in class 1 had an advantage over their lower class counterparts.

VISUALIZATION: Age and Gender Distribution of Survivors and Perished in Each Passenger Class

In [31]:
with sns.plotting_context('notebook',font_scale=1.2):

    ax = sns.catplot(
        x='Survived',
        y='Age',
        hue='Sex',
        col='Pclass',
        data=titanic,
        orient='v',
        kind='boxen',
    )

While the survival rate bar charts above are more easily comprehended, this plot (Seaborn's boxen catplot) deserves a bit of explication. The passenger manifest is first divided by class. Within each class, it is divided into those who perished and those who survived. Within this division we see the age distribution of the male and female members of the division. The weakness of this layout is that it doesn't express much about the relative size of the Survived group vs the Perished group in each class. However it does give us a sense of the summary statistics for each subdivision, and a quick picture of what ages and genders comprised the different groups. It also allows us to look at age after stratification for class and gender.

It is worth noting that the middle box represents the interquartile range, or the middle 50% by age in that class/gender subdivision - essentially dividing the distribution into fourths, as a normal boxplot. The innovation of the boxenplot, elsewhere referred to as the letter-value plot, is to further subdivide the distribution into eighths, sixteenths, and so on, with increasingly narrow boxes. This gives a more complete view of the distribution than a boxplot, without requiring the subjectivity of parameter assignments, as in violin plots.

Some interesting takeaways: Class 2 male survivors are comprised largely by children. We can see that the children male and female were saved entirely in Class 2, with no young among the perished (min age of perished ~ 17). This is in sharp contrast to Class 3, where the fate of children was more indiscriminate. We can see that in general, within stratifications for class and gender, the perished tended to be slightly older than the survivors, with the effect more pronounced among men.

In [32]:
with sns.plotting_context('notebook',font_scale=1.2):

    sns.catplot(
        x='Survived',
        y='Age',
        hue='Sex',
        col='Pclass',
        data=titanic,
        orient='v',
        kind='swarm',
    )

This very similar plot, a swarm plot, lacks an explicit representation of summary statistics. However, it has the great advantage of indicating the size of the subdivisions, and also showing the precise nature of the distribution by displaying every point. We can see the weakness of the boxenplot in the way it misrepresents small counts as full distributions, in particular class 1 female perished, containing only five individuals. We also have a clearer picture of the Class 2 male survivor distribution which we discussed above. Finally, in addition to giving a sense of the age distributions for each sex, the intermingling allows you have the shape of the combined, gender-independent age distribution for survived and perished in each class.

This may be my preferred visualization, if I were to choose only one, as it represents survival by class and gender, with a full distribution by age, and every passenger is visualized with a point on the chart.

Can we make it even better?

In [34]:
#Using plt.subplots allows us more precise control of the swarmplot functionality
fig, ax = plt.subplots(figsize=(12,6), nrows=1, ncols=3)
plt.suptitle('Age Distribution of Survivors and Perished by Gender in Each Class', fontsize=16)

for i in range(3):
    ax[i].set_title('Class {}'.format(i+1))
    ax[i].set_ylim(-5,85)
    sns.swarmplot(data=titanic[titanic['Pclass']==i+1],
                  x='Survived',
                  y='Age',
                  hue='Sex',
                  hue_order=['male','female'],
                  size=3,
                  ax=ax[i])

ax[1].set_ylabel(None)
ax[2].set_ylabel(None)

ax[0].legend_.remove()
ax[1].legend_.remove()

This is a proper visualization. Each year of age corresponds to a line, and manipulating the marker size allows the final visual to be crisper. The age distributions being lower for survivors after stratification for class is very clear here. We can see that the swarmplot is especially well suited to relatively small datasets (<2000 individuals). But what if we want to separate the genders out as well?

In [35]:
#We can set dodge as True in the swarmplot to split the distributions
fig, ax = plt.subplots(figsize=(12,6), nrows=1, ncols=3)
plt.suptitle('Age Distribution of Survivors and Perished by Gender in Each Class', fontsize=16)

for i in range(3):
    ax[i].set_title('Class {}'.format(i+1))
    ax[i].set_ylim(-5,85)
    sns.swarmplot(data=titanic[titanic['Pclass']==i+1],
                  x='Survived',
                  y='Age',
                  hue='Sex',
                  hue_order=['male','female'],
                  size=3,
                  dodge=True,
                  ax=ax[i])

ax[1].set_ylabel(None)
ax[2].set_ylabel(None)

ax[0].legend_.remove()
ax[1].legend_.remove()

Here we have achieved a very appealing view of the dataset in which every individual is represented with a point, and class, age and gender have all been represented intuitively. One can look at this and say, as a thirty year old man, how likely would I have been to survive, depending on my ticket class, and see the answer clearly.