# Tutorial: Part 1 - Basic operations

![](img/titanic.jpg)

## Titanic dataset

Dataset source: https://www.kaggle.com/c/titanic/data

Features:
- **PassengerId:** Id of every passenger.
- **Survived:** This feature have value 0 and 1. 0 for not survived and 1 for survived.
- **Pclass:** There are 3 classes of passengers. Class1, Class2 and Class3.
- **Name:** Name of passenger.
- **Sex:** Gender of passenger.
- **Age:** Age of passenger.
- **SibSp:** Indication that passenger have siblings and spouse.
- **Parch:** Whether a passenger is alone or have family.
- **Ticket:** Ticket no of passenger.
- **Fare:** Indicating the fare.
- **Cabin:** The cabin of passenger.
- **Embarked:** The embarked category.
- **Initial:** Initial name of passenger.

### Loading data

In [None]:
import pandas

titanic = pandas.read_csv('data/titanic.csv.gz')

### Exploring the data

Sample of the data

In [None]:
titanic.head()

In [None]:
titanic.info()

In [None]:
titanic.describe()

In [None]:
%matplotlib inline

titanic[['Age', 'Fare']].boxplot();

### Passenger gender

In [None]:
titanic['Sex'].head()

In [None]:
titanic['Sex'].value_counts()

In [None]:
titanic['Sex'].value_counts(normalize=True)

In [None]:
from matplotlib import pyplot

titanic['Sex'].value_counts().plot(kind='bar')

pyplot.title('Number of Titanic passengers by gender')
pyplot.ylabel('Number of passengers');

In [None]:
titanic['Sex'].replace({'male': 'M', 'female': 'F'}).value_counts().plot(kind='bar')

pyplot.title('Number of Titanic passengers by gender')
pyplot.ylabel('Number of passengers')
pyplot.xticks(rotation=0);

<div class="alert alert-success">
    <p><b>EXERCISE 1:</b> Checking where the passengers embarked</p>
    <img alt="" src="img/titanic_route.png"/>
    <p>Tasks:
        <ul>
            <li>Check the names of the ports in the dataset.</li>
            <li>Replace the abbreviation of the ports by the full name.</li> 
            <li>Compute the number of passangers that embarked in each port.</li>
            <li>Compute the proportion of passangers that embarked in each port.</li>
            <li>Plot the proportion of passangers that embarked in each port.</li>
        </ul>
    </p>
    <p>Hints:
        <ul>
            <li>The column <code>Embarked</code> has the port where the passenger embarked.</li>
            <li>Unique values in a column can be obtained with <code>.unique()</code>.</li>
            <li>The method <code>.replace()</code> receives a dictionary with lookups, and returns
                the column with the values replaced.</li>
            <li>To set a column with another (possibly calculated) column, you can use the method
                <code>.assign()</code>.</li>
            <li>The method <code>plot()</code> supports different <code>kind</code> of plots,
                such as <code>bar</code>, <code>barh</code> and <code>pie</code>.</li>
        </ul>
    </p>
</div>

In [None]:
%load solutions/titanic_1.py

### Passenger classes

In [None]:
titanic.pivot_table(values='PassengerId', index='Pclass', columns='Survived', aggfunc='count')

In [None]:
(titanic.assign(Survived=titanic['Survived'].replace({0: 'No', 1: 'Yes'}))
        .pivot_table(values='PassengerId', index='Pclass', columns='Survived', aggfunc='count')
        .loc[:, ['Yes', 'No']])

<div class="alert alert-success">
    <p><b>EXERCISE 2:</b> Checking survival by sex and class</p>
    <img alt="" src="img/titanic_classes.jpg"/>
    <p>Tasks:
        <ul>
            <li>Display a table with the passenger class as columns, and sex as rows.</li>
            <li>Compute the proportion of passengers who survived in each class.</li> 
            <li>Format the proportions as percentages (e.g. <code>0.5</code> -> <code>50%</code>).</li>
            <li>Rename the column <code>Pclass</code> to <code>Class</code> in the visualization.</li>
        </ul>
    </p>
    <p>Hints:
        <ul>
            <li>The proportion of passangers who survived is <code>survived / total</code> and can be
                computed as <code>titanic.Survived.sum() / titanic.Survived.count()</code>, which is equivalent to
                <code>titanic.Survived.mean()</code>.</li>
            <li>The method <code>.pivot_table()</code> accepts the parameter <code>aggfunc='mean'</code>.</li>
            <li>The values of a table/DataFrame can be formatted as percentage calling
                <code>.style.format('{:.2%}')</code>.</li>
            <li>Columns can be renamed with the method <code>.rename()</code> using the parameter
                <code>columns=</code> with a dictionary of the columns to rename.</code>
        </ul>
    </p>
</div>

In [None]:
%load solutions/titanic_2.py

### Passenger names

In [None]:
titanic['Name'].head()

In [None]:
name = 'Futrelle, Mrs. Jacques Heath (Lily May Peel)'
name

In [None]:
splitted_names = name.split(',')
splitted_names

In [None]:
reversed_splitted_names = splitted_names[::-1]
reversed_splitted_names

In [None]:
joined_names = ' '.join(reversed_splitted_names)
joined_names

In [None]:
joined_names.strip()

In [None]:
full_names = (titanic['Name'].str.split(',')
                             .str[::-1]
                             .str.join(' ')
                             .str.strip())
full_names.head()

In [None]:
full_names.str.startswith('Rev.').head()

In [None]:
full_names[full_names.str.startswith('Rev.')]

In [None]:
titanic[full_names.str.startswith('Rev.')]

<div class="alert alert-success">
    <p><b>EXERCISE 3:</b> Who was the Titanic Captain?</p>
    <img alt="" src="img/titanic_captain.jpg"/>
    <p>Tasks:
        <ul>
            <li>Obtain a Series with the titles (e.g. Miss, Mr...) from the names of the passengers,
                and how many of each exist in the data.</li>
            <li>Identify which title corresponds to the captain.</li> 
            <li>Filter the original dataset to return only the row with the captain.</li>
            <li>Visualize the captain name, age, and in which class he was travelling.</li>
        </ul>
    </p>
    <p>Hints:
        <ul>
            <li>A DataFrame can be filtered by both rows and columns at the same time,
                using the method <code>.loc[rows_filter, columns_filter]</code>.</li>
            <li>A filter can be a label, a list of labels, a boolean array, or a slice.</li>
        </ul>
    </p>
</div>

In [None]:
%load solutions/titanic_3.py