{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Tutorial: Part 1 - Basic operations\n", "\n", "\n", "\n", "## Titanic dataset\n", "\n", "Dataset source: https://www.kaggle.com/c/titanic/data\n", "\n", "Features:\n", "- **PassengerId:** Id of every passenger.\n", "- **Survived:** This feature have value 0 and 1. 0 for not survived and 1 for survived.\n", "- **Pclass:** There are 3 classes of passengers. Class1, Class2 and Class3.\n", "- **Name:** Name of passenger.\n", "- **Sex:** Gender of passenger.\n", "- **Age:** Age of passenger.\n", "- **SibSp:** Indication that passenger have siblings and spouse.\n", "- **Parch:** Whether a passenger is alone or have family.\n", "- **Ticket:** Ticket no of passenger.\n", "- **Fare:** Indicating the fare.\n", "- **Cabin:** The cabin of passenger.\n", "- **Embarked:** The embarked category.\n", "- **Initial:** Initial name of passenger." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Loading data" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import pandas\n", "\n", "titanic = pandas.read_csv('data/titanic.csv.gz')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Exploring the data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Sample of the data" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "titanic.head()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "titanic.info()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "titanic.describe()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%matplotlib inline\n", "\n", "titanic[['Age', 'Fare']].boxplot();" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Passenger gender" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "titanic['Sex'].head()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "titanic['Sex'].value_counts()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "titanic['Sex'].value_counts(normalize=True)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from matplotlib import pyplot\n", "\n", "titanic['Sex'].value_counts().plot(kind='bar')\n", "\n", "pyplot.title('Number of Titanic passengers by gender')\n", "pyplot.ylabel('Number of passengers');" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "titanic['Sex'].replace({'male': 'M', 'female': 'F'}).value_counts().plot(kind='bar')\n", "\n", "pyplot.title('Number of Titanic passengers by gender')\n", "pyplot.ylabel('Number of passengers')\n", "pyplot.xticks(rotation=0);" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
EXERCISE 1: Checking where the passengers embarked
\n", "Tasks:\n", "
Hints:\n", "
Embarked
has the port where the passenger embarked..unique()
..replace()
receives a dictionary with lookups, and returns\n",
" the column with the values replaced..assign()
.plot()
supports different kind
of plots,\n",
" such as bar
, barh
and pie
.EXERCISE 2: Checking survival by sex and class
\n", "Tasks:\n", "
0.5
-> 50%
).Pclass
to Class
in the visualization.Hints:\n", "
survived / total
and can be\n",
" computed as titanic.Survived.sum() / titanic.Survived.count()
, which is equivalent to\n",
" titanic.Survived.mean()
..pivot_table()
accepts the parameter aggfunc='mean'
..style.format('{:.2%}')
..rename()
using the parameter\n",
" columns=
with a dictionary of the columns to rename.\n",
" EXERCISE 3: Who was the Titanic Captain?
\n", "Tasks:\n", "
Hints:\n", "
.loc[rows_filter, columns_filter]
.