{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Tutorial: pandas website views\n", "\n", "A measure to know how relevant is a class or function to pandas users,\n", "is the number of visits to its page in the documentation.\n", "\n", "We use Google analytics to track visits to the pandas website.\n", "This is the dashboard for page views per visits, in 2018:\n", "\n", "![](img/pandas_website_views.png)\n", "\n", "While Google analytics has an API, it doesn't make it easy for users\n", "to download the data. So, we downloaded all the visits from that page\n", "with the `Export` option, which downloads the information visible in the page.\n", "We did that 20 times, for the first 20 pages of results, and saved the data\n", "in the `data/pandas_website` directory.\n", "\n", "In this tutorial we will load data from csv files, we will concatenate them,\n", "and we will transform the data into a format useful to analyze." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import os\n", "import pandas\n", "\n", "DATA_DIR = os.path.join('data', 'pandas_website')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Load first csv file into a pandas DataFrame\n", "\n", "- Load data from the first csv into a DataFrame\n", "- Explore the data, size, data types of columns, how the values look like, if there are missing values,..." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Convert data into the right format with the right types\n", "\n", "- Drop the `Page Value` column, since it doesn't contain useful information\n", "- Set the `Page` as the index, so we can access rows by the page\n", "- Convert every column to its numerical type, so we can operate with them" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Repeat for all data\n", "\n", "- Repeat the same for all available files, and get a single `DataFrame` with all the data" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Save data\n", "\n", "- Save the final `DataFrame` into the file `pandas_page_views_2018.parquet`\n", "- Find information about the parquet format, and discuss what are the advantages compared to other formats" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Solution" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%load solutions/page_views_wrangling.py" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.3" } }, "nbformat": 4, "nbformat_minor": 2 }