|
@@ -0,0 +1,252 @@
|
|
|
|
+{
|
|
|
|
+ "cells": [
|
|
|
|
+ {
|
|
|
|
+ "cell_type": "markdown",
|
|
|
|
+ "metadata": {},
|
|
|
|
+ "source": [
|
|
|
|
+ "# Tutorial: Errors in the pandas API reference\n",
|
|
|
|
+ "\n",
|
|
|
|
+ "In Python, documentation of objects is defined in the objects themselves. For example:\n",
|
|
|
|
+ "\n",
|
|
|
|
+ "```python\n",
|
|
|
|
+ "def divide(dividend, divisor):\n",
|
|
|
|
+ " \"\"\"\n",
|
|
|
|
+ " Compute the division of two floating point numbers.\n",
|
|
|
|
+ " \n",
|
|
|
|
+ " Parameters\n",
|
|
|
|
+ " ----------\n",
|
|
|
|
+ " dividend : float\n",
|
|
|
|
+ " Number to divide.\n",
|
|
|
|
+ " divisor : float\n",
|
|
|
|
+ " Number to divide by.\n",
|
|
|
|
+ " \n",
|
|
|
|
+ " Returns\n",
|
|
|
|
+ " -------\n",
|
|
|
|
+ " float:\n",
|
|
|
|
+ " The result of the division.\n",
|
|
|
|
+ " \"\"\"\n",
|
|
|
|
+ " return dividend / divisor\n",
|
|
|
|
+ "```\n",
|
|
|
|
+ "\n",
|
|
|
|
+ "There are tools to extract this documentation (named docstrings), and generate the\n",
|
|
|
|
+ "web version of it.\n",
|
|
|
|
+ "\n",
|
|
|
|
+ "To make sure the documentation is formatted correctly in the web, and to keep consistency among\n",
|
|
|
|
+ "the pages, there are some standards that we aim to follow. For historical reasons, many\n",
|
|
|
|
+ "docstrings don't follow these standards.\n",
|
|
|
|
+ "\n",
|
|
|
|
+ "Some of the errors are next (they are codified with a code):\n",
|
|
|
|
+ "\n",
|
|
|
|
+ "- **SS02**: Summary does not start with a capital letter\n",
|
|
|
|
+ "- **SS03**: Summary does not end with a period\n",
|
|
|
|
+ "- **PR01**: Parameters {missing_params} not documented\n",
|
|
|
|
+ "- **RT01**: No Returns section found\n",
|
|
|
|
+ "\n",
|
|
|
|
+ "The next docstring would return them:\n",
|
|
|
|
+ "\n",
|
|
|
|
+ "```python\n",
|
|
|
|
+ "def divide(dividend, divisor):\n",
|
|
|
|
+ " \"\"\"\n",
|
|
|
|
+ " compute the division of two floating point numbers\n",
|
|
|
|
+ " \n",
|
|
|
|
+ " Parameters\n",
|
|
|
|
+ " ----------\n",
|
|
|
|
+ " dividend : float\n",
|
|
|
|
+ " Number to divide.\n",
|
|
|
|
+ " \"\"\"\n",
|
|
|
|
+ " return dividend / divisor\n",
|
|
|
|
+ "```\n",
|
|
|
|
+ "\n",
|
|
|
|
+ "We developed a script that is able to automatically detect these errors and report\n",
|
|
|
|
+ "them. It can return all the errors in the whole pandas code base in json format with\n",
|
|
|
|
+ "the next command:\n",
|
|
|
|
+ "\n",
|
|
|
|
+ "```\n",
|
|
|
|
+ "./scripts/validate_docstrings.py --format=json > docstring_errors_pandas023.json.gz\n",
|
|
|
|
+ "```\n",
|
|
|
|
+ "\n",
|
|
|
|
+ "In this tutorial we will load the output of this script, and we will transform it\n",
|
|
|
|
+ "to keep the relevant information"
|
|
|
|
+ ]
|
|
|
|
+ },
|
|
|
|
+ {
|
|
|
|
+ "cell_type": "code",
|
|
|
|
+ "execution_count": null,
|
|
|
|
+ "metadata": {},
|
|
|
|
+ "outputs": [],
|
|
|
|
+ "source": [
|
|
|
|
+ "import os\n",
|
|
|
|
+ "import pandas\n",
|
|
|
|
+ "\n",
|
|
|
|
+ "DATA_FNAME = os.path.join('data', 'docstring_errors_pandas023.json.gz')"
|
|
|
|
+ ]
|
|
|
|
+ },
|
|
|
|
+ {
|
|
|
|
+ "cell_type": "markdown",
|
|
|
|
+ "metadata": {},
|
|
|
|
+ "source": [
|
|
|
|
+ "### Load data\n",
|
|
|
|
+ "\n",
|
|
|
|
+ "- Load data from the json file `DATA_FNAME`\n",
|
|
|
|
+ "- Try reading the data with different `orient` values, and read it so every row is a docstring"
|
|
|
|
+ ]
|
|
|
|
+ },
|
|
|
|
+ {
|
|
|
|
+ "cell_type": "code",
|
|
|
|
+ "execution_count": null,
|
|
|
|
+ "metadata": {},
|
|
|
|
+ "outputs": [],
|
|
|
|
+ "source": []
|
|
|
|
+ },
|
|
|
|
+ {
|
|
|
|
+ "cell_type": "markdown",
|
|
|
|
+ "metadata": {},
|
|
|
|
+ "source": [
|
|
|
|
+ "### New columns\n",
|
|
|
|
+ "\n",
|
|
|
|
+ "- Create a column `docstring_length` with the number of characters of the docstring\n",
|
|
|
|
+ "- Create a column `problems` with the list or errors and warnings in a single list"
|
|
|
|
+ ]
|
|
|
|
+ },
|
|
|
|
+ {
|
|
|
|
+ "cell_type": "code",
|
|
|
|
+ "execution_count": null,
|
|
|
|
+ "metadata": {},
|
|
|
|
+ "outputs": [],
|
|
|
|
+ "source": []
|
|
|
|
+ },
|
|
|
|
+ {
|
|
|
|
+ "cell_type": "markdown",
|
|
|
|
+ "metadata": {},
|
|
|
|
+ "source": [
|
|
|
|
+ "### Delete information not needed\n",
|
|
|
|
+ "\n",
|
|
|
|
+ "- Remove docstrings of functions being deprecated\n",
|
|
|
|
+ "- Remove columns `errors` and `warnings`\n",
|
|
|
|
+ "- Remove the `docstring`"
|
|
|
|
+ ]
|
|
|
|
+ },
|
|
|
|
+ {
|
|
|
|
+ "cell_type": "code",
|
|
|
|
+ "execution_count": null,
|
|
|
|
+ "metadata": {},
|
|
|
|
+ "outputs": [],
|
|
|
|
+ "source": []
|
|
|
|
+ },
|
|
|
|
+ {
|
|
|
|
+ "cell_type": "markdown",
|
|
|
|
+ "metadata": {},
|
|
|
|
+ "source": [
|
|
|
|
+ "### Create a row per problem\n",
|
|
|
|
+ "\n",
|
|
|
|
+ "- Discuss possible ways of creating a row for each problem in the lists of the column `problem`\n",
|
|
|
|
+ "- Check the size of the `DataFrame`\n",
|
|
|
|
+ "- Calculate the expected new size\n",
|
|
|
|
+ "- Perform the transformation to have one row per problem\n",
|
|
|
|
+ "- Check that the new `DataFrame` has the expected size"
|
|
|
|
+ ]
|
|
|
|
+ },
|
|
|
|
+ {
|
|
|
|
+ "cell_type": "code",
|
|
|
|
+ "execution_count": null,
|
|
|
|
+ "metadata": {},
|
|
|
|
+ "outputs": [],
|
|
|
|
+ "source": []
|
|
|
|
+ },
|
|
|
|
+ {
|
|
|
|
+ "cell_type": "markdown",
|
|
|
|
+ "metadata": {},
|
|
|
|
+ "source": [
|
|
|
|
+ "### Extract problem information\n",
|
|
|
|
+ "\n",
|
|
|
|
+ "- Get the problem information of the first row in the `DataFrame`\n",
|
|
|
|
+ "- How can we get the values for the `code` and the `message` independently\n",
|
|
|
|
+ "- Implement it for the whole column at the same time\n",
|
|
|
|
+ "- Discuss if there are other ways to extract them"
|
|
|
|
+ ]
|
|
|
|
+ },
|
|
|
|
+ {
|
|
|
|
+ "cell_type": "code",
|
|
|
|
+ "execution_count": null,
|
|
|
|
+ "metadata": {},
|
|
|
|
+ "outputs": [],
|
|
|
|
+ "source": []
|
|
|
|
+ },
|
|
|
|
+ {
|
|
|
|
+ "cell_type": "markdown",
|
|
|
|
+ "metadata": {},
|
|
|
|
+ "source": [
|
|
|
|
+ "### Save data as categories\n",
|
|
|
|
+ "\n",
|
|
|
|
+ "- Check the number of unique values in every column\n",
|
|
|
|
+ "- Discuss what are the advantages of using categories\n",
|
|
|
|
+ "- Check which is the memory usage of the `DataFrame`\n",
|
|
|
|
+ "- Convert to categories the columns that make sense\n",
|
|
|
|
+ "- Check again the memory usage"
|
|
|
|
+ ]
|
|
|
|
+ },
|
|
|
|
+ {
|
|
|
|
+ "cell_type": "code",
|
|
|
|
+ "execution_count": null,
|
|
|
|
+ "metadata": {},
|
|
|
|
+ "outputs": [],
|
|
|
|
+ "source": []
|
|
|
|
+ },
|
|
|
|
+ {
|
|
|
|
+ "cell_type": "markdown",
|
|
|
|
+ "metadata": {},
|
|
|
|
+ "source": [
|
|
|
|
+ "### Save data to disk\n",
|
|
|
|
+ "\n",
|
|
|
|
+ "- Save data into `data/docstring_errors_pandas023.hd5`\n",
|
|
|
|
+ "- Discuss what is the effect of the parameter `key` and try more than one value\n",
|
|
|
|
+ "- Load the data again from the format\n",
|
|
|
|
+ "- Check whether the data is still the same after reloading it, what is the cause if not, and how to fix it"
|
|
|
|
+ ]
|
|
|
|
+ },
|
|
|
|
+ {
|
|
|
|
+ "cell_type": "code",
|
|
|
|
+ "execution_count": null,
|
|
|
|
+ "metadata": {},
|
|
|
|
+ "outputs": [],
|
|
|
|
+ "source": []
|
|
|
|
+ },
|
|
|
|
+ {
|
|
|
|
+ "cell_type": "markdown",
|
|
|
|
+ "metadata": {},
|
|
|
|
+ "source": [
|
|
|
|
+ "### Solution"
|
|
|
|
+ ]
|
|
|
|
+ },
|
|
|
|
+ {
|
|
|
|
+ "cell_type": "code",
|
|
|
|
+ "execution_count": null,
|
|
|
|
+ "metadata": {},
|
|
|
|
+ "outputs": [],
|
|
|
|
+ "source": [
|
|
|
|
+ "%load solutions/pandas_docstrings.py"
|
|
|
|
+ ]
|
|
|
|
+ }
|
|
|
|
+ ],
|
|
|
|
+ "metadata": {
|
|
|
|
+ "kernelspec": {
|
|
|
|
+ "display_name": "Python 3",
|
|
|
|
+ "language": "python",
|
|
|
|
+ "name": "python3"
|
|
|
|
+ },
|
|
|
|
+ "language_info": {
|
|
|
|
+ "codemirror_mode": {
|
|
|
|
+ "name": "ipython",
|
|
|
|
+ "version": 3
|
|
|
|
+ },
|
|
|
|
+ "file_extension": ".py",
|
|
|
|
+ "mimetype": "text/x-python",
|
|
|
|
+ "name": "python",
|
|
|
|
+ "nbconvert_exporter": "python",
|
|
|
|
+ "pygments_lexer": "ipython3",
|
|
|
|
+ "version": "3.7.3"
|
|
|
|
+ }
|
|
|
|
+ },
|
|
|
|
+ "nbformat": 4,
|
|
|
|
+ "nbformat_minor": 2
|
|
|
|
+}
|