# Tutorial: Errors in the pandas API reference

In Python, documentation of objects is defined in the objects themselves. For example:

```python
def divide(dividend, divisor):
 """
 Compute the division of two floating point numbers.
 
 Parameters
 ----------
 dividend : float
 Number to divide.
 divisor : float
 Number to divide by.
 
 Returns
 -------
 float:
 The result of the division.
 """
 return dividend / divisor
```

There are tools to extract this documentation (named docstrings), and generate the
web version of it.

To make sure the documentation is formatted correctly in the web, and to keep consistency among
the pages, there are some standards that we aim to follow. For historical reasons, many
docstrings don't follow these standards.

Some of the errors are next (they are codified with a code):

- **SS02**: Summary does not start with a capital letter
- **SS03**: Summary does not end with a period
- **PR01**: Parameters {missing_params} not documented
- **RT01**: No Returns section found

The next docstring would return them:

```python
def divide(dividend, divisor):
 """
 compute the division of two floating point numbers
 
 Parameters
 ----------
 dividend : float
 Number to divide.
 """
 return dividend / divisor
```

We developed a script that is able to automatically detect these errors and report
them. It can return all the errors in the whole pandas code base in json format with
the next command:

```
./scripts/validate_docstrings.py --format=json > docstring_errors_pandas023.json.gz
```

In this tutorial we will load the output of this script, and we will transform it
to keep the relevant information

In [None]:
import os
import pandas

DATA_FNAME = os.path.join('data', 'docstring_errors_pandas023.json.gz')

### Load data

- Load data from the json file `DATA_FNAME`
- Try reading the data with different `orient` values, and read it so every row is a docstring

### New columns

- Create a column `docstring_length` with the number of characters of the docstring
- Create a column `problems` with the list or errors and warnings in a single list

### Delete information not needed

- Remove docstrings of functions being deprecated
- Remove columns `errors` and `warnings`
- Remove the `docstring`

### Create a row per problem

- Discuss possible ways of creating a row for each problem in the lists of the column `problem`
- Check the size of the `DataFrame`
- Calculate the expected new size
- Perform the transformation to have one row per problem
- Check that the new `DataFrame` has the expected size

### Extract problem information

- Get the problem information of the first row in the `DataFrame`
- How can we get the values for the `code` and the `message` independently
- Implement it for the whole column at the same time
- Discuss if there are other ways to extract them

### Save data as categories

- Check the number of unique values in every column
- Discuss what are the advantages of using categories
- Check which is the memory usage of the `DataFrame`
- Convert to categories the columns that make sense
- Check again the memory usage

### Save data to disk

- Save data into `data/docstring_errors_pandas023.hd5`
- Discuss what is the effect of the parameter `key` and try more than one value
- Load the data again from the format
- Check whether the data is still the same after reloading it, what is the cause if not, and how to fix it

### Solution

In [None]:
%load solutions/pandas_docstrings.py