{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
"[Home Page](../START_HERE.ipynb)\n",
"\n",
"[Previous Notebook](01-Intro_to_Dask.ipynb)\n",
" \n",
" \n",
" \n",
" \n",
"[1](01-Intro_to_Dask.ipynb)\n",
"[2]\n",
"[3](03-CuML_and_Dask.ipynb)\n",
"[4](04-Challenge.ipynb)\n",
"[5](05-Challenge_Solution.ipynb)\n",
" \n",
" \n",
" \n",
" \n",
"[Next Notebook](03-CuML_and_Dask.ipynb)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Introduction to cuDF and Dask-cuDF\n",
"=======================\n",
"\n",
"Modeled after 10 Minutes to Pandas, this is a short introduction to cuDF and Dask-cuDF, geared mainly for new users. The tutorial is split into modules with embedded exercises for you to practice the concepts. All the concepts have both CuDF and Dask-CuDF syntax for enhanced understanding.\n",
"\n",
"\n",
"\n",
"### What are these Libraries?\n",
"\n",
"[cuDF](https://github.com/rapidsai/cudf) is a Python GPU DataFrame library (built on the Apache Arrow columnar memory format) for loading, joining, aggregating, filtering, and otherwise manipulating data.\n",
"\n",
"[Dask](https://dask.org/) is a flexible library for parallel computing in Python that makes scaling out your workflow smooth and simple.\n",
"\n",
"[Dask-cuDF](https://github.com/rapidsai/dask-cudf) is a library that provides a partitioned, GPU-backed dataframe, using Dask.\n",
"\n",
"\n",
"\n",
"### When to use cuDF and Dask-cuDF\n",
"\n",
"If your workflow is fast enough on a single GPU or your data comfortably fits in memory on a single GPU, you would want to use cuDF. If you want to distribute your workflow across multiple GPUs, have more data than you can fit in memory on a single GPU, or want to analyze data spread across many files at once, you would want to use Dask-cuDF.\n",
"\n",
"\n",
"\n",
"## Here is the list of contents in the lab:\n",
"\n",
"\n",
"- Creating Dask-CuDF Objects
This module shows you how to work with Dask-CuDF dataframes, the distributed GPU equivalent of Pandas dataframes, for faster data transactions. It includes creating Dask-CuDF objects, viewing data, selecting data, boolean indexing and dealing with missing data.\n",
"- Operations
Learn how to view descriptive statistics, perform string operations, histogramming, concatenate, joins, append, group data and use applymap.\n",
"- TimeSeries
Introduction to using TimeSeries data format in Dask-CuDF \n",
"- Categoricals
Introduction to using categorical data in Dask-CuDF \n",
"- Converting Data Representations
Here we will work with converting data representations, including Arrow, Pandas and Numpy, that are commonly required in data science pipelines.\n",
"- Getting Data In and Out
Transfering Dask-CuDf dataframes to and from CSV and Parquet files.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import os\n",
"\n",
"import numpy as np\n",
"import pandas as pd\n",
"import cudf\n",
"import dask_cudf\n",
"\n",
"np.random.seed(12)\n",
"\n",
"#### Portions of this were borrowed and adapted from the\n",
"#### cuDF cheatsheet, existing cuDF documentation,\n",
"#### and 10 Minutes to Pandas."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"\n",
"Object Creation\n",
"---------------"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Creating a `cudf.Series` and `dask_cudf.Series`."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"s = cudf.Series([1,2,3,None,4])\n",
"print(s)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"ds = dask_cudf.from_cudf(s, npartitions=2) \n",
"print(ds.compute())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Creating a `cudf.DataFrame` and a `dask_cudf.DataFrame` by specifying values for each column."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"df = cudf.DataFrame({'a': list(range(20)),\n",
" 'b': list(reversed(range(20))),\n",
" 'c': list(range(20))\n",
" })"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"ddf = dask_cudf.from_cudf(df, npartitions=2) \n",
"print(ddf.compute())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Creating a `cudf.DataFrame` from a pandas `Dataframe` and a `dask_cudf.Dataframe` from a `cudf.Dataframe`.\n",
"\n",
"*Note that best practice for using Dask-cuDF is to read data directly into a `dask_cudf.DataFrame` with something like `read_csv` (discussed below).*"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"pdf = pd.DataFrame({'a': [0, 1, 2, 3],'b': [0.1, 0.2, None, 0.3]})\n",
"gdf = cudf.DataFrame.from_pandas(pdf)\n",
"print(gdf)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"dask_df = dask_cudf.from_cudf(pdf, npartitions=2)\n",
"dask_gdf = dask_cudf.from_dask_dataframe(dask_df)\n",
"print(dask_gdf.compute())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"
\n",
"\n",
"Viewing Data\n",
"-------------"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Viewing the top rows of a GPU dataframe."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"print(df.head(2))"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"print(ddf.head(2))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Sorting by values."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"print(df.sort_values(by='b'))"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"print(ddf.sort_values(by='b').compute())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"\n",
"Selection\n",
"------------\n",
"\n",
"## Getting"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Selecting a single column, which initially yields a `cudf.Series` or `dask_cudf.Series`. Calling `compute` results in a `cudf.Series` (equivalent to `df.a`)."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"print(df['a'])"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"print(ddf['a'].compute())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"\n",
"## Selection by Label"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Selecting rows from index 2 to index 5 from columns 'a' and 'b'."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"print(df.loc[2:5, ['a', 'b']])"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"print(ddf.loc[2:5, ['a', 'b']].compute())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"\n",
"## Selection by Position"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Selecting via integers and integer slices, like numpy/pandas. Note that this functionality is not available for Dask-cuDF DataFrames."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"print(df.iloc[0])"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"print(df.iloc[0:3, 0:2])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"You can also select elements of a `DataFrame` or `Series` with direct index access."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"print(df[3:5])"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"print(s[3:5])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"\n",
"## Boolean Indexing"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Selecting rows in a `DataFrame` or `Series` by direct Boolean indexing."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"print(df[df.b > 15])"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"print(ddf[ddf.b > 15].compute())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Selecting values from a `DataFrame` where a Boolean condition is met, via the `query` API."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"print(df.query(\"b == 3\")) "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"print(ddf.query(\"b == 3\").compute()) "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"You can also pass local variables to Dask-cuDF queries, via the `local_dict` keyword. With standard cuDF, you may either use the `local_dict` keyword or directly pass the variable via the `@` keyword."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"cudf_comparator = 3\n",
"print(df.query(\"b == @cudf_comparator\"))"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"dask_cudf_comparator = 3\n",
"print(ddf.query(\"b == @val\", local_dict={'val':dask_cudf_comparator}).compute()) "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Supported logical operators include `>`, `<`, `>=`, `<=`, `==`, and `!=`."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"
\n",
"\n",
"## MultiIndex"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"cuDF supports hierarchical indexing of DataFrames using MultiIndex. Grouping hierarchically (see `Grouping` below) automatically produces a DataFrame with a MultiIndex."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"arrays = [['a', 'a', 'b', 'b'],\n",
" [1, 2, 3, 4]]\n",
"tuples = list(zip(*arrays))\n",
"idx = cudf.MultiIndex.from_tuples(tuples)\n",
"idx"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This index can back either axis of a DataFrame."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"gdf1 = cudf.DataFrame({'first': np.random.rand(4), 'second': np.random.rand(4)})\n",
"gdf1.index = idx\n",
"print(gdf1.to_pandas())"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"gdf2 = cudf.DataFrame({'first': np.random.rand(4), 'second': np.random.rand(4)}).T\n",
"gdf2.columns = idx\n",
"print(gdf2.to_pandas())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Accessing values of a DataFrame with a MultiIndex. Note that slicing is not yet supported."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"print(gdf1.loc[('b', 3)].to_pandas())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"
\n",
"\n",
"Missing Data\n",
"------------"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Missing data can be replaced by using the `fillna` method."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"print(s.fillna(999))"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"print(ds.fillna(999).compute())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"
\n",
"\n",
"Operations\n",
"------------"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"
\n",
"\n",
"## Stats"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Calculating descriptive statistics for a `Series`."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"print(s.mean(), s.var())"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"print(ds.mean().compute(), ds.var().compute())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"
\n",
"\n",
"## Applymap"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Applying functions to a `Series`. Note that applying user defined functions directly with Dask-cuDF is not yet implemented. For now, you can use [map_partitions](http://docs.dask.org/en/stable/dataframe-api.html#dask.dataframe.DataFrame.map_partitions) to apply a function to each partition of the distributed dataframe."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"def add_ten(num):\n",
" return num + 10\n",
"\n",
"print(df['a'].applymap(add_ten))"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"print(ddf['a'].map_partitions(add_ten).compute())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"\n",
"## Histogramming"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Counting the number of occurrences of each unique value of variable."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"print(df.a.value_counts())"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"print(ddf.a.value_counts().compute())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"
\n",
"\n",
"## String Methods"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Like pandas, cuDF provides string processing methods in the `str` attribute of `Series`. Full documentation of string methods is a work in progress. Please see the cuDF API documentation for more information."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"s = cudf.Series(['A', 'B', 'C', 'Aaba', 'Baca', None, 'CABA', 'dog', 'cat'])\n",
"print(s.str.lower())"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"ds = dask_cudf.from_cudf(s, npartitions=2)\n",
"print(ds.str.lower().compute())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"
\n",
"\n",
"## Concat"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Concatenating `Series` and `DataFrames` row-wise."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"s = cudf.Series([1, 2, 3, None, 5])\n",
"print(cudf.concat([s, s]))"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"ds2 = dask_cudf.from_cudf(s, npartitions=2)\n",
"print(dask_cudf.concat([ds2, ds2]).compute())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"
\n",
"\n",
"## Join"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Performing SQL style merges. Note that the dataframe order is not maintained, but may be restored post-merge by sorting by the index."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"df_a = cudf.DataFrame()\n",
"df_a['key'] = ['a', 'b', 'c', 'd', 'e']\n",
"df_a['vals_a'] = [float(i + 10) for i in range(5)]\n",
"\n",
"df_b = cudf.DataFrame()\n",
"df_b['key'] = ['a', 'c', 'e']\n",
"df_b['vals_b'] = [float(i+100) for i in range(3)]\n",
"\n",
"merged = df_a.merge(df_b, on=['key'], how='left')\n",
"print(merged)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"ddf_a = dask_cudf.from_cudf(df_a, npartitions=2)\n",
"ddf_b = dask_cudf.from_cudf(df_b, npartitions=2)\n",
"\n",
"merged = ddf_a.merge(ddf_b, on=['key'], how='left').compute()\n",
"print(merged)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"
\n",
"\n",
"## Append"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Appending values from another `Series` or array-like object."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"print(s.append(s))"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"print(ds2.append(ds2).compute())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"
\n",
"\n",
"## Grouping"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Like pandas, cuDF and Dask-cuDF support the Split-Apply-Combine groupby paradigm."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"df['agg_col1'] = [1 if x % 2 == 0 else 0 for x in range(len(df))]\n",
"df['agg_col2'] = [1 if x % 3 == 0 else 0 for x in range(len(df))]\n",
"\n",
"ddf = dask_cudf.from_cudf(df, npartitions=2)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Grouping and then applying the `sum` function to the grouped data."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"print(df.groupby('agg_col1').sum())"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"print(ddf.groupby('agg_col1').sum().compute())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Grouping hierarchically then applying the `sum` function to grouped data. We send the result to a pandas dataframe only for printing purposes."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"print(df.groupby(['agg_col1', 'agg_col2']).sum().to_pandas())"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"ddf.groupby(['agg_col1', 'agg_col2']).sum().compute().to_pandas()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Grouping and applying statistical functions to specific columns, using `agg`."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"print(df.groupby('agg_col1').agg({'a':'max', 'b':'mean', 'c':'sum'}))"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"print(ddf.groupby('agg_col1').agg({'a':'max', 'b':'mean', 'c':'sum'}).compute())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"
\n",
"\n",
"## Transpose"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Transposing a dataframe, using either the `transpose` method or `T` property. Currently, all columns must have the same type. Transposing is not currently implemented in Dask-cuDF."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"sample = cudf.DataFrame({'a':[1,2,3], 'b':[4,5,6]})\n",
"print(sample)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"print(sample.transpose())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"
\n",
"\n",
"Time Series\n",
"------------\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"`DataFrames` supports `datetime` typed columns, which allow users to interact with and filter data based on specific timestamps."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import datetime as dt\n",
"\n",
"date_df = cudf.DataFrame()\n",
"date_df['date'] = pd.date_range('11/20/2018', periods=72, freq='D')\n",
"date_df['value'] = np.random.sample(len(date_df))\n",
"\n",
"search_date = dt.datetime.strptime('2018-11-23', '%Y-%m-%d')\n",
"print(date_df.query('date <= @search_date'))"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"date_ddf = dask_cudf.from_cudf(date_df, npartitions=2)\n",
"print(date_ddf.query('date <= @search_date', local_dict={'search_date':search_date}).compute())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"
\n",
"\n",
"Categoricals\n",
"------------"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"`DataFrames` support categorical columns."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"pdf = pd.DataFrame({\"id\":[1,2,3,4,5,6], \"grade\":['a', 'b', 'b', 'a', 'a', 'e']})\n",
"pdf[\"grade\"] = pdf[\"grade\"].astype(\"category\")\n",
"\n",
"gdf = cudf.DataFrame.from_pandas(pdf)\n",
"print(gdf)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"dgdf = dask_cudf.from_cudf(gdf, npartitions=2)\n",
"print(dgdf.compute())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Accessing the categories of a column. Note that this is currently not supported in Dask-cuDF."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"gdf.grade.cat.categories"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Accessing the underlying code values of each categorical observation."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"print(gdf.grade.cat.codes)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"print(dgdf.grade.cat.codes.compute())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"
\n",
"\n",
"\n",
"Converting Data Representation\n",
"--------------------------------"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"
\n",
"\n",
"## Pandas"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Converting a cuDF and Dask-cuDF `DataFrame` to a pandas `DataFrame`."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"print(df.head().to_pandas())"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"print(ddf.compute().head().to_pandas())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"
\n",
"\n",
"## Numpy"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Converting a cuDF or Dask-cuDF `DataFrame` to a numpy `ndarray`."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"print(df.as_matrix())"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"print(ddf.compute().as_matrix())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Converting a cuDF or Dask-cuDF `Series` to a numpy `ndarray`."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"print(df['a'].to_array())"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"print(ddf['a'].compute().to_array())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"
\n",
"\n",
"## Arrow"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Converting a cuDF or Dask-cuDF `DataFrame` to a PyArrow `Table`."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"print(df.to_arrow())"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"jupyter": {
"outputs_hidden": true
}
},
"outputs": [],
"source": [
"print(ddf.compute().to_arrow())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"
\n",
"\n",
"Getting Data In/Out\n",
"------------------------\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"
\n",
"\n",
"## CSV"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Writing to a CSV file, by first sending data to a pandas `Dataframe` on the host."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"if not os.path.exists('example_output'):\n",
" os.mkdir('example_output')\n",
" \n",
"df.to_pandas().to_csv('example_output/foo.csv', index=False)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"jupyter": {
"outputs_hidden": true
}
},
"outputs": [],
"source": [
"ddf.compute().to_pandas().to_csv('example_output/foo_dask.csv', index=False)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Reading from a csv file."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"df = cudf.read_csv('example_output/foo.csv')\n",
"print(df)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"ddf = dask_cudf.read_csv('example_output/foo_dask.csv')\n",
"print(ddf.compute())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Reading all CSV files in a directory into a single `dask_cudf.DataFrame`, using the star wildcard."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"ddf = dask_cudf.read_csv('example_output/*.csv')\n",
"print(ddf.compute())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"
\n",
"\n",
"## Parquet"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Writing to parquet files, using the CPU via PyArrow."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"df.to_parquet('example_output/temp_parquet')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Reading parquet files with a GPU-accelerated parquet reader."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"df = cudf.read_parquet('example_output/temp_parquet')\n",
"print(df.to_pandas())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Writing to parquet files from a `dask_cudf.DataFrame` using PyArrow under the hood."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"ddf.to_parquet('example_files') "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Conclusion\n",
"\n",
"Now we are familiar with creating Dask-CuDF dataframes, selecting, viewing and manipulating data. The operations are almost the same as pandas, and can easily replace the pandas operations in our traditional data science pipeline. While the results may vary slightly on different GPUs, it should be clear that distributed GPU acceleration can make a significant difference. We can get much faster results with the same code! The next tutorial will show you how to use CuML with Dask. This is very exciting as we can now boost our models with distributed GPU programming."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Licensing\n",
" \n",
"This material is released by NVIDIA Corporation under the Creative Commons Attribution 4.0 International (CC BY 4.0)."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"[Previous Notebook](01-Intro_to_Dask.ipynb)\n",
" \n",
" \n",
" \n",
" \n",
"[1](01-Intro_to_Dask.ipynb)\n",
"[2]\n",
"[3](03-CuML_and_Dask.ipynb)\n",
"[4](04-Challenge.ipynb)\n",
" \n",
" \n",
" \n",
" \n",
"[Next Notebook](03-CuML_and_Dask.ipynb)\n",
"\n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
"[Home Page](../START_HERE.ipynb)"
]
}
],
"metadata": {
"anaconda-cloud": {},
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.2"
}
},
"nbformat": 4,
"nbformat_minor": 4
}