{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "     \n",
    "     \n",
    "     \n",
    "     \n",
    "     \n",
    "   \n",
    "[Home Page](../START_HERE.ipynb)\n",
    "\n",
    "     \n",
    "     \n",
    "     \n",
    "     \n",
    "     \n",
    "   \n",
    "[1]\n",
    "[2](02-Intro_to_cuDF_UDFs.ipynb)\n",
    "[3](03-Cudf_Exercise.ipynb)\n",
    "     \n",
    "     \n",
    "     \n",
    "     \n",
    "[Next Notebook](02-Intro_to_cuDF_UDFs.ipynb)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Intro to cuDF\n",
    "=======================\n",
    "\n",
    "Welcome to first cuDF tutorial notebook! This is a short introduction to cuDF, partly modeled after 10 Minutes to Pandas, geared primarily for new users. cuDF is a Python GPU DataFrame library (built on the Apache Arrow columnar memory format) for loading, joining, aggregating, filtering, and otherwise manipulating data. While we'll only cover some of the cuDF functionality, but at the end of this tutorial we hope you'll feel confident creating and analyzing GPU DataFrames. The tutorial is split into different modules as shown in the list below, based on their individual content and also contain embedded exercises for your practice.\n",
    "\n",
    "For more information about CuDF refer to the documentation here: https://docs.rapids.ai/api/cudf/stable/\n",
    " \n",
    "We'll start by introducing the pandas library, and quickly move on cuDF."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Here is the list of contents in the lab:\n",
    "- Pandas
 A short introduction to the Pandas library, used commonly in data science for data manipulation and convenient storage.\n",
    "- CuDF Dataframes
 This module shows you how to work with CuDF dataframes, the GPU equivalent of Pandas dataframes, for faster data transactions. It includes creating CuDF objects, viewing data, selecting data, boolean indexing and dealing with missing data.\n",
    "- Operations
 Learn how to view descriptive statistics, perform string operations, concatenate, joins, append, group data and use applymap.\n",
    "- TimeSeries
 Introduction to using TimeSeries data format in CuDF   \n",
    "- Converting Data Representations
 Here we will work with converting data representations, including CuPy, Pandas and Numpy, that are commonly required in data science pipelines.\n",
    "- Getting Data In and Out
 Transfering CuDf dataframes to and from CSV files.\n",
    "- Performance
 Study the performance metrics of code written using CuDF compared to the regular data science operations and observe the benefits.\n",
    "- Sensor Data Example
 Apply the fundamental methods of CuDF to work with a sample of sensor data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "import numpy as np\n",
    "import math\n",
    "np.random.seed(12)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "## Pandas\n",
    "\n",
    "Data scientists typically work with two types of data: unstructured and structured. Unstructured data often comes in the form of text, images, or videos. Structured data - as the name suggests - comes in a structured form, often represented by a table or CSV. We'll focus the majority of these tutorials on working with structured data.\n",
    "\n",
    "There exist many tools in the Python ecosystem for working with structured, tabular data but few are as widely used as Pandas. Pandas represents data in a table and allows a data scientist to manipulate the data to perform a number of useful operations such as filtering, transforming, aggregating, merging, visualizing and many more. \n",
    "\n",
    "For more information on Pandas, check out the excellent documentation: http://pandas.pydata.org/pandas-docs/stable/\n",
    "\n",
    "Below we show how to create a Pandas DataFrame, an internal object for representing tabular data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd; print('Pandas Version:', pd.__version__)\n",
    "\n",
    "\n",
    "# here we create a Pandas DataFrame with\n",
    "# two columns named \"key\" and \"value\"\n",
    "df = pd.DataFrame()\n",
    "df['key'] = [0, 0, 2, 2, 3]\n",
    "df['value'] = [float(i + 10) for i in range(5)]\n",
    "print(df)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can perform many operations on this data. For example, let's say we wanted to sum all values in the in the `value` column. We could accomplish this using the following syntax:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "aggregation = df['value'].sum()\n",
    "print(aggregation)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "## cuDF\n",
    "\n",
    "Pandas is fantastic for working with small datasets that fit into your system's memory and workflows that are not computationally intense. However, datasets are growing larger and data scientists are working with increasingly complex workloads - the need for accelerated computing is increasing rapidly.\n",
    "\n",
    "cuDF is a package within the RAPIDS ecosystem that allows data scientists to easily migrate their existing Pandas workflows from CPU to GPU, where computations can leverage the immense parallelization that GPUs provide.\n",
    "\n",
    "Below, we show how to create a cuDF DataFrame."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import cudf; print('cuDF Version:', cudf.__version__)\n",
    "\n",
    "\n",
    "# here we create a cuDF DataFrame with\n",
    "# two columns named \"key\" and \"value\"\n",
    "df = cudf.DataFrame()\n",
    "df['key'] = [0, 0, 2, 2, 3]\n",
    "df['value'] = [float(i + 10) for i in range(5)]\n",
    "df"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As before, we can take this cuDF DataFrame and perform a `sum` operation over the `value` column. The key difference is that any operations we perform using cuDF use the GPU instead of the CPU."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "aggregation = df['value'].sum()\n",
    "print(aggregation)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Note how the syntax for both creating and manipulating a cuDF DataFrame is identical to the syntax necessary to create and manipulate Pandas DataFrames; the cuDF API is based on the Pandas API. This design choice minimizes the cognitive burden of switching from a CPU based workflow to a GPU based workflow and allows data scientists to focus on solving problems while benefitting from the speed of a GPU!"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "# DataFrame Basics with cuDF\n",
    "\n",
    "In the following tutorial, you'll get a chance to familiarize yourself with cuDF. For those of you with experience using pandas, this should look nearly identical.\n",
    "\n",
    "Along the way you'll notice small exercises. These exercises are designed to help you get a feel for writing the code yourself, but if you get stuck, you can take a look at the solutions.\n",
    "\n",
    "Portions of this were borrowed from the 10 Minutes to cuDF guide."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "\n",
    "Object Creation\n",
    "---------------"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Creating a `cudf.Series`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "s = cudf.Series([1,2,3,None,4])\n",
    "print(s)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Creating a `cudf.DataFrame` by specifying values for each column."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df = cudf.DataFrame({'a': list(range(20)),\n",
    "'b': list(reversed(range(20))),\n",
    "'c': list(range(20))})\n",
    "print(df)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Creating a `cudf.DataFrame` from a `pd.Dataframe`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "pdf = pd.DataFrame({'a': [0, 1, 2, 3],'b': [0.1, 0.2, None, 0.3]})\n",
    "gdf = cudf.DataFrame.from_pandas(pdf)\n",
    "print(gdf)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "\n",
    "Viewing Data\n",
    "-------------"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Viewing the top rows of a GPU dataframe."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(df.head(2))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Sorting by values."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(df.sort_values(by='b'))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "\n",
    "\n",
    "Selection\n",
    "------------\n",
    "\n",
    "## Getting data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Selecting a single column, which initially yields a `cudf.Series` (equivalent to `df.a`)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(df['a'])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "## Selection by Label"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Selecting rows from index 2 to index 5 from columns `a` and `b`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(df.loc[2:5, ['a', 'b']])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "\n",
    "## Selection by Position"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Selecting via integers and integer slices, like numpy/pandas."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df.iloc[0]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df.iloc[0:3, 0:2]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You can also select elements of a `DataFrame` or `Series` with direct index access."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df[3:5]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "s[3:5]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "\n",
    "## Exercise 1\n",
    "\n",
    "Try to select only the rows at index `4` and `9` from `df`.\n",
    "\n",
    "Solution
\n",
    "   \n",
    "    
print(df.iloc[[4,9]])\n",
    "   \n",
    "Solution
\n",
    "   \n",
    "    
print(df.query(\"b > c + 6\"))\n",
    "   \n",
    "Solution
\n",
    "   \n",
    "    
print(s.str.upper())\n",
    "   \n",
    "Solution
\n",
    "   \n",
    "    
print(df_a.merge(df_b, on=['key'], how='inner'))\n",
    "   \n",
    "Solution
\n",
    "   \n",
    "    
df.groupby(['agg_col1', 'agg_col2']).agg({'a':'mean', 'b':'min'})\n",
    "   \n",
    "Solution
\n",
    "   \n",
    "    
\n",
    "    search_date = pd.to_datetime('2018-11-23')\n",
    "    date_df.loc[date_df.date <= search_date]\n",
    "            \n",
    "   \n",
    "