{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Introduction: Linear Regression with Medium Articles\n", "\n", "In this notebook, we'll look at performing some basic linear regression with the medium articles. This is a continuation of the data analysis performed on my Medium articles. " ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "ExecuteTime": { "end_time": "2018-12-27T03:24:38.358494Z", "start_time": "2018-12-27T03:24:36.875877Z" } }, "outputs": [ { "data": { "text/html": [ "" ], "text/vnd.plotly.v1+html": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# Data science imports\n", "import pandas as pd\n", "import numpy as np\n", "\n", "from scipy import stats\n", "\n", "# Options for pandas\n", "pd.options.display.max_columns = 20\n", "\n", "# Display all cell outputs\n", "from IPython.core.interactiveshell import InteractiveShell\n", "InteractiveShell.ast_node_interactivity = 'all'\n", "\n", "# Interactive plotting\n", "import plotly.plotly as py\n", "import plotly.graph_objs as go\n", "from plotly.offline import iplot\n", "import cufflinks\n", "cufflinks.go_offline()\n", "\n", "%load_ext autoreload\n", "%autoreload 2\n", "\n", "from timeit import default_timer as timer\n", "\n", "from collections import Counter, defaultdict\n", "from itertools import chain\n", "\n", "from bs4 import BeautifulSoup\n", "import re\n", "\n", "import requests\n", "from multiprocessing import Pool" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "ExecuteTime": { "end_time": "2018-12-27T03:24:39.422536Z", "start_time": "2018-12-27T03:24:39.350187Z" } }, "outputs": [ { "data": { "text/html": [ "" ], "text/vnd.plotly.v1+html": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "from utils import process_in_parallel, get_links, make_iplot" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "ExecuteTime": { "end_time": "2018-12-27T03:24:41.134418Z", "start_time": "2018-12-27T03:24:40.976549Z" } }, "outputs": [ { "data": { "text/plain": [ "\"Medium(function(i,s,o,g,r,a,m){i['GoogleAnalyticsObject']=r;i[r]=i[r]||function(){\\n(i[r].q=i[r].q||[\"" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "soup = BeautifulSoup(open('data/published.html', 'r'))\n", "soup.text[:100]" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "ExecuteTime": { "end_time": "2018-12-27T03:24:42.032639Z", "start_time": "2018-12-27T03:24:41.871801Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Found 212 entries.\n", "Total Read Time of Entries: 1361 minutes.\n" ] } ], "source": [ "links = get_links(soup)" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "ExecuteTime": { "end_time": "2018-12-27T03:24:54.163230Z", "start_time": "2018-12-27T03:24:44.240823Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Processed 212 entries in 10 seconds.\n" ] }, { "data": { "text/html": [ "
\n", " | claps | \n", "read_time | \n", "tags | \n", "text | \n", "time_published | \n", "title | \n", "word_count | \n", "response | \n", "claps_per_word | \n", "words_per_minute | \n", "<tag>Education | \n", "<tag>Data Science | \n", "<tag>Towards Data Science | \n", "<tag>Machine Learning | \n", "<tag>Python | \n", "<tag>Programming | \n", "<tag>Statistics | \n", "<tag>Data Analysis | \n", "<tag>Books | \n", "<tag>Review | \n", "
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | \n", "66 | \n", "1 | \n", "[] | \n", "This wasn’t entirely serious, but I do think a... | \n", "2018-11-28 19:44:26.105000-05:00 | \n", "response-2018-11-28 19:44:26.105000-05:00 | \n", "158 | \n", "response | \n", "0.417722 | \n", "158.0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "
1 | \n", "0 | \n", "1 | \n", "[] | \n", "Thanks! I’m glad that others find my work usef... | \n", "2018-11-28 20:01:40.532000-05:00 | \n", "response-2018-11-28 20:01:40.532000-05:00 | \n", "30 | \n", "response | \n", "0.000000 | \n", "30.0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "
3 | \n", "0 | \n", "1 | \n", "[] | \n", "I wrote another article about deploying this t... | \n", "2018-11-28 20:05:02.529000-05:00 | \n", "response-2018-11-28 20:05:02.529000-05:00 | \n", "78 | \n", "response | \n", "0.000000 | \n", "78.0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "
4 | \n", "0 | \n", "1 | \n", "[] | \n", "These API’s are really fun (and sometimes usef... | \n", "2018-12-01 20:30:50.845000-05:00 | \n", "response-2018-12-01 20:30:50.845000-05:00 | \n", "31 | \n", "response | \n", "0.000000 | \n", "31.0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "
5 | \n", "0 | \n", "1 | \n", "[] | \n", "Thanks for your kind words! A lot of the time ... | \n", "2018-12-15 13:56:37.536000-05:00 | \n", "response-2018-12-15 13:56:37.536000-05:00 | \n", "95 | \n", "response | \n", "0.000000 | \n", "95.0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "