| 
					
				 | 
			
			
				@@ -2,39 +2,38 @@ 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				  "cells": [ 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				   { 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				    "cell_type": "markdown", 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				-   "id": "cognitive-explanation", 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+   "id": "guided-latin", 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				    "metadata": {}, 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				    "source": [ 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				-    "# \n", 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				-    "\n", 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				     "# About data cleaning for own language \n", 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				     "---\n", 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				     "\n", 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				     "## Learning Objectives\n", 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				-    "- **The goal of this lab is to go through some basic data cleansing methods which should be cautiously applied to own langauge dataset  **\n", 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				-    "    - langauge detection and filtering \n", 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				-    "    - finding sentence boundary, and give some examples :\n", 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				-    "    it is importance to be able to find good sentence boundary per given document, see [Megatron-LM/tools/preprocess_data.py](https://github.com/NVIDIA/Megatron-LM/blob/main/tools/preprocess_data.py#L84)\n", 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				-    "        1. [sentence-splitter](https://github.com/mediacloud/sentence-splitter)\n", 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				-    "        2. [NLTK](https://github.com/nltk/nltk)\n", 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				-    "        3. write your own sentence splitter, a home-made example\n", 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				-    "    - deduplicate documents based on similarity score\n", 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				-    "    - (additional advice) discuss filtering on number of sentences per document and number of tokens per sentence \n", 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				-    "        - example for English GPT training, it is recommand to check on the stats of your raw data and come-up with a good rule-of-thumb to proceed filtering, it is,however,recommanded to look into language-specific cleaning and follow up robust clearning procedure to obtain quality corpus.\n", 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+    "The goal of this lab is to go through some basic data cleansing methods which should be cautiously applied to own langauge dataset\n", 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+    "\n", 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+    "- langauge detection  \n", 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+    "- finding sentence boundary, and give some examples :\n", 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+    "it is importance to be able to find good sentence boundary per given document, see [Megatron-LM/tools/preprocess_data.py](https://github.com/NVIDIA/Megatron-LM/blob/main/tools/preprocess_data.py#L84)\n", 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+    "    1. [sentence-splitter](https://github.com/mediacloud/sentence-splitter)\n", 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+    "    2. [NLTK](https://github.com/nltk/nltk)\n", 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+    "    3. write your own sentence splitter, a home-made example\n", 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+    "- deduplicate documents based on similarity score\n", 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+    "- (additional advice) discuss filtering on number of sentences per document and number of tokens per sentence \n", 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+    "    - example for English GPT training, it is recommand to check on the stats of your raw data and come-up with a good rule-of-thumb to proceed filtering, it is,however,recommanded to look into language-specific cleaning and follow up robust clearning procedure to obtain quality corpus.\n", 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				     "\n", 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				     "in this notebook, we will embrace the method provided by the [Megatron-LM repo](https://github.com/NVIDIA/Megatron-LM/tree/main/tools/openwebtext) as well as introduce other complimenting methods that might be of interest for data cleaning.\n", 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				     "\n", 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				-    "---------------\n", 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				-    "### At the end, there's a **mini challenge** for hands-on practice identifying number of duplicates approach groudtruth number!" 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+    "\n", 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+    "**Note** :At the end, there's a **mini challenge** for hands-on practice identifying number of duplicates approach groudtruth number!" 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				    ] 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				   }, 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				   { 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				    "cell_type": "markdown", 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				-   "id": "injured-appraisal", 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+   "id": "structured-superior", 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				    "metadata": {}, 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				    "source": [ 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				     "--------------------------------------------------------------------------------------------------------------------\n", 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				-    "#### start by install necessary libraries -\n", 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+    "### Install necessary libraries -\n", 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				     "\n", 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				     "    install LSH - \n", 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				     "\n", 
			 | 
		
	
	
		
			
				| 
					
				 | 
			
			
				@@ -59,7 +58,7 @@ 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				   { 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				    "cell_type": "code", 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				    "execution_count": 1, 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				-   "id": "elect-chair", 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+   "id": "sensitive-violation", 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				    "metadata": {}, 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				    "outputs": [ 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				     { 
			 | 
		
	
	
		
			
				| 
					
				 | 
			
			
				@@ -138,17 +137,17 @@ 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				   }, 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				   { 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				    "cell_type": "markdown", 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				-   "id": "excessive-madison", 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+   "id": "conscious-provision", 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				    "metadata": {}, 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				    "source": [ 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				     "-------------------------------------------------------------------------------\n", 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				-    "## detect and filter undesired langauge in the raw text corpus" 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+    "## Language detection" 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				    ] 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				   }, 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				   { 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				    "cell_type": "code", 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				    "execution_count": 2, 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				-   "id": "choice-nicholas", 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+   "id": "entitled-company", 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				    "metadata": {}, 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				    "outputs": [ 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				     { 
			 | 
		
	
	
		
			
				| 
					
				 | 
			
			
				@@ -171,7 +170,7 @@ 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				   { 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				    "cell_type": "code", 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				    "execution_count": 3, 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				-   "id": "nonprofit-statistics", 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+   "id": "fancy-translator", 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				    "metadata": {}, 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				    "outputs": [ 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				     { 
			 | 
		
	
	
		
			
				| 
					
				 | 
			
			
				@@ -193,7 +192,7 @@ 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				   { 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				    "cell_type": "code", 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				    "execution_count": 4, 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				-   "id": "adverse-robertson", 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+   "id": "experimental-burns", 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				    "metadata": {}, 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				    "outputs": [ 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				     { 
			 | 
		
	
	
		
			
				| 
					
				 | 
			
			
				@@ -214,24 +213,24 @@ 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				   }, 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				   { 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				    "cell_type": "markdown", 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				-   "id": "interpreted-links", 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+   "id": "sporting-involvement", 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				    "metadata": {}, 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				    "source": [ 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				     "-----------------------------------------------------------\n", 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				-    "## finding sentence boundary - NLTK" 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+    "## Finding sentence boundary (altnertive 1) - NLTK " 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				    ] 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				   }, 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				   { 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				    "cell_type": "code", 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				-   "execution_count": 5, 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				-   "id": "typical-accused", 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+   "execution_count": 1, 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+   "id": "prerequisite-london", 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				    "metadata": {}, 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				    "outputs": [ 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				     { 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				      "name": "stderr", 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				      "output_type": "stream", 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				      "text": [ 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				-      "[nltk_data] Downloading package punkt to /home/x_zench/nltk_data...\n", 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+      "[nltk_data] Downloading package punkt to /root/nltk_data...\n", 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				       "[nltk_data]   Unzipping tokenizers/punkt.zip.\n" 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				      ] 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				     }, 
			 | 
		
	
	
		
			
				| 
					
				 | 
			
			
				@@ -241,7 +240,7 @@ 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				        "True" 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				       ] 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				      }, 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				-     "execution_count": 5, 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+     "execution_count": 1, 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				      "metadata": {}, 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				      "output_type": "execute_result" 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				     } 
			 | 
		
	
	
		
			
				| 
					
				 | 
			
			
				@@ -253,8 +252,8 @@ 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				   }, 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				   { 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				    "cell_type": "code", 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				-   "execution_count": 6, 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				-   "id": "ranking-semester", 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+   "execution_count": 20, 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+   "id": "worldwide-fruit", 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				    "metadata": {}, 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				    "outputs": [ 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				     { 
			 | 
		
	
	
		
			
				| 
					
				 | 
			
			
				@@ -262,16 +261,20 @@ 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				      "output_type": "stream", 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				      "text": [ 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				       "original doc is :\n", 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				-      "  På sommarhalvåret längst vattnet och i skogarna på Lidingö eller på Djurgården – på vintern blir det löpbandet på gymmet .Händelsen skapade mycket oro inom klubben , eftersom skridskoåkning äger rum på sjöar och vattendrag där det givetvis inte finns någon gatuadress att uppge vid olycka .\n", 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+      "  Detta är ett stycke. Den innehåller flera meningar. \"Men varför\", frågar du?\n", 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				       "------- sentence 0 -------\n", 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				-      "På sommarhalvåret längst vattnet och i skogarna på Lidingö eller på Djurgården – på vintern blir det löpbandet på gymmet .Händelsen skapade mycket oro inom klubben , eftersom skridskoåkning äger rum på sjöar och vattendrag där det givetvis inte finns någon gatuadress att uppge vid olycka .\n" 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+      "Detta är ett stycke.\n", 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+      "------- sentence 1 -------\n", 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+      "Den innehåller flera meningar.\n", 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+      "------- sentence 2 -------\n", 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+      "\"Men varför\", frågar du?\n" 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				      ] 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				     } 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				    ], 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				    "source": [ 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				     "import nltk\n", 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				     "from nltk.tokenize import sent_tokenize\n", 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				-    "text='På sommarhalvåret längst vattnet och i skogarna på Lidingö eller på Djurgården – på vintern blir det löpbandet på gymmet .Händelsen skapade mycket oro inom klubben , eftersom skridskoåkning äger rum på sjöar och vattendrag där det givetvis inte finns någon gatuadress att uppge vid olycka .'\n", 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+    "text='Detta är ett stycke. Den innehåller flera meningar. \"Men varför\", frågar du?'\n", 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				     "print(\"original doc is :\\n \", text)\n", 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				     "sents=sent_tokenize(text)\n", 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				     "i=0\n", 
			 | 
		
	
	
		
			
				| 
					
				 | 
			
			
				@@ -283,33 +286,33 @@ 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				   }, 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				   { 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				    "cell_type": "markdown", 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				-   "id": "portable-bumper", 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+   "id": "micro-interface", 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				    "metadata": {}, 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				    "source": [ 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				     "-----------------------------------------------------------\n", 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				-    "## finding sentence boundary - Sentence-Splitter" 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+    "## Finding sentence boundary (altnertive 2) - Sentence-Splitter" 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				    ] 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				   }, 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				   { 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				    "cell_type": "code", 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				    "execution_count": 8, 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				-   "id": "environmental-rating", 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+   "id": "third-caution", 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				    "metadata": {}, 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				    "outputs": [ 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				     { 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				      "name": "stdout", 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				      "output_type": "stream", 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				      "text": [ 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				-      "--2021-09-15 11:26:03--  https://github.com/mediacloud/sentence-splitter/blob/develop/sentence_splitter/non_breaking_prefixes/sv.txt\n", 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				-      "Resolving github.com (github.com)... 140.82.121.4\n", 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				-      "Connecting to github.com (github.com)|140.82.121.4|:443... connected.\n", 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+      "--2021-09-24 18:41:51--  https://github.com/mediacloud/sentence-splitter/blob/develop/sentence_splitter/non_breaking_prefixes/sv.txt\n", 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+      "Resolving github.com (github.com)... 140.82.121.3\n", 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+      "Connecting to github.com (github.com)|140.82.121.3|:443... connected.\n", 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				       "HTTP request sent, awaiting response... 200 OK\n", 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				       "Length: unspecified [text/html]\n", 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				       "Saving to: ‘sv.txt’\n", 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				       "\n", 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				-      "sv.txt                  [ <=>                ] 191.69K  --.-KB/s    in 0.08s   \n", 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+      "sv.txt                  [ <=>                ] 190.41K  --.-KB/s    in 0.1s    \n", 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				       "\n", 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				-      "2021-09-15 11:26:04 (2.23 MB/s) - ‘sv.txt’ saved [196294]\n", 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+      "2021-09-24 18:41:52 (1.76 MB/s) - ‘sv.txt’ saved [194983]\n", 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				       "\n" 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				      ] 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				     } 
			 | 
		
	
	
		
			
				| 
					
				 | 
			
			
				@@ -321,7 +324,7 @@ 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				   { 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				    "cell_type": "code", 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				    "execution_count": 9, 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				-   "id": "collective-medication", 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+   "id": "spectacular-eligibility", 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				    "metadata": {}, 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				    "outputs": [], 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				    "source": [ 
			 | 
		
	
	
		
			
				| 
					
				 | 
			
			
				@@ -330,8 +333,8 @@ 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				   }, 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				   { 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				    "cell_type": "code", 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				-   "execution_count": 10, 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				-   "id": "secure-encounter", 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+   "execution_count": 21, 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+   "id": "imported-probability", 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				    "metadata": {}, 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				    "outputs": [ 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				     { 
			 | 
		
	
	
		
			
				| 
					
				 | 
			
			
				@@ -339,7 +342,11 @@ 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				      "output_type": "stream", 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				      "text": [ 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				       "------- sentence 0 -------\n", 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				-      "På sommarhalvåret längst vattnet och i skogarna på Lidingö eller på Djurgården – på vintern blir det löpbandet på gymmet .Händelsen skapade mycket oro inom klubben , eftersom skridskoåkning äger rum på sjöar och vattendrag där det givetvis inte finns någon gatuadress att uppge vid olycka .\n" 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+      "Detta är ett stycke.\n", 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+      "------- sentence 1 -------\n", 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+      "Den innehåller flera meningar.\n", 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+      "------- sentence 2 -------\n", 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+      "\"Men varför\", frågar du?\n" 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				      ] 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				     } 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				    ], 
			 | 
		
	
	
		
			
				| 
					
				 | 
			
			
				@@ -356,17 +363,17 @@ 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				   }, 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				   { 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				    "cell_type": "markdown", 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				-   "id": "varying-province", 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+   "id": "conventional-medium", 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				    "metadata": {}, 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				    "source": [ 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				     "-----------------------------------------------------------\n", 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				-    "## finding sentence boundary - create your own sentence splitter" 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+    "## Finding sentence boundary - create your own sentence splitter" 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				    ] 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				   }, 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				   { 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				    "cell_type": "code", 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				-   "execution_count": 11, 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				-   "id": "voluntary-madness", 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+   "execution_count": 22, 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+   "id": "selective-nelson", 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				    "metadata": {}, 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				    "outputs": [], 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				    "source": [ 
			 | 
		
	
	
		
			
				| 
					
				 | 
			
			
				@@ -378,7 +385,7 @@ 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				     "\n", 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				     "def cut_sentence_with_quotation_marks(text):\n", 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				     "    p = re.compile(\"“.*?”\")\n", 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				-    "    list = []\n", 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+    "    ls = []\n", 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				     "    index = 0\n", 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				     "    length = len(text)\n", 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				     "    for i in p.finditer(text):\n", 
			 | 
		
	
	
		
			
				| 
					
				 | 
			
			
				@@ -389,29 +396,29 @@ 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				     "            temp += text[j]\n", 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				     "        if temp != '':\n", 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				     "            temp_list = normal_cut_sentence(temp)\n", 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				-    "            list += temp_list\n", 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+    "            ls += temp_list\n", 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				     "        temp = ''\n", 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				     "        for k in range(start, end):\n", 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				     "            temp += text[k]\n", 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				     "        if temp != ' ':\n", 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				-    "            list.append(temp)\n", 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+    "            ls.append(temp)\n", 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				     "        index = end\n", 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				-    "    return list" 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+    "    return ls" 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				    ] 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				   }, 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				   { 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				    "cell_type": "code", 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				-   "execution_count": 12, 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				-   "id": "raising-salad", 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+   "execution_count": 23, 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+   "id": "statistical-croatia", 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				    "metadata": {}, 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				    "outputs": [ 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				     { 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				      "name": "stdout", 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				      "output_type": "stream", 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				      "text": [ 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				-      "------- sentence 1 -------\n", 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+      "------- sentence 3 -------\n", 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				       "Andersson pekas ut som nästa partiledare:\n", 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				-      "------- sentence 2 -------\n", 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+      "------- sentence 4 -------\n", 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				       "“Medlemmarna ska säga sitt”\n" 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				      ] 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				     } 
			 | 
		
	
	
		
			
				| 
					
				 | 
			
			
				@@ -427,17 +434,17 @@ 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				   }, 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				   { 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				    "cell_type": "markdown", 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				-   "id": "facial-trading", 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+   "id": "ultimate-geometry", 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				    "metadata": {}, 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				    "source": [ 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				     "-----------------------------------------------------------\n", 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				-    "## deduplicate text based on similarity score  " 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+    "## Deduplicate text based on similarity score  " 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				    ] 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				   }, 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				   { 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				    "cell_type": "code", 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				    "execution_count": 1, 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				-   "id": "agricultural-onion", 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+   "id": "internal-opposition", 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				    "metadata": {}, 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				    "outputs": [], 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				    "source": [ 
			 | 
		
	
	
		
			
				| 
					
				 | 
			
			
				@@ -486,7 +493,7 @@ 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				   { 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				    "cell_type": "code", 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				    "execution_count": 2, 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				-   "id": "nonprofit-panama", 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+   "id": "provincial-cross", 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				    "metadata": {}, 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				    "outputs": [], 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				    "source": [ 
			 | 
		
	
	
		
			
				| 
					
				 | 
			
			
				@@ -500,7 +507,7 @@ 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				   { 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				    "cell_type": "code", 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				    "execution_count": 3, 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				-   "id": "empty-while", 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+   "id": "successful-consent", 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				    "metadata": {}, 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				    "outputs": [ 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				     { 
			 | 
		
	
	
		
			
				| 
					
				 | 
			
			
				@@ -526,16 +533,16 @@ 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				   }, 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				   { 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				    "cell_type": "markdown", 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				-   "id": "blind-union", 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+   "id": "disciplinary-resolution", 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				    "metadata": {}, 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				    "source": [ 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				-    "## dataset extracted from NVIDIA blog urls " 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+    "## Load dataset extracted from NVIDIA blog urls " 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				    ] 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				   }, 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				   { 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				    "cell_type": "code", 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				    "execution_count": 4, 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				-   "id": "instant-grade", 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+   "id": "based-pizza", 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				    "metadata": {}, 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				    "outputs": [ 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				     { 
			 | 
		
	
	
		
			
				| 
					
				 | 
			
			
				@@ -610,16 +617,16 @@ 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				   }, 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				   { 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				    "cell_type": "markdown", 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				-   "id": "attached-candle", 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+   "id": "continuing-palmer", 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				    "metadata": {}, 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				    "source": [ 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				-    "## create our own groudtruth dataset" 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+    "## Create our own groudtruth dataset" 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				    ] 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				   }, 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				   { 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				    "cell_type": "code", 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				    "execution_count": 5, 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				-   "id": "constant-mouth", 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+   "id": "convenient-clear", 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				    "metadata": {}, 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				    "outputs": [], 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				    "source": [ 
			 | 
		
	
	
		
			
				| 
					
				 | 
			
			
				@@ -650,7 +657,7 @@ 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				   { 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				    "cell_type": "code", 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				    "execution_count": 6, 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				-   "id": "accepting-truck", 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+   "id": "vertical-philosophy", 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				    "metadata": {}, 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				    "outputs": [ 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				     { 
			 | 
		
	
	
		
			
				| 
					
				 | 
			
			
				@@ -749,7 +756,7 @@ 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				   { 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				    "cell_type": "code", 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				    "execution_count": 7, 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				-   "id": "acting-tiffany", 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+   "id": "agricultural-bubble", 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				    "metadata": {}, 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				    "outputs": [ 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				     { 
			 | 
		
	
	
		
			
				| 
					
				 | 
			
			
				@@ -773,7 +780,7 @@ 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				   { 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				    "cell_type": "code", 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				    "execution_count": 8, 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				-   "id": "wicked-youth", 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+   "id": "incorporated-nepal", 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				    "metadata": {}, 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				    "outputs": [ 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				     { 
			 | 
		
	
	
		
			
				| 
					
				 | 
			
			
				@@ -794,7 +801,7 @@ 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				   { 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				    "cell_type": "code", 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				    "execution_count": 9, 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				-   "id": "rural-lotus", 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+   "id": "future-worry", 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				    "metadata": {}, 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				    "outputs": [ 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				     { 
			 | 
		
	
	
		
			
				| 
					
				 | 
			
			
				@@ -888,7 +895,7 @@ 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				   }, 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				   { 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				    "cell_type": "markdown", 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				-   "id": "dominican-trick", 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+   "id": "secure-rapid", 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				    "metadata": {}, 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				    "source": [ 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				     "---\n", 
			 | 
		
	
	
		
			
				| 
					
				 | 
			
			
				@@ -898,7 +905,7 @@ 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				   { 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				    "cell_type": "code", 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				    "execution_count": 10, 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				-   "id": "married-straight", 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+   "id": "equivalent-victor", 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				    "metadata": {}, 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				    "outputs": [ 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				     { 
			 | 
		
	
	
		
			
				| 
					
				 | 
			
			
				@@ -998,7 +1005,7 @@ 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				   { 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				    "cell_type": "code", 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				    "execution_count": 11, 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				-   "id": "rocky-courage", 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+   "id": "suffering-bangkok", 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				    "metadata": {}, 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				    "outputs": [ 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				     { 
			 | 
		
	
	
		
			
				| 
					
				 | 
			
			
				@@ -1022,7 +1029,7 @@ 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				   { 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				    "cell_type": "code", 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				    "execution_count": 12, 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				-   "id": "boring-piece", 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+   "id": "efficient-angola", 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				    "metadata": {}, 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				    "outputs": [ 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				     { 
			 | 
		
	
	
		
			
				| 
					
				 | 
			
			
				@@ -1039,7 +1046,7 @@ 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				   }, 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				   { 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				    "cell_type": "markdown", 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				-   "id": "fresh-norfolk", 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+   "id": "featured-packet", 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				    "metadata": {}, 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				    "source": [ 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				     "---\n", 
			 | 
		
	
	
		
			
				| 
					
				 | 
			
			
				@@ -1050,7 +1057,7 @@ 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				   { 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				    "cell_type": "code", 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				    "execution_count": 13, 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				-   "id": "starting-arabic", 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+   "id": "scheduled-publication", 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				    "metadata": {}, 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				    "outputs": [ 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				     { 
			 | 
		
	
	
		
			
				| 
					
				 | 
			
			
				@@ -1145,7 +1152,7 @@ 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				   }, 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				   { 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				    "cell_type": "markdown", 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				-   "id": "banner-dispute", 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+   "id": "judicial-circular", 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				    "metadata": {}, 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				    "source": [ 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				     "---\n", 
			 | 
		
	
	
		
			
				| 
					
				 | 
			
			
				@@ -1155,7 +1162,7 @@ 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				   { 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				    "cell_type": "code", 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				    "execution_count": 14, 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				-   "id": "spread-entity", 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+   "id": "bronze-salem", 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				    "metadata": {}, 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				    "outputs": [ 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				     { 
			 | 
		
	
	
		
			
				| 
					
				 | 
			
			
				@@ -1179,7 +1186,7 @@ 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				   { 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				    "cell_type": "code", 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				    "execution_count": 15, 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				-   "id": "exempt-juice", 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+   "id": "engaging-persian", 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				    "metadata": {}, 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				    "outputs": [], 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				    "source": [ 
			 | 
		
	
	
		
			
				| 
					
				 | 
			
			
				@@ -1189,7 +1196,7 @@ 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				   }, 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				   { 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				    "cell_type": "markdown", 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				-   "id": "abandoned-valuation", 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+   "id": "forward-attachment", 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				    "metadata": {}, 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				    "source": [ 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				     "<a id=\"TheChallenge\"></a>" 
			 | 
		
	
	
		
			
				| 
					
				 | 
			
			
				@@ -1197,13 +1204,13 @@ 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				   }, 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				   { 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				    "cell_type": "markdown", 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				-   "id": "thick-external", 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+   "id": "supposed-columbia", 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				    "metadata": {}, 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				    "source": [ 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				     "---\n", 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				     "# Mini Challenge - approaching the groundtruth !\n", 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				     "\n", 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				-    "Task : Aiming to approach the number 31 modifying the below parameters\n", 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+    "**Task : Aiming to approach the number 31 modifying the below parameters**\n", 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				     "rerun cell <a href=\"./Day3-2_SentenceBoundary_and_Deduplicate.ipynb#Rerun_Cell\">Jump to ReRun Cell</a>\n", 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				     "\n", 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				     "Consider yourself pass this mini challenge when you approach the number **31 +/- 3** ! \n", 
			 | 
		
	
	
		
			
				| 
					
				 | 
			
			
				@@ -1222,12 +1229,11 @@ 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				   { 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				    "cell_type": "code", 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				    "execution_count": 42, 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				-   "id": "sophisticated-boating", 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+   "id": "automated-liver", 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				    "metadata": { 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				     "collapsed": true, 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				     "jupyter": { 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				-     "outputs_hidden": true, 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				-     "source_hidden": true 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+     "outputs_hidden": true 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				     }, 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				     "tags": [] 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				    }, 
			 | 
		
	
	
		
			
				| 
					
				 | 
			
			
				@@ -1257,11 +1263,8 @@ 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				   { 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				    "cell_type": "code", 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				    "execution_count": 114, 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				-   "id": "meaningful-sample", 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+   "id": "anticipated-senator", 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				    "metadata": { 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				-    "jupyter": { 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				-     "source_hidden": true 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				-    }, 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				     "tags": [] 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				    }, 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				    "outputs": [], 
			 | 
		
	
	
		
			
				| 
					
				 | 
			
			
				@@ -1278,12 +1281,11 @@ 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				   { 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				    "cell_type": "code", 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				    "execution_count": 115, 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				-   "id": "operational-steps", 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+   "id": "pregnant-temple", 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				    "metadata": { 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				     "collapsed": true, 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				     "jupyter": { 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				-     "outputs_hidden": true, 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				-     "source_hidden": true 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+     "outputs_hidden": true 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				     }, 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				     "tags": [] 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				    }, 
			 | 
		
	
	
		
			
				| 
					
				 | 
			
			
				@@ -1314,7 +1316,25 @@ 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				   }, 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				   { 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				    "cell_type": "markdown", 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				-   "id": "revolutionary-framing", 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+   "id": "junior-voluntary", 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+   "metadata": {}, 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+   "source": [ 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+    "--- \n", 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+    "\n", 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+    "## Additional Resources\n", 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+    "\n", 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+    "Language Detect : https://github.com/Mimino666/langdetect\n", 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+    "\n", 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+    "NLTK Sentence Tokenizer  : https://www.nltk.org/api/nltk.tokenize.html\n", 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+    "\n", 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+    "Sentence Splitter : https://github.com/mediacloud/sentence-splitter\n", 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+    "\n", 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+    "Local Sensitive Hashing : http://snap.stanford.edu/class/cs246-2012/slides/03-lsh.pdf\n" 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+   ] 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+  }, 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+  { 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+   "cell_type": "markdown", 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+   "id": "banned-telling", 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				    "metadata": {}, 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				    "source": [ 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				     "---\n", 
			 | 
		
	
	
		
			
				| 
					
				 | 
			
			
				@@ -1328,7 +1348,7 @@ 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				   }, 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				   { 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				    "cell_type": "markdown", 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				-   "id": "cutting-greeting", 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+   "id": "controlling-boulder", 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				    "metadata": {}, 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				    "source": [ 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				     "-----\n", 
			 |