|
@@ -2,39 +2,38 @@
|
|
"cells": [
|
|
"cells": [
|
|
{
|
|
{
|
|
"cell_type": "markdown",
|
|
"cell_type": "markdown",
|
|
- "id": "cognitive-explanation",
|
|
|
|
|
|
+ "id": "guided-latin",
|
|
"metadata": {},
|
|
"metadata": {},
|
|
"source": [
|
|
"source": [
|
|
- "# \n",
|
|
|
|
- "\n",
|
|
|
|
"# About data cleaning for own language \n",
|
|
"# About data cleaning for own language \n",
|
|
"---\n",
|
|
"---\n",
|
|
"\n",
|
|
"\n",
|
|
"## Learning Objectives\n",
|
|
"## Learning Objectives\n",
|
|
- "- **The goal of this lab is to go through some basic data cleansing methods which should be cautiously applied to own langauge dataset **\n",
|
|
|
|
- " - langauge detection and filtering \n",
|
|
|
|
- " - finding sentence boundary, and give some examples :\n",
|
|
|
|
- " it is importance to be able to find good sentence boundary per given document, see [Megatron-LM/tools/preprocess_data.py](https://github.com/NVIDIA/Megatron-LM/blob/main/tools/preprocess_data.py#L84)\n",
|
|
|
|
- " 1. [sentence-splitter](https://github.com/mediacloud/sentence-splitter)\n",
|
|
|
|
- " 2. [NLTK](https://github.com/nltk/nltk)\n",
|
|
|
|
- " 3. write your own sentence splitter, a home-made example\n",
|
|
|
|
- " - deduplicate documents based on similarity score\n",
|
|
|
|
- " - (additional advice) discuss filtering on number of sentences per document and number of tokens per sentence \n",
|
|
|
|
- " - example for English GPT training, it is recommand to check on the stats of your raw data and come-up with a good rule-of-thumb to proceed filtering, it is,however,recommanded to look into language-specific cleaning and follow up robust clearning procedure to obtain quality corpus.\n",
|
|
|
|
|
|
+ "The goal of this lab is to go through some basic data cleansing methods which should be cautiously applied to own langauge dataset\n",
|
|
|
|
+ "\n",
|
|
|
|
+ "- langauge detection \n",
|
|
|
|
+ "- finding sentence boundary, and give some examples :\n",
|
|
|
|
+ "it is importance to be able to find good sentence boundary per given document, see [Megatron-LM/tools/preprocess_data.py](https://github.com/NVIDIA/Megatron-LM/blob/main/tools/preprocess_data.py#L84)\n",
|
|
|
|
+ " 1. [sentence-splitter](https://github.com/mediacloud/sentence-splitter)\n",
|
|
|
|
+ " 2. [NLTK](https://github.com/nltk/nltk)\n",
|
|
|
|
+ " 3. write your own sentence splitter, a home-made example\n",
|
|
|
|
+ "- deduplicate documents based on similarity score\n",
|
|
|
|
+ "- (additional advice) discuss filtering on number of sentences per document and number of tokens per sentence \n",
|
|
|
|
+ " - example for English GPT training, it is recommand to check on the stats of your raw data and come-up with a good rule-of-thumb to proceed filtering, it is,however,recommanded to look into language-specific cleaning and follow up robust clearning procedure to obtain quality corpus.\n",
|
|
"\n",
|
|
"\n",
|
|
"in this notebook, we will embrace the method provided by the [Megatron-LM repo](https://github.com/NVIDIA/Megatron-LM/tree/main/tools/openwebtext) as well as introduce other complimenting methods that might be of interest for data cleaning.\n",
|
|
"in this notebook, we will embrace the method provided by the [Megatron-LM repo](https://github.com/NVIDIA/Megatron-LM/tree/main/tools/openwebtext) as well as introduce other complimenting methods that might be of interest for data cleaning.\n",
|
|
"\n",
|
|
"\n",
|
|
- "---------------\n",
|
|
|
|
- "### At the end, there's a **mini challenge** for hands-on practice identifying number of duplicates approach groudtruth number!"
|
|
|
|
|
|
+ "\n",
|
|
|
|
+ "**Note** :At the end, there's a **mini challenge** for hands-on practice identifying number of duplicates approach groudtruth number!"
|
|
]
|
|
]
|
|
},
|
|
},
|
|
{
|
|
{
|
|
"cell_type": "markdown",
|
|
"cell_type": "markdown",
|
|
- "id": "injured-appraisal",
|
|
|
|
|
|
+ "id": "structured-superior",
|
|
"metadata": {},
|
|
"metadata": {},
|
|
"source": [
|
|
"source": [
|
|
"--------------------------------------------------------------------------------------------------------------------\n",
|
|
"--------------------------------------------------------------------------------------------------------------------\n",
|
|
- "#### start by install necessary libraries -\n",
|
|
|
|
|
|
+ "### Install necessary libraries -\n",
|
|
"\n",
|
|
"\n",
|
|
" install LSH - \n",
|
|
" install LSH - \n",
|
|
"\n",
|
|
"\n",
|
|
@@ -59,7 +58,7 @@
|
|
{
|
|
{
|
|
"cell_type": "code",
|
|
"cell_type": "code",
|
|
"execution_count": 1,
|
|
"execution_count": 1,
|
|
- "id": "elect-chair",
|
|
|
|
|
|
+ "id": "sensitive-violation",
|
|
"metadata": {},
|
|
"metadata": {},
|
|
"outputs": [
|
|
"outputs": [
|
|
{
|
|
{
|
|
@@ -138,17 +137,17 @@
|
|
},
|
|
},
|
|
{
|
|
{
|
|
"cell_type": "markdown",
|
|
"cell_type": "markdown",
|
|
- "id": "excessive-madison",
|
|
|
|
|
|
+ "id": "conscious-provision",
|
|
"metadata": {},
|
|
"metadata": {},
|
|
"source": [
|
|
"source": [
|
|
"-------------------------------------------------------------------------------\n",
|
|
"-------------------------------------------------------------------------------\n",
|
|
- "## detect and filter undesired langauge in the raw text corpus"
|
|
|
|
|
|
+ "## Language detection"
|
|
]
|
|
]
|
|
},
|
|
},
|
|
{
|
|
{
|
|
"cell_type": "code",
|
|
"cell_type": "code",
|
|
"execution_count": 2,
|
|
"execution_count": 2,
|
|
- "id": "choice-nicholas",
|
|
|
|
|
|
+ "id": "entitled-company",
|
|
"metadata": {},
|
|
"metadata": {},
|
|
"outputs": [
|
|
"outputs": [
|
|
{
|
|
{
|
|
@@ -171,7 +170,7 @@
|
|
{
|
|
{
|
|
"cell_type": "code",
|
|
"cell_type": "code",
|
|
"execution_count": 3,
|
|
"execution_count": 3,
|
|
- "id": "nonprofit-statistics",
|
|
|
|
|
|
+ "id": "fancy-translator",
|
|
"metadata": {},
|
|
"metadata": {},
|
|
"outputs": [
|
|
"outputs": [
|
|
{
|
|
{
|
|
@@ -193,7 +192,7 @@
|
|
{
|
|
{
|
|
"cell_type": "code",
|
|
"cell_type": "code",
|
|
"execution_count": 4,
|
|
"execution_count": 4,
|
|
- "id": "adverse-robertson",
|
|
|
|
|
|
+ "id": "experimental-burns",
|
|
"metadata": {},
|
|
"metadata": {},
|
|
"outputs": [
|
|
"outputs": [
|
|
{
|
|
{
|
|
@@ -214,24 +213,24 @@
|
|
},
|
|
},
|
|
{
|
|
{
|
|
"cell_type": "markdown",
|
|
"cell_type": "markdown",
|
|
- "id": "interpreted-links",
|
|
|
|
|
|
+ "id": "sporting-involvement",
|
|
"metadata": {},
|
|
"metadata": {},
|
|
"source": [
|
|
"source": [
|
|
"-----------------------------------------------------------\n",
|
|
"-----------------------------------------------------------\n",
|
|
- "## finding sentence boundary - NLTK"
|
|
|
|
|
|
+ "## Finding sentence boundary (altnertive 1) - NLTK "
|
|
]
|
|
]
|
|
},
|
|
},
|
|
{
|
|
{
|
|
"cell_type": "code",
|
|
"cell_type": "code",
|
|
- "execution_count": 5,
|
|
|
|
- "id": "typical-accused",
|
|
|
|
|
|
+ "execution_count": 1,
|
|
|
|
+ "id": "prerequisite-london",
|
|
"metadata": {},
|
|
"metadata": {},
|
|
"outputs": [
|
|
"outputs": [
|
|
{
|
|
{
|
|
"name": "stderr",
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"text": [
|
|
- "[nltk_data] Downloading package punkt to /home/x_zench/nltk_data...\n",
|
|
|
|
|
|
+ "[nltk_data] Downloading package punkt to /root/nltk_data...\n",
|
|
"[nltk_data] Unzipping tokenizers/punkt.zip.\n"
|
|
"[nltk_data] Unzipping tokenizers/punkt.zip.\n"
|
|
]
|
|
]
|
|
},
|
|
},
|
|
@@ -241,7 +240,7 @@
|
|
"True"
|
|
"True"
|
|
]
|
|
]
|
|
},
|
|
},
|
|
- "execution_count": 5,
|
|
|
|
|
|
+ "execution_count": 1,
|
|
"metadata": {},
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
"output_type": "execute_result"
|
|
}
|
|
}
|
|
@@ -253,8 +252,8 @@
|
|
},
|
|
},
|
|
{
|
|
{
|
|
"cell_type": "code",
|
|
"cell_type": "code",
|
|
- "execution_count": 6,
|
|
|
|
- "id": "ranking-semester",
|
|
|
|
|
|
+ "execution_count": 20,
|
|
|
|
+ "id": "worldwide-fruit",
|
|
"metadata": {},
|
|
"metadata": {},
|
|
"outputs": [
|
|
"outputs": [
|
|
{
|
|
{
|
|
@@ -262,16 +261,20 @@
|
|
"output_type": "stream",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"text": [
|
|
"original doc is :\n",
|
|
"original doc is :\n",
|
|
- " På sommarhalvåret längst vattnet och i skogarna på Lidingö eller på Djurgården – på vintern blir det löpbandet på gymmet .Händelsen skapade mycket oro inom klubben , eftersom skridskoåkning äger rum på sjöar och vattendrag där det givetvis inte finns någon gatuadress att uppge vid olycka .\n",
|
|
|
|
|
|
+ " Detta är ett stycke. Den innehåller flera meningar. \"Men varför\", frågar du?\n",
|
|
"------- sentence 0 -------\n",
|
|
"------- sentence 0 -------\n",
|
|
- "På sommarhalvåret längst vattnet och i skogarna på Lidingö eller på Djurgården – på vintern blir det löpbandet på gymmet .Händelsen skapade mycket oro inom klubben , eftersom skridskoåkning äger rum på sjöar och vattendrag där det givetvis inte finns någon gatuadress att uppge vid olycka .\n"
|
|
|
|
|
|
+ "Detta är ett stycke.\n",
|
|
|
|
+ "------- sentence 1 -------\n",
|
|
|
|
+ "Den innehåller flera meningar.\n",
|
|
|
|
+ "------- sentence 2 -------\n",
|
|
|
|
+ "\"Men varför\", frågar du?\n"
|
|
]
|
|
]
|
|
}
|
|
}
|
|
],
|
|
],
|
|
"source": [
|
|
"source": [
|
|
"import nltk\n",
|
|
"import nltk\n",
|
|
"from nltk.tokenize import sent_tokenize\n",
|
|
"from nltk.tokenize import sent_tokenize\n",
|
|
- "text='På sommarhalvåret längst vattnet och i skogarna på Lidingö eller på Djurgården – på vintern blir det löpbandet på gymmet .Händelsen skapade mycket oro inom klubben , eftersom skridskoåkning äger rum på sjöar och vattendrag där det givetvis inte finns någon gatuadress att uppge vid olycka .'\n",
|
|
|
|
|
|
+ "text='Detta är ett stycke. Den innehåller flera meningar. \"Men varför\", frågar du?'\n",
|
|
"print(\"original doc is :\\n \", text)\n",
|
|
"print(\"original doc is :\\n \", text)\n",
|
|
"sents=sent_tokenize(text)\n",
|
|
"sents=sent_tokenize(text)\n",
|
|
"i=0\n",
|
|
"i=0\n",
|
|
@@ -283,33 +286,33 @@
|
|
},
|
|
},
|
|
{
|
|
{
|
|
"cell_type": "markdown",
|
|
"cell_type": "markdown",
|
|
- "id": "portable-bumper",
|
|
|
|
|
|
+ "id": "micro-interface",
|
|
"metadata": {},
|
|
"metadata": {},
|
|
"source": [
|
|
"source": [
|
|
"-----------------------------------------------------------\n",
|
|
"-----------------------------------------------------------\n",
|
|
- "## finding sentence boundary - Sentence-Splitter"
|
|
|
|
|
|
+ "## Finding sentence boundary (altnertive 2) - Sentence-Splitter"
|
|
]
|
|
]
|
|
},
|
|
},
|
|
{
|
|
{
|
|
"cell_type": "code",
|
|
"cell_type": "code",
|
|
"execution_count": 8,
|
|
"execution_count": 8,
|
|
- "id": "environmental-rating",
|
|
|
|
|
|
+ "id": "third-caution",
|
|
"metadata": {},
|
|
"metadata": {},
|
|
"outputs": [
|
|
"outputs": [
|
|
{
|
|
{
|
|
"name": "stdout",
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"text": [
|
|
- "--2021-09-15 11:26:03-- https://github.com/mediacloud/sentence-splitter/blob/develop/sentence_splitter/non_breaking_prefixes/sv.txt\n",
|
|
|
|
- "Resolving github.com (github.com)... 140.82.121.4\n",
|
|
|
|
- "Connecting to github.com (github.com)|140.82.121.4|:443... connected.\n",
|
|
|
|
|
|
+ "--2021-09-24 18:41:51-- https://github.com/mediacloud/sentence-splitter/blob/develop/sentence_splitter/non_breaking_prefixes/sv.txt\n",
|
|
|
|
+ "Resolving github.com (github.com)... 140.82.121.3\n",
|
|
|
|
+ "Connecting to github.com (github.com)|140.82.121.3|:443... connected.\n",
|
|
"HTTP request sent, awaiting response... 200 OK\n",
|
|
"HTTP request sent, awaiting response... 200 OK\n",
|
|
"Length: unspecified [text/html]\n",
|
|
"Length: unspecified [text/html]\n",
|
|
"Saving to: ‘sv.txt’\n",
|
|
"Saving to: ‘sv.txt’\n",
|
|
"\n",
|
|
"\n",
|
|
- "sv.txt [ <=> ] 191.69K --.-KB/s in 0.08s \n",
|
|
|
|
|
|
+ "sv.txt [ <=> ] 190.41K --.-KB/s in 0.1s \n",
|
|
"\n",
|
|
"\n",
|
|
- "2021-09-15 11:26:04 (2.23 MB/s) - ‘sv.txt’ saved [196294]\n",
|
|
|
|
|
|
+ "2021-09-24 18:41:52 (1.76 MB/s) - ‘sv.txt’ saved [194983]\n",
|
|
"\n"
|
|
"\n"
|
|
]
|
|
]
|
|
}
|
|
}
|
|
@@ -321,7 +324,7 @@
|
|
{
|
|
{
|
|
"cell_type": "code",
|
|
"cell_type": "code",
|
|
"execution_count": 9,
|
|
"execution_count": 9,
|
|
- "id": "collective-medication",
|
|
|
|
|
|
+ "id": "spectacular-eligibility",
|
|
"metadata": {},
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"outputs": [],
|
|
"source": [
|
|
"source": [
|
|
@@ -330,8 +333,8 @@
|
|
},
|
|
},
|
|
{
|
|
{
|
|
"cell_type": "code",
|
|
"cell_type": "code",
|
|
- "execution_count": 10,
|
|
|
|
- "id": "secure-encounter",
|
|
|
|
|
|
+ "execution_count": 21,
|
|
|
|
+ "id": "imported-probability",
|
|
"metadata": {},
|
|
"metadata": {},
|
|
"outputs": [
|
|
"outputs": [
|
|
{
|
|
{
|
|
@@ -339,7 +342,11 @@
|
|
"output_type": "stream",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"text": [
|
|
"------- sentence 0 -------\n",
|
|
"------- sentence 0 -------\n",
|
|
- "På sommarhalvåret längst vattnet och i skogarna på Lidingö eller på Djurgården – på vintern blir det löpbandet på gymmet .Händelsen skapade mycket oro inom klubben , eftersom skridskoåkning äger rum på sjöar och vattendrag där det givetvis inte finns någon gatuadress att uppge vid olycka .\n"
|
|
|
|
|
|
+ "Detta är ett stycke.\n",
|
|
|
|
+ "------- sentence 1 -------\n",
|
|
|
|
+ "Den innehåller flera meningar.\n",
|
|
|
|
+ "------- sentence 2 -------\n",
|
|
|
|
+ "\"Men varför\", frågar du?\n"
|
|
]
|
|
]
|
|
}
|
|
}
|
|
],
|
|
],
|
|
@@ -356,17 +363,17 @@
|
|
},
|
|
},
|
|
{
|
|
{
|
|
"cell_type": "markdown",
|
|
"cell_type": "markdown",
|
|
- "id": "varying-province",
|
|
|
|
|
|
+ "id": "conventional-medium",
|
|
"metadata": {},
|
|
"metadata": {},
|
|
"source": [
|
|
"source": [
|
|
"-----------------------------------------------------------\n",
|
|
"-----------------------------------------------------------\n",
|
|
- "## finding sentence boundary - create your own sentence splitter"
|
|
|
|
|
|
+ "## Finding sentence boundary - create your own sentence splitter"
|
|
]
|
|
]
|
|
},
|
|
},
|
|
{
|
|
{
|
|
"cell_type": "code",
|
|
"cell_type": "code",
|
|
- "execution_count": 11,
|
|
|
|
- "id": "voluntary-madness",
|
|
|
|
|
|
+ "execution_count": 22,
|
|
|
|
+ "id": "selective-nelson",
|
|
"metadata": {},
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"outputs": [],
|
|
"source": [
|
|
"source": [
|
|
@@ -378,7 +385,7 @@
|
|
"\n",
|
|
"\n",
|
|
"def cut_sentence_with_quotation_marks(text):\n",
|
|
"def cut_sentence_with_quotation_marks(text):\n",
|
|
" p = re.compile(\"“.*?”\")\n",
|
|
" p = re.compile(\"“.*?”\")\n",
|
|
- " list = []\n",
|
|
|
|
|
|
+ " ls = []\n",
|
|
" index = 0\n",
|
|
" index = 0\n",
|
|
" length = len(text)\n",
|
|
" length = len(text)\n",
|
|
" for i in p.finditer(text):\n",
|
|
" for i in p.finditer(text):\n",
|
|
@@ -389,29 +396,29 @@
|
|
" temp += text[j]\n",
|
|
" temp += text[j]\n",
|
|
" if temp != '':\n",
|
|
" if temp != '':\n",
|
|
" temp_list = normal_cut_sentence(temp)\n",
|
|
" temp_list = normal_cut_sentence(temp)\n",
|
|
- " list += temp_list\n",
|
|
|
|
|
|
+ " ls += temp_list\n",
|
|
" temp = ''\n",
|
|
" temp = ''\n",
|
|
" for k in range(start, end):\n",
|
|
" for k in range(start, end):\n",
|
|
" temp += text[k]\n",
|
|
" temp += text[k]\n",
|
|
" if temp != ' ':\n",
|
|
" if temp != ' ':\n",
|
|
- " list.append(temp)\n",
|
|
|
|
|
|
+ " ls.append(temp)\n",
|
|
" index = end\n",
|
|
" index = end\n",
|
|
- " return list"
|
|
|
|
|
|
+ " return ls"
|
|
]
|
|
]
|
|
},
|
|
},
|
|
{
|
|
{
|
|
"cell_type": "code",
|
|
"cell_type": "code",
|
|
- "execution_count": 12,
|
|
|
|
- "id": "raising-salad",
|
|
|
|
|
|
+ "execution_count": 23,
|
|
|
|
+ "id": "statistical-croatia",
|
|
"metadata": {},
|
|
"metadata": {},
|
|
"outputs": [
|
|
"outputs": [
|
|
{
|
|
{
|
|
"name": "stdout",
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"text": [
|
|
- "------- sentence 1 -------\n",
|
|
|
|
|
|
+ "------- sentence 3 -------\n",
|
|
"Andersson pekas ut som nästa partiledare:\n",
|
|
"Andersson pekas ut som nästa partiledare:\n",
|
|
- "------- sentence 2 -------\n",
|
|
|
|
|
|
+ "------- sentence 4 -------\n",
|
|
"“Medlemmarna ska säga sitt”\n"
|
|
"“Medlemmarna ska säga sitt”\n"
|
|
]
|
|
]
|
|
}
|
|
}
|
|
@@ -427,17 +434,17 @@
|
|
},
|
|
},
|
|
{
|
|
{
|
|
"cell_type": "markdown",
|
|
"cell_type": "markdown",
|
|
- "id": "facial-trading",
|
|
|
|
|
|
+ "id": "ultimate-geometry",
|
|
"metadata": {},
|
|
"metadata": {},
|
|
"source": [
|
|
"source": [
|
|
"-----------------------------------------------------------\n",
|
|
"-----------------------------------------------------------\n",
|
|
- "## deduplicate text based on similarity score "
|
|
|
|
|
|
+ "## Deduplicate text based on similarity score "
|
|
]
|
|
]
|
|
},
|
|
},
|
|
{
|
|
{
|
|
"cell_type": "code",
|
|
"cell_type": "code",
|
|
"execution_count": 1,
|
|
"execution_count": 1,
|
|
- "id": "agricultural-onion",
|
|
|
|
|
|
+ "id": "internal-opposition",
|
|
"metadata": {},
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"outputs": [],
|
|
"source": [
|
|
"source": [
|
|
@@ -486,7 +493,7 @@
|
|
{
|
|
{
|
|
"cell_type": "code",
|
|
"cell_type": "code",
|
|
"execution_count": 2,
|
|
"execution_count": 2,
|
|
- "id": "nonprofit-panama",
|
|
|
|
|
|
+ "id": "provincial-cross",
|
|
"metadata": {},
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"outputs": [],
|
|
"source": [
|
|
"source": [
|
|
@@ -500,7 +507,7 @@
|
|
{
|
|
{
|
|
"cell_type": "code",
|
|
"cell_type": "code",
|
|
"execution_count": 3,
|
|
"execution_count": 3,
|
|
- "id": "empty-while",
|
|
|
|
|
|
+ "id": "successful-consent",
|
|
"metadata": {},
|
|
"metadata": {},
|
|
"outputs": [
|
|
"outputs": [
|
|
{
|
|
{
|
|
@@ -526,16 +533,16 @@
|
|
},
|
|
},
|
|
{
|
|
{
|
|
"cell_type": "markdown",
|
|
"cell_type": "markdown",
|
|
- "id": "blind-union",
|
|
|
|
|
|
+ "id": "disciplinary-resolution",
|
|
"metadata": {},
|
|
"metadata": {},
|
|
"source": [
|
|
"source": [
|
|
- "## dataset extracted from NVIDIA blog urls "
|
|
|
|
|
|
+ "## Load dataset extracted from NVIDIA blog urls "
|
|
]
|
|
]
|
|
},
|
|
},
|
|
{
|
|
{
|
|
"cell_type": "code",
|
|
"cell_type": "code",
|
|
"execution_count": 4,
|
|
"execution_count": 4,
|
|
- "id": "instant-grade",
|
|
|
|
|
|
+ "id": "based-pizza",
|
|
"metadata": {},
|
|
"metadata": {},
|
|
"outputs": [
|
|
"outputs": [
|
|
{
|
|
{
|
|
@@ -610,16 +617,16 @@
|
|
},
|
|
},
|
|
{
|
|
{
|
|
"cell_type": "markdown",
|
|
"cell_type": "markdown",
|
|
- "id": "attached-candle",
|
|
|
|
|
|
+ "id": "continuing-palmer",
|
|
"metadata": {},
|
|
"metadata": {},
|
|
"source": [
|
|
"source": [
|
|
- "## create our own groudtruth dataset"
|
|
|
|
|
|
+ "## Create our own groudtruth dataset"
|
|
]
|
|
]
|
|
},
|
|
},
|
|
{
|
|
{
|
|
"cell_type": "code",
|
|
"cell_type": "code",
|
|
"execution_count": 5,
|
|
"execution_count": 5,
|
|
- "id": "constant-mouth",
|
|
|
|
|
|
+ "id": "convenient-clear",
|
|
"metadata": {},
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"outputs": [],
|
|
"source": [
|
|
"source": [
|
|
@@ -650,7 +657,7 @@
|
|
{
|
|
{
|
|
"cell_type": "code",
|
|
"cell_type": "code",
|
|
"execution_count": 6,
|
|
"execution_count": 6,
|
|
- "id": "accepting-truck",
|
|
|
|
|
|
+ "id": "vertical-philosophy",
|
|
"metadata": {},
|
|
"metadata": {},
|
|
"outputs": [
|
|
"outputs": [
|
|
{
|
|
{
|
|
@@ -749,7 +756,7 @@
|
|
{
|
|
{
|
|
"cell_type": "code",
|
|
"cell_type": "code",
|
|
"execution_count": 7,
|
|
"execution_count": 7,
|
|
- "id": "acting-tiffany",
|
|
|
|
|
|
+ "id": "agricultural-bubble",
|
|
"metadata": {},
|
|
"metadata": {},
|
|
"outputs": [
|
|
"outputs": [
|
|
{
|
|
{
|
|
@@ -773,7 +780,7 @@
|
|
{
|
|
{
|
|
"cell_type": "code",
|
|
"cell_type": "code",
|
|
"execution_count": 8,
|
|
"execution_count": 8,
|
|
- "id": "wicked-youth",
|
|
|
|
|
|
+ "id": "incorporated-nepal",
|
|
"metadata": {},
|
|
"metadata": {},
|
|
"outputs": [
|
|
"outputs": [
|
|
{
|
|
{
|
|
@@ -794,7 +801,7 @@
|
|
{
|
|
{
|
|
"cell_type": "code",
|
|
"cell_type": "code",
|
|
"execution_count": 9,
|
|
"execution_count": 9,
|
|
- "id": "rural-lotus",
|
|
|
|
|
|
+ "id": "future-worry",
|
|
"metadata": {},
|
|
"metadata": {},
|
|
"outputs": [
|
|
"outputs": [
|
|
{
|
|
{
|
|
@@ -888,7 +895,7 @@
|
|
},
|
|
},
|
|
{
|
|
{
|
|
"cell_type": "markdown",
|
|
"cell_type": "markdown",
|
|
- "id": "dominican-trick",
|
|
|
|
|
|
+ "id": "secure-rapid",
|
|
"metadata": {},
|
|
"metadata": {},
|
|
"source": [
|
|
"source": [
|
|
"---\n",
|
|
"---\n",
|
|
@@ -898,7 +905,7 @@
|
|
{
|
|
{
|
|
"cell_type": "code",
|
|
"cell_type": "code",
|
|
"execution_count": 10,
|
|
"execution_count": 10,
|
|
- "id": "married-straight",
|
|
|
|
|
|
+ "id": "equivalent-victor",
|
|
"metadata": {},
|
|
"metadata": {},
|
|
"outputs": [
|
|
"outputs": [
|
|
{
|
|
{
|
|
@@ -998,7 +1005,7 @@
|
|
{
|
|
{
|
|
"cell_type": "code",
|
|
"cell_type": "code",
|
|
"execution_count": 11,
|
|
"execution_count": 11,
|
|
- "id": "rocky-courage",
|
|
|
|
|
|
+ "id": "suffering-bangkok",
|
|
"metadata": {},
|
|
"metadata": {},
|
|
"outputs": [
|
|
"outputs": [
|
|
{
|
|
{
|
|
@@ -1022,7 +1029,7 @@
|
|
{
|
|
{
|
|
"cell_type": "code",
|
|
"cell_type": "code",
|
|
"execution_count": 12,
|
|
"execution_count": 12,
|
|
- "id": "boring-piece",
|
|
|
|
|
|
+ "id": "efficient-angola",
|
|
"metadata": {},
|
|
"metadata": {},
|
|
"outputs": [
|
|
"outputs": [
|
|
{
|
|
{
|
|
@@ -1039,7 +1046,7 @@
|
|
},
|
|
},
|
|
{
|
|
{
|
|
"cell_type": "markdown",
|
|
"cell_type": "markdown",
|
|
- "id": "fresh-norfolk",
|
|
|
|
|
|
+ "id": "featured-packet",
|
|
"metadata": {},
|
|
"metadata": {},
|
|
"source": [
|
|
"source": [
|
|
"---\n",
|
|
"---\n",
|
|
@@ -1050,7 +1057,7 @@
|
|
{
|
|
{
|
|
"cell_type": "code",
|
|
"cell_type": "code",
|
|
"execution_count": 13,
|
|
"execution_count": 13,
|
|
- "id": "starting-arabic",
|
|
|
|
|
|
+ "id": "scheduled-publication",
|
|
"metadata": {},
|
|
"metadata": {},
|
|
"outputs": [
|
|
"outputs": [
|
|
{
|
|
{
|
|
@@ -1145,7 +1152,7 @@
|
|
},
|
|
},
|
|
{
|
|
{
|
|
"cell_type": "markdown",
|
|
"cell_type": "markdown",
|
|
- "id": "banner-dispute",
|
|
|
|
|
|
+ "id": "judicial-circular",
|
|
"metadata": {},
|
|
"metadata": {},
|
|
"source": [
|
|
"source": [
|
|
"---\n",
|
|
"---\n",
|
|
@@ -1155,7 +1162,7 @@
|
|
{
|
|
{
|
|
"cell_type": "code",
|
|
"cell_type": "code",
|
|
"execution_count": 14,
|
|
"execution_count": 14,
|
|
- "id": "spread-entity",
|
|
|
|
|
|
+ "id": "bronze-salem",
|
|
"metadata": {},
|
|
"metadata": {},
|
|
"outputs": [
|
|
"outputs": [
|
|
{
|
|
{
|
|
@@ -1179,7 +1186,7 @@
|
|
{
|
|
{
|
|
"cell_type": "code",
|
|
"cell_type": "code",
|
|
"execution_count": 15,
|
|
"execution_count": 15,
|
|
- "id": "exempt-juice",
|
|
|
|
|
|
+ "id": "engaging-persian",
|
|
"metadata": {},
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"outputs": [],
|
|
"source": [
|
|
"source": [
|
|
@@ -1189,7 +1196,7 @@
|
|
},
|
|
},
|
|
{
|
|
{
|
|
"cell_type": "markdown",
|
|
"cell_type": "markdown",
|
|
- "id": "abandoned-valuation",
|
|
|
|
|
|
+ "id": "forward-attachment",
|
|
"metadata": {},
|
|
"metadata": {},
|
|
"source": [
|
|
"source": [
|
|
"<a id=\"TheChallenge\"></a>"
|
|
"<a id=\"TheChallenge\"></a>"
|
|
@@ -1197,13 +1204,13 @@
|
|
},
|
|
},
|
|
{
|
|
{
|
|
"cell_type": "markdown",
|
|
"cell_type": "markdown",
|
|
- "id": "thick-external",
|
|
|
|
|
|
+ "id": "supposed-columbia",
|
|
"metadata": {},
|
|
"metadata": {},
|
|
"source": [
|
|
"source": [
|
|
"---\n",
|
|
"---\n",
|
|
"# Mini Challenge - approaching the groundtruth !\n",
|
|
"# Mini Challenge - approaching the groundtruth !\n",
|
|
"\n",
|
|
"\n",
|
|
- "Task : Aiming to approach the number 31 modifying the below parameters\n",
|
|
|
|
|
|
+ "**Task : Aiming to approach the number 31 modifying the below parameters**\n",
|
|
"rerun cell <a href=\"./Day3-2_SentenceBoundary_and_Deduplicate.ipynb#Rerun_Cell\">Jump to ReRun Cell</a>\n",
|
|
"rerun cell <a href=\"./Day3-2_SentenceBoundary_and_Deduplicate.ipynb#Rerun_Cell\">Jump to ReRun Cell</a>\n",
|
|
"\n",
|
|
"\n",
|
|
"Consider yourself pass this mini challenge when you approach the number **31 +/- 3** ! \n",
|
|
"Consider yourself pass this mini challenge when you approach the number **31 +/- 3** ! \n",
|
|
@@ -1222,12 +1229,11 @@
|
|
{
|
|
{
|
|
"cell_type": "code",
|
|
"cell_type": "code",
|
|
"execution_count": 42,
|
|
"execution_count": 42,
|
|
- "id": "sophisticated-boating",
|
|
|
|
|
|
+ "id": "automated-liver",
|
|
"metadata": {
|
|
"metadata": {
|
|
"collapsed": true,
|
|
"collapsed": true,
|
|
"jupyter": {
|
|
"jupyter": {
|
|
- "outputs_hidden": true,
|
|
|
|
- "source_hidden": true
|
|
|
|
|
|
+ "outputs_hidden": true
|
|
},
|
|
},
|
|
"tags": []
|
|
"tags": []
|
|
},
|
|
},
|
|
@@ -1257,11 +1263,8 @@
|
|
{
|
|
{
|
|
"cell_type": "code",
|
|
"cell_type": "code",
|
|
"execution_count": 114,
|
|
"execution_count": 114,
|
|
- "id": "meaningful-sample",
|
|
|
|
|
|
+ "id": "anticipated-senator",
|
|
"metadata": {
|
|
"metadata": {
|
|
- "jupyter": {
|
|
|
|
- "source_hidden": true
|
|
|
|
- },
|
|
|
|
"tags": []
|
|
"tags": []
|
|
},
|
|
},
|
|
"outputs": [],
|
|
"outputs": [],
|
|
@@ -1278,12 +1281,11 @@
|
|
{
|
|
{
|
|
"cell_type": "code",
|
|
"cell_type": "code",
|
|
"execution_count": 115,
|
|
"execution_count": 115,
|
|
- "id": "operational-steps",
|
|
|
|
|
|
+ "id": "pregnant-temple",
|
|
"metadata": {
|
|
"metadata": {
|
|
"collapsed": true,
|
|
"collapsed": true,
|
|
"jupyter": {
|
|
"jupyter": {
|
|
- "outputs_hidden": true,
|
|
|
|
- "source_hidden": true
|
|
|
|
|
|
+ "outputs_hidden": true
|
|
},
|
|
},
|
|
"tags": []
|
|
"tags": []
|
|
},
|
|
},
|
|
@@ -1314,7 +1316,25 @@
|
|
},
|
|
},
|
|
{
|
|
{
|
|
"cell_type": "markdown",
|
|
"cell_type": "markdown",
|
|
- "id": "revolutionary-framing",
|
|
|
|
|
|
+ "id": "junior-voluntary",
|
|
|
|
+ "metadata": {},
|
|
|
|
+ "source": [
|
|
|
|
+ "--- \n",
|
|
|
|
+ "\n",
|
|
|
|
+ "## Additional Resources\n",
|
|
|
|
+ "\n",
|
|
|
|
+ "Language Detect : https://github.com/Mimino666/langdetect\n",
|
|
|
|
+ "\n",
|
|
|
|
+ "NLTK Sentence Tokenizer : https://www.nltk.org/api/nltk.tokenize.html\n",
|
|
|
|
+ "\n",
|
|
|
|
+ "Sentence Splitter : https://github.com/mediacloud/sentence-splitter\n",
|
|
|
|
+ "\n",
|
|
|
|
+ "Local Sensitive Hashing : http://snap.stanford.edu/class/cs246-2012/slides/03-lsh.pdf\n"
|
|
|
|
+ ]
|
|
|
|
+ },
|
|
|
|
+ {
|
|
|
|
+ "cell_type": "markdown",
|
|
|
|
+ "id": "banned-telling",
|
|
"metadata": {},
|
|
"metadata": {},
|
|
"source": [
|
|
"source": [
|
|
"---\n",
|
|
"---\n",
|
|
@@ -1328,7 +1348,7 @@
|
|
},
|
|
},
|
|
{
|
|
{
|
|
"cell_type": "markdown",
|
|
"cell_type": "markdown",
|
|
- "id": "cutting-greeting",
|
|
|
|
|
|
+ "id": "controlling-boulder",
|
|
"metadata": {},
|
|
"metadata": {},
|
|
"source": [
|
|
"source": [
|
|
"-----\n",
|
|
"-----\n",
|