|
@@ -2,7 +2,7 @@
|
|
"cells": [
|
|
"cells": [
|
|
{
|
|
{
|
|
"cell_type": "markdown",
|
|
"cell_type": "markdown",
|
|
- "id": "encouraging-melissa",
|
|
|
|
|
|
+ "id": "fixed-species",
|
|
"metadata": {},
|
|
"metadata": {},
|
|
"source": [
|
|
"source": [
|
|
"## Customize preprocess_data.py\n",
|
|
"## Customize preprocess_data.py\n",
|
|
@@ -27,7 +27,7 @@
|
|
},
|
|
},
|
|
{
|
|
{
|
|
"cell_type": "markdown",
|
|
"cell_type": "markdown",
|
|
- "id": "regulation-medium",
|
|
|
|
|
|
+ "id": "comparative-render",
|
|
"metadata": {},
|
|
"metadata": {},
|
|
"source": [
|
|
"source": [
|
|
"1. Convert the extracted raw Swedish text from webnyheter2013.txt to webnyheter2013.json."
|
|
"1. Convert the extracted raw Swedish text from webnyheter2013.txt to webnyheter2013.json."
|
|
@@ -36,7 +36,7 @@
|
|
{
|
|
{
|
|
"cell_type": "code",
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"execution_count": null,
|
|
- "id": "alone-asian",
|
|
|
|
|
|
+ "id": "alien-spanking",
|
|
"metadata": {},
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"outputs": [],
|
|
"source": [
|
|
"source": [
|
|
@@ -45,7 +45,7 @@
|
|
},
|
|
},
|
|
{
|
|
{
|
|
"cell_type": "markdown",
|
|
"cell_type": "markdown",
|
|
- "id": "corporate-honor",
|
|
|
|
|
|
+ "id": "quiet-innocent",
|
|
"metadata": {},
|
|
"metadata": {},
|
|
"source": [
|
|
"source": [
|
|
"Below is the expected outputs :\n",
|
|
"Below is the expected outputs :\n",
|
|
@@ -58,7 +58,7 @@
|
|
},
|
|
},
|
|
{
|
|
{
|
|
"cell_type": "markdown",
|
|
"cell_type": "markdown",
|
|
- "id": "diagnostic-mercury",
|
|
|
|
|
|
+ "id": "relative-execution",
|
|
"metadata": {},
|
|
"metadata": {},
|
|
"source": [
|
|
"source": [
|
|
"2. Generate the mmap format files by default preprocess_data.py as the first step to ensure we have data necessary for the next notebook to run, in case time runs out."
|
|
"2. Generate the mmap format files by default preprocess_data.py as the first step to ensure we have data necessary for the next notebook to run, in case time runs out."
|
|
@@ -67,7 +67,7 @@
|
|
{
|
|
{
|
|
"cell_type": "code",
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"execution_count": null,
|
|
- "id": "unique-assistant",
|
|
|
|
|
|
+ "id": "known-illness",
|
|
"metadata": {},
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"outputs": [],
|
|
"source": [
|
|
"source": [
|
|
@@ -81,7 +81,7 @@
|
|
{
|
|
{
|
|
"cell_type": "code",
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"execution_count": null,
|
|
- "id": "affecting-brave",
|
|
|
|
|
|
+ "id": "least-platform",
|
|
"metadata": {},
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"outputs": [],
|
|
"source": [
|
|
"source": [
|
|
@@ -99,7 +99,7 @@
|
|
},
|
|
},
|
|
{
|
|
{
|
|
"cell_type": "markdown",
|
|
"cell_type": "markdown",
|
|
- "id": "mathematical-crawford",
|
|
|
|
|
|
+ "id": "lined-literacy",
|
|
"metadata": {},
|
|
"metadata": {},
|
|
"source": [
|
|
"source": [
|
|
"Below is the expected outputs :\n",
|
|
"Below is the expected outputs :\n",
|
|
@@ -116,7 +116,7 @@
|
|
},
|
|
},
|
|
{
|
|
{
|
|
"cell_type": "markdown",
|
|
"cell_type": "markdown",
|
|
- "id": "involved-geneva",
|
|
|
|
|
|
+ "id": "periodic-treaty",
|
|
"metadata": {},
|
|
"metadata": {},
|
|
"source": [
|
|
"source": [
|
|
"Now we get the default mmap files (xxx.bin and xxx.idx ) and therefore guarantee we have the data needed for the next notebook to run disregard whether we finish the mini-challenge or not. \n",
|
|
"Now we get the default mmap files (xxx.bin and xxx.idx ) and therefore guarantee we have the data needed for the next notebook to run disregard whether we finish the mini-challenge or not. \n",
|
|
@@ -131,7 +131,7 @@
|
|
{
|
|
{
|
|
"cell_type": "code",
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"execution_count": null,
|
|
- "id": "understood-poultry",
|
|
|
|
|
|
+ "id": "norman-accreditation",
|
|
"metadata": {},
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"outputs": [],
|
|
"source": [
|
|
"source": [
|
|
@@ -140,7 +140,7 @@
|
|
},
|
|
},
|
|
{
|
|
{
|
|
"cell_type": "markdown",
|
|
"cell_type": "markdown",
|
|
- "id": "north-pension",
|
|
|
|
|
|
+ "id": "maritime-bunny",
|
|
"metadata": {},
|
|
"metadata": {},
|
|
"source": [
|
|
"source": [
|
|
"<a id=\"Custom-Sentence-Splitter\"></a>"
|
|
"<a id=\"Custom-Sentence-Splitter\"></a>"
|
|
@@ -148,7 +148,7 @@
|
|
},
|
|
},
|
|
{
|
|
{
|
|
"cell_type": "markdown",
|
|
"cell_type": "markdown",
|
|
- "id": "textile-australian",
|
|
|
|
|
|
+ "id": "foreign-advocacy",
|
|
"metadata": {},
|
|
"metadata": {},
|
|
"source": [
|
|
"source": [
|
|
"The custom sentence-splitter `cut_sentence_with_quotation_marks` function is provided below for your convenience, please integrate this custom function into `MYpreprocess_data.py`."
|
|
"The custom sentence-splitter `cut_sentence_with_quotation_marks` function is provided below for your convenience, please integrate this custom function into `MYpreprocess_data.py`."
|
|
@@ -157,7 +157,7 @@
|
|
{
|
|
{
|
|
"cell_type": "code",
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"execution_count": null,
|
|
- "id": "swedish-column",
|
|
|
|
|
|
+ "id": "celtic-latter",
|
|
"metadata": {},
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"outputs": [],
|
|
"source": [
|
|
"source": [
|
|
@@ -192,7 +192,7 @@
|
|
},
|
|
},
|
|
{
|
|
{
|
|
"cell_type": "markdown",
|
|
"cell_type": "markdown",
|
|
- "id": "heated-drunk",
|
|
|
|
|
|
+ "id": "bacterial-consequence",
|
|
"metadata": {},
|
|
"metadata": {},
|
|
"source": [
|
|
"source": [
|
|
"<a id=\"Mini-Challenge\"></a>"
|
|
"<a id=\"Mini-Challenge\"></a>"
|
|
@@ -200,7 +200,7 @@
|
|
},
|
|
},
|
|
{
|
|
{
|
|
"cell_type": "markdown",
|
|
"cell_type": "markdown",
|
|
- "id": "normal-rachel",
|
|
|
|
|
|
+ "id": "separated-occupation",
|
|
"metadata": {},
|
|
"metadata": {},
|
|
"source": [
|
|
"source": [
|
|
"---\n",
|
|
"---\n",
|
|
@@ -222,7 +222,7 @@
|
|
{
|
|
{
|
|
"cell_type": "code",
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"execution_count": null,
|
|
- "id": "adopted-yeast",
|
|
|
|
|
|
+ "id": "unknown-seven",
|
|
"metadata": {},
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"outputs": [],
|
|
"source": [
|
|
"source": [
|
|
@@ -435,7 +435,7 @@
|
|
},
|
|
},
|
|
{
|
|
{
|
|
"cell_type": "markdown",
|
|
"cell_type": "markdown",
|
|
- "id": "accessory-banana",
|
|
|
|
|
|
+ "id": "ruled-service",
|
|
"metadata": {},
|
|
"metadata": {},
|
|
"source": [
|
|
"source": [
|
|
"Below cell block specify all the input parameters in order to run `MYpreprocess_data.py`. \n",
|
|
"Below cell block specify all the input parameters in order to run `MYpreprocess_data.py`. \n",
|
|
@@ -446,20 +446,20 @@
|
|
{
|
|
{
|
|
"cell_type": "code",
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"execution_count": null,
|
|
- "id": "overhead-hydrogen",
|
|
|
|
|
|
+ "id": "simplified-antarctica",
|
|
"metadata": {},
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"outputs": [],
|
|
"source": [
|
|
"source": [
|
|
"INPUT_JSON_FILE='../dataset/SV/webnyheter2013.json'\n",
|
|
"INPUT_JSON_FILE='../dataset/SV/webnyheter2013.json'\n",
|
|
"OUTPUT_PATH='../dataset/SV/customSentenceSplit'\n",
|
|
"OUTPUT_PATH='../dataset/SV/customSentenceSplit'\n",
|
|
- "VOCAB_FILE='../dataset/SV/32k/vocab.json'\n",
|
|
|
|
- "MERGE_FILE='../dataset/SV/32k/merges.txt'\n",
|
|
|
|
|
|
+ "VOCAB_FILE='../dataset/SV/56k/vocab.json'\n",
|
|
|
|
+ "MERGE_FILE='../dataset/SV/56k/merges.txt'\n",
|
|
"NUM_CPUS=16"
|
|
"NUM_CPUS=16"
|
|
]
|
|
]
|
|
},
|
|
},
|
|
{
|
|
{
|
|
"cell_type": "markdown",
|
|
"cell_type": "markdown",
|
|
- "id": "interior-healthcare",
|
|
|
|
|
|
+ "id": "understanding-things",
|
|
"metadata": {},
|
|
"metadata": {},
|
|
"source": [
|
|
"source": [
|
|
"Below code block is a ReRun cell to launch `MYpreprocess_data.py` and produce the customSentenceSplit_text_document.bin and customSentenceSplit_text_document.idx files, if the script runs successfully.\n",
|
|
"Below code block is a ReRun cell to launch `MYpreprocess_data.py` and produce the customSentenceSplit_text_document.bin and customSentenceSplit_text_document.idx files, if the script runs successfully.\n",
|
|
@@ -472,7 +472,7 @@
|
|
{
|
|
{
|
|
"cell_type": "code",
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"execution_count": null,
|
|
- "id": "compact-access",
|
|
|
|
|
|
+ "id": "exclusive-region",
|
|
"metadata": {},
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"outputs": [],
|
|
"source": [
|
|
"source": [
|
|
@@ -490,7 +490,7 @@
|
|
},
|
|
},
|
|
{
|
|
{
|
|
"cell_type": "markdown",
|
|
"cell_type": "markdown",
|
|
- "id": "automatic-gravity",
|
|
|
|
|
|
+ "id": "armed-german",
|
|
"metadata": {},
|
|
"metadata": {},
|
|
"source": [
|
|
"source": [
|
|
"Check whether these two files : `customSentenceSplit_text_document.bin` and `customSentenceSplit_text_document.idx` files were successfully generated and is in the correct folder under dataset."
|
|
"Check whether these two files : `customSentenceSplit_text_document.bin` and `customSentenceSplit_text_document.idx` files were successfully generated and is in the correct folder under dataset."
|
|
@@ -499,7 +499,7 @@
|
|
{
|
|
{
|
|
"cell_type": "code",
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"execution_count": null,
|
|
- "id": "substantial-spare",
|
|
|
|
|
|
+ "id": "fantastic-harmony",
|
|
"metadata": {},
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"outputs": [],
|
|
"source": [
|
|
"source": [
|
|
@@ -509,7 +509,7 @@
|
|
{
|
|
{
|
|
"cell_type": "code",
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"execution_count": null,
|
|
- "id": "civic-airplane",
|
|
|
|
|
|
+ "id": "final-stomach",
|
|
"metadata": {},
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"outputs": [],
|
|
"source": [
|
|
"source": [
|
|
@@ -519,7 +519,7 @@
|
|
},
|
|
},
|
|
{
|
|
{
|
|
"cell_type": "markdown",
|
|
"cell_type": "markdown",
|
|
- "id": "velvet-tennessee",
|
|
|
|
|
|
+ "id": "still-movement",
|
|
"metadata": {},
|
|
"metadata": {},
|
|
"source": [
|
|
"source": [
|
|
"-----\n",
|
|
"-----\n",
|
|
@@ -528,7 +528,7 @@
|
|
},
|
|
},
|
|
{
|
|
{
|
|
"cell_type": "markdown",
|
|
"cell_type": "markdown",
|
|
- "id": "rising-ready",
|
|
|
|
|
|
+ "id": "organized-mother",
|
|
"metadata": {},
|
|
"metadata": {},
|
|
"source": [
|
|
"source": [
|
|
"-----\n",
|
|
"-----\n",
|