|
@@ -2,7 +2,7 @@
|
|
|
"cells": [
|
|
|
{
|
|
|
"cell_type": "markdown",
|
|
|
- "id": "fixed-species",
|
|
|
+ "id": "amateur-threat",
|
|
|
"metadata": {},
|
|
|
"source": [
|
|
|
"## Customize preprocess_data.py\n",
|
|
@@ -10,11 +10,11 @@
|
|
|
"\n",
|
|
|
"## Learning Objectives\n",
|
|
|
"\n",
|
|
|
- "We fetched our own Swedish raw text data in `Lab2-1_acquiring_data.ipynb`, we learned how to find sentence boundary with custom functions in `Lab2-2_SentenceBoundary_and_Deduplicate.ipynb` , we also trained a GPTBPETokenizer and fitted it to our raw Swedish text with `Lab2-3_train_own_GPT2BPETokenizer.ipynb`. \n",
|
|
|
+ "We fetched our own Swedish raw text data in `Lab2-1_acquiring_data.ipynb`, we learned how to find sentence boundary with custom functions in `Lab2-2_SentenceBoundary_and_Deduplicate.ipynb`, and we trained a GPTBPETokenizer and fitted it to our raw Swedish text with `Lab2-3_train_own_GPT2BPETokenizer.ipynb`. \n",
|
|
|
"\n",
|
|
|
- "We are now ready to incorporate the custom sentence-splitter into preprocess_data.py and covert the raw Swedish text to, first json format, and then mmap format.\n",
|
|
|
+ "We are now ready to incorporate the custom sentence-splitter into preprocess_data.py and convert the raw Swedish text first to, json format, and then mmap format.\n",
|
|
|
"\n",
|
|
|
- "Therefore, the goal of this notebook is to integrate all knowledge gained from both Lab 1 as well as the above notebooks, and challenge ourselves to further customize the preprocess_data.py with a <a href=\"./Lab2-4_customize_process2mmap.ipynb#Custom-Sentence-Splitter\">custom sentence-splitter</a> function, and in the process, convert the new raw Sweden text to mmap format.\n",
|
|
|
+ "Therefore, the goal of this notebook is to integrate all of the knowledge gained from both Lab 1 as well as the above notebooks, and challenge ourselves to further customize the preprocess_data.py with a <a href=\"./Lab2-4_customize_process2mmap.ipynb#Custom-Sentence-Splitter\">custom sentence-splitter</a> function. In the process, we'll convert the new raw Sweden text to mmap format.\n",
|
|
|
"\n",
|
|
|
"More specifically, this notebook will cover the steps to :\n",
|
|
|
"\n",
|
|
@@ -27,7 +27,7 @@
|
|
|
},
|
|
|
{
|
|
|
"cell_type": "markdown",
|
|
|
- "id": "comparative-render",
|
|
|
+ "id": "pleasant-brake",
|
|
|
"metadata": {},
|
|
|
"source": [
|
|
|
"1. Convert the extracted raw Swedish text from webnyheter2013.txt to webnyheter2013.json."
|
|
@@ -36,7 +36,7 @@
|
|
|
{
|
|
|
"cell_type": "code",
|
|
|
"execution_count": null,
|
|
|
- "id": "alien-spanking",
|
|
|
+ "id": "diverse-winning",
|
|
|
"metadata": {},
|
|
|
"outputs": [],
|
|
|
"source": [
|
|
@@ -45,10 +45,10 @@
|
|
|
},
|
|
|
{
|
|
|
"cell_type": "markdown",
|
|
|
- "id": "quiet-innocent",
|
|
|
+ "id": "independent-houston",
|
|
|
"metadata": {},
|
|
|
"source": [
|
|
|
- "Below is the expected outputs :\n",
|
|
|
+ "Below is the expected output:\n",
|
|
|
"\n",
|
|
|
" process 1000000 documents so far ...\n",
|
|
|
" example: – Vi har en bra generation som spelat tillsammans ett tag .\n",
|
|
@@ -58,16 +58,16 @@
|
|
|
},
|
|
|
{
|
|
|
"cell_type": "markdown",
|
|
|
- "id": "relative-execution",
|
|
|
+ "id": "every-equilibrium",
|
|
|
"metadata": {},
|
|
|
"source": [
|
|
|
- "2. Generate the mmap format files by default preprocess_data.py as the first step to ensure we have data necessary for the next notebook to run, in case time runs out."
|
|
|
+ "2. Generate the mmap format files by default preprocess_data.py as the first step to ensure we have the necessary data for the next notebook to run, in case time runs out."
|
|
|
]
|
|
|
},
|
|
|
{
|
|
|
"cell_type": "code",
|
|
|
"execution_count": null,
|
|
|
- "id": "known-illness",
|
|
|
+ "id": "black-schedule",
|
|
|
"metadata": {},
|
|
|
"outputs": [],
|
|
|
"source": [
|
|
@@ -81,7 +81,7 @@
|
|
|
{
|
|
|
"cell_type": "code",
|
|
|
"execution_count": null,
|
|
|
- "id": "least-platform",
|
|
|
+ "id": "cloudy-brighton",
|
|
|
"metadata": {},
|
|
|
"outputs": [],
|
|
|
"source": [
|
|
@@ -99,10 +99,10 @@
|
|
|
},
|
|
|
{
|
|
|
"cell_type": "markdown",
|
|
|
- "id": "lined-literacy",
|
|
|
+ "id": "driven-terminal",
|
|
|
"metadata": {},
|
|
|
"source": [
|
|
|
- "Below is the expected outputs :\n",
|
|
|
+ "Below is the expected output:\n",
|
|
|
"\n",
|
|
|
" Processed 1248300 documents (52998.601302473544 docs/s, 5.869853647730749 MB/s).\n",
|
|
|
" Processed 1248400 documents (53001.39142986273 docs/s, 5.870136451906283 MB/s).\n",
|
|
@@ -116,14 +116,14 @@
|
|
|
},
|
|
|
{
|
|
|
"cell_type": "markdown",
|
|
|
- "id": "periodic-treaty",
|
|
|
+ "id": "superior-stuff",
|
|
|
"metadata": {},
|
|
|
"source": [
|
|
|
- "Now we get the default mmap files (xxx.bin and xxx.idx ) and therefore guarantee we have the data needed for the next notebook to run disregard whether we finish the mini-challenge or not. \n",
|
|
|
+ "Now we get the default mmap files (xxx.bin and xxx.idx ) and therefore guarantee we have the data needed for the next notebook to run regardless of whether we finish the mini-challenge or not. \n",
|
|
|
"\n",
|
|
|
- "We can now move on. We start by copy the old preprocess_data.py and rename it to `MYpreprocess_data.py`. \n",
|
|
|
+ "We can now move on. We start by copying the old preprocess_data.py and rename it to `MYpreprocess_data.py`. \n",
|
|
|
"\n",
|
|
|
- "Note: As best practice, one never overwrites original python script existed in the given repo directly, one copies the original python script and rename it to a new python script, then work on the new python script, in case of irreversible failures, one can always refer to the original python script, and start again.\n",
|
|
|
+ "Note: As best practice, one never overwrites an original python script that exist in the given repo directly. You should copy the original python script and rename it to a new python script, then work on the new python script. In case of irreversible failures, you can always refer to the original python script, and start again.\n",
|
|
|
"\n",
|
|
|
"The below code block will duplicate the preprocess_data.py script and renamed the copied python script into a new python script called `MYpreprocess_data.py`."
|
|
|
]
|
|
@@ -131,7 +131,7 @@
|
|
|
{
|
|
|
"cell_type": "code",
|
|
|
"execution_count": null,
|
|
|
- "id": "norman-accreditation",
|
|
|
+ "id": "protective-topic",
|
|
|
"metadata": {},
|
|
|
"outputs": [],
|
|
|
"source": [
|
|
@@ -140,7 +140,7 @@
|
|
|
},
|
|
|
{
|
|
|
"cell_type": "markdown",
|
|
|
- "id": "maritime-bunny",
|
|
|
+ "id": "restricted-holiday",
|
|
|
"metadata": {},
|
|
|
"source": [
|
|
|
"<a id=\"Custom-Sentence-Splitter\"></a>"
|
|
@@ -148,7 +148,7 @@
|
|
|
},
|
|
|
{
|
|
|
"cell_type": "markdown",
|
|
|
- "id": "foreign-advocacy",
|
|
|
+ "id": "funny-evaluation",
|
|
|
"metadata": {},
|
|
|
"source": [
|
|
|
"The custom sentence-splitter `cut_sentence_with_quotation_marks` function is provided below for your convenience, please integrate this custom function into `MYpreprocess_data.py`."
|
|
@@ -157,7 +157,7 @@
|
|
|
{
|
|
|
"cell_type": "code",
|
|
|
"execution_count": null,
|
|
|
- "id": "celtic-latter",
|
|
|
+ "id": "federal-midwest",
|
|
|
"metadata": {},
|
|
|
"outputs": [],
|
|
|
"source": [
|
|
@@ -192,7 +192,7 @@
|
|
|
},
|
|
|
{
|
|
|
"cell_type": "markdown",
|
|
|
- "id": "bacterial-consequence",
|
|
|
+ "id": "reported-silver",
|
|
|
"metadata": {},
|
|
|
"source": [
|
|
|
"<a id=\"Mini-Challenge\"></a>"
|
|
@@ -200,11 +200,11 @@
|
|
|
},
|
|
|
{
|
|
|
"cell_type": "markdown",
|
|
|
- "id": "separated-occupation",
|
|
|
+ "id": "dress-container",
|
|
|
"metadata": {},
|
|
|
"source": [
|
|
|
"---\n",
|
|
|
- "## **Mini-Challenge ** - integrate the custom sentence splitter into MYpreprocess_data.py\n",
|
|
|
+ "## **Mini-Challenge** - integrate the custom sentence splitter into MYpreprocess_data.py\n",
|
|
|
"\n",
|
|
|
"Task : Modify and overwrite `MYpreprocess_data.py` below to incoporate the custom `cut_sentence_with_quotation_marks`\n",
|
|
|
"\n",
|
|
@@ -222,7 +222,7 @@
|
|
|
{
|
|
|
"cell_type": "code",
|
|
|
"execution_count": null,
|
|
|
- "id": "unknown-seven",
|
|
|
+ "id": "modern-bunny",
|
|
|
"metadata": {},
|
|
|
"outputs": [],
|
|
|
"source": [
|
|
@@ -435,18 +435,18 @@
|
|
|
},
|
|
|
{
|
|
|
"cell_type": "markdown",
|
|
|
- "id": "ruled-service",
|
|
|
+ "id": "continuing-digest",
|
|
|
"metadata": {},
|
|
|
"source": [
|
|
|
- "Below cell block specify all the input parameters in order to run `MYpreprocess_data.py`. \n",
|
|
|
+ "The below cell block specifies all the input parameters in order to run `MYpreprocess_data.py`. \n",
|
|
|
"\n",
|
|
|
- "Please do **NOT** modify anything in below cell."
|
|
|
+ "Please do **NOT** modify anything in the below cell."
|
|
|
]
|
|
|
},
|
|
|
{
|
|
|
"cell_type": "code",
|
|
|
"execution_count": null,
|
|
|
- "id": "simplified-antarctica",
|
|
|
+ "id": "changed-indiana",
|
|
|
"metadata": {},
|
|
|
"outputs": [],
|
|
|
"source": [
|
|
@@ -459,20 +459,20 @@
|
|
|
},
|
|
|
{
|
|
|
"cell_type": "markdown",
|
|
|
- "id": "understanding-things",
|
|
|
+ "id": "unauthorized-manor",
|
|
|
"metadata": {},
|
|
|
"source": [
|
|
|
- "Below code block is a ReRun cell to launch `MYpreprocess_data.py` and produce the customSentenceSplit_text_document.bin and customSentenceSplit_text_document.idx files, if the script runs successfully.\n",
|
|
|
+ "The below code block is a ReRun cell to launch `MYpreprocess_data.py` and produce the customSentenceSplit_text_document.bin and customSentenceSplit_text_document.idx files, if the script runs successfully.\n",
|
|
|
"\n",
|
|
|
"<a id=\"Rerun_Cell\"></a>\n",
|
|
|
"\n",
|
|
|
- "Go back and modify `MYpreprocess_data.py`, click on this shortcut link to <a href=\"./Lab2-4_customize_process2mmap.ipynb#MODIFY_CELL\">Jump to Modify MYpreprocess_data.py</a> "
|
|
|
+ "Go back and modify `MYpreprocess_data.py`. Click on this shortcut link to <a href=\"./Lab2-4_customize_process2mmap.ipynb#MODIFY_CELL\">Jump to Modify MYpreprocess_data.py</a> "
|
|
|
]
|
|
|
},
|
|
|
{
|
|
|
"cell_type": "code",
|
|
|
"execution_count": null,
|
|
|
- "id": "exclusive-region",
|
|
|
+ "id": "specific-presence",
|
|
|
"metadata": {},
|
|
|
"outputs": [],
|
|
|
"source": [
|
|
@@ -490,16 +490,16 @@
|
|
|
},
|
|
|
{
|
|
|
"cell_type": "markdown",
|
|
|
- "id": "armed-german",
|
|
|
+ "id": "compound-photographer",
|
|
|
"metadata": {},
|
|
|
"source": [
|
|
|
- "Check whether these two files : `customSentenceSplit_text_document.bin` and `customSentenceSplit_text_document.idx` files were successfully generated and is in the correct folder under dataset."
|
|
|
+ "Check whether these two files : `customSentenceSplit_text_document.bin` and `customSentenceSplit_text_document.idx` files were successfully generated and are in the correct folder under dataset."
|
|
|
]
|
|
|
},
|
|
|
{
|
|
|
"cell_type": "code",
|
|
|
"execution_count": null,
|
|
|
- "id": "fantastic-harmony",
|
|
|
+ "id": "quarterly-mediterranean",
|
|
|
"metadata": {},
|
|
|
"outputs": [],
|
|
|
"source": [
|
|
@@ -509,7 +509,7 @@
|
|
|
{
|
|
|
"cell_type": "code",
|
|
|
"execution_count": null,
|
|
|
- "id": "final-stomach",
|
|
|
+ "id": "distinguished-latitude",
|
|
|
"metadata": {},
|
|
|
"outputs": [],
|
|
|
"source": [
|
|
@@ -519,7 +519,7 @@
|
|
|
},
|
|
|
{
|
|
|
"cell_type": "markdown",
|
|
|
- "id": "still-movement",
|
|
|
+ "id": "nervous-farming",
|
|
|
"metadata": {},
|
|
|
"source": [
|
|
|
"-----\n",
|
|
@@ -528,7 +528,7 @@
|
|
|
},
|
|
|
{
|
|
|
"cell_type": "markdown",
|
|
|
- "id": "organized-mother",
|
|
|
+ "id": "neural-motor",
|
|
|
"metadata": {},
|
|
|
"source": [
|
|
|
"-----\n",
|