il y a 1 an · 4ecdbf9639
--- a/end-to-end-use-cases/benchmarks/evals_synthetic_data/README.md
+++ b/end-to-end-use-cases/benchmarks/evals_synthetic_data/README.md
@@ -1,6 +1,6 @@
 
				 # Creating Evals with synthetic data and measuring hallucinations
			
 
				 
			
 
				-When you deploy Llama for your use case, it is a good practice to have Evals for your usecase. In an ideal world, you want to have human annotated Evals. If for some reason, this is not possible, this notebook shows a strategy for how one might go about addressing Evals using synthetic data. However, the Evals generated still require validation by a human to make sure that your revenue generating production use case can rely on this. 
			
 
				+When you deploy Llama for your use case, it is a good practice to have Evals for your use case. In an ideal world, you want to have human annotated Evals. If for some reason, this is not possible, this notebook shows a strategy for how one might go about addressing Evals using synthetic data. However, the Evals generated still require validation by a human to make sure that your revenue generating production use case can rely on this. 
			
 
				 The notebook also shows how one could accurately measure hallucinations without using LLM-As-A-Judge methodology using Llama
			
 
				 
			
 
				 
			
--- a/end-to-end-use-cases/benchmarks/evals_synthetic_data/evals_with_synthetic_data.ipynb
+++ b/end-to-end-use-cases/benchmarks/evals_synthetic_data/evals_with_synthetic_data.ipynb
@@ -5,10 +5,10 @@
 
				    "id": "527a835c-afc5-4df7-a924-d5f61a417cf2",
			
 
				    "metadata": {},
			
 
				    "source": [
			
 
				-    "# Creating Evals with synthetic data and measuring halucinations\n",
			
 
				+    "# Creating Evals with synthetic data and measuring hallucinations\n",
			
 
				     "\n",
			
 
				-    "When you deploy Llama for your use case, it is a good practice to have Evals for your usecase. Though it might be ideal to have human annotated Evals, this notebook shows a strategy fow how one might go about addressing this using synthetic data. However, the Evals generated still requires validation by a human to make sure that your production use case can rely on this. \n",
			
 
				-    "The notebook also shows how one could accurately measure halucinations without using LLM-As-A-Judge methodology using Llama"
			
 
				+    "When you deploy Llama for your use case, it is a good practice to have Evals for your use case. Though it might be ideal to have human annotated Evals, this notebook shows a strategy fow how one might go about addressing this using synthetic data. However, the Evals generated still requires validation by a human to make sure that your production use case can rely on this. \n",
			
 
				+    "The notebook also shows how one could accurately measure hallucinations without using LLM-As-A-Judge methodology using Llama"
			
 
				    ]
			
 
				   },
			
 
				   {
			
@@ -1029,6 +1029,7 @@
 
				     "### Generate 12 examples of synthetic data using this loop\n",
			
 
				     "\n",
			
 
				     "Why 12?: We will use 2 examples for few shot prompting and the rest 10 for Evals.\n",
			
 
				+    "\n",
			
 
				     "In practice, you want the number of data points to be much higher for your production application"
			
 
				    ]
			
 
				   },
			
@@ -1258,7 +1259,7 @@
 
				    "source": [
			
 
				     "## Important!!!!! Verification by human!\n",
			
 
				     "\n",
			
 
				-    "At this point, ideally you need a human to look at the synthetic data that you have generated and fix any errors in the formating or factual information or be aware of the number of errors in the dataset"
			
 
				+    "At this point, ideally you need a human to look at the synthetic data that you have generated and fix any errors in the formatting or factual information or be aware of the number of errors in the dataset"
			
 
				    ]
			
 
				   },
			
 
				   {
			
@@ -1291,12 +1292,12 @@
 
				   },
			
 
				   {
			
 
				    "cell_type": "code",
			
 
				-   "execution_count": 13,
			
 
				+   "execution_count": null,
			
 
				    "id": "2c35d758-3d7b-4512-9bdf-8055b9bff3e2",
			
 
				    "metadata": {},
			
 
				    "outputs": [],
			
 
				    "source": [
			
 
				-    "check_halucination = lambda data,index: f\"\"\"You are a Helpful Assistant. \n",
			
 
				+    "check_hallucinations = lambda data,index: f\"\"\"You are a Helpful Assistant. \n",
			
 
				     "Look at the section called Generated Report below & answer the following questions by only looking\n",
			
 
				     "at the corresponding sections in the report\n",
			
 
				     "- student_id: Question : What is the student id?\n",
			
@@ -1312,7 +1313,7 @@
 
				     "Compare your answers with the ground truth and return either True or False within the tags <answer> & </answer>\n",
			
 
				     "Only if an answer is False, explain why in the format shown in the examples below\n",
			
 
				     "\n",
			
 
				-    "Gound Truth:\n",
			
 
				+    "Ground Truth:\n",
			
 
				     "{synthetic_data.loc[[synthetic_data.index[index]]]}\n",
			
 
				     "\n",
			
 
				     "\n",
			
@@ -1323,7 +1324,7 @@
 
				     "- DO NOT reason or explain your process\n",
			
 
				     "- DO NOT code this\n",
			
 
				     "- DO NOT explain why something is True\n",
			
 
				-    "- Be linient when checking decimal points. Ex: 4.0 is the same as 4\n",
			
 
				+    "- Be lenient when checking decimal points. Ex: 4.0 is the same as 4\n",
			
 
				     "\n",
			
 
				     "Example:\n",
			
 
				     "1)\n",
			
@@ -1331,7 +1332,7 @@
 
				     "\n",
			
 
				     "{example_data[\"report\"]}\n",
			
 
				     "\n",
			
 
				-    "and the ground grouth:\n",
			
 
				+    "and the ground truth:\n",
			
 
				     "{synthetic_data.loc[[synthetic_data.index[0]]]}\n",
			
 
				     "\n",
			
 
				     "the following output is expected:\n",
			
@@ -1350,7 +1351,7 @@
 
				     "\n",
			
 
				     "{example_data_1[\"report\"]}\n",
			
 
				     "\n",
			
 
				-    "and the ground grouth:\n",
			
 
				+    "and the ground truth:\n",
			
 
				     "{synthetic_data.loc[[synthetic_data.index[1]]]}\n",
			
 
				     "\n",
			
 
				     "the following output is expected:\n",
			
@@ -1403,7 +1404,7 @@
 
				   },
			
 
				   {
			
 
				    "cell_type": "code",
			
 
				-   "execution_count": 14,
			
 
				+   "execution_count": null,
			
 
				    "id": "b2ee3a24-bb64-4f3a-a547-5e976bbf817b",
			
 
				    "metadata": {},
			
 
				    "outputs": [
			
@@ -1536,7 +1537,7 @@
 
				     "    print(f\"\\nChecking accuracy of generated report in {fname}\\n\")\n",
			
 
				     "    data = read_json_file(fname)\n",
			
 
				     "    \n",
			
 
				-    "    formatted_prompt = check_halucination(data, i)\n",
			
 
				+    "    formatted_prompt = check_hallucinations(data, i)\n",
			
 
				     "\n",
			
 
				     "    input = tokenizer([formatted_prompt], return_tensors=\"pt\").to(\"cuda\")\n",
			
 
				     "    \n",
			
@@ -1560,6 +1561,12 @@
 
				     "- Llama can be used to create evals given few samples of ground truth\n",
			
 
				     "- Using simple QA to measure hallucinations can be an effective strategy to be be confident that important factual information is being verified "
			
 
				    ]
			
 
				+  },
			
 
				+  {
			
 
				+   "cell_type": "markdown",
			
 
				+   "id": "785aaee8",
			
 
				+   "metadata": {},
			
 
				+   "source": []
			
 
				   }
			
 
				  ],
			
 
				  "metadata": {