{ "cells": [ { "cell_type": "markdown", "id": "536715fe-e842-4f3b-99a6-0598b2fd6584", "metadata": {}, "source": [ "# Structured Data Extraction: Social Determinants of Health from Clinical Notes with Groq API's Json Mode" ] }, { "cell_type": "markdown", "id": "67d6ce26-dfa7-4fae-bbd7-68fdc79c45e6", "metadata": {}, "source": [ "Since the dawn of the Electronic Health Record, deriving meaningful insights about the social determinants of health of a patient population has been the holy grail of healthcare analytics. While discrete clinical data (vitals, lab results, diagnoses, etc) is well understood, social determinants - things like financial insecurity, which can determine patient outcomes and barriers to care as much as the patient's clinical chart - are often hidden in clinical notes and unused by analytics departments. While some providers code social determinants using [Z codes](https://www.cms.gov/files/document/zcodes-infographic.pdf), these are often too inconsistently documented and many risk models seeking to add a social determinant score will simply default to using zip code as a crude proxy. With the emergence of Large Language Models, AI has the ability to extract and structure meaningful insights from free-text clinical notes at scale, enabling more effective patient outreach, better risk modeling and a more robust understanding of a patient population as a whole." ] }, { "cell_type": "markdown", "id": "d6cf8245-7e56-45bd-a7d1-2d17b39acc5f", "metadata": {}, "source": [ "This notebook shows how we can use Groq API's [JSON mode](https://console.groq.com/docs/text-chat#json-mode-object-object) feature to extract social determinants of health from fake clinical notes, structure them into a neat table that can be used for analytics and load them into [BigQuery](https://cloud.google.com/bigquery). With JSON mode, we can return structured data from the chat completion in a pre-defined format, making it a great feature for structuring unstructrued data. We will read in each note, ask the LLM to determine if certain social determinant features are met, output structured data and load it into a database to be incorporated with the rest of our clinical data marts." ] }, { "cell_type": "markdown", "id": "08fc3ceb-935c-47dd-a13b-7c63c36dbf3a", "metadata": {}, "source": [ "### Setup" ] }, { "cell_type": "code", "execution_count": 1, "id": "25b843f4-cbd0-443d-ad2b-fe1aab6ae4a6", "metadata": {}, "outputs": [], "source": [ "# Import packages\n", "from groq import Groq\n", "import pandas as pd\n", "import os\n", "from IPython.display import Markdown\n", "import json\n", "from google.cloud import bigquery\n", "import matplotlib.pyplot as plt" ] }, { "cell_type": "markdown", "id": "c20ccfc5-ad0a-4d79-86ce-1db248b31c3e", "metadata": {}, "source": [ "This code block loads in clinical notes from [our repository](https://github.com/groq/groq-api-cookbook/json-mode-social-determinants-of-health/clinical_notes) and displays the first note. As you can see, this hypothetical patient has quite a few notable social determinants of health that contribute to their health outcomes and treatment:" ] }, { "cell_type": "code", "execution_count": 2, "id": "7801227a-a0e9-4a08-891a-340cc10689ea", "metadata": {}, "outputs": [ { "data": { "text/markdown": [ "**Date:** March 28, 2024\n", "\n", "**Patient:** David Smith, 42 years old\n", "\n", "**MRN:** 00456321\n", "\n", "**Chief Complaint:** \"I've been feeling really tired lately, more than usual.\"\n", "\n", "**History of Present Illness:** The patient is a 42-year-old who presents with a 3-week history of increased fatigue, decreased energy, and occasional headaches. The patient reports struggling with sleep due to stress from work and personal life. The patient is currently working two part-time jobs but still finds it hard to make ends meet, indicating financial stress. They express concern over the cost of medications and healthcare visits.\n", "\n", "**Past Medical History:** Type 2 Diabetes Mellitus, Hypertension\n", "\n", "**Social History:**\n", "The patient juggles two jobs to make ends meet, one at a local retail store and another in a fast-food chain, neither offering full-time hours or benefits. Despite the long hours, the patient mentions financial difficulties, especially with covering rent and providing. They share an apartment with three others in an area described as 'not the safest,' due to recent break-ins and a noticeable police presence. Meals are often missed or minimal, as the patient tries to stretch their budget, sometimes seeking help from local food banks when things get particularly tight.\n", "\n", "Educationally, the patient completed high school but hasn't pursued further studies, citing lack of funds and the immediate need to support their family after graduation. They rely on buses and trains for transportation, which complicates timely access to healthcare, often causing delays or missed appointments. Socially, the patient admits to feeling isolated, with most of their family living out of state after their divorce and a personal life that has been 'on hold' due to work and financial pressures. They have a basic health insurance plan with high co-payments, which has led to skipping some recommended medical tests and treatments.\n", "\n", "**Review of Systems:** Denies chest pain, shortness of breath, or fever. Reports occasional headaches.\n", "\n", "**Physical Examination:**\n", "- General: Appears tired but is alert and oriented.\n", "- Vitals: BP 142/89, HR 78, Temp 98.6°F, Resp 16/min\n", "\n", "**Assessment/Plan:**\n", "- Continue to monitor blood pressure and diabetes control.\n", "- Discuss affordable medication options with a pharmacist.\n", "- Refer to a social worker to address food security, housing concerns, and access to healthcare services.\n", "- Encourage the patient to engage with community support groups for social support.\n", "- Schedule a follow-up appointment in 4 weeks or sooner if symptoms worsen.\n", "\n", "**Comments:** The patient's health concerns are compounded by socioeconomic factors, including employment status, housing stability, food security, and access to healthcare. Addressing these social determinants of health is crucial for improving the patient's overall well-being.\n" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# Define the directory path\n", "folder_path = 'clinical_notes/'\n", "\n", "# List all files in the directory\n", "file_list = os.listdir(folder_path)\n", "text_files = sorted([file for file in file_list if file.endswith('.txt')])\n", "\n", "with open(os.path.join(folder_path, text_files[0]), 'r') as file:\n", " clinical_note = file.read()\n", "\n", "display(Markdown(clinical_note))" ] }, { "cell_type": "markdown", "id": "589d5075-8ffe-4fae-ac15-db8578229209", "metadata": {}, "source": [ "### Define System and User Prompts" ] }, { "cell_type": "markdown", "id": "e98d9d36-fcaf-44b8-abb8-7aaa5e12a4ee", "metadata": {}, "source": [ "Crafting clear and effective prompts is crucial for generating valid LLM responses. In our case, we've defined the exact JSON schema for our social determinants of health table we expect the LLM to output and are including it in the system prompt. Then in the user prompt, we will include the entire clinical note in the context window." ] }, { "cell_type": "code", "execution_count": 3, "id": "aa7ee55f-f7f4-4cad-a5ed-7af74c44cac3", "metadata": {}, "outputs": [], "source": [ "# Define system prompt (note: system prompt must contain \"JSON\" in it)\n", "system_prompt = '''\n", "You are a medical coding API specializing in social determinants of health that responds in JSON.\n", "Your job is to extract structured SDOH data from an unstructured clinical note and output the structured data in JSON.\n", "The JSON schema should include:\n", " {\n", " \"employment_status\": \"string (categorical: 'Unemployed', 'Part-time', 'Full-time', 'Retired')\",\n", " \"financial_stress\": \"boolean (TRUE if the patient mentions financial difficulties)\",\n", " \"housing_insecurity\": \"boolean (TRUE if the patient does not live in stable housing conditions)\",\n", " \"neighborhood_unsafety\": \"boolean (TRUE if the patient expresses concerns about safety)\",\n", " \"food_insecurity\": \"boolean (TRUE if the patient does not have reliable access to sufficient food)\",\n", " \"education_level\": \"string (categorical: 'None', 'High School', 'College', 'Graduate')\",\n", " \"transportation_inaccessibility\": \"boolean (TRUE if the patient does not have reliable transportation to healthcare appointments)\",\n", " \"social_isolation\": \"boolean (TRUE if the patient mentions feeling isolated or having a lack of social support)\",\n", " \"health_insurance_inadequacy\": (boolean: TRUE if the patient's health insurance is insufficient),\n", " \"skipped_care_due_to_cost\": \"boolean (TRUE if the patient mentions skipping medical tests or treatments due to cost)\",\n", " \"marital_status\": \"string (categorical: 'Single', 'Married', 'Divorced', 'Widowed')\",\n", " \"language_barrier\": \"boolean (TRUE if the patient has language barriers to healthcare access)\"\n", " }\n", "'''" ] }, { "cell_type": "code", "execution_count": 4, "id": "74d387ec-eced-4364-b285-f039e1898a70", "metadata": {}, "outputs": [], "source": [ "# Define user prompt template\n", "user_prompt_template = '''\n", "Use information from following clinical note to construct the proper JSON output:\n", "\n", "{clinical_note}\n", "'''" ] }, { "cell_type": "markdown", "id": "3c1f156f-2b89-4694-841a-fe9e535786ba", "metadata": {}, "source": [ "### Executing Chat Completions with JSON Mode " ] }, { "cell_type": "markdown", "id": "487b65b1-2df4-452d-a3eb-f4581c2cb1ac", "metadata": {}, "source": [ "Now that we have our notes and our prompts, let's try running a Groq chat completion with JSON mode enabled on the first clinical note to see if the speedy ```llama3-8b-8192``` model can correctly identify this patient's social determinants. Note that you will need a Groq API Key to proceed and can create an account [here](https://console.groq.com/) to generate one for free:" ] }, { "cell_type": "code", "execution_count": 5, "id": "32b4c870-9e85-495a-bc6d-63b2a09cdf5e", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "{\n", " \"employment_status\": \"Part-time\",\n", " \"financial_stress\": true,\n", " \"housing_insecurity\": true,\n", " \"neighborhood_unsafety\": true,\n", " \"food_insecurity\": true,\n", " \"education_level\": \"High School\",\n", " \"transportation_inaccessibility\": true,\n", " \"social_isolation\": true,\n", " \"health_insurance_inadequacy\": true,\n", " \"skipped_care_due_to_cost\": true,\n", " \"marital_status\": \"Divorced\",\n", " \"language_barrier\": false\n", "}\n" ] } ], "source": [ "# Establish client with GROQ_API_KEY environment variable\n", "client = Groq(api_key=os.getenv('GROQ_API_KEY'))\n", "model = \"llama3-8b-8192\"\n", "\n", "# Create chat completion object with JSON response format\n", "chat_completion = client.chat.completions.create(\n", " messages = [\n", " {\n", " \"role\": \"system\",\n", " \"content\": system_prompt\n", " },\n", " {\n", " \"role\": \"user\",\n", " \"content\": user_prompt_template.format(clinical_note=clinical_note),\n", " }\n", " ],\n", " model = model,\n", " response_format = {\"type\": \"json_object\"} # Add this response format to configure JSON mode\n", ")\n", "\n", "social_determinants_json_string = chat_completion.choices[0].message.content\n", "print(social_determinants_json_string)" ] }, { "cell_type": "markdown", "id": "5bac117b-2fc0-4853-8511-06ead4ea3731", "metadata": {}, "source": [ "Looks good! The patient does in fact work part time, is divorced and has expressed concerns pertaining their financial, housing and transportation situations, food insecurity, social isolation and healthcare costs. They do not have a language barrier.\n", "\n", "Now, let's wrap this in a function and apply it to the rest of our clinical notes:" ] }, { "cell_type": "code", "execution_count": 6, "id": "986d26a7-0908-42da-b13c-cf0af1880731", "metadata": {}, "outputs": [], "source": [ "def extract_sdoh_json(system_prompt,user_prompt,model):\n", " \n", " # Establish client with GROQ_API_KEY environment variable\n", " client = Groq(api_key=os.getenv('GROQ_API_KEY'))\n", " \n", " # Create chat completion object with JSON response format\n", " chat_completion = client.chat.completions.create(\n", " messages = [\n", " {\n", " \"role\": \"system\",\n", " \"content\": system_prompt\n", " },\n", " {\n", " \"role\": \"user\",\n", " \"content\": user_prompt_template.format(clinical_note=clinical_note),\n", " }\n", " ],\n", " model = model,\n", " response_format = {\"type\": \"json_object\"} # Add this response format to configure JSON mode\n", " )\n", " \n", " social_determinants_json_string = chat_completion.choices[0].message.content\n", "\n", " # Return json object of the chat output\n", " return json.loads(social_determinants_json_string)" ] }, { "cell_type": "code", "execution_count": 7, "id": "f202c6f1-8d63-4274-b4dc-09dbea2ac5e6", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
mrnemployment_statusfinancial_stresshousing_insecurityneighborhood_unsafetyfood_insecurityeducation_leveltransportation_inaccessibilitysocial_isolationhealth_insurance_inadequacyskipped_care_due_to_costmarital_statuslanguage_barrier
000456321Part-timeTrueTrueTrueTrueHigh SchoolTrueTrueTrueTrueDivorcedFalse
100567289Full-timeTrueFalseFalseFalseBachelor'sFalseTrueFalseFalseSingleFalse
200678934RetiredFalseFalseFalseFalseCollegeTrueTrueFalseFalseWidowedFalse
300785642Full-timeFalseFalseFalseFalseCollegeFalseFalseFalseFalseMarriedFalse
400893247UnemployedTrueTrueTrueTrueHigh SchoolTrueTrueTrueTrueSingleTrue
\n", "
" ], "text/plain": [ " mrn employment_status financial_stress housing_insecurity \\\n", "0 00456321 Part-time True True \n", "1 00567289 Full-time True False \n", "2 00678934 Retired False False \n", "3 00785642 Full-time False False \n", "4 00893247 Unemployed True True \n", "\n", " neighborhood_unsafety food_insecurity education_level \\\n", "0 True True High School \n", "1 False False Bachelor's \n", "2 False False College \n", "3 False False College \n", "4 True True High School \n", "\n", " transportation_inaccessibility social_isolation \\\n", "0 True True \n", "1 False True \n", "2 True True \n", "3 False False \n", "4 True True \n", "\n", " health_insurance_inadequacy skipped_care_due_to_cost marital_status \\\n", "0 True True Divorced \n", "1 False False Single \n", "2 False False Widowed \n", "3 False False Married \n", "4 True True Single \n", "\n", " language_barrier \n", "0 False \n", "1 False \n", "2 False \n", "3 False \n", "4 True " ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Total latency: 4s\n", "\n", "model = \"llama3-8b-8192\"\n", "\n", "patients_data = []\n", "# Loop through each patient clinical note, extract structured SDOH and compile a list of JSON objects\n", "for file_name in text_files:\n", " with open(os.path.join(folder_path, file_name), 'r') as file:\n", " clinical_note = file.read()\n", " user_prompt = user_prompt_template.format(clinical_note=clinical_note)\n", " social_determinants_json = extract_sdoh_json(system_prompt,user_prompt,model)\n", " social_determinants_json['mrn'] = file_name[:-4] # The name of the file is the patient's MRN \n", " patients_data.append(social_determinants_json)\n", "\n", "# Flatten the results into a dataframe\n", "flattened_data = []\n", "for patient in patients_data:\n", " flattened_data.append({'mrn': patient['mrn'],\n", " 'employment_status': patient['employment_status'],\n", " 'financial_stress': patient['financial_stress'],\n", " 'housing_insecurity': patient['housing_insecurity'],\n", " 'neighborhood_unsafety': patient['neighborhood_unsafety'],\n", " 'food_insecurity': patient['food_insecurity'],\n", " 'education_level': patient['education_level'],\n", " 'transportation_inaccessibility': patient['transportation_inaccessibility'],\n", " 'social_isolation': patient['social_isolation'],\n", " 'health_insurance_inadequacy': patient['health_insurance_inadequacy'],\n", " 'skipped_care_due_to_cost': patient['skipped_care_due_to_cost'],\n", " 'marital_status': patient['marital_status'],\n", " 'language_barrier': patient['language_barrier']})\n", "\n", "\n", "sdoh_df = pd.DataFrame(flattened_data)\n", "\n", "sdoh_df" ] }, { "cell_type": "markdown", "id": "8394734d-7ff5-41ad-a008-a4b6b33dd1a9", "metadata": {}, "source": [ "Nice! In just 4 seconds we've parsed through five clinical notes, extracted discrete features and structured them into a neat table. That low latency is important for scaling up, and is why Groq's best-in-class speed makes it an ideal provider for this type of task - in a healthcare network with many providers, it would allow us to process clinical notes for 900 patients in an hour." ] }, { "cell_type": "markdown", "id": "8ed70669-9461-4c79-876e-51484d5ee9dd", "metadata": {}, "source": [ "### Analyzing Structured Data" ] }, { "cell_type": "markdown", "id": "98b23483-82d7-4cb5-ad2a-762d11103be6", "metadata": {}, "source": [ "Now that our Social Determinants of Health are stored in a neat, structured format, we can analyze them much easier than when they're trapped in unstructured clinical notes. Here is a bar plot showing the percent of the patient population impacted by each social determinant - but with more data we could do far more advanced analyses such as showing which determinants are most correlated with each other, or which ones are most predictive of various chronic conditions or negative health outcomes:" ] }, { "cell_type": "code", "execution_count": 8, "id": "13815785-c003-4046-8d30-1266ad387aef", "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# Limit dataframe to boolean fields\n", "df = sdoh_df[['financial_stress','housing_insecurity','neighborhood_unsafety','food_insecurity','transportation_inaccessibility','social_isolation','health_insurance_inadequacy','skipped_care_due_to_cost','language_barrier']]\n", "\n", "# Calculate the percentage of 'True' values for each boolean field\n", "percentages = df.mean() * 100 # df.mean() computes the mean for each column, 'True' is treated as 1, 'False' as 0\n", "\n", "# Plotting\n", "plt.figure(figsize=(10, 6))\n", "percentages.plot(kind='bar')\n", "plt.title('Percentage of Patients with Social Determinants')\n", "plt.ylabel('% of Patient Population')\n", "plt.xlabel('Social Determinant')\n", "plt.xticks(rotation=45)\n", "plt.grid(axis='y')\n", "\n", "# Display the plot\n", "plt.show()\n" ] }, { "cell_type": "markdown", "id": "c2ab9531-f8e6-4fe0-a8f8-962446eeaf08", "metadata": {}, "source": [ "### Loading to a Database" ] }, { "cell_type": "markdown", "id": "d8bf4c71-f864-4707-9788-0b0917ba2049", "metadata": {}, "source": [ "Finally, we will use [SQLAlchemy](https://pypi.org/project/SQLAlchemy/) to load the results to our database - in this case, a BigQuery dataset called ```clinical```. In a real production environment, we could use a tool like [Airflow](https://airflow.apache.org/) to orchestrate the scheduling of this script and process any new notes from recent appointments. " ] }, { "cell_type": "code", "execution_count": 9, "id": "6846b38e-873e-451d-8762-24b5d88d3596", "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 6034.97it/s]\n" ] } ], "source": [ "# Append results to a pre-existing BigQuery table\n", "client = bigquery.Client()\n", "sdoh_df.to_gbq('clinical.social_determinants',client.project,credentials=client._credentials,if_exists='append')" ] }, { "cell_type": "markdown", "id": "b0b1f47f-e0bf-4459-b2a8-76b5b213dc19", "metadata": {}, "source": [ "### Conclusion" ] }, { "cell_type": "markdown", "id": "7419eade-2599-46de-9a7c-4e1ba1a01e02", "metadata": {}, "source": [ "In this notebook, we've used Llama3 with Groq API's JSON mode to extract social determinants of health from and structure them in a relational table, then loaded the results into BigQuery where they can be combined and analyzed with the rest of our patient data. With our social determinants of health now structured in our clinical data warehouse, our analytics team can use it in countless ways by delivering much-needed SDOH insights and enhancing risk models. This allows the clinical practice to not just identify high-risk patients in their population, but to implement more targeted interventions by better understanding their barriers to care.\n", "\n", "More broadly, we've shown how to use Groq to build an LLM-infused data pipeline, one that transforms unstructured text data into structured, relational data that can reside in a warehouse. And with Groq's low latency, the ability to process more files per minute makes for a more efficient pipeline." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.3" } }, "nbformat": 4, "nbformat_minor": 5 }