{ "cells": [ { "cell_type": "markdown", "id": "072150ea-1f44-4428-94ae-695ba94b2f7d", "metadata": {}, "source": [ "# Retrieval-Augmented Generation for Presidential Speeches using Groq API and Langchain" ] }, { "cell_type": "markdown", "id": "d7a4fc92-eb9a-4273-8ff6-0fc5b96236d7", "metadata": {}, "source": [ "Retrieval-Augmented Generation (RAG) is a widely-used technique that enables us to gather pertinent information from an external data source and provide it to our Large Language Model (LLM). It helps solve two of the biggest limitations of LLMs: knowledge cutoffs, in which information after a certain date or for a specific source is not available to the LLM, and hallucination, in which the LLM makes up an answer to a question it doesn't have the information for. With RAG, we can ensure that the LLM has relevant information to answer the question at hand." ] }, { "cell_type": "markdown", "id": "ea1ae66c-a322-467d-b789-f7ce5a636ad7", "metadata": {}, "source": [ "In this notebook we will be using [Groq API](https://console.groq.com), [LangChain](https://www.langchain.com/) and [Pinecone](https://www.pinecone.io/) to perform RAG on [presidential speech transcripts](https://millercenter.org/the-presidency/presidential-speeches) from the Miller Center at the University of Virginia. In doing so, we will create vector embeddings for each speech, store them in a vector database, retrieve the most relevent speech excerpts pertaining to the user prompt and include them in context for the LLM." ] }, { "cell_type": "markdown", "id": "d7784880-495e-4d7c-a045-d12b7f57b65d", "metadata": {}, "source": [ "### Setup" ] }, { "cell_type": "code", "execution_count": 1, "id": "b4679c23-7035-4276-b3d6-95cd89916477", "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "import numpy as np\n", "from groq import Groq\n", "import os\n", "import pinecone\n", "\n", "from langchain_community.vectorstores import Chroma\n", "from langchain.text_splitter import TokenTextSplitter\n", "from langchain.docstore.document import Document\n", "from langchain_community.embeddings.sentence_transformer import SentenceTransformerEmbeddings\n", "from langchain_pinecone import PineconeVectorStore\n", "from transformers import AutoModelForCausalLM, AutoTokenizer\n", "from sklearn.metrics.pairwise import cosine_similarity\n", "\n", "from IPython.display import display, HTML" ] }, { "attachments": {}, "cell_type": "markdown", "id": "4c18688b-178f-439d-90a4-590f99ade11f", "metadata": {}, "source": [ "A Groq API Key is required for this demo - you can generate one for free [here](https://console.groq.com/). We will be using Pinecone as our vector database, which also requires an API key (you can create one index for a small project there for free on their Starter plan), but will also show how it works with [Chroma DB](https://www.trychroma.com/), a free open source alternative that stores vector embeddings in memory. We will also use the Llama3 8b model for this demo." ] }, { "cell_type": "code", "execution_count": 2, "id": "14fd5b33-360e-4fbe-ad29-11d5f759b0d3", "metadata": {}, "outputs": [], "source": [ "groq_api_key = os.getenv('GROQ_API_KEY')\n", "pinecone_api_key = os.getenv('PINECONE_API_KEY')\n", "\n", "client = Groq(api_key = groq_api_key)\n", "model = \"llama3-8b-8192\"" ] }, { "cell_type": "markdown", "id": "469e5b3a-6c5d-49cd-a547-222d45d7a996", "metadata": {}, "source": [ "### RAG Basics with One Document" ] }, { "cell_type": "markdown", "id": "283183cd-ba64-4e98-a0d9-a6165e88494e", "metadata": {}, "source": [ "The presidential speeches we'll be using are stored in this [.csv file](https://github.com/groq/groq-api-cookbook/blob/main/presidential-speeches-rag/presidential_speeches.csv). Each row of the .csv contains fields for the date, president, party, speech title, speech summary and speech transcript, and includes every recorded presidential speech through the Trump presidency:" ] }, { "cell_type": "code", "execution_count": 2, "id": "d1017409-cb0e-402b-9c53-c61729296bd2", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " | Date | \n", "President | \n", "Party | \n", "Speech Title | \n", "Summary | \n", "Transcript | \n", "URL | \n", "
---|---|---|---|---|---|---|---|
0 | \n", "1789-04-30 | \n", "George Washington | \n", "Unaffiliated | \n", "First Inaugural Address | \n", "Washington calls on Congress to avoid local an... | \n", "Fellow Citizens of the Senate and the House of... | \n", "https://millercenter.org/the-presidency/presid... | \n", "
1 | \n", "1789-10-03 | \n", "George Washington | \n", "Unaffiliated | \n", "Thanksgiving Proclamation | \n", "At the request of Congress, Washington establi... | \n", "Whereas it is the duty of all Nations to ackno... | \n", "https://millercenter.org/the-presidency/presid... | \n", "
2 | \n", "1790-01-08 | \n", "George Washington | \n", "Unaffiliated | \n", "First Annual Message to Congress | \n", "In a wide ranging speech, President Washington... | \n", "Fellow Citizens of the Senate and House of Rep... | \n", "https://millercenter.org/the-presidency/presid... | \n", "
3 | \n", "1790-12-08 | \n", "George Washington | \n", "Unaffiliated | \n", "Second Annual Message to Congress | \n", "Washington focuses on commerce in his second a... | \n", "Fellow citizens of the Senate and House of Rep... | \n", "https://millercenter.org/the-presidency/presid... | \n", "
4 | \n", "1790-12-29 | \n", "George Washington | \n", "Unaffiliated | \n", "Talk to the Chiefs and Counselors of the Senec... | \n", "The President reassures the Seneca Nation that... | \n", "I the President of the United States, by my ow... | \n", "https://millercenter.org/the-presidency/presid... | \n", "