Sanyam Bhutani 2a32851dfd update ReadMe 11 månader sedan
..
Experiments 2a32851dfd update ReadMe 11 månader sedan
resources 5ce0b098ce final_runs 11 månader sedan
README.md 2a32851dfd update ReadMe 11 månader sedan
Step-1 PDF-Pre-Processing-Logic.ipynb e84dc568db Polish out notebooks and worflow 11 månader sedan
Step-2-Transcript-Writer.ipynb ca0221f279 Semi-Final-runs 11 månader sedan
Step-3-Re-Writer.ipynb 5ce0b098ce final_runs 11 månader sedan
Step-4-TTS-Workflow.ipynb 5ce0b098ce final_runs 11 månader sedan

README.md

NotebookLlama: An Open Source version of NotebookLM

Author: Sanyam Bhutani

This is a guided series of tutorials/notebooks that can be taken as a reference or course to build a PDF to Podcast workflow.

It assumes zero knowledge of LLMs, prompting and audio models, everything is covered in their respective notebooks.

Outline:

Requirements: GPU server or an API provider for using 70B, 8B and 1B Llama models.

Note: For our GPU Poor friends, you can also use the 8B and lower models for the entire pipeline. There is no strong recommendation. The pipeline below is what worked best on first few tests. You should try and see what works best for you!

Here is the current outline:

  • Step 1: Pre-process PDF: Use Llama-3.2-1B to pre-process and save a PDF
  • Step 2: Transcript Writer: Use Llama-3.1-70B model to write a podcast transcript from the text
  • Step 3: Dramatic Re-Writer: Use Llama-3.1-8B model to make the transcript more dramatic
  • Step 4: Text-To-Speech Workflow: Use parler-tts/parler-tts-mini-v1 and bark/suno to generate a conversational podcast

Steps to running the notebook:

So right now there is one issue: Parler needs transformers 4.43.3 or earlier and to generate you need latest, so I am just switching on fly

TODO-MORE

Next-Improvements/Further ideas:

  • Speech Model experimentation: The TTS model is the limitation of how natural this will sound. This probably be improved with a better pipeline
  • LLM vs LLM Debate: Another approach of writing the podcast would be having two agents debate the topic of interest and write the podcast outline. Right now we use a single LLM (70B) to write the podcast outline
  • Testing 405B for writing the transcripts
  • Better prompting
  • Support for ingesting a website, audio file, YouTube links and more. We welcome community PRs!

Scratch-pad/Running Notes:

Actually this IS THE MOST CONSISTENT PROMPT: Small:

description = """
Laura's voice is expressive and dramatic in delivery, speaking at a fast pace with a very close recording that almost has no background noise.
"""

Large:

description = """
Alisa's voice is consistent, quite expressive and dramatic in delivery, with a very close recording that almost has no background noise.
"""

Small:

description = """
Jenna's voice is consistent, quite expressive and dramatic in delivery, with a very close recording that almost has no background noise.
"""

Bark is cool but just v6 works great, I tried v9 but its quite robotic and that is sad.

So Parler is next-its quite cool for prompting

xTTS-v2 by coquai is cool, however-need to check the license-I think an example is allowed

Torotoise is blocking because it needs HF version that doesnt work with llama-3.2 models so I will probably need to make a seperate env-need to eval if its worth it

Side note: The TTS library is a really cool effort!

Bark-Tests: Best results for speaker/v6 are at speech_output = model.generate(**inputs, temperature = 0.9, semantic_temperature = 0.8) Audio(speech_output[0].cpu().numpy(), rate=sampling_rate)

Tested sound effects:

  • Laugh is probably most effective
  • Sigh is hit or miss
  • Gasps doesn't work
  • A singly hypen is effective
  • Captilisation makes it louder

Ignore/Delete this in final stages, right now this is a "vibe-check" for TTS model(s):

S

Vibe check:

Higher Barrier to testing (In other words-I was too lazy to test):

Try later:

Resources for further learning: