|
před 11 měsíci | |
---|---|---|
.. | ||
Experiments | před 11 měsíci | |
resources | před 11 měsíci | |
README.md | před 11 měsíci | |
Step-1 PDF-Pre-Processing-Logic.ipynb | před 11 měsíci | |
Step-2-Transcript-Writer.ipynb | před 11 měsíci | |
Step-3-Re-Writer.ipynb | před 11 měsíci | |
Step-4-TTS-Workflow.ipynb | před 11 měsíci |
Author: Sanyam Bhutani
This is a guided series of tutorials/notebooks that can be taken as a reference or course to build a PDF to Podcast workflow.
It assumes zero knowledge of LLMs, prompting and audio models, everything is covered in their respective notebooks.
Requirements: GPU server or an API provider for using 70B, 8B and 1B Llama models.
Note: For our GPU Poor friends, you can also use the 8B and lower models for the entire pipeline. There is no strong recommendation. The pipeline below is what worked best on first few tests. You should try and see what works best for you!
Here is the current outline:
Llama-3.2-1B
to pre-process and save a PDFLlama-3.1-70B
model to write a podcast transcript from the textLlama-3.1-8B
model to make the transcript more dramaticparler-tts/parler-tts-mini-v1
and bark/suno
to generate a conversational podcastSo right now there is one issue: Parler needs transformers 4.43.3 or earlier and to generate you need latest, so I am just switching on fly
TODO-MORE
Actually this IS THE MOST CONSISTENT PROMPT: Small:
description = """
Laura's voice is expressive and dramatic in delivery, speaking at a fast pace with a very close recording that almost has no background noise.
"""
Large:
description = """
Alisa's voice is consistent, quite expressive and dramatic in delivery, with a very close recording that almost has no background noise.
"""
Small:
description = """
Jenna's voice is consistent, quite expressive and dramatic in delivery, with a very close recording that almost has no background noise.
"""
Bark is cool but just v6 works great, I tried v9 but its quite robotic and that is sad.
So Parler is next-its quite cool for prompting
xTTS-v2 by coquai is cool, however-need to check the license-I think an example is allowed
Torotoise is blocking because it needs HF version that doesnt work with llama-3.2 models so I will probably need to make a seperate env-need to eval if its worth it
Side note: The TTS library is a really cool effort!
Bark-Tests: Best results for speaker/v6 are at speech_output = model.generate(**inputs, temperature = 0.9, semantic_temperature = 0.8)
Audio(speech_output[0].cpu().numpy(), rate=sampling_rate)
Tested sound effects:
Ignore/Delete this in final stages, right now this is a "vibe-check" for TTS model(s):
S
Vibe check:
Higher Barrier to testing (In other words-I was too lazy to test):
Try later: