Hi there! "Author" here - glad to see this picking up. This was a fascinating pr...

BitwiseFool · on Nov 2, 2022

How did you manage to get us to read this in their respective voices?

jamez · on Nov 2, 2022

Ah! Welcome to my head.

YossarianFrPrez · on Nov 2, 2022

This is incredible! I've sent it to several people already. Is there any chance you could provide more details as to the tech stack / training / technical setup?

jamez · on Nov 2, 2022

Thanks for asking, I think I'll do a write up later this week. Let me know if you have any specific question.

dsoprea · on Nov 6, 2022

That'd be great.

nortonham · on Nov 2, 2022

how did you not go insane working on this?!

jamez · on Nov 2, 2022

One of the guiding lights at the very beginning of this project was the question: "who would I never get tired of listening to?"

sleight42 · on Nov 3, 2022

John Malkovich?

fab1an · on Nov 2, 2022

amazing work!

Curious how you cloned the voices - tortoise? I've previously tried Herzog, but couldn't quite train the German accent...

jamez · on Nov 2, 2022

I haven't tried Tortoise, thanks for pointing me to it. The voices were cloned by fine tuning a VITS model with coqui.ai. I used about two hours of speech for each speaker. With more time and resources, I'm certain it's possible to make those voices considerably better.

OgAstorga · on Nov 2, 2022

Can I get an invite link?

jamez · on Nov 2, 2022

No need to be invited. Between their GitHub[1] page and the documentation[2], you'll find everything you need to get started.

[1] https://github.com/coqui-ai/TTS [2] https://tts.readthedocs.io/en/latest/

forgingahead · on Nov 3, 2022

How long did you train the models for each speaker, and what hardware were you using?

jamez · on Nov 4, 2022

It was fine tuning, so the process was a lot faster than I originally anticipated. I'd say it was between 36 and 72 hours for each voice. I have been working on a gradient notebook provided by Paperspace, which guaranteed me A6000 instances (48GB GPU RAM) at a reasonable flat rate. I discovered them after being repeatedly frustrated by the random allocation of GPUs on colabs pro+ plan.

hanselot · on Nov 3, 2022

How much input audio would you need to produce audiobook quality? Hint Hint...

blueberrychpstx · on Nov 2, 2022

https://coqui.ai?referralCode=q8jfhfs&refSource=copy help us move up the list!

biztos · on Nov 2, 2022

I don’t know if this is useful, but Herzog has a distinctly Bavarian accent. And of course has spent most of his adult life far from there, so it’s not quite Bavarian either.

Training a Herzogbot on recordings/transcriptions of, say, Kinski would be a waste of time accent-wise.

muted_pigment · on Nov 2, 2022

Good stuff. I have a question. How did you align the transcribed interview with the audio?

jamez · on Nov 2, 2022

I use Aeneas[1], a set of tools to do force alignment. I found it in equal measures an amazing and a hard to navigate resource. Took me a while to set up and configure everything to the point that it was usable. But when it works, it works well.

[1] https://github.com/readbeyond/aeneas

eggoa · on Nov 3, 2022

How much manual pruning/selection from the generated text do you have to do?

jamez · on Nov 3, 2022

At this point, zero. The framework I built automatically rejects certain patterns that are not conducive to an interesting conversation. The only thing I still do manually, and I think I will automate, is to decide when to stop a generated segment.

eggoa · on Nov 3, 2022

Impressive!

nortonham · on Nov 2, 2022

oh and zizek needs more sniffles lol

jamez · on Nov 2, 2022

The main issue is that there was no sniffling symbol in the transcript. And the generated text wouldn't contain it either, because (thankfully) they are pruned out of written interviews that I used to train the model.

steve_adams_86 · on Nov 2, 2022

Thanks for the explanation. I had some assumptions but wasn’t totally sure how this was trained.

How would you make it sniffle in a natural way, too? It’s not a usual speech mannerism, and the way he does it is distinct. I wouldn’t know how to efficiently represent it with text. Maybe it’s easier than I’m imagining.

jamez · on Nov 2, 2022

The TTS model is trained on two things: speech samples and their transcript. If you add enough sniffle-symbols every time a sniffle appears in the speech, I am confident the model would pick up on that. And then you would be able to replicate a sniffle in the generation part. The more time-consuming bit would be to add in the training data for the language model those sniffle-symbols, so that they would be organically added in the text in the text-generation phase.

But seriously, it's not worth it. I think he's a brilliant man with an idiosyncratic speech, let's leave it to that.

steve_adams_86 · on Nov 3, 2022

I agree, I personally don't hear his sniffles when I'm listening to him intently. It's irrelevant. I was mostly curious if and how, generally speaking, a model could be trained to sniffle. Now that you describe it though it seems fairly clear, so thanks!

nortonham · on Nov 3, 2022

in all seriousness how do you not hear them?

yucatansunshine · on Nov 2, 2022

Just stop the audio output every 5 seconds and include a sniffle sounds, at least that's what it sounds like in real life haha

nortonham · on Nov 2, 2022

I assumed it would difficult to include, it was just something I noticed about him