Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Hi there! "Author" here - glad to see this picking up. This was a fascinating project to work on and I learned a ton in the process. As it's often the case, I would do a lot of things differently if I were starting from scratch today.


How did you manage to get us to read this in their respective voices?


Ah! Welcome to my head.


This is incredible! I've sent it to several people already. Is there any chance you could provide more details as to the tech stack / training / technical setup?


Thanks for asking, I think I'll do a write up later this week. Let me know if you have any specific question.


That'd be great.


how did you not go insane working on this?!


One of the guiding lights at the very beginning of this project was the question: "who would I never get tired of listening to?"


John Malkovich?


amazing work!

Curious how you cloned the voices - tortoise? I've previously tried Herzog, but couldn't quite train the German accent...


I haven't tried Tortoise, thanks for pointing me to it. The voices were cloned by fine tuning a VITS model with coqui.ai. I used about two hours of speech for each speaker. With more time and resources, I'm certain it's possible to make those voices considerably better.


Can I get an invite link?


No need to be invited. Between their GitHub[1] page and the documentation[2], you'll find everything you need to get started.

[1] https://github.com/coqui-ai/TTS [2] https://tts.readthedocs.io/en/latest/


How long did you train the models for each speaker, and what hardware were you using?


It was fine tuning, so the process was a lot faster than I originally anticipated. I'd say it was between 36 and 72 hours for each voice. I have been working on a gradient notebook provided by Paperspace, which guaranteed me A6000 instances (48GB GPU RAM) at a reasonable flat rate. I discovered them after being repeatedly frustrated by the random allocation of GPUs on colabs pro+ plan.


How much input audio would you need to produce audiobook quality? Hint Hint...



I don’t know if this is useful, but Herzog has a distinctly Bavarian accent. And of course has spent most of his adult life far from there, so it’s not quite Bavarian either.

Training a Herzogbot on recordings/transcriptions of, say, Kinski would be a waste of time accent-wise.


Good stuff. I have a question. How did you align the transcribed interview with the audio?


I use Aeneas[1], a set of tools to do force alignment. I found it in equal measures an amazing and a hard to navigate resource. Took me a while to set up and configure everything to the point that it was usable. But when it works, it works well.

[1] https://github.com/readbeyond/aeneas


How much manual pruning/selection from the generated text do you have to do?


At this point, zero. The framework I built automatically rejects certain patterns that are not conducive to an interesting conversation. The only thing I still do manually, and I think I will automate, is to decide when to stop a generated segment.


Impressive!


oh and zizek needs more sniffles lol


The main issue is that there was no sniffling symbol in the transcript. And the generated text wouldn't contain it either, because (thankfully) they are pruned out of written interviews that I used to train the model.


Thanks for the explanation. I had some assumptions but wasn’t totally sure how this was trained.

How would you make it sniffle in a natural way, too? It’s not a usual speech mannerism, and the way he does it is distinct. I wouldn’t know how to efficiently represent it with text. Maybe it’s easier than I’m imagining.


The TTS model is trained on two things: speech samples and their transcript. If you add enough sniffle-symbols every time a sniffle appears in the speech, I am confident the model would pick up on that. And then you would be able to replicate a sniffle in the generation part. The more time-consuming bit would be to add in the training data for the language model those sniffle-symbols, so that they would be organically added in the text in the text-generation phase.

But seriously, it's not worth it. I think he's a brilliant man with an idiosyncratic speech, let's leave it to that.


I agree, I personally don't hear his sniffles when I'm listening to him intently. It's irrelevant. I was mostly curious if and how, generally speaking, a model could be trained to sniffle. Now that you describe it though it seems fairly clear, so thanks!


in all seriousness how do you not hear them?


Just stop the audio output every 5 seconds and include a sniffle sounds, at least that's what it sounds like in real life haha


I assumed it would difficult to include, it was just something I noticed about him




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: