Hi there! "Author" here - glad to see this picking up. This was a fascinating project to work on and I learned a ton in the process. As it's often the case, I would do a lot of things differently if I were starting from scratch today.
This is incredible! I've sent it to several people already. Is there any chance you could provide more details as to the tech stack / training / technical setup?
I haven't tried Tortoise, thanks for pointing me to it.
The voices were cloned by fine tuning a VITS model with coqui.ai. I used about two hours of speech for each speaker. With more time and resources, I'm certain it's possible to make those voices considerably better.
It was fine tuning, so the process was a lot faster than I originally anticipated. I'd say it was between 36 and 72 hours for each voice. I have been working on a gradient notebook provided by Paperspace, which guaranteed me A6000 instances (48GB GPU RAM) at a reasonable flat rate. I discovered them after being repeatedly frustrated by the random allocation of GPUs on colabs pro+ plan.
I don’t know if this is useful, but Herzog has a distinctly Bavarian accent. And of course has spent most of his adult life far from there, so it’s not quite Bavarian either.
Training a Herzogbot on recordings/transcriptions of, say, Kinski would be a waste of time accent-wise.
I use Aeneas[1], a set of tools to do force alignment. I found it in equal measures an amazing and a hard to navigate resource. Took me a while to set up and configure everything to the point that it was usable. But when it works, it works well.
At this point, zero. The framework I built automatically rejects certain patterns that are not conducive to an interesting conversation. The only thing I still do manually, and I think I will automate, is to decide when to stop a generated segment.
The main issue is that there was no sniffling symbol in the transcript. And the generated text wouldn't contain it either, because (thankfully) they are pruned out of written interviews that I used to train the model.
Thanks for the explanation. I had some assumptions but wasn’t totally sure how this was trained.
How would you make it sniffle in a natural way, too? It’s not a usual speech mannerism, and the way he does it is distinct. I wouldn’t know how to efficiently represent it with text. Maybe it’s easier than I’m imagining.
The TTS model is trained on two things: speech samples and their transcript. If you add enough sniffle-symbols every time a sniffle appears in the speech, I am confident the model would pick up on that. And then you would be able to replicate a sniffle in the generation part. The more time-consuming bit would be to add in the training data for the language model those sniffle-symbols, so that they would be organically added in the text in the text-generation phase.
But seriously, it's not worth it. I think he's a brilliant man with an idiosyncratic speech, let's leave it to that.
I agree, I personally don't hear his sniffles when I'm listening to him intently. It's irrelevant. I was mostly curious if and how, generally speaking, a model could be trained to sniffle. Now that you describe it though it seems fairly clear, so thanks!