Hacker Newsnew | past | comments | ask | show | jobs | submit | vvolhejn's commentslogin

There's a great blog post from Sander Dieleman about exactly this - why do we need a two step pipeline, in particular for images and audio? https://sander.ai/2025/04/15/latents.html

For text, there are a few papers that train the tokenization and language model end-to-end, see: https://arxiv.org/abs/2305.07185


I train for 1M steps (batch size 64, block size 2048), which is enough for the model to more-or-less converge.

It's also a tiny model for LLM standards, with 150M parameters. The goal wasn't really to reach state of the art but to show how the performance of a single language model architecture can be vastly different when you just change the tokenizer.


To get around state of the art, how many parameters would be needed with your approach?


Author here, thanks for the kind words! I think such a physics-based codec is unlikely to happen: in general, machine learning is always moving from handcrafted domain-specific assumptions to leaving as much as possible to the model. The more assumptions you bake in, the smaller the space of sounds you can model, so the quality is capped. Basically, modern ML is just about putting the right data into transformers.

That being said, having a more constrained model can also lead to some really cool stuff. The DDSP paper learns how to control a synthesizer to mimic instruments: https://arxiv.org/abs/2001.04643

You could probably do something similar for a speech model. The result would not sound as good but you could get away with much fewer parameters, because much of the modelling work is done by the assumptions you put in.

Compare also KokoroTTS, a tiny TTS that's so tiny because it uses a handcrafted system to turn text into phonemes, and then just synthesizes from those phonemes: https://huggingface.co/spaces/hexgrad/Kokoro-TTS


merci, will fix tomorrow


I use Descript to edit videos/podcasts and it works great for this kind of thing! It transcribes your audio and then you can edit it as if you were editing text.


Yeah, that stuff is just freaking amazing. I don't know what the transcription quality is like, but if I was doing this as a job, and it was good at transcription, I'd definitely be using that all the time.


I don't know about linear models, but this kind of hierarchical modelling is quite a common idea in speech research. For example, OpenAI's Jukebox (2020) [1], which uses a proto-neural audio codec, has three levels of encoding that get coarser and coarser. They use a language model to predict continuations in the coarsest level and then have models to upscale to the finer levels and finally back to audio.

The recent MiMo-audio bunches tokens into "patches" of four timesteps and has the model predict those. [2]

[1] https://arxiv.org/abs/2005.00341

[2] https://github.com/XiaomiMiMo/MiMo-Audio/blob/main/MiMo-Audi...


Author here. Speech-to-text is more or less solved, it's easy to automatically get captions including precise timestamps. For training Moshi, Kyutai's audio LLM, my colleagues used whisper-timestamped to transcribe 7 million hours of audio.

See Section 4.2 in the Moshi paper: https://arxiv.org/pdf/2410.00037


Sweet!


Author here. There are a few reasons, but the biggest one is simply the compression ratio.

The OG neural audio codec SoundStream (whose first author is Neil, now at Kyutai) can sound decent at 3kbps, whereas MP3 typically has around 128kbps, as you say. Interestingly, it was originally developed for audio compression for Google Meet, not for LLMs. Today's neural codecs have even better compression.

The more modern MP3 alternative is Opus, which can work ok at 12kbps, but it's still less efficient than neural audio codecs. However, these traditional codecs are a lot less CPU-hungry, so they have that going for them.


That makes sense.

Why RVQ though, rather than using the raw VAE embedding?

If I compare rvq-without-quantization-v4.png with rvq-2-level-v4.png, the quality seems oddly similar, but the former takes a 32-sized vector, while the latter takes two 32-sized (one-hot) vectors, (2 = number of levels, 32 = number of quantization cluster centers). Isn't that more?


I had a part about this but I took it out: for compression, you could keep the embeddings unquantized and it would still compress quite well, depending on the embedding dimension and the number of quantization levels.

But categorical distributions are better for modelling. It's a little difficult to explain here without using diagrams. The intuition is that if you try to have a model predict the next embedding and not the next token, you can't model multimodal distributions - you'll end up predicting the mean of the possible continuations and not the mode, which is not what you want.

Check out Section 5.3 and Figure 6 from PixelRNN, where they discuss this phenomenon: https://arxiv.org/pdf/1601.06759

At the bottom of the blog, I link two articles that do make continuous embeddings work. One of them is the Kyutai paper Continuous Audio Language Models: https://arxiv.org/abs/2509.06926


Hmm, I think a mixture of beta distributions could work just as well as cateogrical here. I'm going to train it for PixelRNN, but it's going to take hours or days to train (it's a very inefficient and unparallelizable architecture). I'll report back tomorrow.


Update 2:

After another 24 hours of training and around 100 epochs, we get down to 4.4 bits/dim and colors are starting to emerge[1]. However, an issue a friend brought up is that log-likelihood + beta distribution weights values near 0 and 1 much higher:

     log(Beta likelihood) = alpha * log(x) + beta * log(1-x)
                                      ^
                               log(0) --> oo
This means we should see most outputs be pure colors: black, white, red, blue, green, cyan, magenta, or yellow. 3.6% of the channels are 0 or 255, up from 1.4% after 50 epochs[2]. Apparently, an earth-mover loss might be better:

    E_{x ~ output distribution}[|correct - x|]
I could retrain this for another day or two, but PixelRNN is really slow, and I want to use my GPU for other things. Instead, I trained a 50x faster PixelCNN for 50 epochs with this new loss and... it just went to the average pixel value (0.5). There's probably a way to train a mixture of betas, but I haven't figured it out yet.

[1]: https://imgur.com/kGbERDg [2]: https://imgur.com/iJYwHr0


Update 1: After ~12 hours of training and 45 epochs on CIFAR, I'm starting to see textures.

https://imgur.com/MzKUKhH


Update 3:

Okay, so my PixelCNN masking was wrong... which is why it went to the mean. The earth-mover did get better results than negative log-likelihood, but I found a better solution!

The issue with negative log-likelihood was the neural network could optimize solely around zero and one because there are poles there. The key insight is that the color value in the image is not zero or one. If we are given #00, all we really know is the image from the real world had a brightness between #00 and #01, so we should be integrating the probability density function from 0 to 1/256 to get the likelihood.

It turns out PyTorch does not have a good implementation of Beta.cdf(), so I had to roll my own. Realistically, I just asked the chatbots to tell me what good algorithms there were and to write me code. I ended up with two:

(1) There's a known continued fraction form for the CDF, so combined with Lentz' algorithm it can be computed.

(2) Apparently there's a pretty good closed-form approximation as well (Temme [1]).

The first one was a little unstable in training, but worked well enough (output: [2], color hist: [3]). The second was a little more stable in training, but had issues with nan's near zero and one, so I had to clamp things there which makes it a little less accurate (output: [4], color hist: [5]).

The bits/dim gets down to ~3.5 for both of these, which isn't terrible, but there's probably something that can be done better to get it below 3.0. I don't have any clean code to upload, but I'll probably do that tomrrow and edit (or reply to) this comment. But, that's it for the experiments!

Anyway, the point of this experiment was because this sentence was really bothering me:

> But categorical distributions are better for modelling.

And when I investigated why you said that, it turns out the PixelRNN authors used a mixture of Gaussians, and even said they're probably losing some bits because Gaussians go out of bounds and need to be clipped! So, I really wanted to say, "seems like a skill issue, just use Beta distributions," but then I had to go check if that really did work. My hypothesis was Betas should work even better than a categorical distribution because the categorical model would have to learn nearby outputs are indeed nearby while this is baked into the Beta model. We see the issue show up in the PixelRNN paper, where their outputs are very noisy compared to mine (histogram for a random pixel: [6]).

[1]: https://ir.cwi.nl/pub/2294/2294D.pdf [2]: https://imgur.com/e8xbcfu [3]: https://imgur.com/z0wnqu3 [4]: https://imgur.com/Z2Tcoue [5]: https://imgur.com/p7sW4r9 [6]: https://imgur.com/P4ZV9n4


Author here. I think it's more of a capability issue than a safety issue. Since learning audio is still harder than learning text, audio models don't generalize as well. To fix that, audio models rely on combining information from text and audio (having a single model that consumes/produces both text and audio tokens) and the audio tokens basically end up being an integrated speech-to-text/text-to-speech. This reflects my colleagues' experience working on Moshi, and it seems to be the case for other models too, see the Conclusion section.

Part of the reason can also be synthetic data: if you fine-tune on data generated from text via a text-to-speech, the tone of the voice doesn't have any information, so the model learns to ignore it.


Audio models for speech not understanding pitch, seems similar to how text LLMs often don't understand spelling: it's not what they were trying to recognize.


There was an example, of ChatGPT copying and responding in the speakers voice mid conversation, on OpenAI blog. This was presented an example on non-alignment.


> generated from text via a text-to-speech

Yes, frustratingly we don't have good speech-to-text (STT/ASR) to transcribe such differences.

I recently finetuned a TTS* to be able to emit laughter and hunting for transcriptions which include non-verbal sounds was the hardest part of it. Whisper and other popular transcription systems will ignore sigh, sniff, laugh, etc and can't detect mispronounciations etc.

* = https://github.com/coezbek/PlayDiffusion


IIRC -- the 15.ai dev was training on fan-made "My Little Pony" transcriptions, specificaly because they included more emotive clues in the transcription, and supported a syntax to control the emotive aspect of the speech.


Where can I read about this?


> During this phase, 15 discovered the Pony Preservation Project, a collaborative project started by /mlp/, the My Little Pony board on 4chan.[47] Contributors of the project had manually trimmed, denoised, transcribed, and emotion-tagged thousands of voice lines from My Little Pony: Friendship Is Magic and had compiled them into a dataset that provided ideal training material for 15.ai.[48]

From https://en.wikipedia.org/wiki/15.ai#2016%E2%80%932020:_Conce...


I had no idea this existed, the internet is amazing


I used aistudio and it understood pitch and and even emotion with an uploaded mp3


Accent detection or consciously ignoring it is a filter step.


Yes! The fact that isochrones become circles is one of my favorite things about this. I also discuss this in the video (https://youtu.be/rC2VQ-oyDG0?t=134) I made about these.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: