> The tricky part is coming up with a reasonable set of "add noise" transformati...

> The tricky part is coming up with a reasonable set of "add noise" transformations.

Yes, as well as dealing with a variable-length window.

When generating images with diffusion, one specifies the image ahead-of-time. When generating text with diffusion, it's a bit more open-ended. How long do we want this paragraph to go? Well, that depends on what goes into it -- so how do we adjust for that? Do we use a hierarchical tree-structure approach? Chunk it and do a chain of overlapping segments that are all of fixed-length (could possibly be combined with a transformer model)?

Hard to say what would finally work in the end, but I think this is the sort of thing that YLC is talking about when he encourages students to look beyond LLMs. [1]

* [1] https://x.com/ylecun/status/1793326904692428907