Suno's RVQ-token-based language model is tuned give you an acceptable song that most of their userbase would prefer every single time, but isn't very diverse. Our diffusion model is much more diverse and has higher vocal audio quality, but the results aren't always consistent (just like Flux et al). However, since we have unlimited generations this can be worked around. We're also never going to preference tune our model because I think the stuff that is lost in that process is valuable.