in my opinion (again I'm a user / not a developer) feel like the data/computing/training requirements are MUCH higher. Think about the RGB values / pixel in an image vs raw audio waveform.... just takes more $$$. Let me see if I can find a note where someone calculated the cost to train Jukebox... was definitely in tens of millions.
According to the paper, Jukebox was trained for approximately 7 weeks on 512 V100 GPUs. That's ~300k per training run assuming $0.5/hr/GPU. Obviously you need several runs to experiment with, so I'm guessing 1-2M to develop something similar. But sure, for a much larger Jukebox-2 it could be 10x more.
Still, OpenAI has plenty of funding, so it's either they don't see much value in developing it further, or they can't improve it. I'm surprised that no one else does either. For example, with GPT models every big AI lab started to compete trying to outdo each other. Same is happening now with DALL-E. But no one has tried to train a Jukebox successor.