Well, yeah, its a bigger set of models (particular the language model) that takes more resources (both to train and for inference.) That’s the tradeoff.
> Here’s my question: are there any image models where, if you prompt “1+1”, you get an image showing “3”?
You want a t2i model that does arithmetic in the prompt, translates to it to “text displaying the number <result>”, but, also does the arithmetic wrong?
Yeah, I don’t think that combination of features is in any existing model or, really, in any of the datasets used for evaluation, or otherwise on anyone’s roadmap.
"Actually thinking about your prompt" is a necessary part of being able to make the prompts natural language instead of a long list of fantasy google image search terms.
Useful example being "my bedroom but in a new color", but some things I've typed into Midjourney that don't work include "a really long guinea pig" (you get a regular size one), "world's best coffee" (the coffee cup gets a world on it), etc. It's just too literal.
I don't think they're saying that's a goal, I think they're curious if it is the case. LLMs are bad at arithmetic, this uses a LLM to process the prompt, that class of result seems plausible.
Well, yeah, its a bigger set of models (particular the language model) that takes more resources (both to train and for inference.) That’s the tradeoff.
> Here’s my question: are there any image models where, if you prompt “1+1”, you get an image showing “3”?
You want a t2i model that does arithmetic in the prompt, translates to it to “text displaying the number <result>”, but, also does the arithmetic wrong?
Yeah, I don’t think that combination of features is in any existing model or, really, in any of the datasets used for evaluation, or otherwise on anyone’s roadmap.