Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

What I take from this is that LLMs are somewhat miraculous in generation but terrible at revision. Especially with images, they are very resistant to adjusting initial approaches.

I wonder if there is a consistent way to force structural revisions. I have found Nano Banana particularly terrible at revisions, even something like "change the image dimensions to..." it will confidently claim success but do nothing.



A thing I've been noticing across the board is that current generative AI systems are horrible at composition. It’s most obvious in image generation models where the composition and blocking tend to be jarringly simple and on point (hyper-symmetry, all-middleground, or one of like three canned "artistic" compositions) no matter how you prompt them, but you see it in things like text output as well once you notice it.

I suspect this is either a training data issue, or an issue with the people building these things not recognizing the problem, but it's weird how persistent and cross-model the issue is, even in model releases that specifically call out better/more steerable composition behavior.


I almost always get better results from LLMs by going back and editing my prompt and starting again, rather than trying to correct/guide it interactively. Almost as if having mistakes in your context window is an instruction to generate more mistakes! (I'm sure it's not quite that simple)


I see this all the time when asking Claude or ChapGPT to produce a single-page two-column PDF summarizing the conclusions of our chat. Literally 99% of the time I get a multi-page unpredictably-formatted mess, even after gently asking over and over for specific fixes to the formatting mistake/s.

And as you say, they cheerfully assert that they've done the job, for real this time, every time.


Ask for the asciidoc and asciidoctor command to make a PDF instead. Chat bots aren’t designed to make PDFs. They are just trying to use tools in the background, probably starting with markdown.


Tools are still evolving out of the VLM/LLM split [0]. The reason image-to-image tasks are so variable in quality and vastly inferior to text-to-image tasks is because there is an entirely separate model that is trained on transforming an input image into tokens in the LLM's vector space.

The naive approach that gets you results like ChatGPT is to produce output tokens based on the prompt and generate a new image from the output. It is really difficult to maintain details from the input image with this approach.

A more advanced approach is to generate a stream of "edits" to the input image instead. You see this with Gemini, which sometimes maintains original image details to a fault; e.g. it will preserve human faces at all cost, probably as a result of training.

I think the round-trip through SVG is an extreme challenge to train through and essentially forces the LLM to progressively edit the SVG source, which can result in something like the Gemini approach above.

[0]: https://www.groundlight.ai/blog/how-vlm-works-tokens


Revision should be much easier than generation, e.g. reflection style CoT (draft-critique-revision) is typically the simplest way to get things done with these models. It's always possible to overthink, though.

Nano Banana is rather terrible at multi-turn chats, just like any other model, despite the claim it's been trained for it. Scattered context and irrelevant distractors are always bad, compressing the conversation into a single turn fixes this.


I’m not quite sure. I think that adversarial network works pretty well at image generation.

I think that the problem here is that svg is structured information and an image is unstructured blob, and the translation between them requires planning and understanding. Maybe if instead of treating an svg like a raster image in the prompt is wrong. I think that prompting the image like code (which svg basically is) would result in better outputs.

This is just my uninformed opinion.


The prompt just said to iterate until they were satisfied. Adding something like "don't be afraid to change your approach or make significant revisions" would probably give different results.


> I wonder if there is a consistent way to force structural revisions.

Ask for multiple solutions?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: