Hacker News new | past | comments | ask | show | jobs | submit login

> but since it doesn't have a fundamental understanding of the patterns/objects like arms, it just blends many images of arms together and create things like 3 arms.

But it rarely would put out say 8 arms. And the repeat artifacts are miles ahead of earlier stuff like clip draw or disco diffusion. So it does seem to have some idea of what's going on, just isn't perfect yet. It gets much worse without the 512x512 resolution, if you push both dimensions it loses scene coherence a lot more.




Actually I should have mentioned this in the original post but I think the "3 arms" thing is kind of a bad example come to think of it. I think in general at least with SD, if's very unlikely to create 3 arms or or 8 arms if you for example ask for a person. Mostly it looks like a person because the text prompt maps to training data of persons, and so they will generally look like people with 2 arms.

However, where it struggles I find is with finer details, and also _placement_ of things like arms, eyes, and relationships between them. This I think is because it only has a general idea of the shape of persons but no data for the exact specifics like where the arms, legs, eyes and so on should be placed in a very realistic anatomical way, and this is where I think the challenge is - the gap between a general pattern of a person and an extremely specific but also general one where it can modify it and transform it like a real human artist can. I'm not sure that's in the data exactly




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: