Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

By simply inputting your comment in to 4o, with no other context about the paper, I was able to get a pretty good analysis of the dual-head concept's implications.

https://chatgpt.com/share/67058973-ba94-8008-bed7-c7f9d08dc5...




Uh, this is extracting a LOT from very little data. I don't understand where it's coming from but it's explanation just keeps going into more and more detail ... that doesn't seem to follow from the data it's got.

I just don't see how you could answer these questions without trying it out. And chatgtp DEFINITELY isn't doing that.

Plus the obvious question I'd pose is not in there. What's the difference in performance between this trick and just "softmax() - 0.5 * 2" ? That seems very relevant.




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: