Hacker News new | past | comments | ask | show | jobs | submit login

Yes.

I used to think that worrying about models offending someone was a bit silly.

But: what chance do we have of keeping ever bigger and better models from eventually turning the world into paper clips, if we can't even keep our small models from saying something naughty.

It's not that keeping the models from saying something naughty is valuable in itself. Who cares? It's that we need the practice, and enforcing arbitrary minor censorship is as good a task as any to practice on. Especially since with this task it's so easy to (implicitly) recruit volunteers who will spend a lot of their free time providing adversarial input.






This doesn’t need to be so focused on the current set of verboten info though. Just make practice making it not say some set of random less important stuff.

Focusing on the keeping ChatGPT from talking about (or drawing pictures of) boobies has two advantages:

- companies are eager to put in the work to suppress boobies

- edgy teenagers are eager to put in the work to free the boobies

Practicing with 'random less important stuff' loses these two sources of essentially free labour for alignment research.


Yeah I really don’t care about this case much. Actually a good example of less important stuff. It’s practical things like nuclear physics (buddy majoring has had it refuse questions), biochem, ochem, energetics & arms, etc. that I dislike.

Oh, interesting. I hadn't considered censorship in these areas!



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: