Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> So what's the motivation here?

Better interpretability, I suppose. Could give insights into how cognition works.




The motivation was to induce structure in the weights of neural nets and see if the functional organization that emerges aligns with that of the brain or not. Turns out, it does -- both for vision and language.

The gains in parameter efficiency was a surprise even to us when we first tried it out.


That's true, and interpretability is helpful for AI safety.


Indeed. What's cool is that we were able to localize literal "regions" in the GPTs which encoded toxic concepts related to racism, politics, etc. A similar video can be found here: https://toponets.github.io

More work is being done on this as we speak.


My understanding coming from mechanistic interpretability is that models are typically (or always) in superposition, meaning that most or all neurons are forced to encode semantically unrelated concepts because there are more concepts than neurons in a typical LM. We train SAEs (where we apply L1 reg and a sparsity penalty to “encourage” the encoder output latents to yield sparse representations of the originating raw activations), to hopefully disentangle these features, or make them more monosemantic.This allows us to use the SAE as a sort of microscope to see what’s going on in the LM, and apply techniques like activation patching to localize features of interest, which sounds similar to what you’ve described. I’m curious what this work means for mech interp. Is this a novel alternative to mitigating polysemanticity? Or perhaps neurons are still encoding multiple features, but the features tend to have greater semantical overlap? Fascinating stuff!


> the features tend to have greater semantical overlap?

This is true. The features closer together now have much stronger semantic overlap. You can watch how the weights self-organize in a GPT here: https://toponets.github.io/webpage_assets/banner_video.mp4

We're already studying the effects of topographic structure on polysemanticity.


Was it toxicity though as understood by the model, or just a cluster of concepts that you've chosen to label as toxic?

I.e., is this something that could (and therefore, will) be turned towards identifying toxic concepts as understood by the chinese or us government, or to identify (say) pro-union concepts so they can be down-weighted in a released model, etc?


We localized "toxic" neurons by contrasting the activations of each neuron for toxic v/s normal texts. It's a method inspired by old-school neuroscience.


Defining all politics as toxic is concerning, if it's not just a proof of concept. That's something dictatorships do so that people won't speak up.




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: