*> So what's the motivation here?* Better interpretability, I suppose. Could giv...

mayukhdeb · 2025-01-31T04:24:48 1738297488

The motivation was to induce structure in the weights of neural nets and see if the functional organization that emerges aligns with that of the brain or not. Turns out, it does -- both for vision and language.

The gains in parameter efficiency was a surprise even to us when we first tried it out.

energy123 · 2025-01-31T03:59:19 1738295959

That's true, and interpretability is helpful for AI safety.

mayukhdeb · 2025-01-31T04:26:59 1738297619

Indeed. What's cool is that we were able to localize literal "regions" in the GPTs which encoded toxic concepts related to racism, politics, etc. A similar video can be found here: https://toponets.github.io

More work is being done on this as we speak.

fakeparmesean · 2025-01-31T05:16:08 1738300568

My understanding coming from mechanistic interpretability is that models are typically (or always) in superposition, meaning that most or all neurons are forced to encode semantically unrelated concepts because there are more concepts than neurons in a typical LM. We train SAEs (where we apply L1 reg and a sparsity penalty to “encourage” the encoder output latents to yield sparse representations of the originating raw activations), to hopefully disentangle these features, or make them more monosemantic.This allows us to use the SAE as a sort of microscope to see what’s going on in the LM, and apply techniques like activation patching to localize features of interest, which sounds similar to what you’ve described. I’m curious what this work means for mech interp. Is this a novel alternative to mitigating polysemanticity? Or perhaps neurons are still encoding multiple features, but the features tend to have greater semantical overlap? Fascinating stuff!

mayukhdeb · 2025-01-31T14:58:56 1738335536

> the features tend to have greater semantical overlap?

This is true. The features closer together now have much stronger semantic overlap. You can watch how the weights self-organize in a GPT here: https://toponets.github.io/webpage_assets/banner_video.mp4

We're already studying the effects of topographic structure on polysemanticity.

cwillu · 2025-01-31T04:35:42 1738298142

Was it toxicity though as understood by the model, or just a cluster of concepts that you've chosen to label as toxic?

I.e., is this something that could (and therefore, will) be turned towards identifying toxic concepts as understood by the chinese or us government, or to identify (say) pro-union concepts so they can be down-weighted in a released model, etc?

mayukhdeb · 2025-01-31T04:43:51 1738298631

We localized "toxic" neurons by contrasting the activations of each neuron for toxic v/s normal texts. It's a method inspired by old-school neuroscience.

immibis · 2025-01-31T09:41:49 1738316509

Defining all politics as toxic is concerning, if it's not just a proof of concept. That's something dictatorships do so that people won't speak up.