The main reason topography emerges in physical brains is because spatially dista...

AYBABTME · 2025-01-31T05:53:46 1738302826

Locality of data and computation is very important in neural nets. It's the number one reason why training and inference are as slow as they are. It's why GPUs need super expensive HBM memory, why NVLink is a thing, why Infiniband is a thing.

If the problem of training and inference on neural networks can be optimized so that a topology can be used to keep closely related data together, we will see huge advancements in training and inference speed, and probably in model size as a result.

And speed isn't just speed. Speed makes impossible (not enough time in our lifetime) things possible.

A huge factor in Deepseek being able to train on H800 (half HBM bandwith as H100) is that they used GPU cores to compress/decompress the data moved around between the GPU memory and the compute units. This reduces latency in accessing data and made up for the slower memory bandwith (which translates in higher latency when fetching data). Anything that reduces the latency of memory accesses is a huge accelerator for neural nets. The number one way to achieve this is to keep related data next to each other, so that it fits in the closest caches possible.

mirekrusin · 2025-01-31T06:24:40 1738304680

It's true, but isn't OP also correct? Ie. it's about speed, which implies locality, which implies approaches like MoE which does exactly that and it's unlike physical brain topology?

Having said that it would be fun to see things like rearrangement data moves based on temerature of silicon parts after training cycle.

nickpsecurity · 2025-01-31T15:20:07 1738336807

Well, locality and the global nature of pre-training methods. The brain mostly uses local learning (Hebbian learning) which requires less, data movement. AI firms putting as much money into making that scale as they did on backpropagation might drop costs a lot.

vlovich123 · 2025-01-31T03:48:51 1738295331

Unless GPUs work markedly differently somehow or there’s been some fundamental shift in computer architecture I’m not aware of, spatial locality is still a factor in computers.

Aside from HW acceleration today, designs like Cebras would benefit heavily by reducing the amount of random access from accessing the weights (and thus freeing up cross-chip memory bandwidth for other things).

whynotminot · 2025-01-31T03:55:03 1738295703

This makes me remember game developers back when games could still be played directly from the physical disc. They would often duplicate data to different parts of the disc, knowing that certain data would often be streamed from disc together, so that seek times were minimized.

But those game devs knew where everything was spatially on the disc, and how the data would generally be used during gameplay. It was consistent.

Do engineers have a lot of insight into how models get loaded spatially onto a given GPU at run time? Is this constant? Is it variable on a per GPU basis? I would think it would have to be.

Hard to optimize for this.

jaek · 2025-01-31T04:26:46 1738297606

This brings to mind The Story of Mel from programming folklore.

http://beza1e1.tuxen.de/lore/story_of_mel.html

abrookewood · 2025-01-31T05:28:31 1738301311

Such a good read - some people really are on another level in their chosen field.

vlovich123 · 2025-01-31T15:55:58 1738338958

Right now models have no structure so that access is random but you definitely know where the data is located in memory since you put it there. It doesn’t matter about the physical location - it’s all through a TLB but if you ask the GPU for a contiguos memory allocation it gives it to you. This is probable the absolute easiest thing to optimize for if your data access pattern is amenable to it.

harles · 2025-01-31T04:21:16 1738297276

That could explain compute efficiency, but has nothing to do with the parameter efficiency pointed at in the paper.

vlovich123 · 2025-01-31T15:57:22 1738339042

Haven’t read the paper but my guess around that is that the same reason sparse attention networks (where they 0 out many weights) just have the sparse tensors be larger.

mayukhdeb · 2025-02-01T13:57:57 1738418277

In this paper, we don't zero out the weights. We remove them.

vlovich123 · 2025-02-02T20:40:08 1738528808

Thanks for the correction! Can it be retrofitted into existing models through distillation or do you have to train the model from scratch?

cma · 2025-01-31T03:56:11 1738295771

> The main reason topography emerges in physical brains is because spatially distant connections are physically difficult and expensive in biological systems.

The brain itself seems to have bottlenecks that aren't distance related, like hemispheres and the corpus callosum that are preserved over all placental mammals and other mammalian groups have something similar and still hemispheres. Maybe it's just an artifact of bilateral symmetry that is stuck in there from path dependence, or forcing a redundancy to make damage more recoverable, but maybe it has a big regularizing or alternatively specializing effect (regularization like dropout tends to force more distributed representations which seems kind of opposite to this work and other work like "Seeing is Believing: Brain-Inspired Modular Training for Mechanistic Interpretability," https://arxiv.org/abs/2305.08746 ).

jlpom · 2025-01-31T19:53:44 1738353224

It increases modularity and small-worldness, which are in my book critical for AGI (surprised by the way that this publication doesn't cite https://www.nature.com/articles/s42256-023-00748-9).

mayukhdeb · 2025-01-31T20:19:47 1738354787

Thank you for sharing this! We'll read through this and update the camera-ready version accordingly for ICLR 2025.

exe34 · 2025-01-31T17:16:50 1738343810

> CNN will beat ViT on small data tasks, but that flips with enough scale because ViT imposes less inductive bias

any idea why this is the case? CNN have the bias that neighbouring pixels are somehow relevant - they are neighbours. ViTs have to re-learn this from scratch. So why do they end up doing better than CNN?

TZubiri · 2025-01-31T04:21:01 1738297261

Maybe this would be relevant for datacenters with significant distance between machines, or multidatacenter systems.

xpl · 2025-01-31T03:44:46 1738295086

> So what's the motivation here?

Better interpretability, I suppose. Could give insights into how cognition works.

mayukhdeb · 2025-01-31T04:24:48 1738297488

The motivation was to induce structure in the weights of neural nets and see if the functional organization that emerges aligns with that of the brain or not. Turns out, it does -- both for vision and language.

The gains in parameter efficiency was a surprise even to us when we first tried it out.

energy123 · 2025-01-31T03:59:19 1738295959

That's true, and interpretability is helpful for AI safety.

mayukhdeb · 2025-01-31T04:26:59 1738297619

Indeed. What's cool is that we were able to localize literal "regions" in the GPTs which encoded toxic concepts related to racism, politics, etc. A similar video can be found here: https://toponets.github.io

More work is being done on this as we speak.

fakeparmesean · 2025-01-31T05:16:08 1738300568

My understanding coming from mechanistic interpretability is that models are typically (or always) in superposition, meaning that most or all neurons are forced to encode semantically unrelated concepts because there are more concepts than neurons in a typical LM. We train SAEs (where we apply L1 reg and a sparsity penalty to “encourage” the encoder output latents to yield sparse representations of the originating raw activations), to hopefully disentangle these features, or make them more monosemantic.This allows us to use the SAE as a sort of microscope to see what’s going on in the LM, and apply techniques like activation patching to localize features of interest, which sounds similar to what you’ve described. I’m curious what this work means for mech interp. Is this a novel alternative to mitigating polysemanticity? Or perhaps neurons are still encoding multiple features, but the features tend to have greater semantical overlap? Fascinating stuff!

mayukhdeb · 2025-01-31T14:58:56 1738335536

> the features tend to have greater semantical overlap?

This is true. The features closer together now have much stronger semantic overlap. You can watch how the weights self-organize in a GPT here: https://toponets.github.io/webpage_assets/banner_video.mp4

We're already studying the effects of topographic structure on polysemanticity.

cwillu · 2025-01-31T04:35:42 1738298142

Was it toxicity though as understood by the model, or just a cluster of concepts that you've chosen to label as toxic?

I.e., is this something that could (and therefore, will) be turned towards identifying toxic concepts as understood by the chinese or us government, or to identify (say) pro-union concepts so they can be down-weighted in a released model, etc?

mayukhdeb · 2025-01-31T04:43:51 1738298631

We localized "toxic" neurons by contrasting the activations of each neuron for toxic v/s normal texts. It's a method inspired by old-school neuroscience.

immibis · 2025-01-31T09:41:49 1738316509

Defining all politics as toxic is concerning, if it's not just a proof of concept. That's something dictatorships do so that people won't speak up.

jv22222 · 2025-01-31T04:42:59 1738298579

I had this idea the other day. Not sure if it relates but maybe?

https://twitter.com/justinvincent/status/1884357300703400274

mercer · 2025-01-31T04:03:36 1738296216

I imagine it could be easier to make sense of the 'biological' patterns that way? like, having bottlenecks or spatially-related challenges might have to be simulated too, to make sense of the ingested 'biological' information.

ziofill · 2025-01-31T04:23:44 1738297424

Perhaps they are more easily compressible? Once a bunch of nearby weights have similar roles one may not need all of them.

mayukhdeb · 2025-01-31T04:40:20 1738298420

Yep. That is exactly the idea here. Our compression method is super duper naive. We literally keep every n-th weight column and discard the rest. Turns out that even after getting rid of 80% of the weight columns in this way, we were able to retain the same performance in a 125M GPT.

w-m · 2025-01-31T10:48:25 1738320505

If you have things organized neatly together, you can also use pre-existing compression algorithms, like JPEG, to compress your data. That's what we're doing in Self-Organizing Gaussians [0]. There we take an unorganised (noisy) set of primitives that have 59 attributes and sort them into 59 2D grids which are locally smooth. Then we use off-the-shelf image formats to store the attributes. It's an incredibly effective compression scheme, and quite simple.

[0]: https://fraunhoferhhi.github.io/Self-Organizing-Gaussians/