I've tried that in a small transformer that I trained from scratch and it didn't...

Grosvenor · 2024-10-08T21:58:18 1728424698

I guess the next step is to see if you're getting those mega activations as he describes.

A/B test the two models and compare?

Would be interesting to see if these activations only show up on larger models, or they're some relation to model size.

Grosvenor · 2024-10-09T01:23:30 1728437010

Hah. Yes. It looks like they only show up in models with 6.7B parameters or more.

The problem can start at 125M. Small enough to test on a whim.

So train a model that exhibits these behaviours, then try it out.