Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I've tried that in a small transformer that I trained from scratch and it didn't really make any difference. I also made a version where I made this trainable somehow, probably by replacing the 1 with a constant associated with the layer, and that didn't make any difference either.

I didn't follow Miller's proposal quite as he wrote it though and I put the mechanism in all the layers rather than avoiding it at the end.

My test doesn't absolutely rule out usefulness-- there's always different ways of applying something, but I saw no indication of it.




I guess the next step is to see if you're getting those mega activations as he describes.

A/B test the two models and compare?

Would be interesting to see if these activations only show up on larger models, or they're some relation to model size.


https://news.ycombinator.com/item?id=36871528

Hah. Yes. It looks like they only show up in models with 6.7B parameters or more.

The problem can start at 125M. Small enough to test on a whim.

So train a model that exhibits these behaviours, then try it out.




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: