I've tried that in a small transformer that I trained from scratch and it didn't really make any difference. I also made a version where I made this trainable somehow, probably by replacing the 1 with a constant associated with the layer, and that didn't make any difference either.
I didn't follow Miller's proposal quite as he wrote it though and I put the mechanism in all the layers rather than avoiding it at the end.
My test doesn't absolutely rule out usefulness-- there's always different ways of applying something, but I saw no indication of it.
I didn't follow Miller's proposal quite as he wrote it though and I put the mechanism in all the layers rather than avoiding it at the end.
My test doesn't absolutely rule out usefulness-- there's always different ways of applying something, but I saw no indication of it.