> Differential attention takes the difference between two softmax attention func...

_hl_ · 2024-10-08T17:11:59 1728407519

My understanding was that the extra parameters required for the second attention mechanism are included in those 6.8B parameters (i.e. those are the total parameters of the model, not some made-up metric of would-be parameter count in a standard transformer). This makes the result doubly impressive!

Here's the bit from the paper:

> We set the number of heads h = dmodel/2d, where d is equal to the head dimension of Transformer. So we can align the parameter counts and computational complexity.

In other words, they make up for it by having only half as many attention heads per layer.

chessgecko · 2024-10-08T18:02:51 1728410571

I think they mitigated the extra memory/compute from this by using half the number of overall heads and doubling V and O. Without actually checking the math I think it should be equivalent in flops, not counting the extra (cheap) multiply by const and subtract.

entropicdrifter · 2024-10-08T17:01:34 1728406894

I think it would negate the RAM savings, but it would also reduce the amount of storage needed at rest and possibly reduce initial start up times depending on storage speed and model size. So, possibly good for low-end models on consumer devices?

Kubuxu · 2024-10-08T18:04:49 1728410689

It would double the size of the KV cache, which can be significant (multi-GB) at larger context sizes.