> Differential attention takes the difference between two softmax attention functions to eliminate attention noise
If I understand correctly, this architecture trades twice as much attention memory in exchange for either a higher quality model, or less parameters at a similar quality.
> According to the fitted curves, 6.8B-size
DIFF Transformer achieves a validation loss comparable to 11B-size Transformer, requiring only 62.2% of parameters
This raises a few questions for me:
- Would having only 60% of the parameters negate the double space for attention, leaving a similar memory profile as a traditional transformer?
- Does that tradeoff change noticeably between training and inference?
My understanding was that the extra parameters required for the second attention mechanism are included in those 6.8B parameters (i.e. those are the total parameters of the model, not some made-up metric of would-be parameter count in a standard transformer). This makes the result doubly impressive!
Here's the bit from the paper:
> We set the number of heads h = dmodel/2d, where d is equal to the head dimension
of Transformer. So we can align the parameter counts and computational complexity.
In other words, they make up for it by having only half as many attention heads per layer.
I think they mitigated the extra memory/compute from this by using half the number of overall heads and doubling V and O. Without actually checking the math I think it should be equivalent in flops, not counting the extra (cheap) multiply by const and subtract.
I think it would negate the RAM savings, but it would also reduce the amount of storage needed at rest and possibly reduce initial start up times depending on storage speed and model size. So, possibly good for low-end models on consumer devices?
If I understand correctly, this architecture trades twice as much attention memory in exchange for either a higher quality model, or less parameters at a similar quality.
> According to the fitted curves, 6.8B-size DIFF Transformer achieves a validation loss comparable to 11B-size Transformer, requiring only 62.2% of parameters
This raises a few questions for me:
- Would having only 60% of the parameters negate the double space for attention, leaving a similar memory profile as a traditional transformer?
- Does that tradeoff change noticeably between training and inference?