> It's unclear to me what "convex hull" means though.
The convex hull (https://en.wikipedia.org/wiki/Convex_hull) of a set is the smallest convex shape that includes that set. Geometrically, it's what you'd get if you "shrink wrapped" the thing you're looking at: edges still protrude, but any indentations get smoothed over.
In this context, the grandparent comment is pointing out that with a traditional transformer block, the resulting computed value for a token can never "stick out" past some weighted average of the values of attended-to tokens, but this differential attention formalism allows that result.
The convex hull of a set of points is the region "between" those points. So the convex hull of three points (that do not lie on the same line) is a triangle with those three points as vertices. If you add a fourth point inside the triangle, the convex hull remains the same, but if you add it outside then the convex hull becomes the four-sided region with those points as vertices.
In the context of standard transformer attention, each output lies in the convex hull ("somewhere between") the input values. With the modification of this paper, the input values can be scaled a little so that the output of different heads can be in different "regions" and thus do not interfere with each other (so yes to your third question, the two softmaxes are performed separately for each head).
O_i = softmax(...) * V_i and softmax is between 0 and 1, so O_i = alpha * V_i for some alpha between 0 and 1 so that makes it convex, and it makes the O_i just a shrunken version of V_i. Whereas if you have the diff of softmaxes, you get O_i = (alpha - beta) * V_i, which can range from -V_i to +V_i, so its output could rescale /or/ flip V_i. And yes this is happening in every head in parallel, then they get summed.
By simply inputting your comment in to 4o, with no other context about the paper, I was able to get a pretty good analysis of the dual-head concept's implications.
Uh, this is extracting a LOT from very little data. I don't understand where it's coming from but it's explanation just keeps going into more and more detail ... that doesn't seem to follow from the data it's got.
I just don't see how you could answer these questions without trying it out. And chatgtp DEFINITELY isn't doing that.
Plus the obvious question I'd pose is not in there. What's the difference in performance between this trick and just "softmax() - 0.5 * 2" ? That seems very relevant.
Also, where is each softmax happening here? For each attention head?