As I understand it, the shortened mantissa matters less because these numbers never appear alone. The total number of bits, of all the numbers in the set, provide enough precision to make fine distinctions between states of the network.
Even for regular calculations, it is unfortunate that the conventional split reserved only five bits of exponent. A single extra bit there would make it much more useful, and the loss to the mantissa would be an easy tradeoff.
It’s definitely conceptually simpler, but both conversions are a single fully-pipelined operation on any CPU made in the past 5 years, and can be folded into the arithmetic operation on custom HW. In practice the cost of conversion isn’t really an issue; the win with bfloat16 is the added dynamic range.
I'm a computational scientist. Do ML problems not deal with problems that are sensitive input precision? Is it too naive to say if they don't, does one really need ML for said problem over just plain old fitting and stats?
These are not used for data but for the computation of the internal coefficients. Said coefficients would be stored as F32:s, and the job of modifying them involves computing a lot of multiplications, none of which need to be that precise.
Conversion from Bfloat16 to f32 is just extending the mantissa with zeroes.
Basically, what they are computing is:
f32 acc = C_a0 * C_b0 + Ca1 * Cb1 + Ca2 * Cb2 ...
with very many coefficients, all of which are Bfloat16. The precision of the coefficients is not that important, but they can be of substantially different magnitude, so the coefficients can use few bits in the mantissa but the accumulator needs to be wider.
My understanding is that the optimization steps of neural networks (typically gradient descent + backpropagation) act a bit like Lloyd's algorithm. Neuron weights will push each other into place. Often, what matters is how they compare to each other, such that the error is minimized; reaching the theoretically-perfect weight is less important.
Sorry for not in topic, did Intel calculate bonuses on hn karma (more officially impact)? I see this bf16 multiple times and it like authors dying for Christmas bonus.
To me it looks like a clever optimization. Same range as FP32, but half the size and less precise and can be converted back and forth by truncating and concatenating zeros.
Google uses it on their TPUs [0]. If you're interested in how it would effect the numerical stability of an algorithm you want to use, there is a Julia package that makes prototyping linear algebra over this datatype pretty straightforward [1].
And Facebook is taking this even further. And while all these things are very cool, do not let ASIC designers claim they are barriers to entry for GPUs and CPUs. Whatever variants of this precision potpourri catch on are but a generation away from incarnation in general processors IMO...
I would be extremely surprised if the motivation for putting bfloat16 in tensorflow was not the TPU. That first public commit was ~1.5 years before TPUv2 was announced at I/O, so it was almost certainly already in development.
bfloat16 was first in DistBelief, so it actually predates TensorFlow and TPUs (I worked on both systems). IIRC the motivation was more about minimizing parameter exchange bandwidth for large-scale CPU clusters rather than minimizing memory bandwidth within accelerators, but the idea generalized.
Why is it clever to change the mantissa and exponent size? I thought the clever ones were the nervana flexpoint which seemed at least partially novel. And it's interesting Intel isn't pushing that format given nervana's asic had it.
Even for regular calculations, it is unfortunate that the conventional split reserved only five bits of exponent. A single extra bit there would make it much more useful, and the loss to the mantissa would be an easy tradeoff.