Bfloat16 – Hardware Numerics Definition [pdf]

ncmncm · on Nov 18, 2018

As I understand it, the shortened mantissa matters less because these numbers never appear alone. The total number of bits, of all the numbers in the set, provide enough precision to make fine distinctions between states of the network.

Even for regular calculations, it is unfortunate that the conventional split reserved only five bits of exponent. A single extra bit there would make it much more useful, and the loss to the mantissa would be an easy tradeoff.

ndesaulniers · on Nov 17, 2018

I'll just leave this here. :-X

Bfloat16 is much better for conversions from f32.

https://github.com/tensorflow/tensorflow/blob/master/tensorf...

stephencanon · on Nov 17, 2018

It’s definitely conceptually simpler, but both conversions are a single fully-pipelined operation on any CPU made in the past 5 years, and can be folded into the arithmetic operation on custom HW. In practice the cost of conversion isn’t really an issue; the win with bfloat16 is the added dynamic range.

noobermin · on Nov 17, 2018

I'm a computational scientist. Do ML problems not deal with problems that are sensitive input precision? Is it too naive to say if they don't, does one really need ML for said problem over just plain old fitting and stats?

Tuna-Fish · on Nov 18, 2018

These are not used for data but for the computation of the internal coefficients. Said coefficients would be stored as F32:s, and the job of modifying them involves computing a lot of multiplications, none of which need to be that precise.

noobermin · on Nov 18, 2018

Wouldn't conversion between the steps lead to round off?

Tuna-Fish · on Nov 19, 2018

Conversion from Bfloat16 to f32 is just extending the mantissa with zeroes.

Basically, what they are computing is:

f32 acc = C_a0 * C_b0 + Ca1 * Cb1 + Ca2 * Cb2 ... with very many coefficients, all of which are Bfloat16. The precision of the coefficients is not that important, but they can be of substantially different magnitude, so the coefficients can use few bits in the mantissa but the accumulator needs to be wider.

jokoon · on Nov 17, 2018

7 bit mantissa doesn't sound like a lot, 2^-7 is 0.0078125... shouldn't there be at least 9 bits for the mantissa?

masklinn · on Nov 17, 2018

Note that 7 bits stored means 8 bit mantissa.

And bfloat16 is new but not that new: Tensorflow had it 3 years back[0].

Apparently range is more useful than precision for machine learning, which would be why they went 8/8 instead of IEEE's 11/5 FP16.

[0] https://github.com/tensorflow/tensorflow/blob/f41959ccb2d9d4...

espadrine · on Nov 17, 2018

My understanding is that the optimization steps of neural networks (typically gradient descent + backpropagation) act a bit like Lloyd's algorithm. Neuron weights will push each other into place. Often, what matters is how they compare to each other, such that the error is minimized; reaching the theoretically-perfect weight is less important.

yyww · on Nov 17, 2018

The format optimizes for dynamic range over precision.

kps · on Nov 17, 2018

There's an existing standardized 16-bit float with 10 bits mantissa¹, that graphics people are fond of. This one is for machine learning [drink].

¹ https://en.wikipedia.org/wiki/Half-precision_floating-point_...

wolfgke · on Nov 17, 2018

Have a look at section 1.1 and Figure 1-1 of the Intel Whitepaper. There this topic is discussed.

Tuna-Fish · on Nov 17, 2018

With the bonus bit, that's 2.4 significant digits of precision. Plenty enough for ML, especially if you use a F32 as the accumulator.

bleke · on Nov 17, 2018

Sorry for not in topic, did Intel calculate bonuses on hn karma (more officially impact)? I see this bf16 multiple times and it like authors dying for Christmas bonus.

rbanffy · on Nov 17, 2018

To me it looks like a clever optimization. Same range as FP32, but half the size and less precise and can be converted back and forth by truncating and concatenating zeros.

Is anyone else using it?

staticfloat · on Nov 17, 2018

Google uses it on their TPUs [0]. If you're interested in how it would effect the numerical stability of an algorithm you want to use, there is a Julia package that makes prototyping linear algebra over this datatype pretty straightforward [1].

[0] https://cloud.google.com/tpu/docs/system-architecture

[1] https://github.com/JuliaComputing/BFloat16s.jl

scottlegrand2 · on Nov 17, 2018

And Facebook is taking this even further. And while all these things are very cool, do not let ASIC designers claim they are barriers to entry for GPUs and CPUs. Whatever variants of this precision potpourri catch on are but a generation away from incarnation in general processors IMO...

https://code.fb.com/ai-research/floating-point-math/

marcyb5st · on Nov 17, 2018

Google's TPUs use them. But it has been for a year. I don't agree with the "new" or "Intel's" in the title.

masklinn · on Nov 17, 2018

And TPU uses them because Tensorflow uses them, it's been present since the first public commit: https://github.com/tensorflow/tensorflow/blob/f41959ccb2d9d4...

twtw · on Nov 17, 2018

I would be extremely surprised if the motivation for putting bfloat16 in tensorflow was not the TPU. That first public commit was ~1.5 years before TPUv2 was announced at I/O, so it was almost certainly already in development.

vrv · on Nov 17, 2018

bfloat16 was first in DistBelief, so it actually predates TensorFlow and TPUs (I worked on both systems). IIRC the motivation was more about minimizing parameter exchange bandwidth for large-scale CPU clusters rather than minimizing memory bandwidth within accelerators, but the idea generalized.

marcyb5st · on Nov 17, 2018

Thank you! I didn't know this. I thought they introduced them shortly after announcing TPU v1 in the 2016 (or 2017, can't remember) Google I/O.

shaklee3 · on Nov 17, 2018

Why is it clever to change the mantissa and exponent size? I thought the clever ones were the nervana flexpoint which seemed at least partially novel. And it's interesting Intel isn't pushing that format given nervana's asic had it.