>These are primarily focused on machine learning training where you do backpropa...

joseph_grobbles · on Dec 28, 2020

Scale, and only scale. As with most extensions, there's negative value if you're doing a small number of calcs, but it pays off in larger needs like ML training.

We've been calculating matrixes for time eternal, but AMX has just recently become a thing. Intel just introduced their own AMX instructions.

touisteur · on Dec 29, 2020

I don't know, if you want to perform DL inference on 4k/8k video real-time you're gonna need some heavy duty matrix multiplication resources. GPU is great for batched inferences but for quick no-pcie-transfer, small-to-no-batching inferences you want something close to the CPU...

joseph_grobbles · on Dec 29, 2020

Then it's good that the A13/A14/M1 have a neural inference engine, the latter featuring 11 trillion operations per second, using shared memory with the CPU.

touisteur · on Dec 29, 2020

We're talking about INT4/7 or bfloat TOPS, right? And if similar to other neural inference engines, vpus, tpus, etc. it's probably off except for heavy duty stuff and slow to power up again? Whilst repowering a matmul in-cpu block might be faster?

I don't see the need for an effort such as a matmul-dedicated instruction-set elsewhere? What's your guess?