Scale, and only scale. As with most extensions, there's negative value if you're doing a small number of calcs, but it pays off in larger needs like ML training.
We've been calculating matrixes for time eternal, but AMX has just recently become a thing. Intel just introduced their own AMX instructions.
I don't know, if you want to perform DL inference on 4k/8k video real-time you're gonna need some heavy duty matrix multiplication resources. GPU is great for batched inferences but for quick no-pcie-transfer, small-to-no-batching inferences you want something close to the CPU...
Then it's good that the A13/A14/M1 have a neural inference engine, the latter featuring 11 trillion operations per second, using shared memory with the CPU.
We're talking about INT4/7 or bfloat TOPS, right? And if similar to other neural inference engines, vpus, tpus, etc. it's probably off except for heavy duty stuff and slow to power up again? Whilst repowering a matmul in-cpu block might be faster?
I don't see the need for an effort such as a matmul-dedicated instruction-set elsewhere? What's your guess?
How is this different from normal matrix operations?