Then it's good that the A13/A14/M1 have a neural inference engine, the latter featuring 11 trillion operations per second, using shared memory with the CPU.
We're talking about INT4/7 or bfloat TOPS, right? And if similar to other neural inference engines, vpus, tpus, etc. it's probably off except for heavy duty stuff and slow to power up again? Whilst repowering a matmul in-cpu block might be faster?
I don't see the need for an effort such as a matmul-dedicated instruction-set elsewhere? What's your guess?