Apple added special AMX instructions specifically for matrix operations. They added them back with the A13, and then improved them for the M1. These are primarily focused on machine learning training where you do backpropagation through huge matrix operations.
They provide a very wide coverage library set that works on all supported platforms optimally. You interact with AMX through those libraries, and direct usage is considered unsupported. Apple may completely change it with the next iteration, where they can just change those libraries and every arm's-length application will get those benefits for free.
Apple sells whole systems. They aren't selling CPUs for third parties, and there is no reason they need to encourage their own proprietary extensions. Indeed, it would be more evil if they were asking people to pepper their own code with this, instead of just using libraries that abstract you from the magic.
As a tiny, tiny vendor in the computing space -- as HN often assures us -- I'm not seeing the great evil many are claiming.
(indeed, of course this is downvoted but I chuckle that another comment implores that Apple is evil because this will lead to code and binaries that only work on Apple devices. Which is quite literally exactly the opposite of what Apple is doing, which is forcing use through optimized libraries, abstracting from their own proprietary extensions. People just love bitching)
Scale, and only scale. As with most extensions, there's negative value if you're doing a small number of calcs, but it pays off in larger needs like ML training.
We've been calculating matrixes for time eternal, but AMX has just recently become a thing. Intel just introduced their own AMX instructions.
I don't know, if you want to perform DL inference on 4k/8k video real-time you're gonna need some heavy duty matrix multiplication resources. GPU is great for batched inferences but for quick no-pcie-transfer, small-to-no-batching inferences you want something close to the CPU...
Then it's good that the A13/A14/M1 have a neural inference engine, the latter featuring 11 trillion operations per second, using shared memory with the CPU.
We're talking about INT4/7 or bfloat TOPS, right? And if similar to other neural inference engines, vpus, tpus, etc. it's probably off except for heavy duty stuff and slow to power up again? Whilst repowering a matmul in-cpu block might be faster?
I don't see the need for an effort such as a matmul-dedicated instruction-set elsewhere? What's your guess?
They provide a very wide coverage library set that works on all supported platforms optimally. You interact with AMX through those libraries, and direct usage is considered unsupported. Apple may completely change it with the next iteration, where they can just change those libraries and every arm's-length application will get those benefits for free.
Apple sells whole systems. They aren't selling CPUs for third parties, and there is no reason they need to encourage their own proprietary extensions. Indeed, it would be more evil if they were asking people to pepper their own code with this, instead of just using libraries that abstract you from the magic.
As a tiny, tiny vendor in the computing space -- as HN often assures us -- I'm not seeing the great evil many are claiming.
(indeed, of course this is downvoted but I chuckle that another comment implores that Apple is evil because this will lead to code and binaries that only work on Apple devices. Which is quite literally exactly the opposite of what Apple is doing, which is forcing use through optimized libraries, abstracting from their own proprietary extensions. People just love bitching)