Worse still is Intels rollout of AVX512 specifically, which started nearly a dec...

xeeeeeeeeeeenu · on Aug 7, 2024

While it does seem that AVX10 was mainly designed for consumer CPUs so they could use modern vector instructions without 512-bit vectors, the upcoming Arrow Lake will not have it.[1]

I guess we will have to wait for at least one more generation.

[1] - According to Intel® Architecture Instruction Set Extensions Programming Reference: https://cdrdv2-public.intel.com/826290/architecture-instruct...

adrian_b · on Aug 8, 2024

Not only Arrow Lake does not have AVX10, but even Panther Lake, the 2025/2026 Intel CPU does not have it.

Panther Lake will introduce FRED (Flexible Return and Event Delivery) a new manner of handling interrupts, exceptions and system calls.

FRED will bring tremendous changes to the operating system kernels, but it will have little influence on user programs, except that the computer will spend less time running OS kernel code than now.

For now it is expected that Intel will introduce AVX10 in its consumer CPUs only in Nova Lake, the Intel 2026/2027 CPU.

Meanwhile, AMD Zen 4 and Zen 5 are already happily supporting AVX10, except for implementing the CPUID AVX10 flags. AVX10.1 differs from AVX-512 only by adding a simpler method for identifying which instructions are supported. AVX10.2 will add only some instructions that are not needed on the CPUs that support the 512-bit AVX-512 instructions, like Zen 4 and Zen 5. AVX10.3 has not been defined yet and it is far in the future.

pjmlp · on Aug 8, 2024

Thanks for nerd snipping me into FRED! :)

jsheard · on Aug 7, 2024

Intel is making Xeons out of E-cores (up to 288 of them on one chip) so I assume those will also be motivating the rollout of AVX10, not just their consumer parts.

kllrnohj · on Aug 7, 2024

But surely they could just double pump like AMD does on Zen 4(c) and also on (some?) Zen 5c.

It's weird to see an Intel so... Broke? That they are seemingly forced to recycle old architectures endlessly

dzaima · on Aug 8, 2024

Another concern besides register file size is shuffle instructions, which can transfer any byte of a 512-bit register to any other (or any byte across two such registers for another instruction variant (vpermt2b), i.e. selecting from 128 bytes, and doing 64 such selections in one instruction).

You can't emulate that via just two regular 256-bit uops, you need four (maybe more for blending the results together). And if you don't have the two-register table 256-bit variant (e.g. Tiger Lake doesn't, though for 512-bit of course; it splits it into three uops), that'd end up at a rather massive 12 uops.

jsheard · on Aug 7, 2024

I think Intels E-cores are quite a bit smaller than the Zen 4c/5c cores, maybe at that scale it's prohibitive to even double up the register file? That's required even if the logic is double-pumped. AIUI the small Zen cores are mostly the same design as the big ones, just with less cache, silicon layout retuned for density rather than speed, and the removal of the 3D Cache stacking vias, while Intels small cores are clean-sheet designs with next to nothing in common with their big cores so they have to opportunity to shrink them a lot more.

adrian_b · on Aug 8, 2024

Yes, while the big Intel cores are much bigger than the big AMD cores (e.g. 5 square mm in Meteor Lake vs. 3.8 square mm for Zen 4) the Intel small cores are much smaller than the AMD compact cores (e.g. 1.5 square mm in Meteor Lake vs. 2.5 square mm for Zen 4c).

The smaller size of the Intel E-cores is not only due to their different microarchitecture, but also because only their L1 cache memories are non-shared, while their L2 cache memories are shared within groups of 4 E-cores.

The shared L2 cache may not matter much for many general-purpose programs, but for other multi-threaded programs, which depend on having a great total throughput for the transfers with the L2 cache, the performance of each group of 4 E-cores becomes similar to that of a single core, instead of being 4 times greater.

The AMD compact cores have the same non-shared cache memories as the big cores. Only the shared L3 cache blocks that service a group of compact cores are smaller than for the same number of big cores.

suresk · on Aug 7, 2024

My non-expert brain immediately jumped to double-pumping + maybe working with their thread director to have tasks using a lot of AVX512 instructions prefer P cores more. It feels like such an obvious solution to a really dumb problem that I assumed there was something simple I was missing.

The register file size makes sense, I didn't think they were that much of the die on those processors but I guess they had to be pretty aggressive to meet power goals?

jsheard · on Aug 7, 2024

> The register file size makes sense, I didn't think they were that much of the die on those processors

https://i.imgur.com/WdMPX8S.jpeg

According to this, Zen4s FP register file is almost as big as its FP execution units. It's a pretty sizable chunk of silicon.

suresk · on Aug 7, 2024

I was having trouble finding an E Core die shot, but that helps put it into perspective a bit anyway. Thanks!

avidphantasm · on Aug 8, 2024

If/once they follow through on their x86s architecture maybe they’ll have the transistor budget to support proper AVX512 on their efficiency cores.

celrod · on Aug 7, 2024

Skymont little cores have 4x 128-bit execution. They could quadruple-pump.

But looks more like they're giving up on people writing code for wide vectors, instead settling on trying to make the existing code faster.

xeeeeeeeeeeenu · on Aug 7, 2024

Well, they don't support it either. According to the document I linked, neither the just-released Sierra Forest, nor the planned Clearwater Forest support AVX10.

wtallis · on Aug 7, 2024

AVX10 is still pretty much in the proposal phase, and has been recently updated based on feedback Intel has received. It takes several years to get from that stage to shipping hardware.

adrian_b · on Aug 8, 2024

Granite Rapids, to be launched in a few months, is said by Intel to support AVX10.1/512 (which is identical to the ISA supported by Zen 5, except for a few additional flags reported by CPUID; Zen 4 lacks only VP2INTERSECT of AVX10.1).

Only the availability of AVX10/256 in Intel's consumer CPUs and in its server CPUs with E-cores is in the proposal phase (mainly because Intel has yet to design and launch, as the successor of Skymont that is being launched now, an E-core supporting AVX10/256; this is expected only in H2 2026).

Vecr · on Aug 7, 2024

I don't think you can phase out AVX2, it's the base of AVX512, because you can't always go 512 wide, and you'd have no backwards compatibility.

jsheard · on Aug 7, 2024

I know AVX2 will continue to exist in hardware forever for backwards compatibility, by "fully phased out" I mean the eventual point when software no longer has to maintain a dedicated path for hardware which supports AVX2 but doesn't support AVX10, because all relevent hardware supports AVX10.

the8472 · on Aug 7, 2024

EVEX prefix can address XMM/YMM/ZMM registers. So you can apply the AVX512 instruction set to 128bit and 256bit registers too.

janwas · on Aug 8, 2024

I do not understand why ubiquity or baselines should be required in order to use CPU features :)

Since many years, performance-critical libraries have used runtime/dynamic dispatch.

Our github.com/google/highway intrinsics even automate this. You can write your code once and it is compiled for each instruction set, and the best codepath is selected at runtime.

Remnant44 · on Aug 8, 2024

Dynamic dispatch adds headaches to the build process; they are surmountable for sure, but in my experience the build wrangling to make it all happen is harder than the original work of rewriting your code with intrinsics!

The other major problem I have with dynamic dispatch, at least for the SIMD code I've written, is that you have to do so at a fairly high level of granularity. Most optimized routines are doing as much fusion & cache tiling as possible and so the dispatch has to happen at the level of the higher-level function rather than the more array-op-like components within it. And mostly, that means you've written your (often quite complicated) procedure several times uniquely instead of simply accelerating components within it.

I have not used Highway - if it dramatically simplifies the above, that's excellent!

janwas · on Aug 11, 2024

:) Yes indeed, no changes to the build required. Example: https://gcc.godbolt.org/z/KM3ben7ET

I agree dispatch should be at a reasonably high level. This hasn't been a problem in my experience, we have been able to inline together most SIMD code and dispatch infrequently.