I actually think the CPU and GPU meeting at the idea of SIMT would be very aprop...

I actually think the CPU and GPU meeting at the idea of SIMT would be very apropos. AVX-512/AVX10 has mask registers which work just like CUDA lanes in the sense of allowing lockstep iteration while masking off lanes where it “doesn’t happen” to preserve the illusion of thread individuality. With a mask register, an AVX lane is now a CUDA thread.

Obviously there are compromises in terms of bandwidth but it’s also a lot easier to mix into a broader program if you don’t have to send data across the bus, which also gives it other potential use-cases.

But, if you take the CUDA lane idea one step further and add Independent Thread Scheduling, you can also generalize the idea of these lanes having their own “independent” instruction pointer and flow, which means you’re free to reorder and speculate across the whole 1024b window, independently of your warp/execution width.

The optimization problem you solve is now to move all instruction pointers until they hit a threadfence, with the optimized/lowest-total-cost execution. And technically you may not know where that fence is specifically going to be! Things like self-modifying code etc are another headache not allowed gpgpu too - there certainly will be some idioms that don’t translate well, but I think that stuff is at least thankfully rare in AVX code.