RDNA 4's “Out-of-Order” Memory Accesses

jauntywundrkind · 2025-03-24T03:44:35 1742787875

I've been super curious to see what was at stake here! This sounds better than I'd dared to hope for.

I kind of thought this was just gonna be some kind of deferred texture loading thing, help with streaming assets.

If it actually allows inter-warp sequencing, it sounds like it might possibly solve the chief complains supreme GUI master Raph Levien recently had in I want a good parallel computer, which so that even though we can dynamically add shaders & construct a dynamic workgraph (largely thanks to VK_AMDX_shader_enqueue?), there isn't any sequencing/fencing/barrier-ing between the sections. https://raphlinus.github.io/gpu/2025/03/21/good-parallel-com... https://news.ycombinator.com/item?id=43440174

Not applicable to GPUs, but since I ran into it recently, it's interesting to see how io_uring handles sequenced submissions. Here's Lord of io_uring's write-up, https://unixism.net/loti/tutorial/link_liburing.html#link-li...

Edit: having read the article more fully, I'm not sure this is about waves depending on each other. Maybe more about them trying to access memory. Apologies. Hopefully someday!

Terr_ · 2025-03-24T00:10:17 1742775017

At first glance at the title, I thought it was going to be about some twist on DNA 3' and DNA 5' reading frames.

https://en.wikipedia.org/wiki/Reading_frame

pyinstallwoes · 2025-03-24T08:04:43 1742803483

What’s interesting about that glass you?

IshKebab · 2025-03-24T10:36:42 1742812602

Presumably this didn't matter hugely because the memory access patterns for each wave are going to be extremely similar anyway?

Ah yeah he says that at the end. Doesn't really matter for rasterisation but might make more of a difference for ray tracing.

shmerl · 2025-03-23T23:19:21 1742771961

Does AMD have its own flavor of GPU assembly and how is it called?

dragontamer · 2025-03-23T23:21:56 1742772116

Yes and it's slightly different per architecture. Mostly new instructions (like the discussed one in this article).

Just RDNA4 ISA and you'll find it:https://www.amd.com/content/dam/amd/en/documents/radeon-tech...

Terrascale from 2008 was very different. Ignore it.

GCN is mostly the same as RDNA and GCN is practically identical to CDNA. So you can go back to older guides as far back as GCN1 (like early 2010s era). The only fundamental difference is RDNA is SIMD32 while GCN/CDNA is SIMD64

--------

NVidia has an intermediate assembly language called PTX. NVidias true assembly language is undocumented (but not secret, not just intended for general purpose coding). Search on NVidias PTX manual and you'll see ...

GZGavinZhao · 2025-03-23T23:51:17 1742773877

Slightly tangent, but AMD is also working on amdgcnspirv (i.e. AMD-flavored SPIR-V) that'll hopefully result in a similar user experience like PTX [1].

[1]: https://github.com/ROCm/ROCm/issues/3985#issuecomment-254616...

shmerl · 2025-03-24T00:06:36 1742774796

Mesa uses NIR as intermediate representation for its drivers. Is that comparable?

winocm · 2025-03-24T05:12:24 1742793144

Kind of. NIR is more oriented towards lowering and optimizing code for driver backends, as far as I know. SPIR-V is targeted towards the other end of the spectrum.

GZGavinZhao · 2025-03-24T02:38:04 1742783884

From my limited understanding of SPIR-V (since AMDGCNSPIRV is in essence SPIR-V), I would say yes.

shmerl · 2025-03-23T23:26:57 1742772417

Interesting, thanks!

Looking forward to aco compiler using new features of RDNA4 to improve ray tracing performance with radv.