When you play a video, you're (usually) sending the encoded video to the GPU for...

threecheese · on May 23, 2024

If I thought it would be appropriate, there’d be a :hands-clapping: emoji here thanking you for that infodump (and the reply). That’s a great and useful explanation that’s easily modeled mentally. (“easily” lol)

lostmsu · on May 23, 2024

I think at least the last bit is incorrect: DRM protected video is still often decoded by a hardware decoder built into GPU.

I am also in doubt about the claim that there's never a single VRAM texture that contains the entire desktop, because while I see potential ways for GPU during sending frame to the monitor to assemble said frame on the fly, it sounds implausible vs just reading it from a premade buffer.

derefr · on May 23, 2024

> I think at least the last bit is incorrect: DRM protected video is still often decoded by a hardware decoder built into GPU.

I misspoke, sorry — the last paragraph is meant to say that the CPU "decrypts", not "decodes", the video (inside the Intel SGX enclave.)

> I am also in doubt about the claim that there's never a single VRAM texture that contains the entire desktop

Some GPUs do work this way internally, but my phrasing was specific — even when such a VRAM texture buffer exists, it's not addressible outside the CPU itself. You can't get a handle to it. (And even if you could, it'd just be another HW-accel draw context — and so would be DRM-tainted if-and-when any buffer being copied onto it is considered tainted.)

But why wouldn't you want a GPU work this way? What's the advantage of not having some kind of "final screen buffer"?

It mostly comes down to a quirk of history. Movies are 24FPS, while computer CRTs were mostly designed for 60Hz refresh rates (with no support for 24Hz modes.) And when people see "hardware-accelerated video playback" as a feature on the box of a video card, one of the primary things that people expect that to mean, is that you should be able to play back your 24FPS movie on the 60Hz display. In windowed mode. With the desktop being refreshed behind the 24Hz movie. Smoothly.

If videos were mostly 30FPS, this would be easy, because 30Hz divides 60Hz; each video frame could just be copied to the 60Hz "final screen buffer" twice with the same content. But 24Hz doesn't cleanly divide 60Hz — you'd need each video frame to appear for 2.5 "final screen buffers." (And just doing that naively — that's called 2-2-3 rendering — is very stuttery.)

If you're doing this by trying to render out a "final screen buffer" VRAM texture in lockstep with your display frequency, all in advance of sending that texture down the pipe to the display, then for each frame, you need to complete that rendering pass with enough time left to communicate it to the display. But what happens when it takes a really long time to send the signal down the wire, leaving you only a little time each frame to actually render everything? In fact, what happens when the display is analogue — like a CRT — where the timing of sending the data down the wire is directly related to the timing of drawing?

Well, certainly, you can pipeline this final rendering — i.e. make everything show up one frame behind, so that you can work on each frame for an extra entire frame-time. But gamers won't like you messing with their draw-contexts that way. And gamers are the core audience for your product.

And also, that still doesn't let you render the video in windowed mode "smoothly." It just means that you have now have slightly less than one full frame-time, each frame, to do the concurrent work of decoding... 2.5 (or some even worse multiple) frames of video, and then doing something with them to squeeze it down to one frame.

And also, how should an active hw-accel video-decode draw context react to the OS changing the display resolution to one that uses a different frequency? Or worse — though not invented at the time — to a Variable Refresh Rate monitor, that can be arbitrarily sent frames using FreeSync?

What the GPU vendors decided to do to make this work, was to require one of two things — either:

1. that the algorithm of the video codec directly support decoding of a fractional frame position; or

2. that the creator of the draw-context pass it a VRAM buffer with enough space for not just one, but several (usually specifically three) full extracted frames. This buffer gets used as a ring buffer, with each new frame becoming a source texture; and these frames are then interpolated together (with some GPUs offering fancy motion-interpolations, and others just doing a linear interpolation) according to the fractional time-between-frames that the timing for the center of the dest draw context lands on.

In either case, the goal was to make the video-decode process something that could be demand-driven — where at any point the GPU could say "I'm at the point where I'm drawing the video; give me what the video should look like now" — and get a result that cleanly maps to the frame the video lands on.

The whole toplevel render pass of early GPUs was designed to operate this way, with final VRAM buffers being "copied" not back into VRAM, but directly into the ring-buffer of the display protocol encoder; and with any final hw-accel passes — like interpolations — generating their output directly into that same display-protocol-encoder ring-buffer. (Or, in the modern day, executing the final compute shader that outputs video tiles only one line-of-tiles at a time, on a vastly sub-frame frequency, as signals like "scanlines matching tiles T50..T80 coming up, get them rendered so they can be read from" are sent by the DPE.)

This decoupled and demand-driven final copy to the DPE, has some interesting properties; it means that you can actually have two or more display-protocol encoders, rendering the same screen, or different parts of the screen — without that implying the need for a lowest-common-multiple frequency. In fact, GPUs are not only happy with multiple displays plugged into them at once — they're happy with those displays even when they have no supported refresh rates in common!

(This is in contrast to the old 2D video cards, which did operate on a single global base frame-raster clock. This meant that 2D video cards were always either single-headed, or only supported a set of monitors all running at the same refresh rate. And if anything wanted to draw to the screen at a different rate, it had to be full-screen, with all the heads switching to that rate.)

derefr · on May 24, 2024

I forgot to mention the best part — a trick that modern GPUs can only pull off because of how they feed their DPEs:

Take a modern computer with a modern GPU, and plug the GPU into one 75Hz monitor and one 60Hz monitor. In your OS, arrange the resulting displays to be aligned horizontally. Start an (HW-accelerated) video playing. Now, place that video so it straddles the boundary between the two screens.

When you do this, each DPE will actually be demanding frames from the video draw-context on a different schedule; and so each DPE will actually be getting a snapshot copy of (its slice of) that buffer using a different time-weighted interpolation of two frames! But the result will look perfectly smooth and synchronized!

This synchronization would literally not be possible if the HW draw-context wasn't being snapshotted using each DPE's current rasterization/wire-encoding clock as an input.

The input required for this could — in theory, at least — be synthesized in a pure-functional manner internal to the GPU's compute, by having a final VRAM-to-VRAM copy that is done by a compute-shader with some extra logic to map each pixel position to a simulated DPE's raster-clock time.

But just having a DPE that really does things that way — keeping a physical raster-cycle clock, a "narrow" (much smaller than screen-sized) ring-buffer, and a demand circuit that issues fill orders for the GPU to write lines or tiles into the ring buffer according to the clock — is cheaper even than not synchronizing at all; let alone to doing synchronization through extra VRAM (for the mastering buffer, and for a per-pixel draw-context centroid UV texture) + an extra compute-shader pass with a virtual raster clock.

leshenka · on May 23, 2024

> telling the GPU [...] draw-contexts that are special

but who's telling it? Is it a driver? And if so, can a driver be made that just doesn't do it?

Mindwipe · on May 23, 2024

> but who's telling it? Is it a driver? And if so, can a driver be made that just doesn't do it?

Yes, but then Microsoft haven't signed the driver, and it won't have the correct hardware root of trust to decode the key, so it will fail and the service will fallback to a less secure mechanism which is why services will normally limit the resolution of the content delivered via those mechanisms to 720p or less.

metalcrow · on May 23, 2024

This is a really awesome write up, thank you!

Question, what stops you from modifying the GPU driver to make it so the command to set a draw-context to be DRM'd is a no-op? That way the gpu would happily let you snapshot the context, and the OS would be none the wiser. Or is gpu driver tampering something nvidia explicitly designs against?

EDIT: oh wait, i bet the driver is signed and the OS checks it and only allows DRM content to play if the driver is signed by nvidia. That would make sense.

Mindwipe · on May 23, 2024

Correct.

KETHERCORTEX · on May 23, 2024

> So while you can't capture a screenshot of the decoded frame, you could in theory capture the (encoded) video in RAM.

There was a simpler technique. Open a second video player, then screenshot to your heart's content. Windows hardware acceleration for video had only one gpu thread or something like that, I don't remember the exact details.

derefr · on May 22, 2024

A bonus tangent:

> there's never a point under normal rendering where your entire screen-contents are flattened into a single (externally-addressable) VRAM texture.

...although there used to be, back at the dawn of GPUs. The first "3D acceleration cards" didn't do any 2D drawing — and so PCs would actually have distinct 2D and 3D video cards!

The oldest 2D video cards for PCs, were just persistent framebuffers — their "VRAM" being just for what was on the screen. They relied on the OS to find anything that had been uncovered, and tell the app to repaint it.

But the last and greatest 2D video cards, from before 3D took over, were masters of copying memory around. Each frame, they'd be doing all of these huge accelerated rectangular blits of windows and their controls — some VRAM-to-VRAM, some from main memory using DMA — to bake down a VRAM buffer. They were hardware-accelerated 2D compositors, and they were great at it.

When early 3D cards came along, all that changed at first, was that your OS would tell your 2D card that there was a certain region it was to avoid drawing anything on. The 2D card would leave a black rectangle there.

The 2D card would render and rasterize its VRAM to VGA — and the 3D accelerator card would then be passed this same VGA signal, along a cable, as if it was a display. It would mostly just repeat everything it saw. But when the signal hit the timings for where those black rectangles should appear, then it would instead emit its rendering of the 3D content. (In other words: early 3D cards were https://en.wikipedia.org/wiki/Genlock cards!)

A "hardware-accelerated draw context", in this very early conception of the term, was precisely the geometry, defined by your OS, that your 2D card knew to not bother to draw; which was also the geometry that your 3D card knew was its to draw on.

2D and 3D video cards later merged, and at first you just got one card that did both of these things, as separate phases, with the bridging now internal. But the concept of an HW-accel context lived on.

But later on, actual "GPUs" emerged — in the form of 3D hardware that was now sorta-okay at doing VRAM-to-VRAM copies, and also provided legacy VGA BIOS emulation. So people ditched their 2D cards.

But this change actually kind of sucked, because it meant that all the per-frame 2D blitting (which, again, GPUs were just barely passable at) was now contending for GPU time with 3D stuff. And modern OSes had grown highly dependent on being able to do lots of 2D blitting each frame.

This was why "full-screen mode" was, for the longest time, the way to get the best 3D performance out of a GPU, even at the same display resolution. If you tell the GPU that you want to cover the entire screen with a 3D hardware-accelerated draw context... then it will realize that it'd be pointless to keep drawing the "2D world", and so it will stop spending its time each frame doing 2D VRAM copies to render the 2D-composited layer. It takes the whole OS window-drawing VRAM-copy command-queue out of the loop.

(And also — once there was OS support for this — switching to full-screen mode would also cause the GPU to flush out all the VRAM textures for rendered windows, fonts, icons, etc — giving a 3D app or game far more VRAM to work with. Coming out of full-screen mode would likewise ping the OS to reload all the system brushes and textures into VRAM; and then ping all the running apps to redraw themselves.)