When you play a video, you're (usually) sending the encoded video to the GPU for hardware-accelerated decode. The resulting decoded frames then live purely on the GPU.
However, rather than each video frame being decoded into a distinct VRAM texture with its own manipulable GPU handle, with hardware-accelerated decode, you just tell the GPU to allocate a single buffer — a mutable canvas, or "draw context" — and then tell it that, as the chunks of your encoded video stream are being played, decoded, and rasterized into frames by the GPU, the GPU should be writing each successive frame it's generating into this buffer.
In this setup, the only thing you can get a GPU handle for (besides the encoded video stream) is that buffer. That buffer is special, being the target of the GPU's own asynchronous hardware-accelerated rendering process; so that buffer is referred to as a "hardware-accelerated draw context."
(These days GPUs have enough VRAM that you really could get away with dumping each frame out to its own immutable VRAM texture buffer; but back in the early '90s when hardware-accelerated video decode was first invented, reusing a single buffer [or maybe a pair of them] for decoding was a practical necessity. Hardware-accelerated draw contexts have been core to how GPUs work since there were GPUs.)
In general, the GPU doesn't allow you to read directly from the VRAM backing a hardware-accelerated draw context (because doing so could block the GPU from writing to it, and also because doing so wouldn't get you a coherent single frame's state out.) The GPU locks that VRAM up and considers it to be "its to use" for as long as the draw context is the target of a hardware-accelerated pipeline. So the GPU will ignore any attempt to use any existing handle you had to the VRAM backing the draw context; only the handle to the wrapping draw-context ADT will work, and that handle doesn't have "read from" as part of its command-set.
So how does such a draw context even end up on the screen, then?
When a desktop window compositor sets up the GPU to display your windows, it takes various draw-contexts — some of which may be hardware-accelerated — and lays them out in 3D space, with a (usually orthographic) projection. The GPU itself is then told to take care of the rest (the final toplevel "get this scene out and onto a screen" part) internally. There's never a point under normal GPU rendering, where your entire screen-contents are rendered out to a single (externally-addressable) VRAM texture.
What happens when you take a screenshot, then? Usually, the compositor is pinged to ask it to render that screenshot; and it does so by allocating a screen-sized VRAM buffer, and then taking each GPU handle in turn and telling the GPU to copy the VRAM underlying that buffer, into the target buffer. But oh no — HW-accel draw-contexts don't have exposed VRAM handles!
It used to be (until... Windows Vista, I think?) that hardware-accelerated draw contexts in general (like videos being played using hardware decoding, or like 3D games) just weren't ever captured in screenshots, DRM or no. This was purely a question of GPU performance: a GPU with an architecture that could snapshot a frame of an HW-accel context, wouldn't perform as well as a GPU with an architecture that couldn't.
When this ability was later added, it was done so by adding a feature to the GPU: a command was added on HW-accel draw contexts, that would prompt them to stop whatever rendering is happening into the buffer for a moment, at a coherent point; to snapshot that coherent state of the internal VRAM state, copying it over into a non-HW-accelerated buffer; and to then get the accelerated rendering going again. The compositor therefore now had the ability to ask the such draw-contexts for such snapshots, to then be re-copied out into a WIP screenshot.
But I emphasize "ask" here — you really are just asking the GPU to do it. The GPU can always just refuse! (OSes had to support GPUs refusing to do it, because many GPUs were old enough that they couldn't manage to do it, even given updated drivers that allowed software to ask the GPU the question; even given newer firmware pushed to the GPU by said drivers.)
So DRMed video was implemented simply by telling the GPU that there are some HW-accel draw-contexts that are special, and so the GPU should refuse any requests to snapshot them. (And also, telling the GPU that the encoded video itself is also special — so the GPU should taint any draw-contexts it decodes that video onto with the same specialness.)
I believe that DRMed video streams get decrypted on the CPU before being sent to the GPU. So while you can't capture a screenshot of the decoded frame, you could in theory capture the (encoded) video in RAM. Sadly, that RAM is protected by the OS and the decoding happens using Intel SGX or something. So it's impossible to "get in under" the OS to grab the encoded video, either. (Unless you compile a custom kernel... but that's why these extensions have an in-kernel component that doesn't ship with the Windows or macOS DDKs, and why Linux doesn't even get DRMed-video support except in proprietary spins.)
If I thought it would be appropriate, there’d be a :hands-clapping: emoji here thanking you for that infodump (and the reply). That’s a great and useful explanation that’s easily modeled mentally. (“easily” lol)
I think at least the last bit is incorrect: DRM protected video is still often decoded by a hardware decoder built into GPU.
I am also in doubt about the claim that there's never a single VRAM texture that contains the entire desktop, because while I see potential ways for GPU during sending frame to the monitor to assemble said frame on the fly, it sounds implausible vs just reading it from a premade buffer.
> I think at least the last bit is incorrect: DRM protected video is still often decoded by a hardware decoder built into GPU.
I misspoke, sorry — the last paragraph is meant to say that the CPU "decrypts", not "decodes", the video (inside the Intel SGX enclave.)
> I am also in doubt about the claim that there's never a single VRAM texture that contains the entire desktop
Some GPUs do work this way internally, but my phrasing was specific — even when such a VRAM texture buffer exists, it's not addressible outside the CPU itself. You can't get a handle to it. (And even if you could, it'd just be another HW-accel draw context — and so would be DRM-tainted if-and-when any buffer being copied onto it is considered tainted.)
But why wouldn't you want a GPU work this way? What's the advantage of not having some kind of "final screen buffer"?
It mostly comes down to a quirk of history. Movies are 24FPS, while computer CRTs were mostly designed for 60Hz refresh rates (with no support for 24Hz modes.) And when people see "hardware-accelerated video playback" as a feature on the box of a video card, one of the primary things that people expect that to mean, is that you should be able to play back your 24FPS movie on the 60Hz display. In windowed mode. With the desktop being refreshed behind the 24Hz movie. Smoothly.
If videos were mostly 30FPS, this would be easy, because 30Hz divides 60Hz; each video frame could just be copied to the 60Hz "final screen buffer" twice with the same content. But 24Hz doesn't cleanly divide 60Hz — you'd need each video frame to appear for 2.5 "final screen buffers." (And just doing that naively — that's called 2-2-3 rendering — is very stuttery.)
If you're doing this by trying to render out a "final screen buffer" VRAM texture in lockstep with your display frequency, all in advance of sending that texture down the pipe to the display, then for each frame, you need to complete that rendering pass with enough time left to communicate it to the display. But what happens when it takes a really long time to send the signal down the wire, leaving you only a little time each frame to actually render everything? In fact, what happens when the display is analogue — like a CRT — where the timing of sending the data down the wire is directly related to the timing of drawing?
Well, certainly, you can pipeline this final rendering — i.e. make everything show up one frame behind, so that you can work on each frame for an extra entire frame-time. But gamers won't like you messing with their draw-contexts that way. And gamers are the core audience for your product.
And also, that still doesn't let you render the video in windowed mode "smoothly." It just means that you have now have slightly less than one full frame-time, each frame, to do the concurrent work of decoding... 2.5 (or some even worse multiple) frames of video, and then doing something with them to squeeze it down to one frame.
And also, how should an active hw-accel video-decode draw context react to the OS changing the display resolution to one that uses a different frequency? Or worse — though not invented at the time — to a Variable Refresh Rate monitor, that can be arbitrarily sent frames using FreeSync?
What the GPU vendors decided to do to make this work, was to require one of two things — either:
1. that the algorithm of the video codec directly support decoding of a fractional frame position; or
2. that the creator of the draw-context pass it a VRAM buffer with enough space for not just one, but several (usually specifically three) full extracted frames. This buffer gets used as a ring buffer, with each new frame becoming a source texture; and these frames are then interpolated together (with some GPUs offering fancy motion-interpolations, and others just doing a linear interpolation) according to the fractional time-between-frames that the timing for the center of the dest draw context lands on.
In either case, the goal was to make the video-decode process something that could be demand-driven — where at any point the GPU could say "I'm at the point where I'm drawing the video; give me what the video should look like now" — and get a result that cleanly maps to the frame the video lands on.
The whole toplevel render pass of early GPUs was designed to operate this way, with final VRAM buffers being "copied" not back into VRAM, but directly into the ring-buffer of the display protocol encoder; and with any final hw-accel passes — like interpolations — generating their output directly into that same display-protocol-encoder ring-buffer. (Or, in the modern day, executing the final compute shader that outputs video tiles only one line-of-tiles at a time, on a vastly sub-frame frequency, as signals like "scanlines matching tiles T50..T80 coming up, get them rendered so they can be read from" are sent by the DPE.)
This decoupled and demand-driven final copy to the DPE, has some interesting properties; it means that you can actually have two or more display-protocol encoders, rendering the same screen, or different parts of the screen — without that implying the need for a lowest-common-multiple frequency. In fact, GPUs are not only happy with multiple displays plugged into them at once — they're happy with those displays even when they have no supported refresh rates in common!
(This is in contrast to the old 2D video cards, which did operate on a single global base frame-raster clock. This meant that 2D video cards were always either single-headed, or only supported a set of monitors all running at the same refresh rate. And if anything wanted to draw to the screen at a different rate, it had to be full-screen, with all the heads switching to that rate.)
I forgot to mention the best part — a trick that modern GPUs can only pull off because of how they feed their DPEs:
Take a modern computer with a modern GPU, and plug the GPU into one 75Hz monitor and one 60Hz monitor. In your OS, arrange the resulting displays to be aligned horizontally. Start an (HW-accelerated) video playing. Now, place that video so it straddles the boundary between the two screens.
When you do this, each DPE will actually be demanding frames from the video draw-context on a different schedule; and so each DPE will actually be getting a snapshot copy of (its slice of) that buffer using a different time-weighted interpolation of two frames! But the result will look perfectly smooth and synchronized!
This synchronization would literally not be possible if the HW draw-context wasn't being snapshotted using each DPE's current rasterization/wire-encoding clock as an input.
The input required for this could — in theory, at least — be synthesized in a pure-functional manner internal to the GPU's compute, by having a final VRAM-to-VRAM copy that is done by a compute-shader with some extra logic to map each pixel position to a simulated DPE's raster-clock time.
But just having a DPE that really does things that way — keeping a physical raster-cycle clock, a "narrow" (much smaller than screen-sized) ring-buffer, and a demand circuit that issues fill orders for the GPU to write lines or tiles into the ring buffer according to the clock — is cheaper even than not synchronizing at all; let alone to doing synchronization through extra VRAM (for the mastering buffer, and for a per-pixel draw-context centroid UV texture) + an extra compute-shader pass with a virtual raster clock.
> but who's telling it? Is it a driver? And if so, can a driver be made that just doesn't do it?
Yes, but then Microsoft haven't signed the driver, and it won't have the correct hardware root of trust to decode the key, so it will fail and the service will fallback to a less secure mechanism which is why services will normally limit the resolution of the content delivered via those mechanisms to 720p or less.
Question, what stops you from modifying the GPU driver to make it so the command to set a draw-context to be DRM'd is a no-op? That way the gpu would happily let you snapshot the context, and the OS would be none the wiser. Or is gpu driver tampering something nvidia explicitly designs against?
EDIT: oh wait, i bet the driver is signed and the OS checks it and only allows DRM content to play if the driver is signed by nvidia. That would make sense.
> So while you can't capture a screenshot of the decoded frame, you could in theory capture the (encoded) video in RAM.
There was a simpler technique. Open a second video player, then screenshot to your heart's content. Windows hardware acceleration for video had only one gpu thread or something like that, I don't remember the exact details.
> there's never a point under normal rendering where your entire screen-contents are flattened into a single (externally-addressable) VRAM texture.
...although there used to be, back at the dawn of GPUs. The first "3D acceleration cards" didn't do any 2D drawing — and so PCs would actually have distinct 2D and 3D video cards!
The oldest 2D video cards for PCs, were just persistent framebuffers — their "VRAM" being just for what was on the screen. They relied on the OS to find anything that had been uncovered, and tell the app to repaint it.
But the last and greatest 2D video cards, from before 3D took over, were masters of copying memory around. Each frame, they'd be doing all of these huge accelerated rectangular blits of windows and their controls — some VRAM-to-VRAM, some from main memory using DMA — to bake down a VRAM buffer. They were hardware-accelerated 2D compositors, and they were great at it.
When early 3D cards came along, all that changed at first, was that your OS would tell your 2D card that there was a certain region it was to avoid drawing anything on. The 2D card would leave a black rectangle there.
The 2D card would render and rasterize its VRAM to VGA — and the 3D accelerator card would then be passed this same VGA signal, along a cable, as if it was a display. It would mostly just repeat everything it saw. But when the signal hit the timings for where those black rectangles should appear, then it would instead emit its rendering of the 3D content. (In other words: early 3D cards were https://en.wikipedia.org/wiki/Genlock cards!)
A "hardware-accelerated draw context", in this very early conception of the term, was precisely the geometry, defined by your OS, that your 2D card knew to not bother to draw; which was also the geometry that your 3D card knew was its to draw on.
2D and 3D video cards later merged, and at first you just got one card that did both of these things, as separate phases, with the bridging now internal. But the concept of an HW-accel context lived on.
But later on, actual "GPUs" emerged — in the form of 3D hardware that was now sorta-okay at doing VRAM-to-VRAM copies, and also provided legacy VGA BIOS emulation. So people ditched their 2D cards.
But this change actually kind of sucked, because it meant that all the per-frame 2D blitting (which, again, GPUs were just barely passable at) was now contending for GPU time with 3D stuff. And modern OSes had grown highly dependent on being able to do lots of 2D blitting each frame.
This was why "full-screen mode" was, for the longest time, the way to get the best 3D performance out of a GPU, even at the same display resolution. If you tell the GPU that you want to cover the entire screen with a 3D hardware-accelerated draw context... then it will realize that it'd be pointless to keep drawing the "2D world", and so it will stop spending its time each frame doing 2D VRAM copies to render the 2D-composited layer. It takes the whole OS window-drawing VRAM-copy command-queue out of the loop.
(And also — once there was OS support for this — switching to full-screen mode would also cause the GPU to flush out all the VRAM textures for rendered windows, fonts, icons, etc — giving a 3D app or game far more VRAM to work with. Coming out of full-screen mode would likewise ping the OS to reload all the system brushes and textures into VRAM; and then ping all the running apps to redraw themselves.)
However, rather than each video frame being decoded into a distinct VRAM texture with its own manipulable GPU handle, with hardware-accelerated decode, you just tell the GPU to allocate a single buffer — a mutable canvas, or "draw context" — and then tell it that, as the chunks of your encoded video stream are being played, decoded, and rasterized into frames by the GPU, the GPU should be writing each successive frame it's generating into this buffer.
In this setup, the only thing you can get a GPU handle for (besides the encoded video stream) is that buffer. That buffer is special, being the target of the GPU's own asynchronous hardware-accelerated rendering process; so that buffer is referred to as a "hardware-accelerated draw context."
(These days GPUs have enough VRAM that you really could get away with dumping each frame out to its own immutable VRAM texture buffer; but back in the early '90s when hardware-accelerated video decode was first invented, reusing a single buffer [or maybe a pair of them] for decoding was a practical necessity. Hardware-accelerated draw contexts have been core to how GPUs work since there were GPUs.)
In general, the GPU doesn't allow you to read directly from the VRAM backing a hardware-accelerated draw context (because doing so could block the GPU from writing to it, and also because doing so wouldn't get you a coherent single frame's state out.) The GPU locks that VRAM up and considers it to be "its to use" for as long as the draw context is the target of a hardware-accelerated pipeline. So the GPU will ignore any attempt to use any existing handle you had to the VRAM backing the draw context; only the handle to the wrapping draw-context ADT will work, and that handle doesn't have "read from" as part of its command-set.
So how does such a draw context even end up on the screen, then?
When a desktop window compositor sets up the GPU to display your windows, it takes various draw-contexts — some of which may be hardware-accelerated — and lays them out in 3D space, with a (usually orthographic) projection. The GPU itself is then told to take care of the rest (the final toplevel "get this scene out and onto a screen" part) internally. There's never a point under normal GPU rendering, where your entire screen-contents are rendered out to a single (externally-addressable) VRAM texture.
What happens when you take a screenshot, then? Usually, the compositor is pinged to ask it to render that screenshot; and it does so by allocating a screen-sized VRAM buffer, and then taking each GPU handle in turn and telling the GPU to copy the VRAM underlying that buffer, into the target buffer. But oh no — HW-accel draw-contexts don't have exposed VRAM handles!
It used to be (until... Windows Vista, I think?) that hardware-accelerated draw contexts in general (like videos being played using hardware decoding, or like 3D games) just weren't ever captured in screenshots, DRM or no. This was purely a question of GPU performance: a GPU with an architecture that could snapshot a frame of an HW-accel context, wouldn't perform as well as a GPU with an architecture that couldn't.
When this ability was later added, it was done so by adding a feature to the GPU: a command was added on HW-accel draw contexts, that would prompt them to stop whatever rendering is happening into the buffer for a moment, at a coherent point; to snapshot that coherent state of the internal VRAM state, copying it over into a non-HW-accelerated buffer; and to then get the accelerated rendering going again. The compositor therefore now had the ability to ask the such draw-contexts for such snapshots, to then be re-copied out into a WIP screenshot.
But I emphasize "ask" here — you really are just asking the GPU to do it. The GPU can always just refuse! (OSes had to support GPUs refusing to do it, because many GPUs were old enough that they couldn't manage to do it, even given updated drivers that allowed software to ask the GPU the question; even given newer firmware pushed to the GPU by said drivers.)
So DRMed video was implemented simply by telling the GPU that there are some HW-accel draw-contexts that are special, and so the GPU should refuse any requests to snapshot them. (And also, telling the GPU that the encoded video itself is also special — so the GPU should taint any draw-contexts it decodes that video onto with the same specialness.)
I believe that DRMed video streams get decrypted on the CPU before being sent to the GPU. So while you can't capture a screenshot of the decoded frame, you could in theory capture the (encoded) video in RAM. Sadly, that RAM is protected by the OS and the decoding happens using Intel SGX or something. So it's impossible to "get in under" the OS to grab the encoded video, either. (Unless you compile a custom kernel... but that's why these extensions have an in-kernel component that doesn't ship with the Windows or macOS DDKs, and why Linux doesn't even get DRMed-video support except in proprietary spins.)