I wonder how this is implemented in the GPU. From my time working on a 3D renderer a long time ago, triangles with offscreen vertices would be clipped into smaller triangles, so in the end you would still be rendering multiple triangles anyway. I imagine it would also be possible to clip the scanlines instead.
Actual clipping is expensive so indeed a "guard band" is used : inside the region allowed by the internal precision of the rasterizer, outside pixels are simply "ignored".
I haven't seen it in years but the cause is actually the limitations of float precision and/or the way the developer has setup the world/view matrices.
eg. Your shaders operate on float16 precision (common in the D3D9 days) on a screen with high enough resolution. The boundary pixels at the diagonal now can't fit into the precision of a float16 cleanly.
Mmmh there is rather an easy way to get that bug due to float precision is when the vertice positions are the result of some maths. Then two vertices which should be mathematically at the same position will end up not being exactly at the same position in practice due to float precision. Only ways to have two vertices at the exact same position is either hardcoding it, assigning the position of one to the other or using index buffers.
An easy thing to miss.
That wouldn't be a problem in this case though? The vertices of 2 fullscreen triangles can be passed to the GPU using hardcoded values without any precision error (integers below a certain value can be represented precisely using floats).
That wouldn't be a problem as long as the vertex positions go through exactly the same transformations, as that ends with them accumulating the same rounding errors. Assuming that nobody was messing with floating point rounding behavior at runtime or flat out decided to enable fast-math and live with the resulting mess.
Nearly everything done with floating point would be better done another way...
But the reality is that floating point numbers 'just work' for a huge number of things, and the space and compute inefficiency is small enough to not worry about.
In actual hardware shading is done 32 or 64 pixels at a time, not four. The problem above just got worse.
While it's true that there are "wasted" execution in 2x2 quads for derivative computation, it's absolutely not the case that all lanes of a hardware thread (warp / wavefront) have to come from the same triangle. That would be insanely inefficient.
I dont think that it's publicly documented how the "packing" of quads into lanes is done in the rasterizer for modern GPUs. I'd guess something opportunistic (maybe per tile) taking advantage of the general spatial coherency of triangles in mesh order.
> it's absolutely not the case that all lanes of a hardware thread (warp / wavefront) have to come from the same triangle. That would be insanely inefficient
I am no GPU expert, but I performed some experiments a while ago indicating that this is in fact how it works, at least on nvidia.
I would expect it simplifies the fragment processing pipeline to have all the interpolants come from the same triangle. Another factor that comes to mind is that, due to the 2x2 quad-padding, you would end up with multiple shader executions corresponding to the same pixel location, coming from different triangles; that would probably involve complicated bookkeeping. Especially given MSAA.
It would be interesting to see how you were testing for that, because at least on AMD it's fairly certain that a single thread can be shading multiple primitives.
For example, from the ISA docs [1], pixel waves are preloaded with an SGPR containing a bit mask indicating just that :
> The new_prim_mask is a 15-bit mask with one bit per quad; a one in this mask indicates that this quad begins a new primitive, a zero indicates it uses the same primitive as the previous quad. The mask is 15 bits, not 16, since the first quad in a wavefront begins a new primitive and
so it is not included in the mask
The mask is used by the interp instructions to load the correct interpolants from local memory.
In fact, in the (older) GCN3 docs [2] there is a diagram showing the memory layout of attributes from multiple primitives for a single wavefront (page 99).
That being said, of course I expect this process to be "lazy" : you would not want to buffer execution of a partially filled thread forever, so depending on the workload you might measure different things.
I think I drew that conclusion from the following: I rendered a full-screen quad using two triangles, and make each fragment display a hash of its threadgroup id. Most threadgroups were arranged in nice, aligned 4x8 rectangles, but near the boundary, they became morphed and distorted so they could stay within the same triangle. That said, it occurs to me now that this could be an opportunistic thing; I am going to try to repeat that experiment, but with many triangles which are all smaller than a single threadgroup.
The linked AMD guide seems to suggest the author is correct
>Because the quad is rendered using two separate triangles, separate wavefronts are generated for the pixel work associated with each of those triangles. Some of the pixels near the boundary separating those triangles end up being organized into partial wavefronts
Triangles in 3D euclidean space have the advantage of being uniquely defined by 3 vertices that have the nice property of always sitting on a single plane. There's no linear transformation you could apply that results in something other than a well-defined, planar triangle. The worst that might happen is collapsing it into a line or a point.
The same cannot be said for quads. If you e.g. have 4 vertices where 3 are sitting on the same plane, but the 4th isn't, there are 2 different, possible ways to subdivide this into triangles. In addition to all the possible non-linear surfaces one might image occupying that space.
Old OpenGL and Direct3D versions did have quad and arbitrary polygon rendering support. IIRC from vague memory, the above scenario was something you were supposed to avoid and results varied between graphics card vendors.
On a side note: If we are talking purely 2D, some old 2D software rendering systems used trapezoids as basic primitives, defined by their top and bottom edges. For instance, the X11 XRender API explicitly supports drawing trapezoids. They are easy to rasterize, interpolate across and quite flexible. Many other 2D shapes can be conveniently composed from them, including screen space triangles, if you were to implement a software rasterizer.
The OpenGL specification basically had no explanation of what a "Quad" was, it didn't say anything about how to render quads, interpolate quads, etc. Pretty much every driver implemented quads as a funky way of spelling "two triangles". Much like tristrips/trifans/etc., index buffers give you all of the benefits with none of the drawbacks.
> For instance, the X11 XRender API explicitly supports drawing trapezoids.
It should be said XRender trapezoids have extra constraints: the top and bottom edges of the trapezoid must be completely straight (go exactly left to right). So it's basically a rectangle with sloped left/right sides. This makes it extremely easily to software rasterize, while you can still break each trapezoid into two triangles for 3D hardware.
> It should be said XRender trapezoids have extra constraints: the top and bottom edges of the trapezoid must be completely straight (go exactly left to right). So it's basically a rectangle with sloped left/right sides.
Isn't that just the definition of a trapezoid? Do you mean the top and bottom have to be parallel with the x axis or something?
Of course you're right but I was thinking about a subset of features for quads for this kind of use case. But yeah for all features triangles are better !
> Old OpenGL and Direct3D versions did have quad and arbitrary polygon rendering support.
I vaguely recall that at different times people tried to build both PC and phone graphics hardware around quads and basically failed, but I don’t know why that didn’t work (aside from the issues of geometry that you mention).
Importantly, the sega saturn kind of wasn't a true 3D engine that had quads as it's primitive, but rather a stamp/sprite based system that allowed to you skew, warp, and transform those rectangular images in very powerful ways, mostly being able to reproduce 3D scenes but with certain caveats about the "polygons" that caused some limitations. One good example was that the system has no concept of UV maps. Each polygon is exactly the whole texture, linearly warped. This means you can't do, for example, environment mapping natively
A triangle is the simpleste polygon. As soon as you start doing quads the complexity skyrockets: what if it self intersects? What if it's concave? What does a winding order mean when the verts can be all out of order?
Triangles are used to render 3d meshes because of that yes.
However, the article is about "Full screen post processing effects", which is easier done in a rectangle
Do you mean a bounding rectangle? That exists on a lot of GPUs, where you draw a triangle and the entire bounding box surrounding the triangle is covered. However, the details differ greatly between different GPUs, meaning it's hard to standardize. Full-screen triangles are just fine.
> In my microbenchmark1 the single triangle approach was 0.2% faster than two.
Sounds like something that would be within the margin of error? Seems especially meaningless because it's just the average of the timings, instead of something that would visualize the distribution, like a histogram or KDE.
Probably quite far outside margin of error given the number of runs and simplicity of the test but you'd need the measured variance to be sure. The AMD study goes into more metrics and why's of it being faster.
This is interesting, but also wouldn't the texture mapping / UVs be more confusing and possibly outweigh the benefit of micro-optimisation?
The good thing about having 4 vertices is can just use a vertex position and set of texture coordinates (x,y) on each one and the texture can just be mapped exactly.
What UVs would you use for full-screen effects? Typically you are only interested on the screen position to sample relevant buffers, i.e. gl_FragCoord or interpolated vertex positions depending on what scale you want.
That’s true - not so much UVs, but texture coords I find are useful for quickly flipping the screen horizontally or vertically - so storing both a vertex position and a x,y value 0-1 for sampling a texture. I guess could just interpolate the vertex positions directly, just it’s less intuitive with a large triangle.
Even desktop GPUs use tiled rendering since Maxwell generation on Nvidia, and I forget which gen for AMD. I don't see how it's possible for many triangles to be faster than one for fullscreen rendering.
Be patient. One of the reasons complaining about downvotes is discouraged is that it's often transient, a comment at score 1 is easily punted down to 0, but usually drifts back up.