IHV Business strategies and consumer choice

One of the bigger problems with the graphics pipeline in respect to micro triangle geometry is that GPUs shade 2x2 quads to compute the LoDs for texture sampling. If a triangle only covers a single pixel (active lanes) within a 2x2 quad, hardware still has to shade the 3 uncovered pixels (helper lanes) if the user wants the correct LoDs ...

I think the visibility buffer they used with deferred materials is supposed to improve quad utilization in the case of small triangles, somehow. I don't quite get it. They have 1 draw call per material to write directly to a gbuffer for deferred lighting/shading, and there's only one material per pixel, I guess. I read a blog post once that was explaining quad utilization comparing forward, deferred and visibility and in the case of visibility shader invocations for materials and lighting stayed at 1 per pixel each even in the case of small triangles. Not sure if that approach was the same as UE5 though.
 
Last edited:
This raises me two questions:
If we don't use texture LOD, can any or all modern GPUs shade one sub pixel triangle per lane? I don't think so, but your formulation makes it sound like that.
Do helper lanes only help with derivatives, but are then marked inactive to bypass things like texture fetches? I've thought so, but again your use of 'shade' gives me doubts.
To answer the first question, a GPU software rasterizer can shade single pixel triangles without the extra "helper lanes" and if you want accurate texture LoDs, you're going to have to use other methods via analytic techniques or store the necessary attributes in a visibility buffer/G-buffer.

For the second question, the GPU hardware rasterizer exclusively generates gradients/partial derivatives for 2x2 quads so you're basically forced to shade the non-visible pixel helper lanes in lockstep with the active lanes until texture sampling happens if you want to rely hardware derivatives to automatically get the texture LoDs.
 
I think the visibility buffer they used with deferred materials is supposed to improve quad utilization in the case of small triangles, somehow. I don't quite get it. They have 1 draw call per material to write directly to a gbuffer for deferred lighting/shading, and there's only one material per pixel, I guess. I read a blog post once that was explaining quad utilization comparing forward, deferred and visibility and in the case of visibility shader invocations for materials and lighting stayed at 1 per pixel each even in the case of small triangles. Not sure if that approach was the same as UE5 though.

Nanite sidesteps the quad problem completely. All calcs are done independently for each pixel including calculation of derivatives needed for texture sampling.
 
To answer the first question, a GPU software rasterizer can shade single pixel triangles without the extra "helper lanes" and if you want accurate texture LoDs, you're going to have to use other methods via analytic techniques or store the necessary attributes in a visibility buffer/G-buffer.
Thanks, but i'm wondering about something else for both questions.

Say we want to render a visibility buffer, just containing triangle id and barycentric coords. And we use heavy upscaling too, so almost every triangle becomes a 'micro triangle' due to our small render resolution.
In this case we neither use uv coords nor texture lod.
But i assume the rasterizer still spends the same number of helper lanes even we don't need them?

It always does, no matter what we actually do in our shaders, because those 2x2 quads are currently the one and only granularity the HW rasterizer can work with, i guess.

For the second question, the GPU hardware rasterizer exclusively generates gradients/partial derivatives for 2x2 quads so you're basically forced to shade the non-visible pixel helper lanes in lockstep with the active lanes
Yes, but assume we do a lot of memory access, e.g. sampling many shadow maps to shade our pixel.
The helper lanes then execute the same instructions as the active lanes, but they should not read or write any memory. So the bogus texture fetches of helper lanes should not waste any bandwidth, ideally.

I guess i'm right with those assumptions, but i do not know for sure.
 
Thanks, but i'm wondering about something else for both questions.

Say we want to render a visibility buffer, just containing triangle id and barycentric coords. And we use heavy upscaling too, so almost every triangle becomes a 'micro triangle' due to our small render resolution.
In this case we neither use uv coords nor texture lod.
But i assume the rasterizer still spends the same number of helper lanes even we don't need them?

It always does, no matter what we actually do in our shaders, because those 2x2 quads are currently the one and only granularity the HW rasterizer can work with, i guess.
The GPU will always schedule and spawn 2x2 quads due to how they're designed so your last statement is absolutely correct even if the pixel shader doesn't do any texture sampling. As to what happens to the helper lanes during pixel shader execution if no texture sampling operations is incurred, they become "inactive lanes" immediately ...
Yes, but assume we do a lot of memory access, e.g. sampling many shadow maps to shade our pixel.
The helper lanes then execute the same instructions as the active lanes, but they should not read or write any memory. So the bogus texture fetches of helper lanes should not waste any bandwidth, ideally.

I guess i'm right with those assumptions, but i do not know for sure.
That's correct, helper lanes won't export their results to the render target ...
 
Say we want to render a visibility buffer, just containing triangle id and barycentric coords. And we use heavy upscaling too, so almost every triangle becomes a 'micro triangle' due to our small render resolution.
In this case we neither use uv coords nor texture lod.
But i assume the rasterizer still spends the same number of helper lanes even we don't need them?
There are no "helper" lanes. Each active lane is processing the entire triangle. It's a traditional SIMD loop over as many distinct triangles as your width. If you need derivatives, then each lane calculates it's own private derivative _once_ from the three available vertices, and has it as a constant available in the loop body.
 
There are no "helper" lanes. Each active lane is processing the entire triangle. It's a traditional SIMD loop over as many distinct triangles as your width. If you need derivatives, then each lane calculates it's own private derivative _once_ from the three available vertices, and has it as a constant available in the loop body.
We just need to give those helper lanes some name.
But i thought derivatives available in PS are not calculated from vertices, but from adjacent lanes working on adjacent pixels of a 2x2 quad.
I imagine each triangle is rasterized coarsely and at conservatively to form a 'half res, double sized' triangle of 2x2 quads, 'wasting' some lanes falling outside of the full res triangle.
But because full quads are guaranteed to exist, derivatives / gradients can be calculated using the registers of the other 3 threads in the quad. PS stage so no longer needs access to vertex or triangle data.
Those quads give us also some limited but fast options to get registers from adjacent threads in compute shaders after introduction of wave intrinsics.
I further assume the coarse allocation of quads to a triangle is tight, using a half res bounding triangle, but not somethign simple like a bouding rectangle which would baste many threads and entire quads.
So each quad will have at least one pixel inside the triangle.

However, that's not actually bad, and i wonder why SW raster in Nanite is a win at all.
Generating visibility buffer means little work happening in PS, so even wasting half of threads with very small triangles does not sound too bad for me.
For a compute rasterizer we need some way to fuse two nested loops into a single one to keep all threads working in lockstep on multiple triangles per wave, increasing per pixel logic.
So i assume it should be hard to beat the HW rasterizer.

Thus i wonder if there are some other factors here as well, beside those 'wasted helper lanes' we always hear to be the reason.
But what could those reasons be? The guarantee to render triangles in given API order? Can be turned off these days. So what else?

But the even bigger question to me is: Do we really need subpixel triangles in the days of 4K? Or is it rather a limitation of our LOD solution being unable to keep triangles large enough to be efficient?
For Nanite i speculate the latter may apply at least in cases. They have fine grained and watertight LOD switching over the geometry, which is nice. But afaict they don't have the same for texture UVs.
There are cracks in the UVs, and they may need to hide them by using very small triangles, to some degree.
How much this can be a problem depends a lot on content. High genus / noisy meshes or many UV charts would be bad, and results might look good only if we crank up visible geometry resolution, which kinda defies the primary goal of dynamic LOD.

That said i'm not really sure about 'micropolygon support' of future GPUs, or an eventual removal of ROPs all together, besides some Jokes.
It would mean more fixed function blocks, and the more such stuff we establish, the harder it becomes to maintain and develop new GPU architectures.
As often, i expect a short termed push and marketing hype, but in the long run we just stack up legacy bloat and complexity.
Being worried fixed function stuff might eat up too much area for compute is not the only argument here.
 
Isn’t the ideal coverage for a polygon something like 10 pixels? That means you have to push resolution higher to keep polygons perceptually small, which causes other performance pitfalls. Plus LOD changes and shadow pop-in are probably the two biggest blights on real-time graphics.

I expect other companies will make their own software rasterizers before this generation is over.
 
We just need to give those helper lanes some name.
But i thought derivatives available in PS are not calculated from vertices, but from adjacent lanes working on adjacent pixels of a 2x2 quad.
I think there's a bit of confusion, I explained compute/Nanite here. There are no "helper" lanes in Nanite.

The hardware parallelizes pixels (and loops over triangles), Nanite parallelizes triangles (and loops over pixels).
Thus i wonder if there are some other factors here as well, beside those 'wasted helper lanes' we always hear to be the reason.
But what could those reasons be? The guarantee to render triangles in given API order? Can be turned off these days. So what else?
Let's look at an ideal situation: 32 triangles which all cover precisely 32 pixels (8x4).

Nanite will have 100% utilization of lanes, with no divergence (different loops lengths, branches). The hardware will have ~50% utilization, because it processes a rectangular area which can only be filled 50% by a triangle. This hardware efficiency can only be shifted relatively by making triangles much larger, so that the hardware breaks the rectangular area into smaller ones, of which some start to be filled fully.

Even if you'd make some smart hardware block, which filters outside-the-triangle-pixels _before_ setting up the lanes, it would pay with the additional filter-logic as well as individual per-lane parameters.

Nanite chooses it's "battlefield" wisely, all of the front-end of it is designed to create precisely the situation where it's more efficient. Even if you don't always have same sized triangles, you have to go down to 1,2,3...,32 pixel triangles in a wavefront to arrive at the max. ~50% utilization of the hardware.
 
Last edited:
Isn’t the ideal coverage for a polygon something like 10 pixels? That means you have to push resolution higher to keep polygons perceptually small, which causes other performance pitfalls.
Maybe yes, but now no. It seems any game is using upscaling now, independent from using RT or not.
So we don't know which resolution gamers choose, and now we don't know which upscaling factor they use either.
It's harder now to rely on something like a '10px per triangle' threshold, if we look at the final image people will see.
Besides demand on more geometric detail, the use of upscaling may stress a solution on the small triangle problem even more.
I wonder why no IHV has a solution yet already now.

I expect other companies will make their own software rasterizers before this generation is over.
Would be interesting. But i guess they mostly rather wait on RT being ready. Currently we don't know how RT will evolve to support LOD, so anybody investing into SW rendering to get better detail risks a big investment into something which turns out not future proof soon after that.

But i still hope for some experiments similar to Dreams regardless at least.

I think there's a bit of confusion, I explained compute/Nanite here.
Ah yes, i was talking about ROPs all the time.

The hardware will have ~50% utilization, because it processes a rectangular area which can only be filled 50% by a triangle.
I've skimmed over the Nanite code only quickly, but it did not look like bounding rectangle over triangle, and then one thread per pixel of the rectangle.
It rather looked like the traditional scanline approach, and one thread process one triangle. The divergence would then be the number of pixels those triangles cover. But i was not sure.
I did not spot if they bin the triangles into equally sized batches to reduce this divergence.
In theory it should be easy to get over 75% utilization, but likely such attempt conflicts with other things such as cluster sizes, or drawing front to back for culling, etc.

Anyway, when i personally think about compute raster, i would not want to work with triangles.
I would want something like point splatting, or those spherical gaussians are really nice... something that gives us some new options we could not do now.
The main problem here seems that we might loose those new options again if we want to combine with traditional triangle rendering as well. And for many things triangles are just great.
 
To circle back to the topic at hand, I found this interesting bit from a long time ago. In November 2009, nearly 15 years ago, Richard Huddy (AMD's senior manager of developer relations) accused NVIDIA of not caring about DirectX 11 and Tessellation.

(1) The positive mention of DX11 is a rarity in recent communications from NVIDIA - except perhaps in their messaging that 'DirectX 11 doesn't matter'. For example I don't remember Jensen or others mentioning tessellation (the biggest of the new hardware features) from the stage at GTC. In fact if reports are to be trusted only one game was shown on stage during the whole conference - hardly what I would call treating gaming as a priority!

This was the mentality at AMD, they were first with DX11 hardware (NVIDIA was late 6 months), AMD had so much care about state of the art features, that they immediately moved to attack NVIDIA for being late and for being not as outspoken about DX11 and Tessellation as need be. Right to the point of accusing NVIDIA of not considering gaming a priority!

Fast forward to 2018, and AMD is nowhere to be seen in the Ray Tracing/DXR scene! They tried to downplay the importance of ray tracing in the beginning, then they were late to introduce it for two whole years. Worse yet, they are still downplaying it's importance five years after, not pushing as hard for it as NVIDIA or being as outspoken, they also actively make sure ray tracing on their sponsored titles is no more than a console level implementation.

If we want a clear picture of how AMD changed their priorities over the years, I guess it doesn't get any clearer than this. The question now is: Why? What changed at AMD?
 
Tomshardware on the current situation after DLSS3.5.

Now there's a pretty major gap between Nvidia and AMD GPUs when you enable ray tracing, and the more RT effects a game uses, the bigger the gap becomes.

Games that only make limited use of RT for only shadows or only reflections may have AMD and Nvidia on relatively level ground,

while games that use so-called path tracing (full ray tracing) like Cyberpunk 2077 in RT Overdrive mode can result in massive differences in performance — as shown here, AMD's currently fastest GPU, the RX 7900 XTX, can't even keep up with the three years old RTX 3080.

Now, with Ray Reconstruction, that performance gap is set to become a performance chasm, because not only does Nvidia deliver vastly superior RT performance, but it can also provide clearly superior image fidelity.

It's a messy situation, and it's not likely to get any better in the near term. If there were some hypothetical universal solution that could provide upscaling, frame generation, ray reconstruction, and whatever Nvidia's engineers come up with next, things might be different.

But it's increasingly looking like anyone purchasing a new graphics card will have Nvidia hardware with all the bells and whistles on one side, and everything else on the other.

That's a potential massive blow to competition, and we don't like the long-term ramifications. Yes, Nvidia might make superior ray tracing hardware, but all we have to do is look at the generational price increases on the RTX 40-series to guess where that will lead us.


Alex from DigitalFoundry comments on that sentiment.

 
Last edited:
expect the gap to grow ever wider with the coming of AC Mirage
Funny thing happened, Assassin's Creed Mirage got released, and the AMD advantage is wiped out. A 3080 is almost as fast as 6900XT at 4K. And NVIDIA GPUs are ahead of AMD GPUs. The 4080 is 18% ahead of the 7900XTX at 4K.


 
Last edited:
Eh, Origins had much simpler geometry I'd say. But otherwise the only noteable changes there are the switch to D3D12 and the introduction of TAA in Valhalla and now DLSS/XeSS/FSR2 with Mirage.
 
Whats even more interesting is how these games have done almost nothing to improve the visual or gameplay experience since AC Origins yet continue to get much more demanding.
I didn't bother checking any other sources, but GameGPU universally has Mirage giving 5-10% higher FPS than Valhalla
 
Eh, Origins had much simpler geometry I'd say. But otherwise the only noteable changes there are the switch to D3D12 and the introduction of TAA in Valhalla and now DLSS/XeSS/FSR2 with Mirage.
TAA was introduced with Origins no? Many of these assets are ripped straight from origins. There is a mild increase to scene density but i wouldn't classify it as a major increase in geo.
 
TAA was introduced with Origins no? Many of these assets are ripped straight from origins. There is a mild increase to scene density but i wouldn't classify it as a major increase in geo.
I don't really remember but I thought that Origins and Odyssey were still using their weird mix of MSAA and FXAA
 
Maybe.
Anyway, I've ran the built-in benchmark in Mirage and the game is pulling no more than 330W in 4K on ultra on my 4090 while showing 100% GPU load in peak power draw sections.
So the fact that it runs about the same on Nv and AMD h/w doesn't mean that it's any better optimized now I'm afraid.
Their Anvil engine needs a proper refactoring.
 
Back
Top