Nvidia Blackwell Architecture Speculation

  • Thread starter Deleted member 2197
  • Start date
UE5.1 has mesh shaders. Sadly nobody yet has benchmarked the influence of mesh shaders by comparing the 5700 to the 2060 Super or similar in a recent UE5.1 demo and Fortnite.
 
UE5.1 has mesh shaders. Sadly nobody yet has benchmarked the influence of mesh shaders by comparing the 5700 to the 2060 Super or similar in a recent UE5.1 demo and Fortnite.
To measure the win of mesh shaders, you'd need to do so on the same architecture, once with and once without using mesh shaders.
Otherwise you do just guessing.

Variance shadow maps in Unreal Engine 5 seem to be pretty crucial as I understand it.
Shadow maps are on the way out.
Yeah mesh shaders do seem quite dead right out of the gate.
Well, RT does not work with Nanite, nor does it work with mesh shader generated geometry.

So that's sure reasons why SM will stay longer than hoped, and mesh shaders are now less useful than they could have been before.
 
It seems like RGT does not know tensor cores are not in use for denoising at all, contrary to what Nvidia has had planned, which makes that rumor more creditable in my eyes.

Nvidia probably had no success in moving denoising to an AI neural network running on the tensor cores, which is why they are now working on a fixed function denoiser accelerator. Makes sense and is indeed very interesting!
Guy not knowing tensors are not used for denoising does not make his rumors more creditable, rather the opposite?

Denoising seems to much of an open problem to get HW acceleration. The obvious next steps for RT would be HW BVH builder and ray reordering.
 
Last edited:
Why wouldn't RT work with mesh shaders?
Because mesh shaders generate geometry temporary on chip, render it, and forget it.
So you can't include the output in a BVH build or traceRay(), as this would require to store the geometry in VRAM, defying the advantage of mesh shaders.

Edit: Basically a ray who hits a bounding box containing such temporary, procedural geometry, would need to run the mesh shader again. Which is inefficient, because you generate many triangles but intersect only one.
To make it efficient, you would need to collect all rays which (might) hit the box first, but that's the problem reordering tries to solve, with not so much hope on being worth it.

DMM is a better compromise. It's not flexible, so no complex program needs to run.
 
Last edited:
Because mesh shaders generate geometry temprary on chip, render it, and forget it.
Are you sure that you understand properly what mesh shader is and how it works? For one it doesn't render anything, it outputs triangles for pixel shaders (or as a UAV if no shading is needed).

So you can't include the output in a BVH build or traveRay(), as this would require to store the geometry in VRAM, defying the advantage of mesh shaders.
Why would storing geometry in VRAM defy the advantage of mesh shaders?
 
Are you sure that you understand properly what mesh shader is and how it works? For one it doesn't render anything, it outputs triangles for pixel shaders (or as a UAV if no shading is needed).
No, i'm not sure about any features i haven't used myself yet.
But with 'rendering' i mean the pipeline which comes after vertex shaders. It's not important to specify this pipeline i think, but the whole point of mesh shaders is to stay 'on chip', bypassing VRAM. You disagree?
 
but the whole point of mesh shaders is to stay 'on chip', bypassing VRAM. You disagree?
Microsoft does, as shown in the link above.

I feel like you're confusing mesh shader advantage over implementing geometry processing through compute shader with mesh shader advantage over traditional geometry pipeline. Mesh shader output can be fed directly into pixel shader which wouldn't be the case with compute shader. But even in this case neither has any blocks on the ability to use the geometry for BVH and RT?

Nanite issue with h/w RT isn't in how the geometry is produced, it's in the fact that this geometry bypass h/w rasterizer I think? It still seems like something which can probably be solved to be usable in RT with more complex compute?
 
Last edited:
But even in this case neither has any blocks on the ability to use the geometry for BVH and RT?
Well, if you want to process the geometry twice, once fro BVH, and then each frame with mesh shaders - you can do that.
But in this case i would use the same static geometry i've used for RT also for rasterization, to spare the double processing, which would be just redundant.
With RT mesh-, geometry- and tessellation shaders are pointless.
But that's just the difference between RT and raster. For raster, you process a triangle only once, so all those shader stages make sense.
For RT, you don't know how many rays hit a triangle, so you need to pre transform and store it to build / refit BVH first, render afterwards. Thus there is no need to emulate a geometry pipeline which is no longer used.

Ofc. you could still use vertex shaders to generate geometry, but then write it to VRAM instead rendering it. But why use such single threaded restrictions if you can use flexible compute instead?
Besides, if you do it still, even every frame like before e.g. to animate or change tessellation factors, you'd need to rebuild BVH every frame as well, which isn't practical.

Personally i have no problem with burying the geometry pipeline. Even if mesh shaders are really nice.
But habits may vary. E.g. i remember imgTechs pre DXR Api used kind of vertex shaders to feed the HW BVH builder, for what it's worth.

Nanite issue with h/w RT isn't in how the geometry is produced, it's in the fact that this geometry bypass h/w rasterizer I think? It still seems like something which can probably be solved to be usable in RT with more complex compute?
No. The SW rasterizer has nothing to do with the incompatibility. It's even optional.
The incompatibility is caused from BVH blackbox not supporting local refinements of mesh details, so a full rebuild on minor changes would be the only technical, but unpractical option. But don't get me started on this again.
 
Guy not knowing tensors are not used for denoising does not make his rumors more creditable, rather the opposite?

Denoising seems to much of an open problem to get HW acceleration. The obvious next steps for RT would be HW BVH builder and ray reordering.

Dedicated silicon might be a bit much for BVH construction, there'd be the tendency towards overfitting and building too narrow of an accelerator for the sake of benchmarks over usability. Not that Nvidia wouldn't be guilty of that, but still. I can easily see optimization being built around hw/drivers/etc. stack for acceleration structure building becoming a focus. I wouldn't be surprised if the next Nvidia arch had this under consideration.
 
Well, i wonder if it's a good idea to use oriented bounding boxes instead axis aligned boxes.
One orientation could be shared over a branch of the tree, and it could be quantized, if memory is a problem.
Surface tends to be locally smooth and planar, but not necessarily aligned to global coordinate system axis. Would reduce box surface a lot.
It's what's needed to really push ray tracing and performance forward.
The future should be to stream BVH, imo. Bottom levels could be still built on client to keep storage small. But top levels would profit from high quality offline build.

Though, if such HW BVH builder would be fast enough to enable fine grained LOD, i would take it. That's not really possible, but maybe setting HW acceleration in stone would motivate them to fix some things before they do this.
Building BVH over given mesh clusters, and allowing to replace clusters with finer or coarser cluster really shouldn't be that hard. So we would have build, refit, and finally cluster management.
However, such idea leads to a hierarchy of clusters. And if we already provide such hierarchy, a HW BVH builder isn't needed, because figuring out such hierarchy is all a BVH build is doing.

Edit: With this in mind, maybe HW acceleration for refit would be much more useful than build.
 
Last edited:
It's what's needed to really push ray tracing and performance forward.

From a HW perspective you really want multi-function stuff. For example if the supposedly upcoming texturing/rt hardware from AMD (RDNA 4 supposedly) can also accelerate BVH construction, then great! But a dedicated unit by itself is a lot of hw for one task among many.

This also brings up how inefficient triangles are for this. You need to cut up scenes dramatically for both testing and acceleration structure purposes, but the more you cut it up the worse the building gets.
 
Why wouldn't RT work with mesh shaders?

This would require writing out the geometry to vram but that’s not supported yet. Mesh shaders can only send data to the rasterizer.


We don’t plan on supporting streamout as part of this feature. Instead, in the future, we would like to add a special append buffer UAV type or mode which will ensure the UAV’s outputs are in order of inputs and can be used from any shader stage including regular compute.
 
The obvious next steps for RT would be HW BVH builder and ray reordering.

Nvidia likely does some basic ray-reordering within an SM if their patents are anything to go by. Essentially during MIMD execution the RT hardware chooses rays from the queue that are likely to hit the L0 BVH cache. Those rays can come from any active warp. If my guess is right and the MIMD cores can process rays out of submission order there should be enough latency hiding work to cover rays waiting on cache misses.

Hardware accelerated BVH builds and refits would be interesting though I think it’s a solvable problem without the need for dedicated hardware. In a typical game scene most of the geometry doesn’t move or deform in world space. For the things that are animated there seem to be pretty efficient algorithms for parallelizing the BVH update on general compute cores.

Hardware support for BVH LODs is another possibility but I wouldn’t hold my breath on that. Nvidia doesn’t seem to think lack of LOD is a big problem and it would require developers to implement yet another proprietary NVAPI path.
 
Bump this thread because RedGamingTech has some GB20x (for RTX 50) rumours with specs listed:
  • GB207 - 28SM, 96 bit, GDDR7, 3x GPC PCIe 5.0 x4.
  • GB206 - 44SM, 128 bit, GDDR7, 3x GPC PCIe 5.0 x8.
  • GB205 - 72SM, 192 bit, GDDR7, 3x GPC PCIe 5.0 x8.
  • GB203 - 108SM, 256 bit, GDDR7, 3x GPC PCIe 5.0 x16.
  • GB202 - 192SM, 384 bit, GDDR7, 3x GPC PCIe 5.0 x16.
Does make me wonder if 205, 206 & 207 will be on TSMC N4P and 202 & 203 almost likely to be on TSMC N3E. I mean if GB205 can get like AD103 raster performance with better RT still on N4P with better power consumption figures then why have it on N3 when those N3 wafers for Nvidia can produce more GB202s for AI? And a GB205 on N4P is probably 300-320mm squared which has the same if not slightly higher cost than AD104 2 years ago. I think GDDR7 can give a boost without having to resort more L2 Cache needed (especially if 36 Gbps 3GB modules will be avaliable).
 
when those N3 wafers for Nvidia can produce more GB202s for AI
GB100 is for AI, GB202 is a gaming chip.

I'm also wondering where the idea that wafers of all things are the limiting factor on producing more AI chips has come from? From all we know this isn't true, and there are no indications that there are issues with non-AI chips shipments at the moment.

As for the split in process tech that's something to ponder on I'm sure but it's not clear that there's a benefit to using less advanced process because even at the same price per chip you still get the advantage of more chips from a wafer by going fully onto a more advanced node. It should also be a lot cheaper to put all chips from one family on one node.
 
GB100 is for AI, GB202 is a gaming chip.

I'm also wondering where the idea that wafers of all things are the limiting factor on producing more AI chips has come from? From all we know this isn't true, and there are no indications that there are issues with non-AI chips shipments at the moment.

As for the split in process tech that's something to ponder on I'm sure but it's not clear that there's a benefit to using less advanced process because even at the same price per chip you still get the advantage of more chips from a wafer by going fully onto a more advanced node. It should also be a lot cheaper to put all chips from one family on one node.
I mean AD102 is basically a "gaming chip" but used in the RTX 6000 which is used in workstations that involve work with AI (like that Nvidia RTX Workstations annouced last year).

And on split in process node it means Nvidia can devote more GB100, 202 etc wafers with 3nm at TSMC which is limited in capacity because Apple and others will be eating that supply up for a while. And I suspect from now on we'll see mixed process for generations like I won't be suprised if say in 2028/2029 with RTX 70 if Nvidia does go with chiplets we end up seeing x02 & x03 equivalents using 1.4nm Compute Chiplets while x07, x06 equivalents using 3nm Compute Chiplets or even further than that, especially 2nm and beyond looks... pricey.
 
In the original Lovelace leaks, the AD102 was going to include 144 SMs. And while I think we discovered that's technically true for a die with zero execution units fused off, the actual shipping AD102 product only exposes 128SMs. Thus, with 128 CUDA cores per SM, that provides the 16,384 total CUDA cores of the consumer RTX 4090.

This said, I wonder if the rumored 192SMs on GB202 is the total units on a 100% functional die, or the expected usable units on a production-release die... 192SMs would deliver a 50% increase over the direct predecessor, for 24,576 CUDA cores. This is roughly in line, albeit a little short, of the jump in core counts from the 3090 to the 4090...
 
Back
Top