Follow along with the video below to see how to install our site as a web app on your home screen.
Note: This feature may not be available in some browsers.
The way I read the Variable Rate Shading section of the whitepaper, Turing has hardware support for something that, in the past, was done in software. If you can specify, in hardware, for each 16x16 rectangle on the screen that the shading rate can be lowered to 1/2 or 1/4 of the pixels, then you are reducing the load on the shaders and texture units by a similar amount. (I assume that the ROPs will still be operating at the same rate?)Most of the articles on the architecture are BS, not surprisingly. "Variable rate shading" you mean the F*ing software paper on that recently that has shit all to do with hardware?
I'm trying to read about computer architecture too, and my reading is clearly very different than yours. I must be biased.Yeah thanks for including that in a hardware paper. Why is the "white paper" is filled with PR bullshit? I'm just trying to read about your computer architecture guys, keep it out of the hands of the god damned PR people.
Here's another conclusion from the whitepaper: the RTX cores may simply be 'an extra instruction', as you wrote earlier, but it seems pretty clear from the whitepaper that this extra instruction kicks off a LOT of hardware. I can't wait to see how Vega will pull that off with a software driver.From that paper that's all I can really conclude.
I wonder how you come to this conclusion. Here's all I can find in the Anandtech article about BHV construction:Fortunately Anandtech has done a proper job, and shows that CUDA cores are responsible for BVH construction.
Maybe I missed something, but the whitepaper doesn't talk about BHV construction at all either.AnandTech said:The other major computational cost here is that BVHs themselves aren’t free. One needs to be created for a scene from the polygons in it, so there is an additional step before ray casting can even begin. This is more a developer concern – when can they modify and reuse a BVH versus building a new one – but it’s another step in the process. Furthermore it’s an example of why developer training and efficient engine implementations are so crucial to the process, as a poor implementation can make ray tracing much too slow to be viable.
Oh boy, here we go again.Right now, considering talented programmers can get the same level of raytracing performance out of a 1080ti as Nvidia claims can come out of their new RTX cards, well consider me unimpressed.
I wonder how you come to this conclusion. Here's all I can find in the Anandtech article about BHV construction
Thanks for pointing that out. I finally found the quote that I was looking for:The article does make the claim but I don’t see anything in the white paper. Maybe AT got some info offline.
AnandTech said:But otherwise, everything else is at a high level governed by the API (i.e. DXR) and the application; construction and update of the BVH is done on CUDA cores, governed by the particular IHV – in this case, NVIDIA – in their DXR implementation.
I think part of it is that you completely bypass a bunch of fixed function geometry hardware that does not scale as well as the number of SMs.What I don’t get is how does this help with having lots more objects?
My reading of the method is that decisions about cluster and LOD handling would occur in the task shader, which would then control how many mesh shader invocations would be created, and what it is they would be working on.- Mesh shaders can do much of the object level culling and LOD selection work currently done on the CPU
At least for the compute methods discussed by the presentation, using them at the front of the overall pipeline means they can stay on chip and reduces the overhead introduced by the launch of large compute batches. I presume there's cache thrashing and pipeline events that must write back to memory if the generating shader is in a prior compute pass.- This stuff could all be done using compute shaders but a lot of memory allocation and caching optimizations are taken care of by nvidia if you use the mesh pipeline
The base pipeline always has a primitive distributor that runs sequentially through a single index buffer, creates a de-duplicated in-pipeline vertex batch, and builds a list of in-pipeline primitives that refer back to the entries from the vertex batch they use.What I don’t get is how does this help with having lots more objects?
This is an important slide. Scenes behind the 10 Gigarays/s number. Each scene has a single high poly mesh (no background). Primary rays = fully coherent.
4K at 60 fps = 0.5 Gigarays/s, assuming one ray/pixel. Thus RTX 2080 Ti allows you to cast 20 rays per pixel in scenes like this at 60 fps (4K). Assuming of course that you use 100% of your frame time for TraceRays calls, or overlap TraceRays calls (async compute) with other work
These are pretty good numbers. Especially if you async compute overlap TraceRays with other stuff, hiding the potential bottleneck of BVH and ray-triangle tests in more complex scenes. I will definitely do some async tests when I get my hands dirty with RTX 2080.
The lead designer for the PS4 indicated there was a triangle sieve option for compiling code from a vertex shader, which would produce a shader containing only position-related calculations for culling. This would then be linked to the geometry front end running a normally compiled version of the vertex shader by a firmware-managed ring buffer hosted in the L2 cache, which was something Sony considered a tweak for their architecture.@3dilettante well written.
One thing so, the compute overhead is for small dispatches, not large ones (those would be hidden). As you said there is also extra waits for completing previous tasks. Now it may be doable (on console I guess) to tune producer/consumer batch sizes on architecture specific L2 sizes to keep things on chip, but very much not portable.
It's a frequently retold story for designs facing the programmable/dedicated wheel of reincarnation, sometimes it's where on the cycle a design finds itself. Perhaps a future design will be able to heavily leverage the more programmable path and leave the traditional hardware path alone, and then someone starts to think what hardware could accelerate the new one.When originally designed it was not sure whether fixed function blocks would see further improvement, hence my "awkward" commentHowever as scaling can be demonstrated and depending on the features success, it opens the door to not having to improve fixed func at some point. (The market decides).
Transparent in this case would be whether the shaders would need to expose elements like the number of primitives per mesh or limits to the amount of context being passed from one stage to the next. Some alternate methods for refactoring the geometry pipeline had thresholds for batch or mesh context, but they could be hidden from the developer to varying degrees, either by the driver or in some cases by hardware that could start and end batches based (edit: missing words "on conditions") during execution. Granted, being totally unaware of those limits could lead to less than peak performance, and some of those hardware-linked behaviors add some dependence on other parts of the hardware which the explicitly separate task and mesh shaders discard.Not sure what you mean by less transparent about task/mesh split, given that is all up to developer?
WoW, this is going to be the best RTX demo ever in the near future.New Metro Exodus Geforce RTX global illumination demo:
So much nicer without the AO. AO always seemed like a weird awkward and inaccurate kludge. Some people liked it, but I always felt it made things look worse as it made it look even less natural than without AO.
Regards,
SB