AMD: Navi Speculation, Rumours and Discussion [2019-2020]

Status
Not open for further replies.
My god man! Please help yourself with some information before posting! Just think for a moment, what would be the difference between static ray tracing and simple cube map reflections?

Reflective surfaces in Battlefield V reflect anything that moves. Down to the tiny fire flares of rocket tales.


Right because Nvidia chose a PUDDLE, for that effect. Yet globals and other effects are missing. Turing can only do so much, while rdna2 can do much more. The demo shows how robust rdna2 is, over focused ray tracing, that Nvidia likes to boutique for players in puddles.
 
PC-Land reached crazy territory years ago, not even console-land is this silly.
 
Just wait for the next generation of console posters to arrive after the launch of the ps5 etc :)

We've already had the influx for Silly Season and it's well handled.
 
I'm not all over the console forum but keeping up with the various major threads over the past year or so, I think the PC areas have been worse. Mainly since the release of RTX and the ensuing marketing spam and red vs green wars.
 
I think by now that it's clear that RDNA2 uses the ray tracing method described by that AMD patent, wher the intersction engine is in the tmu. If you the calculation for the case of XSX, 208tmus X 1.825Ghz = 379.6G intersections per second, which matches the figure given by Microsoft, in the Eurogamer article.. Yet according to Mark Cerny, for the case of the PS5, the intersection engine is in the CU. The PS5 is supposed to be using RDNA2.

Could the "intersection engine" be a shader program running on the shaders, for the case of the PS5?
 
I think by now that it's clear that RDNA2 uses the ray tracing method described by that AMD patent, wher the intersction engine is in the tmu. If you the calculation for the case of XSX, 208tmus X 1.825Ghz = 379.6G intersections per second, which matches the figure given by Microsoft, in the Eurogamer article.. Yet according to Mark Cerny, for the case of the PS5, the intersection engine is in the CU. The PS5 is supposed to be using RDNA2.

Could the "intersection engine" be a shader program running on the shaders, for the case of the PS5?
Well, TMU is in the CU as well.
 
I think by now that it's clear that RDNA2 uses the ray tracing method described by that AMD patent, wher the intersction engine is in the tmu. If you the calculation for the case of XSX, 208tmus X 1.825Ghz = 379.6G intersections per second, which matches the figure given by Microsoft, in the Eurogamer article.. Yet according to Mark Cerny, for the case of the PS5, the intersection engine is in the CU. The PS5 is supposed to be using RDNA2.

Could the "intersection engine" be a shader program running on the shaders, for the case of the PS5?

Well, TMU is in the CU as well.
As @szatkus pointed out, TMUs are in the CUs too.
For future reference, this is how RDNA1 Dual Compute Unit looks like:
upload_2020-3-30_18-54-33.png

Intersection Engines will be added in the yellow part on the right, the "TMU", which includes filtering and mapping units at the moment, wether it's separate new blocks or added functionality to current ones I'm not sure
 
So, can Nvidia just add another TMU quad per SM in Ampere to increase the intersection test rate, or it will be limited by cache/memory data bandwidth?
 
The ray tracing fixed function hardware is for a BVH I assume, which at a high level is triangles within boxes within more boxes. Voxels are boxes too, so if game had part of their geometry made up of voxels could the box testers be used for that too? Or it would have to be a specific BVH format? Also, could these box and triangle testers be used for physics collisions, like testing if a box hit another box? The Powervr marketing material always lists physics as a possible use for their ray tracing.
 
So, can Nvidia just add another TMU quad per SM in Ampere to increase the intersection test rate, or it will be limited by cache/memory data bandwidth?
No, NVIDIA uses different approach with separate fixed function(? not 110% sure on this) "RT core" within each SM, which handles everything. They can probably build it beefier though and of course there's more of them the more SMs there is.
 
No, NVIDIA uses different approach with separate fixed function(? not 110% sure on this) "RT core" within each SM, which handles everything. They can probably build it beefier though and of course there's more of them the more SMs there is.
As far as I understand RTX core is a kind of specialized unit like for example FPU. It's used as a part of ray tracing shaders coupled with plain old computing resources*.

Of course they can just add more RTX cores. We already see this in the past, when at the beginning ROPs and "compute pipelines" (that was the term?) were in 1:1 proportions. Now AMD has usually 64 pipelines for every ROP.

* That's why moving RT to separate chip is impossible. Sorry, when I see a news about anything related to ray tracing there's always at least one comment like "AMD should add a chip just for ray tracing".
 
As far as I understand RTX core is a kind of specialized unit like for example FPU. It's used as a part of ray tracing shaders coupled with plain old computing resources*.

Of course they can just add more RTX cores. We already see this in the past, when at the beginning ROPs and "compute pipelines" (that was the term?) were in 1:1 proportions. Now AMD has usually 64 pipelines for every ROP.

* That's why moving RT to separate chip is impossible. Sorry, when I see a news about anything related to ray tracing there's always at least one comment like "AMD should add a chip just for ray tracing".
NVIDIAs RT-core does all the ray traversal & intersection stuff on it's own with no CUDA-core involvement (after they send the probe to the RT core), only after the actual shading is done on CUDA-cores. On AMD stream-processors will be involved in the traversal-portion (didn't double check the patent and my memory is short but bad so might need to check on that)
Here's NVIDIAs whitepaper on theirs anyway https://www.nvidia.com/content/dam/...ure/NVIDIA-Turing-Architecture-Whitepaper.pdf
 
No, NVIDIA uses different approach with separate fixed function(? not 110% sure on this) "RT core" within each SM, which handles everything.
It doesn't handle "everything" but it does evaluate hits and chooses the next level of BLAS to trace against - something which will presumably have to be handled by CU SIMDs in RDNA2.
There are no reasons to assume that NV's approach is any more "fixed function" than AMD's right now. The traversal of BVH is handled by dedicated specialized h/w in both cases. In case of NV this h/w has an additional function of evaluating the results of traversal without SIMDs involvement.
I think it's also a stretch to just assume that these BVH units in RDNA2 are tied to TMUs and especially their numbers. The fact that the CU accesses them through TMU data paths means little more than the usage of said data paths and associated caches IMO. The wiring of actual units can be absolutely arbitrary for all we know - 1 BVH unit per TMU, 1 such unit per TMU quad, 1 such unit per CU or maybe even 1 such unit per WGP?

They can probably build it beefier though and of course there's more of them the more SMs there is.
Why exactly can't NV add more RT cores into an SM if this will actually improve performance?
Again, if we assume that Turing RT cores aren't coupled / accessed through TMUs (like in RDNA2 presumably) then this actually seem like an advantage as you can probably add (or remove) them the way you see fit, irrespective of your TMU counts per SM.
 
On AMD stream-processors will be involved in the traversal-portion (didn't double check the patent and my memory is short but bad so might need to check on that)

"AMD Patent said:
The intersection testing results and/or a list of node pointers which need to be traversed next (in the order they need to be traversed) are returned to the shader unit using the texture data return path. The shader unit reviews the results of the intersection and the indications received to decide how to traverse to the next node in the BVH tree.

I'll be very impressed if AMD takes this approach and delivers usable performance. Passing every node hit or miss back to the shader core will incur crazy scheduling overhead and compete with other texturing & shading work.
 
Last edited:
I'll be very impressed if AMD takes this approach and delivers usable performance. Passing every node hit or miss back to the shader core will incur crazy scheduling overhead and compete with other texturing & shading work.
I don’t see how it is more “problematic” in terms of scheduling and pipelining than, say, an atomic RMW operation (unless there are strict ordering constraints). Both are conceptually sending some data to an external blackbox and get back some results, from the CU’s perspective.
 
Last edited:
I don’t see how it is more “problematic” in terms of scheduling and pipelining than, say, an atomic RMW operation (unless there are strict ordering constraints). Both are conceptually sending some data to an external blackbox and get back some results, from the CU’s perspective.

The number of calls to the external blackbox will be significantly higher in AMD's case. There will be a cost to be paid in compute capacity, cache efficiency and on-chip traffic. The AMD patent discusses returning intermediate BVH traversal results. Nvidia just returns the final triangle hit or miss.

It's more complicated with transparent geometry that invokes any-hit shaders during traversal but the distinction still applies.
 
The number of calls to the external blackbox will be significantly higher in AMD's case. There will be a cost to be paid in compute capacity, cache efficiency and on-chip traffic. The AMD patent discusses returning intermediate BVH traversal results. Nvidia just returns the final triangle hit or miss.

It's more complicated with transparent geometry that invokes any-hit shaders during traversal but the distinction still applies.
That's fair, and it would indeed be competing for resources in the CU memory/texturing pipeline. But there are still takes that can be made:

1. On-chip traffic: If the traversal process can saturate the memory hierarchy, the added CU internal round-trip time could be less relevant in the end. It would burn more joules in absolute terms, but can be a drop in the sea in relative terms.

2. Compute capacity: Compute is cheaper, relative to the lesser growth in memory bandwidth. One may then argue that the intersection engine would have already subsumed the most costly and divergent routines, relative to a complete in-shader implementation. So if what's left is cheap enough, relevancy of "having freed up more compute capacity" in practice can be disputed.

3. Cache efficiency: That bears an assumption that loads generated by the intersection engine would trash the cache hierarchy. But alternative caching policies (skip L0/L1/L2) could be used for these traffic. The same issue applies equally to a dedicated RT core sharing the cache hierarchy.
 
That's fair, and it would indeed be competing for resources in the CU memory/texturing pipeline. But there are still takes that can be made:

1. On-chip traffic: If the traversal process can saturate the memory hierarchy, the added CU internal round-trip time could be less relevant in the end. It would burn more joules in absolute terms, but can be a drop in the sea in relative terms.

2. Compute capacity: Compute is cheaper, relative to the lesser growth in memory bandwidth. One may then argue that the intersection engine would have already subsumed the most costly and divergent routines, relative to a complete in-shader implementation. So if what's left is cheap enough, relevancy of "having freed up more compute capacity" in practice can be disputed.

3. Cache efficiency: That bears an assumption that loads generated by the intersection engine would trash the cache hierarchy. But alternative caching policies (skip L0/L1/L2) could be used for these traffic. The same issue applies equally to a dedicated RT core sharing the cache hierarchy.

Sure, we won’t know exactly how things balance out in the end and there are many factors at play.

In Nvidia’s patent the RT core (or tree traversal unit) has its own local memory and L0 cache. Clearly much more expensive than AMD’s approach but likely faster too.
 
Status
Not open for further replies.
Back
Top