Next gen lighting technologies - voxelised, traced, and everything else *spawn*

No. BVH trees are created on the CPU on RTX too (PowerVR does this on their RT unit though, IIRC). Turing RT cores accelerate ray intersecetion / hit testing on those BVHs the CPU built.
So does that go for all RTX implementations, or only this specific one? The driver creates the BVH tree, and the RT cores are able to accelerate traversal of that BVH structure, which would otherwise have to be done with shaders.. right? So it's not just ray intersection/hit testing..?

My question is then, if this is using the RT cores why is there such a gulf between this implementation and what Nvidia and their partners are doing in other implementations? Comparatively this is relatively limited in scope. Single light source (sun), only vehicles shadows are being ray traced as far as I know.

I guess maybe limitations of the API? (DX11 vs DXR)
 
No. BVH trees are created on the CPU on RTX too
We discussed this some posts above. The bottom levels (BLAS - most of the work) is done in compute on GPU. Only the top level tree (TLAS) is done on CPU.
It would make sense to build everything on CPU, and only refint on GPU, but the quote i have found says entire BLAS work is done on GPU. (Nice GPU friendly task, avoid large data transfers)

But could be the GPU only builds trees of lower quality, while all high quality trees for static objects are built on CPU. (Typically you'd use low quality trees for dynamic objects because they are fast to build, and high quality for static world becasue they are faster to trace. also, if dynamic objects do not change theri shape too much, refitting is better than rebuilding per frame.)
With DXR it's also open to the vendors how they do this, and i remember NV claiming to do constant changes and improvements on this.
 
The driver creates the BVH tree, and the RT cores are able to accelerate traversal of that BVH structure, which would otherwise have to be done with shaders.. right? So it's not just ray intersection/hit testing..?

My question is then, if this is using the RT cores why is there such a gulf between this implementation and what Nvidia and their partners are doing in other implementations?
Sounds you assume the World of Tanks demo could use RT cores for GPU tracing? But no, that's not possible without using DX12, DXR and ofc. Turing GPU.
So their implementation is compute based and has no such requirements, but also lacks support for Turing RT cores. They would need to port to DX12 and offer the current compute solution as a fallback for those owning other GPUs.

Kinda sucks MS seems to use RT to force DX12 and so Win10? Many devs don't want to use low level API for good reasons, and NV is often faster on DX11. But it works - personally i would still use Win7 if it had support for DX12 :(
 
Kinda sucks MS seems to use RT to force DX12 and so Win10? Many devs don't want to use low level API for good reasons, and NV is often faster on DX11. But it works - personally i would still use Win7 if it had support for DX12

Wasn't DX12 coming to W7 sometime? Think i read it somewhere a while ago. I'm on w10 and it's faster and more stable for most games.
 
Somehow older games get a rundll problem on 7, resulting in 100% cpu usage, thx to an windows update i think. They won’t fix it.
Why would they? Support for Win 7 is dropping in a few months. It'll be completely unsupported. Why spend any efforts getting DX12 working on it properly when it's dead? May as well ask MS to patch DX12 support into Windows 3.1 while you're at it. ;)
 
I'm on w10 and it's faster and more stable for most games.
This was my impression too, years ago, when it was new.
But some updates later the usual bloat comes up again. What remains is twice the memory requirements. For Windows update, Cortana and other crap i do not want but can't get rid of.
That just said to rant, ofc. i was not so serious about Win7 :)

But no DXR for DX11 is a topic. Didn't MS say DX11 would be kept alive? Or did i get this wrong and it is now deprecated?
Just asking - personally i use Vulkan and can't avoid the low level pain because of performance. Just, DX11 would be the nicer option than OpenGL for indies still using custom engines. Forcing low level also means pushing UE or Unity to some degree?
 
Sounds you assume the World of Tanks demo could use RT cores for GPU tracing? But no, that's not possible without using DX12, DXR and ofc. Turing GPU.
So their implementation is compute based and has no such requirements, but also lacks support for Turing RT cores. They would need to port to DX12 and offer the current compute solution as a fallback for those owning other GPUs.

Kinda sucks MS seems to use RT to force DX12 and so Win10? Many devs don't want to use low level API for good reasons, and NV is often faster on DX11. But it works - personally i would still use Win7 if it had support for DX12 :(

Yea, I always though that using the RT cores required DX12+DXR/Vulkan. But then seeing how Turing and Navi compare with each other in this benchmark with RT on vs off.. it seems that Navi takes a bigger hit with RT enabled. Which I guess is leading some people to assume, and myself to question, whether the RT cores are being used here in some way.

I guess it just comes down to Nvidia's architecture being better suited for this game and RT through compute in general.
 
Yea, I always though that using the RT cores required DX12+DXR/Vulkan. But then seeing how Turing and Navi compare with each other in this benchmark with RT on vs off.. it seems that Navi takes a bigger hit with RT enabled. Which I guess is leading some people to assume, and myself to question, whether the RT cores are being used here in some way.

I guess it just comes down to Nvidia's architecture being better suited for this game and RT through compute in general.
If RT cores were being used; the difference between them would be substantial. Very substantial.
 
According to NVIDIA, the separation of INT32 from FP32 in Turing helps in RT workloads, that's why a GTX 1660Ti is faster in running DXR games than a GTX 1070, even though they both lack RT cores.
This is the type of customization that matters; if there is a customization that should happen.

Is there any reason why specifically AMD does not separate INT and FP workloads?or perhaps is there any intention to?
 
Why would separating them help? I wouldn't have thought you'd be doing the different types of maths in parallel for RT. It's all very floating-point based, isn't it?
 
Why would separating them help? I wouldn't have thought you'd be doing the different types of maths in parallel for RT. It's all very floating-point based, isn't it?
Parallel int / float should help a lot, pretty much with everything that has some logic / algorithm, and is not just dump brute force like building mip maps or something. Think of all those loop counters and indexing math. Tree traversal and intersection would be just one example.
It's the most interesting and promising feature of Turing for me personally :)
 
Is there any reason why specifically AMD does not separate INT and FP workloads?or perhaps is there any intention to?
They have parallel scalar / vector execution since GCN, also the SFU for transcendental math works independently IIRC, but not sure if this is only since RDNA.
But i think Turing has all this too or similar now. (Anyone knows if Turing has scalar registers?)
So Turing can do 3 ops at a time: Int, float and Tensor, while AMD could do scalar / vector and SFU.

However, how much this relates to end performance is not necessarily said just by having the features. For example if code does just bunch of floating point math but not much else a simpler architecture could be more efficient, or if the simpler architecture had more IPC. Depends on what we do with it.
 
(Anyone knows if Turing has scalar registers?)
Seems yes, found this as first dearch result (https://arxiv.org/pdf/1903.07486.pdf):

"As per NVidia’s documentation, Turing introduces a new feature intended to improve the maximum achievable arithmetic throughput of the main, floating-point capable datapaths, by adding a separate, integer-only, scalar datapath (named the uniform datapath) that operates in parallel with the main datapath. This design is intended to accelerate numerical, array-based, computebound workloads that occupy the main datapaths almost completely with floating-point instructions, typically FFMA or HMMA, but also contain a few integer operations, typically updating array indices, loop indices or pointers; or performing array or loop boundary checks. These few integer instructions spoil the instruction mix, and prevent the main datapaths from ingesting a 100% pure stream of FFMA or HMMA. In these circumstances, even a small fraction of integer instructions can hurt the overall arithmetic throughput, lowering it significantly from its theoretical maximum. On Turing, the compiler has the option to push these integer operations onto the separate uniform datapath, out of the way of the main datapath. To do so, the compiler must emit uniform datapath instructions."

But this also implies the int path has the same limitation as AMDs scalar instructions. (Scalar means all threads do the exact same instruction on just one common register)
Now i'm disappointed becasue it sounded Turing could do vector int / float math in parallel. (every thread it's own data) ...they always come up with marketing terms that sound much better :D

Correct me if i'm wrong!
 
Now i'm disappointed becasue it sounded Turing could do vector int / float math in parallel
It can, according to NVIDIA, they did this because they found that at least 30% of gamecode is Integer, they separated INT from FP because they wanted to exploit the parallelism. They call it concurrent FP & INT.

GDC_Update_FINAL-page-012_575px.jpg


But i think Turing has all this too or similar now. (Anyone knows if Turing has scalar registers?)
Each CU in RDNA has:
  • 32 SPs (IEE754 FP32 and INT32 vector ALUs)
  • 1 SFU
  • 1 INT32 scalar ALU
  • 4 Texture units
  • 1 scheduling and dispatch unit
  • units for cache read/writes
Each SM in Turing has:

  • 64 IEE754 FP32 scalar ALUs
  • 64 INT32 scalar ALUs
  • 8 Tensor cores
  • 1 Ray Tracing unit
  • 4 Texture units
  • 0.5 Geometry Unit (PolyMorph/Tessellation Engine)
  • 4 SFUs
  • 1 instruction scheduling and dispatch unit
  • 4 Load/Store units (which handle cache read/writes)
 
Last edited:
Back
Top